Multi-Feature Nonlinear Optimization Motion Estimation Based on RGB-D and Inertial Fusion

To achieve a high precision estimation of indoor robot motion, a tightly coupled RGB-D visual-inertial SLAM system is proposed herein based on multiple features. Most of the traditional visual SLAM methods only rely on points for feature matching and they often underperform in low textured scenes. Besides point features, line segments can also provide geometrical structure information of the environment. This paper utilized both points and lines in low-textured scenes to increase the robustness of RGB-D SLAM system. In addition, we implemented a fast initialization process based on the RGB-D camera to improve the real-time performance of the proposed system and designed a new backend nonlinear optimization framework. By minimizing the cost function formed by the pre-integrated IMU residuals and re-projection errors of points and lines in sliding windows, the state vector is optimized. The experiments evaluated on public datasets show that our system achieves higher accuracy and robustness on trajectories and in pose estimation compared with several state-of-the-art visual SLAM systems.


Introduction
In recent years, simultaneous localization and mapping (SLAM) [1] has become an attractive research topic in many self-localization robotic areas, particularly with the development of mobile robots. The SLAM technology can estimate a robot's own motion and attitude using sensors without any prior information of the environment. Consequently, it is widely used for autonomous cars, virtual reality and augmented reality, unmanned air vehicles, and indoor robots. For the localization of indoor robots, high-precision map and global positioning system [2] technologies are not typically available indoors, and low-precision inertial sensors can barely meet the requirements of accuracy and applicability. A visual-inertial odometry (VIO) that combines the information of cameras and IMU has evidently improved the accuracy and robustness for the localization of indoor robots. Cameras can capture rich feature information by processing images, and IMU can obtain accurate motion estimation of robots in a short timeframe, which reduces the impact of dynamic objects on cameras. Further, cameras can provide effective corrections for inertial drifts. To a certain extent, these two sensors can complement each other, and the VIO system has been popularized for use in various potential applications.
Based on whether image feature information is incorporated into the system state vector, a typical VIO system can be divided into a loosely coupled method and a tightly coupled method [3,4]. A loosely coupled method implies that the camera and IMU generate each their own motion estimation, and then the VIO system fuses the position and attitude of these two estimation results. Meanwhile, a tightly coupled method establishes motion and observation equations by combining raw measurements Sensors 2020, 20 from the camera and IMU. Compared with the former, the latter generally has better accuracy and robustness. In this study, the proposed system is a tightly coupled VIO system. For the system state estimation, a filter-based approach and an optimization-based approach are both used [5,6]. Filtering approaches, based on the extended Kalman filter (EKF), use IMU measurements for system state estimation, while visual measurements are designed to update the latest state. Mourikis et al. [7] presented a multi-state constraint Kalman filter based on a filtering-based approach, whose backend predicted and updated the state vectors with EKF in sliding windows. However, the drawback of this system is that not all current camera information is used in the filter and that this system processes the measurements of sensors only once per updating step. Optimization-based approaches obtain an optimal estimate by minimizing a joint nonlinear cost function with all IMU and visual residuals. Thus, optimization-based approaches usually achieve a higher accuracy than filtering-based methods. Stefan et al. [8] proposed an open keyframe visual-inertial SLAM (OKVIS) based on a nonlinear optimization framework, which uses IMU pre-integration [9] to avoid repeated IMU integration and the first-in-last-out sliding window method for bounding computation to handle all measurements. Qin et al. [10] also studied a nonlinear-optimization framework, VINS-MONO, which selectively marginalizes visual landmarks and achieves an accurate VIO system for monocular system. Their work enables optimization-based methods to achieve better real-time performance. However, the above VIO systems are based on monocular and stereo cameras. Monocular cameras cannot directly obtain depth information of visual features and require a complex initialization process to recover the scale of visual features, leading to system-scale uncertainty and increases in the computational complexity. Stereo also requires high computing cost to generate the corresponding depth information. Compared with these methods, RGB-D cameras can obtain both color images and aligned depth information, which can simplify the triangulation of point features and achieve a faster initialization process. However, few works have integrated IMU with RGB-D camera measurements for a VIO system up until now. Yang et al. [11] presented an RGB-D-inertial odometry via keyframe-based nonlinear optimization. A non-iterative odometry was proposed in [12], which used a kernel cross-correlator (KCC) to match point clouds and combine IMU measurements to generate a dense indoor map. Although the above mentioned RGB-D-inertial systems, which rely only on point features, have improved the localization accuracy in most scenes, point detection and matching in textureless environments exhibit a poor performance. In contrast, line features provide more structural information of these scenes than points. There are some monocular SLAM systems that combine point and line features. He et al. [13] proposed the PL-VIO system, which is based on VINS-MONO. Although the PL-VIO has achieved good accuracy on public datasets, it needs a complex initialization process and consumes a lot of computing resources because of system-scale uncertainty of the monocular system. For an RGB-D SLAM system, several works combined point and line features to estimate camera motion [14,15]. Lu et al. [16] proposed the LineSLAM, which detected line segments from depth images of RGB-D cameras and implemented an RGB-D visual odometer by combining points and lines. This system is currently the latest open source RGB-D system based on points and lines. However, without the combination of other sensor data, the LineSLAM will not work when the visual information is lost. Plücker coordinates were first introduced to represent 3D lines in [17], which also used orthonormal representation to handle the nonlinear optimization of line features and avoid over-parameterizations.
For an RGB-D-inertial SLAM system, no work that combines point and line features has been conducted yet. Based on the previous discussion, a tightly coupled RGB-D visual-inertial odometry based on multi-features is presented in this paper. The key contributions are as follows: • To our best knowledge, the proposed system is the first tightly coupled optimization-based RGB-D-inertial system based on points and lines.

•
At the beginning of the system, 3D point features can be directly obtained from depth images produced by the RGB-D camera. To correct IMU bias at the beginning, we implemented a fast initialization aligning vision-only measurements with the values of IMU pre-integration.
• This paper achieves an RGB-D-inertial nonlinear optimization framework with constraints of both the IMU kinematic model and the reprojection of points and lines in sliding windows. An orthonormal representation is employed to parameterize line segments and analytically calculate the corresponding Jacobians. • A series of experiments has been designed to compare the performance of our system with the other four open-source visual SLAM systems: PLVIO [13], OKVIS [8], VINS-MONO [10] and LineSLAM [16]. The results validate the accuracy and robustness of the system proposed herein.

Notations and Definitions
The notations used and details throughout this work are as shown in Figure 1. (·) C is the camera frame; (·) B denotes the body frame, which is the same as the IMU frame; (·) W is the world frame, and [·] × is the skew-symmetric matrix of the vector. We denote T C B ∈ SE(3) as the homogeneous transformation matrix from the body frame to the camera frame, T C B = R C B , t C B . L i and P i are the ith point landmark and the line landmark, respectively.
Sensors 2020, 20, x FOR PEER REVIEW 3 of 21 • To our best knowledge, the proposed system is the first tightly coupled optimization-based RGB-Dinertial system based on points and lines.

•
At the beginning of the system, 3D point features can be directly obtained from depth images produced by the RGB-D camera. To correct IMU bias at the beginning, we implemented a fast initialization aligning vision-only measurements with the values of IMU pre-integration.

•
This paper achieves an RGB-D-inertial nonlinear optimization framework with constraints of both the IMU kinematic model and the reprojection of points and lines in sliding windows. An orthonormal representation is employed to parameterize line segments and analytically calculate the corresponding Jacobians. • A series of experiments has been designed to compare the performance of our system with the other four open-source visual SLAM systems: PLVIO [13], OKVIS [8], VINS-MONO [10] and LineSLAM [16]. The results validate the accuracy and robustness of the system proposed herein.

Notations and Definitions
The notations used and details throughout this work are as shown in Figure 1. (

IMU Pre-Integration
Inertia devices usually consist of an accelerometer and a gyroscope, which measure the three-axis acceleration a and the three-axis angular velocity w . Owing to the influences of internal mechanical accuracy and temperature factors, white noise and bias occur in IMU measurements. The measurement models of IMU are expressed as follows:

IMU Pre-Integration
Inertia devices usually consist of an accelerometer and a gyroscope, which measure the three-axis acceleration a and the three-axis angular velocity w. Owing to the influences of internal mechanical accuracy and temperature factors, white noise and bias occur in IMU measurements. The measurement models of IMU are expressed as follows:ŵ where b w and b a denote a gyroscope and a accelerometer bias, respectively, which are usually modeled as a random walk process. Further, η w and η a are the corresponding Gaussian white noise process, w B andâ B are the raw measurements by IMU, and a W and g W = [0, 0, g] T represent the true acceleration and gravity, respectively, in the world frame. Given a system state, namely, position p W B , velocity v W B , and quaternions q W B , we can obtain all body states in discrete form between consecutive frames B k and B k+1 in the world frame according to IMU's kinematic model.
Sensors 2020, 20, 4666 are called IMU pre-integration [9] and can be directly calculated without knowledge of the system state, indicating that when a body state is changed, re-propagation of IMU is not necessary. Time t = i is between time interval [t k , t k+1 ]. To save computational resources, the biases b a , b w are set to constant during the pre-integration between consecutive frames.

Representation of 3D Line Features
Traditional line segmentation detection usually uses the Canny algorithm [18] to obtain edge information in an image and then uses the Hough transform to extract line features. However, this method is time consuming, and many false detections may occur when edge features are dense. Line segment detector (LSD) [19] is a line detection and segmentation algorithm, which can obtain sub-pixel accuracy detection results in linear time. The LSD algorithm is used to detect and match two endpoints of line features in this study, whereas the merged line features are represented by LBD line descriptors [20]. Based on LSD line detection, we introduce the geometric characteristics of the line to match line pairs effectively. In this study, we set the following conditions for line matching: (1) the angular difference of matched line pairs is smaller than a given threshold θ = 30 • and (2) the distance between two LBD descriptors is less than the value δ = 60 pixel.
As a 3D line segment L can be represented by its two endpoints, Plücker coordinates are introduced for a more compact representation. We assume that the homogeneous coordinates of two endpoints are X 1 = (x 1 , y 1 , z 1 , w 1 ) T and X 2 = (x 2 , y 2 , z 2 , w 2 ) T , while their inhomogeneous coordinates are denoted by X 1 , X 2 . The Plücker coordinates of a line segment can be constructed as follows: where v ∈ R 3 is the direction vector of the line feature, and n ∈ R 3 is the normal vector of plane I as determined by line segments and world coordinate origin, which are shown in Figure 2.
Their relationship is n T v = 0. For the line geometric characteristics, we can obtain the transformation of the Plücker coordinates from the world frame to the camera frame from Equation (7).
where [ ] × ⋅ is the skew-symmetric matrix of the vector. According to a pin-hole model camera, a spatial line represented by Plücker coordinates can be projected to the image plane as in [21].  For the line geometric characteristics, we can obtain the transformation of the Plücker coordinates from the world frame to the camera frame from Equation (7).
where [·] × is the skew-symmetric matrix of the vector. According to a pin-hole model camera, a spatial line represented by Plücker coordinates can be projected to the image plane as in [21].
where K denotes the projection matrix of the camera. As a straight line only has four degrees of freedom, Plücker coordinates, which consist of six parameters, are over-parameterized. At the graph optimization, the extra degrees of freedom will increase the computational cost, resulting in an instability in the system. Therefore, an orthonormal coordinate [22] consisting of four parameters is proposed to optimize the line features. The orthonormal representation (U, W) ∈ SO(3)SO (2) is more suitable than Plücker coordinates for optimization and they can easily be converted into each other. The orthonormal representation of line features is denoted as follows: where the orthogonal matrix U represents the rotation between the line coordinates and the camera frame. Matrix W represents the distance information from the camera coordinate origin to the 3D line. The orthonormal representation (U, W) consists of: We can update the orthonormal representation with a minimum of four parameters θ = ϕ φ T in the process of optimization. Matrix U is updated by ϕ, which are the rotation angles of the three axes in the camera frame. Matrix W is updated by φ, which is the translation of a 3D line in the camera frame.

Overall Structure of the VIO System
The proposed system structure is shown in Figure 3. It mainly includes two parts: the frontend and the backend. At the beginning of the system, the data from IMU and the RGB-D camera are received and processed at the frontend module, where the transformation between two consecutive frames is obtained through feature detection and matching. Further, the corresponding system initial pose of IMU pre-integration is provided. Histogram equalization will be conducted on the original RGB image [23], which strengthens the adaptability of images to scenes with light changes for better extraction of features. When a new frame is received, the point features are tracked between the previous frame and the new frame by the KLT sparse optical flow algorithm [24]. Then, the RANSAC framework with a fundamental matrix was used to remove outliers by establishing a parametric model [25]. We used the LSD algorithm to detect line features whose threshold was set to assure that an adequate number of features can be detected. A FAST detector [26] was used to maintain a adequate Sensors 2020, 20, 4666 6 of 19 number of points features in each image. The depth image and RGB image were aligned by hardware, and depth filtering was performed to reduce corresponding errors. We can determine the aligned depth information of each feature from the depth image according to the projection model of the pinhole camera. Meanwhile, the strategy for selecting the keyframe depends on whether the average parallax of the point feature between the current frame and the latest frame is beyond a specified threshold.
integration is provided. Histogram equalization will be conducted on the original RGB image [23], which strengthens the adaptability of images to scenes with light changes for better extraction of features. When a new frame is received, the point features are tracked between the previous frame and the new frame by the KLT sparse optical flow algorithm [24]. Then, the RANSAC framework with a fundamental matrix was used to remove outliers by establishing a parametric model [25]. We used the LSD algorithm to detect line features whose threshold was set to assure that an adequate number of features can be detected. A FAST detector [26] was used to maintain a adequate number of points features in each image. The depth image and RGB image were aligned by hardware, and depth filtering was performed to reduce corresponding errors. We can determine the aligned depth information of each feature from the depth image according to the projection model of the pinhole camera. Meanwhile, the strategy for selecting the keyframe depends on whether the average parallax of the point feature between the current frame and the latest frame is beyond a specified threshold.  In the backend, first, with a set of corresponding 3D points features, the iterative closest point (ICP) algorithm was used to recover the initial pose in the process of structure from motion (SFM). We maintained a certain number of frames in a sliding window to reduce the computational complexity. The line features were triangulated to set up re-projection residuals. After obtaining the pose of all frames and 3D landmarks form vision-only SFM in the sliding windows, we aligned the poses with the measurements of the IMU pre-integration and the visual measurements to obtain a optimal initial state. We built a nonlinear optimization framework to minimize the residuals of the visual re-projection, IMU measurements and prior information of marginalization. To prevent the number of poses and features from increasing excessively, we used a marginalization strategy to reduce the computational complexity of our system. Finally, we integrated the Loop Closure to reduce cumulative errors in pose estimation and improve the accuracy of the system.

Nonlinear Optimization Framework
In this section, the entire backend structure is detailed. To reduce the computational complexity of the backend process, we used a nonlinear optimization method based on sliding windows, in which only a certain number of frames were maintained. A system marginalization strategy was also applied to exclude the oldest frame in a window and add prior information into the next local window optimization. We derived the state constraints of the system through a factor graph, identified the state vector to be optimized and established a nonlinear optimization equation. Meanwhile, to ensure the global consistency of the system, a global loop closure module was incorporated. As shown in Figure 4, circular nodes were the states that needed to be optimized, which included the camera pose and landmarks. Rectangular nodes were the system constraints of the visual measurements, IMU measurements and loop information.
We derived the state constraints of the system through a factor graph, identified the state vector to be optimized and established a nonlinear optimization equation. Meanwhile, to ensure the global consistency of the system, a global loop closure module was incorporated. As shown in Figure 4, circular nodes were the states that needed to be optimized, which included the camera pose and landmarks. Rectangular nodes were the system constraints of the visual measurements, IMU measurements and loop information.

System State
The full state variables that needed to be optimized in a sliding window are defined as follows: where η is the state vector that contains point and line states.

System State
The full state variables that needed to be optimized in a sliding window are defined as follows: where η is the state vector that contains point and line states. N and M are the numbers of keyframes and line landmarks observed, respectively, by all keyframes in the sliding window. l m denotes the orthonormal representation of the m-th line feature in the world frame. For the i-th frame image, x i is the position, velocity, orientation in the world frame and the bias term for the accelerometer and gyroscope, respectively, of the corresponding IMU state. First, the first frame was set as a keyframe, and then the next keyframe was selected according to our strategy. We sought a bundle adjustment formulation to jointly minimize the residuals of all measurements in the Mahalanobis distance. There were mainly four parts of residuals in our system. The cost function is defined as where r M and J M are the prior information from marginalization, which contain the constraints of IMU measurements, point and line features removed from the sliding windows [27]. ρ is the Huber robust function used to decrease the influence of outliers on the whole state estimation. r B is the residual of the IMU pre-integration between two consecutive frames. r P and r l , respectively, represent the residuals of point measurements and line re-projections. B is the set of all IMU measurements. P and L are separately the sets of landmarks and spatial lines, both of which are observed many times in the windows, C is the set of all keyframes.

System Initialization
Visual-inertial odometry is a highly nonlinear system, and there exists IMU bias during the operation of the system, hence, system initialization is very important for correcting initial IMU bias and achieving a high-precision position. In our initialization strategy, we used a loose-coupled method to fuse data from two sensors, with the RGB-D camera and IMU running separately for a short time to calculate the corresponding poses. Then, we aligned the poses from visual structure from motion (SFM) and the IMU pre-integration to correct the initial error of the IMU. Our process of initialization was divided into the following modules. The monocular camera could not determine an absolute scale of feature in 3D space, while the RGB-D camera could quickly obtain more accurate depth information. With the depth information from RBG-D camera, the 3D landmarks of the matched point features between the two frame could be obtained. Firstly, a visual SFM was taken to find the frame which had enough matching feature points with the current frame in the sliding windows. The selected keyframes were used as a reference frame for pose estimation in turn, which could recover the rotation matrix and translation vector between two consecutive frames. According to the pinhole camera model, we could obtain two sets of matched 3D points between the i-th frame and the j-th frame, which are denoted as: The corresponding transformation is described in Equation (17), and we used an ICP algorithm to estimate the state of 3D-3D matching. We could turn this into a least square problem with Equation (18). In addition, we solved Equation (18) using singular value decomposition (SVD) to obtain the pose of the visual-only SFM.
arg min

Visual-IMU Alignment
During initialization, for all consecutive frames, the rotation matrix measured by the camera should have been theoretically the same as that measured by the IMU. However, in reality, they were not exactly the same because of an existing gyroscope bias. Hence, we built the following cost function to minimize the relative rotation error of them: where q B i+1 B i represents the rotation matrix of consecutive frames B i to B i+1 measured by the visual-only SFM, and γ is the rotation measurement measured by IMU pre-integration.

IMU Measurement Residual
In our system, the IMU measurement residual was the norm of the difference between the observed and estimated values of IMU in the Mahalanobis distance. Considering adjacent frames i and i + 1, we obtained the predicted motion of frame i + 1 by using related IMU data of frame i to create the pre-integrated term; the IMU pre-integrated residual is described as: Sensors 2020, 20, 4666 represents the IMU pre-integrated term between two consecutive image frames.
[·] xyz denotes the error of the three-dimensional rotation vector.

Visual Reprojection Residual
For point features, j-th feature points in the sliding window were observed by the two matched frames I and i + 1. According to the RGB-D camera model, we could directly obtain the corresponding 3D coordinates P j C i and P j C i+1 ,P j C i was the projection point of P j C i from frame C i to C i+1 . The residual of the j-th feature points is defined as: where: To minimize the point measurement residuals, the Levenberg-Marquard algorithm was applied to solve the cost function, which can be denoted as: is the Jacobian matrix of residual r p with respect to x. Therefore, we must derive the rotation and translation of frame C i , C i+1 for J C i C i+1 : With After we obtained all Jacobians for the system state, the Levenberg-Marquard algorithm in the Ceres solver library [28] was used to find the optimal state x.
For the line features, a suitable residual model of line segments must also be developed. According to Equation (7), the spatial 3D line l W in the world frame can be converted to l C in the camera frame. l W and l C are represented in Plücker coordinates. In this study, we defined the difference between the position of a line segment detected in the image plane as the re-projection residual of lines. As shown in Figure 5, P W Q W is the spatial line feature observed in the two keyframes C i and C i+1 in the world frame. P C i Q C i and P C i+1 Q C i+1 are, respectively, the observation model of line P W Q W in frame C i and frame C i+1 .P C iQ C i detected in frame C i+1 is considered to be matched with P C i Q C i , which usually does not coincide with P C i+1 Q C i+1 . Thus, the line re-projection residual denotes a sum of point-to-line distances from two endpoints ofP C iQ C i to line P C i+1 Q C i+1 : To optimize the cost function, we also used the Levenberg-Marquard algorithm, the corresponding Jacobian matrix with respect to the increment of the system state and the four dimensional vector θ , which updates the orthonormal representation: with:

Marginalization
Owing to the limited computational resources of most indoor robots, it was difficult to optimize all camera poses and landmarks. Therefore, the IMU states, feature points and feature lines were selectively marginalized from the sliding windows. Thus, an optimal computational complexity was maintained, within an acceptable range. Meanwhile, the corresponding measurement value was taken as prior information. As shown in Figure 6, our marginalization strategy was to decide which frame to remove according to whether the new frame was a keyframe. When the new frame was a keyframe, the oldest keyframe and its corresponding IMU measurements were removed from the sliding window, and the new frame was retained. Otherwise, the image information of the latest keyframe in the sliding window To optimize the cost function, we also used the Levenberg-Marquard algorithm, the corresponding Jacobian matrix with respect to the increment of the system state and the four dimensional vector θ, which updates the orthonormal representation: (28) with:

Marginalization
Owing to the limited computational resources of most indoor robots, it was difficult to optimize all camera poses and landmarks. Therefore, the IMU states, feature points and feature lines were selectively marginalized from the sliding windows. Thus, an optimal computational complexity was maintained, within an acceptable range. Meanwhile, the corresponding measurement value was taken as prior information. As shown in Figure 6, our marginalization strategy was to decide which frame to remove according to whether the new frame was a keyframe. When the new frame was a keyframe, the oldest keyframe and its corresponding IMU measurements were removed from the sliding window, and the new frame was retained. Otherwise, the image information of the latest keyframe in the sliding window was marginalized, and its corresponding IMU measurements were retained to facilitate subsequent IMU pre-integration. Then, we used the Schur complement [29] to keep important constraint information and ensure calculation consumption.
Sensors 2020, 20, x FOR PEER REVIEW 12 of 21 was marginalized, and its corresponding IMU measurements were retained to facilitate subsequent IMU pre-integration. Then, we used the Schur complement [29] to keep important constraint information and ensure calculation consumption.

Loop Closure
The loop closing thread was used to minimize drift accumulated from the backend optimization framework. The detection of loop closures was able to find an image similarity between the new frame and the candidate keyframe followed by estimating a relative pose change between them. We used DboW2 [30] for loop detection, in which only points feature descriptors were used because matching lines across the whole framework were computationally too expensive. When a loop was detected, the connection between the new frame and the candidate keyframe in the local sliding window was established by points feature descriptors. In addition, we used a two-way geometric constraint to compute the relative pose between them. When the 3D points obtained from the new frame were less than the threshold we set, a 2D-2D strategy was used to estimate relative pose. We used 2D observations of points features in the new frame and the candidate keyframe to perform the fundamental matrix test. Otherwise, we performed the ICP test to recover the relative pose. The corresponding strategy is shown in the Figure 7. Finally, a pose graph optimization was launched to the loop closure.

Loop Closure
The loop closing thread was used to minimize drift accumulated from the backend optimization framework. The detection of loop closures was able to find an image similarity between the new frame and the candidate keyframe followed by estimating a relative pose change between them. We used DboW2 [30] for loop detection, in which only points feature descriptors were used because matching lines across the whole framework were computationally too expensive. When a loop was detected, the connection between the new frame and the candidate keyframe in the local sliding window was established by points feature descriptors. In addition, we used a two-way geometric constraint to compute the relative pose between them. When the 3D points obtained from the new frame were less than the threshold we set, a 2D-2D strategy was used to estimate relative pose. We used 2D observations of points features in the new frame and the candidate keyframe to perform the fundamental matrix test. Otherwise, we performed the ICP test to recover the relative pose. The corresponding strategy is shown in the Figure 7. Finally, a pose graph optimization was launched to the loop closure.

Experiment
We tested our system on public datasets provided in STAR-Center dataset [31] and the OpenLORIS-Scene dataset [32]. To our knowledge, there is no open source project that proposes an RGB-D-inertial system based on points and lines. We compared our system with four state-of-the-art open source visual systems: PLVIO, based on an optimization method, which uses monocular visual-inertial odometry with points and lines. OKVIS, which is a monocular visual-inertial odometry with point-based modes. VINS-MONO [10] is the state-of-the-art SLAM system: monocular integrated with IMU. LineSLAM, which is

Experiment
We tested our system on public datasets provided in STAR-Center dataset [31] and the OpenLORIS-Scene dataset [32]. To our knowledge, there is no open source project that proposes an RGB-D-inertial system based on points and lines. We compared our system with four state-of-the-art open source visual systems: PLVIO, based on an optimization method, which uses monocular visual-inertial odometry with points and lines. OKVIS, which is a monocular visual-inertial odometry with point-based modes. VINS-MONO [10] is the state-of-the-art SLAM system: monocular integrated with IMU. LineSLAM, which is an RGB-D odometry based on points and lines and is not integrated with IMU. The open source evaluation tool evo [33] was used to evaluate the absolute pose error (APE), which directly compared the trajectory error between the estimate values of all systems and the ground truth. All of our experiments were conducted on a computer equipped with a i5-8750H CPU at 1.6-2.3 GHz and 8 GB RAM, and no GPU was used.

STAR-Center Dataset
The STAR-Center dataset contained color images and depth images taken by an RGB-D camera at 30FPS and synchronized inertial data at 250 Hz. This dataset included two scenarios: a handheld and a wheeled robot. According to the different rotation degrees, the handheld datasets were divided into three modes: a handheld simple, handheld normal and handheld with more rotation. According to the velocity of the robot, the wheeled robot datasets were also divided into three parts: wheeled slow, wheeled normal, and wheeled fast. The trajectories were gained from different velocities on a road with two up-down slopes. All of these provided the ground truth trajectory captured by the Optitrack tracking system. For quantitative evaluation, we compared our system with four state-of-art SLAM systems without loop closure. Table 1 shows the root mean square error (RMSE) of APE of the trajectory between the estimate and the ground truth, and the corresponding histogram is shown in Figure 8. The numbers in bold represent the estimated trajectory is more close to the ground truth trajectory.
According to Figure 8, our system, which jointly utilizes points and lines, achieved superior performances for positioning on six sequences. The result in Figure 7 shows that LineSLAM-based RGB-D odometry, which is not integrated with IMU, had the worst performance on all sequences. OKVIS which only relied on point features also performed poorly. PLVIO, based on monocular with points and lines, and VINS-MONO, which is the state-of-art SLAM system, also realized better precision. To further demonstrate the results intuitively, we generated a more specific comparison with PLVIO and VINS-MONO on several sequences: Handheld with More Rotation, Wheeled Fast, and Wheeled Normal. There are several detailed error maps on trajectories estimated by our system, PLVIO and VINS-MONO shown in Figure 9, where the reference represents the ground truth. The more intense the color, the bigger the trajectory error is. Comparing the performances of the three systems, it can be conclued that our algorithm based on RGB-D camera yields smaller errors for all sequences than PLVIO and VINS-MONO based on monocular, especially in the cases of motion with more rotation and fast velocity. The results also demonstrate that our system achieves higher accuracy with changing scenarios and exhibits a better robustness. Furthermore, in order to make a more comprehensive demonstration state of our system and PLVIO with more rotation and slow velocity, we also compared the rotation errors of the two system under these two scenarios: Handheld With more Rotation and wheeled slow, and the results are displayed in Figure 10. From the rotation error diagrams shown in the Figure 10, it can be seen that our system has better observations of the motion rotation, and performs better in terms of orientation along these trajectories.
Sensors 2020, 20, x FOR PEER REVIEW 14 of 21 The numbers in bold represent the estimated trajectory is more close to the ground truth trajectory. According to Figure 8, our system, which jointly utilizes points and lines, achieved superior performances for positioning on six sequences. The result in Figure 7 shows that LineSLAM-based RGB-D odometry, which is not integrated with IMU, had the worst performance on all sequences. OKVIS which only relied on point features also performed poorly. PLVIO, based on monocular with points and lines, and VINS-MONO, which is the state-of-art SLAM system, also realized better precision. To further demonstrate the results intuitively, we generated a more specific comparison with PLVIO and VINS-MONO on several sequences: Handheld With More Rotation, Wheeled Fast, and Wheeled Normal. There are several detailed error maps on trajectories estimated by our system, PLVIO and VINS-MONO shown in Figure 9, where the reference represents the ground truth. The more intense the color, the bigger the trajectory error is. Comparing the performances of the three systems, it can be conclued that our algorithm based on RGB-D camera yields smaller errors for all sequences than PLVIO and VINS-MONO based on monocular, especially in the cases of motion with more rotation and fast velocity. The results also demonstrate that our system achieves higher accuracy with changing scenarios and exhibits a better robustness.

OpenLORIS-Scene Dataset
The OpenLORIS-Scene dataset is the latest SLAM benchmark that recorded the synchronized data of multiple sensor real scenes with a real robot, including: the home, the office, the corridor, the café and the marker scene. All the intrinsic and extrinsic parameters between sensors were well calibrated. The ground truth of trajectory were provided by an OptiTrack MCS and a Hokuyo UTM30LX LiDAR. We used the data collected by the RealSense D435i camera in the home scene, which contains many low textured scenes and motion with rapid rotations. We also compared the proposed system with four system: PLVIO, OKVIS, VINS-MONO and LineSLAM. For fairness, the loop closure module was not used in all systems. Table 2 shows the root mean square error (RMSE) of the translation and rotation of the trajectory between the estimate and the ground truth. It shows that our system provided the best performance on four sequences for the translation, except for Home1-5. In addition, our method performed the best on three sequences for rotation, except for Home1-2 and Home1-4. PLVIO also achieved a highest accuracy on Home1-4 and Home1-5, and VINS-MONO had the best performance at rotation error on the home1-2 dataset. There are several detailed error maps on translation and rotation of trajectories estimated by our system, PLVIO and VINS-MONO, as shown in Figure 11. Sensors 2020, 20, x FOR PEER REVIEW 15 of 21  Furthermore, in order to make a more comprehensive demonstration state of our system and PLVIO with more rotation and slow velocity, we also compared the rotation errors of the two system under these two scenarios: Handheld With more Rotation and wheeled slow, and the results are displayed in Figure  10. From the rotation error diagrams shown in the Figure 10, it can be seen that our system has better observations of the motion rotation, and performs better in terms of orientation along these trajectories.

OpenLORIS-Scene Dataset
The OpenLORIS-Scene dataset is the latest SLAM benchmark that recorded the synchronized data of multiple sensor real scenes with a real robot, including: the home, the office, the corridor,the café and the marker scene. All the intrinsic and extrinsic parameters between sensors were well calibrated. The ground truth of trajectory were provided by an OptiTrack MCS and a Hokuyo UTM30LX LiDAR. We used the data collected by the RealSense D435i camera in the home scene, which contains many low textured scenes and motion with rapid rotations. We also compared the proposed system with four   From Table 2 and Figure 11, it can be seen that our system has the higher accuracy on pose estimation in a home low texture scene. Compared with the SLAM system which only relies on point feature or is based on monocular, the proposed system gives a smaller error on translation and rotation of motion and achieves a better robustness.

Running Time Performance Evaluation
In this section, we first evaluate the initialization process of our system and compare the average time of the initialization with PLVIO and VINS-MONO on the STAR-Center dataset and the OpenLORIS-Scene dataset. Table 3 shows the execution time of each dataset. We can observe that PLVIO and VINS-MONO takes a longer time to conduct visual SFM and align visual and IMU measurements during initialization for all eleven datasets. The result shows that our system achieved a faster initialization process. The numbers in bold represent the faster running time.
In order to further demonstrate the real-time improvement of the system, we evaluated the running time of the whole backend the system specifically. We compared it with PL-VIO in all sequences, which is also a VIO system with points and lines. Both of their backend process could be divided into the initialization process and the optimization process. Experiments on each of the sequences measured the running time of the backend process in a sliding window. The results are shown in Table 4. It can be seen from Table 4 that due to the rapidity in the initialization process of our system, the optimization process was also accelerated. We achieved a more real-time backend process and improved the operation efficiency of the whole system. The numbers in bold represent the faster running time.

Conclusions
In this work, a novel tight-coupled RGB-D inertial SLAM with multiple features is proposed, which fuses color images, depth images, and IMU information for motion estimation. Our system contains two main modules: the frontend and the backend. In the frontend, points and lines are detected and matched; IMU states are also propagated by IMU pre-integration. The keyframes are selected by the number of matched features and IMU measurements.
In the backend, first, a simplified global initialization is achieved by directly obtaining 3D landmarks for every frame, resulting in the system having better initial pose estimation. A line feature represented in Plücker coordinates is used for triangulation, and its orthonormal representation is employed to optimize the system state. A factor graph is applied to show the relationships in our system state. In addition, we present a novel nonlinear optimization framework to minimize the residuals of visual features and IMU measurements in sliding windows. Moreover, the Jacobian matrices of the cost function are derived in an effective way, which is crucial for solving the optimization problem. Finally, a comparison with four other state-of-the-art open-source systems on public datasets is provided. The experimental results on two datasets show that our system can achieve the highest pose estimation and robustness of all cases. Meanwhile, specific experiments are carried out, and the results also demonstrate the better performance of our system in terms of rotation error and translation. The experiment regarding initialization confirms that our system achieves a faster initialization than the monocular system, which ensures a good real-time performance of the system. Moreover, we achieved a more real-time backend process and improved the operation efficiency of the whole system. In the future, we plan to find more ways to set up the constraints of 3D line segments, and a real-time dense 3D map will be realized to develop an entire navigation system.