PL-VIO: Tightly-Coupled Monocular Visual–Inertial Odometry Using Point and Line Features

To address the problem of estimating camera trajectory and to build a structural three-dimensional (3D) map based on inertial measurements and visual observations, this paper proposes point–line visual–inertial odometry (PL-VIO), a tightly-coupled monocular visual–inertial odometry system exploiting both point and line features. Compared with point features, lines provide significantly more geometrical structure information on the environment. To obtain both computation simplicity and representational compactness of a 3D spatial line, Plücker coordinates and orthonormal representation for the line are employed. To tightly and efficiently fuse the information from inertial measurement units (IMUs) and visual sensors, we optimize the states by minimizing a cost function which combines the pre-integrated IMU error term together with the point and line re-projection error terms in a sliding window optimization framework. The experiments evaluated on public datasets demonstrate that the PL-VIO method that combines point and line features outperforms several state-of-the-art VIO systems which use point features only.


Introduction
Localization and navigation have attracted much attention in recent years with respect to a wide range of applications, particularly for self-driving cars, service robots, and unmanned aerial vehicles, etc. Several types of sensors are utilized for localization and navigation, such as global navigation satellite systems (GNSSs) [1], laser lidar [2,3], inertial measurement units (IMUs), and cameras [4,5]. However, they have obvious respective drawbacks: GNSSs only provide reliable localization information if there is a clear sky view [6]; laser lidar suffers from a reflection problem for objects with glass surfaces [7]; measurements from civilian IMUs are noisy, such that inertial navigation systems may drift quickly due to error accumulation [8]; and monocular simultaneous localization and mapping (SLAM) can only recover the motion trajectory up to a certain scale and it tends to be lost when the camera moves fast or illumination changes dramatically [9][10][11]. As a result, sensor fusion methods, especially for visual-inertial navigation systems, have drawn widespread attention [12]. The acceleration and angular velocity information from an IMU can significantly improve the monocular SLAM system [13,14]. Furthermore, both IMUs and cameras are light-weight and low-cost, and as such they are widely used in civilian applications.
According to directly or indirectly fused measurements from sensors, visual-inertial odometry (VIO) systems can be divided into two main streams: loosely-coupled and tightly-coupled approaches. Loosely-coupled approaches [15,16] process images and IMU measurements by two estimators that estimate relative motion separately and fuse the estimates from two estimators to obtain the final To build a structural 3D map and obtain the camera's motion, we propose the PL-VIO system, which optimizes the system states by jointly minimizing the IMU pre-integration constraints together with the point and line re-projection errors in sliding windows. Compared to the traditional methods which only use point features, our method utilizes the additional line feature, aiming to increase the robustness and accuracy in an illumination-changing environment. Our main contributions are as follows: • To the best of our knowledge, the proposed PL-VIO is the first optimization-based monocular VIO system using both points and lines as landmarks.

•
To tightly and efficiently fuse the information from visual and inertial sensors, we introduce a sliding window model with IMU pre-integration constraints and point/line features. To represent a 3D spatial line compactly in optimization, the orthonormal representation for a line is employed.
All the Jacobian matrices of error terms with respect to IMU body states are derived for solving the sliding window optimization efficiently.

•
We compare the performances of the proposed PL-VIO with three state-of-the-art monocular VIO methods including ROVIO [17], OKVIS [18], and VINS-Mono [32] on both the EuRoc dataset and the PennCOSYVIO dataset, for which detailed evaluation results are reported.
The remainder of this paper is organized as follows. First, we describe the mathematical preliminaries in Section 2, and then formulate the sliding window-based visual-inertial fusion method in Section 3. Next, we describe our PL-VIO system and implementation details in Section 4. Section 5 shows the experimental results. Finally, conclusions and potential future works are given in Section 6. Figure 1 illustrates the visual-inertial sensors, and the visual observations for point and line features. We denote c i as the camera frame at time t = i and b i as the IMU body frame at the same time. w is the Earth's inertial frame. (·) c means the vector (·) is expressed in frame c. Quaternion q xy is used to rotate a vector from frame y to frame x, and the corresponding matrix form is R xy . We use p xy to translate a vector from frame y to frame x. Quaternion q bc and vector p bc are the extrinsic parameters between the camera frame and the body frame, and these extrinsic parameters are known in the provided datasets or calibrated with the Kalibr calibration toolbox [33]. f j and L j are the jth point landmark and the line landmark, respectively, in the map. z represents a measurement; specifically z c i f j is the jth point feature observed by ith camera frame, and z b i b j represents a pre-integrated IMU measurement between two keyframes.

IMU Pre-Integration
A 6-axis IMU, including a 3-axis accelerometer and a 3-axis gyroscope, can measure the acceleration a and the angular velocity ω of the body frame with respect to the inertial frame [14]. The raw measurementsω andâ are affected by bias and white noise: where b b g , b b a and n b g , n b a are the biases and white noises from gyroscope and accelerometer, respectively. g w = [0, 0, g] is the gravity vector in frame w. We use the following kinematics for IMU-driven system [34]:ṗ where ⊗ denotes the quaternion multiplication operation. Given the IMU body state at time t = i, namely p wb i , v w i , q wb j , and series values of ω and a during the duration t ∈ [i, j], we can obtain the body state at time t = j by integrating Equation (3): where ∆t is the time difference between i and j. In Equation (4), the body state propagation starts from the ith frame b i . When the state of b i is changed, we need to re-propagate all the measurements. Since body states are adjusted at each iteration during the optimization, Equation (4) is time-consuming. By decomposing q wb j to q wb i ⊗ q b i b t , Equation (4) can be written as: are called pre-integration measurements [22] and can be calculated directly without knowing the body states of b i , which means when body state is changed the re-propagation is not necessary. We treat the pre-integrated measurements as constraint factors between successive keyframes.
The pre-integration model (Equation (6)) is derived from continuous time and neglects the biases and noises. In practice, IMU measurements are collected from discrete times, and noise should be considered. In this work, we use mid-point integration to integrate the IMU measurements. The IMU body propagation using measurements at discrete moments k and k + 1 is calculated by: At the beginning, k = i, we have q b i b i = [0, 0, 0, 1] , and α b i b i , β b i b i are zero vectors. In Equation (7), in order to compute the pre-integration measurements efficiently, we assume biases are constant between two keyframes: In practice, biases change slowly. We model biases with random walk noises: where the Gaussian white noises are defined as n b a ∈ N (0, σ 2 b a ) and n b g ∈ N (0, σ 2 b g ). When bias changes with a small increments, instead of computing pre-integrated measurements iteratively, we use a first-order approximation to updateq b i b j ,α b i b j ,β b i b j [14]: where are the Jacobian matrices of pre-integrated measurements with respect to bias at time i. They can be derived with error state transformation matrices, as shown in Appendix A. The covariance matrix of pre-integrated measurements Σ b i b j can be computed iteratively with IMU propagation, and more details are provided in Appendix A.

Geometric Representation of Line
A straight line only has four DoFs. Thus, the compact parameterization of a straight line is with four parameters. In our system, we treat a straight line in 3D space as an infinite line and adopt two parameterizations for a 3D line as in [27]. Plücker line coordinates consisting of six parameters are used for transformation and projection due to their simplicity. An orthonormal representation consisting of four parameters is used for optimization due to its compactness.

Plücker Line Coordinates
In Figure 2a, a 3D spatial line L in Plücker coordinates is represented by L = (n , d ) ∈ R 6 , where d ∈ R 3 is the line direction vector, and n ∈ R 3 is the normal vector of the plane determined by the line and the coordinate origin. The Plücker coordinates are over-parameterized since there is an implicit constraint between the vector n and d, i.e., n d = 0. Therefore, the Plücker coordinates can not be directly used in unconstrained optimization. However, with a 3D line represented by a normal vector and a direction vector it is simple to perform triangulation from two views geometrically, and it is also convenient to model the line geometry transformation.  For line geometry transformation, given the transformation matrix T cw = R cw p cw 0 1 from the world frame w to the camera frame c, we can transform the Plücker coordinates of a line by [30] where [·] × is the skew-symmetric matrix of a three-dimensional vector, and T cw is the transform matrix used to transform a line from frame w to frame c. When a new line landmark is observed in two different camera views, the Plücker coordinates are easily calculated. As shown in Figure 2b . Three non-collinear points, including two endpoints of a line segment and the coordinate origin C = [x 0 , y 0 , z 0 ] , determine a plane π = [π x , π y , π z , π w ] in 3D space: Given the two plane π 1 and π 2 in camera frame c 1 , the dual Plücker matrix L * can be computed by The Plücker coordinates L in camera frame c 1 are easily extracted from the dual Plücker matrix L * . It can be seen that n and d do not need to be unit vectors.

Orthonormal Representation
Since 3D spatial lines only have four DoFs, the orthonormal representation (U, W) ∈ SO(3) × SO(2) is more suitable than Plücker coordinates during optimization. Additionally, the orthonormal representation and Plücker coordinates can be converted from each other, which means we can adopt both of them in a SLAM system for different purposes. In this section, we will introduce the details of orthonormal representation. As shown in Figure 2a, a coordinate system is defined on the 3D line. The normalized normal vector and the normalized direction vector are the two axes of the coordinate system. The third axis is determined by crossing the other two axes vectors. We can define the rotation matrix between the line coordinate and the camera frame as U: (15) where ψ = [ψ 1 , ψ 2 , ψ 3 ] consists of the rotation angles around the x-, y-, and z-axes of a camera frame. The relationship between the Plücker coordinates and U is: Since the combination of ( n , d ) in Equation (16) only has one DoF, we can use trigonometric functions to represent it: where φ is a rotation angle. Recall that the distance from coordinate origin to the 3D line is d = n d , so W contains the information about the distance d. According to the definitions of U and W, these four DoFs include three DoFs from the rotation matrix, which transforms the line coordinate to the camera frame, and one DoF from the distance d. We use O = [ψ, φ] as the minimal representation of a 3D spatial line during optimization.
Once a 3D line L has been optimized with the orthonormal representation, the corresponding Plücker coordinates for the line can be computed by: where u i is the ith column of matrix U, w 1 = cos(φ), and w 2 = sin(φ). There is a scale factor between L and L, but they represent the same 3D spatial line.

Tightly-Coupled Visual-Inertial Fusion
In visual SLAM, bundle adjustment is used to optimize the camera poses and 3D map points by minimizing the re-projection error in image planes. Bundle adjustment by nonlinear optimization can be treated as a factor graph [35] as shown in Figure 3a: round nodes are the camera poses or 3D landmarks needed to be optimized; edges with square boxes represent the visual measurements as constraints between nodes. For visual-inertial fusion, we can also use the tightly-coupled graph-based framework to optimize all the state variables of the visual-inertial system [23]. As shown in Figure 3b, the factor graph not only contains the visual measurements, but also takes the pre-integrated IMU measurements as edges to constrain the successive IMU body states.

Sliding Window Formulation
In order to achieve both computation efficiency and high accuracy, we use the sliding window formulation for factor graph optimization. The full state variables in a sliding window at time i are defined as: where x i is described by the ith IMU body position, velocity, and orientation in the world frame, and biases in the IMU body frame. Subscripts n, m, and o are the start indexes of body states, point landmarks, and line landmarks, respectively. N is the number of keyframes in the sliding window.
M and O are the numbers of point landmarks and line landmarks observed by all keyframes in the sliding window, respectively. We only use one variable, the inverse depth λ k , to parameterize the kth point landmark from its first observed keyframe. O l is the orthonormal representation of the lth line feature in the world frame w.
We optimize all the state variables in the sliding window by minimizing the sum of cost terms from all the measurement residuals: where is an IMU measurement residual between the body state x i and x i+1 . B is the set of all pre-integrated IMU measurements in the sliding window.
are the point feature re-projection residual and the line feature re-projection residual, respectively. F and L are the sets of point features and line features observed by camera frames. {r p , J p } is prior information that can be computed after marginalizing out a frame in the sliding window [23], and J p is the prior Jacobian matrix from the resulting Hessian matrix after the previous optimization. ρ is the Cauchy robust function used to suppress outliers.
We use Levenberg-Marquard algorithm to solve the nonlinear optimization problem (20). The optimal state estimates X can be found by iteratively update from an initial guess X 0 as: where ⊕ is the operator used to update parameters with increment δX . For position, velocity, bias, and inverse depth, the update operator and increments δ are easily defined: However, the update operator and increments δ for state attitude q are more complicated. Four parameters are used in quaternion to represent the three-DoF rotation, so it is over-parameterized. The increment for rotation should only be three-dimensional. Similar to [18], we use a perturbation δθ ∈ R 3 in the tangent space as the rotation increment. Thus, rotation q can be updated by the quaternion multiplication: We can also write it as a rotation matrix form: where I is a 3 × 3 identity matrix. Similarly, we can update the orthonormal representation as: The increment for the orthonormal representation is δO = [[δψ] × , δφ] . Finally, the increment δX during the optimization can be defined as: At each iteration, the increment δX can be solved by Equation (20): where H p , H b , H f , and H l are the Hessian matrices for prior residuals, IMU measurement residuals, and point and line re-projection residuals, respectively. For residual r (·) , we have is the Jacobian matrix of residuals vector r (·) with respect to δX , and Σ (·) is the covariance matrix of measurements. Formulations of residuals and Jacobian matrices will be defined in the following subsections.

IMU Measurement Model
Since the pre-integrated IMU measurement computed by Equation (10) is a constraint factor between two successive keyframes b i and b j , the IMU measurement residual can be defined as: where [·] xyz extracts the real part of a quaternion which is used to approximate the three-dimensional rotation error [18].
During the nonlinear optimization, the Jacobian matrix of the IMU measurement residual with respect to the body state x i and x j is computed by: where [q] L and [q] R are the left-and right-quaternion-product matrices, respectively [36]. The operator · 3×3 is used to extract a 3 × 3 matrix from the bottom right block of (·). The Jacobian matrix is

Point Feature Measurement Model
For point features, the re-projection error of a 3D point is the image distance between the projected point and the observed point. In this work, we deal with the point and line feature measurements in the normalized image plane. Given the kth point feature measurement at frame c j , z , the re-projection error is defined as: where z is the first observation of the feature in camera frame c i , and the inverse depth of the feature λ k is also defined in camera frame c i .
In order to minimize the point's re-projection error (30), we need to optimize the rotation and the position of frame b i , b j , and the feature inverse depth λ. The corresponding Jacobian matrix can be obtained by the chain rule: With where f b i is the 3D point vector in the ith IMU body frame. We define the covariance matrix of point measurement Σ c i f k as a 2 × 2 diagonal matrix by assuming that the detected point features have pixel noise on both the vertical and horizontal directions in the image plane.

Line Feature Measurement Model
The re-projection error of a line measurement is defined as the distance from endpoints to the projected line. For a pin-hole model camera, a 3D spatial line L = [n, d] can be projected to the camera image plane by [27]: where K is the projection matrix for a line feature. When projecting a line to the normalized image plane, K is an identity matrix. From the projection Equation (33), the coordinates of the line segment projected by a 3D line are only related with the normal vector n.
Given a 3D line L w l and the orthonormal presentation O l in a world frame, we firstly transform it to camera frame c i by Equation (11). Then, we project it to the image plane to get the projection line l c i l . The re-projection error in camera frame i is defined as With d(s, l) indicating the distance function from endpoint s to the projection line l: The ith body state and lth line parameters are optimized by minimizing the line re-projection error r l (z c i L l ). The corresponding Jacobian matrix can be obtained by the chain rule: With The derivation details are provided in Appendix B. Similar to the point measurement covariance matrix, the covariance matrix of line measurement Σ c i L l is defined by assuming the endpoints of a line segment have pixel noise.

Monocular Visual Inertial Odometry with Point and Line Features
As shown in Figure 4, the proposed PL-VIO system has two main modules: the front end and the back end. The front-end module is used to pre-process the measurements from IMU and camera. The back-end module is used to estimate and optimize the body states. We will introduce the details in the following subsections.

Front End
The front end extracts information from the raw measurements of the camera and IMU. The body state is updated by propagation with each new IMU measurement, and the newest body state is used as the initial value in sliding window optimization. Additionally, the new IMU measurements are pre-integrated to constrain the successive IMU body states during optimization.
As for image processing, the point and line features are detected in two separate threads. When a new frame comes, the point features are tracked from the previous frame to the new frame by the KLT optical flow algorithm [37]. Then, we use the RANSAC framework with an essential matrix test to remove outliers. If the number of tracked point features is less than a threshold after outlier rejection, new corner features which are detected by the FAST detector [38] will be added. As to the line features, line segments in new frame are detected by the LSD detector [39] and matched with those in the previous frame by the appearance-based descriptor LBD [40]. We use geometric constraints to remove outliers of line matches. For example, the distance between the midpoints of two matched lines should be no more than δ th dist pixels, and the angle difference should be no more than δ th angle degrees. After the feature detection and matching, the point features and the endpoints of line segments are projected onto the normalized image plane. Additionally, a frame is selected as a new keyframe if the average parallax of the tracked point features is larger than a threshold.

Back End
In the back-end thread, the points and lines are firstly triangulated to build re-projection residuals. In order to get good landmark estimations, the inverse depth of a point feature is estimated with all the observations. For line triangulation, we only choose two frames with the furthest spatial distance in the sliding window to initialize the Plücker coordinates.
After we get the initial estimation of map points/lines and the IMU body state predicted from IMU measurements, the sliding window optimization described in Section 3 is used to find the optimal states. To limit the size of the state vector X , a two-way marginalization strategy is adopted to remove states from the sliding window [23]. When the second newest frame x n+N−1 is a keyframe, we marginalize out the earliest frame x n with all its measurements. Otherwise, if the second newest frame is not a keyframe, we discard the visual measurements from this frame and reserve its IMU measurements in the pre-integration measurements. New prior information is gained based on the marginalized measurements, reserving the constraint information from the removed states.
Lastly, we cull the invalid map points and lines. If the inverse depth of a point is negative, we will delete this point from the map. If the re-projection residuals of a line exceed a threshold it will be removed from the map.

Implementation Details
To bootstrap the VIO system, we adopt the visual-inertial alignment method [41] to recover the scale, gravity vector, initial biases, and velocity of the IMU initial body state. The sliding window is with N = 10 keyframes. Each frame has at most 150 point features, and 150 line segments. The thresholds used in line matching are set with δ th dist = 60 pixels and δ th angle = 30 • . Since the visual-inertial system has only four unobservable DoFs (the yaw direction and the absolute position), the optimization methods for six DoFs may introduce illusory information into the roll and pitch directions by automatically taking steps along these directions to minimize the cost function. After the sliding window optimization, we reset the body state by rotating it back with the increments along the roll and pitch directions. All the numerical optimizations are solved using the Levenberg-Marquardt method in the Ceres solver library [42]. The line detection and matching codes are provided by OpenCV 3 [43].

Experimental Results
We evaluated our PL-VIO system using two public benchmark datasets: the EuRoc MAV Dataset [44] and the PennCOSYVIO Dataset [45].The accuracy of the PL-VIO method is compared with that of three state-of-the-art monocular VIO methods to validate the advantages of the PL-VIO method: ROVIO [17] and OKVIS [18] in monocular mode, and VINS-Mono [32] without loop closure. ROVIO is a tightly-coupled form of VIO based on the extended Kalman filter (EKF). It directly uses the intensity errors from images to find the optimal state during the update step. OKVIS is a sliding window optimization algorithm with point features which can work with monocular or stereo modes. VINS-Mono is a complete VIO-SLAM system employing point features to optimize IMU body states in a sliding window, and performs loop closure. All of the experiments were performed on the computer with an Intel Core i7-6700HQ CPU with 2.60 GHz, 16 GB RAM, and the ROS Kinetic [46].

EuRoc MAV Dataset
The EuRoc micro aerial vehicle (MAV) datasets were collected by an MAV in two indoor scenes, which contain stereo images from a global shutter camera at 20FPS and synchronized IMU measurements at 200 Hz [44]. Each dataset provides a ground truth trajectory given by the VICON motion capture system. All the extrinsic and intrinsic parameters are also provided in the datasets. In our experiments, we only used the images from the left camera.
The main advantage of line features is that they provide significant geometry structure information with respect to the environment. As an example, we show the reconstructed map built by PL-VIO from the MH_05_difficult sequence in Figure 5. The four images in Figure 5a-d were captured by an MAV flying in a machine hall, which showed the room's structure. As shown in Figure 5d, the line segment detection in the weak illumination scene was more robust than point feature detection. From the reconstructed 3D map, it can be seen that the geometry of the environment is described by the line segments, and thus semantic information could be extracted from the map. This is useful for robot navigation. For quantitative evaluation, we compared our PL-VIO with three state-of-the-art monocular visual-inertial methods: ROVIO [17], OKVIS [18] in monocular mode, and VINS-Mono [32] without loop closure. For the fair comparison, default parameters provided by the authors of these comparison methods were used. We chose the absolute pose error (APE) as the evaluation metric, which directly compared the trajectory error between the estimate and the ground truth [47]. The open-source package evo (https://michaelgrupp.github.io/evo/) provides an easy-to-use interface to evaluate the trajectories of odometry and SLAM algorithms. We employed this tool to evaluate these methods in this section. Table 1 shows the root mean square error (RMSE) of translation and rotation along all the trajectory, and their histograms are also provided as shown in Figure 6.   Table 1 shows that our PL-VIO which jointly optimizes point and lines provided the best performance on eight sequences for the rotation, except for V1_01_easy. Our method also performed the best on six sequences for the translation, except for V1_01_easy, V1_02_medium, and V1_03_difficult. However, the difference with respect to the best results was only at the submillimeter level. The results in Table 1 show that integrating line features into VIO could improve the accuracy of motion estimation. To demonstrate the results intuitively, several heat map of trajectories estimated by PL-VIO and VINS-Mono are shown in Figure 7. The redder the color is, the larger the translation error is. Comparing the three trajectories, we came to the conclusion that PL-VIO with line features gave smaller errors than VINS-Mono when the camera was moved with rapid rotation. Furthermore, we found that these sequences with rapid rotation caused large changes in the viewing direction, and the lighting conditions are especially challenging for tracking point features [25,26,28]. As shown in Figure 8 Besides, the computation times of different methods are listed in Table 2. The computation efficiency of filter-based ROVIO was the highest, while its accuracy was the lowest. The proposed PL-VIO system was the most time-consuming method, but its computation time was mainly restricted by the line detection and matching step. In Section 5.3, the computation times of different modules in PL-VIO are independently estimated in the V1_03_difficult sequence, and it was found that the computation efficiency of the PL-VIO system directly depended on the line detection and matching.

PennCOSYVIO Dataset
The PennCOSYVIO dataset is a recent VIO benchmark that collects the synchronized data of a large glass building with hand-held equipment from outdoors to indoors (see Figure 9) [45]. Challenging factors include illumination changes, rapid rotations, and repetitive structures. All the intrinsic and extrinsic parameters as well as the ground truth trajectory are provided. We use data collected by the VI-sensor, which was also used in the EuRoc MAV datasets. We compared the proposed PL-VIO with the VINS-Mono without loop closure. To evaluate fairly, the same parameters were used for PL-VIO and VINS-Mono. The same metric and evaluation tools used in Section 5.1 were employed here to evaluate the trajectories. Table 3 lists the results. Furthermore, the evaluation tool (https://daniilidis-group.github.io/penncosyvio/) is also provided in the PennCOSYVIO dataset, and adopts two metrics, the APE and relative pose error (RPE). For RPE, it expresses the errors in percentages by dividing the value with the path length [45]. The creators of PennCOSYVIO cautiously selected the evaluation parameters, so their tool is suited for evaluating VIO approaches in this dataset. Therefore, we adopted this evaluation tool in our experiments, and the results are listed in Table 4. Table 4. The results evaluated by PennCOSYVIO evaluation tool for different algorithms. The rotation errors for the APE and relative pose error (RPE) are expressed in degrees. The translation error is expressed in the x-axis, y-axis, and z-axis. The APE is expressed in meters, while the RPE is expressed in percentages (%). The numbers in bold represent the estimated trajectory is more close to the benchmark trajectory.

Algorithm
APE RPE From Tables 3 and 4, it can be seen that the PL-VIO obtained the best performance for the rotation part. The APE of translation evaluated by PennCOSYVIO tool provided more details. Compared to VINS-Mono, PL-VIO gave smaller errors in the y-axis and z-axis, and a smaller error summation of the three axes. VINS-Mono obtained better performance only in the x-axis.

Computing Time
Finally, we evaluated the average execution time of our PL-VIO running at the V1_02_medium sequence because this image sequence was collected from a typical indoor scene. Table 5 shows the execution time of each block. We can see that line detection and matching, which runs at 11 Hz in the front end, is the bottleneck in terms of efficiency. State-of-the-art line detection and matching methods, such as the combination of LSD and LBD, are not satisfactory for VIO/SLAM systems. Note that our method is independent of line feature detection and matching, so improving their efficiency is beyond the scope of this paper. Marginalization in the back end is another time-consuming part. We observe that the inefficiency of marginalization is caused by the fill-in when marginalizing out features, which makes the Hessian matrix become a less sparse matrix. This problem can be potentially solved by discarding some features when performing marginalization to maintain a sparse Hessian matrix [18].

Conclusions
This paper presents the novel tightly-coupled monocular vision-inertial odometry algorithm PL-VIO, which optimizes the system states in a sliding window with both point and line features. The proposed PL-VIO system has two main modules: the front end and the back end. The front-end module is used to propagate IMU body state, and detect and match point/line features. The back-end module is used to estimate and optimize the body states. In the back-end module, a line landmark is considered as an infinite 3D spatial line and its orthonormal representation is employed to parameterize it compactly during optimization. Furthermore, all the Jacobian matrices of error terms are given in detail for solving the sliding window optimization efficiently. We also provide the evaluation results of the proposed PL-VIO as compared to three state-of-the-art monocular VIO methods including ROVIO [17], OKVIS [18], and VINS-Mono [32] on both the EuRoc dataset and PennCOSYVIO dataset. According to the analysis and results, two further conclusions are as follows: 1. The reconstructed 3D map with line features can provide geometrical information with respect to the environment, and thus semantic information could be extracted from the map. This is useful for robot navigation. 2. Line features can improve the system accuracy both for translation and rotation, especially in illumination-changing scenes. However, the line detection and matching are time-consuming and become the bottlenecks in the efficiency of the system.
In the future, we plan to improve our system by introducing the structural constraints between 3D spatial lines, such as parallel or coplanar lines in Manhattan-world scenes [48]. Geometric constraints among these lines have the potential to further improve localization precision and reduce rotation accumulation errors.
where Σ n is the covariance of the raw IMU measurements, and the initial covariance is Σ b i b i = 0 15×15 . Also we can compute the Jacobian matrix of pre-integrated measurements z b i b j with respect to error state η i iteratively with [49]: With J ii = I. These Jacobian matrices given in Equation (10) can be extracted from J ij .

Appendix B
The Jacobian matrix of the line re-projection error respect to the orthonormal representation is: To compute the line re-projection error, a spatial line in world frame w is transformed to the body frame b firstly, and then transformed to the camera frame c with the extrinsic parameters T bc .
The Jacobian matrix of the line re-projection error with respect to rotation of the i th IMU body state is: The Jacobian matrix of the line re-projection error with respect to position of the ith IMU body state is as follows: