Next Article in Journal
Non-Invasive Blood Pressure Estimation from ECG Using Machine Learning Techniques
Next Article in Special Issue
A BLE-Based Pedestrian Navigation System for Car Searching in Indoor Parking Garages
Previous Article in Journal
A Novel Strategy of Ambiguity Correction for the Improved Faraday Rotation Estimator in Linearly Full-Polarimetric SAR Data
Previous Article in Special Issue
A SINS/SRS/GNS Autonomous Integrated Navigation System Based on Spectral Redshift Velocity Measurements
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

PL-VIO: Tightly-Coupled Monocular Visual–Inertial Odometry Using Point and Line Features

1
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
2
University of Chinese Academy of Sciences, Beijing 100049, China
3
The ReadSense Ltd., Shanghai 200040, China
*
Author to whom correspondence should be addressed.
Sensors 2018, 18(4), 1159; https://doi.org/10.3390/s18041159
Submission received: 23 March 2018 / Revised: 8 April 2018 / Accepted: 9 April 2018 / Published: 10 April 2018
(This article belongs to the Special Issue Sensor Fusion and Novel Technologies in Positioning and Navigation)

Abstract

:
To address the problem of estimating camera trajectory and to build a structural three-dimensional (3D) map based on inertial measurements and visual observations, this paper proposes point–line visual–inertial odometry (PL-VIO), a tightly-coupled monocular visual–inertial odometry system exploiting both point and line features. Compared with point features, lines provide significantly more geometrical structure information on the environment. To obtain both computation simplicity and representational compactness of a 3D spatial line, Plücker coordinates and orthonormal representation for the line are employed. To tightly and efficiently fuse the information from inertial measurement units (IMUs) and visual sensors, we optimize the states by minimizing a cost function which combines the pre-integrated IMU error term together with the point and line re-projection error terms in a sliding window optimization framework. The experiments evaluated on public datasets demonstrate that the PL-VIO method that combines point and line features outperforms several state-of-the-art VIO systems which use point features only.

1. Introduction

Localization and navigation have attracted much attention in recent years with respect to a wide range of applications, particularly for self-driving cars, service robots, and unmanned aerial vehicles, etc. Several types of sensors are utilized for localization and navigation, such as global navigation satellite systems (GNSSs) [1], laser lidar [2,3], inertial measurement units (IMUs), and cameras [4,5]. However, they have obvious respective drawbacks: GNSSs only provide reliable localization information if there is a clear sky view [6]; laser lidar suffers from a reflection problem for objects with glass surfaces [7]; measurements from civilian IMUs are noisy, such that inertial navigation systems may drift quickly due to error accumulation [8]; and monocular simultaneous localization and mapping (SLAM) can only recover the motion trajectory up to a certain scale and it tends to be lost when the camera moves fast or illumination changes dramatically [9,10,11]. As a result, sensor fusion methods, especially for visual–inertial navigation systems, have drawn widespread attention [12]. The acceleration and angular velocity information from an IMU can significantly improve the monocular SLAM system [13,14]. Furthermore, both IMUs and cameras are light-weight and low-cost, and as such they are widely used in civilian applications.
According to directly or indirectly fused measurements from sensors, visual–inertial odometry (VIO) systems can be divided into two main streams: loosely-coupled and tightly-coupled approaches. Loosely-coupled approaches [15,16] process images and IMU measurements by two estimators that estimate relative motion separately and fuse the estimates from two estimators to obtain the final result. Tightly-coupled approaches [17,18] use one estimator to find optimal estimates by fusing raw measurements from the camera and IMU directly. Compared to loosely-coupled approaches, tightly-coupled approaches are generally more accurate and robust. In this paper, the proposed PL-VIO method is a tightly-coupled VIO system. Related works on tightly-coupled VIO approaches can be categorized by the number of linearizations in the measurement model [14]. These approaches based on the extended Kalman filter (EKF) process a measurement only once in the updating step, while batch nonlinear optimization approaches linearize multiple times during the optimization step. Filtering-based approaches [19,20] integrate IMU measurements to propagate/predict the state, and then update/correct the latest state with visual measurements. Since the coordinates of three-dimensional (3D) landmarks are included in the state vector, the computational complexity of the EKF increases quadratically with the number of landmarks. To address this problem, Mourikis and Roumeliotis [21] proposed the multi-state constraint Kalman filter (MSCKF) which marginalizes out the landmark coordinates from the state vector. A drawback of this method is that the landmark measurements used to update the state need to be moved out of view of the camera, which means that not all the current visual measurements are used in the filter. Furthermore, the linearization errors make the filter inconsistent [14].
Optimization-based approaches obtain the optimal estimate by minimizing a joint nonlinear cost function with IMU measurement residuals and visual re-projection residuals. Thus, optimization-based approaches can repeat the linearization of a state vector at different points to achieve higher accuracy than filtering-based methods [14]. The IMU measurement constraints are computed by integrating IMU measurements between frames. However, the standard IMU integration method is closely connected with the initial IMU body state at the first frame. When the estimated state changes, all integrated IMU measurements need to be re-calculated. Lupton and Sukkarieh [22] proposed an IMU pre-integration technology which avoids such repeated integrations. IMU pre-integration has been widely used in optimization-based VIO [18,23,24]. Forster et al. [14] reformulated the IMU pre-integration by treating the rotation group on a manifold instead of using Euler angles. Liu et al. [13] proposed the continuous pre-integration method. Although optimization-based approaches have achieved high accuracy, computation becomes expensive with more and more landmarks being added into the optimization. OKVIS [18] used a first-in-last-out sliding window method for bound computation by marginalizing the measurements from the oldest state. Shen et al. [23] proposed a two-way marginalization to selectively marginalize the body state and landmarks.
Although significant achievements have been made in the VIO area, most VIO systems only use the point features as the visual information. However, point detection in textureless environments and point tracking in illumination-changing scenes are challenging [25,26]. In contrast, line segments are a proper alternative solution in these scenes. Additionally, line segments provide more structural information on the environment than points [27]. For visual-only SLAM, there are several works combining point and line features to estimate camera motion [28,29]. The simplest way to integrate line features in a SLAM system is to use two endpoints to represent the line. Matching the endpoints of a line from different views is difficult. Furthermore, the 3D spatial line only has four degrees-of-freedom (DoFs), while two 3D endpoints introduce six parameters, which results in over-parameterization. Bartoli and Sturm [30] proposed the orthonormal representation, which uses a three-DoF rotation matrix and a one-DoF rotation matrix to update the line parameters during optimization. The orthonormal representation has been used in some stereo visual SLAM systems [27,31]. For VIO approaches, Kottas and Roumeliotis [26] investigated the observability of the VIO using line features only. Kong et al. [25] built a stereo VIO system combining point and line features by utilizing trifocal geometry. However, these works involve filtering-based VIO. In our proposed PL-VIO method, we integrate line features into the optimization framework in order achieve higher accuracy than filtering-based methods.
To build a structural 3D map and obtain the camera’s motion, we propose the PL-VIO system, which optimizes the system states by jointly minimizing the IMU pre-integration constraints together with the point and line re-projection errors in sliding windows. Compared to the traditional methods which only use point features, our method utilizes the additional line feature, aiming to increase the robustness and accuracy in an illumination-changing environment. Our main contributions are as follows:
  • To the best of our knowledge, the proposed PL-VIO is the first optimization-based monocular VIO system using both points and lines as landmarks.
  • To tightly and efficiently fuse the information from visual and inertial sensors, we introduce a sliding window model with IMU pre-integration constraints and point/line features. To represent a 3D spatial line compactly in optimization, the orthonormal representation for a line is employed. All the Jacobian matrices of error terms with respect to IMU body states are derived for solving the sliding window optimization efficiently.
  • We compare the performances of the proposed PL-VIO with three state-of-the-art monocular VIO methods including ROVIO [17], OKVIS [18], and VINS-Mono [32] on both the EuRoc dataset and the PennCOSYVIO dataset, for which detailed evaluation results are reported.
The remainder of this paper is organized as follows. First, we describe the mathematical preliminaries in Section 2, and then formulate the sliding window-based visual–inertial fusion method in Section 3. Next, we describe our PL-VIO system and implementation details in Section 4. Section 5 shows the experimental results. Finally, conclusions and potential future works are given in Section 6.

2. Mathematical Formulation

2.1. Notations

Figure 1 illustrates the visual–inertial sensors, and the visual observations for point and line features. We denote c i as the camera frame at time t = i and b i as the IMU body frame at the same time. w is the Earth’s inertial frame. ( · ) c means the vector ( · ) is expressed in frame c. Quaternion q x y is used to rotate a vector from frame y to frame x, and the corresponding matrix form is R x y . We use p x y to translate a vector from frame y to frame x. Quaternion q b c and vector p b c are the extrinsic parameters between the camera frame and the body frame, and these extrinsic parameters are known in the provided datasets or calibrated with the Kalibr calibration toolbox [33]. f j and L j are the jth point landmark and the line landmark, respectively, in the map. z represents a measurement; specifically z f j c i is the jth point feature observed by ith camera frame, and z b i b j represents a pre-integrated IMU measurement between two keyframes.

2.2. IMU Pre-Integration

A 6-axis IMU, including a 3-axis accelerometer and a 3-axis gyroscope, can measure the acceleration a and the angular velocity ω of the body frame with respect to the inertial frame [14]. The raw measurements ω ^ and a ^ are affected by bias and white noise:
ω ^ b = ω b + b g b + n g b
a ^ b = R b w ( a w + g w ) + b a b + n a b
where b g b , b a b and n g b , n a b are the biases and white noises from gyroscope and accelerometer, respectively. g w = [ 0 , 0 , g ] is the gravity vector in frame w. We use the following kinematics for IMU-driven system [34]:
p ˙ w b t = v t w , v ˙ t w = a t w , q ˙ w b t = q w b t 0 1 2 ω b t
where ⊗ denotes the quaternion multiplication operation.
Given the IMU body state at time t = i , namely p w b i , v i w , q w b j , and series values of ω and a during the duration t [ i , j ] , we can obtain the body state at time t = j by integrating Equation (3):
p w b j = p w b i + v i w Δ t + t [ i , j ] ( R w b t a b t g w ) δ t 2 v j w = v i w + t [ i , j ] ( R w b t a b t g w ) δ t q w b j = t [ i , j ] q w b t 0 1 2 ω b t δ t
where Δ t is the time difference between i and j. In Equation (4), the body state propagation starts from the ith frame b i . When the state of b i is changed, we need to re-propagate all the measurements. Since body states are adjusted at each iteration during the optimization, Equation (4) is time-consuming. By decomposing q w b j to q w b i q b i b t , Equation (4) can be written as:
p w b j = p w b i + v i w Δ t 1 2 g w Δ t 2 + R w b i α b i b j v j w = v i w g w Δ t + R w b i β b i b j q w b j = q w b i q b i b j
where
α b i b j = t [ i , j ] ( R b i b t a b t ) δ t 2 β b i b j = t [ i , j ] ( R b i b t a b t ) δ t q b i b j = t [ i , j ] q b i b t 0 1 2 ω b t δ t
z b i b j = [ α b i b j , β b i b j , q b i b j ] are called pre-integration measurements [22] and can be calculated directly without knowing the body states of b i , which means when body state is changed the re-propagation is not necessary. We treat the pre-integrated measurements as constraint factors between successive keyframes.
The pre-integration model (Equation (6)) is derived from continuous time and neglects the biases and noises. In practice, IMU measurements are collected from discrete times, and noise should be considered. In this work, we use mid-point integration to integrate the IMU measurements. The IMU body propagation using measurements at discrete moments k and k + 1 is calculated by:
ω ^ = 1 2 ( ( ω ^ b k b g b k + n g b k ) + ( ω ^ b k + 1 b g b k + n g b k + 1 ) ) q ^ b i b k + 1 = q ^ b i b k 1 1 2 ω ^ δ t a ^ = 1 2 ( R b i b k ( a ^ b k b a b k + n a b k ) + R b i b k + 1 ( a ^ b k + 1 b a b k + 1 + n g b k + 1 ) ) α ^ b i b k + 1 = α ^ b i b k + β ^ b i b k δ t + 1 2 a ^ δ t 2 β ^ b i b k + 1 = β ^ b i b k + a ^ δ t
At the beginning, k = i , we have q b i b i = [ 0 , 0 , 0 , 1 ] , and α b i b i , β b i b i are zero vectors. In Equation (7), in order to compute the pre-integration measurements efficiently, we assume biases are constant between two keyframes:
b a b k = b a b k + 1 , b g b k = b g b k + 1 , k [ i , j 1 ]
In practice, biases change slowly. We model biases with random walk noises:
b a b k + 1 = b a b k + n b a δ t , b g b k + 1 = b g b k + n b g δ t
where the Gaussian white noises are defined as n b a N ( 0 , σ b a 2 ) and n b g N ( 0 , σ b g 2 ) . When bias changes with a small increments, instead of computing pre-integrated measurements iteratively, we use a first-order approximation to update q ^ b i b j , α ^ b i b j , β ^ b i b j [14]:
α ^ b i b j α ^ b i b j + J b a i α δ b a b i + J b g i α δ b g b i β ^ b i b j β ^ b i b j + J b a i β δ b a b i + J b g i β δ b g b i q ^ b i b j q ^ b i b j 1 1 2 J b g i q δ b g b i
where J b i a α = α b i b j δ b a b i , J b g i α = α b i b j δ b g b i , J b a i β = β b i b j δ b a b i , J b g i β = β b i b j δ b g b i , J b g i q = q b i b j δ b g b i are the Jacobian matrices of pre-integrated measurements with respect to bias at time i. They can be derived with error state transformation matrices, as shown in Appendix A. The covariance matrix of pre-integrated measurements Σ b i b j can be computed iteratively with IMU propagation, and more details are provided in Appendix A.

2.3. Geometric Representation of Line

A straight line only has four DoFs. Thus, the compact parameterization of a straight line is with four parameters. In our system, we treat a straight line in 3D space as an infinite line and adopt two parameterizations for a 3D line as in [27]. Plücker line coordinates consisting of six parameters are used for transformation and projection due to their simplicity. An orthonormal representation consisting of four parameters is used for optimization due to its compactness.

2.3.1. Plücker Line Coordinates

In Figure 2a, a 3D spatial line L in Plücker coordinates is represented by L = ( n , d ) R 6 , where d R 3 is the line direction vector, and n R 3 is the normal vector of the plane determined by the line and the coordinate origin. The Plücker coordinates are over-parameterized since there is an implicit constraint between the vector n and d , i.e., n d = 0 . Therefore, the Plücker coordinates can not be directly used in unconstrained optimization. However, with a 3D line represented by a normal vector and a direction vector it is simple to perform triangulation from two views geometrically, and it is also convenient to model the line geometry transformation.
For line geometry transformation, given the transformation matrix T c w = R c w p c w 0 1 from the world frame w to the camera frame c, we can transform the Plücker coordinates of a line by [30]
L c = n c d c = T c w L w = R c w [ p c w ] × R c w 0 R c w L w
where [ · ] × is the skew-symmetric matrix of a three-dimensional vector, and T c w is the transform matrix used to transform a line from frame w to frame c.
When a new line landmark is observed in two different camera views, the Plücker coordinates are easily calculated. As shown in Figure 2b, the 3D line L is captured by cameras c 1 and c 2 as z L c 1 and z L c 2 , respectively. The line segment z L c 1 in the normalized image plane can be represented by two endpoints, s c 1 = u s , v s , 1 and e c 1 = u e , v e , 1 . Three non-collinear points, including two endpoints of a line segment and the coordinate origin C = [ x 0 , y 0 , z 0 ] , determine a plane π = [ π x , π y , π z , π w ] in 3D space:
π x ( x x 0 ) + π y ( y y 0 ) + π z ( z z 0 ) = 0
where
π x π y π z = [ s c 1 ] × e c 1 , π w = π x x 0 + π y y 0 + π z z 0
Given the two plane π 1 and π 2 in camera frame c 1 , the dual Plücker matrix L can be computed by
L = [ d ] × n n 0 = π 1 π 2 π 2 π 1 R 4 × 4
The Plücker coordinates L in camera frame c 1 are easily extracted from the dual Plücker matrix L . It can be seen that n and d do not need to be unit vectors.

2.3.2. Orthonormal Representation

Since 3D spatial lines only have four DoFs, the orthonormal representation ( U , W ) S O ( 3 ) × S O ( 2 ) is more suitable than Plücker coordinates during optimization. Additionally, the orthonormal representation and Plücker coordinates can be converted from each other, which means we can adopt both of them in a SLAM system for different purposes. In this section, we will introduce the details of orthonormal representation. As shown in Figure 2a, a coordinate system is defined on the 3D line. The normalized normal vector and the normalized direction vector are the two axes of the coordinate system. The third axis is determined by crossing the other two axes vectors. We can define the rotation matrix between the line coordinate and the camera frame as U :
U = R ψ = n n d d n × d n × d
where ψ = ψ 1 , ψ 2 , ψ 3 consists of the rotation angles around the x-, y-, and z-axes of a camera frame. The relationship between the Plücker coordinates and U is:
n d = n n d d n × d n × d n 0 0 d 0 0
Since the combination of ( n , d ) in Equation (16) only has one DoF, we can use trigonometric functions to represent it:
W = cos ( ϕ ) sin ( ϕ ) sin ( ϕ ) cos ( ϕ ) = 1 ( n 2 + d 2 ) n d d n
where ϕ is a rotation angle. Recall that the distance from coordinate origin to the 3D line is d = n d , so W contains the information about the distance d. According to the definitions of U and W , these four DoFs include three DoFs from the rotation matrix, which transforms the line coordinate to the camera frame, and one DoF from the distance d. We use O = [ ψ , ϕ ] as the minimal representation of a 3D spatial line during optimization.
Once a 3D line L has been optimized with the orthonormal representation, the corresponding Plücker coordinates for the line can be computed by:
L = [ w 1 u 1 T , w 2 u 2 T ] T = 1 ( n 2 + d 2 ) L
where u i is the ith column of matrix U , w 1 = cos ( ϕ ) , and w 2 = sin ( ϕ ) . There is a scale factor between L and L , but they represent the same 3D spatial line.

3. Tightly-Coupled Visual–Inertial Fusion

In visual SLAM, bundle adjustment is used to optimize the camera poses and 3D map points by minimizing the re-projection error in image planes. Bundle adjustment by nonlinear optimization can be treated as a factor graph [35] as shown in Figure 3a: round nodes are the camera poses or 3D landmarks needed to be optimized; edges with square boxes represent the visual measurements as constraints between nodes. For visual–inertial fusion, we can also use the tightly-coupled graph-based framework to optimize all the state variables of the visual–inertial system [23]. As shown in Figure 3b, the factor graph not only contains the visual measurements, but also takes the pre-integrated IMU measurements as edges to constrain the successive IMU body states.

3.1. Sliding Window Formulation

In order to achieve both computation efficiency and high accuracy, we use the sliding window formulation for factor graph optimization. The full state variables in a sliding window at time i are defined as:
X = x n , x n + 1 , , x n + N , λ m , λ m + 1 , , λ m + M , O o , O o + 1 , , O o + O x i = p w b i , q w b i , v i w , b a b i , b g b i , i n , n + N
where x i is described by the ith IMU body position, velocity, and orientation in the world frame, and biases in the IMU body frame. Subscripts n , m , and o are the start indexes of body states, point landmarks, and line landmarks, respectively. N is the number of keyframes in the sliding window. M and O are the numbers of point landmarks and line landmarks observed by all keyframes in the sliding window, respectively. We only use one variable, the inverse depth λ k , to parameterize the kth point landmark from its first observed keyframe. O l is the orthonormal representation of the lth line feature in the world frame w.
We optimize all the state variables in the sliding window by minimizing the sum of cost terms from all the measurement residuals:
min X ρ r p J p X Σ p 2 + i B ρ r b ( z b i b i + 1 , X ) Σ b i b i + 1 2 + ( i , j ) F ρ r f ( z f j c i , X ) Σ f j c i 2 + ( i , l ) L ρ r l ( z L i c i , X ) Σ L i c i 2
where r b ( z b i b i + 1 , X ) is an IMU measurement residual between the body state x i and x i + 1 . B is the set of all pre-integrated IMU measurements in the sliding window. r f ( z f j c i , X ) and r l ( z L i c i , X ) are the point feature re-projection residual and the line feature re-projection residual, respectively. F and L are the sets of point features and line features observed by camera frames. { r p , J p } is prior information that can be computed after marginalizing out a frame in the sliding window [23], and J p is the prior Jacobian matrix from the resulting Hessian matrix after the previous optimization. ρ is the Cauchy robust function used to suppress outliers.
We use Levenberg–Marquard algorithm to solve the nonlinear optimization problem (20). The optimal state estimates X can be found by iteratively update from an initial guess X 0 as:
X t + 1 = X t δ X
where ⊕ is the operator used to update parameters with increment δ X . For position, velocity, bias, and inverse depth, the update operator and increments δ are easily defined:
p = p + δ p , v = v + δ v , λ = λ + δ λ , b = b + δ b
However, the update operator and increments δ for state attitude q are more complicated. Four parameters are used in quaternion to represent the three-DoF rotation, so it is over-parameterized. The increment for rotation should only be three-dimensional. Similar to [18], we use a perturbation δ θ R 3 in the tangent space as the rotation increment. Thus, rotation q can be updated by the quaternion multiplication:
q = q δ q , δ q 1 1 2 δ θ
We can also write it as a rotation matrix form:
R R ( I + [ δ θ ] × )
where I is a 3 × 3 identity matrix. Similarly, we can update the orthonormal representation as:
U U ( I + [ δ ψ ] × ) W W I + 0 δ ϕ δ ϕ 0
The increment for the orthonormal representation is δ O = [ [ δ ψ ] × , δ ϕ ] . Finally, the increment δ X during the optimization can be defined as:
δ X = δ x n , δ x n + 1 , , δ x n + N , δ λ m , δ λ m + 1 , , δ λ m + M , δ O o , δ O o + 1 , , δ O o + O δ x i = δ p , δ θ , δ v , δ b a b i , δ b g b i , i n , n + N
At each iteration, the increment δ X can be solved by Equation (20):
( H p + H b + H f + H l ) δ X = ( b p + b b + b f + b l )
where H p , H b , H f , and H l are the Hessian matrices for prior residuals, IMU measurement residuals, and point and line re-projection residuals, respectively. For residual r ( · ) , we have H ( · ) = J ( · ) Σ ( · ) 1 J ( · ) and b ( · ) = J ( · ) Σ ( · ) 1 r ( · ) , where J ( · ) is the Jacobian matrix of residuals vector r ( · ) with respect to δ X , and Σ ( · ) is the covariance matrix of measurements. Formulations of residuals and Jacobian matrices will be defined in the following subsections.

3.2. IMU Measurement Model

Since the pre-integrated IMU measurement computed by Equation (10) is a constraint factor between two successive keyframes b i and b j , the IMU measurement residual can be defined as:
r b ( z b i b j , X ) = r p r θ r v r b a r b g = R b i w ( p w b j p w b i v i w Δ t + 1 2 g w Δ t 2 ) α ^ b i b j 2 [ q ^ b j b i ( q b i w q w b j ) ] x y z R b i w ( v j w v i w + g w Δ t ) β ^ b i b j b a b j b a b i b g b j b g b i 15 × 1
where · x y z extracts the real part of a quaternion which is used to approximate the three-dimensional rotation error [18].
During the nonlinear optimization, the Jacobian matrix of the IMU measurement residual with respect to the body state x i and x j is computed by:
J b = r b δ x i r b δ x j r b δ x i = R b i w [ R b i w ( p w b j p w b i v i w Δ t + 1 2 g w Δ t 2 ) ] × R b i w Δ t J b a i α J b g i α 0 [ q w b j 1 q w b i ] L [ q ^ b i b j ] R 3 × 3 0 0 J b g i r θ 0 [ R b i w ( v j w v i w + g w Δ t ) ] × R b i w J b a i β J b g i β 0 0 0 I 0 0 0 0 0 I 15 × 15 r b δ x j = R b i w 0 0 0 0 0 [ q ^ b i b j 1 q w b i 1 q w b j ] L 3 × 3 0 0 0 0 0 R b i w 0 0 0 0 0 I 0 0 0 0 0 I 15 × 15
where [ q ] L and [ q ] R are the left- and right- quaternion-product matrices, respectively [36]. The operator · 3 × 3 is used to extract a 3 × 3 matrix from the bottom right block of ( · ) . The Jacobian matrix is calculated by J b g i r θ = r θ δ b g b i = [ q w b j 1 q w b i q b i b j ] L 3 × 3 J b g i q .

3.3. Point Feature Measurement Model

For point features, the re-projection error of a 3D point is the image distance between the projected point and the observed point. In this work, we deal with the point and line feature measurements in the normalized image plane. Given the kth point feature measurement at frame c j , z f k c j = [ u f k c j , v f k c j , 1 ] , the re-projection error is defined as:
r f ( z f k c i , X ) = x c j z c j u f k c j y c j z c j v f k c j f k c j = x c j y c j z c j = R b c ( R w b j ( R w b i ( ( R b c 1 λ k u f k c i v f k c i 1 + p b c ) + p w b i ) p w b j ) p b c )
where z f c i = [ u f k c i , v f k c i , 1 ] is the first observation of the feature in camera frame c i , and the inverse depth of the feature λ k is also defined in camera frame c i .
In order to minimize the point’s re-projection error (30), we need to optimize the rotation and the position of frame b i , b j , and the feature inverse depth λ . The corresponding Jacobian matrix can be obtained by the chain rule:
J f = r f f c j f c j x i f c j x j f c j δ λ
With
r f f c j = 1 z c j 0 x c j ( z c j ) 2 0 1 z c j y c j ( z c j ) 2 f c j x i = R b c R w b j R b c R w b j R w b i [ f b i ] × 0 0 0 3 × 15 f c j x j = R b c R w b j R b c T [ f b j ] × 0 0 0 3 × 15 f c j δ λ = 1 λ R b c R w b j R w b i R b c f c i
where f b i is the 3D point vector in the ith IMU body frame. We define the covariance matrix of point measurement Σ f k c i as a 2 × 2 diagonal matrix by assuming that the detected point features have pixel noise on both the vertical and horizontal directions in the image plane.

3.4. Line Feature Measurement Model

The re-projection error of a line measurement is defined as the distance from endpoints to the projected line. For a pin-hole model camera, a 3D spatial line L = [ n , d ] can be projected to the camera image plane by [27]:
l = l 1 l 2 l 3 = K n c = f y 0 0 0 f x 0 f y c x f x c y f x f y n c
where K is the projection matrix for a line feature. When projecting a line to the normalized image plane, K is an identity matrix. From the projection Equation (33), the coordinates of the line segment projected by a 3D line are only related with the normal vector n .
Given a 3D line L l w and the orthonormal presentation O l in a world frame, we firstly transform it to camera frame c i by Equation (11). Then, we project it to the image plane to get the projection line l l c i . The re-projection error in camera frame i is defined as
r l ( z L l c i , X ) = d ( s l c i , l l c i ) d ( e l c i , l l c i )
With d ( s , l ) indicating the distance function from endpoint s to the projection line l :
d ( s , l ) = s l l 1 2 + l 2 2
The ith body state and lth line parameters are optimized by minimizing the line re-projection error r l ( z L l c i ) . The corresponding Jacobian matrix can be obtained by the chain rule:
J l = r l l c i l c i L c i L c i δ x i L c i L w L w δ O
With
r l l c i = l 1 ( s l c i ) l ( l 1 2 + l 2 2 ) ( 3 2 ) + u s ( l 1 2 + l 2 2 ) ( 1 2 ) l 2 ( s l c i ) l ( l 1 2 + l 2 2 ) ( 3 2 ) + v s ( l 1 2 + l 2 2 ) ( 1 2 ) 1 ( l 1 2 + l 2 2 ) ( 1 2 ) l 1 ( e l c i ) l ( l 1 2 + l 2 2 ) ( 3 2 ) + u e ( l 1 2 + l 2 2 ) ( 1 2 ) l 2 ( e l c i ) l ( l 1 2 + l 2 2 ) ( 3 2 ) + v e ( l 1 2 + l 2 2 ) ( 1 2 ) 1 ( l 1 2 + l 2 2 ) ( 1 2 ) 2 × 3 l c i L c i = K 0 3 × 6 L c δ x i = T b c 1 R w b [ d w ] × 0 3 × 3 T b c 1 [ R w b ( n w + [ d w ] × p w b ) ] × R w b d w ] × 0 0 0 6 × 15 L c i L w L w δ O = T w c 1 0 w 1 u 3 w 1 u 2 w 2 u 1 w 2 u 3 0 w 2 u 1 w 1 u 2 6 × 4
The derivation details are provided in Appendix B. Similar to the point measurement covariance matrix, the covariance matrix of line measurement Σ L l c i is defined by assuming the endpoints of a line segment have pixel noise.

4. Monocular Visual Inertial Odometry with Point and Line Features

As shown in Figure 4, the proposed PL-VIO system has two main modules: the front end and the back end. The front-end module is used to pre-process the measurements from IMU and camera. The back-end module is used to estimate and optimize the body states. We will introduce the details in the following subsections.

4.1. Front End

The front end extracts information from the raw measurements of the camera and IMU. The body state is updated by propagation with each new IMU measurement, and the newest body state is used as the initial value in sliding window optimization. Additionally, the new IMU measurements are pre-integrated to constrain the successive IMU body states during optimization.
As for image processing, the point and line features are detected in two separate threads. When a new frame comes, the point features are tracked from the previous frame to the new frame by the KLT optical flow algorithm [37]. Then, we use the RANSAC framework with an essential matrix test to remove outliers. If the number of tracked point features is less than a threshold after outlier rejection, new corner features which are detected by the FAST detector [38] will be added. As to the line features, line segments in new frame are detected by the LSD detector [39] and matched with those in the previous frame by the appearance-based descriptor LBD [40]. We use geometric constraints to remove outliers of line matches. For example, the distance between the midpoints of two matched lines should be no more than δ d i s t t h pixels, and the angle difference should be no more than δ a n g l e t h degrees. After the feature detection and matching, the point features and the endpoints of line segments are projected onto the normalized image plane. Additionally, a frame is selected as a new keyframe if the average parallax of the tracked point features is larger than a threshold.

4.2. Back End

In the back-end thread, the points and lines are firstly triangulated to build re-projection residuals. In order to get good landmark estimations, the inverse depth of a point feature is estimated with all the observations. For line triangulation, we only choose two frames with the furthest spatial distance in the sliding window to initialize the Plücker coordinates.
After we get the initial estimation of map points/lines and the IMU body state predicted from IMU measurements, the sliding window optimization described in Section 3 is used to find the optimal states. To limit the size of the state vector X , a two-way marginalization strategy is adopted to remove states from the sliding window [23]. When the second newest frame x n + N 1 is a keyframe, we marginalize out the earliest frame x n with all its measurements. Otherwise, if the second newest frame is not a keyframe, we discard the visual measurements from this frame and reserve its IMU measurements in the pre-integration measurements. New prior information is gained based on the marginalized measurements, reserving the constraint information from the removed states.
Lastly, we cull the invalid map points and lines. If the inverse depth of a point is negative, we will delete this point from the map. If the re-projection residuals of a line exceed a threshold it will be removed from the map.

4.3. Implementation Details

To bootstrap the VIO system, we adopt the visual–inertial alignment method [41] to recover the scale, gravity vector, initial biases, and velocity of the IMU initial body state. The sliding window is with N = 10 keyframes. Each frame has at most 150 point features, and 150 line segments. The thresholds used in line matching are set with δ d i s t t h = 60 pixels and δ a n g l e t h = 30 . Since the visual–inertial system has only four unobservable DoFs (the yaw direction and the absolute position), the optimization methods for six DoFs may introduce illusory information into the roll and pitch directions by automatically taking steps along these directions to minimize the cost function. After the sliding window optimization, we reset the body state by rotating it back with the increments along the roll and pitch directions. All the numerical optimizations are solved using the Levenberg–Marquardt method in the Ceres solver library [42]. The line detection and matching codes are provided by OpenCV 3 [43].

5. Experimental Results

We evaluated our PL-VIO system using two public benchmark datasets: the EuRoc MAV Dataset [44] and the PennCOSYVIO Dataset [45].The accuracy of the PL-VIO method is compared with that of three state-of-the-art monocular VIO methods to validate the advantages of the PL-VIO method: ROVIO [17] and OKVIS [18] in monocular mode, and VINS-Mono [32] without loop closure. ROVIO is a tightly-coupled form of VIO based on the extended Kalman filter (EKF). It directly uses the intensity errors from images to find the optimal state during the update step. OKVIS is a sliding window optimization algorithm with point features which can work with monocular or stereo modes. VINS-Mono is a complete VIO-SLAM system employing point features to optimize IMU body states in a sliding window, and performs loop closure. All of the experiments were performed on the computer with an Intel Core i7-6700HQ CPU with 2.60 GHz, 16 GB RAM, and the ROS Kinetic [46].

5.1. EuRoc MAV Dataset

The EuRoc micro aerial vehicle (MAV) datasets were collected by an MAV in two indoor scenes, which contain stereo images from a global shutter camera at 20FPS and synchronized IMU measurements at 200 Hz [44]. Each dataset provides a ground truth trajectory given by the VICON motion capture system. All the extrinsic and intrinsic parameters are also provided in the datasets. In our experiments, we only used the images from the left camera.
The main advantage of line features is that they provide significant geometry structure information with respect to the environment. As an example, we show the reconstructed map built by PL-VIO from the MH_05_difficult sequence in Figure 5. The four images in Figure 5a–d were captured by an MAV flying in a machine hall, which showed the room’s structure. As shown in Figure 5d, the line segment detection in the weak illumination scene was more robust than point feature detection. From the reconstructed 3D map, it can be seen that the geometry of the environment is described by the line segments, and thus semantic information could be extracted from the map. This is useful for robot navigation.
For quantitative evaluation, we compared our PL-VIO with three state-of-the-art monocular visual–inertial methods: ROVIO [17], OKVIS [18] in monocular mode, and VINS-Mono [32] without loop closure. For the fair comparison, default parameters provided by the authors of these comparison methods were used. We chose the absolute pose error (APE) as the evaluation metric, which directly compared the trajectory error between the estimate and the ground truth [47]. The open-source package evo (https://michaelgrupp.github.io/evo/) provides an easy-to-use interface to evaluate the trajectories of odometry and SLAM algorithms. We employed this tool to evaluate these methods in this section. Table 1 shows the root mean square error (RMSE) of translation and rotation along all the trajectory, and their histograms are also provided as shown in Figure 6.
Table 1 shows that our PL-VIO which jointly optimizes point and lines provided the best performance on eight sequences for the rotation, except for V1_01_easy. Our method also performed the best on six sequences for the translation, except for V1_01_easy, V1_02_medium, and V1_03_difficult. However, the difference with respect to the best results was only at the submillimeter level. The results in Table 1 show that integrating line features into VIO could improve the accuracy of motion estimation. To demonstrate the results intuitively, several heat map of trajectories estimated by PL-VIO and VINS-Mono are shown in Figure 7. The redder the color is, the larger the translation error is. Comparing the three trajectories, we came to the conclusion that PL-VIO with line features gave smaller errors than VINS-Mono when the camera was moved with rapid rotation. Furthermore, we found that these sequences with rapid rotation caused large changes in the viewing direction, and the lighting conditions are especially challenging for tracking point features [25,26,28]. As shown in Figure 8, 27 point pairs (including 10 outliers) were matched successfully, while 33 lines were matched successfully and all the matches are correct.
Besides, the computation times of different methods are listed in Table 2. The computation efficiency of filter-based ROVIO was the highest, while its accuracy was the lowest. The proposed PL-VIO system was the most time-consuming method, but its computation time was mainly restricted by the line detection and matching step. In Section 5.3, the computation times of different modules in PL-VIO are independently estimated in the V1_03_difficult sequence, and it was found that the computation efficiency of the PL-VIO system directly depended on the line detection and matching.

5.2. PennCOSYVIO Dataset

The PennCOSYVIO dataset is a recent VIO benchmark that collects the synchronized data of a large glass building with hand-held equipment from outdoors to indoors (see Figure 9) [45]. Challenging factors include illumination changes, rapid rotations, and repetitive structures. All the intrinsic and extrinsic parameters as well as the ground truth trajectory are provided. We use data collected by the VI-sensor, which was also used in the EuRoc MAV datasets.
We compared the proposed PL-VIO with the VINS-Mono without loop closure. To evaluate fairly, the same parameters were used for PL-VIO and VINS-Mono. The same metric and evaluation tools used in Section 5.1 were employed here to evaluate the trajectories. Table 3 lists the results.
Furthermore, the evaluation tool (https://daniilidis-group.github.io/penncosyvio/) is also provided in the PennCOSYVIO dataset, and adopts two metrics, the APE and relative pose error (RPE). For RPE, it expresses the errors in percentages by dividing the value with the path length [45]. The creators of PennCOSYVIO cautiously selected the evaluation parameters, so their tool is suited for evaluating VIO approaches in this dataset. Therefore, we adopted this evaluation tool in our experiments, and the results are listed in Table 4.
From Table 3 and Table 4, it can be seen that the PL-VIO obtained the best performance for the rotation part. The APE of translation evaluated by PennCOSYVIO tool provided more details. Compared to VINS-Mono, PL-VIO gave smaller errors in the y-axis and z-axis, and a smaller error summation of the three axes. VINS-Mono obtained better performance only in the x-axis.

5.3. Computing Time

Finally, we evaluated the average execution time of our PL-VIO running at the V1_02_medium sequence because this image sequence was collected from a typical indoor scene. Table 5 shows the execution time of each block. We can see that line detection and matching, which runs at 11 Hz in the front end, is the bottleneck in terms of efficiency. State-of-the-art line detection and matching methods, such as the combination of LSD and LBD, are not satisfactory for VIO/SLAM systems. Note that our method is independent of line feature detection and matching, so improving their efficiency is beyond the scope of this paper. Marginalization in the back end is another time-consuming part. We observe that the inefficiency of marginalization is caused by the fill-in when marginalizing out features, which makes the Hessian matrix become a less sparse matrix. This problem can be potentially solved by discarding some features when performing marginalization to maintain a sparse Hessian matrix [18].

6. Conclusions

This paper presents the novel tightly-coupled monocular vision-inertial odometry algorithm PL-VIO, which optimizes the system states in a sliding window with both point and line features. The proposed PL-VIO system has two main modules: the front end and the back end. The front-end module is used to propagate IMU body state, and detect and match point/line features. The back-end module is used to estimate and optimize the body states. In the back-end module, a line landmark is considered as an infinite 3D spatial line and its orthonormal representation is employed to parameterize it compactly during optimization. Furthermore, all the Jacobian matrices of error terms are given in detail for solving the sliding window optimization efficiently. We also provide the evaluation results of the proposed PL-VIO as compared to three state-of-the-art monocular VIO methods including ROVIO [17], OKVIS [18], and VINS-Mono [32] on both the EuRoc dataset and PennCOSYVIO dataset. According to the analysis and results, two further conclusions are as follows:
  • The reconstructed 3D map with line features can provide geometrical information with respect to the environment, and thus semantic information could be extracted from the map. This is useful for robot navigation.
  • Line features can improve the system accuracy both for translation and rotation, especially in illumination-changing scenes. However, the line detection and matching are time-consuming and become the bottlenecks in the efficiency of the system.
In the future, we plan to improve our system by introducing the structural constraints between 3D spatial lines, such as parallel or coplanar lines in Manhattan-world scenes [48]. Geometric constraints among these lines have the potential to further improve localization precision and reduce rotation accumulation errors.

Acknowledgments

This research work was supported by the National Natural Science Foundation of China (Grant No. 61421004). We would like to thank Yang Ding for testing line matching methods. Finally, Yijia He would like to thank, in particular, the support received from Qiang Tang and Fangbo Qin.

Author Contributions

Yijia He and Ji Zhao conceived and designed the alogrithm; Yijia He performed the experiments, analyzed the data, and drafted the paper; Yue Guo and Wenhao He contributed analysis tools; Ji Zhao and Kui Yuan revised the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

The IMU error state propagtion equation can be defined as [49]:
δ α b k + 1 b k + 1 δ θ b k + 1 b k + 1 δ β b k + 1 b k + 1 δ b a b k + 1 δ b g b k + 1 = F k δ α b k b k δ θ b k b k δ β b k b k δ b a b k δ b g b k + G k n a b k n g b k n a b k + 1 n g b k + 1 n b a n b g
With
F k = I f 12 I δ t 1 4 ( q b i b k + q b i b k + 1 ) δ t 2 f 15 0 I [ ω ] × 0 0 I δ t 0 f 32 I 1 2 ( q b i b k + q b i b k + 1 ) δ t f 35 0 0 0 I 0 0 0 0 0 I
G k = 1 4 q b i b k δ t 2 g 12 1 4 q b i b k + 1 δ t 2 g 14 0 0 0 1 2 δ t 0 1 2 δ t 0 0 1 2 q b i b k δ t g 32 1 2 q b i b k + 1 δ t g 34 0 0 0 0 0 0 δ t 0 0 0 0 0 0 δ t
f 12 = α b i b k + 1 δ θ b k b k = 1 4 ( R b i b k [ a b k b a b k ] × δ t 2 + R b i b k + 1 [ ( a b k b a b k ) ] × ( I [ ω ] × δ t ) δ t 2 ) f 32 = β b i b k + 1 δ θ b k b k = 1 2 ( R b i b k [ a b k b a b k ] × δ t + R b i b k + 1 [ ( a b k b a b k ) ] × ( I [ ω ] × δ t ) δ t ) f 15 = α b i b k + 1 δ b g b k = 1 4 ( R b i b k + 1 [ ( a b k b a b k ) ] × δ t 2 ) ( δ t ) f 35 = β b i b k + 1 δ b g b k = 1 2 ( R b i b k + 1 [ ( a b k b a b k ) ] × δ t ) ( δ t ) g 12 = α b i b k + 1 n g b k = g 14 = α b i b k + 1 n g b k + 1 = 1 4 ( R b i b k + 1 [ ( a b k b a b k ) ] × δ t 2 ) ( 1 2 δ t ) g 32 = β b i b k + 1 n g b k = g 34 = β b i b k + 1 n g b k + 1 = 1 2 ( R b i b k + 1 [ ( a b k b a b k ) ] × δ t 2 ) ( 1 2 δ t )
We can define the error state vector with η k + 1 = [ δ α b k + 1 b k + 1 , δ θ b k + 1 b k + 1 , δ β b k + 1 b k + 1 , δ b a b k + 1 , δ b g b k + 1 ] . The noise vector is n k = [ n a b k , n g b k , n a b k + 1 , n g b k + 1 , n b a , n b g , ] . Equation (A1) can be written in compact matrix form as:
η k + 1 = F k η k + G k n k
The pre-integrated measurements covariance can be computed iteratively based on the linear model (A5) [14]:
Σ b i b k + 1 = F k Σ b i b k F k + G k Σ n G k
where Σ n is the covariance of the raw IMU measurements, and the initial covariance is Σ b i b i = 0 15 × 15 . Also we can compute the Jacobian matrix of pre-integrated measurements z b i b j with respect to error state η i iteratively with [49]:
J i k + 1 = F k J i k
With J i i = I . These Jacobian matrices given in Equation (10) can be extracted from J i j .

Appendix B

The Jacobian matrix of the line re-projection error respect to the orthonormal representation is:
r l l = r 1 l 1 r 1 l 2 r 1 l 3 r 2 l 1 r 2 l 2 r 2 l 3 = l 1 s l l ( l 1 2 + l 2 2 ) ( 3 2 ) + u s ( l 1 2 + l 2 2 ) ( 1 2 ) l 2 s l l ( l 1 2 + l 2 2 ) ( 3 2 ) + v s ( l 1 2 + l 2 2 ) ( 1 2 ) 1 ( l 1 2 + l 2 2 ) ( 1 2 ) l 1 e l l ( l 1 2 + l 2 2 ) ( 3 2 ) + u e ( l 1 2 + l 2 2 ) ( 1 2 ) l 2 e l l ( l 1 2 + l 2 2 ) ( 3 2 ) + v e ( l 1 2 + l 2 2 ) ( 1 2 ) 1 ( l 1 2 + l 2 2 ) ( 1 2 ) 2 × 3
L w δ O = L w u 1 L w u 2 L w w 1 L w w 2 6 × 8 u 1 δ ψ u 1 δ ϕ u 2 δ ψ u 2 δ ϕ w 1 δ ψ w 1 δ ϕ w 2 δ ψ w 2 δ ϕ 8 × 4 = w 1 I 3 × 3 0 3 × 3 u 1 0 3 × 1 0 3 × 3 w 2 I 3 × 3 0 3 × 1 u 2 6 × 8 0 u 3 u 2 0 u 3 0 u 1 0 0 0 0 w 2 0 0 0 w 1 8 × 4 = 0 w 1 u 3 w 1 u 2 w 2 u 1 w 2 u 3 0 w 2 u 1 w 1 u 2 6 × 4
To compute the line re-projection error, a spatial line in world frame w is transformed to the body frame b firstly, and then transformed to the camera frame c with the extrinsic parameters T b c .
L c = T b c 1 T w b 1 L w = T b c 1 R w b ( n w + [ d w ] × p w b ) R w b d w 6 × 1
The Jacobian matrix of the line re-projection error with respect to rotation of the i t h IMU body state is:
L c δ θ b b = T b c 1 ( I [ δ θ b b ] × ) R w b ( n w + [ d w ] × p w b ) δ θ b b ( I [ δ θ b b ] × ) R w b d w δ θ b b = T b c 1 [ R w b ( n w + [ d w ] × p w b ) ] × R w b d w ] × 6 × 3
The Jacobian matrix of the line re-projection error with respect to position of the ith IMU body state is as follows:
L c δ p b b = T b c 1 R w b ( n w + [ d w ] × ( p w b + δ p b b ) ) δ p b b R w b d w δ p b b = T b c 1 R w b [ d w ] × 0 6 × 3

References

  1. Groves, P.D. Principles of GNSS, Inertial, and Multisensor Integrated Navigation Systems; Artech House: Norwood, MA, USA, 2013. [Google Scholar]
  2. Martínez, J.L.; Morán, M.; Morales, J.; Reina, A.J.; Zafra, M. Field Navigation Using Fuzzy Elevation Maps Built with Local 3D Laser Scans. Appl. Sci. 2018, 8, 397. [Google Scholar] [CrossRef]
  3. Hess, W.; Kohler, D.; Rapp, H.; Andor, D. Real-time loop closure in 2D LIDAR SLAM. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 1271–1278. [Google Scholar]
  4. Liu, L.; Mei, T.; Niu, R.; Wang, J.; Liu, Y.; Chu, S. RBF-Based Monocular Vision Navigation for Small Vehicles in Narrow Space below Maize Canopy. Appl. Sci. 2016, 6, 182. [Google Scholar] [CrossRef]
  5. Valiente, D.; Gil, A.; Payá, L.; Sebastián, J.M.; Reinoso, Ó. Robust Visual Localization with Dynamic Uncertainty Management in Omnidirectional SLAM. Appl. Sci. 2017, 7, 1294. [Google Scholar] [CrossRef]
  6. Borraz, R.; Navarro, P.J.; Fernández, C.; Alcover, P.M. Cloud Incubator Car: A Reliable Platform for Autonomous Driving. Appl. Sci. 2018, 8, 303. [Google Scholar] [CrossRef]
  7. Wang, X.; Wang, J. Detecting glass in simultaneous localisation and mapping. Robot. Auton. Syst. 2017, 88, 97–103. [Google Scholar] [CrossRef]
  8. Titterton, D.; Weston, J.L. Strapdown Inertial Navigation Technology; The Institution of Engineering and Technology: Stevenage, UK, 2004; Volume 17. [Google Scholar]
  9. Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
  10. Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
  11. Engel, J.; Koltun, V.; Cremers, D. Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 611–625. [Google Scholar] [CrossRef] [PubMed]
  12. Gui, J.; Gu, D.; Wang, S.; Hu, H. A review of visual inertial odometry from filtering and optimisation perspectives. Adv. Robot. 2015, 29, 1289–1301. [Google Scholar] [CrossRef]
  13. Liu, Y.; Chen, Z.; Zheng, W.; Wang, H.; Liu, J. Monocular Visual-Inertial SLAM: Continuous Preintegration and Reliable Initialization. Sensors 2017, 17, 2613. [Google Scholar] [CrossRef] [PubMed]
  14. Forster, C.; Carlone, L.; Dellaert, F.; Scaramuzza, D. On-Manifold Preintegration for Real-Time Visual–Inertial Odometry. IEEE Trans. Robot. 2017, 33, 1–21. [Google Scholar] [CrossRef]
  15. Weiss, S.; Siegwart, R. Real-time metric state estimation for modular vision-inertial systems. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, 9–13 May 2011; pp. 4531–4537. [Google Scholar]
  16. Kneip, L.; Weiss, S.; Siegwart, R. Deterministic initialization of metric state estimation filters for loosely-coupled monocular vision-inertial systems. In Proceedings of the 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), San Francisco, CA, USA, 25–30 September 2011; pp. 2235–2241. [Google Scholar]
  17. Bloesch, M.; Burri, M.; Omari, S.; Hutter, M.; Siegwart, R. Iterated extended Kalman filter based visual-inertial odometry using direct photometric feedback. Int. J. Robot. Res. 2017, 36, 1053–1072. [Google Scholar] [CrossRef]
  18. Leutenegger, S.; Lynen, S.; Bosse, M.; Siegwart, R.; Furgale, P. Keyframe-based visual–inertial odometry using nonlinear optimization. Int. J. Robot. Res. 2015, 34, 314–334. [Google Scholar] [CrossRef]
  19. Jones, E.S.; Soatto, S. Visual-inertial navigation, mapping and localization: A scalable real-time causal approach. Int. J. Robot. Res. 2011, 30, 407–430. [Google Scholar] [CrossRef]
  20. Bloesch, M.; Omari, S.; Hutter, M.; Siegwart, R. Robust visual inertial odometry using a direct EKF-based approach. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 298–304. [Google Scholar]
  21. Mourikis, A.I.; Roumeliotis, S.I. A multi-state constraint Kalman filter for vision-aided inertial navigation. In Proceedings of the 2007 IEEE International Conference on Robotics and Automation, Roma, Italy, 10–14 April 2007; pp. 3565–3572. [Google Scholar]
  22. Lupton, T.; Sukkarieh, S. Visual-inertial-aided navigation for high-dynamic motion in built environments without initial conditions. IEEE Trans. Robot. 2012, 28, 61–76. [Google Scholar] [CrossRef]
  23. Shen, S.; Michael, N.; Kumar, V. Tightly-coupled monocular visual-inertial fusion for autonomous flight of rotorcraft MAVs. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 5303–5310. [Google Scholar]
  24. Mur-Artal, R.; Tardós, J.D. Visual-inertial monocular SLAM with map reuse. IEEE Robot. Autom. Lett. 2017, 2, 796–803. [Google Scholar] [CrossRef]
  25. Kong, X.; Wu, W.; Zhang, L.; Wang, Y. Tightly-coupled stereo visual-inertial navigation using point and line features. Sensors 2015, 15, 12816–12833. [Google Scholar] [CrossRef] [PubMed]
  26. Kottas, D.G.; Roumeliotis, S.I. Efficient and consistent vision-aided inertial navigation using line observations. In Proceedings of the 2013 IEEE International Conference on Robotics and Automation (ICRA), Karlsruhe, Germany, 6–10 May 2013; pp. 1540–1547. [Google Scholar]
  27. Zhang, G.; Lee, J.H.; Lim, J.; Suh, I.H. Building a 3-D Line-Based Map Using Stereo SLAM. IEEE Trans. Robot. 2015, 31, 1364–1377. [Google Scholar] [CrossRef]
  28. Pumarola, A.; Vakhitov, A.; Agudo, A.; Sanfeliu, A.; Moreno-Noguer, F. PL-SLAM: Real-time monocular visual SLAM with points and lines. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4503–4508. [Google Scholar]
  29. Gomez-Ojeda, R.; Moreno, F.A.; Scaramuzza, D.; Gonzalez-Jimenez, J. PL-SLAM: A Stereo SLAM System through the Combination of Points and Line Segments. arXiv, 2017; arXiv:1705.09479. [Google Scholar]
  30. Bartoli, A.; Sturm, P. The 3D line motion matrix and alignment of line reconstructions. Int. J. Comput. Vis. 2004, 57, 159–178. [Google Scholar] [CrossRef]
  31. Zuo, X.; Xie, X.; Liu, Y.; Huang, G. Robust Visual SLAM with Point and Line Features. arXiv, 2017; arXiv:1711.08654. [Google Scholar]
  32. Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. arXiv, 2017; arXiv:1708.03852. [Google Scholar]
  33. Furgale, P.; Rehder, J.; Siegwart, R. Unified temporal and spatial calibration for multi-sensor systems. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Tokyo, Japan, 3–7 November 2013; pp. 1280–1286. [Google Scholar]
  34. Kok, M.; Hol, J.D.; Schön, T.B. Using inertial sensors for position and orientation estimation. arXiv, 2017; arXiv:1704.06053. [Google Scholar]
  35. Kaess, M.; Johannsson, H.; Roberts, R.; Ila, V.; Leonard, J.J.; Dellaert, F. iSAM2: Incremental smoothing and mapping using the Bayes tree. Int. J. Robot. Res. 2012, 31, 216–235. [Google Scholar] [CrossRef] [Green Version]
  36. Sola, J. Quaternion kinematics for the error-state Kalman filter. arXiv, 2017; arXiv:1711.02508. [Google Scholar]
  37. Lucas, B.D.; Kanade, T. An iterative image registration technique with an application to stereo vision. In Proceedings of the 7th International Joint Conference on Artificial Intelligence (IJCAI), Vancouver, BC, Canada, 24–28 August 1981. [Google Scholar]
  38. Rosten, E.; Porter, R.; Drummond, T. Faster and better: A machine learning approach to corner detection. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 105–119. [Google Scholar] [CrossRef] [PubMed]
  39. Von Gioi, R.G.; Jakubowicz, J.; Morel, J.M.; Randall, G. LSD: A fast line segment detector with a false detection control. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 722–732. [Google Scholar] [CrossRef] [PubMed]
  40. Zhang, L.; Koch, R. An efficient and robust line segment matching approach based on LBD descriptor and pairwise geometric consistency. J. Vis. Commun. Image Represent. 2013, 24, 794–805. [Google Scholar] [CrossRef]
  41. Yang, Z.; Shen, S. Monocular visual–inertial state estimation with online initialization and camera–imu extrinsic calibration. IEEE Trans. Autom. Sci. Eng. 2017, 14, 39–51. [Google Scholar] [CrossRef]
  42. Agarwal, S.; Mierle, K. Ceres Solver. Available online: http://ceres-solver.org (accessed on 9 April 2018).
  43. Kaehler, A.; Bradski, G. Learning OpenCV 3: Computer Vision in C++ with the OpenCV Library; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2016. [Google Scholar]
  44. Burri, M.; Nikolic, J.; Gohl, P.; Schneider, T.; Rehder, J.; Omari, S.; Achtelik, M.W.; Siegwart, R. The EuRoC micro aerial vehicle datasets. Int. J. Robot. Res. 2016, 35, 1157–1163. [Google Scholar] [CrossRef]
  45. Pfrommer, B.; Sanket, N.; Daniilidis, K.; Cleveland, J. PennCOSYVIO: A challenging visual inertial odometry benchmark. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June2017; pp. 3847–3854. [Google Scholar]
  46. Quigley, M.; Conley, K.; Gerkey, B.; Faust, J.; Foote, T.; Leibs, J.; Wheeler, R.; Ng, A.Y. ROS: An Open-Source Robot Operating System; ICRA Workshop on Open Source Software; ICRA: Kobe, Japan, 2009; p. 5. [Google Scholar]
  47. Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A benchmark for the evaluation of RGB-D SLAM systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vilamoura, Portugal, 7–12 October 2012; pp. 573–580. [Google Scholar]
  48. Lu, Y.; Song, D.; Yi, J. High level landmark-based visual navigation using unsupervised geometric constraints in local bundle adjustment. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 1540–1545. [Google Scholar]
  49. Yang, Z.; Shen, S. Tightly-Coupled Visual–Inertial Sensor Fusion Based on IMU Pre-Integration; Technical Report; Hong Kong University of Science and Technology: Hong Kong, China, 2016. [Google Scholar]
Figure 1. An illustration of visual–inertial sensors, point observations, and line observations. IMU: inertial measurement unit.
Figure 1. An illustration of visual–inertial sensors, point observations, and line observations. IMU: inertial measurement unit.
Sensors 18 01159 g001
Figure 2. Plücker coordinates for line features. (a) Plücker line coordinates; (b) Initialization of a newly observed line.
Figure 2. Plücker coordinates for line features. (a) Plücker line coordinates; (b) Initialization of a newly observed line.
Sensors 18 01159 g002
Figure 3. Factor graphs of (a) the visual simultaneous localization and mapping (SLAM) problem versus (b) visual–inertial SLAM.
Figure 3. Factor graphs of (a) the visual simultaneous localization and mapping (SLAM) problem versus (b) visual–inertial SLAM.
Sensors 18 01159 g003
Figure 4. Overview of our point–line visual–inertial odometry (PL-VIO) system. A is the front end module is used to extract information from the raw measurements; B is the back end module is used to estimate and optimize the body states.
Figure 4. Overview of our point–line visual–inertial odometry (PL-VIO) system. A is the front end module is used to extract information from the raw measurements; B is the back end module is used to estimate and optimize the body states.
Sensors 18 01159 g004
Figure 5. Mapping results in the MH_05_difficult sequence. (ad) Detected point and line features for different frames. (e) The reconstructed line map (red) and the trajectory (green). The scenes with rich line features (such as the window, baluster, and cabinet) are clearly reconstructed from the map. (f) The reconstructed point map (blue). As compared to the line map, the structure is hard to identify in the point map.
Figure 5. Mapping results in the MH_05_difficult sequence. (ad) Detected point and line features for different frames. (e) The reconstructed line map (red) and the trajectory (green). The scenes with rich line features (such as the window, baluster, and cabinet) are clearly reconstructed from the map. (f) The reconstructed point map (blue). As compared to the line map, the structure is hard to identify in the point map.
Sensors 18 01159 g005
Figure 6. RMSEs for ROVIO, OKVIS-Mono, and Vins-Mono without loop closure, and for the proposed PL-VIO using the EuRoc dataset. (a) RMSEs in translation. (b) RMSEs in rotation. The rotation errors of ROVIO are one order of magnitude higher than those of other three methods, so its result is not reported in (b).
Figure 6. RMSEs for ROVIO, OKVIS-Mono, and Vins-Mono without loop closure, and for the proposed PL-VIO using the EuRoc dataset. (a) RMSEs in translation. (b) RMSEs in rotation. The rotation errors of ROVIO are one order of magnitude higher than those of other three methods, so its result is not reported in (b).
Sensors 18 01159 g006
Figure 7. Comparison of the proposed method versus VINS-Mono. The three colorful trajectories of the left column are run with VINS-Mono on the (a) MH_05_difficult; (c) MH_04_difficult; and (e) V1_03_difficult sequences. The right three trajectories are the results of the proposed PL-VIO on the (b) MH_05_difficult; (d) MH_04_difficult; and (f) V1_03_difficult sequences. Colors encode the corresponding absolute pose errors.
Figure 7. Comparison of the proposed method versus VINS-Mono. The three colorful trajectories of the left column are run with VINS-Mono on the (a) MH_05_difficult; (c) MH_04_difficult; and (e) V1_03_difficult sequences. The right three trajectories are the results of the proposed PL-VIO on the (b) MH_05_difficult; (d) MH_04_difficult; and (f) V1_03_difficult sequences. Colors encode the corresponding absolute pose errors.
Sensors 18 01159 g007
Figure 8. Tracked point/line features with rapid rotation in V1_03_difficult. Lighting conditions and view directions changed in successive frames. Matched point features are marked with the same color in frame (a,b). Matched line segment features are marked with the same color in frame (c,d).
Figure 8. Tracked point/line features with rapid rotation in V1_03_difficult. Lighting conditions and view directions changed in successive frames. Matched point features are marked with the same color in frame (a,b). Matched line segment features are marked with the same color in frame (c,d).
Sensors 18 01159 g008
Figure 9. Images from PennCOSYVIO dataset. Red lines are detected lines. (a) Outside the glass building. (b) Inside the glass building.
Figure 9. Images from PennCOSYVIO dataset. Red lines are detected lines. (a) Outside the glass building. (b) Inside the glass building.
Sensors 18 01159 g009
Table 1. The root mean square error (RMSE) results on the several EuRoc MAV dataset. The translation (m) and rotation (rad) error are listed as follows. The numbers in bold represent the estimated trajectory is more close to the benchmark trajectory.
Table 1. The root mean square error (RMSE) results on the several EuRoc MAV dataset. The translation (m) and rotation (rad) error are listed as follows. The numbers in bold represent the estimated trajectory is more close to the benchmark trajectory.
Seq.ROVIOOKVIS-MonoVINS-MonoPL-VIO
Trans.Rot.Trans.Rot.Trans.Rot.Trans.Rot.
MH_02_easy0.590752.211810.360590.060950.256630.048020.232740.04204
MH_03_medium0.407091.925610.215340.026220.112390.020270.112240.02016
MH_04_difficult0.883632.303300.239840.019430.153660.021730.139420.01915
MH_05_difficult1.175782.272130.396440.019870.303510.010380.256870.00892
V1_01_easy0.181532.033990.085830.096650.058430.099950.059160.09869
V1_02_medium0.195631.936520.122070.040730.069700.030220.076560.02871
V1_03_difficult0.170912.020690.196130.065910.145310.080210.130160.04382
V2_02_medium0.606861.844580.182530.047730.102180.045580.094500.04177
V2_03_difficult0.189121.923800.305130.075270.264460.061620.260850.06098
Table 2. Average running times of different methods on the EuRoc dataset (unit: millisecond).
Table 2. Average running times of different methods on the EuRoc dataset (unit: millisecond).
Seq.ROVIOOKVIS-MonoVINS-MonoPL-VIO
MH_02_easy153163127
MH_03_medium153062112
MH_04_difficult152454108
MH_05_difficult152758102
V1_01_easy14264593
V1_02_medium15233786
V1_03_difficult15202982
V2_02_medium15223386
V2_03_difficult15212785
Table 3. The RMSE of the absolute pose error (APE) for different algorithms. The numbers in bold represent the estimated trajectory is more close to the benchmark trajectory.
Table 3. The RMSE of the absolute pose error (APE) for different algorithms. The numbers in bold represent the estimated trajectory is more close to the benchmark trajectory.
AlgorithmTranslation Error (m)Rotation Error (rad)
VINS-Mono1.146900.04156
PL-VIO1.059750.03742
Table 4. The results evaluated by PennCOSYVIO evaluation tool for different algorithms. The rotation errors for the APE and relative pose error (RPE) are expressed in degrees. The translation error is expressed in the x-axis, y-axis, and z-axis. The APE is expressed in meters, while the RPE is expressed in percentages (%). The numbers in bold represent the estimated trajectory is more close to the benchmark trajectory.
Table 4. The results evaluated by PennCOSYVIO evaluation tool for different algorithms. The rotation errors for the APE and relative pose error (RPE) are expressed in degrees. The translation error is expressed in the x-axis, y-axis, and z-axis. The APE is expressed in meters, while the RPE is expressed in percentages (%). The numbers in bold represent the estimated trajectory is more close to the benchmark trajectory.
AlgorithmAPERPE
xyzRot.xyzRot.
VINS-Mono0.4230.1730.8612.34772.8071.8444.6631.9337
PL-VIO0.5240.0700.7692.07822.3751.8444.3611.7350
Table 5. Mean execution time of PL-VIO run with the V1_02_medium sequence.
Table 5. Mean execution time of PL-VIO run with the V1_02_medium sequence.
ModuleOperationTimes (ms)Rate (Hz)Thread ID
front endpoint feature detection and matching4251
line feature detection and matching86112
IMU forward propagation11003
back endnonlinear optimization28154
marginalization35154
feature triangulation and culling2154

Share and Cite

MDPI and ACS Style

He, Y.; Zhao, J.; Guo, Y.; He, W.; Yuan, K. PL-VIO: Tightly-Coupled Monocular Visual–Inertial Odometry Using Point and Line Features. Sensors 2018, 18, 1159. https://doi.org/10.3390/s18041159

AMA Style

He Y, Zhao J, Guo Y, He W, Yuan K. PL-VIO: Tightly-Coupled Monocular Visual–Inertial Odometry Using Point and Line Features. Sensors. 2018; 18(4):1159. https://doi.org/10.3390/s18041159

Chicago/Turabian Style

He, Yijia, Ji Zhao, Yue Guo, Wenhao He, and Kui Yuan. 2018. "PL-VIO: Tightly-Coupled Monocular Visual–Inertial Odometry Using Point and Line Features" Sensors 18, no. 4: 1159. https://doi.org/10.3390/s18041159

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop