Tightly-Coupled Stereo Visual-Inertial Navigation Using Point and Line Features

This paper presents a novel approach for estimating the ego-motion of a vehicle in dynamic and unknown environments using tightly-coupled inertial and visual sensors. To improve the accuracy and robustness, we exploit the combination of point and line features to aid navigation. The mathematical framework is based on trifocal geometry among image triplets, which is simple and unified for point and line features. For the fusion algorithm design, we employ the Extended Kalman Filter (EKF) for error state prediction and covariance propagation, and the Sigma Point Kalman Filter (SPKF) for robust measurement updating in the presence of high nonlinearities. The outdoor and indoor experiments show that the combination of point and line features improves the estimation accuracy and robustness compared to the algorithm using point features alone.


Introduction
Reliable navigation in dynamic and unknown environments is a key requirement for many applications, particularly for autonomous ground, underwater and air vehicles. The most common sensor modality used to tackle this problem is the Inertial Measurement Unit (IMU). However, inertial OPEN ACCESS navigation systems (INS) are proved to drift over time due to error accumulation [1]. In the last decades, the topic of vision-aided inertial navigation has received considerable attention in the research community, thanks to some important advantages [2][3][4][5][6][7][8][9]. Firstly, the integrated system can operate in environments where GPS is unreliable or unavailable. Secondly, the complementary frequency responses and noise characteristics of vision and inertial sensors address the respective limitations and deficiencies [10]. In particular, fast and highly dynamic motions can be precisely tracked by an IMU in a short time, and thus the problem of scale ambiguity and large latency in vision can be settled to a certain extent. On the other hand, the low-frequency drift in the inertial measurements can be significantly controlled by visual observations. Furthermore, both cameras and IMUs are low cost, light-weight and low power-consumption devices, which make them ideal for many payload-constrained platforms. Corke [10] has presented a comprehensive introduction of these two sensory modalities from a biological and an engineering perspective.
The simplest fusion scheme for a vision-aided inertial navigation system (VINS) uses separate INS and visual blocks, and fuses information in a loosely-coupled approach [10]. For instance, some methods fuse the inertial navigation solution with the relative pose estimation between consecutive image measurements [11][12][13][14]. Tightly-coupled methods in contrast process the raw information of both sensors in a single estimator, thus all the correlations between them are considered, leading to higher accuracy [15,16]. The most common tightly-coupled scheme augments the 3D feature positions in the filter state, and concurrently estimates the motion and structure [2]. However, this method suffers from high computational complexity, as the dimension of the state vector increases with the number of the observed features. To address this problem, Mourikis [15] proposed an EKF-based algorithm which maintains a sliding window of poses in the filter state, and make use of the tracked features to impose constraints on these poses. The shortcomings of this approach are twofold: (1) the space complexity is high, because it needs to store all the tracked features; (2) it requires a reconstruction of the 3D position of the tracked feature points, which are not necessary in navigation tasks. To overcome these shortcomings, Hu [9] developed a sliding window odometry using the monocular camera geometry constraints among three images as measurements, resulting in a tradeoff between accuracy and computational cost.
While the vision-aided inertial navigation has been extensively studied, and a considerable amount of work has also been dedicated to processing visual observations of point features [2,4,5,7], on the contrary, much less work has been aimed at exploring line features. In fact, line primitives and point primitives provide complementary information about the image [17]. There are many scenes (e.g., wall corners, stairwell edges, etc.) where the point primitive matches are unreliable while the line primitives are well matched, due to multi-pixel support [6].
On the other hand, points are crucial as they give more information than lines. For instance, there are no pose constraints imposed by line correspondences from two views, while there are well-known epipolar geometry constraints for point correspondences from two views [18].
In this paper, we propose a method that combines point and line features for navigation aiding in a simple and unified framework. Our algorithm can deal with any mixed combination of point and line correspondences utilizing trifocal geometry across two stereo views. In the implementation, the inertial sensors are tightly-coupled within feature tracking to improve the robustness and tracking speed. Meanwhile, the drifts of inertial sensors are greatly reduced by using the constraints imposed in the tracked features. Leveraging both of the complementary characteristics of the inertial and visual sensors and the complementary characteristics between point and line features, the proposed algorithm demonstrates improved performance and robustness.
The remainder of this paper is organized as follows: we describe the mathematical model of the VINS in Section 2, and then develop our estimator in Section 3. Experimental results are given in Section 4. Finally, Section 5 contains some conclusions and suggests several directions for future work.

Notations and Convention
We denote scalars in italic lower case letters (e.g., a ), denote vectors in lower case letters with boldface non-italic (e.g., p ), and denote matrices in upper case letters with bold font (e.g., R ). If a vector or matrix describes the relative pose of one reference frame with respect to another, we combine subscript letters to designate the frames, e.g., WI p represents the translation vector from the origin of the frame { } W to the origin of the frame { } I , and WI R represents the direction cosine matrix of frame The six degrees of freedom transform between two reference frames can be represented as a translation followed by a rotation: In the remaining Sections, unit quaternions are also used to describe the relative orientation of two reference frames, e.g., WI q represents the orientation of frame { } I in frame { } W . Finally, to represent projective geometry, it is simpler and more symmetric to introduce homogeneous coordinates, which provides a scale invariant representation for point and line in the Euclidean plane. In this paper, vectors in homogeneous coordinate form are expressed by an underline, e.g., in the Euclidean plane, with ' / , ' / , 0. u u w v v w w = = ≠

System Model
The evolving IMU state is described by the vector: are the IMU gyroscope and accelerometer biases, respectively.
In this work, we model the biases as a Gaussian random walk process, driven by the white, zero-mean noise vectors gw n and aw n , with covariance matrices gw Q and aw Q respectively.
The time evolution of the IMU state is given by the following equation [2]: where ( )

I WI
Ω ω is the quaternion multiplication matrix: which relates the time rate of change of the unit quaternion to the angular velocity; I WI ω is the angular velocity of the IMU with respect to the world frame, and W WI a is the acceleration of the IMU with respected to the world frame expressed in the world frame. The measured angular velocity and linear acceleration from are: is the direction cosine matrix corresponding to the unit quaternion WI q , g n and a n are measurements noises of gyroscope and accelerometer, which are assumed to be zero-mean Gaussian noise with covariance matrices g Q and a Q , respectively. Note that we do not consider the Earth's rotation rate in the gyroscope measurement, because it is small enough relative to the noise and bias of the low-cost gyroscope.

Camera Model
In this Section, we consider the standard perspective camera model, which is commonly used in the computer vision applications. Let K denote the intrinsic camera parameters matrix which can be obtained by calibrating. A mapping between the 3D homogeneous point can be given by: where ∝ means equality up to scale, and [ ] | = ⋅ P K R t is 3 × 4 camera matrix, with R and t representing pose of the camera with respect to the world reference frame. Similarly, a mapping between a 3-space line represented as a Plücker matrix L and the homogenous image line l is given by [18]:

Review of the Trifocal Tensor
A trifocal tensor is a 3 × 3 × 3 array of numbers that describes the geometric relations among three views. It depends only on the relative motion between the different views and is independent of scene structures. Assuming that the camera matrices of three views are , the entries of the trifocal tensor can be derived accordingly using the standard matrix-vector notation [18]: where i a and i b denote the i -th column of the camera matrices 2 P and 3 P , respectively. Once the trifocal tensor is computed, we can use of it to map a pair of matched points 1 2 ↔ m m in the first and second views into the third view, using the homography between the first view and the third view induced by a line in the second image [18]. As shown in Figure 1a, a line in second view defines a plane in space, and this plane induces a homography between the first view and third view. As recommended by Hartley [18], the line 2 l is chosen as the line perpendicular to the epipolar line. The transfer procedure is summarized as follows [18]: (3) The transferred point is 3 Similarly, it is possible to transfer a pair of matched lines 2 3 ↔ l l in the second and third views into the first view according to the line transfer equation [18]:

Stereo Vision Measurement Model via Trifocal Geometry
In this Section, we exploit the trifocal geometry of stereo vision to deduce the measurement model. We depict the stereo camera configuration of two consecutive frames in Figure 1b. For the sake of clarity, we only provide the geometrical relations of lines. The camera matrices of the stereo image pair at the previous time step can be represented in canonical form as: For simplicity, we assume that the IMU frame of reference coincides with the camera frame of reference. Thus, the terms in Equation (12) can be expressed as follows: relating to the current right image can be determined according to Equation (9) using the camera matrices Equations (11) and (12): where ( ) 1 h ⋅ denotes the pixel differences between the transferred point and the measured point. For line measurements, we also need a formulation to compare the transferred lines with the measured lines. Because of the aperture problem [19], only the measurement components which are orthogonal to the transferred line can be used for correction. In [3,17], the line-point is chosen as observation, which is defined as the closest point on the line segment to the image origin. Accordingly, the error function is defined as the differences between the measured and transferred line-points, which is similar to the error function of point features. However, when the lines pass through the origin, the orientation error of the lines cannot be revealed by this error function. Thus, we choose the signed distances between the endpoints of the measured line segment to the transferred line as observation.
Similarly, the line observation function concerning the first, second, and fourth views is defined as: is the line transferred from the second and fourth views.
As we process the point and line measurements in a unified manner after defining the corresponding error, we define the observation model in a single function: , , , f f f f denotes the general feature correspondences among the four views, L T and R T encode the motion information between the successive stereo image pairs, and the function ( ) h ⋅ defines the observations based on the feature type.

Structure of the State Vector
As can be seen in the previous Section, the measurement models are implicit relative-pose measurements, which relate the system state at two different time instants (i.e., the current time and the previous time when image pair is captured). However, the "standard" Kalman filter formulation requires that the measurements employed for the state update be independent of any previous filter states. The problem can be addressed by augment the state vector to include a history of IMU pose when last image pair is recorded. With these state augmentations, the measurements are only related to the current state, and thus, a Kalman filter framework can be applied. The augmented nominal state is given by: where ˆT IMU x is the nominal state of IMU, which can be obtained by integrating Equation (3) where IMU δx is the IMU error-state defined as: The standard additive error definition is used for the position, velocity and biases, while for the orientation error I δθ , the multiplicative error definition is applied: where the symbol ⊗ denotes quaternion multiplication. With the above error definition, the true-state may be expressed as a suitable composition of the nominal and the error-states: = ⊕δ x x x (28) where ⊕ means a generic composition.
Each time a new IMU measurement is received, the nominal state prediction is performed by numerical integration of the kinematic Equations (3) and (33). In order to obtain the error covariance, we compute the discrete-time state transition matrix: The elements of k Φ can be computed analytically following similar derivation as [20]. The state transition matrix is slightly different from [21], because the rates of filter prediction and filter update are different in our case.
The noise covariance matrix d Q of the discrete-time system is evaluated by: The predicted covariance is then obtained as:

Measurement Update
Since the measurement model is highly nonlinear, we employ statistical linearization for measurement updating, which is generally more accurate than the first order Taylor series expansion [22]. Specifically, the Sigma Point approach is applied. First, the following sets of sigma points are selected: where β is related to the higher order moments of the distribution [23] (a good starting guess is 2 β = for Gaussian distribution).
The predicted measurement vector is determined by propagating individual sigma point through the nonlinear observation function ( ) h ⋅ defined in Equation (23): The mean and covariance are computed as: where j j z z P and j xz P are the predicted measurement covariance matrix and the state-measurement cross-covariance matrix, respectively. The filter gain is given as follows: where j R is the measurement noise covariance matrix.
Then, the error state and error covariance are updated using the normal Kalman filter equation: After measurement update, the estimated state 1| 1 k k + + δx is then used to correct nominal state Finally, replace old state by current state and revise the corresponding error covariance: , n e

Outdoor Experiment
We evaluate the proposed method using the publicly available KITTI Vision Benchmark Suite [24], which provides several multi-sensor datasets with ground truth. The selected dataset was captured in a residential area from the experimental vehicle, equipped with a GPS/IMU localization unit with RTK correction signals (OXTS RT3003), and a stereo rig with two grayscale cameras (PointGrey Flea2). The duration is about 440 s, with a traveling distance of about 3600 m, and the average speed is about 29 km/h. All the sensors are rigidly mounted on top of the vehicle. The intrinsic parameters of the cameras and the transformation between the cameras and GPS/IMU were well calibrated. Moreover, the cameras and GPS/IMU are manually synchronized, with sampling rates of 10 Hz and 100 Hz, respectively.
The announced gyroscope and accelerometer bias specifications are 36 deg/h (1 σ ) and 1 mg (1 σ ), respectively. The resolution of stereo images is 1226 × 370 pixels, with 90° field view. For the position ground truth, we use the trajectory of the GPS/IMU output, with open sky localization errors less than 5 cm.

Feature Detection, Tracking, and Outlier Rejection
For point features, the fast corner detection (FAST) algorithm [25] was used for feature extraction, and matching was carried out by normalized cross-correlation. The main advantage of the FAST detector compared to others is the better trade-off between accuracy and efficiency. In order to reduce the computational complexity and to guarantee the well distribution of the image features, we choose a subset of the matched point features by means of bucketing [26]: Divide the image into several non-overlapping rectangles, and maintain a maximal number of feature points in each rectangle.
We extract lines using EDlines detector [27] in the scale space, which can give accurate results in real-time. Then we employ the method described in [28] for line matching. The lines are described local appearances by the so-called Line Band Descriptor (LBD) similar to SIFT [29] for point features, and are matched by exploiting the local appearance similarities and geometric consistencies [28]. The average execution time of line matching between views is about 56 ms with Intel Core i5 2.6 GHz processors running the non-optimized C++ code. Figure 2 shows a sample image from the dataset with extracted points and lines. As can be seen, both point and line features are rich in the selected sequence.
In order to reject mismatched features and features located on independently moving objects (e.g., the running car), we employ a chi-square test [30] for the measurement residuals. We compute the Mahalanobis distance: where ( ) j − 0 Z is the measurement residual, and j j z z j + P R is the covariance of the measurement residual. The rejection threshold is usually chosen by an empirical evaluation of reliability of feature matching. We set the threshold to 12 in the experiment. The feature measurements whose residuals exceed the threshold are discarded.

Experimental Results
In this Section, we compare the performance of our algorithm with the following methods: (1) GPS/IMU localization unit, with open sky localization errors less than 5 cm; (2) VINS using only point feature; (3) pure inertial-only navigation solution; (4) pure stereo visual odometry [31].
The trajectory estimation results of different algorithms with the ground truth data are shown in Figure 3. The corresponding 3D position errors are depicted in Figure 4. The overall root-mean-square errors (RMSE) are shown in Table 1. It can be found that the inertial-only navigation suffers from error accumulation and is not reliable for long-term operation; Secondly, the result of pure stereo visual odometry is inferior, specially where the vehicle turns, and the error grows super-linearly owing to the inherent bias in stereo visual odometry; Thirdly, the combining of inertial navigation and stereo vision with point feature alone can reduce the drift rate effectively, and the additional information from line measurements results in better performance. Note that the jumps from 80 s to 100 s are caused by ground truth errors. It also shows the advantage of the proposed method in cluttered urban environments where the GPS information is less reliable.   We demonstrate the velocity and attitude deviations of the proposed method with the corresponding 3 σ bounds in Figures 5 and 6, which verify that the velocity and attitude estimates are consistent. Note that the standard deviations of the roll and pitch angle errors are bounded, while the standard deviation of the yaw angle error grows over time. This is consistent with the observable property of the VINS system, which indicates that the yaw angle is unobservable [8]. The yaw angle error is bounded under 5° due to the accuracy of the gyroscopes in the experiment.

Indoor Experiment
To demonstrate the robustness of our algorithm in a textureless structured environment, we perform indoor experiments in a corridor scenario with textureless walls which lead to very few points being tracked in some frames. The test rig consists a PointGrey Bumblebee2 stereo pair, a Xsens MTi unit, and a laptop for data recording (Figure 8a). The accuracy specifications and sampling rates of the sensors are listed in Table 2. The relative pose of the IMU and the camera are well calibrated prior to the experiment using the method proposed in [32], and keep unchanged during the experiment. The actual motion of the pushcart is a move along with the corridor, and then return to the initial point. The full length of the path is about 82 m. In Figure 8b, we show the bird's eye view of the estimated trajectories. It is obvious that the combination of point and line features leads to much better performance than the use of point features alone in this scenario. The reason is that the point features are few or not well distributed over the image in some frames, leading to a bad orientation estimation. In Figure 8c, a plot of the number of inlier point and line features per frame is shown, which clearly demonstrates the superiority of combining both feature types under such circumstances.

Conclusions/Outlook
This paper presents a tightly-coupled vision-aided inertial navigation algorithm, which exploits point and line features to aid navigation in a simple and unified framework. The measurement models of the point and line features are derived, and incorporated into a single estimator. The outdoor experimental results show that the proposed algorithm performs well in cluttered urban environments. The overall RMSE of position and orientation is about 10.6 m and 0.83°, respectively, over a path of up to about 4 km in length. The indoor experiment demonstrates the better performance and robustness of combining both point and line features in textureless structured environments. The proposed approach which combines both feature types can deal with different types of environments with a slight increase in computational cost.
As part of future work, we aim to improve the proposed approach, by taking advantage of the structural regularity of man-made environments, such as Manhattan-world scenes, i.e., scenes that lines should be orthogonal or parallel to each other [33]. Unlike ordinary lines, the Manhattan-world lines encode the global orientation information, which can be used to eliminate the accumulated orientation errors, and further suppress the position drifts.