Reliable navigation in dynamic and unknown environments is a key requirement for many applications, particularly for autonomous ground, underwater and air vehicles. The most common sensor modality used to tackle this problem is the Inertial Measurement Unit (IMU). However, inertial navigation systems (INS) are proved to drift over time due to error accumulation [1
]. In the last decades, the topic of vision-aided inertial navigation has received considerable attention in the research community, thanks to some important advantages [2
]. Firstly, the integrated system can operate in environments where GPS is unreliable or unavailable. Secondly, the complementary frequency responses and noise characteristics of vision and inertial sensors address the respective limitations and deficiencies [10
]. In particular, fast and highly dynamic motions can be precisely tracked by an IMU in a short time, and thus the problem of scale ambiguity and large latency in vision can be settled to a certain extent. On the other hand, the low-frequency drift in the inertial measurements can be significantly controlled by visual observations. Furthermore, both cameras and IMUs are low cost, light-weight and low power-consumption devices, which make them ideal for many payload-constrained platforms. Corke [10
] has presented a comprehensive introduction of these two sensory modalities from a biological and an engineering perspective.
The simplest fusion scheme for a vision-aided inertial navigation system (VINS) uses separate INS and visual blocks, and fuses information in a loosely-coupled approach [10
]. For instance, some methods fuse the inertial navigation solution with the relative pose estimation between consecutive image measurements [11
]. Tightly-coupled methods in contrast process the raw information of both sensors in a single estimator, thus all the correlations between them are considered, leading to higher accuracy [15
]. The most common tightly-coupled scheme augments the 3D feature positions in the filter state, and concurrently estimates the motion and structure [2
]. However, this method suffers from high computational complexity, as the dimension of the state vector increases with the number of the observed features. To address this problem, Mourikis [15
] proposed an EKF-based algorithm which maintains a sliding window of poses in the filter state, and make use of the tracked features to impose constraints on these poses. The shortcomings of this approach are twofold: (1) the space complexity is high, because it needs to store all the tracked features; (2) it requires a reconstruction of the 3D position of the tracked feature points, which are not necessary in navigation tasks. To overcome these shortcomings, Hu [9
] developed a sliding window odometry using the monocular camera geometry constraints among three images as measurements, resulting in a tradeoff between accuracy and computational cost.
While the vision-aided inertial navigation has been extensively studied, and a considerable amount of work has also been dedicated to processing visual observations of point features [2
], on the contrary, much less work has been aimed at exploring line features. In fact, line primitives and point primitives provide complementary information about the image [17
]. There are many scenes (e.g., wall corners, stairwell edges, etc.
) where the point primitive matches are unreliable while the line primitives are well matched, due to multi-pixel support [6
On the other hand, points are crucial as they give more information than lines. For instance, there are no pose constraints imposed by line correspondences from two views, while there are well-known epipolar geometry constraints for point correspondences from two views [18
In this paper, we propose a method that combines point and line features for navigation aiding in a simple and unified framework. Our algorithm can deal with any mixed combination of point and line correspondences utilizing trifocal geometry across two stereo views. In the implementation, the inertial sensors are tightly-coupled within feature tracking to improve the robustness and tracking speed. Meanwhile, the drifts of inertial sensors are greatly reduced by using the constraints imposed in the tracked features. Leveraging both of the complementary characteristics of the inertial and visual sensors and the complementary characteristics between point and line features, the proposed algorithm demonstrates improved performance and robustness.
The remainder of this paper is organized as follows: we describe the mathematical model of the VINS in Section 2
, and then develop our estimator in Section 3
. Experimental results are given in Section 4
. Finally, Section 5
contains some conclusions and suggests several directions for future work.
This paper presents a tightly-coupled vision-aided inertial navigation algorithm, which exploits point and line features to aid navigation in a simple and unified framework. The measurement models of the point and line features are derived, and incorporated into a single estimator. The outdoor experimental results show that the proposed algorithm performs well in cluttered urban environments. The overall RMSE of position and orientation is about 10.6 m and 0.83°, respectively, over a path of up to about 4 km in length. The indoor experiment demonstrates the better performance and robustness of combining both point and line features in textureless structured environments. The proposed approach which combines both feature types can deal with different types of environments with a slight increase in computational cost.
As part of future work, we aim to improve the proposed approach, by taking advantage of the structural regularity of man-made environments, such as Manhattan-world scenes, i.e.
, scenes that lines should be orthogonal or parallel to each other [33
]. Unlike ordinary lines, the Manhattan-world lines encode the global orientation information, which can be used to eliminate the accumulated orientation errors, and further suppress the position drifts.