1. Introduction
In recent years, with the advancement of sparse nonlinear optimization theory, camera technology, and computing performance, Visual Simultaneous Localization And Mapping [VSLAM] technology has achieved tremendous development [
1,
2]. Visual SLAM approaches have been widely used and researched because of their simple equipment and remarkable effects. Real-time and robust state estimation plays a significant role in robotics. Accurate state estimation is of crucial importance in a variety of intelligent applications, such as robot autonomy, Augmented Reality [AR] and Virtual Reality (VR). 
The vision-based state estimation method is able to estimate the 6-Degrees-Of-Freedom (6-DOF) state of sensors simultaneously and reconstruct a three-dimensional (3D) map of the surrounding environment. Many works based on nonlinear optimization have been reported, including SVO [
3], LSD-SLAM [
4], DSO [
5], ORB-SLAM [
6,
7]. The ORB-SLAM system supports monocular cameras, stereo cameras, and depth cameras. The system is also designed based on the idea of PTAM [
8], which is divided into three threads: tracking, mapping and loop closing. However, due to the existence of observation noise, the pose estimation of the feature points in space contains uncertainty, leading to localization errors in Visual Odometry. The position change between the current frame and the previous one can be calculated and continuously accumulated by Visual Odometry in order to realize the estimation of the motion process of a robot. During this process, the error of each estimation is collected, thus limiting the accuracy of Visual Odometry over time.
An Inertial Measurement Unit (IMU) is composed of an accelerometer and a gyroscope. The pose of the robot can be calculated and the three-dimensional heading information can be estimated, respectively, by the accelerometer and gyroscope in the IMU. However, the navigation errors caused by the low-frequency noise of the IMU work process will accumulate over time due to the integral operation, and the accuracy will easily drift, resulting in inaccurate pose estimation. A general and effective solution to these problems is to fuse visual and IMU measurements using a filter-based or optimization-based system. During the fusing process, IMU and Vision are combined to form a Visual-Inertial Odometry, which not only takes advantage of the flexibility of the visual method and is adaptable to a wide range of scenes, but also utilizes the high-precision features of the IMU in the short term. Therefore, research into the SLAM algorithm based on visual and inertial sensors is of great significance and application value, allowing robots to perceive the surrounding environment in order to obtain localization information.
In Visual-Inertial Odometry (VIO), the easiest way to handle visual and inertial measurements is to loosely couple the sensor fusion [
9], whereby the IMU is considered to be a separate module for assisting in visual pose estimation, and then is fused by an extended Kalman filter (EKF). In comparison, tightly coupled visual-inertial algorithms can mainly be classified into two types: EKF-based algorithms [
10,
11,
12] and optimization-based algorithms [
13,
14,
15]. For example, MSCKF [
16] is a popular EKF-based VIO method. MSCKF keeps several previous camera poses in the state vector, and arranges them in chronological order, which is also known as a sliding window. If a feature point is observed in several poses of the sliding window, the constraint will be established between these poses, and KF updates will be performed. A disadvantage of using a filter-based approach is that it can lead to suboptimal results due to the state of early linearization estimates.
Meanwhile, with the deepening of research and the improvement of computer performance, optimization-based methods have commonly been used in visual-inertial SLAM systems to ensure higher accuracy. A full smoothing method [
17] for estimating the entire state history is described by solving a substantial nonlinear optimization problem. Although the outlook is promising, its computational complexity is high, and its real-time performance will gradually decrease as the map grows. Recently, the work proposed in [
14] applied a keyframe-based approach to fuse visual-inertial measurements. The use of a sliding window and marginal technology [
14,
15] ensures the real-time operation of the system, and it has achieved remarkable success. Additionally, the IMU pre-integration technique proposed in [
18] can avoid the repeated calculation of the integral when the linearization point changes. Mur-Artal et al. [
15] suggested a VI-SLAM system based on their original ORB-SLAM, which originated from the pre-integration ideas of Forster [
18]. Qin et al. [
19] demonstrated a new Visual-Inertial System (VINS) which is also a complete VI-SLAM system based on tight coupling and nonlinear optimization. The initialization process is robust to unknown states and has excellent application potentials in the field of UAV navigation control.
Very few VIO solutions are planned for stereo or multi-camera systems [
14,
20,
21] compared to the large amount of work that has been carried out for monocular systems. This could partly be due to the costs associated with processing additional images and matching features. Leutenegger et al. [
14] proposed a complete optimization framework for multi-camera VIO that can run in real-time. Usenko et al. [
20] introduced direct methods into stereo VIO to further improve accuracy. The tested dataset in [
20] was acquired using a stereo visual inertial camera that was processed offline. IMU pre-integration was further improved in the stereo VIO in [
20]; however, this solution is not suitable for practical applications. Liu et al. [
22] proposed a stereo VIO assembled with three separate Kalman filters, including an attitude filter, an orientation filter, and a position filter. Wang et al. [
23] proposed a stereo direct sparse odometry, which maintains high accuracy while achieving real-time pose estimation. Xiong et al. [
24] presented a dual-stage EKF-based algorithm for the robust stereo VIO. Zheng et al. [
25] introduced a tightly coupled filtering-based stereo VIO system using both points and lines. Some VIO algorithms have been explored in other areas, such as initialization [
26], online calibration [
27], optimization [
28,
29], and camera type [
30]. These solutions are either based on optimization methods or filtering methods. For the optimization-based methods, they require a powerful CPU for real-time use and cannot run on low-cost devices such as embedded devices. For the filtering-based methods, the localization accuracy may not be enough. 
In this paper, we summarize our contributions as follows:
- The FAST feature detector was employed for its efficiency, and features were tracked by the KLT sparse optical flow algorithm to decrease the cost of computation. 
- Stereo matching was performed, which can remove outliers in the stereo matching and feature tracking steps leading to reduction of the mismatch of feature points and improvement of the robustness and accuracy of the system. 
- A tightly coupled nonlinear optimization method was employed to combine pre-integrated IMU measurements with visual measurements of stereo cameras, and thus a highly accurate and robust visual-inertial odometry was achieved, which can run in real-time on devices such as drones. 
- The proposed method was extensively evaluated and validated in comparison to state-of-the-art open source VIO methods (including OKVIS [ 14- ], VINS-MONO [ 19- ] and S-MSCKF [ 16- ]) by using the EuRoC dataset. 
  2. Visual-Inertial Odometry Overview
The schematic depiction of the Visual-Inertial optimization framework is shown in 
Figure 1. The input of the system is the image acquired by the stereo cameras and the acceleration and angular velocity measured by the IMU. The output is a 6-degree-of-freedom pose with real scale, i.e., the trajectory of the inertial camera (robot) motion. The system starts with Visual-Inertial Preliminaries (
Section 3), where features are extracted and tracked, and pre-integrate IMU measurements between two consecutive frames. Then the initialization preprocessing is introduced (
Section 4). The inertial pose and visual pose are combined by visual inertia joint initialization to obtain the initial estimated value of the system, including pose, velocity, gyroscope bias, gravity vector. These values will iteratively and sequentially be updated by the tightly coupled visual-inertial odometry (
Section 5). Finally, the 6-DOF pose can be obtained.
In this work, we consider  to be the camera frame,  to be the body frame, and we regard the IMU frame as the body frame. The matrix R represents rotation.  represents the rotation from the body frame to the world frame.  is translation from the body frame to the world frame.
  3. Visual-Inertial Preliminaries
This section gives the preliminary steps for stereo visual measurements and the inertial model. For visual measurements, features are tracked in real-time using the KLT optical flow algorithm. For IMU measurements, two consecutive frames are pre-integrated. 
  3.1. Visual Processing Frontend
In the Visual-Inertial odometry, the camera needs to track the pixel points of the captured image frame, and the pixel points need to be the feature points extracted from the previous and subsequent frames. However, there is only a small part of the overlap between successive image frames in many cases, and it would be inappropriate to match all the pixels, due to the massive demand for computing resources. Additionally, because the feature points are matched through the calculation process of the camera pose, it is unnecessary to extract many feature points. Therefore, in visual odometry, only the feature points from part of the image—i.e., the points with the most noticeable features in the image (corner points, points with sharp changes in brightness or at the edge of the contour)—are extracted and subsequently tracked.
In this paper, the feature points are extracted by the FAST [
31] feature detector, and the feature tracking method is the KLT [
32] optical flow algorithm. Through experiments, it is found that the KLT optical flow method takes up less CPU resources than the descriptor-based method, making the optical flow method more advantageous in robot applications. The KLT optical flow method was also used for stereo feature matching to save computing resources. Optical flow tracking is able to obtain the feature point coordinates of cam1 (the right camera of the stereo), which correspond to stereo cam0. Then, based on the principle of epipolar geometry, the following constraint can be defined as
        
        where 
 and 
 are the normalized plane coordinates of the feature points, 
 is the essential matrix, and 
 is the antisymmetric matrix corresponding to the t vector. t and R are the translation vector and rotation matrix between the stereo cameras, respectively, which can be obtained from the initial calibration.
During the measurement,  are not strictly equal to zero, due to the presence of noise. There is an error in this process, and if the error is greater than a certain threshold, it is marked as an outlier.  are obtained by LK optical flow tracking according to the original image, and the accuracy of optical flow tracking is verified by the constraint (relative poses R and t) between stereo cameras. Therefore, the effect of removing outliers can be achieved. Meanwhile, several grids were added to the image, which was assigned with feature points. The number of feature points attached to each grid was checked to ensure that it did not exceed the maximum value of grid feature points.
In the visual processing frontend, outlier rejection was performed using two-point RANSAC similar to (16) in temporal tracking. Assuming that the normalized coordinates of the corresponding points of the previous and current frames are 
, 
, the following constraints can be established based on epipolar geometry:
        where 
R is obtained from the average angular velocity of the IMU. When 
, the coordinate system is unified into one coordinate system.
        
Equation (3) can be expanded as
        
According to the RANSAC principle, two matching points of the previous frame and the current frame are arbitrarily selected, and the outliers with large errors are removed.
        
        where 
. The following three formulas are available:
The values among 
 can be calculated, whereby if 
 is the minimum value, the corresponding 
 = 1 and the other two can be solved by matrix inversion. Then, the obtained t can be brought into Equation (4), and the value calculated by the coordinates of each matching point are regarded as the error. If the error satisfies certain conditions, the matching point is set as inliers. Finally, an inlier set will be obtained, and all inliers in the set will be substituted into Equation (5) to recalculate the new model by the least squares method. The model will be used to compute the final sum of error. Over multiple iterations, the inliers set with the smallest sum of errors is selected, thereby achieving the effect of removing the outliers. Additionally, performing circular matching [
33] further removes the outlier matching step generated in the feature tracking and stereo matching between the previous and current stereo image pairs. Through the above steps, the mismatch of feature points is reduced, thereby improving the robustness and accuracy of the system.
  3.2. IMU Measurement Model and Pre-Integration
As shown in 
Figure 2, the measurement rate of IMU is much faster than that of the visual camera. To simultaneously optimize the constraints of vision and IMU in a single framework, it is necessary to integrate the measurements of many IMUs between two adjacent visual keyframes into one constraint. Therefore, the theory of the pre-integration formula based on SO3 manifolds (18) was used in this paper.
The IMU measures acceleration and angular velocity in three directions within the inertial system by a three-axis accelerometer and a three-axis gyroscope. The IMU measurement is often affected by Gaussian white noise and zero offsets. The noise model of IMU can be expressed by the following formula:
        where 
 is the angular velocity measured by the gyroscope; 
 is the acceleration measured by the accelerometer; 
 and 
 are the real angular velocity and the real acceleration; 
 and 
 represent the corresponding zero-bias and white Gaussian noise, respectively.
To calculate the motion of the robot from the measured values of the IMU, the following kinematic models need to be introduced.
        
        where 
, 
 and 
 respectively represent the derivatives of the rotation matrix 
, the velocity vector 
 and the translation vector 
 with respect to time. Equation (9) describes the change in pose and velocity of the IMU in differential form. To obtain the value of the state of the IMU at each moment, the integration to Equation (9) needs to be determined. The pose and velocity of the IMU at time interval 
 can be described as follows:
        where 
 is the time interval between two adjacent IMU measurements. According to Equations (7) and (8), 
 and 
 in Equation (10) can be rewritten as the following forms.
        
It is worth mentioning that the subscript of the reference coordinate system in Equation (11) is hidden for the legibility of the formula.
Based on Equation (11), the positional relationship of the inertial component at the adjacent 
 can be obtained. The sampling frequency of the IMU can be optimized if a state is added that needs to be evaluated every time interval in the collection of IMU data, which, however, will result in a slow calculation and arduous process of the IMU data. The IMU pre-integration method combines the IMU measurements between two adjacent visual keyframes i and j into a composite term that constitutes the motion constraints of two adjacent visual keyframes. Assuming that the IMU is synchronized with the visual frame detection time and the measurement data is acquired at discrete time 
k, the pose relationship between the two keyframes 
k = 
i and 
k = 
j can be obtained by the IMU measurement. To avoid the situation where the estimation of the initial frame of the pose estimation leads to the recalculation of the integral, this paper uses the incremental expression, which is defined as
        
It should be noted that the IMU biases are regarded as being constants in the time interval 
. However, the estimated biases are changed with a small amount of 
 during optimization. The Jacobians 
 and 
 are employed to account for a first-order approximation of the effect of changing the biases without explicitly recomputing the pre-integration. The pose and velocity can be described as
        
  4. Visual-Inertial Initialization
The primary purpose of Visual-Inertial SLAM system initialization is to obtain the parameters necessary for the system to optimize and the initial values of the state. Since the stereo inertial tightly coupled system is a highly nonlinear system, it becomes very sensitive to specific initial values. The initialization quality directly affects the robustness and accuracy of the localization of the entire tightly coupled system. Therefore, it is necessary to initialize the system properly to provide correct parameters and initial values.
In the process of initialization, the information that needs to be initialized or estimated can be divided into two categories. One is the parameters that almost remain unchanged during the system operation, such as absolute scale and gravity acceleration. Another is the initial values of the system’s starting state quantities, including the pose and velocity information of the first few frames, the position of the 3D landmark points, and the zero offset of the IMU accelerometer and gyroscope. The initialization can be separated into two processes. Firstly, the stereo visual initialization can initialize the initial frame pose and the three-dimensional landmark position information based on the sliding window. Then, the visual inertia joint initialization can initialize the absolute scale, gravity acceleration, and camera speed information.
  4.1. Stereo Vision Initialization Based on a Sliding Window
In the process of stereo visual initialization based on a sliding window, an up-to-scale purely visual structure of the sliding window is constructed to restore the information of the initial frames pose and the three-dimensional landmark point position.
Two frames are first selected that have sufficient feature disparity in the sliding window. Next, the eight-point method of the polar geometry recovery pose is used to recover the essential matrix E. The scale of the translation transformation is fixed, and E is used to retrieve the motion pose and triangulate the 3D map points. After initializing a batch of 3D points, the perspective-n-point (PnP) method is employed to solve the pose information of the remaining frames in the sliding window. A global full bundle adjustment is used to minimize the total reprojection error for all feature observations in all frames. At this point, the measurement can obtain the pose information and the three-dimensional information of the feature points. Since the external parameters 
 between the camera and the IMU are known, all variables can be converted to representations in the IMU coordinate system:
        where 
s is an unknown scale factor that aligns the visual structure to the metric scale, and feature positions 
 are represented with respect to the world coordinate system.
  4.2. Visual Inertial Joint Initialization
Through the visual inertia joint initialization, the absolute scale, gravity acceleration, camera state speed information, and IMU zero offset can be initialized.
  4.2.1. Gyroscope Bias Estimation
The IMU bias is firstly initialized under the assumption that the IMU gyroscope zero offset 
 is constant in the current window. Considering the adjacent kth and k+1th frames in the window, in the stereo visual initialization of the previous step, we can obtain the rotations 
 and 
 in their relative world coordinate system. The angular velocity among the results of IMU pre-integration can be derived from Equation (12). The gyroscope bias can be predicted by minimizing the error of the term.
          
By solving this least squares problem, we can estimate the gyroscope zero offsets .
  4.2.2. Accelerometer Bias, Gravity and Metric Scale Estimation
It is difficult to solve the accelerometer bias during the initialization process because of  and the accelerometer bias are measured simultaneously. However, the accelerometer bias  has little effect on system stability, a rough initial guess can thus be made, and the bias will be continuously optimized during subsequent optimizations. Therefore, in the initialization step, we set the accelerometer offset  to zero.
The remaining estimated parameters: the speed of the state, the gravity, and the scale factor are defined as the variables:
          where 
 is velocity in the body frame while tracking the 
kth image, 
g is the gravity vector in the world frame, and 
s is the scale of Visual-Inertial SLAM system to metric units.
Considering two consecutive frames 
 and 
, we can develop the following linear measurement model from Equations (12), (14) and (15):
The initial value can be guessed by solving the following least squares problem:
The speed of each local frame, the gravity vector (including direction and intensity), and the scale factor can be obtained.
  5. Tightly Coupled Stereo Visual-inertial Odometry
After the state initialization, a highly accurate state is estimated using a sliding window estimator. A state estimator based on the combination of stereo vision and IMU is described in this section. The state estimator can be regarded as a backend based on nonlinear graph optimization. It optimizes the coordinates of the 3D landmark points and the pose, velocity and IMU bias of the inertial camera. In this section, the state vector of the system estimation is defined firstly, then the expression of the visual inertia optimization term is given. After that, the explicit calculation formulas of the visual error term and the inertia error term are provided, and finally, the system’s marginalization method is presented.
  5.1. System State Vector
First, the state vectors of the system are defined. The state vector to be estimated are the state variable 
 of the inertial camera at the image time 
k, including position 
, pose 
, velocity 
, gyroscope bias 
, and accelerometer bias 
.
        
  5.2. IMU and Visual Error Term
In the traditional visual SLAM or visual odometers, nonlinear optimization is used to obtain the best estimate of camera pose and 3D landmark points by minimizing the reprojection error of feature points in the camera frame. Once inertial measurements are introduced, constraints are applied to the camera’s continuous motion poses, the IMU speed and the estimated IMU bias for continuous time, resulting in an increase of the system state variables. The IMU error term can be defined as:
        where 
 is the Huber robust cost function, and 
 is the information matrix of the pre-integration and 
 is the bias random walk.
Consider the feature in the 
 image, and the reprojection error of feature points 
 for a given match 
 to be defined as follows:
        where 
 is the location of the observation keypoint in the image, 
 is the map point in world coordinates, and 
 represents the translation of IMU relative to the world coordinate system at 
 frames, and 
 is the information matrix associated with the keypoint scale.
  5.3. Marginalization
As time goes by, the feature points and camera poses will accumulate, and the amount of optimization calculation will increase accordingly. Because the dimension of the visual inertia system to be optimized is more than visual system, the number of variables to be optimized is large. If the estimated variables are not limited, the calculation load will surge with time. Therefore, the non-linearization is performed while the state of the estimated system is marginalized, i.e., the time-limited keyframes are removed without changing the consistency of the estimation.
The visual-inertial optimization term is a least squares problem, which can generally be solved by the Gauss-Newton iteration method. This can be defined as:
Suppose 
 is the state variable to be marginalized, and 
 is the state variable to be retained. Depending on the conditional independence, we can simplify the process of marginalization as follows:
 in the above formula is the variable for marginalization, such as a camera pose. We cannot directly delete 
 and its related landmark points, because this would reduce the constraint information. Therefore, the following Schur complement can be used to eliminate the element:
        where 
 and 
 are marginalized 
H matrices and error quantities, and state variables 
 are marginalized. By repeating the marginalization, as the new state variable is added, the number of state variables to be optimized remains constant, which dramatically decreases the amount of computation.
  6. Experimental Results
To validate the effectiveness of the proposed reduced mismatch algorithm, the EuRoC dataset of images was used to test the visual front end. Two consecutive frames of images were randomly selected, a total of eight groups (among monocular image using the left camera image), keypoints were extracted for both the monocular image and stereo image, and the RANSAC algorithm was then used to reduce the mismatch. By calculating the comparison between the number of points removed and the original untreated points, it was found that the image matching accuracy after stereo processing was high, with the specific results shown in 
Figure 3.
An experiment was performed to validate the proposed method. A comparison with the existing state-of-art visual-inertial odometry algorithms VINS-Mono and S-MSCKF through the EuRoC MAV Visual-Inertial datasets was also made. The EuRoC datasets were retrieved by a microdrone. The UAV was equipped with sensors such as stereo cameras (Aptina MT9V034, 20FPS) and IMU (ADIS16448, 200 Hz), and the real-time ground truth states were acquired by the motion capture device. The movement of the drone had a large rotation and a significant change in light intensity, which was challenging for our vision-based odometry. The tracking of the visual front end is shown in 
Figure 4.
In these experiments, the proposed method was compared with VINS-Mono and S-MSCKF, which were tested on a desktop computer. The desktop CPU was equipped with quad-core i7-6700 3.4 Hz, the memory is 8 GB. The EuRoC datasets have eleven datasets, and four representative sequences, including MH_03_median, MH 05 difficult, V2_01_easy, and V2_02_median, were used in the dataset, to compare with VINS-Mono. 
Figure 5 shows the trajectory for the sequence MH_03_median, MH 05 difficult, V2_01_easy, and V2_02_median. Obviously, the estimated trajectory by our method is even closer to the real one.
The relationship between the translation error and the distance is shown in 
Figure 6. Due to space limitations, only the results of MH_03_median and MH_05_difficult are demonstrated. In the error graph, the proposed method produces the smallest translation error as the distance increases. Overall, the estimated error of this method is less than 5% on all sequences.
The relationship between the translation error and the distance is shown in 
Figure 6. Due to space limitations, only the results of MH_03_median and MH_05_difficult are demonstrated. In the error graph, the proposed method produces the smallest translation error as the distance increases. Overall, the estimated error of this method is less than 5% on all sequences.
In addition, we also compared our method with VINS-Mono in terms of CPU resource requirements, and the results are listed in 
Table 1. As can be seen, our method is comparable to the VINS-Mono on the CPU load, but the memory utilization statistics appears relatively better. Thus, from 
Figure 5 and 
Figure 6 and 
Table 1, it can be concluded that our method obtains comparable accuracy at the start of the trajectory, but in a more efficient manner, and as the distance increases, the method begins to have much better performance than VINS-Mono.
To evaluate the performance of this method, the tool evo (
https://github.com/MichaelGrupp/evo) is used to determine the Root Mean Square Error (RMSE) of the experimental results, and the results are shown in 
Table 2. Notably, stereo feature matching failed because of the continuous inconsistency in brightness between stereo images in V2_03_difficult. Therefore, V2_03_difficult sequence was not included here. In the initial stage, the accuracy of our method is not greatly superior to VINS-Mono. As the flight distance increases, however, the error of our method becomes significantly lower than that of VINS, ensuring comparably accurate results over long paths. In the end, our method maintains the best performance, with an average RMSE of 0.075 m, as demonstrated in 
Table 2. It should be emphasized here that our method successfully recovers the metric scale, which, however, could not be achieved just by using the monocular camera. Compared to the optimization-based monocular SLAM, our method is more accurate than VINS in most scenarios except for individual scenes. For filter-based S-MSCKF, its accuracy is much lower than our method and VINS, which proves that optimization-based methods have greater potential than filter-based methods. It is worth mentioning that VINS-Mono, slightly inferior to our proposed algorithm, has outperformed both OKVIS and MSCKF in our results as shown in 
Table 2. These findings may not agree with the published results in [
16] as VINS-Mono in their tests are running at a relatively low frequency without loop closure detection. When playing at full strength, VINS-Mono has experimentally been proved to be state-of-the-art and perform better than the widely used VIO systems, including OKVIS and MSCKF [
33]. In summary, the optimization-based method has high localization accuracy and low memory utilization, while the filter-based method has advantages in computing resources. This is a remarkable result of our optimization-based stereo method, compared to the filter-based stereo VIO (16) and the optimization-based monocular VIO (19).