Accurate Initial State Estimation in a Monocular Visual–Inertial SLAM System

The fusion of monocular visual and inertial cues has become popular in robotics, unmanned vehicles and augmented reality fields. Recent results have shown that optimization-based fusion strategies outperform filtering strategies. Robust state estimation is the core capability for optimization-based visual–inertial Simultaneous Localization and Mapping (SLAM) systems. As a result of the nonlinearity of visual–inertial systems, the performance heavily relies on the accuracy of initial values (visual scale, gravity, velocity and Inertial Measurement Unit (IMU) biases). Therefore, this paper aims to propose a more accurate initial state estimation method. On the basis of the known gravity magnitude, we propose an approach to refine the estimated gravity vector by optimizing the two-dimensional (2D) error state on its tangent space, then estimate the accelerometer bias separately, which is difficult to be distinguished under small rotation. Additionally, we propose an automatic termination criterion to determine when the initialization is successful. Once the initial state estimation converges, the initial estimated values are used to launch the nonlinear tightly coupled visual–inertial SLAM system. We have tested our approaches with the public EuRoC dataset. Experimental results show that the proposed methods can achieve good initial state estimation, the gravity refinement approach is able to efficiently speed up the convergence process of the estimated gravity vector, and the termination criterion performs well.


Introduction
In recent years, visual SLAM has reached a mature stage, and there exist a number of robust real-time systems or solutions [1][2][3]. Vision-based approaches can estimate simultaneously the six-degrees-of-freedom (6-DOF) state of sensors and reconstruct a three-dimensional (3D) map of the environment. The concept of using one camera has become popular since the emergence of MonoSLAM [4], which is based on the extended Kalman filter (EKF) framework and is able to achieve real-time localization and mapping indoors in room-sized domains. After this, there have been many scholarly works on monocular visual SLAM, including PTAM [1], SVO [5], and ORB-SLAM2 [6]. PTAM [1] is the first optimization-based solution to split tracking and mapping into separate tasks processed in two parallel threads. However, similarly to many earlier works, it can only work in small scenes and easily suffers from tracking loss. ORB-SLAM2 [6] takes advantages of PTAM and further improves it. Up to now, ORB-SLAM2 has been the most reliable and complete solution for monocular visual SLAM. Although monocular visual SLAM has made great achievements in localization and mapping, it is a partially observable problem, in which sensors do not offer the depth of landmarks. To address these problems, a common and effective solution is to fuse IMU and visual measurements using filter-or optimization-based frameworks.
Most recently, the work presented in [10] applies a keyframe-based method to fuse visual-inertial measurements. Sliding window and marginalization techniques are utilized to ensure real-time operation and achieve remarkable success. Additionally, the IMU pre-integration technique proposed in [26] is able to form relative motion constrains by integrating inertial measurements between keyframes, which avoids computing the integration repeatedly whenever a linearization point changes. However, the performance of state-of-the-art nonlinear monocular visual-inertial systems heavily relies on the accuracy of the initial estimated states. A poor initial state estimation will decrease the convergence speed or even lead to completely incorrect estimates.
Therefore the initial state estimation is very important and attracts great interest among researchers. The early paper [27] presents a deterministic closed-form solution to compute the gravity and the visual scale and provide the initial metric values for the state estimation filter. However as a result of the lack of IMU biases, the estimated scale and gravity are not accurate, which results in a poor system stability. In [24], the scale, velocity and IMU biases are estimated as additional state variables under an EKF framework. However, the estimated variables are slow to converge to stable values. The authors of [28] put forward a loosely coupled visual-inertial system that assumes that MAVs need to take off nearly horizontally at the beginning so as to complete the initialization process. The initialization method proposed in [29] requires that the initial attitude should be aligned with the gravity direction. Without prior information, the above two approaches are not suitable for on-the-fly initialization. Moreover, the gyroscope bias is ignored in the initialization procedure of [17,20], which leads to inaccurate state estimation.
A pioneering work is proposed in [30]. The authors propose a lightweight visual-inertial initialization method. However, the IMU biases and scale need to be refined in the tracking thread. In visual-inertial ORB-SLAM2 [8], the authors propose a loosely coupled visual-inertial alignment method that can recover entire visual-inertial parameters. While promising, it lacks a robust termination criterion to automatically bootstrap the following SLAM algorithm. In addition, considering that the accelerometer bias is usually coupled with gravity under small rotation, estimating the gravity and accelerometer bias separately is a better solution.
For this reason, it is promising to propose a robust and complete initialization procedure that can obtain accurate initial values, particularly the visual scale and the gravity direction. Therefore this paper is dedicated to initializing the gravity and accelerometer bias separately. Additionally, we also present an automatic termination criterion for determining when the estimated values converge.

Visual-Inertial Preliminaries
We consider a visual-inertial odometry problem [9] in which the state of a sensing system equipped with an IMU and a monocular camera need to be estimated in real-time. In this paper, we consider (·) C as the camera frame, which is an arbitrary fixed frame in a visual structure. We define the first camera frame as the world frame (·) W . The IMU frame is aligned with the body frame (·) B , thus we regard the IMU frame as the body frame, which is irrelevant to the camera frame. The matrix T CB = [R CB C P B ] represents the transformation from the body frame B to the camera frame C, R CB is the rotational matrix and C P B is the translation vector. We assume that the intrinsic parameters of the camera and extrinsic parameters between the camera and IMU are calibrated by using the methods of [31,32], respectively. In this section, we introduce some preliminary knowledge about visual measurements, the inertial sensor model, and IMU pre-integration. Figure 1 shows the situation of a camera-IMU setup with its corresponding coordinate frames. Multiple camera-IMU units represent the consecutive states at continuous time, which is convenient for understanding the equations illustrated in Section 4.2.

Visual Measurements
Visual-inertial odometry includes measurements from the camera and the IMU. Our visual measurement algorithm is based on visual ORB-SLAM2 [6], which includes three threads for tracking, local mapping and loop closing. For the process of the initial state estimation, we use the tracking thread and the local mapping thread. For each frame, the tracking thread is performed and decides whether the new frame can be considered as a keyframe. Once a new keyframe is generated, the corresponding IMU pre-integration can be computed iteratively by integrating all IMU measurements between two consecutive keyframes. At every frame, the camera can observe multiple landmarks. With the conventional pinhole-camera model [33], a 3D landmark X c ∈ R 3 in the camera frame is mapped to the image coordinate x ∈ R 2 through a camera projection function π : R 3 → R 2 : where f u f v T is the focal length and c u c v T is the principal point. Hence, by minimizing the re-projection error, we are able to recover the relative rotation and translation up to an unknown scale within multiple keyframes poses.

Inertial Measurements and Kinematics Model
An IMU generally integrates a 3-axis gyroscope sensor and a 3-axis accelerometer sensor, and correspondingly, the measurements provide us the angular velocity and the acceleration of the inertial sensor at a high frame rate with respect to the body frame B. The IMU measurement model contains two kinds of noise. One is white noise n(t); another is random walk noise that is a slowly varying sensor bias b(t). Thus we have where Bw (t) and Bã (t) are the measured angular velocity and acceleration values expressed in the body frame; the real angular velocity B w WB (t) and the real acceleration W a(t) are what we need to estimate. R WB is the rotational part of the transformation matrix [R WB W P B ], which maps a point from a body frame B to the world frame W. Generally, the dynamics of nonstatic bias b g , b a are modeled as a random process, which can be described aṡ Here n b g and n b a are the zero-mean Gaussian white noise. We utilize the following IMU kinematics model commonly used in [34] to deduce the evolution of the pose and velocity of the body frame: where WṘWB , Wν and Wṗ respectively represent the derivatives of the rotation matrix R WB , the velocity vector W ν and the translation vector W p with respect to time. When we assume that W a and B ω are constants in the time interval [t, t + ∆t], the pose and velocity of the IMU at time [t, t + ∆t] can be described as follows: Equations (6)- (8) can be further represented by using IMU measurements:

IMU Pre-Integration
From Equations (9)-(11), we can see that the IMU state propagation requires the rotation, position and velocity of the body frame. With the starting states changing, we need to re-propagate the IMU measurements, which is time consuming. To avoid this problem, we use the IMU pre-integration technique that is first proposed in [35] and is further extended to the manifold structure in [26]. Here we give a rough overview of its theory and usage within monocular visual-inertial SLAM systems. We assume that the IMU is synchronized with the camera and provides measurements at discrete times k. The relative motion increments between two consecutive keyframes at times k = i and k = j are defined as In the above equations, the IMU biases are considered to be constants in the time interval ∆t. However, more likely, the estimated biases change by a small amount δb during optimization. Therefore, the Jacobians J g (·) and J a (·) are applied to indicate how the measurements ∆(·) change with a change δb in the bias estimation; then the pose and velocity can be further expressed as Here the IMU pre-integration is computed iteratively when IMU measurements arrive, and the Jacobians can be precomputed during the pre-integration with the method mentioned in [26].

Visual-Inertial Initial State Estimation
In this section, we detail the complete process of our initial state estimation algorithm, which sequentially estimates the gyroscope bias, gravity vector (including gravity refinement), accelerometer bias, metric scale and velocity. An overview of our method is given in Figure 2. Our algorithm first only uses visual measurements as in the ORB-SLAM2 [6] for a few keyframes. The corresponding IMU pre-integration between these keyframes are computed at the same time. These two steps have been detailed in Section 3. When a new keyframe is created, we run our loosely coupled visual-inertial initial state estimation algorithm to iteratively update the gyroscope bias, gravity vector, accelerometer bias, metric scale and velocity sequentially. This procedure continues until the termination criterion is achieved. In our loosely coupled visual-inertial initial state estimation module, we first recover the gyroscope bias and then roughly estimate the gravity vector and scale without considering the accelerometer bias. Because the gravity norm is usually known (∼9.8 m/s 2 ), we refine the estimated gravity vector by optimizing the 2D error state on its tangent space. After the gravity refinement, we regard it as a fixed vector. Then we begin to accurately estimate the scale and accelerometer bias. Finally we compute the velocities of all keyframes. This is the same as the IMU initialization process of [8] in the first two steps. The main differences are reflected in the remaining steps. In our method, we are dedicated to estimating the gravity and accelerometer bias separately, these are normally difficult to distinguish from each other under the small rotation condition. Furthermore, we constrain the magnitude to refine the estimated gravity vector. In addition, because the condition number can indicate whether a problem is well conditioned, we regard it as one of the termination indicators. Once the termination criterion is achieved, the initialization process will be automatically terminated. The estimated initial state values can be used to launch the nonlinear tightly coupled visual-inertial SLAM system. To sum up, our initial state estimation procedure is partly based on the IMU initialization of [8], but we further improve the method and provide a more accurate and complete initialization procedure.

Gyroscope Bias Estimation
Considering two consecutive keyframes i and i + 1 in the keyframe database, we have their orientations R i WC and R i+1 WC from visual ORB-SLAM2, as well as their integration ∆R i,i+1 from the IMU pre-integration. We estimate the gyroscope bias b g by minimizing the residual errors between the relative rotation from the vision and gyroscope integration. The detailed derivation of Equation (18) can be found in [26].
In Equation (18), N is the number of keyframes and J g ∆R denotes the first-order approximation of the impact of the changing gyroscope bias. R WC R CB , which can be computed by transforming the pose of the IMU to the world coordinate system. By solving Equation (18) by the Gauss-Newton method, we can obtain the estimated gyroscope bias b g . Because the initial gyroscope bias is set to zero at the beginning, we now update the pre-integration ∆R ij , ∆ν ij and ∆p ij with respect to the estimated b g .

Coarse Scale and Gravity Estimation
With small rotation, the accelerometer bias is difficult to be distinguished from gravity. Therefore the second step of our initialization process is to coarsely estimate the preliminary scale s and gravity g 0 without regard to the accelerometer bias b a . We define the variables that we want to estimate as Because of the scale ambiguity existing in monocular visual SLAM systems, an additional visual scale s is necessary when transforming the position in the camera frame C to the body frame B, which is expressed as We substitute Equation (20) into Equation (17), which represents the relative position relation between two consecutive keyframes i and i + 1. Without considering the effect of the accelerometer bias, we can obtain If stacking all equations between every two consecutive keyframes using Equation (21), there will be N − 1 velocities that need to be solved. This would lead to a high computational complexity. Therefore in this section we do not solve the velocities of N keyframes. On the contrary, we consider Equation (21) between three consecutive keyframes ( Figure 1 shows an example) and exploit the velocity Equation (13): In the above formula, W p (·) c and R WC are obtained from ORB-SLAM2, ∆p (·) and ∆ν (·) are from the IMU pre-integration, and ∆t i,i+1 is the time interval between two consecutive keyframes. Stacking every three consecutive keyframes using Equation (22), we can form the following least-square problem. Solving this, we can obtain the coarsely estimated gravity vectorĝ 0 and scale s.

Gravity Refinement
Because the gravity norm is known in most cases, the gravity vector only has 2 degrees of freedom. On the basis of this, the estimated gravityĝ 0 obtained from Section 4.2 can be further refined. If the additional gravity norm constraint is straightway added into the optimization problem in Equation (23), it will become a nonlinear system that is hard to solve. Therefore, we enforce the gravity magnitude by optimizing the 2D error state on its tangent space, similarly to [30].
As shown in Figure 3, the estimated gravity can be re-parameterized aŝ where g mag is the known gravity magnitude,ĝ 0 is the direction of the current estimated gravityĝ 0 , and b 1 and b 2 are two orthogonal bases on the tangent plane; w 1 and w 2 are the corresponding 2D components that need to be estimated. It is easy to find one set of b 1 and b 2 using the Gram-Schmidt process. Then we replace gravityĝ 0 in Equation (22) with Equation (24). In this way, we can form a least-square problem similar to Equation (23) and solve it via Singular Value Decomposition (SVD). Then we iterate these steps several times until the estimatedĝ 0 converges.

Accelerometer Bias and Scale Estimation
After refining the gravity vector, we regard it as a fixed vector g W in the world frame. In Section 4.2, we do not consider the accelerometer bias. The estimated scale s may be coarse, and thus we estimate the accelerometer bias b a and scale s together in this step using Equation (17). The variables that we would like to estimate are defined as Now adding the accelerometer bias into Equation (21), it becomes The Jacobian J a (·) denotes a first-order approximation of the impact of the changing accelerometer bias. Similarly to the method described in Section 4.2, we can obtain the estimated accelerometer bias b a and scale s.

Velocity Estimation
So far, we have estimated all variables except the velocity. In other words, the velocity is the only unknown in Equation (26). Therefore we can compute the velocities of the first N − 1 keyframes using Equation (26), then compute the velocity of the last keyframe using Equation (13).

Termination Criterion
In our method the visual-inertial initialization process is automatically terminated when all estimated states are convergent. Because the norm of the nominal gravity is a constant 9.806 m/s 2 , we regard it as one of the convergence indicators. Another we use here is the condition number, which can indicate whether the problem is well conditioned. Once the visual-inertial initialization is successful, all 3D points in the map and the position of keyframes are updated according to the estimated scale. Because the IMU parameters have been estimated, we can integrate all IMU measurements to predict the next camera pose.

Experimental Results
In order to evaluate the performance of our initial state estimation approach, the public EuRoC dataset [36] was used. The EuRoC dataset consists of 11 sequences of 2 scenes in the Vicon room and industrial machine hall, and it provides synchronized global shutter stereo images at 20 Hz with IMU measurements at 200 Hz and trajectory ground truth. We only used one camera image set and IMU measurements to conduct the experiments in a virtual machine with 2 GB of RAM.
Because the EuRoC dataset does not provide an explicit ground truth scale, we need to calculate the true scale according to the ground truth data and the trajectory generated from visual ORB-SLAM2. Once the initialization of ORB-SLAM2 system completes, it produces an initial translation between the first two keyframes. After this, we can calculate the true translation on the basis of their corresponding ground truth states. Then the true scale (benchmark scale) will be the ratio of the true translation to the initial translation.

The Performance of Visual-Inertial Initial State Estimation
Here, we use the sequences of two scenes for evaluation. The variables of gyroscope bias, gravity vector, visual scale and accelerometer bias are sequentially estimated. Figures 4 and 5 show the convergence process of all the estimated variables on sequences V1_02_medium, V2_02_medium, MH_03_medium and MH_04_difficult. We can see that all variables converged to stable values after 8 s. Even on sequence V1_02_medium, all variables converged quickly after 5 s. In particular, the estimated visual scale was quite close to the benchmark scale. From Figures 4b,c and 5b,c, it can be seen that the gyroscope bias converged quickly and the accelerometer bias converged to almost 0. Figures 4d  and 5d demonstrate the convergence process of the estimated gravity vector, whose three components seriously deviated from stable values within 6 s, while Figures 4e and 5e show that the components of the refined gravity vector quickly converged to final steady-state values only after 2 s. Thus it can be indicated that our gravity refinement approach can efficiently speed up the convergence process of the estimated gravity vector.

The Accuracy of Scale Estimation
In this section, we evaluate the accuracy of the estimated scale using our method in two scenes of the EuRoC dataset including sequences V1_01_easy, V2_01_easy, V1_02_medium, V2_02_medium, MH_01_easy, MH_02_easy, MH_03_medium and MH_04_difficult. In order to effectively verify the accuracy and reliability of our approach, the visual measurements started without any prior information. Table 1 indicates the testing results on eight sequences. Compared with the state-of-the-art visual-inertial ORB-SLAM2 [8], the estimated scale using our method outperformed it on seven test sequences. The scale estimation error of our method was less than 5% on five sequences, and some of them were quite close to the benchmark scale. The scale error was under 7% on the sequences V2_02_medium, MH_02_easy and MH_04_difficult with a bright scene; in particular, the scales estimated by our approach achieved a higher precision than those from [8] in the mass. On sequence V2_01_easy, the results of [8] were better than ours, but fortunately, our approach also achieved a high accuracy, with an error below 5%. Table 1. The results of scale estimation, compared with the scale from visual-inertial ORB-SLAM2 (VI ORB-SLAM2) [8] and benchmark scale after VI ORB-SLAM2 system runs for 15 s. The fifth column shows the percentage of error between the estimated scale using our method and the benchmark scale. The numbers in bold represent the estimated scale is more close to the benchmark scale.

The Effect of Termination Criterion
The visual-inertial initialization process continues until both the termination criteria are achieved. For the sequences V2_02_medium and MH_04_difficult, Figures 6 and 7 show that the condition number dropped to a small and stable value after 8 and 6 s, respectively, which means that we obtain a well-conditioned problem. Meanwhile, the norm of the estimated gravity (blue) converged to almost the nominal gravity (green). On the right side of Figures 4 and 5, we can see that all estimated variables were convergent after 8 and 6 s. This proves that the termination criteria are valid.

The Tracking Accuracy of Keyframes
Once we have estimated a stable and accurate scale, the initialization procedure terminates. All 3D points in the map and the positions of keyframes are updated according to the estimated scale. The estimated IMU parameters can be used to launch the nonlinear tightly coupled visual-inertial SLAM system. Figure 8 shows the processed images of the Vicon room and the industrial machine hall. The final reconstructed sparse map corresponding to the above two scenes is presented in Figure 9.
Because the evo (https://michaelgrupp.github.io/evo/) package provides a small library for handling and evaluating the trajectory of odometry and SLAM algorithms, we made use of this open-source tool to evaluate the trajectory accuracy of visual-inertial SLAM initialized with our algorithm. Figure 10 illustrates the trajectory of keyframes computed by combining our initialization method with the visual-inertial ORB-SLAM2 back-end, which is close to the ground truth trajectory provided by the EuRoC dataset. The colormap reveals the relationship between the colors and the absolute pose error (APE). As shown in Figure 10a, the corresponding pose error for sequence V1_01_easy varied from the minimum, 0.0062 m, to the maximum, 0.1715 m. The values of the mean error (ME), root mean square error (RMSE) and sum of squares error (SSE) were 0.0913 m, 0.0972 m and 1.4839 m 2 respectively. Figure 10b also shows the APE tested on sequence MH_01_easy; the corresponding ME, RMSE and SSE were 0.094816 m, 0.102795 m, and 2.018254 m 2 . Thus it can be concluded that visual-inertial SLAM initialized with our initial state estimation algorithm is able to recover the metric scale and does not suffer from scale drift. We compared the tracking performance of our method with those of state-of-the-art methods [8] and [6] on EuRoC dataset. The above systems could process all sequences, except V1_03_difficult and V2_03_difficult, for which the movement was so intense that the system could not survive. On each sequence, we tested five times and used the evo package to calculate the relative pose error (RPE) by aligning the estimated trajectory with the ground truth; we show the average results of the translation ME, RMSE and SSE in Table 2. From Table 2, we can see that the results of our approach were worse than those of [6] for six sequences. This was because the tightly coupled nonlinear optimization for visual-inertial fusion is more complex and costs more time, as there are nine additional states (IMU biases and velocity) for each keyframe. In order to achieve real-time performance, the local window size for local bundle adjustment of visual-inertial ORB-SLAM2 initialized with our method has to be smaller than that of [6], which would result in a decrease of the optimized states of keyframes and map points and further cause reduced accuracy of the trajectory and map. However, comparing the results from [8] with ours, we can clearly see that our initial state estimation approach could improve the tracking accuracy for six sequences, which were V1_01_easy, V2_02_medium, MH_01_easy, MH_02_easy, MH_04_difficult, and MH_05_difficult. Table 2. The accuracy of keyframe trajectories generated by visual-inertial ORB-SLAM2 (VI ORB-SLAM2) [8], ORB-SLAM2 [6] and VI ORB-SLAM2 system initialized with our initialization approach on EuRoC dataset with 11 sequences. The corresponding values of the mean error (ME), root mean square error (RMSE) and sum of squares error (SSE) are listed as follows.

Conclusions
In this paper, we propose a more accurate algorithm for initial state estimation in a monocular visual-inertial SLAM system. The main contributions of our initialization method are given in the following ways. Firstly, considering that the gravity magnitude is known, we propose a method to refine the estimated gravity by optimizing the 2D error state on its tangent space. Then we estimate the accelerometer bias with the refined gravity fixed. Secondly, we propose an automatic way to determine when to terminate the process of visual-inertial initialization. On the whole, we present a complete and robust initialization method and provide accurate initial values (scale, gravity vector, velocity and IMU biases) to bootstrap the nonlinear visual-inertial SLAM framework. We verify the effectiveness of the algorithm on all sequences in two scenes of the public EuRoC dataset. Experimental results show that the proposed methods can achieve accurate initial state estimation, the gravity refinement approach can efficiently speed up the convergence process of the estimated gravity vector, and the termination criterion performs well.