Monocular Visual-Inertial Odometry with an Unbiased Linear System Model and Robust Feature Tracking Front-End

The research field of visual-inertial odometry has entered a mature stage in recent years. However, unneglectable problems still exist. Tradeoffs have to be made between high accuracy and low computation for users. In addition, notation confusion exists in quaternion descriptions of rotation; although not fatal, this may results in unnecessary difficulties in understanding for researchers. In this paper, we develop a visual-inertial odometry which gives consideration to both precision and computation. The proposed algorithm is a filter-based solution that utilizes the framework of the noted multi-state constraint Kalman filter. To dispel notation confusion, we deduced the error state transition equation from scratch, using the more cognitive Hamilton notation of quaternion. We further come up with a fully linear closed-form formulation that is readily implemented. As the filter-based back-end is vulnerable to feature matching outliers, a descriptor-assisted optical flow tracking front-end was developed to cope with the issue. This modification only requires negligible additional computation. In addition, an initialization procedure is implemented, which automatically selects static data to initialize the filter state. Evaluations of proposed methods were done on a public, real-world dataset, and comparisons were made with state-of-the-art solutions. The experimental results show that the proposed solution is comparable in precision and demonstrates higher computation efficiency compared to the state-of-the-art.


Introduction
The fusion of monocular cameras and inertial measurement units (IMUs) is very popular recently, thanks to great improvements in the computation capacity of computers with low energy costs and low weight, and the increasing demand for accurate motion tracking or positioning in unmanned aerial vehicles (UAVs), augmented reality, and driverless cars. This fusion problem has been studied by brilliant scientists for years [1][2][3][4], and as a result, one can now build his own visual-inertial odometry (VIO) module simply with cheap sensors and open-source software [2,[5][6][7][8][9].
Fusion frameworks are divided into two main branches: filter-based and optimizationbased [10,11]. Optimization-based methods are so far widely recognized performing better in terms of precision [12] due to their iterating mechanism, which is essentially solving a noted bundle adjustment (BA) problem [13]. The BA problem was considered to be computational costly in earlier years, until the literature recognized and revealed its sparse structure [13,14] so as to develop real-time algorithms. Kümmerle et al. [7] modeled BA as a graph optimization problem. Kaess et al. [8] introduced a factor graph model to further illustrate the Bayesian nature of BA. Kaess et al. [8] also found that the incremental fact of sophiscated Hessian matrix in normal equation can be utilized for solving BA, thus speeding up the calculation further. With these profound insights, researchers also made efforts to overcome the inconsistency in fixed-lag fusion algorithms, which has the advantages of having bounded computation with less information loss and maintaining the sparse structure [15]. Several open-source libraries are available for building the back-end for algorithms of this branch, based on different mathematical descriptions listed above and providing convenient application program interfaces (APIs) [7][8][9]16]. Although enabling reduced computation by leveraging sparse matrix factorization, optimization-based VIO systems still need to be tailored sometimes in order to be deployed on a computation-limited platform. Sometimes, this leads to a downgraded performance [17].
Before the flowering of optimization-based methods, the solving of fusion problems was dominated by filtering [18]. The ordinary procedure is to include IMU pose and map point positions in the filter state and recursively propagate and update as IMU and camera measurements respectively arrive [10]. The accurate estimation of map point positions is the key to bring about an unbiased IMU pose updating. In traditional filter-based solutions, the filter state would invariably have a very large dimension since it always preserves a lot of map points, resulting in enhanced computation requirements [1]. The use of a multi-state constraint Kalman filter (MSCKF) was proposed as an effective and optimal filter-based solution that does not maintain map points in filter state [19]. By properly handling the camera measurement, MSCKF can achieve as competitive a performance as optimization-based algorithms and demands far less computation [17].
By correcting observability properties [19][20][21] and incorporating camera-IMU extrinsic parameters into the filter state [22], the performance of MSCKF was further improved. Many follow-up works emerged, including an open-source monocular implementation [23], expansion to stereo camera rig [24], and schemes using direct visual front-ends [25] or adding line features [26].
It should be emphasized that all members of the MSCKF family so far have been developed based on Shuster's notation of quaternion [27], whereas most of the community utilizes the traditional Hamilton notation, which results in unnecessary trouble in understanding for researchers [28].
Visual front-ends apparently play an important role in VIOs. There are typically of two categories. Feature-based methods use descriptors to match features between consecutive images [6], while direct methods seek a minimization of photometric residuals to accomplish data correlation [5,25]. Sparse optical flow tracking is an efficient direct method that is widely used [2,23,24]. It provides sub-pixel accuracy but contains more outliers than feature descriptor matching [29]. An optimization-based back-end would eliminate outliers during iteration [30]. Filter-based back-ends are meanwhile vulnerable to the outliers if only one-off updating is applied [23]. Using an iterated update scheme would mitigate this situation while introducing additive computation [31].
To recap, in order to make VIO algorithms more practical, it is desirable to develop algorithms with lower computation while maintaining high precision.
In this paper, we developed a filter-based monocular visual-inertial odometry which can be regarded as a member of MSCKF family, giving consideration to both high precision and computation efficiency. The main contributions of this paper are as follows: • We deduced a closed-form IMU error state transition equation based on the more cognitive Hamilton notation of quaternion. By solving integration terms analytically, a novel fully linear formulation was further obtained, which is also closed-form, and furthermore, is readily implemented.
• By analyzing the statistical properties of ORB descriptor [32] distances of matched and unmatched feature points, we introduced a novel descriptor-assisted sparse optical flow tracking technique, which enhances the feature tracking robustness and barely adds any computation complexity.

•
More improvements are made to improve the usability and performance of the filter. An initialization procedure is developed that automatically detects stationary scenes by analyzing tracked features and initializes the filter state based on static IMU data. The feature triangulation mechanism is carefully refined to provide efficient measurement updates. • A filter-based monocular VIO using the proposed state transition equation, visual front-end, and initialization procedure under Sun et al.'s [24] framework is implemented. The performances of our VIO and MSCKF-MONO [23], an open-source monocular implementation of MSCKF, are compared with parameters setup as similarly as possible. Ours is also compared with other state-of-the-art open-source VIOs including ROVIO [5], OKVIS [6], and VINS-MONO [2]. In addition, we analyze the process time of our algorithm. All of the evaluations above are done on EuRoC datasets [33]. Detailed evaluations are reported.
The rest of this paper is organized as follows. The problem of quaternion notation confusion is illustrated in Section 2. Section 3 deduces the error state differential equation based on Hamilton's notation. Section 4 gives a closed-form error state transition formulation and then solves the integration terms in it, obtaining a fully linear closed-form formulation. Section 5 presents the descriptor-assisted sparse optical flow tracking front-end. Other implementation details and improvements are presented in Section 6, including the overall filter model, automatic initialization procedure, and refined feature triangulation mechanism. Section 7 presents the experimental results in detail. Finally, conclusions are made in Section 8.

Quaternion Notation Confusion
Quaternion is one of the widely used representations of rotation in numerical calculations [34]. In the related literature, there are mainly two different notations: Hamilton's notation and Shuster's notation [35]. The difference between them lies in their flipped rule for the multiplication of imaginary parts i, j, and k. Hamilton utilizes ij = k, while Shuster advocates for ij = −k to maintain the order of chain rule when transferring to direction cosine matrices (DCMs). Sommer et al. [28] surveyed this notation confusion problem in detail and argue for entirely abandoning Shuster's notation. In this section, we present the original problem that Shuster's notation is designed to solve and a solution for maintaining chain rule order while still using Hamilton's notation.
A quaternion of rotation q is basically a unit quaternion; it can be defined as where u A is the unit vector of rotation axis in frame A, and θ is the angle of rotation. In the rest of this article, the term "quaternion" will be used to refer to a quaternion of rotation, for the sake of simplicity. Equation (1) shows how to construct a quaternion q from an axis-angle θu A , which describes the anticlockwise rotation of an angle θ about the axis u. If the original frame A is rotated to a new frame B after this rotation, as illustrated in Figure 1, then we can use a quaternion q B A or a DCM R B A to describe this rotation.
Note that R B A can be used to compute the coordinate of a vector v in frame B given its coordinate where C S (•) is an operator mapping q B A to R B A . Let q C B be the quaternion describing the rotation from frame B to frame C and q C A the rotation from frame A to frame C. Then, according to Equation (2), we have Coordinate transformation of vectors can also be done by applying the triple product of quaternions: Here we abuse the notation of v A , v B , and v C to describe quaternions with zero real part such Combining the two equations above yields: Referring to Equation (5), there is At the same time, by applying the chain rule in DCM production, it follows that Now we can conclude that C S q B A ⊗ q C B = C S q C B C S q B A , which means the mapping C S (•) is not a homomorphism. One would prefer a homomorphic mapping between DCM and quaternion to maintain the chain-rule order, which is convenient to manipulate. Shuster utilized a flipped multiplication rule to avoid this problem. This notation was adopted by the Jet Propulsion Laboratory (JPL) and thus introduced to spacecraft literatures, while other research fields were still using the traditional Hamilton notation. But as researchers have exchanged ideas between different research fields, Shuster's notation has been utilized in robotics for rotation representation [28]. So far, all of the theories about MSCKF are deduced based on this notation [1].
As Sommer et al. [28] claimed, a homomorphic mapping could be obtained even under Hamilton's notation. Let C H (•) be an operator that satisfies C H (q) = C S (q) T . Thus, we have According to Equations (6) and (7), we now have Given a quaternion the operator C H (•) is defined as a function mapping quaternion q B A to a DCM R A B as This is, in fact, the classical Rodrigues Rotation Formula. There is a more thorough discussion about this mapping in [36].

IMU Error State Differential Equation
In this section, we deduce the IMU error state differential equation based on Hamilton's quaternion notation. The Earth's rotation is ignored as low cost gyros cannot measure it. The static world assumption is employed, which means that gravity has a fixed direction. This is acceptable when a VIO is working in a limited region.

Notation
The east-north-up geographic coordinate system at initial position is selected as the reference world frame w. As the Earth's rotation is omitted, w can be regarded as an inertial frame. Quaternion q b w is used to represent the rotation from frame w to body frame b. According to Equation (8), we obtain The error quaternion is defined as whereq b w is the estimated quaternion of q b w . According to Equation (10), applying the C H (•) mapping to Equation (12) leads to where w is the estimated world frame, and δq b w corresponds to the rotation between w and w . δq b w can be expressed in axis-angle formulation as where is an axis-angle in frame w that rotates frame w to frame w.
As |δθ θ θ w | is a small angle, an approximate expression of Equation (14) is Based on Equations (10) and (14), an approximate expression of R w w is formulated as where the operator [•×] is used to denote the skew matrix. For a given three-

IMU Measurement Model
An IMU includes a 3-axis gyroscope and a 3-axis accelerometer, whose axes are aligned with the body frame. The output of the gyroscope is modeled as where ω ω ω b wb is the true angular velocity, b g denotes the gyroscope bias under the body frame, and n g is the Gaussian white noise.
The accelerometer measures the specific force along a body-fixed axis, which includes an opposite gravity and is affected by bias and noise as well: where a b is the true acceleration, and g w = 0 0 −g T denotes the gravity under the world frame. b a and n a denote the bias and the Gaussian white noise under the body frame, respectively.
The biases b g and b a are modeled as random walk processeṡ where n wg and n wa are Gaussian white noises.

IMU Error State Definition
The IMU state includes the quaternion q b w , velocity v w b and position p w b of the body frame origin in the world frame, and IMU biases b g and b a . The IMU state can be defined as The filter is designed based on the error state because it is convenient to process by extended Kalman filter (EKF). Three dimensional angular error δθ θ θ w rather than four dimensional quaternion error δq b w is utilized since it is accordance with the degree of freedom (DOF) of rotation, and thus a minimum parameterization.
Other error state components are simply defined as the Euclidean distances between true states and the estimated states, which lead to The overall IMU error state can now be concluded as

Differential Equation
The matrix form of the differential equation of the overall IMU error state is as follows.
where n I MU denotes the IMU noise, given by and the matrices F and G are as follows:

Fully Linear State Transition Equation Formulation
A state transition equation is needed for the extended Kalman filter (EKF) to propagate the state and covariance. One commonly used method is to make a first-order approximation based on a continuous differential equation [37]. Li and Mourikis [19] proposed a closed-form error state transition equation that effectuated a system model with no information loss. However, there are still some tricky integration terms left behind. In this section, we first present the closed-form IMU error state transition equation based on the results of Section 3. Then, we solve the integration terms by two-sample fitting of the rotation matrix, resulting in a closed-form formulation that is fully linear.

Original Closed-Form Equation
Following the methodology of Li and Mourikis [19], the closed-form transition equation was deduced and presented in what follows. Noting that k and k + 1 are consecutive discrete sampling instants of IMU, and ∆t is the sampling period, where Φ x 1 x 2 is used to represent the transition matrix of the error state of x 1 with respect to the error state of x 2 , and n * terms represent noise. All of the Φ * and n * terms are listed as follows:

Fully Linear Closed-Form Formulation
Notice that in Equations (38), (41), (42), (46), and (47), although they are closed-form expressions, there are still some tricky integration terms that are not straightforward for implementation. One can solve these terms with numerical integration, but here we present a fully linear analytical expression that is readily implemented. The key is to solve We first apply a two-sample fitting method to approximate the axis-angle representingR b k b τ , then Rodrigues' rotation formula is applied to express the DCM as a linear function of τ, thus making the integration easy to solve.

Two-Sample Fitting of Axis-Angle
Any DCM can be regarded as a single rotation about a fixed axis, and thus can be represented by an axis-angle. Let the axis-angle ofR where α is the angle of rotation and u b k is the rotation axis. Let τ be a time instant between t k and t k+1 and ε = τ − t k , then a linear model can be used to represent the axis-angle ofR Angular velocity measurements of t k and t k+1 are available when calculating the transition matrix Φ Φ Φ (k + 1, k), so a two-sample fitting method can be used to approximate the axis-angle φ φ φ. We start from the differential equation of φ φ φ (ε) [36]: whereω ω ω is the average angular velocity between t k and t k+1 . As two gyro measurements are available, we use a straight line model to fitω ω ω as Consideringω According to Equation (55), φ φ φ is equal to φ φ φ t k+1 , then using Taylor expansion to expand φ φ φ at linearized point t k yields Now define a new function of ε as It can be pointed out that φ φ φ (ε) ≈ ∆θ θ θ (ε). Derivatives of ∆θ θ θ (0) are defined as The third term in Equation (56) is a high-order small quantity that can be omitted. By substituting Now the high-order derivatives of φ φ φ (ε) can be obtained: Let ε = 0, and considering Equation (61), we have Substituting the equations above into Equation (59) yields This is how the axis-angle between two consecutive sampling time instants t k and t k+1 can be computed.
According to Rodrigues' rotation formula, As α is a small angular, since ∆t is small, Equation (66) has an approximation Now, substituting Equation (55) into Equation (67) leads tô Finally, the general procedure to solve the integration term t k+1 t kR b k b τ dτ can be summarized as follows: 1.
Compute the axis-angle between t k and t k+1 according to Equation (65).

ExpressR
Easily solve the Notice that all of the variables needed are available at the time of calculating the Φ Φ Φ terms above. This model is unbiased up to the information loss of the two-sample fitting of DCM, which is small due to the utilization of all related measurement data.

Process Noise Terms
The property of noise terms in Equations (50)-(54) should be acquired to compute the process noise covariance matrix in a Kalman filter. The process noise covariance at t k can be computed as [37]: We temporarily abuse symbol q here to represent the noise intensity matrix. Φ Φ Φ is the overal; IMU error state transition matrix. As ∆t is a small quantity, an approximate expression of Equation (70) is formulated as The discrete form, which will be preferable for a discrete filter implementation, is

Summarization
According to the derivation above, the proposed fully linear closed-form IMU error state transition equation is as follows: where and the covariance matrix of n I MU (k) is E n I MU (k) n T I MU (k) = Q (k). The integration terms are solved using a fitting rule of DCMs by utilizing all of the related measurements, so we claim that the obtained formulation is an unbiased model up to the numerical integration resolution.

ORB Descriptor-Assisted Optical Flow Front-End
In this section, we propose a sparse visual front-end using descriptor-assisted optical flow feature tracking.
Different kinds of feature descriptors are used in several VIOs to accomplish feature extraction and matching [1, 6,30]. In contrast, other solutions choose optical flow feature tracking as their front-end solution since it is not that time-consuming compared to the descriptor-based methods [2,23,24]. However, there are more wrong matches in optical flow tracking than in descriptor-based methods, and these wrong matches exist even after eliminating algorithms such as random sample consensus (RANSAC). Filter-based VIOs are very sensitive to feature outliers since they don't eliminate outliers in their iterations as the optimization-based ones do. Wrong matches left behind will participate in measurement updates, which may result in deteriorating estimates or even failure. As a conclusion, a robust front-end is needed to achieve stable performance for filter-based VIOs, while a real-time solution also calls for fast data correlation.
Yang et al. [29], refined ORB-SLAM [38] by using a sparse optical flow algorithm. The key idea was to correct the image coordinates of ORB features by optical flow tracking results to achieve sub-pixel precision. The proposed method here is a bit different since we use optical flow to first conduct a fast tracking, then compute descriptor distance between matched feature pair members and justify whether they are a good match-up.
There exist plenty of feature descriptor algorithms. We chose the ORB descriptor in our proposed method for two reasons: 1. The ORB descriptor is a binary string, so the distance between two descriptors can be expressed as a Hamming distance, which can be computed efficiently. 2. The rotation between consecutive images in a real-time application is usually very gentle, so invariance to rotation is not very important for a descriptor.

Descriptor Distance Analysis for General Corner Features
The basic visual front-end is based on Shi-Tomasi corner detection [39] and optical flow tracking [40]. It is important to figure out whether the ORB descriptor is meaningful for general Shi-Tomasi corner features. An experiment was done and proved that it is indeed meaningful statistically. We calculated the feature angle for a Shi-Tomasi feature and then used it to compute the ORB descriptor [32]. Several tests were conducted in the experiment. For each test, feature pairs from every two adjacent images of a continuous image stream were stored separately in two sequences. These tests basically analyzed the statistical properties of ORB descriptor distances of feature pairs, including 1.
Coarsely matched feature pairs based on Shi-Tomasi corner detection and optical flow tracking.

2.
Relatively strictly matched feature pairs based on ORB descriptor matching and RANSAC.
Unmatched feature pairs generated by inverse order of one of the strictly matched feature sequences.
One feature sequence from strictly matched pairs was inverted to generate strictly unmatched feature pairs. The experimental result is shown in Figure 2. We strongly suspect the very long tail in Figure 2a may be due to wrong matches because no further outlier rejection method was applied after optical flow tracking in this test. In Figure 2b, except for the massive Guassian-like distribution, a little bump centered at about 17 appeared, which is framed by a red rectangular border. This is because the random pairs were constructed in two adjacent images and thus, two matched features have a considerable probability of being coincidentally formed into a pair. These two experiments prove that ORB descriptors and descriptor distances are meaningful for general Shi-Tomasi corners, from a statistical standpoint.
In order to clearly analyze the statistical properties of matched and unmatched pairs, two further tests were conducted. First, a descriptor-based matching and RANSAC mechanism were applied to obtain relatively strictly matched feature pairs. Then, the order of one of the feature sequences was reversed, which is a simple yet effective way to make two sequences unmatched. Descriptor distances before and after order reversion were computed, and statistical results are shown in Figure 2c,d. It can be seen from the figures that the long tail and little bump disappear because of the relatively strict pairing rule. They are plotted together in Figure 3 to make a clear comparison.  The experimental results show that the descriptor distances of unmatched and matched feature pairs possess significantly different statistical properties. As shown in Figure 3, descriptor distances of unmatched features approximately follow the Gaussian distribution with a mean, or we can say peak, at about 124.7 and with a standard deviation of 21.8. For matched pairs, the distribution shows a sharper peak at about 18.5. There is still a tail in the matched distribution, but it is much smaller than the one in Figure 2a. The difference between matched and unmatched pairs is significant enough to design a strategy to filter out wrong matches.
We use a heuristic to complete the mission: • For feature pairs with distances lower than the smaller peak value, classify them as inliers.

•
For feature pairs with distances higher than the bigger peak value, classify them as outliers.

•
For feature pairs whose distances are between two peaks, calculate and compare the Mahalanobis distances to both peak to decide their classification.

EKF-Based VIO Implementation Details and Improvements
In this section, implementation details and improvements of the proposed EKF-based VIO are presented, including filtering scheme, automatic initialization procedure, and refined feature triangulation mechanism. An overall flow chart of the implemented VIO algorithm is shown in Figure 4. Red sections highlight novelties proposed in this paper.  Red sections highlight novelties proposed in this paper. Term "IMU" stands for inertial measurement unit, and term "MSCKF" stands for multi-state constraint Kalman filter.

Filter State and Measurement Model
A VIO following the scheme of Mourikis and Roumeliotis [1] is implemented. The system state includes a sliding window of N historical IMU poses and camera-IMU extrinsic as proposed by Li and Mourikis [22]. The overall system state is formulated as Therefore, the overall error state of the filter is The measurement residual is a linearized residual about historical IMU pose errors and camera-IMU extrinsic errors. The original reprojection error is manipulated firstly by left nullspace multiplication to marginalize out the feature position, and secondly by applying QR decomposition to decrease the residual dimensions without information loss [1]. Furthermore, only residuals passing through the Mahalanobis gating test would be used in measurement updating.

Automatic Initialization Procedure
An automatic initialization procedure is developed. Firstly, a stationary scene is automatically detected by only using image stream. Secondly, stationary IMU data is used to initialize the system state. The detailed procedure is described in Algorithm 1.
The algorithm identifies a stationary scene by continuously detecting almost no motion of tracked features. Then, static gyro data is used to initialize gyro bias. Rotation matrixR b w is computed by aligning gravity in frame b, which is the mean static accelerometer data, with gravity in frame w. This initialization procedure is a rough one since accelerometer bias has not been eliminated, but its uncertainty can be modeled by the initial covariance matrix of the filter state.
Algorithm 1 Automatic initialization procedure 1. Detect stationary scene Counter is big enough then break end if end for

Initialize system state
Save stationary acc data in arry acc Save stationary gyro data in arry gyro

Refined Feature Triangulation Mechanism
In Mourikis and Roumeliotis's work [1], features are triangulated only if they are no longer being tracked; however, we found that this mechanism does not perform well, especially when using cheap IMUs. To conduct frequent and effective measurement updates, which is crucial to correct biased IMU propagation, a maximum feature tracking length is set. This means each feature would be triangulated when it has been tracked for a certain number of frames, even if it is still being tracked. In the latter situation, the current observation would not be used in triangulation.
Generally, features that failed in triangulation would be discarded directly. While the proposed mechanism is that if a feature fails in triangulation while it is still being tracked, it will have another chance to triangulate when the next image is coming. This mechanism improves the performance when the camera is moving slowly, where features in adjacent images exhibit a small parallax that would easily result in triangulation failure.

Experimental Results
The public dataset EuRoC [33] was used to evaluate the performance of the proposed VIO. It includes 11 sequences that were collected by a UAV in three different scenes. One is a machine hall and the other two are rooms equipped with motion capture systems and different manual layout arrangements. The extrinsic and intrinsic parameters of sensors are carefully calibrated, and ground truths of UAV poses are provided. It is one of the widely used benchmarks for evaluating algorithms of different configurations, including monocular-visual, stereo-visual and monocular/stereo-visual-inertial setups. All of the experiments below were performed on an Ubuntu 16.04 virtual machine powered by MacBook Pro Mid 2015 assigned with two core and 8 GB RAM. Our implementation is a real-time algorithm based on ROS nodelet [41].
The estimated trajectories and corresponding ground truths are shown in Figure 5. Estimated trajectories are aligned with ground truths by a 6-DOF Sim (3) transformation without adjusting the scale [42].

Front-End Improvement
For implementations of front-ends with and without ORB descriptor assistance, we run each EuRoC sequence 50 times. A boxplot summary is shown in Figure 6. The corresponding means and standard deviations are listed in Table 1.
After adding ORB descriptor assistance, the estimator performs better in most sequences, since boxes became narrow and their position lower in Figure 6. The statistics in Table 1 give a numerical display of the results. Obvious improvement can be observed in seven sequences. In the other four sequences, performance are similar with or without ORB descriptor assistance. This may be due to the small quantity of outliers of optical flow tracking in these sequences.
We also analyzed the processing time of the proposed ORB descriptor-assisted outlier elimination procedure. The maximum feature number is set as 150. The results are listed in Table 2.   Figure 6. For each sequence, the one with an obviously better performance is highlighted. As shown in Table 2, the proposed ORB descriptor-assisted outlier elimination procedure introduces little computation. The processing time varies among sequences, mostly due to the motion speed. Sequences with aggressive motion tend to take less processing time than those with slow motion since fewer features are tracked in the former case, and fewer ORB descriptor distances need to be calculated.  [23]. MSCKF-MONO has a visual front-end based on optical flow and utilizes first-order approximation state transition equations in filtering. It also applies observability-constrained Kalman filter (OC-KF) [21] to fix the observability problem, which would fix the wrong observability properties and improve filter performance. Note that ours does not apply any similar techniques.

Sequence
In our experiment, we removed the coarse initialization and forbid the reset module in MSCKF-MONO because for some reason, MSCKF-MONO did not work properly on nearly half of sequences under the original coarse initialization, and reset does not help if there is no stop during running. The initial state was assigned by noisy ground truth for both our algorithm and MSCKF-MONO in this experiment. To make a fair comparison, we tried to run with same setup for common parameters in both algorithms, such as noise densities for sensors measurement, sliding window size, and maximum or minimum track lengths for features. However, MSCKF-MONO barely worked in any sequences under a similar setup as ours. This is mainly due to the different state transition model and visual front-end implementations. As we explored further and could not find a setup which generally performed better than the original setup for MSCKF-MONO, we left the original parameters unaltered. The comparison results are listed in Table 3. The results show that the proposed monocular MSCKF is far more accurate than MSCKF-MONO. We claimed that this is due to a more accurate state transition model and a robust visual front-end.

Comparison with the State-Of-The-Art
The results of proposed VIO algorithm are compared with several state-of-the-art open-source monocular VIOs using the EuRoC dataset, including OKVIS [6], ROVIO [5], and VINS-MONO [2]. To make a fair comparison between pure VIOs, we turned off the closure detection in VINS-MONO. The proposed VIO automatically selects stationary IMU data to initialize the rotation and gyro bias at the beginning of every sequence, while other states are initialized as zeros. In addition, a unique parameter configuration is applied in all sequences. Results are listed in Table 4. As shown above, the proposed VIO algorithm is comparable in accuracy to the state-of-the-art. Notice that VINS-MONO generally performs best out of all four algorithms, and the proposed algorithm has a similar performance in vicon rooms, which is due to good feature triangulation results in a limited area. In addition, the proposed algorithm and ROVIO perform better in V1_03 and V2_03 than others. There are aggressive motions in these two sequences that might result in tracking failure in the front-end; the proposed algorithm and ROVIO are filter-based methods that can utilize IMU measurements to propagate for a short period in this situation, while VINS-MONO and OKVIS sometimes fail and have to lean on relocalization in this circumstance. Notice that the machine hall is a relatively large-scale scenario [33], where triangulations in the proposed method mostly deal with points of large depth. This results in a relatively downgraded performance of the proposed method in the machine hall, even in sequences with mild motions.

Processing Time
As mentioned by Delmerico and Scaramuzza [17], the better performance of VINS-MONO is a trade-off requiring more computer resources than others. In contrast, the proposed method has a similar architecture to MSCKF-MONO, which is a light-weight solution. The average processing time of the visual front-end and EKF/optimization back-end of our implementation and the state-of-the-art are listed in Table 5.
The results show that, the proposed method has higher processing speed than the listed optimization-based methods. ROVIO is the fastest solution among all listed solutions, but as shown in Table 4, its precision is generally the worst. In proposed method, the visual front-end can process images at about 60 Hz. Notice that V2_03 is a little bit slower than others, because aggressive motions in this sequence result in a short feature tracking length, and thus, the front-end will take more time to extract new features. The EKF-based back-end run at more than 160Hz and the difference between each sequence is due to the difference in the number of features used in measurement updating. As can be concluded from Tables 4 and 5, the proposed method is a VIO solution which has comparable precision and generally required less computation resources than the state-of-the-art.

Conclusions
In this paper, we first deduced a highly closed-form IMU error state transition equation from scratch. By using Hamilton's notation of quaternion, we tried to eliminate notation ambiguity. We then managed to solve the integration terms left behind in the transition equation by introducing a two-sample fitting method to approximate the axis-angle, resulting in a fully linear closed-form formulation that is unbiased up to the fitting resolution. This formulation also has potential to incorporate IMU intrinsics into the filter state, since it is a linear function of IMU measurements. An automatic initialization procedure is developed and the feature triangulation mechanism is carefully refined. The ORB descriptor distance between Shi-Tomasi corner pairs was analyzed, and we found that there is a statistical difference in descriptor distances between matched and unmatched feature pairs. As outliers are sometimes fatal for filter-based VIOs, this inspired us to propose a visual front-end based on optical flow tracking and additionally, to use ORB descriptors to eliminate outliers. We implement a monocular VIO under the framework of MSCKF with proposed novelties.
Through a comparison between estimation results with and without the proposed outlier eliminating method, we demonstrate its effectiveness. Furthermore, an experiment was done to compare the proposed method with several state-of-the-art VIOs, both in terms of precision and computation. Results show that the proposed VIO is a visual inertial fusion solution with comparable precision to the state-of-the-ar but which demands less computation resources.
Future works include adding a robust initialization procedure adapting to versatile scenes and analyzing the point selection mechanism in detail.
Author Contributions: X.Q. and H.Z. designed the algorithms. X.Q. deduced all the formulas, analyzed the experimental results and drafted the paper. W.F., C.Z., and Y.J. revised the draft.