A Hybrid Sliding Window Optimizer for Tightly-Coupled Vision-Aided Inertial Navigation System

The fusion of visual and inertial measurements for motion tracking has become prevalent in the robotic community, due to its complementary sensing characteristics, low cost, and small space requirements. This fusion task is known as the vision-aided inertial navigation system problem. We present a novel hybrid sliding window optimizer to achieve information fusion for a tightly-coupled vision-aided inertial navigation system. It possesses the advantages of both the conditioning-based method and the prior-based method. A novel distributed marginalization method was also designed based on the multi-state constraints method with significant efficiency improvement over the traditional method. The performance of the proposed algorithm was evaluated with the publicly available EuRoC datasets and showed competitive results compared with existing algorithms.


Introduction
Accurate localization in an unknown environment is essential for a robot to succeed in its missions. In many cases, existing external localization systems, such as motion capture systems, the global positioning system, or a pre-constructed map of the working area, are costly, insufficiently accurate, or unavailable. In this study, we focused on estimating a vehicle's motion by fusing the measurements from a monocular camera and an inertial measurement unit (IMU). This task-the well-known monocular vision-aided inertial navigation system (VINS) problem-has drawn great interest in the robotic community (e.g., [1][2][3][4][5][6][7]) for many reasons.
First, the monocular camera and the micro-electro-mechanical system IMU are both small and cheap and consume little power, and both are passive sensors, which means that they exert no influence on the external environment and do not interfere with each other like Lidars and RGB-D cameras. Second, the monocular camera cannot compute inter-frame motion alone, so some studies have assumed a constant-velocity model [2,3] between two frames. However, this assumption results in an inconsistent estimation. The micro-electro-mechanical system IMU usually works at a higher frequency than the monocular camera, so the IMU measurements can be integrated for accurate recovery of short-term inter-frame motion. However, the IMU measurements are corrupted by noise and slowly time-varying biases, which makes long-time integration unreliable [4,8], so visual information is required to aid the IMU in estimation of biases. Third, the monocular camera is a projective sensor that provides bearing information regarding visual features, so motion and structure can only be recovered up to an unknown scale [1,5]. In order to recover its metric scale by using IMU, a large window are directly fixed, and the related edges are used as usual. After the measurements of IMU are added, the first node (including the pose, velocity, and biases of IMU) of the sliding window must also be fixed to eliminate the ambiguity of motion [4]. This kind of method is highly robust but is not optimal theoretically, because only a part of the graph is active while LBA is performed. For the prior-based method, a marginalization technique is performed on the edges related with the outside nodes to construct a prior distribution (typically, a Gaussian distribution) for the nodes in the sliding window. This kind of method is optimal theoretically but is also affected by linearization errors numerically and cannot cope properly with features whose tracking length lies outside the range of the sliding window.
In this paper, we propose a novel hybrid sliding window optimizer (HSWO) that has the advantages of both the conditioning-based and prior-based methods. To make the linearization points of the prior distribution reliable, the sliding window in our method is divided into two parts. We call the front part the mature region and the back part the growing region. We marginalize a map point out only if its last measurement lies within the mature region. To cope with map points whose tracking length lies outside the range of the sliding window, we follow the conditioning-based method. The nodes outside the sliding window are fixed directly and are called the fixed basis. For balancing the linearization errors within the growing region and fixed basis, the size of the mature region should be selected carefully.
The two kinds of marginalization techniques-the Schur complement technology [8,18] and the null-space-based method [12,15,16]-are equivalent mathematically [22]. For the traditional method, all reprojection factors of marginalized map points are linearized at the current estimation, and a Hessian matrix is constructed by stacking linearized factors. Because there is a submatrix in the Hessian matrix that needs to be inversed, and the dimension of the submatrix depends on the number of marginalized map points (possibly more than hundreds). It is time-consuming and numerically unstable if we directly calculate the inverse matrix of the submatrix, as in [8]. For our method, we first used the null-space-based method to calculate the multi-state constraints factors of the marginalized map points, i.e., the map point position parameters were marginalized first. Then, we stacked the multistate constraint factors to construct the Hessian matrix. In our method, the dimension of the submatrix that needs to be inversed is quite low.
To avoid the repeated integration of IMU measurements, the well-known IMU pre-integration technology has been widely adopted in graph-based optimization methods. This technique was first proposed by Lupton et al. [23,24] and further improved by [8,25,26]. The integration is performed in the first body reference frame during a period, so no previous estimates or covariance are necessary except for the current estimates of the IMU biases.
It is well known that the uncertainty of low-parallax features is poorly approximated by Gaussian distribution in Euclidean (XYZ) space, which makes XYZ parameterization suitable only for relatively close features. An inverse depth parameterization technology was proposed by Civera et al. to handle this case [5,27]. However, the dimension of the inverse depth parameterization is six, which is twice the size of the XYZ parameterization. For this reason, an anchored inverse depth parameterization was proposed by Pietzsch [28]. In this study, we adopted the anchored inverse depth parameterization and selected the first keyframe in the sliding window as the anchored frame.
Some VINS solutions, such as VI-ORB-SLAM [4], assume that the camera-to-IMU transformation is known. This requirement is not always met, such as with a new device. Although the camera-to-IMU transformation can be calibrated by offline methods [29,30], these methods are time-consuming and complex and require a professional user to carefully move the device in front of a stationary calibration target [29,30]. In this study, the camera-to-IMU transformation was estimated online.
Both the conditioning-based method and the prior-based method are susceptible to linearization error. Directly fixing the outside nodes makes the conditioning-based method more sensitive to linearization error, so a high-precision initialization method is essential for the VINS problem [7]. Two types of online initialization methods are found in the literature: loosely-coupled methods [31][32][33]  and tightly-coupled methods [34,35]. As reported by Liu et al. [32], the tightly-coupled methods that aim to recover all navigation quantities in one attempt perform poorly because they attempt to solve a large number of variables in a poorly conditioned system. In this study, we adopted the method proposed by Huang et al. [31].
Our contributions are threefold: a.
We designed a novel hybrid sliding window optimizer that has the advantages of both the conditioning-based method and the prior-based method. b.
We designed a distributed marginalization technology based on multi-state constraint factors. c.
We estimated the camera-to-IMU transformation online.
The remainder of this paper is organized as follows. Section 2 briefly introduces the anchored inverse depth parameterization and IMU pre-integration technology. In Section 3, we demonstrate the framework of the hybrid sliding window optimizer in detail. In Section 4, we demonstrate the distributed marginalization technology, and in Section 5, we show our experimental results and some implementation details, and in Section 6, we make our conclusions.

Measurement Model Formulation
This section briefly introduces the anchored inverse depth parameterization and the IMU pre-integration technology. The anchored inverse depth parameterization effectively improves the accuracy of linearization relative to Euclidean XYZ parameterization [5,27]. To avoid the repeated integration of IMU measurements during optimization iteration, IMU pre-integration technology was proposed by Lupton et al. [23,24] and further improved in other studies [8,25,26]. In this study, we followed the one proposed by Qin et al. [8].

Notation and Frames of Reference
In this paper, the world reference frame is denoted as W; the camera reference frame is denoted as C; the IMU body reference frame is denoted as B. All vectors and matrixes are bold, and a vector projected in a specific reference frame is appended with a right superscript, e.g., p B ∈ R 3 means the vector p is projected into the reference frame B. A transformation matrix T WB ∈ SE 3 which transforms the vector p B ∈ R 3 from the reference frame B to the reference frame W. The transformation matrix can be further divided into a rotation matrix R WB ∈ SO 3 and a relative translation vector p W WB ∈ R 3 , as follows: In addition, the notation q WB signifies the unit quaternion corresponding to the rotation matrix R WB . We refer readers to [31] for more details about the operation of unit quaternion. The function T WB (·) is defined to transform a vector projected in the reference B to the reference frame W, i.e., The pose of the IMU at time t k with respect to the world reference frame W is denoted as T WB k ; the camera-to-IMU transformation is denoted as T BC . For a full-rank square matrix A, the notation A −1 signifies the inverse matrix of A, i.e., A −1 A = AA −1 = I. The notation I signifies identity matrix. The notation A T signifies the transposed matrix of A. For a vector p = [p x , p y , p z ] T ∈ R 3 , the notation [p] × signifies the skew-symmetric matrix of p, as follows:

Inverse Depth Parameterization
Assume that the map point l m is co-visible in the keyframe KF a and KF j . We selected KF a as the anchored keyframe for l m . The pose of the anchored keyframe is denoted as T WB a . The pose of the observing keyframe KF j is denoted as T WB j . Let ψ l m = [ψ 1 , ψ 2 , ψ 3 ] T ∈ R 3 be the inverse depth parameterization of the co-visible map point in the anchored camera reference frame C a .
The function Π(ψ l m ) = [ T transforms an inverse depth parameterization to its Euclidean XYZ counterpart and vice versa. We adopted a conventional pinhole-camera model, and the 2D projection of the co-visible map point on the image plane of the anchored keyframe are The XYZ coordinates of the co-visible map point l m in the observing camera reference frame C j are the 2D projection of the co-visible map point l m on the image plane of the observing keyframe are this model does not consider the distortion of the camera lens. All 2D coordinates of the key points are undistorted immediately after extraction.
Let z l m a and z l m j be the 2D positions of the key points in the anchored keyframe and the observing keyframe, respectively, and their extraction errors η l m a and η l m j are typically zero-mean Gaussian. The visual reprojection residual functions are as follows: where the covariance matrix of η l m a and η l m j are denoted as P l m a and P l m j (the covariance notation P is capital for making a distinction with the translation notation p which is lowercase).

IMU Pre-Integration Technology
The IMU measures the acceleration a B and the angular velocity ω B of a vehicle, typically at hundreds of Hz. Both measurements are corrupted by additive measurement noises and time-varying biases; therefore, the raw measurements of IMU at time t are modeled as follows: where the notation b B t (·) signifies the time-varying biases of IMU. The notation n B t (·) signifies the additive noises of IMU and are typically Gaussian, i.e., n B t a ∼ N (0, σ 2 a ), n B t g ∼ N (0, σ 2 g ). The time-varying biases are modeled as a random walk, where the noises that drive the biases are Gaussian, i.e., Given two consecutive keyframes at times t k and t k+1 , all IMU measurements are pre-integrated into a single measurement that serves as a constraint between the IMU poses at times t k and t k+1 . According to Newton's second law, the related integration equations are as follows: where ∆t k = t k+1 − t k ; ⊗ represents quaternion multiplication, and we refer readers to [31]  represent the velocity of IMU at time t k and t k+1 , respectively; α B k B k+1 , β B k B k+1 , and γ B k B k+1 are the so-called pre-integrations of the IMU measurements during [t k , t k+1 ]. Note that the IMU pre-integration is performed in the first IMU reference frame B k without the need for initial pose or velocity estimates. Because of the presence of random noises, the pre-integration also involves random variables. We need to calculate its covariance for information fusion. We directly cite the linearized continuous-time dynamic model of the deviation of the pre-integration and refer readers to the literature [8] for more details, as follows: where represents the orientation error. According to the linear system theorem [36], this system can be discretized as follows: where ∆T is the sampling time of IMU; n t = 1 ∆T t+∆T t n τ dτ is the equivalent noise during [t, t + ∆T]; its mean is zero, and its covariance is calculated as follows: where ). Because n t is a Gaussian white noise sequence, the covariance of the deviation of the pre-integration can be calculated recursively, as follows: where the initial covariance P k is set to be zeros. The residual function about the IMU pre-integration in [t k , t k+1 ] is where cov(n pre ) = P k+1 .
) are functions about the IMU biases. To perform iterative optimization, we require their Jacobian matrices about the IMU biases. Therefore, the transfer matrix Φ k+1,k is computed recursively as follows.
the related sub-matrices of Φ k+1,k are the Jacobian matrices.

Hybrid Sliding Window Optimizer
The best modern systems work via interleaved tracking and mapping via optimization [21]. One of the most representative works is PTAM [37], which was the first work to split mapping and tracking in parallel threads. Based on the main ideas of PTAM, the famous ORB-SLAM was proposed by Mur-Artal [3], and its inertial version was described by Mur-Artal et al. [4]. For this study, we built our implementation upon the local mapping thread of ORB-SLAM. We claim that the hybrid sliding window optimizer is generic and suitable for any keyframe-based system with a pipeline similar to that of PTAM or ORB-SLAM.

Framework
The hybrid sliding window optimizer is performed once a new keyframe is received. We kept the last N successive keyframes in the sliding window and all map points visible by those N keyframes. Co-visible keyframes outside the sliding window are fixed during optimization. Figure 1 illustrates the framework of the hybrid sliding window optimizer. The sliding window is divided into two parts: the mature region and the growing region. The nodes (state variables) within the mature region have been updated many times and doubtless possess greater estimation precision than those in the growing region. The map point is to be marginalized out only if all of its edges (at least 3) are related only with the nodes in the mature region (at least 2) and the fixed basis, which decreases the linearization error. As a result, the prior becomes more accurate. After each update, the first IMU pose, velocity, and bias nodes and all 3D position nodes of the map points that meet the above criteria are marginalized out. A prior factor is then constructed that is related to the IMU pose nodes (excluding the first one) within the mature region and the second IMU velocity and bias nodes. Finally, all factors used for marginalization are removed from the sliding window. This process is repeated once a new keyframe is received. For map points with a tracking length greater than the size of the sliding window, we continue to estimate the 3D positions until all edges (residuals) move into the mature region and the fixed basis. Some existing methods, such as MSCKF [12,15] and SWF [8,18], cannot deal well with this kind of map point. Reprojection factor in the anchored keyframe Reprojection factor in the observing keyframe

Formulation
In this section, the cost function used for the graph-based optimization is constructed. All related residual functions and state variables are defined in a compact form. Let the full state variables of IMU as where the visual reprojection residual function ( ) p . For convenience of notation, we directly used the has no influence on the ( ) m l  r . The notation 2 P r signifies the Mahalanobis distance of r given the covariance matrix P . The notation  written in "Euclid Math one" signifies the cost function.

Formulation
In this section, the cost function used for the graph-based optimization is constructed. All related residual functions and state variables are defined in a compact form. Let the full state variables of IMU . Let the state variables of camera-to-IMU transformation as . For a visible map point l m , let S F , S M , and S G be the sets of observation keyframes corresponding to the fixed basis, the mature region, and the growing region, respectively. The first observation keyframe in the sliding window is selected as the anchored keyframe KF a . We denote the full state variables of IMU about KF a as x a . According to Equation (6), the cost function about the map point l m is where the visual reprojection residual function r l m (·) is only related with a part of elements of the full state variables of IMU x (·) , i.e., q WB (·) and p W WB (·) . For convenience of notation, we directly used the notation x (·) , the other part of x (·) , i.e., v W g , has no influence on the r l m (·) . The notation r 2 P signifies the Mahalanobis distance of r given the covariance matrix P. The notation C written in "Euclid Math one" signifies the cost function. According to Equation (14), the cost function about the IMU pre-integration during [t k , t k+1 ] is where P k+1 is the covariance matrix of the deviation of the IMU pre-integration, as in Equation (13).
To simplify the notations, the time of the first keyframe in the sliding window is denoted as t 0 . Let r prior (x 0 , x 1 , · · · , x M−2 ) be the linearized and normalized prior residual function; its cost function can be written as where M is the size of the mature region. We stacked the individual state variables into the full state variables of the sliding window, as follows: where N denotes the size of the sliding window. K denotes the number of map points visible in the sliding window. According to Equations (15)- (17), the cost function for optimization is The optimal estimation χ * can be obtained by minimizing the cost function Equation (19): The Equation (19) is the cost function of the nonlinear least-squares problem. This problem can be solved with a numerical optimization method [38], such as the Dogleg method, 2D subspace minimization, the Gauss-Newton method, or the Levenberg-Marquardt method. In this study, we adopted the Levenberg-Marquardt method and implemented our optimizer based on Ceres [39].
The likelihood function considers all measurements. To keep the computation time bounded, marginalization technology is used to construct a linearized and normalized prior factor that gains information over time. A fixed basis is used to cope with map points whose tracing length exceeds the size of the sliding window. Both the marginalization technology and the fixed basis introduce linearization error, so the estimation result is suboptimal. If the linearization error is negligible, the estimation result will tend to be optimal.
All nodes within the fixed basis remain constant during optimization iteration. Hence, the structure of the Hessian matrix of the hybrid sliding window optimizer is same as that calculated by traditional sliding window optimizers (e.g., [8,18]), and its sparsity was proven by Sibley et al. [18]. Because of the prior factor and the camera-to-IMU transformation, the second-order sparsity may vanish. As a result, the computation complexity of the hybrid sliding window optimizer is max(O(N 3 ), O(N 2 K)) [20,21].

Distributed Marginalization
Marginalization technology is widely used in both Bayesian filters and graph-based optimization methods to make system scalable. For the traditional method (e.g., [8,18]), the last prior residual function, all reprojection residual functions of marginalized map points, and the first pre-integration residual function in the sliding window are directly linearized and normalized at current estimates.
Then, all of those linearized and normalized residuals are stacked into one residual (we refer readers to the open-source code of [8] for more details), as follows: whereχ is the state variables related with the residuals used in the marginalization.ˆχ represents the current estimation ofχ. The notationr(ˆχ) denotes the function value of r(χ) at current estimationˆχ.
δχ is the perturbation increments ofχ with respect toˆχ, as follows: where M is the size of the mature region. K M is the number of marginalized map points. δχ m denotes perturbation increments of the state variables that need to be marginalized. δχ r corresponds to the state variables that remain. x (·) is the full state variables of IMU as the above section. The perturbation increments of the elements of x (·) are additive, e.g., p W , except for the unit quaternion q WB (·) . The perturbation of unit quaternion is defined as follows: where δθ T

WB (·)
is an angle-axis vector. So, the perturbation increments of x (·) can be denoted as follows: the perturbation increments of ψ l (·) are additive, i.e., ψ l (·) =ψ l (·) + δψ l (·) . For calculating the prior residual function for next optimization, the Shur complement technology was used on Equation (21) to marginalize the map points out, yields the Equation (25) is only related withχ r , and serves as the prior residual function for next optimization after normalization. The map points that have be marginalized out will never be used again. According to Equation (21), it can be found that the dimension of the submatrix C is 3 * K M + 15, and mainly depends on the number of marginalized map points. However, in some cases, the K M will be quite large, e.g., rich texture and fast rotation, which makes the matrix inversion C −1 time-consuming and numerically unstable.
In this paper, we designed a distributed marginalization method to reduce the dimension of the submatrix C, in order to make the matrix inversion C −1 efficient and stable. The core idea of our method is that we first calculate the MSC factors of the marginalized map points. The 3D inverse depth positions of the map points are marginalized out by using the null-space-based method [12,22]. Therefore, those MSC factors are only related with the poses of IMU within the mature region and the camera-to-IMU transformation.
We marginalized a map point out only if all its measurements (at least 3) lie outside the growing region and at least two measurements lie within the mature region. In mathematic terms, there are S G = ∅, card(S M ) + card(S F ) ≥ 3, and card(S M ) ≥ 2, where card(·) denotes the number of elements of the set. For a marginalized map point, there are no visual measurements within the growing region. According to Equation (15), the cost function of the marginalized map point l m is as follows: where r l m is the reprojection residual function of the map point l m .
T is the state variables including the full state variables of IMU within the mature region and the camera-to-IMU transformation. P l m = diag(P l m a , · · · , P l m i , · · · P l m j , · · · ) is a block-diagonal covariance matrix. The residual function of l m can be linearized and normalized at the current estimationˆ χ and ψ l m , as follows: wherer l m (ˆ χ,ψ l m ) signifies the function value of r l m ( χ, ψ l m ) at current estimationˆ χ andψ l m . We rewrite Equation (28) in a compact form, as follows: the 3D inverse depth position deviation δψ l m can be marginalized out by using the left null space of the Jacobian matrix J ψ . The left null space can be calculated by QR decomposition, and we denote it as N Le f t . Substituting N Le f t into the Equation (28), we have (29) is the well-known MSC factor [12,15]. It can be found that the Equation (29) is independent of the 3D inverse depth position δψ l m .
By combining the last prior residual function Equation (17), the linearized and normalized pre-integration residual function in Equation (14) of the first pre-integration in the sliding window, and all of the MSC factors, we can construct the equation Equation (21). Compared with the traditional method (e.g., [8,18]), the dimension of submatrix C of our method decreases to 6 + 3 + 6 (corresponding to the first IMU pose, velocity, and biases in our implementation). We can compute the inversed matrix of C in an effective and stable manner.

Results and Discussion
We evaluated the proposed hybrid sliding window optimizer on the publicly available EuRoC datasets using a commercial laptop computer (Lenovo ThinkPad T470p, Intel i7-7700HQ, 2.8GHz). The EuRoC visual-inertial datasets were collected onboard by an MAV flying in two Vicon covered rooms and a large industrial machine hall. The datasets contain synchronized stereo image sequences, IMU measurements, and accurate ground truth [40]. All sequences were classified as easy, medium, and difficult levels according to texture, illumination, fast/slow motion, and motion blur [40]. In our data processing, we used only the left image sequence (i.e., monocular). The original publicly available ORB-SLAM has three main threads: the tracking thread, the local mapping thread, and the loop closing thread. We implemented the HSWO based on the local mapping thread, and the loop closing thread was removed. The iterative linear solver of the nonlinear VINS problem requires a good initial value to achieve a rapid rate of convergence. Hence, a good initialization for the VINS problem is necessary. In this paper, we followed the method proposed by Huang et al. [31]. A visual-inertial bundle adjustment is performed immediately after visual-inertial initialization.
We used EVO [41], a third-party evaluation software, to analyze the trajectory calculated by the HSWO. EVO aligned the trajectory with the ground truth via Umeyama's method and then provided the RMSE of the absolute metric position deviation. Because the threads of ORB-SLAM run in parallel with other tasks of the operating system, some randomness will be introduced into the results [3]. Our implementation also inherited this property; therefore, we provide the median value of the RMSEs of five runs just as reference [3] did.
There are different choices for the size of the sliding window N and the size of the mature area M, which affects the results significantly. In this paper, we evaluated and optimize the performance of the HSWO by changing the size of N and M. In addition, the "MapPointFusion()" function in the ORB-SLAM is in charge of fusing the same map point (if available) in the scene, which may introduce some local loop closures and large loop closures. Those potential loop closures may interfere with the evaluation results and cause unfair comparison. Therefore, we adopt three methods to remove them: In the EuRoC datasets, the dataset labeled with MH_03_medium is the longest and contains fast motion that motivates the IMU. So, we analyzed the influence of changing N, M, and F in the HSWO by using this representative dataset. The parameter tuning results are listed in Table 1.
According to Table 1, we have a.
The accuracy of the system will be improved if increase N, which is because more information is contained; b.
By comparing N15-M10-F15 with N15-M10-F00, we can find that the accuracy will degrade dramatically if the fixed basis is removed, which is because the information in the fixed basis is omitted; c.
By comparing N15-M10-F15 with N15-M05-F15, or N20-M10-F20 with N20-M05-F20, we can find that the accuracy will degrade if the value of M is too small, in which case the fixed basis has more weight than the prior factor, and directly fixing the pose of the keyframe in the fixed basis makes it more sensitive to the linearization error; d.
By comparing N20-M10-F20 and N20-M15-F20, we can find that the accuracy of the system will also degrade if the value of M is too large.
Theoretically, the prior factor has one-order accuracy, the keyframe nodes related with the prior factor still have the chance to be updated. However, the linearization error of the prior factor will increase if M is too large, because the keyframe nodes with larger ID are estimated less times than the ones with smaller ID. For making the linearization error less and giving the keyframe nodes with large ID more freedom (i.e., making the keyframe nodes with large ID not related with the prior factor), we set N = 20, M = 10 in the following data processing. An array of publicly available VINS pipelines (MSCKF, OKVIS, ROVIO, VINS-Mono, etc.) have been evaluated on various hardware configurations [42]. They performed sim3 trajectory alignment to the ground truth and computed the root-mean-square error (RMSE) of the absolute position over the aligned trajectory. All results are listed as a table, and we used the results evaluated on a common mobile workstation (Lenovo ThinkPad W540) for comparison. The accuracy of VI-ORB-SLAM proposed in [4] was also evaluated with the EuRoC dataset, and all metric position RMSEs are shown in the same table. Because our implementation includes no full visual-inertial bundle adjustment, we selected the results labeled as "NO full BA" for comparison. We also directly used the results reported for the VI-DSO proposed by Von Stumberg [43]. All of the results are listed in Table 2, "HSWO (Limited)" means that we removed potential local loop closures by using the three methods mentioned above, and the maximum of the size of the fixed basis F is set as 20. "HSWO" means that we keep the original "MapPointFusion()", the map point will be treated as a new map point after fusion with other map point, which will cause reuse of visual measurements. Note that many details of these VINS pipelines, e.g., the data association of the front end, setting of optimizers, and trajectory saved, are different. For those reasons, the purpose of the comparison is not to show which VINS pipeline is better, but to demonstrate that our implementation can also achieve high-accuracy results by using the results provided by others VINS pipelines as a third-party reference standard.
The full trajectory contains the poses of every frame, not only the keyframe. The original ORB-SLAM recovered this trajectory from the relative pose and the reference keyframe of every frame, and we follow the same method in this paper. In most cases, the accuracy of the full trajectory may be lower than the one of the keyframe trajectory. But, in some cases, the accuracy of full trajectory may be better, and there are two reasons for this: (a) the full trajectory is recovered from the keyframe trajectory, therefore the accuracy of two trajectory should be comparable; (b) the number of pose in the full trajectory is much larger than the one in the keyframe trajectory. So, the weight of every pose deviation will be decreased during calculating the RMSE of trajectory. Then, the weight of some big deviation will also be decreased. In this case, the RMSE of the full trajectory will be less than the RMSE of the keyframe trajectory. The keyframe trajectory with highest accuracy is highlighted in bold and black; The best performance among the VINS pipelines, except for VI-ORB-SLAM and HSWO, is also highlighted in bold. VI-ORB-SLAM used a conditioning-based method for fusing the visual and inertial measurements, and the first IMU pose, velocity, and biases node are also directly fixed. Theoretically, this method is more sensitive to linearization error. However, the accuracy of the keyframe trajectory produced by VI-ORB-SLAM is quite high. The reason might be: The EuRoC datasets contain many local and large loops, the "MapPointFusion ()" function in the local mapping thread may implicitly close many local loops and some large loops, which phenomenon is also found by [44]. In VINS-mono, visual features are tracked by the KLT sparse optical flow algorithm. Both VI-DSO and ROVIO use direct method for data association. Also, they cannot achieve map point fusion like ORB-SLAM. Therefore, it cannot be proven that the conditioning-based method is sufficient for VINS problem, even if the accuracy of VI-ORB-SLAM is higher than some other VINS pipelines. In order to eliminate the influence of map point fusion, the results of the HSWO in which map point fusion has been limited are also listed in Table 2. It can be found that its accuracy slightly degrades compared with the HSWO with map point fusion. Also, compared with other VINS pipelines with no map point fusion, our implementation still achieves better performance on more than half of the EuRoC datasets. Note that the VI-ORB-SLAM cannot process the V1_03_ difficult dataset because the movement has exceeded the limit for the monocular system [4]. In our implementation, IMU measurements are used to reckon the frame pose during tracking lost epoch, and a reprojection method is then used to perform feature matching, which makes it possible to process the V1_03_ difficult dataset. However, our implementation cannot process the V2_03_ difficult dataset if the map point fusion function is limited using the methods mentioned above, even if an approximate frame pose has been provided by integrating the IMU measurements, our implementation still cannot find suitable map points for matching.
Based on all of the above analysis, we claim that: (a) The accuracy of the proposed HSWO will degrade if the size of mature region M is too small, in which case, the performance of HSWO tends to the conditioning-based method. (b) The accuracy of HSWO will degrade if the fixed basis is removed, in this case, the HSWO degenerates into the prior-based method, and the map point whose tracking length is larger than the sliding window cannot be used effectively.  Figures 2b and 3b, respectively, and most of the deviations are less than 10 cm; the rotation deviations of camera-to-IMU transformation are shown in the Figures 2c  and 3c, respectively, and the translation deviations of camera-to-IMU transformation are shown in the Figures 2d and 3d, respectively. It can be found that the estimation of camera-to-IMU transformation fluctuates at the start period, and then tends to be stable as time goes on. But there are still some constant biases in the estimation. This is because the resolution of the digital camera is limited, and most map points are far from the camera, in which case, small rotation biases and small translation biases in the camera-to-IMU transformation estimation will not cause sufficient parallax, and therefore cannot be observed effectively by the optimizer.
limited, and most map points are far from the camera, in which case, small rotation biases and small translation biases in the camera-to-IMU transformation estimation will not cause sufficient parallax, and therefore cannot be observed effectively by the optimizer.  For attitude error, we followed the evaluation method proposed by Delmerico et al. [42] and used the transformation matrix calculated by EVO to align the estimated attitude with its ground truth, and then computed the yaw error. The boxplots in Figures 4 and 5 summarize the statistics of the yaw error on each sequence of the EuRoC datasets. The results provided by the HSWO (limited) are labeled with (*). The yaw error on V1_01_easy dataset is significantly larger than on the other datasets. Here we noticed that the motion on V1_01_easy dataset is weak during initialization, which means our implementation cannot be well initialized, even if a visual-inertial bundle adjustment is performed right after the initialization. For attitude error, we followed the evaluation method proposed by Delmerico et al. [42] and used the transformation matrix calculated by EVO to align the estimated attitude with its ground truth, and then computed the yaw error. The boxplots in Figures 4 and 5 summarize the statistics of the yaw error on each sequence of the EuRoC datasets. The results provided by the HSWO (limited) are labeled with (*). The yaw error on V1_01_easy dataset is significantly larger than on the other datasets. Here we noticed that the motion on V1_01_easy dataset is weak during initialization, which means our implementation cannot be well initialized, even if a visual-inertial bundle adjustment is performed right after the initialization.   In the multithread framework, the optimizer has no need to operate at the frame rate. Table 3 lists the statistical results for the time consumption of the HSWO used in our implementation and the LBA (vision-only, a conditioning-based method) used in the mono ORB-SLAM [3].

METHOD MEDIAN (ms) MEAN (ms) MAX (ms) STD (ms)
HSWO 189 219 507 82 LBA 163 216 1202 189 The standard deviation of the time consumed by the HSWO is less than that for the LBA, which means that the HSWO's time consumption is more stable. This better stability is a result of the fixed size of the HSWO's sliding window, whereas the number of the keyframes used in the LBA varies according to the co-visibility between keyframes, which makes the time consumed by the LBA fluctuate. If a large number of keyframes are used in the LBA and a map point is visible in all keyframes, the second-order sparsity will disappear. In this case, the computation complexity of the LBA is cubic with the number of keyframes. Therefore, the LBA's maximum time consumption is much larger than that of the HSWO. The mean time consumption is approximately equivalent for both methods. During the initial period, the number of keyframes is small and the co-visibility is sparse, so the LBA is much more efficient. As the map grows and the co-visibility gradually becomes denser, the time consumed by LBA will increase. Therefore, the median time consumption of the HSWO is slightly greater than that of the LBA. Based on these statistical results, we claim that the HSWO shows comparable efficiency as the LBA.   In the multithread framework, the optimizer has no need to operate at the frame rate. Table 3 lists the statistical results for the time consumption of the HSWO used in our implementation and the LBA (vision-only, a conditioning-based method) used in the mono ORB-SLAM [3].

METHOD MEDIAN (ms) MEAN (ms) MAX (ms) STD (ms)
HSWO 189 219 507 82 LBA 163 216 1202 189 The standard deviation of the time consumed by the HSWO is less than that for the LBA, which means that the HSWO's time consumption is more stable. This better stability is a result of the fixed size of the HSWO's sliding window, whereas the number of the keyframes used in the LBA varies according to the co-visibility between keyframes, which makes the time consumed by the LBA fluctuate. If a large number of keyframes are used in the LBA and a map point is visible in all keyframes, the second-order sparsity will disappear. In this case, the computation complexity of the LBA is cubic with the number of keyframes. Therefore, the LBA's maximum time consumption is much larger than that of the HSWO. The mean time consumption is approximately equivalent for both methods. During the initial period, the number of keyframes is small and the co-visibility is sparse, so the LBA is much more efficient. As the map grows and the co-visibility gradually becomes denser, the time consumed by LBA will increase. Therefore, the median time consumption of the HSWO is slightly greater than that of the LBA. Based on these statistical results, we claim that the HSWO shows comparable efficiency as the LBA. In the multithread framework, the optimizer has no need to operate at the frame rate. Table 3 lists the statistical results for the time consumption of the HSWO used in our implementation and the LBA (vision-only, a conditioning-based method) used in the mono ORB-SLAM [3]. The standard deviation of the time consumed by the HSWO is less than that for the LBA, which means that the HSWO's time consumption is more stable. This better stability is a result of the fixed size of the HSWO's sliding window, whereas the number of the keyframes used in the LBA varies according to the co-visibility between keyframes, which makes the time consumed by the LBA fluctuate. If a large number of keyframes are used in the LBA and a map point is visible in all keyframes, the second-order sparsity will disappear. In this case, the computation complexity of the LBA is cubic with the number of keyframes. Therefore, the LBA's maximum time consumption is much larger than that of the HSWO. The mean time consumption is approximately equivalent for both methods. During the initial period, the number of keyframes is small and the co-visibility is sparse, so the LBA is much more efficient. As the map grows and the co-visibility gradually becomes denser, the time consumed by LBA will increase. Therefore, the median time consumption of the HSWO is slightly greater than that of the LBA. Based on these statistical results, we claim that the HSWO shows comparable efficiency as the LBA.
Generally, there are 30-80 map points to be marginalized at every step. But in some cases, such map points increase to more than a hundred. In order to evaluate our marginalization method, we increased the size of mature region for making more map points marginalized. Additionally, we performed marginalization in one thread to remove randomness. Figure 6 plots the time consumed by the traditional method (e.g., [8,18]) and by our method. The time consumed by the traditional method is cubic with the number of map points used in marginalization, but the time consumed by our method is linear with the number of map points used in marginalization. Therefore, our method is much more efficient than the traditional method, especially with a large number of map points. Generally, there are 30-80 map points to be marginalized at every step. But in some cases, such map points increase to more than a hundred. In order to evaluate our marginalization method, we increased the size of mature region for making more map points marginalized. Additionally, we performed marginalization in one thread to remove randomness. Figure 6 plots the time consumed by the traditional method (e.g., [8,18]) and by our method. The time consumed by the traditional method is cubic with the number of map points used in marginalization, but the time consumed by our method is linear with the number of map points used in marginalization. Therefore, our method is much more efficient than the traditional method, especially with a large number of map points.

Conclusions
Due to the complementary sensing characteristics, low cost, and small space requirements of the visual-aided inertial navigation system, the VINS problem has become prevalent in the robotic community. In this paper, we designed a hybrid sliding window optimizer for the VINS problem, the method can effectively handle visual features whose tracking length exceeds the size of the sliding window. The sliding window was divided into two parts: the mature area and the growing area. The nodes outsides the window served as fixed basis. We marginalized a feature out only if all of its measures lay within the fixed basis and the mature area. A distributed marginalization technology was also used with significant efficiency improvement than the traditional method, and the accuracy loss was negligible. Finally, we evaluated our implementation using the open shared EuRoC datasets, and the results are competitive compared with other VINS pipelines.

Conclusions
Due to the complementary sensing characteristics, low cost, and small space requirements of the visual-aided inertial navigation system, the VINS problem has become prevalent in the robotic community. In this paper, we designed a hybrid sliding window optimizer for the VINS problem, the method can effectively handle visual features whose tracking length exceeds the size of the sliding window. The sliding window was divided into two parts: the mature area and the growing area. The nodes outsides the window served as fixed basis. We marginalized a feature out only if all of its measures lay within the fixed basis and the mature area. A distributed marginalization technology was also used with significant efficiency improvement than the traditional method, and the accuracy loss was negligible. Finally, we evaluated our implementation using the open shared EuRoC datasets, and the results are competitive compared with other VINS pipelines.