A Robust Indoor/Outdoor Navigation Filter Fusing Data from Vision and Magneto-Inertial Measurement Unit

Visual-inertial Navigation Systems (VINS) are nowadays used for robotic or augmented reality applications. They aim to compute the motion of the robot or the pedestrian in an environment that is unknown and does not have specific localization infrastructure. Because of the low quality of inertial sensors that can be used reasonably for these two applications, state of the art VINS rely heavily on the visual information to correct at high frequency the drift of inertial sensors integration. These methods struggle when environment does not provide usable visual features, such than in low-light of texture-less areas. In the last few years, some work have been focused on using an array of magnetometers to exploit opportunistic stationary magnetic disturbances available indoor in order to deduce a velocity. This led to Magneto-inertial Dead-reckoning (MI-DR) systems that show interesting performance in their nominal conditions, even if they can be defeated when the local magnetic gradient is too low, for example outdoor. We propose in this work to fuse the information from a monocular camera with the MI-DR technique to increase the robustness of both traditional VINS and MI-DR itself. We use an inverse square root filter inspired by the MSCKF algorithm and describe its structure thoroughly in this paper. We show navigation results on a real dataset captured by a sensor fusing a commercial-grade camera with our custom MIMU (Magneto-inertial Measurment Unit) sensor. The fused estimate demonstrates higher robustness compared to pure VINS estimate, specially in areas where vision is non informative. These results could ultimately increase the working domain of mobile augmented reality systems.


Motivation
Infrastructure-less navigation and positioning in indoor location is a technical prerequisite for numerous industrial and consumer applications: ranging from lone worker safety in industrial facilities, to augmented reality. Still, it remains an open challenge to efficiently and reliably combine embedded sensors to reconstruct a position or a trajectory. In the present work, we address this challenge restricting ourselves to the following sensors: MEMS gyroscopes, accelerometers, magnetometers and a standard industrial vision camera. We motivate this choice by the fact these sensors are cheap and can easily be embedded in a wearable form factor, which makes this combination sensors. The Section 4 describes the filter design: chosen state and equations for propagation and measurement update steps. The last section (Section 5) presents comparative results of trajectory estimation obtained on real datasets.

General Conventions
Bold capital letters X denote matrices or elements of manifold. Parenthesis are used to denote the Cartesian product of two elements a ∈ A, b ∈ B → (a, b) ∈ A × B and brackets for the concatenation of two row vectors. For a vector x = [x 1 , x 2 , x 3 ] T , x 2 denotes its second component x 2 and x 2:3 is the sub-vector [x 2 , x 3 ] T . For matrices, we define the vectorization operation Vec () so that: Vec x 11 x 12 A ⊗ B will denotes the Kronecker product of matrices A and B. ∂ A will be a shorthand for the derivative with respect to the coefficients of Vec (A). The notation x Σ is the Mahalanobis norm of invertible covariance Σ: x Σ = x T Σ −1 x. I n is the identity matrix of size n and 0 n×m the zero matrix of size n × m. I n is the identity matrix of size n × n or corresponding application. O(n) is the orthogonal matrix group of size n.

Reserved Symbols
Generally and except stated otherwise we use the following symbols: p for the translational part of the body pose, R for its rotational part. v for the velocity, b a and b g for inertial sensor bias, B for the magnetic field, ∇B for its 3 × 3 gradient matrix. ω is used for the rotational speed and a for the specific acceleration. 3D landmark position are noted with the letter l and these observation into an image with the letter o. We use a tilde symbol for measured quantities or generally quantities that can be derived from sensor reading. We use an hat for estimated version of a physical quantity. g symbol is kept for the gravity vector in inertial coordinates. The world coordinates are defined such that the gravity vector writes g [0, 0, −9.81] T . When ambiguous, the reference frame in which a quantity is expressed will be noted in exponent: w stands for the world gravity-aligned reference frame, b for the current body reference frame and c for the camera frame. The Figure 1 summarizes the chosen notations.

Rotation Parametrization
For rotations, we use the convention that R w transforms a vector from body frame to world frame by left multiplying it. For the sake of clarity of the developments, we represent the attitude of the sensor as a rotation matrix. The Special Orthogonal Group is denoted SO(3) and its associated Lie algebra so(3)-the set of skew symmetric matrices. Any element of so(3) can be identified with a vector of R 3 : [x] × ∈ so(3) with x ∈ R 3 and vex so that vex([x] × ) = x. exp and log are the standard exponential map and logarithm on SO (3). As in [16], we use vectorized versions of exp and log: and Log : SO(3) → R 3 the inverse function. With these conventions, Log(Exp(x)) = x.

Sensing Hardware
The MIMU sensors (see Figure 2) provide raw measurements of biased proper acceleratioñ a b , biased angular velocityω b , magnetic fieldB b and its gradients ∇B b . The latter is a 3 × 3 matrix which elements are estimated by finite differences between signals recorded on an array of magnetometers. These sensors are carefully calibrated offline and harmonised in the same spatial coordinate frame with a method similar to [17]. The resulting MIMU coordinate frame, centered around the magnetometer and accelerometer (which are assumed colocalized here), will be used as the body frame in all subsequent derivations. We use a global shutter camera modeled as a pinhole camera with instantaneous exposure. In practice, recorded real images are undistorted using intrinsics calibration parameters. Intrinsic parameters (focal length in pixels ( f x , f y ), principal point coordinates (c x , c y ) and distortion coefficients) are assumed to be known from a preliminary calibration. The pinhole camera projection function π maps a landmark 3D coordinates l c expressed in the camera frame to the pixel coordinates of its projection onto the image: π : (3) Figure 2. Schematic view of on-board sensors. In addition to accelerometers and gyrometers, the MIMU includes several magnetometers: a central one and at least three peripheral ones in order to compute the full 3 × 3 matrix of magnetic field gradients. The camera is rigidly attached to the MIMU sensor.
The camera is rigidly attached to the MIMU sensor board. The transformation (R b c , p b c ) between the camera frame and the body frame is assumed to be known. We could alternatively include this transform into the state filter as done in [18] for instance.
Since hardware synchronization was not possible with the components used here, we use a datation approach. In the online estimation process we make the simplifying assumption that images are captured simultaneously with the MIMU sample closest in time. More precisely, the image information at time t image p is processed using the state estimate at time t mimu k with k such as: The temporal error done with this approach is always smaller than the sampling period of the MIMU sensors, which is often below the exposure time of the camera. Besides, this approximation significantly simplifies time management as all measurements are indexed by the MIMU time sampling.

Evolution Model
In order to estimate the position and orientation of the body coordinate frame in the world frame, one has to track an estimate of its speed and of the magnetic field at the current position of its center. We model the evolution of these quantities with the following differential equations: This model relies on the following assumptions on the environment: • Flat-earth approximation. We assume that the Est-North-Up (ENU) earth frame at filter initialization is an inertial frame.
• Stationary magnetic field in the world frame-although possibly spatially non-uniform, leading to the spatial gradient in (7).
The first assumption is, in practice, often used in VINS literature, in which gyroscopes are not precise enough to measure earth rotational speed (roughly 7 × 10 −5 rad/s). However if high-end gyroscopes had been used, it is likely that an estimate based on the simple model (4)-(7) would introduce some error, confusing earth rotational velocity with biases estimates. In our opinion though, even in the case of high-end hardware, it is not clear that the visual pipeline used in this work would be able to provide information reliable enough to estimate biases with sufficient accuracy to be influenced by earth rotational speed. This would be a great challenge to address.
Note also that the world frame differs from the ENU frame defined at filter initialization: they both have their origin at the position of the center of the body frame at initial time, but they can differ by a rotation around the gravity direction, as heading is not observable at initialization. Equation (7) is the key equation for MI-DR. It relates the evolution of the magnetic field with kinematics quantities and local magnetic gradient. It actually renders the body velocity observable provided the matrix ∇B b is invertible. However, it fails giving useful translational information if the magnetic field is uniform. This happens outdoor, where the magnetic field is uniformly equal to the earth magnetic field or in very large empty rooms. In the latter situation, magnetic gradients can vanish a few meters away from a wall, ceiling or floor.
Also, the stationarity assumption can be challenged in some environments, or punctually if pieces of metal are moving in the vicinity of the magnetometers. While some instationarities can be modeled-as the case of power-line interference [9]-in practice, any useful estimator should also be robust to non modeled effects.

Model Discretization
The model will be used in a discrete extended Kalman filtering framework described in Section 4 below. We thus discretize it the following way: where we introduced the following notation corresponding to continuous integrals: with ∆R k (τ) The notation ∆R k (τ) is the rotation matrix that transforms point in body frame at time τ to corresponding point in body frame at time t k , that can be deduced from gyroscopes integration.
These integrals can be computed from unbiased MIMU measurements. We thus estimate the biases of the accelerometers and gyrometers along with previously quantities. These are assumed to follow a first-order Gauss-Markov stochastic evolution: where generating noises  bg and  ba satisfy: W c = diag(σ 2 bg;c I 3 , σ 2 ba;c I 3 ).
The presented models discretization exhibits integrals that all have to be computed numerically in order to build the filter on. Normally, the choice of the numerical integration method depends on the required accuracy. In the current implementation though, we take a very simple approach: we assume that the biases, acceleration and magnetic field gradient-the last two are in world frame-are constant between two MIMU sample and use the following identity: The error induced by this simple integration scheme is limited by the relatively high frequency of MIMU sample (325 Hz).

Sensors Error Model
Sensors errors propagate through the discrete model when used for filtering. We assume that the noisy sensor reading at time k (the input vector of the filter) can be written: The noise δu k is assumed to be gaussian δu k ∝ N (0, Σ u ) and is a sensor characteristic. We use here: Note that P ∇B reflects that we explicitly exploit the symmetry in the magnetic field from Maxwell equation. They implies that the gradient should be a symmetric matrix and of zero trace and thus has 5 degrees of freedom, instead of the nine coefficient of the full matrix. P ∇B ∈ R 9×5 is thus the matrix of the application generating the vectorized gradient matrix from minimal gradient coordinates:

Tight Fusion Filter
This section describe thoroughly the MI-MSCKF filter that is used in the experimental Section 5. Section 4.1 describes the parametrization of the state of the filter. Section 4.2 is dedicated to the propagation step while Section 4.3 details the magnetic and visual measurement equations.

State and Error State
We define the state space at time index k, X k as a manifold which is compound of : • the keyframe poses state space K k , which elements are the poses of a set of N past frames at time indexes {i 1 , ...i N } not necessarily temporally successive but close in time. With poses written ξ i = (R w i , p w i ), this part of the state have the following form: • the current mimu state S space which elements have the following form: A complete state space element at time index k is thus noted: We use the Lie group structure of this manifold and its tangent space to (i) define the error tracked by the filter and (ii) define the Jacobian of the measurement process. This derives from the fact that a perturbations around an element can be expressed as an element of its Lie algebra. We use the operator symbol, so that X k δX k computes a new state in X k from a tangent perturbation δX k around X k . We define it as regular addition operation for all components of the state except for pose states where we use: Similarly we define the reciprocal operator as the binary operator giving the perturbation element between two states of X . It is defined as regular minus operation except for pose states for which it is: We here define the error state as the application operator between the true state and the estimated state, noted hereafter with an hat. It is thus an element of the tangent space at the current estimate.
Note that this implies a parametrization of the rotation error in world frame, and is different from our previous work [13] where rotation error was parametrized in the body frame.
The filtering process propagates the estimated mean, along with an estimate of uncertainty. This uncertainty is represented as a Gaussian density on the error state k ∝ N (0, P k ) in order to take advantage of the Lie group structure defined previously, i.e., a minimal parametrization and a locally euclidean structure in tangent space.
For numerical reasons, the covariance P k will be tracked by the filter in an square-root information form, such that we have the relationship P k = (Ŝ T kŜ k ) −1 withŜ k an upper triangular matrix. The next section describes how this quantity evolves through the different steps of the filter: propagation and update.

Propagation/Augmentation/Marginalization
At the time of arrival of a new IMU data k + 1, X k|k andŜ k|k are propagated. This is summarized by Figure 3 and splitted in three steps. First, the mimu state s k is propagated with discretized model (Section 4.2.1), then the full state is augmented with the resulting new mimu state s k+1 (Section 4.2.2). Finally, some part of this augmented state are marginalized before the update step (Section 4.2.3). State at time k

New MIMU data without image New data contains a keyframe
State at time k+1 State at time k+1

Propagation
Keyframe poses states are estimation of physical quantities blocked at fixed instant in time, their error do not evolve with time and thus they are propagated with an identity function. Besides, the current mimu state error is propagated according to where the discrete mimu process function f mimu summarizes (8)- (11) and (24)-(25).
The mimu state error is increased from the three sources of uncertainty: the stochastic model of biases, the measurement noise δu k on the input vector and the uncertainty of the previous estimate: The expression of matrices Φ mimu k+1,k , G mimu k+1,k , C mimu k+1,k are derived by 1st order development of (42): where P ∇B is defined as in (34). Note that we wrote here the transition and noise matrices as a function of the integrals (12)- (17), independently of the way the are computed, so that these expressions are still valid if one choose to compute the integrals with a more sophisticated scheme. Derivative of these integrals with respect to the biases are also required: these quantities should be computed simultaneously with the integrals. In our implementation, they are computed easily with the approximations made in (26)-(31).

State Augmentation
When MIMU sample k + 1 arrives, the mimu propagation function is used to augment the state with the new current mimu stateŝ k+1 = f mimu (ŝ k ,ũ k , 0), leading to the augmented stateX ⊕ k+1|k . The error state square root information matrix is augmented accordingly to [14],Ŝ ⊕ k+1|k using the Jacobian derived in previous subsection.X with and the discrete model noise at Q k time k :

Marginalization of Old State
Then, in order to bound the size ofX k+1|k , some part of it are marginalized. The state elements to be marginalized depend on the type of data available at the current timestamps as depicted in Figure 3. If only MIMU data (without an new keyframe image attached) are arriving at time k, s k is marginalized. If MIMU data and a new keyframe image are available at time k, the oldest keyframe pose is marginalized together with the non pose element of s k . Note that since the image frame rate is well below MIMU frequency, the first case happens more often than the second case.
Within the square root information form, marginalization is done similarly to [14]. With Π k being the matrix permutation putting the to marginalize error states at the beginning, a QR decomposition of a square root information matrix of the permuted augmented error state vector writeŝ We obtain the predicted stateX k+1|k removing marginalized states, and its-upper triangular-square-root information matrixŜ k+1|k . If no measurement update has to be performed at current time step they can be used directly as X k+1|k+1 andŜ k+1|k+1 for the next propagation.
Note: The marginalization of a joint Gaussian distribution with a square-root information form is not often demonstrated in book or lecture but can be deduced from the full information form Λ joint : then the square root information matrix resulting from marginalization of the the M variables is: S marg R =Ŝ R. This is deduced by calculus from the usually demonstrated fact that:

Measurement Update
The filter processes two kinds of measurement: (i) the magnetic one that compares the magnetic field measured at the current timestamp with the magnetic field predicted by the filter; (ii) the visual measurement equation for features for which tracking has just ended.
We first briefly recall the update process of the inverse square root filter on a manifold. Let us suppose that some measurement occurs that can be modeled as: with n m the dimension of the measurement vector. Writing the dimension of the predicted state as n s , the measurement error z k+1 = h(X k+1|k ) −h k+1 and H the jacobian of the application: the update step finds the tangent correction δX * that minimizes the following linearized cost: The optimum point can be obtained by of a thin QR decomposition: R u being upper triangular, δX * is efficiently computed by back-substitution. This optimal correction is finally applied to the predicted state with the retraction operator: With J r being the Jacobian at e = 0 of: R n s → R n s e → (X k+1|k (δX * + e)) (X k+1|k δX * ).
Intuitively, this Jacobian transforms the square-root information matrix from the tangent space of predicted state to the tangent space of the updated state. Note that, in the current parametrization choice (cf. (38)), this Jacobian is the identity matrix; this was not the case in our previous work [13] where the rotation error was expressed in body frame.

Magnetic Measurement Update
The magnetic measurement update is the simplest and uses at MIMU frequency the direct magnetic field measurement which writes: with P B b the projection operator from the state space to the coordinates of the state corresponding to B b . σ B is the noise of the magnetometers reading.

Opportunistic Feature Tracks Measurement Update
We use the feature tracks in the same way as proposed by [20]. We derive here the equation for completeness.
When a feature track ends or when the frame in which the feature was detected is about to be marginalized, we process the entire feature track as a measurement constraining the pose of each in-state frame where the feature was detected.
The predicted reprojection of feature i in frame at time t is written: with l w i ∈ R 3 the 3D-position of the features i in inertial coordinates and π : R 3 → R 2 the projection function of the camera. We recall that (R b c , p b c ) is the known transform between the body frame and the camera frame. l w i is computed by a fast triangulation function from known in-state poses and the measured observation, noted o it .
By stacking all these reprojections, we can write the non linear measurement function: with η f i is assumed to be an additive Gaussian white noise : η f i ∼ N (0, Σ C ), Σ C = σ c I 2m . We note subsequently o i the vector resulting from stacking of 2-vector observation o it . The attentive reader may note that this measurement equation is not of the form of (63), because l w i is not part of our state: for this reason we can not directly use (65) and (67). Worse, the computed l w estimate is correlated with the current state error, so that we can not just use it as a fixed constant. The solution used here aims at expressing a projection of the residual that depends only on the poses, up to the first order. We start by linearizing the residual then proceeds the minimization in two steps to extract the used measurement equation. Linearization of r i : X, l w i → h f i (X, l w i ) − o i yields: is the 2m-vector of predicted residual error. E f is a 2m × 3 matrix of rank 3 ; m denoting the number of observations for the feature. The rank of E f is guaranteed during the triangulation. Its QR decomposition writes: Properties of square orthogonal matrices on the L 2 norm of vectors allow to split the cost function into two terms, one depending only of the current predicted state vector.
As R E1 is invertible, the first term of the quantity to be minimized can be reduced to zero for all δX. Minimization reduces thus to: Finally, the previous linearized residual is used in the cost function (65), i.e., we use for the H matrix, the z error vector and the covariance of measurement Σ h the quantities: Note that everything happens here as if we introduced the features position into the state, but instantly marginalized it.

Filter Initialization
One sensitive issue in a VINS filter is initialization. It usually involves some specific algorithm as described in [21,22] for instance. In our implementation, we proceed as follows: after switch on, the current state is initialized at origin with an attitude matching the current acceleration direction and a zero speed, both with high variance. The filter is then ran using the high frequency magnetic update equation in order to get rapidly a stabilized trajectory from the first few seconds. This trajectory is used to bootstrap the first keyframe poses state and then to start using features information. This initialization process relies on the empirical observation that the MI-DR filter convergence basin is large. Admittedly, it would degrade severely the filter if the system's switch on occurs in an area where magneto-inertial dead-reckoning is not reliable.

Hardware Prototype Description and Data Syncing
The sensor system is pictured on Figure 4. The camera is rigidly attached 47 cm away of the MIMU system to avoid any potential magnetic perturbation. Such a large distance is necessary because the off-the-shelf camera has not been specifically designed for reducing its magnetic footprint. Reducing the hardware to a wearable size would involve sensors co-design that was not in the scope of this work. For the vision part, we use an IDS uEye 3241-LE equipped with a Lensagon BM4018S118 lens. It provides around 100 degrees of field of view. Camera intrinsics and extrinsic parameters are calibrated with the Kalibr toolbox [23]. The MIMU provides data at 325 Hz, the camera at 20 Hz. Magnetometers, accelerometers and gyro-meters are all MEMS sensors digitized with sigma-delta ADCs which were carefully calibrated. The camera and MIMU provide timestamps computed from different clocks. We synchronize them offline: timestamps shifts are estimated both at start and at the end of the records by the checkerboard-based Kalibr calibration toolbox; clock drift is then deduced and corrected for.
The camera exposure time is fixed at 10 milliseconds maximum, reducing worst case motion blur and allowing to timestamp accordingly each image observation. However, this choice limits the camera's ability to adapt to a very low-light environment.

Filter Parameters Tuning
Most parameters of the filter are chosen in a deterministic and consistent fashion. MIMU noise standard deviation σ a , σ ω , σ B and biases evolution parameters τ bg , τ ba , σ bg;c , σ ba;c are derived from sensors characteristics measured empirically with an Allan standard deviation. The pixel reprojection noise σ C is set to √ 2, which is the diagonal size of a pixel. Only the magnetic part of covariance Σ ∇B is tuned empirically. It is set greater than what would derive from the covariance of magnetometers white noise, so as to absorb some modeling errors of the magnetic field, such as small non-stationarities or high values of the 2nd order spatial term. Note that parameter tuning is unchanged for all presented datasets in order to draw fair conclusions.

Visual Processing Implementation
The visual processing pipeline aims at constantly track 200 interest points well spread in the image. In order to enforce a good repartition of corners across the entire image, we use a bucket strategy. Harris-corner response is computed over the entire image and we retain only the strongest features in buckets for which the number of already tracked points is below a threshold. We use here a partition of the image in 6 × 8 buckets. Detected corners are tracked from frame to frame with OpenCV pyramidal KLT algorithm until either: • they go out of the field of view; • the tracking fails; • they are classified as outliers; • the frame where they were firstly detected is to be marginalized at next propagation step.
We ran a 2-point RANSAC algorithm between subsequent frames for outliers detection and rejection, using the relative orientation from the integrated gyroscope as rotation between the two frames.
An ended feature track is used as measurement, only if: • it spans at least three poses; • its initial triangulation did not exhibit any degeneracy; • its re-projection error is below a threshold.
Contrarily to our previous implementation [13], this threshold is dynamically set with a χ 2 threshold. As a result, the criterion becomes looser when the estimated uncertainty of relative poses increases.
In order to make the visual pipeline more robust to dark areas, we normalize each input image by its averaged intensity before corner detection and tracking. Some very dark images then become usable for corner detection despite a significant increase of the photometric noise which affect them. Noise amplification leads to spurious features track, but these are most often correctly rejected by our outlier rejection strategy. Overall, we found that normalization significantly improves the performance of pure MSCKF VINS algorithm on our dataset. Some raw/normalized images are presented in Figure 5. Note that, in contrast with the tuning of the filter, the parameters of the vision pipeline (KLT window size, number of octaves in the pyramid of KLT, RANSAC threshold and minimum Harris score for corner detection) were chosen empirically.

Dataset Presentation
We evaluate our algorithm on a dataset of five test trajectories. A pedestrian is carrying the system depicted in Figure 4 and walks through an industrial facility. Trajectories are specifically designed to be challenging for MI-DR and VINS: they are partly done outdoor, with very low magnetic field gradient and they also contains portions visually non informative made in the non lit basement of the building. Detailed results on Traj2 and Traj5 are presented on Figures 6-8, the others being more briefly depicted in Figure 9. In all cases, a dedicated plot indicates with a color code the parts of the trajectory where the magnetic field gradient vanishes and parts of the trajectories where the mean intensity is low.

Overall Comparison
Three estimators derived from the presented filter have been compared. The MI-DR is the presented filter without any visual update. MSCKF is the presented filter without any magnetic update. MI-MSCKF is the proposed filter fusing both information. Moreover, we also compare our results with the state of the art VINS filter of [24].
Unfortunately, we do not have access to a ground truth trajectory for our datasets. We then consider three evaluation criteria: • the estimated trajectories are superimposed to a georeferenced orthoimage in which one pixel represents precisely 0.5 m. We compute an alignment of the trajectory when the checkerboard detected for the first time in the sequence. This alignement results from setting manually the position and heading of the checkerboard frame relative to the coordinates system of the satellite image. Note that no manual scale alignment has been made, hence this visualization allows to evaluate roughly the correctness of the global scale of the estimate, for instance on Figure 6a; • the z profile is globally known as the pedestrian walks along flat corridors-except when he takes stairs to change levels; • a translational error is computed each time the system comes back to its initial position, thanks to a static checkerboard placed at starting point. This criterion can be visualized in Figure 6c for Traj2 where it is clear the MI-DR estimate is less stable vertically over the entire trajectory.
The last criterion can be quantitavely evaluated: results are displayed in Table 1 with an error given in percentage of the trajectory length. Next sections emphasize some differences in the behaviors of the tested methods.

The Fused Estimate Improves MI-DR in Outdoor Trajectories
The Figure 6 shows the three versions of our filter on Traj2. The trajectory estimated by MI-DR is very close to the others until some point in the outdoor part-note that the outdoor part corresponds to the weak gradient part of the trajectory as depicted on Figure 6b. During this outdoor part, as expected, the MI-DR drifts away compared to the two vision-based estimates which directly leads to a higher final translation error. The same effect is also clearly visible on Traj3, see Figure 9b.

Data Fusion Improves Local Consistency
By reducing the drift in dark areas or low gradient areas, the fused estimates improve the local position estimate consistency, an effect which is not always visible on the metrics of Table 1.
The benefit of magnetometry information in this sense is demonstrated in the details of results on the Traj2 displayed in Figure 7. The two left plots show a similar situation: the VINS estimate (red) is for a few seconds strongly corrected by the filter, leading to non continuous estimates (see green circles). The MI-MSCKF filter stays smoother during the entire trajectories, thanks to the speed observability provided by Equation (7). Interestingly, the pure VINS estimate joins the MI-MSCKF estimate later in the sequence, which makes the temporary drift mostly invisible in the final loop error metric of Table 1. This correction happens when visual information becomes available again. It means that the VINS filter is still able to correct itself after reasonable drift through the information stored in the prior. The same effect occurs on all trajectories. Consider for instance the trajectory Traj5 depicted on Figure 8, where the lowest part goes also through the dark basement of the building. The MSCKF drifts vertically before being corrected when the pedestrian takes the stair up again. This effect is depicted on details in Figure 8c, which clarifies the evolution of the estimated height with respect to time.
In turn, visual information also helps trajectory consistency. It is clearly visible on the strong vertical drift of MI-DR displayd on Figure 6c around 250 s. Again, this drift is corrected as soon as magnetic gradients becomes sufficiently high to make speed observable again. Note the horizontal drift was never corrected though, leading to the larger final drift shown in Table 1.
The reader may have noticed at that point that Figure 8c shows that MI-DR fails badly on Traj5. Yet, we would like to stress that the fully fused estimate is still able to outperform the VINS estimate, leveraging magnetic information correctly beyond the breaking point of MI-DR. The reason of the failure of MI-DR is an unstationary local magnetic field in the first few seconds of Traj5. It perturbes the MI-DR initialization, which has dramatic consequence on all the rest of the trajectory. The fact that the fused estimate prevents local drift could be, in our opinion, highly beneficial to the long-term performance in a Extended Kalman Filter strategy. Indeed, it could reduce overall linearization errors and the maximum magnitude of corrections, which are recognized, in VINS community, as an important drawback of filtering approaches compared to bundle adjustment or optimization-based methods.

Comparison with a State of the Art Filter
We also ran the released binary version (Available on https://github.com/MARSLab-UMN/ MARS-VINS, we used the commit 8531daf.) of [24] on our dataset. Indeed, we think it is the state of the art in VINS filters for pedestrian navigation. As [24] takes only stereo images (even if their method works well with monocular setup also, as said in the paper) we had to trick slightly their software to use a monocular input. Note that, to be as fair as possible, we have entered in their code the same normalized monocular images we use. In doing so, we observed that normalization has also drastically improved the performance of their filter on our data.
An in-depth and comprehensive comparison between the two implementation is difficult as the code of [24] is not open. If inertial handling in both implementation should be close, the visual pipeline is very sensitive to parameters value and implementation details that are not known by us. Nevertheless, Table 1 demonstrates that both filters compare reasonably and are clearly below 1% of trajectory length error.
We also observed in [24] implementation the presence of strong filters correction after dark areas, as in described in Section 5.4.4. It indicates that this local consistency problem, which is essentially solved by the proposed fused estimate, is indeed a general issue of all VINS filters.

Conclusions
This work presented a filter to fuse information from a magnetometer array with other sensors traditionally used in VINS. We described in detail, the method we used and discussed its results on real datasets. Comparing the results of three estimators-one using only magnetic-inertial information, one using only visual-inertial information and one fusing both informations-we showed that the fused estimate leads to a more robust trajectory estimate. First, our fused estimate is able to reconstruct the trajectory outdoor where MI-DR techniques breaks because of the lack of gradient; secondly, our system avoids the unrealistic trajectory correction of VINS after significant duration without a good illumination. One trajectory also highlighted the need for either robust estimation technique or outlier rejection scheme for magnetic information. This should deserve more work in the future.
We think that the proposed approach could help to improve localization systems for augmented reality (AR) that are currently using VINS with consumer grade cameras and IMUs. Not only this combination extends the applicability domain of traditional VINS to degraded environments, but we also foresee opportunities to reduce power consumption. Indeed, taking advantages of the good trajectories given by the MI-DR in various conditions, it might be possible to reduce the computational load of the visual pipeline, which could be of major interest in practical applications.