VIMO: A Visual-Inertial-Magnetic Navigation System Based on Non-Linear Optimization

Visual-inertial navigation systems are credited with superiority over both pure visual approaches and filtering ones. In spite of the high precision many state-of-the-art schemes have attained, yaw remains unobservable in those systems all the same. More accurate yaw estimation not only means more accurate attitude calculation but also leads to better position estimation. This paper presents a novel scheme that combines visual and inertial measurements as well as magnetic information for suppressing deviation in yaw. A novel method for initializing visual-inertial-magnetic odometers, which recovers the directions of magnetic north and gravity, the visual scalar factor, inertial measurement unit (IMU) biases etc., has been conceived, implemented, and validated. Based on non-linear optimization, a magnetometer cost function is incorporated into the overall optimization objective function as a yawing constraint among others. We have done extensive research and collected several datasets recorded in large-scale outdoor environments to certify the proposed system’s viability, robustness, and performance. Cogent experiments and quantitative comparisons corroborate the merits of the proposed scheme and the desired effect of the involvement of magnetic information on the overall performance.


Introduction
As befits one of the most crucial problems in robotics, SLAM has received much attention for the past two decades and opened up many new vistas for autonomous robots, alongside numerous proposed approaches and schemes for implementing it, which fall into three categories by sensing modality: Laser SLAM, visual SLAM, and visual-inertial SLAM. Laser SLAM preponderated in the early stages of the development of SLAM by virtue of its high precision, long range, and capacity for obstacle avoidance. Notwithstanding its advantages, its scale in size and weight means its application is confined to platforms allowing a limited load and a capacity enough for carrying a laser scanner, which therefore circumscribes the application range and agility of the platform. Being cost-effective, lightweight, and efficient in energy consumption, cameras, or vision-based sensors in general, are in the ascendant, which is also attributed to its informative representation of geometry in a single image caught on camera. A desirable consequence of images being abundant in information is the capacity to retrieve previously registered scenes, known as scene recognition, so as to perform loop closure to curb drift in estimates.
For a vision-based SLAM system or those reliant mainly on visual perception, four components are deemed indispensable, including visual tracking, filtering/optimization, loop closure, and mapping, each of which holds systemic effects on the overall performance of a system and to which various solutions have been conceived, implemented, and experimented. Approaches to tracking visual cues can be categorized into such two classes as feature-based and direct methods according to how they implement data association. Feature-based methods rely on the descriptors of features extracted in images to associate image points appearing across several images, and points existing in more than one images can then be used to estimate poses with stability due to the invariance of descriptors. Apart from the relatively high computational cost of extracting features and descriptors, one of the penalties to feature-based methods is their lack of resistance to texture-less views. The descriptor, dependent on features being identifiable, does not contain the amount of information enough to perform data association. Direct methods, however, eschew this problem by direct exploitation of changes in illumination and by minimization of photometric errors rather than geometric ones, thereby being more impervious to scarcity of texture as well as driving down computational consumption since they don't have to describe features. The concept behind every formulations of estimation is probabilistic modelling with noisy measurements as input to estimate certain parameters. Typical models are based on maximum a posterior (MAP) in which the final estimates maximize the a posteriori probability given existent measurements. For pure visual SLAM, the probabilistic model is usually based on Maximum Likelihood (ML) in the absence of prediction models. Earlier SLAM schemes estimate states mainly in a filtering way in which the prior parameters are always marginalized out to focus merely on current states for containment of computation, i.e., all the previous states are fixed, which is bound to affect precision. Recently nonlinear optimization-based methodology has begun to gain ground among scholars alongside increases in the performance of hardware and decreases in its cost. Due to the sparsity inherent in the Hessian matrix of BA (Bundle Adjustment), the computational complexity of estimation in SLAM is lighter than would be conceived. By benefiting from the sparsity and properly assigning tasks among several parallels and with existent optimization modules including Ceres, g2o, iSAM, and gtsam which have been leveraged across many platforms of navigation, optimization-based systems can run smoothly within reasonable time but still need to marginalize out older states to restrict the number of keyframes. The factor graph based on fast incremental matrix factorization, on the other hand, allows for a more accurate and efficient solution by virtue of recalculating only the matrix entries in need of change, whereby the system can considerably slash computation while holding the whole trajectory.
The inertial measurement unit (IMU), light in weight, size, and price, and often in the form of a MEMS (Micro-Electro-Mechanical System), has been gaining in popularity as a source of inertial information complementary to vision since it recovers poses from motion itself and permits rendering observable the metric scale of monocular vision and the direction of gravity whereas visual constraints, in reverse, given accurate data association, can check error accumulation in the integration of IMU measurements (angular velocity, acceleration). The fusion of visual and inertial information began with loosely-coupled mechanisms and has now transitioned into tightly-coupled ones. The procedure of propagating and predicting states with inertial measurements between two consecutive frames and later using a visual image (or several) as an observation to update estimates is characteristic of loosely-coupled mechanisms which are straightforward but fall short of gratifying decision because they fail to take into account the correlation between the two types of data, as opposed to which, tightly-coupled ones fare better by incorporating variables specific to vision (the coordinates of landmarks) and those to inertia (gyroscope and accelerometer biases) into the whole set of optimization states, which in essence utilizes the complementary attribute to a greater extent. To ameliorate the additional computational expense incurred by the introduction of inertial constraints into the optimization graph, in which case the Hessian matrix is no longer as sparse as that of purely visual models, pre-integration on manifold may be adopted to reduce calculation for optimization and bias correction.
Merging inertial and visual information leads to the angles of pitch and roll being observable with the last dimension of attitude, namely yaw, still unobservable and bound to drift over time. It is conceivable that high accuracy in yawing estimation can bolster the overall precision as every pose is estimated partly based on previous estimates and is therefore affected by their yaw angles. The magnetometer has long been employed in the field of navigation, often in combination with other types of apparatus such as an IMU. The magnetometer, as its name reveals, is a type of sensors that measure magnetic density in a magnetic field, especially the Earth's magnetic field (EMF), one of the Geophysical Fields of the Earth (GFE) including the Earth's Gravitational Field (EGF). It has been used in a wide range of commercial and military applications, mostly for directional information [1], whereas another way of using it for motion estimation is through measuring magnetic field gradients so as to acquire velocity information as in [2] which formulated a sensing suite consisting of a vision sensor and a MIMU (Magneto-Inertial Measurement Unit) that, on the presupposition of stationary and non-uniform magnetic field surroundings (particularly indoor environments), can render the body speed observable. The said two manners of employing the magnetometer imply that for this sensor there is a dichotomy in how to process its readings between outdoor and indoor scenarios, and it will not be straightforward to reconcile them.
However, little research has been done about integrating visual-inertial frameworks with magnetic observation that could ably suppress the cumulative azimuth error and further facilitate navigation. It seems that the magnetometer is not so much appealing for scholars in the area of SLAM as it ought to be. What keeps magnetometers from being adopted might be its liability to magnetic disturbance. Ubiquitous sources magnetic interference from ferrous metals and even from the platform itself would entail the system both growing in size and demanding more intricate handling. Since applications of SLAM are always towards compactness and generality, the magnetometer has remained out of favour.
We gather that despite the magnetometer's weaknesses, outdoor environments might still be in accord with its characteristics with appropriate measures taken, and therein lies the main motivation of the paper.
In this paper, we present a visual-inertial-magnetic system of navigation based on graph optimization that, apart from visual and inertial measurements, utilizes the geomagnetic field to restrain drift in yaw. From devising the scheme to verifying its superiority, the following tasks have been followed through along the way:

1.
Exhaustive mathematic deductions have been made to support the proposed system theoretically and mathematically, embodying observation models, least squares problems concerned with initialization, the novel optimization framework that fuses visual-inertial-magnetic information, and a novel way of loop-closure formulation aimed at enhancing the robustness to false loops.

2.
A complete and reliable procedure of initialization for visual-inertial-magnetic navigation systems is presented.

3.
An effective and efficient optimization using visual-inertial-magnetic measurements as observation is established.

4.
A suite of sensors and a CPU with other hardware, capable of data acquisition and real-time operation, has been assembled. 5.
The system has been tried and tested on several datasets collected in large-scale outdoor environments. Analysis and comparison of the experiment results attest to the feasibility, efficacy, and excellence of the proposed system.

Related Work
Since [3] came out, large numbers of well-designed systems [4][5][6][7][8] have swollen the ranks of SLAM schemes. Copious studies have been done on visual-inertial SLAM along with various applications that have since been on a continuum to full maturity [9][10][11]. Until today, scholars are still endeavoring to bring it to perfection with novel ideas and approaches [12][13][14][15][16]. Incipient methods fuse inertial and visual measurements loosely where the two types of information are processed, filtered, and used for estimation all separately [17]. In overcoming the notorious inconsistency of loosely-coupled approaches, researchers have come to appreciate that the advantages of fusing tightly, such as improved consistency [18,19]. Whether sources of information are amalgamated loosely or tightly, filtering mechanisms hold more likelihood to arrive at a suboptimal solution in the wake of linearizing all previous states. In break with traditional filter-based methods, non-linear optimization employed in SLAM has been deemed more desirable with its high precision and tolerable computational requirements. Reference [20] copes with holding on to and optimizing an entire trajectory by what the authors call 'full smoothing', but the ever increasing complexity in line with the incremental map and trajectory largely limits its applicability. Reference [21] is generally recognized as one of most early mature systems using non-linear optimization based on sliding window and marginalization for containment of problem size. To make a virtue of escalated computation due to the addition of inertial measurements, reference [22] proposes what is called the IMU preintegration technique that integrates on manifold a segment of measurements between two time points with the Jacobian matrices maintained for correction. In that way, changes in linearization will not need complete re-integration but a few minor adjustments with respect to the Jacobian matrices as long as they are not too large.
Another property of visual-inertial SLAM that comes to scholar's attention is its initialization and what its does is to work out several parameters describing the system's initial states and sensors' relation. Sfm calls for enough motion to be reliable while estimation of gravity's direction is better off under stationary condition. This contradiction suggests that visual-inertial initialization can not be trifled with. Reference [23] proposes a deterministic closed-form method that can recover gravity's direction and scalar factor, but fails to take cognizance of IMU biases, making the system less stable than it would otherwise be. Reference [17] estimates not only IMU biases but velocity using an EFK but the convergence takes long. Reference [24] initializes system on the assumption that MAVs take off flat with as small inclination as possible, which is, of necessity, unlikely to be the case in practice. Reference [25], much like [24], relies on alignment with gravity at the beginning to initialize. In [26,27], initialization does not calculate gyroscope biases, derogating from the precision.
As it develops, SLAM has begun to go beyond passively observing surroundings to actively exploring environments so as to gain coverage, hence the name 'active SLAM' [28]. So-called active SLAM integrates SLAM itself with path planing. Using this technique, a system covers an area autonomously while performing plain SLAM.
Reference [29] is aimed at solving active SLAM problems where coverage is required and certain constraints are imposed. With that end in view, reference [29] proposes a solution to it that focuses on minimization and area coverage within an MPC (Model Predictive Control) framework. It uses a sub-map joining method to improve both effectiveness and efficiency. The D-opt MPC problem is resolved with recourse to a graph topology and convex optimization and the SQP method is employed to address the coverage problem.The main contribution of [29] is it presents a new method capable of generating a sound collision-free trajectory so as to better perform coverage tasks than many other systems do.
Reference [30] presents an effective indoor navigation system for the Fetch robot. The main idea is founded on sub-mapping and DeepLCD and the system is implemented using Cartographer and AMCL (adaptive Monte Carlo localization). The main method comprises mapping and on-line localization modules and sequential sub-maps are generated by fusing data from a 2D laser scanner and a RGBD camera. AMCL is used to perform accurate localization according to image matching results. Not only the localization system itself but novel evaluation methods for it are presented in the paper and by using it the robustness and accuracy of the system is demonstrated.
Reference [31] presents a navigation method whereby a flying robot is able to explore and map an underground mine without collision. Simulations have verified that the system performs as well as the authors expected whether the robot is circling above flat or sloping ground. The authors also claim in the paper that the system is as simple as it is reliable.

Notation
We employ the following symbols throughout this paper. (.) W denotes the world frame, (.) B k the kth body frame which doubles as the kth IMU frame as aligned with the body frame, (.) C k the kth camera frame, and (.) M k the kth magnetometer frame. As for measurements, w k represents the gyroscope's readings, a k the accelerometer's, and h k the magnetometer's, at kth frame. (u, v) k the coordinates of a feature point in the kth image.
The overall states to be estimated are expressed in such a vector as x = P, q, V, bg, ba, q BC B k with P the position, q the body orientation quaternion, V the velocity, b g and b a the gyroscope's and accelerometer's biases respectively at the kth frame, q BC the transformation from frame C to frame B intrinsic to the installation and thus considered to be invariant. L W symbolizes a landmark expressed homogeneously with L W = x, y, z, 1 .
θ × represents the askew matrix of the vector θ: Note that θ × T = −θ × by its very definition.

Useful Properties of SO(3)
SO (3), the special orthogonal group in 3 dimensions, describes rotations in 3D space, with its corresponding vector so(3) in parameter. SO(3) and so(3) are related to each other through the exponential and logarithmic maps: Four properties of SO(3) are essential in IMU pre-integration and Jacobian computation for optimization and thus merit a mention: • commutative • the righ-hand jacobian Note that there is also J l called the left-hand jacobian as opposed to J r . We shall refer to them as J r and J l for the following content.

Vision
We use a pinhole camera as a source of visual measurement. Under the pinhole camera projection model, a 3D point is projected onto a plane with Z normalized and then projected onto the image plane with projection parameters intrinsic to the camera: where u, v denote the coordinates of the point projected onto the plane. In most cases, coordinates go through radial and tangential distortion on the normalized plane with parameters called distortion coefficients before scaled and displaced to form the final pixel coordinates. The distortion coefficients vary in value and number according to the lenses of cameras. Lenses with a great degree of distortion should be treated with up to 3 coefficients to express the distortion properly. The lens of the camera used in the proposed scheme is an ordinary one and only two distortion coefficients are adopted.
The Jacobian matrices of pixel coordinates with respect to distorted ones, distorted ones to normalized ones, and normalized ones to 3D coordinates are present as follows: The overall Jacobian matrix can then be obtained by applying the chain rule:

Inertia
Inertial data are obtain from an IMU at successive time instants at a frequency of 200 Hz: where w m and a m are acceleration and angular readings, n w and n a conceived of as Gaussian white noise, b w and b a modelled as random walk, and w t and a t the angular rate and acceleration in the world frame.
The error-state kinematics in continuous time are In traditional navigation schemes where filters are often used, states are predicted by integration of inertial measurements based on the prior states, indicating that accurate estimates of the initial states are crucial since even slightly skewed gravity direction would translate into enormously egregious errors in position and velocity estimates, giving rise to complete divergence. For positioning systems based on state optimization applied in more generic environments, not only is the unknown initial orientation problematic but the necessity of the system being re-linearized after each optimization step acts as a further drag. With a view to tackling this issue, the concept of IMU pre-integration has emerged and been well applied in visual-inertial navigation. Through multiplying both sides of the kinematics by a rotation matrix intended to transform the reference frame from the world frame W to the beginning frame B k in question, relative integration terms independent of state variables and gravity can be separated from other terms, obviating the need for re-propagation.
We reckon that mid-point integration is accurate enough as well as being efficient in computation. As IMU readings arrive at regular intervals, the integration goes on step by step as follows: where ∆R, ∆V, and ∆P are the pre-integration terms with the subscript (.) k denoting IMU frames. Note that the pre-integration terms are independent of state variables except biases. No re-propagation will be required after the position P k , the velocity V k , and the rotation R k change and for small alterations in biases the pre-integration will be adjusted according to the jacobians, caculated iteratively alongside pre-integration, of errors in ∆R, ∆V, ∆R with respect to the biases b w and b a . For error propagation, the error-state kinematics differ from those corresponding to Euler integration since they involve measurements at both the previous and the next time instant. The error updates are as follows: δbw k+1 ←δbw k + n bw ∆t (33) where δ(.) indicates error states, ∆t the discrete time interval between t k and t k+1 . Note that in the above equationsw m andā m take the place of for notation simplicity. For the Equation (30) of rotation error propagation, the angular error is defined on the right side, i.e., R true = δRR, probably opposite to most other definitions.
The jacobians δθ δb w , δV δb w , δV δb a , δP δb w , δP δb a calculated by (30), (31), and (32) are used for correction of pre-integration values in response to variations in biases. If the norm of the bias vector b w b a T (rad/s 2 for b w and m/s 2 for b a ) reaches above a threshold of 10 −4 , re-propagation over this period is warranted and executed as the linearization point has changed too much.
According to (30)-(34), the error-state transition matrix is and the Jabocian matrix of pre-integration with respect to the noise vector.
The propagation of errors and covariances for pre-integration is summarized as where Q i is the covariance matrix of Gaussian white noise determined from the IMU datasheet or through calibration experiments. P B k B k+1 propagated by (38) during the pre-integration from f B k to f B k+1 accounts for uncertainty, or observation noise covariances, and is used to weight inertial residuals.

Magnetism
Besides vision and inertia, geomagnetism is also involved to relieve localization of the accumulative yawing error that would otherwise be ever mounting up.
As with the accelerator that can measure the gravitational field on a static platform, the magnetometer purveys readings of the projection of its surrounding magnetic field onto its body frame. While what it gauges is not the projection purely of the EMF, traditionally for centuries it has been used for bearing information. Figure 1 illustrates the vector of the total density EMF T and its projections on the geomagnetic and geographic coordinate axes. Such parameters as T, I, D can be determined by referencing geomagnetic maps that describe the geomagnetic features of various locations around the globe.
The average magnetometer's calibration model is k xx k yx k zx k yy k xy k zy k zz k xz k yz where: M x M y M z T is the magnetometer's raw readings that are a coefficient matrix and a bias vector away from the true magnetic projection on the sensor's body frame M xt M y t M zt T ; k xx , k yy , and k zz are scale factor coefficients; k xy , k yz , and k zx are the transverse coefficients caused by the magnetometer's axes non-orthogonality; b m x , b m y , and b m z are correction coefficients for biases created by local magnetic field.
According to Figure 1, the observation model of the magnetometer is where R MB is the relative rotation matrix from the body frame f B to the magnetometer's frame f M and is determined though calibration. Following from the magnetometer's observation model, the Jacobians of the measurement function with respect to the variable to optimize R WB are where the perturbation is defined on the right side of R WB (globally defined). Figure 2 depicts the system's structure plainly.

Figure 2.
Overall system structure.

Visual-Inertial-Magnetic Initialization
The initialization of monocular visual inertial odometry is as crucial as it is intricate, on account of its precarious structure. Futile or incomplete initialization spell trouble for the entire system. On the one hand, monocular vision calls for a certain length of translation long enough to reflect the depths of key points, on the other hand, the projection of gravity onto the body frame can only be calculated when there's no extra acceleration other than gravity.
This section intends to present and lay out a novel and efficacious procedure of initialization. As drawn out in Section 2, successful initialization is a prerequisite for the system to start off properly. How accurate the initial parameters are estimated will play a huge part in how stably and smoothly the system operates. The involvement of magnetometers introduces extra parameters to be determined, namely the initial magnetic bearing. The proposed method delivers visual-inertial-magnetic initialization with efficacy and credibility assured in some measure by deciding whether or not it is initialized successfully through a set of specific criteria. Figure 3 illustrates the procedure of initialization.
As opposed to VI-SLAM, the system uses magnetic information as rotation constraint during visual recovery with the intention of expediting the initialization process and enhancing its precision.

Visual Recovery and Rotation Calculation through Magnetism
The task the visual module undertakes is estimating the relative transformation with respect to the first frame by vision itself. Using a monocular camera without depth information indicates the first step is to extract from two selected images the essential or homography matrix which can be decomposed to recover the transformation between the images. The two images for recovering poses are selected if there's enough parallax between them. The essential matrix is better at computing poses if the camera's moved, whereas if it is only rotated without translation the homography matrix fare better. The problem is there's no way of establishing whether or not the camera's moved because either pure rotation or translation causes parallax between two images. For better robustness, we compute and decompose both of them and check through whose transformation the reprojection is smaller to decide to use which matrix. Key points tracked in the two images are triangulated to determine depths if there's sufficient translation between them. Those 3D key points are then utilized to execute PnP (Perspective-n-Point) on all the intervening key frames under the auspices of BA (Bundle Adjustment). Magnetic information is also exploited for rotation estimation by incorporating its measurements into the BA problem as extra cost functions in Equation (42).
where M k denotes magnetometer's kth frame and R BM is the rotation from magnetometer's frame to body's frame.

Visual-Inertial-Magnetic Alignment
Before alignment, gyroscope biases need to be worked out. The reason why only gyroscope biases are computed is because attitude holds much more influence on pose estimation as its estimates dictate whether gravity can be rightly projected onto the body frame. By solving Equation (43) through least squares, gyroscope biases b w B k can be obtained: where ∆R B k B k+1 is the rotational preintegration and ∂∆R B k B k+1 ∂b w B k is its partial derivative with respect to gyroscope biases. Section 4.2 lays out how to maintain this partial derivative. come from the visual recovery module. Every combination of adjacent images and the preintegration in-between forms a set of equations like (44), stacking up into a least squares problem.
The process described above is merely visual-inertial alignment after which the z-axis of the world frame is aligned with the vector of gravity, velocity states projected onto the world frame, and key points' depths scaled to proper size. The next step is to align the magnetometer with the world frame by making its projection on the xy plane of the world frame parallel with the x-axis.
After the complete alignment, the world frame's z-axis is parallel with the direction of gravity and x-axis with magnetic north.

Initialization Completion Verdict
It is not beyond the bounds of possibility that parameters initialized could be dubious. A common case explaining this phenomenon is when the system keeps moving in one direction at a constant speed, not administering sufficient excitation to IMU.
Inspired by [32,33], we examines initialization's efficacy by reviewing estimation error that links in to a degree with the uncertainty of the initialization system.
With the estimation error satisfactorily low, the system will be notified and go to the optimization stage.

Joint Nonlinear Optimization
As regards calculation of Jacobians, it warrants mention that the perturbation is defined on the right side where rotation variables are involved, as evidenced in [34] to have better properties.
Different definitions of the perturbation for differentiation certainly lead to different Jacobians and inconsistency of which side the perturbation is operated on and how the Jacobians are calculated is heading for failure in the process of optimization.

Visual Constraint
Feature points for tracking are extracted from every image [35] and data association across image frames is realized by tracking them through optical flow [36], which, as exhibited in real-time operation, relieves the visual front-end of heavy computation because it does not have to describe features and subsequently match them as do standard feature-based methods. The number of successfully tracked points will surely diminish as images arrive and pass frame after frame either because of points moving out of the image region or simply of tracking failure. Additional features are extracted in every frame where the number of successfully tracked points drops to a certain amount. Non-maximum suppression is applied to the extraction of key points to obtain a more even distribution of points that would otherwise cluster around a few areas little more than single features.
As key points have to be dispersed to better represent the whole image, so do frames need to be selected as key frames and the number of them be curtailed to avoid redundancy. Other than representing a solitary frame and indicating a static state through small errors of re-projection, frames with little parallax in-between or even identical due to a stationary state hold little significance for the whole, probably quite long and large, trajectory and map retained in the system for loop closure detection and other uses. A key frame ought to hold enough connections with its previous and next key frames through co-visibility [37] for pose coverage and efficient estimation in regard to the local graph while being distinct enough from its adjacent ones for the conciseness of the whole pose-graph. In our formulation, whether or not a frame is "key" is conditioned by its key points' association with the previous and next frames, or specifically, the ratio of the number in the next frame to that in the current frame of successfully tracked key-points that originate from the previous or older frames. A high value (say 0.8) of this ratio suggests that the majority of key points tracked from previous frames to the current one are well observed again in the next frame and thus this frame may be considered to be redundant in the presence of its neighbours, whereas the lower the ratio is, the more contributory the frame is to the observation and retention of landmarks. A practice is to pre-set a ratio threshold above which a frame is deemed 'non-key' and is consequently going to be marginalized out after the current round of optimization.
We use a plain pinhole camera as the visual sensing module with an ordinary projection model that has been presented in Section 4.1. The re-projection error is where i indexes landmarks and k denotes frames, variables to be optimized in bold type. π(.) is the projection function whose Jacobian matrix (J pro 2D→3D ) with respect to the original 3D point vector has been demonstrated in Section 4.1. T BC is the transformation from f C to f B that is regarded as being constant provided the sensors are rigidly fixed on the platform, and the optimization of which is thus optional with it calibrated in advance. The Jacobians of the re-projection error with respect to each variable are as follows: where t BC and t WB are the translation parts of T BC and T WB . J pro 2D→3D are the Jacobian matrices of pixel coordinates with respect to normalized 3D points as duduced in Section 4.1. The above jacobians are derived with the rotation and translation variables treated separately as they are when being updated, rather than computed as a whole as in [38].

Inertial Constraint
As inertial observations take the form of the integration of measurements between two adjacent frames aligned in time with camera frames, they somewhat resemble a measurement of the relative transformation between each two frames, save that gravity is incorporated in the pre-integration. The equations below express the observation model of pre-integration.
where (.) B k and (.) B k+1 denote the first and last frames of the pre-integration and the variables to optimize are highlighted in bold type. The whole state vector to be optimized according to the pre-integration observation from f B k to f B k+1 is given below: with the corresponding residuals: where the variables to be optimized are in bold type as before. Note that the representations of R B k B k+1 (b w ) and R B k B k+1 (a w ) indicate that by nature the two biases would not change over a period of pre-integration.
The optimization of the variables for inertial residuals needs Jacobians, as do other optimization problems, around various linearized points to nudge variables towards at least a suboptimal solution. While the analytical form of Jacobians may not necessarily be required as there are available optimization modules capable of automatic differentiation, the analytical form is favourable nonetheless in its optimality and efficiency. The equations below present the Jacobians of pre-integration residuals with respect to variables

Magnetic Constraint
Magnetic observation, as laid out in Section 4.3, is the projection of the vector of the EMF onto the body frame: where R MB is the relative rotation matrix from the body frame f B to the magnetometer's frame f M and is determined though calibration. With the observation model, it is fairly straightforward to establish the magnetic residual: where R MB and R W B are to be optimized. The corresponding Jacobians: Note that M M and H W are always normalized with norm equal to 1. H W is the vector of the total intensity of the EMF whose projection on the X-Y plane is in the direction of magnetic north cross true north at an angle called magnetic declination which is not concerned in our system since we use magnetometers only for suppressing yaw estimation shift.

Loop Closure Constraint
Loop closure constraints come from the front-end's detection of loop closures when the robot arrives where it has roamed and thus its trajectory loops. Loop closure constraints take the form of where i and j are the indexes of frames between which a loop closure occurs. f i is the older one whose pose together with the landmarks it holds are fixed on the grounds that the older a frame is the less its estimate has drifted. Equation (78) is almost identical to the visual constraint, save that the landmarks of the older frame are in regular type, signaling that they are not to be optimized since their would not be any other variable more precise than the start of the loop. Apparently, only frames involved in loop closures are to be adjusted, leaving the rest unchanged, if optimization is executed merely on loop closures. That brings about inconsistency in the pose graph. One putative way is to combine odometry constraints and loop closures. Odometry constraints are relative pose transformations obtained from the current pose graph.
where T ij is a relative transformation taken from the existing pose graph as a measurement. The standard pose graph optimization problem is then presented as As [39] suggests, optimization of a pose graph is no more than a plain nonlinear least squares problem. What makes it call for extra cautious treatment is its susceptibility to false positive loop closures. Even a single outlier made by the front-end has the potential to bring the optimization into complete divergence or make the pose graph wrongly deform. One way to tackle this is to obviate outliers as early as in the detection stage. By applying RANSAC or other similar strategies under geometric models, outliers can be fairly picked out and eliminated, making false loops less likely to be carried over into the following optimization. Another relies on what is called the robust cost function which bears much resemblance to the Huber function [40] and can mitigate the impact of outliers by downgrading the cost functions to linear functions rather than quadratic ones, where the overall cost is high enough to be deemed harhouring false loops. Neither of these is foolproof. Systems using these methods are liable to suffer from wrongly detected closures all the same. False positive loop constraints are as much problematic as they are difficult to combat. Erroneous edges in the pose graph following detection of false loops can either lead the optimization to diverge or converge to a entirely egregious solution.
Inspired by [39], we adopt the idea that the topology of the pose graph is subject to the optimization with outliers being identified and removed automatically. To actualize this idea, a weighting factor is put on every loop closure constraint as a way of distinguishing normal loop closures from erroneous ones.
In the above equation, w ij is a weighting factor ranging from 0 to 1 corresponding to whether the constraint is active or deactivated, or rather, removed. As the weighting factor holds influence on the cost value, it indeed makes the topology itself subject to the optimization. The weighting factor is then given by a sigmoid function: where s ij is a switch variable switches on and off a loop closure constraint during the optimization process, hence the name switch variable. These switch variables are to be optimized together with others and are set to 10 initially, meaning all loop closure constraints are initially activated. Switch functions alone are apparently not enough to attain what is intended. The optimization module will simply drive all switch function costs to nearly 0 as a result of attempting to minimizing the overall cost. Introducing a penalty cost for each switch variables could counter this phenomenon. Penalty costs are to keep switch variables to the initial value and in the form of prior constraints: where Ξ, the covariance matrix for the penalty cost, is empirically set to 20 2 . Switch functions and prior constraints together enable the optimization unit to winnow out false loops, and the principle behind it is to tap into the knowledge that there's a certain amount of inconsistency between false and true loops and between false ones themselves in all likelihood while true loops tend to be in agreement with each other, and so what the proposed mechanism does essentially is to always slash the impact of loop closures that are incongruous with other loops.

Implementation Details
The navigation system's software is programmed in C++ with recourse to ceres-solver on ROS (Robot Operating System).
The system runs on a standard central processing unit (Intel R NUC Kit NUC7i7BNH, Intel R Core TM i7-7567U Processor, 16 GB RAM). The suite of sensors is rigidly installed on top of the vehicle and collects data as it is moving. Figures 4-6 are pictures of the sensors employed in the system. Figure 7 is the sensor suite with the CPU. Figure 8 is pictures of the vehicle used.

Experiments
We conducted three large-scale outdoor experiments each of which covers over 1 kilometer by collecting and saving data into a bag file for later analysis. We examined the performance of VI-SLAM (visual-inertial SLAM) and VIMO (the proposed visual-inertial-magnetic navigation system) by running them on recorded datasets to glean quantitative results. Figures 9-11 are three experiments on EUROC datasets "MH 01 easy", "MH 03 medium" and "MH 05 difficult" for VI-SLAM and OKVIS. VIMO is not compared with them, since the public datasets do not contain magnetic measurements. As is illustrated by the pose error map in Figure 9, the absolute pose error by VI-SLAM is (−0.046 (m) to −0.606 (m)) compared to that by OKVIS (−0.051 (m) to −1.184 (m)).
In Figure 10, for the dataset "MH 03 medium", absolute pose error is dragged down from −1.714 (m) by OKVIS to −1.452 (m) by VI-SLAM.
In Figure 11, since the dataset is labeled 'difficult', the accuracy has fallen for either system, with −2.102 (m) by OKVIS and −1.686 (m) by VI-SLAM. An absolute decrease of 0.416 (m) in error is shown. Tables 1-3 show differences in accuracy between OKVIS [21] and VI-SLAM for EUROC datasets. It is shown that VI-SLAM generally outperforms OKVIS.  In Table 1, the root square mean error by VI-SLAM (0.314019 m) and that error by OKVIS is (0.623834 m).
In Table 2, all error terms except for root mean square error are smaller by VI- In Table 3 Figure 15 compares VIMO with OKVIS [21], a state-of-art method. Improvement in performance is demonstrated.
Another phenomenon warranting consideration is that for VI-SLAM the absolute pose error is larger (Max: 21.8827 m, Mean: 11.0507 m, Median: 11.8128 m, RMSE: 12.4916 m, StD: 5.8243 m) in experiment 2 ( Figure 13) than in other experiments, while for VIMO it is quite the contrary. We conjures that it is because the shorter the vehicle travels along a straight line, the lesser error should have been incurred if it were not for accretion in yawing estimation, and since VIMO is immune to yawing deviation it is able to achieve much higher performance, which, again, brings to the fore the significance of magnetic information and the superiority of VIMO.
Various error terms, including max absolute pose error, mean, median, min, root mean squared error, standard deviation of errors, are arranged in Tables 4-6. Quantitative comparisons between VI-SLAM and VIMO drawn from the tables reveal the potency of fusing magnetic information into the system. Tables 4-6 shows differences in accuracy among OKVIS [21], VI-SLAM and VIMO. As they demonstrate, both VI-SLAM and VIMO are able to achieve better results than OKVIS do for every error item. In Table 4, the root square mean error for VI-SLAM (7.6135 m) is about half that for OKVIS (16.3864 m), and VIMO makes it even lower (3.8741 m). In Table 5, the difference between VI-SLAM (12.4916 m) and OKVIS (19.9318 m) is not so pronounced as in Table 4, but for VIMO the error is nearly 20 times smaller (2.7394).

Conclusions
This paper presents a visual-inertial-magnetic navigation system with the originality of exploiting EMF as yaw observation. Respective measurement models are presented in Section 4. Section 5 provides an overview of the system's structure. We present mathematic fundamentals and theories concerned with visual-inertial-magnetic initialization and non-linear optimization in Sections 6 and 7, together with other algorithmic details. Lastly, we demonstrate the validity and superiority of our system over visual-inertial-only ones through 3 outdoor large-scale experiments with their error analysis.
According to Section 8.2, VIMO performs localization more accurately than both VI-SLAM and state-of-the-art methods such as OKVIS, slashing errors by half or more (from 16.3864 m to 1.7543 m in Table 4; 19.9318 m to 2.7394 m Table 5; 11.8855 m to 6.6390 m Table 6 in terms of root mean square error).
The most significant implication of the proposed system is that it opens up new vistas for the development of navigation systems with a new combination of sensors and new ways of information fusion and state estimation.
Future research shall follow the line of designing installation mechanisms that can blot out as much magnetic interference as possible, making better use of magnetometers for initialization (for example applying magnetic measurements to the least squares problem in initialization for more accurate estimation), further improving the system's overall performance.  Acknowledgments: Zheng Jiang, from School of Automation, Beijing Institute of Technology, offered technical support and advice.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: