An Enhanced Hybrid Visual–Inertial Odometry System for Indoor Mobile Robot

As mobile robots are being widely used, accurate localization of the robot counts for the system. Compared with position systems with a single sensor, multi-sensor fusion systems provide better performance and increase the accuracy and robustness. At present, camera and IMU (Inertial Measurement Unit) fusion positioning is extensively studied and many representative Visual–Inertial Odometry (VIO) systems have been produced. Multi-State Constraint Kalman Filter (MSCKF), one of the tightly coupled filtering methods, is characterized by high accuracy and low computational load among typical VIO methods. In the general framework, IMU information is not used after predicting the state and covariance propagation. In this article, we proposed a framework which introduce IMU pre-integration result into MSCKF framework as observation information to improve the system positioning accuracy. Additionally, the system uses the Helmert variance component estimation (HVCE) method to adjust the weight between feature points and pre-integration to further improve the positioning accuracy. Similarly, this article uses the wheel odometer information of the mobile robot to perform zero speed detection, zero-speed update, and pre-integration update to enhance the positioning accuracy of the system. Finally, after experiments carried out in Gazebo simulation environment, public dataset and real scenarios, it is proved that the proposed algorithm has better accuracy results while ensuring real-time performance than existing mainstream algorithms.


Introduction
Simultaneous Localization and Mapping (SLAM) technology has rapidly developed recently, and it is widely used in fields such as drones, Augmented Reality (AR), and unmanned vehicles. Because of the low price of vision sensors, the improvement of computing power, and the advancement of algorithms, vision SLAM has received widespread attention [1][2][3] and produced many excellent results. However, poor lighting conditions, fast motion, and lack of texture which prone to large positioning error or even system crashes remain many flaws to pure vision SLAM. In order to improve the robustness and positioning accuracy of the system, VI-SLAM fade in researcher's sight as a fusion solution of visual measurement and IMU measurement.
The current fusion framework of vision and inertial navigation can be divided into two ways, one is based on filtering, and the other is based on optimization. The optimizationbased method tends to achieve more accurate result by transforming the estimation problem into a nonlinear least squares optimization problem which is a bundle adjustment problem, and iteratively solving it to obtain higher accuracy. However, it was not until the sparsity of the Hessian matrix involved in the solution process was discovered that real-time algorithms were developed to overcome large amount of calculation. For example OKVIS [4], VINS [5], and ORB-SLAM3 [6] use a tightly coupled method to simultaneously optimize the visual projection constraints and IMU pre-integration constraints. In order to balance the calculation load and positioning accuracy, they use a sliding window method to maintain a certain number of optimization variables. Although the use of sparse matrix factorization can reduce the amount of calculation, VIO system based on optimization method still occasionally leads to a decrease in the performance of the entire system while the calculation resource are limited because of the necessary occupation of the other modules in the mobile robot system, such as the navigation module which needs to occupy certain system resources in addition to the localization module [7].
To make sure of the real-time capability, the fusion framework was mostly based on the filtering method. The state vector of general EKF-SLAM [8] contains IMU pose and map feature points. While IMU and image information are obtained, state variables are propagated and updated. The accurate estimation of the map points can obtain the estimation of unbiased pose. As one of the typical filtering method, EKF-SLAM still has great potential for further optimization of the computation load because the state vector contains a large number of spatial feature points. Mourikis and Roumeliotis [9] improved the system performance by proposing the Multi-State Constraint Kalman Filter (MSCKF) algorithm, whose state vector does not contain spatial feature points but only the pose of camera or IMU at a limited time in the sliding window. MSCKF can achieve accuracy comparable to optimization-based methods under a small computational load. Subsequently, many improved versions appeared, such as S-MSCKF-VIO [10], which extended MSCKF to use stereo cameras to acquire certain scale information and improve the robustness of the system. ROVIO [11] used luminosity for status update and zerospeed detection, which is fast but has poor accuracy. Li and Mourikis [12,13], whose work was named as MSCKF2.0, added the stable tracking SLAM feature points, the cameraimu external parameters, and the time drift between camera-imu into the state vector, and simultaneously estimated them together with the camera state. Furthermore, they also used the First Estimate Jacobian (FEJ) technology to maintain system observability. OpenVINS [14] is a framework based on the MSCKF sliding window, using a suitable typebased state modular programming method, which facilitates the management of covariance. It integrates various improved methods for MSCKF, such as monocular, binocular, FEJ, SLAM feature points, camera intrinsic and extrinsic parameters estimation, etc., which facilitate the comparison of effects between various improved methods and provides complete theory and programming documentation. Adding a sliding window-based optimization method on the basis of MSCKF framework to provide constraints between the pose states in the sliding window is proposed in [15], but this undoubtedly increases the computational load due to the additional use of optimization methods.
Inspired by optimization-based methods, we can use the IMU pre-integration between two images [16,17] as an observation to update the camera state in the state vector. The idea is similar to one described in [15], but it can greatly reduce the computational load and ensure the real-time performance. Since reprojection error of feature points is also involved in the update process, how to choose a suitable covariance weight between it and IMU pre-integration is very important. HVCE [18] can determine the weights of different types of observations, and has achieved good results in fusion positioning applications in global satellite navigation [19,20]. Xu et al. [21] used the HVCE method to calculate the weights of the point and line factors in the point-line VIO system for proper estimation, which improved the accuracy of the VIO system. The proposed method in this article also use HVCE method considering its effectiveness in multi-observation fusion.
Hesch et al. [22] analyzed the observability of the VIO system, and concluded that there are four unobservable directions in the VIO system, namely the yaw angle and three position information. Since the linearization point changes during the state update process make yaw become observable from an unobservable state, the paper proposes to use FEJ to maintain the observability of the system and improve the positioning accuracy of the system. In addition, Wu et al. [23] proposed that in some special situations of the robot, VIO will produce additional unobservable directions. For example, in uniform motion, the scale of the system becomes unobservable due to the lack of output acceleration. Quan et al. [24] Sensors 2022, 22, 2930 3 of 21 designed a tightly coupled VIO system for indoor mobile robots by tightly coupling the wheel odometer and gyroscope.
The goal of this paper is to develop a visual-inertial navigation positioning system with high accuracy and low computational load suitable for indoor mobile robots. From this introduction, we can conclude that the optimization-based VIO method can obtain higher positioning accuracy but heavier computation load than the EKF-based VIO method. Currently, VIO systems based on the EKF framework can achieve localization accuracy comparable to optimization-based methods with a smaller computational load. However, from the knowledge we have obtained so far, the mainstream VIO systems based on the EFK framework mostly use IMU measurement for state prediction and covariance propagation but not next update stage. This article proposed a new idea of applying IMU data in system update for further improvement of accuracy. The main contributions of this paper are: (1) Cosidering the insufficient use of the IMU information in the traditional MSCKF VIO system, the IMU pre-integration method is used to constrain and update the state of the sliding window to improve the positioning accuracy of the system. In order to select the appropriate weight between the covariance of the visual feature point reprojection and the IMU pre-integration, this paper uses the Helmert variance component estimation method in the sliding window update process to select the maximum posterior weights between the visual reprojection and the IMU pre-integration. (2) For indoor mobile robots in the process of positioning using the MSCKF-based VIO system, there are observable changes (such as loss of scale) during start-stop and uniform motion, resulting in a decrease in positioning accuracy. The speed information provided by the wheel odometer is used for zero-velocity detection, wheel odometer pre-integration, and corresponding status updates to improve the positioning accuracy of the mobile robot system. (3) Tests and experiments were carried out in the Gazebo simulation environment, public dataset EuRoc [25], and actual environment. The results of tests and experiments were compared with related mainstream algorithms S-MSCKF [10], VINS-Fusion [5,26], and OpenVINS [14]. Simulations and experiments show that the algorithm proposed in this paper can not only ensure real-time performance but also improve the positioning accuracy significantly.
The organization of this paper follows. After the comprehensive introduction in Section 1, the system overview and mathematical method are introduced in Section 2. In Section 3, we build a simulation environment in Gazebo and some simulations are carried out. Next, experimental verification based on the EuRoc data set and the real Mir robot is shown in Section 4. Finally, discussions and conclusions are given in Sections 5 and 6.

System Overview
General VIO systems based on MSCKF framework can roughly be divided into two processes. One is to use the measurement of IMU to predict the state and propagate the covariance. The other one is extracting and tracking the visual features of the images from the camera and updating the state by visual geometric constraints. This paper adds the IMU pre-integration module on the basis of the original framework and uses the HVCE method to estimate the maximum posterior covariance weight of the visual constraint and the pre-integration constraint, which are used to simultaneously update the state vector variables in the sliding window. Meanwhile, for the indoor mobile robot, more accurate speed information provided by the wheel odometer is used to detect the zero velocity, and the pre-integration constraint of the wheel odometer is also used to update the pose state. Figure 1 shows the pipeline of the system, in which the red box is the module added in this paper on the basis of the general MSCKF framework. straint and the pre-integration constraint, which are used to simultaneously update the state vector variables in the sliding window. Meanwhile, for the indoor mobile robot, more accurate speed information provided by the wheel odometer is used to detect the zero velocity, and the pre-integration constraint of the wheel odometer is also used to update the pose state. Figure 1 shows the pipeline of the system, in which the red box is the module added in this paper on the basis of the general MSCKF framework.

State Representation
In order to consider accuracy and efficiency, MSCKF framework uses a sliding window algorithm for back-end optimization. The state vector variables include the current IMU inertial navigation state, N historical clone camera states, M stable SLAM features, time offset between IMU and camera, and the intrinsic and extrinsic parameters of the camera. It is defined as follows: where I x is the state of inertial navigation system (INS) at the moment k, and its rotation posture represents the transformation from the global coordinate frame G to the local

State Representation
In order to consider accuracy and efficiency, MSCKF framework uses a sliding window algorithm for back-end optimization. The state vector variables include the current IMU inertial navigation state, N historical clone camera states, M stable SLAM features, time offset between IMU and camera, and the intrinsic and extrinsic parameters of the camera. It is defined as follows: where x I is the state of inertial navigation system (INS) at the moment k, and its rotation posture represents the transformation from the global coordinate frame G to the local coordinate frame I k through the unit quaternion I k G q; G p I k and G v I k represent the position and velocity of the IMU body coordinate frame I k relative to the global coordinate frame G; b ω k and b a k represent the bias of the gyroscope and accelerometer; x T C is a set of N history IMU state clones and, in order to update the pre-integration, it also contains velocity and bias information in addition to the general pose information; x L represents the SLAM features for stable tracking; and the intrinsic camera parameters and the extrinsic parameters between IMU and camera are represented by x W .

IMU Dynamic Model and Pre-Integration
A six-axis IMU can measure acceleration and angular velocity of the body frame with respect to the inertial frame through a 3-axis gyroscope and a 3-axis accelerometer, respectively [17]. Its measurement model follows: where I q G is the quaternion representation from the world frame G to the inertial frame I, and R( I q G ) represents the conversion of the quaternion I q G into the corresponding rotation matrix. The true values measured by IMU areω(t) andâ(t), and the measured values are ω m (t) and I a m (t). The kinetic equation is shown as below: G .
According to the previous definition, the state vector of INS at time t is: Since the use of the real IMU state leads to the singularity of the covariance matrix, there are additional units constrained to the quaternion in the state vector. Therefore, the error IMU state is usually used, which is defined as: Assuming the error of the quaternion is δ I t G θ, the update method is: For vector variable position, speed, and offset, standard update methods can be used (e.g., G p I ← G p I + δ G p I ). Forward integration can be used to calculate the IMU state at time k+1 from the IMU state at time k according to the dynamic equation through the discrete linear acceleration I k a m , I k+1 a m and angular velocity I k ω m , I k+1 ω m measured by IMU at time k and k + 1. At the same time, the covariance propagation of the state vector can be performed by linearizing the nonlinear model. According to [14], the error transfer equation can be written in a compact form as: where P is the covariance matrix; Φ k and G k are the Jacobian matrix of system state and noise characterizing the error transfer; Q d is the discrete noise matrix. For details, please refer to [14]. Referring to [5], the pre-integration from time i to j can be calculated whose form is defined as follows: The formula is the calculation of pre-integration under continuous conditions. The IMU measurements at time k and k + 1 can be used to accomplish the calculation in the discrete case by numerical methods (such as Euler integral, median integral, or Runge-Kutta method). While considering the bias and noise, the equation can be get as shown: At the beginning of the pre-integration calculation, are zero vectors. It is assumed that the bias is constant in the calculation process. Refer to Appendix A for the calculation method of pre-integration covariance Σ ij . After calculating the observations I jq I i , I iα I j , I iβ I j , and their corresponding covariance Σ ij , the standard extended Kalman filter can be used to update the state in the sliding window.
are affected by the biases b a k and b g k , and the bias changes are small. In order to avoid calculating the pre-integration again after the bias is updated, the first-order approximation formula is used directly to update after a small change in the bias:q are the pre-integrated Jacobian with respect to the bias, which can be solved iteratively according to the method described in Appendix A.

Point Feature Measurement Update
The feature points used for visual observation update have two types: SLAM feature points for long-term stable tracking, the state of which can be added to the state vector of the sliding window, and the MSCKF feature points lost in tracking. The predicted pixel coordinates of the spatial feature in the camera image at time k can be expressed as: where h d (·) maps the normalized coordinates to the distorted pixels coordinates; h p (·) converts the coordinates in the camera coordinate frame into normalized coordinates frame; h t (·) maps the 3D coordinates of the feature point in the world frame to the coordinates in the camera coordinate frame, and h r (·) converts the feature point representation λ into 3D coordinates in the world frame. ζ are the camera's intrinsic parameters, including focal length and distortion parameters. R C k w and p wC k are the position and orientation of the camera; n k is the measurement noise, usually assumed to be Gaussian white noise of one pixel. By stacking the multiple observations of different feature points, the observation equation can be constructed: where δX k is the error state vector in the sliding window involved in the feature point observation update; H x and H f are the measurement Jacobian matrix with respect to state vector variables and the position of the feature. For the SLAM feature point update, since the feature point state is included in the state vector of the system, there is no H f δ G p f term.
As for the update of MSCKF feature points, since the state of the feature points is affected by the state of the camera, the observation equation can be projected onto the left null-space of H f and then updated with a standard extended Kalman filter.

Pre-Integration Measurement Update
The pre-integration of velocity and displacement can be expressed as: The corresponding Jacobian matrix can be obtained: For quaternions, suppose the ideal observation is: The residual of the quaternion can then be defined as . The corresponding Jacobian matrix is: where R L and R R are the left and right quaternion multiplication matrices. The residual of pre-integration can be expressed as: where H I is the Jacobian matrix of the pre-integration relative to state in the slide window and R I is the covariance of the pre-integration.

Wheel Odometer Measurement Update
When the wheel odometer detects that the robot's velocity is zero, it can be defined that the observation speed, acceleration, and angular velocity are all zero. The residual can then be defined as: The corresponding Jacobian is: If the system detects that the robot is in a stationary state, it will not update the visual observation feature points and pre-integration after completing the zero-velocity update. When the speed of the robot is detected by the wheel odometer to be nonzero, d r and d l are the moving distance of the right and left wheels of the differential drive wheel, respectively; d r and d l satisfy [27]: where ∆d r and ∆d l are the displacement in unit tick of the left and right wheel odometer, respectively; ε r and ε l are the scale coefficients; δ r and δ l are zero-mean Gaussian distributions whose variance is proportional to the moving distance of the left and right wheels and the scale factor K. According to kinematics, the following formula can be obtained: where d is the center distance between the two wheels. Since the calculated result of Formula (21) is one-dimensional, it needs to be expanded into a three-dimensional vector when performing the following calculation, that is . The rotation and displacement increments R b j b i and p b i b j of the wheel odometer between image frames i and j can then be obtained by numerical integration of the dynamic Equation (22); the initial condition R b i b i is identity matrix and p b i b i is a zero vector: where R b k b k represents the rotation transformation of the intermediate time between k and k+1 relative to the time k, which can be calculated by b k θ b k+1 b k . For the orientation error propagation using SO(3) perturbation, we obtain: where superscriptˆrepresents the true value and superscript~represents the perturbation. J r (·) is the right Jacobian of SO(3) that maps the variation in rotation angle in the parameter vector space into the variation in the tangent vector space to the manifold. Then, the error transfer coefficients can be obtained as: Similarly, the error transfer equation in the translation direction can be obtained as: The error transfer coefficient can be calculated as: The wheel odometer error transfer calculation can then be performed according to Equation (7). In the mobile robot system, vision information can be used to detect whether the wheel odometer is slipping. According to the positional relationship O R C and O P C between the camera and the wheel odometer coordinate frame, the relative movement C i R C j and C i P C j of the rotation and position of the camera can be calculated. Then, the essential matrix E = C i P C j × C i R C j between two frames can be constructed. According to the camera internal parameter K, we can calculate the essential matrix F = K −T EK −1 .
Assume that the corresponding feature point observations of image frames i and j are P 1 = (u 1 , v 1 , 1) T and P 2 = (u 2 , v 2 , 1) T . The distance from P 1 to the corresponding epipolar line can then be calculated as: According to the threshold th, count the number of feature points with D P 1 P 2 < th and calculate the ratio to the total number of feature points. If the ratio exceeds a certain threshold, the wheel odometer is considered to be slipping. If there is no slip, the state in the sliding window can be updated by using the pre-integration component measured by the wheel odometer, which is similar to the method of updating by using IMU pre-integration. In addition, note that the state variables in the sliding window are defined in the IMU coordinate frame, so the extrinsic parameters from the wheel odom coordinate frame to the IMU coordinate frame need to be used during the update process to convert R b j b i and p b i b j in the wheel odom coordinate frame to the IMU coordinate system.

Helmert Variance Component Estimation
In the update process of the system proposed in this article, because it involves visual feature points, IMU pre-integration, and observation information from the wheel odometer, the determination of the weight between observations is very important. However, their weights are generally inappropriate because of the errors of parameter calibration and calculation and the corresponding unit weight variances. Taking observations of visual feature points and IMU pre-integration as an example, the update equation of the observations to the state variables of the system based on extended Kalman filter is: where m = f or I, which indicate that the system state is updated by visual feature points and IMU pre-integration, respectively. X andX represent the state variables of the system before and after the update; N −1 andN −1 are the covariance matrices of the system state vector variables before and after the update; Z m is the observation and R m is corresponding covariance matrix; H m is the Jacobian matrix of the observed measurement to the system state vector variables, and its definition is the same as Equations (12) and (17); K is the gain matrix and I is the identity matrix; r m can be calculated by Equations (12) and (17).
Then, the calculation equation of weight can be obtained by Helmert variance component estimation theory [18,20] as: where N f = H T f R −1 f H f and N I = H T I R −1 I H I . R f and R I are the covariance of visual feature points and pre-integration observations, respectively. The "tr" item represents the trace of the matrix. Because it is much smaller than the observation numbers n 1 and n 2 in the sliding window calculation process, the trace items can be ignored in the calculation to improve the calculation efficiency. Using Equation (29), σ 2 1 and σ 2 2 can be calculated. Furthermore, the covariance matrices R f = where σ 2 0 is an arbitrary constant and the unit weight variance σ 2 f of the feature point can be taken here. The system state vector variables and their covariance can be updated directly using Equation (28) after R f and R I are determined.

Simulation Environment Settings
To verify the effect of the algorithm proposed in this article on the localization accuracy, we first conduct a simulation analysis of the algorithm in a simulation environment built in Gazebo. We directly use the simulation environment provided by the Chinese Academy of Sciences [28], which contains rich environmental feature information, and import the Mir robot URDF model, as shown in Figure 2. Mir is a differential drive robot equipped with a camera, IMU, and wheel odometer. The camera can output images with a resolution of 1080 × 1920 at a rate of 10 Hz, and the IMU outputs acceleration and angular velocity information at 150 Hz. The wheel odometer outputs speed information and angular velocity information at 50 Hz. During the simulation process, the maximum linear velocity and maximum angular velocity of the robot motion are 1.0 m/s and 0.7 rad/s, respectively. Sensor data are sent and received in the form of ROS (Robot Operating System) topics. The computer is configured with Intel Core i7-7700K CPU with 3.5 GHz, 16 GB RAM, and the system is Ubuntu 18.04 and ROS Melodic.

Analysis of Localization Accuracy
The following is the analysis of the trajectory estimation accuracy of the algorithm proposed in this paper and the original OpenVINS trajectory estimation accuracy of the robot in two motion situations, that is, the motion mainly composed of uniform circular motion or uniform linear motion. All algorithm tests are based on stereovision fusion with IMU and wheel odometer information. We use the open source tool evo [29] and use absolute pose error (APE) as the error evaluation standard to analyze the accuracy of the

Analysis of Localization Accuracy
The following is the analysis of the trajectory estimation accuracy of the algorithm proposed in this paper and the original OpenVINS trajectory estimation accuracy of the robot in two motion situations, that is, the motion mainly composed of uniform circular motion or uniform linear motion. All algorithm tests are based on stereovision fusion with IMU and wheel odometer information. We use the open source tool evo [29] and use absolute pose error (APE) as the error evaluation standard to analyze the accuracy of the trajectory. Figure 3a,c,e shows several comparison heat maps of predicted trajectory and true trajectory (without alignment and scale correction). The motion of the robot is mainly composed of uniform circular motion. The redder the color in the figure, the greater the translation error. The corresponding errors are shown in Figure 3b,d,f.

Analysis of Localization Accuracy
The following is the analysis of the trajectory estimation accuracy of the algorithm proposed in this paper and the original OpenVINS trajectory estimation accuracy of the robot in two motion situations, that is, the motion mainly composed of uniform circular motion or uniform linear motion. All algorithm tests are based on stereovision fusion with IMU and wheel odometer information. We use the open source tool evo [29] and use absolute pose error (APE) as the error evaluation standard to analyze the accuracy of the trajectory. Figure 3a,c,e shows several comparison heat maps of predicted trajectory and true trajectory (without alignment and scale correction). The motion of the robot is mainly composed of uniform circular motion. The redder the color in the figure, the greater the translation error. The corresponding errors are shown in Figure 3b,d,f.  Table 1 shows statistics of trajectory RMSE (Root Mean Squared Error) of translation and rotation. In circular motion, the stereo VIO system will not increase the trajectory estimation error sharply as does the monocular VIO system. Because the stereo can pro-  Table 1 shows statistics of trajectory RMSE (Root Mean Squared Error) of translation and rotation. In circular motion, the stereo VIO system will not increase the trajectory estimation error sharply as does the monocular VIO system. Because the stereo can provide the system with a certain scale of observability, therefore, the RMSE of the translation during its circular motion is not very large, which is only 0.1826 m. After introducing the IMU pre-integration as the observation and using the HVCE method to obtain the maximum a posteriori weight of the covariance between the visual observation and the IMU pre-integration, using this constraint to update the system state can reduce the RMSE to 0.1368 m and improve the accuracy by 25.08%. If the system introduces wheel odometer information at the same time, the error can be further reduced which the RMSE of the translation is reduced to 0.1197 m and the accuracy is improved by 34.45%. This fully proves that the algorithm proposed in this paper has a positive effect on localization accuracy after introducing pre-integration information. When the robot is mainly in uniform linear motion and in a closed loop, the heat maps of the estimated trajectory of the algorithm and the real trajectory are shown in Figure 4a,c,e and the corresponding errors are shown in Figure 4b,d,f. The statistical results of the RMSE of translation and rotation are shown in Table 2. The RMSE of the translation part of the original OpenVINS algorithm is 0.1011 m. After using the IMU pre-integration as the observation constraint and obtaining the optimal a posterior weight to update the state of the system by HVCE method, the RMSE of translation of the system can be reduced to 0.0787 m, and the position accuracy can be improved by 22.15%. Because the robot moves on a plane in the simulation environment, and the wheel odometer pre-integration can provide such constraints, the positioning error can be further reduced to 0.0698 m, and the position accuracy can be improved by 30.95%. It can be seen that in both cases the robot's motion error is relatively small, mainly because the robot's motion speed in the simulation environment is low (maximum 0.8 m/s) and stable, so the error is relatively small. However, in actual situations, especially in the process of autonomous navigation, there are situations where the speed and angular velocity change drastically which may cause error increased rapidly in the absence of loopback. Therefore, next we conduct experiments on the actual data collected and the actual robot to verify and evaluate the performance of the algorithm in real world scenarios. the robot's motion speed in the simulation environment is low (maximum 0.8 m/s) and stable, so the error is relatively small. However, in actual situations, especially in the process of autonomous navigation, there are situations where the speed and angular velocity change drastically which may cause error increased rapidly in the absence of loopback. Therefore, next we conduct experiments on the actual data collected and the actual robot to verify and evaluate the performance of the algorithm in real world scenarios.

Public Dataset Test
To verify the effectiveness of the algorithm, we tested it on the EuRoc MAV Dataset. The EuRoc dataset includes the global shutter camera and Micro Electro Mechanical Systems (MEMS) IMU configured on the drone to simultaneously collect stereoimages and IMU data. The true value of the drone's motion trajectory is provided through the motion capture system. The collection environment of the dataset includes three environments, one is the machine hall and the other two are the Vicon room. The algorithm proposed in this paper compared the performance of trajectory estimation with three classical VIO methods which include S-MSCKF, OpenVINS, and VINS-Fusion without loop closure. S-MSCKF supports stereo and OpenVINS supports monocular and stereo mode. They are both VIO methods that tightly couple visual information and IMU measurements with extended kalman filters. VINS-Fusion is an optimization method which supports a monocular and stereo visual-inertial navigation system and uses tight coupling of visual measurement and pre-integration in a sliding window.
In the experiments, we use the open source tool evo to evaluate the accuracy of the results of the algorithm running on the dataset and use Absolute Pose Error (APE) as the error evaluation standard. Table 3 lists the Root Mean Square Errors (RMSE) of translation and rotation of the estimated trajectory of the proposed method and the stateof-art stereo visual-inertial navigation system. From the table, it can be seen intuitively that OpenVINS gives better positioning performance than the other two mainstream methods by adopting the strategies of using FEJ, adding camera intrinsic parameters, IMU and camera extrinsic parameters, time drift between IMU and camera, and SLAM features to the state vector. Our algorithm adds pre-integration constraints to the basic framework of OpenVINS as observations, which can also further improve the positioning accuracy of the system. Especially when the dataset contains long linear motion, which is similar to the motion of indoor mobile robot, the algorithm proposed in this paper improves the positioning accuracy more obviously. For example, for the OpenVINS system with excellent positioning performance, in the MH-04-difficult and MH-05-difficult sequence, the RMSE of the translation is reduced from 0.1625 m and 0.1518 m to 0.1162 m and 0.1031 m, a reduction of 28.49% and 32.08%, respectively. The symbol "×" indicates that the dataset could not be completed in the program. To display the estimation results intuitively, Figure 5 shows the comparison of trajectory estimation between the original OpenVINS and the algorithm proposed in this paper in the MH-04-difficult and MH-05-difficult sequences. This sequence involves the drone entering a dark environment; it is difficult to extract feature points and the probability of mismatching increases. Only using visual features to update the system status leads to larger errors. Here, the pre-integration of the IMU, which is not affected by light, is introduced as an observation to update the status of the system, which can improve the positioning accuracy of the system.

Real-World Test
To verify the performance of the algorithm in real-world scenarios, the self-built Mir robot mobile platform is used for experiments, as shown in Figure 6a. The Mir robot is equipped with front and back laser sensors and wheel odometer. At the same time, the platform is also equipped with a ZED2i camera produced by STEREOLABS, which can output the stereoimage and IMU message required by the algorithm described in this paper. The image accepted by the algorithm is a three-channel color image and is first converted into a single-channel grayscale image with 720 × 1280 resolution and 20 Hz frequency. In this experiment, the cartographer algorithm [30] is used to build the environment map through the laser sensors. Since the loop closure can be used to eliminate the cumulative error of the map in the process of map building, it can achieve a high map-building accuracy. During the movement of the robot, the adaptive Monte Carlo localization algorithm is used, which applies the acquired laser scanning information and the established environment map to obtain globally consistent positioning information. The localization of the robot on the map is shown in Figure 6b, and its positioning accuracy reaches 5 cm, which meets the requirements of the visual-inertial odometry (without closed loop) accuracy evaluation benchmark. The test trajectory of the robot is a closed-loop motion in an indoor environment. The feature point tracking and the estimated trajectory displayed in RVIZ during motion are shown in Figure 6c,d, respectively.

Real-World Test
To verify the performance of the algorithm in real-world scenarios, the self-built Mir robot mobile platform is used for experiments, as shown in Figure 6a. The Mir robot is equipped with front and back laser sensors and wheel odometer. At the same time, the platform is also equipped with a ZED2i camera produced by STEREOLABS, which can output the stereoimage and IMU message required by the algorithm described in this paper. The image accepted by the algorithm is a three-channel color image and is first converted into a single-channel grayscale image with 720 × 1280 resolution and 20 Hz frequency. In this experiment, the cartographer algorithm [30] is used to build the environment map through the laser sensors. Since the loop closure can be used to eliminate the cumulative error of the map in the process of map building, it can achieve a high map-building accuracy. During the movement of the robot, the adaptive Monte Carlo localization algorithm is used, which applies the acquired laser scanning information and the established environment map to obtain globally consistent positioning information. The localization of the robot on the map is shown in Figure 6b, and its positioning accuracy reaches 5 cm, which meets the requirements of the visual-inertial odometry (without closed loop) accuracy evaluation benchmark. The test trajectory of the robot is a closed-loop motion in an indoor environment. The feature point tracking and the estimated trajectory displayed in RVIZ during motion are shown in Figure 6c,d, respectively.
The trajectory length of the entire test is about 122 m, and the maximum linear velocity and maximum angular velocity are 1.0 m/s and 0.7 rad/s, respectively, during robot motion. As with the above method for accuracy analysis, we continue to use evo_ape for trajectory accuracy analysis, as shown in Figure 7. Figure 7a,c,e shows the comparison of the estimated trajectory with the ground truth, and Figures 6b and 7d,f indicate the corresponding errors. The RMSE of translation is counted separately, the error of the original OpenVINS algorithm is 0.3145 m, the trajectory error after constraint update using IMU pre-integration is 0.2132 m, and the error after further using wheel odometer preintegration update is 0.1841 m. It can be seen that adding IMU pre-integration constraints can greatly improve the positioning accuracy of the system, and the accuracy is increased by 32.21% compared with the original OpenVINS method. Furthermore, by adding the wheel odometer as another observation constraint, the accuracy can be further improved by 41.46%.  The trajectory length of the entire test is about 122 m, and the maximum linear velocity and maximum angular velocity are 1.0 m/s and 0.7 rad/s, respectively, during robot motion. As with the above method for accuracy analysis, we continue to use evo_ape for trajectory accuracy analysis, as shown in Figure 7. Figure 7a,c,e shows the comparison of the estimated trajectory with the ground truth, and Figures 6b and 7d,f indicate the corresponding errors. The RMSE of translation is counted separately, the error of the original OpenVINS algorithm is 0.3145 m, the trajectory error after constraint update using IMU pre-integration is 0.2132 m, and the error after further using wheel odometer pre-integration update is 0.1841 m. It can be seen that adding IMU pre-integration constraints can greatly improve the positioning accuracy of the system, and the accuracy is increased by 32.21% compared with the original OpenVINS method. Furthermore, by adding the wheel odometer as another observation constraint, the accuracy can be further improved by 41.46%.

Discussion
In this paper, in order to make full use of the sensor information to improve the positioning accuracy of the system, three main works were carried out in the VIO system based on the MSCKF framework. The first was to use the pre-integration of the IMU as the observation information of the system, and use the standard EKF method to update the state of the system. Secondly, in the positioning process for indoor mobile robots, we used the same idea to extract the pre-integrated information of the wheel odometer, and constrain and update the state of the system. In addition, visual features are collected and used as observations to update the system state in the MSCKF framework. This paper uses the Helmert variance component estimation method to determine a more reasonable covariance weight between visual features and IMU pre-integration. To verify the effectiveness of the algorithm proposed in this paper, we tested it in the Gazebo simulation environment, public dataset, and actual environment. In the Gazebo simulation environment, the accuracy analysis was carried out on the situation where the robot mainly moved in a circular motion or in a uniform linear motion. The results show that the algorithm proposed in this paper has a significant improvement in accuracy compared to the OpenVINS algorithm. In the case of using IMU pre-integration with the HVCE method, the positioning accuracy of translation can be improved by 25.08% and 22.15% in two cases, respectively. Meanwhile, if the pre-integration information of the wheel odometer is used, the accuracy can be further improved by 34.45% and 30.95%. On the public dataset, we compared the positioning accuracy performance of the proposed algorithm with several mainstream algorithms: S-MSCKF, VINS-Fusion, and OpenVINS. All test results show the excellent performance of the algorithm proposed in this paper. Especially in a complex environment, such as the MH-04-difficult and MH-05-difficult sequences, due to obvious changes in illumination, it has a greater impact on visual measurement. Compared with OpenVINS, the algorithm proposed in this paper can reduce the RMSE of translation from 0.1625 m and 0.1518 m to 0.1162 m and 0.1031 m and increase the accuracy by 28.49% and 32.08%. Finally, we also conducted the corresponding accuracy analysis experiments in the actual robot, and proved that the proposed al-

Discussion
In this paper, in order to make full use of the sensor information to improve the positioning accuracy of the system, three main works were carried out in the VIO system based on the MSCKF framework. The first was to use the pre-integration of the IMU as the observation information of the system, and use the standard EKF method to update the state of the system. Secondly, in the positioning process for indoor mobile robots, we used the same idea to extract the pre-integrated information of the wheel odometer, and constrain and update the state of the system. In addition, visual features are collected and used as observations to update the system state in the MSCKF framework. This paper uses the Helmert variance component estimation method to determine a more reasonable covariance weight between visual features and IMU pre-integration. To verify the effectiveness of the algorithm proposed in this paper, we tested it in the Gazebo simulation environment, public dataset, and actual environment. In the Gazebo simulation environment, the accuracy analysis was carried out on the situation where the robot mainly moved in a circular motion or in a uniform linear motion. The results show that the algorithm proposed in this paper has a significant improvement in accuracy compared to the OpenVINS algorithm. In the case of using IMU pre-integration with the HVCE method, the positioning accuracy of translation can be improved by 25.08% and 22.15% in two cases, respectively. Meanwhile, if the pre-integration information of the wheel odometer is used, the accuracy can be further improved by 34.45% and 30.95%. On the public dataset, we compared the positioning accuracy performance of the proposed algorithm with several mainstream algorithms: S-MSCKF, VINS-Fusion, and OpenVINS. All test results show the excellent performance of the algorithm proposed in this paper. Especially in a complex environment, such as the MH-04-difficult and MH-05-difficult sequences, due to obvious changes in illumination, it has a greater impact on visual measurement. Compared with OpenVINS, the algorithm proposed in this paper can reduce the RMSE of translation from 0.1625 m and 0.1518 m to 0.1162 m and 0.1031 m and increase the accuracy by 28.49% and 32.08%. Finally, we also conducted the corresponding accuracy analysis experiments in the actual robot, and proved that the proposed algorithm can improve the accuracy by 32.21% compared with OpenVINS after introducing IMU pre-integration as the observation constraint. Similarly, after we introduce the pre-integration of the wheel odometer as the observation constraint, the accuracy of the system can be further enhanced, which is 41.46% higher than the original OpenVINS. However, in our actual experimental verification process, we also found that the parameters of the IMU, including the noise variance of the accelerometer and the gyroscope, and the variance of the random walk noise, have a greater impact on the performance of the system. Even if the parameter selection is inappropriate, it may cause the system to crash. In the next step, we will conduct further research on how to improve the robustness of the system to parameters. There are also problems including the estimation of external parameters between the wheel odometer and the camera while using the wheel odometer. The time stamp between the two sensors are not strictly hardware aligned, which will be further solved in the future.

Conclusions
In this paper, the pre-integration of the IMU is introduced as an observation, and the state vector in the sliding window is updated, making full use of the information of the IMU and improving the system positioning accuracy in the MSCKF framework. At the same time, since the system also involves the observation of visual feature points, in order to obtain the appropriate covariance between the two observations, this paper uses the Helmert variance component estimation method to estimate the maximum a posterior weight of the covariance. For indoor mobile robots, the same method is used, and the pre-integration of the wheel odometer is used as the constraint of the pose estimation, which further improves the positioning accuracy of the system. Additionally, through the Gazebo environment simulation, public dataset EuRoc, and actual experiments the performance of the proposed algorithm is verified. Analysis of the results leads to following conclusions. One is that introducing pre-integration by IMU and wheel odometer as observation constraints in MSCKF framework can help improve the positioning accuracy of the system. The other one is that with the Helmert variance component estimation method, the largest a posteriori covariance weight between visual observation and pre-integration observation can be obtained. This method can further improve the positioning accuracy of the system. Since the wheel odometer can only provide plane constraint information, namely x, y, yaw directions, and in an indoor environment, the ground is not completely flat. In the next task, we will model the ground more accurately so that higher localization accuracy can be obtained after using the wheel odometer information.

Conflicts of Interest:
The authors declare no conflict of interest.