Robust Tightly Coupled Pose Measurement Based on Multi-Sensor Fusion in Mobile Robot System

Currently, simultaneous localization and mapping (SLAM) is one of the main research topics in the robotics field. Visual-inertia SLAM, which consists of a camera and an inertial measurement unit (IMU), can significantly improve robustness and enable scale weak-visibility, whereas monocular visual SLAM is scale-invisible. For ground mobile robots, the introduction of a wheel speed sensor can solve the scale weak-visibility problem and improve robustness under abnormal conditions. In this paper, a multi-sensor fusion SLAM algorithm using monocular vision, inertia, and wheel speed measurements is proposed. The sensor measurements are combined in a tightly coupled manner, and a nonlinear optimization method is used to maximize the posterior probability to solve the optimal state estimation. Loop detection and back-end optimization are added to help reduce or even eliminate the cumulative error of the estimated poses, thus ensuring global consistency of the trajectory and map. The outstanding contribution of this paper is that the wheel odometer pre-integration algorithm, which combines the chassis speed and IMU angular speed, can avoid the repeated integration caused by linearization point changes during iterative optimization; state initialization based on the wheel odometer and IMU enables a quick and reliable calculation of the initial state values required by the state estimator in both stationary and moving states. Comparative experiments were conducted in room-scale scenes, building scale scenes, and visual loss scenarios. The results showed that the proposed algorithm is highly accurate—2.2 m of cumulative error after moving 812 m (0.28%, loopback optimization disabled)—robust, and has an effective localization capability even in the event of sensor loss, including visual loss. The accuracy and robustness of the proposed method are superior to those of monocular visual inertia SLAM and traditional wheel odometers.


Introduction
At present, the influential simultaneous localization and mapping (SLAM) systems with monocular vision include ORB-SLAM2 [1], LSD-SLAM [2], DSO [3], and SVO [4]. Among them, ORB-SLAM2 [4] has high adaptability and strong robustness to indoor and outdoor environments. However, the algorithm itself is based on pure vision. When there are no visual features, the accuracy and robustness of the pose estimation will decrease rapidly, and the algorithm may fail. Therefore, in the follow-up development of visual SLAM, in order to overcome the shortcomings of pure vision, the strategy of multisensor fusion is adopted. Accordingly, sensors with scale measurement capabilities and monocular vision sensors are used to perform fusion vision SLAM to increase accuracy and robustness. Relatively stable and reliable solutions can be obtained with lasers [5]; however, at present, this method is generally only suitable for large-scale scenarios, such as unmanned driving, and is unsuitable for applications with limited costs. The inertial measurement unit (IMU) has become a generally accepted option. However, it exhibits a non-negligible cumulative error if run for a long period of time [6], especially in visually restricted conditions without texture or under weak illumination, in which case the visual mile cannot be used to correct the IMU error. In [7], the scale observability of a monocular visual inertial odometer on a ground mobile robot was analyzed in detail. When the robot moved at a constant speed, due to the lack of acceleration excitation, the constraint on the scale was lost, resulting in a gradual increase in the scale uncertainty and positioning error.
A wheel speed inertial odometer was integrated with a monocular visual odometer using the extended Kalman filter (EKF) [8,9], assuming the robot to be running on an ideal plane, and 3-DOF pose measurement was performed. The wheel speed inertial odometer uses wheel speed measurement for EKF status prediction; the angular speed measurement is integrated for dead reckoning; and the visual odometer method is used for an EKF measurement update. In the above-mentioned EKF-based loose coupling method, when the visual odometer cannot accurately calculate the pose due to insufficient visual characteristics, some proportion of its output in the filter will fall, resulting in ineffective visual observation. Consequently, the accuracy is reduced. VINS-Mono used tightly coupled nonlinear optimization [10,11]. By combining the IMU pre-integral measurement and visual measurement in a tightly coupled form to achieve the maximum posterior probability, we use the nonlinear optimization method to reach the optimal state. An opensource visual inertia SLAM algorithm VINS-Fusion [12,13] was developed based on VINS-Mono, supporting multiple sensor combinations (binocular camera + IMU; monocular camera + IMU; binocular camera only); GPS is used to improve the accuracy of the global path. In pursuit of better performance, wheel speed is used in SLAM system [14]-the wheel speed sensor and the visual odometer were integrated tightly. In this way, they make least-squares rather than optimization. However, they do not consider the unreliability of wheel speed. When a robot moves on uneven surfaces, or a wheel slip occurs, an incorrect wheel speed measurement will significantly affect the scale accuracy, and can potentially lead to system failure. In [15], a batch optimization framework with formulated edges was proposed. Moreover, tightly coupled nonlinear optimization methods were used to integrate vision, IMU, and wheel speed, and to perform pose. The error cost function was composed of vision errors, inertial measurement errors, and wheel speed sensor dead reckoning errors. In addition, assuming that the vehicle was moving on an approximate plane, a "soft" plane constraint term was added to the error cost function. While the robot is running with fixed acceleration or going straight, we do not know how long it moves in real time or what the direction is. Therefore, the encoder and soft plane constraints will make the task easier.
According to whether the visual feature points are added to the state vector of the filter or the optimization equation, the visual inertial odometry method based on filtering can be divided into two types: loose coupling and tight coupling. Further, the loose coupling method is divided into two parts: the visual odometer and state estimator. The visual odometer tracks the feature points and calculates the current camera pose, and the state estimator uses the IMU to measure and adjust the camera pose obtained by the visual odometer. As the visual features are invisible to the state estimator, they cannot be adjusted based on the IMU measurement, and the poor position accuracy of the visual features causes the accuracy of the visual odometer to decline. As a result, the correlation between measurements is not fully utilized for pose measurement, thereby reducing the accuracy. In the tight coupling method, the carrier pose and position of the feature point are used as variables to be estimated, and the filter method or optimized method is used for estimation. Thus far, for the existing visual inertia SLAM algorithm, when the robot is moving at a constant speed, or purely rotating, and encounters scenes with insufficient visual features, the problems of low accuracy and poor robustness arise. Tightly coupled wheel speed information can help solve this problem. However, there are few studies on multisensor fusion SLAM for wheeled mobile robots based on vision, inertia, and wheel speed measurements that are tightly coupled and optimized. There is no complete solution similar to ORB-SLAM. Therefore, research on tightly coupled monocular visual odometer combined with wheel speed measurement is significant. Therefore, we use IMU and wheel odometry to make a monocular visual odometry system to gain the scale reconstruction.
The main novelty and contributions of this paper include:

1.
A multi-sensor fusion SLAM algorithm using monocular vision, inertial measurement, and wheel speed measurement is proposed. The wheel speed and IMU angular velocity pre-integration of the wheel odometer can avoid the repeated integration caused by the linearization point change in the iterative optimization process.

2.
Based on the state initialization of the wheel odometer and IMU, the initial state value required by the state estimator can be quickly and reliably calculated in both stationary and moving states.
The method in this paper solves the problem of poor pose estimation accuracy caused by the weak observability of monocular visual inertial SLAM, and further improves the robustness of pose estimation.

SLAM Based on Multi-Sensor Fusion
The multi-sensor fusion state estimator in this study uses monocular vision, IMU, and wheel odometer measurements based on feature point optical flow tracking [16]. None of these sensors can measure the absolute pose. Therefore, the multi-sensor fusion state estimator, as an odometer algorithm, has an unavoidable cumulative error. Therefore, we used key frame selection, loop detection [17], and back-end optimization algorithms based on the VINS-Mono framework and applied them to the multi-sensor fusion state estimator to form a complete SLAM system. Figure 1 shows a system block diagram of the SLAM method.
In a multi-sensor state estimation process, the main data processing and analysis processes include raw sensor input, data pre-processing (including calibration compensation, time alignment, and pre-fused wheel odometer), state initialization, state estimation problem solving, loopback detection, and backend optimization. The processes of optimal state estimator, wheel odometer pre-integration, and initialization reflect the contributions and innovations of this paper. The rest is our flexible use of existing methods. Sections 2.2-2.7 analyze the core multi-sensor fusion state estimation problem and the pre-integration process of IMU and wheel speed. Section 2.8 analyzes the initialization process of the state.

Variables to be Estimated and Observation
Considering that the IMU zero offset always exists, when defining the variable to be estimated, the IMU zero offset of each key frame needs to be included. Therefore, in the k key frame, we have: For the IMU, ∈ × is the state, ∈ × denotes the position, ∈ × is the attitude (quaternion form), ∈ × denotes the speed, and ∈ × and ∈ × are the accelerometer and gyroscope bias, respectively.
is the key frames and ℒ is the feature points. Further, is the inverse depth of ℒ (the inverse of the Z-axis coordinate).
Additionally, pre-fusion wheel odometer observations are added based on VINS-Mono. The original wheel odometer is defined as: where { } is the coordinate of the wheel odometer relative to the starting point and is the angle of rotation relative to the starting direction. Therefore, the observation used to constrain the :

Variables to Be Estimated and Observation
Considering that the IMU zero offset always exists, when defining the variable X to be estimated, the IMU zero offset of each key frame needs to be included. Therefore, in the k key frame, we have: For the IMU, x k ∈ R 16×1 is the state, p W B ∈ R 3×1 denotes the position, q W B ∈ R 4×1 is the attitude (quaternion form), v W B ∈ R 3×1 denotes the speed, and b a ∈ R 3×1 and b g ∈ R 3×1 are the accelerometer and gyroscope bias, respectively. K is the key frames and L is the feature points. Further, λ l is the inverse depth of L (the inverse of the Z-axis coordinate).
Additionally, pre-fusion wheel odometer observations are added based on VINS-Mono. The original wheel odometer is defined as: where {p xk p yk } is the coordinate of the wheel odometer relative to the starting point and θ k is the angle of rotation relative to the starting direction. Therefore, the observation Z used to constrain the X : where visual feature point observation Z C i = {ẑ il } l∈L i , containing all feature points L i observed under the i key frame; B ij = {â t ,ω t } t i ≤t≤t j is IMU pre-integration of accel-erated speed and angular speed; and O ij = {∆m odomt ,â avgt ,ω avgt } t i ≤t≤t j is pre-fusion wheel odometer.

Optimal Estimation Problem
The maximum posterior estimation and Bayes' theorem shows: Here, p(Z |X ) is the conditional probability of observing the occurrence of Z under the existed state X , which can be obtained by the observation equation and the covariance. p(X ) is the prior probability (edge probability) of X . According to formula (3), we have: We use the least-squares method to find the maximum posterior estimate, and use Mahalanobis distance to measure the difference between the residual and the covariance matrix.
Here, r (·) 2 Σ is the Mahalanobis distance and Σ is the covariance matrix. {r p , H p } is the prior information from marginalization; r C ∈ R is the visual residual; r O ∈ R 3×1 is the wheel odometer residual; and r B ∈ R 15×1 is the IMU residual. In addition, the Huber loss function ρ [18] is used to improve the robustness, where ρ is

Visual Measurement Constraints
The specific process is as follows: while the feature point l is first observed in the key frame i, it is recorded and tracked. Its spatial pose is defined as a function of the key frame i pose (p C i , q C i ) and the inverse depth λ l of the feature point. While the feature point l is observed again in the key frame j, a visual residual term is generated. The residual term r C jl represents the error of the feature point l in the position of j. It is also called the re-projection error, which is a function of the key frame i pose (p C i , q C i ).
Here,P In the formula, R represents the rotation matrix from F 2 coordinate system to F 1 coordinate system;P c j l ∈ R 2×1 is the position where the feature point l is projected onto the unit ball in the key frame j; π −1 c is the back projection function, which can project the pixel coordinates into the camera coordinate system C j ; P l C j ∈ R 2×1 is the position projected on the unit ball in the key frame j; to compare the error with P l C j , it needs to be transformed into the camera coordinate system C j of the key frame j; and b 1 b 2 are the two orthogonal base vectors on the tangent plane of the unit ball and the projection lines with the feature points in the orthogonal direction.

IMU Constraints
In the visual odometer method based on bundle adjustment, the state of the carrier is optimized and visual measurement is used to constrain. The IMU measurement between frames is added as a constraint on the optimization framework, thereby improving robustness.

IMU Pre-Integration
To reduce the complicated operation caused by reintegration, we used the IMU preintegration method [19] to fuse IMU measurements between two consecutive key frames.

Residual Term
The IMU pre-integration processes the IMU measurement for a continuous period based on the given IMU zero offset and obtains the relative pose constraint between the initial and end states of the time period. The IMU pre-integration residual term is defined as: The random distribution of the residual term r B conforms to N(0, Σ B ) and Σ B is obtained by updating the covariance equation. δθ B k B k +1 ∈ R 3×1 is three-dimensional small perturbation and δ represents the error term. The IMU pre-integration provides constraints on the variables to be optimized contained in the two key frames before and after. In the process of nonlinear optimization, the essence of the constraint is to provide the direction and gradient of the variable to be optimized by calculating the Jacobian matrix of the residual. As the direction of gravity is obtained during the initialization of the visual inertial odometer, the gravity acceleration g W is not used as a variable to be optimized.

Wheeled Odometer Constraints
On ground mobile robots, a wheel speed meter is typically used to perform dead reckoning to obtain continuous relative poses of the robot. The continuous position and reliable scale estimation of the wheel odometer make it suitable for tasks such as path planning and navigation.

Two-Dimensional Wheel Odometer Algorithm
The two-dimensional wheel odometer has an unavoidable cumulative, error but can provide a continuous carrier trajectory. As the wheel speed meter measures the average wheel speed over a period of time, the chassis speed measurementm base measures the average speed during this time. The position and attitude update methods of the wheel speed odometer include Euler points, median points, and higher-order Runge-Kutta methods. As the sampling speed of the wheel speed meter is high (1 kHz), the Euler integration method is used to reduce the calculation time of the main control microcontroller. This is achieved while assuming that the chassis moves with fixed direction and speed in the original direction during the period and rotates to a new direction at the end of the time period.
The initial state of the wheel odometer ism odom0 = p x0 p y0 θ 0 T = 0. Given the of the wheel odometer, the current chassis speed measurementv basek = v xk v yk ω k T , and the time difference dt k = t k − t k−1 , we can obtain the new wheel odometer statem odomk as:

Wheel Odometer Pre-Integration
In this paper, we propose a pre-integration method for wheel odometer. We use IMU and wheel speed to measure the relative pose between two key frames. The data of the two sensors makes pre-fusionm f used odom . Thereafter, only them f used odom is measured, according to the wheel odometer kinematics equation, and a continuous calculation and integration is made to find the displacement.
The incremental update equation of the wheel odometer is: The nominal weight of the wheel odometer pre-integration item can be incrementally updated based on the pre-fusion wheel odometer measurement: The initial value of nominal weight: According to the definition of the error amount of the pre-integration term, an incremental update formula of the error amount of the pre-integration term is: (16) (·)ˆis 3 × 3 anti-symmetric matrix of Lie algebra SO (3). The nominal value of the wheel odometer pre-integration term depends on the pre-fusion wheel odometer measurement and the gyroscope zero offset. For the variable to be optimized, the zero bias of the gyroscope needs to be continuously adjusted in the pose measurement process to reduce the residual error. Therefore, in the optimization process, the partial derivatives J of the nominal value of the pre-integral term of the wheel odometer with respect to the zero offset of the gyroscope need to be used.
According to the incremental update of the error value of the pre-integration term of the wheel odometer, the Jacobian matrix of the error value between the two frames before and after can be obtained as: According to the definition of the nominal value of the pre-integration item of the wheel odometer, the Jacobian matrix of the error value is the nominal Jacobian matrix of Jx i+1 Jx î x 0 , the update equation of the nominal value of the pre-integration term of the wheel odometer with respect to the zero-biased Jacobian matrix is: The initial value of the Jacobian matrix: J 0 = Jx 0 x 0 = diag(9).

Residual Term
Definition: In the least-squares problem of robot pose measurement, the wheel odometer residual term r O represents the error distance between the frame-to-frame relative displacementp in the variable to be optimized, The wheel odometer residual does not include the errors δθ with respect to the rotation and gyro zero offset. This is as these terms are already defined in the residual term of the IMU pre-integration. The IMU pre-integration uses the original IMU measurement as the angular velocity input, which provides higher rotational integration accuracy than the wheel odometer pre-integration measured with a lower frequency prefused wheel odometer.
To use the variable x k , x k+1 to represent the wheel odometer pre-integration residual term, p O k O k+1 needs to be transformed: We obtain the residual term expressed using only the variables to be optimized and the wheel odometer pre-integration: Here, R B O and p B O are known constants. As a maximum posterior problem, the robot pose measurement is the same as leastsquares by introducing a covariance matrix of the residuals to transform the residuals with dimensions into a unified probability representation. The wheeled odometer residual r O obeys the covariance matrix Σ O of the wheeled odometer pre-integration, key frame, and the gyroscope zero offset b gi of the previous frame. To provide the necessary gradient direction for optimization, the system needs to be linearized in the current state x k , x k+1 and the ratio between the increment of the residual ∂r O and the increment of the variable to be optimized is calculated. Thus, the Jacobian matrix J r O x k ,x k+1 is defined: Here, J r O x k ,x k+1 ∈ R 3×30 . As the increment is small, using the quaternion definition will produce additional degrees of freedom. The increment for the rotation state in the formula is defined as the shaft angle representation.
As the wheel odometer residual is only related to some variables in the previous key frame state x k and next key frame state x k+1 , the value of the Jacobian matrix J

Marginalization and Prior Constraints
The state of the key frame and its related observations are constantly removed from the optimization equation. If all observations related to the removed key frames are directly discarded, the constraints of state estimation will be reduced, and the loss of valid information will lead to a decrease in accuracy. Here, a marginalization algorithm is used, while removing the key frames, retaining the removed observations to constrain the optimization variables. According to [6], the use of the Gauss-Newton method can be understood as adding an increment to the variable to be optimized; the objective function is the smallest.
If the residual function r(x) is linearized at x, and the Jacobian matrix J r x is obtained, the nonlinear least-squares problem becomes a linear least-squares problem: Here, r(x) + J r Taking the derivative of this formula with respect to δx be 0, we obtain: x T r(x); thus, we obtain the incremental equation Hδx = b, where H is called the Hessian matrix.
Divide the variable x to be optimized into the part x a , which needs to be removed, and the part x b , δx = [δx a δx b ] T , which needs to be retained, then the incremental equation becomes: The Schur method is used to eliminate the element to obtain the solution of δx b : Intercepting the second row of the above matrix, we obtain: In the above formula, only x b is unknown, and no information in H and b is lost. This process removes the rows and columns related to x a from the incremental equation, marginalizes the state x a that needs to be removed, and retains the historical observation constraints on the state x b . When the next image frame arrives, the prior information in the above Formula (31) will be used as a prior constraint term to construct a nonlinear least-squares problem.

State Initialization
VINS-Mono uses multiple steps to initialize the state: gyroscope zero offset correction, initializing gravity, speed, and scale coefficients, and modifying the direction of gravity. The disadvantage of this method is that it depends on sufficient visual measurement of parallax and sufficient acceleration excitation. When there is no abnormal situation, such as skidding, the wheel odometer has better accuracy and reliability over a short distance and a short duration. Compared with monocular vision, there is no scale uncertainty, and it is easier to initialize the keyframe pose, velocity, and gravity directions.

Gyro Zero Offset Initialization
As the gyroscope and wheel odometer measurements are on the same rigid body, the rotations of the two are the same. The relative rotation between the two key frames can be obtained through IMU pre-integration and wheel odometer pre-integration, respectively: q . The rotation term of the pre-integration of the wheeled odometer above is also obtained through the gyro integration and has no reference value. Therefore, during the initialization process, the gyroscope pre-integration will use the heading angle for rotation integration. Rotation term q is used as a constraint, the gyroscope bias b g can be estimated. Assuming that the gyro bias b gk of each key frame during the initialization process is the same b gk = b g , the construction of the least-squares problem is as follows: , we obtain: Here, J q b g is the partial derivative of the inter-frame rotation q B k B k+1 . b g is the gyroscope zero bias. The objective function of the least-squares problem is written as: Considering only the imaginary part of the quaternion, we obtain: The above Formula (35) conforms to the format of Hx = b, and the Cholesky decomposition can be used to find the least-squares solution: If the robot rotates rapidly during the gyro work offset initialization process, the elasticity of the wheel, the rigid connection between the wheel and the IMU, the misalignment of the wheel odometer clock and the IMU clock, and a calibration error of the wheel odometer rotation scale factor may lead to poor gyro work offset initialization results.

Initialization of Key Frame Speed and Gravity
As the Mecanum wheel will tremble during movement, and the wheeled odometer algorithm can only obtain the heading angle information, it is difficult to obtain accurate relative rotation between key frames through wheeled odometer integration. In the previous step, the zero offset of the gyroscope has been initialized, and the relative rotation between all key frames can be obtained through IMU pre-integration. As the rotation is known, the key frame speed and gravity can be calculated by solving linear equations. Decomposing the position term α B k B k+1 and speed term β in the IMU pre-integration and transforming it into the form of matrix multiplicationẑ , we obtain: Here,ẑ is the IMU pre-integration measurement between the key frames B k and B k+1 . The variable to be estimated related to the key frames B k and B k+1 is defined as x represents the distance of the wheel odometer with respect to the actual distance, that is, the X-axis and Y-axis scale factors of the wheel odometer S x , S y . If the IMU excitation is sufficient, it can be used to calibrate the scale factor. S = 1 is defined here for the reliability of initialization.
To reduce initialization errors and improve reliability, multiple key frame measurements need to be used as constraints to calculate the key frame speed and gravity direction. By combining the multiple linear equations above, we can obtain the least-squares problem: Here, x is the variable to be estimated, and x * is the optimal estimated value of x. The program uses Cholesky decomposition to solve the least-squares problem: In the formula, the matrix H is obtained by inserting all H B k B k+1 into empty columns at the corresponding positions of the unrelated variables and summing them, andẑ is obtained by combining allẑ

Experiments
In this paper, a mobile robot platform is built, using a dual-processor architecture. The high-performance PC upper computer is used for data processing of each sensor and the SLAM algorithm operation, and the embedded controller lower computer is used for the motion control of the mobile robot chassis. The parameters of the high-performance PC are Intel Core i7-9750H CPU @ 2.6 GHz, 16 GB RAM. The camera is IntelRealSense ZR300.

Room-Scale Pose Measurement Experiment
Experimental conditions: In a laboratory where objects are placed in a complex environment, as shown in Figure 2, the control robot walked through all the channels. The channel width is narrow, and the width at the narrowest point is less than 1 m; the movement speed was maintained at approximately 0.5 m/s; a large number of fixed marking points are arranged inside the room to obtain the real position of the robot in order to analyze the positioning error of the robot. Abnormal conditions during the experiment included:

3.
magnetic guide bars with a height of approximately 0.5 cm were fixed on the ground, and the wheels slipped slightly as they passed; 4.
due to turning too close to the weakly textured wall surface, the visual tracking was completely lost several times.  The RViz interface at the end of the test data playback is shown in Figure 3, which includes the raw picture, feature points, matched feature points, and the path of pose measurement. Figure 2. Room-scale experimental environment. The path diagrams of pose measurement, as shown in Figure 4, start along the positive X-axis. Therefore, the distance between the start point and end point shows the experimental accuracy. Throughout the experiment, the car traveled for 184.3 s, the average speed was 0.264 m/s, the maximum speed was 0.591 m/s, the cumulative translation was 51.321 m, the cumulative rotation was 3428.318°, and the chassis had no abnormalities. As this experiment was performed in an actual scene, the experimental system did not have the equipment to measure ground truth; therefore, it could only measure the distance error of the car passing several fixed road marking points. In each experiment, the robot was controlled to accurately pass the road marking point and keep the rotation direction consistent. Therefore, the pose error in Table 1 is average position error and yaw angle error among several road marking points. When the robot passes through each road marking point, a timestamp is marked. Calculate the coordinate error (∆X and ∆Y), position error (∆P(m), absolute value), and yaw angle error (∆A) between the algorithm trajectory pose and the road marking point when it is at the time stamp. For each algorithm, take the average of the errors generated by all road marking points as the final experimental result. The position error rate (∆P(%)) is the ratio of the position error to the cumulative translation. The table shows our algorithm performed best, with and without closed loops. The position error was 0.206 m and 0.015 m, respectively. As shown in Figure 4, we can clearly see the error of the different methods. The path diagrams of pose measurement, as shown in Figure 4, start along the positive X-axis. Therefore, the distance between the start point and end point shows the experimental accuracy. Throughout the experiment, the car traveled for 184.3 s, the average speed was 0.264 m/s, the maximum speed was 0.591 m/s, the cumulative translation was 51.321 m, the cumulative rotation was 3428.318 • , and the chassis had no abnormalities. As this experiment was performed in an actual scene, the experimental system did not have the equipment to measure ground truth; therefore, it could only measure the distance error of the car passing several fixed road marking points. In each experiment, the robot was controlled to accurately pass the road marking point and keep the rotation direction consistent. Therefore, the pose error in Table 1 is average position error and yaw angle error among several road marking points. When the robot passes through each road marking point, a timestamp is marked. Calculate the coordinate error (∆X and ∆Y), position error (∆P(m), absolute value), and yaw angle error (∆A) between the algorithm trajectory pose and the road marking point when it is at the time stamp. For each algorithm, take the average of the errors generated by all road marking points as the final experimental result. The position error rate (∆P(%)) is the ratio of the position error to the cumulative translation. The table shows our algorithm performed best, with and without closed loops. The position error was 0.206 m and 0.015 m, respectively. As shown in Figure 4, we can clearly see the error of the different methods. As listed in Table 1, we compared the error of the X-axis, Y-axis, position, and heading angle. According to the data presented in Table 1, the accuracy of posture estimation using the monocular vision inertial wheel odometer algorithm was better than that of the VINS-Mono algorithm, and the accumulated position error was only approximately 0.2 m after Sensors 2021, 21, 5522 14 of 19 51 m. We divided the cumulative position error by the cumulative translation distance to obtain the cumulative position error rate. The position error rate of the proposed algorithm was only 0.4%, which was lower than that of VINS-Mono (1%). This experiment verified that a multi-sensor fusion odometer algorithm could perform high-precision positioning in a room-scale indoor environment, and the result was better than that of the monocular and wheel separately. By further combining complete SLAM system, ours almost have no bias.   As listed in Table 1, we compared the error of the X-axis, Y-axis, position, and heading angle. According to the data presented in Table 1, the accuracy of posture estimation using the monocular vision inertial wheel odometer algorithm was better than that of the VINS-Mono algorithm, and the accumulated position error was only approximately 0.2 m after 51 m. We divided the cumulative position error by the cumulative translation distance to obtain the cumulative position error rate. The position error rate of the proposed algorithm was only 0.4%, which was lower than that of VINS-Mono (1%). This experiment verified that a multi-sensor fusion odometer algorithm could perform high-precision positioning in a room-scale indoor environment, and the result was better than that of the monocular and wheel separately. By further combining complete SLAM system, ours almost have no bias.  Anomalies during the experiment included: 1 There was a considerable amount of dust on the ground, which decreased the friction between the wheels, leading to a slight slip during rapid turns and a significant slip during left-to-right translation; 2 Due to the fast movement of the robot, the picture of the rolling shutter camera continued to exhibit disturbances; 3 There were cable manhole covers in numerous places in the corridor, the ground surface was uneven, and there were 2-3 cm step-like undulations. The robot vibrated significantly while passing over these obstacles; 4 The corridor contained semi-open areas and closed areas, the environment brightness changed significantly, and some areas were almost completely dark; 5 The walls around the robot in some areas were covered with tiles and had reflections; 6 The corridor and hall scenes had a high degree of similarity, lacking special landmarks. Loop detection was successfully performed only when the robot passed the hall halfway and finally returned to the hall; Sensors 2021, 21, 5522 15 of 19 7 During the experiment, pedestrians appeared several times.
The RViz interface at the end of test data playback is shown in Figure 5, which includes the raw picture, feature points, matched feature points, and the path. face was uneven, and there were 2-3 cm step-like undulations. The robot vibrated significantly while passing over these obstacles; ④ The corridor contained semi-open areas and closed areas, the environment brightness changed significantly, and some areas were almost completely dark; ⑤ The walls around the robot in some areas were covered with tiles and had reflections; ⑥ The corridor and hall scenes had a high degree of similarity, lacking special landmarks. Loop detection was successfully performed only when the robot passed the hall halfway and finally returned to the hall; ⑦ During the experiment, pedestrians appeared several times.
The RViz interface at the end of test data playback is shown in Figure 5, which includes the raw picture, feature points, matched feature points, and the path. Throughout the experiment, the car traveled for 889.2 s, the average speed was 0.87 m/s, the maximum speed was 1.36 m/s, the cumulative translation was 809.27 m, the cumulative rotation was 15,502.1°, and the chassis abnormal time was 26.983 s. As shown in Figure 6 and Table 2, the uneven ground made the wheel slip, so the error was very high. As shown in the floor-scale scenario, both the VINS-Mono algorithm with loop detection disabled and ours performed better; however, there was still a cumulative positioning error that could not be neglected. Compared with the complete SLAM system, VINS-Mono and ours both significantly improved the accuracy. Additionally, ours was better than that of VINS-Mono. Throughout the experiment, the car traveled for 889.2 s, the average speed was 0.87 m/s, the maximum speed was 1.36 m/s, the cumulative translation was 809.27 m, the cumulative rotation was 15,502.1 • , and the chassis abnormal time was 26.983 s. As shown in Figure 6 and Table 2, the uneven ground made the wheel slip, so the error was very high. As shown in the floor-scale scenario, both the VINS-Mono algorithm with loop detection disabled and ours performed better; however, there was still a cumulative positioning error that could not be neglected. Compared with the complete SLAM system, VINS-Mono and ours both significantly improved the accuracy. Additionally, ours was better than that of VINS-Mono.

Robustness Verification Experiment
This time, we created experiments with sensor measurement errors or even loss. Experimental conditions included intentionally blocking the camera, resulting in the visual signal being lost for approximately 15 s. This was undertaken in the laboratory, during the movement of the robot, during which the robot kept moving and turning.

Robustness Verification Experiment
This time, we created experiments with sensor measurement errors or even loss. Experimental conditions included intentionally blocking the camera, resulting in the visual signal being lost for approximately 15 s. This was undertaken in the laboratory, during the movement of the robot, during which the robot kept moving and turning.
Throughout the experiment, the car traveled for 104.4 s, the average speed was 0.142 m/s, the maximum speed was 0.803 m/s, the cumulative translation was 17.809 m, the cumulative rotation was 1110.32°, and the chassis had no abnormalities. In the process of visual measurement loss, upon which the VINS-Mono is based, the state estimator was downgraded to inertial navigation dead reckoning. The error increased rapidly, thereby reducing the final positioning accuracy. As shown in Figure 7 and Table 3 during the loss of visual features, the path of the SLAM algorithm was the same as that measured by the wheel odometer. This shows that, although the positioning accuracy was affected by the loss of visual measurement, the pre-integration constraint can still provide absolute speed measurement; thus, the final positioning error was less than that of VINS-Mono. Throughout the experiment, the car traveled for 104.4 s, the average speed was 0.142 m/s, the maximum speed was 0.803 m/s, the cumulative translation was 17.809 m, the cumulative rotation was 1110.32 • , and the chassis had no abnormalities. In the process of visual measurement loss, upon which the VINS-Mono is based, the state estimator was downgraded to inertial navigation dead reckoning. The error increased rapidly, thereby reducing the final positioning accuracy. As shown in Figure 7 and Table 3 during the loss of visual features, the path of the SLAM algorithm was the same as that measured by the wheel odometer. This shows that, although the positioning accuracy was affected by the loss of visual measurement, the pre-integration constraint can still provide absolute speed measurement; thus, the final positioning error was less than that of VINS-Mono.

Discussion
Aiming to solve the problems of low accuracy and poor robustness of the visual inertial SLAM algorithm, we designed and implemented a tightly coupled monocular visual inertial pose estimation algorithm that integrates wheel speed information. The state estimator, based on tightly coupled multi-sensor information, is used as the core of the algorithm, and the wheel odometer pre-integration that integrates wheel speed and IMU angular velocity avoids repeated integration in the iterative optimization process. The pose of the robot in the sliding window and the visual feature points are the state to be estimated. Then, the state is initialized based on the wheel odometer and IMU, so that the initial value of the state can be calculated quickly and reliably in both stationary and moving states. Finally, the residual constraint is added to the state estimator, and the optimal robot pose is solved through nonlinear optimization. Compared with the existing SLAM algorithm for many experiments, the SLAM algorithm in this paper can achieve higher pose accuracy and robustness in environments with more extreme situations. Experimental comparison results show that the SLAM algorithm proposed in this paper is feasible and effective.
In future research, our algorithm can be improved in the following ways. First, a motion capture system or D-GPS can be employed to obtain ground truth for quantitative pose comparison analysis. Second multiple monocular cameras with non-common view relationships can be used, which can improve robustness. When some cameras are blocked, the other cameras can still provide reliable visual measurement. Further, based on the robust robot pose measurement, using binocular or depth cameras to build a dense map of the environment can meet the requirements of robot navigation and obstacle avoidance.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to the data may involve confidential information of our research institution.

Conflicts of Interest:
The authors declare no conflict of interest. IMU pre-integration between key frame i and key frame j t timê a t pre-integration of accelerated speed at time t ω t pre-integration of angular speed at time t O ij pre-fusion wheel odometer between key frame i and key frame j X * maximum posterior estimation of variables p(Z |X ) conditional probability of observing the occurrence of Z under the existed state X p(X ) prior probability (edge probability) r(·) 2 Σ Mahalanobis distance Σ covariance matrix {r p , H p } prior information from marginalization r C visual residual r O wheel odometer residual r B IMU residual ρ(·)

Abbreviations
Huber loss function (p C i , q C i ) spatial pose under key frame i r C jl error of the feature point l in the position of key frame j R F 1 F 2 rotation matrix from F 2 coordinate system to F 1 coordinate system P c j l position where the feature point l is projected onto the unit ball in the key frame j π −1 c back projection function C j camera coordinate system of the key frame ĵ P C j l position projected on the unit ball in the key frame j [b 1 b 2 ] two orthogonal base vectors on the tangent plane of the unit ball and the projection lines with the feature points in the orthogonal direction pre-integrated IMU measurement terms, observationα is position term; β is velocity term; q is rotation term δ represents the error term three-dimensional small perturbation displacement between key frame k and key frame i(the term with "ˆ" is the observation, without "ˆ" is the estimate) η os position measurement noise, Gaussian noise variance coefficient between key frame k and key frame k+1