RNGC-VIWO: Robust Neural Gyroscope Calibration Aided Visual-Inertial-Wheel Odometry for Autonomous Vehicle

: Accurate and robust localization using multi-modal sensors is crucial for autonomous driving applications. Although wheel encoder measurements can provide additional velocity information for visual-inertial odometry (VIO), the existing visual-inertial-wheel odometry (VIWO) still cannot avoid long-term drift caused by the low-precision attitude acquired by the gyroscope of a low-cost inertial measurement unit (IMU), especially in visually restricted scenes where the visual information cannot accurately correct for the IMU bias. In this work, leveraging the powerful data processing capability of deep learning, we propose a novel tightly coupled monocular visual-inertial-wheel odometry with neural gyroscope calibration (NGC) to obtain accurate, robust, and long-term localization for autonomous vehicles. First, to cure the drift of the gyroscope, we design a robust neural gyroscope calibration network for low-cost IMU gyroscope measurements (called NGC-Net). Following a carefully deduced mathematical calibration model, NGC-Net leverages the temporal convolutional network to extract different scale features from raw IMU measurements in the past and regress the gyroscope corrections to output the de-noised gyroscope. A series of experiments on public datasets show that our NGC-Net has better performance on gyroscope de-noising than learning methods and competes with state-of-the-art VIO methods. Moreover, based on the more accurate de-noised gyroscope, an effective strategy for combining the advantages of VIWO and NGC-Net outputs is proposed in a tightly coupled framework, which signiﬁcantly improves the accuracy of the state-of-the-art VIO/VIWO methods. In long-term and large-scale urban environments, our RNGC-VIWO tracking system performs robustly, and experimental results demonstrate the superiority of our method in terms of robustness and accuracy.


Introduction
Using low-cost multi-sensors for high accuracy and robust positioning is a challenging task for ground vehicles in GPS-denied urban environments [1].Approaches fusing visual, inertial, and wheel encoder measurements called visual-inertial-wheel odometry (VIWO) have received a lot of attention in recent years.Compared with traditional visual odometry (VO) [2][3][4][5][6] and visual-inertial odometry (VIO) [7][8][9][10][11][12][13][14], the additional wheel measurements can provide true-scale velocities to render the scale of VO/VIO observable, especially when the vehicle is in plane motion with constant acceleration, which is the common motion Remote Sens. 2023, 15, 4292 2 of 26 situation of the ground vehicle [15].Therefore, integrating the wheel measurements with VIO can further improve the accuracy and robustness of vehicle localization.
However, similar to VIO, VIWO still cannot avoid the pose drift caused by the lowprecision attitude acquired by the gyroscope of a low-cost IMU in complex environments.When VIO or VIWO experiences visual degradation scenes, such as weak or repetitive textures, shimmering lighting, or dynamic change, the visual information cannot accurately correct for the IMU bias [16].Limited by the accuracy of an IMU based on a microelectro-mechanical system (MEMS), the attitude integrated from the original gyroscope measurements suffers from large drift for a long time, especially due to the bias instability (BI) and angular random walk (ARW) resulting from the noise and thermal effect [17].The errors in attitude will further lead to the rapid accumulation of positional errors solved by systems that conduct motion estimation through past state recursion, such as an inertial navigation system (INS), VIO, VIWO, etc.In particular, for the INS, the positional errors of the system grow cubically in time caused by the gyroscope bias [18].In VIO/VIWO systems [7,[19][20][21], inaccurate positions will be added to the joint optimization framework as important residual constraints.Even though the existing methods have applied the pre-integration technique, it will inevitably decrease the pose accuracy calculated by the solver and further lead to long-term pose drift.
Therefore, the accuracy of IMU measurements, especially the quality of gyroscope data, is one of the important factors affecting the overall orientation and position accuracy of the VIO/VIWO system over a longer period.Thus, correcting and compensating for the errors of the MEMS IMU gyroscope is crucial to obtain long-term attitude stability, which further helps to improve the overall accuracy of the VIO/VIWO system.Currently, for VIO/VIWO systems, to limit the long-term drift as much as possible, one approach is to introduce the global navigation satellite system (GNSS) [13,22,23].The GNSS can provide global positioning information, helping to eliminate cumulative errors.Nevertheless, satellite signals are susceptible to multipath fading and shadow effects, which affect the accuracy of the GNSS, especially in the city canyon, tunnel, or underground parking lot environments.Another method is to employ loop closing, which can effectively reduce the positioning drift but is hard to apply in large-scale outdoor environments [24,25].Among the commonly proposed methods, some other researchers also employ a pre-built map to bound the long-term errors by matching the map features with the on-the-fly sensor readings [26].In contrast to these works, de-noising the gyroscope measurements to improve long-term positioning accuracy does not need to rely on the additional sensors, specific motion patterns, and prior maps.
In recent years, deep-learning-based methods have been introduced into inertial navigation systems, where motion can be inferred from the IMU, such as pedestrian motion state estimation [27,28].These studies show that human motion priors can be learned from IMU measurements through networks.Meanwhile, the results show that the accuracy based on deep learning is comparable to that of the VIO algorithm.At present, inertial learning is rarely used in robots.Reference [29] uses a CNN to correct IMU noise and bias and directly integrates the de-noised IMU to obtain the orientation.The de-noised gyroscope is also applied to the VIO methods.Similar efforts directly replace the original measurements in the VIO approach with network outputs [30][31][32].The way of fusing network outputs with VIO is very simple and cannot significantly improve the accuracy of the traditional VIO methods.
In this paper, aiming at reducing the long-term drift of VIWO deployed for autonomous vehicles, we present a tight-coupled nonlinear optimization method, which effectively integrates the robust neural gyroscope data with the traditional visual-inertialwheel odometry to perform long-term pose estimation in complex driving environments.Our approach is motivated by the observation that calibrating the errors of MEMS gyroscope measurements based on neural networks can provide more accurate attitude estimation, which can further provide better directional constraints for the VIWO method to cure long-term drift.The main contributions of this work include: Remote Sens. 2023, 15, 4292 3 of 26 (1) A robust neural calibration network for low-cost IMU gyroscopes called NGC-Net is proposed, which leverages the temporal convolutional network to extract the error features from the raw IMU.By effective data enhancement strategy, well-designed network structure, and multiple losses considered, our experiments show the proposed NGC-Net can achieve better de-noising performance.(2) We design an effective fusion strategy to combine the advantages of network outputs and VIWO methods and further propose a novel multi-sensor fusion tracking method to reduce the long-term drift using the heading obtained by our NGC-Net outputs.(3) Through a series of experiments on public datasets, our NGC-Net has better performance than both learning methods and competes with VIO methods.We implement the RNGC-VIWO system and validate the proposed method in complex urban driving datasets.Compared with state-of-the-art methods, our method can significantly improve the accuracy and robustness of vehicle localization in long-term and largescale areas.
The structure of this paper is as follows.Section 1 is the introduction.Section 2 presents the related work.Section 3 elaborates on the details of our method.The experimental results are presented in Section 4. Finally, a brief conclusion and outlook are given in Section 5.

Related Work
Over the past decades, localization methods by fusing multi-model sensors have become popular research.Autonomous driving system (ADS) vehicles equipped with multi-modal sensors (such as cameras, GNSS, LiDAR) provide multi-source sensor data to perceive the surrounding traffic environment and obtain vehicle position, speed, and attitude information [33].We mainly review related work on vision-based methods and IMU correction approaches.

Vision-Aided or -Based Methods
Over the past few decades, great progress has been made in the research of monocular visual odometry/SLAM.The typical studies include PTAM [2], SVO [3], and ORB-SLAM [5].While visual odometry can achieve high precision in ideal environments, it often fails when dealing with untextured areas, motion blur, and severe lighting variations.In addition, it cannot estimate the true scale of the scene, resulting in scale drift.
To reduce the dependence on visual information and obtain high-precision and robust localization, researchers have proposed visual-inertial odometry (VIO) that combines visual information with IMU measurements.According to the VIO algorithm framework, it can be divided into filter-based and optimization-based approaches.Popular filtering methods include MSCKF [34] and Open-VINS [7], which achieve great estimation performance.The typical optimization method is OKVIS [8], which infers a probabilistic cost function that minimizes visual reprojection errors and pre-integrated IMU errors.VINS-Mono [11] is another robust visual-inertial SLAM system.Compared to OKVIS, it has robust initialization, relocalization, and pose graph optimization.VINS-Fusion [12,13] is an extension of VINS-Mono that supports multiple sensor combinations, including fusing VINS with GPS.Under the constraints of IMU pre-integration, ORB-SLAM3 optimizes camera poses in a covisibility graph [14].Although these VIO systems achieve impressive accuracy and robustness, they will suffer from large drift if running in complex environments for a long time, which is insufficient for autonomous vehicle applications [35].
Recent research has incorporated wheel encoder measurements into VIO to improve positioning accuracy and robustness.Reference [15] demonstrates that the VIO system could not estimate the correct scale with constant acceleration motion, nor could it estimate good roll and pitch without rotational motion.To address these issues, wheel encoder readings with the VIO are integrated in a tightly coupled framework to make the scale observable.Reference [19] proposes a multi-sensor fusion SLAM algorithm using monocular vision, IMU, and wheel encoder measurements.Reference [20] presents a tightly coupled monocular visual SLAM using wheels and a MEMS gyroscope and introduces the wheel odometer error term into the optimization process, which is a complete SLAM framework and can increase the accuracy and robustness of localization.However, the use of wheel odometers for positioning is mainly focused on "planar" applications, which are usually only suitable for indoor environments.In [21], IMU-odometer pre-integration is introduced into the initialization and optimization of VIO systems, and an online extrinsic calibration is designed to improve the accuracy dramatically.Reference [36] propagates the system state by angular from a gyroscope and linear velocity obtained from a wheel odometer and also adds GPS directly to make position observable.In [37], an effective MSCKF-based VIWO method is developed to fuse IMU, camera, and pre-integrated wheel measurements, which calibrates the intrinsic and spatiotemporal extrinsic parameters between sensors online to further improve the overall accuracy of the system.Although incorporating wheel encoder measurements can improve the accuracy of the VIO system, it still cannot avoid long-term drift caused by the limitations of vision and the low accuracy of MEMS IMUs in complex driving environments.

IMU Correction Methods
The IMU measurement output is often modeled as a linear polynomial equation for systematic errors, such as constant bias, scale factor, and axis misalignment error.The coefficient parameters are determined using high-precision external equipment, such as a three-axis turntable.Reference [38] presents a calibration approach for low-precision MEMS IMUs using a nonlinear model and the transformed unscented Kalman filter (TUKF) with a turntable.Reference [39] proposes a self-calibrated visual-inertial odometry to estimate the IMU scale factor and axis misalignment error using an extended Kalman filter-based pose estimator.In addition, the vehicle chassis sensors and vehicle kinematics/dynamics reflect the vehicle's own characteristic information such as wheel speed, steering wheel angle, and sideslip angle estimation, which provide the longitudinal and lateral vehicle velocities to correct IMU errors [40][41][42].For example, according to the error dynamics and observation equations, the degree of the observability of the yaw misalignment is analyzed, and the yaw misalignment of the IMU is estimated by using a Kalman filter [43].
Many researchers have introduced deep learning techniques into the visual-inertial navigation field to improve navigation and positioning accuracy.VI-Net uses the LSTM network to extract motion features from the raw IMU data and fuses these features directly with image features for pose estimation [44].IONet uses a two-layer LSTM network to learn the IMU measurements of smartphones and to track user movement over time [28].A tight learned inertial odometry (TLIO) is proposed to extract original IMU features using ResNet and estimate 3D displacement and its uncertainty, allowing them to be fused tightly in an extended Kalman filter (EKF) to estimate pose, speed, and biases for pedestrian dead reckoning [27].RNIN-VIO also uses an LSTM-style IMU neural network to learn pedestrian movement priors from raw IMU data and fuses network outputs and visualinertial information into the EKF to improve the robustness of VIO [45].To sum up, these position estimation algorithms based on deep learning achieve good results compared with traditional methods in different scenarios.With the increasing complexity of scenes, accurate position estimation alone can no longer meet the existing task requirements.Recently, the use of deep learning technology to calibrate IMU errors has begun to attract the attention of researchers, and existing studies have shown that MEMS IMU calibration based on deep learning is feasible.Reference [46] proposes a convolutional neural network (CNN) to reduce accelerometer error.Reference [47] uses a neural structure based on an LSTM to estimate the attitude and introduces an efficient algorithm to calibrate the bias of the gyroscope.In [30], the depthwise separable convolution is used to compensate for IMU errors, which can improve the localization accuracy of inertial navigation.Reference [32] uses dilation convolution for raw IMU feature extraction and outputs the gyroscope error compensation.These network estimation results in gyroscopes are also applied to VIO methods [29][30][31][32], but the fused accuracy gains over traditional VIO are very limited, which does not fully reflect the advantages of the network calibration and VIO methods.Based on VINS-Mono [11], the proposed approach applies a new IMU learning network and fusing strategy to achieve more efficient and accurate pose estimation.The framework is shown in Figure 1.Unlike previous methods [29][30][31][32], the de-noised gyroscope outputs are simply directly input into the mature process of the existing VIO method.Our method effectively fuses the de-noised gyroscope outputs with the existing VIO method in different modules, better combining the advantages of deep learning with traditional VIO methods.Our system consists of four modules: measurements preprocessing, initialization, nonlinear optimization, and yaw attitude correction.
Remote Sens. 2023, 15, x FOR PEER REVIEW 5 of 27 based on an LSTM to estimate the attitude and introduces an efficient algorithm to calibrate the bias of the gyroscope.In [30], the depthwise separable convolution is used to compensate for IMU errors, which can improve the localization accuracy of inertial navigation.Reference [32] uses dilation convolution for raw IMU feature extraction and outputs the gyroscope error compensation.These network estimation results in gyroscopes are also applied to VIO methods [29][30][31][32], but the fused accuracy gains over traditional VIO are very limited, which does not fully reflect the advantages of the network calibration and VIO methods.

System Overview
Based on VINS-Mono [11], the proposed approach applies a new IMU learning network and fusing strategy to achieve more efficient and accurate pose estimation.The framework is shown in Figure 1.Unlike previous methods [29][30][31][32], the de-noised gyroscope outputs are simply directly input into the mature process of the existing VIO method.Our method effectively fuses the de-noised gyroscope outputs with the existing VIO method in different modules, better combining the advantages of deep learning with traditional VIO methods.Our system consists of four modules: measurements preprocessing, initialization, nonlinear optimization, and yaw attitude correction.In the data preprocessing phase, for IMU inputs, a robust neural calibration network for low-cost IMU gyroscope called NGC-Net is first proposed.Then, the IMU pre-integrations between two consecutive frames are calculated using the de-noised gyroscope outputs and raw accelerometer measurements.The wheel odometer pre-integrations between two consecutive frames are also calculated using the de-noised gyroscope outputs In the data preprocessing phase, for IMU inputs, a robust neural calibration network for low-cost IMU gyroscope called NGC-Net is first proposed.Then, the IMU pre-integrations between two consecutive frames are calculated using the de-noised gyroscope outputs and raw accelerometer measurements.The wheel odometer pre-integrations between two consecutive frames are also calculated using the de-noised gyroscope outputs and the velocities transformed by wheel encoder readings.To maintain the best orientation accuracy, the z-axis bias of the gyroscope obtained from the nonlinear optimization is not used, which is set to zero.
In the initialization phase, thanks to our NGC-Net, we only solve the gravity direction in the first frame and the optimized velocity for each frame and no longer solve the gyroscope bias, which can help to improve the speed of initialization.
In the nonlinear optimization stage, combining the advantages of VIO and NGC-Net, we do not use the z-axis bias of the gyroscope calibrated online and only update the preintegration of IMU and odometer according to the calibrated x-axis and y-axis biases of the gyroscope.
In the phase of yaw attitude correction, the horizontal attitude obtained by integrating the de-noised gyroscope angular velocity is used to correct the yaw attitude of each optimized frame, and the corrected poses are further fed back into the sliding window for the following keyframe optimization and output as the final pose.

Notation
We now define notations and coordinate systems throughout the paper.The coordinate systems of the sensors are illustrated in Figure 2. The camera, IMU, and wheel encoder are mounted on a vehicle, and the wheel encoder is installed on the left rear wheel.
we do not use the z-axis bias of the gyroscope calibrated online and only upd integration of IMU and odometer according to the calibrated x-axis and y-ax the gyroscope.
In the phase of yaw attitude correction, the horizontal attitude obtained b ing the de-noised gyroscope angular velocity is used to correct the yaw attitu optimized frame, and the corrected poses are further fed back into the sliding w the following keyframe optimization and output as the final pose.

Notation
We now define notations and coordinate systems throughout the paper.T nate systems of the sensors are illustrated in Figure 2. The camera, IMU, and coder are mounted on a vehicle, and the wheel encoder is installed on the left Figure 2 shows the top view of the coordinate systems, in which ⊙ deno perpendicular to the paper plane outward, and ⊗ denotes the axis perpendic paper plane inward.Each sensor has its individual local coordinate system, a (•) , (•) , (•) , and (•) to denote the world coordinate system, the camera system, the IMU coordinate system, and the odometer coordinate system, re The world coordinate system (•) is the coordinate system of the first IMU gravity rotation correction when the system is initialized, and it is fixed once  = [0; 0; ] T is the prior gravity vector in the world coordinate system, an 9.81007.We use (•) , (•) , and (•) to represent measurements or estimat camera frame, IMU frame, and odometer frame corresponding to the time of th frame, respectively.Finally, we denote (⋅) as the estimated states or the origina ments of a certain quantity with some noise, and (•) as the de-noised measure certain quantity.

Gyroscope Correction Model
A typical low-cost inertial measurement unit (IMU) usually consists of a gyroscope and a three-axis accelerometer.The gyroscope measures angular v Figure 2 shows the top view of the coordinate systems, in which denotes the axis perpendicular to the paper plane outward, and ⊗ denotes the axis perpendicular to the paper plane inward.Each sensor has its individual local coordinate system, and we use (•) W , (•) C , (•) I , and (•) O to denote the world coordinate system, the camera coordinate system, the IMU coordinate system, and the odometer coordinate system, respectively.The world coordinate system (•) W is the coordinate system of the first IMU frame after gravity rotation correction when the system is initialized, and it is fixed once initialized.G W = [0; 0; g] T is the prior gravity vector in the world coordinate system, and here g = 9.81007.We use (•) C k , (•) I k , and (•) O k to represent measurements or estimations in the camera frame, IMU frame, and odometer frame corresponding to the time of the k th image frame, respectively.Finally, we denote (•) as the estimated states or the original measurements of a certain quantity with some noise, and ( ∼ • ) as the de-noised measurements of a certain quantity.

Gyroscope Correction Model
A typical low-cost inertial measurement unit (IMU) usually consists of a three-axis gyroscope and a three-axis accelerometer.The gyroscope measures angular velocity ωi , the accelerometer measures acceleration âi in the IMU coordinate system, and the output model of the IMU sensor can be represented as seen below [48,49]: where ω i and a i are the actual values of the gyroscope angular velocity and accelerometer linear acceleration, which are affected by the gyroscope bias b ω i , the accelerometer bias b a i , and the corresponding additive noises n ω i and n a i .The additive noises are zero-mean white Gaussian noises.C is the intrinsic calibration matrix (C ≈ I 6 ) for the IMU model, which can be given by where L ω and M ω and L a and M a represent the scale factor and axis misalignment matrix of the gyroscope and the accelerometer, respectively, all of which are approximately equal to the identity matrix.C * denotes the linear accelerations on the gyroscope measurements, i.e., g-sensitivity [29], and it is approximately equal to 0 3×3 .The gyroscope angular velocity after error calibration can be shown as ( ∼ where ∼ ω i is the de-noised gyroscope angular velocity, and ωi is the original angular velocity of the gyroscope.Both C −1 ω and δω i affect gyroscope calibration and attitude estimation.C −1 ω contains scale factor and axis misalignment errors of the gyroscope, i.e., as the main gyroscope error correction term, which is a time-varying error.In addition, the acceleration a i can also provide information for the angular velocity correction, since it has a certain influence on the error δω i and should be included as part of the input of the neural network to reduce the impact on the error.
We now need to estimate δω i and C −1 ω .The neural network described in Section 3.2.3predicts δω i by leveraging the gyroscope and accelerometer measurements in a past local window of size N.In most cases, C −1 ω can be taken as an identity matrix, so we set it as the static parameter initialized at I 3 and then set it as the trainable variable optimized in the training process.Here, the form of network can be expressed as where f (•) represents the nonlinear function defined by the proposed NGC-Net.âi−N and ωi−N respectively represent the raw IMU at the time corresponding to the (i-N) th inertial frame.âi and ωi are the IMU at the time corresponding to the i th inertial frame, and N is the size of the local window.After modeling the gyroscope correction using the deep neural network, the NGC-Net parameters are updated by calculating the loss function between the ground truth and the predicted attitude during the training process.By using iterative training, a well-trained network model can be obtained to compute the de-noised gyroscope angular velocity ∼ ω i by subtracting δω i from C −1 ω ωi , and further, more accurate attitude can also be obtained using integration of the de-noised gyroscope outputs.

Data Preprocessing
To improve the performance of our network, we also adopt data enhancement strategy, which can get diverse data and help avoid overfitting.Firstly, considering different types of IMUs may have different white Gaussian noise and biases.In order to reduce the sensitivity to different IMU noises in the training phase, Gaussian white noise and biases are added randomly to the raw IMU data.Secondly, as the network has only access to the IMU measurements, it suffers from the same observability problem that yaw is not observable.To deal with this unobservability problem, a yaw angle rotation is randomly added to each sample data to learn the invariance characteristics of yaw.Thirdly, in order to strengthen training, referring to [50], we also integrate 20%, 40%, 60%, 80%, and 100% measurements to get multiple losses.This is because for multi-sensor fusion algorithm, IMU integration may be required in different durations.It is worth noting that only the enhanced data are fed into the network.

Network Structure
Previous work has shown that temporal convolutional network (TCN) convincingly outperforms baseline recurrent architectures to solve time-series data modeling problems.Compared with canonical networks such as LSTMs and GRUs, TCN has the advantages of stable gradient and flexible acceptance field, requiring less computing resources, and its structure is simpler and clearer [51].
As shown in Figure 3, we design the NGC-Net based on the architecture of TCN, which is composed of six residual blocks with 32, 64, 128, 256, 72, and 36 channels, respectively, and an output layer with a 1D convolutional layer.We measurements to get multiple losses.This is because for multi-sensor fusion algorithm, IMU integration may be required in different durations.It is worth noting that only the enhanced data are fed into the network.

Network Structure
Previous work has shown that temporal convolutional network (TCN) convincingly outperforms baseline recurrent architectures to solve time-series data modeling problems.Compared with canonical networks such as LSTMs and GRUs, TCN has the advantages of stable gradient and flexible acceptance field, requiring less computing resources, and its structure is simpler and clearer [51].
As shown in Figure 3, we design the NGC-Net based on the architecture of TCN, which is composed of six residual blocks with 32, 64, 128, 256, 72, and 36 channels, respectively, and an output layer with a 1D convolutional layer.We set the same kernel size k = 5 for each residual block and dilation factors d = 1, 2, 4, 8, 16, 32, which increase exponentially with the depth of the network (i.e., d = O(b m ), b = 2) at level m of the network [52]).The kernel size k and the dilation factors d determine the receptive field N of NGC-Net, which is 505 in Equation ( 4).As shown in Figure 3, each residual block contains two layers of dilated causal convolution.Gaussian error linear unit (GELU) [53] function is adopted to extract hidden features from the data, and weight normalization is performed for each residual block [54].In order to avoid overfitting, a spatial dropout [55] is added after each dilated convolution for regularization.Furthermore, a residual connection is introduced into each residual block to maintain the stabilization of deeper and larger TCN [51].We note that As shown in Figure 3, each residual block contains two layers of dilated causal convolution.Gaussian error linear unit (GELU) [53] function is adopted to extract hidden features from the data, and weight normalization is performed for each residual block [54].In order to avoid overfitting, a spatial dropout [55] is added after each dilated convolution for regularization.Furthermore, a residual connection is introduced into each residual block to maintain the stabilization of deeper and larger TCN [51].We note that our NGC-Net is relatively preliminary and can be further optimized in the future.However, our results show that the current network is already superior to competing methods, so the improved accuracy of the de-noised gyroscope readings provides a more robust justification for the proposed multi-sensor fusion strategy.

Loss Function
We use the Log-cosh loss for gyroscope correction in the proposed NGC-Net.Theoretically, the loss function should be defined as the error between the real angular velocity and the estimated value.However, on the one hand, the corrected angular velocity is generally consistent with the IMU frequency, in hundreds of Hz, much higher than the frequency of the current best tracking systems, which are generally accurate at 20-120 Hz [29].On the other hand, supervised learning methods require ground truth references for training, but many datasets usually provide attitude information rather than true gyroscope measurements.Thus, in order to conveniently calculate the loss and achieve better calibration performance, similar to [29], we also use the integrated orientation increment errors rather than the angular velocity errors to construct the loss function.The integral orientation increments can be expressed as (5): where exp(•) is the exponential map in the SO(3).
∼ ω j is the corrected gyroscope angular velocity.δt is the time interval between two consecutive gyroscope measurements.l is the integral increment length.
The Log-cosh loss for a given l can be computed as follows: where log(•) is the SO(3) logarithm map, and the loss approximately equals )) 2 /2 for the small loss and |log(δR i,i+l δ for the large loss.δRi,i+l and δ ∼ R i,i+l are the real orientation increments and the estimated orientation increments, respectively.Additionally, we also use a regularization loss as follows: where ωi is the raw gyroscope measurement, ∼ ω i is the de-noised gyroscope data, λ is a controllable parameter, only when the de-noised data deviates from the original measured value by more than a threshold value, it will punish the de-noised data to ensure the fast convergence of the network.

Implementation Details
For the training of our NGC-Net, we use a desktop environment equipped with Intel (R) Core(TM) i7-6700HQ 2.60 GHz CPU and NVIDIA GRID RTX8000-12Q 11 GB RAM.The framework of NGC-Net is implemented using PyTorch 1.5.The Adam optimizer [56] is used during training.We set the initial learning rate at 0.001 and adjust the learning rate adaptively.To prevent overfitting, the weight decay is set to 0.1 and the dropout parameter to 0.2.The model with the best validation loss is chosen as the best model for testing.

Image Processing
Similar to [11], we also use FAST [57] to detect feature points from each new image and track them between consecutive frames using the KLT sparse optical flow algorithm [58].However, since the optical flow tracking method may result in lots of mismatches in complex environments, we conduct a reverse optical flow tracking method to reject the outliers, similar to VINS-fusion [13].

De-Noised IMU and Odometer Pre-Integration
In the pre-integration step, both the de-noised IMU pre-integration and odometer pre-integration need to be calculated at the same time to provide more constraints for system initialization and back-end nonlinear optimization.The pre-integration of IMU and odometer in discrete time is shown in (8): where i is the discrete moment corresponding to an IMU measurement within [T k , T k+1 ], and T k and T k+1 represent the corresponding time of the kth and (k + 1)th image keyframes, respectively.δt i is the time interval between two IMU measurements of i and i + 1. αI , βI , and γI denote the IMU position, velocity, and attitude pre-integration in the IMU coordinate system, respectively.ηI represents the position pre-integration of the odometer in the IMU coordinate system.At time i = T k , αI k i , βI k i , γI k i , ηI k i are all zero.R I o is the extrinsic rotation matrix between the IMU sensor and the odometer.In this paper, the external parameters of the IMU and odometer are not corrected online; instead, we use the values calibrated offline directly.âi , ∼ ω i , and vi represent the accelerometer measurement, the de-noised gyroscope measurement, and the speed provided by wheel encoder reading at the time i. b a i and b ω i represent the bias of accelerometer and gyroscope at the time i.
Since our NGC-Net described in Section 3.2 can effectively reduce the noises of the gyroscope's original measurements, we use the de-noised gyroscope outputs instead of the raw gyroscope measurements to calculate these pre-integrations to achieve more accurate attitude, thus further improving the state estimation accuracy during the initialization and back-end nonlinear optimization.
It should be noted that to accurately calculate the IMU pre-integration and odometer pre-integration between the kth and (k + 1)th image keyframes, we need to align the timestamps between image frames and IMU and odometer measurements.As implemented in [11], we interpolate the IMU measurements and wheel measurements based on the timestamps of the image frames and obtain all the IMU measurements and wheel measurements between the two image frames for calculating the IMU pre-integration and odometer pre-integration, respectively.
In the original VIO algorithm framework, to improve the accuracy of IMU input, IMU bias (including gyroscope bias) is calibrated online during the optimization stage.However, as demonstrated by [29][30][31][32] and our experiments in Section 4.3, we note that the horizontal direction integrated directly from the de-noised gyroscope can be more accurate than the results obtained by the state-of-the-art VIO methods.Thus, we always set (b ω i ) z = 0 when calculating the IMU and odometer pre-integration using the de-noised gyroscope measurements to maintain the best orientation accuracy.
The error state propagation equation of (8) at the discrete time is shown in Equation ( 9): where δ θI k i is the perturbation of rotation error, and its relationship with the rotation attitude is γ • × means the skew-symmetric matrix corresponding to a vector.R γI k i is the rotation matrix of attitude γI k i .n a ,n ω , and n v represent the noises in accelerometer, gyroscope, and odometer measurements, respectively, which are commonly Gaussian white noises.n b a and n b ω denote the derivatives of accelerometer bias and gyroscope bias that are also white noises.
According to Equation ( 9), we can get the Jacobian matrix of αI , βI , γI , and ηI with respect to the IMU biases.The covariance matrix of each pre-integration term can also be obtained by forward propagation of the covariance.The calculation process is similar to [21], which is not explained in detail here.

System Initialization
During the initialization stage, the traditional monocular vision-inertial VIO system aims to recover the true scale of the visual map, the true scale velocity of each frame, the direction of gravity in the first IMU body frame, and the gyroscope biases.Thanks to our NGC-Net, we no longer need to solve the gyroscope bias in the initialization phase.In addition, the odometer can obtain real-scale speeds and positions combined with de-noised gyroscope measurements, helping to recover the real scale of monocular visual maps, i.e., obtaining the real-scale depth of map points.Therefore, in this initialization stage, our system only needs to solve the gravity direction and optimization velocity for each frame at the same time, which improves the speed of initialization.
For the consecutive kth frame and (k + 1)th frame, we align the IMU pre-integration with the odometer pre-integration to construct the joint equation, as shown in Equations ( 10) and (11): where δT k means the time interval between image frames k and k + 1, namely, represent the velocity of k and k + 1 frames in the IMU body to be solved.g I 0 denotes the value of gravity in the first IMU body frame to be computed.P I o denotes the value of the extrinsic translation between the IMU and odometer, which is known because we directly use the values calibrated offline.
The equations constructed using every pair of consecutive frames in the sliding window are concatenated to establish linear simultaneous equations.By solving the linear least-squares problem, the velocity of each frame and the gravity vector g I 0 in the first IMU frame can be solved.
Since the magnitude of gravity is usually a known quantity, the value used in this paper is g = 9.81007.To facilitate the iterative optimization of the gravity vector, we reset and put it into (10) and (11).By resolving the new least-squares problem, the optimized gravity vector and the velocity of each frame are computed.Here, b 1 and b 2 are a pair of orthogonal bases in the tangent space of the gravity vector ĝI 0 i .ĝI 0 i is the i th iteration solution of g I 0 .ϕ 1 and ϕ 2 represent the values adjusted for the gravity vector ĝI 0 i along the b 1 and b 2 directions, which are solved from the new least-squares problem.
We repeat the iterative steps until ĝI 0 convergence.In this way, we need not assume the vehicle moves on an approximately flat surface at this stage because we use the first solution computed from Equations ( 10) and (11) as the initial value for the subsequent optimization iterations.
Then, the first IMU frame is rotated to the world coordinate system.Finally, the optimized velocities and the IMU measurements are integrated to calculate the initial poses of all keyframes in the sliding window, and the feature points tracked between these frames are triangulated.

Nonlinear Optimization
The purpose of back-end nonlinear optimization is to combine a variety of measurements, such as camera, IMU and odometer, for high precision and robust estimation of vehicle states.To balance the requirements of computational load and real-time performance, similar to [11,12], we also use the sliding window with partial marginalization for nonlinear optimization.
The state vector to be estimated in the sliding window is defined as (12): where x k represents the vehicle state at the time that the kth image is captured, including the position p w I k , velocity v w I k , orientation q w I k in the world frame, acceleration bias b a k , and gyroscope bias b ω k in the IMU body frame.n is the total number of keyframes in the sliding window.λ l is the inverse depth of the lth feature from its first observation.x I C is the extrinsic parameters between the camera and IMU, including the relative position p I C and orientation q I C .We do not correct the extrinsic parameters between IMU and odometer but directly use the parameters calibrated offline.
At this stage, since the bias-corrected acceleration may provide more precise constraints for solving the pitch angle and roll angle, we further use the calibrated x-axis and y-axis biases of the gyroscope to update the pre-integration of IMU and odometer accordingly.However, as described in Section 3.3.2,since the learned horizontal direction can be more accurate than the result obtained by the state-of-the-art VIO methods (demonstrated by [29][30][31][32] and our experiments in Section 4.3), we do not update these pre-integrations using the z-axis bias of the gyroscope calibrated online; that is, we set b ω k z to 0.
The objective function is composed of three residual terms, which are the visual reprojection residual, the IMU-odometer pre-integration residual, and the marginalization residual.The objective function is defined as follows: In Equation ( 13), the first term represents the marginalization residual, and r p , H p is the prior information from marginalization.r Ω ẑI k I k+1 , χ is the IMU-odometer preintegration residual, and Ω represents the set of IMU-odometer pre-integration in the sliding window.P is the covariance matrix of IMU-odometer pre-integration from frame k to frame k + 1. r Ψ ẑc j l , χ is the visual reprojection residual.Ψ is the set of features and the corresponding frames in the current sliding window.l means the lth feature in Ψ, and c j is the j th image frame.P c j l is the uniform covariance matrix used for visual reprojection, and ρ is the robust kernel function.
Here, we give the combined residual function of IMU-odometer pre-integration, as shown in Equation ( 14): Similar to [21], .In this paper, Ceres Solver [59] is also used for solving this nonlinear problem.

Yaw Attitude Correction
The VINS system has 4 unobservable degrees of freedom (DOF) corresponding to 3 DOF of global translation and 1 DOF of rotation around the gravity vector (namely the yaw angle) [15].Although the pre-integrations in Section 3.3.2can provide relatively accurate residual constraints for the nonlinear optimization in Section 3.3.4,the yaw angle accuracy of the pose after optimization may still not be ideal.
In this stage, we further correct the yaw angle of each keyframe in the sliding window after nonlinear optimization to obtain the best attitude accuracy.Since the accuracy of attitude plays an important role in estimating the long-term trajectory, the corrected yaw angles are beneficial to improve the accuracy of our VIWO system.As shown in Section 4.4, we finally obtain more accurate trajectory results based on the accurate yaw attitude obtained by our NGC-Net.
For the correction steps, we first obtain the yaw angle of each keyframe in the sliding window by directly integrating the de-noised gyroscope measurements.Then, these yaw angles are substituted into the optimized poses of all keyframes in the sliding window.Finally, on the one hand, the corrected pose of the current keyframe is output as the final result.On the other hand, the corrected poses (besides the marginalized keyframe) in the sliding window are further used for the following keyframe optimization.
It is worth noting that we only substitute the yaw angle for the pose of each optimized keyframe during this stage, not including the pitch and roll angle.

Experiments
In this part, a series of experiments are carried out to verify the de-noising performance of our proposed NGC-Net and the accuracy and robustness of the multi-sensor fusion localization method assisted by the neural gyroscope calibration proposed in this paper.

Baselines
We compare our method with existing methods, including the following: (1) Raw IMU: Orientation computed using the original IMU readings.
(5) VINS-Mono: Representative of state-of-the-art visual-inertial odometry with an opensource code [11].( 6) Open-VINS: A state-of-the-art filter-based visual-inertial estimator for which we choose the stereo and IMU configuration [7].(7) Proposed VIWO: An optimization-based monocular visual-inertial-wheel odometry developed based on VINS-Mono without the aid of NGC-Net, which is similar to the work proposed by [11] without the online IMU-odometer extrinsic parameter calibration module.(8) Proposed VIWO+NGC: A method that is the same as method ( 7), but the gyroscope inputs are the NGC-Net outputs rather than the raw gyroscope measurements.(9) Proposed RNGC-VIWO: The method described in Section 3.3, in which the de-noised gyroscope measurements are effectively fused into the overall framework.
We divide the baseline methods into two categories corresponding to the two purposes of our experiments:

•
3D orientation estimates based on the learning method and VIO method.We compare these deep learning methods to demonstrate the de-noising performance of our NGC-Net.For a fair comparison, we use the same training sequence and test sequence.Since the OriNet has not been published and only provides test results on the EuRoC MAV Dataset [60], we do not compare our method with the OriNet on the KAIST Urban Dataset [61].The DIG-Net is not tested on the KAIST Urban Dataset, although it is open-sourced, and we use the default network parameters and take the best training results as the network output.We also compare the VIO methods (VINS-Mono, Open-VINS) to demonstrate that our NGC-Net can accurately estimate the orientation and compete with VIO methods.

•
6DOF pose estimates based on multi-sensor fusion.We further evaluate the localization performance of our proposed RNGC-VIWO on the KAIST dataset, which is equipped with the stereo camera, IMU, and wheel odometer for vehicle localization.However, an open-source algorithm with the same sensor configuration cannot be found at present.Thus, we compare our RNGC-VIWO with the proposed VIWO, proposed VIWO+NGC, and the state-of-the-art VIO methods (VINS-Mono, Open-VINS).In addition, reference [37] proposes an excellent VIWO algorithm and tests it on KAIST urban39; we directly include the test results of urban39 in Section 4.4 for comparison since it is not open-source.

Metrics Definitions
We evaluate the method in terms of 3D orientation/yaw and 3D translation estimates, which are defined as follows: (1) Absolute Yaw Error (AYE): The AYE computes the root mean square error between the ground truth and estimated heading as the following equation: where n is the sequence length, and θ i and ∼ θ i are the ground truth and estimated yaw angle at the instant i.
(2) Absolute Orientation Error (AOE): The AOE calculates the root mean square error between the ground truth and estimated orientation as where log(•) is the SO(3) logarithm map, n is the sequence length, and R i and ∼ R i are the ground truth and estimated orientation at the instant i.
(3) Absolute Translation Error (ATE): The ATE calculates the root mean square error between the ground truth and estimated position as where P i and ∼ P i are the ground truth and estimated position at the instant i.

Neural Gyroscope Calibration Network Performance
(1) Evaluation of the EuRoC MAV Dataset: The dataset provides stereo image and IMU data from a micro aerial vehicle including 11 sequences [60].The IMU measurements are measured using the ADIS16448 at 200 Hz, and 6DOF pose ground truth is provided.As noticed in [7], the ground truth of the V1_01 easy sequence has incorrect orientation values, and we refer to the corrected ground truth provided by reference [7].
The same training and test sequences are used with the corresponding baseline methods [29,47] to make a fair comparison.The training set is defined as the first one and a half minutes of six sequences, namely V1_02_medium, V2_01_easy, V2_03_difficult MH_01_easy, MH_03_medium, and MH_05_difficult, and the validation set is the rest of these sequences.The test set contains five sequences of MH_02_easy, MH_04_difficult, V1_01_easy, V1_03_difficult, and V2_02_medium.
As shown in Table 1, our proposed NGC-Net is superior to the original IMU data in direction and yaw angle estimation.Compared with the other two learning methods, our proposed NGC-Net achieves the best performance on most test sequences, the AOEs and AYEs of which are 1.53/0.85degrees.When comparing with VIO methods, we find that our NGC-Net is even comparable to the VIO method in orientation estimation.In addition, the mean yaw error of our NGC-Net is 0.85 degrees, which is better than all VIO methods such as Open-VINS with 1.37 degrees and VINS-Mono with 2.14 degrees.The orientation estimates and errors of the different methods on the test sequence of V1_01_easy and V1_03_difficult are also shown in Figure 4 and Figure 5, respectively.In addition, the IMU of the EuRoc MAV Dataset runs at 200 Hz, the number of network model parameters is about 856,013, and the size of the network model is only 3.5 MB.The 1200 epochs of training take about 7 min.
(2) Evaluation of the KAIST Urban Dataset: This dataset contains a variety of sensor measurements, such as stereo camera images, IMU measurements, wheel encoder readings, etc., which can provide the required inputs for our multi-sensor fusion localization system and is the main target dataset for our study.It also provides the ground truth generated by high-precision gyroscope, VRS-GPlS, and LiDAR sensors.The commercial-grade IMU measurements are obtained with Xsens MTi-300, and the frequency is 100 Hz.More details of the dataset are in [61].
The results of DIG-Net in Table 2 are the best results of dozens of training experiments using the default network parameters provided by [29], except for some parameters that must be adjusted to train on the KAIST Urban Dataset, such as IMU frequency, training sequences, etc.     (2) Evaluation of the KAIST Urban Dataset: This dataset contains a variety of sen measurements, such as stereo camera images, IMU measurements, wheel encoder r ings, etc., which can provide the required inputs for our multi-sensor fusion localiza system and is the main target dataset for our study.It also provides the ground t generated by high-precision gyroscope, VRS-GPlS, and LiDAR sensors.The commer grade IMU measurements are obtained with Xsens MTi-300, and the frequency is 100 More details of the dataset are in [61].As shown in Table 2, in terms of 3D orientation and yaw angle estimation, our proposed NGC-Net outperforms the raw IMU data.Compared with DIG-Net, our NGC-Net maintains higher accuracy on most sequences except urban29.When comparing the estimated orientation of NGC-Net with VIO methods, the conclusion is similar to that of the Eu-RoC MAV Dataset.Compared with VINS-Mono, our proposed NGC-Net achieves the best performance on most test sequences, the AOEs and AYEs of which are 3.74/0.96degrees.Compared with Open-VINS, which shows the excellent performance of its system, it still fails to run through the other two sequences since it still suffers from some initialization problems in complex dynamic environments.NGC-Net can provide more accurate 3D orientation estimation, especially accurate yaw angle estimation.
The orientation estimation and error of urban33 and urban39 by different methods are also shown in Figures 6 and 7, respectively.As shown in the figures, Our NGC-Net, the red line, is closest to ground truth and achieves the best performance among these methods.In addition, the training time for 1200 epochs is approximately 24 min, and the size of the network model is about 3.5 MB.Moreover, the prediction time of the network model is about 10 s for the test sequence of more than 30 min and has real-time computing ability.

Muti-Sensor Fusion System Performance
To verify the performance of our approach in more realistic traffic scenarios, we use the KAIST Urban Dataset [61] to further evaluate the accuracy of our proposed multisensor fusion algorithm.The length of the dataset (3-12 km), the significant changes in illumination, and the complexity of the environment (including many pedestrians, oncoming traffic, high-rise buildings, etc.) lift it to another level of difficulty.We use the left of the stereo camera, the mounted commercial-grade IMU, and the left wheel encoder for pose estimation.Our proposed approaches are compared with the state-of-the-art methods such as Open-VINS [7], VINS-Mono [11], and VIWO [37] mentioned in Section 4.1.Since the purpose of our experiments is to evaluate the performance of these methods in complex long-term driving environments, we do not enable the loop-closing module for all these methods.Note that approaches [7,11] do not provide the test results on the KAIST Urban Dataset; we evaluate them ourselves by running each method five times on each testing sequence and taking the average values as the results.For reference [37], we only cite the results because it is not open-source.
Table 3 shows the root mean squared error (RMSE) of the 3D position (in units of meters) and 3D orientation (in units of degrees) of each algorithm.Figures 8 and 9 show the qualitative comparison of experimental results among different approaches.Figure 8 shows the comparison of estimated 2D trajectories of each tasting sequence among different approaches, and Figure 9 shows the detailed 6DOF trajectories of each test sequence among different approaches.
From Table 3 and Figures 8 and 9, we can see that our proposed RNGC-VIWO approach outperforms the state-of-the-art approaches in terms of both translation and orientation.From the above results, we can draw the following conclusions: First, the pure VINS-Mono suffers from very large errors, up to hundreds of meters to about one kilometer, due to the scale drift affected by restricted motion (mostly constant speed, on straight lines) and complex environments.However, as more information becomes available, such as the addition of the right camera or the wheel encoder readings, In addition, the training time for 1200 epochs is approximately 24 min, and the size of the network model is about 3.5 MB.Moreover, the prediction time of the network model is about 10 s for the test sequence of more than 30 min and has real-time computing ability.

Muti-Sensor Fusion System Performance
To verify the performance of our approach in more realistic traffic scenarios, we use the KAIST Urban Dataset [61] to further evaluate the accuracy of our proposed multisensor fusion algorithm.The length of the dataset (3-12 km), the significant changes in illumination, and the complexity of the environment (including many pedestrians, oncoming traffic, high-rise buildings, etc.) lift it to another level of difficulty.We use the left of the stereo camera, the mounted commercial-grade IMU, and the left wheel encoder for pose estimation.Our proposed approaches are compared with the state-of-the-art methods such as Open-VINS [7], VINS-Mono [11], and VIWO [37] mentioned in Section 4.1.Since the purpose of our experiments is to evaluate the performance of these methods in complex long-term driving environments, we do not enable the loop-closing module for all these methods.Note that approaches [7,11] do not provide the test results on the KAIST Urban Dataset; we evaluate them ourselves by running each method five times on each testing sequence and taking the average values as the results.For reference [37], we only cite the results because it is not open-source.
Table 3 shows the root mean squared error (RMSE) of the 3D position (in units of meters) and 3D orientation (in units of degrees) of each algorithm.Figures 8 and 9 show the qualitative comparison of experimental results among different approaches.Figure 8 shows the comparison of estimated 2D trajectories of each tasting sequence among different approaches, and Figure 9 shows the detailed 6DOF trajectories of each test sequence among different approaches.
We test these methods in the same Ubuntu 16.04 environment, using a desktop equipped with Intel (R) Core(TM) i7-6700HQ 2.60 GHz CPU and 16 GB RAM.Our proposed RNGC-VIWO method can achieve real-time performance similar to VINS-Mono [11], since the average time needed for visual feature detection and tracking is about 20 ms, and the maximum time of nonlinear optimization is an adjustable parameter (80 ms used in our experiments).From Table 3 and Figures 8 and 9, we can see that our proposed RNGC-VIWO approach outperforms the state-of-the-art approaches in terms of both translation and orientation.From the above results, we can draw the following conclusions: First, the pure VINS-Mono suffers from very large errors, up to hundreds of meters to about one kilometer, due to the scale drift affected by restricted motion (mostly constant speed, on straight lines) and complex environments.However, as more information becomes available, such as the addition of the right camera or the wheel encoder readings, which are methods of Open-VINS and our proposed VIWO, the positioning accuracy is significantly improved.Surprisingly, the 3D orientation accuracy of our proposed VIWO method is slightly improved compared with VINS-Mono, as the odometer pre-integration may provide some additional constraints for orientation estimation.
Second, by comparing our proposed VIWO and Open-VINS, the accuracy of Open-VINS in the four successfully initialized sequences is better than our proposed VIWO, but it still fails to run through the other two sequences in complex environments.Moreover, it should be noted that we choose the stereo-IMU configuration instead of the mono-IMU configuration of Open-VINS to test on the KAIST Urban Dataset.

Conclusions
In this paper, to reduce the long-term drift of VIWOs deployed for ground vehicles without relying on additional GNSS sensors, prior maps, or specific motion patterns, we propose a tightly coupled nonlinear optimization method combining robust neural gyroscope observations with traditional visual-inertial-wheel odometry for pose estimation.The learning-based gyroscope calibration method can provide more powerful observations, and the de-noised gyroscope outputs can provide more accurate attitude estimation and better directional constraints for the multi-sensor fusion method by effectively fusing them together.
We evaluate the proposed NGC-Net by comparing the orientation estimation performance with other learning-based methods on two public datasets, namely, the EuRoC MAV Dataset and the KAIST Urban Dataset.The results show that our NGC-Net achieves better de-noising performance compared with the existing learning-based methods and VIO methods and obtains excellent accurate attitude estimates with only a low-cost IMU.To verify the performance of our RNGC-VIWO in more realistic driving scenarios, we further use the KAIST Urban Dataset to evaluate the accuracy of our proposed method.The positioning accuracy of our RNGC-VIWO algorithm outperforms the state-of-the-art monocular and stereo visual-inertial methods, such as VINS-Mono and Open-VINS, and is even better than the stereo VIWO methods in [37].The experimental results show our method has reliable performance in complex long-term environments.
In the future, we will further investigate how deep learning algorithms can be used to learn more sensor characteristics, such as accelerometers and odometers, and how to better integrate them with multi-sensor fusion algorithms to further improve localization Third, in comparison with our proposed VIWO, our proposed VIWO+NGC approach just slightly improves the estimated position and orientation accuracy on the six test sequences, since it directly uses the de-noised gyroscope outputs as input.Comparing these two methods with our proposed RNGC-VIWO approach, our RNGC-VIWO can significantly improve the estimated position and orientation accuracy on the six test sequences, which proves the effectiveness of our proposed overall framework, which effectively fuses the de-noised gyroscope outputs into the traditional VIWO.
Fourth, our RNGC-VIWO approach also outperforms Open-VINS in terms of accuracy and robustness.Compared with the results on Urban39 provided by the VIWO method in reference [37], the position accuracy of our proposed RNGC-VIWO method still outperforms [37], even though [37] uses the stereo camera, IMU, and two wheel odometers along with online IMU-odometer spatiotemporal extrinsic calibration.The orientation accuracy of our proposed RNGC-VIWO on urban39 is comparable to the best result obtained by [37] with online IMU-odometer spatiotemporal extrinsic calibration and outperforms the result obtained by [37] without online IMU-odometer spatiotemporal extrinsic calibration.These results further prove the effectiveness of our proposed overall framework.
In conclusion, our proposed RNGC-VIWO approach can achieve only 21.80 m and 1.83-degree average errors over the long-term complex test sequences and a total length of 53.23 km for six sequences in the dataset, and the average distance of the six test sequences is 8.87 km, only using one camera, one wheel odometer, and a vehicle-mounted IMU de-noised by our NGC-Net.
We test these methods in the same Ubuntu 16.04 environment, using a desktop equipped with Intel (R) Core(TM) i7-6700HQ 2.60 GHz CPU and 16 GB RAM.Our proposed RNGC-VIWO method can achieve real-time performance similar to VINS-Mono [11], since the average time needed for visual feature detection and tracking is about 20 ms, and the maximum time of nonlinear optimization is an adjustable parameter (80 ms used in our experiments).

Conclusions
In this paper, to reduce the long-term drift of VIWOs deployed for ground vehicles without relying on additional GNSS sensors, prior maps, or specific motion patterns, we propose a tightly coupled nonlinear optimization method combining robust neural gyroscope observations with traditional visual-inertial-wheel odometry for pose estimation.The learning-based gyroscope calibration method can provide more powerful observations, and the de-noised gyroscope outputs can provide more accurate attitude estimation and better directional constraints for the multi-sensor fusion method by effectively fusing them together.
We evaluate the proposed NGC-Net by comparing the orientation estimation performance with other learning-based methods on two public datasets, namely, the EuRoC MAV Dataset and the KAIST Urban Dataset.The results show that our NGC-Net achieves better de-noising performance compared with the existing learning-based methods and VIO methods and obtains excellent accurate attitude estimates with only a low-cost IMU.To verify the performance of our RNGC-VIWO in more realistic driving scenarios, we further use the KAIST Urban Dataset to evaluate the accuracy of our proposed method.The positioning accuracy of our RNGC-VIWO algorithm outperforms the state-of-the-art monocular and stereo visual-inertial methods, such as VINS-Mono and Open-VINS, and is even better than the stereo VIWO methods in [37].The experimental results show our method has reliable performance in complex long-term environments.
In the future, we will further investigate how deep learning algorithms can be used to learn more sensor characteristics, such as accelerometers and odometers, and how to better integrate them with multi-sensor fusion algorithms to further improve localization accuracy.Moreover, deep learning for object detection has been the focus of much research and has been successful in many applications, including autonomous driving [63].We also plan to incorporate object detection into our localization system to address dynamic interference and improve the robustness of vehicle localization in dynamic environments.Finally, inspired by [37], since online IMU-odometer spatiotemporal extrinsic calibration can improve the overall accuracy of the system, especially correcting the time offset between the IMU and odometer (0.027 s of the KAIST Dataset reported in [37]), we will add this online calibration module into our proposed framework to further improve the performance of vehicle localization.

Figure 1 .
Figure 1.The overall framework of our proposed RNGC-VIWO method.The highlighted modules demonstrate the special contributions of our work.

Figure 2 .
Figure 2. Coordinate systems of sensors mounted on the vehicle.

Figure 2 .
Figure 2. Coordinate systems of sensors mounted on the vehicle.
set the same kernel size k = 5 for each residual block and dilation factors d = 1, 2, 4, 8, 16, 32, which increase exponentially with the depth of the network (i.e., d = O(b m ), b = 2) at level m of the network [52]).The kernel size k and the dilation factors d determine the receptive field N of NGC-Net, which is 505 in Equation (4).

(Figure 3 .
Figure 3. Network structure of the proposed NGC-Net.

Figure 3 .
Figure 3. Network structure of the proposed NGC-Net.
IMU biases b a k and b ω k .Since we set b ω k z = 0, the z-axis bias of the gyroscope produces no updates to αI k

Figure 5 .
Figure 5. V1_03_difficult with different methods: (a) estimated orientation; (b) 3D orientation er In addition, the IMU of the EuRoc MAV Dataset runs at 200 Hz, the number of work model parameters is about 856,013, and the size of the network model is only MB.The 1200 epochs of training take about 7 min.(2)Evaluation of the KAIST Urban Dataset: This dataset contains a variety of sen measurements, such as stereo camera images, IMU measurements, wheel encoder r ings, etc., which can provide the required inputs for our multi-sensor fusion localiza system and is the main target dataset for our study.It also provides the ground t generated by high-precision gyroscope, VRS-GPlS, and LiDAR sensors.The commer grade IMU measurements are obtained with Xsens MTi-300, and the frequency is 100 More details of the dataset are in[61].

Rn−1) …… Initial pose predicted by denoised IMU Bg.z=0 Acc + denoised Gyr DenoisedGyr Monocular Camera Feature detection and tracking Denoised IMU Pre-intergration Initialized? Pose after optimization Yaw attitude correction for keyframes Final pose Bg.z=0 IMU Raw Gyr Raw Acc NGC-Net Denoising Attitude Integration Acc + denoised Gyr Bg.z=0 yaw Yes Keyframes Visual Pre-processing IMU Pre-processing Nonlinear Optimization Output Wheel Encoder Readings Odometer Pre-intergration Odometer Input Vision-only SFM IMU-Odometer Pre-intergration Alignment Odometer Input? No Yes Vision-Inertial Alignment Bg.z=0 Velocity + denoised Gyr Feed back Initialization Figure 1. The overall
framework of our proposed RNGC-VIWO method.The highlighted modules demonstrate the special contributions of our work.

Table 1 .
[32]lute orientation error (AOE) and absolute yaw error (AYE) in terms of 3D orientation/yaw, in degrees.Results of VINS are the reported ones in[32].The best results are in bold.

Table 2 .
Absolute orientation error (AOE) and absolute yaw error (AYE) in terms of 3D orientation/yaw, in degrees.The best results are in bold, and the symbol --fails to run on these test sequences.

Table 3 .
Comparison of absolute translation error (ATE) and absolute orientation error (AOE) in terms of 3D translation/orientation in units of meters/degrees, of resulting trajectories from different approaches.The best results are in bold.