Pose Estimation Utilizing a Gated Recurrent Unit Network for Visual Localization

: Lately, pose estimation based on learning-based Visual Odometry (VO) methods, where raw image data are provided as the input of a neural network to get 6 Degrees of Freedom (DoF) information, has been intensively investigated. Despite its recent advances, learning-based VO methods still perform worse than the classical VO that consists of feature-based VO methods and direct VO methods. In this paper, a new pose estimation method with the help of a Gated Recurrent Unit (GRU) network trained by pose data acquired by an accurate sensor is proposed. The historical trajectory data of the yaw angle are provided to the GRU network to get a yaw angle at the current timestep. The proposed method can be easily combined with other VO methods to enhance the overall performance via an ensemble of predicted results. Pose estimation using the proposed method is especially advantageous in the cornering section which often introduces an estimation error. The performance is improved by reconstructing the rotation matrix using a yaw angle that is the fusion of the yaw angles estimated from the proposed GRU network and other VO methods. The KITTI dataset is utilized to train the network. On average, regarding the KITTI sequences, performance is improved as much as 1.426% in terms of translation error and 0.805 deg / 100 m in terms of rotation error.


Introduction
Localization of a mobile robot and vehicle is a crucial factor needed for the development of autonomous robots and vehicles. While various localization methods have been developed, the abundance of information in images led to Visual Odometry (VO) based localization techniques, which were firstly carried out by Nister et al. [1]. Primarily, the VO estimates the pose of a rigid body incrementally by analyzing the changes that the motion of the rigid body induces on images collected by a camera [2].
Classical VO can be categorized into two types of methods, direct VO methods and feature-based VO methods, as depicted in Figure 1. Feature-based VO methods extract key feature points from the image (e.g., corners, edges, blobs) and match them into a sequential frame. These VO methods involve the ego-motion of the camera, which is estimated by minimizing reprojection errors between feature pairs obtained from sequential images. On the other hand, direct VO methods consist of dense VO methods using all pixels, semi-dense VO methods using pixels with sufficiently large intensity gradient, and sparse VO methods utilizing sparsely selected pixels. In direct VO methods, the ego-motion of the camera is estimated by minimizing the photometric error with non-linear optimization algorithms [3,4]. For each type of method, various techniques have been investigated to increase the accuracy of the VO. In case of feature-based VO methods, which are considered the mainstream of the VO, many studies focus on outlier rejection and finding appropriate features to track for decrement of the error of the VO [6][7][8]. To get better results of pose estimation, Patruno et al. presented a robust correspondence feature filter relying on statistical and geometrical information [9]. Learning-based feature detectors are also investigated. In [10], Yi et al. proposed an integrated feature extraction pipeline, called Learned Invariant Feature Transform (LIFT) pipeline, consisting of a detector, orientation estimator, and descriptor. In [11], DeTone et al. presented neural network architecture for point detection, which is trained by using a self-supervised domain adaptation called Homographic Adaptation. In [12], Revaud et al. proposed the Repeatable and Reliable Detector and Descriptor (R2D2) which is a new learning-based feature extraction method. The R2D2 can learn repeatability and reliability of each key point. In [13], a hierarchical localization technique is introduced, employing Hierarchical Feature Network (HF-Net) based on monolithic Convolutional Neural Network (CNN) which efficiently predicts hierarchical features.
For direct VO methods, to solve the main drawback of feature-based VO methods fundamentally relying on features, pixel-to-pixel direct matching methods with assumption of constant brightness in optical flow are also studied. Despite this adjustment, the feature-based VO methods have been regarded as more suitable approaches than the direct VO methods for cases involving large baselines, fast motions, and varied illumination [14][15][16][17][18][19][20]. From this perspective, studies to combine the advantages of direct VO methods and feature-based VO methods to decrease errors of VO have been conducted [20,21]. For each type of method, various techniques have been investigated to increase the accuracy of the VO. In case of feature-based VO methods, which are considered the mainstream of the VO, many studies focus on outlier rejection and finding appropriate features to track for decrement of the error of the VO [6][7][8]. To get better results of pose estimation, Patruno et al. presented a robust correspondence feature filter relying on statistical and geometrical information [9]. Learning-based feature detectors are also investigated. In [10], Yi et al. proposed an integrated feature extraction pipeline, called Learned Invariant Feature Transform (LIFT) pipeline, consisting of a detector, orientation estimator, and descriptor. In [11], DeTone et al. presented neural network architecture for point detection, which is trained by using a self-supervised domain adaptation called Homographic Adaptation. In [12], Revaud et al. proposed the Repeatable and Reliable Detector and Descriptor (R2D2) which is a new learning-based feature extraction method. The R2D2 can learn repeatability and reliability of each key point. In [13], a hierarchical localization technique is introduced, employing Hierarchical Feature Network (HF-Net) based on monolithic Convolutional Neural Network (CNN) which efficiently predicts hierarchical features.
For direct VO methods, to solve the main drawback of feature-based VO methods fundamentally relying on features, pixel-to-pixel direct matching methods with assumption of constant brightness in optical flow are also studied. Despite this adjustment, the feature-based VO methods have been regarded as more suitable approaches than the direct VO methods for cases involving large baselines, fast motions, and varied illumination [14][15][16][17][18][19][20]. From this perspective, studies to combine the advantages of direct VO methods and feature-based VO methods to decrease errors of VO have been conducted [20,21].
Recently, the application of Deep Neural Networks (DNNs) for the VO has been intensively investigated [22][23][24][25][26]. Most of the applications of DNNs is for learning-based VO methods. In [22], Kendall et al. presented a real-time re-localization system using Deep Convnet consisting of CNNs. Since VO is a continuation of sequential estimation, architectures such as Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM) network, and Gated Recurrent Unit (GRU) network can be adopted. In [23], Wang et al. proposed a learning-based Monocular Visual Odometry (MVO) with the help of deep Recurrent Convolutional Neural Network (RCNN). In [24], VO approach using a dual-stream neural network consisting of a CNN and an LSTM network is proposed by Lit et al. Liu et al. presented an MVO system employing RCNN [25]. In [26], Zhao et al. proposed Learning Kalman Network for improved learning of dynamically changing parameters during end-to-end training for VO without specifying parameters explicitly. In learning-based VO methods, structures of feature-based VO methods and direct VO methods are replaced with the fine-tuned DNNs, however, results of the KITTI benchmark test show that the performances of learning-based VO methods have not yet surpassed the performances of the classical VO (e.g., feature-based VO methods, direct VO methods). For this reason, hybrid methods making use of classical VO combined with learning-based VO are studied in [27,28]. In [27], Peretroukhin et al. proposed a Deep Pose Correction (DPC) network consisting of CNNs. The DPC network predicts camera motion correction according to stereo images. A classical VO framework is fused with the DPC network by Pose Graph Relaxation. In [28], pose estimation relying on the combined use of classical VO with DNNs is presented with camera motion estimated by pre-processed optical flow images.
Despite foregoing efforts, many problems in the VO remain unsolved. Imaging conditions (e.g., sunlight, shadows, image blur) fundamentally affect the performance of the VO [4]. To overcome the problems that occurred by change of light source, various approaches have been attempted. For instance, learning-based detectors (e.g., R2D2, SuperPoint [11]) have proven to be superior to non-machine learning detectors (e.g., Speeded-Up Robust Features, Scale Invariant Feature Transform). When the camera rotates (e.g., yaw rotation), a large change of pixels in the image occurs as compared to that due to translation. The large change of pixels in the image quickly reduces the lifespan of the feature points when using feature-based VO methods, and also causes large rotation error and translation error in learning-based VO methods as well as classical VO [29][30][31][32]. Furthermore, the learning-based VO methods require tremendous time for learning, due to its usage of images to train the network.
This work introduces a correction method for the VO by using prior driving data generated from a relatively accurate sensor such as Real-Time Kinematic Global Positioning System (RTK-GPS) which has 0.02 m/0.1 deg resolution [33]. The performance of the correction method is not affected by the characteristics of images. Being inspired by the trend of using a learning-based VO and classical VO together, this work presents a pose estimation that uses a GRU network which is an improved version of RNN and computationally more efficient than the LSTM network [34,35]. The GRU network is trained by historical trajectory data of the yaw angle constrained by the shape and structure of a robot or a vehicle. The GRU network receives the stacked yaw angles converted from 6 Degrees of Freedom (DoF) as the input, which is the result of classical VO, and predicts the next yaw angle. The 2 yaw angles, one obtained from the GRU network and the other obtained from the classical VO, are fused together. The simulation is conducted with a VO adopting a MVO framework.
The main contributions of this paper are summarized as follows

1.
A VO based on GRU network is proposed for predicting future yaw angle by making use of stacked yaw angle obtained from classical VO.

2.
The proposed method is able to extract rotational tendency constrained by shape or type of robot or vehicle, particularly effective in the cornering section.

3.
A modified VO framework is developed to improve VO performance by applying the GRU network to the classical VO without causing any change of the pipeline of the original VO.

4.
Fusion of yaw angle by Normalized Cross-Correlation (NCC) and subsequent reconstruction of the rotation matrix by using the fused yaw angle are presented.
For performance comparison, the KITTI dataset [33] often used in the development of VO is utilized.
The rest of this paper is organized as follows. Section 2 describes the process of MVO, error analysis during camera rotation, and structure of GRU network for predicting yaw angle. Section 3 presents how to train and test the GRU network and gives consequential simulation results obtained from the use of GRU network. Section 4 concludes this paper.

Visual Odometry
The process of the feature-based VO method is shown in Figure 2. Features are detected by feature detectors such as Smallest Univalue Segment Assimilating Nucleus (SUSAN) and Features from Accelerated Segment Test (FAST). The detected feature points are tracked in sequential frames. Local optimization to obtain optimized pose can be executed by sparse bundle adjustment. The VO dealt with in this work is 2D-to-2D motion estimation involved with rotation and translation. The pipeline of the VO in Figure 2 can be described by the pseudo-code in Algorithm 1.
Undistort the images I k−1 , I k 3.
Use FAST algorithm to detect features in I k−1 , and match those features to I k , A new detection is triggered if the number of features falls below threshold.

5.
Estimate R, t from the essential matrix that was computed in the previous step. 6.
Take scale information from ground truth data and concatenate the translation vector and rotation matrix. 7.
Repeat from line 1.    [37] to compute the essential matrix. 5. Estimate , from the essential matrix that was computed in the previous step. 6. Take scale information from ground truth data and concatenate the translation vector and rotation matrix. 7. Repeat from line 1.
In 2D-to-2D VO, the transformation = 0 1 ∈ ℝ × is estimated in a way that minimizes  In 2D-to-2D VO, the transformation T k = R t 0 1 ∈ R 4×4 is estimated in a way that minimizes reprojection error argmin where T k represents motion estimation between two images I k−1 , I k obtained with a calibrated camera, andp i k−1 = f T k , X i k−1 represents the reprojection of corresponding feature point X i k−1 onto image I k following transformation T k by reprojection function f (·). Furthermore, p i k is the observed feature point in the current frame I k . The essential matrix E defined according to the rotation matrix R ∈ SO(3) where SO(3) = R ∈ R 3×3 RR T = I, det(R) = 1 and the translation vector t ∈ R 3 is given as follows where the translation vector is defined as t = . Furthermore, the essential matrix E can be represented by Singular Value Decomposition (SVD) as follows where U is the unitary matrix, is the orthogonal matrix, and V T is the transpose of conjugate matrix. Then, the rotation matrix R andt can be extracted as follows . Among the possible 4 combinations in Equations (3) and (4), R andt are selected by judging whether the depth by the corresponding feature points is positive in both camera coordinate systems when the frame is k − 1 and k. By getting the rotation matrix R, and the translation vector t, the ego-motion of the camera is estimated using the following equation where λ is the scale information obtained from ground truth data. R traj and t traj are the accumulated rotation and translation calculated by the rotation matrix R =

Effect of Cornering on VO
Coordinate system and rotation of the camera around each axis are shown in Figure 3. In the VO, rotation around the y-axis (yaw) is the main component of rotation of robot and vehicle. Rotation of the camera in the x-z plane (yaw) result in large amount of change in the scene of the camera. It leads to tracking failure by changing the light source and causes large pixel change in the image. These are dramatically related to increasing photometric error and reprojection error. That is the main reason why increasing the accuracy of VO when the camera rotates in the x-z plane is crucial.  (3) and (4), and ̂ are selected by judging whether the depth by the corresponding feature points is positive in both camera coordinate systems when the frame is − 1 and . By getting the rotation matrix , and the translation vector , the ego-motion of the camera is estimated using the following equation where is the scale information obtained from ground truth data. and are the accumulated rotation and translation calculated by the rotation matrix = and the translation vector = .

Effect of Cornering on VO
Coordinate system and rotation of the camera around each axis are shown in Figure 3. In the VO, rotation around the y-axis (yaw) is the main component of rotation of robot and vehicle. Rotation of the camera in the x-z plane (yaw) result in large amount of change in the scene of the camera. It leads to tracking failure by changing the light source and causes large pixel change in the image. These are dramatically related to increasing photometric error and reprojection error. That is the main reason why increasing the accuracy of VO when the camera rotates in the x-z plane is crucial. In the 2D-to-2D VO (i.e., Nister's 5 point-based MVO) called MVO described in Algorithm 1, pitch angle ( ), yaw angle ( ), and roll angle ( ) are obtained as follows In the 2D-to-2D VO (i.e., Nister's 5 point-based MVO) called MVO described in Algorithm 1, pitch angle (θ), yaw angle (ψ), and roll angle (φ) are obtained as follows and atan2(·) is a function that returns arctangent value according to IEEE standard 754. The threshold k is set to 10 −6 . To obtain Figure 4 representing the tendency of change in the yaw angle and the cumulative error of yaw angle, the absolute value of ψ, which corresponds to each pair of the images, is calculated. Then, the error e is computed as follows where ψ GT and ψ MVO are the yaw angles obtained with ground truth (GT) and MVO, respectively.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 20 where = and 2(⋅) is a function that returns arctangent value according to IEEE standard 754. The threshold is set to 10 . To obtain Figure 4 representing the tendency of change in the yaw angle and the cumulative error of yaw angle, the absolute value of , which corresponds to each pair of the images, is calculated. Then, the error is computed as follows where and are the yaw angles obtained with ground truth (GT) and MVO, respectively. When the camera is rotated, the variation of the yaw angle calculated through the pair of images seems to be a bell shape that gradually increases and then decreases. In Figure 4, the starting point of When the camera is rotated, the variation of the yaw angle calculated through the pair of images seems to be a bell shape that gradually increases and then decreases. In Figure 4, the starting point of increasing yaw angle is shown when the camera enters the cornering section (turn-in). The highest value of the yaw angle is shown when the camera is in the middle of the cornering section (mid-corner). Decrease of yaw angle is shown when the camera escapes from the cornering section (exit). The turn-in, mid-corner, and exit involved in a cornering section are shown in Figure 5. The accumulated error of the yaw angle is figured out to be rapidly increasing when the camera rotates and steadily increasing when sequence continues as shown in Figure 4. Note that the rotation is used for the calculation of R traj in Equation (5), and it is also used for calculation of t traj in Equation (6). Therefore, the method of increasing the accuracy in the cornering section can significantly improve the performance of VO.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 20 Therefore, the method of increasing the accuracy in the cornering section can significantly improve the performance of VO.

Network Architecture
The proposed pose estimation method utilizes a GRU network shown in Figure 6a for predicting the yaw angle at the current timestep. The predicted yaw angle is to be used to minimize rotation error defined in x-z plane. In this study, sequential yaw angles obtained for the time interval to are applied as the network inputs and go through the GRU layers and then the prediction that corresponds to the yaw angle at is obtained through the output layer. The GRU network consists of 5 GRU layers, each consisting of 5 GRU cells with 200 hidden units in each GRU cell, and 3 fully connected layers, each consisting of 256, 128, and 64 neurons. The structure of the GRU network architecture and GRU cell is shown in Figure 6a, b, respectively. The GRU network shown in Figure 6a consists of 25 (= 5 cells/layer × 5 layers) GRU cells, where 5 cells/layer represents the number of feature dimension (input dimension), in other words, the length of the input sequence. After the GRU layers, there are 3 fully connected layers before the final output layer that predicts the yaw angle. In the -th GRU cell in Figure 6b, where = 1, . . . ,5, forward transfer formulation of GRU is calculated for each processing step as follows where is the reset gate which determines how much information in the previous GRU cell should be forgotten and is the update gate which determines how much information should be

Network Architecture
The proposed pose estimation method utilizes a GRU network shown in Figure 6a for predicting the yaw angle at the current timestep. The predicted yaw angle is to be used to minimize rotation error defined in x-z plane. In this study, sequential yaw angles obtained for the time interval t −5 to t −1 are applied as the network inputs and go through the GRU layers and then the prediction that corresponds to the yaw angle at t 0 is obtained through the output layer.
The GRU network consists of 5 GRU layers, each consisting of 5 GRU cells with 200 hidden units in each GRU cell, and 3 fully connected layers, each consisting of 256, 128, and 64 neurons. The structure of the GRU network architecture and GRU cell is shown in Figure 6a,b, respectively. The GRU network shown in Figure 6a consists of 25 (= 5 cells/layer × 5 layers) GRU cells, where 5 cells/layer represents the number of feature dimension (input dimension), in other words, the length of the input sequence. After the GRU layers, there are 3 fully connected layers before the final output layer that predicts the yaw angle. In the l-th GRU cell in Figure 6b, where l = 1, . . . , 5, forward transfer formulation of GRU is calculated for each processing step as follows where r t is the reset gate which determines how much information in the previous GRU cell should be forgotten and z t is the update gate which determines how much information should be transferred from the current GRU cell to the next GRU cell and h t is the intermediate state and h t is the hidden state and σ(·) is the sigmoid function and tanh(·) represents the hyperbolic tangent function and W r , W z , W h are entries of weight matrix W and b is bias vector and is element-wise multiplication. For the reset gate, large value of r t means much information from previous cell is forgotten. For the update gate, large value of z t means that much information is transferred to the next cell [34,39].
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 20 is forgotten. For the update gate, large value of means that much information is transferred to the next cell [34,39]. The 5 historical yaw angles estimated by the VO are the input to GRU layer at each time and the current output of each GRU cell is determined by both past and current inputs. A dropout layer is placed between each GRU layer to prevent overfitting. The overall structure is determined by a grid search to define the best hyperparameter values providing the best accuracy. The GRU network is trained by time series data, so continual training is required. The GRU network learns the cornering tendency in terms of yaw angle variation of the camera and then operates when the camera enters the cornering section so that it can predict the next yaw angle on cornering. This novel method can improve the performance of the VO by correcting the yaw angle estimation of the VO.

Framework with Classical VO
The proposed method is combined with a classical VO. The framework combining the classical VO and the proposed GRU network is shown in Figure 7. An image sequence is input to the classical VO and feature points are detected when the classical VO is a feature-based VO method. After detecting the feature points, feature matching (i.e., tracking) is conducted. As a result, the rotation matrix and the translation vector are obtained from the classical VO. In this process, a memory buffer containing the yaw angle for the last 5 timesteps is kept. If the camera enters the cornering section, the GRU network is activated and uses the previous 5 historical yaw angles obtained from the classical VO to predict the yaw angle for correcting the pose obtained from the classical VO. The 5 historical yaw angles estimated by the VO are the input to GRU layer at each time and the current output of each GRU cell is determined by both past and current inputs. A dropout layer is placed between each GRU layer to prevent overfitting. The overall structure is determined by a grid search to define the best hyperparameter values providing the best accuracy. The GRU network is trained by time series data, so continual training is required. The GRU network learns the cornering tendency in terms of yaw angle variation of the camera and then operates when the camera enters the cornering section so that it can predict the next yaw angle on cornering. This novel method can improve the performance of the VO by correcting the yaw angle estimation of the VO.

Framework with Classical VO
The proposed method is combined with a classical VO. The framework combining the classical VO and the proposed GRU network is shown in Figure 7. An image sequence is input to the classical VO and feature points are detected when the classical VO is a feature-based VO method. After detecting the feature points, feature matching (i.e., tracking) is conducted. As a result, the rotation matrix R and the translation vector t are obtained from the classical VO. In this process, a memory buffer containing the yaw angle for the last 5 timesteps is kept. If the camera enters the cornering section, the GRU network is activated and uses the previous 5 historical yaw angles obtained from the classical VO to predict the yaw angle for correcting the pose obtained from the classical VO.
where is the threshold to check whether the camera is in the cornering section. The camera is determined to be in a cornering section when the yaw angles of the camera are continuously larger than the threshold angle . In this work, the threshold angle is set to 0.85 deg through trial and error based on whether the cornering section could be well distinguished. It is recommended to set the threshold angle slightly higher than the angle where the cornering section is well detected as will be discussed in the Section 3.1. The condition to identify when the estimated yaw angle from the classical VO deviates significantly from the previous rotation trend is defined as follows where is scale factor determining the lower limit of the trend. The scale factor for simulations is set to 1.5, considering the range of change allowed for variation of estimation by the classical VO in preliminary simulations. The condition to identify when the yaw angle estimated by the classical VO deviates from the yaw angle obtained from the GRU network is defined as follows If the change of rotation is fairly large so that Equations (14) and (15) are satisfied, the yaw angle is corrected. Correction of yaw angle is executed by fusing the yaw angle obtained from the GRU network and the yaw angle obtained from the classical VO. The fusion is executed using the Normalized Cross-Correlation (NCC) as the weighting factor. The NCC represents the similarity between images and takes values between 0 and 1 [40]. The value of NCC rapidly decreases at the cornering section as shown in Figure 8. In addition, estimation by the classical VO becomes inaccurate at the cornering section as described in Section 2.2. The value of NCC can be considered as a reliability factor of the yaw angle calculated by the classical VO. In some sections with large NCC values, the proposed framework can trust less the estimated value by the GRU, thereby preventing possible To detect when the camera enters a cornering section, a process is defined as follows where γ is the threshold to check whether the camera is in the cornering section. The camera is determined to be in a cornering section when the yaw angles of the camera are continuously larger than the threshold angle γ. In this work, the threshold angle γ is set to 0.85 deg through trial and error based on whether the cornering section could be well distinguished. It is recommended to set the threshold angle γ slightly higher than the angle where the cornering section is well detected as will be discussed in the Section 3.1. The condition to identify when the estimated yaw angle from the classical VO deviates significantly from the previous rotation trend is defined as follows where α is scale factor determining the lower limit of the trend. The scale factor α for simulations is set to 1.5, considering the range of change allowed for variation of estimation by the classical VO in preliminary simulations. The condition to identify when the yaw angle estimated by the classical VO deviates from the yaw angle obtained from the GRU network is defined as follows If the change of rotation is fairly large so that Equations (14) and (15) are satisfied, the yaw angle is corrected. Correction of yaw angle is executed by fusing the yaw angle obtained from the GRU network and the yaw angle obtained from the classical VO. The fusion is executed using the Normalized Cross-Correlation (NCC) as the weighting factor. The NCC represents the similarity between images and takes values between 0 and 1 [40]. The value of NCC rapidly decreases at the cornering section as shown in Figure 8. In addition, estimation by the classical VO becomes inaccurate at the cornering section as described in Section 2.2. The value of NCC can be considered as a reliability factor of the yaw angle calculated by the classical VO. In some sections with large NCC values, the proposed framework can trust less the estimated value by the GRU, thereby preventing possible problems of increased error when the GRU is fully trusted. Reconstruction of the rotation matrix is computed as follows cosθcosφ sinψ corr sinθcosφ − cosψ corr sinφ cosψ corr sinθcosφ + sinψ corr sinφ cosθsinφ sinψ corr sinθsinφ + cosψ corr cosφ cosψ corr sinθsinφ − sinψ corr cosφ −sinθ sinφcosθ cosφcosθ where ψ corr = NCCψ VO + (1 − NCC)ψ GRU The NCC for the image I 1 centered on (u, v) and image I 2 centered on (u , v ) is calculated with m × n patch, where m = 2a + 1, n = 2b + 1 are odd integers, as follows where  The pseudo-code of the MVO based on the proposed pose estimation using the GRU network is shown in Algorithm 2. The 6-th line represents the condition for determining if the camera is in a cornering section and the 7-th line represents the condition for checking if the yaw angle from the VO deviates significantly from the previous rotation trend. In addition, in the 7-th line, the yaw angle is predicted by the GRU network and corrected using NCC and as a result the rotation matrix is reconstructed. The pseudo-code of the MVO based on the proposed pose estimation using the GRU network is shown in Algorithm 2. The 6-th line represents the condition for determining if the camera is in a cornering section and the 7-th line represents the condition for checking if the yaw angle from the VO deviates significantly from the previous rotation trend. In addition, in the 7-th line, the yaw angle is predicted by the GRU network and corrected using NCC and as a result the rotation matrix R is reconstructed. Capture images: I k−1 , I k 2.
Undistort the images I k−1 , I k 3.
Use FAST algorithm to detect features in I k−1 , and match those features to I k , A new detection is triggered if the number of features falls below threshold.

4.
Use Nister's 5-point algorithm with RANSAC to compute the essential matrix.

5.
Estimate R, t from the essential matrix that was computed in the previous step.

6.
If, all the absolute value of ψ in the memory buffer ( ψ k−5 , ψ k−4 , ψ k−3 , ψ k−2 , ψ k−1 ) are larger than γ for determining that the camera is in a cornering section go to 7. Else, go to 8. In this study, γ is set to 0.85 deg.

If
and is satisfied, then, the yaw angle is corrected by ψ corr = NCCψ VO + (1 − NCC)ψ GRU and R is reconstructed using Equation (16). Else, go to 8. In this study, α is set to 1.5. 8.
Take scale information form ground truth data, and concatenate the translation vector, and the rotation matrix. 9.
Repeat from line 1.

Simulation
This section provides a concise and precise description of the training and testing of the proposed method as well as the analysis of simulation results. MATLAB®(R2020a, MathWorks, USA) is used to train and test the GRU network and Visual Studio®(2019, Microsoft Corporation, Redmond, CA, USA) and OpenCV 3.4.0 (Intel Ltd., Santa Clara, CA, USA) are used to simulate the VO. The GRU network trained in MATLAB®is linked with Visual Studio®using MEX, a function of MATLAB®. This process is carried out on a desktop PC with Intel®I7-9700KF (Intel Corporation, Santa Clara, CA, USA) and Nvidia®GPU RTX 2060 (NVIDIA Ltd., Santa Clara, CA, USA).

Training and Testing
To train the GRU network for predicting the yaw angle when a robot or a vehicle is in a cornering section, absolute values of the yaw angle obtained with ground truth data from RTK-GPS are used as the target values and absolute values of the yaw angle obtained from the MVO are taken as the input values. When the yaw angle of the robot or vehicle from RTK-GPS at time t exceeds a certain threshold, the robot or vehicle is considered to be moving around a cornering section. If the vehicle is judged to be at a cornering section, the 2 yaw angles obtained by the MVO before time t and another 2 yaw angles also obtained by the MVO after time t are taken to form 5 historical yaw angles around time t. Note that the time t is according to the RTK-GPS. Therefore, the dataset can be organized as follows where X is the set of feature vectors and Y is the set of target values and x is a feature vector consisting of yaw angles and y is a target value. The value for the sequential length is set to 5 after conducting ablation studies for the comparison of the final loss obtained for different values of the sequential length 3, 5, and 7. However, the sequential length can be adjusted depending on the characteristics of the system such as the frame rate of the camera. In this work, the threshold of the lower limit for ψ t is set to 0.8 deg and, if the yaw angle exceeds 0.8 deg, the corners can be extracted as shown in Figure 9. This threshold must be set carefully as it may vary depending on the hardware constraints of the robot or vehicle and resolution and type of the sensor. The training set consists of 4811 training sequences obtained from the MVO and the ground truth of KITTI dataset consisting of 11 sequences with indices 00-10. Then, 80% of training sequences sare used for training of the GRU network and the remaining 20% are used for network validation (testing). The GRU network is trained using Adam optimizer [41]. The training of the GRU network is done similarly to regression. In other words, The GRU network is trained to reduce the loss ℒ = ∑ ( ( ) − ) where ( ) is the prediction of the GRU network. Both the output of the network and the ground truth yaw angle values are real numbers. The model at the epoch with the minimum value of the loss ℒ is chosen as the best model for the simulations in the Section 3.2. The minibatch size is set to 32 during training.

Evaluation
In Figure 10, the accumulated Root Mean Square Errors (RMSEs) concerning yaw angle obtained from the MVO framework employing the proposed method are reduced when comparing with the MVO. The yaw angle obtained from the MVO employing the proposed method is closer to ground truth than the MVO, as shown in enlarged timesteps of Figure 10a corresponding to the cornering section of the KITTI 06 sequence over timesteps 710-730. For the section where the proposed method is applied, the yaw angle from the proposed method is closer to ground truth as depicted in Figure  11. The timesteps over which the green line segments are overlayed correspond to the timesteps when yaw angle changes. In addition, as seen in Figure 12, it is intuitively clear that odometry performance is improved over the trajectories of most sequences in the KITTI dataset. The training set consists of 4811 training sequences obtained from the MVO and the ground truth of KITTI dataset consisting of 11 sequences with indices 00-10. Then, 80% of training sequences sare used for training of the GRU network and the remaining 20% are used for network validation (testing). The GRU network is trained using Adam optimizer [41]. The training of the GRU network is done similarly to regression. In other words, The GRU network is trained to reduce the loss L = 1 The minibatch size is set to 32 during training.

Evaluation
In Figure 10, the accumulated Root Mean Square Errors (RMSEs) concerning yaw angle obtained from the MVO framework employing the proposed method are reduced when comparing with the MVO. The yaw angle obtained from the MVO employing the proposed method is closer to ground truth than the MVO, as shown in enlarged timesteps of Figure 10a corresponding to the cornering section of the KITTI 06 sequence over timesteps 710-730. For the section where the proposed method is applied, the yaw angle from the proposed method is closer to ground truth as depicted in Figure 11. The timesteps over which the green line segments are overlayed correspond to the timesteps when yaw angle changes. In addition, as seen in Figure 12, it is intuitively clear that odometry performance is improved over the trajectories of most sequences in the KITTI dataset.
The KITTI evaluation kit [42,43] is used for the evaluations in this subsection. In Table 1, translation error [44], rotation error [44], Absolute Trajectory Error (ATE) [43], and Relative Pose Error (RPE) [43] are presented for demonstrating the performance improvement achieved with the proposed method.     The KITTI evaluation kit [42,43] is used for the evaluations in this subsection. In Table 1, translation error [44], rotation error [44], Absolute Trajectory Error (ATE) [43], and Relative Pose Error (RPE) [43] are presented for demonstrating the performance improvement achieved with the proposed method.
Assuming that the sequence of spatial poses from the estimated trajectory is and the sequence of spatial poses from the ground truth is and is the rigid body transformation, the absolute trajectory error matrix at time is given as follows Assuming that the sequence of spatial poses from the estimated trajectory is P i and the sequence of spatial poses from the ground truth is Q i and S is the rigid body transformation, the absolute trajectory error matrix at time i is given as follows Then, the ATE is defined as follows where trans R t 0 1 = t. The function trans(·) extracts t using T = R t 0 1 as an input parameter.
In addition, the relative pose error matrix at time i over a fixed time interval is given as follows The RPE for translation RPE trans and rotation RPE rot are defined as follows where rot R t 0 1 = R. The function rot(·) extracts R using T = R t 0 1 as an input parameter and is the sum of components on main diagonal of square matrix. Table 1 shows the simulation results obtained for the KITTI sequences 00 to 10. The values corresponding to 'Diff' in Table 1 are the values obtained by subtracting the error value of the MVO from the error value obtained with the proposed method. The value marked in red is the amount of improvement. With the proposed method, the translation error is decreased with KITTI sequences 00, 02, 03, 05, 08, 10 as compared to the MVO. The rotation error is decreased with KITTI sequences 00, 02, 05, 06, 08, 10. The ATE (m) is decreased with KITTI sequences 00, 01, 02, 04, 05, 06, 07, 08, 10.
The RPE (m) is decreased with KITTI sequences 07, 08, and unchanged with KITTI sequences 02, 03, 04, 05, 06, 09. The RPE (deg) is decreased with KITTI sequences 00, 02, 05, 06, 08, 09, 10. On average, errors of all metrics except the RPE (m) are reduced. For the RPE (m) metric, the performance obtained with the proposed method is similar to the classical MVO. On average, the translation error is reduced as much as 1.426% and the rotation error is decreased as much as 0.805 deg/100 m and the ATE is reduced as much as 12.752 m, and the RPE is decreased as much as 0.014 deg. Using the hybrid method for yaw correction, the DPC-Net [27] combining learning-based VO with the classical VO, for the KITTI 00, 02, and 05 sequences, 30.06% improvement of translation error and 24.98% enhancement of rotation error can be achieved as compared to the classical VO. Comparing proposed method with the DPC-Net, the proposed method for the KITTI 00, 02, and 05 sequences demonstrates 34.64% improvement of translation error and 43.75% enhancement of rotation error as compared to the classical VO. However, the RPE (deg) for KITTI sequence 01 is slightly increased as much as 0.007 deg. This is due to incorrect estimation with the GRU network in the timestep when the estimated value by the classical VO is significantly different from the previous rotation tendency. The increment of RPE can be solved by fine-tuning the GRU network and adjusting the parameters α, γ.
In Table 1, the comparison with the Deep VO [23], that is one of the learning-based VO methods, is presented. The results shown in Table 1 are indicative of the better performance of the classical VO than the Deep VO.

Conclusions
In this paper, a pose estimation method utilizing a GRU network is proposed. The GRU network predicts yaw angles based on sequential yaw angle data. The predicted yaw angle is used in the proposed method for yaw angle correction at the cornering section where rotational error and translation error tend to increase. The GRU network is trained by the sequences of yaw angle data at the cornering section obtained from the classical VO and an accurate sensor. During testing of the GRU network, the next yaw angle is predicted by the GRU network taking the yaw angles estimated by the classical VO as the input. In this way, more accurate yaw angle is obtained by fusing the yaw angles estimated by the classical VO and the GRU network with NCC as the weighing factor of the fusion mechanism. With the estimated yaw angle, the rotation matrix is reconstructed for subsequent use.
Simulation results with 11 sequences of the KITTI dataset show that with the classical VO employing the proposed method, the performances in terms of translation error and rotation error are improved significantly. In addition, the proposed method requires less computational effort for learning when compared with the learning-based VO using images as the input data. Since the proposed method is combined with the classical VO without any change in the original pipeline, the proposed method can be applied to various VO methods. The proposed method can be applied not only to feature-based VO methods, as for the simulations conducted in this paper, but also to direct and learning-based VO methods. The proposed method suggests that the prediction based on the GRU network can improve the performance in unstable areas such as cornering section.