Forward and Backward Visual Fusion Approach to Motion Estimation with High Robustness and Low Cost

: We present a novel low-cost visual odometry method of estimating the ego-motion (self-motion) for ground vehicles by detecting the changes that motion induces on the images. Different from traditional localization methods that use differential global positioning system (GPS), precise inertial measurement unit (IMU) or 3D Lidar, the proposed method only leverage data from inexpensive visual sensors of forward and backward onboard cameras. Starting with the spatial-temporal synchronization, the scale factor of backward monocular visual odometry was estimated based on the MSE optimization method in a sliding window. Then, in trajectory estimation, an improved two-layers Kalman ﬁlter was proposed including orientation fusion and position fusion. Where, in the orientation fusion step, we utilized the trajectory error space represented by unit quaternion as the state of the ﬁlter. The resulting system enables high-accuracy, low-cost ego-pose estimation, along with providing robustness capability of handing camera module degradation by automatic reduce the conﬁdence of failed sensor in the fusion pipeline. Therefore, it can operate in the presence of complex and highly dynamic motion such as enter-in-and-out tunnel entrance, texture-less, illumination change environments, bumpy road and even one of the cameras fails. The experiments carried out in this paper have proved that our algorithm can achieve the best performance on evaluation indexes of average in distance (AED), average in X direction (AEX), average in Y direction (AEY), and root mean square error (RMSE) compared to other state-of-the-art algorithms, which indicates that the output results of our approach is superior to other methods.


Motivations and Technical Challenges
This paper aims at developing a visual fusion approach for online ego-motion estimation with the data from onboard forward and backward cameras. Ego-motion represents and describes the self-motion of a moving object and the ego-motion problem can be stated as the recovery of observer rotation and direction of translation by at a given instant of time, as the observer moves through the environment,

Literature Review
Visual odometry is a particular case of structure-from-motion (SFM) [21], which aims to incrementally estimate the egomotion of an agent (e.g., vehicle, human and robot) using only vision stream and its origins can be dated back to works such as References [22,23]. The first visual odometry systems were successfully used by NASA in their Mars exploration program, endeavoring to provide all-terrain rovers with the capability of estimating its motion in the presence of wheel slippage terrains [24]. From then on, several methods, techniques and sensor models have been developed and employed to accomplish both vehicle and robotics' egomotion [25,26].
Some of the works were accomplished utilizing monocular vision technology with a single camera that they must measure the relative motion with 2-D bearing data [27]. Since the absolute scale is not clear, the monocular visual odometry has to calculate the relative scale and camera pose using either trifocal tensor method or the knowledge of 3-D structure [28]. Related works can generally be divided into three categories-appearance-based methods, feature-based methods and hybrid methods. Undeniably, some of them can deliver good results [29][30][31], but still they require high quality features to robustly perform and are more prone to errors [19]. Since our work is emphasizing the robust ego-motion of road vehicles in complex urban environments, the use of only a monocular vision system is not enough to meet the requirement and to alleviate the disturbances from a complex environment, including low textured scenes, non-stationary objects and others.
The idea of estimating egomotion from consecutive 3-D frames with a stereo vision method was successfully utilized in Reference [32]. A collaborative factor of these works is the utilization of feature-based methods because of the efficiency and robustness. As the works showing, both sparse features and dense features are used to establish corresponds between consecutive frames from input image sequence. Some of the most common choices are point [33], lines [34], contour [35], and hybrid [36], which can be tracked in both spatial and temporal domains. Point feature is the most commonly used of all, because of its simplicity. However, because of the complexity of dynamic road scenes, errors are unavoidable, appearing in every stage of detection, matching and tracking, especially in urban environments with texture-less, illumination change environments or bumpy road. It makes the robustness problem extremely critical.
Considering the problem of robustness, many researchers have established hypothesis-and-test and coarse-to-fine motion estimation mechanism based on the Random Sample Consensus (RANSAC) scheme, the bundle adjustment optimization approach (BA) and loop closure methods (LC). The 5-points based RANSAC algorithm for monocular visual odometry [37] and 3-points scheme for stereo method [38] has been proposed in the beginning. Yet, with the increase of complexity of environment the RANSAC iterations would grow exponentially, along with the increasing conflict of internal keypoints and the decreasing confidence of vision-based estimation results. Some other works use BA and LC methods to compensate for the drift of visual odometry once the loop is detected successful, but the absolute scale is very difficult to determine using mono based ego-motion estimation. Moreover, the paths of a moving agent do not always have loops in urban environment.
In this case, a combination of other sensors such as wheeled odometry [2], inertial sensing [18], and global position system [39] have been used. However, they have intrinsic limitations-wheeled odometry would be seriously affected in uneven, slippery terrain or other adverse conditions and prone to drift due to error accumulation. While the commonly used civil GPS can only update with a frequency of 1 Hz and the low-cost IMU is not accurate enough to fit the ego-motion estimation use. Some researchers integrated high defined maps (HDM) into the ego-motion estimation system, which can yield good results but have problems with the large-scale HDM collecting, high cost and local security laws [40]. In this work, we describe a novel visual odometry method only leveraging data from inexpensive visual sensors of forward and backward cameras, which enables high-accuracy, low-cost ego-pose estimation, along with providing robustness capability of handing camera module degradation. Moreover, different from the existing works, using only cameras instead of other sensors for computing egomotion allows a simple integration of egomotion data into other vision based algorithms, such as obstacle, pedestrian, and lane detection, without the need for calibration between sensors.

Main Contributions
We are interested in solving the ego-motion estimation with high robustness and low cost, along with solve the problem in real time and reliability. The issue is closely relevant to deal with the sensor degradation due to complex environment [41], such as illumination change, texture-less, and structure-less circumstance. To this end, in order to keep proposed approach cost competitiveness, we only leverage data from inexpensive visual sensors with different orientations. Our main contributions are summarized as follows: (1) Utilizing the outstanding features of symmetry-adaption configuration of forward and backward cameras, we provided a new fusion mechanism of two-layers data processing to comprehensive utilize the nearby environment information. Therefore, it achieves high accuracy and low drift.
(2) The data processing pipeline was carefully designed to handle sensor degradation. Starting with the spatial-temporal synchronization, the scale factor of backward monocular visual odometry was estimated based on the MSE optimization method in sliding-window. With this help, both the forward and the backward camera own the capability of localization independently.
(3) Further, utilizing the fusion of two-layers Kalman Filter, the proposed pipeline can fully or partially bypass failure modules and make use of the rest of the pipeline to handle sensor degradation.
To the best of our knowledge, the proposed method is by far the cheapest method to enable such high robustness ego-motion estimation capable of handling sensor degradation under various lighting and structural conditions. The remainder of this paper is organized as follows: Section 2 describes the detailed methodology including scale estimation and trajectory fusion. In Section 3, the experimental dataset, the performance evaluation results of the proposed method are presented. Sections 4 and 5 summarize this paper and discuss future research directions.

Assumptions and Coordinate Systems
Considering a sensor system including forward stereo camera and backward monocular camera, we assume that ego-motion estimation model is a rigid body motion model and the above cameras can be modeled by the pinhole camera model. Then, using the Zhang method [42], the intrinsic and extrinsic parameters can be easily calibrated in advance. With the calibration matrix, the relative pose transformation matrix between two camera systems can be obtained. Hence, we can use a single coordinate system for both the forward stereo camera and the backward monocular camera. For simplicity, in the 6DoF pose calculated by the proposed method is transformed to original global coordinate. The coordinate systems are defined in the following (see Figure 1 for illustration).  There are three coordinates used in our system, the original global coordinate {G}, Forward camera coordinate system {C F } and Backward camera coordinate system {C R }, which are defined as follows: • We parallel X-O-Y plane of {G} to the horizontal plane. The Z 0 -axis points opposite to gravity. The X 0 -axis points forward of the mobile platform, and the Y 0 -axis is determined by the right-hand rule. • Forward camera coordinate system {C F } is set originated at the left camera optical center of stereo camera system {O F1 }. The x-axis points to the left, the y-axis points upward, and the z-axis points forward coinciding with the camera principal axis. • Backward camera coordinate system {C R } is originated at the camera optical center of monocular camera system {O R }. The x-axis points to the left, the y-axis points upward, and the z-axis points forward coinciding with the camera principal axis.

Data Association of Forward-Facing and Backward-Facing Cameras
With the purpose of fitting and fusing the data of a backward-facing monocular camera and forward-facing stereo camera, the key point of data synchronization both in spatial and temporal space must be solved beforehand, which is shown Figure 2. Within the stereo camera, left frame and right frame are triggered at the same time by the hardware flip-flop circuit. Then, we can assume that they were hardware synchronized and have the same timestamps. Hence, we only need to synchronize the data between monocular camera and stereo camera both in spatial and temporal space.

Monocular Camera timestamps
Using the camera intrinsic and extrinsic parameters, the data association in spatial space can be easily calculated and it will not change over time because of the rigid body motion assumption. Second, in temporal synchronization, we assume that the ego-motion has the same velocity between two consequent frames of stereo camera. So, after transforming the data from {C R } space into the {C F } space, there are two temporal synchronization steps that need to be dealt with-translation synchronization step and rotation synchronization.
In the translation synchronization step, linear interpolation method was given under the constant velocity assumption. In our method, we transform the data from monocular camera into stereo camera coordinate, as follows.
Here, T s1 and T s2 is the consecutive timestamps of stereo camera frames, T m is the timestamp of monocular camera frame pending to be associated. ∆T is the time difference calculated by (1), we take T m − T s1 for example). v s21 is the velocity value in this time gap. t s1 and t s2 are the motion states in the time T s1 and T s2 of stereo camera.
When it comes to the rotation synchronization step, however, the rotation matrix has a universal locking problem and cannot be directly used for interpolation. In our method, we use the spherical linear interpolation method with the rotation angles transformed to the unit quaternion representation space. Assuming q s1 = (x 1 , y 1 , z 1 , w 1 ) and q s2 = (x 2 , y 2 , z 2 , w 2 ) are the unit quaternion for the timestamps T s1 and T s2 , the rotation synchronization value q ms can be calculated by the following equation.
In this equation, e is the interpolation coefficient and it can be calculated by e = T m −T s1 . θ can be obtained from the inverse trigonometric function θ = arccos( q s2 ·q s1 |q s2 ||q s1 | ). It has the advantage of convenient calculation and good performance.

Loosely Coupled Framework for Trajectory Fusion
The detail fusion processing of forward stereo camera and backward monocular camera is shown in Figure 3. It shows that our approach runs separate the stereo visual odometry and monocular visual odometry and fusion in the decision level. Therefore, it should be classified to loosely coupled fusion method.  This framework choice would bring several good features. At first, loosely coupled fusion method support the stereo visual odometry and monocular visual odometry working as independent modules, which bring the fusion system with the capability of handing camera module degradation. Second, with the opposite orientation of forward and backward, the fusion pipeline can comprehensively utilize the nearby environment information and take advantage of the time difference from front forward stereo camera to rear backward monocular camera. Third, the system cost is much lower than other construction scheme, such as two sets of stereo vision system, lidar based odometry, or other kinds of precise localization sensor based systems.

Basic Visual Odometry Method
In our proposed loosely coupled fusion method, featured-based visual odometry was used as the fundamental motion estimation method to compute the translation and rotation of both forward facing stereo camera and backward facing monocular camera and it has a deep influence on the correctness and effectiveness of our method. It has three modules including feature extraction and matching, pose tracking, and local bundle adjustment.
Feature Extraction and Matching: In order to extract and match feature points rapidly, ORB points (Oriented Fast and Rotated BRIEF) were selected as image features, which have good multi-scale and oriented invariance to viewpoints. In the implementation, feature points in each layer of image pyramid space were extracted respectively and the maximum feature points in the unit grid were set to less than five points, making them uniform-distributed. Specifically, for the stereo camera, feature points in the left and right images were extracted simultaneously using two parallel threads and the matching process was employed to delete the isolate points under the epipolar geometry constraints. In the feature matching stage, the Hamming distance was applied to measure the distance of feature points and the best matching should not exceed 30%. Meanwhile, the Fast Library for Approximate Nearest Neighbors (FLANN) method was also used to speed up the feature matching process.
Pose Tracking: The motion model and keyframe based pose tracking method were used to estimate the successive pose of ego vehicle. Firstly, the uniform motion model was used to track feature points and get matching pairs in successive frames with the help of observed 3D map points. With the matching pairs, we can estimate the camera motion using PnP optimization method of minimizing the reprojection errors of all matching point pairs. However, if the matching point pairs are less than certain threshold, the keyframe based pose tracking method woude be activated. Using bag of words(BOW) mode, it can track and match the feature points between the current frame and the closest keyframe.
Local Bundle Adjustment: In this step, the 3D map points observed by closest keyframes were mapped to the current feature points and the poses and 3D map points were adjusted and optimized simultaneously to minimize the reprojection errors. In the application, in order to keep a real-time performance for the system, the maximum number of iterations should be set to a certain value.

Scale of Monocular Visual Odometry
To avoid the scale ambiguity in the monocular visual odometry and make it work independently, we use the sliding window based scale estimation method. In the sliding window, mean square error (MSE) was used to dynamically correct and update the scale coefficient under the rigid body motion assumption, as shown in Figure 4.   In the sliding window, we assume that there are n monocular visual odometry states of t m,i , t m,i+1 , t m,i+2 · · · t m,i+n−1 and n stereo visual odometry states t sm,i , t sm,i+1 , t sm,i+2 · · · t sm,i+n−1 . Here, it is noteworthy that, the states of stereo visual odometry has been synchronized to monocular visual odometry states both in spatial and temporal space. Unbiased estimation of the mathematical expectations of those states can be calculated as follows: The above states can be data centralized to t m,i and t sm,i by: Theoretically, under the rigid body assumption, the relative move distance of the forward stereo camera and backward monocular camera should be equal to each other within the sliding window. So, the problem can be transformed to align the n monocular visual odometry states to the corresponding n points in stereo visual odometry states space.
In practice, there must be some alignment bias and the objective of optimization function should be the minimization of alignment bias. Here, MSE was employed to solve the issue, as in the following equation.
In the equation, e i is the alignment bias, s is the real scale coefficient of monocular visual odometry, R is the rotation matrix obtained from camera calibration, r 0 is the data centralized result of translation value. In order to minimize the alignment bias e i , we can set r 0 equal to 0. Then, we get the following equation.
Our goal is to solve the scale coefficient of s and we found that the above equation is a bivariate linear equation, with the following form. with Then, through solving the bivariate linear problem, we can get the result of the scale coefficient of s as: However, in Equation (8), the scale coefficient of s is not symmetric. It means that, there is no reciprocal relation between s (projecting data from monocular visual odometry to stereo visual odometry) and s (projecting data from stereo visual odometry to monocular visual odometry). So, we rewrite the alignment bias e i into the following form.
Then, the Equation (7) turns to: Then, we can get the new result of scale coefficient of s and take it as the scale of current monocular visual odometry sliding window end state t m,i+n−1 : By comparing Formulas (8) and (11), we can know that the new Equation (11) can calculate the scale coefficient s without solving the rotation matrix R. In practice, the scale estimation method is influenced seriously by vehicle motion state, especially stationary state or the changes of motion state. Therefore, we adjust the strategy of scale estimation and correct monocular scale step by step by considering the effect of vehicle motion state.
In this equation, λ is the scale updating factor, and s * is the updating scale coefficient of monocular visual odometry. The velocity ve m,i+n−1 is computed in the following steps. At first, computing the initial velocity ve m,i+n−1 according to aligning stereo visual odometry position based on the uniform velocity model, Finally, the real value trajectory of monocular visual odometry can be calculated using the updated scale coefficient s * , as shown in the following equation.
In this equation, R e is the transformation matrix from forward stereo camere to the backward monocular camera. R sm is the transformation matrix for the different initialization time points. R c is the transformation matrix from the monocualr coordinate {C R } to the coordinate z B x B y B which is opposite to the global coordinate {G} on X-O-Y plane.

Kalman Filter Based Trajectory Fusion
After giving a dynamically updated scale coefficient to the backward monocular visual odometry, the absolute ego-motion estimation can be correctly obtained. Then, in the sliding window, loosely coupled trajectory fusion was operated to get precise ego-motion estimation, using two-layers Kalman filter method.

Prediction Equation and Observation Equation
In the sliding window [T i , T i+n−1 ], we first take the ego-motion state from forward stereo camera ∆x (i,i+n−1)s as the state variable and establish state prediction equation as Formula (14). Then, we take the ego-motion state from backward monocular camera ∆x (i,i+n−1)m as the observation and establish observation equation as Formula (15).
Here, w (i,i+n−1) is white Gaussian noise of state prediction equation, with covariance matrix of Q (i,i+n−1) ; v (i,i+n−1) is white Gaussian noise of observation equation, with covariance matrix of R (i,i+n−1) . I is an identity matrix.

Calculation of Covariance Matrix
Accurately determining the covariance matrix Q and R would affect the accuracy of the trajectory fusion results. In the proposed method, the matching error of all image frames in the sliding window is considered during the calculation of visual odometry. We use minimum optimization error of each frame to measure the accuracy of the matching confidence and determine the covariance matrix.
Feature points based visual odometry is applied for both forward stereo camera and backward monocular camera and sparse 3-D map points are also reconstructed by different approaches. Here, we use back projection method for stereo camera and use triangulating method for monocular camera. By tracking the feature pairs between sequential images, the ego-motion estimation can be described as a 3D-to-2D problem. In the problem, the RANSAC method is employed to remove outliers. The 3D-to-2D feature pairs from both cameras are all considered in the same optimization function, shown in Equation (16), which tries to minimize the reprojection error of the images and will concern the rigid constraint of the system.
where, e sum is total reprojection error, R and t are the rotation matrix and translation matrix between successive frames, χ is the number of inliners of the visual odometry, i ∈ χ represent the indexes of feature points in the frame that matching the 3D map points. p i represent the matched feature points in the frame, P i are the corresponding 3D points in the 3D sparse map space. π is the projection function from 3D to 2D. Σ is the total projection error.
From the optimization function Equation (16), we found that the more accuracy of the matching between 3D sparse map points and the corresponding 2D frame point, the better output of the pose obtained by solving the optimization equation. In practice, for the principle of comparability, we take into account of the number of frames in the sliding window and then we can get the mean reprojection error e s as: e k∈{s,m} = η where η is the correction coefficient, n is the number of frames in the sliding window, k = s when it represents stereo visual odometry and k = m for monocular visual odometry. Then, we can use e s as diagonal elements to construct the covariance matrix Q for process noise w and can use e m as diagonal elements to construct the covariance matrix R for measurement noise v.

Two-Layers Kalman Filter Based Trajectory Fusion
After obtaining the parameters of prediction equation and observation equation, along with the covariance matrix Q and R, we can benefit from the data fusion pipeline. Here, we employed two fusion space for the motion propagation process of both the position estimation and orientation estimation.
In the first level of the fusion stage, the relative position ∆t = (∆tx, ∆ty, ∆tz) was estimated in the Euclidean linear space. In the sliding window, the Kalman filter was used to construct the prediction equation and observation equation. The detail was shown in Table 1. Where, ∆t (i,i+n−1) is the estimation of changes of relative position within the sliding window, t f ,i and t f ,i+n−1 are the estimation of ego-motion states. For the Kalman gain K i,i+n−1 , it is a diagonal matrix and all diagonal elements are same. Hence, we use k * I to represent diagonal matrix K and the k is the coefficient of Kalman gain matrix. Table 1. Trajectory obtained by our method in the second experiment.

Steps Discription Formula
1 calculate the current predicted value according to the prediction equation update the covariance matrix of prediction equation update the covariance of the prediction equation In the orientation estimation stage, we proposed an improved Kalman filter, which uses the orientation error space represented by unit quaternion as the state of the filter. In order to construct the motion propagation equation, the state vector of this filter need to be unified with four elements ∆q (i,i+n−1) = (qx, qy, qz, qw), which can be described as follows: where, ⊗ represents the multiplication of quaternion.Then, the estimation of orientation propagation states can be acquired by the following equation: Here, ∆q (i,i+n−1) is the estimation of relative orientation changes within the sliding window, t f ,i , t f ,i+n−1 are the estimation of ego-motion states represented by the unit quaternion. In this form, the state propagation model and measurement model would be much simpler. Moreover, the processing of the data fusion occurred in the error space was represented by the error quaternion, which could be closer to linear space and, thus, more suitable to the Kalman filter.
With the proposed trajectory fusion method, the ego-motion states can be computed accurately. When one of the forward-facing stereo camera or backward-facing monocular camera fails, the Kalman filter can automatically update the Kalman gain and reduce the credibility of the impaired camera naturally. It means that the proposed pipeline can fully or partially bypass failure modules and make use of the rest of the pipeline handle sensor degradation.

Results
We evaluated the proposed method on the Oxford RobotCar Dataset and compared its performance to some state-of-the-art visual odometry systems. Some key performances, including monocular scale property, robustness capability, and accuracy performance were carried out and comprehensively evaluated.

Oxford RobotCar Dataset
Oxford RobotCar Dataset [43] was collected by the Mobile Robotics Group, University of Oxford, UK, and it focused on the long-term and large-scale real driving data for autonomous road vehicles, which contains over 1000 km driving data sequences. It was recorded from the car named Nissan LEAF equipping with a Point Grey Bumblebe XB3 trinocular stereo camera, three monocular cameras, three 2D Lidars, GPS and INS.
The reason we chose Oxford RobotCar Dateset as our testing dataset is that it was collected in all weather conditions, including heavy rain, night, direct sunlight and snow, which made it the ideal choice for our evaluation, especially for the robustness and accuracy test. The data in the Oxford RobotCar Dataset are shown in Figure 5. In our experiment setup, the BumbleeXb3 and the rear Grasshopper2 are selected as the forward and backward cameras in the proposed method. For the trinocular BumbleeXB3, the timestamp of three cameras are synchronized by inherent hardware and the left and right cameras are composed for wider baseline (24 cm) as the forward stereo camera. The relative positioning setup is shown in Figure 6 and the more accuracy calibration extrinsics and camera models are provided in the software development kit (SDK) on their website. In addition, the SDK also supplies us with the Matlab and Python functions for demosaicing and undistorting raw Bayer images.

Evaluation of Scale Estimation Method
Monocular visual odometry always suffers from the scale ambiguity problem due to the unknown ego-motion translation length between frames. Hence, in this paper, real time scale estimation strategy for backward-facing monocular visual odometry was proposed. In the experiment, we show that the traditional scale estimation method always suffers from the vehicle motion state. For example, Figure 7a depicts that the scale coefficient may vary significantly when the ego-vehicle was waiting for a traffic light and kept still. After considering the effect of vehicle motion state in this paper, the scale coefficient can keep steady and smooth, as shown in Figure 7b. In order to test the validity of the proposed method, we compared our work with the traditional method proposed by Nist [28], which tends to set the scale of monocular visual odometry to a constant value. As a result, Figure 7 shows that the error caused by the ambiguous scale would continuously accumulated for Nist's method, which makes the system more prone to deviate from the ground truth. In this paper, the sliding window based scale estimation method was proposed to keep system work stable. In the sliding window, the MSE was used to dynamically correct and update the scale coefficient under the rigid body motion assumption. With this help, In Figure 8, we show that our method can significantly improve the performance of monocular visual odometry.

Robustness Evaluation
Utilizing the proposed fusion method, the system can fully or partially bypass failure modules and make use of the rest pipeline to handle sensor degradation. Figure 9 demonstrates the fusion procedure and Figure 9 depicts the corresponding fusion confidence. As it shows, with our proposed method, the coefficient of fusion confidence would be updated in real-time according to the matching error of all image frames in the sliding window. Even in the sensor performance degradation point, such as in point A in the following Figures, the system can automatically increase the confidence of the rest undamaged camera naturally to guarantee the safety of the system. In addition, the failed single stereoVO reinitializes successfully at point B. Further, In order to test the performance of robustness, we compared the proposed fusion method with Single Stereo-VO (single stereo visual odometry) and Single Mono-VO (monocular visual odometry) methods separately. Figure 10 shows that both the Single Stereo-VO and Single Mono-VO may fail under some complicated driving conditions. In sequence 02 of the Oxford RobotCar data set, the Single Stereo-VO failed due to the strong sunlight, which bring insufficient point features for the matching steps. The situation also happened in the sequence 16 for the Single Mono-VO system. In sequence 09 and sequence 12, the Single Stereo-VO lost effectiveness because of the big and fast corner turning. Where, during the quick turning, large motion between corresponding feature points of images would lead to the matching errors and made the system become invalid. In our proposed system, the data processing pipeline was carefully designed to handle sensor degradation. Some comparative experiments in Figure 10 show that the proposed system has higher robustness than the single stereo and monocular visual odometry. It can still work even under the harsh conditions in which one of the cameras fails.
In order to test the robustness of the system in harsh environments, we also compared the proposed method with some state-of-the-art works including ORB-SLAM2 [33], DSO-Mono (direct sparse monocular odometry) and DSO-Stereo (direct sparse stereo odometry) [44] algorithms using our collected dataset. Our dataset contains 14 sequences of harsh driving environment, such as rain, dusk, night, snow, direct sunlight, texture-less, intense illumination, bumpy road, and fast turns. The performance of the above mentioned works in the dataset is shown in Table 2. During the experiment, some methods failed due to strong lighting and some failed in turning corners. In total, the DSO monocular method and DSO stereo method failed 3 times and 9 times respectively, and the ORB-SLAM2 failed 7 times. However, our proposed method always kept a good performance for all the testing sequences.   T  T  T  T  Seq04  rain, overcast  109  1205  T  F  T  T  Seq05 overcast, traffic light  365  4027  T  F  F  T  Seq06  rain, overcast  151  1665  F  F  T  T  Seq07  dusk, rain  183  2020  T  F  T  T  Seq08  overcast, loop road  224  2474  T  T  F  T  Seq09  sun, clouds  90  2690  F  F  F  T  Seq10  night, dark  163  1795  T  F  T  T  Seq11  snow  119  1314  T  T  T  T  Seq12  snow, traffic light  252  2780  T  T  F  T  Seq13 illumination change  152  1672  T  T  T  T  Seq14  strong sunlight  188  2112  T  F  F  T  Total  ---

Accuracy Evaluation
The accuracy of the proposed method was also evaluated. Firstly, the absolute translation root mean-square error (RMSE) [45] was employed as the quantitative metric to evaluate the performance of accuracy. Sequence 05 in Oxford RobotCar Dataset was used as the testing data, which was captured by forward-facing Bumblebee XB3 stereo camera and the backward-facing Grasshopper2 camera. The GPS + INS (NovAtel) data was used as the ground truth. In the testing dataset, the speed of ego-vehicle changed quickly and encountered a number of red lights. In order to evaluate the performance, some state-of-the-art monocular and stereo visual odometry methods were compared. In the experiment, the trajectories of monocular visual odometry were aligned to the ground truth using the similarity transformation method for the lacking of scale. The results are shown in Figure 11 and Table 3.  From the results we can see that the accuracy of the monocular camera based methods of M-DSO and M-ORBSLAM2 obtained the RMSE of 3.372% and 3.195% respectively, which were poor compared to the stereo camera based methods because the monocular visual odometry always suffered from scale ambiguity problem. In contrast, our method output the best accuracy performance among the 4 state-of-the-art methods [33,44,46] with the RMSE of 0.419%, which is lower than the S-ORBSLAM2 method with the RMSE of 0.519%. This was because our method can take advantage of the outstanding features of symmetry-adaption configuration of forward and backward cameras and comprehensively utilizing the nearby environment information.
In the second experiment, we employed the average error (including AEX, AEY and AED) [47] as the quantitative metric to evaluated the performance of accuracy. Among three average error metrics, the assessment data of AED is more important than the others, since AEX and AEY depend on the choice of the coordinate system, while AED is invariant to it. The calculation equation is shown in the following: Here, N is the number of frames, X GT i and Y GT i are the ground-truth values in X direction and Y direction obtained by the GPS + INS at the ith frame. Xi and Yi are the output results in X direction and Y direction obtained by the corresponding odometry method at the ith frame.
The experiment was carried out in sequence 18 of the Oxford RobotCar Dataset was on a drizzly day. We compared the proposed method with some state-of-the-art algorithms including S-ORBSLAM2 [33] and S-VINS (Stereo Visual-Inertial Systems) [46]. The visual results of the odometry estimations of different methods are shown in Figure 12, which shows the trajectories consisting of each (Xi, Yi) point. The ground-truth trajectories for each (X GT i , Y GT i ) point were plotted as black line in Figure 12. The results of the corresponding average error (including AEX, AEY, and AED) are also given in Table 4. The results show that our method can achieve the best performance on AED, AEX and AEY among all the methods, which indicates that the output results of our method are the most stable compared with other methods [33,44,46]. In order to clearly demonstrate the results, we mapped the trajectories onto Google earth.  We also present the error curves of all the methods compared with the ground truths from GPS+INS. Figure 13 shows the distance error curve in each frame and the corresponding error for each method. The results show that our proposed method can have smaller overall errors and error boundary compared with other state-of-the-art methods [33,44,46]. So our method is capable of navigating in the real world for kilometers and performs better than other state-of-the-art algorithms.
. The curve close to the zero line means the error was small.

Time Results
In this experiment, we tested all sequences 5 times in a computational platform with an Intel Core i5-7500 CPU (four cores @3.4 Hz) and 8 GB RAM to eliminate the randomness in the results. In addition, we set the extracting features as 2000, image pyramid layers as 8 for both stereo and monocular visual odometry systems. The consuming time results of the main modules of our proposed method are shown in Table 5. Besides, the monocular and stereo visual odometry systems ran at the same time on the distributed robot operating systems. So, the maximum processing frame rate would be decided by the system consuming more time. Therefore, the maximum processing frame rate would be determined by the system that consumes more time. From the table we can see that the total consumption time of our method is about 55 milliseconds with the capability of processing 20 frames per second.

Discussion
Based on the obtained results, as well as proving its efficiency, robustness, and accuracy, it can be seen that our approach can achieve superior performance compared with other start-of -the-art methods. The method can fully utilize the information of driving conditions to promote the performance of our positioning system.
Regarding the scale estimation, the scale ambiguity problem of monocular visual odometry has been solved with the MSE optimization method. We show that, in the sliding window, MSE was used to dynamically correct and update the scale coefficient under the rigid body motion assumption. After considering the effect of vehicle motion state, the scale coefficient can keep steady and smooth, which makes the system not easily deviate from the ground truth.
Regarding the robustness, the proposed system can fully or partially bypass failure modules and make use of the rest pipeline to handle sensor degradation. It can still work even under the harsh conditions that one of the cameras fails. The proposed method was also tested in complex driving environments, such as rain, dusk, night, direct sunlight, texture-less, intense illumination, bumpy road, and fast turns. We show that our method can work well in the above mentioned complex and highly dynamic driving environments.
Regarding the accuracy, our method can utilize the remarkable characteristics of symmetry adaption configuration of forward and backward cameras. Meanwhile, a novel fusion mechanism of two-layers Kalman Fusion based data processing framework was employed to comprehensive utilize the nearby environment information. We compare the proposed method with some state-of-the-art works including M-ORBSLAM2, S-ORBSLAM2 [33], M-DSO, S-DSO [44] and S-VINS [46] algorithms. The results show that our method can achieve the best performance on evaluation indexes of AED, AEX, AEY and RMSE among all the methods, which indicates that the output results of our method are most accuracy compared with other methods.

Conclusions
We have presented a novel low-cost visual odometry method for estimating egomotion for ground vehicles in challenging environments. To improve the performance of system robustness and accuracy, the scale factor of backward monocular visual odometry was estimated based on the MSE optimization method in a sliding window. Then, in trajectory estimation, an improved two-layers Kalman filter was proposed including orientation fusion and position fusion. The experiments carried out in this paper have proved that our algorithm is superior to other state-of-the-art algorithms. Different from traditional localization methods that use differential GPS, precise IMU or 3D Lidar, the proposed method only leverages data from inexpensive cameras. Meanwhile, our fusion system employed the outstanding features of symmetry-adaption configuration of forward and backward cameras and provided a new fusion mechanism of two-layers data processing framework to comprehensive utilize the nearby environment information. Therefore, the proposed pipeline can fully or partially bypass failure modules and make use of the rest pipeline to handle sensor degradation making the system more robustness and accuracy.
In future work, we will further optimize our proposed algorithm, reduce its computation complexity, and try to implement it in the compact embedded platform. We will try to integrate other lowcost sensors, such as conventional low-cost GPS, IMU to our system. Meanwhile, we will also explore the method to adjust accelerometer biases using the output of our system, since the velocity measured by our system should be equal to the velocity integrated from the bias-corrected.