Semi-Direct Point-Line Visual Inertial Odometry for MAVs

: Traditional Micro-Aerial Vehicles (MAVs) are usually equipped with a low-cost Inertial Measurement Unit (IMU) and monocular cameras, how to achieve high precision and high reliability navigation under the framework of low computational complexity is the main problem for MAVs. To this end, a novel semi-direct point-line visual inertial odometry (SDPL-VIO) has been proposed for MAVs. In the front-end, point and line features are introduced to enhance image constraints and increase environmental adaptability. At the same time, the semi-direct method combined with IMU pre-integration is used to complete motion estimation. This hybrid strategy combines the accuracy and loop closure detection performance of the feature-based method with the rapidity of the direct method, and tracks keyframes and non-keyframes, respectively. In the back-end, the sliding window mechanism is adopted to limit the computation, while the improved marginalization method is used to decompose the high-dimensional matrix corresponding to the cost function to reduce the computational complexity in the optimization process. The comparison results in the EuRoC datasets demonstrate that SDPL-VIO performs better than the other state-of-the-art visual inertial odometry (VIO) methods, especially in terms of accuracy and real-time performance.


Introduction
A navigation system is one of the main applications for MAVs [1], mainly providing accurate and reliable attitude, speed and position information, and is indispensable in the process of MAVs flight and control.Due to limitations in size and payload capacity, MAVs are typically equipped with low-cost sensors and lack the hardware required to run advanced integrated navigation algorithms.Designing a low-cost integrated navigation platform for MAVs and building a more efficient system on this basis to ensure accuracy and real-time performance is the basic premise for MAVs to perform tasks and even survive.
Visual Inertial Navigation (VIN) is a hot topic in the field of MAVs research [2,3].Cameras can provide abundant visual information, and inertial sensors (gyroscopes and accelerometers) can provide short-term and high-precision pose estimation.Their low cost, low power, small size, and complementarity make a combination of them particularly suitable for MAVs.
Based on the functional structure, VINs can be classified into the front end and the back end.The front end completes the calculation of visual and inertial motion states, and the back end realizes data fusion and outputs the optimal state estimation.
Front-end image processing methods can be divided into the following: (1) Featurebased methods [4][5][6][7][8][9][10][11][12], which extract representative features (such as point or line features) from images, and then match them according to the description of features.OKVIS [5] finds features using the Harris corner detector, and matches them using Binary robust invariant scalable keypoints (BRISK) descriptors.ORB-SLAM2 [7] performs feature extraction, description and matching, and then performs motion estimation using Oriented FAST and Rotated BRIEF (ORB) features.PL-SLAM [8] integrates the line representation within the SLAM, and improves the performance of ORB-SLAM2, especially in poorly textured environments.However, feature extraction and the calculation of descriptors are very time-consuming.(2) Direct methods [13,14], which estimate the motion based on the pixel gray difference between two images.DSO [14] minimizes the photometric error to estimate the camera motion, which greatly reduces the amount of computation compared with feature-based methods.However, it has high requirements regarding image quality and is not suitable for large inter-frame motion.(3) Semi-direct methods [15][16][17][18][19][20][21][22], which combine the above two methods and have received increasing attention from researchers in recent years.SVO [15] first uses the image intensity to estimate the pose, and then uses the position of feature points to optimize the pose.However, there are still shortcomings in motion-tracking accuracy and robustness.PL-SVO [16] extends line segments to the SVO algorithm, and has stronger robustness in poorly textured environments.SVL [18] combines ORB-SLAM and SVO, in which the former is used in keyframes and the latter is used in non-keyframes.PCSD-VIO [22] refers to the frame-tracking strategy of SVL, but integrates online photometric calibration, and the fusion method with IMU is slightly different.
Back-end data fusion methods can be divided into the following: (1) Filtering-based methods [4,6], which use inertial observation for state propagation and visual observation for state update.As the number of features in the state vector increases, the computational complexity rapidly increases, so it is not suitable for a large range of scenes.
(2) Optimization-based methods [5,[9][10][11][23][24][25][26][27][28], which are usually based on keyframes and minimize the overall errors by establishing the connection relation between frames and constantly adjusting the pose of frames.VINS-Mono [10] is a tightly coupled algorithm based on nonlinear optimization, with initialization, loop detection and relocalization modules.PL-VIO [11] adds line features to the basic frame of VINS-Mono, and has achieved good results.However, it lacks a loop-detection module, and uses a large amount of computation in nonlinear optimization.
The above work has both advantages and disadvantages, such as the adoption of time-consuming feature extraction, sensitivity to poorly textured environments and large illumination changes, ability to work without inertial sensors, and the large amount of computation needed in nonlinear optimization.Therefore, a novel, semi-direct, pointline, visual inertial odometry for MAVs has been proposed.The main contributions are as follows: Firstly, motion estimation is accomplished using the semi-direct method, which realizes the mutual advantage compensation of the feature-based method and the direct method, that is, the former accurately tracks keyframes, and extracts point and line features for back-end nonlinear optimization and loop-closure detection, while the latter rapidly tracks non-keyframes through direct image alignment.
Secondly, the sliding window mechanism is adopted to effectively combine visual and inertial information, and an improved marginalization method is used to decompose the high-dimension matrix corresponding to the cost function, which binds the computational complexity and improves the computational efficiency.
Finally, experiments are conducted to compare SDPL-VIO and the other state-of-theart VIO methods.The results of the European Robotics Challenge (EuRoC) datasets show that SDPL-VIO can consider both speed and accuracy.
The rest of this paper is organized as follows.In Section 2, the mathematical formulation is given.Next, the proposed system implementation is described in Section 3. The experimental results and analysis are shown in Section 4. Finally, a conclusion is given in Section 5.

Notations
We define {w}, {b}, {c} as a world coordinate, body coordinate and camera coordinate, respectively.R c w and p c w are the rotation and translation from the world coordinate to the camera coordinate.R b c and p b c represent the extrinsic parameters, which can be calibrated in advance.T = R p 0 1 is the 4 × 4 homogeneous transformation.q is the quaternion representation of R, ⊗ represents the multiplication between two quaternions.* × represents the skew symmetric matrix corresponding to the vector.

IMU Pre-Integration
The raw IMU measurements (angular velocity ω and acceleration â) are affected by gravity g w , bias b and noise n [10]: where, ω and a are the real IMU measurements.
The IMU state propagation from the consecutive frame b k to b k+1 can be given by: where, are the pre-integrated measurements, and

Point Feature Projection
For a pin-hole camera model, the projection π c ( * ) from the camera coordinate P c = x c y c z c T ∈ R 3 to the camera image plane p = u v 1 T can be defined as: where ( f x , f y , c x , c y ) represent camera intrinsic parameters.

Line Feature Projection
The Plücker coordinates [29] is used for line parameterization.A 3D line L representing the Plücker coordinates is constructed as L = n T d T T , where n ∈ R 3 represents the normal vector of the plane determined by the line and the coordinate origin, and d ∈ R 3 represents the line direction vector.According to the description in [30], the transformation of a 3D line L from the world coordinate L w to the camera coordinate L c can be defined as: and the projection from the camera coordinate to the camera image plane can be defined as: where K represents the projection matrix of line L c .

System Framework
The proposed system mainly includes two parts: front-end image tracking (measurements preprocessing and initialization), back-end nonlinear optimization, marginalization and loop closure detection, as shown in Figure 1.In the front end, the system extracts point and line features from the image, and carries out a motion estimation of adjacent image frames combined with IMU pre-integration.By aligning visual and inertial information, visual inertial joint initialization is performed to restore metric scale, and estimate inertial bias and gravity vector.Then, after judging whether the current frame is a keyframe according to the selection criteria, the system uses the semi-direct method to track keyframes and non-keyframes, respectively.In the back end, the system adopts a cost function to obtain the optimal state estimation by minimizing prior information, IMU residuals and visual re-projection errors, and uses improved marginalization to decompose the high-dimension matrix step-by-step, corresponding to the cost function, which optimizes the solution and improves the computational efficiency.The key processes are described in detail in the following sections.

Visual-Inertial Initializaiton
We conducted measurement preprocessing, and completed the visual-inertial initialization by aligning the results of vision-only Structure from Motion (SFM) and IMU pre-integration.The system first performed this process to restore the metric scale and estimate the inertial bias and the gravity vector.The specific initialization process can be referred to in [11].

Keyframe Selection
After initialization, the system determines whether the current frame is a keyframe based on scene changes and IMU pre-integration.The overall output accuracy of the system depends largely on the quality of the keyframes that were inserted.The selection criteria for keyframes are as follows: First, the average parallax of tracked points between the current frame and the latest keyframe is beyond a certain threshold.Second, the number of tracked points goes below a certain threshold.Third, the translation calculated by IMU pre-integration exceeds a preset threshold after the latest keyframe is inserted.If any of the above criteria are met, the current frame is considered a keyframe.

Semi-Direct Method for Tracking
(a) For keyframes, we detected point features using the Accelerated Segment Test (FAST) [31] algorithm and tracked adjacent frames using an optical flow based on Kanade-Lucas-Tomasi (KLT) [32].Then, we eliminated outliers by Random Sample Consensus (RANSAC) combined with the essential matrix model.Meanwhile we detected line segments using the Line Segment Detector (LSD) [33] algorithm, and matched them using Line Band Descriptors (LBD) [34] between the current frame and the reference frame.In addition, we also removed outliers for line-matching using geometric constraints.
(b) For non-keyframes, as shown in Figure 2, we first extracted the key points p from the last keyframe I n in the sliding window, and used transformation to construct the matching 3D points P w in the world coordinate.Then, we projected the 3D points onto the current frame I n+1 , and obtained the pixel intensity residual δI(T, p) as: where R is the image area formed by the key points p.By minimizing the photometric error, the relative pose T n+1 n can be calculated [15]: where However, the current frame pose, which was only obtained by Equation ( 8), is insufficiently accurate.Therefore, we need to find more co-visibility feature points between the current frame and the co-visibility keyframes in the sliding window.As shown in Figure 3, based on the pose solved by Equation ( 8), we projected the co-visible 3D points P wi in the world coordinate onto the current frame.By minimizing the photometric error of co-visibility points, the corresponding 2D feature points p i in the current frame can be obtained: After feature-points-matching was completed, we again optimized the camera pose T c w to minimize the re-projection errors: 3.3.Back-End Optimization

Sliding Window Formulation
The sliding window [28] limits the number of keyframes and prevents the number of poses and features from increasing over time, so that back-end optimization is always within a limited complexity.The full state variables in the sliding window are defined as: χ = [x 1 , x 2 , ..., x n , P w1 , P w2 , ..., P wm , L w1 , L w2 , ..., L wo ] where x i is the ith IMU state.The cost function, which simultaneously optimizes visual and IMU variables, is shown in Equation ( 12): where {r p , H p } are the prior information from marginalization, r B ( z , χ) are the IMU measurement residuals, B is the set of IMU states.r P ( z c j p , χ) and r L ( z c j l , χ) are the point and line feature re-projection errors, respectively.P and L are the set of observed point and line features, and ρ is the Cauchy loss function, which minimizes the influence of outliers.Therefore, the matching error terms can be expressed as follows: (a) IMU measurement residual According to Equations ( 1)-( 3), the IMU measurement residuals for two consecutive frames can be calculated as follows: where [ * ] xyz extracts the vector part of the quaternion.

(b) Point feature error term
As shown in Figure 4, the point feature re-projection error is represented as the distance between the observed projection position and currently estimated projection position of the 3D point, which can be defined as: where p c j p is the observed point in camera frame j and P wp is the matching 3D point.The Jacobian of the point feature re-projection error relative to the pose increment can be obtained by the chain rule: with (c) Line feature error term According to Figure 4, the line feature re-projection error is represented as the distance from two endpoints (p s = [u 1 v 1 1] T and p e = [u 2 v 2 1] T ) of the matching line segment to the projecting line l.Combining Equations ( 5) and ( 6), this can be calculated as below [29]: where, The Jacobian line feature re-projection error relative to the pose increment can be obtained as follows: with

Improved Marginalization
Marginalization is performed to bound the computational complexity, and the illustration is shown in Figure 5.If the second latest frame is a keyframe, the system will marginalize the oldest frame, including the pose of the oldest frame and some observed visual landmarks.When the oldest frame has co-visibility with other keyframes in the sliding window, that is, they observe the same visual landmarks, marginalization will keep the constraint on other keyframes.Otherwise, the system will retain the IMU measurements attached to this non-keyframe but remove the visual measurements.This process is solved by the Gauss-Newton iterative method and defined in the form Hδx = b: where b m represents the set of states that are to be marginalized and b p represents the set of preserved states [23,24].Through the Schur complement, δx p can be obtained: which describes the basic marginalization.Thus, the states in b m are marginalized together.Equation ( 25) requires calculating H T mp H −1 m H mp , in which the computational complexity is O(i 3 + i 2 × j + j 2 × i), supposing that H m and H p have dimensions of i × i and j × j.
The above basic marginalization method has high computational complexity, so we adopted the improved marginalization method to solve this problem.Firstly, we divided the state vector x m which was be marginalized into two parts: one was uncorrelated states, containing visual landmarks, the other was correlated states, containing the pose of keyframes, velocity and IMU bias.Uncorrelated states were not related to each other, only to correlated states [24].
We described this process using an example, as shown in Figure 6.The states that are to be marginalized are represented by dashed lines, where L p i ∈ R 3 and L l i ∈ R 6 (1 ≤ i ≤ 2) are points and line segments, respectively, P j ∈ R 6 (1 ≤ j ≤ 4) are the pose of keyframes, and b k ∈ R 9 (1 ≤ k ≤ 4) are velocity and IMU bias. Figure 7 shows the corresponding Hessian matrix.The state variable that is to be marginalized is x m (L p 1 , L p 2 , L l 1 ,L l 2 , P 1 , b 1 ), while the state variable that is to be preserved is x p (P 2 , b 2 , P 3 , b 3 , P 4 ).Firstly, uncorrelated states in x m were marginalized, which only contain points and line segments (L p 1 , L p 2 , L l 1 and L l 2 ).The computational complexity for the marginalization of L p 1 in Figure 7 was mainly determined by H ).The maximum computational complexity will be max O p 1 = O(3 3 + 3 2 × 51 + 3 × 51 2 ), supposing that L p 1 can be observed by P 2 , P 3 and P 4 .The Hessian matrix after the marginalization of L p 1 is shown in Figure 8. L p 2 will be marginalized after the marginalization of L p 1 , which requires calculating , and the computational complexity is approximately The remaining L l 1 and L l 2 were marginalized in the same way, with a computational complexity of O l 1 = O l 2 = O(6 3 + 6 2 × 51 + 6 × 51 2 ).The Hessian matrix after the marginalization of uncorrelated states is shown in Figure 9.Then, correlated states that contain P 1 and b 1 in x m were marginalized, by calculating A p − A T mp A −1 m A mp , with a computational complexity of O r = O(15 3 + 15 2 × 36 + 15 × 36 2 ).Therefore, the total computational complexity will be O Through the above analysis, it is obvious that, compared with the basic marginalization method corresponding to Equation ( 25), the improved marginalization method greatly reduces the amount of computation.

Loop Closure Detection
For loop closure detection, we adopted Bags of Words (DBoW2) [35], a state-of-theart bag-of words place recognition approach.The loop closure detection module will be activated when the current frame is selected as a keyframe.If a loop closure is detected, the drift accumulated during the exploration will be greatly reduced [10].

Experimental Results and Analysis
We compared SDPL-VIO with the state-of-the-art methods: SVO [17], PCSD-VIO [22], PL-VIO [11] and VINS-Mono [10] on the EuRoC datasets [36].The EuRoC datasets were collected by various sensors installed on the MAV, and divided into three series of flight scenarios, including a series of motion sequences ranging from easy to difficult.First, the comparison results of the accuracy and robustness performance based on the datasets are shown in Section 4.1.Then, to demonstrate the computation performance of SDPL-VIO, the time usage statistics results are illustrated in Section 4.2.Finally, the loop-closure detection capability of SDPL-VIO is evaluated in Section 4.3.

Accuracy and Robustness Performance
We implemented the above methods on the EuRoC datasets.In order to intuitively reflect the tracking effect of our proposed semi-direct method, and because PL-VIO lacks the loop-closure module, we disabled the loop-closure module of the above methods.A comparison of root-mean-square error (RMSE) on the EuRoC datasets is shown in Table 1, and their histograms are also provided, as shown in Figure 10.Table 1 shows that SDPL-VIO works robustly and accurately, and other methods also work well.Compared with point features, the combination of point and line features can strengthen the constraints between images, are insensitive to illumination changing environment, deal with the dynamic environment, and reduce the error caused by fast rotation motion.Thus, due to the introduction of line features, the accuracy of SDPL-VIO is better than SVO, PCSD-VIO and VINS-Mono, which only uses point features for tracking.In easy sequences such as MH01 and V101, which have rich features, good illumination, and slow motion, the feature-based method can extract a high number of point and line features.In such environments, the quality of tracking is about the same, so the accuracy of SDPL-VIO is comparable to PL-VIO.In difficult sequences, such as MH05, V203 with motion blur, fast motion, low texture, etc., the feature-based method faces challenges due to the lack of sufficient visual features.Our proposed semi-direct method combines the excellent performance of direct methods in low-texture environments.The feature-based method provides an accurate initialization state and generates keyframes that provide good priors for the direct method, while the direct method uses direct image alignment to move the frame very close to its final pose, while using a refinement step to reduce the pose estimation error.Therefore, SDPL-VIO performs relatively well in difficult sequences.Figure 12 shows the translation errors on the MH01, MH05 and V202 datasets, in which the blue represents SDPL-VIO, the green represents PCSD-VIO, the purple represents PL-VIO, and the yellow represents VINS-Mono.It can been seen that, in the MH01 sequence, due to the rich scene features and good illumination conditions, there is not much difference in the accuracy of the four methods.The maximum error of SDPL-VIO is 0.37 m in the sequence of MH01, while that of PCSD-VIO is 0.41 m, that of PL-VIO is 0.41 m and that of VINS-Mono is 0.40 m, respectively.The MH05 dataset includes fast motion and large illumination changes.The combination of point and line features is more robust to fast rotation in the trajectory, and the semi-direct method is more adaptable to low-texture environments, so SDPL-VIO has the highest accuracy.PCSD-VIO and VINS-Mono only extract point features, and they struggle to extract corner points with large grayscale differences from surrounding pixel blocks.Therefore, the number of effective feature points is reduced, and the accuracy is the lower than the other two methods.The maximum error of SDPL-VIO is 0.73 m in the sequence of MH05, while that of PCSD-VIO is 0.85 m, that of PL-VIO is 0.78 m and that of VINS-Mono is 0.90 m, respectively.For the V202 dataset, SDPL-VIO still performs better than the other three.The maximum error of SDPL-VIO is 0.33 m, while that of PCSD-VIO is 0.37 m, that of PL-VIO is 0.36 m and that of VINS-Mono is 0.42 m, respectively.From Figures 11 and 12, it can been seen that, due to the semi-direct method tracking point and line features, SDPL-VIO has a better performance than the other three methods in challenging sequences, such as fast motion, large illumination changes or poorly textured environments, etc.

Computational Performance
To evaluate the computational performances of SDPL-VIO, the average times taken for tracking and marginalization were measured and analyzed.

Average Time for Tracking
Firstly the computational performances when tracking the image were analyzed to compare SDPL-VIO and PL-VIO, and the comparison results are shown in Table 2. PL-VIO simultaneously extracts point and line features for each frame, which is very time-consuming.SDPL-VIO uses the semi-direct method for tracking, and takes much less time than PL-VIO.This is because non-keyframes account for the majority of the tracked features and the system uses the direct method, with the advantage of rapidity when tracking non-keyframes in the front end, and has no need to extract image features and calculate descriptors.This strategy effectively reduces the average front-end tracking time.Taking V201 as an example, as shown in Figure 13, according to the keyframe selection strategy, the number of selected keyframes is 127, accounting for about 23% of the total frames, while 77% of the frames were selected to be non-keyframes.However, the time needed to track keyframes accounted for 69% of the total time, while the time needed to track non-keyframes accounted for only 31%.Although the tracking time was reduced, the accuracy of SDPL-VIO was not significantly reduced; this was still comparable with PL-VIO, and even higher in some environments, as shown in Section 4.1.

Average Time for Marginalization
Marginalization is another time-consuming aspect of the back-end, so the average time consumption was also analyzed.Taking V102 as an example, as explained in Section 3.3.2,when the back end of SDPL-VIO used the basic marginalization method, one-step marginalization was carried out, resulting in a very large amount of computation, and the average time needed for marginalization was 39 ms.When the back end used the improved marginalization method, the high-dimension matrix corresponding to the cost function was decomposed step by step.Therefore, the amount of computation was significantly reduced, and the average time needed for marginalization was 16 ms.

Loop Closure Detection Evaluation
The most obvious advantage of the feature-based method compared with the direct method is that the feature-based method could be used for loop closure detection, which can reduce the drift accumulated during the long-term operation.Therefore, the loop-closure detection ability of SDPL-VIO was finally evaluated to verify the integrity and reliability of the system.From the comparison results on the MH03 dataset, as shown in Figure 14, the accuracy of SDPL-VIO with loop closure detection is clearly shown to improve.The analysis of multiple experiments shows that SDPL-VIO can consider both speed and accuracy, which increases the operation efficiency without reducing the operation precision, and is superior to other methods in terms of comprehensive performance.

Conclusions
In this paper, a novel, semi-direct, point-line, visual inertial odometry for MAVs was proposed, which extracts point-line features from the image and uses the semi-direct method to track keyframes and non-keyframes.The proposed method also adopts the sliding window strategy and uses the improved marginalization method to decompose the high-dimensional matrix step by step, according to the cost function, to optimize the solution.Experiments on the EuRoC datasets show that the accuracy and real-time performance of SDPL-VIO is better than that of other state-of-the-art VIO methods, especially in challenging datasets containing fast motion, large illumination changes or poorly textured environments.The SDPL-VIO performance, in terms of accuracy and efficiency, validated that it is suitable for navigational uses in MAVs with low-cost sensors.In future work, we aim to adopt a faster line detector that could be more conducive to continuous and real-time tracking.
position, velocity and orientation, respectively.b a and b g represent acceleration bias and gyroscope bias, respectively.P wi (i ∈ [1, m]) represent the point landmarks, and L wi (i ∈ [1, o]) are the 3D line representation formed by the Plücker coordinates.m and o are the number of point and line features in the sliding window, respectively.

Figure 4 .
Figure 4. Illustration of the visual reprojection errors.

Figure 6 .
Figure 6.States in the sliding window.

Figure 8 .
Figure 8.The Hessian matrix after marginalization of L p 1 .

Figure 9 .
Figure 9.The Hessian matrix after the marginalization of all landmarks.

Figure 11
Figure 11 shows a comparison of the trajectories obtained by SDPL-VIO, PCSD-VIO, PL-VIO and VINS-Mono on the MH01, MH05 and V202 datasets, respectively.The ground truth is represented by the red line, while the results of SDPL-VIO, PCSD-VIO, PL-VIO and VINS-Mono are represented by the blue line, the green line, the purple line and the yellow line, respectively.Figure12shows the translation errors on the MH01, MH05 and V202 datasets, in which the blue represents SDPL-VIO, the green represents PCSD-VIO, the purple represents PL-VIO, and the yellow represents VINS-Mono.It can been seen that, in the MH01 sequence, due to the rich scene features and good illumination conditions, there is not much difference in the accuracy of the four methods.The maximum error of SDPL-VIO is 0.37 m in the sequence of MH01, while that of PCSD-VIO is 0.41 m, that of PL-VIO is 0.41 m and that of VINS-Mono is 0.40 m, respectively.The MH05 dataset includes fast motion and large illumination changes.The combination of point and line features is more robust to fast rotation in the trajectory, and the semi-direct method is more adaptable to low-texture environments, so SDPL-VIO has the highest accuracy.PCSD-VIO and VINS-Mono only extract point features, and they struggle to extract corner points with large grayscale differences from surrounding pixel blocks.Therefore, the number of effective feature points is reduced, and the accuracy is the lower than the other two methods.The maximum error of SDPL-VIO is 0.73 m in the sequence of MH05, while that of PCSD-VIO is 0.85 m, that of PL-VIO is 0.78 m and that of VINS-Mono is 0.90 m, respectively.For the V202 dataset, SDPL-VIO still performs better than the other three.The maximum error of SDPL-VIO is 0.33 m, while that of PCSD-VIO is 0.37 m, that of PL-VIO is 0.36 m and that of VINS-Mono is 0.42 m, respectively.From Figures11 and 12, it can been seen that, due to the semi-direct method tracking point and line features, SDPL-VIO has a better performance than the other three methods in challenging sequences, such as fast motion, large illumination changes or poorly textured environments, etc.

Figure 11 .
Figure 11.The comparison of trajectories on the MH01, MH05 and V202 datasets.

Figure 12 .
Figure 12.The comparison of translation errors on the MH01, MH05 and V202 datasets.

Figure 13 .
Figure 13.The ratio of the number and tracking time between keyframes and non-keyframes on the V201 dataset.

Figure 14 .
Figure 14.The comparison of translation errors on the MH03 dataset.

Table 1 .
RMSE (m) of the five methods.

Table 2 .
Average time (ms) spent tracking an image.