IT-SVO: Improved Semi-Direct Monocular Visual Odometry Combined with JS Divergence in Restricted Mobile Devices

Simultaneous localization and mapping (SLAM) has a wide range for applications in mobile robotics. Lightweight and inexpensive vision sensors have been widely used for localization in GPS-denied or weak GPS environments. Mobile robots not only estimate their pose, but also correct their position according to the environment, so a proper mathematical model is required to obtain the state of robots in their circumstances. Usually, filter-based SLAM/VO regards the model as a Gaussian distribution in the mapping thread, which deals with the complicated relationship between mean and covariance. The covariance in SLAM or VO represents the uncertainty of map points. Therefore, the methods, such as probability theory and information theory play a significant role in estimating the uncertainty. In this paper, we combine information theory with classical visual odometry (SVO) and take Jensen-Shannon divergence (JS divergence) instead of Kullback-Leibler divergence (KL divergence) to estimate the uncertainty of depth. A more suitable methodology for SVO is that explores to improve the accuracy and robustness of mobile devices in unknown environments. Meanwhile, this paper aims to efficiently utilize small portability for location and provide a priori knowledge of the latter application scenario. Therefore, combined with SVO, JS divergence is implemented, which has been realized. It not only has the property of accurate distinction of outliers, but also converges the inliers quickly. Simultaneously, the results show, under the same computational simulation, that SVO combined with JS divergence can more accurately locate its state in the environment than the combination with KL divergence.


Introduction
As computer technology rapidly grows, multimedia applications have penetrated into almost every aspect of our daily life. In recent years, simultaneous localization and mapping (SLAM) has been a major research topic in the robotics and computer vision communities. Workspace maps and robot location information are essential to the unmanned microaerial vehicle (MAV) and any other robots performing tasks in unknown environments. For example, in unfamiliar indoor environments, a MAV demands real-time poses and orientation information for obstacle avoidance, motion planning, and navigation [1]. In the GPS-denied or weak GPS environment [2,3], robust and accurate state estimation is necessary to realize environmental perception and motion planning. Because of energy restriction, computing resources, and platform size, it is crucial to utilize resources to perform tasks and estimate positions reasonably [4]. Therefore, it has become an important research hot spot for localization to utilize the information from a single sensor effectively and to provide optimal self-state estimation in factories, mines, etc.
SLAM was the first approach to be adopted by mobile robotics [5,6] and is estimated by the self-state or reconstructs the unknown environment [7]. Currently, SLAM technology is gradually moving closer to lightweight applications and miniaturization. Many teams are committed to effectively integrating the SLAM method into small devices, such as embedded devices, then considering it as an underlying work [8]. However, SLAM only provides a self-state estimation for the upper applications. Our goal with SLAM is to achieve the functionality of robotics, AR/VR devices, and UAVs/UGVs, which functions involve motion, navigation [9,10], teaching, entertainment, etc. SLAM is not expected to consume all computational resources, but there is a strong demand for the miniaturization and lightweight of SLAM. Among the different sensor modalities, cameras offer the advantage of low cost but rich information. Therefore, visual SLAM solutions, in which the primary sensor is a camera, are of significant interest. Monocular cameras are one of the most common sensors found in many SLAM applications [11]. SVO [12] and ORB-SLAM [13] are typical representatives of monocular applications. The semi-direct visual odometry (SVO) algorithm uses features. However, feature-correspondence is the implicit result of direct motion estimation rather than explicit feature matching [12]. Compared with other systems, the remarkable advantages of SVO are that it allows increased speed without extracting features from each frame and provides accuracy through sub-pixel feature correspondence. Meanwhile, SVO is integrated with a probabilistic mapping method, which is robust to outlier measurements, so it can achieve real-time performance on a low-cost computing platform, like MAVs. However, the target platform for application is the top-view camera on a MAV. The object in the field of view is mainly the ground, and the camera mainly moves horizontally, up and down, which makes it perform poorly in rotation localization and easily blur. ORB-SLAM is a feature-based method and realtime graph optimization system, which has tracking, loop detection, and re-localization modules. However, the entire system uses ORB feature points for mapping and tracking, and as a result, it will consume more computing resources than using optical flow [14]. It is challenging to implement the algorithm on a microprocessor and only to calculate in real-time on the current computer architecture, while all three threads are simultaneously starting in the running process of a system. On the contrary, other solutions that provide lightweight localization, such as PTAM [15], DSO [16], VINS [17], etc., can be implemented on low-cost processors or allow the CPU to handle other tasks.
At present, there are mainly feature-based and pixel-based methods for V-SLAM. Although feature-based methods have dominated the field for a long time, in recent years, many direct, dense, and semi-dense methods have still remained popular. In feature-based visual SLAM, the map information includes localization and sparse features [18], both of which are used for better localization and mapping. Unfortunately, they are not suitable for other tasks. Nowadays, the primary feature-based method is ORB-SLAM2 [19]. Direct methods do not extract features but rather use the pixel intensities in the images directly and estimate the motion by minimizing photometric errors. LSD-SLAM [20] can build large scope maps from pixels with high gradients. However, the map estimation reduces pose-graph and achieves lower accuracy than ORB-SLAM. The current mainstream pixelbased approach is the direct sparse odometry (DSO) created by Engel et al. [16], which can compute accurate camera poses in cases of poor performances of some point detectors, and enhance the robustness in low texture regions. To make the map more detailed, Lzadi et al., constructed a dense reconstruction system with Kinect-Fusion [21]. Although the method used 3D reconstruction of small-scale scenes and individual objects, and applied the intercepted symbol distance fusion (TSDF) map results to further elevate the precision, the memory size was limited. Range and accuracy of reconstruction cannot be obtained at the same time. The hybrid system SVO [22] extracts FAST features and makes use of a direct method to track features and any pixel with non-zero intensity gradient, which optimizes camera pose through reprojection errors. SVO is particularly efficient, but as a pure VO it can only perform short-term tracking, limiting its accuracy. Therefore, this method can only be efficiently implemented on restricted mobile devices.
Simultaneously, the rapid development of science and technology has been a hot spot for multidisciplinary intersections. Information theory, probability theory, and game theory are being increasingly exploited in deep learning, state estimation, localization [23], and navigation. Reasonable inference allows machines and equipment to make optimal judgments in the face of emergencies, which increases the robustness and anti-interference of the system. In visual SLAM, LSD-SLAM combines a Kalman filter with a first-order Markov model for the location of map points in three-dimensional space. ClusterVO [24] proposed a multilevel probabilistic data association strategy for dynamic road signs. It also used Shannon entropy to express the uncertainty of map points in three-dimensional space. The clustering method is combined with road features to realize the uncertainty convergence and realize association of low-level features. Since the experiments are only for a single object and the versatility is insufficient, it has to verify the data association method under multiple categories. In the scene structure analysis of structure-SLAM [25], the author uses cross-entropy to predict the normal of the plane, which reduces the rate of mismatching to a certain extent. However, it consumes a lot of resources when removing outliers, which leads to poor real-time performance. Kullback-Leibler divergence (KL divergence) [26] was used to parameterize the probability in the mapping thread in SVO and achieve a balance between the inliers and outliers. Because KL divergence is bounded, significant differences in linearization and asymmetry of KL divergence can lead to gradients that cannot be updated. We are also exploring a method that is more suitable for special mobile devices. Considering the limitations of devices and environment, it is efficient and accurate in localization, so we need a more robust algorithm. Therefore, during the exploration, we finally chose Jensen-Shannon divergence (JS divergence), a deformation of KL divergence, with SVO instead. Figure 1 illustrates a simplified pipeline of the improved SVO, where the algorithm remains with SVO on a parallel thread, mainly for sparse mapping and localization on small devices. The tracking thread uses a semi-direct method of relative pose estimation, and the mapping thread uses the depth filter. As shown in Figure 1, the application of the depth filter in the mapping thread, and using JS divergence from information theory instead of the KL divergence gives better results in dealing with outliers, which is demonstrated in Section 3.
This paper takes MAV and camera as the test platform and SVO as the improved algorithm. Meanwhile, we named it IT-SVO, which stands for semi-direct visual odometry combined with information theory. We discuss the difference between KL divergence and JS divergence, then use JS divergence to estimate the depth and carry out the following experiments:

•
The front-end tracking of PL-SVO [27][28][29] was adopted (the mapping phase is the same as SVO), which improved the robustness of the system; • The back-end depth filter replaced KL divergence with JS divergence and quantified the uncertainty of tracking. The experiments showed that JS divergence improved the accuracy of system localization.

•
We had performed a numerical evaluation on standard SLAM datasets (such as TUM RGB-D, EuRoC), and the results showed that the proposed method was better than the original SVO.
The content of this paper as divided as follows: We introduce the application of information theory in Section 2, compare JS Divergence and KL Divergence in Section 3, experimentally confirm our findings in Section 4 and summarize the work in Section 5.

Introduction of Information Theory
SVO employs a depth filter in the mapping thread, but it is not described in detail [12]. The authors of SVO describe each step of the depth filter [30]. This paper describes a threedimensional reconstruction using the pose provided by SVO, which belongs to progressive three-dimensional reconstruction. The process is the same as the SVO depth filter, except that GPU acceleration is implemented [30]. The depth filter in SVO uses a series of front and back frames to obtain the pixel depth of the mentioned frames, but there will be errora in calculating the depth of pixels and fusing multiple depth measurements by using a series of frames for each pixel depth and a single match recovery depth. SVO considers the depth error model of pixels as a probability distribution. It has two properties: one is the depth values follow a Gauss distribution, and the other is the probability of outliers obeys a beta distribution.

Reasons for Choosing KL Divergence
Kullback-Leibler divergence (KL divergence), also known as relative entropy, was initially used to indicate the degree of loss in information transmissiona. It is used in probability theory to show the difference between two similar distributions. KL divergence itself is non-negative and asymmetric: KL(P||Q) = KL(Q||P) , expressed as: KL divergence is equivalent to the difference in information entropy between two probability distributions. Suppose that one of the probability distributions is a proper distribution, and the other is a theoretical (fitting). In that case, KL divergence represents the information loss produced by fitting of the proper distribution to the theoretical distri-bution. The smaller between P and Q, the closer they are. Therefore, we can use Q as a probability to estimate P. SVO constructs a depth filter by taking a proper estimation of the mapping thread.

KL Divergence in SVO
We know that uncertainty exists in any probability-based model. In machine learning and state estimation, the uncertainty of the model can be approximated by information theory, where the probabilities are parameterized by comparing the similarity of the measured and the estimated to obtain a more precise estimation range. The mapping phase of SVO is how to deal with the depth uncertainty caused by monocular vision as shown in Figure 2: According to the epipolar constraint, I r , I k : Two frames of images in continuous motion; T k,r : The transformation matrix between two frames; P, P : Depth of the same point observed in different frames; σ: Because of the uncertainty of the monocular scale, here is the difference between two observation points, so multiple observations are needed to reduce the uncertainty. First, the depth filter model of SVO is: where d k : depth measurements (k = r, . . . , r + n); N d k d , τ 2 k : Gaussian distribution centered on truth-valued; τ 2 k is the variance; ρ: The probability of the measured value, which belongs to a good model; d min , d max : The upper and lower limits of uniform distribution; Supposed, ρ is s uniform distribution. We label each depth measurement with y k . The probability model of Equation (2) is rewritten as follows: According to the multiplication theorems in probability theory, Equation (3) is rewritten as: Suppose there is an approximate solution q Y,d, ρ of a posterior estimate p Y,d, ρ X , which satisfies: The decomposition of Equation (5) is derived from the maximum a posteriori (MAP) probability arg max[p(x|v,y)] of state estimation [31]. It is well known that in the estimation process, the maximum posterior estimation is to solve: wherex: a posteriori estimate (containing the observed); x: The system state; y: The measurement; v: The input. Then arg max is obtained by rewriting the maximum a posteriori (MAP) estimate using Bayes' rule: (7) where we drop the denominator because it does not depend on x, then we get the product of two probabilities, so Equation (7) is similar to Equation (5). In SVO, depth estimation is modeled as a linear Gaussian system (LG system) because the full Bayesian posterior is precisely Gaussian [32]. Therefore, the optimization approach will find the maximum (this is mode) of Gaussian, which is the same as the mean. Finally, we will get: where H: Hessian matrix. x: Update variable. b: A vector consisting of the system state, input and measurement. Therefore, Equations (6)-(8) are expressed as converting probability parameterization to numerical calculation. When KL divergence of p Y,d, ρ X and q Y,d, ρ is minimum value, the following conditions are satisfied: Finally, the depth model of SVO is modeled as an approximate solution of probabilistic parameterization.

Reasons for Choosing JS Divergence
Due to the non-commutative of KL divergence, it cannot be understood as the concept of 'distance'. It is not the distance of two distributions in space. Still, the information loss of one distribution has to be compared with another one. Because KL divergence is asymmetric, for some p and q there will be exposure to D KL (p||q) = D KL (q||p). This asymmetry means that choosing D KL (p||q) or D KL (q||p) can make a big difference.
During the operation of SLAM or VO, outliers are matching in data association. When iterating along with the outliers gradually, the system is likely to diverge and cause the operation to crash, so the two distributions, map points (p) and noise (q), are estimated. The noise also contains certain information that affects the distribution of map points. When distributing KL divergence and JS divergence that are far away or not overlapping, there is no mean in KL divergence that will be a constant, which means that the pixel is equipped with a nonzero intensity gradient. Whether machine learning or SLAM/VO, if the gradient is zero, it will produce a seven-dimensional zero space [16]. The manifestation may be abruptly reduced in the scale of the map. At the same time, it is clear from the introduction that the choices of different information theory methods will have a significant influence on the results, and that a proper information theory method will lead to a more robust and accurate distribution of the system.

Application of JS Divergence in SVO
The KL divergence is asymmetric, and can be transformed into JS divergence with a slight modification. First, we assume that: The KL divergence replaced by the following expression: Finally: When A and B non-overlapping, the part of left is zero as seen in Figure 3. Let P r , P g obey the normal distribution and p(x) be the probability when P r gets x, it will be found that there is non-overlapping between the two distributions, so on the left side of Equation (13), we knon that when x ≥ 5, p(x) ≈ 0, Equation (13) becomes: When x ≤ 5, p(x) ≈ 0, Equation (11) has become: Finally: ∀x ∈ R, JSD(p||q) = log 2, D KL (p||q) = ∅ (16) This is the difference between KL divergence and JS divergence. When the two distributions do not overlap, even centers of the two distributions are close, the KL divergence becomes nonsense, so the gradient is zero and cannot be updated. The JS divergence still has a numerical, which smoothed the influence of noise to a certain extent.

Why These Two Distributions Not Overlap
To understand the need for depth estimation is important to understand the nature of sensors. All sensors have limited precision. All measurements derived from real sensors have associated uncertainty. Therefore, there will be more noise for the camera sensor, which usefully carrying information determines the corresponding distribution of overlap. Theoretically, the real data distribution is usually a low-dimensional manifold, which means that the data does not have high-dimensional properties but exist in a low-dimensional space embedded in a high-dimensional area. In the three-dimensional space, the data is actually on a two-dimensional plane as shown in Figure 4. In practice, the dimensional space is much more than three-dimensions, and may be hundreds of dimensions, so the data is more challenging to overlap.

Application of JS Divergence in SVO
We need to introduce JS divergence in the posterior estimate. Comparing Equations (9) and (10), assuming m = 1 2 (p + q), and the relationship between KL divergence and expectation in Equation (11) we get: From Equation (17): The inequality of arithmetic and geometric means (Equation (19)) and the properties of the logarithmic function, lead to Equation (20): a 1 + a 2 + · · · + a n n n ≥ a 1 + a 2 + · · · + a n (19) We have obtained the correspondence between JS divergence and KL divergence in the estimated point of SVO as seen in Figure 5. The actual probabilities-KL divergence approximation probability, and JS divergence approximation probability-are denoted as p(x), q(x), q (x), respectively, and the relationship between them is represented by vector simplification.
It is well known the measurement of probabilities in learning algorithms and SLAM/VO assumes a Gaussian distribution and has a more complex character. Finally, we need to transform the model into a continuous product of two Gaussian distributions [33]. It also contains how to transform from two different distributions into a product of two Gaussian distributions. The final model is: Meanwhile, the covariance represents the difference between them, so the fusion strategy chooses Gaussian normalized product, and the uncertainty in the depth filter updated: In Equation (22), τ 2 : The variance introduced by the depth filter itself. τ 2 h : The variance is introduced by the JS divergence. (τ ) 2 : The variance after fusing the Gaussian normalized product. According to Equation (21), the second part on the right, τ 2 in Gaussian distribution, is corresponding to Equation (22). Therefore, with the increase of τ 2 , the second probability moment and the uncertainty decrease, which improves the robustness of the system against outliers. Take the limit of JS divergence, and obtain τ 2 h according to the above equation. Finally, the Gaussian normalized product gets a new (τ ) 2 .

Experimental Description
The systemic performance was evaluated by using the TUM RGB-D and EUROC MAV datasets. The system selected and improved was point-and-line-based SVO (PL-SVO) with the same depth filter as SVO, and we compare it with the state of the art: SVO/PL-SVO. All the experiments were performed on a notebook equipped with an Intel Core i5-9300H CPU (@2.40 GHz) and NVIDIA 1660Ti GPU. We ran each sequence seven times and show the average result of trajectory accuracy estimation. The evaluation indices used in the experiment were absolute trajectory error (ATE), and root means square error (RMSE) [34].

1.
Root Mean Square Error (RMSE): It represents the root mean square error of the Lie algebra of the camera pose, which can describe the difference in rotation and translation between two trajectories. 2.
Absolute Trajectory Error (ATE): The absolute trajectory error (ATE) is used to reflect the drift between the ground truth trajectory and estimated trajectory.

3.
Root Mean Square Error of ATE: It is applied to evaluate the system accuracy. So the ATE and RMSE of the frame are defined as follows: • Ground-truth: Q 1 , . . . , Q n ∈ SE(3), subscript represents time t (or frame). {t, δ}, where t = t x , t y , t z . δ = q x , q y , q z , q w , where q x , q y , q z is the image, and (q w ) is the real. Since there are enough valid map points in the keyframes, the selection of keyframes can also lay the foundation for the relocalization module. The accuracy of the system and the robustness of the location can also be verified by matching previous keyframes to find the camera pose.

•
Initialization/Relocalization Time: Tracking and mapping in SVO are parallel threads that depend on each other. The improved SVO needs to initialize both points and lines features. Relocalization is the ability of the system from crash to recovery when the light intensity or the camera moves. Here we provide relocalization time as measurement. Both initialization and relocalization are based on valid map points, so a comparison is necessary. The code is based on C++ with the addition of a clock for timing. It is important to note that the relocalization retrieves the current camera pose by matching the previous keyframe and the projection of the map point after the tracking is lost.

Experimental Evaluation of TUM RGB-D Datasets
The TUM RGB-D datasets are offered by the Computer Vision Group of the Technical University of Munich and were recorded in different environments with a Microsoft Kinect RGB-D camera. Dataset Resolution: a unique motion capture system records 640 × 480, frame rate: 30 (FPS), the ground-truth is recorded by a special motion capture system.

4.
Fr1_xyz uses a device that is an RGB-D camera (Microsoft Kinect RGB-D camera), and the acquired include both RGB image and depth image. The dataset is chosen because the images all become blurred when the camera moves rapidly, accompanied by repeated up-down/left-right movements, which brings some challenges to the robustness of the motion. 5.
Fr2_desk uses the same equipment and records data with loop, this dataset is mainly to check the loop and the accuracy of localization.
We select two datasets for testing (fr1_xyz/fr2_desk) as shown in Figure 6. We chose SVO, PL-SVO, and improved PL-SVO (the improved method in this article) for testing during the experiment: SVO: Open sourced by ETH Zurich, the open-source SVO has simplified for some reason. PL-SVO: Line features are added based on SVO, which improved accuracy and robustness in a low-texture environment.
Improved PL-SVO: Optimizing information theory (JS divergence) is added based on PL-SVO.
The experimental results are shown in Figures 7 and 8: As shown in Figure 7 each row represents the trajectory of the same system over time/number of frame, each column represents the trajectory of different systems over the same time/number of frames. Figure 7 shows the trajectories of three systems in the TUM RGB-D Dataset fr2_desk. The number of frames in this dataset is around 3000. The primary process is to take pictures around the desktop with a camera to verify the accuracy of the algorithm.
As seen in the first column, SVO initialization is completed successfully, but during the running process, rapid movement and motion blurring cause the SVO location to fail. As the dataset runs, the relocalization of SVO is turned on, and the current camera pose is retrieved by matching with the previous keyframes and projection of map points after the motion tracking is lost, resulting in discontinuous trajectories. PL-SVO works well but also has interruptions with large gaps at F = 900 frames. IT-SVO is superior to PL-SVO in that the overall trajectory is smoother and at F = 2500 frames the interval is smaller. Due to more keyframes are extracted, the current camera pose is restored quickly by the relocalization module.  The advantage of SVO is its speed. We tested it on airborne platform devices (DJI matrice100, NVIDIA XAVIER) and it reached 55 FPS, while on PC it reached 70 FPS. It is well known that the most feared scenes of camera motion are estimated to be pure rotation, changes in light intensity, motion blur, etc., so it is easy to lose in places where the trajectory changes sharply. Since the estimation and optimization of the poses all rely on grayscale matching, this leads to the lack of robustness of the system to illumination, while point features are selected, and the features in the scene are not sufficiently described.
In the complex environment, a large number of point features are extracted while having a simple keyframe selection strategy, resulting in unrepresentative keyframes, which can easily be terminated in the re-localization due to inadequate keyframe information. Figure 8 shows the trajectories of three systems in the TUM RGB-D dataset fr1_xyz. Because of the partial censoring of the SVO open-source algorithm, the running results of fr1_xyz are difficult to verify. When F = 50 frames, SVO has not finished initialization, so the camera does not generate a motion trajectory. PL-SVO and IT-SVO completed the initialization at the same number of frames and successfully generated the trajectories, mainly reflected in the better relocalization and depth update module. However, with the continuation of the dataset, PL-SVO has discontinuities in the left and right corners, due to the outliers generated by the fast movement. Based on KL divergence, it is more challenging for SVO and KL-SVO to extract sufficient depth information from the disordered outliers. When F = 600 frames, there are also discontinuities at the up and down motion corners, while IT-SVO completes the location accurately and produce a smooth trajectory. During the algorithm, SVO has a poor localization performance for a hand-held camera (for the top view of UAV), PL-SVO can generate smooth trajectories under horizontal and vertical motion, its performance is poor at the corner, and IT-SVO completes tracking better.
We modified the SVO to improve its robustness and accuracy by adding line features to the point features. For efficiency we convert the line features to edge features, i.e., segment the line features.
First, the mapping thread provides the initial depth based on the tracking thread. Then, the build thread calculates the exact depth and feeds it back to the tracking thread to achieve the pose estimation. In this process, the JS divergence is used to optimize the mapping thread, and the mean-Gaussian filter is used to suppress the outlier when solving for map points.
We compare the performance of the improved SVO with state-of-the-art systems, i.e., SVO, PL-SVO. Table 1 shows the comparative results of RMSE for ATE. Trans, and Rot representing RMSE of relative pose error (RPE) of the translation and rotation, respectively, where "-" means tracking failure. The smallest indicate the best accuracy (the trajectory lengths of fr1_xyz and fr2_desk are 7.122 and 12.6 m, respectively). As can be seen from Table 1, SVO can barely complete the location, so it is set to "-", which represents the data is invalid-in fr1_xyz, fast steering, up-and-down movement lead to low robustness of the PL-SVO and underestimation of the scene scale. Similiarly in fr2_desk, the rapid movement of the camera results in two relocalizations. In contrast, IT-SVO has a lower RMSE of ATE and RMSE of RPE, which improves the robustness and accuracy. The number of key frames is selected by three systems as shown in Figure 9. In Table 1. xyz/right: fr2_desk the x-axis represents the system type, and the y-axis represents the number of keyframes. In terms of the number of keyframes, it can be seen that compared with the state-of-the-art, PL-SVO, the improved SVO has a small improvement. As described in Section 4.1 Experimental Description, we know that there are many map points in the keyframes. On the contrary, the depth of the map points is obtained through the mapping thread, and both coexist with a linear relationship between the number of keyframes and the amount of information.
Due to the rapid movement of the camera in fr1_xyz, SVO initialization and relocalization start almost simultaneously. PL-SVO and IT-SVO require initialization of point-andline features, which are slower than SVO in the initialization phase. The comparison of initialization time is illustrated in Figure 10. From the comparison of initialization time in Figure 10, it can be seen that SVO has a long initialization time for the dataset fr1_xyz and a short initialization time for fr2_desk, which needs to be explained in third aspects for this problem: First, for fr1_xyz, we can see that the scene is single and partly white desktop and black computer. As SVO only extracts point features, although a large number of fast corners can be extracted, it is difficult to obtain map points with effective depth information, so it is necessary to perform multiple translations to achieve the initialization of the system.
Second, for fr2_desk, we can see that the scene is complex and colorful, the SVO can easily accomplish the initialization, and in contrast to PL-SVO/IT-SVO which need to extract point/line features at the same time, the initialization can be finished quickly under complex textures.
Third, PL-SVO/IT-SVO both have the same front-end (tracking thread) and have the ability of stable feature extraction. The difference in initialization time between them is mainly reflected in the back-end (mapping thread) feedback for generating valid map points. Because of the rapid movement of the camera in fr1_xyz, SVO initialization and relocalization start almost simultaneously. PL-SVO and IT-SVO require the initialization of point-and-line features. SVO runs poorly in both datasets and impossible to complete, so relocalization time is ignored and represented by "-". The relocalization time (s) are listed in Table 2. Table 2. Results of Re-localization time (Sequence: Two datasets. Average: The average time of re-localization between in two datasets. Note: The less time, the better capability of re-localization).

Experimental Evaluation of EuRoC Datasets
We evaluate our method by using the EuRoC MAV visual-inertia datasets. The data is collected on the MAV, including a binocular camera (used to capture stereo images, FPS: 20), an IMU measurement (200 Hz) and an image resolution of 752 × 480. This experiment only uses the image sequence collected by the left camera. We select two datasets for testing (MH01/MH02). The dataset is shown in Figure 11. (1) The content of MH01/MH02 are recorded by the UAV during flight, and the data include image from the binocular as well as the IMU. The EuRoC datasets are divided into three levels: easy/medium/high, the content of image include rotation, movement, turning, etc., and is mainly used to test the coupling algorithms of IMU and image in restricted mobile devices. (2) The high level exists fast movements and is mainly suitable for testing algorithms of IMU+Camera, including DSO/ORB-SLAM2 are unable to complete that level currently, so we chose two datasets for easy and medium levels.
We chose the datasets collected in a factory in this experiment and chose the same SVO/PL-SVO/IT-SVO for testing. The results are shown in Figure 12.
In Figures 12 and 13, each row represents the trajectory of the same system over time/number of frames. Each column represents the trajectory of different systems over the same time/number of frames. There are three main reasons why SVO trajectory failure occur in quite smooth regions:  First, SVO is a monocular visual odometery with no closed-loop detection, which means that relocalization will not be possible when the position difference is significant.
Second, SVO compares the current frame with the previous frame to obtain a rough estimate of the pose, therefore requires the previous frame to be sufficiently accurate. If the previous frame is lost due to occlusion, blurring, etc., then the current frame will also get an incorrect result (we need a solution to effectively suppress the outlier), resulting in a poor comparison with the map, so the system will easily stop and the trajectory will have a large interruption.
Third, the smooth linear motion causes the SVO to store a considerable number of seed points, but the depth filter converges much slower than the seed point selection speed, which leads to depth mis-estimation and serious underestimation, and intuitively generates a interruption in the running. Figure 12 shows the trajectories of three systems in the EuRoC Datasets MH01. The number of frames in this dataset is about 3700. The main content of this dataset is visual localization and navigation in a factory using IMU and camera on MAV. F = 700 is the initialization of the MAV by lifting and rotating. In the first row, due to the fast movement and insufficient point feature extraction, the initialization of SVO is low, there are more localization failures, and scale uncertainties occur during the straight-line operation. Interframe drift occurs when F = 3500, and the algorithm basically fails. PL-SVO and IT-SVO complete initialization, but localization fails when their running in a straight line. IT-SVO fails only on a small straight line, but it achieves tracking soon after relocalization. IT-SVO has better performance than SVO and PL-SVO in running datasets. The detailed error analysis is shown in Table 3. In Figure 14, the SVO still has fewer keyframes, most of which are due to the instability of the system and the complex variation of the environmental characteristics. Compared with PL-SVO, the improved SVO has a little improvement. As described experimentally, the selection of keyframes brings some robustness to the system. In Figure 15, the initialization time is longer for the three systems because the initial state of the EuRoC Datasets requires initializing the camera and IMU. The initialization of the IMU requires static initialization, which has a longer initialization time than the TUM RGB-D Datasets. The overall MH01/MH02 initialization time is higher than fr1_xyz/fr2_desk because the former requires to initialize the IMU at the same time, so the device must be stationary, while the vision-based algorithm only needs to have some translation to complete the initialization, so the initialization time is longer than the latter.
As shown in Table 4, compared with the first two systems, the improved SVO has a better relocalization function, mainly in terms of the shorter time. The relocalization time (s) are listed in Table 4. Finally, the experiments were performed multiple times while the FPS of the processed image was used to indicate the speed. As shown in Table 5. The processing speed on PC is higher than the onboard processor of the mobile device, but for a full-HD camera the processing speed will be slower, because in the tracking thread we only follow the third level of the pyramid, and the depth of scene will be complemented by the mapping thread.

Experiment and Comparison of Generalization Ability of Visual Odometry
We tested on the whole TUM RGB-D datasets, totaling twenty-three groups, and performed the corresponding experimental results and trajectory analysis. Then we choose the well-known ORB-SLAM2 for experimental comparison and select eight groups of experiments in which the proposed algorithm outperforms ORBSLAM2. Finally, the comparison video will be put to the link in the paper.
In the TUM RGB-D datasets, four groups of the robot SLAM category: (fr2_pioneer_360/ fr2_pioneer_slam/fr2_pioneer_slam2/fr2_pioneer_slam3), all the comparison algorithms mentioned in this paper cannot be completed (including ORB-SLAM2). Meanwhile, we show the experimental screenshots of the datasets as well as the trajectories. In Figure 16, eight groups of datasets and trajectories are illustrated. fr1/fr2: The data collected by different cameras, and some datasets that should be clarified. fr1_rpy: This dataset represents the rotation of the camera in space to verify the rotational invariance and motion robustness of the proposed algorithm, while ORB-SLAM2 is difficult to initialize, which leads to operational failure.
fr1_floor: The main scene of this dataset is the ground (containing inhomogeneous light), where IT-SVO can achieve localization. In contrast, ORB-SLAM2 has difficulty in extracting feature points in repetitive low-texture environments and cannot complete the initialization, which leads to localization failure.
There are also several unfinished datasets containing white ground, white boxes and open field, where SVO, PL-SVO, ORB-SLAM2 and IT-SVO all fail to achieve localization (fr1_360/fr2_large_with_loop/fr3_nostructure_notexture_near_with_loop). Finally, related to that provide seven groups of demonstrations of IT-SVO outperforming ORB-SLAM2 in the Supplementary Materials, and the code will be open source in the future.
In Figures 17 and 18, we compare the number of keyframe selections and the time of initialization, as well as the detailed error analysis given in Table 6.   As shown in Figure 17, SVO does not work in the majority of environments, and in Table 6, IT-SVO outperforms PL-SVO in terms of accuracy and robustness. where a lower ATE means that the trajectory is closer to the ground-truth.
At the same time, we also tested the algorithm on the EuRoC datasets, in which due to the the difficulty level, ORB-SLAM2 that only takes the localization mode and the algorithm proposed in this paper cannot be completed, mainly because of the fast moving of the UAV. As shown in Figure 19, we tested on the additional datasets V1_01_easy, V2_01_easy, mav03_medium and V2_02_medium except for the difficulty level. The trajectory is generated from keyframes and connected under the temporal sequence. Also, it is a criterion to evaluate the accuracy of the algorithm.     Finally, we performed experiments with a total of twenty-three groups and related to that seven groups of demonstrations of IT-SVO outperforming ORB-SLAM2 in the Supplementary Materials.

•
We have optimized the depth filter module of the monocular visual odometry (SVO) by combining information theory. Through multiple sets of experimental validation experiments and error analysis, we found that the improved SVO has better localization results when combined with JS divergence in the information theory. The constructed sparse maps are more accurate and robust. Comparing the estimated trajectory with ATE, initialization time, keyframe selection, and relocalization time, it is found that the precision and robustness of the optimized SVO system are greatly improved. Therefore, our contributions are as follow: • An efficient algorithm similar to SVO is implemented, which is also suitable for microprocessors, small devices, mobile robots, etc. • Improved accuracy of the depth filter and sufficient information can be obtained from outliers, laying the foundation for the next update. • Improved the system robustness and more accurate recovery of three-dimensional map points with the same computational resources.

•
Increases the diversity of VO and provides a new way of methodology combined with SLAM. Simultaneously, it lays a foundation for multi-source information fusion in our future work.