SD-VIS: A Fast and Accurate Semi-Direct Monocular Visual-Inertial Simultaneous Localization and Mapping (SLAM)

In practical applications, how to achieve a perfect balance between high accuracy and computational efficiency can be the main challenge for simultaneous localization and mapping (SLAM). To solve this challenge, we propose SD-VIS, a novel fast and accurate semi-direct visual-inertial SLAM framework, which can estimate camera motion and structure of surrounding sparse scenes. In the initialization procedure, we align the pre-integrated IMU measurements and visual images and calibrate out the metric scale, initial velocity, gravity vector, and gyroscope bias by using multiple view geometry (MVG) theory based on the feature-based method. At the front-end, keyframes are tracked by feature-based method and used for back-end optimization and loop closure detection, while non-keyframes are utilized for fast-tracking by direct method. This strategy makes the system not only have the better real-time performance of direct method, but also have high accuracy and loop closing detection ability based on feature-based method. At the back-end, we propose a sliding window-based tightly-coupled optimization framework, which can get more accurate state estimation by minimizing the visual and IMU measurement errors. In order to limit the computational complexity, we adopt the marginalization strategy to fix the number of keyframes in the sliding window. Experimental evaluation on EuRoC dataset demonstrates the feasibility and superior real-time performance of SD-VIS. Compared with state-of-the-art SLAM systems, we can achieve a better balance between accuracy and speed.


Introduction
Simultaneous localization and mapping (SLAM) plays an important role in self-driving cars, virtual reality, unmanned aerial vehicles (UAV), augmented reality and artificial intelligence [1,2]. This technology can provide reliable state estimation for UAV and self-driving cars in GPS-denied environments by relying on its sensors. Various types of sensors can be utilized in SLAM, such as stereo camera, lidar, inertial measurement units (IMU), and monocular camera. However, they have significant disadvantages when used individually: the metric scale of stereo camera can be obtained directly by using fixed baseline length, but it can only be estimated accurately in a limited depth range [3]; lidar has high precision in indoor, but it will encounter the reflection problem of glass surface in outdoor [4]; cheap IMUs are extremely susceptible to bias and noise [5]; monocular camera cannot estimate the absolute metric scale [6]. This paper mainly studies the monocular vision-inertial navigation system (VINS) based on multi-sensor fusion [7], which has the advantages of small size, lightweight, observable scale, roll, and pitch angle, etc.
According to the different methods of image information processing, there are two categories of SLAM: feature-based method and direct method. The standard process of feature-based method between speed and accuracy according to camera motion and environment. Our method is motivated by SVL, but we go one step further in real-time performance. Specifically, thanks to the keyframe selection strategy and sliding window-based back-end, we only need to extract new feature points on the keyframes and track them with KLT sparse optical flow algorithm, which can further reduce the calculation complexity while ensuring accuracy.
In this paper, we present SD-VIS, a novel fast and accurate semi-direct visual-inertial SLAM framework, which combines the exactness of feature-based method and quickness of direct method. The keyframes in SD-VIS are tracked by feature-based method, which is used for sliding window-based non-linear optimization and loop closure detection. This strategy solves the problem of drift in the long-term operation and ensures the robustness of the algorithm in case of large baseline motion and image blur. Non-keyframes are tracked by the direct method, and the distance between adjacent non-keyframes is minimal, which ensures the convergence of error function. Compared with the direct method, SD-VIS exhibits the function of loop closure detection and solves the problem of drift in long-term operation. Compared with the feature-based method, SD-VIS can achieve the same accuracy while maintaining a faster speed. Figure 1 demonstrates the framework of the semi-direct vision-inertial SLAM system. Sensor data comes from a monocular camera and IMU. IMU measurements between two consecutive images are pre-integrated, and the pre-integration is used as the constraint of IMU between two images (Section 3.2). In the initialization procedure, we detect the feature points on each image and track them with KLT sparse optical flow algorithm [15].

System Framework Overview
Sensors 2020, 20, x FOR PEER REVIEW 3 of 19 and environment. Our method is motivated by SVL, but we go one step further in real-time performance. Specifically, thanks to the keyframe selection strategy and sliding window -based backend, we only need to extract new feature points on the keyframes and track them with KLT sparse optical flow algorithm, which can further reduce the calculation complexity while ensuring accuracy.
In this paper, we present SD-VIS, a novel fast and accurate semi-direct visual-inertial SLAM framework, which combines the exactness of feature-based method and quickness of direct method. The keyframes in SD-VIS are tracked by feature-based method, which is used for sliding windowbased non-linear optimization and loop closure detection. This strategy solves the problem of drift in the long-term operation and ensures the robustness of the algorithm in case of large baseline motion and image blur. Non-keyframes are tracked by the direct method, and the distance between adjacent non-keyframes is minimal, which ensures the convergence of error function. Compared with the direct method, SD-VIS exhibits the function of loop closure detection and solves the problem of drift in long-term operation. Compared with the feature-based method, SD-VIS can achieve the same accuracy while maintaining a faster speed. Figure 1 demonstrates the framework of the semi-direct vision-inertial SLAM system. Sensor data comes from a monocular camera and IMU. IMU measurements between two consecutive images are pre-integrated, and the pre-integration is used as the constraint of IMU between two images (Section 3.2). In the initialization procedur e, we detect the feature points on each image and track them with KLT sparse optical flow algorithm [15]. In the following visual-inertial alignment, we align the pre-integrated IMU measurements and visual images and calibrate out the metric scale, initial velocity, gravity vector, and gyroscope bias Figure 1. The semi-direct visual-inertial SLAM system framework In the following visual-inertial alignment, we align the pre-integrated IMU measurements and visual images and calibrate out the metric scale, initial velocity, gravity vector, and gyroscope bias by using multiple view geometry (MVG) theory based on the feature-based method. (Section 3.3). After initialization, keyframe selection will be performed based on the IMU pre-integration and previous feature matching results. The previous feature matching refers to the feature matching between the last frame and the penultimate frame in the sliding window, and the matching is completed before the current frame arrives. Non-keyframes are used for fast-tracking and localization by direct method [11], and keyframes are tracked by feature-based method [18] and used for non-linear optimization and loop closure detection (Section 4). In the following tight coupling optimization framework, we can get more accurate state estimation by minimizing visual re-projection error, IMU residual, prior information from marginalization, and re-location information from loop closure detection (Section 5). Figure 2 shows the definition of symbols in the semi-direct visual-inertial SLAM framework. C, B, and W are the camera coordinate system, the IMU body coordinate system, and the world coordinate system, respectively. We define T WB = R W B , P W B as the motion of B relative to W. T BC = R B C , P B C represents the extrinsic parameters between C and B, which can be calibrated in advance. T t,t+1 ∈ SE(3) represents the motion from time t to time t + 1 in the coordinate system C, and Z t,t+1 represents a pre-integrated IMU measurement between the camera coordinate system C t and C t+1 .

Definition of Symbols
Sensors 2020, 20, x FOR PEER REVIEW 4 of 19 by using multiple view geometry (MVG) theory based on the feature-based method. (Section 3.3). After initialization, keyframe selection will b e performed based on the IMU pre-integration and previous feature matching results. The previous feature matching refers to the feature matching between the last frame and the penultimate frame in the sliding window, and the matching is completed before the current frame arrives. Non-keyframes are used for fast-tracking and localization by direct method [11], and keyframes are tracked by feature-based method [18] and used for non-linear optimization and loop closure detection (Section 4). In the following tight coupling optimization framework, we can get more accurate state estimation by minimizing visual reprojection error, IMU residual, prior information from marginalization, and re-location information from loop closure detection (Section 5). Figure 2 shows the definition of symbols in the semi-direct visual-inertial SLAM framework. C, B, and W are the camera coordinate system, the IMU body coordinate s ystem, and the world coordinate system, respectively. We define T WB = (R B W , P B W ) as the motion of B relative to W . T BC = (R C B , P C B ) represents the extrinsic parameters between C and B, which can be calibrated in advance.

Definition of Symbols
T , +1 ∈ SE(3) represents the motion from time t to time t + 1 in the coordinate system C, and , +1 represents a pre-integrated IMU measurement between the camera coordinate system C and C +1 .

Figure 2. Symbol de finition of algorithm
A 3D point F 1 , F 2 ∈ R 3 represents the spatial feature points observed simultaneously by the camera coordinate system C and C +1 . P 1 , P 2 , P 3 , P 4 ∈ R 2 are the projections of feature points on the image coordinate system. We adopt the traditional pinhole camera model to map the F 1 in the camera coordinate system to the image coordinate system by the projection function π: R 3 → R 2 : where [f u f v ] T and [c u c v ] T is camera internal parameters.

IMU Pre-Integration
In the back-end optimization and visual-inertial alignment, the constraints of vision and IMU need to be optimized in the same frame, so the IMU measurements between two adjacent frames need to be integrated into one constraint. A 3D point F 1 , F 2 ∈ R 3 represents the spatial feature points observed simultaneously by the camera coordinate system C t and C t+1 . P 1 , P 2 , P 3 , P 4 ∈ R 2 are the projections of feature points on the image coordinate system. We adopt the traditional pinhole camera model to map the F 1 in the camera coordinate system to the image coordinate system by the projection function π: R 3 → R 2 : where f u f v T and c u c v T is camera internal parameters.

IMU Pre-Integration
In the back-end optimization and visual-inertial alignment, the constraints of vision and IMU need to be optimized in the same frame, so the IMU measurements between two adjacent frames need to be integrated into one constraint. IMU can output 3-axis angular velocity ω and 3-axis acceleration α including bias and Gaussian white noise: where n B ω ∼ N 0, σ 2 ω , n B α ∼ N 0, σ 2 α represents the Gaussian white noise. g W = [0, 0, g] T is the gravity vector. R B w represents the rotation matrix from W to B. b B ω , b B α represents the biases of gyroscope and accelerometer.
We define P W where: is related to the IMU body coordinate system B i . In the back-end tightly coupling optimization, we will continuously iteratively update the IMU state variables in the sliding window. When the IMU body state at time t = i is iteratively updated, we need to recalculate the state at time t = j, which is very time-consuming. We adopt IMU pre-integration technology to avoid unnecessary time consumption. formulas (4)- (6) can be written as: where: Sensors 2020, 20, 1511 6 of 18 From formulas (10)- (12), the pre-integration measurements θ , and the IMU body coordinate system B i are independent of each other. This means that when the states in the B i coordinate system are iteratively updated, there is no need to recalculate the states in the B j coordinate system. Since the IMU pre-integrated measurements θ is affected by the bias, when the bias is updated iteratively, we will update the pre-integrated measurement by the first-order approximation method: θ are the Jacobian matrices of pre-integrated measurements with respect to bias.

Visual-Inertial Alignment
The convergence speed and effect of nonlinear visual-inertial SLAM systems depend heavily on reliable initial values. Therefore, in the initialization procedure of SD-VIS, we align the pre-integrated IMU measurements with the visual image to complete the system initialization.

Gyroscope Bias Correction
We regard the camera coordinate system C 0 as the world coordinate system. We detect the feature points on each image and track them with KLT sparse optical flow algorithm, and then the rotation R C 0 C t and R C 0 C t+1 of the two adjacent frames C t and C t+1 can be estimated by using visual structure from between the IMU body coordinate system B t and B t+1 can be estimated by IMU pre-integration. We can get the following formula: (16) where: We solve the above least squares problem to get the initial calibration of the gyroscope bias and use it to update θ

Gravity Vector, Initial Velocity, and Metric Scale Correction
We define the variables that need to be calibrated as: is the initial velocity in the IMU body coordinate system, g C 0 is the gravity vector in the camera coordinate system, s represents the metric scale of semi-direct visual-inertial SLAM framework.
Suppose we have obtained pre-calibrated external parameter T BC = R B C , P B C , we can transform the pose T C 0 C t = R C 0 C t , P C 0 C t from the camera coordinate system to the IMU body coordinate system: Sensors 2020, 20, 1511 (20) Considering two adjacent keyframes B t and B t+1 , then formulas (7) and (8) can be rewritten as: We combine formulas (19)- (22) to get the following formula: where: In the above formula R can be obtained through visual SfM: Solving the above formula, we can calibrate the initial velocity for each keyframe, gravity vector, and absolute metric scale. After estimating the scale, we will adjust the translation vector of the vision SfM to make the system have an observable scale.

Gravity Vector Refinement
We can know the magnitude of the gravity vector in advance, so we refer to the VINS-Mono [18] method to re-parameterize the gravity vector obtained in Section 3.3.2 with two variables in tangent space, and perform further optimization.
After obtaining the accurate gravity vector, we can rotate the coordinate system C 0 , which is temporarily the world coordinate system, to the real world coordinate system W. However, since the yaw angle in the visual-inertial SLAM system is unobservable, the yaw angle of the C 0 coordinate system remains unchanged during the rotation process. At this time, the initialization procedure of the semi-direct visual-inertial SLAM system is completed.

Keyframe Selection
We have three different keyframe selection strategies. Satisfying one of these three strategies makes the current frame a keyframe. The first and third strategies of keyframe selection are based on the feature matching results of the last frame and the penultimate frame in the sliding window, which has been matched before the current frame arrives. The first selection strategy is the tracking number of feature points. No new feature points will be extracted when tracking non-keyframes. The translational motion of the camera will lead to the decrease of tracking feature points. If the number of tracking points in the last frame in the sliding window is less than 70% of the minimum tracking point threshold, the current frame will be treated as a keyframe. The second selection strategy is related to IMU pre-integration. If the translation distance between the last two adjacent frames in the sliding window calculated by the IMU pre-integration exceeds a preset threshold, the current frame is also considered as a new keyframe. The third selection strategy is the average parallax of the feature points tracked on the the last frame and the penultimate frame in the sliding window. The translation Sensors 2020, 20, 1511 8 of 18 or rotation of camera will cause parallax. When the average parallax exceeds the threshold, the current frame will also be regarded as a keyframe.

Keyframes Tracking
If the current frame is treated as a keyframe, we first use the fast feature detector [28] to add new feature points in the last frame in the sliding window and then use the KLT sparse optical flow algorithm to track them in the current frame (Figure 3). At least 200 feature points will be maintained in each frame. Since there is no need to calculate the feature point descriptor, the optical flow method can save more time. In addition, we also use RANSAC [29] with the fundamental matrix model to eliminate outliers generated during tracking.

Keyframes Tracking
If the current frame is treated as a keyframe, we first use the fast feature detector [28] to add new feature points in the last frame in the sliding window and then use the KLT sparse optical flow algorithm to track them in the current frame (Figure 3). At least 200 feature points will be maintained in each frame. Since there is no need to calculate the feature point descriptor, the optical flow method can save more time. In addition, we also use RANSAC [29] with the fundamental matrix model to eliminate outliers generated during tracking.

Non-Keyframes Tracking
If the current frame is considered a non-keyframe, we use direct image alignment to estimate the relative pose T , +1 between the current frame and the last frame in the sliding window. The initial value of the relative pose can be obtained directly by IMU pre-integration. The feature points observed in the last frame are projected into the current frame accordin g to the estimated pose T , +1 . Due to the hypothesis of photometric invariance, if the same feature point is observed by two adjacent frames, the photometric values of the projection points on the two adjacent frames are equal ( Figure  4). Therefore, we can optimize the relative pose T , +1 by minimizing the photometric error between image blocks (4 × 4 pixels):

Non-Keyframes Tracking
If the current frame is considered a non-keyframe, we use direct image alignment to estimate the relative pose T t,t+1 between the current frame and the last frame in the sliding window. The initial value of the relative pose can be obtained directly by IMU pre-integration. The feature points observed in the last frame are projected into the current frame according to the estimated pose T t,t+1 . Due to the hypothesis of photometric invariance, if the same feature point is observed by two adjacent frames, the photometric values of the projection points on the two adjacent frames are equal (Figure 4). Therefore, we can optimize the relative pose T t,t+1 by minimizing the photometric error between image blocks (4 × 4 pixels): The photometric error is: ( , +1 , ) = +1 (π( , +1 · π −1 ( , ))) − ( ) ∀ ∈ where is the depth of the feature point in the last frame in the sliding window . represents the intensity image in the k-th frame.
We use the inverse compositional formulation [30] of the photometric error, which can avoid unnecessary Jacobian derivation. The update step T(ξ) for the last frame in the sliding window is: ( ξ, ) = +1 (π( , +1 · π −1 ( , ))) − (π (T(ξ) · π −1 ( , ))) ∀ ∈ We solve it in an iterative Gauss Newton method and update T , +1 in the following way: , +1 ← , +1 · T(ξ) −1 After image alignment, we can get the optimized relative pose T , +1 between the current frame and the last frame in the sliding window . We define all 3D points observed in all frames in the sliding window as the local map, and project the local map to the current frame to find the visible 3D points of the current frame. Due to the inaccuracy of the visible 3D point position and the camera pose, there will be errors in the projection position of the current frame. To make the projection position more accurate, the current frame needs to be aligned with the local map. The feature matching step optimizes the positions of all the projection points in the current frame by minimizing the photometric errors of the projection blocks (5 × 5 pixels) in the current frame and the reference frame ( Figure 5): Solving the above formula in an iterative Gauss Newton method, we can get the update of the projection block position. The reference frame is usually far away from the current frame, so we apply an affine warping to the reference patch. Through image alignment and feature matching, we get the implicit results of direct motion estimation -feature correspondence with sub -pixel accuracy. Note that when tracking nonkeyframes with the direct method, no new feature points are extracted. In the back-end optimization, we will combine IMU residual, visual re-projection error, prior information, and re-localization information to optimize the camera pose and 3D point position again. The photometric error δI is: where d p is the depth of the feature point in the last frame in the sliding window. I k represents the intensity image in the k-th frame. We use the inverse compositional formulation [30] of the photometric error, which can avoid unnecessary Jacobian derivation. The update step T(ξ) for the last frame in the sliding window is: δI(ξ, u) = I t+1 πT t,t+1 ·π −1 u, d p − I t π T(ξ)·π −1 u, d p ∀u ∈ R We solve it in an iterative Gauss Newton method and update T t,t+1 in the following way: After image alignment, we can get the optimized relative pose T t,t+1 between the current frame and the last frame in the sliding window. We define all 3D points observed in all frames in the sliding window as the local map, and project the local map to the current frame to find the visible 3D points of the current frame. Due to the inaccuracy of the visible 3D point position and the camera pose, there will be errors in the projection position of the current frame. To make the projection position more accurate, the current frame needs to be aligned with the local map. The feature matching step optimizes the positions of all the projection points in the current frame by minimizing the photometric errors of the projection blocks (5 × 5 pixels) in the current frame and the reference frame ( Figure 5): Solving the above formula in an iterative Gauss Newton method, we can get the update of the projection block position. The reference frame is usually far away from the current frame, so we apply an affine warping A i to the reference patch.
Through image alignment and feature matching, we get the implicit results of direct motion estimation -feature correspondence with sub-pixel accuracy. Note that when tracking non-keyframes with the direct method, no new feature points are extracted. In the back-end optimization, we will combine IMU residual, visual re-projection error, prior information, and re-localization information to optimize the camera pose and 3D point position again.
Sensors 2020, 20, x FOR PEER REVIEW 10 of 19 Figure 5. Adjust the position of the proje ction block ′ on the curre nt frame to minimize the photome tric e rror of the proje ction block in the curre nt frame and the re fe re nce frame in the sliding window.

Sliding Window-based Tightly-coupled Optimization Framework
After tracking non-keyframes and keyframes, we proceed with a sliding window -based tightlycoupled optimization framework for high accuracy and robust state estimation. In the optimization framework, we combined IMU residual, visual re-projection error, prior information, and relocalization information to optimize the camera pose and 3D point position again.

Formulation
The state variables to be estimated by SD-VIS are defined as: where X k includes the translation, velocity, and rotation quaternions of the k th IMU body coordinate system concerning the world coordinate system, as well as the bias of gyroscope and accelerometer. n represents the size of the sliding window . By minimizing the sum of IMU residuals, visual re-projection errors, prior information, and relocation information in the sliding window, we can obtain a robust and accurate semi-direct visual-inertial SLAM system: (32) where r B (Z B k+1 B k , X), r C (Z F C j , X), {r p , H p } and r C (Z F C L , X) are IMU residuals, visual re-projection errors, prior information and re-localization information respectively.

IMU Residuals
According to the formulas (4)- (6) in Section 3.2, we can get the IMU measurement residual:

Sliding Window-based Tightly-coupled Optimization Framework
After tracking non-keyframes and keyframes, we proceed with a sliding window-based tightly-coupled optimization framework for high accuracy and robust state estimation. In the optimization framework, we combined IMU residual, visual re-projection error, prior information, and re-localization information to optimize the camera pose and 3D point position again.

Formulation
The state variables to be estimated by SD-VIS are defined as: where X k includes the translation, velocity, and rotation quaternions of the k th IMU body coordinate system concerning the world coordinate system, as well as the bias of gyroscope and accelerometer. n represents the size of the sliding window. By minimizing the sum of IMU residuals, visual re-projection errors, prior information, and relocation information in the sliding window, we can obtain a robust and accurate semi-direct visual-inertial SLAM system: where r B Z B k B k+1 , X , r C Z C j F , X , r p , H p and r C Z C L F , X are IMU residuals, visual re-projection errors, prior information and re-localization information respectively.

IMU Residuals
According to the formulas (4)- (6) in Section 3.2, we can get the IMU measurement residual: where [·] xyz represents the real part of the quaternion. θ are the IMU pre-integration between two adjacent keyframes B i and B j .

Visual Re-Projection Errors
When the feature point F 1 is first observed in the i th image, the visual re-projection error in the j th image can be defined by the pinhole camera model as: where u represent the coordinates of the pixels projected from the feature point F 1 to the i th and j th frame image, respectively. π −1 is the back-projection function of the pinhole camera model.

Marginalization Strategy
In order to limit the computational complexity of SD-VIS, the back-end adopts a sliding window-based tightly-coupled optimization framework, so we use the marginalization strategy [31] to make the correct operation of sliding windows. As shown in Figure 6, if the current frame is determined as a keyframe, the frame will remain in the sliding window, and the oldest frame is marginalized out When the oldest frame is marginalized, the feature points that can only be observed by the oldest frame will be discarded directly, and other visual and inertial measurements associated with the frame will be removed from the sliding window by Schur complement. The new prior information constructed by Schur complement will be added to the existing prior information. If the current frame is not a keyframe, the last frame in the sliding window will be marginalized, and all visual measurements related to that frame will be removed directly from the sliding window.

Re-Localization
Due to the global 3D position and yaw angle are unobservable, there will be inevitable accumulative errors in the vision-inertial SLAM system. To eliminate the accumulated error, we introduce the re-localization module. After the keyframe is traced successfully, it can be judged whether the SLAM system has been here before by loop closure detection. We utilize DBoW2, a state-of-the-art bag-of-word place recognition approach, for loop closure detection. When a loop is detected, the re-localization module can effectively align the current sliding window, thus eliminating the accumulated error. For a detailed description of re-location, readers may refer to [18].
When the oldest frame is marginalized, the feature points that can only be observed by the oldest frame will be discarded directly, and other visual and inertial measurements a ssociated with the frame will be removed from the sliding window by Schur complement. The new prior information constructed by Schur complement will be added to the existing prior information. If the current frame is not a keyframe, the last frame in the sliding window will be marginalized, and all visual measurements related to that frame will be removed directly from the sliding window.

Re-Localization
Due to the global 3D position and yaw angle are unobservable, there will be inevitable accumulative errors in the vision-inertial SLAM system. To eliminate the accumulated error, we

Experiment
We evaluate the accuracy, robustness, and real-time performance of SD-VIS on the EuRoC dataset [32]. The SD-VIS method is compared with the state-of-the-art vision SLAM methods, such as VINS-mono [18] and VINS-Fusion [20]. In Section 6.1, the accuracy and robustness of SD-VIS are evaluated, and the experimental results show that the accuracy and robustness of the proposed method reach the same level as the state-of-the-art method. Section 6.2 evaluates real-time performance. The experimental results show that the proposed method achieves a good balance between accuracy and real-time performance. Section 6.3 evaluates the loop closure detection capability and verifies the overall feasibility of the SLAM system.

Accuracy and Robustness Evaluate
In the experiments on the EuRoC dataset, we adopt the open source tool EVO [33] to evaluate the performance of SD-VIS. By comparing the estimated value with the actual value, we calculate the absolute pose error (APE) as an index of the evaluation algorithm [34]. Table 1 shows the root mean square error (RMSE) of the translation on the EuRoC dataset. For fairness, the following algorithms do not use the loop closure detection module. As can be seen from Table 1, in terms of accuracy, SD-VIS and VINS-Mono and VINS-Fusion are at the same level. The accuracy of SD-VIS is slightly lower than that of VINS-Mono when moving at low and medium speed in the environment with abundant feature points (such as MH_01_easy and MH_03_medium). This is due to the susceptibility to various illumination changes when tracking non-keyframes using the direct method. We have observed that some datasets exhibit strong exposure changes between images and, therefore, the tracking effect of the direct method is reduced in these cases. In addition, in order to further improve the real-time performance of the algorithm, we only extract new feature points in keyframes, which results in that the number of feature points in non-keyframes will be less than keyframes, which will also bring some negative effects. Although the accuracy performance is not significantly better than the traditional method, it has achieved the same level of accuracy as VINS-Mono while greatly improving real-time performance. Therefore, our algorithm is very suitable for small-sized unmanned platform with limited computing resources. The accuracy of SD-VIS is higher than that of VINS-Mono and VINS-Fusion when moving fast in the low-texture environment (such as V2_03_difficult). This is due to the excellent performance of the direct method in the low-texture environment. In addition, the keyframe selection strategy will tend to generate more keyframes during fast motion, which will also improve the accuracy performance of the algorithm. Figure 7 shows more intuitively the trajectory heat map estimated by SD-VIS, VINS-Mono, and VINS-Fusion in MH_01_easy. Figure 8 shows the change of translation absolute pose error with time in MH_01_easy, MH_04_medium, and V2_03_difficult. Through Figures 7 and 8, we came to the conclusion that the accuracy and robustness of our algorithm have reached the level of the state-of-the-art algorithm. Especially in the initialization procedure and low-texture environment, our algorithm performs better. generate more keyframes during fast motion, which will also improve the accuracy performance of the algorithm. Figure 7 shows more intuitively the trajectory heat map estimated by SD-VIS, VINS-Mono, and VINS-Fusion in MH_01_easy. Figure 8 shows

Real-Time Performance Evaluate
In this section, we evaluate the real-time performance of SD-VIS. We compared the average time required to track an image (Table 2). As can be seen from Table 2, the image tracking of ORB-SLAM2 [11] uses the feature-based method to extract and match the ORB features of each frame, which takes a long time. However, VINS-Mono uses the optical flow method to track FAST features, which saves the calculation of feature descriptors, so the time consumption is less than ORB-SLAM2. VINS-Fusion is a stereo visual-inertial fusion SLAM algorithm, and image tracking also takes a long time. In SD-VIS, non-keyframes are used for fast-tracking and localization by direct method, and keyframes are tracked by feature-based method and used for back-end optimization and loop closure detection. This algorithm saves a lot of time and minimizes the average time of SD-VIS tracking images. Due to the keyframe selection strategy, in some low-speed motion scenes (such as MH_01_easy and V1_01_easy), the number of keyframes will be less, and the time to track a frame of the image will be reduced. In some fast-moving scenes (such as V1_02_medium and V2_03_difficult), as the number of keyframes increases, the time required to track an image will be increased.
In summary, the reason why we can obtain good real-time performance is due to the use of KLT sparse optical flow algorithm when tracking keyframes, which eliminates the calculation of descriptors and feature matching. In addition, for non-keyframes, only the direct method is used to track existing feature points, and new feature points are not extracted. Due to the close distance between two adjacent non-keyframes, the direct method of image alignment and feature matching can quickly converge.
Compared with the feature-based method, we use the direct method to track non-keyframes and accelerate the algorithm without reducing the accuracy and robustness. As shown in Figure 9, in MH_02_easy, 26% of the frames are determined to be keyframes, while 74% of the frames are determined to be non-keyframes. The time consumption of tracking keyframes is 65%, while that of non-keyframes are only 35%. Combined with Section 6.1, we can conclude that compared with the state-of-the-art SLAM systems, we can achieve a better balance between quickness and exactness.

Loop Closure Detection Evaluate
Finally, in order to verify the integrity and feasibility of the proposed algorithm, we evaluate the loop closure detection capability of SD-VIS. As can be seen from Figures 10 and 11, the accuracy of SD-VIS with loop detection is improved obviously. Compared with the direct method, SD-VIS exhibits the function of loop closure detection and solves the problem of drift in long-term operation.

Loop Closure Detection Evaluate
Finally, in order to verify the integrity and feasibility of the proposed algorithm, we evaluate the loop closure detection capability of SD-VIS. As can be seen from Figures 10 and 11, the accuracy of SD-VIS with loop detection is improved obviously. Compared with the direct method, SD-VIS exhibits the function of loop closure detection and solves the problem of drift in long-term operation.
(a) (b) Figure 9. The le ft picture (a) shows a comparison of the numbe r of ke yframe s and non-ke yframe s.
The right picture (b) shows the comparison of time consumption be twe e n tracking ke yframe s and non-ke yframes

Loop Closure Detection Evaluate
Finally, in order to verify the integrity and feasibility of the proposed algorithm, we evaluate the loop closure detection capability of SD-VIS. As can be seen from Figures 10 and 11, the accuracy of SD-VIS with loop detection is improved obviously. Compared with the direct method, SD-VIS exhibits the function of loop closure detection and solves the problem of drift in long-term operation.

Conclusions
We present SD-VIS, a novel fast and accurate semi-direct visual-inertial SLAM framework, which combines the exactness of feature-based method and quickness of direct method. Compared with the state-of-the-art feature-based method, we use the direct method to track non-keyframes and accelerate the algorithm without reducing the accuracy and robustness. Compared with the direct method, SD-VIS exhibits the function of loop closure detection and solves the problem of drift in long-term operation. We get a better balance between accuracy and speed, so the algorithm is more suitable for the platform with limited computing resources. In the future, we will extend the algorithm to support more types of multi-sensor fusion to increase its robustness in complex environments.

Conclusions
We present SD-VIS, a novel fast and accurate semi-direct visual-inertial SLAM framework, which combines the exactness of feature-based method and quickness of direct method. Compared with the state-of-the-art feature-based method, we use the direct method to track non-keyframes and accelerate the algorithm without reducing the accuracy and robustness. Compared with the direct method, SD-VIS exhibits the function of loop closure detection and solves the problem of drift in long-term operation. We get a better balance between accuracy and speed, so the algorithm is more suitable for the platform with limited computing resources. In the future, we will extend the algorithm to support more types of multi-sensor fusion to increase its robustness in complex environments.