A Review of Visual-Inertial Simultaneous Localization and Mapping from Filtering-Based and Optimization-Based Perspectives

Visual-inertial simultaneous localization and mapping (VI-SLAM) is popular research topic in robotics. Because of its advantages in terms of robustness, VI-SLAM enjoys wide applications in the field of localization and mapping, including in mobile robotics, self-driving cars, unmanned aerial vehicles, and autonomous underwater vehicles. This study provides a comprehensive survey on VI-SLAM. Following a short introduction, this study is the first to review VI-SLAM techniques from filtering-based and optimization-based perspectives. It summarizes state-of-the-art studies over the last 10 years based on the back-end approach, camera type, and sensor fusion type. Key VI-SLAM technologies are also introduced such as feature extraction and tracking, core theory, and loop closure. The performance of representative VI-SLAM methods and famous VI-SLAM datasets are also surveyed. Finally, this study contributes to the comparison of filtering-based and optimization-based methods through experiments. A comparative study of VI-SLAM methods helps understand the differences in their operating principles. Optimization-based methods achieve excellent localization accuracy and lower memory utilization, while filtering-based methods have advantages in terms of computing resources. Furthermore, this study proposes future development trends and research directions for VI-SLAM. It provides a detailed survey of VI-SLAM techniques and can serve as a brief guide to newcomers in the field of SLAM and experienced researchers looking for possible directions for future work.


Introduction
Simultaneous localization and mapping (SLAM) technology was first proposed by Smith [1,2], which was applied in robotics with the goal of building a real-time map of surroundings based on sensor data in an unknown environment as the sensor positioned itself.Over the years, new methods have appeared using different sensors such as sonar [3], lidar [4], and cameras [5].These methods created new data representations and consequently new maps.Durrant-Whyte and Bailey [6,7] systematically reviewed SLAM technologies.Due to recent advances in CPU and GPU technologies, visual SLAM methods have seen increased interest because of the rich visual information available from low-cost cameras compared to other sensors.There are many excellent visual SLAM methods that have improved the development of SLAM technologies, such as MonoSLAM [5], PTAM [8], RatSLAM [9], DTAM [10], KinectFusion [11], and ORB-SLAM [12].SLAM technology has undergone three major iterations over the last 30 years [13].Today, SLAM technology is thriving and robust; real-time, high-precision SLAM technology is urgently needed in robotics.
Visual-inertial simultaneous localization and mapping (VI-SLAM) that fuses camera and IMU data for localization and environmental perception has become increasingly popular for several reasons.First, the technology is used in robotics, especially in extensive research and applications involving the autonomous navigation of micro aerial vehicles (MAV).Second, augmented reality (AR) and virtual reality (VR) are growing rapidly.Third, unmanned technology and artificial intelligence has expanded tremendously.
VI-SLAM is generally divided into two approaches: filtering-based and optimization-based.Maplab [14,15] and VINS-mono [16][17][18] are typical of these two methods, and both are open source.Maplab is a filtering-based VI-SLAM system that also provides the research community with a collection of multi-session mapping tools including map merging, loop closure, and visual-inertial optimization.VINS-mono is a real-time optimization-based VI-SLAM system that uses a sliding window to provide high-precision odometry.Furthermore, it features efficient IMU pre-integration with bias correction, automatic estimator initialization, online extrinsic calibration, failure detection, and loop detection.
Much research has been conducted on SLAM over the last few decades, including reviews and tutorials.A classic review was [6,7]; however, they do not reflect the more recent and emerging SLAM technology.Most reviews [19][20][21][22][23] have also focused solely on visual SLAM or visual odometry without addressing VI-SLAM technology.This study, therefore, provides an overview of VI-SLAM technology from filtering-based and optimization-based perspectives.Feature extraction and tracking, core theory, and loop closure are proposed, which are key technologies in VI-SLAM methods.This work also summarizes research over the previous 10 years and famous VI-SLAM datasets and compares filtering-based and optimization-based methods through experiments.Finally, potential development trends and forthcoming research directions are introduced.

Filtering-Based Methods
VI-SLAM approaches can also be further categorized into either loosely or tightly coupled according to sensor fusion type.State-of-the-art VI-SLAM studies over the last 10 years are listed in Table 1.This study divides VI-SLAM methods into filtering-based and optimization-based approaches, mainly according to their back-end optimization type.The loosely coupled method [24,25] usually only fuses the IMU to estimate the orientation and possible the change in position, but not the full pose.In contrast, the tightly coupled method [26,27] fuses the state of the camera and IMU together into a motion and observation equation, and then performs state estimation.Tightly coupled methods presently constitute the main research focus, thanks to advances in computer technology.
Table 1.State-of-the-art visual-inertial simultaneous localization and mapping (VI-SLAM) methods.

Feature Extraction
Tracking is an important component in VI-SLAM systems, which depends on the tracking camera pixel.VI-SLAM tracking strategies are presented on Table 2. Feature detection aims to identify features and determine their position in an image.Features used in VI-SLAM are mainly Harris [78], FAST [79], ORB [80], SIFT [81], and SURF [82].Feature detection uses descriptors to describe the keypoint neighborhoods.The ways to obtain features in the image are summarized at several points: (1) the pixel point corresponding to the local maximum of the first derivative, (2) the intersection point of two or more edges, (3) the point where the rate change of the gradient value and gradient direction is high, and (4) the point at which the first derivative at the corner point is the largest and the second derivative is zero.
Brito [83] evaluated the application of different state-of-the-art methods for interest point matching, including SURF, SIFT, ORB, BRISK, and FREAK, aiming for the projective reconstruction of three-dimensional scenes.New features have also been incorporated into the SLAM system, such as the planar feature [84,85], line, or edge feature [86][87][88].Importantly, Yang [85] translated monocular sequences to the 3D plane map and proposed semantic monocular plane SLAM for low-texture environments.

Feature Tracking
There are four commonly used methods to track pixel in SLAM systems: descriptor matching [28], filter-based tracking [75], optical flow tracking [26], and direct pixel processing [77].The principle of the descriptor and feature is the same.Filter-based tracking includes the Kalman filter, particle filter, and mean-shift method.These methods model the target area in the current frame, and predict position by finding the most similar area to the model in the next frame.Optical flow is an effective means of estimating the movement state, such as velocity, pose, and displacement during navigation.Optical flow relates to the apparent movement in the image brightness mode and expresses an image change.
Optical flow can also be divided into three methods depending on the type of calculation, namely the difference [89], correlation [90], and phase-based methods [91].Among these, the block-matching algorithm is most commonly used in SLAM.However, it has shortcomings, such as a lack of sub-pixel accuracy and reduction of the matching degree after image deformation.To solve these problems, an image pyramid is applied simultaneously to increase computing speed [92].

Dynamic and Observational Models
The filtering-based SLAM method uses linear or nonlinear models in dynamic and observation models.However, the nonlinear model is mainly used in the filtering-based VI-SLAM method, whose dynamic model is expressed as where u t is the control vector, w t is the process noise, and w t ~N(0,Q t ), Q t is the variance.The IMU status is expressed as a 16-dimension vector.
where I W q T is the quaternion rotated from the world frame to the IMU frame, and W p T I and W v T I correspond to the rotation and speed of the world coordinate system, respectively.b g T and b a T correspond to the gyroscope bias and accelerometer bias, respectively.The classic filtering-based method framework is shown in Table 3. Propagation and update steps are important to filtering-based methods.The non-linear observation and prediction equation model are expressed as The work of filtering-based VIO focuses mainly on the covariance matrix, feature processing, and EKF updates.The propagated covariance matrix is expressed as The update equations are expressed as MSCKF [28] is a classic VI-SLAM system.It is also a visual inertial navigation system based on the multi-state constraint EKF.It employs a measurement model to express the geometric constraints that arise when a static feature is observed from multiple camera poses.The algorithm extracts and matches the SIFT feature, and maintains 30 camera poses in the filter state.
In addition, Li [27,36] proved that the standard method of computing Jacobian matrixes in filters inevitably resulted in inconsistencies and a loss of accuracy through simulation tests, which showed that the yaw errors of the MSCKF and FLS [93] lay outside the ±3σ bounds indicating inconsistencies.Thus they proposed modifications to the MSCKF algorithm, which ensure the correct observability properties without incurring additional computational costs.Clement [53] compared MSCKF and the sliding window filter (SWF).Its results showed the SWF to be more accurate and less sensitive to tuning parameters than the MSCKF.However, the MSCKF is computationally cheaper, has good consistency properties, and improves in accuracy as more features are tracked.In contrast to feature-based methods, Tanskanen [50] combined the advantages of EKF filters and minimized photometric errors to propose a direct VIO using only CUP.Increasing studies also began to apply VI-SLAM technologies to small devices such as mobile phones and cleaning robots [41,46].
Bloesch [51] proposed a monocular VIO-ROVIO (https://github.com/ethz-asl/rovio),used to directly detect luminosity error to obtain accurate, robust tracking from image matching.The model also uses the FAST corner to recognize candidate feature regions.A multi-layer image pyramid is used to extract multi-layer features with edge features added.The work process of the filter feature is shown in Figure 1.

Filtering-Based VIO and VI-SLAM
MSCKF [28] is a classic VI-SLAM system.It is also a visual inertial navigation system based on the multi-state constraint EKF.It employs a measurement model to express the geometric constraints that arise when a static feature is observed from multiple camera poses.The algorithm extracts and matches the SIFT feature, and maintains 30 camera poses in the filter state.
In addition, Li [27,36] proved that the standard method of computing Jacobian matrixes in filters inevitably resulted in inconsistencies and a loss of accuracy through simulation tests, which showed that the yaw errors of the MSCKF and FLS [93] lay outside the ±3σ bounds indicating inconsistencies.Thus they proposed modifications to the MSCKF algorithm, which ensure the correct observability properties without incurring additional computational costs.Clement [53] compared MSCKF and the sliding window filter (SWF).Its results showed the SWF to be more accurate and less sensitive to tuning parameters than the MSCKF.However, the MSCKF is computationally cheaper, has good consistency properties, and improves in accuracy as more features are tracked.In contrast to featurebased methods, Tanskanen [50] combined the advantages of EKF filters and minimized photometric errors to propose a direct VIO using only CUP.Increasing studies also began to apply VI-SLAM technologies to small devices such as mobile phones and cleaning robots [41,46].
Bloesch [51] proposed a monocular VIO-ROVIO (https://github.com/ethz-asl/rovio),used to directly detect luminosity error to obtain accurate, robust tracking from image matching.The model also uses the FAST corner to recognize candidate feature regions.A multi-layer image pyramid is used to extract multi-layer features with edge features added.The work process of the filter feature is shown in Figure 1.For image pyramid I l and multi-layer image block feature with coordinates p and block P l , the photometric error of block pixel p j at pyramid l is shown as where W is the radiation enhancement transformation matrix, and m is the mean intensity error.
The average image processing time with 50 features at initialization is 29.72 ms, while the system can run smoothly at 20 Hz.Furthermore, a VIO based on an iterative extended Kalman filter was proposed [63].S-MSCKF (https://github.com/KumarRobotics/msckf_vio)[26] can be considered a stereo version of MSCKF.The software takes synchronized stereo images and IMU messages and generates a real-time 6DOF camera pose estimation.It uses the FAST corner [79] to increase the speed and tracked features with KLT optical flow [94].In addition, circular matching can be used to remove outliers generated during feature tracking and stereo matching.It is hard to compare these VI-SLAM methods using only accuracy, due to their different application platforms and sceneries.Therefore, this study surveys representative filtering-based and optimization-based VI-SLAM methods in Appendix A.
Robust and accurate state estimation in robotics remains challenging.If the system can obtain accurate pose estimation based on a prior map, then system adaptability will improve.Therefore, Schneider [15] proposed a VI-SLAM system called Maplab that includes integrated functions of creating, processing and blending multiple maps.The system extensibility is suitable for research, and provided the evaluation method for the selection of system mining components.In addition, Maplab has been found to extract BRISK [95] and FREAK [96] from the image and fuses IMU data for localization and mapping.Separate sections can be combined into a single global map to correct drift for odometry and localization.ROVIOLI [63] is the front-end of Maplab for localization and mapping; the system module and data flow are present in Figure 2. The matching window has been shown to improve efficiency and robustness based on integrated gyroscope measurements.This system easily extends new algorithms in the current framework, such as multithreaded map building, semantic SLAM, and positioning.
where W is the radiation enhancement transformation matrix, and m is the mean intensity error.The average image processing time with 50 features at initialization is 29.72 ms, while the system can run smoothly at 20 Hz.Furthermore, a VIO based on an iterative extended Kalman filter was proposed [63].
S-MSCKF (https://github.com/KumarRobotics/msckf_vio)[26] can be considered a stereo version of MSCKF.The software takes synchronized stereo images and IMU messages and generates a real-time 6DOF camera pose estimation.It uses the FAST corner [79] to increase the speed and tracked features with KLT optical flow [94].In addition, circular matching can be used to remove outliers generated during feature tracking and stereo matching.It is hard to compare these VI-SLAM methods using only accuracy, due to their different application platforms and sceneries.Therefore, this study surveys representative filtering-based and optimization-based VI-SLAM methods in Appendix A.
Robust and accurate state estimation in robotics remains challenging.If the system can obtain accurate pose estimation based on a prior map, then system adaptability will improve.Therefore, Schneider [15] proposed a VI-SLAM system called Maplab that includes integrated functions of creating, processing and blending multiple maps.The system extensibility is suitable for research, and provided the evaluation method for the selection of system mining components.In addition, Maplab has been found to extract BRISK [95] and FREAK [96] from the image and fuses IMU data for localization and mapping.Separate sections can be combined into a single global map to correct drift for odometry and localization.ROVIOLI [63] is the front-end of Maplab for localization and mapping; the system module and data flow are present in Figure 2. The matching window has been shown to improve efficiency and robustness based on integrated gyroscope measurements.This system easily extends new algorithms in the current framework, such as multithreaded map building, semantic SLAM, and positioning.
Methods combining the advantages of filtering-based and optimization-based approaches have also drawn wide attention.Quan [97] proposed a monocular VI-SLAM using a Kalman filter as an assistant.To enable place recognition and reduce trajectory estimation drift, the authors constructed a factor-graph-based nonlinear optimization in the back-end.A feedback mechanism was used to guarantee estimation accuracy of the front-end and back-end.
The continuous updating and maintenance of maps in a large scale environment is still a challenge.It is particularly essential for platforms that work in repetitive scenarios or use previous maps, such as inspection robots and driverless cars.To update the map according to the dynamic changes and new explored areas, Labbé [98] employed a memory management mechanism into the SLAM system, which identified locations that should remain in fast access memory for online processing from locations.Methods combining the advantages of filtering-based and optimization-based approaches have also drawn wide attention.Quan [97] proposed a monocular VI-SLAM using a Kalman filter as an assistant.To enable place recognition and reduce trajectory estimation drift, the authors constructed a factor-graph-based nonlinear optimization in the back-end.A feedback mechanism was used to guarantee estimation accuracy of the front-end and back-end.
The continuous updating and maintenance of maps in a large scale environment is still a challenge.It is particularly essential for platforms that work in repetitive scenarios or use previous maps, such as inspection robots and driverless cars.To update the map according to the dynamic changes and new explored areas, Labbé [98] employed a memory management mechanism into the SLAM system, which identified locations that should remain in fast access memory for online processing from locations.

Optimization-Based Methods
With the development of computer technology, optimization-based VI-SLAM has proliferated rapidly.Optimization-based methods divide the entire SLAM frame into a front-end and back-end according to image processing; the front-end is responsible for map construction, whereas the back-end is responsible for pose optimization.Back-end optimization techniques are usually implemented on g2o [99], ceres-solver [100], and gtsam [101].Many excellent datasets can be used to study visual-inertial methods, such as EuRoC [102], Canoe [103], Zurich urban MAV [104], TUM VI Benchmark [105], and PennCOSYVIO [106].Details of the study surveys are provided in Appendix B.

Loop Closure
Loop closure can detect whether the robot re-enters at the same location; and can determine whether the robot returns to a previously visited location, thus creating a loop in its trajectory.Loop closure also optimizes the entire circuit map and increases system positioning accuracy.
Loop closure methods are mainly classified into odometry-based geometric relationship and appearance-based approaches.The odometry-based geometric relationship approach does not work when the cumulative error is large [107].The appearance-based approach determines the loop closure relationship to eliminate the cumulative error according to the similarity of two images, and it has been used successfully in VI-SLAM systems [18,31,60].
As shown in Figure 3, the camera data in the VI-SLAM is image-processed to match the spot stored in the map, and a position recognition decision is made after successful matching.The storage map is then updated.

Optimization-Based Methods
With the development of computer technology, optimization-based VI-SLAM has proliferated rapidly.Optimization-based methods divide the entire SLAM frame into a front-end and back-end according to image processing; the front-end is responsible for map construction, whereas the back-end is responsible for pose optimization.Back-end optimization techniques are usually implemented on g2o [99], ceres-solver [100], and gtsam [101].Many excellent datasets can be used to study visual-inertial methods, such as EuRoC [102], Canoe [103], Zurich urban MAV [104], TUM VI Benchmark [105], and PennCOSYVIO [106].Details of the study surveys are provided in Appendix B.

Loop Closure
Loop closure can detect whether the robot re-enters at the same location; and can determine whether the robot returns to a previously visited location, thus creating a loop in its trajectory.Loop closure also optimizes the entire circuit map and increases system positioning accuracy.
Loop closure methods are mainly classified into odometry-based geometric relationship and appearance-based approaches.The odometry-based geometric relationship approach does not work when the cumulative error is large [107].The appearance-based approach determines the loop closure relationship to eliminate the cumulative error according to the similarity of two images, and it has been used successfully in VI-SLAM systems [18,31,60].
As shown in Figure 3, the camera data in the VI-SLAM is image-processed to match the spot stored in the map, and a position recognition decision is made after successful matching.The storage map is then updated.Loop closure is essentially a matter of scene recognition, which is a difficult because of different appearances in various places in the real world.To solve this problem, Galvez-López [109] proposed DBoW2 to obtain a binary bag model with BRIEF and FAST features.Although this algorithm was more efficient and robust in terms of feature extraction compared to those using SIFT or SURF, the BRIEF descriptor lacks rotation and scale invariance, and it can only be used in 2D environments.To address this issue, Mur-Artal [12] used a bag-of-words model of location recognition based on DBoW2 and ORB that included covisibility information.
Loop closure methods based on deep learning continue to emerge [110][111][112].Compared with the appearance-based method, they were more robust to environmental changes.However, designing a neural network architecture to run in real-time in a VI-SLAM system remains challenging.In the robotic area coverage problem, the goal is to explore and map a given target area within a reasonable amount of time, which necessitates the use of minimally redundant overlap trajectories for coverage efficiency.However, system estimates will inevitably drift over time in the absence of loop closures.Efficient area coverage and good SLAM navigation performance represent competing objectives.In this case, active SLAM algorithm is needed that accounts for the area Loop closure is essentially a matter of scene recognition, which is a difficult because of different appearances in various places in the real world.To solve this problem, Galvez-López [109] proposed DBoW2 to obtain a binary bag model with BRIEF and FAST features.Although this algorithm was more efficient and robust in terms of feature extraction compared to those using SIFT or SURF, the BRIEF descriptor lacks rotation and scale invariance, and it can only be used in 2D environments.To address this issue, Mur-Artal [12] used a bag-of-words model of location recognition based on DBoW2 and ORB that included covisibility information.
Loop closure methods based on deep learning continue to emerge [110][111][112].Compared with the appearance-based method, they were more robust to environmental changes.However, designing a neural network architecture to run in real-time in a VI-SLAM system remains challenging.In the robotic area coverage problem, the goal is to explore and map a given target area within a reasonable amount of time, which necessitates the use of minimally redundant overlap trajectories for coverage efficiency.However, system estimates will inevitably drift over time in the absence of loop closures.Efficient area coverage and good SLAM navigation performance represent competing objectives.In this case, active SLAM algorithm is needed that accounts for the area coverage and navigation uncertainty performance to efficiently explore a target area of interest [113].Thrun [114] found a balance between visiting new places (exploration) and reducing the uncertainty by re-visiting known areas (exploitation), providing a more efficient alternative with respect to random exploration or pure exploitation.

Optimization-Based VI-SLAM Algorithms
OKVIS (https://github.com/ethz-asl/okvis)[43][44][45] was an excellent keyframe-based VI-SLAM system; that combined the IMU and reprojection error terms into a cost function to optimize the system.The old keyframes were marginalized to maintain a bounded-sized optimization window, ensuring real-time operation.As a first step to initialization and matching, they propagated the last pose using acquired IMU measurements to obtain a preliminary uncertain estimate of the states.Optimization strategies of optimization-based VI-SLAM algorithms are surveyed in Table 4.To avoid repeated constraints caused by the parameterization of relative motion integration, pre-integration was proposed to reduce computation.This method was first described by Lupton [35], where IMU data were changed between two frames by pre-integrating the constraints.The pre-integration principle is illustrated in Figure 4.The pre-integration theory was further developed after Forster [47] applied it to the VI-SLAM framework to reduce bias.
Systems that fused IMU data into the classic visual SLAM also garnered widespread attention.Usenko [56] proposed a stereo direct VIO that combined IMU and stereo LSD-SLAM [115].They formulated a joint optimization problem to recover the full state containing camera pose, translational velocity, and IMU biases of all frames.Concha [55] devised the first direct tightly coupled VIO algorithm that could run in real-time under a standard CPU, but initialization was not introduced.
VIORB [60] is a monocular tightly coupled VI-SLAM based on ORB-SLAM and contains an ORB sparse front-end, graph optimization back-end, loop closure, and relocation.This method was first initialized using only monocular vision, and performed a specific initialization of the scale, gravity direction, velocity, and accelerometer and gyroscope biases after a few seconds.VIORB proposed a novel IMU initialization method, which is divided into next four steps: (1) gyroscopes biases estimation, (2) scale and gravity approximation (considering no accelerometer bias), (3) accelerometer biases estimation (scale and gravity direction refinement), and (4) velocity estimation.The local map module uses local BA to optimize the latest N keyframes and all points observed on these N keyframes after a new keyframe is inserted.Local maps are then retrieved based on the time series of the keyframe.The fixed window connects the N + 1th keyframe and co-visibility graph.The keyframe in the local map is shown in Figure 5.In addition to monocular and IMU fusion methods, SLAM with stereo and RGBD fusion with IMU have also been investigated [54,58].

mono
IMU residual integration with the visual-only SfM results to recover scale, gravity, velocity, and even bias sliding window two-way marginalization scheme To avoid repeated constraints caused by the parameterization of relative motion integration, pre-integration was proposed to reduce computation.This method was first described by Lupton [35], where IMU data were changed between two frames by pre-integrating the constraints.The preintegration principle is illustrated in Figure 4.The pre-integration theory was further developed after Forster [47] applied it to the VI-SLAM framework to reduce bias.
Systems that fused IMU data into the classic visual SLAM also garnered widespread attention.Usenko [56] proposed a stereo direct VIO that combined IMU and stereo LSD-SLAM [115].They formulated a joint optimization problem to recover the full state containing camera pose, translational velocity, and IMU biases of all frames.Concha [55] devised the first direct tightly coupled VIO algorithm that could run in real-time under a standard CPU, but initialization was not introduced.VIORB [60] is a monocular tightly coupled VI-SLAM based on ORB-SLAM and contains an ORB sparse front-end, graph optimization back-end, loop closure, and relocation.This method was first initialized using only monocular vision, and performed a specific initialization of the scale, gravity direction, velocity, and accelerometer and gyroscope biases after a few seconds.VIORB proposed a novel IMU initialization method, which is divided into next four steps: (1) gyroscopes biases estimation, (2) scale and gravity approximation (considering no accelerometer bias), (3) accelerometer biases estimation (scale and gravity direction refinement), and (4) velocity estimation.The local map module uses local BA to optimize the latest N keyframes and all points observed on these N keyframes after a new keyframe is inserted.Local maps are then retrieved based on the time series of the keyframe.The fixed window connects the N + 1th keyframe and co-visibility graph.The keyframe in the local map is shown in Figure 5.In addition to monocular and IMU fusion methods, SLAM with stereo and RGBD fusion with IMU have also been investigated [54,58].VINS-mono (https://github.com/HKUST-Aerial-Robotics/VINS-Mono)was a standout VI-SLAM method whose frond-end uses the KLT optical flow [94] to track the Harris corner, while the back-end uses a sliding window for nonlinear optimization.The entire system includes measurement processing, estimation initialization, local bundle adjustment without relocalization, loop closure, and global pose optimization.See Figure 6 for the system framework.The Fisheye camera model is used in the front-end, and an outlier of the fundamental matrix is rejected by the RANSAC method.The calibration error between the camera and IMU is less than 0.02 m, and the rotation error is less than 1° [76].In addition, this method has been successfully applied to AR [18].VINS-mono (https://github.com/HKUST-Aerial-Robotics/VINS-Mono)was a standout VI-SLAM method whose frond-end uses the KLT optical flow [94] to track the Harris corner, while the back-end uses a sliding window for nonlinear optimization.The entire system includes measurement processing, estimation initialization, local bundle adjustment without relocalization, loop closure, and global pose optimization.See Figure 6 for the system framework.The Fisheye camera model is used in the front-end, and an outlier of the fundamental matrix is rejected by the RANSAC method.The calibration error between the camera and IMU is less than 0.02 m, and the rotation error is less than 1 • [76].In addition, this method has been successfully applied to AR [18].
Additionally, methods integrated with deep learning and new sensors have accompanied the rise of artificial intelligence and computer vision.Clark [68] proposed an end-to-end VIO with good results that combined sensor fusion and depth learning.However, loop closure and mapping were not used in this system.Vidal [69] used event cameras instead of luma frames in VIO to achieve good results in low-light and high-dynamic scenes.CNN-SLAM [116] replaced depth estimation and image matching in LSD-SLAM with CNN-based methods to incorporate semantic information.[18].
Additionally, methods integrated with deep learning and new sensors have accompanied the rise of artificial intelligence and computer vision.Clark [68] proposed an end-to-end VIO with good results that combined sensor fusion and depth learning.However, loop closure and mapping were not used in this system.Vidal [69] used event cameras instead of luma frames in VIO to achieve good results in low-light and high-dynamic scenes.CNN-SLAM [116] replaced depth estimation and image matching in LSD-SLAM with CNN-based methods to incorporate semantic information.

Details
Different VI-SLAM methods are designed for different applications and it is hard to comprehensively evaluate them.To deeply compare filtering-based and optimization-based methods, this section provides the experiments of representative methods on EuRoC datasets using conditions that emulate state estimation for a flying robot.Because VIORB does not have open source code, this study uses an implementation from Jing Wang (https://github.com/jingpang/LearnVIORB).
Experiments are performed on an Intel Core i7-6700 × 8@3.40GHz computer with 16 Gb RAM.The EuRoC datasets consist of 11 visual inertial sequences recorded onboard a micro-aerial vehicle while it is manually piloted around three different indoor environments.Within each environment, the sequences increase qualitatively in difficulty with increasing sequence number.For example, MH_01 is "easy", while MM_05 is a more challenging sequence in the same environment, introducing things such as faster motions and, poor illumination.
To account for the nondeterministic nature of the multithreading, we run each sequence five times and show the median result for accuracy.In order to compare these methods equally, the mapping thread of VIORB is closed and the camera frequency of all methods is set to 20 Hz.

Experiments
Experiment results are shown in Tables 5-7.In Table 5, when all eight logical cores are in use, the CPU utilization load is 100%.This study uses the elevation tool evo (https://github.com/MichaelGrupp/evo) to calculate the root mean square error of experiment results according to the ground truth.Notably, VIORB cannot obtain the full trajectory result on the V2_03_difficult dataset.In Table 7, memory utilization is represented as a percentage of the available RAM on the given platform.

Details
Different VI-SLAM methods are designed for different applications and it is hard to comprehensively evaluate them.To deeply compare filtering-based and optimization-based methods, this section provides the experiments of representative methods on EuRoC datasets using conditions that emulate state estimation for a flying robot.Because VIORB does not have open source code, this study uses an implementation from Jing Wang (https://github.com/jingpang/LearnVIORB).
Experiments are performed on an Intel Core i7-6700 × 8@3.40GHz computer with 16 Gb RAM.The EuRoC datasets consist of 11 visual inertial sequences recorded onboard a micro-aerial vehicle while it is manually piloted around three different indoor environments.Within each environment, the sequences increase qualitatively in difficulty with increasing sequence number.For example, MH_01 is "easy", while MM_05 is a more challenging sequence in the same environment, introducing things such as faster motions and, poor illumination.
To account for the nondeterministic nature of the multithreading, we run each sequence five times and show the median result for accuracy.In order to compare these methods equally, the mapping thread of VIORB is closed and the camera frequency of all methods is set to 20 Hz.

Experiments
Experiment results are shown in Tables 5-7.In Table 5, when all eight logical cores are in use, the CPU utilization load is 100%.This study uses the elevation tool evo (https://github.com/MichaelGrupp/evo) to calculate the root mean square error of experiment results according to the ground truth.Notably, VIORB cannot obtain the full trajectory result on the V2_03_difficult dataset.In Table 7, memory utilization is represented as a percentage of the available RAM on the given platform.
This section experiments representative optimization-based and filtering-based methods, which are all proposed in recent years.As shown in Table 5, the CPU utilization of ROVIO is the lowest among five methods, and filtering-based methods are better than optimization-based methods.The camera type of ROVIO, VINS-mono, and VIORB is monocular, while the camera type of S-MSCKF and OKVIS is stereo.The stereo VI-SLAM methods use more computing resources than monocular VI-SLAM methods, whether filtering-based or optimization-based.Importantly, filtering-based methods have advantages over optimization-based methods on CPU utilization.As shown in Table 6, VINS-mono obtains the best accuracy with a 0.079 m average root mean square error.OKVIS and VIORB have advantages in terms of memory utilization (according to Table 7), which implies that they are robust for system management.Optimization-based methods have more potential than filtering-based methods in terms of localization accuracy and memory utilization.In summary, optimization-based methods achieve excellent localization accuracy and lower memory utilization, while filtering-based methods have advantages in terms of computing resource.How to find the right balance between competing requirements and accuracy can be challenging.

SLAM with Deep Learning
At present, the semantic level of the image features used in the SLAM scheme is too low, rendering feature distinguishability weak; the point cloud map constructed by the current method does not distinguish between different objects.Deep learning will develop SLAM technology, which can be used to build semantic maps to advance human computer interaction.Rambach [117] proposed a deep learning approach to visual-inertial camera pose estimation through a trained short-term memory model.Shamwell [118] presented an unsupervised deep neural network approach to the fusion of RGB-D imagery with inertial measurements for absolute trajectory estimation.
Although the study of semantic issues in SLAM is still in a nascent stage, combining semantics with SLAM will enable robots to obtain poses more effectively by building consistent maps using semantic concepts of categories, relationships, and environmental attributes.In addition, a new map of the SLAM system can effectively store and display information, such as SkiMap [119] and Road-SLAM [120].The continuous updating and maintenance of maps still presents an obstacle in the field.

Hardware Integration and Multi-Sensor Fusion
The lightweight and miniaturization characteristics of the SLAM system allow it to run well on small devices, such as embedded systems or cell phones.Excellent results were achieved in Microsoft Hololens, Intel RealSense, and Google Tango [121].Customized hardware for the VI-SLAM can realize the function of robots, and AR/VR devices are applied to sports, navigation, teaching, and entertainment.Therefore, a strong demand exists for SLAM miniaturization and weight reduction, prefacing the future of embedded SLAM [122].
A single sensor cannot adequately sense environmental information, and state estimation is highly uncertain.Multi-sensor fusion can solve these problems and improve the accuracy of system positioning and environment mapping.VI-SLAM technology is an example of multi-sensor fusion.Research and applications involving multi-sensor fusion in SLAM are expected to grow, as evidenced by [123,124].

Active SLAM on Robots
A pertinent SLAM issue represents a passive estimation problem in robotics.However, the main purpose of controlling the robot motion problem is to control the robot to minimize uncertainty of robotic map representation and positioning.In a conventional approach, SLAM is passive and typically performed on preplanned or human-controlled trajectories.A fully autonomous robot must plan a motion given a high-level command, such as, a task-level command from a human supervisor to explore a given area.In this example, the robot should plan accordingly to accomplish the given task and should not require detailed input by a human supervisor [113].Active SLAM [125] has therefore attracted gradual attention.The active SLAM algorithm has demonstrated good effects in terms of enabling the robot to identify possible locations, calculate each vantage point visited, and select the most efficient action plan.SLAM technology should thus incorporate technologies such as path planning [126], mission planning [127], and object recognition [128].References [129,130] contributed to active SLAM and combine it to make robots more intelligent and practical.In addition, integrating the advantages of different branches of SLAM technology (such as, filtering and optimization-based approaches and loosely and tightly coupled methods) would greatly improve system robustness and accuracy.

Applications on Complex Dynamic Environments
The SLAM algorithm generally assumes a static environment.However, the actual working environment of the mobile robot often involves changes in the spatial positions of pedestrians and vehicles over time.These dynamic features can provide useful information about environmental changes.Identification of static and dynamic features in the environment and locating and mapping the robot effectively are important.Saarinen [131] made contributions to enabling long-term operation of autonomous vehicles in industrial dynamic environments and proposed a novel 3D normal distribution transform occupancy maps.Additionally (to ensure more effective practical application), seasonal weather changes in unstructured terrain require a more robust SLAM system to handle complex dynamic environments.Multi-robot collaboration SLAM [132] possesses advantages of high accuracy and efficiency, and it is emerging as a common research area.

Conclusions
VI-SLAM technology is a popular and complicated research issue in the field of robotics and computer vision.This study provided an overview of VI-SLAM technology and summarized methods over the last 10 years.State-of-the-art VI-SLAM methods are introduced from filtering and optimization-based perspectives.The respective frameworks, key technologies, and advantages of these methods are presented.In addition, central technologies in VI-SLAM are systematically proposed, including feature extraction and tracking, pre-integration, and loop closure.This study surveys the performance of representative VI-SLAM methods and famous VI-SLAM datasets.Comparisons are made between filtering-based and optimization-based methods through experiments, which indicate filtering-based methods have advantages in terms of computing resources, while optimization-based methods achieve excellent localization accuracy and lower memory utilization.This study also predicted upcoming development trends and research directions for SLAM that have the potential to make the technology substantial.

Table 3 .
Classic filtering-based method framework.

Table 3 .
Classic filtering-based method framework.For each IMU measurement received, propagate the filter state and covariance Image registration: Every time a new image is recorded.augment the state and covariance matrix with a copy of the current camera pose estimate image processing modules begins operation Update: When the feature measurements of a given image become available, perform an EKF update Propagation:

Table 4 .
Optimization strategies of optimization-based VI-SLAM algorithms.