A Comprehensive Survey of Visual SLAM Algorithms

: Simultaneous localization and mapping (SLAM) techniques are widely researched, since they allow the simultaneous creation of a map and the sensors’ pose estimation in an unknown environment. Visual-based SLAM techniques play a signiﬁcant role in this ﬁeld, as they are based on a low-cost and small sensor system, which guarantees those advantages compared to other sensor-based SLAM techniques. The literature presents different approaches and methods to implement visual-based SLAM systems. Among this variety of publications, a beginner in this domain may ﬁnd problems with identifying and analyzing the main algorithms and selecting the most appropriate one according to his or her project constraints. Therefore, we present the three main visual-based SLAM approaches (visual-only, visual-inertial, and RGB-D SLAM), providing a review of the main algorithms of each approach through diagrams and ﬂowcharts, and highlighting the main advantages and disadvantages of each technique. Furthermore, we propose six criteria that ease the SLAM algorithm’s analysis and consider both the software and hardware levels. In addition, we present some major issues and future directions on visual-SLAM ﬁeld, and provide a general overview of some of the existing benchmark datasets. This work aims to be the ﬁrst step for those initiating a SLAM project to have a good perspective of SLAM techniques’ main elements and characteristics.


Introduction
Simultaneous localization and mapping (SLAM) technology, first proposed by Smith in 1986 [1], is used in an extensive range of applications, especially in the domain of augmented reality (AR) [2][3][4] and robotics [5][6][7].The SLAM process aims at mapping an unknown environment and simultaneously locating a sensor system in this environment through the signals provided by the sensor(s).In robotics, the construction of a map is a crucial task, since it allows the visualization of landmarks, facilitating the environment's visualization.In addition, it can help in the state estimation of the robot, relocating it, and decreasing estimation errors when re-visiting registered areas [8].
The map construction comes with two other tasks: localization and path planning.According to Stachniss [9], the mapping problem may be described by examining three questions considering the robot's perspective: What does the world look like?Where am I? and How can I reach a given location?The first question is clarified by the mapping task, which searches to construct a map, i.e., a model of the environment.To do so, it requires the location of the observed landmarks, i.e., the answer for the second question, provided by the localization task.The localization task searches to determine the robot's pose, i.e., its orientation and position and, consequently, locates the robot on the map.Depending on the first two tasks, the path planning clears up the last question, and seeks to estimate a trajectory for the robot to achieve a given location.It relies on the current robot's pose, provided by the localization task, and on the environment's characteristics, provided by the mapping task.SLAM is a solution that integrates both the mapping and localization tasks.
to the presented approaches.Section 4 presents some of the recent major issues faced by the visual-SLAM community and points out future directions to deal with these problems.Section 5 provides a general overview of some of the most significant publicly available benchmark datasets.Finally, our conclusions are presented in Section 6.

Visual-Based SLAM Concepts
This section presents concepts related to visual-based SLAM and odometry algorithms, and the main characteristics of the visual-based approaches covered in this paper.The visual-based SLAM techniques use one or more cameras in the sensor system, receiving 2D images as the source of information.In general, the visual-based SLAM algorithms are divided into three main threads: initialization, tracking, and mapping [10].Figure 1 shows a general view of the three main parts generally present in visual-based SLAM approaches.The depth and inertial data may be added to the 2D visual input to generate a sparse map (generated with the ORB-SLAM3 algorithm [22] in the MH_01 sequence [23]), semi-dense map (obtained with the LSD-SLAM [24] in the dataset provided by the authors), and a dense reconstruction (Reprinted from [25]).
As one can see in the Figure, in visual-SLAM systems, the input can be a 2D image, both a 2D image and IMU data, or a 2D image and depth data, depending on the used approach, i.e., visual-only (Section 2.1), visual-inertial (Section 2.2), or RGB-D-based (Section 2.3), respectively.The initialization determines the global coordinates and builds an initial map, used to perform the two main steps: tracking and mapping.The tracking process is responsible for the continuous estimation of the sensor's pose.In general, the algorithm establishes 2D-3D correspondences between the current frame and map, constituting a problem called perspective-n-points.There are several ways to solve this problem, EPnP being one of the most representative solutions [26].The mapping process is in charge of computing and expanding the 3D structure as the camera moves.The depth data computation differs according to the employed algorithm (Section 3 addresses individually each algorithm providing detailed explanations).Finally, the mapping processes shall result in a sparse, semi-dense, or dense 3D reconstruction, according to the implemented technique.
Although we mainly refer to the concepts as belonging to the SLAM methodology, we consider, in this paper, both visual-SLAM and visual-odometry (VO) techniques, since they are closely related.The VO algorithms also seek to estimate a robot's position through cameras as a source of information.The main difference between visual-SLAM and VO lies in considering, or not, the global consistency of the estimated trajectory and map [14].
While VO performs only local optimizations, visual-SLAM algorithms also employ loop closure detection (see Section 3), being capable of correcting drifts accumulated at the end of the robot's trajectory.

Visual-Only SLAM
The visual-only SLAM systems are based on 2D image processing.After the images' acquisition from more than one point of view, the system performs the initialization process to define a global coordinate system and reconstruct an initial map.In the feature-based algorithms relying on filters (filtering-based algorithms), the first step consists of initializing the map points with high uncertainty, which may converge later to their actual positions.This procedure is followed by tracking, which attempts to estimate the camera pose.Simultaneously, the mapping process includes new points in the 3D reconstruction as more unknown scenes are observed.
The visual-only SLAM system may use a monocular or stereo camera.The monocular camera-based SLAM is a well-explored domain given the small size of the sensor (the smallest of all the presented approaches), its low price, easy calibration, and reduced power consumption [27].Despite these advantages, the monocular-based systems offer a higher complexity in system's initialization, since at least two different views are necessary to determine an initial depth, and pose estimation and problems concerning drift and scale estimation.This last problem may be compensated by stereo cameras, which present the main advantage to feature the stereo view in only one frame.However, the sensor's size is more significant than a simple monocular camera.In addition, it requires more processing for each frame, mainly due to the need for an image rectification process in the stereo matching stage.
The visual-only SLAM category can be divided into two main methods: feature-based and direct.

Feature-Based Methods
SLAM algorithms based on features consider a certain number of points of interest, called keypoints.They can be detected in several images and matched by comparing their descriptors; this process provides the camera pose estimation information.The descriptor data and keypoint location compose the feature, i.e., the data used by the algorithm to process the tracking and mapping.As the feature-based methods do not use all the frame information, they are suitable to figure in embedded implementations.However, the feature extraction may fail in a textureless environment [28], as well as it generates a sparse map, providing less information than a dense one.

Direct Methods
In contrast with the feature-based methods, the direct methods use the sensor data without pre-processing, considering pixels' intensities, and minimizing the photometric error.There are many different algorithms based on this methodology, and depending on the chosen technique, the reconstruction may be dense, semi-dense, or sparse.The reconstruction density is a substantial constraint to the algorithm's real-time operation, since the joint optimization of both structure and camera positions is more computationally expensive for dense and semi-dense reconstructions than for a sparse one [29].Figure 2 shows the main difference between feature-based (indirect) and direct methods according to their front-end and back-end, that is, the part of the algorithm responsible for sensor's data abstraction and the part responsible for the interpretation of the abstracted data, respectively.General differences between feature-based and direct methods.Top: main steps followed by the feature-based methods, resulting in a sparse reconstruction (map generated with the ORB-SLAM3 algorithm [22] in the MH_01/EuRoC sequence [23]).Bottom: main steps followed by a direct method, that may result in a sparse (generated from the reconstruction of sequence_02/TUM MonoVO [30] with the DSO algorithm [31]) or dense reconstruction (Reprinted from [25]), according to the chosen technique.

Visual-Inertial SLAM
The VI-SLAM approach incorporates inertial measurements to estimate the structure and the sensor pose.The inertial data are provided by the use of an inertial measurement unit (IMU), which consists of a combination of gyroscope, accelerometer, and, additionally, magnetometer devices.This way, the IMU is capable of providing information relative to the angular rate (gyroscope) and acceleration (accelerometer) along the x-, y-, and z-axes, and, additionally, the magnetic field around the device (magnometer).While adding an IMU may increase the information richness of the environment and provide higher accuracy, it also increases the algorithm's complexity, especially during the initialization step, since, besides the initial estimation of the camera pose, the algorithm also has to estimate the IMU poses.VI-SLAM algorithms can be divided according to the type of fusion between the camera and IMU data, which can be loosely or tightly coupled.The loosely coupled methods do not merge the IMU states to estimate the full pose: instead, the IMU data are used to estimate the orientation and changes in the sensor's position [18].On the other side, the tightly coupled methods are based on the fusion of camera and IMU data into a motion equation, resulting in a state estimation that considers both data.
In addition, VI-SLAM algorithms present different implementations according to their back-end approach, which can be filtering-based or optimization-based.The front-end of filtering-based approaches for VI-SLAM relies on feature extraction, while optimizationbased methods (also known as keyframe-based approaches) rely on global optimizations, which increase the system's accuracy, as well as the algorithm's computational cost.

RGB-D SLAM
SLAM systems based on RGB-D data started to attract more attention with the advent of Microsoft's Kinect in 2010.RGB-D sensors consist of a monocular RGB camera and a depth sensor, allowing SLAM systems to directly acquire the depth information with a feasible accuracy accomplished in real-time by low-cost hardware.As the RGB-D devices directly provide the depth map to the SLAM systems, the general framework of SLAM based on this approach differs from the other ones already presented.
Most of the RGB-D-based systems make use of the iterative closest point (ICP) algorithm to locate the sensor, fusing the depth maps to obtain the reconstruction of the whole structure.RGB-D systems present advantages such as providing color image data and dense depth map without any pre-processing step, hence decreasing the complexity of the SLAM initialization [10].Despite this, this approach is most suitable to indoor environments, and requires large memory and power consumption [32].

Visual-SLAM Algorithms
Each considered approach presented in Section 2 includes several algorithms, making it difficult to select the most suitable SLAM or odometry algorithm according to one's project constraints.Therefore, we present the most representative algorithms of each approach, selected based on literature feedback, to accomplish a brief review of each one, and a systematic analysis based on six selected criteria that, in general, are presented as limiting factors of SLAM projects.Besides the proposed criteria, it is also necessary to characterize the scene and application, since some scenarios may present specific attributes that may imply specific evaluation criteria, such as the analysis presented in [33].The authors consider the autonomous driving application characteristics, which implies a set of specific criteria, such as the required accuracy, scalability, dynamicity, etc.Thus, considering the general approach of the SLAM systems, we established six criteria that influence system dimensioning, accuracy, and hardware implementation.They are: algorithm type, map density, global optimization, loop closure, availability, and embedded implementations: • Algorithm type: this criterion indicates the methodology adopted by the algorithm.
For the visual-only algorithms, we divide them into feature-based, hybrid, and direct methods.Considering the visual-inertial algorithms, they must be filtering-based or optimization-based methods.Lastly, the RGB-D approach can be divided concerning their tracking method, which can be direct, hybrid, or feature-based.

•
Map density: in general, dense reconstruction requires more computational resources than a sparse one, having an impact on memory usage and computational cost.On the other hand, it provides a more detailed and accurate reconstruction, which may be a key factor in a SLAM project.Embedded implementations: the embedded SLAM implementation is an emerging field used in several applications, especially in robotics and automobile domains.This criterion depends on each algorithm's hardware constraints and specificity, since there must be a trade-off between algorithm architecture in terms of energy consumption, memory, and processing usage.We assembled the main publications we found presenting fully embedded SLAM systems in platforms such as microcontrollers and FPGA boards.
In the following, we present the selected SLAM algorithms considered the most representative of each of the three presented approaches according to their publication years.

Visual-Only SLAM
The selected visual-only SLAM algorithms are presented in Figure 3 and explained in the following subsections.

MonoSLAM (2007)
The first monocular SLAM algorithm is MonoSLAM, which was proposed by Davidson et al. [27] in 2007.The first step of the algorithm consists of the system's initialization.Then, it updates the state vector considering a constant velocity motion model, where the camera motion and environment structure are estimated in real-time using an extended Kalman filter (EKF).The algorithm is represented by Figure 4. MonoSLAM operates in real-time and was made available by the authors.Moreover, since MonoSLAM is based on EKF, an already well-covered topic, several embedded implementations based on this algorithm are found in the literature.In [34,35], Vincke et al. based their implementation on the MonoSLAM algorithm, combining multiple sensors and a multi-processor architecture to evaluate its implementation.In [34], the authors used an ARM + DSP + GPU architecture (OMAP3530 architecture) to implement the localization, reconstruction, and feature detection.They combine this architecture with a co-processor ATMega168 used for data pre-processing and robot controlling.In [35], they based the architecture on a combination of multi-CPUs + GPUs provided by the use of an OMAP4430 architecture.The authors implemented the different tasks of the algorithms into both single-core and dual-core ARM architecture, and compared their performances.In addition, they parallelized the matching and initialization tasks using the ARM and NEON processors provided by the OMAP4430.
MonoSLAM requires a known target for the initialization step, which is not always accessible.In addition, the algorithm's complexity increases proportionally with the size of the environment.This algorithm neither employs global optimization techniques nor loop closure detection.At last, it only reconstructs a map of landmarks, which may be a drawback regarding the applications that require a more accurate reconstruction.

Parallel Tracking and Mapping (2007)
Another pioneer algorithm is the Parallel Tracking and Mapping (PTAM) [36] algorithm.PTAM was the first algorithm to separate Tracking and Mapping into two different threads and to apply the concept of keyframes to the mapping thread.First, the mapping thread performs the map initialization.New keyframes are added to the system as the camera moves and the initial map is expanded.Triangulation between two consecutive keyframes calculates the new point's depth information.The tracking thread computes the camera poses, and for each new frame, it estimates an initial pose for performing the projection of the map points on the image.PTAM uses the correspondences to compute the camera pose by minimizing the reprojection error.Figure 5 represents the steps performed by the PTAM algorithm.

New Frame
Prior pose estimation  PTAM allows the map representation by a large number of features and performs global optimization.Despite these advantages, the PTAM algorithm presents a high complexity due to the bundle adjustment step.In addition, it does not count with loop closure, and the generated map is more suitable to identify landmarks.Furthermore, it requires the user's interaction to establish the initial keyframes, and it presents a non-negligible power consumption, which makes it unsuitable for low-cost embedded systems [37].

Dense Tracking and Mapping (2011)
Dense tracking and mapping (DTAM), proposed by Newcombe et al. [38], was the first fully direct method in the literature.The algorithm is divided into two main parts: dense Mapping and dense tracking.The first stage searches to estimate the depth values by defining data cost volume representing the average photometric error of multiple frames computed for the inverse depth of the current frame.The inverse depth that minimizes the photometric error is selected to integrate the reconstruction.In the dense tracking stage, DTAM estimates the motion parameters by aligning an image from the dense model projected in a virtual camera and the current frame.Figure 6 shows a general view of the DTAM algorithm.The algorithm provides an accurate and detailed reconstruction, but this level of density reconstruction impacts the computational cost to store and process the data.As a consequence, to achieve real-time operation, the algorithm requires state-of-the-art GPUs [10].The authors in [39] employed a CPU + GPU architecture of different iPhone models to implement a fully dense algorithm based on DTAM.They used the CPU for the tracking task and the GPU for depth estimation and frame fusion.DTAM does not implement loop closure techniques or global optimization.

Semi-Direct Visual Odometry (2014)
The semi-direct visual odometry (SVO) algorithm [40] combines the advantages of both feature-based and direct methods.The algorithm is divided into two main threads: motion estimation and mapping.The first thread searches to estimate the sensor's motion parameters, which consists of minimizing the photometric error.The mapping thread is based on probabilistic depth filters, and it searches to estimate the optimum depth value for each 2D feature.When the algorithm achieves a low uncertainty, it inserts the 3D point in the reconstruction, as shown in Figure 7. SVO enables direct pixel correspondences and the usage of a probabilistic mapping method.In addition, the algorithm is capable of operating with a high frame rate, since it does not need to extract features for every frame [41], which enables its operation in a low-cost embedded system, as with the embedded platform considered by [40] that consists in an Odroid-U2.Nonetheless, SVO presents a limited accuracy due to the short-term data association [22].SVO does not implement global optimization techniques or loop closure.The authors already proposed an extended version of the SVO, SVO 2.0 [42], in which the algorithm is capable of processing stereo data and IMU information.The large-scale direct monocular SLAM (LSD-SLAM) [24] is a direct algorithm that performs a semi-dense reconstruction.The algorithm consists of three main steps: tracking, depth map estimation, and map optimization.The first step minimizes the photometric error to estimate the sensor's pose.Next, the LSD-SLAM performs the keyframe selection in the depth map estimation step.If it adds a new keyframe to the algorithm, it initializes its depth map; otherwise, it refines the depth map of the current keyframe by performing several small-baseline stereo comparisons.Finally, in the map optimization step, the LSD-SLAM incorporates the new keyframe in the map and optimizes it by applying a pose-graph optimization algorithm.Figure 8 illustrates the procedure.This technique allows the realtime construction of large-scale maps and employs global optimization and loop closure.In addition, by combining the absence of feature extraction, characteristic of the direct methods, with a semi-dense reconstruction, this method improves its efficiency, enabling embedded implementations.Boikos and Christos-Savvas in [29,43] used CPU + FPGA architectures to implement the LSD-SLAM algorithm.In [29], the authors implemented two accelerators on the FPGA to perform more expensive tasks of the tracking thread; that is, Jacobian calculations, as well as residual and weight calculations.The ARM CPU was used to implement the other tasks of the algorithm.In [43], the authors implement the direct tracking thread on the FPGA, while the CPU was responsible for memory, hardware control, and parameter setup.The LSD-SLAM map estimation is based essentially on pose-graph optimization [22] and the algorithm achieved lower accuracy than others, such as PTAM and ORB-SLAM [41].The ORB-SLAM2 algorithm [44], originated from ORB-SLAM [41], is considered the state of the art of feature-based algorithms.It works in three parallel threads: tracking, local mapping, and loop closing.The first thread locates the sensor by finding features correspondences and minimizing the reprojection error.The local mapping thread is responsible for the map management operations.The last thread, loop closing, is in charge of detecting new loops and correcting the drift error in the loop.After processing the three threads, the algorithm also considers the whole structure and estimated motion consistency by performing a full bundle adjustment.Figure 9 represents the threads that constitute the algorithm.ORB-SLAM2 considers the monocular, stereo and RGB-D approaches, and implements global optimization and loop closure techniques.Nonetheless, the tracking failure situation may lead to a lost state if the system does not recognize a high-similarity frame [45].In addition, this method needs to acquire the images with the same frame rate as it processes them, which makes real-time operation in embedded platforms difficult [46].This is in spite of the fact that several embedded implementations may be found in the literature.Yu et al. [47] used a CPU to run the ORB-SLAM algorithm and Abouzahir et al. [46] implemented the algorithm in different CPU-and GPU-based platforms, and evaluated the performance of each thread on the platforms.

New Frame
Extract ORB

CNN-SLAM (2017)
CNN-SLAM [48] is one of the first works to present a real-time SLAM system based on convolutional neural networks (CNN).The algorithm may be divided into two different pipelines: one applied in every input frame and another in every keyframe.The first is responsible for the camera pose estimation by minimizing the photometric error between the current frame and the nearest keyframe.In parallel, for every keyframe, the depth is predicted by a CNN.In addition, the algorithm predicts the semantic segmentation for each frame.After these processing steps, the algorithm performs a pose-graph optimization to obtain a globally optimized pose estimation, as shown in Figure 10.

General Comments
In this Section, we presented the main visual-only-based SLAM algorithms.Table 1 summarizes the main characteristics and analyzed criteria for the presented visual-only SLAM algorithms.The main benefits and drawbacks of each method were individually addressed.Considering a general point of view, the visual-only-based SLAM algorithms may be considered a well-explored field, since most of the algorithms were made available by the authors, which also had consequences for the embedded SLAM implementations found in the literature.The embedded implementations presented in Table 1 consider the full SLAM algorithms implementation and works that do not perform essential modifications in the originally proposed technique.However, it is possible to find in the literature several embedded implementations based on fundamental concepts of the presented algorithms.For instance, the MonoSLAM principles have been used for the development and implementation of several other SLAM on SoC implementations, such as the heterogeneous architecture recently proposed by Piat et al. [59].Furthermore, the growing development of the CNN-based SLAM algorithms can be noticed.Besides the presented CNN-SLAM, other algorithms are found in the literature, such as the CNN-SVO [28] algorithm that uses depth prediction to initialize the depth filters.Developments of the hardware implementations of CNN-based SLAM algorithms have been growing since the launch of the AI accelerator Xilinx Deep-learning Processor Unit [60] in 2019.This hardware already enabled the progress on embedded implementations of CNN-based algorithms: one example is the work presented in [61] that uses an FPGA platform to perform a CNN-based feature extractor.

Visual-Inertial SLAM
A timeline representing the selected visual-inertial algorithms is presented in Figure 12 and the algorithms are explained in the following subsections.

Multi-State Constraint Kalman Filter (2007)
The multi-state constraint Kalman filter (MSCKF) [62] can be implemented using both monocular and stereo cameras [63].The algorithm's pipeline consists of three main steps: propagation, image registration, and update.In the first step, the MSCKF considers the discretization of a continuous-time IMU model to obtain the propagation of the filter state and covariance.Then, the image registration performs the state augmentation each time a new image is recorded.This estimation is added in the state and covariance matrix to initiate the image processing module (feature extraction).Finally, the algorithm performs the filter update.Figure 13 represents the algorithm.The MSCKF is considered one of the fastest filter-based methods in the literature [64], a consequence of its low computational cost [63], which makes this algorithm suitable for embedded implementations.Delmerico and Scaramuzza [65] used different hardware platforms based on CPU architectures to implement visual-inertial SLAM algorithms.The authors implemented the algorithm in three different embedded boards-Intel NUC, Up Board, and ODROID.However, the Jacobian calculations performed by the algorithm may cause inconsistency and loss of accuracy [66].Open Keyframe-based Visual-Inertial SLAM (OKVIS) [67] is an optimization-based method.It combines the IMU data and reprojection terms into an objective function, allowing the algorithm to jointly optimize both the weighted reprojection error and temporal error from IMU.The algorithm builds a local map, and then the subsequent keyframes are selected according to the keypoints match area.The algorithm can be depicted as shown in Figure 14.The OKVIS algorithm presented a lower memory usage when compared with other algorithms (this will be explained in the following subsections), such as VINS-Mono, VIORB, and ROVIO [18], enabling its embedded implementation.Already mentioned, the work of Delmerico and Scaramuzza [65] used different CPU platforms to implement the OKVIS algorithm.However, to achieve real-time performance in the Up Board and ODROID, the authors needed to reduce the number of keypoints, the keyframe window, and the IMU-linked frames.Nikolic et al. [68] used an FPGA-CPU architecture to evaluate the OKVIS algorithm's performance.The authors took advantage of the logic blocks on the FPGA to implement the image processing techniques and accelerated the keypoint detection process.However, it was demonstrated that the algorithm is less accurate than others [18].

Robust Visual Inertial Odometry (2015)
The Robust Visual Inertial Odometry (ROVIO) algorithm [69] is another filter-based method that uses the EKF approach, and similar to other filter-based methods, it uses the IMU data to state propagation, and the camera data to filter update.However, besides performing the feature extraction, ROVIO executes the extraction of multi-level patches around the features, as illustrated by Figure 15.The patches are used by the prediction and update step to obtain the innovation term, i.e., the calculation of the error between the frame and the projection of the multi-level patch into the frame.The ROVIO algorithm achieves good accuracy and robustness under a low resource utilization [18,65], being suitable for embedded implementations [65].However, the algorithm proved to be more sensitive to per-frame processing time [65] and less accurate than other algorithms, such as VI-DSO [70].

Visual Inertial ORB-SLAM (2017)
The Visual-Inertial ORB-SLAM (VIORB) algorithm [71] is based on the already presented ORB-SLAM algorithm [44].As such, the system also counts with three main threads: tracking, local mapping, and loop closing.In VIORB, the tracking thread estimates the sensor pose, velocity, and IMU biases.Additionally, this thread performs the joint optimization of the reprojection error of the matched points and IMU error data.The local mapping thread adopts a different culling policy considering the IMU operation.Finally, the loop closing thread implements a place recognition module to identify the keyframes already visited by the sensors.Furthermore, the algorithm performs an optimization to minimize the accumulated error.Figure 16 seeks to illustrate the main differences between the ORB-SLAM algorithm (see Figure 9) and its visual-inertial version.The VIORB algorithm was the first visual-inertial method to employ map reuse, and it presents high-performance accuracy [64,70,72] and memory usage [18].Nonetheless, the IMU initialization takes between 10 to 15 s [71], and no embedded implementations were found.In [22], the authors propose the ORB-SLAM3 algorithm, which is based on ORB-SLAM2 and VIORB algorithms.The system presents a reduced initialization time compared to its predecessor, VIORB.

Monocular Visual-Inertial System (2018)
Monocular Visual-Inertial System (VINS-Mono) [73] is a monocular visual-inertial state estimator.It starts with a measurement process responsible for features extraction and tracking, and a pre-integration of the IMU data between the frames.Then, the algorithm performs an initialization process to provide the initial values for a non-linear optimization process that minimizes the visual and inertial errors.The VINS also implements a relocalization and a pose-graph optimization module that merges the IMU measurements and features observations.Figure 17 illustrates the VINS-Mono algorithm.The algorithm can also be applied considering binocular and stereo approaches [74].The VINS-Mono already demonstrated to achieve high accuracy when compared to other algorithms.Yet, it presented the highest memory usage when compared to algorithms such as ROVIO, VIORB, and OKVIS [18].This is despite the fact that, since it only considers pose and velocity from the latest IMU states during the optimization process, this algorithm still demonstrated its suitability in embedded implementations [73].The Visual-Inertial Direct Sparse Odometry (VI-DSO) algorithm [70] is based on the already presented DSO algorithm [31].The algorithm searches to minimize an energy function that combines the photometric and inertial errors, which is built considering a nonlinear dynamic model.Figure 18 shows an overview of the VI-DSO algorithm that illustrates its main differences concerning the DSO technique.The VI-DSO is an extension of DSO that considers the inertial information, which results in better accuracy and robustness than the original DSO and other algorithms, like ROVIO [70].However, the initialization procedure relies on bundle adjustment, which makes the initialization slow [22].The algorithm does not perform global optimization and loop closure detection, and embedded implementations were not found in the literature.

ORB-SLAM3 (2020)
The already mentioned ORB-SLAM3 algorithm [75] is a technique that combines the ORB-SLAM and VIORB algorithms.As with its predecessors, the algorithm is divided into three main threads: tracking, local mapping and, instead of loop closing, loop closing and map merging.In addition, ORB-SLAM3 maintains a multi-map representation called Atlas, which maintains an active map used by the tracking thread, and non-active maps used for relocalization and place recognition.The first two threads follow the same principle as VIORB, while map merging is added to the last thread.The loop closing and map merging thread uses all the maps in Atlas to identify common parts and perform loop correction or merge maps and change the active map, depending on the location of the overlapped area.Another important aspect of ORB-SLAM3 concerns the proposed initialization technique that relies on the Maximum-a-Posteriori algorithm individually applied to the visual and inertial estimations, which are later jointly optimized.This algorithm can be used with monocular, stereo, and RGB-D cameras, and implements global optimizations and loop closures techniques.However, authors in [76] demonstrated significant errors results of ORB-SLAM3 online performance.In [77], the algorithm obtained a good performance, but failed to process all the sequences, and obtained inaccurate estimates in outdoor sequences.

General Comments
This Section presented seven main visual-inertial SLAM algorithms, as long as an individual analysis of each of them.Table 2 summarizes the main characteristics and analyzed criteria for the presented visual-inertial SLAM algorithms.In a general analysis, the addition of an IMU to visual-based SLAM algorithms has the primary purpose of increasing the system's robustness, which was already demonstrated to be true [2,22,70].We observed greater literature feedback from the algorithms made available by their authors, which directly influenced the embedded implementations found in the literature.Unlike its visual-only version, we did not find an embedded version of the VIORB algorithm, since the original article does not provide an open-source version, and the more recent one, the open-source ORB-SLAM3, was recently published in 2020 [22].As for the inertial version of the DSO algorithm, the authors do not provide an open-source implementation; however, an implementation by third parties may be found [83], even though it requires optimization.The visual-inertial SLAM-based approaches represent a growing field, and several recent articles have been published, combining the IMU technologies with a large variety of sensors [85][86][87].Limiting our research to the visual-SLAM techniques, we could find several articles proposing solutions to increase the performance of the VI-based SLAM algorithm's initialization step [75,88,89].

RGB-D SLAM
The most representative SLAM algorithms based on RGB-D sensors, i.e., considering RGB images and depth information directly, are presented in Figure 19, according to their published years, and explained in the following subsections.

KinectFusion (2011)
The KinectFusion algorithm [90] was the first algorithm based on an RGB-D sensor to operate in real-time.The algorithm includes four main steps: the measurement, pose estimation, reconstruction update, and surface prediction.In the first step, the RGB image and depth data are used to generate a vertex and a normal map.In the pose estimation step, the algorithm applies the ICP alignment between the current surface and the predicted one (provided by the previous step).Then, the reconstruction update step integrates the new depth frame into the 3D reconstruction, which is raycasted into the new estimated frame to obtain a new dense surface prediction.The KinectFusion algorithm is capable of good mapping in maximum medium-sized rooms [90].However, it accumulates drift errors, since it does not perform loop closing [91].Nardi et al.,in [92], propose an implementation for the KinectFusion and test it in different CPU-and GPU-based platforms.Bodin et al. [93] use the framework proposed by [92] to implement the KinectFusion in two different CPU and GPU platforms.An overview of the steps performed by the algorithm is shown in Figure 20.

SLAM++ (2013)
The SLAM++ algorithm [94] is an object-oriented SLAM algorithm that takes advantage of previously known scenes containing repeated objects and structures, such as a classroom.After the system initialization, SLAM++ operates in four steps: camera pose estimation, object insertion, and pose update, pose-graph optimization, and surface rendering.The first step estimates the current camera pose by applying the ICP algorithm, considering dense multi-object prediction in the current SLAM graph.Next, the algorithm searches to identify objects in the current frame using the database information.The third step inserts the considered objects in the SLAM graph by performing a pose-graph optimization operation.Finally, the algorithm renders the objects in the graph, as shown in Figure 21.SLAM++ performs loop closure detection and, by considering the object's repeatability, it increases its efficiency and scene description.Nevertheless, the algorithm is most suitable for already known scenes.

Dense Visual Odometry (2013)
The dense visual odometry SLAM (DVO-SLAM) algorithm, proposed by Kerl et al. [95], is a keyframe-based technique.It minimizes the photometric error between the keyframes to acquire the depth values and pixels coordinates, as well as camera motion.The algorithm calculates, for each input frame, an entropy value that is compared to a threshold value.The same principle is used for loop detection, although it uses a different threshold value.The map is represented by a SLAM graph where the vertex has camera poses, and edges are the transformations between keyframes.This algorithm is robust to textureless scenes and performs loop closure detection.The map representation relies on a representation of the keyframes, and the algorithm does not perform an explicit map reconstruction.Figure 22 shows an overview of the DVO algorithm.

Camera Motion Estimation
Photometric and geometric errors minimization

Keyframe Selection
Based on the calculation of an entropy ratio

Loop Closure Detection
Nearest neighbour search

Map Optimization
Non-linear least squares optimization

RGBDSLAMv2 (2014)
The RGBDSLAMv2 [96] is one of the most popular RGB-D-based algorithms and relies on feature extraction.It performs the RANSAC algorithm to estimate the transformation between the matched features and the ICP algorithm to obtain pose estimation.Finally, the system executes a global optimization and loop closure to eliminate the accumulated error.In addition, this method proposes using an environment measurement model (EMM) to validate the transformations obtained between the frames.The algorithm is based on SIFT features, which degrades its real-time performance.RGBDSLAMv2 presents a high computation consumption and requires a slow movement by the sensor for its correct operation [91].Figure 23 represents the algorithm.

Transformation Estimation Tranformation Validation
Input Depth

Input Data
Tranformations Optimizations  3 summarizes the main characteristics and analyzed criteria for the presented algorithms.RGB-D-based SLAM algorithms represent an alternative solution to the visual-only and visual-inertial SLAM.In general, they construct dense maps, enabling them to represent the environment in greater detail.In addition, it is a more robust approach regarding lowtexture environments thanks to the depth sensor.Concerning embedded implementations, it is possible to find, in the literature, several solutions searching to accelerate parts of the RGB-D-based algorithms that usually require more computation load, such as the ICP algorithm.Beshaw et al. [100] and Williams et al. [101] propose different architectures to accelerate the ICP algorithm, and Gautier et al. [102] implemented the ICP and the volumetric integration algorithms in a heterogeneous architecture.Recent publications have focused on developing robust RGB-D SLAM algorithms considering dynamic environments conditions [103-105].

Open Problems and Future Directions
Although the SLAM domain has been widely studied for years, there are still several open problems.The current state of the art of SLAM and odometry algorithms increasingly seeks to reinforce the algorithm's robustness, optimize computational resources usage, and evolve the environment's understanding in the map representations [8].Concerning the robustness, SLAM and odometry techniques still present some major issues that undermine algorithms' robustness [8].One of them is the tracking failure [106]; facing some challenges or long-term scenarios, the algorithms may still fail to recognize and associate features in the current received image, resulting in inaccurate pose estimation.This may have consequences in loop closure techniques [107] and relocalizations [8,108].As a solution to this issue, authors have been exploring new methods to deal with the SLAM problem.Recent works propose the incorporation of deep learning and spectral techniques [109,110] to increase the system's robustness; some main examples the deep-learning-based algorithms are discussed in Section 4.1.
Another main issue that decreases the SLAM algorithms' robustness is the assumption of static scenarios, while the real world presents dynamic environments; this may cause failures in tracking [111] and reconstruction [112].Dealing with dynamic scenes may be considered a challenge, since it requires the algorithm to detect the dynamic object, avoid the tracking of the object, and exclude it from the map [113].As mentioned in Section 3.3.5,several works have been published proposing solutions to this central issue; more representative examples are discussed in Section 4.3.
Besides the robustness, recent SLAM algorithms seek to consider the usage of the computational resources [8].This current topic leads to the open problem of memory usage by map storage [8].Storing the map in a long-term operation may considerably increase the memory usage, which may have consequences for memory-limited systems operation, e.g., embedded SLAM.However, it is already possible to find, in the literature, works proposing solutions for this topic.One example is the work of Opdenbosch et al. [114], who proposes an efficient map compression, and demonstrates its ability to significantly reduce the map's data and size without losing relevant information.In addition to map storage, another major issue that influences resource usage is map sparsity.Dense and semi-dense maps provide a more detailed representation of the environment, but this feature has consequences for resource usage.It has already been demonstrated that sparse maps present lower power consumption compared to semi-dense and dense ones-Wan et al. [115].Consequently, they may be more suitable for an embedded implementation, although they provide fewer details.
Currently, the SLAM algorithms also seek to evolve our understanding of the environment in the performed reconstructions [8].Besides obtaining the geometric information, the algorithms obtain information about the environment by recognizing objects within it, for example.An evolving SLAM category that enables this better environment abstraction is the semantic-based SLAM.The semantic SLAM is a trending topic on SLAM, and some main examples are discussed in Section 4.2.Following this, we briefly discuss some recent and relevant articles that we believe are representatives as future directions of the visual-SLAM and visual-odometry fields.

Deep Learning-Based Algorithms
One remarkable algorithm that incorporates deep learning concepts is the UnDeepVO [116].This monocular visual-odometry algorithm can perform pose and depth estimation via a deep neural network.The authors train UnDeepVO with unsupervised learning using stereo images; additionally, they consider both spatial and temporal dense information in the loss function of the training.This method proved to be more accurate and robust than other monocular methods, such as the ORB-SLAM (without loop closure).
Recently, the same research group proposed the DeepSLAM [117].The system considers a tracking-net and mapping-net trained using unsupervised learning, and considering spatial and temporal geometry in the loss function.The algorithm also contains a Loop-Net to perform loop detection.DeepSLAM presented a better performance than other monocular algorithms, as the ORB-SLAM, and better robustness than ORB-SLAM and LSD-SLAM.
Another relevant algorithm based on deep learning is the DF-SLAM [118].DF-SLAM follows a framework similar to ORB-SLAM, but instead of using the hand-made features, explained in Section 2.1.1,it uses deep local features described by the TFeat network.The authors provide several results comparing DF-SLAM to ORB-SLAM2; for most sequences, the proposed algorithm obtained a better performance.Recently, it is possible to find, in the literature, several overviews [119][120][121] that address deep learning-based algorithms applied to depth estimation and the main concepts of SLAM's direction.More methods that use deep learning techniques are discussed in Section 4.3 as a solution to dynamic SLAM algorithms.

Semantic-Based Algorithms
Incorporating semantic information on the visual-SLAM problem is a growing field, and has been attracting more attention in recent years.One important and recent study in this area is presented in [122].The authors propose a new methodology for data association that incorporates information from an object detector, proposing a solution that can represent both data association and landmark class in a factor graph solution.This method presents reduced errors compared to other solutions incorporating semantic data association techniques.More methods containing semantic data are discussed in Section 4.3 as a solution to dynamic SLAM algorithms.As this field grows, it is also necessary to establish methods to validate the semantic-based algorithms.Authors in [123] introduce a new synthetically generated benchmark dataset that, besides the traditional ground truth of the trajectory, contains semantic labels, information about the scene composition, ground truth 3D models, and the pose of the objects.In addition, they propose evaluation metrics that may assess the semantic-based algorithms' performance.

Dynamic SLAM Algorithms
Research studies into the SLAM algorithms considering dynamic environments are essential to increase the algorithms' robustness to more realistic situations.Firstly, in [124], then in [125], Sun et al. propose a motion removal technique to deal with the environment's dynamicity in RGB-D approach.In [125], the removal algorithm may be divided into two parts; first, it identifies the moving object and updates the foreground model using the error caused by the object in the image.Then, it performs the foreground segmentation.The algorithm obtained better performance, especially in high-dynamics environments, than some state-of-the-art techniques, such as DVO.
An essential algorithm robust to dynamic scenes is the Dynamic-SLAM proposed by Xiao et al. [126]; this method incorporates both deep learning and semantic techniques.The system employs a CNN to detect dynamic objects at a semantic level; it separates the dynamics and statics features, considering the dynamic ones as outliers.In addition, they propose a compensation algorithm to increase the detection accuracy and a feature-based framework.The tracking thread incorporates the semantic data, discarding or reserving the features.Dynamic-SLAM presented a greater accuracy than other methods such as LSD-SLAM, SVO, and PTAM; and better robustness compared to ORB-SLAM2.
DynaSLAM II [127] is another relevant method that incorporates semantic segmentation to track dynamic objects.This algorithm is based on ORB-SLAM2 and performs semantic segmentation and feature extraction at each new frame.This algorithm does not make assumptions about the dynamic objects and performs the data association of dynamic and static features.Static features are used to estimate the initial camera poses, and then trajectories, bounding boxes, and 3D points are optimized.DynaSLAMII showed to present a performance comparable to other state-of-the-art algorithms, such as the ORB-SLAM2.

Datasets and Benchmarking
Among all the SLAM algorithms in the literature, it is essential to achieve a fair comparison between them to determine which one presents a better performance in certain situations.Several benchmarking datasets with different characteristics are proposed in the literature to explore the SLAM capabilities and robustness.Here, we present the publicly available benchmark dataset used to evaluate the presented SLAM algorithms in their original articles.
The TUM RGB-D dataset [128] consists of several image sequences containing color and depth images recorded in indoor environments with a Microsoft Kinetic in two different platforms: robot and handheld.The system was synchronized with a motion-capture system to provide the ground truth.In addition, the authors propose two metrics to evaluate the local accuracy and the global consistency of the trajectory; they are relative pose error and absolute trajectory error, respectively.The KITTI dataset [129] contains outdoor sequences recorded by color and grayscale stereo cameras.The KITTI also present data from a 3D laser scanner and the ground truth provided by an INS/GPS.The sensor system is synchronized and mounted on a car.In addition, the authors provide tracklets for a dynamic objects classification and benchmarks to evaluate robotics tasks, such as visual odometry and SLAM.
Another main benchmark dataset is the ICL-NUIM [130].The dataset focuses on RGB-D algorithms and provides data for the evaluation of the 3D reconstruction through eight synthetically generated indoor scenes.A handheld RGB-D camera generates the sequences, and the ground truth consists of a 3D surface model and the estimated trajectory by a SLAM algorithm [131].The EuRoC benchmark dataset [23] is widely used to evaluate visual-only and visual-inertial SLAM and odometry algorithms.The data were collected in two indoor environments by a micro aerial vehicle (MAV), and it provides eleven sequences of stereo images and IMU data.The ground truth is obtained by a total station and a motion capture system.
A dataset commonly used to evaluate monocular systems is the TUM MonoVO [30].It contains several photometrically calibrated indoor and outdoor sequences provided by two handheld non-stereo monocular cameras.Due to the variety of the scenes, the authors do not provide a ground-truth from the poses, but they perform large sequences that start and end at the same position, allowing the evaluation of the loop drifts.Lastly, a dataset provided for visual-inertial systems evaluation is the TUM VI dataset [132].It provides several indoor and outdoor sequences captured by a stereo camera synchronized with an IMU.The sensor system is handheld, and, as for the TUM MonoVO, it was impossible to establish the ground truth for the entire sequences.However, they provide the ground truth via a motion capture system for the beginning and end of the system.Table 4 summarizes the main benchmark datasets characteristics presented in this work.

Conclusions
The visual-based SLAM techniques represent a wide field of research thanks to their robustness and accuracy provided by a cheap and small sensor system.The literature presents many different visual-SLAM algorithms that make researchers' choices difficult, without criteria, when it comes to evaluating their benefits and drawbacks.In this paper, we introduced the main visual-based SLAM approaches and a brief description and systematic analyses of a set of the most exemplary techniques of each approach.To guide the choices among all the algorithms, we proposed six criteria that are limiting factors to several SLAM projects: the algorithm type, the density of the reconstructed map, the presence of global optimizations and loop closures techniques, its availability, and the embedded implementations already performed.Researchers can consider each criterion according to their application, and obtain an initial analysis from the presented paper.In addition, we presented some major issues, suggested future directions for the field, and discussed the main benchmarking datasets for visual-SLAM and odometry algorithms evaluation.Regarding future works, we will apply the proposed criteria analysis to nuclear decommissioning scenarios.The best SLAM algorithm shall be selected after considering the variety of features and specificities that this environment and application possess.

Figure 1 .
Figure 1.General components of a visual-based SLAM.The depth and inertial data may be added to the 2D visual input to generate a sparse map (generated with the ORB-SLAM3 algorithm[22] in the MH_01 sequence[23]), semi-dense map (obtained with the LSD-SLAM[24] in the dataset provided by the authors), and a dense reconstruction (Reprinted from[25]).

Figure 2 .
Figure2.General differences between feature-based and direct methods.Top: main steps followed by the feature-based methods, resulting in a sparse reconstruction (map generated with the ORB-SLAM3 algorithm[22] in the MH_01/EuRoC sequence[23]).Bottom: main steps followed by a direct method, that may result in a sparse (generated from the reconstruction of sequence_02/TUM MonoVO[30] with the DSO algorithm[31]) or dense reconstruction (Reprinted from[25]), according to the chosen technique.

Figure 3 .
Figure 3. Timeline representing the most representative visual-only SLAM algorithms.

Figure 12 .
Figure 12.Timeline representing the most representative visual-inertial SLAM algorithms.

Figure 19 .
Figure 19.Timeline representing the most representative RGB-D-based SLAM algorithms.

Figure 23 .
Figure 23.Diagram representing the RGBDSLAMv2 algorithm.Adapted from [96].3.3.5.General Comments Section 3.3 individually presented the most representative RGB-D-based techniques.Table3summarizes the main characteristics and analyzed criteria for the presented algorithms.

New Frame Track on current Keyframe
YesKeyframe Creation

Table 1 .
Main aspects related to the visual-only SLAM approaches.
[69]ram representing the feature handling performed by the ROVIO algorithm.Adapted from[69].

Table 2 .
Main aspects related to the visual-inertial SLAM approaches.All approaches present tightly coupled sensor fusion.

Table 3 .
Main aspects related to the RGB-D-based SLAM approaches.

Table 4 .
Main aspects related to the presented benchmark datasets.
* Environment: indoor or outdoor.