Modelling Software Architecture for Visual Simultaneous Localization and Mapping

: Visual simultaneous localization and mapping (VSLAM) is an essential technique used in areas such as robotics and augmented reality for pose estimation and 3D mapping. Research on VSLAM using both monocular and stereo cameras has grown signiﬁcantly over the last two decades. There is, therefore, a need for emphasis on a comprehensive review of the evolving architecture of such algorithms in the literature. Although VSLAM algorithm pipelines share similar mathematical backbones, their implementations are individualized and the ad hoc nature of the interfacing between different modules of VSLAM pipelines complicates code reuseability and maintenance. This paper presents a software model for core components of VSLAM implementations and interfaces that govern data ﬂow between them while also attempting to preserve the elements that offer performance improvements over the evolution of VSLAM architectures. The framework presented in this paper employs principles from model-driven engineering (MDE), which are used extensively in the development of large and complicated software systems. The presented VSLAM framework will assist researchers in improving the performance of individual modules of VSLAM while not having to spend time on system integration of those modules into VSLAM pipelines.


Introduction
Visual simultaneous localization and mapping (VSLAM) is a technique used to estimate camera motion and to generate 3D maps of environments using vision-based sensors. VSLAM is an essential building block in several robotic, automotive, and augmented/mixed reality (AR/MR) applications [1]. In the presence of sufficient illumination and visually distinctive environments, VSLAM can be used to track the 6 degrees-offreedom (DOF) pose of the camera as well as to generate globally consistent 3D maps and camera trajectories with minimal drift over large-scale motion [2]. Therefore, VSLAM can also be viewed as a combination of visual odometry and loop closure [3]. Most VSLAM implementations share common components such as filtering, feature detection, and image alignment in the front-end for generating sensor-agnostic representations of the input data. The back-end usually consists of keyframe management, optimization, and map improvement for estimating motion and structure with global consistency and minimal error [4]. There still remain open challenges in VSLAM implementations such as resource allocation, distributed mapping, and optimizing map storage [4]. The research on improving the state-of-the-art in VSLAM is still ongoing, and several different approaches and implementations of VSLAM are being continuously proposed [5][6][7][8]. The newly proposed VSLAM approaches, conforming to their experimental software nature, are also vulnerable to the same issues that arise commonly in complex software engineering systems. This necessitates developing a better understanding of the core components and interfaces within VSLAM systems captured by an overarching software architecture for VSLAM.
The ability to create a generalized software model from commonly occurring components in any robotic application can speed up the overall application development process by improving code reusability and by reducing the time required for system design, development, integration, and maintenance [9]. Therefore, a software architecture for VSLAM that not only satisfies the original performance requirements but also offers long-term maintainability, reliability, stability, and robustness could significantly assist researchers while also promoting a commonality of language and a flexibility in composition. A framework for VSLAM could be highly beneficial in the long-term if it were designed with the requirements for robust perception under consideration [4], as well.
This paper aims to present an overview of VSLAM algorithms and, from this overview, to propose a model of VSLAM using components and defining interfaces with specific message types for assisting in the design, development, and testing of individual modules of VSLAM. Such an approach offers three key advantages: (i) better transparency in troubleshooting the software system, (ii) the ability to independently modify individual components, (iii) improved efficiency in system integration, and (iv) a better conceptual model of the system in general. The survey on VSLAM algorithms from 2010 to 2016 by Taketomi et al. [3] and the review of the SLAM literature up to 2016 by Cadena et al. [4] explain the high-level framework employed by state-of-the-art SLAM algorithms up until 2016. However, their works more generally review SLAM and do not provide detailed discussion on vision-based SLAM modules. Fraundorfer and Scaramuzza [10] systematically explained visual odometry algorithms developed between 1980 and 2011 but did not cover techniques that employ loop closures for global optimization. To the best of our knowledge, the current literature only reviews VSLAM algorithms up until 2016. In this work, we include more recent advances while also introducing a standardized comparison. Most notably, we introduce a novel framework to address algorithm reusability and architecture modularity to assist in the development of new VSLAM approaches. Therefore, the main contributions of this paper include (i) an analysis of past trends in design and performance of VSLAM algorithms; (ii) a comprehensive review of open source libraries and packages used in their implementations; and (iii) a general overview of VSLAM algorithms developed between 2000 and 2019; and finally, leveraging these other contributions, (iv) a component-based model of VSLAM modules.
The rest of this paper is organized into the following sections: Section 2 summarizes the approaches that have been used to implement various VSLAM algorithms. Section 3 develops an accumulated list of modules commonly found in VSLAM implementations for both feature-based and direct approaches. Section 4 proposes a model for VSLAM based on its evolving architecture from the literature. Section 5 lists the tools, libraries, and packages used in VSLAM module implementations. Section 6 provides an overview of the performance of various implementations of VSLAM against standard benchmark datasets for VSLAM. Finally, Section 7 presents the conclusion.

Overview of Approaches in VSLAM
The majority of currently existing visual SLAM algorithms primarily consist of two stages: the front-end, which handles transforming sensor data into a representation in feature space and establishing constraints in robot motion and sensor measurements, and the back-end, which consumes the constraints generated by the front-end and performs an optimization to maximize the maximum a posteriori estimate of the unknown poses and landmarks. Representation of the back-end as a factor graph has been de facto standard for formulating the core graph-based structure of SLAM problems due to its intuitive and generalized nature.

Front-End Modules for VSLAM
Often, the data received from perception sensors is large, dense, and too complex to be represented directly in the core graph of constraints. Due to this reason, it is common to derive constraints from salient portions of the sensor data such as features and patches with high image gradients. Depending on which approach is chosen, feature-based and direct approaches stem out. Specific implementation of the front-end also varies based on the type of sensor configuration used such as monocular, stereo, RGB-D, etc. or some combination of them. Most of the modules of VSLAM can be used with these sensor configurations interchangeably. The majority of VSLAM algorithms have been developed using featurebased [10] and direct [11,12] approaches. Hybrid approaches that combine the selected benefits from both direct and feature-based approaches have also been proposed and are described below.

Feature-Based
Feature-based VSLAM methods track keypoint correspondences over successive image frames and perform a joint optimization of camera pose and 3D world point locations using a framework known as bundle adjustment. These methods require rotation and scale-invariant feature detection, effective matching, and algorithms such as Random Sample Consensus (RANSAC) for robust outlier removal [10]. Similar to most VSLAM approaches, feature-based methods also require modules such as pose-graph optimization, local mapping, and global map optimization to achieve high performance overall [13,14].

Initialization:
The initialization module refers to the initial local map and camera pose generation, and the first setup of the global world frame for the algorithm. The frontend usually performs the initialization by triangulating an initial set of keypoints using two-view reconstruction algorithms such as the five-point algorithm [15], eight-point algorithm [16], or objects with known geometry [14,17]. In feature-based monocular VSLAM, two-view initialization becomes delayed since at least two keyframes are needed for local mapping and tracking [5]. However, in stereo setups, a single stereo-pair can be used to initialize SLAM by directly estimating parameters of the essential matrix, which is a matrix that relates 2D image points corresponding to the same 3D landmark points in two different images [14,18]. Approaches such as model selection between homography (for planar scenes) and fundamental matrix (for general scenes) have also been proposed [5]. Although most approaches estimate local map and pose by estimating parameters of the homography or the essential matrix between two keyframes by matching point correspondences, recent line correspondence-based methods that assume a constant rotation model over three successive keyframes have been explored [19].
Feature matching/tracking: Feature matching and tracking are necessary to find corresponding feature points in successive images, known as correspondences [14]. The brute force approach to feature matching employs a comparison of the pairwise sum of squared distances or normalized cross correlation between the feature points detected in first and second images. However, due to the quadratic computational complexity of such an approach, other techniques have been explored utilizing data structures such as search trees, bag-of-words, and hash tables for fast lookup of neighboring feature points in both images [5,34]. In the presence of wide-baseline motion between the two images, corresponding feature points are searched at their expected positions in the second image using image motion and distortion models [21]. Such an approach predicts the expected location of the feature point by assuming a motion model between the two images. Epipolar match-ing is also used to reduce the search space in the second image to only the points along the epipolar lines corresponding to the feature points in the first image [35].

Keypoint outlier removal:
Often, corresponding feature points are mismatched, and due to a high number of such outliers in the set of correspondences, RANSAC is performed over a camera motion model [10]. Usually a perspective-n-point (PnP) or efficient PnP (EPnP) algorithm is used to generate a camera motion model upon which RANSAC [36] is performed. The 5-point, 6-point, 7-point, and 8-point algorithms are widely used PnP variations in the literature. However, varying degrees of accuracy have also been achieved with lower numbers of point correspondences. For line correspondences, Pumarola et al. [19] explored EPnPL [37] for model fitting.

Direct
Direct methods avoid feature extraction and are designed to process image data directly using numerical optimization techniques for minimizing photometric cost using most of the pixels in the image to obtain the initial estimates of the structure and motion [11]. Usually, variants of image alignment techniques based on the famous Lucas-Kanade algorithm [38] are used to iteratively estimate the parameters of camera motion and 3D structure of the scene. Modules such as pose-graph optimization, loop closure, and keyframe management are also used in direct methods to improve overall performance and robustness [6,11].
Image alignment: Dense tracking using image alignment has been performed primarily using variants of the optical flow algorithm proposed by Lucas and Kanade [38] to estimate warping parameters between consecutive images [39]. The camera motion and depth map are jointly optimized for a large number of pixels in the image with the cost function arg min where W(I, p) is the warp function applied to image I 1 with parameters p for transforming it close to the second image I 2 . Baker et al. [39] provide a detailed overview of four different configurations of the Lucas-Kanade algorithm (forward additive, inverse additive, forward compositional, and inverse compositional) based on the direction of warping between the template and image as well as on whether the update rule is additive or compositional.
Optimizer: Dense tracking methods require minimization of the energy function consisting of photometric residuals from all pixels based on brightness constancy [12]. A warping model is selected with appropriate parametrization for optimization [11]. The parameters of this warping transform are optimized with update steps based on steepest-descent, Newton, Gauss-Newton, and Levenberg-Marquardt formulations [39]. The overall framework of direct methods can be modelled as a maximum a posteriori (MAP) estimation [4]. The image alignment-based dense and semi-dense camera pose tracking optimization is essentially a maximization over the negative log likelihood of the photometric error over all pixels [40]. The negative log of the prior term therefore becomes the regularization term in the energy that is optimized. In DTAM [7], a non-convex energy functional was proposed with a photometric error data term and a regularization term that uses a spatial smoothness prior in an inverse depth parameterization.

Hybrid Approaches
In semi-direct methods, camera motion is estimated using feature-based keypoints tracked using optical flow. The mapping is performed using direct approach on keyframes that are generated after large baseline motion of the camera from the last keyframe [40]. Semi-dense methods combine the computational efficiency of feature-based methods and the ability to work in featureless conditions of direct methods to estimate motion and structure in real time [41].

Other Common Modules
The front-end in VSLAM is also responsible for the task of data association. In the short term, the front-end associates measurements to variables in consecutive camera frames. In the long term, the front-end is responsible for finding loop closures to generate looping constraints on the graph of keyframes for back-end optimization [4].
Keyframe management: Keyframes are specific frames captured by the camera when large disparities are observed. Keyframe-based techniques offer robustness in map generation due to better 3D landmark triangulation [42]. The frequency at which keyframes are chosen and discarded can be used to establish an adaptive usage of processing and memory resources used by the algorithm. The technique of bundle adjustment (BA) is usually performed on a subset (window) of keyframes for local mapping to reduce the computational complexity and to improve convergence of the optimization. Various strategies for insertion and culling of keyframes based on factors such as the number of co-visible points and disparity with the previous keyframe, and computational resource limits have been used in the literature [5,43].
The keyframe pose-point constraints are usually stored in the form of a covisibility graph with a threshold number of shared landmarks for efficient BA and pose-graph optimization [44,45].
Relocalization and loop detection: VSLAM systems fail on occlusions and fast camera movements when tracking cannot be performed. Techniques using place recognition and data structures such as co-visibility graphs and spanning tree of keyframes are used to detect previously observed landmarks and to either relocalize the camera or identify loop constraints on the pose graph of the system [46]. The performance of relocalization improves when place recognition techniques such as bag-of-words (DBoW2 [47]) are used [5].
Loop closure: Visual odometry systems lack global map consistency checks and are therefore prone to drift from the ground truth over time as motion estimation error accumulates [10]. This can be corrected by detecting constraints from instances of the system coming back to a previously observed location, also known as place recognition [48]. The visual vocabulary for both local and global image descriptors can be used to generate loop candidates [10]. Techniques using visual bag-of-words have been frequently used in the past to implement loop closure detection [49]. The detected loop candidates are then verified for geometric consistency using epipolar constraints against previously observed keypoints [50].

Back-End Modules of VSLAM
Feature-based, direct, and hybrid VSLAM approaches all share the overall framework since they take camera image frames as input and generate camera motion models and 3D models of the scene as the output. They, however, differ in some of the core modules associated with local mapping and the optimization framework. This section provides a description of the most commonly found internal modules in both feature-based and direct approaches. Figure 1 also provides a general illustration of the feature-based VS-LAM pipelines specifically and loosely generalizes the state-of-the-art pipelines from the literature such as ORB-SLAM [5], PL-SLAM [8], and PTAM [14].
Visual odometry systems accumulate both rotational and translational drift errors over time due to uncertainties in camera poses and 3D landmark positions. This issue is resolved by optimizing the old map points and camera poses with newly found geometrical constraints such as loop detection and multiple overlapping keyframes [42]. Several techniques such as performing BA on keyframes after loop constraint detection, and alternating between pose-graph optimization and BA [5] have been proposed that perform global bundle adjustment while also keeping the required computation within resource limits for real-time operation [3].
Bundle adjustment: Bundle adjustment is the estimation of camera positions T, camera orientations R, and 3D landmark positions X through a joint optimization on the reprojection error between the landmark projections onto the keyframes π(X) and matched keypoints. The optimization on reprojection error is defined as follows: arg min where n is the number of images, m is the number of point projections in the ith image and j iterates over the observation of m landmark points visible in the ith image as z ij . The nonlinear nature of such an optimization problem requires the use of nonlinear optimization algorithms such as Levenberg-Marquardt and Gauss-Newton. Local BA is used to obtain local 3D landmark positions around the current keyframe or a window of keyframes [13]. Full or global BAs can be performed on all keyframes, however, with increased computational cost. For multiple types of features (points, edgelets, and lines), cost functions are custom designed based on the reprojection constraints [8].
Factor graph optimization: The camera poses and their relative transforms are usually represented as a probabilistic graph, where the camera poses and landmark positions are the nodes and the probabilistic constraints between them are the edges [51]. Visual odometry algorithms usually accumulate drift over time, which can be modelled as the propagation of error in both poses and transforms through the pose graph [10]. Given initial noisy estimates for the variables such as camera poses and landmark positions, a factor graph can be used to calculate the maximum a posteriori (MAP) inference to optimize the graph further for consistency and to remove drift. An error term generated from the expected camera pose is summed up to a cost function over all edge constraints, arg min which on expanding the logarithm expression and dropping the constant prior factor becomes a sum of exponents as follows: arg min where an assumption of Gaussian distributions as P( This shows that, under Gaussian assumption, MAP inference is equivalent to solving a nonlinear least squares. At this point, a nonlinear optimizer such as the Gauss-Newton, Levenberg-Marquardt, or Dogleg algorithms could be used to generate the MAP estimate for the posterior P(X|Z) [51]. Such techniques have also been referred to as graph-based SLAM, pose-graph SLAM, and smoothing in the literature.
Loop constraints in a factor graph provide important information for extended consistency of the global map and for reduction of error in camera poses and their relative transforms. Several approaches for loop constraint detection have been proposed in the past using both local [52] and global [53,54] image feature descriptors.

Software Model for VSLAM Architecture
The principles of component-based software engineering help model a software system in terms of components, connectors, and rules to define the model topology using those components and connectors [9]. This section presents a novel architectural framework for VSLAM that not only has the potential to promote code reusability and maintainability but also can function to model existing state-of-the-art VSLAM systems. The following subsection defines VSLAM as consisting of both active components that perform computations on the processor and passive components such as data structures and databases that primarily store different kinds of information throughout the execution of the active components. We believe such a delineation of VSLAM modules into active and passive components as well as the interfaces between them can help make research on individual components of the VSLAM pipeline independent of the other components and, therefore, allow researchers to focus on improving particular components of VSLAM rather than on expending their efforts on system integration.

Architecture
The objects and their connections shown in Figure 2 illustrate the proposed architecture visually. This architecture consists of active component types and the interfaces between them as specified below:

1.
Feature detector: requires an image frame and descriptor type and provides image coordinates of the detected feature points and feature descriptors; 2.
Feature matcher: requires feature coordinates and descriptors from two images and provides feature point correspondences between two images; 3.
Local mapper: requires keypoint correspondences and camera intrinsics, and provides local 3D map points; 4.
Local pose estimator: requires feature point correspondences from two images and provides a T ∈ SE(3) transform between camera image and last keyframe; 5.
Keyframe manager: requires a new image frame and its feature points, and provides a keyframe decision and keyframe update; 6.
Loop detector: requires a last keyframe and a new image frame and provides a loop closure constraint for a factor graph; 7.
Nonlinear optimizer: requires a factor graph of measurement, motion, and loop constraints and provides an optimized graph based on maximum a posteriori inferencing; and 8.
Analyzer (for benchmarking purposes): requires the camera pose and 3D map points and provides accuracy measurements tested against benchmarking datasets.
The following passive components assist the active components in storing and retrieving different types of information as specified:

1.
Keyframe list: stores the keyframes collected by the keyframe manager based on the keyframe generation and culling conditions and continuously optimized by the factor-graph nonlinear optimizer component 2.
Factor graph: stores a graph of all the constraints from odometry; landmark measurements; loop closures; and other sensors such as Inertial Measurement Unit (IMU) factors, kinematics factors, etc.

Data Flow
This paper defines the data flow through the architecture as consisting of the messages that are exchanged at the interfaces between both the active and passive components. Selection of the appropriate set of message types can have a significant impact on the overall performance of the system. Based on the previous literature on VSLAM, we define a specific set of message types to be an appropriate selection for transfer of data across VSLAM components.
Keyframe messages: A keyframe message consists of the current image and the camera pose associated with the image. Keyframes are generated, stored, retrieved, and deleted by the keyframe manager component based on factors such as availability of wide-baseline images, limitations on computational resources, and detection of loop closures. The keyframe messages are stored in the keyframe database in the form of a covisibility graph and spanning tree for fast access using both the global map optimizer and the keyframe manager components. This definition gains inspiration from the survey by Younes et al. [42].

Keypoint messages:
The keypoint message consists of detected feature point coordinates of a correspondence pairs between two images. The filtered keypoint messages are used by the keyframe manager to generate keyframes when a threshold number of keypoints are detected in consecutive image frames. The keypoint messages are associated with the 3D landmark measurements in the sensor frame.
Image messages: An image message consists of the normal RGB images generated by a monocular camera with specified height and width parameters. Raw images are only processed by the feature detector and loop detector components to create sensor agnostic data for subsequent components to use. Image messages are not passed or stored anywhere in raw format to conserve space and to maintain computational efficiency.

Map point messages:
The map point messages consist of the 3D coordinates of the map points as generated by the local mapper. The map points are stored in the map database and occasionally optimized by the factor graph optimizer.
Camera pose messages: The camera pose messages consist of the 6 DoF position and orientation information needed to uniquely define the extrinsic parameters of the camera for a given time instant. The initial camera pose is generated by the initializer component and then later updated by the local mapper component.

Additional Modules
Resource monitor: The standard for robust perception as described by Cadena et al. in their review paper on visual odometry considers resource awareness as one of the important characteristics of any robust perception system. We therefore recommend an additional component in the VSLAM pipeline termed "resource monitor", which runs in the background and continuously monitors the CPU, GPU, memory, storage, and network usage metrics and dynamically changes the quantitative parameters of the VSLAM modules to vary the amount of processing time and space needed on-the-go accordingly.
Visualizer: To further enhance the productivity of researchers in VSLAM, we also recommend that particular emphasis be applied in the development of a graphical user interface, called the "visualizer" component, which extracts data from every interface between the VSLAM components and offers clear and reliable visualization of the different messages that flow through the VSLAM pipeline.

Algorithms, Tools, and Libraries
Since most of the VSLAM back-end problems can be formulated as optimization problems, software libraries that offer fast and accurate numerical optimization are indispensable to the current state-of-the-art in VSLAM implementations. In VSLAM pipelines, modules involving bundle adjustment, pose-graph optimization, and direct image alignment require efficient and real-time optimization over cost functions with thousands of parameters. This has only been achieved due to open-source implementations of software packages for large optimization problems including linear, nonlinear, constrained, and unconstrained ones.
Georgia Tech Smoothing and Mapping (GTSAM) is a smoothing and mapping (SAM) library for robotics and vision in C++ built using Bayesian networks and factor graphs. The General Graph Optimization (g2o) package offers a powerful back-end for solving optimization problems where constraints can be formulated as nodes and edges, such as in graph-based SLAM and bundle adjustment [51]. Ceres is a high-performance C++ library for solving large optimization problems such as nonlinear least squares with bounds and general unconstrained optimization [55]. The Incremental Smoothing and Mapping (iSAM and iSAM2) libraries are sparse nonlinear optimization libraries implemented in C++ that calculate incremental updates using efficient matrix factorization [56].

Benchmarking and Datasets
In this section, we discuss several datasets that have been used for evaluating visual SLAM systems [57]. The following subsections of this paper describe these datasets and compare the performance of VSLAM implementations that have been evaluated on each dataset by the respective authors of the VSLAM implementations.

KITTI Odometry:
The KITTI Odometry dataset [57] consists of 22 stereo sequences taken from four Flea4 cameras, 3D laser scans taken from a Velodyne HDL-64E, and GPS-RTK readings taken on their self-driving platform over a length of 39.2 km in outdoor conditions. The sensor fusion and localization system based on OXTS RT3003 is used to define the ground truth for benchmarking. The KITTI dataset has been used for evaluation by ORB-SLAM [13], DSO [58], LDSO [59], GDVO [60], and Stereo LSD-VO [61], as shown in Table 1. Since ORB-SLAM2 is the most complete state-of-the-art VSLAM pipeline in featurebased methods (includes almost all the modules of VSLAM as proposed in the model), it achieves the best performances on this dataset. Stereo DSO [62] and Stereo LSD [61] can be observed in Table 1 to be a close seconds in terms of relative and absolute translational errors, respectively.  [63] consists of 11 sequences captured from the on-board stereo camera and IMU of a quadrotor in varying motion and illumination conditions. The external motion capture system and 3D scans of the environment are provided as the ground truth for evaluation of the SLAM systems. The EuRoC dataset has been used in evaluating ORB-SLAM2 [13], LSD-SLAM [61], SVO [17], DSO [58], and PL-SLAM [8] as shown in Table 2. This is one of the most challenging datasets for any VSLAM pipeline as the motion consists of rapid translations and rotations, as observed from a micro-aerial vehicle. ORB-SLAM2 can again be observed to offer one of the best performances compared to other state-of-the-art VSLAM pipelines in Table 2.

TUM RGB-D:
The TUM RGB-D dataset [64] consists of color and depth sequences captured from a Microsoft Kinect in two different environments at 30 Hz. The ground truth is defined by the time-synchronized 6 DoF poses provided by an external motion capture system at 100 Hz. The TUM RGB-D dataset has been used for evaluation by PL-SLAM [8], ORB-SLAM [5], LSD-SLAM [6], Semidense-VO [41], and PL-SVO [65], as shown in Table 3.
The PTAM [14] and ORB-SLAM [5] pipelines deliver the best performances on this dataset out of the solely vision-based SLAM pipelines given in Table 3.  [66] consists of RGB-D sequences, ground truth camera poses, and 3D surface models for four different trajectories generated in a synthetic environment. This is the first dataset to provide a 3D surface ground truth for assessing reconstruction accuracy in SLAM systems. The ICL-NUIM dataset has been used for the evaluation of SVO, ORB-SLAM, DSO, and LSD-SLAM by Forster et al. [17] and of PL-SVO by Gomez-Ojeda et al. [65]. The New College dataset [67] consists of 30 GB of data collected as 5 DoF odometry, omnidirectional, and stereo camera sequences captured by a robotic vehicle driving through college campus. The New College dataset has been used for evaluation of RSLAM [68] and ORB-SLAM [5]. Other notable datasets used for evaluation of the VSLAM systems include RobotCar dataset [69], TrakMark [70], SLAMBench2 [71], and the dataset proposed by Martull et al. [72].

Conclusions
Numerous approaches to VSLAM including filter-based, feature-based, direct, semidirect, and semi-dense methods have been developed and tested in the past. Exploration of VSLAM has fueled research in several different areas of computer vision such as feature detection, place recognition, numerical optimization methods, and visual geometry. The landscape of VSLAM pipelines continues to grow, with new approaches being continuously proposed. A standardized model of VSLAM pipelines is required now more than ever before. Appropriate delineation of module boundaries and definition of the interfaces between them can help improve conceptual understanding, can optimize performance, and can offer high code reuse by bringing consistency in implementation techniques. Several aspects for robust 3D perception using monocular and stereo visual cameras have not been addressed sufficiently enough for high reliability, efficiency, and performance. This paper reviews VSLAM implementations through the lens of a consistent model, compares their performances as presented in the literature, and provides a concise summary of open source libraries that have been used to develop VSLAM systems.

Conflicts of Interest:
The authors declare no conflict of interest.