A Review on Visual-SLAM: Advancements from Geometric Modelling to Learning-Based Semantic Scene Understanding Using Multi-Modal Sensor Fusion

Simultaneous Localisation and Mapping (SLAM) is one of the fundamental problems in autonomous mobile robots where a robot needs to reconstruct a previously unseen environment while simultaneously localising itself with respect to the map. In particular, Visual-SLAM uses various sensors from the mobile robot for collecting and sensing a representation of the map. Traditionally, geometric model-based techniques were used to tackle the SLAM problem, which tends to be error-prone under challenging environments. Recent advancements in computer vision, such as deep learning techniques, have provided a data-driven approach to tackle the Visual-SLAM problem. This review summarises recent advancements in the Visual-SLAM domain using various learning-based methods. We begin by providing a concise overview of the geometric model-based approaches, followed by technical reviews on the current paradigms in SLAM. Then, we present the various learning-based approaches to collecting sensory inputs from mobile robots and performing scene understanding. The current paradigms in deep-learning-based semantic understanding are discussed and placed under the context of Visual-SLAM. Finally, we discuss challenges and further opportunities in the direction of learning-based approaches in Visual-SLAM.


Introduction
Autonomous navigation in mobile robots has become an active research field in recent years due to the advancements in material science for robot construction, compact battery size for more prolonged duration remote operations, and the increases in computational hardware for powering algorithmic and artificial intelligent methods.Mobile robots are capable of navigating within some environments to perform their respective objectives.Most autonomous robots need to move around in the environment while respecting the environment.Robots "see and understand" the world through collecting information from the attached sensors and making sense of the readings.Specifically, for robots to interact with the real world, they need some capabilities of understanding the scene geometrically and interpreting it semantically.Accurate localisation of the robot is essential for the robot to compute suitable actions, and having a good knowledge or perception of the environment allows the robot to react to its surroundings.
Mobile robots typically receive sensor information from their attached sensors [1], for example, in the form of 2D projections of image frames or 3D spatial points from high-frequency LiDAR scans [2].However, this perceived information is often insufficient for the robot to navigate as it lacks the geometric understanding and reconstruction of the scene.Geometric modelling is especially essential in complex tasks where the robot needs to localise itself with respect to the modelled map of the scene to navigate and accomplish its objectives.For example, mobile robot navigation often requires the robot to maintain a map representation for performing planning in tasks such as motion planning [3], navigating mobile robots [4], moving robotics arms [5], or even autonomous cars [6].It is infeasible for the robot to perform safe autonomous operations in a dynamic environment if it lacks the ability to perceive and make sense of the potential obstacles.This is especially important for the robot to operate in a novel region without any prior information about the environment, for example, in planetary exploration or search and rescue operations.
Difficulty in the geometric reconstruction of the environment while navigating within the unknown environment often arises due to issues like sensor modalities, misalignment of map representations, or observation noises [7]- [9].Simultaneous Localisation and Mapping (SLAM) addresses the problem of an online incremental process where the mobile robot needs to refine the reconstructed map and its current location iteratively by observing more of the unknown environment using various sensors [10].In Visual-SLAM, the most straightforward representation of the environment often consists of a collection of sparse 3D points processed continuously during the navigation.A robotic system collects visual inputs for Visual-SLAM while operating in some unknown environment [11], which constructs a map of the surroundings using various sensors while simultaneously estimating its position concerning the environment.The constructed map can be used for mapping a novel environment or for the robot to plan for its mission with autonomy.Such a system can maintain stability, plan for its own movements, and reacts to dynamic changes in the surroundings without human intervention [12].SLAM Mobile robot semantic scene involves both the domain of robotics, computer vision and sensor technologies.A wide range of sensors can be used for SLAM.For example, most autonomous vehicles use Light Detection and Ranging (LiDAR) sensor [13] or stereo cameras [14] to perceive the surrounding environment during navigation.LiDAR can often provide a more accurate environment representation by providing a 3D point cloud with ranging measurements.Traditional LiDAR sensors are rarely used in consumer-grade mobile robots due to the high cost; however, the advancements in manufacturing have enabled LiDAR sensors to become more common in mobile robots navegation [15], [16] In contrast, stereo cameras are more ubiquitous as the manufacturing cost is much lower than LiDAR.On the other hand, the hybridisation of multiple sensors is shown to enhance the localisation, and mapping performance in most SLAM approaches significantly [17].Hybridisation visual sensors with classical proprioceptive sensors such as IMU or odometers can often reduce the localisation drift due to the cumulative error of these relative positioning approaches.Rather than having some superior type of sensors, different sensing methods have their relative strength and weaknesses [18], [19].For example, laser scanners in LiDAR are efficient for obstacle detection but are highly sensitive to weather conditions like rain.In contrast, RGB cameras can extract semantic meaning from the captured images but are sensitive to lighting conditions.Therefore, most sensors are complementary, and it is an open research question of matching and balancing each type of sensor with their respective strengths and weaknesses.Semantic scene understanding is neglected in traditional approaches, which only focus on the geometric reconstruction of the environment.Rather than treating the collecting points that carry no relationship, semantic understanding assigns higher-level meanings to the collected data [20].Real-world environments often contain many structures and objects that carry high-level semantic information that is helpful to act as landmarks in SLAM.Assignment semantic meaning can be helpful in both reconstructing the scene by inferring missing information and providing complementary information for reconstructing the scene [21], [22].Moreover, reconstructing the geometric representation of the scene and their respective semantic meaning can be helpful for the mobile robot to make higher-level decisions on selecting landmarks that are suitable for the environment and informing the robot planner on deciding its mission route.
This review article provides an overview of the current state-of-the-art Visual-SLAM paradigms and models with a focus on sensor fusion.This paper is organised as follows.We begin by first providing a concise summary of the theory behind the SLAM process and the formulation of the geometric modelling of the surrounding environment.Then, we present an overview of the evolving SLAM paradigms throughout recent years, including both approaches in pure geometric reconstruction and semantic scene understanding using a deep-learning model.Finally, we discuss the current state-of-the-art Visual-SLAM models to understand our progress and future direction in SLAM.

Simultaneous Localisation and Mapping
Simultaneous Localisation and Mapping (SLAM) is a problem where a robot needs to operate in an unknown environment to construct a map while estimating its uncertain location [23].SLAM is a fundamental problem in numerous robotics applications that needs the robot to autonomously navigate within some environment and interact with the real world.In the following, we will introduce the problem setup in SLAM, followed by a formalisation of the fundamental theory behind SLAM algorithms.

Problem Setup
The problem's difficulty comes from the recursive dependency: constructing a map often depends on the robot observing the environment from some known location, while state estimation also often requires a robot to infer its location by relying on some known landmarks.SLAM algorithm estimates the sensor motion and simultaneously reconstructs the Figure 1: Formulating the visual-SLAM problem with a factor graph, where the camera poses are denoted as X w i and landmarks as l j .The observations of the landmarks and odometry at various camera poses are denoted as z k and u i , respectively.The prior belief on the initial pose is denoted as P 0 , and the joint probability distribution of the MAP problem can be computed to the product of the depicted factors.geometrical structure of the visited area.Chatila and Laumond [24] first formularised the problem setup in 1985 for mobile robots navigation.The problem lies in the need to model the environment and locate itself correctly through the inaccuracies introduced by the sensors.The proposed methodology defines a general principle to deal with uncertainties in the collected data and for a mobile robot to define its reference landmarks while exploring the environment.
The fundamental idea of SLAM lies in using landmark correlations, data association, and loop-closure to reduce the uncertainties about its previously visited area and poses [25].Traditional techniques for sequential state estimation include Kalman filter [26].Kalman filter is an optimal state estimation technique in a linear system with Gaussian noise.Practical implementations of SLAM often use the extended Kalman filter (EKF) for state estimation [27], which is advantageous because the Gaussian assumption allows EKF to be measured analytically.If the system has non-Gaussian noise, the Kalman filter is still the optimal linear filter but performs worse than other techniques.For nonlinear systems, methods such as particle filter [28] can be a more flexible alternative as it does not rely on any local linearisation technique or crude functional approximation.However, the higher performance comes with a higher computational effort than Kalman filters.In particle filtering, we need to perform weighted sampling to estimate the distribution of the robot state rather than having an analytical solution to obtain the robot state distribution by using the mean and covariance matrix in a Gaussian distribution.
There are multiple metrics to measure the benefit of actions.For example, A-optimality measures the trace of the covariance matrix [29], which is equivalent to minimising the mean squared error between the data and model parameters.D-optimality, on the other hand, minimises the determinant of the covariance matrix [30], which is equivalent to minimising the entropy of the SLAM system [31].For example, we can utilise the building structure lines as features for localisation and mapping, which can encode the global orientation information constrains the robot's heading over time.These features help eliminate the accumulated orientation errors and reduce the position drift in SLAM algorithms.In SLAM, the concept of loop closure can also reduce drift errors by allowing the robot to reset its estimated state by revisiting a known portion of the map [30].Active SLAM methods often exploit this property by guiding the robot to regions that allow the robot to close the loop [32], which can significantly reduce the localisation error [33].Autonomous navigation in an indoor environment often requires multiple sensory inputs and actuating outputs.Wheeled ground mobile robots are often designed with a differential drive base that uses DC geared or stepper motors for their driving wheels.Mobile robots collect data from onboard sensors like wheel encoders, initial measurement units (IMU), RGB cameras for visual inputs, or LiDAR as remote sensing to measure ranges.
Loop-closing can detect if a given keyframe had been seen previously [34], [35].Loop Closure can be formulated as an optimisation problem, such as a nonlinear least squares problem that matches the current scans with previously visited areas.One reason that loop closing is hard in SLAM is that the internal estimates can, despite best efforts, be in gross error.Loop closing is essentially a data association problem where a positive loop closure occurs when the robot recognises the local scene to be one that it has previously visited.Traditional feature-based SLAM uses simple geometric primitives such as corners or lines as features.When a loop closure is detected, it acts as an opportunity to constrain the robot's internal estimate of its current state with respect to the map.

SLAM Formulation
SLAM is a multi-discipline problem that spans both the computer vision and robotics domain and is traditionally formulated as a maximum a posterior (MAP).In Visual SLAM, we define X = {X w i } N i=1 as the trajectory of the robot over time, where X w i denote the pose of the robot parameterised in the set of rigid Euclidean transformations SE(3).Let L = {l j } M j=1 denote the set of landmarks parameterised by their appropriate representation space, Z = {z k } K k=1 be the set of observations of the detected landmarks, and U = {u} N −1 i=1 be the set of odometry measurements between robot poses.The observations Z of the landmarks are collected under some observation model h k (•), given by where X i k and l j k denote the actual robot state and landmark pose, and k is a random measurement noise.The solution to the SLAM problem is the optimal MAP estimation of where P(X, L | Z, U) is the joint probability of all latent estimate variables given all of our previous observations and measurements.For a classical SLAM problem without odometry measurments [36], we can rewire (2) as = arg max where P(Z | X, L) is the likelihood of the obtaining the measurement Z given X and L, and P(X, L) being the prior knowledge on X and L. Assuming that each observation z k is independent, we can then compute (4) as

Factor Graph and Loop-Closure
Factor graph [37] represents an essential part of modern approaches to address the probabilistic SLAM problem by factorisation of and inference over arbitrary distribution functions.A factor graph G(V, F; E) is a bipartite graph that determines the factorisation of variables from a global function into product of local functions.Specifically, the set of vertices V in the graph G represents the latent variables that participate in the estimation process.The set of factors F represents the prior knowledge regarding variable nodes and constraints between nodes, where the connections between nodes are represented by the set of edges E.
We can represent a classical SLAM problem as a factor graph as depicted in Fig. 1, where the joint probability distribution of the MAP estimation is factorised as a product over observation factors.Using the factor graph notation, we can rewrite the MAP formulation in (2) as = arg min where h k denote the k th factor of observing a landmark l j from the camera pose X w i with the sensor model z k , the notation • 2 Σ denote the squared Malnalanobis norm with covariance matrix Σ, and is the difference operator in the target measurement space.
The SLAM problem can be formulated as a Bayes net under the factor graph formulation as factorisation and inference over probability distribution and functions [36], [38]- [41].A factor graph is a bipartite graph that characterises how a global multi-variable function can be factorised into a product of local functions.Each blue and green node in Fig. 1, also known as variables, represents the set of latent variables that need to be estimated, which in the case of SLAM are the state of the robots and the landmarks.The node in-between the variables are known as factors, which is the set of constraints and information between the variables.We can use a factor graph to factorise a joint probability distribution over some random variables by encoding the inherent conditional independence of some local variables into the joint probability distribution.
The joint probability distribution of all the latent estimate variables of the SLAM problem can be written as where P(X w 0 ) ≡ P 0 is the prior belief on the robot's initial pose.The P(z k | X w i k , l j k ) represents the effect of landmark observation z k given the data association (i k , j k ), and P(X w i | X w i−1 , u i−1 ) represents the state update given the motion model.Assuming a zero-mean Gaussian observation noise for observation Z and odometry U, we can rewrite (8) as effect of odometry (9) where h k is the sensor model, f o is the motion model, Σ k and Σ o are the covariance matrix for the Gaussian noise in Z and U, respectively.
We can further factorise this joint probability distribution to obtain the optimal MAP estimation by solving the equivalent least-squares form of = arg min = arg min which can be interpreted graphically as a factor graph as the one shown in Figs. 1 and 2.
The depiction in Fig. 2 indicates an instance of the classical bundle adjustment (BA) [42], [43].In BA, the factor graph's variable nodes can be considered camera poses and 3D landmarks to minimise the re-projection error factors.BA applications use sensor information from odometry in mobile robots or IMU to further improve the accuracy of the estimated robot trajectory.In Fig. 2, the loop-closure factors can be extended to higher-level entities that impose some sophicated constraints and factors.Loop-closure can often improve the consistency of the mapping results [33] as they act as additional constraints during the factorisation of the joint distribution [44].

Evolution of SLAM Techniques and Paradigms
Various approaches in SLAM have been proposed throughout the years to address different challenges within the approach.The following discusses the operational process of traditional SLAM algorithms and recent developments in SLAM paradigms.A summary of their algorithmic approach, advantages and shortcoming are provided as follows.
• Fast-SLAM (2002) [45] addresses the localisation problem by using a decomposing strategy for recursively estimating the full posterior distribution over robot pose and landmark location.The algorithm performs exact factorisation of the posterior into a conditional landmark distribution and distribution over robot paths.
Advantage: The complexity scales logarithmically with the number of landmarks on the map.Disadvantage: FastSLAM behaves like a non-optimal local search algorithm, where it is capable of producing consistent uncertainty estimates, but, in the long-term, it is unable to explore the state-space as a Bayesian estimator adequately.
• Extended Kalman Filter (EFK) SLAM (2007) [46] uses a divide and conquer approach to estimate a consistent state estimation by using state covariance to represent the real error in the estimation process.EFK SLAM uses phase iterations of predictions, observation, and updates to perform state estimation in a Bayesian manner.Advantage: EFK SLAM often achieve a more consistent estimation than another approach as it computes the exact solution rather than using approximation, and the proposed approach tackles the combinatorial complexity.Disadvantage: Despite the consistent estimation, the approach uses a probabilistic inference approach for forecasting the current state, which might diverge from the actual current state.
• V-SLAM (2011) [47] computes a locally dense stereo correspondences from the potentially sparse raw representation.The dense representation avoids the sparsity problem that often arises in operating SLAM with a sparse set of landmarks.Advantage: The computational overhead is relatively small, and the dense representation increases the robustness of the inner SLAM algorithm in a sparse environment.Disadvantage: The dense representation can be more sensitive to flaws or environmental changes to the environment.
• Large-Scale Direct (LSD) SLAM (2015) [14] aligns images directly with photoconsistency of high-contrast pixels.LSD SLAM can concurrently estimate the depths at the pixels using static stereo and temporal multiview stereo by utilising the camera motion.Advantage: The SLAM can operate directly at the pixel level rather than as a separate procedure for processing the captured images.Disadvantage: The procedure can be costly when computing the translational motion between frames.
• ORB-SLAM2 (2017) [48] make use of multiple features from Monocular, Stereo and RGB-D cameras which greatly enhance the versatility of the method.The algorithm uses bundle adjustment to create a 3D environment by extracting features from different images and placing them in 3D.Advantage: ORB-SLAM2 is highly versatile and can perform sensor fusion to improve detection quality.The model includes loopclosure detection, keyframe selection and per-frame localisation, which enhance its robustness.Disadvantage: High processing cost, which might be costly for small systems.
• 2D-LiDAR SLAM (2018) [49] is an algorithmic approach that uses a laser sensor to create a 2D view of its surroundings.The method uses laser and visual fusion to provide localisation by combining two kinds of laser-based SLAM and monocular camera-based SLAM.Advantage: The fusion allows high performance in spotting complex structures like hollow ceilings and can achieve high precision even at a significant distance range.Disadvantage: The 2D-LiDAR-based approach is highly sensitive to visibility conditions and performs poorly during poor weather conditions.
• GRAPH-SLAM (2019) [50] utilises a stochastic gradient descent approach for nonlinear optimisation.GRAPH-SLAM uses radar sensors to perform point matching with ICP.Advantage: The approach uses the higher range and angular resolutions in radar for performing SLAM over long tracks.Disadvantage: GRAPH-SLAM can be sensitive in the choice of parameters and require fine-tuning.
• Particle Filter SLAM (2020) [51] uses Monte Carlo sequence filtering method for maintaining an estimated distribution of the current robot state.Advantage: The filtering process is performed with state identification, mass modification and a resampling procedure.Disadvantage: It requires lots of particles to perform state estimation in an environment with a large spatial area; otherwise, the likelihood will be spatially separated with large separation.
• Direct Sparse Mapping (DSM) (2020) [52] adopted the Photometric bundle adjustment (PBA) method for SLAM, which was shown to be effective for estimating scene geometry and camera motion in Visual Odometry (VO).Unlike PBA, which estimates the camera odometry with a temporary map, DSM can build a persistent map for SLAM usage.Advantage: DSM is a direct monocular VSLAM method that detects point observations and extracts the geometric information from the photometric formulation.Disadvantage: PBA is needed during the DSM procedure, significantly increasing the runtime processing cost.

Visual-SLAM
Visual-SLAM and sensors have been the main research direction for SLAM solutions due to their capability of collecting a large amount of information and measurement range for mapping.The principle of Visual-SLAM lies in a sequential estimation of the camera motions depending on the perceived movements of pixels in the image sequence.Besides robotics, Visual-SLAM is also essential for many enormous vision-based applications such as virtual and augmented reality.Many existing Visual-SLAM methods explicitly model camera projections, motions, and environments based on visual geometry.Recently, many methods have assigned and incorporated semantic meaning to the observed objects to provide a more successful localisation that is robust against observation noise and dynamic objects.In this section, we will review the different families of algorithms within the branch of Visual-SLAM. .However, the computational complexity of these methods increased with the number of landmarks, and they did not efficiently handle non-linearities in the measurements [57].FastSLAM was proposed to improve the EKF-SLAM by combining particle filters with EKFs for landmark estimation [45].However, it also suffered from the limitations of sample degeneracy when sampling the proposal distribution.Parallel Tracking and Mapping [58] was proposed to address the issue by splitting the pose and map estimation into separate threads, which enhance their real-time performance [59], [60].
A place recognition system with ORB features was first proposed in [61], which is developed based on Bag-of-Words (BoW).The ORB is a rotational invariant and scale-aware feature [62], which can be used to extract features at a high frequency.Place recognition algorithms can often be highly efficient and run in real time.The algorithm is helpful in relocalisation and loop-closure for Visual-SLAM, and it is further developed with monocular cameras for operating in a large-scale environment [63].
RGB-D SLAM [63] is another feature-based SLAM that uses feature points for generating dense and accurate 3D maps.Several models are proposed to utilise the active camera sensor to develop a 6-DOF motion tracking model capable of 3D reconstruction and achieve impressive performance even under challenging scenarios [64], [65].In contrast to lowlevel point features, high-level objects often provide a more accurate tracking performance.For example, using a planar SLAM system, we can detect the planar in the environment for yielding a planar map while detecting objects such as desks and chairs for localisatino [66].The recognition of the objects, however, requires an offline supervised-learning procedure before executing the SLAM procedure.
Direct SLAM refers to methods that directly use the input images without any feature detector and descriptors.In contrast to feature-based methods, these feature-less approaches are generally used in photometric consistency to register two successive images.Using deep-learning models for extracting the environment's feature representation is promising in numerous robotic domains [67], [68].For example, DTAM [69], LSD-SLAM [70] and SVO [71] are some of the models that had gain lots of successes.DSO models [72], [73] are also shown to be capable of using bundle adjustment pipeline of temporal multi-view stereo for achieving high accuracy in a real-time system.In additions, models such as CodeSLAM [74] and CNN-SLAM [75] use deep-learning approach for extracting a dense representations of the environment for performing direct SLAM.However, direct SLAM is often more time-consuming when compared to feature-based SLAM since they operate directly on the image space.

Localisation with Scene Modelling
Deep learning plays an essential role in scene understanding by utilising a range of information in techniques such as CNN classifications.CNN can be utilised over RGB images for extracting semantic information like detecting scene or pedestrians within the images [77]- [79].CNN can also directly operates on captured point cloud information from range-based sensors like LiDAR.Models like PointNet [53] in Fig. 3 can understand classifying the class of the objects Figure 4: Using Dynamic Scene Graph (DSG) [76] for generating multi-layer abstraction of an indoor environment.
Figure 5: Using environment features to create a semantic map.SUMA++ [84] operating under an environment using LiDAR sensors, which provides rich information to understand the environment around the vehicle.
based purely on point clouds.For example, PointNet++ [53], TangentConvolutions [80], DOPS [81], RandLA-Net [82] are some of the recent deep learning model that can perform semantic understanding using a large scale of point clouds.
Most models are trained on some point cloud dataset that enables the model to infer objects and scene information based purely on the geometric orientations of the input points.
Dynamic objects can introduce difficulties in SLAM during loop-closure due to the moving objects.SLAM can tackle this difficulty by utilising semantics information to filter dynamic objects from the input images [83].Using the scene understanding module, we can filter out moving objects from the images to prevent the SLAM algorithm conditioning on dynamic objects.For example, the SUMA++ model illustrated on the right of Fig. 5 can obtain a semantic understanding of each detected object to filter out dynamic objects such as pedestrians and other moving vehicles.However, the increased SLAM accuracy comes with the cost of lowering the accuracy of the estimated robot pose due to the method neglecting parts of the perceived information.

Scene Modelling with Typological Relationship and Dynamic Models
Scene graphs is a different approach to building a model of the environment that includes both the metric, semantic, and primary topological relationship between the scene objects and the overall environment [85].Scene graphs can construct an environmental graph that spans an entire building, including objects, materials and rooms within the building [86].The main disadvantage of scene graphs is the need to compute offline, requiring a known 3D mesh of the building with the registered RGB images to generate the 3D scene graphs.Previous approaches rely on registering Projects 2D region to 3D boxes RGB images with the 3D mesh of the buildings for generating the 3D scene graphs, which limits their applicability to static environments.Fig. 4 illustrates one of the approaches, Dynamic scene graphs (DSG) [76], that can also include dynamic elements within the environment.For example, DSG can model humans that are navigating within the building.The original DSG approach needs to be built offline, but an extension has been proposed [76] which is capable of building a 3D dynamic DSG from visual-inertial data in a fully automatic manner.The approach first builds a 3D mesh-based semantic map fed to the dynamic scene generator.
In addition, we can perform reasoning on the current situation by projecting what will likely happen based on previous events [87].This class of methods relies on predicting the possible future state of the robot by conditioning on the current belief of our robot state and the robot's dynamic model [88].In addition, dynamic models can be incorporated into the objects in the surrounding environment, such as pedestrians and vehicles, for the model to recognise the predicted future pose of the nearby objects with some amount of uncertainty [89].

Semantic Understanding with Segmentation
Pixel-wise semantic segmentation is another promising direction in SLAM semantic understanding.FCN [77] is a fully convolutional neural network that uses pixel-wise segmentation in the computer vision community for SLAM.ParseNet derived a similar CNN architecture [90] and injected the global context information into the global pooling layers in FCN.The global context information allows the model to achieve better scene segmentation with a more featurerich representation of the network.SegNet is another novel netowkr [91] that uses an encoder-decoder architecture for segmentation.The decoder architecture helps upsample the captured low-resolution features from the images.Bayesian approaches are helpful in many learning-based robotics application [92], [93].Bayesian SegNet [92] took a probabilistic approach by using dropout layers in the original SegNet for sampling.The Bayesian approach estimates the probability for pixel-level segmentation, which often outperforms the original approach.Conditional Random Fields had been combined with CNN architecture [94] for deriving a mean-field approximate inference as Recurrent neural Networks.
Semantic information is particularly valued in an environment where a robot needs to interact with human [95].
The progress in computer vision semantic segmentation using deep learning is constructive for pushing the research progress in semantic SLAM.By combining model-based SLAM methods with spatio-temporal CNN-based semantic segmentation [96], we can often provide the SLAM model with a more informative feature representation for localisation.The proposed system can simultaneously perform 3D semantic scene mapping and 6-DOF localisation even in a large indoor environment.Pixel-voxel netowk [97] is another similar approach that uses CNN-like architecture for semantic mapping.SemanticFusion [98] integrates the CNN-based semantic segmentation with the dense SLAM technology ElasticFusion [99], resulting in a model that produces a dense semantic map and performs well in an indoor environment.

Sensors Fusions for Semantic Scene Understanding
With the recent advancements in Deep Learning, numerous Visual-SLAM have also gained treatment success in using the learned models for semantic understanding using data fusion.Models such as Frustrum PointNets [100] utilise both RGB camera and LiDAR sensors to improve the accuracy of understanding the semantics of the scene.Fig. 6 illustrates how Frustrum PointNet utilises information from both sensors for data fusion, where a PointNet is first applied for object instance segmentation and amodal bounding box regression.Sensor fusion provides a more rich feature representation for performing data association.For example, VINet is a sensor fusion network [101] that can use the estimated pose from DeepVO [102] along with the inertial sensor readings with an LSTM.During the model training procedure, the prediction and the fusion network are trained jointly to allow the gradient to pass through the entire network.Therefore, both networks can compensate each other, and the fusion system has high performance compared to traditional sensor fusion methods.The same methodology can also be used as a fusion system [103] which is capable of fusing the 6-DOF pose data from the cameras and the magnetic sensors [104].
The information obtained from a camera can also be fused with GPS, INS, and wheel odometry readings as an motion estimation system [105].The model essentially uses deep learning to capture the temporal motion dynamics.The motion from the camera is utilised in a mixture density network to construct an optical flow vector for better estimation.Direct methods for visual odometry (VO) can often exploit information from the intensity level gathered from the input images.However, these methods cannot guarantee optimality compared to feature-based methods.Semi-direct VO (SVO2) [106] is a hybrid method that uses direct methods to track pixels while relying on featurebased methods for joint optimisation of structure and motions.The hybrid methods take advantage of both approaches to improve the robustness of VO.Similar approaches like VINS-Fusion [107] are capable of using IMU fused with monocular visual input for estimating odometry with high reliability.Deep neural networks can further learn the rigidbody motion in a CNN architecture [108] using raw point cloud data as input for predicting the SE3 rigid transformation of the robot.

Conclusion and Future Directions
Numerous studies have been conducted in the SLAM domain, as mapping and navigation are critical for enabling robots to interact autonomously with the real world.SLAM algorithms remain a promising and exciting research domain due to their ubiquitous needs in mobile robotic applications.SLAM merges ideas from multiple fields that bridge community within the broader robotic system, for example, sensing, perception, localisation and mapping.
In addition, a visual SLAM system with learning capability has shown tremendous potential for further exploration.
Approaches with deep learning are shown to be more flexible in producing a more robust approach via utilising the semantic information about the surrounding objects.Sensors information such as pose, depth, 3D point cloud, and semantic mapping of the surrounding objects have shown to be highly useful in Visual-SLAM.By fusing the measured readings from different sensors, the learning-based models can utilise more sources of information for a more featurerich data-association process.We believe a learning-based approach in semantic SLAM is a promising and exciting direction for developing autonomous robots.
SLAM provides the foundations for the autonomous operations of robots.Many possible future directions can further improve the challenges discussed in earlier sections.For example, data association is one of the core problems in SLAM.Some current IMU and visual odometry-based approaches depend highly on sensors' accuracy or assume some prior on normally distributed and stationary noise.Having an adaptive approach to tackling possible shifting temporal noise distribution can further mitigate the data association problem.Sensor fusion should be another focus in Visual-SLAM due to the availability of various sensors in modern robots.LiDAR and RGB-D cameras are the two most popular approaches in modern SLAM; therefore, combining the rich information provided by the sensors can further improve the current state-of-the-art Visual-SLAM algorithms.Currently, the SLAM and motion planning problems in robotics are typically tackled in a disjointed manner.However, integrating the uncertainty and probabilistic information obtained in the SLAM framework would theoretically provide more information for the robot to plan for its next movement during motion planning.Therefore, integrating motion planning algorithms such as RRT or PRM within SLAM could provide a more robust robotic framework.Finally, several works that depend on deep neural netowkrs have been discussed in previous sections.Integrating methodologies in deep reinforcement learning literature can perhaps provide SLAM with a learnable policy that exploits past SLAM episodes to improve future execution in unseen environments.
We provide a thorough literature review of the fundamental and current state-of-the-art Visual-SLAM models for communicating our current understanding of SLAM approaches.We have shown ongoing evolution in Visual-SLAM, from model-based approaches to deep learning-based methods.Most current SLAM models seek to improve their accuracy and robustness in the high-level cognition and perception within the Visual-SLAM systems.The key to most current semantic Visual-SLAM models lies in designing the network architecture, appropriate loss function, and the data representation of deep learning-based methods.Therefore, the ongoing research progress in deep-learning models will further enhance the capability of Visual-SLAM models.

Figure 2 :
Figure 2: Visual-SLAM Bundle Adjustment (BA) in a factor graph.The potential odometry factor u i constrain the relative camera poses with potential loop-closure factors c i1,i2 where i 1 , i 2 are the index of the camera poses.This figure demonstrate loop-closure factors c 1,N −1 between the camera pose X w1 and X w N −1 , and c 2,N between X w 2 and X w N for deciding whether the mobile robot had returned to a previously visited area.

Figure 3 :
Figure 3: Example of using PointNet [53] for performing part segmentation directly on input point clouds.

Figure 6 :
Figure 6: Multi-modal model Frustrum PointNets [100] which uses CNN model to projects detected objects from RGB images into 3D space, thus improving the accuracy on semantic understanding.