1. Introduction
People need the mobile robot to perform some tasks by themselves, which needs the robot to be able to adapt to an unfamiliar environment. Therefore, SLAM [
1] (Simultaneous Localization and Mapping), which enables localization and mapping in unfamiliar environments, has become a necessary capacity for autonomous mobile robots. Since it was first proposed in 1986, SLAM has attracted extensive attention from many researchers and developed rapidly in robotics, virtual reality, and other fields. SLAM refers to self-positioning based on location and map, and building incremental maps based on self-positioning. It is mainly used to solve the problem of robot localization and map construction when moving in an unknown environment [
2]. SLAM, as a basic technology, has been applied to mobile robot localization and navigation in the early stage. With the development of computer technology (hardware) and artificial intelligence (software), robot research has received more and more attention and investment. Numerous researchers are committed to making robots more intelligent. SLAM is considered to be the key to promoting the real autonomy of mobile robots [
3].
Some scholars divide SLAM into Laser SLAM and Visual SLAM (VSLAM) according to the different sensors adopted [
4]. Compared with VSLAM, because of an early start, laser SLAM studies abroad are relatively mature and have been considered the preferred solution for mobile robots for a long time in the past. Similar to human eyes, VSLAM mainly uses images as the information source of environmental perception, which is more consistent with human understanding and has more information than laser SLAM. In recent years, camera-based VSLAM research has attracted extensive attention from researchers. Due to the advantages of cheap, easy installation, abundant environmental information, and easy fusion with other sensors, many vision-based SLAM algorithms have emerged [
5]. VSLAM has the advantage of richer environmental information and is considered to be able to give mobile robots stronger perceptual ability and be applied in some specific scenarios. Therefore, this paper focuses on VSLAM and combs out the algorithms derived from it. SLAM based on all kinds of laser radar is not within the scope of discussion in this paper. Interested readers can refer to [
6,
7,
8] and other sources in the literature.
As one of the solutions for autonomous robot navigation, traditional VSLAM is essentially a simple environmental understanding based on image geometric features [
9]. Because traditional VSLAM only uses the geometric feature of the environment, such as points and lines, to face this low-level geometry information, it can reach a high level in real-time. Facing changes in lighting, texture, and dynamic objects are widespread, which shows the obvious shortage, in terms of position precision and robustness is flawed [
10]. Although the map constructed by traditional visual SLAM includes important information in the environment and meets the positioning needs of the robot to a certain extent. It is inadequate in supporting the autonomous navigation and obstacle avoidance tasks of the robot. Furthermore, it cannot meet the interaction needs of the intelligent robot with the environment and humans [
11].
People’s demand for intelligent mobile robots is increasing day by day, which put forward a high need for autonomous ability and the human–computer interaction ability of robots [
12]. The traditional VSLAM algorithm can meet the basic positioning and navigation requirements of the robot, but cannot complete higher-level tasks such as “help me close the bedroom door”, “go to the kitchen and get me an apple”, etc. To achieve such goals, robots need to recognize information about objects in the scene, find out their locations and build semantic maps. With the help of semantic information, the data association is upgraded from the traditional pixel level to the object level. Furthermore, the perceptual geometric environment information is assigned with semantic labels to obtain a high-level semantic map. It can help the robot to understand the autonomous environment and human–computer interaction [
13]. We believe that the rapid development of deep learning provides a bridge for the introduction of semantic information into VSLAM. Especially in semantic map construction, combining it with VLAM can enable robots to gain high-level perception and understanding of the scene. It significantly improves the interaction ability between robots and the environment [
14].
In 2016, Cadena et al. [
15] first proposed to divide the development of SLAM into three stages. In their description, we are in a stage of robust perception, as shown in
Figure 1. They describe the emphasis and contribution of SLAM in different times from three aspects: Classical, Algorithmic, and Robust. Ref. [
16] summarizes the development of vision-based SLAM algorithms from 2010 to 2016 and provides a toolkit to help beginners. Yousif et al. [
17] discussed the elementary framework of VSLAM and summarized several mathematical problems to help readers make the best choice. Bavle et al. [
18] summarized the robot SLAM technology and pointed out the development trend of robot scene understanding. Starting from the fusion of vision and visual inertia, Servieres et al. [
19] reviewed and compared important methods and summarized excellent algorithms emerging in SLAM. Azzam et al. [
20] conducted a comprehensive study on feature-based methods. They classified the reviewed methods according to the visual features observed in the environment. Furthermore, they also proposed possible problems and solutions for the development of SLAM in the future. Ref. [
21] introduces in detail the SLAM method based on monocular, binocular, RGB-D, and visual-inertial fusion, and gives the existing problems and future direction. Ref. [
22] describes the opportunities and challenges of VSLAM from geometry to deep learning and forecasts the development prospects of VSLAM in the future semantic era.
As you can see, there are some surveys and summaries of vision-based SLAM technologies. However, most of them only focus on one aspect of VSLAM, without a more comprehensive summary of the development of VSLAM. Furthermore, the above review focuses more on traditional visual SLAM algorithms, while semantic SLAM combined with deep learning is not introduced in detail. So, a comprehensive review of vision-based SLAM algorithms is necessary to help researchers and students launch their efforts at visual SLAM technologies to obtain an overview of this large field.
To give readers a deeper and more comprehensive understanding of the field of SLAM, we reviewed the history of general SLAM algorithms from inception to the present. In addition, we summarize the key solutions driving the technological evolution of SLAM solutions. The work of SLAM is described from the formation of point problems to the most commonly used state methods. Rather than focusing on just one aspect, we present the key main approaches to show the connections between the research that has brought the SLAM approach to its current state. In addition, we review the evolution of SLAM from traditional to semantic, a perspective that covers major, interesting, and leading design approaches throughout history. On this basis, we make a comprehensive summary of DEEP learning SLAM algorithms. Semantic VSLAM is also explained in detail to help readers better understand the characteristics of semantic VSLAM. We think our work can help readers better understand robot environment perception. Our work on semantic VSLAM can provide readers with a better idea and provide a useful reference for future SLAM research and even robot autonomous sensing. Therefore, this paper comprehensively supplements and updates the development of vision-based SLAM technology. Furthermore, this paper divides the development of vision-based SLAM into two stages: traditional VSLAM and semantic VSLAM integrating deep learning. So readers can better understand the research hot spots of VSLAM and grasp the development direction of VSLAM. We believe the traditional phase SLAM problem mainly solves the framework problem of the algorithm. In the semantic era, SLAM focuses on advanced situational awareness and system robustness in combination with deep learning.
Our review makes the following contributions to the state of the art:
We have reviewed the development of vision-based SLAM more comprehensively, we review the recent research progress in the field of simultaneous localization and map construction based on environmental semantic information.
Starting with a convolutional neural network (CNN) and a recurrent neural network (RNN), we describe the application of deep learning in VSLAM in detail. To our knowledge, this is the first review to introduce VSLAM from a neural network perspective.
We describe the combination of semantic information and VSLAM in detail and point out the development direction of VSLAM in the semantic era. We mainly introduce and summarize the outstanding research achievements in the combination of semantic information and traditional visual SLAM in system localization and map construction, and make an in-depth comparison between traditional visual SLAM and semantic SLAM. Finally, the future research direction of semantic SLAM is proposed.
Specifically, in
Section 1, this paper introduces the characteristics of traditional VSLAM in detail, including the direct method and the indirect method based on the front-end vision odometer, and makes a comparison between the depth camera-based VSLAM and the classical VSLAM integrated with IMU. In
Section 2, this paper is divided into two parts. We firstly introduce the combination of deep learning and VSLAM from two neural networks, CNN and RNN. We believe that introducing deep learning into semantic VSLAM is the precondition for the development of semantic VSLAM. Furthermore, this stage can also be regarded as the beginning of semantic VSLAM. Then, this paper describes the process of deep learning leading semantic VSLAM to the advanced stage from the aspects of target detection and semantic segmentation. So this paper summarizes the development direction of semantic VSLAM from three aspects of localization, mapping, and elimination of dynamic objects. In
Section 3, this paper introduces some mainstream SLAM data sets, and some outstanding laboratories in this area. In the end, we summarize the current research and point out the direction of VSLAM research in the future. The section table of contents for this article is shown in
Figure 2.
4. Semantic VSLAM
Semantic SLAM refers to a SLAM system that can not only obtain geometric information of the unknown environment and robot movement information but also detect and identify targets in the scene. It can obtain semantic information such as their functional attributes and relationship with surrounding objects, and even understand the contents of the whole environment [
134]. Traditional VSLAM represents the environment in the form of point clouds and so on, which to us are a bunch of meaningless points. To perceive the world from both geometric and content levels and provide better services to humans, robots need to further abstract the features of these points and understand them [
135]. With deep learning development, researchers have gradually realized its possible help to SLAM problems [
136]. Semantic information can help SLAM to understand the map at a higher level. Furthermore, it lessens the dependence of the SLAM system on feature points and improves the robustness of the system [
137].
Modern semantic VSLAM systems cannot do without the help of deep learning, and feature attributes and association relations obtained through learning can be used in different tasks [
138]. As an important branch of machine learning, deep learning has achieved remarkable results in image recognition [
139], semantic understanding [
140], image matching [
141], 3D reconstruction [
142], and other tasks. The application of deep learning in computer vision can greatly ease the problems encountered by traditional methods [
143]. Traditional VSLAM systems have achieved commendable results in many aspects, but there are still many challenging problems to be solved [
144]. Ref. [
145] has summarized deep learning-based VSLAM in detail and pointed out the problems existing in traditional VSLAM. These works [
146,
147,
148,
149] suggest that deep learning should be used to replace some modules of traditional SLAM, such as loop closure and pose estimation, to improve the traditional method.
Machine learning is a subset of artificial intelligence that uses statistical techniques to provide the ability to ”learn“ data from a computer without complex programming. Unlike task-specific algorithms, deep learning is a subset of machine learning based on learning data. It is inspired by the function and structure of what are known as artificial neural networks. Deep learning gains great flexibility and power by learning to display the world as simpler concepts and hierarchies, and to calculate more abstract representations based on less abstract concepts. The most important difference between traditional machine learning and deep learning is the performance of data scaling. Deep learning algorithms do not work well when the data is very small, because they need big data to perfectly identify and understand it. The performance of machine learning algorithms depends on the accuracy of features identified and extracted. Deep learning algorithms, on the other hand, identify these high-level features from the data, thus reducing the effort to develop an entirely new feature extractor for each problem. Deep learning is a subset of machine learning, which has proven to be a more powerful and promising branch of the industry compared to traditional machine learning algorithms. It realizes many functions that traditional machine learning cannot achieve with its layered characteristics. SLAM systems need to collect a large amount of information in the environment, so there is a huge amount of data to calculate, and the deep learning model is just suitable for solving this problem.
This paper believes that semantic VSLAM is an evolving process. In the early stage, some researchers tried to improve the performance of VSLAM by extracting semantic information in the environment using neural networks such as CNN. In the modern stage, target detection, semantic segmentation, and other deep learning methods are powerful tools to promote the development of semantic VSLAM. Therefore, in this chapter, we will first describe the application of typical neural networks in VSLAM. We believe that this is the premise of the development of modern semantic VSLAM. The application of neural networks in VSLAM provides a model for modern semantic VSLAM. This paper believes that a neural network is a bridge to introduce semantic information into the modern semantic VSLAM system and obtain rapid development.
4.1. Neural Networks with VSLAM
Figure 13 shows the typical framework of CNN and RNN. CNN can capture spatial features from the image, which help us accurately identify the object and its relationship with other objects in the image [
150]. The characteristic of RNN is that it can process an image or numerical data. Because of the memory capacity of the network itself, it can learn data types with contextual correlation [
151]. In addition, other types of neural networks such as DNN (Deep Neural Networks) also have some tentative work, but it is in the initial stage. This paper notes that CNN has the advantages of extracting features of things with a certain model, and then classifying, identifying, predicting, or deciding based on the features. It can be helpful to different modules of VSLAM. In addition, this paper believes that RNN has great advantages in helping to establish consistency between nearby frames. Furthermore, the high-level features have better differentiation, which can help robots to better complete data association.
4.1.1. CNN with VSLAM
Traditional inter-frame estimation methods adopt feature-based methods or direct methods to identify camera pose through multi-view geometry [
152]. Features-based methods need complex feature extraction and matching. Direct methods rely on pixel intensity values, which makes it difficult for traditional methods to obtain wished results in environments such as intense illumination or sparse texture [
153]. In contrast, methods based on deep learning are more intuitive and concise. That is because they do not need to extract environmental features, feature matching, and complex geometric operations [
154]. As the feature detection layer of CNN learns through training data, it avoids feature extraction in display and learns implicitly from training data during use. Refs. [
155,
156] and other works have made a detailed summary.
CNN’s advantages in image processing have been fully verified. For example, visual depth estimation improves the problem that monocular cameras cannot obtain reliable depth information [
157]. In 2017, Tateno et al. [
158] proposed a real-time SLAM system “CNN-SLAM “based on CNN in the framework of LSD-SLAM. As shown in
Figure 14, the algorithm obtained a reliable depth map by training the depth estimation network model. CNN is used for depth prediction, which is input into subsequent modules such as traditional pose estimation to improve positioning and mapping accuracy. In addition, CNN semantic segmentation module is added to the framework, which provides help for advanced information perception of the VSLAM system. Similar work using the network to estimate depth information includes Code-SLAM [
42] and DVSO [
159] Based on a stereo camera. In the same year, Godard et al. [
160] proposed an unsupervised image depth estimation scheme. Unsupervised learning is improved by using stereo data set, and then a single frame is used for pose estimation, which has a great improvement compared with other schemes.
CNN not only solves the problem that traditional methods cannot obtain reliable depth data by using a monocular camera but also improves the defects of traditional methods in camera pose estimation. In 2020, Yang et al. [
48] proposed D3VO. In this method, deep learning is used from three aspects, including depth estimation, pose estimation, and uncertainty estimation. The prediction depth, pose and uncertainty are closely combined into a direct visual odometer to simultaneously improve the performance of front-end tracking and back-end nonlinear optimization. However, self-supervised methods are difficult to adapt to all environments. In addition, Qin et al. [
161] proposed a semantic feature-based localization method in 2020, which effectively solves the problem that traditional visual SLAM methods are prone to tracking loss. Its principle is to use CNN to detect semantic features in the narrow and crowded environment of an underground parking lot, lack of GPS signal, dim light, and sparse texture. Then use U-Net [
162] to perform semantic segmentation to separate parking lines, speed bumps, and other indicators on the ground, and then use odometer information. The semantic features are mapped to the global coordinate system to build the parking lot map. Then the semantic features are matched with the previously constructed map to locate the vehicle. Finally, EKF is used to integrate visual positioning results and odometer information to ensure the system can obtain continuous and stable positioning results in the underground parking environment. Zhu et al. [
163] learned rotation and translation by using CNN to focus on different quadrants of optical flow input. However, the end-to-end method to replace the visual odometer is simple and crude but without theoretical support and generalization ability.
Loop closure detection can eliminate cumulative trajectory errors and map errors, and determines the accuracy of the whole system, which is essentially a scene identification problem [
164]. Traditional methods are matched by artificially designed sparse features or pixel-level dense features. Deep learning can learn high-level features in images through neural networks. Furthermore, its recognition rate can reach a higher level by using the powerful recognition ability of deep learning to extract higher-level robust features of images. In this way, the system can have stronger adaptability to image changes such as perspective and illumination and improve the loop closure image recognition ability [
165]. Therefore, scene identification based on deep learning can improve the accuracy of loop closure detection, and CNN has also obtained many reliable effects for loop closure detection. Memon et al. [
166] proposed a dictionary-based deep learning method, which is different from the traditional Bow dictionary and uses higher-level and more abstract deep learning features. This method does not need to create vocabulary, has higher memory efficiency, and has a faster running speed than similar methods. However, this paper is only based on the likeness score detection cycle, so it is not widely representative. Li et al. [
167] proposed a learning feature-based visual SLAM system named DXSLAM, which solved the limitations of the above methods. Local and global features are extracted from each frame using CNN, and these features are then fed into modern SLAM pipelines for posture tracking, local mapping, and repositioning. Compared with traditional BOW-based methods, it achieves higher efficiency and lower computational cost. In addition, Qin et al. [
168] used CNN to extract environmental semantic information and modeled the visual scene as a semantic subgraph. It can effectively improve the efficiency of loopback detection by using semantic information. Refs. [
169,
170] and others describe in detail the achievements of deep learning in many aspects. However, with the introduction of more complex and better models, how to ensure the real-time performance of model calculation? How to better set in the loop closure detection model in resource-constrained platforms, and the lightweight of the model is also a major problem [
171].
CNN has achieved good results in replacing some modules of the traditional VSLAM algorithm, such as depth estimation and loop closure detection. Its stability is still not as good as the traditional VSLAM algorithm [
172]. In contrast, the semantic information extraction of the CNN system has brought better effects. The process of traditional VSLAM is optimized by using CNN to extract the semantic information of the environment with higher-level features, making the traditional VSLAM achieve better results. Using a neural network to extract semantic information and combining it with VSLAM will be an area of great interest. With the help of semantic information, the data association is upgraded from the traditional pixel level to the object level. The perceptual geometric environment information is assigned with semantic labels to obtain a high-level semantic map. It can help the robot to understand the autonomous environment and human–computer interaction.
Table 8 shows some main application links of the CNN network in VSLAM. Some are involved in many aspects, only the main contributions are listed here.
4.1.2. RNN with VSLAM
The research of RNN (recurrent neural network) began in the 1980s and 1990s and developed into one of the classical deep learning algorithms in the early 21st century. Long short-term Memory Networks (LSTM) are one of the most common recurrent neural networks [
178]. LSTM is a variant of RNN, which remembers a controllable amount of previous training data or forgets it more properly [
179]. As shown in
Figure 15, the structure of LSTM and the equations of state of its different modules are given. LSTM with special implicit units can preserve input for a long time. LSTM inherits most characteristics of the RNN model and solves the Vanishing Gradient problem caused by the gradual reduction of the Gradient back transmission process. As another variant of RNN, GRU (Gated Recurrent Unit) is easier to train and can improve training efficiency [
180]. RNN has some advantages in learning nonlinear features of sequences because of its memorization and parameter sharing. RNN constructed by introducing a convolutional neural network CNN can deal with computer vision problems involving sequence input [
181].
In pose estimation, the end-to-end deep learning method is introduced to solve pose parameters between frames of visual images without feature matching and complex geometric operations. It can quickly obtain the relative pose parameters between frames by directly inputting nearby frames [
182]. Xue et al. [
183] use deep learning to learn the process of feature selection and realize pose estimation based on RNN. In pose estimation, rotation and displacement are trained separately, which has better adaptability compared with traditional methods. In 2021, Teed et al. [
184] introduced DROID-SLAM, whose core is a learnable update operator. As shown in
Figure 16, the update operator is a 3 × 3 convolutional GRU with a hidden state of H. The iterative application of the update operator creates a series of attitudes and depths that converge to a fixed point that reflects a real reconstruction. The algorithm is an end-to-end neural network architecture for visual SLAM, which has great advantages over previous work in challenging environments.
Most existing methods adopt to combine CNN with RNN to improve the overall performance of VSLAM. CNN and RNN can be combined using a separate layer, with the output of CNN as the input of RNN. On the one hand, it can automatically learn the effective feature representation of the VO problem through CNN. On the other hand, it can implicitly model the timing model (motion model) and data association model (image sequence) through RNN [
185]. In 2017, Yu et al. [
60] combined RNN with KinectFusion to carry out semantic annotation on RGB-D collected images to reconstruct a 3D semantic map. They introduced a new loop closure unit into RNN to solve the problem of GPU computing resource consumption. This method makes full use of the advantages of RNN to realize the annotation of semantic information. High-level features have better discrimination and help the robot to better complete the data association. Due to the use of RGB-D cameras, they can only be operated in indoor environments. DeepSeqSLAM [
186] solved this problem well. In this scheme, a trainable CNN+RNN architecture is used to jointly learn visual and location representations from a single monocular image sequence. An RNN is used to integrate temporal information on short image sequences. At the same time, using the dynamic information processing functions of these networks, end-to-end position and sequence position learning are realized for the first time. Furthermore, the ability to learn meaningful temporal relationships from single image sequences of large driving datasets. In running time, accuracy, and calculation needs, sequence-based methods are significantly superior to traditional methods and can operate stably in outdoor environments.
CNN can be combined with many links of VLSAM, such as feature extraction and matching, depth estimation, and pose estimation, and has achieved good results in these aspects. RNN, by contrast, has a smaller scope of application, but it has a great advantage in helping to establish consistency between nearby frames. RNN is a common method for data-driven timing modeling in deep learning. Inertial data such as high frame rate angular velocity and acceleration output by IMU have strict dependence on timing, which is especially suitable for RNN models. Based on this, Clark et al. [
175] proposed to use a conventional small LSTM network to process the original data of IMU and obtain the motion characteristics under IMU data. Finally, they combined visual motion features with IMU motion features, and sent it into a core LSTM network for feature fusion and pose estimation. Its principle of it is shown in
Figure 17.
Compared with pose estimation, we believe that RNN is more attractive for its contribution to visual-inertial data fusion. This method can effectively fuse visual-inertial data and is more convenient than traditional methods. Similar work, such as [
187,
188], proves the effectiveness of the fusion strategy, which provides better performance compared with direct fusion. This paper gives the contribution of RNN to partial VSLAM in
Table 9.
This paper introduces the combination of deep learning and traditional VSLAM from the classical neural networks CNN and RNN in this section.
Table 10 shows some excellent algorithms combining neural networks with VSLAM.
4.2. Modern Semantic VSLAM
Deep learning has made many achievements in pose estimation, depth estimation, and loop closure detection. However, in VSLAM, deep learning is currently unable to shake the dominance of traditional methods. However, applying deep learning to semantic VSLAM research can obtain more valuable discoveries, which can quickly promote to development of semantic VSLAM. Refs. [
60,
158,
168] used CNN or RNN to extract semantic information in the environment to improve the performance of different modules in traditional VSLAM. The semantic information was used for pose estimation and loopback detection. It significantly improved the performance of traditional methods and proved the effectiveness of semantic information for the VSLAM system. This paper believes that this provides technical support for the development of modern semantic VSLAM and is the beginning of modern semantic VSLAM. Using deep learning methods such as target detection and semantic segmentation to create a semantic map, which is an important representative period of semantic SLAM development. Refs. [
135,
200] points out that semantic SLAM can be divided into two types according to different target detection methods. One is to detect targets using traditional methods. Real-time monocular object SLAM is the most common one, using a large number of binary words and a database of object models to provide real-time detection. However, it’s very limited because there are many types of 3D object entities for semantic classes such as ”cars.“ Another approach to SLAM is object recognition using deep learning methods, such as those proposed in [
46].
Semantics and SLAM may seem to be separate modules, but they are not. In many applications, the two go hand in hand. On the one hand, semantic information can help SLAM to improve the accuracy of mapping and localization, especially for complex dynamic scenes [
201]. The mapping and localization of traditional SLAM are mostly based on pixel-level geometric matching. With semantic information, we can upgrade the data association from the traditional pixel level to the object level, improving the accuracy of complex scenes [
202]. On the other hand, by using SLAM technology to calculate the position constraints between objects, the consistency constraints can be applied to the recognition results of the same object at different angles and at different times, thus improving the accuracy of semantic understanding. The integration of semantic and SLAM not only contributes greatly to the improvement of the accuracy of both but also promotes the application of SLAM in robotics, such as robot path planning and navigation, carrying objects according to human instructions, doing housework, and accompanying human movement, etc.
For example, We want a robot to walk from the bedroom to the kitchen to get an apple. How does that work? Relying on traditional SLAM, the robot calculates its location (automatically) and Apple’s location (manually) and then does path planning and navigation. If the apple is in the refrigerator, you also need to manually set the relationship between the refrigerator and the apple. However, now with our semantic SLAM technology, it’s much more natural for a human to send a robot, “Please go to the kitchen and get me an apple”, and the robot will do the rest automatically. If there is a contaminated ground in front of the robot during an operation, traditional path planning algorithms need to manually mark the contaminated area so the robot can bypass it [
203].
Semantic information can help robots better understand their surroundings. Integrating semantic information into VSLAM is a growing field that has received more and more attention in recent years. This paper will elaborate on our understanding of semantic VSLAM from two aspects of localization, mapping, and dynamic object removal in this section. We believe the biggest contribution of deep learning for VSLAM is the introduction of semantic information. It can improve the performance of different modules of traditional methods to varying degrees. Especially in the construction of the semantic map, which promotes the innovation of the whole intelligent robot field.
4.2.1. Image Information Extraction
The core difference between modern semantic VSLAM and traditional VSLAM lies in the integration of the object detection module. It can obtain the attributes and semantic information of objects in the environment [
204]. The first step of semantic VSLAM is to extract semantic information from the images gained by the camera. Furthermore, semantic information based on image information can be achieved through classifying image information [
205]. Traditional target detection relies on interpretable machine learning classifiers, such as decision trees and SVM, to classify and realize target features. However the detection process is slow, the accuracy is low and the generalization ability is weak [
206]. Image classification based on deep learning can be divided into Object detection, Semantic segmentation, and Instance segmentation, as shown in
Figure 18.
How to better extract semantic information from images is a hot research issue in computer vision, whose essence is to extract object character information from scenes [
207]. We believe that although neural networks such as CNN also contribute to semantic information extraction, modern semantic VSLAM relies more on semantic extraction modules such as target detection. Object detection and image semantic segmentation are both methods of extracting semantic information from images. Semantic segmentation of images is to understand images at the pixel level to obtain deep-level information in the image, including space, category, and edge. Semantic segmentation technology based on a deep neural network breaks through the bottleneck of traditional semantic segmentation [
208]. Compared with semantic segmentation, target detection only obtains the object information and spatial information of the image. Furthermore, it identifies the category of each object by drawing the candidate box of the object, so target detection is faster than semantic segmentation [
209]. Compared with object detection, semantic segmentation technology has higher accuracy, but its speed is much lower [
210].
Target detection is divided into one-stage and two-stage structures [
211]. Early target detection algorithms use two-stage architecture. After creating a series of candidate boxes as samples, sample classification is carried out through a convolutional neural network. Common algorithms include R-CNN [
212], Fast R-CNN [
213], Faster R-CNN [
214], and so on. Later, YOLO [
215] creatively proposed the one-stage structure. It directly carried out the Two steps of the two-stage in One step, completed the classification and positioning of objects in one step, and directly output the candidate box and its category obtained by regression. One-stage reduces the steps of the target detection algorithm and directly converts the problem of target frame positioning into regression problem theory without the need to create candidate boxes, which are superior in speed. Common algorithms include YOLO and SSD [
216].
In 2014, the appearance of R-CNN subverted the traditional object detection scheme, improved the detection accuracy, and promoted the rapid development of object detection technology. Its core is to extract candidate regions, then obtain feature vectors through Alexnet, and finally use SVM classification and frame correction. However, the speed of feature extraction is limited due to the serial feature extraction method used by R-CNN. Ross proposed Fast R-CNN in 2015 to solve this problem well. Region of Interest Pooling (ROI Pooling) operation is used in Fast R-CNN to improve the efficiency of feature extraction, and Region generation network (RPN) is used for coordinate correction. Many candidate frames (anchor) are set in RPN. Then the dependency relation of the anchor to the background is judged, to work out the coverage area of the anchor and determine whether the target is covered. In addition, YOLO improves the accuracy of prediction, speeds up the processing speed and increases the types of identified objects, and proposes a joint training method for target classification and detection. YOLO is one of the most widely used target detection algorithms, offering real-time detection and a series of improved versions since then.
Different from object detection, semantic segmentation not only predicts the position and category of objects in the image but also accurately describes the boundary between different kinds of objects. However, in semantic segmentation technology, an ordinary convolutional neural network cannot obtain enough information. To solve this problem, Long et al. proposed a fully convolutional neural network FCN [
217]. Compared with CNN, FCN does not have a fully connected layer. The new FCN obtains the spatial position of the feature map and fuses the output of different depth layers with the hierarchical structure. This method combines local information with global information and improves the accuracy of semantic segmentation. In Segnet network proposed by Badriarayansn et al. [
218], the encoder-decoder structure was proposed, which combined two independent networks to improve the accuracy of segmentation. However, the combination of two independent networks severely reduced the detection speed. Zhao et al. proposed PSPNet [
219] and a pyramid module, which fuses the features of each level, such as a pyramid, and finally fuses the output to further improve the segmentation effect.
In recent years, the continuous improvement of computer performance promotes the rapid development of instance segmentation in vision. Instance segmentation not only has the classification on the pixel level (semantic segmentation) but also has the location information of different objects (target detection), even the same object can be detected. In 2017, He et al. proposed the Mask R-CNN [
220]. This algorithm is the pioneering work of instance segmentation. As shown in
Figure 19, its main idea is to add a branch for semantic segmentation based on Faster R-CNN.
Although the target detection and segmentation technology based on a neural network have been perfect, it needs to rely on powerful computing capacity to achieve real-time processing. VSLAM has a high requirement for real-time operation, so how efficiently separating the needed object and its semantic information from the environment will be a long-term and hard task. As the basis of semantic VSLAM, after processing semantic segmentation, we will pay attention to the influence of semantic information on different aspects of VSLAM. We will elaborate on three aspects of localization, mapping, and dynamic object removal. Object detection and semantic segmentation are both a means of extracting semantic information from images.
Table 11 shows the contribution of some algorithms. Object detection is faster than semantic segmentation. However, semantic segmentation is better in precision. Instance segmentation integrates object detection and semantic segmentation, and has outstanding performance in precision, but can not guarantee the running speed. For some schemes that cannot provide the original paper, we provide the open-source code, such as YOLOV5.
4.2.2. Semantic with Location
Location accuracy is one of the most basic assessment standards in the SLAM system and is a precondition for mobile robots to perform many tasks [
225]. Introducing environmental semantic information can effectively improve the scale uncertainty and cumulative drift in visual SLAM localization, thus improving the localization accuracy to varying degrees [
226].
Bowman et al. [
177] proposed a sensor state estimation and semantic landmark location optimization problem, which integrates metric information, semantic information, and data association. After obtaining semantic information from target detection, they introduced the Expectation-Maximization (EM) and calculated the probability of data association according to the result of semantic classification. They successfully converted semantic SLAM into a probability problem and improved the localization accuracy of the SLAM system. However, there are many strong assumptions in this paper. Such as the projection of the three-dimensional center of the object should be close to the center of the detection network, which is not easy to meet in practice.
In 2020, Zhao et al. [
227] of Xi ’an Jiaotong University proposed a landmark visual semantic SLAM system for a large-scale outdoor environment. Its core is to combine a 3D point cloud in ORB-SLAM with semantic segmentation information in the convolutional neural network model PSPNET-101. It can build a 3D semantic map of a large-scale environment. They proposed a method to associate real landmarks with a point cloud map. It associates architectural landmarks with the semantic point cloud and associates landmarks obtained from Google Maps with a semantic 3D map for urban area navigation. With the help of a semantic point cloud, the system realizes landmark-based relocation in a wide range of outdoor environments without GPS information. Its process is shown in
Figure 20. In 2018, ETH Zurich proposed VSO [
228] based on semantic information for autonomous driving scenarios. This scheme solves the problem of visual SLAM localization in the environment of outdoor lighting changes. It establishes constraints between semantic information with images and takes advantage of the advantage that semantic information is not affected by Angle of view, scale, and illumination. Similarly, Stenborg et al. [
229] also proposed solutions to such problems.
In the aspect of trajectory estimation, geometric features can only provide short-term constraints for camera pose, which will produce large deviations in a wide range of environments. In contrast, objects, as higher-level features, can keep their semantic information unchanged when light intensity, observation distance, and Angle change. For example, a table is still a table under any light and Angle, and its more stable performance can provide long-term constraints for the camera posture. In addition, semantic SLAM can effectively solve the problems that traditional visual SLAM is sensitive to illumination changes and interferes with the robustness of system positioning. We believe that VSLAM localization is essentially camera pose estimation. Semantic information can improve the positioning accuracy of traditional VSLAM systems under strong illumination and high camera rotation. However, in practice, the introduction of semantic information will inevitably slow down the operation of the whole system, which is an urgent problem to be solved in VSLAM. We believe that in most cases, traditional VSLAM still performs well in localization accuracy. However, semantic help for VSLAM systems to improve localization accuracy is also worthy of research.
Table 12 compares the differences between traditional methods and semantic methods for VSLAM localization.
4.2.3. Semantic with Mapping
Another key juncture of VSLAM and deep learning is the semantic map construction of SLAM, and most semantic VSLAM systems are based on this idea [
230]. For a robot to understand the environment as well as a human and perform different tasks from one place to another requires a different skill than a geometric map can provide [
231]. Robots should have the ability to have a human-centered understanding of their environment. It needs to distinguish between a room and a hallway, or the different functions of a kitchen and a living room in the future [
232]. Therefore, semantic attributes involving human concepts (such as room types, objects, and their spatial layout), which is considered a necessary attribute of future robots [
233]. In recent years, with the rapid development of deep learning, a semantic map containing semantic information has gradually come into people’s view [
234]. The semantic map in the semantic SLAM system enables robots to obtain geometric information such as feature points of the environment. Furthermore, it also identifies objects in the environment and obtains semantic information such as location, attribute, and category. Compared with the map constructed by traditional VSLAM, the robot can be equipped with perceptual ability. It is significant for the robot to deal with a complex environment and complete human–computer interaction [
235]. Semantic map construction is one of the hot topics in SLAM research [
236]. In 2005, Galindo et al. [
237] proposed the concept of a semantic map. As shown in
Figure 21, it is represented by two parallel layers: spatial representation and semantic representation. It provides robots with an inference ability similar to humans to the environment (for example, a bedroom is a room containing a bed). Later, Vasudevan et al. [
238] further strengthened people’s understanding of semantic maps.
In recent years, deep learning technology has developed rapidly. More and more researchers combine deep learning with SLAM technology. They use target detection, semantic segmentation, and other algorithms to obtain semantic information about the environment. Furthermore, integrate it into the environment map to construct the environment semantic map [
239]. As shown in
Figure 22, the research on semantic map construction is mainly divided into two directions: scene-oriented semantic map construction and object-oriented semantic map construction.
Most scenario-oriented semantic maps are based on deep learning methods, which map 2D semantic information to 3D point clouds. Scenario-oriented semantic maps can help robots better understand their environment [
240]. In 2020, MIT proposed Kimera [
241]. This is a mature scenario-oriented semantic SLAM algorithm. Ref. [
242] proposed an algorithm of semantic map construction oriented to the scene. Based on RTABMAP [
243], YOLO is used for target detection. After roughly estimating the position of the object, they used the Canny operator to detect the edge of the target object in the depth image. Then they achieved accurate segmentation of the object by processing edge based on the region growth algorithm. Through the non-deep learning semantic segmentation algorithm, they solved the problem of large computing resources in traditional semantic map construction, ad constructed the scene-oriented semantic map in real-time. The scene-oriented semantic map will help the robot better understand the environment, and build a more expressive environment map. However, this method cannot provide more help for a robot to know the environment, preventing the robot and the environment of the individual to interact, to a certain extent restricting the intellectualized degree of the robot [
244]. In addition, such algorithms need to carry out pixel-level semantic segmentation of objects in the scene, which leads to much system calculation and low real-time performance. Therefore, some scholars turn to object-oriented semantic map construction algorithms [
245].
An object-oriented semantic map refers to a map that contains only partial instance semantic information, and the semantic information exists independently in the method of clustering [
246]. This type of map allows robots to operate and maintain the semantic information of each entity on the map. So it is more conducive for robots to understand the environment and interact with entities in the environment, improving the practicality of the map [
247]. Reference [
45] proposed a voxel-based semantic visual SLAM system based on mask-RCNN and KinectFusion algorithm. After object detection by the Mask-RCNN algorithm, object detection results are fused with the TSDF model based on voxel foreground theory to construct an object-oriented semantic map. Although the accuracy of detection is guaranteed, it still cannot solve the problem of the poor real-time performance of the algorithm. Ref. [
248] proposed a lightweight object-oriented SLAM system, which effectively solves the problems of data association and attitude estimation, and solves the problem of the poor real-time performance of the above methods. The core framework is developed based on ORB-SLAM2 and uses YOLOv3 as an object detector to fuse semantic thread. In the tracer thread, boundary box, semantic label, and point cloud information are fused, and the object-oriented semi-dense semantic map is constructed. Experimental results show that compared with ORB-SLAM2, the scheme can deal with multiple classes of objects with different scales and directions in a complex environment, and can better express the environment. However, for some large objects, accurate pose estimation is not possible. Similarly, University College London proposed DSP-SLAM [
249].
At present, most semantic map construction methods need to deal with both instance segmentation and semantic segmentation at the same time, which leads to poor real-time performance of the system [
250].
Table 13 lists some semantic map construction work. In addition, when dealing with dynamic objects, most algorithms realize system robustness by eliminating dynamic objects, which will make the system lose much useful information. Therefore, SLAM oriented to dynamic scenes is an urgent problem to be solved [
251].
4.2.4. Elimination of Dynamic Objects
Traditional VSLAM algorithms assume that objects in the environment are static or low-motion, which affects the applicability of the VSLAM system in actual scenes [
258]. When dynamic objects exist in the environment(such as people, vehicles and pets), they will bring wrong observation data to the system and reduce the accuracy and robustness of the system [
259]. Traditional methods solve the influence of some outliers on the system through the RANSAC algorithm. However, if dynamic objects occupy most of the image area or moving objects are fast, reliable observation data still cannot be obtained [
260]. As shown in
Figure 23, the camera cannot accurately capture data due to dynamic objects. So how to solve the impact of dynamic objects on the SLAM system has become the goal of many researchers.
Now, the solutions to the problem of disturbance brought by dynamic objects to the SLAM system are consistent. That is, before the visual odometer, using target detection and image segmentation algorithm to filter out the dynamic areas in the image. Then use static environment points to calculate the nearby positions of the camera and construct a map containing semantic information [
261].
Figure 24 shows a typical structure. Although the influence of dynamic objects cannot be completely solved, the robustness of the system is greatly improved.
In 2018, Bescos et al. [
262] proposed the DynaSLAM algorithm for visual SLAM for dynamic scenarios based on ORB-SlAM2. The system provides interfaces for monocular, stereo, and RGB-D cameras. For monocular and stereo cameras, MASK-RCNN is used to segment dynamic objects in each frame to avoid feature extraction of dynamic objects in the SLAM system. If an RGB-D camera is used, the method of multi-view geometry is used for more accurate motion segmentation. Dynamic segments are removed from the current frame and map. However, this method chooses to remove all potentially moving objects, such as parked cars. This may lead to too few remaining stationary feature points and affect camera pose estimation. In the same year, the Tsinghua University team proposed a complete SLAM system DS-SLAM [
263] based on ORB-SLAM2. Its core is the ORB-SLAM2 added a semantic network segmentation, and as a separate thread running in real-time. It can remove objects in the scene dynamic segmentation and create a separate thread to build a dense semantic octree map to help the robot to achieve a higher level of the task.
Some methods use semantic information to hide objects that are considered to be dynamic. Although such methods improve the influence of dynamic objects on the system to a certain extent, the one-size-fits-all approach may cause the system to lose many useful feature points. For example, a car parked on the roadside may be regarded as a dynamic object and all feature points carried by it are filtered out [
264]. However, a car stationary on the side of the road can be used as a reliable feature point in the system. However, it can even be a major source of high-quality feature points. Reference [
265] proposed the integration of semantic information into traditional VSLAM methods. This method does not need to motion detection. The introduction of confidence, gives each object a different possible movement probability, to judge whether an object is in motion. Furthermore, the semantic label distribution is combined with map point observation consistency, to estimate the reliability of each 3D point measurement. Then use it in the map of pose estimation and optimization steps. This method can handle objects that are considered dynamic but are stationary, such as cars parked on the side of the road. Reference [
266] is based on the optical flow method to remove dynamic objects. Its core idea is based on ORB-SLAM2. In its front end, four CNN neural networks are used to simultaneously predict the depth, posture, optical flow, and semantic mask of each frame. By calculating the rigid optical flow synthesized by depth and posture and comparing the estimated optical flow, the initial motion region is obtained. The algorithm can distinguish the moving object from the current scene and retain the feature points of the static object. Avoiding the removal of the moving object based on the category attribute only, which leads to the tracking failure of the SLAM system. The article [
267] has presented a visual SLAM system that is built on ORB-SLAM2 and performs robustly and accurately in dynamic environments through discarding the moving feature points with the help of semantic information obtained by Mask-RCNN and depth information provided by RGB-D camera. This method tries to exploit more reliable feature points for camera pose estimation by finding out the static feature points extracted from movable objects, which would benefit a lot when static objects could not provide enough feature points in the scene.
Semantic information can better help the system to solve the interference brought by dynamic objects, due to the high consumption of computing resources. However, the existing schemes are generally not real-time enough to be widely promoted to practical robots, and the application scenarios are greatly limited [
268]. In addition, semantic information may not be available at the camera frame rate, or may not always provide accurate data [
269]. Assigning an image region to the wrong semantic class may unnecessarily exclude it from posture estimation, which can be critical in a sparsely textured environment [
270]. Current solutions to this problem focus on using methods such as optical flow to detect objects that are moving in the scene [
271]. Although the existing algorithms have achieved good results in data sets, they have not achieved very reliable results in practical engineering.
Table 14 shows the VSLAM algorithms using a deep neural network to improve the dynamic environment in recent years.
5. Conclusions and Prospect
Simultaneous localization and mapping is a major research problem in the robotics community, where a great deal of effort has been devoted to developing new methods to maximize their robustness and reliability. Vision-based SLAM technology has experienced many years of development, and many excellent algorithms have emerged, which have been successfully applied in various fields such as robotics and UAV. The rapid development of deep learning has promoted the innovation of the computer field, and the combination of the two has become an active research field. Therefore, the research on VSLAM has received more and more attention. In addition, with the advent of the intelligent era, higher requirements are put forward for the autonomy of mobile robots. In order to realize advanced environment perception of robots, semantic VSLAM has been proposed and developed rapidly. Traditional VSLAM only restores the geometric features of the environment when constructing the environment map, which cannot meet the requirements of robot navigation, human–computer interaction, autonomous exploration, and other applications. However, the early semantic map construction method generally adopts the model library matching method, which requires the construction of an object model library in advance, which has great limitations and is not conducive to popularization and application. With the improvement of computer performance and the rapid development of deep learning technology, VSLAM technology is combined with deep learning technology to fill the deficiency of the traditional VSLAM system. In recent years, as the most promising and advantageous computer vision processing method, deep learning technology has been widely concerned by SLAM researchers. In the semantic SLAM system, environmental semantic information can be directly learned from pre-trained image sets and real-time perceived image sets by deep learning techniques. It can also be used to make better use of large data sets, giving the system greater generalization capability. When constructing a semantic map, the semantic SLAM system can use the deep learning method to detect and classify objects in the environment and construct a map with richer information, which has better practicality.
In this article, we investigate most of the most advanced visual SLAM solutions that use features to locate robots and map their surroundings. We classify them according to the feature types relied on by feature-based visual SLAM methods; Traditional VSLAM and VSLAM combined with deep learning. The strengths and weaknesses of each category are thoroughly investigated and, where applicable, the challenges that each solution overcomes are highlighted. This work demonstrates the importance of using vision as the only external perceptual sensor to solve SLAM problems. This is mainly because the camera is an ideal sensor because it is light, passive, low-power, and capable of capturing rich and unique information about a scene. However, the use of vision requires reliable algorithms with good performance and consistency under variable lighting conditions, due to moving people or objects, phantoms of featureless areas, transitions between day and night, or any other unforeseen circumstances. Therefore, SLAM systems using vision as the only sensor remain a challenging and promising research area. Image matching and data association are still open research fields in computer vision and robot vision, respectively. The choice of detectors and descriptors directly affects the performance of the system to track salient features, identify previously seen areas, build a consistent environmental model, and work in real-time. Data correlation in particular requires long-term navigation, despite a growing database and a constantly changing and complex environment. Accepting bad associations will cause serious errors in the entire SLAM system, meaning that location calculations and map construction will be inconsistent.
In addition, we highlight the development of VSLAM that fuses semantic information. The VSLAM system combined with semantic information achieves better results in terms of robustness, precision, and high-level perception. More attention will be paid to the research of semantic VLSAM. Semantic VSLAM will fundamentally improve the autonomous interaction ability of robots.
Combined with other studies, we make the following prospects for the future development of VSLAM:
(1) Engineering application. After decades of development, VSLAM has been widely used in many fields such as robotics. However, SLAM is sensitive to environmental illumination, high-speed motion, motion interference and other problems, so how to improve the robustness of the system and build large-scale maps for a long time are all worthy of challenges. The two main scenarios used in SLAM are based on embedded platforms such as smart phones or drones, and 3D reconstruction, scene understanding and deep learning. How to balance real-time and accuracy is an important open question. Solutions for dynamic, unstructured, complex, uncertain and large-scale environments remain to be explored.
(2) Theoretical support. The information features learned through deep learning still lack intuitive meaning and clear theoretical guidance. At present, deep learning is mainly applied to local sub-modules of SLAM, such as depth estimation and closed-loop detection. However, how to apply deep learning to the entire SLAM system remains a big challenge. Traditional VSLAM still has advantages in positioning and navigation. Although some modules of traditional methods are improved by deep learning, the scope of deep learning is generally not wide, and it may achieve good results in some data sets, but it may be unstable in another scene. The positioning and mapping process involves a lot of mathematical formulas, and deep learning has drawbacks in dealing with mathematical problems while using deep learning has fewer data to carry out relevant training, and this method is more traditional. The SLAM framework does not present significant advantages and is not yet available. The main algorithms of SLAM technology. In the future, SLAM will gradually absorb deep learning methods and improve training numbers data sets are used to improve the accuracy and robustness of positioning and mapping.
(3) High-level environmental information perception, and human–computer interaction. With the further development of deep learning, the research and application of semantic VSLAM will have a huge space for development. In the future intelligent era, people’s demand for intelligent autonomous mobile robots will increase rapidly. How to use semantic VSLAM technology to better improve the autonomous ability of robots will be a long-term and difficult task. Although there have been some excellent achievements in recent years, compared with the classical VSLAM algorithm, semantic VSLAM is still in the development stage. Currently, there are not many open source solutions for semantic SLAM, and the application of semantic SLAM is still in the initial stage, mainly because the construction of an accurate semantic map requires a lot of computing resources. This severely interferes with the real-time performance of SLAM. With the continuous improvement of hardware level in the future, the problem of the poor real-time performance of SLAM systems may be greatly improved.
(4) Establish a sound evaluation system. Semantic VSLAM technology has developed rapidly in recent years. However, compared with traditional VSLAM, there are no perfect evaluation criteria for the time being. In SLAM system research, ATE or RPE is generally used to evaluate the system performance. However, both of these evaluation criteria are based on the pose estimation results of the SLAM system, and there is no universally recognized reliable evaluation criterion for the effect of map construction. For a semantic SLAM system, how to evaluate the accuracy of semantic information acquisition and how to evaluate the effect of semantic map construction are the issues that should be considered in the evaluation criteria of the semantic SLAM system. Furthermore, it is not a long-term solution to evaluate only by subjective indicators. In the future, it will be a hot topic how to establish systematic evaluation indicators for semantic VSLAM.