For decades, many methods in the field of LIDAR and visual SLAM technology have flourished, but they all have their limitations, as they are easily affected by external factors; therefore, more and more researchers are focusing on the integration of the two methods. By combining the depth of the accurate LIDAR estimation and the powerful features of the camera tracking ability, this type of fusion will have many advantages. The fusion of LIDAR and visual SLAM will produce a large cumulative error in high-speed motion. Therefore, inertial sensing units with a low price and excellent performance have become the first choice to make up for this defect; thus, three types of sensor fusion methods have slowly emerged. Although the fusion between multiple sensors complements the advantages of different sensors on the surface, it involves the fusion between different algorithms in essence and further shows the advantages of the algorithms through the sensors. Based on the current papers on existing fusion methods, this paper will analyze the fusion of multiple sensors from the methods based on uncertainty, traditional features, and deep learning.
4.3.1. Fusion Methods Based on Uncertainty
Uncertainty-based methods are usually used in 2D visual–LIDAR fusion SLAM. At present, there are three mainstream methods: Kalman filter (KF), particle filter, and graph-based algorithms and their derivatives. The Kalman filter and particle filter are two different implementations of the Bayesian filter. The Kalman filter is mainly responsible for the forecasting and updating of two parts, but it cannot satisfy the demand of the nonlinear problem; therefore, researchers have developed an extended Kalman filtering (EKF) method. This method has achieved good effects in mid- and small-scale scenes, but when it comes to large maps, it leads to a huge amount of computation. The unscented Kalman filter (UKF) is a good solution to nonlinear problems. However, the above KF and its variants can only deal with the case of a Gaussian distribution, and when facing the case of an arbitrary distribution, the use of the KF will bring larger errors. The method based on a particle filter solves the problem of the arbitrary distribution of multiple samples. A region with a larger number of particles in this method has a higher probability. Graph-based SLAM, on the other hand, finds the relationship between poses by minimizing the sum of squared variances.
(a) A fusion method based on the KF and particle filter
In 2006, P. Newman et al. [
110] mounted a LIDAR and camera simultaneously on a mobile robot for the first time, and the LIDAR acquired the local geometry of the environment, which was used to incrementally construct a 3D point cloud map of the workspace. Data fusion was performed using standard EKF equations, and this observation was applied to the state vector. In the process of loop closure detection, the camera sequence is used for detection, and then the local LIDAR scanning is used again to process the image of the loop closure detection, which effectively eliminates the error generated by the loop closure detection process, but the huge amount of calculation is still difficult to solve. In 2009, F. Malartre et al. [
128] developed a perception strategy system combining visual and LIDAR SLAM. By adding LIDAR data to the EKF, the drift of visual SLAM was reduced, and the density-controlled digital elevation map (DEM) was quickly recovered. In 2010, F. Sun et al. [
129] assumed that the sensor noise obeyed a Gaussian distribution and used the EKF to estimate the minimum mean square error of the system state. Visual data and LIDAR data with the same corner features were fused, and the active detection strategy was adopted to improve the accuracy of SLAM and obtain more 3D map information.
The fusion method of the two has slowly matured, and some scholars have gradually deployed it in mobile robots, autonomous vehicles, and drones. In the process of the continuous improvement of the algorithm, excellent performance that cannot be achieved by a single sensor has been obtained.
In 2007, L. Iocchi et al. [
130] used a particle filter to estimate the displacement between local maps for the mapping problem of large indoor environments. They mainly used a binocular camera to measure plane displacement, supplemented with 2D LIDAR data, which cooperated with high-precision IMU sensors to successfully construct a low-cost 3D map. The study of [
131] used LIDAR data as the input of a binocular vision system and applied this system to a complex intersection scene, vehicles, and other dynamic risk levels as the output of the object, using a particle filter to solve the location problem (each particle corresponds to a vehicle location, using LIDAR data to compute the probability of each particle). This method obtained good detection effects and further shows the broad prospect of different types of sensor fusion, which has attracted the widespread attention of researchers. The multi-sensor fusion of UAVs has made breakthrough progress. In 2013, J. Collier et al. [
132] used SIFT and variable dimension local shape descriptor (VD-LSD) to train the bag-of-words model of LIDAR and visual sensors based on the FAB-MAP algorithm and performed position recognition on a UAV. Regardless of poor lighting conditions or low-texture scenes, it has good recall and accuracy, but when the UAV flies too fast, it can easily lead to feature tracking failure.
Figure 16 shows the multi-sensor fusion framework based on the FAB-MAP algorithm, for the readers’ convenience.
To address this problem, D. Magree and Johnson [
133] used visual- and LIDAR-assisted navigation. The navigation architecture was based on EKF filters to provide sensor updates for the UAV, coupled at the scan and point correspondence levels, which reduced the impact of the fuzzy geometry generated by the rapid UAV flight. Additionally, this led to the Monte Carlo LIDAR-based SLAM system and its vision application in scan matching. S. Wang et al. [
134] improved the Monte Carlo localization (MCL) method and applied it to a robot’s pose estimation procedure. This paper proposed a localization algorithm based on 2D LIDAR and 3D point clouds from a 2D structure to generate the 2D LIDAR alignment. With this algorithm, the data and map can be located in the robot’s positioning at the same time as the local map scale.
The traditional fusion method coarsely fuses LIDAR and visual SLAM and achieves good results. By more finely selecting different types of sensors for fusion, the feasibility of producing better results has been confirmed by more and more researchers. S. Huh et al. [
135] deployed a monocular camera, LIDAR, and an inertial sensor on an unmanned aerial vehicle (UAV) using visual markers to calibrate the camera and LIDAR information and, based on the EKF, developed a real-time navigation algorithm, even without any a priori knowledge about the environment. E. Lopez et al. [
136] improved the then-advanced monocular visual SLAM methods (LSD-SLAM and ORB-SLAM) to develop a SLAM system that integrates different visual, LIDAR, and inertial measurements using the EKF, which optimizes 6D pose estimation using the EKF, where local 2.5D maps and footprint estimates of robot positions can be obtained, improving the ability of low-cost commercial aviation platforms to build pose and environment maps in real-time on board. Y. Bi et al. [
137] fused a depth camera and LIDAR and deployed them on a UAV. Hector SLAM was used to determine the relative position and orientation of the UAV, and the Gauss–Newton method was used to find the best transformation between the current scanning and mapping. This system can carry out positioning, mapping, planning, and flight in unknown indoor environments. The accurate landing of visual targets can be achieved, and all real-time calculations can be performed on the aircraft. V. De Silva et al. [
138] solved the problem of fusing the output of light detection, LIDAR, and wide-angle monocular camera sensors in free-space detection by first spatially aligning the output of the sensors and then using the Gaussian process (GP) regression resolution matching algorithm to interpolate the missing data with quantified uncertainty. This data fusion method significantly improves the perception of unmanned and mobile robots. B.P.E. Vasquez et al. [
139] installed a LIDAR, camera, and radio frequency identification (RFID) system on the mobile robot Doris and improved the positioning accuracy of the robot in a real environment through the EKF. It can be located only using sensor fusion and a semantic map, without mapping the whole environment by creating a point cloud map. SLAM does not need to be used. One study even put forward a bolder idea to apply SLAM technology to the simulation experiment of spacecraft landing on the moon [
140]. This study combined LIDAR and monocular camera images to eliminate the error caused by scale uncertainty for the landing problem of spacecraft. The unscented Kalman filter (UKF) was used to provide state estimates based on inertial and star-tracking camera data at various stages of the spacecraft reaching the moon. A visual localization algorithm based on 2D–3D point correspondence and LIDAR distance was proposed, which can be initialized without systematic errors compared with only using optical navigation.
At present, fusion algorithms have been able to run stably on UAV and mobile robots, but there are still many problems that cannot be ignored, such as the large amount of calculation, complex process of mapping, and low accuracy of positioning. To solve these problems, researchers carried out a large number of experiments and made a lot of improvements to the front-end odometry, back-end optimization, loop closure detection, and mapping steps.
The fusion methods of front-end odometry mainly include those presented in [
64,
141,
142,
143]. The study of [
64] combined visual and LIDAR odometry and proposed an online self-motion estimation method, V-LOAM. At high frequencies, self-motion is estimated by visual odometry, and at low frequencies, motion estimation and drift correction are improved by LIDAR odometry. Accurate and robust motion estimation can be achieved even when moving at high speed and the illumination changes dramatically. Additionally, this method can use different types of ranging sensors: for example, it can improve the positioning accuracy of fisheye cameras with serious distortion caused by large viewing angles.
The study of [
143] developed a low-cost inertial positioning system of binocular vision combined with the multi-state constraint Kalman filter (MSCKF), a visual inertial odometer (VIO), and LIDAR information provided by a 3D map, which greatly improved the performance of the standard visual inertial odometer and reduced the positioning error system’s acceptable range. The study of [
143] proposed a new method of fusing visual and LIDAR data in odometry. Visual maps of LIDAR voxel maps and map points were constructed, and the maps were integrated into the odometer measurement residuals to eliminate any errors caused by assigning the LIDAR depth to non-corresponding visual features. A large number of geometric residuals obtained by LIDAR were used instead of a single linearized residual. This greatly accelerated the iterative optimization of a similar Levenberg–Marquardt algorithm and obtained a more accurate pose estimation.
The study of [
142] proposed a fully automatic end-to-end method based on the 3D–2D joint correspondence mask (CoMask), which can directly estimate the extrinsic parameters with high accuracy. The genetic algorithm was combined with the Levenberg–Marquardt method, which can solve the global optimization problem without any initial estimation. The general framework of the study is presented in
Figure 17, wherein different colors are used to represent different steps to facilitate a better understanding by the readers.
High-precision mapping requires very rich information, and the rich texture information brought by the visual algorithm has great advantages in relocation. The information carried by the LIDAR point cloud is not deeply mined in this paper. In the high-end long-range multi-line LIDAR, the returned point cloud contains not only the direction and range information, but also the reflectance information of the target point. When the number of lines is large and dense, the data composed of reflectance information can be regarded as a type of texture information. Once this information is integrated into the high-precision map, the high-precision map can seamlessly switch between the two forms of point cloud and texture, which is also the research direction of some foreign teams [
144].
In 2017, M. Shin et al. [
145] proposed a registration method of point cloud images fused with LIDAR, an inertial sensing unit, and a camera. The SLAM method used was LOAM. By combining the odometry information obtained by the LIDAR and the 3D position of the selected image feature points, the information of the co-located positions in these two maps was extracted, and the accurate estimation of the rigid transformation between the origin of each mapping was realized. In the same year, M. Chen et al. [
146] used 2D LIDAR to realize real-time 3D mapping through visual inertial fusion attitude estimation, real-time conversion from the point cloud to the world frame in pose estimation, and the accurate motion estimation of the robot through IMU-assisted visual SLAM based on the EKF. Subsequently, Z. Zhu et al. [
147] proposed a visualization method based on 3D LIDAR SLAM, which uses LIDAR to provide pose and undeformed point cloud information for visual keyframes and simultaneously detects and corrects the loop closure detection. Compared with the original LOAM system, the accuracy is improved, and the motion drift generated by pose estimation is effectively reduced. The great potential of LIDAR point cloud information has attracted the attention of more and more researchers who have begun to try to use point cloud information in combination with other information. By using photometric image registration with a geometric point cloud, K. Huang et al. [
148] proposed a direct LIDAR–visual measuring method process, using different sensors’ output information combined with a graphic image block with a single-plane pixel alignment formula to calculate the accurate motion estimation between frames, providing accurate projections of obstructions. Experiments on the KITTI dataset produced consistently registered colored point clouds. G. Zhou et al. [
149] proposed a visual localization algorithm combining points and lines. Using the most advanced simultaneous localization and mapping algorithms at that time (such as LIO-SAM, LVI-SLAM, and Fast-LIO), 2D lines in the image could be extracted online, and 3D lines in the 3D LIDAR map could be extracted offline. Sparse 3D points are obtained by visual odometry, and their poses are constantly corrected by minimizing the reprojection errors of 2D–3D lines and 3D–3D points. Finally, zero-drift positioning can be achieved. Various researchers have tried to add point cloud information to the direct method and achieved amazing results. J. Qian et al. [
150] proposed a new direct odometry method: the image data obtained by the vision sensor are combined with the sparse point cloud obtained by the LIDAR with the relative attitude as the prior. With this method, the positioning and mapping of a UAV with high accuracy and robustness are realized. W. Wang et al. [
151] proposed a direct visual–LIDAR fusion SLAM framework, which includes a frame-by-frame tracking module, an improved sliding window-based refinement module, and a parallel global and local search loop closure detection (PGLS-LCD) module, and combines the bag-of-visual-words (BoW) and LIDAR iris features for location recognition. Finally, a high-precision real-world trajectory and point cloud maps can be generated. The framework diagram for generating high-precision point cloud maps using direct methods is presented in
Figure 18 for the readers’ convenience.
However, correspondingly, point cloud information is easily affected by illumination, and different sensors have different viewpoints, which are difficult to make consistent in the process of extraction. To address this issue, A. Gawel et al. [
152] proposed a framework that uses structural descriptors to match LIDAR point clouds to the sparse visualization of key points, which is not affected by the viewpoint and illumination changes of the sensor. The framework contains two independent pipeline inputs: LIDAR–inertial sensing unit data and visual inertial sensing unit data. When constructing the structural descriptor, the two types of data are fused to carry out feature matching, which can adapt to different environments. J. Mo and J. Sattar [
153] proposed a SLAM method for location recognition using 3D LIDAR descriptors: a LIDAR sensor is used for the location recognition of 3D points obtained from stereo visual odometry. This system has higher robustness when the environment changes dramatically, and in the process of position recognition, the accuracy and computational efficiency are better than those of the traditional 2D method.
In the stages of loop closure detection and back-end optimization, the connection between them is usually considered simultaneously. In 2016, Q. Wu et al. [
154] presented recursive Bayesian filters, which can handle arbitrary distributions using multiple samples. Through this method, they completed the calibration of 2D LIDAR and a panoramic camera, used the visual loop closure detection method to assist 2D-SLAM, registered the panoramic camera image with the point cloud, and obtained a 3D map with RGB-D information. This method solves the positioning problem in indoor scenes without GPS signals, and in relatively flat outdoor scenes. In the same year, R.O. Chavez-Garcia et al. [
155] proposed a complete perception fusion architecture based on an evidence framework (the architecture includes three main sensors: radar, LIDAR, and camera), which uses a Bayesian filter for state prediction. The authors addressed the problem of moving object detection and tracking by integrating composite representations and uncertainty management and carried out tests in real driving scenarios, drastically reducing the number of false detections and misclassifications. S.H. Chan et al. [
156] developed a method for the robust positioning of lightweight indoor SLAM. Compared with the traditional feature matching method, the algorithm uses a path-matching method, curvature filter, and pyramid filter to find the basic matrix between the different trajectories and can be applied to any type of general SLAM fusion architecture. Even with cheap sensors, the fusion method has reasonably high localization accuracy and sufficiently robust navigation. Z. Jin et al. [
157] used the particle filter method to introduce a FastSLAM method that fuses a visual stereo image and 2D LIDAR data. By providing a priori mapping, the submaps obtained by the particle filter were compared with each other, which effectively eliminated the particles with large differences and made the algorithm converge quickly, providing easier access to high-definition maps. When tested on the KITTI dataset, compared with the popular ORB SLAM, the estimated trajectory was closer to the ground truth. Y. Tao et al. [
158] proposed a SLAM algorithm for a multi-sensor information fusion model based on the EKF, which uses Bayesian inference and joint probability density estimation on each frame of fixed time to fuse LIDAR RBPF-SLAM and monocular vision information and has high positioning accuracy in actual scenes.
Similarly, direct methods are also widely used in loop closure detection and back-end optimization, and in 2018, great breakthroughs were achieved in this direction. R. Giubilato et al. [
159] solved the scale ambiguity problem of monocular vision by fusing the range data of the LIDAR altimeter in the monocular vision odometry framework. Using the keyframe-based tracking and optical flow mapping algorithm, the distance data were used as the scale constraint between keyframes, and the optimization algorithm based on iSAM2 was applied to the back-end trajectory optimization and map estimation, which can eliminate the scale drift before the loop closure detection process. Y. Kim et al. [
160] proposed a lightweight monocular vision localization algorithm to match the depth of the stereo disparity map to the 3D LIDAR map. Similar to the method of compensating for drift in the LSD-SLAM method, this paper applied the depth residual minimization algorithm to camera pose estimation, which can be applied to urban environments with weak GPS signals. Y. Shin et al. [
161] proposed a camera–LIDAR sensor system using a direct method, which uses a sliding window method in pose estimation to avoid local horizontal drift. Global horizontal consistency is maintained using an appearance-based place recognition module and a pose graph optimizer. This system verifies the advantages of the direct method: it has obvious advantages in the process of fusing low-resolution cameras and sparse LIDAR data. However, more consideration is needed in the case of large changes in lighting conditions and fast-moving objects. In 2020, ref. [
162] proposed a direct visual-to-LIDAR SLAM framework combining light detection, LIDAR ranging, and a monocular camera for sparse depth measurement, jointly optimizing each measurement under multiple keyframes to realize the direct utilization of sparse depth. This study addressed the unavailability of traditional keyframe-based methods in sparse-depth scenes. This method achieves robust SLAM results even with extremely sparse depth measurements (eight rays), but it is not applicable to the case of poor illumination changes. The proposed DVL-SLAM framework is presented in
Figure 19.
(b) Graph optimization-based fusion method
In addition to using direct methods, graph optimization methods can also be used to determine the robot’s position. A.L. Majdik et al. [
163] regarded speeded-up robust features (SURF) features as environmental landmarks and tracked the displacement of these landmarks between different positions of the robot. The cross-use of visual mapping and LIDAR mapping systems can achieve efficient localization and autonomously filter out detected landmarks. S. Houben et al. [
164] abstracted LIDAR SLAM at different stages into a thin interface only connected to the map construction process, proposed a fast and simple labeling method that can effectively detect and decode, and provided a graph optimization method that can seamlessly and continuously integrate its location information in the map. G. Jiang et al. [
165] proposed a new SLAM framework based on graph optimization considering the fusion of cheap LIDAR and vision sensor data. In this framework, a cost function was designed to process both scanning data and image data, and the bag-of-words model with visual features was imported into the loop closure stage. A 2.5D map containing visual features and obstacles was generated, which is faster than a traditional grid map. L. Mu et al. [
166] proposed a graph-optimized SLAM method. Based on the unscented Kalman filter (UKF), four sensors including LIDAR, an RGB-D camera, an encoder, and an IMU were combined for joint positioning, which effectively improved the accuracy of loop closure detection and made the map more refined. S. Chen et al. [
167] studied the back-end of LIDAR and visual SLAM and constructed a method based on loop closure detection and global graph optimization (GGO). In the main stage, the geometric features and visual features of LIDAR were used, and the bag-of-words (BoW) model describing visual similarity was constructed in the auxiliary stage. The loop closure detection and graph optimization performance were significantly improved.
At this time, for the front- and back-end optimization, loop closure testing, built figure, etc., various versions of the fusion algorithm, compared with only using a single-sensor algorithm, have obtained good effects and shown more advantages and the great potential of sensor fusion; therefore, there is an indication that the multi-sensor fusion method is also applicable to the complete process of SLAM. Inspired by the loose coupling and tight coupling methods of visual sensors and IMUs, researchers are also attempting to use loose coupling and tight coupling methods throughout the whole SLAM process to make full use of the respective advantages of different sensors.
(c) Fusion method based on loose coupling
At first, researchers focused on multi-sensor loose coupling methods. In 2017, M. Yan et al. [
168] proposed a loosely coupled visual–LIDAR odometry method combining VISO2 (second version of visual odometry) and LOAM (LIDAR odometry and mapping), which utilizes the complementary advantages of different sensors to reduce the number of limited scenes. They demonstrated reasonably high accuracy even in situations where environmental texture was repeated and shape features were not prominent, but scenes with high-speed motion and lack of color and shape features still presented challenges. In 2020, multi-sensor loose coupling approaches achieved significant breakthroughs, with researchers applying them to mobile robots and various harsh environments. A multi-sensor fusion state estimation framework for legged robots was proposed by M. Camurri et al. [
169] whose core is the extended Kalman filter (EKF), which fuses IMU and leg odometry sensing for attitude and velocity estimation and simultaneously uses visual sensors and LIDAR to correct motion drift in a loosely coupled manner. The performance is reliable when the robot moves for a long distance, but it is not suitable for situations where the movement speed is too fast. P. Alliez et al. [
170] developed a SLAM system equipped with dual LIDAR, an IMU, a GPS receiver, and camera sensors for emergencies in the military and civilian fields. The information of each sensor is fused using the loose coupling method. The visual part is based on the ORB-SLAM algorithm, and the LIDAR part is based on the error-state Kalman filter (ESKF); the two cooperate through pose sharing and relocation and can even operate stably in harsh environments. Subsequently, in another paper [
171], the same authors fused more types of sensors and proposed a real-time indoor/outdoor positioning and offline 3D reconstruction system by fusing visual–LIDAR–inertial GPS, which is based on the KF and performs a loosely coupled fusion method between the LVI-SLAM method and GPS positioning. In the case of GPS failure, dark environments, and smoky scenes, the signal can be transmitted by radio, and the localization is more accurate than that of the existing technology. To allow the readers to better understand the loose coupling method of four-sensor fusion, the general framework is presented in
Figure 20 for reference.
(d) Fusion method based on tight coupling
In contrast with the previous loosely coupled fusion methods based on the Kalman filter, now, the hot spot in the academic community is tightly coupled fusion based on nonlinear optimization. For example, fusion with IMUs and real-time mutual calibration allows the LIDAR or visual module to maintain a certain positioning accuracy when maneuvering (violent acceleration and deceleration and rotation), which prevents tracking loss and greatly improves the stability of positioning and map construction.
In 2019, Z. Wang et al. [
172] proposed a robust and high-precision visual inertial LIDAR SLAM system, which combines the advantages of VINS-Mono and LOAM and can effectively deal with scenes of sensor degradation. The visual inertial tight coupling method is used to estimate the motion attitude, and the estimated value of the previous step is refined through LIDAR scan matching. When one of the links fails, the tracking motion can still be continued. T. Wang et al. [
173] fused sensors such as LIDAR, camera, IMU, encoder, and GNSS sensors and proposed a tightly coupled method to improve the positioning accuracy and eliminate dynamic targets and unstable features in order to robustly and accurately estimate the attitude of the robot. With the continuous attempts of researchers, they found that the method of multi-sensor tight coupling based on graph optimization can significantly improve the accuracy of mapping and robustness in complex environments.
In 2021, J. Lin et al. [
174] proposed a tightly coupled framework that fuses LIDAR, camera, and IMU sensors, which is mainly composed of two parts: factor graph optimization and filter-based odometry. State estimation is performed within the framework of iterative Kalman filtering, and the overall accuracy is further improved through factor graph optimization. This method overcomes the problems of sensor failure and violent motion, and at the time of its release, it had the highest accuracy. Broad development prospects have prompted more and more scholars to study this field. Based on factor graphs, T. Shan et al. [
175] designed a tightly coupled method involving a visual inertial system (VIS) and a LIDAR–inertial system (LIS), where the VIS uses the LIS estimation to promote initialization, and LIDAR extracts depth information in visual features, which significantly improves the performance in texture-free and non-functional environments. It can be used for real-time state estimation and mapping in complex scenes. D. Wisth et al. [
176] developed a joint optimization based on a tightly coupled factor graph-based visual, LIDAR, and IMU system. The authors proposed a 3D extraction procedure from LIDAR point cloud line motifs and a new method of graphic primitives which overcomes the suboptimal performance of the frame-by-frame tracking method and is especially suitable for vigorous exercise or rapidly changing light intensity situations. L. Meng et al. [
177] proposed a tight coupling of the monocular vision method and LIDAR ranging to extract the 3D characteristics of both the LIDAR and visual information. In this system, the monocular camera and 3D LIDAR measurements are close together for joint optimization, which can provide accurate data for 6-DOF pose estimation pretreatment, and the ICP method is used to construct loop closure constraints. Global pose optimization is performed to obtain a high-frequency and high-precision pose estimation. The approximate tightly coupled framework is presented in
Figure 21 for the convenience of the readers.
(e) Assessment tools
Thus far, this paper has summarized multi-sensor fusion methods based on uncertainty, which optimize the local or global SLAM process. This paper now turns to evaluation tools that can be used to evaluate the quality of these improved solutions. A. Kassir and T. Peynot [
178] proposed a reliable and accurate camera–LIDAR calibration method which can accurately find the rigid transformation between the two sensors according to the internal parameters of the camera and LIDAR. This method is mainly divided into two stages: in the first stage, the chessboard extraction algorithm is used to automatically calibrate the dataset image through the camera; in the second stage, LIDAR is used to process the data of the previous step to achieve automatic extraction. M. Labbe and F. Michaud [
57] introduced an extended version of RTAB-Map (distributed library for appearance-based real-time mapping) which provides a set of tools for comparing the performance of different sensor fusion methods and various other 3D and 2D methods, which can be used to compare performance on datasets and perform online evaluation. This tool can assist in better analyzing which robot sensor configuration is most suitable for the current navigation situation. Y. Xie et al. [
179] designed A4LidarTag, a marker pattern composed of circular holes. Because sensors are susceptible to environmental factors in external attitude calibration, the depth information obtained by LIDAR is used to encode the position information, which has strong generalization in both indoor and outdoor scenes.
4.3.2. Fusion Method Based on Traditional Features
Traditional feature-based methods also play a vital role in the field of multi-sensor fusion, and current fusion methods are mainly based on the ORB-SLAM framework. ORB-SLAM and its subsequent versions have become some of the most widely used visual SLAM solutions due to their excellent real-time performance and robustness. However, the ORB-SLAM series heavily depends on environmental features, and it is difficult to obtain enough feature points in environments without texture features [
180]. Nonetheless, traditional features can provide sufficient information for the ORB-SLAM systems. Additionally, with the continuous attempts of researchers, the integration of the two methods is becoming more and more mature [
181].
(a) Fusion method based on ORB-SLAM framework
In 2016, Liang et al. [
182] used ORB features and bag-of-words features for loop closure detection, applied the well-identified LRGC SLAM framework and SPA optimization algorithm to SLAM, introduced visual information into the environment, and successfully solved the problem of large-scale LIDAR-SLAM loop closure detection. However, this method can easily fail in the case of missing ORB features. In the same year, Q. Lv et al. [
183] used LIDAR to accurately obtain distance information, improved the map initialization process of the ORB-SLAM algorithm, estimated the absolute scale by calculating the average depth, and realized accurate positioning in unknown environments. Z. Zhang et al. [
184] fused 1D range information and image information to estimate the absolute scale, used LIDAR range information to correct the scale drift of monocular SLAM, and adopted a similar method to ORB-SLAM to extract keyframes in the correction stage of scale drift. However, errors are prone to occur in pure rotational motion and cases of a lack of texture or extreme discontinuities due to the reliance on SFM methods and local dense reconstruction. Aiming at this problem, S. Yun et al. [
185] proposed a new method that uses 2D image data to determine the 3D position of feature points. The feature point localization process involves a combination of visual sensors and LIDAR and uses iterative automatic scaling parameter adjustment technology. In indoor and outdoor environments, this method has a strong performance. H.H. Jeon and Y. Ko [
186] used bilinear interpolation to interpolate sparse LIDAR data, which accelerated the process of extracting feature points in the 3D–2D motion estimation of visual SLAM. Y. Zhang et al. [
187] used LIDAR information to assist in visual pose optimization, and the overall framework was based on ORB-SLAM2. First, the visual part obtains the precise environmental information from the LIDAR sensor, and then this information is transformed into a visual tracking thread posture to optimize the initial value. The system can be adapted to the change in weight of two types of sensor fusion, where the system has high accuracy for the reference keyframes and motion model; however, in the process of generating a trajectory, the accuracy may fluctuate. Since then, some researchers have tried to use point and line features [
188] and LIDAR point clouds [
189] instead of ORB features. The study of [
188] introduced point and line features to pose estimation and used ORB features as the point and line features (point and line features are not susceptible to noise, a wide viewing angle, or motion blur). Compared with the traditional visual–LIDAR odometry method based only on points, the utilization of environmental structure information was improved, and the accuracy of attitude estimation was greatly improved. A new scale correction method was proposed to optimize the tracking results, which was tested on the KITTI dataset. Compared with pure geometric methods such as DEMO and DVL-SLAM, this method has higher pose estimation accuracy. The study by [
189] proposed a feature-based SLAM algorithm. Firstly, the 3D point cloud raster was converted into an image using the camera parameter matrix, and then the image was imported into the ORB feature detector. This method can estimate the 6-DOF pose of the robot and has an excellent performance in various environments, but the dynamic objects in the environment will affect the system performance. However, the use of point clouds and point line features also has its limitations, since these features are vulnerable to interference in similar scenes and large-scale outdoor environments. To this end, J. Kang et al. [
190] proposed a range-enhanced panoramic vision simultaneous localization and mapping system (RPV-SLAM). The panoramic camera was used as the main sensor of the system, and the range information obtained by the tilted LIDAR was used to enhance the visual features and output the measurement scale. The initial range of depth information is obtained from the LIDAR sensor, and the ORB features are extracted in this range to recover the dense depth map from the sparse depth measurements, which is still robust under complex outdoor conditions. Y.C. Chang et al. [
191] combined RTK-GPS, camera, and LIDAR sensors for the first time to accurately locate vehicles and build high-precision maps in scenes with huge weather changes. Through normal distribution transformation (NDT), ICP, and ORB-SLAM, image feature points are extracted and mapped to anchor points, and the map can be updated quickly when the position of the object in the map changes.
Similarly, traditional feature-based multi-sensor fusion methods also have tight coupling methods. C.-C. Chou and C.-F. Chou [
192], inspired by the ORB-SLAM2 framework, proposed a tightly coupled visual–LIDAR SLAM system, in which the front-end and back-end run independently. At the back-end, all the LIDAR and visual information is fused, a novel LIDAR residual compression method is proposed, and large-scale BA optimization is performed, achieving superior performance to that of the existing visual–LIDAR SLAM method. However, when the scene is dense and contains a large number of objects, the loss of corner information can easily occur. The framework of this article is shown in
Figure 22.
(b) Other fusion options
In addition to the current dominant multi-sensor fusion method based on the ORB-SLAM framework, there are also many excellent fusion methods worthy of reference and research. R. Radmanesh et al. [
193] proposed a monocular SLAM method based on light detection and LIDAR ranging to provide depth information, which uses camera data to process unknown objects in an unsupervised way, as well as visually detected features as landmark features, and fuses them with LIDAR sensor data [
194]. The proposed method is superior to the current maps generated only by LIDAR in terms of computational efficiency and accuracy. In 2021, D. Cheng et al. [
195] solved the problem of the limited space inside the object using a method based on the feature fusion of LIDAR, camera, and inertial measurements for the accurate positioning of the sensors. To solve the problem of the poor positioning of the sensors and the surrounding objects, multiple sensors were used to capture the finer details and clearer geometric shapes in order to better reconstruct the high-texture 3D point cloud map in real-time. K. Wang et al. [
196] proposed a two-layer optimization strategy. In the local estimation layer, the relative pose is obtained through LIDAR odometry and visual inertial odometry, and GPS information is introduced in the global optimization layer to correct the cumulative drift, so that accurate absolute positioning can be achieved without global drift. S. Yi et al. [
197] adapted ORB-SLAM3 and proposed a behavioral tree framework that can intelligently select the best global positioning method from visual features, LIDAR landmarks, and GPS, forming a long-term available feature map that can autonomically correct proportions and minimize global drift and geographical registration. This method meets the needs of complex large-scale scenarios.
4.3.3. Fusion Method Based on Deep Learning
In the first two sections, this paper summarized the current multi-sensor fusion methods based on uncertainty and traditional features, which greatly improve the effectiveness and robustness of SLAM. At the same time, with the continuous development of traditional machine learning, the field of deep learning has gradually developed [
198]. Deep learning involves training the model on a large number of sample data and allowing the computer to find the potential rules between each sample [
199]. This technology promotes the development of artificial intelligence, such as a robot with independent analysis, judgment, and decision-making ability [
200]. Deep learning shows extraordinary potential in image recognition and sound processing [
201], and more and more researchers have attempted to combine it with SLAM. At present, the neural networks used in deep learning technology can be mainly divided into three categories: convolutional neural networks (CNNs), recurrent neural networks (RNNs), and deep neural networks (DNNs) [
202]. The concept of neural networks originated in the 1950s and 1960s when they were called perceptrons. Perceptrons could deal with simple function problems, but they were unable to deal with slightly more complex function problems [
203]. This drawback was not overcome until 1980 with the advent of multilayer perceptrons, which this paper will now refer to as “neural networks”.
This paper presents the classical neural network framework used by the fusion method in
Figure 23. Since there are inherent local patterns in images (such as the human mouth, eyes, and nose) [
204], the recognition of local feature images often has a faster speed and higher accuracy [
205], so researchers combine image processing and neural networks to create CNNs. DNNs have a similar structure to multilayer perceptrons, and these networks can overcome the disadvantages of gradient disappearance and avoid falling into local optimal solutions [
206]. However, DNNs are unable to model changes in time series. To adapt to this demand, RNNs have been proposed which can deal with context-dependent data types [
207], but these networks have not been sufficiently tested in the field of multi-sensor fusion. Therefore, this paper argues that CNNs have abilities in terms of classification, recognition, forecasting, and decision making, while DNNs have abilities in fitting and can more quickly reach the local optimal solution, both of which can be used with different multi-sensor fusion modules [
208] of SLAM. In the following, this paper presents the structure diagrams of a CNN and DNN for the readers’ convenience.
(a) CNN-based fusion method
In recent years, the advantages of CNNs in image processing have been widely used in single-sensor SLAM methods such as [
209], since they allow monocular cameras to obtain reliable depth information. Similarly, CNNs have also achieved surprising results in the field of multi-sensor fusion. In 2018, J. Ku et al. [
210] proposed AVOD, an aggregated view object detection network for autonomous driving scenes, which generates network sharing features using LIDAR point clouds and RGB images. The network uses a high-resolution feature extractor and a multi-modal fusion region proposal network (RPN) architecture (which is built on Faster R-CNN and is a commonly used detector for 2D objects) to reliably generate 3D candidate regions for multiple object classes in road scenes in real-time. In the same year, F. Ma and S. Karaman [
211] considered predicting dense depth data from low-resolution sparse depth data to obtain maps with higher robustness and accuracy. They proposed a new depth prediction method that uses a CNN to learn a deep regression model for depth image prediction and used this model as a plug-in for sparse SLAM, visual inertial odometry algorithms, and super-resolution LIDAR measurement. Experiments on the KITTI dataset showed that the accuracy is improved by 33% compared with previous algorithms. Subsequently, X. Kang et al. [
212] further explored depth data. They aligned LIDAR data with an RGB-D point cloud to generate continuous video frames of the corresponding scene and used a CNN for training. Deep learning methods (mainly the PoseNet neural network) were used to achieve motion recovery and the automatic initialization of the system. Experiments were carried out in large-scale indoor and complex scenes. Compared with the traditional SLAM algorithm, the cumulative error in the loop closure detection stage is reduced by two times, and the overall robustness is higher than that of ORB-SLAM2. In
Figure 24, a detailed flowchart of the CNN-based fusion method is presented. The flowchart is mainly composed of five parts: the first part is the environmental information and LIDAR data collection; the second part is the process of tracking for precise automatic initialization of the RGB-D SLAM algorithm, to extract all the keyframes; the third part is the elimination of redundant keyframes; the fourth part uses the ICP algorithm to determine the camera pose and select the correct keyframe; and the fifth part performs loop closure detection.
In recent years, CNN-based multi-sensor fusion methods have been further developed. Z. Gong et al. [
213] proposed a real-time 3D object detector based on LIDAR, which combines vision and range information into a frustum-based probabilistic framework, effectively solving the problem of sparse point clouds and noise caused by LIDAR sensors. Additionally, it can detect 3D objects in large built environments in a CNN without pre-training. When tested on the KITTI dataset, the results were better than those of the state-of-the-art object localization and bounding box estimation methods at the time. Z. Shi et al. [
214] proposed an effective method to extract the projection line of LIDAR from the image and improved the LIDAR scanning system based on visual SLAM. Firstly, the adaptive threshold was introduced to the identified object, and then the image feature was used for the pose estimation of visual SLAM. Finally, the semantic segmentation method in the CNN was used to establish an accurate and realistic 3D model which can generate 3D point cloud maps with real colors and real scales. K. Park et al. [
215] developed a CNN model for the first time to fuse uncalibrated LIDAR and binocular camera depth information and proposed a fusion framework for high-precision depth estimation, including a deep fusion network for enhanced encoding using the complementary characteristics of sparse LIDAR and dense stereo depth, and a calibration network for correcting initial extrinsic parameters and generating pseudo-ground truth labels from the KITTI dataset. The proposed network outperforms current state-of-the-art algorithms and can meet various real-time requirements. H. Qiu et al. [
216] proposed a semantic map construction method that combines a camera, odometry, and LIDAR and uses the YOLOv3 algorithm to process the pictures taken by the camera to obtain the semantic information and location information of the target object. Subsequently, semantic information and location information are integrated into the grid map constructed by the Gmapping algorithm, which promotes research in semantic navigation and other aspects.
(b) DNN-based fusion method
CNNs have been favored by a large number of researchers because of their excellent image processing performance. In contrast, DNN-based multi-sensor fusion methods are relatively few, and at present, DNNs and CNNs are partially fused. Y. An et al. [
217] proposed a new unsupervised multi-channel visual–LIDAR SLAM method (MVL-SLAM), which fully combines the advantages of both LIDAR and visual sensors, applies a recurrent convolutional neural network (RCNN) to the fusion method component, and uses the features of a DNN as the loop closure detection component. This method does not need to produce pre-training data and can directly construct the 3D map of the environment from the 3D mapping component. D. Cattaneo et al. [
218] used LIDAR maps to perform global visual localization, leveraging a deep neural network (DNN) to create a shared embedding space, which contains both image and LIDAR map information, allowing image-to-3D LIDAR location recognition. The proposed method uses a DNN and CNN to extract LIDAR point clouds and image information and has achieved the best performance index thus far on the Oxford Robotcar dataset (which contains pictures of all weather and lighting conditions). After the weights are inserted, fusion is performed in the shared embedding space to achieve accurate position recognition. The framework of this article is shown in
Figure 25.
(c) Other fusion options
Of course, in addition to the multi-sensor fusion method based on CNNs and DNNs, researchers have also tested other deep learning methods and achieved excellent results. For example, J. Graeter et al. [
219] proposed a local plane from the LIDAR detected by fitting the camera features in the image depth estimation method. This deep learning method can detect landmarks in the environment of dynamic objects. When combined with the measured values of the LIDAR sensor and high-precision depth vision sensor, this method has a strong tracking ability. These authors also proposed a method combining keyframe selection and landmark selection and embedded this method into the visual odometry framework based on bundle adjustment (BA) for online use. H. Deng et al. [
220] proposed a neural-network-based method for combining vision sensor, odometry, and LIDAR observations, using a three-layer BP neural network (including an input layer, hidden layer, and output layer) to fuse the observation information of a Kinect camera and 2D LIDAR sensor. Compared to only using a single sensor, the ability to detect multi-scale obstacles and the accuracy of localization and mapping are improved. S. Arshad et al. [
221] proposed a deep-learning-based loop closure detection algorithm to solve the problem of loop closure detection based on visual–LIDAR fusion, which effectively reduces the cumulative error of robot pose estimation and generates a consistent global map, but this method depends on the dataset used to train the network.