1. Introduction
Autonomous mobile robots (AMRs) depend on the effectiveness of mapping the surrounding environment to determine their movement. Simultaneous Localization and Mapping (SLAM) enables the AMR to understand its surroundings and constructs a map model of the surrounding environment simultaneously [
1]. When the AMR navigates in an unknown environment, it acquires point clouds, images, and inertial data through mounted sensors. Then, SLAM restores the real-time position and attitude of the AMR through the spatial–temporal correlation of the observations to realize its localization and mapping. There are two main categories according to the sensor types in the SLAM methods: visual SLAM (VSLAM) [
2,
3] and LiDAR SLAM [
4,
5,
6,
7].
A robot equipped with a vision sensor to estimate the position and orientation of the robot to perform SLAM is called VSLAM. VSLAM relies on camera images to perform the localization and mapping tasks and it covers the variety of visual data detectors, including monocular, stereo, omnidirectional, and RGB-D cameras [
8]. VSLAM has the advantages of cheaper hardware requirements and straightforward object detection and tracking, and it provides visual and semantic information [
9]. A fundamental classification used in VSLAM is the SLAM system based on the type of data, which can be divided into direct and indirect (feature-based) methods [
10]. The direct algorithm estimates camera motions from pixel-level data to build an optimization problem that minimizes photo-metric error. The direct approach has a higher-accuracy 3D reconstruction and obtains more information from images than the indirect method. The indirect method extracts the feature points obtained by processing the scene and by matching their descriptors in sequential frames. These methods are precise and robust against the photo-metric changes in frame intensities, but the performance of feature extraction and matching stages are expensive [
11].
RGB-D SLAM methods use a combination of an RGB camera and a depth sensor to measure pixel depth directly to simplify the depth estimation process [
12]. RGB-D SLAM methods are popular choices for indoor settings as they provide precise and accurate depth information. However, for outdoor applications, RGB-D cameras may provide unreliable mapping results due to potential interference from sunlight [
13]. In dynamic environments, the efficiency of RGB-D SLAM depends on factors such as the speed of moving objects, the quality of sensor data, and the complexity of the scene. Therefore, the use of multiple sensors is the solution to integrate motion models and detect moving objects in dynamic environments. Therefore, the impact on RGB-D SLAM can be mitigated by employing strategies to enhance overall accuracy and robustness [
14].
LiDAR is a perception device utilized to check exterior surfaces and measure object distance by emitting a laser toward the object to capture the traveling time [
15]. Two-dimensional LiDAR sensors record X and Y parameters using the axes of beams [
16]. Three-dimensional LiDAR sensors work in the same way as 2D LiDAR, but the additional measurement around the Z-axis is performed to collect true 3D data. The data from the Z-axis are collected using several lasers at longitudinal projections of various angles. Three-dimensional LiDAR sensors are ideal for visualization due to their higher precision and resolution, but they are more expensive than 2D LiDAR versions. The 2D LiDAR SLAM architecture is equipped with a LiDAR sensor to create a 2D map of its surroundings [
17]. The LiDAR sensor illuminates the target with a laser pulse to calculate the distance from the object. Three-dimensional LiDAR SLAM is a precise solution for map generation in various landscapes [
18]. For indoor and outdoor robot navigation, 2D LiDAR SLAM is unaffected by disruptions like temperature and light changes and is commonly used in a wide range of autonomous applications, including building inspections, autonomous driving, and smart fabrication [
19,
20].
RTAB-MAP (Real-Time Appearance-Based Mapping) is a mapping algorithm that combines SLAM features with appearance-based mapping [
21]. It is designed for real-time 3D modeling in dynamic environments and has the ability to adapt to fast-changing environments [
22]. RTAB-MAP can be used to implement either a visual SLAM approach or a LiDAR SLAM approach on a real robot. It can be used to create 2D or 3D maps by using an RGB-D camera or a LiDAR to estimate its position and orientation in the map simultaneously [
23,
24]. Therefore, multi-sensor fusion SLAM techniques can be utilized in the robotic system to collect LiDAR data and perform RGB-D reconstructions [
25]. RTAB-MAP includes important features such as stereo image matching and depth image matching. It can be integrated with an ROS (Robot Operating System) on various robotic platforms. Many studies optimize the performance of RTAB-MAP in real-time 3D mapping, including the use of loop closure techniques [
26]. Loop closure is an important process that detects whether an AMR has returned to a previously visited location and connects the separated parts of the map into a unified whole. It is very important because loop closure can produce accurate maps. Three-dimensional loop closure is the process of detecting the 3D map created by the system to identify previously visited locations [
27].
A fast deep learning-based VSLAM system for dynamic environments was proposed by Su et al. [
28]. The proposed real-time framework integrates ORB-SLAM 2.0 with the lightweight object detection network YOLOv5s to acquire semantic information in the scene [
29]. An optimized homograph matrix module is introduced to create more accurate Optical Flow vectors. Their method performs better than existing semantic VSLAM algorithms in terms of accuracy and performance, and the TUM RGB-D dataset is employed to analyze the system’s runtime and accuracy. The backbone network of YOLOv5 needs to be optimized to make it more robust and practicable. Recently, a comprehensive review found that YOLO 11 for object detection was revolutionized by achieving an optimal balance between speed and accuracy [
30]. The key architectural improvements, performance benchmarks, and applications were evaluated using the framework’s strengths and limitations in practical scenarios. Critical gaps in the literature were identified and the review outlines future directions to enhance YOLO’s adaptability, robustness, and integration into emerging technologies [
31].
In recent years, SLAM technologies have been developed that can solve complex problems related to real-time 3D mapping. Many algorithms based on SLAM have been developed to enhance its work efficiency. In terms of visual SLAMs, three known algorithms are used: ORB-SLAM2, SPTAM, and RTAB-MAP [
32]. ORB-SLAM2 generates the sparse reconstruction from RGB-D inputs, exhibiting better stereo SLAM solutions and achieving zero-drift localization in already-mapped areas for the KITTI visual odometry benchmark. In the evaluation results, ORB-SLAM2 exhibits higher speeds than RTAB-MAP and SPTAM. SPTAM has higher accuracy in terms of feature recognition in 3D maps [
33]. Comparing ORB-SLAM2 and RTAB-MAP, the performance of ORB SLAM2 is better than that of RTAB-MAP in terms of accuracy, speed, and feature usage. Although ORB-SLAM2 provides better overall performance, RTAB-MAP remains the priority in cases that deal with difficult lighting and camera motion conditions. Therefore, RTAB-MAP has the advantage in that the object creates shadows and the camera motion is faster [
2]. Moreover, RTAB-MAP is superior in terms of real-time data processing and can generate more accurate maps than ORB-SLAM2 and SPTAM in indoor environments [
34]. When the AMR system detects a previously visited location, it adjusts the robot’s position and orientation to update the map accordingly. There are various techniques to optimize the 3D loop closure in RTAB-MAP. It can improve the accuracy and efficiency of loop closure detection through key frame selection, feature extraction and matching, and pose graph optimization [
35].
In this study, an AMR is equipped with RGB-D and LiDAR sensors. RTAB-MAP is used for the mapping and localization system. Then, its ability is tested to determine the robot’s position in indoor environments. The RTAB-MAP memory management supports the data inputs from a variety of sensors, such as the 2D LiDAR and RGB-D depth cameras, making it particularly suitable for complex indoor environments. The rest of this paper is organized as follows: 
Section 2 describes the methodology and framework of the system. This section introduces the hardware architecture of the AMR system and the RTAB-MAP method. 
Section 3 describes the experiments, including loop closure detection using RTAB-MAP, and three graph optimization approaches—TORO, g2o, and GTSAM—are integrated into RTAB-MAP. 
Section 4 provides an analysis of RTAB-MAP and the AMCL method with the three graph optimization approaches. Finally, 
Section 5 summarizes the main conclusions of this research.
  2. Methodology
In this study, the AMR system consists of four stages: the results display, RGB-D data acquisition, the TECO AMR system, and real-time 3D reconstruction using RTAB-MAP, as shown in 
Figure 1. Three map optimization methods, TORO (Tree-based Network Optimizer), g2o (General Graph Optimization), and GTSAM (Georgia Tech Smoothing and Mapping), are used for the graph optimization of RTAB-MAP. The maps built using the three methods are evaluated with RTAB-MAP localization in a high-similarity long corridor environment. The odometer, RGB-D D435i depth camera, and 2D LiDAR device are input to the AMR for the RTAB-MAP architecture. RTAB-MAP is the method used for the real-time location and construction of the real-time environment mapping and localization, and is particularly suitable for automated navigation robot applications. The RTAB-MAP mapping process begins with the systematic collection of data from the RGB-D camera and the 2D LiDAR, followed by pre-processing to filter and synchronize the data from different sensors.
The AMR is controlled using the computer (AD-LINK, Taoyuan, Taiwan), AmITX-SL-G-H110, whose CPU is an Intel
® Core™ i7-6700 LGA1151 with a 32 GB DDR, as shown in 
Figure 1. To reach a specified destination while avoiding obstacles, an Intel
® RealSense D435i RGB-D camera (Intel Cooperation, Santa Clara, CA, USA) and a SICK TIM551 2D-LiDAR (SICK AG, Waldkirch, Germany) were used to determine the geography of the environment. The operating system of the AMR is the ROS environment on the Ubuntu 20.04 (Focal) release; the AMR incorporates several sensing devices, including a laser rangefinder, a depth camera, an odometer via the servo drives, and an IMU. The odometer utilizes the encoder feedback from the servo drives and RTAB-MAP is used for the construction of a graph optimizer. The RGB-D depth camera and the 2D LiDAR device (SICK Tim 551) are the feedback inputs of the RTAB-MAP architecture, as shown in 
Figure 2a.
RTAB-MAP is the construction of SLAM and real-time environment mapping and localization. The method is suitable for navigation robot applications. The RTAB-MAP mapping process begins with the systematic collection of data from the RGB-D camera and the 2D LiDAR, followed by pre-processing to filter and synchronize the data from different sensors. The characteristics are then extracted from the data and matched, and the pose of the robot is estimated. If loops are detected, the map will be optimized to reduce the accumulated error. The optimized attitude and sensor data are used to generate high-precision 3D maps. After extracting and matching the features from the data, the attitude of the AMR is estimated and the navigation target is set for path planning and obstacle avoidance. Then, the positioning is updated in real time during navigation to improve the positioning accuracy. The ROS provides the precise synchronization and approximate synchronization. RTAB-MAP mainly uses the approximate synchronization method to synchronize the top publishing speed of multiple sensors with minimal latency error. We used rtabmap_ros/rgbd_sync to synchronize the RGB-D camera’s RGB and the depth image data into a single subject rtabmap_ros/RGBD image. Then, rgbd_sync approximated synchronization and rgbd_sync was used to synchronize data with different sensors to improve the overall performance and stability, as shown in 
Figure 2b.
  3. Visual SLAM with RTAB-MAP
Short-term internal memory (STM) is used to observe the current location scene where the robot walks. The time change between the images is used to compare the similarity between the new node 
Nt and the previous node 
Nt−1. To update the weight of the acquired location, the location Lt is compared with the last one in the STM, and the similarity (
sim) is evaluated using Equation (1) [
36].
      where 
Npair is the number of matched word pairs between the compared location signatures, and 
 and 
 are the total number of words of signature 
t and the compared signature 
t − 1, respectively.
The working memory (WM) detects the closed loop of the spatial position and sets the STM size according to the robot speed and position acquisition rate. The number of nodes stored in the STM is set by the parameter “Mem/STMSize”. When the STM reaches the upper limit of capacity, the oldest data will be transferred to the WM for closed-loop detection. After the closed-loop detection is completed, the data will be transferred to the long-term memory (LTM), making it easier for long-term storage and deeper information processing, which helps to maintain system efficiency and reduce data loss [
36].
      where 
 is used to predict the distribution of 
, given each state of distribution 
 in accordance with the robot’s motion between 
t and 
t − 1. Between 
t − 1 and 
t, the full posterior is updated according to the time evolution model of the pdf 
, which gives the probability of transition from one state 
i at time 
t − 1 to every possible state at time 
t. According to the robot combining the latter part of the filter between time 
t and 
t − 1, 
, the next closed-loop detection is achieved [
36].
Depending on the respective values of  and , this probability takes one of the following values:
	   
- (1)
-  = 0.9; the probability that no loop closure event will occur at time t, given that none occurred at time t − 1. 
- (2)
-  = , ; NWM refers to the number of positions in the WM of the current iteration. 
- (3)
-  = 0.1, ; the probability of the event “no loop closure at time t” is low given that a loop closure occurred at time t − 1. 
- (4)
-  with i, j ∈ [0; ] is a Gaussian on the distance in time between i and j whose sigma value is chosen so that it is nonzero for exactly four neighbors. 
If there are closed-loop connections, a location in the graph may have more than two neighbors [
36]. The closed-loop detection will be limited due to the lack of visual features when the robot faces a highly similar wall. The proximity detection uses laser scanning data to correct the drift of the odometer. It connects nodes with close locations to make up for the lack of image data and enhances the accuracy of location estimation.
  3.1. Loop Closure Detection
In RTAB-MAP, the LTM affects large-scale and long-term loop closure detection. It is mainly responsible for storing information that is currently not within the direct perception range of the robot. For example, map data, past path records, and environmental features are used for navigation. Loop closure detection uses a Bayesian filter to track by estimating the probability of the current position, assuming that Lt matches the accessed position stored in the WM. It represents the sequence of states for all closed-loop hypotheses at a certain moment [
36,
37].
RTAB-MAP performs real-time positioning and provides visual positioning and map construction information. It displays loopback detection results in the middle, including the matching results between the current image and the historical image. When a loop closure or a proximity detection are detected or some nodes are retrieved or transferred because of memory management, a graph optimization approach is applied to minimize errors in the map. In RTAB-MAP, loop closure detection is performed by comparing current sensor data with previous sensor data to find similarities in the features. When the system has returned to a previously visited location, loop closure detection is performed. If the loop closure detection is successful, then an estimation of the transformation between the two locations is made.
When a loop closure hypothesis is accepted, a graph optimizer minimizes the errors in the map. First, the graph optimization approach TORO is integrated into RTAB-MAP [
38]. The AMR is controlled from A, B, C, D, E, and F, and then to A to perform the loop closure to establish a localization map with a velocity of 0.2 m/s, as shown in 
Figure 3. This process involves matching features between the two locations, followed by optimization using the pose graph optimization method to improve the estimation of the transformation. In 
Figure 4, the right graph is the generated 3D map and the blue lines indicate the AMR’s trajectory and current anchor points. Once an accurate estimation of the transformation is obtained, loop closure correction is performed, which involves adjusting the camera position and adjusting the 3D model using the new estimation of the transformation. This way, the errors that arise from inaccurate camera or sensor movements can be corrected, resulting in a more accurate and consistent 3D model. The loop closure detection of the environment at the initial time (
t = 10) is shown in 
Figure 4a and the AMR continues to move forward to the end of the corridor, as shown in 
Figure 4c. Then, the AMR turns back to move forward in the original position (as shown in 
Figure 4e). Following this, the loop closure detection and correction processes are performed continuously as the camera moves around the environment to ensure that the 3D model is always up to date and accurate. In this paper, three graph optimization approaches—TORO [
38], g2o [
39], and GTSAM [
40]—are integrated in RTAB-MAP. After the loop closure process is finished, the localization graph can be obtained by RTAB-MAP with TORO, g2o, and GTSAM approaches, as shown in 
Figure 5, 
Figure 6 and 
Figure 7.
  3.2. Three Graph Optimizations with RTAB-MAP
When a loop closure hypothesis is accepted, a graph optimizer minimizes the errors in the map. In this study, three graph optimization approaches, i.e., TORO, g2o, and GTSAM, are integrated in RTAB-MAP. All of these optimizations use node poses and link transformations as constraints. As a loop closure is detected, the errors introduced by the odometer can be propagated to all links, correcting the map.
- (a)
- TORO uses a tree structure to organize and simplify problems effectively. In the tree structure, each node represents a pose of the robot, which is used to optimize the pose map. The global positioning error is reduced by arranging the pose into a tree structure. 
- (b)
- g2o is a highly flexible and modular optimization framework for solving large-scale nonlinear least squares problems, supporting Gaussian Newton and a variety of different local optimization algorithms, making it suitable for processing a variety of sensor data. 
- (c)
- GTSAM supports real-time updates and can quickly recalculate and update maps and pose estimates when new nodes enter. Smooth mapping provides accurate estimates even when the data are incomplete or noisy, making it suitable for dynamic environment mapping. 
In the RTAB-MAP graph optimization results shown in 
Table 1, the results show that g2o and GTSAM converge faster than TORO. The results obtained using TORO are less sensitive to poorly estimated odometry covariance. For the single map, the quality of g2o and GTSAM optimization is better than that of TORO based on empirical data. In the experimental results, GTSAM is more robust than g2o and TORO. The experimental results show that the performances of g2o and GTSAM are better than that of TORO. However, in terms of the computational cost of the three methods, GTSAM requires the most CPU usage, as shown in 
Table 1.
  5. Conclusions
In this study, the navigation of a corridor with high similarity features was carried out using an AMR based on the AMCL method and RTAB-MAP with three graph optimization approaches: TORO, g2o, and GTSAM. The experimental results reveal distinct characteristics of the two methods tested. For the AMCL method, the error of the x, y positions for GTSAM was smaller than for the g2o and TORO methods, with average errors of . However, the TORO had the smallest error of angle among the three methods. However, the processing time for GTSAM is longer than for the TORO. For the RTAB-MAP method, GTSAM was the best in terms of the error of x, y positions and the angle; the average errors were  . However, the TORO method had the shortest computation time. Due to the sequential relationship between the tree-like data, the ability of the algorithm to deal with complex intersections or multipath scenarios in the environment is limited. In long-term applications, GTSAM uses smoothing and mapping techniques to provide more accurate estimation and faster convergence speeds with differential function, allowing users to reduce calculation errors and make it easier to optimize the model.
As shown in the experiments, AMCL uses 2D LiDAR scan matching for positioning; the positioning accuracy is influenced by map similarity and feature points. However, RTAB-MAP can use a visual SLAM approach to create 2D or 3D maps by using an RGB-D camera. LiDAR is used to estimate its position and orientation in the map simultaneously. Therefore, RTAB-MAP can perform closed loop detection by using the two sensors to more accurately extract environmental features. Therefore, the positioning accuracies of RTAB-MAP in the three methods are better than those of AMCL. RTAB-MAP positioning relies on both visual and LiDAR data to achieve positioning accuracy, while AMCL only considers LiDAR information, leading to low accuracy. However, the depth camera used by RTAB-MAP is affected by light; thus, the detection results are affected by the accuracy of map construction and positioning as well as navigation error accuracy. The AMCL method uses 2D LiDAR for particle filtering positioning; the AMCL method does not depend heavily on light conditions and can be used even if the visual information is limited or disturbed. Therefore, in scenarios where the light conditions change greatly, the performance of AMCL will be better than that of RTAB-MAP, which requires visual positioning.