To verify the performance of the proposed algorithm, a series of simulation studies and experimental validations were conducted. All of the experiments were performed on a laptop with an Intel (Intel Corporation, Santa Clara, CA, USA) i7-13700H CPU with 16 GB of DDR3 RAM, running under Ubuntu 18.04.
4.1. Simulation Studies Under Datasets
In this study, we evaluated the performance of our algorithm on the TUM RGB-D and ICL-NUIM datasets, and employed ATE (Absolute Trajectory Error) and RPE (Relative Pose Error) as indicators to measure the accuracy of pose estimation. To validate the pose estimation performance of the SLAM system that integrates point–line features in this paper, we first conducted comparative experiments on the TUM static dataset sequence using ORB-SLAM2 [
2], PL-SLAM [
9], RGB-D SLAM [
32], PLP-SLAM [
33], and Ours. All statistical data comes from papers on corresponding algorithms and real experiments of this system, where “-” indicates that the algorithm did not provide relevant experimental results, and “×” indicates that the accuracy of the algorithm implemented in this paper did not improve when comparing the two algorithms.
From
Table 1, it can be seen that the proposed method exhibited superior accuracy performance in multiple typical scenes of the TUM dataset. Among them, in scenarios such as fr1_floor and fr3_long_office, the proposed method showed a 77% and 53% improvement compared to PL-SLAM, respectively, demonstrating its strong robustness in environments with unclear structures. Meanwhile, in relatively simple structural environments such as fr1_xyz and fr2_xyz, our method still demonstrated stable accuracy advantages compared to PL-SLAM and ORB-SLAM2, further verifying the adaptability and stability of our algorithm under different structural features. We also found that our method obtained the highest accuracy on five of the test sequences, and the gap between our algorithm and the algorithm with the highest accuracy on the rest of the sequences was smaller. From this, it can be concluded that the line features effectively improve the localization accuracy and stability of the system.
Table 2,
Table 3 and
Table 4 compare the localization accuracy of ORB-SLAM2, DS-SLAM [
11], RDS-SLAM [
34], DynaTm SLAM [
35], and Ours on the walking and sitting_static dynamic sequences in the TUM dataset. It can be observed that in the high-dynamic scenes of the walking sequence, our method achieved an average improvement of 96.74% and 69.95% in ATE compared to ORB-SLAM2 and RDS-SLAM, respectively. Meanwhile, we find that there was no significant difference in pose estimation accuracy between our algorithm and DS-SLAM. However, DS-SLAM mainly relied on epipolar geometric constraints to distinguish between dynamic and static features. In the case of frequent camera rotation, its accuracy may be affected due to the difficulty in accurately finding epipolar lines.
Furthermore, although the deep learning-based method can determine a more complete dynamic region, it can lead to misjudging the state of the dynamic target during the change of camera’s angle-of-view. Instead, our method preliminarily identified dynamic points by judging the geometric relationships between adjacent frames, and combined YOLOv8-seg segmentation to determine dynamic regions, which can accurately detect dynamic features even in the case of angle-of-view changes. Thus, such a dual strategy effectively enhances the adaptability of the algorithm. It can be further seen from
Table 2,
Table 3 and
Table 4 that the ATE of our algorithm was higher than that of DynaTm-TM on the fr3_walk_rpy sequence, which is due to the fact that its motion state was misjudged when the angle-of-view changes greatly, and that there was a period of time in the dataset in which only a single metal partition was captured, which results in the number of feature points being low. For this case, the line features in our method can effectively ensure the stability of the system. In addition, on the walking-static sequence, there was case where pedestrians occupy most of the area in the scene image, while our method possessed good adaptability due to its ability to extract enough line features, resulting in smaller trajectory errors.
Figure 3 demonstrates the ATE results of ORB-SLAM2 and our algorithm on fr3_walk_xyz, fr3_walk_static, fr3_walk_rpy, and fr3_walk_half sequences, respectively, where the larger the red area is, the larger the error between the estimated trajectory and the reference trajectory is. It can be found that the trajectories estimated by the proposed algorithm had a high degree of fit to the real trajectories, which is mainly due to the fact that we utilize the dynamic feature rejection technique based on YOLOv8-seg and Delaunay, which not merely maintains a lower error performance, but also enhances the stability of the system.
Additionally, we quantitatively evaluate the performance of the 3D line-segment extraction method in this paper by calculating the average reconstruction rate as follows:
where
denotes the number of images in the dataset used in the experiment,
denotes the number of 3D line-segments constructed using the method of this paper in the
frame, and
denotes the number of 2D line-segments used to construct the 3D line-segments in the
frame. As a note, due to the excessive segmentation of LSD line-segment extractor, we performed optimization on the original LSD extracted line-segments, so the number of reconstructed line-segments used is not consistent with the original extracted line-segments.
Figure 4 shows the 3D line-segment reconstruction results of our method on different dataset sequences. As can be seen, our method can accurately reveal the structural features of the scene by reconstructing the local sparse map through 3D line-segments. From
Table 5, it can be seen that the fitting reconstruction method proposed in this paper exhibited significant advantages compared to traditional line-segment endpoint triangulation methods. On the sequences of live_room_traj1_frei, traj0_frei_png, and freiburg3_long_office_househol, the reconstruction accuracy increased by 34.8%, 21.5%, and 39.7%, respectively, with an average increase of 32%. The improvement was particularly significant in complex office scenes. Moreover, the accuracy fluctuation range of our method in three scenes was only 3.5%, while the traditional method reached 14.7%. This indicates that the generalization ability of the algorithm for different scenes is effectively improved by the geometrically-constrained optimization and continuous line-segment fitting strategies.
Figure 5 illustrates the local mapping effects of different algorithms on dynamic fr3_walk_xyz sequences. It can be seen that when there were dynamic objects in the camera’s field-of-view, it not only affected the accuracy of pose estimation but also led to a lot of residual images in the constructed 3D dense map. Such dense maps cannot be used for navigation and obstacle avoidance after being converted into octree maps. However, our algorithm detected dynamic objects in the front-end and excluded dynamic features from the map construction process, effectively ensuring the consistency of the map.
Figure 6 demonstrates the matching effects based on semantic instance similarity in this paper. From
Figure 6a, it can be seen that the two images possess high similarity, with two instances of chairs, two instances of displays, and two instances of keyboards. However, the proposed method utilized position threshold restrictions to prevent mismatches and thus improve the accuracy of similarity calculation between the two frames. In
Figure 6d, although the static backgrounds of the two scenes had a high similarity, the occlusion of dynamic objects results in a low similarity score calculated by the point–line-based bag-of-words model. However, our method can effectively avoid the interference of dynamic objects. For the case of
Figure 6f, where two images had significant differences but both had the same instance, our algorithm will not encounter the problem of high similarity due to the same instance. From
Table 6, it can be further concluded that the instance-matching method proposed in this paper can adapt well to various scenes and calculate reasonable similarity scores, effectively improving the recall of loop-closure detection.
4.2. Experimental Testing in Real Scenes
In the experiment, we conducted experimental verification of our algorithm using a TurtleBot3 mobile robot (ROBOTIS, Seoul, Republic of Korea) equipped with an Intel D435i depth camera.
Figure 7 and
Table 7 show the experimental scene and camera parameters, respectively. It should be noted that there are randomly walking pedestrians in this scene, so it is a real dynamic scene.
Figure 8 illustrates the feature extraction and sparse mapping effects of a robot in real dynamic scenes. It can be observed from
Figure 8a that the system extracts many point–line features from the pedestrian, resulting in poor consistency between the constructed sparse map and the real scene. However, in
Figure 8b, most of the features on the pedestrian have been removed, and the sparse map has a high-consistency with the real scene, demonstrating the effectiveness of the dynamic feature removal method proposed in this paper.
To demonstrate the global mapping effect, we constructed a global semantic map in another office with more objects, as shown in
Figure 9. It can be seen that since ORB-SLAM2 did not involve dynamic feature rejection, resulting in the presence of residual shadows of pedestrians walking back and forth in the generated global map, and the overall structure had a certain deviation from the real environment, which will affect the autonomous navigation and obstacle avoidance of the robot. In contrast, we combined YOLOv8-seg and Delaunay to remove feature points on dynamic objects, resulting in a global map that maintains high-consistency with the real environment and exhibits good robustness throughout the entire process.
To clearly show the effects of semantic mapping, we exploited different colors to render the entities according to the results of YOLOv8-seg instance segmentation, and construct the dense semantic map and semantic octree map of the real scene, respectively. From
Figure 10c,d, we find that the semantic maps constructed by our method can accurately reveal the corresponding entities, and the whole map was not misaligned and had good consistency.
Meanwhile, we utilized the SLAM trajectory evaluation tool “EVO” to evaluate the global localization accuracy of our algorithm.
Figure 11 shows the trajectories generated by our algorithm and ORB-SLAM2, and it can be found that the trajectory of ORB-SLAM2 exhibited serious errors compared to ours. This is mainly due to the fact that when the robot was located at the loop-closure point for the first time, there were dynamic objects in the field-of-view and part of the static background was obscured, while, when it passed through the loop-closure point for the second time, there were no dynamic objects, which causes some impact for ORB-SLAM2 using the BoW model, leading to unsuccessful detection of the closed-loop at this point, and failing to correct the positional drift in a timely manner; whereas, we combined the semantic information and the point–lines feature that can correctly detect the closed-loop.
Furthermore, we quantitatively revealed the effectiveness of closed-loop detection using Precision–Recall (P–R) curves, and balanced “P” and “R” by adjusting the normalized similarity coefficient .
Figure 12 compares the P–R curves of closed-loop detection using different features in the indoor dynamic environment. It can be observed that the Precision (0.88–0.92) and Recall (0.55–0.65) of using a single feature (ORB or LBD) were limited, while the “ORB+LBD” method significantly improved Recall, but the accuracy was still constrained by dynamic impact. In contrast, “ORB+LBD+Semantic” improved the precision while maintaining a high Recall, and forming an obvious right-up shifted P–R curve, which indicates that the method of combining instance matching to calculate the similarity can provide richer semantic information, and effectively suppresses the influence of dynamic objects on loop-closure detection.
Table 8 compares the average tracking time and detection segmentation time of each frame between our algorithm and some mainstream algorithms. We find that the average time required for detecting and segmenting each frame in our method was reduced by 87% compared to Dyna-SLAM [
10], 87% compared to YOLO-SLAM [
36], and 70% compared to DO-SLAM [
37]. Moreover, the average tracking time of our method was reduced by 76.6% compared to DO-SLAM. Hence, we could reasonably to conclude that our method has good real-time performance while ensuring accuracy.
To further verify the performance of our algorithm in large-scale scenes, we carried out a experiment in a 27 m × 20 m outdoor corridor environment. From the global mapping results in
Figure 13, we find that our method can extract rich line features in outdoor corridors, which can better assist point features in localization and mapping. In
Figure 13a, the addition of line features can better reveal the structural information of the scene, while the dense map constructed in
Figure 13b had high consistency with the real scene, without distortion or deformation. We could reasonably to conclude that our algorithm still exhibits good accuracy and robustness in large-scale scenes.