You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

11 May 2024

DSOMF: A Dynamic Environment Simultaneous Localization and Mapping Technique Based on Machine Learning

,
and
School of Electromechanical Engineering, Beijing Institute of Technology, Beijing 100081, China
*
Author to whom correspondence should be addressed.
This article belongs to the Topic 3D Computer Vision and Smart Building and City, 2nd Volume

Abstract

To address the challenges of reduced localization accuracy and incomplete map construction demonstrated using classical semantic simultaneous localization and mapping (SLAM) algorithms in dynamic environments, this study introduces a dynamic scene SLAM technique that builds upon direct sparse odometry (DSO) and incorporates instance segmentation and video completion algorithms. While prioritizing the algorithm’s real-time performance, we leverage the rapid matching capabilities of Direct Sparse Odometry (DSO) to link identical dynamic objects in consecutive frames. This association is achieved through merging semantic and geometric data, thereby enhancing the matching accuracy during image tracking through the inclusion of semantic probability. Furthermore, we incorporate a loop closure module based on video inpainting algorithms into our mapping thread. This allows our algorithm to rely on the completed static background for loop closure detection, further enhancing the localization accuracy of our algorithm. The efficacy of this approach is validated using the TUM and KITTI public datasets and the unmanned platform experiment. Experimental results show that, in various dynamic scenes, our method achieves an improvement exceeding 85% in terms of localization accuracy compared with the DSO system.

1. Introduction

Simultaneous localization and mapping (SLAM) technology facilitates the acquisition of location and environmental information through pose estimation and map construction [1]. In recent years, visual SLAM technology, as a subject of extensive research, has advanced significantly [2]. Sophisticated visual SLAM algorithms have been developed to achieve localization precision at the centimeter level and can be used to successfully construct large-scale three-dimensional (3D) maps [3,4,5,6,7,8]. However, these advanced visual SLAM algorithms operate predominantly under the strong assumption of rigid scenes, which substantially limits their applicability to dynamic environments.
With the development of deep learning technologies, an increasing number of researchers have integrated object detection and semantic segmentation algorithms with classical SLAM algorithms to enhance their robustness in dynamic scenarios. Dyna-SLAM [9] combines semantic segmentation algorithms with geometric methods to identify and remove dynamic objects from images. Dm-SLAM [10] utilizes instance segmentation information and optical flow data to mitigate the effects of dynamic objects in a scene. Detect-SLAM [11] employs object detection algorithms to identify dynamic feature points and uses the motion probability to convey the motion information of these points. However, these algorithms were improved using feature-based SLAM systems. When most of the selected feature points originate from dynamic objects, the localization accuracy of the SLAM algorithm can decrease significantly because of the reduced number of usable feature points. By contrast, a more flexible selection mechanism for photometric gradient points can prevent the selection of points from being overly concentrated. Moreover, the frame-to-frame matching speed of the SLAM algorithms based on direct methods is higher.
Direct Sparse Odometry (DSO) is a visual odometry algorithm that estimates the camera motion in a sparse and direct manner [6]. Unlike traditional feature-based methods, which extract and match key-points across images, DSO operates directly on pixel intensities. This method significantly enhances the flexibility of map point selection, effectively avoiding the decrease in positioning accuracy caused by an excessive number of feature points from dynamic objects. While DSO primarily focuses on static scenes, the authors also mentioned in his paper that direct methods have a greater advantage when dealing with dynamic objects. During our practical applications, we have also observed that DSO exhibits greater robustness compared to ORB-SLAM2 when dynamic objects are present in the environment.
Although the DSO algorithm holds significant advantages over feature-based SLAM algorithms when it comes to handling dynamic environments, its adoption for further research and improvement is limited due to its low code readability and the challenging nature of modifying its direct method-based optimization approach. In this work, we not only integrate the DSO algorithm with semantic segmentation and video inpainting algorithms but also restructure its algorithmic framework. We introduce a SLAM method tailored for rigid dynamic scenes, named DSOMF. This method employs instance segmentation and video inpainting algorithms to mitigate the impact of dynamic objects, thereby enhancing the localization precision and mapping performance of the DSO algorithm. The principal contributions of this study are as follows:
  • We refine the region segmentation approach within the DSO algorithm, accelerating the motion recognition speed of the algorithm through dynamic region segmentation and the direct selection of optical flow points.
  • We propose a method that utilizes inter-frame semantic information to identify and remove dynamic objects. This approach effectively reduces the interference caused by noise introduced by dynamic objects, enhancing the system’s robustness in dynamic environments. Furthermore, we also improve the matching accuracy of pixel intensities points by adding semantic probability, so that our algorithm can make full use of semantic information.
  • We synergize video inpainting algorithms with the map-building thread of the DSO algorithm, compensating for static background gaps caused by the removal of moving objects, thereby optimizing the map construction performance of the DSO algorithm in dynamic environments.
  • We integrate a loop closure detection module, thus rendering the DSO algorithm framework more comprehensive. Moreover, by associating loop closure detection with video inpainting algorithms, we enhance the efficiency of this module in identifying loop closures within dynamic environments.
In the following, we discuss the related studies in Section 2, providing a comprehensive introduction to its methodology in Section 3, followed by a detailed elaboration on motion object segmentation, semantic data association, motion object recognition, and static background completion, and a comparative analysis of the positioning accuracy between our algorithm and existing ones conducted using indoor and outdoor datasets in Section 4. The real-scene experiments based on the unmanned aerial vehicle and unmanned vehicle platforms is in Section 5 and we conclude the paper in Section 6.

3. Algorithm Design

In this section, we first introduce the overall algorithmic framework of the proposed SLAM system, followed by a detailed discussion of the methods introduced herein. This study addresses the problem of localization and mapping in dynamic environments and proposes an algorithm framework based on DSO, as illustrated in Figure 1. Within this framework, the regions highlighted in blue represent programs that have been added or modified, whereas those in green indicate the original programs.
Figure 1. Algorithm framework.
Firstly, we employed an instance segmentation network, i.e., mask R-CNN [29], to extract all prior instances of dynamic objects. Subsequently, we utilized an enhanced PARSAC algorithm [30] for the motion recognition of dynamic object instances and for eliminating photometric gradient points within the dynamic object regions. Once elimination was completed, photometric gradient points were supplemented from static regions to increase the number of trackable points for the algorithm. After selecting the keyframes, we applied the flow-guided video completion (FGVC) algorithm [31] to replenish the static background occluded by dynamic objects in the keyframes. Finally, we re-supplemented photometric gradient points from the replenished portions of the keyframes and input them into the mapping thread for loop closing. The image processing procedure is depicted in Figure 2 below, with the tracking thread enclosed in blue and the mapping thread in orange.
Figure 2. Image processing workflow (The blue box in the diagram represents the tracking thread, while the orange box represents the mapping thread).

3.1. Segmentation of Potential Dynamic Objects

To detect objects within an image that possesses the potential for motion, we employed a mask R-CNN instance segmentation network to segment the input image. The mask R-CNN network is a widely used instance segmentation technique that enables the acquisition of pixel-level semantic segmentation outcomes, along with instance-level object labels. Current semantic SLAM algorithms designed for dynamic environments typically utilize only semantic segmentation information and directly eliminate objects that are likely to move. This approach inadvertently removes objects that are potentially mobile yet stationary, thus diminishing the number of static feature points available for the algorithm and affecting the precision of SLAM localization. This study innovatively integrates semantic segmentation information with instance label data, thereby refining the identification of moving objects to the level of individual instances, to preserve as many static points in the image as possible.
The input for the mask R-CNN is the raw image captured using the camera, as shown in Figure 3a. In this study, the mask R-CNN network was exclusively utilized to segment the categories of objects that exhibit the potential for dynamics, including but not limited to movable humans, bicycles, cars, cats and dogs. These categories are posited to encompass the most dynamic objects likely to appear in most scenes. If additional object categories must be identified, then one can retrain the mask R-CNN network using the COCO dataset [32] and then fine-tune its weight files accordingly.
Figure 3. Dynamic object instance segmentation results. (a) Original image; (b) mask image.
For the output results of the mask R-CNN, we assumed that the input image size was s × h × 3. During processing, the network generated an s × h × w matrix, where w represents the number of dynamic objects in an image. Each output channel (cw) generates a binary mask image for a specific object instance. Stacking these w images yielded a segmented result that encompassed all instances of dynamic objects, as illustrated in Figure 3b.

3.2. Region Segmentation and Data Association

The DSO performs tracking by selecting points with photometric gradients within an image. To achieve a more uniform point distribution, the DSO segments the image into n square regions, each with a side length of d, and selects photometric gradient points within each region. To further conserve computational resources, this study enhances the original algorithm by redefining the partitioning of regions based on the number and size of potentially moving objects in an image, thereby facilitating motion detection. An analysis of semantic segmentation labels across multiple datasets revealed that when the labels indicating potential motion exceeded 30% of the total label count, the image was likely to contain a higher number and larger projection area of moving objects. Consequently, when the proportion of labels for potentially moving objects exceeded 30%, the image was partitioned into square regions with a side length of 6d. Otherwise, the regions were defined as having a side length of 4d. Upon completing the partitioning of regions, the central point of each region was identified, and the Kanade–Lucas–Tomasi method was applied to track their optical flow f: (u, v) → (u + Δu, v + Δv). Thus, the correspondence between the segmentation regions of the two frames was identified, as illustrated in Figure 4.
Figure 4. Optical flow tracking of regional centroids.
The association degree between segmentation regions of successive frames can be represented as follows:
φ f ( C a t 1 , C b t | Z , f t 1 t ) = k = 1 N [ L o c ( f ( z i ) , C b t ) × L o c ( z i , C a t 1 ) ] k = 1 N L o c ( z i , C a t 1 ) ,
where φf denotes the association degree of regional centroids, Z the observation function, zi the observation point, a and b the semantic tag and  C a t C b t  the semantic label clustering of central points at time t. By defining the spatial association of segment clustering, we can represent the overlap between segmented regions across two frames [33]. Specifically, it can be articulated as:
φ s ( C a t 1 , C b t 1 ) = C a t 1 C b t k = 0 N C k t 1 ,
where, φs denotes the spatial association of segmentation clustering. By considering both the spatial and centroid associations, the probability that ( C a t 1 C b t ) belongs to the same category can be expressed as:
P = ( l i t = j | C t 1 , C t , L t 1 , Z , f t 1 t ) = φ s ( C i t 1 , C j t ) + α × φ f ( C i t 1 , C j t ) m = 0 N n = 0 N [ φ s ( C m t 1 , C n t ) + α × φ f ( C m t 1 , C n t ) ] ,
where,  l i t  denotes the object label. Consequently, the label of  C i t  can be expressed as:
l i t = arg max P ( l i t = j | C t 1 , C t , L t 1 , Z , f t 1 t ) ,
The equation above clearly indicates, that when  l i t = l 0 t 1  and i ≠ 0, a new object label emerges in the scene, thus incrementing the label count. Using this approach, the association between the segmentation results of potential moving objects across successive frames can be established based on both centroid matching and the overlap of segmentation regions.

3.3. Dynamic Object Recognition and Determination of Dynamic Regions

Because the methodology used in this study was adapted from DSO, using the photometric error-based matching approach, which is computationally efficient, may compromise the accuracy of point matching. Consequently, this study incorporates a motion recognition method based on semantic probability [34], which is expressed as follows:
ε 2 ( x , x F ) = ( x ¯ F x ¯ ) 2 ( F x ¯ ) 1 2 + ( F x ¯ ) 2 2 + ( F T x ¯ ) 1 2 + ( F T x ¯ ) 2 2 ,
where x denotes the coordinates of the regional centroid in the previous frame,  x  the coordinates of the regional centroid in the current frame,  x ¯  the coordinates in homogeneous form, and F the fundamental matrix between two adjacent frames. Consequently, for each label cluster, a parameter  D = { d i | i = 1 , 2 , , N }  can be obtained to evaluate whether the matching of the centroid of the cluster region adheres to geometric constraints. This can be expressed as follows:
D ( Z , C i t , F , f t 1 t ) = k = 1 N ε 2 ( z k , f ( z k ) , F × L o c ( z k , C i t ) k = 1 N L o c ( z k , C i t ) ,
d i = D ( Z , C i t , F , f t 1 t ) ,
When di surpasses a specified threshold  τ d y n , we assume that motion has occurred within the associated segmentation cluster.
After determining the motion status of the segmentation clusters, regions whose centroids are dynamic cluster points are marked as dynamic areas. Additionally, there are regions where the centroid is not a dynamic cluster point but the extremities are defined as edge areas. The dynamism of the edge areas is ascertained based on the distance from the centroid to the segmentation cluster. The distance from the centroid of an edge area to the segmentation cluster is expressed as:
d i s t ( x k , C j ) = min x k x i 2 , x k C ¯ j , x j C j ,
In the equation,  C ¯ j  represents the set of edge centroids of segmentation cluster  C j  . Therefore, the semantic probability that the edge centroid xk belongs to  C j  can be expressed using binomial logistic regression as follows:
P = 1 exp ( λ × d i s t ( x k , C j ) ) + 1 ,
When P exceeds Pdyn, the edge area becomes dynamic. Following the determination of the dynamic areas, these regions are marked as occupied (as illustrated in Figure 5), and all photometric gradient points selected using the DSO algorithm within these areas are excluded. Subsequently, based on the proportion of dynamic areas to the total area, the threshold for selecting the photometric gradient points in the static areas is reduced appropriately to compensate for the number of points selected in unoccupied areas.
Figure 5. Delineation of dynamic and static regions.

3.4. Completion of Static Background in Keyframes and Loop Closure

After instance segmentation and motion-consistency verification are performed to eliminate dynamic objects, the constructed static background map may exhibit gaps, as illustrated in Figure 6. In the DSO framework, non-keyframes participate only in localization and tracking, whereas map construction relies on keyframes. Consequently, we utilized pixels from non-keyframes to fill in the static background voids caused by dynamic objects obstructing the keyframes. Thus, the keyframes can be used to synthesize more realistic static environment images after the removal of dynamic objects. This type of composite image, which includes static structures, can not only further optimize and enhance the accuracy of camera pose estimation but also contribute significantly to virtual and augmented reality applications.
Figure 6. Comparative analysis of mapping outcomes pre and post dynamic object elimination.
For missing regions in an image, computing the optical flow field of the area is easier than direct pixel filling, and optical flow-guided pixel propagation can naturally maintain temporal coherence [35]. Additionally, owing to the tracking of object movement in non-key frames, their optical flow can be more readily obtained. Therefore, we adopted the FGVC algorithm, which is guided using optical flow edges. After the tracking thread identifies dynamic objects, it sends the keyframe images and dynamic object labels to the video completion thread. The FGVC video completion algorithm primarily comprises three steps (as shown in Figure 7): (1) Flow completion, where the forward and backward optical flows between adjacent frames are calculated; (2) temporal propagation, where the trajectory of the optical flow is followed to identify a set of candidate pixels for each missing pixel, and a confidence score as well as a binary validity indicator are estimated for each candidate frame; (3) fusion, where confidence-weighted averaging is employed to fuse each missing candidate pixel with at least one valid candidate pixel. Single-image completion techniques are used to fill areas devoid of candidate pixels. To ensure the effectiveness of the map completion thread, the tracking thread saves 20 frames of non-keyframes for achieving completion. After completing the missing areas, photometric gradient points are selected within the completed areas and integrated into the backend optimization thread to enhance the localization accuracy and mapping outcomes.
Figure 7. FGVC optical flow completion process.
After completing the video inpainting, we re-selected some photometric gradient points on the incorporated static background into the mapping thread. Subsequently, we detected loop closure candidates using the Bag of Words (BoW) model, computing the Sim(3) transformation between the candidate frames and the current frame, and we applied these constraints to pose a graph optimization to enhance the overall map accuracy. This method effectively avoids failures in the Bag of Words model detection caused by occlusions from dynamic objects, thereby improving the efficiency of loop closure detection in dynamic environments.

4. Simulation Testing

To verify the effectiveness of the proposed method, we conducted tests using a publicly available indoor dynamic TUM dataset [36] and the outdoor dynamic KITTI dataset [37]. Because the proposed method modifies the existing stable open-source DSO algorithm, it primarily serves as a benchmark for comparison. Additionally, we compared the proposed method with a similar Dyna-SLAM algorithm, which utilizes the mask R-CNN algorithm for segmenting dynamic objects. To evaluate the effect of map completion on the localization accuracy, we used an algorithm that removes dynamic objects but does not perform map completion (named DSOM) and an algorithm that removes dynamic objects and performs map completion (named DSOMF). Additionally, we employed the absolute trajectory error (ATE) and relative pose error (RPE) for a quantitative evaluation of the algorithms’ localization accuracy. The ATE metric represents the global consistency of the trajectory, whereas the RPE metric reflects the drift in translation and rotation. All the experiments were conducted on a notebook computer equipped with an Intel Core i9-14900H CPU, RTX 4090 GPU with 16GB of graphics memory, and 64GB of RAM in a dual 32GB configuration.

4.1. Simulation Testing on TUM Dataset

The TUM dataset captures data for various task types in indoor environments using RGB-D cameras. Moreover, the dataset provides the ground truth for camera poses and comparison tools, thus rendering it highly suitable for SLAM researchers to evaluate the performance of their algorithms.
In this study, three sets of data from the dynamic object module of the dataset were used for simulation testing: fr3_sitting_static, fr3_walking_static, and fr3_walking_xyz. The first dataset represents a static environment, whereas the latter two are dynamic environments. The results yielded using the algorithm are shown in Table 1, where the first column lists the dataset names and the second to fourth columns are the names of the algorithms. From left to right, the second, third, and fourth columns present the root mean square error (RMSE) and standard deviation (STD), which effectively reflects the stability and robustness of the algorithm. In dynamic environments, the improvement in the localization accuracy of the proposed method relative to the other two algorithms is expressed as:
δ = α β α × 100 % ,
where δ represents the improvement rate, α the data from the other two algorithms, and β the data from the proposed algorithm.
Table 1. Comparison of absolute trajectory errors in TUM datasets.
Based on the data shown in Table 1, one can observe that in the fr3_sitting_static data scenario, the RMSE and STD of DSOM improved by 37% and 24%, respectively, compared with those of DSO, whereas they did not improve significantly with respect to those of Dyna-SLAM2. The RMSE and STD of DSOMF improved by 45% and 27%, respectively, compared with those of the DSO, whereas they did not improve significantly with respect to those of Dyna-SLAM2. In the fr3_walking_static data scenario, the RMSE and STD of DSOM improved by 88.5% and 65.4%, respectively, compared to DSO, and by 5.4% and 9.3%, respectively, compared with those of Dyna-SLAM. The RMSE and STD of DSOM improved by 90.6% and 71.6%, respectively, compared with those of DSO, and by 21.6% and 25.5%, respectively, compared with those of Dyna-SLAM. In the fr3_walking_xyz data scenario, the RMSE and STD of DSOM improved by 90.8% and 87.8%, respectively, compared with those of DSO, and by 10.9% and 17.5%, respectively, compared with those of Dyna-SLAM. The RMSE and STD of DSOMF improved by 92% and 89.7%, respectively, compared with those of DSO, and by 20.8% and 24.6%, respectively, compared with those of Dyna-SLAM.
To further analyze the test results, Figure 8 shows a comparison of the ATEs among DSO, Dyna-SLAM, and the methods presented here in this study, i.e., DSOM and DSOMF, across the three scenarios of fr3_sitting_static, fr3_walking_static, and fr3_walking_xyz. One can intuitively observe that as the dynamic nature of the scenes increases, the improvement effect of the methods on the localization accuracy intensifies gradually. In environments with higher dynamics, the localization effect of the methods presents a clear enhancement compared with that of DSO. However, the improvement effect of DSOMF was slightly weaker relative to that of Dyna-SLAM, which is attributed to the deteriorating image completion effect owing to the increased number of dynamic objects in the image.
Figure 8. Comparison of absolute trajectory error for the camera on the TUM dataset.
To further enhance the real-time performance of our method, we drew inspiration from RDS-SLAM [19] and incorporated the selection of semantic keyframes into our algorithm. Therefore, in addition to comparing the localization accuracy, we quantitatively compared the runtime overhead and GPU memory consumption of several methods on the TUM dataset, as listed in Table 2. As shown, the per-frame processing time of DSOM was lower than that of Dyna-SLAM. After the video completion thread was incorporated, the per-frame processing time of DSOMF was slightly longer than that of Dyna-SLAM.
Table 2. Execution time and GPU memory cost on KITTI dataset based on several methods.
Additionally, Table 3 shows the time overhead of the main modules of the proposed method on the TUM dataset. Clearly, the semantic segmentation module was the most time-consuming. In this study, instance segmentation is executed as a separate, independent thread, thus allowing for the substitution with a less time-consuming instance segmentation algorithm (if necessary) to further enhance the real-time performance of the proposed method. Moreover, through efficient parallel operations, the incorporation of instance segmentation and video completion modules, in addition to DSO, results in frame rates and memory usage that are comparable to those of Dyna-SLAM.
Table 3. Execution time cost of our method’s main modules.

4.2. Simulation Testing on KITTI Dataset

The KITTI dataset is widely recognized and utilized in the field of computer vision algorithms for testing an autonomous driving scenario. It encompasses various dynamic outdoor scenarios, including urban, rural, and highway environments. The images that captured the most dynamic objects include 15 vehicles and 30 pedestrians, thus indicating a high level of scene dynamism. Sequences 00-10 from the dataset were used to validate the DSOM and DSOMF algorithms proposed herein. The comparisons were performed made with the DSO and Dyna-SLAM algorithms using the EVO evaluation tool.
To analyze the positioning performance of DSOMF in outdoor dynamic environments, we present a comparison of the absolute pose errors between the DSOMF and DSO algorithms for dataset sequences 01, 02, 04, and 06 in Table 4. The data in the table include the corresponding RMSE, mean, max and min values. The data in the table indicate that, for sequences 01 and 02, the absolute pose error of DSOMF is significantly lower than that of DSO. For sequences 04 and 06, the absolute pose error of the method presented herein is comparable to that of DSO. This is because sequences 01 and 02 contain many dynamic vehicles and pedestrians, whereas sequences 04 and 06 contain primarily stationary vehicles.
Table 4. Comparison of absolute pose errors in KITTI dataset.
To analyze the effect of the video completion module on the localization accuracy in outdoor environments, we present a comparison of the APEs among the DSOM, DSOMF, and Dyna-SLAM algorithms across dataset sequences 00–10 in Table 5. As shown in the table, in outdoor environments, both DSOM and DSOMF exhibited slight improvements in terms of pose-estimation accuracy across various dataset sequences compared with Dyna-SLAM. However, the inclusion of the video completion thread in DSOMF did not significantly enhance its performance compared with those of the first two methods. This is because in large outdoor scenes, dynamic objects typically occupy a relatively small area in images, and images completed thereafter fail to provide additional map information for backend optimization.
Table 5. Comparison of absolute pose errors of same-type algorithms.
Figure 9 shows the process of dynamic object removal using a video completion thread in the KITTI-04 dataset. In this figure, a white vehicle moving at a constant speed in the same direction is identified as a static object, whereas a black vehicle moving in the opposite direction is recognized as a dynamic object and thus removed and completed. The map-construction effect, as presented in Figure 10, shows that DSOMF excluded the interference of the black vehicle during map construction, thus resulting in a complete, linear road point cloud map.
Figure 9. Video completion process in KITTI-04 dataset.
Figure 10. Map construction effect of DSOMF in the KITTI-04 dataset.

5. Unmanned Platform Experiment

5.1. Drone Experiment

Given the absence of distinct loops in the datasets utilized above, to verify the effectiveness of the loop closure detection module, we conducted real-world loop closure detection experiments using the quadcopter drone depicted in Figure 11. The drone is equipped with an Intel Realsense D435 image sensor (Santa Clara, CA, USA) and a micro-computer developed on the Jetson AGX Orin platform (2048-core NVIDIA Ampere architecture GPU (Santa Clara, CA, USA) featuring 64 Tensor Cores, 12-core ARM Cortex-A78AE v8.2 64-bit CPU, PVA v2 visual processing accelerator, 64 GB of 256-bit LPDDR5 memory).
Figure 11. Unmanned flight platform SLAM algorithm test system.
In this experiment, we operated the drone to circle around the fixed-wing aircraft shown in Figure 12. By observing whether the semi-dense point clouds constructed at the starting and ending points overlap, we assessed the effectiveness of the loop closure detection module of our algorithm in a real-world setting. Figure 13a and Figure 13b, respectively, represent the semi-dense point cloud maps of the fixed-wing aircraft with and without loop closure detection. The red boxes indicate the semi-dense point cloud maps at the starting and ending points of the drone’s flight. As shown in the red box in Figure 13a, with loop closure detection enabled, the drone successfully triggered loop closure at the starting and ending points, correcting the cumulative trajectory error and allowing the point cloud maps of these points to overlap effectively. In contrast, as illustrated in the red box in Figure 13b, without loop closure detection, the drone could not trigger loop closure at the endpoint, resulting in the inability to eliminate the cumulative trajectory error, which prevented the point cloud maps at the starting and ending points from overlapping. This experiment successfully validated the effectiveness of real-world loop closure detection on the same drone platform.
Figure 12. Top view of fixed wing aircraft.
Figure 13. Loop closure detection experiment. (a) The loop closure detection module runs; (b) the loop closure detection not runs.

5.2. Driverless Car Experiment

To further verify the effectiveness of the algorithm presented in this paper in handling dynamic objects, we conducted a localization and mapping experiment in an outdoor environment using a ground unmanned platform, as shown in Figure 14 below. To ensure the safety of the experiment, the unmanned platform was placed on the left side of the road, moving forward close to the edge of the road. The image sensor was oriented towards the center of the road to capture dynamic objects on the road. This experiment utilized the monocular mode of the Intel Realsense D435i image sensor mounted on the robot to collect environmental information and the same model of microcomputer as the drone. Due to the limited computing power of the computing platform carried by the unmanned vehicle, the collected image information was sent to a high-performance laptop computer by the unmanned vehicle for instance segmentation and video completion.
Figure 14. SLAM algorithm test system for unmanned ground platform.
We conducted tests on the algorithm proposed in this study by deploying unmanned vehicles to navigate around the site for one round, using DSO and DSOMF. The trajectories formed using the two algorithms are compared in Figure 15 below. In the site schematic, the green trajectory includes a section where a pedestrian is walking slowly (as shown in Figure 16a), while the white trajectory incorporates a segment with an electric truck moving at a higher speed (as shown in Figure 16b). The comparative observation of the trajectories generated using the two algorithms reveals that the algorithm proposed in this study demonstrates a notable improvement in positioning accuracy in both the white trajectory segments and the sections where loops occur. Upon calculation, it was determined that the processing time per frame for the algorithm discussed in this study was 49.6 milliseconds during the experimental process. Although the dynamic settings of this experiment were somewhat simplified due to hardware limitations, it has nonetheless validated the feasibility of the algorithm presented in this study in real-world scenarios to a certain extent.
Figure 15. Comparison of outdoor dynamic environment trajectories (In the real-life scenario, the outlined boxes represent the trajectories of dynamic objects. Route one denotes the path of vehicles, while route two signifies pedestrian pathways).
Figure 16. Environment image. (a) Pedestrian environment image; (b) electric vehicle environment image.

6. Conclusions

This study addressed the need for high-precision navigation and positioning for unmanned platforms in dynamic environments by introducing instance segmentation and completion into the DSO algorithm framework, proposing a machine learning-based dynamic environment SLAM algorithm. Initially, the algorithm utilizes instance segmentation to divide the scene into objects with potential for motion. Subsequently, it combines semantic and geometric information to identify and eliminate moving objects. After the removal of dynamic objects, a video inpainting algorithm fills in the static background obscured by these moving entities, enhancing the algorithm’s loop closure detection and improving pose alignment accuracy in dynamic environments. Finally, the methodology is validated using the TUM dataset, KITTI dataset, and real-world scenarios. The results demonstrate that in dynamic settings, the positioning accuracy of the proposed algorithm significantly surpasses that of DSO. In our future research, we aim to enhance the real-time performance of the proposed algorithm by improving keyframe selection strategies and data association methods, thereby reducing the computational resource consumption of semantic segmentation and optical flow completion algorithms. Additionally, dynamic objects in the environment hold significant value, and we plan to incorporate them into map construction in our subsequent studies.

Author Contributions

S.Y. conceived the idea; S.Y designed the software; S.Y., X.Z. and Z.W. collected the test data of real environment; S.Y. and X.Z. collected the related resources and supervised the experiment; X.Z. and Z.W. proposed the comment for the paper and experiment. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (41971306).

Institutional Review Board Statement

This study does not involve research on human or animal subjects.

Data Availability Statement

The COCO dataset is obtained from http://images.cocodataset.org/zips/train2014.zip (accessed on 6 August 2023). TUM dataset is obtained from https://vision.in.tum.de/data/datasets/rgbd-dataset/download (accessed on 7 August 2023). KITTI dataset is obtained from http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=2d (accessed on 9 August 2023). Our real-world test data are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Tourani, A.; Bavle, H.; Sanchez-Lopez, J.L.; Voos, H. Visual SLAM: What Are the Current Trends and What to Expect? Sensors 2022, 22, 9297. [Google Scholar] [CrossRef]
  2. Zhu, Z.; Peng, S.; Larsson, V.; Xu, W.; Bao, H.; Cui, Z.; Oswald, M.R.; Pollefeys, M. NICE-SLAM: Neural Implicit Scalable Encoding for SLAM. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12776–12786. [Google Scholar]
  3. Wang, H.; Wang, J.; Agapito, L. Co-SLAM: Joint Coordinate and Sparse Parametric Encodings for Neural Real-Time SLAM. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13293–13302. [Google Scholar]
  4. Mur-Artal, R.; Tardos, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
  5. Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
  6. Engel, J.; Koltun, V.; Cremers, D. Direct Sparse Odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 611–625. [Google Scholar] [CrossRef]
  7. El Ghazouali, S.; Mhirit, Y.; Oukhrid, A.; Michelucci, U.; Nouira, H. FusionVision: A Comprehensive Approach of 3D Object Reconstruction and Segmentation from RGB-D Cameras Using YOLO and Fast Segment Anything. Sensors 2024, 24, 2889. [Google Scholar] [CrossRef]
  8. Yugay, V.; Li, Y.; Gevers, T.; Oswald, M.R. Gaussian-SLAM: Photo-Realistic Dense SLAM with Gaussian Splatting. arXiv 2023, arXiv:2312.10070. [Google Scholar] [CrossRef]
  9. Bescos, B.; Facil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
  10. Cheng, J.; Wang, Z.; Zhou, H.; Li, L.; Yao, J. DM-SLAM: A Feature-Based SLAM System for Rigid Dynamic Scenes. ISPRS Int. J. Geo-Inf. 2020, 9, 202. [Google Scholar] [CrossRef]
  11. Zhong, F.; Wang, S.; Zhang, Z.; Chen, C.; Wang, Y. Detect-SLAM: Making Object Detection and SLAM Mutually Beneficial. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1001–1010. [Google Scholar]
  12. Sun, Y.; Liu, M.; Meng, M.Q.-H. Improving RGB-D SLAM in Dynamic Environments: A Motion Removal Approach. Robot. Auton. Syst. 2017, 89, 110–122. [Google Scholar] [CrossRef]
  13. Li, S.; Lee, D. RGB-D SLAM in Dynamic Environments Using Static Point Weighting. IEEE Robot. Autom. Lett. 2017, 2, 2263–2270. [Google Scholar] [CrossRef]
  14. Wang, Y.; Huang, S. Towards Dense Moving Object Segmentation Based Robust Dense RGB-D SLAM in Dynamic Scenarios. In Proceedings of the 2014 13th International Conference on Control Automation Robotics & Vision (ICARCV), Singapore, 10–12 December 2014; pp. 1841–1846. [Google Scholar]
  15. Fischler, M.A.; Bolles, R.C. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
  16. Tan, W.; Dong, Z.; Zhang, G.; Bao, H. Robust Monocular SLAM in Dynamic Environments. In Proceedings of the 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Adelaide, Australia, 1–4 October 2013; pp. 209–218. [Google Scholar]
  17. Ferrera, M.; Moras, J.; Trouvé-Peloux, P.; Creuze, V. Real-Time Monocular Visual Odometry for Turbid and Dynamic Underwater Environments. arXiv 2018, arXiv:1806.05842. [Google Scholar] [CrossRef]
  18. Yu, C.; Liu, Z.; Liu, X.-J.; Xie, F.; Yang, Y.; Wei, Q.; Fei, Q. DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1168–1174. [Google Scholar]
  19. Liu, Y.; Miura, J. RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods. IEEE Access 2021, 9, 23772–23785. [Google Scholar] [CrossRef]
  20. Alcantarilla, P.F.; Yebes, J.J.; Almazan, J.; Bergasa, L.M. On Combining Visual SLAM and Dense Scene Flow to Increase the Robustness of Localization and Mapping in Dynamic Environments. In Proceedings of the 2012 IEEE International Conference on Robotics and Automation, St. Paul, MN, USA, 14–18 May 2012; pp. 1290–1297. [Google Scholar]
  21. Wang, C.-C.; Thorpe, C.; Thrun, S.; Hebert, M.; Durrant-Whyte, H. Simultaneous Localization, Mapping and Moving Object Tracking. Int. J. Robot. Res. 2007, 26, 889–916. [Google Scholar] [CrossRef]
  22. Reddy, N.D.; Singhal, P.; Chari, V.; Krishna, K.M. Dynamic Body VSLAM with Semantic Constraints. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–3 October 2015; pp. 1897–1904. [Google Scholar]
  23. Salas-Moreno, R.F.; Newcombe, R.A.; Strasdat, H.; Kelly, P.H.J.; Davison, A.J. SLAM++: Simultaneous Localisation and Mapping at the Level of Objects. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 1352–1359. [Google Scholar]
  24. Tateno, K.; Tombari, F.; Navab, N. When 2.5D Is Not Enough: Simultaneous Reconstruction, Segmentation and Recognition on Dense SLAM. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 2295–2302. [Google Scholar]
  25. Sucar, E.; Wada, K.; Davison, A. NodeSLAM: Neural Object Descriptors for Multi-View Shape Reconstruction. In Proceedings of the 2020 International Conference on 3D Vision (3DV), Fukuoka, Japan, 25–28 November 2020; pp. 949–958. [Google Scholar]
  26. Hosseinzadeh, M.; Li, K.; Latif, Y.; Reid, I. Real-Time Monocular Object-Model Aware Sparse SLAM. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 7123–7129. [Google Scholar]
  27. Nicholson, L.; Milford, M.; Sunderhauf, N. QuadricSLAM: Dual Quadrics From Object Detections as Landmarks in Object-Oriented SLAM. IEEE Robot. Autom. Lett. 2019, 4, 1–8. [Google Scholar] [CrossRef]
  28. Bescos, B.; Campos, C.; Tardós, J.D.; Neira, J. DynaSLAM II: Tightly-Coupled Multi-Object Tracking and SLAM. arXiv 2020, arXiv:2010.07820. [Google Scholar] [CrossRef]
  29. He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  30. Kluger, F.; Rosenhahn, B. PARSAC: Accelerating Robust Multi-Model Fitting with Parallel Sample Consensus. Proc. AAAI Conf. Artif. Intell. 2024, 38, 2804–2812. [Google Scholar] [CrossRef]
  31. Gao, C.; Saraf, A.; Huang, J.-B.; Kopf, J. Flow-Edge Guided Video Completion. arXiv 2020, arXiv:2009.01835. [Google Scholar] [CrossRef]
  32. Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2014, arXiv:1405.0312. [Google Scholar] [CrossRef]
  33. Runz, M.; Buffier, M.; Agapito, L. MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects. In Proceedings of the 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Munich, Germany, 16–20 October 2018; pp. 10–20. [Google Scholar]
  34. Pan, X.; Liu, H.; Fang, M.; Wang, Z.; Zhang, Y.; Zhang, G. Dynamic 3D Scenario-Oriented Monocular SLAM Based on Semantic Probability Prediction. J. Image Graph. 2023, 28, 2151–2166. [Google Scholar] [CrossRef]
  35. Xu, R.; Li, X.; Zhou, B.; Loy, C.C. Deep Flow-Guided Video Inpainting. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3718–3727. [Google Scholar]
  36. Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A Benchmark for the Evaluation of RGB-D SLAM Systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 573–580. [Google Scholar]
  37. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision Meets Robotics: The KITTI Dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.