YOD-SLAM: An Indoor Dynamic VSLAM Algorithm Based on the YOLOv8 Model and Depth Information
Abstract
:1. Introduction
- To improve the mask’s coverage of a priori dynamic objects, a method of optimizing the original mask of a priori dynamic objects using a depth filtering mechanism is proposed. The method first limits the effective depth of the camera, then filters the mask information that does not belong to the a priori dynamic object using the average depth of the mask, and finally performs the dilate operation on it.
- To better cover a priori dynamic objects with small pixel areas, a method for detecting missed a priori dynamic objects using the average depth and center point of the mask and to redraw after a missed detection is proposed. This method first uses a constant velocity motion model to predict the average depth and center point of the a priori dynamic object mask in the current frame. Then, it determines whether there is a missed detection based on these two factors. Finally, if missed, it searches for pixels near the predicted center point that match the expected depth range and redraws the mask.
- To avoid adverse effects of non-prior dynamic objects on the system, a method is proposed to determine non-priority dynamic objects using mask edge and depth information. This method utilizes the distance and depth information of the mask edge between dynamic objects and a priori static objects to determine whether the prior static objects have been interacted with. If an interaction occurs, it is a non-prior dynamic object and will be removed in subsequent operations.
- We compared the performance of our system with the ORB-SLAM3, DS-SLAM, and DynaSLAM systems on the Technical University of Munich (TUM) RGB-D [16] dataset, which is widely used, and evaluated its performance in real indoor environments. The experimental results show that this system has the best camera positioning accuracy among the four systems in three high-dynamic sequences: fr3/w/half, fr3/w/rpy, and fr3/w/xyz. There is a small gap or a certain performance improvement compared to other SLAM systems in low-dynamic and static sequences. In mask processing experiments, YOD-SLAM can effectively cover all dynamic objects in the image. It can establish clear and accurate dense point cloud maps both in dense point cloud experiments and real-world experiments. The conclusion is that YOD-SLAM can perform visual SLAM tasks in indoor dynamic environments.
2. Related Works
3. System Description
3.1. Overview of YOD-SLAM
3.2. Get Semantic Information from YOLOv8
3.3. Using a Depth Filtering Mechanism to Modify the Mask
3.4. Missing Detection Judgment and Redrawing Using Depth Information
3.5. Using Mask Edge and Depth Information to Determine Non-Prior Dynamic Objects
4. Experimental Results
4.1. Positioning Accuracy Experiments
4.2. Mask Processing Experiments
4.3. Evaluation of Dense Point Cloud Maps
4.4. Evaluation in the Real-World Environment
5. Conclusions and Future Work
- This paper presents a YOD-SLAM system based on ORB-SLAM3. Different from previous studies, the main idea of YOD-SLAM is to try to add depth information to the process of modifying an a priori dynamic object mask or judging non-prior dynamic objects, especially in the process of mask processing, to improve the accuracy of pose estimation and reduce the artifacts in dense point cloud maps. In YOD-SLAM, the YOLOv8s model is first used to obtain the original mask. Second, a deep filtering mechanism is used to modify the prior dynamic object mask, covering the prior dynamic objects as fully as possible while minimizing coverage of the static environment. Third, the mask’s average depth and center point are used to determine whether the prior dynamic object mask has missed detection. In the case of missing detection, the mask is redrawn using both to reduce the impact caused by prior dynamic objects with a small pixel area missing from the segmentation model. Fourth, we use the mask edge and depth information to exclude an a priori static object in the dynamic state, i.e., a non-prior dynamic object. Finally, the ORB feature points in the static region are extracted to obtain higher positioning accuracy and a more precise and more accurate dense point cloud map.
- To test the performance of YOD-SLAM, we compared it with ORB-SLAM3, DS-SLAM, and DynaSLAM systems in the TUM RGB-D dataset and conducted experimental tests in a real indoor environment. In the positioning accuracy experiment, YOD-SLAM performed best among the four systems in the high-dynamic scene, and the performance of ATE and RPE was improved by about 90%. Our improvement measures positively affect SLAM tasks in low-dynamic scenarios. Our improvement measures have no negative impact on SLAM tasks in static scenarios and are at the same level as ORB-SLAM3. In the mask processing experiment, YOD-SLAM better processed the dynamic object mask and achieve high accuracy. In the dense point cloud map experiment, YOD-SLAM obtained a more precise dense point cloud map, which significantly reduced the human residual image and the remote small residual image. YOD-SLAM was not only suitable for the TUM RGB-D dataset but also performed well in the laboratory’s real indoor environment. To sum up, YOD-SLAM is competent in SLAM tasks in an indoor dynamic environment, obtaining accurate pose estimation and precise dense point cloud mapping.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Theodorou, C.; Velisavljevic, V.; Dyo, V.; Nonyelu, F. Visual Slam Algorithms and Their Application for Ar, Mapping, Localization and Wayfinding. Array 2022, 15, 100222. [Google Scholar] [CrossRef]
- Forster, C.; Pizzoli, M.; Scaramuzza, D. Svo: Fast Semi-Direct Monocular Visual Odometry. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; IEEE: Piscataway, NJ, USA, 2014. [Google Scholar]
- Klein, G.; Murray, D. Parallel Tracking and Mapping for Small Ar Workspaces. In Proceedings of the 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, Nara, Japan, 3–16 November 2007; IEEE: Piscataway, NJ, USA, 2007. [Google Scholar]
- Newcombe, R.A.; Lovegrove, S.J.; Davison, A.J. Dtam: Dense Tracking and Mapping in Real-Time. In Proceedings of the International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011. [Google Scholar]
- Campos, C.; Elvira, R.; Rodriguez, J.J.G.; Montiel, J.M.M.; Tardos, J.D. Orb-Slam3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap Slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
- Badrinarayanan, V.; Handa, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling. arXiv 2015, arXiv:1505.07293. [Google Scholar] [CrossRef]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
- Terven, J.; Córdova-Esparza, D.-M.; Romero-González, J.-A. A Comprehensive Review of Yolo Architectures in Computer Vision: From Yolov1 to Yolov8 and Yolo-Nas. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Springer International Publishing: New York, NY, USA, 2015. [Google Scholar]
- Dvornik, N.; Shmelkov, K.; Mairal, J.; Schmid, C. Blitznet: A Real-Time Deep Network for Scene Understanding. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Tateno, K.; Tombari, F.; Laina, I.; Navab, N. Cnn-Slam: Real-Time Dense Monocular Slam with Learned Depth Prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017. [Google Scholar]
- Zhu, Z.; Peng, S.; Larsson, V.; Xu, W.; Bao, H.; Cui, Z.; Oswald, M.R.; Pollefeys, M. NICE-SLAM: Neural Implicit Scalable Encoding for SLAM. arXiv 2021, arXiv:2112.12130. [Google Scholar] [CrossRef]
- Yu, C.; Liu, Z.; Liu, X.; Xie, F.; Yang, Y.; Wei, Q.; Qiao, F. Ds-Slam: A Semantic Visual Slam Towards Dynamic Environments. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1168–1174. [Google Scholar]
- Bescós, B.; Fácil, J.M.; Civera, J.; Neira, J. Dynaslam: Tracking, Mapping, and Inpainting in Dynamic Scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
- Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics Yolo. Available online: https://github.com/ultralytics/ultralytics (accessed on 30 July 2024).
- Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A Benchmark for the Evaluation of Rgb-D Slam Systems. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Algarve, Portugal, 7–12 October 2012; IEEE: Piscataway, NJ, USA, 2012. [Google Scholar]
- Islam, Q.U.; Ibrahim, H.; Chin, P.K.; Lim, K.; Abdullah, M.Z.; Khozaei, F. Ard-Slam: Accurate and Robust Dynamic Slam Using Dynamic Object Identification and Improved Multi-View Geometrical Approaches. Displays 2024, 82, 102654. [Google Scholar] [CrossRef]
- Cheng, J.; Wang, C.; Meng, M.Q.H. Robust Visual Localization in Dynamic Environments Based on Sparse Motion Removal. IEEE Trans. Autom. Sci. Eng. 2020, 17, 658–669. [Google Scholar] [CrossRef]
- Jeon, H.; Han, C.; You, D.; Oh, J. Rgb-D Visual Slam Algorithm Using Scene Flow and Conditional Random Field in Dynamic Environments. In Proceedings of the 22nd International Conference on Control, Automation and Systems (ICCAS), Busan, Republic of Korea, 27 November–1 December 2022; IEEE: Piscataway, NJ, USA, 2022. [Google Scholar]
- Zhong, M.; Hong, C.; Jia, Z.; Wang, C.; Wang, Z. Dynatm-Slam: Fast Filtering of Dynamic Feature Points and Object-Based Localization in Dynamic Indoor Environments. Robot. Auton. Syst. 2024, 174, 104634. [Google Scholar] [CrossRef]
- Yang, L.; Cai, H. Enhanced Visual Slam for Construction Robots by Efficient Integration of Dynamic Object Segmentation and Scene Semantics. Adv. Eng. Inform. 2024, 59, 102313. [Google Scholar] [CrossRef]
- Wang, C.; Zhang, Y.; Li, X. Pmds-Slam: Probability Mesh Enhanced Semantic Slam in Dynamic Environments. In Proceedings of the 5th International Conference on Control, Robotics and Cybernetics (CRC), Wuhan, China, 16–18 October 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
- Wei, B.; Zhao, L.; Li, L.; Li, X. Research on Rgb-D Visual Slam Algorithm Based on Adaptive Target Detection. In Proceedings of the IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Zhengzhou, China, 14–17 November 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar]
- Zhang, J.; Yuan, L.; Ran, T.; Peng, S.; Tao, Q.; Xiao, W.; Cui, J. A Dynamic Detection and Data Association Method Based on Probabilistic Models for Visual Slam. Displays 2024, 82, 102663. [Google Scholar] [CrossRef]
- Yang, L.; Wang, L. A Semantic Slam-Based Dense Mapping Approach for Large-Scale Dynamic Outdoor Environment. Measurement 2022, 204, 112001. [Google Scholar] [CrossRef]
- Gou, R.; Chen, G.; Yan, C.; Pu, X.; Wu, Y.; Tang, Y. Three-Dimensional Dynamic Uncertainty Semantic Slam Method for a Production Workshop. Eng. Appl. Artif. Intell. 2022, 116, 105325. [Google Scholar] [CrossRef]
- Cai, L.; Ye, Y.; Gao, X.; Li, Z.; Zhang, C. An Improved Visual Slam Based on Affine Transformation for Orb Feature Extraction. Optik 2021, 227, 165421. [Google Scholar] [CrossRef]
- Li, A.; Wang, J.; Xu, M.; Chen, Z. Dp-Slam: A Visual Slam with Moving Probability Towards Dynamic Environments. Inf. Sci. 2021, 556, 128–142. [Google Scholar] [CrossRef]
- Ai, Y.; Rui, T.; Yang, X.-Q.; He, J.-L.; Fu, L.; Li, J.-B.; Lu, M. Visual Slam in Dynamic Environments Based on Object Detection. Def. Technol. 2021, 17, 1712–1721. [Google Scholar] [CrossRef]
- Ran, T.; Yuan, L.; Zhang, J.; Tang, D.; He, L. Rs-Slam: A Robust Semantic Slam in Dynamic Environments Based on Rgb-D Sensor. IEEE Sens. J. 2021, 21, 20657–20664. [Google Scholar] [CrossRef]
- Li, X.; Guan, S. Sig-Slam: Semantic Information-Guided Real-Time Slam for Dynamic Scenes. In Proceedings of the 35th Chinese Control and Decision Conference (CCDC), Yichang, China, 20–22 May 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar]
- Qian, R.; Guo, H.; Chen, M.; Gong, G.; Cheng, H. A Visual Slam Algorithm Based on Instance Segmentation and Background Inpainting in Dynamic Scenes. In Proceedings of the 38th Youth Academic Annual Conference of Chinese Association of Automation (YAC), Hefei, China, 27–29 August 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar]
- Li, Y.; Wang, Y.; Lu, L.; Guo, Y.; An, Q. Semantic Visual Slam Algorithm Based on Improved Deeplabv3+ Model and Lk Optical Flow. Appl. Sci. 2024, 14, 5792. [Google Scholar] [CrossRef]
- Cong, P.; Liu, J.; Li, J.; Xiao, Y.; Chen, X.; Feng, X.; Zhang, X. Ydd-Slam: Indoor Dynamic Visual Slam Fusing Yolov5 with Depth Information. Sensors 2023, 23, 9592. [Google Scholar] [CrossRef]
- Cong, P.; Li, J.; Liu, J.; Xiao, Y.; Zhang, X. Seg-Slam: Dynamic Indoor Rgb-D Visual Slam Integrating Geometric and Yolov5-Based Semantic Information. Sensors 2024, 24, 2102. [Google Scholar] [CrossRef] [PubMed]
- RealSense, Intel. Intel Realsense Depth Camera D455. Available online: https://store.intelrealsense.com/buy-intel-realsense-depth-camera-d455.html (accessed on 30 July 2024).
Sequences | ORB-SLAM3 | DS-SLAM | DynaSLAM | YOD-SLAM | Improvements | |||||
---|---|---|---|---|---|---|---|---|---|---|
RMSE | S.D. | RMSE | S.D. | RMSE | S.D. | RMSE | S.D. | RMSE | S.D. | |
fr2/rpy | 0.009843 | 0.004591 | 0.009662 | 0.004521 | 0.010335 | 0.005053 | 0.009918 | 0.004587 | −0.76% | 0.09% |
fr2/desk/p | 0.073530 | 0.016453 | 0.072904 | 0.018297 | 0.073061 | 0.015868 | 0.070515 | 0.017356 | 4.10% | −5.49% |
fr3/s/half | 0.022589 | 0.013636 | 0.014042 | 0.006401 | 0.017184 | 0.007823 | 0.018360 | 0.008658 | 18.72% | 36.51% |
fr3/s/static | 0.009239 | 0.004286 | 0.006685 | 0.003335 | 0.008844 | 0.004220 | 0.006904 | 0.003635 | 25.27% | 15.19% |
fr3/w/half | 0.432602 | 0.128498 | 0.027896 | 0.014280 | 0.043314 | 0.026787 | 0.027876 | 0.013535 | 93.56% | 89.47% |
fr3/w/rpy | 0.809910 | 0.438483 | 0.370257 | 0.179338 | 0.112192 | 0.083041 | 0.039225 | 0.021384 | 95.16% | 95.12% |
fr3/w/static | 0.072855 | 0.063654 | 0.006829 | 0.003080 | 0.050325 | 0.046775 | 0.008974 | 0.004196 | 87.68% | 93.41% |
fr3/w/xyz | 0.578596 | 0.384372 | 0.027407 | 0.017815 | 0.454917 | 0.242555 | 0.013828 | 0.006929 | 97.61% | 98.20% |
Sequences | ORB-SLAM3 | DS-SLAM | DynaSLAM | YOD-SLAM | Improvements | |||||
---|---|---|---|---|---|---|---|---|---|---|
RMSE | S.D. | RMSE | S.D. | RMSE | S.D. | RMSE | S.D. | RMSE | S.D. | |
fr2/rpy | 0.004275 | 0.001831 | 0.003938 | 0.001708 | 0.004416 | 0.001968 | 0.004512 | 0.001945 | −5.54% | −6.23% |
fr2/desk/p | 0.010366 | 0.005163 | 0.009930 | 0.005042 | 0.010651 | 0.005805 | 0.009989 | 0.005086 | 3.64% | 1.49% |
fr3/s/half | 0.024164 | 0.017052 | 0.017709 | 0.007784 | 0.016475 | 0.007843 | 0.021756 | 0.010244 | 9.97% | 39.92% |
fr3/s/static | 0.010594 | 0.005205 | 0.007431 | 0.003682 | 0.010001 | 0.004664 | 0.008761 | 0.004625 | 17.30% | 11.14% |
fr3/w/half | 0.222896 | 0.181602 | 0.029448 | 0.015015 | 0.049016 | 0.031655 | 0.026020 | 0.012465 | 88.33% | 93.14% |
fr3/w/rpy | 0.366246 | 0.262886 | 0.139005 | 0.107797 | 0.140947 | 0.108961 | 0.047089 | 0.026256 | 87.14% | 90.01% |
fr3/w/static | 0.106157 | 0.094985 | 0.008839 | 0.004344 | 0.078228 | 0.072836 | 0.011144 | 0.005235 | 89.50% | 94.49% |
fr3/w/xyz | 0.263517 | 0.223604 | 0.035181 | 0.025513 | 0.289587 | 0.257472 | 0.018448 | 0.008892 | 93.00% | 96.02% |
Sequences | ORB-SLAM3 | DS-SLAM | DynaSLAM | YOD-SLAM | Improvements | |||||
---|---|---|---|---|---|---|---|---|---|---|
RMSE | S.D. | RMSE | S.D. | RMSE | S.D. | RMSE | S.D. | RMSE | S.D. | |
fr2/rpy | 0.323221 | 0.157730 | 0.321130 | 0.156694 | 0.328782 | 0.159128 | 0.327383 | 0.155683 | −1.29% | 1.30% |
fr2/desk/p | 0.429150 | 0.236126 | 0.420757 | 0.231161 | 0.434628 | 0.247476 | 0.426132 | 0.232241 | 0.70% | 1.65% |
fr3/s/half | 0.597328 | 0.274515 | 0.622036 | 0.298539 | 0.602074 | 0.293220 | 0.678658 | 0.328636 | −13.62% | −19.72% |
fr3/s/static | 0.298310 | 0.124228 | 0.274786 | 0.121917 | 0.296515 | 0.123293 | 0.282690 | 0.128440 | 5.24% | −3.39% |
fr3/w/half | 4.422840 | 3.587926 | 0.822474 | 0.407067 | 0.865281 | 0.422776 | 0.784272 | 0.384858 | 82.27% | 89.27% |
fr3/w/rpy | 7.068179 | 5.018182 | 2.791948 | 2.125861 | 2.310514 | 1.710229 | 1.201026 | 0.735262 | 83.01% | 85.35% |
fr3/w/static | 1.875602 | 1.646718 | 0.247073 | 0.105284 | 1.353593 | 1.234624 | 0.285816 | 0.130954 | 84.76% | 92.05% |
fr3/w/xyz | 4.889436 | 4.102002 | 0.851439 | 0.607327 | 5.487427 | 4.903747 | 0.602239 | 0.371467 | 87.68% | 90.94% |
Indicators | Parameters |
---|---|
Image sensing technology | Global Shutter |
Depth FOV (H × V) | 87° × 58° |
Depth Resolution | Up to 1280 × 720 |
Depth Accuracy | <2% at 4 m |
Depth Frame Rate | Up to 90 fps |
Depth Filter | All-Pass/IR-Pass |
RGB Sensor Technology | Global Shutter |
RGB Resolution and Frame Rate | 1280 × 800 at 30 fps |
RGB Sensor FOV (H × V) | 90 × 65° |
Ideal Range | 0.6 m to 6 m |
Interface | USB 3 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, Y.; Wang, Y.; Lu, L.; An, Q. YOD-SLAM: An Indoor Dynamic VSLAM Algorithm Based on the YOLOv8 Model and Depth Information. Electronics 2024, 13, 3633. https://doi.org/10.3390/electronics13183633
Li Y, Wang Y, Lu L, An Q. YOD-SLAM: An Indoor Dynamic VSLAM Algorithm Based on the YOLOv8 Model and Depth Information. Electronics. 2024; 13(18):3633. https://doi.org/10.3390/electronics13183633
Chicago/Turabian StyleLi, Yiming, Yize Wang, Liuwei Lu, and Qi An. 2024. "YOD-SLAM: An Indoor Dynamic VSLAM Algorithm Based on the YOLOv8 Model and Depth Information" Electronics 13, no. 18: 3633. https://doi.org/10.3390/electronics13183633
APA StyleLi, Y., Wang, Y., Lu, L., & An, Q. (2024). YOD-SLAM: An Indoor Dynamic VSLAM Algorithm Based on the YOLOv8 Model and Depth Information. Electronics, 13(18), 3633. https://doi.org/10.3390/electronics13183633