YSAG-VINS—A Robust Visual-Inertial Navigation System with Adaptive Geometric Constraints and Semantic Information Based on YOLOv8n-ODUIB in Dynamic Environments
Abstract
1. Introduction
2. Related Work
2.1. The Methods Based on Geometric Information
2.2. The Methods Based on Semantic Information
3. System Overview
3.1. The System Architecture of the Proposed Algorithm of YSAG-VINS
3.2. The Optimization Algorithm of Object Detection Based on the Enhanced YOLOv8
3.3. The Dynamic Feature Rejection Strategy Based on the Epipolar Constraint of Dynamic Threshold
| Algorithm 1 Dynamic feature rejection strategy |
| Input: Previous frame’s feature point P1, Current frame’s feature point P2, Bounding BOX x1, y1, x2, y2 Output: The set of static feature points in the current frame, S F = FindFundmentalMat (P2, P1, seven-point method based on RANSAC) for each matched pair p1, p2 in P1, P2 do: if (Bounding BOX exist) then if (xp < x1||xp > x2||yp < y1||yp > y2) then Append p2 to S & DSi = CalEpipolarLineDistance (p1, p2, F) Traverse the number of feature points in S obtain && = DSi / end if else DA = CalEpipolarLineDistance (p1, p2, F) if (DA < ) then Append p2 to S end if else Append p2 to S end if end for |
4. Experiments
4.1. The Performance Evaluation of the Proposed Object Detection Model
- Group 1: Baseline model using YOLOv8n.
- Group 2: YOLOv8n enhanced with the ODConv module.
- Group 3: YOLOv8n model incorporating the C2f module integrated with the proposed UIB structure.
- Group 4: Integrated model combining both ODConv and C2f_UIB modifications, under the same input conditions and hyperparameter settings as the other groups.
4.2. The Performance Evaluation of Pose Estimation Based on the KITTI Dataset
4.3. The Performance Evaluation of Pose Estimation Based on the M2DGR and M2UD Datasets
4.4. Validation of the Adaptive Threshold Strategy
4.5. The Effectiveness Evaluation of Dynamic Feature Removal Strategy
- VINS-Fusion (baseline);
- YSAG-VINS (S), which utilizes the original YOLOv8n model for semantic-based feature removal;
- YSAG-VINS (SI), which employs the improved semantic detection model proposed in this work;
- YSAG-VINS (SI+G), the full pipeline that integrates both the improved semantic model and adaptive geometric constraints.
4.6. Timing Analysis
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| SLAM | Simultaneous localization and mapping |
| ODConv | Omni-dimensional dynamic convolution |
| C2f | Concatenate and fuse |
| UIB | Unified inverted bottleneck |
| SGD | Stochastic gradient descent |
| mAP@0.5 | mean average precision at IoU threshold = 0.5 |
| mAP@0.5:0.95 | mean average precision across IoU thresholds from 0.5 to 0.95 |
| Param | parameter count |
| GLOPs | Giga Floating Point Operations per Second |
References
- Durrant-Whyte, H.; Bailey, T. Simultaneous localization and mapping: Part I. IEEE Robot. Autom. Mag. 2006, 13, 99–110. [Google Scholar] [CrossRef]
- Klein, G.; Murray, D. Parallel tracking and mapping for small AR workspaces. In Proceedings of the 6th IEEE/ACM International Symposium on Mixed and Augmented Reality (ISMAR), Nara, Japan, 13–16 November 2007; pp. 225–234. [Google Scholar]
- Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
- Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast semi-direct monocular visual odometry. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 13 May–7 June 2014; pp. 15–22. [Google Scholar]
- Rosinol, A.; Abate, M.; Chang, Y.; Carlone, L. Kimera: An open-source library for real-time metric-semantic localization and mapping. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 1689–1696. [Google Scholar]
- Qin, T.; Li, P.; Shen, S. VINS-Mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
- Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-scale direct monocular SLAM. In Proceedings of the European Conference on Computer Vision (ECCV), Cham, Switzerland, 8–11 September 2014; pp. 834–849. [Google Scholar]
- He, J.; Li, M.; Wang, Y.; Wang, H. OVD-SLAM: An online visual SLAM for dynamic environments. IEEE Sens. J. 2023, 23, 13210–13219. [Google Scholar] [CrossRef]
- Wen, S.; Li, X.; Zhang, H.; Li, J.; Tao, S.; Long, Y. Dynamic SLAM: A visual SLAM in outdoor dynamic scenes. IEEE Trans. Instrum. Meas. 2023, 72, 1–11. [Google Scholar] [CrossRef]
- Wu, W.; Guo, L.; Gao, H.; You, Z.; Liu, Y.; Chen, Z. YOLO-SLAM: A semantic SLAM system towards dynamic environment with geometric constraint. Neural Comput. Appl. 2022, 34, 6011–6026. [Google Scholar] [CrossRef]
- Guan, H.; Qian, C.; Wu, T.; Hu, X.; Duan, F.; Ye, X. A dynamic scene vision SLAM method incorporating object detection and object characterization. Sustainability 2023, 15, 3048. [Google Scholar] [CrossRef]
- Mur-Artal, R.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
- Kerl, C.; Sturm, J.; Cremers, D. Dense visual SLAM for RGB-D cameras. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Tokyo, Japan, 3–8 November 2013; pp. 2100–2106. [Google Scholar]
- Alcantarilla, P.F.; Yebes, J.J.; Almazán, J.; Bergasa, L.M. On combining visual SLAM and dense scene flow to increase the robustness of localization and mapping in dynamic environments. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Saint Paul, MN, USA, 14–18 May 2012; pp. 1290–1297. [Google Scholar]
- Tan, W.; Liu, H.; Dong, Z.; Zhang, G.; Bao, H. Robust Monocular SLAM in dynamic environments. In Proceedings of the International Symposium on Mixed and Augmented Reality (ISMAR), Adelaide, Australia, 1–4 October 2013; pp. 209–218. [Google Scholar]
- Wang, Y.; Huang, S. Towards dense moving object segmentation based robust dense RGB-D SLAM in dynamic scenarios. In Proceedings of the International Conference on Control, Automation, Robotics and Vision (ICARCV), Singapore, 10–12 December 2014; pp. 1841–1846. [Google Scholar]
- Bescos, B.; Fácil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
- Runz, M.; Buffier, M.; Agapito, L. MaskFusion: Real-time recognition, tracking and reconstruction of multiple moving objects. In Proceedings of the 6th IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Munich, Germany, 16–20 October 2018; pp. 10–20. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Chang, J.; Dong, N.; Li, D. A real-time dynamic object segmentation framework for SLAM system in dynamic scenes. IEEE Trans. Instrum. Meas. 2021, 70, 1–9. [Google Scholar] [CrossRef]
- Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, South Korea, 27–28 October 2019; pp. 9157–9166. [Google Scholar]
- Zhang, J.; Henein, M.; Mahony, R.; Ila, V. VDO-SLAM: A visual dynamic object-aware SLAM system. arXiv 2020, arXiv:2005.11052. [Google Scholar]
- Zhang, C.; Huang, T.; Zhang, R.; Yi, X. PLD-SLAM: A new RGB-D SLAM method with point and line features for indoor dynamic scene. ISPRS Int. J. Geo-Inf. 2021, 10, 163. [Google Scholar] [CrossRef]
- Xing, Z.; Zhu, X.; Dong, D. DE-SLAM: SLAM for highly dynamic environment. J. Field Robot. 2022, 39, 528–542. [Google Scholar] [CrossRef]
- Yang, J.; Wang, Y.; Tan, X.; Fang, M.; Ma, L. DHP-SLAM: A real-time visual SLAM system with high positioning accuracy under dynamic environment. Displays 2025, 84, 103067. [Google Scholar] [CrossRef]
- Cheng, S.; Sun, C.; Zhang, S.; Zhang, D. SG-SLAM: A real-time RGB-D visual SLAM toward dynamic scenes with semantic and geometric information. IEEE Trans. Instrum. Meas. 2022, 72, 1–12. [Google Scholar] [CrossRef]
- Song, S.; Lim, H.; Lee, A.J.; Myung, H. DynaVINS: A visual-inertial SLAM for dynamic environments. IEEE Robot. Autom. Lett. 2022, 7, 11523–11530. [Google Scholar] [CrossRef]
- Min, F.; Wu, Z.; Li, D.; Wang, G.; Liu, N. COEB-SLAM: A robust VSLAM in dynamic environments combined object detection, epipolar geometry constraint, and blur filtering. IEEE Sens. J. 2023, 23, 26279–26291. [Google Scholar] [CrossRef]
- Sun, H.; Fan, Q.; Zhang, H.; Liu, J. A real-time visual SLAM based on semantic information and geometric information in dynamic environment. J. Real-Time Image Process. 2024, 21, 169. [Google Scholar] [CrossRef]
- Zhu, Y.; Cheng, P.; Zhuang, J.; Wang, Z.; He, T. Visual simultaneous localization and mapping optimization method based on object detection in dynamic scene. Appl. Sci. 2024, 14, 1787. [Google Scholar] [CrossRef]
- Li, C.; Zhou, A.; Yao, A. Omni-dimensional dynamic convolution. arXiv 2022, arXiv:2209.07947. [Google Scholar] [CrossRef]
- Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4: Universal models for the mobile ecosystem. In Proceedings of the European Conference on Computer Vision (ECCV), Cham, Switzerland, 29 September–4 October 2024; pp. 78–96. [Google Scholar]
- Xu, G. Epipolar Geometry in Stereo, Motion and Object Recognition; Computational Imaging and Vision; Springer: Berlin/Heidelberg, Germany, 1996. [Google Scholar]
- Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. BDD100K: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 18–24 June 2020; pp. 2636–2645. [Google Scholar]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
- Yin, J.; Li, A.; Li, T.; Yu, W.; Zou, D. M2DGR: A multi-sensor and multi-scenario SLAM dataset for ground robots. IEEE Robot. Autom. Lett. 2021, 7, 2266–2273. [Google Scholar] [CrossRef]
- Jia, Y.; Wang, S.; Shao, S.; Wang, Y.; Zhang, F.; Wang, T. M2UD: A multi-model, multi-scenario, uneven-terrain dataset for ground robot with localization and mapping evaluation. arXiv 2025, arXiv:2503.12387. [Google Scholar]
- Chen, J.; Xu, Y. DynamicVINS: Visual-inertial localization and dynamic object tracking. In Proceedings of the China Automation Congress (CAC), Beijing, China, 10–12 October 2022; pp. 6861–6866. [Google Scholar]













| Studies | G/S | Methods | RT Perf. | Pred. Acc. | Occ. Rob. | Dyn. Rej. | HW Cost | Scene Gen. |
|---|---|---|---|---|---|---|---|---|
| [14] | G | Scene flow | RT | N/A | N | M | L | L |
| [15] | G | Adaptive RANSAC | RT | N/A | N | M | L | L |
| [16] | G | Optical flow + clustering | RT | N/A | N | M | L | L |
| DynaSLAM [17] | S | Mask R-CNN seg. + multi-view geom. | Non-RT | H | H | H | H | M |
| MaskFusion [18] | S | Mask R-CNN seg. | Non-RT | H | H | H | VH | M |
| [20] | S | YOLACT seg. + epipolar+ flow | NRT | M–H | M | H | M–H | M |
| VDO-SLAM [22] | S | Mask R-CNN seg. + dense optical flow + Rigid-body Constraint | NRT | M–H | M | H | M–H | M |
| [23,24,25] | S | Lightweight seg. + epipolar | NRT | M | M | M–H | M | M–H |
| SG-SLAM [26] | S | Lightweight det.+ epipolar | RT | M | L | M–H | ML | M |
| DynaVINS [27] | S | YOLOv3 Det. + multi-hypothesis constraint | RT | M | M | H | M | M–H |
| COEB-SLAM [28] | S | Det.+ motion blur constraints +epipolar | NRT | M | L | M–H | M | M |
| [29] | S | YOLOv8s det. +Depth RANSAC | RT | M | L | M–H | ML | M |
| [30] | S | YOLOv5s det. + dynamic logic decision mechanism | RT | M | L | H | M–H | M–H |
| YSAG-VINS (Ours) | S | YOLOv8n-ODUIB det. + Adaptive Geometric Constraints | RT | H | H | H | L | M–H |
| Group | ODConv | UIB | P/% | R/% | mAP@0.5/% | mAP@0.5: 0.95/% | Param(M) | GFLOPs | FPS |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 88.4 | 94.1 | 89.6 | 63.7 | 3.20 | 8.7 | 132.04 | ||
| 2 | √ | 92.8 | 95.3 | 92.4 | 66.9 | 4.20 | 7.3 | 128.93 | |
| 3 | √ | 90.9 | 96.5 | 92.0 | 66.2 | 2.18 | 5.3 | 180.63 | |
| 4 | √ | √ | 92.4 | 96.6 | 92.6 | 68.1 | 3.38 | 6.2 | 157.81 |
| Category | YOLOv8n | OURS | ||||||
|---|---|---|---|---|---|---|---|---|
| P | R | mAP@0.5 | mAP@0.5:0.95 | P | R | mAP@0.5 | mAP@0.5:0.95 | |
| Car | 95.6 | 98.8 | 96.6 | 75.6 | 98.2 | 98.9 | 97.8 | 78.1 |
| Cyclist | 87.9 | 95.6 | 89.9 | 65.3 | 93.7 | 96.3 | 91.9 | 66.3 |
| Pedestrian | 81.6 | 87.7 | 83.0 | 50.2 | 85.3 | 89.6 | 87.6 | 54.7 |
| all | 88.4 | 94.1 | 89.6 | 63.7 | 92.4 | 96.6 | 92.6 | 68.1 |
| Sequences | VINS-Fusion | YSAG-VINS(OURS) | Improvements | ||||||
|---|---|---|---|---|---|---|---|---|---|
| RMSE | Mean | STD | RMSE | Mean | STD | RMSE | Mean | STD | |
| 0930-0027 | 1.1708 | 1.0821 | 0.4471 | 0.5795 | 0.5442 | 0.1991 | 49.71% | 50.50% | 55.47% |
| 0930-0033 | 5.7936 | 4.7357 | 3.3376 | 2.2445 | 1.8391 | 1.2866 | 61.26% | 61.17% | 61.45% |
| 0930-0034 | 2.6243 | 2.3275 | 1.2123 | 1.4945 | 1.3281 | 0.6854 | 43.05% | 34.19% | 33.73% |
| Sequences | VDO-SLAM | Dynamic-VINS | YSAG-VINS |
|---|---|---|---|
| 0930-0027 | 0.8324 | 0.7923 | 0.5795 |
| 0930-0033 | 2.3253 | 2.4356 | 2.2445 |
| 0930-0034 | 1.5357 | 1.5864 | 1.4945 |
| Sequences | VINS-Fusion | DynaVINS | YSAG-VINS(OURS) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| RMSE | Mean | STD | RMSE | Mean | STD | RMSE | Mean | STD | |
| rotation_01 | 1.7228 | 1.4023 | 0.7822 | 1.4319 | 1.2352 | 0.7242 | 1.2090 | 1.0670 | 0.5685 |
| Urban_02 | 6.7327 | 5.8551 | 3.3237 | 6.4968 | 5.6180 | 3.2628 | 5.8653 | 4.8862 | 3.0764 |
| Urban_03 | 3.7253 | 3.1834 | 1.9349 | 3.5204 | 3.1054 | 1.8736 | 3.0490 | 3.0490 | 1.7156 |
| Sequences | Fixed Threshold Strategy | Our Method | ||||
|---|---|---|---|---|---|---|
| RMSE | Mean | STD | RMSE | Mean | STD | |
| 0930-0027 | 0.8325 | 0.7937 | 0.3146 | 0.5795 | 0.5442 | 0.1991 |
| 0930-0033 | 3.5631 | 2.7781 | 1.7369 | 2.2445 | 1.8391 | 1.2866 |
| 0930-0034 | 1.9372 | 1.7469 | 1.0237 | 1.4945 | 1.3281 | 0.6854 |
| rotation_01 | 1.6324 | 1.3763 | 0.9651 | 1.2090 | 1.0670 | 0.5685 |
| Urban_02 | 6.0235 | 5.3299 | 3.2691 | 5.8653 | 4.8862 | 3.0764 |
| Urban_03 | 3.6836 | 3.1107 | 1.8749 | 3.0490 | 3.0490 | 1.7156 |
| Sequences | VINS-Fusion | YSAG-VINS(S) | YSAG-VINS(SI) | YSAG-VINS(SI+G) | ||||
|---|---|---|---|---|---|---|---|---|
| RMSE | STD | RMSE | STD | RMSE | STD | RMSE | STD | |
| 0930-0027 | 1.1708 | 0.4471 | 0.8627 | 0.4505 | 0.9473 | 0.4763 | 0.5795 | 0.1991 |
| 0930-0033 | 5.7936 | 3.3376 | 3.9746 | 2.2575 | 2.6578 | 1.4299 | 2.2445 | 1.2866 |
| 0930-0034 | 2.6243 | 1.2123 | 2.0257 | 0.9850 | 1.7270 | 0.8034 | 1.4945 | 0.6854 |
| Comparison | t-Value | p-Value | Significance |
|---|---|---|---|
| YSAG-VINS(S)—VINS-Fusion | 4.53 | 0.045 | Significant (p < 0.05) |
| YSAG-VINS(SI)—VINS-Fusion | 6.40 | 0.023 | Significant (p < 0.05) |
| YSAG-VINS(SI+G)—VINS-Fusion | 10.9 | 0.008 | Highly significant (p < 0.05) |
| YSAG-VINS(SI)—YSAG-VINS(S) | 2.81 | 0.106 | Not significant |
| YSAG-VINS(SI+G)—YSAG-VINS(SI) | 6.55 | 0.022 | Significant (p < 0.05) |
| Systems | Average Processing Time of per Frame |
|---|---|
| VINS-Fusion | 34.37 |
| YSAG-VINS(S) | 38.15 |
| YSAG-VINS(SI) | 37.53 |
| YSAG-VINS(SI+G) | 37.75 |
| DynaVINS | 68.54 |
| Dynamic-VINS | 56.23 |
| VDO-SLAM | 121.32 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, K.; Chai, D.; Wang, X.; Yan, R.; Ning, Y.; Sang, W.; Wang, S. YSAG-VINS—A Robust Visual-Inertial Navigation System with Adaptive Geometric Constraints and Semantic Information Based on YOLOv8n-ODUIB in Dynamic Environments. Appl. Sci. 2025, 15, 10595. https://doi.org/10.3390/app151910595
Wang K, Chai D, Wang X, Yan R, Ning Y, Sang W, Wang S. YSAG-VINS—A Robust Visual-Inertial Navigation System with Adaptive Geometric Constraints and Semantic Information Based on YOLOv8n-ODUIB in Dynamic Environments. Applied Sciences. 2025; 15(19):10595. https://doi.org/10.3390/app151910595
Chicago/Turabian StyleWang, Kunlin, Dashuai Chai, Xiqi Wang, Ruijie Yan, Yipeng Ning, Wengang Sang, and Shengli Wang. 2025. "YSAG-VINS—A Robust Visual-Inertial Navigation System with Adaptive Geometric Constraints and Semantic Information Based on YOLOv8n-ODUIB in Dynamic Environments" Applied Sciences 15, no. 19: 10595. https://doi.org/10.3390/app151910595
APA StyleWang, K., Chai, D., Wang, X., Yan, R., Ning, Y., Sang, W., & Wang, S. (2025). YSAG-VINS—A Robust Visual-Inertial Navigation System with Adaptive Geometric Constraints and Semantic Information Based on YOLOv8n-ODUIB in Dynamic Environments. Applied Sciences, 15(19), 10595. https://doi.org/10.3390/app151910595

