Vision-Based Navigation and Perception for Autonomous Robots: Sensors, SLAM, Control Strategies, and Cross-Domain Applications—A Review
Abstract
1. Introduction
1.1. Scope and Contributions
1.2. Methodology
- RQ1
- What vision sensor modalities and fusion schemes are employed across robotic domains?
- RQ2
- How do the current localization, mapping, and navigation algorithms exploit vision data, and what quantitative performance do they achieve?
- RQ3
- What outstanding technical challenges remain, and what promising future trends are emerging?
1.3. Organization of the Review
2. Vision Sensor Modalities for Robotics
2.1. Monocular and Multi-Camera RGB Systems
2.2. Stereo Vision Rigs
2.3. RGB-D and Time-of-Flight Cameras
2.4. LiDAR and Camera–LiDAR Hybrids
2.5. Event-Based (Neuromorphic) Cameras
2.6. Infrared and Thermal Imaging Sensors
2.7. A Comparative Summary
3. Depth and Scene Understanding Techniques
3.1. Geometry-Based Depth Recovery
3.1.1. Structure from Motion and Monocular SLAM
3.1.2. Stereo Correspondence
3.1.3. Active Depth Sensors
3.2. Learning-Based Monocular Depth Estimation
3.3. Self-Supervised and Hybrid Approaches
3.4. Sensor Fusion for Dense 3D Perception
3.5. The Quantitative Impact of Sensor Fusion on 3D Perception
3.6. Discussion
4. Visual Localization and Mapping (SLAM)
4.1. Visual Odometry Foundations
4.2. Sparse Feature-Based SLAM
4.3. Direct and Dense SLAM
4.4. Visual–Inertial SLAM and Factor Graphs
4.5. LiDAR–Visual–Inertial Fusion
4.6. Loop Closure, Relocalization, and Multi-Session Mapping
4.7. Semantic, Object-Level, and Neural Implicit Mapping
4.8. NeRF-Based Real-Time 3D Reconstruction for Navigation and Localization
- Hash-grid accelerations. Instant-NGP’s multiresolution hash encoding cuts both the training and rendering latency by two orders of magnitude—minutes of optimization fall to seconds, enabling live updates at on a single RTX-level GPU [64];
- Dense monocular pipelines. NeRF-SLAM couples a direct photometric tracker with an uncertainty-aware NeRF optimizer to achieve real-time ( tracking, mapping) performance using only a monocular camera, outperforming classical dense SLAM on TUM-RGB-D by up to in L1 depth [65];
- System-level engineering. Recent SP-SLAM prunes sample rays via scene priors and parallel CUDA kernels, reporting a speed-up over NICE-SLAM while keeping the trajectory error below on ScanNet [66].
Limitations and Outlook
4.9. Robustness in Dynamic and Adverse Environments
System | Year | Sensor Suite | EuRoC ATE (Avg.) | Loop Closure | Semantic | Hardware |
---|---|---|---|---|---|---|
VINS-Fusion [52] | 2018 | Mono/Stereo + IMU + GPS | 0.18 m | Yes | No | CPU |
BAD-SLAM [7] | 2019 | RGB-D | NR | Yes | No | GPU |
LIO-SAM [53] | 2020 | LiDAR + IMU + GNSS | NR | Yes | No | CPU |
Kimera [58] | 2020 | Mono/Stereo/RGB-D + IMU | NR | Yes | Yes | CPU |
DROID-SLAM [51] | 2021 | Mono/Stereo/RGB-D | 0.022 m | Implicit | No | GPU |
ORB-SLAM3 [1] | 2021 | Mono/Stereo/RGB-D/IMU | 0.036 m | Yes | No | CPU |
5. Navigation and Control Strategies
5.1. Classical Vision-Aided Planning and Obstacle Avoidance
5.2. Visual Servoing for Robotic Navigation
5.2.1. Techniques and Control Laws
5.2.2. Sensors
5.2.3. Path Planning and Navigation Strategies
5.2.4. Advantages and Limitations
5.3. Learning-Based Controllers
5.3.1. Deep Reinforcement Learning
5.3.2. Imitation Learning and Behavior Cloning
5.3.3. Hybrid-Model-Based and Learning Approaches
5.4. Topological Visual Memories and Teach-and-Repeat
5.5. Visual Language Navigation (VLN)
5.6. Vision-Guided Manipulation, Visual Servoing, and Inspection
5.7. Discussion and Outlook
6. Cross-Domain Application Case Studies
6.1. Autonomous Road Vehicles
6.2. Industrial Robots and Smart Factories
6.3. Autonomous Underwater Vehicles (AUVs)
6.4. Planetary and Lunar Rovers
6.5. Emerging Platforms: Drones and Humanoids
7. Benchmark Datasets, Simulators, and Evaluation Metrics
7.1. SLAM and Odometry Benchmarks
- KITTI Odometry—Stereo/mono sequences with a LiDAR ground truth for driving scenarios; reports translational and rotational RMSEs over 100 m/800 m segments [41];
- EuRoC MAV—Indoor VICON-tracked stereo+IMU recordings for micro-aerial vehicles [44];
- TUM VI—A high-rate fisheye+IMU dataset emphasizing photometric calibration and rapid motion [45];
- 4Seasons—Multi-season, day/night visual–LiDAR sequences for robustness testing under appearance changes [12].
7.2. Navigation and VLN Test Suites
- CARLA—An open-source urban driving simulator with weather perturbations; metrics include the success, collision, and lane invasion rates [128].
7.3. Domain-Specific Datasets
7.4. Evaluation Metrics
7.5. Open-Source Software and Reproducibility Resources
8. Engineering Design Guidelines
8.1. Sensor Selection and Placement
- Maximizing the complementary overlap: Arrange multi-modal rigs with overlapping fields of view to ease extrinsic calibration and redundancy. For 360° coverage, stagger cameras with a overlap to maintain feature continuity [21];
- Minimize parallax in manipulation: Mount eye-in-hand cameras close to the end-effector to reduce hand-eye calibration errors while ensuring a sufficient baseline for stereo depth.
8.2. Calibration and Synchronization
8.3. Real-Time Computation and Hardware Acceleration
8.4. Safety, Reliability, and Redundancy
8.5. System Integration, Testing, and Deployment
9. Future Trends and Open Research Challenges
9.1. Generative AI for Synthetic Data and Scene Completion
9.2. Neural Implicit Representations and Dense 3D Mapping
9.3. High-Density 3D Sensing and Sensor Fusion
9.4. Event-Based Vision for Ultra-Fast Control
9.5. Human-Centric Autonomy and Humanoids
9.6. All-Weather, All-Condition Operation
9.7. Ethical, Security, and Societal Considerations
10. Conclusions
- Fusion is mandatory. Across the benchmarks, combining complementary sensors markedly improves the accuracy: RGB + LiDAR depth completion cuts the RMSE by about 30% on KITTI, and adding an IMU reduces the visual drift on EuRoC/TUM VI from 1 to 2% to below 0.5% of the trajectory length (see Section 3.5).
- Learning is the new front-end. Learned feature extractors (SuperPoint + LightGlue) and dense networks such as DROID-SLAM consistently outperform handcrafted pipelines, maintaining high recall and robust tracking under severe viewpoint, lighting, and dynamic scene changes (Section 4.2 and Section 4.9).
- Implicit mapping is now real-time. Hash-grid NeRF variants—e.g., Instant-NGP and NICE-SLAM—reach >15 Hz camera tracking with a centimeter-level depth accuracy on a single consumer GPU (Section 4.8).
- Topological representations slash memory costs. Systems such as NavTopo achieve an order-of-magnitude (∼10×) reduction in storage relative to that for dense metric grids while retaining centimeter-level repeatability for long routes (Section 5.4).
- Safety relies on redundancy and self-diagnosis. The guideline section (Section 8.4) shows that cross-checking LiDAR, stereo, and network uncertainty signals—coupled with watchdog triggers—detects most perception anomalies before they propagate to the controller, underscoring the importance of built-in fault detection.
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
- Gallego, G.; Delbrück, T.; Orchard, G.; Bartolozzi, C.; Taba, B.; Censi, A.; Leutenegger, S.; Davison, A.J.; Conradt, J.; Daniilidis, K.; et al. Event-based vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 154–180. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.; Yu, Z.; Zhou, D.; Shi, J.; Deng, R. Vision-Based Deep Reinforcement Learning of Unmanned Aerial Vehicle (UAV) Autonomous Navigation Using Privileged Information. Drones 2024, 8, 782. [Google Scholar] [CrossRef]
- Verma, V.; Maimone, M.W.; Gaines, D.M.; Francis, R.; Estlin, T.A.; Kuhn, S.R.; Rabideau, G.R.; Chien, S.A.; McHenry, M.M.; Graser, E.J.; et al. Autonomous robotics is driving Perseverance rover’s progress on Mars. Sci. Robot. 2023, 8, eadi3099. [Google Scholar] [CrossRef]
- Arafat, M.Y.; Alam, M.M.; Moh, S. Vision-based navigation techniques for unmanned aerial vehicles: Review and challenges. Drones 2023, 7, 89. [Google Scholar] [CrossRef]
- Panduru, K.; Walsh, J. Exploring the Unseen: A Survey of Multi-Sensor Fusion and the Role of Explainable AI (XAI) in Autonomous Vehicles. Sensors 2025, 25, 856. [Google Scholar] [CrossRef]
- Schops, T.; Sattler, T.; Pollefeys, M. Bad slam: Bundle adjusted direct rgb-d slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 134–144. [Google Scholar] [CrossRef]
- Zhou, S.; Wang, Z.; Dai, X.; Song, W.; Gu, S. LIR-LIVO: A Lightweight, Robust LiDAR/Vision/Inertial Odometry with Illumination-Resilient Deep Features. arXiv 2025, arXiv:2502.08676. [Google Scholar] [CrossRef]
- Bescos, B.; Fácil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
- Bhaskar, A.; Mahammad, Z.; Jadhav, S.R.; Tokekar, P. NAVINACT: Combining Navigation and Imitation Learning for Bootstrapping Reinforcement Learning. arXiv 2024, arXiv:2408.04054. [Google Scholar]
- Huang, C.; Mees, O.; Zeng, A.; Burgard, W. Visual language maps for robot navigation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 10608–10615. [Google Scholar] [CrossRef]
- Wenzel, P.; Yang, N.; Wang, R.; Zeller, N.; Cremers, D. 4Seasons: Benchmarking visual slam and long-term localization for autonomous driving in challenging conditions. Int. J. Comput. Vis. 2024, 133, 1564–1586. [Google Scholar] [CrossRef]
- Morgan, A.S.; Wen, B.; Liang, J.; Boularias, A.; Dollar, A.M.; Bekris, K. Vision-driven compliant manipulation for reliable, high-precision assembly tasks. arXiv 2021, arXiv:2106.14070. [Google Scholar] [CrossRef]
- Wang, X.; Fan, X.; Shi, P.; Ni, J.; Zhou, Z. An overview of key SLAM technologies for underwater scenes. Remote Sens. 2023, 15, 2496. [Google Scholar] [CrossRef]
- Villalonga, C. Leveraging Synthetic Data to Create Autonomous Driving Perception Systems. Ph.D. Thesis, Universitat Autònoma de Barcelona, Barcelona, Spain, 2020. Available online: https://ddd.uab.cat/pub/tesis/2021/hdl_10803_671739/gvp1de1.pdf (accessed on 6 July 2025).
- Yu, T.; Xiao, T.; Stone, A.; Tompson, J.; Brohan, A.; Wang, S.; Singh, J.; Tan, C.; M, D.; Peralta, J.; et al. Scaling Robot Learning with Semantically Imagined Experience. In Proceedings of the Robotics: Science and Systems (RSS), Daegu, Republic of Korea, 10–14 July 2023. [Google Scholar] [CrossRef]
- Shahria, M.T.; Sunny, M.S.H.; Zarif, M.I.I.; Ghommam, J.; Ahamed, S.I.; Rahman, M.H. A comprehensive review of vision-based robotic applications: Current state, components, approaches, barriers, and potential solutions. Robotics 2022, 11, 139. [Google Scholar] [CrossRef]
- Wang, H.; Li, J.; Dong, H. A Review of Vision-Based Multi-Task Perception Research Methods for Autonomous Vehicles. Sensors 2025, 25, 2611. [Google Scholar] [CrossRef]
- Grant, M.J.; Booth, A. A typology of reviews: An analysis of 14 review types and associated methodologies. Health Inf. Libr. J. 2009, 26, 91–108. [Google Scholar] [CrossRef]
- Liang, H.; Ma, Z.; Zhang, Q. Self-supervised object distance estimation using a monocular camera. Sensors 2022, 22, 2936. [Google Scholar] [CrossRef]
- Won, C.; Seok, H.; Cui, Z.; Pollefeys, M.; Lim, J. OmniSLAM: Omnidirectional localization and dense mapping for wide-baseline multi-camera systems. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 559–566. [Google Scholar] [CrossRef]
- Liu, P.; Geppert, M.; Heng, L.; Sattler, T.; Geiger, A.; Pollefeys, M. Towards robust visual odometry with a multi-camera system. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1154–1161. [Google Scholar] [CrossRef]
- Fields, J.; Salgian, G.; Samarasekera, S.; Kumar, R. Monocular structure from motion for near to long ranges. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, Kyoto, Japan, 27 September–4 October 2009; pp. 1702–1709. [Google Scholar] [CrossRef]
- Haseeb, M.A.; Ristić-Durrant, D.; Gräser, A. Long-range obstacle detection from a monocular camera. In Proceedings of the ACM Computer Science in Cars Symposium (CSCS), Munich, Germany, 13–14 September 2018; pp. 13–14. [Google Scholar] [CrossRef]
- Pinggera, P.; Franke, U.; Mester, R. High-performance long range obstacle detection using stereo vision. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 1308–1313. [Google Scholar] [CrossRef]
- Huang, A.S.; Bachrach, A.; Henry, P.; Krainin, M.; Maturana, D.; Fox, D.; Roy, N. Visual odometry and mapping for autonomous flight using an RGB-D camera. In Proceedings of the Robotics Research: The 15th International Symposium ISRR, Puerto Varas, Chile, 11–14 December 2017; pp. 235–252. [Google Scholar]
- Adiuku, N.; Avdelidis, N.P.; Tang, G.; Plastropoulos, A. Advancements in learning-based navigation systems for robotic applications in MRO hangar. Sensors 2024, 24, 1377. [Google Scholar] [CrossRef]
- Jeyabal, S.; Sachinthana, W.; Bhagya, S.; Samarakoon, P.; Elara, M.R.; Sheu, B.J. Hard-to-Detect Obstacle Mapping by Fusing LIDAR and Depth Camera. IEEE Sens. J. 2024, 24, 24690–24698. [Google Scholar] [CrossRef]
- Fan, Z.; Zhang, L.; Wang, X.; Shen, Y.; Deng, F. LiDAR, IMU, and camera fusion for simultaneous localization and mapping: A systematic review. Artif. Intell. Rev. 2025, 58, 1–59. [Google Scholar] [CrossRef]
- Shi, C.; Song, N.; Li, W.; Li, Y.; Wei, B.; Liu, H.; Jin, J. A Review of Event-Based Indoor Positioning and Navigation. IPIN-WiP 2022, 3248, 12. [Google Scholar]
- Furmonas, J.; Liobe, J.; Barzdenas, V. Analytical review of event-based camera depth estimation methods and systems. Sensors 2022, 22, 1201. [Google Scholar] [CrossRef] [PubMed]
- Rebecq, H.; Gallego, G.; Mueggler, E.; Scaramuzza, D. EMVS: Event-based multi-view stereo—3D reconstruction with an event camera in real-time. Int. J. Comput. Vis. 2018, 126, 1394–1414. [Google Scholar] [CrossRef]
- Munford, M.J.; Rodriguez y Baena, F.; Bowyer, S. Stereoscopic Near-Infrared Fluorescence Imaging: A Proof of Concept Toward Real-Time Depth Perception in Surgical Robotics. Front. Robot. 2019, 6, 66. [Google Scholar] [CrossRef] [PubMed]
- Nguyen, T.X.B.; Rosser, K.; Chahl, J. A review of modern thermal imaging sensor technology and applications for autonomous aerial navigation. J. Imaging 2021, 7, 217. [Google Scholar] [CrossRef]
- Khattak, S.; Papachristos, C.; Alexis, K. Visual-thermal landmarks and inertial fusion for navigation in degraded visual environments. In Proceedings of the 2019 IEEE Aerospace Conference, Big Sky, MT, USA, 2–9 March 2019; pp. 1–9. [Google Scholar] [CrossRef]
- NG, A.; PB, D.; Shalabi, J.; Jape, S.; Wang, X.; Jacob, Z. Thermal Voyager: A Comparative Study of RGB and Thermal Cameras for Night-Time Autonomous Navigation. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 14116–14122. [Google Scholar] [CrossRef]
- Delaune, J.; Hewitt, R.; Lytle, L.; Sorice, C.; Thakker, R.; Matthies, L. Thermal-inertial odometry for autonomous flight throughout the night. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 1122–1128. [Google Scholar] [CrossRef]
- Hou, F.; Zhang, Y.; Zhou, Y.; Zhang, M.; Lv, B.; Wu, J. Review on infrared imaging technology. Sustainability 2022, 14, 11161. [Google Scholar] [CrossRef]
- Zhang, J. Survey on Monocular Metric Depth Estimation. arXiv 2025, arXiv:2501.11841. [Google Scholar] [CrossRef]
- Tateno, K.; Tombari, F.; Laina, I.; Navab, N. Cnn-slam: Real-time dense monocular slam with learned depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6243–6252. [Google Scholar] [CrossRef]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
- Hu, M.; Wang, S.; Li, B.; Ning, S.; Fan, L.; Gong, X. Penet: Towards precise and efficient image guided depth completion. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13656–13662. [Google Scholar] [CrossRef]
- Klenk, S.; Chui, J.; Demmel, N.; Cremers, D. TUM-VIE: The TUM stereo visual-inertial event dataset. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 8601–8608. [Google Scholar] [CrossRef]
- Burri, M.; Nikolic, J.; Gohl, P.; Schneider, T.; Rehder, J.; Omari, S.; Achtelik, M.W.; Siegwart, R. The EuRoC micro aerial vehicle datasets. Int. J. Robot. Res. 2016, 35, 1157–1163. [Google Scholar] [CrossRef]
- Schubert, D.; Goll, T.; Demmel, N.; Usenko, V.; Stückler, J.; Cremers, D. The TUM VI benchmark for evaluating visual-inertial odometry. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1680–1687. [Google Scholar] [CrossRef]
- Gehrig, M.; Aarents, W.; Gehrig, D.; Scaramuzza, D. Dsec: A stereo event camera dataset for driving scenarios. IEEE Robot. Autom. Lett. 2021, 6, 4947–4954. [Google Scholar] [CrossRef]
- Bai, L.; Yang, J.; Tian, C.; Sun, Y.; Mao, M.; Xu, Y.; Xu, W. DCANet: Differential convolution attention network for RGB-D semantic segmentation. Pattern Recognit. 2025, 162, 111379. [Google Scholar] [CrossRef]
- Yu, K.; Tao, T.; Xie, H.; Lin, Z.; Liang, T.; Wang, B.; Chen, P.; Hao, D.; Wang, Y.; Liang, X. Benchmarking the robustness of lidar-camera fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 3188–3198. [Google Scholar] [CrossRef]
- Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
- Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-scale direct monocular SLAM. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 834–849. [Google Scholar] [CrossRef]
- Teed, Z.; Deng, J. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Adv. Neural Inf. Process. Syst. 2021, 34, 16558–16569. [Google Scholar]
- Qin, T.; Li, P.; Shen, S. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
- Shan, T.; Englot, B.; Meyers, D.; Wang, W.; Ratti, C.; Daniela, R. LIO-SAM: Tightly-coupled Lidar Inertial Odometry via Smoothing and Mapping. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, Nevada, USA, 25–29 October 2020; pp. 5135–5142. [Google Scholar] [CrossRef]
- DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar] [CrossRef]
- Lindenberger, P.; Sarlin, P.E.; Pollefeys, M. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 17627–17638. [Google Scholar] [CrossRef]
- Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5297–5307. [Google Scholar] [CrossRef]
- Rosinol, A.; Sattler, T.; Pollefeys, M.; Carlone, L. Incremental visual-inertial 3d mesh generation with structural regularities. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019. [Google Scholar] [CrossRef]
- Rosinol, A.; Abate, M.; Chang, Y.; Carlone, L. Kimera: An Open-Source Library for Real-Time Metric-Semantic Localization and Mapping. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020. [Google Scholar] [CrossRef]
- Rosinol, A.; Gupta, A.; Abate, M.; Shi, J.; Carlone, L. 3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans. In Proceedings of the Robotics: Science and Systems (RSS), Corvalis, OR, USA, 12–16 July 2020. [Google Scholar] [CrossRef]
- Salas-Moreno, R.F.; Newcombe, R.A.; Strasdat, H.; Kelly, P.H.; Davison, A.J. Slam++: Simultaneous localisation and mapping at the level of objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 1352–1359. [Google Scholar] [CrossRef]
- Sucar, E.; Liu, S.; Ortiz, J.; Davison, A.J. imap: Implicit mapping and positioning in real-time. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 6229–6238. [Google Scholar] [CrossRef]
- Li, G.; Chen, Q.; Yan, Y.; Pu, J. EC-SLAM: Real-time Dense Neural RGB-D SLAM System with Effectively Constrained Global Bundle Adjustment. arXiv 2024, arXiv:2404.13346. [Google Scholar] [CrossRef]
- Zhu, Z.; Peng, S.; Larsson, V.; Xu, W.; Bao, H.; Cui, Z.; Oswald, M.R.; Pollefeys, M. Nice-slam: Neural implicit scalable encoding for slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12786–12796. [Google Scholar] [CrossRef]
- Müller, T.; Evans, A.; Schied, C.; Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. 2022, 41, 1–15. [Google Scholar] [CrossRef]
- Rosinol, A.; Leonard, J.J.; Carlone, L. Nerf-slam: Real-time dense monocular slam with neural radiance fields. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 3437–3444. [Google Scholar] [CrossRef]
- Hong, Z.; Wang, B.; Duan, H.; Huang, Y.; Li, X.; Wen, Z.; Wu, X.; Xiang, W.; Zheng, Y. SP-SLAM: Neural Real-Time Dense SLAM With Scene Priors. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 5182–5194. [Google Scholar] [CrossRef]
- Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8922–8931. [Google Scholar] [CrossRef]
- Yang, L.; Zhang, Y.; Tian, R.; Liang, S.; Shen, Y.; Coleman, S.; Kerr, D. Fast, Robust, Accurate, Multi-Body Motion Aware SLAM. IEEE Trans. Intell. Transp. Syst. 2023, 25, 4381–4397. [Google Scholar] [CrossRef]
- Kitt, B.M.; Rehder, J.; Chambers, A.D.; Schonbein, M.; Lategahn, H.; Singh, S. Monocular Visual Odometry Using a Planar Road Model to Solve Scale Ambiguity. In Proceedings of the European Conference on Mobile Robots (ECMR), Orebro, Sweden, 7–9 September 2011. [Google Scholar]
- Zheng, S.; Wang, J.; Rizos, C.; Ding, W.; El-Mowafy, A. Simultaneous localization and mapping (slam) for autonomous driving: Concept and analysis. Remote Sens. 2023, 15, 1156. [Google Scholar] [CrossRef]
- Keetha, N.; Karhade, J.; Jatavallabhula, K.M.; Yang, G.; Scherer, S.; Ramanan, D.; Luiten, J. SplaTAM: Splat Track & Map 3D Gaussians for Dense RGB-D SLAM. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 21357–21366. [Google Scholar] [CrossRef]
- Joo, K.; Kim, P.; Hebert, M.; Kweon, I.S.; Kim, H.J. Linear RGB-D SLAM for structured environments. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 8403–8419. [Google Scholar] [CrossRef]
- Kim, P.; Coltin, B.; Kim, H.J. Linear RGB-D SLAM for Planar Environments. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Berlin/Heidelberg, Germany, 2018; pp. 350–366. [Google Scholar] [CrossRef]
- Zhou, D.; Wang, Z.; Bandyopadhyay, S.; Schwager, M. Fast, on-line collision avoidance for dynamic vehicles using buffered voronoi cells. IEEE Robot. Autom. Lett. 2017, 2, 1047–1054. [Google Scholar] [CrossRef]
- Wu, J. Rigid 3-D registration: A simple method free of SVD and eigendecomposition. IEEE Trans. Instrum. Meas. 2020, 69, 8288–8303. [Google Scholar] [CrossRef]
- Liu, Y.; Kong, D.; Zhao, D.; Gong, X.; Han, G. A point cloud registration algorithm based on feature extraction and matching. Math. Probl. Eng. 2018, 2018, 7352691. [Google Scholar] [CrossRef]
- Tiar, R.; Lakrouf, M.; Azouaoui, O. Fast ICP-SLAM for a bi-steerable mobile robot in large environments. In Proceedings of the 2015 IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM), Istanbul, Turkey,, 27–31 July 2015; pp. 1–6. [Google Scholar] [CrossRef]
- Ahmadi, A.; Nardi, L.; Chebrolu, N.; Stachniss, C. Visual servoing-based navigation for monitoring row-crop fields. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 4920–4926. [Google Scholar] [CrossRef]
- Li, Y.; Košecka, J. Learning view and target invariant visual servoing for navigation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 658–664. [Google Scholar] [CrossRef]
- Caron, G.; Marchand, E.; Mouaddib, E.M. Photometric visual servoing for omnidirectional cameras. Auton. Robot. 2013, 35, 177–193. [Google Scholar] [CrossRef]
- Petiteville, A.D.; Hutchinson, S.; Cadenat, V.; Courdesses, M. 2D visual servoing for a long range navigation in a cluttered environment. In Proceedings of the 2011 50th IEEE Conference on Decision and Control and European Control Conference, Orlando, FL, USA, 12–15 December 2011; pp. 5677–5682. [Google Scholar] [CrossRef]
- Rodríguez Martínez, E.A.; Caron, G.; Pégard, C.; Lara-Alabazares, D. Photometric-Planner for Visual Path Following. IEEE Sens. J. 2020, 21, 11310–11317. [Google Scholar] [CrossRef]
- Rodríguez Martínez, E.A.; Caron, G.; Pégard, C.; Lara-Alabazares, D. Photometric Path Planning for Vision-Based Navigation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 9007–9013. [Google Scholar] [CrossRef]
- Wang, Y.; Yan, Y.; Shi, D.; Zhu, W.; Xia, J.; Jeff, T.; Jin, S.; Gao, K.; Li, X.; Yang, X. NeRF-IBVS: Visual servo based on nerf for visual localization and navigation. Adv. Neural Inf. Process. Syst. 2023, 36, 8292–8304. [Google Scholar]
- Vassallo, R.F.; Schneebeli, H.J.; Santos-Victor, J. Visual servoing and appearance for navigation. Robot. Auton. Syst. 2000, 31, 87–97. [Google Scholar] [CrossRef]
- Li, J.; Wang, X.; Tang, S.; Shi, H.; Wu, F.; Zhuang, Y.; Wang, W.Y. Unsupervised reinforcement learning of transferable meta-skills for embodied navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12123–12132. [Google Scholar] [CrossRef]
- Martins, R.; Bersan, D.; Campos, M.F.; Nascimento, E.R. Extending maps with semantic and contextual object information for robot navigation: A learning-based framework using visual and depth cues. J. Intell. Robot. Syst. 2020, 99, 555–569. [Google Scholar] [CrossRef]
- Kwon, O.; Kim, N.; Choi, Y.; Yoo, H.; Park, J.; Oh, S. Visual graph memory with unsupervised representation for visual navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 15890–15899. [Google Scholar] [CrossRef]
- Cèsar-Tondreau, B.; Warnell, G.; Stump, E.; Kochersberger, K.; Waytowich, N.R. Improving autonomous robotic navigation using imitation learning. Front. Robot. 2021, 8, 627730. [Google Scholar] [CrossRef]
- Furgale, P.; Barfoot, T.D. Visual teach and repeat for long-range rover autonomy. J. Field Robot. 2010, 27, 534–560. [Google Scholar] [CrossRef]
- Delfin, J.; Becerra, H.M.; Arechavaleta, G. Humanoid navigation using a visual memory with obstacle avoidance. Robot. Auton. Syst. 2018, 109, 109–124. [Google Scholar] [CrossRef]
- Bista, S.R.; Ward, B.; Corke, P. Image-based indoor topological navigation with collision avoidance for resource-constrained mobile robots. J. Intell. Robot. Syst. 2021, 102, 55. [Google Scholar] [CrossRef]
- Chaplot, D.S.; Salakhutdinov, R.; Gupta, A.; Gupta, S. Neural Topological SLAM for Visual Navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
- Muravyev, K.; Yakovlev, K. NavTopo: Leveraging Topological Maps for Autonomous Navigation of a Mobile Robot. In Proceedings of the International Conference on Interactive Collaborative Robotics, New Mexico, Mexico, 14–18 October 2024; pp. 144–157. [Google Scholar] [CrossRef]
- Liu, Q.; Cui, X.; Liu, Z.; Wang, H. Cognitive navigation for intelligent mobile robots: A learning-based approach with topological memory configuration. CAA J. Autom. Sin. 2024, 11, 1933–1943. [Google Scholar] [CrossRef]
- Wang, H.; Wang, W.; Liang, W.; Xiong, C.; Shen, J. Structured scene memory for vision-language navigation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8455–8464. [Google Scholar] [CrossRef]
- Liu, R.; Wang, X.; Wang, W.; Yang, Y. Bird’s-Eye-View Scene Graph for Vision-Language Navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 10968–10980. [Google Scholar] [CrossRef]
- Liu, R.; Wang, W.; Yang, Y. Volumetric Environment Representation for Vision-Language Navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16317–16328. [Google Scholar] [CrossRef]
- Wang, H.; Liang, W.; Gool, L.V.; Wang, W. Towards versatile embodied navigation. Adv. Neural Inf. Process. Syst. 2022, 35, 36858–36874. [Google Scholar] [CrossRef]
- Wang, X.; Wang, W.; Shao, J.; Yang, Y. Lana: A language-capable navigator for instruction following and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19048–19058. [Google Scholar] [CrossRef]
- Shah, D.; Osiński, B.; Levine, S. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Proceedings of the Conference on Robot Learning, Jeju Island, Republic of Korea, 16–18 October 2023; pp. 492–504. [Google Scholar] [CrossRef]
- Cai, W.; Ponomarenko, Y.; Yuan, J.; Li, X.; Yang, W.; Dong, H.; Zhao, B. SpatialBot: Precise Spatial Understanding with Vision Language Models. arXiv 2024, arXiv:2406.13642. [Google Scholar] [CrossRef]
- Park, S.M.; Kim, Y.G. Visual language navigation: A survey and open challenges. Artif. Intell. Rev. 2023, 56, 365–427. [Google Scholar] [CrossRef]
- Liu, H.; Xue, W.; Chen, Y.; Chen, D.; Zhao, X.; Wang, K.; Hou, L.; Li, R.; Peng, W. A survey on hallucination in large vision-language models. arXiv 2024, arXiv:2402.00253. [Google Scholar] [CrossRef]
- Jiang, P.; Ishihara, Y.; Sugiyama, N.; Oaki, J.; Tokura, S.; Sugahara, A.; Ogawa, A. Depth image–based deep learning of grasp planning for textureless planar-faced objects in vision-guided robotic bin-picking. Sensors 2020, 20, 706. [Google Scholar] [CrossRef]
- Hütten, N.; Alves Gomes, M.; Hölken, F.; Andricevic, K.; Meyes, R.; Meisen, T. Deep learning for automated visual inspection in manufacturing and maintenance: A survey of open-access papers. Appl. Syst. Innov. 2024, 7, 11. [Google Scholar] [CrossRef]
- Machkour, Z.; Ortiz-Arroyo, D.; Durdevic, P. Classical and deep learning based visual servoing systems: A survey on state of the art. J. Intell. Robot. Syst. 2022, 104, 11. [Google Scholar] [CrossRef]
- Chen, D.; Zhou, B.; Koltun, V.; Krähenbühl, P. Learning by cheating. In Proceedings of the Conference on Robot Learning, Virtual, 16–18 November 2020; pp. 66–75. [Google Scholar]
- Chitta, K.; Prakash, A.; Jaeger, B.; Yu, Z.; Renz, K.; Geiger, A. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 12878–12895. [Google Scholar] [CrossRef]
- Wijmans, E.; Kadian, A.; Morcos, A.; Lee, S.; Essa, I.; Parikh, D.; Savva, M.; Batra, D. Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. arXiv 2019, arXiv:1911.00357. [Google Scholar] [CrossRef]
- Ramakrishnan, S.K.; Al-Halah, Z.; Grauman, K. Occupancy anticipation for efficient exploration and navigation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 400–418. [Google Scholar] [CrossRef]
- Shafiullah, N.M.M.; Paxton, C.; Pinto, L.; Chintala, S.; Szlam, A. Clip-fields: Weakly supervised semantic fields for robotic memory. arXiv 2022, arXiv:2210.05663. [Google Scholar] [CrossRef]
- Anderson, P.; Chang, A.; Chaplot, D.S.; Dosovitskiy, A.; Gupta, S.; Koltun, V.; Kosecka, J.; Malik, J.; Mottaghi, R.; Savva, M.; et al. On evaluation of embodied navigation agents. arXiv 2018, arXiv:1807.06757. [Google Scholar] [CrossRef]
- Rosero, L.A.; Gomes, I.P.; Da Silva, J.A.R.; Przewodowski, C.A.; Wolf, D.F.; Osório, F.S. Integrating modular pipelines with end-to-end learning: A hybrid approach for robust and reliable autonomous driving systems. Sensors 2024, 24, 2097. [Google Scholar] [CrossRef]
- Cheng, J.; Zhang, L.; Chen, Q.; Hu, X.; Cai, J. A review of visual SLAM methods for autonomous driving vehicles. Eng. Appl. Artif. Intell. 2022, 114, 104992. [Google Scholar] [CrossRef]
- Philion, J.; Fidler, S. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 194–210. [Google Scholar] [CrossRef]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar] [CrossRef]
- Rahman, S.; Li, A.Q.; Rekleitis, I. Svin2: An underwater slam system using sonar, visual, inertial, and depth sensor. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 1861–1868. [Google Scholar] [CrossRef]
- Wang, Z.; Cheng, Q.; Mu, X. Ru-slam: A robust deep-learning visual simultaneous localization and mapping (slam) system for weakly textured underwater environments. Sensors 2024, 24, 1937. [Google Scholar] [CrossRef]
- Zarei, M.; Chhabra, R. Advancements in autonomous mobility of planetary wheeled mobile robots: A review. Front. Space Technol. 2022, 3, 1080291. [Google Scholar] [CrossRef]
- Giubilato, R.; Le Gentil, C.; Vayugundla, M.; Schuster, M.J.; Vidal-Calleja, T.; Triebel, R. GPGM-SLAM: A robust slam system for unstructured planetary environments with gaussian process gradient maps. Field Robot. 2022, 2, 1721–1753. [Google Scholar] [CrossRef]
- Tian, H.; Zhang, T.; Jia, Y.; Peng, S.; Yan, C. Zhurong: Features and mission of China’s first Mars rover. Innovation 2021, 2, 100121. [Google Scholar] [CrossRef]
- IEEE Spectrum. Figure 01: General Purpose Humanoid Robot. Robot. Guide 2023, 7, e797. [Google Scholar] [CrossRef]
- Border, R.; Chebrolu, N.; Tao, Y.; Gammell, J.D.; Fallon, M. Osprey: Multi-Session Autonomous Aerial Mapping with LiDAR-based SLAM and Next Best View Planning. IEEE Trans. Field Robot. 2024, 1, 113–130. [Google Scholar] [CrossRef]
- Kuindersma, S.; Deits, R.; Fallon, M.; Valenzuela, A.; Dai, H.; Permenter, F.; Koolen, T.; Marion, P.; Tedrake, R. Optimization-based locomotion planning, estimation, and control design for the atlas humanoid robot. Auton. Robot. 2016, 40, 429–455. [Google Scholar] [CrossRef]
- Radford, N.A.; Strawser, P.; Hambuchen, K.; Mehling, J.S.; Verdeyen, W.K.; Donnan, A.S.; Holley, J.; Sanchez, J.; Nguyen, V.; Bridgwater, L.; et al. Valkyrie: Nasa’s first bipedal humanoid robot. J. Field Robot. 2015, 32, 397–419. [Google Scholar] [CrossRef]
- Giernacki, W.; Skwierczyński, M.; Witwicki, W.; Wroński, P.; Kozierski, P. Crazyflie 2.0 quadrotor as a platform for research and education in robotics and control engineering. In Proceedings of the 2017 22nd International Conference on Methods and Models in Automation and Robotics (MMAR), Miedzyzdroje, Poland, 28–31 August 2017; pp. 37–42. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An open urban driving simulator. In Proceedings of the Conference on Robot Learning, Mountain View, CA, USA, 13–15 November 2017; pp. 1–16. [Google Scholar]
- Szot, A.; Clegg, A.; Undersander, E.; Wijmans, E.; Zhao, Y.; Turner, J.; Maestre, N.; Mukadam, M.; Chaplot, D.; Maksymets, O.; et al. Habitat 2.0: Training Home Assistants to Rearrange their Habitat. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 6–14 December 2021. [Google Scholar]
- Puig, X.; Undersander, E.; Szot, A.; Cote, M.D.; Yang, T.Y.; Partsey, R.; Desai, R.; Clegg, A.W.; Hlavac, M.; Min, S.Y.; et al. Habitat 3.0: A co-habitat for humans, avatars and robots. arXiv 2023, arXiv:2310.13724. [Google Scholar] [CrossRef]
- Savva, M.; Kadian, A.; Maksymets, O.; Zhao, Y.; Wijmans, E.; Jain, B.; Straub, J.; Liu, J.; Koltun, V.; Malik, J.; et al. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October 27–2 November 2019. [Google Scholar] [CrossRef]
- Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar] [CrossRef]
- Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2446–2454. [Google Scholar] [CrossRef]
- Liao, X.; Duan, J.; Huang, Y.; Wang, J. RUIE: Retrieval-based Unified Information Extraction using Large Language Model. arXiv 2024, arXiv:2409.11673. [Google Scholar] [CrossRef]
- Ferrera, M.; Creuze, V.; Moras, J.; Trouvé-Peloux, P. AQUALOC: An underwater dataset for visual–inertial–pressure localization. Int. J. Robot. Res. 2019, 38, 1549–1559. [Google Scholar] [CrossRef]
- ISO 10218-1:2025; Robotics—Safety Requirements—Part 1: Industrial Robots. International Organization for Standardization: Geneva, Switzerland, 2025.
- ISO 10218-2:2025; Robotics — Safety Requirements — Part 2: Industrial Robot Applications and Robot Cells. International Organization for Standardization: Geneva, Switzerland, 2025.
- Weber, E.; Holynski, A.; Jampani, V.; Saxena, S.; Snavely, N.; Kar, A.; Kanazawa, A. Nerfiller: Completing scenes via generative 3d inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 20731–20741. [Google Scholar] [CrossRef]
- Ren, X.; Huang, J.; Zeng, X.; Museth, K.; Fidler, S.; Williams, F. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 6–22 June 2024; pp. 4209–4219. [Google Scholar] [CrossRef]
- Zhang, D.; Williams, F.; Gojcic, Z.; Kreis, K.; Fidler, S.; Kim, Y.M.; Kar, A. Outdoor Scene Extrapolation with Hierarchical Generative Cellular Automata. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 20145–20154. [Google Scholar] [CrossRef]
- Wen, B.; Xie, H.; Chen, Z.; Hong, F.; Liu, Z. 3D Scene Generation: A Survey. arXiv 2025, arXiv:2505.05474. [Google Scholar] [CrossRef]
- Hoffman, J.; Tzeng, E.; Park, T.; Zhu, J.Y.; Isola, P.; Saenko, K.; Efros, A.; Darrell, T. Cycada: Cycle-consistent adversarial domain adaptation. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1989–1998. [Google Scholar] [CrossRef]
- Tremblay, J.; Prakash, A.; Acuna, D.; Brophy, M.; Jampani, V.; Anil, C.; To, T.; Cameracci, E.; Boochoon, S.; Birchfield, S. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 969–977. [Google Scholar] [CrossRef]
- Qin, Y.; Wang, C.; Kang, Z.; Ma, N.; Li, Z.; Zhang, R. SupFusion: Supervised LiDAR-camera fusion for 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 22014–22024. [Google Scholar] [CrossRef]
- Wang, S.; Xie, Y.; Chang, C.P.; Millerdurai, C.; Pagani, A.; Stricker, D. Uni-slam: Uncertainty-aware neural implicit slam for real-time dense indoor scene reconstruction. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 2228–2239. [Google Scholar] [CrossRef]
- Tancik, M.; Casser, V.; Yan, X.; Pradhan, S.; Mildenhall, B.; Srinivasan, P.P.; Barron, J.T.; Kretzschmar, H. Block-nerf: Scalable large scene neural view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8248–8258. [Google Scholar] [CrossRef]
- Zhang, Q.; Zhang, Z.; Cui, W.; Sun, J.; Cao, J.; Guo, Y.; Han, G.; Zhao, W.; Wang, J.; Sun, C.; et al. HumanoidPano: Hybrid Spherical Panoramic-LiDAR Cross-Modal Perception for Humanoid Robots. arXiv 2025, arXiv:2503.09010. [Google Scholar] [CrossRef]
- Chae, Y.; Kim, H.; Yoon, K.J. Towards robust 3d object detection with lidar and 4d radar fusion in various weather conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15162–15172. [Google Scholar] [CrossRef]
- Gehrig, D.; Scaramuzza, D. Low-latency automotive vision with event cameras. Nature 2024, 629, 1034–1040. [Google Scholar] [CrossRef]
- Dong, W.; Li, S.; Zheng, P. Toward Embodied Intelligence-Enabled Human–Robot Symbiotic Manufacturing: A Large Language Model-Based Perspective. J. Comput. Inf. Sci. Eng. 2025, 25, 050801. [Google Scholar] [CrossRef]
- Li, J.; Shi, X.; Chen, F.; Stroud, J.; Zhang, Z.; Lan, T.; Mao, J.; Kang, J.; Refaat, K.S.; Yang, W.; et al. Pedestrian crossing action recognition and trajectory prediction with 3d human keypoints. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 1463–1470. [Google Scholar] [CrossRef]
- Tong, Y.; Liu, H.; Zhang, Z. Advancements in humanoid robots: A comprehensive review and future prospects. CAA J. Autom. Sin. 2024, 11, 301–328. [Google Scholar] [CrossRef]
- Kalb, T.; Beyerer, J. Principles of forgetting in domain-incremental semantic segmentation in adverse weather conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19508–19518. [Google Scholar] [CrossRef]
- Sui, T. Exploring moral algorithm preferences in autonomous vehicle dilemmas: An empirical study. Front. Psychol. 2023, 14, 1229245. [Google Scholar] [CrossRef]
Attribute | Description |
---|---|
Monocular RGB/Multi-Camera | |
Range | Algorithm-dependent; reliable within a few meters |
FoV | Up to 360° (multi) |
Power | Low |
Cost | Low |
Typical Use Cases | Appearance-based recognition, SLAM with priors, place recognition, loop closure |
Contributions from 2020 | 128 (e.g., [20,21]) |
Stereo Vision | |
Range | 5–40 m |
FoV | 120° |
Power | Low |
Cost | Low to moderate |
Typical Use Cases | Real-time metric depth, indoor/outdoor navigation, robust passive depth sensing |
Contributions from 2020 | 528 (e.g., [27]) |
RGB-D/ToF Cameras | |
Range | 0.3–5 m (typical) |
FoV | 67°–108° horizontal |
Power | Higher than passive |
Cost | Higher than RGB |
Typical Use Cases | Dense mapping, obstacle avoidance, low-texture surface handling |
Contributions from 2020 | 743 (e.g., [5]) |
LiDAR/Camera–LiDAR Fusion | |
Range | Tens of hundreds of meters |
FoV | 360° |
Power | 8–15 W |
Cost | Decreasing |
Typical Use Cases | Long-range 3D mapping, drift-free SLAM, robust perception in varied lighting/weather |
Contributions from 2020 | 1449 (e.g., [6,28,29]) |
Event-Based Cameras | |
Range | 1–5 m |
FoV | 75° |
Power | Very low (0.25 mW) |
Cost | High |
Typical Use Cases | High-speed motion scenes, HDR environments, drones, low-latency SLAM |
Contributions from 2020 | 94 (e.g., [2,30]) |
Infrared/Thermal Imaging | |
Range | Long-distance |
FoV | 12.4°–50° |
Power | Moderate–high but decreasing |
Cost | Low to high |
Typical Use Cases | Low-light/night vision, search-and-rescue, firefighting, night-time driving |
Contributions from 2020 | 78 (e.g., [34,38]) |
Task | Metric | What It Measures |
---|---|---|
Pose estimation | ATE, RPE | Absolute and relative drift of estimated trajectory with respect to ground truth |
Mapping | Completeness, Accuracy | Point-/surface-level recall and deviation versus dense laser scans |
Navigation | Success rate, Collision rate, SPL | Goal reached, safety violations, efficiency normalized by path length |
Detection and segm. | mAP, PQ | Precision–recall area (2D/3D) and unified panoptic quality |
Manipulation | Pick success, cycle time, force | Grasp completion ratio, throughput, and applied force thresholds |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Rodríguez-Martínez, E.A.; Flores-Fuentes, W.; Achakir, F.; Sergiyenko, O.; Murrieta-Rico, F.N. Vision-Based Navigation and Perception for Autonomous Robots: Sensors, SLAM, Control Strategies, and Cross-Domain Applications—A Review. Eng 2025, 6, 153. https://doi.org/10.3390/eng6070153
Rodríguez-Martínez EA, Flores-Fuentes W, Achakir F, Sergiyenko O, Murrieta-Rico FN. Vision-Based Navigation and Perception for Autonomous Robots: Sensors, SLAM, Control Strategies, and Cross-Domain Applications—A Review. Eng. 2025; 6(7):153. https://doi.org/10.3390/eng6070153
Chicago/Turabian StyleRodríguez-Martínez, Eder A., Wendy Flores-Fuentes, Farouk Achakir, Oleg Sergiyenko, and Fabian N. Murrieta-Rico. 2025. "Vision-Based Navigation and Perception for Autonomous Robots: Sensors, SLAM, Control Strategies, and Cross-Domain Applications—A Review" Eng 6, no. 7: 153. https://doi.org/10.3390/eng6070153
APA StyleRodríguez-Martínez, E. A., Flores-Fuentes, W., Achakir, F., Sergiyenko, O., & Murrieta-Rico, F. N. (2025). Vision-Based Navigation and Perception for Autonomous Robots: Sensors, SLAM, Control Strategies, and Cross-Domain Applications—A Review. Eng, 6(7), 153. https://doi.org/10.3390/eng6070153