Semantic SLAM with Multi-Modal Perception: Survey on Robust Long-Term Localization for Autonomous Vehicles
Abstract
1. Introduction
2. Time-Varying Semantic Maps for Long-Term Localization
3. Multi-Modal Perception
3.1. Extrinsic Calibration
3.2. Temporal Synchronization
- Trigger pulses sent from a master device (often the camera) to slave sensors (e.g., LiDAR or IMU) to initiate acquisitions simultaneously.
- GPS-disciplined clocks (PPS–Pulse Per Second) that provide a global reference time for all sensors.
- Time synchronization protocols such as IEEE 1588 Precision Time Protocol (PTP) or Network Time Protocol (NTP) for distributed setups.
4. Semantic Mapping
4.1. Vision-Based Semantic Mapping
| Ref. | Learning | Object Detection | Env. | Dynamic Handling | Mapping Type | Real Time | Accuracy/ Robustness | Efficiency | Scalability |
|---|---|---|---|---|---|---|---|---|---|
| [92] | CNN (2D) | None | Outdoor | x | Semi-dense 3D | Partially | Basic semantic consistency | Lightweight | Early-stage baseline |
| [93] | Deep CNN | None | Indoor | Partially | 2D semantic | ✓ | Reliable short-term accuracy | Real-time feasible | Limited to small indoor setups |
| [83] | Deep segmentation | None | Indoor/ outdoor | ✓ | 3D semantic | ✓ | Strong against motion blur | GPU-efficient | Handles dynamic motion |
| [33] | None | Faster R-CNN | Indoor/ outdoor | x | Object-level 3D | ✓ | Good object localization | Efficient computation | Interpretable structure |
| [86] | Weakly supervised CNN | None | Indoor/ dynamic | ✓ | Semi-dense | Partially | Acceptable with less labels | Low training cost | Moderate adaptability |
| [94] | Mask R-CNN. | Mask R-CNN | Indoor/ dynamic | ✓ | 3D semantic | Partially | Accurate instance masking | Medium real-time | Moderate generalization |
| [84] | Deep segmentation | None | Indoor/ dynamic | ✓ | Dense 3D | ✓ | Excellent motion robustness | Real-time GPU | Stable in moderate dynamics |
| [85] | Deep CNN | None | Indoor | ✓ | Dense 3D | ✓ | Stable tracking in motion | Efficient CNN pipeline | Indoor adaptability |
| [87] | CNN | YOLOv3 | Outdoor large-scale | Partially | Dense 3D | ✓ | High semantic precision | Real-time optimized | Excellent for road scenes |
| [90] | CNN | Faster R-CNN. | Indoor static | x | Object-level | ✓ | Strong semantic association | Real-time feasible | Good indoor scalability |
| [88] | CNN segmentation | YOLOv3 | Indoor multi-robot | Partially | Shared 3D | ✓ | Reliable data fusion | Distributed real-time | High multi-agent scalability |
| [95] | CNN segmentation | None | Dynamic scenes | ✓ | Sparse 3D | Partially | Accurate dynamic removal | Medium processing speed | Adaptable to motion levels |
| [89] | Semantic + stereo depth | YOLOv4 | Outdoor dynamic | ✓ | Dense 3D | ✓ | High spatial accuracy | Real-time stereo | Robust outdoor adaptability |
| [91] | Instance segmentation | Mask R-CNN. | Outdoor dynamic | ✓ | Dense 3D | ✓ | Excellent dynamic robustness | Real-time optimized | Suitable for complex scenes |
| [96] | Hybrid heuristic segmentation | None | Indoor dynamic | ✓ | Dense 3D | ✓ | Strong tracking robustness | Optimized pipeline | Adaptive to scene changes |
4.2. LiDAR-Based Semantic Mapping
4.3. Sensor-Fusion-Based Semantic Mapping
5. 3D—Semantic Segmentation
5.1. LiDAR-Only 3D Semantic Segmentation
5.1.1. Image-Based Methods
5.1.2. Voxel-Based Methods
5.1.3. Point-Cloud-Based Method
5.2. LiDAR–Camera Fusion 3D Semantic Segmentation
5.2.1. Early Fusion
5.2.2. Deep Fusion
5.2.3. Late Fusion
5.2.4. Asymmetry Fusion
6. Long-Term Localization
6.1. Vision-Based Place Recognition
6.2. LiDAR-Based Place Recognition
6.2.1. Point-Based Methods
6.2.2. Voxel-Based Methods
6.2.3. Segment-Based Methods
6.2.4. Projection-Based Methods
6.2.5. Graph-Based Methods
6.3. Fusion-Based Place Recognition
7. Benchmarking
7.1. Datasets
7.2. Evaluation Metrics
7.3. Long-Term Evaluation Perspective
8. Open Challenges and Outlook
9. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Bowman, S.L.; Atanasov, N.; Daniilidis, K.; Pappas, G.J. Probabilistic data association for semantic SLAM. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1722–1729. [Google Scholar] [CrossRef]
- McCormac, J.; Clark, R.; Bloesch, M.; Davison, A.J.; Leutenegger, S. Fusion++: Volumetric Object-Level SLAM. arXiv 2018, arXiv:1808.08378. [Google Scholar]
- Yusefi, A.; Durdu, A.; Toy, I. Camera/LiDAR Sensor Fusion-based Autonomous Navigation. In Proceedings of the 2024 23rd International Symposium INFOTEH-JAHORINA (INFOTEH), East Sarajevo, Bosnia and Herzegovina, 20–22 March 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Abdelfattah, M.; Yuan, K.; Wang, Z.J.; Ward, R. Multi-modal Streaming 3D Object Detection. arXiv 2022, arXiv:2209.04966. [Google Scholar] [CrossRef]
- Wang, M.; Liu, W.; Zhou, B.; Wang, Z.; Liu, R.; Wang, H. A Robust Camera-LiDAR Fusion Framework for 3D Object Detection in High-Dust Environments. In Proceedings of the 2024 IEEE 22nd International Conference on Industrial Informatics (INDIN), Beijing, China, 18–20 August 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Chen, B.; Shen, H.; Zhao, Z.; Yu, L.; Zhao, Y. LiDAR-Camera cross fusion network towards 3D object detection in self-driving. IEEE Sensors J. 2024, 25, 21857–21866. [Google Scholar] [CrossRef]
- Erkent, Ö.; Wolf, C.; Laugier, C.; Gonzalez, D.S.; Cano, V.R. Semantic Grid Estimation with a Hybrid Bayesian and Deep Neural Network Approach. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 888–895. [Google Scholar] [CrossRef]
- Tsintotas, K.A.; Bampis, L.; Gasteratos, A. The Revisiting Problem in Simultaneous Localization and Mapping: A Survey on Visual Loop Closure Detection. IEEE Trans. Intell. Transp. Syst. 2022, 23, 19929–19953. [Google Scholar] [CrossRef]
- Li, L.; Kong, X.; Zhao, X.; Li, W.; Wen, F.; Zhang, H.; Liu, Y. SA-LOAM: Semantic-aided LiDAR SLAM with Loop Closure. arXiv 2021, arXiv:2106.11516. [Google Scholar]
- Stenborg, E.; Toft, C.; Hammarstrand, L. Long-term Visual Localization using Semantically Segmented Images. arXiv 2018, arXiv:1801.05269. [Google Scholar] [CrossRef]
- Huang, X.; Xu, Z.; Wu, H.; Wang, J.; Xia, Q.; Xia, Y.; Li, J.; Gao, K.; Wen, C.; Wang, C. L4DR: LiDAR-4DRadar Fusion for Weather-Robust 3D Object Detection. arXiv 2025, arXiv:2408.03677. [Google Scholar] [CrossRef]
- Zhu, D.; Yang, G. RDynaSLAM: Fusing 4D Radar Point Clouds to Visual SLAM in Dynamic Environments. J. Intell. Robot. Syst. 2025, 111, 11. [Google Scholar] [CrossRef]
- Wang, B.; Zhuang, Y.; Huai, J.; Chen, Y.; Chen, J.; El-Bendary, N. GV-iRIOM: GNSS-visual-aided 4D radar inertial odometry and mapping in large-scale environments. ISPRS J. Photogramm. Remote. Sens. 2025, 221, 310–323. [Google Scholar] [CrossRef]
- Liu, H.; Xu, G.; Liu, B.; Li, Y.; Yang, S.; Tang, J.; Pan, K.; Xing, Y. A real time LiDAR-Visual-Inertial object level semantic SLAM for forest environments. ISPRS J. Photogramm. Remote. Sens. 2025, 219, 71–90. [Google Scholar] [CrossRef]
- Zhao, Y.; Wang, C.; Ouyang, Y.; Zhong, J.; Li, Y.; Zhao, N. DHDP-SLAM: Dynamic Hierarchical Dirichlet Process based data association for semantic SLAM. Displays 2025, 86, 102892. [Google Scholar] [CrossRef]
- Chen, N.; Wei, D.; Lin, D.; Lin, L. Semantic SLAM using laser-vision data fusion: Enhancing autonomous navigation in unstructured environments. Alex. Eng. J. 2025, 127, 606–618. [Google Scholar] [CrossRef]
- Li, F.; Fu, C.; Wang, J.; Sun, D. Dynamic Semantic SLAM Based on Panoramic Camera and LiDAR Fusion for Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2025, 27, 2763–2776. [Google Scholar] [CrossRef]
- Jiao, J.; Geng, R.; Li, Y.; Xin, R.; Yang, B.; Wu, J.; Wang, L.; Liu, M.; Fan, R.; Kanoulas, D. Real-Time Metric-Semantic Mapping for Autonomous Navigation in Outdoor Environments. IEEE Trans. Autom. Sci. Eng. 2025, 22, 5729–5740. [Google Scholar] [CrossRef]
- Wan, J.; Zhang, X.; Dong, S.; Zhang, Y.; Yang, Y.; Wu, R.; Jiang, Y.; Li, J.; Lin, J.; Yang, M. Monocular Localization with Semantics Map for Autonomous Vehicles. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 14146–14152. [Google Scholar] [CrossRef]
- Zhang, C.; Zhao, H.; Wang, C.; Tang, X.; Yang, M. Cross-Modal Monocular Localization in Prior LiDAR Maps Utilizing Semantic Consistency. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 4004–4010. [Google Scholar] [CrossRef]
- Cheng, X.; Geng, K.; Yin, G.; Sun, Y.; Wang, J.; Ding, P. Semantic Mapping Optimization Based on LIDAR and Camera Data Fusion for autonomous vehicle. In Proceedings of the 2022 6th CAA International Conference on Vehicular Control and Intelligence (CVCI), Nanjing, China, 28–30 October 2022; pp. 1–6. [Google Scholar] [CrossRef]
- Song, X.; Zhijiang, Z.; Liang, X.; Huaidong, Z. Monocular camera and laser based semantic mapping system with temporal-spatial data association for indoor mobile robots. Multimed. Tools Appl. 2023, 82, 34459–34484. [Google Scholar] [CrossRef]
- Ding, F.; Ji, X.; Wei, D.; Zhang, J.; Li, K.; Yuan, H. Monocular Mapping and Localization of Urban Road Scenes Based on Parameterized Semantic Representation. In Proceedings of the CEUR Workshop Proceedings at the 13th International Conference on Indoor Positioning and Indoor, Nuremberg, Germany, 25–28 September 2023. [Google Scholar]
- Yi, F.; Ye, L.; Huang, M.; Wang, Q. A Method for Constructing Semantic Navigation Maps in Urban Environments. In Proceedings of the 2023 6th International Conference on Intelligent Autonomous Systems (ICoIAS), Qinhuangdao, China, 22–24 September 2023; pp. 141–147. [Google Scholar] [CrossRef]
- Berrio, J.S.; Shan, M.; Worrall, S.; Nebot, E. Camera-LIDAR Integration: Probabilistic Sensor Fusion for Semantic Mapping. IEEE Trans. Intell. Transp. Syst. 2022, 23, 7637–7652. [Google Scholar] [CrossRef]
- Qian, Z.; Patath, K.; Fu, J.; Xiao, J. Semantic SLAM with Autonomous Object-Level Data Association. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11203–11209. [Google Scholar] [CrossRef]
- Kang, X.; Li, J.; Fan, X.; Jian, H.; Xu, C. Object-level semantic map construction for dynamic scenes. Appl. Sci. 2021, 11, 645. [Google Scholar] [CrossRef]
- Sualeh, M.; Kim, G.W. Semantics aware dynamic SLAM based on 3D MODT. Sensors 2021, 21, 6355. [Google Scholar] [CrossRef]
- Paz, D.; Zhang, H.; Li, Q.; Xiang, H.; Christensen, H.I. Probabilistic Semantic Mapping for Urban Autonomous Driving Applications. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 2059–2064. [Google Scholar] [CrossRef]
- Guan, P.; Cao, Z.; Chen, E.; Liang, S.; Tan, M.; Yu, J. A real-time semantic visual SLAM approach with points and objects. Int. J. Adv. Robot. Syst. 2020, 17, 1729881420905443. [Google Scholar] [CrossRef]
- Jiao, J.; Geng, R.; Li, Y.; Yang, B.; Liu, M. Online Metric-Semantic Mapping for Autonomous Robot Navigation. In Proceedings of the Robot Representations Workshop, Robotics: Science and Systems (RSS), Corvallis, OR, USA, 12–16 June 2020. [Google Scholar]
- Li, J.; Zhang, X.; Li, J.; Liu, Y.; Wang, J. Building and optimization of 3D semantic map based on Lidar and camera fusion. Neurocomputing 2020, 409, 394–407. [Google Scholar] [CrossRef]
- Nakajima, Y.; Saito, H. Efficient object-oriented semantic mapping with object detector. IEEE Access 2019, 7, 3206–3213. [Google Scholar] [CrossRef]
- Chi, J.; Wu, H.; Tian, G. Object-oriented 3D semantic mapping based on instance segmentation. J. Adv. Comput. Intell. Intell. Inform. 2019, 23, 695–704. [Google Scholar] [CrossRef]
- Zhang, L.; Wei, L.; Shen, P.; Wei, W.; Zhu, G.; Song, J. Semantic SLAM Based on Object Detection and Improved Octomap. IEEE Access 2018, 6, 75545–75559. [Google Scholar] [CrossRef]
- Barsan, I.A.; Liu, P.; Pollefeys, M.; Geiger, A. Robust Dense Mapping for Large-Scale Dynamic Environments. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 7510–7517. [Google Scholar] [CrossRef]
- Yang, Y.; Qiu, F.; Li, H.; Zhang, L.; Wang, M.L.; Fu, M.Y. Large-scale 3D Semantic Mapping Using Stereo Vision. Int. J. Autom. Comput. 2018, 15, 194–206. [Google Scholar] [CrossRef]
- Goga, S.E.C.; Nedevschi, S. Fusing semantic labeled camera images and 3D LiDAR data for the detection of urban curbs. In Proceedings of the 2018 IEEE 14th International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj-Napoca, Romania, 6–8 September 2018; pp. 301–308. [Google Scholar] [CrossRef]
- McCormac, J.; Handa, A.; Davison, A.; Leutenegger, S. SemanticFusion: Dense 3D semantic mapping with convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4628–4635. [Google Scholar] [CrossRef]
- Mur-Artal, R.; Tardos, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
- Segal, A.; Hähnel, D.; Thrun, S. Generalized-ICP. In Proceedings of the Robotics: Science and Systems; Trinkle, J., Matsuoka, Y., Castellanos, J.A., Eds.; The MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
- Khanam, R.; Hussain, M. What is YOLOv5: A deep look into the internal features of the popular object detector. arXiv 2024, arXiv:2407.20892. [Google Scholar] [CrossRef]
- Yaseen, M. What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector. arXiv 2024, arXiv:2408.15857. [Google Scholar]
- Kwak, J.; Sung, Y. DeepLabV3-Refiner-Based Semantic Segmentation Model for Dense 3D Point Clouds. Remote. Sens. 2021, 13, 1565. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. arXiv 2018, arXiv:1703.06870. [Google Scholar]
- Biber, P.; Straßer, W. The normal distributions transform: A new approach to laser scan matching. In Proceedings of the 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453), Las Vegas, NV, USA, 27–31 October 2003; Volume 3, pp. 2743–2748. [Google Scholar]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets Robotics: The KITTI Dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
- Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. arXiv 2020, arXiv:1903.11027. [Google Scholar] [CrossRef]
- Zhou, L.; Li, Z.; Kaess, M. Automatic Extrinsic Calibration of a Camera and a 3D LiDAR Using Line and Plane Correspondences. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; IEEE Press: Piscataway, NJ, USA, 2018; pp. 5562–5569. [Google Scholar] [CrossRef]
- Kim, E.s.; Park, S.Y. Extrinsic Calibration between Camera and LiDAR Sensors by Matching Multiple 3D Planes. Sensors 2020, 20, 52. [Google Scholar] [CrossRef] [PubMed]
- Meilin, C.; Guotao, J.; Zhichao, P.; Qian, H.; Bin, T.; Shuo, L.; Hailang, Y. An Extrinsic Calibration Method for LiDAR and Camera Based on Feature Matching. Control. Inf. Technol. 2024, 102–108. [Google Scholar] [CrossRef]
- Verma, S.; Berrio, J.S.; Worrall, S.; Nebot, E. Automatic extrinsic calibration between a camera and a 3D Lidar using 3D point and plane correspondences. arXiv 2019, arXiv:1904.12433. [Google Scholar] [CrossRef]
- Mishra, S.; Osteen, P.R.; Pandey, G.; Saripalli, S. Experimental Evaluation of 3D-LIDAR Camera Extrinsic Calibration. arXiv 2020, arXiv:2007.01959. [Google Scholar] [CrossRef]
- Wang, W.; Sun, R.; Wang, Z.; Hu, Z. A Step-By-Step Approach for Camera and Low-Resolution-3D-LiDAR Calibration. In Proceedings of the 2022 17th International Conference on Control, Automation, Robotics, and Vision (ICARCV), Las Vegas, NV, USA, 6–8 January 2023; IEEE: Piscataway, NJ, USA, 2022; pp. 555–561. [Google Scholar] [CrossRef]
- Guo, J.; Liu, C.; Cui, X.; He, Z.; Wang, Z. Automatic Extrinsic Calibration for Lidar-Photoneo Camera Using a Hemispherical Calibration Board. arXiv 2023, arXiv:2304.09062. [Google Scholar]
- Zhang, X.; Luo, W.; Xu, Y. Camera–LiDAR Calibration Using Iterative Random Sampling and Intersection Line-Based Quality Evaluation. Electronics 2024, 13, 249. [Google Scholar] [CrossRef]
- Wang, Z.; Li, M.; Yang, Y.; Zhang, Y.F.; Sörstedt, J.; Chen, Y. A two-step approach to lidar-camera calibration. In Proceedings of the 2021 IEEE 19th International Conference on Embedded and Ubiquitous Computing (EUC), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 201–207. [Google Scholar]
- Lao, Y.; Wei, S.; Liu, G.; Liu, C.; Yang, T. Enhanced Extrinsic Calibration Method for Camera-LiDAR Fusion and Monitoring of Safety Threats to Power Transmission Lines. Int. Arch. Photogramm. Remote. Sens. Spat. Inf. Sci. 2025, XLVIII-G-2025, 877–884. [Google Scholar] [CrossRef]
- Zhang, B.; Zheng, Y.; Zhang, Z.; He, Q. LiDAR and Camera Calibration Using Pyramid and Checkerboard Calibrators. In Proceedings of the 2023 IEEE 8th International Conference on Big Data Analytics (ICBDA), Harbin, China, 3–5 March 2023; pp. 187–192. [Google Scholar] [CrossRef]
- Cai, Y.; Zhan, Y.; Deng, W. A Novel Extrinsic Calibration Method of a Camera-And-LiDAR System. In Proceedings of the 2021 IEEE 7th International Conference on Virtual Reality (ICVR), Foshan, China, 20–22 May 2021; pp. 109–116. [Google Scholar] [CrossRef]
- Xiao, Z.; Li, H.; Zhou, D.; Dai, Y.; Dai, B. Accurate extrinsic calibration between monocular camera and sparse 3D Lidar points without markers. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV), Los Angeles, CA, USA, 11–14 June 2017; pp. 424–429. [Google Scholar] [CrossRef]
- Liu, X.; Yuan, C.; Zhang, F. Targetless Extrinsic Calibration of Multiple Small FoV LiDARs and Cameras Using Adaptive Voxelization. IEEE Trans. Instrum. Meas. 2022, 71, 8502612. [Google Scholar] [CrossRef]
- Borer, J.; Tschirner, J.; Ölsner, F.; Milz, S. From Chaos to Calibration: A Geometric Mutual Information Approach to Target-Free Camera LiDAR Extrinsic Calibration. arXiv 2023, arXiv:2311.01905. [Google Scholar]
- Borer, J.; Tschirner, J.; Ölsner, F.; Milz, S. Continuous Online Extrinsic Calibration of Fisheye Camera and LiDAR. arXiv 2023, arXiv:2306.13240. [Google Scholar] [CrossRef]
- Yoon, H.K.; Bae, J.M.; Kim, M.J.; Lee, B.C.; Ko, S.J. Targetless Multiple Camera-LiDAR Extrinsic Calibration using Object Pose Estimation. In Proceedings of the 2021 21st International Conference on Control, Automation and Systems (ICCAS), Xi’an, China, 30 May–5 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 576–581. [Google Scholar]
- Beltran, J.; Guindel, C.; de la Escalera, A.; Garcia, F. Automatic Extrinsic Calibration Method for LiDAR and Camera Sensor Setups. IEEE Trans. Intell. Transp. Syst. 2022, 23, 17677–17689. [Google Scholar] [CrossRef]
- Ou, J.; Huang, P.; Zhou, J.; Zhao, Y.; Lin, L. Automatic Extrinsic Calibration of 3D LIDAR and Multi-Cameras Based on Graph Optimization. Sensors 2022, 22, 2221. [Google Scholar] [CrossRef]
- Sen, A.; Pan, G.; Mitrokhin, A.; Islam, A. SceneCalib: Automatic Targetless Calibration of Cameras and Lidars in Autonomous Driving. arXiv 2023, arXiv:2304.05530. [Google Scholar] [CrossRef]
- Liu, R.; Shi, J.; Zhang, H.; Zhang, J.; Sun, B. Causal calibration: Iteratively calibrating LiDAR and camera by considering causality and geometry. Complex Intell. Syst. 2023, 9, 7349–7363. [Google Scholar] [CrossRef]
- Zeng, T.; He, D.; Yan, F.; He, M. YOCO: You Only Calibrate Once for Accurate Extrinsic Parameter in LiDAR-Camera Systems. arXiv 2024, arXiv:2407.18043. [Google Scholar] [CrossRef]
- Tan, Z.; Zhang, X.; Teng, S.; Wang, L.; Gao, F. A Review of Deep Learning-Based LiDAR and Camera Extrinsic Calibration. Sensors 2024, 24, 3878. [Google Scholar] [CrossRef]
- Lv, X.; Wang, B.; Ye, D.; Wang, S. LCCNet: LiDAR and Camera Self-Calibration using Cost Volume Network. arXiv 2021, arXiv:2012.13901. [Google Scholar]
- Jiang, P.; Osteen, P.; Saripalli, S. SemCal: Semantic LiDAR-Camera Calibration using Neural MutualInformation Estimator. arXiv 2021, arXiv:2109.10270. [Google Scholar]
- Rachman, A.; Seiler, J.; Kaup, A. End-to-End Lidar-Camera Self-Calibration for Autonomous Vehicles. arXiv 2023, arXiv:2304.12412. [Google Scholar]
- Wu, S.; Hadachi, A.; Vivet, D.; Prabhakar, Y. NetCalib: A Novel Approach for LiDAR-Camera Auto-calibration Based on Deep Learning. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
- Open Robotics. message_filters-ROS 2 Documentation. Open Source Robotics Foundation. 2025. Available online: https://docs.ros.org/en/foxy/Tutorials/Beginner-Client-Libraries/Custom-ROS2-Interfaces.html (accessed on 18 June 2025).
- Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; IEEE: Piscataway, NJ, USA, 1999; Volume 2, pp. 1150–1157. [Google Scholar]
- Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded Up Robust Features. In Proceedings of the Computer Vision—ECCV 2006; Leonardis, A., Bischof, H., Pinz, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 404–417. [Google Scholar]
- Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar] [CrossRef]
- Baroffio, L.; Cesana, M.; Redondi, A.; Tagliasacchi, M.; Tubaro, S. Fast keypoint detection in video sequences. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1342–1346. [Google Scholar] [CrossRef]
- Calonder, M.; Lepetit, V.; Strecha, C.; Fua, P. BRIEF: Binary robust independent elementary features. In Proceedings of the 11th European Conference on Computer Vision: Part IV, Crete, Greece, 5–11 September 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 778–792. [Google Scholar]
- Cui, L.; Ma, C. SOF-SLAM: A Semantic Visual SLAM for Dynamic Environments. IEEE Access 2019, 7, 166528–166539. [Google Scholar] [CrossRef]
- Zhao, X.; Zuo, T.; Hu, X. OFM-SLAM: A Visual Semantic SLAM for Dynamic Indoor Environments. Math. Probl. Eng. 2021, 2021, 5538840. [Google Scholar] [CrossRef]
- Yang, C.; Lyu, T. Research on Vision-based Semantic SLAM towards Indoor Dynamic Environment. In Proceedings of the 2022 International Conference on Computing, Communication, Perception and Quantum Technology (CCPQT), Xiamen, China, 5–7 August 2022; pp. 53–58. [Google Scholar] [CrossRef]
- Liu, J.; Fan, X.; Fu, L.; Chen, Y. Visual SLAM technology based on weakly supervised semantic segmentation in dynamic environment. In Proceedings of the Seventh International Conference on Image and Graphics (ICIG 2020); SPIE: Bellingham, WA, USA, 2020; Volume 11574, p. 115740Q. [Google Scholar]
- Cheng, Q.; Zeller, N.; Cremers, D. Vision-Based Large-scale 3D Semantic Mapping for Autonomous Driving Applications. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 9235–9242. [Google Scholar] [CrossRef]
- Peng, J.; Xiaoqiang, L.; Wei, G.; Ming, L. A Vision Based Multi-robot Cooperative Semantic SLAM Algorithm. In Proceedings of the 2022 34th Chinese Control and Decision Conference (CCDC), Hefei, China, 15–17 August 2022; pp. 5663–5668. [Google Scholar] [CrossRef]
- Esparza, D.; Flores, G. The STDyn-SLAM: A Stereo Vision and Semantic Segmentation Approach for VSLAM in Dynamic Outdoor Environments. IEEE Access 2022, 10, 18201–18209. [Google Scholar] [CrossRef]
- Song, X.; Liang, X.; Zhijiang, Z.; Huaidong, Z. A Object-augmented Semantic Mapping System for Indoor Mobile Robots. In Proceedings of the 2022 IEEE 2nd International Conference on Software Engineering and Artificial Intelligence (SEAI), Xiamen, China, 10–12 June 2022; pp. 225–229. [Google Scholar] [CrossRef]
- Qin, L.; Wu, C.; Chen, Z.; Kong, X.; Lv, Z.; Zhao, Z. RSO-SLAM: A Robust Semantic Visual SLAM With Optical Flow in Complex Dynamic Environments. IEEE Trans. Intell. Transp. Syst. 2024, 25, 14669–14684. [Google Scholar] [CrossRef]
- Li, X.; Ao, H.; Belaroussi, R.; Gruyer, D. Fast semi-dense 3D semantic mapping with monocular visual SLAM. In Proceedings of the 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Yokohama, Japan, 16–19 October 2017; pp. 385–390. [Google Scholar] [CrossRef]
- Zhang, C.; Liu, Z.; Liu, G.; Huang, D. Large-Scale 3D Semantic Mapping Using Monocular Vision. In Proceedings of the 2019 IEEE 4th International Conference on Image, Vision and Computing (ICIVC), Xiamen, China, 5–7 July 2019; pp. 71–76. [Google Scholar] [CrossRef]
- Hu, S.; Li, D.; Tang, G.; Xu, X. A 3D Semantic Visual SLAM in Dynamic Scenes. In Proceedings of the 2021 6th IEEE International Conference on Advanced Robotics and Mechatronics (ICARM), Chongqing, China, 3–5 July 2021; pp. 522–528. [Google Scholar] [CrossRef]
- Yang, K.; Jiang, Y.; Qi, L.; Fan, H.; Zhang, S.; Dong, J. Visual Semantic SLAM Based on Examination of Moving Consistency in Dynamic Scenes. In Proceedings of the 2022 4th International Conference on Data Intelligence and Security (ICDIS), Shenzhen, China, 24–26 August 2022; pp. 275–282. [Google Scholar] [CrossRef]
- Liu, M.; Zou, Q.; Long, J.; Wang, Y.; Lin, M.; Wang, F. SSE-SLAM: Semantic Visual SLAM Based on RGB-D Camera for High-Accuracy Pose Measurement in Dynamic Environments. IEEE Trans. Instrum. Meas. 2025, 74, 5047611. [Google Scholar] [CrossRef]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. arXiv 2017, arXiv:1612.00593. [Google Scholar] [CrossRef]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. arXiv 2017, arXiv:1706.02413. [Google Scholar]
- Wu, B.; Wan, A.; Yue, X.; Keutzer, K. SqueezeSeg: Convolutional Neural Nets with Recurrent CRF for Real-Time Road-Object Segmentation from 3D LiDAR Point Cloud. arXiv 2017, arXiv:1710.07368. [Google Scholar]
- Milioto, A.; Vizzo, I.; Behley, J.; Stachniss, C. RangeNet ++: Fast and Accurate LiDAR Semantic Segmentation. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 4213–4220. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhou, Z.; David, P.; Yue, X.; Xi, Z.; Gong, B.; Foroosh, H. PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation. arXiv 2020, arXiv:2003.14032. [Google Scholar]
- Cortinhal, T.; Tzelepis, G.; Aksoy, E.E. SalsaNext: Fast, Uncertainty-aware Semantic Segmentation of LiDAR Point Clouds for Autonomous Driving. arXiv 2024, arXiv:2003.03653. [Google Scholar]
- Lai, X.; Chen, Y.; Lu, F.; Liu, J.; Jia, J. Spherical Transformer for LiDAR-based 3D Recognition. arXiv 2023, arXiv:2303.12766. [Google Scholar]
- Zhang, J.; Liu, H.; Yang, K.; Hu, X.; Liu, R.; Stiefelhagen, R. CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers. arXiv 2023, arXiv:2203.04838. [Google Scholar] [CrossRef]
- Ma, X.; Zhang, X.; Pun, M.O.; Liu, M. A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 5403215. [Google Scholar] [CrossRef]
- Wang, Y.; Wu, Y.; Li, D.; Yu, W. Millimeter-Wave Radar and Vision Fusion-Based Semantic Simultaneous Localization and Mapping. IEEE Antennas Wirel. Propag. Lett. 2024, 23, 3977–3981. [Google Scholar] [CrossRef]
- Broedermann, T.; Sakaridis, C.; Fu, Y.; Gool, L.V. CAFuser: Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes. arXiv 2025, arXiv:2410.10791. [Google Scholar] [CrossRef]
- Tan, M.; Zhuang, Z.; Chen, S.; Li, R.; Jia, K.; Wang, Q.; Li, Y. EPMF: Efficient Perception-Aware Multi-Sensor Fusion for 3D Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8258–8273. [Google Scholar] [CrossRef]
- Zhao, L.; Zhou, H.; Zhu, X.; Song, X.; Li, H.; Tao, W. LIF-Seg: LiDAR and Camera Image Fusion for 3D LiDAR Semantic Segmentation. arXiv 2021, arXiv:2108.07511. [Google Scholar] [CrossRef]
- Sánchez-García, F.; Montiel-Marín, S.; Antunes-Garcia, M.; Gutiérrez-Moreno, R.; Llamazares Llamazares, Á.; Bergasa, L.M. SalsaNext+: A Multimodal-Based Point Cloud Semantic Segmentation With Range and RGB Images. IEEE Access 2025, 13, 64133–64147. [Google Scholar] [CrossRef]
- Lu, Z.; Cao, B.; Hu, Q. LiDAR-Camera Continuous Fusion in Voxelized Grid for Semantic Scene Completion. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 12330–12344. [Google Scholar] [CrossRef]
- Li, Q.; Wang, Y.; Wang, Y.; Zhao, H. HDMapNet: An Online HD Map Construction and Evaluation Framework. arXiv 2022, arXiv:2107.06307. [Google Scholar] [CrossRef]
- Sauerbeck, F.; Kulmer, D.; Pielmeier, M.; Leitenstern, M.; Weiß, C.; Betz, J. Multi-LiDAR Localization and Mapping Pipeline for Urban Autonomous Driving. arXiv 2023, arXiv:2311.01823. [Google Scholar]
- Sanchez, J.; Deschaud, J.E.; Goulette, F. 3DLabelProp: Geometric-Driven Domain Generalization for LiDAR Semantic Segmentation in Autonomous Driving. arXiv 2025, arXiv:2501.14605. [Google Scholar]
- Qiu, S.; Li, X.; Xue, X.; Pu, J. PC-BEV: An Efficient Polar-Cartesian BEV Fusion Framework for LiDAR Semantic Segmentation. arXiv 2024, arXiv:2412.14821. [Google Scholar] [CrossRef]
- Park, J.; Kim, C.; Jo, K. PCSCNet: Fast 3D Semantic Segmentation of LiDAR Point Cloud for Autonomous Car using Point Convolution and Sparse Convolution Network. arXiv 2022, arXiv:2202.10047. [Google Scholar] [CrossRef]
- Razani, R.; Cheng, R.; Taghavi, E.; Bingbing, L. Lite-HDSeg: LiDAR Semantic Segmentation Using Lite Harmonic Dense Convolutions. arXiv 2021, arXiv:2103.08852. [Google Scholar]
- Nowruzi, F.E.; Kolhatkar, D.; Kapoor, P.; Heravi, E.J.; Hassanat, F.A.; Laganiere, R.; Rebut, J.; Malik, W. PolarNet: Accelerated Deep Open Space Segmentation Using Automotive Radar in Polar Domain. arXiv 2021, arXiv:2103.03387. [Google Scholar] [CrossRef]
- Maturana, D.; Scherer, S. VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition. In Proceedings of the (IROS) IEEE/RSJ International Conference on Intelligent Robots and Systems, Hamburg, Germany, 28 September–2 October 2015; pp. 922–928. [Google Scholar]
- Komorowski, J.; Wysoczanska, M.; Trzcinski, T. MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition. arXiv 2021, arXiv:2104.05327. [Google Scholar]
- Rosu, R.A.; Schütt, P.; Quenzel, J.; Behnke, S. LatticeNet: Fast Spatio-Temporal Point Cloud Segmentation Using Permutohedral Lattices. arXiv 2021, arXiv:2108.03917. [Google Scholar] [CrossRef]
- Zhou, H.; Zhu, X.; Song, X.; Ma, Y.; Wang, Z.; Li, H.; Lin, D. Cylinder3D: An Effective 3D Framework for Driving-scene LiDAR Semantic Segmentation. arXiv 2020, arXiv:2008.01550. [Google Scholar]
- Shi, S.; Jiang, L.; Deng, J.; Wang, Z.; Guo, C.; Shi, J.; Wang, X.; Li, H. PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection. Int. J. Comput. Vision 2022, 131, 531–551. [Google Scholar] [CrossRef]
- Tang, H.; Liu, Z.; Zhao, S.; Lin, Y.; Lin, J.; Wang, H.; Han, S. Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution. In Computer Vision—ECCV 2020; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
- Qian, G.; Li, Y.; Peng, H.; Mai, J.; Hammoud, H.A.A.K.; Elhoseiny, M.; Ghanem, B. PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies. arXiv 2022, arXiv:2206.04670. [Google Scholar]
- Thomas, H.; Tsai, Y.H.H.; Barfoot, T.D.; Zhang, J. KPConvX: Modernizing Kernel Point Convolution with Kernel Attention. arXiv 2024, arXiv:2405.13194. [Google Scholar] [CrossRef]
- Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds. arXiv 2020, arXiv:1911.11236. [Google Scholar]
- Yan, X.; Gao, J.; Zheng, C.; Zheng, C.; Zhang, R.; Cui, S.; Li, Z. 2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds. arXiv 2022, arXiv:2207.04397. [Google Scholar] [CrossRef]
- Zhao, H.; Jiang, L.; Jia, J.; Torr, P.; Koltun, V. Point Transformer. arXiv 2021, arXiv:2012.09164. [Google Scholar]
- Wu, X.; Lao, Y.; Jiang, L.; Liu, X.; Zhao, H. Point Transformer V2: Grouped Vector Attention and Partition-based Pooling. In Proceedings of the Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 33330–33342. [Google Scholar]
- Wu, X.; Jiang, L.; Wang, P.S.; Liu, Z.; Liu, X.; Qiao, Y.; Ouyang, W.; He, T.; Zhao, H. Point Transformer V3: Simpler, Faster, Stronger. arXiv 2024, arXiv:2312.10035. [Google Scholar] [CrossRef]
- Ni, P.; Li, X.; Xu, W.; Kong, D.; Hu, Y.; Wei, K. Robust 3D Semantic Segmentation Based on Multi-Phase Multi-Modal Fusion for Intelligent Vehicles. IEEE Trans. Intell. Veh. 2024, 9, 1602–1614. [Google Scholar] [CrossRef]
- Zhan, Q.M.; Dong, Z.; Wang, J.X.; Zhu, L.J. Fusion of images and point clouds for the semantic segmentation of large-scale 3D scenes based on deep learning. ISPRS J. Photogramm. Remote. Sens. 2018, 143, 85–101. [Google Scholar] [CrossRef]
- Lee, J.S.; Park, T.H. Fast Road Detection by CNN-Based Camera–Lidar Fusion and Spherical Coordinate Transformation. Trans. Intell. Transport. Sys. 2021, 22, 5802–5810. [Google Scholar] [CrossRef]
- Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. PointPainting: Sequential Fusion for 3D Object Detection. arXiv 2020, arXiv:1911.10150. [Google Scholar] [CrossRef]
- Xu, S.; Zhou, D.; Fang, J.; Yin, J.; Bin, Z.; Zhang, L. FusionPainting: Multimodal Fusion with Adaptive Attention for 3D Object Detection. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021. [Google Scholar]
- Li, J.; Dai, H.; Han, H.; Ding, Y. MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving. arXiv 2023, arXiv:2303.08600. [Google Scholar]
- Valada, A.; Mohan, R.; Burgard, W. Self-Supervised Model Adaptation for Multimodal Semantic Segmentation. Int. J. Comput. Vis. 2019, 128, 1239–1285. [Google Scholar] [CrossRef]
- Schieber, H.; Duerr, F.; Schoen, T.; Beyerer, J. Deep Sensor Fusion with Pyramid Fusion Networks for 3D Semantic Segmentation. arXiv 2022, arXiv:2205.13629. [Google Scholar] [CrossRef]
- Gu, S.; Lu, T.; Zhang, Y.; Alvarez, J.M.; Yang, J.; Kong, H. 3-D LiDAR + Monocular Camera: An Inverse-Depth-Induced Fusion Framework for Urban Road Detection. IEEE Trans. Intell. Veh. 2018, 3, 351–360. [Google Scholar] [CrossRef]
- Jaritz, M.; Vu, T.H.; de Charette, R.; Wirbel, E.; Pérez, P. xMUDA: Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation. arXiv 2020, arXiv:1911.12676. [Google Scholar]
- Park, J.; Yoo, H.; Wang, Y. Drivable Dirt Road Region Identification Using Image and Point Cloud Semantic Segmentation Fusion. IEEE Trans. Intell. Transp. Syst. 2022, 23, 13203–13216. [Google Scholar] [CrossRef]
- Pang, S.; Morris, D.; Radha, H. CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 10386–10393. [Google Scholar] [CrossRef]
- Luo, Y.; Han, T.; Liu, Y.; Su, J.; Chen, Y.; Li, J.; Wu, Y.; Cai, G. CSFNet: Cross-Modal Semantic Focus Network for Semantic Segmentation of Large-Scale Point Clouds. IEEE Trans. Geosci. Remote. Sens. 2025, 63, 5701415. [Google Scholar] [CrossRef]
- Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9297–9307. [Google Scholar]
- Geyer, J.; Kassahun, Y.; Mahmudi, M.; Ricou, X.; Durgesh, R.; Chung, A.S.; Hauswald, L.; Pham, V.H.; Mühlegg, M.; Dorn, S.; et al. A2D2: Audi Autonomous Driving Dataset. arXiv 2020, arXiv:2004.06320. [Google Scholar] [CrossRef]
- Qin, T.; Zheng, Y.; Chen, T.; Chen, Y.; Su, Q. A Light-Weight Semantic Map for Visual Localization towards Autonomous Driving. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11248–11254. [Google Scholar] [CrossRef]
- Schaefer, A.; Büscher, D.; Vertens, J.; Luft, L.; Burgard, W. Long-Term Urban Vehicle Localization Using Pole Landmarks Extracted from 3-D Lidar Scans. In Proceedings of the 2019 European Conference on Mobile Robots (ECMR), Prague, Czech Republic, 4–6 September 2019; pp. 1–7. [Google Scholar] [CrossRef]
- Shan, T.; Englot, B. LeGO-LOAM: Lightweight and Ground-Optimized Lidar Odometry and Mapping on Variable Terrain. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 4758–4765. [Google Scholar] [CrossRef]
- Chen, S.W.; Nardari, G.V.; Lee, E.S.; Qu, C.; Liu, X.; Romero, R.A.F.; Kumar, V. SLOAM: Semantic Lidar Odometry and Mapping for Forest Inventory. arXiv 2019, arXiv:1912.12726. [Google Scholar] [CrossRef]
- Yi, S.; Lyu, Y.; Hua, L.; Pan, Q.; Zhao, C. Light-LOAM: A Lightweight LiDAR Odometry and Mapping based on Graph-Matching. arXiv 2023, arXiv:2310.04162. [Google Scholar] [CrossRef]
- Feng, Y.; Jiang, Z.; Shi, Y.; Feng, Y.; Chen, X.; Zhao, H.; Zhou, G. Block-Map-Based Localization in Large-Scale Environment. arXiv 2024, arXiv:2404.18192. [Google Scholar]
- Koide, K.; Oishi, S.; Yokozuka, M.; Banno, A. Tightly Coupled Range Inertial Localization on a 3D Prior Map Based on Sliding Window Factor Graph Optimization. arXiv 2024, arXiv:2402.05540. [Google Scholar] [CrossRef]
- Zhang, J.; Singh, S. LOAM: Lidar Odometry and Mapping in real-time. Robot. Sci. Syst. Conf. 2014, 2, 109–111. [Google Scholar]
- Cao, F.; Wu, H.; Wu, C. An End-to-End Localizer for Long-Term Topological Localization in Large-Scale Changing Environments. IEEE Trans. Ind. Electron. 2023, 70, 5140–5149. [Google Scholar] [CrossRef]
- Kong, D.; Li, X.; Hu, Y.; Xu, Q.; Wang, A.; Hu, W. Learning a Novel LiDAR Submap-Based Observation Model for Global Positioning in Long-Term Changing Environments. IEEE Trans. Ind. Electron. 2023, 70, 3147–3157. [Google Scholar] [CrossRef]
- Blumenthal, D.B.; Gamper, J. On the exact computation of the graph edit distance. Pattern Recognit. Lett. 2020, 134, 46–57. [Google Scholar] [CrossRef]
- Bai, Y.; Ding, H.; Bian, S.; Chen, T.; Sun, Y.; Wang, W. SimGNN: A Neural Network Approach to Fast Graph Similarity Computation. arXiv 2020, arXiv:1808.05689. [Google Scholar] [CrossRef]
- Kong, X.; Yang, X.; Zhai, G.; Zhao, X.; Zeng, X.; Wang, M.; Liu, Y.; Li, W.; Wen, F. Semantic Graph Based Place Recognition for 3D Point Clouds. arXiv 2020, arXiv:2008.11459. [Google Scholar] [CrossRef]
- Pramatarov, G.; De Martini, D.; Gadd, M.; Newman, P. BoxGraph: Semantic place recognition and pose estimation from 3D LiDAR. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; IEEE: Piscataway, NJ, USA, 2021; pp. 7004–7011. [Google Scholar]
- Wu, S.C.; Wald, J.; Tateno, K.; Navab, N.; Tombari, F. SceneGraphFusion: Incremental 3D Scene Graph Prediction from RGB-D Sequences. arXiv 2021, arXiv:2103.14898. [Google Scholar]
- Arandjelović, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. arXiv 2016, arXiv:1511.07247. [Google Scholar] [CrossRef]
- Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-Net: A Trainable CNN for Joint Detection and Description of Local Features. arXiv 2019, arXiv:1905.03561. [Google Scholar] [CrossRef]
- Wang, R.; Shen, Y.; Zuo, W.; Zhou, S.; Zheng, N. TransVPR: Transformer-based place recognition with multi-level attention aggregation. arXiv 2022, arXiv:2201.02001. [Google Scholar]
- Ali-bey, A.; Chaib-draa, B.; Giguère, P. MixVPR: Feature Mixing for Visual Place Recognition. arXiv 2023, arXiv:2303.02190. [Google Scholar] [CrossRef]
- DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperPoint: Self-Supervised Interest Point Detection and Description. arXiv 2018, arXiv:1712.07629. [Google Scholar]
- Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning Feature Matching with Graph Neural Networks. arXiv 2020, arXiv:1911.11763. [Google Scholar] [CrossRef]
- Hausler, S.; Garg, S.; Xu, M.; Milford, M.; Fischer, T. Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition. arXiv 2021, arXiv:2103.01486. [Google Scholar]
- Khaliq, A.; Milford, M.; Garg, S. MultiRes-NetVLAD: Augmenting Place Recognition Training With Low-Resolution Imagery. IEEE Robot. Autom. Lett. 2022, 7, 3882–3889. [Google Scholar] [CrossRef]
- Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-Free Local Feature Matching with Transformers. arXiv 2021, arXiv:2104.00680. [Google Scholar]
- Woo, S.; Kim, S.W. Context-Based Visual-Language Place Recognition. arXiv 2024, arXiv:2410.19341. [Google Scholar]
- Sferrazza, D.; Berton, G.; Trivigno, G.; Masone, C. To Match or Not to Match: Revisiting Image Matching for Reliable Visual Place Recognition. arXiv 2025, arXiv:2504.06116. [Google Scholar] [CrossRef]
- Truhlařík, V.; Pivoňka, T.; Kasarda, M.; Přeučil, L. Multi-Platform Teach-and-Repeat Navigation by Visual Place Recognition Based on Deep-Learned Local Features. arXiv 2025, arXiv:2503.13090. [Google Scholar]
- Uy, M.A.; Lee, G.H. PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition. arXiv 2018, arXiv:1804.03492. [Google Scholar]
- Zhang, W.; Xiao, C. PCAN: 3D Attention Map Learning Using Contextual Information for Point Cloud Based Retrieval. arXiv 2019, arXiv:1904.09793. [Google Scholar] [CrossRef]
- Liu, Z.; Zhou, S.; Suo, C.; Liu, Y.; Yin, P.; Wang, H.; Liu, Y.H. LPD-Net: 3D Point Cloud Learning for Large-Scale Place Recognition and Environment Analysis. arXiv 2019, arXiv:1812.07050. [Google Scholar]
- Du, J.; Wang, R.; Cremers, D. DH3D: Deep Hierarchical 3D Descriptors for Robust Large-Scale 6DoF Relocalization. arXiv 2020, arXiv:2007.09217. [Google Scholar]
- Xia, Y.; Xu, Y.; Li, S.; Wang, R.; Du, J.; Cremers, D.; Stilla, U. SOE-Net: A Self-Attention and Orientation Encoding Network for Point Cloud based Place Recognition. arXiv 2021, arXiv:2011.12430. [Google Scholar]
- Hui, L.; Yang, H.; Cheng, M.; Xie, J.; Yang, J. Pyramid Point Cloud Transformer for Large-Scale Place Recognition. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 6078–6087. [Google Scholar] [CrossRef]
- Lyu, J.; Li, J.; Chen, D.; Zhang, Y.; Liu, J.; Guo, Y.; Xu, Y.; Zhu, Y. An Efficient 3D Point Cloud-Based Place Recognition Approach for Underground Tunnels Using Convolution and Self-Attention Mechanism. J. Field Robot. 2024, 42, 1537–1549. [Google Scholar] [CrossRef]
- Li, L.; Kong, X.; Zhao, X.; Huang, T.; Li, W.; Wen, F.; Zhang, H.; Liu, Y. RINet: Efficient 3D Lidar-Based Place Recognition Using Rotation Invariant Neural Network. IEEE Robot. Autom. Lett. 2022, 7, 4321–4328. [Google Scholar] [CrossRef]
- Vidanapathirana, K.; Ramezani, M.; Moghadam, P.; Sridharan, S.; Fookes, C. LoGG3D-Net: Locally Guided Global Descriptor Learning for 3D Place Recognition. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 2215–2221. [Google Scholar] [CrossRef]
- Hui, L.; Cheng, M.; Xie, J.; Yang, J.; Cheng, M.M. Efficient 3D Point Cloud Feature Learning for Large-Scale Place Recognition. IEEE Trans. Image Process. 2022, 31, 1258–1270. [Google Scholar] [CrossRef]
- Komorowski, J. MinkLoc3D: Point Cloud Based Large-Scale Place Recognition. arXiv 2020, arXiv:2011.04530. [Google Scholar]
- Zywanowski, K.; Banaszczyk, A.; Nowicki, M.R.; Komorowski, J. MinkLoc3D-SI: 3D LiDAR Place Recognition With Sparse Convolutions, Spherical Coordinates, and Intensity. IEEE Robot. Autom. Lett. 2022, 7, 1079–1086. [Google Scholar] [CrossRef]
- Komorowski, J.; Wysoczanska, M.; Trzcinski, T. EgoNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale. arXiv 2021, arXiv:2110.12486. [Google Scholar] [CrossRef]
- Zhou, Z.; Zhao, C.; Adolfsson, D.; Su, S.; Gao, Y.; Duckett, T.; Sun, L. NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 5654–5660. [Google Scholar] [CrossRef]
- Xu, T.X.; Guo, Y.C.; Li, Z.; Yu, G.; Lai, Y.K.; Zhang, S.H. TransLoc3D: Point Cloud based Large-scale Place Recognition using Adaptive Receptive Fields. arXiv 2022, arXiv:2105.11605. [Google Scholar] [CrossRef]
- Cattaneo, D.; Vaghi, M.; Valada, A. LCDNet: Deep Loop Closure Detection and Point Cloud Registration for LiDAR SLAM. arXiv 2022, arXiv:2103.05056. [Google Scholar] [CrossRef]
- Fan, Z.; Song, Z.; Liu, H.; Lu, Z.; He, J.; Du, X. SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition. arXiv 2021, arXiv:2105.00149. [Google Scholar] [CrossRef]
- Fu, C.; Li, L.; Mei, J.; Ma, Y.; Peng, L.; Zhao, X.; Liu, Y. A Coarse-to-Fine Place Recognition Approach using Attention-guided Descriptors and Overlap Estimation. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2023; pp. 8493–8499. [Google Scholar]
- Dubé, R.; Dugas, D.; Stumm, E.; Nieto, J.; Siegwart, R.; Cadena, C. SegMatch: Segment based place recognition in 3D point clouds. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 5266–5272. [Google Scholar]
- Dubé, R.; Cramariuc, A.; Dugas, D.; Sommer, H.; Dymczyk, M.; Nieto, J.; Siegwart, R.; Cadena, C. SegMap: Segment-based mapping and localization using data-driven descriptors. Int. J. Robot. Res. 2019, 39, 339–355. [Google Scholar] [CrossRef]
- Zaganidis, A.; Zerntev, A.; Duckett, T.; Cielniak, G. Semantically Assisted Loop Closure in SLAM Using NDT Histograms. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; IEEE Press: Piscataway, NJ, USA, 2019; pp. 4562–4568. [Google Scholar] [CrossRef]
- Jiang, J.; Wang, J.; Wang, P.; Bao, P.; Chen, Z. LiPMatch: LiDAR Point Cloud Plane Based Loop-Closure. IEEE Robot. Autom. Lett. 2020, 5, 6861–6868. [Google Scholar] [CrossRef]
- Tomono, M. Loop detection for 3D LiDAR SLAM using segment-group matching. Adv. Robot. 2020, 34, 1530–1544. [Google Scholar] [CrossRef]
- Vidanapathirana, K.; Moghadam, P.; Harwood, B.; Zhao, M.; Sridharan, S.; Fookes, C. Locus: LiDAR-based Place Recognition using Spatiotemporal Higher-Order Pooling. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 5075–5081. [Google Scholar] [CrossRef]
- Li, L.; Kong, X.; Zhao, X.; Huang, T.; Liu, Y. SSC: Semantic Scan Context for Large-Scale Place Recognition. arXiv 2021, arXiv:2107.00382. [Google Scholar] [CrossRef]
- Fan, Y.; Yuan, H.; Zhu, S.; Zhou, G.; Du, R.; Gu, J. A Semantic-Based Loop Closure Detection of 3D Point Cloud. In Proceedings of the 2021 IEEE International Conference on Robotics and Biomimetics (ROBIO), Sanya, China, 27–31 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1184–1189. [Google Scholar]
- Yin, P.; Xu, L.; Feng, Z.; Egorov, A.; Li, B. PSE-Match: A Viewpoint-free Place Recognition Method with Parallel Semantic Embedding. arXiv 2021, arXiv:2108.00552. [Google Scholar] [CrossRef]
- Song, T.; He, S.; Wu, X. Semantic Assisted Loop Closure Detection for Automated Driving. In CICTP 2022; American Society of Civil Engineers: Reston, VA, USA, 2021; pp. 690–698. [Google Scholar] [CrossRef]
- Yin, H.; Ding, X.; Tang, L.; Wang, Y.; Xiong, R. Efficient 3D LIDAR based loop closing using deep neural network. In Proceedings of the 2017 IEEE International Conference on Robotics and Biomimetics (ROBIO), Macau, Macao, 5–8 December 2017; pp. 481–486. [Google Scholar] [CrossRef]
- Yin, H.; Tang, L.; Ding, X.; Wang, Y.; Xiong, R. LocNet: Global Localization in 3D Point Clouds for Mobile Vehicles. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 728–733. [Google Scholar] [CrossRef]
- Sun, T.; Liu, M.; Ye, H.; Yeung, D.Y. Point-cloud-based place recognition using CNN feature extraction. arXiv 2018, arXiv:1810.09631. [Google Scholar] [CrossRef]
- Chen, X.; Läbe, T.; Milioto, A.; Röhling, T.; Vysotska, O.; Haag, A.; Behley, J.; Stachniss, C. OverlapNet: Loop Closing for LiDAR-based SLAM. In Proceedings of the Robotics: Science and Systems XVI. Robotics: Science and Systems Foundation. arXiv 2020, arXiv:2105.11344. [Google Scholar] [CrossRef]
- Xu, X.; Yin, H.; Chen, Z.; Wang, Y.; Xiong, R. DiSCO: Differentiable Scan Context with Orientation. arXiv 2021, arXiv:2010.10949. [Google Scholar] [CrossRef]
- Ma, J.; Zhang, J.; Xu, J.; Ai, R.; Gu, W.; Chen, X. OverlapTransformer: An Efficient and Yaw-Angle-Invariant Transformer Network for LiDAR-Based Place Recognition. IEEE Robot. Autom. Lett. 2022, 7, 6958–6965. [Google Scholar] [CrossRef]
- Yin, P.; Lingyun, X.; Zhang, J.; Choset, H. FusionVLAD: A Multi-view Deep Fusion Networksfor Viewpoint-free 3D Place Recognition. IEEE Robot. Autom. Lett. 2021, 6, 2304–2310. [Google Scholar] [CrossRef]
- Yin, P.; Wang, F.; Egorov, A.; Hou, J.; Jia, Z.; Han, J. Fast Sequence-matching Enhanced Viewpoint-invariant 3D Place Recognition. IEEE Trans. Ind. Electron. 2021, 69, 2127–2135. [Google Scholar] [CrossRef]
- Zhao, S.; Yin, P.; Yi, G.; Scherer, S. SphereVLAD++: Attention-based and Signal-enhanced Viewpoint Invariant Descriptor. arXiv 2022, arXiv:2207.02958. [Google Scholar] [CrossRef]
- Lu, S.; Xu, X.; Tang, L.; Xiong, R.; Wang, Y. DeepRING: Learning Roto-translation Invariant Representation for LiDAR based Place Recognition. arXiv 2022, arXiv:2210.11029. [Google Scholar]
- Luo, L.; Zheng, S.; Li, Y.; Fan, Y.; Yu, B.; Cao, S.; Shen, H. BEVPlace: Learning LiDAR-based Place Recognition using Bird’s Eye View Images. arXiv 2023, arXiv:2302.14325. [Google Scholar]
- Barros, T.; Garrote, L.; Pereira, R.; Premebida, C.; Nunes, U.J. AttDLNet: Attention-Based Deep Network for 3D LiDAR Place Recognition. In Proceedings of the ROBOT2022: Fifth Iberian Robotics Conference, Zaragoza, Spain, 23–25 November 2022; Springer International Publishing: Berlin/Heidelberg, Germany, 2022; pp. 309–320. [Google Scholar] [CrossRef]
- Zhu, Y.; Ma, Y.; Chen, L.; Liu, C.; Ye, M.; Li, L. GOSMatch: Graph-of-Semantics Matching for Detecting Loop Closures in 3D LiDAR data. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 5151–5157. [Google Scholar]
- Gong, Y.; Sun, F.; Yuan, J.; Zhu, W.; Sun, Q. A Two-Level Framework for Place Recognition with 3D LiDAR Based on Spatial Relation Graph. Pattern Recognit. 2021, 120, 108171. [Google Scholar] [CrossRef]
- Cui, J.; Huang, T.; Cai, Y.; Zhao, J.; Xiong, L.; Yu, Z. DSC: Deep Scan Context Descriptor for Large-Scale Place Recognition. arXiv 2021, arXiv:2111.13838. [Google Scholar] [CrossRef]
- Yu, J.; Shen, S. SemanticLoop: Loop closure with 3D semantic graph matching. arXiv 2022, arXiv:2211.11977. [Google Scholar] [CrossRef]
- Wang, S.; Kang, Q.; She, R.; Zhao, K.; Song, Y.; Tay, W.P. PRFusion: Toward Effective and Robust Multi-Modal Place Recognition with Image and Point Cloud Fusion. arXiv 2024, arXiv:2410.04939. [Google Scholar] [CrossRef]
- Xu, J.; Ma, J.; Wu, Q.; Zhou, Z.; Wang, Y.; Chen, X.; Pei, L. Explicit Interaction for Fusion-Based Place Recognition. arXiv 2024, arXiv:2402.17264. [Google Scholar] [CrossRef]
- Jung, M.; Fu, L.F.T.; Fallon, M.; Kim, A. ImLPR: Image-based LiDAR Place Recognition using Vision Foundation Models. arXiv 2025, arXiv:2505.18364. [Google Scholar]
- Melekhin, A.; Yudin, D.; Petryashin, I.; Bezuglyj, V. MSSPlace: Multi-Sensor Place Recognition with Visual and Text Semantics. arXiv 2024, arXiv:2407.15663. [Google Scholar] [CrossRef]
- Qi, Z.; Cheng, L.; Zhou, Z.; Xiong, G. LRFusionPR: A Polar BEV-Based LiDAR-Radar Fusion Network for Place Recognition. arXiv 2025, arXiv:2504.19186. [Google Scholar] [CrossRef]
- Zhou, Z.; Xu, J.; Xiong, G.; Ma, J. LCPR: A Multi-Scale Attention-Based LiDAR-Camera Fusion Network for Place Recognition. IEEE Robot. Autom. Lett. 2024, 9, 1342–1349. [Google Scholar] [CrossRef]
- Maddern, W.; Pascoe, G.; Linegar, C.; Newman, P. 1 Year, 1000km: The Oxford RobotCar Dataset. Int. J. Robot. Res. 2017, 36, 3–15. [Google Scholar] [CrossRef]
- Burnett, K.; Yoon, D.J.; Wu, Y.; Li, A.Z.; Zhang, H.; Lu, S.; Qian, J.; Tseng, W.K.; Lambert, A.; Leung, K.Y.; et al. Boreas: A multi-season autonomous driving dataset. Int. J. Robot. Res. 2023, 42, 33–42. [Google Scholar] [CrossRef]
- Liao, Y.; Yang, L.; Behley, J.; Stachniss, C. KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3292–3310. [Google Scholar] [CrossRef]
- Jeong, J.; Cho, Y.; Shin, Y.S.; Roh, H.; Kim, A. Complex urban dataset with multi-level sensors from highly diverse urban environments. Int. J. Robot. Res. 2019, 38, 642–657. [Google Scholar] [CrossRef]
- Udacity. Udacity Self-Driving Car Dataset. 2016. Available online: https://github.com/udacity/self-driving-car (accessed on 7 August 2025).
- Huang, X.; Wang, P.; Cheng, X.; Zhou, D.; Geng, Q.; Yang, R. The ApolloScape Open Dataset for Autonomous Driving and Its Application. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2702–2719. [Google Scholar] [CrossRef]
- Sun, P.; Kretzschmar, H.; Li, X.D.; Gorban, V.; Portelas, R.; Chiang, J.H.; Chen, C.H.; Chan, C.H.; Caine, B.; Gupta, S.; et al. Scalability in Perception for Autonomous Driving: Waymo Open Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 2443–2451. [Google Scholar]
- Yan, Z.; Sun, L.; Krajnik, T.; Ruichek, Y. EU Long-term Dataset with Multiple Sensors for Autonomous Driving. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Virtual, 25–29 October 2020. [Google Scholar]
- Kim, G.; Park, Y.S.; Cho, Y.; Jeong, J.; Kim, A. MulRan: Multimodal Range Dataset for Urban Place Recognition. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020. [Google Scholar]
- Xiao, P.; Shao, Z.; Hao, S.; Zhang, Z.; Chai, X.; Jiao, J.; Li, Z.; Wu, J.; Sun, K.; Jiang, K.; et al. PandaSet: Advanced Sensor Suite Dataset for Autonomous Driving. arXiv 2021, arXiv:2112.12610. [Google Scholar] [CrossRef]
- ISO 26262:2018; Road Vehicles—Functional Safety—Part 1: Vocabulary. International Organization for Standardization: Geneva, Switzerland, 2018.
- Kim, G.; Kim, A. Scan Context: Egocentric Spatial Descriptor for Place Recognition Within 3D Point Cloud Map. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 4802–4809. [Google Scholar]
- Kim, G.; Choi, S.; Kim, A. Scan Context++: Structural Place Recognition Robust to Rotation and Lateral Variations in Urban Environments. arXiv 2021, arXiv:2109.13494. [Google Scholar] [CrossRef]
- He, L.; Wang, X.; Zhang, H. M2DP: A novel 3D point cloud descriptor and its application in loop closure detection. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea, 9–14 October 2016; pp. 231–237. [Google Scholar] [CrossRef]
- Zhu, S.; Shah, M.; Chen, C. TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization. arXiv 2022, arXiv:2204.00097. [Google Scholar]










| Paper | Sensor | Obstacle Type | Mapping Type | Object Detection | Data Association | Semantic Segmentation | Semantic Representation | Loop Closure | Localization | Dataset | Year |
|---|---|---|---|---|---|---|---|---|---|---|---|
| [11] | L-R | Dynamic | X | 3D Detection | Cross-modal fusion | X | Object-level (3D bounding boxes) | X | X | VoD - K-Radar Adverse | 2025 |
| [12] | C-R | Static Dynamic | Metric | Dyn. Cluster ORB-SLAM3 | Point-to-point Point-to-frame | DBSCAN | X | Visual Appearance | Feature Tracking | Dynamic Scenarios | 2025 |
| [13] | L-I-G- R | Static Dynamic | Metric | YOLOv8- assisted | Fusion-based; geometric matching | ✓ | X | Observation Resilient Weighted | IEKF | MSCRAD 4R | 2025 |
| [14] | L-C-I | Static | Geometric | YOLOv8- Seg | Object-level | Object-level | ORB-SLAM3 | ✓ | LVI- ObjSemant. | Experiment | 2025 |
| [15] | C | Dynamic | Semantic Topological | YOLOv8 | DHDP | X | ORB-SLAM2 ORB-SLAM3 | ✓ | Landmarks | KITTI TUM | 2025 |
| [16] | C-L | Static | Semantic Topological Point cloud | PSPNet | X | PSPNet CBAM | Region-growing | X | LOAM | CityScape KITTI | 2025 |
| [17] | C-L | Dynamic Semi-static | Metric-scaled feature-based trajectory | ORB-SLAM2 Super point | SuperGlue | DeepLabv3 DBSCAN | ViL SLAM | ✓ | ViL SLAM | CQU | 2025 |
| [18] | C-L-I | X | Metric– semantic map | X | X | CNN | TSDF | X | Prior map- based | S. KITTI S.USL | 2025 |
| [19] | L-G-R- I | Static | Semantic HD map | Mask | Object-level | BiseNetV2 | Extracting urban features | X | Vis Mono | KAIST | 2024 |
| [20] | L-C-G- R | Static | 3D semantic prior map | ORB3 | Match Images to point cloud | X | Point-to-plane ICP | X | ICP camera pose | KITTI Own | 2023 |
| [21] | L-C | Dynamic | 3D semantic map | X | 3D point to 2D pixel | DeepLabv3 | Multiple frames, | X | LEGO- LOAM | KITTI | 2023 |
| [22] | C-L | Static | 2D–3D hybrid semantic map | YOLOv2 | Object augmented | Inception v3 - LTSM | Labeled scene with Bayesian estimation | Temporal Spatial Association | EKF | X | 2023 |
| [23] | C-E-I | Static | Vectorial Paramet. Semantic map | Sem- LSD | Object - level/ DBSCAN SORT | DeepLabv3+ Sem-LSD | Semantic feature around vehicle center | X | Sliding window | nuScenes | 2023 |
| [24] | C-L-G- I | Static | Urban Structured Semantic map | ORB - SLAM2 | X | ResNet | X | Graph map | KITTI | 2023 | |
| [25] | C-L-I WE | Static dynamic | Voxelized 3D Semantic map | CNN | Motion correction synchronize | Enet CNN | Probabilities with super pixel | X | X | USyd | 2022 |
| [26] | C | Static | Structured Semantic map | YOLOv2 | Object - level | X | Include semantic frames into a database | Factor Graph SLAM | X | TUM RGB-D | 2021 |
| [27] | R-C | Static dynamic | 3D dense map | - | Geometric cue associated to object model | RGB (Mask R-CNN) and depth (bilateral filter, geometric segmentation, erode and dilate processing) | X | X | TUM RGB-D | 2021 | |
| [28] | C-L | Dynamic | Classic Geometric Visual map | Based on clustering | Tracked the corners of bounding box | Ground segmentation, clustering, | Dynamic mask Gaussian blur kernel | X | - | KITTI Tracking | 2021 |
| [29] | C-L | X | 2D Probabilistic Semantic map | X | X | DeepLabv3+ HRNet+OCR | Occupation grid with BEV local and global map | X | X | Own Data | 2020 |
| [30] | R-C | Static | ORB - SLAM2 | YOLOv3 | Point–object Object–object | Object detection with depth histogram of 2D BB | X | Semantic Bundle Adjustment | X | X | 2020 |
| [31] | C-L-I | Static | Global metric semantic map | X | X | CNN | Projection of semantic label on the metric map | X | EKF | KITTI | 2020 |
| [32] | L-C-I | Static | 3D accuracy Semantic map | CNN | Based on voxels and timestamp | ResNet50 | Data-level fusion method | X | LOAM | KITTI | 2020 |
| [33] | R-C | Static | Object- oriented semantic map | YOLOv2 | Ratio of object instance with target label in the BB | Class probability of object is assigned to each segment in the BB | InfiniTAM v3 | X | X | Tested indoors | 2019 |
| [34] | R-C | Static | 3D object detector | Dense SIFT | Inter-frame geometry consistency constraint | Object semantic annotation projected onto 3D point | Mask R-CNN ORB-SLAM2 | ORB- SLAM2 | X | ICL | 2019 |
| [7] | L-C | Static | Occupancy grid map | X | Conv–deconv fusion network | SegNet | Semantic image Occupancy grid | X | X | KITTI Own | 2018 |
| [35] | R-C | Office objects | Static | Tiny YOLO | KD structure Bayesian process | Relationship between keyframe and objects | Octomap accelerated using FLRA | X | ORB- SLAM2 | TUM- RGBD | 2018 |
| [36] | S-C | Cars, trucks | Dynamic separated from static | Multi-task network cascade (MNC) | - | X | Dense map with ELA, dispNet, InfiniTAM | X | X | KITTI | 2018 |
| [37] | C | Cars, pedestrians buildings poles, signs | Dynamic | - | Match feature point in semantic images (kernels) | SegNet + Voxel CRF project pc onto image | Built from disparity map | X | Visual odometry | KITTI | 2018 |
| [38] | ML- MC | Curbs, lanes, sidewalks, ground, parking | Static | CNN | Computing height of the curb inside the ROI | ERFNet | Graph forest | Projection point cloud onto semantic images | X | X | 2018 |
| [39] | R-C | Indoors object | Static | VGG-16 | Based on Bayesian approach | CNN | RGB - SF-CRF | X | X | NYUv2 Own | 2018 |
| Ref. | Method Type | Require Target | Scene Dependency | Automation Level | Online Capability | Calibration Domain | Spatial– Temporal Modeling | Multi- Sensor Scalability | Temporal Sync. | Evaluation Metric/ Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|
| [50] | Target-based | ✓ | Low | Medium | X | 3D–2D | X | X | X | Reprojection error |
| [51] | Target-based | ✓ | Low-Medium | Medium | X | 3D–3D | X | ✓ | X | Plane alignment accuracy |
| [52] | Target-based | ✓ | Medium | Medium | X | 3D–3D | X | X | X | Center detection error |
| [53] | Target-based | ✓ | Medium | Medium | X | 3D–2D | X | X | X | Reprojection residuals |
| [54] | Target-based | ✓ | Low-Medium | Medium | X | 3D–2D | X | X | X | RMSE |
| [55] | Target-based | ✓ | Medium | Medium | X | 3D–3D | X | X | X | Translation/rotation error |
| [56] | Target-based | ✓ | Low-Medium | Medium | X | 3D–3D | X | X | X | Reprojection error |
| [57] | Target-based | ✓ | Medium | Medium | X | 3D–2D | X | X | X | Pose alignment quality |
| [58] | Target-based | ✓ | Medium | Medium | X | 3D–2D | X | X | X | Calibration stability |
| [59] | Target-based | ✓ | Low-Medium | Medium | X | 3D–3D | X | ✓ | X | Reprojection error |
| [60] | Target-based | ✓ | Low-Medium | Medium | X | 3D–2D | X | X | X | Feature correspondence acc. |
| [61] | Target-based | ✓ | Low-Medium | Medium | X | 3D–3D | X | X | H | Cross-correlation |
| [62] | Targetless | X | High | Medium | X | 3D–2D | X | ✓ | S | Alignment error |
| [63] | Targetless | X | High | High | Limited | 3D–3D | X | X | H | Runtime/accuracy |
| [64] | Targetless | X | High | High | X | 3D–3D | X | X | X | Initialization error |
| [65] | Targetless | X | Medium | High | ✓ | 3D–3D | ✓ | X | J | reprojection error |
| [66] | Targetless | X | High | High | X | 3D–3D | ✓ | X | J | RMSE |
| [67] | Targetless | X | Medium-High | High | X | 3D–3D | ✓ | ✓ | J | Temporal drift |
| [68] | Targetless | X | Medium | High | ✓ | 3D–3D | ✓ | ✓ | J | Cross-modal consistency |
| [69] | Targetless | X | High | High | X | 3D–3D | ✓ | ✓ | S | Angular/translation error |
| [70] | Deep Learning | X | Medium | High | ✓ | 3D–3D | ✓ | ✓ | J | Learning loss (pose) |
| [71] | Deep Learning | X | Medium | High | X | 3D–3D | ✓ | ✓ | J | Pose regression error |
| [72] | Deep Learning | X | Medium | High | ✓ | 3D–3D | ✓ | X | J | 6-DoF estimation error |
| [73] | Deep Learning | X | Medium | High | ✓ | 3D–3D | ✓ | X | J | Depth/pose consistency |
| [74] | Deep Learning | X | Medium | High | ✓ | 3D–3D | ✓ | ✓ | J | Reprojection accuracy |
| [75] | Deep Learning | X | Medium | High | ✓ | 3D–3D | ✓ | ✓ | J | RMSE |
| [76] | Deep Learning | X | Medium | High | X | 3D–3D | ✓ | ✓ | J | Learning-based accuracy |
| Ref. | Learning | Input Representation | Real Time | Accuracy/ Robustness | Efficiency | Scalability |
|---|---|---|---|---|---|---|
| [97] | MLP-based deep network | Raw point cloud | Limited | Moderate accuracy on simple geometries | Low due to per- point processing | Scalable to small datasets |
| [98] | Hierarchical deep learning | Point cloud (multi- scale grouping) | Limited | High accuracy with local–global features | Moderate | Improved large-scene scalability |
| [99] | CNN + recurrent CRF | Range image projection of LiDAR | ✓ | Reliable road–object segmentation | High (GPU-optimized) | Suitable for large driving datasets |
| [100] | CNN with post- processing CRF | Range image (spherical projection) | ✓ | High accuracy in road scenes | High (fast inference) | Robust for outdoor navigation |
| [101] | CNN with polar grid encoding | Polar coordinate grid | ✓ | Strong global consistency | Efficient GPU computation | Excellent outdoor scalability |
| [102] | CNN + uncertainty modeling | Spherical range image | ✓ | Very high robustness under noise | Real-time optimized | Generalizable to multiple datasets |
| [103] | Transformer-based network | Spherical projection | Near real time | State-of-the-art accuracy | Moderate (transformer- heavy) | Scalable to complex 3D environments |
| Ref. | Learning Model | Modalities | Real-Time Capability | Accuracy | Efficiency | Scalability |
|---|---|---|---|---|---|---|
| [18] | Multi-modal voxel fusion | RGB-D, LiDAR, IMU | X | High semantic–metric integration; fine-grained voxel reasoning | Computationally heavy | Suitable for small-scale 3D semantic mapping |
| [104] | Transformer-based RGB-X fusion (CMX) | RGB + Depth/ LiDAR | Near real-time | High semantic consistency under varying illumination | Transformer-heavy; optimized with attention pruning | Scalable to large- scale 3D scenes |
| [105] | Hierarchical multi-modal transformer | LiDAR + multispectral imagery | X | High semantic alignment with context reasoning | Moderate | Remote-sensing and large-scale scene mapping |
| [108] | Perception-aware CNN fusion | LiDAR + RGB camera | ✓ | Robust 3D segmentation under noisy conditions | High (GPU-optimized) | Autonomous driving; outdoor mapping |
| [111] | Voxelized LiDAR– camera continuous fusion | LiDAR + RGB camera | ✓ | Reliable semantic completion and spatial accuracy | GPU-efficient | Dense 3D city-level mapping |
| [109] | LiF-Seg: Late fusion network | LiDAR + RGB camera | ✓ | Accurate object-level segmentation; complementary cues | Moderate | Scalable for dynamic outdoor scenes |
| [110] | Multi-modal range–RGB CNN | LiDAR + RGB | ✓ | High robustness under lighting and occlusion changes | Real-time optimized | Generalizable across driving datasets |
| [106] | Radar–vision semantic SLAM | Millimeter-wave radar/RGB camera | ✓ | Robust under adverse weather and poor visibility | Moderate | All-weather autonomous navigation |
| [107] | Condition-aware multi-modal fusion | LiDAR + camera | ✓ | Adaptive to environmental conditions; strong semantic consistency | Efficient | Long-term semantic perception in dynamic scenes |
| [113] | Multi-LiDAR fusion pipeline | Multiple LiDARs | ✓ | Enhanced 3D reconstruction with semantic annotation | Efficient | Large-scale urban driving |
| [112] | HDMapNet | LiDAR + camera | ✓ | Fine-grained semantic mapping and HD map generation | Real-time | City-scale mapping applications |
| Ref. | Network Type | Input Representation | Feature Extraction | Learning Strategy | Optimization | Dataset |
|---|---|---|---|---|---|---|
| [114] | KPConv | Point-based | Kernel point convolution | Domain generalization | Geometric domain adaptation | SemanticKITTI |
| [115] | BEV-based convolutional fusion | Polar + Cartesian BEV | Dual-branch CNNs | Efficiency-focused supervised | Joint spatial–semantic optimization | nuScenes |
| [116] | Point Conv + Sparse Conv 3D | Voxelized 3D grid | Sparse + point convolutions | Supervised learning | End-to-end supervised | SemanticKITTI |
| [117] | Harmonic dense convolution | Cartesian voxel grid | Harmonic dense filters | Efficiency-oriented supervised | Lightweight efficiency tuning | SemanticKITTI |
| [102] | 2D CNN over spherical projections | Spherical projection | Residual + context layers | Supervised, uncertainty- aware | Bayesian uncertainty modeling | SemanticKITTI |
| [118] | 2D CNN over polar image | Polar grid/range view | Standard CNN | Supervised | Basic supervised training | SemanticKITTI, nuScenes |
| Ref. | Network Type | Input Representation | Feature Extraction | Learning Strategy | Optimization | Dataset |
|---|---|---|---|---|---|---|
| [119] | Dense 3D CNN | Voxelized occupancy grid | 3D convolutional filters | Supervised | Basic CNN optimization | KITTI |
| [120] | Sparse 3D CNN | Sparse voxel grids | Sparse 3D convolutions | Supervised | Efficiency-focused sparse ops | KITTI |
| [121] | Sparse lattice CNN | Permutohedral lattice | Lattice convolutions | Supervised spatio- temporal | Spatio-temporal optimization | SemanticKITTI |
| [122] | Sparse 3D CNN with cylindrical partitioning | Cylindrical voxel grid | Adaptive contextual convolutions | Supervised | Joint spatial–temporal optimization | nuScenes, SemanticKITTI |
| [123] | Hybrid voxel + point network | Voxel + point set encoding | Voxel feature extraction + point refinement | Supervised | End-to-end optimization | KITTI, Waymo |
| [124] | Sparse 3D NAS-optimized CNN backbone | Voxel grid | NAS-optimized convolutional backbone | Automated (NAS) | Architecture search optimization | nuScenes, SemanticKITTI |
| Ref. | Network Type | Input Representation | Feature Extraction | Learning Strategy | Optimization | Dataset |
|---|---|---|---|---|---|---|
| [97] [98] | MLP with KNN/ Hierarchical group | Raw point sets | MLPs + hierarchical KNN aggregation | Supervised | Basic MLP optimization | — |
| [125] | MLP-based hierarchical networks | Raw point sets | Efficient MLP stacking + KNN neighbor search | Supervised | Improved training and scaling strategies | — |
| [126] | Point convolution with attention | Raw point sets | Kernel point convolution + kernel attention | Supervised | Architectural optimization | — |
| [127] | Point-based MLP with attention and detection aware loss | Raw point sets | Random sampling + local feature aggregation | Supervised | Detection-aware optimization | Semantic KITTI |
| [128] | 3D point network with knowledge distillation | Raw point sets + 2D priors | Semantic distillation from 2D to 3D | Supervised | Multi-modal knowledge distillation | — |
| [103] | Self-attention (transformer style) | Raw point clouds in spherical coordinates | Radial window self-attention | Supervised | Global structure modeling optimization | — |
| [114] | Label propagation with cloud tracking | Registered point scans | Point alignment + ICP propagation | — | Geometric optimization | — |
| [129] | Transformer-based | Raw point sets | Self-attention with positional encoding | Supervised | Transformer-based optimization | — |
| [130] | Transformer-based | Raw point sets | Efficient neighborhood attention + residual context blocks | Supervised | End-to-end optimization | |
| [131] | Transformer-based | Raw point sets | Multi-scale global-local transformer fusion | Supervised | Computational optimization for robustness |
| Refs. | Method | Category | Architecture | Descriptor Role | Metric | Reported Performance | Contribution |
|---|---|---|---|---|---|---|---|
| [162] | NetVLAD | Geometric | CNN + VLAD | Global | Recall@N | ≈ 80% | End-to-end global descriptor |
| [166,167] | SuperPoint + SuperGlue | Geometric | CNN + GNN | Local | Accuracy | >95% | Self-supervised keypoints and learned graph matching |
| [163] | D2-Net | Geometric | CNN | local | Precision- Recall | 60–70% | Joint detection and dense feature description |
| [168] | Patch-NetVLAD | Hybrid | CNN | Local+Global | Recall@N | 92–97% | Patch-level fusion of local and global context |
| [169] | Multires-NetVLAD | Geometric | CNN Multi-scale | Global | Recall@1,@5 | 90–96% | Multi-resolution training for scale robustness |
| [164] | TransVPR | Hybrid | Vision Transformer | Global | Recall@1, mAP | 94%, 0.88 | Multi-level attention for context aggregation |
| [165] | MixVPR | Hybrid | Transformer + CNN | Global | Recall@N, mAP | 96%, 0.91 | Spatial–semantic feature mixing for robustness |
| [170] | LoFTR/ SparseGNN | Geometric | Transformer GNN | Local | Accuracy | ≈97% | Detector-free dense matching with high precision |
| [171] | CLIP-VPR | Semantic | Vision-language | Global | Recall@20 | 0.66 | Vision–text embedding for semantic-level retrieval |
| [172] | Adaptive Matching | Geometric | CNN | Local | Recall@N | ≈89% | Adaptive re-ranking for condition invariance |
| [173] | Multi-Platform VPR | Hybrid | CNN + domain adaptation | Global | Recall@1 | 85–90% | Cross-domain embedding for multi-robot localization |
| Ref. | Method | Feature | Invariance to Translation | Invariance to Rotation | Similarity | Dataset | Year |
|---|---|---|---|---|---|---|---|
| [174] | PointNet | NetVLAD | Centroid | - | L2/Cosine | Oxford RobotCar | 2018 |
| [175] | PointNet | NetVLAD | Centroid | - | L2/Cosine | Oxford RobotCar | 2019 |
| [176] | GNN, DG-CNN | NetVLAD | Learned spatial graph normalization | - | Cosine | Oxford RobotCar | 2019 |
| [177] | PointNet + FlexConv | NetVLAD, SE block | Centroid | - | Cosine | Oxford RobotCar, ETH | 2020 |
| [178] | PointOE | NetVLAD | Centroid | x | Cosine | Oxford RobotCar | 2021 |
| [179] | Pyramid Point Transformer | Pyramid VLAD | Centroid | Transformer equivariant layers | L2/Cosine | Oxford RobotCar | 2021 |
| [183] | PPCNN | G-VLAD | Centroid | PCA-like learned | Cosine | KITTI | 2022 |
| [180] | Feature Point Extractor | Point Transformer | Learned normalization | Equivariant transformer | Cosine | KITTI, KITTI-360 | 2022 |
| [181] | Rotation Equivariant Encoder | Siamese CNN | Learned | Strong rotation equivariance | Cosine | KITTI, MulRan | 2022 |
| [182] | U-Net | ePN, 2nd-order pooling | Centroid | High-order rotational invariance | L2 | Oxford RobotCar | 2022 |
| Ref. | Method | Feature | Similarity | Dataset | Year |
|---|---|---|---|---|---|
| [184] | MinkLoc3D | Sparse 3D convolutions, FPN, GeM | L2 distance between global embeddings | Oxford RobotCar | 2021 |
| [185] | MinkLoc3D-SI | Sparse convolutions–spherical coordinates–intensity; FPN; GeM | Triplet loss using L2 distance between descriptors (explicit) | Usyd Campus, Oxford RobotCar, KITTI | 2021 |
| [186] | EgoNN | FPN backbone; ECA attention | Uses global descriptor retrieval | MulRan, Apollo-SouthBay, KITTI | 2022 |
| [187] | NDT-Transformer | Voxel-level statistical representation (Normal Distribution Transform); Transformer encoder; | Uses descriptor retrieval | Oxford RobotCar | 2021 |
| [188] | TransLoc3D | NDT representation; Adaptive Receptive Field Module (ARFM); Transformer; NetVLAD | Triplet margin loss with L2 distance between descriptors | Oxford RobotCar | 2021 |
| [189] | LCDNet | PV-RCNN feature backbone; NetVLAD for global descriptor | L2/Cosine | KITTI, KITTI-360, Freiburg | 2022 |
| [190] | SVT-Net | Sparse Voxel Transformer (SVT); captures local + long-range structure | Uses descriptor retrieval | Oxford RobotCar | 2022 |
| [191] | OverlapNetVLAD | BEV-based feature extractor (BEVNet) + NetVLAD | NetVLAD descriptors typically compared via L2 | KITTI | 2023 |
| Ref. | Method | Feature Strategy | Similarity/Matching | Dataset | Year |
|---|---|---|---|---|---|
| [192] | SegMatch | Segment-based geometric features, PCA descriptors, 3D segment clustering | Random forest classification, Euclidean distance; geometric consistency checks | KITTI | 2017 |
| [193] | SegMap | Learned 3D segment descriptors using a CNN autoencoder | L2 distance between learned descriptors; nearest-neighbor retrieval | KITTI | 2018 |
| [194] | Zaganidis et al. | Keypoint extraction with local shape descriptors | Nearest-neighbor keypoint matching using Euclidean distance; RANSAC alignment | KITTI | 2019 |
| [195] | LiPMatch | LiDAR keypoints, multi-scale pyramid feature encoding; handcrafted structural descriptors | Descriptor NN matching; geometric verification for loop closure | KITTI | 2020 |
| [196] | Tomono | NDT-based local map representation; Gaussian distribution parameters per cell | Likelihood-based NDT matching; optimization of alignment score | Tc0915, Cit1015, Kitti0027 | 2020 |
| [197] | Locus | Segment-level features combining geometry + intensity cues | L2 distance for segment descriptors − geometric consistency validation | KITTI | 2021 |
| [9] | SA-LOAM | Edge and planar features (LOAM style); spatial-appearance cues | Feature correspondence + ICP-based geometric error minimization | KITTI | 2021 |
| [198] | SSC | PCA-based shape descriptors from segmented point clouds | L2 or Mahalanobis distance between PCA descriptors; voting scheme for matches | KITTI | 2021 |
| [199] | PCA-SSC | Improved PCA descriptors with stronger geometric invariance | L2 distance + consistency scoring to refine correspondences | KITTI, Semantic KITTI | 2021 |
| [200] | PSE-Match | Probabilistic Shape Encoding (mixture-model representation) | Probability-based similarity measure between PSE descriptors | KITTI, NCLT, CMU, Pittsburgh | 2022 |
| [201] | Song et al. | Hybrid descriptor combining geometry, intensity, and local topology | L2 or cosine distance between descriptors; global verification | KITTI | 2022 |
| Ref. | Method | Feature Strategy | Similarity/Matching | Dataset | Year |
|---|---|---|---|---|---|
| [202] | Yin et al. | Spherical projection of LiDAR scans; handcrafted geometric + intensity features | Correlation-based matching; similarity score from projected image alignment | KITTI | 2017 |
| [203] | LocNet | Range image projection; CNN-based learned global descriptor | L2 distance between global descriptors | KITTI | 2018 |
| [204] | Sun et al. | Spherical projection + CNN encoder; multi-channel range-intensity features | L2 distance; nearest-neighbor retrieval | KITTI | 2019 |
| [205] | OverlapNet | Range projection with multi-modal channels | Predicted overlap score + yaw estimation; similarity via overlap probability | KITTI, Ford Campus | 2020 |
| [206] | DiSCO | Dense spherical projection; self-supervised contrastive feature learning | Cosine similarity between learned descriptors | Oxford RobotCar, NCLT MulRan | 2021 |
| [207] | Overlap Transformer | Transformer encoder applied to range-image feature maps; multi-modal projected features | Overlap prediction + yaw regression; similarity via predicted overlap | KITTI, Ford Campus | 2022 |
| [208] | FusionVLAD | Spherical projection + CNN features + NetVLAD aggregation | VLAD vector similarity using L2 distance | KITTI, NCLT, Campus City | 2021 |
| [213] | AttDLNet | Attention-based CNN encoder on projection images; multi-scale feature extraction | L2 distance between descriptors; attention-weighted retrieval | KITTI | 2022 |
| [209] | SphereVLAD | Spherical projection + deep CNN features + VLAD aggregation | VLAD descriptor matching via L2/cosine distance | KITTI, Campus City | 2022 |
| [210] | SphereVLAD++ | Improved SphereVLAD with multi-scale projections and enhanced CNN backbone | VLAD descriptor L2 distance; weighted similarity scoring | KITTI, Pittsburg | 2022 |
| [211] | DeepRING | Circular (ring-based) LiDAR projection; CNN encoder for rotation-equivariant features | Correlation-based similarity using circular shift alignment | NCLT, MulRan | 2022 |
| [212] | BEVPlace | Bird’s-eye-view (BEV) projection; CNN encoder with BEV-specific geometric priors | L2 distance between BEV global descriptors | KITTI, Oxford RobotCar | 2023 |
| Ref. | Method | Feature Strategy (Representation) | Similarity/Matching | Dataset | Year |
|---|---|---|---|---|---|
| [159] | SGPR | Builds semantic graphs from segmented point clouds | Graph matching using structural similarity metrics | KITTI | 2020 |
| [214] | GOS Match | Graph-of-semantics representation where each node = semantic instance and edges = geometric/topological relations | Graph similarity scoring with node − edge feature alignment + geometric verification | KITTI | 2020 |
| [215] | Gong et al. | Two-tier representation: local spatial relation graph + global spatial topology; semantic and geometric cues | Coarse-to-fine graph matching; spatial relation similarity + refined structural alignment | KITTI, Hannover, Self- Built Campus | 2021 |
| [160] | BoxGraph | Semantic bounding-box graph generated from object detectors; boxes as nodes, box relations as edges | Graph matching with similarity metrics over box attributes + spatial consistency | KITTI | 2022 |
| [216] | Deep Scan Context | Learned high-dimensional scan context descriptor; pseudo-cylindrical projection + neural embedding | Descriptor matching via cosine/L2 distance + rotation alignment | KITTI | 2022 |
| [158] | SimGNN | Graph embeddings learned with GNN layers; captures node distributions and global graph structure | Neural graph similarity prediction (end-to-end), including attention-based similarity | Custom | 2020 |
| [217] | Semantic Loop | Builds instance-level 3D semantic graphs; node features include object class + geometry | Graph matching + spectral features + RANSAC-based geometric verification | TUM RGBD − COCO | 2022 |
| Ref. | Modalities | Fusion Level | Fusion Type | Feature Representation | Learning Objective | Robustness | Remarks |
|---|---|---|---|---|---|---|---|
| [218] | L-C | Global and local | Attention + manifold | Global–local joint features | Cross-modal alignment | Viewpoint, illumination | High accuracy, high cost |
| [219] | L-C | Cross-modal | Explicit interaction | Shared latent embedding | Contrastive consistency | Domain generalization | Needs balanced modalities |
| [220] | L-C | Feature transfer | LiDAR-image conversion | Vision-based descriptors | Knowledge transfer | Geometry–appearance bridge | Depends on projection quality |
| [221] | L-MC | Late fusion | Descriptor + score fusion | Multi-modal high- level features | Similarity fusion | Appearance robustness | Limited fine-grained fusion |
| [222] | L-R | Mid fusion | BEV + cross- attention | Spatial BEV maps | Feature distillation | Radar sparsity handling | Low semantic detail |
| [223] | L-C | Multi-scale | Attention pyramids | Hierarchical features | Joint scale invariance | Scale, viewpoint adaptation | High memory demand |
| Ref. | Modal. | Acc. | Robust. to Env. Changes | Robustness to Dynamic Scenes | Computat. Efficiency | Recall@1 | Strengths | Weaknesses |
|---|---|---|---|---|---|---|---|---|
| [174] | LiDAR | High | High | Medium | Medium | 80.31 Oxford | Learns discriminative global descriptors from raw point clouds; robust to viewpoint variations; scalable to large environments | Limited modeling of local geometric structure; performance depends on training data; moderate computational cost |
| [170] | Vision | High | Medium | Low | Medium | 95 | Dense feature matching without keypoint detection; robust to textureless regions; effective for precise visual correspondence | Computationally expensive; high memory consumption; limited scalability for real-time large-scale localization |
| [182] | LiDAR | High | High | Medium High | Medium | 93 KITTI | Combines local and global features for discriminative descriptors; robust to viewpoint variations; effective for large-scale place recognition | Requires GPU for real- time inference; performance depends on LiDAR density; computational cost higher than lightweight descriptors |
| [189] | LiDAR | High | High | Medium High | Medium | 96 KITTI | Joint framework for loop closure detection and point cloud registration; improves global consistency in SLAM; robust descriptors for LiDAR scans | Higher computational cost due to joint descriptor and registration estimation; performance depends on LiDAR scan quality |
| [184] | LiDAR | Very high | High | High | Medium | 98.5 Oxford | Robust LiDAR-based global descriptors; efficient sparse convolution processing; good scalability for large environments | Requires high-quality LiDAR data; performance degrades with partial scans or heavy occlusions |
| [205] | LiDAR | High | High | Medium | Medium | 88.0 KITTI | Effective for loop closure detection; robust to viewpoint changes; integrates geometric overlap estimation | Requires precomputed range images; computat. cost increases for large -scale map databases |
| [149] | LiDAR | High | High | Medium | High | 88 KITTI | High geometric accuracy; real-time LiDAR odometry; widely used baseline for LiDAR SLAM | Limited semantic understanding; sensitive to dynamic objects; L.C. requires additional modules |
| [212] | Multi- modal | Very high | Very high | High | Medium | 98.0 KITTI | Robust to viewpoint changes using BEV representation; effective for large-scale urban scenes | Requires accurate projection to BEV; performance depends on sensor calibration |
| [216] | liDAR | High | High | Medium | Medium | 83.0 KITTI | Learns discriminative LiDAR descriptors; robust to geometric variations | High computational requirements; limited robustness to extreme environmental changes |
| [160] | Multi- modal Semant. | Very high | Very high | Very high | Medium- low | 87.0 KITTI | Uses semantic object relationships; robust to appearance changes; suitable for long-term localization | Depends on reliable object detection; graph matching may become expensive in large environments |
| Dataset | Sensors | Synchronization | Ground Truth | Location | Weather | Time | Year |
|---|---|---|---|---|---|---|---|
| KITTI [48] Semantic KITTI [145] KITTI 360 [226] | 1 × 64-layer LiDAR 2 × grayscale camera 2 × color camera 1 × GPS-RTK/IMU | Software and hardware (reed contact) | scene flow, odometry object detection- tracking, road - lane | Germany | clear | day, autumn | 2013 2019 2022 |
| KAIST [227] | 2 × 16-layer LiDAR 2 × 1-layer LiDAR 2 × monocular camera 1 × GPS (consumer level) 1 × GPS-RTK 1 × fiber optics gyro 1 × independent IMU 2 × wheel encoder 1 × altimeter | Software (ROS timestamp) and hardware (PPS for the two Velodynes, an external trigger for the two monocular cameras to get stereo) | SLAM algorithm for vehicle self-localization | South Korea | clear | day | 2015 |
| Oxford [224] | 1 × 4-layer LiDAR 2 × 1-layer LiDAR 1 × stereo camera 3 × fisheye camera 1 × GPS-RTK/INS | Software | GPS-RTK/INS for vehicle self-localization | UK | sun, clouds overcast, rain snow | day, dusk, night, four seasons | 2017 |
| Udacity [228] | 1 × 64-layer LiDAR 3 × RGB cameras 1 × Radar ARS-408 1 × GPS/IMU | Software ROS timestamp | GPS/IMU for vehicle self-localization | USA | sunny, cloudy | day | 2017 |
| ApolloScape [229] | 2 × 1-layer LiDAR 6 × monocular camera 1 × GPS-RTK/IMU | Unknown | Scene parsing, car instance, lane segmentation, detection-tracking | China | unknown | day | 2018 |
| Waymo [230] | 5 × LiDAR 5 × camera | Strategy not specified but they report that it is very well synced | Object detection- tracking | US | sun, rain | day, night | 2019 |
| EU long-term [231] | 2 × 32-layer LiDAR 1 × 4-layer LiDAR 1 × 1-layer LiDAR 2 × stereo camera 1 × radar 1 × GPS-RTK 1 × independent IMU | Software (ROS timestamp) and hardware (PPS for the two Velodynes) | GPS-RTK/IMU for vehicle self-localization | France | sun, clouds, snow | day, dusk, night, three seasons (spring, summer, winter) | 2020 |
| nuScenes [45] | 1 × 32-layer LiDAR 6 × monocular camera 1 × radar 1 × GPS-RTK 1 × Independent IMU | Software | HD map-based localization, object detection-tracking | US Singapore | sun, clouds, rain | day, night | 2020 |
| MulRan [232] | 1 × 64-layer LiDAR 1 × radar | Hardware synchronization (LiDAR–INS) | Global pose (GPS/INS), LiDAR odometry | Republic of Korea | clear, overcast, varying | day, multiple sessions | 2020 |
| A2D2 [146] | 5 × 16-layer LiDAR 6 × cameras 1 × GPS/IMU | Hardware | Object detection- tracking | Germany | sunny, rainy and cloudy | day | 2020 |
| Pandaset [233] | 1 × 64-layer LiDAR 1 × 150-layer LiDAR 5 × wide-angle cameras 1 × long-focus camera 1 × GNSS/IMU | Software and hardware | Object detection– classification | US | sun | day, night | 2021 |
| Boreas [225] | 1 × 128-layer LiDAR 1 × Monocular camera 1 × Applanix GNSS 1 × Radar 360 degree | Hardware Timestamp-based LiDAR synch using UTC time | GPS for vehicle self-localization | Canada | sun, clouds, snow, rain | day, night, overcast | 2023 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Navarro-Pérez, Á.; Bacca-Cortés, B.; Caicedo-Bravo, E. Semantic SLAM with Multi-Modal Perception: Survey on Robust Long-Term Localization for Autonomous Vehicles. Robotics 2026, 15, 88. https://doi.org/10.3390/robotics15050088
Navarro-Pérez Á, Bacca-Cortés B, Caicedo-Bravo E. Semantic SLAM with Multi-Modal Perception: Survey on Robust Long-Term Localization for Autonomous Vehicles. Robotics. 2026; 15(5):88. https://doi.org/10.3390/robotics15050088
Chicago/Turabian StyleNavarro-Pérez, Álvaro, Bladimir Bacca-Cortés, and Eduardo Caicedo-Bravo. 2026. "Semantic SLAM with Multi-Modal Perception: Survey on Robust Long-Term Localization for Autonomous Vehicles" Robotics 15, no. 5: 88. https://doi.org/10.3390/robotics15050088
APA StyleNavarro-Pérez, Á., Bacca-Cortés, B., & Caicedo-Bravo, E. (2026). Semantic SLAM with Multi-Modal Perception: Survey on Robust Long-Term Localization for Autonomous Vehicles. Robotics, 15(5), 88. https://doi.org/10.3390/robotics15050088

