Semantic SLAM with Multi-Modal Perception: Survey on Robust Long-Term Localization for Autonomous Vehicles

Navarro-Pérez, Álvaro; Bacca-Cortés, Bladimir; Caicedo-Bravo, Eduardo

doi:10.3390/robotics15050088

Open AccessReview

Semantic SLAM with Multi-Modal Perception: Survey on Robust Long-Term Localization for Autonomous Vehicles

by

Álvaro Navarro-Pérez

^1,*

,

Bladimir Bacca-Cortés

²

and

Eduardo Caicedo-Bravo

²

¹

Faculty of Engineering, Electronic Engineering, Technological Development Research Group (GIDET), Universidad del Quindío, Carrera 15 Calle 12N, Armenia 630001, Colombia

²

School of Electrical Engineering, Perception and Intelligent Group, (PSI), Universidad del Valle, Campus Melendez, Calle 13 #100-00, Cali 760032, Colombia

^*

Author to whom correspondence should be addressed.

Robotics 2026, 15(5), 88; https://doi.org/10.3390/robotics15050088

Submission received: 10 February 2026 / Revised: 18 March 2026 / Accepted: 26 March 2026 / Published: 28 April 2026

(This article belongs to the Topic Advances in Robot Vision Perception and Control Technology)

Download

Browse Figures

Versions Notes

Abstract

Long-term localization in dynamic and changing environments remains a key challenge for autonomous vehicles. Semantic Simultaneous Localization and Mapping (SLAM) enhances traditional SLAM by integrating high-level semantic understanding, enabling robust mapping and localization even under complex scenarios. In this context, multi-modal sensor fusion—particularly the combination of LiDAR and camera data—has proven essential in leveraging complementary strengths: the geometric accuracy of LiDAR and the rich semantic cues from images. A significant advancement in this domain is the adoption of graph-based semantic localization frameworks, where semantic entities and spatial relationships are encoded in graph structures to improve map consistency, loop closure detection, and data association over time. This review presents a comprehensive survey of recent developments in Semantic SLAM, with a focus on long-term localization for autonomous vehicles using multi-modal fusion strategies. We categorize existing methods into traditional SLAM, vision-based, point-cloud-based, and graph-based techniques, emphasizing the role of semantic data association and loop closure in maintaining long-term consistency. Additionally, we discuss the integration of deep learning techniques for semantic segmentation and feature extraction. Finally, we analyze widely used datasets and evaluation metrics, identifying current limitations and proposing directions for future research on robust, scalable, and semantically enriched localization.

Keywords:

3D-LiDAR; sensor fusion; feature extraction; textured point cloud; semantic map; long-term localization; datasets

Graphical Abstract

1. Introduction

Autonomous vehicles (AVs) rely heavily on accurate and persistent localization capabilities to navigate safely and efficiently in complex real-world environments. Among the many technologies that support autonomous navigation, Simultaneous Localization and Mapping (SLAM) has become a foundational component, enabling a vehicle to construct and update a map of its environment while simultaneously determining its position within that map. Traditional SLAM techniques, which primarily rely on low-level geometric features derived from LiDAR or vision sensors, have shown strong performance in structured and static environments. However, these methods often fail in the presence of dynamic objects, occlusions, or environmental changes, such as lighting variations, weather conditions, or seasonal shifts.

To address these challenges, long-term semantic SLAM has emerged as a robust alternative, integrating high-level semantic information—such as objects, road elements, and scene categories—into the SLAM pipeline. By enriching map representations with semantics, autonomous systems gain a more contextualized and interpretable understanding of the scene, allowing for more robust localization, particularly over extended time periods and under varying environmental conditions [1,2].

A pivotal advancement in Semantic SLAM is the use of multi-modal sensor fusion, particularly between camera and LiDAR data. Cameras provide dense image textures and rich semantic cues, while LiDAR sensors offer precise 3D geometry and resilience to lighting changes. When combined, these complementary modalities form textured point clouds that capture both semantic and spatial information, improving the AV’s ability to interpret and localize within complex and changing environments [3,4,5,6,7]. This fusion is especially beneficial for long-term operations, where the scene may undergo significant appearance changes due to weather, seasonal transitions, or urban development, as presented in Figure 1.

One of the central challenges in long-term SLAM is maintaining accurate data association over time. As scenes evolve, features may disappear, move, or transform, making it difficult to match current observations with prior map entries. Semantic information helps mitigate this by enabling higher-level correspondences, such as associating object categories or consistent spatial layouts, rather than relying solely on raw geometric features. Moreover, multi-modal fusion enhances this process by offering redundant and complementary descriptors, increasing the likelihood of successful associations under appearance changes. Another critical component is loop closure detection, which plays a key role in correcting accumulated drift and preserving the global consistency of the map. In long-term deployments, loop closure becomes particularly challenging due to changes in viewpoint, illumination, or scene structure. Semantic and multi-modal representations help address this by providing robust scene signatures that remain consistent even when visual or geometric features alone are insufficient. Effective loop closure mechanisms are essential for maintaining an accurate global map and ensuring reliable re-localization during extended autonomous missions [8,9,10].

Long-term localization is particularly affected by three major sources of error: environmental dynamics, sensor drift, and calibration inaccuracies. Dynamic objects may corrupt feature matching and loop closure detection, while odometry drift leads to cumulative trajectory errors. Additionally, miscalibration between sensors can introduce systematic biases in multi-modal fusion systems. Addressing these challenges requires robust semantic representations and global optimization strategies capable of maintaining map consistency over extended periods of operation.

To ensure robustness in real-world deployments, long-term localization systems must address three major challenges: scalability to large urban environments, robustness to dynamic objects, and adaptation to time-varying scene semantics. Graph-based semantic representations combined with multi-modal perception enable efficient map management, dynamic object filtering, and incremental semantic map updates, allowing localization systems to remain consistent over long-term operation.

This review presents a comprehensive survey of recent developments in long-term Semantic SLAM with an emphasis on multi-modal fusion strategies for autonomous vehicle localization. We categorize state-of-the-art approaches into traditional SLAM extensions, vision-based techniques, point-cloud-based methods, fusion-based architectures, and graph-based models. Furthermore, we examine how deep learning techniques have been integrated into Semantic SLAM for tasks such as feature extraction, segmentation, and object recognition. Special attention is given to methods that enhance data association and loop closure in dynamic and evolving environments. Finally, we discuss benchmark datasets and evaluation metrics that support comparative analysis of Semantic SLAM systems, and we identify future research directions to advance robust and scalable long-term localization. This paper is organized as shown in Figure 2: Section 2 introduces the main concept and motivation behind time-varying semantic maps for robust localization in dynamic environments. Section 3 presents multi-modal perception, including spatial calibration and temporal synchronization. Section 4 discusses semantic map construction using vision, LiDAR, and fusion strategies, while Section 5 focuses on 3D semantic segmentation. Section 6 addresses long-term localization through place recognition, loop closure, and data association. Section 7 covers benchmarking aspects such as evaluation metrics, datasets, and long-term perspectives. Section 8 outlines current open challenges in long-term SLAM, and finally, Section 9. concludes this survey.

2. Time-Varying Semantic Maps for Long-Term Localization

Autonomous navigation in real-world environments poses numerous challenges, particularly when vehicles are expected to operate reliably over extended periods in dynamic, unstructured, and constantly changing urban settings. A fundamental requirement for achieving long-term autonomy is the ability to localize the vehicle accurately and robustly despite variations in appearance, structure, and sensor conditions. Traditional localization systems, typically based on geometric SLAM (Simultaneous Localization and Mapping), provide effective short-term solutions under the assumption of static environments. However, their performance often degrades over time due to environmental changes such as seasonal variations, construction, dynamic objects, or sensor wear and drift. These limitations highlight the fact that maps are inherently time-varying, and to be useful in long-term localization tasks, they must be continually analyzed, updated, and adapted to reflect the evolving nature of the environment.

Developing robust time-varying semantic maps for long-term localization requires the integration of several key components, as summarized in Table 1 and illustrated in Figure 2, where each task corresponds to a stage in the semantic mapping pipeline. Multi-modal perceptionprovides complementary spatial and semantic information, forming the foundation of resilient mapping under varying conditions. Differentiating static from dynamic obstacles prevents transient elements from degrading the map’s quality, while appropriate mapping strategies—metric, topological, or hybrid—facilitate flexible and efficient long-term updates. Object detection and data association play a crucial role in identifying and consistently linking meaningful semantic entities to existing map features, thereby supporting both map maintenance and accurate pose estimation. Semantic segmentation further enhances environmental understanding by distinguishing structural changes from expected variations, whereas semantic representation strategies define how this information is efficiently stored and updated over time. Loop closure is essential for maintaining temporal and spatial consistency, enabling the system to recognize previously visited places and correct accumulated drift. Localization ultimately benefits from the integration of semantic and geometric cues, resulting in improved robustness in dynamic environments. Finally, long-term datasets and benchmarking provide the foundation for evaluation, standardization, and fair comparison across different approaches. Together, these components establish a comprehensive pipeline in which multi-modal perception, semantic understanding, and temporal consistency enable reliable and scalable long-term localization in complex and evolving scenarios.

To gain a deeper understanding of the analysis of the works presented in Table 1, a review will be conducted for each strategy (table column), highlighting the most relevant and commonly adopted methods.

The first aspect corresponds to the sensor configurations presented in Table 1, which reveal a clear trend toward multi-modal perception systems. These configurations demonstrate that the fusion of complementary sensing technologies significantly enhances both spatial accuracy and semantic richness. The LiDAR–camera combination emerges as the most recurrent configuration, reflecting its balanced integration of geometric precision from LiDAR and semantic detail from visual imagery. This pairing has become the standard in semantic SLAM and mapping applications, particularly within autonomous driving frameworks. Extensions of this baseline through the inclusion of inertial measurement units (IMUs), GPS, or wheel encoders demonstrate ongoing efforts to improve pose stability, temporal alignment, and long-term consistency, albeit at the cost of increased calibration complexity and computational demand. A growing number of studies also incorporate radar sensors to enhance robustness under adverse environmental conditions, indicating a shift toward perception systems resilient to lighting and weather variability. In contrast, vision-only or stereo systems, although still employed in the construction of semantic maps, remain less robust compared to multi-sensor configurations due to their sensitivity to illumination changes and environmental disturbances. Overall, the reviewed works illustrate a technological progression from dual-sensor fusion toward heterogeneous multi-sensor architectures, aiming to achieve condition-adaptive, robust, and lifelong semantic mapping capabilities for autonomous vehicles.

Across papers [7,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39], the dominant sensor configuration is the camera–LiDAR (C–L) pair, appearing in nearly half of the studies [16,17,21,25,37]. These works consistently show that fusing visual semantics with LiDAR geometry provides superior localization accuracy and semantic richness. A smaller but significant group integrates GPS, IMU, or radar sensors [11,12,13,14,15,20,28,32], aiming to improve long-term stability and environmental robustness. Radar-based fusion in particular emerges as a growing area, motivated by its resilience under adverse weather conditions. Finally, vision-only systems [34] remain valuable for lightweight or low-cost scenarios but lack the robustness of multi-sensor configurations. Overall, the analyzed works demonstrate a clear technological trajectory toward heterogeneous multi-sensor architectures that exploit complementary modalities to achieve accurate, robust, and condition-adaptive semantic mapping for autonomous vehicles. When multiple sensors are used for perception tasks, calibration plays a crucial role, as it ensures both spatial and temporal synchronization between sensors, thereby guaranteeing accurate data alignment. This aspect is further discussed in Section 3.

Another subject corresponds to the types of obstacles, which are crucial to identifying their class and categorize them according to the mapping strategy. Table 1 reveals that static environments dominate most of the analyzed works, reflecting the initial focus of semantic mapping research on structured and predictable scenes such as urban roads, parking areas, or office-like environments. Notable static-scene studies include [16,18,22,24,33], which rely on datasets where environmental dynamics are limited (the datasets will be analyzed in Section 7.1). However, more recent studies have increasingly shifted towards handling dynamic environments, where moving objects significantly affect mapping and localization performance. Works such as [7,17,21,25,34] explicitly address this challenge by incorporating mechanisms to detect, filter, or model dynamic elements. For instance, [17] leverages motion cues to remove dynamic regions and preserve geometric consistency, while [21] integrates dynamic objects into the semantic representation to enable object-aware localization. Similarly, [34] focuses on long-term temporal consistency through probabilistic map updates, distinguishing transient motion from structural changes, and [7] employs multi-sensor fusion to robustly segment moving objects in complex scenarios. In line with this trend, recent works summarized in Table 1 further extend dynamic scene handling by incorporating advanced sensing modalities and fusion strategies. For example, approaches based on camera–radar fusion [12] exploit Doppler information to identify and filter dynamic features, improving robustness in highly dynamic environments. Similarly, LiDAR–radar fusion methods [11] enhance the detection of moving objects under adverse weather conditions, while multi-sensor frameworks integrating LiDAR, camera, GNSS, and radar [13] enable more reliable perception and localization in large-scale scenarios. Additionally, object-detection-driven methods [14,15] incorporate semantic cues to explicitly model dynamic entities within the SLAM pipeline. Collectively, these developments highlight a clear transition from static-scene assumptions towards dynamic-aware and sensor-fusion-driven approaches, emphasizing the necessity of robust dynamic object handling for achieving reliable, adaptive, and long-term semantic mapping in real-world autonomous driving applications.

Once the type of obstacle is analyzed, the next topic corresponds to the mapping type. This category reveals the conceptual and methodological diversity in how the reviewed works represent the environment. Across studies [7,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39], a clear evolution is observed from purely geometric maps toward semantically structured, probabilistic, and graph-based representations that integrate spatial, temporal, and semantic information for more robust and interpretable perception. This evolution reflects a growing need for maps that not only encode geometry but also understand scene context and object meaning, which are essential for long-term autonomous navigation and reasoning. In the earliest stage, several works employed metric mapping approaches, where the environment is represented as a dense or sparse 3D point cloud derived from LiDAR or stereo vision [7,19,34]. These maps emphasize spatial accuracy and reconstruction fidelity, often using traditional SLAM back-ends ORB-SLAM2 [40] or [41] to achieve geometric consistency. However, while effective for localization, such representations lack semantic interpretability and cannot differentiate between object classes or dynamic entities. This limitation motivated the incorporation of higher-level features and semantic annotations into the mapping process.

To overcome these limitations, more recent approaches have incorporated semantic and topological representations, where object-level information and contextual relationships are embedded into the map [15,17,21]. These methods enable richer scene understanding and support higher-level tasks such as object-aware localization and long-term reasoning. Additionally, lightweight and hybrid metric–semantic frameworks [14] have been proposed to balance computational efficiency with semantic expressiveness in dynamic environments. More recently, the integration of radar sensors has further expanded the design space of mapping strategies. Works based on camera–radar fusion, such as [12], retain a metric mapping formulation while enhancing robustness through motion-aware filtering of dynamic elements. In contrast, LiDAR–radar fusion [11] focus primarily on perception and do not explicitly construct global maps, highlighting a shift toward perception-driven pipelines. Furthermore, multi-sensor systems combining LiDAR, camera, GNSS, and radar [13] adopt tightly coupled metric mapping frameworks that improve localization accuracy and robustness in large-scale and adverse environments. Overall, these developments demonstrate a transition from purely geometric mapping toward hybrid and sensor-fusion-driven representations, where robustness, scalability, and environmental adaptability are prioritized alongside semantic understanding. This evolution underscores the importance of integrating complementary sensing modalities and representation paradigms to address the challenges of real-world autonomous navigation.

Another subject of the analysis corresponds to semantic maps, where every 3D element (voxel, point, or pixel) is associated with a semantic label derived from image segmentation or learned features [16,17,21,25]. These works combine deep learning-based object detection or segmentation networks YOLOv3 - YOLOv5 - YOLOv8 versions [42,43,44], DeepLabv3 [45], Mask R-CNN [46] with geometric reconstruction pipelines to generate dense semantic maps. This integration enables the robot or vehicle to reason about objects, road elements, and navigable areas rather than raw geometric data. For instance, [16,21] constructed hybrid LiDAR–camera semantic maps where each 3D point inherits a class label from image-based segmentation. Similarly, Li et al. [17] integrates object-level associations into the mapping process, enabling dynamic scene interpretation and improved long-term consistency. Such semantic enrichment transforms maps into contextual knowledge representations, providing the basis for higher-level planning and decision-making tasks. They also improve loop closure and relocalization by using semantically stable features instead of purely geometric ones.

A smaller but influential subset of studies [20,22,24,26,27] adopt topological or graph-based mapping frameworks. In these representations, the environment is modeled as a set of interconnected nodes and edges, where each node encodes local geometry and semantics, and edges describe spatial or temporal relations between them. Graph-based maps offer advantages in scalability, memory efficiency, and long-term adaptability, since they allow local map updates without the need to rebuild the entire environment. For example, Zhang et al. [20] integrate LiDAR, camera, GPS, and radar information into a global graph representation, linking semantically labeled regions for large-scale mapping. Similarly, Yi et al. [24] employ probabilistic graph fusion to integrate uncertain visual and LiDAR data, allowing the system to maintain consistent maps under varying conditions. These approaches move semantic mapping closer to cognitive-level perception, where spatial relations and object semantics coexist in a unified topological structure.

Recent contributions such as [30,31,34] expand semantic mapping into probabilistic and temporal domains. These works treat the map as a dynamic entity that evolves over time, incorporating mechanisms for uncertainty modeling and temporal updates. In this context, mapping is not limited to static reconstruction but becomes a continuous learning process, where each observation refines prior knowledge about the environment. For instance, [34] introduces a temporal consistency framework that updates semantic maps only when persistent changes are detected, effectively distinguishing between transient motion and long-term environmental alterations. Similarly, Jiao et al. [31] combine radar and LiDAR inputs within a probabilistic representation to handle incomplete or degraded sensory observations, ensuring robust mapping under adverse conditions. These probabilistic approaches reflect the transition toward lifelong mapping, where the map adapts to environmental evolution rather than remaining static after construction. The strategies employed to build semantic maps based on sensor type are presented in Section 4.

After analyzing the semantic mapping approaches, it is essential to address the evolution of object detection, which plays a central role in enabling semantic awareness within SLAM systems. As summarized in Table 1, object detection has evolved from traditional geometric feature extraction methods to deep learning-based and multi-modal frameworks. Early LiDAR-based approaches relied on geometric clustering techniques to identify obstacles, which, while effective for spatial segmentation, lacked semantic interpretability. Subsequent advances introduced vision-based detectors such as YOLOv3-8 [42,43,44] and Mask R-CNN [46], enabling accurate object classification and instance-level understanding. These methods significantly improved the ability to distinguish between different object categories, facilitating the integration of semantic information into mapping and localization pipelines. More recent works further enhanced detection performance through LiDAR–camera fusion, combining geometric precision with rich visual semantics to improve robustness across diverse environments. In line with this evolution, recent approaches have increasingly incorporated radar-based sensing to address limitations under adverse weather and highly dynamic conditions. For instance, LiDAR–radar fusion methods [11] perform robust 3D object detection by leveraging complementary sensing modalities, where radar contributes motion and weather-invariant information. Similarly, camera–radar fusion approaches [12] exploit Doppler measurements to identify dynamic objects, improving the reliability of detection in challenging scenarios. Multi-sensor frameworks integrating LiDAR, camera, GNSS, and radar [13] further extend this paradigm by combining perception and localization cues within a unified system. Additionally, several recent SLAM-oriented methods [14,15] employ advanced deep learning detectors such as YOLOv8 to explicitly model dynamic objects within the mapping process, while others [7,17,21,34] incorporate temporal consistency and motion reasoning to distinguish between static and dynamic elements over time. These developments reflect a shift toward object-level semantic fusion, where detection is not only frame-based but also temporally consistent and tightly integrated with SLAM. Overall, object detection has progressed from low-level geometric reasoning to high-level, multi-modal, and temporally stable perception, enabling more robust and adaptive semantic mapping systems for real-world autonomous driving applications.

The next topic corresponds to semantic segmentation. This category highlights the continuous evolution of methods used to assign semantic labels to visual and geometric data, which is a fundamental step for constructing interpretable and structured environmental representations. Initial approaches [16,18,23] relied primarily on image-based semantic segmentation, employing deep convolutional neural networks. These methods operate on camera imagery to produce pixel-wise labels that distinguish semantic categories such as road, vehicle, pedestrian, and vegetation. The advantage of this approach lies in its dense texture information and availability of pretrained models from large visual datasets. However, since they operate in the 2D domain, these methods face limitations in depth estimation and spatial consistency, particularly when transferring the semantic information into the 3D domain. To overcome these limitations, several studies [19,26,32,34] explored 3D semantic segmentation directly on LiDAR point clouds. LiDAR-based segmentation provides accurate geometric delineation of object boundaries and robustness to illumination changes, making it highly suitable for autonomous navigation tasks. Nevertheless, these models require large annotated 3D datasets and are computationally intensive, which can hinder real-time performance. Despite these challenges, these methods represent an essential step toward geometry-driven semantic understanding, especially for systems operating under poor lighting or weather conditions.

The most relevant evolution observed in recent works [17,21,27,37,39] involves multi-modal semantic segmentation, where LiDAR and camera data are fused to leverage their complementary advantages. For example, [17] performs feature-level fusion using a dual-branch network, improving segmentation accuracy in challenging lighting conditions. Cheng et al. [21] integrate a temporal fusion mechanism, ensuring consistent semantic labeling over sequential frames. Yang et al. [37] combine LiDAR geometry with camera semantics to improve the understanding of dynamic scenes, which is critical for updating maps in long-term scenarios. These works illustrate a movement toward robust, sensor-adaptive segmentation architectures capable of maintaining high accuracy across environmental variability. An emerging research direction involves temporal-aware semantic segmentation, as seen in works like [21,36], where models incorporate recurrent neural structures or spatio-temporal filtering to enforce label consistency over time. This approach mitigates flickering or inconsistencies between consecutive frames and enables incremental learning, where the model can adapt to environmental changes (e.g., seasonal variations or construction zones). Such temporal reasoning represents a crucial advancement toward lifelong semantic perception, allowing autonomous systems to continuously refine and maintain their understanding of the environment. The strategies of semantic segmentation are categorized according to sensor modality and are presented in Section 5.

Within the domain of semantic segmentation, semantic representation plays a crucial role, as it defines how the semantic information obtained through segmentation and detection is encoded, structured, and maintained within the map. It represents the bridge between perception and mapping—transforming raw sensor data into interpretable and reusable knowledge that supports localization, planning, and scene reasoning. Several studies [16,18,22] employ occupancy grids or semantic voxel maps, where the environment is discretized into cells containing both occupancy probabilities and semantic labels. This approach facilitates probabilistic reasoning and enables incremental updates. For example, Alexandria et al. [16] combine semantic labels from image segmentation with LiDAR occupancy grids to construct a semantic occupancy map useful for obstacle avoidance. Such representations are intuitive and well suited for real-time applications, but they often suffer from high memory requirements and limited scalability in large-scale environments.

Other works [19,26,32,34] adopt a point-level representation, where each 3D point in the LiDAR scan retains an associated semantic label. This approach preserves geometric precision and allows for detailed scene reconstruction. However, maintaining and updating dense point-level semantic maps over time is computationally demanding, especially when handling dynamic or changing environments. To mitigate this, some studies employ probabilistic occupancy updates to optimize storage and computational efficiency. Recent studies [21,27,37,39] emphasize object-centric and graph-based semantic representations, where each detected and labeled object becomes a node within a semantic graph. Edges encode spatial or semantic relationships (e.g., adjacency, co-occurrence, temporal correlation). This representation allows the system to reason about the environment at a higher level of abstraction, enabling scene understanding, contextual reasoning, and long-term data association. Another relevant aspect across works [19,36,39] is the inclusion of temporal reasoning and probabilistic modeling in the semantic representation layer. By associating timestamps, confidence scores, or persistence probabilities to each semantic entity, these systems can differentiate between static and transient elements, update knowledge incrementally, and handle uncertainty arising from noisy detections or sensor failures. This paradigm is central to long-term mapping, allowing the environment model to evolve and remain consistent over extended operation periods.

After analyzing the main aspects of semantic segmentation, the next topic focuses on long-term localization, which constitutes the most critical component within the context of this survey. Its analysis is divided into localization, loop closure and data association. Localization reflects the diverse strategies employed to estimate and maintain the ego-pose of a vehicle or robot within a spatial or semantic map. Across the analyzed works, localization evolves from purely geometric odometry toward semantic and probabilistic localization frameworks, leveraging multi-modal cues and learned representations to enhance robustness under long-term and dynamic conditions. Early works ([16,19,26]) focus on geometric localization, using LiDAR or stereo vision to estimate motion through Iterative Closest Point (ICP) [41], Normal Distribution Transform (NDT) [47], or visual feature matching. These methods compute the pose by minimizing the spatial difference between consecutive scans or against a pre-built map. While geometrically accurate, such methods are highly sensitive to scene changes, occlusions, and sparse geometric features in unstructured environments. Nevertheless, they form the foundation for subsequent hybrid and semantic-aware localization techniques.

Several studies [18,22,30] employ visual or visual–inertial odometry (VIO), integrating camera imagery with IMU measurements to achieve pose tracking at high frequency. VIO systems effectively address short-term drift and motion ambiguity, especially in texture-rich environments. However, they still face challenges under illumination changes or textureless regions. To mitigate these issues, works like [30] combine VIO with radar or LiDAR to improve localization reliability, introducing multi-sensor odometry frameworks that adaptively weight sensor confidence.

A more recent and significant shift appears in works such as [21,27,36,37], where semantic cues—rather than low-level features—are used to infer or correct the vehicle’s position. In these approaches, localization is performed relative to semantic landmarks, detected objects, or scene graphs rather than raw geometric structures. Cheng et al. [21] use semantic entities as nodes in a spatial graph, aligning observed objects with map-level counterparts through probabilistic matching. Kang et al. [27] leverage semantic keypoints to enhance pose estimation in changing environments. The authors of [36] introduce a semantic scene descriptor that improves global localization under appearance variations. This paradigm shift enables localization to remain robust even when geometric or visual features degrade, supporting long-term and condition-invariant positioning.

More advanced works ([27,39]) introduce probabilistic localization frameworks, where the vehicle pose is represented as a distribution rather than a single deterministic estimate. These methods employ Bayesian filters, pose graph optimization, or factor graphs that incorporate both geometric and semantic constraints. This approach facilitates uncertainty modeling, loop closure integration, and temporal consistency, marking a decisive step toward lifelong localization.

Another important aspect corresponds to data association, which defines how systems establish correspondences between newly observed features and previously mapped elements, directly impacting SLAM consistency and long-term mapping stability. As summarized in Table 1, the reviewed works exhibit a clear evolution from purely geometric matching strategies toward semantic, probabilistic, and multi-modal data association frameworks capable of handling perceptual ambiguity, environmental changes, and dynamic scenes. Early approaches [16,19,26] rely on point-level correspondences using nearest-neighbor search or ICP-based matching to associate geometric features across scans. These methods operate at the spatial level, leveraging proximity and shape similarity. While effective in static and structured environments, they are prone to incorrect associations under occlusions, dynamic objects, and viewpoint variations. Their lack of semantic awareness further limits interpretability, as geometrically similar structures may correspond to different semantic entities. To improve robustness, several works [18,22,30] incorporate visual descriptors, enabling feature-based matching across image sequences. These approaches enhance resilience to partial occlusions and moderate viewpoint changes. However, appearance-based association remains sensitive to illumination variations and adverse weather conditions, motivating the transition toward multi-modal and semantically informed strategies. A significant advancement is observed in semantic-level data association frameworks [21,27,36,37], where correspondences are established between object instances or semantic categories rather than low-level features. This approach improves robustness under environmental variability, as semantic entities remain consistent despite changes in appearance. For example, [21] employs graph-based semantic matching, while [36] integrate class probabilities and semantic keypoints to reduce ambiguity in loop closure. More recent developments further extend data association through probabilistic modeling [36,39], where correspondences are formulated as likelihood functions that account for sensor uncertainty and temporal drift. These methods leverage Bayesian inference or graph optimization to maintain globally consistent associations over time, which is critical for long-term SLAM in dynamic environments. In line with these advances, recent works incorporating radar sensing introduce a new paradigm based on motion-aware data association. For instance, camera–radar fusion approaches [12] exploit Doppler velocity measurements to distinguish between static and dynamic elements, enabling the filtering of inconsistent features prior to association. Similarly, LiDAR–radar fusion methods [11] perform cross-modal data association by combining geometric consistency with motion cues, improving robustness under adverse weather conditions. Multi-sensor frameworks integrating LiDAR, camera, GNSS, and radar [13] further enhance this process by jointly optimizing correspondences across heterogeneous modalities within a unified estimation framework.

Overall, data association has evolved from purely geometric matching toward multi-modal, semantic, and motion-aware strategies, where the integration of complementary sensing modalities—particularly radar—plays a key role in improving robustness, reliability, and long-term consistency in real-world autonomous driving scenarios.

Another topic analyzed in the localization section corresponds to loop closure. This stage focuses on the methods used to recognize previously visited locations and correct accumulated pose drift. It represents a critical step for maintaining global map consistency. The reviewed literature reveals a progression from geometry-driven loop detection to semantic and descriptor-based approaches that improve reliability across temporal and environmental variations. Traditional SLAM systems [16,19,26] rely on geometric consistency checks, detecting loop closures by matching local LiDAR submaps or visual landmarks based on spatial overlap. While effective in static environments, these methods degrade under long-term changes or when dynamic elements alter the scene’s structure. Moreover, geometric descriptors can fail under significant viewpoint or illumination differences.

Visual SLAM frameworks [18,22,29] employ appearance-based place recognition, using global image descriptors to identify revisited scenes. These methods are computationally efficient and allow for large-scale loop detection, but they remain vulnerable to lighting, weather, and seasonal variations. Some studies mitigate these issues by combining visual and geometric cues, improving resilience across different conditions. Recent works [21,27,36,37] incorporate semantic information into loop closure detection, enabling place recognition based on meaning rather than appearance. Ref. [21] uses semantic graphs to compare scene layouts at the object level, identifying revisited locations even under appearance change. Ref. [27] proposes semantic keypoint descriptors that remain invariant to viewpoint and illumination, improving long-term loop recognition. Ref. [37] exploits scene-level semantic embeddings to perform global matching between current observations and map segments, enhancing robustness under dynamic and evolving environments. These methods demonstrate that incorporating semantics into loop closure allows for condition-invariant place recognition, a cornerstone of long-term mapping.

Advanced works [36,39] extend loop closure to probabilistic frameworks, where loop hypotheses are treated as probabilistic events integrated into the pose graph optimization process. This allows uncertain loop candidates to be verified or rejected based on global consistency criteria. In addition, semantic and geometric constraints can be jointly optimized, improving both map accuracy and long-term stability. Such models reflect the emerging paradigm of semantic graph SLAM, where loop closure is not merely spatial but semantic–topological.

And finally, the last aspect of this analysis corresponds to the dataset. This category reveals a strong reliance on publicly available benchmark datasets to validate semantic mapping and localization frameworks, reflecting the community’s emphasis on reproducibility and comparative evaluation. Among these, the KITTI dataset [48] remains the most frequently employed benchmark, appearing in the majority of works, such as [16,20,21,32,36,38], due to its rich multi-modal configuration (LiDAR, stereo camera, GPS, and IMU) and its widespread use in odometry and SLAM evaluation. Its versatility allows testing both geometric and semantic tasks under realistic urban driving conditions. More recent studies, such as [17,24], expand validation using custom datasets or institutional data collections, reflecting an emerging need to evaluate algorithms in domain-specific or dynamic environments beyond the constraints of fixed benchmarks. Works like [23] leverage nuScenes [49], which provides high-frequency multi-sensor data and dense semantic annotations, ideal for assessing real-time semantic SLAM and object-level mapping performance.

A smaller number of studies [20,26,32] utilize self-collected or synthetic datasets, particularly to assess temporal consistency and multi-modal calibration performance, which are often under-represented in public datasets. This diversification trend indicates a shift toward dataset specialization, aiming to benchmark algorithms under diverse environmental, temporal, and sensor conditions. Section 7.1 further expands on other datasets employed for the validation of long-term localization.

3. Multi-Modal Perception

Real-world autonomous driving systems require perception and localization modules capable of operating under strict real-time constraints. These constraints impose limits on latency, computational cost, and memory usage, requiring a careful balance between localization accuracy and processing speed. While complex deep learning models can improve robustness and semantic understanding, their computational requirements may challenge real-time deployment unless optimized through efficient architectures or hardware acceleration.

Autonomous vehicles rely on a diverse suite of complementary sensors that collectively enable the perception and interpretation of their environment. These sensing modalities convert external events and environmental variations into measurable signals suitable for subsequent processing. When multiple sensors are integrated—particularly for perception-oriented tasks—calibration becomes a fundamental requirement, as it ensures the correct spatial and temporal alignment of the information captured by each device. This calibration step constitutes one of the initial and most critical components in the construction of semantic maps under long-term operational conditions, especially within multi-modal systems, as illustrated in Figure 2. Achieving reliable and accurate fusion across heterogeneous sensors demands precise transformations between their respective reference frames and acquisition timelines. Consequently, a robust camera–LiDAR calibration pipeline capable of jointly addressing spatial and temporal misalignment is essential, and its components are presented in the following subsections. A detailed taxonomy of the multi-modal calibration aspects discussed next is provided in Figure 3.

3.1. Extrinsic Calibration

Extrinsic calibration aims to determine the parameters that relate the 3D coordinate frame of the LiDAR sensor to that of the camera, enabling both modalities to observe and represent the environment within a common spatial reference frame. In particular, extrinsic calibration, which defines the relative position and orientation between sensors (e.g., between a camera and a LiDAR), is essential to guarantee that data from different sources are geometrically aligned within a shared coordinate frame, enabling accurate object detection and scene understanding. This alignment enables the reliable fusion of 3D spatial data from LiDAR with the rich semantic information provided by cameras, which is fundamental for high-level tasks like object detection, semantic segmentation, and scene understanding. Errors in extrinsic calibration can lead to spatial misalignments that compromise perception accuracy and system safety.

To address this, a wide range of extrinsic calibration methods have been developed in the literature. These can be broadly categorized into target-based, targetless, and learning-based approaches, as depicted in Figure 4. Target-based methods rely on artificial calibration patterns—such as checkerboards, planar boards, or trihedral targets—that are visible to both sensors and facilitate accurate estimation of the relative transformation. In contrast, targetless methods exploit natural features or environmental structures, eliminating the need for specific calibration objects and enabling more flexible deployment. Recently, deep learning-based methods have emerged, which leverage data-driven models to predict extrinsic parameters directly from raw sensor data, often improving robustness and enabling online or self-calibration capabilities. Each of these approaches presents trade-offs in terms of accuracy, practicality, and computational complexity, depending on the application and operational constraints.

The comparative analysis of LiDAR–camera extrinsic calibration methods, as summarized in Table 2, reveals a clear evolution in methodology from traditional target-based techniques to targetless and deep learning approaches. Target-based methods [50,51,52,53,54,55,56,57,58,59,60,61] rely on predefined calibration targets such as checkerboards, spheres, or specialized boards (Figure 4a), offering high precision and repeatability, particularly in controlled environments. These methods generally exhibit low to medium scene dependency and medium levels of automation, with no inherent online calibration capability, making them suitable for offline, lab-based settings. Their key advantages often lie in their geometric rigor and the ability to handle specific calibration challenges—for example, using line and plane correspondences [50], genetic algorithms [53], or custom-designed boards [60,61].

In the camera image, the checkerboard corners or target markers are extracted with sub-pixel accuracy. In the LiDAR point cloud, the same planar surface or reflective elements of the target are identified using geometric cues such as plane fitting or intensity responses. Because the physical structure of the target is known, these detected features establish correspondence pairs between 2D image coordinates and 3D LiDAR points. Using these cross-modal correspondences, different calibration algorithms solve for the extrinsic transformation—rotation

R

and translation

t

—that aligns the LiDAR coordinate frame with the camera frame. This is typically formulated as a geometric optimization problem (e.g., minimizing reprojection error).

However, they typically require manual intervention and are not practical for in-the-field or real-time applications. In contrast, targetless methods [62,63,64,65,66,67,68,69] eliminate the need for artificial markers by leveraging features from the natural environment, thus supporting more automated calibration pipelines and enabling online capabilities in some cases [65,68]. In this approach, geometric structures, such as edges, planes, road markings, or object boundaries visible in both the LiDAR point cloud and the camera image, are automatically extracted (Figure 4b). These scene-derived features are then matched across the two modalities to establish correspondences. By optimizing the alignment between the projected LiDAR points and the corresponding visual features in the image, the method estimates the rotation and translation between the sensors.

These methods exhibit medium to high scene dependency, as their performance depends on the richness of the scene geometry or motion cues. Their strengths include easier deployment in real-world or dynamic scenarios, as seen in works that utilize mutual information [64], object pose estimation [66], or motion constraints [67]. Notably, some approaches support online calibration across a vehicle’s lifetime [65] or full automation of multi-camera + LiDAR systems [69]. Despite this flexibility, their accuracy may degrade in textureless or geometrically poor scenes, and they often involve more complex algorithmic design compared to target-based methods.

Deep learning-based approaches [70,71,72,73,74,75,76] represent the most recent trend, aiming for fully automatic and highly generalizable calibration without requiring explicit correspondences or markers. These methods typically exhibit medium scene dependency, high automation, and, in several cases, online capabilities [70,71,75]. Their main advantages lie in leveraging learned features or semantics to perform calibration in complex or unstructured environments. For example, some models incorporate causal networks [70], semantic segmentation [74], or cost–volume approaches [73] to improve robustness. While they offer significant practical benefits—particularly in self-driving or multi-sensor robotic systems—their effectiveness strongly depends on training data quality and coverage, and they may suffer from domain shift issues or lack of interpretability. Figure 4c depicts a deep learning-based extrinsic calibration pipeline, where a neural network learns to estimate the rotation and translation between a LiDAR and a camera directly from data. The pipeline begins by transforming the raw inputs—projecting the 3D point cloud into the camera frame and generating image-like representations. These transformed inputs pass through feature extraction networks that encode geometric and visual cues. The extracted features are then matched in a shared latent space, and a regression module predicts the extrinsic parameters. A loss function supervises the training by enforcing geometric consistency between modalities.

In summary, the progression from target-based to deep learning methods illustrates a trade-off between accuracy and practicality. Target-based methods remain the gold standard for high-precision calibration in controlled environments, whereas targetless and learning-based methods offer increasing levels of automation, online adaptability, and scalability, which are essential for modern, autonomous, and dynamic systems.

3.2. Temporal Synchronization

Temporal calibration is the process of determining and compensating for time offsets or synchronization delays between data streams originating from multiple sensors operating at different frequencies and subject to varying latencies. Accurate temporal synchronization is critical in multi-sensor fusion and SLAM systems, as even small timing discrepancies can lead to inconsistent data alignment, leading to localization drift or degraded map quality. To address this, some methods have been developed. Based on Table 2, they can be categorized as hardware-based synchronization methods, software-based synchronization methods, and joint spatial–temporal calibration and motion compensation methods in LiDAR-based systems.

Hardware synchronization methods, labeled as (H), rely on physical triggering mechanisms or shared clock signals to ensure simultaneous data capture across sensors. Common implementations include:

Trigger pulses sent from a master device (often the camera) to slave sensors (e.g., LiDAR or IMU) to initiate acquisitions simultaneously.
GPS-disciplined clocks (PPS–Pulse Per Second) that provide a global reference time for all sensors.
Time synchronization protocols such as IEEE 1588 Precision Time Protocol (PTP) or Network Time Protocol (NTP) for distributed setups.

These approaches provide high precision (typically in the microsecond range) but require specialized hardware and wiring. Methods like [61,63] employ hardware synchronization to minimize latency and ensure deterministic acquisition timing.

Software synchronization, labeled as (S), aligns sensor data post-acquisition based on message timestamps recorded by the host system. The most common strategy associates sensor measurements using message header timestamps, often managed through frameworks such as ROS message filters [77]. This method selects the most recent message as a pivot and matches other data streams within a defined temporal window.

Joint spatial–temporal calibration, labeled as (J), refers to the process of simultaneously estimating both the spatial (extrinsic) parameters—the 3D rotation and translation between sensors—and the temporal offset

(Δ t)

between their data streams within a single optimization framework. Refs. [64,65,66,67,68,69,70,71,72,73,74,75,76] explicitly or implicitly adopt joint spatial–temporal calibration techniques.

Target-based methods generally disregard temporal synchronization because they are performed in controlled, static conditions where data are captured while the calibration target remains still. Each frame is processed independently, and the sensors are not continuously moving or acquiring time-sensitive data, making precise temporal alignment unnecessary. The main objective of these methods is to estimate accurate spatial transformations (rotation and translation) between sensors rather than synchronize their data streams over time, which is why temporal synchronization is typically marked as not required.

Due to LiDAR’s scanning nature, motion-induced distortions occur when the platform moves during acquisition. Motion compensation techniques correct for this by estimating the vehicle’s pose over the scan period and transforming each LiDAR point into a common reference frame. This correction ensures consistency across multi-sensor fusion tasks. Many joint calibration frameworks integrate motion compensation as part of temporal alignment, e.g., [66,68,73].

Temporal synchronization also plays a critical role in loop closure detection within SLAM pipelines, especially in long-term operations. Loop closure involves recognizing previously visited locations and correcting accumulated pose drift by realigning the current estimate with historical data. As will be discussed in Section 6, robust loop closure heavily depends on the precise temporal alignment of multi-modal data, which in turn influences the reliability and consistency of long-term localization.

4. Semantic Mapping

According to Figure 2, the second task corresponds to the semantic mapping strategy. Geometric maps have been widely adopted in SLAM systems due to their high precision and spatial accuracy. However, they present several limitations when applied to long-term localization, particularly in dynamic and evolving environments. Geometric features—such as corners, edges, and planes—are prone to change or even disappear over time as a result of dynamic objects (e.g., vehicles, pedestrians), structural modifications (e.g., construction or new buildings), and environmental variations (e.g., seasonal foliage, snow, lighting, and shadows). These factors undermine the reliability of purely geometric representations for sustained operations.

To overcome these challenges, it becomes essential to incorporate environmental context for a more robust and interpretable understanding of the scene—this is where semantic maps play a pivotal role. Semantic mapping involves building spatial representations that not only capture the geometry of the environment but also encode high-level information about scene elements, such as object categories, identities, and inter-object relationships.

By enriching maps with semantic information, autonomous systems achieve greater robustness to appearance changes and improved scene interpretation, enabling more reliable localization and navigation over extended time periods. Based on Table 1, we propose a reclassification of semantic mapping approaches according to the sensing modality and the adopted strategy for perception and representation. The resulting categories are vision-based, LiDAR-based, fusion-based, and graph-based methods, as presented in Figure 5.

4.1. Vision-Based Semantic Mapping

Visual SLAM, which relies primarily on camera sensors, has gained significant attention due to its low cost, ability to provide rich scene understanding, and compact hardware requirements. However, traditional visual SLAM systems [78,79,80,81,82] are predominantly based on geometric features—such as keypoints, edges, and landmarks—which makes them highly sensitive to changes in illumination, occlusions, dynamic elements, and long-term environmental variations. To overcome these limitations, recent advancements have introduced semantic visual SLAM, which integrates high-level scene understanding into the SLAM pipeline using deep learning techniques.

Visual SLAM has evolved significantly with the incorporation of semantic understanding through deep learning techniques. Traditional geometric SLAM methods relied mainly on low-level features such as keypoints, edges, and landmarks, which made them highly sensitive to illumination changes, occlusions, and dynamic elements. To address these limitations, recent research has focused on integrating convolutional neural networks (CNNs) and segmentation architectures into the SLAM pipeline, enabling richer scene interpretation and enhanced robustness.

As observed in Table 3, the majority of recent works, such as those by [83,84,85], utilize deep CNNs or segmentation-based networks that strengthen the system’s ability to track and reconstruct scenes even under motion blur or dynamic conditions. Weakly supervised learning approaches, such as the work presented in [86], have also emerged to reduce labeling effort, although they typically offer lower adaptability to diverse environments.

A critical aspect of semantic SLAM is object detection, which provides the foundation for object-oriented mapping and semantic labeling. Among the methods compared, YOLO and R-CNN stand out as the two predominant detection frameworks. YOLO employed in [87,88,89] offers high-speed detection suitable for real-time applications, consistent with its single-pass inference design that captures global context efficiently. Conversely, R-CNN and Faster R-CNN [33,90,91] achieve higher localization accuracy and instance-level recognition but at a greater computational cost, making them less practical for real-time embedded systems. Some earlier approaches [33,83,84,86,92,93,94] omit explicit object detection and instead rely on pixel-wise segmentation or feature-based semantics, which limits their ability to handle dynamic scenes and moving objects effectively.

Table 3. Comparative analysis of vision-based semantic mapping and SLAM approaches in dynamic environments.

Ref.	Learning	Object Detection	Env.	Dynamic Handling	Mapping Type	Real Time	Accuracy/ Robustness	Efficiency	Scalability
[92]	CNN (2D)	None	Outdoor	x	Semi-dense 3D	Partially	Basic semantic consistency	Lightweight	Early-stage baseline
[93]	Deep CNN	None	Indoor	Partially	2D semantic	✓	Reliable short-term accuracy	Real-time feasible	Limited to small indoor setups
[83]	Deep segmentation	None	Indoor/ outdoor	✓	3D semantic	✓	Strong against motion blur	GPU-efficient	Handles dynamic motion
[33]	None	Faster R-CNN	Indoor/ outdoor	x	Object-level 3D	✓	Good object localization	Efficient computation	Interpretable structure
[86]	Weakly supervised CNN	None	Indoor/ dynamic	✓	Semi-dense	Partially	Acceptable with less labels	Low training cost	Moderate adaptability
[94]	Mask R-CNN.	Mask R-CNN	Indoor/ dynamic	✓	3D semantic	Partially	Accurate instance masking	Medium real-time	Moderate generalization
[84]	Deep segmentation	None	Indoor/ dynamic	✓	Dense 3D	✓	Excellent motion robustness	Real-time GPU	Stable in moderate dynamics
[85]	Deep CNN	None	Indoor	✓	Dense 3D	✓	Stable tracking in motion	Efficient CNN pipeline	Indoor adaptability
[87]	CNN	YOLOv3	Outdoor large-scale	Partially	Dense 3D	✓	High semantic precision	Real-time optimized	Excellent for road scenes
[90]	CNN	Faster R-CNN.	Indoor static	x	Object-level	✓	Strong semantic association	Real-time feasible	Good indoor scalability
[88]	CNN segmentation	YOLOv3	Indoor multi-robot	Partially	Shared 3D	✓	Reliable data fusion	Distributed real-time	High multi-agent scalability
[95]	CNN segmentation	None	Dynamic scenes	✓	Sparse 3D	Partially	Accurate dynamic removal	Medium processing speed	Adaptable to motion levels
[89]	Semantic + stereo depth	YOLOv4	Outdoor dynamic	✓	Dense 3D	✓	High spatial accuracy	Real-time stereo	Robust outdoor adaptability
[91]	Instance segmentation	Mask R-CNN.	Outdoor dynamic	✓	Dense 3D	✓	Excellent dynamic robustness	Real-time optimized	Suitable for complex scenes
[96]	Hybrid heuristic segmentation	None	Indoor dynamic	✓	Dense 3D	✓	Strong tracking robustness	Optimized pipeline	Adaptive to scene changes

x: The method does not take into account scene dynamics.

In terms of environmental adaptability and dynamic handling, evolution across the referenced studies is evident. Early frameworks were primarily tested in controlled or static indoor/outdoor settings ([92,93]), while newer methods explicitly address dynamic environments [84,89,91,94]. These modern approaches integrate motion segmentation, temporal tracking, or depth-based filtering to maintain map consistency in the presence of moving objects. For example, [89,94] align conceptually with the works of Kang et al. [27] and Li et al. [32], who proposed dynamic-aware and sensor-fusion-based systems capable of recognizing and reconstructing moving entities within the environment. Such methods demonstrate the transition from static-feature-matching pipelines to motion-aware semantic perception systems.

The mapping strategies in the reviewed literature also reveal an important progression—from semi-dense or object-level mapping [33,86,92] to dense and shared 3D semantic representations [84,85,87,88,89,90,95]. Dense 3D mapping enhances geometric precision and allows for the inclusion of semantic classes at the voxel or instance level. Object-level maps, such as those presented by Nakajima et al. [33], improve human interpretability and facilitate higher-level reasoning for navigation and task execution. Meanwhile, distributed or shared mapping architectures ([88]) extend these benefits to multi-agent systems, improving scalability and cooperative perception among autonomous vehicles or robots.

When analyzing performance metrics—namely, accuracy, efficiency, and scalability—deep-segmentation-based systems [84,91,96] exhibit superior robustness to motion and dynamic changes. GPU-optimized pipelines and real-time processing capabilities [84,87,89] ensure that such methods are applicable to autonomous driving scenarios, where computational efficiency is critical. Scalability is particularly evident in modular and distributed systems [83,85,94], which leverage stereo or multi-sensor configurations to handle large-scale outdoor environments. This trend resonates with the stereo-based approach of Yang et al. [37], where integrating disparity-based depth estimation and motion segmentation significantly improved both accuracy and scalability.

In summary, the comparative table highlights a clear evolution from purely geometric SLAM toward semantic, dynamic, and scalable mapping frameworks. The integration of object detection, semantic segmentation, and motion modeling has enabled robust and context-aware mapping suitable for complex, real-world environments. Current trends point toward hybrid systems combining deep learning, multi-sensor fusion, and distributed computation, which form the foundation for next-generation semantic SLAM solutions in autonomous vehicles.

4.2. LiDAR-Based Semantic Mapping

With the increasing adoption of LiDAR sensors in autonomous vehicles, point-cloud-based semantic segmentation and mapping techniques have become a critical component for robust 3D scene understanding. Unlike image-based SLAM systems that rely on 2D pixel data, LiDAR-based methods directly process 3D spatial information, providing superior geometric precision and resilience to illumination changes. The works summarized in Table 4 ([97,98,99,100,101,102,103]) represent the evolution of deep learning models designed to efficiently interpret, segment, and represent large-scale LiDAR point clouds for autonomous navigation and semantic mapping.

The pioneering PointNet framework by Qi et al. [97] marked a fundamental shift in how 3D point sets are processed. Instead of projecting 3D data into voxel grids or image-like representations, PointNet introduced a permutation-invariant neural architecture that directly consumes raw point clouds for classification and segmentation tasks. Its successor, PointNet++ [98], extended this concept by introducing hierarchical feature learning capable of capturing both local and global geometric context, improving segmentation accuracy in complex and non-uniform point distributions. Despite their conceptual elegance, PointNet-based models face challenges in large-scale or real-time applications due to their high computational demand and limited exploitation of spatial locality.

Building on the need for real-time performance, Wu et al. [99] proposed SqueezeSeg, which employs convolutional neural networks combined with recurrent conditional random fields (CRFs) for real-time road–object segmentation. By transforming 3D LiDAR point clouds into range images, SqueezeSeg achieves significant computational efficiency while maintaining robust object recognition in driving scenarios. However, its reliance on 2D spherical projections can introduce artifacts and loss of spatial precision, particularly in dense urban scenes with occlusions. To address this, RangeNet++ [100] refined this approach with a more accurate range-image representation and optimized postprocessing, achieving both high inference speed and strong segmentation quality in large-scale driving datasets.

Further advancements in point cloud representation led to PolarNet [101], which introduced a polar-grid encoding to preserve spatial relationships and achieve more uniform coverage of the LiDAR field of view. This model balances accuracy and computational cost, offering an efficient alternative to Cartesian voxelization while maintaining real-time feasibility. Similarly, SalsaNext [102] advanced semantic segmentation with uncertainty-aware prediction mechanisms, allowing the system to better handle ambiguous or noisy regions within LiDAR scans—an essential feature for safety-critical applications such as autonomous driving.

Finally, the recent Spherical Transformer framework proposed by Lai et al. [103] demonstrates the next generation of LiDAR-based semantic understanding through transformer-based architectures. By leveraging global attention mechanisms, it captures long-range spatial dependencies across the spherical projection domain, significantly improving recognition accuracy over traditional convolutional models. This represents a shift toward hybrid architectures that combine the interpretability and efficiency of convolutional encoders with the contextual modeling capabilities of transformers.

In summary, these LiDAR-based semantic mapping methods illustrate a clear progression from direct point processing (PointNet, PointNet++) to efficient projection-based and transformer-driven architectures (SqueezeSeg, RangeNet++, PolarNet, SalsaNext, Spherical Transformer). The evolution emphasizes three main trends: (1) the pursuit of real-time inference suitable for autonomous navigation; (2) the enhancement of contextual and uncertainty modeling for safety-critical perception; and (3) the integration of advanced deep architectures to achieve scalable and accurate 3D semantic mapping. Together, these advancements form the foundation for robust LiDAR-based perception systems that complement vision-based SLAM approaches in modern autonomous platforms.

4.3. Sensor-Fusion-Based Semantic Mapping

Sensor-fusion-based semantic mapping leverages the complementary strengths of multiple sensor modalities—such as LiDAR, RGB cameras, IMUs, and depth sensors—to create maps that are both geometrically accurate and semantically rich. This approach significantly enhances scene understanding, which is particularly beneficial in complex, dynamic, and long-term environments. Compared to single-modality systems, sensor fusion provides notable advantages in key aspects such as semantic labeling accuracy, robustness to environmental changes, handling of occlusion and sparsity, spatial consistency, and semantic completeness.

Table 5 reveals a clear progression toward deep, adaptive architectures that effectively integrate complementary sensory data to enhance 3D perception and scene understanding. Early voxel-based fusion models such as the multi-modal RGB-D–LiDAR–IMU framework [18] demonstrated strong semantic–metric alignment but were computationally expensive and constrained to small-scale mapping. In contrast, transformer-based fusion frameworks, including the RGB-X model (CMX) presented in [104] and hierarchical multi-modal transformers presented in [105], achieved superior contextual reasoning and large-scale semantic consistency by exploiting attention mechanisms that capture long-range cross-modal dependencies. LiDAR–camera fusion remains the dominant paradigm due to its balance between geometric precision and semantic richness, while the inclusion of radar, depth, and IMU sensors in recent studies [106,107] further enhances robustness under adverse weather, illumination changes, and occlusions. Regarding real-time capability, perception-aware CNN fusion models [108] and late-fusion architectures such as LiF-Seg [109] demonstrate near real-time operation via GPU-optimized inference pipelines, making them suitable for autonomous driving and outdoor mapping applications. In terms of accuracy, multi-modal CNNs [110] and voxelized fusion schemes [111] deliver reliable object-level segmentation and high semantic completeness, particularly in dynamic or cluttered scenes. Nevertheless, transformer-heavy approaches [104,105] remain computationally demanding, motivating the use of attention pruning and sparse convolutions to enhance scalability. Notably, frameworks such as HDMapNet [112] and multi-LiDAR fusion pipelines [113] extend these capabilities to city-scale and high-definition mapping, offering fine-grained semantic reasoning and efficient map generation. Overall, the most effective models achieve an equilibrium between semantic richness, computational efficiency, and scalability, enabling robust operation across diverse and long-term environments. This evolution highlights the pivotal role of multi-modal deep fusion in advancing reliable, context-aware, and semantically consistent mapping for autonomous systems.

In summary, sensor-fusion-based semantic mapping has evolved toward architectures that not only integrate geometric and semantic cues but also dynamically adapt to environmental variability and long-term changes. By leveraging advanced deep learning mechanisms—particularly transformer-based and perception-aware fusion networks—recent approaches have demonstrated significant improvements in semantic completeness, contextual understanding, and robustness under adverse conditions. Nevertheless, challenges remain in achieving a balance between real-time performance and computational efficiency, especially for large-scale and resource-constrained applications.

Section 5 further expands on the concepts of semantic segmentation using multi-modal fusion, highlighting the advantages and specific contributions of each approach to perception tasks in autonomous vehicles.

5. 3D—Semantic Segmentation

The next task in the analysis presented in Figure 2 is about Three-dimensional (3D) semantic segmentation, which deals with the task of assigning a semantic label to every element of a 3D scene. This process enables machines to distinguish between specific categories with high spatial precision. Unlike object detection or classification, which focus on identifying discrete entities, semantic segmentation delivers dense, scene-wide understanding, providing a complete semantic map of the surroundings. With the rapid advancement of deep learning techniques, the strategies focused on semantic segmentation have shown better results in comparison to the conventional techniques. It has allowed the definition of new applications in the self-driving car context. Semantic segmentation is divided into two sections depending on the sensor type, LiDAR-only 3D semantic segmentation and LiDAR–camera fusion, as presented in Figure 6.

5.1. LiDAR-Only 3D Semantic Segmentation

When performed using only LiDAR data, the process relies exclusively on the geometry and intensity information captured by LiDAR sensors.

We highlight these models’ unique strengths and architectural innovations in addressing the challenges of point cloud processing. These models are classified into four major types: image-based, voxel-based, point-based, and fusion-based.

5.1.1. Image-Based Methods

Image-based methods for 3D semantic segmentation rely on 2D visual data—typically RGB images—to infer semantic labels for 3D points within a scene. Instead of directly operating on raw LiDAR point clouds, these approaches leverage the rich texture, color, and contextual cues present in images to guide semantic understanding, often projecting image-based predictions into 3D space using calibration data.

This strategy can improve semantic granularity while mitigating the sparsity and irregularity challenges of LiDAR data. Table 6 provides a comparative overview of recent deep learning architectures developed for LiDAR semantic segmentation based on images. These methods vary in their input representations, convolutional designs, and optimization strategies, each addressing different trade-offs between accuracy, computational efficiency, and generalization capability. The table organizes the approaches by network type, feature extraction scheme, and learning strategy, emphasizing how architectural choices—such as voxelization, polar/spherical projection, and BEV fusion—affect segmentation performance. The optimization techniques range from conventional supervised learning to uncertainty-aware modeling and joint spatial–semantic calibration, illustrating ongoing research efforts to improve robustness across diverse sensing and environmental conditions.

Sánchez-García et al. [114] proposed a domain generalization framework that geometrically propagates semantic labels across sequences of LiDAR scans using ICP-based alignment. This enables training on synthetic or manually annotated datasets and the transfer of knowledge to real-world domains without requiring retraining. While the method demonstrates strong cross-domain generalization, it remains sensitive to alignment drift—particularly in dynamic environments—where errors can accumulate over time and degrade performance.

In another approach, Qiu et al. [115] introduced a fusion pipeline that combines polar grid and Cartesian bird’s-eye-view (BEV) representations, achieving spatial consistency and computational efficiency. The use of convolutional operations over structured grid representations allows capturing local geometric details across large-scale scenes. However, these methods often require careful tuning of the voxel or grid resolution, which can be scene-dependent and impact generalization [116]. Cortinhal et al. [102] improved upon earlier methods like [118] by integrating image-based semantic information into the LiDAR segmentation pipeline. Their use of spherical projection reduces the sampling imbalance commonly observed in polar grid structures and incorporates Bayesian uncertainty modeling to enhance robustness in ambiguous or noisy environments. This fusion of camera and LiDAR data significantly enriches the semantic interpretation of the scene.

5.1.2. Voxel-Based Methods

Instead of projecting 3D point clouds into 2D images—thereby losing spatial resolution—another effective approach for preserving the 3D structure of the environment is to convert irregular point distributions into regular volumetric occupancy grids. Voxel-based methods discretize the 3D space into uniform volumetric elements, or voxels, which allows the application of 3D convolutional neural networks (CNNs) to learn both spatial and semantic patterns directly in three dimensions. Table 7 provides a comparative overview of recent deep learning architectures developed for LiDAR semantic segmentation based on voxels.

Voxel-based approaches enable robust scene understanding by aggregating local and global geometric context, offering a compromise between spatial representation and computational tractability. Early work by Maturana and Scherer [119] introduced voxel feature encoding for 3D object recognition, demonstrating the potential of volumetric CNNs but suffering from high memory consumption due to dense voxel grids. To address this, Komorowski et al. [120] employed sparse convolutions, significantly reducing memory usage and improving scalability for large-scale LiDAR mapping tasks.

More recent advancements have focused on improving spatial partitioning and computational efficiency. Zhou et al. [122] proposed Cylinder3D, which uses cylindrical coordinates to better handle the anisotropic distribution of LiDAR points, achieving superior performance in outdoor navigation scenarios. Tang et al. [124] introduced a hybrid point–voxel network optimized through neural architecture search, striking a strong balance between inference speed and segmentation accuracy. Similarly, Rosu et al. [121] utilized permutohedral lattices to process sparse data more efficiently while preserving neighborhood structure. Meanwhile, Milioto et al. [100] favored range image projection for real-time segmentation, trading off some geometric detail in favor of speed and efficiency.

While voxel-based methods are highly competitive in LiDAR-based 3D semantic segmentation—thanks to their ability to model local geometry and leverage deep CNN architectures—they are still constrained by trade-offs between memory usage, spatial resolution, and computational cost. Careful architectural design, including multi-scale voxel grids, sparse convolutional layers, and coordinate-aware encoding, is crucial to maximize their effectiveness in large-scale or highly sparse environments.

5.1.3. Point-Cloud-Based Method

Point-based methods for 3D semantic segmentation operate directly on raw point clouds without projecting them into intermediate voxel grids, images, or other structured formats. By doing so, these approaches preserve the original geometric fidelity and spatial precision of LiDAR or RGB-D data, avoiding the discretization artifacts and quantization errors typically introduced by voxelization or rasterization. The fundamental principle is to process each point’s coordinates and associated features—such as intensity, color, or surface normals—using neural networks specifically designed to handle unordered and irregular data distributions. Table 8 provides a comparative overview of recent deep learning architectures developed for LiDAR semantic segmentation based on points.

The work PointNet [97] introduced a deep learning architecture that processes each point independently through shared multilayer perceptrons (MLPs) and aggregates features using a symmetric function (e.g., max pooling) to ensure permutation invariance. Its successor, PointNet++ [98], extended this idea by incorporating hierarchical learning of local structures via neighborhood grouping and recursive feature extraction, significantly improving performance in complex scenes.

Further advancements have sought to enhance spatial modeling and computational efficiency. KPConv [126] introduced kernel-based continuous convolution directly on point clouds, enabling fine-grained spatial learning by computing weights based on distances between input points and a learned set of kernel points. However, this method incurs a high computational overhead due to neighborhood search and kernel interpolation, requiring substantial GPU memory. In contrast, RandLA-Net [127] prioritized efficiency by using random sampling combined with a lightweight local feature aggregation strategy. This approach avoids complex sampling techniques like farthest point sampling (FPS) and achieves impressive scalability, processing up to a million points per forward pass—making it well suited for real-time applications.

In recent years, transformer-based architectures have emerged as powerful tools for point cloud processing. Methods such as Point Transformer [129], PTv2 [130], and Point Transformer V3 [131] incorporate attention mechanisms to enable geometry-aware, adaptive, and long-range feature interactions. These models offer superior accuracy and contextual understanding by dynamically weighting neighboring points based on both spatial and semantic relevance. While transformer-based methods are more computationally intensive than their MLP- or kernel-based counterparts, their ability to model complex relationships across large-scale point clouds has made them a preferred choice in state-of-the-art research for indoor and outdoor 3D scene understanding.

In the context of autonomous driving, point-based methods are especially appealing due to their ability to process sparse and non-uniform LiDAR data effectively, adapt to varying point densities, and integrate complementary modalities (e.g., camera features). However, they still face challenges in scalability and memory usage when processing millions of points per frame in large-scale, real-time outdoor environments.

5.2. LiDAR–Camera Fusion 3D Semantic Segmentation

LiDAR–camera fusion methods for 3D semantic segmentation are designed to leverage the complementary strengths of both sensor modalities to enhance the accuracy and robustness of scene understanding. LiDAR sensors offer precise 3D geometric information and reliable depth measurements across a wide range of distances, while cameras provide high-resolution texture, color, and rich semantic cues that are essential for detailed perception. By combining these modalities, fusion-based approaches can generate semantic maps that are both geometrically accurate and semantically expressive.

These fusion strategies are generally categorized into three main types: early fusion, where raw data from both sensors is combined before feature extraction; deep fusion, where intermediate features are fused within the network architecture; and late fusion, where high-level predictions from each modality are merged at the output level. Each fusion paradigm presents trade-offs in terms of alignment requirements, computational complexity, and semantic consistency. Selecting the appropriate fusion strategy is crucial for achieving effective cross-modal learning, especially in dynamic, long-term autonomous driving scenarios [132].

5.2.1. Early Fusion

Early fusion, as depicted in Figure 7a, refers to the strategy of integrating LiDAR and camera data at the raw or minimally processed stage—prior to deep feature extraction by a neural network [133,134,135,136]. The core idea is to fuse the complementary spatial and visual information as early as possible, enabling the network to learn joint representations that capture correlations between geometric structure and visual appearance cues from the very beginning. This approach allows the efficient use of multi-modal information, often enhancing feature expressiveness in downstream tasks such as segmentation or object detection. However, it typically incurs higher computational costs due to the need for fine-grained alignment and processing of raw multi-sensor data.

In practice, early fusion is commonly realized by projecting sparse 3D LiDAR points onto the 2D image plane using known extrinsic and intrinsic calibration parameters, thereby associating each 3D point with RGB values or other image-derived features (e.g., semantic priors, edge information). Alternatively, dense image pixels can be back-projected into 3D space using LiDAR depth measurements, resulting in a colorized point cloud or voxel grid that retains spatial and semantic information. The fused representation can then be processed using point-based networks, voxel-based CNNs, or range-image-based architectures.

Equation (1) describes the early fusion method for a network with

L + 1

layers, where

M_{i}

and

M_{j}

represent two different modalities,

f_{l}

stands for the feature mapping of a neural network at layer l, and

l \in {1, 2, \dots, L}

,

f_{l}^{M_{i}}

and

f_{l}^{M_{j}}

are the feature mapping of the two modalities

M_{i}

and

M_{j}

in the l layer of the neural network, respectively.

T_{l} (.)

represents the feature transformation function in the neural network layer l.

Let

f_{l + 1} = f_{l}^{M_{i}} \otimes f_{l}^{M_{j}}

, where

f_{l}^{M_{i}} \oplus f_{l}^{M_{j}}

represent data fusion operations.

f_{L} = T_{L} (T_{L - 1} (\dots T_{l} (\dots T_{2} (T_{1} (f_{0}^{M_{i}} \oplus f_{0}^{M_{j}})))))

(1)

Advantages of early fusion include the ability to learn cross-modal feature dependencies from the start, which can improve the detection of small, occluded, or distant objects and enhance performance in complex or low-visibility conditions. However, limitations include sensitivity to calibration errors, occlusion mismatches, and inconsistencies in resolution and data density between modalities. Consequently, early fusion is most effective in systems where sensors are well-calibrated and temporally synchronized, and where maintaining precise spatial correspondence is crucial.

5.2.2. Deep Fusion

Deep fusion, as depicted in Figure 7b, refers to the integration of LiDAR and camera data at multiple intermediate stages of feature extraction within deep neural networks rather than fusing them only at the raw input level (early fusion) or at the final prediction stage (late fusion) [137,138,139]. In other words, it mixes multi-modal data in the feature space to obtain fusion features, compensates for missing features from other modalities, and then applies the fusion features to perform classification or regression tasks in the prediction stage. Equation (2) represents the deep fusion process, where features from different modalities—such as LiDAR and camera—are integrated at multiple intermediate layers within a neural network rather than only at the input or output stages. Each modality first passes through its own feature extraction transformations

T_{l}^{M_{i}}

and

T_{l}^{M_{j}}

, producing modality-specific representations from their raw inputs

f_{0}^{M_{i}}

and

f_{0}^{M_{j}}

. These intermediate features are then fused, which combines the complementary information from both modalities to form a unified multi-modal feature representation. The fused features are subsequently propagated through the following network layers:

T_{l + 1}, T_{l + 2}, \dots, T_{L}

. This allows the network to jointly refine and exploit cross-modal dependencies. This hierarchical integration enables the model to compensate for missing or degraded information in one modality by using cues from the other, leading to more robust and context-aware feature learning across diverse sensing conditions.

\begin{matrix} f_{L} = T_{L} (\dots T_{l + 1} (T_{l}^{M_{i}} (\dots T_{1}^{M_{i}} (f_{0}^{M_{i}}))) \oplus T_{l}^{M_{j}} (\dots T_{1}^{M_{j}} (f_{0}^{M_{j}}))) \end{matrix}

(2)

5.2.3. Late Fusion

Late fusion, as depicted in Figure 7c, refers to a multi-modal integration strategy where LiDAR and camera data are processed independently through separate neural networks, each producing its own semantic predictions or feature maps, and fusion occurs only at the final decision stage [140,141,142]. Instead of mixing raw data (early fusion) or intermediate features (deep fusion), late fusion combines the outputs of each modality—often probability maps, class scores, or semantic labels—into a unified segmentation result.

Some late-fusion methods leverage output from both the LiDAR point cloud branch and the camera image branch and make the final prediction based on the results in two modalities [143].

Late fusion is an integration method that uses multi-modal information to optimize the final proposal, as presented in Equation (3):

\begin{matrix} f_{L} = T_{L}^{M_{i}} (T_{L - 1}^{M_{i}} (\dots T_{1}^{M_{i}} (f_{0}^{M_{i}}))) \oplus T_{L}^{M_{j}} (T_{L - 1}^{M_{j}} (\dots T_{1}^{M_{j}} (f_{0}^{M_{j}}))) \end{matrix}

(3)

5.2.4. Asymmetry Fusion

Asymmetry fusion, as depicted in Figure 7d, generally fuses the decision-level information of one branch with the data-level or feature-level information from other branches to establish a cascade relationship between multiple modalities.

Yang et al. [144] proposed a framework designed for semantic segmentation of large-scale point clouds that addresses the challenges of long-range and low-density objects. This method introduces a Multi-Scale Feature Dynamic Fusion (MDF) module to integrate multi-scale texture and geometric features, along with a Semantic Focus Module (SFM) employing multi-class contrastive learning to enhance class discrimination. Furthermore, cross-modal knowledge distillation enables effective information transfer from the 2D image branch to the 3D point cloud branch, resulting in improved segmentation accuracy on datasets such as SemanticKITTI [48] and nuScenes [49], particularly for distant or sparse regions. In contrast, Tan et al. [108] presented an approach emphasizing computational efficiency in LiDAR–camera fusion for 3D semantic segmentation.

This method employs a Two-Stream Network (TSNet) that processes LiDAR and camera data separately before merging them via residual fusion modules, while perception-aware losses ensure semantic consistency across modalities. The framework leverages perspective projection to preserve more visual detail than spherical projection and incorporates cross-modal alignment and cropping to reduce computational cost. Experiments on SemanticKITTI-FV [145], nuScenes [49], and A2D2 [146] datasets demonstrate that this method achieves competitive accuracy with significantly improved inference efficiency.

Both methods are regarded as asymmetric fusion methods because they perform feature extraction independently for each modality and fuse them in a hierarchical or unidirectional manner, where one modality (typically the image or LiDAR branch) dominates the information transfer. This asymmetric design enables the networks to exploit the complementary strengths of each sensor while maintaining robustness and computational efficiency.

6. Long-Term Localization

Long-term localization is another important task presented in Figure 2, which introduces a set of complex challenges that extend beyond those addressed by conventional short-term SLAM systems. One of the core difficulties lies in handling environmental dynamics, such as moving and semi-static objects, structural modifications, seasonal changes, and variations in lighting—all of which degrade the reliability of feature matching and the consistency of the map over time. Additionally, sensor drift and calibration issues emerge, as intrinsic parameters of LiDARs, cameras, and IMUs may shift due to mechanical wear or environmental factors, leading to accumulated pose estimation errors.

Furthermore, infrastructure variability in developing regions—such as irregular road layouts, limited traffic signage, or rapidly changing urban structures—reduces the availability of stable landmarks required for long-term localization. These factors highlight the need for robust multi-modal perception and adaptive mapping strategies capable of maintaining reliable localization under challenging real-world conditions.

Another important factor is that long-term deployment of autonomous vehicles requires semantic SLAM systems capable of continuously adapting to evolving environments. Lifelong SLAM frameworks address this challenge by enabling incremental map updates, semantic label refinement, and the removal of outdated observations while preserving persistent structural landmarks. These mechanisms allow localization systems to remain consistent despite environmental changes occurring over extended periods of operation.

Table 1 summarizes several representative works focused on localization, including methods addressing loop closure and data association, which are fundamental concepts that strengthen and refine vehicle localization over time. However, beyond these traditional localization components, place recognition has emerged as an essential capability for achieving robust and long-term localization in dynamic and large-scale environments. Place recognition enables an autonomous system to identify previously visited locations, even under significant appearance or structural changes caused by illumination, weather, seasonal variations, or viewpoint differences. This capability is particularly critical for ensuring global consistency in mapping and loop closure detection, which directly impact the accuracy and stability of Simultaneous Localization and Mapping (SLAM) systems.

Given its importance, we have restructured this section to present an in-depth discussion of place recognition techniques categorized according to the type of sensory information they rely on. Specifically, we review vision-based place recognition methods (Section 6.1), which exploit visual cues such as texture, color, and geometric patterns; LiDAR-based place recognition approaches (Section 6.2), which utilize structural and geometric information extracted from 3D point clouds; and fusion-based place recognition techniques (Section 6.3), which integrate complementary data from multiple sensors (e.g., cameras, LiDAR) to enhance robustness. This organization (Figure 8) not only highlights the evolution of place recognition methods but also provides a clear comparative framework for understanding how different sensing modalities contribute to the overall goal of accurate, reliable, and long-term localization in autonomous systems. Despite the advancement of numerous vision-based, LiDAR-based, and odometry-assisted methods, long-term semantic localization remains an open and difficult research problem.

Most existing techniques perform well in single-session mapping, but extending them to multi-session and long-term settings is challenging due to their limited capacity to model spatial–temporal correlations.

Several researchers have proposed localization strategies based on semantic feature detection, using high-level cues as references. For example, Qin et al. [147] proposed localizing a vehicle using semantic cues such as lane markings, crosswalks, and road surfaces. Similarly, Schaefer et al. [148] introduced a pole-based localization method suitable for urban environments. Although both approaches showed promising performance across different time periods, they face limitations: the former depends heavily on road marking visibility, while the latter assumes a sufficient number of pole-like structures—an assumption that may not hold in rural or cluttered areas. Additionally, pole-based methods often suffer from ambiguities due to the geometric similarity and proximity of objects.

More recently, researchers have explored topological representations for environment modeling, which leverage graph-based structures for localization in outdoor scenes. However, these systems often fail to recognize revisited locations under environmental change due to inconsistent relative pose estimates when comparing current scans to outdated prior maps. While methods such as [149,150,151] utilize odometry information for localization, many of them struggle with cumulative error and lack robust loop closure mechanisms, resulting in globally inconsistent maps.

To address this, recent efforts such as [152,153] introduced techniques that convert the relative pose between the current scan and prior maps into scan matching factors, which are then integrated into a factor graph for improved consistency. In a similar line of work, ref. [9] combined the classical LOAM pipeline [154] with semantic segmentation to improve loop closure detection using LiDAR-specific deep learning models. However, this system is not optimized for real-time performance on resource-constrained platforms. Conversely, Cao et al. [155] employed deep learning for topological localization from raw sensor data, demonstrating robustness to long-term appearance changes. Yet, the method lacks metric-level information, which limits precision in pose estimation. Another approach presented by Kong et al. [156] leveraged probabilistic filtering for global pose estimation, incorporating urban environment dynamics into the model. Still, performance may degrade in very large-scale environments due to the slow convergence of global localization.

In these long-term scenarios, data association becomes particularly difficult due to perceptual aliasing, dynamic objects, and sensor noise. To address this, several works have focused on graph matching techniques to align current observations with prior semantic maps. For instance, [157,158] explored neural network approaches for measuring graph similarity, while [159] developed a semantic graph-based place recognition model using spatial and semantic neural modules. Pramatarov et al. [160] enhanced point cloud descriptors by incorporating object-level bounding boxes, while Wu et al. [161] used image-based depth and geometric segmentation to incrementally build consistent 3D scene graphs. However, these methods often neglect dynamic objects, which can introduce instability in long-term maps.

6.1. Vision-Based Place Recognition

Vision-based place recognition refers to the task of determining whether a robot’s current camera view corresponds to a previously visited location, relying exclusively on visual data as the primary perceptual source. This process involves extracting distinctive cues—such as texture, color, shape, and spatial layout—from images or video frames, and comparing them against a database of stored visual representations to identify potential matches.

Table 9 shows a representative set of deep learning-based vision methods for place recognition, ranging from early CNN-driven global descriptor approaches such as NetVLAD [162] and D2-Net [163] to more advanced transformer and hybrid architectures like TransVPR [164] and MixVPR [165].

These methods fall into three categories—geometric, hybrid, and semantic—which reflect their feature representation strategies. Geometric models rely primarily on photometric and structural cues, whereas hybrid methods (e.g., Patch-NetVLAD [168], Multi-res NetVLAD [169]) fuse local and global visual information to improve robustness against appearance changes. The most recent semantic approaches, including CLIP-VPR [171], incorporate vision–language alignment to enable higher-level contextual reasoning. Architecturally, the progression is clear, highlighting the way the field has transitioned from purely CNN-based encoders toward attention-driven and multi-modal transformer frameworks, supported by modules such as GNNs (SuperGlue [166]), dense matchers (LoFTR [170]), and cross-modal pretrained backbones.

The descriptors play a crucial role in the scene description. Global descriptors (e.g., NetVLAD [162], TransVPR [164]) provide compact scene embeddings suitable for large-scale retrieval, while local descriptors (e.g., SuperPoint + SuperGlue [166,167], LoFTR [170]) focus on keypoints and dense correspondences, improving invariance to viewpoint or illumination changes. Hybrid descriptors, as in Patch-NetVLAD [168], combine both paradigms to balance discriminative granularity and computational efficiency. The contributions reflect these design choices: earlier methods introduced end-to-end descriptor learning and self-supervised feature detection, whereas newer approaches prioritize multi-resolution aggregation, attention-based feature mixing, and semantic alignment. Recent contributions such as vision–language embedding integration (CLIP-VPR [171]) and cross-domain adaptation (Multi-Platform VPR [173]) represent the current direction of the field toward robust, transferable representations necessary for long-term localization.

The most important way of evaluating these methods is by reviewing the metrics. Most studies evaluate recognition ability using Recall@N, Recall@1, mAP, or Accuracy, which together capture both retrieval effectiveness and correspondence precision. Early CNN-based systems such as NetVLAD [162] typically report around 80% Recall@N, whereas enhanced multi-resolution CNN models like Multires-NetVLAD [169] achieve up to 96% Recall@5. Transformer-based approaches show the strongest gains: TransVPR [164] and MixVPR [165] report 94–96% Recall and mAP scores approaching 0.9, while dense correspondence models like LoFTR [170] reach ≈97% accuracy. Although semantic vision–language approaches such as CLIP-VPR [171] show moderate recall values (0.66 evaluated using Semantic KITTI dataset), they provide crucial semantic invariance for cross-domain and long-term scenarios. Overall, the reported metrics confirm that more recent architectures consistently deliver superior robustness and generalization, reinforcing their suitability for long-term semantic mapping and autonomous navigation.

6.2. LiDAR-Based Place Recognition

LiDAR-based place recognition, considered another method for localization, is the process of identifying whether a current observation—typically in the form of a 3D point cloud acquired by LiDAR sensors—corresponds to a previously visited location. This capability is essential for long-term autonomous localization, where the environment may change drastically over time due to seasonal variation, weather conditions, lighting changes, or structural alterations.

Unlike vision-based systems, which depend heavily on color, texture, and lighting conditions, LiDAR-based methods focus on geometric and structural information. As a result, they are inherently more robust to appearance changes and are particularly well suited for outdoor, long-term deployments where environmental conditions can vary unpredictably. Based on the LiDAR data properties, the 3D-LiDAR-based place recognition can be divided into five categories based on their approach to preprocessing the raw point cloud data.

The five categories considered in this survey are: (1) point-based, (2) voxel-based, (3) segmentation-based, (4) graph-based, and (5) projection-based methods. The comparative analysis across these families of approaches is conducted by examining several key aspects: the type of descriptor employed to encode geometric or semantic information; the architectural design used to extract, aggregate, and manage features; the strategies adopted to achieve invariance to rotation and translation; the similarity measures applied during place recognition; and finally, the datasets used for validation, which provide insights into generalization and robustness under real-world conditions.

6.2.1. Point-Based Methods

Point-based methods for LiDAR place recognition directly operate on raw 3D point clouds without relying on intermediate representations such as voxel grids, range images, or bird’s-eye projections. By processing the original geometric data, these methods preserve fine-grained spatial structure, enabling them to capture distinctive local and global patterns necessary for identifying previously visited places.

Table 10 summarizes the main features of the point-based methods for place recognition. Early methods such as PointNet in PointNetVLAD [174] and PCAN [175] adopt point-wise MLPs, which makes them lightweight and efficient but limited in capturing local structural context due to their independence assumptions. Methods like GNN and DG-CNN in LPD-Net [176] introduce graph construction and edge convolutions to capture richer topological information, though at higher computational cost. Architectures such as PointNet + FlexConv in DH3D [177] and PointOE in SOE-Net [178] improve spatial reasoning through learned convolutional kernels or ordered embeddings. More advanced architectures—such as the Pyramid Point Transformer in PPT-Net [179] and the Feature Point Extractor + Point Transformer in FPET-Net [180]—use attention mechanisms to model long-range dependencies between points. Rotation-equivariant encoders [181] explicitly build SO(3)-equivariance into the model, offering strong robustness to viewpoint changes. Finally, U-Net architectures with hierarchical multiscale encoding, as in LoGG3D-Net [182], provide dense contextual information but are heavier and more complex to train.

The next stage is about the Feature, which describes how local descriptors are aggregated into a single global embedding. Most early approaches rely on NetVLAD, as in [174,175,176,177], because VLAD-style residual pooling provides strong discriminative power for place recognition. However, NetVLAD is sensitive to rotation and may overfit to dataset-specific cluster centers. Methods like SE blocks in DH3D [177] and Pyramid VLAD in PPT-Net [179] address this by incorporating attention or multiscale structures to improve robustness. G-VLAD in EPC-Net [183] integrates geometric priors, enhancing generalization to structural variations. Point Transformer pooling in FPET-Net [180] introduces an attention-based aggregation, enabling more context-aware embedding formation, while rotation-equivariant pooling in [181] ensures that the descriptor transforms consistently under 3D rotations. High-order pooling in LoGG3D-Net [182] captures second-order statistics, which improves rotational invariance but increases embedding dimensionality.

The next stages are related to invariance to translation and rotation. This column reflects how each method manages translational offsets. Most methods rely on centroid normalization, as seen in [174,175,177,178,179,182], which standardizes the point cloud to a canonical coordinate frame. While simple, centroiding may fail under partial observations or when large occlusions shift the perceived centroid. LPD-Net [176] uses learned spatial graph normalization, enabling the network to learn stable spatial references from neighborhood topology, improving robustness in dynamic environments. FPET-Net [180] and the rotation-equivariant encoder [181] adopt learned normalization mechanisms, better handling irregular structures but requiring more training data. Rotation robustness varies strongly across methods. Early approaches such as PointNetVLAD [175], PCAN [176], and LPD-Net [176] do not explicitly account for rotation, making them sensitive to sensor roll and viewpoint changes. DH3D [177] and SOE-Net [178] provide limited improvements but still lack strong rotational invariance.

PPT-Net [179] introduces transformer equivariant layers, which capture orientation-aware correspondences. EPC-Net [183] applies PCA-like alignment, offering moderate rotation handling but possibly unstable under partial scans. FPET-Net [180] improves this further through transformer-based equivariant mapping. The strongest invariance appears in the Rotation Equivariant Encoder [181], which guarantees mathematically grounded SO(3) equivariance. LoGG3D-Net [182] also achieves high rotational invariance via second-order pooling operations.

6.2.2. Voxel-Based Methods

Voxel-based methods constitute one of the most influential families of approaches for 3D LiDAR place recognition, offering a structured and computationally manageable representation of large-scale point clouds. Instead of operating directly on raw, irregular point sets, these methods discretize the 3D space into a grid of voxels, enabling the transformation of sparse LiDAR scans into a regular data format that is well suited for convolutional or learning-based architectures. This discretization facilitates efficient spatial reasoning, reduces sensitivity to local noise, and allows the extraction of high-level geometric patterns that are difficult to capture from unordered points alone. Moreover, voxelization introduces a natural mechanism for achieving translation invariance, as local voxel neighborhoods preserve spatial consistency across revisited places. Recent advances leverage sparse 3D CNNs, multi-resolution voxel hierarchies, and learned voxel descriptors to enhance robustness under viewpoint changes, varying densities, and long-term environmental variations. As a result, voxel-based place recognition methods strike a balance between accuracy, scalability, and computational efficiency, making them particularly suitable for autonomous driving and large-scale mapping scenarios.

Table 11 shows a summary of the latest works related to voxel-based methods for 3D LIDAR place recognition. The Feature column summarizes the architectural backbone or feature extraction strategy used by each method. Sparse 3D convolutions dominate early methods such as MinkLoc3D ([184]) and its spherical-coordinate variant MinkLoc3D-SI ([185]). Attention-based enhancements appear in EgoNN ([186]) through ECA attention, while transformer-based models like NDT-Transformer and TransLoc3D ([187,188]) utilize voxel-level NDT representations combined with Transformer encoders. Other approaches include LCDNet ([189]), which integrates PV-RCNN with NetVLAD for global descriptors, SVT-Net ([190]) employing a sparse voxel transformer to capture local and long-range geometric structures, and OverlapNetVLAD ([191]), which extracts features from a BEV-based backbone before applying NetVLAD.

The Similarity column describes the technique used to match descriptors produced by these models. The most common metric is L2 distance, used by methods such as MinkLoc3D, MinkLoc3D-SI, TransLoc3D, LCDNet, and OverlapNetVLAD [184,185,188,189,191]. Several works also incorporate triplet loss during training—most notably MinkLoc3D-SI and TransLoc3D [185,188]—to encourage discriminative descriptor learning. Other methods, including EgoNN, NDT-Transformer, and SVT-Net [186,187,190], focus on nearest-neighbor descriptor retrieval rather than specifying a formal similarity metric. LCDNet [189] additionally supports cosine similarity, offering an alternative to Euclidean distance.

6.2.3. Segment-Based Methods

Segment-based methods represent one of the earliest and most influential approaches to 3D LiDAR place recognition. Instead of processing entire point clouds holistically, these methods decompose the scene into meaningful geometric segments such as clusters of points corresponding to objects, facades, or structural elements. Each segment is then described using either handcrafted geometric descriptors or learned representations obtained through neural networks or autoencoders.

By operating at the segment level, these methods provide robustness to viewpoint changes, partial overlaps, and dynamic elements, since the matching process focuses on stable, repeatable structures rather than the full scan. Place recognition is then performed by comparing segment descriptors via nearest-neighbor search, classification models, or probabilistic similarity measures, often followed by geometric consistency checks to verify structural alignment.

Table 12 shows a summary of works related to segment-based methods for 3D LiDAR place recognition. The Feature Strategy column describes how each method constructs its segment-level representations. Early methods such as SegMatch [192] rely on geometric features, PCA descriptors, and cluster segmentation. SegMap [193] introduced learned segment embeddings using a CNN-based autoencoder, marking a transition from handcrafted to learned representations. Other methods extract local keypoints and shape descriptors, as in Zaganidis et al. [194] and LiPMatch [195], or use richer scene representations such as the Normal Distribution Transform (NDT) in Tomono [196]. Locus ([197]) integrates structural and intensity cues, while SA-LOAM [9] embeds appearance-augmented LOAM features. Improvements on PCA-based features appear in SSC [198] and PCA-SSC [199], whereas PSE-Match [200] introduces a probabilistic shape encoding based on mixture models. Song et al. [201] combines geometry, intensity, and topology to create a hybrid descriptor. Together, these strategies show a transition from purely geometric descriptors toward learned, probabilistic, and multi-modal segment representations.

The Similarity column explains how the methods compare segment descriptors to perform place recognition or loop closure. SegMatch [192] uses random forest classification combined with Euclidean distances and structural consistency checks. SegMap [193] employs L2 distance between its learned descriptors. Keypoint-based works like Zaganidis et al. [194] rely on nearest-neighbor matching and RANSAC-based geometric verification, while LiPMatch [195] also uses NN matching with geometric validation. The NDT-based method of Tomono [196] uses a likelihood-driven matching process optimized through alignment scoring. Systems such as Locus [197] and PCA-SSC [199] rely on L2 similarity combined with geometric consistency filtering. SA-LOAM [9] incorporates ICP-based residual minimization. SSC [198] leverages L2 or Mahalanobis distance for PCA descriptors, whereas PSE-Match [200] evaluates similarity via probabilistic distance between mixture-model encodings. Song et al. [201] uses L2 or cosine distance with an additional global verification stage. This column shows a continuum of matching strategies ranging from simple metric distances to probabilistic and optimization-based formulations.

6.2.4. Projection-Based Methods

Projection-based methods for 3D LiDAR place recognition convert raw point clouds into 2D representations—such as range images, bird’s-eye-view (BEV) maps, or spherical projections—to simplify processing while preserving geometric structure. By projecting the 3D data into a structured grid, these methods significantly reduce computational cost and enable the use of mature 2D convolutional neural networks for feature extraction. This representation also enhances robustness to varying point densities and provides a consistent spatial layout that facilitates descriptor learning.

Once projected, global descriptors can be computed from the entire image or from aggregated local regions, enabling efficient matching between scenes using metric distances or retrieval techniques. Projection-based approaches have become increasingly popular due to their balance between computational efficiency, descriptive power, and compatibility with deep learning architectures, making them well suited for large-scale LiDAR place recognition in real-time applications.

Table 13 shows a summary of the works related to projection-based methods for 3D LiDAR place recognition. In terms of feature strategy, the majority of works leverage some form of spherical or range-image projection because it preserves angular relationships and allows convolutional neural networks to operate efficiently on regular grids. This strategy appears in [202,203,204,205,206,207,208,209,210], often with additional channels such as intensity, normals, or multi-scale features to enrich the representation. Overlap-based methods enhance the projection with modality fusion or transformer encoders to better extract global geometric cues. Meanwhile, methods like DiSCO [206] incorporate self-supervised training pipelines, letting the network learn discriminative features without labeled place data. DeepRING [211] takes a different approach by using ring-based projections to inherently encode rotational symmetry. Finally, BEVPlace [212] introduces a full top-down BEV projection, showing a trend toward spatially interpretable representations that resemble 2D maps and naturally support localization in urban environments.

The similarity and matching techniques also reflect the progression of the field. Early works emphasize correlation-based similarity or Euclidean distance between handcrafted descriptors, but the majority of modern methods rely on L2 or cosine distance applied to learned global descriptors—this is the case for LocNet [203], Sun et al. [204], DiSCO [206], FusionVLAD [208], AttDLNet [213], SphereVLAD [209], SphereVLAD++ [210], and BEVPlace [212]. VLAD-based approaches in particular benefit from this metric due to the normalization effects inherent in VLAD aggregation. A distinct direction emerges with OverlapNet [205] and OverlapTransformer [207], which predict scan overlap ratios and yaw differences, treating place recognition as an overlap estimation problem rather than classical descriptor matching. DeepRING [211] revisits correlation-based similarity but enhances robustness through circular shifting, which compensates for yaw rotation.

6.2.5. Graph-Based Methods

Graph-based methods for 3D LiDAR place recognition model a scene as a graph—where nodes represent keypoints, segments, or semantic instances, and edges encode geometric or semantic relationships between them. Instead of relying solely on raw point clouds or handcrafted descriptors, these approaches capture the structural layout of the environment, making them robust to viewpoint changes, occlusions, and varying densities.

By comparing graphs using techniques such as graph matching, spectral embeddings, or graph neural networks (GNNs), these methods can identify whether two scans correspond to the same place.

This representation is especially effective in large, cluttered, or repetitive environments because it preserves topological consistency and leverages semantic information when available. As a result, graph-based approaches provide a powerful, flexible framework for reliable long-term LiDAR place recognition.

Table 14 provides a summary of recent works on graph-based methods for 3D LiDAR place recognition. Semantic representations vary significantly across the listed approaches, reflecting different philosophies for encoding environmental structure. SGPR [159] builds semantic graphs directly from segmented point clouds, providing a geometric–semantic scaffold of the scene. GOS Match [214] extends this idea by constructing a graph of semantics where each node corresponds to an object instance and edges encode geometric or topological relations, allowing structured reasoning across the scene. Gong et al. [215] propose a two-tier representation that integrates local spatial-relation graphs with a global topological graph, combining semantic cues with geometric layout information. BoxGraph [160] represents scenes through semantic bounding-box graphs generated from object detectors, translating image-level detections into a graph structure. Deep Scan Context [216] instead relies on high-dimensional descriptors produced from pseudo-cylindrical projections and neural embeddings, moving away from explicit graphs. SimGNN [158] departs from handcrafted features entirely by learning graph embeddings directly through GNN layers that capture both node distributions and overall graph structure. Finally, Semantic Loop [217] constructs instance-level 3D semantic graphs that encode both object category and geometric attributes, offering detailed and interpretable spatial representations.

Similarity computation and matching approaches evolve from classical graph-based metrics to more sophisticated neural matching mechanisms. SGPR [159] performs structural graph matching using predefined similarity metrics, while GOS Match [214] scores graph similarity by aligning node and edge features followed by geometric verification. Gong et al. [215] introduce a coarse-to-fine hierarchical strategy, first evaluating spatial relations and then refining alignment through structural similarity. BoxGraph [160] uses similarity metrics computed over bounding-box attributes combined with spatial consistency checks. Deep Scan Context [216] applies descriptor-based matching using cosine or L2 distance and incorporates rotation alignment to handle viewpoint variance. SimGNN [158] represents a transition toward learned similarity: it uses GNN-based embeddings and attention-driven similarity prediction, enabling the model to infer which structural cues are most discriminative. Semantic Loop [158] fuses spectral graph features with RANSAC-based geometric verification, bridging semantic graph information with robust geometric alignment strategies.

Chronologically, the progression from 2020 to 2022 demonstrates a clear shift: early works focus on explicit graph structures and classical similarity metrics, while later contributions incorporate high-dimensional learned descriptors and neural graph embeddings, often combined with geometric verification. This transition reflects a broader movement in semantic SLAM and place recognition toward hybrid systems that fuse semantic reasoning, structural consistency, and learned relational features for robust long-term localization.

6.3. Fusion-Based Place Recognition

Recent advancements in fusion-based place recognition have increasingly leveraged multi-modal sensor integration to address the challenges of long-term localization under diverse and dynamic environmental conditions. Table 15 summarizes the key characteristics of recent approaches, emphasizing their fusion strategies, learning objectives—particularly those designed to minimize cross-modal drift—and their robustness to environmental variations according to the sensor modalities employed. Wang et al. [218] proposed a robust LiDAR–camera fusion framework that jointly integrates geometric and visual cues through deep feature fusion, significantly improving recognition reliability across changes in season, illumination, and viewpoint. Building upon this foundation, Xu et al. [219] introduced an explicit cross-modal attention mechanism that facilitates a more effective exploitation of complementary information between image and point cloud representations, thereby enhancing feature consistency and discriminative power for place recognition.

Further advancing the synergy between vision and LiDAR, jung et al. [220] employed vision foundation models for image encoding, incorporating high-level semantic context to complement structural LiDAR data, thereby improving robustness in long-term deployments.

Extending the semantic scope, Melekhin et al. [221] combined visual and textual information to enrich place representations, boosting discrimination in environments characterized by repetitive structures or perceptual aliasing.

In parallel, Qi et al. [222] demonstrated the effectiveness of LiDAR–radar fusion by introducing a polar bird’s-eye-view representation that captures complementary spatial and motion cues, particularly advantageous in adverse weather conditions where vision systems may falter. More recently, ref. [223] incorporated multi-scale attention mechanisms within a LiDAR–camera network, facilitating adaptive focus on both global layouts and fine-grained details, thus enhancing recognition performance across scale variations and occlusions.

In summary, recent fusion-based place recognition methods show that hybrid fusion strategies that combine global–local or multi-scale integration—achieve superior robustness and consistency across modalities. Attention and explicit cross-modal interaction mechanisms [218,219,223] effectively exploit complementary cues, reducing cross-sensor drift and improving long-term stability. Deep joint embeddings and BEV-based representations [218,222,223] provide strong invariance to translation and rotation, while contrastive and cross-supervised learning objectives enhance feature alignment under dynamic environmental conditions. Among these, PRFusion++ [218], LCPR [222], and LRFusionPR [223] demonstrate the best overall performance, maintaining high recognition accuracy and minimal drift across varying weather, illumination, and viewpoint scenarios.

To enable a meaningful and rigorous comparison of the different approaches discussed above, it is essential to evaluate them under consistent and well-defined experimental conditions. In this work, particular emphasis is placed on accuracy, robustness to environmental changes, and Recall@1, as these metrics are critical for assessing the reliability and discriminative capability of SLAM systems in real-world scenarios. The comparison is based on reported performance in the literature under conditions involving environmental appearance changes and dynamic urban scenes.

In addition, complementary factors such as robustness in dynamic scenes and computational efficiency are also considered to provide a more comprehensive evaluation. By structuring the comparison across these criteria, as summarized in Table 16, this analysis highlights the strengths and weaknesses of each method and the metric which is widely used to assess the performance of place recognition algorithms, facilitating a clearer understanding of the trade-offs between different SLAM paradigms.

LiDAR-based methods such as PointNetVLAD [174], MinkLoc3D [184], and LoGG3D-Net [182] demonstrate strong capabilities in learning discriminative global descriptors and achieving robustness to viewpoint and environmental changes. Their main strength lies in exploiting geometric consistency, enabling reliable place recognition in large-scale environments. However, these approaches are often limited by their dependence on high-quality point cloud data, sensitivity to partial observations or occlusions, and moderate-to-high computational cost, particularly when deep architectures or dense inputs are involved. Approaches such as LCDNet [189] and OverlapNet [205] further extend LiDAR-based solutions by incorporating loop closure detection and registration mechanisms, improving global consistency. While these methods provide robust descriptors and effective alignment, they introduce additional complexity, resulting in higher computational overhead and reliance on preprocessing steps or scan quality, which may limit scalability. Vision-based methods like LoFTR [170] exhibit strong performance in dense feature matching and precise correspondence estimation, particularly in low-texture environments. Their main advantage is the ability to capture rich visual details without relying on explicit keypoints. Nevertheless, they suffer from high computational and memory requirements, and their performance is inherently constrained by sensitivity to illumination, seasonal variations, and dynamic changes, reducing their reliability in long-term localization scenarios. Classical geometric approaches such as LOAM [149] remain highly efficient and provide accurate real-time odometry, making them suitable for deployment in practical systems. However, their main limitation is the lack of semantic understanding, which reduces robustness in dynamic environments and prevents effective long-term adaptation. Multi-modal and semantic-aware approaches, including BEVPlace [212] and BoxGraph [160], present significant strengths by combining geometric and semantic information. These methods achieve improved robustness to environmental changes and dynamic scenes, particularly in long-term localization tasks. For example, BoxGraph leverages object-level relationships, while BEVPlace uses spatial representations to enhance invariance. However, these approaches introduce new challenges, such as dependence on accurate sensor calibration, reliance on reliable semantic extraction (e.g., object detection), and increased computational complexity, especially in graph-based matching processes. Finally, learning-based approaches such as Deep Scan Context show a strong ability to learn geometric descriptors but are limited by high computational requirements and reduced generalization under extreme environmental variations, highlighting the ongoing challenge of balancing robustness and efficiency.

Large-scale semantic SLAM systems face significant computational and memory challenges due to the continuous growth of maps and the complexity of semantic representations. To address these limitations, several strategies have been proposed, including submap-based mapping, map sparsification, and hierarchical map representations. These approaches limit the size of the optimization problem while preserving essential structural and semantic information. Furthermore, incremental graph optimization and efficient deep learning architectures help maintain real-time performance while enabling long-term map maintenance in large-scale environments.

Beyond scalability, meeting real-time requirements is also essential for deploying localization systems in autonomous vehicles. To address latency and computational constraints, modern SLAM systems adopt several strategies, including sparse data representations, incremental optimization techniques, and parallel processing pipelines. Additionally, lightweight neural architectures and hierarchical localization frameworks enable a balance between localization accuracy and computational efficiency. These approaches allow semantic SLAM systems to maintain reliable performance while operating within the strict timing constraints required for real-time autonomous navigation.

7. Benchmarking

Benchmarking is the last stage in the analysis of our survey, as presented in Figure 2. It plays a crucial role in advancing long-term localization for autonomous systems, as it provides standardized methodologies for evaluating and comparing the performance of different algorithms under diverse and realistic conditions. In this context, evaluation metrics are essential tools for assessing key aspects such as pose accuracy, place recognition robustness, semantic consistency, and computational efficiency—all of which are critical in dynamic environments subject to seasonal changes, lighting variations, and structural modifications. However, meaningful benchmarking requires not only robust evaluation metrics but also publicly available datasets that capture long-term environmental variability.

As presented in Table 1, the different datasets employed for the validation of various methods illustrate the diversity of testing scenarios used to construct semantic maps through multi-sensor systems in long-term settings. Building upon this, it is equally important to associate the evaluation metrics used as performance measures with the corresponding datasets to establish a comprehensive benchmarking framework. Datasets such as KITTI [48], nuScenes [49], Oxford RobotCar [224], and Boreas [225], among others, have become fundamental to this effort, offering multi-session recordings under diverse weather, time-of-day, and seasonal conditions. These datasets enable systematic and reproducible evaluation of localization systems over time, ensuring fair comparisons and fostering the development of algorithms that are resilient, scalable, and reliable in real-world long-term scenarios.

7.1. Datasets

Datasets are fundamental to the development of perception systems in autonomous vehicles, providing the necessary data for training, validating, and benchmarking algorithms.

Also, datasets are involved in tasks such as 3D object detection, semantic segmentation, tracking, and sensor fusion. Over the past decade, multiple datasets have emerged to meet these challenges, each with varying sensor configurations, annotation detail, and environmental diversity. A comparative overview of major datasets is presented in Table 17.

The KITTI dataset [48] remains one of the most extensively used benchmarks for evaluating perception and localization systems in autonomous driving. It provides synchronized data from stereo cameras, LiDAR, GPS/IMU, and annotations for tasks such as odometry, 3D object detection, and semantic segmentation.

Despite its widespread impact, KITTI has notable limitations, including restricted geographic diversity and relatively simple urban environments, which constrain its applicability for long-term and large-scale localization tasks.

To address these limitations, the Waymo Open Dataset [230] offers a significantly larger and more diverse multi-modal dataset. It includes high-resolution LiDAR data, front-facing and surround-view camera imagery, and detailed 3D bounding-box annotations across a wide range of urban and suburban scenarios. The dataset’s scale and richness in annotated data make it highly suitable for training and evaluating deep learning models. However, its substantial data volume requires considerable storage capacity and computational resources, often limiting accessibility for researchers without high-end hardware.

The nuScenes dataset [49] significantly advanced the field by offering full 360° sensor coverage using a suite of six cameras, five radars, and a single spinning LiDAR sensor. It also includes high-definition map data and precise annotations over 20 s clips sampled at 2 Hz, enabling temporal reasoning in dynamic urban scenes. However, the LiDAR sensor used provides relatively low spatial resolution compared to datasets such as Waymo or Pandaset, potentially limiting its effectiveness in fine-grained 3D perception tasks.

To address long-term challenges, the EU Long-Term Dataset [231] captured sensor data across varying seasons, weather conditions, and times of day. This dataset is particularly valuable for evaluating robustness under environmental change.

Nonetheless, it lacks dense annotations and high-resolution sensor modalities, such as modern LiDAR systems, which limits its applicability for supervised learning and fine-scale semantic tasks.

The Audi A2D2 dataset [146] emphasizes multi-sensor fusion, incorporating six cameras and five LiDAR units to capture scenes under real-world driving conditions. It includes dense semantic segmentation and 3D bounding-box labels. However, its limited geographic diversity and lack of long temporal sequences reduce its utility for long-term localization research.

Pandaset [233] provides high-resolution 3D LiDAR, radar, and multi-camera data, along with detailed annotations suitable for tasks such as object detection and semantic segmentation. While it offers a rich sensor suite and high-quality metadata, its spatial coverage is relatively restricted compared to large-scale datasets like Waymo.

Despite the diversity and technological sophistication of current datasets, most have been collected in well-structured environments typical of developed countries. As a result, they often lack representation of the complex conditions found in developing regions—such as degraded road infrastructure, ambiguous or absent lane markings, non-standard traffic behaviors, and extreme variations in lighting and weather. These limitations underscore the need for more globally representative datasets to support robust, real-world autonomous systems.

The Boreas Dataset [225] is a multi-season, multi-sensor dataset designed to advance research in long-term autonomous localization by addressing the challenges posed by environmental changes over time. Collected over ten months across all four seasons, it includes diverse conditions such as snow, rain, fog, and varying lighting, making it ideal for evaluating the robustness of localization and SLAM algorithms in real-world, dynamic environments. The dataset features data from LiDAR, radar, cameras, GPS/INS, and vehicle CAN bus, providing a rich multi-modal sensor suite. One of its key strengths is the inclusion of radar, which remains effective under adverse weather when LiDAR or cameras may fail, enabling research into sensor fusion for improved robustness. With centimeter-level accurate ground truth and multiple revisits of the same locations, Boreas is particularly suited for studying multi-session mapping, loop closure detection, semantic SLAM, and lifelong localization. Its focus on long-term, highway-based driving scenarios, along with a wide variety of environmental conditions, sets it apart from urban-centric datasets like KITTI [48] or nuScenes [49], making Boreas [225] a valuable benchmark for testing resilient, scalable localization systems for autonomous vehicles.

In summary, recent benchmarking datasets for long-term localization exhibit an evolution toward richer sensor configurations, improved synchronization, and broader environmental coverage. Early datasets such as KITTI [48] and KAIST [227] primarily relied on LiDAR–camera setups with limited weather and time variability, while newer datasets like EU long-term [231], nuScenes [49], and Boreas [225] integrate additional modalities such as radar and high-precision GNSS/IMU to enhance robustness under challenging visibility and GNSS-degraded conditions. Hardware-based synchronization, as implemented in Boreas and EU long-term, provides superior temporal alignment, which is critical to minimize fusion drift in multi-modal systems.

Moreover, datasets including Oxford [224] and Boreas [225] introduce extensive temporal diversity, capturing multiple seasons, lighting, and weather conditions, thereby enabling a realistic evaluation of environmental invariance.

Collectively, modern datasets emphasize long-term consistency, high-fidelity ground truth, and resilience to dynamic environmental changes, positioning them as essential benchmarks for assessing drift, robustness, and generalization in autonomous vehicle localization. Figure 9 shows visual examples of some datasets presented in this section.

7.2. Evaluation Metrics

Evaluation metrics are essential for the design, validation, and benchmarking of long-term localization systems in autonomous vehicles. They provide standardized criteria for assessing accuracy, robustness, and computational efficiency, thereby facilitating fair comparisons across methods and ensuring that proposed solutions meet the demands of real-world deployment. Unlike short-term localization—where evaluation primarily emphasizes geometric accuracy over short trajectories—long-term localization requires metrics that reflect robustness to environmental changes such as seasonal variations, lighting conditions, weather effects, and the presence of dynamic objects.

As such, effective evaluation must go beyond trajectory accuracy to include the quality of semantic representations, the reliability of place recognition and loop closure, and the scalability of mapping frameworks over extended time periods. Well-defined metrics—ranging from geometric error measures like Root Mean Square Error (RMSE), Absolute Trajectory Error (

A T E

), and Relative Pose Error (

R P E

) to recognition-based metrics such as precision, recall, F1-score, and ROC/AUC, as well as system-level criteria like runtime, memory usage, and performance under varying environmental conditions—enable a comprehensive analysis of system performance.

This section introduces and categorizes the key evaluation metrics used in long-term localization, with a focus on two principal dimensions: fine localization (i.e., precise pose estimation) and global localization (i.e., place recognition and loop closure).

In pose estimation, the most common metrics used are Translation Error (

T E

), Rotation Error (

R E

), and the combination of both. The translation error corresponds to the Euclidean distance between estimated and true positions.

TE = | | t_{e s t} - t_{g t} | |

(4)

The Rotation Error corresponds to the angular difference between the estimated and true orientations, typically computed using rotation matrices or quaternions.

RE = {cos}^{- 1} (\frac{trace (R_{e s t}^{⊤} R_{g t}) - 1}{2})

(5)

In place recognition, the most common metrics used are precision–recall and F-score. These metrics are based on positive matches in comparison with the negative matches. The precision–recall curve has been widely used to evaluate this imbalanced matching problem.

Based on the matching results and the ground-truth information, correct positive matches are regarded as True Positives (

T P s

), whereas incorrect positive matches are regarded as False Positives (

F P s

).

Similarly, True Negatives (TNs) and False Negatives (FNs) represent correct negative matches and incorrect negative matches, respectively.

Precision = \frac{T P}{T P + F P}, Recall = \frac{T P}{T P + F N}

(6)

F-score

F_{β}

is another widely used evaluation metric associated with place recognition, especially

F 1 - s c o r e

.

F_{β}

considers both precision and recall, and its calculation is given by:

F_{β} = (1 + β^{2}) . \frac{Precision. Recall}{(β^{2} . Precision) + Recall}

(7)

where

β \in R^{+}

is chosen to make recall

β

times as important as precision. When

β = 1

, we have

F_{1}

score, which is the harmonic mean of precision and recall. To quantitatively evaluate localization performance, we adopt the widely used Absolute Trajectory Error (

A T E

) and Relative Pose Error (

R P E

) metrics.

A T E

measures the global consistency between the estimated and ground-truth trajectories by computing the Root Mean Square Error (

R M S E

) of the translational differences after alignment. This metric reflects the accumulated drift over the entire trajectory and is particularly relevant for assessing long-term localization accuracy.

In contrast,

R P E

evaluates the local accuracy of the estimated motion by comparing relative pose transformations over a fixed interval. It captures the drift per time step or distance traveled, providing insight into the short-term stability of the system. This distinction makes

R P E

especially useful for analyzing incremental errors that may not be evident in global trajectory comparisons.

In addition to

A T E

and

R P E

, other complementary metrics are often considered to provide a more comprehensive evaluation. These include separate translation and rotation errors, which allow finer analysis of positional and angular drift, as well as drift per distance traveled to quantify error accumulation in long-term scenarios. Furthermore, in place recognition and loop closure tasks, metrics such as Recall@K and precision–recall curves are commonly employed to evaluate the system’s ability to correctly identify revisited locations.

Together, these metrics provide a multi-level evaluation framework, where

A T E

captures global trajectory accuracy,

R P E

reflects local motion consistency, and complementary metrics enable a deeper understanding of robustness and long-term performance in dynamic and large-scale environments.

A T E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} | | T_{e s t, i} - T_{g t, i} {| |}^{2}}

(8)

R P E = \sqrt{\frac{1}{N - 1} \sum_{i = 1}^{N - 1} | | T_{e s t, i}^{- 1} T_{e s t, i + 1} - T_{g t, i}^{- 1} T_{g t, i + 1} {| |}^{2}}

(9)

where

T_{e s t . i}

and

T_{g t, i}

are the estimated and ground-truth poses at time step i, respectively, N denotes the total number of poses.

While geometric metrics such as Absolute Trajectory Error (ATE) and Relative Pose Error (RPE) remain the standard for evaluating SLAM systems, they are insufficient to fully characterize the performance of semantic SLAM in real-world and safety-critical applications. In particular, these metrics do not capture semantic-specific failure modes, such as label drift over time, inconsistent object classification, or the impact of misclassifications on downstream tasks. For instance, false-negative detections of critical objects (e.g., pedestrians) may not significantly affect geometric accuracy but can lead to severe consequences in planning and decision-making modules. Therefore, recent research highlights the need for complementary evaluation protocols that account for semantic consistency, temporal stability, and task-aware performance. Metrics that quantify semantic drift, object-level persistence, and classification uncertainty are essential to assess the reliability of semantic representations over long-term deployments. Furthermore, evaluating the effect of perception errors on higher-level tasks—such as path planning under partial or incorrect semantic information—provides a more realistic measure of system robustness. These considerations are closely related to safety certification frameworks, where perception failures must be analyzed in terms of risk and reliability. In this context, standards such as ISO 26262 [234] emphasize the importance of identifying and mitigating hazardous failure modes, including those arising from incorrect or missing semantic labels. Incorporating such safety-oriented evaluation criteria into semantic SLAM benchmarking remains an open challenge, but it is a necessary step toward deploying trustworthy and certifiable autonomous systems in complex, real-world environments.

7.3. Long-Term Evaluation Perspective

The long-term evaluation of SLAM systems requires not only high accuracy in static scenes but also robustness under temporal, environmental, and sensor variability. Their evaluation remains challenging due to the lack of standardized benchmarking protocols. Different studies often employ distinct datasets, evaluation metrics, and experimental setups, making direct comparisons difficult. A unified benchmarking framework should consider multiple evaluation dimensions, including localization accuracy, robustness under environmental changes, computational efficiency, and scalability. Benchmark datasets such as KITTI, SemanticKITTI, and Oxford RobotCar provide valuable testbeds for evaluating long-term localization performance. Establishing standardized evaluation protocols across these datasets would enable more reliable comparisons and accelerate progress in semantic SLAM research.

Among the reviewed works, the best-performing methods reveal consistent trends when comparing their semantic segmentation accuracy, mapping consistency, and localization reliability.

In semantic segmentation, transformer-based and multi-scale fusion models demonstrate superior generalization. For instance, MSeg3D achieves a remarkable mIoU of 72.7% on SemanticKITTI [145], while Spherical Transformer and EPMF report 70.1% and 68.5%, respectively—outperforming early CNN-based architectures such as SalsaNext (59.5%) and PointNet (47.4%). These improvements reflect the benefit of fusing multi-scale geometric and contextual cues, essential for maintaining semantic consistency over time.

From the semantic mapping perspective, graph-based and probabilistic fusion methods exhibit stronger temporal stability. Approaches like SemSegMap integrate semantic priors into SLAM back-ends, preserving map consistency even under dynamic object variations. SemSegMap, in particular, combines semantic likelihood fields with LiDAR odometry, enabling long-term map reuse with an average pose drift below 1%.

A crucial component of long-term autonomy is place recognition, which supports loop closure and relocalization under substantial environmental changes. Across all modalities, methods that incorporate semantic and cross-modal representations demonstrate the best overall robustness. In vision-based recognition, transformer architectures such as TransVPR [164] and MixVPR [165] achieve outstanding generalization through global attention mechanisms, with MixVPR reaching 93.6% recall@1 on Pittsburgh and 89.2% on Nordland—over 10% higher than traditional CNN-based descriptors like NetVLAD [162]. LiDAR-based approaches similarly exhibit strong geometric stability, with OverlapTransformer [207] reporting 95.1% recall at 1% precision on the MulRan DCC dataset [232], outperforming handcrafted descriptors such as Scan Context, Scan Context++ ([235,236]) and M2DP [237] by more than 15%. This robustness to viewpoint and structural variations makes LiDAR-based methods essential for geometric consistency over time.

In recent years, fusion-based place recognition has emerged as a dominant paradigm for sustainable performance in long-term mapping. These methods combine complementary sensory modalities—typically LiDAR and camera, and increasingly radar—to exploit both geometric precision and semantic richness. OverlapNet [205] initially demonstrated the benefits of joint feature encoding, achieving 94.8% recall at 1% precision on KITTI [48] and 91.7% on Oxford RobotCar [224]. Transformer-based fusion models such as TransGeo [238] further improved cross-domain adaptation, maintaining over 90% recall across lighting and weather variations.

More recent architectures, including PRFusion++ [218], LCPR [223], and LRFusionPR [222], have extended the paradigm of cross-modal learning through attention-driven feature alignment and unified latent representations. PRFusion++ achieves 93.5% recall@1 on Oxford RobotCar and 91.2% on MulRan [232], surpassing OverlapNet by up to 10%. LCPR attains 92.8% recall@1 on KITTI [48] and remains above 88% under illumination variation, while LRFusionPR, which integrates LiDAR, radar, and visual cues, achieves 90.4% recall at 1% precision on Boreas under snow and fog, improving by 12% over LiDAR-only approaches. These results confirm that deep multi-modal fusion provides a balanced trade-off between semantic fidelity, geometric stability, and environmental adaptability.

The most invariant and semantically meaningful representations, however, arise from semantic graph-based approaches, which encode the environment as structured graphs of objects and relationships. GOSMatch [214] models spatial interrelations through hierarchical or neural graph embeddings, achieving 92–94% recall across multi-season evaluations in Boreas [225] and MulRan [232], exceeding purely geometric baselines by over 18%. SegMap [193] extends this direction by integrating geometric surface descriptors with semantic categories, maintaining over 90% recall even when significant portions of the scene evolve structurally. These approaches demonstrate that topological reasoning at the semantic level is fundamental for long-term map consistency and temporally persistent localization.

Regarding datasets, although SemanticKITTI [145], KITTI-360 [226], and nuScenes [49] remain the main semantic benchmarks, they offer limited temporal diversity. Oxford RobotCar [224] and MulRan [232] enable multi-session evaluation under varying conditions, but the Boreas dataset [225] sets a new standard for long-term place recognition and semantic mapping. Boreas spans an entire year of traversals across diverse lighting, weather, and seasonal conditions, with synchronized LiDAR, camera, radar, and GNSS/INS data. Models jointly trained or fine-tuned on SemanticKITTI [145] and Boreas maintain over 85% recall across seasons, compared to less than 70% when trained on SemanticKITTI [145], establishing Boreas as the most comprehensive dataset for evaluating temporal domain adaptation and semantic persistence.

8. Open Challenges and Outlook

Beyond perception, integrating odometry with semantic graph representations remains a fundamental yet unresolved challenge for long-term localization. While odometric sensors such as IMU or wheel encoders provide reliable short-term motion estimates, their cumulative drift significantly degrades global consistency over time. Semantic graph-based maps offer a promising alternative by encoding spatial and semantic relationships among persistent landmarks, enabling drift correction through probabilistic data association. However, achieving scalable, uncertainty-aware fusion between odometric trajectories and continuously evolving semantic graphs remains an open research problem, particularly in large-scale and dynamic environments.

These challenges are further exacerbated by the reliance of current semantic SLAM systems on closed-set object representations. Most approaches assume a predefined set of semantic classes learned from datasets collected in structured environments, which limits their ability to generalize to unstructured or open-world scenarios. In real-world deployments, environments often exhibit significant variability in geometry, appearance, and dynamics, including objects and structures that are not present in the training data. As a result, semantic perception becomes unreliable, leading to incomplete scene understanding and degraded localization performance. Moreover, the datasets commonly used for benchmarking mapping and localization systems are typically biased toward structured and homogeneous environments, with a limited diversity of object classes and environmental conditions. This creates a gap between controlled experimental settings and real-world scenarios, where factors such as irregular terrain, cluttered layouts, and changing environmental conditions introduce additional uncertainty. Consequently, current systems struggle to maintain robustness and long-term consistency when deployed in unstructured and evolving environments.

Addressing these limitations requires moving beyond closed-set assumptions toward more adaptive and generalizable frameworks. Promising directions include open-set and open-world perception methods, the integration of geometric and motion-based cues to complement semantic information, and the development of richer, more diverse datasets that capture the complexity of real-world environments. Ultimately, enabling robust long-term localization and semantic mapping in dynamic and unstructured scenarios will depend on the ability of systems to continuously adapt, learn, and reason under uncertainty.

Looking forward, future research should focus on developing temporally consistent and adaptive semantic graph representations that evolve alongside the environment while preserving topological and semantic coherence. Incorporating probabilistic similarity measures between semantic subgraphs, as well as lifelong learning strategies for continuous model refinement, will be essential to improve resilience against drift, perceptual uncertainty, and environmental changes over time. Finally, advances in deep multi-modal fusion architectures—leveraging complementary features from LiDAR and camera within intermediate network layers—will pave the way for uncertainty-aware and computationally efficient models capable of supporting persistent localization in large-scale, dynamic urban environments.

In summary, the convergence of deep semantic segmentation, multi-sensor fusion, and graph-based mapping represents a promising direction toward reliable long-term localization. By combining textured point-cloud representations with adaptive graph structures, autonomous vehicles can achieve more stable, context-aware, and drift-resistant navigation, ultimately bridging the gap between perception and robust spatial understanding over extended time horizons.

In addition, semantic SLAM systems play a critical role within the perception and localization layers of the autonomous driving pipeline. Beyond providing accurate vehicle pose estimation, these systems generate semantically enriched maps that can be leveraged by higher-level modules such as decision-making and path planning, as depicted in Figure 10. For example, semantic information about road structures, traffic signs, and dynamic obstacles enables planning algorithms to generate safer and more context-aware navigation strategies. Additionally, semantic maps facilitate more intuitive human–machine interaction by allowing the system to communicate environmental information and navigation decisions in a human-understandable manner.

9. Conclusions

Accurate and persistent localization remains a cornerstone for autonomous vehicle (AV) navigation in complex and evolving real-world environments. Traditional SLAM approaches, while effective in static and structured settings, face limitations under dynamic conditions, long-term deployments, and significant environmental changes. This survey highlights the role of Semantic SLAM and multi-modal fusion as transformative solutions to these challenges. By incorporating semantic understanding into spatial maps and leveraging complementary sensing modalities such as LiDAR and cameras, AVs gain both contextual awareness and geometric precision, enabling more reliable long-term localization.

Recent advances demonstrate that semantic information improves data association and loop closure by providing robust, high-level scene descriptors that remain consistent despite appearance variations or structural changes. Multi-modal fusion further strengthens this capability by integrating redundant and complementary features, resulting in textured point clouds and enriched semantic maps that are resilient to environmental dynamics. Graph-based representations and deep learning-driven segmentation also contribute to advancing scalability and adaptability in long-term SLAM.

Despite these achievements, several open challenges remain. Current systems still struggle with scalability in large-scale urban deployments, uncertainty handling in dynamic object interactions, and continuous adaptation to time-varying semantics. Benchmark datasets and evaluation metrics provide an essential foundation for comparative analysis, yet the community must further advance standardized protocols that reflect real-world complexities. Looking forward, future research should emphasize scalable graph-based models, probabilistic data association techniques, and lifelong learning strategies that allow semantic maps to evolve alongside changing environments. Furthermore, deeper integration of multi-modal cues, together with efficient deployment on real-time platforms, will be key to ensuring robust, reliable, and persistent localization for autonomous vehicles operating in dynamic, long-term scenarios.

Author Contributions

Conceptualization, Á.N.-P.; methodology, Á.N.-P. and B.B.-C.; supervision, B.B.-C.; writing—review and editing, B.B.-C.; investigation and writing—original draft preparation, Á.N.-P.; funding acquisition, E.C.-B. All authors have read and agreed to the published version of the manuscript.

Funding

The publication of this article was partially funded by the Vice-Rector’s Office for Research of the University of Quindío, in support of the doctoral study commission. Likewise, it was partially funded by Universidad del Valle under the call “Support for Doctoral Students 2026”.

Data Availability Statement

All Data is contained within the article.

Acknowledgments

We would like to express our sincere gratitude to Universidad del Valle and Universidad del Quindío for their valuable support in the funding of this article, as part of the research of the first author towards his PhD in engineering. Their commitment to the promotion of scientific research and academic development has been instrumental in making this work possible. During the preparation of this manuscript, the authors used QuillBot Premium v40.144.1 as a language refinement tool to improve English grammar, clarity, and writing quality. The authors have critically reviewed and edited all generated content and take full responsibility for the final version of the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bowman, S.L.; Atanasov, N.; Daniilidis, K.; Pappas, G.J. Probabilistic data association for semantic SLAM. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1722–1729. [Google Scholar] [CrossRef]
McCormac, J.; Clark, R.; Bloesch, M.; Davison, A.J.; Leutenegger, S. Fusion++: Volumetric Object-Level SLAM. arXiv 2018, arXiv:1808.08378. [Google Scholar]
Yusefi, A.; Durdu, A.; Toy, I. Camera/LiDAR Sensor Fusion-based Autonomous Navigation. In Proceedings of the 2024 23rd International Symposium INFOTEH-JAHORINA (INFOTEH), East Sarajevo, Bosnia and Herzegovina, 20–22 March 2024; pp. 1–6. [Google Scholar] [CrossRef]
Abdelfattah, M.; Yuan, K.; Wang, Z.J.; Ward, R. Multi-modal Streaming 3D Object Detection. arXiv 2022, arXiv:2209.04966. [Google Scholar] [CrossRef]
Wang, M.; Liu, W.; Zhou, B.; Wang, Z.; Liu, R.; Wang, H. A Robust Camera-LiDAR Fusion Framework for 3D Object Detection in High-Dust Environments. In Proceedings of the 2024 IEEE 22nd International Conference on Industrial Informatics (INDIN), Beijing, China, 18–20 August 2024; pp. 1–6. [Google Scholar] [CrossRef]
Chen, B.; Shen, H.; Zhao, Z.; Yu, L.; Zhao, Y. LiDAR-Camera cross fusion network towards 3D object detection in self-driving. IEEE Sensors J. 2024, 25, 21857–21866. [Google Scholar] [CrossRef]
Erkent, Ö.; Wolf, C.; Laugier, C.; Gonzalez, D.S.; Cano, V.R. Semantic Grid Estimation with a Hybrid Bayesian and Deep Neural Network Approach. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 888–895. [Google Scholar] [CrossRef]
Tsintotas, K.A.; Bampis, L.; Gasteratos, A. The Revisiting Problem in Simultaneous Localization and Mapping: A Survey on Visual Loop Closure Detection. IEEE Trans. Intell. Transp. Syst. 2022, 23, 19929–19953. [Google Scholar] [CrossRef]
Li, L.; Kong, X.; Zhao, X.; Li, W.; Wen, F.; Zhang, H.; Liu, Y. SA-LOAM: Semantic-aided LiDAR SLAM with Loop Closure. arXiv 2021, arXiv:2106.11516. [Google Scholar]
Stenborg, E.; Toft, C.; Hammarstrand, L. Long-term Visual Localization using Semantically Segmented Images. arXiv 2018, arXiv:1801.05269. [Google Scholar] [CrossRef]
Huang, X.; Xu, Z.; Wu, H.; Wang, J.; Xia, Q.; Xia, Y.; Li, J.; Gao, K.; Wen, C.; Wang, C. L4DR: LiDAR-4DRadar Fusion for Weather-Robust 3D Object Detection. arXiv 2025, arXiv:2408.03677. [Google Scholar] [CrossRef]
Zhu, D.; Yang, G. RDynaSLAM: Fusing 4D Radar Point Clouds to Visual SLAM in Dynamic Environments. J. Intell. Robot. Syst. 2025, 111, 11. [Google Scholar] [CrossRef]
Wang, B.; Zhuang, Y.; Huai, J.; Chen, Y.; Chen, J.; El-Bendary, N. GV-iRIOM: GNSS-visual-aided 4D radar inertial odometry and mapping in large-scale environments. ISPRS J. Photogramm. Remote. Sens. 2025, 221, 310–323. [Google Scholar] [CrossRef]
Liu, H.; Xu, G.; Liu, B.; Li, Y.; Yang, S.; Tang, J.; Pan, K.; Xing, Y. A real time LiDAR-Visual-Inertial object level semantic SLAM for forest environments. ISPRS J. Photogramm. Remote. Sens. 2025, 219, 71–90. [Google Scholar] [CrossRef]
Zhao, Y.; Wang, C.; Ouyang, Y.; Zhong, J.; Li, Y.; Zhao, N. DHDP-SLAM: Dynamic Hierarchical Dirichlet Process based data association for semantic SLAM. Displays 2025, 86, 102892. [Google Scholar] [CrossRef]
Chen, N.; Wei, D.; Lin, D.; Lin, L. Semantic SLAM using laser-vision data fusion: Enhancing autonomous navigation in unstructured environments. Alex. Eng. J. 2025, 127, 606–618. [Google Scholar] [CrossRef]
Li, F.; Fu, C.; Wang, J.; Sun, D. Dynamic Semantic SLAM Based on Panoramic Camera and LiDAR Fusion for Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2025, 27, 2763–2776. [Google Scholar] [CrossRef]
Jiao, J.; Geng, R.; Li, Y.; Xin, R.; Yang, B.; Wu, J.; Wang, L.; Liu, M.; Fan, R.; Kanoulas, D. Real-Time Metric-Semantic Mapping for Autonomous Navigation in Outdoor Environments. IEEE Trans. Autom. Sci. Eng. 2025, 22, 5729–5740. [Google Scholar] [CrossRef]
Wan, J.; Zhang, X.; Dong, S.; Zhang, Y.; Yang, Y.; Wu, R.; Jiang, Y.; Li, J.; Lin, J.; Yang, M. Monocular Localization with Semantics Map for Autonomous Vehicles. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 14146–14152. [Google Scholar] [CrossRef]
Zhang, C.; Zhao, H.; Wang, C.; Tang, X.; Yang, M. Cross-Modal Monocular Localization in Prior LiDAR Maps Utilizing Semantic Consistency. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 4004–4010. [Google Scholar] [CrossRef]
Cheng, X.; Geng, K.; Yin, G.; Sun, Y.; Wang, J.; Ding, P. Semantic Mapping Optimization Based on LIDAR and Camera Data Fusion for autonomous vehicle. In Proceedings of the 2022 6th CAA International Conference on Vehicular Control and Intelligence (CVCI), Nanjing, China, 28–30 October 2022; pp. 1–6. [Google Scholar] [CrossRef]
Song, X.; Zhijiang, Z.; Liang, X.; Huaidong, Z. Monocular camera and laser based semantic mapping system with temporal-spatial data association for indoor mobile robots. Multimed. Tools Appl. 2023, 82, 34459–34484. [Google Scholar] [CrossRef]
Ding, F.; Ji, X.; Wei, D.; Zhang, J.; Li, K.; Yuan, H. Monocular Mapping and Localization of Urban Road Scenes Based on Parameterized Semantic Representation. In Proceedings of the CEUR Workshop Proceedings at the 13th International Conference on Indoor Positioning and Indoor, Nuremberg, Germany, 25–28 September 2023. [Google Scholar]
Yi, F.; Ye, L.; Huang, M.; Wang, Q. A Method for Constructing Semantic Navigation Maps in Urban Environments. In Proceedings of the 2023 6th International Conference on Intelligent Autonomous Systems (ICoIAS), Qinhuangdao, China, 22–24 September 2023; pp. 141–147. [Google Scholar] [CrossRef]
Berrio, J.S.; Shan, M.; Worrall, S.; Nebot, E. Camera-LIDAR Integration: Probabilistic Sensor Fusion for Semantic Mapping. IEEE Trans. Intell. Transp. Syst. 2022, 23, 7637–7652. [Google Scholar] [CrossRef]
Qian, Z.; Patath, K.; Fu, J.; Xiao, J. Semantic SLAM with Autonomous Object-Level Data Association. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11203–11209. [Google Scholar] [CrossRef]
Kang, X.; Li, J.; Fan, X.; Jian, H.; Xu, C. Object-level semantic map construction for dynamic scenes. Appl. Sci. 2021, 11, 645. [Google Scholar] [CrossRef]
Sualeh, M.; Kim, G.W. Semantics aware dynamic SLAM based on 3D MODT. Sensors 2021, 21, 6355. [Google Scholar] [CrossRef]
Paz, D.; Zhang, H.; Li, Q.; Xiang, H.; Christensen, H.I. Probabilistic Semantic Mapping for Urban Autonomous Driving Applications. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 2059–2064. [Google Scholar] [CrossRef]
Guan, P.; Cao, Z.; Chen, E.; Liang, S.; Tan, M.; Yu, J. A real-time semantic visual SLAM approach with points and objects. Int. J. Adv. Robot. Syst. 2020, 17, 1729881420905443. [Google Scholar] [CrossRef]
Jiao, J.; Geng, R.; Li, Y.; Yang, B.; Liu, M. Online Metric-Semantic Mapping for Autonomous Robot Navigation. In Proceedings of the Robot Representations Workshop, Robotics: Science and Systems (RSS), Corvallis, OR, USA, 12–16 June 2020. [Google Scholar]
Li, J.; Zhang, X.; Li, J.; Liu, Y.; Wang, J. Building and optimization of 3D semantic map based on Lidar and camera fusion. Neurocomputing 2020, 409, 394–407. [Google Scholar] [CrossRef]
Nakajima, Y.; Saito, H. Efficient object-oriented semantic mapping with object detector. IEEE Access 2019, 7, 3206–3213. [Google Scholar] [CrossRef]
Chi, J.; Wu, H.; Tian, G. Object-oriented 3D semantic mapping based on instance segmentation. J. Adv. Comput. Intell. Intell. Inform. 2019, 23, 695–704. [Google Scholar] [CrossRef]
Zhang, L.; Wei, L.; Shen, P.; Wei, W.; Zhu, G.; Song, J. Semantic SLAM Based on Object Detection and Improved Octomap. IEEE Access 2018, 6, 75545–75559. [Google Scholar] [CrossRef]
Barsan, I.A.; Liu, P.; Pollefeys, M.; Geiger, A. Robust Dense Mapping for Large-Scale Dynamic Environments. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 7510–7517. [Google Scholar] [CrossRef]
Yang, Y.; Qiu, F.; Li, H.; Zhang, L.; Wang, M.L.; Fu, M.Y. Large-scale 3D Semantic Mapping Using Stereo Vision. Int. J. Autom. Comput. 2018, 15, 194–206. [Google Scholar] [CrossRef]
Goga, S.E.C.; Nedevschi, S. Fusing semantic labeled camera images and 3D LiDAR data for the detection of urban curbs. In Proceedings of the 2018 IEEE 14th International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj-Napoca, Romania, 6–8 September 2018; pp. 301–308. [Google Scholar] [CrossRef]
McCormac, J.; Handa, A.; Davison, A.; Leutenegger, S. SemanticFusion: Dense 3D semantic mapping with convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4628–4635. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardos, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Segal, A.; Hähnel, D.; Thrun, S. Generalized-ICP. In Proceedings of the Robotics: Science and Systems; Trinkle, J., Matsuoka, Y., Castellanos, J.A., Eds.; The MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. What is YOLOv5: A deep look into the internal features of the popular object detector. arXiv 2024, arXiv:2407.20892. [Google Scholar] [CrossRef]
Yaseen, M. What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector. arXiv 2024, arXiv:2408.15857. [Google Scholar]
Kwak, J.; Sung, Y. DeepLabV3-Refiner-Based Semantic Segmentation Model for Dense 3D Point Clouds. Remote. Sens. 2021, 13, 1565. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. arXiv 2018, arXiv:1703.06870. [Google Scholar]
Biber, P.; Straßer, W. The normal distributions transform: A new approach to laser scan matching. In Proceedings of the 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453), Las Vegas, NV, USA, 27–31 October 2003; Volume 3, pp. 2743–2748. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets Robotics: The KITTI Dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. arXiv 2020, arXiv:1903.11027. [Google Scholar] [CrossRef]
Zhou, L.; Li, Z.; Kaess, M. Automatic Extrinsic Calibration of a Camera and a 3D LiDAR Using Line and Plane Correspondences. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; IEEE Press: Piscataway, NJ, USA, 2018; pp. 5562–5569. [Google Scholar] [CrossRef]
Kim, E.s.; Park, S.Y. Extrinsic Calibration between Camera and LiDAR Sensors by Matching Multiple 3D Planes. Sensors 2020, 20, 52. [Google Scholar] [CrossRef] [PubMed]
Meilin, C.; Guotao, J.; Zhichao, P.; Qian, H.; Bin, T.; Shuo, L.; Hailang, Y. An Extrinsic Calibration Method for LiDAR and Camera Based on Feature Matching. Control. Inf. Technol. 2024, 102–108. [Google Scholar] [CrossRef]
Verma, S.; Berrio, J.S.; Worrall, S.; Nebot, E. Automatic extrinsic calibration between a camera and a 3D Lidar using 3D point and plane correspondences. arXiv 2019, arXiv:1904.12433. [Google Scholar] [CrossRef]
Mishra, S.; Osteen, P.R.; Pandey, G.; Saripalli, S. Experimental Evaluation of 3D-LIDAR Camera Extrinsic Calibration. arXiv 2020, arXiv:2007.01959. [Google Scholar] [CrossRef]
Wang, W.; Sun, R.; Wang, Z.; Hu, Z. A Step-By-Step Approach for Camera and Low-Resolution-3D-LiDAR Calibration. In Proceedings of the 2022 17th International Conference on Control, Automation, Robotics, and Vision (ICARCV), Las Vegas, NV, USA, 6–8 January 2023; IEEE: Piscataway, NJ, USA, 2022; pp. 555–561. [Google Scholar] [CrossRef]
Guo, J.; Liu, C.; Cui, X.; He, Z.; Wang, Z. Automatic Extrinsic Calibration for Lidar-Photoneo Camera Using a Hemispherical Calibration Board. arXiv 2023, arXiv:2304.09062. [Google Scholar]
Zhang, X.; Luo, W.; Xu, Y. Camera–LiDAR Calibration Using Iterative Random Sampling and Intersection Line-Based Quality Evaluation. Electronics 2024, 13, 249. [Google Scholar] [CrossRef]
Wang, Z.; Li, M.; Yang, Y.; Zhang, Y.F.; Sörstedt, J.; Chen, Y. A two-step approach to lidar-camera calibration. In Proceedings of the 2021 IEEE 19th International Conference on Embedded and Ubiquitous Computing (EUC), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 201–207. [Google Scholar]
Lao, Y.; Wei, S.; Liu, G.; Liu, C.; Yang, T. Enhanced Extrinsic Calibration Method for Camera-LiDAR Fusion and Monitoring of Safety Threats to Power Transmission Lines. Int. Arch. Photogramm. Remote. Sens. Spat. Inf. Sci. 2025, XLVIII-G-2025, 877–884. [Google Scholar] [CrossRef]
Zhang, B.; Zheng, Y.; Zhang, Z.; He, Q. LiDAR and Camera Calibration Using Pyramid and Checkerboard Calibrators. In Proceedings of the 2023 IEEE 8th International Conference on Big Data Analytics (ICBDA), Harbin, China, 3–5 March 2023; pp. 187–192. [Google Scholar] [CrossRef]
Cai, Y.; Zhan, Y.; Deng, W. A Novel Extrinsic Calibration Method of a Camera-And-LiDAR System. In Proceedings of the 2021 IEEE 7th International Conference on Virtual Reality (ICVR), Foshan, China, 20–22 May 2021; pp. 109–116. [Google Scholar] [CrossRef]
Xiao, Z.; Li, H.; Zhou, D.; Dai, Y.; Dai, B. Accurate extrinsic calibration between monocular camera and sparse 3D Lidar points without markers. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV), Los Angeles, CA, USA, 11–14 June 2017; pp. 424–429. [Google Scholar] [CrossRef]
Liu, X.; Yuan, C.; Zhang, F. Targetless Extrinsic Calibration of Multiple Small FoV LiDARs and Cameras Using Adaptive Voxelization. IEEE Trans. Instrum. Meas. 2022, 71, 8502612. [Google Scholar] [CrossRef]
Borer, J.; Tschirner, J.; Ölsner, F.; Milz, S. From Chaos to Calibration: A Geometric Mutual Information Approach to Target-Free Camera LiDAR Extrinsic Calibration. arXiv 2023, arXiv:2311.01905. [Google Scholar]
Borer, J.; Tschirner, J.; Ölsner, F.; Milz, S. Continuous Online Extrinsic Calibration of Fisheye Camera and LiDAR. arXiv 2023, arXiv:2306.13240. [Google Scholar] [CrossRef]
Yoon, H.K.; Bae, J.M.; Kim, M.J.; Lee, B.C.; Ko, S.J. Targetless Multiple Camera-LiDAR Extrinsic Calibration using Object Pose Estimation. In Proceedings of the 2021 21st International Conference on Control, Automation and Systems (ICCAS), Xi’an, China, 30 May–5 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 576–581. [Google Scholar]
Beltran, J.; Guindel, C.; de la Escalera, A.; Garcia, F. Automatic Extrinsic Calibration Method for LiDAR and Camera Sensor Setups. IEEE Trans. Intell. Transp. Syst. 2022, 23, 17677–17689. [Google Scholar] [CrossRef]
Ou, J.; Huang, P.; Zhou, J.; Zhao, Y.; Lin, L. Automatic Extrinsic Calibration of 3D LIDAR and Multi-Cameras Based on Graph Optimization. Sensors 2022, 22, 2221. [Google Scholar] [CrossRef]
Sen, A.; Pan, G.; Mitrokhin, A.; Islam, A. SceneCalib: Automatic Targetless Calibration of Cameras and Lidars in Autonomous Driving. arXiv 2023, arXiv:2304.05530. [Google Scholar] [CrossRef]
Liu, R.; Shi, J.; Zhang, H.; Zhang, J.; Sun, B. Causal calibration: Iteratively calibrating LiDAR and camera by considering causality and geometry. Complex Intell. Syst. 2023, 9, 7349–7363. [Google Scholar] [CrossRef]
Zeng, T.; He, D.; Yan, F.; He, M. YOCO: You Only Calibrate Once for Accurate Extrinsic Parameter in LiDAR-Camera Systems. arXiv 2024, arXiv:2407.18043. [Google Scholar] [CrossRef]
Tan, Z.; Zhang, X.; Teng, S.; Wang, L.; Gao, F. A Review of Deep Learning-Based LiDAR and Camera Extrinsic Calibration. Sensors 2024, 24, 3878. [Google Scholar] [CrossRef]
Lv, X.; Wang, B.; Ye, D.; Wang, S. LCCNet: LiDAR and Camera Self-Calibration using Cost Volume Network. arXiv 2021, arXiv:2012.13901. [Google Scholar]
Jiang, P.; Osteen, P.; Saripalli, S. SemCal: Semantic LiDAR-Camera Calibration using Neural MutualInformation Estimator. arXiv 2021, arXiv:2109.10270. [Google Scholar]
Rachman, A.; Seiler, J.; Kaup, A. End-to-End Lidar-Camera Self-Calibration for Autonomous Vehicles. arXiv 2023, arXiv:2304.12412. [Google Scholar]
Wu, S.; Hadachi, A.; Vivet, D.; Prabhakar, Y. NetCalib: A Novel Approach for LiDAR-Camera Auto-calibration Based on Deep Learning. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
Open Robotics. message_filters-ROS 2 Documentation. Open Source Robotics Foundation. 2025. Available online: https://docs.ros.org/en/foxy/Tutorials/Beginner-Client-Libraries/Custom-ROS2-Interfaces.html (accessed on 18 June 2025).
Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; IEEE: Piscataway, NJ, USA, 1999; Volume 2, pp. 1150–1157. [Google Scholar]
Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded Up Robust Features. In Proceedings of the Computer Vision—ECCV 2006; Leonardis, A., Bischof, H., Pinz, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 404–417. [Google Scholar]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar] [CrossRef]
Baroffio, L.; Cesana, M.; Redondi, A.; Tagliasacchi, M.; Tubaro, S. Fast keypoint detection in video sequences. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1342–1346. [Google Scholar] [CrossRef]
Calonder, M.; Lepetit, V.; Strecha, C.; Fua, P. BRIEF: Binary robust independent elementary features. In Proceedings of the 11th European Conference on Computer Vision: Part IV, Crete, Greece, 5–11 September 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 778–792. [Google Scholar]
Cui, L.; Ma, C. SOF-SLAM: A Semantic Visual SLAM for Dynamic Environments. IEEE Access 2019, 7, 166528–166539. [Google Scholar] [CrossRef]
Zhao, X.; Zuo, T.; Hu, X. OFM-SLAM: A Visual Semantic SLAM for Dynamic Indoor Environments. Math. Probl. Eng. 2021, 2021, 5538840. [Google Scholar] [CrossRef]
Yang, C.; Lyu, T. Research on Vision-based Semantic SLAM towards Indoor Dynamic Environment. In Proceedings of the 2022 International Conference on Computing, Communication, Perception and Quantum Technology (CCPQT), Xiamen, China, 5–7 August 2022; pp. 53–58. [Google Scholar] [CrossRef]
Liu, J.; Fan, X.; Fu, L.; Chen, Y. Visual SLAM technology based on weakly supervised semantic segmentation in dynamic environment. In Proceedings of the Seventh International Conference on Image and Graphics (ICIG 2020); SPIE: Bellingham, WA, USA, 2020; Volume 11574, p. 115740Q. [Google Scholar]
Cheng, Q.; Zeller, N.; Cremers, D. Vision-Based Large-scale 3D Semantic Mapping for Autonomous Driving Applications. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 9235–9242. [Google Scholar] [CrossRef]
Peng, J.; Xiaoqiang, L.; Wei, G.; Ming, L. A Vision Based Multi-robot Cooperative Semantic SLAM Algorithm. In Proceedings of the 2022 34th Chinese Control and Decision Conference (CCDC), Hefei, China, 15–17 August 2022; pp. 5663–5668. [Google Scholar] [CrossRef]
Esparza, D.; Flores, G. The STDyn-SLAM: A Stereo Vision and Semantic Segmentation Approach for VSLAM in Dynamic Outdoor Environments. IEEE Access 2022, 10, 18201–18209. [Google Scholar] [CrossRef]
Song, X.; Liang, X.; Zhijiang, Z.; Huaidong, Z. A Object-augmented Semantic Mapping System for Indoor Mobile Robots. In Proceedings of the 2022 IEEE 2nd International Conference on Software Engineering and Artificial Intelligence (SEAI), Xiamen, China, 10–12 June 2022; pp. 225–229. [Google Scholar] [CrossRef]
Qin, L.; Wu, C.; Chen, Z.; Kong, X.; Lv, Z.; Zhao, Z. RSO-SLAM: A Robust Semantic Visual SLAM With Optical Flow in Complex Dynamic Environments. IEEE Trans. Intell. Transp. Syst. 2024, 25, 14669–14684. [Google Scholar] [CrossRef]
Li, X.; Ao, H.; Belaroussi, R.; Gruyer, D. Fast semi-dense 3D semantic mapping with monocular visual SLAM. In Proceedings of the 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Yokohama, Japan, 16–19 October 2017; pp. 385–390. [Google Scholar] [CrossRef]
Zhang, C.; Liu, Z.; Liu, G.; Huang, D. Large-Scale 3D Semantic Mapping Using Monocular Vision. In Proceedings of the 2019 IEEE 4th International Conference on Image, Vision and Computing (ICIVC), Xiamen, China, 5–7 July 2019; pp. 71–76. [Google Scholar] [CrossRef]
Hu, S.; Li, D.; Tang, G.; Xu, X. A 3D Semantic Visual SLAM in Dynamic Scenes. In Proceedings of the 2021 6th IEEE International Conference on Advanced Robotics and Mechatronics (ICARM), Chongqing, China, 3–5 July 2021; pp. 522–528. [Google Scholar] [CrossRef]
Yang, K.; Jiang, Y.; Qi, L.; Fan, H.; Zhang, S.; Dong, J. Visual Semantic SLAM Based on Examination of Moving Consistency in Dynamic Scenes. In Proceedings of the 2022 4th International Conference on Data Intelligence and Security (ICDIS), Shenzhen, China, 24–26 August 2022; pp. 275–282. [Google Scholar] [CrossRef]
Liu, M.; Zou, Q.; Long, J.; Wang, Y.; Lin, M.; Wang, F. SSE-SLAM: Semantic Visual SLAM Based on RGB-D Camera for High-Accuracy Pose Measurement in Dynamic Environments. IEEE Trans. Instrum. Meas. 2025, 74, 5047611. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. arXiv 2017, arXiv:1612.00593. [Google Scholar] [CrossRef]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. arXiv 2017, arXiv:1706.02413. [Google Scholar]
Wu, B.; Wan, A.; Yue, X.; Keutzer, K. SqueezeSeg: Convolutional Neural Nets with Recurrent CRF for Real-Time Road-Object Segmentation from 3D LiDAR Point Cloud. arXiv 2017, arXiv:1710.07368. [Google Scholar]
Milioto, A.; Vizzo, I.; Behley, J.; Stachniss, C. RangeNet ++: Fast and Accurate LiDAR Semantic Segmentation. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 4213–4220. [Google Scholar] [CrossRef]
Zhang, Y.; Zhou, Z.; David, P.; Yue, X.; Xi, Z.; Gong, B.; Foroosh, H. PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation. arXiv 2020, arXiv:2003.14032. [Google Scholar]
Cortinhal, T.; Tzelepis, G.; Aksoy, E.E. SalsaNext: Fast, Uncertainty-aware Semantic Segmentation of LiDAR Point Clouds for Autonomous Driving. arXiv 2024, arXiv:2003.03653. [Google Scholar]
Lai, X.; Chen, Y.; Lu, F.; Liu, J.; Jia, J. Spherical Transformer for LiDAR-based 3D Recognition. arXiv 2023, arXiv:2303.12766. [Google Scholar]
Zhang, J.; Liu, H.; Yang, K.; Hu, X.; Liu, R.; Stiefelhagen, R. CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers. arXiv 2023, arXiv:2203.04838. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.O.; Liu, M. A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 5403215. [Google Scholar] [CrossRef]
Wang, Y.; Wu, Y.; Li, D.; Yu, W. Millimeter-Wave Radar and Vision Fusion-Based Semantic Simultaneous Localization and Mapping. IEEE Antennas Wirel. Propag. Lett. 2024, 23, 3977–3981. [Google Scholar] [CrossRef]
Broedermann, T.; Sakaridis, C.; Fu, Y.; Gool, L.V. CAFuser: Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes. arXiv 2025, arXiv:2410.10791. [Google Scholar] [CrossRef]
Tan, M.; Zhuang, Z.; Chen, S.; Li, R.; Jia, K.; Wang, Q.; Li, Y. EPMF: Efficient Perception-Aware Multi-Sensor Fusion for 3D Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8258–8273. [Google Scholar] [CrossRef]
Zhao, L.; Zhou, H.; Zhu, X.; Song, X.; Li, H.; Tao, W. LIF-Seg: LiDAR and Camera Image Fusion for 3D LiDAR Semantic Segmentation. arXiv 2021, arXiv:2108.07511. [Google Scholar] [CrossRef]
Sánchez-García, F.; Montiel-Marín, S.; Antunes-Garcia, M.; Gutiérrez-Moreno, R.; Llamazares Llamazares, Á.; Bergasa, L.M. SalsaNext+: A Multimodal-Based Point Cloud Semantic Segmentation With Range and RGB Images. IEEE Access 2025, 13, 64133–64147. [Google Scholar] [CrossRef]
Lu, Z.; Cao, B.; Hu, Q. LiDAR-Camera Continuous Fusion in Voxelized Grid for Semantic Scene Completion. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 12330–12344. [Google Scholar] [CrossRef]
Li, Q.; Wang, Y.; Wang, Y.; Zhao, H. HDMapNet: An Online HD Map Construction and Evaluation Framework. arXiv 2022, arXiv:2107.06307. [Google Scholar] [CrossRef]
Sauerbeck, F.; Kulmer, D.; Pielmeier, M.; Leitenstern, M.; Weiß, C.; Betz, J. Multi-LiDAR Localization and Mapping Pipeline for Urban Autonomous Driving. arXiv 2023, arXiv:2311.01823. [Google Scholar]
Sanchez, J.; Deschaud, J.E.; Goulette, F. 3DLabelProp: Geometric-Driven Domain Generalization for LiDAR Semantic Segmentation in Autonomous Driving. arXiv 2025, arXiv:2501.14605. [Google Scholar]
Qiu, S.; Li, X.; Xue, X.; Pu, J. PC-BEV: An Efficient Polar-Cartesian BEV Fusion Framework for LiDAR Semantic Segmentation. arXiv 2024, arXiv:2412.14821. [Google Scholar] [CrossRef]
Park, J.; Kim, C.; Jo, K. PCSCNet: Fast 3D Semantic Segmentation of LiDAR Point Cloud for Autonomous Car using Point Convolution and Sparse Convolution Network. arXiv 2022, arXiv:2202.10047. [Google Scholar] [CrossRef]
Razani, R.; Cheng, R.; Taghavi, E.; Bingbing, L. Lite-HDSeg: LiDAR Semantic Segmentation Using Lite Harmonic Dense Convolutions. arXiv 2021, arXiv:2103.08852. [Google Scholar]
Nowruzi, F.E.; Kolhatkar, D.; Kapoor, P.; Heravi, E.J.; Hassanat, F.A.; Laganiere, R.; Rebut, J.; Malik, W. PolarNet: Accelerated Deep Open Space Segmentation Using Automotive Radar in Polar Domain. arXiv 2021, arXiv:2103.03387. [Google Scholar] [CrossRef]
Maturana, D.; Scherer, S. VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition. In Proceedings of the (IROS) IEEE/RSJ International Conference on Intelligent Robots and Systems, Hamburg, Germany, 28 September–2 October 2015; pp. 922–928. [Google Scholar]
Komorowski, J.; Wysoczanska, M.; Trzcinski, T. MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition. arXiv 2021, arXiv:2104.05327. [Google Scholar]
Rosu, R.A.; Schütt, P.; Quenzel, J.; Behnke, S. LatticeNet: Fast Spatio-Temporal Point Cloud Segmentation Using Permutohedral Lattices. arXiv 2021, arXiv:2108.03917. [Google Scholar] [CrossRef]
Zhou, H.; Zhu, X.; Song, X.; Ma, Y.; Wang, Z.; Li, H.; Lin, D. Cylinder3D: An Effective 3D Framework for Driving-scene LiDAR Semantic Segmentation. arXiv 2020, arXiv:2008.01550. [Google Scholar]
Shi, S.; Jiang, L.; Deng, J.; Wang, Z.; Guo, C.; Shi, J.; Wang, X.; Li, H. PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection. Int. J. Comput. Vision 2022, 131, 531–551. [Google Scholar] [CrossRef]
Tang, H.; Liu, Z.; Zhao, S.; Lin, Y.; Lin, J.; Wang, H.; Han, S. Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution. In Computer Vision—ECCV 2020; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Qian, G.; Li, Y.; Peng, H.; Mai, J.; Hammoud, H.A.A.K.; Elhoseiny, M.; Ghanem, B. PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies. arXiv 2022, arXiv:2206.04670. [Google Scholar]
Thomas, H.; Tsai, Y.H.H.; Barfoot, T.D.; Zhang, J. KPConvX: Modernizing Kernel Point Convolution with Kernel Attention. arXiv 2024, arXiv:2405.13194. [Google Scholar] [CrossRef]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds. arXiv 2020, arXiv:1911.11236. [Google Scholar]
Yan, X.; Gao, J.; Zheng, C.; Zheng, C.; Zhang, R.; Cui, S.; Li, Z. 2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds. arXiv 2022, arXiv:2207.04397. [Google Scholar] [CrossRef]
Zhao, H.; Jiang, L.; Jia, J.; Torr, P.; Koltun, V. Point Transformer. arXiv 2021, arXiv:2012.09164. [Google Scholar]
Wu, X.; Lao, Y.; Jiang, L.; Liu, X.; Zhao, H. Point Transformer V2: Grouped Vector Attention and Partition-based Pooling. In Proceedings of the Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 33330–33342. [Google Scholar]
Wu, X.; Jiang, L.; Wang, P.S.; Liu, Z.; Liu, X.; Qiao, Y.; Ouyang, W.; He, T.; Zhao, H. Point Transformer V3: Simpler, Faster, Stronger. arXiv 2024, arXiv:2312.10035. [Google Scholar] [CrossRef]
Ni, P.; Li, X.; Xu, W.; Kong, D.; Hu, Y.; Wei, K. Robust 3D Semantic Segmentation Based on Multi-Phase Multi-Modal Fusion for Intelligent Vehicles. IEEE Trans. Intell. Veh. 2024, 9, 1602–1614. [Google Scholar] [CrossRef]
Zhan, Q.M.; Dong, Z.; Wang, J.X.; Zhu, L.J. Fusion of images and point clouds for the semantic segmentation of large-scale 3D scenes based on deep learning. ISPRS J. Photogramm. Remote. Sens. 2018, 143, 85–101. [Google Scholar] [CrossRef]
Lee, J.S.; Park, T.H. Fast Road Detection by CNN-Based Camera–Lidar Fusion and Spherical Coordinate Transformation. Trans. Intell. Transport. Sys. 2021, 22, 5802–5810. [Google Scholar] [CrossRef]
Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. PointPainting: Sequential Fusion for 3D Object Detection. arXiv 2020, arXiv:1911.10150. [Google Scholar] [CrossRef]
Xu, S.; Zhou, D.; Fang, J.; Yin, J.; Bin, Z.; Zhang, L. FusionPainting: Multimodal Fusion with Adaptive Attention for 3D Object Detection. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021. [Google Scholar]
Li, J.; Dai, H.; Han, H.; Ding, Y. MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving. arXiv 2023, arXiv:2303.08600. [Google Scholar]
Valada, A.; Mohan, R.; Burgard, W. Self-Supervised Model Adaptation for Multimodal Semantic Segmentation. Int. J. Comput. Vis. 2019, 128, 1239–1285. [Google Scholar] [CrossRef]
Schieber, H.; Duerr, F.; Schoen, T.; Beyerer, J. Deep Sensor Fusion with Pyramid Fusion Networks for 3D Semantic Segmentation. arXiv 2022, arXiv:2205.13629. [Google Scholar] [CrossRef]
Gu, S.; Lu, T.; Zhang, Y.; Alvarez, J.M.; Yang, J.; Kong, H. 3-D LiDAR + Monocular Camera: An Inverse-Depth-Induced Fusion Framework for Urban Road Detection. IEEE Trans. Intell. Veh. 2018, 3, 351–360. [Google Scholar] [CrossRef]
Jaritz, M.; Vu, T.H.; de Charette, R.; Wirbel, E.; Pérez, P. xMUDA: Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation. arXiv 2020, arXiv:1911.12676. [Google Scholar]
Park, J.; Yoo, H.; Wang, Y. Drivable Dirt Road Region Identification Using Image and Point Cloud Semantic Segmentation Fusion. IEEE Trans. Intell. Transp. Syst. 2022, 23, 13203–13216. [Google Scholar] [CrossRef]
Pang, S.; Morris, D.; Radha, H. CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 10386–10393. [Google Scholar] [CrossRef]
Luo, Y.; Han, T.; Liu, Y.; Su, J.; Chen, Y.; Li, J.; Wu, Y.; Cai, G. CSFNet: Cross-Modal Semantic Focus Network for Semantic Segmentation of Large-Scale Point Clouds. IEEE Trans. Geosci. Remote. Sens. 2025, 63, 5701415. [Google Scholar] [CrossRef]
Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9297–9307. [Google Scholar]
Geyer, J.; Kassahun, Y.; Mahmudi, M.; Ricou, X.; Durgesh, R.; Chung, A.S.; Hauswald, L.; Pham, V.H.; Mühlegg, M.; Dorn, S.; et al. A2D2: Audi Autonomous Driving Dataset. arXiv 2020, arXiv:2004.06320. [Google Scholar] [CrossRef]
Qin, T.; Zheng, Y.; Chen, T.; Chen, Y.; Su, Q. A Light-Weight Semantic Map for Visual Localization towards Autonomous Driving. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11248–11254. [Google Scholar] [CrossRef]
Schaefer, A.; Büscher, D.; Vertens, J.; Luft, L.; Burgard, W. Long-Term Urban Vehicle Localization Using Pole Landmarks Extracted from 3-D Lidar Scans. In Proceedings of the 2019 European Conference on Mobile Robots (ECMR), Prague, Czech Republic, 4–6 September 2019; pp. 1–7. [Google Scholar] [CrossRef]
Shan, T.; Englot, B. LeGO-LOAM: Lightweight and Ground-Optimized Lidar Odometry and Mapping on Variable Terrain. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 4758–4765. [Google Scholar] [CrossRef]
Chen, S.W.; Nardari, G.V.; Lee, E.S.; Qu, C.; Liu, X.; Romero, R.A.F.; Kumar, V. SLOAM: Semantic Lidar Odometry and Mapping for Forest Inventory. arXiv 2019, arXiv:1912.12726. [Google Scholar] [CrossRef]
Yi, S.; Lyu, Y.; Hua, L.; Pan, Q.; Zhao, C. Light-LOAM: A Lightweight LiDAR Odometry and Mapping based on Graph-Matching. arXiv 2023, arXiv:2310.04162. [Google Scholar] [CrossRef]
Feng, Y.; Jiang, Z.; Shi, Y.; Feng, Y.; Chen, X.; Zhao, H.; Zhou, G. Block-Map-Based Localization in Large-Scale Environment. arXiv 2024, arXiv:2404.18192. [Google Scholar]
Koide, K.; Oishi, S.; Yokozuka, M.; Banno, A. Tightly Coupled Range Inertial Localization on a 3D Prior Map Based on Sliding Window Factor Graph Optimization. arXiv 2024, arXiv:2402.05540. [Google Scholar] [CrossRef]
Zhang, J.; Singh, S. LOAM: Lidar Odometry and Mapping in real-time. Robot. Sci. Syst. Conf. 2014, 2, 109–111. [Google Scholar]
Cao, F.; Wu, H.; Wu, C. An End-to-End Localizer for Long-Term Topological Localization in Large-Scale Changing Environments. IEEE Trans. Ind. Electron. 2023, 70, 5140–5149. [Google Scholar] [CrossRef]
Kong, D.; Li, X.; Hu, Y.; Xu, Q.; Wang, A.; Hu, W. Learning a Novel LiDAR Submap-Based Observation Model for Global Positioning in Long-Term Changing Environments. IEEE Trans. Ind. Electron. 2023, 70, 3147–3157. [Google Scholar] [CrossRef]
Blumenthal, D.B.; Gamper, J. On the exact computation of the graph edit distance. Pattern Recognit. Lett. 2020, 134, 46–57. [Google Scholar] [CrossRef]
Bai, Y.; Ding, H.; Bian, S.; Chen, T.; Sun, Y.; Wang, W. SimGNN: A Neural Network Approach to Fast Graph Similarity Computation. arXiv 2020, arXiv:1808.05689. [Google Scholar] [CrossRef]
Kong, X.; Yang, X.; Zhai, G.; Zhao, X.; Zeng, X.; Wang, M.; Liu, Y.; Li, W.; Wen, F. Semantic Graph Based Place Recognition for 3D Point Clouds. arXiv 2020, arXiv:2008.11459. [Google Scholar] [CrossRef]
Pramatarov, G.; De Martini, D.; Gadd, M.; Newman, P. BoxGraph: Semantic place recognition and pose estimation from 3D LiDAR. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; IEEE: Piscataway, NJ, USA, 2021; pp. 7004–7011. [Google Scholar]
Wu, S.C.; Wald, J.; Tateno, K.; Navab, N.; Tombari, F. SceneGraphFusion: Incremental 3D Scene Graph Prediction from RGB-D Sequences. arXiv 2021, arXiv:2103.14898. [Google Scholar]
Arandjelović, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. arXiv 2016, arXiv:1511.07247. [Google Scholar] [CrossRef]
Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-Net: A Trainable CNN for Joint Detection and Description of Local Features. arXiv 2019, arXiv:1905.03561. [Google Scholar] [CrossRef]
Wang, R.; Shen, Y.; Zuo, W.; Zhou, S.; Zheng, N. TransVPR: Transformer-based place recognition with multi-level attention aggregation. arXiv 2022, arXiv:2201.02001. [Google Scholar]
Ali-bey, A.; Chaib-draa, B.; Giguère, P. MixVPR: Feature Mixing for Visual Place Recognition. arXiv 2023, arXiv:2303.02190. [Google Scholar] [CrossRef]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperPoint: Self-Supervised Interest Point Detection and Description. arXiv 2018, arXiv:1712.07629. [Google Scholar]
Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning Feature Matching with Graph Neural Networks. arXiv 2020, arXiv:1911.11763. [Google Scholar] [CrossRef]
Hausler, S.; Garg, S.; Xu, M.; Milford, M.; Fischer, T. Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition. arXiv 2021, arXiv:2103.01486. [Google Scholar]
Khaliq, A.; Milford, M.; Garg, S. MultiRes-NetVLAD: Augmenting Place Recognition Training With Low-Resolution Imagery. IEEE Robot. Autom. Lett. 2022, 7, 3882–3889. [Google Scholar] [CrossRef]
Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-Free Local Feature Matching with Transformers. arXiv 2021, arXiv:2104.00680. [Google Scholar]
Woo, S.; Kim, S.W. Context-Based Visual-Language Place Recognition. arXiv 2024, arXiv:2410.19341. [Google Scholar]
Sferrazza, D.; Berton, G.; Trivigno, G.; Masone, C. To Match or Not to Match: Revisiting Image Matching for Reliable Visual Place Recognition. arXiv 2025, arXiv:2504.06116. [Google Scholar] [CrossRef]
Truhlařík, V.; Pivoňka, T.; Kasarda, M.; Přeučil, L. Multi-Platform Teach-and-Repeat Navigation by Visual Place Recognition Based on Deep-Learned Local Features. arXiv 2025, arXiv:2503.13090. [Google Scholar]
Uy, M.A.; Lee, G.H. PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition. arXiv 2018, arXiv:1804.03492. [Google Scholar]
Zhang, W.; Xiao, C. PCAN: 3D Attention Map Learning Using Contextual Information for Point Cloud Based Retrieval. arXiv 2019, arXiv:1904.09793. [Google Scholar] [CrossRef]
Liu, Z.; Zhou, S.; Suo, C.; Liu, Y.; Yin, P.; Wang, H.; Liu, Y.H. LPD-Net: 3D Point Cloud Learning for Large-Scale Place Recognition and Environment Analysis. arXiv 2019, arXiv:1812.07050. [Google Scholar]
Du, J.; Wang, R.; Cremers, D. DH3D: Deep Hierarchical 3D Descriptors for Robust Large-Scale 6DoF Relocalization. arXiv 2020, arXiv:2007.09217. [Google Scholar]
Xia, Y.; Xu, Y.; Li, S.; Wang, R.; Du, J.; Cremers, D.; Stilla, U. SOE-Net: A Self-Attention and Orientation Encoding Network for Point Cloud based Place Recognition. arXiv 2021, arXiv:2011.12430. [Google Scholar]
Hui, L.; Yang, H.; Cheng, M.; Xie, J.; Yang, J. Pyramid Point Cloud Transformer for Large-Scale Place Recognition. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 6078–6087. [Google Scholar] [CrossRef]
Lyu, J.; Li, J.; Chen, D.; Zhang, Y.; Liu, J.; Guo, Y.; Xu, Y.; Zhu, Y. An Efficient 3D Point Cloud-Based Place Recognition Approach for Underground Tunnels Using Convolution and Self-Attention Mechanism. J. Field Robot. 2024, 42, 1537–1549. [Google Scholar] [CrossRef]
Li, L.; Kong, X.; Zhao, X.; Huang, T.; Li, W.; Wen, F.; Zhang, H.; Liu, Y. RINet: Efficient 3D Lidar-Based Place Recognition Using Rotation Invariant Neural Network. IEEE Robot. Autom. Lett. 2022, 7, 4321–4328. [Google Scholar] [CrossRef]
Vidanapathirana, K.; Ramezani, M.; Moghadam, P.; Sridharan, S.; Fookes, C. LoGG3D-Net: Locally Guided Global Descriptor Learning for 3D Place Recognition. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 2215–2221. [Google Scholar] [CrossRef]
Hui, L.; Cheng, M.; Xie, J.; Yang, J.; Cheng, M.M. Efficient 3D Point Cloud Feature Learning for Large-Scale Place Recognition. IEEE Trans. Image Process. 2022, 31, 1258–1270. [Google Scholar] [CrossRef]
Komorowski, J. MinkLoc3D: Point Cloud Based Large-Scale Place Recognition. arXiv 2020, arXiv:2011.04530. [Google Scholar]
Zywanowski, K.; Banaszczyk, A.; Nowicki, M.R.; Komorowski, J. MinkLoc3D-SI: 3D LiDAR Place Recognition With Sparse Convolutions, Spherical Coordinates, and Intensity. IEEE Robot. Autom. Lett. 2022, 7, 1079–1086. [Google Scholar] [CrossRef]
Komorowski, J.; Wysoczanska, M.; Trzcinski, T. EgoNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale. arXiv 2021, arXiv:2110.12486. [Google Scholar] [CrossRef]
Zhou, Z.; Zhao, C.; Adolfsson, D.; Su, S.; Gao, Y.; Duckett, T.; Sun, L. NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 5654–5660. [Google Scholar] [CrossRef]
Xu, T.X.; Guo, Y.C.; Li, Z.; Yu, G.; Lai, Y.K.; Zhang, S.H. TransLoc3D: Point Cloud based Large-scale Place Recognition using Adaptive Receptive Fields. arXiv 2022, arXiv:2105.11605. [Google Scholar] [CrossRef]
Cattaneo, D.; Vaghi, M.; Valada, A. LCDNet: Deep Loop Closure Detection and Point Cloud Registration for LiDAR SLAM. arXiv 2022, arXiv:2103.05056. [Google Scholar] [CrossRef]
Fan, Z.; Song, Z.; Liu, H.; Lu, Z.; He, J.; Du, X. SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition. arXiv 2021, arXiv:2105.00149. [Google Scholar] [CrossRef]
Fu, C.; Li, L.; Mei, J.; Ma, Y.; Peng, L.; Zhao, X.; Liu, Y. A Coarse-to-Fine Place Recognition Approach using Attention-guided Descriptors and Overlap Estimation. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2023; pp. 8493–8499. [Google Scholar]
Dubé, R.; Dugas, D.; Stumm, E.; Nieto, J.; Siegwart, R.; Cadena, C. SegMatch: Segment based place recognition in 3D point clouds. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 5266–5272. [Google Scholar]
Dubé, R.; Cramariuc, A.; Dugas, D.; Sommer, H.; Dymczyk, M.; Nieto, J.; Siegwart, R.; Cadena, C. SegMap: Segment-based mapping and localization using data-driven descriptors. Int. J. Robot. Res. 2019, 39, 339–355. [Google Scholar] [CrossRef]
Zaganidis, A.; Zerntev, A.; Duckett, T.; Cielniak, G. Semantically Assisted Loop Closure in SLAM Using NDT Histograms. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; IEEE Press: Piscataway, NJ, USA, 2019; pp. 4562–4568. [Google Scholar] [CrossRef]
Jiang, J.; Wang, J.; Wang, P.; Bao, P.; Chen, Z. LiPMatch: LiDAR Point Cloud Plane Based Loop-Closure. IEEE Robot. Autom. Lett. 2020, 5, 6861–6868. [Google Scholar] [CrossRef]
Tomono, M. Loop detection for 3D LiDAR SLAM using segment-group matching. Adv. Robot. 2020, 34, 1530–1544. [Google Scholar] [CrossRef]
Vidanapathirana, K.; Moghadam, P.; Harwood, B.; Zhao, M.; Sridharan, S.; Fookes, C. Locus: LiDAR-based Place Recognition using Spatiotemporal Higher-Order Pooling. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 5075–5081. [Google Scholar] [CrossRef]
Li, L.; Kong, X.; Zhao, X.; Huang, T.; Liu, Y. SSC: Semantic Scan Context for Large-Scale Place Recognition. arXiv 2021, arXiv:2107.00382. [Google Scholar] [CrossRef]
Fan, Y.; Yuan, H.; Zhu, S.; Zhou, G.; Du, R.; Gu, J. A Semantic-Based Loop Closure Detection of 3D Point Cloud. In Proceedings of the 2021 IEEE International Conference on Robotics and Biomimetics (ROBIO), Sanya, China, 27–31 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1184–1189. [Google Scholar]
Yin, P.; Xu, L.; Feng, Z.; Egorov, A.; Li, B. PSE-Match: A Viewpoint-free Place Recognition Method with Parallel Semantic Embedding. arXiv 2021, arXiv:2108.00552. [Google Scholar] [CrossRef]
Song, T.; He, S.; Wu, X. Semantic Assisted Loop Closure Detection for Automated Driving. In CICTP 2022; American Society of Civil Engineers: Reston, VA, USA, 2021; pp. 690–698. [Google Scholar] [CrossRef]
Yin, H.; Ding, X.; Tang, L.; Wang, Y.; Xiong, R. Efficient 3D LIDAR based loop closing using deep neural network. In Proceedings of the 2017 IEEE International Conference on Robotics and Biomimetics (ROBIO), Macau, Macao, 5–8 December 2017; pp. 481–486. [Google Scholar] [CrossRef]
Yin, H.; Tang, L.; Ding, X.; Wang, Y.; Xiong, R. LocNet: Global Localization in 3D Point Clouds for Mobile Vehicles. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 728–733. [Google Scholar] [CrossRef]
Sun, T.; Liu, M.; Ye, H.; Yeung, D.Y. Point-cloud-based place recognition using CNN feature extraction. arXiv 2018, arXiv:1810.09631. [Google Scholar] [CrossRef]
Chen, X.; Läbe, T.; Milioto, A.; Röhling, T.; Vysotska, O.; Haag, A.; Behley, J.; Stachniss, C. OverlapNet: Loop Closing for LiDAR-based SLAM. In Proceedings of the Robotics: Science and Systems XVI. Robotics: Science and Systems Foundation. arXiv 2020, arXiv:2105.11344. [Google Scholar] [CrossRef]
Xu, X.; Yin, H.; Chen, Z.; Wang, Y.; Xiong, R. DiSCO: Differentiable Scan Context with Orientation. arXiv 2021, arXiv:2010.10949. [Google Scholar] [CrossRef]
Ma, J.; Zhang, J.; Xu, J.; Ai, R.; Gu, W.; Chen, X. OverlapTransformer: An Efficient and Yaw-Angle-Invariant Transformer Network for LiDAR-Based Place Recognition. IEEE Robot. Autom. Lett. 2022, 7, 6958–6965. [Google Scholar] [CrossRef]
Yin, P.; Lingyun, X.; Zhang, J.; Choset, H. FusionVLAD: A Multi-view Deep Fusion Networksfor Viewpoint-free 3D Place Recognition. IEEE Robot. Autom. Lett. 2021, 6, 2304–2310. [Google Scholar] [CrossRef]
Yin, P.; Wang, F.; Egorov, A.; Hou, J.; Jia, Z.; Han, J. Fast Sequence-matching Enhanced Viewpoint-invariant 3D Place Recognition. IEEE Trans. Ind. Electron. 2021, 69, 2127–2135. [Google Scholar] [CrossRef]
Zhao, S.; Yin, P.; Yi, G.; Scherer, S. SphereVLAD++: Attention-based and Signal-enhanced Viewpoint Invariant Descriptor. arXiv 2022, arXiv:2207.02958. [Google Scholar] [CrossRef]
Lu, S.; Xu, X.; Tang, L.; Xiong, R.; Wang, Y. DeepRING: Learning Roto-translation Invariant Representation for LiDAR based Place Recognition. arXiv 2022, arXiv:2210.11029. [Google Scholar]
Luo, L.; Zheng, S.; Li, Y.; Fan, Y.; Yu, B.; Cao, S.; Shen, H. BEVPlace: Learning LiDAR-based Place Recognition using Bird’s Eye View Images. arXiv 2023, arXiv:2302.14325. [Google Scholar]
Barros, T.; Garrote, L.; Pereira, R.; Premebida, C.; Nunes, U.J. AttDLNet: Attention-Based Deep Network for 3D LiDAR Place Recognition. In Proceedings of the ROBOT2022: Fifth Iberian Robotics Conference, Zaragoza, Spain, 23–25 November 2022; Springer International Publishing: Berlin/Heidelberg, Germany, 2022; pp. 309–320. [Google Scholar] [CrossRef]
Zhu, Y.; Ma, Y.; Chen, L.; Liu, C.; Ye, M.; Li, L. GOSMatch: Graph-of-Semantics Matching for Detecting Loop Closures in 3D LiDAR data. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 5151–5157. [Google Scholar]
Gong, Y.; Sun, F.; Yuan, J.; Zhu, W.; Sun, Q. A Two-Level Framework for Place Recognition with 3D LiDAR Based on Spatial Relation Graph. Pattern Recognit. 2021, 120, 108171. [Google Scholar] [CrossRef]
Cui, J.; Huang, T.; Cai, Y.; Zhao, J.; Xiong, L.; Yu, Z. DSC: Deep Scan Context Descriptor for Large-Scale Place Recognition. arXiv 2021, arXiv:2111.13838. [Google Scholar] [CrossRef]
Yu, J.; Shen, S. SemanticLoop: Loop closure with 3D semantic graph matching. arXiv 2022, arXiv:2211.11977. [Google Scholar] [CrossRef]
Wang, S.; Kang, Q.; She, R.; Zhao, K.; Song, Y.; Tay, W.P. PRFusion: Toward Effective and Robust Multi-Modal Place Recognition with Image and Point Cloud Fusion. arXiv 2024, arXiv:2410.04939. [Google Scholar] [CrossRef]
Xu, J.; Ma, J.; Wu, Q.; Zhou, Z.; Wang, Y.; Chen, X.; Pei, L. Explicit Interaction for Fusion-Based Place Recognition. arXiv 2024, arXiv:2402.17264. [Google Scholar] [CrossRef]
Jung, M.; Fu, L.F.T.; Fallon, M.; Kim, A. ImLPR: Image-based LiDAR Place Recognition using Vision Foundation Models. arXiv 2025, arXiv:2505.18364. [Google Scholar]
Melekhin, A.; Yudin, D.; Petryashin, I.; Bezuglyj, V. MSSPlace: Multi-Sensor Place Recognition with Visual and Text Semantics. arXiv 2024, arXiv:2407.15663. [Google Scholar] [CrossRef]
Qi, Z.; Cheng, L.; Zhou, Z.; Xiong, G. LRFusionPR: A Polar BEV-Based LiDAR-Radar Fusion Network for Place Recognition. arXiv 2025, arXiv:2504.19186. [Google Scholar] [CrossRef]
Zhou, Z.; Xu, J.; Xiong, G.; Ma, J. LCPR: A Multi-Scale Attention-Based LiDAR-Camera Fusion Network for Place Recognition. IEEE Robot. Autom. Lett. 2024, 9, 1342–1349. [Google Scholar] [CrossRef]
Maddern, W.; Pascoe, G.; Linegar, C.; Newman, P. 1 Year, 1000km: The Oxford RobotCar Dataset. Int. J. Robot. Res. 2017, 36, 3–15. [Google Scholar] [CrossRef]
Burnett, K.; Yoon, D.J.; Wu, Y.; Li, A.Z.; Zhang, H.; Lu, S.; Qian, J.; Tseng, W.K.; Lambert, A.; Leung, K.Y.; et al. Boreas: A multi-season autonomous driving dataset. Int. J. Robot. Res. 2023, 42, 33–42. [Google Scholar] [CrossRef]
Liao, Y.; Yang, L.; Behley, J.; Stachniss, C. KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3292–3310. [Google Scholar] [CrossRef]
Jeong, J.; Cho, Y.; Shin, Y.S.; Roh, H.; Kim, A. Complex urban dataset with multi-level sensors from highly diverse urban environments. Int. J. Robot. Res. 2019, 38, 642–657. [Google Scholar] [CrossRef]
Udacity. Udacity Self-Driving Car Dataset. 2016. Available online: https://github.com/udacity/self-driving-car (accessed on 7 August 2025).
Huang, X.; Wang, P.; Cheng, X.; Zhou, D.; Geng, Q.; Yang, R. The ApolloScape Open Dataset for Autonomous Driving and Its Application. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2702–2719. [Google Scholar] [CrossRef]
Sun, P.; Kretzschmar, H.; Li, X.D.; Gorban, V.; Portelas, R.; Chiang, J.H.; Chen, C.H.; Chan, C.H.; Caine, B.; Gupta, S.; et al. Scalability in Perception for Autonomous Driving: Waymo Open Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 2443–2451. [Google Scholar]
Yan, Z.; Sun, L.; Krajnik, T.; Ruichek, Y. EU Long-term Dataset with Multiple Sensors for Autonomous Driving. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Virtual, 25–29 October 2020. [Google Scholar]
Kim, G.; Park, Y.S.; Cho, Y.; Jeong, J.; Kim, A. MulRan: Multimodal Range Dataset for Urban Place Recognition. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020. [Google Scholar]
Xiao, P.; Shao, Z.; Hao, S.; Zhang, Z.; Chai, X.; Jiao, J.; Li, Z.; Wu, J.; Sun, K.; Jiang, K.; et al. PandaSet: Advanced Sensor Suite Dataset for Autonomous Driving. arXiv 2021, arXiv:2112.12610. [Google Scholar] [CrossRef]
ISO 26262:2018; Road Vehicles—Functional Safety—Part 1: Vocabulary. International Organization for Standardization: Geneva, Switzerland, 2018.
Kim, G.; Kim, A. Scan Context: Egocentric Spatial Descriptor for Place Recognition Within 3D Point Cloud Map. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 4802–4809. [Google Scholar]
Kim, G.; Choi, S.; Kim, A. Scan Context++: Structural Place Recognition Robust to Rotation and Lateral Variations in Urban Environments. arXiv 2021, arXiv:2109.13494. [Google Scholar] [CrossRef]
He, L.; Wang, X.; Zhang, H. M2DP: A novel 3D point cloud descriptor and its application in loop closure detection. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea, 9–14 October 2016; pp. 231–237. [Google Scholar] [CrossRef]
Zhu, S.; Shah, M.; Chen, C. TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization. arXiv 2022, arXiv:2204.00097. [Google Scholar]

Figure 1. Multi-temporal observations of the same location under varying lighting and weather conditions, illustrating challenges in long-term localization and mapping. The system incorporates GNSS/GPS for global positioning and an IMU (inertial measurement unit) for motion estimation, enabling consistent data acquisition across sessions.

Figure 2. Tasks related to long-term semantic maps for autonomous vehicle localization and its corresponding sections.

Figure 3. A detailed taxonomy of multi-modal perception.

Figure 4. Extrinsic calibration methods. (a) Target-based, (b) targetless, and (c) deep learning-based pipelines.

Figure 5. A detailed taxonomy of semantic mapping.

Figure 6. A detailed taxonomy of 3D semantic segmentation.

Figure 7. 3D semantic segmentation methods. (a) Early fusion, (b) deep fusion, (c) late fusion, (d) asymmetry fusion.

Figure 8. A detailed taxonomy of long-term localization.

Figure 9. Example of datasets. (a) Oxford RobotCar. (b) KITTI. (c) Boreas. (d) EU Long-term.

Figure 10. General architecture of an autonomous driving system, illustrating the end-to-end processing pipeline from multi-modal sensor acquisition through perception and semantic SLAM (localization and semantic mapping) to planning, decision-making, and human–machine interaction modules.

Table 1. Comparative study of semantic SLAM approaches with respect to sensing modalities, mapping strategies, and perception pipeline.

Paper	Sensor	Obstacle Type	Mapping Type	Object Detection	Data Association	Semantic Segmentation	Semantic Representation	Loop Closure	Localization	Dataset	Year
[11]	L-R	Dynamic	X	3D Detection	Cross-modal fusion	X	Object-level (3D bounding boxes)	X	X	VoD - K-Radar Adverse	2025
[12]	C-R	Static Dynamic	Metric	Dyn. Cluster ORB-SLAM3	Point-to-point Point-to-frame	DBSCAN	X	Visual Appearance	Feature Tracking	Dynamic Scenarios	2025
[13]	L-I-G- R	Static Dynamic	Metric	YOLOv8- assisted	Fusion-based; geometric matching	✓	X	Observation Resilient Weighted	IEKF	MSCRAD 4R	2025
[14]	L-C-I	Static	Geometric	YOLOv8- Seg	Object-level	Object-level	ORB-SLAM3	✓	LVI- ObjSemant.	Experiment	2025
[15]	C	Dynamic	Semantic Topological	YOLOv8	DHDP	X	ORB-SLAM2 ORB-SLAM3	✓	Landmarks	KITTI TUM	2025
[16]	C-L	Static	Semantic Topological Point cloud	PSPNet	X	PSPNet CBAM	Region-growing	X	LOAM	CityScape KITTI	2025
[17]	C-L	Dynamic Semi-static	Metric-scaled feature-based trajectory	ORB-SLAM2 Super point	SuperGlue	DeepLabv3 DBSCAN	ViL SLAM	✓	ViL SLAM	CQU	2025
[18]	C-L-I	X	Metric– semantic map	X	X	CNN	TSDF	X	Prior map- based	S. KITTI S.USL	2025
[19]	L-G-R- I	Static	Semantic HD map	Mask	Object-level	BiseNetV2	Extracting urban features	X	Vis Mono	KAIST	2024
[20]	L-C-G- R	Static	3D semantic prior map	ORB3	Match Images to point cloud	X	Point-to-plane ICP	X	ICP camera pose	KITTI Own	2023
[21]	L-C	Dynamic	3D semantic map	X	3D point to 2D pixel	DeepLabv3	Multiple frames, $[R \| t]$	X	LEGO- LOAM	KITTI	2023
[22]	C-L	Static	2D–3D hybrid semantic map	YOLOv2	Object augmented	Inception v3 - LTSM	Labeled scene with Bayesian estimation	Temporal Spatial Association	EKF	X	2023
[23]	C-E-I	Static	Vectorial Paramet. Semantic map	Sem- LSD	Object - level/ DBSCAN SORT	DeepLabv3+ Sem-LSD	Semantic feature around vehicle center	X	Sliding window	nuScenes	2023
[24]	C-L-G- I	Static	Urban Structured Semantic map	ORB - SLAM2		X	ResNet	X	Graph map	KITTI	2023
[25]	C-L-I WE	Static dynamic	Voxelized 3D Semantic map	CNN	Motion correction synchronize	Enet CNN	Probabilities with super pixel	X	X	USyd	2022
[26]	C	Static	Structured Semantic map	YOLOv2	Object - level	X	Include semantic frames into a database	Factor Graph SLAM	X	TUM RGB-D	2021
[27]	R-C	Static dynamic	3D dense map	-	Geometric cue associated to object model	RGB (Mask R-CNN) and depth (bilateral filter, geometric segmentation, erode and dilate processing)		X	X	TUM RGB-D	2021
[28]	C-L	Dynamic	Classic Geometric Visual map	Based on clustering	Tracked the corners of bounding box	Ground segmentation, clustering,	Dynamic mask Gaussian blur kernel	X	-	KITTI Tracking	2021
[29]	C-L	X	2D Probabilistic Semantic map	X	X	DeepLabv3+ HRNet+OCR	Occupation grid with BEV local and global map	X	X	Own Data	2020
[30]	R-C	Static	ORB - SLAM2	YOLOv3	Point–object Object–object	Object detection with depth histogram of 2D BB	X	Semantic Bundle Adjustment	X	X	2020
[31]	C-L-I	Static	Global metric semantic map	X	X	CNN	Projection of semantic label on the metric map	X	EKF	KITTI	2020
[32]	L-C-I	Static	3D accuracy Semantic map	CNN	Based on voxels and timestamp	ResNet50	Data-level fusion method	X	LOAM	KITTI	2020
[33]	R-C	Static	Object- oriented semantic map	YOLOv2	Ratio of object instance with target label in the BB	Class probability of object is assigned to each segment in the BB	InfiniTAM v3	X	X	Tested indoors	2019
[34]	R-C	Static	3D object detector	Dense SIFT	Inter-frame geometry consistency constraint	Object semantic annotation projected onto 3D point	Mask R-CNN ORB-SLAM2	ORB- SLAM2	X	ICL	2019
[7]	L-C	Static	Occupancy grid map	X	Conv–deconv fusion network	SegNet	Semantic image Occupancy grid	X	X	KITTI Own	2018
[35]	R-C	Office objects	Static	Tiny YOLO	KD structure Bayesian process	Relationship between keyframe and objects	Octomap accelerated using FLRA	X	ORB- SLAM2	TUM- RGBD	2018
[36]	S-C	Cars, trucks	Dynamic separated from static	Multi-task network cascade (MNC)	-	X	Dense map with ELA, dispNet, InfiniTAM	X	X	KITTI	2018
[37]	C	Cars, pedestrians buildings poles, signs	Dynamic	-	Match feature point in semantic images (kernels)	SegNet + Voxel CRF project pc onto image	Built from disparity map	X	Visual odometry	KITTI	2018
[38]	ML- MC	Curbs, lanes, sidewalks, ground, parking	Static	CNN	Computing height of the curb inside the ROI	ERFNet	Graph forest	Projection point cloud onto semantic images	X	X	2018
[39]	R-C	Indoors object	Static	VGG-16	Based on Bayesian approach	CNN	RGB - SF-CRF	X	X	NYUv2 Own	2018

C: camera, S-C: stereo camera, L: LiDAR, I: IMU, G: GPS/GNSS, WE: wheel encoder and R: radar. ✓: available, X: The method is not reported.

Table 2. Comparative summary of LiDAR–camera extrinsic calibration methods.

Ref.	Method Type	Require Target	Scene Dependency	Automation Level	Online Capability	Calibration Domain	Spatial– Temporal Modeling	Multi- Sensor Scalability	Temporal Sync.	Evaluation Metric/ Accuracy
[50]	Target-based	✓	Low	Medium	X	3D–2D	X	X	X	Reprojection error
[51]	Target-based	✓	Low-Medium	Medium	X	3D–3D	X	✓	X	Plane alignment accuracy
[52]	Target-based	✓	Medium	Medium	X	3D–3D	X	X	X	Center detection error
[53]	Target-based	✓	Medium	Medium	X	3D–2D	X	X	X	Reprojection residuals
[54]	Target-based	✓	Low-Medium	Medium	X	3D–2D	X	X	X	RMSE
[55]	Target-based	✓	Medium	Medium	X	3D–3D	X	X	X	Translation/rotation error
[56]	Target-based	✓	Low-Medium	Medium	X	3D–3D	X	X	X	Reprojection error
[57]	Target-based	✓	Medium	Medium	X	3D–2D	X	X	X	Pose alignment quality
[58]	Target-based	✓	Medium	Medium	X	3D–2D	X	X	X	Calibration stability
[59]	Target-based	✓	Low-Medium	Medium	X	3D–3D	X	✓	X	Reprojection error
[60]	Target-based	✓	Low-Medium	Medium	X	3D–2D	X	X	X	Feature correspondence acc.
[61]	Target-based	✓	Low-Medium	Medium	X	3D–3D	X	X	H	Cross-correlation
[62]	Targetless	X	High	Medium	X	3D–2D	X	✓	S	Alignment error
[63]	Targetless	X	High	High	Limited	3D–3D	X	X	H	Runtime/accuracy
[64]	Targetless	X	High	High	X	3D–3D	X	X	X	Initialization error
[65]	Targetless	X	Medium	High	✓	3D–3D	✓	X	J	reprojection error
[66]	Targetless	X	High	High	X	3D–3D	✓	X	J	RMSE
[67]	Targetless	X	Medium-High	High	X	3D–3D	✓	✓	J	Temporal drift
[68]	Targetless	X	Medium	High	✓	3D–3D	✓	✓	J	Cross-modal consistency
[69]	Targetless	X	High	High	X	3D–3D	✓	✓	S	Angular/translation error
[70]	Deep Learning	X	Medium	High	✓	3D–3D	✓	✓	J	Learning loss (pose)
[71]	Deep Learning	X	Medium	High	X	3D–3D	✓	✓	J	Pose regression error
[72]	Deep Learning	X	Medium	High	✓	3D–3D	✓	X	J	6-DoF estimation error
[73]	Deep Learning	X	Medium	High	✓	3D–3D	✓	X	J	Depth/pose consistency
[74]	Deep Learning	X	Medium	High	✓	3D–3D	✓	✓	J	Reprojection accuracy
[75]	Deep Learning	X	Medium	High	✓	3D–3D	✓	✓	J	RMSE
[76]	Deep Learning	X	Medium	High	X	3D–3D	✓	✓	J	Learning-based accuracy

✓: available, X: the method is not supported, S: software, H: hardware. J: joint.

Table 4. Comparative summary of LiDAR- and point-cloud-based semantic mapping methods.

Ref.	Learning	Input Representation	Real Time	Accuracy/ Robustness	Efficiency	Scalability
[97]	MLP-based deep network	Raw point cloud	Limited	Moderate accuracy on simple geometries	Low due to per- point processing	Scalable to small datasets
[98]	Hierarchical deep learning	Point cloud (multi- scale grouping)	Limited	High accuracy with local–global features	Moderate	Improved large-scene scalability
[99]	CNN + recurrent CRF	Range image projection of LiDAR	✓	Reliable road–object segmentation	High (GPU-optimized)	Suitable for large driving datasets
[100]	CNN with post- processing CRF	Range image (spherical projection)	✓	High accuracy in road scenes	High (fast inference)	Robust for outdoor navigation
[101]	CNN with polar grid encoding	Polar coordinate grid	✓	Strong global consistency	Efficient GPU computation	Excellent outdoor scalability
[102]	CNN + uncertainty modeling	Spherical range image	✓	Very high robustness under noise	Real-time optimized	Generalizable to multiple datasets
[103]	Transformer-based network	Spherical projection	Near real time	State-of-the-art accuracy	Moderate (transformer- heavy)	Scalable to complex 3D environments

✓: The method enables real-time semantic mapping.

Table 5. Representative works on sensor-fusion-based semantic mapping for autonomous driving.

Ref.	Learning Model	Modalities	Real-Time Capability	Accuracy	Efficiency	Scalability
[18]	Multi-modal voxel fusion	RGB-D, LiDAR, IMU	X	High semantic–metric integration; fine-grained voxel reasoning	Computationally heavy	Suitable for small-scale 3D semantic mapping
[104]	Transformer-based RGB-X fusion (CMX)	RGB + Depth/ LiDAR	Near real-time	High semantic consistency under varying illumination	Transformer-heavy; optimized with attention pruning	Scalable to large- scale 3D scenes
[105]	Hierarchical multi-modal transformer	LiDAR + multispectral imagery	X	High semantic alignment with context reasoning	Moderate	Remote-sensing and large-scale scene mapping
[108]	Perception-aware CNN fusion	LiDAR + RGB camera	✓	Robust 3D segmentation under noisy conditions	High (GPU-optimized)	Autonomous driving; outdoor mapping
[111]	Voxelized LiDAR– camera continuous fusion	LiDAR + RGB camera	✓	Reliable semantic completion and spatial accuracy	GPU-efficient	Dense 3D city-level mapping
[109]	LiF-Seg: Late fusion network	LiDAR + RGB camera	✓	Accurate object-level segmentation; complementary cues	Moderate	Scalable for dynamic outdoor scenes
[110]	Multi-modal range–RGB CNN	LiDAR + RGB	✓	High robustness under lighting and occlusion changes	Real-time optimized	Generalizable across driving datasets
[106]	Radar–vision semantic SLAM	Millimeter-wave radar/RGB camera	✓	Robust under adverse weather and poor visibility	Moderate	All-weather autonomous navigation
[107]	Condition-aware multi-modal fusion	LiDAR + camera	✓	Adaptive to environmental conditions; strong semantic consistency	Efficient	Long-term semantic perception in dynamic scenes
[113]	Multi-LiDAR fusion pipeline	Multiple LiDARs	✓	Enhanced 3D reconstruction with semantic annotation	Efficient	Large-scale urban driving
[112]	HDMapNet	LiDAR + camera	✓	Fine-grained semantic mapping and HD map generation	Real-time	City-scale mapping applications

x: The method does not support real-time capability for semantic mapping.

Table 6. Latest works about image-based methods for 3D semantic segmentation associated with LiDAR-only strategies.

Ref.	Network Type	Input Representation	Feature Extraction	Learning Strategy	Optimization	Dataset
[114]	KPConv	Point-based	Kernel point convolution	Domain generalization	Geometric domain adaptation	SemanticKITTI
[115]	BEV-based convolutional fusion	Polar + Cartesian BEV	Dual-branch CNNs	Efficiency-focused supervised	Joint spatial–semantic optimization	nuScenes
[116]	Point Conv + Sparse Conv 3D	Voxelized 3D grid	Sparse + point convolutions	Supervised learning	End-to-end supervised	SemanticKITTI
[117]	Harmonic dense convolution	Cartesian voxel grid	Harmonic dense filters	Efficiency-oriented supervised	Lightweight efficiency tuning	SemanticKITTI
[102]	2D CNN over spherical projections	Spherical projection	Residual + context layers	Supervised, uncertainty- aware	Bayesian uncertainty modeling	SemanticKITTI
[118]	2D CNN over polar image	Polar grid/range view	Standard CNN	Supervised	Basic supervised training	SemanticKITTI, nuScenes

Table 7. Latest works about voxel-based methods for 3D semantic segmentation associated with LiDAR-only strategies.

Ref.	Network Type	Input Representation	Feature Extraction	Learning Strategy	Optimization	Dataset
[119]	Dense 3D CNN	Voxelized occupancy grid	3D convolutional filters	Supervised	Basic CNN optimization	KITTI
[120]	Sparse 3D CNN	Sparse voxel grids	Sparse 3D convolutions	Supervised	Efficiency-focused sparse ops	KITTI
[121]	Sparse lattice CNN	Permutohedral lattice	Lattice convolutions	Supervised spatio- temporal	Spatio-temporal optimization	SemanticKITTI
[122]	Sparse 3D CNN with cylindrical partitioning	Cylindrical voxel grid	Adaptive contextual convolutions	Supervised	Joint spatial–temporal optimization	nuScenes, SemanticKITTI
[123]	Hybrid voxel + point network	Voxel + point set encoding	Voxel feature extraction + point refinement	Supervised	End-to-end optimization	KITTI, Waymo
[124]	Sparse 3D NAS-optimized CNN backbone	Voxel grid	NAS-optimized convolutional backbone	Automated (NAS)	Architecture search optimization	nuScenes, SemanticKITTI

Table 8. Latest works about point-based methods for 3D semantic segmentation associated with LiDAR-only strategies.

Ref.	Network Type	Input Representation	Feature Extraction	Learning Strategy	Optimization	Dataset
[97] [98]	MLP with KNN/ Hierarchical group	Raw point sets	MLPs + hierarchical KNN aggregation	Supervised	Basic MLP optimization	—
[125]	MLP-based hierarchical networks	Raw point sets	Efficient MLP stacking + KNN neighbor search	Supervised	Improved training and scaling strategies	—
[126]	Point convolution with attention	Raw point sets	Kernel point convolution + kernel attention	Supervised	Architectural optimization	—
[127]	Point-based MLP with attention and detection aware loss	Raw point sets	Random sampling + local feature aggregation	Supervised	Detection-aware optimization	Semantic KITTI
[128]	3D point network with knowledge distillation	Raw point sets + 2D priors	Semantic distillation from 2D to 3D	Supervised	Multi-modal knowledge distillation	—
[103]	Self-attention (transformer style)	Raw point clouds in spherical coordinates	Radial window self-attention	Supervised	Global structure modeling optimization	—
[114]	Label propagation with cloud tracking	Registered point scans	Point alignment + ICP propagation	—	Geometric optimization	—
[129]	Transformer-based	Raw point sets	Self-attention with positional encoding	Supervised	Transformer-based optimization	—
[130]	Transformer-based	Raw point sets	Efficient neighborhood attention + residual context blocks	Supervised	End-to-end optimization
[131]	Transformer-based	Raw point sets	Multi-scale global-local transformer fusion	Supervised	Computational optimization for robustness

Table 9. Comparative analysis of deep learning-based vision methods for place recognition.

Refs.	Method	Category	Architecture	Descriptor Role	Metric	Reported Performance	Contribution
[162]	NetVLAD	Geometric	CNN + VLAD	Global	Recall@N	≈ 80%	End-to-end global descriptor
[166,167]	SuperPoint + SuperGlue	Geometric	CNN + GNN	Local	Accuracy	>95%	Self-supervised keypoints and learned graph matching
[163]	D2-Net	Geometric	CNN	local	Precision- Recall	60–70%	Joint detection and dense feature description
[168]	Patch-NetVLAD	Hybrid	CNN	Local+Global	Recall@N	92–97%	Patch-level fusion of local and global context
[169]	Multires-NetVLAD	Geometric	CNN Multi-scale	Global	Recall@1,@5	90–96%	Multi-resolution training for scale robustness
[164]	TransVPR	Hybrid	Vision Transformer	Global	Recall@1, mAP	94%, 0.88	Multi-level attention for context aggregation
[165]	MixVPR	Hybrid	Transformer + CNN	Global	Recall@N, mAP	96%, 0.91	Spatial–semantic feature mixing for robustness
[170]	LoFTR/ SparseGNN	Geometric	Transformer GNN	Local	Accuracy	≈97%	Detector-free dense matching with high precision
[171]	CLIP-VPR	Semantic	Vision-language	Global	Recall@20	0.66	Vision–text embedding for semantic-level retrieval
[172]	Adaptive Matching	Geometric	CNN	Local	Recall@N	≈89%	Adaptive re-ranking for condition invariance
[173]	Multi-Platform VPR	Hybrid	CNN + domain adaptation	Global	Recall@1	85–90%	Cross-domain embedding for multi-robot localization

Table 10. Point-based methods for LiDAR-based place recognition.

Ref.	Method	Feature	Invariance to Translation	Invariance to Rotation	Similarity	Dataset	Year
[174]	PointNet	NetVLAD	Centroid	-	L2/Cosine	Oxford RobotCar	2018
[175]	PointNet	NetVLAD	Centroid	-	L2/Cosine	Oxford RobotCar	2019
[176]	GNN, DG-CNN	NetVLAD	Learned spatial graph normalization	-	Cosine	Oxford RobotCar	2019
[177]	PointNet + FlexConv	NetVLAD, SE block	Centroid	-	Cosine	Oxford RobotCar, ETH	2020
[178]	PointOE	NetVLAD	Centroid	x	Cosine	Oxford RobotCar	2021
[179]	Pyramid Point Transformer	Pyramid VLAD	Centroid	Transformer equivariant layers	L2/Cosine	Oxford RobotCar	2021
[183]	PPCNN	G-VLAD	Centroid	PCA-like learned	Cosine	KITTI	2022
[180]	Feature Point Extractor	Point Transformer	Learned normalization	Equivariant transformer	Cosine	KITTI, KITTI-360	2022
[181]	Rotation Equivariant Encoder	Siamese CNN	Learned	Strong rotation equivariance	Cosine	KITTI, MulRan	2022
[182]	U-Net	ePN, 2nd-order pooling	Centroid	High-order rotational invariance	L2	Oxford RobotCar	2022

Table 11. Voxel-based methods for LiDAR-based place recognition.

Ref.	Method	Feature	Similarity	Dataset	Year
[184]	MinkLoc3D	Sparse 3D convolutions, FPN, GeM	L2 distance between global embeddings	Oxford RobotCar	2021
[185]	MinkLoc3D-SI	Sparse convolutions–spherical coordinates–intensity; FPN; GeM	Triplet loss using L2 distance between descriptors (explicit)	Usyd Campus, Oxford RobotCar, KITTI	2021
[186]	EgoNN	FPN backbone; ECA attention	Uses global descriptor retrieval	MulRan, Apollo-SouthBay, KITTI	2022
[187]	NDT-Transformer	Voxel-level statistical representation (Normal Distribution Transform); Transformer encoder;	Uses descriptor retrieval	Oxford RobotCar	2021
[188]	TransLoc3D	NDT representation; Adaptive Receptive Field Module (ARFM); Transformer; NetVLAD	Triplet margin loss with L2 distance between descriptors	Oxford RobotCar	2021
[189]	LCDNet	PV-RCNN feature backbone; NetVLAD for global descriptor	L2/Cosine	KITTI, KITTI-360, Freiburg	2022
[190]	SVT-Net	Sparse Voxel Transformer (SVT); captures local + long-range structure	Uses descriptor retrieval	Oxford RobotCar	2022
[191]	OverlapNetVLAD	BEV-based feature extractor (BEVNet) + NetVLAD	NetVLAD descriptors typically compared via L2	KITTI	2023

Table 12. Segment-based methods for LiDAR-based place recognition.

Ref.	Method	Feature Strategy	Similarity/Matching	Dataset	Year
[192]	SegMatch	Segment-based geometric features, PCA descriptors, 3D segment clustering	Random forest classification, Euclidean distance; geometric consistency checks	KITTI	2017
[193]	SegMap	Learned 3D segment descriptors using a CNN autoencoder	L2 distance between learned descriptors; nearest-neighbor retrieval	KITTI	2018
[194]	Zaganidis et al.	Keypoint extraction with local shape descriptors	Nearest-neighbor keypoint matching using Euclidean distance; RANSAC alignment	KITTI	2019
[195]	LiPMatch	LiDAR keypoints, multi-scale pyramid feature encoding; handcrafted structural descriptors	Descriptor NN matching; geometric verification for loop closure	KITTI	2020
[196]	Tomono	NDT-based local map representation; Gaussian distribution parameters per cell	Likelihood-based NDT matching; optimization of alignment score	Tc0915, Cit1015, Kitti0027	2020
[197]	Locus	Segment-level features combining geometry + intensity cues	L2 distance for segment descriptors − geometric consistency validation	KITTI	2021
[9]	SA-LOAM	Edge and planar features (LOAM style); spatial-appearance cues	Feature correspondence + ICP-based geometric error minimization	KITTI	2021
[198]	SSC	PCA-based shape descriptors from segmented point clouds	L2 or Mahalanobis distance between PCA descriptors; voting scheme for matches	KITTI	2021
[199]	PCA-SSC	Improved PCA descriptors with stronger geometric invariance	L2 distance + consistency scoring to refine correspondences	KITTI, Semantic KITTI	2021
[200]	PSE-Match	Probabilistic Shape Encoding (mixture-model representation)	Probability-based similarity measure between PSE descriptors	KITTI, NCLT, CMU, Pittsburgh	2022
[201]	Song et al.	Hybrid descriptor combining geometry, intensity, and local topology	L2 or cosine distance between descriptors; global verification	KITTI	2022

Table 13. Projection-based methods for LiDAR-based place recognition.

Ref.	Method	Feature Strategy	Similarity/Matching	Dataset	Year
[202]	Yin et al.	Spherical projection of LiDAR scans; handcrafted geometric + intensity features	Correlation-based matching; similarity score from projected image alignment	KITTI	2017
[203]	LocNet	Range image projection; CNN-based learned global descriptor	L2 distance between global descriptors	KITTI	2018
[204]	Sun et al.	Spherical projection + CNN encoder; multi-channel range-intensity features	L2 distance; nearest-neighbor retrieval	KITTI	2019
[205]	OverlapNet	Range projection with multi-modal channels	Predicted overlap score + yaw estimation; similarity via overlap probability	KITTI, Ford Campus	2020
[206]	DiSCO	Dense spherical projection; self-supervised contrastive feature learning	Cosine similarity between learned descriptors	Oxford RobotCar, NCLT MulRan	2021
[207]	Overlap Transformer	Transformer encoder applied to range-image feature maps; multi-modal projected features	Overlap prediction + yaw regression; similarity via predicted overlap	KITTI, Ford Campus	2022
[208]	FusionVLAD	Spherical projection + CNN features + NetVLAD aggregation	VLAD vector similarity using L2 distance	KITTI, NCLT, Campus City	2021
[213]	AttDLNet	Attention-based CNN encoder on projection images; multi-scale feature extraction	L2 distance between descriptors; attention-weighted retrieval	KITTI	2022
[209]	SphereVLAD	Spherical projection + deep CNN features + VLAD aggregation	VLAD descriptor matching via L2/cosine distance	KITTI, Campus City	2022
[210]	SphereVLAD++	Improved SphereVLAD with multi-scale projections and enhanced CNN backbone	VLAD descriptor L2 distance; weighted similarity scoring	KITTI, Pittsburg	2022
[211]	DeepRING	Circular (ring-based) LiDAR projection; CNN encoder for rotation-equivariant features	Correlation-based similarity using circular shift alignment	NCLT, MulRan	2022
[212]	BEVPlace	Bird’s-eye-view (BEV) projection; CNN encoder with BEV-specific geometric priors	L2 distance between BEV global descriptors	KITTI, Oxford RobotCar	2023

Table 14. Graph-based methods for LiDAR-based place recognition.

Ref.	Method	Feature Strategy (Representation)	Similarity/Matching	Dataset	Year
[159]	SGPR	Builds semantic graphs from segmented point clouds	Graph matching using structural similarity metrics	KITTI	2020
[214]	GOS Match	Graph-of-semantics representation where each node = semantic instance and edges = geometric/topological relations	Graph similarity scoring with node − edge feature alignment + geometric verification	KITTI	2020
[215]	Gong et al.	Two-tier representation: local spatial relation graph + global spatial topology; semantic and geometric cues	Coarse-to-fine graph matching; spatial relation similarity + refined structural alignment	KITTI, Hannover, Self- Built Campus	2021
[160]	BoxGraph	Semantic bounding-box graph generated from object detectors; boxes as nodes, box relations as edges	Graph matching with similarity metrics over box attributes + spatial consistency	KITTI	2022
[216]	Deep Scan Context	Learned high-dimensional scan context descriptor; pseudo-cylindrical projection + neural embedding	Descriptor matching via cosine/L2 distance + rotation alignment	KITTI	2022
[158]	SimGNN	Graph embeddings learned with GNN layers; captures node distributions and global graph structure	Neural graph similarity prediction (end-to-end), including attention-based similarity	Custom	2020
[217]	Semantic Loop	Builds instance-level 3D semantic graphs; node features include object class + geometry	Graph matching + spectral features + RANSAC-based geometric verification	TUM RGBD − COCO	2022

Table 15. Comparison of state-of-the-art fusion-based place recognition approaches across multi-modal sensor configurations.

Ref.	Modalities	Fusion Level	Fusion Type	Feature Representation	Learning Objective	Robustness	Remarks
[218]	L-C	Global and local	Attention + manifold	Global–local joint features	Cross-modal alignment	Viewpoint, illumination	High accuracy, high cost
[219]	L-C	Cross-modal	Explicit interaction	Shared latent embedding	Contrastive consistency	Domain generalization	Needs balanced modalities
[220]	L-C	Feature transfer	LiDAR-image conversion	Vision-based descriptors	Knowledge transfer	Geometry–appearance bridge	Depends on projection quality
[221]	L-MC	Late fusion	Descriptor + score fusion	Multi-modal high- level features	Similarity fusion	Appearance robustness	Limited fine-grained fusion
[222]	L-R	Mid fusion	BEV + cross- attention	Spatial BEV maps	Feature distillation	Radar sparsity handling	Low semantic detail
[223]	L-C	Multi-scale	Attention pyramids	Hierarchical features	Joint scale invariance	Scale, viewpoint adaptation	High memory demand

L: LiDAR, C: camera, MC: multi-camera and R: radar.

Table 16. Quantitative analysis of the performance of different approaches for long-term localization based on place recognition.

Ref.	Modal.	Acc.	Robust. to Env. Changes	Robustness to Dynamic Scenes	Computat. Efficiency	Recall@1	Strengths	Weaknesses
[174]	LiDAR	High	High	Medium	Medium	80.31 Oxford	Learns discriminative global descriptors from raw point clouds; robust to viewpoint variations; scalable to large environments	Limited modeling of local geometric structure; performance depends on training data; moderate computational cost
[170]	Vision	High	Medium	Low	Medium	95	Dense feature matching without keypoint detection; robust to textureless regions; effective for precise visual correspondence	Computationally expensive; high memory consumption; limited scalability for real-time large-scale localization
[182]	LiDAR	High	High	Medium High	Medium	93 KITTI	Combines local and global features for discriminative descriptors; robust to viewpoint variations; effective for large-scale place recognition	Requires GPU for real- time inference; performance depends on LiDAR density; computational cost higher than lightweight descriptors
[189]	LiDAR	High	High	Medium High	Medium	96 KITTI	Joint framework for loop closure detection and point cloud registration; improves global consistency in SLAM; robust descriptors for LiDAR scans	Higher computational cost due to joint descriptor and registration estimation; performance depends on LiDAR scan quality
[184]	LiDAR	Very high	High	High	Medium	98.5 Oxford	Robust LiDAR-based global descriptors; efficient sparse convolution processing; good scalability for large environments	Requires high-quality LiDAR data; performance degrades with partial scans or heavy occlusions
[205]	LiDAR	High	High	Medium	Medium	88.0 KITTI	Effective for loop closure detection; robust to viewpoint changes; integrates geometric overlap estimation	Requires precomputed range images; computat. cost increases for large -scale map databases
[149]	LiDAR	High	High	Medium	High	88 KITTI	High geometric accuracy; real-time LiDAR odometry; widely used baseline for LiDAR SLAM	Limited semantic understanding; sensitive to dynamic objects; L.C. requires additional modules
[212]	Multi- modal	Very high	Very high	High	Medium	98.0 KITTI	Robust to viewpoint changes using BEV representation; effective for large-scale urban scenes	Requires accurate projection to BEV; performance depends on sensor calibration
[216]	liDAR	High	High	Medium	Medium	83.0 KITTI	Learns discriminative LiDAR descriptors; robust to geometric variations	High computational requirements; limited robustness to extreme environmental changes
[160]	Multi- modal Semant.	Very high	Very high	Very high	Medium- low	87.0 KITTI	Uses semantic object relationships; robust to appearance changes; suitable for long-term localization	Depends on reliable object detection; graph matching may become expensive in large environments

Table 17. Comparison of datasets used in autonomous vehicles.

Dataset	Sensors	Synchronization	Ground Truth	Location	Weather	Time	Year
KITTI [48] Semantic KITTI [145] KITTI 360 [226]	1 × 64-layer LiDAR 2 × grayscale camera 2 × color camera 1 × GPS-RTK/IMU	Software and hardware (reed contact)	scene flow, odometry object detection- tracking, road - lane	Germany	clear	day, autumn	2013 2019 2022
KAIST [227]	2 × 16-layer LiDAR 2 × 1-layer LiDAR 2 × monocular camera 1 × GPS (consumer level) 1 × GPS-RTK 1 × fiber optics gyro 1 × independent IMU 2 × wheel encoder 1 × altimeter	Software (ROS timestamp) and hardware (PPS for the two Velodynes, an external trigger for the two monocular cameras to get stereo)	SLAM algorithm for vehicle self-localization	South Korea	clear	day	2015
Oxford [224]	1 × 4-layer LiDAR 2 × 1-layer LiDAR 1 × stereo camera 3 × fisheye camera 1 × GPS-RTK/INS	Software	GPS-RTK/INS for vehicle self-localization	UK	sun, clouds overcast, rain snow	day, dusk, night, four seasons	2017
Udacity [228]	1 × 64-layer LiDAR 3 × RGB cameras 1 × Radar ARS-408 1 × GPS/IMU	Software ROS timestamp	GPS/IMU for vehicle self-localization	USA	sunny, cloudy	day	2017
ApolloScape [229]	2 × 1-layer LiDAR 6 × monocular camera 1 × GPS-RTK/IMU	Unknown	Scene parsing, car instance, lane segmentation, detection-tracking	China	unknown	day	2018
Waymo [230]	5 × LiDAR 5 × camera	Strategy not specified but they report that it is very well synced	Object detection- tracking	US	sun, rain	day, night	2019
EU long-term [231]	2 × 32-layer LiDAR 1 × 4-layer LiDAR 1 × 1-layer LiDAR 2 × stereo camera 1 × radar 1 × GPS-RTK 1 × independent IMU	Software (ROS timestamp) and hardware (PPS for the two Velodynes)	GPS-RTK/IMU for vehicle self-localization	France	sun, clouds, snow	day, dusk, night, three seasons (spring, summer, winter)	2020
nuScenes [45]	1 × 32-layer LiDAR 6 × monocular camera 1 × radar 1 × GPS-RTK 1 × Independent IMU	Software	HD map-based localization, object detection-tracking	US Singapore	sun, clouds, rain	day, night	2020
MulRan [232]	1 × 64-layer LiDAR 1 × radar	Hardware synchronization (LiDAR–INS)	Global pose (GPS/INS), LiDAR odometry	Republic of Korea	clear, overcast, varying	day, multiple sessions	2020
A2D2 [146]	5 × 16-layer LiDAR 6 × cameras 1 × GPS/IMU	Hardware	Object detection- tracking	Germany	sunny, rainy and cloudy	day	2020
Pandaset [233]	1 × 64-layer LiDAR 1 × 150-layer LiDAR 5 × wide-angle cameras 1 × long-focus camera 1 × GNSS/IMU	Software and hardware	Object detection– classification	US	sun	day, night	2021
Boreas [225]	1 × 128-layer LiDAR 1 × Monocular camera 1 × Applanix GNSS 1 × Radar 360 degree	Hardware Timestamp-based LiDAR synch using UTC time	GPS for vehicle self-localization	Canada	sun, clouds, snow, rain	day, night, overcast	2023

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Navarro-Pérez, Á.; Bacca-Cortés, B.; Caicedo-Bravo, E. Semantic SLAM with Multi-Modal Perception: Survey on Robust Long-Term Localization for Autonomous Vehicles. Robotics 2026, 15, 88. https://doi.org/10.3390/robotics15050088

AMA Style

Navarro-Pérez Á, Bacca-Cortés B, Caicedo-Bravo E. Semantic SLAM with Multi-Modal Perception: Survey on Robust Long-Term Localization for Autonomous Vehicles. Robotics. 2026; 15(5):88. https://doi.org/10.3390/robotics15050088

Chicago/Turabian Style

Navarro-Pérez, Álvaro, Bladimir Bacca-Cortés, and Eduardo Caicedo-Bravo. 2026. "Semantic SLAM with Multi-Modal Perception: Survey on Robust Long-Term Localization for Autonomous Vehicles" Robotics 15, no. 5: 88. https://doi.org/10.3390/robotics15050088

APA Style

Navarro-Pérez, Á., Bacca-Cortés, B., & Caicedo-Bravo, E. (2026). Semantic SLAM with Multi-Modal Perception: Survey on Robust Long-Term Localization for Autonomous Vehicles. Robotics, 15(5), 88. https://doi.org/10.3390/robotics15050088

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic SLAM with Multi-Modal Perception: Survey on Robust Long-Term Localization for Autonomous Vehicles

Abstract

1. Introduction

2. Time-Varying Semantic Maps for Long-Term Localization

3. Multi-Modal Perception

3.1. Extrinsic Calibration

3.2. Temporal Synchronization

4. Semantic Mapping

4.1. Vision-Based Semantic Mapping

4.2. LiDAR-Based Semantic Mapping

4.3. Sensor-Fusion-Based Semantic Mapping

5. 3D—Semantic Segmentation

5.1. LiDAR-Only 3D Semantic Segmentation

5.1.1. Image-Based Methods

5.1.2. Voxel-Based Methods

5.1.3. Point-Cloud-Based Method

5.2. LiDAR–Camera Fusion 3D Semantic Segmentation

5.2.1. Early Fusion

5.2.2. Deep Fusion

5.2.3. Late Fusion

5.2.4. Asymmetry Fusion

6. Long-Term Localization

6.1. Vision-Based Place Recognition

6.2. LiDAR-Based Place Recognition

6.2.1. Point-Based Methods

6.2.2. Voxel-Based Methods

6.2.3. Segment-Based Methods

6.2.4. Projection-Based Methods

6.2.5. Graph-Based Methods

6.3. Fusion-Based Place Recognition

7. Benchmarking

7.1. Datasets

7.2. Evaluation Metrics

7.3. Long-Term Evaluation Perspective

8. Open Challenges and Outlook

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI