1. Introduction
Autonomous driving (AD) systems rely on robust perception [
1] and decision-making modules to operate safely in dynamic and unpredictable road environments. Their ability to perceive surrounding objects, interpret traffic situations, and react in time is essential for preventing accidents and ensuring reliable vehicle operation. In recent years, deep learning-based perception models have achieved significant progress in object detection [
2], semantic segmentation, and scene understanding. In particular, camera-based detectors are capable of recognizing common traffic participants such as vehicles, pedestrians, cyclists, traffic signs, and traffic lights with high accuracy under standard driving conditions. However, despite these advances, autonomous systems still face substantial limitations when encountering rare, unexpected, or safety-critical situations [
3,
4].
Such situations are commonly referred to as rare traffic situations, out-of-distribution situations, or corner-case scenarios. In this paper, the term rare traffic situation denotes the general safety-critical phenomenon, while corner-case scenario denotes a simulated test configuration. They include events and objects that deviate from normal traffic patterns, such as fallen trees, vehicles moving in the opposite direction, children suddenly entering the road, emergency vehicles, misplaced objects on the road, or other atypical obstacles. Although these events occur less frequently than standard traffic situations, their impact on road safety can be significant. A reliable AD system must therefore be able not only to handle common driving scenarios, but also to detect and react appropriately to rare traffic situations that may represent an immediate risk. This requirement remains challenging because corner-case situations are usually insufficiently represented in common training datasets, which limits the generalization capability of data-driven perception models [
5,
6].
A key limitation of purely camera-based perception is its dependence on predefined object classes and visual conditions. Standard object detectors, such as models trained on COCO-like datasets [
7], can reliably recognize objects that belong to known categories. However, they may fail to semantically identify objects that are not included in their training classes, even if these objects are physically relevant for safe navigation. For example, a fallen tree, a ball on the road, or an unusual obstacle may not be assigned a correct semantic label by the detector, although it should still influence the vehicle’s behavior. Moreover, camera-based perception can be affected by illumination changes, weather conditions, occlusion, motion blur, and viewpoint variations. These limitations show that relying on a single sensor modality may be insufficient in safety-critical traffic scenarios.
For this reason, additional sensor modalities are required to improve the robustness of autonomous perception. LiDAR [
8] represents an important complementary sensor because it provides accurate spatial and distance information about the surrounding environment. While RGB cameras provide rich semantic and visual context, LiDAR [
9,
10] can detect the physical presence and position of objects independently of their visual appearance or semantic category. This is particularly useful for detecting obstacles that do not belong to standard object classes. By projecting LiDAR data into a bird’s-eye-view (BEV) representation [
11], the system can reason about the spatial distribution of objects around the ego vehicle and identify potential obstacles in the driving area. Therefore, the fusion of RGB camera data and LiDAR data can improve perception robustness, especially in situations where one modality alone is insufficient [
1,
12].
Testing safety-critical traffic situations in the real world is difficult, expensive, and potentially dangerous. Rare traffic situations are not easy to reproduce under controlled conditions, and collecting sufficient real-world data for each corner-case type would require substantial effort. Simulation environments provide a practical alternative by enabling repeatable and controllable evaluation of AD systems. In particular, CARLA (Car Learning to Act) [
13] provides configurable maps, traffic actors, sensors, weather, lighting, and vehicle-control interfaces, which makes it suitable for testing corner-case scenarios that would be difficult or unsafe to reproduce in real traffic. Beyond its established use in the AD research community, CARLA is also relevant to broader simulation and synthetic-data workflows for AD research, including workflows that may be combined with industrial tools for data generation, scene reconstruction, and closed-loop experimentation. In this work, CARLA is not interpreted as a substitute for real-road validation. It is used as a reproducible and controllable testbed for studying whether lightweight RGB-LiDAR fusion can provide additional safety evidence for rare traffic situations.
This paper addresses the problem of detecting and reacting to rare traffic situations in a simulated AD environment. The proposed system combines RGB-based object detection [
2], LiDAR-based BEV clustering, and a rule-based safety agent with hysteresis. The RGB camera is used to provide semantic information through the YOLOv8 detector [
2], while LiDAR data are processed to identify spatial clusters corresponding to physical obstacles. The decision module combines these sources of information and determines the appropriate vehicle behavior in safety-critical situations. To improve the stability of decisions, hysteresis is incorporated into the rule-based logic, reducing unnecessary oscillations between different driving actions.
The system is implemented in the CARLA simulator and includes an automatic emergency braking (AEB) override layer above the simulator autopilot. This layer enables the vehicle to react immediately when a critical obstacle is detected in front of the ego vehicle. The proposed approach is evaluated on multiple CARLA maps and several corner-case scenarios leveraging the CARLA Corner Case Simulation (3CSim) framework [
14], including situations involving unusual obstacles, vulnerable road users, wrong-way vehicles, and priority vehicles. The main objective is to evaluate whether RGB-LiDAR fusion can improve the detection of objects outside standard detector classes and support reliable agent reactions in safety-critical traffic scenarios.
The main contributions of this paper are summarized as follows:
A real-time RGB-LiDAR perception pipeline for detecting safety-relevant objects and geometric obstacle candidates in CARLA-based AD simulations.
A decision-level fusion strategy that combines camera-based semantic detections with LiDAR-based spatial clustering and explicitly separates semantic–geometric detections, semantic-only detections, and geometric-only obstacle candidates. The novelty is not the existence of unmatched clusters alone, but their integration into a conservative threat-estimation and safety-supervision logic with corridor filtering, distance/TTC reasoning, and hysteresis.
A deterministic safety agent that maps fused perception outputs to interpretable control states and applies an AEB override above the CARLA autopilot. The agent uses distance thresholds, a constant-velocity TTC rule, and hysteresis to reduce unstable transitions and unnecessary braking.
A reproducible evaluation over 13 simulation runs, 19,253 frames, three CARLA maps, and 3CSim-inspired corner-case configurations, complemented by weather/lighting stress tests and a nuScenes mini cross-platform check.
A component ablation and failure-mode analysis showing the complementary contribution of the RGB branch, LiDAR branch, fusion layer, hysteresis, and AEB override, together with an explicit discussion of simulation-only validation and the remaining sim-to-real limitations.
2. Related Work
This section positions the proposed framework with respect to five related research directions: corner-case and simulation-based validation, open-set and anomaly-aware perception, RGB-LiDAR and BEV fusion, robustness under adverse visual conditions, and interpretable safety decision-making. The goal is not only to summarize previous work, but also to clarify the role of the proposed late-fusion safety agent compared with more complex perception architectures.
2.1. Corner Cases, Simulation-Based Validation, and the Sim-to-Real Gap
Rare traffic situations in AD are often described as corner cases, out-of-distribution situations, or long-tail events. They include unusual objects, unexpected behavior of traffic participants, and atypical spatial configurations that are poorly represented in standard training data [
3,
4,
5]. Because these events are safety-critical but infrequent, real-world collection is expensive, difficult to repeat, and may be unsafe. Simulation environments therefore play an important role in early validation because they allow controlled generation of rare traffic situations and repeatable testing of perception and decision-making pipelines. CARLA provides configurable maps, traffic actors, sensors, weather, and vehicle-control interfaces [
13]. However, simulation introduces a sim-to-real gap caused by simplified sensor noise, synthetic textures, rendering differences, traffic-behavior assumptions, and calibration mismatch. The present work therefore treats CARLA as a reproducible pre-validation environment and complements it with a limited public-dataset check on nuScenes mini [
15]; real-road deployment validation remains future work.
2.2. Geometry-Based Safety Monitoring for Unknown Objects
Modern autonomous perception must address objects and events that are not represented by the detector’s training classes. Open World Object Detection (OWOD), Unknown Object Detection (UOD), Open Vocabulary Detection (OVD), Out-of-Distribution (OOD) detection, and anomaly-detection methods aim to recognize or flag unknown objects rather than forcing every object into a closed label set [
16,
17,
18,
19]. These methods are powerful but often require additional training data, specialized confidence calibration, large open-vocabulary models, or image-level anomaly scoring. The proposed framework follows a different objective: it does not try to semantically name every unknown object. Instead, it uses LiDAR geometry to conservatively flag physical obstacle candidates that are not supported by the camera detector and then evaluates them through a transparent safety layer. This makes the method closer to an interpretable safety monitor than to an open-set detector.
2.3. RGB-LiDAR Fusion and BEV-Centric Perception
Camera and LiDAR sensors provide complementary information. RGB images contain rich semantic and appearance cues, whereas LiDAR provides metric depth and geometric evidence that is less dependent on visual texture, illumination, and object category. Fusion methods can be broadly organized into early fusion, feature-level fusion, BEV-level fusion, and decision-level fusion [
1,
11,
12,
20]. Recent BEV-centric and transformer-based methods, such as BEVFormer and PETR, project multi-view features or point-aware representations into a shared spatial space to support 3D detection and scene understanding [
21]. Such methods are more expressive than the lightweight pipeline used here, but they also require training, calibration, synchronization, and GPU resources. In contrast, the proposed decision-level fusion uses pretrained YOLOv8n detections and LiDAR clusters as modular outputs. This design sacrifices the representational power of feature-level fusion but improves interpretability, CPU feasibility, and ease of failure analysis.
2.4. Robust Visual Perception Under Adverse Conditions
Visual perception can be degraded by small objects, low contrast, low illumination, rain, fog, glare, backlighting, and motion blur. Recent detector modifications, including traffic-sign-oriented YOLO variants such as TSD-YOLO and illumination-robust feature-enhancement networks such as IHENet, aim to improve object detection under challenging visual conditions [
22,
23]. These works are complementary to the present framework. A stronger detector could be substituted for YOLOv8n in the RGB branch, while the LiDAR branch and safety agent would remain unchanged. The experiments in this manuscript therefore evaluate the safety value of adding geometric evidence to a compact detector rather than claiming state-of-the-art visual recognition.
2.5. Safety Decision-Making and AEB Supervision
Safety decision-making in AD is commonly addressed through rule-based, learning-based, or hybrid approaches [
24,
25]. Rule-based methods use explicit thresholds, logical conditions, and safety constraints to determine the vehicle response [
26,
27]. Their main advantage is that each decision can be directly interpreted, which is important for safety-critical systems. Learning-based and vision–language approaches can support higher-level scene interpretation, but they typically require large and diverse datasets and may be difficult to inspect in rare safety-critical situations [
28,
29,
30,
31]. The proposed approach belongs to the hybrid direction: object detection and LiDAR clustering provide perception evidence, while the final safety decision is made by a deterministic rule-based agent with distance/TTC checks, hysteresis, and an AEB override.
3. Materials and Methods
This section presents the proposed framework for detecting rare and safety-critical traffic situations in simulated AD. The framework combines RGB-based object detection, LiDAR-based geometric perception, decision-level sensor fusion, and an interpretable rule-based safety agent. The main objective is to detect common road users recognized by a pretrained object detector and to generate conservative geometric obstacle candidates for physical structures that are not associated with a visual detection. These geometric-only observations are not considered confirmed rare objects by themselves; they are treated as potential unclassified obstacle candidates that require spatial filtering and decision-level interpretation in the ego-vehicle corridor.
The overall processing pipeline is organized into four main layers: perception, late fusion, decision-making, and action execution. At each simulation step
t, the ego vehicle receives an RGB image and a LiDAR point cloud from the CARLA simulator. These inputs are processed independently and subsequently fused at the decision level. The resulting fused representation is then evaluated by a rule-based agent, which determines the safety state of the ego vehicle and optionally triggers an AEB action, as illustrated in
Figure 1.
3.1. Sensor Input Representation
Let the synchronized sensor input at time step
t be defined as
where
denotes the RGB image captured by the front-facing camera, and
denotes the LiDAR point cloud. Each LiDAR point is represented as
where
are the 3D coordinates of the point in the LiDAR coordinate frame and
denotes the returned intensity.
The RGB image provides semantic information about visible objects, whereas the LiDAR point cloud provides metric information about object distance and spatial structure. The complementary nature of these two modalities is essential for detecting rare traffic situations, especially when the visual detector does not recognize an object but the LiDAR sensor still captures its physical presence.
3.2. RGB-Based Object Detection
The RGB image
is processed by an object detector based on YOLOv8. The detector produces a set of visual detections:
where each detection is defined as
Here,
denotes the 2D bounding box in the image plane,
is the predicted object class, and
is the confidence score.
Since the detector is pretrained on a finite set of object classes, it is effective for common traffic participants such as vehicles, pedestrians, bicycles, traffic lights, and traffic signs. However, rare or unusual objects, such as fallen trees, construction objects, debris, or other non-standard obstacles, may not be classified correctly. Therefore, visual detection alone is insufficient for robust corner-case detection.
3.3. LiDAR Projection, Corridor Filtering, and BEV Clustering
The LiDAR point cloud is first transformed from the LiDAR coordinate frame into the camera coordinate frame and then projected into the image plane. For a LiDAR point
, the projection can be expressed as
where
is the rigid transformation from the LiDAR frame to the camera frame and
is the intrinsic matrix of the camera. The projected image coordinates are obtained by homogeneous normalization:
For safety reasoning, LiDAR points are additionally filtered in the ego-vehicle coordinate frame. The driving corridor is defined as
where
x is the longitudinal forward axis of the ego vehicle,
y is the lateral axis, and
z is the vertical axis. In the experiments, the corridor was set to 2–
longitudinally,
laterally, and
to
vertically. This wider corridor is used for perception and candidate logging, while the rule-based agent later prioritizes only objects that are relevant to the ego lane and immediate collision risk.
The filtered LiDAR points are clustered in BEV using DBSCAN [
32]. Let
denote the horizontal projection of each point. The clustering operation is
with distance threshold
and minimum cluster size
. Each cluster
is represented by its projected 2D bounding box
, its centroid
, its physical size
, and its nearest longitudinal distance from the ego vehicle:
Equation (
10) clarifies the distance used in the paper. The Euclidean centroid distance
is used for reporting and visualization, whereas the safety agent uses the front-edge longitudinal distance
because braking depends on the closest point of a candidate in front of the ego vehicle.
To reduce the risk that static infrastructure is treated as a safety-relevant unknown object, each geometric-only cluster is further assigned a diagnostic subclass. A cluster is marked as dynamic-like when it can be associated with a cluster in the previous frame and has a longitudinal or lateral velocity above a small threshold. Otherwise, it is marked as static-like. Very small ground-level clusters are marked as road-surface artifacts, and clusters near the corridor boundary are marked as boundary infrastructure. These subclasses are used for error analysis and do not replace the conservative safety logic.
3.4. Decision-Level RGB-LiDAR Fusion
The proposed system uses decision-level, or late, fusion. Instead of merging raw sensor data or intermediate features, the framework compares the final outputs of the RGB and LiDAR perception branches. This design is computationally efficient, modular, and interpretable.
To associate RGB detections with LiDAR clusters, the intersection-over-union (IoU) between a YOLO bounding box
and a projected LiDAR bounding box
is computed as
A LiDAR cluster is considered semantically supported if at least one YOLO detection overlaps with it above a predefined threshold
:
The fusion category of each LiDAR cluster is then defined as
Similarly, a YOLO detection is classified as semantic-only if it has no corresponding LiDAR support:
Thus, the fused perception output at time
t is represented as
where
denotes semantic–geometric detections,
denotes semantic-only detections, and
denotes geometric-only obstacle candidates. The set
is particularly important because it may indicate a physical object that is not semantically recognized by the RGB detector.
3.5. Threat Estimation
The safety agent evaluates objects only if they are relevant to the ego vehicle’s driving corridor. For each fused object
, the system stores its fusion category, semantic class when available, front-edge longitudinal distance
, and estimated relative speed. The nearest relevant threat is defined as
where
is the set of relevant objects located in the ego-vehicle corridor. The corresponding minimum threat distance is
Objects are prioritized according to their safety relevance. Geometric-only obstacle candidates inside the corridor are assigned high priority because they may represent physical obstacles that are not semantically recognized by the RGB detector. However, they are not interpreted as confirmed rare objects. Semantic–geometric vulnerable road users, such as pedestrians and cyclists, are assigned the next priority, followed by semantic–geometric vehicles. Semantic-only detections are treated with lower priority unless they persist over multiple frames or are spatially close.
To address rapidly approaching targets, the rule-based distance thresholds are complemented with a constant-velocity TTC estimate. For an object
, the relative closing speed is approximated by the change in the front-edge distance over consecutive frames:
and the TTC is defined as
where
avoids numerical instability. The TTC value is not used as a learned predictor; it is a simple constant-speed safety check that can escalate the decision state when an object approaches quickly.
3.6. Rule-Based Safety Agent
The decision layer is implemented as an interpretable rule-based agent. The agent maps the nearest threat distance
and the minimum TTC value into one of five discrete safety states:
The preliminary state
is determined using distance thresholds:
where
,
,
, and
denote the emergency braking, braking, slowing, and warning thresholds, respectively:
The thresholds are set according to the intended longitudinal control response at
. Distances below
are treated as emergency braking because they leave little time for corrective control. The 5–
interval activates strong braking, the 10–
interval activates speed reduction, and the 20–
interval activates warning-level supervision. TTC escalation is then applied as
where
and
in the reported experiments.
Table 1 summarizes the resulting rule configuration.
3.7. Hysteresis for Stable Decision-Making
To prevent unstable switching between states caused by temporary sensor noise or short-term missed detections, the safety agent uses hysteresis. Escalation to a more critical state is applied immediately:
where
maps each state to its severity level.
De-escalation is allowed only after the lower-risk state remains stable for
consecutive simulation ticks:
where
is the number of consecutive ticks for which the less severe state has been observed. This mechanism ensures that the agent reacts quickly to danger while returning to normal driving only after the scene is consistently safe.
3.8. AEB Override Action
The final layer converts the decision state into a vehicle-control command. In passive mode, the agent only records and visualizes the decision. In active mode, the AEB override modifies the control command of the CARLA autopilot.
The action command is defined as
where
denotes throttle and
denotes braking intensity. The brake value is assigned according to the agent state:
When the agent enters a braking state, the throttle is suppressed:
where
is the throttle command generated by the CARLA autopilot.
The final control command applied to the ego vehicle is therefore
where
denotes the autopilot command. This formulation allows the autopilot to handle normal driving while the proposed safety layer intervenes only in potentially dangerous situations.
3.9. Implementation Parameters and Reproducibility
To improve reproducibility, the hardware and implementation parameters are reported explicitly in
Table 2 and
Table 3. All CARLA and nuScenes experiments used a 12th Gen Intel Core i7-12700H CPU with 16 GB RAM. YOLOv8n inference was executed on the CPU to demonstrate CPU-only feasibility. The dedicated NVIDIA GeForce RTX GPU was used only for visualization.
The reported thresholds should be interpreted as a transparent experimental configuration for simulation-based safety supervision, not as universally optimal thresholds for a production vehicle. Their purpose is to make the rule layer inspectable and reproducible.
3.10. Frame-Level Logging and Evaluation
For each frame, the framework records the perception and decision outputs:
where
,
, and
denote the number of semantic–geometric detections, semantic-only detections, and geometric-only obstacle candidates, respectively. These per-frame records enable quantitative evaluation of fusion behavior, geometric-only candidate distance, safety-state distribution, braking actions, TTC escalation, and real-time performance.
The unnecessary braking rate is defined only on annotated frames. A braking response is considered unnecessary when the agent selects
SLOW,
BRAKE, or
EMERGENCY_BRAKE while the annotation indicates that no safety-relevant object is present in the ego-vehicle corridor. Let
denote a braking state and let
denote an annotated relevant threat in the corridor. The metric is
This definition separates false braking caused by perception errors from correct braking caused by annotated threats.
Overall, the proposed framework provides a lightweight and interpretable approach for detecting rare traffic situations. By treating unmatched LiDAR clusters as geometric-only obstacle candidates, the system can flag physical structures that are not associated with an RGB detection, while avoiding the stronger claim that every unmatched cluster is a true rare object. The rule-based safety agent then converts these fused perception outputs into transparent and reproducible safety actions.
4. Results
This section reports the evaluation of the proposed RGB-LiDAR fusion framework in CARLA and the additional cross-platform check on nuScenes mini. The results are organized to address scenario coverage, ground-truth (GT) validation, fusion-category behavior, safety-agent behavior, weather/lighting robustness, geometric-only candidate analysis, ablation, public-dataset transfer, and CPU real-time performance.
All numerical values from the original CARLA evaluation are retained from executed simulation logs and the manually annotated subset. In the final validation, the manually annotated subset contained 4800 frames using stratified sampling across maps, scenarios, weather/lighting conditions, fusion categories, and safety-agent states. This subset is used for the GT validation and control-stability analysis below.
4.1. Experimental Setup and Scenario Coverage
The ego vehicle was equipped with a front-facing RGB camera and a 32-layer LiDAR sensor. RGB frames were processed by YOLOv8n, while LiDAR points were projected into the image plane and clustered in the BEV representation. The simulator was executed at
, and each run contained 1481 frames, corresponding to approximately
. The complete CARLA evaluation covered 13 runs and 19,253 frames, as summarized in
Table 4.
The functional corner-case subset included a child entering the road with a ball, a pedestrian entering the ego corridor, a wrong-way vehicle, and an emergency vehicle leaving a side road. The manual validation was performed on critical intervals in which the safety-relevant object could affect the ego vehicle. The annotation protocol also includes baseline intervals and adverse weather/lighting cases so that normal driving, rare-object interactions, low-visibility conditions, and braking states are represented in the same validation subset.
4.2. Manual Ground-Truth Validation
A manually annotated validation subset was used to evaluate safety-relevant threat detection. In the validation, the subset contains 4800 representative CARLA frames. The frames were selected using stratified sampling across CARLA maps, scenario types, weather and illumination conditions, fusion categories, and safety-agent states. This protocol prevents the annotated subset from being dominated by normal driving frames and ensures that geometric-only candidates, adverse-weather cases, and emergency-braking situations are represented. Each frame was annotated with the presence or absence of a safety-relevant object in the ego-vehicle corridor. A detection was counted as a true positive when the fused output matched an annotated relevant object and the decision state was consistent with the corresponding distance interval. False positives corresponded to detections or braking responses without an annotated relevant object in the corridor. False negatives corresponded to missed relevant objects or missing escalation when the object entered the critical corridor.
The resulting frame-level validation metrics are reported in
Table 5. The stratified composition of the manually annotated subset is shown in
Table 6, and the scenario-level performance of the full framework is reported in
Table 7.
The validation confirms the complementary behavior of the two sensor modalities. The vision-only baseline remained relatively precise but missed several unusual or visually degraded objects. The LiDAR-only baseline achieved high recall, but it produced more false positives because it lacked semantic context. The full framework achieved the best balance between precision and recall by combining semantic support, geometric evidence, corridor filtering, TTC-aware escalation, and hysteresis.
4.3. Fusion Category Distribution
The first log-derived analysis evaluates the distribution of the three fusion categories over all recorded CARLA frames, as summarized in
Figure 2. Semantic–geometric detections formed the largest group, with an average of
detections per frame. Semantic-only detections reached
detections per frame, and geometric-only candidates reached
detections per frame. Thus, geometric-only candidates represented approximately
of all fused observations.
The geometric-only category should be interpreted as candidate-level evidence, not as a confirmed rare-object category. These candidates are valuable because they represent physical structures detected by LiDAR that are not associated with a YOLO detection. However, they may also contain static infrastructure, projection uncertainty, or partial objects. Therefore, the safety agent evaluates them only after corridor filtering, distance estimation, TTC checking, and temporal stabilization.
4.4. Safety-Agent State Distribution
The rule-based agent produced five interpretable safety states:
CLEAR,
WARN,
SLOW,
BRAKE, and
EMERGENCY_BRAKE.
Figure 3 provides the corresponding visualization. The agent remained in
CLEAR for
of frames, which indicates that it did not brake continuously during normal driving. Stronger states were activated when relevant objects entered the safety corridor.
The relatively low proportion of WARN frames is explained by the configured distance thresholds and hysteresis logic. Escalation to more severe states is immediate, while de-escalation is delayed until the lower-risk state is stable. This behavior is desirable for safety supervision because a close object should cause a fast response, whereas recovery should be conservative.
4.5. Scenario-Level Corner-Case Results
Table 8 reports the scenario-level fusion statistics, agent-state behavior, and minimum relevant obstacle distance for the baseline and functional corner-case scenarios.
SG, S, and G denote semantic–geometric, semantic-only, and geometric-only detections per frame, respectively. The ball_boy scenario produced the highest geometric-only rate because the small object and the child–ball interaction generated LiDAR-supported evidence that was not always semantically matched by YOLOv8n. The pedestrian-crossing and wrong-way-vehicle scenarios produced stronger emergency-braking rates because the relevant object entered the ego corridor at a short distance.
4.6. Weather and Lighting Robustness
Additional stress tests were conducted across three scenarios and six weather/lighting conditions: clear day, light rain, heavy rain, fog, night, and backlight.
Table 9 reports YOLO detections per frame and the
CLEAR share, while
Table 10 reports the emergency-brake share and LiDAR-only anomaly rate. The goal of this analysis is not to prove real-world robustness, but to verify whether the same pipeline remains stable across controlled CARLA visual conditions.
The stress test shows that the agent response is scenario-dependent. The fallen-tree scenario remained mostly in CLEAR because the object was frequently outside the immediate braking corridor, whereas ball_boy under light rain and fog produced high emergency-brake shares due to short-distance corridor interactions. The geometric-only candidate rate remained below approximately LiDAR-only candidates per frame in all listed conditions, indicating that the LiDAR branch did not become uncontrollably active under adverse visual conditions.
In addition to the log-derived weather stress test, the annotations were used to compute frame-level detection performance under each weather and illumination condition. The results in
Table 11 show the expected degradation under night and backlight conditions, but the full framework remained stable because the LiDAR branch preserved geometric evidence when RGB confidence decreased.
4.7. Geometry-Only Candidate Analysis and LiDAR-Only Failure Modes
To clarify the meaning of geometric-only candidates, unmatched LiDAR clusters were not treated as confirmed rare objects. They were subclassified using spatial position, size, persistence, and frame-to-frame motion. The diagnostic categories are summarized in
Table 12.
This analysis explains the higher unnecessary-braking rate of the LiDAR-only baseline. Without semantic support and hysteresis, static infrastructure and small ground-level clusters can be interpreted as obstacles. Decision-level fusion reduces this effect by retaining LiDAR evidence for unknown physical objects while using semantic support, corridor relevance, TTC, and temporal stability to suppress many spurious reactions.
The annotated subset was also used to quantify the composition of geometric-only candidates. The results in
Table 13 show that most geometric-only candidates corresponded to safety-relevant physical obstacles, while the remaining cases were mainly static infrastructure, road-surface artifacts, or projection/matching artifacts. The main sources of false-positive detections in the LiDAR-only baseline are listed in
Table 14.
These results explain why the LiDAR-only baseline achieved high recall but also a higher unnecessary-braking rate. The proposed fusion strategy reduces this effect by preserving geometric evidence for unknown obstacles while using semantic support, candidate size, temporal persistence, and hysteresis before triggering stronger control states.
4.8. Ablation Study
The full framework was compared with three simplified variants, as reported in
Table 15. The vision-only baseline used YOLOv8n detections without LiDAR support. The LiDAR-only baseline used BEV clusters without semantic support. The fusion-only variant used RGB-LiDAR fusion but disabled hysteresis. The full framework used RGB-LiDAR fusion, threat prioritization, corridor filtering, TTC escalation, and hysteresis.
The vision-only baseline missed several safety-relevant objects because some obstacles were outside the detector’s semantic classes or appeared in unusual configurations. The LiDAR-only baseline improved recall but increased unnecessary braking because it lacked semantic context. Fusion without hysteresis improved event detection but produced more unstable state transitions. On the annotated subset, the full framework achieved the best balance: recall, unnecessary braking, full event-level success on the annotated corner-case subset, and negligible latency overhead compared with the fusion-only variant.
4.9. Public Dataset Cross-Platform Check on nuScenes Mini
To provide a limited public-dataset sanity check, the perception and fusion logic was applied to nuScenes mini v1.0. The evaluation used 10 scenes and 100 keyframes with the same CPU inference target and LiDAR clustering parameters. The resulting per-frame averages are summarized in
Table 16. Because nuScenes and CARLA have different camera/LiDAR layouts, object taxonomies, annotation ranges, and scene compositions, the nuScenes experiment is not presented as a direct benchmark comparison. It is used only to verify that the pipeline can ingest real multimodal data and expose the expected gap between closed-set YOLO detections and public-dataset GT.
The nuScenes check illustrates that closed-set YOLO detections cover only a subset of the annotated objects in dense real scenes. This supports the motivation for using geometric evidence as a conservative safety signal, while also confirming the limitation that the current system is not a replacement for a fully trained 3D detector or a modern BEV fusion network.
4.10. Map-Level Robustness
Figure 4 visualizes the corresponding fusion-category frequencies.
These results support the interpretation that the framework reacts to the spatial structure of the environment rather than producing a fixed response pattern. Corridor filtering is important because LiDAR observes not only vehicles and pedestrians but also walls, curbs, poles, and other static objects that should not necessarily cause braking.
4.11. Distance-Based Safety Consistency
The distance analysis verified whether braking decisions corresponded to physically meaningful obstacle distances. Geometric-only candidates that triggered strong reactions were concentrated mainly in the 4–
range, as shown in
Figure 5. The minimum recorded relevant candidate distance was
in the pedestrian-crossing scenario and
in the
ball_boy scenario. These distances fall inside the configured braking and emergency-braking regions.
Across the annotated subset, of agent interventions matched the expected threshold interval, while were one level more conservative than the annotation due to hysteresis-delayed de-escalation. No annotated critical event was missed at the event level.
4.12. Latency and Real-Time Performance
The latency analysis was performed on the CPU-only implementation described in
Table 2.
Table 17 shows that YOLOv8n was the dominant computational component, while LiDAR projection, BEV clustering, fusion, and decision-making introduced only minor overhead.
At , the CARLA tick budget is . The average total latency of therefore leaves a margin of per frame. The observed processing rate ranged from 13 to depending on scene complexity. This confirms that the fusion and decision layers are lightweight; the main optimization target for future deployment is the visual detector.
4.13. Qualitative Examples from Simulation
Qualitative examples were used to verify that the numerical results correspond to visually interpretable behavior.
Figure 6 shows normal Town05 driving, where the agent remained in
CLEAR because detected objects were outside the immediate collision corridor.
Figure 7 shows a critical scene with a vehicle approximately
in front of the ego vehicle, where the agent selected
BRAKE.
5. Discussion
The results indicate that RGB-LiDAR fusion provides complementary safety evidence beyond a vision-only detector. The annotated CARLA subset showed that the fused threat representation achieved precision, recall, and a F1-score. This improvement is mainly caused by the ability of the LiDAR branch to provide geometric evidence for objects that are not reliably classified by the RGB detector. At the same time, the paper avoids interpreting every unmatched LiDAR cluster as a confirmed rare object. The geometric-only category is treated as candidate-level evidence and becomes safety-relevant only after corridor filtering, distance estimation, TTC checking, priority assignment, and temporal stabilization.
The ablation study demonstrates the role of each component. Vision-only processing is computationally simple but misses part of the safety-relevant evidence. LiDAR-only processing improves recall but increases unnecessary braking because semantic context is missing and static infrastructure can be interpreted as a physical obstacle. Fusion improves the availability of relevant evidence, while hysteresis reduces oscillations and unnecessary interventions. The full framework therefore provides the best balance between sensitivity to rare traffic situations and stability during normal driving.
The weather/lighting experiments show that the framework remains operational under controlled CARLA stress conditions, but the behavior is scenario-dependent. For example, ball_boy under light rain and fog produced high emergency-braking shares because the relevant object entered the corridor at short distance, whereas the fallen-tree scenario was often classified as low-risk when the object remained outside the immediate collision corridor. This confirms that the agent responds to spatial relevance rather than to scenario labels alone.
The nuScenes mini experiment should be interpreted cautiously. It demonstrates that the pipeline can be executed on a public multimodal dataset and reveals the expected gap between closed-set YOLO detections and the full nuScenes GT. However, it is not a direct comparison with state-of-the-art 3D detectors or BEV fusion networks. Modern methods such as BEVFormer, PETR, open-vocabulary detectors, and OOD/anomaly detectors may provide stronger perception performance, but they require different training and computational assumptions. The contribution of this paper is a compact, interpretable, CPU-feasible safety-monitoring pipeline rather than a new detector architecture.
Several limitations remain. First, the primary evaluation is simulation-based and should not be interpreted as proof of real-world deployment readiness. CARLA provides repeatability and safety, but it cannot fully reproduce real sensor noise, rolling shutter, LiDAR intensity distributions, weather physics, material reflectance, traffic behavior, and calibration drift. Second, although the manual CARLA validation subset contained 4800 frames, it still covers only approximately of the full 19,253-frame log and does not replace full-frame annotation or real-road validation. Third, the nuScenes check is limited to 100 keyframes and does not replace a full benchmark on KITTI, Waymo Open Dataset, SemanticKITTI, or the full nuScenes split. Fourth, the geometry-only subclassification is diagnostic and rule-based; static infrastructure such as guardrails, curbs, poles, and road-surface artifacts can still produce false candidates. Fifth, the TTC model assumes short-term constant velocity and does not model complex interactions, occlusion histories, or pedestrian intent.
Future work will therefore evaluate the pipeline on larger public multimodal datasets, compare it with modern BEV and open-set perception methods, incorporate multi-object tracking and more reliable velocity estimation, and study deployment-oriented calibration and sensor-noise mismatch. A further extension will be to annotate the complete 19,253-frame CARLA log and evaluate the method on longer real-world driving sequences. Another important direction is to replace the current hand-tuned thresholds with a formally verified or data-calibrated safety envelope while preserving interpretability.
6. Conclusions
This paper presented a real-time RGB-LiDAR fusion framework for detecting and reacting to rare traffic situations in CARLA. The system combines YOLOv8n-based semantic perception, BEV LiDAR clustering, decision-level fusion, a rule-based safety agent with hysteresis, TTC-aware escalation, and an AEB override layer. The method distinguishes semantic–geometric detections, semantic-only detections, and geometric-only obstacle candidates, while conservatively treating unmatched LiDAR clusters as candidate-level evidence rather than confirmed rare objects. The CARLA evaluation covered 19,253 frames from three maps and 3CSim-inspired corner-case scenarios. On the annotated validation subset of 4800 frames, the fused threat representation achieved precision, recall, and a F1-score. The full framework reduced unnecessary braking to and outperformed vision-only, LiDAR-only, and fusion-without-hysteresis variants by improving critical-event recall and control stability. Additional weather/lighting tests and the nuScenes mini check broadened the analysis beyond the initial CARLA setup, while also making clear that full real-world validation remains future work. The average CPU latency was per frame, which remained within the budget of the simulation. Overall, the results support lightweight RGB-LiDAR fusion with transparent rule-based safety supervision as a reproducible simulation baseline for rare-traffic-situation testing in AD.