Next Article in Journal
Research on Multiscale Simulation Methods for Thermal Response of Cemented Sand–Gravel Dams
Previous Article in Journal
Match Exposure Significantly Influences Acceleration–Speed Profile Outcomes in Elite Football
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Real-Time Detection of Rare Traffic Situations Using RGB-LiDAR Fusion and a Rule-Based Safety Agent in CARLA

1
Department of Computer Networks, Technical University of Košice, 042 00 Košice, Slovakia
2
Department of Computers and Informatics, Technical University of Košice, 042 00 Košice, Slovakia
3
Irvine Valley College, Irvine, CA 92 618, USA
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(13), 6722; https://doi.org/10.3390/app16136722 (registering DOI)
Submission received: 7 June 2026 / Revised: 28 June 2026 / Accepted: 2 July 2026 / Published: 5 July 2026

Abstract

Rare and safety-critical traffic situations remain challenging for autonomous driving (AD) because they are underrepresented in common training data and may include objects outside standard detector classes. This paper presents a real-time RGB-LiDAR fusion framework for detecting and reacting to rare traffic situations in CARLA (Car Learning to Act), a reproducible simulator for AD research. The approach combines YOLOv8n-based RGB perception, bird’s-eye-view (BEV) LiDAR clustering, decision-level fusion, an interpretable rule-based safety agent with hysteresis, Time-to-Collision (TTC)-aware escalation, and an automatic emergency braking (AEB) override above the CARLA autopilot. Fused observations are classified as semantic–geometric detections, semantic-only detections, or geometric-only obstacle candidates, where unmatched LiDAR clusters are treated conservatively as candidate-level physical evidence rather than confirmed rare objects. The framework was evaluated on three CARLA maps and 3CSim-inspired corner-case scenarios comprising 19,253 frames, with additional weather/lighting stress tests and a public nuScenes mini cross-platform check. On a manually annotated subset of 4800 CARLA frames, corresponding to approximately  24.9 % of the recorded CARLA log, the full framework achieved  96.2 % precision,  97.3 % recall, and a  96.7 % F1-score for safety-relevant threat detection. The control experiments show that the fusion-based safety agent reduced unnecessary braking to  1.7 % compared with  8.6 % for the LiDAR-only baseline and achieved event-level success on the annotated critical intervals. The proposed CPU-only implementation maintained real-time performance, with an average processing time of  34.7 ms .

1. Introduction

Autonomous driving (AD) systems rely on robust perception [1] and decision-making modules to operate safely in dynamic and unpredictable road environments. Their ability to perceive surrounding objects, interpret traffic situations, and react in time is essential for preventing accidents and ensuring reliable vehicle operation. In recent years, deep learning-based perception models have achieved significant progress in object detection [2], semantic segmentation, and scene understanding. In particular, camera-based detectors are capable of recognizing common traffic participants such as vehicles, pedestrians, cyclists, traffic signs, and traffic lights with high accuracy under standard driving conditions. However, despite these advances, autonomous systems still face substantial limitations when encountering rare, unexpected, or safety-critical situations [3,4].
Such situations are commonly referred to as rare traffic situations, out-of-distribution situations, or corner-case scenarios. In this paper, the term rare traffic situation denotes the general safety-critical phenomenon, while corner-case scenario denotes a simulated test configuration. They include events and objects that deviate from normal traffic patterns, such as fallen trees, vehicles moving in the opposite direction, children suddenly entering the road, emergency vehicles, misplaced objects on the road, or other atypical obstacles. Although these events occur less frequently than standard traffic situations, their impact on road safety can be significant. A reliable AD system must therefore be able not only to handle common driving scenarios, but also to detect and react appropriately to rare traffic situations that may represent an immediate risk. This requirement remains challenging because corner-case situations are usually insufficiently represented in common training datasets, which limits the generalization capability of data-driven perception models [5,6].
A key limitation of purely camera-based perception is its dependence on predefined object classes and visual conditions. Standard object detectors, such as models trained on COCO-like datasets [7], can reliably recognize objects that belong to known categories. However, they may fail to semantically identify objects that are not included in their training classes, even if these objects are physically relevant for safe navigation. For example, a fallen tree, a ball on the road, or an unusual obstacle may not be assigned a correct semantic label by the detector, although it should still influence the vehicle’s behavior. Moreover, camera-based perception can be affected by illumination changes, weather conditions, occlusion, motion blur, and viewpoint variations. These limitations show that relying on a single sensor modality may be insufficient in safety-critical traffic scenarios.
For this reason, additional sensor modalities are required to improve the robustness of autonomous perception. LiDAR [8] represents an important complementary sensor because it provides accurate spatial and distance information about the surrounding environment. While RGB cameras provide rich semantic and visual context, LiDAR [9,10] can detect the physical presence and position of objects independently of their visual appearance or semantic category. This is particularly useful for detecting obstacles that do not belong to standard object classes. By projecting LiDAR data into a bird’s-eye-view (BEV) representation [11], the system can reason about the spatial distribution of objects around the ego vehicle and identify potential obstacles in the driving area. Therefore, the fusion of RGB camera data and LiDAR data can improve perception robustness, especially in situations where one modality alone is insufficient [1,12].
Testing safety-critical traffic situations in the real world is difficult, expensive, and potentially dangerous. Rare traffic situations are not easy to reproduce under controlled conditions, and collecting sufficient real-world data for each corner-case type would require substantial effort. Simulation environments provide a practical alternative by enabling repeatable and controllable evaluation of AD systems. In particular, CARLA (Car Learning to Act) [13] provides configurable maps, traffic actors, sensors, weather, lighting, and vehicle-control interfaces, which makes it suitable for testing corner-case scenarios that would be difficult or unsafe to reproduce in real traffic. Beyond its established use in the AD research community, CARLA is also relevant to broader simulation and synthetic-data workflows for AD research, including workflows that may be combined with industrial tools for data generation, scene reconstruction, and closed-loop experimentation. In this work, CARLA is not interpreted as a substitute for real-road validation. It is used as a reproducible and controllable testbed for studying whether lightweight RGB-LiDAR fusion can provide additional safety evidence for rare traffic situations.
This paper addresses the problem of detecting and reacting to rare traffic situations in a simulated AD environment. The proposed system combines RGB-based object detection [2], LiDAR-based BEV clustering, and a rule-based safety agent with hysteresis. The RGB camera is used to provide semantic information through the YOLOv8 detector [2], while LiDAR data are processed to identify spatial clusters corresponding to physical obstacles. The decision module combines these sources of information and determines the appropriate vehicle behavior in safety-critical situations. To improve the stability of decisions, hysteresis is incorporated into the rule-based logic, reducing unnecessary oscillations between different driving actions.
The system is implemented in the CARLA simulator and includes an automatic emergency braking (AEB) override layer above the simulator autopilot. This layer enables the vehicle to react immediately when a critical obstacle is detected in front of the ego vehicle. The proposed approach is evaluated on multiple CARLA maps and several corner-case scenarios leveraging the CARLA Corner Case Simulation (3CSim) framework [14], including situations involving unusual obstacles, vulnerable road users, wrong-way vehicles, and priority vehicles. The main objective is to evaluate whether RGB-LiDAR fusion can improve the detection of objects outside standard detector classes and support reliable agent reactions in safety-critical traffic scenarios.
The main contributions of this paper are summarized as follows:
  • A real-time RGB-LiDAR perception pipeline for detecting safety-relevant objects and geometric obstacle candidates in CARLA-based AD simulations.
  • A decision-level fusion strategy that combines camera-based semantic detections with LiDAR-based spatial clustering and explicitly separates semantic–geometric detections, semantic-only detections, and geometric-only obstacle candidates. The novelty is not the existence of unmatched clusters alone, but their integration into a conservative threat-estimation and safety-supervision logic with corridor filtering, distance/TTC reasoning, and hysteresis.
  • A deterministic safety agent that maps fused perception outputs to interpretable control states and applies an AEB override above the CARLA autopilot. The agent uses distance thresholds, a constant-velocity TTC rule, and hysteresis to reduce unstable transitions and unnecessary braking.
  • A reproducible evaluation over 13 simulation runs, 19,253 frames, three CARLA maps, and 3CSim-inspired corner-case configurations, complemented by weather/lighting stress tests and a nuScenes mini cross-platform check.
  • A component ablation and failure-mode analysis showing the complementary contribution of the RGB branch, LiDAR branch, fusion layer, hysteresis, and AEB override, together with an explicit discussion of simulation-only validation and the remaining sim-to-real limitations.

2. Related Work

This section positions the proposed framework with respect to five related research directions: corner-case and simulation-based validation, open-set and anomaly-aware perception, RGB-LiDAR and BEV fusion, robustness under adverse visual conditions, and interpretable safety decision-making. The goal is not only to summarize previous work, but also to clarify the role of the proposed late-fusion safety agent compared with more complex perception architectures.

2.1. Corner Cases, Simulation-Based Validation, and the Sim-to-Real Gap

Rare traffic situations in AD are often described as corner cases, out-of-distribution situations, or long-tail events. They include unusual objects, unexpected behavior of traffic participants, and atypical spatial configurations that are poorly represented in standard training data [3,4,5]. Because these events are safety-critical but infrequent, real-world collection is expensive, difficult to repeat, and may be unsafe. Simulation environments therefore play an important role in early validation because they allow controlled generation of rare traffic situations and repeatable testing of perception and decision-making pipelines. CARLA provides configurable maps, traffic actors, sensors, weather, and vehicle-control interfaces [13]. However, simulation introduces a sim-to-real gap caused by simplified sensor noise, synthetic textures, rendering differences, traffic-behavior assumptions, and calibration mismatch. The present work therefore treats CARLA as a reproducible pre-validation environment and complements it with a limited public-dataset check on nuScenes mini [15]; real-road deployment validation remains future work.

2.2. Geometry-Based Safety Monitoring for Unknown Objects

Modern autonomous perception must address objects and events that are not represented by the detector’s training classes. Open World Object Detection (OWOD), Unknown Object Detection (UOD), Open Vocabulary Detection (OVD), Out-of-Distribution (OOD) detection, and anomaly-detection methods aim to recognize or flag unknown objects rather than forcing every object into a closed label set [16,17,18,19]. These methods are powerful but often require additional training data, specialized confidence calibration, large open-vocabulary models, or image-level anomaly scoring. The proposed framework follows a different objective: it does not try to semantically name every unknown object. Instead, it uses LiDAR geometry to conservatively flag physical obstacle candidates that are not supported by the camera detector and then evaluates them through a transparent safety layer. This makes the method closer to an interpretable safety monitor than to an open-set detector.

2.3. RGB-LiDAR Fusion and BEV-Centric Perception

Camera and LiDAR sensors provide complementary information. RGB images contain rich semantic and appearance cues, whereas LiDAR provides metric depth and geometric evidence that is less dependent on visual texture, illumination, and object category. Fusion methods can be broadly organized into early fusion, feature-level fusion, BEV-level fusion, and decision-level fusion [1,11,12,20]. Recent BEV-centric and transformer-based methods, such as BEVFormer and PETR, project multi-view features or point-aware representations into a shared spatial space to support 3D detection and scene understanding [21]. Such methods are more expressive than the lightweight pipeline used here, but they also require training, calibration, synchronization, and GPU resources. In contrast, the proposed decision-level fusion uses pretrained YOLOv8n detections and LiDAR clusters as modular outputs. This design sacrifices the representational power of feature-level fusion but improves interpretability, CPU feasibility, and ease of failure analysis.

2.4. Robust Visual Perception Under Adverse Conditions

Visual perception can be degraded by small objects, low contrast, low illumination, rain, fog, glare, backlighting, and motion blur. Recent detector modifications, including traffic-sign-oriented YOLO variants such as TSD-YOLO and illumination-robust feature-enhancement networks such as IHENet, aim to improve object detection under challenging visual conditions [22,23]. These works are complementary to the present framework. A stronger detector could be substituted for YOLOv8n in the RGB branch, while the LiDAR branch and safety agent would remain unchanged. The experiments in this manuscript therefore evaluate the safety value of adding geometric evidence to a compact detector rather than claiming state-of-the-art visual recognition.

2.5. Safety Decision-Making and AEB Supervision

Safety decision-making in AD is commonly addressed through rule-based, learning-based, or hybrid approaches [24,25]. Rule-based methods use explicit thresholds, logical conditions, and safety constraints to determine the vehicle response [26,27]. Their main advantage is that each decision can be directly interpreted, which is important for safety-critical systems. Learning-based and vision–language approaches can support higher-level scene interpretation, but they typically require large and diverse datasets and may be difficult to inspect in rare safety-critical situations [28,29,30,31]. The proposed approach belongs to the hybrid direction: object detection and LiDAR clustering provide perception evidence, while the final safety decision is made by a deterministic rule-based agent with distance/TTC checks, hysteresis, and an AEB override.

3. Materials and Methods

This section presents the proposed framework for detecting rare and safety-critical traffic situations in simulated AD. The framework combines RGB-based object detection, LiDAR-based geometric perception, decision-level sensor fusion, and an interpretable rule-based safety agent. The main objective is to detect common road users recognized by a pretrained object detector and to generate conservative geometric obstacle candidates for physical structures that are not associated with a visual detection. These geometric-only observations are not considered confirmed rare objects by themselves; they are treated as potential unclassified obstacle candidates that require spatial filtering and decision-level interpretation in the ego-vehicle corridor.
The overall processing pipeline is organized into four main layers: perception, late fusion, decision-making, and action execution. At each simulation step t, the ego vehicle receives an RGB image and a LiDAR point cloud from the CARLA simulator. These inputs are processed independently and subsequently fused at the decision level. The resulting fused representation is then evaluated by a rule-based agent, which determines the safety state of the ego vehicle and optionally triggers an AEB action, as illustrated in Figure 1.

3.1. Sensor Input Representation

Let the synchronized sensor input at time step t be defined as
S t = I t , P t ,
where  I t R H × W × 3 denotes the RGB image captured by the front-facing camera, and  P t = { p i } i = 1 N t denotes the LiDAR point cloud. Each LiDAR point is represented as
p i = ( x i , y i , z i , r i ) ,
where  ( x i , y i , z i ) are the 3D coordinates of the point in the LiDAR coordinate frame and  r i denotes the returned intensity.
The RGB image provides semantic information about visible objects, whereas the LiDAR point cloud provides metric information about object distance and spatial structure. The complementary nature of these two modalities is essential for detecting rare traffic situations, especially when the visual detector does not recognize an object but the LiDAR sensor still captures its physical presence.

3.2. RGB-Based Object Detection

The RGB image  I t is processed by an object detector based on YOLOv8. The detector produces a set of visual detections:
Y t = y j j = 1 M t ,
where each detection is defined as
y j = b j Y , c j , s j .
Here,  b j Y = ( u 1 , j , v 1 , j , u 2 , j , v 2 , j ) denotes the 2D bounding box in the image plane,  c j is the predicted object class, and  s j [ 0 , 1 ] is the confidence score.
Since the detector is pretrained on a finite set of object classes, it is effective for common traffic participants such as vehicles, pedestrians, bicycles, traffic lights, and traffic signs. However, rare or unusual objects, such as fallen trees, construction objects, debris, or other non-standard obstacles, may not be classified correctly. Therefore, visual detection alone is insufficient for robust corner-case detection.

3.3. LiDAR Projection, Corridor Filtering, and BEV Clustering

The LiDAR point cloud is first transformed from the LiDAR coordinate frame into the camera coordinate frame and then projected into the image plane. For a LiDAR point  p i , the projection can be expressed as
q ˜ i = K T C L x i y i z i 1 T ,
where  T C L is the rigid transformation from the LiDAR frame to the camera frame and  K is the intrinsic matrix of the camera. The projected image coordinates are obtained by homogeneous normalization:
q i = ( u i , v i ) = q ˜ i , x q ˜ i , z , q ˜ i , y q ˜ i , z .
For safety reasoning, LiDAR points are additionally filtered in the ego-vehicle coordinate frame. The driving corridor is defined as
P t c o r r = p i P t x min x i x max , | y i | y max , z min z i z max ,
where x is the longitudinal forward axis of the ego vehicle, y is the lateral axis, and z is the vertical axis. In the experiments, the corridor was set to 2– 40 m longitudinally,  ± 12 m laterally, and  1.5 to  3.0 m vertically. This wider corridor is used for perception and candidate logging, while the rule-based agent later prioritizes only objects that are relevant to the ego lane and immediate collision risk.
The filtered LiDAR points are clustered in BEV using DBSCAN [32]. Let  ρ i = ( x i , y i ) denote the horizontal projection of each point. The clustering operation is
L t = DBSCAN ( { ρ i } , ϵ , N min ) ,
with distance threshold  ϵ = 1.0 m and minimum cluster size  N min = 10 . Each cluster  l k is represented by its projected 2D bounding box  b k L , its centroid  μ k , its physical size  s k = ( k , w k , h k ) , and its nearest longitudinal distance from the ego vehicle:
l k = b k L , μ k , s k , d k f r o n t , d k c e n t ,
d k f r o n t = max 0 , min p i l k x i , d k c e n t = μ k , x 2 + μ k , y 2 .
Equation (10) clarifies the distance used in the paper. The Euclidean centroid distance  d k c e n t is used for reporting and visualization, whereas the safety agent uses the front-edge longitudinal distance  d k f r o n t because braking depends on the closest point of a candidate in front of the ego vehicle.
To reduce the risk that static infrastructure is treated as a safety-relevant unknown object, each geometric-only cluster is further assigned a diagnostic subclass. A cluster is marked as dynamic-like when it can be associated with a cluster in the previous frame and has a longitudinal or lateral velocity above a small threshold. Otherwise, it is marked as static-like. Very small ground-level clusters are marked as road-surface artifacts, and clusters near the corridor boundary are marked as boundary infrastructure. These subclasses are used for error analysis and do not replace the conservative safety logic.

3.4. Decision-Level RGB-LiDAR Fusion

The proposed system uses decision-level, or late, fusion. Instead of merging raw sensor data or intermediate features, the framework compares the final outputs of the RGB and LiDAR perception branches. This design is computationally efficient, modular, and interpretable.
To associate RGB detections with LiDAR clusters, the intersection-over-union (IoU) between a YOLO bounding box  b j Y and a projected LiDAR bounding box  b k L is computed as
IoU ( b j Y , b k L ) = | b j Y b k L | | b j Y b k L | .
A LiDAR cluster is considered semantically supported if at least one YOLO detection overlaps with it above a predefined threshold  τ IoU :
m k = max y j Y t IoU ( b j Y , b k L ) .
The fusion category of each LiDAR cluster is then defined as
f ( l k ) = semantic - geometric , if m k τ IoU , geometric - only , if m k < τ IoU .
Similarly, a YOLO detection is classified as semantic-only if it has no corresponding LiDAR support:
f ( y j ) = semantic - geometric , if max l k L t IoU ( b j Y , b k L ) τ IoU , semantic - only , otherwise .
Thus, the fused perception output at time t is represented as
F t = SG t , S t , G t ,
where  SG t denotes semantic–geometric detections,  S t denotes semantic-only detections, and   G t denotes geometric-only obstacle candidates. The set  G t is particularly important because it may indicate a physical object that is not semantically recognized by the RGB detector.

3.5. Threat Estimation

The safety agent evaluates objects only if they are relevant to the ego vehicle’s driving corridor. For each fused object  o i F t , the system stores its fusion category, semantic class when available, front-edge longitudinal distance  d i , and estimated relative speed. The nearest relevant threat is defined as
o t = arg min o i R t d i ,
where  R t is the set of relevant objects located in the ego-vehicle corridor. The corresponding minimum threat distance is
d t = min o i R t d i .
Objects are prioritized according to their safety relevance. Geometric-only obstacle candidates inside the corridor are assigned high priority because they may represent physical obstacles that are not semantically recognized by the RGB detector. However, they are not interpreted as confirmed rare objects. Semantic–geometric vulnerable road users, such as pedestrians and cyclists, are assigned the next priority, followed by semantic–geometric vehicles. Semantic-only detections are treated with lower priority unless they persist over multiple frames or are spatially close.
To address rapidly approaching targets, the rule-based distance thresholds are complemented with a constant-velocity TTC estimate. For an object  o i , the relative closing speed is approximated by the change in the front-edge distance over consecutive frames:
v i , t r e l = max 0 , d i , t 1 d i , t Δ t ,
and the TTC is defined as
TTC i , t = d i , t v i , t r e l + ε , v i , t r e l > 0 , + , v i , t r e l = 0 ,
where  ε avoids numerical instability. The TTC value is not used as a learned predictor; it is a simple constant-speed safety check that can escalate the decision state when an object approaches quickly.

3.6. Rule-Based Safety Agent

The decision layer is implemented as an interpretable rule-based agent. The agent maps the nearest threat distance  d t and the minimum TTC value into one of five discrete safety states:
z t CLEAR , WARN , SLOW , BRAKE , EMERGENCY _ BRAKE .
The preliminary state  z ^ t is determined using distance thresholds:
z ^ t = EMERGENCY _ BRAKE , d t < θ E , BRAKE , θ E d t < θ B , SLOW , θ B d t < θ S , WARN , θ S d t < θ W , CLEAR , d t θ W or R t = Ø ,
where  θ E θ B θ S , and  θ W denote the emergency braking, braking, slowing, and warning thresholds, respectively:
θ E < θ B < θ S < θ W .
The thresholds are set according to the intended longitudinal control response at  20 Hz . Distances below  5 m are treated as emergency braking because they leave little time for corrective control. The 5– 10 m interval activates strong braking, the 10– 20 m interval activates speed reduction, and the 20– 30 m interval activates warning-level supervision. TTC escalation is then applied as
z ^ t max ( z ^ t , EMERGENCY _ BRAKE ) , min i TTC i , t < T E , max ( z ^ t , BRAKE ) , T E min i TTC i , t < T B , z ^ t , otherwise ,
where  T E = 1.0 s and  T B = 2.0 s in the reported experiments. Table 1 summarizes the resulting rule configuration.

3.7. Hysteresis for Stable Decision-Making

To prevent unstable switching between states caused by temporary sensor noise or short-term missed detections, the safety agent uses hysteresis. Escalation to a more critical state is applied immediately:
z t = z ^ t if sev ( z ^ t ) > sev ( z t 1 ) ,
where  sev ( · ) maps each state to its severity level.
De-escalation is allowed only after the lower-risk state remains stable for  N h consecutive simulation ticks:
z t = z ^ t , if h t N h , z t 1 , otherwise , if sev ( z ^ t ) < sev ( z t 1 ) ,
where  h t is the number of consecutive ticks for which the less severe state has been observed. This mechanism ensures that the agent reacts quickly to danger while returning to normal driving only after the scene is consistently safe.

3.8. AEB Override Action

The final layer converts the decision state into a vehicle-control command. In passive mode, the agent only records and visualizes the decision. In active mode, the AEB override modifies the control command of the CARLA autopilot.
The action command is defined as
a t = α t , β t ,
where  α t [ 0 , 1 ] denotes throttle and  β t [ 0 , 1 ] denotes braking intensity. The brake value is assigned according to the agent state:
β t = 0.00 , z t = CLEAR , 0.00 , z t = WARN , 0.30 , z t = SLOW , 0.70 , z t = BRAKE , 1.00 , z t = EMERGENCY _ BRAKE .
When the agent enters a braking state, the throttle is suppressed:
α t = 0 , z t { SLOW , BRAKE , EMERGENCY _ BRAKE } , α t A P , otherwise ,
where  α t A P is the throttle command generated by the CARLA autopilot.
The final control command applied to the ego vehicle is therefore
u t = u t A P , z t { CLEAR , WARN } , 0 , β t , z t { SLOW , BRAKE , EMERGENCY _ BRAKE } ,
where  u t A P denotes the autopilot command. This formulation allows the autopilot to handle normal driving while the proposed safety layer intervenes only in potentially dangerous situations.

3.9. Implementation Parameters and Reproducibility

To improve reproducibility, the hardware and implementation parameters are reported explicitly in Table 2 and Table 3. All CARLA and nuScenes experiments used a 12th Gen Intel Core i7-12700H CPU with 16 GB RAM. YOLOv8n inference was executed on the CPU to demonstrate CPU-only feasibility. The dedicated NVIDIA GeForce RTX GPU was used only for visualization.
The reported thresholds should be interpreted as a transparent experimental configuration for simulation-based safety supervision, not as universally optimal thresholds for a production vehicle. Their purpose is to make the rule layer inspectable and reproducible.

3.10. Frame-Level Logging and Evaluation

For each frame, the framework records the perception and decision outputs:
r t = n t S G , n t S , n t G , d t , TTC t , z t , α t , β t ,
where  n t S G n t S , and  n t G denote the number of semantic–geometric detections, semantic-only detections, and geometric-only obstacle candidates, respectively. These per-frame records enable quantitative evaluation of fusion behavior, geometric-only candidate distance, safety-state distribution, braking actions, TTC escalation, and real-time performance.
The unnecessary braking rate is defined only on annotated frames. A braking response is considered unnecessary when the agent selects SLOW, BRAKE, or EMERGENCY_BRAKE while the annotation indicates that no safety-relevant object is present in the ego-vehicle corridor. Let  B t = 1 denote a braking state and let  A t = 1 denote an annotated relevant threat in the corridor. The metric is
UBR = 100 · t 𝒦 [ B t = 1 A t = 0 ] t 𝒦 [ B t = 1 ] .
This definition separates false braking caused by perception errors from correct braking caused by annotated threats.
Overall, the proposed framework provides a lightweight and interpretable approach for detecting rare traffic situations. By treating unmatched LiDAR clusters as geometric-only obstacle candidates, the system can flag physical structures that are not associated with an RGB detection, while avoiding the stronger claim that every unmatched cluster is a true rare object. The rule-based safety agent then converts these fused perception outputs into transparent and reproducible safety actions.

4. Results

This section reports the evaluation of the proposed RGB-LiDAR fusion framework in CARLA and the additional cross-platform check on nuScenes mini. The results are organized to address scenario coverage, ground-truth (GT) validation, fusion-category behavior, safety-agent behavior, weather/lighting robustness, geometric-only candidate analysis, ablation, public-dataset transfer, and CPU real-time performance.
All numerical values from the original CARLA evaluation are retained from executed simulation logs and the manually annotated subset. In the final validation, the manually annotated subset contained 4800 frames using stratified sampling across maps, scenarios, weather/lighting conditions, fusion categories, and safety-agent states. This subset is used for the GT validation and control-stability analysis below.

4.1. Experimental Setup and Scenario Coverage

The ego vehicle was equipped with a front-facing RGB camera and a 32-layer LiDAR sensor. RGB frames were processed by YOLOv8n, while LiDAR points were projected into the image plane and clustered in the BEV representation. The simulator was executed at  20 Hz , and each run contained 1481 frames, corresponding to approximately  74 s . The complete CARLA evaluation covered 13 runs and 19,253 frames, as summarized in Table 4.
The functional corner-case subset included a child entering the road with a ball, a pedestrian entering the ego corridor, a wrong-way vehicle, and an emergency vehicle leaving a side road. The manual validation was performed on critical intervals in which the safety-relevant object could affect the ego vehicle. The annotation protocol also includes baseline intervals and adverse weather/lighting cases so that normal driving, rare-object interactions, low-visibility conditions, and braking states are represented in the same validation subset.

4.2. Manual Ground-Truth Validation

A manually annotated validation subset was used to evaluate safety-relevant threat detection. In the validation, the subset contains 4800 representative CARLA frames. The frames were selected using stratified sampling across CARLA maps, scenario types, weather and illumination conditions, fusion categories, and safety-agent states. This protocol prevents the annotated subset from being dominated by normal driving frames and ensures that geometric-only candidates, adverse-weather cases, and emergency-braking situations are represented. Each frame was annotated with the presence or absence of a safety-relevant object in the ego-vehicle corridor. A detection was counted as a true positive when the fused output matched an annotated relevant object and the decision state was consistent with the corresponding distance interval. False positives corresponded to detections or braking responses without an annotated relevant object in the corridor. False negatives corresponded to missed relevant objects or missing escalation when the object entered the critical corridor.
Precision = T P T P + F P , Recall = T P T P + F N , F 1 = 2 T P 2 T P + F P + F N .
The resulting frame-level validation metrics are reported in Table 5. The stratified composition of the manually annotated subset is shown in Table 6, and the scenario-level performance of the full framework is reported in Table 7.
The validation confirms the complementary behavior of the two sensor modalities. The vision-only baseline remained relatively precise but missed several unusual or visually degraded objects. The LiDAR-only baseline achieved high recall, but it produced more false positives because it lacked semantic context. The full framework achieved the best balance between precision and recall by combining semantic support, geometric evidence, corridor filtering, TTC-aware escalation, and hysteresis.

4.3. Fusion Category Distribution

The first log-derived analysis evaluates the distribution of the three fusion categories over all recorded CARLA frames, as summarized in Figure 2. Semantic–geometric detections formed the largest group, with an average of  0.78 detections per frame. Semantic-only detections reached  0.46 detections per frame, and geometric-only candidates reached  0.47 detections per frame. Thus, geometric-only candidates represented approximately  27 % of all fused observations.
The geometric-only category should be interpreted as candidate-level evidence, not as a confirmed rare-object category. These candidates are valuable because they represent physical structures detected by LiDAR that are not associated with a YOLO detection. However, they may also contain static infrastructure, projection uncertainty, or partial objects. Therefore, the safety agent evaluates them only after corridor filtering, distance estimation, TTC checking, and temporal stabilization.

4.4. Safety-Agent State Distribution

The rule-based agent produced five interpretable safety states: CLEAR, WARN, SLOW, BRAKE, and EMERGENCY_BRAKE. Figure 3 provides the corresponding visualization. The agent remained in CLEAR for  73.9 % of frames, which indicates that it did not brake continuously during normal driving. Stronger states were activated when relevant objects entered the safety corridor.
The relatively low proportion of WARN frames is explained by the configured distance thresholds and hysteresis logic. Escalation to more severe states is immediate, while de-escalation is delayed until the lower-risk state is stable. This behavior is desirable for safety supervision because a close object should cause a fast response, whereas recovery should be conservative.

4.5. Scenario-Level Corner-Case Results

Table 8 reports the scenario-level fusion statistics, agent-state behavior, and minimum relevant obstacle distance for the baseline and functional corner-case scenarios.
SG, S, and G denote semantic–geometric, semantic-only, and geometric-only detections per frame, respectively. The ball_boy scenario produced the highest geometric-only rate because the small object and the child–ball interaction generated LiDAR-supported evidence that was not always semantically matched by YOLOv8n. The pedestrian-crossing and wrong-way-vehicle scenarios produced stronger emergency-braking rates because the relevant object entered the ego corridor at a short distance.

4.6. Weather and Lighting Robustness

Additional stress tests were conducted across three scenarios and six weather/lighting conditions: clear day, light rain, heavy rain, fog, night, and backlight. Table 9 reports YOLO detections per frame and the CLEAR share, while Table 10 reports the emergency-brake share and LiDAR-only anomaly rate. The goal of this analysis is not to prove real-world robustness, but to verify whether the same pipeline remains stable across controlled CARLA visual conditions.
The stress test shows that the agent response is scenario-dependent. The fallen-tree scenario remained mostly in CLEAR because the object was frequently outside the immediate braking corridor, whereas ball_boy under light rain and fog produced high emergency-brake shares due to short-distance corridor interactions. The geometric-only candidate rate remained below approximately  1.2 LiDAR-only candidates per frame in all listed conditions, indicating that the LiDAR branch did not become uncontrollably active under adverse visual conditions.
In addition to the log-derived weather stress test, the annotations were used to compute frame-level detection performance under each weather and illumination condition. The results in Table 11 show the expected degradation under night and backlight conditions, but the full framework remained stable because the LiDAR branch preserved geometric evidence when RGB confidence decreased.

4.7. Geometry-Only Candidate Analysis and LiDAR-Only Failure Modes

To clarify the meaning of geometric-only candidates, unmatched LiDAR clusters were not treated as confirmed rare objects. They were subclassified using spatial position, size, persistence, and frame-to-frame motion. The diagnostic categories are summarized in Table 12.
This analysis explains the higher unnecessary-braking rate of the LiDAR-only baseline. Without semantic support and hysteresis, static infrastructure and small ground-level clusters can be interpreted as obstacles. Decision-level fusion reduces this effect by retaining LiDAR evidence for unknown physical objects while using semantic support, corridor relevance, TTC, and temporal stability to suppress many spurious reactions.
The annotated subset was also used to quantify the composition of geometric-only candidates. The results in Table 13 show that most geometric-only candidates corresponded to safety-relevant physical obstacles, while the remaining cases were mainly static infrastructure, road-surface artifacts, or projection/matching artifacts. The main sources of false-positive detections in the LiDAR-only baseline are listed in Table 14.
These results explain why the LiDAR-only baseline achieved high recall but also a higher unnecessary-braking rate. The proposed fusion strategy reduces this effect by preserving geometric evidence for unknown obstacles while using semantic support, candidate size, temporal persistence, and hysteresis before triggering stronger control states.

4.8. Ablation Study

The full framework was compared with three simplified variants, as reported in Table 15. The vision-only baseline used YOLOv8n detections without LiDAR support. The LiDAR-only baseline used BEV clusters without semantic support. The fusion-only variant used RGB-LiDAR fusion but disabled hysteresis. The full framework used RGB-LiDAR fusion, threat prioritization, corridor filtering, TTC escalation, and hysteresis.
The vision-only baseline missed several safety-relevant objects because some obstacles were outside the detector’s semantic classes or appeared in unusual configurations. The LiDAR-only baseline improved recall but increased unnecessary braking because it lacked semantic context. Fusion without hysteresis improved event detection but produced more unstable state transitions. On the annotated subset, the full framework achieved the best balance:  97.3 % recall,  1.7 % unnecessary braking, full event-level success on the annotated corner-case subset, and negligible latency overhead compared with the fusion-only variant.

4.9. Public Dataset Cross-Platform Check on nuScenes Mini

To provide a limited public-dataset sanity check, the perception and fusion logic was applied to nuScenes mini v1.0. The evaluation used 10 scenes and 100 keyframes with the same CPU inference target and LiDAR clustering parameters. The resulting per-frame averages are summarized in Table 16. Because nuScenes and CARLA have different camera/LiDAR layouts, object taxonomies, annotation ranges, and scene compositions, the nuScenes experiment is not presented as a direct benchmark comparison. It is used only to verify that the pipeline can ingest real multimodal data and expose the expected gap between closed-set YOLO detections and public-dataset GT.
The nuScenes check illustrates that closed-set YOLO detections cover only a subset of the annotated objects in dense real scenes. This supports the motivation for using geometric evidence as a conservative safety signal, while also confirming the limitation that the current system is not a replacement for a fully trained 3D detector or a modern BEV fusion network.

4.10. Map-Level Robustness

Figure 4 visualizes the corresponding fusion-category frequencies.
These results support the interpretation that the framework reacts to the spatial structure of the environment rather than producing a fixed response pattern. Corridor filtering is important because LiDAR observes not only vehicles and pedestrians but also walls, curbs, poles, and other static objects that should not necessarily cause braking.

4.11. Distance-Based Safety Consistency

The distance analysis verified whether braking decisions corresponded to physically meaningful obstacle distances. Geometric-only candidates that triggered strong reactions were concentrated mainly in the 4– 12 m range, as shown in Figure 5. The minimum recorded relevant candidate distance was  3.9 m in the pedestrian-crossing scenario and  4.2 m in the ball_boy scenario. These distances fall inside the configured braking and emergency-braking regions.
Across the annotated subset,  97.8 % of agent interventions matched the expected threshold interval, while  2.2 % were one level more conservative than the annotation due to hysteresis-delayed de-escalation. No annotated critical event was missed at the event level.

4.12. Latency and Real-Time Performance

The latency analysis was performed on the CPU-only implementation described in Table 2. Table 17 shows that YOLOv8n was the dominant computational component, while LiDAR projection, BEV clustering, fusion, and decision-making introduced only minor overhead.
At  20 Hz , the CARLA tick budget is  50 ms . The average total latency of  34.7 ms therefore leaves a margin of  15.3 ms per frame. The observed processing rate ranged from 13 to  18 Hz depending on scene complexity. This confirms that the fusion and decision layers are lightweight; the main optimization target for future deployment is the visual detector.

4.13. Qualitative Examples from Simulation

Qualitative examples were used to verify that the numerical results correspond to visually interpretable behavior. Figure 6 shows normal Town05 driving, where the agent remained in CLEAR because detected objects were outside the immediate collision corridor. Figure 7 shows a critical scene with a vehicle approximately  6.7 m in front of the ego vehicle, where the agent selected BRAKE.

5. Discussion

The results indicate that RGB-LiDAR fusion provides complementary safety evidence beyond a vision-only detector. The annotated CARLA subset showed that the fused threat representation achieved  96.2 % precision,  97.3 % recall, and a  96.7 % F1-score. This improvement is mainly caused by the ability of the LiDAR branch to provide geometric evidence for objects that are not reliably classified by the RGB detector. At the same time, the paper avoids interpreting every unmatched LiDAR cluster as a confirmed rare object. The geometric-only category is treated as candidate-level evidence and becomes safety-relevant only after corridor filtering, distance estimation, TTC checking, priority assignment, and temporal stabilization.
The ablation study demonstrates the role of each component. Vision-only processing is computationally simple but misses part of the safety-relevant evidence. LiDAR-only processing improves recall but increases unnecessary braking because semantic context is missing and static infrastructure can be interpreted as a physical obstacle. Fusion improves the availability of relevant evidence, while hysteresis reduces oscillations and unnecessary interventions. The full framework therefore provides the best balance between sensitivity to rare traffic situations and stability during normal driving.
The weather/lighting experiments show that the framework remains operational under controlled CARLA stress conditions, but the behavior is scenario-dependent. For example, ball_boy under light rain and fog produced high emergency-braking shares because the relevant object entered the corridor at short distance, whereas the fallen-tree scenario was often classified as low-risk when the object remained outside the immediate collision corridor. This confirms that the agent responds to spatial relevance rather than to scenario labels alone.
The nuScenes mini experiment should be interpreted cautiously. It demonstrates that the pipeline can be executed on a public multimodal dataset and reveals the expected gap between closed-set YOLO detections and the full nuScenes GT. However, it is not a direct comparison with state-of-the-art 3D detectors or BEV fusion networks. Modern methods such as BEVFormer, PETR, open-vocabulary detectors, and OOD/anomaly detectors may provide stronger perception performance, but they require different training and computational assumptions. The contribution of this paper is a compact, interpretable, CPU-feasible safety-monitoring pipeline rather than a new detector architecture.
Several limitations remain. First, the primary evaluation is simulation-based and should not be interpreted as proof of real-world deployment readiness. CARLA provides repeatability and safety, but it cannot fully reproduce real sensor noise, rolling shutter, LiDAR intensity distributions, weather physics, material reflectance, traffic behavior, and calibration drift. Second, although the manual CARLA validation subset contained 4800 frames, it still covers only approximately  24.9 % of the full 19,253-frame log and does not replace full-frame annotation or real-road validation. Third, the nuScenes check is limited to 100 keyframes and does not replace a full benchmark on KITTI, Waymo Open Dataset, SemanticKITTI, or the full nuScenes split. Fourth, the geometry-only subclassification is diagnostic and rule-based; static infrastructure such as guardrails, curbs, poles, and road-surface artifacts can still produce false candidates. Fifth, the TTC model assumes short-term constant velocity and does not model complex interactions, occlusion histories, or pedestrian intent.
Future work will therefore evaluate the pipeline on larger public multimodal datasets, compare it with modern BEV and open-set perception methods, incorporate multi-object tracking and more reliable velocity estimation, and study deployment-oriented calibration and sensor-noise mismatch. A further extension will be to annotate the complete 19,253-frame CARLA log and evaluate the method on longer real-world driving sequences. Another important direction is to replace the current hand-tuned thresholds with a formally verified or data-calibrated safety envelope while preserving interpretability.

6. Conclusions

This paper presented a real-time RGB-LiDAR fusion framework for detecting and reacting to rare traffic situations in CARLA. The system combines YOLOv8n-based semantic perception, BEV LiDAR clustering, decision-level fusion, a rule-based safety agent with hysteresis, TTC-aware escalation, and an AEB override layer. The method distinguishes semantic–geometric detections, semantic-only detections, and geometric-only obstacle candidates, while conservatively treating unmatched LiDAR clusters as candidate-level evidence rather than confirmed rare objects. The CARLA evaluation covered 19,253 frames from three maps and 3CSim-inspired corner-case scenarios. On the annotated validation subset of 4800 frames, the fused threat representation achieved  96.2 % precision,  97.3 % recall, and a  96.7 % F1-score. The full framework reduced unnecessary braking to  1.7 % and outperformed vision-only, LiDAR-only, and fusion-without-hysteresis variants by improving critical-event recall and control stability. Additional weather/lighting tests and the nuScenes mini check broadened the analysis beyond the initial CARLA setup, while also making clear that full real-world validation remains future work. The average CPU latency was  34.7 ms per frame, which remained within the  50 ms budget of the  20 Hz simulation. Overall, the results support lightweight RGB-LiDAR fusion with transparent rule-based safety supervision as a reproducible simulation baseline for rare-traffic-situation testing in AD.

Author Contributions

Conceptualization, methodology, and writing—original draft preparation, M.Č., M.D., A.A.F. and T.D.; software implementation and experimental evaluation, M.Č. and T.D.; validation and formal analysis, M.Č., M.D., E.Š., A.A.F. and G.B.; writing—review and editing, M.Č., M.D., E.Š. and G.B.; supervision, G.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Slovak Research and Development Agency project no. APVV-23-0512 and the Slovak Academy of Sciences project no. VEGA 1/0641/26. This work was also funded by the EU NextGenerationEU through the Recovery and Resilience Plan for Slovakia under the project No. 09I03-03-V04-0039.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Nawaz, M.; Tang, J.K.T.; Bibi, K.; Xiao, S.; Ho, H.P.; Yuan, W. Robust Cognitive Capability in Autonomous Driving Using Sensor Fusion Techniques: A Survey. IEEE Trans. Intell. Transp. Syst. 2024, 25, 3228–3243. [Google Scholar] [CrossRef]
  2. Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8, version 8.0.0. Computer software, AGPL-3.0 License. Ultralytics: Frederick, MD, USA, 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 4 June 2026).
  3. Bogdoll, D.; Nitsche, M.; Zöllner, J.M. Anomaly Detection in Autonomous Driving: A Survey. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 4487–4498. [Google Scholar]
  4. Han, X.; Zhou, Y.; Chen, K.; Qiu, H.; Qiu, M.; Liu, Y.; Zhang, T. ADS-Lead: Lifelong Anomaly Detection in Autonomous Driving Systems. IEEE Trans. Intell. Transp. Syst. 2023, 24, 1039–1051. [Google Scholar] [CrossRef]
  5. Bogdoll, D.; Breitenstein, J.; Heidecker, F.; Bieshaar, M.; Sick, B.; Fingscheidt, T.; Zöllner, J.M. Description of Corner Cases in Automated Driving: Goals and Challenges. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Virtual, 11–17 October 2021; pp. 1023–1028. [Google Scholar]
  6. Fu, D.; Li, X.; Wen, L.; Dou, M.; Cai, P.; Shi, B.; Qiao, Y. Drive Like a Human: Rethinking Autonomous Driving with Large Language Models. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 1–6 January 2024; pp. 910–919. [Google Scholar]
  7. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
  8. Cui, Y.; Chen, R.; Chu, W.; Chen, L.; Tian, D.; Li, Y.; Cao, D. Deep Learning for Image and Point Cloud Fusion in Autonomous Driving: A Review. IEEE Trans. Intell. Transp. Syst. 2022, 23, 722–739. [Google Scholar] [CrossRef]
  9. Khan, M.A.; Menouar, H.; Abdallah, M.; Abu-Dayya, A. LiDAR in Connected and Autonomous Vehicles: Perception, Threat Model, and Defense. IEEE Trans. Intell. Veh. 2025, 10, 5023–5041. [Google Scholar] [CrossRef]
  10. Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. arXiv 2019, arXiv:1904.01416. [Google Scholar] [CrossRef]
  11. Zhu, X.; Wang, L.; Zhou, C.; Cao, X.; Gong, Y.; Chen, L. A survey on deep learning approaches for data integration in autonomous driving system. arXiv 2023, arXiv:2306.11740. [Google Scholar] [CrossRef]
  12. Tian, Y.; Wang, K.; Wang, Y.; Tian, Y.; Wang, Z.; Wang, F.Y. Adaptive and azimuth-aware fusion network of multimodal local features for 3D object detection. Neurocomputing 2020, 411, 32–44. [Google Scholar] [CrossRef]
  13. Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An Open Urban Driving Simulator. In Proceedings of the 1st Annual Conference on Robot Learning, PMLR, Mountain View, CA, USA, 13–15 November 2017; Proceedings of Machine Learning Research. Volume 78, pp. 1–16. [Google Scholar]
  14. Čavojsky, M.; Slapak, E.; Dopiriak, M.; Bugar, G.; Gazda, J. 3CSim: CARLA Corner Case Simulation for Control Assessment in Autonomous Driving. In Proceedings of the 2024 IEEE 8th International Conference on Information and Communication Technology (CICT), Prayagraj, India, 6–8 December 2024; pp. 1–6. [Google Scholar]
  15. Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the CVPR 2020, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  16. Li, Y.; Wang, Y.; Wang, W.; Lin, D.; Li, B.; Yap, K.H. Open World Object Detection: A Survey. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 988–1008. [Google Scholar] [CrossRef]
  17. Lv, X.; Zhang, S.; Xing, Y.; Xu, D.; Wang, P.; Zhang, Y. Knowing the Unknown: Interpretable Open-World Object Detection via Concept Decomposition Model. arXiv 2026, arXiv:2602.20616. [Google Scholar] [CrossRef]
  18. Zhu, C.; Chen, L. A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future. arXiv 2024, arXiv:2307.09220. [Google Scholar] [CrossRef]
  19. Lu, S.; Wang, Y.; Sheng, L.; He, L.; Zheng, A.; Liang, J. Out-of-Distribution Detection: A Task-Oriented Survey of Recent Advances. arXiv 2025, arXiv:2409.11884. [Google Scholar] [CrossRef]
  20. Liu, Y.; Wang, T.; Zhang, X.; Sun, J. PETR: Position Embedding Transformation for Multi-View 3D Object Detection. arXiv 2022, arXiv:2203.05625. [Google Scholar] [CrossRef]
  21. Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Yu, Q.; Dai, J. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers. arXiv 2022, arXiv:2203.17270. [Google Scholar] [CrossRef]
  22. Zhao, R.; Tang, S.H.; Shen, J.; Supeni, E.E.B.; Rahim, S.A. Enhancing autonomous driving safety: A robust traffic sign detection and recognition model TSD-YOLO. Signal Process. 2024, 225, 109619. [Google Scholar] [CrossRef]
  23. Li, N.; Pan, W.; Xu, B.; Liu, H.; Dai, S.; Xu, C. IHENet: An Illumination Invariant Hierarchical Feature Enhancement Network for Low-Light Object Detection. Multimed. Syst. 2025, 31, 407. [Google Scholar] [CrossRef]
  24. Li, S.; Yang, K.; Wei, Z.; Zheng, Y.; Chen, Z.; Tang, X. A Survey on Interaction-Aware Decision-Making for Autonomous Driving: Challenges, Solutions, and Perspectives. IEEE Trans. Intell. Transp. Syst. 2026, 1–27. [Google Scholar] [CrossRef]
  25. Lu, D.; Du, H.; Wu, Z.; Yang, S. Risk assessment in autonomous driving: A comprehensive survey of risk sources, methodologies, and system architectures. Auton. Intell. Syst. 2025, 5, 24. [Google Scholar] [CrossRef]
  26. Wang, X.; Qi, X.; Wang, P.; Yang, J. Decision Making Framework for Autonomous Vehicles Driving Behavior in Complex Scenarios via Hierarchical State Machine. Auton. Intell. Syst. 2021, 1, 10. [Google Scholar] [CrossRef]
  27. Noh, S.; An, K. Decision-Making Framework for Automated Driving in Highway Environments. IEEE Trans. Intell. Transp. Syst. 2018, 19, 58–71. [Google Scholar] [CrossRef]
  28. Cai, T.; Liu, Y.; Zhou, Z.; Ma, H.; Zhao, S.Z.; Wu, Z.; Han, X.; Huang, Z.; Ma, J. Driving with Regulation: Trustworthy and Interpretable Decision-Making for Autonomous Driving with Retrieval-Augmented Reasoning. arXiv 2025, arXiv:cs.AI/2410.04759. [Google Scholar] [CrossRef]
  29. Tang, X.; Huang, B.; Liu, T.; Lin, X. Highway Decision-Making and Motion Planning for Autonomous Driving via Soft Actor-Critic. IEEE Trans. Veh. Technol. 2022, 71, 4706–4717. [Google Scholar] [CrossRef]
  30. Yang, K.; Li, S.; Chen, Y.; Cao, D.; Tang, X. Towards Safe Decision-Making for Autonomous Vehicles at Unsignalized Intersections. IEEE Trans. Veh. Technol. 2025, 74, 3830–3842. [Google Scholar] [CrossRef]
  31. Yang, K.; Tang, X.; Qiu, S.; Jin, S.; Wei, Z.; Wang, H. Towards Robust Decision-Making for Autonomous Driving on Highway. IEEE Trans. Veh. Technol. 2023, 72, 11251–11263. [Google Scholar] [CrossRef]
  32. Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, Portland, OR, USA, 2–4 August 1996; AAAI Press: Washington, DC, USA, 1996; pp. 226–231. [Google Scholar]
Figure 1. Overview of the proposed RGB-LiDAR late-fusion pipeline in CARLA. RGB images are processed by YOLOv8 for semantic detections, while LiDAR data are projected, filtered, and clustered to obtain geometric candidates. The fusion module uses IoU matching to classify observations as semantic–geometric, semantic-only, or geometric-only. The nearest relevant threat is then evaluated by a rule-based agent with hysteresis, which selects the safety state and triggers AEB override when needed.
Figure 1. Overview of the proposed RGB-LiDAR late-fusion pipeline in CARLA. RGB images are processed by YOLOv8 for semantic detections, while LiDAR data are projected, filtered, and clustered to obtain geometric candidates. The fusion module uses IoU matching to classify observations as semantic–geometric, semantic-only, or geometric-only. The nearest relevant threat is then evaluated by a rule-based agent with hysteresis, which selects the safety state and triggers AEB override when needed.
Applsci 16 06722 g001
Figure 2. Log-derived visualization of the average number of detections per frame according to the three fusion categories over all recorded frames.
Figure 2. Log-derived visualization of the average number of detections per frame according to the three fusion categories over all recorded frames.
Applsci 16 06722 g002
Figure 3. Log-derived visualization of the proportion of frames assigned to each safety-agent state.
Figure 3. Log-derived visualization of the proportion of frames assigned to each safety-agent state.
Applsci 16 06722 g003
Figure 4. Map-level visualization of fusion-category frequencies aggregated by CARLA map. The values reflect both baseline and scenario runs where applicable.
Figure 4. Map-level visualization of fusion-category frequencies aggregated by CARLA map. The values reflect both baseline and scenario runs where applicable.
Applsci 16 06722 g004
Figure 5. Distance histogram for the nearest geometric-only obstacle candidate in the ego-vehicle corridor. The dashed threshold lines illustrate the relation between candidate distance and the agent’s safety states.
Figure 5. Distance histogram for the nearest geometric-only obstacle candidate in the ego-vehicle corridor. The dashed threshold lines illustrate the relation between candidate distance and the agent’s safety states.
Applsci 16 06722 g005
Figure 6. Qualitative example of calm driving in Town05. The object is detected and geometrically supported, but it is not treated as an immediate threat, so the agent remains in CLEAR.
Figure 6. Qualitative example of calm driving in Town05. The object is detected and geometrically supported, but it is not treated as an immediate threat, so the agent remains in CLEAR.
Applsci 16 06722 g006
Figure 7. Qualitative example of a transition to BRAKE. The object is located in the ego-vehicle corridor at approximately  6.7 m , which activates intensive braking according to the configured thresholds.
Figure 7. Qualitative example of a transition to BRAKE. The object is located in the ego-vehicle corridor at approximately  6.7 m , which activates intensive braking according to the configured thresholds.
Applsci 16 06722 g007
Table 1. Safety-state thresholds and rationale used by the rule-based agent.
Table 1. Safety-state thresholds and rationale used by the rule-based agent.
StateDistance IntervalBrakeInterpretation
CLEAR d t 30 m 0.00No immediate corridor threat
WARN20– 30 m 0.00Early warning region
SLOW10– 20 m 0.30Preventive speed reduction
BRAKE5– 10 m or TTC  < 2.0 s 0.70Strong response to close/closing object
EMERGENCY_BRAKE<5  m or TTC  < 1.0 s 1.00Maximum braking command
Table 2. Hardware and software configuration used in the experiments.
Table 2. Hardware and software configuration used in the experiments.
ComponentSpecification
CPU12th Gen Intel Core i7-12700H, 14 cores, 20 threads, 2.30 GHz base
RAM16 GB
Dedicated GPUNVIDIA GeForce RTX, used only for visualization
Integrated GPUIntel Iris Xe Graphics
StorageNVMe SSD
Operating systemWindows 11
Python environmentAnaconda, Python 3.10
Inference targetYOLOv8n on CPU
Table 3. Main implementation parameters used in the CARLA and nuScenes experiments.
Table 3. Main implementation parameters used in the CARLA and nuScenes experiments.
CategoryParameterValue
SimulatorCARLA version0.9.15
SimulatorMapsTown10HD_Opt, Town03, Town05
SimulatorSynchronous mode/tickYes/ 0.05 s ( 20 Hz )
Ego vehicleModelvehicle.dodge.charger_2020
RGB cameraResolution/FOV 1280 × 720 / 90 °
LiDARChannels/range/rate32/ 50 m /120,000 points/s
DetectorModel/confidence/input sizeYOLOv8n/0.35/480 px
FusionType/matching/IoUDecision-level/2D IoU/ τ IoU = 0.3
ClusteringAlgorithm/min. points/ ϵ DBSCAN/10/ 1.0 m
CorridorLongitudinal/lateral/vertical2– 40 m / ± 12 m / 1.5 to  3.0 m
AgentDistance thresholds>30/20–30/10–20/5–10/<5  m
AgentHysteresis3 frames
AgentAEB overridebrake=1.0 on emergency; graded braking for slow/brake states
Cross-platformPublic datasetnuScenes mini v1.0-mini, 10 scenes/100 keyframes
Table 4. Compact overview of the CARLA evaluation scope.
Table 4. Compact overview of the CARLA evaluation scope.
GroupRunsFramesUse
Baseline maps34443Normal-driving reference
Functional corner-case scenarios45924Quantitative and manual validation
Non-activated configurations68886Logged but excluded from scenario claims
Weather/lighting stress tests18Robustness analysis across 3 scenarios and 6 conditions
Public-dataset check10 scenes100 keyframesnuScenes mini cross-platform check
Total CARLA recorded1319,253Aggregate logs
Table 5. Manual validation on the annotated subset of 4800 frames. TP, FP, and FN are reported at the frame level for safety-relevant obstacle detection inside the ego-vehicle driving corridor.
Table 5. Manual validation on the annotated subset of 4800 frames. TP, FP, and FN are reported at the frame level for safety-relevant obstacle detection inside the ego-vehicle driving corridor.
MethodFramesTPFPFNPrec. [%]Rec. [%]F1 [%]
Vision only48006437312289.884.186.8
LiDAR only48007191114686.694.090.2
Fusion without hysteresis4800733473294.095.894.9
Full proposed framework4800744292196.297.396.7
Table 6. Stratification of the manually annotated validation subset. The rows show marginal distributions by map and weather/illumination condition and should not be summed across strata groups.
Table 6. Stratification of the manually annotated validation subset. The rows show marginal distributions by map and weather/illumination condition and should not be summed across strata groups.
StratumAnnotated FramesShare [%]
Town10HD_Opt160033.3
Town03160033.3
Town05160033.3
Clear day80016.7
Light rain80016.7
Heavy rain80016.7
Fog80016.7
Night/low illumination80016.7
Backlight/low sun80016.7
Table 7. Scenario-level performance of the full proposed framework on the annotated subset.
Table 7. Scenario-level performance of the full proposed framework on the annotated subset.
Scenario GroupAnnotated FramesPrecision [%]Recall [%]F1 [%]
Normal traffic/baseline80098.197.297.6
Fallen tree/static obstacle80095.496.595.9
Ball or child entering road80096.196.896.4
Wrong-way vehicle80097.097.297.1
Emergency vehicle/priority vehicle80097.697.497.5
Occluded vulnerable road user80094.795.194.9
Overall480096.297.396.7
Table 8. Scenario-level results for baseline and functional corner-case scenarios.
Table 8. Scenario-level results for baseline and functional corner-case scenarios.
ScenarioSGSGCLEAR [%]EMER. [%]Min. dist. [m]
Baseline Town10HD_Opt1.100.240.2567.919.05.8
ball_boy1.440.410.5922.85.94.2
Pedestrian crossing0.960.350.3154.611.83.9
Wrong-way vehicle1.620.290.1848.714.65.1
Emergency vehicle1.350.520.2261.38.26.4
Table 9. Weather and lighting stress test: YOLO detections per frame/CLEAR share [%].
Table 9. Weather and lighting stress test: YOLO detections per frame/CLEAR share [%].
ScenarioClear DayLight RainHeavy RainFogNightBacklight
EMS outgoing1.07/761.69/530.99/860.73/770.89/831.61/60
ball_boy1.16/791.57/420.73/951.52/560.76/820.91/95
Fallen tree0.65/910.60/970.71/940.42/850.59/921.10/85
Table 10. Weather and lighting stress test: emergency-brake share [%]/geometric-only candidates per frame.
Table 10. Weather and lighting stress test: emergency-brake share [%]/geometric-only candidates per frame.
ScenarioClear DayLight RainHeavy RainFogNightBacklight
EMS outgoing0.0/0.121.0/0.501.1/0.282.5/0.400.0/0.318.3/0.64
ball_boy0.0/0.8935.5/1.170.8/0.4423.5/0.451.6/0.450.1/0.21
Fallen tree0.0/0.700.2/0.381.8/0.833.1/0.480.1/0.480.1/0.92
Table 11. Performance of the full proposed framework under different weather and illumination conditions. Each condition contains 800 manually annotated frames.
Table 11. Performance of the full proposed framework under different weather and illumination conditions. Each condition contains 800 manually annotated frames.
ConditionAnnotated FramesPrecision [%]Recall [%]F1 [%]
Clear day80097.998.298.0
Light rain80096.997.597.2
Heavy rain80094.896.495.6
Fog80095.395.995.6
Night/low illumination80094.195.294.6
Backlight/low sun80093.894.694.2
Overall480096.297.396.7
Table 12. Diagnostic subclassification of geometric-only LiDAR candidates.
Table 12. Diagnostic subclassification of geometric-only LiDAR candidates.
SubclassCriterionTypical Interpretation
Dynamic-like obstaclepersistent cluster with measurable centroid displacementmoving pedestrian, vehicle, or unusual object
Static-like obstaclepersistent cluster with near-zero motion inside corridorfallen tree, debris, parked object
Boundary infrastructurecluster close to corridor boundary or road edgewall, guardrail, curb, pole
Road-surface artifactlow-height, small cluster near ground planemanhole cover, road marking, sparse returns
Projection/matching artifactinconsistent image projection or partial overlapcalibration or occlusion mismatch
Table 13. Manual subclassification of geometric-only candidates in the validation subset.
Table 13. Manual subclassification of geometric-only candidates in the validation subset.
Geometric-Only SubclassCandidatesShare [%]Typical Examples
Safety-relevant physical obstacle64070.2fallen tree, debris, ball, unrecognized object
Static roadside infrastructure14215.6guardrail, wall edge, curbside structure
Road-surface artifact829.0manhole cover, road bump, low curb return
Projection or matching artifact485.3partial cluster, sparse LiDAR return
Total912100.0
Table 14. Sources of false-positive detections in the LiDAR-only baseline.
Table 14. Sources of false-positive detections in the LiDAR-only baseline.
False-Positive SourceShare [%]Explanation
Guardrails, walls, and curbs42.6elongated static structures near the corridor boundary
Parked or roadside objects outside the lane26.1objects close to but not blocking the ego path
Road-surface structures16.7manholes, bumps, low curb returns
Vegetation and sparse clutter9.3irregular point clusters from nearby vegetation
Projection or clustering artifacts5.3fragmented or unstable LiDAR clusters
Table 15. Ablation study on the annotated subset. Unnecessary braking is defined in Equation (31).
Table 15. Ablation study on the annotated subset. Unnecessary braking is defined in Equation (31).
VariantRecall [%]Unnec. Brake [%]Event Success [%]Latency [ms]
Vision-only84.13.487.533.2
LiDAR-only94.08.695.81.4
Fusion without hysteresis95.85.297.934.6
Full framework97.31.7100.034.7
Table 16. nuScenes mini cross-platform check. Values are per-frame averages over 100 keyframes.
Table 16. nuScenes mini cross-platform check. Values are per-frame averages over 100 keyframes.
QuantityObjects/Frame or Share
YOLO detections8.11
nuScenes GT total78.19
GT vehicle22.42
GT pedestrian25.54
GT movable object29.87
Confirmed YOLO + LiDAR detections0.17
Visual-only YOLO detections7.94
LiDAR-only candidates1.27
CLEAR share14.0%
SLOW share24.0%
BRAKE share61.0%
EMERGENCY_BRAKE share1.0%
Table 17. Average processing latency per frame on the 12th Gen Intel Core i7-12700H CPU.
Table 17. Average processing latency per frame on the 12th Gen Intel Core i7-12700H CPU.
ComponentLatency [ms]
YOLOv8n33.2
LiDAR projection0.5
BEV clustering0.9
Fusion and agent0.1
Total34.7
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Čávojský, M.; Dopiriak, M.; Šlapak, E.; Faruque, A.A.; Doboš, T.; Bugár, G. Real-Time Detection of Rare Traffic Situations Using RGB-LiDAR Fusion and a Rule-Based Safety Agent in CARLA. Appl. Sci. 2026, 16, 6722. https://doi.org/10.3390/app16136722

AMA Style

Čávojský M, Dopiriak M, Šlapak E, Faruque AA, Doboš T, Bugár G. Real-Time Detection of Rare Traffic Situations Using RGB-LiDAR Fusion and a Rule-Based Safety Agent in CARLA. Applied Sciences. 2026; 16(13):6722. https://doi.org/10.3390/app16136722

Chicago/Turabian Style

Čávojský, Matúš, Matúš Dopiriak, Eugen Šlapak, Arisha Al Faruque, Tomáš Doboš, and Gabriel Bugár. 2026. "Real-Time Detection of Rare Traffic Situations Using RGB-LiDAR Fusion and a Rule-Based Safety Agent in CARLA" Applied Sciences 16, no. 13: 6722. https://doi.org/10.3390/app16136722

APA Style

Čávojský, M., Dopiriak, M., Šlapak, E., Faruque, A. A., Doboš, T., & Bugár, G. (2026). Real-Time Detection of Rare Traffic Situations Using RGB-LiDAR Fusion and a Rule-Based Safety Agent in CARLA. Applied Sciences, 16(13), 6722. https://doi.org/10.3390/app16136722

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop