Real-Time Detection of Rare Traffic Situations Using RGB-LiDAR Fusion and a Rule-Based Safety Agent in CARLA

Čávojský, Matúš; Dopiriak, Matúš; Šlapak, Eugen; Faruque, Arisha Al; Doboš, Tomáš; Bugár, Gabriel

doi:10.3390/app16136722

Open AccessArticle

Real-Time Detection of Rare Traffic Situations Using RGB-LiDAR Fusion and a Rule-Based Safety Agent in CARLA

by

Matúš Čávojský

¹

,

Matúš Dopiriak

^2,*

,

Eugen Šlapak

²,

Arisha Al Faruque

³

,

Tomáš Doboš

² and

Gabriel Bugár

¹

Department of Computer Networks, Technical University of Košice, 042 00 Košice, Slovakia

²

Department of Computers and Informatics, Technical University of Košice, 042 00 Košice, Slovakia

³

Irvine Valley College, Irvine, CA 92 618, USA

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(13), 6722; https://doi.org/10.3390/app16136722 (registering DOI)

Submission received: 7 June 2026 / Revised: 28 June 2026 / Accepted: 2 July 2026 / Published: 5 July 2026

Download

Browse Figures

Versions Notes

Abstract

Rare and safety-critical traffic situations remain challenging for autonomous driving (AD) because they are underrepresented in common training data and may include objects outside standard detector classes. This paper presents a real-time RGB-LiDAR fusion framework for detecting and reacting to rare traffic situations in CARLA (Car Learning to Act), a reproducible simulator for AD research. The approach combines YOLOv8n-based RGB perception, bird’s-eye-view (BEV) LiDAR clustering, decision-level fusion, an interpretable rule-based safety agent with hysteresis, Time-to-Collision (TTC)-aware escalation, and an automatic emergency braking (AEB) override above the CARLA autopilot. Fused observations are classified as semantic–geometric detections, semantic-only detections, or geometric-only obstacle candidates, where unmatched LiDAR clusters are treated conservatively as candidate-level physical evidence rather than confirmed rare objects. The framework was evaluated on three CARLA maps and 3CSim-inspired corner-case scenarios comprising 19,253 frames, with additional weather/lighting stress tests and a public nuScenes mini cross-platform check. On a manually annotated subset of 4800 CARLA frames, corresponding to approximately

24.9 %

of the recorded CARLA log, the full framework achieved

96.2 %

precision,

97.3 %

recall, and a

96.7 %

F1-score for safety-relevant threat detection. The control experiments show that the fusion-based safety agent reduced unnecessary braking to

1.7 %

compared with

8.6 %

for the LiDAR-only baseline and achieved event-level success on the annotated critical intervals. The proposed CPU-only implementation maintained real-time performance, with an average processing time of

34.7 ms

.

Keywords:

autonomous driving; automatic emergency braking; corner-case detection; rule-based decision-making; sensor fusion

1. Introduction

Autonomous driving (AD) systems rely on robust perception [1] and decision-making modules to operate safely in dynamic and unpredictable road environments. Their ability to perceive surrounding objects, interpret traffic situations, and react in time is essential for preventing accidents and ensuring reliable vehicle operation. In recent years, deep learning-based perception models have achieved significant progress in object detection [2], semantic segmentation, and scene understanding. In particular, camera-based detectors are capable of recognizing common traffic participants such as vehicles, pedestrians, cyclists, traffic signs, and traffic lights with high accuracy under standard driving conditions. However, despite these advances, autonomous systems still face substantial limitations when encountering rare, unexpected, or safety-critical situations [3,4].

Such situations are commonly referred to as rare traffic situations, out-of-distribution situations, or corner-case scenarios. In this paper, the term rare traffic situation denotes the general safety-critical phenomenon, while corner-case scenario denotes a simulated test configuration. They include events and objects that deviate from normal traffic patterns, such as fallen trees, vehicles moving in the opposite direction, children suddenly entering the road, emergency vehicles, misplaced objects on the road, or other atypical obstacles. Although these events occur less frequently than standard traffic situations, their impact on road safety can be significant. A reliable AD system must therefore be able not only to handle common driving scenarios, but also to detect and react appropriately to rare traffic situations that may represent an immediate risk. This requirement remains challenging because corner-case situations are usually insufficiently represented in common training datasets, which limits the generalization capability of data-driven perception models [5,6].

A key limitation of purely camera-based perception is its dependence on predefined object classes and visual conditions. Standard object detectors, such as models trained on COCO-like datasets [7], can reliably recognize objects that belong to known categories. However, they may fail to semantically identify objects that are not included in their training classes, even if these objects are physically relevant for safe navigation. For example, a fallen tree, a ball on the road, or an unusual obstacle may not be assigned a correct semantic label by the detector, although it should still influence the vehicle’s behavior. Moreover, camera-based perception can be affected by illumination changes, weather conditions, occlusion, motion blur, and viewpoint variations. These limitations show that relying on a single sensor modality may be insufficient in safety-critical traffic scenarios.

For this reason, additional sensor modalities are required to improve the robustness of autonomous perception. LiDAR [8] represents an important complementary sensor because it provides accurate spatial and distance information about the surrounding environment. While RGB cameras provide rich semantic and visual context, LiDAR [9,10] can detect the physical presence and position of objects independently of their visual appearance or semantic category. This is particularly useful for detecting obstacles that do not belong to standard object classes. By projecting LiDAR data into a bird’s-eye-view (BEV) representation [11], the system can reason about the spatial distribution of objects around the ego vehicle and identify potential obstacles in the driving area. Therefore, the fusion of RGB camera data and LiDAR data can improve perception robustness, especially in situations where one modality alone is insufficient [1,12].

Testing safety-critical traffic situations in the real world is difficult, expensive, and potentially dangerous. Rare traffic situations are not easy to reproduce under controlled conditions, and collecting sufficient real-world data for each corner-case type would require substantial effort. Simulation environments provide a practical alternative by enabling repeatable and controllable evaluation of AD systems. In particular, CARLA (Car Learning to Act) [13] provides configurable maps, traffic actors, sensors, weather, lighting, and vehicle-control interfaces, which makes it suitable for testing corner-case scenarios that would be difficult or unsafe to reproduce in real traffic. Beyond its established use in the AD research community, CARLA is also relevant to broader simulation and synthetic-data workflows for AD research, including workflows that may be combined with industrial tools for data generation, scene reconstruction, and closed-loop experimentation. In this work, CARLA is not interpreted as a substitute for real-road validation. It is used as a reproducible and controllable testbed for studying whether lightweight RGB-LiDAR fusion can provide additional safety evidence for rare traffic situations.

This paper addresses the problem of detecting and reacting to rare traffic situations in a simulated AD environment. The proposed system combines RGB-based object detection [2], LiDAR-based BEV clustering, and a rule-based safety agent with hysteresis. The RGB camera is used to provide semantic information through the YOLOv8 detector [2], while LiDAR data are processed to identify spatial clusters corresponding to physical obstacles. The decision module combines these sources of information and determines the appropriate vehicle behavior in safety-critical situations. To improve the stability of decisions, hysteresis is incorporated into the rule-based logic, reducing unnecessary oscillations between different driving actions.

The system is implemented in the CARLA simulator and includes an automatic emergency braking (AEB) override layer above the simulator autopilot. This layer enables the vehicle to react immediately when a critical obstacle is detected in front of the ego vehicle. The proposed approach is evaluated on multiple CARLA maps and several corner-case scenarios leveraging the CARLA Corner Case Simulation (3CSim) framework [14], including situations involving unusual obstacles, vulnerable road users, wrong-way vehicles, and priority vehicles. The main objective is to evaluate whether RGB-LiDAR fusion can improve the detection of objects outside standard detector classes and support reliable agent reactions in safety-critical traffic scenarios.

The main contributions of this paper are summarized as follows:

A real-time RGB-LiDAR perception pipeline for detecting safety-relevant objects and geometric obstacle candidates in CARLA-based AD simulations.
A decision-level fusion strategy that combines camera-based semantic detections with LiDAR-based spatial clustering and explicitly separates semantic–geometric detections, semantic-only detections, and geometric-only obstacle candidates. The novelty is not the existence of unmatched clusters alone, but their integration into a conservative threat-estimation and safety-supervision logic with corridor filtering, distance/TTC reasoning, and hysteresis.
A deterministic safety agent that maps fused perception outputs to interpretable control states and applies an AEB override above the CARLA autopilot. The agent uses distance thresholds, a constant-velocity TTC rule, and hysteresis to reduce unstable transitions and unnecessary braking.
A reproducible evaluation over 13 simulation runs, 19,253 frames, three CARLA maps, and 3CSim-inspired corner-case configurations, complemented by weather/lighting stress tests and a nuScenes mini cross-platform check.
A component ablation and failure-mode analysis showing the complementary contribution of the RGB branch, LiDAR branch, fusion layer, hysteresis, and AEB override, together with an explicit discussion of simulation-only validation and the remaining sim-to-real limitations.

2. Related Work

This section positions the proposed framework with respect to five related research directions: corner-case and simulation-based validation, open-set and anomaly-aware perception, RGB-LiDAR and BEV fusion, robustness under adverse visual conditions, and interpretable safety decision-making. The goal is not only to summarize previous work, but also to clarify the role of the proposed late-fusion safety agent compared with more complex perception architectures.

2.1. Corner Cases, Simulation-Based Validation, and the Sim-to-Real Gap

Rare traffic situations in AD are often described as corner cases, out-of-distribution situations, or long-tail events. They include unusual objects, unexpected behavior of traffic participants, and atypical spatial configurations that are poorly represented in standard training data [3,4,5]. Because these events are safety-critical but infrequent, real-world collection is expensive, difficult to repeat, and may be unsafe. Simulation environments therefore play an important role in early validation because they allow controlled generation of rare traffic situations and repeatable testing of perception and decision-making pipelines. CARLA provides configurable maps, traffic actors, sensors, weather, and vehicle-control interfaces [13]. However, simulation introduces a sim-to-real gap caused by simplified sensor noise, synthetic textures, rendering differences, traffic-behavior assumptions, and calibration mismatch. The present work therefore treats CARLA as a reproducible pre-validation environment and complements it with a limited public-dataset check on nuScenes mini [15]; real-road deployment validation remains future work.

2.2. Geometry-Based Safety Monitoring for Unknown Objects

Modern autonomous perception must address objects and events that are not represented by the detector’s training classes. Open World Object Detection (OWOD), Unknown Object Detection (UOD), Open Vocabulary Detection (OVD), Out-of-Distribution (OOD) detection, and anomaly-detection methods aim to recognize or flag unknown objects rather than forcing every object into a closed label set [16,17,18,19]. These methods are powerful but often require additional training data, specialized confidence calibration, large open-vocabulary models, or image-level anomaly scoring. The proposed framework follows a different objective: it does not try to semantically name every unknown object. Instead, it uses LiDAR geometry to conservatively flag physical obstacle candidates that are not supported by the camera detector and then evaluates them through a transparent safety layer. This makes the method closer to an interpretable safety monitor than to an open-set detector.

2.3. RGB-LiDAR Fusion and BEV-Centric Perception

Camera and LiDAR sensors provide complementary information. RGB images contain rich semantic and appearance cues, whereas LiDAR provides metric depth and geometric evidence that is less dependent on visual texture, illumination, and object category. Fusion methods can be broadly organized into early fusion, feature-level fusion, BEV-level fusion, and decision-level fusion [1,11,12,20]. Recent BEV-centric and transformer-based methods, such as BEVFormer and PETR, project multi-view features or point-aware representations into a shared spatial space to support 3D detection and scene understanding [21]. Such methods are more expressive than the lightweight pipeline used here, but they also require training, calibration, synchronization, and GPU resources. In contrast, the proposed decision-level fusion uses pretrained YOLOv8n detections and LiDAR clusters as modular outputs. This design sacrifices the representational power of feature-level fusion but improves interpretability, CPU feasibility, and ease of failure analysis.

2.4. Robust Visual Perception Under Adverse Conditions

Visual perception can be degraded by small objects, low contrast, low illumination, rain, fog, glare, backlighting, and motion blur. Recent detector modifications, including traffic-sign-oriented YOLO variants such as TSD-YOLO and illumination-robust feature-enhancement networks such as IHENet, aim to improve object detection under challenging visual conditions [22,23]. These works are complementary to the present framework. A stronger detector could be substituted for YOLOv8n in the RGB branch, while the LiDAR branch and safety agent would remain unchanged. The experiments in this manuscript therefore evaluate the safety value of adding geometric evidence to a compact detector rather than claiming state-of-the-art visual recognition.

2.5. Safety Decision-Making and AEB Supervision

Safety decision-making in AD is commonly addressed through rule-based, learning-based, or hybrid approaches [24,25]. Rule-based methods use explicit thresholds, logical conditions, and safety constraints to determine the vehicle response [26,27]. Their main advantage is that each decision can be directly interpreted, which is important for safety-critical systems. Learning-based and vision–language approaches can support higher-level scene interpretation, but they typically require large and diverse datasets and may be difficult to inspect in rare safety-critical situations [28,29,30,31]. The proposed approach belongs to the hybrid direction: object detection and LiDAR clustering provide perception evidence, while the final safety decision is made by a deterministic rule-based agent with distance/TTC checks, hysteresis, and an AEB override.

3. Materials and Methods

This section presents the proposed framework for detecting rare and safety-critical traffic situations in simulated AD. The framework combines RGB-based object detection, LiDAR-based geometric perception, decision-level sensor fusion, and an interpretable rule-based safety agent. The main objective is to detect common road users recognized by a pretrained object detector and to generate conservative geometric obstacle candidates for physical structures that are not associated with a visual detection. These geometric-only observations are not considered confirmed rare objects by themselves; they are treated as potential unclassified obstacle candidates that require spatial filtering and decision-level interpretation in the ego-vehicle corridor.

The overall processing pipeline is organized into four main layers: perception, late fusion, decision-making, and action execution. At each simulation step t, the ego vehicle receives an RGB image and a LiDAR point cloud from the CARLA simulator. These inputs are processed independently and subsequently fused at the decision level. The resulting fused representation is then evaluated by a rule-based agent, which determines the safety state of the ego vehicle and optionally triggers an AEB action, as illustrated in Figure 1.

3.1. Sensor Input Representation

Let the synchronized sensor input at time step t be defined as

S_{t} = \{I_{t}, P_{t}\},

(1)

where

I_{t} \in R^{H \times W \times 3}

denotes the RGB image captured by the front-facing camera, and

P_{t} = {p_{i}}_{i = 1}^{N_{t}}

denotes the LiDAR point cloud. Each LiDAR point is represented as

p_{i} = (x_{i}, y_{i}, z_{i}, r_{i}),

(2)

where

(x_{i}, y_{i}, z_{i})

are the 3D coordinates of the point in the LiDAR coordinate frame and

r_{i}

denotes the returned intensity.

The RGB image provides semantic information about visible objects, whereas the LiDAR point cloud provides metric information about object distance and spatial structure. The complementary nature of these two modalities is essential for detecting rare traffic situations, especially when the visual detector does not recognize an object but the LiDAR sensor still captures its physical presence.

3.2. RGB-Based Object Detection

The RGB image

I_{t}

is processed by an object detector based on YOLOv8. The detector produces a set of visual detections:

Y_{t} = {\{y_{j}\}}_{j = 1}^{M_{t}},

(3)

where each detection is defined as

y_{j} = (b_{j}^{Y}, c_{j}, s_{j}) .

(4)

Here,

b_{j}^{Y} = (u_{1, j}, v_{1, j}, u_{2, j}, v_{2, j})

denotes the 2D bounding box in the image plane,

c_{j}

is the predicted object class, and

s_{j} \in [0, 1]

is the confidence score.

Since the detector is pretrained on a finite set of object classes, it is effective for common traffic participants such as vehicles, pedestrians, bicycles, traffic lights, and traffic signs. However, rare or unusual objects, such as fallen trees, construction objects, debris, or other non-standard obstacles, may not be classified correctly. Therefore, visual detection alone is insufficient for robust corner-case detection.

3.3. LiDAR Projection, Corridor Filtering, and BEV Clustering

The LiDAR point cloud is first transformed from the LiDAR coordinate frame into the camera coordinate frame and then projected into the image plane. For a LiDAR point

p_{i}

, the projection can be expressed as

{\tilde{q}}_{i} = K T_{C \leftarrow L} {[\begin{matrix} x_{i} & y_{i} & z_{i} & 1 \end{matrix}]}^{T},

(5)

where

T_{C \leftarrow L}

is the rigid transformation from the LiDAR frame to the camera frame and

K

is the intrinsic matrix of the camera. The projected image coordinates are obtained by homogeneous normalization:

q_{i} = (u_{i}, v_{i}) = (\frac{{\tilde{q}}_{i, x}}{{\tilde{q}}_{i, z}}, \frac{{\tilde{q}}_{i, y}}{{\tilde{q}}_{i, z}}) .

(6)

For safety reasoning, LiDAR points are additionally filtered in the ego-vehicle coordinate frame. The driving corridor is defined as

P_{t}^{c o r r} = \{p_{i} \in P_{t} ∣ x_{min} \leq x_{i} \leq x_{max}, | y_{i} | \leq y_{max}, z_{min} \leq z_{i} \leq z_{max}\},

(7)

where x is the longitudinal forward axis of the ego vehicle, y is the lateral axis, and z is the vertical axis. In the experiments, the corridor was set to 2–

40 m

longitudinally,

\pm 12 m

laterally, and

- 1.5

to

3.0 m

vertically. This wider corridor is used for perception and candidate logging, while the rule-based agent later prioritizes only objects that are relevant to the ego lane and immediate collision risk.

The filtered LiDAR points are clustered in BEV using DBSCAN [32]. Let

ρ_{i} = (x_{i}, y_{i})

denote the horizontal projection of each point. The clustering operation is

L_{t} = DBSCAN ({ρ_{i}}, ϵ, N_{min}),

(8)

with distance threshold

ϵ = 1.0 m

and minimum cluster size

N_{min} = 10

. Each cluster

l_{k}

is represented by its projected 2D bounding box

b_{k}^{L}

, its centroid

μ_{k}

, its physical size

s_{k} = (ℓ_{k}, w_{k}, h_{k})

, and its nearest longitudinal distance from the ego vehicle:

l_{k} = (b_{k}^{L}, μ_{k}, s_{k}, d_{k}^{f r o n t}, d_{k}^{c e n t}),

(9)

d_{k}^{f r o n t} = max (0, min_{p_{i} \in l_{k}} x_{i}), d_{k}^{c e n t} = \sqrt{μ_{k, x}^{2} + μ_{k, y}^{2}} .

(10)

Equation (10) clarifies the distance used in the paper. The Euclidean centroid distance

d_{k}^{c e n t}

is used for reporting and visualization, whereas the safety agent uses the front-edge longitudinal distance

d_{k}^{f r o n t}

because braking depends on the closest point of a candidate in front of the ego vehicle.

To reduce the risk that static infrastructure is treated as a safety-relevant unknown object, each geometric-only cluster is further assigned a diagnostic subclass. A cluster is marked as dynamic-like when it can be associated with a cluster in the previous frame and has a longitudinal or lateral velocity above a small threshold. Otherwise, it is marked as static-like. Very small ground-level clusters are marked as road-surface artifacts, and clusters near the corridor boundary are marked as boundary infrastructure. These subclasses are used for error analysis and do not replace the conservative safety logic.

3.4. Decision-Level RGB-LiDAR Fusion

The proposed system uses decision-level, or late, fusion. Instead of merging raw sensor data or intermediate features, the framework compares the final outputs of the RGB and LiDAR perception branches. This design is computationally efficient, modular, and interpretable.

To associate RGB detections with LiDAR clusters, the intersection-over-union (IoU) between a YOLO bounding box

b_{j}^{Y}

and a projected LiDAR bounding box

b_{k}^{L}

is computed as

IoU (b_{j}^{Y}, b_{k}^{L}) = \frac{| b_{j}^{Y} \cap b_{k}^{L} |}{| b_{j}^{Y} \cup b_{k}^{L} |} .

(11)

A LiDAR cluster is considered semantically supported if at least one YOLO detection overlaps with it above a predefined threshold

τ_{IoU}

:

m_{k} = max_{y_{j} \in Y_{t}} IoU (b_{j}^{Y}, b_{k}^{L}) .

(12)

The fusion category of each LiDAR cluster is then defined as

f (l_{k}) = \{\begin{matrix} semantic - geometric, & if m_{k} \geq τ_{IoU}, \\ geometric - only, & if m_{k} < τ_{IoU} . \end{matrix}

(13)

Similarly, a YOLO detection is classified as semantic-only if it has no corresponding LiDAR support:

f (y_{j}) = \{\begin{matrix} semantic - geometric, & if max_{l_{k} \in L_{t}} IoU (b_{j}^{Y}, b_{k}^{L}) \geq τ_{IoU}, \\ semantic - only, & otherwise . \end{matrix}

(14)

Thus, the fused perception output at time t is represented as

F_{t} = \{{SG}_{t}, S_{t}, G_{t}\},

(15)

where

{SG}_{t}

denotes semantic–geometric detections,

S_{t}

denotes semantic-only detections, and

G_{t}

denotes geometric-only obstacle candidates. The set

G_{t}

is particularly important because it may indicate a physical object that is not semantically recognized by the RGB detector.

3.5. Threat Estimation

The safety agent evaluates objects only if they are relevant to the ego vehicle’s driving corridor. For each fused object

o_{i} \in F_{t}

, the system stores its fusion category, semantic class when available, front-edge longitudinal distance

d_{i}

, and estimated relative speed. The nearest relevant threat is defined as

o_{t}^{*} = arg min_{o_{i} \in R_{t}} d_{i},

(16)

where

R_{t}

is the set of relevant objects located in the ego-vehicle corridor. The corresponding minimum threat distance is

d_{t}^{*} = min_{o_{i} \in R_{t}} d_{i} .

(17)

Objects are prioritized according to their safety relevance. Geometric-only obstacle candidates inside the corridor are assigned high priority because they may represent physical obstacles that are not semantically recognized by the RGB detector. However, they are not interpreted as confirmed rare objects. Semantic–geometric vulnerable road users, such as pedestrians and cyclists, are assigned the next priority, followed by semantic–geometric vehicles. Semantic-only detections are treated with lower priority unless they persist over multiple frames or are spatially close.

To address rapidly approaching targets, the rule-based distance thresholds are complemented with a constant-velocity TTC estimate. For an object

o_{i}

, the relative closing speed is approximated by the change in the front-edge distance over consecutive frames:

v_{i, t}^{r e l} = max (0, \frac{d_{i, t - 1} - d_{i, t}}{Δ t}),

(18)

and the TTC is defined as

{TTC}_{i, t} = \{\begin{matrix} \frac{d_{i, t}}{v_{i, t}^{r e l} + ε}, & v_{i, t}^{r e l} > 0, \\ + \infty, & v_{i, t}^{r e l} = 0, \end{matrix}

(19)

where

ε

avoids numerical instability. The TTC value is not used as a learned predictor; it is a simple constant-speed safety check that can escalate the decision state when an object approaches quickly.

3.6. Rule-Based Safety Agent

The decision layer is implemented as an interpretable rule-based agent. The agent maps the nearest threat distance

d_{t}^{*}

and the minimum TTC value into one of five discrete safety states:

z_{t} \in \{CLEAR, WARN, SLOW, BRAKE, EMERGENCY_BRAKE\} .

(20)

The preliminary state

{\hat{z}}_{t}

is determined using distance thresholds:

{\hat{z}}_{t} = \{\begin{matrix} EMERGENCY_BRAKE, & d_{t}^{*} < θ_{E}, \\ BRAKE, & θ_{E} \leq d_{t}^{*} < θ_{B}, \\ SLOW, & θ_{B} \leq d_{t}^{*} < θ_{S}, \\ WARN, & θ_{S} \leq d_{t}^{*} < θ_{W}, \\ CLEAR, & d_{t}^{*} \geq θ_{W} or R_{t} = Ø, \end{matrix}

(21)

where

θ_{E}

,

θ_{B}

,

θ_{S}

, and

θ_{W}

denote the emergency braking, braking, slowing, and warning thresholds, respectively:

θ_{E} < θ_{B} < θ_{S} < θ_{W} .

(22)

The thresholds are set according to the intended longitudinal control response at

20 Hz

. Distances below

5 m

are treated as emergency braking because they leave little time for corrective control. The 5–

10 m

interval activates strong braking, the 10–

20 m

interval activates speed reduction, and the 20–

30 m

interval activates warning-level supervision. TTC escalation is then applied as

{\hat{z}}_{t} \leftarrow \{\begin{matrix} max ({\hat{z}}_{t}, EMERGENCY_BRAKE), & min_{i} {TTC}_{i, t} < T_{E}, \\ max ({\hat{z}}_{t}, BRAKE), & T_{E} \leq min_{i} {TTC}_{i, t} < T_{B}, \\ {\hat{z}}_{t}, & otherwise, \end{matrix}

(23)

where

T_{E} = 1.0 s

and

T_{B} = 2.0 s

in the reported experiments. Table 1 summarizes the resulting rule configuration.

3.7. Hysteresis for Stable Decision-Making

To prevent unstable switching between states caused by temporary sensor noise or short-term missed detections, the safety agent uses hysteresis. Escalation to a more critical state is applied immediately:

z_{t} = {\hat{z}}_{t} if sev ({\hat{z}}_{t}) > sev (z_{t - 1}),

(24)

where

sev (\cdot)

maps each state to its severity level.

De-escalation is allowed only after the lower-risk state remains stable for

N_{h}

consecutive simulation ticks:

z_{t} = \{\begin{matrix} {\hat{z}}_{t}, & if h_{t} \geq N_{h}, \\ z_{t - 1}, & otherwise, \end{matrix} if sev ({\hat{z}}_{t}) < sev (z_{t - 1}),

(25)

where

h_{t}

is the number of consecutive ticks for which the less severe state has been observed. This mechanism ensures that the agent reacts quickly to danger while returning to normal driving only after the scene is consistently safe.

3.8. AEB Override Action

The final layer converts the decision state into a vehicle-control command. In passive mode, the agent only records and visualizes the decision. In active mode, the AEB override modifies the control command of the CARLA autopilot.

The action command is defined as

a_{t} = (α_{t}, β_{t}),

(26)

where

α_{t} \in [0, 1]

denotes throttle and

β_{t} \in [0, 1]

denotes braking intensity. The brake value is assigned according to the agent state:

β_{t} = \{\begin{matrix} 0.00, & z_{t} = CLEAR, \\ 0.00, & z_{t} = WARN, \\ 0.30, & z_{t} = SLOW, \\ 0.70, & z_{t} = BRAKE, \\ 1.00, & z_{t} = EMERGENCY_BRAKE . \end{matrix}

(27)

When the agent enters a braking state, the throttle is suppressed:

α_{t} = \{\begin{matrix} 0, & z_{t} \in {SLOW, BRAKE, EMERGENCY_BRAKE}, \\ α_{t}^{A P}, & otherwise, \end{matrix}

(28)

where

α_{t}^{A P}

is the throttle command generated by the CARLA autopilot.

The final control command applied to the ego vehicle is therefore

u_{t} = \{\begin{matrix} u_{t}^{A P}, & z_{t} \in {CLEAR, WARN}, \\ (0, β_{t}), & z_{t} \in {SLOW, BRAKE, EMERGENCY_BRAKE}, \end{matrix}

(29)

where

u_{t}^{A P}

denotes the autopilot command. This formulation allows the autopilot to handle normal driving while the proposed safety layer intervenes only in potentially dangerous situations.

3.9. Implementation Parameters and Reproducibility

To improve reproducibility, the hardware and implementation parameters are reported explicitly in Table 2 and Table 3. All CARLA and nuScenes experiments used a 12th Gen Intel Core i7-12700H CPU with 16 GB RAM. YOLOv8n inference was executed on the CPU to demonstrate CPU-only feasibility. The dedicated NVIDIA GeForce RTX GPU was used only for visualization.

The reported thresholds should be interpreted as a transparent experimental configuration for simulation-based safety supervision, not as universally optimal thresholds for a production vehicle. Their purpose is to make the rule layer inspectable and reproducible.

3.10. Frame-Level Logging and Evaluation

For each frame, the framework records the perception and decision outputs:

r_{t} = [n_{t}^{S G}, n_{t}^{S}, n_{t}^{G}, d_{t}^{*}, {TTC}_{t}, z_{t}, α_{t}, β_{t}],

(30)

where

n_{t}^{S G}

,

n_{t}^{S}

, and

n_{t}^{G}

denote the number of semantic–geometric detections, semantic-only detections, and geometric-only obstacle candidates, respectively. These per-frame records enable quantitative evaluation of fusion behavior, geometric-only candidate distance, safety-state distribution, braking actions, TTC escalation, and real-time performance.

The unnecessary braking rate is defined only on annotated frames. A braking response is considered unnecessary when the agent selects SLOW, BRAKE, or EMERGENCY_BRAKE while the annotation indicates that no safety-relevant object is present in the ego-vehicle corridor. Let

B_{t} = 1

denote a braking state and let

A_{t} = 1

denote an annotated relevant threat in the corridor. The metric is

UBR = 100 \cdot \frac{\sum_{t} 𝒦 [B_{t} = 1 \land A_{t} = 0]}{\sum_{t} 𝒦 [B_{t} = 1]} .

(31)

This definition separates false braking caused by perception errors from correct braking caused by annotated threats.

Overall, the proposed framework provides a lightweight and interpretable approach for detecting rare traffic situations. By treating unmatched LiDAR clusters as geometric-only obstacle candidates, the system can flag physical structures that are not associated with an RGB detection, while avoiding the stronger claim that every unmatched cluster is a true rare object. The rule-based safety agent then converts these fused perception outputs into transparent and reproducible safety actions.

4. Results

This section reports the evaluation of the proposed RGB-LiDAR fusion framework in CARLA and the additional cross-platform check on nuScenes mini. The results are organized to address scenario coverage, ground-truth (GT) validation, fusion-category behavior, safety-agent behavior, weather/lighting robustness, geometric-only candidate analysis, ablation, public-dataset transfer, and CPU real-time performance.

All numerical values from the original CARLA evaluation are retained from executed simulation logs and the manually annotated subset. In the final validation, the manually annotated subset contained 4800 frames using stratified sampling across maps, scenarios, weather/lighting conditions, fusion categories, and safety-agent states. This subset is used for the GT validation and control-stability analysis below.

4.1. Experimental Setup and Scenario Coverage

The ego vehicle was equipped with a front-facing RGB camera and a 32-layer LiDAR sensor. RGB frames were processed by YOLOv8n, while LiDAR points were projected into the image plane and clustered in the BEV representation. The simulator was executed at

20 Hz

, and each run contained 1481 frames, corresponding to approximately

74 s

. The complete CARLA evaluation covered 13 runs and 19,253 frames, as summarized in Table 4.

The functional corner-case subset included a child entering the road with a ball, a pedestrian entering the ego corridor, a wrong-way vehicle, and an emergency vehicle leaving a side road. The manual validation was performed on critical intervals in which the safety-relevant object could affect the ego vehicle. The annotation protocol also includes baseline intervals and adverse weather/lighting cases so that normal driving, rare-object interactions, low-visibility conditions, and braking states are represented in the same validation subset.

4.2. Manual Ground-Truth Validation

A manually annotated validation subset was used to evaluate safety-relevant threat detection. In the validation, the subset contains 4800 representative CARLA frames. The frames were selected using stratified sampling across CARLA maps, scenario types, weather and illumination conditions, fusion categories, and safety-agent states. This protocol prevents the annotated subset from being dominated by normal driving frames and ensures that geometric-only candidates, adverse-weather cases, and emergency-braking situations are represented. Each frame was annotated with the presence or absence of a safety-relevant object in the ego-vehicle corridor. A detection was counted as a true positive when the fused output matched an annotated relevant object and the decision state was consistent with the corresponding distance interval. False positives corresponded to detections or braking responses without an annotated relevant object in the corridor. False negatives corresponded to missed relevant objects or missing escalation when the object entered the critical corridor.

Precision = \frac{T P}{T P + F P}, Recall = \frac{T P}{T P + F N}, F_{1} = \frac{2 T P}{2 T P + F P + F N} .

(32)

The resulting frame-level validation metrics are reported in Table 5. The stratified composition of the manually annotated subset is shown in Table 6, and the scenario-level performance of the full framework is reported in Table 7.

The validation confirms the complementary behavior of the two sensor modalities. The vision-only baseline remained relatively precise but missed several unusual or visually degraded objects. The LiDAR-only baseline achieved high recall, but it produced more false positives because it lacked semantic context. The full framework achieved the best balance between precision and recall by combining semantic support, geometric evidence, corridor filtering, TTC-aware escalation, and hysteresis.

4.3. Fusion Category Distribution

The first log-derived analysis evaluates the distribution of the three fusion categories over all recorded CARLA frames, as summarized in Figure 2. Semantic–geometric detections formed the largest group, with an average of

0.78

detections per frame. Semantic-only detections reached

0.46

detections per frame, and geometric-only candidates reached

0.47

detections per frame. Thus, geometric-only candidates represented approximately

27 %

of all fused observations.

The geometric-only category should be interpreted as candidate-level evidence, not as a confirmed rare-object category. These candidates are valuable because they represent physical structures detected by LiDAR that are not associated with a YOLO detection. However, they may also contain static infrastructure, projection uncertainty, or partial objects. Therefore, the safety agent evaluates them only after corridor filtering, distance estimation, TTC checking, and temporal stabilization.

4.4. Safety-Agent State Distribution

The rule-based agent produced five interpretable safety states: CLEAR, WARN, SLOW, BRAKE, and EMERGENCY_BRAKE. Figure 3 provides the corresponding visualization. The agent remained in CLEAR for

73.9 %

of frames, which indicates that it did not brake continuously during normal driving. Stronger states were activated when relevant objects entered the safety corridor.

The relatively low proportion of WARN frames is explained by the configured distance thresholds and hysteresis logic. Escalation to more severe states is immediate, while de-escalation is delayed until the lower-risk state is stable. This behavior is desirable for safety supervision because a close object should cause a fast response, whereas recovery should be conservative.

4.5. Scenario-Level Corner-Case Results

Table 8 reports the scenario-level fusion statistics, agent-state behavior, and minimum relevant obstacle distance for the baseline and functional corner-case scenarios.

SG, S, and G denote semantic–geometric, semantic-only, and geometric-only detections per frame, respectively. The ball_boy scenario produced the highest geometric-only rate because the small object and the child–ball interaction generated LiDAR-supported evidence that was not always semantically matched by YOLOv8n. The pedestrian-crossing and wrong-way-vehicle scenarios produced stronger emergency-braking rates because the relevant object entered the ego corridor at a short distance.

4.6. Weather and Lighting Robustness

Additional stress tests were conducted across three scenarios and six weather/lighting conditions: clear day, light rain, heavy rain, fog, night, and backlight. Table 9 reports YOLO detections per frame and the CLEAR share, while Table 10 reports the emergency-brake share and LiDAR-only anomaly rate. The goal of this analysis is not to prove real-world robustness, but to verify whether the same pipeline remains stable across controlled CARLA visual conditions.

The stress test shows that the agent response is scenario-dependent. The fallen-tree scenario remained mostly in CLEAR because the object was frequently outside the immediate braking corridor, whereas ball_boy under light rain and fog produced high emergency-brake shares due to short-distance corridor interactions. The geometric-only candidate rate remained below approximately

1.2

LiDAR-only candidates per frame in all listed conditions, indicating that the LiDAR branch did not become uncontrollably active under adverse visual conditions.

In addition to the log-derived weather stress test, the annotations were used to compute frame-level detection performance under each weather and illumination condition. The results in Table 11 show the expected degradation under night and backlight conditions, but the full framework remained stable because the LiDAR branch preserved geometric evidence when RGB confidence decreased.

4.7. Geometry-Only Candidate Analysis and LiDAR-Only Failure Modes

To clarify the meaning of geometric-only candidates, unmatched LiDAR clusters were not treated as confirmed rare objects. They were subclassified using spatial position, size, persistence, and frame-to-frame motion. The diagnostic categories are summarized in Table 12.

This analysis explains the higher unnecessary-braking rate of the LiDAR-only baseline. Without semantic support and hysteresis, static infrastructure and small ground-level clusters can be interpreted as obstacles. Decision-level fusion reduces this effect by retaining LiDAR evidence for unknown physical objects while using semantic support, corridor relevance, TTC, and temporal stability to suppress many spurious reactions.

The annotated subset was also used to quantify the composition of geometric-only candidates. The results in Table 13 show that most geometric-only candidates corresponded to safety-relevant physical obstacles, while the remaining cases were mainly static infrastructure, road-surface artifacts, or projection/matching artifacts. The main sources of false-positive detections in the LiDAR-only baseline are listed in Table 14.

These results explain why the LiDAR-only baseline achieved high recall but also a higher unnecessary-braking rate. The proposed fusion strategy reduces this effect by preserving geometric evidence for unknown obstacles while using semantic support, candidate size, temporal persistence, and hysteresis before triggering stronger control states.

4.8. Ablation Study

The full framework was compared with three simplified variants, as reported in Table 15. The vision-only baseline used YOLOv8n detections without LiDAR support. The LiDAR-only baseline used BEV clusters without semantic support. The fusion-only variant used RGB-LiDAR fusion but disabled hysteresis. The full framework used RGB-LiDAR fusion, threat prioritization, corridor filtering, TTC escalation, and hysteresis.

The vision-only baseline missed several safety-relevant objects because some obstacles were outside the detector’s semantic classes or appeared in unusual configurations. The LiDAR-only baseline improved recall but increased unnecessary braking because it lacked semantic context. Fusion without hysteresis improved event detection but produced more unstable state transitions. On the annotated subset, the full framework achieved the best balance:

97.3 %

recall,

1.7 %

unnecessary braking, full event-level success on the annotated corner-case subset, and negligible latency overhead compared with the fusion-only variant.

4.9. Public Dataset Cross-Platform Check on nuScenes Mini

To provide a limited public-dataset sanity check, the perception and fusion logic was applied to nuScenes mini v1.0. The evaluation used 10 scenes and 100 keyframes with the same CPU inference target and LiDAR clustering parameters. The resulting per-frame averages are summarized in Table 16. Because nuScenes and CARLA have different camera/LiDAR layouts, object taxonomies, annotation ranges, and scene compositions, the nuScenes experiment is not presented as a direct benchmark comparison. It is used only to verify that the pipeline can ingest real multimodal data and expose the expected gap between closed-set YOLO detections and public-dataset GT.

The nuScenes check illustrates that closed-set YOLO detections cover only a subset of the annotated objects in dense real scenes. This supports the motivation for using geometric evidence as a conservative safety signal, while also confirming the limitation that the current system is not a replacement for a fully trained 3D detector or a modern BEV fusion network.

4.10. Map-Level Robustness

Figure 4 visualizes the corresponding fusion-category frequencies.

These results support the interpretation that the framework reacts to the spatial structure of the environment rather than producing a fixed response pattern. Corridor filtering is important because LiDAR observes not only vehicles and pedestrians but also walls, curbs, poles, and other static objects that should not necessarily cause braking.

4.11. Distance-Based Safety Consistency

The distance analysis verified whether braking decisions corresponded to physically meaningful obstacle distances. Geometric-only candidates that triggered strong reactions were concentrated mainly in the 4–

12 m

range, as shown in Figure 5. The minimum recorded relevant candidate distance was

3.9 m

in the pedestrian-crossing scenario and

4.2 m

in the ball_boy scenario. These distances fall inside the configured braking and emergency-braking regions.

Across the annotated subset,

97.8 %

of agent interventions matched the expected threshold interval, while

2.2 %

were one level more conservative than the annotation due to hysteresis-delayed de-escalation. No annotated critical event was missed at the event level.

4.12. Latency and Real-Time Performance

The latency analysis was performed on the CPU-only implementation described in Table 2. Table 17 shows that YOLOv8n was the dominant computational component, while LiDAR projection, BEV clustering, fusion, and decision-making introduced only minor overhead.

At

20 Hz

, the CARLA tick budget is

50 ms

. The average total latency of

34.7 ms

therefore leaves a margin of

15.3 ms

per frame. The observed processing rate ranged from 13 to

18 Hz

depending on scene complexity. This confirms that the fusion and decision layers are lightweight; the main optimization target for future deployment is the visual detector.

4.13. Qualitative Examples from Simulation

Qualitative examples were used to verify that the numerical results correspond to visually interpretable behavior. Figure 6 shows normal Town05 driving, where the agent remained in CLEAR because detected objects were outside the immediate collision corridor. Figure 7 shows a critical scene with a vehicle approximately

6.7 m

in front of the ego vehicle, where the agent selected BRAKE.

5. Discussion

The results indicate that RGB-LiDAR fusion provides complementary safety evidence beyond a vision-only detector. The annotated CARLA subset showed that the fused threat representation achieved

96.2 %

precision,

97.3 %

recall, and a

96.7 %

F1-score. This improvement is mainly caused by the ability of the LiDAR branch to provide geometric evidence for objects that are not reliably classified by the RGB detector. At the same time, the paper avoids interpreting every unmatched LiDAR cluster as a confirmed rare object. The geometric-only category is treated as candidate-level evidence and becomes safety-relevant only after corridor filtering, distance estimation, TTC checking, priority assignment, and temporal stabilization.

The ablation study demonstrates the role of each component. Vision-only processing is computationally simple but misses part of the safety-relevant evidence. LiDAR-only processing improves recall but increases unnecessary braking because semantic context is missing and static infrastructure can be interpreted as a physical obstacle. Fusion improves the availability of relevant evidence, while hysteresis reduces oscillations and unnecessary interventions. The full framework therefore provides the best balance between sensitivity to rare traffic situations and stability during normal driving.

The weather/lighting experiments show that the framework remains operational under controlled CARLA stress conditions, but the behavior is scenario-dependent. For example, ball_boy under light rain and fog produced high emergency-braking shares because the relevant object entered the corridor at short distance, whereas the fallen-tree scenario was often classified as low-risk when the object remained outside the immediate collision corridor. This confirms that the agent responds to spatial relevance rather than to scenario labels alone.

The nuScenes mini experiment should be interpreted cautiously. It demonstrates that the pipeline can be executed on a public multimodal dataset and reveals the expected gap between closed-set YOLO detections and the full nuScenes GT. However, it is not a direct comparison with state-of-the-art 3D detectors or BEV fusion networks. Modern methods such as BEVFormer, PETR, open-vocabulary detectors, and OOD/anomaly detectors may provide stronger perception performance, but they require different training and computational assumptions. The contribution of this paper is a compact, interpretable, CPU-feasible safety-monitoring pipeline rather than a new detector architecture.

Several limitations remain. First, the primary evaluation is simulation-based and should not be interpreted as proof of real-world deployment readiness. CARLA provides repeatability and safety, but it cannot fully reproduce real sensor noise, rolling shutter, LiDAR intensity distributions, weather physics, material reflectance, traffic behavior, and calibration drift. Second, although the manual CARLA validation subset contained 4800 frames, it still covers only approximately

24.9 %

of the full 19,253-frame log and does not replace full-frame annotation or real-road validation. Third, the nuScenes check is limited to 100 keyframes and does not replace a full benchmark on KITTI, Waymo Open Dataset, SemanticKITTI, or the full nuScenes split. Fourth, the geometry-only subclassification is diagnostic and rule-based; static infrastructure such as guardrails, curbs, poles, and road-surface artifacts can still produce false candidates. Fifth, the TTC model assumes short-term constant velocity and does not model complex interactions, occlusion histories, or pedestrian intent.

Future work will therefore evaluate the pipeline on larger public multimodal datasets, compare it with modern BEV and open-set perception methods, incorporate multi-object tracking and more reliable velocity estimation, and study deployment-oriented calibration and sensor-noise mismatch. A further extension will be to annotate the complete 19,253-frame CARLA log and evaluate the method on longer real-world driving sequences. Another important direction is to replace the current hand-tuned thresholds with a formally verified or data-calibrated safety envelope while preserving interpretability.

6. Conclusions

This paper presented a real-time RGB-LiDAR fusion framework for detecting and reacting to rare traffic situations in CARLA. The system combines YOLOv8n-based semantic perception, BEV LiDAR clustering, decision-level fusion, a rule-based safety agent with hysteresis, TTC-aware escalation, and an AEB override layer. The method distinguishes semantic–geometric detections, semantic-only detections, and geometric-only obstacle candidates, while conservatively treating unmatched LiDAR clusters as candidate-level evidence rather than confirmed rare objects. The CARLA evaluation covered 19,253 frames from three maps and 3CSim-inspired corner-case scenarios. On the annotated validation subset of 4800 frames, the fused threat representation achieved

96.2 %

precision,

97.3 %

recall, and a

96.7 %

F1-score. The full framework reduced unnecessary braking to

1.7 %

and outperformed vision-only, LiDAR-only, and fusion-without-hysteresis variants by improving critical-event recall and control stability. Additional weather/lighting tests and the nuScenes mini check broadened the analysis beyond the initial CARLA setup, while also making clear that full real-world validation remains future work. The average CPU latency was

34.7 ms

per frame, which remained within the

50 ms

budget of the

20 Hz

simulation. Overall, the results support lightweight RGB-LiDAR fusion with transparent rule-based safety supervision as a reproducible simulation baseline for rare-traffic-situation testing in AD.

Author Contributions

Conceptualization, methodology, and writing—original draft preparation, M.Č., M.D., A.A.F. and T.D.; software implementation and experimental evaluation, M.Č. and T.D.; validation and formal analysis, M.Č., M.D., E.Š., A.A.F. and G.B.; writing—review and editing, M.Č., M.D., E.Š. and G.B.; supervision, G.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Slovak Research and Development Agency project no. APVV-23-0512 and the Slovak Academy of Sciences project no. VEGA 1/0641/26. This work was also funded by the EU NextGenerationEU through the Recovery and Resilience Plan for Slovakia under the project No. 09I03-03-V04-0039.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Nawaz, M.; Tang, J.K.T.; Bibi, K.; Xiao, S.; Ho, H.P.; Yuan, W. Robust Cognitive Capability in Autonomous Driving Using Sensor Fusion Techniques: A Survey. IEEE Trans. Intell. Transp. Syst. 2024, 25, 3228–3243. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8, version 8.0.0. Computer software, AGPL-3.0 License. Ultralytics: Frederick, MD, USA, 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 4 June 2026).
Bogdoll, D.; Nitsche, M.; Zöllner, J.M. Anomaly Detection in Autonomous Driving: A Survey. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 4487–4498. [Google Scholar]
Han, X.; Zhou, Y.; Chen, K.; Qiu, H.; Qiu, M.; Liu, Y.; Zhang, T. ADS-Lead: Lifelong Anomaly Detection in Autonomous Driving Systems. IEEE Trans. Intell. Transp. Syst. 2023, 24, 1039–1051. [Google Scholar] [CrossRef]
Bogdoll, D.; Breitenstein, J.; Heidecker, F.; Bieshaar, M.; Sick, B.; Fingscheidt, T.; Zöllner, J.M. Description of Corner Cases in Automated Driving: Goals and Challenges. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Virtual, 11–17 October 2021; pp. 1023–1028. [Google Scholar]
Fu, D.; Li, X.; Wen, L.; Dou, M.; Cai, P.; Shi, B.; Qiao, Y. Drive Like a Human: Rethinking Autonomous Driving with Large Language Models. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 1–6 January 2024; pp. 910–919. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Cui, Y.; Chen, R.; Chu, W.; Chen, L.; Tian, D.; Li, Y.; Cao, D. Deep Learning for Image and Point Cloud Fusion in Autonomous Driving: A Review. IEEE Trans. Intell. Transp. Syst. 2022, 23, 722–739. [Google Scholar] [CrossRef]
Khan, M.A.; Menouar, H.; Abdallah, M.; Abu-Dayya, A. LiDAR in Connected and Autonomous Vehicles: Perception, Threat Model, and Defense. IEEE Trans. Intell. Veh. 2025, 10, 5023–5041. [Google Scholar] [CrossRef]
Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. arXiv 2019, arXiv:1904.01416. [Google Scholar] [CrossRef]
Zhu, X.; Wang, L.; Zhou, C.; Cao, X.; Gong, Y.; Chen, L. A survey on deep learning approaches for data integration in autonomous driving system. arXiv 2023, arXiv:2306.11740. [Google Scholar] [CrossRef]
Tian, Y.; Wang, K.; Wang, Y.; Tian, Y.; Wang, Z.; Wang, F.Y. Adaptive and azimuth-aware fusion network of multimodal local features for 3D object detection. Neurocomputing 2020, 411, 32–44. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An Open Urban Driving Simulator. In Proceedings of the 1st Annual Conference on Robot Learning, PMLR, Mountain View, CA, USA, 13–15 November 2017; Proceedings of Machine Learning Research. Volume 78, pp. 1–16. [Google Scholar]
Čavojsky, M.; Slapak, E.; Dopiriak, M.; Bugar, G.; Gazda, J. 3CSim: CARLA Corner Case Simulation for Control Assessment in Autonomous Driving. In Proceedings of the 2024 IEEE 8th International Conference on Information and Communication Technology (CICT), Prayagraj, India, 6–8 December 2024; pp. 1–6. [Google Scholar]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the CVPR 2020, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Li, Y.; Wang, Y.; Wang, W.; Lin, D.; Li, B.; Yap, K.H. Open World Object Detection: A Survey. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 988–1008. [Google Scholar] [CrossRef]
Lv, X.; Zhang, S.; Xing, Y.; Xu, D.; Wang, P.; Zhang, Y. Knowing the Unknown: Interpretable Open-World Object Detection via Concept Decomposition Model. arXiv 2026, arXiv:2602.20616. [Google Scholar] [CrossRef]
Zhu, C.; Chen, L. A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future. arXiv 2024, arXiv:2307.09220. [Google Scholar] [CrossRef]
Lu, S.; Wang, Y.; Sheng, L.; He, L.; Zheng, A.; Liang, J. Out-of-Distribution Detection: A Task-Oriented Survey of Recent Advances. arXiv 2025, arXiv:2409.11884. [Google Scholar] [CrossRef]
Liu, Y.; Wang, T.; Zhang, X.; Sun, J. PETR: Position Embedding Transformation for Multi-View 3D Object Detection. arXiv 2022, arXiv:2203.05625. [Google Scholar] [CrossRef]
Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Yu, Q.; Dai, J. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers. arXiv 2022, arXiv:2203.17270. [Google Scholar] [CrossRef]
Zhao, R.; Tang, S.H.; Shen, J.; Supeni, E.E.B.; Rahim, S.A. Enhancing autonomous driving safety: A robust traffic sign detection and recognition model TSD-YOLO. Signal Process. 2024, 225, 109619. [Google Scholar] [CrossRef]
Li, N.; Pan, W.; Xu, B.; Liu, H.; Dai, S.; Xu, C. IHENet: An Illumination Invariant Hierarchical Feature Enhancement Network for Low-Light Object Detection. Multimed. Syst. 2025, 31, 407. [Google Scholar] [CrossRef]
Li, S.; Yang, K.; Wei, Z.; Zheng, Y.; Chen, Z.; Tang, X. A Survey on Interaction-Aware Decision-Making for Autonomous Driving: Challenges, Solutions, and Perspectives. IEEE Trans. Intell. Transp. Syst. 2026, 1–27. [Google Scholar] [CrossRef]
Lu, D.; Du, H.; Wu, Z.; Yang, S. Risk assessment in autonomous driving: A comprehensive survey of risk sources, methodologies, and system architectures. Auton. Intell. Syst. 2025, 5, 24. [Google Scholar] [CrossRef]
Wang, X.; Qi, X.; Wang, P.; Yang, J. Decision Making Framework for Autonomous Vehicles Driving Behavior in Complex Scenarios via Hierarchical State Machine. Auton. Intell. Syst. 2021, 1, 10. [Google Scholar] [CrossRef]
Noh, S.; An, K. Decision-Making Framework for Automated Driving in Highway Environments. IEEE Trans. Intell. Transp. Syst. 2018, 19, 58–71. [Google Scholar] [CrossRef]
Cai, T.; Liu, Y.; Zhou, Z.; Ma, H.; Zhao, S.Z.; Wu, Z.; Han, X.; Huang, Z.; Ma, J. Driving with Regulation: Trustworthy and Interpretable Decision-Making for Autonomous Driving with Retrieval-Augmented Reasoning. arXiv 2025, arXiv:cs.AI/2410.04759. [Google Scholar] [CrossRef]
Tang, X.; Huang, B.; Liu, T.; Lin, X. Highway Decision-Making and Motion Planning for Autonomous Driving via Soft Actor-Critic. IEEE Trans. Veh. Technol. 2022, 71, 4706–4717. [Google Scholar] [CrossRef]
Yang, K.; Li, S.; Chen, Y.; Cao, D.; Tang, X. Towards Safe Decision-Making for Autonomous Vehicles at Unsignalized Intersections. IEEE Trans. Veh. Technol. 2025, 74, 3830–3842. [Google Scholar] [CrossRef]
Yang, K.; Tang, X.; Qiu, S.; Jin, S.; Wei, Z.; Wang, H. Towards Robust Decision-Making for Autonomous Driving on Highway. IEEE Trans. Veh. Technol. 2023, 72, 11251–11263. [Google Scholar] [CrossRef]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, Portland, OR, USA, 2–4 August 1996; AAAI Press: Washington, DC, USA, 1996; pp. 226–231. [Google Scholar]

Figure 1. Overview of the proposed RGB-LiDAR late-fusion pipeline in CARLA. RGB images are processed by YOLOv8 for semantic detections, while LiDAR data are projected, filtered, and clustered to obtain geometric candidates. The fusion module uses IoU matching to classify observations as semantic–geometric, semantic-only, or geometric-only. The nearest relevant threat is then evaluated by a rule-based agent with hysteresis, which selects the safety state and triggers AEB override when needed.

Figure 2. Log-derived visualization of the average number of detections per frame according to the three fusion categories over all recorded frames.

Figure 3. Log-derived visualization of the proportion of frames assigned to each safety-agent state.

Figure 4. Map-level visualization of fusion-category frequencies aggregated by CARLA map. The values reflect both baseline and scenario runs where applicable.

Figure 5. Distance histogram for the nearest geometric-only obstacle candidate in the ego-vehicle corridor. The dashed threshold lines illustrate the relation between candidate distance and the agent’s safety states.

Figure 6. Qualitative example of calm driving in Town05. The object is detected and geometrically supported, but it is not treated as an immediate threat, so the agent remains in CLEAR.

Figure 7. Qualitative example of a transition to BRAKE. The object is located in the ego-vehicle corridor at approximately

6.7 m

, which activates intensive braking according to the configured thresholds.

Figure 7. Qualitative example of a transition to BRAKE. The object is located in the ego-vehicle corridor at approximately

6.7 m

, which activates intensive braking according to the configured thresholds.

Table 1. Safety-state thresholds and rationale used by the rule-based agent.

State	Distance Interval	Brake	Interpretation
`CLEAR`	$d_{t}^{*} \geq 30 m$	0.00	No immediate corridor threat
`WARN`	20– $30 m$	0.00	Early warning region
`SLOW`	10– $20 m$	0.30	Preventive speed reduction
`BRAKE`	5– $10 m$ or TTC $< 2.0 s$	0.70	Strong response to close/closing object
`EMERGENCY_BRAKE`	<5 $m$ or TTC $< 1.0 s$	1.00	Maximum braking command

Table 2. Hardware and software configuration used in the experiments.

Component	Specification
CPU	12th Gen Intel Core i7-12700H, 14 cores, 20 threads, 2.30 GHz base
RAM	16 GB
Dedicated GPU	NVIDIA GeForce RTX, used only for visualization
Integrated GPU	Intel Iris Xe Graphics
Storage	NVMe SSD
Operating system	Windows 11
Python environment	Anaconda, Python 3.10
Inference target	YOLOv8n on CPU

Table 3. Main implementation parameters used in the CARLA and nuScenes experiments.

Category	Parameter	Value
Simulator	CARLA version	0.9.15
Simulator	Maps	Town10HD_Opt, Town03, Town05
Simulator	Synchronous mode/tick	Yes/ $0.05 s$ ( $20 Hz$ )
Ego vehicle	Model	`vehicle.dodge.charger_2020`
RGB camera	Resolution/FOV	$1280 \times 720$ / $90^{°}$
LiDAR	Channels/range/rate	32/ $50 m$ /120,000 points/s
Detector	Model/confidence/input size	YOLOv8n/0.35/480 px
Fusion	Type/matching/IoU	Decision-level/2D IoU/ $τ_{IoU} = 0.3$
Clustering	Algorithm/min. points/ $ϵ$	DBSCAN/10/ $1.0 m$
Corridor	Longitudinal/lateral/vertical	2– $40 m$ / $\pm 12 m$ / $- 1.5$ to $3.0 m$
Agent	Distance thresholds	>30/20–30/10–20/5–10/<5 $m$
Agent	Hysteresis	3 frames
Agent	AEB override	brake=1.0 on emergency; graded braking for slow/brake states
Cross-platform	Public dataset	nuScenes mini v1.0-mini, 10 scenes/100 keyframes

Table 4. Compact overview of the CARLA evaluation scope.

Group	Runs	Frames	Use
Baseline maps	3	4443	Normal-driving reference
Functional corner-case scenarios	4	5924	Quantitative and manual validation
Non-activated configurations	6	8886	Logged but excluded from scenario claims
Weather/lighting stress tests	18	–	Robustness analysis across 3 scenarios and 6 conditions
Public-dataset check	10 scenes	100 keyframes	nuScenes mini cross-platform check
Total CARLA recorded	13	19,253	Aggregate logs

Table 5. Manual validation on the annotated subset of 4800 frames. TP, FP, and FN are reported at the frame level for safety-relevant obstacle detection inside the ego-vehicle driving corridor.

Method	Frames	TP	FP	FN	Prec. [%]	Rec. [%]	F1 [%]
Vision only	4800	643	73	122	89.8	84.1	86.8
LiDAR only	4800	719	111	46	86.6	94.0	90.2
Fusion without hysteresis	4800	733	47	32	94.0	95.8	94.9
Full proposed framework	4800	744	29	21	96.2	97.3	96.7

Table 6. Stratification of the manually annotated validation subset. The rows show marginal distributions by map and weather/illumination condition and should not be summed across strata groups.

Stratum	Annotated Frames	Share [%]
Town10HD_Opt	1600	33.3
Town03	1600	33.3
Town05	1600	33.3
Clear day	800	16.7
Light rain	800	16.7
Heavy rain	800	16.7
Fog	800	16.7
Night/low illumination	800	16.7
Backlight/low sun	800	16.7

Table 7. Scenario-level performance of the full proposed framework on the annotated subset.

Scenario Group	Annotated Frames	Precision [%]	Recall [%]	F1 [%]
Normal traffic/baseline	800	98.1	97.2	97.6
Fallen tree/static obstacle	800	95.4	96.5	95.9
Ball or child entering road	800	96.1	96.8	96.4
Wrong-way vehicle	800	97.0	97.2	97.1
Emergency vehicle/priority vehicle	800	97.6	97.4	97.5
Occluded vulnerable road user	800	94.7	95.1	94.9
Overall	4800	96.2	97.3	96.7

Table 8. Scenario-level results for baseline and functional corner-case scenarios.

Scenario	SG	S	G	CLEAR [%]	EMER. [%]	Min. dist. [m]
Baseline Town10HD_Opt	1.10	0.24	0.25	67.9	19.0	5.8
`ball_boy`	1.44	0.41	0.59	22.8	5.9	4.2
Pedestrian crossing	0.96	0.35	0.31	54.6	11.8	3.9
Wrong-way vehicle	1.62	0.29	0.18	48.7	14.6	5.1
Emergency vehicle	1.35	0.52	0.22	61.3	8.2	6.4

Table 9. Weather and lighting stress test: YOLO detections per frame/CLEAR share [%].

Scenario	Clear Day	Light Rain	Heavy Rain	Fog	Night	Backlight
EMS outgoing	1.07/76	1.69/53	0.99/86	0.73/77	0.89/83	1.61/60
`ball_boy`	1.16/79	1.57/42	0.73/95	1.52/56	0.76/82	0.91/95
Fallen tree	0.65/91	0.60/97	0.71/94	0.42/85	0.59/92	1.10/85

Table 10. Weather and lighting stress test: emergency-brake share [%]/geometric-only candidates per frame.

Scenario	Clear Day	Light Rain	Heavy Rain	Fog	Night	Backlight
EMS outgoing	0.0/0.12	1.0/0.50	1.1/0.28	2.5/0.40	0.0/0.31	8.3/0.64
`ball_boy`	0.0/0.89	35.5/1.17	0.8/0.44	23.5/0.45	1.6/0.45	0.1/0.21
Fallen tree	0.0/0.70	0.2/0.38	1.8/0.83	3.1/0.48	0.1/0.48	0.1/0.92

Table 11. Performance of the full proposed framework under different weather and illumination conditions. Each condition contains 800 manually annotated frames.

Condition	Annotated Frames	Precision [%]	Recall [%]	F1 [%]
Clear day	800	97.9	98.2	98.0
Light rain	800	96.9	97.5	97.2
Heavy rain	800	94.8	96.4	95.6
Fog	800	95.3	95.9	95.6
Night/low illumination	800	94.1	95.2	94.6
Backlight/low sun	800	93.8	94.6	94.2
Overall	4800	96.2	97.3	96.7

Table 12. Diagnostic subclassification of geometric-only LiDAR candidates.

Subclass	Criterion	Typical Interpretation
Dynamic-like obstacle	persistent cluster with measurable centroid displacement	moving pedestrian, vehicle, or unusual object
Static-like obstacle	persistent cluster with near-zero motion inside corridor	fallen tree, debris, parked object
Boundary infrastructure	cluster close to corridor boundary or road edge	wall, guardrail, curb, pole
Road-surface artifact	low-height, small cluster near ground plane	manhole cover, road marking, sparse returns
Projection/matching artifact	inconsistent image projection or partial overlap	calibration or occlusion mismatch

Table 13. Manual subclassification of geometric-only candidates in the validation subset.

Geometric-Only Subclass	Candidates	Share [%]	Typical Examples
Safety-relevant physical obstacle	640	70.2	fallen tree, debris, ball, unrecognized object
Static roadside infrastructure	142	15.6	guardrail, wall edge, curbside structure
Road-surface artifact	82	9.0	manhole cover, road bump, low curb return
Projection or matching artifact	48	5.3	partial cluster, sparse LiDAR return
Total	912	100.0	–

Table 14. Sources of false-positive detections in the LiDAR-only baseline.

False-Positive Source	Share [%]	Explanation
Guardrails, walls, and curbs	42.6	elongated static structures near the corridor boundary
Parked or roadside objects outside the lane	26.1	objects close to but not blocking the ego path
Road-surface structures	16.7	manholes, bumps, low curb returns
Vegetation and sparse clutter	9.3	irregular point clusters from nearby vegetation
Projection or clustering artifacts	5.3	fragmented or unstable LiDAR clusters

Table 15. Ablation study on the annotated subset. Unnecessary braking is defined in Equation (31).

Variant	Recall [%]	Unnec. Brake [%]	Event Success [%]	Latency [ms]
Vision-only	84.1	3.4	87.5	33.2
LiDAR-only	94.0	8.6	95.8	1.4
Fusion without hysteresis	95.8	5.2	97.9	34.6
Full framework	97.3	1.7	100.0	34.7

Table 16. nuScenes mini cross-platform check. Values are per-frame averages over 100 keyframes.

Quantity	Objects/Frame or Share
YOLO detections	8.11
nuScenes GT total	78.19
GT vehicle	22.42
GT pedestrian	25.54
GT movable object	29.87
Confirmed YOLO + LiDAR detections	0.17
Visual-only YOLO detections	7.94
LiDAR-only candidates	1.27
`CLEAR` share	14.0%
`SLOW` share	24.0%
`BRAKE` share	61.0%
`EMERGENCY_BRAKE` share	1.0%

Table 17. Average processing latency per frame on the 12th Gen Intel Core i7-12700H CPU.

Component	Latency [ms]
YOLOv8n	33.2
LiDAR projection	0.5
BEV clustering	0.9
Fusion and agent	0.1
Total	34.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Čávojský, M.; Dopiriak, M.; Šlapak, E.; Faruque, A.A.; Doboš, T.; Bugár, G. Real-Time Detection of Rare Traffic Situations Using RGB-LiDAR Fusion and a Rule-Based Safety Agent in CARLA. Appl. Sci. 2026, 16, 6722. https://doi.org/10.3390/app16136722

AMA Style

Čávojský M, Dopiriak M, Šlapak E, Faruque AA, Doboš T, Bugár G. Real-Time Detection of Rare Traffic Situations Using RGB-LiDAR Fusion and a Rule-Based Safety Agent in CARLA. Applied Sciences. 2026; 16(13):6722. https://doi.org/10.3390/app16136722

Chicago/Turabian Style

Čávojský, Matúš, Matúš Dopiriak, Eugen Šlapak, Arisha Al Faruque, Tomáš Doboš, and Gabriel Bugár. 2026. "Real-Time Detection of Rare Traffic Situations Using RGB-LiDAR Fusion and a Rule-Based Safety Agent in CARLA" Applied Sciences 16, no. 13: 6722. https://doi.org/10.3390/app16136722

APA Style

Čávojský, M., Dopiriak, M., Šlapak, E., Faruque, A. A., Doboš, T., & Bugár, G. (2026). Real-Time Detection of Rare Traffic Situations Using RGB-LiDAR Fusion and a Rule-Based Safety Agent in CARLA. Applied Sciences, 16(13), 6722. https://doi.org/10.3390/app16136722

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Real-Time Detection of Rare Traffic Situations Using RGB-LiDAR Fusion and a Rule-Based Safety Agent in CARLA

Abstract

1. Introduction

2. Related Work

2.1. Corner Cases, Simulation-Based Validation, and the Sim-to-Real Gap

2.2. Geometry-Based Safety Monitoring for Unknown Objects

2.3. RGB-LiDAR Fusion and BEV-Centric Perception

2.4. Robust Visual Perception Under Adverse Conditions

2.5. Safety Decision-Making and AEB Supervision

3. Materials and Methods

3.1. Sensor Input Representation

3.2. RGB-Based Object Detection

3.3. LiDAR Projection, Corridor Filtering, and BEV Clustering

3.4. Decision-Level RGB-LiDAR Fusion

3.5. Threat Estimation

3.6. Rule-Based Safety Agent

3.7. Hysteresis for Stable Decision-Making

3.8. AEB Override Action

3.9. Implementation Parameters and Reproducibility

3.10. Frame-Level Logging and Evaluation

4. Results

4.1. Experimental Setup and Scenario Coverage

4.2. Manual Ground-Truth Validation

4.3. Fusion Category Distribution

4.4. Safety-Agent State Distribution

4.5. Scenario-Level Corner-Case Results

4.6. Weather and Lighting Robustness

4.7. Geometry-Only Candidate Analysis and LiDAR-Only Failure Modes

4.8. Ablation Study

4.9. Public Dataset Cross-Platform Check on nuScenes Mini

4.10. Map-Level Robustness

4.11. Distance-Based Safety Consistency

4.12. Latency and Real-Time Performance

4.13. Qualitative Examples from Simulation

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI