1. Introduction
Autonomous driving (AD) has become a critical source of modern mobility, which has the potential to improve safety, accessibility, and transportation efficiency. Recent advances in computer vision (CV), machine learning (ML), and deep learning (DL) enable significant progress in perception and decision-making for autonomous systems (AS) [
1]. Despite these advances, ensuring reliable operation in real-world, open environments remains a major challenge, particularly when vehicles encounter hazards that deviate from their pre-training distributions [
2]. A central challenge in AD lies in handling unforeseen situations that are not labeled in the training of data, called out-of-label (OOL) hazards, such as rare objects, atypical road situations, low lighting, and visually degraded conditions [
3,
4]. These hazards mostly occur at long distances, under poor illumination, partial occlusion, and in unusual semantic contexts and are insufficiently represented in standard closed-set datasets [
5]. As a result, perception models fail significantly in highlighting safety risks that are difficult to detect and interpret [
6].
Explainable Artificial Intelligence (XAI) in parallel has emerged as a promising direction for improving transparency and trust in safety-critical AI systems. By explaining the reasoning behind model decisions to humans, XAI supports debugging, validation, and human oversight in AD scenarios [
7]. However, most existing XAI approaches are qualitative, lacking standardization evaluation, and are also not explicitly designed for open-world perception, which results in failures in countering issues [
8]. In parallel, Vision–Language Models (VLMs) have demonstrated strong potential for contextual scene understanding by jointly modeling visual inputs and natural language processing covered in the articles of [
9,
10]. In AD, VLMs enable a semantic interpretation of complex scenes, natural-language description of hazards, and human-interpretable explanations of perception outputs [
11]. Nevertheless, their effectiveness in explaining rare and unseen hazards remains insufficiently quantified, particularly under standardized safety benchmarks. To address these challenges, recent research has explored open-vocabulary and open-set perception, which aim to extend the recognition of beyond predefined object categories [
12]. While LiDAR-based and multi-sensor approaches have improved geometric perception and motion forecasting, reliance on point-cloud data alone limits semantic understanding and interpretability [
13]. Vision-based reasoning still remains essential for interpreting road signs, traffic signals, and human behavior.
In this research, we investigate robustness and interpretability in AD perception under open-world conditions by using the Challenge of Out-Of-Label (COOOL) benchmark dataset. COOOL provides a standard dataset of real dashcam videos with explicit annotations for both common and rare hazards, which enable of control evaluation of perception systems when confronted with real-world novelty.
Rather than proposing a new detector, this study focuses on system-level behavior by integrating object detection, lane estimation, temporal modeling, and vision–language explanation into a unified multimodal pipeline illustrated in
Figure 1. We combine YOLO11x-based object detection, Hough Transform-based lane estimation, GPT-4o (
https://openai.com/api/, accessed on 1 March 2025) hazard captioning, and LSTM-based temporal modeling of vehicle response states. Our objective is to empirically evaluate the performance of degradation, characterize failure modes, assess temporal robustness gains, and quantify explainability under OOL conditions. All evaluations are grounded in the official COOOL protocol to ensure reproducibility and comparability.
1.1. Problem Definition
The rapid deployment of AD systems in unconstrained environments requires robust handling of OOL hazards, defined as objects, events, or situations that are not represented within the fixed category set used during training. Within the COOOL benchmark, OOL hazards are defined as follows:
Hazardous objects or events that do not belong to the standard closed-set object categories.
Hazards appearing under atypical visual conditions, such as long distance, low resolution, partial occlusion, physical degradation, or unusual semantic contexts.
Although these hazards are explicitly annotated in COOOL, they are not constrained by a fixed ontology, reflecting the open-world nature of real driving environments. This formulation enables a controled evaluation of the open-set perception without assuming prior knowledge of hazard categories.
1.2. Expected Error Modes
The COOOL benchmark operationalizes OOL hazard recognition through three complementary tasks given as follows: (1) Driver State Prediction to identify the frame at which a driver reacts to a hazard; (2) hazard detection in localized hazard bounding boxes per frame; and (3) captioning the hazard by describing the semantic nature of the detected hazards by using the free-form language. This formulation enables the evaluation of perception robustness beyond conventional closed-set detection.
1.3. Evaluation Criteria and Success Metrics
As per the official COOOL evaluation protocol, system performance is assessed by using the macro-average frame-level accuracy across the three benchmark tasks.
Equation (
1) presents the driver response state accuracy, where
denotes the number of frames in which the vehicle response state is correctly predicted, and
is the total number of evaluated frames.
Equation (
2) presents the hazard detection accuracy, where
represents the number of correctly detected hazards,
denotes the number of ground-truth hazards, and
is the number of predicted hazards.
Equation (
3) presents the hazard captioning accuracy, where
indicates the number of correctly generated hazard captions,
refers to the number of ground-truth hazard names, and
is the number of predicted hazard names.
1.4. Research Questions and Hypotheses
RQ1: How do state-of-the-art closed-set object detectors perform under OOL hazard conditions in COOOL? H1: Closed-set detectors exhibit reduced recall and increase the localization errors for rare and OOL hazards.
RQ2: Does the temporal model improve hazard recognition and driver response prediction under OOL conditions? H2: With the temporal modeling of the LSTM, it improves consistency and recall compared to the frame-wise perception alone.
RQ3: Can VLMs provide measurable interpretability benefits for OOL hazards? H3: Hazard captioning by GPT-4o improves interpretability but remains constrained by the visual uncertainty and semantic ambiguity.
1.5. Contribution
We provide a clear benchmark-Ground OOL Problem Formulation to the dataset-consistent definition of OOL hazards aligned with the three official COOOL tasks, which enable systematic evaluation of open-world perception failures.
Through control evaluation of YOLO11x on COOOL, we empirically characterize missed detections, semantic errors, and temporal instability under OOL conditions, which validates H1.
We introduce an LSTM-based temporal modeling module that improves the hazard in terms of recall and driver response stability in support of H2 and demonstrate the importance of temporal context in open-world driving.
We integrate GPT-4o hazard captioning and evaluate explanations by using object coverage and factual consistency metrics to provide a quantitative assessment of interpretability under OOL scenarios and addressing H3.
2. Related Work
Research on AD systems has reached an interesting era of impressive maturity, supported by the availability of large-scale datasets and benchmark platforms. These rich datasets have significantly advanced research on explainability, visual perception, and hazard recognition. However, despite their contributions, most existing datasets are primarily limited to pre-defined scenarios with fixed labels [
14]. As a result, they fail to adequately address out-of-category challenges such as identify novel hazards and provide deeper interpretability of the system’s decision-making process. A route positioning system identifies a vehicle’s current path between different stops, often used in public transportation to track movement along fix routes through use of technologies like GPS, sensors, and cameras to enable accurate and cost-effective location recognition even in challenging environments [
15]. The nuScenes dataset extends multimodal perception capabilities by providing more than 1.44 million images of diverse sets of sensors, including radar, cameras, and LiDAR. It encompasses a wide variety of driving scenarios recorded in Singapore and Boston, featuring unpredictable maneuvers and complex traffic patterns [
16,
17]. While such diversity is essential for advancing perception and prediction research, interpretable reasoning and novelty detection were not the primary objectives of this dataset, similar to the Waymo Open Dataset. The Waymo Open Dataset emphasizes 2D and 3D detection and segmentation, offering another valuable resource for perception research, with over 200,000 high-frequency images captured across different environments [
18]. Likewise, the ApolloScape dataset provides extensive street-view imagery, LiDAR point clouds, and trajectory data that support numerous AD challenges but place limited emphasis on defining and interpreting surprising, unseen events [
19]. Although the vast diversity of these datasets, including the variations in weather, climate, and geography across 100,000 caption videos, which represent more than 1000 h of driving, aims to enhance model robustness, their reliance on hard-coded annotations and fixed label schemes constrains the development of truly human-interpretable systems.
This research contributes a system-level empirical evaluation of OOL hazard perception by integrating multimodal perception, temporal reasoning, and vision–language explainability under a standard real-world benchmark. It aims to enhance the dependability and interpretability of AS by addressing hazard detection and responses to newly emerging and previously unseen dangers. Our proposed models are designed to enable safer and more reliable navigation in complex, real-world environments to identify unexpected events and provide a clear understandable explanations of the system’s decision-making process. Several machine learning (ML) and deep learning (DL) techniques have been explored for novelty detection in AD, which focus on recognizing and explaining unforeseen risks while improving decision transparency. For instance, in the study of [
20], the authors develop an innovative end-to-end control framework for autonomous vehicles with the use of auto encoders and novelty detection methods to provide interpretable feedback for unexpected events. Similarly, another study, in which the authors employed a dataset collected from a 2015 Toyota Prius equipped with a forward-facing Leopard Imaging LIAR0231-GMSL camera that captured 1080p RGB images at approximately 30 Hz, was designed to evaluate the system’s ability to detect unforeseen situations while maintaining transparency in its decision-making mechanisms [
21]. Another study advances the field of AD safety through explainability by exploring the use of language embeddings and active learning to identify novel driving scenarios [
22]. Their work utilizes the LAVA dataset, which contains rich multimodal information, including environmental context, vehicle trajectories, annotated traffic signs, and corresponding video clips [
23]. The dataset was collected across various locations in the San Diego area, capturing a diverse range of road types, traffic conditions, and weather phenomena, which enables detailed analysis of decision-making under ambiguous events.
Despite substantial progress in object detection, temporal modeling, and vision–language reasoning for AD, the existing studies remain largely constrained to closed-set assumptions and isolated notions of novelty. Many approaches evaluate robustness by using synthetic unknowns, curated open-set extensions, and qualitative explainability measures, which limit their applicability to real-world driving conditions. A critical gap persists in the availability of standard benchmarks and system-level evaluations that jointly address OOL hazard detection, temporal driver response modeling, and explainable scene understanding under realistic conditions. The COOOL benchmark directly targets this gap by providing annotated dashcam video sequences that contain both common and rare hazards falling outside the fixed object ontologies, as illustrated in
Table 1.
However, while COOOL establishes the evaluation protocol, it does not prescribe how multimodal perception, temporal reasoning, and explainability integrate into a unified architecture. This absence of system-level baselines motivates the need for empirical investigation of how contemporary perception components behave under OOL conditions and their limitations are mitigated through multimodal integration.
Prior work reveals three unresolved gaps: (i) the lack of real-world benchmarks that explicitly annotate OOL hazards, (ii) limited understanding of how closed-set perception models fail under such conditions, and (iii) the absence of quantitative frameworks for evaluating interpretability in safety-critical scenarios. The intersection of these gaps motivates the formulation of our task: to evaluate perception robustness, temporal consistency, and explainability within a unified multimodal pipeline by using the COOOL benchmark. Consequently, our experimental design integrates object detection, temporal driver-state modeling, and vision–language hazard captioning, which enable systematic analysis of failure modes and incremental robustness gains under standard out-of-distribution conditions.
Table 1.
Summary of Research Gaps in AD Hazard Perception and Explainability.
Table 1.
Summary of Research Gaps in AD Hazard Perception and Explainability.
| Study/Method | Data Source | Unseen Type | Metrics | Key Limitations |
|---|
| [24] YOLO, Faster R-CNN | COCO, BDD100K | None | mAP, Precision, Recall | Closed-set; cannot detect unseen hazards |
| [25] Open-Set/OOD Methods | Synthetic Data | Unknown Objects | AUROC, OSCR | Limited realism; weak temporal modeling |
| [26] Vision Language Models | Toyota Woven Traffic Safety dataset | Semantic Novelty | Caption Accuracy | Not evaluated for safety-critical driving |
| [27] Temporal Models (LSTM, GRU) | Driving Logs | Behavior Anomalies | Accuracy, F1 | No visual hazard localization |
| [28] Explainable AI Methods | Simulation | Model Uncertainty | Saliency Maps | Mostly qualitative evaluation |
| [29] COOOL Baseline Methods | COOOL Dataset | Rare Hazards | Detection Accuracy | No unified temporal baseline |
| [30] COOOL Challenge Winners | COOOL Dataset | Long-tail Hazards | Hazard Recall | Limited explainability analysis |
| [31] Segment Extraction | AI City Challenge Track2 | Descriptions | BLEU, Rouge-L | No validation for captions |
4. Results
Both quantitative and qualitative results obtained on the COOOL benchmark are presented in this section, following the official evaluation protocol. Structural analysis assesses closed-set perception in terms of robustness, temporal consistency, lane estimation stability, anomaly detection capability, and interpretability under OOL conditions. All quantitative results are computed over the complete COOOL benchmark. Object detection metrics are reported at the frame level, while temporal and anomaly-related metrics are aggregated over sliding windows of 16 consecutive frames. Closed-set object detection performance is measured by using mean Average Precision at an IoU threshold of for common object categories. Robustness to rare and OOL hazards is evaluated by using recall over annotated hazard instances. Temporal anomaly detection performance is assessed using the Area Under the Receiver Operating Characteristic curve (AUROC). Lane estimation stability is quantified by mean lateral deviation from detected lane boundaries, reported separately for daylight and low-light conditions. Vision–language interpretability is evaluated using human-verified object coverage and factual consistency scores computed from structured scene descriptions generated for each video. All metrics are reported as dataset-level aggregates to ensure statistical reliability.
To evaluate the limitations of closed-set perception under open-world driving scenarios, YOLO11x was applied to the COOOL benchmark without dataset-specific retraining. As shown in
Figure 3, the detector achieved an
of
on common in-distribution object categories. Performance degrades notably for rare and OOL hazards, with hazard-level recall reaching
. Hazards frequently appear under long-range, low-illumination conditions, resulting in intermittent localization and missed detections across frames. Comparison against a closed-set YOLO baseline highlights the difficulty of OOL perception and empirically confirms the limited robustness of conventional detectors in open-world settings. Temporal consistency is evaluated by aggregating hazard confidence scores across consecutive frames for all videos.
Figure 4 reports the mean and standard deviation of hazard confidence trajectories over 200 videos.
While frame-wise recall remains a comparable , temporal modeling reduces the frame-to-frame instability and improves alignment with annotated hazard response intervals. The shaded confidence bands illustrate inter-video variability, demonstrating that temporal context contributes primarily to stability and robustness rather than large gains in instantaneous detection accuracy.
Lane estimation performance was assessed using mean lateral deviation from detected lane boundaries. As illustrated in
Figure 5, daylight scenes achieved a mean deviation of
pixels, while low-light and nighttime scenes exhibited an increased deviation of
pixels.
Distributional analysis reveals higher variance under low illumination, reflecting the sensitivity of classical Hough-based lane detection to visual degradation. Despite this degradation, lane estimates remain sufficiently stable to provide structural context for hazard interpretation.
Temporal anomaly detection capability is evaluated by using AUROC, measuring the discrimination between nominal driving behavior and hazard-induced temporal patterns. As shown in
Figure 6, the model achieves an AUROC of
, exceeding random chance while remaining below thresholds required for safety-critical decision-making.
This result indicates that visual temporal cues alone provide partial but meaningful information for identifying evolving risk states. Interpretability is evaluated using GPT-4o scene descriptions generated from the structural perception outputs. Object coverage and factual consistency were assessed via human verification across all videos. As shown in
Figure 7, the descriptions achieved an object coverage score of
and factual consistency of
.
Error bars indicate annotator variability and highlight that language-based explanations remain sensitive to upstream perception errors. The results demonstrate that vision–language models enhance transparency and contextual understanding. The complete multimodal pipeline operated at approximately 10–12 FPS on a single RTX 4090 GPU. While this throughput does not meet strict real-time control requirements, it supports near–real-time perception analysis, interpretability, and offline evaluation. However, detection of rare and OOL hazards proved substantially more challenging. Despite their explicit annotation in the dataset, these hazards often appear at long distances, under poor illumination, and ambiguous visual cues. As a result, hazard-level recall reached , with frequent missed detections and intermittent localization across consecutive frames. These results empirically confirm H1, demonstrating that closed-set detectors exhibit reduce robustness when exposed to novel and visually degraded hazards.
Figure 8 shows the frequency of object-related terms in GPT-4o-generated scene descriptions across the COOOL benchmark. This analysis serves as a diagnostic tool to verify the grounding of language outputs in perception inputs and is interpreted jointly with object coverage and factual consistency metrics reported in
Table 4. Frequency alone should not be treated as evidence of recognition performance.
5. Discussion and Limitations
The quantitative results in
Table 4 demonstrate that the proposed multimodal framework provides a coherent and interpretable perception pipeline for open-world driving scenarios, revealing important limitations that constrain its reliability for safety-critical autonomous control. While YOLO11x achieves moderate performance on common object categories with an
and reasonable recall for rare and OOL hazards of
, the results confirm that closed-set detectors remain vulnerable to missed detections and intermittent localization under adverse conditions. This limitation is particularly evident in low-light and nighttime scenes, where lane estimation error increases from
to
pixels, potentially degrading spatial context for downstream decision-making. Consequently, the system is not considered robust enough for standalone deployment in safety-critical environments without additional redundancy. The temporal model components improve consistency rather than raw detection accuracy. The LSTM risk-state model yielded a hazard recall of
comparable to frame-wise detection but primarily stabilizing hazard recognition across consecutive frames. Similarly, the temporal anomaly detector achieves an AUROC of
, indicating limited discrimination between nominal and hazard-induced behavioral patterns. These findings suggest that the vision-only temporal cues provide partial but insufficient information for precise risk-state estimation, especially when hazards are visually subtle, distant, and occluded. False negatives in such scenarios delay hazard awareness, while false positives lead to unnecessary braking of AV system interventions.
The vision–language scene description module enhances interpretability but introduces additional failure modes. Although object coverage of and factual consistency of indicate that most detected elements are correctly verbalized, the language model remains sensitive to upstream perception errors and occasionally produces generic as well as incomplete descriptions under ambiguous visual conditions. Importantly, the GPT-4o module operates with non-negligible latency and relies on API-based inference, making it unsuitable for real-time closed-loop control. For this reason, it is explicitly restricted to post hoc explanation and monitoring; any direct use of language model outputs for decision-making would pose safety risks due to potential hallucinations, omissions, and delays in responses.
Finally, the proposed framework is limited by its reliance on monocular vision and the absence of sensor fusion. The COOOL benchmark does not provide radar, LiDAR, and vehicle telemetry, preventing validation of cross-modal consistency and redundancy. Future work must address these limitations through multi-sensor fusion, uncertainty-aware OOD detection, tighter latency control, and formal validation and verification of failure modes.
Beyond autonomous driving applications, the proposed multimodal anomaly recognition framework presents a valuable research direction for electroceutical systems, particularly in scenarios that require real-time identification of rare or unseen physiological events. Future studies will investigate extending the vision–language–temporal modeling paradigm to implantable and wearable electroceutical devices, where heterogeneous biosignals, contextual interpretation, and temporal dynamics must be jointly analyzed to ensure therapeutic safety and reliability. By adapting anomaly detection and explainable reasoning components to bioelectrical signal monitoring, such frameworks may enable closed-loop electroceutical platforms capable of autonomously adjusting stimulation parameters in response to abnormal or unexpected physiological conditions. This line of research offers a systematic pathway for translating AI-driven perception and robustness methodologies into clinically relevant electroceutical technologies, while maintaining compatibility with regulatory and safety evaluation frameworks.
6. Conclusions
This work presented a multimodal perception and interpretation framework designed to analyze open-world driving scenarios with an emphasis on rare hazards, temporal consistency, and explainability. Evaluation on the COOOL benchmark demonstrates that the proposed system achieves measurable performance across detection, temporal modeling, and language-based interpretation tasks. Specifically, the object detection module attained an
of
on common classes and a recall of
for rare and out-of-label hazards, while the LSTM-based temporal risk model achieved a hazard recall of
. Lane estimation accuracy showed a mean lateral deviation of
pixels in daylight conditions and
pixels under low-light and nighttime scenarios. Temporal anomaly detection yielded an AUROC of
, indicating moderate discrimination capability under open-world conditions. In terms of interpretability, the vision–language component achieved an object coverage score of
and factual consistency of
, demonstrating that language outputs are largely grounded in visual perception while remaining sensitive to upstream detection errors. End-to-end system throughput ranged between 10 and 12 FPS on a single RTX 4090 GPU, highlighting current limitations for real-time deployment that would enable consistent offline evaluation. These results, summarized quantitatively in
Table 4, indicate that the framework provides meaningful robustness analysis and diagnostic insight rather than production-level autonomy. Our proposed contribution of this work lies in providing a unified experimental framework for evaluating perception stability, temporal reasoning, and interpretability under rare and out-of-distribution driving conditions. While the system is not suitable for direct stand-alone autonomous control, it offers a reproducible benchmark and analysis tool for studying failure modes and transparency in safety-critical perception pipelines, thereby supporting future research toward more reliable and trustworthy autonomous systems.