Next Article in Journal
Quantum-Enhanced DNA Image Compression: Theoretical Framework and NISQ Implementation Strategy
Next Article in Special Issue
Energy-Aware Spatio-Temporal Multi-Agent Route Planning for AGVs
Previous Article in Journal
Surface Characterisation of Retrieved Orthopaedic Knee Liners
Previous Article in Special Issue
Construction of an Intelligent Risk Identification System for Highway Flood Damage Based on Multimodal Large Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Unseen Hazard Recognition in Autonomous Driving Using Vision–Language and Sensor-Based Temporal Models

1
Department of AI and Software, Gachon University, Seongnam-si 13120, Gyeonggi-do, Republic of Korea
2
Department of Creative Technologies, Air University, Islamabad 44000, Pakistan
3
Department of Biomedical Engineering, Gachon University, Seongnam-si 13120, Gyeonggi-do, Republic of Korea
4
Medical Device Development Center, Osong Medical Innovation Foundation, Cheongju 28160, Chungbuk, Republic of Korea
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2026, 16(3), 1503; https://doi.org/10.3390/app16031503
Submission received: 4 December 2025 / Revised: 8 January 2026 / Accepted: 12 January 2026 / Published: 2 February 2026
(This article belongs to the Special Issue Autonomous Vehicles and Robotics—2nd Edition)

Abstract

Autonomous driving (AD) systems remain vulnerable to rare, ambiguous, and out-of-label (OOL) hazards that are insufficiently represented in conventional training datasets. This work investigates perception robustness under such conditions by using the Challenge of Out-Of-Label (COOOL) benchmark dataset, which consists of 200 dashcam video sequences annotated with both common and uncommon traffic hazards. We analyze that the behavior of widely used methods in the perception of components and present a multimodal pipeline in which we integrate YOLO11x for object detection, Hough Transform for lane estimation, and GPT-4o for scene description, and for temporal modeling, we use Long Short-Term Memory (LSTM) networks. On the COOOL benchmark, YOLO11x achieves an m A P @ 0.5 of 54.1 % on the common object categories, whereas the detection of rare and OFL hazards remains challenging, with a recall of 72.6 % . Incorporating temporal risk modeling improves hazard recall to 71.8 % , indicating a modest but consistent gain in recognizing uncommon events. Hough Transform shows the stable behavior in standard conditions for lane estimation, with a mean lateral deviation of 8.9 pixels in daylight scenes and 13.4 pixels under low-light conditions. The temporal anomaly detection module attains an AUROC of 0.65 , reflecting the limitation but meaningful discrimination between nominal and anomalous driving situations. For interpretability, the GPT-4o scene description module generates context-aware textual explanations with an object coverage score of 0.72 and a factual consistency rate of 78 % , as assessed through manual inspection. The end-to-end pipeline operates at approximately 10–12 frames per second on a single GPU, supporting near-real-time analysis and optimization. Our results confirm that state-of-the-art perception models struggle with OOL hazards and that multimodal vision–language–temporal integration provides incremental improvements in robustness and interpretability when evaluated under the standardized out-of-distribution conditions.

1. Introduction

Autonomous driving (AD) has become a critical source of modern mobility, which has the potential to improve safety, accessibility, and transportation efficiency. Recent advances in computer vision (CV), machine learning (ML), and deep learning (DL) enable significant progress in perception and decision-making for autonomous systems (AS) [1]. Despite these advances, ensuring reliable operation in real-world, open environments remains a major challenge, particularly when vehicles encounter hazards that deviate from their pre-training distributions [2]. A central challenge in AD lies in handling unforeseen situations that are not labeled in the training of data, called out-of-label (OOL) hazards, such as rare objects, atypical road situations, low lighting, and visually degraded conditions [3,4]. These hazards mostly occur at long distances, under poor illumination, partial occlusion, and in unusual semantic contexts and are insufficiently represented in standard closed-set datasets [5]. As a result, perception models fail significantly in highlighting safety risks that are difficult to detect and interpret [6].
Explainable Artificial Intelligence (XAI) in parallel has emerged as a promising direction for improving transparency and trust in safety-critical AI systems. By explaining the reasoning behind model decisions to humans, XAI supports debugging, validation, and human oversight in AD scenarios [7]. However, most existing XAI approaches are qualitative, lacking standardization evaluation, and are also not explicitly designed for open-world perception, which results in failures in countering issues [8]. In parallel, Vision–Language Models (VLMs) have demonstrated strong potential for contextual scene understanding by jointly modeling visual inputs and natural language processing covered in the articles of [9,10]. In AD, VLMs enable a semantic interpretation of complex scenes, natural-language description of hazards, and human-interpretable explanations of perception outputs [11]. Nevertheless, their effectiveness in explaining rare and unseen hazards remains insufficiently quantified, particularly under standardized safety benchmarks. To address these challenges, recent research has explored open-vocabulary and open-set perception, which aim to extend the recognition of beyond predefined object categories [12]. While LiDAR-based and multi-sensor approaches have improved geometric perception and motion forecasting, reliance on point-cloud data alone limits semantic understanding and interpretability [13]. Vision-based reasoning still remains essential for interpreting road signs, traffic signals, and human behavior.
In this research, we investigate robustness and interpretability in AD perception under open-world conditions by using the Challenge of Out-Of-Label (COOOL) benchmark dataset. COOOL provides a standard dataset of real dashcam videos with explicit annotations for both common and rare hazards, which enable of control evaluation of perception systems when confronted with real-world novelty.
Rather than proposing a new detector, this study focuses on system-level behavior by integrating object detection, lane estimation, temporal modeling, and vision–language explanation into a unified multimodal pipeline illustrated in Figure 1. We combine YOLO11x-based object detection, Hough Transform-based lane estimation, GPT-4o (https://openai.com/api/, accessed on 1 March 2025) hazard captioning, and LSTM-based temporal modeling of vehicle response states. Our objective is to empirically evaluate the performance of degradation, characterize failure modes, assess temporal robustness gains, and quantify explainability under OOL conditions. All evaluations are grounded in the official COOOL protocol to ensure reproducibility and comparability.

1.1. Problem Definition

The rapid deployment of AD systems in unconstrained environments requires robust handling of OOL hazards, defined as objects, events, or situations that are not represented within the fixed category set used during training. Within the COOOL benchmark, OOL hazards are defined as follows:
  • Hazardous objects or events that do not belong to the standard closed-set object categories.
  • Hazards appearing under atypical visual conditions, such as long distance, low resolution, partial occlusion, physical degradation, or unusual semantic contexts.
Although these hazards are explicitly annotated in COOOL, they are not constrained by a fixed ontology, reflecting the open-world nature of real driving environments. This formulation enables a controled evaluation of the open-set perception without assuming prior knowledge of hazard categories.

1.2. Expected Error Modes

The COOOL benchmark operationalizes OOL hazard recognition through three complementary tasks given as follows: (1) Driver State Prediction to identify the frame at which a driver reacts to a hazard; (2) hazard detection in localized hazard bounding boxes per frame; and (3) captioning the hazard by describing the semantic nature of the detected hazards by using the free-form language. This formulation enables the evaluation of perception robustness beyond conventional closed-set detection.

1.3. Evaluation Criteria and Success Metrics

As per the official COOOL evaluation protocol, system performance is assessed by using the macro-average frame-level accuracy across the three benchmark tasks.
Equation (1) presents the driver response state accuracy, where N correct DS denotes the number of frames in which the vehicle response state is correctly predicted, and N total is the total number of evaluated frames.
Accuracy DS = N correct DS N total ,
Equation (2) presents the hazard detection accuracy, where N correct HD represents the number of correctly detected hazards, N known H denotes the number of ground-truth hazards, and N pred H is the number of predicted hazards.
Accuracy HD = N correct HD max N known H , N pred H ,
Equation (3) presents the hazard captioning accuracy, where N correct HC indicates the number of correctly generated hazard captions, N known C refers to the number of ground-truth hazard names, and N pred C is the number of predicted hazard names.
Accuracy HC = N correct HC max N known C , N pred C ,

1.4. Research Questions and Hypotheses

RQ1: How do state-of-the-art closed-set object detectors perform under OOL hazard conditions in COOOL? H1: Closed-set detectors exhibit reduced recall and increase the localization errors for rare and OOL hazards.
RQ2: Does the temporal model improve hazard recognition and driver response prediction under OOL conditions? H2: With the temporal modeling of the LSTM, it improves consistency and recall compared to the frame-wise perception alone.
RQ3: Can VLMs provide measurable interpretability benefits for OOL hazards? H3: Hazard captioning by GPT-4o improves interpretability but remains constrained by the visual uncertainty and semantic ambiguity.

1.5. Contribution

We provide a clear benchmark-Ground OOL Problem Formulation to the dataset-consistent definition of OOL hazards aligned with the three official COOOL tasks, which enable systematic evaluation of open-world perception failures.
Through control evaluation of YOLO11x on COOOL, we empirically characterize missed detections, semantic errors, and temporal instability under OOL conditions, which validates H1.
We introduce an LSTM-based temporal modeling module that improves the hazard in terms of recall and driver response stability in support of H2 and demonstrate the importance of temporal context in open-world driving.
We integrate GPT-4o hazard captioning and evaluate explanations by using object coverage and factual consistency metrics to provide a quantitative assessment of interpretability under OOL scenarios and addressing H3.

2. Related Work

Research on AD systems has reached an interesting era of impressive maturity, supported by the availability of large-scale datasets and benchmark platforms. These rich datasets have significantly advanced research on explainability, visual perception, and hazard recognition. However, despite their contributions, most existing datasets are primarily limited to pre-defined scenarios with fixed labels [14]. As a result, they fail to adequately address out-of-category challenges such as identify novel hazards and provide deeper interpretability of the system’s decision-making process. A route positioning system identifies a vehicle’s current path between different stops, often used in public transportation to track movement along fix routes through use of technologies like GPS, sensors, and cameras to enable accurate and cost-effective location recognition even in challenging environments [15]. The nuScenes dataset extends multimodal perception capabilities by providing more than 1.44 million images of diverse sets of sensors, including radar, cameras, and LiDAR. It encompasses a wide variety of driving scenarios recorded in Singapore and Boston, featuring unpredictable maneuvers and complex traffic patterns [16,17]. While such diversity is essential for advancing perception and prediction research, interpretable reasoning and novelty detection were not the primary objectives of this dataset, similar to the Waymo Open Dataset. The Waymo Open Dataset emphasizes 2D and 3D detection and segmentation, offering another valuable resource for perception research, with over 200,000 high-frequency images captured across different environments [18]. Likewise, the ApolloScape dataset provides extensive street-view imagery, LiDAR point clouds, and trajectory data that support numerous AD challenges but place limited emphasis on defining and interpreting surprising, unseen events [19]. Although the vast diversity of these datasets, including the variations in weather, climate, and geography across 100,000 caption videos, which represent more than 1000 h of driving, aims to enhance model robustness, their reliance on hard-coded annotations and fixed label schemes constrains the development of truly human-interpretable systems.
This research contributes a system-level empirical evaluation of OOL hazard perception by integrating multimodal perception, temporal reasoning, and vision–language explainability under a standard real-world benchmark. It aims to enhance the dependability and interpretability of AS by addressing hazard detection and responses to newly emerging and previously unseen dangers. Our proposed models are designed to enable safer and more reliable navigation in complex, real-world environments to identify unexpected events and provide a clear understandable explanations of the system’s decision-making process. Several machine learning (ML) and deep learning (DL) techniques have been explored for novelty detection in AD, which focus on recognizing and explaining unforeseen risks while improving decision transparency. For instance, in the study of [20], the authors develop an innovative end-to-end control framework for autonomous vehicles with the use of auto encoders and novelty detection methods to provide interpretable feedback for unexpected events. Similarly, another study, in which the authors employed a dataset collected from a 2015 Toyota Prius equipped with a forward-facing Leopard Imaging LIAR0231-GMSL camera that captured 1080p RGB images at approximately 30 Hz, was designed to evaluate the system’s ability to detect unforeseen situations while maintaining transparency in its decision-making mechanisms [21]. Another study advances the field of AD safety through explainability by exploring the use of language embeddings and active learning to identify novel driving scenarios [22]. Their work utilizes the LAVA dataset, which contains rich multimodal information, including environmental context, vehicle trajectories, annotated traffic signs, and corresponding video clips [23]. The dataset was collected across various locations in the San Diego area, capturing a diverse range of road types, traffic conditions, and weather phenomena, which enables detailed analysis of decision-making under ambiguous events.
Despite substantial progress in object detection, temporal modeling, and vision–language reasoning for AD, the existing studies remain largely constrained to closed-set assumptions and isolated notions of novelty. Many approaches evaluate robustness by using synthetic unknowns, curated open-set extensions, and qualitative explainability measures, which limit their applicability to real-world driving conditions. A critical gap persists in the availability of standard benchmarks and system-level evaluations that jointly address OOL hazard detection, temporal driver response modeling, and explainable scene understanding under realistic conditions. The COOOL benchmark directly targets this gap by providing annotated dashcam video sequences that contain both common and rare hazards falling outside the fixed object ontologies, as illustrated in Table 1.
However, while COOOL establishes the evaluation protocol, it does not prescribe how multimodal perception, temporal reasoning, and explainability integrate into a unified architecture. This absence of system-level baselines motivates the need for empirical investigation of how contemporary perception components behave under OOL conditions and their limitations are mitigated through multimodal integration.
Prior work reveals three unresolved gaps: (i) the lack of real-world benchmarks that explicitly annotate OOL hazards, (ii) limited understanding of how closed-set perception models fail under such conditions, and (iii) the absence of quantitative frameworks for evaluating interpretability in safety-critical scenarios. The intersection of these gaps motivates the formulation of our task: to evaluate perception robustness, temporal consistency, and explainability within a unified multimodal pipeline by using the COOOL benchmark. Consequently, our experimental design integrates object detection, temporal driver-state modeling, and vision–language hazard captioning, which enable systematic analysis of failure modes and incremental robustness gains under standard out-of-distribution conditions.
Table 1. Summary of Research Gaps in AD Hazard Perception and Explainability.
Table 1. Summary of Research Gaps in AD Hazard Perception and Explainability.
Study/MethodData SourceUnseen TypeMetricsKey Limitations
[24] YOLO, Faster R-CNNCOCO, BDD100KNonemAP, Precision, RecallClosed-set; cannot detect unseen hazards
[25] Open-Set/OOD MethodsSynthetic DataUnknown ObjectsAUROC, OSCRLimited realism; weak temporal modeling
[26] Vision Language ModelsToyota Woven Traffic Safety datasetSemantic NoveltyCaption AccuracyNot evaluated for safety-critical driving
[27] Temporal Models (LSTM, GRU)Driving LogsBehavior AnomaliesAccuracy, F1No visual hazard localization
[28] Explainable AI MethodsSimulationModel UncertaintySaliency MapsMostly qualitative evaluation
[29] COOOL Baseline MethodsCOOOL DatasetRare HazardsDetection AccuracyNo unified temporal baseline
[30] COOOL Challenge WinnersCOOOL DatasetLong-tail HazardsHazard RecallLimited explainability analysis
[31] Segment ExtractionAI City Challenge Track2DescriptionsBLEU, Rouge-LNo validation for captions

3. Methodology

3.1. Overview of the Proposed Framework

This research integrates XAI and VLMs to address the challenges of OOL hazard detection and interpret decision-making in AD systems [32]. Figure 2 presents the end-to-end architecture of the proposed explainable AD framework of the COOOL benchmark.
The framework follows a multi-stage perception–reasoning pipeline that transforms raw dashcam video streams into human-understandable hazard explanations. It consists of object detection and OOL hazard localization by using YOLO11x, lane estimation via geometric Hough Transform analysis, vision–language scene description grounded in perception outputs, and temporal vehicle response state estimation by using LSTM-based modeling.

3.2. Dataset Description

The proposed OOL evaluation utilizes the COOOL benchmark dataset, which consists of 200 high-resolution dashcam videos which cover diverse and safety-critical driving scenarios. The dataset is exclusively evaluative and includes both common and rare hazards annotated by human experts, such as exotic animals, erratic objects, and unconventional road events [29]. A summary of dataset characteristics is provided in Table 2 while COOOL does not provide pixel-level lane masks.

3.3. Object Detection and OOL Hazard Localization

YOLO11x is used for object detection, pre-trained on the COCO dataset [33]. The detector outputs bound boxes, class labels, and confidence scores for all visible traffic participants, including vehicles, pedestrians, and rare hazards. Each video is decomposed into frames and processed independently by YOLO11x. For each detected object, bounding box center coordinates, dimensions, and confidence scores are computed as follows:
x ^ = σ ( x ^ center ) · grid _ size , y ^ = σ ( y ^ center ) · grid _ size
w ^ = w ^ pred · image _ width , h ^ = h ^ pred · image _ height
C ^ = σ ( C ^ pred )
p i = σ ( p ^ i )
C ^ i = C ^ · p i
The center coordinates of each detected object are computed using Equation (4) from the predicted bounding box, while Equation (5) defines the predicted bounding box width and height based on the model’s raw outputs. The sigmoid activation function σ is used in Equation (6) to constrain the output between 0 and 1, representing the confidence score prediction. For each object class i, the class probability is computed using Equation (7). The overall confidence score for each detected object in a frame is given by Equation (8). Objects that do not belong to the closed-set training categories are explicitly labeled as unknown, enabling OOL hazard identification.

3.4. Lane Estimation and Evaluation Metrics

The Hough Transform-based geometric approach is used for the lane estimation, which extracts dominant lane boundaries from edge-detected frames. This method provides light-weight and interpretable lane cues suitable for the monocular dashcam. The COOOL benchmark does not provide ground-truth lane masks or lane boundary annotations. Therefore, pixel-level IoU and segmentation accuracy metrics are not applicable. To quantitatively evaluate lane estimation quality, we therefore adopt geometry-base consistency metrics used in the absence of ground truth. Specifically, we compute the mean lateral deviation (MLD) between detected lane boundaries and the image centerline given in Equation (9):
Δ lane = 1 2 | x left x center | + | x right x center |
where x left and x right in Equation (9) denote the horizontal positions of the detected lane boundaries at a fix vertical reference line, and x center represents the image center. To evaluate temporal stability, we further compute the standard deviation of Δ lane across consecutive frames. Lower variance indicates smoother and more stable lane tracking under challenging conditions such as low illumination and partial occlusion.

3.5. Vision–Language Scene Description and Explainability

To enhance transparency and human interpretability, we integrated a vision–language scene description module that translates structured perception outputs into grounded natural-language explanations. The configuration of the module is summarized in Table 3.
The model operates strictly in inference mode and receives only machine-generated perception signals, including detected object categories, bounding box geometry, confidence scores, and lane geometry. To reduce hallucination risk, the model is explicitly constrained to describe only the supplied detections, while objects outside the closed-set taxonomy are labeled as unknown. Speculative reasoning beyond available evidence is discouraged through prompt design and low-temperature decoding. Formally, the scene description process is expressed in Equation (10):
D = GPT { x ^ i , y ^ i , w ^ i , h ^ i , p i , R i j } ,
where ( x ^ i , y ^ i , w ^ i , h ^ i , p i ) denote bounding box geometry and confidence for object i, and R i j represents spatial relations between detected objects. Description quality is evaluated using object coverage and factual consistency metrics, emphasizing correctness and grounding rather than linguistic fluency.

3.6. Temporal Vehicle Response State Estimation

Due to the absence of driver-facing sensor annotations in the COOOL benchmark, internal cognitive driver states are not inferred. Instead, a vehicle response state is modeled as a discrete temporal risk variable derived from observable ego–vehicle behavior. The response state at time t is defined in Equation (11):
r t R = { Normal , Decelerating , Stopped , Resume } .
For each frame t, perception outputs are represented in Equation (12):
x t = { c t ( i ) , b t ( i ) , s t ( i ) , Δ t ( i ) } i = 1 N t ,
where c t ( i ) denotes the object category, b t ( i ) the bounding box geometry, s t ( i ) the detection confidence, and Δ t ( i ) the relative displacement from the image center.
A temporal input window is defined in Equation (13):
X t = { x t k + 1 , , x t } , k = 16 .
Temporal dependencies are captured using a Long Short-Term Memory (LSTM) network as shown in Equation (14):
h t = LSTM ( X t ) , r ^ t = arg max r R p ( r h t ) .
Ground-truth labels r t are obtained from COOOL hazard–response annotations indicating observable vehicle reactions. These labels reflect vehicle-level behavioral changes rather than human driver intent.

3.7. Experimental Setup and Reproducibility

Experiments were conducted on an NVIDIA RTX 4090 GPU, 24 GB VRAM, of Intel Core i9 CPU, and 64 GB of system RAM. The software environment ran Ubuntu 22.04 LTS with CUDA 12 . x and cuDNN enabled. Python 3.10 was used in all experiments with PyTorch 2.1, Ultralytics YOLO11x, OpenCV 4.9 , NumPy 1.26 , scikit-learn 1.4 , and the OpenAI API for vision–language. All COOOL benchmark videos were processed at a rate of 25–30 FPS without frame skipping. Temporal modeling components operated on fixed-length sliding windows of 16 consecutive frames. Performance metrics, including object detection accuracy, lane stability measures, and temporal response-state indicators, were computed over the full dataset following the official COOOL evaluation protocol. The YOLO11x detector was evaluated using default inference parameters of confidence t h r e s h o l d = 0.25, non-maximum suppression I o U = 0.45, i n p u t r e s o l u t i o n = 1280 × 720) without dataset-specific tuning. Vision–language scene descriptions were generated by using the gpt-4o-mini model with a fixed temperature of 0.2 and a maximum output length of 150 tokens. The language model is used exclusively to interpret the analysis and to not influence the perception and decision-making control. To ensure reproducibility, all hyper-parameters, pre-processing steps, and evaluations were fixed across the experiment.

4. Results

Both quantitative and qualitative results obtained on the COOOL benchmark are presented in this section, following the official evaluation protocol. Structural analysis assesses closed-set perception in terms of robustness, temporal consistency, lane estimation stability, anomaly detection capability, and interpretability under OOL conditions. All quantitative results are computed over the complete COOOL benchmark. Object detection metrics are reported at the frame level, while temporal and anomaly-related metrics are aggregated over sliding windows of 16 consecutive frames. Closed-set object detection performance is measured by using mean Average Precision at an IoU threshold of 0.5 for common object categories. Robustness to rare and OOL hazards is evaluated by using recall over annotated hazard instances. Temporal anomaly detection performance is assessed using the Area Under the Receiver Operating Characteristic curve (AUROC). Lane estimation stability is quantified by mean lateral deviation from detected lane boundaries, reported separately for daylight and low-light conditions. Vision–language interpretability is evaluated using human-verified object coverage and factual consistency scores computed from structured scene descriptions generated for each video. All metrics are reported as dataset-level aggregates to ensure statistical reliability.
To evaluate the limitations of closed-set perception under open-world driving scenarios, YOLO11x was applied to the COOOL benchmark without dataset-specific retraining. As shown in Figure 3, the detector achieved an m A P @ 0.5 of 54.1 % on common in-distribution object categories. Performance degrades notably for rare and OOL hazards, with hazard-level recall reaching 72.6 % . Hazards frequently appear under long-range, low-illumination conditions, resulting in intermittent localization and missed detections across frames. Comparison against a closed-set YOLO baseline highlights the difficulty of OOL perception and empirically confirms the limited robustness of conventional detectors in open-world settings. Temporal consistency is evaluated by aggregating hazard confidence scores across consecutive frames for all videos. Figure 4 reports the mean and standard deviation of hazard confidence trajectories over 200 videos.
While frame-wise recall remains a comparable 71.8 % , temporal modeling reduces the frame-to-frame instability and improves alignment with annotated hazard response intervals. The shaded confidence bands illustrate inter-video variability, demonstrating that temporal context contributes primarily to stability and robustness rather than large gains in instantaneous detection accuracy.
Lane estimation performance was assessed using mean lateral deviation from detected lane boundaries. As illustrated in Figure 5, daylight scenes achieved a mean deviation of 8.9 pixels, while low-light and nighttime scenes exhibited an increased deviation of 13.4 pixels.
Distributional analysis reveals higher variance under low illumination, reflecting the sensitivity of classical Hough-based lane detection to visual degradation. Despite this degradation, lane estimates remain sufficiently stable to provide structural context for hazard interpretation.
Temporal anomaly detection capability is evaluated by using AUROC, measuring the discrimination between nominal driving behavior and hazard-induced temporal patterns. As shown in Figure 6, the model achieves an AUROC of 0.65 , exceeding random chance while remaining below thresholds required for safety-critical decision-making.
This result indicates that visual temporal cues alone provide partial but meaningful information for identifying evolving risk states. Interpretability is evaluated using GPT-4o scene descriptions generated from the structural perception outputs. Object coverage and factual consistency were assessed via human verification across all videos. As shown in Figure 7, the descriptions achieved an object coverage score of 0.72 and factual consistency of 78 % .
Error bars indicate annotator variability and highlight that language-based explanations remain sensitive to upstream perception errors. The results demonstrate that vision–language models enhance transparency and contextual understanding. The complete multimodal pipeline operated at approximately 10–12 FPS on a single RTX 4090 GPU. While this throughput does not meet strict real-time control requirements, it supports near–real-time perception analysis, interpretability, and offline evaluation. However, detection of rare and OOL hazards proved substantially more challenging. Despite their explicit annotation in the dataset, these hazards often appear at long distances, under poor illumination, and ambiguous visual cues. As a result, hazard-level recall reached 72.6 % , with frequent missed detections and intermittent localization across consecutive frames. These results empirically confirm H1, demonstrating that closed-set detectors exhibit reduce robustness when exposed to novel and visually degraded hazards.
Figure 8 shows the frequency of object-related terms in GPT-4o-generated scene descriptions across the COOOL benchmark. This analysis serves as a diagnostic tool to verify the grounding of language outputs in perception inputs and is interpreted jointly with object coverage and factual consistency metrics reported in Table 4. Frequency alone should not be treated as evidence of recognition performance.

5. Discussion and Limitations

The quantitative results in Table 4 demonstrate that the proposed multimodal framework provides a coherent and interpretable perception pipeline for open-world driving scenarios, revealing important limitations that constrain its reliability for safety-critical autonomous control. While YOLO11x achieves moderate performance on common object categories with an m A P @ 0.5 = 54.1 % and reasonable recall for rare and OOL hazards of 72.6 % , the results confirm that closed-set detectors remain vulnerable to missed detections and intermittent localization under adverse conditions. This limitation is particularly evident in low-light and nighttime scenes, where lane estimation error increases from 8.9 to 13.4 pixels, potentially degrading spatial context for downstream decision-making. Consequently, the system is not considered robust enough for standalone deployment in safety-critical environments without additional redundancy. The temporal model components improve consistency rather than raw detection accuracy. The LSTM risk-state model yielded a hazard recall of 71.8 % comparable to frame-wise detection but primarily stabilizing hazard recognition across consecutive frames. Similarly, the temporal anomaly detector achieves an AUROC of 0.65 , indicating limited discrimination between nominal and hazard-induced behavioral patterns. These findings suggest that the vision-only temporal cues provide partial but insufficient information for precise risk-state estimation, especially when hazards are visually subtle, distant, and occluded. False negatives in such scenarios delay hazard awareness, while false positives lead to unnecessary braking of AV system interventions.
The vision–language scene description module enhances interpretability but introduces additional failure modes. Although object coverage of 0.72 and factual consistency of 78 % indicate that most detected elements are correctly verbalized, the language model remains sensitive to upstream perception errors and occasionally produces generic as well as incomplete descriptions under ambiguous visual conditions. Importantly, the GPT-4o module operates with non-negligible latency and relies on API-based inference, making it unsuitable for real-time closed-loop control. For this reason, it is explicitly restricted to post hoc explanation and monitoring; any direct use of language model outputs for decision-making would pose safety risks due to potential hallucinations, omissions, and delays in responses.
Finally, the proposed framework is limited by its reliance on monocular vision and the absence of sensor fusion. The COOOL benchmark does not provide radar, LiDAR, and vehicle telemetry, preventing validation of cross-modal consistency and redundancy. Future work must address these limitations through multi-sensor fusion, uncertainty-aware OOD detection, tighter latency control, and formal validation and verification of failure modes.
Beyond autonomous driving applications, the proposed multimodal anomaly recognition framework presents a valuable research direction for electroceutical systems, particularly in scenarios that require real-time identification of rare or unseen physiological events. Future studies will investigate extending the vision–language–temporal modeling paradigm to implantable and wearable electroceutical devices, where heterogeneous biosignals, contextual interpretation, and temporal dynamics must be jointly analyzed to ensure therapeutic safety and reliability. By adapting anomaly detection and explainable reasoning components to bioelectrical signal monitoring, such frameworks may enable closed-loop electroceutical platforms capable of autonomously adjusting stimulation parameters in response to abnormal or unexpected physiological conditions. This line of research offers a systematic pathway for translating AI-driven perception and robustness methodologies into clinically relevant electroceutical technologies, while maintaining compatibility with regulatory and safety evaluation frameworks.

6. Conclusions

This work presented a multimodal perception and interpretation framework designed to analyze open-world driving scenarios with an emphasis on rare hazards, temporal consistency, and explainability. Evaluation on the COOOL benchmark demonstrates that the proposed system achieves measurable performance across detection, temporal modeling, and language-based interpretation tasks. Specifically, the object detection module attained an m A P @ 0.5 of 54.1 % on common classes and a recall of 72.6 % for rare and out-of-label hazards, while the LSTM-based temporal risk model achieved a hazard recall of 71.8 % . Lane estimation accuracy showed a mean lateral deviation of 8.9 pixels in daylight conditions and 13.4 pixels under low-light and nighttime scenarios. Temporal anomaly detection yielded an AUROC of 0.65 , indicating moderate discrimination capability under open-world conditions. In terms of interpretability, the vision–language component achieved an object coverage score of 0.72 and factual consistency of 78 % , demonstrating that language outputs are largely grounded in visual perception while remaining sensitive to upstream detection errors. End-to-end system throughput ranged between 10 and 12 FPS on a single RTX 4090 GPU, highlighting current limitations for real-time deployment that would enable consistent offline evaluation. These results, summarized quantitatively in Table 4, indicate that the framework provides meaningful robustness analysis and diagnostic insight rather than production-level autonomy. Our proposed contribution of this work lies in providing a unified experimental framework for evaluating perception stability, temporal reasoning, and interpretability under rare and out-of-distribution driving conditions. While the system is not suitable for direct stand-alone autonomous control, it offers a reproducible benchmark and analysis tool for studying failure modes and transparency in safety-critical perception pipelines, thereby supporting future research toward more reliable and trustworthy autonomous systems.

Author Contributions

Conceptualization, F.M.; methodology, S.U.R. and A.M.; software, F.M. and S.U.R.; validation, A.M. and S.U.R.; formal analysis, A.M. and Y.-J.K.; investigation, Y.-J.K.; resources, Y.-J.K.; data curation, A.M.; writing—original draft preparation, F.M.; writing—review and editing, S.U.R. and A.M.; visualization, F.M. and S.U.R.; supervision, A.M. and Y.-J.K.; project administration, A.M. and Y.-J.K.; funding acquisition, Y.-J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation (NRF) and the Ministry of Science and ICT (MSIT) (RS-2024-00419269), Republic of Korea.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study is publicly available as the COOOL (Challenge Of Out-of-Label) benchmark dataset, accessible at https://github.com/alshami52/COOOL_benchmark?tab=readme-ov-file (accessed on 10 July 2025).

Acknowledgments

We used the OpenAI API (https://openai.com/api/, accessed on 1 March 2025) with the GPT-4o-mini model to generate scene descriptions from detected objects and lanes in video frames via the function generate_description_gpt. This approach enhances interpretability in autonomous driving by leveraging generative AI for scene understanding. Generative AI tools were also used solely for grammar checking and proofreading during manuscript preparation. The authors have reviewed and edited all outputs and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chai, Z.; Nie, T.; Becker, J. Autonomous Driving Changes the Future; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
  2. Mehmood, A.; Muhammad, A.; Mehmood, F.; Song, W.C. Enhancing vehicle location prediction accuracy with road-aware rectification for multi-access edge computing applications. Mathematics 2024, 12, 3980. [Google Scholar] [CrossRef]
  3. Bathla, G.; Bhadane, K.; Singh, R.K.; Kumar, R.; Aluvalu, R.; Krishnamurthi, R.; Kumar, A.; Thakur, R.; Basheer, S. Autonomous vehicles and intelligent automation: Applications, challenges, and opportunities. Mob. Inf. Syst. 2022, 2022, 7632892. [Google Scholar] [CrossRef]
  4. Shabbir, A.; Cheema, A.N.; Ullah, I.; Almanjahie, I.M.; Alshahrani, F. Smart city traffic management: Acoustic-based vehicle detection using stacking-based ensemble deep learning approach. IEEE Access 2024, 12, 35947–35956. [Google Scholar] [CrossRef]
  5. Mehmood, A.; Mehmood, F.; Kim, J. Towards Explainable Deep Learning in Computational Neuroscience: Visual and Clinical Applications. Mathematics 2025, 13, 3286. [Google Scholar] [CrossRef]
  6. Mehmood, F.; Rehman, S.U.; Choi, A. Vision-AQ: Explainable Multi-Modal Deep Learning for Air Pollution Classification in Smart Cities. Mathematics 2025, 13, 3017. [Google Scholar] [CrossRef]
  7. Madhav, A.S.; Tyagi, A.K. Explainable Artificial Intelligence (XAI): Connecting artificial decision-making and human trust in autonomous vehicles. In Proceedings of the Third International Conference on Computing, Communications, and Cyber-Security: IC4S 2021, Ghaziabad, India, 30–31 October 2021; Springer: Berlin/Heidelberg, Germany, 2022; pp. 123–136. [Google Scholar]
  8. Hou, J.; Liu, S.; Bie, Y.; Wang, H.; Tan, A.; Luo, L.; Chen, H. Self-eXplainable AI for Medical Image Analysis: A Survey and New Outlooks. arXiv 2024, arXiv:2410.02331. [Google Scholar]
  9. Zhou, X.; Liu, M.; Zagar, B.L.; Yurtsever, E.; Knoll, A.C. Vision language models in autonomous driving and intelligent transportation systems. arXiv 2023, arXiv:2310.14414. [Google Scholar] [CrossRef]
  10. Ullah, I.; Inayat, T.; Ullah, N.; Alzahrani, F.; Khan, M.I. Clinical decision support system (CDSS) for heart disease diagnosis and prediction by machine learning algorithms: A systematic literature review. J. Mech. Med. Biol. 2023, 23, 2330001. [Google Scholar] [CrossRef]
  11. Sural, S.; Naren; Rajkumar, R. ContextVLM: Zero-Shot and Few-Shot Context Understanding for Autonomous Driving using Vision Language Models. arXiv 2024, arXiv:2409.00301. [Google Scholar]
  12. Wong, K.; Wang, S.; Ren, M.; Liang, M.; Urtasun, R. Identifying unknown instances for autonomous driving. In Proceedings of the Conference on Robot Learning, PMLR, Osaka, Japan, 30 October–1 November 2019; pp. 384–393. [Google Scholar]
  13. Aung, N.H.H.; Sangwongngam, P.; Jintamethasawat, R.; Shah, S.; Wuttisittikulkij, L. A review of LiDAR-based 3D object detection via deep learning Approaches towards robust connected and autonomous vehicles. IEEE Trans. Intell. Veh. 2024, 10, 526–547. [Google Scholar] [CrossRef]
  14. Pravallika, A.; Hashmi, M.F.; Gupta, A. Deep Learning Frontiers in 3D Object Detection: A Comprehensive Review for Autonomous Driving. IEEE Access 2024, 12, 173936–173980. [Google Scholar] [CrossRef]
  15. An, J. Route positioning system for campus shuttle bus service using a single camera. Electronics 2024, 13, 2004. [Google Scholar] [CrossRef]
  16. Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
  17. Ullah, I.; Imran, A.; Ashfaq, A.; Adnan, M. Exploiting machine learning models for identification of heart diseases. J. Comput. Biomed. Inform. 2022, 3, 1–20. [Google Scholar] [CrossRef]
  18. Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2446–2454. [Google Scholar]
  19. Huang, X.; Cheng, X.; Geng, Q.; Cao, B.; Zhou, D.; Wang, P.; Lin, Y.; Yang, R. The apolloscape dataset for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 954–960. [Google Scholar]
  20. Thakur, A.; Mishra, S.K. An in-depth evaluation of deep learning-enabled adaptive approaches for detecting obstacles using sensor-fused data in autonomous vehicles. Eng. Appl. Artif. Intell. 2024, 133, 108550. [Google Scholar] [CrossRef]
  21. Chakravarthula, P.; D’Souza, J.A.; Tseng, E.; Bartusek, J.; Heide, F. Seeing with sound: Long-range acoustic beamforming for multimodal scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 982–991. [Google Scholar]
  22. Greer, R. Towards Safe, Human-Centered Autonomous Driving: Real-World Artificial Intelligence for Enhanced Situation Awareness and Transition Control. Ph.D. Thesis, University of California, San Diego, CA, USA, 2024. [Google Scholar]
  23. Greer, R.; Isa, J.; Deo, N.; Rangesh, A.; Trivedi, M.M. On salience-sensitive sign classification in autonomous vehicle path planning: Experimental explorations with a novel dataset. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 636–644. [Google Scholar]
  24. Olorunshola, O.; Jemitola, P.; Ademuwagun, A. Comparative study of some deep learning object detection algorithms: R-CNN, fast R-CNN, faster R-CNN, SSD, and YOLO. Nile J. Eng. Appl. Sci 2023, 1, 70–80. [Google Scholar] [CrossRef]
  25. Wang, K.; Ma, Q.; Shen, C.; Lu, J. Application of Uncertainty to Out-of-Distribution Detection for Autonomous Driving Perception Safety. IEEE Trans. Intell. Transp. Syst. 2025, 26, 11276–11293. [Google Scholar] [CrossRef]
  26. Zhang, R.; Wang, B.; Zhang, J.; Bian, Z.; Feng, C.; Ozbay, K. When language and vision meet road safety: Leveraging multimodal large language models for video-based traffic accident analysis. Accid. Anal. Prev. 2025, 219, 108077. [Google Scholar] [CrossRef] [PubMed]
  27. Nasr Azadani, M. Driving Behavior Analysis and Prediction for Safe Autonomous Vehicles. Ph.D. Thesis, Université d’Ottawa/University of Ottawa, Ottawa, ON, Canada, 2024. [Google Scholar]
  28. Kuznietsov, A.; Gyevnar, B.; Wang, C.; Peters, S.; Albrecht, S.V. Explainable AI for safe and trustworthy autonomous driving: A systematic review. IEEE Trans. Intell. Transp. Syst. 2024, 25, 19342–19364. [Google Scholar] [CrossRef]
  29. AlShami, A.K.; Kalita, A.; Rabinowitz, R.; Lam, K.; Bezbarua, R.; Boult, T.; Kalita, J. COOOL: Challenge Of Out-Of-Label A Novel Benchmark for Autonomous Driving. arXiv 2024, arXiv:2412.05462. [Google Scholar] [CrossRef]
  30. Xie, Z.; Ni, Z.; Yang, W.; Zhang, Y.; Chen, Y.; Zhang, Y.; Ma, X. A robust online multi-camera people tracking system with geometric consistency and state-aware re-id correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 7007–7016. [Google Scholar]
  31. Xuan, K.T.; Nguyen, K.N.; Ngo, B.H.; Xuan, V.D.; An, M.H.; Dinh, Q.V. Divide and conquer boosting for enhanced traffic safety description and analysis with large vision language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 7046–7055. [Google Scholar]
  32. Dong, J.; Chen, S.; Miralinaghi, M.; Chen, T.; Li, P.; Labi, S. Why did the AI make that decision? Towards an explainable artificial intelligence (XAI) for autonomous driving systems. Transp. Res. Part C Emerg. Technol. 2023, 156, 104358. [Google Scholar] [CrossRef]
  33. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part V 13; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Figure 1. Overview of the proposed multimodal perception and explainability pipeline for OOL hazard analysis on the COOOL benchmark.
Figure 1. Overview of the proposed multimodal perception and explainability pipeline for OOL hazard analysis on the COOOL benchmark.
Applsci 16 01503 g001
Figure 2. MultimodalVisual–Language Pipeline for Explainable Autonomous Driving Scene Understanding.
Figure 2. MultimodalVisual–Language Pipeline for Explainable Autonomous Driving Scene Understanding.
Applsci 16 01503 g002
Figure 3. COOOL benchmark object detection performance.
Figure 3. COOOL benchmark object detection performance.
Applsci 16 01503 g003
Figure 4. Temporal hazard confidence consistency. The solid line represents the mean confidence across frames, while the shaded region denotes one standard deviation, illustrating the inter-video variability.
Figure 4. Temporal hazard confidence consistency. The solid line represents the mean confidence across frames, while the shaded region denotes one standard deviation, illustrating the inter-video variability.
Applsci 16 01503 g004
Figure 5. Distribution of mean lateral lane deviation under day-light and low-light conditions across the COOOL benchmark.
Figure 5. Distribution of mean lateral lane deviation under day-light and low-light conditions across the COOOL benchmark.
Applsci 16 01503 g005
Figure 6. ROC curve for temporal anomaly detection evaluation.
Figure 6. ROC curve for temporal anomaly detection evaluation.
Applsci 16 01503 g006
Figure 7. Human-evaluation of vision–language description quality metrics. Error bars indicate annotator variability for object coverage and factual consistency.
Figure 7. Human-evaluation of vision–language description quality metrics. Error bars indicate annotator variability for object coverage and factual consistency.
Applsci 16 01503 g007
Figure 8. Object detection and lane estimation in a sample dashcam frame. Green bounding boxes (YOLO11x) highlight identified objects, and blue lines (Hough Transform) mark detected lanes.
Figure 8. Object detection and lane estimation in a sample dashcam frame. Green bounding boxes (YOLO11x) highlight identified objects, and blue lines (Hough Transform) mark detected lanes.
Applsci 16 01503 g008
Table 2. Summary of COOOL Benchmark Dataset Characteristics.
Table 2. Summary of COOOL Benchmark Dataset Characteristics.
AttributeDescription
Number of Videos200 high-resolution dashcam videos
Total Frames26,725
Annotated Frames26,724 (manually annotated)
Annotation TypesCommon hazards, rare hazards, lane markings, objects
Hazard CategoriesExotic animals, erratic objects, conventional road threats
Environmental ConditionsDaylight, low-light, nighttime, adverse weather
PurposeEvaluation-only benchmark for OOL detection
Application FocusVision-based and explainable AI for AD systems
Table 3. Configuration of the Vision–Language Scene Description Module.
Table 3. Configuration of the Vision–Language Scene Description Module.
ComponentDescription
ModelGPT-4o-mini (inference-only)
InputObject labels, bbox geometry, confidence, lane data
PromptSystem + user prompts for OOL hazard focus
TokensMax 150 per description
Temperature0.2 (deterministic)
Hallucination ControlOnly supplied detections; OOL = unknown
OutputGrounded natural-language description
EvaluationObject coverage + factual consistency
Table 4. Performance of the proposed multimodal perception framework on the COOOL benchmark. Metrics are reported for object detection, temporal hazard modeling, lane estimation, and interpretability.
Table 4. Performance of the proposed multimodal perception framework on the COOOL benchmark. Metrics are reported for object detection, temporal hazard modeling, lane estimation, and interpretability.
ModuleMetricResult
YOLO11x Object Detection (Common Classes)mAP@0.5 (%)54.1
Rare/OOL Hazard DetectionRecall (%)72.6
Temporal Risk Modeling (LSTM)Hazard Recall (%)71.8
Lane Detection (Daylight Conditions)Mean Lateral Deviation (pixels)8.9
Lane Detection (Low-Light/Night)Mean Lateral Deviation (pixels)13.4
Temporal Anomaly DetectionAUROC0.65
GPT-4o Scene DescriptionObject Coverage Score0.72
GPT-4o Scene DescriptionFactual Consistency (%)78
End-to-End System PerformanceThroughput (FPS)10–12
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mehmood, F.; Rehman, S.U.; Mehmood, A.; Kim, Y.-J. Unseen Hazard Recognition in Autonomous Driving Using Vision–Language and Sensor-Based Temporal Models. Appl. Sci. 2026, 16, 1503. https://doi.org/10.3390/app16031503

AMA Style

Mehmood F, Rehman SU, Mehmood A, Kim Y-J. Unseen Hazard Recognition in Autonomous Driving Using Vision–Language and Sensor-Based Temporal Models. Applied Sciences. 2026; 16(3):1503. https://doi.org/10.3390/app16031503

Chicago/Turabian Style

Mehmood, Faisal, Sajid Ur Rehman, Asif Mehmood, and Young-Jin Kim. 2026. "Unseen Hazard Recognition in Autonomous Driving Using Vision–Language and Sensor-Based Temporal Models" Applied Sciences 16, no. 3: 1503. https://doi.org/10.3390/app16031503

APA Style

Mehmood, F., Rehman, S. U., Mehmood, A., & Kim, Y.-J. (2026). Unseen Hazard Recognition in Autonomous Driving Using Vision–Language and Sensor-Based Temporal Models. Applied Sciences, 16(3), 1503. https://doi.org/10.3390/app16031503

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop