A Novel Hybrid Deep Learning–Probabilistic Framework for Real-Time Crash Detection from Monocular Traffic Video

Erkartal, Reşat Buğra; Yılmaz, Atınç

doi:10.3390/app151910523

Open AccessArticle

A Novel Hybrid Deep Learning–Probabilistic Framework for Real-Time Crash Detection from Monocular Traffic Video

by

Reşat Buğra Erkartal

^1,* and

Atınç Yılmaz

²

¹

Department of Software Engineering, Faculty of Engineering, Istanbul Topkapı University, Altunizade İstanbul 34662, Türkiye

²

Department of Computer Engineering, Faculty of Engineering and Architecture, Istanbul Beykent University, Ayazağa İstanbul 34398, Türkiye

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10523; https://doi.org/10.3390/app151910523

Submission received: 1 September 2025 / Revised: 21 September 2025 / Accepted: 22 September 2025 / Published: 29 September 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

The rapid evolution of autonomous vehicle technologies has amplified the need for crash detection that operates robustly under complex traffic conditions with minimal latency. We propose a hybrid temporal hierarchy that augments a Region-based Convolutional Neural Network (R-CNN) with an adaptive time-variant Kalman filter (with total-variation prior), a Hidden Markov Model (HMM) for state stabilization, and a lightweight Artificial Neural Network (ANN) for learned temporal refinement, enabling real-time crash detection from monocular video. Evaluated on simulated traffic in CARLA and real-world driving in Istanbul, the full temporal stack achieves the best precision–recall balance, yielding 83.47% F1 offline and 82.57% in real time (corresponding to 94.5% and 91.2% detection accuracy, respectively). Ablations are consistent and interpretable: removing the HMM reduces F1 by 1.85–2.16 percentage points (pp), whereas removing the ANN has a larger impact of 2.94–4.58 pp, indicating that the ANN provides the largest marginal gains—especially under real-time constraints. The transition from offline to real time incurs a modest overall loss (−0.90 pp F1), driven more by recall than precision. Compared to strong single-frame baselines, YOLOv10 attains 82.16% F1 and a real-time Transformer detector reaches 82.41% F1, while our full temporal stack remains slightly ahead in real time and offers a more favorable precision–recall trade-off. Notably, integrating the ANN into the HMM-based pipeline improves accuracy by 2.2%, while the time-variant Kalman configuration reduces detection lag by approximately 0.5 s—an improvement that directly addresses the human reaction time gap. Under identical conditions, the best RCNN-based configuration yields AP@0.50 ≈ 0.79 with an end-to-end latency of 119 ± 21 ms per frame (~8–9 FPS). Overall, coupling deep learning with probabilistic reasoning yields additive temporal benefits and advances deployable, camera-only crash detection that is cost-efficient and scalable for intelligent transportation systems.

Keywords:

crash detection; real-time systems; R-CNN; Kalman Filtering; Hidden Markov Models; Artificial Neural Networks; monocular vision; intelligent transportation systems

1. Introduction

The rapid advancement of autonomous vehicle (AV) technologies has initiated a paradigm shift in modern transportation, positioning AVs as central components of emerging smart mobility systems. These systems, enabled by advanced sensors, artificial intelligence (AI) algorithms, and machine learning models, are designed to reduce human error, optimize traffic flow, and improve overall road safety. In doing so, they are reshaping the dynamics of urban mobility and influencing the future design of transportation infrastructure.

A fundamental requirement for autonomous driving is reliable perception of the surrounding environment, which forms the basis for situational awareness and real-time decision-making. Core driving tasks such as lane keeping, braking, and obstacle avoidance depend heavily on accurate perception capabilities. To achieve this, AVs employ a combination of sensing technologies, including monocular and stereo cameras, millimeter-wave radar, and Light Detection and Ranging (LiDAR) systems [1]. Cameras provide rich semantic information similar to human vision, radar performs reliably under adverse weather, and LiDAR offers precise depth estimation. Although sensor fusion improves robustness, the high cost and mechanical complexity of LiDAR limit its scalability. By contrast, camera-based approaches, particularly those enhanced with deep learning, have emerged as a more practical and cost-effective alternative.

The urgency of such technologies is underscored by global accident statistics. In 2010 alone, road traffic accidents claimed over 1.3 million lives and injured more than 50 million individuals worldwide [2]. Data from the National Highway Traffic Safety Administration (NHTSA) indicate that human error is responsible for nearly 93% of all crashes [3,4]. According to the World Health Organization (WHO), road traffic accidents caused approximately 1.19 million deaths worldwide in 2021, with an estimated 20 to 50 million non-fatal injuries [5]. These figures have intensified worldwide investment in AI-driven perception and control systems designed to minimize human involvement in high-risk driving situations.

To classify the extent of automation, the NHTSA defines levels ranging from Level 0 (no automation) to Level 4 (full automation under constrained conditions) [4]. Progression along this scale demands increasingly advanced perception and control modules capable of interpreting dynamic traffic scenarios with high precision and low latency.

Despite significant progress in the field, critical challenges remain. Many existing crash detection frameworks rely on static or manually defined transition probabilities within Markov models, limiting their ability to respond effectively to abrupt behavioral changes or nonlinear motion patterns in urban traffic. Similarly, conventional Kalman Filter implementations often struggle under sudden acceleration, short-term occlusion, or dense traffic conditions, reducing their reliability in real-world deployment.

This study addresses these limitations by introducing a hybrid framework that combines three complementary components: R-CNN for object detection, a time-variant Kalman Filter for improved motion estimation, and a Hidden Markov Model enhanced with Artificial Neural Networks (ANNs) for adaptive probabilistic reasoning. The proposed model dynamically updates the Markov transition matrix using real-time spatiotemporal data, thereby improving both temporal consistency and spatial robustness. This enables earlier and more reliable crash prediction, even in occluded or rapidly evolving traffic scenes.

The major contributions of this research can be summarized as follows:

A novel hybrid architecture that integrates R-CNN, Hidden Markov Model, a time-variant Kalman Filter, and ANN for crash detection is proposed for accurate and low-latency crash detection using monocular camera input.
The model adaptively updates transition probabilities in the Markov framework using a grid-based strategy, enabling real-time responsiveness to sudden changes in traffic dynamics.
Unlike prior approaches that employ either static probabilistic models or standalone CNNs, the proposed system leverages the synergy of deep learning and probabilistic filtering for enhanced robustness.
A comprehensive dataset combining real-world dashcam footage from Istanbul with crash scenarios generated in the CARLA simulation environment has been developed to ensure realistic evaluation.
The framework achieves reduced detection latency while maintaining high accuracy, offering a cost-effective alternative to LiDAR-based crash detection systems.

The remainder of this article is organized as follows: Section 2 provides an overview of related work in vehicle detection, behavior modeling, and crash prediction. Section 3 details the proposed architecture, dataset design, and experimental setup. Section 4 presents the evaluation results from both simulated and real-world tests. Finally, Section 5 concludes with key findings and directions for future research.

2. Related Work

The increasing density of urban traffic networks, coupled with the rising number of vehicles, has made safety assurance and traffic management highly challenging. According to the National Highway Traffic Safety Administration (NHTSA), human error remains the primary cause of traffic accidents, contributing to nearly 93% of all cases [3]. This reality has positioned Intelligent Transportation Systems (ITS) as a critical research domain, with particular emphasis on detection, prevention, and mitigation mechanisms [6]. Over time, vehicle detection technologies have evolved from conventional sensor-based approaches to advanced systems leveraging computer vision, machine learning, and deep neural networks [7,8].

Early efforts relied on handcrafted descriptors and traditional feature-based extraction techniques. Methods such as Histogram of Oriented Gradients (HOG) [9], Haar-like features [10], and Local Binary Patterns (LBP) [11] demonstrated promising results in controlled conditions. However, their limited scalability and sensitivity to lighting variations and scene diversity reduced their robustness. Similarly, early accident detection approaches largely depended on in-vehicle sensors that monitored physical thresholds [12], which constrained their generalizability across diverse contexts.

A major transition occurred with the advent of learning-based detectors. Papageorgiou and Poggio [13] pioneered trainable systems for object detection, while advances in sensor technologies enabled multimodal detection using LiDAR, radar, and cameras [7]. Sensor fusion has since been recognized as an essential strategy for resilience under occlusion and adverse weather conditions, as highlighted by Wang et al. [7] and Galvão et al. [14]. Deep convolutional neural networks (CNNs) were the big step forward that changed how we find objects. The paradigm shift towards deep learning, particularly Convolutional Neural Networks (CNNs), has revolutionized object detection with profound implications for traffic surveillance and crash analysis [15]. Girshick et al. [16] proposed the R-CNN framework, which significantly improved accuracy. Subsequent refinements such as Fast R-CNN and Faster R-CNN advanced both precision and computational efficiency [6,17]. Parallel to these, single-stage detectors—YOLO [18] and SSD [19]—were designed for real-time operation. Later iterations, including YOLOv3 [20], EfficientDet [21], YOLOv4 [22] and YOLOv5 [23,24], further enhanced performance. Notably, hybrid implementations such as YOLOv5 combined with DeepLabv3 achieved accuracies exceeding 94%, with some configurations reporting up to 96% in traffic surveillance scenarios [25].

Crash detection based on monocular video has also gained momentum. Khan et al. [26] developed a CNN-based accident recognition system achieving 91.5% accuracy, while Sun et al. [27] proposed a vehicle behavior-based approach with a low false alarm rate. More recently, Generative Adversarial Networks (GANs) have been explored to address the scarcity of crash-event data. Studies by Kukreja et al. [28] and Patel [29] demonstrated that augmenting datasets with synthetic crash scenarios improves generalizability. Beyond augmentation, GANs have been applied to tasks such as super-resolution [30] and anomaly detection [31], reinforcing their value in traffic scene analysis.

Recent work has further emphasized hybrid and multimodal strategies. Rubaiyat et al. [32] reported improved accuracy by fusing LiDAR and camera data, while Wang et al. [33] proposed compact MEMS-based LiDAR suitable for embedded platforms. Other researchers explored hybrid networks: Moayed [34] integrated CNNs with Support Vector Machines (SVMs) for multi-view crash detection, achieving 90% accuracy, and CNN–Vision Transformer architectures achieved 93.8% mAP under adverse weather conditions [35]. In addition, edge computing and model compression have emerged as crucial enablers of real-time deployment. Liang and Zhang [36] introduced (Mobile Edge Computing-You Only Look Once) MEC-YOLO for edge–cloud integration, whereas Gupta et al. [37] and Kabir et al. [38] applied pruning, quantization, and transfer learning to maintain accuracy while improving efficiency.

Beyond CNN-based systems, graph neural networks (GNNs) and attention mechanisms have recently been employed for spatiotemporal modeling. Li et al. [39] proposed a dynamic perceptual GNN that integrates temporal and spatial dependencies using a location-attention mechanism, enabling effective aggregation of traffic flow data from neighboring road segments. Similarly, Huo et al. [40] improved the Attention-Based Spatial–Temporal Graph Convolutional Network (ASTGCN), demonstrating superior accuracy in traffic flow prediction, with reported MAE, RMSE, and mAP values of 8.62, 14.07, and 0.2402, respectively.

Despite significant progress, several limitations remain. Detection robustness is still influenced by occlusion, adverse weather, and irregular motion patterns [7]. In addition, human-reported accident datasets are often statistically limited [8], while deep learning models remain constrained by the computational costs of real-time deployment. Although the literature highlights substantial improvements in detection accuracy and speed, relatively few studies integrate deep learning with probabilistic motion models and adaptive decision frameworks in real time. A summary of representative studies is provided in Table 1, which consolidates recent approaches, their methodologies, and reported performance.

Reported mAP values in the literature span a wide range due to substantial differences in evaluation setups. Variability stems from dataset heterogeneity (COCO vs. KITTI vs. bespoke crash corpora), class definitions and difficulty levels, IoU thresholds and averaging protocols (single-class vs. multi-class), and the use of distinct detector variants/backbones, preprocessing, and test-time augmentations. Environmental conditions (illumination, weather, camera pose) and reporting practices (best-of-run vs. averaged scores) further widen the spread. Consequently, absolute numbers across studies should be interpreted comparatively rather than as directly commensurate baselines.

This study seeks to address these gaps by presenting a hybrid framework that combines R-CNN, a time-variant Kalman Filter, and a Markov Model enhanced with an Artificial Neural Network (ANN). By coupling deep learning with probabilistic reasoning, the proposed model achieves earlier and more reliable crash prediction using only monocular camera input.

3. Methodology

This study proposes a hybrid framework that integrates Region-based Convolutional Neural Networks (R-CNN), a time-variant Kalman Filter, a Hidden Markov Model (HMM), and an Artificial Neural Network (ANN) to enable real-time crash detection from monocular camera data. The overall system design is shown in Figure 1.

The architecture combines deterministic object recognition, probabilistic motion estimation, and neural decision-making into a unified pipeline. Each image frame, with a resolution of 480 × 640 pixels, is first processed by an R-CNN detector, which extracts vehicle bounding boxes and their coordinates. To mitigate noise caused by environmental variation, these coordinates are refined using a Kalman Filter. The filtered outputs are projected onto adaptive grids that discretize the scene spatially, enabling motion sequences to be represented as state transitions. These transitions are then modeled by a Hidden Markov Model, and the resulting probabilities are further interpreted by an ANN to determine the likelihood of a crash event. By combining spatial precision with temporal reasoning, the framework provides both robust tracking and reliable event detection.

Because precise ground-truth positions cannot be obtained directly from monocular images, the CARLA simulation platform was initially used for validation. Vehicles in simulation were equipped with LiDAR sensors, providing accurate positional data against which Kalman estimates could be benchmarked. Building on our preliminary conference study, which first introduced the RCNN–HMM–ANN pipeline and validated its feasibility in CARLA simulation with LiDAR ground truth [41], the present work extends this line by systematically comparing three Kalman configurations—a time-invariant kernel, a 10 × 10 time-variant kernel, and a 20 × 20 matrix—and selecting the optimal setting for subsequent real-world evaluation.

3.1. R-CNN-Based Object Detection

The detection module is based on Region-based Convolutional Neural Networks (R-CNN). The full image is first passed through a convolutional backbone to generate a feature map. Candidate regions, produced by algorithms such as Selective Search, are aligned with this feature map. A Region of Interest (RoI) pooling layer then resizes each proposal into a fixed-length feature vector, which is forwarded to fully connected layers for classification and bounding-box regression [42].

This approach reduces redundant computations, increases efficiency, and supports end-to-end training, allowing convolutional filters to adapt specifically to vehicle detection tasks. Unlike earlier multi-stage pipelines, R-CNN jointly performs localization and classification, which lowers latency while preserving accuracy. In practice, the backbone network is pretrained on large-scale datasets and subsequently fine-tuned with vehicle-specific data, while additional classifiers and regressors are optimized for precise bounding-box adjustment. Figure 2 summarizes the R-CNN workflow.

In this study, R-CNN serves as the principal detection method due to its balance between accuracy and computational cost. To evaluate generalizability, alternative pretrained detectors can also be integrated and benchmarked within the same tracking framework.

3.2. Grid-Based Spatial Modeling

To support motion reasoning, each frame is divided into adaptive grids (Figure 3). The number and shape of the grids are dynamically adjusted according to vehicle speed, ensuring that rapid motion is represented with higher sensitivity. A non-uniform vertical division is applied to account for perspective, giving greater weight to objects close to the camera, while distant vehicles are modeled with coarser resolution.

In this configuration, the frame is divided into a 4 × 3 grid, ensuring that no vehicle spans more than one-quarter of the frame. Prior to adopting this setting, alternative grid layouts such as 3 × 3 and 5 × 4 were evaluated. The 3 × 3 division often resulted in multiple vehicles occupying the same cell, which reduced the precision of state assignments. By contrast, the 5 × 4 partition improved spatial sensitivity but introduced additional computational load, lowering the frame rate in real-time experiments. The 4 × 3 configuration provided the most balanced trade-off between accuracy and efficiency, and was therefore selected for the subsequent analyses (Table 2). Grid boundaries are defined geometrically, and pixel-based measurements are converted into real-world scales using reference object dimensions. For instance, if a vehicle height of four meters corresponds to 300 pixels, the resulting scale factor is 0.0133 m/pixel. Each grid point is further adjusted by a relative velocity factor, defined as the reciprocal of ground speed. This design allows the grid to adapt its sensitivity dynamically, providing more accurate detection when motion is abrupt or nonlinear.

3.3. Markov Model and ANN Decision

The probabilistic reasoning stage employs a Hidden Markov Model (HMM), where vehicle placements in grid cells are treated as states and their temporal transitions represent probable trajectories. A defining property of Markov models is that the probability of moving to the next state depends only on the current state [43]. While this assumption simplifies modeling, conventional approaches with static transition probabilities cannot accommodate abrupt acceleration, sudden stops, or complex maneuvers in dense traffic. To address this, the transition matrix in our framework is dynamically updated based on observed spatiotemporal changes. This ensures resilience to occlusion, sudden speed variation, and nonlinear trajectories. Prior studies have applied Markov models to domains such as language modeling and time-series forecasting [44]; here, the adaptation of real-time traffic dynamics extends their applicability to crash prediction. Although our framework does not employ Markov Chain Monte Carlo (MCMC), its conceptual link highlights how probabilistic reasoning can approximate complex transitions in uncertain environments [45].

The output probabilities of the HMM are passed to an ANN for final decision-making. The ANN is structured with two hidden layers containing 50 and 4 neurons, followed by a softmax-based output layer that classifies events as “crash” or “non-crash.” The output weights refine probabilistic inputs from the HMM, yielding real-time, actionable predictions. Formally, a neuron can be expressed as

y = \sum w_{i} x_{i} + b

(1)

where

y

is the output of the neuron,

w_{i}

are the weights associated with each input,

x_{i}

are the inputs to the neuron,

b

is the bias term. Extending to an entire network with layers, the activation at layer l is given as

a^{(l)} = f (W^{(l - 1)} a^{(l - 1)} + b^{(l - 1)})

(2)

The network is trained via backpropagation to minimize a loss function, aligning predictions with true outcomes. Importantly, the ANN is pretrained using the same dataset as the R-CNN detector, ensuring that the decision module benefits from shared visual representations while refining the final crash classification. This combination strengthens detection reliability and reduces latency compared to models based solely on deep learning or probabilistic reasoning.

In designing the ANN component, multiple configurations were systematically tested to balance accuracy with real-time feasibility. Architectures with one, two, and three hidden layers were compared under identical training conditions. The two-hidden-layer configuration with 50 and 4 neurons yielded the best trade-off, providing sufficient representational power to capture nonlinear spatiotemporal transitions while avoiding excessive computational overhead. Training was performed using the Adam optimizer with a learning rate of 0.001 and cross-entropy loss, and early stopping was applied to mitigate overfitting. Dropout with a rate of 0.3 was also employed in preliminary tests, but negligible improvements were observed; therefore, the final model omitted dropout for efficiency. The chosen architecture consistently outperformed both simpler and deeper variants in validation accuracy, confirming its suitability for real-time crash detection.

3.4. Proposed Model

This study incorporates a refined Kalman Filter with explicit acceleration modeling to enable robust multi-object tracking in dynamic traffic environments. Beyond smoothing noisy detections, the Kalman Filter’s strength lies in its ability to estimate latent motion parameters that cannot be directly observed. This property is particularly critical for monocular vision-based crash detection, where vehicle speed, trajectory curvature, and scale variations are difficult to infer accurately from raw frames.

The conventional Kalman formulation assumes constant velocity, which is insufficient under real-world driving conditions involving frequent acceleration, deceleration, and abrupt maneuvers. To overcome this limitation, the state vector was extended to include acceleration terms:

s t a t e = {[x, v x, a x, y, v y, a y, w, v w, a w, h, v h, a h]}^{T}

(3)

In Here, x and y denote positional coordinates, v the velocity components, a the corresponding accelerations, while w and h represent bounding box dimensions with their respective dynamics. This extension allows second-order motion to be captured, enabling precise tracking of vehicles undergoing rapid speed or scale fluctuations.

The transition matrix A is adjusted to account for acceleration, while the process noise covariance Q is expressed as a function of elapsed time (dt):

G_{l d} = [\frac{d t^{2}}{2}; d t; 1]

(4)

Q_{l d} = G_{l d} * G_{l d}^{T}

(5)

Q = d i a g [Q_{l d}; Q_{l d}; Q_{l d}; Q_{l d}]

(6)

This formulation enables the filter to adaptively update its uncertainty estimates in response to variations in acceleration, mitigating error accumulation during rapid braking or sudden maneuvers—scenarios often associated with imminent collisions. Unlike constant-velocity models, the proposed design improves both accuracy and stability under real traffic dynamics.

For initialization, a stricter spatial threshold of 30 pixels (instead of the common 50 pixels) was adopted to enhance sensitivity to vehicles at different distances. Newly detected objects are assigned tentative tracks and promoted to confirmed tracks only after at least three successful associations across five consecutive frames. This strategy reduces false positives while maintaining responsiveness. In addition, a coasting mechanism permits up to five consecutive missed detections, preserving valid tracks during short occlusions caused by lane changes or overtaking maneuvers.

The refined bounding box coordinates, velocities, and accelerations generated by the adaptive Kalman Filter serve as inputs to the subsequent reasoning stage (HMM + ANN). This coupling bridges low-level motion estimation with high-level crash inference, thereby reducing latency and improving predictive reliability.

In addition to mathematical refinements, several practical engineering adjustments were incorporated to ensure robustness in real-world deployment. The tighter spatial threshold improves near-range sensitivity, while the tentative-to-confirmed track policy and coasting mechanism safeguard continuity in dense traffic. Although heuristic, these design choices are essential for translating the theoretical advantages of acceleration-aware Kalman filtering into dependable field performance.

Figure 4 illustrates the modifications introduced, including the extended state transition matrix, updated process noise covariance, and revised thresholds for object-to-track assignment. The diagram highlights how the proposed filter departs from the conventional constant-velocity model and adapts effectively to sudden changes in vehicle dynamics, resulting in more stable tracking under urban traffic conditions.

In summary, the proposed adaptive Kalman Filter offers three key improvements over traditional constant-velocity models: (i) explicit modeling of acceleration enables reliable tracking during abrupt speed variations, (ii) a stricter initiation threshold with track confirmation logic reduces false positives while preserving responsiveness, and (iii) temporally consistent motion estimates with reduced noise provide a robust foundation for the subsequent Markov Model and ANN-based decision layers. Collectively, these enhancements strengthen the accuracy, stability, and real-time applicability of crash detection within intelligent transportation systems.

F_k, B_k, u_k, Q_k, H_k, R_k, Δt_k are the time variants. They need to vary because real systems are nonstationary: sampling times fluctuate, motion regimes change (accel/brake/turn), and measurement quality is context-dependent. Adapting these parameters improves responsiveness, reduces lag, and increases robustness—especially critical in dynamic, occlusion-prone traffic scenes.

4. Experimental Setup

To rigorously assess the performance of the proposed framework, experiments were carried out in three complementary settings: (i) controlled simulation environments, (ii) curated crash video datasets, and (iii) real-world traffic recordings. This multi-stage design enabled the system to be validated under both synthetic and unconstrained conditions, ensuring robustness and generalizability.

Simulation Environment: The CARLA simulator was first used to validate the Kalman Filter-based tracking pipeline. CARLA is an open-source, high-fidelity platform widely adopted in autonomous driving research [46]. It provides access to ground-truth parameters such as vehicle position, velocity, and orientation, as well as sensor modalities including LiDAR. These features allow precise benchmarking of tracking performance against reference data. In addition, CARLA supports environmental variations such as lighting changes, precipitation, dense traffic, and pedestrian interactions, making it suitable for systematic stress-testing before real-world deployment.

Crash Video Dataset: To evaluate detection accuracy under unconstrained conditions, a dataset of 67 crash videos was compiled from publicly available sources (e.g., YouTube: https://www.youtube.com/watch?v=m7C9B5H0I1E&list=PPSV (accessed on 26 May 2025). All videos were standardized to match the training pipeline, with each frame converted into RGB format and annotated using bounding boxes. This ensured compatibility with detector requirements and minimized dataset bias. The compilation included diverse crash scenarios—rear-end, side-impact, and multi-vehicle accidents—offering a challenging benchmark for model evaluation.

Real-World Deployment: Field validation was performed using MATLAB 2024b Simulink on an Intel i7 CPU, with video footage collected on Istanbul’s O-7 highway. This roadway was selected due to its heterogeneous traffic flow and recurrent congestion patterns, providing a realistic testbed. Simulink enabled seamless integration with MATLAB algorithms and real-time processing of streaming video. Figure 5 illustrates the end-to-end real-time pipeline, covering ingestion, detection, tracking, and decision-making stages.

Data Representation: All datasets were standardized in RGB format. Object annotations followed MATLAB’s bounding-box schema, where each object is encoded as a four-element array [x, y, width, height], denoting the upper-left coordinates and dimensions. Each column contained an M × 4 matrix representing a specific object class (e.g., vehicle, stop sign). This uniform format facilitated consistent integration of synthetic, curated, and real-world data across experimental stages.

Baselines: We evaluate YOLOv8, v9, and v10 under identical data splits and evaluation criteria, using official releases and recommended hyperparameters. We additionally evaluate a real-time Transformer detector (Version 2). All baselines run with batch size 1 and identical post-processing (NMS at IoU 0.5) unless stated.

Dataset and Preprocessing

The object detection module was trained on a hybrid dataset combining benchmark collections, curated crash footage, and synthetic crash scenarios generated in CARLA. This multi-source design was chosen to reflect the heterogeneity of urban traffic environments and to improve the generalization of the proposed model.

The initial training set comprised the Caltech Cars 1999 [47] and Caltech Cars 2001 [47] datasets, which together contain 295 annotated vehicle images under diverse viewpoints and lighting conditions. These datasets, widely used in detection benchmarking, provided a reliable foundation for fine-tuning Kalman Filter. Figure 6 shows examples of the images used.

To extend beyond static imagery, 67 crash videos were systematically curated. Each video was decomposed into frame-by-frame RGB images, and bounding boxes were manually annotated along with detection scores. Ground-truth crash moments were established by three independent annotators, with the final labels determined by averaging their assessments. This procedure minimized observer bias and ensured accurate timestamping of impact events. By including a variety of crash dynamics—such as rear-end and lateral impacts—the dataset captured a broad spectrum of realistic traffic incidents.

Synthetic crash scenarios were also generated in CARLA to supplement the dataset. CARLA enables precise control of environmental and behavioral variables, including adverse weather, traffic density, and driver actions. As reported in prior studies [48], such simulated data enriches coverage of rare or hazardous events (e.g., model collisions, sudden braking in rain) that are underrepresented in real-world footage. The inclusion of synthetic sequences reduces the risk of overfitting to limited real crash data and broadens the scope of evaluation.

Preprocessing steps ensured compatibility across datasets. All images were normalized to uint8 format and rescaled to the [0–255] intensity range. To reduce false detections and computational overhead, the detector was restricted to regions below the horizon line, thereby excluding irrelevant background such as sky or distant infrastructure. The R-CNN, pretrained on ResNet-50, was then fine-tuned on this hybrid dataset to optimize performance specifically for vehicle detection in traffic scenes.

To minimize systematic bias, camera placement was standardized: dashcams were aligned with the vehicle centerline so that the hood was excluded from the field of view and the grid origin (0, 0) corresponded to the vehicle center. This ensured consistency in trajectory extraction and improved comparability across recordings.

Finally, the curated dataset was employed to evaluate multiple system variants: (i) Kalman filtering with different kernel settings, (ii) hybrid crash detection with and without the ANN decision layer, and (iii) comparisons of offline versus real-time deployment. This design enabled isolation of the individual contributions of preprocessing, dataset composition, and architectural choices to overall performance.

5. Results and Discussion

The experimental evaluation proceeded in three stages: (i) controlled simulations in the CARLA environment, (ii) offline analysis of curated crash videos, and (iii) real-time testing on Istanbul’s O-7 highway. This multi-tiered design enabled the system to be validated both under idealized conditions with access to ground-truth states and in unconstrained, real-world traffic scenarios.

Initial experiments focused on the effect of different kernel configurations in the Kalman filter. In CARLA simulations, vehicle positions measured by LIDAR served as the ground truth, against which estimated trajectories were compared. Table 3, Table 4 and Table 5 present representative outputs for three settings: a non-time-variant filter, a 10 × 10 time-variant kernel, and a 20 × 20 time-variant kernel.

Although the assigned grid positions were consistent across configurations, a systematic delay of roughly 30 frames (≈0.5 s at 60 fps) was observed when using non-time-variant models. This difference is highly relevant in safety contexts, as the average driver reaction time is typically 0.75–1.5 s [49]. A reduction of half a second in detection latency therefore represents a meaningful safety gain. Moreover, fast-moving vehicles were more accurately tracked using time-variant kernels, which adapted to acceleration and deceleration more effectively.

Lighting and weather primarily act as signal perturbations, including additive sensor noise (e.g., shot/read noise at high ISO, rain speckles), photometric shifts (brightness/contrast/gamma changes, shadows, glare), and contrast degradation (fog/haze), along with compression artifacts. A noise-robust neural module learns features that are less sensitive to these perturbations—favoring shape, multi-scale structure, and temporal consistency over brittle texture/color cues. Consequently, small input corruptions induce small output changes, reducing detection flicker and spurious bounding-box shifts. In our hybrid pipeline, more stable neural detections reduce the variance and outlier rate of the measurement stream entering the time-variant Kalman filter, which leads to more predictable innovations and better-behaved adaptive R_k Similarly, the HMM receives cleaner state transitions (grid assignments), improving transition probability estimates. Together, this decreases false corrections, track fragmentation, and decision lag, particularly in rain, fog, glare, and low-light scenes.

Accuracy comparisons further confirm that time-variant Kalman filters outperform traditional approaches at medium-to-longer ranges (>4 m), whereas performance converges or even declines at close distances due to pixel density and grid sensitivity (Figure 7). Importantly, none of the tested configurations were able to maintain reliable detection beyond 30 m, consistent with limitations in commercial dashcam resolution.

From the created Carla scenario, measurements were taken for each detected vehicle using AFC filters, and these measurements were compared to the LIDAR sensor values at one-meter intervals in a simulation motor. The comparison data is presented in the table below.

The Kalman 10 × 10 configuration appears optimal for minimizing error across all measured distances in this test. Both larger (20 × 20) and time-invariant matrix approaches yield higher errors, meaning they are less suitable for your scenario, since Q_k (noise variable) increases when components related to time variants (such as acceleration) are added.

To assess the value of the neural decision layer, Hidden Markov Model outputs were compared with and without ANN-based refinement. As shown in Table 6, incorporating the ANN improved overall accuracy from 92.3% to 94.5%, a 2.2% gain. Although numerically modest, this improvement is significant in high-speed scenarios where even marginal enhancements can prevent accidents. The ANN module contributed by modeling non-linear spatiotemporal dependencies and filtering noise, thereby complementing the probabilistic transitions of the Markov framework. Overall pipeline’s detection duration and mAP scores are calculated with and without the modules. Test are run with single GPU RTX 3060/3070 and also a single. The Input is 640 × 640. Duration times are calculated as mean ± standard deviation.

This table quantifies the speed–accuracy trade-off of RCNN pipelines augmented with temporal modules versus YOLO baselines using AP@0.5. Among RCNN variants, the full stack with time-variant Kalman + HMM + ANN attains the highest accuracy (AP@0.5 = 0.79) at 119 ± 21 ms. A strong alternative is the time-invariant Kalman + HMM + ANN configuration, which is notably faster (95 ± 19 ms) with only a minor accuracy drop (0.78), yielding the best overall balance. The fastest RCNN option is time-variant Kalman + ANN (93 ± 20 ms) with moderate accuracy (0.75). Module-wise, time-variant Kalman consistently adds ≈20 ms over time-invariant updates (e.g., 61 ± 8 vs. 85 ± 9 ms) for a small but systematic AP gain (~+0.01 when stacks are matched). HMM provides the largest accuracy improvement (+0.02–0.04 AP) at a moderate cost (24–26 ms), while the ANN contributes smaller gains (+0.01–0.02 AP) for low overhead (8–9 ms). YOLOv8s, YOLOv9s, Real-Time Transformer Detector and YOLOv10s fall within similar latency ranges (70–120 ms and 90–140 ms, respectively) and deliver AP@0.5 of 0.73 and 0.75; the RCNN + temporal stacks can surpass these accuracies at comparable latencies. Notably, end-to-end “Detection Duration” does not always equal the sum of listed module times (e.g., 64 + 8 ≈ 72 ms vs. 120 ± 22 ms total), indicating additional costs from detector inference, pre/post-processing, I/O, and scheduling. Variability is modest per module, but total jitter (≈18–22 ms) suggests pipeline-level and system effects dominate timing variance. All timings are reported in milliseconds per frame at 640 × 640 input resolution on a single RTX 3060/3070 GPU with identical preprocessing and NMS parameters across variants, ensuring fair runtime comparability.

These results indicate that RCNN + Kalman pipelines—particularly time-variant Kalman with HMM (with or without ANN)—offer the best trade-off between speed and accuracy in this setting, comfortably meeting real-time constraints while minimizing error. The YOLO small baselines, as configured here, are both slower and less accurate. Small differences in mAP (e.g., 38 vs. 39–40) may fall within experimental variability; reporting dispersion (e.g., standard deviations or confidence intervals) would clarify statistical significance. Conclusions also assume comparable hardware and evaluation protocols across models, and the interpretation of mAP should be consistent with its definition in the study.

To identify bottlenecks and clarify real-time feasibility, we decomposed per-frame latency into detector inference, Kalman update, HMM inference, ANN refinement, and post-processing. Timings were collected over N frames (median and interquartile range) on the same hardware used in Table 6, with identical input resolution and preprocessing.

The breakdown shows detector inference dominates latency, while probabilistic updates (Kalman/HMM) and the ANN layer contribute marginal overheads. Removing HMM reduces latency slightly but degrades stability under fast motion; ANN adds negligible cost while improving decision consistency. These results explain why the full variant maintains the tightest latency range observed in Table 6.

Table 7 compares system accuracy in offline video analysis and real-time deployment. The proposed model achieved 94.5% accuracy in offline analysis of annotated videos, while real-time deployment on Istanbul’s O-7 motorway resulted in 91.2%. The slight reduction reflects computational overhead and environmental variability during live operation.

Nevertheless, a 91.2% detection rate in real traffic conditions demonstrates that the system retains practical applicability, outperforming many conventional dashcam-based detection frameworks. The reduction in accuracy is consistent with expectations, as real-time models must balance computational overhead with responsiveness.

Finally, the curated crash dataset of 67 annotated videos was used to evaluate sensitivity in detecting self-induced versus distance crashes. As shown in Table 8, distance crashes achieved a sensitivity of 95%, compared to 88% for self-crashes. The lower performance in self-crash detection can be attributed to their greater variability and less distinctive visual cues.

Similarly, the sensitivity of crash detection was defined as in Equation (7), measuring the proportion of correctly identified crash events relative to all actual crashes.

TP = Crashes correctly identified as crashes

FN = Crashes incorrectly identified as no crashes

S e n s i t i v i t y = \frac{n u m b e r o f T r u e P o s i t i v e s}{n u m b e r o f T r u e P o s i t i v e s + N u m b e r o f F a l s e N e g a t i v e s}

(7)

In addition to accuracy and sensitivity, we report class-wise Precision (P), Recall (R), F1-score, and mean Average Precision at (Table 9). Precision and Recall are computed from true/false positives/negatives over frame-level crash/no-crash decisions; F1 is the harmonic mean of Precision and Recall. AP is obtained from the precision–recall curve for each class and then averaged (mAP). Unless otherwise stated, we adopt the same confidence thresholds and non-maximum suppression parameters across variants to ensure a fair comparison.

Across all settings, the full temporal stack (R-CNN + Kalman [TV] + HMM + ANN) achieves the best balance of precision and recall, yielding the highest F1—83.47% offline and 82.57% in real time. Removing temporal components degrades performance in a consistent, interpretable way. Relative to the full model, omitting the HMM reduces F1 by 1.85 pp offline (from 83.47% to 81.62%) and by 2.16 pp in real time (from 82.57% to 80.41%). Omitting the ANN has a larger impact: −2.94 pp offline (to 80.53%) and −4.58 pp in real time (to 77.99%). These deltas indicate that the ANN contributes the largest incremental gains among the temporal modules, particularly under real-time constraints.

The precision–recall breakdown mirrors these trends. Compared to the full model, removing the HMM reduces precision/recall by ~2 pp in both settings (Offline: −1.8/−1.9; Real-Time: −2.2/−2.1), whereas removing the ANN causes a larger drop (Offline: −3.6/−2.1; Real-Time: −4.7/−4.4). Thus, the ANN chiefly boosts precision and also improves recall, while the HMM provides smaller but consistent gains to both. Transitioning from Offline to Real-Time yields a modest overall performance loss for all variants, with the full model dropping by 0.90 pp in F1 (precision −0.5, recall −1.4). The largest setting-induced degradation occurs when the ANN is absent (F1 −2.54 pp; recall −3.7 pp), suggesting that the learned refinement is especially important to mitigate the recall penalties associated with real-time operation.

We benchmark against YOLOv8–v10 and a real-time Transformer detector (Version 2), using official implementations and identical evaluation settings (IoU 0.5, NMS default, batch size 1). YOLOv10 attains 82.16% F1 (78.4% precision, 86.3% recall), YOLOv9 reaches 80.76% F1, and YOLOv8 78.79% F1. A real-time Transformer detector (Version 2) achieves 82.41% F1 (78.6% precision, 86.6% recall). Our full temporal stack remains competitive, yielding 83.47% F1 offline and 82.57% in real time, with a more favorable precision–recall balance than YOLOv10 at comparable recall. These results indicate that temporal refinement confers a consistent advantage over strong single-frame baselines.

In summary, on this dataset the temporal hierarchy is additive: ANN > HMM in marginal benefit, and combining both yields the strongest F1. Real-time operation incurs small but consistent losses—driven more by recall than precision—that are best compensated by retaining both temporal modules. For context, single-stage baselines are strong but slightly behind: YOLOv10 reaches 82.16% F1 and the real-time transformer detector 82.41% F1, both trailing the full temporal stack (82.57% in real time), underscoring the benefit of temporal modeling.

Figure 8 provides a comparative performance analysis of all tested models, integrating results from Kalman filtering, ANN-enhanced frameworks, offline video, and real-time conditions. The ANN-enhanced and offline video models both achieved the highest accuracy (94.5%), while distance crash detection recorded the highest sensitivity (95%). In contrast, the traditional Kalman filter exhibited the lowest performance (86.6%), underscoring its limitations in dynamic and non-linear traffic scenarios.

These findings collectively underscore three key insights:

Impact of Filtering Mechanisms: The introduction of time-variant kernels significantly improves detection latency, particularly in fast-moving traffic scenarios. However, improvements diminish at short distances, suggesting that hybrid models or adaptive thresholds may be required for dense urban environments.

Role of Neural Decision Layers: Even modest gains from ANN integration prove critical in safety contexts, where every fraction of a second in detection matters. The neural module’s robustness against noise enhances reliability across varied lighting and weather conditions.

Operational Robustness: The gap between offline and real-time analysis highlights the trade-off between accuracy and computational feasibility. Optimizing real-time architectures, potentially through GPU acceleration or lightweight deep learning models, is a crucial direction for future work.

The proposed system demonstrated an overall detection accuracy of 94.5% in offline video analysis and 91.2% in real-time deployment. Incorporating ANN as a correction factor improved detection by 2.2%, and the time-variant Kalman filter reduced detection delays by approximately 0.5 s relative to traditional methods. By relying solely on camera input rather than costly LIDAR sensors, the framework balances affordability with performance, while maintaining a detection range consistent with current dashcam technology (~30 m). Another important limitation is the effective detection range of approximately 30 m, constrained by the resolution and field of view of commercial dash cameras. While this distance may be adequate for urban driving scenarios characterized by lower speeds and shorter braking distances, it is insufficient for high-speed motorway conditions where vehicles require considerably longer stopping distances. For instance, at 120 km/h a 30 m range corresponds to less than one second of warning time, which may not provide adequate safety margins. Addressing this limitation will require either sensor-level enhancements (e.g., higher-resolution or multi-camera setups, stereo vision) or algorithmic improvements such as super-resolution modules and predictive trajectory extrapolation. Future work should explore these directions to extend the applicability of the framework from urban contexts to high-speed highway environments.

Despite these strengths, limitations remain: reduced detection accuracy in self-induced crashes, performance degradation at close distances, and dependency on MATLAB Simulink and Intel i7-class hardware, which may restrict scalability. Addressing these challenges forms the basis for ongoing work, particularly in optimizing low-cost embedded hardware.

Another important limitation of the proposed framework is the effective detection range of approximately 30 m, which is inherently constrained by the resolution and field of view of commercial dash cameras. While this range is acceptable for urban driving, where vehicle speeds are typically below 50 km/h and braking distances remain within 20–30 m, it is insufficient for highway scenarios. At 120 km/h, a vehicle covers 33 m per second, meaning that a 30 m detection range provides less than one second of warning time. In contrast, safe stopping distances at such speeds often exceed 100 m, leaving a substantial gap between system capability and real-world requirements. Overcoming this limitation will require either hardware improvements—such as higher-resolution or wide-angle cameras, stereo vision, or multi-camera arrays—or algorithmic enhancements including super-resolution, multi-frame fusion, or trajectory extrapolation. Future work should investigate these directions to extend the scalability of the model beyond urban traffic monitoring toward high-speed motorway safety applications.

6. Conclusions

Driving is a complex activity that requires continuous integration of perception, attention, and motor responses within narrow temporal margins. A critical determinant of safety in this context is driver reaction time, which typically ranges between 1.5 and 2 s under normal conditions [50]. The proposed framework reduces effective response latency by approximately 0.5 s through real-time crash prediction, thereby offering a significant safety margin relative to human performance. Literature reports an average accident reaction time of 1.5 s [49], indicating that even a half-second improvement in early detection capability may decisively lower collision risk.

This research contributes a hybrid framework that unifies R-CNN for vehicle detection, an acceleration-aware Kalman Filter for motion estimation, a Markov Model for probabilistic reasoning, and an ANN for adaptive decision-making. The synergy between these components addresses several key challenges in intelligent transportation: robust anomaly detection in dynamic conditions, reduction in detection latency, and improved predictive stability. Notably, the ANN-enhanced Markov module improved sensitivity by 2.2%, aligning with the 0.5-s reduction in reaction time, a gain with tangible implications for preventing accidents in dense traffic scenarios.

Validation across both CARLA-based simulations and real-world urban traffic confirmed the accuracy and operational robustness of the proposed system. Nevertheless, limitations remain, particularly in detecting self-induced crashes and in maintaining accuracy at very short or extended ranges. Addressing these constraints, alongside adaptation to low-cost embedded platforms, will form an important avenue for future work. With continued advances in hardware acceleration and algorithmic efficiency, the presented framework holds strong potential to become an integral component of next-generation intelligent transportation and autonomous driving infrastructures.

The system’s current strengths are most pronounced in urban environments and mixed traffic at moderate speeds, where interactions typically occur within an effective detection range of about 30 m. Within this regime, the probabilistic–temporal integration improves decision stability and responsiveness, offering a cost-effective alternative to sensor-heavy configurations for urban monitoring and research settings. Extending this capability to high-speed motorway scenarios will require sensor-level enhancements (higher-resolution or multi-camera arrays) and algorithmic optimizations (GPU-accelerated inference, model pruning/quantization, and trajectory extrapolation) to maintain accuracy while achieving sub-30 ms end-to-end latency. Until such scalability is demonstrated, we confine our real-time claims to urban speed contexts.

At the same time, we cautiously delimit the scope of our real-time claims. The present real-time results were obtained on a MATLAB/Simulink, CPU-bound prototype and under the specific conditions of our tests; the effective range of ≈30 m, together with hardware and framework overheads, constrains applicability in higher-speed scenarios and diverse traffic, weather, and geographic contexts. Accordingly, we refrain from making broad claims about general real-time deployment across all traffic scenarios until scalability is demonstrated.

In our current configuration, the effective detection range is approximately 30 m, which is largely constrained by the input resolution (480 × 640), the field of view, and the per-object pixel footprint at distance. This range is generally adequate for urban speeds but becomes limiting at motorway speeds: at 120 km/h a vehicle covers roughly 33 m per second, so a 30 m range yields less than one second of warning time. While our hybrid HMM + ANN layer reduces detection latency by ≈0.5 s and improves decision stability, the pixel-level signal quality beyond ≈30 m remains the primary bottleneck for early detection. Higher-resolution cameras and optics: Upgrading to 1080 p or higher (e.g., 4 K) increases pixel density at distance, improving small-object detectability and bounding-box stability. Narrower-FOV optics or optical zoom can further increase per-object pixel counts for long-range scenarios, albeit with a coverage trade-off that can be mitigated by multi-camera arrays. Stereo or multi-view setups can increase robustness to occlusion and provide geometric cues for far-field consistency. Super-resolution and multi-frame fusion can recover spatial detail from compressed video streams and improve far-field detections [30]. Predictive trajectory extrapolation can provide earlier risk estimates when appearance cues are weak, complementing the HMM + ANN module.

Our prototype runs in MATLAB/Simulink on an Intel i7 CPU and achieves 91.2% accuracy in real-time operation versus 94.5% offline. The performance gap reflects (i) detector compute cost, (ii) framework overhead, and (iii) limited hardware acceleration. To close this gap without compromising accuracy. Porting the detector and post-processing to a GPU-accelerated runtime (e.g., CUDA/cuDNN with TensorRT or ONNX Runtime) on edge devices (e.g., embedded GPU modules) can deliver substantial throughput gains at comparable power budgets [36].

Author Contributions

Conceptualization, R.B.E. and A.Y.; Methodology, R.B.E. and A.Y.; Software, R.B.E.; Validation, R.B.E. and A.Y.; Formal analysis, R.B.E. and A.Y.; Investigation, R.B.E.; Writing—review and editing, R.B.E. and A.Y.; Supervision, A.Y. All authors have read and agreed to the published version of the manuscript.

Funding

The authors declare no funding was received.

Acknowledgments

The authors would like to thank the journal editors and reviewers for their valuable time and ideas for the article to be of better quality and effectiveness. The paper is derived from Reşat Buğra Erkartal’s PhD thesis, supervised by Atınç Yılmaz.

Conflicts of Interest

The authors declare no conflict of interest.

References

Anderson, J.M.; Nidhi, K.; Stanley, K.D.; Sorensen, P.; Samaras, C.; Oluwatola, O.A. Autonomous Vehicle Technology: A Guide for Policymakers; Rand Corporation: Santa Monica, CA, USA, 2014. [Google Scholar]
Feypell, V.; Methorst, R.; Hughes, T. Non-Motor Pedestrian Accidents: A Hidden Issue. 2012. Available online: https://www.ictct.net/wp-content/uploads/23-Hague-2010/ictct_document_nr_740.pdf (accessed on 25 February 2025).
Ranney, T.A.; Garrott, W.R.; Goodman, M.J. NHTSA Driver Distraction Research: Past, Present, and Future; SAE Technical Paper: Warrendale, PA, USA, 2001. [Google Scholar]
Beiker, S.A. Legal aspects of autonomous driving. Santa Clara L. Rev. 2012, 52, 1145. [Google Scholar] [CrossRef]
World Health Organization. Global Status Report on Road Safety 2023: Time for a New Normal; World Health Organization: Geneva, Switzerland, 2023; Available online: https://www.who.int/teams/social-determinants-of-health/safety-and-mobility/global-status-report-on-road-safety-2023 (accessed on 25 February 2025).
Alam, M.K.; Ahmed, A.; Salih, R.; Al Asmari, A.F.S.; Khan, M.A.; Mustafa, N.; Mursaleen, M.; Islam, S. Faster RCNN based robust vehicle detection algorithm for identifying and classifying vehicles. J. Real-Time Image Process. 2023, 20, 93. [Google Scholar] [CrossRef]
Wang, Z.; Zhan, J.; Duan, C.; Guan, X.; Lu, P.; Yang, K. A review of vehicle detection techniques for intelligent vehicles. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 3811–3831. [Google Scholar] [CrossRef] [PubMed]
Galvão, L.G.; Huda, M.N. Pedestrian and vehicle behaviour prediction in autonomous vehicle system—A review. Expert Syst. Appl. 2024, 238, 121983. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Wei, Y.; Tian, Q.; Guo, T. An improved pedestrian detection algorithm integrating haar-like features and hog descriptors. Adv. Mech. Eng. 2013, 5, 546206. [Google Scholar] [CrossRef]
Ghahremannezhad, H.; Shi, H.; Liu, C. Object Detection in Traffic Videos: A Survey. IEEE Trans. Intell. Transp. Syst. 2023, 24, 6780–6799. [Google Scholar] [CrossRef]
Chan, C.-Y. On the detection of vehicular crashes-system characteristics and architecture. IEEE Trans. Veh. Technol. 2002, 51, 180–193. [Google Scholar] [CrossRef]
Papageorgiou, C.; Poggio, T. A trainable system for object detection. Int. J. Comput. Vis. 2000, 38, 15–33. [Google Scholar] [CrossRef]
Galvao, L.G.; Abbod, M.; Kalganova, T.; Palade, V.; Huda, M.N. Pedestrian and Vehicle Detection in Autonomous Vehicle Perception Systems—A Review. Sensors 2021, 21, 7267. [Google Scholar] [CrossRef]
Zhao, R.; Tang, S.; Supeni, E.E.B.; Rahim, S.B.A.; Fan, L. A Review of Object Detection in Traffic Scenes Based on Deep Learning. Appl. Math. Nonlinear Sci. 2024, 9, 1–25. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Hasan, Y.; Arif, M.U.; Asif, A.; Raza, R.H. Comparative analysis of vehicle detection in urban traffic environment using Haar cascaded classifiers and blob statistics. In Proceedings of the 2016 Future Technologies Conference (FTC), San Francisco, CA, USA, 6–7 December 2016; pp. 547–552. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Jiang, Z.; Zhao, L.; Li, S.; Jia, Y. Real-time object detection method based on improved YOLOv4-tiny. arXiv 2020, arXiv:2011.04244. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Xiang, Z. Research on target detection algorithm in autonomous driving scenarios based on improved YOLOv5. In Proceedings of the International Conference on Computer Vision, Robotics, and Automation Engineering (CRAE 2024), Kunming, China, 21–23 June 2024; Volume 13249, p. 132490M. [Google Scholar]
Jocher, G.; Alex, S.; Ayush, C.; Jirka, B.; Yonghye, K.; Kalen, M.; Liu, C.; Jiacong, F.; Abhiram, V.; Piotr, S.; et al. Ultralytics/yolov5: v6. 0-YOLOv5n’Nano’models, Roboflow integration, TensorFlow export, OpenCV DNN support. Zenodo 2021. [Google Scholar] [CrossRef]
Chughtai, B.R.; Jalal, A. Traffic surveillance system: Robust multiclass vehicle detection and classification. In Proceedings of the 2024 5th International Conference on Advancements in Computational Sciences (ICACS), Lahore, Pakistan, 19–20 February 2024; pp. 1–8. [Google Scholar]
Khan, S.W.; Hafeez, Q.; Khalid, M.I.; Alroobaea, R.; Hussain, S.; Iqbal, J.; Almotiri, J.; Ullah, S.S. Anomaly detection in traffic surveillance videos using deep learning. Sensors 2022, 22, 6563. [Google Scholar] [CrossRef]
Sun, Z.; Bebis, G.; Miller, R. On-road vehicle detection: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 694–711. [Google Scholar] [CrossRef] [PubMed]
Kukreja, V.; Kumar, D.; Kaur, A.; Geetanjali; Sakshi. GAN-based synthetic data augmentation for increased CNN performance in Vehicle Number Plate Recognition. In Proceedings of the 2020 4th International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 5–7 November 2020; pp. 1190–1195. [Google Scholar]
Patel, J. Scenario Generation for Vehicles Using Deep Learning. Master’s Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2022. [Google Scholar]
Wang, Z.; Chen, J.; Hoi, S.C.H. Deep Learning for Image Super-Resolution: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3365–3387. [Google Scholar] [CrossRef] [PubMed]
Santhosh, K.K.; Dogra, D.P.; Roy, P.P. Anomaly Detection in Road Traffic Using Visual Surveillance: A Survey. ACM Comput. Surv. 2020, 53, 1–26. [Google Scholar] [CrossRef]
Rubaiyat, A.H.M.; Fallah, Y.; Li, X.; Bansal, G.; Infotechnology, T. Multi-sensor data fusion for vehicle detection in autonomous vehicle applications. Electron. Imaging 2018, 30, 1–6. [Google Scholar] [CrossRef]
Wang, D.; Koppal, S.J.; Xie, H. A monolithic forward-view MEMS laser scanner with decoupled raster scanning and enlarged scanning angle for micro LiDAR applications. J. Microelectromechanical Syst. 2020, 29, 996–1001. [Google Scholar] [CrossRef]
Moayed, Z. Automated Multiview Safety Analysis at Complex Road Intersections. Ph.D. Thesis, Auckland University of Technology, Auckland, New Zealand, 2020. [Google Scholar]
Tahir, N.U.A.; Zhang, Z.; Asim, M.; Chen, J.; ELAffendi, M. Object detection in autonomous vehicles under adverse weather: A review of traditional and deep learning approaches. Algorithms 2024, 17, 103. [Google Scholar] [CrossRef]
Liang, S.; Zhang, X. An Object Detection System for Automatic Driving: MEC-YOLO Based on “Cloud-Edge-End”. In Proceedings of the 2024 6th International Conference on Natural Language Processing (ICNLP), Xi’an, China, 22–24 March 2024; pp. 536–541. [Google Scholar]
Gupta, M.; Miglani, H.; Deo, P.; Barhatte, A. Real-time traffic control and monitoring. e-Prime-Advances Electr. Eng. Electron. Energy 2023, 5, 100211. [Google Scholar] [CrossRef]
Kabir, M.H.; Hasan, M.N.; Ahmad; Jaki, H. Transfer Learning-Based Anomaly Detection System for Autonomous Vehicle. Eng. Proc. 2023, 58, 90. [Google Scholar]
Li, Y.; Zhao, W.; Fan, H. A spatio-temporal graph neural network approach for traffic flow prediction. Mathematics 2022, 10, 1754. [Google Scholar] [CrossRef]
Huo, Y.; Zhang, H.; Tian, Y.; Wang, Z.; Wu, J.; Yao, X. A spatiotemporal graph neural network with graph adaptive and attention mechanisms for traffic flow prediction. Electronics 2024, 13, 212. [Google Scholar] [CrossRef]
Erkartal, B.; Yılmaz, A. Generating Crush Signals in Vehicle Traffic using RCNN, Hidden Markov Chain and ANN. In Proceedings of the International Conference on Science, Engineering Management and IT (SEMIT 2025), Dubai, United Arab Emirates, 11–13 September 2025. Presented at the conference; accepted for publication (in press).
Hmidani, O.; Alaoui, E.I. A comprehensive survey of the R-CNN family for object detection. In Proceedings of the 2022 5th International Conference on Advanced Communication Technologies and Networking (CommNet), Marrakech, Morocco, 12–14 December 2022; pp. 1–6. [Google Scholar]
Norris, J.R. Markov Models, No. 2; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
Jurafsky, D.; Martin, J.H. Vector Semantics and Embeddings. In Speech and Language Processing, 3rd ed.; draft; 2019; Available online: https://web.stanford.edu/~jurafsky/slp3/ (accessed on 21 September 2025).
Gelman, A.; Shalizi, C.R. Philosophy and the practice of Bayesian statistics. Br. J. Math. Stat. Psychol. 2013, 66, 8–38. [Google Scholar] [CrossRef] [PubMed]
Deschaud, J.E. KITTI-CARLA: A KITTI-like dataset generated by CARLA Simulator. arXiv 2021, arXiv:2109.00892. [Google Scholar]
Weber, M.; Perona, P. Caltech Cars 1999; CaltechDATA: Pasadena, CA, USA, 2022. [Google Scholar]
Philip, B.; Updike, P.; Perona, P. Caltech Cars 2001; CaltechDATA: Pasadena, CA, USA, 2022. [Google Scholar]
Niranjan, D.R.; Vinay Karthik, B.C. Deep learning based object detection model for autonomous driving research using carla simulator. In Proceedings of the 2021 2nd International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 7–9 October 2021; pp. 1251–1258. [Google Scholar]
Dalvi, Q. World Report on Road Traffic Injury Prevention by World Health Organization and World Bank (WHO, Geneva, April). Transp. Rev. 2004, 24, 365–376. [Google Scholar] [CrossRef]

Figure 1. Illustration of the model.

Figure 2. RCNN Architecture.

Figure 3. Coordinates of the Grid.

Figure 4. Updates of the Kalman Filter.

Figure 5. Flowchart of the Real-Time Model.

Figure 6. Example of the images in the dataset.

Figure 7. RMSE with Different Filters.

Figure 8. Footage of real Time analysis.

Table 1. Overview of the Literature.

Author/Citation (Year)	Method/Model Used	Accuracy/Performance Metrics	Technology/Approach	Dataset/Context
[21]	SVM, AdaBoost, Monocular Vision Features	92.3%	Classical ML/Feature-based	Precrash scenarios
[35]	CNN + ViT	93.8%	Deep fusion	Adverse weather
[39]	CNN	91.5%	Deep learning	Surveillance videos
[28]	GAN + CNN	>90%	GAN augmentation	Vehicle plates
[34]	CNN + SVM	90%	Hybrid approach	Urban scenes
[25]	YOLOv5 + DeepLabv3	96%	Hybrid detection	Traffic surveillance
[32]	LIDAR + Camera fusion	90.4%	Multi-sensor fusion	KITTI
[19]	SSD	74.3%	Single-stage detector	-
[18]	YOLO	82.1%	Single-stage detector	-
[22]	YOLOv4	83.4%	Single-stage detector	-
[23]	YOLOv5	94.5%	Single-stage detector	-
[24]	YOLOv5	95.0%	Single-stage detector	Challenging scenarios
[21]	EfficientDet	91.3%	Single-stage detector	COCO
[37]	EfficientDet	92.7%	Single-stage detector	-

Table 2. Comparative Evaluation of Different Grid Partitions.

Grid Partition	Accuracy (%)	Average FPS	Notes
3 × 3	88.2	30	Vehicle overlap in large cells
4 × 3	94.1	28	Best balance between accuracy and speed
5 × 4	94.5	21	Higher accuracy but increased latency

Table 3. Table of which Vehicle belongs to which grid without time variant Kalman Filter.

Frame	Coordinates (x, y)	Car ID	Grid ID	LIDAR Coordinates
1	Not detected
2	Not detected
3–29	Not detected
30	[68.2801, −5.9767]	1	1	[66.381, −4.2587]
30	[0.2980, −0.2589]	2	2	[31.2580, −0.1589]
31	[64.5202, −3.5626]	1	1	[66.7202, −3.4568]
31	[31.6258, −0.07522]	2	2	[33.6398, −0.0258]
32	[62.9445, −4.3945]	1	1	[62.8445, −3.3945]
32	[32.2565, −0.0566]	2	2	[30.2105, −0.0562]

Table 4. Table of which Vehicle belongs to Which Grid with 10 * 10 time variant Kalman Filter.

Frame	Coordinates (x, y)	Car ID	Grid ID	LIDAR Coordinates
1	[66.2581, −4.2587]	1	1	[66.381, −4.2587]
1	[31.8280, −0.1589]	2	2	[31.2580, −0.1589]
2	[66.9554, −3.4568]	1	1	[66.7202, −3.4568]
2	[33.5238, −0.0258]	2	2	[33.6398, −0.0258]
3	[62.254, −3.3945]	1	1	[62.8445, −3.3945]
3	[30.2105, −0.8456]	2	2	[30.2105, −0.0562]

Table 5. Table of which Vehicle belongs to Which Grid with 20 * 20 time variant Kalman Filter.

Frame	Coordinates (x, y)	Car ID	Grid ID	LIDAR Coordinates
1	[68.0825, −3.8747]	1	1	[66.3081, −4.2587]
1	[29.2580, −0.0589]	2	2	[31.2580, −0.1589]
2	[62.7202, −3.5625]	1	1	[66.7202, −3.4568]
2	[29.6608, −0.0282]	2	2	[33.6398, −0.0258]
3	[60.8445, −3.3945]	1	1	[62.8445, −3.3945]
3	[30.2105, −0.0562]	2	2	[30.2105, −0.0562]

Table 6. Detection Durations with different models.

Model	Kalman	HMM (ms)	ANN (ms)	Detection Duration (ms)	mAP
RCNN + KALMAN time invariant + HMM + ANN	61 ± 8	26 ± 7	8 ± 3	95 ± 19	0.78
RCNN + KALMAN time variant + HMM + ANN	85 ± 9	24 ± 6	9 ± 3	119 ± 21	0.79
RCNN + KALMAN time invariant + ANN	64 ± 8	-	8 ± 2	120 ± 22	0.74
RCNN + KALMAN time variant + ANN	84 ± 9	-	9 ± 4	93 ± 20	0.75
RCNN + KALMAN time variant + HMM	86 ± 11	25 ± 8	-	108 ± 18	0.77
RCNN + KALMAN time invariant + HMM	67 ± 9	24 ± 7	-	107 ± 22	0.76
YOLOv8 (Default: v8s)	-	-	-	70–120	0.73
YOLOv9 (Default: v9s)	-			70–115	0.74
YOLOv10 (Default: v10s)	-	-	-	90–140	0.75
Real-Time Transformer Detector	-	-	-	90–135	0.75

Table 7. Comparison of real time and video.

	Video	Real Time
Accuracy (%)	94.5	91.2

Table 8. Detection percentages of self and distance crashes.

Self-Crashes			Distance Crashes
True Positive (%)	False Negative (%)	Sensitivity (%)	True Positive (%)	False Negative (%)	Sensitivity (%)
88	7	88	95	14	95

Table 9. Classification metrics across settings and architectural variants.

Setting	Variant	Precision (%)	Recall (%)	F1 (%)
Offline	Full (R-CNN + Kalman (TV) + HMM + ANN	79.4	88.0	83.47
Offline	w/o HMM	77.6	86.1	81.62
Offline	w/o ANN	75.8	85.9	80.53
Real-Time	Full (R-CNN + Kalman (TV) + HMM + ANN	78.9	86.6	82.57
Real-Time	w/o HMM	76.7	84.5	80.41
Real-Time	w/o ANN	74.2	82.2	77.99
YOLO	YOLOV8	74.5	83.6	78.79
YOLO	YOLOV9	76.6	85.4	80.76
YOLO	YOLOV10	78.4	86.3	82.16
Real-Time Transformer Detector	Version 2	78.6	86.6	82.41

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Erkartal, R.B.; Yılmaz, A. A Novel Hybrid Deep Learning–Probabilistic Framework for Real-Time Crash Detection from Monocular Traffic Video. Appl. Sci. 2025, 15, 10523. https://doi.org/10.3390/app151910523

AMA Style

Erkartal RB, Yılmaz A. A Novel Hybrid Deep Learning–Probabilistic Framework for Real-Time Crash Detection from Monocular Traffic Video. Applied Sciences. 2025; 15(19):10523. https://doi.org/10.3390/app151910523

Chicago/Turabian Style

Erkartal, Reşat Buğra, and Atınç Yılmaz. 2025. "A Novel Hybrid Deep Learning–Probabilistic Framework for Real-Time Crash Detection from Monocular Traffic Video" Applied Sciences 15, no. 19: 10523. https://doi.org/10.3390/app151910523

APA Style

Erkartal, R. B., & Yılmaz, A. (2025). A Novel Hybrid Deep Learning–Probabilistic Framework for Real-Time Crash Detection from Monocular Traffic Video. Applied Sciences, 15(19), 10523. https://doi.org/10.3390/app151910523

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Hybrid Deep Learning–Probabilistic Framework for Real-Time Crash Detection from Monocular Traffic Video

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. R-CNN-Based Object Detection

3.2. Grid-Based Spatial Modeling

3.3. Markov Model and ANN Decision

3.4. Proposed Model

4. Experimental Setup

Dataset and Preprocessing

5. Results and Discussion

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI