1. Introduction
Traffic accidents remain a major threat to public safety, causing substantial loss of life, serious injuries, and considerable economic damage each year. Rapid and reliable traffic accident detection is therefore an essential component of intelligent transportation systems (ITSs), as it can support faster emergency response, improve situational awareness, and enhance traffic management. In current practice, however, traffic authorities often depend on manual monitoring of large-scale video feeds, a process that is labor-intensive, error-prone, and difficult to sustain over time. These limitations motivate the development of automated traffic accident detection systems that can operate accurately and robustly in real-world environments [
1,
2].
Recent advances in deep learning have improved vision-based traffic analysis. This includes convolutional neural networks and both two-stage and one-stage object detectors. In addition to YOLO-based approaches, alternative object detection frameworks such as two-stage detectors (e.g., Faster R-CNN) and transformer-based models (e.g., DETR) have also demonstrated strong performance in various vision tasks. While these models can achieve high accuracy, they are often computationally more expensive and less suitable for real-time traffic monitoring applications compared to one-stage detectors like YOLO. Among these, the You Only Look Once (YOLO) family has been widely adopted because it offers an effective balance between detection accuracy and inference speed [
3,
4,
5,
6,
7]. These developments make YOLO-based models practical for traffic surveillance and safety-critical tasks. This motivates further investigation of YOLOv9 in this work [
8].
Several architectural advances within the YOLO lineage have focused on strengthening feature representation and improving optimization stability. YOLOv7 [
9] introduced the Extended ELAN (E-ELAN) design, demonstrating that multi-branch aggregation can enhance feature diversity and gradient propagation. Cross Stage Partial Network (CSPNet) [
10], in turn, showed that partial gradient flow can reduce feature redundancy and improve optimization efficiency in deep convolutional networks. Building on these ideas, YOLOv9 [
11] adopts the GELAN backbone, which integrates principles from both CSPNet and ELAN and provides a strong starting point for further backbone-level refinement.
Despite these advances, traffic accident detection from video footage remains particularly challenging. Real accident videos often contain motion blur, occlusion, abrupt viewpoint shifts, unstable camera motion, and inconsistent illumination. These factors degrade detection reliability and make accident recognition more difficult than conventional object detection. Prior studies have explored YOLO-based traffic accident detection, but many rely on unmodified architectures or image-centric datasets, limiting robustness in real-world deployment. These limitations suggest that improvements may require not only better data or training strategies but also architectural refinements that enhance feature representation and gradient propagation.
Motivated by this gap, this work investigates whether targeted backbone-level enhancements can improve YOLOv9’s effectiveness for traffic accident detection. Rather than treating YOLO architectures as fixed black-box detectors, we examine how specific modifications within the backbone influence detection performance on a challenging, manually annotated accident dataset.
While CSP and ELAN are established architectural components, this work focuses on systematically analyzing their behavior within the YOLOv9 backbone for traffic accident detection, highlighting their complementary effects and practical trade-offs under real-world conditions.
Unlike prior works that primarily apply YOLO-based detectors without modifying their internal structure, this study focuses on analyzing and refining backbone-level design choices. We explicitly introduce CSP-based feature partitioning and extend ELAN structures within the YOLOv9 backbone to evaluate their individual and combined effects on traffic accident detection performance.
The main contributions of this work are summarized as follows:
We provide a systematic analysis of backbone-level architectural modifications within YOLOv9 for traffic accident detection in challenging video-based scenarios.
We introduce explicit CSP-based feature partitioning and extend ELAN structures within the YOLOv9 backbone to investigate their impact on feature representation, gradient propagation, and detection performance.
We demonstrate that CSP-based modifications improve localization precision, while deeper-ELAN structures enhance recall, revealing complementary detection behaviors.
We show that combining explicit CSP-based partitioning and extended ELAN structures does not necessarily improve performance, highlighting optimization trade-offs in multi-branch architectures.
We validate the proposed approach through comprehensive experiments on a manually annotated accident dataset, achieving improved performance compared to baseline YOLO models.
2. Related Work
This section reviews the existing literature on traffic accident detection and YOLO-based object detection frameworks, highlighting recent advances and limitations relevant to this study.
Deep learning-based object detection has become a core component of modern traffic monitoring and intelligent transportation systems, particularly for automated traffic accident detection. Among existing approaches, the YOLO (You Only Look Once) family of detectors has been widely adopted due to its favorable balance between detection accuracy and computational efficiency [
3,
4,
5,
6,
7]. Comparative evaluations of YOLOv5, YOLOv8, and YOLO-NAS highlight trade-offs in precision, recall, and inference speed, with YOLOv8 offering a practical balance for real-world traffic surveillance applications [
8]. These findings establish YOLO-based detectors as suitable candidates for safety-critical vision tasks.
Several architectural innovations within the YOLO family have aimed to improve feature representation and optimization stability. YOLOv7 [
9] introduced the Extended ELAN (E-ELAN) architecture, demonstrating that multi-branch aggregation enhances feature diversity and gradient propagation. In parallel, Cross Stage Partial Network (CSPNet) [
10] showed that partial gradient propagation reduces feature redundancy and improves optimization efficiency in deep convolutional networks. YOLOv9 builds upon the Generalized Efficient Layer Aggregation Network (GELAN) backbone, which integrates design principles from both CSPNet and ELAN.
Beyond direct traffic accident detection, YOLO-based frameworks have been applied to related traffic safety problems, including abnormal driving behavior recognition and crash risk analysis. In addition to detection-based approaches, classification-based methods have also been explored, such as cross-modality interaction models for traffic accident classification [
12]. Aung and Oo [
13] proposed an enhanced YOLO-based approach to detect risky driving behaviors such as sudden lane departures and erratic steering. Hybrid methods that combine spatial and temporal modeling have also been explored. For example, CNN-GRU ensembles have been used to model temporal patterns from dashboard camera data [
14], while Arifeen et al. [
15] integrated CNN-based detectors with one-class and multi-class Support Vector Machine (SVM) classifiers to analyze accident anomalies in the UCF-Crime dataset. Other studies have incorporated object detection with tracking and trajectory analysis, such as combining YOLOv5 with StrongSORT to identify highway crashes based on motion cues [
2].
More recent studies have examined next-generation YOLO architectures for traffic safety and infrastructure monitoring applications [
16,
17]. Recent work has also explored improved YOLO-based frameworks for traffic accident detection, such as enhanced YOLO11 models that refine detection accuracy under complex conditions [
18,
19]. Models designed for challenging operating conditions have also been explored, including fine-tuned YOLOv10 for low-light surveillance [
20]. In addition, YOLOv9-based frameworks leveraging GELAN backbones and anchor-free detection strategies have demonstrated improved performance over earlier YOLO variants in traffic accident detection tasks [
11]. These results position YOLOv9 as a competitive and flexible backbone for advanced traffic incident analysis.
Despite these advances, many existing traffic accident detection studies rely on static images, web-scraped data, or simulated crash scenarios [
11,
14]. Relatively fewer works focus on manually annotated offline video footage, which introduces additional challenges such as motion blur, varying viewpoints, occlusions, and temporal inconsistency. These factors are largely absent from image-based datasets but are critical for realistic deployment in closed-circuit television (CCTV) and dashcam footage.
Overall, prior research demonstrates substantial progress in traffic accident detection using YOLO-based and hybrid deep learning approaches. However, most studies treat YOLO architectures as fixed detectors and focus primarily on training strategies or auxiliary components rather than internal backbone design. In particular, limited work has systematically analyzed how backbone-level modifications, such as CSP-based feature partitioning and deeper-ELAN structures, influence detection behavior in challenging accident scenarios. This gap motivates the present work, which investigates targeted backbone-level enhancements to YOLOv9-t for traffic accident detection using manually annotated offline video footage.
3. Materials and Methods
3.1. Proposed Framework
The overall framework of the proposed approach is illustrated in
Figure 1.
The proposed framework processes input video frames using a modified YOLOv9 backbone to improve traffic accident detection performance. Two types of architectural modifications are considered: (1) CSP-based customization for feature partitioning, which splits feature maps to enhance gradient flow and reduce redundancy (named ‘CSP-only’), and (2) extended ELAN structures, which improve multiscale feature aggregation and representation learning (named ‘ELAN-only’). These modifications can be applied individually (‘CSP-only’ or ‘ELAN-only’) or in combination (named ‘CSP + ELAN’) to analyze their effect on detection performance. The modified backbone extracts feature representations from input frames, which are then passed to the detection head to predict bounding boxes and identify accident regions. This design enables the model to better capture complex visual patterns, such as motion blur, occlusion, and irregular object shapes, commonly present in real-world accident scenarios.
To evaluate the effectiveness of the proposed modifications, three configurations are considered: CSP-only, ELAN-only, and a combined CSP + ELAN model.
To further clarify the operational steps of the proposed framework, the workflow is described as follows.
3.2. Proposed Framework Workflow
The proposed framework operates through the following steps:
Input video data is processed into individual frames and resized to a fixed resolution suitable for model input.
Each frame is processed by the modified YOLOv9 backbone, where CSP-based feature partitioning and/or extended ELAN structures are applied depending on the configuration.
The backbone extracts multiscale feature representations, which are forwarded to the detection head for prediction.
The detection head generates bounding boxes and confidence scores for accident regions within each frame.
Non-maximum suppression (NMS) is applied to remove redundant detections and refine the final output.
The final predictions are evaluated using standard object detection metrics.
3.3. Problem Formulation
This work addresses the problem of single-class traffic accident detection from individual video frames. Let denote the dataset, where is an input frame and is the set of ground-truth accident bounding boxes in that frame. Because the task is formulated as single-class detection, each annotated box belongs to the same semantic category, namely accident. Given a frame , the detector predicts a set of bounding boxes , where denotes a predicted box and denotes its confidence score.
A predicted box is considered correct when its Intersection over Union (IoU) with a ground-truth box exceeds a specified threshold:
3.4. Evaluation Metrics
Model performance is evaluated using standard object detection metrics, including precision, recall, F1-score, mAP50, and mAP50–95.
where
,
, and
denote true positives, false positives, and false negatives, respectively. Precision measures the proportion of predicted traffic accident detections that are correct, while recall indicates the proportion of ground-truth accident instances successfully detected by the model. The F1-score provides a balanced summary of precision and recall.
In addition, Mean Average Precision at IoU (mAP50) and Mean Average Precision averaged over IoU thresholds from 0.50 to 0.95 with a step size of 0.05 (mAP50–95) are reported. These metrics jointly assess detection reliability, localization accuracy, and robustness across varying overlap thresholds.
3.5. Dataset Curation and Annotation Workflow
The dataset used in this study was derived from the Highway Incidents Detection (HWID12) dataset [
21], which was originally developed for video-based incident recognition in intelligent transportation systems. HWID12 contains more than 2780 short video clips, each lasting approximately 3–8 s. In total, the dataset includes over 500,000 frames across 11 incident categories and one normal-traffic class. Although originally designed for temporal incident classification, the dataset provides realistic traffic scenes suitable for frame-level traffic accident detection [
22,
23].
From this source, a focused subset of 250 accident videos was selected to construct a detection-oriented benchmark. The selection aimed to ensure diversity across accident types (e.g., head-on collisions, side impacts, rollovers, and motorcycle crashes), viewpoints, and environmental conditions such as motion blur and low visibility.
The selected videos represent these four visually distinct accident categories. Frames were extracted at 6 frames per second to preserve event progression and viewpoint diversity while reducing excessive redundancy between adjacent frames. This process yielded approximately 3000 frames for manual annotation and subsequent model development.
The extracted frames exhibit several characteristics that make traffic accident detection difficult in practice, including motion blur, sudden object deformation, partial occlusion, camera shake, long-range viewpoints, and illumination variation. These properties are important because they better reflect real deployment conditions in CCTV and dashcam environments.
All frames were manually annotated using two complementary tools. Roboflow [
24] was used to annotate approximately 2300 frames through its cloud-based interface, while the VGG Image Annotator (VIA) [
25] was used to annotate and refine approximately 700 frames offline. Annotations were exported in YOLO format, where each object instance is represented by normalized center coordinates and box dimensions
. A single class label, accident, was used throughout the dataset. After annotation, all images were resized to
pixels and partitioned into training, validation, and test sets using an 80–10–10 split.
Table 1 summarizes the key characteristics of the dataset used in this study.
Although formal inter-annotator agreement metrics were not computed, annotation quality was maintained through careful manual verification and cross-checking using both Roboflow and VIA tools. Difficult cases were reviewed and refined to improve consistency. Future work will include multi-annotator validation and quantitative agreement analysis.
Figure 2 illustrates representative examples of the annotation workflow. The use of both cloud-based and offline tools made it possible to cross-check difficult cases and refine annotations in frames affected by blur, occlusion, and low visibility.
3.6. Experimental Models and Comparison Protocol
This study primarily targets YOLOv9-based models, because YOLOv9 introduces the GELAN backbone together with an anchor-free detection design, both of which are well suited to visually complex accident scenes with scale variation and dense background clutter [
11]. The main experiments therefore compare YOLOv9-s, the baseline YOLOv9-t, YOLOv9-t with CSP integration, YOLOv9-t with deeper-ELAN expansion, and YOLOv9-t with both modifications combined. Among the standard YOLOv9 variants, YOLOv9-t was chosen as the principal reference model because it provides a favorable balance between computational efficiency and representational capacity, making it an appropriate base architecture for the proposed enhancements.
To place the YOLOv9 results in a broader comparative context, three widely used YOLO detectors were also included as baselines: YOLOv5-s, YOLOv5-n, and YOLOv8-n [
5,
6]. These models were selected because they are lightweight, practical for real-time deployment, and commonly used in traffic monitoring applications. All baseline models were initialized with official pretrained weights and fine-tuned on the accident dataset under the same experimental conditions to ensure a fair comparison.
To isolate the effect of architectural modifications, all models were trained using the same image resolution, dataset partition, optimization strategy, model-selection criterion, and evaluation metrics. This comparison protocol ensures that observed performance differences are attributable primarily to model design rather than to differences in preprocessing or training configuration.
This project focuses on the impacts of proposed modifications to the backbone architectures of the selected/experimented YOLO versions. Therefore, we keep the hyperparameters of the experimented YOLO versions unchanged. Specifically, all experiments were conducted using a consistent training protocol to ensure reproducibility and fair comparison across model variants. All models were trained using the Ultralytics YOLO framework with pretrained weights for initialization and evaluated using the same dataset described in
Table 1.
The AdamW optimizer was used with a cosine learning rate schedule. Standard data augmentation techniques provided by the Ultralytics YOLO framework were used during training. These include built-in transformations such as scaling, flipping, and color adjustments. No additional or customized augmentation strategies were applied in order to maintain a controlled comparison between architectural configurations. The batch size and other training parameters were kept consistent across all experiments to ensure a fair comparison of architectural modifications.
3.7. Base YOLOv9-t Detector
After selecting YOLOv9-t as the primary reference architecture, its backbone and detection pipeline were used as the starting point for the proposed modifications. At a high level, the detector consists of three functional stages: a backbone for hierarchical feature extraction, a feature-fusion stage for multiscale aggregation, and a detection head for box regression and confidence prediction. The backbone progressively transforms the input image into feature maps with increasing semantic richness and decreasing spatial resolution. The multiscale fusion stage combines information from different resolutions so that both coarse contextual cues and fine local details can contribute to detection. The final detection head predicts object locations and confidence scores over multiple feature scales.
This multiscale design is particularly important for traffic accident detection because accident regions can appear at widely varying sizes depending on camera distance, vehicle scale, and viewpoint. In addition, accident evidence is often visually subtle, such as slight vehicle deformation, partial overlap between vehicles, or small regions of smoke and dust. These characteristics motivate backbone-level modifications that can preserve discriminative detail while improving gradient propagation during optimization.
3.8. Proposed Backbone Enhancements
3.8.1. CSP-Enhanced YOLOv9-t
The first proposed modification introduces an explicit CSP-based feature partitioning mechanism at the backbone level of YOLOv9-t. Unlike the standard YOLOv9 architecture, where CSP is already integrated internally within blocks such as RepNCSPELAN4, our approach applies CSP externally by wrapping CSP around the existing RepNCSPELAN4.
Figure 3 illustrates the modified YOLOv9-t backbone, highlighting the replacement of the RepNCSPELAN4 block with the proposed CSPRepNCSPELAN4 module at the P4 stage. Specifically, we replace one RepNCSPELAN4 block at the P4 stage (corresponding to the 1/16-resolution feature map) with a custom module, referred to as CSPRepNCSPELAN4. This modification is implemented as a direct replacement rather than an additional branch, ensuring that the overall backbone structure remains unchanged while altering the feature processing at that stage.
The proposed CSPRepNCSPELAN4 module first applies a
convolution to align channel dimensions, followed by channel-wise partitioning of the feature map:
where
and
represent two equal channel partitions. The first partition is passed through a RepNCSPELAN4 block (the existing YOLOv9-t block at the same position), while the second partition is preserved through an identity mapping (lightweight shortcut). The outputs are then concatenated and fused using a
convolution:
This design introduces a higher-level feature partitioning mechanism compared to the internal CSP operations already present in YOLOv9. By applying CSP externally around the ELAN-based structure, the model preserves original feature information while enabling deeper transformation on a subset of channels, thereby improving feature reuse and gradient propagation.
The modification is applied specifically at the P4 stage of the backbone. In YOLO-based architectures, different stages (P1–P5) represent feature maps at progressively lower spatial resolutions and higher semantic abstraction. Early stages such as P1 and P2 primarily capture low-level features (e.g., edges and textures), while deeper stages such as P5 focus on high-level semantic context but with reduced spatial detail. The P4 stage provides a balance between spatial resolution and semantic richness, making it particularly suitable for detecting medium-scale objects and localized patterns such as vehicle collisions and accident regions.
Applying the CSP-enhanced block at P4 allows the model to improve feature discrimination where both localization accuracy and contextual understanding are critical. Modifying earlier stages (P1–P2) may disrupt low-level feature extraction, while modifying deeper stages (P5) may limit spatial precision due to reduced resolution. Therefore, P4 represents an effective trade-off point for introducing architectural enhancements.
3.8.2. Deeper-ELAN YOLOv9-t
The second modification investigates a deeper-ELAN-style aggregation strategy within the RepNCSPELAN4 module. ELAN-based designs improve representational diversity by aggregating features from multiple transformation depths [
9]. Let
denote feature maps obtained from progressively deeper transformations of the same input feature map
F. ELAN-style aggregation can be expressed as
where the concatenated features are fused into a unified representation.
In this study, the RepNCSPELAN4 block was extended by introducing an additional sequential transformation stage, referred to as . This additional block () also uses a stack of a RepNCSP block followed by a Conv layer just like and blocks in the original RepNCSPELAN4 version. This additional layer increases the effective depth of feature extraction while preserving the ELAN aggregation mechanism through concatenation of intermediate outputs.
The motivation is that accident scenes often contain subtle and heterogeneous patterns, such as vehicle edges, collision overlap regions, scattered debris, or dust clouds, which benefit from deeper feature refinement. By adding an additional transformation stage, the model is able to capture more complex feature representations and improve sensitivity to fine-grained visual patterns. However, increasing the depth also introduces additional computational complexity and may affect optimization stability, particularly when training data are limited.
Figure 4 illustrates the deeper-ELAN structure with the added sequential stage.
3.8.3. Combined CSP and Deeper-ELAN Configuration
To examine whether the two modifications provide complementary benefits, a combined YOLOv9-t + ELAN + CSP variant was also evaluated. In this configuration, the CSP-based wrapper (CSPRepNCSPELAN4) and the deeper-ELAN modification were applied simultaneously at the P4 stage of the backbone. Specifically, the RepNCSPELAN4 block at P4 was replaced with the CSPRepNCSPELAN4 module, while its internal ELAN structure was extended by adding an additional sequential transformation stage. This experiment was designed to determine whether outer-level feature partitioning (CSP) and deeper sequential feature extraction (ELAN) reinforce each other or instead lead to conflicting optimization behavior.
It is important to note that YOLOv9 already incorporates CSPNet and ELAN design principles through the GELAN backbone. Therefore, introducing additional CSP-based partitioning and deeper-ELAN structures is not a trivial extension. These modifications may introduce redundant feature processing, increase architectural complexity, and create competing gradient pathways, particularly under limited training data conditions. Experimental results show that the combined configuration does not consistently outperform individual modifications, indicating potential interference between CSP-based partitioning and deeper-ELAN aggregation.
3.9. Training Objective, Optimization, and Implementation Details
All models were trained using transfer learning from official pretrained weights. During fine-tuning, the input resolution was fixed at
, matching the preprocessing applied to the annotated dataset. The detection objective followed the standard YOLO training formulation, which combines box regression, classification, and localization-refinement terms:
where
is the bounding-box regression loss,
is the classification loss, and
is the Distribution Focal Loss (DFL) term used to improve localization precision. Although only a single semantic class is predicted, the classification component remains necessary for distinguishing accident regions from background hypotheses during detection.
All experiments were conducted on a workstation equipped with an NVIDIA RTX A1000 GPU with 8 GB of VRAM. Models were trained for 100 to 150 epochs, depending on convergence behavior, using batch sizes between 8 and 16 as permitted by memory constraints. The AdamW optimizer was employed together with a cosine learning rate schedule and automatic warm-up, using the default hyperparameters provided by the Ultralytics YOLO framework. Standard data augmentation techniques inherent to the framework were applied during training. Model checkpoints were saved at each epoch, and the final model for testing was selected according to the best validation mAP50–95.
To ensure a fair comparison, the same training protocol was applied across all evaluated models, including the baseline YOLOv5 and YOLOv8 detectors and all YOLOv9 variants. Thus, differences in performance reflect architectural differences rather than differences in training schedule or evaluation setup.
3.10. Reproducibility
To support reproducibility, fixed random seeds were used across experiments. Training configurations were stored in YAML files, and training logs, loss curves, and model checkpoints were automatically preserved. All trained weights and evaluation reports were retained to facilitate independent verification and replication of the reported results. In addition, all models were evaluated on the same held-out test split using the same confidence-based detection framework and the same mAP computation protocol, ensuring consistent comparison across architectures.
4. Results
This section presents the performance of the evaluated models from three complementary perspectives. First, quantitative comparisons are reported for the baseline and modified YOLO architectures. Second, training behavior and qualitative detection outputs are analyzed to better understand how the proposed modifications affect model performance in challenging accident scenes. Third, computational efficiency is reported to assess the practical suitability of the best-performing model for deployment.
4.1. Quantitative Performance Comparison
Table 2 summarizes the performance of the evaluated YOLOv9 variants and the proposed backbone-level enhancements. The baseline YOLOv9-t achieved an mAP50 of approximately 0.35 and an mAP50–95 of approximately 0.15 on the traffic accident detection dataset. Relative to this baseline, both backbone modifications improved detection performance, although they emphasized different aspects of model behavior.
Among all evaluated variants, YOLOv9-t with CSP achieved the highest precision (0.601), mAP50 (0.50), and mAP50–95 (0.282), indicating the strongest localization quality and the most reliable high-confidence detections. This corresponds to a relative improvement of approximately 42.8% in mAP50 compared to the baseline YOLOv9-t model.
Although the improvements are moderate, they are consistent across multiple evaluation metrics and reflect the challenging nature of traffic accident detection. In practical traffic monitoring systems, even small improvements in detection accuracy can significantly enhance early traffic accident detection and reduce response time, making the proposed modifications practically meaningful.
In contrast, the deeper-ELAN version of YOLOv9-t achieved the highest recall (0.450) and F1-score (0.484), suggesting a more balanced trade-off between missed detections and false positives. However, its localization metrics remained below those of the CSP-enhanced model.
This contrast highlights a clear trade-off between localization accuracy and detection sensitivity across different backbone modifications. The observed difference between precision and recall indicates that the CSP-enhanced model tends to produce more conservative predictions, prioritizing high-confidence detections while potentially missing some accident instances. This behavior leads to higher precision but comparatively lower recall. In safety-critical applications such as traffic monitoring, this trade-off is common, as reducing false positives is often important to avoid unnecessary alerts. However, improving recall remains essential for comprehensive traffic accident detection and can be further addressed through enhanced data augmentation, threshold tuning, and hyperparameter optimization.
The moderate F1-score and mAP50–95 values further reflect the inherent difficulty of traffic accident detection in real-world scenarios, where visual cues are often subtle and affected by motion blur, occlusion, and viewpoint variation. These moderate performance values reflect the difficulty of accident detection in real-world scenarios and the limited dataset size. These results indicate that while the proposed modifications improve performance, achieving robust detection across varying IoU thresholds remains a challenging task.
The combined CSP + ELAN configuration does not outperform the individual enhancements and in some cases leads to reduced performance. This behavior suggests that simultaneously integrating multiple architectural modifications may introduce optimization conflicts or redundant feature representations within the GELAN-based YOLOv9 backbone, particularly under limited data conditions.
To place these results in a broader context,
Table 3 compares the best-performing model in terms of localization, YOLOv9-t + CSP, against baseline YOLO detectors trained on the same dataset, including the recent YOLO26n model. YOLO26n is included as a representative of recent lightweight detection architectures designed for efficient deployment, providing a stronger and more modern baseline comparison. YOLO26n achieved the highest precision (0.632) and recall (0.437), indicating stronger detection sensitivity. However, the proposed CSP-enhanced YOLOv9-t achieved higher mAP50 (0.50) and mAP50–95 (0.282), demonstrating superior localization performance. These results highlight a trade-off between detection sensitivity and localization accuracy, and indicate that the proposed CSP-based backbone refinement provides a measurable advantage in spatial precision over both earlier YOLO generations and newer detection models.
4.2. Training Behavior and Qualitative Detection Analysis
The quantitative improvements are further supported by the model’s confidence-threshold behavior and training dynamics.
Figure 5 presents the F1–confidence curve (left) and the corresponding precision–recall curve (right) for the CSP-enhanced YOLOv9-t. Together, these plots illustrate the trade-off between precision and recall across confidence thresholds and confirm that the model maintains competitive detection quality over a useful operating range.
Training stability is illustrated in
Figure 6, which shows the loss curves for bounding-box regression, classification, and Distribution Focal Loss (DFL). All three loss components decrease steadily throughout training, indicating stable optimization and consistent convergence behavior.
Figure 7 shows two representative qualitative detection examples under both daylight and low-visibility conditions. These examples demonstrate that the CSP-enhanced YOLOv9-t can localize accident events under both clear daylight conditions (
Figure 7a) and visually degraded dusty scenes (
Figure 7b). The results suggest that the model remains effective despite the motion blur, partial visibility, and low-contrast characteristics commonly found in accident footage. Interested readers can refer to the repository of this project for further qualitative examples.
4.3. Computational Efficiency
In addition to improved detection accuracy, the proposed model retained practical inference speed. When evaluated on an NVIDIA RTX A1000 GPU with 8 GB of VRAM, the CSP-enhanced YOLOv9-t processed 640 × 640 frames at approximately 48–55 frames per second (FPS), which is significantly higher than the frame extraction rate (6 FPS) used during dataset preparation. All evaluated variants exhibited similar inference speeds due to their shared YOLOv9-based architecture. Training time ranged from 1.3 to 1.5 h per experiment, and the final model size was approximately 23 million parameters. These results indicate that the proposed enhancement improves detection performance without sacrificing the real-time capability required for traffic monitoring applications.
5. Discussion
The results demonstrate that targeted backbone-level modifications can improve YOLOv9-t for traffic accident detection in challenging video frames. Among the evaluated variants, the CSP-enhanced YOLOv9-t achieved the highest precision and the strongest localization performance, as reflected by the best mAP50 and mAP50–95 scores. In contrast, the deeper-ELAN variant achieved the highest recall and F1-score, indicating improved detection sensitivity. The combined ELAN + CSP configuration did not outperform the individual variants, suggesting that excessive branching may interfere with optimization in the GELAN-based YOLOv9 backbone.
From an architectural perspective, these results are consistent with the design of the proposed modifications. CSP-based feature partitioning improves localization accuracy by preserving spatially relevant features while reducing redundancy. This is particularly beneficial for traffic accident detection, where informative visual cues are often sparse and degraded by motion blur, occlusion, and viewpoint variation. In contrast, the deeper-ELAN design enhances feature aggregation, improving recall and detection sensitivity, although with reduced localization precision compared to CSP. The weaker performance of the combined model further suggests that increased architectural complexity may introduce optimization conflicts under limited data conditions.
The qualitative results support these observations. The CSP-enhanced model remains robust under motion blur, dust interference, off-angle viewpoints, and partial occlusions, which are representative of real CCTV and dashcam footage. In addition, the model maintains real-time inference speeds of approximately 48–55 FPS, supporting its suitability for intelligent traffic monitoring, roadside surveillance, and early-warning systems.
Several factors provide additional context for interpreting these findings. First, the dataset is relatively modest in size, consisting of approximately 3000 annotated frames. While the dataset is limited, it was carefully curated to include challenging real-world accident scenarios such as motion blur, occlusion, viewpoint variation, and low-visibility conditions. All frames were manually annotated and verified using both Roboflow and VIA to ensure consistency and annotation quality. Therefore, the goal of this study is not to maximize absolute performance but to provide a controlled analysis of backbone-level modifications. Evaluation on larger and more diverse datasets remains an important direction for future work.
Second, manual annotation introduces some degree of subjectivity and limits dataset scalability. Third, the use of a single accident class provides a practical starting point for detection, although more fine-grained labels would enable severity analysis and multi-class reasoning. This simplification allows the study to focus on localization performance and backbone behavior under challenging conditions. Extending the framework to multi-class traffic accident detection is left for future work.
Fourth, hardware constraints associated with the NVIDIA RTX A1000 GPU limited batch size and the evaluation of larger YOLOv9 variants. In addition, no extensive hyperparameter tuning was performed. While consistent training settings and default augmentations were used to ensure fair comparison, further optimization may improve recall and overall performance.
Furthermore, the reported results are based on single-run evaluations and do not include statistical significance analysis. While consistent experimental settings were maintained, future work should include multiple runs with different random seeds, along with reporting the mean, standard deviation, and statistical validation to better assess robustness.
Finally, the current framework processes frames independently without modeling temporal continuity, which limits its ability to capture motion dynamics. Direct quantitative comparison with prior traffic accident detection methods is also limited due to differences in datasets and evaluation protocols. Most existing works focus on alternative benchmarks or video-level classification tasks, whereas this study targets frame-level detection.
The comparison with the recent YOLO26n model further supports these observations. YOLO26n achieves higher precision and recall, indicating stronger detection sensitivity, while the proposed CSP-enhanced YOLOv9-t achieves higher mAP50 and mAP50–95, reflecting better localization performance. This highlights a trade-off between detection sensitivity and spatial accuracy across different architectural designs.
The lightweight design of YOLOv9-t also makes it suitable for real-time deployment scenarios. While detailed benchmarking on edge devices is beyond the scope of this study, the observed inference speed suggests strong potential for practical deployment.
Taken together, these findings indicate that backbone-level refinement is a promising direction for improving traffic accident detection. Further improvements may be achieved through larger datasets, richer annotations, hyperparameter optimization, and the integration of temporal modeling.
6. Conclusions
This study investigated the impact of backbone-level architectural modifications on YOLOv9-t for traffic accident detection in challenging video-based scenarios. The experimental results demonstrate that targeted enhancements can improve detection performance, although different modifications emphasize different aspects of model behavior.
Among the evaluated models, the CSP-enhanced YOLOv9-t achieved the strongest localization performance, with a precision of 0.601, mAP50 of 0.50, and mAP50–95 of 0.282, corresponding to a relative improvement of approximately 42.8% in mAP50 compared to the baseline YOLOv9-t model. In contrast, the deeper-ELAN variant achieved higher recall (0.450) and F1-score (0.484), indicating improved detection sensitivity.
The results further reveal that combining CSP and ELAN does not necessarily lead to improved performance, highlighting important trade-offs in backbone design. These findings show that backbone-level refinement can improve detection behavior. This is especially important in complex accident scenarios with motion blur, occlusion, and varying viewpoints.
Future work will focus on evaluating the statistical robustness of the proposed models through multiple runs with different random seeds, including reporting the mean and standard deviation of performance metrics and conducting statistical significance analysis. In addition, further investigation into inference speed, deployment performance on real-world edge devices, dataset expansion, temporal modeling, and hyperparameter optimization will be explored to improve robustness and generalization.
Although the overall performance improvements are moderate, they are consistent and practically relevant for real-world traffic monitoring systems.