1. Introduction
Road traffic injuries are considered one of the leading causes of preventable death and disability worldwide. The World Health Organization reports that road traffic accidents claim 1.19 million lives annually, making them a leading cause of death among children and young people aged 5–29 [
1]. Traffic accidents also have a significant impact on total national expenditure and impose a substantial economic burden, with costs estimated at 3% of GDP in most countries. The international community has responded through the United Nations Action for Road Safety 2021–2030, aiming to reduce road traffic-related deaths by 50% by the year 2030.
Traffic violations, such as speeding and running red lights, are indicators of serious crashes. Enforcement today remains largely reactive: most violations are detected by cameras or patrols after they occur, with limited ability to predict when or where they might result in serious injuries.
Predictive modeling can aid in making our roads safer. Rather than simply issuing tickets after the fact, this technique helps identify patterns—where and when the most serious crashes are likely to occur—enabling intervention before a violation results in more tragic consequences. This approach shifts the focus from punishment to prevention and from reacting to protecting.
Machine learning has recently emerged as an innovative method in road safety research, with the ability to handle non-linear and complex patterns within large datasets [
2]. In the literature, the results of several recent works demonstrate that ensemble models such as Random Forest, Gradient Boosting, XGBoost, and LightGBM outperform traditional algorithms in accident severity classification [
3,
4].
Qatar National Vision 2030 (QNV 2030) [
5] is based on the development sector, including explicit dedication to ensuring road safety. Qatar issued a National Road Safety Strategy [
6] with revised targets that are legally binding, including reducing traffic deaths by 50%. Road traffic deaths decreased from 235 fatalities in 2013 to 178 in 2016, resulting in Qatar maintaining the top rank among all Arab states for road safety [
1]. Despite these developments, more than 1 million accidents were recorded in the 2020–2025 period, highlighting the ongoing need for data-driven techniques.
This study is the first large-scale machine learning analysis of traffic accident severity using Qatar’s national crash database (1,000,500 records), addressing a critical gap in GCC road safety research. Although several machine learning and statistical studies have been conducted in the region [
3,
4,
7,
8,
9,
10,
11,
12], they face three main methodological limitations: (1) Qatar-specific gap: Despite growing global concern over traffic safety, research specifically investigating traffic crash prediction in Qatar is limited. While GCC studies exist, many of them remain predominantly Saudi-based [
3,
4,
8] and rely on aggregated or small-scale datasets. In no published ML study has large-scale severity modeling been applied to Qatar’s complete national accident database. This gap was identified by Timmermans et al. [
7], with the authors noting that their dataset made it impossible to provide statistical correlations or predictions and calling for in-depth investigation with in-depth forecasting analyses. (2) Target leakage: Some studies include target variables as predictors [
13], which inflates performance metrics and renders models operationally impractical for real-world prevention. (3) Extreme imbalance is addressed in various ways [
14], with the authors of some studies applying oversampling before data splitting, which results in leakage of synthetic samples into the test set and compromises validity.
In order to fill the aforementioned gaps, we aim to examine whether it is feasible for machine learning models developed based only on pre-crash and on-crash data to predict crash severity in Qatar, with acceptable and repeatable performance. We further examine which classification strategy—multi-class, binary, or cascade—provides the highest balanced accuracy under real-world imbalance conditions and identify the factors most significantly influencing severity in Qatar through SHAP analysis. Lastly, we assess how the proposed framework compares to existing GCC studies in terms of methodological rigor and predictive performance. Through this work, we make four key contributions:
We present the first large-scale machine learning study using Qatar’s complete accident database (588,023 records, 2020–2025), addressing a critical research gap in GCC countries where existing ML studies remain limited.
We provide a systematic comparison of three classification frameworks across 36 configurations under extreme class imbalance, with explicit prevention of target leakage through formal exclusion of post-crash variables.
Binary reformulation achieves a 64% performance improvement over multi-class approaches, demonstrating that problem reformulation yields greater gains than algorithmic selection—findings applicable beyond Qatar to global traffic safety datasets.
SHAP-based interpretability translates predictions into actionable policy recommendations for Qatar’s road safety strategy.
The remainder of this paper is organized as follows: In
Section 2, we review related work; in
Section 3, we describe the dataset and methodology; in
Section 4, we present the results; in
Section 5, we discuss the findings and policy implications; lastly, in
Section 6, we present our conclusions.
3. Materials and Methods
Our research framework (
Figure 1) was developed in Python 3.13, using several libraries, including scikit-learn [
26], imbalanced-learn, XGBoost [
27], and LightGBM [
19]. All experiments were performed on a Windows 11 64-bit system (32 GB RAM, Intel Core i7-8850H). The methodology involved a series of interconnected steps: data preprocessing, feature engineering, addressing class imbalance, and training models across multi-class, binary, and cascaded frameworks. We then assessed performance using balanced metrics and incorporated SHAP for model interpretability.
3.1. Dataset Description and Preprocessing
The dataset, covering 2020–2025, was obtained from the Ministry of Interior of the State of Qatar—General Directorate of Traffic, comprising 1,000,500 raw records. The class distribution of the final 588,023-record dataset is presented in
Table 2.
The imbalance ratio of 2295:1 between SIMPLE and DEATH INJURY classes places this dataset among the most severely imbalanced in the traffic accident severity prediction literature [
2,
14].
Table 3 summarizes the four-stage preprocessing pipeline and the impact of each cleaning step on record retention. The process began with 1,000,500 raw accident records. First, four columns with missing rates >75% were removed. The post-crash variable Death Count was explicitly excluded to prevent target leakage, which is a common methodological error in traffic safety ML studies. Next, records classified as ‘NOT A TRAFFIC ACCIDENT’ (n = 448) and those with missing critical fields (n = 411,892) were removed. Finally, records with implausible driver ages (n = 137) were removed. The total removal rate was 41.2%.
Several variables were excluded due to extreme missingness and limited practical value. Weather condition, road condition, and accident cause had more than 98% missing values, while ‘nature of accident’ was both highly incomplete and largely redundant with the retained Accident Type variable. The final seven-item feature set is presented in
Table 4. Accident Hour was extracted from raw time strings; Time Period was derived as a four-category ordinal variable; Driver Age was computed as Accident Year—Driver Birth Year; and categorical variables were label-encoded using scikit-learn [
26]. All post-crash variables were excluded to prevent target leakage [
13].
To assess potential selection bias, we performed chi-square tests comparing distributions across seven variables before and after exclusion. While statistical tests indicated significant differences (p < 0.001) due to the large sample size, the practical magnitude of the differences was minimal (<3% for all categories). For example, driver age showed a negligible difference (0.07%), as did time_period (0.96%). The target variable (Severity) maintained essential characteristics with sufficient severe case representation (n = 16,254, 2.76% of filtered data), ensuring robust machine learning with 5-fold cross-validation.
3.2. Train–Test Split, Temporal Validation, and Class Imbalance Mitigation
An 80/20 stratified split yielded 470,418 training and 117,605 test records. The training set was used for 5-fold stratified cross-validation during hyperparameter tuning. To assess model generalization, temporal validation was additionally performed by splitting the dataset: accidents from 2020 to 2024 (95.01%, n = 558,653) were used for model training and 5-fold stratified cross-validation, while accidents from 2025 (4.99%, n = 29,370) served as an independent temporal test set. The top 10 best-performing configurations identified through cross-validation were re-evaluated on the 2025 holdout set to detect potential temporal drift and verify model stability over time. Class-balancing techniques were applied exclusively to the training set after splitting, preventing synthetic sample leakage into the test set [
14].
Table 5 summarizes the three balancing techniques evaluated. All classifiers were evaluated across three balancing strategies (SMOTE, ADASYN, and SMOTETomek) in multi-class, cascaded, and binary classifications.
To assess the impact of categorical encoding on model performance, two encoding strategies were evaluated: label encoding and one-hot encoding. Each encoding method was systematically combined with the three balancing techniques across all classifiers. The results of the two encoding strategies were broadly comparable, with a slight practical advantage for label encoding in the final selected binary configuration.
3.3. Classification Frameworks
Three classification frameworks were implemented to address the extreme class imbalance in the dataset.
The multi-class approach classified accidents into four severity levels (SIMPLE, LIGHT INJURY, HEAVY INJURY, and DEATH INJURY), and six classifiers were trained under each of the three balancing strategies, yielding 18 experimental configurations. This approach is marked by extreme imbalance (2295:1 between the most and least frequent classes). The classifiers and best parameters are summarized in
Table 6.
The binary approach was implemented, through which accidents were reformulated into binary categories, Safe (SIMPLE) vs. Severe (LIGHT/HEAVY/DEATH INJURY), reducing the imbalance ratio from 2295:1 to 35:1.
We evaluated 36 configurations through a complete experimental matrix (two encoding methods, three balancing techniques, and six classifiers). Each configuration underwent systematic hyperparameter tuning using RandomizedSearchCV with 50 iterations and balanced accuracy scoring. Model selection prioritized performance on imbalanced-data metrics (PR-AUC, test generalization) over cross-validation scores, with statistical significance confirmed through Friedman tests (p < 0.002).
The cascaded two-stage framework combined both approaches. In total, 18 configurations were applied. Stage 1 involves binary classification (Safe vs. Severe), with the reduced 35:1 ratio. In Stage 2, severe cases are classified into LIGHT/HEAVY/DEATH INJURY subcategories (imbalance 61:1 within the severe subset). SMOTE, ADASYN, and SMOTETomek were applied independently to each stage’s training partition.
3.4. Evaluation Metrics
Model performance was evaluated using metrics suited to imbalanced classification. Under extreme imbalance, for binary classification, PR-AUC was designated as the primary selection criterion, supported by balanced accuracy, AUC-ROC, sensitivity, specificity, Cohen’s Kappa, F1-Macro, average precision, and Brier Score. During cross-validation, BA was used for tuning, while final model selection prioritized PR-AUC and test generalization. For multi-class cascaded framework selection, Cohen’s Kappa was prioritized due to its sensitivity to minority class performance.
PR-AUC is calculated as the area under the precision–recall curve (Equation (1)), capturing the model’s performance across all recall levels, and is especially informative for imbalanced datasets.
where
is the precision, and
Balanced accuracy is calculated as the macro-average of recall per class (Equation (2)), making it unaffected by the extreme imbalance between majority and minority classes.
where
is the number of classes.
AUC-ROC (area under the receiver operating characteristic curve) measures the model’s discriminative ability across all classification thresholds (Equation (3)):
where
is the true-positive rate, and
is the false-positive rate. This metric is robust to class imbalance and provides a threshold-independent assessment of model performance.
Cohen’s Kappa measures agreement between predicted and actual classifications beyond chance (Equation (4)):
where
is the observed agreement (accuracy), and
is the expected agreement based on class distributions. Kappa is particularly informative for imbalanced datasets as it accounts for correct classifications that would occur by chance.
F1-Macro provides a balanced measure of precision and recall across all classes (Equation (5)):
where
is the harmonic mean of precision and recall for class
and
is the number of classes. This metric equally weights all classes, preventing majority class dominance in performance assessment.
For the binary classification task (Safe vs. Severe), two additional metrics were employed:
Sensitivity (also referred to as ‘recall’ or ‘true-positive rate’) measures the proportion of severe accidents correctly identified (Equation (6)):
where
represents true positives (correctly predicted severe accidents) and
represents false negatives (severe accidents misclassified as safe).
Specificity (true-negative rate) measures the proportion of safe accidents correctly classified (Equation (7)):
where
represents true negatives (correctly predicted safe accidents), and
represents false positives (safe accidents misclassified as severe).
3.5. SHAP Interpretability Analysis
SHAP values were computed for the binary LightGBM classifier using TreeExplainer [
28,
29] on a stratified random sample of 10,000 test instances, consistent with prior studies [
20,
21,
22]. Global feature importance was assessed using mean absolute SHAP values.
5. Discussion
Our results demonstrate that machine learning can predict accident severity when class imbalance is appropriately addressed and models remain interpretable. This approach provides a practical tool for Qatar’s road safety operations and supports both QNV 2030 goals [
5] and the National Road Safety Strategy [
6].
5.1. Interpretation of Classification Performance
The multi-class BA of 44.12% is consistent with benchmarks for severely imbalanced four-class severity datasets [
2,
15,
16]. The cascaded framework’s higher Kappa values suggest better agreement beyond chance and improved discrimination under class imbalance.
5.2. Policy Implications of SHAP Findings
SHAP improved the interpretability of the model, but its results should be interpreted with some caution. The importance values may be affected by how categorical variables are encoded, and they describe associations learned by the fitted model rather than causal effects. For this reason, SHAP is best viewed as a useful interpretive tool that can complement, rather than replace, other approaches such as cost-sensitive learning and spatial modeling.
5.2.1. Temporal Factors
Time Period has the highest SHAP contribution, followed by Driver Age, confirming that temporal patterns and demographic factors are key predictors [
20,
22,
30,
31]. Empirical analysis results reveal two complementary intervention priorities: midnight-concentrated enforcement (Hour 0: 7.1% severity, low volume) and evening sustained risk management (18–23: 4.25% average, high volume). These findings demonstrate that AI-driven feature importance, while valuable for identifying risk factors, must be validated against actual severity distributions before resource allocation. Model emphasis on isolated peaks (Hour 0) should be balanced with empirical evidence of sustained high-risk periods (evening rush hours) to ensure comprehensive temporal coverage and optimal policy impact.
5.2.2. Driver Age and Accident Type
Younger drivers were estimated to be at increased risk of more severe accidents [
10], in line with global evidence on novice driver risk. Accident Type exhibited high positive SHAP values for run-over and fixed-object collision types [
30,
31], corresponding to investment in pedestrian safety infrastructure.
5.2.3. Road Type and Accident Year
External roads demonstrated higher severity rates (9.73%) compared to internal roads (2.51%). This pattern reflects the higher-speed nature of highways and main arterials, where crash energy and injury severity are inherently elevated.
These findings support the prioritization of infrastructure improvements and enforcement on major external corridors alongside internal road networks. The downward trend in recent accident years may indicate improvements in road safety infrastructure during the period 2020–2025.
5.2.4. Integration with Traffic Violation Systems
This framework could complement existing violation detection systems by helping to rank higher-risk cases. Severity predictions could support patrol deployment through highlighting high-severity periods, such as midnight and the evening, and severity information could be combined with violation heat maps in Command Center dashboards to improve resource allocation. Because calibration was limited, the model is more suitable for risk prioritization than for direct probability interpretation.
5.2.5. Comparison with Related Work
Overall, while previous studies have reported the strong performance of advanced machine learning models in accident severity prediction, our findings extend these findings by showing that reliability and robustness are better assessed through imbalance-aware metrics, systematic comparison across preprocessing strategies, and temporal validation on unseen future data. In particular, the binary LightGBM framework combines strong predictive performance with interpretability and temporal generalization, providing a more practically robust solution under extreme class imbalance than many earlier approaches.
Unlike prior Qatar-focused descriptive studies in which the authors relied on traditional statistical analysis [
7,
12], in this work, we deliver the first individual-level predictive framework with SHAP-based policy interpretability.
The present study’s multi-class BA of 44.12% is comparable to or exceeds results from similar studies [
2,
3,
4,
8,
16]. The binary BA of 71.04% and AUC of 0.7722 represent competitive performance. The dataset size (588,023 records) substantially exceeds that in most comparable GCC studies, providing greater statistical power for minority class learning.
The present study’s binary classification performance (AUC = 0.7722, F1-Macro = 0.5186) aligns with international benchmarks for severely imbalanced severity prediction, exceeding Dong et al.’s [
22] LightGBM result (AUC = 0.71) on Pakistani highway data while remaining conservative compared to studies reporting suspiciously high metrics on small datasets without validation details [
3,
4]. Three methodological advantages distinguish this work: (1) dataset scale (588,023 records) exceeding comparable GCC studies, (2) rigorous validation with SMOTE applied exclusively to training data and stratified train–test splitting, and (3) explicit target leakage prevention, ensuring operational validity—standards rarely enforced in the regional literature.
5.3. Key Findings and Practical Recommendations
The binary formulation outperformed the multi-class form by 61% in balanced accuracy (71.04% vs. 44.12%), confirming that problem reformulation is critical under extreme imbalance. SMOTE and ADASYN proved to be the most effective balancing strategies, while LightGBM provided the best trade-off between speed and performance, training in less than 15 s compared to over 700 s for Gradient Boosting. The cascaded approach improved minority class detection, with Cohen’s Kappa increasing by 70.7% (0.1217 vs. 0.0713), demonstrating better sensitivity to severe accidents. Temporal factors emerged as the dominant predictors, with a combined SHAP contribution of 0.692, revealing two distinct risk patterns: a midnight peak (Hour 0) with 7.1% severity affecting lower traffic volumes and sustained evening elevation (18–23), averaging 4.25% severity across higher traffic volumes. The model exhibited strong temporal generalization, with a minimal training–validation gap confirming no significant overfitting.
These findings translate into several actionable recommendations. The midnight and evening risk periods support targeted enforcement strategies, with midnight requiring concentrated intervention for high-risk behaviors such as reckless driving and evening hours calling for comprehensive traffic management given their sustained risk and higher traffic exposure. Younger drivers exhibit elevated risk levels, supporting the implementation of graduated licensing systems and targeted training programs, particularly for high-risk nighttime driving conditions. Pedestrian-related accidents are associated with higher severity, underscoring the need for improved crossing infrastructure, enhanced traffic signal systems, and increased driver awareness in urban environments. The high computational efficiency of LightGBM makes it suitable for real-time deployment by traffic safety practitioners to support proactive risk monitoring and intervention. Finally, the elevated severity rates on external roads (9.73%) compared to internal roads (2.51%) highlight the importance of speed management, defensive driving practices, and targeted safety interventions in high-speed corridors.
6. Conclusions
This study examined machine learning frameworks for the prediction of traffic accident severity using Qatar’s national crash dataset under extreme class imbalance. The binary LightGBM framework achieved the most practical combination of performance (BA = 71.04%), interpretability, and computational efficiency, demonstrating that problem reformulation yields greater gains than algorithmic selection alone. Temporal factors emerged as the dominant predictors, with midnight and evening hours presenting distinct risk patterns that support differentiated enforcement strategies. The findings translate into actionable recommendations for targeted temporal enforcement, pedestrian safety infrastructure, graduated licensing for young drivers, and speed management on external roads.
Several limitations should be acknowledged. The binary formulation simplified the original four-level severity structure, temporal validation was based on a single future year, and the model showed limited probability calibration, making it more suitable for risk ranking than direct probability interpretation. Four high-value features were excluded due to extreme missingness, and geographic variables were unavailable. Furthermore, the study period overlaps with the COVID-19 pandemic (2020–2021), which may have influenced traffic patterns, and minority classes remain severely underrepresented. All experiments were conducted using Qatar-specific data, and generalizability to other GCC countries remains unvalidated. Future research should integrate environmental and sensor data, explore cost-sensitive and spatial modeling approaches, and develop forecasting tools for accident rates and types to support proactive road safety management.
While the current study focuses on predicting severity for individual accidents, understanding future accident rates and types remains a critical direction for proactive road safety management. Our study reveals encouraging downward severity trends; accident severity decreased in 2024–2025 compared to earlier years, reflecting the positive impact of Qatar’s road safety efforts. Building on this, in future work, forecasting tools will be developed to help authorities anticipate accident patterns and plan preventive measures.