Prediction of Large-Scale Traffic Accident Severity in Qatar: A Binary Reformulation Approach for Extreme Class Imbalance with Interpretable AI

Alshriem, Mohammed; Yang, Yin

doi:10.3390/futuretransp6020088

Open AccessArticle

Prediction of Large-Scale Traffic Accident Severity in Qatar: A Binary Reformulation Approach for Extreme Class Imbalance with Interpretable AI

by

Mohammed Alshriem

^*

and

Yin Yang

College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar

^*

Author to whom correspondence should be addressed.

Future Transp. 2026, 6(2), 88; https://doi.org/10.3390/futuretransp6020088

Submission received: 1 March 2026 / Revised: 9 April 2026 / Accepted: 10 April 2026 / Published: 15 April 2026

Download

Browse Figures

Versions Notes

Abstract

Road traffic injuries represent one of the most critical public health challenges in the Gulf region. Predicting traffic accident severity is therefore a critical component of evidence-based road safety management. In this study, we develop machine learning frameworks for predicting traffic accident severity using Qatar’s national dataset (2020–2025), addressing extreme class imbalance and interpretability. A dataset of 588,023 accident records was systematically preprocessed from 1,000,500 raw reports. We compare three approaches: multi-class (four severity levels), binary (Safe vs. Severe), and cascaded two-stage (combining both). Six classifiers were evaluated across two encoding methods and three balancing strategies. Systematic hyperparameter tuning with 5-fold stratified cross-validation was performed for all models. The binary LightGBM classifier achieved BA = 71.04%, AUC-ROC = 0.772, Sensitivity = 61.03%, and Specificity = 81.05%, demonstrating superior performance over multi-class approaches. Temporal validation on 2025 data (trained on 2020–2024 data) supported good temporal generalization. Analysis of 10,000 test instances identified the time period as the dominant predictor of accident severity. The binary LightGBM framework provides an interpretable and effective approach for severe accident identification and risk prioritization, with SHAP findings supporting targeted temporal enforcement and pedestrian safety as evidence-based policy priorities.

Keywords:

traffic accident severity; machine learning; class imbalance; SMOTE; ADASYN; LightGBM; cascaded classification; SHAP; Qatar; road safety

1. Introduction

Road traffic injuries are considered one of the leading causes of preventable death and disability worldwide. The World Health Organization reports that road traffic accidents claim 1.19 million lives annually, making them a leading cause of death among children and young people aged 5–29 [1]. Traffic accidents also have a significant impact on total national expenditure and impose a substantial economic burden, with costs estimated at 3% of GDP in most countries. The international community has responded through the United Nations Action for Road Safety 2021–2030, aiming to reduce road traffic-related deaths by 50% by the year 2030.

Traffic violations, such as speeding and running red lights, are indicators of serious crashes. Enforcement today remains largely reactive: most violations are detected by cameras or patrols after they occur, with limited ability to predict when or where they might result in serious injuries.

Predictive modeling can aid in making our roads safer. Rather than simply issuing tickets after the fact, this technique helps identify patterns—where and when the most serious crashes are likely to occur—enabling intervention before a violation results in more tragic consequences. This approach shifts the focus from punishment to prevention and from reacting to protecting.

Machine learning has recently emerged as an innovative method in road safety research, with the ability to handle non-linear and complex patterns within large datasets [2]. In the literature, the results of several recent works demonstrate that ensemble models such as Random Forest, Gradient Boosting, XGBoost, and LightGBM outperform traditional algorithms in accident severity classification [3,4].

Qatar National Vision 2030 (QNV 2030) [5] is based on the development sector, including explicit dedication to ensuring road safety. Qatar issued a National Road Safety Strategy [6] with revised targets that are legally binding, including reducing traffic deaths by 50%. Road traffic deaths decreased from 235 fatalities in 2013 to 178 in 2016, resulting in Qatar maintaining the top rank among all Arab states for road safety [1]. Despite these developments, more than 1 million accidents were recorded in the 2020–2025 period, highlighting the ongoing need for data-driven techniques.

This study is the first large-scale machine learning analysis of traffic accident severity using Qatar’s national crash database (1,000,500 records), addressing a critical gap in GCC road safety research. Although several machine learning and statistical studies have been conducted in the region [3,4,7,8,9,10,11,12], they face three main methodological limitations: (1) Qatar-specific gap: Despite growing global concern over traffic safety, research specifically investigating traffic crash prediction in Qatar is limited. While GCC studies exist, many of them remain predominantly Saudi-based [3,4,8] and rely on aggregated or small-scale datasets. In no published ML study has large-scale severity modeling been applied to Qatar’s complete national accident database. This gap was identified by Timmermans et al. [7], with the authors noting that their dataset made it impossible to provide statistical correlations or predictions and calling for in-depth investigation with in-depth forecasting analyses. (2) Target leakage: Some studies include target variables as predictors [13], which inflates performance metrics and renders models operationally impractical for real-world prevention. (3) Extreme imbalance is addressed in various ways [14], with the authors of some studies applying oversampling before data splitting, which results in leakage of synthetic samples into the test set and compromises validity.

In order to fill the aforementioned gaps, we aim to examine whether it is feasible for machine learning models developed based only on pre-crash and on-crash data to predict crash severity in Qatar, with acceptable and repeatable performance. We further examine which classification strategy—multi-class, binary, or cascade—provides the highest balanced accuracy under real-world imbalance conditions and identify the factors most significantly influencing severity in Qatar through SHAP analysis. Lastly, we assess how the proposed framework compares to existing GCC studies in terms of methodological rigor and predictive performance. Through this work, we make four key contributions:

We present the first large-scale machine learning study using Qatar’s complete accident database (588,023 records, 2020–2025), addressing a critical research gap in GCC countries where existing ML studies remain limited.
We provide a systematic comparison of three classification frameworks across 36 configurations under extreme class imbalance, with explicit prevention of target leakage through formal exclusion of post-crash variables.
Binary reformulation achieves a 64% performance improvement over multi-class approaches, demonstrating that problem reformulation yields greater gains than algorithmic selection—findings applicable beyond Qatar to global traffic safety datasets.
SHAP-based interpretability translates predictions into actionable policy recommendations for Qatar’s road safety strategy.

The remainder of this paper is organized as follows: In Section 2, we review related work; in Section 3, we describe the dataset and methodology; in Section 4, we present the results; in Section 5, we discuss the findings and policy implications; lastly, in Section 6, we present our conclusions.

2. Related Work

2.1. Machine Learning for Accident Severity Prediction

The use of ML in the prediction of traffic accident severity has increased significantly in recent years because of its improved ability to model complex nonlinear relationships and its improved prediction accuracy [3,15]. Notably, Mostafa et al. [2] introduced a comprehensive AI-driven framework for crash severity prediction, based on more than 2.26 million records, to test SMOTE, ADASYN, and Borderline SMOTE with ensemble classifiers, including Extra Trees, yielding an F1-macro of 95.28%. In the same context, Chen et al. [16] created an MSCPO-XGBoost hybrid model with metaheuristic hyperparameter optimization and SMOTE balancing on the Chinese national accident dataset, aiming to improve the baseline of XGBoost. Furthermore, Pei et al. [17] presented an interpretable deep learning model (AISTGCN) based on XGBoost and SHAP to select the features utilized in the case of UK accident data, achieving an accuracy of 87.72%.

In their systematic review of 191 ML studies (2019–2024) on traffic accidents, Behboudi et al. [15] highlighted three key methodological challenges: class imbalance, spatiotemporal dependencies, and the interpretability of the models. In their evaluation, they discovered that Gradient Boosting and LightGBM consistently achieved better performance in comparison with other classifiers in various geographies and scales of data. Although a small number of studies have been conducted on predicting accidents using neural networks on small datasets [18], the generalizability of the neural network-based method has not been extensively studied because of the lack of data and the potential for overfitting.

2.2. LightGBM and SHAP in Traffic Safety

LightGBM [19] has become one of the most popular classifiers in recent traffic safety research because of its computational efficiency and competitive predictive performance. Yan et al. [20] used LightGBM combined with SHAP to analyze accident severity in hotspot regions of Changsha City. They identified accident type, visibility, and duration as key factors for estimating severity. Yang et al. [21] utilized LightGBM and SHAP to analyze driver injury severity in highway-rail grade crossings, demonstrating that road surface conditions are important in mediating temporal and behavioral risk factors.

Furthermore, Dong et al. [22] evaluated four boosting-based ensemble models (LightGBM, CatBoost, AdaBoost, and NGBoost) to predict road traffic injury severity on the N-5 highway in Pakistan. They reported that LightGBM achieved the best AUC (0.71). SHAP analysis showed that driver age, accident cause, and collision type were the most important predictors of road traffic injury. These findings are consistent with our Qatar-specific results, providing cross-national validation for the utility of the LightGBM + SHAP framework in road safety research.

2.3. Class Imbalance in Traffic Accident Datasets

Class imbalance remains the most cited methodological challenge in traffic accident severity prediction [2]. SMOTE [14] is the most widely applied balancing technique, generating synthetic minority instances by interpolating between nearest neighbors. ADASYN [2] extends SMOTE by adaptively concentrating synthetic generation near decision boundaries, whereas SMOTETomek combines oversampling with boundary cleaning.

Despite the proliferation of balancing studies, few works provide direct, systematic comparisons of multiple techniques across multiple classifiers on the same dataset. We address this gap in the present study by evaluating SMOTE, ADASYN, and SMOTETomek across six classifiers in multi-class, cascaded configurations, providing a comprehensive balancing comparison not previously conducted in the GCC traffic safety literature.

2.4. GCC and Qatar-Specific Studies

Traffic safety research in the GCC region has experienced rapid growth, though it remains concentrated in Saudi Arabia [3,4,5,7,8,9,10,11,12]. Aldhari et al. [3] applied machine learning to Saudi highway crash data, comparing LR, RF, and XGBoost with feature importance analysis. Alanazi and Okail [4] developed a data-driven crash severity prediction framework for Jeddah using ensemble methods. Alsoud et al. [8] applied Random Forest to predict pedestrian crash density in Riyadh school zones (R² = 0.88), focusing on spatial risk mapping rather than individual severity prediction. Qatar-specific studies have been primarily descriptive [7,11,12], spatiotemporal [9], or behavioral survey-based [10], analyzing accident patterns and driver tendencies without predictive ML modeling. This gap is addressed in the present study.

Analyzing fatal crashes in Abu Dhabi (2012–2017), Awadalla and Albuquerque [23] used multivariate logistic regression to identify contributing factors and recommend six safety measures, including enhanced weekend enforcement, preventing red-light running, pedestrian-friendly design, improving high-speed road standards, educational campaigns targeting national male drivers, and stricter penalties (jail terms or license suspension) for reckless driving.

Al-Eideh [24] used time series models to predict annual accident counts and economic impacts in Kuwait, focusing on the general trend rather than on individual accident severity prediction.

AlKheder et al. [25] benchmarked four classifiers on 5740 Abu Dhabi accident records, with Random Forest achieving the highest accuracy. In their study, however, they did not consider class imbalance, nor employ SHAP-based interpretability. In our work, we extend this study through a systematic comparison of 36 configurations on 588,023 instances with explicit imbalance mitigation and SHAP analysis.

In a recent medical-based study [11], researchers reported the clinical impact of desert off-road accidents: a 1.5% mortality rate, with a predominance of orthopedic injuries. However, despite the fact that these studies provide useful insights regarding post-crash outcomes, they do not consider pre-crash and on-crash severity prediction—the research focus of this work.

Earlier Qatar-based studies are limited by data availability. Timmermans et al. [7] examined RTC aggregated data (2010–2016), identifying temporal patterns while explicitly stating that their dataset did not include individual-level crash characteristics, and ”statistical correlations or predictions were not possible”. They recommended that the authors of future studies incorporate “detailed crash pattern analysis” and “sophisticated forecasting analyses.” The current study directly fills this gap with individual-level predictive modeling of 588,023 records.

While machine learning has been widely applied in North American and European safety analysis, studies in the Middle East and GCC context remain extremely rare. There is a lack of large-scale predictive ML modeling in Qatar, despite it having one of the highest vehicle ownership rates globally—a gap this study directly addresses. Collectively, these studies highlight the absence of large-scale predictive ML models in Qatar, reinforcing the contribution of the present work as the first comprehensive machine learning analysis of Qatar’s national accident database.

Table 1 summarizes the key studies discussed in Section 2.1, Section 2.2, Section 2.3 and Section 2.4, highlighting their methodologies, datasets, main findings, and the research gaps that motivate the proposed approach.

3. Materials and Methods

Our research framework (Figure 1) was developed in Python 3.13, using several libraries, including scikit-learn [26], imbalanced-learn, XGBoost [27], and LightGBM [19]. All experiments were performed on a Windows 11 64-bit system (32 GB RAM, Intel Core i7-8850H). The methodology involved a series of interconnected steps: data preprocessing, feature engineering, addressing class imbalance, and training models across multi-class, binary, and cascaded frameworks. We then assessed performance using balanced metrics and incorporated SHAP for model interpretability.

3.1. Dataset Description and Preprocessing

The dataset, covering 2020–2025, was obtained from the Ministry of Interior of the State of Qatar—General Directorate of Traffic, comprising 1,000,500 raw records. The class distribution of the final 588,023-record dataset is presented in Table 2.

The imbalance ratio of 2295:1 between SIMPLE and DEATH INJURY classes places this dataset among the most severely imbalanced in the traffic accident severity prediction literature [2,14].

Table 3 summarizes the four-stage preprocessing pipeline and the impact of each cleaning step on record retention. The process began with 1,000,500 raw accident records. First, four columns with missing rates >75% were removed. The post-crash variable Death Count was explicitly excluded to prevent target leakage, which is a common methodological error in traffic safety ML studies. Next, records classified as ‘NOT A TRAFFIC ACCIDENT’ (n = 448) and those with missing critical fields (n = 411,892) were removed. Finally, records with implausible driver ages (n = 137) were removed. The total removal rate was 41.2%.

Several variables were excluded due to extreme missingness and limited practical value. Weather condition, road condition, and accident cause had more than 98% missing values, while ‘nature of accident’ was both highly incomplete and largely redundant with the retained Accident Type variable. The final seven-item feature set is presented in Table 4. Accident Hour was extracted from raw time strings; Time Period was derived as a four-category ordinal variable; Driver Age was computed as Accident Year—Driver Birth Year; and categorical variables were label-encoded using scikit-learn [26]. All post-crash variables were excluded to prevent target leakage [13].

To assess potential selection bias, we performed chi-square tests comparing distributions across seven variables before and after exclusion. While statistical tests indicated significant differences (p < 0.001) due to the large sample size, the practical magnitude of the differences was minimal (<3% for all categories). For example, driver age showed a negligible difference (0.07%), as did time_period (0.96%). The target variable (Severity) maintained essential characteristics with sufficient severe case representation (n = 16,254, 2.76% of filtered data), ensuring robust machine learning with 5-fold cross-validation.

3.2. Train–Test Split, Temporal Validation, and Class Imbalance Mitigation

An 80/20 stratified split yielded 470,418 training and 117,605 test records. The training set was used for 5-fold stratified cross-validation during hyperparameter tuning. To assess model generalization, temporal validation was additionally performed by splitting the dataset: accidents from 2020 to 2024 (95.01%, n = 558,653) were used for model training and 5-fold stratified cross-validation, while accidents from 2025 (4.99%, n = 29,370) served as an independent temporal test set. The top 10 best-performing configurations identified through cross-validation were re-evaluated on the 2025 holdout set to detect potential temporal drift and verify model stability over time. Class-balancing techniques were applied exclusively to the training set after splitting, preventing synthetic sample leakage into the test set [14].

Table 5 summarizes the three balancing techniques evaluated. All classifiers were evaluated across three balancing strategies (SMOTE, ADASYN, and SMOTETomek) in multi-class, cascaded, and binary classifications.

To assess the impact of categorical encoding on model performance, two encoding strategies were evaluated: label encoding and one-hot encoding. Each encoding method was systematically combined with the three balancing techniques across all classifiers. The results of the two encoding strategies were broadly comparable, with a slight practical advantage for label encoding in the final selected binary configuration.

3.3. Classification Frameworks

Three classification frameworks were implemented to address the extreme class imbalance in the dataset.

The multi-class approach classified accidents into four severity levels (SIMPLE, LIGHT INJURY, HEAVY INJURY, and DEATH INJURY), and six classifiers were trained under each of the three balancing strategies, yielding 18 experimental configurations. This approach is marked by extreme imbalance (2295:1 between the most and least frequent classes). The classifiers and best parameters are summarized in Table 6.

The binary approach was implemented, through which accidents were reformulated into binary categories, Safe (SIMPLE) vs. Severe (LIGHT/HEAVY/DEATH INJURY), reducing the imbalance ratio from 2295:1 to 35:1.

We evaluated 36 configurations through a complete experimental matrix (two encoding methods, three balancing techniques, and six classifiers). Each configuration underwent systematic hyperparameter tuning using RandomizedSearchCV with 50 iterations and balanced accuracy scoring. Model selection prioritized performance on imbalanced-data metrics (PR-AUC, test generalization) over cross-validation scores, with statistical significance confirmed through Friedman tests (p < 0.002).

The cascaded two-stage framework combined both approaches. In total, 18 configurations were applied. Stage 1 involves binary classification (Safe vs. Severe), with the reduced 35:1 ratio. In Stage 2, severe cases are classified into LIGHT/HEAVY/DEATH INJURY subcategories (imbalance 61:1 within the severe subset). SMOTE, ADASYN, and SMOTETomek were applied independently to each stage’s training partition.

3.4. Evaluation Metrics

Model performance was evaluated using metrics suited to imbalanced classification. Under extreme imbalance, for binary classification, PR-AUC was designated as the primary selection criterion, supported by balanced accuracy, AUC-ROC, sensitivity, specificity, Cohen’s Kappa, F1-Macro, average precision, and Brier Score. During cross-validation, BA was used for tuning, while final model selection prioritized PR-AUC and test generalization. For multi-class cascaded framework selection, Cohen’s Kappa was prioritized due to its sensitivity to minority class performance.

PR-AUC is calculated as the area under the precision–recall curve (Equation (1)), capturing the model’s performance across all recall levels, and is especially informative for imbalanced datasets.

P R - A U C = \int P (R) d R

(1)

where

P

is the precision, and

R is r e c a l l .

Balanced accuracy is calculated as the macro-average of recall per class (Equation (2)), making it unaffected by the extreme imbalance between majority and minority classes.

\frac{1}{k} \sum_{i = 1}^{k} R e c a l l_{i}

(2)

where

k

is the number of classes.

AUC-ROC (area under the receiver operating characteristic curve) measures the model’s discriminative ability across all classification thresholds (Equation (3)):

A U C - R O C = \int T P R d F P R

(3)

where

T P R

is the true-positive rate, and

F P R

is the false-positive rate. This metric is robust to class imbalance and provides a threshold-independent assessment of model performance.

Cohen’s Kappa measures agreement between predicted and actual classifications beyond chance (Equation (4)):

K a p p a = \frac{P_{o} - P_{e}}{1 - P_{e}}

(4)

where

P_{o}

is the observed agreement (accuracy), and

P_{e}

is the expected agreement based on class distributions. Kappa is particularly informative for imbalanced datasets as it accounts for correct classifications that would occur by chance.

F1-Macro provides a balanced measure of precision and recall across all classes (Equation (5)):

F 1 - M a c r o = \frac{1}{k} \sum_{i = 1}^{k} F 1_{i}

(5)

where

F 1_{i}

is the harmonic mean of precision and recall for class

i

and

k

is the number of classes. This metric equally weights all classes, preventing majority class dominance in performance assessment.

For the binary classification task (Safe vs. Severe), two additional metrics were employed:

Sensitivity (also referred to as ‘recall’ or ‘true-positive rate’) measures the proportion of severe accidents correctly identified (Equation (6)):

S e n s i t i v i t y = \frac{T P}{T P + F N}

(6)

where

T P

represents true positives (correctly predicted severe accidents) and

F N

represents false negatives (severe accidents misclassified as safe).

Specificity (true-negative rate) measures the proportion of safe accidents correctly classified (Equation (7)):

S p e c i f i c i t y = \frac{T N}{T N + F P}

(7)

where

T N

represents true negatives (correctly predicted safe accidents), and

F P

represents false positives (safe accidents misclassified as severe).

3.5. SHAP Interpretability Analysis

SHAP values were computed for the binary LightGBM classifier using TreeExplainer [28,29] on a stratified random sample of 10,000 test instances, consistent with prior studies [20,21,22]. Global feature importance was assessed using mean absolute SHAP values.

4. Results

In this section, we present results across three analytical phases: multi-class classification, cascaded approach, and binary LightGBM classification with diagnostics.

4.1. Multi-Class Classification Results

Table 7 presents the performance of the six classifiers under three balancing strategies in the multi-class severity setting. Gradient Boosting achieved the strongest class-balanced performance, with the best result obtained using SMOTE (balanced accuracy = 44.12%, Cohen’s Kappa = 0.0623), followed closely by ADASYN and SMOTETomek. LightGBM delivered slightly lower balanced accuracy (39.55–41.18%) but showed the best computational efficiency, requiring only 12.3–16.4 s for training compared with 772.9–1029.7 s for Gradient Boosting. Random Forest produced the lowest average balanced accuracy across balancing methods (37.62%), despite relatively competitive F1 and Kappa values. Overall, SMOTE provided the best average results among the balancing strategies, although the differences across balancing methods were modest.

4.2. Cascaded Classification Approach

The cascaded results are presented in Table 8. The highest balanced accuracy was achieved by LightGBM with SMOTETomek (BA = 42.25%), with Cohen’s Kappa = 0.0884 and F1-Macro = 0.2621. In contrast, the highest Cohen’s Kappa and F1-Macro were obtained by Random Forest with SMOTETomek (Kappa = 0.1217, F1-Macro = 0.2758), although its balanced accuracy was slightly lower (41.30%). Logistic regression did not produce the best cascaded result; its best performance was obtained with SMOTETomek (BA = 38.61%, Kappa = 0.0544, F1-Macro = 0.2450). Compared with the best multi-class result in Table 7, the cascaded framework achieved a slightly lower balanced accuracy (42.25% vs. 44.12%) but a substantially higher Cohen’s Kappa (0.1217 vs. 0.0713), corresponding to an improvement of approximately 70.7%. This indicates that the cascaded formulation improved agreement and minority-class discrimination more than overall class-balanced accuracy.

4.3. Binary Classification: LightGBM + SMOTE

Table 9 presents the binary classification results. The binary reformulation substantially improved performance relative to the best multi-class configuration (Gradient Boosting with SMOTE). The best binary model, LightGBM with SMOTE, achieved BA = 71.04%, AUC-ROC = 0.7722, PR-AUC = 0.2989, Cohen’s Kappa = 0.1037, and F1-Macro = 0.5186, representing a 61% improvement in BA over the best multi-class result (44.12%). Given the extreme class imbalance, PR-AUC offers a more demanding evaluation than ROC-AUC.

The binary LightGBM model was selected after a comprehensive evaluation of 36 configurations (two encoding methods, three balancing techniques, and six classifiers) using systematic hyperparameter tuning with RandomizedSearchCV (50 iterations). The final selection prioritized PR-AUC and holdout test generalization rather than cross-validation scores alone, while Friedman tests confirmed statistically significant overall differences among the compared models (p < 0.002) (Table 10).

4.4. Model Performance Analysis

Figure 2 presents the ROC curve for the LightGBM + SMOTE model, with an AUC of 0.7722. Using the default threshold of 0.5, the model achieved a true positive rate of 61.03% and a false positive rate of 18.95%.

Figure 3 presents the row-wise normalized confusion matrix for the best binary model (LightGBM with SMOTE). The model correctly classified 81.05% of safe accidents and 61.03% of severe accidents, corresponding to a balanced accuracy of 71.04%. Misclassification remained higher for the severe class, as 38.97% of severe accidents were predicted as safe, whereas 18.95% of safe accidents were predicted as severe. These results indicate that the model was more effective at identifying safe accidents than severe ones, although it still achieved meaningful detection of the minority severe class under extreme class imbalance.

Figure 4 shows the balanced accuracy of the LightGBM + SMOTE model across three evaluation stages: 5-fold cross-validation, holdout test set, and temporal validation on 2025 data. Temporal validation achieved the highest balanced accuracy (77.50%), outperforming both 5-fold cross-validation (73.78%) and the holdout test set (71.04%), supporting the good temporal generalization of the LightGBM + SMOTE model.

4.5. SHAP-Based Feature Importance Analysis

The global feature importance rankings are presented in Table 11. Time Period (|SHAP| = 0.408), Driver Age (0.334), Accident Type (0.323), and Accident Hour (0.284) were the dominant predictors, with nighttime hours exhibiting strongly positive SHAP values.

The actual crash severity patterns across temporal periods are compared in Figure 5. Hourly analysis (top panel) reveals Hour 0 (midnight) as the peak individual risk hour (7.1% severe accident rate), substantially exceeding all other hours. Morning hours (6–11 a.m.) consistently exhibit the lowest severity rates (1.5–2.0%). Period-level aggregation (bottom panel) reveals a critical pattern: evening hours (18–23) demonstrate the highest sustained severity rate (4.25%), exceeding nighttime (3.08%), afternoon (2.82%), and morning (1.73%). While Hour 0 exhibits peak individual severity (7.1%), the evening’s sustained elevation across six consecutive hours (average 4.25%) creates a greater aggregate severe accident burden due to higher traffic exposure. This finding indicates two distinct temporal risk patterns: (1) isolated midnight concentration (Hour 0: 7.1%) affecting low traffic volumes and (2) sustained evening elevation (18–23: average 4.25%) affecting high traffic volumes during peak commuting hours. The contrast between Hour 0’s exceptional individual severity and the evening’s sustained period-level risk has important policy implications. While the midnight time period requires concentrated enforcement targeting impaired driving, evening hours demand comprehensive traffic management addressing the aggregate accident burden across larger populations.

The SHAP feature importance for binary severity prediction is represented in Figure 6. Each point shows a single prediction’s SHAP value, color-coded by feature magnitude (red = high, blue = low). Time_Period_enc appears as the most influential feature, followed by Driver_Age and Accident_Type_enc, while Accident_Hour also shows a meaningful contribution. In contrast, Accident_Year, Road_Type_enc, and Nationality_Group_enc exhibit comparatively lower overall influence.

4.6. Performance Under Severe Class Imbalance: PR-AUC, Class-Based Recall, and Calibration

Under severe class imbalance, the precision–recall curve was more informative than ROC-AUC because severe accidents accounted for only 2.76% of the test set. As shown in Figure 7, the curve remained clearly above the random baseline (0.0276), and the model achieved a PR-AUC of 0.2989, or about 10.8 times the baseline level. This indicates that the model can rank severe cases meaningfully despite their rarity. Class-based recall showed that 61.03% of severe accidents were correctly identified, while 81.05% of safe accidents were correctly classified, giving a balanced accuracy of 71.04%. At the same time, the calibration results indicated that predicted probabilities were generally higher than the observed severe-accident frequencies, with a Brier Score of 0.1683. Overall, the model appears more suitable for identifying and prioritizing severe-accident risk than for direct probability interpretation.

Figure 8 shows that the test set is highly imbalanced, with severe accidents accounting for only 2.76% of all cases, yet the model still achieved a recall of 61.03% for the severe class, compared with 81.05% for the safe class.

Figure 9 further examines probability reliability through the calibration curve of the selected binary LightGBM model.

5. Discussion

Our results demonstrate that machine learning can predict accident severity when class imbalance is appropriately addressed and models remain interpretable. This approach provides a practical tool for Qatar’s road safety operations and supports both QNV 2030 goals [5] and the National Road Safety Strategy [6].

5.1. Interpretation of Classification Performance

The multi-class BA of 44.12% is consistent with benchmarks for severely imbalanced four-class severity datasets [2,15,16]. The cascaded framework’s higher Kappa values suggest better agreement beyond chance and improved discrimination under class imbalance.

5.2. Policy Implications of SHAP Findings

SHAP improved the interpretability of the model, but its results should be interpreted with some caution. The importance values may be affected by how categorical variables are encoded, and they describe associations learned by the fitted model rather than causal effects. For this reason, SHAP is best viewed as a useful interpretive tool that can complement, rather than replace, other approaches such as cost-sensitive learning and spatial modeling.

5.2.1. Temporal Factors

Time Period has the highest SHAP contribution, followed by Driver Age, confirming that temporal patterns and demographic factors are key predictors [20,22,30,31]. Empirical analysis results reveal two complementary intervention priorities: midnight-concentrated enforcement (Hour 0: 7.1% severity, low volume) and evening sustained risk management (18–23: 4.25% average, high volume). These findings demonstrate that AI-driven feature importance, while valuable for identifying risk factors, must be validated against actual severity distributions before resource allocation. Model emphasis on isolated peaks (Hour 0) should be balanced with empirical evidence of sustained high-risk periods (evening rush hours) to ensure comprehensive temporal coverage and optimal policy impact.

5.2.2. Driver Age and Accident Type

Younger drivers were estimated to be at increased risk of more severe accidents [10], in line with global evidence on novice driver risk. Accident Type exhibited high positive SHAP values for run-over and fixed-object collision types [30,31], corresponding to investment in pedestrian safety infrastructure.

5.2.3. Road Type and Accident Year

External roads demonstrated higher severity rates (9.73%) compared to internal roads (2.51%). This pattern reflects the higher-speed nature of highways and main arterials, where crash energy and injury severity are inherently elevated.

These findings support the prioritization of infrastructure improvements and enforcement on major external corridors alongside internal road networks. The downward trend in recent accident years may indicate improvements in road safety infrastructure during the period 2020–2025.

5.2.4. Integration with Traffic Violation Systems

This framework could complement existing violation detection systems by helping to rank higher-risk cases. Severity predictions could support patrol deployment through highlighting high-severity periods, such as midnight and the evening, and severity information could be combined with violation heat maps in Command Center dashboards to improve resource allocation. Because calibration was limited, the model is more suitable for risk prioritization than for direct probability interpretation.

5.2.5. Comparison with Related Work

Overall, while previous studies have reported the strong performance of advanced machine learning models in accident severity prediction, our findings extend these findings by showing that reliability and robustness are better assessed through imbalance-aware metrics, systematic comparison across preprocessing strategies, and temporal validation on unseen future data. In particular, the binary LightGBM framework combines strong predictive performance with interpretability and temporal generalization, providing a more practically robust solution under extreme class imbalance than many earlier approaches.

Unlike prior Qatar-focused descriptive studies in which the authors relied on traditional statistical analysis [7,12], in this work, we deliver the first individual-level predictive framework with SHAP-based policy interpretability.

The present study’s multi-class BA of 44.12% is comparable to or exceeds results from similar studies [2,3,4,8,16]. The binary BA of 71.04% and AUC of 0.7722 represent competitive performance. The dataset size (588,023 records) substantially exceeds that in most comparable GCC studies, providing greater statistical power for minority class learning.

The present study’s binary classification performance (AUC = 0.7722, F1-Macro = 0.5186) aligns with international benchmarks for severely imbalanced severity prediction, exceeding Dong et al.’s [22] LightGBM result (AUC = 0.71) on Pakistani highway data while remaining conservative compared to studies reporting suspiciously high metrics on small datasets without validation details [3,4]. Three methodological advantages distinguish this work: (1) dataset scale (588,023 records) exceeding comparable GCC studies, (2) rigorous validation with SMOTE applied exclusively to training data and stratified train–test splitting, and (3) explicit target leakage prevention, ensuring operational validity—standards rarely enforced in the regional literature.

5.3. Key Findings and Practical Recommendations

The binary formulation outperformed the multi-class form by 61% in balanced accuracy (71.04% vs. 44.12%), confirming that problem reformulation is critical under extreme imbalance. SMOTE and ADASYN proved to be the most effective balancing strategies, while LightGBM provided the best trade-off between speed and performance, training in less than 15 s compared to over 700 s for Gradient Boosting. The cascaded approach improved minority class detection, with Cohen’s Kappa increasing by 70.7% (0.1217 vs. 0.0713), demonstrating better sensitivity to severe accidents. Temporal factors emerged as the dominant predictors, with a combined SHAP contribution of 0.692, revealing two distinct risk patterns: a midnight peak (Hour 0) with 7.1% severity affecting lower traffic volumes and sustained evening elevation (18–23), averaging 4.25% severity across higher traffic volumes. The model exhibited strong temporal generalization, with a minimal training–validation gap confirming no significant overfitting.

These findings translate into several actionable recommendations. The midnight and evening risk periods support targeted enforcement strategies, with midnight requiring concentrated intervention for high-risk behaviors such as reckless driving and evening hours calling for comprehensive traffic management given their sustained risk and higher traffic exposure. Younger drivers exhibit elevated risk levels, supporting the implementation of graduated licensing systems and targeted training programs, particularly for high-risk nighttime driving conditions. Pedestrian-related accidents are associated with higher severity, underscoring the need for improved crossing infrastructure, enhanced traffic signal systems, and increased driver awareness in urban environments. The high computational efficiency of LightGBM makes it suitable for real-time deployment by traffic safety practitioners to support proactive risk monitoring and intervention. Finally, the elevated severity rates on external roads (9.73%) compared to internal roads (2.51%) highlight the importance of speed management, defensive driving practices, and targeted safety interventions in high-speed corridors.

6. Conclusions

This study examined machine learning frameworks for the prediction of traffic accident severity using Qatar’s national crash dataset under extreme class imbalance. The binary LightGBM framework achieved the most practical combination of performance (BA = 71.04%), interpretability, and computational efficiency, demonstrating that problem reformulation yields greater gains than algorithmic selection alone. Temporal factors emerged as the dominant predictors, with midnight and evening hours presenting distinct risk patterns that support differentiated enforcement strategies. The findings translate into actionable recommendations for targeted temporal enforcement, pedestrian safety infrastructure, graduated licensing for young drivers, and speed management on external roads.

Several limitations should be acknowledged. The binary formulation simplified the original four-level severity structure, temporal validation was based on a single future year, and the model showed limited probability calibration, making it more suitable for risk ranking than direct probability interpretation. Four high-value features were excluded due to extreme missingness, and geographic variables were unavailable. Furthermore, the study period overlaps with the COVID-19 pandemic (2020–2021), which may have influenced traffic patterns, and minority classes remain severely underrepresented. All experiments were conducted using Qatar-specific data, and generalizability to other GCC countries remains unvalidated. Future research should integrate environmental and sensor data, explore cost-sensitive and spatial modeling approaches, and develop forecasting tools for accident rates and types to support proactive road safety management.

While the current study focuses on predicting severity for individual accidents, understanding future accident rates and types remains a critical direction for proactive road safety management. Our study reveals encouraging downward severity trends; accident severity decreased in 2024–2025 compared to earlier years, reflecting the positive impact of Qatar’s road safety efforts. Building on this, in future work, forecasting tools will be developed to help authorities anticipate accident patterns and plan preventive measures.

Author Contributions

Conceptualization, M.A. and Y.Y.; methodology, M.A. and Y.Y.; software, M.A.; validation, M.A. and Y.Y.; formal analysis, M.A.; data curation, M.A.; writing—original draft preparation, M.A.; writing—review and editing, M.A. and Y.Y.; visualization, M.A.; supervision, Y.Y.; project administration, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available on the Qatar Open Data Portal at https://www.moi.gov.qa (accessed on 1 April 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GDP	Gross Domestic Product
GCC	Gulf Cooperation Council
BA	Balanced Accuracy
RTC	Road Traffic Crash
QNV	Qatar National Vision

References

World Health Organization. Global Status Report on Road Safety 2023; WHO Press: Geneva, Switzerland, 2023.
Mostafa, A.M.; Aldughayfiq, B.; Tarek, M.; Alaerjan, A.S.; Allahem, H.; Elbashir, M.K.; Ezz, M.; Hamouda, E. AI-based prediction of traffic crash severity for improving road safety. Sci. Rep. 2025, 15, 27468. [Google Scholar] [CrossRef]
Aldhari, I.; Almoshaogeh, M.; Assi, K.; Alnedawi, A.; Alharbi, F. Severity Prediction of Highway Crashes in Saudi Arabia Using Machine Learning Techniques. Appl. Sci. 2022, 13, 233. [Google Scholar] [CrossRef]
Alanazi, F.; Okail, M.A. Predicting Traffic Crash Severity in Jeddah Using Machine Learning Techniques: A Data-Driven Approach to Road Safety. Transp. Res. Rec. 2025, 2679, 324–343. [Google Scholar] [CrossRef]
General Secretariat for Development Planning. Qatar National Vision 2030; Amiri Decision No. 44; Government of Qatar: Doha, Qatar, 2008.
Ministry of Interior Qatar. National Road Safety Strategy—Second Phase Action Plan; General Directorate of Traffic: Doha, Qatar, 2016.
Timmermans, C.; Alhajyaseen, W.; Al Mamun, A.; Wakjira, T.; Qasem, M.; Almallah, M.; Younis, H. Analysis of road traffic crashes in the State of Qatar. Int. J. Inj. Control Saf. Promot. 2019, 26, 242–250. [Google Scholar] [CrossRef]
Alsoud, A.; Alomari, A.; Qrarah, M.; Ahmad, Z. Spatial Pedestrian Safety in Riyadh School Zones: A Data-Driven Approach: Evaluating Spatial Pedestrian Safety in Riyadh’s School Zones Using Multiple Linear Regression and Machine Learning: A Data-Driven Approach. Str. Art Urban Creat. 2025, 11, 231–257. [Google Scholar] [CrossRef]
Mohammed, S.; Alkhereibi, A.H.; Abulibdeh, A.; Jawarneh, R.N.; Balakrishnan, P. GIS-based spatiotemporal analysis for road traffic crashes; in support of sustainable transportation Planning. Transp. Res. Interdiscip. Perspect. 2023, 20, 100836. [Google Scholar] [CrossRef]
Tarlochan, F.; Ibrahim, M.I.M.; Gaben, B. Understanding Traffic Accidents among Young Drivers in Qatar. Int. J. Environ. Res. Public Health 2022, 19, 514. [Google Scholar] [CrossRef]
Iftikhar, H.; Turkmen, S.; Azad, A.M.; Bhutta, Z.; Imamoglu, M.; Karakullukcu, S. Analysis of desert traffic accidents: A retrospective study. Qatar Med. J. 2024, 2024, 65. [Google Scholar] [CrossRef]
Shaaban, K.; Siam, A.; Badran, A. Analysis of Traffic Crashes and Violations in a Developing Country. Transp. Res. Procedia 2021, 55, 1689–1695. [Google Scholar] [CrossRef]
Kaufman, S.; Rosset, S.; Perlich, C.; Stitelman, O. Leakage in Data Mining: Formulation, Detection, and Avoidance. ACM Trans. Knowl. Discov. Data 2012, 6, 1–21. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Behboudi, N.; Moosavi, S.; Ramnath, R. Recent Advances in Traffic Accident Analysis and Prediction. A Comprehensive Review of Machine Learning Techniques. arXiv 2024, arXiv:2406.13968. [Google Scholar] [CrossRef]
Chen, F.; Liu, X.Q.; Yang, J.J.; Liu, X.K.; Ma, J.H.; Chen, J.; Xiao, H.Y. Traffic accident severity prediction based on an enhanced MSCPO-XGBoost hybrid model. Sci. Rep. 2025, 15, 25729. [Google Scholar] [CrossRef]
Pei, Y.; Wen, Y.; Pan, S. Traffic accident severity prediction based on interpretable deep learning model. Transp. Lett. 2025, 17, 895–909. [Google Scholar] [CrossRef]
Alfasi, B.A.; Mahmoud, K.R.M.; Matar, A.-H.; Abdelati, M.H. Neural Network-Based Prediction of Traffic Accidents and Congestion Levels Using Real-World Urban Road Data. Future Transp. 2025, 5, 138. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 3146–3154. [Google Scholar]
Yan, R.; Hu, L.; Li, J.; Lin, N. Accident Severity Analysis of Traffic Accident Hot Spot Areas in Changsha City. Sustainability 2024, 6, 3054. [Google Scholar] [CrossRef]
Yang, Z.; Zhang, C.; Li, G.; Xu, H. Analysis of the Impact of Different Road Conditions on Accident Severity at Highway-Rail Grade Crossings. Symmetry 2025, 17, 147. [Google Scholar] [CrossRef]
Dong, S.; Khattak, A.; Ullah, I.; Zhou, J.; Hussain, A. Predicting and Analyzing Road Traffic Injury Severity Using Boosting-Based Ensemble Learning Models with SHAP. Int. J. Environ. Res. Public Health 2022, 19, 2925. [Google Scholar] [CrossRef]
Awadalla, D.M.; de Albuquerque, F.D.B. Fatal Road Crashes in the Emirate of Abu Dhabi: Contributing Factors and Data-Driven Safety Recommendations. Transp. Res. Procedia 2021, 52, 260–267. [Google Scholar] [CrossRef]
Al-Eideh, B.M. Statistical analytical study of traffic accidents and violations in the State of Kuwait and its social and economic impact on the Kuwaiti society. Am. J. Appl. Math. Stat. 2016, 4, 24–36. [Google Scholar]
AlKheder, S.; AlRukaibi, F.; Aiash, A. Support Vector Machine (SVM), Random Forest (RF), Artificial Neural Network (ANN) and Bayesian Network for prediction and analysis of GCC traffic accidents. J. Ambient Intell. Humaniz. Comput. 2023, 14, 7331–7339. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 4768–4777. [Google Scholar] [CrossRef]
Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef] [PubMed]
Parsa, A.B.; Movahedi, A.; Taghipour, H.; Derrible, S.; Mohammadian, A. Toward safer highways: XGBoost and SHAP for real-time accident detection. Accid. Anal. Prev. 2020, 136, 105405. [Google Scholar] [CrossRef] [PubMed]
Wen, X.; Xie, Y.; Wu, L.; Jiang, L. Quantifying and comparing the effects of key risk factors on various types of roadway segment crashes with LightGBM and SHAP. Accid. Anal. Prev. 2021, 159, 106261. [Google Scholar] [CrossRef]

Figure 1. Research framework overview.

Figure 2. ROC curve.

Figure 3. Confusion matrix.

Figure 4. Model performance across evaluation stages.

Figure 5. Temporal severity patterns.

Figure 6. SHAP feature importance.

Figure 7. Precision–recall curve of the binary LightGBM + SMOTE model.

Figure 8. Class-based recall and test set class distribution.

Figure 9. Calibration curve of the binary LightGBM + SMOTE model.

Table 1. Summary of related work, methods, and research gaps.

Subsection	Study Focus	Methods Used	Dataset	Key Findings	Identified Gaps
ML for Severity Prediction	ML for Severity Prediction	General ML-based severity prediction	Large-scale	High predictive performance (macro up to 95%)	Limited focus on extreme class imbalance, lack of consistent evaluation frameworks, potential overfitting in deep learning
LightGBM + SHAP	LightGBM + SHAP	Interpretable ML in traffic safety	Medium-scale	Identification of key risk factors (e.g., visibility, road conditions, driver age)	Limited application for large-scale national datasets and lack of integration with imbalance handling strategies
Class Imbalance Handling	Class Imbalance Handling	Addressing imbalance in accident data	Variable	Improved minority class detection	Lack of systematic comparison across multiple classifiers and frameworks on the same dataset
GCC and Qatar Studies	GCC and Qatar Studies	Regional traffic safety research	Small to medium	Insights into regional patterns and risk factors	Absence of large-scale predictive ML studies in Qatar; limited use of imbalance handling and interpretable AI

Table 2. Class distribution of the final analytical dataset (n = 588,023).

Severity Class	Records	Percentage	Role
SIMPLE	571,794	97.24%	Majority (Negative)
LIGHT INJURY	15,165	2.58%	Minority (Positive)
HEAVY INJURY	815	0.14%	Minority (Positive)
DEATH INJURY	249	0.04%	Minority (Positive)
Total	588,023	100.0%	-

Table 3. Data preprocessing pipeline and record retention summary.

Step	Action	Records Removed	Remaining
1	Remove columns with >75% missing values	-	1,000,500
2	Remove ‘NOT A TRAFFIC ACCIDENT’ records	448	1,000,052
3	Remove records with missing critical fields	411,892	588,160
4	Remove records with invalid driver ages	137	588,023
Final	Clean dataset	412,477	588,023

Table 4. Final feature set (seven predictor variables).

#	Feature	Type	Range	Description
1	Accident Year	Numeric	2020–2025	Year the accident occurred
2	Accident Hour	Numeric	0–23	Hour of day extracted from accident time
3	Time Period	Categorical	0–3	Night (0), Morning (1), Afternoon (2), Evening (3)
4	Road Type	Binary	0–1	External (0), Internal (1)
5	Accident Type	Categorical	0–3	Collision (0), Coup (1), Run Over (2), Other (3)
6	Nationality Group	Categorical	0–9	10 nationality groups (label encoded)
7	Driver Age	Numeric	1–100	Derived: Accident Year − Driver Birth Year

Table 5. Class-balancing techniques evaluated.

Technique	Type	Description
SMOTE	Oversampling	Generates synthetic minority samples by interpolating between nearest neighbors [14]
ADASYN	Adaptive Oversampling	Adaptively generates more samples near decision boundaries
SMOTETomek	Hybrid	Combines SMOTE oversampling with Tomek Links undersampling

Table 6. Machine learning classifiers evaluated.

Model	Reference	Best Hyperparameters
Logistic Regression	scikit-learn	{‘solver’: ‘liblinear’, ‘penalty’: ‘l1’, ‘max_iter’: 1000, ‘class_weight’: ‘balanced’, ‘C’: 10}
Random Forest	scikit-learn	{‘n_estimators’: 150, ‘min_samples_split’: 20, ‘min_samples_leaf’: 2, ‘max_features’: ‘sqrt’, ‘max_depth’: 20, ‘class_weight’: ‘balanced’}
Gradient Boosting	scikit-learn	{‘subsample’: 0.6, ‘n_estimators’: 300, ‘min_samples_split’: 10, ‘min_samples_leaf’: 2, ‘max_depth’: 5, ‘learning_rate’: 0.1}
AdaBoost	scikit-learn	{‘n_estimators’: 150, ‘learning_rate’: 0.5, ‘algorithm’: ‘SAMME.R’}
XGBoost	Chen and Guestrin [27]	{‘subsample’: 0.8, ‘scale_pos_weight’: 2, ‘n_estimators’: 200, ‘min_child_weight’: 1, ‘max_depth’: 6, ‘learning_rate’: 0.2, ‘gamma’: 0, ‘colsample_bytree’: 0.8}
LightGBM	Ke et al. [19]	{‘subsample’: 0.8, ‘reg_lambda’: 0.1, ‘reg_alpha’: 0.5, ‘num_leaves’: 31, ‘n_estimators’: 300, ‘min_child_samples’: 20, ‘max_depth’: 10, ‘learning_rate’: 0.05, ‘colsample_bytree’: 0.6, ‘class_weight’: ‘balanced’}

Table 7. Multi-class classification performance (18 experimental configurations).

Balancer	Model	BA (%)	Kappa	F1	Time (s)
SMOTE	Logistic Regression	39.22	0.0258	0.2173	232.21
	Gradient Boosting	44.12	0.0623	0.2461	1004.25
	AdaBoost	37.46	0.0362	0.2433	99.76
	LightGBM	40.46	0.0668	0.2490	15.71
	XGBoost	39.24	0.0587	0.2432	29.68
	Random Forest	38.00	0.0699	0.2548	57.18
ADASYN	Logistic Regression	39.08	0.0256	0.2167	218.37
	Gradient Boosting	43.65	0.0529	0.2368	772.92
	AdaBoost	38.10	0.0252	0.2214	56.51
	LightGBM	41.18	0.0612	0.2441	12.30
	XGBoost	38.65	0.0525	0.2358	26.12
	Random Forest	37.48	0.0642	0.2492	33.16
SMOTETomek	Logistic Regression	38.93	0.0270	0.2180	221.35
	Gradient Boosting	42.71	0.0614	0.2451	1029.68
	AdaBoost	37.64	0.0291	0.2262	95.12
	LightGBM	39.55	0.0652	0.2478	16.35
	XGBoost	39.23	0.0596	0.2436	29.80
	Random Forest	37.38	0.0713	0.2554	55.25

Table 8. Cascaded approach classification results.

Balancer	Model	BA (%)	Kappa	F1	Time (s)
SMOTE	Logistic Regression	38.12	0.0545	0.2447	15.07
	Gradient Boosting	40.23	0.0822	0.2588	111.77
	AdaBoost	39.73	0.0603	0.2567	38.35
	LightGBM	40.69	0.0878	0.2617	3.10
	XGBoost	41.96	0.0921	0.2628	3.64
	Random Forest	41.19	0.1208	0.2751	25.65
ADASYN	Logistic Regression	38.41	0.0457	0.2361	13.61
	Gradient Boosting	39.52	0.0741	0.2528	94.43
	AdaBoost	37.85	0.0568	0.2531	28.50
	LightGBM	39.32	0.0768	0.2535	2.71
	XGBoost	40.78	0.0803	0.2548	3.44
	Random Forest	41.24	0.1077	0.2679	17.73
SMOTETomek	Logistic Regression	38.61	0.0544	0.2450	14.26
	Gradient Boosting	41.06	0.0823	0.2591	108.19
	AdaBoost	39.08	0.0591	0.2573	36.83
	LightGBM	42.25	0.0884	0.2621	3.00
	XGBoost	42.01	0.0923	0.2629	3.45
	Random Forest	41.30	0.1217	0.2758	24.34

Table 9. Binary approach classification results.

Balancer	Model	BA	Kappa	PR-AUC	Specificity	AUC_ROC	Sensitivity	Brier_Score	F1	Time (s)
SMOTE	LR	66.29	0.0749	0.2578	0.7910	0.7157	0.5348	0.1957	0.4985	10.25
	GB	71.03	0.1005	0.2903	0.8031	0.7732	0.6174	0.1703	0.5149	107.91
	AdaBoost	67.45	0.0818	0.2601	0.7964	0.7284	0.5527	0.2295	0.5035	48.77
	LGBM	71.04	0.1037	0.2989	0.8105	0.7722	0.6103	0.1683	0.5186	2.48
	XGBoost	70.39	0.1008	0.2806	0.8105	0.761	0.5974	0.1623	0.5171	3.37
	RF	68.60	0.109	0.2256	0.8456	0.7347	0.5265	0.1468	0.5304	23.79
ADASYN	LR	65.80	0.0615	0.2576	0.7460	0.7153	0.5699	0.2080	0.4785	10.94
	GB	70.79	0.0906	0.2822	0.7799	0.7716	0.6359	0.1844	0.5033	112.81
	AdaBoost	67.38	0.0788	0.2522	0.7881	0.7254	0.5595	0.2317	0.4997	39.58
	LGBM	70.83	0.0913	0.2929	0.7814	0.7712	0.6352	0.1823	0.5041	2.17
	XGBoost	69.97	0.0873	0.2713	0.7802	0.7578	0.6192	0.1763	0.5018	2.94
	RF	68.14	0.0938	0.2163	0.8196	0.7312	0.5431	0.1592	0.5160	17.08
SMOTETomek	LR	66.29	0.0749	0.2578	0.7910	0.7157	0.5348	0.1957	0.4985	13.36
	GB	70.88	0.1001	0.2904	0.8037	0.7732	0.6140	0.1703	0.5149	160.02
	AdaBoost	67.45	0.0818	0.2596	0.7964	0.7284	0.5527	0.2295	0.5035	54.85
	LGBM	70.95	0.1032	0.2974	0.8103	0.7719	0.6087	0.1680	0.5183	2.88
	XGBoost	70.32	0.1002	0.2840	0.8100	0.7602	0.5964	0.1623	0.5166	3.30
	RF	68.56	0.1096	0.2206	0.8469	0.7353	0.5243	0.1464	0.5310	24.69

Table 10. Binary LightGBM + SMOTE performance summary.

Metric	Value	Interpretation
BA	71.04%	61% improvement over the best multi-class configuration (44.12%, GB + ADASYN)
AUC-ROC	0.7722	Good discriminative ability (>0.70)
Cohen’s Kappa	0.1037	45.4% improvement over the best multi-class configuration (0.0713, GB + ADASYN)
F1-Macro	0.5186	103% improvement over the best multi-class configuration (0.2554, GB + ADASYN)
Sensitivity	61.03%	61% of severe accidents correctly detected
Specificity	81.05%	81% of safe accidents correctly classified
PR-AUC	0.2989	Strong performance under severe class imbalance
Av. Precision	0.2990	Consistent with PR-AUC under imbalanced evaluation
Precision	8.38%	Approximately 3.0× enrichment over the severe-class prevalence (~2.76%)
Brier Score	0.1683	Limited probability calibration; more suitable for ranking than direct probability interpretation

Table 11. SHAP feature importance ranking.

Rank	Feature	Mean \|SHAP\|	Interpretation
1	Time Period	0.408	The night period is the highest-risk temporal category
2	Driver Age	0.334	Younger drivers are associated with more severe outcomes
3	Accident Type	0.323	Run-over and fixed-object collisions are more severe
4	Accident Hour	0.284	Nighttime hours increase the probability of severity
5	Accident Year	0.246	Recent years show slightly reduced severity trends
6	Road Type	0.195	External roads show higher severity than internal
7	Nationality Group	0.159	Least influential predictor

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alshriem, M.; Yang, Y. Prediction of Large-Scale Traffic Accident Severity in Qatar: A Binary Reformulation Approach for Extreme Class Imbalance with Interpretable AI. Future Transp. 2026, 6, 88. https://doi.org/10.3390/futuretransp6020088

AMA Style

Alshriem M, Yang Y. Prediction of Large-Scale Traffic Accident Severity in Qatar: A Binary Reformulation Approach for Extreme Class Imbalance with Interpretable AI. Future Transportation. 2026; 6(2):88. https://doi.org/10.3390/futuretransp6020088

Chicago/Turabian Style

Alshriem, Mohammed, and Yin Yang. 2026. "Prediction of Large-Scale Traffic Accident Severity in Qatar: A Binary Reformulation Approach for Extreme Class Imbalance with Interpretable AI" Future Transportation 6, no. 2: 88. https://doi.org/10.3390/futuretransp6020088

APA Style

Alshriem, M., & Yang, Y. (2026). Prediction of Large-Scale Traffic Accident Severity in Qatar: A Binary Reformulation Approach for Extreme Class Imbalance with Interpretable AI. Future Transportation, 6(2), 88. https://doi.org/10.3390/futuretransp6020088

Article Menu

Prediction of Large-Scale Traffic Accident Severity in Qatar: A Binary Reformulation Approach for Extreme Class Imbalance with Interpretable AI

Abstract

1. Introduction

2. Related Work

2.1. Machine Learning for Accident Severity Prediction

2.2. LightGBM and SHAP in Traffic Safety

2.3. Class Imbalance in Traffic Accident Datasets

2.4. GCC and Qatar-Specific Studies

3. Materials and Methods

3.1. Dataset Description and Preprocessing

3.2. Train–Test Split, Temporal Validation, and Class Imbalance Mitigation

3.3. Classification Frameworks

3.4. Evaluation Metrics

3.5. SHAP Interpretability Analysis

4. Results

4.1. Multi-Class Classification Results

4.2. Cascaded Classification Approach

4.3. Binary Classification: LightGBM + SMOTE

4.4. Model Performance Analysis

4.5. SHAP-Based Feature Importance Analysis

4.6. Performance Under Severe Class Imbalance: PR-AUC, Class-Based Recall, and Calibration

5. Discussion

5.1. Interpretation of Classification Performance

5.2. Policy Implications of SHAP Findings

5.2.1. Temporal Factors

5.2.2. Driver Age and Accident Type

5.2.3. Road Type and Accident Year

5.2.4. Integration with Traffic Violation Systems

5.2.5. Comparison with Related Work

5.3. Key Findings and Practical Recommendations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI