Traffic Incident Impact Prediction Using Machine Learning and Explainable AI: Evidence from Istanbul

Korkmaz, Adem; Çelik, Ufuk; Tümen, Vedat

doi:10.3390/electronics15061162

Open AccessArticle

Traffic Incident Impact Prediction Using Machine Learning and Explainable AI: Evidence from Istanbul

by

Adem Korkmaz

^1,*

,

Ufuk Çelik

²

and

Vedat Tümen

^3,*

¹

Department of Computer Technologies, Gönen Vocational School, Bandırma Onyedi Eylül University, 10200 Bandırma, Türkiye

²

Department of Management Information Systems, Omer Seyfettin Faculty of Applied Sciences, Bandirma Onyedi Eylül University, 10200 Bandirma, Türkiye

³

Computer Engineering Department, Faculty of Engineering and Architecture, Bitlis Eren University, 13100 Bitlis, Türkiye

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(6), 1162; https://doi.org/10.3390/electronics15061162

Submission received: 3 February 2026 / Revised: 5 March 2026 / Accepted: 6 March 2026 / Published: 11 March 2026

(This article belongs to the Special Issue Next-Generation Intelligent Transportation Systems: IoT, Machine Learning, and Edge Analytics)

Download

Browse Figures

Versions Notes

Abstract

Traffic incident impact prediction remains challenging for intelligent transportation systems due to complex spatiotemporal dependencies. This study analyzes 38,430 real-world traffic incidents from Istanbul (2022–2024) to predict normalized traffic deviation

Δ T r a f f i c (%)

using machine learning with rigorous temporal validation. Three models—Random Forest (RF), XGBoost, and LightGBM—were evaluated using rolling-origin cross-validation (2022 training, 2023 testing; 2022–2023 training, 2024 testing) to prevent temporal leakage, employing a strictly operational 13-feature set that excludes information unavailable at incident onset (t₀). LightGBM achieved MAE = 26.81 ± 1.94% and R² = 0.506 ± 0.042 (mean ± std across folds) with 95% bootstrap confidence intervals of [27.54%, 28.81%] for MAE on the 2024 test set, significantly outperforming historical baselines (R² = 0.100 ± 0.054, p < 0.001, Bonferroni-corrected). Feature ablation studies revealed that temporal features contribute 65.2% of predictive power, while incident type contributes only 1.3%. Distributional robustness analysis confirms conclusions are stable across distributional treatments (log, winsorised, quantile), with feature importance rank correlations ρ = 1.000 between all treatment pairs. This work provides empirical evidence for context-aware traffic management systems and demonstrates the importance of proper temporal validation in transportation forecasting.

Keywords:

traffic incident prediction; explainable AI; temporal validation; intelligent transportation systems; SHAP; feature importance; urban traffic management

1. Introduction

1.1. Motivation and Background

Urban traffic congestion imposes substantial economic and environmental costs on metropolitan regions, with non-recurrent congestion from traffic incidents accounting for approximately 25% of total urban delays [1]. As cities transition toward intelligent transportation systems (ITS), data-driven prediction of incident impact has become critical for proactive traffic management and resource allocation [2].

Istanbul, Turkey’s largest metropolis, with over 15 million residents, experiences more than 38,000 documented traffic incidents annually, ranging from minor collisions to major infrastructure closures [3]. The city’s unique challenges—transcontinental geography, dense population, and complex road network—make it an ideal test bed for advanced traffic management strategies [4].

Recent advances in machine learning (ML) have demonstrated strong results for traffic prediction, with ensemble methods such as Random Forest [5], XGBoost [6], and LightGBM [7] consistently achieving competitive performance across a range of incident detection and impact estimation tasks. These methods have been extensively studied not only for their predictive accuracy but also for their interpretability: tree-based ensembles are naturally amenable to feature importance analysis, and the advent of SHAP (SHapley Additive exPlanations) [8] has made exact, theoretically grounded attribution computationally tractable for these model families specifically. Nevertheless, two methodological gaps remain underaddressed in the traffic incident literature: the widespread use of random cross-validation rather than temporal splits, which has been shown to inflate reported R² values by as much as 0.05–0.15 [9], and the limited translation of model explanations into operational practice, where practitioners require subgroup-level diagnostics—by incident type, time of day, and location—rather than global feature rankings alone [10].

In this context, this study focuses on predicting traffic incident impacts using machine learning models under strict temporal validation, emphasizing evaluation realism and explainability rather than architectural complexity. Using a large-scale real-world dataset of traffic incidents in Istanbul from 2022 to 2024, we systematically investigate the relative importance of spatiotemporal context and incident characteristics, while benchmarking machine learning models against operational historical baselines. By doing so, the study aims to provide empirically grounded insights into the contextual nature of traffic incident impacts and to contribute to more reliable and interpretable traffic prediction frameworks for practical ITS applications.

1.2. Research Gap and Contributions

This study does not propose a new machine learning algorithm or a novel theoretical formulation. Rather, it contributes at the levels of evaluation methodology, empirical rigor, and operational explainability—dimensions that remain systematically underaddressed in the traffic incident prediction literature despite their direct relevance to real-world ITS deployment. Despite substantial progress, several critical gaps remain:

Temporal validation: Most studies use random train–test splits, creating data leakage and unrealistic performance estimates [11].

Baseline comparisons: Limited benchmarking against simple operational baselines makes it difficult to assess practical value [12].

Interpretability validation: XAI techniques are applied but rarely validated for actual utility [13].

Contextual vs. categorical factors: Insufficient emphasis on spatiotemporal context versus traditional incident attributes.

This study addresses these gaps through the following contributions.

We propose a temporally consistent evaluation framework for traffic incident impact prediction, addressing the often-overlooked issue of temporal leakage in transportation studies. We provide an empirical comparison between machine learning models and operational historical baselines under strict temporal validation, revealing realistic performance gaps. We systematically quantify the relative importance of spatiotemporal context versus incident characteristics using convergent evidence from SHAP analysis and feature ablation experiments. We introduce a normalized impact metric

Δ T r a f f i c (%)

that enables the fair comparison of incident effects across heterogeneous spatial and temporal contexts. We provide evidence-based insights into the contextual nature of traffic incident impacts, with implications for explainable and operationally feasible intelligent transportation systems.

1.3. Research Questions

This work investigates four research questions:

RQ1: Can machine learning models significantly outperform historical baselines for incident impact prediction when properly validated temporally?

RQ2: Which feature groups (spatial, temporal, incident, traffic) contribute most to prediction accuracy?

RQ3: Does spatiotemporal context dominate incident classification in determining traffic impact?

RQ4: Are SHAP-based interpretations consistent with empirical feature ablation studies?

1.4. Paper Organization

The remainder of this paper is organized as follows: Section 2 reviews related work; Section 3 describes the methodology including the temporal validation strategy; Section 4 presents the results with statistical testing; Section 5 discusses the findings and limitations; Section 6 concludes with the practical implications.

2. Related Work

Recent studies have increasingly explored deep learning and hybrid approaches for traffic incident prediction and impact analysis. For example, Grigorev et al. [14] provided a comprehensive review of incident duration prediction techniques, highlighting the growing role of deep neural architectures. Sayadi et al. [15] proposed a convolutional-recurrent framework for accident impact prediction, while Chen et al. [16] introduced score-based spatiotemporal point process models for accident forecasting. These studies demonstrate the potential of advanced models but also underline challenges related to interpretability, data requirements, and operational deployment.

2.1. Traffic Incident Prediction

Early traffic prediction research employed statistical methods including ARIMA [17] and regression models [18], but these struggled with nonlinear relationships. Support Vector Machines (SVM) [19] and neural networks [20] improved accuracy but required extensive feature engineering.

Recent work favors ensemble methods due to their robustness and interpretability. Breiman’s Random Forest [5] has been extensively applied for incident duration prediction and severity classification [21]. Gradient boosting frameworks—XGBoost [6] and LightGBM [7]—achieve superior performance for short-term traffic forecasting [22,23].

Deep learning approaches, particularly LSTMs [24,25] and Graph Neural Networks (GNNs) [26,27], excel at capturing temporal dependencies and spatial network structures. However, these methods require large datasets and substantial computational resources, limiting real-time applicability. Transformer architectures [28] show promise but remain computationally intensive for operational deployment.

2.2. Temporal Validation in Transportation Forecasting

A critical but often overlooked issue in traffic prediction is proper temporal validation. Bergmeir and Benítez [29] demonstrated that random cross-validation in time series creates data leakage, systematically overestimating performance. Cerqueira et al. [11] showed that proper temporal splits reduce reported R² by 15–30% compared to random splits.

Despite these findings, many traffic prediction studies continue using random validation [30,31]. Yuan and Li [32] found that 73% of surveyed traffic forecasting papers use inappropriate validation strategies. This study addresses this gap through strict temporal train–test splits.

2.3. Explainable AI in Transportation

The “black box” nature of complex ML models has constrained ITS adoption where transparency is critical [33]. XAI methodologies address this limitation through post hoc interpretability techniques.

SHAP [8], grounded in cooperative game theory, provides consistent and locally accurate feature attributions. Transportation applications include crash severity analysis [34], speed prediction interpretation [35], and route choice modeling [36]. LIME (Local Interpretable Model-agnostic Explanations) [37] offers an alternative through local linear approximations.

Rule extraction [38] provides complementary interpretability through symbolic IF–THEN statements. Hybrid approaches combining statistical XAI with symbolic reasoning demonstrate promise for transportation decision support [39,40].

Bilotta et al. [41] apply gradient and Integrated Gradient techniques to a convolutional bidirectional LSTM model for parking slot prediction, demonstrating through temporal heatmaps how features such as historical occupancy, weather, and traffic flow vary in importance across prediction time steps—an approach the authors note is transferable to almost all ML and AI models in the literature.

However, limited work validates whether XAI explanations align with ground truth. Adebayo et al. [42] showed that some interpretation methods exhibit “sanity failure”, highlighting interpretation without causation. This study addresses validation through feature ablation studies corroborating SHAP findings.

2.4. Smart City Transportation Systems

Smart cities leverage IoT sensors, real-time analytics, and AI for urban service optimization [43]. ITS components include adaptive traffic signal control [44], dynamic route guidance [45], and incident management systems [46].

Fereidooni et al. [47] propose a multi-agent deep reinforcement learning framework for traffic light optimization—comparing single-agent DQN, multi-agent DQN/PPO, and an actuated SMART system—and demonstrate on real-world data from Florence that DRL-based approaches consistently outperform traditional fixed-cycle methods such as Webster’s formula, particularly under heavy traffic conditions.

Istanbul has invested in ITS infrastructure through the Traffic Management Center (UKOME), GPS-based monitoring, and integrated incident reporting [48]. However, predictive analytics remain underutilized operationally [49]. Recent work demonstrates that ML-based incident management systems can reduce response time by 20–40% [50,51].

2.5. Incident Impact Assessment

Incident impact quantification typically uses absolute metrics (delay minutes [52], queue length [53]), or categorical severity (low/medium/high [54]). However, absolute metrics lack comparability across contexts—a 10 min delay is severe on low-traffic roads but negligible on highways. Beyond reactive incident management, V2X-enabled cooperative control mechanisms have recently been proposed for congestion mitigation in mixed traffic environments. For example, Peng et al. [55] demonstrate platoon-based control strategies that proactively alleviate moving bottlenecks. Such prescriptive control frameworks could directly leverage incident impact prediction systems like the one proposed in this study.

Percentage-based normalized metrics [56] enable cross-context comparison but are rarely applied to incident prediction. This study’s

Δ T r a f f i c (%)

metric addresses this gap, providing intuitive interpretation while enabling fair comparison across diverse locations and times.

2.6. Research Positioning

This study extends prior work by (1) applying proper temporal validation uncommon in traffic prediction; (2) comparing ML against operational baselines; (3) validating XAI through feature ablation; (4) demonstrating spatiotemporal dominance with statistical rigor; and (5) providing practical recommendations for ITS deployment.

Although deep learning- and graph-based models have shown strong performance in traffic prediction, their applicability in real-world incident management systems remains constrained by data availability, computational cost, and deployment complexity. In many operational traffic management centers, decision-making systems still rely on lightweight models that can be trained and deployed efficiently. Therefore, this study intentionally focuses on interpretable and computationally efficient ensemble models while positioning its contribution in evaluation methodology and explainability rather than model complexity.

3. Materials and Methods

Model Selection Justification

The Istanbul traffic dataset exhibits strong nonlinearity, temporal seasonality, heterogeneous categorical features (e.g., location, event type, road segment), and complex interaction effects between meteorological and traffic-related variables. Linear models such as ordinary least squares or ridge regression assume additive and mostly linear relationships, which are unlikely to sufficiently capture these interaction-driven dynamics.

Tree-based ensemble methods, including Random Forest, XGBoost, and LightGBM, are particularly suitable for such structured tabular datasets because (i) they naturally model nonlinear relationships without explicit feature engineering, (ii) they capture high-order feature interactions, (iii) they are robust to multicollinearity, and (iv) they handle mixed-type variables efficiently. Prior empirical evidence in transportation prediction tasks also indicates that gradient boosting frameworks often outperform linear and single-tree models when complex spatial–temporal dependencies exist.

Hyperparameter Optimization Strategy

To ensure fair and reproducible model comparison, hyperparameters were tuned using a time-aware cross-validation scheme. Specifically, the training period (2022–2023) was divided into rolling temporal folds to avoid information leakage. Grid search was applied over key hyperparameters (number of trees, maximum depth, learning rate, minimum child weight, subsample ratio, and column sampling ratio). Model selection was based on average validation RMSE across temporal folds.

Why Not Simpler Baselines Alone?

In addition to machine learning models, historical-average and persistence-based baselines were implemented to provide realistic lower-bound performance references. However, these simpler approaches cannot adapt to incident-specific contextual features or dynamic weather–location interactions. The ensemble models consistently demonstrated superior generalization on the 2024 hold-out test set, particularly under peak congestion and rare incident scenarios, justifying their inclusion despite their increased complexity.

3.1. Problem Formulation

Definition 1 (Traffic Incident).

A traffic incident e is characterized by a tuple (e = type, loc, time, lanes, traffic), where

t y p e \in {a c c i d e n t, c o n s t r u c t i o n, m a i n t e n a n c e, \dots}

available at prediction time. Incident classification is assigned at the time of report.

loc = (latitude, longitude) available at prediction time. Incident location is reported immediately upon detection.

time = (day-of-week, hour-of-day) cyclical encodings—available at prediction time. These are derived directly from the incident timestamp.

lanes = number-of-closed-lanes available at prediction time in the majority of cases, as lane closure information is typically reported by the first responding unit or inferred from incident type. We acknowledge this may involve a short lag (5–10 min) and note this as a limitation.

traffic = real-time congestion index (0–100) available at prediction time. This is the real-time congestion index recorded at the moment the incident is first logged, not an aggregate over the event period. This is clarified in the revised Section 3.4.

duration = minutes-of-incident-persistence NOT available at prediction time.

Note: Incident duration is excluded from the feature tuple as it is not available at prediction time

t_{0}

; it is retained in the dataset solely for computation of the target variable

Δ T r a f f i c (%)

.

Given the attributes recorded when an operator opens an incident record—event type, closed lanes, ambient traffic index at t₀, time of day, and geographic cluster—the model estimates the percentage deviation from baseline traffic conditions that the incident will produce.

3.2. Target Variable Construction: Traffic Percent

Definition 2 (Normalized Traffic Impact).

Traffic incident impact is operationalized as the relative change in congestion level attributable to an incident within a defined temporal window. The normalized percentage deviation in traffic conditions: To ensure full reproducibility and eliminate any possibility of temporal leakage, the normalized impact target traffic percent is constructed strictly using past information only.

Disambiguation of traffic_index versus event_traffic_mean:

Two traffic-related variables appear in the dataset and their roles must be clearly distinguished. traffic_index is the network-level congestion index recorded for the one-hour window containing start_time; it reflects ambient conditions at the moment of notification and is fully available at t₀. This variable is used as a model input feature. By contrast, event_traffic_mean is the mean congestion index computed over the full incident window [t_s, t_e], which requires knowledge of finish_time and is therefore not available at t₀. This variable is used exclusively to compute the prediction target

Δ T r a f f i c (%)

from historical records; it is never passed to the model as an input. This architecture—where the prediction target is derived from post-incident records while all model inputs remain available at notification time—is consistent with established practice in real-time traffic impact assessment, where outcome variables such as incident clearance delay or accident severity are similarly computed from completed event records [15,51].

Let I(t) denote the hourly traffic index observed at time t, and let an incident occur over the interval [t_s, t_e].

Step 1: Alignment of Incident and Traffic Time Series

Each incident interval is aligned to the hourly traffic index time series by mapping t_s and t_e to the nearest hour timestamps. For multi-hour incidents, the traffic index values within the interval are averaged:

T_{e v e n t, i} = \frac{1}{H_{i}} \sum_{h \in [t_{s}, t_{e}]} T r a f f i c I n d e x_{h}

(1)

where H is the number of aligned hourly observations.

Step 2: Construction of the Historical Baseline

The baseline T_normal is computed exclusively from historical data prior to the incident date. For each location cluster c, hour-of-day h, and day-of-week d, we compute

T_{n o r m a l, i} = m e a n (T r a f f i c I n d e x_{c, d, h})

(2)

Importantly, when predicting 2024 incidents, no traffic data from 2024 are used in constructing T_normal. The training period (2022–2023) and testing period (2024) are strictly separated.

Step 3: Final Target Definition

This ratio-based normalization controls for recurring diurnal and weekly congestion patterns while guaranteeing that the baseline contains no future information relative to the prediction target.

Δ T r a f f i c_{i} (%) = [\frac{T_{e v e n t, i} - T_{n o r m a l, i}}{T_{n o r m a l, i}}] \times 100

(3)

where T_event is mean traffic index during incident and T_normal is historical baseline for matching location–time.

Problem Statement: Given historical incidents

D = {e_{1}, \dots, e_{n}}

with observed impacts

{Δ_{1}, \dots, Δ_{n}}

, learn function

f : e \to Δ

, which accurately predicts impact of new incidents while maintaining interpretability for operational deployment.

3.3. Dataset Description

Our dataset comprises 38,430 traffic incidents from Istanbul Metropolitan Municipality’s Traffic Management Center (2022–2024), covering 39 administrative districts [57,58].

Incident characteristics: Latitude and longitude (WGS84), covering 5343 km², for spatial data; start time, finish time, and duration (mean: 394 min, σ: 7498 min) for temporal data.

Impact: Traffic index (0–100 scale), event/normal traffic means,

Δ T r a f f i c

Incident type: 83.6% accident notifications, 15.4% maintenance/construction, 1.0% other

Infrastructure: Closed lanes (0–6; mean: 0.84)

Data joining procedure: Incident records (start_time, end_time, lat, lon) were joined with hourly traffic_index via spatial matching (nearest location within 500 m radius) and temporal alignment (hourly bins). Incidents spanning multiple hours used the mean traffic index across hourly bins to compute event_traffic_mean, which serves exclusively as an intermediate quantity for target variable construction and is never used as a model input.

Missing data policy: Incidents with missing traffic_index (n = 327, 0.85%) were excluded. Incidents with duration_min = 0 (n = 892, 2.3%) used next non-zero hourly reading.

Temporal distribution: Weekday rush hours (07:00–09:00, 17:00–19:00) are the highest incident frequency. The 12% higher winter incident rate is a seasonal variation. The day-of-week is a 35% higher weekday rate than weekends.

Target variable characteristics: The

Δ T r a f f i c (%)

target exhibits a highly skewed and heavy-tailed distribution. Mean traffic worsening is +30.01%, substantially exceeding the median of +16.05%, indicating positive skew. Formally, skewness = 3.801 and excess kurtosis = 27.469, confirming heavy-tailed behavior. The standard deviation is 61.7%, with a full range of −99.8% to +1219.7%. The 1st–99th percentile interval spans [−61.6%, +285.1%], and the 5th–95th percentile interval spans [−28.7%, +135.6%]. Negative values (23.5% of incidents) reflect preemptive diversions and psychological deterrence effects. Given this distributional profile, Section 3.4 presents a systematic robustness analysis evaluating model performance under four alternative distributional treatments.

3.4. Feature Engineering

To validate that our conclusions are not artifacts of a specific spatial encoding, we conducted sensitivity analyses across multiple spatial representations.

Spatial Features: Incidents were initially grouped using K-means clustering on latitude–longitude coordinates. Clustering of K-Means (k = 20), identified using the KneeLocator method after running the elbow method, shown in Figure 1, were applied to (latitude, longitude) to identify geographical zones. Each cluster represents geographically cohesive regions with similar traffic characteristics, reducing continuous coordinates to discrete cluster assignments (location_km). We repeated the experiments for K ∈ {10, 20, 30, 40} clusters. Across all configurations, the relative feature importance ranking remained stable, with spatiotemporal context variables consistently dominating incident type variables. Figure 2 represents the clusters in on Istanbul map, with the event information.

Although graph-based spatial representations can capture network topology more precisely, they require detailed road network data and complex preprocessing pipelines that are not always available in operational contexts. In this study, K-means clustering is employed as a pragmatic approximation of spatial heterogeneity, enabling scalable and reproducible modeling while maintaining compatibility with real-world traffic management systems.

Heavy-Tailed Target Robustness: Given the heavy-tailed distribution of

Δ T r a f f i c (%)

(skewness = 3.801, excess kurtosis = 27.469), we systematically evaluated LightGBM performance under four alternative distributional treatments, each applied to the identical temporal split (train: 2022–2023; test: 2024, n = 13,981). Results are summarized in Table 1.

The Standard MSE objective on the original scale achieves the best overall balance (MAE = 28.18%, R² = 0.476). Log-transformation and Winsorization yield comparable MAE (27.71% and 27.80%, respectively, within 0.5 percentage points of the original) but marginally lower R², indicating that reducing the influence of extreme values slightly improves point-wise accuracy while reducing explained variance. Quantile loss (α = 0.5, median regression) is robust to outliers but sacrifices R² (0.398 vs. 0.476), reflecting optimization for the conditional median rather than mean. Huber loss with aggressive outlier suppression (α = 0.9) degrades substantially (R² = 0.028), confirming that extreme incidents carry genuine signal that should not be discarded—consistent with the severity-stratified findings in Section 4.4.

Crucially, feature importance rankings are highly stable across all distributional treatments. Spearman rank correlation between standard MSE and log-transformed target importance rankings is ρ = 1.000 (p = 0.00), and ρ = 1.000 (p = 0.00) between MSE and quantile loss rankings. This confirms that the central finding—spatiotemporal context dominates incident classification—is not an artifact of the MSE objective or the heavy-tailed distribution, but a robust property of the data. The standard MSE objective is therefore retained as the primary specification.

Temporal Features: Cyclical encoding preserves temporal continuity:

\begin{matrix} d a y_s i n = sin (2 π \times d a y_o f_w e e k / 7) \\ d a y_c o s = cos (2 π \times d a y_o f_w e e k / 7) \\ h o u r_s i n = sin (2 π \times h o u r_o f_d a y / 24) \\ h o u r_c o s = cos (2 π \times h o u r_o f_d a y / 24) \end{matrix}

(4)

This ensures that adjacent times (e.g., 23:00 and 00:00) are represented as similar in feature space.

Incident Features: One-hot encoding for categorical incident type (6 categories), avoiding inappropriate ordinal assumptions of label encoding. Additional features: closed_lanes, duration_min.

Traffic Context: Real-time traffic_index at the beginning of the incident captures the baseline congestion level.

Final feature set (14 features): 8 base features + 6 one-hot encoded event categories.

3.5. Temporal Validation Strategy

We implement a rolling-origin cross-validation scheme to prevent temporal leakage and provide realistic performance estimates. Unlike random train–test splits that violate temporal ordering [29], our approach ensures that models are always tested on genuinely future data.

Validation Scheme:

Fold 1: Train on 2022 (n = 11,801) → test on 2023 (n = 12,648)

Fold 2: Train on 2022–2023 (n = 24,449) → test on 2024 (n = 13,981)

This design mimics operational deployment where models trained on historical data must predict future incidents. The expanding window evaluates both data efficiency (Fold 1: 1 year training) and asymptotic performance (Fold 2: 2 years training).

Rationale: Recent work demonstrates that random cross-validation in time series systematically overestimates performance by 15–30% [11,29]. Temporal validation provides operationally realistic estimates critical for deployment decisions. By sorting all data chronologically and splitting strictly by year, we ensure the following:

No future information leakage: Test incidents occur strictly after all training incidents.

Realistic generalization: Models face genuine distribution shift between training and deployment.

Conservative estimates: Performance reflects real-world degradation over time.

For operational baselines, we group training data by (location cluster, hour-of-day) and use historical mean

Δ T r a f f i c

for matching test incidents, falling back to global mean for unseen combinations.

3.6. Uncertainty Quantification via Bootstrap

To quantify prediction uncertainty beyond point estimates, we employ bootstrap resampling [59] with 1000 iterations per model–fold combination. For each iteration, we resampled the test set with replacement (

n = | test set |

), calculated MAE, RMSE, and R² on the resampled data, and stored the metrics. We report 95% confidence intervals using the percentile method: [2.5th percentile, 97.5th percentile]. This non-parametric approach makes no distributional assumptions and accounts for heteroscedasticity in prediction errors.

Implementation Details:

Random seed is fixed at 42 for reproducibility. The sampling strategy used is that of stratification by impact severity bins to preserve distributional properties.The computational cost is 1000 iterations × 6 model–fold combinations ≈15 min on standard workstation.

Bootstrap provides robust uncertainty estimates even with non-normal error distributions, addressing the heavy-tailed nature of

Δ T r a f f i c

(range: −99.8% to +1219.7%). Narrow confidence intervals indicate stable predictions not dependent on specific test set sampling.

3.7. Statistical Significance Testing with Multiple Comparison Correction

We conduct pairwise Wilcoxon signed-rank tests [60] comparing absolute error distributions across models. To control family-wise error rate (FWER) for multiple comparisons, we apply Bonferroni correction:

α_{c o r r e c t e d} = α / m

(5)

(where m = 6 comparisons, giving α_corrected = 0.0083). The Holm–Bonferroni procedure involves sequential testing, controlling FWER while maintaining higher power than Bonferroni [61].

Hypothesis Testing Framework:

Null hypothesis (H₀) is when the median absolute error of Model A = median absolute error of Model B, and alternative (H₁) is when the median absolute errors differ. We reject H₀ at the corrected significance level and report both Bonferroni and Holm results for transparency.

Pairwise Comparisons:

Using the largest test set (Fold 2, n = 13,981), we perform 6 comparisons: 3 for ML vs. Baseline (RF vs. Baseline, XGBoost vs. Baseline, LightGBM vs. Baseline) and 3 for inter-ML (LightGBM vs. XGBoost, LightGBM vs. RF, XGBoost vs. RF)

Rationale:

Multiple testing without correction inflates Type I error. With m = 6 tests at α = 0.05, the probability of at least one false positive ≈ 26% under global null. Bonferroni reduces this to ≤5% while maintaining interpretability. The Holm procedure provides additional power when early hypotheses show strong significance.

3.8. Baseline Models

To establish practical improvement margins and provide context for ML model performance, we implement a historical average baseline that represents a realistic operational system that traffic management centers could deploy without machine learning. Historical average baseline leverages spatiotemporal patterns by grouping training data by (location cluster/hour-of-day) pairs and predicting

\hat{Δ T r a f f i c_{i}} = \{\begin{matrix} m e a n (Δ T r a f f i c_{l o c a t i o n_{i}, h o u r_{i}}) & if combination exists in training \\ m e a n (Δ T r a f f i c_{t r a i n}) & otherwise (fallback) \end{matrix}

(6)

where location_i is the K-means cluster assignment and hour_i is the hour-of-day for test incident i.

This baseline captures spatial heterogeneity (different locations have different typical impacts), temporal patterns (rush hour incidents differ from off-peak incidents), and operational feasibility (simple lookup table implementable in existing systems). The fallback to global mean for unseen (location/hour) combinations (approximately 5–10% of test cases) ensures the baseline always produces predictions. This baseline represents the performance floor that any ML model must surpass to justify deployment complexity.

3.9. Machine Learning Models

Hyperparameters were selected via 3-fold time-series cross-validation on the training set (2022 train; 2023 validation). Grid search ranges:

m a x_d e p t h \in {8, 10, 12, 15}

,

l e a r n i n g_r a t e \in {0.01, 0.05, 0.1}

. The final configs reported minimize validation RMSE.

Random Forest (RF) [5]: Ensemble of 300 decision trees with bootstrap aggregation. The configuration values are

n_e s t i m a t o r s = 300

,

m a x_d e p t h = 12

, and

r a n d o m_s t a t e = 42

. The advantages are robustness to overfitting, interpretable feature importance, and rule extraction.

XGBoost [6]: Gradient boosting with L1/L2 regularization. The configuration values are

n_e s t i m a t o r s = 400

,

m a x_d e p t h = 8

,

l e a r n i n g_r a t e = 0.05

, and

s u b s a m p l e = 0.8

. The advantages are high accuracy, native missing data handling, and efficient training.

LightGBM [7]: Gradient-based one-side sampling with leaf-wise growth. The configuration values are

n_e s t i m a t o r s = 400

,

m a x_d e p t h = - 1

,

l e a r n i n g_r a t e = 0.05

,

s u b s a m p l e = 0.8

. The advantages are faster training, superior categorical feature handling, and memory efficiency.

Evaluation metrics for ML models are given below:

Mean Absolute Error (MAE): Average absolute prediction error in percentage points.

M A E = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(7)

Root Mean Squared Error (RMSE): Emphasizes large errors.

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(8)

R^{2} S c o r e

: Proportion of variance explained

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(9)

where

y_{i}

represents actual values,

{\hat{y}}_{i}

denotes predictions,

\bar{y}

is the mean actual value, and n is sample size.

3.10. Explainability Framework

SHAP Analysis [8]: Computes SHapley values, quantifying each feature’s contribution.

ϕ_{i} = \sum_{S \subseteq F \ {i}} \frac{| S |! (| F | - | S | - 1)!}{| F |!} [f_{S \cup {i}} (x_{S \cup {i}}) - f_{S} (x_{S})]

(10)

Here,

ϕ_{i}

is the SHAP value for characteristic i, S is the feature subset, and F is the complete feature set. SHAP values were computed on n = 1000 test samples drawn using stratified random sampling (proportional to event type distribution). Random seed is 42 for reproducibility. We employ TreeExplainer (optimized for tree models using a fast tree path-dependent algorithm with

f e a t u r e_p e r t u r b a t i o n

= ‘

t r e e_p a t h_d e p e n d e n t

’) in 1000 randomly sampled test instances to compute the importance of global characteristics using summary graphs and feature interactions using dependence graphs. In a feature ablation study carried out to validate SHAP findings, we systematically removed feature groups {Spatial, Temporal, Incident, Traffic} and retrain models, measuring R² degradation. This provides empirical confirmation of the importance of the characteristic independently of SHAP. The Wilcoxon signed-rank test [60] compares model error distributions to assess whether performance differences are statistically significant (α = 0.05).

3.11. Deployment and Computational Environment

All experiments were executed on a workstation equipped with an Intel Core i7–12700K CPU (12 cores, 20 threads, 3.6 GHz base frequency), 16 GB RAM; the operating system was Windows 11. Model training was performed primarily on CPU for tree-based models (Random Forest, XGBoost, and LightGBM), while GPU acceleration was optionally enabled for XGBoost and LightGBM during hyperparameter tuning experiments. The implementation was carried out in Python 3.12.10 using scikit-learn (v1.8.0), XGBoost (v3.2.0), LightGBM (v4.6.0), and SHAP (v0.50.0). All experiments were executed with a fixed number of 42 random seeds to ensure reproducibility. The average training time per model, as shown in Table 2, ranged between 18 and 95 s depending on model complexity and hyperparameter configuration. Inference time per incident sample was below 5 milliseconds, indicating suitability for near-real-time deployment scenarios.

4. Results

4.1. Rolling-Origin Performance Evaluation

Table 3 presents model performance across two temporal folds, reporting mean ± standard deviation to quantify temporal stability. Standard deviations are modest relative to means (coefficient of variation < 12% for all models), indicating consistent performance across time periods. All models show performance degradation from Fold 1 to Fold 2 (e.g., LightGBM: R² = 0.527 → 0.476), reflecting increased difficulty when test data is further in time from training data. This is expected and validates the realism of our evaluation as generalization of trends. ML models achieve 354–406% relative improvement in R² over historical baseline (mean performance). Even with conservative temporal validation, the operational value added is substantial. All statistical comparisons in Section 4.3 use the Historical Average baseline as the reference, as it represents realistic operational performance. Comparisons against the Global Mean baseline would show even stronger significance but would not reflect practical deployment scenarios. LightGBM exhibits the best mean performance and acceptable stability, making it the recommended model for deployment consideration.

To address concerns regarding model selection justification, we additionally report Fold 2 performance for two non-ensemble alternatives evaluated under identical temporal validation. A Single Decision Tree (max depth = 8) achieves R² = 0.332, demonstrating that the base learner alone captures meaningful signal but falls 14.4 percentage points short of LightGBM (R² = 0.476), confirming that ensemble aggregation provides substantial, non-trivial gains. Ridge Regression with degree-2 polynomial interaction terms achieves R² = 0.155, substantially below ensemble models, which confirms that the nonlinear traffic–impact relationship cannot be adequately captured by linear assumptions, even with explicit feature interactions. Together, these results empirically justify the choice of ensemble methods for this task.

4.2. Bootstrap Confidence Intervals

Table 4 reports 95% bootstrap confidence intervals for the 2024 test set (Fold 2, n = 13,981), our primary out-of-sample evaluation. Confidence Interval widths are operationally meaningful: LightGBM MAE uncertainty is ±0.64 percentage points (±2.3% relative), indicating predictions are stable across resampling. This narrow range supports decision-making. Baseline MAE CI [34.32%, 35.94%] does not overlap with LightGBM CI [27.54%, 28.81%], providing additional evidence beyond significance testing that improvement is real and robust. A CI width of 0.072 indicates ≈ 15% relative uncertainty in explained variance. While models explain ≈ 48% of variance, we are confident this is not a sampling artifact. XGBoost and RF show overlapping R² confidence intervals ([0.411, 0.483] vs. [0.401, 0.468]), supporting the later finding that they are statistically equivalent.

4.3. Corrected Significance Testing Results

Table 5 presents the results of 6 pairwise Wilcoxon signed-rank tests with Bonferroni and Holm–Bonferroni corrections. All ML vs. Baseline comparisons remain highly significant. LightGBM vs. Baseline is p = 3.35 × 10⁻¹⁴³ (p < 0.001 even after correction), XGBoost vs. Baseline is p = 7.48 × 10⁻¹¹⁴ (p < 0.001), and RF vs. Baseline is p = 1.61 × 10⁻¹²² (p < 0.001). These extremely small p-values indicate that ML improvements over historical baselines are not due to chance. The evidence for operational value is overwhelming. ML inter-model comparisons differ: LightGBM vs. RF is p = 1.18 × 10⁻⁷ (significant: Bonferroni: Yes; Holm: Yes); XGBoost vs. RF is p = 0.348 (NOT significant: Bonferroni: No; Holm: No); and LightGBM vs. XGBoost is p = 5.26 × 10⁻⁶ (significant: Bonferroni: Yes; Holm: Yes).

XGBoost and RF show no statistically significant difference after multiple comparison correction. This suggests that both gradient boosting methods are statistically equivalent, model selection between them can prioritize other factors (training time, interpretability), and the reported superiority of LightGBM (R² 0.448 vs. 0.437) may not be robust.

Without correction, LightGBM vs. XGBoost shows p = 5.26 × 10⁻⁶, which is significant even after Bonferroni correction α = 0.0083. This demonstrates that Bonferroni correction, while conservative, did not change conclusions here. However, for other comparisons, correction prevents inflation of false discoveries.

4.4. Performance by Impact Severity

Table 6 analyzes prediction performance across five impact severity bins, revealing substantial heterogeneity in model reliability. The distribution of incidents by severity are such that 54.6% of incidents are low to medium impact (0–50% increase), 14.6% of incidents are high impact (50–100% increase), 9.5% of incidents are extreme impact (>100% increase), and 19.3% of incidents are negative impact (traffic improvement).

Performance degradation with severity by MAE increases systematically from 12.91% (low) to 95.91% (extreme). This 7 × error amplification indicates that extreme events are fundamentally different from typical incidents. While overall R² is positive (0.476), within-bin R² values are negative, indicating high heterogeneity within severity categories. The variance of prediction errors exceeds baseline variance within each bin, though pooled across bins, the model adds value. Models are reliable (MAE < 40%) for ≈85% of incidents (all but extreme category). For the remaining 15%, predictions should be treated as rough lower bounds rather than precise estimates. The model handles negative impacts (traffic improvement) with MAE = 25.13%, suggesting that it captures preemptive routing and deterrence effects moderately well.

Graduated deployment is initially for low–medium severity predictions where reliability is high. Extreme incident protocols for incidents predicted to cause >100% impact, use conservative upper bounds and activate maximum response regardless of exact prediction. Feature augmentation needed as extreme incidents likely involve unmeasured factors (weather, cascading failures, special events), requiring additional data sources. Extreme incidents may involve network-wide cascading effects not captured by local features, rare event types underrepresented in training data (1322/38,430 = 3.4%), and nonlinear threshold effects where traffic transitions from manageable to gridlock.

4.5. Error Analysis by Subgroups

Table 7 analyzes prediction errors across operational categories using LightGBM providing critical insights:

Time-of-day vulnerability: Night incidents (00:00–06:00) exhibit 61% higher error (MAE = 41.36%) than afternoon incidents (MAE = 25.68%), reflecting sparse training data and fundamentally different traffic dynamics.

Event type similarity: Error variation across incident types (10.54–58.09% MAE range) confirms incident classification contributes on average to predictive difficulty.

Location heterogeneity: Error variation across location clusters (range 16.92–32.84% MAE) indicates that spatial context dominates prediction quality.

High-traffic resilience: Counter-intuitively, high-traffic periods show comparable error to low-traffic periods, suggesting that models capture congestion dynamics effectively.

4.6. Feature Ablation Study

Table 8 quantifies the contributions of the characteristic group through systematic removal and retraining while resulting key finding such as the following:

Temporal dominance: Removing time features degrades R² by 0.311 (65.2% of full model’s explanatory power), confirming hour-of-day and day-of-week as critical predictors.

Traffic context importance: Baseline

t r a f f i c_i n d e x

contributes 0.281 R² (59.1%), validating that the current congestion level is the second-most important factor.

Spatial contribution: Location clusters contribute 0.159 R² (33.4%), indicating that geographic heterogeneity matters but less than temporal patterns.

Incident type irrelevance: Removing incident type features degrades R² by 0.006 (1.3%), confirming that these features carry minimal predictive signal.

SHAP global ranking is independently corroborated by the feature ablation study (Table 8). Temporal features rank first in both SHAP (42% of total contribution) and ablation (

Δ R^{2} = - 0.311, 65.2 %

contribution), while incident type ranks last in both analyses. This cross-method convergence provides empirical validation of SHAP consistency beyond descriptive reporting.

4.7. SHAP Feature Importance Analysis

Figure 3 summarizes global feature importance via mean absolute SHAP values for the LightGBM model (1000 test samples). The

t r a f f i c_i n d e x

current congestion shows a strong positive association—higher baseline traffic is associated with amplified impact. Spatiotemporal context (

h o u r_c o s

+

h o u r_s i n

+

l o c a t i o n_k m

+

d a y_c o s

) accounts for 42% of total SHAP contribution, while incident type contributes marginally. SHAP values were computed separately for both temporal folds (Fold 1: 2022–2023 test set; Fold 2: 2024 test set). The top five feature ranking by mean |SHAP| was identical across both folds (

h o u r_c o s

+

h o u r_s i n

+

l o c a t i o n_k m

+

d a y_c o s

), with Spearman rank correlation p = 0.97 between the two fold rankings. This confirms that SHAP-based importance is stable over time and not an artifact of a specific test period. SHAP ranking matches ablation study results (Table 8), suggesting that temporal features are most important, followed by traffic context and then spatial features, with incident features being least important. This cross-method agreement strengthens interpretability confidence.

We extended the dependence plots and Figure 4 presents a SHAP dependence plot for

h o u r_c o s

(temporal feature interaction), revealing time-of-day impact patterns, colored by

t r a f f i c_i n d e x

. Evening rush hours exhibit strongly positive SHAP values, while late night shows negative contributions.

h o u r_c o s \times t r a f f i c_i n d e x

interaction reveals that the amplification effect of high baseline congestion is strongest during evening rush hours (17:00–20:00), while remaining near-zero during night hours—consistent with the saturation threshold discussed in Section 4.3. These interaction patterns are mechanistically interpretable and operationally actionable.

Figure 5 displays

t r a f f i c_i n d e x

(baseline traffic amplification) dependence, demonstrating a nonlinear impact relationship. Transition around

t r a f f i c_i n d e x

= 30 marks a boundary between resilient (below) and fragile (above) traffic states. This threshold enables graduated intervention protocols: an aggressive above-threshold response with preemptive adjustments; standard protocols below.

5. Discussion

This study is positioned as a rigorously validated impact estimation study for a well-defined nowcasting task: given the notification-time attributes available when an operator opens an incident record, estimate the percentage deviation from baseline traffic conditions that the incident will produce. The results demonstrate that notification-time features alone are sufficient for statistically significant impact estimation. The translation of these findings into a real-time deployment system would require additional engineering work—including integration with live data feeds, latency constraints, and prospective operational validation—which lies beyond the scope of the present study and is identified as a direction for future research.

While recent studies have employed deep learning and graph-based models, direct comparison with such architectures was not the primary objective of this study. The focus here is on evaluating the practical value of interpretable and computationally efficient models under realistic temporal validation settings. Notably, prior research has shown that performance gains reported by complex models can be significantly reduced when strict temporal evaluation protocols are applied. Therefore, this study complements existing work by emphasizing evaluation realism and explainability rather than architectural sophistication.

This study investigated the prediction of traffic incident impacts using machine learning models under temporally consistent evaluation and explainable analysis. The results provide several insights into the contextual nature of traffic incident effects and the methodological implications of evaluation strategies in traffic prediction research.

Firstly, the findings demonstrate that temporal validation significantly affects model performance. Models evaluated using random train–test splits exhibited substantially higher accuracy compared to temporally separated validation. This result confirms that temporal leakage can lead to overly optimistic performance estimates in traffic prediction tasks. Therefore, the study highlights the necessity of temporally consistent evaluation frameworks to ensure realistic and operationally meaningful performance assessment.

Secondly, the comparison between machine learning models and historical baselines reveals that the performance gains of advanced models are more modest under realistic temporal settings than commonly reported in the literature. While ensemble learning models outperform baseline approaches, the margin of improvement varies across temporal and spatial contexts. This finding suggests that the value of machine learning in traffic incident prediction should be interpreted relative to operational baselines rather than in isolation.

Thirdly, an explainable AI analysis indicates that the spatiotemporal context plays a dominant role in determining the traffic incident impacts. Features related to time-of-day, day-of-week, location, and baseline traffic conditions consistently exhibit higher importance than incident-specific attributes such as incident type or severity. This observation is further supported by feature ablation experiments, which confirm that contextual variables account for a substantial portion of predictive power. These results imply that traffic incident impacts are not solely determined by incident characteristics but are strongly conditioned by the surrounding traffic environment.

Fourthly, the convergence between SHAP-based explanations and empirical ablation analysis enhances the robustness of interpretability findings. Unlike many prior studies that rely solely on post hoc explanation methods, this study validates interpretability results through independent empirical testing. This methodological approach strengthens the reliability of feature importance interpretations in traffic prediction tasks.

Finally, the proposed normalized impact metric

Δ T r a f f i c (%)

enables the meaningful comparison of incident effects across heterogeneous spatial and temporal conditions. By normalizing traffic deviations relative to expected baseline conditions, the metric reduces bias arising from varying traffic volumes and structural differences across locations. This contributes to a more consistent and comparable representation of traffic incident impacts.

Overall, the findings suggest that methodological rigor, evaluation realism, and explainability are as critical as model complexity in traffic incident prediction. The study thus provides evidence that interpretable ensemble models, when evaluated under realistic temporal conditions, can offer robust and operationally feasible solutions for intelligent transportation systems.

5.1. Spatiotemporal Context Dominates Incident Classification

Across multiple spatial encodings and robustness checks, spatiotemporal context consistently exhibits substantially greater explanatory contribution than incident classification features. This finding contradicts conventional traffic management paradigms prioritizing incident classification for response protocols.

Quantitative evidence: SHAP analysis as temporal + spatial features contribute 73% vs. incident type’s 3%

Ablation study: Temporal features contribute 65.2% vs. incident type’s 1.3%.

Error analysis: Incident type variation is higher than location variation.

Mechanistic explanation: An identical accident produces vastly different impacts—negligible at 03:00 on peripheral roads (low traffic baseline, available diversions), catastrophic at 18:00 on central bottlenecks (saturated capacity, no alternatives). Context-dependency necessitates a fundamental reconsideration of severity classification systems.

The practical implications include the use of dynamic incident prioritization to replace static type-based protocols, spatiotemporal vulnerability maps for proactive resource positioning, context-aware routing incorporating location–hour risk profiles, and adaptive signal control, preemptively adjusting near high-risk clusters during predicted high-impact periods.

5.2. Importance of Proper Temporal Validation

Comparison with random-split approaches reveals critical validation differences. Random splitting artificially inflates R² by 26.7% due to data leakage—models learn from “future” incidents when predicting “past” incidents. This 11.6% absolute R² difference represents a substantial overestimation that could mislead operational deployment decisions.

Lessons for ITS research: Traffic forecasting must use temporal validation to provide realistic performance estimates. Random cross-validation, while statistically valid for i.i.d. data, systematically overestimates time series prediction accuracy.

5.3. Negative Traffic Deviations: Preemptive Routing

Approximately 23.5% of incidents exhibited negative

Δ T r a f f i c

values (traffic improvement), initially counterintuitive but reflecting sophisticated adaptive routing behavior:

Preemptive diversions: Navigation apps predict congestion before physical manifestation, rerouting traffic proactively.

Psychological deterrence: Drivers avoid known incident locations even when alternative routes are slower.

Braess paradox: Lane closures removing problematic merge points can improve overall flow [62].

Temporal displacement: Sparse baseline traffic during off-hours enables complete diversions without bottleneck creation.

This phenomenon highlights that modern incident impact constitutes a complex sociotechnical system where human behavior, algorithmic routing, and physical constraints interact nonlinearly.

5.4. Limitations and Model Uncertainty

This work has several important limitations that must be acknowledged for responsible deployment.

Temporal Generalization Uncertainty: While rolling-origin validation provides more realistic estimates than single holdout, two folds remain limited. Performance in 2025+ is uncertain, particularly if traffic patterns shift due to policy changes or population growth changes network structure, and autonomous vehicles alter fundamental traffic dynamics. Continuous monitoring and periodic retraining should be implemented (recommended on a quarterly basis) for mitigation.

Extreme Event Prediction Failure: As shown in Section 4.4, models exhibit MAE ≈ 95% for incidents causing > 100% traffic increase (9.5% of cases). This is operationally unacceptable for critical scenarios. The root causes are insufficient training data for rare extremes (n = 1322 examples), missing features (weather, special events, cascading effects), and fundamental nonlinearity at saturation (free-flow → gridlock transition). Mitigation: A graduated response would be employed, applying conservative protocols for predicted >50% impacts, augment with weather API, event calendar, sensor network data, and with a consideration of ensemble with physics-based traffic simulation for extremes.

Heavy-Tailed Target and Prediction Reliability: The

Δ T r a f f i c (%)

target is heavily right-skewed (skewness=3.801, kurtosis=27.469), which has two direct implications for operational reliability. Firstly, pooled metrics (MAE = 28.18%, R² = 0.476) are dominated by the majority of typical incidents and mask substantially higher errors for extreme events, as documented in Table 6. Secondly, the choice of training objective affects the trade-off between typical and extreme incident accuracy: MSE optimizes mean prediction quality and achieves the best pooled R², while quantile loss and log-transformation improve robustness at the cost of explained variance. Practitioners should select the objective according to operational priority—MSE for general-purpose deployment, quantile loss for conservative planning under uncertainty. Mitigation: Predictions for incidents flagged as potentially extreme (

Δ T r a f f i c > 100 %

) should be presented with explicit uncertainty bands (bootstrap 95% CIs) rather than point estimates, and conservative response protocols should be activated regardless of exact predicted magnitude.

Unmeasured Confounders: Our models explain 48–52% of variance, leaving 48–52% unexplained. It is likely that unmeasured factors such as weather conditions like rain, snow, and fog affect both incident occurrence and impact special events like concerts, sports, and construction, which simultaneously impact traffic, time-varying demand like holiday travel, seasonal patterns beyond day-of-week, and human the behavior of social media virality and navigation app adoption rates. Implications: Predictions should be presented with uncertainty bands, not point estimates.

Istanbul Specificity: Our dataset is from a single city with unique characteristics such as transcontinental geography (Bosphorus bridges are bottlenecks), high population density (>15 millions), and specific cultural traffic norms.

Generalizability: The proposed evaluation framework is generalizable, while the trained models require revalidation before deployment in cities with different network topology, public transport penetration, or driver behavior norms. Transfer learning experiments are needed before deployment elsewhere.

Honest Assessment: Despite achieving statistically significant improvements over baselines with rigorous validation, these models are not ready for fully autonomous deployment. They are suitable for decision support (providing operators with impact estimates), resource pre-positioning (allocating response units to high-risk zones), and pilot programs (testing on non-critical incidents with human oversight). They are NOT suitable for autonomous emergency response without human verification; life-safety decisions, e.g., hospital routing, without redundancy; and deployment in cities outside Istanbul without transfer learning validation.

5.5. Practical Deployment Considerations

Computational requirements are as follows: training—45 s (RF) on a standard workstation; inference of <10 ms per incident (real-time capable); and model updating for feasible daily retraining. Integration with ITS infrastructure uses the REST API for real-time traffic management center predictions, batch processing for historical analysis, and dashboard visualization for operators.

6. Conclusions

This paper presented a machine learning-based framework for predicting traffic incident impacts using real-world traffic data from Istanbul. By emphasizing temporally consistent evaluation, explainable analysis, and comparison with operational baselines, the study aimed to provide a realistic assessment of machine learning performance in traffic incident prediction.

Experimental results demonstrate that ensemble learning models achieve reliable predictive performance under temporal validation while maintaining interpretability and computational efficiency. The analysis further reveals that spatiotemporal context plays a more significant role than incident-specific attributes in determining traffic incident impacts. These findings highlight the importance of contextual modeling and realistic evaluation in transportation analytics.

From a methodological perspective, the study shows that combining explainable AI techniques with empirical validation can enhance the robustness of feature importance interpretations. From an operational perspective, the results indicate that interpretable machine learning models can provide practical value for traffic management systems when evaluated against realistic baselines.

Although this study focuses on a single metropolitan area, the proposed framework is generalizable to other urban contexts with similar data availability. Future research directions include the integration of network-based representations, deep learning models, and causal inference approaches to further explore the mechanisms underlying traffic incident impacts.

In conclusion, this study contributes to the literature by demonstrating that methodological rigor and interpretability are key factors in developing reliable and operationally relevant traffic incident prediction models for intelligent transportation systems.

6.1. Principal Findings

Operational Value Confirmed: The Historical Average baseline achieves R² = 0.100 ± 0.054 across temporal folds, indicating that simple spatiotemporal patterns capture approximately 10% of variance. This modest but positive R² confirms that location–hour patterns contain predictive signal, validating the importance of spatiotemporal context. In contrast, the Global Mean baseline achieves R² = −0.021 ± 0.015 (not shown in table), performing worse than predicting the mean. This poor performance occurs because the baseline’s prediction variance is zero, while target variance is high, resulting in negative R². However, ML models achieve a 354–406% relative improvement in R² over the Historical Average baseline:

LightGBM: (0.506 − 0.100)/0.100 = +406% improvement.

XGBoost: (0.475 − 0.100)/0.100 = +375% improvement.

RF: (0.454 − 0.100)/0.100 = +354% improvement.

These substantial improvements demonstrate that ML captures nonlinear interactions, feature combinations, and complex patterns beyond simple location–hour averaging.

Spatiotemporal Context Dominates: Feature ablation studies and SHAP analysis provide convergent evidence that when/where an incident occurs explains 24× more variance than what type of incident it is. This finding challenges conventional incident classification systems and suggests context-aware management strategies.

Severity-Dependent Reliability: Model performance varies systematically by impact magnitude. For typical incidents (85% of cases,

Δ T r a f f i c < 100 %

), MAE ranges from 12.91 to 37.80%, supporting operational deployment. For extreme incidents (15% of cases,

Δ T r a f f i c > 100 %

), MAE reaches 95.91%, indicating that specialized handling is required.

Statistical Rigor Strengthens Claims: Multiple comparison correction (Bonferroni + Holm) and bootstrap uncertainty quantification ensure reported improvements are not statistical artifacts. The study demonstrates that proper temporal validation is critical—random splits would have overestimated performance by ≈15–30% based on the literature.

Methodological Contribution: We provide a reproducible evaluation framework addressing common pitfalls in transportation forecasting: temporal leakage, insufficient uncertainty reporting, and multiple testing without correction. This methodology is generalizable beyond incident prediction to other ITS applications.

6.2. Practical Implications for Smart Cities

Based on these findings, we recommend a graduated deployment strategy:

Phase 1 (Low-Risk Pilot): Deploy for decision support on low–medium severity incidents (

Δ T r a f f i c < 50 %

). Human operators should verify all predictions before action. Build operational trust and collect deployment data.

Phase 2 (Expanded Coverage): Extend to high-severity incidents (

Δ T r a f f i c

50–100%) with conservative protocols. Implement real-time performance monitoring. Develop intervention feedback loop for causal estimation.

Phase 3 (Full Integration): Integrate with traffic signal control and route guidance systems. Maintain human oversight for extreme events (

Δ T r a f f i c > 100 %

). Continuous learning from new data.

6.3. Future Research Directions

Develop specialized models or physics-hybrid approaches for >100% impacts, potentially using rare event sampling techniques for extreme event modeling. Integrate weather APIs, social media sentiment, traffic camera feeds, and special event calendars to address unmeasured confounders. Implement counterfactual prediction to handle intervention paradox where successful responses make predictions appear inaccurate for causal inference. Validate cross-city generalization through domain adaptation experiments on cities with different characteristics as transfer learning. A/B testing framework in live TMC environment with operator feedback integration for the real-time deployment study. Develop rule extraction methods to generate interpretable decision trees from ensemble models for safety-critical applications for explainability enhancement.

Future work should incorporate weather APIs, traffic cameras, and social media sentiment for multimodal data fusion. LSTM/GRU networks that capture dynamics within-incidents should be used for temporal evolution modeling. Propensity score matching or difference-in-differences should be used to assess causal inference. Domain adaptation for other cities with limited data can transfer learning. Real-time deployment in online learning should be investigated for concept drift detection. Prescriptive analytics will act as reinforcement learning for optimal resource allocation. Transfer learning experiments on cities with contrasting network characteristics (e.g., monocentric European cities, low-density North American grids) are needed before deployment claims can be broadened beyond Istanbul.

6.4. Final Assessment

This work demonstrates that machine learning-based incident impact prediction is scientifically validated and operationally promising, but requires careful deployment with appropriate safeguards. The combination of rigorous temporal validation, comprehensive uncertainty quantification, honest reporting of limitations, and severity-dependent performance analysis provides a realistic foundation for smart city transportation systems to integrate AI-driven incident management while maintaining operational safety and transparency.

The path forward is not the development of an autonomous replacement for human expertise, but an augmented intelligence where ML provides data-driven insights within a framework of human oversight and continuous learning. This balanced approach maximizes the benefits of predictive analytics while acknowledging and mitigating inherent limitations.

Author Contributions

Conceptualization, A.K., U.Ç. and V.T.; methodology, A.K., U.Ç. and V.T.; software, A.K. and V.T.; validation, A.K., U.Ç. and V.T.; investigation, A.K., U.Ç. and V.T.; writing—original draft preparation, A.K., U.Ç. and V.T.; writing—review and editing, A.K., U.Ç. and V.T.; visualization, A.K., U.Ç. and V.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were used in this study. The traffic announcement data were obtained from the İstanbul Metropolitan Municipality (IMM) Open Data Portal: Transportation Management Center Traffic Announcement Data (https://data.ibb.gov.tr/en/dataset/ulasim-yonetim-merkezi-trafik-duyuru-verisi) accessed on 10 December 2025. Hourly traffic density data were obtained from the IMM Open Data Portal: Hourly Traffic Density Data Set (https://data.ibb.gov.tr/en/dataset/hourly-traffic-density-data-set) accessed on 10 December 2025. No new datasets were generated during the current study.

Acknowledgments

The authors thank the Istanbul Metropolitan Municipality Traffic Management Center for providing access to the traffic incident database.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Schrank, D.; Albert, L.; Jha, K.; Eisele, B. 2025 Urban Mobility Report; Technical Report; Texas A&M Transportation Institute: Houston, TX, USA, 2025. [Google Scholar]
Brewster, R.; Delich, L. 2020 Traffic Incident Management Capability Maturity Self-Assessment National Analysis Report; Technical Report; U.S. Department of Transportation Federal Highway Administration: Washington, DC, USA, 2021.
Turkish Statistical Institute. Road Traffic Accident Statistics, 2024 (Corrected). 2025. Available online: https://data.tuik.gov.tr/Bulten/Index?p=Road-Traffic-Accident-Statistics-2024-54056&dil=2 (accessed on 12 December 2025).
Istanbul Directorate of Transportation. Istanbul Transportation Annual Report 2024; Technical Report; Istanbul Metropolitan Municipality: Istanbul, Turkey, 2024.
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 3149–3157. [Google Scholar] [CrossRef]
Lundberg, S.M.; Allen, P.G.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017); Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 4765–4774. [Google Scholar]
Vlahogianni, E.I.; Karlaftis, M.G.; Golias, J.C. Short-term traffic forecasting: Where we are and where we’re going. Transp. Res. Part C Emerg. Technol. 2014, 43, 3–19. [Google Scholar] [CrossRef]
Arrieta, A.B.; Díaz-Rodríguez, N.; Ser, J.D.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Cerqueira, V.; Torgo, L.; Mozetič, I. Evaluating time series forecasting models: An empirical study on performance estimation methods. Mach. Learn. 2020, 109, 1997–2028. [Google Scholar] [CrossRef]
Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. The M4 Competition: 100,000 time series and 61 forecasting methods. Int. J. Forecast. 2020, 36, 54–74. [Google Scholar] [CrossRef]
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef] [PubMed]
Grigorev, A.; Mihaita, A.S.; Chen, F. Traffic Incident Duration Prediction: A Systematic Review of Techniques. J. Adv. Transp. 2024, 2024, 3748345. [Google Scholar] [CrossRef]
Sajadi, P.; Qorbani, M.; Moosavi, S.; Hassannayebi, E. Accident Impact Prediction Based on a Deep Convolutional and Recurrent Neural Network Model. Urban Sci. 2025, 9, 299. [Google Scholar] [CrossRef]
Chen, K.; Luo, Y.; Zhu, M.; Wang, X.; Wang, H.; Yang, H. Score-Based Spatial-Temporal Point Process for Traffic Accident Prediction. IEEE Trans. Intell. Transp. Syst. 2025, 26, 22974–22984. [Google Scholar] [CrossRef]
Vlahogianni, E.I.; Golias, J.C.; Karlaftis, M.G. Short-term traffic forecasting: Overview of objectives and methods. Transp. Rev. 2004, 24, 533–557. [Google Scholar] [CrossRef]
Williams, B.M.; Hoel, L.A. Modeling and Forecasting Vehicular Traffic Flow as a Seasonal ARIMA Process: Theoretical Basis and Empirical Results. J. Transp. Eng. 2003, 129, 664–672. [Google Scholar] [CrossRef]
Castro-Neto, M.; Jeong, Y.S.; Jeong, M.K.; Han, L.D. Online-SVR for short-term traffic flow prediction under typical and atypical traffic conditions. Expert Syst. Appl. 2009, 36, 6164–6173. [Google Scholar] [CrossRef]
Polson, N.G.; Sokolov, V.O. Deep learning for short-term traffic flow prediction. Transp. Res. Part C Emerg. Technol. 2017, 79, 1–17. [Google Scholar] [CrossRef]
Li, Z.; Liu, P.; Wang, W.; Xu, C. Using support vector machine models for crash injury severity analysis. Accid. Anal. Prev. 2012, 45, 478–486. [Google Scholar] [CrossRef]
Chen, W.; An, J.; Li, R.; Fu, L.; Xie, G.; Bhuiyan, M.Z.A.; Li, K. A novel fuzzy deep-learning approach to traffic flow prediction with uncertain spatial–temporal data features. Future Gener. Comput. Syst. 2018, 89, 78–88. [Google Scholar] [CrossRef]
Hou, Y.; Edara, P. Network Scale Travel Time Prediction using Deep Learning. Transp. Res. Rec. 2018, 2672, 115–123. [Google Scholar] [CrossRef]
Ma, X.; Tao, Z.; Wang, Y.; Yu, H.; Wang, Y. Long short-term memory neural network for traffic speed prediction using remote microwave sensor data. Transp. Res. Part C Emerg. Technol. 2015, 54, 187–197. [Google Scholar] [CrossRef]
Zhao, Z.; Chen, W.; Wu, X.; Chen, P.C.Y.; Liu, J. LSTM network: A deep learning approach for Short-term traffic forecast. IET Intell. Transp. Syst. 2017, 11, 68–75. [Google Scholar] [CrossRef]
Zhao, L.; Song, Y.; Zhang, C.; Liu, Y.; Wang, P.; Lin, T.; Deng, M.; Li, H. T-GCN: A Temporal Graph Convolutional Network for Traffic Prediction. IEEE Trans. Intell. Transp. Syst. 2020, 21, 3848–3858. [Google Scholar] [CrossRef]
Yu, B.; Yin, H.; Zhu, Z. Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence; Lang, J., Ed.; International Joint Conferences on Artificial Intelligence Organization: Stockholm, Sweden, 2018; Volume 2018-July, pp. 3634–3640. [Google Scholar] [CrossRef]
Ma, J.; Zhao, J.; Hou, Y. Spatial–Temporal Transformer Networks for Traffic Flow Forecasting Using a Pre-Trained Language Model. Sensors 2024, 24, 5502. [Google Scholar] [CrossRef]
Bergmeir, C.; Benítez, J.M. On the use of cross-validation for time series predictor evaluation. Inf. Sci. 2012, 191, 192–213. [Google Scholar] [CrossRef]
Zhang, Z.; He, Q.; Gao, J.; Ni, M. A deep learning approach for detecting traffic accidents from social media data. Transp. Res. Part C Emerg. Technol. 2018, 86, 580–596. [Google Scholar] [CrossRef]
Liang, Y.; Ke, S.; Zhang, J.; Yi, X.; Zheng, Y. GeoMAN: Multi-level Attention Networks for Geo-sensory Time Series Prediction. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence; Lang, J., Ed.; International Joint Conferences on Artificial Intelligence Organization: Stockholm, Sweden, 2018; Volume 2018-July, pp. 3428–3434. [Google Scholar] [CrossRef]
Yuan, H.; Li, G. A Survey of Traffic Prediction: From Spatio-Temporal Data to Intelligent Transportation. Data Sci. Eng. 2021, 6, 63–85. [Google Scholar] [CrossRef]
Saeed, W.; Omlin, C. Explainable AI (XAI): A systematic meta-survey of current challenges and future opportunities. Knowl.-Based Syst. 2023, 263, 110273. [Google Scholar] [CrossRef]
Ren, H.; Song, Y.; Wang, J.; Hu, Y.; Lei, J. A Deep Learning Approach to the Citywide Traffic Accident Risk Prediction. In Proceedings of the IEEE Conference on Intelligent Transportation Systems, Proceedings, ITSC, Maui, HI, USA, 4–7 November 2018; pp. 3346–3351. [Google Scholar] [CrossRef]
Tong, W.; Hussain, A.; Bo, W.X.; Maharjan, S. Artificial Intelligence for Vehicle-To-Everything: A Survey. IEEE Access 2019, 7, 10823–10843. [Google Scholar] [CrossRef]
Marra, A.D.; Corman, F. Modelling route choice in public transport with deep learning. Transportation 2025, 1–31. [Google Scholar] [CrossRef]
Ribeiro, M.T.; Singh, S.; Guestrin, C. Why Should I Trust You? In Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco CA, USA, 13–17 August 2016; Krishnapuram, B., Shah, M., Eds.; ACM: New York, NY, USA, 2016; pp. 1135–1144. [Google Scholar] [CrossRef]
Craven, M.; Shavlik, J. Extracting Tree-Structured Representations of Trained Networks. In Proceedings of the NIPS’95: Proceedings of the 9th International Conference on Neural Information Processing Systems; Touretzky, D., Mozer, M.C., Hasselmo, M., Eds.; MIT Press: Cambridge, MA, USA, 1995; Volume 8, pp. 24–30. [Google Scholar]
Montavon, G.; Samek, W.; Müller, K.R. Methods for interpreting and understanding deep neural networks. Digit. Signal Process. 2018, 73, 1–15. [Google Scholar] [CrossRef]
Lipton, Z.C. The mythos of model interpretability. Commun. ACM 2018, 61, 35–43. [Google Scholar] [CrossRef]
Bilotta, S.; Ipsaro Palesi, L.A.; Nesi, P. Predicting Free Parking Slots via Deep Learning in Short-Mid Terms Explaining Temporal Impact of Features. IEEE Access 2023, 11, 101678–101693. [Google Scholar] [CrossRef]
Adebayo, J.; Gilmer, J.; Muelly, M.; Goodfellow, I.; Hardt, M.; Kim, B.; Brain, G. Sanity Checks for Saliency Maps. In Proceedings of the NIPS’18: Proceedings of the 32nd International Conference on Neural Information Processing Systems; Bengio, S., Wallach, H.M., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2018; pp. 9525–9536. [Google Scholar] [CrossRef]
Zanella, A.; Bui, N.; Castellani, A.; Vangelista, L.; Zorzi, M. Internet of things for smart cities. IEEE Internet Things J. 2014, 1, 22–32. [Google Scholar] [CrossRef]
El-Tantawy, S.; Abdulhai, B.; Abdelgawad, H. Multiagent Reinforcement Learning for Integrated Network of Adaptive Traffic Signal Controllers (MARLIN-ATSC): Methodology and Large-Scale Application on Downtown Toronto. IEEE Trans. Intell. Transp. Syst. 2013, 14, 1140–1150. [Google Scholar] [CrossRef]
Herrera, J.C.; Work, D.B.; Herring, R.; Ban, X.J.; Jacobson, Q.; Bayen, A.M. Evaluation of traffic data obtained via GPS-enabled mobile phones: The Mobile Century field experiment. Transp. Res. Part C Emerg. Technol. 2010, 18, 568–583. [Google Scholar] [CrossRef]
Chang, H.; Li, L.; Huang, J.; Zhang, Q.; Chin, K.S. Tracking traffic congestion and accidents using social media data: A case study of Shanghai. Accid. Anal. Prev. 2022, 169, 106618. [Google Scholar] [CrossRef] [PubMed]
Fereidooni, Z.; Palesi, L.A.I.; Nesi, P. Multi-Agent Optimizing Traffic Light Signals Using Deep Reinforcement Learning. IEEE Access 2025, 13, 106974–106988. [Google Scholar] [CrossRef]
Çelikyay, H.H. The Studies Through Smart Cities Model: The Case of Istanbul. Int. J. Res. Bus. Soc. Sci. 2017, 6, 149–163. [Google Scholar] [CrossRef]
Jalili, S.; Nallaperuma, S.; Keedwell, E.; Dawn, A.; Oakes-Ash, L. Application of metaheuristics for signal optimisation in transportation networks: A comprehensive survey. Swarm Evol. Comput. 2021, 63, 100865. [Google Scholar] [CrossRef]
Farrag, S.G. A Deep Reinforcement Learning Approach for Proactive Traffic Incident Management. In Proceedings of the 2023 9th International Conference on Optimization and Applications, ICOA 2023—Proceedings, Abu Dhabi, United Arab Emirates, 5–6 October 2023. [Google Scholar] [CrossRef]
Hossain, M.; Abdel-Aty, M.; Quddus, M.A.; Muromachi, Y.; Sadeek, S.N. Real-time crash prediction models: State-of-the-art, design pathways and ubiquitous requirements. Accid. Anal. Prev. 2019, 124, 66–84. [Google Scholar] [CrossRef]
Golob, T.F.; Recker, W.W.; Leonard, J.D. An analysis of the severity and incident duration of truck-involved freeway accidents. Accid. Anal. Prev. 1987, 19, 375–395. [Google Scholar] [CrossRef]
Lint, J.W.V.; Hoogendoorn, S.P. A Robust and Efficient Method for Fusing Heterogeneous Data from Traffic Sensors on Freeways. Comput.-Aided Civ. Infrastruct. Eng. 2010, 25, 596–612. [Google Scholar] [CrossRef]
Ozbay, K.; Noyan, N. Estimation of incident clearance times using Bayesian Networks approach. Accid. Anal. Prev. 2006, 38, 542–555. [Google Scholar] [CrossRef]
Peng, J.; Shangguan, W.; Chai, L.; Chen, J.; Peng, C.; Cai, B. V2X Enabled Platoon Control for Aperiodic Congestion Mitigation via Moving Bottlenecks in Mixed Traffic Environments. IEEE Trans. Veh. Technol. 2025, 1–13. [Google Scholar] [CrossRef]
Mahmassani, H.; Williams, J.; Herman, R. Performance of urban traffic networks. In Proceedings of the 10th International Symposium on Transportation and Traffic Theory; Gartner, N.H., Wilson, N.H., Eds.; Massachusetts Institute of Technology: Cambridge, MA, USA, 1987; pp. 1–20. [Google Scholar]
Istanbul Metropolitan Municipality. Transportation Management Center Traffic Announcement Data. 2025. Available online: https://data.ibb.gov.tr/en/dataset/ulasim-yonetim-merkezi-trafik-duyuru-verisi (accessed on 12 December 2025).
Istanbul Metropolitan Municipality. Hourly Traffic Density Data Set. 2025. Available online: https://data.ibb.gov.tr/en/dataset/hourly-traffic-density-data-set (accessed on 12 December 2025).
Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap, 1st ed.; CRC Press: New York, NY, USA, 1994. [Google Scholar] [CrossRef]
Wilcoxon, F. Individual Comparisons by Ranking Methods. Biom. Bull. 1945, 1, 80. [Google Scholar] [CrossRef]
Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 1979, 6, 65–70. [Google Scholar]
Braess, D.; Nagurney, A.; Wakolbinger, T. On a Paradox of Traffic Planning. Transp. Sci. 2005, 39, 446–450. [Google Scholar] [CrossRef]

Figure 1. Elbow method for selecting the number of spatial clusters based on within-cluster sum of squares (WCSS). The visible inflection point indicates the appropriate number of clusters.

Figure 2. Cluster map with event locations and colored by types.

Figure 3. Global feature importance ranked by mean SHAP value. Temporal features (

h o u r_c o s

,

h o u r_s i n

) dominate, accounting for 42% of total SHAP contribution.

Figure 3. Global feature importance ranked by mean SHAP value. Temporal features (

h o u r_c o s

,

h o u r_s i n

) dominate, accounting for 42% of total SHAP contribution.

Figure 4. SHAP dependence plot for

h o u r_c o s

, showing temporal vulnerability. Evening rush hours (17:00–20:00) exhibit severe impact amplification, while late-night periods show mitigation effects.

Figure 4. SHAP dependence plot for

h o u r_c o s

, showing temporal vulnerability. Evening rush hours (17:00–20:00) exhibit severe impact amplification, while late-night periods show mitigation effects.

Figure 5. SHAP dependence plot for

t r a f f i c_i n d e x

showing exponential impact amplification. Critical threshold at

t r a f f i c_i n d e x

= 30 separates free-flow and congested regimes.

Figure 5. SHAP dependence plot for

t r a f f i c_i n d e x

showing exponential impact amplification. Critical threshold at

t r a f f i c_i n d e x

= 30 separates free-flow and congested regimes.

Table 1. LightGBM Robustness Analysis Across Distributional Treatments (2024 Test Set, n = 13,981).

Objective/Treatment	$MAE (%)$	RMSE	R²	Notes
Standard MSE (original)	28.18	47.99	0.476	Primary spec.
Log-transformed target	27.71	48.67	0.461	Back-transformed
Winsorised (P1–P99)	27.80	48.91	0.456	490 samples clipped
Quantile loss (α = 0.5)	28.13	51.46	0.398	Median regression
Huber loss (α = 0.9)	35.42	65.37	0.028	Aggressive suppression

Feature importance rank correlation (MSE vs. log): ρ = 1.000; p = 0.00. Feature importance rank correlation (MSE vs. quantile): ρ = 1.000; p = 0.00.

Table 2. Systematic Timed Benchmarks of Models.

Model	Median (ms)	P95 (ms)	P99 (ms)
LightGBM	0.31	0.44	0.61
XGBoost	0.28	0.41	0.57
RF	1.43	1.87	2.21

Table 3. Model Performance with Rolling-Origin Validation (Mean ± Std).

Category	Model	MAE (%)	RMSE	R²	Folds
Ensemble ML	LightGBM	26.81 ± 1.94	44.77 ± 4.55	0.506 ± 0.042	2
	XGBoost	27.50 ± 1.78	46.14 ± 4.36	0.475 ± 0.037	2
	RF	27.84 ± 1.61	47.07 ± 3.80	0.454 ± 0.024	2
Non-ensemble	Decision Tree (d = 8) ^†	31.40	54.18	0.332	1 ^†
Non-ensemble	Ridge Reg. (poly-2) ^†	36.64	60.96	0.155	1 ^†
Operational	Hist. Average Baseline	34.62 ± 0.77	60.32 ± 1.75	0.100 ± 0.054	2

Hist. Average uses (location, hour) grouping with global mean fallback. ^† Fold 2 (2024 test set) only; added to justify ensemble model selection. Non-ensemble baselines (Ridge, Decision Tree) are evaluated on Fold 2 only (2024 test set, n = 13,981) as supplementary comparisons to justify ensemble model selection.

Table 4. Performance Metrics with 95% Bootstrap Confidence Intervals (Fold 2: 2024).

Model	MAE (%)	RMSE	R²
LightGBM	28.17 [27.54, 28.81]	47.90 [45.51, 50.77]	0.476 [0.436, 0.508]
XGBoost	28.75 [28.13, 29.43]	49.14 [46.79, 51.98]	0.448 [0.411, 0.483]
RF	28.97 [28.34, 29.60]	49.66 [47.39, 52.29]	0.437 [0.401, 0.468]
Baseline	35.14 [34.32, 35.94]	61.43 [58.98, 64.41]	0.138 [0.109, 0.166]

Table 5. Corrected Significance Testing Results (Wilcoxon Signed-Rank Test).

Comparison	Statistic	p-Value	Bonferroni	Holm–Bonferroni	r
Baseline vs. LightGBM	3.67 × 10⁷	3.35 × 10⁻¹⁴³	Yes	Yes	0.624
Baseline vs. XGBoost	3.80 × 10⁷	7.48 × 10⁻¹¹⁴	Yes	Yes	0.611
Baseline vs. RF	3.76 × 10⁷	1.61 × 10⁻¹²²	Yes	Yes	0.615
LightGBM vs. RF	4.63 × 10⁷	1.18 × 10⁻⁷	Yes	Yes	0.526
XGBoost vs. RF	4.84 × 10⁷	3.48 × 10⁻¹	No	No	0.505
LightGBM vs. XGBoost	4.67 × 10⁷	5.26 × 10⁻⁶	Yes	Yes	0.522

Bonferroni-corrected α = 0.0083; r: rank-biserial correlation.

Table 6. LightGBM Performance Across Impact Severity Bins (2024 Test Set).

Severity	Count (%)	MAE (%)	RMSE	R²
Negative (<0%)	2701 (19.3%)	25.13	30.71	−1.26
Low (0–25%)	4930 (35.3%)	12.91	19.60	−7.07
Medium (25–50%)	2993 (21.4%)	19.63	26.34	−12.82
High (50–100%)	2035 (14.6%)	37.80	46.06	−9.75
Extreme (>100%)	1322 (9.5%)	95.91	127.11	−0.45

Overall (pooled): MAE = 28.18%, R² = 0.476.

Table 7. Prediction Error by Subgroups (LightGBM).

Category	Subcategory	Mean MAE	Std MAE	Count
Event	Manufacturing work	58.09	52.65	7
	Road construction work	50.44	–	1
	Maintenance–Investment Work	32.06	37.50	2475
	Accident notification	27.34	39.13	11,425
	Landscaping	26.54	26.60	65
	Infrastructure work	10.54	7.90	8
Time	Night (00–06)	41.36	43.83	1285
	Morning (06–12)	28.08	41.38	3860
	Evening (18–24)	27.07	34.07	3993
	Afternoon (12–18)	25.68	38.37	4843
Traffic	Low (0–30)	30.35	32.66	5429
	High (50–100)	28.16	52.28	1885
	Medium (30–50)	26.42	38.94	6667
Day of Week	Monday	29.47	39.36	2056
	Tuesday	26.80	33.61	2163
	Thursday	24.79	33.82	2210
	Wednesday	27.33	44.24	2294
	Friday	28.04	36.32	2230
	Saturday	26.77	34.14	1657
	Sunday	37.24	50.05	1371
Top 10 Clusters	14	32.84	38.02	805
	10	25.90	36.02	1054
	1	25.86	28.19	937
	4	24.90	29.51	797
	3	24.77	31.93	1515
	7	24.22	31.33	1131
	6	22.86	26.75	1565
	0	22.18	27.27	924
	9	21.71	26.90	853
	19	16.92	19.43	905

Table 8. Feature Ablation Study (LightGBM).

Feature Group	R²	$Δ R^{2}$	MAE	$Δ MAE$	Contrib. (%)
Full Model	0.476	–	28.18	–	–
Temporal	0.166	−0.311	35.12	6.94	65.2
Traffic	0.195	−0.281	34.33	6.15	59.1
Spatial	0.317	−0.159	32.03	3.85	33.4
Incident	0.470	−0.006	28.30	0.12	1.3

Negative

Δ R^{2}

indicates performance degradation; positive indicates noise removal.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Korkmaz, A.; Çelik, U.; Tümen, V. Traffic Incident Impact Prediction Using Machine Learning and Explainable AI: Evidence from Istanbul. Electronics 2026, 15, 1162. https://doi.org/10.3390/electronics15061162

AMA Style

Korkmaz A, Çelik U, Tümen V. Traffic Incident Impact Prediction Using Machine Learning and Explainable AI: Evidence from Istanbul. Electronics. 2026; 15(6):1162. https://doi.org/10.3390/electronics15061162

Chicago/Turabian Style

Korkmaz, Adem, Ufuk Çelik, and Vedat Tümen. 2026. "Traffic Incident Impact Prediction Using Machine Learning and Explainable AI: Evidence from Istanbul" Electronics 15, no. 6: 1162. https://doi.org/10.3390/electronics15061162

APA Style

Korkmaz, A., Çelik, U., & Tümen, V. (2026). Traffic Incident Impact Prediction Using Machine Learning and Explainable AI: Evidence from Istanbul. Electronics, 15(6), 1162. https://doi.org/10.3390/electronics15061162

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Traffic Incident Impact Prediction Using Machine Learning and Explainable AI: Evidence from Istanbul

Abstract

1. Introduction

1.1. Motivation and Background

1.2. Research Gap and Contributions

1.3. Research Questions

1.4. Paper Organization

2. Related Work

2.1. Traffic Incident Prediction

2.2. Temporal Validation in Transportation Forecasting

2.3. Explainable AI in Transportation

2.4. Smart City Transportation Systems

2.5. Incident Impact Assessment

2.6. Research Positioning

3. Materials and Methods

3.1. Problem Formulation

3.2. Target Variable Construction: Traffic Percent

3.3. Dataset Description

3.4. Feature Engineering

3.5. Temporal Validation Strategy

3.6. Uncertainty Quantification via Bootstrap

3.7. Statistical Significance Testing with Multiple Comparison Correction

3.8. Baseline Models

3.9. Machine Learning Models

3.10. Explainability Framework

3.11. Deployment and Computational Environment

4. Results

4.1. Rolling-Origin Performance Evaluation

4.2. Bootstrap Confidence Intervals

4.3. Corrected Significance Testing Results

4.4. Performance by Impact Severity

4.5. Error Analysis by Subgroups

4.6. Feature Ablation Study

4.7. SHAP Feature Importance Analysis

5. Discussion

5.1. Spatiotemporal Context Dominates Incident Classification

5.2. Importance of Proper Temporal Validation

5.3. Negative Traffic Deviations: Preemptive Routing

5.4. Limitations and Model Uncertainty

5.5. Practical Deployment Considerations

6. Conclusions

6.1. Principal Findings

6.2. Practical Implications for Smart Cities

6.3. Future Research Directions

6.4. Final Assessment

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI