3. Materials and Methods
Model Selection Justification
The Istanbul traffic dataset exhibits strong nonlinearity, temporal seasonality, heterogeneous categorical features (e.g., location, event type, road segment), and complex interaction effects between meteorological and traffic-related variables. Linear models such as ordinary least squares or ridge regression assume additive and mostly linear relationships, which are unlikely to sufficiently capture these interaction-driven dynamics.
Tree-based ensemble methods, including Random Forest, XGBoost, and LightGBM, are particularly suitable for such structured tabular datasets because (i) they naturally model nonlinear relationships without explicit feature engineering, (ii) they capture high-order feature interactions, (iii) they are robust to multicollinearity, and (iv) they handle mixed-type variables efficiently. Prior empirical evidence in transportation prediction tasks also indicates that gradient boosting frameworks often outperform linear and single-tree models when complex spatial–temporal dependencies exist.
Hyperparameter Optimization Strategy
To ensure fair and reproducible model comparison, hyperparameters were tuned using a time-aware cross-validation scheme. Specifically, the training period (2022–2023) was divided into rolling temporal folds to avoid information leakage. Grid search was applied over key hyperparameters (number of trees, maximum depth, learning rate, minimum child weight, subsample ratio, and column sampling ratio). Model selection was based on average validation RMSE across temporal folds.
Why Not Simpler Baselines Alone?
In addition to machine learning models, historical-average and persistence-based baselines were implemented to provide realistic lower-bound performance references. However, these simpler approaches cannot adapt to incident-specific contextual features or dynamic weather–location interactions. The ensemble models consistently demonstrated superior generalization on the 2024 hold-out test set, particularly under peak congestion and rare incident scenarios, justifying their inclusion despite their increased complexity.
3.1. Problem Formulation
Definition 1 (Traffic Incident)
. A traffic incident e is characterized by a tuple (e = type, loc, time, lanes, traffic), where available at prediction time. Incident classification is assigned at the time of report.
loc = (latitude, longitude) available at prediction time. Incident location is reported immediately upon detection.
time = (day-of-week, hour-of-day) cyclical encodings—available at prediction time. These are derived directly from the incident timestamp.
lanes = number-of-closed-lanes available at prediction time in the majority of cases, as lane closure information is typically reported by the first responding unit or inferred from incident type. We acknowledge this may involve a short lag (5–10 min) and note this as a limitation.
traffic = real-time congestion index (0–100) available at prediction time. This is the real-time congestion index recorded at the moment the incident is first logged, not an aggregate over the event period. This is clarified in the revised Section 3.4. duration = minutes-of-incident-persistence NOT available at prediction time.
Note: Incident duration is excluded from the feature tuple as it is not available at prediction time ; it is retained in the dataset solely for computation of the target variable .
Given the attributes recorded when an operator opens an incident record—event type, closed lanes, ambient traffic index at t0, time of day, and geographic cluster—the model estimates the percentage deviation from baseline traffic conditions that the incident will produce.
3.2. Target Variable Construction: Traffic Percent
Definition 2 (Normalized Traffic Impact)
. Traffic incident impact is operationalized as the relative change in congestion level attributable to an incident within a defined temporal window. The normalized percentage deviation in traffic conditions: To ensure full reproducibility and eliminate any possibility of temporal leakage, the normalized impact target traffic percent is constructed strictly using past information only.
Disambiguation of traffic_index versus event_traffic_mean:
Two traffic-related variables appear in the dataset and their roles must be clearly distinguished.
traffic_
index is the network-level congestion index recorded for the one-hour window containing
start_
time; it reflects ambient conditions at the moment of notification and is fully available at
t0. This variable is used as a model input feature. By contrast,
event_
traffic_
mean is the mean congestion index computed over the full incident window [
ts,
te], which requires knowledge of
finish_
time and is therefore not available at
t0. This variable is used exclusively to compute the prediction target
from historical records; it is never passed to the model as an input. This architecture—where the prediction target is derived from post-incident records while all model inputs remain available at notification time—is consistent with established practice in real-time traffic impact assessment, where outcome variables such as incident clearance delay or accident severity are similarly computed from completed event records [
15,
51].
Let I(t) denote the hourly traffic index observed at time t, and let an incident occur over the interval [ts, te].
Step 1: Alignment of Incident and Traffic Time Series
Each incident interval is aligned to the hourly traffic index time series by mapping
ts and
te to the nearest hour timestamps. For multi-hour incidents, the traffic index values within the interval are averaged:
where
H is the number of aligned hourly observations.
Step 2: Construction of the Historical Baseline
The baseline
Tnormal is computed exclusively from historical data prior to the incident date. For each location cluster
c, hour-of-day
h, and day-of-week
d, we compute
Importantly, when predicting 2024 incidents, no traffic data from 2024 are used in constructing Tnormal. The training period (2022–2023) and testing period (2024) are strictly separated.
Step 3: Final Target Definition
This ratio-based normalization controls for recurring diurnal and weekly congestion patterns while guaranteeing that the baseline contains no future information relative to the prediction target.
where
Tevent is mean traffic index during incident and
Tnormal is historical baseline for matching location–time.
Problem Statement: Given historical incidents with observed impacts , learn function , which accurately predicts impact of new incidents while maintaining interpretability for operational deployment.
3.3. Dataset Description
Our dataset comprises 38,430 traffic incidents from Istanbul Metropolitan Municipality’s Traffic Management Center (2022–2024), covering 39 administrative districts [
57,
58].
Incident characteristics: Latitude and longitude (WGS84), covering 5343 km2, for spatial data; start time, finish time, and duration (mean: 394 min, σ: 7498 min) for temporal data.
Impact: Traffic index (0–100 scale), event/normal traffic means,
Incident type: 83.6% accident notifications, 15.4% maintenance/construction, 1.0% other
Infrastructure: Closed lanes (0–6; mean: 0.84)
Data joining procedure: Incident records (start_time, end_time, lat, lon) were joined with hourly traffic_index via spatial matching (nearest location within 500 m radius) and temporal alignment (hourly bins). Incidents spanning multiple hours used the mean traffic index across hourly bins to compute event_traffic_mean, which serves exclusively as an intermediate quantity for target variable construction and is never used as a model input.
Missing data policy: Incidents with missing traffic_index (n = 327, 0.85%) were excluded. Incidents with duration_min = 0 (n = 892, 2.3%) used next non-zero hourly reading.
Temporal distribution: Weekday rush hours (07:00–09:00, 17:00–19:00) are the highest incident frequency. The 12% higher winter incident rate is a seasonal variation. The day-of-week is a 35% higher weekday rate than weekends.
Target variable characteristics: The
target exhibits a highly skewed and heavy-tailed distribution. Mean traffic worsening is +30.01%, substantially exceeding the median of +16.05%, indicating positive skew. Formally, skewness = 3.801 and excess kurtosis = 27.469, confirming heavy-tailed behavior. The standard deviation is 61.7%, with a full range of −99.8% to +1219.7%. The 1st–99th percentile interval spans [−61.6%, +285.1%], and the 5th–95th percentile interval spans [−28.7%, +135.6%]. Negative values (23.5% of incidents) reflect preemptive diversions and psychological deterrence effects. Given this distributional profile,
Section 3.4 presents a systematic robustness analysis evaluating model performance under four alternative distributional treatments.
3.4. Feature Engineering
To validate that our conclusions are not artifacts of a specific spatial encoding, we conducted sensitivity analyses across multiple spatial representations.
Spatial Features: Incidents were initially grouped using K-means clustering on latitude–longitude coordinates. Clustering of K-Means (
k = 20), identified using the KneeLocator method after running the elbow method, shown in
Figure 1, were applied to (latitude, longitude) to identify geographical zones. Each cluster represents geographically cohesive regions with similar traffic characteristics, reducing continuous coordinates to discrete cluster assignments (location_km). We repeated the experiments for
K ∈ {10, 20, 30, 40} clusters. Across all configurations, the relative feature importance ranking remained stable, with spatiotemporal context variables consistently dominating incident type variables.
Figure 2 represents the clusters in on Istanbul map, with the event information.
Although graph-based spatial representations can capture network topology more precisely, they require detailed road network data and complex preprocessing pipelines that are not always available in operational contexts. In this study, K-means clustering is employed as a pragmatic approximation of spatial heterogeneity, enabling scalable and reproducible modeling while maintaining compatibility with real-world traffic management systems.
Heavy-Tailed Target Robustness: Given the heavy-tailed distribution of
(skewness = 3.801, excess kurtosis = 27.469), we systematically evaluated LightGBM performance under four alternative distributional treatments, each applied to the identical temporal split (train: 2022–2023; test: 2024,
n = 13,981). Results are summarized in
Table 1.
The
Standard MSE objective on the original scale achieves the best overall balance (
MAE = 28.18%,
R2 = 0.476). Log-transformation and Winsorization yield comparable
MAE (27.71% and 27.80%, respectively, within 0.5 percentage points of the original) but marginally lower
R2, indicating that reducing the influence of extreme values slightly improves point-wise accuracy while reducing explained variance. Quantile loss (
α = 0.5, median regression) is robust to outliers but sacrifices
R2 (0.398 vs. 0.476), reflecting optimization for the conditional median rather than mean. Huber loss with aggressive outlier suppression (
α = 0.9) degrades substantially (
R2 = 0.028), confirming that extreme incidents carry genuine signal that should not be discarded—consistent with the severity-stratified findings in
Section 4.4.
Crucially, feature importance rankings are highly stable across all distributional treatments. Spearman rank correlation between standard MSE and log-transformed target importance rankings is ρ = 1.000 (p = 0.00), and ρ = 1.000 (p = 0.00) between MSE and quantile loss rankings. This confirms that the central finding—spatiotemporal context dominates incident classification—is not an artifact of the MSE objective or the heavy-tailed distribution, but a robust property of the data. The standard MSE objective is therefore retained as the primary specification.
Temporal Features: Cyclical encoding preserves temporal continuity:
This ensures that adjacent times (e.g., 23:00 and 00:00) are represented as similar in feature space.
Incident Features: One-hot encoding for categorical incident type (6 categories), avoiding inappropriate ordinal assumptions of label encoding. Additional features: closed_lanes, duration_min.
Traffic Context: Real-time traffic_index at the beginning of the incident captures the baseline congestion level.
Final feature set (14 features): 8 base features + 6 one-hot encoded event categories.
3.5. Temporal Validation Strategy
We implement a rolling-origin cross-validation scheme to prevent temporal leakage and provide realistic performance estimates. Unlike random train–test splits that violate temporal ordering [
29], our approach ensures that models are always tested on genuinely future data.
Validation Scheme:
Fold 1: Train on 2022 (n = 11,801) → test on 2023 (n = 12,648)
Fold 2: Train on 2022–2023 (n = 24,449) → test on 2024 (n = 13,981)
This design mimics operational deployment where models trained on historical data must predict future incidents. The expanding window evaluates both data efficiency (Fold 1: 1 year training) and asymptotic performance (Fold 2: 2 years training).
Rationale: Recent work demonstrates that random cross-validation in time series systematically overestimates performance by 15–30% [
11,
29]. Temporal validation provides operationally realistic estimates critical for deployment decisions. By sorting all data chronologically and splitting strictly by year, we ensure the following:
No future information leakage: Test incidents occur strictly after all training incidents.
Realistic generalization: Models face genuine distribution shift between training and deployment.
Conservative estimates: Performance reflects real-world degradation over time.
For operational baselines, we group training data by (location cluster, hour-of-day) and use historical mean for matching test incidents, falling back to global mean for unseen combinations.
3.6. Uncertainty Quantification via Bootstrap
To quantify prediction uncertainty beyond point estimates, we employ bootstrap resampling [
59] with 1000 iterations per model–fold combination. For each iteration, we resampled the test set with replacement (
), calculated
MAE,
RMSE, and
R2 on the resampled data, and stored the metrics. We report 95% confidence intervals using the percentile method: [2.5th percentile, 97.5th percentile]. This non-parametric approach makes no distributional assumptions and accounts for heteroscedasticity in prediction errors.
Implementation Details:
Random seed is fixed at 42 for reproducibility. The sampling strategy used is that of stratification by impact severity bins to preserve distributional properties.The computational cost is 1000 iterations × 6 model–fold combinations ≈15 min on standard workstation.
Bootstrap provides robust uncertainty estimates even with non-normal error distributions, addressing the heavy-tailed nature of (range: −99.8% to +1219.7%). Narrow confidence intervals indicate stable predictions not dependent on specific test set sampling.
3.7. Statistical Significance Testing with Multiple Comparison Correction
We conduct pairwise Wilcoxon signed-rank tests [
60] comparing absolute error distributions across models. To control family-wise error rate (FWER) for multiple comparisons, we apply Bonferroni correction:
(where
m = 6 comparisons, giving
αcorrected = 0.0083). The Holm–Bonferroni procedure involves sequential testing, controlling FWER while maintaining higher power than Bonferroni [
61].
Hypothesis Testing Framework:
Null hypothesis (H0) is when the median absolute error of Model A = median absolute error of Model B, and alternative (H1) is when the median absolute errors differ. We reject H0 at the corrected significance level and report both Bonferroni and Holm results for transparency.
Pairwise Comparisons:
Using the largest test set (Fold 2, n = 13,981), we perform 6 comparisons: 3 for ML vs. Baseline (RF vs. Baseline, XGBoost vs. Baseline, LightGBM vs. Baseline) and 3 for inter-ML (LightGBM vs. XGBoost, LightGBM vs. RF, XGBoost vs. RF)
Rationale:
Multiple testing without correction inflates Type I error. With m = 6 tests at α = 0.05, the probability of at least one false positive ≈ 26% under global null. Bonferroni reduces this to ≤5% while maintaining interpretability. The Holm procedure provides additional power when early hypotheses show strong significance.
3.8. Baseline Models
To establish practical improvement margins and provide context for ML model performance, we implement a historical average baseline that represents a realistic operational system that traffic management centers could deploy without machine learning. Historical average baseline leverages spatiotemporal patterns by grouping training data by (location cluster/hour-of-day) pairs and predicting
where
locationi is the K-means cluster assignment and
houri is the hour-of-day for test incident
i.
This baseline captures spatial heterogeneity (different locations have different typical impacts), temporal patterns (rush hour incidents differ from off-peak incidents), and operational feasibility (simple lookup table implementable in existing systems). The fallback to global mean for unseen (location/hour) combinations (approximately 5–10% of test cases) ensures the baseline always produces predictions. This baseline represents the performance floor that any ML model must surpass to justify deployment complexity.
3.9. Machine Learning Models
Hyperparameters were selected via 3-fold time-series cross-validation on the training set (2022 train; 2023 validation). Grid search ranges: , . The final configs reported minimize validation RMSE.
Random Forest (RF) [
5]: Ensemble of 300 decision trees with bootstrap aggregation. The configuration values are
,
, and
. The advantages are robustness to overfitting, interpretable feature importance, and rule extraction.
XGBoost [
6]: Gradient boosting with L1/L2 regularization. The configuration values are
,
,
, and
. The advantages are high accuracy, native missing data handling, and efficient training.
LightGBM [
7]: Gradient-based one-side sampling with leaf-wise growth. The configuration values are
,
,
,
. The advantages are faster training, superior categorical feature handling, and memory efficiency.
Evaluation metrics for ML models are given below:
Mean Absolute Error (
MAE): Average absolute prediction error in percentage points.
Root Mean Squared Error (
RMSE): Emphasizes large errors.
: Proportion of variance explained
where
represents actual values,
denotes predictions,
is the mean actual value, and
n is sample size.
3.10. Explainability Framework
SHAP Analysis [
8]: Computes SHapley values, quantifying each feature’s contribution.
Here,
is the SHAP value for characteristic
i,
S is the feature subset, and
F is the complete feature set. SHAP values were computed on
n = 1000 test samples drawn using stratified random sampling (proportional to event type distribution). Random seed is 42 for reproducibility. We employ TreeExplainer (optimized for tree models using a fast tree path-dependent algorithm with
= ‘
’) in 1000 randomly sampled test instances to compute the importance of global characteristics using summary graphs and feature interactions using dependence graphs. In a feature ablation study carried out to validate SHAP findings, we systematically removed feature groups {Spatial, Temporal, Incident, Traffic} and retrain models, measuring
R2 degradation. This provides empirical confirmation of the importance of the characteristic independently of SHAP. The Wilcoxon signed-rank test [
60] compares model error distributions to assess whether performance differences are statistically significant (
α = 0.05).
3.11. Deployment and Computational Environment
All experiments were executed on a workstation equipped with an Intel Core i7–12700K CPU (12 cores, 20 threads, 3.6 GHz base frequency), 16 GB RAM; the operating system was Windows 11. Model training was performed primarily on CPU for tree-based models (Random Forest, XGBoost, and LightGBM), while GPU acceleration was optionally enabled for XGBoost and LightGBM during hyperparameter tuning experiments. The implementation was carried out in Python 3.12.10 using scikit-learn (v1.8.0), XGBoost (v3.2.0), LightGBM (v4.6.0), and SHAP (v0.50.0). All experiments were executed with a fixed number of 42 random seeds to ensure reproducibility. The average training time per model, as shown in
Table 2, ranged between 18 and 95 s depending on model complexity and hyperparameter configuration. Inference time per incident sample was below 5 milliseconds, indicating suitability for near-real-time deployment scenarios.
4. Results
4.1. Rolling-Origin Performance Evaluation
Table 3 presents model performance across two temporal folds, reporting mean ± standard deviation to quantify temporal stability. Standard deviations are modest relative to means (coefficient of variation < 12% for all models), indicating consistent performance across time periods. All models show performance degradation from Fold 1 to Fold 2 (e.g., LightGBM:
R2 = 0.527 → 0.476), reflecting increased difficulty when test data is further in time from training data. This is expected and validates the realism of our evaluation as generalization of trends. ML models achieve 354–406% relative improvement in
R2 over historical baseline (mean performance). Even with conservative temporal validation, the operational value added is substantial. All statistical comparisons in
Section 4.3 use the Historical Average baseline as the reference, as it represents realistic operational performance. Comparisons against the Global Mean baseline would show even stronger significance but would not reflect practical deployment scenarios. LightGBM exhibits the best mean performance and acceptable stability, making it the recommended model for deployment consideration.
To address concerns regarding model selection justification, we additionally report Fold 2 performance for two non-ensemble alternatives evaluated under identical temporal validation. A Single Decision Tree (max depth = 8) achieves R2 = 0.332, demonstrating that the base learner alone captures meaningful signal but falls 14.4 percentage points short of LightGBM (R2 = 0.476), confirming that ensemble aggregation provides substantial, non-trivial gains. Ridge Regression with degree-2 polynomial interaction terms achieves R2 = 0.155, substantially below ensemble models, which confirms that the nonlinear traffic–impact relationship cannot be adequately captured by linear assumptions, even with explicit feature interactions. Together, these results empirically justify the choice of ensemble methods for this task.
4.2. Bootstrap Confidence Intervals
Table 4 reports 95% bootstrap confidence intervals for the 2024 test set (Fold 2,
n = 13,981), our primary out-of-sample evaluation. Confidence Interval widths are operationally meaningful: LightGBM
MAE uncertainty is ±0.64 percentage points (±2.3% relative), indicating predictions are stable across resampling. This narrow range supports decision-making. Baseline
MAE CI [34.32%, 35.94%] does not overlap with LightGBM CI [27.54%, 28.81%], providing additional evidence beyond significance testing that improvement is real and robust. A CI width of 0.072 indicates ≈ 15% relative uncertainty in explained variance. While models explain ≈ 48% of variance, we are confident this is not a sampling artifact. XGBoost and RF show overlapping
R2 confidence intervals ([0.411, 0.483] vs. [0.401, 0.468]), supporting the later finding that they are statistically equivalent.
4.3. Corrected Significance Testing Results
Table 5 presents the results of 6 pairwise Wilcoxon signed-rank tests with Bonferroni and Holm–Bonferroni corrections. All ML vs. Baseline comparisons remain highly significant. LightGBM vs. Baseline is
p = 3.35 × 10
−143 (
p < 0.001 even after correction), XGBoost vs. Baseline is
p = 7.48 × 10
−114 (
p < 0.001), and RF vs. Baseline is
p = 1.61 × 10
−122 (
p < 0.001). These extremely small
p-values indicate that ML improvements over historical baselines are not due to chance. The evidence for operational value is overwhelming. ML inter-model comparisons differ: LightGBM vs. RF is
p = 1.18 × 10
−7 (significant: Bonferroni: Yes; Holm: Yes); XGBoost vs. RF is
p = 0.348 (NOT significant: Bonferroni: No; Holm: No); and LightGBM vs. XGBoost is
p = 5.26 × 10
−6 (significant: Bonferroni: Yes; Holm: Yes).
XGBoost and RF show no statistically significant difference after multiple comparison correction. This suggests that both gradient boosting methods are statistically equivalent, model selection between them can prioritize other factors (training time, interpretability), and the reported superiority of LightGBM (R2 0.448 vs. 0.437) may not be robust.
Without correction, LightGBM vs. XGBoost shows p = 5.26 × 10−6, which is significant even after Bonferroni correction α = 0.0083. This demonstrates that Bonferroni correction, while conservative, did not change conclusions here. However, for other comparisons, correction prevents inflation of false discoveries.
4.4. Performance by Impact Severity
Table 6 analyzes prediction performance across five impact severity bins, revealing substantial heterogeneity in model reliability. The distribution of incidents by severity are such that 54.6% of incidents are low to medium impact (0–50% increase), 14.6% of incidents are high impact (50–100% increase), 9.5% of incidents are extreme impact (>100% increase), and 19.3% of incidents are negative impact (traffic improvement).
Performance degradation with severity by MAE increases systematically from 12.91% (low) to 95.91% (extreme). This 7 × error amplification indicates that extreme events are fundamentally different from typical incidents. While overall R2 is positive (0.476), within-bin R2 values are negative, indicating high heterogeneity within severity categories. The variance of prediction errors exceeds baseline variance within each bin, though pooled across bins, the model adds value. Models are reliable (MAE < 40%) for ≈85% of incidents (all but extreme category). For the remaining 15%, predictions should be treated as rough lower bounds rather than precise estimates. The model handles negative impacts (traffic improvement) with MAE = 25.13%, suggesting that it captures preemptive routing and deterrence effects moderately well.
Graduated deployment is initially for low–medium severity predictions where reliability is high. Extreme incident protocols for incidents predicted to cause >100% impact, use conservative upper bounds and activate maximum response regardless of exact prediction. Feature augmentation needed as extreme incidents likely involve unmeasured factors (weather, cascading failures, special events), requiring additional data sources. Extreme incidents may involve network-wide cascading effects not captured by local features, rare event types underrepresented in training data (1322/38,430 = 3.4%), and nonlinear threshold effects where traffic transitions from manageable to gridlock.
4.5. Error Analysis by Subgroups
Table 7 analyzes prediction errors across operational categories using LightGBM providing critical insights:
Time-of-day vulnerability: Night incidents (00:00–06:00) exhibit 61% higher error (MAE = 41.36%) than afternoon incidents (MAE = 25.68%), reflecting sparse training data and fundamentally different traffic dynamics.
Event type similarity: Error variation across incident types (10.54–58.09% MAE range) confirms incident classification contributes on average to predictive difficulty.
Location heterogeneity: Error variation across location clusters (range 16.92–32.84% MAE) indicates that spatial context dominates prediction quality.
High-traffic resilience: Counter-intuitively, high-traffic periods show comparable error to low-traffic periods, suggesting that models capture congestion dynamics effectively.
4.6. Feature Ablation Study
Table 8 quantifies the contributions of the characteristic group through systematic removal and retraining while resulting key finding such as the following:
Temporal dominance: Removing time features degrades R2 by 0.311 (65.2% of full model’s explanatory power), confirming hour-of-day and day-of-week as critical predictors.
Traffic context importance: Baseline contributes 0.281 R2 (59.1%), validating that the current congestion level is the second-most important factor.
Spatial contribution: Location clusters contribute 0.159 R2 (33.4%), indicating that geographic heterogeneity matters but less than temporal patterns.
Incident type irrelevance: Removing incident type features degrades R2 by 0.006 (1.3%), confirming that these features carry minimal predictive signal.
SHAP global ranking is independently corroborated by the feature ablation study (
Table 8). Temporal features rank first in both SHAP (42% of total contribution) and ablation (
contribution), while incident type ranks last in both analyses. This cross-method convergence provides empirical validation of SHAP consistency beyond descriptive reporting.
4.7. SHAP Feature Importance Analysis
Figure 3 summarizes global feature importance via mean absolute SHAP values for the LightGBM model (1000 test samples). The
current congestion shows a strong positive association—higher baseline traffic is associated with amplified impact. Spatiotemporal context (
+
+
+
) accounts for 42% of total SHAP contribution, while incident type contributes marginally. SHAP values were computed separately for both temporal folds (Fold 1: 2022–2023 test set; Fold 2: 2024 test set). The top five feature ranking by mean |SHAP| was identical across both folds (
+
+
+
), with Spearman rank correlation
p = 0.97 between the two fold rankings. This confirms that SHAP-based importance is stable over time and not an artifact of a specific test period. SHAP ranking matches ablation study results (
Table 8), suggesting that temporal features are most important, followed by traffic context and then spatial features, with incident features being least important. This cross-method agreement strengthens interpretability confidence.
We extended the dependence plots and
Figure 4 presents a SHAP dependence plot for
(temporal feature interaction), revealing time-of-day impact patterns, colored by
. Evening rush hours exhibit strongly positive SHAP values, while late night shows negative contributions.
interaction reveals that the amplification effect of high baseline congestion is strongest during evening rush hours (17:00–20:00), while remaining near-zero during night hours—consistent with the saturation threshold discussed in
Section 4.3. These interaction patterns are mechanistically interpretable and operationally actionable.
Figure 5 displays
(baseline traffic amplification) dependence, demonstrating a nonlinear impact relationship. Transition around
= 30 marks a boundary between resilient (below) and fragile (above) traffic states. This threshold enables graduated intervention protocols: an aggressive above-threshold response with preemptive adjustments; standard protocols below.
5. Discussion
This study is positioned as a rigorously validated impact estimation study for a well-defined nowcasting task: given the notification-time attributes available when an operator opens an incident record, estimate the percentage deviation from baseline traffic conditions that the incident will produce. The results demonstrate that notification-time features alone are sufficient for statistically significant impact estimation. The translation of these findings into a real-time deployment system would require additional engineering work—including integration with live data feeds, latency constraints, and prospective operational validation—which lies beyond the scope of the present study and is identified as a direction for future research.
While recent studies have employed deep learning and graph-based models, direct comparison with such architectures was not the primary objective of this study. The focus here is on evaluating the practical value of interpretable and computationally efficient models under realistic temporal validation settings. Notably, prior research has shown that performance gains reported by complex models can be significantly reduced when strict temporal evaluation protocols are applied. Therefore, this study complements existing work by emphasizing evaluation realism and explainability rather than architectural sophistication.
This study investigated the prediction of traffic incident impacts using machine learning models under temporally consistent evaluation and explainable analysis. The results provide several insights into the contextual nature of traffic incident effects and the methodological implications of evaluation strategies in traffic prediction research.
Firstly, the findings demonstrate that temporal validation significantly affects model performance. Models evaluated using random train–test splits exhibited substantially higher accuracy compared to temporally separated validation. This result confirms that temporal leakage can lead to overly optimistic performance estimates in traffic prediction tasks. Therefore, the study highlights the necessity of temporally consistent evaluation frameworks to ensure realistic and operationally meaningful performance assessment.
Secondly, the comparison between machine learning models and historical baselines reveals that the performance gains of advanced models are more modest under realistic temporal settings than commonly reported in the literature. While ensemble learning models outperform baseline approaches, the margin of improvement varies across temporal and spatial contexts. This finding suggests that the value of machine learning in traffic incident prediction should be interpreted relative to operational baselines rather than in isolation.
Thirdly, an explainable AI analysis indicates that the spatiotemporal context plays a dominant role in determining the traffic incident impacts. Features related to time-of-day, day-of-week, location, and baseline traffic conditions consistently exhibit higher importance than incident-specific attributes such as incident type or severity. This observation is further supported by feature ablation experiments, which confirm that contextual variables account for a substantial portion of predictive power. These results imply that traffic incident impacts are not solely determined by incident characteristics but are strongly conditioned by the surrounding traffic environment.
Fourthly, the convergence between SHAP-based explanations and empirical ablation analysis enhances the robustness of interpretability findings. Unlike many prior studies that rely solely on post hoc explanation methods, this study validates interpretability results through independent empirical testing. This methodological approach strengthens the reliability of feature importance interpretations in traffic prediction tasks.
Finally, the proposed normalized impact metric enables the meaningful comparison of incident effects across heterogeneous spatial and temporal conditions. By normalizing traffic deviations relative to expected baseline conditions, the metric reduces bias arising from varying traffic volumes and structural differences across locations. This contributes to a more consistent and comparable representation of traffic incident impacts.
Overall, the findings suggest that methodological rigor, evaluation realism, and explainability are as critical as model complexity in traffic incident prediction. The study thus provides evidence that interpretable ensemble models, when evaluated under realistic temporal conditions, can offer robust and operationally feasible solutions for intelligent transportation systems.
5.1. Spatiotemporal Context Dominates Incident Classification
Across multiple spatial encodings and robustness checks, spatiotemporal context consistently exhibits substantially greater explanatory contribution than incident classification features. This finding contradicts conventional traffic management paradigms prioritizing incident classification for response protocols.
Quantitative evidence: SHAP analysis as temporal + spatial features contribute 73% vs. incident type’s 3%
Ablation study: Temporal features contribute 65.2% vs. incident type’s 1.3%.
Error analysis: Incident type variation is higher than location variation.
Mechanistic explanation: An identical accident produces vastly different impacts—negligible at 03:00 on peripheral roads (low traffic baseline, available diversions), catastrophic at 18:00 on central bottlenecks (saturated capacity, no alternatives). Context-dependency necessitates a fundamental reconsideration of severity classification systems.
The practical implications include the use of dynamic incident prioritization to replace static type-based protocols, spatiotemporal vulnerability maps for proactive resource positioning, context-aware routing incorporating location–hour risk profiles, and adaptive signal control, preemptively adjusting near high-risk clusters during predicted high-impact periods.
5.2. Importance of Proper Temporal Validation
Comparison with random-split approaches reveals critical validation differences. Random splitting artificially inflates R2 by 26.7% due to data leakage—models learn from “future” incidents when predicting “past” incidents. This 11.6% absolute R2 difference represents a substantial overestimation that could mislead operational deployment decisions.
Lessons for ITS research: Traffic forecasting must use temporal validation to provide realistic performance estimates. Random cross-validation, while statistically valid for i.i.d. data, systematically overestimates time series prediction accuracy.
5.3. Negative Traffic Deviations: Preemptive Routing
Approximately 23.5% of incidents exhibited negative values (traffic improvement), initially counterintuitive but reflecting sophisticated adaptive routing behavior:
Preemptive diversions: Navigation apps predict congestion before physical manifestation, rerouting traffic proactively.
Psychological deterrence: Drivers avoid known incident locations even when alternative routes are slower.
Braess paradox: Lane closures removing problematic merge points can improve overall flow [
62].
Temporal displacement: Sparse baseline traffic during off-hours enables complete diversions without bottleneck creation.
This phenomenon highlights that modern incident impact constitutes a complex sociotechnical system where human behavior, algorithmic routing, and physical constraints interact nonlinearly.
5.4. Limitations and Model Uncertainty
This work has several important limitations that must be acknowledged for responsible deployment.
Temporal Generalization Uncertainty: While rolling-origin validation provides more realistic estimates than single holdout, two folds remain limited. Performance in 2025+ is uncertain, particularly if traffic patterns shift due to policy changes or population growth changes network structure, and autonomous vehicles alter fundamental traffic dynamics. Continuous monitoring and periodic retraining should be implemented (recommended on a quarterly basis) for mitigation.
Extreme Event Prediction Failure: As shown in
Section 4.4, models exhibit
MAE ≈ 95% for incidents causing > 100% traffic increase (9.5% of cases). This is operationally unacceptable for critical scenarios. The root causes are insufficient training data for rare extremes (
n = 1322 examples), missing features (weather, special events, cascading effects), and fundamental nonlinearity at saturation (free-flow → gridlock transition). Mitigation: A graduated response would be employed, applying conservative protocols for predicted >50% impacts, augment with weather API, event calendar, sensor network data, and with a consideration of ensemble with physics-based traffic simulation for extremes.
Heavy-Tailed Target and Prediction Reliability: The
target is heavily right-skewed (skewness=3.801, kurtosis=27.469), which has two direct implications for operational reliability. Firstly, pooled metrics (
MAE = 28.18%,
R2 = 0.476) are dominated by the majority of typical incidents and mask substantially higher errors for extreme events, as documented in
Table 6. Secondly, the choice of training objective affects the trade-off between typical and extreme incident accuracy:
MSE optimizes mean prediction quality and achieves the best pooled
R2, while quantile loss and log-transformation improve robustness at the cost of explained variance. Practitioners should select the objective according to operational priority—
MSE for general-purpose deployment, quantile loss for conservative planning under uncertainty. Mitigation: Predictions for incidents flagged as potentially extreme (
) should be presented with explicit uncertainty bands (bootstrap 95% CIs) rather than point estimates, and conservative response protocols should be activated regardless of exact predicted magnitude.
Unmeasured Confounders: Our models explain 48–52% of variance, leaving 48–52% unexplained. It is likely that unmeasured factors such as weather conditions like rain, snow, and fog affect both incident occurrence and impact special events like concerts, sports, and construction, which simultaneously impact traffic, time-varying demand like holiday travel, seasonal patterns beyond day-of-week, and human the behavior of social media virality and navigation app adoption rates. Implications: Predictions should be presented with uncertainty bands, not point estimates.
Istanbul Specificity: Our dataset is from a single city with unique characteristics such as transcontinental geography (Bosphorus bridges are bottlenecks), high population density (>15 millions), and specific cultural traffic norms.
Generalizability: The proposed evaluation framework is generalizable, while the trained models require revalidation before deployment in cities with different network topology, public transport penetration, or driver behavior norms. Transfer learning experiments are needed before deployment elsewhere.
Honest Assessment: Despite achieving statistically significant improvements over baselines with rigorous validation, these models are not ready for fully autonomous deployment. They are suitable for decision support (providing operators with impact estimates), resource pre-positioning (allocating response units to high-risk zones), and pilot programs (testing on non-critical incidents with human oversight). They are NOT suitable for autonomous emergency response without human verification; life-safety decisions, e.g., hospital routing, without redundancy; and deployment in cities outside Istanbul without transfer learning validation.
5.5. Practical Deployment Considerations
Computational requirements are as follows: training—45 s (RF) on a standard workstation; inference of <10 ms per incident (real-time capable); and model updating for feasible daily retraining. Integration with ITS infrastructure uses the REST API for real-time traffic management center predictions, batch processing for historical analysis, and dashboard visualization for operators.
6. Conclusions
This paper presented a machine learning-based framework for predicting traffic incident impacts using real-world traffic data from Istanbul. By emphasizing temporally consistent evaluation, explainable analysis, and comparison with operational baselines, the study aimed to provide a realistic assessment of machine learning performance in traffic incident prediction.
Experimental results demonstrate that ensemble learning models achieve reliable predictive performance under temporal validation while maintaining interpretability and computational efficiency. The analysis further reveals that spatiotemporal context plays a more significant role than incident-specific attributes in determining traffic incident impacts. These findings highlight the importance of contextual modeling and realistic evaluation in transportation analytics.
From a methodological perspective, the study shows that combining explainable AI techniques with empirical validation can enhance the robustness of feature importance interpretations. From an operational perspective, the results indicate that interpretable machine learning models can provide practical value for traffic management systems when evaluated against realistic baselines.
Although this study focuses on a single metropolitan area, the proposed framework is generalizable to other urban contexts with similar data availability. Future research directions include the integration of network-based representations, deep learning models, and causal inference approaches to further explore the mechanisms underlying traffic incident impacts.
In conclusion, this study contributes to the literature by demonstrating that methodological rigor and interpretability are key factors in developing reliable and operationally relevant traffic incident prediction models for intelligent transportation systems.
6.1. Principal Findings
Operational Value Confirmed: The Historical Average baseline achieves R2 = 0.100 ± 0.054 across temporal folds, indicating that simple spatiotemporal patterns capture approximately 10% of variance. This modest but positive R2 confirms that location–hour patterns contain predictive signal, validating the importance of spatiotemporal context. In contrast, the Global Mean baseline achieves R2 = −0.021 ± 0.015 (not shown in table), performing worse than predicting the mean. This poor performance occurs because the baseline’s prediction variance is zero, while target variance is high, resulting in negative R2. However, ML models achieve a 354–406% relative improvement in R2 over the Historical Average baseline:
LightGBM: (0.506 − 0.100)/0.100 = +406% improvement.
XGBoost: (0.475 − 0.100)/0.100 = +375% improvement.
RF: (0.454 − 0.100)/0.100 = +354% improvement.
These substantial improvements demonstrate that ML captures nonlinear interactions, feature combinations, and complex patterns beyond simple location–hour averaging.
Spatiotemporal Context Dominates: Feature ablation studies and SHAP analysis provide convergent evidence that when/where an incident occurs explains 24× more variance than what type of incident it is. This finding challenges conventional incident classification systems and suggests context-aware management strategies.
Severity-Dependent Reliability: Model performance varies systematically by impact magnitude. For typical incidents (85% of cases, ), MAE ranges from 12.91 to 37.80%, supporting operational deployment. For extreme incidents (15% of cases, ), MAE reaches 95.91%, indicating that specialized handling is required.
Statistical Rigor Strengthens Claims: Multiple comparison correction (Bonferroni + Holm) and bootstrap uncertainty quantification ensure reported improvements are not statistical artifacts. The study demonstrates that proper temporal validation is critical—random splits would have overestimated performance by ≈15–30% based on the literature.
Methodological Contribution: We provide a reproducible evaluation framework addressing common pitfalls in transportation forecasting: temporal leakage, insufficient uncertainty reporting, and multiple testing without correction. This methodology is generalizable beyond incident prediction to other ITS applications.
6.2. Practical Implications for Smart Cities
Based on these findings, we recommend a graduated deployment strategy:
Phase 1 (Low-Risk Pilot): Deploy for decision support on low–medium severity incidents (). Human operators should verify all predictions before action. Build operational trust and collect deployment data.
Phase 2 (Expanded Coverage): Extend to high-severity incidents ( 50–100%) with conservative protocols. Implement real-time performance monitoring. Develop intervention feedback loop for causal estimation.
Phase 3 (Full Integration): Integrate with traffic signal control and route guidance systems. Maintain human oversight for extreme events (). Continuous learning from new data.
6.3. Future Research Directions
Develop specialized models or physics-hybrid approaches for >100% impacts, potentially using rare event sampling techniques for extreme event modeling. Integrate weather APIs, social media sentiment, traffic camera feeds, and special event calendars to address unmeasured confounders. Implement counterfactual prediction to handle intervention paradox where successful responses make predictions appear inaccurate for causal inference. Validate cross-city generalization through domain adaptation experiments on cities with different characteristics as transfer learning. A/B testing framework in live TMC environment with operator feedback integration for the real-time deployment study. Develop rule extraction methods to generate interpretable decision trees from ensemble models for safety-critical applications for explainability enhancement.
Future work should incorporate weather APIs, traffic cameras, and social media sentiment for multimodal data fusion. LSTM/GRU networks that capture dynamics within-incidents should be used for temporal evolution modeling. Propensity score matching or difference-in-differences should be used to assess causal inference. Domain adaptation for other cities with limited data can transfer learning. Real-time deployment in online learning should be investigated for concept drift detection. Prescriptive analytics will act as reinforcement learning for optimal resource allocation. Transfer learning experiments on cities with contrasting network characteristics (e.g., monocentric European cities, low-density North American grids) are needed before deployment claims can be broadened beyond Istanbul.
6.4. Final Assessment
This work demonstrates that machine learning-based incident impact prediction is scientifically validated and operationally promising, but requires careful deployment with appropriate safeguards. The combination of rigorous temporal validation, comprehensive uncertainty quantification, honest reporting of limitations, and severity-dependent performance analysis provides a realistic foundation for smart city transportation systems to integrate AI-driven incident management while maintaining operational safety and transparency.
The path forward is not the development of an autonomous replacement for human expertise, but an augmented intelligence where ML provides data-driven insights within a framework of human oversight and continuous learning. This balanced approach maximizes the benefits of predictive analytics while acknowledging and mitigating inherent limitations.