# Short-Term Segment-Level Crash Risk Prediction Using Advanced Data Modeling with Proactive and Reactive Crash Data

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Summary of the Literature Review

## 3. Methodological Background

#### 3.1. Random Effect Bayesian Logistic Regression

#### 3.2. Random Forest (RF)

#### 3.3. Gradient Boosting Machine (GBM)

#### 3.4. K-Nearest Neighbor (KNN)

#### 3.5. Gaussian Naïve Bayes (GNB)

#### 3.6. Model Performance Criteria

- TP = True positive value, defined as the number of crash cases in the crash likelihood model (or fatal/injury crashes in crash severity model) that are correctly predicted.
- TN = True negative value, defined as the number of non-crash cases in the crash likelihood model (or PDO crashes in the crash severity model) that are correctly predicted.
- FP = False positive value, defined as the number of non-crash cases that are falsely predicted as crash cases in the crash likelihood model (PDO crashes falsely predicted as fatal/injury crashes in the crash severity model).
- FN = False negative value, defined as the number of crash cases that are falsely predicted as non-crash cases in the crash likelihood model (fatal/injury crashes falsely predicted as PDO crashes in the crash severity model).

## 4. Data Collection and Preparation

#### 4.1. Study Dataset

#### 4.2. Data Integration

#### 4.3. Explanatory Variables

#### 4.4. Generating Non-Crash Cases for the Crash Likelihood Modeling

#### 4.5. Correlation between the Explanatory Variables

#### 4.6. Determination of Significant Variables

#### 4.7. Dealing with the Data Imbalance Problem

- select ${y}^{*}={Y}_{j}$ with probability ${\pi}_{j}$;
- select x such that ${y}_{k}={y}^{*},k=1,\dots ,n$ with probability $\frac{1}{{n}_{j}}$;
- sample ${x}^{*}$ from the estimated kernel function.

#### 4.8. Final Preparation of the Training and the Testing Datasets

## 5. Model Implementation

## 6. Results

#### 6.1. Crash Likelihood Models

#### 6.2. Crash Severity Models

#### 6.3. Application of Reactive Data in the Crash Severity Model

## 7. Practical Implications of the Presented Modeling Framework

Low risk | if | P_{i} ≤ 0.3 |

Moderate risk | if | 0.3 < P_{i} ≤ 0.6 |

High risk | if | 0.6 < P_{i} ≤ 0.75 |

Extremely high risk | if | P_{i} > 0.75 |

## 8. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Darban Khales, S.; Kunt, M.M.; Dimitrijevic, B. Analysis of the impacts of risk factors on teenage and older driver injury severity using random-parameter ordered probit. Can. J. Civ. Eng.
**2020**, 47, 1249–1257. [Google Scholar] [CrossRef] - Chen, C.; Zhang, G.; Yang, J.; Milton, J.C. An explanatory analysis of driver injury severity in rear-end crashes using a decision table/Naïve Bayes (DTNB) hybrid classifier. Accid. Anal. Prev.
**2016**, 90, 95–107. [Google Scholar] [CrossRef] [PubMed] - Iranitalab, A.; Khattak, A. Comparison of four statistical and machine learning methods for crash severity prediction. Accid. Anal. Prev.
**2017**, 108, 27–36. [Google Scholar] [CrossRef] [PubMed] - Reiman, T.; Pietikäinen, E. Leading indicators of system safety–monitoring and driving the organizational safety potential. Saf. Sci.
**2012**, 50, 1993–2000. [Google Scholar] [CrossRef] - Sarkar, S.; Pramanik, A.; Maiti, J.; Reniers, G. Predicting and analyzing injury severity: A machine learning-based approach using class-imbalanced proactive and reactive data. Saf. Sci.
**2020**, 125, 104616. [Google Scholar] [CrossRef] - Xu, C.; Tarko, A.P.; Wang, W.; Liu, P. Predicting crash likelihood and severity on freeways with real-time loop detector data. Accid. Anal. Prev.
**2013**, 57, 30–39. [Google Scholar] [CrossRef] [PubMed] - Theofilatos, A. Incorporating real-time traffic and weather data to explore road accident likelihood and severity in urban arterials. J. Saf. Res.
**2017**, 61, 9–21. [Google Scholar] [CrossRef] - Yu, R.; Abdel-Aty, M. Utilizing support vector machine in real-time crash risk evaluation. Accid. Anal. Prev.
**2013**, 51, 252–259. [Google Scholar] [CrossRef] - Wang, L.; Shi, Q.; Abdel-Aty, M. Predicting crashes on expressway ramps with real-time traffic and weather data. Transp. Res. Rec.
**2015**, 2514, 32–38. [Google Scholar] [CrossRef] - Theofilatos, A.; Chen, C.; Antoniou, C. Comparing machine learning and deep learning methods for real-time crash prediction. Transp. Res. Rec.
**2019**, 2673, 169–178. [Google Scholar] [CrossRef] - Guo, M.; Zhao, X.; Yao, Y.; Yan, P.; Su, Y.; Bi, C.; Wu, D. A study of freeway crash risk prediction and interpretation based on risky driving behavior and traffic flow data. Accid. Anal. Prev.
**2021**, 160, 106328. [Google Scholar] [CrossRef] [PubMed] - Wang, C.; Liu, L.; Xu, C.; Lv, W. Predicting future driving risk of crash-involved drivers based on a systematic machine learning framework. Int. J. Environ. Res. Public Health
**2019**, 16, 334. [Google Scholar] [CrossRef] [PubMed][Green Version] - Sameen, M.I.; Pradhan, B. Severity prediction of traffic accidents with recurrent neural networks. Appl. Sci.
**2017**, 7, 476. [Google Scholar] [CrossRef][Green Version] - Lin, C.; Wu, D.; Liu, H.; Xia, X.; Bhattarai, N. Factor identification and prediction for teen driver crash severity using machine learning: A case study. Appl. Sci.
**2020**, 10, 1675. [Google Scholar] [CrossRef][Green Version] - Zhang, J.; Li, Z.; Pu, Z.; Xu, C. Comparing prediction performance for crash injury severity among various machine learning and statistical methods. IEEE Access
**2018**, 6, 60079–60087. [Google Scholar] [CrossRef] - Wahab, L.; Jiang, H. A comparative study on machine learning based algorithms for prediction of motorcycle crash severity. PLoS ONE
**2019**, 14, e0214966. [Google Scholar] [CrossRef] - Kim, S.; Lym, Y.; Kim, K.-J. Developing crash severity model handling class imbalance and implementing ordered nature: Focusing on elderly drivers. Int. J. Environ. Res. Public Health
**2021**, 18, 1966. [Google Scholar] [CrossRef] - Fiorentini, N.; Losa, M. Handling imbalanced data in road crash severity prediction by machine learning algorithms. Infrastructures
**2020**, 5, 61. [Google Scholar] [CrossRef] - Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Wadsworth and Brooks: Monterey, CA, USA, 1984. [Google Scholar]
- Breiman, L. Some Infinity Theory for Predictor Ensembles; Technical Report. Report No. 579; Statistics Department, University of California: Berkeley, CA, USA, 2000. [Google Scholar]
- Nicodemus, K.K. Letter to the editor: On the stability and ranking of predictors from random forest variable importance measures. Brief. Bioinform.
**2011**, 12, 369–373. [Google Scholar] [CrossRef][Green Version] - Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin, Germany, 2006. [Google Scholar]
- Cigdem, A.; Ozden, C. Predicting the severity of motor vehicle accident injuries in Adana-Turkey using machine learning methods and detailed meteorological data. Int. J. Intell. Syst. Appl. Eng.
**2018**, 6, 72–79. [Google Scholar] - Shanthi, S.; Ramani, R.G. Classification of vehicle collision patterns in road accidents using data mining algorithms. Int. J. Comput. Appl.
**2011**, 35, 30–37. [Google Scholar] - Stafford, B. Pysolar. 2021. Available online: https://media.readthedocs.org/pdf/pysolar/latest/pysolar.pdf (accessed on 8 November 2021).
- Li, X.; Cai, B.Y.; Qiu, W.; Zhao, J.; Ratti, C. A novel method for predicting and mapping the occurrence of sun glare using Google Street View. Transp. Res. Part C Emerg. Technol.
**2019**, 106, 132–144. [Google Scholar] [CrossRef] - Ahmed, M.M.; Abdel-Aty, M.A. The viability of using automatic vehicle identification data for real-time crash prediction. IEEE Trans. Intell. Transp. Syst.
**2011**, 13, 459–468. [Google Scholar] [CrossRef] - Menardi, G.; Torelli, N. Training and assessing classification rules with imbalanced data. Data Min. Knowl. Discov.
**2014**, 28, 92–122. [Google Scholar] [CrossRef] - Tibshirani, R.J.; Efron, B. An Introduction to the Bootstrap. In Monographs on Statistics and Applied Probability; CRC Press: Boca Raton, FL, USA, 1993; Volume 57, pp. 1–436. [Google Scholar]
- Bowman, A.W.; Azzalini, A. Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations; Oxford University Press: Oxford, UK, 1997. [Google Scholar]
- Gelman, A. Objections to Bayesian statistics. Bayesian Anal.
**2008**, 3, 445–449. [Google Scholar] [CrossRef] - Lunn, D.; Jackson, C.; Best, N.; Thomas, A.; Spiegelhalter, D. The BUGS Book: A Practical Introduction to Bayesian Analysis; CRC Press: Boca Raton, FL, USA, 2012. [Google Scholar]
- StataCorp, LLC. Stata Bayesian Analysis Reference Manual; StataCorp, LLC.: College Station, TX, USA, 2017. [Google Scholar]
- Kuhn, M.; Wing, J.; Weston, S.; Williams, A.; Keefer, C.; Engelhardt, A.; Cooper, T.; Mayer, Z.; Benesty, M.; Lescarbeau, R.; et al. Package ‘caret’. R J.
**2020**, 223, 7. [Google Scholar] - Opitz, D.; Maclin, R. Popular ensemble methods: An empirical study. J. Artif. Intell. Res.
**1999**, 11, 169–198. [Google Scholar] [CrossRef] - Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2009. [Google Scholar]
- Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat.
**2001**, 29, 1189–1232. [Google Scholar] [CrossRef] - Dingus, T.A.; Klauer, S.G.; Neale, V.L.; Petersen, A.; Lee, S.E.; Sudweeks, J.; Perez, M.A.; Hankey, J.; Ramsey, D.; Gupta, S.; et al. The 100-Car Naturalistic Driving Study, Phase II-Results of the 100-Car Field Experiment; Department of Transportation, National Highway Traffic Safety Administration: Washington, DC, USA, 2006. [Google Scholar]

**Figure 2.**(

**a**) The geometric model of the relative position of the Sun and the vehicle, (

**b**) vehicle’s position vs. sun azimuth, (

**c**) vehicle’s position vs. sun elevation (Source: [26]).

**Figure 3.**RF variable importance plot for: (

**a**) the crash likelihood model, (

**b**) the crash severity model.

**Figure 4.**Standardized average speed vs. standardized v/c ratio in the crash injury severity dataset: before ROSE (

**left**) and after ROSE (

**right**).

Characteristic | I-287 | I-80 | Total |
---|---|---|---|

Number of crashes (total) | 1267 | 8888 | 10,155 |

Number of injury/fatal crashes | 236 | 1903 | 2139 |

Number of PDO crashes | 1031 | 6985 | 8016 |

Roadway length (in miles) | 67.5 | 68.5 | 136 |

Number of roadway segments (both ways) | 116 | 164 | 280 |

Minimum length of a roadway segment (in miles) | 0.020 | 0.100 | 0.020 |

Maximum length of a roadway segment (in miles) | 5.140 | 4.020 | 5.140 |

Average length of a roadway segment (in miles) | 1.218 | 0.936 | 1.053 |

Variable | Min | Max | Mean | Median |
---|---|---|---|---|

CAPLINK | 3268 | 8570 | 6138 | 6856 |

VC_RATIO | 0.032 | 1.599 | 0.600 | 0.577 |

Vol16_Tr | 0.032 | 1.450 | 0.576 | 0.554 |

HourlyPrecipitation | 0.000 | 0.720 | 0.002 | 0.000 |

HourlyVisibility | 0.000 | 74.00 | 8.898 | 10.00 |

speed_avg_1015 | 2.00 | 83.00 | 61.56 | 64.80 |

speed_sd_1015 | 0.00 | 25.23 | 1.29 | 0.89 |

speedup_dif_1015 | 0.00 | 63.00 | 8.72 | 6.20 |

speeddown_dif_1015 | 0.00 | 63.00 | 8.23 | 5.80 |

Variable | Type | Class | Source | Description |
---|---|---|---|---|

LANES | Categorical | Proactive | NJCMS | Number of lanes (the values are: 2, 3, 4, or 5) |

Hour | Categorical | Proactive | NJCR | Time of the crash or non-crash (hour of the day) |

Month | Categorical | Proactive | NJCR | Time of the crash or non-crash (month of the crash) |

MEDIAN_TY | Binary | Proactive | NJCR | Median type (protected or non-protected) |

Weekend | Binary | Proactive | NJCR | Day of the week (weekend of weekday) |

Sun glare | Binary | Proactive | NJCMS | The effect of sun glare (0 = no effect, or 1 = Sun glare existed) |

CAPLINK | Continuous | Proactive | NJCMS | Link capacity (vehicles/hour) |

VC_RATIO | Continuous | Proactive | NJCMS | Volume-to-capacity ratio at the highway section during a given hour of the day and month [unitless] |

Vol16_Tr | Continuous | Proactive | NJCMS | Hourly truck volume ratio |

Hourly Precipitation | Continuous | Proactive | NWS | Hourly precipitation at the highway section during the hour of the crash or non-crash event obtained from the weather records for the closest weather station [inches/hour] |

Hourly Visibility | Continuous | Proactive | NWS | Hourly visibility at the highway section during the hour of the crash or non-crash event obtained from the weather records for the closest weather station [miles] |

speed_avg_1015 | Continuous | Proactive | RITIS | Average speed at the highway section [miles/hour]. It is calculated for each crash and non-crash event as an average of 1 min prevailing speeds for the pertinent highway section over a 10 min period (5–15 min) prior to the crash or non-crash event. |

speed_sd_1015 | Continuous | Proactive | RITIS | Standard deviation of speed at the cash location [miles/hour]. It is calculated as a standard deviation of 1 min prevailing speeds for the pertinent highway section over a 10 min period (5–15 min) prior to the crash or non-crash event. |

speedup_sd_1015 | Continuous | Proactive | RITIS | Standard deviation of speed at the upstream highway section [miles/hour]. It is calculated as a standard deviation of 1 min prevailing speeds for the pertinent highway section over a 10 min period (5–15 min) prior to the crash or non-crash event. |

speeddown_sd_1015 | Continuous | Proactive | RITIS | Standard deviation of speed at the downstream highway section [miles/hour]. It is calculated as a standard deviation of 1 min prevailing speeds for the pertinent highway section over a 10 min period (5–15 min) prior to the crash or non-crash event. |

speedup_dif_1015 | Continuous | Proactive | RITIS | Deviation of speed from the speed limit [miles/hour] at the upstream roadway segment. Calculated as the difference between the average speed (speed_avg) and the speed limit (obtained for the upstream roadway segment from the NJCMS dataset). |

speeddown_dif_1015 | Continuous | Proactive | RITIS | Deviation of speed from the speed limit [miles/hour] at the downstream roadway segment. Calculated as the difference between the average speed (speed_avg) and the speed limit (obtained for the downstream roadway segment from the NJCMS dataset). |

Shape_Leng | Continuous | Proactive | RITIS | Length of the highway segment [miles] |

Age | Categorical | Reactive | NJCR | Driver’s age in years (the classes are defined as: age ≤ 25, 25 < age ≤ 60, and age > 60) |

Veh_age | Categorical | Reactive | NJCR | Vehicle age in years (the classes are defined as: 0 < age ≤ 5, 5 < age ≤ 10, and age > 10) |

Models/Corresponding Classes | Training Dataset | Testing Dataset | |
---|---|---|---|

Before ROSE | After ROSE | ||

Crash Severity Dataset | 7616 | 12,719 | 2539 |

PDO Crashes | 6009 | 6614 | 2003 |

Injury/Fatal Crashes | 1607 | 6105 | 53 |

Variables | Mean | Std. Err | 95% BCI |
---|---|---|---|

speed_sd_1015 | 0.069 | 0.022 | (0.028, 0.111) |

speed_avg_1015 | 0.32 | 0.024 | (0.415, 0.295) |

HourlyPrecipitation | 0.125 | 0.033 | (0.082, 0.179) |

HourlyVisibility | −0.118 | 0.026 | (−0.167, −0.071) |

V_ratio | −0.138 | 0.027 | (−0.192, −0.080) |

Constant | −0.148 | 0.026 | (−0.206, 0.103) |

Model | Hyperparameters | Accuracy | Sensitivity | Specificity | AUC |
---|---|---|---|---|---|

RBLR | Not applicable | 0.67 | 0.53 | 0.77 | 0.67 |

RF | mtry = 4, split rule = gini, node size = 1, sample size = full training set | 0.72 | 0.65 | 0.75 | 0.70 |

GBM | ntree = 250, interaction.depth = 1, shrinkage = 0.1, n.minobsinnode = 10 | 0.64 | 0.53 | 0.75 | 0.66 |

GNB | Not applicable | 0.63 | 0.52 | 0.74 | 0.64 |

KNN | K = 13 | 0.58 | 0.50 | 0.65 | 0.61 |

Variables | Mean | Std. Err | 95% BCI |
---|---|---|---|

Speed_avg_1015 | 0.2 | 0.08 | (0.05, 0.4) |

HourlyVisibility | 0.02 | 0.009 | (0.001, 0.03) |

Sunglare | 0.01 | 0.006 | (0.005, 0.02) |

Constant | 0.02 | 0.001 | (0.004, 0.04) |

Model | Hyperparameters | Accuracy | Sensitivity | Specificity | AUC |
---|---|---|---|---|---|

RBLR | Not applicable | 0.67 | 0.41 | 0.73 | 0.59 |

RF | mtry = 16, split rule = gini, node size = 1, sample size = full training set | 0.68 | 0.46 | 0.72 | 0.61 |

GBM | ntree = 250, interaction.depth = 5, shrinkage = 0.1, n.minobsinnode = 10 | 0.80 | 0.08 | 0.96 | 0.58 |

GNB | Not applicable | 0.67 | 0.41 | 0.73 | 0.58 |

KNN | K = 5 | 0.61 | 0.40 | 0.66 | 0.55 |

**Table 9.**Summary statistics of crash records considering driver age and vehicle age, and results of the driver injury severity models.

Group # | Variable | N | % | Accuracy | Sensitivity | Specificity | AUC |
---|---|---|---|---|---|---|---|

1 | DrAge ^{1} < 25 and VehAge ^{2} < 5 | 1096 | 8.72 | 0.61 | 0.60 | 0.61 | 0.62 |

2 | DrAge < 25 and 5 ≤ VehAge < 10 | 627 | 4.99 | 0.58 | 0.66 | 0.55 | 0.63 |

3 | DrAge < 25 and 10 ≤ VehAge | 716 | 5.69 | 0.62 | 0.55 | 0.63 | 0.66 |

4 | 25 ≤ DrAge < 70 and VehAge < 5 | 5927 | 47.17 | 0.68 | 0.54 | 0.79 | 0.68 |

5 | 25 ≤ DrAge < 70 and 5 ≤ VehAge < 10 | 1991 | 15.84 | 0.69 | 0.52 | 0.77 | 0.64 |

6 | 25 ≤ DrAge < 70 and 10 ≤ VehAge | 1832 | 14.57 | 0.63 | 0.55 | 0.72 | 0.64 |

7 | 70 ≤ DrAge and VehAge < 5 | 198 | 1.58 | 0.68 | 0.42 | 0.89 | 0.66 |

8 | 70 ≤ DrAge and 5 ≤ VehAge < 10 | 89 | 0.71 | 0.65 | 0.52 | 0.76 | 0.66 |

9 | 70 ≤ DrAge and 10 ≤ VehAge | 90 | 0.72 | 0.62 | 0.52 | 0.67 | 0.61 |

Average | 0.64 | 0.54 | 0.71 | 0.64 |

^{1}: DrAge = driver age;

^{2}: VehAge = vehicle age.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Dimitrijevic, B.; Khales, S.D.; Asadi, R.; Lee, J.
Short-Term Segment-Level Crash Risk Prediction Using Advanced Data Modeling with Proactive and Reactive Crash Data. *Appl. Sci.* **2022**, *12*, 856.
https://doi.org/10.3390/app12020856

**AMA Style**

Dimitrijevic B, Khales SD, Asadi R, Lee J.
Short-Term Segment-Level Crash Risk Prediction Using Advanced Data Modeling with Proactive and Reactive Crash Data. *Applied Sciences*. 2022; 12(2):856.
https://doi.org/10.3390/app12020856

**Chicago/Turabian Style**

Dimitrijevic, Branislav, Sina Darban Khales, Roksana Asadi, and Joyoung Lee.
2022. "Short-Term Segment-Level Crash Risk Prediction Using Advanced Data Modeling with Proactive and Reactive Crash Data" *Applied Sciences* 12, no. 2: 856.
https://doi.org/10.3390/app12020856