# The Prediction of Road-Accident Risk through Data Mining: A Case Study from Setubal, Portugal

## 1. Introduction

## 2. Related Work

#### 2.1. The Classical Approach

#### 2.2. The Deep Learning Approach

#### 2.3. Other Relevant Works

## 3. Theoretical Framework

## 4. Results and Discussion

#### 4.1. Dataset

#### 4.2. Selection of Attributes

#### 4.3. Data Mining

## 5. Conclusions

**Figure 1.**Box plots of the frequency of accidents that occurred at the different time intervals (according to the TIME field of Table 3).

**Figure 4.**Number of monthly accidents before COVID-19 (2019) and during the COVID-19 pandemic (2020 and 2021).

**Figure 5.**Correlations of Cramer V and Kruskal-Wallis test for the most relevant pairs of variables.

Weather Conditions: | Precipitation, Temperature; Wind Force |
---|---|

Human behavior: | Seat belt use, cell phone use, alcohol consumption calendar |

Road conditions: | Road networks, luminosity, road identification, traffic volume |

Values of Cramer V Coefficient, ${\varnothing}_{\mathrm{c}}$ | Interpretation |
---|---|

[0.25; 1.00] | Very Strong |

[0.15; 0.25] | Strong |

[0.10; 0.15] | Moderated |

[0.05; 0.10] | Weak |

[0; 0.05] | Very Weak |

Attribute | Type/Format of Data |
---|---|

Identification of accident | Serial number |

Date | dd/mm/yyyy |

Time | {Morning, morning work, morning rush hours, lunch break, afternoon work, afternoon rush hours, night} |

Type of local | {Motorway, itineraries or national roads, village roads} |

Localization | {Urban location, non-urban location} |

Type of accident | {Damage only, with injured} |

Day of the week | {Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday} |

Holiday | Boolean |

Alcohol | Numerical with 2 decimal digits (g/L) |

Administrative offenses | Numerical with 2 decimal digits |

Weather conditions | {Good weather, fog, rain, strong wind, hail, smoke cloud} |

**Table 4.**Relevance of features for the creation of predictive models obtained with RBA and SBS for incidents that occur on motorways.

Motorways | Considered Relevant by Both Algorithms | Considered Irrelevant by Both Algorithms |
---|---|---|

RBA & SBS | Rain, morning work, afternoon rush hours, Friday, Saturday, August, February | Sunday |

Classes | Range of the number of accidents |
---|---|

Low Risk | <1.5 |

Medium Risk | ≥1.5 & <2.5 |

High Risk | ≥2.5 |

Algorithm | MAE (Distance) | Accuracy (%) |
---|---|---|

kNN | 0.74 | 56% |

Linear Regression | 0.63 | 57% |

Lasso Regression | 0.60 | 54% |

Ridge Regression | 0.61 | 52% |

Decision Tree | 0.69 | 56% |

Neural Network | 0.57 | 89% |

Algorithm | MAE (Distance) | Accuracy (%) |
---|---|---|

kNN | 0.30 | 81% |

Linear Regression | 0.27 | 86% |

Lasso Regression | 0.28 | 86% |

Ridge Regression | 0.28 | 80% |

Decision Tree | 0.31 | 76% |

Neural Network | 0.55 | 87% |

Algorithm | MAE (Distance) | Accuracy (%) |
---|---|---|

kNN | 0.93 | 48% |

Linear Regression | 0.85 | 50% |

Lasso Regression | 0.80 | 51% |

Ridge Regression | 0.79 | 50% |

Decision Tree | 0.91 | 55% |

Neural Network | 0.52 | 88% |

Algorithm (Regression with Neural Network) | MAE (Distance) | Accuracy (%) |
---|---|---|

General model | 0.49 | 88% |

Motorways (9.3% of total accidents) | 0.57 | 89% |

Itineraries or national roads (30% of total accidents) | 0.55 | 87% |

Village roads (60.7% of total accidents) | 0.52 | 88% |

Type of Road | Percentage of Accidents involving Injuries or Deaths | Nº of Injured/Dead per Accident |
---|---|---|

Motorway | 25.1% | 1.7 |

Village roads | 17.3% | 1.2 |

Itineraries or national roads | 29.8% | 1.42 |

