Comparison of Statistical and Machine-Learning Models on Road Trafﬁc Accident Severity Classiﬁcation

: Portugal has the sixth highest road fatality rate among European Union members. This is a problem of different dimensions with serious consequences in people’s lives. This study analyses daily data from police and government authorities on road trafﬁc accidents that occurred between 2016 and 2019 in a district of Portugal. This paper looks for the determinants that contribute to the existence of victims in road trafﬁc accidents, as well as the determinants for fatalities and/or serious injuries in accidents with victims. We use logistic regression models, and the results are compared to the machine-learning model results. For the severity model, where the response variable indicates whether only property damage or casualties resulted in the trafﬁc accident, we used a large sample with a small imbalance. For the serious injuries model, where the response variable indicates whether or not there were victims with serious injuries and/or fatalities in the trafﬁc accident with victims, we used a small sample with very imbalanced data. Empirical analysis supports the conclusion that, with a small sample of imbalanced data, machine-learning models generally do not perform better than statistical models; however, they perform similarly when the sample is large and has a small imbalance.


Introduction
In recent years, there has been a growing demand for public and freight transportation across the world, leading to an increase in the volume of road traffic. Studies on this matter also provide some valuable insights into the reasons for traffic accidents in many countries [1]. Road traffic crashes are one of the major social problems of modern societies not only because of the high number of victims but also due to the high costs associated.
In 2016, road traffic injuries were the eighth world-leading cause of death and are predicted to become the seventh leading cause of death by 2030 [2]. Moreover, road traffic costs represent about 1-3% of gross domestic product (GDP) worldwide [2]. In 2019, Portugal recorded the sixth highest rate of road fatalities among the 27 members of the European Union (EU), with more 16 fatalities per million inhabitants than the EU as a whole [3].
Beyond the impact caused by fatalities, in Portugal road traffic accidents have an economic and social impact equivalent to 1.2% of GDP, i.e., EUR 2.3 billion [4]. A better understanding of factors affecting the injury severity is fundamental to implementing appropriate strategies to improve road safety. In recent years, several methodological approaches have been used to analyse road traffic accident data.
Some of the statistical models that have been proposed to study crash-injury severities include binary logit, binary probit, Bayesian ordered probit, Bayesian hierarchical binomial logit, generalized ordered logit, log-linear model, mixed generalized ordered logit, multinomial logit, multivariate probit, ordered logit and ordered probit [5,6]. Several authors indicated the limitations to statistical modelling since the models make assumptions about the distribution of data and predefine the relationship between the dependent variable and the explanatory variables [7,8].
With the advances in computing methods, machine-learning-based models have emerged as promising tools in road safety research to overcome the limitations of statistical methods [9], namely, having higher adaptability to process not only outliers but also noisy and absent data. Machine-learning methods are mostly used as prediction tools, while statistical models are more frequently used in crash severity modelling as explanatory models [10].
To overcome the disadvantage of not providing an explicit relation between dependent variables and the explanatory ones, machine-learning typically adopts feature-based sensitivity analysis. Several studies revealed that crash injury severity is influenced by the driver attributes, vehicle features, crash characteristics, circumstances, etc. [6,[11][12][13][14][15].
This paper analyses the daily data from the Statistical Bulletin of Road Traffic Accidents (BEAV) about accidents that occurred between 1 January 2016 and 31 December 2019, in the district of Setúbal (Portugal) in the areas of jurisdiction of the Territorial Command of the Guarda Nacional Republicana (CT-GNR) of Setúbal. The data was collected and validated by the CT-GNR of Setúbal, complemented by Autoridade Nacional de Segurança Rodoviária (ANSR) for 30-day victims, by Infraestuturas de Portugal for road characteristics and by Instituto Português do Mar e da Atmosfera (IPMA) for meteorological data.
This paper is organized as follows. Section 2 presents the study area, data description and the statistical methods used in the paper. Section 3 presents the results of the logistic regression and machine-learning methods, for severity and serious injury models. Section 4 discusses the results obtained, and the main conclusions of the paper are presented in Section 5.

Study Area
Setúbal is the eighth largest district in Portugal with a land area of 5064 km 2 divided into 13 municipalities and six protected natural areas. It houses many residents who commute daily to Lisbon, creating a high-density population with high traffic flow, concentrated mainly in the upper part of the district, which contrasts with the rest of the district with more agricultural areas, lower population density and rural roads with low traffic flow. The district is crossed by important access roads to Lisbon, Algarve (South of Portugal) and the Alentejo coast, in addition to containing important tourist spots, such as Sesimbra and Costa da Caparica, which increase the traffic flow during the summer holidays and weekends.
This district contains approximately 293 km of National Road (EN-Estrada Nacional), 219 km of Highway (AE-Autoestrada), 19 km of Principal Itinerary (IP-Itinerário Principal), 90 km of Complementary Itinerary (IC-Itinerário Complementar) and the bridges Vasco da Gama and 25 de Abril that cross the Tagus River in Lisbon, the capital of Portugal. The TC-GNR Setúbal has a jurisdiction area of approximately 96 % of this territory, including responsibility for the Vasco da Gama Bridge.
Between 2016 and 2019, the district of Setúbal is one of the Portuguese districts with the highest number of fatalities as a consequence of road traffic accidents but was not among the ones with the highest number of road traffic accidents.

Data
In Portugal, whenever the police entities GNR and Public Security Police (PSP) became aware of the occurrence of a traffic accident, these entities fill out the BEAV. This is a statistical notation instrument that aims to characterize, as faithfully as possible, the circumstances in which the road traffic accidents occurred, as well as the people and vehicles involved in the accident [16].
The BEAV is divided into two distinct parts [16]: (1) to be filled in all accidents and (2) to be filled only in accidents with victims. The first part contains the essential elements to identify the road traffic accident and general information about vehicles, drivers and the number of victims. When a road traffic accident with only material damage occurs, only this part of the BEAV is filled in. The second part aims to describe the surroundings of road traffic accidents with victims, collecting detailed information on the nature of the accident, vehicles, drivers and people involved in the accident.
ANSR updates the information about the injuries of the victims 30 days after the road traffic accident. The severity of injuries of the victims, within 30 days of the occurrence of the accident, is classified as [16]: • Fatal: victim who dies. • Serious injury: victim whose bodily injury requires hospitalization for more than 24 h and who does not die within 30 days of the accident. • Minor injury: victim whose bodily injury does not require hospitalization, or whose hospitalization is less than 24 h and who does not die within 30 days of the accident.
Through IPMA, it was possible to obtain meteorological information at the time and place of the accident, namely the temperature, wind velocity, precipitation volume, humidity and temperature measured at the meteorological station closest to the accident. The weather information was collected by the project team in the hour before and after the accident.
A database was created with the 2016-2019 reported 28,103 road traffic accidents that occurred in the municipality of Setúbal. These accidents involved 50,726 vehicles, 49,747 drivers and 8273 victims with injuries. The worst injury severity observed in the accident is distributed as follows: no injury (78.63%), minor injury (19.34%); serious injury (1.45%); and fatal injury (0.58%). The database contains different types of variables that were dispersed in the several data sources and are related to: • Accidents: county, accident location, type of accident, type and name of the road, type of roadside, type of lane, road conservation state, the existence of works on the road, the existence of light signals, the existence of pavement marks, the existence and type of damage on the road, existence of nearby health facilities, total and type of victims, driver escaping from the location of the accident, causes of the accident and date and time of the accident. The description of the accident location was checked against GPS coordinates. In the cases where differences were detected, the CT-GNR validated the information. In a few cases, the date of birth of the drivers and victims and the year of registration of the vehicles were incorrectly recorded. Whenever possible the information was corrected, for example, in cases where the year was registered with only two digits instead of four, and in the remaining cases, it was considered NA. No imputation was made in the missing values.
In the data processing stage, the data is cleaned and prepared so that it can be adequately used by the machine-learning algorithms. Such a stage involves handling null values, encoding values and assigning types to variables among other standard techniques.

Statistical and Machine-Learning Models
Facing a large dataset, not only in the number of road traffic accidents but also in the number of variables used, the first spatial analysis of the accidents was performed. The objective was to categorize the municipalities with the same vulnerability for the occurrence of a road traffic accident, thereby, reducing the number of municipalities from the initial 13 municipalities. This spatial analysis was incorporated in the statistical models, reducing the number of coefficients needed for the model fitting.
Logistic regression was used to identify some determinants for the existence of: (i) injured victims, within 30 days, in road traffic accidents (severity model) and (ii) fatalities and/or serious injuries (ignores lightly injured), within 30 days, in road traffic accidents with victims (serious injuries model), in the district of Setúbal.
For the severity model, the response variable was defined as y = 1 if the road traffic accident had victims, and y = 0 if the road accident had only property damage. For the serious injuries model, the response variable was defined as y = 1 if the road traffic accident had victims with serious or fatal injuries, and y = 0 if the road accident had victims with minor injuries. The logistic regression models were fitted according to the approach suggested by Hosmer and Lemeshow [17].
The explanatory variables considered were those defined in the previous section. To obtain a parsimonious multivariate model only the variables that were significant at 0.05 in the univariate analysis were considered and the interactions were considered significant at a 0.001 significance level (in order to obtain a simpler model and to avoid too many scenarios that are difficult to interpret and to understand).
In addition, we conducted an evaluation of the functional form of continuous variables through the LOWESS method and fractional polynomials, a residual analysis to search for influential observations and outliers and model validation through bootstrap. The goodness of fit was tested using the Cessie-van Houwelingen and the Hosmer-Lemeshow tests.
Machine-learning techniques have been used for the same response and explanatory variables used in logistic regression. The following supervised learning algorithms were used: random forest, Naive Bayes, Support Vector Machine, K-Nearest Neighbors and decision trees with the C5.0 algorithm. Random forest is one of the most-used classification and regression methods, operating by building decision trees on different samples and collecting the majority voting to provide the final prediction for classification problems. Naive Bayes is a simple classification algorithm based on Bayes' Theorem and assumes that the predictors are independent. Support Vector Machine (SVM) is another supervised learning algorithm that finds a hyperplane with dimensions equal to the number of features that separate the two classes of data points with the maximum distance between points. K-Nearest Neighbour (KNN) is one of the simplest machine-learning algorithms that classifies a new case based on the similarity between it and the available categories. C5.0 is a decision tree algorithm that uses information entropy to determine the best rule to split the data at that node; C5.0 is an evolution of the popular C4.5 algorithm developed by Ross Quinlan [18,19].
The data was pre-processed to be used in the machine-learning models using normalization of the variables. Some observations were deleted due to missing values. No missing values imputation was made. For each model, a random sample of 67% of observations was selected for the training data and 33% of the remaining data was used for validation. For the severity model, 18,791 observations were used for training and 9211 for testing, with a total of 25 variables.
For the serious injuries model, model 3712 observations were used for training and 1885 for testing, with a total of 40 variables. Some variables were initially categorized in the univariate phase of the logistic regression models; however, others were included without any further categories merging. Since the machine-learning models assume that all the data are numeric, factors need to be converted into dummy variables, resulting in a total of 44 predictors for the severity model and 112 predictors for the serious injuries model.
For the random forest model, the number of variables randomly collected to be sampled at each split time was 2 for the severity model and 23 for the serious injuries model, with a Gini impurity split rule (usually used in classification and regression tree algorithms) and a minimum node size of 1. For the C5.0 algorithm, results were obtained using a tree model and 20 trials. For the KNN, the final model used k = 9. The SVM used a linear kernel with C = 1. All algorithms used 25 bootstrap repetitions. The Naive Bayes classifier used the Laplace smoother and 10-fold cross-validation.
To compare the logistic regression model and the machine-learning models, the performance of each model was evaluated by its discrimination ability: accuracy, sensitivity, specificity and positive and negative predictive values. Sensitivity (also called recall) is the ability of the model to detect a true positive case, and specificity is the ability of the model to detect a true negative case.
The positive and negative predictive values are the proportions of positive and negative outcomes that are true positive and true negative values, respectively. The accuracy is then the proportion of model correct predictions. According to [20], for imbalanced data, the sensitivity is more interesting than the specificity; however, they can be combined into a single score balancing both measures, called the geometric mean or G-Mean.
Another popular classification metric for imbalanced data is the F-score or the Fmeasure, which combines, into a single measure, the balance between positive predictive values and sensitivity. Matthew's correlation coefficient (MCC) is used as a measure of the quality of binary classifications and is one of the best measures to use when the data is very imbalanced or in cases where the minority class is set as the positive class.
The ROC curve (a graph showing the performance of a classification model at all classification thresholds) and the AUC (the area under the ROC curve) can also be obtained. For all models, the cut points used were selected in order to maximize both the sensitivity and specificity.
For the logistic regression model, the discrimination ability can also be assessed. The logistic regression model also has the advantage to give additional information about the significant variables and non-significant variables and allows measuring the size of effects in the response variable. Using the odds ratio also allows measuring the strength of that relationship and if those variables contribute to an increase or decrease in the probability of the occurrence of the event of interest (occurrence of a road traffic accident or severity of the road traffic accident). All statistical analyses were conducted using R version 4.0.4 [21].

Results
The main objective was to evaluate the performance of two different approaches to classify the severity of road traffic accidents. From the dataset available, two different response variables were considered that differ in the number of cases and also in the ratio of the imbalanced data.
First, the severity model was fitted, with the response variable being a road traffic accident resulting in property damage only (negative class) and a road traffic accident with victims (positive class). For these models, there were 22,097 road traffic accidents with property damage and 6006 accidents with victims. This is the case where we have a large sample of road accidents with victims and a slightly higher number of road accidents with property damage, resulting in an imbalance ratio of 4.7:1.
For the serious injuries model, the response variable is a road traffic accident resulting in victims with minor injuries (negative class) and a road traffic accident with victims with serious injuries and/or deaths (positive class). For these models, there are 5436 road traffic accidents with victims with minor injuries and 570 accidents with victims with serious injuries and/or deaths. This is the case where we have a small sample of road accidents with serious injuries and/or deaths and a considerably higher number of road accidents with victims with minor injuries, resulting in an imbalance ratio of 10:1.

Severity Model
Using the spatial analysis, the 13 municipalities of the Setúbal district were categorized and clustered according to the severity of the accident: one cluster is composed of the municipalities of Alcochete, Almada, Barreiro, Montijo, Setúbal and Sines; and another cluster composed of the municipalities of Alcácer do Sal, Grândola, Moita, Palmela, Santiago do Cacém, Seixal and Sesimbra. Therefore, this spatial analysis allowed us to reduce the number of categories of the variable Municipality, and these categories were used from the beginning of the fitting of the logistic regression model.
For the other variables, their categories were merged, and the likelihood ratio test was used to evaluate the simplified model against the model where the categories were separated. The final model (Table 1) presents the coefficients of the logistic regression model, as well as the corresponding standard deviation values and the p-values obtained from the Wald statistic.
In the final model, two continuous (or modelled as continuous) variables needed to be transformed in order to verify the assumption of the linearity with the logit. The variable total number of drivers was transformed in its squared root (called transf. in Table 1). For the maximum age of the drivers, the model needed two transformations: one given by the cubic value of the maximum age of the drivers (called transf. 1 in Table 1) and another by the cubic value of the maximum age of the drivers multiplied by 1 plus the logarithm of the maximum age of drivers (called transf. 2 in Table 1). The goodness of fit test for the multiple regression model had a p-value of 0.09 for the Hosmer-Lemeshow test, a Nagelkerke R 2 = 0.34 and an AUC of 0.813 (OR 95% = (0.806; 0.819)), asserting the goodness of fit of the model to the given data.
The adjusted severity model is presented in Table 1. Positive coefficients are associated with variables/categories with higher odds of an accident with victims with injuries (minor, serious and/or fatality), while negative coefficients are associated with lower odds. From the interpretation of the odds ratios, it can be concluded that the odds of existence of victims in road traffic accidents are higher when: accidents involve a pedestrian that occurs on other roads (when compared with the ones that occur on a highway/bridges); and • Accident-related factors: accidents where there was no escape; and the number of drivers increases. The detailed results of the performance of several methods are presented in Table 2. The C5.0 algorithm was unable to correctly classify any of the minority class observations (PPV) and also had a weak predictive capacity over the majority class (NPV). All other methods had a weak predictive capacity over the minority class (PPV) and high predictive accuracy over the majority class (NPV), with the logistic regression model performing better than the other models. The KNN algorithm presented the worst performance in all measures.

Serious Injuries Model
Using the spatial analysis, the 13 municipalities of Setúbal district were categorized according to the severity of the accidents resulting in three clusters composed of the municipalities: (1) Alcácer do Sal, Alcochete and Palmela; (2) Almada, Moita, Montijo, Sesimbra and Setúbal; and (3) Barreiro, Grândola, Santiago do Cacém, Seixal and Sines. Therefore, this spatial analysis allowed us to reduce the number of categories of the variable Municipality, and these categories were used in the fitting of the logistic regression model. Table 3 presents significant variables in the multiple logistic model for serious injuries and/or deaths in road traffic accidents with victims. Variables/categories with positive coefficients are associated with higher odds of an accident having fatalities and/or victims with serious injuries, while negative coefficients are linked to variables/categories with lower odds. The goodness of fit test for the multiple regression model had a p-value of 0.549 for the Cessie van Houwelingen test, a Nagelkerke R 2 = 0.15 and an AUC of 0.682 (OR 95% = (0.648; 0.717)), asserting the goodness of fit of the model to the given data.
The odds for the existence of serious injuries and/or fatalities in road traffic accidents with victims are higher when: • Geographical factors: the accidents occur in the municipalities of Alcochete, Alcácer do Sal and Palmela; • Temporal factors: the accidents occur between Thursday and Monday; between 2 a.m. to 5 a.m. and between 6 a.m. to 7 a.m. or between 8 p.m. to 2 a.m., 5 a.m. to 6 a.m. and 7 a.m. to 8 a.m. (when compared to the ones that occur between 8 a.m. to 8 p.m.); • Road characteristics: the accidents occurred on an IC/IP or on an EN; the accidents occurred on a road where the lanes do not have a central separator; the accidents occur inside urban areas when the roadside is not paved; and the accidents occur on a road with a paved roadside outside an urban area; • Driver characteristics: the majority of drivers involved are male; and the age of the youngest driver involved in the accident increases; • Victim characteristics: the age of the oldest victim involved in the accident increases; and • Vehicle features: the median age of the vehicles involved in the accident increases; in collision accidents, those involving heavy vehicles and those not involving heavy vehicles but involving motorbikes (when compared to the ones involving only light vehicles); and in accidents involving only light vehicles, those that occur by pedestrian running over and those by crashing (when compared to those that occur by collision).
The detailed results of the performance of the several methods are presented in Table 4. For this model, there are a much larger number of predictors than for the severity model, since, when a road traffic accident with victims occurs, more information is collected, and more variables can be used in the analysis. All methods presented similar values in the performance measures; however, logistic regression presented the highest G-mean and F-score values.

Discussion
We analysed data from road traffic accidents in a district of Portugal under the jurisdiction of the GNR-CT Setúbal. The initial challenge was to clean the original dataset and then add data from additional sources with different data structures and with information regarding the vehicles, drivers and victims of a given accident. It was possible to create a unified dataset with variables about the road traffic accident, vehicle, driver, victims, weather and road conditions. However, some of this information was only available in road traffic accidents with victims.
The main objective of the paper was to compare the performance of a statistical method and some machine-learning models for road traffic accident severity, mostly because severity is usually imbalanced. We analysed a severity model where the response variable was not so imbalanced and where there existed a large number of observations in the minority class. For a dataset with these characteristics, the accuracy (for the overall classification measure), the sensitivity and the predictive value-for the ability of the model to correctly classify the highest road traffic accident severity-are recommended.
Observing the results, it is possible to conclude that the C5.0 was unable to correctly classify any of the minority class observations and that the random forest presented a very poor classification ratio of this class as shown in Table 2. The Naive Bayes performed better in this requirement but with very low accuracy. The logistic regression model outperformed all the models in the G-mean, F-score and the overall measures, giving more information by indicating the variables with higher odds of having an accident with fatalities and/or victims with serious injuries.
For the serious injuries model, we had a small sample of positive cases with the most imbalanced data. In these scenarios, the sensitivity was more interesting than the accuracy; however, it is recommended to use measures, such as the G-mean or F-score. Observing these measures, the C5.0 algorithm and random forest had similar performance to the logistic regression model, which had higher performance than the Naive Bayes, the SVM and the KNN as shown in Table 4.
However, since the models have equivalent performance, the fact that the logistic regression model allows a researcher to know what are the significant and non-significant variables, and through the odds ratio, it also measures the increase or decreases risk of the road traffic accident severity of a given variable, one should not discard the statistical method on such analyses.

Conclusions
In general, we conclude that, for road traffic accident datasets with very imbalanced data and a small sample size (the most severe road traffic accidents), machine-learning models are not very suitable because they require many observations for training. This result has already been highlighted in previous studies [22]. For the case of a dataset with a larger sample of the class with the highest severity and more balanced datasets, the machine-learning models presented very good performance.
Nevertheless, the statistical logistic regression model was able to achieve similar performance with the additional gain of having more information about the importance of the variables in explaining the risk factors. To better support and generalise our conclusions, we intend to extend and apply this study to other datasets. We will apply this methodology to data from other regions in Portugal and, in this way, strongly validate the conclusions obtained from this set of experiences.
For future work, we plan to study the impact of choosing different fractions of data for training and testing and different approaches to lead with imbalanced data. More specifically, we intend to explore the use of machine-learning algorithms for the detection of rare events. We also intend to apply and evaluate the use of neural network architecturesin particular, the use of deep-learning methodologies to identify outliers in time-series data.