Long-Term-Based Road Blackspot Screening Procedures by Machine Learning Algorithms

Screening procedures in road blackspot detection are essential tools for road authorities for quickly gathering insights on the safety level of each road site they manage. This paper suggests a road blackspot screening procedure for two-lane rural roads, relying on five different machine learning algorithms (MLAs) and real long-term traffic data. The network analyzed is the one managed by the Tuscany Region Road Administration, mainly composed of two-lane rural roads. An amount of 995 road sites, where at least one accident occurred in 2012–2016, have been labeled as “Accident Case”. Accordingly, an equal number of sites where no accident occurred in the same period, have been randomly selected and labeled as “Non-Accident Case”. Five different MLAs, namely Logistic Regression, Classification and Regression Tree, Random Forest, K-Nearest Neighbor, and Naïve Bayes, have been trained and validated. The output response of the MLAs, i.e., crash occurrence susceptibility, is a binary categorical variable. Therefore, such algorithms aim to classify a road site as likely safe (“Accident Case”) or potentially susceptible to an accident occurrence (“Non-Accident Case”) over five years. Finally, algorithms have been compared by a set of performance metrics, including precision, recall, F1-score, overall accuracy, confusion matrix, and the Area Under the Receiver Operating Characteristic. Outcomes show that the Random Forest outperforms the other MLAs with an overall accuracy of 73.53%. Furthermore, all the MLAs do not show overfitting issues. Road authorities could consider MLAs to draw up a priority list of on-site inspections and maintenance interventions.


Motivations
The World Health Organization indicates that, in 2018 alone, more than 1.35 million deaths are the consequence of road accidents [1]. Although the policy supported by the European Union aims to reduce global deaths for road accidents by 50% by 2020, several countries are still far away from this ambitious achievement. Indeed, currently, the death for road accidents is the eighth leading cause of death for all age groups and the leading cause for children and young adults aged 5-29 years. More people now die as a result of road traffic injuries than from HIV/AIDS, tuberculosis, or diarrheal diseases [1].
Considering the value of the topic, we strive to offer increasingly refined tools in recognizing situations of potential risk, in order to prevent future occurrences, and try to mitigate the unavoidable high number of accidents and deaths. The Highway Safety Manual [2] inserts screening procedures as a crucial part of the right road safety management cycle. Once the screening ends, a restricted

Purpose
Relying on motivations, assumptions, and related works, the present study analyzes the field of road blackspot screening procedures employing real long-term data on Italian two-lane rural roads. The network investigated is located in the Tuscany Region, central Italy, and it extends for about 1200 km. The network involves 995 road sites in which at least one accident occurred from 2012 to 2016 (total of 5094 accidents, 7437 injuries, and 113 deaths). The remaining road sites of the network did not experience any accident event. In order to deal with a balanced dataset (avoiding imbalance issues), an equal number of road sites (995) have been randomly extracted from this sample of "likely safe" road sites. Five different MLAs have been calibrated, validated, and outcomes compared. These MLAs are different: both parametric (Logistic Regression, LR and Naïve Bayes classifier, NB) and non-parametric (K-Nearest Neighbor, KNN, Classification and Regression Tree, CART, and Random Forest, RF), which operate as single classifiers (LR, NB, KNN, CART) or ensemble of learners (RF). They are trained and tested with the same training and test set, respectively. Therefore, the comparison between outcomes should be possible and coherent. The quality assessment comprises a broad set of performance metrics for recognizing the best model. In other similar researches, the Recall is usually accompanied by Precision [22] and F1-Score [23].
Moreover, it seems useful to provide also the Confusion Matrix [24], which allows understanding the reliability of estimations in terms of the number of instances correctly or incorrectly classified for each output class. Therefore, in order to evaluate the MLAs developed comprehensively, a broad set of performance metrics, including Precision, Recall, F1 -Score, overall Accuracy, Confusion Matrix, and AUROC, has been computed for evaluating both the Goodness-of-Fit (training phase) and the predictive performance (test phase). The aim is to evaluate if MLAs allow identifying those road sites that experienced road safety criticalities over time. Once demonstrated that the procedure is reliable, it will enable Road Authorities to predict the dangerousness of new road sites, and those that have not yet experienced a long-term crash history.

Workflow
The workflow is shown in the following Figure 1. Firstly, the following data have been collected from the Tuscany Region Road Administration (TRRA): Firstly, the following data have been collected from the Tuscany Region Road Administration (TRRA): Built-up areas information: According to the Italian standards [25] the localization of the built-up areas has been collected in order to discretize the road network into three area types (Road Site Inside an urban area (RSI), Road Site Outside an urban area (RSO), and Road Site on the administrative Boundary of an urban area, (RSB)).
A GIS platform has been exploited in order to match the data collected and define the set of input features (independent variables of the MLAs). According to studies [26,27], the fixed length-based criterion has been chosen for defining the road sites. Therefore, the road network has been discretized into stretches of 500 m, in which each input factor has been computed. Road sites in which at least one accident occurred (an amount of 995) have been classified as "Accident Case". Subsequently, in order to avoid potential imbalance issues that may affect MLAs [19,28,29], an equal amount of road sites where no accidents occurred in 2012-2016 has been randomly selected. These sites have been labeled as "Non-Accident Case".
The dataset has been randomly split into two different sets: the training set (70% of the data) and the test set (30%). The training set and a 10-fold Cross-Validation (CV) process have been used for training and evaluating the Goodness-of-Fit of the five MLAs (LR, CART, RF, KNN, NB). The test set has been used for verifying the predictive performance of the algorithms.
In order to evaluate and compare the MLAs, a set of performance metrics has been computed: Precision, Recall, F 1 -Score, Confusion Matrices, and AUROC. These metrics should represent the Goodness-of-Fit of the MLAs if they are computed for the training data. Otherwise, if these metrics are computed for the test data, they should represent the predictive performance of the MLAs.

Study Area and Input Factors
The network analyzed extends for 1190.136 km ( Figure 2). Mainly, it consists of two-lane rural roads. TRAA provided traffic flow data, geometric features, and built-up area characteristics. These data have been employed for the definition of the input factors (or independent variables) of the MLAs. Built-up areas information: According to the Italian standards [25] the localization of the builtup areas has been collected in order to discretize the road network into three area types (Road Site Inside an urban area (RSI), Road Site Outside an urban area (RSO), and Road Site on the administrative Boundary of an urban area, (RSB)).
A GIS platform has been exploited in order to match the data collected and define the set of input features (independent variables of the MLAs). According to studies [26,27], the fixed lengthbased criterion has been chosen for defining the road sites. Therefore, the road network has been discretized into stretches of 500 m, in which each input factor has been computed. Road sites in which at least one accident occurred (an amount of 995) have been classified as "Accident Case". Subsequently, in order to avoid potential imbalance issues that may affect MLAs [19,28,29], an equal amount of road sites where no accidents occurred in 2012-2016 has been randomly selected. These sites have been labeled as "Non-Accident Case".
The dataset has been randomly split into two different sets: the training set (70% of the data) and the test set (30%). The training set and a 10-fold Cross-Validation (CV) process have been used for training and evaluating the Goodness-of-Fit of the five MLAs (LR, CART, RF, KNN, NB). The test set has been used for verifying the predictive performance of the algorithms.
In order to evaluate and compare the MLAs, a set of performance metrics has been computed: Precision, Recall, F1-Score, Confusion Matrices, and AUROC. These metrics should represent the Goodness-of-Fit of the MLAs if they are computed for the training data. Otherwise, if these metrics are computed for the test data, they should represent the predictive performance of the MLAs.

Study Area and Input Factors
The network analyzed extends for 1190.136 km ( Figure 2). Mainly, it consists of two-lane rural roads. TRAA provided traffic flow data, geometric features, and built-up area characteristics. These data have been employed for the definition of the input factors (or independent variables) of the MLAs.   Therefore, the environmental context is represented by a nominal variable that can assume three different values (1, 2, or 3).
Average Annual Daily Traffic: where: • n is the period, in years, of the analysis, equal to 5 years; • AADT i,j is the average annual daily traffic for the i-th year evaluated for the j-th road site.
Average carriageway width: where: • W c,i is the average carriageway width of the i-th segment in the j-th road site; • m is the number of segments in the j-th road site; • L i is the length of the i-th segment.
Average Slope: where: • i i is the Average Slope of the i-th segment in the j-th road site; • m is the number of segments in the j-th road site; • L i is the length of the i-th segment.
Horizontal Tortuosity Index: where: • R i is the radius of the i-th circular curve in the j-th road site; • r p is the number of elements (circular curves) in the j-th road site; • L i is the length of the i-th segment.

•
Rv i is the radius of the i-th vertical curve segment in the j-th road site; • r v is the number of elements (vertical curves) in the j-th road site; • L i is the length of the i-th segment.
Driveway Density: where: • n d,j is the number of driveways in the j-th road site.
Density of Road Junctions: where: • α i considers the type of the i-th junction. It can be α = 5 for linear signalized and unsignalized intersections, or α=1 for roundabouts. The value of α is determined accordingly to Crash Modification Factors reported in Chapter 14, Table 14-3 and Table 14 The following Table 2 reports the mean and standard deviation of each input factor belonging to the training set, divided by different classes.

Output Classes
The output response (crash occurrence susceptibility) of the MLAs is a binary categorical variable. Therefore, relying on the input factors mentioned above, MLAs aim to classify a road site as likely safe (by labeling the site as "Non-Accident Case") or potentially susceptible to an accident occurrence (by labeling the site as "Accident Case"). TRRA provided the crash reports of the Fatal and Injury crashes that occurred on the network over the period 2012-2016. Such crashes: Sustainability 2020, 12, 5972 8 of 23

•
Can concern any type of accident (e.g., head-on, run-off, rear-end, side collision, rollover, etc.); • May have occurred in the day time or night time; • May have occurred on road segments or road junctions; • May have involved one or more vehicles; • May have involved one or more casualties.
Property Damage Only crashes are not included in the dataset since they are not considered by the Italian standards concerning road safety analyses [30].
Certainly, crash reports also record the location of the event. Consequently, through a GIS platform, it was possible to assign the number of accidents that occurred in the five years (2012-2016) of analysis to road sites. If no accidents happened on a road site, it was classified as a "Non-Accident Case". Conversely, if at least one accident occurred on a road site, the road site was classified as an "Accident Case". Therefore, the temporal resolution of the crash occurrence predictions of the MLAs is equivalent to five years.

Machine Learning Algorithms
This part aims to introduce the main characteristics, shortcomings, advantages, and theoretical relations of each MLA employed in this study. All of them are supervised MLAs for classification purposes. Indeed, they are trained with a dataset in which input factors and output classes are known in advance. The purpose of these models is the classification of instances in one of the possible output classes. In order to be compared, all algorithms used the same training and test sets. It is worth mentioning that the Waikato Environment for Knowledge Analysis (WEKA) software [31,32], version 3.8.4, has been employed for building the classifiers.

Logistic Regression
Historically, the LR classifier was defined and employed in the study of Berkson [33]. Firstly, LR exploits a linear multivariate regression for relating the output and the input factors. Accordingly, a logit function is exploited for converting the output of the multivariate linear regression into an output within the range [0,1]. Equation (8) below defines the logit function: where: • P(z) is the probability that the analyzed event occurs; Equation (9) below defines z, that is, the output of regression: where: m is the number of independent variables; • x i (i = 1, 2, 3, . . . , n) represents the value of the i-th input factor; • b i (i = 1, 2, 3, . . . , n) is the regression coefficient assigned to the i-th input factor.
LR exploits the following decision rule for assigning new unknown instances to the classẑ as follows (Equation (10) LR provides several advantages to the user: It is easy to train since no hyperparameters have to be tuned.
However, LR has some drawbacks: • It is not able to solve non-linear problems since its decision surface is linear; • Since the outcome is discrete, LR can only predict categorical outcomes; • LR is prone to overfitting issues.

Classification and Regression Tree
CART [34] is a non-parametric MLA used for building a hierarchical tree-based model. The tree starts with a root node, grows through branch nodes, and ends at leaf nodes. Root node and each branch node represent a different decision rule based on a specific input factor: the node is split into two or more homogeneous zones relying on the so-called cut point. The leaf nodes contain the final prediction: in a CART model used for classification, they contain the output class. Therefore, by repeatedly splitting the dataset node-by-node, a tree-based model grows. CART learns decision rules by inferring directly from the training data. It exploits the Recursive Partitioning algorithm [34,35] for identifying the decision rule. For each node, Recursive Partitioning can find the best input feature and the best cut point for splitting the node.
CART model provides a non-black-box solution by a tree-graph visualization. Moreover, CART is not affected by the scale and linear transformation of the input factors, outliers, and insignificant input factors. Last but not least, CART can easily handle numerical and nominal input factors. However, CART models generally suffer overfitting issues by growing over-complex and deep trees. Furthermore, slightly different training sets may lead to significantly different CART.
In order to alleviate these issues, a pruning process of the CART can be used. Pruning is a procedure that leaves out a certain number of hierarchical levels and leaf nodes of the initial CART, making it able to generalize better and perform more reliable prediction. Weka Software provides an automatic pruning process [36,37]. At first, CART developed had 256 decision rules and 257 leaf nodes. After the pruning procedure, CART had 7 decision rules and 8 leaf nodes. Figure 3 below shows the pruned CART.

Random Forest
RF was introduced by Breiman [38]. RF is an ensemble classifier, i.e., it makes classifications relying on different predictions made by a set of individual classifiers: Each of them makes a prediction, then they are averaged in a certain way. In the case of RF, the ensemble classifier consists of a large number of uncorrelated CART models. In order to build uncorrelated classifiers, they are defined by a Bootstrap Aggregation (Bagging) process. Bagging consists of creating a set of different training sets through replacement. Each training set trains each CART of the forest. These CART models are not pruned. Furthermore, a Feature Randomness approach is used in growing trees: Each node is split in branch nodes considering a prefixed number of input factors, selected at random among the whole set [39]. Therefore, by employing Bootstrap Aggregation and Feature Randomness, the RF can exploit the prediction of a large set of uncorrelated CART models. Consequently, RF includes all the strengths of CART. Besides: • The predictive performance of RF can compete with the best supervised learning algorithms; • RF can provide a feature importance estimation by the computation of the Out-of-Bag Error; • Through Bagging and Feature Randomness, RF offers an efficient solution against overfitting.
On the other hand, RF also has a few shortcomings: • An ensemble model is inherently less interpretable than an individual CART; • Training a large number of trees may require high computational costs and long training time; • Predictions are slower than individual classifiers, which may create challenges for some applications.
CART model provides a non-black-box solution by a tree-graph visualization. Moreover, CART is not affected by the scale and linear transformation of the input factors, outliers, and insignificant input factors. Last but not least, CART can easily handle numerical and nominal input factors. However, CART models generally suffer overfitting issues by growing over-complex and deep trees . Furthermore, slightly different training sets may lead to significantly different CART.
In order to alleviate these issues, a pruning process of the CART can be used. Pruning is a procedure that leaves out a certain number of hierarchical levels a nd leaf nodes of the initial CART, making it able to generalize better and perform more reliable prediction. Weka Software provides an automatic pruning process [36,37]. At first, CART developed had 256 decision rules and 257 leaf nodes. After the pruning procedure, CART had 7 decision rules and 8 leaf nodes. Figure 3 below shows the pruned CART.  To classify a new instance, the RF bases its decision rule on the number of times that CART models assign each possible output class to that instance. The class with the maximum number of nominations is assigned to the output class by RF to the new instance.
RF requires that the modeler tunes two hyperparameters: the number of CART models N t to grow and the number of input factors randomly selected N rs as candidates at each split. The "trial and error" approach is widely used in Machine Learning modeling in order to identify the best set of hyperparameters of the algorithms [40][41][42][43][44]. It has been chosen N t = 500 CART models since a higher number did not produce a significant increase in RF performance. Moreover, it has been tried N rs = 1, 2, . . . , 8 identifying N rs = 8 as the better value.

K-Nearest Neighbor
Cover and Hart defined and employed the KNN algorithm at the end of the 1960s [45]. KNN is a supervised instance-based MLA that classify a new sample by considering its k closest instances (called neighbors). To evaluate how close or far an instance is, the KNN algorithm introduces a distance function into the input feature space. The class assigned to the new observation derives directly from the majority class among the k instances considered.
Therefore, KNN requires that the modeler tunes two hyperparameters: the number of the k nearest neighbors and the type of distance function. By following the aforementioned "trial and error" approach, the optimal number of k neighbors has been determined by trying different values of k and computing the Accuracy of the classifier. It has been tried k = 1, 2, 5, 10, 15, 20, and 25. It has been chosen k = 10, considering the highest Accuracy computed. The Euclidean distance defines the distance function. The Euclidean distance d ij between two samples (two points) into the feature space, i and j, is defined as Equation (11): (11) where: • m is the dimension of the samples (i.e., the number of independent variables). In this study, m = 8. • x ik and x kj are the values of the k-th input factors for the observation i and j, respectively.

Naïve Bayes Classifier
NB is a probabilistic supervised classifier introduced by Maron [46] in the early 1960s. In order to label a new instance (or feature vector) x = (x 1 , . . . , x i , . . . x n ) with one of the possible k output classes C k of the algorithm, NB exploits the well-known Bayes' Theorem reported below (Equation (12)): where: • p( C k |x 1 , . . . , x n ) is the posterior probability, that is the conditional probability of having the class C k given the feature vector x = (x 1 , . . . , x i , . . . x n ); • p(C k ) is the prior probability of observing an instance belonging to the class C k . Since the dataset is balanced and composed of two classes, p(C k ) = 0.5; • p( x| C k ) is the conditional probability of observing the feature vector x given the output class C k ; • p(x) is the probability of observing the feature vector x, and it is common to all classes.
By assuming that all the features in x are mutually independent (Naïve condition), and repeating the application of the concept of conditional probabilities (chain rule), the numerator of Equation (12) becomes Equation (13): Therefore, the decision rule of NB classifiers that assign a classẑ to a new instance is defined as follows (Equation (14)):ẑ = argmax k∈{1,...,K} Equation (14) is called a Maximum a Posteriori decision rule: NB computes the probability that a new instance belongs to any different output classes, then it assigns the class considering the highest probability.
NB has several advantages: • It is easy to train since no hyperparameters have to be tuned; • It is fast in predicting the classes of the sample belonging to the test set; • It requires less training data compared to the LR for training a reliable classifier.
However, NB has some limitations: • If a categorical variable (e.g., Area Type) has a category in the test set which was not observed in the training set, then the model assigns a zero probability to the event occurrence and will be unable to make a reliable prediction. This issue is known as "Zero Frequency". Accordingly, the CV procedure is essential for ensuring that all the categories of the categorical variables have been considered in both the training and test sets; • For numerical variables (e.g., AADT), the NB makes the strong assumption that they are distributed according to the normal distribution; • Generally, it is almost impossible that the predictors of a phenomenon are entirely independent.

Modeling Settings
This section reports the main procedural steps common to all MLAs. Set up of the dataset: The initial dataset contains an equal number of "Accident Case" and "Non-Accident Case" sites, considering the possible weakness in the prediction of the minority class by classifiers trained on unbalanced training sets. Therefore, once all the sites where accidents occurred have been taken into consideration, an equal number of "Non-Accident Case" road sites have been randomly extracted from the network.
Definition of the training and test set: The initial dataset has been randomly divided into two distinct sets, one for training the models and the other for testing them. As other authors did [47][48][49][50], we chose percentages equal to 70% of the initial dataset for training MLAs and 30% for testing them.
K-fold CV approach: In order to verify that the models are robust and reliable in their predictions, they have been evaluated through a 10-fold CV process. The CV has been introduced by Larson [51], who defined the concept of splitting the dataset into two parts and then using one for training the model and the other one for judging it. Subsequently, Mosteller and Tukey [52] proposed the k-fold CV procedure adopted in this study. The process of k-fold CV consists of splitting the training set into k-folds. Afterward, at iteration k, k − 1 folds are used for training the model, and the leaved-out fold for evaluating it. After k iterations, all the samples belonging to the training set are used both for training and evaluating the MLA, ensuring the most representative evaluation of the algorithm. In this study, as recommended by Kohavi [53], a 10-fold CV has been followed.
Assessment of training and test phases: Both phases have been evaluated with the same type of metrics: overall Accuracy, Precision, Recall, F1-Score, Confusion Matrix, ROC, and AUROC. These metrics allow judging the quality of the models (if they suffer from overfitting problems), and therefore the predictive performance in classifying an unknown instance.

Evaluation Metrics
A comprehensive set of performance metrics has been used for the evaluation of the MLAs. The following Equations (15)-(18) define the overall Accuracy of the classifiers, Precision, Recall, and F1-Score, respectively. where: • TP is the number of True Positive instances, i.e., the instances belonging to the class "Accident Case" classified into the same class; • TN is the number of True Negative instances, i.e., the instances belonging to the class "Non-Accident Case" classified into the same class; • FP is the number of False Positive instances, i.e., the instances belonging to the class "Non-Accident Case" misclassified into "Accident Case"; • FN is the number of False Negative instances, i.e., the instances belonging to the class "Accident Case" misclassified into class "Non-Accident Case".
If the dataset is balanced, the overall Accuracy should represent the global performance of the classifier accurately. The Precision shows the goodness of positive predictions. The higher is the Precision, and the lower is the number of "False Alarms". The Recall, also called True Positive Rate (TPR), is the ratio of positive instances that are correctly detected by the classifier. Therefore, the higher the Recall, the higher is the quality of the classifier in detecting positive instances. The F 1 -Score is the harmonic mean of Precision and Recall, and it can be used for comparing classifiers since they are combined into a concise metric. The harmonic mean is used instead of the arithmetic one since it is more susceptible to low values. Therefore, a valid classifier has a satisfactory F 1 -Score only if it has high Precision and high Recall. These parameters can be computed as specific metrics for each class or as the overall metrics of the classifier. Furthermore, MLAs have been judged and compared by computing the Confusion Matrix and the AUROC. The Confusion Matrix reports the TP, TN, FP, and FN instances as a two-by-two matrix, in which the rows correspond to the observed classes, while the columns correspond to the predicted ones. An adequate Confusion Matrix has the most of the instances on its main diagonal. Furthermore, by observing a Confusion Matrix, it is possible to compute the performance metrics mentioned above. The Receiver Operating Characteristic (ROC) curve represents the performance of a classifier onto a cartesian plane. The TPR is reported on the ordinates, while the abscissas show the False Positive Rate (FPR). Equation (19) defines the FPR: Therefore, the FPR is the ratio of FP instances among all negative instances, i.e., the ratio of "False Alarms" given by the classifier. The ROC plot shows the relation between TPR and FPR at various classification thresholds. Once the ROC is plotted, the Area Under the ROC (AUROC) [54] can be computed. It is the two-dimensional area underneath the ROC curve [16]. Hypothetically, it can assume values between 0 and 1. The value of 0.5 represents a random classifier, while the value of 1 represents the perfect classifier that classifies each sample into the right class. Therefore, the higher is the AUROC, the better is the classifier.

Results and Discussion
Outcomes are presented and discussed in terms of Goodness-of-Fit and predictive performance. Goodness-of-Fit refers to the quality of the training phase, while predictive performance represents the ability of the MLAs to be able to generalize (test phase). We provided detailed performance metrics for each class of the classification (Accident and Non-Accident class) and the weighted average values, i.e., the average value of each performance metric weighted by the number of samples for each class. These averaged metrics represent the overall performance of the classifiers.
Firstly, Precision, Recall, and F1-Score are introduced for the evaluation of both Goodness-of-Fit and predictive performance. Afterward, the Confusion Matrix of each classifier (both in training and testing phase) and the Accuracy are shown. Finally, the ROC plots are reported, and the AUROC computed (both in the training and testing phase). Table 3 below shows Precision, Recall, and F1-Score of MLAs computed for the Goodness-of-Fit assessment.

Precision, Recall, F1-Score
Both the weighted average metrics and the specific ones for each class seem satisfactory. The highest Precision (0.786) is showed by the NB classifier in detecting Accident Case road sites, while LR reports the highest Recall (0.816) in recognizing Non-Accident Case road sites. Since the dataset is balanced, the Weighted Average metrics should assume an adequate representation of the real performance of a classifier. In this case, the highest Precision (0.724) and the highest Recall have been observed (0.721) for the CART algorithm. Accordingly, F1-Score shows the best value (0.721) for the CART algorithm. Table 4 below reports Precision, Recall, and F1-Score of MLAs computed in the testing phase. As well as in the training phase, Table 4 demonstrates that in the testing phase, the MLAs present adequate predictive capacities. Qualitatively, the whole set of metrics is similar to the one shown in the previous Table 3; therefore, the fact that these MLAs do not suffer from overfitting issues should be ensured. Quantitatively, NB reports the highest Precision (0.758) in detecting Accident Case road sites and the highest Recall (0.833) in identifying Non-Accident Case road sites. Nonetheless, the classifier that presents the best weighted average metrics is the RF algorithm. Indeed, the highest Precision (0.736), the highest Recall (0.735), and the highest F1-Score (0.735) have been observed for the RF.
By employing Precision, Recall, and F1-Score for evaluating the Goodness-of-Fit and predictive performance, it is confirmed that RF is the most appropriate one for predicting road blackspots. The other MLAs also showed satisfactory performance and, therefore, should be considered as different alternatives to the RF.  Table 5 below reports the Confusion Matrices computed after the training phase of the MLAs. At the bottom of each Confusion Matrix, it is reported the number of correctly classified instances (that is, the overall Accuracy of the classifier), and the number of incorrectly classified instances.

ROC and AUROC
In order to judge the MLAs comprehensively, the ROC is derived for each classifier. ROC has been plotted for both the training and testing phases (Figures 4 and 5, respectively). For each ROC, the corresponding AUROC (Table 7) has been computed. These outcomes are presented below.  The dash-dot line in Figure 4 represents the performance of a hypothetical random classifier, which corresponds to an AUROC of 0.5. Therefore, the further a ROC is far away from this line, the better the algorithm. Qualitatively, Figure 4 reports that both in Accident and Non-Accident Cases, all MLAs have similar trends in the TPR-FPR space. Once again, all MLAs show similar trends and similar ROC to the ones derived for the training phase. This fact should ensure that the modeling procedure has been carried out correctly. The dash-dot line in Figure 4 represents the performance of a hypothetical random classifier, which corresponds to an AUROC of 0.5. Therefore, the further a ROC is far away from this line, the better the algorithm. Qualitatively, Figure 4 reports that both in Accident and Non-Accident Cases, all MLAs have similar trends in the TPR-FPR space. Figure 5 below reports the ROC for the test phase of MLAs.
Once again, all MLAs show similar trends and similar ROC to the ones derived for the training phase. This fact should ensure that the modeling procedure has been carried out correctly. As said, for a quantitative assessment with the use of ROC curves, the AUROC values need to be computed. Table 7 below reports the values of AUROC for each classifier, both in the training and test phase. Considering that the dataset is balanced, it is worth mentioning that the AUROC of MLAs for detecting Accident and Non-Accident classes assume the same value. Therefore, Table 7 distinguish MLAs only between training and test phases.
By observing Table 7, it is shown that AUROC in the training and test phases assume similar values. This fact should confirm that the MLAs calibrated do not suffer from overfitting. For the data used in this work, RF is the most reliable and suitable algorithm in predicting road blackspots (AUROC of the test phase = 0.795), followed by LR (0.773), NB (0.747), CART (0.742), and KNN (0.740). These values also indicate that the other MLAs used in this study can be useful for reaching the purpose. Moreover, the predictive performance of the MLAs shown in Table 7 for the test set are consistent with those presented in Tables 4 and 6.

Conclusions
A road blackspot screening procedure based on real long-term traffic data and machine learning algorithms has been presented. This process aims to classify a road site as safe or potentially susceptible to an accident occurrence.
In order to fulfill the objective of this study, five different supervised classification algorithms have been calibrated and compared. Specifically, an instance-based algorithm (KNN), two probabilistic models (LR and NB), a non-parametric tree-based algorithm (CART), and a non-parametric ensemble classifier (RF), have been employed. Both the training and the test phase have been judged by computing a broad set of performance metrics: Precision, Recall, F1-Score, overall Accuracy, Confusion Matrix, and AUROC. Considering that both phases provided satisfactory and similar performance among all these metrics, it is proved that the algorithms should not suffer from overfitting issues. Moreover, considering that the Random Forest classifier showed the highest performance in all the test phase-related metrics (F1-Score = 0.735, overall Accuracy = 73.53%, and AUROC = 0.795), we recommend such a classifier as reliable and suitable algorithms in road blackspot detection modeling. Furthermore, all the other algorithms offer adequate performance as well, and they may be more accurate than Random Forest in other similar researches. Therefore, it is also essential to calibrate and compare different algorithms for ensuring that one of the possible best solutions has been found out.
Road Authorities should consider the use of Machine Learning classifiers for the prediction of the safety level of the road networks they manage. These procedures can efficiently be employed as a supporting tool in decision-making processes concerning road maintenance intervention and planning new road projects. Indeed, such algorithms allow predicting the dangerousness of new road sites, as well as of road sites that have not yet experienced a long-term crash history. Once the screening procedure is terminated, a restricted sample of road sites is highlighted as potentially susceptible to crash occurrence. These sites could be included appropriately by Road Authorities in inspection lists with higher priority. Funding: This work is supported by the Regione Toscana (Tuscany Region) and the University of Pisa under the "CMRSS 2018" research project on the "Actuation of the Regional Monitoring Center of Road Safety" DGR n. 553, 29/05/2018.