A Crash Data Analysis through a Comparative Application of Regression and Neural Network Models

: One way to reduce road crashes is to determine the main inﬂuential factors among a long list that are attributable to driver behavior, environmental conditions, vehicle features, road type, and trafﬁc signs. Hence, selecting the best modelling tool for extracting the relations between crash factors and their outcomes is a crucial task. To analyze the road crash data of Milan City, Italy, gathered between 2014–2017, this study used artiﬁcial neural networks (ANNs), generalized linear mixed-effects (GLME), multinomial regression (MNR), and general nonlinear regression (NLM), as the modelling tools. The data set contained 35,182 records of road crashes with injuries or fatalities. The ﬁndings showed that unbalanced and incomplete data sets had an impact on outcome performance, and data treatment methods could help overcome this problem. Age and gender were the most inﬂuential recurrent factors in crashes. Additionally, ANNs demonstrated a superior capability to approximate complicated relationships between an input and output better than the other regression models. However, they cannot provide an analytical formulation, but can be used as a baseline for other regression models. Due to this, GLME and MNR were utilized to gather information regarding the analytical framework of the model, that aimed to construct a particular NLM.


Introduction
According to the World Health Organization (WHO), road traffic crashes cause approximately 1.35 million fatalities worldwide each year, leaving between 20 and 50 million people with non-fatal injuries (WHO, 2022) [1]. The WHO reported that road traffic crashes have multi-factorial causes, including human factors, roads and other infrastructure factors, insufficient policing, environmental factors, and vehicle characteristics. Road users are reportedly the main cause of crashes. According to estimates, road traffic injuries are the eighth leading cause of death worldwide for all age groups and the leading cause of death for children and young people aged 5-29 years (WHO, 2018) [1].
Another task is to create an appropriate crash data set for potential modelling tools. Obtaining proper results and reliable models depends on the availability of a high-quality data set, representing precise information about potential affecting factors.
(2019) [13] suggested that, to eliminate potential mistakes, the ultimate goal of the road safety research community should be to develop a seamless crash database, where all components can be integrated automatically. In addition, the sensitivity of the modelling tools to an unbalanced data set in terms of the response variable categories may affect the outcomes. Another objective outlined in recent literature is to predict crash risk in real time or near real time in order to prevent them from happening (Mehdizadeh et al., 2020;Dimitrijevic et al., 2022) [14,15].
One of the most important aspects of road crash analysis is the selection of the proper tool for modelling. Previous research methods and techniques (Fausett, 1994;Mccullagh and Nelder, 1985;Hosmer et al., 1989) [16][17][18] have been used to model crash data based on various approaches, such as crash severity and crash type. In this study, four different modelling tools, namely, artificial neural networks (ANNs) (Fausett, 1994) [16], generalized linear mixed effects (GLME) (Molenberghs et al., 2002) [19], a general nonlinear model (NLM) (Mccullagh and Nelder, 1985) [17], and a multinomial regression model (MNR) (Hosmer et al., 1989) [18], were chosen as the modelling tools for this study in order to meet the predetermined objectives.
Recently, papers such as Dimitrijevic et al. (2022) [15] presented several crash risk and injury severity assessment models in order to compare their performance: random effects, Bayesian logistics regression, Gaussian naïve Bayes, k-nearest neighbor, random forest, and gradient boosting machine methods have been trained and tested for this purpose. Iranitalaba and Khattakb (2017) [20] compared the performance of four statistical and machine learning methods, including multinomial logit (MNL), nearest neighbor classification (NNC), support vector machines (SVM), and random forests (RF), in predicting traffic crash severity. Vector quantization was used in Mussone and Kim (2010) [21] to cluster crash data through a self-organizing map.
The different performance and functionalities of modelling options make the choice of proper alternatives more difficult. Another difficulty in the choice is the functioning of the modelling tools. Distinctions in modelling tools have an effect when comparing their functionalities.
Moreover, ANNs are a robust tool for investigating complex phenomena without assuming any preliminary hypotheses about the model. Nevertheless, they do not provide an analytical formula in terms of their mathematical functions. Thus, only a sensitivity analysis can be performed to determine how explanatory variables affect outcomes. GLME and NLM models require an analytical formulation of the input-output relationship. Hence, the actual effect of the significant independent variables on the results can be calculated. MNR models have a predefined modelling structure and are used to compare and assess the results with those obtained by ANNs.
In addition, in the field of road safety, survey research is frequently used to study traffic behavior and its underlying cognitive and motivational determinants (Goldenbeld and de Craen 2013) [22]. To understand the factors influencing road crashes, constructing a questionnaire could provide knowledge of road users' habits, perceptions, and beliefs. The road safety perception questionnaire should consider the singularities of the variables that cause road crashes (Espinoza-Molina et al., 2021) [23]. This study is multifaceted and focuses on three aims:

1.
To study crash data collected between 2014 and 2017 through a comparison of modelling methodologies, in terms of their performance and results, using four paradigms, namely, artificial neural networks (ANNs), generalized linear mixed-effects (GLME), multinomial regression (MNR), and general nonlinear regression (NLM); 2.
To find the analytical formulation that better describes the relationship between input and output; 3.
To analyze common variables of the models.
The data set contained 35,182 crash-related records. The primary differentiation between the modelling tools is their categorization. ANNs are machine learning tools. However, GLME, MNR, and NLM are related to regression modelling. The distinction Safety 2023, 9, 20 3 of 21 between the modelling categorization is based on their functionalities, but similarities are being investigated.
The rest of this paper is organized as follows: Section 2 describes the methodology adopted in the research. Section 3 provides the information about the data set, such as explanatory and response variables and the preprocessing of data. Section 4 presents the modeling approaches used in this study. Section 5 describes the evaluation methods for modeling and then compares the outcomes. Finally, Section 6 discusses the outcomes and draws the conclusions of the study.

Methodology
The methodology adopted in the research includes the following steps: 1.
Analysis of data; 2.
Pre-processing and normalization of data; 3.
Check of model performance (if not satisfactory, we went back to step 3 for model building); 5.
Analysis of results and discussion.
Steps 3 and 4 were reiterated until performance was considered satisfactory or could not be further improved. This loop holds for all four models and only the MNR model required a few iterations to reach the final configuration. This methodological structure explains why model performance is presented just after the theoretical description of the models.

Database Information
The crash data used in this study come from the Lombardy Region in Italy, an institutional subject that collects information on crashes that result in injuries or fatalities occurring in the region. These data are the same as those managed by the Italian National Institute of Statistics.
The definition of injured people and the distinction between serious and minor injuries is made on the basis of AIS scale (AAAM, 2022) [24] which was adopted by the Italian Institute of Statistics in accordance with the European Commission directive.
The data refer to crashes that occurred in the city of Milan between 2014 and 2017. Each of them has been considered an independent observation with different characteristics. There are 35,182 records described by 16 explanatory variables, 14 of which are used as independent variables and 2 as distinct dependent variables:

1.
Crash type, with three categories, including between circulating vehicles, pedestrian hit, and isolated vehicle crash; 2.
Crash effects, with two categories, including injuries and fatalities.
Owing to their geo-referenced localization, the distribution of crashes by severity and type of crash in Milan City is presented in Figure 1a,b. In the figures, the density of crashes with an injury outcome is significantly higher than that of fatal crashes. Meanwhile, the categories of crash types are subdivided more homogenously among types and on the territory of the city. Only the first two vehicles (vehicles A and B) involved in a crash were considered in the research.

Data Set Variables
Fourteen explanatory variables were chosen among the different possible variables available in the original data set. The following is the set of variable types (see Table 1 for the details and the basic statistical information):

1.
Variables referring to the road conditions; 2.
Variable referring to the infrastructure; 3.
Variables referring to the crash characteristics; 4.
Variables for vehicle characteristics;   The distributions of the observations between the categories of the variables are presented in Table 1.
Unfortunately, no data about the socio-economic characteristics of the people involved in the crashes are available and, due to anonymity of data, they cannot be subsequently retrieved.

Data Oversampling and Normalization
One of the issues that can affect modelling is related to an imbalanced data set. As presented in Table 1, a noticeable difference exists between the frequencies of the two categories of output C15 (but not so critical for C16); this affects machine learning performance and generally all other (though nonlinear) regression models. For example, categories with a low number of counts in the training set are almost always ignored by artificial neural networks, and this affects model performance (references to these models can be found in the section Models and Their Performance). Different solutions have been suggested in previous studies to address this issue. Two methods are proposed in the literature: undersampling and over-sampling. Under-sampling methods are generally used in conjunction with an over-sampling technique for the minority class, and this combination often results in better performance than using over-sampling or under-sampling alone on the training data set. The major drawback of under-sampling is that this method can discard potentially useful data that could be important for the induction process. This holds particularly for crash data that suffer on their own because of the high dishomogeneity of possible combinations of data. Ling et al. (1998) [25] proposed a method in which a category with a low count in the data set was over-sampled to match the size of other classes. According to their study, we decided to use over-sampling and not under-sampling in order to lose no data information for both majority and minority classes. A simple duplication of the minority class in output C15 was carried out to achieve the similar number of observations as in the other class (leading to a data set size of 70087 records).
Another issue is related to the different scales between variables, which may also cause training distortions. To solve this issue, all numerical inputs were normalized in the range from 0 to 1, using the formula in Equation (1) (Dutka, 1988) [26]: where X norm is the normalized value of variable X and X max and X min are the maximum and minimum values of X, respectively. Binomial variables do not require normalization, while categorical variables are coded in a binary base (e.g., 6 is coded as "110" by using three independent inputs).  [27][28][29][30][31], focused on the application of NNs on crash data sets and demonstrated their power in predicting the proper outcomes. A back propagation artificial neural network is a particular NN that computes the input function by propagating the input neuron's computed values to the output neuron(s) using the weights of links connecting neurons as intermediate parameters.

Models and Their
Learning occurs by iteratively changing those weights over many input-output pairs to ensure that more accurate predictions can be provided. As we decided to study the two dependent variables separately to make the interpretation of outcomes easier, we developed two distinct NN structures. The training procedure was identical for both models, First, the data set was divided randomly into three different sets: training, testing, and validation sets with 60%, 20%, and 20% of the data, respectively.
The training data set was used during the training process to calculate the weights for links in the NN layers. At each iteration, the test set is used to calculate the network performance and decide to stop learning. Then, the validation set tests the model with data that has never been seen before and calculates the actual NN performance. The procedure was repeated a number of times to check whether data dishomogeneity affected the outcomes.
The disadvantage of using the back propagation artificial neural network (BPNN) approach is that the NN model is a black box, and the relationships between variables are not deducible from the inspection of weights.
Therefore, no analytical formulation between the input and output can be obtained directly. The effects of the input variables were analyzed through a sensitivity analysis of the model. Performance was evaluated according to mean squared errors (MSE) (Equation (2)), where n = number of data points, Y i = observed values, and Y' i = predicted values: The BPNN is trained by updating the weight and bias values according to the Levenberg-Marquardt optimization algorithm (Kumaraswamy, 2021) [32]. The model was constructed using an input layer containing the 14 independent variables (C1-C14) reported in Table 1, hidden layers, and an output layer, with only one output corresponding to the C15 or C16 variables. As explained in the previous section, all categorical variables were coded in a binary format to reduce the connection between their values.
The data set was divided into three subsets according to the following percentages: 60%, 20%, and 20% for train, test, and validation sets, respectively. This configuration was the outcome of many trials focused on optimizing performance. The tuning of internal parameters was made by using an interface program which optimized them based on the performance achieved by using the validation set (such as internal weights). The Levenberg-Marquardt algorithm adaptively varies the parameter updates between the gradient descent and the Gauss-Newton update. Overfitting is controlled by the test and validation sets. To increase the reliability of the models, we repeated the division of the data set into the three training subsets more times and checked if outcomes were similar or affected by the subset composition. The final structure of ANN is, in turn, the result of many trials with a different number of hidden layers and hidden neurons. This is an empirical activity. The best structures for response variables were created with two hidden layers with 30 neurons each for C15 ( Figure 2a) and one hidden layer with 40 neurons for C16 (Figure 2b).    The networks had an MSE of 0.018 for the C15 and 0.049 for the C16 models. The MSE for C15 implies that there are approximately two errors for each of the one hundred samples, because the maximum error for this case is one. For output C16, the errors may assume different values (because the output varies from one to three). However, considering the confusion matrix of this model (in the Appendix A, Table A1), it can be deduced that there are approximately five errors for every one hundred samples. It is worth noting that MSE (or equivalently the RMSE) is generally used for NN training though in some cases it may give classification problems that, fortunately, were not experimented for these models. In any case, the comparisons between other regressive models were made by considering all the calculated indices.

Generalized Linear Mixed Effects (GLME)
For the analysis of multilevel data, random clusters and/or subject effects should be included in the regression model to account for the data correlation (Mussone et al., 2017) [30]. Generalized linear models (GLMs) extend classical linear models by allowing responses to follow distributions in the exponential family and allowing nonlinear relationships between the response and covariates. The exponential family includes a wide range of commonly used distributions, such as normal, binomial, and Poisson distributions (Wu, 2010) [33]. Fixed-and random-effect terms are the main components of generalized linear mixed effects (GLME) (Molenberghs et al., 2002) [19]. Fixed-effect terms are the conventional linear regression parts of the model, while random-effect terms are associated with individual experimental units drawn at random from a population and account for variations between groups that might affect the response. Random effects have prior distributions, whereas fixed effects do not. The general formulation and standard forms of the GLME model (Molenberghs et al., 2002) [19] are expressed as follows: Here, y i is the outcome variable, Distr is a specified conditional distribution of y given b, µ is the conditional mean of y i given b i and µ i is its ith element, σ 2 is the dispersion parameter, w is the effective observation weight vector, g(µ) is a link function, X is a matrix of the predictor variables, β is a column vector of the fixed-effect regression coefficients, Z plays the role of design matrix for the random effects, b is a vector of the random effects, and ε is a column vector of the residuals.
In this study, the "log" (Equation (7)) link function has been used because a Poisson or binomial distribution is assumed for the output variables. Finally, the model for the mean response µ can be written as in Equation (8), where g −1 is the inverse of the link function g(µ) and η is the linear predictor of the fixed and random effects of the GLME: As modelling with the original data set by using GLME ignores the prediction of correct values for the fatality case crashes, an oversampled data set was also used for the response variable C15. It is worth noting that GLME models do not require independence Safety 2023, 9, 20 9 of 21 from the observations. The log link function and a binomial distribution as a probability mass function were employed.
The best performance was calculated by minimizing the log-likelihood index; other indexes (i.e., Akaike's information criterion, AIC; Bayesian information criteria, BIC) were also estimated to control for the minimization process. For the C16 model, because the responses of the trials were acceptable without oversampling the data set, the original data set was used. Therefore, according to the given formulations, after a more or less long series of trials with different combinations of independent variables, the GLME models that provide the best performance for C15 (Equation (9)) and C16 (Equation (10)) are formulated as follows using Wilkinson's notation: Here, Cx (x = 1, . . . , 14) are the independent variables, C2|C11 denotes that C2 is analyzed after grouping data by C11, and C42-C4 indicates that only second-order effects of C4 are considered. Table 2 shows the statistical analysis results of the GLME model for C15, where all P values were lower than 0.001. The standard error of estimates (SE) is generally much lower than the estimates, and the lower and upper bounds of the confidence interval (CI) never include zero. The estimated coefficients reveal the positive contributions of C8, C9, C42, and C92. Thus, to better understand the effect of variables, marginal effects analysis was applied through the analytical GLME model by computing the change in output by a unitary increment of the considered variable. The same analysis was developed for C16, as presented in Table 3. The positive contribution of C5 and C9, in addition to the negative contributions of C4, C7, and C92, is evident in the model. Confusion matrices for C15 and C16 are in Tables A2 and A6 in the Appendix A, respectively.

Multinomial Regression (MNR)
The MNR model is a classification model used to generalize the binomial logistic regression to multiclass problems; in other words, more than two possible discrete outcomes are available as response variables. The MNR model indicates the probability of observation i choosing outcome k given the observation's measured characteristics. The outcome of a response variable can be a restricted set of possible values. Because no natural order exists among the response variable categories in the used data, the model's chosen response mode is the nominal response. The mathematical representation of the model with four output categories can be written as the following systems of relationships (Equation (11)): ln ln Here, X n is an input variable, θ j = P(y = j) is the probability of an outcome being in category j, k is the number of response categories, and n is the number of predictor variables. The last category was used as a reference variable, written as the kth category. Further, β jn are the coefficients in the model that realize the effects of the predictor variables on the log odds of being in category j versus the reference category k. The most important assumption is to set the kth category coefficients. Therefore, the probability of being in each category j is Similar to other modelling methods, MNR could not predict the fatal crash category in the C15 model. Thus, an oversampled data set was applied. The α values in Tables 4 and 5 represent the intercept values of the model. Because some of the p value calculations are greater than the threshold value, the related variables must be considered removed from the model.
Because the C16 output has three different categories, the nominal MNR model can be represented by two equations. The isolated vehicle crash category was selected as the model reference category, and the predicted values of the model depict the relative risk of being in one category versus being in the reference category. Therefore, the highest probability was selected as the predicted value for the model. The model's statistical representations for the C16 output variable are presented in Table 5 (non-significant variables are highlighted by a grey background).
The statistical analysis reveals that the importance of driver B information emerges by comparing the values of the coefficients with each other. Because driver B information has positive coefficients, the probability of being in the first category against the reference category increases by increasing the components of driver B, such as C11 and C12. Meanwhile, an increase in the C7 input variable decreases the probability of being in the first category and increases that of being in the second category.

General Nonlinear Regression (NLM)
Nonlinear regression is a form of regression analysis in which observational data are modeled by a function that is a nonlinear combination of model parameters, completely determined in its form by the researcher, and depends on one or more independent variables. The data were fitted using successive approximations. These models can make better predictions for unobserved data than other models whose analytical form is limited.
A general representation of the parametric nonlinear regression model is shown in Equation (15): where y is the representation of response variables, X is the vector of input variables, β is the vector of the unknown parameters to be estimated, ε is the vector of identically distributed random disturbances, and f is a function of X and β. The assumptions for a standard nonlinear regression model are as follows: 1. Errors are independent; 2.
Errors have mean zero and constant variance; 3.
Errors are normally distributed.
Some of these assumptions may be relaxed for more general models. This type of model attempts to find a parameter β that minimizes the MSE between the observed responses y and the predictions of the model f (X,β). To accomplish this, a starting value β 0 is required before iteratively modifying the vector β to a vector with a minimum MSE.

Database Information Content
Principal component analysis (PCA) (Lebart et al., 1986) [34] was used to investigate the information content of the database and to identify the main variables explaining variance in the data. In this study, PCA was applied to understand why, for C15 models, only NN achieved good performance. Hence, we considered the four subsets of data according to the classification made by the GLME model (represented by the confusion matrix in Table A6): true injuries (class 0); false injuries (false 0); true fatalities (class 1); false fatalities (false 1). A PCA was carried out for each of them, and their outcomes are plotted in Figure 3.   The first four principal components explain more than 99% and the first two 86% of the overall variance; the related variables are C8, C10, C12, and C14 (age and years of driving license of drivers A and B, respectively). Considering that C10 and C14 are collinear with C8 and C12, we can assume that C8 and C12 are the actual crucial variables for modelling. It is worth noting that the mentioned variables are also related to the driver's licensing age.
In Figure 3, true injuries (red) and false fatalities (black) almost overlap. The same, though to a lesser extent, occurs for true fatalities (blue) and false injuries (green). This accounts for the difficulties in increasing the performance of the GLME, NLM, and MNR models, which, contrary to NN, are not capable of such a discriminant property for those types of data.
The PCA analysis for the C16 model was applied to five subsets of data corresponding to those calculated by using the GLME model (Table A6). For class 1 (crash between circulating vehicles), variance in data (red points) is well represented by variables C8 (i.e., age of driver A) and C12 (i.e., age of driver B). For the other two classes (2, pedestrian hits; 3, isolated vehicles), the situation is slightly confusing. For all four subsets, the two main principal components are represented by variables C8 and C10. As shown in Figure 4, the subsets for false cases (yellow and green points) are almost indistinguishable, as in the C15 model, but in this model, these bad samples are far fewer.

Sensitivity Analysis
Neural networks do not provide an analytical formulation between input and output variables. One way to address this issue is to conduct a sensitivity analysis. A sensitivity analysis is carried out by varying the input values individually or in a limited group (scenario) and by observing how the output varies. A number of scenarios were prepared from a large number of possible scenarios. These scenarios aim to consider some possible and typical crash situations involving male or female drivers during daytime or nighttime, with rainy weather or dry road surfaces, or with elderly drivers.
The main outcomes can be synthesized as follows: For the C15 model, a fatality crash is more likely to occur when driver A's age is high. The difference between weather conditions in scenarios reveals that older drivers are prone to fatal crashes in unstable weather conditions. Conversely, the age of driver B does not significantly affect the output in most of the prepared scenarios. The time of crash and road conditions play a crucial role together with road typology. For example, if the road typology is a one-way carriageway in wet road conditions, a fatal crash can occur at night.
By applying sensitivity analysis to the C16 model, it was revealed that an increase in driver A's age during unstable weather conditions (such as rainy weather with wet road conditions) significantly affected the output classified as the pedestrian hit crash type, regardless of the time of crash. However, younger drivers, such as driver B, were prone to having the same output. In addition, driver B's licensing year variable is related to pedestrian hit crashes.
Another method to roughly assess the relative importance of input variables is to cancel one variable at a time and then calculate the new best NN and its performance. The variable whose cancelation leads to the worst performance is the most important. This method does not consider second order effects but is remarkable for its simplicity. When applied to the C15 and C16 models, it confirms and completes the outcomes of sensitivity analysis (as reported in Table 8).

Marginal Effects Analysis
Another method to analyze the effect of input when the analytical model is available, such as in GLME, is marginal effects analysis (MEA). MEA reveals the changes in a model's response values when a specific input variable changes in a unit, while the other variables are assumed to be constant. Suppose the model is log(y) = β 1 C 1 + β x C x and C1 is the variable under consideration. Its marginal effect, ME, is where y' is the new value for the response variable after a unitary increase in the input variable, C1 = C1 + 1. The higher the ME, the stronger the effect of the variable. If β > 0, ME > 1, and if β < 0, ME < 1. If the formula involving the variable is not linear, the calculations are more complex, but the procedure is identical. The outcomes of this analysis are reported in the synoptic Table 8.

Model Comparison
Finally, to compare the model's performance, a confusion matrix is considered (Powers, 2020) [30]. A confusion matrix, also known as an error matrix, is a table where each column represents the instances in a predicted class (if the output is continuous, it can be divided into classes) and each row represents the actual class instances. Precision, also called positive predictive value, is the fraction of relevant instances among the retrieved instances. Meanwhile, recall, also known as sensitivity, is the fraction of retrieved relevant instances among all relevant instances. The "a-priori" rate (PR) (Equation (19)) corresponds to the percentage of the predicted crashes to the total to be predicted for each severity level, and the "a posteriori" rate (PO) (Equation (20)) is the percentage of predicted crashes to the total of predicted crashes for each severity level (Powers, 2020) [35].
where a ij represents the number of predicted cases i and n is the matrix dimension. Finally, the confusion matrix's overall accuracy can be calculated using Equation (21) (Powers, 2020) [35]: Accuracy is a more aggregate measure than PR and PO. The confusion matrices in Tables A1-A8 show the outcomes for the four model paradigms and for the two outputs: C15 and C16. In Table A1, the class "−1" is created by the ANN model in the recall mode; it may be that ANNs extrapolate output outside the interval used for training. As with other aggregated indices, accuracy is not a perfect measure of performance. In the case of the C15 model, its value is conditioned by the high level of imbalance in the data and by the oversampling procedure. In Table 8, a summary of model performance (accuracy, PR and PO by row and column, respectively) is reported.
The C15 output achieved the best performance (accuracy = 97.4%) in the ANN model, but not without any issues. First, a class with C15 = −1 is created, likely due to the ANN extrapolation function (which must be considered a negative capability because it is not controlled by actual data). Second, the "perfect" prediction of class 1 is attributable to oversampling; therefore, overtraining cannot be excluded. The confusion matrices for the other models were similar. Although an oversampled data set is also used, no definitive benefit is achieved and fatal crashes and those with injuries are still hardly discernible from each other (for the reasons reported by PCA in Section 4.1). For GLME and NLM models, the crucial point is to find the optimal relationship between the input variables, considering not only first-and second-order effects but also their interactions. Because the MNR model does not require an analytical formulation to model the data, it does not appear to be a suitable model when input data have complex relationships with the output.
For the C16 output, the ANN still achieved the best performance (accuracy is 93.4%), whereas the other models had similar outcomes (accuracy is never less than 91%). Differences between models are due to the different capabilities to distinguish between classes 2 and 3 (i.e., pedestrian hit and isolated vehicle crash) because class 1 is perfectly forecasted. The PCA in Section 4.1 explains the reason for this, and it may be that either further variables are needed, or events have a strong random component. Table 8 presents the synoptic table of the model performance that synthesizes previous analyses. For the most part, the most relevant variables are age and gender of driver A and B and the type of road infrastructure (be it section or intersection). For output C15, road typology and meteorological conditions are also relevant; and for output C16, road conditions.

Discussion and Conclusions
The comparison of the performance of models for crash risk evaluation is a task developed by some authors (as in Iranitalaba and Khattakb (2017) [20] and Dimitrijevic et al. (2022) [15]. A common result between them and the present study is that model performance varies with the model type and with it also most significant variables (see Table 8 for a synoptic of used variables in the models developed in this research). A cross-comparison between results achieved with different data sets is not very feasible not least because of their differences; this is true not only for data distribution but also for the available fields present in the data sets themselves.
In the review paper by Slikboer et al. (2020) [36], twenty studies on the prediction of road crashes were taken into consideration, and among them only one reported an independent variable (driver gender) which was also used in our research. In the paper by Zhang et al. (2018) [37], road and meteorological conditions were the only significant fields to be compared. A comparison of performance between different models is of certain interest; however, it must be considered that performance depends greatly on the contents of the data set (mainly related to the distribution of cases and the relationship between input and output). In the paper by Zhang et al. (2018) [37], the overall accuracy for six machine learning and statistical models ranges from 44% to 53.9%. Random forest behaves better, but no NNs are used.
In Iranitalaba and Khattakb (2017) [20], the best overall performance in predicting the costs of crashes is achieved by NNC (nearest neighbor classification) followed at a distance by SVM (support vector machine), RF (random forest), and MNL (multinomial logit).
Apart from these issues in comparing the different experiences on the subject, one limitation of this study is the contents of the data set; they do not include information about the socio-economic characteristics of people involved in the crashes, or about the flow when the crash occurred, or the speed of involved vehicles. Another limitation is that not all possible types of models have been investigated: only four. GLME and NLM models require an analytical formulation in advance. Unlike the ANN model, oversampling the data set did not improve their overall performance. However, because of the insertion of the analytical formulation, the actual effect of each input variable can be determined. Nevertheless, having many different ways of combining input variables makes analytical formulation a tricky exercise. Despite the expectations and the efforts, the performance of NLM did not overcome that of GLME and this proves that the knowledge about crash data contents is still incomplete. In addition, different analytical formulations may lead to models that appear different but have the same performance. In this contest, MNR presents a relatively higher ease of application but with no way to improve performance.
Another critical issue is the use of regression models in classification problems; this can hinder their application if the structure of output data is too complex (e.g., with more than one field).
Future studies should still focus on finding the optimal analytical formulation for regression models. In addition, research can be conducted to discover suitable data treatment methods for cases where the data set includes incomplete information or unbalanced data distribution. Notably, acquiring flow data as additional information for road crash analysis may positively affect the results of future studies. In addition, the current data set contains all crashes with injured people or fatalities. However, from a probabilistic measure of risk, there should also be a focus on crashes with only property damage, and there would be an interest in those events that did not lead to whatever type of crash for casualties.
Hence, it is worth noting that the modelling of crash data is only the first step for extracting rules or indications to be applied on the road for traffic control or road geometry concerns, and those outcomes refer only to the limited, though most expensive, component of road safety.
The paper aims at indicating the variables to be investigated in further detail, for the specific data set of Milan. Preprocessing of data shows which type and structure of model to use; it recognizes whether the collected data are really sufficient for the analysis, both for missing data and information content.
Some features about drivers turned out to be relevant, sometimes referring to driver A, sometimes to driver B. This raises a question about the meaning of A and B. At present, it is very likely that A and B (when both are present) are indicated at random (except when there is a pedestrian or a cyclist involved, they are always labelled as B) and not according to the (even presumed) responsibility for the crash. Models highlight the significance of a driver's age, which always has a positive relationship with the output (that is, as age increases, crash severity increases). Then, it makes sense to extract crash records layered by driver age and analyze the distribution of elder drivers' crashes on the network. The main issues with elderly drivers are related to visibility conditions (not present in the crash data set) and the complexity of maneuvers.
These two last points are related to the type of infrastructure, which is another important variable in models. They show that severity increases in intersections when crashes occur between two vehicles. None of this is really new, but this states that in Milan, for those years of data collection, the main concerns (in a statistical sense) are at intersections due to regulation or geometry. Of course, this must be investigated locally in detail.
Models do give limited suggestions about the role of weather conditions in crashes, partly due to the high percentage (87%) of crashes occurring with serene weather conditions. However, it is possible to infer that when weather conditions worsen (for example, rain, fog, or snow), pedestrians or isolated vehicle crashes may increase. In these cases, visibility and, even more generally, geometry, may be the causes or contributing causes of crashes.
Among all the information present in crash records, the most operative (at least in the short term) is crash localization (by geo-referenced devices), followed by environmental variables (meteorological, illumination, and road pavement conditions), which provide some initial hints about possible structural intervention. Data on drivers and other people involved in a crash are limited to age and years of driving license. However, no data about their psycho-physic conditions (if under the influence of alcohol or drugs, their emotional profile) or about compliance with highway rules (e.g., use of seat belts, use of glasses if needed, maximum number of driving hours, and use of cellular phones) are available. There are no data on traffic conditions (e.g., flow, percentage of heavy vehicles, and average speed) during a crash. To understand how the crash data set could be developed to improve the search for policy indications, one should refer to the concept of road safety in people's minds, that is, their interpretation of road safety.  Acknowledgments: The authors thank Regione Lombardia-Divisione Sicurezza Stradale for providing the data used in this research.

Conflicts of Interest:
The authors declare no conflict of interest.