A Rare Event Modelling Approach to Assess Injury Severity Risk of Vulnerable Road Users

.


Introduction
Road crashes are among the leading causes of death, disability, property loss and yield costs to society, representing 1-3% of GDP worldwide [1]. More than one million people lose their lives every year in road crashes and 20 to 50 million people are injured [1]. Pedestrians and cyclists are vulnerable road users (VRUs) since they are unprotected and they represent the majority of people killed and injured on the European Union (EU) roads [2]. Despite long-term trends in reducing death and injury rates, in 2017, 21% of fatalities on European Union roads were pedestrians and 8% were cyclists, decreasing at a lower rate than other fatalities [3]. For Portugal, in 2017, the percentage of VRU fatalities were 25% of the total (21% being pedestrians and 4% being cyclists) [4].
The transportation systems are becoming more sophisticated and confront more risks. This situation increases the difficulty of regulators to ensure safety [5]. There are many factors related to road crash risks, such as human factors, environmental conditions, roadway infrastructure, traffic characteristics and vehicle conditions. However, the identification of risk factors that can contribute to the injury severity of a specific type of road user may be different [6,7]. For that purpose, it is particularly important to give special attention to VRUs' safety, by providing a better understanding of the factors affecting the outcome in terms of injury severity [8].
Jiang et al. [59] introduced three methods to model unbalanced data-random forest, AdaBoost and Gradient Boost-showing that the latter two generate more balanced prediction accuracies.
The novelty of this study is the suggested methodological approach, which applies different resampling techniques to imbalanced pedestrian and cyclist motor vehicle crash datasets to perform a comparative evaluation of two commonly used but different classifiers: decision tree and logistic regression. For that purpose, road crash records involving a motor vehicle and pedestrians/cyclists from six years (2012-2017) and three different cities were used. This database was organized in injury severity level, which is classified into severe (which includes serious injuries and fatalities) and non-severe (light injuries). Since the proportion of the minority class is significantly lower than the majority class, the dataset is imbalanced. Thus, different resampling techniques (under-, over-and synthetic oversampling) were applied. The best resampling method is selected based on the classifier performance through receiver operating characteristics (ROC) curves. The developed models allow the identification of risk factors that can affect pedestrian or cyclist injury severity when involved in a motor vehicle crash. These findings can be used as a tool for local authorities to develop road safety strategies. From the variables under evaluation, we should highlight the road conditions and markings sometimes neglected in the literature.

Methodology
This section describes the techniques used for data resampling, followed by a short description of the classifier methods: decision tree and logistic regression. The classifiers will be applied to identify the variables which are statistically significant in predicting the injury severity of a VRU involved in a crash. Lastly, data characteristics and pre-processing as well as case studies are described. The conceptual framework designed for this study is presented in Figure 1. This process was applied for each city individually and a global dataset including all crash and injury information. forest, AdaBoost and Gradient Boost-showing that the latter two generate more balanced prediction accuracies. The novelty of this study is the suggested methodological approach, which applies different resampling techniques to imbalanced pedestrian and cyclist motor vehicle crash datasets to perform a comparative evaluation of two commonly used but different classifiers: decision tree and logistic regression. For that purpose, road crash records involving a motor vehicle and pedestrians/cyclists from six years (2012-2017) and three different cities were used. This database was organized in injury severity level, which is classified into severe (which includes serious injuries and fatalities) and nonsevere (light injuries). Since the proportion of the minority class is significantly lower than the majority class, the dataset is imbalanced. Thus, different resampling techniques (under-, over-and synthetic oversampling) were applied. The best resampling method is selected based on the classifier performance through receiver operating characteristics (ROC) curves. The developed models allow the identification of risk factors that can affect pedestrian or cyclist injury severity when involved in a motor vehicle crash. These findings can be used as a tool for local authorities to develop road safety strategies. From the variables under evaluation, we should highlight the road conditions and markings sometimes neglected in the literature.

Methodology
This section describes the techniques used for data resampling, followed by a short description of the classifier methods: decision tree and logistic regression. The classifiers will be applied to identify the variables which are statistically significant in predicting the injury severity of a VRU involved in a crash. Lastly, data characteristics and pre-processing as well as case studies are described. The conceptual framework designed for this study is presented in Figure 1. This process was applied for each city individually and a global dataset including all crash and injury information.  In order to perform the proposed methodology, an open-source software for statistical computing, R software [60], is employed to handle the crash dataset using specific packages for resampling imbalanced data and for applying the two classifiers.

Resampling Techniques
Crash records can be considered an imbalanced dataset, since the target variable (injury severity) is predominantly imbalanced, with the majority of instances belonging to the non-severe class and only a small percentage of the instances in the severe class. Resampling is a commonly used data In order to perform the proposed methodology, an open-source software for statistical computing, R software [60], is employed to handle the crash dataset using specific packages for resampling imbalanced data and for applying the two classifiers.

Resampling Techniques
Crash records can be considered an imbalanced dataset, since the target variable (injury severity) is predominantly imbalanced, with the majority of instances belonging to the non-severe class and only a small percentage of the instances in the severe class. Resampling is a commonly used data level approach to deal with class imbalance [53,54]. Resampling methodologies are processes of continuously drawing samples from a dataset and refitting a given model. These methodologies can be divided into undersampling and oversampling [54].
Three resampling techniques are proposed for application to such a dataset in order to construct a more balanced one. In this study, the widely used random undersampling and random oversampling methods as well as a synthetic oversampling method were analysed. Since we are interested in using open-source software to perform our analysis, the ROSE R package [61] was explored, since it has many built-in resampling techniques. The best resampling method was then selected based on the classifier performance (see Section 2.2) through ROC curves.

Undersampling
The undersampling method consists of constructing a balanced dataset by randomly removing instances from the majority class until the desired ratio has been reached in order to adjust a class distribution of a dataset [41]. The main advantage of an undersampling method is the reduction of a training data size when the original data is large. On the other hand, removing instances may imply a loss of valuable information of the majority class [52].

Oversampling
In the oversampling method, the balanced dataset is constructed by randomly duplicating instances from the minority class until the desired ratio has been reached [41]. The advantage of an oversampling technique is that it leads to no information loss [52]. Therefore, although it is widely used, oversampling might be ineffective at improving recognition of the minority class and may lead to overfitting [54].

ROSE
ROSE (bootstrap random oversampling examples technique) is a method that generates a synthetic sample from the feature space around the minority class according to a smoothed-bootstrapping approach. According to this, ROSE combines oversampling and undersampling by generating an augmented sample of the data (mainly belonging to the minority class). Three steps are involved in the development of ROSE methodology: (1) Resampling data of the majority class using a bootstrap resampling technique to remove instances of the majority class considering a ratio of 50%-undersampling; (2) Repeat the same process for the minority class-oversampling; (3) Generate a new synthetic data in its neighbourhood, where the shape is determined by a function provided by the ROSE R package. A new synthetic training sample of approximately equal size to the original dataset is generated [62].
Studies have been showing that generating new synthetic data to balance a skewed dataset is an alternative to the above resampling techniques, and is being associated to a reduction of the risk of overfitting and an improvement of the ability of generalisation compromised by the oversampling methods [63].

Supervised Learning Classifiers
In order to identify risk factors significantly affecting the VRU injury severity, two classification techniques were explored and the results compared. Classifiers can be trained using historical data with the known outcome to predict an associated class. A trained model aims to be able to classify unseen new data correctly.
In this study, the dependent variable (injury severity) presents a binary classification with two possible outcomes (non-severe or severe). Two widely used supervised classification techniques will be explored, namely decision tree and logistic regression.
A stratified holdout procedure was applied to split data into training and testing sets, which ensures each class is represented in both sets. The training set (70% of instances) is used to build the models, which are then tested over the testing set (the remaining 30% of instances) to evaluate its predictive accuracy [64]. The classifier performance is evaluated through receiver operating characteristics (ROC) curves [65,66]. The ROC curve represents the relationship between both sensitivity and specificity in a graphical representation of the true positive (i.e., a severe injury correctly classified) rate against the false positive rate. The overall performance can be given with the area under the ROC curve (AUC), a summary measure that allows one to quantify how accurately a model can discriminate. In particular, an AUC under 0.50 reflects a poor model. Hence, a higher AUC score represents a better classifier. Moreover, when comparing models, the best model is the one that yields the dominant ROC curve (most significant AUC).
The following sections briefly describe the two classifiers used to develop the models.

Decision Tree
Decision tree methodology is a commonly used nonparametric data-mining method. It classifies instances by sorting them based on attribute values. Each node in a decision tree represents a feature in an instance to be classified and is a test on an attribute. Each branch represents a value that the node can assume and is an outcome of the test. Finally, a leaf node represents a class label. The feature that best divides the training data would be the root node of the tree. At each node, one attribute is chosen to split training examples into disjoint classes as much as possible. This procedure is repeated on each partition of the divided data, resulting in subtrees until the training data is divided into subsets of the same class. Thus, a decision tree classifies an instance as belonging to a specific class by following a suitable path from the root to a leaf node, which represents a classification rule [67]. The advantages of using a decision tree model are threefold: it requires minimal knowledge of the underlying data relationships, provides useful information regarding the most important variables in the dataset that are placed as top nodes and is less sensitive to missing data and outliers [68].

Logistic Regression
Logistic regression is a linear, parametric method for binary classification. The logistic regression method is used to explain the relationship between the dependent variable and the independent variables and has been the most commonly used statistic method for studying injury severity risk factors [13]. The outcomes of the regression equation can vary without limit, but constrain the predictions of the dependent variable to values between 0 and 1. The multiple binary logistic regression model expression [69] is given by: where x is the vector of the explanatory variables, β is the vector of the coefficients of the model and π(x) is the probability of a severe VRU injury. The logistic regression model can be used for continuous and/or categorical explanatory variables as well as interaction terms to investigate potential combined effects of the explanatory variables. In fitting the data, logistic regression fits a straight line to divide the space into two. A single linear boundary can sometimes be limiting for logistic regression.

Data Description and Case Studies
A crash dataset involving pedestrians and cyclists from three different cities of Portugal was originally acquired from the National Authority of Road Safety (ANSR). Crash registrations correspond to six years (from 2012 to 2017), which gave a total of 6876 observations. These crashes yielded 7155 injured VRUs, 86% corresponding to injured pedestrians and 14% to cyclists. The original dataset contains specific information about the number and severity of injuries, gender and age of the injured, temporal information (year, month, day and hour), location/position by address and geocode, road characteristics, weather conditions, and luminosity information.
The dataset covers three different cities located in the north, centre, and centre-south of Portugal, namely Aveiro, Porto and Lisbon. These case studies were chosen based on their differences in terms of land use, transport demand and demographic contexts, and also due to their relatively high share of walking and cycling modes, which vary between 19% and 22% for pedestrians and 0.2% and 3% for cyclists [70].
The previously described methodology was applied to three different datasets considering each city. Afterward, the same process was applied to a third dataset considering all recorded samples of each city, which yielded an overall perspective.

Pre-Processing Data
The analysis focused on the injury severity level, which is subdivided into two classes: non-severe (light injuries) and severe (including serious injuries and fatalities). In order to have a representative sample with common characteristics, records with missing information or uninjured VRUs were removed from the dataset. This preliminary step eliminated 1.5% of the records. Hence, the dataset used in this study contains a total of 7048 injured VRUs, 6% being categorised into severe injuries or fatalities and 94% into the non-severe injury class. The crash dataset represents an imbalance of a 1/16 ratio considering non-severe and severe injuries.

Results
The results are presented considering two different aims: 1.
To evaluate the most efficient prediction model based on three resampling techniques (undersampling, oversampling and ROSE); 2.
To explore and compare the results of two supervised classification techniques in order to identify which variables can significantly affect pedestrian and cyclist injury severity when involved in a motor vehicle crash.
The three resampling techniques were applied to the datasets, resulting in six different datasets for each city and the overall perspective. Table 2 shows an overview of the dataset modifications and their distribution amongst the severity classes.
The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studies (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were developed. The performance of the models was examined based on the area under the ROC curve (AUC). Table 3 shows the AUC results for the developed models. The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studies (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were developed. The performance of the models was examined based on the area under the ROC curve (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-imbalanced dataset tends to improve the classification power of the classifiers to discriminate between severe and nonsevere VRU injuries when involved in a motor vehicle crash. Improvement in the classification power can be verified for oversampling techniques for both classifier models; however, this is not always the case regarding the undersampling technique and ROSE.
The best results (highlighted in Table 3) revealed that oversampling is the best resampling technique for Aveiro, independent of the classifier used and the VRU under study. Regarding Porto, for the pedestrians database, the oversampling technique was revealed to improve the classifier power of a decision tree and ROSE yielded the best performance for logistic regression. On the other hand, the cyclist database of Porto revealed ROSE as the best technique when a decision tree is applied and oversampling when the logistic regression is applied. Regarding the Lisbon and overall cases, considering the pedestrian databases, oversampling is the best technique, except in the case of the Lisbon pedestrian database, where the decision tree classifier is applied. Considering the cyclist database for Lisbon, oversampling is the best resampling technique for both classifiers. Besides, this The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studies (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were developed. The performance of the models was examined based on the area under the ROC curve (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-imbalanced dataset tends to improve the classification power of the classifiers to discriminate between severe and nonsevere VRU injuries when involved in a motor vehicle crash. Improvement in the classification power can be verified for oversampling techniques for both classifier models; however, this is not always the case regarding the undersampling technique and ROSE.
The best results (highlighted in Table 3) revealed that oversampling is the best resampling technique for Aveiro, independent of the classifier used and the VRU under study. Regarding Porto, for the pedestrians database, the oversampling technique was revealed to improve the classifier power of a decision tree and ROSE yielded the best performance for logistic regression. On the other hand, the cyclist database of Porto revealed ROSE as the best technique when a decision tree is applied and oversampling when the logistic regression is applied. Regarding the Lisbon and overall cases, considering the pedestrian databases, oversampling is the best technique, except in the case of the Lisbon pedestrian database, where the decision tree classifier is applied. Considering the cyclist database for Lisbon, oversampling is the best resampling technique for both classifiers. Besides, this The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studies (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were developed. The performance of the models was examined based on the area under the ROC curve (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-imbalanced dataset tends to improve the classification power of the classifiers to discriminate between severe and nonsevere VRU injuries when involved in a motor vehicle crash. Improvement in the classification power can be verified for oversampling techniques for both classifier models; however, this is not always the case regarding the undersampling technique and ROSE.
The best results (highlighted in Table 3) revealed that oversampling is the best resampling technique for Aveiro, independent of the classifier used and the VRU under study. Regarding Porto, for the pedestrians database, the oversampling technique was revealed to improve the classifier power of a decision tree and ROSE yielded the best performance for logistic regression. On the other hand, the cyclist database of Porto revealed ROSE as the best technique when a decision tree is applied and oversampling when the logistic regression is applied. Regarding the Lisbon and overall cases, considering the pedestrian databases, oversampling is the best technique, except in the case of the Lisbon pedestrian database, where the decision tree classifier is applied. Considering the cyclist database for Lisbon, oversampling is the best resampling technique for both classifiers. Besides, this The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studies (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were developed. The performance of the models was examined based on the area under the ROC curve (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-imbalanced dataset tends to improve the classification power of the classifiers to discriminate between severe and nonsevere VRU injuries when involved in a motor vehicle crash. Improvement in the classification power can be verified for oversampling techniques for both classifier models; however, this is not always the case regarding the undersampling technique and ROSE.
The best results (highlighted in Table 3) revealed that oversampling is the best resampling technique for Aveiro, independent of the classifier used and the VRU under study. Regarding Porto, for the pedestrians database, the oversampling technique was revealed to improve the classifier power of a decision tree and ROSE yielded the best performance for logistic regression. On the other hand, the cyclist database of Porto revealed ROSE as the best technique when a decision tree is applied and oversampling when the logistic regression is applied. Regarding the Lisbon and overall cases, considering the pedestrian databases, oversampling is the best technique, except in the case of the Lisbon pedestrian database, where the decision tree classifier is applied. Considering the cyclist database for Lisbon, oversampling is the best resampling technique for both classifiers. Besides, this The two classifiers described in Section 2.2 were used to develop injury se models. Therefore, two models were developed for each dataset considering the diff (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were div and test sets considering a 2/3 ratio, as described in the methodology. A total of developed. The performance of the models was examined based on the area unde (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-im tends to improve the classification power of the classifiers to discriminate between severe VRU injuries when involved in a motor vehicle crash. Improvement in the cla can be verified for oversampling techniques for both classifier models; however, t the case regarding the undersampling technique and ROSE.
The best results (highlighted in Table 3) revealed that oversampling is the technique for Aveiro, independent of the classifier used and the VRU under study. for the pedestrians database, the oversampling technique was revealed to impr power of a decision tree and ROSE yielded the best performance for logistic regress hand, the cyclist database of Porto revealed ROSE as the best technique when a applied and oversampling when the logistic regression is applied. Regarding the L cases, considering the pedestrian databases, oversampling is the best technique, exc the Lisbon pedestrian database, where the decision tree classifier is applied. Consi database for Lisbon, oversampling is the best resampling technique for both classif Lisbon  Original  3990  476  3713  456  277  20  Undersampling  554  40  277  20  277  20  Oversampling  7426  912  3713  456  3713  456  ROSE  3990  476  2060  257  1930  219  Overall  Original  6088  960  5715  921  373  39  Undersampling  746  78  373  39  373  39  Oversampling  11430  1842  5715  921  5715  921  ROSE  6088  960  3085  497  3003  463 Key: represents pedestrians; represents cyclists.
The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studie (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into trainin and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models wer developed. The performance of the models was examined based on the area under the ROC curv (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-imbalanced datase tends to improve the classification power of the classifiers to discriminate between severe and non severe VRU injuries when involved in a motor vehicle crash. Improvement in the classification powe can be verified for oversampling techniques for both classifier models; however, this is not alway the case regarding the undersampling technique and ROSE.
The best results (highlighted in Table 3) revealed that oversampling is the best resamplin technique for Aveiro, independent of the classifier used and the VRU under study. Regarding Porto for the pedestrians database, the oversampling technique was revealed to improve the classifie power of a decision tree and ROSE yielded the best performance for logistic regression. On the othe hand, the cyclist database of Porto revealed ROSE as the best technique when a decision tree i applied and oversampling when the logistic regression is applied. Regarding the Lisbon and overa cases, considering the pedestrian databases, oversampling is the best technique, except in the case o the Lisbon pedestrian database, where the decision tree classifier is applied. Considering the cyclis database for Lisbon, oversampling is the best resampling technique for both classifiers. Besides, thi The three resampling techniques were applied to the datasets, resulting in six different datasets for each city and the overall perspective. Table 2 shows an overview of the dataset modifications and their distribution amongst the severity classes. The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studies (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were developed. The performance of the models was examined based on the area under the ROC curve (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-imbalanced dataset tends to improve the classification power of the classifiers to discriminate between severe and nonsevere VRU injuries when involved in a motor vehicle crash. Improvement in the classification power can be verified for oversampling techniques for both classifier models; however, this is not always the case regarding the undersampling technique and ROSE.
The best results (highlighted in Table 3) revealed that oversampling is the best resampling technique for Aveiro, independent of the classifier used and the VRU under study. Regarding Porto, for the pedestrians database, the oversampling technique was revealed to improve the classifier power of a decision tree and ROSE yielded the best performance for logistic regression. On the other hand, the cyclist database of Porto revealed ROSE as the best technique when a decision tree is applied and oversampling when the logistic regression is applied. Regarding the Lisbon and overall cases, considering the pedestrian databases, oversampling is the best technique, except in the case of the Lisbon pedestrian database, where the decision tree classifier is applied. Considering the cyclist database for Lisbon, oversampling is the best resampling technique for both classifiers. Besides, this represents pedestrians; The three resampling techniques were applied to the datasets, resulting in six different datasets for each city and the overall perspective. Table 2 shows an overview of the dataset modifications and their distribution amongst the severity classes. The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studies (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were developed. The performance of the models was examined based on the area under the ROC curve (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-imbalanced dataset tends to improve the classification power of the classifiers to discriminate between severe and nonsevere VRU injuries when involved in a motor vehicle crash. Improvement in the classification power can be verified for oversampling techniques for both classifier models; however, this is not always the case regarding the undersampling technique and ROSE.
The best results (highlighted in Table 3) revealed that oversampling is the best resampling technique for Aveiro, independent of the classifier used and the VRU under study. Regarding Porto, for the pedestrians database, the oversampling technique was revealed to improve the classifier power of a decision tree and ROSE yielded the best performance for logistic regression. On the other hand, the cyclist database of Porto revealed ROSE as the best technique when a decision tree is applied and oversampling when the logistic regression is applied. Regarding the Lisbon and overall cases, considering the pedestrian databases, oversampling is the best technique, except in the case of the Lisbon pedestrian database, where the decision tree classifier is applied. Considering the cyclist database for Lisbon, oversampling is the best resampling technique for both classifiers. Besides, this represents cyclists. 2. To explore and compare the results of two supervised classification techniques in order to identify which variables can significantly affect pedestrian and cyclist injury severity when involved in a motor vehicle crash.
The three resampling techniques were applied to the datasets, resulting in six different datasets for each city and the overall perspective. Table 2 shows an overview of the dataset modifications and their distribution amongst the severity classes. The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studies (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were developed. The performance of the models was examined based on the area under the ROC curve (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-imbalanced dataset tends to improve the classification power of the classifiers to discriminate between severe and nonsevere VRU injuries when involved in a motor vehicle crash. Improvement in the classification power can be verified for oversampling techniques for both classifier models; however, this is not always the case regarding the undersampling technique and ROSE.
The best results (highlighted in Table 3) revealed that oversampling is the best resampling technique for Aveiro, independent of the classifier used and the VRU under study. Regarding Porto, for the pedestrians database, the oversampling technique was revealed to improve the classifier power of a decision tree and ROSE yielded the best performance for logistic regression. On the other hand, the cyclist database of Porto revealed ROSE as the best technique when a decision tree is applied and oversampling when the logistic regression is applied. Regarding the Lisbon and overall cases, considering the pedestrian databases, oversampling is the best technique, except in the case of the Lisbon pedestrian database, where the decision tree classifier is applied. Considering the cyclist database for Lisbon, oversampling is the best resampling technique for both classifiers. Besides, this To explore and compare the results of two supervised classification techniques in order to identify which variables can significantly affect pedestrian and cyclist injury severity when involved in a motor vehicle crash.
The three resampling techniques were applied to the datasets, resulting in six different datasets ch city and the overall perspective. Table 2 shows an overview of the dataset modifications and distribution amongst the severity classes. The two classifiers described in Section 2.2 were used to develop injury severity prediction ls. Therefore, two models were developed for each dataset considering the different case studies iro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were loped. The performance of the models was examined based on the area under the ROC curve ). Table 3 shows the AUC results for the developed models. In general, results showed that applying resampling methods in a class-imbalanced dataset to improve the classification power of the classifiers to discriminate between severe and none VRU injuries when involved in a motor vehicle crash. Improvement in the classification power e verified for oversampling techniques for both classifier models; however, this is not always ase regarding the undersampling technique and ROSE. The best results (highlighted in 2. To explore and compare the results of two supervised classification techniques in order to identify which variables can significantly affect pedestrian and cyclist injury severity when involved in a motor vehicle crash. The three resampling techniques were applied to the datasets, resulting in six different datasets for each city and the overall perspective. Table 2 shows an overview of the dataset modifications and their distribution amongst the severity classes. The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studies (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were developed. The performance of the models was examined based on the area under the ROC curve (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-imbalanced dataset tends to improve the classification power of the classifiers to discriminate between severe and nonsevere VRU injuries when involved in a motor vehicle crash. Improvement in the classification power can be verified for oversampling techniques for both classifier models; however, this is not always the case regarding the undersampling technique and ROSE.
The best results (highlighted in Table 3) revealed that oversampling is the best resampling technique for Aveiro, independent of the classifier used and the VRU under study. Regarding Porto, for the pedestrians database, the oversampling technique was revealed to improve the classifier power of a decision tree and ROSE yielded the best performance for logistic regression. On the other hand, the cyclist database of Porto revealed ROSE as the best technique when a decision tree is applied and oversampling when the logistic regression is applied. Regarding the Lisbon and overall cases, considering the pedestrian databases, oversampling is the best technique, except in the case of the Lisbon pedestrian database, where the decision tree classifier is applied. Considering the cyclist database for Lisbon, oversampling is the best resampling technique for both classifiers. Besides, this To explore and compare the results of two supervised classification techniques in order to identify which variables can significantly affect pedestrian and cyclist injury severity when involved in a motor vehicle crash.
The three resampling techniques were applied to the datasets, resulting in six different datasets ch city and the overall perspective. Table 2 shows an overview of the dataset modifications and distribution amongst the severity classes. The two classifiers described in Section 2.2 were used to develop injury severity prediction ls. Therefore, two models were developed for each dataset considering the different case studies iro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were loped. The performance of the models was examined based on the area under the ROC curve ). Table 3 shows the AUC results for the developed models. In general, results showed that applying resampling methods in a class-imbalanced dataset to improve the classification power of the classifiers to discriminate between severe and none VRU injuries when involved in a motor vehicle crash. Improvement in the classification power e verified for oversampling techniques for both classifier models; however, this is not always ase regarding the undersampling technique and ROSE.
The best results (highlighted in 2. To explore and compare the results of two supervised classification techniques in order to identify which variables can significantly affect pedestrian and cyclist injury severity when involved in a motor vehicle crash. The three resampling techniques were applied to the datasets, resulting in six different datasets for each city and the overall perspective. Table 2 shows an overview of the dataset modifications and their distribution amongst the severity classes. The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studies (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were developed. The performance of the models was examined based on the area under the ROC curve (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-imbalanced dataset tends to improve the classification power of the classifiers to discriminate between severe and nonsevere VRU injuries when involved in a motor vehicle crash. Improvement in the classification power can be verified for oversampling techniques for both classifier models; however, this is not always the case regarding the undersampling technique and ROSE.
The best results (highlighted in Table 3) revealed that oversampling is the best resampling technique for Aveiro, independent of the classifier used and the VRU under study. Regarding Porto, for the pedestrians database, the oversampling technique was revealed to improve the classifier power of a decision tree and ROSE yielded the best performance for logistic regression. On the other hand, the cyclist database of Porto revealed ROSE as the best technique when a decision tree is applied and oversampling when the logistic regression is applied. Regarding the Lisbon and overall cases, considering the pedestrian databases, oversampling is the best technique, except in the case of the Lisbon pedestrian database, where the decision tree classifier is applied. Considering the cyclist database for Lisbon, oversampling is the best resampling technique for both classifiers. Besides, this To explore and compare the results of two supervised classification techniques in order to identify which variables can significantly affect pedestrian and cyclist injury severity when involved in a motor vehicle crash.
The three resampling techniques were applied to the datasets, resulting in six different datasets ch city and the overall perspective. Table 2 shows an overview of the dataset modifications and distribution amongst the severity classes. The two classifiers described in Section 2.2 were used to develop injury severity prediction ls. Therefore, two models were developed for each dataset considering the different case studies iro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were loped. The performance of the models was examined based on the area under the ROC curve ). Table 3 shows the AUC results for the developed models. In general, results showed that applying resampling methods in a class-imbalanced dataset to improve the classification power of the classifiers to discriminate between severe and none VRU injuries when involved in a motor vehicle crash. Improvement in the classification power e verified for oversampling techniques for both classifier models; however, this is not always ase regarding the undersampling technique and ROSE.
The best results (highlighted in 2. To explore and compare the results of two supervised classification techniques in order to identify which variables can significantly affect pedestrian and cyclist injury severity when involved in a motor vehicle crash. The three resampling techniques were applied to the datasets, resulting in six different datasets for each city and the overall perspective. Table 2 shows an overview of the dataset modifications and their distribution amongst the severity classes. The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studies (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were developed. The performance of the models was examined based on the area under the ROC curve (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-imbalanced dataset tends to improve the classification power of the classifiers to discriminate between severe and nonsevere VRU injuries when involved in a motor vehicle crash. Improvement in the classification power can be verified for oversampling techniques for both classifier models; however, this is not always the case regarding the undersampling technique and ROSE.
The best results (highlighted in Table 3) revealed that oversampling is the best resampling technique for Aveiro, independent of the classifier used and the VRU under study. Regarding Porto, for the pedestrians database, the oversampling technique was revealed to improve the classifier power of a decision tree and ROSE yielded the best performance for logistic regression. On the other hand, the cyclist database of Porto revealed ROSE as the best technique when a decision tree is applied and oversampling when the logistic regression is applied. Regarding the Lisbon and overall cases, considering the pedestrian databases, oversampling is the best technique, except in the case of the Lisbon pedestrian database, where the decision tree classifier is applied. Considering the cyclist database for Lisbon, oversampling is the best resampling technique for both classifiers. Besides, this To explore and compare the results of two supervised classification techniques in order to identify which variables can significantly affect pedestrian and cyclist injury severity when involved in a motor vehicle crash.
The three resampling techniques were applied to the datasets, resulting in six different datasets ch city and the overall perspective. Table 2 shows an overview of the dataset modifications and distribution amongst the severity classes. The two classifiers described in Section 2.2 were used to develop injury severity prediction ls. Therefore, two models were developed for each dataset considering the different case studies iro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were loped. The performance of the models was examined based on the area under the ROC curve ). Table 3 shows the AUC results for the developed models. In general, results showed that applying resampling methods in a class-imbalanced dataset to improve the classification power of the classifiers to discriminate between severe and none VRU injuries when involved in a motor vehicle crash. Improvement in the classification power e verified for oversampling techniques for both classifier models; however, this is not always ase regarding the undersampling technique and ROSE. The best results (highlighted in 2. To explore and compare the results of two supervised classification techniques in order to identify which variables can significantly affect pedestrian and cyclist injury severity when involved in a motor vehicle crash. The three resampling techniques were applied to the datasets, resulting in six different datasets for each city and the overall perspective. Table 2 shows an overview of the dataset modifications and their distribution amongst the severity classes. The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studies (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were developed. The performance of the models was examined based on the area under the ROC curve (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-imbalanced dataset tends to improve the classification power of the classifiers to discriminate between severe and nonsevere VRU injuries when involved in a motor vehicle crash. Improvement in the classification power can be verified for oversampling techniques for both classifier models; however, this is not always the case regarding the undersampling technique and ROSE.
The best results (highlighted in Table 3) revealed that oversampling is the best resampling technique for Aveiro, independent of the classifier used and the VRU under study. Regarding Porto, for the pedestrians database, the oversampling technique was revealed to improve the classifier power of a decision tree and ROSE yielded the best performance for logistic regression. On the other hand, the cyclist database of Porto revealed ROSE as the best technique when a decision tree is applied and oversampling when the logistic regression is applied. Regarding the Lisbon and overall To explore and compare the results of two supervised classification techniques in order to identify which variables can significantly affect pedestrian and cyclist injury severity when involved in a motor vehicle crash.
The three resampling techniques were applied to the datasets, resulting in six different datasets ch city and the overall perspective. Table 2 shows an overview of the dataset modifications and distribution amongst the severity classes. The two classifiers described in Section 2.2 were used to develop injury severity prediction ls. Therefore, two models were developed for each dataset considering the different case studies iro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were loped. The performance of the models was examined based on the area under the ROC curve ). Table 3 shows the AUC results for the developed models. In general, results showed that applying resampling methods in a class-imbalanced dataset to improve the classification power of the classifiers to discriminate between severe and none VRU injuries when involved in a motor vehicle crash. Improvement in the classification power e verified for oversampling techniques for both classifier models; however, this is not always ase regarding the undersampling technique and ROSE. The best results (highlighted in 2. To explore and compare the results of two supervised classification techniques in order to identify which variables can significantly affect pedestrian and cyclist injury severity when involved in a motor vehicle crash. The three resampling techniques were applied to the datasets, resulting in six different datasets for each city and the overall perspective. Table 2 shows an overview of the dataset modifications and their distribution amongst the severity classes. The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studies (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were developed. The performance of the models was examined based on the area under the ROC curve (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-imbalanced dataset tends to improve the classification power of the classifiers to discriminate between severe and nonsevere VRU injuries when involved in a motor vehicle crash. Improvement in the classification power can be verified for oversampling techniques for both classifier models; however, this is not always the case regarding the undersampling technique and ROSE.
The best results (highlighted in To explore and compare the results of two supervised classification techniques in order to identify which variables can significantly affect pedestrian and cyclist injury severity when involved in a motor vehicle crash.
The three resampling techniques were applied to the datasets, resulting in six different datasets ch city and the overall perspective. Table 2 shows an overview of the dataset modifications and distribution amongst the severity classes. The two classifiers described in Section 2.2 were used to develop injury severity prediction ls. Therefore, two models were developed for each dataset considering the different case studies iro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were loped. The performance of the models was examined based on the area under the ROC curve ). Table 3 shows the AUC results for the developed models. In general, results showed that applying resampling methods in a class-imbalanced dataset to improve the classification power of the classifiers to discriminate between severe and none VRU injuries when involved in a motor vehicle crash. Improvement in the classification power e verified for oversampling techniques for both classifier models; however, this is not always ase regarding the undersampling technique and ROSE. The best results (highlighted in 2. To explore and compare the results of two supervised classification techniques in order to identify which variables can significantly affect pedestrian and cyclist injury severity when involved in a motor vehicle crash. The three resampling techniques were applied to the datasets, resulting in six different datasets for each city and the overall perspective. Table 2 shows an overview of the dataset modifications and their distribution amongst the severity classes. The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studies (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were developed. The performance of the models was examined based on the area under the ROC curve (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-imbalanced dataset tends to improve the classification power of the classifiers to discriminate between severe and nonsevere VRU injuries when involved in a motor vehicle crash. Improvement in the classification power can be verified for oversampling techniques for both classifier models; however, this is not always the case regarding the undersampling technique and ROSE. To explore and compare the results of two supervised classification techniques in order to identify which variables can significantly affect pedestrian and cyclist injury severity when involved in a motor vehicle crash.
The three resampling techniques were applied to the datasets, resulting in six different datasets ch city and the overall perspective. Table 2 shows an overview of the dataset modifications and distribution amongst the severity classes. The two classifiers described in Section 2.2 were used to develop injury severity prediction ls. Therefore, two models were developed for each dataset considering the different case studies iro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were loped. The performance of the models was examined based on the area under the ROC curve ). Table 3 shows the AUC results for the developed models. In general, results showed that applying resampling methods in a class-imbalanced dataset to improve the classification power of the classifiers to discriminate between severe and none VRU injuries when involved in a motor vehicle crash. Improvement in the classification power e verified for oversampling techniques for both classifier models; however, this is not always The three resampling techniques were applied to the datasets, resulting in six different datasets for each city and the overall perspective. Table 2 shows an overview of the dataset modifications and their distribution amongst the severity classes.
The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studies (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were developed. The performance of the models was examined based on the area under the ROC curve (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-imbalanced dataset tends to improve the classification power of the classifiers to discriminate between severe and nonsevere VRU injuries when involved in a motor vehicle crash. Improvement in the classification power To explore and compare the results of two supervised classification techniques in order to identify which variables can significantly affect pedestrian and cyclist injury severity when involved in a motor vehicle crash.
The two classifiers described in Section 2.2 were used to develop injury severity prediction ls. Therefore, two models were developed for each dataset considering the different case studies iro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were loped. The performance of the models was examined based on the area under the ROC curve ). Table 3 shows the AUC results for the developed models. In general, results showed that applying resampling methods in a class-imbalanced dataset to improve the classification power of the classifiers to discriminate between severe and non- In general, results showed that applying resampling methods in a class-imbalanced dataset tends to improve the classification power of the classifiers to discriminate between severe and non-severe VRU injuries when involved in a motor vehicle crash. Improvement in the classification power can be verified for oversampling techniques for both classifier models; however, this is not always the case regarding the undersampling technique and ROSE.
The best results (highlighted in Table 3) revealed that oversampling is the best resampling technique for Aveiro, independent of the classifier used and the VRU under study. Regarding Porto, for the pedestrians database, the oversampling technique was revealed to improve the classifier power of a decision tree and ROSE yielded the best performance for logistic regression. On the other hand, the cyclist database of Porto revealed ROSE as the best technique when a decision tree is applied and oversampling when the logistic regression is applied. Regarding the Lisbon and overall cases, considering the pedestrian databases, oversampling is the best technique, except in the case of the Lisbon pedestrian database, where the decision tree classifier is applied. Considering the cyclist database for Lisbon, oversampling is the best resampling technique for both classifiers. Besides, this technique also presents the best performance when considering the overall case for the cyclist database, apart from logistic regression, where undersampling is the best approach.
From Table 3, it can be seen that the decision tree performed more accurately for predicting VRU injury severity level for Aveiro and Porto, showing an appropriate predictive ability of 77% to 80% for pedestrians and 84% to 97% for cyclists. The same can be verified for Lisbon and the overall perspective regarding the cyclist database when the decision tree presents a predictive ability between 89% and 93%. Considering the pedestrian database, logistic regression presents a better predictive ability of 66% to 70%.
The comparison between the two different classifiers-decision tree and logistic regression-shows that the decision tree presents the best performance results.
The risk variables that can affect VRU injury severity were identified. To accomplish this goal, decision tree and logistic regression models were developed. In particular, a decision tree has a subprocess with an attribute weighting scheme; this weight (from 0 to 100) provides attribute importance information considering the occurrence of severe injury. Table 4 presents the results of the decision tree model, considering the three variables with higher importance scores (numbers shown in brackets) for each case study and database approach (original, undersampling, oversampling and ROSE). Some models present only one significant variable since the weight of this variable is 100.
The significant variables presented take into account mainly the results of the best classifier's performance. Regarding Aveiro, the significant variables that present the highest weight considering oversampling (the best classifier performance) are road conditions and road markings. The profile of cyclists (especially age) and temporal variables (such as month, weekday and time) present an important role in identifying severe injuries of cyclists. Regarding the Porto case study, road markings, age and month are the most significant variables considering pedestrians' severe injury. On the other hand, the variables most significant for cyclists are age, gender and month. For the Lisbon case study, road markings are considered the most important variable concerning pedestrians' severe injury, and beyond that, age and month also present significant importance. Regarding cyclists, temporal variables (month, weekday and time) present an important role too. Lastly, the overall dataset results clearly show that road markings, gender and age are the main risk factors affecting the injury severity of a pedestrian involved in a motor vehicle crash, while for cyclists, age, road conditions and luminosity are considered the important variables to predict injury severity.
The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studies (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were developed. The performance of the models was examined based on the area under the ROC curve (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-imbalanced dataset tends to improve the classification power of the classifiers to discriminate between severe and nonsevere VRU injuries when involved in a motor vehicle crash. Improvement in the classification power can be verified for oversampling techniques for both classifier models; however, this is not always the case regarding the undersampling technique and ROSE.
The best results (highlighted in Table 3) revealed that oversampling is the best resampling technique for Aveiro, independent of the classifier used and the VRU under study. Regarding Porto, for the pedestrians database, the oversampling technique was revealed to improve the classifier power of a decision tree and ROSE yielded the best performance for logistic regression. On the other hand, the cyclist database of Porto revealed ROSE as the best technique when a decision tree is applied and oversampling when the logistic regression is applied. Regarding the Lisbon and overall cases, considering the pedestrian databases, oversampling is the best technique, except in the case of the Lisbon pedestrian database, where the decision tree classifier is applied. Considering the cyclist database for Lisbon, oversampling is the best resampling technique for both classifiers. Besides, this  Original  3990  476  3713  456  277  20  Undersampling  554  40  277  20  277  20  Oversampling  7426  912  3713  456  3713  456  ROSE  3990  476  2060  257  1930  219  Overall  Original  6088  960  5715  921  373  39  Undersampling  746  78  373  39  373  39  Oversampling  11430  1842  5715  921  5715  921  ROSE  6088  960  3085  497  3003  463 Key: represents pedestrians; represents cyclists. e two classifiers described in Section 2.2 were used to develop injury severity prediction . Therefore, two models were developed for each dataset considering the different case studies , Porto, Lisbon and overall). For the developed models, the datasets were divided into training t sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were ed. The performance of the models was examined based on the area under the ROC curve Table 3 shows the AUC results for the developed models. general, results showed that applying resampling methods in a class-imbalanced dataset o improve the classification power of the classifiers to discriminate between severe and non-VRU injuries when involved in a motor vehicle crash. Improvement in the classification power verified for oversampling techniques for both classifier models; however, this is not always regarding the undersampling technique and ROSE. e best results (highlighted in  Original  1849  226  1780  224  69  2  Undersampling  138  4  69  2  69  2  Oversampling  3560  448  1780  224  1780  224  ROSE  1849  226  959  110  890  116  Lisbon  Original  3990  476  3713  456  277  20  Undersampling  554  40  277  20  277  20  Oversampling  7426  912  3713  456  3713  456  ROSE  3990  476  2060  257  1930  219  Overall  Original  6088  960  5715  921  373  39  Undersampling  746  78  373  39  373  39  Oversampling  11430  1842  5715  921  5715  921  ROSE  6088  960  3085  497  3003  463 Key: represents pedestrians; represents cyclists. The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studies (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were developed. The performance of the models was examined based on the area under the ROC curve (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-imbalanced dataset tends to improve the classification power of the classifiers to discriminate between severe and nonsevere VRU injuries when involved in a motor vehicle crash. Improvement in the classification power can be verified for oversampling techniques for both classifier models; however, this is not always the case regarding the undersampling technique and ROSE.
The best results (highlighted in Table 3) revealed that oversampling is the best resampling technique for Aveiro, independent of the classifier used and the VRU under study. Regarding Porto, for the pedestrians database, the oversampling technique was revealed to improve the classifier power of a decision tree and ROSE yielded the best performance for logistic regression. On the other hand, the cyclist database of Porto revealed ROSE as the best technique when a decision tree is applied and oversampling when the logistic regression is applied. Regarding the Lisbon and overall cases, considering the pedestrian databases, oversampling is the best technique, except in the case of the Lisbon pedestrian database, where the decision tree classifier is applied. Considering the cyclist database for Lisbon, oversampling is the best resampling technique for both classifiers. Besides, this Month (51) Luminosity (15) Time (10) Age (36) Month (24) Road Markings (17) Road Markings (25) Age (24) Month (14) Month (33) Luminosity (19) Road Markings (19)  e two classifiers described in Section 2.2 were used to develop injury severity prediction . Therefore, two models were developed for each dataset considering the different case studies , Porto, Lisbon and overall). For the developed models, the datasets were divided into training t sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were ed. The performance of the models was examined based on the area under the ROC curve Table 3 shows the AUC results for the developed models. general, results showed that applying resampling methods in a class-imbalanced dataset o improve the classification power of the classifiers to discriminate between severe and non-VRU injuries when involved in a motor vehicle crash. Improvement in the classification power verified for oversampling techniques for both classifier models; however, this is not always regarding the undersampling technique and ROSE. e best results (highlighted in  (8) Road Conditions (30) Month (22) Age (17) Oversampling The three resampling techniques were applied to the datasets, resulting in six different datasets for each city and the overall perspective. Table 2 shows an overview of the dataset modifications and their distribution amongst the severity classes. The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studies (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were developed. The performance of the models was examined based on the area under the ROC curve (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-imbalanced dataset tends to improve the classification power of the classifiers to discriminate between severe and nonsevere VRU injuries when involved in a motor vehicle crash. Improvement in the classification power can be verified for oversampling techniques for both classifier models; however, this is not always the case regarding the undersampling technique and ROSE.
The best results (highlighted in Table 3) revealed that oversampling is the best resampling technique for Aveiro, independent of the classifier used and the VRU under study. Regarding Porto, for the pedestrians database, the oversampling technique was revealed to improve the classifier power of a decision tree and ROSE yielded the best performance for logistic regression. On the other hand, the cyclist database of Porto revealed ROSE as the best technique when a decision tree is applied and oversampling when the logistic regression is applied. Regarding the Lisbon and overall cases, considering the pedestrian databases, oversampling is the best technique, except in the case of the Lisbon pedestrian database, where the decision tree classifier is applied. Considering the cyclist database for Lisbon, oversampling is the best resampling technique for both classifiers. Besides, this Road Conditions (19) Month (17) Age (14) Month (18) Road Markings (14) Age (14) Road Markings (42) Time (21) Age (19) Age (43) Gender (30) Road Markings (20) explore and compare the results of two supervised classification techniques in order to entify which variables can significantly affect pedestrian and cyclist injury severity when olved in a motor vehicle crash. e three resampling techniques were applied to the datasets, resulting in six different datasets city and the overall perspective. Table 2 shows an overview of the dataset modifications and stribution amongst the severity classes.  Original  249  258  222  241  27  17  Undersampling  54  34  27  17  27  17  Oversampling  444  482  222  241  222  241  ROSE  249  258  126  131 123 127 e two classifiers described in Section 2.2 were used to develop injury severity prediction . Therefore, two models were developed for each dataset considering the different case studies , Porto, Lisbon and overall). For the developed models, the datasets were divided into training t sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were ed. The performance of the models was examined based on the area under the ROC curve Table 3 shows the AUC results for the developed models. general, results showed that applying resampling methods in a class-imbalanced dataset o improve the classification power of the classifiers to discriminate between severe and non-VRU injuries when involved in a motor vehicle crash. Improvement in the classification power verified for oversampling techniques for both classifier models; however, this is not always regarding the undersampling technique and ROSE. e best results (highlighted in Table 3) revealed that oversampling is the best resampling ue for Aveiro, independent of the classifier used and the VRU under study. Regarding Porto, pedestrians database, the oversampling technique was revealed to improve the classifier of a decision tree and ROSE yielded the best performance for logistic regression. On the other he cyclist database of Porto revealed ROSE as the best technique when a decision tree is and oversampling when the logistic regression is applied. Regarding the Lisbon and overall onsidering the pedestrian databases, oversampling is the best technique, except in the case of on pedestrian database, where the decision tree classifier is applied. Considering the cyclist e for Lisbon, oversampling is the best resampling technique for both classifiers. Besides, this Weekday (18) Month (17) Age (14) Time (14) Month (61) Time (17) Luminosity (9) Weekday (22) Time (22) Month (15) Age (21) Month (20)  2. To explore and compare the results of two supervised classification techniques in order to identify which variables can significantly affect pedestrian and cyclist injury severity when involved in a motor vehicle crash.
The three resampling techniques were applied to the datasets, resulting in six different datasets for each city and the overall perspective. Table 2 shows an overview of the dataset modifications and their distribution amongst the severity classes.
The two classifiers described in Section 2.2 were used to develop injury severity prediction models. Therefore, two models were developed for each dataset considering the different case studies (Aveiro, Porto, Lisbon and overall). For the developed models, the datasets were divided into training and test sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were developed. The performance of the models was examined based on the area under the ROC curve (AUC). Table 3 shows the AUC results for the developed models.
In general, results showed that applying resampling methods in a class-imbalanced dataset tends to improve the classification power of the classifiers to discriminate between severe and nonsevere VRU injuries when involved in a motor vehicle crash. Improvement in the classification power can be verified for oversampling techniques for both classifier models; however, this is not always the case regarding the undersampling technique and ROSE.
The best results (highlighted in Table 3) revealed that oversampling is the best resampling technique for Aveiro, independent of the classifier used and the VRU under study. Regarding Porto, for the pedestrians database, the oversampling technique was revealed to improve the classifier power of a decision tree and ROSE yielded the best performance for logistic regression. On the other hand, the cyclist database of Porto revealed ROSE as the best technique when a decision tree is applied and oversampling when the logistic regression is applied. Regarding the Lisbon and overall cases, considering the pedestrian databases, oversampling is the best technique, except in the case of the Lisbon pedestrian database, where the decision tree classifier is applied. Considering the cyclist database for Lisbon, oversampling is the best resampling technique for both classifiers. Besides, this Gender (41) Road Markings (16) Time (11) Road Markings (23) Gender (20) Age (18) Road Markings (42) Luminosity (36) Age (11) Gender (28) Road Markings (25) Age (20) FOR PEER REVIEW 8 of 17 explore and compare the results of two supervised classification techniques in order to entify which variables can significantly affect pedestrian and cyclist injury severity when olved in a motor vehicle crash.
e three resampling techniques were applied to the datasets, resulting in six different datasets city and the overall perspective. Table 2 shows an overview of the dataset modifications and stribution amongst the severity classes.  represents cyclists. e two classifiers described in Section 2.2 were used to develop injury severity prediction . Therefore, two models were developed for each dataset considering the different case studies , Porto, Lisbon and overall). For the developed models, the datasets were divided into training t sets considering a 2/3 ratio, as described in the methodology. A total of 64 models were ed. The performance of the models was examined based on the area under the ROC curve Table 3 shows the AUC results for the developed models.
general, results showed that applying resampling methods in a class-imbalanced dataset o improve the classification power of the classifiers to discriminate between severe and non-VRU injuries when involved in a motor vehicle crash. Improvement in the classification power verified for oversampling techniques for both classifier models; however, this is not always regarding the undersampling technique and ROSE.
e best results (highlighted in  (16) Month (30) Gender (18) Time (17) Luminosity (24) Weather (19) Weekday (13) Time (13) Road Conditions (19) Age (18) Luminosity (16)   Considering the analysis for pedestrians (Table 5), for Aveiro, Lisbon and the overall dataset, oversampling is the best resampling technique applied for the logistic regression model. Only Porto presented ROSE as the best resampling approach when applied to this classifier. Based on this, we can conclude that weather, luminosity and road conditions are statistically significant variables to predict severe injuries of pedestrians for Aveiro. In the Porto case study, gender and age as well as time, luminosity, road conditions and road markings are the risk factors to predict the severity of these VRU injuries. For Lisbon, considering the oversampling approach, we can conclude that VRU profile (gender and age), month, time, weather, luminosity and road markings are the risk factors that contribute to predicting the injury severity of a pedestrian. For the overall case, considering oversampling as the best approach, only weekdays seems not to be considered as a risk factor in predicting pedestrian injury severity. Similarly, concerning cyclists, oversampling presented the best approach when a logistic regression classifier is applied to the database, except for the overall perspective, when undersampling was shown to be the best approach. Based on this, for the Aveiro case study, only month and weather are variables that are not statistically significant in predicting injury severity of cyclists. Considering the Porto case study, month, weekday and road markings are the risk factors to predict the injury severity for these VRUs. Regarding Lisbon, the variables that were shown to be risk factors are month, weather and luminosity. For the overall case, considering that undersampling is the best approach, we conclude that any variable seems to be a risk factor that contributes to predicting the injury severity of a cyclist; see Table 6.
Comparing the results obtained from the two models-decision tree and logistic regressionvariables such as gender, luminosity and road conditions are considered statistically significant for more datasets when logistic regression is applied to the pedestrian datasets. On the other hand, gender and month are more representative variables considering the decision tree models when applied to the cyclist datasets.
Results highlight that an overall perspective is not always the best approach, considering two main reasons: First, the seriousness of the road crash can be affected by the specificities of each city; secondly, the overall perspective results can be biased from the most prominent database (in this case, the Lisbon database).

Discussion
In this paper, the performance of two classifiers (decision tree and logistic regression) to predict risk factors that can affect the severity of a VRU's injury were investigated. To deal with the imbalanced data problem, three resampling techniques were applied: undersamping, oversampling, and ROSE methods. The effectiveness of each resampling method applied to each of two classifiers was evaluated based on AUC as a performance metric.
The results showed that the performance of the classifier can be improved by processing the data with one of those resampling techniques. However, the small samples of VRU crash data may explain why the application of an undersampling technique does not improve the explanatory power of the studied classifiers for almost all the datasets, due to the loss of information related with this resampling technique. Emphasis is given for the Porto case study, where the proportion of severe injuries presented the lowest values (4% for pedestrians and 1% for cyclists).
Regarding the overall performance of the two classifiers, the decision tree slightly outperformed the logistic regression. This can be explained by the fact that the coefficient correlations were not considered and all the variables were analysed. Nevertheless, it is well known that decision tree models are robust to identify outliers, so when a domain problem is given, the decision tree technique naturally captures the relationship between variables, leading to higher classification performance.
Considering the identification of risk factors that can affect the injury severity level of a VRU, having a joint overview of the two classifiers enables finding out some main results. Considering the pedestrian injury severity risk, we can conclude that:

•
Gender and age factors seem to play an important role in this type of VRU; • Road markings are a risk factor considering pedestrian injury severity, especially for bigger cities; • The luminosity of the road seems to be more important than weather conditions.
On the other hand, considering cyclist databases, the main results allow us to conclude that cyclist age group and month are the main identified risk factors in predicting the injury severity of a cyclist. These results can be related to exposure values, namely the most people of active cycling age and the fluctuation between the number of people cycling in summer and winter. Road conditions seem not to affect the severity of both pedestrian and cyclist injuries in Lisbon and cyclist injuries in Porto.
Although these results are based on a crash database of three cities, the methodology and results can be generalized to small and medium-sized cities, since our results are similar to those reported in the literature review. For instance, the importance of age to most severe outcomes involving pedestrians [19] and the importance of environmental factors, such as time and environmental conditions, are relevant categories to consider in motor vehicle-bicycle collisions [29,40]. Road conditions and surface markings, variables which are sometimes neglected in the literature, are essential factors to be taken into account.

Limitations and Future Research
Due to the complexity of reporting a road crash, our study presents some limitations: (1) more detailed information about vehicle characteristics and drivers are missing in our database; (2) unobserved heterogeneity of data was not considered in the analyses. Future research will try to address these issues. Also, it would be interesting to explore other methods to handle the class imbalance problem, to extend to a national database and to focus our research on some unobserved variables that may affect the prediction models.

Conclusions
In this paper, an approach to reveal the most significant risk factors that can possibly affect VRU injury severity when involved in a motor vehicle crash is presented. Since prediction model performance can be biased when imbalanced data is used, three well-known resampling techniques were examined in an attempt to improve the model's classification performance. Two widely used supervised methods were applied: decision tree and logistic regression. The machine learning classifiers were able to correctly classify both the majority and the minority classes with relatively high accuracy. It is known that the performance of a resampling method depends on the classifier used, and no method would always outperform the other. Nevertheless, the decision tree model revealed to be a more accurate model considering the crash severity data under evaluation.
Results showed that the oversampling technique (used to balance the dataset) always improves the effectiveness of both classifiers (decision tree and logistic regression) to identify risk factors.
The classifiers were applied considering the original and developed dataset based on three different resampling techniques. Based on an attribute weighting scheme presented by decision tree models and p-values < 0.1 considering the logistic regression model, the risk variables that can significantly affect pedestrians and cyclists injury severity were identified. A joint analysis of the obtained results allows us to conclude that road markings, road conditions and luminosity significantly affect the severity of a pedestrian's injury when involved in a crash. On the other hand, age group and temporal variables (month, weekday and time period) are the risk factors that were revealed to be the most significant to predict the severity of a cyclist's injury when involved in a motor vehicle crash.
Furthermore, it should be emphasised that the identification of risk factors is relevant to the development of road safety measures that aim to reduce the injury severity of crashes between VRUs and motor vehicles, which is crucial information to help decision-makers in the definition of road safety policies and strategies.