Modeling Road Accident Severity with Comparisons of Logistic Regression, Decision Tree and Random Forest

: To reduce the damage caused by road accidents, researchers have applied di ﬀ erent techniques to explore correlated factors and develop e ﬃ cient prediction models. The main purpose of this study is to use one statistical and two nonparametric data mining techniques, namely, logistic regression (LR), classiﬁcation and regression tree (CART), and random forest (RF), to compare their prediction capability, identify the signiﬁcant variables (identiﬁed by LR) and important variables (identiﬁed by CART or RF) that are strongly correlated with road accident severity, and distinguish the variables that have signiﬁcant positive inﬂuence on prediction performance. In this study, three prediction performance evaluation measures, accuracy, sensitivity and speciﬁcity, are used to ﬁnd the best integrated method which consists of the most e ﬀ ective prediction model and the input variables that have higher positive inﬂuence on accuracy, sensitivity and speciﬁcity.


Introduction
Since road accidents occur frequently and result in property damage, injury, and even death, all of which impose a high cost on society, researchers have explored the correlated factors and built prediction models by utilizing different research techniques, so as to propose appropriate measures for prevention. The statistical model of logistic regression (LR) has been the most popular technique in accident severity research in the past, because the relationship between accidents and correlated factors can be clearly identified [1]. LR provides information on the parameter estimates and their standard errors, along with some notion of their significance and some interpretation of the model through odds ratios and their confidence intervals [2]. For instance, Kim et al. [3] developed a logistic model to explain the likelihood of motorists being at fault in collisions with cyclists. Al-Ghamdi [4] used logistic regression to estimate the relationship between correlated factors and accident severity. Besliu-Ionescu et al. [5] and Zhu et al. [6] discussed the prediction performance of logistic regression. However, the above-mentioned research lacks the discussion of how significant variables influence the prediction performance.
In recent years, there has been increasing interest in employing the nonparametric classification and regression tree (CART) technique to analyze transportation-related problems, for instance for modeling travel demand [7], driver behavior [8], and traffic accident analysis [9][10][11][12][13][14][15][16][17][18]. Furthermore, CART has been shown to be a powerful tool, especially in dealing with prediction and classification problems [19,20]. CART uses a nonparametric procedure to build a graphical model that provides information on parameter values, such as important variables, but provides no notion of their significance [2]. Harb et al. [21] explored the important factors associated with crash avoidance

Modeling Methods
In this study, accident severity analysis is performed by using IBM Modeler 18.0 software, which can run several models including LR, CART, and RF. In LR, the logit is the natural logarithm of the odds or the likelihood ratio that the dependent variable is 1 (serious accident) as opposed to 0 (minor accident). The probability p of a serious accident is given by where Y is the dependent variable (accident severity; Y = 1, if severity is serious; Y = 0, if severity is minor), β is a vector of parameters to be estimated, and X is a vector of independent variables [33]. The methodology of CART, which is outlined extensively by Breiman et al. [34], and the building of a CART model mainly consist of three steps: (1) tree growing; (2) tree pruning; and (3) selecting an optimal tree from the pruned trees [1]. The random forest algorithm was developed by Breiman [22]. With respect to the algorithm of the RF method, if RF builds N trees (600 trees in this study), then the algorithm of each N iterations consists of four steps: (1) selection of training data using the bootstrap Information 2020, 11, 270 3 of 23 method; (2) growing the tree fully; (3) attribute selection randomly; and (4) overall prediction based on majority vote (classification) from all individually trained trees [35].
In each analysis employing LR, CART and RF, the accident data are randomly divided into two groups, a training data set and validation data set, with a specific ratio (70/30 in this study) [28]. The larger dataset (training dataset) is used for training the three models, while the smaller dataset (validation dataset) is used for model validation.
Based on the p-values of the t-tests, the significance of each variable is one of the outputs estimated by LR. The significance represents the degree of influence of each variable on accident severity. In other words, significant variables (p-value < 0.05 in this study) will influence accident severity in an obvious manner. A measure of variable importance given by CART can be obtained by observing the drop in the error rate when another variable is used instead of the primary split. In general, the more frequently a variable appears as a primary or surrogate split, the higher the importance score assigned [2]. Variable importance scores for RF can be computed by measuring the increase in prediction error if the values of a variable under question are permuted across the out-of-bag observations. This score is computed for each constituent tree, averaged across the entire ensemble, and divided by the standard deviation [36].

Evaluation of Prediction Performance
The purpose of this study focuses on exploring an integrated method for promoting prediction performance of models and reducing the cost of collecting data by comparing the prediction performance of LR, CART and RF with input of different groups of factor variables. There are three evaluation measures of prediction performance in this study, namely, accuracy, sensitivity and specificity.
Accuracy measures the proportion of actual positives plus actual negatives that are correctly identified in validation subset. Sensitivity measures the proportion of actual positives that are correctly identified. Specificity measures the proportion of actual negatives that are correctly identified. Accuracy pursues the entire collection of classification. Sensitivity quantifies the avoiding of false negatives (positives wrongly classified as negatives), and specificity does the same for false positives (negatives wrongly classified as positives).
In this study, serious accidents are set as positives, and minor accidents are set as negatives. Mathematically, accuracy, sensitivity and specificity of the test can respectively be written as: where TP (true positive) and TN (true negative) are the numbers of accidents that are correctly classified, and FP (false positive) and FN (false negative) are the numbers of accidents incorrectly classified [28][29][30][31].

Selecting Target Data
On Taiwan's highways, each road accident is recorded in detail by the police with digitized information. The recorded information includes accident severity (fatal, injury, and property damage only), and several correlated factors (individual and environmental attributes). For the purpose of exploring the correlated factors and building efficient prediction models, this study selects 18 target data points from each accident drawn from Taiwan's highway traffic accident investigation reports for the years 2015 to 2019, including severity and 17 correlated factors [33,37]. To ensure that the result of the LR model is not affected by multicollinearity of variables, the data of 17 initial factors are Information 2020, 11, 270 4 of 23 checked for multicollinearity by the Spearman analysis method of SPSS software. It is found that the absolute value of the Spearman correlation coefficient between a couple of variables, "major cause" and "collision type", and between another couple of variables, "weather condition" and "surface condition", are both higher than 0.3; thus, the variables in the above two pairs are moderately or strongly correlated respectively. Finally, the "major cause" and "weather condition" are retained for analysis, and "collision type" and "surface condition" are deleted, then the variables of severity and 15 correlated factors are input into the following models for analysis. In addition, the multicollinearity metric is shown in Table 1.

Preprocessing Data
Each variable among the above-mentioned data points is discrete (categorical) except for the variable "driver age", which is then discretized into reasonable intervals. In the initial road accident records, the categories of road accident severity include fatality, injury and property damage. A road accident may consist of one severity category or multiple severity categories. In this study, if a road accident includes fatality, its severity is classified as fatality initially. If a road accident includes injury but no fatality, its severity is classified as injury initially. If a road accident includes property damage but no fatality or injury, its severity is classified as property damage only initially. To overcome the small number of observations of fatal accidents, which may lead to unreasonable analytical results, both fatality (344 cases) and injury accidents (4392 cases) are grouped into "serious accidents" (4736 cases); accidents with property damage only are categorized as "minor accidents" [37]. Then, to ensure a balance of accident amounts between different classifications of severity, 4736 minor accidents are sampled at random, in order to match the number of serious accidents. Therefore, in this study, the road accident severity is categorized into two classes, including "serious accidents" (fatality and injury) and "minor accidents" (property damage only). Moreover, the classifications of some other variables are similarly appropriately merged.
After discretization, grouping and screening, the 16 variables (target data) are briefly described and preliminary summarized in Table 2.

Results
After running several models of LR, CART, and RF by using IBM Modeler 18.0 software, the results of exploring significant variables identified by LR and important variables identified by CART or RF are illustrated in Section 3.1. In this study, 15 original variables and 10 to 4 significant variables identified by LR and 10 to 4 important variables identified by CART or RF (total 22 sets of variables) are input into LR, CART and RF models, respectively, for comparisons. In Section 3.2, accuracy, sensitivity and specificity of each set of variables are summarized for comparing the results of classification by using LR, CART, and RF. Section 3.3 presents the comparisons in terms of accuracy, sensitivity and specificity. Moreover, the odds ratios of LR with the 15 original variables are presented in Section 3.4. The odds ratio reveals the relative correlation of each value of a given variable with the specific road accident severity. The rules generated CART with the 15 original variables are presented in Section 3.5. Rules of CART indicate the relationships between variables and road accident severity.

Significant and Important Variables
After conducting LR, CART and RF with input of the 15 original factor variables using the whole dataset, ten significant variables are identified by LR and two groups of ten important variables are identified by CART or RF respectively, as listed in Table 3.  Table 4), and the model developed by RF (73.38%) has higher accuracy than LR (73.07%) and CART (72.65%). In addition, the specificity of RF (72.95%) is higher than CART (72.43%) and LR (71.79%), while the sensitivity of LR (74.48%) is higher than RF (73.82%) and CART (72.87%). Furthermore, the accuracy of validation datasets of the LR, CART and RF models when the 15 original variables are replaced by 7 groups of significant variables identified by LR and the most significant 4 to 10 are kept is illustrated in Table 5 and Figure 1.
Information 2020, 11, x FOR PEER REVIEW 9 of 24 and the specificity of RF with input of ten, nine, six or five significant variables is higher than LR and CART with input of the same variables.  As shown in Table 5 and Figure 1, when the number of input variables decreases from 15 to 10, the accuracy of the LR and CART is unchanged and that of RF increases. Furthermore, when the number of input significant variables decreases from 10 to 5, the accuracy of each RF is higher than that with input of the 15 original variables-the accuracy (74.43%) of RF with input of 9 significant variables identified by LR is the highest accuracy in this study. This means that using only significant variables helps by omitting the noise variables, and retains the accuracy of the LR and CART models, while improving the accuracy of RF. On the other hand, it also reduces the considerable cost of collecting accident data, that is, by collecting data on ten or even only five significant variables. In addition, it is shown that the accuracy of RF is always higher than LR and CART when inputting any group of significant variables.

Number of input factor variables
Concerning sensitivity, it is noted in Table 5 and Figure 2 that the sensitivity of RF with input of only 5 to 10 significant variables identified by LR is higher than that with input of the 15 original variables; the sensitivity (75.6%) of RF with input of 9 significant variables identified by LR is the highest sensitivity in this study. In addition, the sensitivity of LR with input of only 9 or 10 significant variables identified by LR is higher than that with input of the 15 original variables too. Comparing sensitivity of LR, CART and RF, the sensitivity of RF with input of nine, eight or six significant variables identified by LR is higher than LR and CART with input of the same variables, and the sensitivity of LR with input of ten, seven, five or four significant variables identified by LR is higher than that of RF and CART with input of the same variables.     Figure 3 show that the specificity of CART with input of only 4 to 10 significant variables identified by LR is higher than that with input of the 15 original variables, and the specificity of RF is the same. Comparing specificity of LR, CART and RF, the specificity of CART with input of eight, seven or four significant variables is higher than LR and RF with input of the same variables, and the specificity of RF with input of ten, nine, six or five significant variables is higher than LR and CART with input of the same variables.    Table 6 and Figure 4. Only the accuracy of RF increases slightly, and the accuracy of LR and CART all decreases. Moreover, the accuracy of RF is higher than LR and CART when inputting any group of important variables identified by CART.   Concerning sensitivity, Table 6 and Figure 5 show that the sensitivity of LR, CART and RF with input of only 4 to 10 important variables identified by CART is lower than that with input of the 15 original variables, with the exception of RF with input of eight, seven, six or four important variables identified by CART. Comparing sensitivity of LR, CART and RF, the sensitivity of LR with input of ten, nine or five important variables identified by CART is higher than CART and RF with input of the same variables, and sensitivity of RF with input of eight, seven, six or four important variables identified by CART is higher than LR and CART with input of the same variables.    As for specificity, it is noted in Table 6 and Figure 6 that the specificity of CART with input of only 6 to 10 important variables identified by CART is 75.5%, which is higher than that with input of the 15 original variables, and is the highest specificity in this study. Comparing specificity of LR, CART and RF, the specificity of CART with input of ten, nine, eight, seven or six important variables identified by CART is higher than LR and RF with input of the same variables, and the specificity of RF with input of five or four important variables identified by CART is higher than LR and CART with input of the same variables.    Table 7 and Figure 7. It can be seen that the accuracy of the LR, CART and RF models mostly decrease. Even so, the model developed by RF still is the best one, with higher accuracy than LR and CART when inputting most groups of important variables. with input of ten, nine, eight or four important variables identified by RF is higher than LR and CART with input of the same variables. Concerning sensitivity, it is noted in Table 7 and Figure 8 that the sensitivity of LR, CART and RF with input of only 4 to 10 important variables identified by RF is lower than that with input of the 15 original variables, with the exception of LR and CART with input of 10 important variables, and RF with input of 9 or 6 important variables. Comparing sensitivity of LR, CART and RF, the sensitivity of LR with input of ten, eight or four important variables identified by RF is higher than CART and RF with input of the same variables, and the sensitivity of RF with input of nine, seven, six or five important variables identified by RF is higher than LR and CART with input of the same variables.   As to specificity, it is noted in Table 7 and Figure 9 that the specificity of LR, CART and RF with input of 10 important variables identified by RF all is higher than that with input of the 15 original variables. However, the specificity of LR, CART and RF with input of 4 to 9 important variables identified by RF all is lower than that with input of the 15 original variables. Comparing specificity of LR, CART and RF, the specificity of CART with input of seven, six or five important variables identified by RF is higher than LR and RF with input of the same variables, and the specificity of RF with input of ten, nine, eight or four important variables identified by RF is higher than LR and CART with input of the same variables.

Accuracy of LR, CART and RF with Input of Significant or Important Variables
In Table 4 it is seen that the accuracy (73.38%) of RF is higher than that of LR and CART when inputting the 15 original variables. For another analytical point of view, accuracy of different LR, CART and RF models is shown in Figures 10-12, corresponding to input of different groups of significant variables identified by LR or important variables identified by CART or RF. The accuracy of RF increases as we reduce the number of significant input variables identified by LR from 10 to 5, and each of these is higher than that with input of the 15 original variables. In particular, the accuracy (74.43%) of RF with input of nine significant variables identified by LR is the highest accuracy in this study. In addition, the accuracy of RF increases as the number of input important variables identified by CART decreases from 10 to 6, and all figures are higher than the accuracy of RF with input of the 15 original variables too. significant variables identified by LR or important variables identified by CART or RF. The accuracy of RF increases as we reduce the number of significant input variables identified by LR from 10 to 5, and each of these is higher than that with input of the 15 original variables. In particular, the accuracy (74.43%) of RF with input of nine significant variables identified by LR is the highest accuracy in this study. In addition, the accuracy of RF increases as the number of input important variables identified by CART decreases from 10 to 6, and all figures are higher than the accuracy of RF with input of the 15 original variables too.
This means that the RF model is the most efficient prediction model for accuracy, and inputting most significant variables identified by LR or important variables identified by CART will clearly promote the accuracy of RF.  of RF increases as we reduce the number of significant input variables identified by LR from 10 to 5, and each of these is higher than that with input of the 15 original variables. In particular, the accuracy (74.43%) of RF with input of nine significant variables identified by LR is the highest accuracy in this study. In addition, the accuracy of RF increases as the number of input important variables identified by CART decreases from 10 to 6, and all figures are higher than the accuracy of RF with input of the 15 original variables too. This means that the RF model is the most efficient prediction model for accuracy, and inputting most significant variables identified by LR or important variables identified by CART will clearly promote the accuracy of RF.

Sensitivity of LR, CART and RF with Input of Significant or Important Variables
In Table 4 it is seen that the sensitivity (74.48%) of LR is higher than CART and RF when inputting the 15 original variables. For another analytical point of view, sensitivity of different LR, CART and RF models is shown in Figures 13-15, corresponding to input of different groups of significant variables identified by LR or important variables identified by CART or RF respectively. It is seen that the sensitivity (75.6%) of RF with input of nine significant variables identified by LR is This means that the RF model is the most efficient prediction model for accuracy, and inputting most significant variables identified by LR or important variables identified by CART will clearly promote the accuracy of RF.

Sensitivity of LR, CART and RF with Input of Significant or Important Variables
In Table 4 it is seen that the sensitivity (74.48%) of LR is higher than CART and RF when inputting the 15 original variables. For another analytical point of view, sensitivity of different LR, CART and RF models is shown in Figures 13-15, corresponding to input of different groups of significant variables identified by LR or important variables identified by CART or RF respectively. It is seen that the sensitivity (75.6%) of RF with input of nine significant variables identified by LR is the highest sensitivity in this study, and the sensitivity (74.54%) of RF model with input of eight, seven or six important variables identified by CART is slightly higher than the highest sensitivity (74.48%) when inputting the 15 original variables. In addition, it is seen that the sensitivity of LR model with input of ten or nine significant variables identified by LR is slightly higher than 74.48% too.
This means that RF and LR models are efficient prediction models for sensitivity, and sensitivity can be promoted when RF is input some significant variables identified by LR or important variables identified by CART, and when LR is input some significant variables identified by LR. Figure 12. Accuracy of RF with input of significant or important variables.

Sensitivity of LR, CART and RF with Input of Significant or Important Variables
In Table 4 it is seen that the sensitivity (74.48%) of LR is higher than CART and RF when inputting the 15 original variables. For another analytical point of view, sensitivity of different LR, CART and RF models is shown in Figures 13-15, corresponding to input of different groups of significant variables identified by LR or important variables identified by CART or RF respectively. It is seen that the sensitivity (75.6%) of RF with input of nine significant variables identified by LR is the highest sensitivity in this study, and the sensitivity (74.54%) of RF model with input of eight, seven or six important variables identified by CART is slightly higher than the highest sensitivity (74.48%) when inputting the 15 original variables. In addition, it is seen that the sensitivity of LR model with input of ten or nine significant variables identified by LR is slightly higher than 74.48% too.
This means that RF and LR models are efficient prediction models for sensitivity, and sensitivity can be promoted when RF is input some significant variables identified by LR or important variables identified by CART, and when LR is input some significant variables identified by LR.

Specificity of LR, CART and RF with Input of Significant or Important Variables
In Table 4 it is seen that the specificity (72.95%) of RF is higher than LR and CART when inputting the 15 original variables. For another analytical point of view, specificity of different LR, CART and RF models is shown in Figures 16-18, corresponding to input of different groups of significant variables identified by LR or important variables identified by CART or RF, respectively. It is seen that the specificity (75.5%) of CART with input of 6 to 10 important variables identified by CART is the highest specificity in this study, and the specificity of CART with input of four to eight significant variables identified by LR is higher than the highest specificity (72.95%) when inputting the 15 original variables. In addition, it is seen that the specificity of RF with input of four to ten significant variables identified by LR, six to ten important variables identified by CART or ten important variables identified by RF are all higher than 72.95%.
This means that CART and RF models are efficient prediction models for specificity. In other words, if CART is input with either some important variables identified by CART or significant variables identified by LR, and RF is input with either some significant variables identified by LR or important variables identified by CART or RF, then their specificity will be promoted.

Specificity of LR, CART and RF with Input of Significant or Important Variables
In Table 4 it is seen that the specificity (72.95%) of RF is higher than LR and CART when inputting the 15 original variables. For another analytical point of view, specificity of different LR, CART and RF models is shown in Figures 16-18, corresponding to input of different groups of significant variables identified by LR or important variables identified by CART or RF, respectively. It is seen that the specificity (75.5%) of CART with input of 6 to 10 important variables identified by CART is the highest specificity in this study, and the specificity of CART with input of four to eight significant variables identified by LR is higher than the highest specificity (72.95%) when inputting the 15 original variables. In addition, it is seen that the specificity of RF with input of four to ten significant variables identified by LR, six to ten important variables identified by CART or ten important variables identified by RF are all higher than 72.95%.
This means that CART and RF models are efficient prediction models for specificity. In other words, if CART is input with either some important variables identified by CART or significant variables identified by LR, and RF is input with either some significant variables identified by LR or important variables identified by CART or RF, then their specificity will be promoted.

Odds Ratio of Variables Estimated by LR
Beside significances, the output estimated by LR contains odds ratios of variables. An odds ratio reveals the relative correlation of each value of a given variable with specific severity. For instance, the odds ratios of the values of variables "major cause" and "action" are listed in Table 8. It is seen that the odds ratio of "reverse driving" (5.931) is much higher than 1, which means that there is a strong correlation between "reverse driving" (driving in the direction opposite to the flow of traffic) and "serious accidents". On the other hand, the odds ratio of "failure to keep a safe distance" (0.157) is less than 1, which means that there is strong correlation between "failure to keep a safe distance" and "minor accidents". Another example is the variable "action", where the odds ratio of "overtaking" (4.810) is higher than 1, meaning that there is a strong correlation between "overtaking" and "serious accidents". On the other hand, the odds ratio of "abrupt deceleration" (0.991) is less than 1, it means that there is a correlation between "abrupt deceleration" and "minor accidents". Reference base: minor accident. Note: b It indicates that "others" is the reference base for the 2 variables of "major cause" and "action".

Rules Generated by CART
Analyzing the results (with a graphic tree displayed) of the classification trees discovery, the results of the CART model can be converted into rules [38]. Each terminal node of the tree represents a rule, with all the splits of the parent nodes being the antecedents and the class of the terminal node being the consequents. For each terminal node, the rules can be filtered by support, confidence, and lift, where support is the percentage of the entire data set covered by the rule, confidence is the proportion of the number of examples which fit the right side (consequent) among those that fit the left side (antecedent), and lift is a measure of the statistical dependence of the rule.
In each actual road accident case, there are several variables occurring together. In other words, there are many patterns of variables occurring together in traffic accidents. The pattern of variables occurring together in traffic accidents is the antecedent of the rule, and the corresponding severity is the consequent of the rule. The lift of a rule can reveal the tendency of this pattern of variables occurring together to result in the corresponding severity. There were four rules with high lift (i.e., a value higher than 1), which are displayed in Table 9. Table 9. Rules converted from tree graph as the output of CART.

Discussion
Summarizing the empirical results on prediction performance, the four main findings are as follows. First, regarding accuracy, RF is the most efficient tool for predicting severity among the above-mentioned three models, and the integrated method in which RF models are input only some significant variables identified by LR or important variables identified by CART improves the accuracy; the accuracy (74.43%) of RF with input of nine significant variables identified by LR is the highest accuracy in this study. Second, regarding sensitivity, RF and LR achieve better performance for predicting severity, and in the integrated method in which RF models are input, only some significant variables are identified by LR and important variables are identified by CART. When the LR models are input, only some significant variables identified by LR have better sensitivity; the sensitivity (75.6%) of RF with input of nine significant variables identified by LR is the highest sensitivity in this study. Third, regarding specificity, CART and RF achieve better performance for predicting specificity, and in the integrated method in which CART models are input, only some significant variables are identified by LR and important variables are identified by CART. When the RF models are input, only significant variables identified by LR or important variables identified by CART or RF have greater specificity; the specificity (75.5%) of CART with input of six to ten important variables identified by CART is the highest specificity in this study. Fourth, in general, in the integrated method in which RF models are input, only some significant variables identified by LR and important variables identified by CART can simultaneously satisfy the dual purposes of promoting prediction performance (including accuracy, sensitivity and specificity) and reduce the considerable cost of collecting data in accident research.
Based on the above summary, if the primary concern is overall prediction performance, the integrated method in which RF models are input with only some significant variables identified by LR or important variables identified by CART should be selected to pursue higher accuracy. In addition, if the focus is on serious accident prediction performance, the integrated method in which RF models are input with only some significant variables identified by LR or important variables identified by CART, or LR models are input with only some significant variables identified by LR should be selected to pursue higher sensitivity. Furthermore, if the goal is minor accident prediction performance, the integrated method in which CART models should be input with only some significant variables identified by LR or important variables identified by CART, or RF models should be input with only some significant variables identified by LR or important variables identified by CART, or alternatively, RF should be selected to pursue higher specificity. In general, no matter whether the goal is prediction performance in its entirety, or just serious or minor accidents, the integrated method in which RF models are input with only some significant variables identified by LR or important variables identified by CART can simultaneously satisfy the dual purposes of promoting prediction performance and reducing the considerable cost of collecting data. There are 15 original variables for modeling road accident severity in this study. By using LR, CART and RF, significant variables or important variables are identified, and various numbers of significant variables or important variables are taken as input variables to compare the classification performance of road accident severity. In addition, the management organization should focus more on the management of issues related to significant variables or important variables.

Conclusions
In this paper, the empirical results demonstrate that the accuracy and specificity of RF are higher than those of LR and CART when the 15 original variables are input, and inputting only some special significant variables identified by LR or important variables identified by CART into RF can promote accuracy, sensitivity and specificity. Therefore, it can be said that RF is the most effective prediction model among the three models, which is consistent with the results of previous studies.
On the other hand, the frequent discussion of significant variables identified by LR [4] and important variables identified by CART [2] or RF [24] in previous studies was focused on the strength of their relationship with the accident. The influences of significant variables and important variables on prediction performance of LR, CART and RF models are evaluated in this study. The results presented in Section 3 reveal that when the 15 original variables are replaced by some specific significant variables identified by LR or important variables identified by CART in RF, LR or CART models, the accuracy, sensitivity or specificity can be improved more significantly than when they are replaced by some important variables identified by RF. Therefore, it can be said that significant variables identified by LR and important variables identified by CART can help in generating better prediction performance of RF, LR and CART models more efficiently than important variables identified by RF.
Based on the above two summaries, the combining of the most effective prediction model (RF) and significant variables identified by LR or important variables identified by CART should achieve better prediction performance. The above conclusion is confirmed by two results as follows: First, the accuracy, sensitivity and specificity of RF with input of only significant variables identified by LR or important variables identified by CART can all be promoted. Second, the accuracy (74.43%) and sensitivity (75.6%) of RF are both at their highest in this study with input of 9 significant variables identified by LR. In other words, the integrated method in which RF models are input with only significant variables identified by LR or important variables identified by CART can simultaneously satisfy the dual purposes of promoting prediction performance and reduce the considerable cost of collecting data.
Beside the significant variables mentioned above, LR can generate the odds ratio of variables, which reveals the correlation between each value of a given variable with specific severity, and provides information to road authorities about preventing accidents [33]. For instance, observing the results, "reverse driving" strongly tends to be linked to "serious accidents", and "failure to keep a safe distance" has a strong correlation with "minor accidents". It is worth considering corresponding measures to prevent serious accidents and minor accidents, so as to offer proposals to road authorities. In addition, beside the important variables mentioned above, CART can generate rules [38] which can help in preventing accidents. For instance, in this study, when the "vehicle type" is "bus" or "heavy truck or tractor-trailer", the "major cause" is "improper lane change" or "failure to keep a safe distance", and the "location" is "ramp", and consequently, the severity strongly tends to be "minor accident". In other words, road authorities can adopt effective measures for buses, heavy trucks or tractor-trailers in order to reduce their improper lane changes and failure to keep a safe distance on ramps, such that property damage accidents can be reduced.
In this study, significant variables or important variables are identified using LR, CART and RF, and various numbers of significant variables or important variables are taken as input variables to compare the classification performance of road accident severity. Exploring the impact of category of variables on the classification performance is an issue worth studying, and it is suggested as the future area of investigation.

Conflicts of Interest:
The authors declare no conflict of interest.