Crash Injury Severity Prediction Using an Ordinal Classification Machine Learning Approach

In many related works, nominal classification algorithms ignore the order between injury severity levels and make sub-optimal predictions. Existing ordinal classification methods suffer rank inconsistency and rank non-monotonicity. The aim of this paper is to propose an ordinal classification approach to predict traffic crash injury severity and to test its performance over existing machine learning classification methods. First, we compare the performance of the neural network, XGBoost, and SVM classifiers in injury severity prediction. Second, we utilize a severity category-combination method with oversampling to relieve the class-imbalance problem prevalent in crash data. Third, we take advantage of probability calibration and the optimal probability threshold moving to improve the prediction ability of ordinal classification. The proposed approach can satisfy the rank consistency and rank monotonicity requirement and is proved to be superior to other ordinal classification methods and nominal classification machine learning by statistical significance test. Important factors relating to injury severity are selected based on their permutation feature importance scores. We find that converting severity levels into three classes, minor injury, moderate injury, and serious injury, can substantially improve the prediction precision.


Introduction
The prediction and cause analysis of traffic crashes has always been an important topic for scholars in traffic safety. In the research of this subject, scholars often use statistical methods or machine learning methods to conduct research.
A statistical model usually specifies the mathematical relationship between explanatory variables and crash severity. Based on strict assumptions of uncertainty distribution and hypothesis tests, the statistical model can isolate the effects of explanatory variables on crash severity [1,2]. For example, Cerwick et al. [3] used the mixed logit model and the latent class multinomial logit model to predict crash severity. A large number of crash specific, temporal, roadway, vehicle, driver characteristics, and environmental factors were found significant. Haghighi et al. [4] used standard ordered logit (SOL) and Multilevel ordered logit (MOL) to analyze the effect of roadway geometric features on crash severity. However, statistical models are usually weaker in making predictions than machine learning methods. Iranitalab and Khattak [5] compared the performance of a statistical model, Multinomial Logit (MNL), with three machine learning methods including Nearest Neighbor Classification (NNC), Support Vector Machines (SVM), and Random Forests (RF) in predicting traffic crash severity, and found MNL has the worst prediction accuracy.
Machine learning models are designed to make the most accurate predictions possible. Chang et al. [6] used the Classification And Regression Tree (CART) to predict crash severity, where prediction accuracy is 90.8% for learning data and 91.7% for testing data. Abdel-Aty et al. [7] used a single-layer hidden layer Multi-layer Perceptron (MLP) to predict traffic crash severity with an average prediction accuracy of 73.5%. Delen et al. [8] used an artificial neural network (ANN) to predict the severity of a crash and improved the prediction accuracy. Alkheder et al. [9] used ANN, combined with k-means for clustering, to predict traffic crashes and then compared with probit algorithm to prove that ANN is better than probit in predicting the severity of crashes. Among many methods of crash severity prediction research, neural network methods have better performance and are more popular.
Most existing machine learning methods applied in crash severity prediction treat crash severity levels as nominal data without order information. This unrealistic simplification casts a shadow on machine learning methods' prediction ability. In general, the severity of traffic crashes is classified as fatal injury or killed, incapacitating injury, nonincapacitating, possible injury, property damage, or no injury. Moreover, the severity of the injury is ordered and increases from no injury to possible injury, to non-incapacitating, to incapacitating injury, and to fatal injury or killed. Between closely related adjacent categories (such as no injury and possible injury), there may be shared unobserved effects or correlations between their data [10]. To the authors' knowledge, in the field of crash severity study, less attention has been paid to ordinal classification machine learning, and information on natural ordering in injury severity is missed in conventional machine learning, including SVM, decision tree, and MLP. Some statistical models, such as ordered logit and ordered probit [4,11,12], can handle the ordinal severity labels. However, discrete choice models rely on statistical assumptions and pre-defined relationships between severity labels and input variables, which makes them good choices for factor analysis but restricts their prediction accuracy [13].
Gutierrez et al. [14] summarized ordinal classification machine learning algorithms developed to classify categorical variables that show a natural order between the labels. They confirmed that there is no clear winner that performs the best in all possible datasets and problem requirements. The main three categories of ordinal classification machine learning are: (1) Cost-sensitive classification: apply cost-sensitive loss function in the evaluation of the learned system with different costs for different types of misclassification errors. For example, Riccardi et al. [15] proposed cost-sensitive AdaBoost for ordinal regression. The problem of cost-sensitive classification is how to determine the cost matrix without priori knowledge of the ordinal classification. (2) Ordinal binary decomposition: decompose the ordinal target variable into several binary variables, which are then estimated by single or multiple models. Our new ordinal classification method falls in this category. The problem of existing ordinal binary decomposition methods is the violation of rank monotonicity or rank consistency. Related methods and their drawbacks are introduced in the method section later in more detail. (3) Threshold model: extension of the regression model in which distances among the ordered classes are not pre-defined but estimated by finding the optimal thresholds dividing classes [16]. Li and Lin [17] proposed a general reduction framework to transform ordinal regression as a series of binary classification sub-problems and demonstrated that many threshold models and ordinal binary decomposition methods are equivalent.
The number of crash cases in each category is often imbalanced. Usually, the sample size of fatal cases is several times smaller than that of cases in other categories. With imbalanced data, traditional classification algorithms incline to the category with a large amount of data, while the category with a small amount of data is neglected [9]. Many studies merged several minority categories of injury severity into one class and converted multi-class classification problems into two-class (no injury vs. injury) classification problems [8]. Another option is to turn the multi-class classification problem into a three-class (no injury, minor injury, and fatal injury) problem [3,6]. Some studies have tried to deal with imbalanced data by under-sampling majority class examples [18] and by oversampling minority class examples [19,20] and achieved good results.
Although some scholars have tried to combine several injury severity levels into fewer categories [8], there will still be a class-imbalance problem, and the predicted results will still incline towards the category of large proportion. This paper focuses on different combinations of severity categories that can relieve imbalance and keep the model's ability to predict crash injury severity. SMOTE-NC (Synthetic Minority Oversampling Technique for Nominal and Continuous) is applied to oversample the minority class. We compare the performance of three classifiers: MLP, XGBoost, and SVM. The best classifier is combined with an ordinal binary decomposition method to handle ordinal crash severity labels.
The aim of this paper is to propose an ordinal machine learning classification approach that overcomes the ordinal nature of crash severity data and class-imbalance problems. The contributions of this presented approach include: (1) To the authors' knowledge, this is the first paper applying ordinal classification machine learning to predict traffic crash injury severity using real-world crash data. (2) We propose an ordinal classification machine learning method that satisfies rank monotonicity and rank consistency and takes advantage of probability calibration and the movement of optimal probability threshold to generate superior classification results compared to existing ordinal classification algorithms. (3) We test six severity category-combination strategies and find the best three-class combination plan.
The rest of this paper is constructed as follows. The second section describes and analyzes the characteristics of crash data. The third section presents the research methods involved in this paper, including sampling, severity category-combination, machine learning, and ordinal classification. The fourth section shows the comparison and analysis of the results. The conclusions of this paper are included in the fifth section.

Data Description
The data were collected from the Highway Safety Information System (HSIS) for crashes that occurred in California in 2010. Variables in the crash dataset include those related to intersections, road segments, and historical traffic crashes. Several variables were dropped when the null value occurred too frequently in the dataset. The California traffic crash data contains three data files, the crash file, vehicle file, and occupant file. The crash file contains 52 variables such as time, location, crash severity, the total number of injuries, weather, etc. The vehicle file contains 42 variables such as vehicle model, whether the driver makes a phone call, whether the driver is drunk, etc. The occupant file contains ten variables such as age, gender, the severity of injury, type of collision, etc. These three data files were merged according to the crash number and the crash vehicle number. The data contains 104 variables observed in one crash, including the injury severity of the person involved. In this dataset, five crash injury severity levels are defined, namely non-injury crash (denoted as NIC), complaint of pain (denoted as COP), other visible injury (denoted as OVI), severe injury (denoted as SI), and killed (denoted as KSI). The severity level of the crash-related personnel injury is shown in Figure 1 We select 17 major influencing variables in this study: occupant type, seating position, type of collision, primary collision factor, first associated factor, roadway class, ejected, object struck, the total number of vehicles involved, alcohol involved, driver's gender, driver's age, occupant's age, vehicle year, motorcycle involved, driver's safety equipment, and occupant's safety equipment. Primary collision factor is the one element that best describes the cause of the collision or, if removed, would have prevented the collision from occurring. First associated collision factor is the most important one of factors or violations that contributed, but were not the main cause of the collision. There are 14 categorical variables, except for driver's age, occupant's age, and vehicle model year. The Appendix A provides the descriptive statistics of these variables. In total, there are 140,690 crash records, out of which 139,555 remained after cleaning missing data. Samples with missing values were simply removed because the proportion of samples with missing values is relatively small. Other errors were not found in the Highway Safety Information System (HSIS) dataset. equipment, and occupant's safety equipment. Primary collision factor is the one element that best describes the cause of the collision or, if removed, would have prevented the collision from occurring. First associated collision factor is the most important one of factors or violations that contributed, but were not the main cause of the collision. There are 14 categorical variables, except for driver's age, occupant's age, and vehicle model year. The Appendix A provides the descriptive statistics of these variables. In total, there are 140,690 crash records, out of which 139,555 remained after cleaning missing data. Samples with missing values were simply removed because the proportion of samples with missing values is relatively small. Other errors were not found in the Highway Safety Information System (HSIS) dataset.

Methodology
We summarize the research framework as a flowchart in Figure 2. After data cleaning, we try six methods of category combination that merge crash injury severity levels into fewer categories. SMOTE-NC oversampling is applied to relieve the class-imbalance problem. We compare SVM, XGBoost, and MLP, and choose the best one as the classifier used in the ordinal classification method. The permutation feature importance is also analyzed with the chosen classifier.
The proposed ordinal classification method contains five main steps. First, the label to predict (crash injury severity) is decomposed with one-vs-all binary decomposition. Second, a multiple-output classifier or multiple single-output classifiers (chosen from SVM, XGBoost, and MLP) are trained to predict crash injury severity. Third, the predicted probabilities are calibrated to remove the bias from the classifier and data sampling. Fourth, the cumulative probabilities are calculated based on calibrated probabilities, which satisfies both the rank monotonicity and rank consistency. Fifth, the threshold moving method can help to find the optimal threshold that converts probabilities into crash injury severity predictions.
All classifiers trained in this paper are established using Python programming language with supported libraries, including keras, tensorflow, xgboost, hyperopt, and scikit-learn. The computing platform is a desktop computer with an AMD Ryzen 1700X 8-core processor and Windows 10

Methodology
We summarize the research framework as a flowchart in Figure 2. method, along with other methods, is evaluated through cross-validation. The performance of each method is compared and tested by a statistical significance test. The computational cost of the proposed method is discussed to show its computation efficiency.

Imbalanced Data Preprocessing
One of the critical characteristics of the crash dataset is that the number of crashes leading to death and severe injury is always much less than those of trivial injury. The problem of imbalanced data is prevalent in the field of traffic accident studies. For example, in our data NIC accounts for 57.20% of all crashes. This imbalance of data means that a dummy classifier that classifies all instances to NIC would still achieve an accuracy score of 57.20%. This issue would have a detrimental effect on the training process. Classifiers After data cleaning, we try six methods of category combination that merge crash injury severity levels into fewer categories. SMOTE-NC oversampling is applied to relieve the class-imbalance problem. We compare SVM, XGBoost, and MLP, and choose the best one as the classifier used in the ordinal classification method. The permutation feature importance is also analyzed with the chosen classifier.
The proposed ordinal classification method contains five main steps. First, the label to predict (crash injury severity) is decomposed with one-vs-all binary decomposition. Second, a multiple-output classifier or multiple single-output classifiers (chosen from SVM, XGBoost, and MLP) are trained to predict crash injury severity. Third, the predicted probabilities are calibrated to remove the bias from the classifier and data sampling. Fourth, the cumulative probabilities are calculated based on calibrated probabilities, which satisfies both the rank monotonicity and rank consistency. Fifth, the threshold moving method can help to find the optimal threshold that converts probabilities into crash injury severity predictions.
All classifiers trained in this paper are established using Python programming language with supported libraries, including keras, tensorflow, xgboost, hyperopt, and scikitlearn. The computing platform is a desktop computer with an AMD Ryzen 1700X 8-core processor and Windows 10 operating system. The proposed ordinal classification method, along with other methods, is evaluated through cross-validation. The performance of each method is compared and tested by a statistical significance test. The computational cost of the proposed method is discussed to show its computation efficiency.

Imbalanced Data Preprocessing
One of the critical characteristics of the crash dataset is that the number of crashes leading to death and severe injury is always much less than those of trivial injury. The problem of imbalanced data is prevalent in the field of traffic accident studies. For example, in our data NIC accounts for 57.20% of all crashes. This imbalance of data means that a dummy classifier that classifies all instances to NIC would still achieve an accuracy score of 57.20%. This issue would have a detrimental effect on the training process. Classifiers such as artificial neural networks, support vector machines, and decision trees are designed for balanced data with a roughly equal sample size of each class. In the case of imbalanced data, classifiers tend to overly focus on the class with the largest proportion and ignore the minority class. However, accurately predicting the minority class, SI and KSI in this case, is the main purpose of machine learning training. Existing research has tried to combine the categories of severe injuries in traffic crashes and turn multi-class problems into a binary-class problem to make predictions. However, training a binary-class classifier limits the model's ability to distinguish different levels of injury severity and therefore reduces the model's practical value.
This research performs oversampling and category combination to solve the problem of imbalanced data. We combine the five crash severity levels into three classes in order to reduce the difficulty in severity prediction. We propose all six possible ways of category combination that can convert five crash severity levels into three classes while keeping an ordinal nature (illustrated in Figure 3). Each class contains at least one of the five crash severity levels, and the severity levels are exclusive over classes. For all combinations, NIC is always included in class 1, and KSI is always included in class 3. All instances in class 3 have higher severity levels than instances in class 2, and all instances in class 2 have higher severity levels than instances in class 1. The difference between combinations is how COP, OVI, and SI are assigned into classes. As shown in Figure 4, the proportion of each class in the traffic crash data is still uneven in each combination, but the imbalance is relieved compared to the original five categories. In addition, the 3-class classification problem has more explanatory ability compared to the 2-class problem. It can be interpreted that the five severity categories are combined into three new classes: minor injury crashes, moderate injury crashes, and serious injury crashes. sen to be oversampled. For all combinations except combination 3, class 3 is oversampled. For combination 6, class 2 is also oversampled. SMOTE-NC sampling is performed on class 3 in combination 1, 2, and 4-6, and on class 2 in combination 6. The sampling rate ensures that the minority class instance size equals one-fifth of the majority class instance size. The new proportion of classes in each combination after sampling is shown in Figure  5. Since classes 2 and 3 are more than one-fifth of class 1 in combination 3, no sample in combination 3 is oversampled.      Another important task we need to complete is oversampling of the minority class. At present, there are two main sampling methods to deal with the imbalance-data problem, namely under-sampling and oversampling. Machine learning algorithms are data-hungry and require extensive data for model training, making under-sampling unpreferable, especially when the minority class sample size is small. Because the dataset is a mix of categorical and continuous features, this research uses SMOTE-NC sampling, a variation of SMOTE sampling. SMOTE-NC creates synthetic data for categorical as well as continuous features in the data set. SMOTE-NC treats categorical features differently from continuous features. For continuous features, SMOTE-NC sampling is an interpolation algorithm that looks for features between the data sample and supplements data with similar characteristics for minority class instances. By contrast, the categorical variable's value of a newly generated sample is decided by picking the most frequent category of the nearest neighbors present during the generation [21]. The proportion of each class in each combination after oversampling is shown in Figure 5. Class 1 is the majority class in all six combinations. Classes with an instance number smaller than 20% of class 1 are chosen to be oversampled. For all combinations except combination 3, class 3 is oversampled. For combination 6, class 2 is also oversampled. SMOTE-NC sampling is performed on class 3 in combination 1, 2, and 4-6, and on class 2 in combination 6. The sampling rate ensures that the minority class instance size equals one-fifth of the majority class instance size. The new proportion of classes in each combination after sampling is shown in Figure 5. Since classes 2 and 3 are more than one-fifth of class 1 in combination 3, no sample in combination 3 is oversampled.

Cumulative Binary Decomposition
Given a dataset (1), the cumulative binary decomposition method en-   (1), the cumulative binary decomposition method encodes y i into K-1 binary labels y

Cumulative Binary Decomposition
A single multi-output model or K-1 single-output models can be trained with the binary decomposed dataset x i , y where k ∈ {1, 2, · · · , K − 1}. After training, the predicted probability of y (k) i is essentially the predicted cumulative probabilityp(y i > C k ). Frank's method and Cheng's method are both based onp(y i > C k ). Frank's method [22] first calculate the probability of each class based on (2), and the predicted class is given by (3).

One-vs-All Binary Decomposition
Different from cumulative binary decomposition, one-vs-all binary decomposition method encodes y i into K binary labels y A single multi-output model can be trained to predictp(y i = C k ) and guarantee that Beckham and Pal [24] propose a method that where β k is to be determined. One option is setting β k = k; another is to calculate β k by optimizing the following objective function (8). We denote these two options as Beckham1 and Beckham2.

Existing Drawbacks
Both cumulative binary decomposition and one-vs-all binary decomposition have drawbacks.
Rank monotonicity requires thatp(y i > C k ) ≤p y i > C j for any k > j. However, predicted cumulative probabilities based on cumulative binary decomposition do not guarantee rank monotonicity [25] since cumulative probabilitiesp(y i > C k ) andp y i > C j are predicted independently. Ifp(y i > C k ) >p y i > C j for any k > j, thenp C k ≥ y i > C j < 0, which is unrealistic and hurts model performance.
Predicted class probabilities based on one-vs-all binary decomposition do not guarantee rank consistency, which requires p i (k) =p(y i = C k ) to have a convex shape, illustrated in Figure 6.
 , which is unrealistic and hurts model performance.
Predicted class probabilities based on one-vs-all binary decomposition do not guarantee rank consistency, which requires     = i i k p k p y C  to have a convex shape, illustrated in Figure 6.

Proposed Method
We summarize all methods in Table 1. Frank's method and Cheng's method are both based on cumulative binary decomposition. Frank's method converts predicted cumulative probabilities into class probabilities, while Cheng's method applies cumulative probabilities directly to predict class labels without knowing class probabilities. Beckham1 and Beckham2 are based on one-vs-all binary decomposition and class probabilities.

Proposed Method
We summarize all methods in Table 1. Frank's method and Cheng's method are both based on cumulative binary decomposition. Frank's method converts predicted cumulative probabilities into class probabilities, while Cheng's method applies cumulative probabilities directly to predict class labels without knowing class probabilities. Beckham1 and Beckham2 are based on one-vs-all binary decomposition and class probabilities.

Class Prediction Based on Cumulative Binary Decomposition One-vs-All Binary Decomposition
Class Probability Frank's method [22] Beckham1 and Beckham2 [24] Cumulative Probability Cheng's method [23] This paper We propose a new method that uses one-vs-all binary decomposition and K single output models to predict class probabilities. The main difference between our method and Beckham1/Beckham2 is that we convert predicted class probabilities into cumulative probabilities:p The advantage of this method is that it generates predicted cumulative probabilities satisfying rank monotonicity. Then the class label is determined as follows: where T k is the optimal threshold of y (k) . Some machine learning algorithms, such as treebased learning, usually generate biased probabilities. Probabilities can also be distorted because of data imbalance and sampling [26]. Therefore, we find the F1-maximizing T k with validation data instead of simply setting T k to 0.5. Sincep y i = C j in (6) could be biased, the bias is delivered top(y i > C k ) and causes inaccurate estimation ofŷ i in (7). We perform probability calibration by isotonic regression to remove bias inp y i = C j . There are two main methods to calibrate probability: Platt scaling and isotonic regression. Platt [27] introduced Platt scaling, which trains a logistic regression to map the original output to the real class probability. Isotonic regression is a non-parametric approach introduced by Zadrozny and Elkan [28,29]. Isotonic regression is preferable to Platt scaling when the sample size is large enough. Isotonic regression fits a piecewise constant non-decreasing function, where predicted probabilities or scores in each bin are assigned the same calibrated probability that is monotonically increasing over bins. More formally, where M is the number of bins, a 1 , a 2 , · · · , a M+1 are the interval boundaries, θ 1 , θ 2 , . . . , θ M are the corresponding calibrated probabilities for that falls in each bin.

Machine Learning Algorithms
We test the performance of three classifiers, Multi-Layer Perceptron (MLP), eXtreme Gradient Boosting (XGBoost), and Support Vector Machine (SVM), on 5-category classification. The winner of three candidates is kept for a 3-class problem and other analysis tasks after.
MLP is a pass-forward artificial neuron network that maps an n-dimensional input vector to an m-dimensional output vector. It has many successful applications in classification tasks such as MNIST handwriting digit number recognition by transforming high dimensional input of related elements to low dimensional discriminative representation. Several researchers have attempted to utilize deep learning frameworks to model potential factors that may lead to different injury severity levels [30,31]. They input the randomly shuffled dataset directly into the network to capture the feature of all input factors of a particular crash.
Back-propagation multi-layer perceptron (BP-MLP) applies weighted input from every previous layer to a non-linear function, evaluates the difference between network output and actual label, and optimizes the parameters in the network using optimizers such sophistic gradient decedent (SGD) or Adam optimizer. Thus, the characteristic of a particular MLP model can be defined by its depth, non-linear function of each layer, loss function, and optimizer. This research utilizes two hidden layers in the network. The numbers of neurons are 64 and 10, respectively. Both layers use a rectifier linear unit (ReLU) as the activation function. We then feed the mapped 3-dimension output learned representation vector to a softmax layer to compute the final predicted class and predict probabilities for the input variables.
Among the 17 variables used to predict injury severity, there are 14 categorical variables, except driver's age, occupant's age, and vehicle model year. When training a neural network, one-hot encoding is more appropriate for categorical data where no specific numerical relationship exists between categories. This involves representing each categorical variable with a group of binary vectors that has one {0,1} code for each unique variable value. However, One-hot encoding dramatically increases the dimension of data. For example, if a {0,1} code is used to represent every numerical code of object1, then the one-hot encoding will create 99 more dimensions than numerical encoding. One-hot encoding also leads to sparse data space, making it challenging to optimize neural networks.
XGBoost is a Gradient Tree Boosting-based algorithm that has been proven to be a powerful classifier. The advantage of XGBoost in this study is that decision tree-based machine learning has no issues with the numerical encoding of categorical variables. Moreover, XGBoost requires much less training time than neural network and often produce remarkable prediction results in crash-related studies [32][33][34].
SVM has been and still is a widely used classifier. Many studies of traffic crash injury prediction have applied SVM as a benchmark classifier [30,35,36]. Therefore, it is used as such in this study.
To extract the maximum performance out of classifiers, we need hyperparameter tuning to determine the optimal combination of hyperparameters. Hyperopt is one of the most popular hyperparameter tuning packages and implements the Tree of Parzen Estimators (TPE) algorithm to search the optimal value of hyperparameters efficiently in a search space described by the user [37]. We apply Hyperopt in this paper to optimize the main hyperparameters of MLP, XGBoost, and SVM. The hyperparameters of MLP include the number of layers/neurons, activation function in each layer, optimizer, learning rate, number of epochs, and batch size. The hyperparameters of XGBoost include the number of estimators, learning rate, maximum depth, subsample ratio, etc. The hyperparameters of SVM are the C parameter and gamma.

Cross-Validation and Evaluation Metrics
We use stratified 10-fold cross-validation to evaluate the classification algorithm's performance. Stratified 10-fold cross-validation divides the 139,555 records randomly into ten equal-sized subsets. Each subset has the same proportion of each class as the total dataset. At each time, eight subsets are used for sampling (if required) and training, and one subset is used for probability calibration and threshold optimization (only for the ordinal classification method proposed in this paper). The last subset is used to test the performance of the trained model. This process rotates through each subset, and the average precision, recall, and F1 score of each class represent the algorithm's performance.

Statistical Significance Test
Machine learning algorithms are commonly evaluated using k-fold cross-validation, and their evaluation metrics, such as mean accuracy scores, are compared directly. Statistically significance tests are designed to test whether the difference between evaluation metrics is statistically significant or the result of a statistical fluke. The null hypothesis is that metric scores observed from two algorithms were drawn from the same distribution. If this assumption is rejected, it suggests that the difference in metric scores is statistically significant. Otherwise, the two algorithms' performances are statistically equal. K-fold cross-validated paired Student's t-test is the most used statistical test for machine learning algorithms comparison. However, the calculation of the t-statistic in the test is misleading since the metric scores in each sample are not independent [38]. In k-fold cross-validation, a given observation will be used in the training dataset k-1 times. This means that the estimated metric scores are dependent.
Dietterich [38] recommended a resampling method called 5 × 2 cross-validation that involves five repeats of 2-fold cross-validation. Two-fold cross-validation can ensure that each observation appears only in the train or test dataset once. A paired Student's t-test is used on the results.
where: ∆ (1) i is the scores difference of two algorithms for the first fold of the i-th 2-fold crossvalidation; ∆ (2) i is the scores difference of two algorithms for the second fold of the i-th 2-fold crossvalidation; is the mean of scores difference for the first 2-fold cross-validation.
Under the null hypothesis that two algorithms are statistically equal, t is assumed to follow a Student's t-distribution with 5 degrees of freedom. If t stays close enough to 0, then the null hypothesis is satisfied. The threshold is 2.571 at the 95% confidence level. 5 × 2 cross-validation is used in this paper to compare algorithms' performance.

Comparison of Classifiers
The result of the 5-category classification problem is listed in Table 2. We compare each classifier's precision rate and find that XGBoost has the highest precision rate in COP, SI, and KSI categories. MLP only outperforms other classifiers in category OVI. The gap between the performance of XGBoost and MLP may be caused by the data characteristic that most variables are categorical. Therefore, we utilize XGBoost as the only classifier used for analysis in the following sections. The performance of SVM relies on marginal data that lies near the separating hyperplane. SVM yields poor performance when fed with data with ambiguous distinction or imbalanced class. Several modifications to the SVM kernel function and preprocessing methods have been used to improve SVM's capability to distinguish minority samples. This paper uses radial basis function (RBF) as SVM's kernel function and achieves significant improvement on minority class prediction compared to other kernel functions. Although SVM can identify 100% NIC instances, it is still the worst classifier and fails to identify any SI and KSI instances.
In general, the precision of injury severity decreases as the severity level rises. More than 99% of NIC cases can be correctly classified regardless of the classifier used. For COP, the precision is about 60%, but the precision of SI and KSI decreases dramatically to almost 0. As discussed in the Introduction, the poor performance on serious injury crashes is caused by class-imbalance.

Comparison of Category-Combination and Sampling
In order to overcome the shortcomings of the five-category problem, this research proposes six ways of category-combination and generates a 3-class problem through category-combination. As shown in Table 3, it is clear that the macro-average precision rate is improved for most combinations except combination 6. However, it can be seen that the precision rate of class 3 (KSI only) is still very low in combinations 1 and 4 because KSI is not combined with others to relieve the imbalance. In Table 4, after preprocessing by SMOTE-NC, the precision rate of each class is further improved. Among all combinations, combination 3 has the highest F1 score of 45.0% for class 3, but it combines OVI with SI and KSI and loses the ability to predict serious injury crashes. Combination 5 has the second-highest F1 score of 24.5% for class 3, but its F1 score for class 2 is only 19.4%. Combination 2 achieves acceptable F1 scores of 90.1%, 78.3%, and 24.1% for classes 1, 2, and 3, respectively. Moreover, combination 2 groups COP and OVI into class 2 and groups SI and KSI into class 3, which is a reasonable combination strategy. The three classes can be considered as minor injury crashes, moderate injury crashes, and serious injury crashes. Therefore, we apply combination 2 to convert 5-category into a 3-class classification problem.

Feature Importance
After category-combination and SMOTE-NC sampling, the classification model is more efficient in predicting the severity of crash injuries than the original five-category problem. We analyze the permutation feature importance of the classification models, as shown in Table 5. In combination 1-3, occupant type has the most significant impact on the injury severity to crash-injured individuals. Ejected from vehicle has the second-highest importance in combinations 3 and 5. In combination 1 and 2, the number of vehicles involved in the crash has the second-greatest impact on the severity of injuries to crashinjured individuals. Vehicle model year is also an important feature in combinations 3, 4, and 5. Based on repeated permutation feature importance calculation, we find that the differences between the most and less important input features are statistically significant.
It is worth noting that in each combination, features with high importance are basically the same. They are ejected from vehicle, number of vehicles, occupant type, and vehicle model year. As shown in Appendix A, ejected from vehicle and occupant type are highly related to the severity of the injury. The driver's injury severity is more considerable when the number of vehicles involved is one or two. The proportion of cases in which drivers were ejected from vehicles is not particularly large, accounting for only 2.45% of the data. In cases where drivers were ejected from vehicles, the proportion of fatal and severe injuries is high, accounting for 27-29%. In cases where drivers were not ejected from vehicles, the ratio of fatal and severe injury is only 1.7%.

Comparison of Ordinal Classifications
In total, we test the performance of 5 ordinal classification methods on injury severity data, including Frank's method, Cheng's method, Beckham1, Beckham2, and the method proposed by this paper. We also compared the results of ordinal classification methods with nominal classification to prove ordinal classification's advantage. In each method, XGBoost is used as the basic classifier.
In Section 1, we explained why ordinal classification methods are better than nominal classification when the labels are ordinal. In Section 3, we interpreted the drawbacks of ordinal classification benchmarks used in this paper and why our proposed ordinal classification method is more advanced theoretically. We believe that our proposed method should outperform other ordinal classification methods, which outperform nominal classification. Most results shown in Table 6 are consistent with our expectations. The ordinal classification method proposed in this paper achieves the highest precision rate for class 3, at 41.2%. The corresponding recall rate is acceptable, at 21.3%. The F1 score is 28.1%, which is also the highest among all methods. In the meantime, the proposed approach can still get high F1 scores for classes 1 and 2. This method also has the highest macro-average precision and the third-highest macro-average F1 score.
As expected, Frank's method and Cheng's method have the second highest and third highest F1 score for class 3. Cheng's method gets almost the same F1 scores for classes 1 and 2 as Frank's method. This shows that Frank's and Cheng's methods are superior to the traditional nominal classification method, although rank monotonicity is not satisfied.
Surprisingly, Beckham1 and Beckham2 perform worse than nominal classification. Beckham1 s F1 scores for class 2 and 3 are smaller than these of nominal classification. Beckham2 cannot even predict any cases in class 3, resulting in a 0% F1 score for class 3. A possible reason is that Beckham's method is adversely impacted by the class-imbalance issue. Beckham's method relies on the estimation of β k , which could be biased if the numbers of cases in classes are not equal-sized. 5 × 2 cross-validation and paired Student's t-test are used to test whether the ordinal classification method proposed in this paper is statistically better than other methods. As shown in Table 7, we compare this paper's method (method A) and other methods (method B) by testing whether their accuracy scores are from the same distribution. All p-values are smaller than 0.01, indicating that performance differences are statistically significant, and this paper's method is superior to the nominal classification method and other ordinal classification methods. The computational cost of the proposed method is not much higher than nominal classification and other ordinal classification methods. The main computational cost of classification, either categorical or ordinal, is training k or k-1 single output XGBoost classifiers, which takes 21.8 s on the computing platform. Compared to other methods, the extra computational work of the proposed method is probability calibration and threshold moving, which costs about 4.5 s and are much faster than XGBoost training. Therefore, the proposed method can improve model performance with minimal extra computational cost. Figures 7 and 8 present the predicted probabilities before and after calibration, respectively. The probability plot is a standard way to check how predicted probabilities fit empirical probabilities. Take class 1, for example, in which all samples are binned into groups based on their predicted probabilities of class 1. For each bin, we calculate the percentage of samples that are actually in class 1 (fraction of positives). The horizontal axis of Figures 7 and 8 are the mean predicted probabilities of each bin, and the vertical axis is the corresponding fraction of positives. Perfectly calibrated probabilities should have the mean predicted probability equal to the fraction of positives in each bin and should form a diagonal line in the probability plot.
Before calibration, the predicted probabilities of class 1 are very close to being perfectly calibrated. The predicted probabilities of class 2 are slightly underestimated. For example, when the predicted probability of class 2 is around 60%, the actual fraction of positive cases is 80%. This underestimation bias could be caused by XGBoost itself since decision treebased classifiers do not generate calibrated probabilities. The predicted probabilities of class 3 are obviously overestimated since the probability plot is below the perfectly calibrated line. This problem is due to class imbalance and oversampling, which distorts the class distribution in the original data. Therefore, if the biased and uncalibrated probabilities of classes 1, 2, and 3 are directly used to calculate the cumulative probabilities, the cumulative probabilities will also be wrong and unreliable. After isotonic regression, all predicted probabilities are perfectly calibrated, as shown in Figure 8. The thresholds Tk in (7) are determined by finding the F1-maximizing thresholds for the validation data. The optimal thresholds found for the data used in this paper are T1 = 0.43 and T2 = 0.33. Since 0.5 is the default threshold used in many studies and algorithms, we compare the optimal thresholds to the default threshold in Table 8. The default threshold leads to significantly worse results than the optimal thresholds. For the default threshold, the precision rate of class 3 is 15.8%, much smaller than 41.2% of the optimal thresholds, and the F1 score of class 3 is also smaller than that of the optimal thresholds. After isotonic regression, all predicted probabilities are perfectly calibrated, as shown in Figure 8. The thresholds Tk in (7) are determined by finding the F1-maximizing thresholds for the validation data. The optimal thresholds found for the data used in this paper are T1 = 0.43 and T2 = 0.33. Since 0.5 is the default threshold used in many studies and algorithms, we compare the optimal thresholds to the default threshold in Table 8. The default threshold leads to significantly worse results than the optimal thresholds. For the default threshold, the precision rate of class 3 is 15.8%, much smaller than 41.2% of the optimal thresholds, and the F1 score of class 3 is also smaller than that of the optimal thresholds. After isotonic regression, all predicted probabilities are perfectly calibrated, as shown in Figure 8.
The thresholds T k in (7) are determined by finding the F1-maximizing thresholds for the validation data. The optimal thresholds found for the data used in this paper are T 1 = 0.43 and T 2 = 0.33. Since 0.5 is the default threshold used in many studies and algorithms, we compare the optimal thresholds to the default threshold in Table 8. The default threshold leads to significantly worse results than the optimal thresholds. For the default threshold, the precision rate of class 3 is 15.8%, much smaller than 41.2% of the optimal thresholds, and the F1 score of class 3 is also smaller than that of the optimal thresholds.

Conclusions
This research proposed an ordinal classification machine learning method to improve the prediction of imbalanced traffic crash injury severity. SMOTE-NC oversampling and category-combination are applied to relieve the class imbalance problem. XGBoost, SVM, and multi-layer perceptron machine learning are utilized to predict the injury severity of traffic crashes. Based on the analysis results, the effects of ejected from vehicle, number of vehicles involved, occupant type, and vehicle model year on the severity of traffic crashes are found to be important. The experimental results suggest that the proposed ordinal classification method provides better prediction results than other existing ordinal classification methods and traditional nominal classification, especially in minority classes. It was shown from the results that probability calibration and optimal thresholds are helpful in injury severity prediction.
Future efforts should focus on the following aspects: (1) establish a more comprehensive ordinal classification that combines cost-sensitivity with the ordinal classification method proposed in this paper. (2) try to solve the 5-category classification problem without combining any two or more categories.

Conflicts of Interest:
The authors declare no conflict of interest.