Predicting Crash Injury Severity with Machine Learning Algorithm Synergized with Clustering Technique: A Promising Protocol

Predicting crash injury severity is a crucial constituent of reducing the consequences of traffic crashes. This study developed machine learning (ML) models to predict crash injury severity using 15 crash-related parameters. Separate ML models for each cluster were obtained using fuzzy c-means, which enhanced the predicting capability. Finally, four ML models were developed: feed-forward neural networks (FNN), support vector machine (SVM), fuzzy C-means clustering based feed-forward neural network (FNN-FCM), and fuzzy c-means based support vector machine (SVM-FCM). Features that were easily identified with little investigation on crash sites were used as an input so that the trauma center can predict the crash severity level based on the initial information provided from the crash site and prepare accordingly for the treatment of the victims. The input parameters mainly include vehicle attributes and road condition attributes. This study used the crash database of Great Britain for the years 2011–2016. A random sample of crashes representing each year was used considering the same share of severe and non-severe crashes. The models were compared based on injury severity prediction accuracy, sensitivity, precision, and harmonic mean of sensitivity and precision (i.e., F1 score). The SVM-FCM model outperformed the other developed models in terms of accuracy and F1 score in predicting the injury severity level of severe and non-severe crashes. This study concluded that the FCM clustering algorithm enhanced the prediction power of FNN and SVM models.


Background
The number of traffic crashes and their victims has been a rising trend globally due to increases in population and motorization. Different factors involved in traffic crashes have a substantial effect on each other, thus making it difficult to individually consider any of the parameters when explaining the severity of traffic crashes.
Traditionally, different statistical techniques have been employed to predict the severity of traffic crashes. Among those statistical models, ordered probit (OP) model [1][2][3][4][5], ordered logit (OL) model [6], multinomial logit (ML) model [7,8], and logistic regression (LR) model [9] are all widely used. Though these conventional techniques have been used for predicting crash severity, they suffer from a few inherent limitations. For example, the assumptions related to data distribution and a linear relationship Without the assumptions of statistical models, several ML models have been employed for traffic crash severity prediction. ML algorithms can model the non-linear relationship crash severity and related factors. Sameen and Pradhan [38] predicted the injury severity of traffic crashes using recurrent neural networks (RNN) in Malaysia. It was found that RNN is superior, with an accuracy of 71.77% followed by multilayer perceptron neural network (MLP) and Bayesian logistic regression (BLS) with an accuracy of 65.48% and 58.30%, respectively. Similarly, the performance of the ANN and OP model were compared [39]. Fuzzy adaptive resonance theory (ART) and MLP were used to examine crash severity. The study revealed that MLP performed better, with an accuracy of 73.6%, whereas ART showed a classification accuracy of 70.6%. The results of the OP model were the least accurate, having an accuracy of 61.7%. ANN was used for crash severity prediction using 6 years of crash data for Abu Dhabi. The results of ANN were compared to the OP model and the ANN outperformed the OP model in terms of accuracy [40]. Similarly, ANN was used to predict the crash severity at a signalized intersection in Central Florida, USA. MLP and ART neural networks were compared based on accuracy and MLP performed better than ART [17].
Along with neural networks, many other ML models have been used for crash severity prediction. A study adopted ANN, SVM, decision tree (DT), and LR models to predict severity [41]. SVM showed the most accurate prediction followed by DT, ANN, and LR, respectively. A multi-objective genetic algorithm was compared with ANN, SVM, and DT for predicting crash severity in the capital of Iran. The technique was more accurate compared to ANN, SVM, and DT [42]. Zhang et al. [43] compared ML and statistical techniques for crash severity prediction. The study concluded that the ML techniques, although suffering from over-fitting issues, outperformed the classical statistical techniques in terms of prediction accuracy.
Similarly, a study was conducted to predict the traffic crash severity using ANN, genetic algorithm (GA), combined genetic algorithm (CGA), and pattern search (PS) methods, and their performance was compared [44]. The study revealed that ANN outperformed the other three methods having an R-value of 0.87, while an R-value of 0.79 was calculated for GA and PS methods. Li et al. [45] used a combination of GA and ANN to examine the factors affecting the crash severity in Washington, USA. The authors combined the GA with ANN architecture to enhance the efficiency of searching for the significant variables. It was found that driver conduct, vehicle action, roadway surface condition, driver restraint, and age significantly affect the crash severity. A deep learning-based convolutional neural network (CNN) was employed for predicting traffic crash severity. CNN was compared with several statistical techniques like LR and ML techniques like SVM and ANN. The study revealed that the CNN model performed better than all of the other techniques [46].
In the preceding sections (Sections 1.2 and 1.3), we presented the studies related to traffic crash severity prediction. Many statistical techniques have been implemented by researchers followed by modern ML techniques like ANN, SVM, DT, and GA. As a general conclusion, we noticed that ML techniques outperformed conventional crash severity prediction techniques, which was also evident in most studies.

Artificial Neural Networks
First introduced in the 1960s, the ANN can solve many complex analytical problems. It is a biologically-motivated machine learning tool that captures and represents extremely complex non-linear relationships existing in real work data sets. It works by mimicking the neurological functions of the human brain, just as neurons stimulate and react to a situation in the human brain. It can predict the outcome of an observation based on the pattern caught from historical data after carrying put a training procedure [41]. An artificial neuron is the basic unit of a neural network. Input nodes transfer the knowledge to a neuron, which is processed internally to produce a response. ANN is processed in two steps: the first step is the linear combination of input values and then the obtained results are used as an argument for non-linear activation function. Each connection has a weight assigned to it and the activation function is differentiable [47].
The network architecture is defined by neurons organization. An example of a neural network is presented in Figure 1. The architecture consists of an input layer, a hidden layer, and an output layer. The input, hidden, and output layers consist of five, three, and one neuron, respectively. The output of one layer is utilized as input to the next layer. The activation function in the neuron combines the inputs by multiplying with the corresponding weights. There is also a bias component in each neuron. Optimization methods are used to estimate the weights of the input ("called the training of network") by minimizing the loss function. Several training algorithms are available; one of them is backpropagation, which is based on a gradient descent technique for parameter estimation [47].

Support Vector Machine
SVM is an extensively used ML technique. SVM works under the principle of supervised learning that uses labeled training data to deliver input and output functions, just like neural networks. Input and output functions are related to each other either by classification or regression type functions. Just like neural networks, both classification and regression prediction models can be handled by SVM [41,48].
SVM is a supervised ML technique that does not require any assumption about the data distribution. Labeled data sets are presented in the original formulation and the main aim of the SVM training algorithm is to look for a hyperplane, thus separating data set into discrete sets consistent with the training examples. The decision boundary is often referred by the term optimal separation hyperplane, which minimizes the misclassifications found in the training stage. Classifier with optimal decision boundary is found by the iterative process of learning, which separates the training pattern and the simulation data under the same configuration. The simplest form of SVM is a linear binary classifier in which a class from one of the two labels is assigned to a given test sample [49].
SVM is a comparatively strong machine learning technique in many fields due to its theoretical framework, which can perform more efficiently for noise mixed data compared to the conventional chaotic local models. This technique is also valid for small data sets along with global optimization and better generalization performance. Minimizing general error and maximizing the margin is the primary intent of SVM by separating hyperplane between two classes [50]. Figure 2 explains the general idea of a support vector machine. It has two classes of support vectors separated by a hyperplane. The points which are closest to the hyperplane are called support vectors. Support vectors are critical points that, if eliminated, would alter the position of the hyperplane. The hyperplane is a simple line classifying the data. Higher the margin between support vectors, more confident we are about the correct classification of data by the hyperplane.

Support Vector Machine
SVM is an extensively used ML technique. SVM works under the principle of supervised learning that uses labeled training data to deliver input and output functions, just like neural networks. Input and output functions are related to each other either by classification or regression type functions. Just like neural networks, both classification and regression prediction models can be handled by SVM [41,48].
SVM is a supervised ML technique that does not require any assumption about the data distribution. Labeled data sets are presented in the original formulation and the main aim of the SVM training algorithm is to look for a hyperplane, thus separating data set into discrete sets consistent with the training examples. The decision boundary is often referred by the term optimal separation hyperplane, which minimizes the misclassifications found in the training stage. Classifier with optimal decision boundary is found by the iterative process of learning, which separates the training pattern and the simulation data under the same configuration. The simplest form of SVM is a linear binary classifier in which a class from one of the two labels is assigned to a given test sample [49].
SVM is a comparatively strong machine learning technique in many fields due to its theoretical framework, which can perform more efficiently for noise mixed data compared to the conventional chaotic local models. This technique is also valid for small data sets along with global optimization and better generalization performance. Minimizing general error and maximizing the margin is the primary intent of SVM by separating hyperplane between two classes [50]. Figure 2 explains the general idea of a support vector machine. It has two classes of support vectors separated by a hyperplane. The points which are closest to the hyperplane are called support vectors. Support vectors are critical points that, if eliminated, would alter the position of the hyperplane. The hyperplane is a simple line classifying the data. Higher the margin between support vectors, more confident we are about the correct classification of data by the hyperplane.

Fuzzy C-Means Clustering
Fuzzy c-means clustering algorithm (FCM) was introduced by Dunn in 1973 [51] and enhanced by Bezdeq in 1981 [52], and it partitions a dataset into two or more clusters. A process of subdividing data set into subsets (clusters) is known as cluster analysis. Clusters are pairwise disjointed and reproduce the dataset upon merging. The FCM cluster technique is the extension of the K-means method. It repeatedly searches for a set of fuzzy clusters and associated clusters centers that provide the best representation of data set structure. The algorithm is user-dependent as the user has to state the number of clusters for cluster analysis. It divides the data set into clusters by reducing the sum of squared error within each group [53].
In this clustering algorithm, each cluster can be considered as a fuzzy set, and each training vector assigns membership grade to a cluster, which is measured by a membership function [54]. In FCM clustering, unlike hard C-means, each training vector belongs to multiple clusters [55]. In the FCM algorithm, the partitioning assigns each training vector to a cluster with the degree of belongingness varying from 0 to 1. It minimizes an objective function while partitioning the training vectors. The membership matrix U lies within 0 and 1, and the total membership grades of a training vector to all clusters is 1. The relevant expressions follow.
Summation of membership grades, Objective function, Calculated cluster center, and calculated membership grade,

Fuzzy C-Means Clustering
Fuzzy c-means clustering algorithm (FCM) was introduced by Dunn in 1973 [51] and enhanced by Bezdeq in 1981 [52], and it partitions a dataset into two or more clusters. A process of subdividing data set into subsets (clusters) is known as cluster analysis. Clusters are pairwise disjointed and reproduce the dataset upon merging. The FCM cluster technique is the extension of the K-means method. It repeatedly searches for a set of fuzzy clusters and associated clusters centers that provide the best representation of data set structure. The algorithm is user-dependent as the user has to state the number of clusters for cluster analysis. It divides the data set into clusters by reducing the sum of squared error within each group [53].
In this clustering algorithm, each cluster can be considered as a fuzzy set, and each training vector assigns membership grade to a cluster, which is measured by a membership function [54]. In FCM clustering, unlike hard C-means, each training vector belongs to multiple clusters [55]. In the FCM algorithm, the partitioning assigns each training vector to a cluster with the degree of belongingness varying from 0 to 1. It minimizes an objective function while partitioning the training vectors. The membership matrix U lies within 0 and 1, and the total membership grades of a training vector to all clusters is 1. The relevant expressions follow.
Summation of membership grades, Objective function, Calculated cluster center, where j is 1, . . . , N; c is the number of clusters; m is any real number greater than 1 known as fuzzy exponent; u ij is the membership value of x j for the cluster I; x j is the j th of d-dimensional measured data; c i is the d-dimension center of the cluster; and ||*|| is the Euclidean distance between any training vector and the center.

Study Objectives
The objective of this paper was to predict crash severity using information that could be easily identified with a little investigation on crash sites. Such knowledge enables trauma centers to predict injury severity accordingly, dispatch the properly equipped emergency vehicle, and then guide it to the nearest hospital. Empowering trauma centers with such ability can improve traffic safety especially in remote areas where emergency services are scarce. The parameters used in this paper mainly include vehicle attributes and road condition attributes.
The main goal of traffic crash injury severity modeling is to comprehend the relationship between crash injury severity and many contributing factors like human characteristics, vehicle characteristics, roadway characteristics, and environmental characteristics. The aim of this study is to investigate the role of clustering algorithms in the enhancement of the performance of machine learning models, such as ANN and SVM. It will create a new paradigm in developing machine learning models for crash severity prediction.

Outline
The rest of the paper is organized as follows: Section 2 explains the dataset used in model development. The methodology followed in models' development and validation is described in Section 3. Section 4 presents results and discussion. Section 5 outlines the main conclusions and explains the limitations and recommendations for future study.

Data Set Description
The crash dataset in Great Britain, UK, between 2011-2016 was used in this study. A random sample of crashes representing each year was used for analysis. A randomized class balancing procedure [56] was followed in selecting the sample. In this procedure, a sample was randomly selected in such a way that each class had an equal number of events in the dataset to make sure that the sample was balanced without any bias towards a specific severity level. The spatial distribution of crashes for the sample is illustrated in Figure 3, representing both severe and non-severe crashes in the Great Britain (GB). From this figure, it is evident that crashes were highly crowded in London. The selected input attributes that belonged to vehicle, road, and crash are shown in Table 1. Features that could be easily identified with a little investigation on crash sites were used as an input so that the trauma center could predict the injury severity level based on the initial information provided and prepare accordingly for the treatment of the victims. The output variable was the crash severity (i.e., severe or non-severe). A crash with at least one fatality or serious injury was considered severe. The severity was reported in three levels in the original data set: fatal, serious, and slight, with a proportion of 1.5%, 15.5%, and 83%, respectively. In this research, the fatal and serious crashes were merged under one severity class, i.e., severe crash. In contrast, slight crashes were considered a non-severe crash. It should be mentioned that the following types of injuries were considered serious injuries:

•
The injury that causes a person to be detained in hospital as an in-patient for an extended period and which may have required surgery.
• An injury that will have lasting or even permanent implications for the injured person and that will have an impact upon their ability to work or which involve a change to their level of independence.

•
An injury that causes death 30 or more days after the accident.


An injury that will have lasting or even permanent implications for the injured person and that will have an impact upon their ability to work or which involve a change to their level of independence.  An injury that causes death 30 or more days after the accident.
Before developing the ML models, the dataset was preprocessed to make sure that the crash dataset was clean and valid. The crash distribution with respect to area type and road type was illustrated in Figure 4. Figure 4 explains that a higher number of crashes in rural areas was severe, while urban areas mostly encountered non-severe crashes. This trend was obvious due to higher speeds in rural areas leading to severe crashes, while the speeds in urban areas are comparatively slower, minimizing the possibility of severe crashes. Most crashes occurred on a single carriageway since two-way traffic is not separated by a median, thus leading to higher chances of crashes.   Before developing the ML models, the dataset was preprocessed to make sure that the crash dataset was clean and valid. The crash distribution with respect to area type and road type was illustrated in Figure 4. Figure 4 explains that a higher number of crashes in rural areas was severe, while urban areas mostly encountered non-severe crashes. This trend was obvious due to higher speeds in rural areas leading to severe crashes, while the speeds in urban areas are comparatively slower, minimizing the possibility of severe crashes. Most crashes occurred on a single carriageway since two-way traffic is not separated by a median, thus leading to higher chances of crashes.

Model Development
In this study, the MATLAB Environment (MathWorks , Natick, MA, USA) was used throughout the analysis to develop and validate all models using the variables mentioned in Table 1 as inputs. Four ML models were considered in the study: feed-forward neural networks (FNN), support vector machine (SVM), fuzzy c-means clustering based feed-forward neural network (FNN-FCM), and fuzzy c-means based support vector machine (SVM-FCM). The models were trained and tested based on the same dataset for unbiased comparison. Description of the developed models and the resulted accuracies are provided in the following sections. For training purposes for all developed models, 70% of the dataset (7000 crashes) was used while the remaining 30% were used for testing (3000

Model Development
In this study, the MATLAB Environment (MathWorks, Natick, MA, USA) was used throughout the analysis to develop and validate all models using the variables mentioned in Table 1 as inputs. Four ML models were considered in the study: feed-forward neural networks (FNN), support vector machine (SVM), fuzzy c-means clustering based feed-forward neural network (FNN-FCM), and fuzzy c-means based support vector machine (SVM-FCM). The models were trained and tested based on the same dataset for unbiased comparison. Description of the developed models and the resulted accuracies are provided in the following sections. For training purposes for all developed models, 70% of the dataset (7000 crashes) was used while the remaining 30% were used for testing (3000 crashes).

Feedforward Neural Networks
The neural network model was developed in a systematic procedure for predicting crash severity. A systematic trial and error approach followed, starting with a small number of hidden layers and then built larger until an acceptable accuracy was achieved without causing overfitting. Different training algorithms were tested in building the FNN model such as Levenberg-Marquardt (LM), resilient backpropagation (RB), scaled conjugate gradient (SCG), BFGS quasi-Newton (BFG), Bayesian regularization backpropagation (BR), and variable learning rate backpropagation (GDX).
LM learning algorithm resulted in the highest classification accuracy compared to the other training algorithms. The optimum topology for the FNN model with the highest classification accuracy on the testing set was found to be two hidden layers with 32 and 2 neurons in the first and second layers respectively using hyperbolic tangent sigmoid and softmax activation functions for the first layer and second layer, respectively. The learning rate, goal, and number of epochs of the network were 0.00002, 0.000001, and 100, respectively. The loss function was the mean squared error.

Support Vector Machine
A SVM model was developed to predict crash severity using driver, roadway, and crash characteristics as input parameters. In developing any SVM model, the values of the penalty parameter C and insensitivity zone ε should be determined. The trade-off between training error and model complexity was controlled by parameter C. Insensitivity zone ε affected the complexity and generalization capability of the SVM model as it controlled the smoothness of SVM response and the number of support vectors.
The optimization routine of the classifier was based on the iterative single data algorithm proposed by Kecman et al. [57]. A systematic trial and error procedure was followed to determine the values of C and ε. After many trial and error experiments, the best values of C and ε were found to be 15 and 150, respectively. One of the important parameters that can affect the classification accuracy of the SVM model is the kernel function. Many kernel functions were tried in developing the SVM model, such as polynomial kernel, Gaussian kernel, Gaussian radial basis function, Laplace RBF kernel, hyperbolic tangent kernel, sigmoid kernel, and ANOVA radial basis kernel. In this study, the Gaussian radial basis function was adopted as it resulted in the highest classification accuracy compared to the other functions. The kernel scale parameter was 15. The classifier divided all elements of the predictor matrix by kernel scale value. A box constraint controlled the maximum penalty enacted on margin-violating observations, which prevented overfitting. The classifier generated fewer support vectors for a high value of box constraint. The value of the box constraint was 200 in this modeling exercise. The resulted accuracy of the developed SVM model found to be 73%.

FCM-Based FNN and SVM
The traffic crash data were divided into different clusters using the FCM clustering algorithm to investigate the effect of clustering on the classification accuracies of FNN models. The exponent, number of iterations, and desired improvement in the objective function for the FCM were 2, 100, and 0.0001, respectively. The learning algorithm of the FNN was LM. In all the cases, hyperbolic tangent sigmoid activation functions were used for both layers. The number of neurons, learning rate, goal, and number of epochs of the FNN models varied depending on clusters. The loss function was the mean squared error for all models.
The FCM clustering algorithm was also adopted to investigate the effect of clustering on the classification accuracies of SVM models. The exponent, number of iterations, and the desired improvement in the objective function for the FCM were 3, 100, and 0.0001, respectively. The optimization routine of the classifier was based on the iterative single data algorithm. All of the predictor variables were normalized using. The kernel scale parameter varied between 5 and 7 depending on clusters. The value of box constraint was 150 for all classifiers.
A systematic approach was followed to determine the optimal number of clusters based on prediction accuracy, as shown in Table 2. It can be observed from Table 2 that the optimal number of clusters of FNN-FCM and SVM-FCM were 2 and 4 clusters, respectively. For the FNN-FCM model, one ANN model was developed for each cluster with a total of 2 FNN models. The optimum topology for each model was found following the same procedure explained in Section 3.1. After training the models, all crashes in the testing dataset were distributed between the two clusters and the corresponding FNN model was executed to predict the severity of each crash. The overall testing accuracy for FNN combined with FCM found to be 71.8%, which represents a slight improvement compared to testing accuracy of FNN without clustering which was found to be 70.1%.
For the SVM-FCM model, one SVM model was developed for each cluster with a total of four SVM models and the same testing procedure was repeated. The overall testing accuracy for SVM combined with FCM found to be 74.2%, which indicates that combining FCM with SVM had higher accuracy when compared with SVM.

Results and Discussion
In this study, four machine-learning models were developed to predict the severity of traffic crashes. Two severity levels were considered in this study (severe and non-severe). To evaluate the developed models, confusion matrices for the training and testing data for the developed models were prepared, as shown in Figures 5-8. To evaluate the developed models, confusion matrices for the training, and testing data for the developed models were prepared (Figures 5-8). A confusion matrix is a simple table that is used usually to describe the performance of classification models on test data for which the true values are known. The diagonals in Figures 5-8 running from NW to SE represent true predictions. On the contrary, diagonals running NE to SW depict false predictions. Several statistics are calculated from the confusion matrices, namely classification accuracy, sensitivity, precision, and F1 score, which are explained in Equations (5)-(10).         The classification accuracy is defined as the number of correct predictions divided by the total number of observations and can be explained using Equation (5).
Classification Accuracy = Number of crashes correctly predicted as severe/non − sever Total number of crashes The classification accuracy results revealed that combining the clustering technique with FNN could enhance the prediction accuracy of FNN slightly. Combining FCM with SVM had higher accuracy when compared with SVM only. Moreover, based on the classification accuracy results, SVM-FCM models outperformed the other developed models. Nevertheless, using classification accuracy as the only metric to compare between the crash severity prediction models is misleading and not enough, as sometimes the developed model is biased towards a specific severity level.
In crash severity prediction models, prediction of crash severity is correctly critical and important as the actions required for sever crashes are different when compared with those of nonsever crashes. Hence, more metrics such as sensitivity and precision were considered in this study to investigate the capability of the developed models in predicting the crash severity. Sensitivity can be defined as the ratio of the crashes correctly predicted as sever/non-sever crashes to the total number of actual sever/non-sever crashes and can be expressed using Equations (6) and (7).
Sensitivity (severe crashes) = Number of crashes correctly predicted as severe Total actual severe crashes Number of crashes correctly predicted as non − severe The classification accuracy is defined as the number of correct predictions divided by the total number of observations and can be explained using Equation (5).
Classification Accuracy = Number of crashes correctly predicted as severe/non−sever Total number of crashes The classification accuracy results revealed that combining the clustering technique with FNN could enhance the prediction accuracy of FNN slightly. Combining FCM with SVM had higher accuracy when compared with SVM only. Moreover, based on the classification accuracy results, SVM-FCM models outperformed the other developed models. Nevertheless, using classification accuracy as the only metric to compare between the crash severity prediction models is misleading and not enough, as sometimes the developed model is biased towards a specific severity level.
In crash severity prediction models, prediction of crash severity is correctly critical and important as the actions required for sever crashes are different when compared with those of non-sever crashes. Hence, more metrics such as sensitivity and precision were considered in this study to investigate the capability of the developed models in predicting the crash severity. Sensitivity can be defined as the ratio of the crashes correctly predicted as sever/non-sever crashes to the total number of actual sever/non-sever crashes and can be expressed using Equations (6) and (7). Sensitivity (severe crashes) = Number of crashes correctly predicted as severe Total actual severe crashes (6) Sensitivity (non − severe crashes) = Number of crashes correctly predicted as non − severe Total actual non − severe crashes (7) Precision is the ratio of the crashes correctly predicted as sever/non-sever crashes to the total number of predicted sever/non-sever crashes and can be expressed using Equations (8) and (9).
Number of crashes correctly predicted as severe Total predicted severe crashes (8) Precision (non − severe) = Number of crashes correctly predicted as non − severe Total predicted non − severe crashes (9) The best model has the maximum values of both sensitivity and precision. Sometimes, choosing the best classifier based on maximizing two parameters is confusing and not an easy task. Hence, for convenience, the harmonic mean of sensitivity and precision (F1 score) was introduced, which is expressed in Equation (10) [58].
The values of accuracy, sensitivity, precision, and F1 score for all developed classifiers are shown in Figures 9 and 10 The values of accuracy, sensitivity, precision, and F1 score for all developed classifiers are shown in Figures 9 and 10.     The comparative analysis showed that the SVM-FCM model performed better than the other developed models based on accuracy and F1 score in predicting severe and non-severe accidents. Moreover, it can be observed that FNN had the least accuracy and F1 score values. Furthermore, the results revealed that introducing the FCM clustering algorithm enhanced the prediction power of FNN and SVM models. The prediction accuracies of clustering based models developed in this study were higher than ML models developed in previous studies for traffic crash severity prediction [38,39,43].

Conclusions
This study focused on predicting traffic crash severity by employing 15 crash-related parameters in four machine learning models: FNN, SVM, FCM clustering-based FNN, and FCM-based SVM. The models were developed based on the GB crash database data from 2011-2016. A random sample of crashes representing each year was used for analysis. Driver attributes, vehicle attributes, and road condition attributes were used as inputs when developing the models. Two severity levels were considered in this study: severe crashes and non-severe crashes. The severe crash was defined as the crash with at least one serious injury or a fatality. The severity prediction power of the developed models was evaluated using four measures of effectiveness: accuracy, sensitivity, precision, and the harmonic mean of sensitivity and precision (F1 score). The SVM-FCM model outperformed the other developed models in terms of accuracy and F1 score in predicting the severity level of severe and non-severe crashes. The FNN had the least accuracy and F1 score values. This study concluded that the FCM clustering algorithm enhanced the prediction power of FNN and SVM models.

Limitations and Future Study
The randomized class balancing procedure was used in this study to solve the problem of traffic crash imbalanced dataset. The authors could have used other advanced approaches to handle imbalance dataset issue. Moreover, the application of the developed models in developing countries might face challenges due to scarcity of the data.
This work inspired the authors to think of simplifying the developed model in the future by eliminating some of the predicting variables. Although this is expected to reduce the model accuracy, it might make it agile enough to be utilized in other countries especially developing ones where traffic crash data is usually scarce.