Analysis of Severe Injuries in Crashes Involving Large Trucks Using K-Prototypes Clustering-Based GBDT Model

: The unobserved heterogeneity in trafﬁc crash data hides certain relationships between the contributory factors and injury severity. The literature has been limited in exploring different types of clustering methods for the analysis of the injury severity in crashes involving large trucks. Additionally, the variability of data type in trafﬁc crash data has rarely been addressed. This study explored the application of the k-prototypes clustering method to countermeasure the unobserved heterogeneity in large truck-involved crashes that had occurred in the United States between the period of 2016 to 2019. The study segmented the entire dataset (EDS) into three homogeneous clusters. Four gradient boosted decision trees (GBDT) models were developed on the EDS and individual clusters to predict the injury severity in crashes involving large trucks. The list of input features included crash characteristics, truck characteristics, roadway attributes, time and location of the crash, and environmental factors. Each cluster-based GBDT model was compared with the EDS-based model. Two of the three cluster-based models showed signiﬁcant improvement in their predicting performances. Additionally, feature analysis using the SHAP (Shapley additive explanations) method identiﬁed few new important features in each cluster and showed that some features have a different degree of effects on severe injuries in the individual clusters. The current study concluded that the k-prototypes clustering-based GBDT model is a promising approach to reveal hidden insights, which can be used to improve safety measures, roadway conditions and policies for the prevention of severe injuries in crashes involving large trucks.


Introduction
Large trucks play an indispensable role in freight logistics and the economic development of a country. According to a commodity flow survey of 2017, trucks alone moved 8.8 billion tons of goods. However, large trucks are also responsible for increasing the chances of severe injuries and fatalities in traffic crashes. Severe injuries and fatalities in crashes cause substantial social and economic loss for human society. From 2016 to 2018, the total fatalities in large truck-involved traffic crashes increased by almost 6 percent in the United States (US) [1]. Besides, the unique characteristics of large trucks such as long vehicle length, heavy vehicle weight and poor deceleration system may play key roles in increasing the probability of severe injuries [2,3]. The statistics and claims of past studies clearly signify the importance of researching the injury severity in crashes involving large trucks. Previous studies have explored different types of contributory factors (e.g., crash, location, roadway, weather, light and truck, etc.) that affect the injury outcomes of crashes involving large trucks [4][5][6][7][8][9][10][11][12]. However, the heterogeneous nature of traffic crash data significantly obscures the effects of contributory factors on injury outcomes. Several studies have recognized the unobserved heterogeneity as a major barrier for the analysis of traffic crash data [10,13,14].
Unobserved heterogeneity refers to the correlation between the unobserved factors and the observed variables. Leaving the issue of heterogeneity unaddressed might leave Safety 2021, 7, 32 2 of 18 crucial insights hidden [13,15]. Besides, a subset of factors may have inconsistent effects on the severity of injury under different traffic crash scenarios [16,17]. For example, the severity of injury in crashes involving large trucks and under dark unlighted conditions may differ because of changes in the posted speed limit. To account for unobserved heterogeneity, it is possible to separate a certain type of crash based on domain knowledge and build an injury severity model for that type of crash only. Some past studies have analyzed the injury severity in certain types of truck-involved crashes based on the crash type [18,19], location [2,12,20], environmental factors [21], temporal characteristics [22], and truck characteristics [11]. However, such an approach will generate a large number of subgroups, and consequently, it would be infeasible to build a model for each subgroup. Additionally, some subgroups may include an insufficient number of observations to build a viable injury severity model. Segmenting the crash data according to domain knowledge or studying specific crash types may account for heterogeneity to a certain extent, but it does not guarantee homogeneity in the subgroups [13].
In the field of road safety, a lot of studies have successfully employed cluster analysis, which is an unsupervised data mining method, to account for unobserved heterogeneity in different types of crash data (e.g., all road users, pedestrians, bicycles, motorcycles, and trucks) [23][24][25][26][27][28][29][30][31]. The objective of cluster analysis is to identify the latent homogeneous clusters in the data. Previous studies have reported that certain relationships between the contributory factors and injury severity cannot be uncovered without segmenting the aggregated crash data into homogeneous clusters. Additionally, some contributory factors have different magnitudes of effects under different crash scenarios [13,23,24,30]. It is a well-established fact that different crash scenarios require different types of preventive measures. Therefore, it is crucial for road safety authorities and traffic engineers to obtain comprehensive and accurate insights about the effects of contributory factors on injury outcomes. Such insights can be used to improve traffic laws, policies, road infrastructures, and road user's awareness for the reduction in severe injuries in crashes involving large trucks. Fewer severe injuries and fatalities will substantially reduce the social and economic costs caused by traffic crashes. Consequently, it will lead to a more sustainable transportation system.
A review of the past studies has revealed that the literature on the analysis of the injury severity in crashes involving large trucks has been limited in exploring an appropriate clustering approach, which can account for the heterogeneous nature and variability of data type in crash data. The current study has explored the application of k-prototypes clustering-based machine learning models for the analysis of severe injuries in crashes involving large trucks. To the best of the author's knowledge, none of the previous studies have applied both of the approaches together. A feature analysis with the help of the SHAP method was conducted to demonstrate the effectiveness of the application of kprototypes cluster analysis. The rest of the study is organized with a literature review, methodology and materials, results, discussion and conclusion, and limitations and future research opportunities.

Related Work
In recent times, a wide variety of research efforts have been conducted for the analysis of the severity of injury in different types of crashes involving large trucks [5,8,10,18,28,29]. However, this study aims to explore the application of the k-prototypes clustering-based ensemble learning method for the analysis of severe injuries in crashes involving large trucks. Therefore, the current study has discussed the commonly applied methods that have been used to mitigate unobserved heterogeneity and investigate the factors influencing the injury outcomes of crashes involving large trucks.

Unobserved Heterogeneity in Crash Data
The random parameters logit model (mixed logit) and latent class clustering (LCC) have been the two commonly used approaches to account for the heterogeneous nature of Safety 2021, 7, 32 3 of 18 different types of traffic crash data. A large number of studies have used the random parameter logit model for the analysis of large truck-involved crash injury severity [3,5,10,32]. Behnood and Mannering [10] used the random parameters logit model to explore the effects of time of day variations and temporal instability on the injury severity in large truck crashes. Rahimi et al. [32] investigated the injury severity of single vehicle truck crashes using the random parameter logit model. The random parameter logit model allows the parameters to vary across observations, as opposed to a single parameter representing all observations. However, cluster analysis is a better approach for the mitigation of unobserved heterogeneity because it identifies the latent homogeneous clusters within the crash data. Additionally, the characterization of those homogeneous clusters gives a clearer interpretation of the crash scenarios. Developing the injury severity model on the homogeneous clusters gives more accurate estimations about the effects of the contributory factors on injury outcomes. Several studies have demonstrated the benefits of cluster analysis for the mitigation of unobserved heterogeneity in traffic crash data [23][24][25]33].
Ona et al. [23] used LCC for the analysis of the injury severity in crashes on rural highways. The study showed that certain contributory factors are important only in the individual clusters. Sasidharan et al. [26] combined LCC with the binary logit model for the analysis of pedestrians crash injury, and reported that certain relationships between the injury outcome and exploratory factors would have remained hidden without cluster analysis. LCC is a statistical model-based clustering approach, and the final class solution depends on the user [34]. If the model is user-specified, then the results of the model may not be reproducible.
Another type of clustering is the similarity-based approach, which maximizes the similarity between the observations within the clusters and dissimilarity between the inter-clusters based on some distance measure. The k-means, k-modes, and hierarchical clustering are similarity-based approaches. One of the advantages of using the k-means clustering method is its efficiency and simplicity for processing large datasets [35]. Sohn and Lee [36], and Iranitalab and Khattak [37] developed k-means clustering-based models for the prediction of injury outcomes. Nandurge and Dharwadkar [38] used the k-means clustering method as the primary tool for the segmentation of traffic crash data in homogenous subgroups. However, the k-means is appropriate for datasets that include only numerical features, and road accident datasets rarely include only numerical features. The k-modes clustering method is more appropriate for datasets that include only categorical features and uses the mismatches between the observations as a dissimilarity measure. Kumar et al. [39] have used the k-modes clustering method to improve classification performances. The hierarchical clustering method segments the data by building a tree of clusters [34]. Taamneh et al. [25] used hierarchical clustering before performing the classification of traffic crashes. This method is not user-defined, but it is inherently incapable of handling categorical features, and the memory space and time required by it is infeasible for large datasets.
There are some advantages of these clustering methods, and in some cases, they can produce satisfactory results. However, traffic crash datasets are often large and include both numerical and categorical features. For example, the total number of vehicles is a numerical feature, and the manner of collision is a categorical feature with five or six unique values (e.g., front-to-front, rear-end and sideswipe). Most of the clustering methods convert the unique values of the categorical features into dummy features (where the presence of a unique value in the observations is denoted by 1 and the absence as 0). This technique loses the information inherited by the categorical feature. Additionally, converting categorical features into dummy variables increases the total number of features, which will require more memory and time for processing the data.

Severity Modeling
Several research efforts on the severity of injury in truck-involved traffic crash applied logit-based or ordered probability type statistical models [2,21,40,41]. The non-linear re-Safety 2021, 7, 32 4 of 18 lationships between injury outcome and contributory factors make such types of model inappropriate. Moreover, statistical models have model-specific assumptions and predefined underlying relationships between dependent and input features. The violation of those assumptions may lead to erroneous results [24,42]. A lot of studies have shown that machine learning models are better at predicting the injury outcomes [37,[43][44][45]. A few studies on truck-involved crashes also have shown promising results using machine learning methods. Chang and Chien [42], and Eustace et al. [46] applied the classification and regression trees (CART) to identify the important factors that influence the severity of truck-involved crashes. Zheng et al. [47] used the GBDT machine learning model for the analysis of commercial truck crash injury severity.
A limited number of studies have used the clustering approach for analysis of the injury severity in crashes involving large trucks. Rahimi et al. [28] used the block clustering method to identify patterns, and the conditions contributing to the patterns in large truck-involved traffic crashes. The study was limited to cluster analysis only. Another study combined LCC with the partial proportional odds model for the exploration of heterogeneities in truck-involved severities only at cross and T-intersections [29].
The current study has identified a few limitations in the existing literature. Firstly, a very limited number of studies have used the clustering approach for the mitigation of unobserved heterogeneity in crashes involving large trucks. Secondly, the variability of data type has rarely been addressed. Lastly, none of the previous studies have employed a clustering-based machine learning model for the analysis of severe injuries in crashes involving large trucks. In light of these gaps in the literature, the current study has proposed the k-prototypes clustering-based GBDT machine learning model for the analysis of severe injuries in crashes involving large trucks. The k-prototypes clustering method was proposed by Huang [35]. The method is capable of clustering datasets with both numerical and categorical type features. Further, the study has developed cluster-based GBDT machine learning models for predicting the injury outcomes of crashes involving large trucks. The GBDT model has shown significant success in the prediction of injury outcomes in traffic crashes [43,47]. Another distinguishable feature of the study is the application of the SHAP method to estimate the varying effects of the contributory factors on severe injuries across the EDS and each cluster. The SHAP method was proposed by Lundberg and Lee [48] for the interpretation of machine learning models.

Methodologies and Materials
This section of the study includes a description of the k-prototypes clustering method, and the GBDT ensemble learning method. It also includes descriptions of the data used for this study, the evaluation metrices of predicting performance, and the details of the SHAP method.

The K-Prototypes Clustering Method
Here, we show the distance measure used by the k-prototypes clustering method. The k-prototypes uses the squared Euclidian distances for the numeric features and a simple matching coefficient for the categorical features. We have summarized the cost function in Equation (1) according to [35]. The distance between the observations and the assigned prototype is represented by the cost function.
where the objective of the algorithm is to minimize E (cost function), and divide the dataset X into k number of clusters and Q l is center of prototype l, y il is the dummy variable that equals to 0 when data object i is assigned to prototype l, and d(X i , Q l ) is the dissimilarity where the first term is the squared numerical distance of attribute j of data object i from the center for attribute j of prototype l, and the γ l in the second term is the weight for categorical features for cluster l. The second term is the dissimilarity measure of attribute j of data object i from the center for attribute j of prototype l. In Equation (2), superscript r represents the numeric features and c represents the categorical features. Then, the complete cost function for prototype l is computed by the following equation.
where the term is explained in Equation (4). In the following equation C j is the set of all discrete values of the categorical attribute j, and p(c j ∈ C j |l) is the probability of the discrete value q j from C j being in prototype l.

Gradient Boosted Decision Trees
The GBDT is a type of ensemble learning method that combines many weak decision trees (base learners) to produce more robust and accurate models. It integrates the gradient boosting technique to extend and improve the weak-decision trees [49]. In the field of machine learning, boosting means training multiple weak learners in a sequential manner, where each learning algorithm is adjusted based on the error of its predecessors. The reason we have selected GBDT is because traffic crash data usually come in a tabular form, and often GBDT is the state-of-the-art model for tabular data [50]. The GBDT algorithm follows the steps below to reach the final model.

1.
Initialize the model with a constant value F 0 = argmin ∑ n i=1 L(y i , γ); here, y i is the observed values, L is the loss function and γ is the prediction of the model, which minimizes the loss function. In a classification task, γ is the value of the log (odds).

2.
For m = 1,2,3 . . . to M: (a) Calculate the pseudo residuals, Fit a decision tree h m to the pseudo-residuals and create the terminal regions. Each leaf of the tree represented by R jm for j = 1, 2, 3..., J m .
There are two major hyperparameters of GBDT: one is the number of sequential weak decision trees, and another is the learning rate, which determines the magnitude of the contribution of each tree in the right direction of prediction. In the training process, the algorithm keeps adding weak decision trees in a forward-stage-wise manner until it reaches the best fit. The optimal combination of the number of weak decision trees and learning rate can avoid over-fitting and increase the performance of the model.

Data Description
For this study, the data were collected from the Crash Report Sampling System (CRSS) of the National Highway Traffic Safety Administration (NHTSA). CRSS is a sample of police reported crashes involving all types of motor vehicles, pedestrians, and cyclists. A crash record in CRSS includes information about the crash characteristics, location of Safety 2021, 7, 32 6 of 18 the crash, roadway characteristics, involved vehicles, drivers, occupants, pedestrians and cyclists, and environmental factors. Our data include only large truck-involved traffic crashes that occurred in the United States from 2016 to 2019. Any medium or heavy truck excluding buses and motorhomes, with a gross vehicle weight rating (GVWR) over 10,000 (10 k) lb, is considered a large truck. The descriptions and codes for large trucks are given in Table 1. At first, we filtered the traffic crashes involving large trucks based on the vehicle body type and GVWR from the vehicle data file. Then, we merged the traffic crashes from the vehicle data file with the crash data file by the case number data element. Redundant features, duplicated crash records, and crash records involving values such as "reported as unknown" or "unknown" were removed from the data. The end data included 9534 traffic crashes involving large trucks, where the injured persons were drivers, occupants, pedestrians, and pedal cyclists.

60
Step van (GVWR > 10,000 lb) 61 Single For traffic crashes with multiple injuries, the severity of the injury is determined by the most injured person. In the CRSS crash data, the severity of injuries follows the KABCO scale. This study has placed no injury (O), possible injury (C), suspected minor injury (B) into one category as "Non-severe injuries"; and suspected serious injury (A) and fatal injury or killed (K) into another category as "severe injuries". Table 2 describes the definitions and the distributions of severity of injury. The current study has selected 22 input features, and Table 3 describes the names, unique values, and distribution of the input features. The readers can follow the CRSS analytical user's manual of 2016-2019 for more details [51]. It should be noted that for this study, the contributory factors and input features hold similar meaning.

SHAP Method for Feature Analysis
Lundberg and Lee [48] proposed the SHAP (Shapley additive explanations) method for the interpretation of the complex models/machine learning models. The concept of SHAP is based on the game theory by Štrumbelj and Kononenko [52] and the local explanations by Ribeiro et al. [53]. Initially, the SHAP method builds a model with all the features, and then builds another model without the feature of interest to see how the model performs without the feature of interest. The SHAP value of a feature is the marginal contribution of the feature for the prediction of the desired outcomes. The SHAP value of a feature is calculated using the following equation: In the above equation, ∅ i is the contribution of a feature as the SHAP value, S is the subset of all features, F represents the set of all features and x S represents the values of the input features in the set S. To determine the effects of the feature of interest, a model f S∪{i} is trained with the feature interest, and another model f S is trained without the feature of the interest. Then, predictions from the two models are compared with the current output denoted as f S∪{i} x S∪{i} − f s (x s ). The effect of the feature of interest also depends on the other features in the model. Therefore, the preceding differences are computed for all possible subsets [48].

Predicting Performance Evaluation Metrices
The predicting performances of the models were evaluated based on accuracy, precision, sensitivity or recall, specificity, and precision-recall area under curve (PR-AUC) score. These indicators derive from the components of the confusion matrix, and the components are presented in Table 4. According to Saito and Rehmsmeier [54], the PR-curve can be more informative than the traditionally used ROC (receiver operating characteristics) curve to evaluate a model trained on an imbalanced dataset, and where the desired outcome is dichotomous. The PR-curve is a graphical representation of the precision (on the y-axis) and the recall (on the x-axis) at different probability thresholds. The PR-AUC score summarizes the curve with a range of threshold values as a single score. The PR-AUC score is obtained by the trapezoidal interpolation of the precision. The PR-AUC score ranges from 0 to 1, where a score of 1 indicates a model with perfect skill.

Results
The current study has demonstrated that the k-protypes clustering is a useful method for the identification of homogeneous clusters within the EDS (entire dataset) of crashes involving large trucks. Four GBDT models were developed: one before cluster analysis on the EDS (entire dataset without clustering) and three on each cluster using the 22 input features presented in Table 3. The dependent feature of the model includes two injury severity outcomes (severe and non-severe). After cluster analysis, a GBDT model was developed on each of the identified clusters using the same input features as the EDS-based model to predict injury severity outcomes. To demonstrate the benefits of developing cluster-based models, the predicting performances of the EDS-based model were compared with the predicting performances of each cluster-based model. Subsequently, we used the SHAP method to estimate and compare the varying effects of the contributory factors on the severity of injury in the EDS and individual cluster-based model.

ClusterAnalysis
For k-prototypes clustering, the optimal number of clusters (k-value) needs to be determined before applying the algorithm. In this study, the optimal number of clusters was determined by visualizing the within-cluster sum of squares (WSS) in a plot. This method is known as the elbow method. The idea of the elbow method is to run the clustering method on the dataset for a range of values (1 to k values). Subsequently, the WSS for each value of k is plotted in a line chart. If the line in the plot looks like an arm, then the k value that corresponds to the "elbow" of the arm is the optimal number for clustering. Generally, an elbow occurs when adding more clusters does not reduce the WSS significantly. Figure 1 indicates a sharp elbow for three clusters. This means that clustering the dataset into more than three clusters does not reduce the WSS significantly. Therefore, we selected three as the optimal number of clusters.
clustering method on the dataset for a range of values (1 to k values). Subsequently, the WSS for each value of k is plotted in a line chart. If the line in the plot looks like an arm, then the k value that corresponds to the "elbow" of the arm is the optimal number for clustering. Generally, an elbow occurs when adding more clusters does not reduce the WSS significantly. Figure 1 indicates a sharp elbow for three clusters. This means that clustering the dataset into more than three clusters does not reduce the WSS significantly. Therefore, we selected three as the optimal number of clusters. The study identified three clusters within the EDS using the k-prototypes clustering and explored the cluster prototypes to characterize the clusters. The distribution of cluster prototypes presented in Table 5 indicates that the identified individual clusters are more homogeneous than the EDS. For example, almost 99 percent of the crashes in CL2 occurred on non-interstate highways. On the other hand, almost 76 percent of the crashes in the EDS occurred on interstate highways. Therefore, analysis of the EDS alone would have given fewer compelling estimations about the crashes that occurred on the non-interstate highways. Similarly, the distribution of the trafficway types in the EDS indicates less homogeneity. In CL1, almost 78 percent and 75 percent of the crashes occurred on twoways divided with positive median barrier and in high-speed (over 60 mph) limit zones, respectively. Thus, analysis of CL1 can reveal clearer insights about the crashes in those types of locations. Besides, more than 93 percent of crashes in CL3 involved large trucks The study identified three clusters within the EDS using the k-prototypes clustering and explored the cluster prototypes to characterize the clusters. The distribution of cluster prototypes presented in Table 5 indicates that the identified individual clusters are more homogeneous than the EDS. For example, almost 99 percent of the crashes in CL2 occurred on non-interstate highways. On the other hand, almost 76 percent of the crashes in the EDS occurred on interstate highways. Therefore, analysis of the EDS alone would have given fewer compelling estimations about the crashes that occurred on the noninterstate highways. Similarly, the distribution of the trafficway types in the EDS indicates less homogeneity. In CL1, almost 78 percent and 75 percent of the crashes occurred on two-ways divided with positive median barrier and in high-speed (over 60 mph) limit zones, respectively. Thus, analysis of CL1 can reveal clearer insights about the crashes in those types of locations. Besides, more than 93 percent of crashes in CL3 involved large trucks without trailing units. In contrast, only close to 48 percent of the crashes in EDS included such characteristics. The skewed distribution of these features in the individual clusters indicates better homogeneity than the EDS. Based on the distribution of the cluster prototypes, this study characterized the individual clusters in the following manner: CL1: "crashes on two-way divided with positive median barrier and in high posted speed limit zone"; CL2: "non-interstate highway crashes involving large trucks weighing over 26,000 (lb)"; CL3: "non-interstate highway crashes involving large trucks without trailing unit".

Model Performance Evaluation
This study intended to show that the k-prototypes cluster-based models are better than the EDS-based model at predicting the severity of injury in crashes involving large trucks. All of the GBDT models were trained and tested on 70 and 30 percent of the sample data, respectively. Then, the EDS-based model was compared with the k-prototypes clusteringbased models in terms of accuracy, precision, sensitivity, specificity, and PR-AUC score. Table 6 shows the predicting performances of all the models.  Table 6 shows that accuracy for both the CL2 and CL3-based models increased slightly during validation on both the training set and test set compared to the accuracy of the EDS-based model. On the other hand, the accuracy of the CL1-based model decreased for validation across both the training set and test set if compared to the accuracy of the EDS-based model. Determining the predicting performance of a model that was trained on an imbalanced dataset based on the accuracy alone would be misleading. However, the differences in accuracy between the training set and test set can show whether or not the model is overfitting. The difference between the accuracies indicated that all the models have satisfactory generalization ability.
The precision, sensitivity, and PR-AUC score are better indicators of the predicting performance of a model when it is critical to predicting the minority class accurately, and there is a skewness between the classes of the dependent feature. In this study, severe injuries are the minority class, and their social and economic effects are more damaging than other types of crash severity. The prediction results indicated that the CL1 and CL2based models obtained better precision, sensitivity, and PR-AUC score than the EDS-based model during validation on both the training set and test set. The performances of the CL3-based model were lower than the EDS-based model in terms of the sensitivity, and PR-AUC score during validation on the test set. This may have been caused because of the lower proportion of severe injuries in CL3. When compared to the EDS-based model, the specificity of the CL1 and CL3-based models increased during validation across the test set, but the increases were not noteworthy. Based on these results, it is fair to conclude that segmenting the EDS using the k-prototypes clustering method increased the predicting performances of the cluster-based models.

Feature Analysis
Feature analysis using the SHAP method demonstrated that the application of kprototypes cluster analysis is an effective step for capturing latent features, and their varying effects on the severity of injury under different crash scenarios. Figures 2-5 show the summary plot for the 15 most important features that influence severe injuries in the EDS, CL1, CL2 and CL3-based model, respectively. The suffixes added to the label of each feature represent the unique values of that feature. The x-axis of the figure displays the effects of a feature on the model output, and the y-axis on the right side indicates the value of the feature being low (blue) vs. high (red). Each dot in the figure represents an instance from the test set, and the cluster of dots indicates overlapping instances for a particular SHAP value. Moreover, a feature increases the likelihood of severe injuries when a certain value of that feature results in higher SHAP values. On the contrary, a feature reduces the probability of severe injuries when a certain value of that feature results in lower SHAP values.
there is a skewness between the classes of the dependent feature. In this study, severe injuries are the minority class, and their social and economic effects are more damaging than other types of crash severity. The prediction results indicated that the CL1 and CL2based models obtained better precision, sensitivity, and PR-AUC score than the EDSbased model during validation on both the training set and test set. The performances of the CL3-based model were lower than the EDS-based model in terms of the sensitivity, and PR-AUC score during validation on the test set. This may have been caused because of the lower proportion of severe injuries in CL3. When compared to the EDS-based model, the specificity of the CL1 and CL3-based models increased during validation across the test set, but the increases were not noteworthy. Based on these results, it is fair to conclude that segmenting the EDS using the k-prototypes clustering method increased the predicting performances of the cluster-based models.

Feature Analysis
Feature analysis using the SHAP method demonstrated that the application of kprototypes cluster analysis is an effective step for capturing latent features, and their varying effects on the severity of injury under different crash scenarios. Figures 2-5 show the summary plot for the 15 most important features that influence severe injuries in the EDS, CL1, CL2 and CL3-based model, respectively. The suffixes added to the label of each feature represent the unique values of that feature. The x-axis of the figure displays the effects of a feature on the model output, and the y-axis on the right side indicates the value of the feature being low (blue) vs. high (red). Each dot in the figure represents an instance from the test set, and the cluster of dots indicates overlapping instances for a particular SHAP value. Moreover, a feature increases the likelihood of severe injuries when a certain value of that feature results in higher SHAP values. On the contrary, a feature reduces the probability of severe injuries when a certain value of that feature results in lower SHAP values.              show that features such as the number of vehicles, high posted speed limit, a.m. (0-5 a.m.) hours, angle collisions, and pre-crash movements such as stopping in the roadway increase the chances of severe injuries in crashes involving large trucks across all the models. The direction of effects by these features is intuitively reasonable. It is possible to assume that an increase in the total number of vehicles enhances the cumulative effects of the crash, which can lead to severe injuries. Additionally, large trucks are more likely to travel at higher speed in high-speed limit zones, and higher speed has more chances of leading to severe injuries. On the other hand, sideswipe (same direction) collisions reduce the probability of severe injuries across all the models.
The figures indicate that though the direction of effects for some features is similar across all the models, their order of importance and magnitude of effects is not similar. For example, the SHAP value for the number of vehicles is more than or close to 2 in the EDS, CL1, and CL3-based models, but it is less than 2 in the CL2-based model. Similarly, angle collisions have a lower SHAP value in the CL1-based model if compared to other models.
In Table 7, we have summarized the common important relationships across all the combinations of models (including EDS) and the relationships that were important only in the individual cluster-based models. Table 7 shows that front-to-rear collisions and dark not lighted conditions increase the probability of severe injuries in the crash scenarios of EDS and CL1. Seventy percent of the crashes in CL1 occurred on high-speed limit zones, which explains the increase in the severity of injury for CL1 crashes. Except for CL1 crashes, front-to-front collisions and large trucks colliding with pedestrians significantly increase the chances of severe injuries across the crashes of EDS, CL2, and CL3. Close to 57 and 50 percent of the crashes in CL2 and CL3 occurred on two-way not divided traffic ways. It is intuitively reasonable to assume that due to the lack of a barrier between vehicles coming from opposite directions, front-to-front collisions are more likely to occur on two-way not divided traffic ways. Additionally, the majority of the crashes in CL2 and CL3 occurred on non-interstate highways, where pedestrians are more likely to be present. Large trucks colliding with pedestrians certainly will lead to severe injuries because pedestrians are the most vulnerable road users without any protection. Large trucks colliding with a vehicle in transport was a common important feature across EDS, CL1, and CL2-based models. Daylight conditions reduce the likelihood of severe injuries in the crash scenarios of CL1 and CL3.  Table 7, we can also observe that some features were important only in certain individual clusters. Roadway alignment curve left, van/enclosed box cargo body, and two-ways divided with unprotected median increase the chances of severe injuries only in the crashes of CL1. In contrast, pm peak hours and roadway grade unknown slope reduce the likelihood of severe injuries. The rollover/overturn of large trucks enhances the probability of severe injuries only in the crashes of CL2. The remaining exclusive important features in the CL2-based model include cargo body type (i.e., others, unknown), clear weather, and no collision with vehicle in transport. Features such as dump cargo body, urban and rural roadways, traffic control signals, and weekdays were important predictors of injury outcomes only in the CL3-based model. The results indicated that unless the EDS is segmented, the varying effects of some features under different crash scenarios cannot be captured. Additionally, the importance of some features remained hidden during the analysis of the EDS alone.

Discussion and Conclusions
Severe injuries and fatalities in crashes involving large trucks have serious impacts on human lives, the local economy, and trucking companies. Moreover, the characteristics of a large truck make the crashes more complex. The complex interactions of contributory factors obscure their effects on injury outcomes and create heterogeneity in crash data. The literature on the analysis of injury severity in crashes involving large trucks is limited in exploring novel methods for the mitigation of unobserved heterogeneity in crash data. Additionally, the variability of data types in crash data has not been addressed often.
This study has proposed the k-prototypes cluster analysis for the mitigation of unobserved heterogeneity in crash data of large trucks. The data used for this study include traffic crashes involving large trucks that occurred in the US from 2016 to 2019. Three homogeneous clusters were identified in the EDS using the k-prototypes cluster analysis. The distribution of the cluster prototypes indicated that the identified clusters were more homogeneous than the EDS. Homogeneous clusters can provide clearer interpretations for different types of crash scenarios involving large trucks. For example, almost 99 percent of the crashes in CL2 occurred on non-interstate highways. In contrast, only around 24 percent of the crashes in the EDS were non-interstate highway crashes. Therefore, the analysis of CL2 can reveal more accurate estimations about the effects of contributory factors on the injury outcome of large truck crashes that occurred on non-interstate highways. Road safety authorities can gain crucial insights by comparing the crashes on the interstate and non-interstate highways.
Further, the study developed four GBDT models on the EDS and three clusters to predict the injury outcome of crashes involving large trucks. The EDS-based model was compared to each cluster-based model in terms of their predicting performances to demonstrate the superiority of the cluster-based models. The comparison of the predicting performances showed that the CL1 and CL2-based GBDT models were better at predicting the injury severity in crashes involving large trucks. The results indicated that developing an injury severity model on the homogeneous clusters improves the model's overall performance. Similar conclusions were reported in a few previous studies that had developed cluster-based models as well [23,25,33,55].
Additionally, the SHAP method was used to estimate and compare the varying effects of the contributory factors on severe injuries in the EDS and individual clusters. The results showed that some features such as the number of vehicles, posted speed limit, angle collisions, a.m. hours (0-5 a.m.), pre-crash movements (i.e., stopped in roadway), and manner of collision (i.e., sideswipe same direction) had the similar direction of effects in all the models. These estimations are also consistent with some previous studies. Zheng et al. [47] and Islam and Hernandez [6] reported similar effects for the increase in the number of vehicles in large truck-involved crashes. Uddin and Huynh [9], and Islam and Hernandez [56] also reported that crashes occurring in high-speed limit zones and between midnight to 6 a.m. increase the probability of severe injuries in crashes involving large trucks, respectively. A few studies also found that sideswipe collisions are more likely to be associated with non-severe injuries [10,21]. The magnitude of the effects by few features was different in the cluster-based models if compared to the EDS-based model. Further, some relationships between the contributory factors and injury outcomes were important only in certain individual clusters. For example, large trucks rollover/overturn was an important predictor of severe injuries only for the crash scenarios of CL2.
Based on the results of the study, it is fair to conclude that the application of the kprototypes clustering-based GBDT model is an effective approach for the analysis of severe injuries in crashes involving large trucks. The insights uncovered by the k-prototypes clustering-based model have substantial value from the road safety perspective. Road safety authorities can put signs and change traffic signals on the segments of non-interstate highways, where large truck rollover/overturn are more likely to happen. Traffic engineers may improve the road geometry to reduce the impacts of large truck rollover/overturn. Such preventive measures can reduce severe injuries and fatalities caused by crashes involving large trucks. A reduction in severe injuries and fatalities will lessen the social and economic loss, improve road safety and contribute to sustainable transportation.

Limitations and Future Study
In general, the proportion of severe injuries is fairly low compared to minor injury or property damage only in traffic crash data. As a result, models tend to become biased towards the minor injury or property damage only crashes. This biasness can reduce the accuracy of estimation about effects of the factors that influence severe injuries. The issue of imbalance in injury outcomes can be addressed by over-sampling or under-sampling methods. Future researchers can apply over-sampling or under-sampling measures on the clusters to overcome the issue of class imbalance. Additionally, the study observed that there is a lack of data about the truck driver's attributes (e.g., level of fatigue prior to the crash, driving experiences). The inclusion of such human factors in the data can produce a better understanding about the causes of crashes involving large trucks. The k-prototypes cluster analysis can be applied to other types of crash data as well, since most of the time, crash data have mixed types of features. Further, a comparative study between the k-prototypes clustering and other types of clustering method can be conducted to identify the superior clustering method.