1. Introduction
The development of autonomous vehicles (AVs) is expected to transform the transportation system [
1,
2,
3], owing to their potential to eliminate the need for human drivers and reduce associated costs. However, as AVs are increasingly being tested and operated, they pose significant crash risks. Many of these crashes are ultimately attributable to performance limitations in the AV system itself, such as perception errors, flawed decision-making algorithms, or inadequate control responses under complex driving scenarios [
4,
5]. It is therefore necessary to analyze the factors influencing crash outcomes using crash data to develop effective countermeasures.
Concurrently, significant research and development efforts are dedicated to enhancing the intrinsic safety of AV systems to prevent crashes proactively. Beyond incremental improvements in core perception, planning, and control modules [
6,
7,
8], a particularly promising direction is the development of runtime-enabled active collision-avoidance techniques [
9]. These systems operate as a parallel safety layer, continuously monitoring the vehicle’s internal state and external environment in real time. Using advanced risk prediction models, they can identify potentially hazardous situations that may not be adequately handled by the primary autonomous driving stack [
10,
11]. When an imminent risk is detected, these systems can execute safeguarded maneuvers or emergency interventions, such as constrained emergency braking or steering within stable dynamics limits, to avoid or mitigate collisions [
12,
13,
14]. Other complementary approaches include formal methods for runtime assurance, fault-tolerant system architectures, and comprehensive validation using simulation [
15,
16,
17,
18]. While these technological advancements are crucial for raising the overall safety floor and handling edge cases, analyzing real-world crash data remains indispensable. Such analysis provides empirical evidence on the performance boundaries and residual failure modes of existing systems—even those equipped with advanced safety layers. It reveals which scenarios, system states, or environmental interactions continue to challenge AVs, thereby offering critical, evidence-based guidance for prioritizing future research, refining safety architectures, and validating the effectiveness of new mitigation strategies. Therefore, complementing proactive safety technology development with systematic forensic analysis of crash outcomes constitutes a vital, dual-path strategy for accelerating the safe maturation of AVs.
The risk factors related to traffic crashes include human, vehicle, road, and environmental factors. As the driver’s role is assumed by the vehicle itself [
4], analyzing vehicle-related factors—which serve as proxies for the underlying performance of the perception, decision-making, and control systems—becomes paramount for diagnosing system failures and enhancing safety. Consequently, it is necessary to study the joint effects of different vehicle factors and the interactive effects of vehicle and environmental factors [
19] on crash outcomes. This approach helps to identify which combinations of vehicle states and environmental conditions are most likely to overwhelm the current AV technological stack, leading to collisions. In addition, different types of collisions are attributed to different factors. Crashes involving vulnerable road users, including pedestrians and cyclists, are more likely to have serious consequences [
20]. Therefore, the research objectives also include the influence of risk factors on crashes involving vulnerable road users.
The most frequently employed techniques for analysis of crash data consist of both traditional statistical models and machine learning methodologies. The application of traditional statistical models to causal inference is constrained by the limitations of specific assumptions, including the independence of irrelevant alternatives (IIA) assumption and the parallel line assumption [
21]. To overcome these limitations, machine learning techniques are introduced [
22], such as decision tree algorithm [
23]. The Gradient Boosting Decision Tree (GBDT) is a prominent machine learning algorithm, distinguished by its superior performance on efficiency, accuracy, and interpretability [
24]. eXtreme Gradient Boosting (XGBoost) is a representative of GBDT and widely used in crash analysis [
25]. LightGBM has demonstrated better outcomes in numerous standard classification benchmarks than XGBoost. Meanwhile, it directly supports categorical features without requiring one-hot encoding, which is needed by XGBoost, reducing the dimensionality and potential sparsity issues that could arise from expanding categories into multiple binary features [
26]. SHapley Additive exPlanations (SHAP) is a technique that can provide interpretability for machine learning algorithms [
25,
27,
28]. Unlike tree-based algorithms that merely indicate feature importance without elucidating their directional effects on model outputs, SHAP values provide precise quantification of both the magnitude and polarity of each variable’s contribution to individual predictions. In addition, SHAP can deliver both local explanations for single predictions and global interpretations that highlight feature importance. SHAP values, deriving their theoretical foundation from coalitional game theory, provide more consistent and equitable feature contribution attributions than traditional tree-based importance measures like split gain or cover, resulting in more dependable explanatory capabilities. Therefore, LightGBM is employed together with SHAP to elucidate the influence of each independent variable on the outcome.
This study aims to evaluate the impact of vehicle factors on the severity and collision type of crashes involving AVs. An advanced machine learning algorithm, LightGBM is introduced into the analysis of AV-involved crashes, and SHAP is used for interpretations [
29]. The joint effects of different vehicle factors and the interactive effects of vehicle and environmental factors are studied, which are important in the safety of AVs. The main contributions of this paper are summarized as follows:
- (1)
It introduces an advanced machine learning framework (LightGBM combined with SHAP) for a systematic analysis of AV-involved crash outcomes, addressing limitations of traditional statistical models.
- (2)
It explicitly investigates and quantifies the joint effects of various vehicle factors and their interactions with environmental factors on crash severity and collision type, thereby identifying high-risk scenarios that challenge current AV capabilities.
- (3)
It provides specific insights into the factors influencing crashes involving vulnerable road users, which pose significant perceptual and predictive challenges to AV systems.
- (4)
By interpreting model outcomes via SHAP, it offers actionable insights that can inform the refinement of core AV technologies, including perception algorithms, decision-making logic, and control strategies, thereby contributing to the development of more robust autonomous driving systems.
The subsequent content is divided into four main sections. Commencing with an extensive review of relevant scholarship, the article then characterizes the research data, elaborates on the methodological framework, and concludes with an analysis of outcomes and their implications.
It is crucial to emphasize that this study operates within an observational, associational framework. We employ SHAP to explain the predictive model’s behavior and to uncover which factors are most strongly associated with adverse outcomes in the available data. The findings highlight potential risk indicators and high-priority scenarios for AV safety. However, establishing causal relationships between specific vehicle factors and crash outcomes requires different methodological approaches, which are discussed as important avenues for future research.
2. Literature Review
The analysis of crashes involving autonomous vehicles extends beyond traditional factor identification to encompass the complex interplay between the vehicle’s internal systems and the external operational environment. This systems-oriented perspective is crucial, as AV crashes often stem not from single points of failure, but from mismatches or performance boundaries within the entire automated driving stack—comprising sensing, perception, decision-making, and control execution [
4,
5,
30].
Previous research has identified a range of factors influencing crash severity [
31] and types [
32,
33,
34,
35]. These factors usually include road and environment factors [
36,
37,
38,
39,
40,
41,
42], like crash location, land use, weather, lighting, speed limit, and road types, as well as some factors related to vehicles, such as precrash movements and driving mode, among others. Notably, factors like lighting and weather directly challenge an AV’s perception system, a cornerstone of its operational safety. Inclement weather, for instance, can degrade sensor (e.g., LiDAR, camera) performance, leading to inaccurate environment models and subsequent decision errors [
43]. Vehicle factors assume greater importance for the safety of AVs as the driver’s role is assumed by the vehicle itself, with its movements controlled by algorithms. However, limited existing research has focused on the impact of these vehicle factors on crashes [
36,
37,
38,
39,
40,
41,
42], particularly those that serve as proxies for the performance of the underlying autonomy stack. Specifically, factors like the novelty of the autopilot function are important as it can significantly impact operational safety and be upgraded quickly, but current research lacks consideration of this factor.
In the analysis of crashes involving AVs, logit models are among the most frequently employed statistical techniques [
41,
42], which have good interpretability but have severe shortcomings [
28]. Other researchers employed machine learning techniques to circumvent these constraints, and decision tree [
31,
34] and their variants were the most widely used methods. A more sophisticated model, LightGBM [
26], has also been employed for crash analysis, but only for crash types and severity of conventional vehicles, and has not been applied to the analysis of crashes involving autonomous vehicles.
Table 1 is a summary of the methods, factors, and outcomes of studies on AVs-involved crashes using decision tree-based methods. Although machine learning techniques demonstrate remarkable predictive capabilities, their inherent opacity and limited explanatory capacity frequently raise concerns among stakeholders.
SHapley Additive exPlanation (SHAP) has been developed to achieve interpretability from results [
44]. Integrating machine learning techniques with the SHAP framework [
32,
33] enables comprehensive elucidation of predictive outputs derived from algorithmic processing of collision-related datasets [
25,
27,
28]. This is pivotal for a system-level safety analysis, as it allows researchers to move beyond predicting “what” will happen to understanding “why” it might happen, thereby linking statistical patterns to potential engineering flaws. While previous studies have investigated factors contributing to injury severity, the application of more advanced and interpretable machine learning methods like LightGBM combined with SHAP to AV crash data remains nascent. Therefore, it is necessary to introduce such methods to decipher the causal mechanisms behind AV crashes, ultimately informing the advancement of more dependable autonomous transportation solutions.
4. Materials and Methods
In this research, the LightGBM (v4.4.0) model was trained for analysis. The boosting algorithm has distinctive capabilities for addressing datasets with small sample sizes. GBDT is a tree-based ensemble learning framework that exhibits high levels of accuracy in prediction. In comparison to basic GBDT, XGBoost incorporates a second-order Taylor expansion term into the objective function, thereby establishing a more efficient and accurate framework.
LightGBM is an efficient ensemble machine learning method that employs Gradient-based One-Side Sampling (GOSS) to split internal nodes based on variance gain and Exclusive Feature Bundling (EFB) for input feature dimension reduction [
26]. This approach was designed to overcome computational bottlenecks and enhance the scalability of the XGBoost framework. It has been demonstrated that LightGBM exhibits superior performance compared to other gradient boosting algorithms in terms of training speed and predictive accuracy. Categorical features are also supported without requiring one-hot encoding, which may cause the dimensionality and potential sparsity problems [
26]. A distinctive strength of LightGBM as a tree-based algorithm lies in its tolerance to multicollinearity, making it well-suited for safety analytics where predictors frequently demonstrate interdependence. This capability allows LightGBM [
46] to accommodate correlated input variables without compromising model integrity.
In this gradient boosting architecture, the final prediction constitutes the cumulative output generated by an ensemble of decision trees
:
In this formulation,
specifies the quantity of constituent trees. The GBDT training procedure aims to construct an approximator
that optimizes the composite objective
, mirroring XGBoost’s dual-component structure that combines a loss minimization term with regularization constraints, mathematically represented in Equation (2):
In this formulation, denotes the total count of terminal nodes within the tree structure. For any given leaf , the parameters and represent the aggregated first-order gradient statistics and cumulative second-order gradient statistics, respectively, computed across all training instances assigned to that leaf. The term corresponds to the optimal weight value assigned to leaf during the model fitting process. is the coefficient of the regularization penalty term. The hyperparameter serves to regulate the structural sophistication of the decision tree.
LightGBM employs the Gradient-based One-Side Sampling (GOSS) technique for node splitting, departing from conventional entropy-based criteria used in standard GBDT. This approach prioritizes instances with substantial gradient magnitudes (representing
) for inclusion in subset A, while randomly sampling a fraction (
) of the remaining lower-gradient observations to form subset B. The partitioning mechanism subsequently computes the variance improvement metric
over the combined dataset
:
Complementing its GOSS methodology, LightGBM further integrates EFB to enhance computational efficiency during training while preserving model fidelity. Sparse high-dimensional feature sets often exhibit disjoint activation characteristics, with different features rarely assuming non-zero values concurrently within any given instance. The EFB technique aggregates mutually exclusive features into a composite feature representation.
Recall is the proportion of predicted true positives in the sample of all positives. Precision measures the proportion of true positives in the model’s predicted positive samples. The F1-score constitutes a balanced performance metric derived from the harmonic mean of precision and recall. For all binary classification tasks in this study, class predictions were obtained by applying a fixed decision threshold of 0.5 to the predicted probabilities. This consistent threshold ensures a fair comparison focused on inherent model discriminative ability, while acknowledging that threshold tuning could optimize specific metrics (e.g., recall) for particular applications.
The term ‘TP’ represents instances correctly identified as positive by the model when the actual classification is positive, whereas ‘FP’ indicates cases erroneously classified as positive when the ground truth is negative.
The Brier Score was additionally computed to assess probability calibration, measuring the mean squared difference between predicted probabilities and actual outcomes. Lower Brier Score values indicate better-calibrated probabilities. The formula is:
The term is the predicted probability for instance , and is the actual outcome (0 or 1).
ROC-AUC quantifies the overall discriminative capacity of a binary classifier by measuring the entire two-dimensional area beneath its ROC curve. Superior predictive capability is indicated by higher AUC measurements. The
ROC-AUC curve is depicted as follows
where
characterizes the proportion of correctly identified positive instances across different threshold values
, while
quantifies the fraction of negative cases mistakenly classified as positive under varying threshold settings and
signifies the inverse function of the
FPR function.
In addition to the standard
ROC-AUC, we computed the Precision-Recall Area Under the Curve (
PR-AUC) to specifically assess model performance on imbalanced datasets. PR-AUC is particularly informative for classification tasks with skewed class distributions, as it focuses on the performance of the positive (minority) class by evaluating the trade-off between precision and recall across different probability thresholds.
Bayesian optimization was employed to determine the optimal hyperparameters for the LightGBM, XGBoost, and SVM models. This approach has gained prominence for its effectiveness in optimizing model parameters, attributed to its systematic exploration of the hyperparameter domain [
47]. The fundamental principle of Bayesian optimization involves constructing a probabilistic model to approximate the objective function, which is then used to guide the selection of subsequent sampling points. Initially, a prior distribution is defined to encapsulate the initial assumptions about the objective function. The function provides a criterion for selecting the next evaluation point, after which the model is updated, and the process is iteratively repeated until a predefined stopping condition is satisfied.
A list of potential values for each hyperparameter is put forward to define the search space. The search space of the number of estimators is between 50 and 300, and the maximum depth limit is set between 3 and 5. The learning rate, denoted by η, was set to the values {0.01, 0.05, 0.1, 0.2}. A ten-fold cross-validation was conducted for each combination of values to identify the optimal values. The iteration process was terminated when the metric on the test set did not decrease for 10 consecutive rounds.
Once the optimal model had been obtained through hyperparameter tuning, the prediction results were subjected to interpretation, which means to ascertain the contribution of each feature to the prediction made by the model. Traditional feature importance measures in tree models, split count, cover, and gain are typically employed, but these methods might not accurately reflect the true contribution of each feature and can sometimes lead to misleading conclusions [
48]. By contrast, the aggregate of all feature importance values provided by SHAP [
44] represents the locally specific output of the model, thus guaranteeing “local accuracy”. SHAP operates by measuring the additive importance of each feature in shaping the final prediction, thereby offering dual-level interpretability that encompasses both overall model behavior and individual instance predictions.
The model’s output is constituted by the aggregation of individual feature contributions combined with a baseline intercept term, expressed mathematically as:
In this framework,
serves as an interpretive function that provides intuitive insights into the complex model
. The binary indicator
assumes a value of 1 when feature
j is present and 0 when absent. The parameter
quantifies the attribution value assigned to feature
j, while
represents the baseline prediction from a null model without any features. Mathematical derivation establishes that the attribution term
must satisfy the following unique representation:
In this formulation: represents the Shapley additive explanation assigned to feature , quantifying its specific attribution; corresponds to the complete feature vector undergoing interpretation, with indicating its total element count; refers to any feature subset derived from that excludes feature , where specifies the number of features in this subset; and indicate the model’s predictive outputs when feature is incorporated and omitted, respectively.
5. Experiment
To quantify uncertainty in performance estimates, 95% confidence intervals were calculated for all metrics using stratified bootstrapping with 1000 iterations on the test set, preserving the original class distribution in each bootstrap sample (for detailed information, please refer to the
Supplementary Material). The performance of each model is illustrated in
Table 3. In addition to LightGBM and XGBoost, our experimental framework incorporates support vector machines (SVMs) as a benchmark algorithm for comparative analysis. With the corrected and verified metrics, comparative analysis confirms LightGBM’s superiority across all evaluation dimensions. XGBoost shows competitive performance with an ROC-AUC of 0.773 [0.745–0.801] and recall of 0.562 [0.505–0.619], achieving an F1-score of 0.608 [0.560–0.656]. However, LightGBM maintains a clear lead with higher ROC-AUC (0.797 [0.770–0.824]), PR-AUC (0.589 [0.545–0.633]), recall (0.625 [0.570–0.680]), precision and F1-score (0.860 [0.824–0.896]). Therefore, the LightGBM method is used for prediction.
The PR-AUC analysis is particularly informative given the class imbalance. LightGBM’s PR-AUC of 0.589 [0.545–0.633] substantially outperforms XGBoost (0.512 [0.468–0.556]) and SVM (0.385 [0.342–0.428]), demonstrating its superior ability to identify the minority class. The Brier Score further confirms LightGBM’s advantage in probability calibration (0.168 [0.158–0.178]), compared to XGBoost (0.187 [0.176–0.198]) and SVM (0.215 [0.202–0.228]). For LightGBM, the optimal hyperparameters were determined as: maximum depth = 4, minimum child samples = 5, and number of estimators = 200.
For collision type prediction (
Table 4), LightGBM again provides the best overall performance. It achieves the highest ROC-AUC (0.837 [0.795–0.879]), PR-AUC (0.523 [0.471–0.575]) and F1-score (0.632 [0.548–0.716]), along with the lowest Brier Score (0.142 [0.129–0.155]). While XGBoost matches LightGBM’s recall, it shows substantially lower F1-score and PR-AUC. The lower absolute values of PR-AUC compared to ROC-AUC across all models reflect the challenge of predicting the sparse “vulnerable road user involved” class. LightGBM’s superior PR-AUC (0.523 [0.471–0.575]) compared to XGBoost (0.485 [0.431–0.539]) and SVM (0.418 [0.362–0.474]) demonstrates its particular robustness in this highly imbalanced classification scenario. For the collision type model, the optimal LightGBM hyperparameters were: maximum depth = 4, minimum child samples = 44, and number of estimators = 200.
As expected, this resampling technique improved the recall for the minority class compared to training on the raw imbalanced data (e.g., recall increased from 0.51 to 0.625 for LightGBM severity prediction model), demonstrating enhanced sensitivity. However, final evaluation metrics and all model interpretations (SHAP) are reported on the original test set to provide a realistic assessment of model performance and avoid over-optimism.
SHAP analysis quantifies the influence of predictor variables on both crash severity levels and collision types. The magnitude of SHAP values directly reflects the relative influence of features on model predictions.
Figure 1 summarizes the results of the SHAP analysis of serious crashes. It can be observed that operator type, mileage, contact area, year since crash, and vehicle pre-crash speed contribute most significantly to the model’s predictions. This indicates that it is crucial to investigate the vehicle characteristics of AVs. Simultaneously, a comprehensive investigation of the interdependencies among these variables becomes crucial, particularly regarding the impact of contextual elements like road type and lighting.
The predicted severity level varies with the operator type, as depicted in
Figure 2. Autonomous driving functions of low SAE levels, such as Lane Centering Control (LCC), are already in use in a considerable number of vehicles. Some crashes may occur when consumers utilize these functions, and others occur when developers operate test vehicles. For consumer-operated vehicles, higher mileage is associated with a higher predicted probability of serious crashes. According to the model, for the test operator vehicles, the vehicles with lower mileage have a higher association with severe accidents. This may coincide with the deployment of less mature software versions during early testing phases, although the exact causality cannot be established.
Next,
Figure 3 illustrates the impact of mileage. As illustrated in the dependence plot, most of the vehicles involved in crashes that occurred more than eight years ago had mileage exceeding 80,000. These higher-mileage vehicles show a positive association with serious crashes. For more recently developed vehicles, higher mileage is linked to a greater increase in the predicted probability of serious crashes.
Highways and streets are associated with increased predicted severity, whereas intersections are associated with decreased predicted severity (
Figure 4). This result is consistent with the previous research [
42,
49]. The stronger association for streets may be linked to the diversity of traffic participants on the streets, as vehicles may have difficulty in coping with complex and changing traffic conditions, or more vulnerable road users may occur. For highways, the higher contribution may be related to higher speeds, as more kinetic energy causes more damage. It is notable that newer vehicles show a weaker association with serious crashes on highways and at intersections in the model. However, on parking lots and streets, newer models are associated with a higher proportion of serious crashes in the dataset. One potential explanation for this discrepancy is that there are still unresolved corner case problems for AVs in the street.
The conjecture about speed can be verified by another dependence plot (
Figure 5). Streets and highways, which are associated with higher predicted severity, also tend to have crashes occurring at higher speeds.
Furthermore, among the precrash movements, proceeding straight and changing lanes are associated with increased predicted crash severity (
Figure 6). Higher speeds exhibit a stronger positive association with crash severity when the vehicle is proceeding straight. A more nuanced and counterintuitive pattern is observed for turning vehicles: higher precrash speeds show a weaker association with increased severity in the model. This finding appears to contradict conventional traffic safety wisdom. While intriguing, it must be interpreted with caution as it may reflect specific, unobserved contextual factors in the dataset rather than a generalizable safety principle. Several non-mutually exclusive hypotheses could explain this pattern within the AV-specific context: (1) AVs may be programmed or may only engage higher speeds during turns in well-structured, low-risk environments (e.g., highway ramps with clear visibility, controlled test areas) where conflict probability is inherently low; (2) the dataset may lack granular details about turn geometry (e.g., radius, bank) or concurrent environmental conditions that fully define the risk scenario; (3) this statistical association may be influenced by confounding factors not included in the model. This result highlights the complexity of interpreting AV behavior from aggregate crash data and underscores the need for more detailed investigations to unpack the underlying conditions.
Vulnerable road users, like cyclists and pedestrians, are more likely to be injured in the absence of vehicle protection. At the same time, AVs are less capable of sensing small moving objects, posing more of a threat to them [
4]. Therefore, vulnerable users deserve attention in the safety research of AVs. The following analysis of factors associated with crashes involving vulnerable road users (VRUs) is based on a substantially smaller subset of data compared to the overall crash severity model. The trends and associations reported below, while derived from our interpretable ML framework, should be regarded as “preliminary and exploratory” due to the limited sample size. They highlight potential signals that merit investigation with larger, dedicated datasets rather than offering definitive conclusions.
Environmental factors appear to be more prominent in the prediction of injuries sustained by vulnerable road users (
Figure 7). The type of road is the most significant influencing factor, followed by precrash speed, mileage, year since crash, and then light and operator type.
At intersections, newer vehicles are associated with a higher predicted probability of injuries to vulnerable road users, while on highways, older vehicles show a stronger association with such injuries (
Figure 8). This may be due to the fact that newer vehicles have sophisticated algorithms that can perform better on incomplex environments like freeways, while older vehicles tend to take more cautious action at intersections.
Under dark-but-lighted and dusk conditions, older vehicles are associated with a higher likelihood of injuries to vulnerable road users in our model, whereas newer vehicles show a stronger association under full darkness and daylight conditions (
Figure 9). This indicates that older vehicles exhibit driving behaviors that are more akin to those of humans, as dusk is an environment in which human drivers are particularly prone to collisions [
19]. It is therefore important to include hazardous road conditions in future studies of crash injury severity involving AVs.
To assess the sensitivity of our interpretations to the use of SMOTE-NC, we compared the SHAP-derived feature importance rankings from the LightGBM model (trained with SMOTE-NC) with those from an auxiliary model trained without oversampling. While absolute SHAP values differed, the relative order of the top five most influential features remained consistent for both crash severity and collision type predictions. This suggests that the core interpretative insights regarding key risk factors are stable and not an artifact of the sampling strategy.
To assess the temporal stability and generalizability of the identified predictive patterns, we performed a temporal hold-out validation. The dataset was split chronologically, with crashes occurring before 2021 (approximately 70%) used for training and validation, and crashes from 2021 onward (approximately 30%) held out as a temporal test set. This split approximates a scenario where a model trained on past data is used to predict future incidents. The LightGBM model, configured with the previously determined optimal hyperparameters, demonstrated robust temporal generalizability for both prediction tasks. For crash severity, the model achieved an AUC of 0.781 on the temporal test set, compared to 0.797 on the random split. For collision type (VRU-involved vs. others), it attained an AUC of 0.821, versus 0.837 in the main analysis. The slight and comparable decreases in performance indicate that the feature importance patterns and associations learned by the model are reasonably stable over time and are not merely capturing transient historical artifacts. Detailed performance metrics for both tasks in this temporal validation are provided in
Supplementary Tables S3.1 and S3.2.
6. Conclusions and Future Work
6.1. Conclusions
This research analyzed the effects of vehicle factors on crash severity and collision types using the AVOID. LightGBM is chosen due to its best performance on metrics including PR-AUC, Recall, and F1-score etc. SHAP is adopted to study the causality between crash results and vehicle factors.
The SHAP analysis revealed that operator type, mileage, contact area, year since the crash, and vehicle pre-crash speed were the most important predictive factors for crash severity in our model, indicating that these vehicle characteristics are crucial to investigate. Operator type, which means whether the vehicle is operated by consumers or tested by businesses, is the most important factor. For consumer-operated vehicles, higher mileage is associated with a higher probability of serious crashes. Overall, higher mileage is a prominent factor linked to serious crashes. Regarding road types, highways and streets are associated with increased severity, while intersections are associated with decreased severity. Crashes on highways and streets tend to occur at higher speeds. Precrash movement is another important predictive factor. Proceeding straight and changing lanes are associated with higher predicted severity. The contribution of higher speeds to crashes is greater for vehicles proceeding straight, compared with that for turning vehicles.
Vulnerable users deserve attention in safety research. In our exploratory analysis of the limited VRU-involved crash data, road type emerged as the most influential feature in the model, followed by precrash speed, mileage, length of time since the crash, lighting, and operator type. Newer vehicles show a weaker association with injuries to vulnerable road users on highways. Lighting is another important environmental factor that may affect vehicles’ characteristics. Under dark-but-lighted and dusk conditions, newer vehicles are associated with a lower likelihood of injuries to vulnerable road users, whereas the association is stronger for newer vehicles under full darkness and daylight conditions.
The current study has provided insight into the vehicle factors that influence the outcomes of AVs-involved crashes. The results can also help manufacturers develop safer vehicles [
50] and provide a reference for transportation management agencies in regulating autonomous vehicles. For example, manufacturers can improve sensors and vehicle control algorithms for crash-prone scenarios, and management can limit the level of automation for vehicle operation based on the environments.
6.2. Limitations and Future Research
This study identified key vehicle, roadway, and environmental factors associated with the severity and type of AV-involved crashes using an interpretable machine-learning framework. It is important to emphasize that the reported relationships are statistical associations derived from the model and the historical dataset. They highlight potential risk indicators and areas for further investigation but do not confirm direct causation. These associations could be influenced by unmeasured or residual confounding.
Unmeasured confounding variables (e.g., specific software versions, detailed sensor configurations, traffic density) may influence both the predictor variables and the outcomes. Although the variable “year since crash” serves as a proxy for technological evolution, it does not reflect the precise software state at the time of the incident. Future research could benefit from richer datasets containing direct measures of AV system maturity (e.g., software version logs, disengagement reports) and employ quasi-experimental or causal inference frameworks to move beyond association towards causation.
Future research should prioritize study designs more conducive to causal inference, such as natural experiments or meticulously matched observational studies. Building on such designs, established methods from confounding research—including sensitivity analysis (e.g., the E-value [
51,
52]) and quantitative bias analysis [
53]—can be systematically employed to evaluate the potential impact of unmeasured confounding. Through the application of these methods, a more rigorous and quantified measure of confidence can be provided regarding the causal plausibility of the risk factors identified by the predictive model in this study.
Furthermore, the binary categorization of crash severity and collision type, while necessary and justified for our specific analytical objectives given the dataset, represents a simplification of real-world complexity. It does not capture gradients within injury severity (e.g., separating fatal from severe injury) nor the distinct crash dynamics associated with different non-VRU collision partners (e.g., fixed object vs. truck). Future research with larger, more detailed datasets would enable more granular, multi-class analyses to uncover these finer-grained relationships.
Our investigation into crashes involving vulnerable road users (VRUs) is constrained by a very sparse dataset. The insights from this part of the analysis are inherently preliminary and serve primarily for hypothesis generation. A critical future direction is to compile a significantly larger, multi-source dataset dedicated to VRU-AV interactions to support more reliable and stable SHAP-based interpretation.
Additionally, our analysis surfaced specific findings, such as the association between higher turning speeds and lower predicted severity, that challenge conventional expectations. These points of divergence are not shortcomings but rather valuable outcomes of applying an interpretable framework to novel AV data. They serve as precise hypotheses for future research. We recommend that subsequent studies employ high-resolution data (e.g., simulation logs, detailed telemetry) to investigate the specific operational contexts and vehicle behaviors that give rise to such statistical patterns, moving from correlation towards a mechanistic understanding.