From Prediction to Explanation: Explainable Machine Learning for Motor Vehicle–Involved Pedestrian and Cyclist Crash Risk

Elsayed, Ahmed; Abdel-Rahim, Ahmed; Prescott, Logan

doi:10.3390/infrastructures11030077

Open AccessArticle

From Prediction to Explanation: Explainable Machine Learning for Motor Vehicle–Involved Pedestrian and Cyclist Crash Risk

by

Ahmed Elsayed

^*

,

Ahmed Abdel-Rahim

and

Logan Prescott

^†

Department of Civil and Environmental Engineering, College of Engineering, University of Idaho, Moscow, ID 83844, USA

^*

Author to whom correspondence should be addressed.

^†

Present address: Kimley-Horn, Everett, WA 98201, USA.

Infrastructures 2026, 11(3), 77; https://doi.org/10.3390/infrastructures11030077

Submission received: 23 January 2026 / Revised: 10 February 2026 / Accepted: 17 February 2026 / Published: 26 February 2026

(This article belongs to the Special Issue Advances in Smart Infrastructures: Converging IoT, AI, and Digital Twins for Resilient Futures)

Download

Browse Figures

Versions Notes

Abstract

Pedestrian and cyclist safety at urban intersections remains a critical challenge for transportation agencies, as vulnerable road users are significantly exposed to crash risks in complex traffic environments. Identifying high-risk locations and factors that contribute to crashes is essential for improving road safety. This study developed an explainable machine learning framework to predict motor vehicle-involved pedestrian and cyclist crash occurrence at urban intersections using five years of crash, geometric, operational, and socioeconomic data from a large set of urban intersections. Five supervised machine learning algorithms were trained and evaluated, including Binary Logistic Regression, K-Nearest Neighbors, Support Vector Machine, Decision Tree, and Random Forest. The evaluated models demonstrated strong predictive performance overall, with accuracies approaching 91% and high discriminative capability. In particular, the Binary Logistic Regression and Random Forest models achieved the highest area under the receiver operating characteristic curve (AUC) values of 0.961 and 0.964, respectively. To enhance transparency, SHAP values were used to quantify the contribution of predictors and examine feature effects at both the global and local levels. The results indicate that roadway hierarchy, intersection markings, and total entering volume are among the most influential determinants of crash likelihood, while socioeconomic variables exhibit weaker but interpretable effects.

Keywords:

explainable machine learning; crash prediction; pedestrian safety; cyclist safety; vulnerable road users; SHAP values; risk factor; urban intersections; traffic safety analytics

1. Introduction

Walking and cycling play significant roles in sustainable transportation systems because of their contributions to environmental quality, public health, and urban accessibility [1]. In response, transportation agencies and policymakers across many countries have dedicated increasing focus toward promoting active transportation to reduce emissions, support healthier travel behavior, and improve overall mobility efficiency [2,3]. Despite these advantages, safety outcomes for non-motorized users have not improved at the same pace. Road traffic injuries continue to represent a substantial global public health problem, particularly in younger populations. As reported by World Health Organization [4], traffic-related injuries remain the leading cause of death for individuals aged 5–29 years worldwide, with approximately 1.2 million fatalities annually. Pedestrians and cyclists represent a substantial share of these fatalities, underscoring persistent safety deficiencies in roadway design and operation and the need for targeted interventions to reduce exposure and conflict risk.

In the U.S., the demand for active transportation has been increasing gradually, with pedestrian and cyclist volumes estimated to grow by approximately 3% annually [5]. This growth, coupled with sustainability and transportation planning efforts, presents new challenges for road safety management. Recent national trends indicate rising pedestrian fatalities resulting from motor vehicle-involved crashes, reinforcing the need for infrastructure-focused safety analyses that account for traffic and contextual factors affecting vulnerable road users [6]. Intersections are consistently identified as high-risk locations where multiple traffic streams converge and conflict potential is elevated. National and state-level crash statistics show that a substantial proportion of pedestrian and cyclist crashes involving motor vehicles occur at intersections, regardless of the control type [7]. Although numerous safety countermeasures have been implemented, intersections remain among the most critical locations for nonmotorized user safety. Therefore, identifying the environmental, geometric, and social factors associated with crash occurrence at intersections is essential for developing effective and targeted safety strategies.

Beyond localized safety concerns, improving pedestrian and cyclist safety is a global challenge faced by transportation agencies across diverse geographic and institutional contexts. Although crash patterns and exposure levels vary by region, the fundamental relationships linking roadway geometry, traffic exposure, land use, and socio-demographic characteristics to the risk of vulnerable road users are widely shared. This has created a growing need for transferable, data-driven analytical frameworks that can support infrastructure safety assessments and decision-making using commonly available data sources. In this context, explainable machine learning approaches offer a promising opportunity to balance predictive performance with interpretability, enabling practitioners to better understand and act upon complex crash risk patterns across different settings. Traditionally, these relationships have been examined using statistical modeling approaches, which have provided valuable insights into crash occurrence and the contributing factors. However, there is increasing recognition of the limitations imposed by model-specific assumptions, particularly when addressing nonlinear interactions and high-dimensional data [8,9,10]. To overcome these constraints, machine learning (ML) techniques have recently emerged as effective alternatives for crash prediction and safety analysis, offering greater flexibility in capturing complex relationships within transportation systems [11,12,13,14]. To provide context for the proposed framework, the following subsections review the existing literature on traditional statistical models and machine learning techniques applied to pedestrian and cyclist crash prediction.

1.1. Traditional Statistical Approaches to Predict Vulnerable User-Involved Crashes

Traditional statistical modeling has long been used to investigate pedestrian and cyclist-involved crash occurrences, particularly at intersections, where conflict points between road users are concentrated. Early safety studies mainly relied on regression-based approaches, including Poisson, Negative Binomial (NB), and probit models, to examine how roadway design, traffic characteristics, and environmental conditions influence crash outcomes. For example, Pljakić et al. [15] analyzed pedestrian crashes across traffic analysis zones in Novi Sad, Serbia, and found that factors such as street network length, transit stop density, and paid parking were associated with higher crash frequency. Similarly, Ma et al. [16] applied probit models to intersection crash data in Cook County, Illinois, showing that pedestrian age, vehicle characteristics, and weather conditions play important roles in injury severity outcomes. Other studies have reported that higher operating speeds, frequent pedestrian–vehicle interactions, wider cross-sections, and on-street parking are associated with elevated crash risks [17]. Evidence from Beijing further suggests that refuge islands at signalized intersections can substantially reduce severe conflicts by approximately 30%, emphasizing the safety benefits of two-stage crossing designs [18].

Comparable modeling frameworks have also been applied to cyclist safety analyses. A comprehensive review by Salmon et al. [19] reported that intersection design deficiencies are a dominant contributor to bicycle crashes, which aligns with the findings of subsequent studies [20]. Empirical analyses using regional crash data have shown that a substantial share of cyclist crashes occur at intersections, including 58.5% of reported crashes in Victoria, Australia, with three-leg intersections being particularly prevalent [21]. Similar patterns were observed in Western Australia, where over half of bicycle crashes were linked to intersections, especially during non-commuting trips and at locations where shared paths transitioned to roadways [22].

Although these traditional statistical approaches have generated important insights into crash mechanisms and contributing factors, their applicability is limited by underlying assumptions such as linear relationships, fixed parameter effects, and independence among observations. These constraints can restrict their ability to represent complex nonlinear interactions among roadway geometry, traffic exposure, and sociodemographic characteristics. In addition, many prior studies have focused primarily on roadway and traffic variables, with relatively limited consideration of community-level factors, such as age composition, school presence, and commuting behavior, which may influence pedestrian and cyclist exposure. These methodological and data-related limitations have encouraged the exploration of more flexible, data-driven approaches—particularly machine learning techniques—that can better accommodate high-dimensional data and complex interaction structures in crash predictions.

1.2. Machine Learning Approaches in Crash Prediction

Recent advances in data availability and computational methods have accelerated the adoption of ML techniques in traffic crash modeling. Unlike traditional statistical approaches, ML algorithms can capture complex nonlinear relationships and higher-order interactions among predictors without relying on strict parametric assumptions. Tree-based methods, such as Decision Trees (DT) and Random Forests (RF), have been widely applied in crash prediction studies because of their ability to model nonlinear effects and interactions among roadway, traffic, and environmental variables [23,24,25]. Support Vector Machine (SVM) models have also been used to classify crash occurrences by identifying optimal separating boundaries in high-dimensional feature spaces, often achieving strong predictive performance [26,27,28,29]. In addition, instance-based approaches, such as the K-Nearest Neighbor (KNN) algorithm, have been applied to crash classification and short-term prediction tasks, particularly in contexts with well-defined similarity structures [30,31]. Although Binary Logistic Regression (BLR) is more limited in flexibility, it remains a commonly used baseline classifier in crash studies because of its clarity and ease of implementation [32]. Table 1 summarizes the representative applications of these ML techniques, including the types of predictors used and the corresponding modeling objectives.

An improving body of literature has focused on comparing the predictive performances of multiple ML models across different crash types and spatial scales. These comparative studies generally report that ensemble-based methods, including RF and gradient boosting models, achieve superior discrimination performance, whereas kernel-based and distance-based classifiers, such as SVM and KNN, can perform competitively under balanced data conditions and rich feature sets [10,33]. However, much of the existing research has concentrated on vehicle-only crashes or network-level safety analyses, with relatively limited attention given to vulnerable road users, particularly pedestrians and cyclists, at the intersection level. To address this gap, this study applies and compares five supervised ML algorithms to model crash occurrences at urban intersections, with a specific focus on interpretability and infrastructure-related risk factors.

1.3. Feature Importance and Machine Learning Model Explainability

Recent developments in machine learning have shifted attention beyond predictive accuracy to understanding how input variables influence model behavior and outcomes. In the context of traffic safety analysis, this has led to increasing interest in feature importance and model interpretation techniques that help identify the factors associated with elevated crash risk. A range of post hoc interpretability methods have been applied in prior safety studies to examine how trained models respond to changes in individual predictors [34,35,36]. One commonly used approach is Local Sensitivity Analysis (LSA), which evaluates changes in the model output by varying one predictor at a time while holding the other inputs constant, thereby providing elasticity and sensitivity measures [37,38,39]. Partial Dependence Plots (PDPs) have also been widely used to visualize the average marginal relationship between a predictor and the model output across the observed data distribution [40,41]. However, PDPs rely on an implicit independence assumption among predictors, which can lead to misleading interpretations when strong correlations exist between input variables, a frequent condition in transportation datasets [42].

To address these limitations, the Shapley Additive exPlanations (SHAP) framework was introduced as a unified game-theoretic approach for interpreting machine learning predictions [43]. SHAP quantifies the contribution of each predictor to an individual model prediction by considering all possible feature coalitions, allowing both the main and interaction effects to be examined consistently. Consequently, SHAP has been increasingly adopted in recent traffic safety studies, including applications focused on crash occurrence [44,45] and crash severity outcomes [46,47]. Importantly, the SHAP values characterize model sensitivities and associations learned from the data rather than causal effects, making them well-suited for exploratory analysis and risk pattern identification in complex infrastructure systems.

This study contributes to the growing literature on active transportation safety in several ways. First, it proposes an explainable machine learning framework for assessing pedestrian and cyclist motor vehicle-involved crash risk at urban intersections, integrating prediction and interpretation to support infrastructure-level safety decision-making. Second, this study demonstrates how multiple commonly available data sources, including roadway geometry, traffic exposure, and sociodemographic indicators, can be combined to capture both the physical and contextual influences on crash risk. Third, the performance of several supervised machine learning models, including BLR, KNN, SVM, DT, and RF, was systematically compared to evaluate their suitability for intersection-level safety analysis. Finally, the framework is illustrated through a large-scale empirical case study of urban intersections in Idaho, U.S., highlighting its practical applicability and transferability to other regions with similar data. From a local perspective, the analysis provides the first intersection-level, explainable machine learning assessment of motor vehicle–involved pedestrian and cyclist crash risk in Idaho, offering actionable insights for prioritizing infrastructure improvements in a state characterized by a mix of urban, suburban, and semi-rural travel environments. From a broader perspective, the proposed framework was intentionally designed to rely on commonly available data sources and transferable modeling components, enabling its application to other regions seeking interpretable, data-driven approaches for vulnerable road user safety assessment.

Table 1. Summary of literature on machine learning models applied to crash prediction.

ML Model	Author	Variables Used	Model Output
DT/RF	Gu et al. [48]	Roadway characteristics, traffic exposure, vehicle dynamics (speed and acceleration), and intersection control features	Crash occurrence
	Wu et al. [49]	Roadway characteristics, traffic exposure, and socio-economic attributes	Crash occurrence
	Yan and Shen [50]	Roadway characteristics, temporal and weather attributes, and intersection control type	Crash severity
	Sum et al. [51]	Roadway characteristics, traffic exposure, environmental conditions, and vehicle involvement factors	Crash severity
SVM	Yu and Abdel-Aty [29]	Traffic states (speed, occupancy, flow variation) and roadway characteristics	Crash occurrence
	You et al. [52]	Traffic states, roadway characteristics, and weather conditions	Crash occurrence
	Basso et al. [53]	Traffic composition, vehicle mix, and roadway characteristics	Crash occurrence
KNN	Yang et al. [54]	Socio-economic attributes, traffic exposure, crash time, and roadway characteristics	Crash severity
	Madushani et al. [55]	Roadway characteristics, pavement condition, lighting environment, and weather conditions	Crash severity
	Santos et al. [56]	Roadway characteristics, intersection control, temporal attributes, and weather conditions	Crash severity
	Haghshenas et al. [57]	Roadway characteristics, pavement condition, traffic exposure, and weather attributes	Crash severity
BLR	Wang et al. [58]	Roadway characteristics, construction zone attributes, traffic states, and temporal factors	Crash occurrence
	Shiran et al. [59]	Roadway geometry, pavement condition, lighting, weather, and traffic exposure	Crash severity
	Najafi Moghaddam Gilani et al. [60]	Roadway characteristics, traffic exposure, lighting conditions, weather, and socio-economic characteristics	Crash severity
	Wang et al. [61]	Roadway and traffic characteristics, environmental conditions, and rider demographics	Crash severity

2. Methods

2.1. Data Preparation

2.1.1. Data Sources

This study constructed a comprehensive multi-source dataset to analyze motor vehicle-involved pedestrian and cyclist crashes at intersections in urban areas of Idaho, U.S. The fundamental spatial layer was obtained from the Idaho intersection database developed by [62], which includes intersection coordinates and estimates of total entering vehicles (TEV) across Idaho. Historical crash records were obtained from the Idaho Transportation Department (ITD) online crash database, which also provides roadway classification information for each crash. Urban area boundaries were defined using the 2020 U.S. Census Bureau’s Census, and socio-demographic attributes at the census tract level were extracted from the American Community Survey (ACS). The ACS is an annual nationwide survey conducted in the U.S. Census Bureau, which provides updated demographic, economic, and housing information for the United States. The variables selected for this study included population, racial composition, median household income, housing occupancy rate, vehicle availability, travel mode, travel time to work, and school enrollment. These variables were integrated to represent the contextual and exposure-related factors for each intersection. The variables selected for this study included population, racial composition, median household income, housing occupancy rate, vehicle availability, travel mode, travel time to work, and school enrollment (see Table 2).

2.1.2. Study Area and Period

The analysis focused on intersections located within federally defined urbanized areas of Idaho, as delineated by the 2020 U.S. Census Bureau boundaries. Intersection-level data were restricted to locations falling within these urbanized boundaries to ensure consistency in demographic characteristics, traffic exposure, and built-environment context across the study area. The study period covered five consecutive years (2015–2019), following Federal Highway Administration (FHWA) recommendations for crash trend analysis to establish a stable pre-COVID baseline that excludes disruptions in travel behavior associated with the COVID-19 pandemic [63].

2.1.3. Crash Data and Intersection Classification

In this study, pedestrian and cyclist crashes were defined as crashes occurring at intersections that involved at least one pedestrian or cyclist and a motor vehicle. Crashes involving only pedestrians or cyclists without motor vehicle involvement were excluded from the analysis. This definition was consistently applied across all crash records used in this study. Pedestrian and cyclist crashes were jointly analyzed under the broader category of vulnerable road users (VRUs). Although pedestrians and cyclists represent distinct travel modes with different exposure patterns, this aggregation was adopted to support infrastructure-level safety analysis and mitigate data sparsity resulting from relatively low crash frequencies at the intersection level. The collected dataset included 1549 intersections. Among these, 973 intersections recorded at least one pedestrian or cyclist crash during the study period, whereas 576 intersections within the same spatial extent had no crashes. The crash distribution across the intersections was as follows: 784 intersections with one crash, 135 with two crashes, 34 with three crashes, 15 with four crashes, four with five crashes, and one with six crashes. The crash records were spatially matched to the intersection points using the geographic coordinates. Each intersection was further marked with a roadway functional class and a TEV. All roadway, traffic, and pedestrian- and cyclist-related variables were treated as static representations of the average conditions over the study period, consistent with the objective of intersection-level safety screening rather than time-specific crash modeling.

2.1.4. Geometric and Environmental Attributes

Because detailed geometric data were not publicly available, the intersection geometry and design features were manually extracted using Google Maps (Google LLC, Mountain View, CA, USA).based on the latitude and longitude coordinates provided. The following intersection attributes were identified and coded for analysis: control type (signalized or stop-controlled), roadway lighting (presence or absence of illumination), intersection markings (including marked crosswalks and lane delineation), lane configuration on both major and minor approaches, and pedestrian and cyclist activity levels at the intersection. Pedestrian and cyclist activity levels were classified using a standardized three-level rubric based on observable land-use and infrastructure characteristics. Pedestrian and cyclist activity levels were represented using a three-level categorical proxy (low, medium, and high) developed to reflect the relative pedestrian and cyclist demands at intersections under data availability constraints. Classification was performed using a standardized and predefined rubric based on observable surrounding land use and infrastructure characteristics, including the presence of sidewalks and crosswalks and proximity to pedestrian and cyclist attractors, such as schools, commercial centers, transit stops, and recreational facilities. All intersections were evaluated using Google Maps and Street View imagery, and the same criteria were consistently applied across the study area to reduce subjectivity and ensure comparability. Because pedestrian and cyclist activity is inherently related to land use and infrastructure provision, this variable may be correlated with other exposure-related features, such as intersection markings and traffic volume. Accordingly, the pedestrian and cyclist activity levels are intended to represent relative exposure conditions rather than exact pedestrian and cyclist counts, and their influence in the model reflects sensitivity to co-located pedestrian and cyclist demand rather than a direct causal effect.

2.1.5. Socio-Demographic Variables

Socio-economic and demographic variables from the ACS were aggregated at the census tract level and geographically joined to each intersection point. These included population density, age distribution, race composition, median income, vehicle ownership, commuting mode share, average travel time when traveling to work, and student enrollment. Taken together, these variables provide a geographical representation of exposure and community-level risk factors affecting the safety of non-motorized users. Although ACS data include a margin of error (MOE) for each estimate, these values were excluded from the analysis to improve model efficiency and understanding. Including MOE in ML models introduces noise rather than useful variability, as ML algorithms assume precise input variables.

2.2. Methodological Framework

This study developed a data-driven framework to predict motor vehicle-involved pedestrian and cyclist crashes and identify the key contributing factors at urban intersections in Idaho, U.S. The framework illustrated in Figure 1 comprises three main stages: (1) data collection and preparation, (2) model development and optimization, and (3) feature importance and interpretability analyses.

In the first stage, crash data, roadway geometric features, traffic volumes, and demographic information were obtained from multiple sources, including the Idaho Transportation Department (ITD), American Community Survey (ACS), and Google Maps, and spatially integrated into an intersection-level database. In the second stage, several machine learning models, including RF, DT, SVM, KNN, and BLR, were trained to predict crash occurrence. Model parameters were optimized using a grid search and 5-fold cross-validation, while Recursive Feature Elimination (RFE) was used to identify the most relevant predictors for each model. In the final stage, SHAP [43] was applied to interpret the trained models used. SHAP values quantify each variable’s contribution to model predictions, providing both global and local interpretability to better understand how roadway, traffic, and sociodemographic factors influence crash likelihood.

Five supervised machine learning algorithms were applied to model motor vehicle-involved pedestrian and cyclist crash occurrences: BLR, KNN, SVM, DT, and RF. These models were selected to represent both linear and nonlinear predictive frameworks and allow performance comparisons across different learning paradigms. The following subsections illustrate the mathematical concepts for the models used, with symbols listed in Table 3.

2.2.1. Binary Logistic Regression (BLR)

For classification tasks, the BLR was used to estimate the probability of a crash. The model applies a logistic transformation to the linear predictor as follows:

\hat{p} (X_{i}) = expit (X_{i} w + w_{0}) = \frac{1}{1 + e^{- X_{i} w - w_{0}}}

(1)

Although linear in parameters, BLR effectively handles binary outcomes and provides interpretable coefficients for examining how predictor variables influence crash likelihood [55].

2.2.2. K-Nearest Neighbors (KNN)

KNN is an instance-based, non-parametric model that classifies observations based on the majority class of their k-nearest neighbors, which is typically determined by the Euclidean distance:

d (x_{1}, x_{2}) = \sqrt{\sum_{i = 1}^{n} {(x_{1 i} - x_{2 i})}^{2}}

(2)

The model performance depends on the choice of the number of neighbors, which is optimized via cross-validation to balance bias and variance.

2.2.3. Support Vector Machine (SVM)

The SVM constructs a hyperplane that maximizes the margin between classes in the feature space as follows:

\max_{w, b} \min_{x_{i} \in D_{N}} \frac{y_{i} (w^{T} x_{i} + b)}{∥ w ∥}

(3)

For nonlinear patterns, the kernel trick maps the input data into higher dimensions, thereby enabling flexible decision boundaries [64,65].

2.2.4. Decision Tree (DT)

DTs partition data repeatedly into subsets based on features that maximize uniformity (e.g., information gain or Gini index). Each leaf node represents a prediction outcome, and the model structure can be described as shown in Equation (4). DTs are interpretable and suitable for both classification and regression, but they can overfit small datasets [66].

y = f (x) = \sum_{l} c_{l} I (x \in R_{l})

(4)

2.2.5. Random Forests (RF)

RF is an ensemble of multiple DTs trained on bootstrap samples of the data and random feature subsets at each split. The final predictions were obtained using majority voting (hard voting). This bagging approach reduces overfitting and enhances model robustness by increasing the diversity of trees [67].

2.2.6. Model Evaluation Metrics

The performance of the developed models was evaluated using a set of assessment metrics. These metrics provide quantitative measures of the predictive accuracy, robustness, and generalization capabilities of the model. Table 4 demonstrates key performance metrics used to evaluate the predictive models. For the classification tasks, accuracy, precision, recall, and F1-score were applied to assess the models’ ability to correctly identify crash occurrence and nonoccurrence. Additionally, the Receiver Operating Characteristics (ROC) and Area Under the Receiver Operating Characteristic Curve (AUC) indicate the model’s overall ability to distinguish between positive and negative classes, with higher values indicating better discriminatory power.

2.2.7. Feature Importance Analysis

Understanding the relative influence of input variables is essential for interpreting model behavior and identifying the key factors that contribute to the occurrence of crashes. In this study, feature importance was analyzed using the SHAP framework, which provides a unified approach to explain the predictions made by any machine learning model, as proposed by Lundberg and Lee [43]. SHAP values can explain the importance of the model features. Moreover, it can show the interaction and unobserved heterogeneity between features [68,69]. For a given instance i, the SHAP value

ϕ_{j}

for feature j represents the average change in the model output when the feature is included compared to when it is excluded. Mathematically, the SHAP value is defined as

ϕ_{j} = \sum_{S \subseteq F ∖ {j}} \frac{| S |! (| F | - | S | - 1)!}{| F |!} [f_{S \cup {j}} (x_{S \cup {j}}) - f_{S} (x_{S})]

(5)

where F denotes the set of all features, S is a subset of features that does not contain j, and

f_{S} (x_{S})

is the model prediction using only the features in subset S. This formulation ensures a fair distribution of contributions among all features and provides local and global interpretability.

Global feature importance was derived by aggregating the absolute mean SHAP values across all observations, allowing the identification of the most influential predictors in the models used. In contrast, local SHAP values were used to interpret how specific features affected individual intersection predictions, thereby offering insights into case-specific variations in crash risk.

3. Results

3.1. Models’ Performance

For the five ML models used in this study, a 65–35 split of the dataset was used for the training and test data subsets, including 65% of the dataset used in the training process and 35% used in the testing and evaluation process. The training and testing datasets were randomly split to ensure model evaluation stability using a 5-fold cross-validation technique. Five different metrics were used to evaluate model performance: accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC).

Table 5 summarizes the predictive performances of the machine learning classification models considered in this study. Overall, the evaluated models exhibited strong and comparable predictive performance, with accuracies ranging from 88.47% to 90.99%. The BLR model achieved the highest accuracy (90.99%) and F1-score (92.77%), whereas the RF model demonstrated the highest discriminatory capability among the evaluated models, as reflected by the highest AUC value (0.9639). The SVM and KNN models also produced competitive results, each achieving an accuracy of 90.45% with balanced precision and recall values. Although the DT model yielded the highest recall (92.92%) and correctly identified a large share of crash-prone intersections, its lower AUC value (0.8681) indicates reduced robustness in distinguishing between crash and non-crash cases compared with the other models.

Further insight into the model discrimination is provided by the ROC curves shown in Figure 2. The RF and BLR models demonstrated strong overall classification performance, with ROC curves that consistently remained above those of the other models across a wide range of false-positive rates. The SVM and KNN models exhibited slightly lower—but still competitive—discrimination capabilities, whereas the DT model showed a comparatively weaker separation between crash and non-crash intersections, consistent with its lower AUC value. These results indicate that no single model uniformly outperformed the others across all evaluation metrics, with BLR excelling in overall accuracy and F1-score, DT achieving the highest recall, and RF providing the strongest discrimination capability, as measured by the AUC.

Considering the multi-metric performance results and objectives of this study, the RF model was selected for subsequent explainability and feature importance analyses. Although several models achieved comparable accuracy and F1-scores, the RF model attained the highest AUC value, indicating a stronger ability to distinguish between crash and non-crash intersections across a range of classification thresholds. The AUC metric is particularly important in this context because pedestrian and cyclist-involved crash occurrences are relatively rare events, and effective safety screening requires a reliable ranking of high-risk locations rather than optimization at a single decision threshold. This threshold-independent property makes the AUC a suitable criterion for evaluating the model performance in intersection-level crash risk analysis. In addition to its strong discriminatory capability, the RF model is well-suited for capturing nonlinear relationships and complex feature interactions that are commonly present in roadway, traffic, and sociodemographic data. Tree-based ensemble models, such as RF, are also highly compatible with SHAP-based interpretation, enabling stable and consistent estimation of both the global and local feature contributions. Accordingly, the RF model was selected for the subsequent SHAP-based feature importance analysis. The following subsection presents the explainability results, focusing on the relative influence of the input variables on crash occurrence across urban intersections. The implications of these performance differences and the interpretation of the most influential predictors are discussed in Section 4.

3.2. Feature Importance and Model Interpretation

The influence of individual predictors on crash occurrence was examined using SHAP values, as defined in Equation (5). SHAP analysis was employed to decompose the model output into feature-level contributions, enabling a transparent examination of how the input variables influenced the RF model predictions. This section presents the global, semiglobal, and local interpretability results.

3.2.1. Global Feature Importance

Figure 3 presents the global importance of the model features based on the mean absolute SHAP values shown on the left side of the figure. Larger SHAP values indicate a greater contribution of a feature to the prediction model. The most influential predictors identified by the model were Major Road Classification, Minor Road Classification, Intersection Marking, and TEV, with average absolute SHAP values of 0.12, 0.10, 0.06, and 0.05, respectively. In contrast, socioeconomic variables (e.g., median household income for the 45–64 age group) exhibited comparatively lower SHAP values, indicating smaller contributions to the predicted crash likelihood.

The SHAP beeswarm plot (Figure 3, right side) provides a detailed visualization of how the top predictors influence the prediction model. Each point represents the SHAP value for a single intersection, which is associated with the marginal contribution of a specific feature to the predicted likelihood of a crash. The features are ordered along the y-axis according to their global importance, with higher-ranked predictors appearing at the top, following the same order of global importance bars on the left side. The color gradient represents the magnitude of each feature value, ranging from low (blue) to high (red), enabling the simultaneous interpretation of the feature magnitude and direction of influence. The horizontal spread of points reflects the variability of the SHAP values across intersections, indicating the extent to which a given feature increased (positive SHAP value) or decreased (negative SHAP value) the predicted crash likelihood. For example, lower major and minor road classification variable codes, corresponding to higher functional roadway classes as defined in Table 2, were associated with positive SHAP values, indicating higher predicted crash likelihoods assigned by the model. Intersection Marking and TEV also exhibited predominantly positive SHAP values at higher feature levels, whereas lower values of these variables were associated with negative SHAP contributions. The horizontal spread of the points reflects the substantial variability in the feature effects across intersections, highlighting the heterogeneous influence of these predictors on the model output.

3.2.2. Local Feature Contributions: SHAP Force Plot

To illustrate how individual feature values contribute to specific model predictions, two SHAP force plots were generated for representative intersections with contrasting predicted risk levels (Figure 4 and Figure 5). The force plots visualize how the feature contributions shift the model output relative to the baseline (expected) prediction, with positive and negative SHAP values indicating increased or decreased predicted crash likelihood, respectively.

Figure 4 presents a high-risk intersection, where features such as roadway classification, traffic exposure (TEV), and the presence of pedestrian and cyclist-related infrastructure collectively push the model prediction toward a higher crash likelihood. In contrast, Figure 5 shows a low-risk intersection, where lower traffic exposure, lower roadway hierarchy, and the absence of intersection markings contribute to a reduced predicted crash likelihood relative to the baseline. These local explanations demonstrate that although the same variables may appear as dominant predictors in the global analysis, their direction and magnitude of influence vary substantially across individual intersections.

3.2.3. Pairwise SHAP Dependence Analysis of the Top Predictors

Pairwise SHAP dependence plots were generated for the four most influential predictors to examine how these variables jointly affected the model’s output. The selected predictors were the major and minor approach classifications, intersection marking, and TEV. Each subplot displays the SHAP value of a primary feature along the x-axis, whereas the point color represents the interacting feature, as shown in the 4 × 3 matrix in Figure 6. The SHAP dependence plots for the Major Road Classification (Figure 6a–c) show a clear pattern in the model output across the different functional classes of the major approach. When higher functional classes of the major road (lower numerical values) were paired with higher functional classes of the minor road, the predicted model assigned mostly positive SHAP values, whereas combinations involving lower functional classes on both approaches were generally associated with negative SHAP values closer to or below zero. This pattern indicates that intersections formed by lower-ranked major and minor roads were assigned lower predicted crash likelihoods by the model. A similar trend was observed for the interaction between major road classification and intersection markings. Across most higher-ranked major road classes, the presence of intersection markings corresponded to positive SHAP values, whereas lower-ranked major roads without intersection markings were primarily associated with negative SHAP contributions. For the interaction between Major Road Classification and TEV, higher-ranked major roads combined with larger traffic volumes exhibited positive SHAP values, whereas lower-ranked major roads paired with lower TEV values were associated with lower SHAP contributions.

The SHAP dependence plots for Minor Road Classification (Figure 6d–f) exhibit interaction patterns that closely mirror those observed for Major Road Classification. Across interactions with roadway classification, intersection marking, and Total Entering Volume (TEV), higher functional classes of the minor approach were generally associated with positive SHAP values, whereas lower functional classes corresponded to SHAP values closer to or below zero. The similarity in these interaction patterns indicates that the predicted model responds in a comparable manner to the roadway hierarchy on both the major and minor approaches when combined with traffic exposure and infrastructure-related features.

The SHAP dependence plots for Intersection Marking (Figure 6g–i) show a clear interaction pattern with the traffic exposure and roadway classification variables. Intersections without markings were generally associated with negative SHAP values, particularly when combined with lower TEV. In contrast, intersections with marked crossings generally exhibited positive SHAP values, with the highest SHAP contributions observed when the presence of markings coincided with higher TEV values. This pattern indicates that the predicted model assigns higher predicted crash likelihoods to intersections characterized by the combined presence of intersection markings and elevated traffic volumes, whereas unmarked intersections with lower traffic exposure are associated with lower predicted values. Figure 7 presents the SHAP dependence plot for TEV, colored by pedestrian and cyclist traffic levels and labeled as (PED_Traffic_Level). The plot shows a nonlinear relationship between the TEV and the SHAP values assigned by the prediction model. At lower TEV levels, the SHAP values are typically negative, whereas the SHAP values increase rapidly as the TEV rises from low to moderate ranges before stabilizing at higher TEV levels. The color gradient indicates that intersections with higher pedestrian and cyclist traffic levels are more frequently associated with positive SHAP values at comparable TEV levels, whereas lower pedestrian and cyclist activity levels tend to cluster around lower or approximately zero SHAP values. These patterns illustrate that the sensitivity of the model to traffic exposure varies with pedestrian and cyclist activity levels, reflecting the interaction effects captured by the RF model.

3.3. Socioeconomic Influence on Crash Occurrence

Socioeconomic variables exhibited lower overall contributions to model predictions than geometric and traffic-related factors. Among these variables, the median household income for the 45–64 age group displayed a nonlinear association with the model output (Figure 8). The smoothed SHAP curve shows that the SHAP values decrease as income increases from lower levels to approximately $70,000, at which point the SHAP values reach a minimum. Beyond this income level, the SHAP values remained near zero, indicating a diminished contribution of this variable to the predicted crash likelihood within the model. Overall, the model assigned higher SHAP values to intersections located in areas with lower median household incomes, indicating a stronger model sensitivity to lower income levels.

4. Discussion

This study presents an explainable machine learning framework for analyzing motor vehicle-involved pedestrian and cyclist crash risk at urban intersections, modeled jointly at the intersection level by integrating prediction accuracy with a transparent interpretation of model outputs. By combining multiple data sources related to roadway geometry, traffic exposure, surrounding area population activity, and socioeconomic context, the framework moves beyond black-box prediction and provides insights into how different factors contribute to the crash likelihood across heterogeneous intersection environments. The results demonstrate that intersection-level crash risk is primarily shaped by infrastructure and traffic exposure characteristics, whereas contextual variables contribute secondary but interpretable effects. The following discussion interprets these findings, explains their practical implications for intersection safety analysis, and clarifies how the explainability results should be understood in the context of data-driven safety decision-making.

4.1. Model Performance Comparison and Selection

The comparative evaluation of the machine learning models revealed that several classifiers achieved strong and comparable predictive performance; however, important differences emerged when considering their ability to discriminate between crash-prone and non-crash-prone intersections. While BLR achieved the highest accuracy and F1 score, the RF model consistently demonstrated superior discriminatory capability, as reflected by the highest area under the receiver operating characteristic curve (AUC). This distinction is particularly important in the context of motor vehicle–involved pedestrian and cyclist crashes, where crash events are relatively rare, and effective safety screening depends on the reliable ranking of high-risk locations rather than optimization at a single classification threshold.

The observed performance differences can be attributed to the underlying characteristics of the dataset and the model structures. Tree-based ensemble models, such as RF, are well suited for capturing nonlinear relationships and complex interactions among roadway geometry, traffic exposure, and contextual variables, which are common in intersection safety data. In contrast, linear models, such as BLR, provide stable global performance but may be less flexible in representing heterogeneous risk patterns across diverse intersection environments. The decision tree (DT) model exhibited high recall but lower robustness, as indicated by its reduced AUC, suggesting sensitivity to local patterns but limited generalization ability.

Based on these findings, this study recommends that model selection for intersection-level crash risk analysis should prioritize discrimination capability, in addition to traditional accuracy-based metrics. In particular, AUC-based evaluation is well-suited for rare-event safety applications, as it reflects a model’s ability to rank locations by relative risk across a wide range of decision thresholds. Accordingly, the RF model was selected for the subsequent explainability analysis, as it provided a balance between predictive performance, robustness, and interpretability.

4.2. Infrastructure and Traffic Exposure Effects on Crash Risk

As noted in the Methodology Section, traffic volume, surrounding area population activity, and geometric characteristics were treated as static representations of average conditions during the study period; therefore, the results reflect relative crash risk patterns suitable for network-level screening rather than time-specific crash dynamics. The explainable machine learning analysis revealed clear and consistent relationships between intersection infrastructure characteristics, traffic exposure, and the predicted likelihood of motor vehicle–involved pedestrian and cyclist crashes in the study area. Roadway functional classification and total entering volume (TEV) emerged as the most influential contributors across all model interpretation analyses, indicating that intersection-level crash risk is strongly influenced by exposure-related and geometric factors. Intersections located on higher-class roadways generally accommodate greater traffic demand, higher operating speeds, and more complex turning movements, all of which increase the frequency and severity of the interactions between motor vehicles and vulnerable road users.

Traffic exposure, as represented by TEV, showed a strong nonlinear association with the predicted crash likelihood. The model assigned a higher risk to intersections as traffic volumes increased from low to moderate levels, reflecting the increasing probability of motor vehicle–pedestrian and vehicle–cyclist conflicts. At higher traffic volumes, the increase in the predicted risk began to stabilize, suggesting that congestion effects may limit further growth in conflict opportunities. These patterns are consistent with prior intersection safety research, which has repeatedly shown that crash occurrence is closely linked to traffic volume through increased exposure rather than isolated design deficiencies [70,71].

The interaction effects identified through SHAP dependence analyses further highlight the importance of jointly considering infrastructure and exposure. Intersections combining higher roadway functional classifications on major and minor approaches with elevated TEV consistently exhibited the highest predicted crash likelihoods. This indicates that crash risk is not driven by individual features in isolation but rather by the combined influence of roadway hierarchy and traffic demand. These findings emphasize the need for safety screening approaches that consider interaction effects when prioritizing high-risk locations.

Intersection markings were also associated with an increased predicted crash likelihood in the model. This relationship should be interpreted cautiously, as markings are typically implemented at locations with higher pedestrian and cyclist activity and crossing demand. Consequently, the model likely captures increased exposure to pedestrian and cyclist–vehicle interactions rather than a direct safety deficiency related to the presence of the markings, which aligns with prior studies [72,73]. Overall, the infrastructure-related results underscore the central role of traffic exposure and intersection complexity in shaping the crash risk for vulnerable road users.

4.3. Contextual and Socioeconomic Influences on Crash Risk

Although socioeconomic variables contributed less to the overall model performance than geometric and traffic-related factors, their influence on the predicted crash likelihood was not negligible. The SHAP-based analysis revealed a nonlinear relationship between median household income and model predictions, with higher SHAP values concentrated in lower-income areas, aligning with previous studies [74,75]. This pattern suggests that intersections located in lower-income neighborhoods are assigned a higher predicted crash risk by the model, particularly when combined with unfavorable infrastructure or traffic exposure conditions.

The observed income-related patterns are likely indirect and reflect broader contextual differences rather than individual socio-economic characteristics. Lower-income areas may experience disparities in infrastructure quality, population facility provision, land use patterns, and travel behavior, all of which can influence exposure to traffic conflict. As income levels increased, the SHAP values declined and stabilized, indicating that the model became less sensitive to income variations beyond a certain threshold. This stabilization suggests a diminishing marginal influence of income at higher levels, consistent with the idea that the built environment and traffic conditions play a more dominant role once baseline socioeconomic needs are met.

It is important to emphasize that the socioeconomic variables in this study function as contextual indicators rather than direct causal factors. The SHAP values reflect how the model responds to these variables in combination with roadway and traffic characteristics, rather than implying that income causes crashes. Nonetheless, the results highlight the importance of incorporating contextual information into intersection safety analyses, as socioeconomic conditions may influence where and how vulnerable road users interact with transportation networks. Including such variables allows for a more comprehensive and equity-aware assessment of crash risk patterns in urban environments.

4.4. Practical Implications for Intersection Safety Management

The accurate identification of crash-prone intersections and transparent interpretation of contributing factors are essential for effective data-driven safety management. The proposed explainable machine learning framework supports this objective by enabling transportation agencies to screen large intersection networks and prioritize locations where the motor vehicle-involved pedestrian and cyclist crash risk is high. By identifying high-risk sites, practitioners can implement targeted safety countermeasures, such as enhanced pedestrian and cyclist crossing treatments, traffic-calming strategies, improved signal timing, access management, and geometric design modifications.

Beyond site-level interventions, the framework provides valuable support for strategic planning and resource allocation at the national level. Transportation agencies and funding bodies can use the predicted risk rankings to allocate limited resources more efficiently, focusing on investments in locations where safety improvements are most likely to reduce the crash risk for vulnerable road users. This targeted approach improves the cost-effectiveness of safety programs and supports equity-aware decision-making by highlighting locations where exposure and risk are concentrated.

At the policy level, explainable crash prediction models offer insights into how infrastructure characteristics and traffic exposure jointly shape safety outcomes. Policymakers can use these insights to evaluate existing design standards, assess the effectiveness of implemented countermeasures, and inform future infrastructure investment. Unlike traditional black-box models, the SHAP-based interpretation framework enables transparent communication of model results to both technical and non-technical stakeholders, facilitating evidence-based decision-making.

Finally, the framework provides a foundation for iterative safety evaluations. By comparing the predicted risk patterns with the observed crash outcomes over time, agencies can assess whether the implemented interventions achieve the intended safety benefits and adjust their strategies accordingly. In this context, reliable model discrimination and interpretability are critical, as they directly influence the credibility and usefulness of predictive tools in real-world safety management applications.

4.5. Limitations and Scope of the Analysis

Several limitations should be considered when interpreting the findings of this study. First, the crash records used for model development may contain reporting inconsistencies and underreporting, particularly for minor incidents, which can influence both the predictive accuracy and SHAP-based interpretation. In addition, socioeconomic and demographic variables were derived from census block group-level data, which may have introduced spatial aggregation bias and limited the ability to capture fine-grained local variation at individual intersections.

Second, key roadway, traffic, and population-related variables, including traffic volume, pedestrian and cyclist activity levels, and geometric features, were treated as static inputs. This modeling choice reflects the availability of data at the intersection level and is appropriate for network-wide screening applications; however, it limits the ability to capture short-term temporal fluctuations in exposure and the operational conditions. Accordingly, the results should be interpreted as representing average risk patterns over the study period rather than time-specific crash dynamics or patterns. Future research should incorporate time-varying exposure measures to better represent the temporal variability in traffic, pedestrian, and cyclist activities.

Third, motor vehicle-involved pedestrian and cyclist crashes represent a relatively small proportion of total roadway incidents at urban intersections, resulting in low event frequencies when considered separately. To mitigate data sparsity and improve the model’s stability and reliability, motor vehicle crashes involving pedestrians and cyclists were analyzed jointly under the broader category of vulnerable road users. This aggregation allows the model to identify infrastructure- and exposure-related patterns common to both modes; however, it does not capture mode-specific behavioral differences in these patterns. Future extensions of this framework could explore separate or hierarchical modeling approaches as larger datasets become available.

Finally, the analysis was based on pre-COVID crash and exposure data covering the period from 2015 to 2019. Although travel behavior and traffic patterns have evolved following the COVID-19 pandemic, the primary objective of this study was to develop and demonstrate an explainable machine learning framework for intersection-level infrastructure safety analysis. Core risk factors related to roadway hierarchy, intersection design, and traffic exposure are expected to remain relevant for safety assessment even as absolute traffic volumes change. The proposed framework can be readily updated and re-evaluated using post-pandemic data as they become available, enabling its continued applicability in evolving travel contexts.

Author Contributions

Conceptualization, A.A.-R.; methodology, A.A.-R., L.P. and A.E.; software, L.P. and A.E.; validation, A.A.-R., L.P. and A.E.; formal analysis, L.P. and A.E.; investigation, L.P. and A.E.; resources, L.P.; data curation, L.P.; writing—original draft preparation, A.E. and L.P.; writing—review and editing, A.A.-R. and A.E.; visualization, A.E.; supervision, A.A.-R.; project administration, A.A.-R.; funding acquisition, A.A.-R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the U.S. Department of Transportation’s University Transportation Center Program, Grant 69A3552348310, in part by the Pacific Northwest Regional University Transportation Center (PacTrans).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw datasets used in this study are publicly available and are described in detail in Section 2.1.1. The refined datasets used for model development, along with the code employed for training and evaluation of the machine learning models, are available from the corresponding author upon reasonable request, as they are also used in ongoing and related studies. No additional data were generated in this study.

Acknowledgments

Generative artificial intelligence tools including ChatGPT (OpenAI, San Francisco, CA, USA; version GPT-4) and Grammarly (Grammarly Inc., San Francisco, CA, USA) were used to assist with the language editing and structural refinement of the manuscript. All methodological descriptions, analyses, results, and conclusions were conceived, implemented, and verified by the authors. No data, figures, numerical results, or scientific content were generated or modified using generative AI.

Conflicts of Interest

Author Logan Prescott was employed by the company Kimley-Horn (Present address). The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript.

ML	Machine Learning
BLR	Binary Logistic Regression
KNN	K-Nearest Neighbors
SVM	Support Vector Machine
DT	Decision tree
RF	Random Forest
PDPs	Partial Dependence Plots
SHAP	Shapley Additive exPlanations
TEV	Total entering vehicles
ITD	Idaho Transportation Department
GIS	Geographic Information System
ACS	American Community Survey
FHWA	Federal Highway Administration
MOE	Margin of Error
AWSC	All-Way Stop Control
TWSC	Two-Way Stop Control
ROC	Receiver Operating Characteristics
AUC	Area Under the Receiver Operating Characteristic Curve

References

Gerike, R.; de Nazelle, A.; Wittwer, R.; Parkin, J. Special issue “walking and cycling for better transport, health and the environment”. Transp. Res. Part A Policy Pract. 2019, 123, 1–6. [Google Scholar] [CrossRef]
Bleviss, D.L. Transportation is critical to reducing greenhouse gas emissions in the United States. WIREs Energy Environ. 2021, 10, e390. [Google Scholar] [CrossRef]
Woodcock, J.; Edwards, P.; Tonne, C.; Armstrong, B.G.; Ashiru, O.; Banister, D.; Beevers, S.; Chalabi, Z.; Chowdhury, Z.; Cohen, A. Public health benefits of strategies to reduce greenhouse-gas emissions: Urban land transport. Lancet 2009, 374, 1930–1943. [Google Scholar] [CrossRef] [PubMed]
World Health Organization. Road Traffic Injuries; Technical Report; World Health Organization: Geneva, Switzerland, 2023. [Google Scholar]
Le, H.T.; Buehler, R.; Hankey, S. Have walking and bicycling increased in the US? A 13-year longitudinal analysis of traffic counts from 13 metropolitan areas. Transp. Res. Part D Transp. Environ. 2019, 69, 329–345. [Google Scholar] [CrossRef]
Jackson, S.; Raymond, P.; Taylor, S. Bicycle and Pedestrian Safety Research Project; Technical Report; Idaho Transportation Department: Boise, ID, USA, 2023.
Mahmoudi, J.; Xiong, C.; Yang, M.; Luo, W. Modeling the Frequency of Pedestrian and Bicyclist Crashes at Intersections: Big Data-driven Evidence From Maryland. Transp. Res. Rec. J. Transp. Res. Board 2023, 2677, 1245–1260. [Google Scholar] [CrossRef]
Anastasopoulos, P.C.; Mannering, F.L. An empirical assessment of fixed and random parameter logit models using crash-and non-crash-specific injury data. Accid. Anal. Prev. 2011, 43, 1140–1147. [Google Scholar] [CrossRef]
Mannering, F.L.; Shankar, V.; Bhat, C.R. Unobserved heterogeneity and the statistical analysis of highway accident data. Anal. Methods Accid. Res. 2016, 11, 1–16. [Google Scholar] [CrossRef]
Yuan, J.; Abdel-Aty, M. Approach-level real-time crash risk analysis for signalized intersections. Accid. Anal. Prev. 2018, 119, 274–289. [Google Scholar] [CrossRef]
Abdel-Aty, M.; Haleem, K. Analyzing angle crashes at unsignalized intersections using machine learning techniques. Accid. Anal. Prev. 2011, 43, 461–470. [Google Scholar] [CrossRef] [PubMed]
Iranitalab, A.; Khattak, A. Comparison of four statistical and machine learning methods for crash severity prediction. Accid. Anal. Prev. 2017, 108, 27–36. [Google Scholar] [CrossRef]
Osman, O.A.; Hajij, M.; Bakhit, P.R.; Ishak, S. Prediction of Near-Crashes from Observed Vehicle Kinematics using Machine Learning. Transp. Res. Rec. J. Transp. Res. Board 2019, 2673, 463–473. [Google Scholar] [CrossRef]
Theofilatos, A.; Chen, C.; Antoniou, C. Comparing Machine Learning and Deep Learning Methods for Real-Time Crash Prediction. Transp. Res. Rec. J. Transp. Res. Board 2019, 2673, 169–178. [Google Scholar] [CrossRef]
Pljakić, M.; Jovanović, D.; Matović, B. The influence of traffic-infrastructure factors on pedestrian accidents at the macro-level: The geographically weighted regression approach. J. Saf. Res. 2022, 83, 248–259. [Google Scholar] [CrossRef]
Ma, Z.; Lu, X.; Chien, S.I.J.; Hu, D. Investigating factors influencing pedestrian injury severity at intersections. Traffic Inj. Prev. 2018, 19, 159–164. [Google Scholar] [CrossRef]
Mukherjee, D.; Mitra, S. A comprehensive study on identification of risk factors for fatal pedestrian crashes at urban intersections in a developing country. Asian Transp. Stud. 2020, 6, 100003. [Google Scholar] [CrossRef]
Li, L.; Yang, X.; Yin, L. Exploration of Pedestrian Refuge Effect on Safety Crossing at Signalized Intersection. Transp. Res. Rec. J. Transp. Res. Board 2010, 2193, 44–50. [Google Scholar] [CrossRef]
Salmon, P.M.; Naughton, M.; Hulme, A.; McLean, S. Bicycle crash contributory factors: A systematic review. Saf. Sci. 2022, 145, 105511. [Google Scholar] [CrossRef]
Prati, G.; Marín Puchades, V.; De Angelis, M.; Fraboni, F.; Pietrantoni, L. Factors contributing to bicycle–motorised vehicle collisions: A systematic literature review. Transp. Rev. 2018, 38, 184–208. [Google Scholar] [CrossRef]
Boufous, S.; De Rome, L.; Senserrick, T.; Ivers, R. Risk factors for severe injury in cyclists involved in traffic crashes in Victoria, Australia. Accid. Anal. Prev. 2012, 49, 404–409. [Google Scholar] [CrossRef]
Meuleners, L.B.; Fraser, M.; Johnson, M.; Stevenson, M.; Rose, G.; Oxley, J. Characteristics of the road infrastructure and injurious cyclist crashes resulting in a hospitalisation. Accid. Anal. Prev. 2020, 136, 105407. [Google Scholar] [CrossRef]
Abellán, J.; López, G.; De OñA, J. Analysis of traffic accident severity using decision rules via decision trees. Expert Syst. Appl. 2013, 40, 6047–6054. [Google Scholar] [CrossRef]
Das, A.; Abdel-Aty, M.; Pande, A. Using conditional inference forests to identify the factors affecting crash severity on arterial corridors. J. Saf. Res. 2009, 40, 317–327. [Google Scholar] [CrossRef]
Harb, R.; Yan, X.; Radwan, E.; Su, X. Exploring precrash maneuvers using classification trees and random forests. Accid. Anal. Prev. 2009, 41, 98–107. [Google Scholar] [CrossRef]
Li, Z.; Liu, P.; Wang, W.; Xu, C. Using support vector machine models for crash injury severity analysis. Accid. Anal. Prev. 2012, 45, 478–486. [Google Scholar] [CrossRef]
Dong, N.; Huang, H.; Zheng, L. Support vector machine in crash prediction at the level of traffic analysis zones: Assessing the spatial proximity effects. Accid. Anal. Prev. 2015, 82, 192–198. [Google Scholar] [CrossRef]
Li, X.; Lord, D.; Zhang, Y.; Xie, Y. Predicting motor vehicle crashes using support vector machine models. Accid. Anal. Prev. 2008, 40, 1611–1618. [Google Scholar] [CrossRef] [PubMed]
Yu, R.; Abdel-Aty, M. Utilizing support vector machine in real-time crash risk evaluation. Accid. Anal. Prev. 2013, 51, 252–259. [Google Scholar] [CrossRef] [PubMed]
Lv, Y.; Tang, S.; Zhao, H. Real-time highway traffic accident prediction based on the k-nearest neighbor method. In Proceedings of the 2009 International Conference on Measuring Technology and Mechatronics Automation; IEEE: Piscataway, NJ, USA, 2009; Volume 3, pp. 547–550. [Google Scholar]
Zhang, L.; Liu, Q.; Yang, W.; Wei, N.; Dong, D. An improved k-nearest neighbor model for short-term traffic flow prediction. Procedia-Soc. Behav. Sci. 2013, 96, 653–662. [Google Scholar] [CrossRef]
Lu, T.; Dunyao, Z.H.U.; Lixin, Y.; Pan, Z. The traffic accident hotspot prediction: Based on the logistic regression method. In Proceedings of the 2015 International Conference on Transportation Information and Safety (ICTIS); IEEE: Piscataway, NJ, USA, 2015; pp. 107–110. [Google Scholar]
Rahman, R.; Bhowmik, T.; Eluru, N.; Hasan, S. Assessing the crash risks of evacuation: A matched case-control approach applied over data collected during Hurricane Irma. Accid. Anal. Prev. 2021, 159, 106260. [Google Scholar] [CrossRef] [PubMed]
Gill, N.; Hall, P.; Montgomery, K.; Schmidt, N. A responsible machine learning workflow with focus on interpretable models, post-hoc explanation, and discrimination testing. Information 2020, 11, 137. [Google Scholar] [CrossRef]
Guerra-Manzanares, A.; Nõmm, S.; Bahsi, H. Towards the integration of a post-hoc interpretation step into the machine learning workflow for IoT botnet detection. In Proceedings of the 2019 18th IEEE International Conference On Machine Learning and Applications (ICMLA); IEEE: Piscataway, NJ, USA, 2019; pp. 1162–1169. [Google Scholar]
Vieira, C.P.; Digiampietri, L.A. Machine Learning post-hoc interpretability: A systematic mapping study. In Proceedings of the XVIII Brazilian Symposium on Information Systems, Curitiba, Brazil, 16–19 May 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 1–8. [Google Scholar] [CrossRef]
Delen, D.; Tomak, L.; Topuz, K.; Eryarsoy, E. Investigating injury severity risk factors in automobile crashes with predictive analytics and sensitivity analysis methods. J. Transp. Health 2017, 4, 118–131. [Google Scholar] [CrossRef]
Jiang, L.; Xie, Y.; Wen, X.; Ren, T. Modeling highly imbalanced crash severity data by ensemble methods and global sensitivity analysis. J. Transp. Saf. Secur. 2022, 14, 562–584. [Google Scholar] [CrossRef]
Wen, X.; Xie, Y.; Jiang, L.; Li, Y.; Ge, T. On the interpretability of machine learning methods in crash frequency modeling and crash modification factor development. Accid. Anal. Prev. 2022, 168, 106617. [Google Scholar] [CrossRef]
Toran Pour, A.; Moridpour, S.; Tay, R.; Rajabifard, A. Modelling pedestrian crash severity at mid-blocks. Transp. A Transp. Sci. 2017, 13, 273–297. [Google Scholar] [CrossRef]
Danesh, T.; Ouaret, R.; Floquet, P. Interpretability in machine learning predictions: Case of Random Forest regression using Partial Dependence Plots. In Proceedings of the 18ème Congrès de la Société Française de Génie des Procédés, Toulouse, France, 7–10 November 2022. [Google Scholar]
Apley, D.W.; Zhu, J. Visualizing the effects of predictor variables in black box supervised learning models. J. R. Stat. Soc. Ser. B Stat. Methodol. 2020, 82, 1059–1086. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Parsa, A.B.; Movahedi, A.; Taghipour, H.; Derrible, S.; Mohammadian, A.K. Toward safer highways, application of XGBoost and SHAP for real-time accident detection and feature analysis. Accid. Anal. Prev. 2020, 136, 105405. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Shi, L.; Shi, Y.; Tang, J.; Zhao, P.; Wang, Y.; Chen, J. Exploring interactive and nonlinear effects of key factors on intercity travel mode choice using XGBoost. Appl. Geogr. 2024, 166, 103264. [Google Scholar] [CrossRef]
Dong, S.; Khattak, A.; Ullah, I.; Zhou, J.; Hussain, A. Predicting and analyzing road traffic injury severity using boosting-based ensemble learning models with SHAPley Additive exPlanations. Int. J. Environ. Res. Public Health 2022, 19, 2925. [Google Scholar] [CrossRef]
Hasan, A.S.; Jalayer, M.; Das, S.; Kabir, M.A.B. Application of machine learning models and SHAP to examine crashes involving young drivers in New Jersey. Int. J. Transp. Sci. Technol. 2024, 14, 156–170. [Google Scholar] [CrossRef]
Gu, Y.; Liu, D.; Arvin, R.; Khattak, A.J.; Han, L.D. Predicting intersection crash frequency using connected vehicle data: A framework for geographical random forest. Accid. Anal. Prev. 2023, 179, 106880. [Google Scholar] [CrossRef]
Wu, D.; Zhang, Y.; Xiang, Q. Geographically weighted random forests for macro-level crash frequency prediction. Accid. Anal. Prev. 2024, 194, 107370. [Google Scholar] [CrossRef]
Yan, M.; Shen, Y. Traffic accident severity prediction based on random forest. Sustainability 2022, 14, 1729. [Google Scholar] [CrossRef]
Sum, S.; Se, C.; Champahom, T.; Jomnonkwao, S.; Sinha, S.; Ratanavaraha, V. A Random Forest and SHAP-based analysis of motorcycle crash severity in Thailand: Urban-Rural and Day-Night perspectives. Transp. Eng. 2025, 21, 100369. [Google Scholar] [CrossRef]
You, J.; Wang, J.; Guo, J. Real-time crash prediction on freeways using data mining and emerging techniques. J. Mod. Transp. 2017, 25, 116–123. [Google Scholar] [CrossRef]
Basso, F.; Basso, L.J.; Pezoa, R. The importance of flow composition in real-time crash prediction. Accid. Anal. Prev. 2020, 137, 105436. [Google Scholar] [CrossRef]
Yang, L.; Aghaabbasi, M.; Ali, M.; Jan, A.; Bouallegue, B.; Javed, M.F.; Salem, N.M. Comparative analysis of the optimized KNN, SVM, and ensemble DT models using Bayesian optimization for predicting pedestrian fatalities: An advance towards realizing the sustainable safety of pedestrians. Sustainability 2022, 14, 10467. [Google Scholar] [CrossRef]
Madushani, J.S.; Sandamal, R.K.; Meddage, D.P.P.; Pasindu, H.R.; Gomes, P.A. Evaluating expressway traffic crash severity by using logistic regression and explainable & supervised machine learning classifiers. Transp. Eng. 2023, 13, 100190. [Google Scholar] [CrossRef]
Santos, D.; Saias, J.; Quaresma, P.; Nogueira, V.B. Machine learning approaches to traffic accident analysis and hotspot prediction. Computers 2021, 10, 157. [Google Scholar] [CrossRef]
Haghshenas, S.S.; Guido, G.; Vitale, A.; Astarita, V. Assessment of the level of road crash severity: Comparison of intelligence studies. Expert Syst. Appl. 2023, 234, 121118. [Google Scholar] [CrossRef]
Wang, J.; Song, H.; Fu, T.; Behan, M.; Jie, L.; He, Y.; Shangguan, Q. Crash prediction for freeway work zones in real time: A comparison between Convolutional Neural Network and Binary Logistic Regression model. Int. J. Transp. Sci. Technol. 2022, 11, 484–495. [Google Scholar] [CrossRef]
Shiran, G.; Imaninasab, R.; Khayamim, R. Crash severity analysis of highways based on multinomial logistic regression model, decision tree techniques, and artificial neural network: A modeling comparison. Sustainability 2021, 13, 5670. [Google Scholar] [CrossRef]
Najafi Moghaddam Gilani, V.; Hosseinian, S.M.; Ghasedi, M.; Nikookar, M. Data-Driven Urban Traffic Accident Analysis and Prediction Using Logit and Machine Learning-Based Pattern Recognition Models. Math. Probl. Eng. 2021, 2021, 9974219. [Google Scholar] [CrossRef]
Wang, Z.; Huang, S.; Wang, J.; Sulaj, D.; Hao, W.; Kuang, A. Risk factors affecting crash injury severity for different groups of e-bike riders: A classification tree-based logistic regression model. J. Saf. Res. 2021, 76, 176–183. [Google Scholar] [CrossRef]
Lowry, M.B.; Ward, C.R. Development of a Methodology to Evaluate the Highway Safety Improvement Program; Technical Report; Idaho Transportation Department: Boise, ID, USA, 2023.
Elsayed, A.; Smith, S.; Abdel-Rahim, A.; Chang, K. Impact of the COVID-19 Pandemic on Travel Mode Choices and Fatal Crash Rates; Technical Report; Center for Safety Equity in Transportation: Fairbanks, AK, USA, 2025. [Google Scholar]
Awad, M.; Khanna, R. Support Vector Machines for Classification. In Efficient Learning Machines; Apress: Berkeley, CA, USA, 2015; pp. 39–66. [Google Scholar] [CrossRef]
Ghosh, S.; Dasgupta, A.; Swetapadma, A. A study on support vector machine based linear and non-linear pattern classification. In Proceedings of the 2019 International Conference on Intelligent Sustainable Systems (ICISS); IEEE: Piscataway, NJ, USA, 2019; pp. 24–28. [Google Scholar]
Luna, J.M.; Gennatas, E.D.; Ungar, L.H.; Eaton, E.; Diffenderfer, E.S.; Jensen, S.T.; Simone, C.B.; Friedman, J.H.; Solberg, T.D.; Valdes, G. Building more accurate decision trees with the additive tree. Proc. Natl. Acad. Sci. USA 2019, 116, 19887–19893. [Google Scholar] [CrossRef]
Syam, N.; Kaul, R. Random forest, bagging, and boosting of decision trees. In Machine Learning and Artificial Intelligence in Marketing and Sales: Essential Reference for Practitioners and Data Scientists; Emerald Publishing Limited: Leeds, UK, 2021; pp. 139–182. [Google Scholar]
Chang, I.; Park, H.; Hong, E.; Lee, J.; Kwon, N. Predicting effects of built environment on fatal pedestrian accidents at location-specific level: Application of XGBoost and SHAP. Accid. Anal. Prev. 2022, 166, 106545. [Google Scholar] [CrossRef]
Yuan, C.; Li, Y.; Huang, H.; Wang, S.; Sun, Z.; Wang, H. Application of explainable machine learning for real-time safety analysis toward a connected vehicle environment. Accid. Anal. Prev. 2022, 171, 106681. [Google Scholar] [CrossRef] [PubMed]
Liu, C.; Zhao, M.; Li, W.; Sharma, A. Multivariate random parameters zero-inflated negative binomial regression for analyzing urban midblock crashes. Anal. Methods Accid. Res. 2018, 17, 32–46. [Google Scholar] [CrossRef]
Wu, P.; Chen, T.; Wong, Y.D.; Meng, X.; Wang, X.; Liu, W. Exploring key spatio-temporal features of crash risk hot spots on urban road network: A machine learning approach. Transp. Res. Part A Policy Pract. 2023, 173, 103717. [Google Scholar] [CrossRef]
Koepsell, T.; McCloskey, L.; Wolf, M.; Moudon, A.V.; Buchner, D.; Kraus, J.; Patterson, M. Crosswalk markings and the risk of pedestrian–motor vehicle collisions in older pedestrians. JAMA 2002, 288, 2136–2143. [Google Scholar] [CrossRef]
Deliali, A.; Fournier, N.; Christofa, E.; Knodler, M. Investigating the Safety Impact of Segment- and Intersection-Level Bicycle Treatments on Bicycle–Motorized Vehicle Crashes. Transp. Res. Rec. J. Transp. Res. Board 2023, 2677, 1315–1330. [Google Scholar] [CrossRef]
Younes, H.; Noland, R.B.; Von Hagen, L.A.; Meehan, S. Pedestrian-and bicyclist-involved crashes: Associations with spatial factors, pedestrian infrastructure, and equity impacts. J. Saf. Res. 2023, 86, 137–147. [Google Scholar] [CrossRef] [PubMed]
Roll, J.; McNeil, N. Race and income disparities in pedestrian injuries: Factors influencing pedestrian safety inequity. Transp. Res. Part D Transp. Environ. 2022, 107, 103294. [Google Scholar] [CrossRef]

Figure 1. Research Framework.

Figure 2. ROC curves comparing the performance of the five classification models.

Figure 3. Global SHAP feature importance and SHAP beeswarm plot for the top ten ranked predictors.

Figure 4. Local SHAP force plot illustrating feature contributions for a high predicted crash likelihood intersection (Intersection 10).

Figure 5. Local SHAP force plot illustrating feature contributions for a low predicted crash likelihood intersection (Intersection 106).

Figure 6. SHAP dependence plots for the four most influential predictors.

Figure 7. SHAP dependence plot for TEV colored by pedestrian and cyclist traffic level.

Figure 8. Smoothed curve SHAP values for median household income (45–64 years).

Table 2. Descriptive statistics of variables used in the crash prediction models.

Variable Category and Name	Description	Mean	SD	Min	Max
Target Variable: Crash Occurrence	1 = Crash; 0 = No crash	0.63	0.48	0	1
Traffic Volume: Total Entering Vehicles	Estimated total vehicles entering the intersection	42,608	47,358	400	352,360
Roadway and Geometric Design Variables:
Major Road Type	1 = Interstate; 2 = Freeway; 3 = Principal Arterial; 4 = Minor Arterial; 5 = Major Collector; 6 = Minor Collector; 7 = Local	5.10	1.71	1	7
Minor Road Type	Same classification as major approach	5.36	1.52	2	7
Major Road Lanes	Number of lanes on major approach	1.41	0.66	0	5
Minor Road Lanes	Number of lanes on minor approach	1.08	0.31	0	4
Major Right-Turn Lane	1 = Yes; 0 = No	0.10	0.30	0	1
Major Left-Turn Lane	1 = Yes; 0 = No	0.26	0.44	0	1
Minor Right-Turn Lane	1 = Yes; 0 = No	0.14	0.34	0	1
Minor Left-Turn Lane	1 = Yes; 0 = No	0.20	0.40	0	1
Major Crosswalk	1 = Present; 0 = Absent	0.38	0.49	0	1
Minor Crosswalk	1 = Present; 0 = Absent	0.41	0.49	0	1
Intersection Control	1 = Signal; 2 = AWSC; 3 = TWSC; 4 = Roundabout; 5 = Uncontrolled	2.78	1.19	1	5
Pavement Marking	1 = Yes; 0 = No	0.63	0.48	0	1
Intersection Lighting	1 = Yes; 0 = No	0.89	0.32	0	1
Pedestrian and Cyclist Activity Variables:
Pedestrian and Cyclist Volume Level	1 = Low; 2 = Medium; 3 = High	1.97	0.71	1	3
Socio-Demographic and Economic Variables:
Population ≤ 18 years (%)	Population under 18 years old	21.45	7.54	1.0	42.7
Population ≥ 65 years (%)	Population aged 65 or older	14.75	6.30	0.7	47.3
Dependent-Age Population (%)	Population ≤ 18 or ≥65 years	36.19	9.44	2.2	63.6
Median Household Income (USD)	Household income over the past 12 months	65,648	24,064	19,500	192,802
Total Households	Total households per census tract	1726	542	505	3847
Housing and Household Characteristics:
Renter-Occupied Units (%)	Housing units occupied by renters	36.96	19.93	0	87.1
Owner-Occupied Units (%)	Housing units occupied by owners	55.42	22.00	2.9	99.1
Households with No Vehicle (%)	Households reporting no vehicle ownership	6.07	6.28	0	27.9
Commuting and Travel Behavior Variables: *
Drive Alone to work (%)	Workers commuting alone by car	73.10	10.32	33.7	93.3
Carpool to work (%)	Workers commuting by shared carpool	8.50	4.74	0	33.5
Public Transit to work (%)	Workers using public transit	0.84	1.48	0	10.2
Walk or Bike to work (%)	Workers walking or cycling to work	5.75	6.42	0	36.3
Work at Home (%)	Workers working remotely	10.70	6.48	0	48.3
Commute Time Variables:
Commute 5–9 min (%)	Workers commuting 5–9 min	18.19	9.48	0.6	53.1
Commute 10–14 min (%)	Workers commuting 10–14 min	22.16	9.35	0.2	63.2
Commute 15–19 min (%)	Workers commuting 15–19 min	19.52	7.55	0	49.8
Commute 20–24 min (%)	Workers commuting 20–24 min	12.97	7.61	0.1	45.7
Education and School Enrollment Variables:
Private School Enrollment (K–12) (%)	Students enrolled in private schools (K–12)	11.01	10.39	0	86.0
Private School Enrollment (5–8) (%)	Students enrolled in private schools (grades 5–8)	9.51	13.80	0	100
Private School Enrollment (9–12) (%)	Students enrolled in private schools (grades 9–12)	10.18	14.16	0	100

*: Commuting and travel behavior variables represent percentages of the resident population within the census tract and are derived from ACS data reflecting population-level exposure.

Table 3. Notation and definitions of symbols used in the machine learning model equations.

Symbol	Description	Equation No.
$w_{i}$	Model coefficient (weight) for feature i.	Equation (1)
$w_{0}$	Intercept or bias term in the linear models.	Equation (1)
$p (X_{i})$	Predicted probability of observation i in the logistic regression.	Equation (1)
k	Number of nearest neighbors in the KNN model.	Equation (2)
$d (x_{1}, x_{2})$	Euclidean distance between the two observations.	Equation (2)
n	Number of input features in the dataset.	Equation (2)
$w, b$	Weight vector and bias term defining the separating hyperplane in SVM.	Equation (3)
$I (x \in R_{l})$	Indicator function (1 if x belongs to region $R_{l}$ , 0 otherwise).	Equation (4)
$c_{l}$	Constant prediction value within region $R_{l}$ in the Decision Tree.	Equation (4)
$R_{l}$	Partitioned region l in the feature space of the Decision Tree.	Equation (4)

Table 4. Summary of model evaluation metrics used for classification and regression tasks.

Metric	Description	Equation
Accuracy	Measures the overall proportion of correctly classified instances among all predictions.	$Accuracy = \frac{T P + T N}{T P + T N + F P + F N}$
Precision	Proportion of correctly predicted positive cases out of all predicted positives.	$Precision = \frac{T P}{T P + F P}$
Recall (sensitivity)	Proportion of actual positives correctly identified by the model.	$Recall = \frac{T P}{T P + F N}$
F1-Score	Harmonic mean of precision and recall, balancing both metrics under a class imbalance.	$F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$
ROC	Graphical representation of the trade-off between the true-positive rate (TPR) and false-positive rate (FPR) across the classification thresholds.	$TPR = \frac{T P}{T P + F N}, FPR = \frac{F P}{F P + T N}$
AUC	Scalar metric summarizing the ROC curve; represents the probability that the classifier ranks a randomly chosen positive instance higher than a negative instance. The values ranged from zero to one.	$AUC = \int_{0}^{1} TPR (FPR) d (FPR)$

Table 5. Performance comparison of classification models for crash occurrence prediction.

Model	Accuracy %	Precision %	Recall %	F1-Score %	AUC
BLR	90.99	94.69	90.93	92.77	0.9609
KNN	90.45	93.35	91.50	92.42	0.9107
SVM	90.45	94.64	90.09	92.31	0.9492
DT	88.47	89.37	92.92	91.11	0.8681
RF	89.37	92.00	91.22	91.61	0.9639

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Elsayed, A.; Abdel-Rahim, A.; Prescott, L. From Prediction to Explanation: Explainable Machine Learning for Motor Vehicle–Involved Pedestrian and Cyclist Crash Risk. Infrastructures 2026, 11, 77. https://doi.org/10.3390/infrastructures11030077

AMA Style

Elsayed A, Abdel-Rahim A, Prescott L. From Prediction to Explanation: Explainable Machine Learning for Motor Vehicle–Involved Pedestrian and Cyclist Crash Risk. Infrastructures. 2026; 11(3):77. https://doi.org/10.3390/infrastructures11030077

Chicago/Turabian Style

Elsayed, Ahmed, Ahmed Abdel-Rahim, and Logan Prescott. 2026. "From Prediction to Explanation: Explainable Machine Learning for Motor Vehicle–Involved Pedestrian and Cyclist Crash Risk" Infrastructures 11, no. 3: 77. https://doi.org/10.3390/infrastructures11030077

APA Style

Elsayed, A., Abdel-Rahim, A., & Prescott, L. (2026). From Prediction to Explanation: Explainable Machine Learning for Motor Vehicle–Involved Pedestrian and Cyclist Crash Risk. Infrastructures, 11(3), 77. https://doi.org/10.3390/infrastructures11030077

Article Menu

From Prediction to Explanation: Explainable Machine Learning for Motor Vehicle–Involved Pedestrian and Cyclist Crash Risk

Abstract

1. Introduction

1.1. Traditional Statistical Approaches to Predict Vulnerable User-Involved Crashes

1.2. Machine Learning Approaches in Crash Prediction

1.3. Feature Importance and Machine Learning Model Explainability

2. Methods

2.1. Data Preparation

2.1.1. Data Sources

2.1.2. Study Area and Period

2.1.3. Crash Data and Intersection Classification

2.1.4. Geometric and Environmental Attributes

2.1.5. Socio-Demographic Variables

2.2. Methodological Framework

2.2.1. Binary Logistic Regression (BLR)

2.2.2. K-Nearest Neighbors (KNN)

2.2.3. Support Vector Machine (SVM)

2.2.4. Decision Tree (DT)

2.2.5. Random Forests (RF)

2.2.6. Model Evaluation Metrics

2.2.7. Feature Importance Analysis

3. Results

3.1. Models’ Performance

3.2. Feature Importance and Model Interpretation

3.2.1. Global Feature Importance

3.2.2. Local Feature Contributions: SHAP Force Plot

3.2.3. Pairwise SHAP Dependence Analysis of the Top Predictors

3.3. Socioeconomic Influence on Crash Occurrence

4. Discussion

4.1. Model Performance Comparison and Selection

4.2. Infrastructure and Traffic Exposure Effects on Crash Risk

4.3. Contextual and Socioeconomic Influences on Crash Risk

4.4. Practical Implications for Intersection Safety Management

4.5. Limitations and Scope of the Analysis

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI