Explainable Machine Learning for Cyclist Injury Severity in Bicycle–Vehicle Crashes in Poland: Association Patterns and Implications for Sustainable Road Safety

Budzyński, Artur; Cieśla, Maria

doi:10.3390/su18115501

Open AccessArticle

Explainable Machine Learning for Cyclist Injury Severity in Bicycle–Vehicle Crashes in Poland: Association Patterns and Implications for Sustainable Road Safety

by

Artur Budzyński

^*

and

Maria Cieśla

^*

Department of Transport Systems, Traffic Engineering and Logistics, Faculty of Transport and Aviation Engineering, Silesian University of Technology, 8 Krasińskiego St., 40-019 Katowice, Poland

^*

Authors to whom correspondence should be addressed.

Sustainability 2026, 18(11), 5501; https://doi.org/10.3390/su18115501 (registering DOI)

Submission received: 9 April 2026 / Revised: 13 May 2026 / Accepted: 21 May 2026 / Published: 1 June 2026

(This article belongs to the Special Issue Sustainable and Smart Transportation Systems)

Download

Browse Figures

Versions Notes

Abstract

Road safety is a prerequisite for sustainable mobility, yet cyclists remain disproportionately exposed to severe outcomes in mixed traffic. Using police-reported bicycle–vehicle crashes from the national SEWIK registry in Poland (152,567 cyclist-involved records; 2015–2024), this study modeled five ordered injury-severity classes with a CatBoost gradient-boosting classifier, evaluated performance with quadratic weighted kappa and complementary class-sensitive metrics under extreme imbalance (including benchmark comparisons and calendar-based walk-forward stress tests), and interpreted predictions with SHAP to summarize transparent, feature-level association patterns. The results indicate modest overall ordinal discrimination (hold-out QWK ≈ 0.20), while highlighting elevated recall for rare fatal outcomes together with low precision, implying a substantial false-positive trade-off if outputs were used as deterministic classifiers. Global and local explanations point to stronger associations for cyclist age, shorter offender licensure tenure (a registry proxy for experience-related factors), regional context, and built-up versus non-built-up settings consistent with higher kinetic-energy environments; these variables should be interpreted cautiously because registry data are observational and omit key exposures (e.g., measured impact speed and cycling volume). Overall, the study contributes a nationwide, explainable severity-profiling workflow for prioritizing cyclist protection: combining benchmarked ML, multi-metric reporting, and XAI diagnostics can support monitoring and evaluation of speed management, infrastructure, and licensing-system improvements—without overstating causal effects from administrative records alone.

Keywords:

bicycle–vehicle crashes; cyclist injury severity; injury severity prediction; explainable artificial intelligence; SHAP; CatBoost; vulnerable road users; driving experience; cyclist age; sustainable mobility policy

1. Introduction

Cycling is increasingly promoted as a cornerstone of sustainable and healthy mobility, yet its expansion is repeatedly constrained by safety concerns, in particular the risk of severe injury in collisions with motor vehicles. The cyclist-safety literature has grown rapidly over the last decade, mapping a broad landscape of determinants spanning behavior, infrastructure, and vehicle–road interactions [1]. At the same time, large differences in reporting practices, exposure measurement, and crash typologies (e.g., single-bicycle vs. bicycle–vehicle events) continue to produce non-trivial uncertainty about what matters most for preventing death and disabling trauma [2].

However, evidence remains mixed on the relative emphasis of infrastructure and speed management versus behavioral, licensing, and enforcement-oriented measures [3,4,5]. In practice, the policy-relevant signal is often context dependent: built-environment characteristics, vehicle mix, and speed environments can be strongly associated with severe outcomes in some samples, while demographic and behavioral factors appear prominent in others [3,4]. For sustainable road safety, the key implication is that single-factor narratives are risky: analytical designs should allow non-linearities and interactions while reporting performance transparently, especially when severe and fatal outcomes are rare [1,5].

Within Europe, recent sustainability-oriented crash severity studies reinforce that cyclist outcomes are associated with multi-level interactions among road environment, time/lighting, and participant attributes, but they disagree on which factors remain robust once heterogeneity is accounted for [6,7,8]. For example, logistic-regression-based evidence from large cyclist crash samples highlights strong age gradients and higher fatal odds outside built-up areas, consistent with the well-known speed–severity relationship [6]. In parallel, regional studies focusing on bicyclist–vehicle collisions identify driver and cyclist alcohol involvement, vehicle type (e.g., trucks), and time-of-day as prominent correlates of severe injury [7,8].

To address heterogeneity, modern injury-severity research increasingly combines segmentation ideas (latent class or clustering) with flexible statistical models, showing that predictors can change direction and magnitude across latent crash contexts [4,9]. Similarly, recent cyclist-safety work integrates machine-learning (ML) pipelines with interpretation layers to discover non-linear effects and interactions that conventional models may miss [10,11,12]. For instance, AutoML frameworks paired with SHAP explanations have been used to quantify how visual street-scene features relate to cycling crash rates at intersections, offering a bridge from smart-mobility sensing to actionable infrastructure insights [10]. Hybrid approaches that combine XGBoost–SHAP screening with random-parameters or mixed logit models demonstrate that interpretability and statistical rigor can be complementary rather than competing goals [11,12].

A central theme in the cyclist-injury literature is that fatal and severe outcomes are strongly associated with impact energy, which is often proxied in observational data by speed environment, rurality, and road class [13,14]. Although many speed–risk curves are estimated for pedestrians, cyclist studies commonly report analogous monotonic relationships, where higher impact speeds are biomechanically consistent with a higher probability of severe trauma relative to tolerance thresholds [14]. Consistent with this evidence base, cyclist-specific studies repeatedly identify higher posted speeds and non-built-up (rural) settings as strong correlates of severe injury and death [6].

Cyclist vulnerability is not uniform across age groups. A large body of observational research reports higher fatality and serious-injury risk among older cyclists, even when comparing broadly similar crash contexts [6,15,16,17]. This pattern is commonly discussed using age-related fragility language; however, in police-registry studies, age should be interpreted primarily as a marker that can correlate with physiological vulnerability, exposure contexts, trip characteristics, and other unobserved factors rather than as direct proof of a specific biological mechanism [15,16]. From a sustainability perspective, population aging alongside policies encouraging cycling increases the salience of equity-sensitive safety monitoring: severity models should therefore combine interpretable learning with explicit uncertainty and limitation framing [1,16].

Another rapidly evolving dimension is technology-driven mode change, including the growth of e-bikes, which may alter exposure, speeds, and rider demographics [18,19,20]. Empirical evidence suggests that e-bike and bicycle crashes exhibit distinct patterns of casualties and that injuries remain strongly associated with human and environmental factors (e.g., visibility, vehicle type, and speed context). These trends reinforce the need for analytical approaches that remain valid under shifting mobility regimes—an explicit priority within sustainable and smart transportation systems, where digitalized infrastructures and AI-supported safety management are increasingly used to support road safety practice.

A second focus of this paper is driver experience, commonly proxied by years since licensure in administrative data. Traffic-safety research documents learning-curve patterns in crash involvement among young drivers as a function of age and licensure experience [21]. Yet for injury severity conditional on a cyclist–vehicle collision having occurred, the incremental value of experience measures—and their interpretation—is less standardized: experience may correlate with crash contexts and risk profiles, and registry variables may partially reflect measurement and selection processes rather than a clean behavioral construct [22]. Accordingly, we treat experience-related findings as associational evidence compatible with a hypothesized mitigation channel (e.g., anticipatory behavior reducing impact severity), rather than as causal identification of an “experience buffer.” Related engineering-oriented work models subjective driving risk from cognitive spatiotemporal features [23] and driver fatigue from on-road signals with recurrent architectures [24]; these strands illustrate ML applied to human-state and risk perception in traffic safety, but they rely on signals and tasks outside sparse police-tabular cyclist severity coding.

Methodologically, national crash registries require models that accommodate high-dimensional categorical fields, severe class imbalance [25], and ordinal severity scales, while still supporting policy-relevant transparency. Recent work pairs ensemble learners with explainability tools such as SHAP to summarize non-linear patterns and heterogeneous effects [26,27,28,29,30,31]. At the same time, reviews highlight recurring pitfalls in severity modeling—especially optimistic reporting based on a single metric and limited comparability across studies—calling for multi-metric evaluation and benchmarking against credible baselines [32,33,34]. Building on these insights, we implement an explainable gradient-boosting framework (CatBoost) with SHAP-based diagnostics, and we report quadratic weighted kappa, macro- and weighted-F1, class-wise precision/recall/F1, confusion matrices, and comparisons against additional benchmark models to make trade-offs between sensitivity to rare fatal classes and false-positive risk explicit. Despite growing international use of explainable ML in road-safety research, nationwide cyclist-specific evidence for Poland remains comparatively limited, particularly studies that jointly incorporate cyclist attributes, motor-vehicle participant characteristics (including licensure tenure proxies), and crash context within a single modeling pipeline [35,36,37]. Prior Polish ML applications often target broader road-user populations or related classification tasks, which can obscure cyclist-specific severity heterogeneity [35,36]. Nationwide SEWIK studies have linked injury outcomes to participant roles, fault-related indicators, and registry road- and environment-context fields [35,36,37], including participant-level ensemble analyses [38] and, separately, child severity work with explainable boosting and counterfactual perspectives on protective equipment [39]. This paper contributes a cyclist-centered national analysis with an emphasis on reproducible evaluation and interpretable outputs aligned with data-driven accident analysis within sustainable and smart mobility agendas, where AI is used to support monitoring, prioritization, and evaluation—without overstating causal conclusions from observational records. Beyond sample size alone, the revision adds calendar-based out-of-time stress testing (walk-forward test years and complementary expanding-window splits), transparent quantification of missingness for predictors that condition interpretation of offender- and time-related signals, an ordinal logit stress test that clarifies limits of linear ordinal specifications on the same sparse encoding as the classical baselines, multi-baseline ensemble and sensitivity checks beyond the primary split, and a compact tabular summary of mean absolute SHAP contributions computed on a capped stratified hold-out subsample for tractability. Section 2 and Section 3 set out the estimators, data handling, evaluation designs, and numerical results.

This study examines association patterns between cyclist injury severity (five ordered police-reported classes) and crash, participant, and context variables in bicycle–vehicle collisions in Poland (2015–2024). We estimated a CatBoost severity model and interpreted predictions using SHAP, complemented by benchmark comparisons and class-wise performance reporting under extreme imbalance. Our aims were as follows: (i) to summarize which factors are most strongly associated with higher severity in this registry; (ii) to compare CatBoost against credible baselines using complementary metrics that separate ordinal agreement, overall discrimination, and rare-class sensitivity; and (iii) to translate explainable outputs into exploratory policy hypotheses for cyclist protection and speed-risk management—subject to external validation—while clearly stating data limitations (including proxy variables such as administrative region and built-up indicators) and the non-causal nature of the evidence.

2. Materials and Methods

2.1. Data Source and Characteristics

The primary data source is the National Road Traffic Accident and Collision Registry (SEWIK), the official police-maintained national road traffic accident and collision registry. SEWIK provides standardized information on crash circumstances, infrastructure descriptors, environmental conditions, and participant attributes for each reported event.

For this study, we constructed an analytical sample covering 2015–2024 (10 calendar years). Crashes were retained if the record corresponded to a bicycle-involved event in the sense used for cyclist safety analysis (i.e., incidents where a cyclist participant is present in the extracted cyclist-focused analytical dataset). The final modeling file contains 152,567 cyclist-involved crash records after dataset construction and filtering steps applied in the preparation pipeline (including linkage of cyclist and motor-vehicle participant attributes where available).

Outcome variable. Injury severity was coded as an ordinal five-class outcome based on the police-reported injury categories: (0) no injury, (1) slight injury, (2) serious injury, (3) fatal within 30 days, and (4) fatal at the scene. We retained the full ordinal scale (rather than collapsing categories) to preserve policy-relevant distinctions along the injury spectrum while acknowledging that class frequencies are highly imbalanced (with a large majority of lower-severity outcomes).

Predictors (illustrative grouping). Each record includes variables describing the following:

temporal context (year, month, weekday, hour);
environmental conditions (e.g., lighting, weather, road-surface state);
infrastructure and location context (e.g., road type, traffic control, built-up indicator, intersection-related descriptors);
participant attributes for the cyclist and the motor-vehicle party where recorded (e.g., age, sex, and—when available—license-holding tenure as a proxy for driving experience).

Missingness and registry limitations. Administrative registries commonly contain item-level missingness and selective reporting. Table 1 summarizes item-level missing percentages in the processed modeling file (n = 152,567 cyclist-level records as in Section 2.1). Offender-side fields are the most incomplete: offender_years_of_driving is missing in about 53% of records and offender_age in about 30%. Cyclist age is missing in about 23%, and the co-recorded temporal fields month, hour, and weekday each show about 19% missing (identical rates reflect joint gaps in time-of-crash coding). The cyclist sex indicator female is missing in about 4%. These patterns motivate cautious interpretation of offender-related SHAP signals and any contrasts involving time-of-day when reading the association results below.

2.2. Model Specification, Training, and Benchmark Comparisons

Severity prediction was implemented using CatBoost, a gradient boosting framework based on oblivious trees that is well suited to high-cardinality categorical fields and mixed-type crash-registry features. Relative to conventional one-hot expansion, CatBoost reduces the risk of sparse high-dimensional encodings for categorical predictors while retaining non-linear capacity for interactions commonly present in injury-severity tasks [30]. In this registry setting, CatBoost also provides a practical route to global and local explainability via SHAP-based diagnostics (detailed under model interpretation).

XGBoost remains a strong default for many tabular problems [29], but it typically relies on explicit encodings for high-cardinality categorical fields, which can explode dimensionality in national registries or require additional engineering to keep sparsity under control. CatBoost’s ordered boosting and native categorical treatment target exactly those registry columns without rebuilding a separate one-hot matrix for the primary learner [30], and—paired with the same ShapValues interface used in our figure pipeline—it gives a stable implementation path from training to class-specific explanations on unchanged feature definitions. We therefore retained CatBoost as the primary model and did **not** add a fresh full-sample XGBoost retrain in this revision, because the marginal insight for the stated aims was judged small relative to the already-reported one-hot baselines and temporal comparators.

Software and computing environment. Analyses were implemented in Python 3.11. The primary severity model used CatBoost 1.2.10. Class-wise SHAP summaries used CatBoost’s native ShapValues implementation (get_feature_importance). Benchmark models used scikit-learn 1.8.0 (random forest and multinomial logistic regression with median/mode imputation and one-hot encoding) and statsmodels 0.14.6 (proportional-odds ordered logit). Data processing relied on pandas 3.0.2 and NumPy 2.4.4. No specialized experimental equipment was used; computations were performed on standard workstations at the Silesian University of Technology, Katowice, Poland.

Training used the multi-class objective with quadratic weighted kappa (QWK) as the guiding performance criterion during model development. QWK is appropriate for ordinal severity scales because misclassifications are penalized more heavily when the predicted class is farther from the true ordered class [34]. Because the severity distribution is highly skewed (the no-injury class comprises about 72% of observations), we applied class reweighting using CatBoost’s auto_class_weights = Balanced mechanism to maintain sensitivity to rare serious and fatal outcomes, while recognizing that this typically increases false positives for minority classes (evaluated explicitly in Results using per-class precision/recall/F1 and confusion matrices).

Hyperparameter search (transparent grid). Hyperparameters were selected using 5-fold stratified cross-validation over four pre-specified configurations (i.e., four distinct hyperparameter tuples evaluated by CV mean QWK):

(1): depth = 6, learning rate = 0.10, L2 leaf regularization = 3
(2): depth = 8, learning rate = 0.05, L2 leaf regularization = 3
(3): depth = 8, learning rate = 0.05, L2 leaf regularization = 5
(4): depth = 10, learning rate = 0.03, L2 leaf regularization = 5

The best-performing configuration was depth = 10, learning rate = 0.03, and L2 leaf regularization = 5.

Final model fitting and reporting used a stratified holdout split (80%/20%, random seed 42), yielding 122,053 training records and 30,514 test records. The final CatBoost model training used early stopping [40] on the holdout evaluation signal; training terminated when validation loss did not improve for 100 consecutive iterations, with the best iteration retained (best iteration 648 in the fitted model used here).

Benchmark models (reviewer-requested comparators). To contextualize CatBoost performance, we report additional statistical and machine-learning baselines evaluated on the same holdout test set:

Multinomial logistic regression with median imputation for numeric predictors, mode imputation for categoricals, one-hot encoding (with a minimum frequency threshold to control sparsity), and class weighting;
Random forest with the same preprocessing pipeline and class-balanced subsampling.

For computational feasibility given the high-dimensional one-hot expansion, these two baselines were estimated on a stratified random subsample of 30,000 observations drawn from the training partition (seed 42), while performance is still reported on the full 30,514 holdout observations to preserve an unbiased test comparison. The primary CatBoost model, in contrast, was trained on all 122,053 records in the training partition (same stratified split). As a transparency check on how much the subsample constraint moves the random-forest comparator, we additionally fit a random forest on the full training partition with identical preprocessing and test evaluation; that auxiliary run achieved hold-out QWK ≈ 0.301 (see Section 2.2), compared with QWK = 0.2353 for the subsampled random forest in Table 2. Repeating the full-training random forest for three independent stratified splits (random seeds 42–44) yielded mean ± standard deviation QWK of 0.307 ± 0.010 on the hold-out (see Section 2.2), indicating moderate variability with the partition draw but consistently higher ordinal agreement than the Table 2 subsampled row. We also fitted a proportional-odds (ordered logit) specification in statsmodels as an ordinal statistical baseline on the full training split using the same one-hot feature matrix as the logistic benchmark; hold-out QWK was about 0.024—far below both gradient boosting and random forest—so we treated it as a lower-bound comparator illustrating the limits of a linear ordinal model on this sparse registry encoding rather than a competitive alternative.

We did not rely on synthetic oversampling (e.g., SMOTE-type resampling) in the primary pipeline; imbalance was addressed primarily through class weighting and multi-metric evaluation [41], with sensitivity analyses left for robustness discussion where applicable.

Beyond QWK, the Section 3 reports macro- and weighted-average F1, class-wise precision/recall/F1, and confusion matrices (counts and row-normalized) to make trade-offs between rare-class sensitivity and false-positive risk explicit [32,33,34].

Temporal and stress-test evaluation (out-of-time generalization). In addition to the stratified hold-out benchmark above, we assessed how models behave when training and test periods differ systematically by calendar year. We implemented a walk-forward scheme: for each calendar test year from 2019 through 2024, models were trained on all records with crash year strictly before that test year and evaluated on records with crash year equal to the test year. This mimics a retrospective deployment where past years inform prediction on a future calendar year and provides a stronger stress test than a single random split because severity mixes, reporting practices, and exposure contexts can shift over time.

We report the same preprocessing pipeline as for the random-forest benchmark (median/mode imputation and one-hot encoding for the RF comparator; categorical handling as for CatBoost in the primary pipeline). The random forest used 300 trees with class_weight = balanced_subsample (other settings as in the benchmark script). The CatBoost comparator used the same fixed hyperparameters as the manuscript’s selected configuration (depth 10, learning rate 0.03, L2 leaf regularization 5, 649 boosting iterations, multi-class loss, auto_class_weights = Balanced, random seed 42), fitted on the training-year subset only without fold-specific early stopping on a withheld validation slice (the temporal exercise isolates calendar drift rather than replication of the hold-out tuning protocol). For numerical stability on large training folds, CatBoost training additionally used used_ram_limit = 8 GB and thread_count = 2 (as stated in this subsection).

As complementary stress tests, we evaluated three expanding-window “block” splits: (i) train on crash years through 2018 and test on years from 2019 onward; (ii) train through 2020 and test from 2021 onward; (iii) train through 2022 and test from 2023 onward. These splits emphasize performance under larger temporal gaps and shifting covariate mixes. Full fold-wise metrics (including macro- and weighted-F1 and fit timings) are summarized in Section 3 below.

Quantitative SHAP export. For a compact tabular complement to thank you for the lesson, we summarized global mean absolute native ShapValues (CatBoost get_feature_importance with ShapValues, using the same tensor layout as in the analysis notebook for the main SHAP figures). This summary is not “SHAP on the full 30,514-row hold-out” nor “SHAP on the training rows”: the CatBoost model was fitted on the full 122,053-row training partition with the manuscript configuration (649 boosting iterations; depth 10; learning rate 0.03; L2 leaf regularization 5; auto_class_weights = Balanced; same stratified 80/20 split and seed 42 as the reported benchmarks), while ShapValues for the tabulation were computed only on a stratified subsample of the hold-out test partition (here n = 2500, with a configurable cap on the number of hold-out rows evaluated). Thus, the global mean-|SHAP| ranking is tied to that subsample and is not an exhaustive average over all hold-out predictions. The tabulated ranking therefore reflects the same fully trained CatBoost model as in the benchmarks above—not a smaller or faster-training variant. The only shortcut is the evaluation of ShapValues on a capped subset of hold-out rows.

3. Result

3.1. Exploratory Distribution of Injury Severity

Figure 1 summarizes the distribution of the five-level cyclist injury severity outcome. The sample is strongly dominated by no-injury events (severity = 0), which comprise about 72.4% of all records. The frequency declines sharply with increasing severity: slight injuries (severity = 1) account for about 17.0%, serious injuries (severity = 2) for about 8.9%, while fatal outcomes remain rare (fatal within 30 days, severity = 3—about 0.75%; fatal at the scene, severity = 4—about 0.98%). This pattern is typical of police-reported crash registries and motivates class-imbalance-aware modeling, multi-metric evaluation, and cautious interpretation of rare-class performance, as reported in the following subsections.

3.2. Model Development and Performance

Hyperparameter tuning (cross-validation). We first tuned CatBoost using 5-fold stratified cross-validation over four pre-specified configurations, using quadratic weighted kappa (QWK) as the primary selection criterion for the ordinal severity scale. Table 3 reports the mean CV QWK for each configuration. Performance increased with deeper trees and a refined learning rate, with the best mean CV QWK (0.1991) obtained for tree depth = 10, learning rate = 0.03, and L2 leaf regularization = 5.

Using the selected hyperparameters, we trained the final CatBoost model with early stopping (monitoring validation loss; best iteration 648, minimum validation loss 1.2972 as reported in the training log). On the stratified hold-out test set (n = 30,514), the model achieved QWK = 0.2038 and macro-averaged F1 = 0.2911 (with weighted F1 = 0.5515). Overall, these values indicate modest ordinal discrimination; therefore, we interpreted the model as useful, primarily for screening-style risk stratification and pattern discovery, rather than as a high-accuracy standalone classifier.

To contextualize CatBoost performance, we evaluated two additional baselines on the same hold-out split: a multinomial logistic regression with one-hot encoded inputs (estimated on a stratified 30,000-record training subsample for computational feasibility) and a random forest trained under the same subsampling constraint (Table 2). On that table, random forest achieved the highest QWK (0.2353) and weighted F1 (0.6452) among the subsampled comparators, whereas CatBoost remained competitive on class-sensitive behavior (see per-class metrics below) and was retained as the primary model for SHAP-based interpretation given its native handling of categorical registry fields and the stability of its explanation pipeline in this setting. Under equal training size, a full-training random forest reaches materially higher QWK (≈0.301; Section 2.2), which reinforces that Table 2 should be read as a like-for-like subsampled benchmark row for the one-hot pipelines, not as the ceiling performance of tree ensembles on this registry.

Confusion structure and rare-class trade-offs. Figure 2 presents the row-normalized confusion matrix (i.e., recall by true class). On the hold-out test set, recall is 0.414 for class 3 (fatal within 30 days) and 0.600 for class 4 (fatal at scene). However, precision is low for these rare fatal classes (0.055 for class 3 and 0.094 for class 4), indicating a substantial false-positive rate when the model flags fatal risk. This pattern is expected under severe imbalance and class reweighting and must be interpreted alongside per-class precision/recall/F1 (Table 4).

Temporal walk-forward (2019–2024). Table 5 summarizes quadratic weighted kappa on each calendar-year test fold under the walk-forward protocol described in Section 2 Materials and Methods (train on all years strictly before the test year). Mean QWK across the six folds was 0.1451 for random forest and 0.1894 for CatBoost—below the single stratified hold-out values in Table 3, as expected when evaluation is conditioned on forward calendar shifts rather than a random 80/20 mix. Both models showed a pronounced deterioration in 2023 (RF QWK = 0.0526; CatBoost QWK = 0.0910), followed by partial recovery in 2024, which is consistent with year-specific changes in injury severity composition or reporting, rather than stable ordinal discrimination at a constant level across the decade.

Expanding-window block splits. Table 6 reports QWK for the three expanding-window splits (train on all years up to an anchor, test on all later years). Averaging QWK over these splits yielded 0.1056 for random forest and 0.1728 for CatBoost—lower than the walk-forward calendar-year averages, reflecting the larger temporal separation and broader future-oriented test pools in these stress tests. The weakest block-split performance occurred for the split trained through 2022 and evaluated on 2023 onward (RF 0.0536; CatBoost 0.0968), aligning with the weak 2023 calendar-year performance in Table 5.

3.3. Model Interpretation

CatBoost global importance percentages, classwise SHAP beeswarms, and Table 7 (mean |SHAP| on a capped hold-out subsample; Section 2.2) describe associative contribution magnitudes only. We did not compute bootstrap confidence intervals for feature rankings or global diagnostics, so Table 7 and plot-based orderings must not be read as formal tests of “statistical significance” for the relative ordering of predictors.

Figure 3 reports global feature importance from CatBoost (PredictionValuesChange, expressed as percentage contribution). The largest contributions are observed for voivodeship (15.97%), cyclist_age (9.17%), and offender_years_of_driving (6.84%), followed by crash-context variables such as crash_location_type (6.34%) and temporal descriptors (year, offender_age, hour, month, weekday; each in the ~5–6% range), with additional contributions from lighting (4.19%) and crash_type (4.17%) (see Figure 3 for the full ranking).

Importantly, high global importance does not imply a direct causal effect. Administrative region can aggregate unmeasured contextual differences (infrastructure quality, traffic composition, reporting practices, emergency care access, and exposure intensity). Likewise, age and licensure tenure should be interpreted as markers that may correlate with multiple underlying mechanisms in observational police data. Therefore, Figure 3 should be read as a priority map for associations learned by the model, guiding hypothesis generation rather than definitive mechanism identification.

Figure 4 summarizes directional SHAP contributions for non-fatal severity classes (0–2). Across panels, older cyclist age is generally associated with higher predicted severity within lower severity classes (consistent with a fragility-oriented interpretation as an association pattern, not a causal proof of physiology). The offender_years_of_driving shows patterns consistent with lower predicted severity when tenure is higher in several panels; however, this should be interpreted cautiously because licensure tenure may proxy for driver age cohorts, risk exposure, and reporting/linkage completeness in the registry.

The built_up indicator frequently aligns with less severe predicted outcomes in these panels, which is consistent with speed–risk narratives if built-up areas proxy lower kinetic-energy environments on average. We treated built_up primarily as a contextual proxy rather than a precise measure of impact speed. Overall, Figure 4 supports the view that human- and context-related variables jointly shift predicted severity within the non-fatal range, but it does not establish that any single factor deterministically prevents fatal outcomes.

Figure 5 presents SHAP summaries for fatal outcomes (classes 3 and 4). In both panels, higher cyclist age is associated with higher predicted probability mass on fatal classes, aligning with prior epidemiological patterns of higher lethality among older cyclists in comparable crash records [6,15,16,17]. Lower offender_years_of_driving is associated with higher predicted fatal severity in these plots as well.

We avoided interpreting these SHAP gradients as evidence that novice drivers fail to execute specific emergency maneuvers in a causal sense. A more conservative interpretation is that lower tenure correlates with fatal severity outcomes in this dataset, potentially reflecting unobserved crash contexts, collision-type mixtures, and missingness structures tied to how offender attributes are recorded. For class 4, crashes outside built-up areas tend to show patterns consistent with higher predicted on-scene fatality, plausibly consistent with higher-speed rural exposure proxies—again without claiming measured impact speeds.

Taken together, Figure 4 and Figure 5 provide transparent, class-specific association profiles that complement global importance. They are most appropriately used to prioritize follow-up analyses and policy questions (e.g., where speed management and vulnerable-user protections matter most), while remaining explicitly non-causal.

4. Discussion

The results indicate that cyclist injury severity in Poland is most strongly associated with cyclist age, motor-vehicle participants’ licensure tenure (offender_years_of_driving), and regional and crash-setting variables, alongside temporal and lighting-related context. This broad pattern aligns with road-safety syntheses emphasizing heterogeneous severity risk across user attributes, traffic context, and roadway environment, and with evidence that flexible learners can reveal non-linear structure that is difficult to capture parsimoniously in conventional regression specifications [1,11,30,31,42].

At the same time, high model reliance on a predictor is not equivalent to a causal effect. Administrative region can aggregate unobserved differences in infrastructure quality, traffic composition, emergency care access, reporting practices, and cycling exposure. Age and licensure tenure should likewise be interpreted as markers that may correlate with crash contexts, injury tolerance, and data completeness, rather than as uniquely identified behavioral or biomechanical mechanisms.

The hold-out evaluation shows modest overall ordinal discrimination combined with a familiar imbalance trade-off: the model attains relatively higher sensitivity for rare fatal severity classes, but precision for those classes remains low, meaning many predicted high-risk cases will be false positives if the output were treated as a deterministic classifier. In operational terms, that false-positive burden would imply substantial follow-up cost for screening, hot-spot ranking, or targeted interventions unless prevalence, utilities, and calibration are explicitly modeled. This cautions against over-interpreting any single global ranking or local explanation as operational truth without external validation, richer covariates (including speed and exposure), and calibration-oriented checks.

The temporal walk-forward and block-split benchmarks reinforce this caution: ordinal agreement (QWK) for both CatBoost and random forest drops markedly when evaluation respects calendar time, with the weakest performance concentrated around the 2023 test horizon. Such patterns are plausibly driven by evolving crash mixes, enforcement or coding practices, and mobility trends that are not captured as explicit covariates, rather than by a single stable “severity surface” that transfers identically across years. Accordingly, metrics from the stratified random hold-out split should be read as a single in-sample reference, whereas calendar-year walk-forward folds and broader expanding-window test pools stress out-of-time generalization more stringently.

Similarly, the strong age gradient in predicted severity is consistent with widely reported higher lethality among older cyclists in observational studies [15,43] and biomechanically motivated risk discussions [16], but it does not establish a specific “biological fragility mechanism” from police-registry fields alone. For sustainable mobility, the practical implication is nevertheless clear in policy design terms: if cycling promotion spans age groups, safety investments should explicitly consider greater severity risk among older riders in comparable crash records, including speed management and forgiving infrastructure on higher-risk corridors and conflict points [5,6].

The contrasted role of built-up versus non-built-up contexts is consistent with speed–severity narratives documented in the biomechanics and road-safety literature [13,14,44] and with cyclist studies linking higher-speed rural contexts to worse outcomes [6]. Here, built-up status should be read primarily as a proxy for kinetic-energy environments, not as a substitute for measured impact speed.

Overall, the contribution is best understood as AI-enabled accident analysis for prioritization: combining benchmarked learning, multi-metric evaluation, and explainable diagnostics can help translate large registries into auditable, feature-level hypotheses for where speed-risk and infrastructure treatments may matter most and which human-factor interventions merit trials—while remaining explicit that such tools do not replace causal identification strategies when policy claims require them [1,10,31,42].

Several limitations remain. The study is observational and restricted to police-reported SEWIK records in Poland, without granular information on impact speed, cycling exposure, helmet use, distraction, and detailed infrastructure quality—omissions that can steer the estimated effects toward proxies. Missing data, especially for offender attributes, complicates interpretation and may interact with severity reporting. A proportional-odds ordinal logit on the same sparse one-hot encoding achieved negligible hold-out QWK (≈0.024; Section 2.2), underscoring that not all ordinal specifications are informative here. ShapValues underlying the compact global ranking were computed only on a stratified hold-out test subsample (n = 2500 of 30,514) for tractability, whereas the fitted ensemble used the full training partition at the manuscript iteration budget (Section 2.2). Finally, while we include logistic-type and tree-based baselines for context, those one-hot pipelines were estimated on a training subsample for computational feasibility; full-training random-forest metrics and multi-seed variability are documented in Section 2.2. Future work should broaden benchmarks, strengthen temporal and external validation, and integrate exposure and network data to move from association patterns toward more decision-ready risk estimates [10,18,19].

The temporal exercises reported here use a fixed CatBoost iteration budget per fold (mirroring the selected manuscript configuration) rather than re-running fold-specific early stopping; they therefore isolate calendar drift under a controlled training budget but do not re-optimize stopping within each training window. Year-to-year QWK variability—especially the 2023 trough—underscores that nationwide registry models can be sensitive to period-specific factors that are only partially observed, so external validation on independent regions or updated cohorts remains essential before any operational use.

5. Conclusions

This study modeled five-level cyclist injury severity using 152,567 police-reported bicycle–vehicle crashes in Poland (2015–2024) with a CatBoost classifier and SHAP-based explanations, complemented by benchmark comparisons and multi-metric evaluation suited to ordinal, imbalanced outcomes. The approach provides transparent, feature-level association patterns that extend purely linear, coefficient-only narratives, while remaining explicit about the non-causal nature of registry-based evidence.

Overall, the analysis highlights pronounced association patterns between higher predicted severity and older cyclist age, shorter offender licensure tenure (a registry proxy for experience-related factors), and contextual variables that plausibly reflect higher kinetic-energy environments (notably patterns consistent with non-built-up settings), alongside substantial regional heterogeneity. These results should be interpreted as risk markers for screening-style prioritization and hypothesis generation, not as identification of unique biological or behavioral mechanisms. In particular, overall ordinal discrimination remains modest on the stratified hold-out, and calendar-based walk-forward evaluation shows additional degradation relative to that single random-split benchmark—especially around 2023—so performance should not be extrapolated as time-invariant without ongoing monitoring. High recall for rare fatal classes coexists with low precision, limiting standalone deployment without external validation and richer covariates (including exposure and speed).

From a sustainable road safety perspective, the practical implication is to prioritize systemic prevention alongside enforcement: protecting older cyclists through safer junction designs and separation on higher-risk corridors, speed management where rural and high-speed mixing is common, and novice-driver training content that emphasizes cyclist-relevant conflicts [45]—while evaluating such interventions with prospective outcome metrics rather than inferring causal impacts directly from this observational severity model.

Future work should integrate cycling exposure and network infrastructure data, link police records to medical/in-depth crash information where possible, broaden ordinal and statistical baselines, and test temporal stability and transferability as mobility technologies and fleets evolve.

Author Contributions

Conceptualization, A.B. and M.C.; methodology, A.B. and M.C.; formal analysis, A.B. and M.C.; investigation, A.B. and M.C.; writing—original draft preparation, A.B. and M.C.; writing—review and editing, A.B. and M.C.; visualization, A.B. and M.C.; supervision, A.B. and M.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. The data were obtained from the General Police Headquarters of Poland (SEWIK registry) and are available from the corresponding authors with the permission of the General Police Headquarters of Poland for scientific research purposes. Analysis code is available from the corresponding authors upon reasonable request.

Acknowledgments

The authors would like to express their gratitude to the Opinion and Analysis Department of the Road Traffic Bureau at the General Police Headquarters of Poland. During the preparation of this work, the authors used Google Gemini 3.1. Pro to improve the English language quality of the manuscript. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the published article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Scarano, A.; Aria, M.; Mauriello, F.; Riccardi, M.R.; Montella, A. Systematic Literature Review of 10 Years of Cyclist Safety Research. Accid. Anal. Prev. 2023, 184, 106996. [Google Scholar] [CrossRef]
Utriainen, R.; O’Hern, S.; Pöllänen, M. Review on Single-Bicycle Crashes in the Recent Scientific Literature. Transp. Rev. 2023, 43, 159–177. [Google Scholar] [CrossRef]
Moore, D.N.; Schneider, W.H.; Savolainen, P.T.; Farzaneh, M. Mixed Logit Analysis of Bicyclist Injury Severity Resulting from Motor Vehicle Crashes at Intersection and Non-Intersection Locations. Accid. Anal. Prev. 2011, 43, 621–630. [Google Scholar] [CrossRef]
Lin, Z.; Fan, W.D. Exploring Bicyclist Injury Severity in Bicycle-Vehicle Crashes Using Latent Class Clustering Analysis and Partial Proportional Odds Models. J. Saf. Res. 2021, 76, 101–117. [Google Scholar] [CrossRef]
Asgarzadeh, M.; Verma, S.; Mekary, R.A.; Courtney, T.K.; Christiani, D.C. The Role of Intersection and Street Design on Severity of Bicycle-Motor Vehicle Crashes. Inj. Prev. 2017, 23, 179–185. [Google Scholar] [CrossRef]
Filipović, F.; Mladenović, D.; Lipovac, K.; Das, D.K.; Todosijević, B. Determining Risk Factors That Influence Cycling Crash Severity, for the Purpose of Setting Sustainable Cycling Mobility. Sustainability 2022, 14, 13091. [Google Scholar] [CrossRef]
Macioszek, E.; Granà, A. The Analysis of the Factors Influencing the Severity of Bicyclist Injury in Bicyclist-Vehicle Crashes. Sustainability 2021, 14, 215. [Google Scholar] [CrossRef]
Jaber, A.; Juhász, J.; Csonka, B. An Analysis of Factors Affecting the Severity of Cycling Crashes Using Binary Regression Model. Sustainability 2021, 13, 6945. [Google Scholar] [CrossRef]
Dong, X.; Zhang, D.; Wang, C.; Zhang, T. Analysis of Factors Influencing the Degree of Accidental Injury of Bicycle Riders Considering Data Heterogeneity and Imbalance. PLoS ONE 2024, 19, e0301293. [Google Scholar] [CrossRef]
Xue, H.; Guo, P.; Li, Y.; Ma, J. Integrating Visual Factors in Crash Rate Analysis at Intersections: An AutoML and SHAP Approach towards Cycling Safety. Accid. Anal. Prev. 2024, 200, 107544. [Google Scholar] [CrossRef] [PubMed]
Scarano, A.; Sadeghi, M.; Mauriello, F.; Riccardi, M.R.; Aghabayk, K.; Montella, A. Cyclist Crash Severity Modeling: A Hybrid Approach of XGBoost-SHAP and Random Parameters Logit with Heterogeneity in Means and Variances. J. Saf. Res. 2025, 93, 373–398. [Google Scholar] [CrossRef]
Scarano, A.; Riccardi, M.R.; Mauriello, F.; D’Agostino, C.; Montella, A. Mixed Logit Model and Classification Tree to Investigate Cyclists Crash Severity. Traffic Saf. Res. 2025, 9, e000094. [Google Scholar] [CrossRef]
Rosén, E.; Sander, U. Pedestrian Fatality Risk as a Function of Car Impact Speed. Accid. Anal. Prev. 2009, 41, 536–542. [Google Scholar] [CrossRef]
Tefft, B.C. Impact Speed and a Pedestrian’s Risk of Severe Injury or Death. Accid. Anal. Prev. 2013, 50, 871–878. [Google Scholar] [CrossRef]
Chong, S.; Poulos, R.; Olivier, J.; Watson, W.L.; Grzebieta, R. Relative Injury Severity among Vulnerable Non-Motorised Road Users: Comparative Analysis of Injury Arising from Bicycle–Motor Vehicle and Bicycle–Pedestrian Collisions. Accid. Anal. Prev. 2010, 42, 290–296. [Google Scholar] [CrossRef]
Schubert, A.; Campolettano, E.T.; Scanlon, J.M.; McMurry, T.L.; Unger, T. Bridging the Gap: Mechanistic-Based Cyclist Injury Risk Curves Using Two Decades of Crash Data. Traffic Inj. Prev. 2024, 25, S105–S115. [Google Scholar] [CrossRef]
Swedler, D.I.; Ali, B.; Hoffman, R.; Leonardo, J.; Romano, E.; Miller, T.R. Injury and Fatality Risks for Child Pedestrians and Cyclists on Public Roads. Inj. Epidemiol. 2024, 11, 15. [Google Scholar] [CrossRef] [PubMed]
Zhou, N.; Zeng, H.; Xie, R.; Yang, T.; Kong, J.; Song, Z.; Zhang, F.; Liao, X.; Chen, X.; Miao, Q.; et al. Analysis of Road Traffic Accidents and Casualties Associated with Electric Bikes and Bicycles in Guangzhou, China: A Retrospective Descriptive Analysis. Heliyon 2024, 10, e29961. [Google Scholar] [CrossRef] [PubMed]
Zhu, T.; Zhu, Z.; Zhang, J.; Yang, C. Electric Bicyclist Injury Severity during Peak Traffic Periods: A Random-Parameters Approach with Heterogeneity in Means and Variances. Int. J. Environ. Res. Public Health 2021, 18, 11131. [Google Scholar] [CrossRef]
Hyman, S.C.; Ignacio, R. Road Traffic Injury Prevention: Bicycle. Curr. Trauma Rep. 2024, 10, 53–60. [Google Scholar] [CrossRef]
McCartt, A.T.; Mayhew, D.R.; Braitman, K.A.; Ferguson, S.A.; Simpson, H.M. Effects of Age and Experience on Young Driver Crashes: Review of Recent Literature. Traffic Inj. Prev. 2009, 10, 209–219. [Google Scholar] [CrossRef]
Fisher, D.L.; Agrawal, R.; Divekar, G.; Hamid, M.A.; Krishnan, A.; Mehranian, H.; Muttart, J.; Pradhan, A.; Roberts, S.; Romoser, M.; et al. Novice Driver Crashes: The Relation between Putative Causal Factors, Countermeasures, Real World Implementations, and Policy—A Case Study in Simple, Scalable Solutions. Accid. Anal. Prev. 2024, 198, 107397. [Google Scholar] [CrossRef]
Song, D.; Zhao, J.; Zhu, B.; Han, J.; Jia, S. Subjective Driving Risk Prediction Based on Spatiotemporal Distribution Features of Human Driver’s Cognitive Risk. IEEE Trans. Intell. Transp. Syst. 2024, 25, 16687–16703. [Google Scholar] [CrossRef]
Li, Z.; Cai, J.; Chen, Q.; Chen, L.; Qing, M.; Yang, S.X. An LSTM Network with Neural Plasticity for Driver Fatigue Recognition on Real Roads. IEEE Trans. Ind. Electron. 2025, 72, 14668–14676. [Google Scholar] [CrossRef]
Johnson, J.M.; Khoshgoftaar, T.M. Survey on Deep Learning with Class Imbalance. J. Big Data 2019, 6, 27. [Google Scholar] [CrossRef]
Wang, C.; Serre, T. A Hybrid Approach to Investigating Factors Associated with Crash Injury Severity: Integrating Interpretable Machine Learning with Logit Model. Appl. Sci. 2025, 15, 10417. [Google Scholar] [CrossRef]
Benfaress, I.; Bouhoute, A.; Zinedine, A. Enhancing Traffic Accident Severity Prediction Using ResNet and SHAP for Interpretability. AI 2024, 5, 2568–2585. [Google Scholar] [CrossRef]
Alotaibi, J. Enhancing Traffic Accident Severity Prediction: Feature Identification Using Explainable AI. Vehicles 2025, 7, 38. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased Boosting with Categorical Features. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31, pp. 6638–6648. [Google Scholar]
Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 4765–4774. [Google Scholar]
Kotsyubynska, Y.; Kozan, N.; Chadiuk, V.; Kostyshyn, A.; Kotsyubynsky, A.; Fentsyk, V. Machine Learning and Deep Learning for Predicting Traffic Crash Injury Severity: A Systematic Review and Meta-Analysis (2014–2025). J. Road Saf. 2026, 37, 156042. [Google Scholar] [CrossRef]
Johnson, P.M.; Barbour, W.; Camp, J.V.; Baroud, H. Using Machine Learning to Examine Freight Network Spatial Vulnerabilities to Disasters: A New Take on Partial Dependence Plots. Transp. Res. Interdiscip. Perspect. 2022, 14, 100617. [Google Scholar] [CrossRef]
Cohen, J. Weighted Kappa: Nominal Scale Agreement Provision for Scaled Disagreement or Partial Credit. Psychol. Bull. 1968, 70, 213–220. [Google Scholar] [CrossRef] [PubMed]
Budzyński, A.; Czerepicki, A. Towards Sustainable Road Safety: Feature-Level Interpretation of Injury Severity in Poland (2015–2024) Using SHAP and XGBoost. Sustainability 2025, 17, 8026. [Google Scholar] [CrossRef]
Budzyński, A. Interpretable Machine Learning for Driver Fault Attribution in Road Traffic Crashes: Evidence from a Nationwide Police Dataset. Transp. Res. Interdiscip. Perspect. 2025, 34, 101713. [Google Scholar] [CrossRef]
Glowinski, S.; Rzepczyk, S.; Obst, M. Trends in Bicycle Accidents and Injury Analysis in Poland: Insights from 2016 to 2023. Safety 2025, 11, 32. [Google Scholar] [CrossRef]
Budzyński, A.; Sładkowski, A. Participant-Level Injury Outcome Prediction in Road Traffic Incidents Using Machine Learning: A Case Study in Poland. In Problems of Logistics, Management and Operation in the East-West Transport Corridor; Abbasov, A., Sładkowski, A., Babayev, T., Eds.; Communications in Computer and Information Science; Springer: Cham, Switzerland, 2026; Volume 2767, pp. 82–93. [Google Scholar] [CrossRef]
Budzyński, A. Evaluating the Impact of Protective Equipment on Child Injury Severity in Road Traffic Crashes: An Explainable Machine Learning and Counterfactual Analysis Approach. Int. J. Inj. Control Saf. Promot. 2025, 1–12. [Google Scholar] [CrossRef] [PubMed]
Prechelt, L. Early Stopping—But When? In Neural Networks: Tricks of the Trade; Montavon, G., Orr, G.B., Müller, K.-R., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2012; Volume 7700, pp. 53–67. ISBN 978-3-642-35288-1. [Google Scholar]
He, H.; Ma, Y. (Eds.) Imbalanced Learning: Foundations, Algorithms, and Applications; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2013; ISBN 978-0-470-62609-0. [Google Scholar]
Dong, S.; Khattak, A.; Ullah, I.; Zhou, J.; Hussain, A. Predicting and Analyzing Road Traffic Injury Severity Using Boosting-Based Ensemble Learning Models with SHapley Additive exPlanations. Int. J. Environ. Res. Public Health 2022, 19, 2925. [Google Scholar] [CrossRef] [PubMed]
Boufous, S.; De Rome, L.; Senserrick, T.; Ivers, R. Risk Factors for Severe Injury in Cyclists Involved in Traffic Crashes in Victoria, Australia. Accid. Anal. Prev. 2012, 49, 404–409. [Google Scholar] [CrossRef]
Rosén, E.; Stigson, H.; Sander, U. Literature Review of Pedestrian Fatality Risk as a Function of Car Impact Speed. Accid. Anal. Prev. 2011, 43, 25–33. [Google Scholar] [CrossRef]
Fisher, D.L.; Pollatsek, A.P.; Pradhan, A. Can Novice Drivers Be Trained to Scan for Information That Will Reduce Their Likelihood of a Crash? Inj. Prev. 2006, 12, i25–i29. [Google Scholar] [CrossRef]

Figure 1. Distribution of cyclist injury severity levels in the analyzed dataset.

Figure 2. Row-normalized confusion matrix for the five-class injury severity model (CatBoost, hold-out evaluation).

Figure 3. Global feature importance (CatBoost PredictionValuesChange), expressed as percentage contribution.

Figure 4. Directional SHAP impact analysis for non-lethal outcomes: (a) Class 0 (No injury); (b) Class 1 (Slight injury); (c) Class 2 (Serious injury).

Figure 5. SHAP beeswarm plots for lethal outcomes: (a) Class 3 (Fatal within 30 days); (b) Class 4 (Fatal at scene).

Table 1. Percentage of records with missing values (selected predictors, full modeling file, n = 152,567).

Column	pct_Missing	n_Missing	n_Total
offender_years_of_driving	52.77	80,508	152,567
offender_age	29.71	45,328	152,567
cyclist_age	23.21	35,426	152,567
month	19.13	29,186	152,567
hour	19.13	29,186	152,567
weekday	19.13	29,186	152,567
female	4.08	6227	152,567
speed_limit	0.03	46	152,567

Table 2. Hold-out benchmark summary (QWK, macro-F1, weighted-F1).

Model	QWK	Macro-F1	Weighted-F1
Random forest	0.2353	0.3027	0.6452
CatBoost	0.2038	0.2911	0.5515
Multinomial logistic regression	0.0704	0.0727	0.0495

Table 3. Hyperparameter search results using 5-fold stratified cross-validation (mean QWK).

Configuration	Tree Depth	Learning Rate	L2 Leaf Reg	Mean QWK Score
1	6	0.1	3	0.1869
2	8	0.05	3	0.1923
3	8	0.05	5	0.1923
4 (Optimal)	10	0.03	5	0.1991

Table 4. Per-class precision, recall, and F1 on the hold-out test set (CatBoost).

Class	Precision	Recall	F1
0	0.854	0.533	0.656
1	0.276	0.387	0.323
2	0.160	0.343	0.218
3	0.055	0.414	0.097
4	0.094	0.600	0.162

Table 5. Walk-forward evaluation by test year (QWK). Train n and test n are cyclist-level analytical record counts in each fold.

Test Year	Train n	Test n	RF QWK	CatBoost QWK
2019	63,580	16,241	0.1626	0.2014
2020	79,821	14,770	0.1535	0.2354
2021	94,591	15,164	0.1816	0.216
2022	109,755	13,630	0.1609	0.2107
2023	123,385	13,983	0.0526	0.091
2024	137,368	15,199	0.1592	0.1822

Table 6. Expanding-window block splits (QWK).

Split (Train/Test Years)	Train n	Test n	RF QWK	CatBoost QWK
through 2018/from 2019 onward	63,580	88,987	0.1308	0.2133
through 2020/from 2021 onward	94,591	57,976	0.1325	0.2082
through 2022/from 2023 onward	123,385	29,182	0.0536	0.0968

Table 7. Global mean absolute SHAP for the top ten predictors (CatBoost ShapValues; stratified hold-out subsample n = 2500).

Rank	Feature	Mean abs SHAP
1	cyclist_age	0.241
2	voivodeship	0.141
3	crash_location_type	0.125
4	offender_years_of_driving	0.116
5	female	0.099
6	crash_type	0.098
7	offender_female	0.081
8	built_up	0.078
9	offender_age	0.062
10	month	0.054

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Budzyński, A.; Cieśla, M. Explainable Machine Learning for Cyclist Injury Severity in Bicycle–Vehicle Crashes in Poland: Association Patterns and Implications for Sustainable Road Safety. Sustainability 2026, 18, 5501. https://doi.org/10.3390/su18115501

AMA Style

Budzyński A, Cieśla M. Explainable Machine Learning for Cyclist Injury Severity in Bicycle–Vehicle Crashes in Poland: Association Patterns and Implications for Sustainable Road Safety. Sustainability. 2026; 18(11):5501. https://doi.org/10.3390/su18115501

Chicago/Turabian Style

Budzyński, Artur, and Maria Cieśla. 2026. "Explainable Machine Learning for Cyclist Injury Severity in Bicycle–Vehicle Crashes in Poland: Association Patterns and Implications for Sustainable Road Safety" Sustainability 18, no. 11: 5501. https://doi.org/10.3390/su18115501

APA Style

Budzyński, A., & Cieśla, M. (2026). Explainable Machine Learning for Cyclist Injury Severity in Bicycle–Vehicle Crashes in Poland: Association Patterns and Implications for Sustainable Road Safety. Sustainability, 18(11), 5501. https://doi.org/10.3390/su18115501

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Explainable Machine Learning for Cyclist Injury Severity in Bicycle–Vehicle Crashes in Poland: Association Patterns and Implications for Sustainable Road Safety

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Source and Characteristics

2.2. Model Specification, Training, and Benchmark Comparisons

3. Result

3.1. Exploratory Distribution of Injury Severity

3.2. Model Development and Performance

3.3. Model Interpretation

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI