1. Introduction
Cycling is increasingly promoted as a cornerstone of sustainable and healthy mobility, yet its expansion is repeatedly constrained by safety concerns, in particular the risk of severe injury in collisions with motor vehicles. The cyclist-safety literature has grown rapidly over the last decade, mapping a broad landscape of determinants spanning behavior, infrastructure, and vehicle–road interactions [
1]. At the same time, large differences in reporting practices, exposure measurement, and crash typologies (e.g., single-bicycle vs. bicycle–vehicle events) continue to produce non-trivial uncertainty about what matters most for preventing death and disabling trauma [
2].
However, evidence remains mixed on the relative emphasis of infrastructure and speed management versus behavioral, licensing, and enforcement-oriented measures [
3,
4,
5]. In practice, the policy-relevant signal is often context dependent: built-environment characteristics, vehicle mix, and speed environments can be strongly associated with severe outcomes in some samples, while demographic and behavioral factors appear prominent in others [
3,
4]. For sustainable road safety, the key implication is that single-factor narratives are risky: analytical designs should allow non-linearities and interactions while reporting performance transparently, especially when severe and fatal outcomes are rare [
1,
5].
Within Europe, recent sustainability-oriented crash severity studies reinforce that cyclist outcomes are associated with multi-level interactions among road environment, time/lighting, and participant attributes, but they disagree on which factors remain robust once heterogeneity is accounted for [
6,
7,
8]. For example, logistic-regression-based evidence from large cyclist crash samples highlights strong age gradients and higher fatal odds outside built-up areas, consistent with the well-known speed–severity relationship [
6]. In parallel, regional studies focusing on bicyclist–vehicle collisions identify driver and cyclist alcohol involvement, vehicle type (e.g., trucks), and time-of-day as prominent correlates of severe injury [
7,
8].
To address heterogeneity, modern injury-severity research increasingly combines segmentation ideas (latent class or clustering) with flexible statistical models, showing that predictors can change direction and magnitude across latent crash contexts [
4,
9]. Similarly, recent cyclist-safety work integrates machine-learning (ML) pipelines with interpretation layers to discover non-linear effects and interactions that conventional models may miss [
10,
11,
12]. For instance, AutoML frameworks paired with SHAP explanations have been used to quantify how visual street-scene features relate to cycling crash rates at intersections, offering a bridge from smart-mobility sensing to actionable infrastructure insights [
10]. Hybrid approaches that combine XGBoost–SHAP screening with random-parameters or mixed logit models demonstrate that interpretability and statistical rigor can be complementary rather than competing goals [
11,
12].
A central theme in the cyclist-injury literature is that fatal and severe outcomes are strongly associated with impact energy, which is often proxied in observational data by speed environment, rurality, and road class [
13,
14]. Although many speed–risk curves are estimated for pedestrians, cyclist studies commonly report analogous monotonic relationships, where higher impact speeds are biomechanically consistent with a higher probability of severe trauma relative to tolerance thresholds [
14]. Consistent with this evidence base, cyclist-specific studies repeatedly identify higher posted speeds and non-built-up (rural) settings as strong correlates of severe injury and death [
6].
Cyclist vulnerability is not uniform across age groups. A large body of observational research reports higher fatality and serious-injury risk among older cyclists, even when comparing broadly similar crash contexts [
6,
15,
16,
17]. This pattern is commonly discussed using age-related fragility language; however, in police-registry studies, age should be interpreted primarily as a marker that can correlate with physiological vulnerability, exposure contexts, trip characteristics, and other unobserved factors rather than as direct proof of a specific biological mechanism [
15,
16]. From a sustainability perspective, population aging alongside policies encouraging cycling increases the salience of equity-sensitive safety monitoring: severity models should therefore combine interpretable learning with explicit uncertainty and limitation framing [
1,
16].
Another rapidly evolving dimension is technology-driven mode change, including the growth of e-bikes, which may alter exposure, speeds, and rider demographics [
18,
19,
20]. Empirical evidence suggests that e-bike and bicycle crashes exhibit distinct patterns of casualties and that injuries remain strongly associated with human and environmental factors (e.g., visibility, vehicle type, and speed context). These trends reinforce the need for analytical approaches that remain valid under shifting mobility regimes—an explicit priority within sustainable and smart transportation systems, where digitalized infrastructures and AI-supported safety management are increasingly used to support road safety practice.
A second focus of this paper is driver experience, commonly proxied by years since licensure in administrative data. Traffic-safety research documents learning-curve patterns in crash involvement among young drivers as a function of age and licensure experience [
21]. Yet for injury severity conditional on a cyclist–vehicle collision having occurred, the incremental value of experience measures—and their interpretation—is less standardized: experience may correlate with crash contexts and risk profiles, and registry variables may partially reflect measurement and selection processes rather than a clean behavioral construct [
22]. Accordingly, we treat experience-related findings as associational evidence compatible with a hypothesized mitigation channel (e.g., anticipatory behavior reducing impact severity), rather than as causal identification of an “experience buffer.” Related engineering-oriented work models subjective driving risk from cognitive spatiotemporal features [
23] and driver fatigue from on-road signals with recurrent architectures [
24]; these strands illustrate ML applied to human-state and risk perception in traffic safety, but they rely on signals and tasks outside sparse police-tabular cyclist severity coding.
Methodologically, national crash registries require models that accommodate high-dimensional categorical fields, severe class imbalance [
25], and ordinal severity scales, while still supporting policy-relevant transparency. Recent work pairs ensemble learners with explainability tools such as SHAP to summarize non-linear patterns and heterogeneous effects [
26,
27,
28,
29,
30,
31]. At the same time, reviews highlight recurring pitfalls in severity modeling—especially optimistic reporting based on a single metric and limited comparability across studies—calling for multi-metric evaluation and benchmarking against credible baselines [
32,
33,
34]. Building on these insights, we implement an explainable gradient-boosting framework (CatBoost) with SHAP-based diagnostics, and we report quadratic weighted kappa, macro- and weighted-F1, class-wise precision/recall/F1, confusion matrices, and comparisons against additional benchmark models to make trade-offs between sensitivity to rare fatal classes and false-positive risk explicit. Despite growing international use of explainable ML in road-safety research, nationwide cyclist-specific evidence for Poland remains comparatively limited, particularly studies that jointly incorporate cyclist attributes, motor-vehicle participant characteristics (including licensure tenure proxies), and crash context within a single modeling pipeline [
35,
36,
37]. Prior Polish ML applications often target broader road-user populations or related classification tasks, which can obscure cyclist-specific severity heterogeneity [
35,
36]. Nationwide SEWIK studies have linked injury outcomes to participant roles, fault-related indicators, and registry road- and environment-context fields [
35,
36,
37], including participant-level ensemble analyses [
38] and, separately, child severity work with explainable boosting and counterfactual perspectives on protective equipment [
39]. This paper contributes a cyclist-centered national analysis with an emphasis on reproducible evaluation and interpretable outputs aligned with data-driven accident analysis within sustainable and smart mobility agendas, where AI is used to support monitoring, prioritization, and evaluation—without overstating causal conclusions from observational records. Beyond sample size alone, the revision adds calendar-based out-of-time stress testing (walk-forward test years and complementary expanding-window splits), transparent quantification of missingness for predictors that condition interpretation of offender- and time-related signals, an ordinal logit stress test that clarifies limits of linear ordinal specifications on the same sparse encoding as the classical baselines, multi-baseline ensemble and sensitivity checks beyond the primary split, and a compact tabular summary of mean absolute SHAP contributions computed on a capped stratified hold-out subsample for tractability.
Section 2 and
Section 3 set out the estimators, data handling, evaluation designs, and numerical results.
This study examines association patterns between cyclist injury severity (five ordered police-reported classes) and crash, participant, and context variables in bicycle–vehicle collisions in Poland (2015–2024). We estimated a CatBoost severity model and interpreted predictions using SHAP, complemented by benchmark comparisons and class-wise performance reporting under extreme imbalance. Our aims were as follows: (i) to summarize which factors are most strongly associated with higher severity in this registry; (ii) to compare CatBoost against credible baselines using complementary metrics that separate ordinal agreement, overall discrimination, and rare-class sensitivity; and (iii) to translate explainable outputs into exploratory policy hypotheses for cyclist protection and speed-risk management—subject to external validation—while clearly stating data limitations (including proxy variables such as administrative region and built-up indicators) and the non-causal nature of the evidence.
2. Materials and Methods
2.1. Data Source and Characteristics
The primary data source is the National Road Traffic Accident and Collision Registry (SEWIK), the official police-maintained national road traffic accident and collision registry. SEWIK provides standardized information on crash circumstances, infrastructure descriptors, environmental conditions, and participant attributes for each reported event.
For this study, we constructed an analytical sample covering 2015–2024 (10 calendar years). Crashes were retained if the record corresponded to a bicycle-involved event in the sense used for cyclist safety analysis (i.e., incidents where a cyclist participant is present in the extracted cyclist-focused analytical dataset). The final modeling file contains 152,567 cyclist-involved crash records after dataset construction and filtering steps applied in the preparation pipeline (including linkage of cyclist and motor-vehicle participant attributes where available).
Outcome variable. Injury severity was coded as an ordinal five-class outcome based on the police-reported injury categories: (0) no injury, (1) slight injury, (2) serious injury, (3) fatal within 30 days, and (4) fatal at the scene. We retained the full ordinal scale (rather than collapsing categories) to preserve policy-relevant distinctions along the injury spectrum while acknowledging that class frequencies are highly imbalanced (with a large majority of lower-severity outcomes).
Predictors (illustrative grouping). Each record includes variables describing the following:
temporal context (year, month, weekday, hour);
environmental conditions (e.g., lighting, weather, road-surface state);
infrastructure and location context (e.g., road type, traffic control, built-up indicator, intersection-related descriptors);
participant attributes for the cyclist and the motor-vehicle party where recorded (e.g., age, sex, and—when available—license-holding tenure as a proxy for driving experience).
Missingness and registry limitations. Administrative registries commonly contain item-level missingness and selective reporting.
Table 1 summarizes item-level missing percentages in the processed modeling file (n = 152,567 cyclist-level records as in
Section 2.1). Offender-side fields are the most incomplete: offender_years_of_driving is missing in about 53% of records and offender_age in about 30%. Cyclist age is missing in about 23%, and the co-recorded temporal fields month, hour, and weekday each show about 19% missing (identical rates reflect joint gaps in time-of-crash coding). The cyclist sex indicator female is missing in about 4%. These patterns motivate cautious interpretation of offender-related SHAP signals and any contrasts involving time-of-day when reading the association results below.
2.2. Model Specification, Training, and Benchmark Comparisons
Severity prediction was implemented using CatBoost, a gradient boosting framework based on oblivious trees that is well suited to high-cardinality categorical fields and mixed-type crash-registry features. Relative to conventional one-hot expansion, CatBoost reduces the risk of sparse high-dimensional encodings for categorical predictors while retaining non-linear capacity for interactions commonly present in injury-severity tasks [
30]. In this registry setting, CatBoost also provides a practical route to global and local explainability via SHAP-based diagnostics (detailed under model interpretation).
XGBoost remains a strong default for many tabular problems [
29], but it typically relies on explicit encodings for high-cardinality categorical fields, which can explode dimensionality in national registries or require additional engineering to keep sparsity under control. CatBoost’s ordered boosting and native categorical treatment target exactly those registry columns without rebuilding a separate one-hot matrix for the primary learner [
30], and—paired with the same ShapValues interface used in our figure pipeline—it gives a stable implementation path from training to class-specific explanations on unchanged feature definitions. We therefore retained CatBoost as the primary model and did **not** add a fresh full-sample XGBoost retrain in this revision, because the marginal insight for the stated aims was judged small relative to the already-reported one-hot baselines and temporal comparators.
Software and computing environment. Analyses were implemented in Python 3.11. The primary severity model used CatBoost 1.2.10. Class-wise SHAP summaries used CatBoost’s native ShapValues implementation (get_feature_importance). Benchmark models used scikit-learn 1.8.0 (random forest and multinomial logistic regression with median/mode imputation and one-hot encoding) and statsmodels 0.14.6 (proportional-odds ordered logit). Data processing relied on pandas 3.0.2 and NumPy 2.4.4. No specialized experimental equipment was used; computations were performed on standard workstations at the Silesian University of Technology, Katowice, Poland.
Training used the multi-class objective with quadratic weighted kappa (QWK) as the guiding performance criterion during model development. QWK is appropriate for ordinal severity scales because misclassifications are penalized more heavily when the predicted class is farther from the true ordered class [
34]. Because the severity distribution is highly skewed (the no-injury class comprises about 72% of observations), we applied class reweighting using CatBoost’s auto_class_weights = Balanced mechanism to maintain sensitivity to rare serious and fatal outcomes, while recognizing that this typically increases false positives for minority classes (evaluated explicitly in Results using per-class precision/recall/F1 and confusion matrices).
Hyperparameter search (transparent grid). Hyperparameters were selected using 5-fold stratified cross-validation over four pre-specified configurations (i.e., four distinct hyperparameter tuples evaluated by CV mean QWK):
- (1)
depth = 6, learning rate = 0.10, L2 leaf regularization = 3
- (2)
depth = 8, learning rate = 0.05, L2 leaf regularization = 3
- (3)
depth = 8, learning rate = 0.05, L2 leaf regularization = 5
- (4)
depth = 10, learning rate = 0.03, L2 leaf regularization = 5
The best-performing configuration was depth = 10, learning rate = 0.03, and L2 leaf regularization = 5.
Final model fitting and reporting used a stratified holdout split (80%/20%, random seed 42), yielding 122,053 training records and 30,514 test records. The final CatBoost model training used early stopping [
40] on the holdout evaluation signal; training terminated when validation loss did not improve for 100 consecutive iterations, with the best iteration retained (best iteration 648 in the fitted model used here).
Benchmark models (reviewer-requested comparators). To contextualize CatBoost performance, we report additional statistical and machine-learning baselines evaluated on the same holdout test set:
Multinomial logistic regression with median imputation for numeric predictors, mode imputation for categoricals, one-hot encoding (with a minimum frequency threshold to control sparsity), and class weighting;
Random forest with the same preprocessing pipeline and class-balanced subsampling.
For computational feasibility given the high-dimensional one-hot expansion, these two baselines were estimated on a stratified random subsample of 30,000 observations drawn from the training partition (seed 42), while performance is still reported on the full 30,514 holdout observations to preserve an unbiased test comparison. The primary CatBoost model, in contrast, was trained on all 122,053 records in the training partition (same stratified split). As a transparency check on how much the subsample constraint moves the random-forest comparator, we additionally fit a random forest on the full training partition with identical preprocessing and test evaluation; that auxiliary run achieved hold-out QWK ≈ 0.301 (see
Section 2.2), compared with QWK = 0.2353 for the subsampled random forest in
Table 2. Repeating the full-training random forest for three independent stratified splits (random seeds 42–44) yielded mean ± standard deviation QWK of 0.307 ± 0.010 on the hold-out (see
Section 2.2), indicating moderate variability with the partition draw but consistently higher ordinal agreement than the
Table 2 subsampled row. We also fitted a proportional-odds (ordered logit) specification in statsmodels as an ordinal statistical baseline on the full training split using the same one-hot feature matrix as the logistic benchmark; hold-out QWK was about 0.024—far below both gradient boosting and random forest—so we treated it as a lower-bound comparator illustrating the limits of a linear ordinal model on this sparse registry encoding rather than a competitive alternative.
We did not rely on synthetic oversampling (e.g., SMOTE-type resampling) in the primary pipeline; imbalance was addressed primarily through class weighting and multi-metric evaluation [
41], with sensitivity analyses left for robustness discussion where applicable.
Beyond QWK, the
Section 3 reports macro- and weighted-average F1, class-wise precision/recall/F1, and confusion matrices (counts and row-normalized) to make trade-offs between rare-class sensitivity and false-positive risk explicit [
32,
33,
34].
Temporal and stress-test evaluation (out-of-time generalization). In addition to the stratified hold-out benchmark above, we assessed how models behave when training and test periods differ systematically by calendar year. We implemented a walk-forward scheme: for each calendar test year from 2019 through 2024, models were trained on all records with crash year strictly before that test year and evaluated on records with crash year equal to the test year. This mimics a retrospective deployment where past years inform prediction on a future calendar year and provides a stronger stress test than a single random split because severity mixes, reporting practices, and exposure contexts can shift over time.
We report the same preprocessing pipeline as for the random-forest benchmark (median/mode imputation and one-hot encoding for the RF comparator; categorical handling as for CatBoost in the primary pipeline). The random forest used 300 trees with class_weight = balanced_subsample (other settings as in the benchmark script). The CatBoost comparator used the same fixed hyperparameters as the manuscript’s selected configuration (depth 10, learning rate 0.03, L2 leaf regularization 5, 649 boosting iterations, multi-class loss, auto_class_weights = Balanced, random seed 42), fitted on the training-year subset only without fold-specific early stopping on a withheld validation slice (the temporal exercise isolates calendar drift rather than replication of the hold-out tuning protocol). For numerical stability on large training folds, CatBoost training additionally used used_ram_limit = 8 GB and thread_count = 2 (as stated in this subsection).
As complementary stress tests, we evaluated three expanding-window “block” splits: (i) train on crash years through 2018 and test on years from 2019 onward; (ii) train through 2020 and test from 2021 onward; (iii) train through 2022 and test from 2023 onward. These splits emphasize performance under larger temporal gaps and shifting covariate mixes. Full fold-wise metrics (including macro- and weighted-F1 and fit timings) are summarized in
Section 3 below.
Quantitative SHAP export. For a compact tabular complement to thank you for the lesson, we summarized global mean absolute native ShapValues (CatBoost get_feature_importance with ShapValues, using the same tensor layout as in the analysis notebook for the main SHAP figures). This summary is not “SHAP on the full 30,514-row hold-out” nor “SHAP on the training rows”: the CatBoost model was fitted on the full 122,053-row training partition with the manuscript configuration (649 boosting iterations; depth 10; learning rate 0.03; L2 leaf regularization 5; auto_class_weights = Balanced; same stratified 80/20 split and seed 42 as the reported benchmarks), while ShapValues for the tabulation were computed only on a stratified subsample of the hold-out test partition (here n = 2500, with a configurable cap on the number of hold-out rows evaluated). Thus, the global mean-|SHAP| ranking is tied to that subsample and is not an exhaustive average over all hold-out predictions. The tabulated ranking therefore reflects the same fully trained CatBoost model as in the benchmarks above—not a smaller or faster-training variant. The only shortcut is the evaluation of ShapValues on a capped subset of hold-out rows.
4. Discussion
The results indicate that cyclist injury severity in Poland is most strongly associated with cyclist age, motor-vehicle participants’ licensure tenure (offender_years_of_driving), and regional and crash-setting variables, alongside temporal and lighting-related context. This broad pattern aligns with road-safety syntheses emphasizing heterogeneous severity risk across user attributes, traffic context, and roadway environment, and with evidence that flexible learners can reveal non-linear structure that is difficult to capture parsimoniously in conventional regression specifications [
1,
11,
30,
31,
42].
At the same time, high model reliance on a predictor is not equivalent to a causal effect. Administrative region can aggregate unobserved differences in infrastructure quality, traffic composition, emergency care access, reporting practices, and cycling exposure. Age and licensure tenure should likewise be interpreted as markers that may correlate with crash contexts, injury tolerance, and data completeness, rather than as uniquely identified behavioral or biomechanical mechanisms.
The hold-out evaluation shows modest overall ordinal discrimination combined with a familiar imbalance trade-off: the model attains relatively higher sensitivity for rare fatal severity classes, but precision for those classes remains low, meaning many predicted high-risk cases will be false positives if the output were treated as a deterministic classifier. In operational terms, that false-positive burden would imply substantial follow-up cost for screening, hot-spot ranking, or targeted interventions unless prevalence, utilities, and calibration are explicitly modeled. This cautions against over-interpreting any single global ranking or local explanation as operational truth without external validation, richer covariates (including speed and exposure), and calibration-oriented checks.
The temporal walk-forward and block-split benchmarks reinforce this caution: ordinal agreement (QWK) for both CatBoost and random forest drops markedly when evaluation respects calendar time, with the weakest performance concentrated around the 2023 test horizon. Such patterns are plausibly driven by evolving crash mixes, enforcement or coding practices, and mobility trends that are not captured as explicit covariates, rather than by a single stable “severity surface” that transfers identically across years. Accordingly, metrics from the stratified random hold-out split should be read as a single in-sample reference, whereas calendar-year walk-forward folds and broader expanding-window test pools stress out-of-time generalization more stringently.
Similarly, the strong age gradient in predicted severity is consistent with widely reported higher lethality among older cyclists in observational studies [
15,
43] and biomechanically motivated risk discussions [
16], but it does not establish a specific “biological fragility mechanism” from police-registry fields alone. For sustainable mobility, the practical implication is nevertheless clear in policy design terms: if cycling promotion spans age groups, safety investments should explicitly consider greater severity risk among older riders in comparable crash records, including speed management and forgiving infrastructure on higher-risk corridors and conflict points [
5,
6].
The contrasted role of built-up versus non-built-up contexts is consistent with speed–severity narratives documented in the biomechanics and road-safety literature [
13,
14,
44] and with cyclist studies linking higher-speed rural contexts to worse outcomes [
6]. Here, built-up status should be read primarily as a proxy for kinetic-energy environments, not as a substitute for measured impact speed.
Overall, the contribution is best understood as AI-enabled accident analysis for prioritization: combining benchmarked learning, multi-metric evaluation, and explainable diagnostics can help translate large registries into auditable, feature-level hypotheses for where speed-risk and infrastructure treatments may matter most and which human-factor interventions merit trials—while remaining explicit that such tools do not replace causal identification strategies when policy claims require them [
1,
10,
31,
42].
Several limitations remain. The study is observational and restricted to police-reported SEWIK records in Poland, without granular information on impact speed, cycling exposure, helmet use, distraction, and detailed infrastructure quality—omissions that can steer the estimated effects toward proxies. Missing data, especially for offender attributes, complicates interpretation and may interact with severity reporting. A proportional-odds ordinal logit on the same sparse one-hot encoding achieved negligible hold-out QWK (≈0.024;
Section 2.2), underscoring that not all ordinal specifications are informative here. ShapValues underlying the compact global ranking were computed only on a stratified hold-out test subsample (n = 2500 of 30,514) for tractability, whereas the fitted ensemble used the full training partition at the manuscript iteration budget (
Section 2.2). Finally, while we include logistic-type and tree-based baselines for context, those one-hot pipelines were estimated on a training subsample for computational feasibility; full-training random-forest metrics and multi-seed variability are documented in
Section 2.2. Future work should broaden benchmarks, strengthen temporal and external validation, and integrate exposure and network data to move from association patterns toward more decision-ready risk estimates [
10,
18,
19].
The temporal exercises reported here use a fixed CatBoost iteration budget per fold (mirroring the selected manuscript configuration) rather than re-running fold-specific early stopping; they therefore isolate calendar drift under a controlled training budget but do not re-optimize stopping within each training window. Year-to-year QWK variability—especially the 2023 trough—underscores that nationwide registry models can be sensitive to period-specific factors that are only partially observed, so external validation on independent regions or updated cohorts remains essential before any operational use.
5. Conclusions
This study modeled five-level cyclist injury severity using 152,567 police-reported bicycle–vehicle crashes in Poland (2015–2024) with a CatBoost classifier and SHAP-based explanations, complemented by benchmark comparisons and multi-metric evaluation suited to ordinal, imbalanced outcomes. The approach provides transparent, feature-level association patterns that extend purely linear, coefficient-only narratives, while remaining explicit about the non-causal nature of registry-based evidence.
Overall, the analysis highlights pronounced association patterns between higher predicted severity and older cyclist age, shorter offender licensure tenure (a registry proxy for experience-related factors), and contextual variables that plausibly reflect higher kinetic-energy environments (notably patterns consistent with non-built-up settings), alongside substantial regional heterogeneity. These results should be interpreted as risk markers for screening-style prioritization and hypothesis generation, not as identification of unique biological or behavioral mechanisms. In particular, overall ordinal discrimination remains modest on the stratified hold-out, and calendar-based walk-forward evaluation shows additional degradation relative to that single random-split benchmark—especially around 2023—so performance should not be extrapolated as time-invariant without ongoing monitoring. High recall for rare fatal classes coexists with low precision, limiting standalone deployment without external validation and richer covariates (including exposure and speed).
From a sustainable road safety perspective, the practical implication is to prioritize systemic prevention alongside enforcement: protecting older cyclists through safer junction designs and separation on higher-risk corridors, speed management where rural and high-speed mixing is common, and novice-driver training content that emphasizes cyclist-relevant conflicts [
45]—while evaluating such interventions with prospective outcome metrics rather than inferring causal impacts directly from this observational severity model.
Future work should integrate cycling exposure and network infrastructure data, link police records to medical/in-depth crash information where possible, broaden ordinal and statistical baselines, and test temporal stability and transferability as mobility technologies and fleets evolve.