Figure 1.
End-to-end pipeline of the FDTM framework for athlete injury risk assessment.
Figure 1.
End-to-end pipeline of the FDTM framework for athlete injury risk assessment.
Figure 2.
Detailed neural architecture of the FDTM model.
Figure 2.
Detailed neural architecture of the FDTM model.
Figure 3.
Injury-risk prediction performance (AUC-ROC) across the three public datasets, FDTM vs. the nine original baselines. Error bars represent 95% confidence intervals from 5-fold stratified cross-validation.
Figure 3.
Injury-risk prediction performance (AUC-ROC) across the three public datasets, FDTM vs. the nine original baselines. Error bars represent 95% confidence intervals from 5-fold stratified cross-validation.
Figure 4.
Extended baseline comparison (16 methods) on the three public datasets, including the seven recent state-of-the-art time-series baselines added per Reviewer 1, comment 5 (TimesNet, PatchTST, FEDformer, Autoformer, Informer, DLinear, N-BEATS). The dashed vertical line in each panel marks the strongest baseline (FEDformer). FDTM (red, bold border) outperforms every method on every dataset.
Figure 4.
Extended baseline comparison (16 methods) on the three public datasets, including the seven recent state-of-the-art time-series baselines added per Reviewer 1, comment 5 (TimesNet, PatchTST, FEDformer, Autoformer, Informer, DLinear, N-BEATS). The dashed vertical line in each panel marks the strongest baseline (FEDformer). FDTM (red, bold border) outperforms every method on every dataset.
Figure 5.
Seed-variance robustness analysis: 30 independent training runs per method on each dataset. FDTM’s advantage is consistent and statistically robust across all three datasets (paired Wilcoxon test, n = 30; Cohen’s d = 2.58/3.86/4.15). Added in response to Reviewer 1, comment 15.
Figure 5.
Seed-variance robustness analysis: 30 independent training runs per method on each dataset. FDTM’s advantage is consistent and statistically robust across all three datasets (paired Wilcoxon test, n = 30; Cohen’s d = 2.58/3.86/4.15). Added in response to Reviewer 1, comment 15.
Figure 6.
Receiver-operating characteristic (top row) and precision–recall (bottom row) curves on the three public datasets. PR curves are particularly informative under the 3.8% class imbalance.
Figure 6.
Receiver-operating characteristic (top row) and precision–recall (bottom row) curves on the three public datasets. PR curves are particularly informative under the 3.8% class imbalance.
Figure 7.
Comprehensive ablation study isolating the marginal contribution of each FDTM component. Panel (a) absolute performance per variant; panel (b) percentage-point change relative to the full FDTM model.
Figure 7.
Comprehensive ablation study isolating the marginal contribution of each FDTM component. Panel (a) absolute performance per variant; panel (b) percentage-point change relative to the full FDTM model.
Figure 8.
Interpretability analysis of FDTM. Panel (a) top-15 SHAP feature attributions; panel (b) frequency-band importance across the three datasets.
Figure 8.
Interpretability analysis of FDTM. Panel (a) top-15 SHAP feature attributions; panel (b) frequency-band importance across the three datasets.
Figure 9.
Empirical validation of the spectral-physiology hypothesis. Top row: mean power spectral density (PSD) of the per-game load signal in injury-preceding vs. injury-free windows, with 95% CI shading. Bottom row: band-integrated power with Mann–Whitney U test
p-values and rank-biserial effect sizes (rb). Injury-preceding windows exhibit significantly elevated weekly-band power across all three datasets, providing model-independent evidence for the SHAP-based ranking shown in
Figure 8. Asterisks denote statistical significance (*
; ****
); ns denotes not significant. Added in response to Reviewer 1, comments 4 and 10.
Figure 9.
Empirical validation of the spectral-physiology hypothesis. Top row: mean power spectral density (PSD) of the per-game load signal in injury-preceding vs. injury-free windows, with 95% CI shading. Bottom row: band-integrated power with Mann–Whitney U test
p-values and rank-biserial effect sizes (rb). Injury-preceding windows exhibit significantly elevated weekly-band power across all three datasets, providing model-independent evidence for the SHAP-based ranking shown in
Figure 8. Asterisks denote statistical significance (*
; ****
); ns denotes not significant. Added in response to Reviewer 1, comments 4 and 10.
Figure 10.
Hyperparameter sensitivity analysis on the three public datasets. Optimal values (dashed vertical lines) coincide across datasets, supporting cross-sport transferability of the configuration. The asterisk marks the selected optimum in each panel.
Figure 10.
Hyperparameter sensitivity analysis on the three public datasets. Optimal values (dashed vertical lines) coincide across datasets, supporting cross-sport transferability of the configuration. The asterisk marks the selected optimum in each panel.
Figure 11.
Temporal self-attention weight visualization for two representative NBA players. Player A (left) exhibits attention concentration on games 22–26, corresponding to a workload spike preceding the injury; Player B (right) shows a uniformly distributed pattern.
Figure 11.
Temporal self-attention weight visualization for two representative NBA players. Player A (left) exhibits attention concentration on games 22–26, corresponding to a workload spike preceding the injury; Player B (right) shows a uniformly distributed pattern.
Figure 12.
Calibration reliability of FDTM injury-risk scores. (a–c) Reliability diagrams across the three public datasets (NBA/AFL/SoccerMon), extended in response to Reviewer 2; FDTM stays closest to the identity line on every dataset with ECE values of 0.014, 0.018, and 0.021 respectively, an order of magnitude smaller than baselines (LSTM-only, random forest).
Figure 12.
Calibration reliability of FDTM injury-risk scores. (a–c) Reliability diagrams across the three public datasets (NBA/AFL/SoccerMon), extended in response to Reviewer 2; FDTM stays closest to the identity line on every dataset with ECE values of 0.014, 0.018, and 0.021 respectively, an order of magnitude smaller than baselines (LSTM-only, random forest).
Figure 13.
Decision Curve Analysis (DCA) of FDTM against the strongest baselines (FEDformer, TCN) and the Treat-all/Treat-none reference strategies across the three public datasets. The shaded red band marks the clinically relevant threshold range (0.10–0.30). FDTM exhibits the highest net benefit on every dataset across this range. Added in response to Reviewer 2 s-round comment 1.
Figure 13.
Decision Curve Analysis (DCA) of FDTM against the strongest baselines (FEDformer, TCN) and the Treat-all/Treat-none reference strategies across the three public datasets. The shaded red band marks the clinically relevant threshold range (0.10–0.30). FDTM exhibits the highest net benefit on every dataset across this range. Added in response to Reviewer 2 s-round comment 1.
Figure 14.
Leave-one-sport-out external validation of FDTM (formerly
Figure 10, framed as “cross-dataset generalization” in the first-round manuscript). (
a) Transfer matrix showing AUC-ROC when training on one dataset and evaluating on another with no fine-tuning. (
b) Zero-shot transfer drops 3.1–4.9 pp relative to in-domain training.
Figure 14.
Leave-one-sport-out external validation of FDTM (formerly
Figure 10, framed as “cross-dataset generalization” in the first-round manuscript). (
a) Transfer matrix showing AUC-ROC when training on one dataset and evaluating on another with no fine-tuning. (
b) Zero-shot transfer drops 3.1–4.9 pp relative to in-domain training.
Figure 15.
Clinical-subgroup-stratified performance analysis. (a) AUC-ROC stratified by injury type across the three datasets. (b) NBA performance heatmap stratified by anatomical region and severity.
Figure 15.
Clinical-subgroup-stratified performance analysis. (a) AUC-ROC stratified by injury type across the three datasets. (b) NBA performance heatmap stratified by anatomical region and severity.
Figure 16.
Computational efficiency and deployment feasibility. (a) Trade-off between parameter count and NBA AUC-ROC: FDTM occupies the Pareto frontier with only 2.4 M parameters. (b) Single-sample inference latency across cloud, edge, and mobile hardware.
Figure 16.
Computational efficiency and deployment feasibility. (a) Trade-off between parameter count and NBA AUC-ROC: FDTM occupies the Pareto frontier with only 2.4 M parameters. (b) Single-sample inference latency across cloud, edge, and mobile hardware.
Figure 17.
Capacity and overfitting analysis of FDTM. (a) Parameter breakdown across modules (total ≈ 2.4 M). (b) NBA learning curves over 30 independent seeds: train-vs-validation AUC-ROC, with a final gap of 1.3 pp. (c) NBA focal-loss curves: the validation loss does not turn upward, indicating no late-stage overfitting. (d) AFL learning curves (train–val gap 1.8 pp). (e) SoccerMon learning curves (train–val gap 2.1 pp). (f) Data-ablation curve: test AUC-ROC as a function of training-data fraction, exhibiting a clear plateau between 75% and 100% on all three datasets (newly added in response to Reviewer 2 s-round comment 3).
Figure 17.
Capacity and overfitting analysis of FDTM. (a) Parameter breakdown across modules (total ≈ 2.4 M). (b) NBA learning curves over 30 independent seeds: train-vs-validation AUC-ROC, with a final gap of 1.3 pp. (c) NBA focal-loss curves: the validation loss does not turn upward, indicating no late-stage overfitting. (d) AFL learning curves (train–val gap 1.8 pp). (e) SoccerMon learning curves (train–val gap 2.1 pp). (f) Data-ablation curve: test AUC-ROC as a function of training-data fraction, exhibiting a clear plateau between 75% and 100% on all three datasets (newly added in response to Reviewer 2 s-round comment 3).
Figure 18.
Quantitative clinical impact projection. (a) Projected avoidable injuries per team-season as a function of intervention efficacy across the three sports, with vertical reference lines at 20% (conservative) and 40% (moderate) efficacy assumptions; the grey dashed line shows the ACWR > 1.5 baseline for NBA. (b) Head-to-head comparison of FDTM versus the ACWR > 1.5 alerting baseline on NBA at matched alert burden (~9%): FDTM achieves 59.9% higher sensitivity, doubles the positive predictive value, and captures an additional 12.8 injuries per team-season. Added in response to Reviewer 2 s-round comment 4.
Figure 18.
Quantitative clinical impact projection. (a) Projected avoidable injuries per team-season as a function of intervention efficacy across the three sports, with vertical reference lines at 20% (conservative) and 40% (moderate) efficacy assumptions; the grey dashed line shows the ACWR > 1.5 baseline for NBA. (b) Head-to-head comparison of FDTM versus the ACWR > 1.5 alerting baseline on NBA at matched alert burden (~9%): FDTM achieves 59.9% higher sensitivity, doubles the positive predictive value, and captures an additional 12.8 injuries per team-season. Added in response to Reviewer 2 s-round comment 4.
Table 1.
Summary statistics of the three public datasets used to evaluate FDTM. The combined corpus contains 612 unique athletes and 247,830 player-game observations.
Table 1.
Summary statistics of the three public datasets used to evaluate FDTM. The combined corpus contains 612 unique athletes and 247,830 player-game observations.
| Dataset | Sport | Seasons | Athletes | Player-Game Obs. |
|---|
| NBA game-log corpus (2013–2023) | Basketball | 10 | 248 | ≈142,400 |
| AFL Player Workload Dataset | Australian football | 4 | 184 | ≈53,200 |
| SoccerMon corpus | Soccer | 6 | 180 | ≈52,230 |
| Combined corpus | — | 10 | 612 | 247,830 |
Table 2.
Hyperparameter configuration of FDTM and search ranges. Final values were selected on the validation partition via 5-fold stratified cross-validation.
Table 2.
Hyperparameter configuration of FDTM and search ranges. Final values were selected on the validation partition via 5-fold stratified cross-validation.
| Hyperparameter | Search Range | Final Value |
|---|
| Input window length T (games) | {14, 21, 28, 35} | 28 |
| Input embedding dimension | — | 128 |
| Bi-LSTM layers | {2, 3, 4} | 3 |
| Bi-LSTM hidden units (per direction) | — | 128 |
| Temporal/spectral embedding dimension | — | 256 |
| Cross-attention heads k | {2, 4, 8} | 4 |
| Dropout rate | {0.1, 0.2, 0.3, 0.4} | 0.3 |
| Focal-loss focusing parameter | {1, 2, 3} | 2 |
| Optimizer | — | AdamW |
| Initial learning rate | — | 3 × |
| Weight decay | — | 1 × |
| Learning-rate schedule | — | Cosine, 5-epoch warm-up |
| Training epochs | — | 80 |
| Gradient clipping (L2 norm) | — | 1.0 |
| Training precision | — | bfloat16 |
Table 3.
Benchmark comparison (AUC-ROC) of FDTM against the sixteen baseline methods on the three public datasets. Values correspond to the point estimates plotted in
Figure 4; the 95% bootstrap confidence intervals are shown there as error bars. The DeLong-test
p-values for the pairwise comparison against FDTM are below 0.01 for every baseline on all three datasets (see also the 30-seed analysis in
Section 4.3.1).
Table 3.
Benchmark comparison (AUC-ROC) of FDTM against the sixteen baseline methods on the three public datasets. Values correspond to the point estimates plotted in
Figure 4; the 95% bootstrap confidence intervals are shown there as error bars. The DeLong-test
p-values for the pairwise comparison against FDTM are below 0.01 for every baseline on all three datasets (see also the 30-seed analysis in
Section 4.3.1).
| Method | Family | NBA | AFL | SoccerMon |
|---|
| Logistic Regression | Classical | 0.704 | 0.689 | 0.673 |
| Random Forest | Classical | 0.785 | 0.762 | 0.748 |
| XGBoost | Classical | 0.806 | 0.781 | 0.766 |
| FFT-only MLP | Deep (single-stream) | 0.788 | 0.755 | 0.741 |
| 1D-CNN | Deep (single-stream) | 0.795 | 0.764 | 0.749 |
| LSTM | Deep (single-stream) | 0.808 | 0.774 | 0.758 |
| Bi-LSTM | Deep (single-stream) | 0.815 | 0.781 | 0.766 |
| Transformer | Deep (single-stream) | 0.823 | 0.789 | 0.772 |
| TCN [45] | Convolutional | 0.826 | 0.793 | 0.776 |
| N-BEATS | SOTA time-series | 0.814 | 0.782 | 0.764 |
| DLinear | SOTA time-series | 0.812 | 0.778 | 0.761 |
| Informer | SOTA time-series | 0.818 | 0.785 | 0.769 |
| Autoformer [39] | SOTA time-series | 0.825 | 0.791 | 0.774 |
| PatchTST | SOTA time-series | 0.831 | 0.798 | 0.781 |
| TimesNet [40] | SOTA time-series | 0.835 | 0.802 | 0.785 |
| FEDformer [38] | SOTA time-series | 0.838 | 0.805 | 0.788 |
| FDTM (Ours) | Dual-stream | 0.858 | 0.833 | 0.821 |
Table 4.
Comprehensive ablation results across three datasets and two metrics. The ranking of component importance is consistent across datasets, providing evidence that the architectural choices in FDTM are not over-fit to any single sport.
Table 4.
Comprehensive ablation results across three datasets and two metrics. The ranking of component importance is consistent across datasets, providing evidence that the architectural choices in FDTM are not over-fit to any single sport.
| Variant | NBA | AFL | SoccerMon | NBA (PR) | AFL (PR) | SoccerMon (PR) |
|---|
| Full FDTM | 0.858 | 0.833 | 0.821 | 0.312 | 0.289 | 0.271 |
| w/o Bi-LSTM (uni-LSTM) | 0.819 | 0.795 | 0.787 | 0.278 | 0.255 | 0.241 |
| w/o Self-Attention Pool | 0.830 | 0.806 | 0.795 | 0.291 | 0.268 | 0.252 |
| w/o Spectral Branch | 0.807 | 0.783 | 0.770 | 0.246 | 0.240 | 0.225 |
| w/o Temporal Branch | 0.831 | 0.806 | 0.798 | 0.268 | 0.249 | 0.247 |
| w/o Cross-Attention (concat) | 0.844 | 0.819 | 0.808 | 0.298 | 0.275 | 0.258 |
| w/o Gated Fusion (avg) | 0.837 | 0.815 | 0.802 | 0.293 | 0.270 | 0.253 |
| w/o Focal Loss (BCE) | 0.846 | 0.823 | 0.811 | 0.278 | 0.257 | 0.241 |
| w/o High-Freq Band | 0.832 | 0.808 | 0.796 | 0.282 | 0.260 | 0.244 |
| w/o Mid-Freq Band | 0.844 | 0.820 | 0.807 | 0.301 | 0.278 | 0.262 |
| w/o Low-Freq Band | 0.839 | 0.814 | 0.802 | 0.290 | 0.267 | 0.250 |
Table 5.
Extended calibration analysis across the three public datasets (newly added in response to Reviewer 2 s-round comment 1).
Table 5.
Extended calibration analysis across the three public datasets (newly added in response to Reviewer 2 s-round comment 1).
| Calibration Metric | NBA | AFL | SoccerMon |
|---|
| Brier score | 0.0157 | 0.0216 | 0.0203 |
| Reliability (Murphy) | 0.0030 | 0.0041 | 0.0053 |
| Resolution (Murphy) | 0.0240 | 0.0218 | 0.0197 |
| Uncertainty (Murphy) | 0.0366 | 0.0393 | 0.0347 |
| ECE (10 bins) | 0.0140 | 0.0182 | 0.0214 |
| Maximum Calibration Error (MCE) | 0.0480 | 0.0530 | 0.0610 |
| Hosmer–Lemeshow χ2 (df = 8) | 11.42 | 12.87 | 13.94 |
| Hosmer–Lemeshow p-value | 0.179 | 0.117 | 0.083 |
| Platt-scaled ECE | 0.0122 | 0.0168 | 0.0201 |
Table 6.
Leave-one-sport-out external validation results. In the header, ↓ indicates the training dataset and → indicates the test dataset. Rows index the training dataset, columns the test dataset; diagonal entries (in-domain) are shown in bold. The final column reports the mean zero-shot drop relative to in-domain training.
Table 6.
Leave-one-sport-out external validation results. In the header, ↓ indicates the training dataset and → indicates the test dataset. Rows index the training dataset, columns the test dataset; diagonal entries (in-domain) are shown in bold. The final column reports the mean zero-shot drop relative to in-domain training.
| Train ↓/Test → | NBA | AFL | SoccerMon | Mean Zero-Shot Drop |
|---|
| NBA | 0.858 | 0.795 | 0.781 | — |
| AFL | 0.812 | 0.833 | 0.798 | — |
| SoccerMon | 0.806 | 0.789 | 0.821 | — |
| Zero-shot avg. (off-diagonal) | 0.809 | 0.792 | 0.790 | 4.9/4.1/3.1 pp |
Table 7.
Computational complexity comparison of FDTM against the strongest deep-learning baselines. FDTM achieves the highest AUC-ROC while remaining real-time on cloud and edge platforms. Single-sample inference latency was benchmarked for FDTM (cloud/edge/mobile); see
Section 4.11.
Table 7.
Computational complexity comparison of FDTM against the strongest deep-learning baselines. FDTM achieves the highest AUC-ROC while remaining real-time on cloud and edge platforms. Single-sample inference latency was benchmarked for FDTM (cloud/edge/mobile); see
Section 4.11.
| Method | Parameters (M) | NBA AUC-ROC | Inference Latency |
|---|
| TCN [45] | 1.7 | 0.826 | — |
| Bi-LSTM | 2.1 | 0.815 | — |
| Transformer | 3.8 | 0.823 | — |
| FEDformer [38] | — | 0.838 | — |
| FDTM (Ours) | 2.4 | 0.858 | 8.2/31.4/187.3 ms |
Table 8.
Per-dataset overfitting diagnostics across the three datasets, n = 30 independent seeds (newly added in response to Reviewer 2 s-round comment 3).
Table 8.
Per-dataset overfitting diagnostics across the three datasets, n = 30 independent seeds (newly added in response to Reviewer 2 s-round comment 3).
| Dataset | Train AUC | Val AUC | Test AUC | Gap (pp) | Early-Stop Epoch |
|---|
| NBA | 0.871 ± 0.005 | 0.860 ± 0.006 | 0.858 ± 0.006 | 1.3 | 65 |
| AFL | 0.851 ± 0.007 | 0.835 ± 0.008 | 0.833 ± 0.007 | 1.8 | 58 |
| SoccerMon | 0.842 ± 0.008 | 0.823 ± 0.009 | 0.821 ± 0.007 | 2.1 | 54 |
Table 9.
Point-by-point comparison of FDTM against representative prior deep-learning and machine-learning approaches to athlete injury prediction (rendered in full in response to Reviewer 2; previously referenced only in text).
Table 9.
Point-by-point comparison of FDTM against representative prior deep-learning and machine-learning approaches to athlete injury prediction (rendered in full in response to Reviewer 2; previously referenced only in text).
| Prior Study | Data/Sport | Approach | Reported AUC-ROC | FDTM (This Work) |
|---|
| Carey et al. [30] | AFL workload (public) | Random forest | ≈0.78 | 0.833 (AFL) |
| Rossi et al. [29] | Soccer, GPS training data | Gradient boosting | ≈0.76 | 0.821 (SoccerMon) |
| Ye et al. [13] | Proprietary cohort, time-series images | CNN (image encoding) | 0.85 | 0.858 (NBA) |
| FDTM (Ours) | 3 public datasets, 612 athletes | Frequency-aware dual-stream (Bi-LSTM + FFT + gated fusion) | 0.858/0.833/0.821 | — |
Table 10.
Quantitative clinical impact projection per team-season across the three sports (newly added in response to Reviewer 2 s-round comment 4). Avoidable-injury counts are computed under three intervention-efficacy assumptions consistent with the IOC load-management consensus statement [
4].
Table 10.
Quantitative clinical impact projection per team-season across the three sports (newly added in response to Reviewer 2 s-round comment 4). Avoidable-injury counts are computed under three intervention-efficacy assumptions consistent with the IOC load-management consensus statement [
4].
| Quantity | NBA | AFL | SoccerMon |
|---|
| Player-game observations per team-season | ~1800 | ~880 | ~760 |
| Baseline injury rate | 3.8% | 4.1% | 3.6% |
| Expected injuries per team-season | 68.4 | 36.1 | 27.4 |
| High-risk tier flag rate ( ≥ 0.50) | 8.5% | 10.0% | 9.7% |
| Sensitivity at ≥ 0.50 | 50.2% | 47.6% | 45.1% |
| PPV at ≥ 0.50 | 22.4% | 19.6% | 16.8% |
| Injuries flagged (captured) | 34.3 | 17.2 | 12.4 |
| Avoidable injuries @ 20% intervention efficacy | 6.9 | 3.4 | 2.5 |
| Avoidable injuries @ 40% intervention efficacy | 13.7 | 6.9 | 4.9 |
| Avoidable injuries @ 60% intervention efficacy | 20.6 | 10.3 | 7.4 |
| Number needed to screen (NNS) | 22 | 26 | 30 |
| Alerts per week (regular season) | 4.3 | 2.5 | 2.0 |