Author Contributions
Conceptualisation, R.X. and S.C.; methodology, R.X.; software, R.X.; validation, R.X.; formal analysis, R.X.; investigation, R.X.; resources, S.K.; data curation, R.X.; writing—original draft preparation, R.X.; writing—review and editing, S.C., M.B.-C., E.I., E.N.-S. and S.K.; visualisation, R.X.; supervision, S.C.; project administration, S.C.; funding acquisition, S.C. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Overview of the analytical pipeline of six stages, from flight experiment through validation output, highlighting the fold-safe two-way residual label construction and leave-one-subject-out (LOSO) cross-validation design.
Figure 1.
Overview of the analytical pipeline of six stages, from flight experiment through validation output, highlighting the fold-safe two-way residual label construction and leave-one-subject-out (LOSO) cross-validation design.
Figure 2.
In-cabin experimental setup. (a) Placement of the Polar H10 chest strap (ECG) and the Empatica Embrace Plus wristband (EDA and skin temperature). (b) Photograph taken during data collection in the Cessna 172 cockpit; the participant’s face is obscured to protect privacy.
Figure 2.
In-cabin experimental setup. (a) Placement of the Polar H10 chest strap (ECG) and the Empatica Embrace Plus wristband (EDA and skin temperature). (b) Photograph taken during data collection in the Cessna 172 cockpit; the participant’s face is obscured to protect privacy.
Figure 3.
Representative real-flight physiological timeline for one pilot (all models were fit on all pilots and trials). (a) Flight-segment timeline with per segment raw workload/stress ratings and residual labels; the background shading distinguishes the successive flight segments (grey Baseline, blue Takeoff, green Steep Turn, orange Stall, purple Landing); Baseline is excluded from the machine learning analysis. (b) Normalised heart rate with ECG-unusable windows shaded red and the 30 s/10 s windowing indicated. (c) EDA decomposed into tonic SCL and phasic SCR (NeuroKit2 cvxEDA); red points mark SCR peaks (scipy find_peaks, prominence 0.01 μS, minimum spacing 1 s). (d) Wrist skin temperature (robust z), which drifts slowly across the flight. (e) Workload and stress residual label strips. Signals smoothed only by a 5-sample rolling median; segments concatenated back-to-back.
Figure 3.
Representative real-flight physiological timeline for one pilot (all models were fit on all pilots and trials). (a) Flight-segment timeline with per segment raw workload/stress ratings and residual labels; the background shading distinguishes the successive flight segments (grey Baseline, blue Takeoff, green Steep Turn, orange Stall, purple Landing); Baseline is excluded from the machine learning analysis. (b) Normalised heart rate with ECG-unusable windows shaded red and the 30 s/10 s windowing indicated. (c) EDA decomposed into tonic SCL and phasic SCR (NeuroKit2 cvxEDA); red points mark SCR peaks (scipy find_peaks, prominence 0.01 μS, minimum spacing 1 s). (d) Wrist skin temperature (robust z), which drifts slowly across the flight. (e) Workload and stress residual label strips. Signals smoothed only by a 5-sample rolling median; segments concatenated back-to-back.
Figure 4.
Self-reported workload and stress landscape across real-flight manoeuvres. (a,b) Raincloud distributions of raw workload and stress by segment: half-violin density, interquartile box, individual pilot points, and faint within-pilot connecting lines; mean SD annotated. (c) Workload–stress coupling (hexbin density, points coloured by segment throughout (Takeoff blue, Steep Turn green, Stall orange, Landing purple), dotted identity line, red linear fit; Pearson , Spearman , ). (d,e) Two-way residual distributions by segment, centred on zero (high to the right, low to the left). (f) Resulting two-way residual binary label balance.
Figure 4.
Self-reported workload and stress landscape across real-flight manoeuvres. (a,b) Raincloud distributions of raw workload and stress by segment: half-violin density, interquartile box, individual pilot points, and faint within-pilot connecting lines; mean SD annotated. (c) Workload–stress coupling (hexbin density, points coloured by segment throughout (Takeoff blue, Steep Turn green, Stall orange, Landing purple), dotted identity line, red linear fit; Pearson , Spearman , ). (d,e) Two-way residual distributions by segment, centred on zero (high to the right, low to the left). (f) Resulting two-way residual binary label balance.
Figure 5.
Fold-safe two-way residual label construction (workload shown; one example held-out pilot outlined). (a) Raw 0–10 rating matrix (pilots sorted by mean rating). (b) Pilot-centred matrix (raw minus pilot mean). (c) Two-way residual matrix, with task reference means computed from training pilots only. (d) Binary high/low labels. (e) Label-switching alluvial from raw median split, through task-residual-only, to the two-way residual labels, for workload and stress, annotated with the corresponding LOSO macro F1. The held-out pilot’s task reference means use training pilots only; test labels are retrospective.
Figure 5.
Fold-safe two-way residual label construction (workload shown; one example held-out pilot outlined). (a) Raw 0–10 rating matrix (pilots sorted by mean rating). (b) Pilot-centred matrix (raw minus pilot mean). (c) Two-way residual matrix, with task reference means computed from training pilots only. (d) Binary high/low labels. (e) Label-switching alluvial from raw median split, through task-residual-only, to the two-way residual labels, for workload and stress, annotated with the corresponding LOSO macro F1. The held-out pilot’s task reference means use training pilots only; test labels are retrospective.
Figure 6.
LOSO model-by-modality performance landscape. (a,b) Macro F1 heatmaps for stress and workload across the five classifiers and four modalities (factorial analysis); bold outlines mark the prespecified configurations, * marks FDR-significant cells, and ▲ marks nested-stable cells (|fixed − nested| ≤ 0.05). (c) Forest plot of the ten prespecified cells with 95% cluster bootstrap confidence intervals (chance = 0.5; * FDR-significant; filled circle = nested-stable, open diamond = not). (d) Fixed-hyperparameter versus nested-CV macro F1 for every cell, with the line.
Figure 6.
LOSO model-by-modality performance landscape. (a,b) Macro F1 heatmaps for stress and workload across the five classifiers and four modalities (factorial analysis); bold outlines mark the prespecified configurations, * marks FDR-significant cells, and ▲ marks nested-stable cells (|fixed − nested| ≤ 0.05). (c) Forest plot of the ten prespecified cells with 95% cluster bootstrap confidence intervals (chance = 0.5; * FDR-significant; filled circle = nested-stable, open diamond = not). (d) Fixed-hyperparameter versus nested-CV macro F1 for every cell, with the line.
Figure 7.
Personalised calibration inflates apparent performance relative to unseen-pilot LOSO validation. (a) Dumbbell plot of each model–modality cell (S = stress, warm red; W = workload, blue): LOSO (filled circle), within-subject window (open square), and within-pilot leave-one-trial-out (grey triangle), ordered by the LOSO-to-within gap; the grey band marks the chance region. (b) Distribution of the within-subject minus LOSO gap by target (mean gap 0.220). (c) Per-pilot LOSO prediction stability for the primary configurations: number of correctly classified segments out of four. Within-subject window validation is optimistic because windows from the same pilot and segment can appear in both training and test folds.
Figure 7.
Personalised calibration inflates apparent performance relative to unseen-pilot LOSO validation. (a) Dumbbell plot of each model–modality cell (S = stress, warm red; W = workload, blue): LOSO (filled circle), within-subject window (open square), and within-pilot leave-one-trial-out (grey triangle), ordered by the LOSO-to-within gap; the grey band marks the chance region. (b) Distribution of the within-subject minus LOSO gap by target (mean gap 0.220). (c) Per-pilot LOSO prediction stability for the primary configurations: number of correctly classified segments out of four. Within-subject window validation is optimistic because windows from the same pilot and segment can appear in both training and test folds.
Figure 8.
Robustness to feature count, label definition, signal reliability, and quality-control choices. (a) Trial-level LOSO macro F1 of the EDA classifier versus the number of selected features k; performance peaks at the selected and degrades when more features are admitted. (b) Feature-selection stability: fraction of the 35 LOSO folds in which each top feature was selected. (c) Between-segment intraclass correlation (ICC) of cardiac features; frequency-domain indices (†) are near zero on the 56–158 s segments, whereas time-domain indices are far more reliable. (d) Label-definition sensitivity: LOSO macro F1 under raw median split, task-residual-only, and two-way residual labels. (e) QC and ablation deltas ( macro F1) for the primary configurations (independent analyses; upper bar stress, lower bar workload).
Figure 8.
Robustness to feature count, label definition, signal reliability, and quality-control choices. (a) Trial-level LOSO macro F1 of the EDA classifier versus the number of selected features k; performance peaks at the selected and degrades when more features are admitted. (b) Feature-selection stability: fraction of the 35 LOSO folds in which each top feature was selected. (c) Between-segment intraclass correlation (ICC) of cardiac features; frequency-domain indices (†) are near zero on the 56–158 s segments, whereas time-domain indices are far more reliable. (d) Label-definition sensitivity: LOSO macro F1 under raw median split, task-residual-only, and two-way residual labels. (e) QC and ablation deltas ( macro F1) for the primary configurations (independent analyses; upper bar stress, lower bar workload).
Figure 9.
Directional physiological signatures of workload and stress from SHAP values (exploratory all-data SHAP; LightGBM on combined features). (a,b) SHAP beeswarm for stress and workload: each point is one window, positioned by its SHAP value (right = pushes towards high) and coloured by the within-participant z-scored feature value; modality tags precede each feature and † marks low-reliability frequency-domain ECG features. (c) Modality-level model-attribution share (mean |SHAP|). (d) Feature-category attribution contrast between stress and workload. SHAP reflects model attribution, not causal physiological mechanism.
Figure 9.
Directional physiological signatures of workload and stress from SHAP values (exploratory all-data SHAP; LightGBM on combined features). (a,b) SHAP beeswarm for stress and workload: each point is one window, positioned by its SHAP value (right = pushes towards high) and coloured by the within-participant z-scored feature value; modality tags precede each feature and † marks low-reliability frequency-domain ECG features. (c) Modality-level model-attribution share (mean |SHAP|). (d) Feature-category attribution contrast between stress and workload. SHAP reflects model attribution, not causal physiological mechanism.
Table 1.
Cohort and protocol summary.
Table 1.
Cohort and protocol summary.
| Field | Value |
|---|
| Parent cohort [31] | pilots (mixed PPL/CPL/SPL/SPP) |
| Analysis segments | Segments 2–5 (Takeoff, Steep Turn, Stall, Landing); Baseline excluded |
| Trial-level observations | 135 (after label construction) |
| Label balance (high:low; stress/workload) | 66:69/62:73 |
| Aircraft/regime | Cessna 172, VFR, certified flight instructor present |
| Self-report | 0–10 Likert (verbal anchors at 0, 5, 10) for stress + workload |
| Sensors | Polar H10 (ECG); Empatica Embrace Plus (EDA, 4 Hz; temperature) |
Table 2.
Participant characteristics (). Flight hours are summarised by median, interquartile range, and range owing to their skewed distribution.
Table 2.
Participant characteristics (). Flight hours are summarised by median, interquartile range, and range owing to their skewed distribution.
| Characteristic | Value |
|---|
| Age, years—mean ± SD (range) | 29.3 ± 16.3 (18–78) |
| Sex—n (male/female) | 31/4 |
| Licence—n | PPL 20, CPL 10, SPL 2, SPP 2 (1 unspecified) |
| Flight hours—median [IQR] (range) | 160 [104–322] (20–1800) |
| Additional rating (IFR/Multi/Night)—n | 23 |
| Pre-flight stress (0–10)—mean ± SD | 1.9 ± 1.3 |
Table 3.
LOSO classification performance ( trials, 35 pilots). Ranked by macro F1. : permutation test; : Benjamini–Hochberg FDR-adjusted across all ten cells. Significance (based on ): † .
Table 3.
LOSO classification performance ( trials, 35 pilots). Ranked by macro F1. : permutation test; : Benjamini–Hochberg FDR-adjusted across all ten cells. Significance (based on ): † .
| Target | Model (Modality) | Acc | F1 | F1 [95% CI] | | |
|---|
| Stress | LightGBM (EDA) | 0.615 | 0.611 | [0.521, 0.698] | 0.008 | 0.033 † |
| Stress | Linear SVC (Comb.) | 0.607 | 0.607 | [0.536, 0.680] | 0.010 | 0.033 † |
| Stress | Random Forest (ECG) | 0.570 | 0.569 | [0.481, 0.652] | 0.068 | 0.113 |
| Stress | XGBoost (EDA) | 0.556 | 0.556 | [0.463, 0.643] | 0.128 | 0.160 |
| Stress | KNN (Comb.) | 0.526 | 0.394 | [0.344, 0.449] | 0.310 | 0.310 |
| Workload | XGBoost (EDA) | 0.600 | 0.598 | [0.526, 0.668] | 0.008 | 0.033 † |
| Workload | LightGBM (EDA) | 0.585 | 0.581 | [0.507, 0.650] | 0.045 | 0.113 |
| Workload | Random Forest (ECG) | 0.563 | 0.561 | [0.478, 0.639] | 0.109 | 0.156 |
| Workload | KNN (Comb.) | 0.578 | 0.465 | [0.386, 0.548] | 0.063 | 0.113 |
| Workload | Linear SVC (Comb.) | 0.548 | 0.548 | [0.463, 0.629] | 0.156 | 0.173 |
Table 4.
LOSO versus subject-dependent within-subject window validation, trial-level macro F1. Gap = within-subject − LOSO. The within-subject analysis is included for comparison with prior personalised validation studies and should not be interpreted as unseen-pilot generalisation.
Table 4.
LOSO versus subject-dependent within-subject window validation, trial-level macro F1. Gap = within-subject − LOSO. The within-subject analysis is included for comparison with prior personalised validation studies and should not be interpreted as unseen-pilot generalisation.
| Target | Model (Modality) | LOSO F1 | Within F1 | Gap | Within N |
|---|
| Stress | Linear SVC (Comb.) | 0.607 | 0.853 | +0.246 | 132 |
| Stress | KNN (Comb.) | 0.394 | 0.804 | +0.410 | 132 |
| Stress | LightGBM (EDA) | 0.611 | 0.801 | +0.190 | 132 |
| Stress | Random Forest (ECG) | 0.569 | 0.779 | +0.210 | 132 |
| Stress | XGBoost (EDA) | 0.556 | 0.727 | +0.172 | 132 |
| Workload | Linear SVC (Comb.) | 0.548 | 0.791 | +0.243 | 134 |
| Workload | KNN (Comb.) | 0.465 | 0.775 | +0.310 | 134 |
| Workload | LightGBM (EDA) | 0.581 | 0.763 | +0.181 | 134 |
| Workload | XGBoost (EDA) | 0.598 | 0.697 | +0.099 | 134 |
| Workload | Random Forest (ECG) | 0.561 | 0.697 | +0.136 | 132 |
| Mean | 0.549 | 0.769 | +0.220 | 133 |
Table 5.
Full factorial LOSO macro F1 (5 classifiers × 4 modalities × 2 targets). Bold: best modality per row.
Table 5.
Full factorial LOSO macro F1 (5 classifiers × 4 modalities × 2 targets). Bold: best modality per row.
| | Stress | Workload |
|---|
| Model | EDA | ECG | Temp | Comb | EDA | ECG | Temp | Comb |
|---|
| LightGBM | 0.611 | 0.541 | 0.510 | 0.499 | 0.581 | 0.526 | 0.429 | 0.510 |
| XGBoost | 0.533 | 0.533 | 0.481 | 0.511 | 0.515 | 0.529 | 0.419 | 0.505 |
| Random Forest | 0.558 | 0.569 | 0.513 | 0.577 | 0.533 | 0.561 | 0.325 | 0.525 |
| Linear SVC | 0.503 | 0.518 | 0.554 | 0.606 | 0.473 | 0.510 | 0.503 | 0.481 |
| KNN | 0.355 | 0.351 | 0.379 | 0.338 | 0.432 | 0.345 | 0.391 | 0.414 |
Table 6.
Robustness analyses. (A) Nested LOSO cross-validation with an inner subject-grouped hyperparameter search, versus the fixed-hyperparameter results, for all ten model–target cells. (B) Sensitivity of the primary configurations to feature and quality-control choices. is variant minus baseline macro F1.
Table 6.
Robustness analyses. (A) Nested LOSO cross-validation with an inner subject-grouped hyperparameter search, versus the fixed-hyperparameter results, for all ten model–target cells. (B) Sensitivity of the primary configurations to feature and quality-control choices. is variant minus baseline macro F1.
| (A) Nested vs. fixed-hyperparameter macro F1 |
| Target | Model (Modality) | Fixed | Nested | |
| Stress | Linear SVC (Comb.) | 0.607 | 0.606 | −0.001 |
| Stress | LightGBM (EDA) | 0.611 | 0.506 | −0.106 |
| Stress | XGBoost (EDA) | 0.556 | 0.548 | −0.007 |
| Stress | Random Forest (ECG) | 0.569 | 0.518 | −0.052 |
| Stress | KNN (Comb.) | 0.394 | 0.394 | 0.000 |
| Workload | XGBoost (EDA) | 0.598 | 0.561 | −0.037 |
| Workload | LightGBM (EDA) | 0.581 | 0.556 | −0.025 |
| Workload | Linear SVC (Comb.) | 0.548 | 0.556 | +0.008 |
| Workload | Random Forest (ECG) | 0.561 | 0.519 | −0.042 |
| Workload | KNN (Comb.) | 0.465 | 0.465 | 0.000 |
| (B) Sensitivity of primary configurations |
| Analysis | Target (Model) | Base | Variant | |
| Drop frequency-domain ECG | Stress (RF/ECG) | 0.569 | 0.518 | −0.051 |
| Drop frequency-domain ECG | Workload (RF/ECG) | 0.561 | 0.495 | −0.066 |
| Drop 4 Hz shape features | Stress (LGBM/EDA) | 0.611 | 0.611 | 0.000 |
| Drop 4 Hz shape features | Workload (XGB/EDA) | 0.581 | 0.581 | 0.000 |
| Drop temperature (Comb.) | Stress (SVC) | 0.607 | 0.488 | −0.119 |
| Drop temperature (Comb.) | Workload (SVC) | 0.548 | 0.540 | −0.008 |
| Strict all-channel QC | Stress (LGBM/EDA) | 0.611 | 0.569 | −0.042 |
| Strict all-channel QC | Workload (XGB/EDA) | 0.598 | 0.533 | −0.065 |