Leakage-Aware Federated Learning for ICU Sepsis Early Warning: Fixed Alert-Rate Evaluation on PhysioNet/CinC 2019 and MIMIC-IV

Jin, Hyejin; Lee, Hongchul

doi:10.3390/app16062735

Open AccessArticle

Leakage-Aware Federated Learning for ICU Sepsis Early Warning: Fixed Alert-Rate Evaluation on PhysioNet/CinC 2019 and MIMIC-IV

by

Hyejin Jin

¹

and

Hongchul Lee

^2,*

¹

School of Software, College of Computer Science, Kookmin University, Seoul 02707, Republic of Korea

²

School of Industrial and Management Engineering, College of Engineering, Korea University, Seoul 02841, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(6), 2735; https://doi.org/10.3390/app16062735

Submission received: 2 February 2026 / Revised: 7 March 2026 / Accepted: 11 March 2026 / Published: 12 March 2026

Download

Browse Figures

Versions Notes

Abstract

Sepsis early warning is hindered by data silos, temporal leakage, and threshold choices that obscure operational performance. We present a leakage-aware federated-learning evaluation pipeline that enforces group/temporal separation and compares models at a fixed alert workload. Stage-1 benchmarks local, FedAvg, and FedProx LSTM/Transformer models on PhysioNet/CinC 2019 using the official A/B partitions in bidirectional cross-hospital evaluation (A→B/B→A) after removing ICULOS. Stage-2 constructs a Sepsis-3-aligned MIMIC-IV task using full SOFA-component features and simulated clients to emulate institutional heterogeneity. Federated training improves out-of-hospital generalization for LSTM models on PhysioNet, whereas Transformer models remain robust across 3–12 h horizons. On MIMIC-IV, fixed alert-rate evaluation (α = 5%) clarifies workload–timeliness trade-offs, and centralized XGBoost achieves the strongest stay-level detection with clinically meaningful lead times. Supplementary privacy and security stress tests further contextualize residual deployment risks. Overall, leakage control and workload-matched evaluation are essential for trustworthy, operationally actionable sepsis early warning.

Keywords:

sepsis; early warning; federated learning; ICU; MIMIC-IV; PhysioNet/CinC 2019; temporal leakage; fixed alert rate; lead time; privacy and security

1. Introduction

Sepsis is a life-threatening syndrome characterized by organ dysfunction due to a dysregulated host response to infection, and its timely recognition remains challenging in the intensive care unit (ICU) because physiological deterioration can be gradual, heterogeneous, and partially observed in routine measurements [1,2]. Machine learning (ML) early warning models trained on electronic health records (EHRs) have shown promise, but their clinical translation depends on (i) generalization across institutions, (ii) robust evaluation that avoids temporal leakage [3], and (iii) operational decision-making that controls alert workload rather than optimizing a single threshold-dependent metric. Recent multicenter external validation and deployment studies have underscored that sepsis prediction performance can drop across sites and that alert workload and clinical utility should be considered alongside discrimination metrics [4,5,6,7,8].

Federated learning (FL) is a natural paradigm for multi-site model development because it enables collaborative training without direct sharing of patient-level data [9]. However, healthcare FL is challenged by non-IID distributions (e.g., different patient mixes and practice patterns), unequal client sizes, and vulnerabilities to privacy leakage and adversarial manipulation. Moreover, for sepsis early warning benchmarks, naïve feature sets may inadvertently encode time-to-event information (e.g., explicit time counters), which can inflate discrimination and obscure real-world performance.

To address these issues, we present a leakage-aware FL pipeline and evaluate it in two stages. In Stage-1, we benchmark LSTM and Transformer sequence models on the PhysioNet/CinC Challenge 2019 dataset [10] under a bidirectional cross-hospital evaluation scheme based on the official training sets A and B: train on A and test on B (A→B), and train on B and test on A (B→A), while comparing local training, FedAvg, and FedProx across prediction horizons (H = 3/6/12 h). In Stage-2, we construct a Sepsis-3-aligned early warning task from MIMIC-IV ICU data [11] using full-SOFA features and client partitions that emulate institutional heterogeneity, and we emphasize operational evaluation through fixed alert-rate thresholding and lead-time analysis. We designed a two-stage evaluation framework with distinct generalization scopes. Stage-1 assesses cross-hospital generalization across the predefined institutional partitions in the PhysioNet/CinC 2019 dataset using the bidirectional A→B/B→A evaluation scheme. Stage-2, conducted on MIMIC-IV, does not constitute independent institutional external validation; rather, it simulates client heterogeneity within a single-center database to assess robustness under distributed training conditions. Therefore, Stage-2 results should be interpreted as internal robustness evidence rather than cross-institution transportability. This distinction is critical for appropriately interpreting the scope of generalizability supported by our results. Here, A and B are the two official hospital-system partitions provided by the Challenge organizers.

Our main contributions are:

A two-stage, leakage-aware evaluation framework for sepsis early warning that (Stage-1) benchmarks cross-hospital generalization on PhysioNet/CinC 2019 via a bidirectional cross-hospital evaluation scheme based on the official training sets A and B (A→B/B→A) and (Stage-2) evaluates a Sepsis-3-aligned task derived from MIMIC-IV with simulated federated clients.
A workload-matched operational evaluation strategy using fixed alert-rate thresholding (α = 5%) with stay-level detection rates and lead-time distributions to improve clinical interpretability beyond AUROC/AUPRC.
Supplementary trustworthiness analyses, including leakage stress tests (proper-vs-leaky splitting and contamination sweeps), calibration/decision-curve analysis, and privacy/security stress tests (membership inference, DP-inspired clipping/noise, and model poisoning).

2. Related Work

2.1. Sepsis Early Warning from EHR Time Series

Sepsis prediction models span rule-based scores, classical ML with engineered features, and deep sequence models. Sequence architectures such as LSTMs [12] and attention-based Transformers [13] are well-suited to irregular ICU time series and have been widely adopted in recent early warning systems. The PhysioNet/CinC 2019 Challenge [10] established a common benchmark for early prediction of sepsis from multivariate clinical time series, but subsequent analyses have highlighted the importance of leakage controls, robust splitting, and evaluation at clinically meaningful horizons. Recent ICU sepsis prediction studies have increasingly emphasized rigorous external validation across hospitals and countries [4,6] and deployment-aware evaluation of alert burden, clinical workflow, and patient outcomes [5,7,8].

2.2. Federated Learning in Healthcare

FL enables collaborative learning across sites by exchanging model updates rather than raw data. FedAvg [9] is the canonical algorithm, while FedProx [14] introduces a proximal term to stabilize training under heterogeneous client distributions. In healthcare, FL has been explored for imaging and EHR modeling [15], but ICU tasks are particularly sensitive to distribution shifts and to operational constraints such as alert fatigue. For early warning systems, comparing models at matched alert workloads is essential for meaningful cross-site assessment. Recent sepsis-focused FL approaches (e.g., FedSepsis) provide additional context for this setting [16]. We also note broader FL surveys [17,18], secure aggregation mechanisms [19], and privacy-preserving medical FL applications [20].

2.3. Operational Evaluation, Calibration, and Robustness

Beyond AUROC and AUPRC, clinical deployment requires calibrated risk estimates and decision-analytic evaluation. Calibration metrics (e.g., Brier score) and decision curve analysis (DCA) quantify whether a model provides net benefit over default strategies at relevant threshold probabilities [21]. From a security perspective, FL models may leak membership information [22,23] or unintended feature information [24] and can be degraded by poisoned updates or Byzantine clients; robust aggregation methods such as coordinate-wise median, trimmed mean, or Krum have been proposed to mitigate such attacks [25,26,27,28]. These considerations motivate our inclusion of privacy/security stress tests and leakage contamination experiments to complement standard discrimination metrics. Recent guidance also clarifies that leakage can arise in subtle ways in clinical prediction (e.g., cadence, perspective, and applicability mismatches) [29] and proposes broader taxonomies for leakage mitigation [30].

3. Materials and Methods

3.1. Study Overview and Leakage-Aware Evaluation Pipeline

A scheme-based presentation is helpful for communicating FL workflows and evaluation safeguards. Our end-to-end pipeline (Scheme 1) consists of (i) cohort construction and feature extraction, (ii) leakage-aware data splitting, (iii) local training or federated optimization across clients, and (iv) evaluation with both discrimination metrics and operational fixed alert-rate analysis. For the PhysioNet 2019 benchmark, clients correspond to two hospital partitions (A and B), and we report cross-hospital generalization. For MIMIC-IV, we emulate heterogeneity by partitioning ICU stays into clinically meaningful clients (e.g., care units) while keeping ICU-stay identities disjoint across splits. Because MIMIC-IV is a single-center dataset, we refer to these client partitions as simulated federated learning (SimFL). To enhance transparency, we report key items commonly required by prediction-model reporting checklists (e.g., data sources, participants, outcome and predictor definitions, and evaluation procedures). We report key items to support transparent reporting and evaluation following TRIPOD + AI guidance and related evaluation recommendations [31,32].

3.2. Datasets and Cohort Construction

Stage-1 (PhysioNet/CinC 2019). We used the publicly available PhysioNet/CinC 2019 Challenge dataset for early prediction of sepsis [10]. The dataset provides multivariate ICU time series and sepsis onset labels. We use the two official hospital partitions (training set A and training set B) and construct training/validation/test splits within each partition. In total, the benchmark includes 40,336 patient records across A + B (Table 1). We evaluate cross-hospital generalization using a bidirectional cross-hospital evaluation scheme: train on A and test on B (A→B), and train on B and test on A (B→A). We evaluate multiple prediction horizons (H = 3, 6, 12 h) and remove ICULOS (ICU length of stay, hours) to mitigate temporal leakage [10].

Stage-2 (MIMIC-IV). We additionally construct a Sepsis-3-aligned early warning task from MIMIC-IV ICU data [11]. To focus on clinical interpretability and reduce feature ambiguity, we use features derived from the components of the Sequential Organ Failure Assessment (SOFA) score [2], including laboratory extrema and vasoactive drug dose summaries within a fixed observation window. All cohorts are de-identified public datasets; access to MIMIC-IV requires completion of the associated credentialing and data use agreement. The overall two-stage evaluation design is summarized in Table 1. Specifically, the full-SOFA input uses 12 raw variables required to compute the SOFA components (PaO2, FiO2, platelets, total bilirubin, mean arterial pressure, dopamine, dobutamine, epinephrine, norepinephrine, Glasgow Coma Scale, creatinine, and urine output); see Appendix A for a grouped list and units.

Stage-2 cohort statistics and leakage-controlled evaluation-window counts are summarized in Table 2. Throughout Stage-2, we distinguish between (i) eligible pre-onset evaluation windows used for discrimination metrics (AUROC/AUPRC) and (ii) a monitored stream that additionally includes post-onset windows for realized alert-burden calculations (Section 4.3). Because consecutive hourly windows within an ICU stay are highly correlated, we use a window-sampling strategy to limit redundancy; therefore, alert-rate results should be interpreted relative to this monitored stream. As reported in Table 1 and Table 2, the labeled cohort sizes (stays and windows) are large relative to model capacity; we additionally use regularization and early stopping to support stable parameter estimation.

3.3. Prediction Task, Windowing, and Label Definitions

We formulate sepsis early warning as a binary classification problem over sliding windows. Each sample consists of an observation window of length T = 6 h, and the model predicts whether sepsis will occur within a prediction horizon H (H ∈ {3, 6, 12} h). Stage-1 follows the PhysioNet 2019 label definition tied to sepsis onset in the Challenge dataset [10]. Stage-2 follows a Sepsis-3-aligned definition (suspected infection with an acute increase in SOFA score of ≥2 points) and uses a full-SOFA feature set computed from ICU charted data [1,2].

Model inputs and outputs. For each ICU stay, we extract a multivariate observation window x(t) covering the prior T = 6 h. In Stage-1, x(t) is a T × d time series with d = 39 variables (all PhysioNet/CinC 2019 variables except ICULOS), sampled hourly. In Stage-2, x(t) is a d = 12 tabular vector summarizing the worst (most abnormal) SOFA-related measurements/drug doses within the window (Appendix A). The model outputs a calibrated risk score p(t) ∈ [0,1] interpreted as the probability of sepsis onset within the next H hours; alerts are generated when p(t) ≥ τ_α, where τ_α is selected on the validation set to achieve a target window-level alert-rate α (α = 0.05).

Illustrative example (hypothetical). Stage-1: For an observation window ending at t = 36 h, x(t) is a 6 × 39 matrix of hourly measurements (ICULOS removed); for example, the most recent hour could include HR = 92 bpm, O2Sat = 97%, Temp = 37.2 °C, SBP = 118 mmHg, MAP = 78 mmHg, DBP = 62 mmHg, and Resp = 18/min (with other variables as in Table A1). A model might output p(t) = 0.12. For a window ending at t = 36 h, a Stage-2 feature vector may include PaO₂ = 78 mmHg, FiO₂ = 0.50, platelets = 120 × 10³/µL, bilirubin = 1.8 mg/dL, MAP = 65 mmHg, norepinephrine = 0.08 µg/kg/min, GCS = 13, creatinine = 1.6 mg/dL, and urine output = 0.4 mL/kg/h (with other vasopressor variables = 0 if not administered). A model might output p(t) = 0.23; if τ_0.05 = 0.20 (chosen on validation), this window would trigger an alert (ŷ(t) = 1).

Stage-1 label shift and horizon definition (PhysioNet/CinC 2019). The official Challenge label is intentionally shifted 6 h earlier than the adjudicated sepsis time; specifically, the label transitions to 1 at

t_{o n s e t, s h i f t} = t_{s e p s i s} - 6 h

. For a window ending at time

t

, our horizon-H target is

y_{t, H} = 1

if

t_{o n s e t, s h i f t} \in (t, t + H]

. Equivalently, with respect to the true onset time, this corresponds to predicting whether

t_{s e p s i s} \in (t, t + H + 6]

. We report H with respect to the shifted label time to remain consistent with the official benchmark definition. Scheme 2 illustrates this mapping.

We additionally align our leakage controls with recent clinical ML discussions of label leakage and prediction perspective [29].

Stage-2 label operationalization (Sepsis-3 proxy). We approximate Sepsis-3 using a retrospective EHR operationalization based on (i) suspected infection and (ii) an acute increase in organ dysfunction. Suspected infection is defined as the first qualifying pair of blood-culture sampling and antibiotic administration: if antibiotics occur first, the culture must be obtained within 24 h; if the culture occurs first, antibiotics must be started within 72 h, and the infection onset is the earlier event. Organ dysfunction is defined by an acute SOFA increase (ΔSOFA ≥ 2) relative to baseline (baseline assumed SOFA_base = 0 when pre-existing organ dysfunction is not documented) and evaluated within a 24 h window centered at the infection onset. We define sepsis onset as the earliest time at which both a suspected infection has occurred and ΔSOFA ≥ 2 is satisfied. To avoid post-onset leakage, windows ending at or after the onset are excluded, and labels are assigned only when the prediction horizon (t, t + H] contains the onset. SOFA component missingness is handled by carry-forward within the stay; remaining missing SOFA components are set to 0 (normal organ function), consistent with SOFA scoring conventions. All splits are group-disjoint by ICU stay to prevent within-stay leakage.

Because Sepsis-3 operationalization from retrospective EHR data can vary across implementations (e.g., how “suspected infection” is instantiated and how baseline SOFA is handled), we report our exact labeling and feature-construction workflow in the Supplementary Materials for transparency and reproducibility. We treat these labels as imperfect proxies and therefore emphasize leakage-aware splitting and workload-matched evaluation rather than optimizing a single threshold on the test set.

Algorithm 1 summarizes the proposed sepsis labeling and windowing procedure used in Stage-2.

Algorithm 1. Stage-2 Sepsis-3 proxy labeling and windowing rules (summary)

(1) Suspected infection time t_inf: identify the first qualifying culture↔antibiotics pair. If antibiotics occur first, require culture within 24 h; if culture occurs first, require antibiotics within 72 h. Set t_inf to the earlier event time.
(2) Baseline SOFA_base: assume SOFA_base = 0 when pre-existing organ dysfunction is not documented.
(3) Compute hourly SOFA(t) from ICU charted variables and labs.
(4) Sepsis onset t_sepsis: earliest time t within [t_inf − 12 h, t_inf + 12 h] such that SOFA(t) − SOFA_base ≥ 2.
(5) Window labeling: for each window ending at time t, set y_t,H = 1 if t_sepsis ∈ (t, t + H]; exclude windows with end time ≥ t_sepsis.
(6) Missingness: carry forward within stay; remaining missing SOFA components are set to 0 (normal organ function).
Notation. For each ICU stay, let x_t−T:t denote multivariate measurements within the observation window of length T ending at time t. The prediction target is y_t = 1 if sepsis onset occurs within the future horizon (t, t + H) and y_t = 0 otherwise. Models output a risk score p_t = f_θ(x_t−T:t) between 0 and 1. Candidate times t are generated on an hourly grid in principle; in this retrospective evaluation, we score a sampled subset of candidate times per stay (mean ≈ 2.74 windows/stay). Stay-level partitioning ensures that no windows from the same ICU stay appear in multiple splits.

We conduct leakage sanity checks by contrasting proper vs. leaky splitting and by performing a contamination stress test (see Section 4.3).

3.4. Models

We benchmark two sequence models and three classical ML baselines. (i) LSTM: a recurrent sequence model suited to multivariate temporal dynamics [12]. (ii) Transformer: an attention-based sequence model that can capture long-range dependencies [13]. For Stage-2 tabular SOFA-window features, we also evaluate (iii) logistic regression (LogReg) and (iv) gradient-boosted decision trees (XGBoost) [33], as well as random forests (RF) [34]. Related supporting tables and the audit figure are provided in Supplementary File (ZIP; Supplementary Tables S1–S4 and Supplementary Figure S1). Experiments were run with Python 3.10.12 and PyTorch 2.7.1+cu118 (CUDA 11.8) on a single NVIDIA A100 80 GB GPU; preprocessing and evaluation were performed using NumPy 1.26.4, pandas 2.3.3, and scikit-learn 1.7.2.

3.5. Federated Learning Algorithms

Let K denote the number of clients, D_k the local dataset at client k with n_k samples, and N = ∑_k=1^K n_k. Define the local empirical loss ℓ_k(θ) = (1/n_k) ∑_(x,y)_∈_Dk ℓ(f_θ(x), y). FL minimizes the weighted empirical risk across clients:

\underset{θ}{m i n} \sum_{k = 1}^{K} (\frac{n_{k}}{N}) 𝓁_{k} (θ)

(1)

In FedAvg [9], each round t broadcasts global parameters θ^t to clients; each client performs E steps of local optimization and returns updated parameters θ_k^t+1; the server aggregates by a weighted average:

θ^{t + 1} = \sum_{k = 1}^{K} (\frac{n_{k}}{N}) θ_{k}^{t + 1}

(2)

FedProx [14] modifies the local objective by adding a proximal term to reduce client drift under heterogeneity:

\underset{θ}{m i n} [𝓁_{k} (θ) + \frac{μ}{2} ∥ θ - θ^{t} ∥_{2}^{2}]

(3)

3.6. Evaluation Metrics and Fixed Alert-Rate Lead-Time Analysis

We report the area under the receiver operating characteristic curve (AUROC) and the area under the precision–recall curve (AUPRC) on the held-out test set, with bootstrap confidence intervals where appropriate. Because sepsis prevalence is low, AUPRC is emphasized as a more informative metric for imbalanced data [35]. Calibration is summarized via the Brier score and reliability diagrams, and DCA is used to quantify net benefit across threshold probabilities [21]. This combination of discrimination, calibration, and clinical-utility reporting aligns with recent guidance on performance evaluation for predictive AI models intended for clinical decision support [36].

Fixed alert-rate thresholding. To compare models at a matched alert workload, we select a decision threshold

τ_{α}

on the validation set, such that approximately a fraction

α

of validation windows triggers an alert (i.e.,

P r (p_{i} \geq τ_{α}) \approx α

). In practice,

τ_{α}

is chosen as the

1 - α

quantile of the validation risk scores

{p_{i}}_{i \in V a l}

; the resulting threshold is then frozen for final test evaluation. No post hoc threshold tuning was performed on the test set.

τ_{α} = Q_{1 - α} ({p_{i}}_{i \in V a l}), {\hat{y}}_{i} = 1 [p_{i} \geq τ_{α}], α = 0.05

(4)

We use α = 0.05 as a representative, moderate alert workload for benchmarking; importantly, α is an operational knob that can be tuned to local staffing capacity and acceptable alert burden, and thresholds are selected on the validation set and then frozen for final test evaluation.

Lead time and detection. For each sepsis-positive ICU stay s with onset time

t_{o n s e t, s}

we define the first alert time

t_{a l e r t, s}

as the earliest window time at which the thresholded prediction

{\hat{y}}_{t} = 1

under the frozen

τ_{α}

. A stay is considered detected if

t_{a l e r t, s} < t_{o n s e t, s}

; the lead time is

L_{s} = t_{o n s e t, s} - t_{a l e r t, s}

for detected stays. We report the detection rate and the distribution (median/mean) of

L_{s}

.

Note on horizon vs. lead time: the horizon-H window label used for AUROC/AUPRC differs from the stay-level lead time computed under the fixed alert-rate alarm policy. The alarm policy operates on the continuous risk score and records the first threshold crossing per stay; thus, observed lead time is not bounded by H because alerts can occur earlier under a workload constraint. Such early alerts are false positives under horizon labeling, but they remain operationally relevant. We therefore interpret lead time alongside realized alert burden and decision-analytic utility, rather than as a standalone objective.

To quantify the uncertainty of stay-level operational metrics, we performed stay-level cluster bootstrapping (B = 1000). In each bootstrap replicate, ICU stays were resampled with replacement, and all window-level predictions within each sampled stay were retained. This cluster bootstrap preserves within-stay correlation and avoids treating windows as independent samples. The stay-level first-crossing detection rate among sepsis-positive stays was recalculated for each replicate. Two-sided 95% confidence intervals were obtained from the 2.5th and 97.5th percentiles of the bootstrap distribution. For reproducibility, we fixed the bootstrap random seed (seed = 2026) for all reported confidence intervals. For sensitivity analysis, we augmented the feature set with binary missingness indicators (added for 10 of 12 variables) using scikit-learn’s SimpleImputer with missing-indicator augmentation (add_indicator = True) and repeated the same train/validation/test split design.

Scheme 3 provides an at-a-glance summary of fixed alert-rate thresholding, and Scheme 4 summarizes the stay-level lead-time computation used throughout Stage-2 evaluations.

Algorithm 2 summarizes the fixed alert-rate evaluation and stay-level lead-time computation used in Stage-2.

Algorithm 2. Fixed alert-rate evaluation and stay-level lead-time metrics

Input: validation monitored scores

{p_{i}^{(v a l)}}

; target alert rate

α

; test scores

{p_{(s, t)}^{(t e s t)}}

; sepsis onset times

{t_{o n s e t, s}}

.
1. Select

τ_{α} = Q_{1 - α} ({p_{i}^{(v a l)}})

.
2. For each sepsis-positive stay

s

define

t_{a l e r t, s} = m i n {t : {\hat{y}}_{(s, t)} = 1}

; if no alert exists, mark the stay as undetected.
3. Detection indicator:

I_{s} = 1 [t_{a l e r t, s} < t_{o n s e t, s}]

.
4. Lead time (defined only if detected):

L_{s} = t_{o n s e t, s} - t_{a l e r t, s}

.
5. Report detection rate and the distribution of

L_{s}

(median, IQR).
6. Capped timeliness over detected stays:

P (L_{s} \leq 24 h)

and P (L_{s} \leq 48 h)

.

3.7. Privacy and Security Stress Tests (Supplementary)

These experiments do not provide a formal differential privacy guarantee [37] because privacy accounting is not performed; we report them as qualitative stress tests of the utility–leakage trade-off under clipping/noise and adversarial updates. Accordingly, we do not claim any (ε, δ)-DP guarantee in this study.

4. Results

4.1. Stage-1 Cross-Hospital Benchmarking on PhysioNet 2019

Table 3 summarizes cross-hospital performance for H = 6 h under a leakage-aware feature set (ICULOS removed). Across both evaluation directions, federated training generally improves the LSTM’s generalization compared with purely local training, and FedProx is competitive with FedAvg under client heterogeneity. Transformers achieve strong performance across settings, and centralized training on pooled data provides an upper bound that is typically not feasible under data-sharing constraints.

To visualize horizon sensitivity beyond a single operating point, Figure 1 provides a heatmap of AUPRC across prediction horizons (H = 3, 6, 12 h) for key model/learning settings.

To complement Stage-1, we further quantify leakage sensitivity in Stage-2 via proper-vs-leaky splitting and a contamination sweep (see Section 4.3).

4.2. Stage-2 Full-SOFA Early Warning on MIMIC-IV with Operational Fixed Alert-Rate Evaluation

This section reports Stage-2 results in three steps: (i) discrimination (AUROC/AUPRC) on leakage-controlled pre-onset windows, (ii) operational performance under a fixed alert workload (α = 5%) using stay-level detection and lead-time distributions, and (iii) sensitivity/leakage stress tests to contextualize robustness.

Stage-2 statistics are reported for the monitored test stream (9927 windows sampled from the test split; 4022 ICU stays; 187 sepsis-positive stays), while operational early-warning metrics are computed on the eligible pre-onset evaluation subset (8743 windows; 3620 ICU stays; 110 sepsis-positive stays). The reductions in test stays (4022 → 3620) and sepsis-positive stays (187 → 110) reflect exclusions introduced by the eligible pre-onset definition and windowing constraints (e.g., no fully observed pre-onset window of length T before the assigned onset time). Unless otherwise stated, AUROC/AUPRC are computed on the eligible pre-onset evaluation subset, whereas alert-burden and α-sweep analyses use the monitored test stream.

Before reporting Stage-2 discrimination and operational results, we verify leakage control by comparing proper stay-level splitting with a leaky random split; leaky splits can substantially inflate performance (Figure 2).

Table 4 reports discrimination results for the MIMIC-IV full-SOFA task (T = 6 h, H = 6 h) under the proper stay-level split (pre-onset evaluation subset; 8743 evaluation windows). We report point estimates for centralized baselines (LogReg/RF/XGBoost) and simulated-FL (SimFL) FedAvg-LogReg variants under careunit and admission-type client partitions, along with a pooled-1 FedAvg baseline.

As shown in Figure 3, Stage-2 discrimination performance is summarized by AUROC and AUPRC with bootstrap uncertainty. Calibration and decision-curve analysis are presented in the corresponding figure in this section. Table 4 reports the test-set AUROC/AUPRC for centralized baselines and key SimFL configurations.

Operational evaluation at a fixed alert workload is summarized in Table 5. Thresholds are selected to yield a 5% alert rate on the validation set.

Operational stability under alert-rate sweeps. In the Stage-2 monitored test stream (9927 windows across 4022 ICU stays, including 187 sepsis-positive stays), we assessed robustness under varying fixed alert rates (α = 1–10%). At α = 5%, the stay-level first-crossing detection rate among sepsis-positive stays was 12.8% (95% CI: 8.0–17.6%) based on stay-level cluster bootstrapping (B = 1000, seed = 2026). Detection increased monotonically with α, indicating threshold-dependent operational scaling without instability. Because the α-sweep is computed on the monitored stream (including sepsis-positive stays without eligible pre-onset windows), it should not be compared directly to Table 5a detection rates computed on the leakage-controlled eligible pre-onset subset (110 sepsis-positive stays).

Missing-indicator sensitivity. Adding binary missingness indicators (added for 10 of 12 variables) improved discrimination (ΔAUROC = +0.0373; ΔAUPRC = +0.0137) under the same train/validation/test split design. Because missingness may reflect measurement and care processes rather than purely physiological severity, we report both versions and interpret these gains with caution (Supplementary Table S3).

To contextualize Stage-2 operating points, we report the stay-level sepsis prevalence in the monitored test stream (187/4022 = 4.65%). Under the fixed alert-rate policy, the target alert workload is α (e.g., 5% of validation windows triggering an alarm), and the realized window-level alert rate on the test monitored stream equals the proportion of monitored windows with p ≥ τ_α. Because τ_α is selected on the validation set and then frozen, the realized test alert rate can differ slightly from α. Moreover, because the monitored stream is window-subsampled for computational tractability (mean 2.47 monitored windows per stay), these workload quantities should be interpreted as proxies rather than as alerts per ICU-day under continuous hourly monitoring. This workload is computed at the window level (threshold crossings) prior to any per-stay alert suppression; for timeliness, we use a first-crossing rule (at most one pre-onset alert per stay), consistent with common deployment suppression policies. For window-level discrimination, we report performance over leakage-controlled eligible (pre-onset) evaluation windows (

n_{e v a l}

; Table 2), whereas lead time is computed per stay using the first pre-onset alert. Realized alert rates may differ across partitions because τ_α is fixed on validation rather than tuned on test.

Operationally, alert burden should be interpreted in the context of alarm fatigue and prior mixed evidence from deployed sepsis prediction systems, which motivates workload-matched evaluation and careful operational calibration in practice [38,39,40].

We note that the careunit FedAvg-LogReg row yields an extreme threshold τ_α because the fixed alert-rate rule selects an upper quantile of predicted probabilities. Under strong class imbalance and heterogeneous careunit distributions, logistic models can produce highly skewed or near-bimodal score distributions (quasi-separation), which pushes τ_α toward 1.0. This behavior reflects calibration/scale differences rather than a labeling bug; calibration diagnostics are reported in Section 4.4 and in the Supplementary Materials. Figure 4 visualizes the lead-time distribution at the same 5% alert-rate operating point via its cumulative distribution function (CDF); capped timeliness statistics are reported in Table 5b. The legend in Figure 4 denotes the number of detected sepsis-positive stays plotted for each model in the monitored test stream.

Baseline SOFA sensitivity was evaluated in two complementary ways: (i) we replaced the baseline definition with the minimum partial SOFA (excluding the respiratory component due to feature availability) within the first 6–12 h after ICU admission and recomputed Sepsis-3 onset accordingly (Table S1); and (ii) at α = 5%, we compared baseline SOFA = 0 (Sepsis-3 default) versus baseline defined as the first-24h minimum proxy SOFA_total and reported stay-level any-alert detection (18.2%, 95% CI: 12.8–23.5% vs. 24.1%, 95% CI: 18.3–29.9%; Table S4).

As shown in Figure 5, the corresponding results are summarized below.

4.3. Stage-2 Sensitivity Analyses and Leakage Stress Tests (SOFA-Proxy Label; Reference CIs for the Main Full-SOFA Task)

To further quantify how evaluation-design choices can inflate performance estimates, we conduct an additional sensitivity analysis on MIMIC-IV using a SOFA-based proxy label (acute ΔSOFA ≥ 2) and a controlled contamination sweep. We inject a fraction of windows originating from held-out test sets into training (leak fraction in {0, 5, 10, 30, 50}%) and report mean ± std over three random seeds. This stress test complements the main full-SOFA/Sepsis-3 evaluation by directly demonstrating the impact of leakage contamination under otherwise identical preprocessing. The contamination trend is shown in Figure 6 and summarized in part (a) of Table 6.

The contamination sweep for a centralized XGBoost classifier on the SOFA-proxy label is reported using the same windowing (T = 6 h, H = 6 h) and feature pipeline. Table 6b reports stay-level cluster bootstrap 95% confidence intervals for Centralized-LogReg and SimFL FedAvg-LogReg (careunit/admission-type) under the proper stay-level split.

4.4. Summary of Privacy and Security Stress Tests (Supplementary)

Table 7 summarizes the compact main-text privacy/security overview for the 100,000-window audit sample. Membership inference AUC values remain close to chance in these stress tests, DP-inspired clipping/noise illustrates a utility–privacy trade-off, and poisoning experiments show that simple clipping offers only partial mitigation. Because these audits are run on an aggregate audit subset optimized for repeated attack/defense evaluation, the absolute AUROC/AUPRC values should not be compared directly to the main evaluation results; the focus is on relative changes across audit conditions. Detailed per-setting attack/defense sweeps are provided in Supplementary File (Supplementary Table S2 and Supplementary Figure S1).

Figure 7 summarizes probability calibration and threshold-dependent net-benefit patterns for the three centralized baseline models under the same Stage-2 evaluation design. Panel (a) presents binned probability calibration together with the predicted-probability distribution, and panel (b) presents decision-curve analysis as an additional utility diagnostic.

Detailed per-setting attack/defense sweeps and the corresponding trustworthiness audit panel are provided in Supplementary File (Supplementary Table S2 and Supplementary Figure S1) so that the main manuscript can remain focused on the compact privacy/security summary (Table 7) and the calibration/utility diagnostic overview in Figure 7.

5. Discussion

5.1. Interpreting the Cross-Hospital FL Benchmark

Stage-1 results show that cross-hospital evaluation can materially differ from within-hospital testing, supporting the need for explicit A→B/B→A cross-hospital evaluation schemes rather than single-site splits. Stage-1 follows the official PhysioNet/CinC A/B benchmark as a minimal two-client setting; we partially bridge this gap via Stage-2 partition-sensitivity analyses, while true multi-hospital external validation remains an important direction for future work. Across horizons, FL tends to benefit the LSTM more than the Transformer, consistent with the notion that more expressive sequence models may partially compensate for limited local diversity. FedProx is competitive with FedAvg, suggesting that proximal regularization can help stabilize training under client heterogeneity without sacrificing performance.

5.2. Why Fixed Alert-Rate Evaluation Matters

Standard metrics (AUROC/AUPRC) quantify ranking quality but do not directly determine how many alarms a system would generate. Because alert fatigue is a major barrier to clinical adoption, evaluating models at a matched alert workload is critical. Our fixed alert-rate approach selects a validation threshold to cap alerts (e.g., 5% of windows) and then measures detection and lead time at the ICU-stay level. In Stage-2, the centralized XGB model yields the highest detection rate and the longest median lead time (Table 5a,b), while federated LogReg models trade lower detection for earlier/noisier alerts depending on the client partition. Very long lead times can arise from the window-based formulation and onset definitions; therefore, clinical actionability should be interpreted alongside false-positive burden and threshold-dependent net-benefit patterns, rather than lead time alone. Clinical guideline recommendations emphasize timely recognition and escalation, motivating 24–48 h reporting windows for early warning [41]. In deployment, long-tail early alerts can be managed via persistence rules (e.g., requiring k consecutive windows above the decision threshold), escalation windows, or service-line–specific alert-rate calibration to reduce spurious early alerts while preserving actionable timeliness.

5.3. Calibration, Decision-Analytic Utility, and Non-IID Heterogeneity

Calibration and decision-curve analysis provide complementary information to discrimination. In this study, Figure 7 is intended as a diagnostic summary for the three centralized baseline models rather than as definitive evidence of sustained clinical benefit across the examined threshold range. The Stage-2 analyses show that calibration behavior and threshold-dependent net-benefit patterns can vary by model family. Moreover, careunit-stratified results (Figure 5) highlight non-IID heterogeneity that can be masked by pooled metrics; this supports stratified reporting and motivates personalization or hierarchical FL extensions in future work.

5.4. Leakage, Privacy, Security, and Limitations

Leakage controls are essential for trustworthy evaluation. Our proper-vs-leaky and contamination stress tests (Figure 2) demonstrate that even modest violations of split integrity can inflate performance, reinforcing that early warning evaluation must enforce strict temporal and patient-level separation. Beyond leakage, FL introduces privacy and security considerations. While the membership inference AUC values in our stress tests are near chance [42], this does not constitute a formal privacy guarantee. DP-inspired clipping/noise can reduce membership signal but may reduce discrimination; rigorous privacy accounting and threat modeling are required before deployment. Poisoning experiments further show that adversarial clients can degrade model performance, and simple defenses (e.g., clipping) offer partial protection, underscoring the need for robust aggregation and monitoring in real-world FL systems.

Limitations and future directions. This study is not designed to claim that FL outperforms centralized training; rather, Stage-2 quantifies workload-matched operational performance and trustworthiness trade-offs under realistic client heterogeneity. First, Stage-2 client partitions (careunits/admission-type) emulate heterogeneity but do not fully substitute for true multi-hospital external validation. Second, labels derived from retrospective EHR data are inherently noisy, and Sepsis-3 operationalization can vary across institutions and implementations. In addition, some full-SOFA laboratory components are sampled irregularly and may be delayed; prospective deployment should explicitly model measurement availability/latency (e.g., masks or time-since-last-measurement features) to avoid optimistic assumptions. Third, our DP-inspired experiments do not provide a formal differential privacy guarantee [37] because privacy accounting and composition were not performed. Finally, retrospective evaluation cannot capture prospective workflow effects such as changes in clinician behavior. In Stage-2, we assume baseline SOFA_base = 0 when pre-existing organ dysfunction is not documented; alternative baseline definitions (e.g., minimum SOFA over an initial stabilization period) may shift inferred onset times and derived lead-time statistics. In addition, imputing missing SOFA components as 0 is a pragmatic choice but may understate risk if missingness is informative; future work should evaluate explicit missingness modeling and baseline sensitivity analyses. Future work should include prospective validation, richer personalization strategies, and end-to-end privacy/security audits with explicit adversary models. More generally, calibration is often a critical weakness in predictive analytics and should be monitored in deployment [43]; systematic reviews highlight substantial heterogeneity in sepsis prediction model performance across settings [44], and classic ICU sepsis early-warning models provide complementary baselines for interpreting performance trade-offs [45]. In addition, these pragmatic partitions reflect data-sharing constraints that often preclude raw multi-hospital pooling; future studies should validate the framework on true multi-institution cohorts. Finally, because α is an operational knob, sensitivity analyses across multiple alert-rate settings (e.g., 1–10%) are important for tailoring deployment to staffing capacity. Although Stage-1 evaluates cross-hospital performance across the PhysioNet/CinC 2019 A/B partitions, Stage-2 is confined to a single-center database (MIMIC-IV). While MIMIC-IV spans multiple care units and clinical contexts, our client-partitioning strategy represents internal heterogeneity rather than independent institutional external validation. Therefore, the transportability of the proposed pipeline across geographically and operationally distinct healthcare systems should be interpreted with caution, and prospective multi-institutional validation remains necessary to establish real-world deployment readiness. Importantly, our primary contribution lies in the leakage-aware evaluation framework and workload-matched operational analysis, rather than in claiming universal cross-institutional performance superiority. Future work should (i) report sensitivity across multiple alert-rate settings (e.g., α ∈ {1%, 2%, 5%, 10%}) to explicitly map the workload–timeliness trade-off, (ii) include purely local (per-client) baselines under the same fixed alert-rate policy for Stage-2 partitions, and (iii) benchmark against simple rule-based scores (e.g., SOFA/qSOFA/NEWS) where feasible. Although we enforced group-disjoint splits at the ICU-stay level, patients with multiple admissions (subject_id) may contribute ICU stays across different splits; future work should evaluate subject-disjoint splitting and quantify its impact on performance.

It is important to note that the present Stage-2 evaluation was conducted on a sampled window-level monitored test subset (mean 2.47 windows per stay), rather than on continuously monitored ICU time series. Therefore, stay-level first-crossing detection rates in this subset should not be directly interpreted as full-monitoring deployment performance. Instead, these results demonstrate threshold stability and internal operational consistency under constrained evaluation sampling.

6. Conclusions

We presented a leakage-aware federated learning evaluation pipeline for sepsis early warning and benchmarked sequence models under cross-hospital generalization on PhysioNet 2019, followed by a full-SOFA early warning task on MIMIC-IV with operational fixed alert-rate lead-time analysis. Across settings, explicit leakage control and workload-matched evaluation were critical for credible comparisons. Supplementary privacy and security stress tests further contextualize deployment risks and mitigation strategies. Overall, our results support FL as a practical approach for collaborative early warning model development while highlighting the methodological safeguards needed for trustworthy clinical translation.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/app16062735/s1, Supplementary File (ZIP) contains aggregate-only supporting items referenced in the main text: Supplementary Table S1 (baseline SOFA sensitivity using the 6–12 h partial-SOFA alternative), Supplementary Table S2 (detailed privacy/security stress-test sweep summary moved from the main text), Supplementary Table S3 (missing-indicator sensitivity analysis), Supplementary Table S4 (baseline SOFA sensitivity at α = 5% under alternative baseline definitions), and Supplementary Figure S1 (trustworthiness audit panel on the 100,000-window sample). No patient-level records or per-stay predictions are redistributed.

Author Contributions

Conceptualization, H.J. and H.L.; methodology, H.J.; software, H.J.; validation, H.J. and H.L.; formal analysis, H.J.; investigation, H.J.; resources, H.J.; data curation, H.J.; writing—original draft preparation, H.J.; writing—review and editing, H.J. and H.L.; visualization, H.J.; supervision, H.L.; project administration, H.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study because it uses de-identified datasets. The PhysioNet/CinC 2019 dataset is publicly available; access to MIMIC-IV requires credentialed access under a data use agreement.

Informed Consent Statement

Patient consent was waived because this study used de-identified datasets and involved no direct interaction with human participants.

Data Availability Statement

Restrictions apply to the availability of these data. MIMIC-IV is a credentialed dataset available via PhysioNet. To support reproducibility without redistributing individual-level derived outputs, Supplementary File provides aggregate summary tables (CSV) and a supporting audit figure (PNG) referenced in Section 4.2 and Section 4.4. Supplementary File contains only aggregate outputs and does not include patient-level records or per-stay predictions. Analysis code can be shared upon reasonable request and/or in a suitable repository under the same access restrictions.

Acknowledgments

The authors thank the PhysioNet and MIMIC-IV teams for providing access to de-identified critical care datasets and the open-source community for software used in this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Feature Sets and Model I/O Details

Stage-1 (PhysioNet/CinC 2019) input features. The PhysioNet/CinC 2019 dataset provides 40 variables measured hourly, including vital signs, laboratory measurements, demographics, and unit indicators. For clarity and to avoid temporal leakage, we exclude the explicit time counter ICULOS and use the remaining 39 variables as the Stage-1 model input (d = 39).

Stage-2 (MIMIC-IV) full-SOFA input features. To reduce feature ambiguity and align with Sepsis-3, Stage-2 uses the 12 raw variables needed to compute full SOFA components within each observation window (PaO², FiO₂, platelets, total bilirubin, MAP, dopamine, dobutamine, epinephrine, norepinephrine, GCS, creatinine, and urine output). These variables are summarized within the window (worst/most abnormal value) and used as the Stage-2 feature vector (d = 12).

Table A1. Stage-1 (PhysioNet/CinC 2019) feature set (ICULOS removed).

Feature Group	Variables (Hourly Unless Noted)
Vital signs	HR, O2Sat, Temp, SBP, MAP, DBP, Resp, EtCO2
Laboratory + demographics	BaseExcess, HCO₃, FiO₂, pH, PaCO₂, SaO₂, AST, BUN, Alkalinephos, Calcium, Chloride, Creatinine, Bilirubin_direct, Glucose, Lactate, Magnesium, Phosphate, Potassium, Bilirubin_total, TroponinI, Hct, Hgb, PTT, WBC, Fibrinogen, Platelets; Age, Gender, Unit1, Unit2, HospAdmTime

Table A2. Stage-2 (MIMIC-IV) full-SOFA variables used as model inputs (d = 12).

SOFA Component	Raw Variables Used	Example Window Summary
Respiratory/Coagulation/Liver/CNS/Renal	PaO₂, FiO₂; Platelets; Bilirubin_total; GCS; Creatinine, UrineOutput	Worst value within window (e.g., PaO₂ min, FiO₂ max, platelets min, bilirubin max, GCS min, creatinine max, urine output min)
Cardiovascular	MAP; Dopamine; Dobutamine; Epinephrine; Norepinephrine	MAP min; maximum infusion rate per drug within window (0 if not administered)

Model output. For both stages, models output a window-level risk score p(t) ∈ [0,1]. Operational alerts are generated by comparing p(t) to a validation-selected fixed alert-rate threshold τ_α (Section 3.6).

References

Singer, M.; Deutschman, C.S.; Seymour, C.W.; Shankar-Hari, M.; Annane, D.; Bauer, M.; Bellomo, R.; Bernard, G.R.; Chiche, J.-D.; Coopersmith, C.M.; et al. The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3). JAMA 2016, 315, 801–810. [Google Scholar] [CrossRef] [PubMed]
Vincent, J.-L.; Moreno, R.; Takala, J.; Willatts, S.; De Mendonça, A.; Bruining, H.; Reinhart, C.K.; Suter, P.M.; Thijs, L.G. The SOFA (Sepsis-related Organ Failure Assessment) Score to Describe Organ Dysfunction/Failure. Intensive Care Med. 1996, 22, 707–710. [Google Scholar] [CrossRef]
Sherman, E.; Szolovits, P.; Ieva, A.; Patel, V.; Ghassemi, M. Leveraging Clinical Time-Series Data for Prediction: A Cautionary Tale. arXiv 2018, arXiv:1811.12520. [Google Scholar] [CrossRef]
Moor, M.; Bennett, N.; Plečko, D.; Horn, M.; Rieck, B.; Meinshausen, N.; Bühlmann, P.; Borgwardt, K. Predicting Sepsis Using Deep Learning across International Sites: A Retrospective Development and Validation Study. EClinicalMedicine 2023, 62, 102124. [Google Scholar] [CrossRef]
Boussina, A.; Shashikumar, S.P.; Malhotra, A.; Owens, R.L.; El-Kareh, R.; Longhurst, C.A.; Quintero, K.; Donahue, A.; Chan, T.C.; Nemati, S.; et al. Impact of a Deep Learning Sepsis Prediction Model on Quality of Care and Survival. npj Digit. Med. 2024, 7, 14. [Google Scholar] [CrossRef]
Valan, B.; Prakash, A.; Ratliff, W.; Gao, M.; Muthya, S.; Thomas, A.; Eaton, J.L.; Gardner, M.; Nichols, M.; Revoir, M.; et al. Evaluating Sepsis Watch Generalizability through Multisite External Validation of a Sepsis Machine Learning Model. npj Digit. Med. 2025, 8, 350. [Google Scholar] [CrossRef]
Gupta, A.; Chauhan, R.; G, S.; Shreekumar, A. Improving Sepsis Prediction in Intensive Care with SepsisAI: A Clinical Decision Support System with a Focus on Minimizing False Alarms. PLoS Digit. Health 2024, 3, e0000569. [Google Scholar] [CrossRef]
Rich, R.L.; Montero, J.M.; Dillon, K.E.; Condon, P.; Vadaparampil, M. Evaluation of an Intensive Care Unit Sepsis Alert in Critically Ill Medical Patients. Am. J. Crit. Care 2024, 33, 212–216. [Google Scholar] [CrossRef]
McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; Aguera y Arcas, B. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Reyna, M.A.; Josef, C.S.; Jeter, R.; Shashikumar, S.P.; Westover, M.B.; Nemati, S.; Clifford, G.D.; Sharma, A. Early Prediction of Sepsis From Clinical Data: The PhysioNet/Computing in Cardiology Challenge. Crit. Care Med. 2020, 48, 210–217. [Google Scholar] [CrossRef] [PubMed]
Johnson, A.E.W.; Bulgarelli, L.; Shen, L.; Gayles, A.; Shammout, A.; Horng, S.; Pollard, T.J.; Hao, S.; Moody, B.; Gow, B.; et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 2023, 10, 1. [Google Scholar] [CrossRef] [PubMed]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems; The Neural Information Processing Systems Foundation: San Diego, CA, USA, 2017; pp. 5998–6008. [Google Scholar]
Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated Optimization in Heterogeneous Networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Rieke, N.; Hancox, J.; Li, W.; Milletarì, F.; Roth, H.R.; Albarqouni, S.; Bakas, S.; Galtier, M.N.; Landman, B.A.; Maier-Hein, K.; et al. The Future of Digital Health with Federated Learning. npj Digit. Med. 2020, 3, 119. [Google Scholar] [CrossRef]
Alam, M.U.; Rahmani, R. FedSepsis: A Federated Multi-Modal Deep Learning-Based Internet of Medical Things Application for Early Detection of Sepsis from Electronic Health Records Using Raspberry Pi and Jetson Nano Devices. Sensors 2023, 23, 970. [Google Scholar] [CrossRef]
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and Open Problems in Federated Learning. arXiv 2019, arXiv:1912.04977. [Google Scholar]
Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Process. Mag. 2020, 37, 50–60. [Google Scholar] [CrossRef]
Bonawitz, K.; Ivanov, V.; Kreuter, B.; Marcedone, A.; McMahan, H.B.; Patel, S.; Ramage, D.; Segal, A.; Seth, K. Practical Secure Aggregation for Privacy-Preserving Machine Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, 30 October–3 November 2017; pp. 1175–1191. [Google Scholar]
Kaissis, G.A.; Makowski, M.R.; Rückert, D.; Braren, R.F. Secure, Privacy-Preserving and Federated Machine Learning in Medical Imaging. Nat. Mach. Intell. 2020, 2, 305–311. [Google Scholar] [CrossRef]
Vickers, A.J.; Elkin, E.B. Decision Curve Analysis: A Novel Method for Evaluating Prediction Models. Med. Decis. Mak. 2006, 26, 565–574. [Google Scholar] [CrossRef]
Shokri, R.; Stronati, M.; Song, C.; Shmatikov, V. Membership Inference Attacks against Machine Learning Models. In 2017 IEEE Symposium on Security and Privacy (SP); IEEE: New York, NY, USA, 2017; pp. 3–18. [Google Scholar]
Yeom, S.; Giacomelli, I.; Fredrikson, M.; Jha, S. Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF); IEEE: New York, NY, USA, 2018; pp. 268–282. [Google Scholar]
Melis, L.; Song, C.; De Cristofaro, E.; Shmatikov, V. Exploiting Unintended Feature Leakage in Collaborative Learning. In 2019 IEEE Symposium on Security and Privacy (SP); IEEE: New York, NY, USA, 2019. [Google Scholar]
Bhagoji, A.N.; Chakraborty, S.; Mittal, P.; Calo, S. Model Poisoning Attacks in Federated Learning. arXiv 2019, arXiv:1811.02645. [Google Scholar]
Bagdasaryan, E.; Veit, A.; Hua, Y.; Estrin, D.; Shmatikov, V. How to Backdoor Federated Learning. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), Virtual, 26–28 August 2020; pp. 2938–2948. [Google Scholar]
Blanchard, P.; El Mhamdi, E.M.; Guerraoui, R.; Stainer, J. Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent. In Advances in Neural Information Processing Systems; The Neural Information Processing Systems Foundation: San Diego, CA, USA, 2017. [Google Scholar]
Yin, D.; Chen, Y.; Ramchandran, K.; Bartlett, P. Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates. In Proceedings of the 35th International Conference on Machine Learning; PMLR: Stockholm, Sweden, 2018; pp. 5650–5659. [Google Scholar]
Davis, S.E.; Matheny, M.E.; Balu, S.; Sendak, M.P. A Framework for Understanding Label Leakage in Machine Learning for Health Care. J. Am. Med. Inform. Assoc. 2024, 31, 274–280. [Google Scholar] [CrossRef] [PubMed]
Kapoor, S.; Narayanan, A. Leakage and the Reproducibility Crisis in Machine-Learning-Based Science. Patterns 2023, 4, 100804. [Google Scholar] [CrossRef]
Collins, G.S.; Moons, K.G.M.; Dhiman, P.; Riley, R.D.; Beam, A.L.; Van Calster, B.; Ghassemi, M.; Liu, X.; Reitsma, J.B.; van Smeden, M.; et al. TRIPOD+AI Statement: Updated Guidance for Reporting Clinical Prediction Models that Use Regression or Machine Learning Methods. BMJ 2024, 385, e078378. [Google Scholar] [CrossRef]
Collins, G.S.; Dhiman, P.; Ma, J.; Schlussel, M.M.; Archer, L.; Van Calster, B.; Harrell, F.E.; Martin, G.P.; Moons, K.G.M.; van Smeden, M.; et al. Evaluation of Clinical Prediction Models (Part 1): From Development to External Validation. BMJ 2024, 384, e074819. [Google Scholar] [CrossRef] [PubMed]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In KDD’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Saito, T.; Rehmsmeier, M. The Precision–Recall Plot Is More Informative than the ROC Plot when Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
Van Calster, B.; Collins, G.S.; Vickers, A.J.; Wynants, L.; Kerr, K.F.; Barreñada, L.; Varoquaux, G.; Singh, K.; Moons, K.G.M.; Hernandez-Boussard, T.; et al. Evaluation of Performance Measures in Predictive Artificial Intelligence Models to Support Medical Decisions: Overview and Guidance. Lancet Digit. Health 2025, 7, e100916. [Google Scholar] [CrossRef]
Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H.B.; Mironov, I.; Talwar, K.; Zhang, L. Deep Learning with Differential Privacy. In CCS’16: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security; Association for Computing Machinery: New York, NY, USA, 2016; pp. 308–318. [Google Scholar]
Wong, A.; Otles, E.; Donnelly, J.P.; Krumm, A.; McCullough, J.; DeTroyer-Cooley, O.; Pestrue, J.; Phillips, M.; Konye, J.; Penoza, C.; et al. External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients. JAMA Intern. Med. 2021, 181, 1065–1070. [Google Scholar] [CrossRef]
Cvach, M. Monitor Alarm Fatigue: An Integrative Review. Biomed. Instrum. Technol. 2012, 46, 268–277. [Google Scholar] [CrossRef]
Sendelbach, S.; Funk, M. Alarm Fatigue: A Patient Safety Concern. AACN Adv. Crit. Care 2013, 24, 378–386. [Google Scholar] [CrossRef] [PubMed]
Evans, L.; Rhodes, A.; Alhazzani, W.; Antonelli, M.; Coopersmith, C.M.; French, C.; Machado, F.R.; McIntyre, L.; Ostermann, M.; Prescott, H.C.; et al. Surviving Sepsis Campaign: International Guidelines for Management of Sepsis and Septic Shock 2021. Intensive Care Med. 2021, 47, 1181–1247. [Google Scholar] [CrossRef] [PubMed]
Carlini, N.; Chien, S.; Nasr, M.; Song, S.; Terzis, A.; Tramèr, F. Membership Inference Attacks from First Principles. In 2022 IEEE Symposium on Security and Privacy (SP); IEEE: New York, NY, USA, 2022; pp. 1897–1914. [Google Scholar] [CrossRef]
Van Calster, B.; McLernon, D.J.; van Smeden, M.; Wynants, L.; Steyerberg, E.W. Calibration: The Achilles Heel of Predictive Analytics. BMC Med. 2019, 17, 230. [Google Scholar] [CrossRef] [PubMed]
Fleuren, L.M.; Klausch, T.L.T.; Zwager, C.L.; Schoonmade, L.J.; Guo, T.; Roggeveen, L.F.; Swart, E.L.; Girbes, A.R.J.; Thoral, P.; Ercole, A.; et al. Machine Learning for the Prediction of Sepsis: A Systematic Review and Meta-Analysis. Intensive Care Med. 2020, 46, 383–400. [Google Scholar] [CrossRef] [PubMed]
Nemati, S.; Holder, A.; Razmi, F.; Stanley, M.D.; Clifford, G.D.; Buchman, T.G. An Interpretable Machine Learning Model for Accurate Prediction of Sepsis in the ICU. Sci. Transl. Med. 2018, 10, eaas9557. [Google Scholar] [CrossRef] [PubMed]

Scheme 1. Overview of the leakage-aware federated learning (FL) pipeline and evaluation workflow used in this study. Arrows indicate the workflow progression, and different colors distinguish the major functional stages of the pipeline.

Scheme 2. Timeline for the PhysioNet/CinC 2019 label shift and horizon definition. Arrows indicate the temporal progression from the observation window to the shifted label transition and prediction horizon, and the shaded blocks mark the corresponding time intervals.

Scheme 3. Fixed fraction α of monitored windows that trigger an alarm. Fixed alert-rate thresholding on the validation set and lead-time evaluation at the ICU-stay level. Alert-rate α is defined as the proportion of monitored windows in the monitored stream that trigger an alarm. Arrows indicate the sequence of threshold selection, threshold freezing, first-alert determination, and stay-level lead-time computation.

Scheme 4. Fixed alert-rate operational rule and stay-level lead-time computation. The validation threshold τ_α is selected by quantile thresholding to match an alert workload α and is then frozen for test evaluation. For stay-level timeliness, we apply a first-crossing rule (at most one pre-onset alert per stay). Alert-rate α is defined as the proportion of monitored windows in the monitored stream that trigger an alarm. Arrows indicate the sequence from validation-based threshold selection to first-threshold crossing and stay-level lead-time computation.

Figure 1. Horizon sensitivity heatmap (AUPRC; area under the precision–recall curve) for PhysioNet/CinC 2019 Stage-1 cross-hospital evaluation. Heatmaps report AUPRC across prediction horizons (H = 3, 6, 12 h) for key model/learning settings under (a) A-test and (b) B-test evaluation, using a shared color scale (updated for improved text–background contrast).

Figure 2. Stage-2 leakage sanity check on MIMIC-IV. (a) Performance under proper (stay-level) versus leaky (row-random) splitting with AUROC and AUPRC (95% CI). (b) Contamination sweep showing AUROC and AUPRC as mean ± std over three seeds across leak fractions.

Figure 6. Stage-2 leakage contamination stress test on MIMIC-IV (SOFA-proxy label): effect of leakage contamination as test-stay windows leak into training (mean ± std over three seeds). Results use a centralized XGBoost classifier with T = 6 h and H = 6 h on the SOFA-proxy label.

Figure 7. Calibration and decision-curve analysis for the three centralized baseline models on the target-hospital test set. Panel (a) shows binned probability calibration, and the lower subpanel shows the distribution of predicted probabilities, illustrating the concentration of samples in the low-risk region under outcome imbalance. Panel (b) presents decision-curve analysis across threshold probabilities from 0.01 to 0.50 as an additional utility diagnostic, complementing the discrimination results reported in Table 4 and Figure 3. Colored lines denote different baseline models, and the dashed reference lines indicate the calibration and default-policy baselines used for interpretation.

Figure 3. Stage-2 (MIMIC-IV) test AUROC and AUPRC under the proper stay-level split (pre-onset evaluation subset; 8743 evaluation windows; T = 6 h, H = 6 h). Point estimates correspond to the values reported in Table 4, and stay-level cluster bootstrap 95% confidence intervals are shown where available in the corresponding summary table.

Figure 4. Stage-2 (MIMIC-IV): stay-level lead-time CDF at a fixed alert workload (α = 5%), where τ is selected on the validation set and then frozen for test evaluation. Vertical dashed lines indicate 24 h and 48 h. Detection requires at least one alert before sepsis onset. Actionability is summarized by capped timeliness over detected sepsis-positive stays (Table 5b). Legend counts (n) denote detected sepsis-positive stays plotted for each model in the monitored test stream.

Figure 5. Partition sensitivity at the fixed alert-rate (5%) operating point in Stage-2 (full SOFA). Cluster bootstrap 95% confidence intervals are shown for (a) AUROC and (b) AUPRC across centralized baselines and SimFL FedAvg-LogReg under admission-type and careunit client definitions. Abbreviations: SimFL, simulated federated learning; FedAvg-LogReg, FedAvg with logistic regression.

Table 1. Cohort and evaluation design summary for the two-stage evaluation.

Stage	Dataset	Records	Split Unit	Client/Domain	Evaluation Target
Stage-1	PhysioNet/CinC 2019 (https://doi.org/10.13026/v64v-d857)	Training sets A + B: 40,336 patients (A: 20,336; B: 20,000)	patient (group split)	Hospital systems (A/B)	Fixed-horizon benchmark (main)
Stage-2	MIMIC-IV v3.1 (https://doi.org/10.13026/kpb9-mt58)	Total: 36,193 stays; 87,460 windows (T = 6 h). Test: 3620 stays; 8743 windows ( $n_{e v a l}$ ). Monitored test stream: 4022 stays; 9927 windows $(n_{m o n i t o r e d}$ ).	stay (group split)	Care unit (multi-client) + pooled-1 (single-client control)	Operational evaluation (fixed alert-rate α = 5%)

Note: In Stage-2, leakage-controlled eligible pre-onset evaluation windows are used for discrimination metrics (AUROC/AUPRC), whereas the monitored test stream contains all monitored windows used for realized alert burden and may include post-onset periods.

Table 2. Stage-2 cohort statistics and leakage-controlled evaluation-window counts (MIMIC-IV; observation window length T = 6 h; prediction horizon H = 6 h). Window counts reflect the (subsampled) hourly window stream used in this study.

Split	ICU Stays (n)	Evaluation Windows $n_{e v a l}$ (n)	Positive Evaluation Windows (n)	Prevalence
Train	28,954	70,012	3387	4.84%
Validation	3619	8705	486	5.58%
Test	3620	8743	500	5.72%
Total	36,193	87,460	4373	5.00%

Note: T is the observation window length, and H is the prediction horizon. “Evaluation windows” are leakage-controlled, pre-onset windows used for discrimination evaluation; windows ending at or after sepsis onset are excluded to prevent post-onset leakage. “Prevalence” is computed per window. For operational alert-burden analysis, we additionally monitor the full test-window stream under the fixed alert-rate policy (9927 windows across 4022 ICU stays; 187 sepsis-positive stays), which includes windows excluded from the leakage-controlled evaluation set. For completeness, the leakage-controlled evaluation subset contains 110 eligible sepsis-positive test stays (computed as unique stays with at least one positive evaluation window). In the test evaluation subset, 3620 ICU stays correspond to 3620 hospital admissions (hadm_id) and 3553 unique patients (subject_id).

Table 3. Main PhysioNet 2019 benchmark results for H = 6 h. AUROC/AUPRC are reported with bootstrap 95% confidence intervals. H is defined with respect to the shifted onset time per the official PhysioNet/CinC 2019 benchmark.

Method	A-Test AUROC (95% CI)	A-Test AUPRC (95% CI)	B-Test AUROC (95% CI)	B-Test AUPRC (95% CI)
LSTM Local(A)	0.769 [0.726, 0.806]	0.255 [0.188, 0.331]	0.715 [0.667, 0.766]	0.121 [0.083, 0.190]
LSTM Local(B)	0.712 [0.670, 0.756]	0.193 [0.150, 0.253]	0.810 [0.766, 0.851]	0.170 [0.124, 0.247]
LSTM-FedAvg(r10)	0.738 [0.697, 0.780]	0.185 [0.143, 0.243]	0.774 [0.724, 0.824]	0.207 [0.144, 0.315]
LSTM FedProx(r8)	0.736 [0.694, 0.777]	0.207 [0.159, 0.271]	0.779 [0.736, 0.825]	0.171 [0.113, 0.247]
TF Local(A)	0.783 [0.742, 0.821]	0.300 [0.231, 0.376]	0.818 [0.778, 0.861]	0.199 [0.144, 0.294]
TF Local(B)	0.666 [0.628, 0.704]	0.121 [0.099, 0.161]	0.806 [0.752, 0.854]	0.219 [0.155, 0.316]
TF-FedAvg(r4)	0.748 [0.705, 0.785]	0.226 [0.178, 0.303]	0.850 [0.806, 0.887]	0.312 [0.225, 0.421]

Note: AUROC, area under the receiver operating characteristic curve; AUPRC, area under the precision–recall curve; CI, confidence interval. H denotes the prediction horizon (relative to the shifted onset definition in PhysioNet/CinC 2019).

Table 4. Stage-2 MIMIC-IV early warning results for the main task (full-SOFA features; Sepsis-3 proxy label; T = 6 h, H = 6 h). AUROC/AUPRC are reported on the test set.

Setting	Training	Test AUROC	Test AUPRC
Centralized-XGB	Centralized	0.6592	0.1443
Centralized-RF	Centralized	0.6646	0.0899
Centralized-LogReg	Centralized	0.6693	0.1174
SimFL FedAvg-LogReg (client = careunit)	SimFL	0.6150	0.1160
SimFL FedAvg-LogReg (client = admission-type)	SimFL	0.6319	0.1230
Pooled-1 FedAvg-LogReg (client = pooled-1)	Pooled	0.5802	0.0708

Note: AUROC, area under the receiver operating characteristic curve; AUPRC, area under the precision–recall curve. T denotes the observation window length, and H denotes the prediction horizon.

Table 6. (a) Contamination sweep results (mean ± std over three random seeds). Setting: centralized XGBoost; SOFA-proxy label; T = 6 h; H = 6 h. (b) Proper stay-level split results (stay-level cluster bootstrap 95% CI) for Centralized-LogReg and SimFL FedAvg-LogReg under careunit and admission-type client partitions, corresponding to the main full-SOFA task reported in Table 4. (Pre-onset evaluation windows;

n_{e v a l}

= 8743.) Point estimates match the corresponding LogReg/SimFL entries in Table 4.

Table 6. (a) Contamination sweep results (mean ± std over three random seeds). Setting: centralized XGBoost; SOFA-proxy label; T = 6 h; H = 6 h. (b) Proper stay-level split results (stay-level cluster bootstrap 95% CI) for Centralized-LogReg and SimFL FedAvg-LogReg under careunit and admission-type client partitions, corresponding to the main full-SOFA task reported in Table 4. (Pre-onset evaluation windows;

n_{e v a l}

= 8743.) Point estimates match the corresponding LogReg/SimFL entries in Table 4.

(a)
(A) Contamination Sweep (Mean ± std Over Three Seeds)
Leak_frac		AUROC (Mean ± std)		AUPRC (Mean ± std)
0		0.6410 ± 0.0063		0.0888 ± 0.0018
0.05		0.6313 ± 0.0265		0.0880 ± 0.0058
0.1		0.6474 ± 0.0113		0.0925 ± 0.0073
0.3		0.6493 ± 0.0166		0.0916 ± 0.0032
0.5		0.6514 ± 0.0081		0.0984 ± 0.0034
(b)
Method	Test_AUROC	AUROC_lo	AUROC_hi	Test_AUPRC	AUPRC_lo	AUPRC_hi	n_Test_windows	n_Test_Stays
Centralized-LogReg	0.6693	0.6203	0.7161	0.1174	0.0854	0.1569	8743	3620
SimFL FedAvg-LogReg (client = careunit)	0.615	0.5612	0.6649	0.116	0.0816	0.1596	8743	3620
SimFL FedAvg-LogReg (client = admission-type)	0.6319	0.5798	0.6816	0.123	0.0851	0.1688	8743	3620

Note: Values are reported as mean ± standard deviation over three random seeds. CI denotes the 95% stay-level cluster bootstrap confidence interval.

Table 7. Compact privacy/security summary on a 100,000-window sample. Extended per-setting attack/defense sweep details are provided in Supplementary File (Supplementary Table S2 and Supplementary Figure S1). Audit subset definition: 100,000 windows sampled from the Stage-2 monitored stream (including post-onset windows) across train/validation/test.

Setting	Test AUROC	Test AUPRC	MIA AUC (Attack Model)	MIA AUC (Loss Score)
Pooled LogReg (central)	0.5571	0.0598	0.4979	0.5069
Careunit LogReg (FedAvg)	0.4873	0.0625	0.4942	0.4973
FedAvg + clip (C = 1.0), σ = 0.01	0.5450	0.0594	0.4924	0.4987
FedAvg poisoning (scale = 20)	0.4467	0.0490	—	—

Note: LogReg, logistic regression; FedAvg, federated averaging; MIA, membership inference attack; AUC, area under the receiver operating characteristic curve. σ denotes the Gaussian noise standard deviation; clip (C) denotes update clipping with bound C.

Table 5. (a) Stage-2 fixed alert-rate (5%) stay-level summary. Detection is computed among eligible sepsis-positive stays in the leakage-controlled pre-onset subset (n = 110). Detection requires at least one alert before sepsis onset. Lead time is not capped by horizon H under the fixed alert-rate alarm policy. Capped timeliness metrics are summarized in part (b) of this table. (b) Stage-2 fixed alert-rate (5%) capped timeliness metrics computed over detected sepsis-positive stays: P (L ≤ 24 h), P (L ≤ 48 h), and median (L|L ≤ 48 h).

(a)
Model	Training	Detection Rate	Median Lead Time (h)	IQR (25–75)	Threshold τ (α = 5%)
XGB	Centralized	54.8%	44.4	19.7–105.9	0.632
Pooled-1 LogReg	FedAvg	37.2%	33.7	8.1–77.8	0.240
Careunit LogReg	FedAvg	31.4%	23.5	4.5–62.2	0.999
(b)
Model	Training	P (L ≤ 24 h)	P (L ≤ 48 h)	Median (L\|L ≤ 48 h) (h)
XGB	Centralized	34.0%	52.4%	21.4
Pooled-1 LogReg	FedAvg	40.0%	60.0%	14.1
Careunit LogReg	FedAvg	50.8%	69.5%	6.5

Note: α, target alert workload (proportion of monitored windows triggering an alarm); τ, threshold selected on the validation set to match α; L, lead time (hours) for detected sepsis-positive stays. P (L ≤ 24 h) and P (L ≤ 48 h) are computed over detected stays; median (L|L ≤ 48 h) reports the median among stays with lead time ≤ 48 h.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jin, H.; Lee, H. Leakage-Aware Federated Learning for ICU Sepsis Early Warning: Fixed Alert-Rate Evaluation on PhysioNet/CinC 2019 and MIMIC-IV. Appl. Sci. 2026, 16, 2735. https://doi.org/10.3390/app16062735

AMA Style

Jin H, Lee H. Leakage-Aware Federated Learning for ICU Sepsis Early Warning: Fixed Alert-Rate Evaluation on PhysioNet/CinC 2019 and MIMIC-IV. Applied Sciences. 2026; 16(6):2735. https://doi.org/10.3390/app16062735

Chicago/Turabian Style

Jin, Hyejin, and Hongchul Lee. 2026. "Leakage-Aware Federated Learning for ICU Sepsis Early Warning: Fixed Alert-Rate Evaluation on PhysioNet/CinC 2019 and MIMIC-IV" Applied Sciences 16, no. 6: 2735. https://doi.org/10.3390/app16062735

APA Style

Jin, H., & Lee, H. (2026). Leakage-Aware Federated Learning for ICU Sepsis Early Warning: Fixed Alert-Rate Evaluation on PhysioNet/CinC 2019 and MIMIC-IV. Applied Sciences, 16(6), 2735. https://doi.org/10.3390/app16062735

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Leakage-Aware Federated Learning for ICU Sepsis Early Warning: Fixed Alert-Rate Evaluation on PhysioNet/CinC 2019 and MIMIC-IV

Abstract

1. Introduction

2. Related Work

2.1. Sepsis Early Warning from EHR Time Series

2.2. Federated Learning in Healthcare

2.3. Operational Evaluation, Calibration, and Robustness

3. Materials and Methods

3.1. Study Overview and Leakage-Aware Evaluation Pipeline

3.2. Datasets and Cohort Construction

3.3. Prediction Task, Windowing, and Label Definitions

3.4. Models

3.5. Federated Learning Algorithms

3.6. Evaluation Metrics and Fixed Alert-Rate Lead-Time Analysis

3.7. Privacy and Security Stress Tests (Supplementary)

4. Results

4.1. Stage-1 Cross-Hospital Benchmarking on PhysioNet 2019

4.2. Stage-2 Full-SOFA Early Warning on MIMIC-IV with Operational Fixed Alert-Rate Evaluation

4.3. Stage-2 Sensitivity Analyses and Leakage Stress Tests (SOFA-Proxy Label; Reference CIs for the Main Full-SOFA Task)

4.4. Summary of Privacy and Security Stress Tests (Supplementary)

5. Discussion

5.1. Interpreting the Cross-Hospital FL Benchmark

5.2. Why Fixed Alert-Rate Evaluation Matters

5.3. Calibration, Decision-Analytic Utility, and Non-IID Heterogeneity

5.4. Leakage, Privacy, Security, and Limitations

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Feature Sets and Model I/O Details

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI