You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

13 November 2025

Deployable Machine Learning on MIMIC-IV: Leakage-Safe Prediction and Calibration for Incident Diabetes

,
and
Dipartimento di Matematica e Fisica, Università Cattolica del Sacro Cuore, via Garzetta 48, 25121 Brescia, Italy
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Advancements in Cross-Disciplinary AI: Theory and Application, 3rd Edition

Abstract

We present a leakage-safe, deployment-oriented pipeline for predicting incident diabetes within 365 days of an index hospital admission, using the MIMIC-IV (Medical Information Mart for Intensive Care) hospital module. The process enforces temporal and patient-level guards, separates model-specific preprocessing, and evaluates models with a calibration-first lens: reliability diagrams and Brier score, alongside discrimination (ROC/PR) and decision-analytic utility via decision curves. Two standard baselines—logistic regression and random forest—are tuned with inner cross-validation and post hoc calibrated (Platt or isotonic) tools. On the held-out test set, calibrated logistic regression attains modest yet consistent performance (e.g., AUROC 0.70 , AUPRC 0.31 ) and competitive net benefit across clinically plausible thresholds; calibrated random forest is lower. Global explanations (permutation importance; SHAP for logistic) yield clinically coherent reliance patterns (age, length of stay, sex, ethnicity). The primary contribution is procedural and implementation-oriented: a transparent, reproducible, leak-proof pipeline that surfaces calibrated risks, threshold-aware decision support, and auditability artifacts suitable for clinical review and site adaptation. We outline a practical path to external validation and lightweight site-specific recalibration, positioning the work as a reference implementation for reliable EHR risk stratification rather than an algorithmic innovation.

1. Introduction

Diabetes mellitus and cardiovascular disease (CVD) are leading contributors to morbidity and mortality and share overlapping pathophysiology and risk profiles [,]. Early identification of individuals at elevated risk—particularly at the time of hospital admission—can facilitate timely prevention and targeted follow-up. Electronic Health Records (EHRs) enable such risk stratification at scale. In particular, the MIMIC-IV hospital module provides structured demographics, diagnoses, procedures, laboratory tests, medications, microbiology, and administrative events suitable for building and evaluating prognostic models.
While prior work has explored increasingly complex architectures, clinical deployment often benefits from compact models that are transparent, calibration-aware, and easy to maintain. Moreover, downstream decisions consume probabilities at task-specific thresholds, making probability calibration and threshold-dependent utility analysis as important as raw discrimination. In this study we therefore focus on a leakage-safe, end-to-end pipeline for predicting incident diabetes within 365 days of an index admission, using two strong, interpretable baselines: logistic regression and random forest. The pipeline couples admission-time features with patient-level splitting, post hoc calibration (Platt and isotonic), decision curve analysis, and model explanations.
Our contributions are threefold. First, we develop and release a reproducible pipeline that enforces temporal and patient-level leakage guards, standardizes preprocessing, and attaches post hoc probability calibration and decision-analytic evaluation. Second, we benchmark calibrated logistic regression against a calibrated random forest on MIMIC-IV for the task of predicting incident diabetes at 365 days, reporting ROC/PR discrimination, reliability, and threshold-wise net benefit. Third, we provide transparent explanations via SHAP summaries for logistic regression and permutation importance analyses, highlighting clinically plausible risk factors.
Guided by the data and methods used here, we reformulate the research questions as follows.
RQ1:Within a leakage-safe pipeline using only information available at or before admission, which calibrated baseline—logistic regression or random forest—achieves superior discrimination for predicting incident diabetes at 365 days (AUROC/AUPRC)?
RQ2: To what extent does post hoc calibration improve probability quality and translate into higher net benefit across clinically plausible decision thresholds, as assessed by reliability diagrams and decision curve analysis?
RQ3: Which predictors most strongly influence risk estimates, and are these influences clinically coherent, as quantified by SHAP summaries for logistic regression and permutation importance for both models?
Beyond this introduction, Section 2 reviews relevant background. Section 3 describes the pipeline and presents the empirical results on MIMIC-IV. Section 4 interprets the findings and their clinical implications, and Section 5 presents the related work. Finally, Section 6 outlines limitations and future work.

2. Background

We formalize the leakage-safe prediction problem, discrimination and calibration metrics, decision-analytic evaluation, and the global explanation methods used in Section 3. Throughout, { · } denotes the indicator function (it returns 1 when the stated condition holds and 0 otherwise).

2.1. Leakage-Safe Problem Formulation

Let U be the set of patients. For each u U , let τ u denote the index (admission) time, and let { z u ( t ) } t τ u be the multivariate EHR process up to τ u (events, codes, labs, medications). Fix a prediction horizon H > 0 and define the incident outcome
y u = { incident diabetes occurs in ( τ u , τ u + H ] } .
A feature map Φ is temporal-guarded if it ignores any information after τ u . Formally, for any two trajectories with identical histories up to τ u ,
{ z u ( t ) } t τ u = { z u ( t ) } t τ u Φ { z u ( t ) } t τ u = Φ { z u ( t ) } t τ u .
We write x u = Φ ( { z u ( t ) } t τ u ) R p for the feature vector.
Let S tr , S val , S te U be disjoint patient-level splits. Imputation I and preprocessing P operators are fit on training only,
( I ^ , P ^ ) arg min I , P L prep { ( x u , y u ) : u S tr } ,
and applied to validation/test without refitting,
x ˜ u = P ^ I ^ ( x u ) , u S val S te .
Equations (2)–(4) jointly prevent temporal leakage (using only t τ u ) and split leakage (no cross-partition fitting).

2.2. Predictors and Discrimination

A probabilistic predictor is a measurable map f : R p [ 0 , 1 ] that produces raw scores s ^ u = f ( x ˜ u ) . When f natively outputs probabilities we also write p ^ u = s ^ u . For a threshold t [ 0 , 1 ] define the true- and false-positive rates
TPR ( t ) = P ( s ^ u t y u = 1 ) ,
FPR ( t ) = P ( s ^ u t y u = 0 ) .
The ROC curve is { ( FPR ( t ) , TPR ( t ) ) : t [ 0 , 1 ] } and its area is
AUROC = 0 1 TPR FPR 1 ( x ) d x .
Precision and recall at threshold t are
Precision ( t ) = P ( y u = 1 s ^ u t ) ,
Recall ( t ) = TPR ( t ) .
The PR curve is { ( Recall ( t ) , Precision ( t ) ) } with
AUPRC = 0 1 Precision Recall 1 ( r ) d r .
In practice, (7)–(10) are computed via trapezoidal integration (ROC) and average precision (PR).

2.3. Probability Calibration and Reliability

We use post hoc calibration maps g : [ 0 , 1 ] [ 0 , 1 ] to transform raw scores into calibrated probabilities p ˜ u = g ( s ^ u ) . Perfect calibration requires
E [ y u p ˜ u = p ] = p for all p [ 0 , 1 ] .
Probability quality is measured by the Brier score on a set S ,
Brier ( S ) = 1 | S | u S ( p ˜ u y u ) 2 ,
with the understanding that we may also report the same quantity for raw p ^ u when f is probabilistic. We consider two calibrators fitted on the validation set:
g platt ( s ) = σ ( a s + b ) , σ ( u ) = 1 1 + e u ,
where ( a , b ) minimize the negative log-likelihood on ( s ^ u , y u ) u S val , and isotonic regression,
g iso arg min g G u S val g ( s ^ u ) y u 2 ,
where G is the set of nondecreasing step functions. We select
g { g platt , g iso } such that Brier g f ; S val = min g { g platt , g iso } Brier g f ; S val ,
and apply p ˜ u = g ( s ^ u ) to test. Reliability diagrams visualize the empirical calibration function
r ^ ( p ) = E y u | q u bin ( p ) ,
where q u denotes either raw probabilities p ^ u or calibrated probabilities p ˜ u , depending on the plot (we use raw for precalibration and calibrated for postcalibration).

2.4. Decision-Analytic Evaluation

For a decision threshold t ( 0 , 1 ) define the classified label
y ^ u , t = { p ˜ u t } .
With N = | S te | , true positives TP t and false positives FP t on test, the net benefit at t ([]) is
NB ( t ) = TP t N FP t N · t 1 t .
Curves NB ( t ) are compared to treat-all and treat-none references to summarize threshold-dependent clinical utility without fixing a single operating point.

2.5. Global Explanation: Permutation Importance and SHAP

Let m ( f ; S ) be a scalar performance functional (e.g., AUPRC) on a dataset S . For feature j, let Π j randomly permute column j of the validation design matrix (holding all else fixed). The permutation importance of feature j is the expected performance drop
I j perm = m f ; S val E m f ; Π j ( S val ) ,
approximated via Monte Carlo over permutations.
SHAP assigns an attribution vector ( ϕ u , 1 , , ϕ u , p ) and a base value ϕ u , 0 to each instance u, satisfying the Shapley additivity identity
f ( x ˜ u ) = ϕ u , 0 + j = 1 p ϕ u , j .
Global importance is summarized by the mean absolute attribution
G j = E u S val | ϕ u , j | , j = 1 , , p ,
and visualized via summary and bar plots indicating magnitude and directionality.

2.6. Reproducible Pipeline as Operator Composition

Let Θ ^ = ( I ^ , P ^ , f ^ , g ^ ) denote the fitted operators. The predictor is learned on training data only,
f ^ arg min f F L train f ; P ^ ( I ^ ( X tr ) ) , y tr ,
the calibrator g ^ is selected on validation as in (15), and test-time probabilities are the composition
p ˜ u = g ^ f ^ P ^ ( I ^ ( x u ) ) , u S te .
All evaluations in Section 3 (AUROC/AUPRC, reliability/Brier, and decision curves) are computed on (23), preserving the independence between fitting, model selection, and final assessment.

3. Model and Experiments

This study uses the MIMIC-IV database [,,], a large, freely accessible critical care resource containing anonymized demographics, laboratory results, prescriptions, diagnoses, procedures, microbiology, and administrative events from Beth Israel Deaconess Medical Center. We operated on the hospital module and used the tables patients, admissions, transfers, diagnoses_icd, procedures_icd, pharmacy, prescriptions, microbiologyevents, labevents, and the dictionaries d_icd_diagnoses, d_icd_procedures, d_labitems. The code used in the models and experiments is available at https://github.com/EBarbierato/Deployable-Machine-Learning-on-MIMIC-IV (accessed on 7 November 2025).
  • Patient-level split and leakage guards.
All preprocessing and modeling were conducted with strict temporal and subject-level guards. We first created patient-level partitions ( S tr / S val / S te ) in proportions 60 / 20 / 20 % to avoid cross-partition contamination; no subject appears in more than one split. Index (admission) time τ u was harmonized per patient, and all feature construction used only data with timestamps t τ u (temporal guard), as formalized in Section 2. All learned transformers were fit on training only and applied forward to validation and test without refitting.

3.1. Feature Engineering and Temporal Filtering

We constructed admission-time features covering: demographics; admission characteristics and prior-encounter aggregates (e.g., cumulative length of stay before index); counts of diagnoses/procedures; summaries of common lab measurements; coarse medication-exposure indicators; and microbiology flags. Categorical variables were one-hot encoded (rare levels pooled). For numeric features we applied light outlier handling (winsorization at the 1st/99th percentiles) and log ( 1 + x ) transforms for heavy-tailed counts. Missingness was handled by SimpleImputer—mean for numeric features and most-frequent for categoricals—fit on training only and applied to validation/test.
  • Model-Specific Preprocessing
Preprocessing consists of distinct pipelines: for logistic regressionwe standardized all numeric features with μ / σ computed on training only, then concatenated with one-hot encoded categoricals. For random forest we used the same imputed one-hot design but no scaling (tree-based models are impurity-based and do not require normalization). This separation avoids unnecessary transformations for RF while preserving well-conditioned inputs for LR.

3.2. Model Selection, Training, and Calibration

We trained two single-task baselines for diab_incident_365d.
  • Hyperparameter tuning (nested on training).
To ensure a fair comparison and avoid under/overfitting, each model was selected by an inner cross-validation conducted on training data only (validation and test were held out until after selection):
  • Logistic Regression (LR). Search over λ = 1 / C { 10 4 , 10 3 , 10 2 , 10 1 , 1 , 10 } , penalty = L2, solver = lbfgs/liblinear (chosen by convergence), with class_weight = balanced.
  • Random Forest (RF). Search over n _ estimators { 200 , 400 , 600 } , max _ depth { 6 , 8 , 10 , 12 , None } , min _ samples _ leaf { 1 , 2 , 5 } , max _ features { sqrt , log 2 , 0.5 } , bootstrap { True , False } , with class_weight=balanced_subsample.
For both models we used stratified 5-fold CV on S tr with the selection metric set to AUROC (primary), recording AUPRC as a secondary criterion (tie-breaker). After selecting hyperparameters, we refit the model on S tr and evaluated it on S val to choose a probability calibrator.
  • Post hoc calibration and thresholding.
On S val we fit Platt scaling and isotonic regression and selected the mapping that minimized the Brier score (Section 2). The chosen calibrator was then frozen and applied to S te . Threshold-dependent analyses (decision curves) are reported directly on calibrated probabilities without committing to a single operating point.

3.3. Evaluation Figures

Test-set discrimination is summarized by the calibrated ROC overlay in Figure 1 and by the calibrated precision–recall overlay in Figure 2.
Figure 1. Calibrated ROC on the test set; LR shows higher TPR across FPR ranges compared to RF.
Figure 2. Calibrated precision–recall on the test set; LR maintains higher precision over most recall values in this imbalanced setting.
Decision-curve analysis indicates modest, threshold-dependent utility for both calibrated models. To avoid duplicating nearly identical curves, Figure 3 reports the delta net benefit
Δ NB ( t ) = NB × LR ( t ) NB × RF ( t ) .
Figure 3. Delta net benefit on the test set, defined as Δ NB ( t ) = NB LR ( t ) NB RF ( t ) . Positive values indicate thresholds where calibrated logistic regression offers higher clinical utility than calibrated random forest; negative values indicate the opposite. The panel is zoomed to t [ 0 , 0.30 ] , the range most relevant for screening.
The delta stays close to zero across clinically plausible thresholds, with RF slightly higher at very low and mid thresholds (e.g., (t = 0.05) and ( t [ 0.20 , 0.30 ] )), and LR competitive elsewhere. The small absolute magnitudes are consistent with a compact admission-time feature set and underscore that threshold selection should be guided by local costs and prevalence rather than assuming a uniform winner.
Probability quality appears in the validation reliability diagram before calibration (Figure 4) and in the test reliability diagrams after calibration (Figure 5 and Figure 6). Global feature reliance is reported as permutation importance for LR and RF (Figure 7 and Figure 8); model-agnostic explainability for the logistic model is shown via SHAP bar and summary plots (Figure 9 and Figure 10).
Figure 4. Validation reliability before calibration (LR); overconfidence at higher bins motivates post hoc calibration.
Figure 5. Test reliability after calibration (LR); closer adherence to the diagonal, especially in mid-probability bins.
Figure 6. Test reliability after calibration (RF); under-prediction improves, though dispersion remains at low probabilities.
Figure 7. Permutation importance on validation for LR (drop in average precision). Age and white ethnicity dominate, with smaller contributions from length of stay and sex.
Figure 8. Permutation importance on validation for RF; age and index length of stay lead, followed by ethnicity indicators.
Figure 9. SHAP (LR): mean absolute attributions; age carries the largest global impact, followed by white ethnicity and sex.
Figure 10. SHAP summary (LR): higher age values push risk upward; other features exert smaller, bidirectional effects consistent with permutation ranking.

3.4. Pipeline Overview

  • Operational Summary
Raw CSV extracts are converted to Parquet; timestamps define index admissions. A patient-level cohort is assembled and labels for diab_incident_365d are generated with a look-ahead window, enforcing temporal guards at cohorting and feature extraction so that only pre-index information enters the design matrix. Features are imputed and (for LR only) standardized; the dataset is split into disjoint train/val/test partitions at the patient level. Models are tuned via inner CV on training only, calibrated on validation, and evaluated on the untouched test set via ROC/PR, decision curves, and reliability diagrams. Model reliance combines permutation importance (both models) and SHAP (LR) on validation samples. The pipeline overview is denoted by Figure 11.
Figure 11. End-to-end pipeline with temporal guards before cohorting and feature aggregation, and a patient-level split guard prior to partitioning. Calibrated probabilities feed ROC/PR metrics and decision-curve analysis; permutation importance and SHAP summarize reliance.
This is a single-center retrospective analysis; we therefore emphasize calibration quality and decision utility alongside discrimination and provide a deployment-oriented baseline under strict leakage control. External validation and richer temporal representations are left to future work (Section 6).

4. Discussion

Taken together, the figures provide a consistent view across discrimination, calibration, clinical utility, and model reliance under a leakage-safe pipeline with tuned baselines. After inner cross-validation on training data and post hoc calibration on validation, the calibrated logistic regression dominates the calibrated random forest on the held-out test set in both ROC and precision–recall space (Figure 1 and Figure 2). In an imbalanced endpoint such as incident diabetes at one year, the precision–recall overlay is especially informative and shows a persistent precision advantage for logistic regression over most recall levels.
Calibration materially improves probability quality. Before calibration, the logistic model exhibits overconfidence at higher bins on validation (Figure 4). Applying the selected mapping to test increases alignment with the identity line (Figure 5). The calibrated random forest also improves, but its predicted probabilities are concentrated in a narrower range (Figure 6), limiting granularity for downstream risk stratification. Because clinical workflows consume probabilities rather than ranks, these gains are consequential for thresholding and risk communication.
Decision-curve analysis integrates discrimination, prevalence, and threshold choice into net benefit as a function of the decision threshold. The per-model curves show that calibrated logistic regression achieves consistently higher net benefit across most clinically plausible thresholds, while calibrated random forest remains positive but generally lower.
Model understanding is concordant across permutation importance and SHAP for the logistic model. Age is the dominant predictor in both views, with additional reliance on index length of stay and selected demographic covariates (Figure 7, Figure 8, Figure 9 and Figure 10). The SHAP summary shows a monotone, clinically sensible pattern whereby higher age pushes risk upward, while other variables exert smaller, sometimes bidirectional, effects. These are statements of model reliance rather than causal claims, but they furnish an auditable trail to accompany performance and calibration diagnostics.
Several aspects delimit current utility and suggest improvements. The precalibration reliability for logistic regression departs from the identity line at higher probabilities (Figure 4), illustrating why uncalibrated scores should not be operationalized. Postcalibration, random-forest probabilities remain compressed (Figure 6), which hampers nuanced prioritization even when discrimination is acceptable. Net-benefit magnitudes are small (Figure 3).
With tuned baselines, strict leakage control, and explicit calibration and decision analysis, a compact calibrated logistic model already delivers stable, threshold-aware utility on MIMIC-IV. The modest absolute gains and residual calibration limitations set the agenda for external and temporal validation, longitudinal feature enrichment, and stronger calibrated baselines (e.g., gradient-boosted trees), rather than undermining the value of a deployable, auditable reference.

4.1. Interpreting Net Benefit in Practice

Net benefit NB ( t ) , plotted against the decision threshold t, quantifies benefit-adjusted true positives per patient. A value NB ( t ) = 0.012 means that, at threshold t, the model yields the equivalent of 1.2 additional true positives per 100 patients, after accounting for the relative harm of false positives implied by t. The threshold encodes the risk–benefit trade-off via
harm of FP benefit of TP = t 1 t ,
so, for example, t = 0.20 corresponds to judging a false alarm as one-quarter as harmful as a missed case. Two operational translations are useful: “benefit-adjusted TPs per 100” equals 100 × NB ( t ) ; and the net reduction in unnecessary interventions per 100 patients versus a treat-all policy is
NRUI 100 ( t ) = 100 × NB model ( t ) NB treat - all ( t ) t / ( 1 t ) .
In Table 1, these quantities illustrate how calibrated models can deliver positive benefit at thresholds where a treat-all approach is counterproductive.
Table 1. Threshold-wise net benefit (test set) for calibrated models. Treat-none is zero at all thresholds.

4.2. Implementation and Generalization in Clinical Settings

The workflow is designed to translate into a clinical decision support (CDS) service with minimal changes. On the data side, the same temporal guards and patient-level partitioning used in the study become pre-ingestion rules enforced in the hospital’s data platform: only events with timestamps at or before the current admission time are admitted to the feature builder, and the feature map is frozen and versioned to ensure parity with the validated model. Integration with the electronic health record can be achieved via standard interfaces (e.g., FHIR resources for demographics, encounters, observations, medications) and a lightweight inference endpoint that accepts a patient identifier and returns a calibrated probability along with the model version and a timestamp. Because downstream decisions consume probabilities, the calibrator selected on internal validation is exported and attached to the model so that the deployment returns calibrated risks rather than raw scores.
Generalization to new clinical settings follows a site-adaptation protocol that preserves the leakage-safe design while accommodating local prevalence and coding differences. The receiving site first reproduces the cohort and feature map from the site’s own tables and code systems, runs the trained model in “shadow mode” for a short prospective window, and evaluates discrimination, reliability, and net benefit against local labels without exposing outputs to clinicians. If reliability drifts, a lightweight recalibration step is performed on a site-specific validation split, leaving the core model weights unchanged; this preserves interpretability while aligning probabilities to local outcome frequencies. Thresholds are then selected using local decision-curve analysis to reflect resource constraints and clinical policy, and only after this prospective silent period and documentation of performance are alerts turned on.
Operationally, the service is instrumented for continuous monitoring of data drift, calibration drift, and outcome prevalence. The same reliability diagrams, Brier score, ROC–PR overlays, and decision curves used in the study are computed on a rolling window and logged alongside model and calibrator version identifiers. Any material change triggers a scheduled recalibration or a formal review. Governance artifacts—data cards for the cohort and feature space, model cards that report discrimination, calibration, decision-curve ranges, and limitations, and an audit log of versions and thresholds—are kept under change control to support clinical risk management. Clinician-facing displays present the calibrated probability with a short, plain-language explanation, a small set of global reliance cues (e.g., the top permutation-importance features), and links to guidance on follow-up actions defined by the local service line.
To contextualize our discrimination results on MIMIC-IV, we contrast them with recent EHR-based diabetes-risk studies. On our held-out test set, calibrated logistic regression achieved AUROC 0.705 and AUPRC 0.308 at a prevalence of π 0.19 . In a primary-care EHR cohort (age > 50 years) using 53 routinely collected variables, Stiglic et al. reported incident T2D prediction with AUROC 0.70 0.75 (external validations varying by site and covariate set) []. Early EHR prediction efforts for gestational diabetes (a distinct endpoint but informative for EHR screening difficulty) also commonly fall in the AUROC 0.70 0.80 band []. While AUPRC is not directly comparable across studies due to varying prevalence [], our AUPRC 0.31 at π 0.19 is consistent with reports in adult EHR settings where prevalence typically ranges 10 25 % . Taken together, these references situate our calibrated baselines within the expected range for compact, admission–time feature sets, and reinforce the need to couple discrimination with calibration and decision-analytic evaluation [,].

5. Related Work

Manco et al. [] introduce HEALER, a zone-based healthcare data-lake architecture (Transient Landing, Raw, Process, Refined, Consumption) implemented with Apache NiFi for ingestion and HDFS for storage, plus Python 3.13.5 components for waveform processing and NLP. A proof-of-concept on MIMIC-III demonstrates increasing ingestion throughput with larger batch sizes, identifies the processing layer as the bottleneck, and argues for governance/security layers and streaming ingestion as future work—highlighting the operational, pipeline-centric nature rather than modeling advances.
Prior studies often frame comorbidity prediction through multitask learning (MTL), where shared representations are learned across related targets. Benchmark work on clinical time series demonstrated consistent, albeit modest, gains for MTL over single-task models under standardised tasks and data from a single health system []. While such results motivate cross-task sharing, they less frequently emphasize probability calibration, leakage control at the patient level, and threshold-dependent decision utility—factors that are central when predictions are consumed as probabilities in clinical workflows.
A complementary line of research focuses on deployment-oriented, tightly scoped pipelines that favor transparent models, explicit leakage guards, and task-appropriate evaluation. For example, data-driven diagnostic support for pediatric appendicitis demonstrates how careful variable curation, interpretable modeling, and clinically salient metrics can reduce misdiagnoses and unnecessary procedures in a real decision context []. In parallel, systems-engineering studies of healthcare analytics characterize big-data workflows and performance using multiformalism to guide capacity planning and reliable operation of end-to-end pipelines []. These perspectives underscore that success in practice hinges not only on raw discrimination but also on stability, calibration, and operational fit.
Beyond modeling architectures, several strands explicitly address reporting quality, calibration, decision-analytic evaluation, and design considerations for thresholded use. Collins et al. [] outline TRIPOD-AI and PROBAST-AI, emphasizing transparent reporting and risk-of-bias assessment tailored to AI-based prediction models; this foregrounds calibration, data leakage prevention, and intended use. Van Calster et al. [] argue that miscalibration is the “Achilles’ heel” of predictive analytics, recommending routine assessment and correction, and cautioning against relying on discrimination alone. Vickers and Elkin [] formalize decision curve analysis (DCA), which compares net benefit across thresholds against treat-all and treat-none policies, aligning model evaluation with clinical trade-offs rather than single operating points. Whittle et al. [] extend sample-size calculations for threshold-based evaluation, showing how planned operating thresholds and prevalence shape the precision of performance estimates and decision metrics.
Riley et al. [] discuss a tutorial that bridges the gap between conventional model performance reporting and actual clinical usefulness. Specifically, it stresses that discrimination alone is insufficient and places calibration, decision thresholds, and net benefit at the center of evaluation. The paper provides practical guidance on reading reliability plots, decomposing performance, and linking metrics to decisions, which directly motivates our use of calibrated probabilities, reliability diagrams, and decision-curve analysis in Section 3.
Vickers et al. [] formalize net benefit and its inference, showing how to quantify clinical utility with confidence intervals and hypothesis tests. They clarify the interpretation of thresholds as harm–benefit trade-offs and how to compare models against treat-all/none strategies. This supports our threshold-wise reporting, the interpretation of net benefit magnitudes, and the recommendation to select operating points via decision curves rather than a single fixed cutoff.
Moreover, Corbin  et al. [] present a pragmatic framework for taking ML models from development to production in healthcare, covering integration, versioning, monitoring, and drift management. The emphasis on transparent artifacts, shadow deployment, calibration checks, and governance aligns with our deployment-oriented discussion: exporting the chosen calibrator with the model, monitoring calibration and prevalence over time, and performing lightweight site-specific recalibration prior to activation.
As per Table 2, this paper positions itself at that deployment-oriented intersection. We construct a leakage-safe, patient-level pipeline on MIMIC-IV hospital data; we train two single-task baselines (logistic regression and random forest) for incident diabetes at 365 days; we calibrate probabilities post hoc (Platt and isotonic) and evaluate with ROC/PR curves, reliability diagrams, and decision-curve analysis; and we summarize model reliance with permutation importance and SHAP for the calibrated baselines. In doing so, the work shows that, under strict temporal and subject-level guards with calibration and decision-analytic reporting, a compact calibrated baseline can offer competitive performance and clearer operational traceability on a widely used critical-care dataset. Relative to prior literature, our contribution is methodological and reporting-centric: we assemble an end-to-end, reproducible workflow that surfaces discrimination, calibration, decision utility, and model reliance in a single dossier suitable for clinical review, aligning explicitly with contemporary guidance on reporting, calibration, and threshold-based evaluation [,,,].
Table 2. Positioning relative to representative strands in the literature.
It should be noticed that, prior MTL studies underscore the potential of shared representation learning across related clinical tasks but often underreport calibration and threshold-dependent clinical utility. Our findings complement that literature by demonstrating that a calibrated, transparent baseline—with explicit decision-analytic reporting—can provide a strong, readily deployable reference. By tying the pipeline to emerging reporting guidance and decision-focused evaluation, we offer a pragmatic template for external validation and, where justified by incremental utility, the integration of more expressive models.

6. Conclusions and Future Work

This study delivers a leakage-safe, calibration-first pipeline for incident diabetes prediction on MIMIC-IV, with strict temporal and patient-level guards, model-specific preprocessing, tuned baselines, and post hoc calibration.
For RQ1 (which calibrated baseline discriminates best), the tuned and calibrated logistic regression consistently exceeded the tuned and calibrated random forest on the held-out test set. Calibrated ROC and PR overlays indicate higher AUROC/AUPRC for logistic regression across operating regions (Figure 1 and Figure 2), suggesting that in this admission–time, tabular setting, a compact linear model is a strong and parsimonious choice relative to a deeper ensemble.
For RQ2 (does calibration improve probability quality and translate to decision utility), reliability diagrams show that post hoc calibration materially improves probability accuracy. The logistic model aligns more closely with the identity line from validation (uncalibrated) to test (calibrated) (Figure 4 and Figure 5); the calibrated random forest improves as well, albeit with compressed probability ranges (Figure 6). Decision-curve analysis shows modest but non-negative net benefit across clinically plausible thresholds, with calibrated logistic regression dominating random forest over most of the range. These findings support calibrated, threshold-aware deployment when probabilities are consumed for triage.
For RQ3 (which predictors most influence risk and are they clinically coherent), explanations are concordant with established risk profiles. SHAP summaries for logistic regression and permutation importance for both models highlight age as the dominant contributor, with additional reliance on index length of stay, sex, and ethnicity indicators (Figure 7, Figure 8, Figure 9 and Figure 10). These are associative statements of model reliance rather than causal effects, but they provide transparent context to accompany discrimination, calibration, and decision utility.
The contribution is implementation and reproducibility rather than methodological innovation: we consolidate best practices—leakage control, calibration, decision curves, and transparent reliance summaries—into a single, documented workflow that returns calibrated probabilities ready for thresholded use in clinical decision support. All artifacts necessary for clinical review and governance (metrics, reliability, decision curves, model/explainer versions, and data/model cards) are surfaced to enable auditing, monitoring, and safe integration.
Limitations include single-center retrospective data, no external or temporal validation in this release, absence of groupwise fairness analyses, and limited longitudinal features. Future work will (i) perform multi-site external and temporal validation with lightweight recalibration; (ii) report groupwise discrimination, calibration, and net benefit; (iii) enrich features with longitudinal trajectories while preserving leakage guards; (iv) use benchmark calibrated, still interpretable reference learners (e.g., gradient-boosted trees) under the same selection and calibration protocol. These steps aim to strengthen transportability and clinical usefulness while retaining the transparency required for adoption.

Author Contributions

Conceptualization, E.B.; methodology, E.B.; software, A.D.C. and E.B.; validation, E.B. and A.G.; formal analysis, A.D.C.; investigation, E.B.; resources, E.B.; data curation, E.B.; writing—original draft preparation, A.D.C.; writing—review and editing, E.B. and A.G.; visualization, A.D.C.; supervision, E.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The collection of patient information and creation of the research resource was reviewed by the Institutional Review Board at the Beth Israel Deaconess Medical Center, who granted a waiver of informed consent and approved the data sharing initiative.

Data Availability Statement

MIMIC IV datasets are available at https://physionet.org/content/mimiciv/3.1/, accessed on 1 November 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. American Diabetes Association Professional Practice Committee. 10. Cardiovascular Disease and Risk Management: Standards of Care in Diabetes—2024. Diabetes Care 2024, 47, S179–S218. [Google Scholar] [CrossRef] [PubMed]
  2. Low Wang, C.C.; Hess, C.N.; Hiatt, W.R.; Goldfine, A.B. Clinical Update: Cardiovascular Disease in Diabetes Mellitus—Atherosclerotic Cardiovascular Disease and Heart Failure in Type 2 Diabetes Mellitus: Mechanisms, Management, and Clinical Considerations. Circulation 2016, 133, 2459–2502. [Google Scholar] [CrossRef] [PubMed]
  3. Vickers, A.J.; Elkin, E.B. Decision Curve Analysis: A Novel Method for Evaluating Prediction Models. Med Decis. Mak. 2006, 26, 565–574. [Google Scholar] [CrossRef] [PubMed]
  4. Johnson, A.E.W.; Bulgarelli, L.; Pollard, T.; Gow, B.; Moody, B.; Horng, S.; Celi, L.A.; Mark, R. MIMIC-IV, version 3.1; RRID:SCR_007345, PhysioNet resource; PhysioNet: Cambridge, MA, USA, 2024. [CrossRef]
  5. Johnson, A.E.W.; Bulgarelli, L.; Shen, L.; Gayles, A.; Shammout, A.; Horng, S.; Pollard, T.J.; Hao, S.; Moody, B.; Gow, B.; et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 2023, 10, 1. [Google Scholar] [CrossRef] [PubMed]
  6. Goldberger, A.L.; Amaral, L.A.N.; Glass, L.; Hausdorff, J.M.; Ivanov, P.C.; Mark, R.G.; Mietus, J.E.; Moody, G.B.; Peng, C.K.; Stanley, H.E. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 2000, 101, e215–e220. [Google Scholar] [CrossRef] [PubMed]
  7. Kopitar, L.; Kocbek, S.; Cilar, L.; Sheikh, A.; Stiglic, G. Early detection of type 2 diabetes mellitus using machine learning-based prediction models. Sci. Rep. 2020, 10, 11981. [Google Scholar] [CrossRef] [PubMed]
  8. Germaine, M.; O’Higgins, A.C.; Egan, B.; Healy, G. Label Accuracy in Electronic Health Records and Its Impact on Machine Learning Models for Early Prediction of Gestational Diabetes: A Three-Step Evaluation. JMIR Med. Inform. 2025, 13, e72938. [Google Scholar] [CrossRef] [PubMed]
  9. Riley, R.D.; Archer, L.; Snell, K.I.; Ensor, J.; Dhiman, P.; Martin, G.P.; Bonnett, L.J.; Collins, G.S. Evaluation of clinical prediction models (part 2): How to undertake an external validation study. BMJ 2024, 384, e074820. [Google Scholar] [CrossRef] [PubMed]
  10. Vickers, A.J.; Van Calster, B.; Steyerberg, E.W. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. Med Decis. Mak. 2021, 41, 780–791. [Google Scholar] [CrossRef] [PubMed]
  11. Manco, C.; Dolci, T.; Azzalini, F.; Barbierato, E.; Gribaudo, M.; Tanca, L. HEALER: A Data Lake Architecture for Healthcare. In Proceedings of the EDBT/ICDT Workshops, Ioannina, Greece, 28–31 March 2023. [Google Scholar]
  12. Harutyunyan, H.; Khachatrian, H.; Kale, D.C.; Ver Steeg, G.; Galstyan, A. Multitask learning and benchmarking with clinical time series data. Sci. Data 2019, 6, 96. [Google Scholar] [CrossRef] [PubMed]
  13. Maffezzoni, D.; Barbierato, E.; Gatti, A. Data-Driven Diagnostics for Pediatric Appendicitis: Machine Learning to Minimize Misdiagnoses and Unnecessary Surgeries. Future Internet 2025, 17, 147. [Google Scholar] [CrossRef]
  14. Covioli, T.; Dolci, T.; Azzalini, F.; Piantella, D.; Barbierato, E.; Gribaudo, M. Workflow characterization of a big data system model for healthcare through multiformalism. In European Workshop on Performance Engineering; Springer: Cham, Switzerland, 2023; pp. 279–293. [Google Scholar]
  15. Collins, G.S.; Dhiman, P.; Andaur Navarro, C.L.; Ma, J.; Hooft, L.; Reitsma, J.B.; Logullo, P.; Beam, A.L.; Peng, L.; Van Calster, B.; et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open 2021, 11, e048008. [Google Scholar] [CrossRef] [PubMed]
  16. Van Calster, B.; McLernon, D.J.; van Smeden, M.; Wynants, L.; Steyerberg, E.W. Calibration: The Achilles heel of predictive analytics. BMC Med. 2019, 17, 230. [Google Scholar] [CrossRef] [PubMed]
  17. Whittle, R.; Ensor, J.; Archer, L.; Collins, G.S.; Dhiman, P.; Denniston, A.; Alderman, J.; Legha, A.; van Smeden, M.; Moons, K.G.; et al. Extended Sample Size Calculations for Evaluation of Prediction Models Using a Threshold for Classification. arXiv 2024. [Google Scholar] [CrossRef] [PubMed]
  18. Corbin, C.K.; Maclay, R.; Acharya, A.; Mony, S.; Punnathanam, S.; Thapa, R.; Kotecha, N.; Shah, N.H.; Chen, J.H. DEPLOYR: A technical framework for deploying custom real-time machine learning models into the electronic medical record. J. Am. Med. Inform. Assoc. 2023, 30, 1532–1542. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.