Previous Article in Journal
From Device Data to Trusted Decision Support: Building the Foundation for AI in Hospital Insulin Management
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Identifying Pre-Existing Diabetes at ICU Admission with Machine Learning on Public GOSSIS Data

by
Lily Popova Zhuhadar
Department of Analytics & Information Systems, Center for Applied Data Analytics, Western Kentucky University, 410 Regents Avenue, CHAN Building, Room 3049, Bowling Green, KY 42101, USA
Diabetology 2026, 7(5), 100; https://doi.org/10.3390/diabetology7050100
Submission received: 26 March 2026 / Revised: 19 May 2026 / Accepted: 19 May 2026 / Published: 21 May 2026

Abstract

Background: Pre-existing diabetes mellitus is prevalent among critically ill adults and can influence initial glycemic targets, therapeutic decisions, and early risk stratification in the intensive care unit (ICU). However, diabetes status may be distributed across heterogeneous electronic health record (EHR) sources and may be incomplete at the time of ICU admission, particularly for inter-facility transfers. Methods: Using the public WiDS Datathon 2021 tabular release derived from the Global Open-Source Severity of Illness Score (GOSSIS) initiative, we conducted a retrospective machine-learning benchmarking study for admission-time identification of documented diabetes status in ICU patients. Candidate predictors included demographics, admission characteristics, anthropometrics, day-1 physiologic and laboratory summaries, APACHE-related variables, comorbidity indicators, and site descriptors. We compared CatBoost, random forest, tuned XGBoost, tuned LightGBM, histogram-based gradient boosting, and a soft-voting ensemble combining XGBoost, LightGBM, and histogram-based gradient boosting. Because class imbalance was a central concern, the final workflow emphasized model-intrinsic class weighting and threshold-aware evaluation rather than synthetic oversampling. Results: In the primary leakage-mitigated random validation split, the voting ensemble achieved the highest overall balance, with AUROC 0.8539, precision 0.5671, recall 0.6690, and F1-score 0.6138. Tuned LightGBM was the most sensitivity-oriented individual model, achieving recall 0.7677 and AUROC 0.8537, although with lower precision and a less favorable Brier score. Ablation analyses clarified the source of this performance: removing leakage-prone and APACHE-related variables caused only modest decreases in discrimination, whereas the strict reduced model that also excluded glucose-like predictors produced a marked decline, with LightGBM AUROC falling to 0.7432 and the voting ensemble AUROC falling to 0.7448. These findings, together with SHAP analyses identifying day-1 glucose maximum, day-1 glucose minimum, BMI, age, hemoglobin, and related clinical variables as major contributors, indicate that glucose-related admission variables remained the dominant predictive signal. In grouped hospital validation, tuned LightGBM maintained recall of 0.7684 while AUROC decreased modestly to 0.8443, indicating preserved case detection under stricter site separation but reduced precision. Precision–recall analysis further showed that average precision decreased from 0.622 under random validation to 0.551 under grouped validation; at a high-sensitivity grouped-site operating point, a probability threshold of 0.4537 achieved recall of 0.8001 with precision of 0.4314. Calibration curves and Brier scores showed that predicted probabilities were imperfectly calibrated. Conclusions: Although the dominance of glucose-related predictors is clinically plausible for identifying documented diabetes status, early glycemic measurements in critically ill patients may also partly capture acute stress physiology, treatment-related effects, monitoring intensity, or other forms of acute dysglycemia rather than chronic diabetes status alone. Therefore, these findings support gradient-boosted and ensemble models as reproducible tools for ICU admission-time phenotyping of documented diabetes status, but the proposed system should be interpreted primarily as a screening-oriented phenotyping aid for chart review, cohort enrichment, or workflow support, not as a stand-alone diagnostic tool. Further external validation, recalibration, threshold selection matched to intended use, and clinical review are needed before deployment.

1. Introduction

Diabetes mellitus and dysglycemia are highly prevalent in critical care, where acute illness frequently disrupts glucose homeostasis through counter-regulatory hormone release, inflammatory signaling, medication exposure, nutritional support, and other iatrogenic factors. Both hyperglycemia and hypoglycemia, as well as increased glycemic variability, have been associated with adverse outcomes in critically ill populations, including infection, longer ICU stay, and increased mortality risk.
Contemporary evidence has shifted ICU glycemic management away from intensive glucose-normalization strategies after large multicenter randomized trials demonstrated net harm, including increased mortality and hypoglycemia risk, under intensive glucose targets compared with more moderate control [1]. Reflecting this evidence base, major consensus recommendations for most critically ill adults favor initiating insulin for persistent hyperglycemia, commonly at or above approximately 180 mg/dL, and then targeting a moderate glucose range centered around approximately 140–180 mg/dL rather than pursuing tighter normalization [2,3].
Accurate recognition of diabetes status at or near ICU admission is clinically meaningful because early hyperglycemia is etiologically heterogeneous. Elevated glucose may reflect documented pre-existing diabetes, previously unrecognized chronic dysglycemia, stress hyperglycemia caused by acute physiologic stress, medication-related hyperglycemia, nutrition-associated hyperglycemia, or treatment effects during the early ICU course. This distinction influences how clinicians interpret admission glucose values, balance insulin titration against hypoglycemia risk, and decide whether elevated inpatient glucose should trigger outpatient diabetes evaluation after discharge. In the present study, the target label is therefore interpreted as documented diabetes status available in the source records, not as de novo diagnosis of diabetes during ICU stay, and not as direct prediction of stress hyperglycemia or incident diabetes.
Increasing attention has been directed toward contextualizing acute hyperglycemia relative to baseline glycemic status. Measures such as the stress hyperglycemia ratio, commonly defined using admission glucose divided by an HbA1c-derived estimated average glucose, have been proposed as pragmatic markers of relative hyperglycemia and have been associated with illness severity and mortality across cohorts [4,5].
Related constructs, including the glycemic gap, have also been explored for risk stratification and outcome prediction in ICU populations [6]. Importantly, observational work suggests that the prognostic meaning of hyperglycemia and the optimal glucose exposure range may differ between patients with and without diabetes, reinforcing the importance of identifying baseline diabetes status when studying ICU glycemic control and safety tradeoffs [7].
Despite its importance, diabetes status at ICU admission is often incomplete, delayed, or fragmented across heterogeneous electronic health record (EHR) sources. Relevant evidence may be distributed across problem lists, billing or diagnosis codes, prior encounter histories, medication records, laboratory archives, and transfer documentation.
These sources may be unavailable when patients arrive from another facility, when longitudinal history is not accessible, or when documentation practices differ across hospitals. Under these real-world constraints, clinicians and researchers often need a computable method to infer likely documented diabetes status using structured variables available at or near admission, while acknowledging uncertainty from incomplete capture and heterogeneous documentation.
Computable phenotyping operationalizes clinical definitions using EHR data through rule-based algorithms, statistical models, or machine-learning approaches. Great collaborative efforts have disseminated phenotype definitions and workflows, including PheKB and eMERGE-derived algorithms, demonstrating the feasibility of diabetes identification using combinations of diagnosis codes, medications, and laboratory criteria [8,9]. However, phenotype portability across institutions remains challenging because of differences in coding practices, data models, missingness patterns, laboratory availability, and clinical workflows. Even when the same clinical definition is intended, implementation burden and site-specific customization can be substantial [10,11].
These challenges have motivated calls for clearer reporting of phenotype complexity, implementation requirements, validation design, and maintenance considerations so that end users can judge whether an algorithm is fit for purpose in their target setting and timeframe [12]. In parallel, reporting and appraisal frameworks for prediction modeling and artificial intelligence methods emphasize transparency in dataset construction, label definition, preprocessing order, feature selection, validation strategy, calibration, interpretability, and intended use case [13,14].
These principles are particularly important for ICU phenotyping, where apparent model performance can be inflated by leakage-prone variables, site identifiers, duplicated preprocessing across train and validation data, or validation designs that allow patients from the same hospital to appear in both training and testing folds. A clinically useful diabetes phenotype should therefore be evaluated not only by discrimination, but also by threshold-dependent performance, probability calibration, robustness to feature exclusions, and transportability across hospitals.
Public critical-care datasets have created important opportunities to develop reproducible phenotyping and prediction pipelines at scale. The WiDS Datathon 2021 tabular release, derived from the Global Open-Source Severity of Illness Score (GOSSIS) initiative, provides a large public multi-center ICU dataset suitable for benchmarking structured EHR machine-learning workflows [15,16]. In this study, we use the labeled WiDS 2021 training file to evaluate admission-time identification of documented diabetes status among adult ICU stays. The separate unlabeled challenge file is not used to fit preprocessing parameters, select thresholds, or estimate validation performance.
Prior work has demonstrated the broad utility of machine-learning and deep learning approaches for diabetes prediction, clinical disease classification, ICU outcome modeling, and healthcare operations. Recent reviews have highlighted the use of tree-based learners, neural networks, recurrent architectures, convolutional models, and Transformers across EHR, sensor, wearable, administrative, and operational datasets, while also emphasizing persistent limitations related to fragmented data sources, bias, privacy, validation, and reproducibility [17]. In contrast to these broader applications, the present study focuses specifically on admission-time phenotyping of documented diabetes status in a public multi-center ICU cohort. Rather than optimizing only discrimination, we evaluate leakage mitigation, feature-scenario robustness, grouped hospital transportability, precision–recall behavior under class imbalance, probability calibration, and SHAP-based interpretability.
A closely related GOSSIS-based study by Sánchez-Gómez et al. [18] applied an AdaBoost ensemble classifier to diabetes-related prediction in ICU patients using more than 90 structured clinical features, including demographic variables, vital signs, laboratory values, and comorbidities. Their model reported an AUROC of 0.83 and accuracy of 83.28%, suggesting good overall discrimination, but the threshold-dependent results showed an important imbalance between specificity and sensitivity, with high specificity of 93.95% and lower recall of 41.95%. The authors identified glucose, body mass index, age, creatinine, and bicarbonate as important predictors, supporting the clinical relevance of metabolic, renal, and demographic features in ICU diabetes-related modeling [18]. Their findings are consistent with the present study in showing that structured ICU variables can support machine-learning-based diabetes classification in public critical-care data. However, our study differs in objective and evaluation scope: rather than focusing on a single AdaBoost model, we benchmark multiple tree-based classifiers for admission-time phenotyping of documented diabetes status, evaluate leakage-mitigated feature scenarios, assess grouped hospital transportability, examine precision-recall behavior under class imbalance, evaluate probability calibration, and use SHAP-based interpretation to characterize model behavior.
We benchmarked several tree-based machine-learning classifiers selected for their suitability to structured tabular clinical data with nonlinear associations, mixed feature types, and clinically meaningful missingness patterns. The evaluated models included CatBoost, random forest, tuned XGBoost, tuned LightGBM, histogram-based gradient boosting, and a soft-voting ensemble combining tuned XGBoost, tuned LightGBM, and histogram-based gradient boosting [19,20,21,22].
The workflow emphasized split-aware preprocessing, including training-derived imputation, one-hot encoding based only on training categories, correlation pruning based only on the training matrix, removal of high-frequency one-hour variables, removal of explicit site identifiers from the model matrix, and harmonization of admission-source variables.
Because class imbalance is a central concern in diabetes phenotyping, the final workflow emphasized model-intrinsic class weighting and threshold-aware evaluation rather than synthetic oversampling, consistent with recent concerns that SMOTE-generated synthetic medical records may introduce clinically questionable examples and distort model interpretation, and with recent clinical machine-learning approaches that combine class weighting with threshold optimization to improve recall while limiting artifacts from oversampling [23].
Model performance was assessed using complementary fixed-threshold metrics, including accuracy, precision, recall, and F1-score; probability-based metrics, including AUROC and the Brier score; and precision–recall analysis, which is especially informative for evaluating binary classifiers under class imbalance [24,25,26,27]. To evaluate robustness and possible leakage dependence, the full modeling framework was repeated across four prespecified feature scenarios: full feature set, leakage-mitigated original, exclude all APACHE, and strict reduced model. This scenario-based ablation design was used to determine whether model performance depended on leakage-prone predictors, APACHE-derived variables, site-related structure, or glucose-like proxies, consistent with recommendations to detect and mitigate data leakage, assess predictor-related risk of bias, and transparently report validation and feature-handling decisions in clinical machine-learning prediction studies [13,28,29,30,31]. To estimate transportability to unseen clinical environments, we also performed grouped site-stratified validation using hospital identifiers so that no hospital appeared in both training and validation folds. Finally, calibration plots and Brier scores were used to assess whether the predicted probabilities were reliable as risk estimates, while SHAP-based interpretability was used to examine whether model behavior was clinically coherent, transparent, and explainable at both global and patient-specific levels [32,33,34,35,36,37,38].
Accordingly, the objective of this study was to develop and evaluate a leakage-aware, reproducible machine-learning benchmark for ICU admission-time phenotyping of documented diabetes status using the public WiDS 2021/GOSSIS release. By combining model benchmarking with ablation analysis, grouped hospital validation, calibration assessment, and SHAP-based interpretation, this work aims to clarify the performance, robustness, and implementation tradeoffs of machine-learning-based diabetes identification in public multi-center critical-care EHR data, consistent with recommended practices for benchmarking critical-care models, preventing leakage, transparently reporting clinical prediction models, assessing probability reliability, and explaining tree-based model behavior [14,30,33,36,39,40,41].

2. Materials and Methods

2.1. Experimental Setup and Environment

All computational analyses and modeling were conducted using Python 3.9.6. Data manipulation, cleaning, and structuring were performed using the pandas and numpy libraries. The core machine learning pipeline was constructed using scikit-learn, which provided robust modules for data splitting, missing value imputation, and performance evaluation metrics. For the predictive modeling, we implemented an ensemble of advanced gradient boosting frameworks, specifically CatBoost, LightGBM, and XGBoost, alongside standard Random Forest and Histogram-based Gradient Boosting classifiers. Furthermore, model interpretability and feature importance were extracted using the SHAP (SHapley Additive exPlanations) library, while interactive and static visualizations were rendered using plotly and matplotlib.

2.2. Data Source and Study Design

We conducted a retrospective machine-learning benchmarking study using the public WiDS Datathon 2021 tabular release derived from MIT’s Global Open-Source Severity of Illness Score (GOSSIS) initiative. The raw datasets used in this study were ingested into the computational environment using the pandas library. The primary dataset was segmented into a training cohort consisting of 130,157 patient records with 181 features, and an unlabeled testing cohort comprising 10,234 records with 180 features. This initial data ingestion phase provided a quick structural inspection of the data and served as the foundation for all subsequent preprocessing and predictive modeling [15,16].

2.3. Outcome Definition

The primary outcome was a binary indicator of pre-existing diabetes mellitus documented at or prior to ICU admission. This label reflects historical diabetes status as represented in the source database (e.g., comorbidity fields and/or admission documentation) rather than a de novo diagnosis made during the ICU stay. Accordingly, the modeling task is framed as phenotyping/identification of an existing condition, not detection of stress hyperglycemia or prediction of incident diabetes.

2.4. Candidate Predictors and Feature-Scenario Design

Candidate predictors were drawn from the public WiDS 2021 release and included demographic variables, admission characteristics, anthropometrics, day-1 physiologic and laboratory summaries, APACHE-related variables, comorbidity indicators, and site descriptors. The analysis is designed to address indirect leakage and site memorization by evaluating four prespecified feature scenarios:
  • Full feature set: routine cleaning only, with grouping identifiers removed from the model matrix.
  • Leakage-mitigated original: removal of variables most likely to encode label-adjacent information, including APACHE diagnosis fields and selected chronic comorbidity indicators, while retaining other clinically relevant predictors.
  • Exclude all APACHE: removal of all APACHE-related variables in addition to the leakage-mitigated exclusions.
  • Strict reduced model: a conservative predictor set that further excluded APACHE-related variables, site identifiers, and glucose-like predictors.

2.5. Data Preprocessing and Feature Engineering

To prepare the raw WiDS 2021/GOSSIS files for predictive modeling, we implemented a split-aware preprocessing and feature-engineering pipeline in which all data-driven transformations were learned only from the training portion of the labeled dataset.
The unlabeled challenge file was not used to fit imputers, define categorical encoding levels, derive engineered variables, or determine correlation-pruning rules. This prevented information from validation or challenge records from influencing the preprocessing parameters used during model development.
Formally, after initial row filtering, the adult labeled cohort was denoted by D l a b = { x i , y i } i = 1 N , where x i is the predictor vector for ICU stay i and y i { 0 , 1 } is the binary diabetes label. The labeled data were partitioned into an 80/20 stratified random split using a fixed seed of 40:
X t r a i n , X v a l , y t r a i n , y v a l = S t r a t i f i e d S p l i t X , y ; test_size = 0.20 ,   stratify = y ,   seed = 40
All preprocessing parameters were then estimated from the training split alone and applied without refitting to the validation and unlabeled challenge sets:
Θ = f i t t r a i n p r e p r o c e s s i n g , X v a l * = t r a n s f o r m Θ X v a l , X t e s t * = t r a n s f o r m Θ X t e s t
The first row-level transformation restricted the analysis to adult ICU stays aged 16 years or older:
D a d u l t = { x i , y i D l a b : a g e i 16 }
Schema-level harmonization then standardized admission-source variables. Multiple hospital and ICU admission categories were recoded to a smaller and more consistent vocabulary to reduce spelling inconsistencies and ontology fragmentation. Missing hospital_admit_source values were filled with “Other,” and missing icu_admit_source values were backfilled from hospital_admit_source whenever that variable was available.
Logical consistency checks were next applied to all paired day-level summary variables ending in _min and _max. Whenever the recorded minimum exceeded the corresponding maximum, the two values were exchanged according to the following rule:
x i j m i n > x i j m a x : x i j m i n , x i j m a x = x i j m a x , x i j m i n
A limited engineered physiologic feature was added for the P i A O 2 / F i I O 2 ratio. Let R i P F denote d1_pao2fio2ratio_max, P i A O 2 denote pao2_apache, and F i I O 2 denote fio2_apache. When the day-1 ratio was missing, both source variables were observed, and the denominator was nonzero, the ratio was computed as:
R i P F = P i A O 2 F i I O 2 , F i I O 2 0
All h1_prefixed variables were removed to eliminate extremely granular one-hour measures. The code also dropped near-duplicate APACHE variables whenever an equivalent day-1 representation was available, thereby prioritizing directly observed day-1 measurements over redundant APACHE summaries. Potentially non-predictive identifiers, including encounter_id, readmission_status, and the unnamed index column, were excluded from the predictor matrix. The grouping variables hospital_id and icu_id were retained separately for grouped validation and robustness analyses but were excluded from the model feature space under all feature scenarios.
Missing-data handling was fully split-aware. For continuous predictors, let o i j = 1 indicate that the value was observed in the raw training-derived matrix and o i j = 0 indicate that it was missing. Missing continuous values were replaced with the training-set mean:
μ j t r a i n = 1 n j , o b s t r a i n i t r a i n :   o i j = 1 x i j
For categorical predictors, missing values were replaced by the most frequent training-set category. If c i j denotes the categorical value for record i and categorical predictor j , and m j t r a i n denotes the training-set mode, the imputation rule was:
m j t r a i n = m o d e { c i j : i t r a i n ,   o i j = 1 }
A more specific imputation rule was implemented for anthropometric variables. For height, weight, and body mass index (BMI), the pipeline first computed sex-specific means within the training split and then fell back to the overall training mean if the sex-stratified mean was unavailable. Let g i denote the recorded gender for patient i , let A = { h e i g h t , w e i g h t , B M I } , and let a j g i = 1 indicate that a training-derived gender-specific mean existed for anthropometric variable j and gender group g i :
x ~ i j = x i j , o i j = 1 μ j g i t r a i n , o i j = 0 ,   a j g i = 1 μ j t r a i n , o i j = 0 ,   a j g i = 0 , j A
This two-stage strategy preserved more physiologically plausible imputations for anthropometric variables than a single global mean. In addition, ethnicity and gender themselves were filled using training-derived modal values, with code fallback defaults of “Other/Unknown” and “M,” respectively, when no empirical mode was available.
After imputation, categorical variables were one-hot encoded using only the levels observed in the training data. For a categorical predictor x j with training categories C j t r a i n = { c 1 , , c K } , the dummy representation was defined as:
z i k j = 1 x i j = c k , c k C j t r a i n
Validation and challenge matrices were then aligned to the resulting training-derived column set. To reduce redundancy and instability, the code computed pairwise absolute Pearson correlations on the training design matrix only. For features a and b , the training-set Pearson correlation was:
r a b = i = 1 n t r a i n x i a x a x i b x b i = 1 n t r a i n x i a x a 2 i = 1 n t r a i n x i b x b 2
A feature was flagged for removal whenever it appeared as the second member of any upper-triangular feature pair whose absolute correlation exceeded the threshold of 0.80:
C d r o p = { b :   a < b ,   r a b > 0.80 }
The same dropped columns were removed from the validation and challenge matrices. Finally, feature names were sanitized by replacing non-alphanumeric characters with underscores, and both validation and test matrices were realigned one final time to the exact training-derived column set. This sequence ensured a stable and identical feature space across all downstream models and validation scenarios. Refer to Table 1 and Table 2 for details.

2.6. Model Development

We compared five individual tree-based classifiers selected for their strong performance on structured tabular clinical data: CatBoost, random forest, tuned XGBoost, tuned LightGBM, and Histogram-based Gradient Boosting (HGB). These algorithms were trained on the final split-aware, one-hot encoded, correlation-pruned feature matrices described in Section 2.5. For each ICU stay i , the transformed predictor vector was denoted by x i and the binary outcome was y i { 0 , 1 } , where y i = 1 indicated documented diabetes status. Each classifier learned a mapping from the final predictor space to an estimated probability of diabetes:
p i m = P r y i = 1 x i , M m = f m x i
where M m denotes the m -th trained model and f m is the model-specific probability function. Predicted class labels were obtained by applying a probability threshold t to the estimated probabilities:
y ^ i m t = I p i m t , with   default   t = 0.50   unless   otherwise   specified
This formulation allowed the analysis pipeline to evaluate both fixed-threshold classification performance and threshold-aware behavior in the later precision-recall analysis. The modeling strategy intentionally emphasized probability estimation, not only hard classification, because subsequent sections evaluate AUROC, Brier score, precision-recall behavior, and calibration.
CatBoost and random forest served as robust baseline models. CatBoost was implemented with 100 boosting iterations and a fixed random seed, while random forest was implemented with 100 trees, parallel processing, and the same study-level random state. The random forest model can be represented as an average over B independently trained decision trees:
p i R F = 1 B b = 1 B T b x i , B = 100
where T b x i is the probability estimate from the b -th tree. This averaging structure reduces variance relative to a single decision tree and provides a conservative benchmark against the boosting-based models.
The gradient-boosted tree models used additive ensembles in which each new tree updated the current prediction function. In general form, the boosted scoring function after K iterations can be written as:
F K x i = F 0 x i + k = 1 K η h k x i
where h k is the k -th regression tree and η is the learning rate. XGBoost and LightGBM represented tuned gradient-boosted tree approaches optimized for this benchmark. The tuned XGBoost configuration used 200 estimators, maximum depth 8 , learning rate 0.05 , subsample 0.8 , minimum child weight 10 , colsample_bytree = 0.8 , a fixed random state, and log-loss evaluation. The tuned LightGBM configuration used gradient boosting decision trees with binary objective, AUC metric, learning rate 0.007 , subsample 1.0 , colsample _ bytree = 0.2 , reg_alpha = 3 , reg_lambda = 1 , scale_pos_weight = 4 , 10 , 000 boosting iterations, unrestricted tree depth, fixed random state 100 , and force_col_wise enabled.
Because class imbalance was a central methodological concern, the final analysis pipeline emphasized model-intrinsic class weighting and threshold-aware evaluation rather than synthetic oversampling. In the LightGBM model, the positive diabetes class was weighted more heavily through scale_pos_weight = 4 . Conceptually, the weighted binary learning objective can be expressed as:
L = i = 1 n w y i l y i , p i + Ω F
where l y i , p i is the binary classification loss, w y i is the class-specific observation weight, and Ω F represents regularization terms that penalize excessive model complexity. This weighting scheme increased the cost of misclassifying diabetes-positive cases during training while preserving the original empirical distribution of the data. Therefore, the workflow avoided synthetic generation of minority-class records and instead evaluated the sensitivity-precision trade-off directly through model probabilities and threshold-dependent metrics.
In addition to the five individual models, we trained a soft-voting ensemble that combined tuned XGBoost, tuned LightGBM, and histogram-based gradient boosting. The ensemble assigned weights of 1 : 3 : 1 to XGBoost, LightGBM, and HGB, respectively, giving greater influence to the sensitivity-oriented LightGBM model while still incorporating the more conservative probability estimates from XGBoost and HGB. The ensemble probability was calculated as the weighted average:
p i e n s = 1 p i X G B + 3 p i L G B M + 1 p i H G B 1 + 3 + 1
Final ensemble labels were then obtained by applying the same thresholding rule in Equation (13). This design provided a more balanced operating point than relying on LightGBM alone: the ensemble retained much of LightGBM’s sensitivity while improving precision by tempering extreme probability estimates through the additional boosted-tree models. Table 3 summarizes the model-development choices.

2.7. Evaluation, Ablation Analysis, and Grouped Site-Stratified Validation

Model performance was evaluated on held-out validation data generated by the split-aware preprocessing workflow described in Section 2.5. For a given validation set V = { ( x i , y i ) , i = 1 , , n V } , x i denotes the final transformed predictor vector and y i { 0 , 1 } denotes the observed diabetes label. For model m , the predicted probability p i ( m ) was obtained from the fitted probability function defined in Equation (12), and thresholded class labels were obtained according to Equation (13).
The primary model-comparison tables used the model-default binary prediction, corresponding to a probability threshold of t = 0.50 for the probabilistic binary classifiers unless a separate threshold analysis was explicitly reported.
For each model and validation fold, the threshold-specific confusion-matrix counts were computed from the observed and predicted labels as follows:
T P t = i V I y i = 1 I y ^ i t = 1 F P t = i V I y i = 0 I y ^ i t = 1 T N t = i V I y i = 0 I y ^ i t = 0 F N t = i V I y i = 1 I y ^ i t = 0
These four quantities were used to calculate the fixed-threshold classification metrics reported in the primary and ablation tables:
              n V t = T P t + T N t + F P t + F N t             Accuracy t = T P t + T N t n V t             Precision t = T P t T P t + F P t             Recall t = T P t T P t + F N t
F 1 t = 2 Precision t Recall t Precision t + Recall t
In the implementation, precision, recall, and F1-score were computed using the scikit-learn metric functions with z e r o _ d i v i s i o n = 0 , so undefined ratios caused by an empty denominator were assigned a value of zero rather than producing unstable estimates. Recall was interpreted as sensitivity for the diabetes-positive class, whereas precision represented the positive predictive value among ICU stays flagged as likely diabetes-positive.
Probability-based performance was evaluated separately from fixed-threshold classification. Probability accuracy was summarized using the Brier score, which measures the mean squared deviation between the predicted probability and the observed binary label:
Brier = 1 n V i V p i m y i 2
Discrimination was assessed with the receiver operating characteristic curve and the area under that curve (AUROC). At each threshold t, the true-positive rate and false-positive rate were defined as:
T P R t = T P t T P t + F N t F P R t = F P t F P t + T N t
The AUROC was therefore interpreted as a threshold-independent ranking measure. Equivalently, for n1 positive cases and n0 negative cases in the validation fold, AUROC estimates the probability that a randomly selected diabetes-positive ICU stay receives a higher predicted probability than a randomly selected diabetes-negative ICU stay:
AUROC = 1 n 1 n 0 i : y i = 1 j : y j = 0 S i j S i j = I p i > p j + 0.5 I p i = p j
Because the positive class represented a minority of the cohort, receiver operating characteristic analysis was supplemented with precision-recall analysis. The prevalence of the positive class in a validation fold provides the no-skill precision baseline:
π V = 1 n V i V y i
Precision-recall curves were generated by sweeping the classification threshold over the predicted probabilities. Average precision (AP) summarized the area under the precision-recall curve using the stepwise interpolation implemented by scikit-learn:
AP = k = 1 K R k R k 1 P k
where P k and R k are the precision and recall values at the k -th threshold operating point. To support clinically interpretable threshold selection, we also examined operating points that preserved high sensitivity. For a prespecified recall target r = 0.80 , the selected threshold was defined as the largest threshold that still achieved at least the target recall:
t r = m a x { t : Recall t r } , with   r = 0.80
The primary evaluation was conducted under the leakage-mitigated original feature scenario using the 80/20 stratified random validation split. For each trained model m in the set of evaluated classifiers, the analysis pipeline stored the predicted class labels, predicted probabilities, and metric vector:
q s , m = Accuracy , Precision , Recall , F 1 , AUROC , Brier s , m
where s indexes the feature scenario and m indexes the model. This notation reflects the implementation in which the same train_eval_predict routine was applied to CatBoost, random forest, tuned XGBoost, tuned LightGBM, histogram-based gradient boosting, and the soft-voting ensemble. The leakage-mitigated original scenario served as the primary benchmark because it removed variables most likely to encode label-adjacent information while retaining clinically interpretable admission and day-1 predictors.
To quantify robustness to potential leakage and feature-set dependence, the full modeling framework was repeated across the four prespecified feature scenarios: full feature set, leakage-mitigated original, exclude all APACHE, and strict reduced model. For a given metric component k, the scenario-specific change relative to the primary leakage-mitigated benchmark was defined as:
Δ s , m k = q s , m k q primary , m k
where q primary , m k is the value of metric k for model m under the leakage-mitigated original scenario. This ablation framework allowed the analysis to determine whether performance was driven primarily by APACHE-derived variables, site-related structure, glucose-like predictors, or more general clinical and physiologic information.
In addition to the row-level random split, a grouped site-stratified validation analysis was conducted to estimate transportability to unseen hospitals. The grouped experiment reconstructed the processed adult labeled cohort under the same leakage-mitigated feature scenario and then applied GroupShuffleSplit with one split, test_size = 0.20, random_state = 40, and hospital_id as the grouping variable. Let h i denote the hospital identifier for ICU stay i , and let G train and G val denote the selected training and validation hospital sets. The grouped split enforced hospital-level separation:
G train G val = . D train group = { i : h i G train } D val group = i : h i G val
Thus, no hospital contributed records to both the grouped training and grouped validation partitions. The grouped analysis used the tuned LightGBM model because it was the sensitivity-oriented model emphasized for minority-class detection and because it provided the basis for subsequent calibration and SHAP-based interpretability analyses. To compare random-split and grouped-site behavior, the grouped validation metric change was summarized as:
Δ group k = q group , LGBM k q random , LGBM k
A decrease in AUROC, precision, or F1-score, or an increase in Brier score after site separation, would indicate reduced transportability to unseen hospitals, whereas stable recall would indicate that the model continued to identify diabetes-positive ICU stays despite a stricter validation design.
Together, as shown in Table 4, the random split, scenario ablations, precision-recall analysis, and grouped hospital validation provided complementary evidence about discrimination, probability reliability, class-imbalance behavior, robustness to leakage-prone predictors, and site-level generalizability.

2.8. Calibration Assessment and SHAP-Based Interpretability

We implemented discrimination-based evaluation with probability calibration assessment and SHAP-based interpretability. Calibration was examined for the tuned LightGBM model under both the primary random stratified validation split and the stricter grouped site validation split. This distinction was important because AUROC and average precision evaluate ranking behavior, whereas calibration evaluates whether predicted probabilities are numerically reliable as risk estimates.
Let s ∈ {random, group} index the validation strategy, and let V s denote the corresponding validation set. For ICU stay i in V s , the tuned LightGBM model produced an estimated probability of documented diabetes status:
p i s = P r y i = 1 x i , M LGBM s , i V s
A perfectly calibrated probabilistic classifier would satisfy the condition that, among patients assigned predicted risk p , the observed event frequency is also p :
E y i p i s = p = p , 0 p 1
Calibration curves were generated using the scikit-learn calibration_curve function using 10 equal-width bins and strategy = ‘uniform’. Thus, the interval [0, 1] was divided into 10 equal-width probability bins. For bin b , the validation records assigned to that bin were defined as:
B b s = i V s :   b 1 10 p i s < b 10 , b = 1 , , 9 , i V s :   0.90 p i s 1 , b = 10 .
For each nonempty bin, the analysis pipeline computed the mean predicted probability and the observed fraction positive:
p b s = 1 B b s i B b s p i s o b s = 1 B b s i B b s y i
The plotted calibration curve was therefore the set of bin-level pairs displayed against the diagonal perfect-calibration reference line:
C s = { p b s , o b s : B b s > 0 } Perfect   calibration : o b s = p b s
Points below the diagonal indicate probability overestimation, whereas points above the diagonal indicate probability underestimation. The Brier score defined in Equation (21) was retained as the numerical summary of probability accuracy for both validation settings. No post hoc recalibration model, such as Platt scaling or isotonic regression, was fitted; the purpose of this step was to diagnose probability reliability rather than to recalibrate the model.
Interpretability was assessed using SHAP rather than relying only on model-native feature-importance rankings (SHapley Additive exPlanations) [33]. SHAP analysis was anchored to the grouped-site LightGBM model because that model was trained and evaluated under hospital-level separation and therefore provided the more conservative explanation target. For computational efficiency, SHAP values were computed on the grouped validation set, with random subsampling applied only when the grouped validation set exceeded 3000 records:
S SHAP = V group , V group 3000 , a   3000 - record   sample   from   V group   without   replacement , otherwise .
For transformed predictor vector x i , the explanation takes the additive form:
F G x i = ϕ 0 + j = 1 p ϕ i j
where F G x i is the grouped-site LightGBM output on the TreeExplainer model-output scale, ϕ 0 is the expected baseline output, p is the number of final predictors, and ϕ i j is the contribution of predictor j for ICU stay i . Conceptually, each SHAP value is a weighted average of the marginal contribution of feature j across possible feature subsets:
ϕ i j = S F j w S f G x i , S + j f G x i , S w S = S ! p S 1 ! p !
Global feature influence was summarized by the mean absolute SHAP value, which measures the average magnitude of a predictor’s contribution across the SHAP explanation sample:
I j = 1 S SHAP i S SHAP ϕ i j
The global mean absolute SHAP bar plot ranked predictors by I j , while the SHAP beeswarm summary plot displayed the distribution and direction of feature contributions across patients. Higher or lower feature values could therefore be interpreted in terms of whether they pushed the grouped-site LightGBM output toward or away from the diabetes-positive class.
To examine nonlinear predictor effects, dependence plots were generated for the most influential predictors. For a selected feature j , the dependence plot consisted of the observed transformed feature value paired with its SHAP contribution across the explanation sample:
D j = { x i j , ϕ i j : i S SHAP }
The top SHAP features were identified by sorting Equation (39), and dependence plots were generated for the six highest-ranked predictors. These plots were used to assess whether dominant predictors contributed monotonically, showed threshold-like changes, or saturated at high or low values.
Finally, local SHAP waterfall plots were generated for representative cases selected from the grouped validation confusion-matrix categories. Using the grouped LightGBM predicted labels, the case sets were defined as:
T P = { i : y i = 1 , y ^ i = 1 } , F P = { i : y i = 0 , y ^ i = 1 } T N = { i : y i = 0 , y ^ i = 0 } , F N = { i : y i = 1 , y ^ i = 0 }
We selected the first available example from each nonempty set and plotted the cumulative SHAP decomposition for that patient. If features are ordered by decreasing absolute local contribution, the waterfall trajectory after the first k displayed features is:
W i k = ϕ 0 + r = 1 k ϕ i , j r , ϕ i , j 1 ϕ i , j 2
This local decomposition showed how individual predictors moved the grouped-site LightGBM output away from the baseline value and toward either the diabetes-positive or diabetes-negative class. Together, the calibration diagnostics and SHAP analyses addressed two complementary concerns: whether the model probabilities were reliable enough to interpret, and whether the model’s behavior could be explained at both global and patient-specific levels using the leakage-mitigated, site-separated modeling workflow, as shown in Table 5.

2.9. Reporting and Reproducibility

Methods and results reporting were structured to align with core elements of TRIPOD, including explicit definitions of outcome and predictors, transparent preprocessing, clear separation of model development from evaluation, and complete reporting of performance metrics. For full reproducibility, release of the complete preprocessing pipeline (including feature lists and imputation rules), the final tuned hyperparameters, and software/package versions is provided.

3. Results

3.1. Cohort Composition and Validation Design

The analytic cohort was derived from the labeled WiDS 2021 training file, which contained 130,157 ICU stays. This labeled file should be distinguished from the model-training partition generated after splitting. The separate unlabeled challenge file contained 10,234 ICU stays and was not used to estimate validation performance, select thresholds, or fit preprocessing parameters. After applying the adult eligibility restriction (age ≥ 16 years), 125,139 labeled ICU stays remained for model development and evaluation. Thus, 5018 records were excluded during cohort definition, and 96.1% of the labeled cohort was retained in the adult analytic cohort.
The primary internal validation design used the 80/20 stratified random partition described in the Methods. This split allocated 100,111 adult records to training and 25,028 records to validation, corresponding almost exactly to 80.0% and 20.0% of the adult analytic cohort. Because the split was stratified by diabetes_mellitus, the validation partition preserved the minority-class structure of the outcome and provided the principal row-level benchmark for comparing CatBoost, random forest, tuned XGBoost, tuned LightGBM, histogram-based gradient boosting, and the soft-voting ensemble.
A second validation design was used to evaluate transportability across clinical sites rather than performance on randomly sampled patient rows alone. In this grouped site-stratified experiment, hospital_id was used as the grouping variable, yielding 95,233 records for model training and 29,906 records for validation. The resulting row allocation was 76.1% training and 23.9% validation, rather than exactly 80/20, because the split operated at the hospital level and therefore had to assign all records from a given hospital to the same fold.
Hospital-level separation was complete. A total of 163 hospitals were assigned to the grouped training partition, and 41 hospitals were assigned to the grouped validation partition, for a total of 204 hospitals with no overlap between folds. Thus, approximately 79.9% of hospitals were used for grouped training and 20.1% were held out for grouped validation. This design intentionally created a stricter generalizability assessment than the row-level random split by reducing the possibility that performance estimates were inflated by hospital-specific coding patterns, admission practices, laboratory ordering behavior, or site-level population structure appearing in both training and validation data. Consequently, the random split was interpreted as the primary internal benchmark, whereas the grouped split was interpreted as a more conservative assessment of performance at previously unseen hospitals, as shown in Table 6.

3.2. Primary Model Performance in the Leakage-Mitigated Random Split

Table 7 summarizes the primary model-comparison results in the leakage-mitigated scenario. Two distinct operating profiles were apparent. The tuned LightGBM model achieved the highest recall (0.7677) and nearly the highest AUROC (0.8537), indicating that it was the most aggressive minority-class detector. However, this sensitivity came at the expense of precision (0.5017), overall accuracy (0.7815), and the least favorable Brier score (0.1508), indicating poorer probability reliability. In contrast, the voting ensemble preserved essentially the same AUROC (0.8539), improved precision to 0.5671, and achieved the highest F1-score (0.6138) among the evaluated models. XGBoost (tuned) showed the lowest Brier score (0.1197), suggesting the best raw probability fit of the individual models, but its recall remained moderate at 0.4642. Random forest produced the highest precision (0.6645) but the lowest recall (0.3225), illustrating the cost of using a comparatively conservative classifier in an imbalanced screening task.
Figure 1, Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6 provide the model-specific confusion matrices and top-15 feature-importance summaries for the six primary models. Taken together, these panels show that performance differences were driven less by fundamentally different predictor families than by how each algorithm traded sensitivity against false-positive burden. Across all models, day-1 glucose maxima dominated the feature rankings, and day-1 glucose minima, BMI, age, creatinine, pre-ICU length of stay, hemoglobin, and blood urea nitrogen repeatedly appeared among the most influential variables. The consistency of these rankings across algorithms supports the clinical plausibility of the pipeline and shows that the leakage mitigation step succeeded in removing hospital_id and icu_id from the dominant predictor set.
In Figure 1A, CatBoost correctly classified 18,046 non-diabetes stays and 2545 diabetes stays, while producing 1484 false positives and 2953 false negatives. This pattern is consistent with a balanced but not highly sensitive classifier. Figure 1B shows that d1_glucose_max clearly dominated the CatBoost ranking, followed by d1_glucose_min, age, BMI, and d1_creatinine_max, indicating that CatBoost relied primarily on metabolic and renal markers rather than on any residual site proxy.
Figure 2 illustrates the conservative operating profile of the random forest model. It produced the fewest false positives among the primary models (895) but missed a large number of diabetes cases (3725 false negatives), which explains its very low recall. Its feature ranking remained consistent with the other models, with d1_glucose_max by far the strongest signal and d1_glucose_min, age, BMI, pre_icu_los_days, and d1_creatinine_max forming the next tier of contributors.
The tuned XGBoost model in Figure 3 delivered a stronger balance between specificity and sensitivity than CatBoost or random forest. It correctly identified 2552 diabetes stays while keeping false positives to 1340, which explains its comparatively strong accuracy and the lowest Brier score among the individual-tuned models. Its importance profile again centered on d1_glucose_max as APACHE-related variables such as ventilated_apache, arf_apache, gcs_unable_apache, and intubated_apache, along with BMI and age.
Figure 4 highlights why LightGBM emerged as the most sensitivity-oriented primary model. It correctly identified 4221 diabetes stays and reduced false negatives to 1277, but this gain was accompanied by 4192 false positives. Panel B shows that the same core predictors remained dominant, with d1_glucose_max and d1_glucose_min well above the other variables, followed by BMI, d1_creatinine_max, pre_icu_los_days, age, d1_wbc_max, and several vital-sign summaries.
As shown in Figure 5, the histogram-based gradient boosting model behaved similarly to CatBoost and tuned XGBoost. It delivered moderate sensitivity, with 2498 true positives and 3000 false negatives, while maintaining false positives at 1378. Its feature ranking was again led by d1_glucose_max, followed by age, BMI, d1_creatinine_max, and selected laboratory and respiratory features, reinforcing the stability of the signal structure across boosting methods.
Figure 6 demonstrates why the voting ensemble provided the most balanced overall operating point. The ensemble correctly identified 3678 diabetes stays and misclassified 2808 non-diabetes stays as positive, thereby improving precision relative to tuned LightGBM while retaining substantially higher recall than the more conservative baseline models. Its aggregated importance ranking remained anchored by d1_glucose_max, with d1_glucose_min, BMI, d1_creatinine_max, age, and pre_icu_los_days forming the next strongest contributors.

3.3. Ablation Analyses

Table 8 reports the full ablation results across all four feature scenarios. The most important observation is that performance fell only modestly when leakage-prone or APACHE-related variables were removed, but deteriorated sharply under the strict reduced model that also excluded glucose-like predictors. For LightGBM, AUROC declined only from 0.8565 in the full-feature scenario to 0.8537 in the leakage-mitigated scenario and to 0.8508 after removing all APACHE variables. The voting ensemble showed a similar pattern, decreasing from 0.8566 to 0.8539 and then to 0.8514. These changes were small enough to support the claim that the models retained meaningful discriminative ability after reasonable leakage mitigation. In contrast, the strict reduced model caused a much larger collapse in performance. LightGBM AUROC fell to 0.7432, and its Brier score worsened to 0.1986; the voting ensemble dropped to 0.7448 AUROC and 0.1656 Brier. Thus, the ablation analysis suggests that the model was not primarily dependent on explicit site or APACHE identifiers, but it was strongly dependent on glucose-related admission information, which is clinically plausible given the task of identifying documented diabetes.
Figure 7 complements Table 8 by making the ablation trends visually explicit. The full-feature, leakage-mitigated, and exclude-all-APACHE scenarios form a tight cluster for most models, whereas the strict reduced model produces an obvious downward shift in recall-weighted performance and AUC. Importantly, Brier score is the only metric in Figure 7 for which lower values are better; the strict reduced bars rise because probability accuracy worsened materially once glucose-like predictors were removed. The figure therefore supports a nuanced conclusion: modest variable exclusions designed to reduce leakage do not materially destabilize performance, but removing the dominant glycemic signal substantially changes the task and sharply reduces discriminative utility.

3.4. Grouped Site-Stratified Validation

Table 9 compares the tuned LightGBM model under the original random stratified split and the stricter grouped site split. The grouped split reduced accuracy from 0.7815 to 0.7727, precision from 0.5017 to 0.4546, F1-score from 0.6069 to 0.5712, AUROC from 0.8537 to 0.8443, and Brier score from 0.1508 to 0.1596. Recall, however, remained essentially unchanged (0.7677 versus 0.7684). This is an important pattern. It suggests that some of the precision and ranking advantage seen in the random split depended on within-hospital regularities that do not fully transfer to unseen sites, but the model’s ability to find likely diabetes cases remained stable even after hospital-level separation.
Figure 8A shows the corresponding grouped-site confusion matrix: 18,581 true negatives, 5432 false positives, 1365 false negatives, and 4528 true positives. Because the grouped validation fold was larger than the random validation fold, the raw counts are not directly comparable, but the error pattern is informative. The grouped-site model continued to identify many diabetes cases while producing a relatively larger false-positive burden, which is consistent with the drop in precision documented in Table 9. Figure 8B shows that the grouped-site feature ranking remained clinically coherent. The leading variables d1_glucose_max, d1_glucose_min, BMI, d1_creatinine_max, pre_icu_los_days, and age.

3.5. Precision–Recall Behavior Under Class Imbalance

Because documented diabetes represented a minority outcome in the validation cohorts, precision–recall analysis was used to complement AUROC-based discrimination. Precision–recall curves directly summarize the trade-off between sensitivity and positive predictive value and are therefore useful for evaluating classifier behavior when the positive class is less frequent. In this analysis, the tuned LightGBM model was evaluated under both the primary random stratified validation split and the grouped hospital validation split. Horizontal reference lines were added to represent the diabetes-positive prevalence in the corresponding validation fold, which serves as the no-skill precision baseline.
The random-split LightGBM model achieved an average precision of 0.622, whereas the grouped-site LightGBM model achieved an average precision of 0.551. The decrease in average precision under grouped validation is consistent with the lower grouped-site precision reported in Table 9 and indicates that hospital-level separation increased the false-positive burden. However, the grouped-site precision–recall curve remained above the prevalence baseline across much of the recall range, showing that the model retained useful minority-class ranking ability even when evaluated on hospitals not seen during training.
The precision–recall curves also clarify the clinical implications of threshold selection. In the grouped-site validation set, a probability threshold of 0.4537 achieved recall of 0.8001 with precision of 0.4314. Thus, the model can be tuned to preserve high sensitivity for diabetes-positive ICU stays at previously unseen hospitals, but this operating point requires accepting a substantial number of false positives. This trade-off may be appropriate for a screening-oriented phenotyping workflow in which the model prompts chart review, medication reconciliation, or cohort enrichment. However, it would be less appropriate for automated diagnostic labeling or interruptive clinical alerts without additional confirmation. These findings reinforce that the model should be interpreted as a computable phenotyping aid rather than a stand-alone diagnostic tool.
In Figure 9, the blue curve shows the primary random stratified validation split, and the orange curve shows the grouped hospital validation split. Average precision was 0.622 for the random split and 0.551 for the grouped split. Horizontal reference lines indicate the diabetes-positive prevalence in the corresponding validation fold and represent the no-skill precision baseline. The grouped curve shows reduced precision relative to the random split but remains above the baseline over much of the recall range, indicating preserved minority-class ranking performance under hospital-level separation.

3.6. Calibration Assessment

Calibration analysis further evaluated whether the tuned LightGBM probabilities were numerically reliable as risk estimates rather than only useful for ranking patients. Figure 10 shows the calibration curves for the random stratified and grouped hospital validation splits. In both validation settings, the model curve fell below the diagonal reference line across much of the probability range, indicating that predicted probabilities tended to overestimate the observed frequency of documented diabetes, especially in the mid-to-high predicted-risk ranges. The random-split LightGBM model had a Brier score of 0.1508, whereas the grouped-site LightGBM model had a Brier score of 0.1596. This increase under grouped validation indicates that probability reliability worsened slightly when the model was evaluated on hospitals not observed during training. These findings are important because AUROC and average precision measure ranking behavior, but they do not establish whether the predicted probabilities are clinically reliable as absolute risk estimates. Therefore, although tuned LightGBM preserved strong recall and reasonable discrimination under both validation strategies, its probabilities would likely require post hoc recalibration and prospective monitoring before use in a decision-support workflow where the predicted probability itself is interpreted clinically.
Figure 10A shows calibration on the random stratified validation split; Figure 10B shows calibration on the grouped hospital validation split. The diagonal reference line represents perfect calibration. In both panels, the model curve falls below the diagonal over much of the probability range, indicating probability overestimation. Calibration worsened slightly under grouped-site validation, consistent with the higher grouped-site Brier score.

3.7. SHAP-Based Interpretability

Figure 11, Figure 12, Figure 13, Figure 14 and Figure 15 present SHAP-based interpretability analyses for the grouped-site LightGBM model. This choice is important because the grouped-site model provides a more conservative estimate of generalizability. The SHAP outputs allow global, nonlinear, and patient-level interpretation of the model’s behavior.
Figure 11A quantifies the overall influence of each feature on the grouped-site LightGBM output. d1_glucose_max was by far the dominant feature, with a mean absolute SHAP value substantially larger than all other predictors. The next strongest contributors were BMI, age, d1_glucose_min, d1_hemoglobin_max, and urineoutput_apache. Figure 11B adds directional information. High values of d1_glucose_max and BMI were generally associated with positive SHAP values that pushed the model toward diabetes, whereas lower values tended to push predictions downward. Age showed a more graded pattern, and lower-ranked variables clustered closer to zero, indicating more modest global influence. Importantly, site identifiers were absent from the SHAP ranking, which supports the success of the leakage-mitigation strategy.
Figure 12 shows that the effects of the dominant predictors were nonlinear rather than purely linear. In Figure 12A, the contribution of d1_glucose_max rose sharply as maximum glucose increased from approximately 100 to 250 mg/dL, then plateaued at very high values, indicating saturation rather than unbounded growth. In Figure 12B, BMI crossed from negative to positive contribution around the high-20s and then climbed steadily through the obese range, again with some flattening at the upper extreme. In Figure 12C, younger age was associated with negative SHAP values; the contribution increased through middle and older adulthood and then flattened or slightly tapered in the oldest age bands. These dependence plots show that the model captured clinically plausible thresholds rather than relying on simple monotonic linear rules.
Figure 13 reveals a more complex interaction structure in the secondary predictors. The d1_glucose_min plot in Figure 13A displayed an inverse and clearly non-monotonic pattern: lower glucose minima tended to increase the model output, whereas higher minima progressively reduced it. This behavior likely reflects interaction with d1_glucose_max and treatment response rather than a simple isolated glucose effect. Figure 13B shows that d1_hemoglobin_max contributed positively at lower-to-mid values and then became increasingly negative at higher values, suggesting that this variable acted as a contextual severity marker rather than a direct diabetes proxy. Figure 13C indicates that urineoutput_apache had a more diffuse contribution with relatively small SHAP magnitudes overall; it functioned as a secondary contextual feature rather than a dominant driver.
Figure 14 and Figure 15 show four local SHAP waterfall explanations generated from representative cases drawn from the true-positive, false-positive, true-negative, and false-negative sets in the analysis pipeline. The most striking recurring pattern is the dominant role of d1_glucose_max. In the positively scored cases, high d1_glucose_max created the single largest upward shift in model output, often overwhelming smaller downward contributions from hemoglobin, age, or glucose-minimum features. In the negatively scored cases, low or less extreme d1_glucose_max exerted the largest downward pull, often accompanied by lower BMI or other context variables that further reduced the diabetes score. These local explanations are valuable because they show that the model’s final decision for an individual patient was rarely driven by one feature in isolation; instead, a strong glycemic signal was modulated by anthropometrics, age, hematologic markers, renal markers, and pre-ICU context. These local explanations provide patient-level transparency beyond global feature-importance rankings alone.
Overall, the results lead to three main conclusions. First, the ensemble provided the most balanced overall operating profile, whereas tuned LightGBM remained the most sensitivity-oriented model. Second, reasonable leakage mitigation and grouped site validation reduced performance only modestly, supporting the robustness of the pipeline. Third, the calibration and SHAP analyses show that the final model is interpretable and still useful as a screening-oriented phenotyping tool, but that its probabilities remain imperfectly calibrated and would require further refinement before deployment.

4. Discussion

In this public, multi-center ICU cohort derived from the WiDS Datathon 2021/GOSSIS release, leakage-aware tree-based machine-learning models showed strong performance for admission-time identification of documented diabetes status. The analysis was intentionally designed as a benchmarking and phenotyping study rather than a stand-alone diagnostic model. The workflow used split-aware preprocessing, avoided fitting preprocessing steps on validation or challenge data, excluded explicit site identifiers from the model matrix, evaluated prespecified feature-ablation scenarios, used model-intrinsic class weighting rather than synthetic oversampling, and assessed performance under both random and grouped hospital validation. This design strengthens the credibility of the results because it reduces the risk that apparent discrimination was driven by direct leakage, site memorization, or preprocessing choices informed by held-out data.
The primary leakage-mitigated random-split results showed two clinically meaningful operating profiles. The soft-voting ensemble achieved the most balanced overall performance, with an AUROC of 0.8539, a precision of 0.5671, a recall of 0.6690, and an F1-score of 0.6138. Tuned LightGBM was the most sensitivity-oriented individual model, achieving recall of 0.7677 and AUROC of 0.8537, although this came with a lower precision of 0.5017 and a less favorable Brier score of 0.1508. These findings indicate that model choice should depend on intended use. LightGBM may be preferable when the goal is high-sensitivity screening and missed diabetes-positive cases are costly, whereas the ensemble may be preferable when the burden of false positives, chart review, or alert fatigue is a major concern.
The ablation analyses clarify the source of model performance and support a cautious interpretation. Removing leakage-prone and APACHE-related variables caused only modest reductions in discrimination. For LightGBM, AUROC declined from 0.8565 in the full-feature scenario to 0.8537 in the leakage-mitigated scenario and to 0.8508 after excluding all APACHE variables. The voting ensemble showed a similar pattern, with AUROC decreasing from 0.8566 to 0.8539 and then to 0.8514. These small changes suggest that the models were not primarily dependent on explicit APACHE-derived variables, site identifiers, or obvious leakage-prone predictors. In contrast, the strict reduced model that also removed glucose-like predictors produced a much larger decline in performance: LightGBM AUROC fell to 0.7432, and the ensemble AUROC fell to 0.7448. Thus, glucose-related admission variables remained the dominant predictive signal. This finding is clinically plausible for identifying documented diabetes status, but it is also the central interpretive limitation of the study. Early glycemic values in critically ill patients can reflect chronic diabetes, but they can also capture acute stress physiology, medication exposure, insulin or dextrose treatment, nutritional support, illness severity, and monitoring intensity. Therefore, the model should not be interpreted as learning chronic diabetes status alone.
Grouped hospital validation provided a stricter assessment of generalizability to unseen clinical sites. When tuned LightGBM was evaluated under hospital-level separation, AUROC decreased from 0.8537 in the random split to 0.8443, precision decreased from 0.5017 to 0.4546, and Brier score worsened from 0.1508 to 0.1596. Recall, however, remained essentially stable at 0.7684. This pattern suggests that some precision and ranking advantage in the random split may reflect within-hospital regularities that do not fully transfer to unseen hospitals. At the same time, the model retained useful case-finding ability after site separation. For multi-center ICU phenotyping, this distinction is important: row-level random validation estimates internal performance, whereas grouped hospital validation provides a more conservative estimate of transportability across sites.
The precision–recall analysis further emphasizes that the system is best interpreted as a screening-oriented phenotyping aid. Tuned LightGBM achieved an average precision of 0.622 under random validation and 0.551 under grouped hospital validation, showing that hospital-level separation increased the false-positive burden. At a grouped-site probability threshold of 0.4537, the model achieved recall of 0.8001 but precision of only 0.4314. This high-sensitivity operating point may be acceptable when the model is used to prompt chart review, support medication reconciliation, enrich research cohorts, or identify patients needing confirmation of diabetes history. However, it would be inappropriate for automated diagnostic labeling, treatment decisions, or interruptive clinical alerts without additional clinical confirmation. The model should therefore be framed as a computable phenotyping aid, not as a stand-alone diagnostic tool.
The clinical relevance of this distinction is especially important in critical care. Admission-time diabetes status can help clinicians interpret early hyperglycemia and contextualize glycemic targets, but hyperglycemia in the ICU is etiologically heterogeneous. In a patient with established diabetes, elevated glucose may reflect chronic dysglycemia or baseline metabolic disease. In a patient without known diabetes, the same glucose value may reflect stress hyperglycemia, acute inflammatory and hormonal responses, sepsis physiology, glucocorticoid or vasopressor exposure, enteral or parenteral nutrition, dextrose-containing fluids, or early treatment effects. Thus, although glucose-related variables are informative for identifying documented diabetes status, their dominance means that model predictions may partly encode acute dysglycemia and clinical management patterns rather than chronic diabetes alone. This reinforces the need for cautious use and clinical review.
Calibration results also show why discrimination alone is insufficient. Although tuned LightGBM maintained strong recall and reasonable AUROC under both random and grouped validation, calibration curves and Brier scores showed that predicted probabilities were not fully reliable as absolute risk estimates. In both validation settings, predicted probabilities tended to overestimate the observed frequency of documented diabetes, particularly in the mid-to-high probability range, and the grouped-site Brier score was worse than the random-split Brier score. These findings indicate that the model can rank patients usefully while still producing probabilities that are not deployment-ready. If predicted probabilities are used clinically, post hoc recalibration, prospective monitoring, and periodic site-specific reassessment would be necessary.
The SHAP-based interpretability analysis provides additional support for both clinical plausibility and cautious interpretation. In the grouped-site LightGBM model, day-1 maximum glucose was the dominant predictor, followed by BMI, age, day-1 minimum glucose, hemoglobin, urine output, and other physiologic or laboratory variables. These predictors are clinically coherent because glycemic extrema, anthropometrics, age, renal and hematologic markers, and pre-ICU context may all relate to documented diabetes status or to the conditions under which diabetes is recognized and recorded. Importantly, hospital_id and icu_id were absent from the dominant predictor set because they were removed from the model matrix by design, supporting the leakage-mitigation strategy. However, SHAP results also reinforce the same limitation identified by ablation analysis: the dominant glycemic signal is clinically meaningful but not specific to chronic diabetes. Early glucose values may be shaped by both baseline disease and acute ICU physiology.
The SHAP dependence and local waterfall plots further showed that model behavior was nonlinear and context-dependent. The contribution of maximum glucose increased sharply across clinically meaningful glucose ranges and then plateaued at high values. BMI and age showed graded or threshold-like effects, whereas secondary predictors such as minimum glucose, hemoglobin, and urine output showed more complex patterns that may reflect interactions with illness severity, renal function, treatment response, or monitoring intensity. Local explanations demonstrated that individual predictions were rarely determined by one variable alone; rather, strong glycemic signals were modulated by anthropometric, renal, hematologic, and clinical-context variables. This improves auditability and interpretability, but it does not eliminate the need for clinical confirmation when distinguishing chronic diabetes from stress hyperglycemia or treatment-related dysglycemia.
The use of model-intrinsic class weighting and threshold-aware evaluation was appropriate for this imbalanced clinical phenotyping task. Rather than generating synthetic minority-class records, class weighting preserved the empirical patient distribution while increasing the training penalty for misclassifying diabetes-positive cases. In this study, positive-class weighting helped LightGBM achieve high sensitivity, while the soft-voting ensemble moderated the false-positive burden by combining sensitivity-oriented and more conservative boosted-tree models. These results support class weighting and explicit threshold selection as transparent, reproducible approaches for imbalanced ICU phenotyping. Nevertheless, threshold selection should be matched to the intended workflow rather than treated as a universal operating point.
Several limitations should be considered. First, the outcome label represents documented diabetes status in the source data, not adjudicated chronic diabetes confirmed by longitudinal outpatient records, HbA1c, medication history, or manual chart review. As a result, the model may learn correlates of diabetes documentation as well as correlates of true chronic disease. Second, although grouped hospital validation is stronger than a row-level random split, it remains internal validation within the same public data release. External validation in independent ICU datasets is necessary before general clinical use. Third, the strong influence of glucose-like predictors creates a clinical ambiguity: early glucose values are highly informative for diabetes phenotyping, but they may also reflect stress hyperglycemia, treatment exposure, acute illness severity, or monitoring patterns. Fourth, calibration was diagnostic only; no recalibration model was fitted. Therefore, predicted probabilities should not be interpreted as deployment-ready clinical risk estimates without recalibration and prospective assessment.
Future work should prioritize external validation in independent ICU datasets, prospective evaluation across hospitals and time periods, recalibration when probabilities are used clinically, and threshold selection matched to specific workflow goals. Improved reference labeling is also needed. Ideally, future studies should distinguish documented chronic diabetes from previously undiagnosed diabetes, stress hyperglycemia, medication-related hyperglycemia, nutrition-associated dysglycemia, and treatment-related glucose changes using longitudinal EHR data, HbA1c, outpatient medication history, problem lists, diagnosis codes, and manual adjudication where feasible. Future extensions could also compare the present leakage-aware tree-based benchmark with deep tabular architectures, including TabNet, TabTransformer, FT-Transformer, or other Transformer-based models, under the same grouped validation, calibration, ablation, and interpretability framework.
Overall, this study provides a reproducible and clinically interpretable benchmark for ICU admission-time phenotyping of documented diabetes status in public multi-center EHR data. The findings show that leakage-aware gradient-boosted and ensemble tree models can achieve strong discrimination, that reasonable leakage mitigation and grouped-site validation do not eliminate useful case-finding performance, and that model evaluation should extend beyond AUROC to include precision–recall behavior, calibration, ablation analysis, and interpretability. At the same time, the dominance of glucose-related predictors requires a careful clinical interpretation. The proposed system should be used primarily as a screening-oriented phenotyping aid for chart review, cohort enrichment, workflow support, or research classification, not as a stand-alone diagnostic tool. Clinical deployment would require external validation, recalibration, workflow-specific thresholding, and confirmatory clinical review to distinguish chronic diabetes from acute stress physiology and treatment-related dysglycemia.

5. Conclusions

Using the public WiDS Datathon 2021 tabular release derived from the GOSSIS initiative, this study developed and evaluated leakage-aware machine-learning models for identifying documented diabetes status among adult ICU stays at or near admission. The workflow emphasized split-aware preprocessing, removal of explicit site identifiers from the model matrix, prespecified feature-scenario ablations, model-intrinsic class weighting, threshold-aware evaluation, grouped hospital validation, calibration assessment, and SHAP-based interpretability. Together, these design choices provide a reproducible benchmark for ICU admission-time diabetes phenotyping in public multi-center EHR data.
In the primary leakage-mitigated random validation split, gradient-boosted and ensemble tree models achieved strong discrimination. The soft-voting ensemble provided the most balanced operating profile, with AUROC 0.8539, precision 0.5671, recall 0.6690, and F1-score 0.6138. Tuned LightGBM was the most sensitivity-oriented individual model, achieving recall 0.7677 and AUROC 0.8537, although with lower precision and a less favorable Brier score. Grouped hospital validation showed that tuned LightGBM retained case-finding ability at unseen hospitals, with recall remaining stable at 0.7684, while AUROC decreased to 0.8443 and precision declined to 0.4546. Precision–recall analysis further confirmed the intended-use trade-off: a grouped-site threshold of 0.4537 preserved high recall of 0.8001 but reduced precision to 0.4314. These findings indicate that LightGBM is most appropriate for high-sensitivity screening, whereas the ensemble may be preferable when false-positive burden is a major concern.
Ablation and interpretability analyses clarify the source and limitations of model performance. Removing leakage-prone and APACHE-related variables caused only modest performance reductions, supporting the robustness of the leakage-aware pipeline. In contrast, the strict reduced model that also excluded glucose-like predictors produced a marked decline in discrimination, confirming that glucose-related admission variables remained the dominant predictive signal. SHAP analysis was consistent with this finding, identifying day-1 maximum glucose, BMI, age, day-1 minimum glucose, hemoglobin, urine output, and related physiologic or laboratory variables as major contributors. Although these predictors are clinically plausible for identifying documented diabetes status, early glycemic measurements in critically ill patients may also partly capture acute stress physiology, treatment-related effects, monitoring intensity, nutritional support, or other forms of acute dysglycemia rather than chronic diabetes status alone.
Calibration results further emphasize that the model should not be interpreted as a deployment-ready diagnostic instrument. Calibration curves and Brier scores showed that predicted probabilities were not fully reliable as clinical risk estimates without further recalibration. Therefore, the proposed system should be interpreted primarily as a screening-oriented phenotyping aid for chart review, medication reconciliation, cohort enrichment, or workflow support, not as a stand-alone diagnostic tool.
Future work should prioritize external validation in independent ICU datasets, prospective evaluation across hospitals and time periods, workflow-specific threshold selection, post hoc recalibration when probabilities are used clinically, and improved reference labeling that better distinguishes chronic diabetes from stress hyperglycemia and treatment-related dysglycemia. Clear documentation of intended use remains essential for safe and reproducible application in critical care research and decision-support workflows.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable. This study analyzed de-identified, publicly available ICU data distributed under the data use terms of the eICU Collaborative Research Database and the GOSSIS initiative [16].

Informed Consent Statement

Not applicable.

Data Availability Statement

The WiDS Datathon 2021 tabular dataset used in this study is publicly available through Kaggle. The dataset was derived from the Global Open-Source Severity of Illness Score (GOSSIS) initiative and the eICU Collaborative Research Database. Code and analytic artifacts necessary to reproduce the results are publicly available from the author at: https://colab.research.google.com/drive/1UDqeBo6Hv1HrPIGTVYNWCpIoUtvpnLkd?usp=sharing (accessed on 12 March 2026).

Acknowledgments

Lily Popova Zhuhadar is the sole author of this manuscript. The author acknowledges the use of publicly available resources, including the eICU Collaborative Research Database and the GOSSIS-1-eICU dataset, which support reproducible critical care research. The dataset creators are cited appropriately in the reference [16].

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AbbreviationFull term
ADAAmerican Diabetes Association
AIArtificial intelligence
APAverage precision
APACHEAcute Physiology and Chronic Health Evaluation
AUROCArea under the receiver operating characteristic curve
BMIBody mass index
EHRElectronic health record
eICU-CRDeICU Collaborative Research Database
eMERGEElectronic Medical Records and Genomics Network
F1-scoreHarmonic mean of precision and recall
FNFalse negative
FPFalse positive
FPRFalse-positive rate
GBDTGradient boosting decision tree
GOSSISGlobal Open-Source Severity of Illness Score
HbA1cHemoglobin A1c
HGBHistogram-based gradient boosting
ICUIntensive care unit
LGBMLight Gradient Boosting Machine
MLMachine learning
PaO2/FiO2Ratio of arterial oxygen partial pressure to fractional inspired oxygen
PheKBPhenotype KnowledgeBase
PRPrecision–recall
PROBASTPrediction model Risk Of Bias ASsessment Tool
RFRandom forest
ROCReceiver operating characteristic
SHAPSHapley Additive exPlanations
SMOTESynthetic Minority Oversampling Technique
TNTrue negative
TPTrue positive
TPRTrue-positive rate
TRIPODTransparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis
TRIPOD+AITRIPOD extension for prediction models using artificial intelligence
WiDSWomen in Data Science
XGBoostExtreme Gradient Boosting

References

  1. Finfer, S.; Chittock, D.R.; Su, S.Y.; Blair, D.; Foster, D.; Dhingra, V.; Bellomo, R.; Cook, D.; Dodek, P.; Henderson, W.R.; et al. Intensive versus conventional glucose control in critically ill patients. N. Engl. J. Med. 2009, 360, 1283–1297. [Google Scholar] [CrossRef]
  2. Honarmand, K.; Sirimaturos, M.; Hirshberg, E.L.; Bircher, N.G.; Agus, M.S.; Carpenter, D.L.; Downs, C.R.; Farrington, E.A.; Freire, A.X.; Grow, A. Society of critical care medicine guidelines on glycemic control for critically ill children and adults 2024. Crit. Care Med. 2024, 52, e161–e181. [Google Scholar] [CrossRef]
  3. Evans, L.; Rhodes, A.; Alhazzani, W.; Antonelli, M.; Coopersmith, C.M.; French, C.; Machado, F.R.; McIntyre, L.; Ostergaard, L.; Prescott, H.C.; et al. Surviving sepsis campaign: International guidelines for management of sepsis and septic shock 2021. Intensive Care Med. 2021, 47, 1181–1247. [Google Scholar] [CrossRef]
  4. Roberts, G.W.; Quinn, S.J.; Valentine, N.; Alhawassi, T.; O’Dea, H.; Stranks, S.N.; Burt, M.G.; Doogue, M.P. Relative Hyperglycemia, a Marker of Critical Illness: Introducing the Stress Hyperglycemia Ratio. J. Clin. Endocrinol. Metab. 2015, 100, 4490–4497. [Google Scholar] [CrossRef]
  5. Xie, H.; Hao, T.; Qi, R.; Zhang, L.; An, Y.; Jia, D.; Wang, H.; Niu, W.; Han, X.; Sha, Y.; et al. Association between stress hyperglycemia ratio and all-cause mortality among ICU patients with sepsis: A systematic review and meta-analysis. Front. Med. 2025, 12, 1741993. [Google Scholar] [CrossRef]
  6. Xia, W.; Li, C.; Kuang, M.; Wu, Y.; Xu, L.; Hu, H. Predictive value of glycemic gap and stress glycemia ratio among critically ill patients with acute kidney injury: A retrospective analysis of the MIMIC-III database. BMC Nephrol. 2023, 24, 227. [Google Scholar] [CrossRef]
  7. Lu, Z.; Tao, G.; Sun, X.; Zhang, Y.; Jiang, M.; Liu, Y.; Ling, M.; Zhang, J.; Xiao, W.; Hua, T.; et al. Association of Blood Glucose Level and Glycemic Variability With Mortality in Sepsis Patients During ICU Hospitalization. Front. Public Health 2022, 10, 857368. [Google Scholar] [CrossRef]
  8. Kirby, J.C.; Speltz, P.; Rasmussen, L.V.; Basford, M.; Gottesman, O.; Peissig, P.L.; Pacheco, J.A.; Tromp, G.; Pathak, J.; Carrell, D.S.; et al. PheKB: A catalog and workflow for creating electronic phenotype algorithms for transportability. J. Am. Med. Inform. Assoc. 2016, 23, 1046–1052. [Google Scholar] [CrossRef]
  9. Newton, K.M.; Peissig, P.L.; Kho, A.N.; Bielinski, S.J.; Berg, R.L.; Choudhary, V.; Basford, M.; Chute, C.G.; Kullo, I.J.; Li, R.; et al. Validation of electronic medical record-based phenotyping algorithms: Results and lessons learned from the eMERGE network. J. Am. Med. Inform. Assoc. 2013, 20, e147–e154. [Google Scholar] [CrossRef]
  10. Pacheco, J.A.; Rasmussen, L.V.; Kiefer, R.C.; Campion, T.R.; Speltz, P.; Carroll, R.J.; Stallings, S.C.; Mo, H.; Ahuja, M.; Jiang, G.; et al. A case study evaluating the portability of an executable computable phenotype algorithm across multiple institutions and electronic health record environments. J. Am. Med. Inform. Assoc. 2018, 25, 1540–1546. [Google Scholar] [CrossRef]
  11. Shang, N.; Liu, C.; Rasmussen, L.V.; Ta, C.N.; Caroll, R.J.; Benoit, B.; Lingren, T.; Dikilitas, O.; Mentch, F.D.; Carrell, D.S.; et al. Making work visible for electronic phenotype implementation: Lessons learned from the eMERGE network. J. Biomed. Inform. 2019, 99, 103293. [Google Scholar] [CrossRef] [PubMed]
  12. Wei, W.-Q.; Rowley, R.; Wood, A.; MacArthur, J.; Embi, P.J.; Denaxas, S. Improving reporting standards for phenotyping algorithm in biomedical research: 5 fundamental dimensions. J. Am. Med. Inform. Assoc. 2024, 31, 1036–1041. [Google Scholar] [CrossRef]
  13. Wolff, R.F.; Moons, K.G.M.; Riley, R.D.; Whiting, P.F.; Westwood, M.; Collins, G.S.; Reitsma, J.B.; Kleijnen, J.; Mallett, S. PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies. Ann. Intern. Med. 2019, 170, 51–58. [Google Scholar] [CrossRef]
  14. Collins, G.S.; Moons, K.G.M.; Dhiman, P.; Riley, R.D.; Beam, A.L.; Van Calster, B.; Ghassemi, M.; Liu, X.; Reitsma, J.B.; van Smeden, M.; et al. TRIPOD+AI statement: Updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 2024, 385, e078378. [Google Scholar] [CrossRef]
  15. Raffa, J.D.; Johnson, A.E.; O’Brien, Z.; Pollard, T.J.; Mark, R.G.; Celi, L.A.; Pilcher, D.; Badawi, O. The global open source severity of illness score (GOSSIS). Crit. Care Med. 2022, 50, 1040–1050. [Google Scholar] [CrossRef]
  16. Raffa, J.; Johnson, A.; Pollard, T.; Badawi, O. GOSSIS-1-eICU, the eICU-CRD subset of the Global Open Source Severity of Illness Score (GOSSIS-1) dataset (version 1.0.0). PhysioNet 2022. RRID:SCR_007345. [Google Scholar] [CrossRef]
  17. Haidari, A.; Bahman, N.; Qauomi, N.A.; Eti, M.; Kamalesh, M.D.; Sharma, P.; Oza, A.D. AI-Driven approaches for sustainable urban development: A PRISMA systematic review of machine and deep learning applications in occupant health and facility management. Int. J. Sustain. Build. Technol. Urban Dev. 2025, 16, 461–476. [Google Scholar] [CrossRef]
  18. Sánchez-Gómez, J.S.; Bravo, J.T.; Sobrino, I.C.A. Predicting Diabetes Outcomes in ICU Patients Using AdaBoost Machine Learning. In Proceedings of the 2025 IEEE 4th Colombian BioCAS Workshop, Armenia Quindio, Colombia, 27–29 August 2025; pp. 1–6. [Google Scholar]
  19. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  20. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  21. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 3149–3157. [Google Scholar]
  22. Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 6639–6649. [Google Scholar]
  23. Gholampour, S. Impact of Nature of Medical Data on Machine and Deep Learning for Imbalanced Datasets: Clinical Validity of SMOTE Is Questionable. Mach. Learn. Knowl. Extr. 2024, 6, 827–841. [Google Scholar] [CrossRef]
  24. Ozkerim, U.; Isik, D.; Kinikoglu, O.; Oksuz, S.; Altintas, Y.E.; Akdag, G.; Yildirim, S.; Basoglu, T.; Surmeli, H.; Odabas, H.; et al. Development of a Machine Learning-Based Prognostic Model Using Systemic Inflammation Markers in Patients Receiving Nivolumab Immunotherapy: A Real-World Cohort Study. J. Pers. Med. 2026, 16, 8. [Google Scholar] [CrossRef]
  25. Tedesco, S.; Andrulli, M.; Larsson, M.Å.; Kelly, D.; Alamäki, A.; Timmons, S.; Barton, J.; Condell, J.; O’Flynn, B.; Nordström, A. Comparison of Machine Learning Techniques for Mortality Prediction in a Prospective Cohort of Older Adults. Int. J. Environ. Res. Public Health 2021, 18, 12806. [Google Scholar] [CrossRef]
  26. Saito, T.; Rehmsmeier, M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
  27. Van Calster, B.; McLernon, D.J.; van Smeden, M.; Wynants, L.; Steyerberg, E.W. Calibration: The Achilles heel of predictive analytics. BMC Med. 2019, 17, 230. [Google Scholar] [CrossRef]
  28. Amanatidis, A.; Egan, K.; Nio, K.; Toma, M. Data-Leakage-Aware Preoperative Prediction of Postoperative Complications from Structured Data and Preoperative Clinical Notes. Surgeries 2025, 6, 87. [Google Scholar] [CrossRef]
  29. Chiavegatto Filho, A.; Batista, A.F.d.M.; dos Santos, H.G. Data Leakage in Health Outcomes Prediction With Machine Learning. Comment on “Prediction of Incident Hypertension Within the Next Year: Prospective Study Using Statewide Electronic Health Records and Machine Learning”. J. Med. Internet Res. 2021, 23, e10969. [Google Scholar] [CrossRef]
  30. Kapoor, S.; Narayanan, A. Leakage and the Reproducibility Crisis in Machine-Learning-Based Science. Patterns 2023, 4, 100804. [Google Scholar] [CrossRef]
  31. Ramadan, B.; Parker, W.F.; Beaulieu-Jones, B.K. Diagnostic Codes in AI Prediction Models and Label Leakage of Same-Admission Clinical Outcomes. JAMA Netw. Open 2025, 8, e2550463. [Google Scholar] [CrossRef]
  32. Huang, Y.; Li, W.; Macheret, F.; Gabriel, R.A.; Ohno-Machado, L. A Tutorial on Calibration Measurements and Calibration Models for Clinical Prediction Models. J. Am. Med. Inform. Assoc. 2020, 27, 621–633. [Google Scholar] [CrossRef]
  33. Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I. From Local Explanations to Global Understanding with Explainable AI for Trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef]
  34. Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  35. Ponce-Bobadilla, A.V.; Schmitt, V.; Maier, C.S.; Mensing, S.; Stodtmann, S. Practical Guide to SHAP Analysis: Explaining Supervised Machine Learning Model Predictions in Drug Development. Clin. Transl. Sci. 2024, 17, e70056. [Google Scholar] [CrossRef]
  36. Steyerberg, E.W.; Vickers, A.J.; Cook, N.R.; Gerds, T.; Gonen, M.; Obuchowski, N.; Pencina, M.J.; Kattan, M.W. Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures. Epidemiology 2010, 21, 128–138. [Google Scholar] [CrossRef] [PubMed]
  37. Thongprayoon, C.; Kaewput, W.; Hansrivijit, P.; Kovvuru, K.; Kanduri, S.R.; Bathini, T.; Cheungpasitporn, W. Explainable Preoperative Automated Machine Learning Prediction Model for Cardiac Surgery-Associated Acute Kidney Injury. J. Clin. Med. 2022, 11, 6264. [Google Scholar] [CrossRef]
  38. Van Calster, B.; Nieboer, D.; Vergouwe, Y.; De Cock, B.; Pencina, M.J.; Steyerberg, E.W. A Calibration Hierarchy for Risk Models Was Defined: From Utopia to Empirical Data. J. Clin. Epidemiol. 2016, 74, 167–176. [Google Scholar] [CrossRef]
  39. Atallah, L.; Klerings, I.; Balzer, F.; Celi, L.A. Machine Learning for Benchmarking Critical Care Outcomes. Crit. Care 2023, 27, 315. [Google Scholar] [CrossRef] [PubMed]
  40. Johnson, A.E.W.; Bulgarelli, L.; Shen, L.; Gayles, A.; Shammout, A.; Horng, S.; Pollard, T.J.; Hao, S.; Moody, B.; Gow, B.; et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 2023, 10, 1. [Google Scholar] [CrossRef]
  41. Pollard, T.J.; Johnson, A.E.; Raffa, J.D.; Celi, L.A.; Mark, R.G.; Badawi, O. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci. Data 2018, 5, 180178. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Panel (A) shows the CatBoost confusion matrix; Panel (B) shows the CatBoost top-15 feature-importance ranking.
Figure 1. Panel (A) shows the CatBoost confusion matrix; Panel (B) shows the CatBoost top-15 feature-importance ranking.
Diabetology 07 00100 g001
Figure 2. Panel (A) shows the random forest confusion matrix; Panel (B) shows the random forest top-15 feature-importance ranking.
Figure 2. Panel (A) shows the random forest confusion matrix; Panel (B) shows the random forest top-15 feature-importance ranking.
Diabetology 07 00100 g002
Figure 3. Tuned XGBoost results. Panel (A) shows the tuned XGBoost confusion matrix; Panel (B) shows the tuned XGBoost top-15 feature-importance ranking.
Figure 3. Tuned XGBoost results. Panel (A) shows the tuned XGBoost confusion matrix; Panel (B) shows the tuned XGBoost top-15 feature-importance ranking.
Diabetology 07 00100 g003
Figure 4. Tuned LightGBM results. Panel (A) shows the tuned LightGBM confusion matrix; Panel (B) shows the tuned LightGBM top-15 feature-importance ranking.
Figure 4. Tuned LightGBM results. Panel (A) shows the tuned LightGBM confusion matrix; Panel (B) shows the tuned LightGBM top-15 feature-importance ranking.
Diabetology 07 00100 g004
Figure 5. Histogram-based gradient boosting results. Panel (A) shows the HGB confusion matrix; Panel (B) shows the HGB top-15 feature-importance ranking.
Figure 5. Histogram-based gradient boosting results. Panel (A) shows the HGB confusion matrix; Panel (B) shows the HGB top-15 feature-importance ranking.
Diabetology 07 00100 g005
Figure 6. Voting-ensemble results. Panel (A) shows the voting-ensemble confusion matrix; Panel (B) shows the ensemble top-15 feature-importance ranking.
Figure 6. Voting-ensemble results. Panel (A) shows the voting-ensemble confusion matrix; Panel (B) shows the ensemble top-15 feature-importance ranking.
Diabetology 07 00100 g006
Figure 7. Model performance across ablation scenarios. Accuracy, precision, recall, F1 score, AUC, and Brier score are shown for each algorithm and scenario.
Figure 7. Model performance across ablation scenarios. Accuracy, precision, recall, F1 score, AUC, and Brier score are shown for each algorithm and scenario.
Diabetology 07 00100 g007
Figure 8. Tuned LightGBM under grouped site-stratified validation. Panel (A) shows the grouped-site confusion matrix; Panel (B) shows the grouped-site top-15 feature-importance ranking.
Figure 8. Tuned LightGBM under grouped site-stratified validation. Panel (A) shows the grouped-site confusion matrix; Panel (B) shows the grouped-site top-15 feature-importance ranking.
Diabetology 07 00100 g008
Figure 9. Precision–recall curve comparison for tuned LightGBM under random and grouped validation.
Figure 9. Precision–recall curve comparison for tuned LightGBM under random and grouped validation.
Diabetology 07 00100 g009
Figure 10. LightGBM calibration curves. (A) shows calibration on the random stratified validation split; (B) shows calibration on the grouped hospital validation split.
Figure 10. LightGBM calibration curves. (A) shows calibration on the random stratified validation split; (B) shows calibration on the grouped hospital validation split.
Diabetology 07 00100 g010
Figure 11. Global SHAP interpretation for the grouped-site LightGBM model. Panel (A) shows mean absolute SHAP values; Panel (B) shows the SHAP beeswarm summary plot.
Figure 11. Global SHAP interpretation for the grouped-site LightGBM model. Panel (A) shows mean absolute SHAP values; Panel (B) shows the SHAP beeswarm summary plot.
Diabetology 07 00100 g011
Figure 12. SHAP dependence plots for three dominant predictors. Panel (A): d1_glucose_max. Panel (B): BMI. Panel (C): age.
Figure 12. SHAP dependence plots for three dominant predictors. Panel (A): d1_glucose_max. Panel (B): BMI. Panel (C): age.
Diabetology 07 00100 g012
Figure 13. Additional SHAP dependence plots. Panel (A): d1_glucose_min. Panel (B): d1_hemoglobin_max. Panel (C): urineoutput_apache.
Figure 13. Additional SHAP dependence plots. Panel (A): d1_glucose_min. Panel (B): d1_hemoglobin_max. Panel (C): urineoutput_apache.
Diabetology 07 00100 g013
Figure 14. Representative local SHAP waterfall plots for two cases selected from the grouped-site validation set.
Figure 14. Representative local SHAP waterfall plots for two cases selected from the grouped-site validation set.
Diabetology 07 00100 g014
Figure 15. Additional representative local SHAP waterfall plots for two cases selected from the grouped-site validation set.
Figure 15. Additional representative local SHAP waterfall plots for two cases selected from the grouped-site validation set.
Diabetology 07 00100 g015
Table 1. Preprocessing operations implemented directly.
Table 1. Preprocessing operations implemented directly.
ComponentRule Implemented in Code
Age restrictionKeep records with age ≥ 16
Split strategy80/20 stratified train/validation split
Random seedrandom_state = 40
Grouping variableshospital_id and icu_id preserved for robustness checks but removed from the predictor matrix
Identifiers droppedUnnamed: 0, encounter_id, readmission_status
Row-consistency correctionSwap any _min and _max pair when _min > _max
Engineered featureFill d1_pao2fio2ratio_max using pao2_apache/fio2_apache when eligible
Explicit column removalAll h1_ variables removed
APACHE duplicate handlingDrop near-duplicate APACHE variables when equivalent day-1 variables exist
Numeric imputationTraining-set mean
Categorical imputationTraining-set mode
Anthropometric imputationSex-stratified training mean for height, weight, and BMI, then global training mean fallback
Categorical encodingOne-hot encoding using training categories only
Correlation pruningDrop columns with absolute Pearson correlation > 0.80, determined from the training matrix only
Final column handlingSanitize feature names and align validation/test matrices to training columns
Table 2. Scenario-dependent exclusion rules applied after preprocessing but before model fitting.
Table 2. Scenario-dependent exclusion rules applied after preprocessing but before model fitting.
ScenarioFeature-Exclusion Rule
Full feature setBasic cleaning only; group identifiers removed from predictor matrix
Leakage-mitigated originalRemove obvious leakage-prone comorbidity and diagnosis fields; retain other APACHE variables
Exclude all APACHERemove all APACHE-related variables in addition to the leakage-mitigated exclusions
Strict reduced modelRemove APACHE-related variables, group identifiers, and glucose-like predictors
Table 3. Model-development configuration.
Table 3. Model-development configuration.
ModelRole in AnalysisKey Code-Level Configuration
CatBoostBoosting baseline100 estimators; random_state = 40; verbose output suppressed
Random ForestBagging baseline100 trees; random_state = 40; parallel processing enabled
XGBoost (Tuned)Tuned gradient-boosted tree model200 estimators; max_depth = 8; learning_rate = 0.05; subsample = 0.8; min_child_weight = 10; colsample_bytree = 0.8; eval_metric = logloss
LightGBM (Tuned)Sensitivity-oriented tuned modelGBDT binary objective; AUC metric; learning_rate = 0.007; colsample_bytree = 0.2; reg_alpha = 3; reg_lambda = 1; scale_pos_weight = 4; n_estimators = 10,000
HistGradientBoostingAdditional tree-boosting comparatorscikit-learn histogram-based gradient boosting; random_state = 40
Soft-voting ensembleCombined probability modelXGBoost, LightGBM, and HGB combined with soft voting and weights 1:3:1
Table 4. Evaluation, ablation, and grouped-validation workflow.
Table 4. Evaluation, ablation, and grouped-validation workflow.
Evaluation ComponentCode-Level ImplementationPurpose in Analysis
Primary feature scenarioleakage_mitigated_original with 80/20 stratified splitMain benchmark after removing obvious leakage-prone fields
Model setCatBoost, random forest, tuned XGBoost, tuned LightGBM, HGB, and soft-voting ensembleCompare baseline, boosted, and ensemble classifiers under identical feature processing
Fixed-threshold metricsAccuracy, precision, recall, and F1-score from predicted labelsSummarize discrete classification performance at the default operating point
Probability metricsAUROC and Brier score from predicted probabilitiesMeasure ranking discrimination and probability accuracy
Precision-recall analysisprecision_recall_curve and average_precision_scoreCharacterize minority-class behavior under class imbalance
Target-recall thresholdLargest threshold satisfying Recall(t) ≥ 0.80Identify a high-sensitivity operating point for screening-oriented use
Ablation analysisRepeat full model set across four prespecified feature scenariosAssess dependence on leakage-prone, APACHE-related, site, and glucose-like predictors
Grouped validationGroupShuffleSplit by hospital_id with test_size = 0.20 and random_state = 40Estimate generalizability to hospitals not observed during training
Table 5. Calibration and SHAP-based interpretability workflow.
Table 5. Calibration and SHAP-based interpretability workflow.
ComponentCode-Level ImplementationPurpose in Analysis
Calibration targetTuned LightGBM evaluated under both random stratified validation and grouped site validationAssess probability reliability beyond threshold-independent discrimination
Calibration curvecalibration_curve with n_bins = 10 and strategy = ‘uniform’Compare observed diabetes frequency with mean predicted probability across risk bins
Calibration referenceDiagonal line representing observed fraction positive = mean predicted probabilityIdentify systematic probability overestimation or underestimation
Probability scoreBrier score reported for random and grouped LightGBM probability outputsSummarize probability accuracy as a lower-is-better proper scoring rule
Interpretability modelGrouped-site LightGBM modelExplain the model evaluated under the stricter hospital-separated design
SHAP explainershap.TreeExplainer applied to the grouped-site LightGBM modelDecompose predictions into additive feature-level contributions
SHAP sampleGrouped-validation records, sampled to at most 3000 rows with RANDOM_STATE if neededLimit computational burden while keeping the explanation sample reproducible
Global outputsMean absolute SHAP bar plot and SHAP beeswarm summary plotRank influential predictors and show directional contribution patterns
Dependence plotsTop six predictors ranked by mean absolute SHAP valueAssess nonlinear, threshold-like, or saturated predictor effects
Local outputsWaterfall plots for representative TP, FP, TN, and FN casesExplain individual correct and incorrect predictions
Table 6. Cohort retention and validation-design summary.
Table 6. Cohort retention and validation-design summary.
ComponentCountPercentage/Interpretation
Labeled WiDS 2021 file130,157 ICU staysInitial labeled source cohort used for model development and validation
Unlabeled challenge file10,234 ICU staysHeld out from all validation-performance estimation and threshold selection
Excluded by adult restriction5018 ICU stays3.9% of labeled cohort excluded because age < 16 years or did not meet the adult eligibility rule
Adult analytic cohort125,139 ICU stays96.1% of labeled cohort retained after age restriction
Primary random split100,111 training/25,028 validation records80.0%/20.0% row-level stratified split of the adult analytic cohort
Grouped site split95,233 training/29,906 validation records76.1%/23.9% row allocation; proportions differ from 80/20 because hospitals, not individual rows, were split
Grouped hospitals163 training/41 validation hospitals79.9%/20.1% hospital allocation with no hospital overlap between training and validation folds
Note. Percentages are rounded to one decimal place.
Table 7. Primary model metrics in the leakage-mitigated feature scenario.
Table 7. Primary model metrics in the leakage-mitigated feature scenario.
ModelAccuracyPrecisionRecallF1 ScoreAUCBrier Score
Voting Ensemble0.81510.56710.66900.61380.85390.1300
LightGBM (Tuned)0.78150.50170.76770.60690.85370.1508
XGBoost (Tuned)0.82880.65570.46420.54360.85050.1197
HistGradientBoosting0.82510.64450.45430.53300.84580.1214
CatBoost0.82270.63170.46290.53430.84030.1236
Random Forest0.81540.66450.32250.43420.82520.1286
Note. Brier score is a lower-is-better measure of probability accuracy. All other metrics are higher-is-better.
Table 8. Ablation metrics across the four feature scenarios.
Table 8. Ablation metrics across the four feature scenarios.
ScenarioModelAccuracyPrecisionRecallF1 ScoreAUCBrier
Full feature setCatBoost0.82500.63710.47200.54230.84350.1227
Full feature setRandom Forest0.81780.67840.32470.43920.82700.1280
Full feature setXGBoost (Tuned)0.83000.66050.46530.54590.85250.1189
Full feature setLightGBM (Tuned)0.78420.50580.76720.60970.85650.1491
Full feature setHistGradientBoosting0.82700.64960.46160.53970.84860.1203
Full feature setVoting Ensemble0.81590.56850.67210.61590.85660.1286
Leakage-mitigated originalCatBoost0.82270.63170.46290.53430.84030.1236
Leakage-mitigated originalRandom Forest0.81540.66450.32250.43420.82520.1286
Leakage-mitigated originalXGBoost (Tuned)0.82880.65570.46420.54360.85050.1197
Leakage-mitigated originalLightGBM (Tuned)0.78150.50170.76770.60690.85370.1508
Leakage-mitigated originalHistGradientBoosting0.82510.64450.45430.53300.84580.1214
Leakage-mitigated originalVoting Ensemble0.81510.56710.66900.61380.85390.1300
Exclude all APACHECatBoost0.82280.63350.45910.53240.83770.1247
Exclude all APACHERandom Forest0.81630.66160.33540.44510.82390.1288
Exclude all APACHEXGBoost (Tuned)0.82560.64470.45870.53600.84780.1210
Exclude all APACHELightGBM (Tuned)0.77880.49780.76340.60260.85080.1529
Exclude all APACHEHistGradientBoosting0.82320.63770.45230.52930.84380.1224
Exclude all APACHEVoting Ensemble0.81110.55870.66610.60770.85140.1314
Strict reduced modelCatBoost0.78300.51790.17370.26010.72520.1533
Strict reduced modelRandom Forest0.78310.57480.04890.09020.70360.1558
Strict reduced modelXGBoost (Tuned)0.79110.59990.14700.23610.74020.1491
Strict reduced modelLightGBM (Tuned)0.68680.37860.66410.48230.74320.1986
Strict reduced modelHistGradientBoosting0.78960.59980.12680.20930.73620.1498
Strict reduced modelVoting Ensemble0.76340.45970.43940.44930.74480.1656
Note. Brier score is lower-is-better. The strict reduced model excluded glucose-like predictors in addition to APACHE-related variables and site descriptors.
Table 9. Tuned LightGBM performance under random stratified versus grouped site-stratified validation.
Table 9. Tuned LightGBM performance under random stratified versus grouped site-stratified validation.
Validation StrategyAccuracyPrecisionRecallF1 ScoreAUCBrier Score
Random Stratified Split0.78150.50170.76770.60690.85370.1508
Grouped Site Split0.77270.45460.76840.57120.84430.1596
Note. The grouped split used hospital_id so that no hospital appeared in both training and validation folds.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Popova Zhuhadar, L. Identifying Pre-Existing Diabetes at ICU Admission with Machine Learning on Public GOSSIS Data. Diabetology 2026, 7, 100. https://doi.org/10.3390/diabetology7050100

AMA Style

Popova Zhuhadar L. Identifying Pre-Existing Diabetes at ICU Admission with Machine Learning on Public GOSSIS Data. Diabetology. 2026; 7(5):100. https://doi.org/10.3390/diabetology7050100

Chicago/Turabian Style

Popova Zhuhadar, Lily. 2026. "Identifying Pre-Existing Diabetes at ICU Admission with Machine Learning on Public GOSSIS Data" Diabetology 7, no. 5: 100. https://doi.org/10.3390/diabetology7050100

APA Style

Popova Zhuhadar, L. (2026). Identifying Pre-Existing Diabetes at ICU Admission with Machine Learning on Public GOSSIS Data. Diabetology, 7(5), 100. https://doi.org/10.3390/diabetology7050100

Article Metrics

Back to TopTop