1. Introduction
Diabetes mellitus and dysglycemia are highly prevalent in critical care, where acute illness frequently disrupts glucose homeostasis through counter-regulatory hormone release, inflammatory signaling, medication exposure, nutritional support, and other iatrogenic factors. Both hyperglycemia and hypoglycemia, as well as increased glycemic variability, have been associated with adverse outcomes in critically ill populations, including infection, longer ICU stay, and increased mortality risk.
Contemporary evidence has shifted ICU glycemic management away from intensive glucose-normalization strategies after large multicenter randomized trials demonstrated net harm, including increased mortality and hypoglycemia risk, under intensive glucose targets compared with more moderate control [
1]. Reflecting this evidence base, major consensus recommendations for most critically ill adults favor initiating insulin for persistent hyperglycemia, commonly at or above approximately 180 mg/dL, and then targeting a moderate glucose range centered around approximately 140–180 mg/dL rather than pursuing tighter normalization [
2,
3].
Accurate recognition of diabetes status at or near ICU admission is clinically meaningful because early hyperglycemia is etiologically heterogeneous. Elevated glucose may reflect documented pre-existing diabetes, previously unrecognized chronic dysglycemia, stress hyperglycemia caused by acute physiologic stress, medication-related hyperglycemia, nutrition-associated hyperglycemia, or treatment effects during the early ICU course. This distinction influences how clinicians interpret admission glucose values, balance insulin titration against hypoglycemia risk, and decide whether elevated inpatient glucose should trigger outpatient diabetes evaluation after discharge. In the present study, the target label is therefore interpreted as documented diabetes status available in the source records, not as de novo diagnosis of diabetes during ICU stay, and not as direct prediction of stress hyperglycemia or incident diabetes.
Increasing attention has been directed toward contextualizing acute hyperglycemia relative to baseline glycemic status. Measures such as the stress hyperglycemia ratio, commonly defined using admission glucose divided by an HbA1c-derived estimated average glucose, have been proposed as pragmatic markers of relative hyperglycemia and have been associated with illness severity and mortality across cohorts [
4,
5].
Related constructs, including the glycemic gap, have also been explored for risk stratification and outcome prediction in ICU populations [
6]. Importantly, observational work suggests that the prognostic meaning of hyperglycemia and the optimal glucose exposure range may differ between patients with and without diabetes, reinforcing the importance of identifying baseline diabetes status when studying ICU glycemic control and safety tradeoffs [
7].
Despite its importance, diabetes status at ICU admission is often incomplete, delayed, or fragmented across heterogeneous electronic health record (EHR) sources. Relevant evidence may be distributed across problem lists, billing or diagnosis codes, prior encounter histories, medication records, laboratory archives, and transfer documentation.
These sources may be unavailable when patients arrive from another facility, when longitudinal history is not accessible, or when documentation practices differ across hospitals. Under these real-world constraints, clinicians and researchers often need a computable method to infer likely documented diabetes status using structured variables available at or near admission, while acknowledging uncertainty from incomplete capture and heterogeneous documentation.
Computable phenotyping operationalizes clinical definitions using EHR data through rule-based algorithms, statistical models, or machine-learning approaches. Great collaborative efforts have disseminated phenotype definitions and workflows, including PheKB and eMERGE-derived algorithms, demonstrating the feasibility of diabetes identification using combinations of diagnosis codes, medications, and laboratory criteria [
8,
9]. However, phenotype portability across institutions remains challenging because of differences in coding practices, data models, missingness patterns, laboratory availability, and clinical workflows. Even when the same clinical definition is intended, implementation burden and site-specific customization can be substantial [
10,
11].
These challenges have motivated calls for clearer reporting of phenotype complexity, implementation requirements, validation design, and maintenance considerations so that end users can judge whether an algorithm is fit for purpose in their target setting and timeframe [
12]. In parallel, reporting and appraisal frameworks for prediction modeling and artificial intelligence methods emphasize transparency in dataset construction, label definition, preprocessing order, feature selection, validation strategy, calibration, interpretability, and intended use case [
13,
14].
These principles are particularly important for ICU phenotyping, where apparent model performance can be inflated by leakage-prone variables, site identifiers, duplicated preprocessing across train and validation data, or validation designs that allow patients from the same hospital to appear in both training and testing folds. A clinically useful diabetes phenotype should therefore be evaluated not only by discrimination, but also by threshold-dependent performance, probability calibration, robustness to feature exclusions, and transportability across hospitals.
Public critical-care datasets have created important opportunities to develop reproducible phenotyping and prediction pipelines at scale. The WiDS Datathon 2021 tabular release, derived from the Global Open-Source Severity of Illness Score (GOSSIS) initiative, provides a large public multi-center ICU dataset suitable for benchmarking structured EHR machine-learning workflows [
15,
16]. In this study, we use the labeled WiDS 2021 training file to evaluate admission-time identification of documented diabetes status among adult ICU stays. The separate unlabeled challenge file is not used to fit preprocessing parameters, select thresholds, or estimate validation performance.
Prior work has demonstrated the broad utility of machine-learning and deep learning approaches for diabetes prediction, clinical disease classification, ICU outcome modeling, and healthcare operations. Recent reviews have highlighted the use of tree-based learners, neural networks, recurrent architectures, convolutional models, and Transformers across EHR, sensor, wearable, administrative, and operational datasets, while also emphasizing persistent limitations related to fragmented data sources, bias, privacy, validation, and reproducibility [
17]. In contrast to these broader applications, the present study focuses specifically on admission-time phenotyping of documented diabetes status in a public multi-center ICU cohort. Rather than optimizing only discrimination, we evaluate leakage mitigation, feature-scenario robustness, grouped hospital transportability, precision–recall behavior under class imbalance, probability calibration, and SHAP-based interpretability.
A closely related GOSSIS-based study by Sánchez-Gómez et al. [
18] applied an AdaBoost ensemble classifier to diabetes-related prediction in ICU patients using more than 90 structured clinical features, including demographic variables, vital signs, laboratory values, and comorbidities. Their model reported an AUROC of 0.83 and accuracy of 83.28%, suggesting good overall discrimination, but the threshold-dependent results showed an important imbalance between specificity and sensitivity, with high specificity of 93.95% and lower recall of 41.95%. The authors identified glucose, body mass index, age, creatinine, and bicarbonate as important predictors, supporting the clinical relevance of metabolic, renal, and demographic features in ICU diabetes-related modeling [
18]. Their findings are consistent with the present study in showing that structured ICU variables can support machine-learning-based diabetes classification in public critical-care data. However, our study differs in objective and evaluation scope: rather than focusing on a single AdaBoost model, we benchmark multiple tree-based classifiers for admission-time phenotyping of documented diabetes status, evaluate leakage-mitigated feature scenarios, assess grouped hospital transportability, examine precision-recall behavior under class imbalance, evaluate probability calibration, and use SHAP-based interpretation to characterize model behavior.
We benchmarked several tree-based machine-learning classifiers selected for their suitability to structured tabular clinical data with nonlinear associations, mixed feature types, and clinically meaningful missingness patterns. The evaluated models included CatBoost, random forest, tuned XGBoost, tuned LightGBM, histogram-based gradient boosting, and a soft-voting ensemble combining tuned XGBoost, tuned LightGBM, and histogram-based gradient boosting [
19,
20,
21,
22].
The workflow emphasized split-aware preprocessing, including training-derived imputation, one-hot encoding based only on training categories, correlation pruning based only on the training matrix, removal of high-frequency one-hour variables, removal of explicit site identifiers from the model matrix, and harmonization of admission-source variables.
Because class imbalance is a central concern in diabetes phenotyping, the final workflow emphasized model-intrinsic class weighting and threshold-aware evaluation rather than synthetic oversampling, consistent with recent concerns that SMOTE-generated synthetic medical records may introduce clinically questionable examples and distort model interpretation, and with recent clinical machine-learning approaches that combine class weighting with threshold optimization to improve recall while limiting artifacts from oversampling [
23].
Model performance was assessed using complementary fixed-threshold metrics, including accuracy, precision, recall, and F1-score; probability-based metrics, including AUROC and the Brier score; and precision–recall analysis, which is especially informative for evaluating binary classifiers under class imbalance [
24,
25,
26,
27]. To evaluate robustness and possible leakage dependence, the full modeling framework was repeated across four prespecified feature scenarios: full feature set, leakage-mitigated original, exclude all APACHE, and strict reduced model. This scenario-based ablation design was used to determine whether model performance depended on leakage-prone predictors, APACHE-derived variables, site-related structure, or glucose-like proxies, consistent with recommendations to detect and mitigate data leakage, assess predictor-related risk of bias, and transparently report validation and feature-handling decisions in clinical machine-learning prediction studies [
13,
28,
29,
30,
31]. To estimate transportability to unseen clinical environments, we also performed grouped site-stratified validation using hospital identifiers so that no hospital appeared in both training and validation folds. Finally, calibration plots and Brier scores were used to assess whether the predicted probabilities were reliable as risk estimates, while SHAP-based interpretability was used to examine whether model behavior was clinically coherent, transparent, and explainable at both global and patient-specific levels [
32,
33,
34,
35,
36,
37,
38].
Accordingly, the objective of this study was to develop and evaluate a leakage-aware, reproducible machine-learning benchmark for ICU admission-time phenotyping of documented diabetes status using the public WiDS 2021/GOSSIS release. By combining model benchmarking with ablation analysis, grouped hospital validation, calibration assessment, and SHAP-based interpretation, this work aims to clarify the performance, robustness, and implementation tradeoffs of machine-learning-based diabetes identification in public multi-center critical-care EHR data, consistent with recommended practices for benchmarking critical-care models, preventing leakage, transparently reporting clinical prediction models, assessing probability reliability, and explaining tree-based model behavior [
14,
30,
33,
36,
39,
40,
41].
2. Materials and Methods
2.1. Experimental Setup and Environment
All computational analyses and modeling were conducted using Python 3.9.6. Data manipulation, cleaning, and structuring were performed using the pandas and numpy libraries. The core machine learning pipeline was constructed using scikit-learn, which provided robust modules for data splitting, missing value imputation, and performance evaluation metrics. For the predictive modeling, we implemented an ensemble of advanced gradient boosting frameworks, specifically CatBoost, LightGBM, and XGBoost, alongside standard Random Forest and Histogram-based Gradient Boosting classifiers. Furthermore, model interpretability and feature importance were extracted using the SHAP (SHapley Additive exPlanations) library, while interactive and static visualizations were rendered using plotly and matplotlib.
2.2. Data Source and Study Design
We conducted a retrospective machine-learning benchmarking study using the public WiDS Datathon 2021 tabular release derived from MIT’s Global Open-Source Severity of Illness Score (GOSSIS) initiative. The raw datasets used in this study were ingested into the computational environment using the pandas library. The primary dataset was segmented into a training cohort consisting of 130,157 patient records with 181 features, and an unlabeled testing cohort comprising 10,234 records with 180 features. This initial data ingestion phase provided a quick structural inspection of the data and served as the foundation for all subsequent preprocessing and predictive modeling [
15,
16].
2.3. Outcome Definition
The primary outcome was a binary indicator of pre-existing diabetes mellitus documented at or prior to ICU admission. This label reflects historical diabetes status as represented in the source database (e.g., comorbidity fields and/or admission documentation) rather than a de novo diagnosis made during the ICU stay. Accordingly, the modeling task is framed as phenotyping/identification of an existing condition, not detection of stress hyperglycemia or prediction of incident diabetes.
2.4. Candidate Predictors and Feature-Scenario Design
Candidate predictors were drawn from the public WiDS 2021 release and included demographic variables, admission characteristics, anthropometrics, day-1 physiologic and laboratory summaries, APACHE-related variables, comorbidity indicators, and site descriptors. The analysis is designed to address indirect leakage and site memorization by evaluating four prespecified feature scenarios:
Full feature set: routine cleaning only, with grouping identifiers removed from the model matrix.
Leakage-mitigated original: removal of variables most likely to encode label-adjacent information, including APACHE diagnosis fields and selected chronic comorbidity indicators, while retaining other clinically relevant predictors.
Exclude all APACHE: removal of all APACHE-related variables in addition to the leakage-mitigated exclusions.
Strict reduced model: a conservative predictor set that further excluded APACHE-related variables, site identifiers, and glucose-like predictors.
2.5. Data Preprocessing and Feature Engineering
To prepare the raw WiDS 2021/GOSSIS files for predictive modeling, we implemented a split-aware preprocessing and feature-engineering pipeline in which all data-driven transformations were learned only from the training portion of the labeled dataset.
The unlabeled challenge file was not used to fit imputers, define categorical encoding levels, derive engineered variables, or determine correlation-pruning rules. This prevented information from validation or challenge records from influencing the preprocessing parameters used during model development.
Formally, after initial row filtering, the adult labeled cohort was denoted by
, where
is the predictor vector for ICU stay
and
is the binary diabetes label. The labeled data were partitioned into an 80/20 stratified random split using a fixed seed of 40:
All preprocessing parameters were then estimated from the training split alone and applied without refitting to the validation and unlabeled challenge sets:
The first row-level transformation restricted the analysis to adult ICU stays aged 16 years or older:
Schema-level harmonization then standardized admission-source variables. Multiple hospital and ICU admission categories were recoded to a smaller and more consistent vocabulary to reduce spelling inconsistencies and ontology fragmentation. Missing hospital_admit_source values were filled with “Other,” and missing icu_admit_source values were backfilled from hospital_admit_source whenever that variable was available.
Logical consistency checks were next applied to all paired day-level summary variables ending in _min and _max. Whenever the recorded minimum exceeded the corresponding maximum, the two values were exchanged according to the following rule:
A limited engineered physiologic feature was added for the
ratio. Let
denote d1_pao2fio2ratio_max,
denote pao2_apache, and
denote fio2_apache. When the day-1 ratio was missing, both source variables were observed, and the denominator was nonzero, the ratio was computed as:
All h1_prefixed variables were removed to eliminate extremely granular one-hour measures. The code also dropped near-duplicate APACHE variables whenever an equivalent day-1 representation was available, thereby prioritizing directly observed day-1 measurements over redundant APACHE summaries. Potentially non-predictive identifiers, including encounter_id, readmission_status, and the unnamed index column, were excluded from the predictor matrix. The grouping variables hospital_id and icu_id were retained separately for grouped validation and robustness analyses but were excluded from the model feature space under all feature scenarios.
Missing-data handling was fully split-aware. For continuous predictors, let
indicate that the value was observed in the raw training-derived matrix and
indicate that it was missing. Missing continuous values were replaced with the training-set mean:
For categorical predictors, missing values were replaced by the most frequent training-set category. If
denotes the categorical value for record
and categorical predictor
, and
denotes the training-set mode, the imputation rule was:
A more specific imputation rule was implemented for anthropometric variables. For height, weight, and body mass index (BMI), the pipeline first computed sex-specific means within the training split and then fell back to the overall training mean if the sex-stratified mean was unavailable. Let
denote the recorded gender for patient
, let
, and let
indicate that a training-derived gender-specific mean existed for anthropometric variable
and gender group
:
This two-stage strategy preserved more physiologically plausible imputations for anthropometric variables than a single global mean. In addition, ethnicity and gender themselves were filled using training-derived modal values, with code fallback defaults of “Other/Unknown” and “M,” respectively, when no empirical mode was available.
After imputation, categorical variables were one-hot encoded using only the levels observed in the training data. For a categorical predictor
with training categories
, the dummy representation was defined as:
Validation and challenge matrices were then aligned to the resulting training-derived column set. To reduce redundancy and instability, the code computed pairwise absolute Pearson correlations on the training design matrix only. For features
and
, the training-set Pearson correlation was:
A feature was flagged for removal whenever it appeared as the second member of any upper-triangular feature pair whose absolute correlation exceeded the threshold of 0.80:
The same dropped columns were removed from the validation and challenge matrices. Finally, feature names were sanitized by replacing non-alphanumeric characters with underscores, and both validation and test matrices were realigned one final time to the exact training-derived column set. This sequence ensured a stable and identical feature space across all downstream models and validation scenarios. Refer to
Table 1 and
Table 2 for details.
2.6. Model Development
We compared five individual tree-based classifiers selected for their strong performance on structured tabular clinical data: CatBoost, random forest, tuned XGBoost, tuned LightGBM, and Histogram-based Gradient Boosting (HGB). These algorithms were trained on the final split-aware, one-hot encoded, correlation-pruned feature matrices described in
Section 2.5. For each ICU stay
, the transformed predictor vector was denoted by
and the binary outcome was
, where
indicated documented diabetes status. Each classifier learned a mapping from the final predictor space to an estimated probability of diabetes:
where
denotes the
-th trained model and
is the model-specific probability function. Predicted class labels were obtained by applying a probability threshold
to the estimated probabilities:
This formulation allowed the analysis pipeline to evaluate both fixed-threshold classification performance and threshold-aware behavior in the later precision-recall analysis. The modeling strategy intentionally emphasized probability estimation, not only hard classification, because subsequent sections evaluate AUROC, Brier score, precision-recall behavior, and calibration.
CatBoost and random forest served as robust baseline models. CatBoost was implemented with
boosting iterations and a fixed random seed, while random forest was implemented with
trees, parallel processing, and the same study-level random state. The random forest model can be represented as an average over
independently trained decision trees:
where
is the probability estimate from the
-th tree. This averaging structure reduces variance relative to a single decision tree and provides a conservative benchmark against the boosting-based models.
The gradient-boosted tree models used additive ensembles in which each new tree updated the current prediction function. In general form, the boosted scoring function after
iterations can be written as:
where
is the
-th regression tree and
is the learning rate. XGBoost and LightGBM represented tuned gradient-boosted tree approaches optimized for this benchmark. The tuned XGBoost configuration used
estimators, maximum depth
, learning rate
, subsample
, minimum child weight
,
, a fixed random state, and log-loss evaluation. The tuned LightGBM configuration used gradient boosting decision trees with binary objective, AUC metric, learning rate
, subsample
,
,
,
,
,
boosting iterations, unrestricted tree depth, fixed random state
, and force_col_wise enabled.
Because class imbalance was a central methodological concern, the final analysis pipeline emphasized model-intrinsic class weighting and threshold-aware evaluation rather than synthetic oversampling. In the LightGBM model, the positive diabetes class was weighted more heavily through
. Conceptually, the weighted binary learning objective can be expressed as:
where
is the binary classification loss,
is the class-specific observation weight, and
represents regularization terms that penalize excessive model complexity. This weighting scheme increased the cost of misclassifying diabetes-positive cases during training while preserving the original empirical distribution of the data. Therefore, the workflow avoided synthetic generation of minority-class records and instead evaluated the sensitivity-precision trade-off directly through model probabilities and threshold-dependent metrics.
In addition to the five individual models, we trained a soft-voting ensemble that combined tuned XGBoost, tuned LightGBM, and histogram-based gradient boosting. The ensemble assigned weights of
to XGBoost, LightGBM, and HGB, respectively, giving greater influence to the sensitivity-oriented LightGBM model while still incorporating the more conservative probability estimates from XGBoost and HGB. The ensemble probability was calculated as the weighted average:
Final ensemble labels were then obtained by applying the same thresholding rule in Equation (13). This design provided a more balanced operating point than relying on LightGBM alone: the ensemble retained much of LightGBM’s sensitivity while improving precision by tempering extreme probability estimates through the additional boosted-tree models.
Table 3 summarizes the model-development choices.
2.7. Evaluation, Ablation Analysis, and Grouped Site-Stratified Validation
Model performance was evaluated on held-out validation data generated by the split-aware preprocessing workflow described in
Section 2.5. For a given validation set
,
denotes the final transformed predictor vector and
denotes the observed diabetes label. For model
, the predicted probability
was obtained from the fitted probability function defined in Equation (12), and thresholded class labels were obtained according to Equation (13).
The primary model-comparison tables used the model-default binary prediction, corresponding to a probability threshold of for the probabilistic binary classifiers unless a separate threshold analysis was explicitly reported.
For each model and validation fold, the threshold-specific confusion-matrix counts were computed from the observed and predicted labels as follows:
These four quantities were used to calculate the fixed-threshold classification metrics reported in the primary and ablation tables:
In the implementation, precision, recall, and F1-score were computed using the scikit-learn metric functions with , so undefined ratios caused by an empty denominator were assigned a value of zero rather than producing unstable estimates. Recall was interpreted as sensitivity for the diabetes-positive class, whereas precision represented the positive predictive value among ICU stays flagged as likely diabetes-positive.
Probability-based performance was evaluated separately from fixed-threshold classification. Probability accuracy was summarized using the Brier score, which measures the mean squared deviation between the predicted probability and the observed binary label:
Discrimination was assessed with the receiver operating characteristic curve and the area under that curve (AUROC). At each threshold t, the true-positive rate and false-positive rate were defined as:
The AUROC was therefore interpreted as a threshold-independent ranking measure. Equivalently, for n1 positive cases and n0 negative cases in the validation fold, AUROC estimates the probability that a randomly selected diabetes-positive ICU stay receives a higher predicted probability than a randomly selected diabetes-negative ICU stay:
Because the positive class represented a minority of the cohort, receiver operating characteristic analysis was supplemented with precision-recall analysis. The prevalence of the positive class in a validation fold provides the no-skill precision baseline:
Precision-recall curves were generated by sweeping the classification threshold over the predicted probabilities. Average precision (AP) summarized the area under the precision-recall curve using the stepwise interpolation implemented by scikit-learn:
where
and
are the precision and recall values at the
-th threshold operating point. To support clinically interpretable threshold selection, we also examined operating points that preserved high sensitivity. For a prespecified recall target
, the selected threshold was defined as the largest threshold that still achieved at least the target recall:
The primary evaluation was conducted under the leakage-mitigated original feature scenario using the 80/20 stratified random validation split. For each trained model m in the set of evaluated classifiers, the analysis pipeline stored the predicted class labels, predicted probabilities, and metric vector:
where s indexes the feature scenario and m indexes the model. This notation reflects the implementation in which the same train_eval_predict routine was applied to CatBoost, random forest, tuned XGBoost, tuned LightGBM, histogram-based gradient boosting, and the soft-voting ensemble. The leakage-mitigated original scenario served as the primary benchmark because it removed variables most likely to encode label-adjacent information while retaining clinically interpretable admission and day-1 predictors.
To quantify robustness to potential leakage and feature-set dependence, the full modeling framework was repeated across the four prespecified feature scenarios: full feature set, leakage-mitigated original, exclude all APACHE, and strict reduced model. For a given metric component k, the scenario-specific change relative to the primary leakage-mitigated benchmark was defined as:
where
is the value of metric k for model m under the leakage-mitigated original scenario. This ablation framework allowed the analysis to determine whether performance was driven primarily by APACHE-derived variables, site-related structure, glucose-like predictors, or more general clinical and physiologic information.
In addition to the row-level random split, a grouped site-stratified validation analysis was conducted to estimate transportability to unseen hospitals. The grouped experiment reconstructed the processed adult labeled cohort under the same leakage-mitigated feature scenario and then applied GroupShuffleSplit with one split, test_size = 0.20, random_state = 40, and hospital_id as the grouping variable. Let
denote the hospital identifier for ICU stay
, and let
and
denote the selected training and validation hospital sets. The grouped split enforced hospital-level separation:
Thus, no hospital contributed records to both the grouped training and grouped validation partitions. The grouped analysis used the tuned LightGBM model because it was the sensitivity-oriented model emphasized for minority-class detection and because it provided the basis for subsequent calibration and SHAP-based interpretability analyses. To compare random-split and grouped-site behavior, the grouped validation metric change was summarized as:
A decrease in AUROC, precision, or F1-score, or an increase in Brier score after site separation, would indicate reduced transportability to unseen hospitals, whereas stable recall would indicate that the model continued to identify diabetes-positive ICU stays despite a stricter validation design.
Together, as shown in
Table 4, the random split, scenario ablations, precision-recall analysis, and grouped hospital validation provided complementary evidence about discrimination, probability reliability, class-imbalance behavior, robustness to leakage-prone predictors, and site-level generalizability.
2.8. Calibration Assessment and SHAP-Based Interpretability
We implemented discrimination-based evaluation with probability calibration assessment and SHAP-based interpretability. Calibration was examined for the tuned LightGBM model under both the primary random stratified validation split and the stricter grouped site validation split. This distinction was important because AUROC and average precision evaluate ranking behavior, whereas calibration evaluates whether predicted probabilities are numerically reliable as risk estimates.
Let s ∈ {random, group} index the validation strategy, and let
denote the corresponding validation set. For ICU stay
in
, the tuned LightGBM model produced an estimated probability of documented diabetes status:
A perfectly calibrated probabilistic classifier would satisfy the condition that, among patients assigned predicted risk
, the observed event frequency is also
:
Calibration curves were generated using the scikit-learn calibration_curve function using 10 equal-width bins and strategy = ‘uniform’. Thus, the interval [0, 1] was divided into 10 equal-width probability bins. For bin
, the validation records assigned to that bin were defined as:
For each nonempty bin, the analysis pipeline computed the mean predicted probability and the observed fraction positive:
The plotted calibration curve was therefore the set of bin-level pairs displayed against the diagonal perfect-calibration reference line:
Points below the diagonal indicate probability overestimation, whereas points above the diagonal indicate probability underestimation. The Brier score defined in Equation (21) was retained as the numerical summary of probability accuracy for both validation settings. No post hoc recalibration model, such as Platt scaling or isotonic regression, was fitted; the purpose of this step was to diagnose probability reliability rather than to recalibrate the model.
Interpretability was assessed using SHAP rather than relying only on model-native feature-importance rankings (SHapley Additive exPlanations) [
33]. SHAP analysis was anchored to the grouped-site LightGBM model because that model was trained and evaluated under hospital-level separation and therefore provided the more conservative explanation target. For computational efficiency, SHAP values were computed on the grouped validation set, with random subsampling applied only when the grouped validation set exceeded 3000 records:
For transformed predictor vector
, the explanation takes the additive form:
where
is the grouped-site LightGBM output on the TreeExplainer model-output scale,
is the expected baseline output,
is the number of final predictors, and
is the contribution of predictor
for ICU stay
. Conceptually, each SHAP value is a weighted average of the marginal contribution of feature j across possible feature subsets:
Global feature influence was summarized by the mean absolute SHAP value, which measures the average magnitude of a predictor’s contribution across the SHAP explanation sample:
The global mean absolute SHAP bar plot ranked predictors by , while the SHAP beeswarm summary plot displayed the distribution and direction of feature contributions across patients. Higher or lower feature values could therefore be interpreted in terms of whether they pushed the grouped-site LightGBM output toward or away from the diabetes-positive class.
To examine nonlinear predictor effects, dependence plots were generated for the most influential predictors. For a selected feature
, the dependence plot consisted of the observed transformed feature value paired with its SHAP contribution across the explanation sample:
The top SHAP features were identified by sorting Equation (39), and dependence plots were generated for the six highest-ranked predictors. These plots were used to assess whether dominant predictors contributed monotonically, showed threshold-like changes, or saturated at high or low values.
Finally, local SHAP waterfall plots were generated for representative cases selected from the grouped validation confusion-matrix categories. Using the grouped LightGBM predicted labels, the case sets were defined as:
We selected the first available example from each nonempty set and plotted the cumulative SHAP decomposition for that patient. If features are ordered by decreasing absolute local contribution, the waterfall trajectory after the first
displayed features is:
This local decomposition showed how individual predictors moved the grouped-site LightGBM output away from the baseline value and toward either the diabetes-positive or diabetes-negative class. Together, the calibration diagnostics and SHAP analyses addressed two complementary concerns: whether the model probabilities were reliable enough to interpret, and whether the model’s behavior could be explained at both global and patient-specific levels using the leakage-mitigated, site-separated modeling workflow, as shown in
Table 5.
2.9. Reporting and Reproducibility
Methods and results reporting were structured to align with core elements of TRIPOD, including explicit definitions of outcome and predictors, transparent preprocessing, clear separation of model development from evaluation, and complete reporting of performance metrics. For full reproducibility, release of the complete preprocessing pipeline (including feature lists and imputation rules), the final tuned hyperparameters, and software/package versions is provided.
4. Discussion
In this public, multi-center ICU cohort derived from the WiDS Datathon 2021/GOSSIS release, leakage-aware tree-based machine-learning models showed strong performance for admission-time identification of documented diabetes status. The analysis was intentionally designed as a benchmarking and phenotyping study rather than a stand-alone diagnostic model. The workflow used split-aware preprocessing, avoided fitting preprocessing steps on validation or challenge data, excluded explicit site identifiers from the model matrix, evaluated prespecified feature-ablation scenarios, used model-intrinsic class weighting rather than synthetic oversampling, and assessed performance under both random and grouped hospital validation. This design strengthens the credibility of the results because it reduces the risk that apparent discrimination was driven by direct leakage, site memorization, or preprocessing choices informed by held-out data.
The primary leakage-mitigated random-split results showed two clinically meaningful operating profiles. The soft-voting ensemble achieved the most balanced overall performance, with an AUROC of 0.8539, a precision of 0.5671, a recall of 0.6690, and an F1-score of 0.6138. Tuned LightGBM was the most sensitivity-oriented individual model, achieving recall of 0.7677 and AUROC of 0.8537, although this came with a lower precision of 0.5017 and a less favorable Brier score of 0.1508. These findings indicate that model choice should depend on intended use. LightGBM may be preferable when the goal is high-sensitivity screening and missed diabetes-positive cases are costly, whereas the ensemble may be preferable when the burden of false positives, chart review, or alert fatigue is a major concern.
The ablation analyses clarify the source of model performance and support a cautious interpretation. Removing leakage-prone and APACHE-related variables caused only modest reductions in discrimination. For LightGBM, AUROC declined from 0.8565 in the full-feature scenario to 0.8537 in the leakage-mitigated scenario and to 0.8508 after excluding all APACHE variables. The voting ensemble showed a similar pattern, with AUROC decreasing from 0.8566 to 0.8539 and then to 0.8514. These small changes suggest that the models were not primarily dependent on explicit APACHE-derived variables, site identifiers, or obvious leakage-prone predictors. In contrast, the strict reduced model that also removed glucose-like predictors produced a much larger decline in performance: LightGBM AUROC fell to 0.7432, and the ensemble AUROC fell to 0.7448. Thus, glucose-related admission variables remained the dominant predictive signal. This finding is clinically plausible for identifying documented diabetes status, but it is also the central interpretive limitation of the study. Early glycemic values in critically ill patients can reflect chronic diabetes, but they can also capture acute stress physiology, medication exposure, insulin or dextrose treatment, nutritional support, illness severity, and monitoring intensity. Therefore, the model should not be interpreted as learning chronic diabetes status alone.
Grouped hospital validation provided a stricter assessment of generalizability to unseen clinical sites. When tuned LightGBM was evaluated under hospital-level separation, AUROC decreased from 0.8537 in the random split to 0.8443, precision decreased from 0.5017 to 0.4546, and Brier score worsened from 0.1508 to 0.1596. Recall, however, remained essentially stable at 0.7684. This pattern suggests that some precision and ranking advantage in the random split may reflect within-hospital regularities that do not fully transfer to unseen hospitals. At the same time, the model retained useful case-finding ability after site separation. For multi-center ICU phenotyping, this distinction is important: row-level random validation estimates internal performance, whereas grouped hospital validation provides a more conservative estimate of transportability across sites.
The precision–recall analysis further emphasizes that the system is best interpreted as a screening-oriented phenotyping aid. Tuned LightGBM achieved an average precision of 0.622 under random validation and 0.551 under grouped hospital validation, showing that hospital-level separation increased the false-positive burden. At a grouped-site probability threshold of 0.4537, the model achieved recall of 0.8001 but precision of only 0.4314. This high-sensitivity operating point may be acceptable when the model is used to prompt chart review, support medication reconciliation, enrich research cohorts, or identify patients needing confirmation of diabetes history. However, it would be inappropriate for automated diagnostic labeling, treatment decisions, or interruptive clinical alerts without additional clinical confirmation. The model should therefore be framed as a computable phenotyping aid, not as a stand-alone diagnostic tool.
The clinical relevance of this distinction is especially important in critical care. Admission-time diabetes status can help clinicians interpret early hyperglycemia and contextualize glycemic targets, but hyperglycemia in the ICU is etiologically heterogeneous. In a patient with established diabetes, elevated glucose may reflect chronic dysglycemia or baseline metabolic disease. In a patient without known diabetes, the same glucose value may reflect stress hyperglycemia, acute inflammatory and hormonal responses, sepsis physiology, glucocorticoid or vasopressor exposure, enteral or parenteral nutrition, dextrose-containing fluids, or early treatment effects. Thus, although glucose-related variables are informative for identifying documented diabetes status, their dominance means that model predictions may partly encode acute dysglycemia and clinical management patterns rather than chronic diabetes alone. This reinforces the need for cautious use and clinical review.
Calibration results also show why discrimination alone is insufficient. Although tuned LightGBM maintained strong recall and reasonable AUROC under both random and grouped validation, calibration curves and Brier scores showed that predicted probabilities were not fully reliable as absolute risk estimates. In both validation settings, predicted probabilities tended to overestimate the observed frequency of documented diabetes, particularly in the mid-to-high probability range, and the grouped-site Brier score was worse than the random-split Brier score. These findings indicate that the model can rank patients usefully while still producing probabilities that are not deployment-ready. If predicted probabilities are used clinically, post hoc recalibration, prospective monitoring, and periodic site-specific reassessment would be necessary.
The SHAP-based interpretability analysis provides additional support for both clinical plausibility and cautious interpretation. In the grouped-site LightGBM model, day-1 maximum glucose was the dominant predictor, followed by BMI, age, day-1 minimum glucose, hemoglobin, urine output, and other physiologic or laboratory variables. These predictors are clinically coherent because glycemic extrema, anthropometrics, age, renal and hematologic markers, and pre-ICU context may all relate to documented diabetes status or to the conditions under which diabetes is recognized and recorded. Importantly, hospital_id and icu_id were absent from the dominant predictor set because they were removed from the model matrix by design, supporting the leakage-mitigation strategy. However, SHAP results also reinforce the same limitation identified by ablation analysis: the dominant glycemic signal is clinically meaningful but not specific to chronic diabetes. Early glucose values may be shaped by both baseline disease and acute ICU physiology.
The SHAP dependence and local waterfall plots further showed that model behavior was nonlinear and context-dependent. The contribution of maximum glucose increased sharply across clinically meaningful glucose ranges and then plateaued at high values. BMI and age showed graded or threshold-like effects, whereas secondary predictors such as minimum glucose, hemoglobin, and urine output showed more complex patterns that may reflect interactions with illness severity, renal function, treatment response, or monitoring intensity. Local explanations demonstrated that individual predictions were rarely determined by one variable alone; rather, strong glycemic signals were modulated by anthropometric, renal, hematologic, and clinical-context variables. This improves auditability and interpretability, but it does not eliminate the need for clinical confirmation when distinguishing chronic diabetes from stress hyperglycemia or treatment-related dysglycemia.
The use of model-intrinsic class weighting and threshold-aware evaluation was appropriate for this imbalanced clinical phenotyping task. Rather than generating synthetic minority-class records, class weighting preserved the empirical patient distribution while increasing the training penalty for misclassifying diabetes-positive cases. In this study, positive-class weighting helped LightGBM achieve high sensitivity, while the soft-voting ensemble moderated the false-positive burden by combining sensitivity-oriented and more conservative boosted-tree models. These results support class weighting and explicit threshold selection as transparent, reproducible approaches for imbalanced ICU phenotyping. Nevertheless, threshold selection should be matched to the intended workflow rather than treated as a universal operating point.
Several limitations should be considered. First, the outcome label represents documented diabetes status in the source data, not adjudicated chronic diabetes confirmed by longitudinal outpatient records, HbA1c, medication history, or manual chart review. As a result, the model may learn correlates of diabetes documentation as well as correlates of true chronic disease. Second, although grouped hospital validation is stronger than a row-level random split, it remains internal validation within the same public data release. External validation in independent ICU datasets is necessary before general clinical use. Third, the strong influence of glucose-like predictors creates a clinical ambiguity: early glucose values are highly informative for diabetes phenotyping, but they may also reflect stress hyperglycemia, treatment exposure, acute illness severity, or monitoring patterns. Fourth, calibration was diagnostic only; no recalibration model was fitted. Therefore, predicted probabilities should not be interpreted as deployment-ready clinical risk estimates without recalibration and prospective assessment.
Future work should prioritize external validation in independent ICU datasets, prospective evaluation across hospitals and time periods, recalibration when probabilities are used clinically, and threshold selection matched to specific workflow goals. Improved reference labeling is also needed. Ideally, future studies should distinguish documented chronic diabetes from previously undiagnosed diabetes, stress hyperglycemia, medication-related hyperglycemia, nutrition-associated dysglycemia, and treatment-related glucose changes using longitudinal EHR data, HbA1c, outpatient medication history, problem lists, diagnosis codes, and manual adjudication where feasible. Future extensions could also compare the present leakage-aware tree-based benchmark with deep tabular architectures, including TabNet, TabTransformer, FT-Transformer, or other Transformer-based models, under the same grouped validation, calibration, ablation, and interpretability framework.
Overall, this study provides a reproducible and clinically interpretable benchmark for ICU admission-time phenotyping of documented diabetes status in public multi-center EHR data. The findings show that leakage-aware gradient-boosted and ensemble tree models can achieve strong discrimination, that reasonable leakage mitigation and grouped-site validation do not eliminate useful case-finding performance, and that model evaluation should extend beyond AUROC to include precision–recall behavior, calibration, ablation analysis, and interpretability. At the same time, the dominance of glucose-related predictors requires a careful clinical interpretation. The proposed system should be used primarily as a screening-oriented phenotyping aid for chart review, cohort enrichment, workflow support, or research classification, not as a stand-alone diagnostic tool. Clinical deployment would require external validation, recalibration, workflow-specific thresholding, and confirmatory clinical review to distinguish chronic diabetes from acute stress physiology and treatment-related dysglycemia.
5. Conclusions
Using the public WiDS Datathon 2021 tabular release derived from the GOSSIS initiative, this study developed and evaluated leakage-aware machine-learning models for identifying documented diabetes status among adult ICU stays at or near admission. The workflow emphasized split-aware preprocessing, removal of explicit site identifiers from the model matrix, prespecified feature-scenario ablations, model-intrinsic class weighting, threshold-aware evaluation, grouped hospital validation, calibration assessment, and SHAP-based interpretability. Together, these design choices provide a reproducible benchmark for ICU admission-time diabetes phenotyping in public multi-center EHR data.
In the primary leakage-mitigated random validation split, gradient-boosted and ensemble tree models achieved strong discrimination. The soft-voting ensemble provided the most balanced operating profile, with AUROC 0.8539, precision 0.5671, recall 0.6690, and F1-score 0.6138. Tuned LightGBM was the most sensitivity-oriented individual model, achieving recall 0.7677 and AUROC 0.8537, although with lower precision and a less favorable Brier score. Grouped hospital validation showed that tuned LightGBM retained case-finding ability at unseen hospitals, with recall remaining stable at 0.7684, while AUROC decreased to 0.8443 and precision declined to 0.4546. Precision–recall analysis further confirmed the intended-use trade-off: a grouped-site threshold of 0.4537 preserved high recall of 0.8001 but reduced precision to 0.4314. These findings indicate that LightGBM is most appropriate for high-sensitivity screening, whereas the ensemble may be preferable when false-positive burden is a major concern.
Ablation and interpretability analyses clarify the source and limitations of model performance. Removing leakage-prone and APACHE-related variables caused only modest performance reductions, supporting the robustness of the leakage-aware pipeline. In contrast, the strict reduced model that also excluded glucose-like predictors produced a marked decline in discrimination, confirming that glucose-related admission variables remained the dominant predictive signal. SHAP analysis was consistent with this finding, identifying day-1 maximum glucose, BMI, age, day-1 minimum glucose, hemoglobin, urine output, and related physiologic or laboratory variables as major contributors. Although these predictors are clinically plausible for identifying documented diabetes status, early glycemic measurements in critically ill patients may also partly capture acute stress physiology, treatment-related effects, monitoring intensity, nutritional support, or other forms of acute dysglycemia rather than chronic diabetes status alone.
Calibration results further emphasize that the model should not be interpreted as a deployment-ready diagnostic instrument. Calibration curves and Brier scores showed that predicted probabilities were not fully reliable as clinical risk estimates without further recalibration. Therefore, the proposed system should be interpreted primarily as a screening-oriented phenotyping aid for chart review, medication reconciliation, cohort enrichment, or workflow support, not as a stand-alone diagnostic tool.
Future work should prioritize external validation in independent ICU datasets, prospective evaluation across hospitals and time periods, workflow-specific threshold selection, post hoc recalibration when probabilities are used clinically, and improved reference labeling that better distinguishes chronic diabetes from stress hyperglycemia and treatment-related dysglycemia. Clear documentation of intended use remains essential for safe and reproducible application in critical care research and decision-support workflows.