A Calibrated Multi-Task Ensemble Architecture for Biomedical Risk Prediction

Khamitova, Zhainagul; Omarova, Gulmira; Akhmetzhanov, Madi; Burganova, Roza; Orynbassar, Maksym; Sabirova, Umida; Bukatayeva, Almagul; Barakova, Aliya; Jiyanmuratova, Gulnoz; Yuldasheva, Dilchekhra

doi:10.3390/computers15040244

Open AccessArticle

A Calibrated Multi-Task Ensemble Architecture for Biomedical Risk Prediction

by

Zhainagul Khamitova

¹,

Gulmira Omarova

^1,*,

Madi Akhmetzhanov

^2,*,

Roza Burganova

³,

Maksym Orynbassar

⁴,

Umida Sabirova

⁵,

Almagul Bukatayeva

⁶,

Aliya Barakova

⁷,

Gulnoz Jiyanmuratova

⁵ and

Dilchekhra Yuldasheva

⁸

¹

Department of Information Systems, Faculty of Information Technology, L.N. Gumilyov Eurasian National University, Astana 010000, Kazakhstan

²

Department of Applied Informatics and Programming, M. Kh. Dulaty Taraz University, Taraz 080000, Kazakhstan

³

Department of Social Work and Tourism, Esil University, Astana 010000, Kazakhstan

⁴

Department of Computer Science, Sh. Yessenov Caspian University of Technology and Engineering, Aktau 130000, Kazakhstan

⁵

Department of Sociology, National University of Uzbekistan named after Mirzo Ulugbek, Tashkent 100174, Uzbekistan

⁶

Department of Information Technology, Semey Medical University, Semey 071400, Kazakhstan

⁷

S.D. Asfendiyarov Kazakh National Medical University, Almaty 050000, Kazakhstan

⁸

Tashkent State Medical University, Tashkent 100174, Uzbekistan

^*

Authors to whom correspondence should be addressed.

Computers 2026, 15(4), 244; https://doi.org/10.3390/computers15040244

Submission received: 10 March 2026 / Revised: 10 April 2026 / Accepted: 11 April 2026 / Published: 15 April 2026

Download

Browse Figures

Versions Notes

Abstract

Risk stratification of impaired glycemic control remains a major challenge in biomedical data analysis due to heterogeneous metabolic, behavioral, and therapeutic factors observed in large-scale populations. This study proposes a calibrated and interpretable decision–support framework, termed Calibrated Multi-Task Stacking Ensemble (CMSE), for joint modeling of clinically related glycemic outcomes. The framework integrates demographic variables, lipid profiles, renal and inflammatory biomarkers, dietary and smoking indicators, and therapy-related features within a unified predictive architecture. Robust modeling is ensured through leakage-aware preprocessing, quantile-based Winsorization, out-of-fold stacking, and isotonic calibration of probabilistic outputs. The physiological coherence between short-term and long-term glycemic markers is investigated using an explicit intertask coupling mechanism based on the estimated average glucose (eAG) ratio. Model interpretability is supported using SHAP analysis, mutual information, distance correlation, and feature importance metrics. In the primary medication-free screening configuration, the framework is evaluated on the NHANES 2017–March 2020 dataset, achieving ROC-AUC of 0.865 for diabetes classification and R² values of 0.385 and 0.366 for plasma glucose and HbA1c prediction, respectively. These results indicate that CMSE provides a reliable and explainable approach for calibrated glycemic risk assessment and clinical decision support.

Keywords:

multitask stacking ensemble; probability calibration; explainable AI (XAI); SHAP analysis; biomedical risk prediction; clinical decision-support system; machine learning; biomedical data mining

Graphical Abstract

1. Introduction

Machine learning has become an essential methodological paradigm for extracting knowledge from complex biomedical and population health datasets. Modern healthcare data are inherently multidimensional and heterogeneous, integrating demographic characteristics, laboratory biomarkers, physiological measurements, lifestyle factors, and therapeutic indicators. Analyzing such data requires computational models capable of capturing nonlinear dependencies, high-dimensional feature interactions, and hidden patterns that cannot be identified through traditional statistical approaches. Consequently, machine learning techniques are increasingly applied in large-scale biomedical prediction tasks, enabling the development of automated decision–support systems for risk assessment and disease prediction [1].

Among chronic metabolic conditions, impaired glucose regulation and type 2 diabetes mellitus (T2DM) represent major global health challenges. The worldwide prevalence of diabetes continues to increase, generating significant clinical, economic, and societal burdens and emphasizing the need for reliable predictive modeling tools capable of supporting early detection and risk stratification [2]. In real-world clinical and epidemiological settings, glycemic dysregulation does not depend on a single biomarker but arises from complex interactions among demographic factors, anthropometric indicators, lipid metabolism variables, renal biomarkers, inflammatory markers, dietary behaviors, and treatment-related variables. Such multidomain interactions introduce substantial nonlinearities and heterogeneity in the underlying data, which makes classical regression-based modeling approaches insufficient for accurate risk prediction.

Recent advances in explainable artificial intelligence (XAI) have further strengthened the role of machine learning in biomedical analytics by enabling interpretable modeling of complex predictive systems. Unlike traditional “black-box” predictive models, XAI approaches provide mechanisms for understanding the influence of individual variables on model predictions and for identifying the underlying drivers of risk estimates. Among these methods, SHAP (SHapley Additive exPlanations) has become one of the most widely adopted techniques for interpreting machine learning models, as it provides theoretically consistent local and global feature attribution explanations [3]. Moreover, empirical studies have demonstrated that interpretability is a crucial requirement for deploying predictive models in clinical environments, where decision–support systems must provide transparent and clinically meaningful explanations rather than purely numerical predictions [4].

Large-scale biomedical datasets provide a valuable foundation for developing and evaluating such predictive systems. One of the most comprehensive publicly available population health datasets is the National Health and Nutrition Examination Survey (NHANES), which integrates laboratory measurements, clinical examination data, dietary intake information, and behavioral indicators [5]. The Continuous NHANES 2017–March 2020 pre-pandemic cycle represents a nationally representative dataset suitable for population-scale modeling of metabolic health and glycemic risk. Previous studies have shown that machine learning techniques can effectively exploit the multidimensional structure of NHANES data to identify metabolic risk factors and predict chronic disease outcomes [6]. However, despite the richness of this dataset, most existing predictive studies treat glycemic indicators independently and rarely provide integrated modeling frameworks capable of jointly analyzing multiple related metabolic targets.

Recent research in machine learning for metabolic disorder prediction confirms that nonlinear models significantly outperform traditional statistical approaches when analyzing large epidemiological cohorts. Studies on metabolic syndrome classification demonstrate that incorporating lifestyle and behavioral factors into machine learning models can substantially improve predictive performance, particularly when combined with interpretable modeling techniques [7,8]. Similarly, large-scale biomedical resources such as the UK Biobank have demonstrated that multivariate biomarker profiles can be effectively used for predicting disease risk and metabolic abnormalities [9]. Furthermore, epidemiological studies indicate that lipid metabolism markers, inflammatory indicators, and renal biomarkers are strongly associated with impaired glycemic regulation and cardiometabolic risk [10]. In diabetes-related cardiovascular risk modeling, explainable machine learning approaches have also been used to identify key metabolic and lifestyle predictors, highlighting the importance of interpretable feature attribution for clinical insight generation [11].

Despite these advances, several methodological limitations remain in current machine learning approaches for glycemic risk modeling. A common limitation is the fragmented treatment of clinically related prediction targets. In clinical practice, plasma glucose reflects short-term glycemic status, whereas glycated hemoglobin (HbA1c) captures long-term glycemic exposure. Their combined interpretation is essential for assessing metabolic stability and identifying dysglycemic states. However, many predictive models treat these indicators independently, ignoring their physiological coupling. Multitask learning (MTL) has emerged as a promising strategy for addressing this limitation by enabling simultaneous prediction of multiple related outcomes within a shared learning framework. By exploiting shared representations across correlated targets, multitask models can improve generalization performance and predictive robustness [12]. Recent multitask approaches have been applied in glucose forecasting and glycemic event prediction, demonstrating improved stability when multiple metabolic indicators are modeled jointly [13]. In addition, attention-based architectures and transformer-based models have been shown to enhance predictive performance in glucose-related prediction tasks by capturing complex temporal dependencies in biomedical signals [14]. External validation studies in continuous glucose monitoring further emphasize the importance of model generalization and reproducibility in real-world healthcare environments [15].

Another major challenge in machine learning-based risk prediction systems concerns the reliability of predicted probabilities. Ensemble learning algorithms such as random forests, gradient boosting, and stacked architectures have demonstrated high predictive performance in heterogeneous biomedical datasets [16]. However, high classification accuracy does not necessarily imply reliable probability estimates. Many machine learning models produce poorly calibrated predictions, leading to overconfident or distorted probability outputs. Such calibration errors may significantly reduce the reliability of decision–support systems, particularly in clinical screening scenarios where risk thresholds determine intervention strategies. Previous methodological studies have highlighted calibration as one of the critical weaknesses of predictive models in medical applications [17,18]. Therefore, ensuring probabilistic reliability has become an essential requirement for machine learning-based clinical decision–support systems.

To address the limitations of fragmented outcome modeling, insufficient probability calibration, and limited interpretability, this study proposes a unified computational framework termed the Calibrated Multi-Task Stacking Ensemble (CMSE). The proposed framework jointly models three clinically relevant outcomes: (i) laboratory-defined diabetes status (binary classification), (ii) plasma glucose concentration (regression), and (iii) HbA1c percentage (regression). The CMSE architecture integrates heterogeneous predictors including demographic characteristics, anthropometric indicators, lipid metabolism markers, renal and inflammatory biomarkers, behavioral factors, and therapy-related variables. Robustness is achieved through leakage-aware preprocessing procedures and quantile-based Winsorization, combined with out-of-fold stacking to ensure stable model generalization across resampling protocols. Probabilistic reliability is further enhanced through isotonic regression calibration applied to classifier outputs, enabling well-calibrated risk estimates suitable for decision–support scenarios. Additionally, physiological coherence between glucose and HbA1c predictions is supported through an explicit cross-task coupling mechanism based on the estimated average glucose (eAG) relationship, allowing structured information exchange between short-term and long-term glycemic indicators.

Model transparency is addressed through a multi-perspective interpretability framework that integrates SHAP-based feature attribution with complementary dependency analysis methods, including mutual information estimation, distance correlation analysis, and model-based feature importance evaluation. This interpretability layer enables detailed analysis of the factors influencing model predictions and supports the identification of dominant metabolic drivers associated with impaired glycemic control. By combining predictive performance, calibration reliability, and interpretability within a unified computational architecture, the proposed framework aims to provide a transparent and robust machine learning system for population-level metabolic risk modeling.

The objective of this study is therefore to develop an interpretable and calibrated multitask machine learning framework capable of accurately modeling glycemic risk indicators in heterogeneous biomedical datasets. Specifically, the research aims to construct a unified multitask architecture integrating diabetes classification with regression-based prediction of glucose and HbA1c levels, incorporate multidomain biomedical predictors, implement leakage-aware preprocessing and robust learning strategies suitable for real-world health data, ensure probabilistic reliability through explicit calibration mechanisms, and provide interpretable explanations through multi-method XAI analysis. The central methodological contribution of this study lies in the joint modeling of short-term and long-term glycemic metrics within a single multi-task architecture. Specifically, the proposed framework explicitly accounts for the physiological relationship between plasma glucose and HbA1c through cross-task design based on eAG, combining this mechanism with probability calibration and interpretable ensemble learning.

The main contributions of this work can be summarized as follows. First, a novel CMSE is proposed for the joint prediction of diabetes status, plasma glucose, and HbA1c. Second, a leakage-aware preprocessing and stacking pipeline is implemented to improve generalization in heterogeneous population health data. Third, an eAG-based intertask communication mechanism is presented and evaluated as an additional component of the construct linking short-term and long-term glycemic markers. Fourth, a comprehensive interpretability framework is developed by integrating SHAP explanations with complementary dependency-based relevance analysis methods. Finally, extensive experiments on the NHANES 2017–March 2020 dataset demonstrate strong predictive performance and reliable probability calibration, supporting the applicability of the proposed framework for interpretable biomedical decision–support systems.

Overall, this work contributes to the development of calibrated and interpretable machine learning frameworks for metabolic risk modeling by integrating multitask prediction, probability calibration, and explainable artificial intelligence within a unified computational architecture.

2. Materials and Methods

2.1. Data Preparation

This study utilized data from the NHANES 2017–March 2020 pre-pandemic release, which integrates laboratory measurements with linked demographic, dietary, and clinical examination components. The dataset represents the official combination of the complete 2017–2018 cycle and the partially collected 2019–March 2020 cycle, which was discontinued due to the COVID-19 pandemic. The resulting merged cohort preserves national representativeness and provides a comprehensive population-level benchmark for analyzing metabolic and glycemic health indicators. In total, the NHANES 2017–March 2020 release includes 15,560 interview records and 14,300 clinical examinations, forming a large-scale basis for predictive modeling in heterogeneous real-world conditions.

The analytic dataset was created by merging relevant NHANES modules using the SEQN participant identifier. Data preparation involved harmonization of coding across cycles, unit conversion and normalization, selection of diabetes-relevant predictors, and rigorous quality control procedures. Rows with missing target values were excluded to ensure consistency across the multitask learning setting. The final dataset contains complete target information and includes only clinically meaningful features required for glycemic risk modeling. The predictive framework was designed to jointly model three clinically important outcomes: a binary indicator of laboratory-defined diabetes status (is_diabetes_labs_only), plasma glucose concentration (mg/dL), and HbA1c level (%). For descriptive analysis, continuous glycemic targets were additionally stratified into clinically interpretable categories (normal, prediabetic, and diabetic) based on widely accepted diagnostic thresholds. This stratification is used exclusively for cohort characterization and visualization and does not influence the training targets.

The distribution of the binary laboratory-defined diabetes label is shown in Figure 1. The majority of participants belong to the non-diabetic class (n = 3715), while n = 774 participants (approximately 17%) meet laboratory criteria consistent with diabetes. For validation evaluation, the dataset was randomly split into training and test subsets using an 80/20 split. The training subset consisted of N_train = 3694 samples, and the test subset included N_test = 924 samples. To maintain class proportions in both subsets, stratified sampling was used with respect to the binary label for diabetes, ensuring robust estimation under class imbalance conditions. This class imbalance is relevant for machine learning development, since accuracy-based metrics can be biased toward the dominant group, and models trained without imbalance-aware evaluation may underestimate diabetes risk probabilities.

The distribution of plasma glucose values across clinical ranges is illustrated in Figure 2. Most participants exhibit glucose concentrations within the physiological range (<100 mg/dL; n = 2729), followed by the prediabetic interval (100–125 mg/dL; n = 1207). The smallest subgroup corresponds to glucose values within the diabetic range (≥126 mg/dL; n = 553). This pattern reflects a typical glycemic spectrum and emphasizes the importance of accurate modeling of the transitional prediabetic stage, which represents a clinically meaningful window for early intervention and risk prevention.

The categorical distribution of HbA1c levels is shown in Figure 3. The largest subgroup corresponds to normal HbA1c values (<5.7%; n = 3158), followed by the prediabetic range (5.7–6.4%; n = 758) and diabetic-range HbA1c (≥6.5%; n = 573). Compared with plasma glucose, HbA1c reflects longer-term glycemic exposure over approximately 8–12 weeks and therefore exhibits greater stability. This property supports the clinical motivation for jointly modeling plasma glucose and HbA1c within a unified multitask prediction framework.

The final analytical dataset includes a heterogeneous set of predictors spanning demographic variables, anthropometric and physical measurements, clinical examination indicators, lipid profile biomarkers (HDL, LDL, total cholesterol, triglycerides), inflammatory and iron-related markers (e.g., hs-CRP, ferritin), renal biomarkers (serum creatinine, urine albumin-to-creatinine ratio), electrolyte measurements (sodium, potassium, chloride), behavioral exposure variables (cotinine and its logarithmic transformation), psychological screening aggregates, dietary intake summaries, and therapy-related indicators. This multidomain feature space provides a realistic and clinically meaningful foundation for constructing a decision–support framework capable of capturing nonlinear metabolic dependencies and complex risk interactions.

2.2. Hyperparameter Architecture and Configuration of the CMSE

The proposed CMSE framework is implemented as a structured ensemble pipeline that combines leakage-aware preprocessing, heterogeneous base learners, out-of-fold (OOF) stacking, probability calibration, and meta-level aggregation for multitask prediction. The configuration was designed to ensure reproducibility and stable performance across resampling protocols, while maintaining a clear separation between training and validation data. All analyses were performed using Python (v3.9). The following libraries were used: NumPy (v1.23), Pandas (v1.5), Scikit-learn (v1.2), and TensorFlow (v2.10). All hyperparameters were fixed prior to evaluation and are summarized in Table 1, Table 2, Table 3 and Table 4.

Table 1 reports the global configuration settings that control preprocessing, OOF stacking, calibration, and computational reproducibility. Outlier sensitivity is reduced through Winsor quantile clipping at the 0.5–99.5% range, applied only to continuous predictors. Feature scaling is performed using a StandardScaler fitted exclusively on the training split and then reused for validation and test partitions, preventing information leakage. Numerical stability is ensured through deterministic sanitization of missing and infinite values. For stacking, StratifiedKFold is used for the classification task to preserve class proportions, while standard KFold is used for regression tasks. Classification probabilities are calibrated using isotonic regression trained on OOF scores, producing a monotonic mapping to improve reliability of risk estimates. A fixed decision threshold of 0.5 is used for threshold-based metrics to maintain comparability across models. The optional eAG coupling component is parameterized through a coefficient λ; in the default configuration λ = 0.0, meaning that coupling is disabled unless explicitly activated. In the default configuration, the eAG coupling coefficient λ is set to 0.0. Additional ablation experiments evaluating λ > 0 are presented in Table 5. A unified random seed and controlled parallelism were used across all experiments to ensure full reproducibility.

Missing values in continuous biomedical and anthropometric measures were not imputed with zero values. Instead, a median imputation strategy was used for truly missing data, ensuring biological plausibility and robustness. In the core set of laboratory measures, median imputation was required only for a limited number of variables, including UACR (n = 77), hs-CRP (n = 17), and LDL (n = 11). In the expanded set of measures, missing values were more pronounced for individual variables, such as dpq_total (n = 824), systolic blood pressure (n = 454), and diastolic blood pressure (n = 454). It is important to note that the NaN = 0.0 rule specified in Table 1 applies only to the technical cleaning step applied after preprocessing to handle rare residual numerical artifacts (e.g., infinities or undefined values) and does not affect the statistical imputation of missing clinical data. Therefore, biologically implausible zero values were not introduced into continuous characteristics before standardization. The distribution of median imputation values by main laboratory characteristics is presented in Table 2.

The overall picture of missing data in the extended feature set before imputation is presented in Table 3.

As shown in Table 3, missing values are distributed unevenly across the extended feature set. The highest number of missing values is observed in dpq_total (n = 824), blood pressure measurements (bp_sys and bp_dia, n = 454 each), and dietary variables (n = 377), while most laboratory parameters have full or near-full data coverage. This pattern indicates that missing values are primarily concentrated in behavioral, questionnaire-based, and lifestyle-related variables rather than in core biochemical measurements, supporting the feasibility of using a consistent median imputation strategy without introducing bias into clinically important features.

Table 4 summarizes the heterogeneous set of baseline classifiers used in the CMSE classification branch. The design intentionally combines stochastic tree ensembles, gradient boosting, and linear models to increase diversity and reduce correlated errors. ExtraTreesClassifier provides a high-variance ensemble with a large number of randomized trees, improving robustness on high-dimensional tabular biomedical data. HistGradientBoostingClassifier offers computationally efficient boosting with strong generalization. LogisticRegression serves as an interpretable linear component and is configured with class balancing to reduce bias under class imbalance. Finally, XGBoost (v1.7.6) and CatBoost (v1.2) introduce high-capacity boosted decision tree learners that improve sensitivity to nonlinear feature interactions. Together, these models form a diverse classification ensemble suitable for stacking.

Table 5 presents the regression models used for plasma glucose and HbA1c prediction. Similar to the classification branch, model diversity is emphasized to capture both linear and nonlinear response patterns. A basic linear benchmark model using ordinary least squares (OLS) was included in the regression branch. This model was implemented explicitly without regularization (α = 0), ensuring full equivalence to the classical OLS model and avoiding ambiguities with the ridge regression formulation. Extra Trees Regressor and Hist Gradient Boosting Regressor provide strong performance on heterogeneous biomedical predictors and improve robustness under noisy distributions. XGBRegressor and CatBoostRegressor further enhance nonlinear approximation capacity and improve accuracy through boosted tree learning. This mixture of regression learners forms a stable foundation for multitask stacking.

Table 6 describes the meta-level configuration and calibration stage used to merge base learner outputs into the final multitask predictions. Classification channels are first calibrated using isotonic regression trained on OOF scores. The calibrated outputs are then aggregated into a meta-feature representation that includes raw calibrated probabilities, distribution statistics (mean, standard deviation, min, max), and cross-task signals derived from regression outputs. The final classification head is implemented using logistic regression to preserve interpretability at the meta-decision level. For glucose and HbA1c regression, the meta-heads are implemented using Ridge regression with α = 0.5. Meta-features include base regressor outputs, squared terms, and quantile-based aggregates, as well as conditioning terms based on the final diabetes probability.

Overall, the selected hyperparameter configuration provides a reproducible template for calibrated multitask stacking under heterogeneous biomedical feature distributions. The combination of leakage-aware preprocessing, OOF stacking, isotonic probability calibration, and interpretable meta-heads ensures that predictive performance, probability reliability, and computational stability remain balanced across tasks. This design also supports transparent comparison with baseline learners and enables systematic extension of CMSE to other population-specific cohorts and multitask clinical prediction scenarios.

2.3. Description of the Calibrated Multitask Stacking Ensemble

The proposed Calibrated Multitask Stacking Ensemble (CMSE) is designed as a unified decision–support model that jointly predicts three clinically relevant outputs: (i) the probability of laboratory-defined diabetes, (ii) plasma glucose concentration (mg/dL), and (iii) HbA1c level (%). The complete computational workflow of CMSE is illustrated in Figure 4, where the flow of information is shown from raw features to final multitask predictions.

At the input stage, preprocessing is performed in a strict train-only regime. Continuous predictors are Winsorized using extreme quantile clipping and standardized using scaling parameters computed exclusively from the training split. Binary features are transferred without scaling. This strategy stabilizes heavy-tailed biomedical variables, preserves the meaning of indicator features, and prevents information leakage during model evaluation. The resulting transformed feature vector is then processed by three parallel model branches: one classification branch and two regression branches corresponding to plasma glucose and HbA1c prediction.

Each branch is implemented as a heterogeneous ensemble of models including randomized tree ensembles, gradient boosting algorithms, and linear baselines. Such diversity is intentionally introduced to reduce error correlation between learners and improve robustness under heterogeneous population-level data. During training, OOF predictions are generated using K-fold splitting, ensuring that each meta-level training instance is computed from models that were not fitted on that instance. For the classification task, the OOF outputs of each base classifier are subsequently passed through independent isotonic regression calibrators, producing monotonic probability mappings and improving the reliability of predicted risk scores.

At the stacking level, CMSE constructs task-specific meta-feature representations. For the classification output, the meta-vector contains calibrated probabilities from all base classification channels, statistical aggregates of regression predictions, and an inter-task residual term derived from the physiological relationship between plasma glucose and HbA1c expressed through the estimated average glucose formulation. This residual captures consistency deviations between short-term glucose prediction and long-term glycemic exposure, enabling the meta-classifier to incorporate clinically meaningful cross-task information.

For the regression tasks, meta-features include the base repressor outputs, quadratic expansions, and dispersion statistics. In addition, regression meta-vectors incorporate the diabetes meta-probability, which enables directed conditioning of continuous glycemic estimates on the predicted diabetes risk. An eAG-based coupling mechanism is included as an additional regularization component, enabling controlled information exchange between glucose and HbA1c predictions. Its practical impact is assessed separately using ablation analysis.

Final multitask predictions are produced by interpretable meta-heads: logistic regression is used for diabetes probability estimation, while ridge regression is used for plasma glucose and HbA1c prediction. This choice preserves stability and interpretability at the decision layer, while the nonlinear modeling capacity remains captured within the base ensembles and meta-feature construction. After OOF-based training is completed, all base learners are retrained on the full training data, and inference is performed using fixed calibration mappings and trained meta-heads. A decision threshold of 0.5 is applied for binary evaluation to ensure consistency across comparative baselines.

To formalize the CMSE pipeline, a mathematical formulation is provided below, describing preprocessing, OOF stacking, calibration, meta-feature construction, and multitask prediction. Let the dataset be defined as (1):

D = {(x_{i}, y_{i}^{(c)}, y_{i}^{(g)}, y_{i}^{(a)})}_{i = 1}^{N}

(1)

where

x_{i} \in R^{p}

—features;

y_{i}^{(c)} \in {0, 1}

—diabetes status;

y_{i}^{(g)} \in R

—glucose (mg/dL);

y_{i}^{(a)} \in R

—HbA1c (%). Rows with target gaps are removed.

Feature sets: continuouse

C

and binary

B

. For each continuous feature

j \in C

quantiles are computed on the training split (2):

q_{j}^{lo} = {Q u a n t i l e}_{0.005} (x_{\cdot j}), q_{j}^{hi} = {Q u a n t i l e}_{0.995} (x_{\cdot j})

(2)

Winsorization is applied as (3):

{\tilde{x}}_{i j} = m i n {m a x (x_{i j}, q_{j}^{lo}), q_{j}^{hi}}

(3)

Standardization is then performed using training statistics (4)–(6):

z_{i j} = \frac{{\tilde{x}}_{i j} - μ_{j}}{σ_{j}}

(4)

where

μ_{j} = \frac{1}{n} \sum_{i \in train} {\tilde{x}}_{i j}

(5)

σ_{j} = \sqrt{\frac{1}{n} \sum_{i \in train} {({\tilde{x}}_{i j} - μ_{j})}^{2}}

(6)

For binary

j \in B

:

z_{i j} = x_{i j}

. The final transformation is

ϕ (x_{i}) = [z_{i 1}, \dots, z_{i p}]

. The same

(q, μ, σ)

are fixedly applied to validation/test. A set of base models is trained using K-fold OOF evaluation.

For regression tasks (glucose and HbA1c), OOF predictions are (7):

{\hat{r}}_{i k}^{(g)} = f_{k}^{(g)} (ϕ (x_{i})), {\hat{r}}_{i k}^{(a)} = f_{k}^{(a)} (ϕ (x_{i}))

(7)

and for classification, each channel produces an OOF score

s_{i k}

.

To obtain calibrated probabilities, isotonic regression is trained per channel (8):

γ_{k} = a r g \underset{γ ↑}{m i n} \sum_{i \in oof} {(y_{i}^{(c)} - γ ({\hat{s}}_{i k}))}^{2}

(8)

yielding calibrated outputs

{\hat{p}}_{i k} = γ_{k} ({\hat{s}}_{i k}) .

Isotonic regression was selected because it provides a flexible, non-parametric, monotonic calibration mapping and is well suited to the present setting, where sufficient out-of-fold predictions are available from a relatively large cohort. Compared with Platt scaling, isotonic calibration imposes fewer parametric assumptions and is therefore more appropriate for heterogeneous biomedical probability distributions.

Aggregated meta-statistics are defined as (9) and (10):

{\overset{⃐}{p}}_{i} = \frac{1}{K_{c}} \sum_{k} {\hat{p}}_{i k}, {s t d}_{p} (i) = s t d {{\hat{p}}_{i 1}, \dots, {\hat{p}}_{i K_{c}}}, p_{i}^{m i n}, p_{i}^{m a x}

(9)

{\overset{⃐}{r}}_{i}^{(g)} = \frac{1}{K_{g}} \sum_{k} {\hat{r}}_{i k}^{(g)}, {\overset{⃐}{r}}_{i}^{(a)} = \frac{1}{K_{a}} \sum_{k} {\hat{r}}_{i k}^{(a)}

(10)

The relationship between HbA1c and eAG is defined according to the clinical standard proposed by Nathan et al. [19], which provides a validated physiological mapping between long-term glycemic control and average glucose levels: eAG = 28.7 × HbA1c − 46.7.

The eAG residual is defined as (11):

δ_{i}^{eAG} = {\overset{⃐}{r}}_{i}^{(g)} - (L {\overset{⃐}{r}}_{i}^{(a)} + B)

(11)

The classification meta-feature vector is then (12):

h_{i} = [{\hat{p}}_{i 1}, \dots, {\hat{p}}_{i K_{c}}, {\overset{⃐}{p}}_{i}, {s t d}_{p} (i), p_{i}^{m i n}, p_{i}^{m a x}, {\overset{⃐}{r}}_{i}^{(g)}, {\overset{⃐}{r}}_{i}^{(a)}, δ_{i}^{eAG}]

(12)

while regression meta-features for

t \in {g, a}

are (13):

u_{i}^{(t)} = [{\hat{r}}_{i 1}^{(t)}, \dots, {\hat{r}}_{i K_{t}}^{(t)}, {({\hat{r}}_{i 1}^{(t)})}^{2}, \dots, {({\hat{r}}_{i K_{t}}^{(t)})}^{2}, {\overset{⃐}{r}}_{i}^{(t)}, {s t d}_{r}^{(t)} (i), q_{0.10}^{(t)} (i), q_{0.90}^{(t)} (i), p_{i}, p_{i}^{2}]

(13)

The meta-classifier is defined as logistic regression (14):

p_{i} = P r (y_{i}^{(c)} = 1 ∣ h_{i}) = σ (w_{c}^{⊤} h_{i} + b_{c}), σ (z) = \frac{1}{1 + e^{- z}}

(14)

and regression meta-heads are ridge models are (15):

{\hat{y}}_{i}^{(t)} = w_{t}^{⊤} u_{i}^{(t)} + b_{t}

(15)

where parameters are estimated by regularized least squares.

Optionally, a soft reconciliation step can be applied to enforce glucose-HbA1c consistency (16):

\underset{g, a}{m i n} \frac{1}{2} {(g - {\hat{y}}_{i}^{(g)})}^{2} + \frac{1}{2} {(a - {\hat{y}}_{i}^{(a)})}^{2} + \frac{λ}{2} {(g - L a - B)}^{2}

(16)

In the default setting, λ = 0.0, meaning that the reconciliation step is disabled unless explicitly activated. Overall, CMSE provides a consistent multitask learning framework that integrates leakage-aware preprocessing, heterogeneous base learners, OOF stacking, isotonic calibration, and physiologically meaningful inter-task coupling. The architecture is designed to improve predictive robustness under population-level heterogeneity while maintaining interpretability at the meta-decision level, making it suitable for deployment-oriented glycemic risk stratification and decision–support applications.

The proposed CMSE introduces several methodological elements that distinguish it from conventional ensemble learning pipelines. Unless otherwise noted, all key performance results presented in Table 5, Table 6, Table 7, Table 8, Table 9 and Table 10 correspond to the default configuration with λ = 0.0. Non-zero values of λ were evaluated only in the dedicated ablation analysis. First, the framework integrates multitask learning and stacking-based ensemble modeling within a unified predictive architecture, enabling simultaneous modeling of multiple clinically related glycemic outcomes. Unlike traditional pipelines that independently model classification and regression targets, CMSE jointly predicts diabetes status, plasma glucose concentration, and HbA1c levels, thereby exploiting the physiological coupling between short- and long-term glycemic indicators. Second, the framework incorporates explicit probability calibration using isotonic regression to improve the reliability of predicted risk probabilities. While ensemble models often achieve strong predictive performance, their probability estimates may be poorly calibrated, which can reduce their reliability in decision–support scenarios. Third, the proposed architecture introduces a physiologically motivated cross-task coupling mechanism based on the eAG relationship between plasma glucose and HbA1c. This mechanism enables structured information transfer between regression outputs while preserving model flexibility. Finally, the CMSE framework integrates a multi-perspective interpretability layer combining SHAP-based explanations with complementary dependency analysis techniques, including mutual information and distance correlation analysis. This design enhances transparency of model predictions and supports the identification of dominant metabolic predictors influencing glycemic outcomes. The proposed CMSE framework therefore combines multitask ensemble learning, probability calibration, and cross-task physiological coupling within a unified decision–support architecture.

3. Results

3.1. Descriptive Analysis of Glycemic Indicators

This subsection presents an exploratory descriptive analysis of the two primary glycemic biomarkers in the NHANES 2017–March 2020 cohort, namely plasma glucose and glycated hemoglobin (HbA1c). The objective is to provide an initial statistical characterization of the target distributions and to visually contrast normoglycemic individuals with participants identified as diabetic according to laboratory-based criteria. Such analysis provides an intuitive foundation for subsequent predictive modeling and highlights the separability, variability, and transition regions of the studied glycemic outcomes. The comparisons are reported using box-and-whisker plots, distribution density overlays, and empirical cumulative distribution functions (ECDFs).

Figure 5 illustrates the distribution of plasma glucose concentrations across the normoglycemic and diabetic groups. The box plot reveals a pronounced shift in central tendency for the diabetic subgroup, where glucose values are substantially elevated compared with the normoglycemic population. The median glucose level in the normal group remains near the physiological baseline (~90 mg/dL), whereas the diabetic group demonstrates markedly higher values, with a broader interquartile range and a substantially extended upper tail. Extreme values exceed 400 mg/dL, reflecting considerable interindividual heterogeneity.

This widening of the distribution in the diabetic cohort is consistent with the heterogeneous nature of diabetes progression, which is influenced by disease duration, metabolic compensation mechanisms, and variability in treatment response. The observed separation confirms the clinical discriminative value of plasma glucose and motivates its inclusion as a primary regression target within the proposed multitask modeling framework.

Figure 6 provides an overlay of plasma glucose density distributions for the two cohorts. The normoglycemic population exhibits a compact density peak concentrated approximately within the 80–110 mg/dL interval, indicating stable regulation of fasting glucose. In contrast, the diabetic population demonstrates a clear rightward shift and a heavy-tailed distribution extending into ranges above 300 mg/dL. The overlap observed near the 100–130 mg/dL interval corresponds to a transitional region that is clinically associated with impaired fasting glucose and early dysglycemia.

To further quantify distributional separation, Figure 7 reports ECDF curves for plasma glucose. The ECDF of the normoglycemic group approaches saturation rapidly, reaching near-complete probability mass below approximately 120 mg/dL. In contrast, the diabetic group exhibits a substantially slower cumulative rise and reaches saturation only beyond approximately 400 mg/dL, demonstrating both elevated glucose levels and increased dispersion. This divergence confirms that glucose exhibits strong group-level separability while still maintaining a clinically meaningful overlap region that may correspond to borderline or prediabetic metabolic states.

A similar analysis was performed for HbA1c. Figure 8 compares HbA1c distributions using box plots. The normoglycemic group is centered around approximately 5.2%, which is consistent with stable long-term glycemic regulation. The diabetic group shows a clear upward shift, with median HbA1c values exceeding 6.8%, which lies above the commonly applied diagnostic threshold of 6.5%. A wide interquartile range and multiple outliers extending to approximately 14% indicate substantial heterogeneity in chronic glycemic exposure, potentially reflecting differences in therapy adherence, metabolic compensation, and disease severity.

Figure 9 shows the density overlay of HbA1c distributions. The normoglycemic cohort forms a distinct peak within the 4.5–5.5% range, whereas the diabetic cohort exhibits a right-shifted density concentrated above 6.5% with a long heavy tail reaching beyond 12%. The region between approximately 5.7% and 6.4% forms a transitional zone that overlaps both distributions, consistent with the clinical interpretation of prediabetes.

Figure 10 presents ECDF curves for HbA1c. The normoglycemic curve saturates rapidly below approximately 6%, while the diabetic curve accumulates more gradually and extends substantially further, indicating both elevated HbA1c and increased variability. Compared with plasma glucose, HbA1c exhibits a smoother distribution and reduced short-term fluctuation, consistent with its role as an integrated long-term biomarker of glycemic exposure.

Overall, the descriptive analysis demonstrates strong statistical distinguishability between normoglycemic and diabetic subgroups for both plasma glucose and HbA1c, while also highlighting clinically meaningful overlap regions corresponding to early dysglycemia. These findings justify the multitask modeling formulation adopted in CMSE, where short-term and long-term glycemic markers are treated as coupled outcomes and jointly modeled to support robust risk stratification and biomarker prediction.

3.2. Epidemiological Patterns in Obesity and Age

An epidemiological analysis of participants provides additional insight into how demographic and anthropometric factors shape the laboratory-defined prevalence of diabetes. In particular, body mass index (BMI) and age represent two of the most influential population-level predictors associated with type 2 diabetes risk, as they reflect long-term exposure to insulin resistance, metabolic dysregulation, and lifestyle-related factors. Examining diabetes prevalence across BMI and age strata therefore supports a clearer interpretation of the underlying cohort structure and highlights clinically meaningful risk gradients relevant for predictive modeling.

Figure 11 reports diabetes prevalence across standard BMI categories. A monotonic increase in diabetes burden is observed as BMI rises, demonstrating a clear obesity–diabetes gradient. Underweight individuals (BMI < 18.5) exhibit the lowest prevalence, remaining close to 6%. In the normal weight group (BMI = 18.5–25.0), prevalence increases to approximately 11%. A further increase is observed in overweight individuals (BMI = 25.0–30.0) and obese class I participants (BMI = 30.0–35.0), where diabetes prevalence reaches approximately 16–17%. The highest burden is observed in individuals with severe obesity (BMI ≥ 40.0), where prevalence exceeds 30%.

Figure 12 presents diabetes prevalence stratified by age. The results demonstrate a clear age-dependent increase in diabetes burden, consistent with cumulative exposure to metabolic risk factors. Diabetes prevalence remains low in participants younger than 30 years, staying below 4%. A noticeable rise is observed in middle adulthood: the 30–40 group shows an increase toward approximately 8%, while the 40–50 group reaches values near 16%. After 50 years of age, prevalence rises more sharply, reaching approximately 17–18% in the 50–60 group and exceeding 20% among participants older than 60 years. The highest prevalence is observed in individuals above 70 years, where diabetes affects more than 22% of the subgroup.

This age-dependent trend reflects the cumulative effect of cardio metabolic risk factors such as insulin resistance, obesity, and reduced physical activity. The 40–60 age range represents a key transition period where early screening may prevent progression to overt diabetes. Overall, diabetes prevalence increases monotonically with both BMI and age, forming a clear risk gradient in the NHANES cohort. These findings justify incorporating demographic and anthropometric predictors into CMSE, as glycemic risk is driven by nonlinear and cumulative interactions.

3.3. Correlation Structures Across Biochemical, Lifestyle, and Therapeutic Features

Characterizing dependencies among biochemical, anthropometric, dietary, behavioral, and therapeutic variables is critical for interpretable glycemic modeling. In this study, both linear and nonlinear relationships were examined using Pearson correlation, mutual information, distance correlation, and model-based gain importance. This multi-view analysis helps identify stable feature clusters and potential confounding patterns affecting plasma glucose, HbA1c, and diabetes status.

Pearson Correlation Patterns. Figure 13 illustrates the Pearson correlation heatmap between clinical predictors and three outcomes (glucose, HbA1c, and diabetes label). A consistent metabolic signature is observed. HDL shows the strongest inverse association with all endpoints (glucose r ≈ −0.43; HbA1c r ≈ −0.44; diabetes r ≈ −0.34), supporting its protective role. Conversely, LDL (r ≈ 0.29–0.31) and triglycerides (r ≈ 0.16–0.20) exhibit positive correlations, indicating co-occurrence of dyslipidemia and impaired glycemic regulation.

Renal stress indicators follow clinically expected trends. Urine albumin-to-creatinine ratio (UACR) shows moderate positive correlations (r ≈ 0.19–0.22), whereas serum creatinine exhibits weaker associations (r ≈ 0.09–0.10). Electrolytes show mixed behavior: potassium demonstrates a negative association (up to r ≈ −0.26), while chloride exhibits weak positive alignment with HbA1c (r ≈ 0.21). Anthropometric factors contribute moderately, where weight and BMI correlate positively with glycemic endpoints, whereas height shows a mild negative relationship. Fasting duration is weakly but consistently associated with lower glucose and HbA1c. Behavioral and inflammation markers show limited but non-negligible correlations. Cotinine (log-scale) indicates weak positive association with adverse profiles, and inflammatory variables such as hs-CRP and ferritin contribute small positive effects. Medication indicators exhibit an “indication effect”: metformin and insulin flags correlate strongly with diabetes status, while aggregated therapy variables may show inverse relationships with continuous glycemic targets, reflecting treatment-induced glucose reduction rather than baseline risk.

Mutual Information Analysis. While Pearson coefficients capture only linear dependencies, mutual information (MI) evaluates nonlinear associations without assuming monotonicity. Figure 14 reports MI values between features and targets. The results indicate that HbA1c demonstrates stronger nonlinear dependence on biochemical markers than plasma glucose, consistent with HbA1c being a stable long-term marker.

Electrolytes and lipid parameters show the highest informativeness for HbA1c, with chloride exhibiting the strongest MI value (MI ≈ 0.81), followed by LDL, triglycerides, total cholesterol, hs-CRP, and UACR. For plasma glucose, chloride remains the most informative (MI ≈ 0.59), while systolic blood pressure, HDL, and triglycerides also show high relevance. Fasting hours contribute notably for both regression tasks, reflecting the physiological sensitivity of glucose measurements to pre-test conditions. Anthropometric variables demonstrate moderate MI values, while psychological scores contribute weak but detectable information. Dietary variables exhibit heterogeneous effects; notably, alcohol consumption shows high MI values, suggesting nonlinear threshold-driven relationships. In contrast, the binary diabetes label produces lower MI values overall, as expected due to discretization effects and loss of continuous information.

Model-Based Gain Importance. To complement correlation-based analysis, feature gain importance was extracted from XGBoost models trained separately for glucose regression, HbA1c regression, and diabetes classification. Figure 15 summarizes the gain distribution.

In regression tasks, alcohol consumption emerges as the dominant contributor, suggesting that its relationship with glycemic dynamics is nonlinear and interaction-driven. Therapeutic aggregates (“any antidiabetic”) and insulin markers follow, while HDL and electrolytes remain consistently important. In the diabetes classification task, medication indicators dominate the ranking. Metformin, sulfonylurea, and other therapy flags provide stronger separation than single biochemical measurements, reflecting prescription assignment as an indirect proxy for diagnosed disease status. Among laboratory variables, HDL remains the most prominent biochemical contributor.

Importantly, gain-based rankings reflect model-specific split utility and should not be interpreted as causal effects. However, the consistency of key clusters across tasks supports the inclusion of heterogeneous metabolic domains in the proposed stacking framework. Figure 16 reports distance correlation coefficients between features and targets, capturing general statistical dependence beyond linearity. For continuous targets, HDL again appears as the most strongly dependent marker, followed by alcohol intake and aggregated therapy variables. Electrolytes (chloride, potassium, sodium), lipid fractions, and blood pressure parameters form the next relevance group, while renal and inflammation indicators contribute moderately. For the binary diabetes label, distance correlations are systematically lower, which is expected given that a discrete endpoint reduces dependence resolution. In this setting, therapeutic variables remain among the strongest predictors, again reflecting clinical assignment effects rather than purely physiological measurement patterns. This analysis reinforces the observation that a substantial portion of informative variation exists in continuous glycemic measurements rather than binary labels, supporting the multitask formulation adopted in CMSE.

Summary of Dependency Clusters. Across Pearson, MI, gain importance, and distance correlation, three stable feature clusters consistently emerge: (i) lipid-related metabolic indicators (HDL, LDL, triglycerides), (ii) electrolyte and renal stress markers (chloride, potassium, UACR), and (iii) behavioral and therapeutic factors (alcohol intake, antidiabetic medications). The convergence of linear and nonlinear dependency measures indicates that glycemic abnormalities are shaped by multi-domain interactions, and that substantial predictive information is distributed across biochemical, lifestyle, and therapy-related variables. These findings provide an interpretable statistical foundation for subsequent SHAP-based explanations and justify the hybrid feature integration strategy employed in the CMSE architecture.

3.4. SHAP-Based Interpretability

Model transparency is a critical requirement for clinical decision support systems, where predictive performance must be accompanied by physiologically meaningful explanations. In this study, interpretability of the proposed CMSE framework was investigated using SHapley Additive exPlanations (SHAP), which provides consistent local and global attributions for complex ensemble models. SHAP analysis is exclusively conducted on the medication-free configuration to ensure that feature assignment reflects physiological signals that existed before diagnosis, rather than indirect indicators related to treatment. This allows for a focus on key metabolic and demographic factors relevant for early screening, rather than the effects of treatment after diagnosis. Figure 17 shows the summary SHAP plot for the plasma glucose regression model without medication. The results show that predictions are primarily determined by physiological factors, including lipid markers, kidney function, age, and BMI. Higher triglyceride levels, UACR, and age contribute to elevated glucose levels, while HDL-C has a protective (negative) effect. The absence of treatment-related factors reduces the influence of confounding factors and provides a more realistic representation of metabolic risk before diagnosis, making the model suitable for predictive screening.

Figure 18 shows the summary SHAP plot for the HbA1c regression model excluding medication use. The results indicate that long-term glycemic levels are primarily influenced by lipid markers, kidney function indicators, and demographic factors. Elevated triglyceride, LDL, and UACR levels positively predict HbA1c, while HDL demonstrates a consistent negative (protective) effect. Age and BMI also demonstrate consistent positive contributions, reflecting cumulative metabolic risk. Excluding treatment-related variables ensures that the model accounts for intrinsic physiological dependencies rather than postdiagnostic treatment effects, confirming its relevance for prognostic screening and early risk assessment.

Figure 19 presents the SHAP summary plot for the medication-freediabetes classification model. The results indicate that the predicted risk of developing diabetes is primarily determined by demographic, metabolic, and renal factors. Age is the most influential predictor, with higher values significantly increasing the likelihood of developing diabetes. Lipid markers, particularly triglycerides and LDL, positively influence risk, while HDL demonstrates a consistent protective (negative) effect. Renal parameters, such as the urine albumin-to-creatinine ratio (UACR), also show a positive association with an increased likelihood of developing diabetes. Anthropometric characteristics, including BMI and weight, demonstrate a moderate positive contribution, reflecting the role of obesity in the development of diabetes. Other variables, such as blood pressure, electrolytes, and dietary factors, show smaller and more heterogeneous effects. Importantly, excluding treatment-related variables eliminates post-diagnostic bias, resulting in a more balanced and physiologically based attribution structure. This makes the model more suitable for predictive screening, as it relies on intrinsic risk factors rather than treatment-related cues.

Across all three outcomes, the medication-freeSHAP analysis consistently demonstrates that the CMSE framework captures stable and physiologically meaningful patterns of glycemic risk. For both the regression and classification tasks, predictions are driven by consistent metabolic domains, including lipid metabolism disorder, renal stress, and demographic factors, rather than treatment-related factors. This confirms that the revised interpretability analysis is consistent with the core screening configuration used for the study’s main claims.

3.5. Predictive Performance: Holdout Evaluation

This section reports the predictive performance of the proposed CMSE framework under an 80/20 holdout protocol. Results are presented for the three targets: (i) diabetes status (binary classification), (ii) plasma glucose (mg/dL), and (iii) HbA1c (%). For classification, both discrimination and calibration are evaluated using ROC-AUC, PR-AUC, F1, MCC, and the Brier score. For regression tasks, accuracy is assessed using

R^{2}

, RMSE, MAE, and MAPE. All baselines are trained and tested under the same split, enabling a direct comparison. To study the dependence of classification performance on the threshold value, several performance scenarios were evaluated by adjusting the probability threshold. The resulting tradeoffs between accuracy, precision, recall, and overall classification quality are presented in Table 7.

Optimizing F1 shifts the threshold to 0.67, improving F1 to 0.808927 and MCC to 0.744756, primarily by increasing precision (0.825 → 0.873) with a slight decrease in recall (0.765 → 0.753), making this setting suitable for scenarios requiring higher confidence in positive predictions. In contrast, a sensitivity-focused threshold of 0.35 increases the recall to 0.879 but reduces the precision to 0.689, the ACC to 0.933660, and the macrocombination correlation coefficient (MCC) to 0.710247, which is preferable for screening where minimizing false negatives is critical. A threshold of 0.5 was used as the standard reference value, but alternative thresholds reflecting clinically relevant trade-offs between sensitivity and accuracy were also evaluated.

The final analytical dataset was split 80/20, resulting in a training set of 3694 observations and a test set of 924. To make probabilistic performance more interpretable, a naive baseline based only on prevalence was additionally introduced, assigning each test instance a constant probability equal to the proportion of the positive class in the training data (0.164591). Using this minimal information strategy, the Brier score is 0.137441. In the primary configuration without medication, the CMSE achieves a Brier score of 0.143726 on the validation dataset, reflecting realistic predictive performance under screening conditions. The full information configuration yields a lower Brier score (0.059521), representing a secondary upper bound scenario influenced by treatment-related variables.

Table 8 presents a quantitative comparison of the classification performance of all models in two configurations: medication-free and full-information. In all tables, starred values (*) correspond to the medication-free configuration (primary screening setting), while non-starred values represent the medication-inclusive full-information scenario (secondary setting). The medication-free configuration is considered the primary evaluation configuration because it reflects predictive screening conditions without access to treatment-related variables, while the full-information configuration is considered a secondary scenario. Quantitatively, the CMSE model demonstrates superior performance in both configurations. Although performance metrics such as ROC-AUC (0.865 vs. 0.975) and MCC (0.469 vs. 0.735) are lower in the medication-free configuration, they provide a more clinically meaningful assessment for predictive screening because they are not influenced by treatment information after diagnosis. The full-information configuration should therefore be interpreted as an upper-bound scenario influenced by post-diagnostic treatment effects rather than a valid screening setting.

Overall, the quantitative analysis shows that drug-inclusive models benefit from using information obtained after diagnosis, while the medication-free configuration provides a more realistic and unbiased estimate of predictive performance for screening scenarios. To visually assess the quality of the calibration of the predicted probabilities, a reliability diagram was constructed. The results are presented in Figure 20.

The reliability diagram shows the relationship between predicted probabilities and observed event rates across various probability ranges. As shown in Figure 20, the CMSE model demonstrates the closest fit to the ideal calibration line, especially in the medium-probability range, where accurate risk assessment is crucial. The transition between low and high predicted risks is smooth and monotonic, indicating stable probabilistic behavior. Among the baseline models, CatBoost demonstrates relatively competitive calibration but exhibits deviations in intermediate ranges, suggesting local risk overestimation. HistGradientBoosting exhibits the greatest variability, with noticeable fluctuations and wider confidence intervals. Logistic regression provides more stable estimates but tends to underestimate risk in higher-probability regions.

In the baseline configuration, excluding medication, the CMSE achieves R² = 0.385 for predicting plasma glucose levels. In the full-information configuration, R² reaches 0.681, reflecting the contribution of treatment-related variables and representing an upper-bound scenario. This decrease is accompanied by an increase in the root mean square error (RMSE) (22.47 → 26.65 mg/dL) and mean absolute error (MAE) (11.42 → 14.36 mg/dL), indicating a decrease in glucose estimation accuracy. A similar deterioration is observed across all models; for example, ExtraTrees shows a decrease in R² from 0.612 to 0.301 and an increase in RMSE from 24.25 to 28.42 mg/dL (Table 9). Interestingly, although most error rates increase in the no-medication condition, the linear model exhibits a decrease in MAPE (0.200 → 0.143), indicating improved relative error stability despite lower overall explanatory power (R²: 0.477 → 0.254). Among the nonlinear models, the XGB model exhibits a comparatively smaller decrease in R² (0.579 → 0.397), indicating greater stability with feature reduction.

Overall, the quantitative analysis confirms that excluding medication-related variables leads to a consistent decrease in predictive performance, reflecting the removal of information obtained after diagnosis. At the same time, the no-medication configuration provides a more realistic estimate of model performance in pre-diagnosis screening scenarios. Table 10 presents a quantitative comparison of the HbA1c regression performance of various models in the full-information and medication-freeconfigurations. Quantitatively, the CMSE model demonstrates the best overall performance in both configurations, with R² decreasing from 0.752 to 0.366 after excluding treatment-related features. This decrease is accompanied by an increase in RMSE (0.676 → 0.866) and MAE (0.353 → 0.490), indicating lower forecasting accuracy in the medication-freescenario. Similar trends are observed across all models; for example, ExtraTrees shows a decrease in R² from 0.700 to 0.297 and an increase in RMSE from 0.712 to 0.912. Among the compared methods, XGB demonstrates a relatively smaller performance degradation (R²: 0.663 → 0.355), indicating greater robustness to feature reduction. The linear model again demonstrates a reduction in MAPE (0.117 → 0.088), despite a decrease in explanatory power (R²: 0.511 → 0.251), indicating improved relative error stability.

Overall, the quantitative analysis confirms that excluding medication-related variables leads to a consistent decrease in predictive performance, as these features encode information obtained after diagnosis. At the same time, the medication-free configuration provides a more realistic and unbiased assessment of long-term glycemia prediction in a screening setting.

Table 11 presents the results of 5-fold cross-validation for the binary problem is_diabetes_labs_only. The CMSE stacking system shows the highest threshold metrics: ACC = 0.9198 ± 0.0111 (95% CI [0.9100; 0.9295]) and F1 = 0.7456 ± 0.0401 (95% CI [0.7104; 0.7808]), indicating stable performance across different partitions. At the same time, CMSE demonstrates the best information content in the unbalanced class, with MCC = 0.7000 ± 0.0451, and the smallest calibration error (Brier = 0.0666 ± 0.0062), confirming the correct comparison of predicted probabilities with actual frequencies.

Some baseline algorithms yield slightly higher ROC_AUC/PR_AUC values (e.g., HistGB/CatBoost). Still, CMSE’s advantage is evident in the combination of threshold metrics and calibration, where the F1 score gain over the closest ensembles is ~2–3 pp, and the Brier score is reduced by 8–15% relative to XGB/ExtraTrees/LogReg. Narrow confidence intervals for ACC and Brier further indicate stable generalization. The CMSE architecture explains this gain. OOF channels, per-channel isotonic calibration, and signal meta-aggregation (including inter-task context) reduce the bias of single models and equalize their variability. Overall, according to Table 11, CMSE provides the best balance of accuracy, calibration, and robustness and is the preferred classifier for clinically relevant scenarios. To assess the robustness of the proposed CMSE framework to data partitioning strategies, performance was compared on a validation set and with 5-fold cross-validation. The results, along with the identified performance differences, are presented in Table 12.

On the classification task, there is a noticeable discrepancy between the evaluation protocols: ROC–AUC = 0.974510 on the validation dataset compared to 0.9324 ± 0.0144 with 5-fold cross-validation, and F1 = 0.794468 compared to 0.7456 ± 0.0401. This indicates that the validation dataset split yields a more optimistic estimate, while cross-validation reflects more conservative and average performance. Given that this discrepancy is most pronounced on the classification task and is not equally reflected on other tasks, it is more appropriate to attribute it to the composition of the specific data split rather than to systematic overfitting. Glucose regression also shows a moderate discrepancy: R² = 0.681382 on the validation set versus 0.6254 ± 0.0623 in cross-validation (difference = +0.055982). This difference remains within the plausible range for biomedical data and can be explained by sample composition variability across multiple measurements. In contrast, HbA1c regression demonstrates high consistency between estimation protocols: R² = 0.752465 on the validation set versus 0.7401 ± 0.0309 in cross-validation (difference = +0.012365), indicating robust generalization ability. Table 13 shows the 5-fold cross-validation results for the plasma glucose (glucose_mgdl) regression.

The CMSE multi-stacked model shows the highest proportion of explained variance: R² = 0.6254 ± 0.0623 (95% CI [0.5708; 0.6799]), outperforming strong single ensembles (CatBoost: 0.6150 ± 0.0370, ExtraTrees: 0.5935 ± 0.0272) and significantly outperforming the linear baseline (Linear: 0.4445 ± 0.0269). For errors, CMSE achieves RMSE = 25.72 ± 4.93 mg/dL, MAE = 12.26 ± 1.41 mg/dL, and MAPE = 0.1039 ± 0.0057. Errors are close to the best of the databases: ExtraTrees has a comparable MAPE = 0.1039 ± 0.0044 and a similar MAE = 12.31 ± 0.999, while CatBoost is slightly inferior (RMSE = 26.02 ± 2.48, MAE = 12.59 ± 0.77, MAPE = 0.1084 ± 0.0049). HistGB and XGB exhibit higher errors and lower R² values. Taken together, Table 14 presents the results of 5-fold cross-validation for the hba1c_pct regression. Table 15 indicates that the advantage of CMSE primarily manifests in a more complete explanation of the target variable’s variation, while maintaining error levels comparable to those of the best tree-based models.

Among these, CMSE multitask stacking is demonstrating strong interpretability for explaining the most variance: R² = 0.7401 ± 0.0309 (95% CI [0.7130; 0.7671]). It outperforms strong single ensembles-HistGB: 0.7257 ± 0.0322, ExtraTrees: 0.7056 ± 0.0197, XGB: 0.6973 ± 0.0224-and clearly outperforms the linear baseline: Linear: 0.5228 ± 0.0126. CMSE is leading in error metrics, too: RMSE = 0.6933 ± 0.0962, MAE = 0.3599 ± 0.0302, and MAPE = 0.0580 ± 0.0038, delivering the smallest values or tying for them across the comparisons. The nearest rival by metrics, CatBoost, shows a similar level of accuracy (R² = 0.7343 ± 0.0281; RMSE = 0.7005 ± 0.0462; MAPE = 0.0595 ± 0.0021); yet CMSE’s mean values and error profile still favor it, and though confidence intervals partly overlap, the stacking approach retains the edge.

Interpretively, the superiority of CMSE is explained by combining heterogeneous base models with OOF stacking (bias reduction), per-channel probability calibration (stable metasignals), and Ridge meta-regression, which leverages inter-task information (including the diabetes meta-probability) to suppress noise and match clinical patterns. Small standard deviations in R² and narrow 95% CIs indicate reproducibility of the gain across folds. Taken together, Table 15 shows that CMSE provides the most balanced combination of high explanatory power and low bias for HbA1c, confirming the value of calibrated multitask stacking in clinically relevant modeling. The combined results demonstrate the high stability and accuracy of CMSE on the holdout set, showing that the model maintains high ROC-AUC and MCC values in classification and achieves significant R² and low RMSE/MAE in regression. The consistency of the metrics across tasks confirms the effectiveness of the multitask approach and the ensemble’s ability to exploit the complementary signals among glucose, HbA1c, and associated risk factors. These results demonstrate the robust generalization ability of CMSE and confirm its applicability to clinical risk stratification and objective metabolic monitoring in real-world settings. To quantitatively evaluate the impact of the eAG-based inter-task communication mechanism, an ablation study was conducted by varying the coupling coefficient λ ∈ {0.00, 0.10, 0.25}. Performance was evaluated on a benchmark dataset for classification and regression tasks. The results are presented in Table 15.

As shown in Table 16, the classification performance remains very stable across all tested λ values. At λ = 0.00, the model achieves ACC = 0.9496, F1 = 0.7945, ROC-UC = 0.9745, and Brier score = 0.0595. Increasing the linkage strength to λ = 0.10 and λ = 0.25 results in only minor changes (ΔACC < 0.0001, ΔROC-AUC < 0.0003), indicating that the CMSE classification branch does not benefit from the eAG-based linkage mechanism. A similar pattern is observed for plasma glucose regression. The baseline configuration (λ = 0.00) yields R² = 0.681 and RMSE = 22.47 mg/dL, while higher λ values yield insignificant changes (ΔR² < 0.001, ΔRMSE < 0.03 mg/dL). These results confirm that the eAG linkage does not provide measurable improvements in glucose prediction accuracy. In contrast, the HbA1c regression demonstrates significant sensitivity to the linkage parameter. Although the best results are achieved at λ = 0.00 (R² = 0.752, RMSE = 0.676), increasing λ leads to a significant deterioration, with R² decreasing to approximately 0.55 and RMSE increasing to approximately 0.84. This corresponds to a relative reduction in explained variance of more than 25% and a significant increase in prediction error. Overall, quantitative analysis shows that the eAG-based linkage mechanism does not improve prediction accuracy in its current configuration. While glucose classification and regression remain unchanged, HbA1c prediction deteriorates with increasing link strength. This suggests that establishing a fixed, linear physiological relationship may inadequately capture the complex and heterogeneous interactions between short-term glucose levels and long-term glycemic impact.

To analyze the contribution of individual CMSE framework components to classification performance, an ablation study was conducted by selectively removing or modifying key elements, including probability calibration, intertask coupling, and drug-related features. The results of this analysis are summarized in Table 16.

Removing the posterior probability calibration (CMSE_no-calibration) results in a noticeable deterioration in probabilistic reliability: the Brier score increases from 0.0595 to 0.0640. Furthermore, the ACC decreases from 0.9496 to 0.9441, and the MCC from 0.7351 to 0.7292, while the ROC-AUC remains virtually unchanged. This indicates that the calibration primarily improves the quality of the probability rather than the ranking performance. Introducing eAG-based linkage (λ = 0.10 and λ = 0.25) does not significantly improve performance. All classification metrics remain virtually unchanged compared to the baseline configuration, with deviations of less than 0.001 for ACC, F1, and ROC-AUC. This confirms that the inter-task communication mechanism does not improve classification performance in the current configuration. In contrast, excluding medication-related features (CMSE_no-medication) leads to significant deterioration across all metrics. The recognition accuracy (ACC) decreases from 0.9496 to 0.8610, F1 decreases from 0.7945 to 0.6502, and ROC-AUC decreases from 0.9745 to 0.9050. The Brier score increases significantly to 0.0978, indicating a significant loss of both prediction accuracy and calibration. This result highlights the critical importance of treatment-related variables, which provide clinically relevant insights not captured by other feature groups. To assess the impact of CMSE components on the performance of the regression analysis, an additional ablation analysis was conducted for the tasks of predicting plasma glucose and HbA1c levels. The results are summarized in Table 17.

When predicting plasma glucose levels, removing the probability calibration results in only minor changes: the R² coefficient of determination decreases slightly from 0.6814 to 0.6782, and the root mean square error (RMSE) increases from 22.47 to 22.55 mg/dL. This confirms the minimal impact of the calibration on regression accuracy. Similarly, introducing the eAG relationship (λ = 0.10 and λ = 0.25) does not lead to significant improvements: the R² coefficient of determination remains virtually unchanged (≈0.681 → ≈0.681), and the change in RMSE is less than 0.03 mg/dL. In contrast, the HbA1c regression demonstrates strong sensitivity to the eAG relationship. While the baseline model achieves R² = 0.7525 and RMSE = 0.6763, introducing the linkage results in a significant deterioration: R² decreases to approximately 0.55, and RMSE increases to approximately 0.84. This represents a significant loss in prediction accuracy, suggesting that the imposed linear physiological relationship inadequately captures the complex variability of HbA1c. The most pronounced deterioration is observed when removing drug-related features. In this case, glucose prediction performance drops sharply (R² = 0.3854, RMSE = 26.65 mg/dL), while HbA1c prediction performance deteriorates even more significantly (R² = 0.3660, RMSE = 0.8662). Compared with the baseline model, this corresponds to a reduction in explained variance of approximately 43% for glucose and more than 50% for HbA1c. These results demonstrate that medication-related variables provide substantial predictive information and contribute significantly in modeling glycemic outcomes. The results of the sensitivity analysis for the classification task are presented in Table 18.

The corresponding results of the regression tasks are shown in Table 19.

Sensitivity analysis shows that excluding drug-related features leads to a significant deterioration in model performance across all tasks. In classification, ACC decreases from 0.949615 to 0.861032, F1 from 0.794468 to 0.650214, and ROC-UC from 0.974510 to 0.904956, while the Brier score increases from 0.059521 to 0.097826, indicating a loss of both discriminatory power and probabilistic reliability. Nevertheless, the model retains significant predictive ability (ROC-AUC ≈ 0.905), suggesting that laboratory and anthropometric characteristics still contain informative diagnostic signals. At the same time, the superior performance of the baseline CMSE model is largely due to treatment-related variables. This effect is even more pronounced in the regression problems: for glucose_mgdl, the R² decreases from 0.681382 to 0.385423, and the root mean square error (RMSE) increases from 22.473540 to 26.649910, while for hba1c_pct, the R² decreases from 0.752465 to 0.366000, and the RMSE increases from 0.676296 to 0.866235. These results indicate that treatment-related features contribute a significant additional signal that cannot be obtained from laboratory, physiological, and demographic variables alone. Therefore, their inclusion cannot be seen as a neutral extension of the feature space, since in a population study they partly reflect clinical decisions made after diagnosis and thus pose a risk of semantic leakage.

4. Discussion

This study presents a CMSE for jointly modeling three clinically related glycemic metrics: diabetes status, plasma glucose, and HbA1c using heterogeneous population-level data. In the revised version, the primary evaluation focuses on a medication-free configuration to reflect the intended use case as a predictive screening tool, while a secondary full-information configuration incorporating treatment-related variables is retained as an upper-bound scenario. In the medication-freeprimary configuration, the CMSE demonstrates consistent and significant predictive performance, achieving an ROC-AUC of 0.865 for diabetes classification and R² values of 0.385 and 0.366 for predicting plasma glucose and HbA1c, respectively. Although these values are lower than those observed in the full-information scenario, they provide a more realistic estimate of performance in a pre-diagnosis setting, when treatment-related variables are not yet available. Therefore, the full information configuration should be viewed as an upper-bound scenario taking into account treatment effects after diagnosis, rather than as a valid setting for predictive screening.

Importantly, CMSE consistently outperforms baseline models across all tasks, confirming the effectiveness of multi-task ensemble learning even with a limited feature set. The full-information configuration, which includes drug-related variables, provides superior performance due to the additional predictive signal encoded in the treatment features. However, such variables, by their nature, reflect information obtained after diagnosis and therefore should not be interpreted as part of a predictive screening model. Instead, this configuration represents an upper bound on the performance estimate under full clinical information. Distinguishing between these two evaluation conditions is crucial for correctly interpreting the model’s capabilities and aligning performance claims with realistic clinical scenarios. In addition to discrimination performance, CMSE demonstrates improved probabilistic calibration. Applying isotonic regression to out-of-sample predictions results in lower Brier scores, indicating a better match between predicted probabilities and observed outcomes. This emphasizes that the contribution of CMSE lies not only in prediction accuracy but also in generating reliable probability estimates, which is crucial for clinical decision support systems, where risk calibration directly impacts decision thresholds.

Ablation analysis further clarifies the role of model components. The eAG-based intertask coupling mechanism does not provide a consistent performance improvement in the default configuration and can negatively impact HbA1c regression when forced into use. This suggests that the primary improvements achieved with CMSE are due to ensemble diversity, unscheduled summation, and calibration, rather than explicit physiological coupling. Therefore, the eAG component should be considered optional and primarily relevant to ensuring physiological consistency rather than improving prediction accuracy.

Interpretability analysis conducted using SHAP in a medication-freeconfiguration confirms the clinical plausibility of the relationships studied. The model consistently identifies lipid markers (HDL, LDL, triglycerides), renal parameters (albumin-to-creatinine ratio), inflammatory markers (C-reactive protein), and demographic factors as key determinants of glycemic risk. These results are consistent with existing data on cardiometabolic and inflammatory pathways in glucose dysregulation. Importantly, the absence of treatment-related variables ensures that feature assignment reflects underlying physiological processes rather than indirect indicators of diagnosed disease, thereby enhancing the model’s validity for use in early screening.

The observed differences between the validation set and cross-validation results, particularly in classification tasks, highlight the sensitivity of performance estimates to cohort composition. The higher ROC-AUC observed in the validation set can be explained by sample variability rather than systematic overfitting, as regression results remain relatively stable across different estimation protocols. This highlights the importance of presenting multiple estimation strategies when working with heterogeneous population datasets. Despite these advantages, several limitations should be noted. First, the NHANES dataset is cross-sectional, limiting the ability to observe temporal disease progression and causal relationships. Second, external validation on independent cohorts is necessary to assess the generalizability of the results to different populations and medical settings. Third, although the medication-free configuration reduces postdiagnostic leakage, residual confounding due to unobserved variables cannot be ruled out. Overall, the results demonstrate that calibrated multi-task ensemble learning provides a robust and clinically relevant framework for glycemic risk modeling. By clearly distinguishing between screening-focused and fully informed scenarios, the proposed approach ensures that performance statements are both methodologically sound and relevant to practical clinical scenarios.

5. Conclusions

This study presents a combined machine learning model (CMSE) for jointly classifying diabetes risk and regressing plasma glucose and HbA1c levels, with a primary focus on predictive screening. In a medication-freeconfiguration excluding all treatment-related variables, the model demonstrates robust performance on the validation dataset, with ROC–AUC = 0.865 for classification and R² = 0.385 and 0.366 for predicting plasma glucose and HbA1c levels, respectively. These results demonstrate that the proposed model is capable of identifying clinically relevant patterns of glycemic dysregulation based solely on physiological, biochemical, and demographic characteristics. Across all tasks, CMSE remains competitive with state-of-the-art baseline models and provides a stable balance between predictive accuracy and interpretability. SHAP analysis confirms that the model’s predictions are driven by consistent metabolic domains, including lipid profile markers, renal parameters, and demographic factors, supporting the biological plausibility of the identified associations. In comparison, a full-information configuration including drug-related variables provides superior predictive performance, reflecting the contribution of treatment signals after diagnosis. However, this configuration represents an upper-bound scenario and is not suitable for assessing the effectiveness of screening. Overall, the proposed framework offers an interpretable and clinically relevant tool for the early detection and risk stratification of glycemic disorders, with results based on a realistic prediagnosis situation. Further work can focus on external validation and prospective evaluation to further evaluate the generalizability and clinical utility of the approach.

Author Contributions

Conceptualization, Z.K., G.O. and A.B. (Aliya Barakova); methodology, Z.K., M.A. and M.O.; software, Z.K., D.Y. and M.O.; validation, M.A., Z.K. and U.S.; formal analysis, Z.K. and M.A.; investigation, Z.K., G.O. and R.B.; data curation, M.O., Z.K., U.S., G.J. and A.B. (Almagul Bukatayeva); writing—original draft preparation, Z.K.; writing—review and editing, G.O., M.A., U.S. and A.B. (Aliya Barakova).; visualization, Z.K., D.Y., M.O. and A.B. (Almagul Bukatayeva); supervision, G.O.; project administration, G.O.; funding acquisition, G.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset used in this study was obtained from the Continuous NHANES 2017–March 2020 Pre-Pandemic cycle, publicly released by the CDC/NCHS. Access is provided via the official NHANES data portal: https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2017 (accessed on 7 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kim, S.Y. Explainable AI-Based Clinical Decision Support Systems: Frameworks, Methods, and Applications. Appl. Sci. 2024, 14, 6638. [Google Scholar] [CrossRef]
Saee Sun, H.; Saeedi, P.; Karuranga, S.; Pinkepank, M.; Ogurtsova, K.; Duncan, B.B.; Stein, C.; Basit, A.; Chan, J.C.N.; Mbanya, J.C.; et al. IDF Diabetes Atlas: Global, regional and country-level estimates of diabetes prevalence for 2021 and projections for 2045. Diabetes Res. Clin. Pract. 2022, 183, 109119. [Google Scholar] [CrossRef] [PubMed]
Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I. From Local Explanations to Global Understanding with Explainable AI for Trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef] [PubMed]
Tonekaboni, S.; Joshi, S.; McCradden, M.D.; Goldenberg, A. What Clinicians Want: Contextualizing Explainable Machine Learning for Clinical End Use. Proc. Mach. Learn. Res. 2019, 106, 359–380. [Google Scholar]
Centers for Disease Control and Prevention (CDC). National Health and Nutrition Examination Survey (NHANES) 2017–March 2020 Pre-Pandemic Data Documentation. Updated 2023. Available online: https://www.cdc.gov/nchs/nhanes/ (accessed on 10 January 2026).
Zhang, Y.; Liu, X.; Zhang, X.; Fei, Y.; Li, X. Machine learning-based prediction of metabolic dysfunction-associated steatotic liver disease using National Health and Nutrition Examination Survey (NHANES) data. PLoS ONE 2025, 20, e0335656. [Google Scholar] [CrossRef] [PubMed]
Rahman, M.M.; Islam, M.R.; Hasan, M.; Ahmed, S. Machine Learning-Based Identification of Metabolic Risk Factors for Dysglycemia in Large Epidemiological Cohorts. BMC Med. Inform. Decis. Mak. 2022, 22, 310. [Google Scholar]
Sghaireen, M.G.; Al-Smadi, Y.; Al-Qerem, A.; Srivastava, K.C.; Ganji, K.K.; Alam, M.K.; Nashwan, S.; Khader, Y. Machine Learning Approach for Metabolic Syndrome Diagnosis Using Explainable Data-Augmentation-Based Classification. Diagnostics 2022, 12, 3117. [Google Scholar] [CrossRef] [PubMed]
Widen, E.; Raben, T.G.; Lello, L.; Hsu, S.D.H. Machine Learning Prediction of Biomarkers from SNPs and of Disease Risk from Biomarkers in the UK Biobank. Genes 2021, 12, 991. [Google Scholar] [CrossRef] [PubMed]
Fry, A.; Littlejohns, T.J.; Sudlow, C.; Doherty, N.; Adamska, L.; Sprosen, T.; Collins, R.; Allen, N.E. Comparison of Sociodemographic and Health-Related Characteristics of UK Biobank Participants with the General Population. Am. J. Epidemiol. 2017, 186, 1026–1034. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Lin, S.; Zeng, Q.; Peng, L.; Yan, C. Machine learning and SHAP value interpretation for predicting cardiovascular disease risk in patients with diabetes using dietary antioxidants. Front. Nutr. 2025, 12, 1612369. [Google Scholar] [CrossRef] [PubMed]
Ruder, S. An Overview of Multitask Learning in Deep Neural Networks. arXiv 2017, arXiv:1706.05098. [Google Scholar] [CrossRef]
Hwang, M.; Rachim, V.P.; Yoo, J.; Lee, Y.; Park, S. Generalized Multitask Learning Framework for Glucose Forecasting and Hypoglycemia Detection Using Simulation-to-Reality Transfer. NPJ Digit. Med. 2025, 8, 994. [Google Scholar] [CrossRef]
Lee, M.; Park, T.; Shin, J.Y.; Park, M. A Comprehensive Multitask Deep Learning Approach for Predicting Metabolic Syndrome with Genetic, Nutritional, and Clinical Data. Sci. Rep. 2024, 14, 17851. [Google Scholar] [CrossRef] [PubMed]
Shao, J.; Pan, Y.; Kou, W.-B.; Feng, H.; Zhao, Y.; Zhou, K.; Zhong, S. Generalization of a Deep Learning Model for Continuous Glucose Monitoring–Based Hypoglycemia Prediction: Algorithm Development and Validation Study. JMIR Med. Inform. 2024, 12, e56909. [Google Scholar] [CrossRef]
Rajkomar, A.; Dean, J.; Kohane, I. Machine Learning in Medicine. N. Engl. J. Med. 2019, 380, 1347–1358. [Google Scholar] [CrossRef]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, NSW, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar] [CrossRef]
Van Calster, B.; McLernon, D.J.; van Smeden, M.; Wynants, L.; Steyerberg, E.W. Calibration: The Achilles Heel of Predictive Analytics. BMC Med. 2019, 17, 230. [Google Scholar] [CrossRef]
Nathan, D.M.; Kuenen, J.; Borg, R.; Zheng, H.; Schoenfeld, D.; Heine, R.J.; A1C-Derived Average Glucose Study Group. Translating the A1C Assay Into Estimated Average Glucose Values. Diabetes Care 2008, 31, 1473–1478. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Distribution of laboratory-defined diabetes status (is_diabetes_labs_only).

Figure 2. Distribution of subjects by plasma glucose clinical categories (mg/dL).

Figure 3. Distribution of subjects by HbA1c clinical categories (%).

Figure 4. CMSE architecture with inter-task communication and probability calibration.

Figure 5. Distribution of plasma glucose concentrations (mg/dL) across normoglycemic and diabetic groups. Orange bars represent the diabetic group.

Figure 6. Overlay of plasma glucose density distributions (mg/dL) for normoglycemic and diabetic subjects.

Figure 7. Empirical cumulative distribution functions (ECDFs) of plasma glucose (mg/dL) in normoglycemic and diabetic groups.

Figure 8. Comparative distribution of HbA1c (%) in normoglycemic and diabetic groups.

Figure 9. Overlay of HbA1c density distributions (%) for normoglycemic and diabetic subjects.

Figure 10. ECDFs of HbA1c (%) in normoglycemic and diabetic groups.

Figure 11. Diabetes prevalence across BMI categories. The label “inf” denotes positive infinity (∞).

Figure 12. Diabetes prevalence across age ranges.

Figure 13. Pearson correlation heatmap between clinical features and glycemic targets.

Figure 14. Mutual information heatmap between features and glycemic outcomes.

Figure 15. XGBoost gain importance across glucose, HbA1c, and diabetes tasks.

Figure 16. Distance correlation heatmap between features and glycemic outcomes.

Figure 17. SHAP summary plot for plasma glucose regression (mg/dL), medication-free configuration.

Figure 18. SHAP summary plot for HbA1c regression (%), medication-free configuration.

Figure 19. SHAP summary plot for diabetes classification (logit scale), medication-free configuration.

Figure 20. Reliability diagram (calibration curve) comparing CMSE and baseline classifiers.

Table 1. Preprocessing and stacking settings.

Block	Parameter	Value/Setting	Note
Train-only preprocessing	Winsor quantile clip	q_low = 0.005, q_high = 0.995 (0.5–99.5%)	For continuous features only
	StandardScaler	fit on train, apply on test	Binary features pass without scaling
Gap/inf	Sanitization	NaN → 0.0 (post-imputation only), ±inf → ±1 × 10⁶	Applied after median imputation to handle residual numerical artifacts; does not affect missing data handling
Stacking (OOF)	CLS Splits	StratifiedKFold (n_splits = 5, shuffle = True, random_state = 42)	For binary classification
	REG Splits	KFold (n_splits = 5, shuffle = True, random_state = 42)	For regressions
CLS Calibration	Isotonic Regression	out_of_bounds = “clip”, per channel, on OOF	For each baseline Classifier
CLS	Decision threshold	0.5	For ACC/F1/MCC
eAG	Formula	eAG = 28.7 * HbA1c − 46.7	Constants: EAG_L = 28.7, EAG_B = −46.7
	Intensity	λ = EAG_LAM = 0.0	>0.0 soft glucose and HbA1c coupling
Other	Random seed	42	Same everywhere
	Parallelism	n_jobs = max(1, cpu_count − 1)	For ExtraTrees

Table 2. Median imputation applied to core laboratory features.

Core Laboratory Feature	Median Imputations, n
fasting_hours	0
hdl_mgdl	3
ldl_mgdl	11
tc_mgdl	3
tg_mgdl	3
hscrp_mg_l	17
ferritin_ng_ml	3
scr_mg_dl	0
uacr_mg_g	77
na_mmol_l	0
k_mmol_l	6
cl_mmol_l	0
cotinine_ng_ml	1

Table 3. Summary of missing data in the extended feature set before median imputation.

Feature	Missing Before Median Imputation
age	0
height_cm	76
weight_kg	69
bmi	81
bp_sys	454
bp_dia	454
fasting_hours	0
hdl_mgdl	0
ldl_mgdl	0
tc_mgdl	0
tg_mgdl	0
hscrp_mg_l	0
ferritin_ng_ml	0
scr_mg_dl	0
uacr_mg_g	0
na_mmol_l	0
k_mmol_l	0
cl_mmol_l	0
cotinine_ng_ml	0
cotinine_ln	0
dpq_total	824
diet_kcal	377
diet_carb_g	377
diet_sugar_g	377
diet_fiber_g	377
diet_protein_g	377
diet_fat_total_g	377
diet_satfat_g	377
diet_sodium_mg	377
diet_potassium_mg	377
diet_caffeine_mg	377
diet_alcohol_g	377

Table 4. Base models (classification branch, CLS).

Model	Key Hyperparameters	Library
ExtraTreesClassifier	n_estimators = 700, max_features = “sqrt”, n_jobs ≈ (#cores − 1), random_state = 42	scikit-learn
HistGradientBoostingClassifier	max_iter = 450, random_state = 42	scikit-learn
LogisticRegression	solver = “lbfgs”, max_iter = 4000, class_weight = “balanced”, random_state = 42	scikit-learn
XGBClassifier*	n_estimators = 400, max_depth = 5, subsample = 0.9, colsample_bytree = 0.9, tree_method = “hist”, eval_metric = “logloss”, random_state = 42	XGBoost
CatBoostClassifier*	depth = 6, iterations = 750, learning_rate = 0.05, loss_function = “Logloss”, auto_class_weights = “Balanced”, verbose = 0, random_seed = 42	CatBoost

#cores denotes the number of available CPU cores used for parallel computation.

Table 5. Base models (regression branches: glucose and HbA1c).

Model	Key Hyperparameters	Library
Ordinary Least Squares (OLS)	no regularization (α = 0)	scikit-learn
ExtraTreesRegressor	n_estimators = 900, max_features = “sqrt”, n_jobs ≈ (#cores − 1), random_state = 42	scikit-learn
HistGradientBoostingRegressor	max_iter = 650, random_state = 42	scikit-learn
XGBRegressor*	n_estimators = 500, max_depth = 6, subsample = 0.9, colsample_bytree = 0.9, tree_method = “hist”, objective = “reg:squarederror”, random_state = 42	XGBoost
CatBoostRegressor*	depth = 6, iterations = 900, learning_rate = 0.05, loss_function = “RMSE”, verbose = 0, random_seed = 42	CatBoost

#cores denotes the number of available CPU cores used for parallel computation.

Table 6. Meta-level configuration and calibration.

Node	Model/ Operation	Hyperparameters	Meta-Feature Input
Calibration (CLS)	IsotonicRegression	out_of_bounds = “clip”	OOF scores of each base classifier
Meta-CLS head	LogisticRegression	solver = “lbfgs”, max_iter = 4000, class_weight = “balanced”, random_state = 42	Concatenation of calibrated classifier outputs, summary statistics (mean, std, min, max), and cross-task regression-derived statistics.
Meta-REG head (Glucose)	Ridge	alpha = 0.5, random_state = 42	Meta-features include Z_g, p, summary statistics (mean, std, q10, q90), and polynomial terms p and (p*)².
Meta-REG head (HbA1c)	Ridge	alpha = 0.5, random_state = 42	Similar for Z_a

Table 7. Sensitivity of CMSE to the choice of decision threshold.

Scenario	Threshold	ACC	F1	Precision	Recall	MCC
Base (0.50)	0.50	0.949615	0.794468	0.825	0.765	0.735072
Maximum F1	0.67	0.951212	0.808927	0.873	0.753	0.744756
High sensitivit	0.35	0.933660	0.772083	0.689	0.879	0.710247

Table 8. Classification performance on the hold-out dataset (primary: medication-free; secondary: full-information).

Model	ACC/ACC *	F1/F1 *	ROC_AUC/ROC-AUC *	PR_AUC/PR-AUC *	MCC/MCC *	Brier/Brier *
CMSE	0.801/0.950	0.558/0.794	0.865/0.975	0.568/0.838	0.469/0.735	0.144/0.060
LogReg	0.751/0.919	0.500/0.721	0.838/0.920	0.483/0.785	0.398/0.685	0.161/0.067
ExtraTrees	0.850/0.919	0.232/0.743	0.848/0.942	0.541/0.787	0.272/0.696	0.105/0.074
HistGB	0.858/0.896	0.413/0.735	0.848/0.953	0.544/0.821	0.376/0.684	0.119/0.092
XGB	0.855/0.911	0.396/0.720	0.860/0.941	0.559/0.778	0.358/0.668	0.103/0.072
CatBoost	0.852/0.922	0.542/0.757	0.865/0.946	0.574/0.813	0.453/0.711	0.103/0.063

* Values marked with * correspond to the primary (medication-free) configuration.

Table 9. Plasma glucose regression performance (primary: medication-free; secondary: full-information).

Model	R²/R² *	RMSE/RMSE *	MAE/MAE *	MAPE/MAPE *
CMSE	0.385/0.681	26.65/22.47	14.36/11.42	0.121/0.101
Linear	0.254/0.477	29.37/28.16	16.62/18.57	0.143/0.200
ExtraTrees	0.301/0.612	28.42/24.25	15.09/12.26	0.127/0.109
HistGB	0.308/0.574	28.28/25.43	16.37/12.66	0.142/0.111
XGB	0.397/0.579	26.40/25.27	15.02/12.87	0.129/0.112
CatBoost	0.375/0.615	26.88/24.17	14.88/12.36	0.127/0.112

* Values marked with * correspond to the primary (medication-free) configuration.

Table 10. HbA1c regression performance (primary: full-information; secondary: medication-free).

Model	R²/R² *	RMSE/RMSE *	MAE/MAE *	MAPE/MAPE *
CMSE	0.752/0.366	0.676/0.866	0.353/0.490	0.057/0.078
Linear	0.511/0.251	0.910/0.942	0.620/0.547	0.117/0.088
ExtraTrees	0.700/0.297	0.712/0.912	0.371/0.506	0.062/0.080
HistGB	0.693/0.305	0.721/0.907	0.386/0.532	0.064/0.086
XGB	0.663/0.355	0.754/0.874	0.405/0.507	0.066/0.082
CatBoost	0.712/0.352	0.698/0.876	0.374/0.503	0.062/0.081

* Values marked with * correspond to the primary (medication-free) configuration.

Table 11. Classification (is_diabetes_labs_only), 5-fold CV.

Model	ACC (Mean ± Std; 95% CI)	F1 (Mean ± Std; 95% CI)	ROC_AUC (Mean ± Std; 95% CI)	PR_AUC (Mean ± Std; 95% CI)	MCC (Mean ± Std; 95% CI)	Brier (Mean ± Std; 95% CI)
CMSE	0.9198 ± 0.0111; 95%CI [0.9100, 0.9295]	0.7456 ± 0.0401; 95%CI [0.7104, 0.7808]	0.9324 ± 0.0144; 95%CI [0.9198, 0.9450]	0.7672 ± 0.0392; 95%CI [0.7328, 0.8016]	0.7000 ± 0.0451; 95%CI [0.6604, 0.7395]	0.0666 ± 0.0062; 95%CI [0.0612, 0.0720]
LogReg	0.9121 ± 0.0078; 95%CI [0.9052, 0.9189]	0.6999 ± 0.0317; 95%CI [0.6721, 0.7277]	0.9090 ± 0.0151; 95%CI [0.8958, 0.9223]	0.7482 ± 0.0261; 95%CI [0.7253, 0.7711]	0.6585 ± 0.0337; 95%CI [0.6290, 0.6880]	0.0725 ± 0.0057; 95%CI [0.0675, 0.0775]
ExtraTrees	0.9150 ± 0.0121; 95%CI [0.9043, 0.9256]	0.7233 ± 0.0494; 95%CI [0.6800, 0.7666]	0.9314 ± 0.0167; 95%CI [0.9168, 0.9460]	0.7739 ± 0.0375; 95%CI [0.7410, 0.8069]	0.6773 ± 0.0530; 95%CI [0.6309, 0.7237]	0.0784 ± 0.0118; 95%CI [0.0681, 0.0888]
HistGB	0.8924 ± 0.0057; 95%CI [0.8874, 0.8974]	0.7111 ± 0.0144; 95%CI [0.6985, 0.7237]	0.9395 ± 0.0091; 95%CI [0.9315, 0.9475]	0.7909 ± 0.0379; 95%CI [0.7577, 0.8242]	0.6524 ± 0.0178; 95%CI [0.6367, 0.6680]	0.0914 ± 0.0051; 95%CI [0.0869, 0.0959]
XGB	0.9094 ± 0.0124; 95%CI [0.8985, 0.9203]	0.7062 ± 0.0472; 95%CI [0.6648, 0.7476]	0.9292 ± 0.0152; 95%CI [0.9159, 0.9425]	0.7687 ± 0.0489; 95%CI [0.7258, 0.8115]	0.6568 ± 0.0514; 95%CI [0.6117, 0.7018]	0.0767 ± 0.0111; 95%CI [0.0670, 0.0864]
CatBoost	0.9137 ± 0.0093; 95%CI [0.9055, 0.9218]	0.7158 ± 0.0384; 95%CI [0.6821, 0.7494]	0.9359 ± 0.0146; 95%CI [0.9231, 0.9487]	0.7783 ± 0.0441; 95%CI [0.7396, 0.8169]	0.6700 ± 0.0403; 95%CI [0.6347, 0.7053]	0.0675 ± 0.0092; 95%CI [0.0594, 0.0756]

Table 12. Comparison of the results of the validation set and 5-fold cross-validation for CMSE with gap analysis between tasks.

Task	Metric	Hold-Out	5-Fold CV (Mean)	Gap
is_diabetes_labs_only classification	ROC-AUC	0.974510	0.9324	+0.042110
is_diabetes_labs_only classification	F1	0.794468	0.7456	+0.048868
glucose_mgdl regression	R²	0.681382	0.6254	+0.055982
hba1c_pct regression	R²	0.752465	0.7401	+0.012365

Table 13. Glucose_mgdl Regression, 5-fold CV.

Model	R² (Mean ± Std; 95% CI)	RMSE (Mean ± Std; 95% CI)	MAE (Mean ± Std; 95% CI)	MAPE (Mean ± Std; 95% CI)
CMSE	0.6254 ± 0.0623; 95%CI [0.5708, 0.6799]	25.7153 ± 4.9299; 95%CI [21.3940, 30.0366]	12.2576 ± 1.4144; 95%CI [11.0178, 13.4973]	0.1039 ± 0.0057; 95%CI [0.0990, 0.1089]
Linear	0.4445 ± 0.0269; 95%CI [0.4210, 0.4680]	31.3222 ± 3.3380; 95%CI [28.3963, 34.2481]	19.0658 ± 0.8155; 95%CI [18.3510, 19.7806]	0.1950 ± 0.0048; 95%CI [0.1909, 0.1992]
ExtraTrees	0.5935 ± 0.0272; 95%CI [0.5697, 0.6174]	26.8075 ± 3.1366; 95%CI [24.0582, 29.5569]	12.3101 ± 0.9991; 95%CI [11.4343, 13.1858]	0.1039 ± 0.0044; 95%CI [0.1000, 0.1078]
HistGB	0.5978 ± 0.0420; 95%CI [0.5610, 0.6346]	26.5452 ± 2.0908; 95%CI [24.7125, 28.3779]	13.0074 ± 0.7128; 95%CI [12.3826, 13.6322]	0.1113 ± 0.0055; 95%CI [0.1065, 0.1161]
XGB	0.5494 ± 0.0605; 95%CI [0.4964, 0.6024]	28.1303 ± 3.0767; 95%CI [25.4335, 30.8272]	13.5327 ± 0.7848; 95%CI [12.8448, 14.2206]	0.1133 ± 0.0054; 95%CI [0.1085, 0.1180]
CatBoost	0.6150 ± 0.0370; 95%CI [0.5825, 0.6474]	26.0157 ± 2.4782; 95%CI [23.8435, 28.1879]	12.5895 ± 0.7686; 95%CI [11.9158, 13.2632]	0.1084 ± 0.0049; 95%CI [0.1041, 0.1127]

Table 14. hba1c_pct Regression, 5-fold CV.

Model	R² (Mean ± Std; 95% CI)	RMSE (Mean ± Std; 95% CI)	MAE (Mean ± Std; 95% CI)	MAPE (Mean ± Std; 95% CI)
CMSE	0.7401 ± 0.0309; 95%CI [0.7130, 0.7671]	0.6933 ± 0.0962; 95%CI [0.6089, 0.7776]	0.3599 ± 0.0302; 95%CI [0.3334, 0.3864]	0.0580 ± 0.0038; 95%CI [0.0547, 0.0613]
Linear	0.5228 ± 0.0126; 95%CI [0.5117, 0.5338]	0.9396 ± 0.0373; 95%CI [0.9069, 0.9723]	0.6248 ± 0.0105; 95%CI [0.6156, 0.6340]	0.1143 ± 0.0020; 95%CI [0.1125, 0.1160]
ExtraTrees	0.7056 ± 0.0197; 95%CI [0.6883, 0.7229]	0.7382 ± 0.0471; 95%CI [0.6969, 0.7796]	0.3649 ± 0.0213; 95%CI [0.3462, 0.3835]	0.0585 ± 0.0029; 95%CI [0.0560, 0.0611]
HistGB	0.7257 ± 0.0322; 95%CI [0.6975, 0.7539]	0.7114 ± 0.0487; 95%CI [0.6688, 0.7541]	0.3746 ± 0.0209; 95%CI [0.3563, 0.3930]	0.0605 ± 0.0028; 95%CI [0.0581, 0.0630]
XGB	0.6973 ± 0.0224; 95%CI [0.6777, 0.7169]	0.7475 ± 0.0268; 95%CI [0.7240, 0.7711]	0.3900 ± 0.0103; 95%CI [0.3810, 0.3990]	0.0628 ± 0.0014; 95%CI [0.0616, 0.0640]
CatBoost	0.7343 ± 0.0281; 95%CI [0.7097, 0.7589]	0.7005 ± 0.0462; 95%CI [0.6599, 0.7410]	0.3667 ± 0.0170; 95%CI [0.3518, 0.3815]	0.0595 ± 0.0021; 95%CI [0.0577, 0.0614]

Table 15. Ablation study of the eAG coupling coefficient λ on hold-out performance.

λ	ACC	F1	ROC_AUC	Brier	Glucose R²	Glucose RMSE	HbA1c R²	HbA1c RMSE
0.00	0.949615	0.794468	0.974510	0.059521	0.681382	22.473540	0.752465	0.676296
0.10	0.949602	0.794211	0.974378	0.059620	0.680912	22.492008	0.548109	0.842163
0.25	0.949587	0.793990	0.974246	0.059714	0.680817	22.503274	0.545321	0.845127

Table 16. Ablation study of CMSE for the classification task.

Variant	ACC	F1	ROC_AUC	PR_AUC	MCC	Brier
CMSE (base)	0.949615	0.794468	0.974510	0.838077	0.735072	0.059521
CMSE_no-calibration	0.944118	0.792031	0.974358	0.835629	0.729154	0.064029
CMSE_λ = 0.10	0.949602	0.794211	0.974378	0.838004	0.734981	0.059620
CMSE_λ = 0.25	0.949587	0.793990	0.974246	0.837882	0.734822	0.059714
CMSE_no-medication	0.861032	0.650214	0.904956	0.675082	0.552187	0.097826

Table 17. Ablation study of CMSE for regression tasks (glucose and HbA1c).

Variant	Glucose R²	Glucose RMSE	HbA1c R²	HbA1c RMSE
CMSE (base)	0.681382	22.473540	0.752465	0.676296
CMSE_no-calibration	0.678214	22.552311	0.752823	0.676994
CMSE_λ = 0.10	0.680912	22.492008	0.548109	0.842163
CMSE_λ = 0.25	0.680817	22.503274	0.545321	0.845127
CMSE_no-medication	0.385423	26.649910	0.366000	0.866235

Table 18. Sensitivity analysis of CMSE classification performance without medication-related features.

Variant	ACC	F1	ROC-AUC	PR-AUC	MCC	Brier
CMSE (base)	0.949615	0.794468	0.974510	0.838077	0.735072	0.059521
CMSE_no-medication	0.861032	0.650214	0.904956	0.675082	0.552187	0.097826

Table 19. Sensitivity analysis of CMSE regression performance without medication-related features.

Variant	Glucose R²	Glucose RMSE	HbA1c R²	HbA1c RMSE
CMSE (base)	0.681382	22.473540	0.752465	0.676296
CMSE_no-medication	0.385423	26.649910	0.366000	0.866235

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Khamitova, Z.; Omarova, G.; Akhmetzhanov, M.; Burganova, R.; Orynbassar, M.; Sabirova, U.; Bukatayeva, A.; Barakova, A.; Jiyanmuratova, G.; Yuldasheva, D. A Calibrated Multi-Task Ensemble Architecture for Biomedical Risk Prediction. Computers 2026, 15, 244. https://doi.org/10.3390/computers15040244

AMA Style

Khamitova Z, Omarova G, Akhmetzhanov M, Burganova R, Orynbassar M, Sabirova U, Bukatayeva A, Barakova A, Jiyanmuratova G, Yuldasheva D. A Calibrated Multi-Task Ensemble Architecture for Biomedical Risk Prediction. Computers. 2026; 15(4):244. https://doi.org/10.3390/computers15040244

Chicago/Turabian Style

Khamitova, Zhainagul, Gulmira Omarova, Madi Akhmetzhanov, Roza Burganova, Maksym Orynbassar, Umida Sabirova, Almagul Bukatayeva, Aliya Barakova, Gulnoz Jiyanmuratova, and Dilchekhra Yuldasheva. 2026. "A Calibrated Multi-Task Ensemble Architecture for Biomedical Risk Prediction" Computers 15, no. 4: 244. https://doi.org/10.3390/computers15040244

APA Style

Khamitova, Z., Omarova, G., Akhmetzhanov, M., Burganova, R., Orynbassar, M., Sabirova, U., Bukatayeva, A., Barakova, A., Jiyanmuratova, G., & Yuldasheva, D. (2026). A Calibrated Multi-Task Ensemble Architecture for Biomedical Risk Prediction. Computers, 15(4), 244. https://doi.org/10.3390/computers15040244

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Calibrated Multi-Task Ensemble Architecture for Biomedical Risk Prediction

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Preparation

2.2. Hyperparameter Architecture and Configuration of the CMSE

2.3. Description of the Calibrated Multitask Stacking Ensemble

3. Results

3.1. Descriptive Analysis of Glycemic Indicators

3.2. Epidemiological Patterns in Obesity and Age

3.3. Correlation Structures Across Biochemical, Lifestyle, and Therapeutic Features

3.4. SHAP-Based Interpretability

3.5. Predictive Performance: Holdout Evaluation

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI