Next Article in Journal
Stressor-Specific Anomaly Detection System in Group-Housed Growing Pigs Through Combined Computer Vision-Machine Learning Framework: A Pilot Study
Previous Article in Journal
AEConvNeXt: An Attention-Enhanced ConvNeXt Framework for Imbalanced Photovoltaic Fault Classification with Explainable Feature Analysis
Previous Article in Special Issue
Explainable Multi-Modal Medical Image Analysis Through Dual-Stream Multi-Feature Fusion and Class-Specific Selection
 
 
Article
Peer-Review Record

Beyond Glycemic Control: Precision Medicine in Type 2 Diabetes Using Multi-Output Explainable Artificial Intelligence for Personalized SGLT2 and DPP-4 Therapy Selection

by Anusha Ihalapathirana 1,*, Piia Lavikainen 2, Pekka Siirtola 1, Satu Tamminen 1, Gunjan Chandra 1, Tiina Laatikainen 3, Janne Martikainen 2 and Juha Röning 1,4
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Submission received: 15 April 2026 / Revised: 12 May 2026 / Accepted: 16 May 2026 / Published: 22 May 2026
(This article belongs to the Special Issue Digital Health: AI-Driven Personalized Healthcare and Applications)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors
  1. The final test set contains only 101 samples (52 SGLT2-i, 49 DPP4-i), and after further outlier removal the concordant/discordant subgroups shrink to as few as 9--10 patients. With such small cell sizes, the reported treatment effect differences are highly susceptible to random variation, and confidence intervals (which are absent) would likely be very wide. The authors should report confidence intervals or bootstrap estimates for all treatment effect comparisons, and discuss the statistical power limitations more explicitly. Consider whether a larger held-out set or repeated random splits could provide more stable estimates.
  2. The authors impute missing outcome values in the training set using predictive models, yet the LDL imputation model achieved only an $R^2$ of 0.213, essentially explaining very little variance. Treating these imputed outcomes as ground truth during training risks injecting systematic noise. Moreover, the imputation models themselves use the drug class as a predictor, which could create circular dependencies when the imputed outcomes are later used to evaluate drug-specific treatment effects. The paper would benefit from a sensitivity analysis comparing results with and without imputed outcomes, or from adopting multiple imputation to properly propagate uncertainty.
  3. The best multi-output model (LightGBM) was selected based on $R^2$ and RMSE, metrics that evaluate outcome prediction accuracy. However, the actual goal is treatment selection, which depends on the accuracy of predicted differences between therapies rather than the accuracy of individual predictions. A model with moderate $R^2$ could still rank treatments correctly, while a model with higher $R^2$ might systematically misestimate treatment contrasts. The authors should evaluate and select models using treatment-selection-specific metrics (e.g., the concordant-vs-discordant treatment effect gap) rather than generic regression performance, or at least demonstrate that $R^2$ and RMSE correlate well with downstream selection quality in their setting.

Author Response

Comment 1: 

The final test set contains only 101 samples (52 SGLT2-i, 49 DPP4-i), and after further outlier removal the concordant/discordant subgroups shrink to as few as 9--10 patients. With such small cell sizes, the reported treatment effect differences are highly susceptible to random variation, and confidence intervals (which are absent) would likely be very wide. The authors should report confidence intervals or bootstrap estimates for all treatment effect comparisons, and discuss the statistical power limitations more explicitly. Consider whether a larger held-out set or repeated random splits could provide more stable estimates.

Response 1:

We thank the reviewer for this important comment. We have now reported 95% bootstrap confidence intervals for all treatment effect comparisons (Table 2, 4, and 5). As expected, several estimates exhibit wide confidence intervals, reflecting the small sample size within concordant and discordant subgroups.
We agree that this limits the statistical power and stability of the estimated treatment effects. This limitation has now been discussed in the manuscript (Section 4). While expanding the test set was not feasible due to data constraints, future work will explore larger cohorts to obtain more stable estimates.

Comment 2: The authors impute missing outcome values in the training set using predictive models, yet the LDL imputation model achieved only an $R^2$ of 0.213, essentially explaining very little variance. Treating these imputed outcomes as ground truth during training risks injecting systematic noise. Moreover, the imputation models themselves use the drug class as a predictor, which could create circular dependencies when the imputed outcomes are later used to evaluate drug-specific treatment effects. The paper would benefit from a sensitivity analysis comparing results with and without imputed outcomes, or from adopting multiple imputation to properly propagate uncertainty.

Response 2: 

We thank the reviewer for this insightful comment. We would like to clarify that LDL outcome values were not imputed in the final analysis. Due to the low predictive performance of the LDL imputation model, all samples with missing LDL outcomes were removed from the training dataset to avoid introducing noise. This has been clarified in the manuscript (Section 2.4.1 paragraph 3).

For the remaining outcomes, we acknowledge the concern regarding potential bias introduced by including drug class as a predictor in the imputation models. To address this, we have removed the drug class variable from the imputation models and updated all corresponding results, tables, and figures accordingly.

We agree that model-based imputation may introduce uncertainty that is not fully propagated in the current framework. This limitation has been acknowledged in the manuscript (Section 4). Future work will explore multiple imputation to better account for uncertainty in missing outcome values.

Comment 3: 

The best multi-output model (LightGBM) was selected based on $R^2$ and RMSE, metrics that evaluate outcome prediction accuracy. However, the actual goal is treatment selection, which depends on the accuracy of predicted differences between therapies rather than the accuracy of individual predictions. A model with moderate $R^2$ could still rank treatments correctly, while a model with higher $R^2$ might systematically misestimate treatment contrasts. The authors should evaluate and select models using treatment-selection-specific metrics (e.g., the concordant-vs-discordant treatment effect gap) rather than generic regression performance, or at least demonstrate that $R^2$ and RMSE correlate well with downstream selection quality in their setting.

Response 3: 

We thank the reviewer for this important observation. We agree that model selection should be aligned with the downstream objective of treatment selection rather than relying solely on predictive performance metrics.

In our study, the final model was selected based on treatment-selection-specific evaluation, specifically its ability to discriminate between concordant and discordant subgroups. Predictive performance metrics (R² and RMSE) were not the primary criteria for model selection. Although the MLPR model achieved the highest predictive performance (R² = 0.49, RMSE = 5.11; Appendix A), it did not yield the best treatment-selection performance.

We have clarified this in the manuscript (Section 3, paragraph 1 and 2) by explicitly describing the model selection strategy in the Results section.

Comment on Tables and Figures  - Figures and tables can be improved

Response - We thank the reviewer for this comment. The figures were revised to improve their quality, readability, and overall visibility throughout the manuscript.

 

Reviewer 2 Report

Comments and Suggestions for Authors

This study develops an explainable AI framework for personalized treatment selection between SGLT2 inhibitors and DPP-4 inhibitors in Type 2 diabetes, using multi-output regression to simultaneously predict four health outcomes — HbA1c, LDL cholesterol, HDL cholesterol, and BMI — twelve months post-drug initiation. A LightGBM-based multi-output model (R²≈0.44) is combined with SHAP for interpretability and an aggregation strategy for single-treatment recommendations.

 

However, I have the following concerns,

 

(1) The multi-output model demonstrates only moderate performance (R² ≈ 0.44), and the associated prediction uncertainty may limit the clinical reliability of its treatment recommendations. While improvements could be achieved with additional data and more robust modeling approaches, the current level of performance makes it difficult to draw meaningful conclusions or provide reliable guidance.

 

(2) The authors apply random oversampling to address a moderate treatment imbalance (637 vs. 440). While the test set is reportedly left unmodified, it is important to clarify whether oversampling was performed after the train–test split to avoid data leakage. In particular, the authors should confirm that no duplicated samples appear in the test set. Additionally, since the multi-output LightGBM model includes the drug class variable as a predictive feature, the authors should justify this design choice and discuss whether it may introduce bias or trivialize treatment effect estimation.

 

(3) The manuscript uses imputation models for missing outcome variables (HbA1c, HDL, BMI) trained on the training dataset, where “drug class” is included as a predictor (Table 1), and these imputed values are subsequently used as training labels for the main treatment selection model. This design raises a concern of circularity: because drug class is used both in imputing outcomes and as a predictor in the downstream model, the imputation step may inadvertently encode treatment effects into the labels, leading to biased estimates and inflated predictive performance. To mitigate this risk, the authors should re-train imputation models excluding the drug class variable or adopt multiple imputation approaches based only on pre-treatment covariates. Additionally, a sensitivity analysis comparing model performance with and without imputed outcomes would help quantify the extent of this potential bias.

 

(4) No comparison is provided to standard reference models such as predicting the cohort mean, using each patient’s baseline value, or applying a regularized linear regression. Without these null-model benchmarks, it is unclear whether the reported R² represents meaningful predictive improvement, as it may only marginally exceed a naive mean-based prediction given the inherent variability in EHR outcomes. To strengthen the evaluation, the model should be explicitly compared against at least a mean-prediction baseline and a regularized linear regression.

Author Response

Comment 1: 

The multi-output model demonstrates only moderate performance (R² ≈ 0.44), and the associated prediction uncertainty may limit the clinical reliability of its treatment recommendations. While improvements could be achieved with additional data and more robust modeling approaches, the current level of performance makes it difficult to draw meaningful conclusions or provide reliable guidance.

Response 1:

We thank the reviewer for this comment. We acknowledge that the predictive performance of the multi-output model is moderate and that prediction uncertainty may affect the reliability of individualized treatment recommendations. However, the primary objective of the framework is treatment selection rather than precise prediction of absolute outcome values. In this setting, the ability to correctly differentiate relative treatment benefits between therapies is more important than maximizing predictive metrics such as R² alone.
Accordingly, the model was evaluated based on treatment-selection-specific performance, particularly the observed differences between concordant and discordant subgroups (Section 3, paragraph 1). We agree that larger datasets and more advanced modeling approaches may improve the robustness and clinical applicability of the framework, and this has now been further emphasized in the manuscript limitations and discussion sections.

Comment 2:

The authors apply random oversampling to address a moderate treatment imbalance (637 vs. 440). While the test set is reportedly left unmodified, it is important to clarify whether oversampling was performed after the train–test split to avoid data leakage. In particular, the authors should confirm that no duplicated samples appear in the test set. Additionally, since the multi-output LightGBM model includes the drug class variable as a predictive feature, the authors should justify this design choice and discuss whether it may introduce bias or trivialize treatment effect estimation.

Response 2:

We thank the reviewer for this important comment. Oversampling was performed only on the training dataset after the train–test split, and the test dataset remained completely untouched throughout the analysis. Therefore, no duplicated samples from the oversampling process were present in the test set, avoiding potential data leakage and ensuring unbiased evaluation. We have clarified this more explicitly in the manuscript (Section 2.4.1 paragraph 2).
Regarding the inclusion of the drug class variable, this design follows an S-learner style treatment effect estimation framework, where the treatment assignment variable is included as an input feature to enable estimation of potential outcomes under different treatments. In this setting, the treatment variable is essential for modeling counterfactual predictions rather than representing a conventional predictive covariate. We have further clarified this in the manuscript (Section 2.4.1 paragraph 5).

Comment 3:

The manuscript uses imputation models for missing outcome variables (HbA1c, HDL, BMI) trained on the training dataset, where “drug class” is included as a predictor (Table 1), and these imputed values are subsequently used as training labels for the main treatment selection model. This design raises a concern of circularity: because drug class is used both in imputing outcomes and as a predictor in the downstream model, the imputation step may inadvertently encode treatment effects into the labels, leading to biased estimates and inflated predictive performance. To mitigate this risk, the authors should re-train imputation models excluding the drug class variable or adopt multiple imputation approaches based only on pre-treatment covariates. Additionally, a sensitivity analysis comparing model performance with and without imputed outcomes would help quantify the extent of this potential bias.

Response 3: 

We thank the reviewer for this important observation. We agree that including the drug class variable in the imputation models could potentially introduce bias by encoding treatment-related information into the imputed outcome values.
To address this concern, we re-trained all outcome imputation models after excluding the drug class variable from the predictor set. All corresponding results, tables, and figures have been updated accordingly. In addition, LDL outcome values were not imputed in the final analysis due to the low performance of the LDL imputation model, and samples with missing LDL outcomes were removed from the training dataset.
We agree that model-based imputation may introduce uncertainty that is not fully propagated in the current framework. This limitation has been acknowledged in the manuscript (Section 4). Future work will explore multiple imputation to better account for uncertainty in missing outcome values.

Comment 4:

No comparison is provided to standard reference models such as predicting the cohort mean, using each patient’s baseline value, or applying a regularized linear regression. Without these null-model benchmarks, it is unclear whether the reported R² represents meaningful predictive improvement, as it may only marginally exceed a naive mean-based prediction given the inherent variability in EHR outcomes. To strengthen the evaluation, the model should be explicitly compared against at least a mean-prediction baseline and a regularized linear regression.

Response 4:

We thank the reviewer for this valuable suggestion. To better contextualize the predictive performance of the proposed framework, we added comparisons against a cohort mean prediction baseline and regularized linear regression models (Elastic Net and Ridge regression).

The mean baseline achieved an R2 of -0.007 and an RMSE of 6.59, while Elastic Net and Ridge regression achieved R2 scores of 0.195 and 0.464, respectively. These results demonstrate that the evaluated machine learning models substantially outperform naive mean-based prediction.

Although Ridge regression achieved slightly higher predictive performance metrics than the selected LightGBM model, it assigned all patients to the SGLT2-i treatment group and therefore failed to capture treatment heterogeneity. In contrast, the LightGBM model demonstrated higher treatment-selection-specific performance, particularly in discriminating between concordant and discordant subgroups, and was therefore selected as the final model. These additions and clarifications have now been added to the manuscript (Section 3 paragraph 1 and 2).

Comment on Tables and Figures - Figures and tables must be improved

Response - We thank the reviewer for this comment. The figures and tables were revised to improve their quality, readability, and overall visibility throughout the manuscript.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have addressed my concerns. I agree to accept the paper.

Reviewer 2 Report

Comments and Suggestions for Authors

Revision looks good to me.

Back to TopTop