Next Article in Journal
Reduction in Severe, Chronic Mid-Back Pain Following Correction of Sagittal Thoracic Spinal Alignment Using Chiropractic BioPhysics® Spinal Rehabilitation Program Following Prior Failed Treatment: A Case Report with 9-Month Follow-Up
Previous Article in Journal
Mental Health Outcomes and Occupational Stress Among Malaysian Frontline Workers During the COVID-19 Pandemic
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Interpretable Machine Learning Framework for Diabetes Prediction: Integrating SMOTE Balancing with SHAP Explainability for Clinical Decision Support

by
Pathamakorn Netayawijit
1,
Wirapong Chansanam
2 and
Kanda Sorn-In
3,*
1
Department of Information Systems, Faculty of Business Administration and Information Technology, Rajamangala University of Technology Isan, Khon Kaen Campus, Khon Kaen 40000, Thailand
2
Department of Information Science, Faculty of Humanities and Social Sciences, Khon Kaen University, Khon Kaen 40002, Thailand
3
Department of Technology and Engineering, Faculty of Interdisciplinary Studies, Khon Kaen University, Nong Khai Campus, Nong Khai 43000, Thailand
*
Author to whom correspondence should be addressed.
Healthcare 2025, 13(20), 2588; https://doi.org/10.3390/healthcare13202588
Submission received: 3 September 2025 / Revised: 10 October 2025 / Accepted: 12 October 2025 / Published: 14 October 2025

Abstract

Background: Class imbalance and limited interpretability remain major barriers to the clinical adoption of machine learning in diabetes prediction. These challenges often result in poor sensitivity to high-risk cases and reduced trust in AI-based decision support. This study addresses these limitations by integrating SMOTE-based resampling with SHAP-driven explainability, aiming to enhance both predictive performance and clinical transparency for real-world deployment. Objective: To develop and validate an interpretable machine learning framework that addresses class imbalance through advanced resampling techniques while providing clinically meaningful explanations for enhanced decision support. This study serves as a methodologically rigorous proof-of-concept, prioritizing analytical integrity over scale. While based on a computationally feasible subset of 1500 records, future work will extend to the full 100,000-patient dataset to evaluate scalability and external validity. We used the publicly available, de-identified Diabetes Prediction Dataset hosted on Kaggle, which is synthetic/derivative and not a clinically curated cohort. Accordingly, this study is framed as a methodological proof-of-concept rather than a clinically generalizable evaluation. Methods: We implemented a robust seven-stage pipeline integrating the Synthetic Minority Oversampling Technique (SMOTE) with SHapley Additive exPlanations (SHAP) to enhance model interpretability and address class imbalance. Five machine learning algorithms—Random Forest, Gradient Boosting, Support Vector Machine (SVM), Logistic Regression, and XGBoost—were comparatively evaluated on a stratified random sample of 1500 patient records drawn from the publicly available Diabetes Prediction Dataset (n = 100,000) hosted on Kaggle. To ensure methodological rigor and prevent data leakage, all preprocessing steps—including SMOTE application—were performed within the training folds of a 5-fold stratified cross-validation framework, preserving the original class distribution in each fold. Model performance was assessed using accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, specificity, F1-score, and precision. Statistical significance was determined using McNemar’s test, with p-values adjusted via the Bonferroni correction to control for multiple comparisons. Results: The Random Forest-SMOTE model achieved superior performance with 96.91% accuracy (95% CI: 95.4–98.2%), AUC of 0.998, sensitivity of 99.5%, and specificity of 97.3%, significantly outperforming recent benchmarks (p < 0.001). SHAP analysis identified glucose (SHAP value: 2.34) and BMI (SHAP value: 1.87) as primary predictors, demonstrating strong clinical concordance. Feature interaction analysis revealed synergistic effects between glucose and BMI, providing actionable insights for personalized intervention strategies. Conclusions: Despite promising results, further validation of the proposed framework is required prior to any clinical deployment. At this stage, the study should be regarded as a methodological proof-of-concept rather than a clinically generalizable evaluation. Our framework successfully bridges algorithmic performance and clinical applicability. It achieved high cross-validated performance on a publicly available Kaggle dataset, with Random Forest reaching 96.9% accuracy and 0.998 AUC. These results are dataset-specific and should not be interpreted as clinical performance. External, prospective validation in real-world cohorts is required prior to any consideration of clinical deployment, particularly for personalized risk assessment in healthcare systems.
Keywords: diabetes prediction; machine learning; SMOTE; SHAP; explainable AI; Random Forest; clinical decision support; healthcare AI; model interpretability diabetes prediction; machine learning; SMOTE; SHAP; explainable AI; Random Forest; clinical decision support; healthcare AI; model interpretability

Share and Cite

MDPI and ACS Style

Netayawijit, P.; Chansanam, W.; Sorn-In, K. Interpretable Machine Learning Framework for Diabetes Prediction: Integrating SMOTE Balancing with SHAP Explainability for Clinical Decision Support. Healthcare 2025, 13, 2588. https://doi.org/10.3390/healthcare13202588

AMA Style

Netayawijit P, Chansanam W, Sorn-In K. Interpretable Machine Learning Framework for Diabetes Prediction: Integrating SMOTE Balancing with SHAP Explainability for Clinical Decision Support. Healthcare. 2025; 13(20):2588. https://doi.org/10.3390/healthcare13202588

Chicago/Turabian Style

Netayawijit, Pathamakorn, Wirapong Chansanam, and Kanda Sorn-In. 2025. "Interpretable Machine Learning Framework for Diabetes Prediction: Integrating SMOTE Balancing with SHAP Explainability for Clinical Decision Support" Healthcare 13, no. 20: 2588. https://doi.org/10.3390/healthcare13202588

APA Style

Netayawijit, P., Chansanam, W., & Sorn-In, K. (2025). Interpretable Machine Learning Framework for Diabetes Prediction: Integrating SMOTE Balancing with SHAP Explainability for Clinical Decision Support. Healthcare, 13(20), 2588. https://doi.org/10.3390/healthcare13202588

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop