Machine Learning Pipeline for Early Diabetes Detection: A Comparative Study with Explainable AI

Yas Barzegar; Atrin Barzegar; Francesco Bellini; Fabrizio D’Ascenzo; Irina Gorelova; Patrizio Pisani

doi:10.3390/fi17110513

,

and

¹

Department of Management, Banking and Commodity Sciences, Sapienza University, 00161 Rome, Italy

²

Mathematics, Physics and Applications to Engineering Department, Università Degli Studi Della Campania “Luigi Vanvitelli”, Viale Lincoln n°5, 81100 Caserta, Italy

³

Unidata S.p.A., Viale A. G. Eiffel, 00148 Roma, Italy

^*

Author to whom correspondence should be addressed.

Future Internet2025, 17(11), 513;https://doi.org/10.3390/fi17110513

This article belongs to the Special Issue The Future Internet of Medical Things, 3rd Edition

Version Notes

Order Reprints

Abstract

The use of Artificial Intelligence (AI) in healthcare has significantly advanced early disease detection, enabling timely diagnosis and improved patient outcomes. This work proposes an end-to-end machine learning (ML) model for predicting diabetes based on data quality by following key steps, including advanced preprocessing by KNN imputation, intelligent feature selection, class imbalance with a hybrid approach of SMOTEENN, and multi-model classification. We rigorously compared nine ML classifiers, namely ensemble approaches (Random Forest, CatBoost, XGBoost), Support Vector Machines (SVM), and Logistic Regression (LR) for the prediction of diabetes disease. We evaluated performance on specificity, accuracy, recall, precision, and F1-score to assess generalizability and robustness. We employed SHapley Additive exPlanations (SHAP) for explainability, ranking, and identifying the most influential clinical risk factors. SHAP analysis identified glucose levels as the dominant predictor, followed by BMI and age, providing clinically interpretable risk factors that align with established medical knowledge. Results indicate that ensemble models have the highest performance among the others, and CatBoost performed the best, which achieved an ROC-AUC of 0.972, an accuracy of 0.968, and an F1-score of 0.971. The model was successfully validated on two larger datasets (CDC BRFSS and a 130-hospital dataset), confirming its generalizability. This data-driven design provides a reproducible platform for applying useful and interpretable ML models in clinical practice as a primary application for future Internet-of-Things-based smart healthcare systems.

Keywords:

AI; Smart Healthcare; ML; Diagnosis; Hybrid Resampling; Interpretability; Feature Selection; Future Internet

Machine Learning Pipeline for Early Diabetes Detection: A Comparative Study with Explainable AI

Abstract

Article Metrics

Citations

Article Access Statistics