You are currently viewing a new version of our website. To view the old version click .
Future Internet
  • This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
  • Article
  • Open Access

10 November 2025

Machine Learning Pipeline for Early Diabetes Detection: A Comparative Study with Explainable AI

,
,
,
,
and
1
Department of Management, Banking and Commodity Sciences, Sapienza University, 00161 Rome, Italy
2
Mathematics, Physics and Applications to Engineering Department, Università Degli Studi Della Campania “Luigi Vanvitelli”, Viale Lincoln n°5, 81100 Caserta, Italy
3
Unidata S.p.A., Viale A. G. Eiffel, 00148 Roma, Italy
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue The Future Internet of Medical Things, 3rd Edition

Abstract

The use of Artificial Intelligence (AI) in healthcare has significantly advanced early disease detection, enabling timely diagnosis and improved patient outcomes. This work proposes an end-to-end machine learning (ML) model for predicting diabetes based on data quality by following key steps, including advanced preprocessing by KNN imputation, intelligent feature selection, class imbalance with a hybrid approach of SMOTEENN, and multi-model classification. We rigorously compared nine ML classifiers, namely ensemble approaches (Random Forest, CatBoost, XGBoost), Support Vector Machines (SVM), and Logistic Regression (LR) for the prediction of diabetes disease. We evaluated performance on specificity, accuracy, recall, precision, and F1-score to assess generalizability and robustness. We employed SHapley Additive exPlanations (SHAP) for explainability, ranking, and identifying the most influential clinical risk factors. SHAP analysis identified glucose levels as the dominant predictor, followed by BMI and age, providing clinically interpretable risk factors that align with established medical knowledge. Results indicate that ensemble models have the highest performance among the others, and CatBoost performed the best, which achieved an ROC-AUC of 0.972, an accuracy of 0.968, and an F1-score of 0.971. The model was successfully validated on two larger datasets (CDC BRFSS and a 130-hospital dataset), confirming its generalizability. This data-driven design provides a reproducible platform for applying useful and interpretable ML models in clinical practice as a primary application for future Internet-of-Things-based smart healthcare systems.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.