- Article
Towards Robust Risk-Based Screening of Early-Stage Diabetes: Machine Learning Models with Union Features Selection and External Validation
- Pasa Sukson,
- Watcharaporn Cholamjiak and
- Nontawat Eiamniran
- + 1 author
Background/Objectives: Early-stage diabetes often presents with subtle symptoms, making timely screening challenging. This study aimed to develop an interpretable and robust machine learning framework for early-stage diabetes risk prediction using integrated statistical and machine learning–based feature selection, and to evaluate its generalizability using real-world hospital data. Methods: A Union Feature Selection approach was constructed by combining logistic regression significance testing with ReliefF and MRMR feature importance scores. Five machine learning models—Decision Tree, Naïve Bayes, SVM, KNN, and Neural Network—were trained on the UCI Early Stage Diabetes dataset (N = 520) under multiple feature-selection scenarios. External validation was performed using retrospective hospital records from the University of Phayao (N = 60). Model performance was assessed using accuracy, precision, recall, and F1-score. Results: The union feature-selection approach identified four core predictors—polyuria, polydipsia, gender, and irritability—with additional secondary features providing only marginal improvements. Among the evaluated models, Naïve Bayes demonstrated the most stable external performance, achieving 85% test accuracy, balanced precision, recall, and F1-score, along with a moderate AUC of 0.838, indicating reliable discriminative ability in real-world hospital data. In contrast, SVM, KNN, and Neural Network models, despite exhibiting very high internal validation performance (>96%) under optimally selected ML features, showed marked performance decline during external validation, highlighting their sensitivity to distributional shifts between public and clinical datasets. Conclusions: The combined statistical–ML feature selection method improved interpretability and stability in early-stage diabetes prediction. Naïve Bayes demonstrated the strongest generalizability and is well suited for real-world screening applications. The findings support the use of integrated feature selection to develop efficient and clinically relevant risk assessment tools.
26 December 2025




