Diabetes Prediction Using Feature Selection Algorithms and Boosting-Based Machine Learning Classifiers
Abstract
1. Introduction
- ▪ Developed a robust preprocessing pipeline, including mean imputation, outlier removal using the interquartile range (IQR) method, and oversampling, enhancing data quality and ensuring reliable model performance.
- ▪ Systematically evaluated five feature selection algorithms—Recursive Feature Elimination (RFE), Grey Wolf Optimizer (GWO), Particle Swarm Optimizer (PSO), Genetic Algorithm (GA), and Boruta—and validated feature importance with SHAP analysis, resulting in an interpretable and clinically meaningful feature set.
- ▪ Demonstrated that LightGBM combined with the features selected by the Boruta algorithm achieves superior predictive performance (accuracy: 85.16%, F1-score: 85.41%) while reducing feature dimensionality from 8 to 5, highlighting both efficiency and model robustness.
- ▪ Reduced model training time by nearly 55% through feature reduction, showing that the framework improves computational efficiency without compromising predictive accuracy.
- ▪ Reducing the number of features lowers computational cost and accelerates model training and inference, making the framework a cost-effective and practical tool for early diabetes detection and faster clinical decision-making.
- ▪ Validated the framework on an independent, recently released clinical dataset (DiaHealth), demonstrating robustness and generalizability. This validation shows its potential for reliable early diabetes prediction in novel clinical datasets with limited prior studies.
2. Related Works
3. Methodology
3.1. Dataset Description
3.1.1. Pima Indian Diabetes Dataset (PIDD)
3.1.2. DiaHealth: A Bangladeshi Dataset for Type 2 Diabetes Prediction
3.2. Data Preprocessing
- 1.
- The dataset is first sorted in ascending order for each numeric feature.
- 2.
- The first quartile (Q1), which is the median of the lower half of the data, and the third quartile (Q3), the median of the upper half, are calculated.
- 3.
- The IQR is computed as the difference between Q3 and Q1 (IQR = Q3 − Q1).
- 4.
- Outliers are defined as any data points lying below Q1 − (1.5 × IQR) or above Q3 + (1.5 × IQR)
- 5.
- These outliers are removed from the dataset to improve data quality.
3.3. Feature Selection
- ▪ Using SHAP analysis;
- ▪ Using Feature selection algorithms.
- SHAP feature selection
- Feature selection algorithms
- Choosing of feature selection algorithm and justification
- Recursive Feature Elimination (RFE)
- Genetic Algorithm (GA)
- Particle Swarm Optimization (PSO)
- Grey Wolf Optimizer (GWO)
- Boruta algorithm
- Cross-validation
3.4. Machine Learning Models for Classification
- Light Gradient Boosting Machine (LGBM) algorithm:
- = estimated variance gain over the subset A ∪ B;
- = {}, = {};
- = {}, = {};
- n = total number of training samples, = gradient of the loss;
- A = subset of samples with largest |gradients|, B = subset of randomly sampled remaining samples, a = top fraction of samples with largest gradients, b = fraction of remaining samples randomly sampled and is the coefficient.
- Extreme Gradient Boosting (XGBoost):
3.5. Performance Analysis
- True Positives (TP): Correctly predicted positive cases;
- True Negatives (TN): Correctly predicted negative cases;
- False Positives (FP): Incorrectly predicted positive cases;
- False Negatives (FN): Incorrectly predicted negative cases.
4. Results
4.1. Baseline Model Performance on Raw Dataset
4.2. Feature Selection Strategy
4.2.1. SHAP-Based Feature Importance Analysis
4.2.2. Algorithm-Based Feature Selection
4.2.3. Comparative Analysis of Feature Selection Approaches
- Feature 1 (Glucose)—selected by all algorithms;
- Feature 5 (BMI)—selected by four algorithms;
- Feature 6 (Diabetes Pedigree Function)—selected by four algorithms;
- Feature 0 (Pregnancies)—selected by three algorithms;
- Feature 2 (Blood Pressure)—selected by three algorithms;
- Feature 3 (Skin Thickness)—selected by three algorithms;
- Feature 7 (Age)—selected by three algorithms.
4.3. Final Model Performance
4.4. Validation on an Additional Dataset
- Numeric conversion with imputation for missing values;
- Outlier removal using the interquartile range (IQR) method;
- Class balancing via the Random Over-Sampling method;
- Feature selection using Boruta across 5-fold cross-validation;
- The final model performance was evaluated using LightGBM.
5. Discussion, Limitation, and Future Work
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| RFE | Recursive Feature Elimination | 
| GWO | Grey Wolf Optimizer | 
| PSO | Particle Swarm Optimizer | 
| GA | Genetic Algorithm | 
| CV | Cross-Validation | 
| LGBM | Light Gradient Boosting Machine Algorithm | 
| XGBoost | Extreme Gradient Boosting Algorithm | 
References
- Iftikhar, K.; Javaid, N.; Ahmed, I.; Alrajeh, N. A Novel Explainable Deep Learning Framework for Accurate Diabetes Mellitus Prediction. Appl. Sci. 2025, 15, 9162. [Google Scholar] [CrossRef]
- Ganie, S.M.; Pramanik, P.K.D.; Malik, M.B.; Mallik, S.; Qin, H. An ensemble learning approach for diabetes prediction using boosting techniques. Front. Genet. 2023, 14, 1252159. [Google Scholar] [CrossRef] [PubMed]
- Sai, M.J.; Chettri, P.; Panigrahi, R.; Garg, A.; Bhoi, A.K.; Barsocchi, P. An Ensemble of Light Gradient Boosting Machine and Adaptive Boosting for Prediction of Type-2 Diabetes. Int. J. Comput. Intell. Syst. 2023, 16, 14. [Google Scholar] [CrossRef]
- Kopitar, L.; Kocbek, P.; Cilar, L.; Sheikh, A.; Stiglic, G. Early detection of type 2 diabetes mellitus using machine learning-based prediction models. Sci. Rep. 2020, 10, 11981. [Google Scholar] [CrossRef] [PubMed]
- Fang, M.; Wang, D.; Coresh, J.; Selvin, E. Trends in Diabetes Treatment and Control in U.S. Adults, 1999–2018. N. Engl. J. Med. 2021, 384, 2219–2228. [Google Scholar] [CrossRef]
- Centers for Disease Control and Prevention (CDC). National Center for Chronic Disease Prevention and Health Promotion, Division of Population Health. BRFSS Prevalence & Trends Data. 2018. Available online: https://www.cdc.gov/brfss/brfssprevalence (accessed on 16 October 2025).
- World Health Organization (WHO). Diabetes. 2024. Available online: https://www.who.int/news-room/fact-sheets/detail/diabetes (accessed on 16 October 2025).
- International Diabetes Federation (IDF). Facts & Figures. 2024. Available online: https://idf.org/about-diabetes/diabetes-facts-figures/ (accessed on 16 October 2025).
- Chowdhury, M.M.; Ayon, R.S.; Hossain, M.S. Diabetes Diagnosis through Machine Learning: Investigating Algorithms and Data Augmentation for Class Imbalanced BRFSS Dataset. medRxiv 2023. [Google Scholar] [CrossRef]
- Zhang, Z. Comparison of Machine Learning Models for Predicting Type 2 Diabetes Risk Using the Pima Indians Diabetes Dataset. J. Innov. Med. Res. 2025, 4, 65–71. [Google Scholar] [CrossRef]
- Salih, M.S. Diabetic Prediction based on Machine Learning Using PIMA Indian Dataset. Commun. Appl. Nonlinear Anal. 2024, 31, 138–156. [Google Scholar] [CrossRef]
- Shuvo, M.H.; Ahmed, N.; Islam, H.; Alaboud, K.; Cheng, J.; Mosa, A.S.M.; Islam, S.K. Machine Learning Embedded Smartphone Application for Early-Stage Diabetes Risk Assessment. In Proceedings of the 2022 IEEE International Symposium on Medical Measurements and Applications (MeMeA), Messina, Italy, 22–24 June 2022; pp. 1–6. [Google Scholar] [CrossRef]
- Ganie, S.M.; Malik, M.B. Comparative analysis of various supervised machine learning algorithms for the early prediction of type-II diabetes mellitus. Int. J. Med Eng. Inform. 2022, 14, 473. [Google Scholar] [CrossRef]
- Chang, V.; Bailey, J.; Xu, Q.A.; Sun, Z. Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms. Neural Comput. Appl. 2022, 35, 16157–16173. [Google Scholar] [CrossRef]
- Chaki, J.; Ganesh, S.T.; Cidham, S.; Theertan, S.A. Machine learning and artificial intelligence based Diabetes Mellitus detection and self-management: A systematic review. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 3204–3225. [Google Scholar] [CrossRef]
- Perveen, S.; Shahbaz, M.; Keshavjee, K.; Guergachi, A. Metabolic Syndrome and Development of Diabetes Mellitus: Predictive Modeling Based on Machine Learning Techniques. IEEE Access 2018, 7, 1365–1375. [Google Scholar] [CrossRef]
- Sisodia, D.; Sisodia, D.S. Prediction of Diabetes using Classification Algorithms. Procedia Comput. Sci. 2018, 132, 1578–1585. [Google Scholar] [CrossRef]
- Benbelkacem, S.; Atmani, B. Random Forests for Diabetes Diagnosis. In Proceedings of the 2019 International Conference on Computer and Information Sciences (ICCIS), Sakaka, Saudi Arabia, 3–4 April 2019; pp. 1–4. [Google Scholar] [CrossRef]
- Sneha, N.; Gangil, T. Analysis of diabetes mellitus for early prediction using optimal features selection. J. Big Data 2019, 6, 13. [Google Scholar] [CrossRef]
- Bania, R.K.; Halder, A. R-HEFS: Rough set based heterogeneous ensemble feature selection method for medical data classification. Artif. Intell. Med. 2021, 114, 102049. [Google Scholar] [CrossRef]
- Li, M.; Fu, X.; Li, D. Diabetes Prediction Based on XGBoost Algorithm. IOP Conf. Ser. Mater. Sci. Eng. 2020, 768, 072093. [Google Scholar] [CrossRef]
- Roshi, S.; Sharma, S.K.; Gupta, M. Role of K-nearest neighbour in detection of Diabetes Mellitus. Turk. J. Comput. Math. Educ. 2021, 12, 373–376. [Google Scholar]
- Ismail, L.; Materwala, H.; Tayefi, M.; Ngo, P.; Karduck, A.P. Type 2 Diabetes with Artificial Intelligence Machine Learning: Methods and Evaluation. Arch. Comput. Methods Eng. 2022, 29, 313–333. [Google Scholar] [CrossRef]
- Tasin, I.; Nabil, T.U.; Islam, S.; Khan, R. Diabetes prediction using machine learning and explainable AI techniques. Health Technol. Lett. 2023, 10, 1–10. [Google Scholar] [CrossRef] [PubMed]
- Iparraguirre-Villanueva, O.; Espinola-Linares, K.; Castañeda, R.O.F.; Cabanillas-Carbonell, M. Application of Machine Learning Models for Early Detection and Accurate Classification of Type 2 Diabetes. Diagnostics 2023, 13, 2383. [Google Scholar] [CrossRef] [PubMed]
- Ahmed, A.; Khan, J.; Arsalan, M.; Ahmed, K.; Shahat, A.A.; Alhalmi, A.; Naaz, S. Machine Learning Algorithm-Based Prediction of Diabetes Among Female Population Using PIMA Dataset. Healthcare 2024, 13, 37. [Google Scholar] [CrossRef]
- Mansouri, S.; Boulares, S.; Chabchoub, S. Machine Learning for Early Diabetes Detection and Diagnosis. J. Wirel. Mob. Networks, Ubiquitous Comput. Dependable Appl. 2024, 15, 216–230. [Google Scholar] [CrossRef]
- Rezki, M.K.; Mazdadi, M.I.; Indriani, F.; Muliadi, M.; Saragih, T.H.; Athavale, V.A. Application Of SMOTE To Address Class Imbalance In Diabetes Disease Classification Utilizing C5.0, Random Forest, And SVM. J. Electron. Électroméd. Eng. Med. Inform. 2024, 6, 343–354. [Google Scholar] [CrossRef]
- Joseph, L.P.; Joseph, E.A.; Prasad, R. Diabetes Datasets. Mendeley Data V1. 2022. Available online: https://doi.org/10.17632/7zcc8v6hvp.1 (accessed on 16 October 2025).
- Scott, D.W. Histogram. WIREs Comput. Stat. 2010, 2, 44–48. [Google Scholar] [CrossRef]
- Asuero, A.G.; Sayago, A.; González, A.G. The Correlation Coefficient: An Overview. Crit. Rev. Anal. Chem. 2006, 36, 41–59. [Google Scholar] [CrossRef]
- Prama, T.T.; Zaman, M.; Sarker, F.; Mamun, K.A. DiaHealth: A Bangladeshi Dataset for Type 2 Diabetes Prediction. Mendeley Data V1. 2024. Available online: https://doi.org/10.17632/7m7555vgrn.1 (accessed on 16 October 2025).
- García, S.; Luengo, J.; Herrera, F. Data Preprocessing in Data Mining; Springer International Publishing: Cham, Switzerland, 2015; Volume 72. [Google Scholar] [CrossRef]
- Alabrah, A. An Improved CCF Detector to Handle the Problem of Class Imbalance with Outlier Normalization Using IQR Method. Sensors 2023, 23, 4406. [Google Scholar] [CrossRef]
- Joseph, V.R. Optimal ratio for data splitting. Stat. Anal. Data Min. ASA Data Sci. J. 2022, 15, 531–538. [Google Scholar] [CrossRef]
- Marcilio, W.E.; Eler, D.M. From explanations to feature selection: Assessing SHAP values as feature selection mechanism. In Proceedings of the 2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Virtual, 7–10 November 2020; pp. 340–347. [Google Scholar] [CrossRef]
- Agrawal, P.; Abutarboush, H.F.; Ganesh, T.; Mohamed, A.W. Metaheuristic Algorithms on Feature Selection: A Survey of One Decade of Research (2009-2019). IEEE Access 2021, 9, 26766–26791. [Google Scholar] [CrossRef]
- Tan, Y.; Chen, H.; Zhang, J.; Tang, R.; Liu, P. Early Risk Prediction of Diabetes Based on GA-Stacking. Appl. Sci. 2022, 12, 632. [Google Scholar] [CrossRef]
- Azad, C.; Bhushan, B.; Sharma, R.; Shankar, A.; Singh, K.K.; Khamparia, A. Prediction model using SMOTE, genetic algorithm and decision tree (PMSGD) for classification of diabetes mellitus. Multimedia Syst. 2022, 28, 1289–1307. [Google Scholar] [CrossRef]
- Choubey, D.K.; Kumar, P.; Tripathi, S.; Kumar, S. Performance evaluation of classification methods with PCA and PSO for diabetes. Netw. Model. Anal. Health Inform. Bioinform. 2020, 9, 5. [Google Scholar] [CrossRef]
- Le, T.M.; Vo, T.M.; Pham, T.N.; Dao, S.V.T. A Novel Wrapper–Based Feature Selection for Early Diabetes Prediction Enhanced With a Metaheuristic. IEEE Access 2021, 9, 7869–7884. [Google Scholar] [CrossRef]
- Sabitha, E.; Durgadevi, M. Improving the Diabetes Diagnosis Prediction Rate Using Data Preprocessing, Data Augmentation and Recursive Feature Elimination Method. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 529–536. [Google Scholar] [CrossRef]
- Zhou, H.; Xin, Y.; Li, S. A diabetes prediction model based on Boruta feature selection and ensemble learning. BMC Bioinform. 2023, 24, 1–34. [Google Scholar] [CrossRef]
- Brownlee, J. Recursive Feature Elimination (RFE) for Feature Selection in Python. Mach. Learn. Mastery 2020. Available online: https://machinelearningmastery.com/rfe-feature-selection-in-python/ (accessed on 16 October 2025).
- GeeksforGeeks. Genetic Algorithms. 2024. Available online: https://www.geeksforgeeks.org/genetic-algorithms/ (accessed on 16 October 2025).
- GeeksforGeeks. Particle Swarm Optimization (PSO)—An Overview. 2025. Available online: https://www.geeksforgeeks.org/particle-swarm-optimization-pso-an-overview/ (accessed on 16 October 2025).
- GeeksforGeeks. Grey Wolf Optimization—Introduction. 2025. Available online: https://www.geeksforgeeks.org/machine-learning/grey-wolf-optimization-introduction/ (accessed on 16 October 2025).
- Shivam. Boruta Feature Selection in R. GeeksforGeeks. 2021. Available online: https://www.geeksforgeeks.org/boruta-feature-selection-in-r/ (accessed on 16 October 2025).
- Berrar, D. Cross-Validation. In Encyclopedia of Bioinformatics and Computational Biology; Elsevier: Amsterdam, The Netherlands, 2019; pp. 542–545. [Google Scholar] [CrossRef]
- Rufo, D.D.; Debelee, T.G.; Ibenthal, A.; Negera, W.G. Diagnosis of Diabetes Mellitus Using Gradient Boosting Machine (LightGBM). Diagnostics 2021, 11, 1714. [Google Scholar] [CrossRef]
- Nagassou, M.; Mwangi, R.W.; Nyarige, E. A Hybrid Ensemble Learning Approach Utilizing Light Gradient Boosting Machine and Category Boosting Model for Lifestyle-Based Prediction of Type-II Diabetes Mellitus. J. Data Anal. Inf. Process. 2023, 11, 480–511. [Google Scholar] [CrossRef]
- Noviandy, T.R.; Nainggolan, S.I.; Raihan, R.; Firmansyah, I.; Idroes, R. Maternal Health Risk Detection Using Light Gradient Boosting Machine Approach. Infolitika J. Data Sci. 2023, 1, 48–55. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
- Zafar, M.M.; Khan, Z.A.; Javaid, N.; Aslam, M.; Alrajeh, N. From Data to Diagnosis: A Novel Deep Learning Model for Early and Accurate Diabetes Prediction. Healthcare 2025, 13, 2138. [Google Scholar] [CrossRef]










| Ref. No | Dataset | Method | Key Findings | N.F.U | Year | 
|---|---|---|---|---|---|
| [16] | Canadian Primary Care Sentinel Surveillance Network (CPCSSN) | DT, NB | Showed the supremacy of NB with the K-medoids undersampling technique as compared to random undersampling, oversampling, and no sampling, with an achievement of 79% receiver operating characteristic performance. | -- | 2018 | 
| [17] | PIDD | DT, SVM, and NB | Naïve Bayes outperforms with the highest accuracy of 76.30% compared to other algorithms. | 8/8 | 2018 | 
| [18] | PIDD | RF | The random forest gave better performance with an error rate of 0.21 at 40 trees than any other method. | 8/8 | 2019 | 
| [19] | UCI Diabetes Dataset | SVM, RF, NB, DT, and KNN | Naïve Bayesian outcome states the best accuracy of 82.30%. | 11/15 | 2019 | 
| [20] | PIDD | Parameter-free greedy ensemble approach and RF | Uses Tenfold cross-validation and an accuracy of 73.04%. | 3/8 | 2021 | 
| [21] | PIDD | Feature extraction with XGBoost | The improved XGBoost algorithm with feature combination is 80.2% | F. E. | 2020 | 
| [22] | PIDD | Ensemble of multiple classifiers, including DT, NB, KNN, LR, etc. | With the proposed algorithm, accuracy has risen from 70.1% to 78.58%, which is an increase of 8.48% | 8/8 | 2021 | 
| [23] | PIDD | K-means feature selection and Bagging (LR) | Tenfold cross-validation and accuracy of 82.00%. | 8/8 | 2022 | 
| [24] | PIDD and RTML | XGBoost, RF, SVM, KNN, LR, DT, AdaBoost, Bagging, Voting | XGBoost with ADASYN achieved the highest accuracy of 81% (AUC 0.84) on the merged dataset; 96% accuracy on the private RTML dataset with domain adaptation. Mutual information identified Glucose, BMI, Age, and Insulin as key features. | 8/8 | 2022 | 
| [25] | PIDD | KNN, BNB | The KNN model achieved the highest accuracy in detecting diabetes, with 79.6%, surpassing the BNB model’s 77.2%. | 8/8 | 2023 | 
| [26] | PIDD | RF, DT, NB, and LR. | Results of the study showed that RF performs better with an accuracy of 80%, precision of 82%, error rate of 20%, and sensitivity of 88% in comparison to other developed models, DT, NB, and LR | 8/8 | 2024 | 
| [27] | PIDD | KNN | The KNN model achieved an accuracy of 76%, with a precision of 0.80, a recall of 0.85, an F1-score of 0.83, and support of 167 for the test set. | 8/8 | 2024 | 
| [28] | PIDD | SMOTE with C5.0, RF, SVM | It can be inferred that there is minimal impact post-SMOTE across the three classification models due to potential overfitting on the dataset. | 8/8 | 2024 | 
| Feature No. | Feature Name | Count | Min | Max | Mean | Std Dev | 
|---|---|---|---|---|---|---|
| 0 | Pregnancies | 768 | 0.0 | 17.0 | 3.84 | 3.36 | 
| 1 | Glucose | 0.0 | 199.0 | 120.89 | 31.97 | |
| 2 | BloodPressure | 0.0 | 122.0 | 69.10 | 19.35 | |
| 3 | SkinThickness | 0.0 | 99.0 | 20.53 | 15.95 | |
| 4 | Insulin | 0.0 | 846.0 | 79.79 | 115.24 | |
| 5 | BMI | 0.078 | 67.1 | 31.99 | 7.88 | |
| 6 | DiabetesPedigreeFunction | 21.0 | 2.42 | 0.47 | 0.33 | |
| 7 | Age | 0.0 | 81.0 | 33.24 | 11.76 | 
| Feature No. | Feature Name | Description | Data Type | 
|---|---|---|---|
| 0 | age | Patient’s age in years | int64 | 
| 1 | gender | Patient’s biological indicator | object | 
| 2 | pulse_rate | Heart rate | int64 | 
| 3 | systolic_bp | Systolic blood pressure | int64 | 
| 4 | diastolic_bp | Diastolic blood pressure | int64 | 
| 5 | glucose | Glucose concentratyion of blood | float64 | 
| 6 | height | Patient’s height | float64 | 
| 7 | weight | Patient’s weight | float64 | 
| 8 | bmi | Patient’s body mass index | float64 | 
| 9 | family_diabetes | Family history about diabetes | int64 | 
| 10 | hypertensive | Presence of hypertension | int64 | 
| 11 | family_hypertension | Family history of hypertension | int64 | 
| 12 | cardiovascular_disease | Presence of cardiovascular diseases | int64 | 
| 13 | stroke | History of stroke | int64 | 
| 14 | diabetic | Target variable | object | 
| Feature Name | Missing Values per Column | 
|---|---|
| Pregnancies | 0 | 
| Glucose | 5 | 
| BloodPressure | 35 | 
| SkinThickness | 227 | 
| Insulin | 374 | 
| BMI | 11 | 
| DiabetesPedigreeFunction | 0 | 
| Age | 0 | 
| Model | Accuracy (Baseline) | Accuracy (After Preprocessing) | Recall (Baseline) | Recall (After Preprocessing) | F1-Score (Baseline) | F1-Score (After Preprocessing) | 
|---|---|---|---|---|---|---|
| XGBoost | 0.7532 | 0.8242 | 0.5679 | 0.8242 | 0.6174 | 0.8242 | 
| LightGBM | 0.7359 | 0.8242 | 0.5556 | 0.8352 | 0.5960 | 0.8253 | 
| Algorithm | Hyperparameters | 
|---|---|
| Boruta | n_estimators = auto, random_state = 42, cross-validation = 5-fold StratifiedKFold | 
| RFE | step = 1, cv = 5-fold, min_features_to_select = 1 | 
| GA | population_size = 20, generations = 15, mutation_rate = 0.01, selection = tournament (size = 3), crossover = single-point, cv = 5-fold | 
| PSO | population = 10, max_iter = 15, w = 0.72, c1 = 1.5, c2 = 1.5, encoding = binary (0/1), cv = 5-fold | 
| GWO | population = 10, max_iter = 15, selection threshold > 0.5, fitness = accuracy, cv = 5-fold | 
| Algorithm | Avg Accuracy | Avg F1-Score | Avg Recall | Number of Features (Most Frequent) | Most Frequent Feature Indices | 
|---|---|---|---|---|---|
| BORUTA | 0.7964 | 0.8058 | 0.8478 | 5 | [1, 3, 5, 6, 7] | 
| RFE | 0.7169 | 0.7079 | 0.6952 | 5 | [0, 1, 2, 5, 6] | 
| GA | 0.8179 | 0.8235 | 0.8546 | 7 | [0, 1, 2, 3, 4, 6, 7] | 
| PSO | 0.7981 | 0.8037 | 0.8313 | 6 | [0, 1, 2, 3, 5, 7] | 
| GWO | 0.7948 | 0.8032 | 0.8445 | 6 | [1, 2, 3, 5, 6, 7] | 
| Feature Selection Method | No. of Features | Model | Accuracy (%) | Precision (Macro Avg) | Recall (Macro Avg) | F1-Score (Macro Avg) | 
|---|---|---|---|---|---|---|
| Boruta | 5 | XGBoost | 84.07 | 0.8370 | 0.8462 | 0.8415 | 
| LightGBM | 85.16 | 0.8404 | 0.8681 | 0.8541 | ||
| RFECV | 5 | XGBoost | 81.32 | 0.8138 | 0.8132 | 0.8131 | 
| LightGBM | 82.42 | 0.8244 | 0.8242 | 0.8242 | ||
| GA | 7 | XGBoost | 80.77 | 0.8111 | 0.8022 | 0.8066 | 
| LightGBM | 82.97 | 0.8191 | 0.8462 | 0.8324 | ||
| PSO | 6 | XGBoost | 76.26 | 0.7400 | 0.7100 | 0.7200 | 
| LightGBM | 77.70 | 0.7600 | 0.7300 | 0.7400 | ||
| GWO | 6 | XGBoost | 76.26 | 0.7400 | 0.7100 | 0.7200 | 
| LightGBM | 77.70 | 0.7600 | 0.7300 | 0.7400 | 
| Model | Features Used | Preprocessing | Accuracy (%) | F1-Score | Training Time (s) | 
|---|---|---|---|---|---|
| XGBoost | 8 (all) | NO | 75.32 | 0.6174 | 0.2246 | 
| LightGBM | 8 (all) | NO | 73.59 | 0.5960 | 0.0564 | 
| XGBoost | 5 (Boruta) | YES | 84.07 | 0.8415 | 0.0406 | 
| LightGBM | 5 (Boruta) | YES | 85.16 | 0.8541 | 0.0254 | 
| Model | Used Framework | U.F.N. | Accuracy | Recall | F1-Score | 
|---|---|---|---|---|---|
| [54] | TIPNet Deep Model | 14 | 87% | 88% | 86% | 
| Baseline | LGBM classifier on raw (no preprocessing) dataset | 14 | 94.49% | 29.13% | 40.00% | 
| Final | UsedLGBM classifier on Boruta-selected features from preprocessed dataset | 10 | 99.39% | 100.00% | 99.39% | 
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Rahman, F.; Hossain, S.; Tiang, J.-J.; Nahid, A.-A. Diabetes Prediction Using Feature Selection Algorithms and Boosting-Based Machine Learning Classifiers. Diagnostics 2025, 15, 2622. https://doi.org/10.3390/diagnostics15202622
Rahman F, Hossain S, Tiang J-J, Nahid A-A. Diabetes Prediction Using Feature Selection Algorithms and Boosting-Based Machine Learning Classifiers. Diagnostics. 2025; 15(20):2622. https://doi.org/10.3390/diagnostics15202622
Chicago/Turabian StyleRahman, Fatima, Sheyum Hossain, Jun-Jiat Tiang, and Abdullah-Al Nahid. 2025. "Diabetes Prediction Using Feature Selection Algorithms and Boosting-Based Machine Learning Classifiers" Diagnostics 15, no. 20: 2622. https://doi.org/10.3390/diagnostics15202622
APA StyleRahman, F., Hossain, S., Tiang, J.-J., & Nahid, A.-A. (2025). Diabetes Prediction Using Feature Selection Algorithms and Boosting-Based Machine Learning Classifiers. Diagnostics, 15(20), 2622. https://doi.org/10.3390/diagnostics15202622
 
        



 
       