A Comparative Analysis of Machine Learning Models for the Detection of Undiagnosed Diabetes Patients
Abstract
:1. Introduction
2. Methods
2.1. Data Source
2.2. End Points
2.3. Variables and Selection
2.4. Model Development
2.4.1. Random Forest Model
2.4.2. AdaBoost
2.4.3. RUSBoost
2.4.4. LogitBoost
2.4.5. Neural Network
2.5. Model Assessment
3. Results
4. Discussion
4.1. Comparison to Other Related Work
4.2. Strengths and Limitations
4.3. Future Directions
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
References
- Shah, A.; Afzal, M. Prevalence of diabetes and hypertension and association with various risk factors among different Muslim populations of Manipur, India. J. Diabetes Metab. Disord. 2013, 12, 52. [Google Scholar] [CrossRef] [PubMed]
- Noble, D.; Mathur, R.; Dent, T.; Meads, C.; Greenhalgh, T. Risk models and scores for type 2 diabetes: Systematic review. BMJ 2011, 343, 1243. [Google Scholar] [CrossRef] [PubMed]
- Mendola, N.D.; Chen, T.-C.; Gu, Q.; Eberhardt, M.S.; Saydah, S. Prevalence of Total, Diagnosed, and Undiagnosed Diabetes Among Adults: United States, 2013–2016. Key findings Data from the National Health and Nutrition Examination Survey (NHANES). NCHS Data Brief 2013, 319, 1–8. [Google Scholar]
- Gillies, C.L.; Lambert, P.C.; Abrams, K.R.; Sutton, A.J.; Cooper, N.J.; Hsu, R.T.; Davies, M.J.; Khunti, K. Different strategies for screening and prevention of type 2 diabetes in adults: Cost effectiveness analysis. BMJ 2008, 336, 1180–1184. [Google Scholar] [CrossRef]
- Simmons, R.K.; Echouffo-Tcheugui, J.B.; Griffin, S.J. Screening for type 2 diabetes: An update of the evidence. Diabetes Obes Metab. 2010, 12, 838–844. [Google Scholar] [CrossRef]
- Lee, Y.H.; Bang, H.; Kim, H.C.; Kim, H.M.; Park, S.W.; Kim, D.J. A simple screening score for diabetes for the Korean population: Development, validation, and comparison with other scores. Diabetes Care 2012, 35, 1723–1730. [Google Scholar] [CrossRef] [PubMed]
- Liu, M.; Pan, C.; Jin, M. A Chinese diabetes risk score for screening of undiagnosed diabetes and abnormal glucose tolerance. Diabetes Technol. Ther. 2011, 13, 501–507. [Google Scholar] [CrossRef]
- Collins, G.S.; Mallett, S.; Omar, O.; Yu, L.-M. Developing risk prediction models for type 2 diabetes: A systematic review of methodology and reporting. BMC Med. 2011, 9, 103. [Google Scholar] [CrossRef]
- Firdous, S.; Wagai, G.; Sharma, K. A survey on diabetes risk prediction using machine learning approaches. J. Fam. Med. Prim. Care 2022, 11, 6929. [Google Scholar]
- Sun, G.W.; Shook, T.L.; Kay, G.L. Inappropriate use of bivariable analysis to screen risk factors for use in multivariable analysis. J. Clin. Epidemiol. 1996, 49, 907–916. [Google Scholar] [CrossRef]
- Royston, P.; Altman, D.G.; Sauerbrei, W. Dichotomizing continuous predictors in multiple regression: A bad idea. Stat. Med. 2006, 25, 127–141. [Google Scholar] [CrossRef]
- Maniruzzaman, M.; Kumar, N.; Menhazul Abedin, M.; Islam, S.; Suri, H.S.; El-Baz, A.S.; Suri, J.S. Comparative approaches for classification of diabetes mellitus data: Machine learning paradigm. Comput. Methods Programs Biomed. 2017, 152, 23–34. [Google Scholar] [CrossRef]
- Cichosz, S.L.; Xylander, A.A.P. A Conditional Generative Adversarial Network for Synthesis of Continuous Glucose Monitoring Signals. J. Diabetes Sci. Technol. 2021, 16, 1220–1223. [Google Scholar] [CrossRef] [PubMed]
- Cichosz, S.L.; Jensen, M.H.; Hejlesen, O. Short-term prediction of future continuous glucose monitoring readings in type 1 diabetes: Development and validation of a neural network regression model. Int. J. Med. Inform. 2021, 151, 104472. [Google Scholar] [CrossRef] [PubMed]
- Cichosz, S.L.; Johansen, M.D.; Hejlesen, O. Toward Big Data Analytics: Review of Predictive Models in Management of Diabetes and Its Complications. J. Diabetes Sci. Technol. 2016, 10, 27–34. [Google Scholar] [CrossRef] [PubMed]
- Cichosz, S.L.; Frystyk, J.; Tarnow, L.; Fleischer, J. Combining Information of Autonomic Modulation and CGM Measurements Enables Prediction and Improves Detection of Spontaneous Hypoglycemic Events. J. Diabetes Sci. Technol. 2014, 9, 132–137. [Google Scholar] [CrossRef]
- Cichosz, S.L.; Kronborg, T.; Jensen, M.H.; Hejlesen, O. Penalty weighted glucose prediction models could lead to better clinically usage. Comput. Biol. Med. 2021, 138, 104865. [Google Scholar] [CrossRef]
- Cichosz, S.L.; Rasmussen, N.H.; Vestergaard, P.; Hejlesen, O. Precise Prediction of Total Body Lean and Fat Mass from Anthropometric and Demographic Data: Development and Validation of Neural Network Models. J. Diabetes Sci. Technol. 2020, 15, 1337–1343. [Google Scholar] [CrossRef]
- Huang, J.; Yeung, A.M.; Armstrong, D.G.; Battarbee, A.N.; Cuadros, J.; Espinoza, J.C.; Kleinberg, S.; Mathioudakis, N.; Swerdlow, M.A.; Klonoff, D.C. Artificial Intelligence for Predicting and Diagnosing Complications of Diabetes. J. Diabetes Sci. Technol. 2023, 17, 224–238. [Google Scholar] [CrossRef]
- Joshi, R.D.; Dhakal, C.K. Predicting Type 2 Diabetes Using Logistic Regression and Machine Learning Approaches. Int. J. Environ. Res. Public Health 2021, 18, 7346. [Google Scholar] [CrossRef]
- Chen, W.; Chen, S.; Zhang, H.; Wu, T. A hybrid prediction model for type 2 diabetes using K-means and decision tree. In Proceedings of the IEEE International Conference on Software Engineering and Service Sciences, ICSESS, Beijing, China, 24–26 November 2017; pp. 386–390. [Google Scholar]
- Sisodia, D.; Sisodia, D.S. Prediction of Diabetes using Classification Algorithms. Procedia Comput. Sci. 2018, 132, 1578–1585. [Google Scholar] [CrossRef]
- Cichosz, S.L.; Johansen, M.D.; Ejskjaer, N.; Hansen, T.K.; Hejlesen, O.K. Improved diabetes screening using an extended predictive feature search. Diabetes Technol. Ther. 2014, 16, 166–171. [Google Scholar] [CrossRef]
- Yu, W.; Liu, T.; Valdez, R.; Gwinn, M.; Khoury, M.J. Application of support vector machine modeling for prediction of common diseases: The case of diabetes and prediabetes. BMC Med. Inform. Decis. Mak. 2010, 10, 16. [Google Scholar] [CrossRef]
- Maniruzzaman, M.; Rahman, M.J.; Ahammed, B.; Abedin, M. Classification and prediction of diabetes disease using machine learning paradigm. Health Inf. Sci. Syst. 2020, 8, 1–14. [Google Scholar] [CrossRef]
- Centers for Disease Control and Prevention (CDC); National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Data. Hyattsville MUSD of H and HSC for DC and P. National Health and Nutrition Examination Survey (NHANES) 2005–2018. Available online: https://www.cdc.gov/nchs/nhanes/index.htm (accessed on 19 November 2023).
- Association, A.D. Standards of Medical Care in Diabetes—2022 Abridged for Primary Care Providers. Clin. Diabetes 2022, 40, 10–38. [Google Scholar] [CrossRef]
- García-Laencina, P.J.; Sancho-Gómez, J.L.; Figueiras-Vidal, A.R. Pattern classification with missing data: A review. Neural Comput. Appl. 2010, 19, 263–282. [Google Scholar] [CrossRef]
- Park, D.J.; Park, M.W.; Lee, H.; Kim, Y.-J.; Kim, Y.; Park, Y.H. Development of machine learning model for diagnostic disease prediction based on laboratory tests. Sci. Rep. 2021, 11, 7567. [Google Scholar] [CrossRef]
- Uddin, S.; Khan, A.; Hossain, M.E.; Moni, M.A. Comparing different supervised machine learning algorithms for disease prediction. BMC Med Inform. Decis. Mak. 2019, 19, 281. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
- Seiffert, C.; Khoshgoftaar, T.M.; van Hulse, J.; Napolitano, A. RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 2010, 40, 185–197. [Google Scholar] [CrossRef]
- Friedman, J.; Hastie, T.; Tibshirani, R. Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Statist. 2000, 28, 337–407. [Google Scholar] [CrossRef]
- Baan, C.A.; Ruige, J.B.; Stolk, R.P.; Witteman, J.C.; Dekker, J.M.; Heine, R.J.; Feskens, E.J. Performance of a predictive model to identify undiagnosed diabetes in a health care setting. Diabetes Care 1999, 22, 213–219. [Google Scholar] [CrossRef]
- Fletcher, B.; Gulanick, M.; Lamendola, C. Risk factors for type 2 diabetes mellitus. J. Cardiovasc. Nurs. 2002, 16, 486. [Google Scholar] [CrossRef]
- Yang, H.; Xin, Z.; Feng, J.P.; Yang, J.-K. Waist-to-height ratio is better than body mass index and waist circumference as a screening criterion for metabolic syndrome in Han Chinese adults. Medicine 2017, 96, e8192. [Google Scholar] [CrossRef]
- Diabetes Prevention Program Research Group. Reduction in the incidence of type 2 diabetes with lifestyle intervention or metformin. N. Engl. J. Med. 2002, 346, 393–403. [Google Scholar] [CrossRef]
- Katsimpris, A.; Brahim, A.; Rathmann, W.; Peters, A.; Strauch, K.; Flaquer, A. Prediction of type 2 diabetes mellitus based on nutrition data. J. Nutr. Sci. 2021, 10, 1139. [Google Scholar] [CrossRef] [PubMed]
No Diabetes | Prediabetes | Undiagnosed Diabetes | Diabetes | Significance p < 0.05 | |
---|---|---|---|---|---|
n | 29,806 | 9556 | 1297 | 4772 | |
Age, years | 36 (19.3) | 54.2 (18.4) | 57.7 (15) | 61.9 (13.9) | NPD |
Male, % | 48.1 | 50 | 52.5 | 50.5 | N |
BMI, kg/m2 | 26.8 (6.5) | 30.1 (7.2) | 33.1 (7.8) | 32.4 (7.7) | NPD |
Height, cm | 166.9 (10.1) | 166.1 (10.2) | 166.1 (9.9) | 165.6 (10.7) | N |
Weight, kg | 75.1 (20.6) | 83.5 (22.5) | 91.8 (24.2) | 89.5 (24.5) | NPD |
Systolic BP, mmHg | 117.3 (16.5) | 127.5 (19) | 134.1 (20.9) | 131.6 (20.5) | NPD |
Diastolic BP, mmHg | 67 (13.5) | 70.4 (14.4) | 72.3 (15.3) | 67.7 (14.6) | NPD |
Smoking, % | 12.6 | 16.5 | 17.3 | 11.7 | ND |
Physically active, % | 36.1 | 18.7 | 10.7 | 8.5 | NPD |
Drinking alcohol, days/yrs | 12.6 (53.2) | 10.9 (51.9) | 9 (53.8) | 10 (54) | N |
Family income to poverty ratio | 2.5 (1.6) | 2.4 (1.6) | 2.2 (1.5) | 2.3 (1.5) | NP |
Sleep, h | 7.2 (3.1) | 7 (1.6) | 7 (3.1) | 7.4 (4.8) | ND |
Hispanic-Mexican American, % | 18.3 | 15.3 | 21.6 | 17.2 | NPD |
Hispanic-Other Hispanic, % | 9.4 | 9.9 | 10.4 | 9.9 | |
Non-Hispanic White, % | 41.9 | 35.5 | 27.4 | 34.3 | NPD |
Non-Hispanic Black, % | 19 | 28.2 | 28.5 | 27.6 | N |
ROC AUC | Sensitivity | Specificity | PPV | NPV | |
---|---|---|---|---|---|
Undiagnosed diabetes | |||||
RF | 0.786 (0.765; 0.810) | 0.855 | 0.603 | 0.083 | 0.99 |
AdaBoost | 0.776 (0.750; 0.797) | 0.742 | 0.674 | 0.087 | 0.984 |
RUSBoost | 0.792 (0.767; 0.812) | 0.824 | 0.657 | 0.091 | 0.989 |
LogitBoost | 0.799 (0.775; 0.823) | 0.871 | 0.615 | 0.086 | 0.991 |
Neural network | 0.806 (0.782; 0.827) | 0.848 | 0.628 | 0.087 | 0.99 |
Diabetes + Undiagnosed diabetes | |||||
RF | 0.800 (0.788; 0.815) | 0.814 | 0.637 | 0.290 | 0.949 |
AdaBoost | 0.787 (0.775; 0.799) | 0.819 | 0.628 | 0.287 | 0.95 |
RUSBoost | 0.796 (0.782; 0.809) | 0.818 | 0.631 | 0.288 | 0.95 |
LogitBoost | 0.802 (0.789; 0.814) | 0.816 | 0.645 | 0.295 | 0.95 |
Neural network | 0.800 (0.787; 0.810) | 0.821 | 0.64 | 0.294 | 0.952 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cichosz, S.L.; Bender, C.; Hejlesen, O. A Comparative Analysis of Machine Learning Models for the Detection of Undiagnosed Diabetes Patients. Diabetology 2024, 5, 1-11. https://doi.org/10.3390/diabetology5010001
Cichosz SL, Bender C, Hejlesen O. A Comparative Analysis of Machine Learning Models for the Detection of Undiagnosed Diabetes Patients. Diabetology. 2024; 5(1):1-11. https://doi.org/10.3390/diabetology5010001
Chicago/Turabian StyleCichosz, Simon Lebech, Clara Bender, and Ole Hejlesen. 2024. "A Comparative Analysis of Machine Learning Models for the Detection of Undiagnosed Diabetes Patients" Diabetology 5, no. 1: 1-11. https://doi.org/10.3390/diabetology5010001
APA StyleCichosz, S. L., Bender, C., & Hejlesen, O. (2024). A Comparative Analysis of Machine Learning Models for the Detection of Undiagnosed Diabetes Patients. Diabetology, 5(1), 1-11. https://doi.org/10.3390/diabetology5010001