A Comparative Analysis of Machine Learning Models for the Detection of Undiagnosed Diabetes Patients

: Introduction: Early detection of type 2 diabetes is essential for preventing long-term complications. However, screening the entire population for diabetes is not cost-effective, so identifying individuals at high risk for this disease is crucial. The aim of this study was to compare the performance of five diverse machine learning (ML) models in classifying undiagnosed diabetes using large heterogeneous datasets. Methods: We used machine learning data from several years of the National Health and Nutrition Examination Survey (NHANES) from 2005 to 2018 to identify people with undiagnosed diabetes. The dataset included 45,431 participants, and biochemical confirmation of glucose control (HbA1c) were used to identify undiagnosed diabetes. The predictors were based on simple and clinically obtainable variables, which could be feasible for prescreening for diabetes. We included five ML models for comparison: random forest, AdaBoost, RUSBoost, LogitBoost, and a neural network. Results: The prevalence of undiagnosed diabetes was 4%. For the classification of undiagnosed diabetes, the area under the ROC curve (AUC) values were between 0.776 and 0.806. The positive predictive values (PPVs) were between 0.083 and 0.091, the negative predictive values (NPVs) were between 0.984 and 0.99, and the sensitivities were between 0.742 and 0.871. Conclusion: We have demonstrated that several types of classification models can accurately classify undiagnosed diabetes from simple and clinically obtainable variables. These results suggest that the use of machine learning for prescreening for undiagnosed diabetes could be a useful tool in clinical practice.


Introduction
The prevalence of type 2 diabetes is on the rise, leading to increased occurrences of illness and mortality and escalated healthcare expenditures.The incidence of type 2 diabetes varies across regions such as the UK, the U.S., China, and the United Arab Emirates, encompassing a range of 7% to 34% of the respective population [1,2].Of individuals in the United States, 9.7% have received a formal diagnosis of diabetes, while an additional 4.3% are living with diabetes but remain undiagnosed.Notably, approximately 30% of those who eventually receive a diabetes diagnosis exhibit associated complications [3].
The timely identification of type 2 diabetes holds significance due to its potential to significantly mitigate long-term complications through rigorous diabetes management.Nevertheless, conducting diabetes screening across the entire population lacks cost-effectiveness, thus emphasizing the need to prioritize the recognition of individuals with a heightened susceptibility to the condition [4,5].Numerous investigations regarding diabetes screening have been conducted within the previous ten years.Risk prediction or stratification models can serve the purpose of identifying individuals at an elevated risk level for diabetes, allowing for subsequent targeted testing.Typically, these models incorporate a blend of variables, encompassing weight, lifestyle, familial background, and clinical measurements, and are formulated through the utilization of multivariable statistical techniques [6][7][8].
Diabetology 2024, 5 2 Nevertheless, numerous of these models are not extensively employed within clinical practice, primarily owing to their foundation on data gathered for alternate objectives.This circumstance can decrease the relevance of these findings when applied to a broader population [9].Additionally, attempts are often made to create models that are easy to use in clinical practice.This is often accomplished by condensing continuous variables into distinct categories or opting for predictors in a subjective manner.However, such approaches can result in excessive simplification and a consequent decrease in the models' overall efficacy [10,11].
Analyzing data on diabetes can be difficult because medical data often exhibit nonlinear, nonnormal, correlated, and complex characteristics [12].Machine learning (ML) methods have the potential to structure and utilize these complex patterns to classify diseases.It has previously been reported that ML could be utilized in diabetes for different purposes [13][14][15][16][17][18][19].
Others have reported ML approaches for the detection of diabetes and prediabetes [20][21][22][23][24][25].However, it is still unclear which ML methods are best at capturing the complexity of the data to aid in selecting people at high risk of undiagnosed diabetes.
The objective of this study was to compare the performance of five diverse ML models for classifying undiagnosed diabetes using a large heterogeneous dataset.

Data Source
To identify individuals with undiagnosed diabetes using machine learning, we used data from multiple years of the National Health and Nutrition Examination Survey (NHANES) from 2005 to 2018 [26], which included HbA1c (glycated hemoglobin) data.HbA1c is recommended for the diagnosis of diabetes in most patient groups by the American Diabetes Association [27].The NHANES study was executed by the National Center for Health Statistics, a division of the Centers for Disease Control and Prevention.This research employs intricate sampling techniques to determine the demographic composition of the U.S. populace.This inclusivity extends to the overrepresentation of subpopulations, such as elderly individuals and various racial and ethnic minorities.Over the period spanning 2005 to 2018, a comprehensive total of 70,190 participants were enrolled in the NHANES.
The present investigation involved individuals aged >20 years, excluding pregnant individuals and those with a documented diabetes diagnosis.A participant's diabetes diagnosis was ascertained by their affirmative response to the survey.Have you ever been informed by a medical professional that you have diabetes?
Using these data, we developed and compared ML models for diabetes prescreening in patients with undiagnosed diabetes.

End Points
Our objective was to compare five machine learning models for the detection of undiagnosed diabetes (prevalence) in the NHANES cohort.
We included two binary end points for classification: The primary endpoint (ap1) for the classification of undiagnosed diabetes was defined as an HbA1c ≥ 6.5% (48 mmol/mol) without a previous diagnosis of diabetes.
The secondary endpoint (ap2) was for the classification of undiagnosed diabetes (defined by an HbA1c ≥ 6.5% (48 mmol/mol) without a previous diagnosis of diabetes) or known diabetes.

Variables and Selection
We included simple variables commonly associated with the risk of diabetes that could be used in a practical prescreening procedure.The variables included age, sex, ethnicity, weight, height, waist circumference, sleep duration, BMI, blood pressure (BP), physical activity, smoking, alcohol use, education, and the ratio of family income to poverty.Variable selection were performed according to an automatic approach using the training data, 3-fold cross-validation, and receiver operating characteristic (ROC) area under the curve (AUC) improvements as criteria for the inclusion of variables.Missing data among variables used for classification and prediction are common both in studies and during clinical usage.However, the chosen ML methods implemented in this study can incorporate missing values into the modeling approach without the need for imputation or case deletion [28].

Model Development
We included five ML models for comparison: random forest, AdaBoost, RUSBoost, LogitBoost, and a neural network.These specific models were compared because previous studies have shown high performance with ensemble and neural network models in general disease classification [29,30].The rationale behind selecting these models is rooted in the collective strengths they bring to the task of disease classification, which aims to provide a comprehensive comparison across diverse approaches.By including models with different underlying mechanisms (boosting, bagging, or neural networks), we aim to identify the most suitable model for our specific dataset and research objectives.
The models were trained/developed using a sample of 80% (training data) of the individuals in each group and tested on the remaining 20% (test data).This process was conducted in such a manner that 20% of the data were saved for testing the final models; hence, the test data were not used to optimize the models further.The training data were used to select variables through forward selection and to optimize and train the models; cross-validation was used to minimize overfitting of the models.Due to a class imbalance in the dataset, the optimization was conducted with an evaluation criterion based on the precision-recall curve area under the curve (PR-AUC): A schematic of the procedure is illustrated in Figure 1.
We included simple variables commonly associated with the risk of diabetes that could be used in a practical prescreening procedure.The variables included age, sex, ethnicity, weight, height, waist circumference, sleep duration, BMI, blood pressure (BP), physical activity, smoking, alcohol use, education, and the ratio of family income to poverty.Variable selection were performed according to an automatic approach using the training data, 3-fold cross-validation, and receiver operating characteristic (ROC) area under the curve (AUC) improvements as criteria for the inclusion of variables.Missing data among variables used for classification and prediction are common both in studies and during clinical usage.However, the chosen ML methods implemented in this study can incorporate missing values into the modeling approach without the need for imputation or case deletion [28].

Model Development
We included five ML models for comparison: random forest, AdaBoost, RUSBoost, LogitBoost, and a neural network.These specific models were compared because previous studies have shown high performance with ensemble and neural network models in general disease classification [29,30].The rationale behind selecting these models is rooted in the collective strengths they bring to the task of disease classification, which aims to provide a comprehensive comparison across diverse approaches.By including models with different underlying mechanisms (boosting, bagging, or neural networks), we aim to identify the most suitable model for our specific dataset and research objectives.
The models were trained/developed using a sample of 80% (training data) of the individuals in each group and tested on the remaining 20% (test data).This process was conducted in such a manner that 20% of the data were saved for testing the final models; hence, the test data were not used to optimize the models further.The training data were used to select variables through forward selection and to optimize and train the models; cross-validation was used to minimize overfitting of the models.Due to a class imbalance in the dataset, the optimization was conducted with an evaluation criterion based on the precision-recall curve area under the curve (PR-AUC): A schematic of the procedure is illustrated in Figure 1.All the models were developed and implemented using MATLAB R2021b (MathWorks, Natick, MA, USA).

Random Forest Model
The random forest algorithm is a machine learning method [31] that uses a group of decision trees to make predictions.During the training process, many decision trees are constructed, and the output of the random forest is determined by the majority vote of the trees.Each tree in the forest are based on a random sample of data.The final prediction is made by combining the predictions of all the individual trees.As the number of trees in the forest increases, the accuracy of the predictions tends to improve.Hyperparameter estimation was performed using a grid search strategy.We optimized the number of trees, depth of trees, and minimum number of samples to perform splitting.

AdaBoost
Adaptive boosting (AdaBoost) [32] is an ensemble learning algorithm that is used to improve the accuracy of a weak learner (such as a decision tree).This process involves iteratively training the weak learner and adjusting the weights of the training data at each iteration so that the misclassified examples are given higher weights.The final model is a combination of all the weak learners, with each weak learner contributing a weight to the final prediction.One of the main benefits of AdaBoost is that it is simple to implement and relatively resistant to overfitting problems, making it a good choice for situations where the training data are limited.Hyperparameter estimation was performed using a grid search strategy.We optimized the number of weak learners and the learning rate.

RUSBoost
Random undersampling boosting (RUSBoost) [33] is a variant of the AdaBoost algorithm that are designed to handle imbalanced datasets.The imbalanced datasets are datasets in which one class (the minority class; in our case, individuals with undiagnosed diabetes) has significantly fewer examples than the other class (the majority class; in our cases, individuals without undiagnosed diabetes).In such cases of imbalance, AdaBoost can be prone to bias toward the majority class, leading to poor performance for the minority class.RUSBoost addresses this issue by randomly undersampling the majority class at each iteration.By undersampling the majority class, RUSBoost ensures that each weak learner are trained on a balanced dataset.Hyperparameter estimation was performed using a grid search strategy.We optimized the number of weak learners and the learning rate.

LogitBoost
LogitBoost [34] is a popular boosting modification that can be applied to binary classification problems.From a statistical standpoint, LogitBoost can be seen as an additive tree regression by minimizing the logistic loss.One of the benefits of LogitBoost is that it is relatively easy to implement, and it can often achieve good performance with relatively little hyperparameter tuning.It is also resistant to overfitting, which makes it a good fit for use on noisy or high-dimensional data.Hyperparameter estimation was performed using a grid search strategy.We optimized the number of weak learners and the learning rate.

Neural Network
A neural network is a machine learning model inspired by the structure and function of the human brain.It is composed of layers of interconnected nodes, or neurons, that process and transmit information.We implemented a feedforward neural network with the following architecture: an input layer, three fully connected hidden layers [3, 2, 4 neurons], a softmax layer, and a classification layer.In the training process, 30% of the training dataset were used as the validation dataset to minimize overfitting of the model.Hyperparameter estimation was performed using a grid search strategy.We optimized the number of neurons in the hidden layers.

Model Assessment
Test datasets were used to assess the performance of the five models.Receiver operating characteristic (ROC) curves and receiver operating characteristic (ROC) curves were used to compare the performance of the models for classifying undiagnosed diabetes (ap1) from that of undiagnosed diabetes and known diabetes (ap2).Ninety-five percent confidence intervals (CIs) for the receiver operating characteristic (ROC) curve were estimated using bootstrap replicates (n = 1000).Furthermore, a specific threshold (based on the maximized Youden index) was used to compare the sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) to better understand the capabilities of the models for usage in clinical practice during a prescreening procedure.

Results
A total of 45,431 participants were included in the analysis, and 36,162 participants were excluded from the analysis due to missing HbA1c measurements, age criteria, or pregnancy.Among the included participants, 1297 had undiagnosed diabetes (the prevalence of undiagnosed diabetes was 3.2%), 4772 had known diabetes, and 9556 had prediabetes.The characteristics of the included participants are presented in Table 1.Table 2 shows the ROC AUCs for the classifiers along with a selected cutoff, which included sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).Figure 2 shows the ROC curves (left) and the precision-recall curves (right) for the five classifiers.
Table 2.The ROC AUC (95% confidence interval) for the classifiers, along with a selected cutoff based on the maximized Youden index, which includes sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).dataset were used as the validation dataset to minimize overfitting of the model.Hyperparameter estimation was performed using a grid search strategy.We optimized the number of neurons in the hidden layers.

Model Assessment
Test datasets were used to assess the performance of the five models.Receiver operating characteristic (ROC) curves and receiver operating characteristic (ROC) curves were used to compare the performance of the models for classifying undiagnosed diabetes (ap1) from that of undiagnosed diabetes and known diabetes (ap2).Ninety-five percent confidence intervals (CIs) for the receiver operating characteristic (ROC) curve were estimated using bootstrap replicates (n = 1000).Furthermore, a specific threshold (based on the maximized Youden index) was used to compare the sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) to better understand the capabilities of the models for usage in clinical practice during a prescreening procedure.

Results
A total of 45,431 participants were included in the analysis, and 36,162 participants were excluded from the analysis due to missing HbA1c measurements, age criteria, or pregnancy.Among the included participants, 1297 had undiagnosed diabetes (the prevalence of undiagnosed diabetes was 3.2%), 4772 had known diabetes, and 9556 had prediabetes.The characteristics of the included participants are presented in Table 1.
Table 2 shows the ROC AUCs for the classifiers along with a selected cutoff, which included sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).Figure 2 shows the ROC curves (left) and the precision-recall curves (right) for the five classifiers.Primary endpoint (ap1): For the classification of undiagnosed diabetes (no diabetes or prediabetes vs. undiagnosed diabetes), the area under the ROC curve (AUC) was between 0.776 and 0.806.The PPV was between 0.083 and 0.091, the NPV was between 0.984 and 0.99, and the sensitivity was between 0.742 and 0.871.Figure 3 shows the selected predictors for each model using forward selection and cross-validation.Age and ethnicity (non-Hispanic white) were selected for all models, and the economic ratio was selected for four out of five models.and 0.99, and the sensitivity was between 0.742 and 0.871.Figure 3 shows the selected predictors for each model using forward selection and cross-validation.Age and ethnicity (non-Hispanic white) were selected for all models, and the economic ratio was selected for four out of five models.
Secondary endpoint (ap2): For the classification of undiagnosed diabetes + known diabetes (no diabetes or prediabetes vs. undiagnosed diabetes or known diabetes), the receiver operating characteristic (ROC) curves were between 0.787 and 0.802.The PPV was between 0.287 and 295; the NPV was between 0.949 and 952; and the sensitivity was between 0.787 and 802.

Discussion
This study aimed to compare the performance of an ML model in classifying undiagnosed diabetes from known diabetes using a large heterogeneous dataset utilizing simple and obtainable clinical information.For the classification of undiagnosed diabetes, the comparison did not reveal large differences in model performance among the five models.All the included models performed well and could be utilized in a clinical prescreening program to identify people for subsequent diabetes testing.The PPV was approximately 8-9%, which is low but is expected for this type of prescreening.Other risk score studies have reported PPVs between 4 and 8% [35,6].This means that for each of the 1000 people we screened, if a sensitivity of 80% was selected, ~392 people would be eligible for subsequent testing, and out of those people, ~32 would have undiagnosed diabetes.Furthermore, ~8 people will not be diagnosed with diabetes.A substantial portion of the people selected for subsequent testing who did not have undiagnosed diabetes (false positives) were diagnosed with prediabetes.Identifying people with prediabetes could lead to Secondary endpoint (ap2): For the classification of undiagnosed diabetes + known diabetes (no diabetes or prediabetes vs. undiagnosed diabetes or known diabetes), the receiver operating characteristic (ROC) curves were between 0.787 and 0.802.The PPV was between 0.287 and 295; the NPV was between 0.949 and 952; and the sensitivity was between 0.787 and 802.

Discussion
This study aimed to compare the performance of an ML model in classifying undiagnosed diabetes from known diabetes using a large heterogeneous dataset utilizing simple and obtainable clinical information.For the classification of undiagnosed diabetes, the comparison did not reveal large differences in model performance among the five models.All the included models performed well and could be utilized in a clinical prescreening program to identify people for subsequent diabetes testing.The PPV was approximately 8-9%, which is low but is expected for this type of prescreening.Other risk score studies have reported PPVs between 4 and 8% [6,35].This means that for each of the 1000 people we screened, if a sensitivity of 80% was selected, ~392 people would be eligible for subsequent testing, and out of those people, ~32 would have undiagnosed diabetes.Furthermore, ~8 people will not be diagnosed with diabetes.A substantial portion of the people selected for subsequent testing who did not have undiagnosed diabetes (false positives) were diagnosed with prediabetes.Identifying people with prediabetes could lead to health-promoting initiatives for the group to slow or stop the progression from prediabetes to diabetes.
The exact cutoff for such a prescreening procedure also needs to be considered in a cost-benefit analysis, which is beyond the scope of this paper.
For the classification of undiagnosed diabetes + known diabetes, similar trends were observed-the choice of model did not significantly change the performance.However, the PPV was much greater than that of undiagnosed diabetes alone.This is also expected, as the prevalence of prediabetes at the population level is much greater.
The predictors included in this study can be categorized into three groups: demographic, clinical, and lifestyle predictors.These predictors were used to develop machine learning models to prescreen undiagnosed diabetes patients.
Demographic predictors have been shown to be associated with diabetes incidence [36].In our study, age and ethnicity were included as predictors in all the proposed models.Clinical variables have been consistently associated with diabetes risk.For example, higher BMI and waist circumference have been shown to be strongly associated with diabetes risk, with individuals with a BMI of 30 or higher being at a greater risk of developing diabetes.Waist circumference and systolic blood pressure were also included as predictors in most of the models.BMI was only selected for one of the compared models; however, studies have shown that waist circumference may be a more specific predictor of dangerous overweight [37].Lifestyle predictors have also been shown to be associated with diabetes risk [36].For example, physical activity has been shown to lower diabetes risk, while smoking and alcohol usage have been shown to increase diabetes risk [38].In our study, alcohol usage and indirect measures of lifestyle, such as education level and economic status, were included as predictors.Surprisingly, physical activity and smoking were not included as predictors.The explanation could be that it might be difficult to capture the discriminative information in these predictors using a questionnaire-based approach or that the information is captured indirectly by other predictors.

Comparison to Other Related Work
Over the past few decades, several machine learning approaches and classic statistical predictive models have been published on the topic of screening for undiagnosed diabetes.Baan et al. [35] developed three predictive models (logistics regression) based on a sample of participants from the Rotterdam Study (n = 1016) aged 55 to 75 years who were not known to have diabetes.The authors reported ROC AUCs of up to 0.74.Bang et al. [36] developed a simple scoring system (based on logistic regression) based on the Korea National Health and Nutrition Examination Survey (KNHANES) and compared it with previous scoring systems.Bang et al. reported ROC AUCs of up to 0.73.Moreover, Cichosz et al. [23] suggested an extended predictive feature search strategy to model a logistic regression for the prediction of undiagnosed diabetes.They reported an ROC AUC of 0.78.
Yu et al. [24] used a support vector machine (SVM) approach to identify undiagnosed and known diabetes in the 1999-2004 sample of the NHANES with successful performance (AUC = 0.83).However, Yu et al. did not predict undiagnosed diabetes separately, which makes comparison difficult.

Strengths and Limitations
An important advantage of this research lies in our utilization of a substantial and diverse dataset from the NHANES.This dataset are distinctive because it comprises nationally representative survey data that have been weighted, accurately reflecting the composition of the entire U.S. populace.As a result, the findings are likely to have a reasonable degree of applicability to the broader U.S. population when used in a screening process.Nevertheless, the application of these models in different global regions necessitates careful consideration, and it is imperative to validate their effectiveness in these populations before embracing them on a larger scale.
The approach introduced in this research were rooted in data-driven analysis.We carefully chose variables and refined our models to achieve optimal performance.Although the chosen variables were all characterized as readily available or easily obtainable clinical data, certain pieces of information hold greater clinical practicality, particularly in the context of conducting large-scale population screenings.Should these models be considered for practical clinical use, it becomes important to assess the significance of each variable, with an emphasis on selecting those that offer the most effortlessly attainable information.
A limitation of this study is the definition of undiagnosed diabetes, as it was based on a single lab value of HbA1c above 6.5%.The American Diabetes Association (ADA) recommends that at least two HbA1c levels be measured to fully establish a diabetes diagnosis.Furthermore, known diabetes diagnoses rely on participant self-reports, which are subject to misclassification bias.
Additionally, we explored five distinct, robust machine learning algorithms known for their effective predictive capabilities in healthcare settings for comparative analysis.Numerous alternative methods and implementations, including support vector machines, XGBoost, and K-nearest neighbor methods, are also available.We believe that further exploration and comparison of additional methods could be pertinent, particularly when dealing with more intricate datasets containing extensive additional and complex information for the identification of undiagnosed diabetes.

Future Directions
In a recent study, Katsimpris et al. [39] demonstrated the potential of leveraging nutritional data for predicting type 2 diabetes mellitus through a logistic regression approach.An avenue for future exploration in the development of a classification model for identifying individuals with undiagnosed diabetes involves integrating dietary information with other pertinent factors.This strategic combination of variables aims to enhance the predictive capabilities of the model, potentially yielding more accurate and comprehensive insights into the identification of undiagnosed diabetes patients.

Conclusions
We have demonstrated that several types of classification models can accurately classify undiagnosed diabetes from simple and clinically obtainable variables.Small differences in performance were observed among the compared models, but no one model outperformed the others in terms of classifying undiagnosed diabetes or prediabetes.These results suggest that the use of machine learning for prescreening for undiagnosed diabetes could be a useful tool in clinical practice.

Figure 1 .
Figure 1.Illustrates model development and performance testing.Figure 1. Illustrates model development and performance testing.

Figure 1 .
Figure 1.Illustrates model development and performance testing.Figure 1. Illustrates model development and performance testing.

Figure 3 .
Figure 3.The selected predictors/features for the model(s).Gray indicates that the selected predictor was selected in the forward feature selection using 3-fold cross-validation.

Figure 3 .
Figure 3.The selected predictors/features for the model(s).Gray indicates that the selected predictor was selected in the forward feature selection using 3-fold cross-validation.

Table 1 .
The baseline characteristics of people with prediabetes, undiagnosed diabetes, or diabetes.Significance (p < 0.05) is indicated between undiagnosed diabetes and no diabetes (N), prediabetes (P), and diabetes (D).