Predicting the Risk of Incident Type 2 Diabetes Mellitus in Chinese Elderly Using Machine Learning Techniques

Early identification of individuals at high risk of diabetes is crucial for implementing early intervention strategies. However, algorithms specific to elderly Chinese adults are lacking. The aim of this study is to build effective prediction models based on machine learning (ML) for the risk of type 2 diabetes mellitus (T2DM) in Chinese elderly. A retrospective cohort study was conducted using the health screening data of adults older than 65 years in Wuhan, China from 2018 to 2020. With a strict data filtration, 127,031 records from the eligible participants were utilized. Overall, 8298 participants were diagnosed with incident T2DM during the 2-year follow-up (2019–2020). The dataset was randomly split into training set (n = 101,625) and test set (n = 25,406). We developed prediction models based on four ML algorithms: logistic regression (LR), decision tree (DT), random forest (RF), and extreme gradient boosting (XGBoost). Using LASSO regression, 21 prediction features were selected. The Random under-sampling (RUS) was applied to address the class imbalance, and the Shapley Additive Explanations (SHAP) was used to calculate and visualize feature importance. Model performance was evaluated by the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, and accuracy. The XGBoost model achieved the best performance (AUC = 0.7805, sensitivity = 0.6452, specificity = 0.7577, accuracy = 0.7503). Fasting plasma glucose (FPG), education, exercise, gender, and waist circumference (WC) were the top five important predictors. This study showed that XGBoost model can be applied to screen individuals at high risk of T2DM in the early phrase, which has the strong potential for intelligent prevention and control of diabetes. The key features could also be useful for developing targeted diabetes prevention interventions.


Introduction
Diabetes mellitus (DM) is a chronic metabolic disease characterized by hyperglycemia, which can lead to serious complications such as chronic kidney disease, acute kidney injury, cardiovascular disease, ischemic heart disease, stroke, or even death [1]. Type 2 diabetes mellitus (T2DM) is the most common type of diabetes, accounting for around 90% of all diabetes cases. According to the report of the International Diabetes Federation (IDF) in 2021, about 537 million people worldwide are suffering from diabetes and the figure is projected to rise to 643 million by 2030 and 783 million by 2045 [2]. In China, it was estimated that there were 140.9 million adults living with diabetes, accounting for 25% of patients with diabetes worldwide [3]. The rising incidence of diabetes imposes a heavy burden on individual, health system, and the whole society [4,5].
T2DM is an irreversible but preventable disease [6]. Early diagnosis and effective screening of high-risk populations can prevent or delay the occurrence or development of ipant. Baseline data were collected in 2018, and follow-up data were collected in 2019 and 2020. For longitudinal analysis of incident T2DM, excluding criteria of participants were: (1) Participants with prevalent T2DM at baseline (participants diagnosed by a fasting plasma glucose ≥7.0 mmol/L or with a self-reported previous diagnosis by health care professionals at baseline); (2) those who lost to follow-up; (3) those with duplicate data; (4) those with missing laboratory values; (5) those with outliers. After applying the exclusion criteria, a total of 127,031 participants were included in this study. The study flow chart is depicted in Figure 1.

Study Design and Participants
A retrospective cohort study was conducted using the health screening data of adults older than 65 years from 17 districts in Wuhan, China. The Wuhan Municipal Government would provide free physical examinations for the elderly aged 65 and above, which was regarded as a normalized and standardized project to benefit people. A total of 388,420 elderly people participated in the health screening in 2018. The protocol was approved by the Ethics Committee of Wuhan Center for Disease Control and Prevention (protocol code WHCDCIRB-K-2018023), and written informed consent was obtained from each participant. Baseline data were collected in 2018, and follow-up data were collected in 2019 and 2020. For longitudinal analysis of incident T2DM, excluding criteria of participants were: (1) Participants with prevalent T2DM at baseline (participants diagnosed by a fasting plasma glucose ≥7.0 mmol/L or with a self-reported previous diagnosis by health care professionals at baseline); (2) those who lost to follow-up; (3) those with duplicate data; (4) those with missing laboratory values; (5) those with outliers. After applying the exclusion criteria, a total of 127,031 participants were included in this study. The study flow chart is depicted in Figure 1.

Candidate Predictors
The health screening data were collected and recorded at the local community health service centers in Wuhan by well-trained research staff. It included three parts: a health status questionnaire, anthropometric measures, and laboratory measures. The questionnaire included age, gender, education, marital status, medical history (hypertension, myocardial infarction, coronary heart disease, angina pectoris, fatty liver), exercise, current smoking, current drinking. Anthropometric measures were conducted by trained medical staff using standardized procedures, including weight, height, waist circumference (WC), systolic blood pressure (SBP), and diastolic blood pressure (DBP). Body mass index (BMI) was calculated as weight (kg) divided by height squared (m 2 ). Laboratory measures were performed at the central laboratory, including fasting plasma glucose (FPG), total cholesterol (TC), triglyceride (TG), high-density lipoprotein (HDL-C), low-density lipoprotein (LDL-C), alanine aminotransferase (ALT), aspartate transaminase (AST), total bilirubin (TBIL), serum creatinine (Scr), blood urea nitrogen (BUN), serum uric acid (SUA). The 27 candidate predictors from the health screening baseline data (Table 1) have been carefully selected based on the available variables in our dataset, clinical expertise, and prior literature evidence of their associations with T2DM [30][31][32].

Outcome
Incident type 2 diabetes mellitus (T2DM) was diagnosed if at least one of the following two criteria were satisfied according to the American Diabetes Association (ADA): (1) a self-reported diagnosis that was determined previously by a health care professional, or (2) fasting plasma glucose (FPG) ≥ 126 mg/dL (7.0 mmol/L) [33]. In this study, selfreported T2DM was defined by asking participants whether a health care professional had ever told that he/she was diagnosed with diabetes. Fasting blood samples were collected after at least 8 h of overnight fasting and were analyzed by trained research staff at the central laboratory. Fasting plasma glucose (FPG) levels were measured using the glucose oxidase procedure.

Logistic Regression (LR)
LR is a classic classification algorithm that measures the relationship between a categorical dependent variable and one or more independent variables based on the sigmoid function [34]. This algorithm is a simple method for prediction which provides baseline accuracy scores to compare with other non-parametric machine learning models [14,35].

Decision Tree (DT)
DT is a supervised learning technique used for a classification task. A decision tree is a class discrimination tree structure, with each internal node representing an attribute (or independent variable), each branch reflecting an outcome of the test, and each leaf node corresponding to a class label (or dependent variable) [11]. The purpose of DT is to generate a decision tree with strong generalization capability [36].

Random Forest (RF)
RF is a typical ensemble learning algorithm that consists of multiple decision trees [37]. It can be applied to deal with regression and classification tasks. The algorithm is based on the idea of incorporating multiple decision tree classifiers to obtain the final classification result by majority voting and make accurate predictions [38]. RF can analyze complex interactions between characteristics, and is extremely adept at handling noisy and missing data [29].

Extreme Gradient Boosting (XGBoost)
XGBoost is an advanced ensemble algorithm, which was proposed by Chen and Guestrin in 2016 [39]. It is a scalable machine learning technique for tree boosting that can combine a series of weak classifiers to construct a stronger classifier. This classifier is an optimized implementation of the gradient boosting decision tree (GBDT) and has the advantages of high training speed, excellent performance, and can deal with largescale data.

Model Development
The dataset was randomly split into two parts: the training set accounted for 80% (n = 101,625) and the test set accounted for 20% (n = 25,406). Since the categories of the incident T2DM in the dataset were imbalanced, the Random under-sampling (RUS) was applied to the training set to resolve the effect of class imbalance. In order to standardize the input features, the data were normalized using the Python Sklearn library [40]. The training set was standardized to mean 0 and variance 1 using the StandardScaler function from the Sklearn preprocessing library in Python, and the test set was standardized using the mean and standard deviation of the training dataset. Least Absolute Shrinkage and Selection Operator (LASSO) regression was used for feature selection in the training set to construct the prediction models. LASSO is a regression model that penalizes the absolute sizes of the coefficients, resulting in the disappearance of some regression coefficients [41]. The candidates with non-zero coefficients are selected during the feature selection. We used LASSO regression with all candidate variables to screen the final input features for the prediction models.
We trained the logistic regression (LR), decision tree (DT), and random forest (RF) models implemented using the Python Sklearn package [42]. The extreme gradient boosting (XGBoost) was implemented using the Xgboost package [39]. The input variables were the 21 features selected by LASSO regression (Table 2). For the DT, RF, and XGBoost algorithms, Bayesian optimization with 10-fold cross-validation was performed on the training set to tune the hyperparameters. Bayesian optimization was proposed by Snoek et al. [43], which has demonstrated to outperform most global optimization algorithms on benchmark functions. It has become extremely popular for tuning hyperparameters in machine learning algorithms [44]. Bayesian optimization keeps track of the previous evaluation results of the objective function and uses them to create a surrogate model such as Gaussian process which was used to find out the most optimal hyperparameters [45]. After sufficient evaluations of the objective function until reaching maximum iterations, the surrogate function becomes an accurate model for the actual objective function and the set of hyperparameters selected is optimal [46]. After 500 iterations, we find the final optimal hyperparameters of DT, RF, and XGBoost. The best hyperparameters for DT were as followed: max_depth = 19, max_features = 7, min_samples_leaf = 55, min_samples_split = 10, min_weight_fraction_ leaf = 0.031159281996108103. The best hyperparameters for RF were as followed: max_depth = 68, max_features = 8, n_estimators = 80, min_samples_leaf = 5, min_samples_split = 69, min_weight_ fraction_ leaf = 0.0009215045821160297. The best hyperparameters for XGBoost were as followed: colsample_bytree = 0.6907621204231386, gamma = 0.6991315172625473, learning_rate = 0.093311071904797607, max_depth = 3, min_child_weight = 30, reg_alpha = 0.9430563747862351, reg_lambda = 0.7001632991135449, subsample = 0.5957497121054272.

Model Evaluation
The performances of the prediction models were evaluated on the test set using tuned hyperparameters. The area under receiver operating characteristic (AUC), sensitivity, specificity, and accuracy were used to evaluate the classification performance. Sensitivity indicates the proportion of positive sets being predicted correctly, and the specificity represents the proportion of negative sets being predicted correctly. Accuracy illustrates the correct prediction of both positive and negative sets. A receiver operating characteristic (ROC) curve was drawn with the true positive rate (sensitivity) as the ordinate and the false positive rate (1-specificity) as the abscissa, which indicates the overall performance of a binary classifier system. AUC was calculated from the ROC curve. The performance metrics were calculated as follows: Here, TP, FN, FP, and TN represent true positive, false negative, false positive, and true negative, respectively.

Model Interpretation
For further model interpretation, the Shapley Additive Explanations (SHAP) was used. SHAP is a method proposed by Lundberg and Lee in 2017, which is widely used in the interpretation of various classification and regression models [47]. In this method, the features are ranked by their contribution to the model, and the relationship between features and the outcome can be visualized. The model would produce a predicted value for each sample, and the SHAP value represented the value allocated to each feature in the sample. Its absolute value reflects the influence of the feature, and its positive or negative reflects its positive or negative effect on the predicted risk of incident T2DM. When the SHAP value > 0, it indicated that the feature contributed to a higher risk of incident T2DM; On the contrary, when the SHAP value < 0, it indicated that the feature contributed to a lower risk of incident T2DM [48].

Statistical Analysis
Data analyses were performed using SAS version 9.4 and Python version 3.10. Baseline characteristics were summarized as means ± SD (standard deviation) for normally distributed continuous variables, as median and interquartile range (IQR) for non-normally distributed continuous variables, and as numbers and percentage for categorical variables. Students' t test and Wilcoxon test were used to compare normal and non-normal continuous variables respectively and Chi-square tests or Fisher's exact test were used to compare categorical variables between subgroups. The statistical significance level was set at p-value < 0.05 (two-sided). To implement the ML algorithms, we used the Python sklearn package [42] and the Xgboost package [39]. Table 1 demonstrated the participants' baseline characteristics. A total of 127,031 eligible participants were included in this study, which consisted of 8298 incident T2DM and 118,733 non-T2DM. The mean age of study participants was 71.94 ± 5.10 years old. The results showed that age, gender, education, marital status, hypertension, fatty liver, exercise, current smoking, BMI, WC, SBP, DBP, FPG, TC, TG, HDL-C, LDL-C, ALT, AST, TBIL, Scr, BUN, and SUA were all significantly associated with incident T2DM (p < 0.05). Table 2 presented the results of the LASSO regression. Finally, 21 features were significantly associated with incident T2DM, including age, gender, education, marital status, hypertension, exercise, current smoking, current drinking, WC, SBP, FPG, TC, TG, HDL-C, LDL-C, ALT, AST, TBIL, Scr, BUN, and SUA. Table 3 presented the results of performance of four machine learning models. The ROC curves on the training set and test set are shown in Figure 2. Overall, the XGBoost model performed best with the highest AUC value of 0.7805 on the test set, and the sensitivity, specificity, and accuracy were 0.6452, 0.7577, and 0.7503, respectively. The confusion matrix of the four machine learning models is presented in Figure 3.

Feature Importance
In this study, XGBoost performed the best out of the four models. Figure 4 presented the contributions of the 21 features on the XGBoost model output ranked by the average absolute SHAP value. FPG, education, exercise, gender, and WC were the top five important features. The SHAP values of FPG, WC, ALT, marital status, SBP, TG, hypertension, TBIL, age, smoking, Scr, and LDL-C were greater than 0, which suggested that these features were significant risk factors for incident T2DM.

Feature Importance
In this study, XGBoost performed the best out of the four models. Figure 4 pres the contributions of the 21 features on the XGBoost model output ranked by the av absolute SHAP value. FPG, education, exercise, gender, and WC were the top fiv portant features. The SHAP values of FPG, WC, ALT, marital status, SBP, TG, hyp sion, TBIL, age, smoking, Scr, and LDL-C were greater than 0, which suggested that features were significant risk factors for incident T2DM.

Discussion
In this retrospective study, we applied four machine learning algorithms t prediction models for the risk of incident T2DM among Chinese elderly. It is fou the XGBoost model with 21 features demonstrated the best performance for pre T2DM. This suggested that the prediction model derived in the present study c applied to screen out individuals at high risk of T2DM, which could benefit the pre and control of diabetes.
To date, the research of diabetes prediction models tended to focus on white tions [49][50][51][52], and Asian populations especially for the elderly have received re little attention. This study utilized a large longitudinal dataset obtained from Chi derly to establish prediction models for T2DM. The prediction results confirmed

Discussion
In this retrospective study, we applied four machine learning algorithms to build prediction models for the risk of incident T2DM among Chinese elderly. It is found that the XGBoost model with 21 features demonstrated the best performance for predicting T2DM. This suggested that the prediction model derived in the present study could be applied to screen out individuals at high risk of T2DM, which could benefit the prevention and control of diabetes.
To date, the research of diabetes prediction models tended to focus on white populations [49][50][51][52], and Asian populations especially for the elderly have received relatively little attention. This study utilized a large longitudinal dataset obtained from Chinese elderly to establish prediction models for T2DM. The prediction results confirmed that the XGBoost model performed best with the highest AUC value of 0.7805 in predicting the probability that an individual develops T2DM. It was a good example of success for the XGBoost's application in the research of diabetes risk prediction. This finding was consistent with earlier studies [14,21,27,53], which identified the good prediction power of the XGBoost model, with AUC values ranging from 0.8300 to 0.9680. Different from this study, a previous Korean population-based cohort study demonstrated that the ensemble models (e.g., stacking classifier) had better performance than the single models including XGBoost [54]. A rural cohort study in Henan province of China showed good predictive efficiency for the prediction models of T2DM, with AUC values ranging from 0.811 to 0.872 using laboratory data [55]. Compared with previous research, the AUC value in this study was relatively not satisfactory enough. A potential reason could be due to the differences of the study population and input features in the models, which could impact the predictive performance to some extent. Different from our study, the study population of prior studies [14,21,27,53] were middle-aged adults and fewer predictors were applied in the prediction of diabetes. To our knowledge, this was the first study that targeted the elderly population (≥65 years) in China to build predictive models for diabetes using machine learning techniques, which would have great implications for designing diabetes prevention focusing on the elderly. With the development of artificial intelligence, machine learning techniques have been widely applied in the medical field, especially for prediction models for diabetes [49,51,53,[56][57][58]. It is worth noting that the advantages of machine learning models are well-documented empirically compared with traditional statistical methods, but its disadvantage is the lack of model interpretability [13]. XGBoost was often considered as a black box model, because it tends to have better accuracy for predictions compared with linear models while it loses the model interpretability at the same time [39]. Thus, we applied the Shapley Additive Explanations (SHAP) method developed by Lundberg and Lee [47] to better explain the contribution of each feature to the model. This is crucial for healthcare workers to get over the model interpretability barrier to apply predictive models in clinical practice.
Notably, the results of the feature importance analysis indicated the contribution of different feature to the model. These features such as FPG, education, exercise, gender, WC, etc., made substantial contributions to the prediction model. This was in accordance with the results observed in prior similar research [14,53,59]. Early identification of key risk factors had important implications for the risk assessment and prevention of diabetes. Our model results identified that FPG was the most significant predictor of T2DM. Individuals with higher blood glucose would have a greater likelihood of developing diabetes. An explanation for this was that hyperglycemia was correlated with insulin resistance [60]. As mentioned in the literature review, blood glucose was the main traditionally diabetes predictor and also widely used for diagnosis of diabetes [61]. This indicated that blood glucose control plays a key role in the prevention of T2DM, especially for the elderly.
As is shown in the present study, education and exercise showed negative associations with the risk of incident T2DM. Several studies have suggested that diabetes is associated with a low level of education [62][63][64][65][66]. A cohort study among American adults has confirmed that educational level was linked to the onset of diabetes [66]. Individuals with less than a high school educational level (hazard rate [HR] 1.58; 95% CI, 1.26-1.97) were more likely to develop diabetes. It is possible that people with higher education would have better health literacy, so they paid more attention to health management to prevent diabetes [65]. Prior studies have also noted the key role of exercise [67,68] and found that exercise intervention could decrease the risk of developing diabetes by 46% [68]. The China Da Qing Diabetes Prevention Study has identified the long-term effects of exercise interventions in reducing the incidence of T2DM [67]. It was shown that exercise intervention groups had a 49% decreased incidence of T2DM (hazard rate ratio [HRR], 0.51; 95% CI, 0.31-0.83) over the past two decades. There is need for implementing diabetes prevention programs, emphasizing the importance of regular exercise, and focusing particularly on lower educated populations. In our study, another interesting finding was that men were more likely to develop T2DM compared to women, which agreed with results from earlier studies [69,70]. Previous meta-analysis also demonstrated that gender was a dependent risk factor of T2DM in mainland China [71]. It found that the female gender (odds ratio [OR], 0.87, 95% CI, 0.78-0.97) was significantly negatively associated with the risk of T2DM. This could be explained by the fact that most risk factors (e.g., smoking and alcohol consumption, and physical inactivity) were more prevalent in men than women [72]. Therefore, more attention should be paid to men. As a measure of central/abdominal obesity, WC was also proved to be a strong predictor of T2DM. The significance of WC has been illustrated in other studies [17,73]. A 13-year prospective cohort study reported that a higher WC was linked to an increased risk of diabetes and the age-adjusted relative risks (RRs) across quintiles of WC were 1.0, 2.0, 2.7, 5.0, and 12.0, respectively [74]. Our findings further supported that the routine measurement of waist circumference would help clinical workers make preventive recommendations for individuals at high risk of diabetes.
Diabetes has become a major human health challenge and a global health burden because of its high morbidity and mortality rates [75,76]. The XGBoost prediction model established in this study showed promising performance. It had important public health implications, which could help clinicians screen out populations with a high risk of diabetes.
The key features identified in this study not only captured each person's socio-demographic variables, but also medical history, anthropometric and clinical laboratory variables, which could be effective for formulating and implementing targeted diabetes prevention strategies to reduce the disease burden.
Despite of the above encouraging findings, the current study has several limitations. First, only the participants who attended both the baseline survey and 2 -year follow-up were included in this study, which might potentially introduce a selection bias and limit the generalizability of the results. Second, some important risk factors of T2DM such as HbA1c, and insulin were not accounted for in the prediction models due to lack of relevant data. Third, some diabetes cases would be misclassified as non-T2DM because the oral glucose tolerance test (OGTT) was not included for the diagnosis of T2DM. However, the high cost and large sample size make it infeasible and difficult to perform oral glucose tolerance tests for all participants. Fourth, we only performed internal validation, and these prediction models need to be further validated in an external validation set in future work. Moreover, further work is warranted to consider auto encoder, to extract the type 2 diabetes mellitus (T2DM) features automatically, which can improve the classification efficiency of T2DM to some extent.

Conclusions
The current study developed four predictive models based on ML algorithms for the risk of incident T2DM in Chinese elderly. Our findings demonstrated that the XGBoost model achieved the best predictive performance for T2DM. Additionally, FPG, education, exercise, gender, and WC were the strongest predictors in the prediction model, which would benefit clinical practice in developing targeted diabetes prevention and control interventions.

Institutional Review Board Statement:
The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of Wuhan Center for Disease Control and Prevention (protocol code WHCDCIRB-K-2018023).
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study. Written informed consent has been obtained from the patient(s) to publish this paper.

Data Availability Statement:
The data presented in this study are available from the corresponding author upon reasonable request.