Machine Learning Prediction of Prediabetes in a Young Male Chinese Cohort with 5.8-Year Follow-Up

The identification of risk factors for future prediabetes in young men remains largely unexamined. This study enrolled 6247 young ethnic Chinese men with normal fasting plasma glucose at the baseline (FPGbase), and used machine learning (Mach-L) methods to predict prediabetes after 5.8 years. The study seeks to achieve the following: 1. Evaluate whether Mach-L outperformed traditional multiple linear regression (MLR). 2. Identify the most important risk factors. The baseline data included demographic, biochemistry, and lifestyle information. Two models were built, where Model 1 included all variables and Model 2 excluded FPGbase, since it had the most profound effect on prediction. Random forest, stochastic gradient boosting, eXtreme gradient boosting, and elastic net were used, and the model performance was compared using different error metrics. All the Mach-L errors were smaller than those for MLR, thus Mach-L provided the most accurate results. In descending order of importance, the key factors for Model 1 were FPGbase, body fat (BF), creatinine (Cr), thyroid stimulating hormone (TSH), WBC, and age, while those for Model 2 were BF, white blood cell, age, TSH, TG, and LDL-C. We concluded that FPGbase was the most important factor to predict future prediabetes. However, after removing FPGbase, WBC, TSH, BF, HDL-C, and age were the key factors after 5.8 years.


Introduction
Globally, type 2 diabetes (T2D) is the most common type of diabetes, and its prevalence has increased drastically in recent years.In 2022, according to the American Diabetes Association, over 11% of Americans are diabetic, with type 2 accounting for 95% of all cases [1].The prevalence and ratios of type 1 and type 2 diabetes in Taiwan are similar.
Diagnostics 2024, 14, 979 2 of 15 According to Taiwan Biobank, in 2020, Taiwan had 2.18 million diabetic patients (11.1% of the population).Again, type 1 diabetes only accounted for 0.51% of these patients [2].From 2001 to 2017, the number of T2D cases among subjects younger than 20 years old nearly doubled [3], while the number of cases in subjects under the age of 35 increased 2.8-fold [4].These reports indicate that the age of initial diabetes onset has been decreasing.Since the severity of diabetes complications is related to the time of onset, patients who develop diabetes at a younger age will suffer more extensive and severe complications [5].This raises an urgent need for early diagnosis and management among younger people susceptible to T2D.
Many risk factors have been identified for susceptibility to diabetes, including being overweight, smoking, alcohol consumption, income, less physical activity, marital status, and educational level [6].Most previous studies of diabetes susceptibility relied on traditional statistic methods such as multiple linear regression (MLR).In recent years, machine learning (Mach-L) techniques have been widely applied in many fields including medicine [7,8].Mach-L applies computer algorithms to achieve our goal automatically on the basis of Mitchell [9].Mach-L can capture nonlinear relationships in the data and complex interactions among multiple predictors, allowing it to potentially outperform other conventional multiple logistic regression for diseases [10].Several large-cohort studies have focused on the prediction of prediabetes, but have failed to account for factors including lifestyle, income, education level, and marriage status.The present study enrolls subjects under the age of 36, with a follow-up of 5.8 years.Four different Mach-L methods are applied to achieve the following: 1.
Compare Mach-L and MLR performance in predicting future prediabetes 2.
Identify and rank the six most important risk factors for prediabetes.

Subject Selection
The data for this study were sourced from the Taiwan MJ Cohort, an ongoing prospective cohort of health examinations conducted by the MJ Health Screening Centers in Taiwan [11].These examinations cover more than 100 important biological indicators, including anthropometric measurements, blood tests, imaging tests, etc.Each participant completed a self-administered questionnaire, covering personal and family medical history, current health status, lifestyle, physical exercise, sleep habits, and dietary habits [12].All participants provided informed consent.All or part of the data used in this research were authorized by and received from the MJ Health Research Foundation (Authorization Code: MJHRF2023007A).Any interpretations or conclusions described in this paper do not represent the views of MJ Health Research [13].The study protocol was approved by the Institutional Review Board of the Kaohsiung Armed Forces General Hospital (IRB No.: KAFGHIRB 112-006).An initial sample of 23,462 subjects under the age of 36 was selected based on the standards of care published by the American Diabetes Association [14], which notes that most T2D diagnoses occur after this age.Excluding subjects who did not fit our inclusion criteria left a total sample of 6247 male subjects for further analysis (Figure 1).
The exclusion criteria were as follows: 1.
Taking any medications known to affect blood pressure, blood glucose, or blood lipids; 3.
Abnormal plasma glucose level at the time of the study.
The following methods were published in our previous study [15].On the day of the study, senior nursing staff recorded the subject's medical history, including current medications, and a physical examination was performed.Body fat percentage (BF) was measured using bioelectrical impedance analysis.WBC, hemoglobin levels, and the platelet count (Plt) were measured using standard laboratory techniques, typically performed on automated hematology analyzers.Creatinine (Cr), uric acid (UA), and C-reactive protein (CRP) levels were measured through blood tests using a biomedical analyzer to assess the concentration of these substances in the blood [16].
1 Following previously published protocols, demographic and biochemical data were collected as follows.After fasting for 10 h, blood samples were collected for biochemical analysis.Plasma was separated from blood within 1 h of collection and stored at 30 • C until the analysis of the fasting plasma glucose and lipid profiles.The FPG was measured using the glucose oxidase method (YSI 203 glucose analyzer; Yellow Springs Instruments, Yellow Springs, OH, USA).The total cholesterol and triglyceride (TG) levels were measured using the dry multilayer analytical slide method with a Fuji Dri-Chem 3000 analyzer (Fuji Photo Film, Tokyo, Japan).The serum high-density lipoprotein cholesterol and low-density lipoprotein cholesterol concentrations were analyzed using an enzymatic cholesterol assay, following dextran sulfate precipitation.A Beckman Coulter AU 5800 biochemical analyzer was used to determine the urine ACR via turbidimetry (Indianapolis, IN, USA).
Table 1 shows the 25 baseline variables, including the participants' age, body fat, complete blood cell count, biochemistries, thyroid stimulating hormone, C-reactive protein, education level, marital status, and income level.Alcohol consumption was defined as the multiple of the total consumption duration, frequency, and alcohol percentage.Similarly, smoking was the multiple of the smoking duration, frequency, and number of cigarettes.The sport area was the multiple of the exercise duration, frequency, and type.All of these parameters were used as independent variables, while the dependent variable was the fasting plasma glucose (FPG end ) after a 5.8-year follow-up, on average.

Traditional Statistics
Two models were built in the present study.From our preliminary evaluation, Model 1 included all 25 variables.Our results showed that the FPG base displayed 100% importance when compared to the second important factor (BF, 28.3%).To further evaluate the hidden interactions between these factors, Model 2 was built without the baseline FPG.
Data are represented as means ± standard deviations.The Student's t test was used to evaluate the differences in the continuous data between married and unmarried participants.Education and income levels were used as ordinal variables for analysis of variance (ANOVA).Pearson's correlation was used to analyze the relationships between all the continuous risk factors and the FPG end (Table 2).All statistical tests were two sided, and p < 0.05 was considered statistically significant.Statistical analysis was performed using SPSS 10.0 for Windows (SPSS, Chicago, IL, USA).

Proposed Machine Learning Scheme
Building on our group's previous work, models were constructed using four different Mach-L methods to predict prediabetes and to rank risk factors [15].
Random forest (RF) is an ensemble learning decision tree algorithm that combines bootstrap resampling and bagging [17].RF's randomly generates many different and unpruned CART decision trees, using the decrease in Gini impurity as the splitting criterion.The trees in the forest are then averaged or voted on to generate output probabilities and a final model, producing a robust model [18].The following methods were published by our group [15,19]: Stochastic gradient boosting (SGB) is a tree-based gradient boosting learning algorithm that combines bagging and boosting techniques to minimize the loss function and solve the overfitting problem of traditional decision trees [20].In SGB, many stochastic weak learners of trees are sequentially generated through multiple iterations, in which each tree concentrates on correcting or explaining errors of the tree generated in the previous iteration.That is, the residual of the previous tree iteration is used as the input for the newly generated tree.This iterative process is repeated until the convergence condition, or a stopping criterion is reached for the maximum number of iterations.Finally, the cumulative results of many trees are used to produce a robust model.
The third method used in this study is eXtreme gradient boosting (XGBoost), a gradient boosting technique based on an optimized extension of SGB [21].XGBoost sequentially trains multiple weak models, which are then assembled using the gradient boosting method of outputs to improve prediction performance.XGBoost uses Taylor binomial expansion to approximate the objective function and arbitrary differentiable loss functions to accelerate the model construction convergence process [22].In addition, XGBoost applies regularized boosting techniques to penalize the complexity of the model and correct overfitting, thus increasing model accuracy [21].
Finally, elastic net (EN) is a hybrid of L1 and L2 regularization, integrating the penalty terms of both.EN combines the Ridge penalty item, to achieve effective regularization, and the Lasso penalty item, to select variables, allowing for effective model learning with only a small number of arguments that are non-zero sparse, just like Lasso, but while maintaining some of Ridge's regular properties, thus providing certain advantages as follows: 1. EN encourages group effects in the case of highly correlated variables, rather than setting some of them to 0, like Lasso. 2. Ens are useful when multiple features are correlated with one another.3. Lasso tends to choose one of them at random, while elastic net tends to choose two [23].
Figure 2 presents the proposed prediction and important variable identification scheme that combines the four Mach-L methods.First, patient data were collected to prepare the dataset, which was then randomly divided into a training dataset (80%) for model building and a testing dataset (20%) for model testing.In the training process, the hyperparameters of each Mach-L method must be tuned to construct an effective model.In this study, a 10-fold cross-validation technique was used for hyperparameter tuning.
The training dataset was further randomly divided into a training dataset to build the model with a different set of hyperparameters, and a validation dataset for model validation.All possible combinations of hyperparameters were investigated via grid search.The model with the lowest root mean square error on the validation dataset was taken as the best model for each Mach-L method.The best models for RF, SGB, XGBoost, and EN were generated to obtain the corresponding variable importance ranking information.
During the testing phase, the performance of the best machine learning models was evaluated using the testing dataset.Since the target variable in this study is a numerical variable, the model performance was compared using different metrics, including symmetric mean absolute percentage error (SMAPE), relative absolute error (RAE), root relative squared error (RRSE), and root mean squared error (RMSE).The values for these metrics are listed in Table 3.

SMAPE Symmetric Mean Absolute Percentage Error
To ensure a more reliable and stable comparison, the training and testing processes were each repeated 10 times.The performance metrics of the four machine learning models were then averaged for comparison against the performance of the benchmark MLR model using the same training and testing datasets.A model with an average metric lower than that of the MLR model was considered to be a more convincing model.

RMSE
Root Mean Squared Error  = 1   −   Because all of the machine learning methods used can rank the importance of each predictor variable, we defined the priority demonstrated in each model that was ranked 1 as the most critical risk factor, and that ranked as 25 was the last selected risk factor.The machine learning methods used in this study may produce different variable importance rankings due to their unique modeling characteristics.To maximize the stability and reliability of our findings, we integrated the variable importance rankings of the pricier machine learning models.In the final stage of our proposed scheme, we summarize and discuss our significant findings based on the pricier machine learning methods.
All methods were performed using R software version 4.0.5 and RStudio version 1.1.453,with the required packages installed [24,25].
The Materials and Methods should be described with sufficient details to allow others to replicate and build on the published results.Please note that the publication of your manuscript implicates that you must make all materials, data, computer code, and protocols associated with the publication available to readers.Please disclose at the submission stage any restrictions on the availability of materials or information.New methods and protocols should be described in detail while well-established methods can be briefly described and appropriately cited.
Research manuscripts reporting large datasets that are deposited in a publicly available database should specify where the data have been deposited and provide the relevant accession numbers.If the accession numbers have not yet been obtained at the time of submission, please state that they will be provided during review.They must be provided prior to publication.
Interventionary studies involving animals or humans, and other studies that require ethical approval, must list the authority that provided approval and the corresponding ethical approval code.

Results
A total of 2789 study participants developed prediabetes, with age, BF, WBC, FPG base , γ-GT, LDH, UA, TG, and LDL-C as the most important impact factors for the total 5.8-year follow-up period, while HDL-C, TSH, and sport area also displayed significance in the earlier follow-up stages.Unmarried subjects were found to be more susceptible to developing prediabetes, while the educational level was found to have no significant impact.Subjects without income were also more susceptible (Table 1).Table 4 compares the performance of the four different methods.For both models, the four Mach-L methods produced lower values for SMAPE, RAE, RRSE, and RMSE, indicating that they outperformed MLR.Table 5 shows the importance percentage of the four Mach-L methods.The rightmost column averages the four methods, indicating that the most important factors for predicting the FPG end were FPG base , BF, Cr, TSH, WBC, and age in Model 1.As previously noted, the importance percentage for the FPG base was 100%, which is significantly higher than the second most important impact factor, i.e., BF (28.32%).Table 6 shows the results for Model 2, excluding the FPG base .Similar to Model 1, the most important factors are BF, WBC, age, TSH, TG, and LDL-C.Finally, Figures 3 and 4, respectively, present illustrations of the results in Tables 5 and 6, allowing for closer observations of the risk factor rankings.

Discussion
The present study followed 6247 young ethnically Chinese men for an average of 5.8 years.The subject data included lifestyle information, allowing for a more comprehensive view of the predictors for glucose change.Using four different Mach-L in Model 1, we found that FPGbase, BF, Cr, TSH, WBC, and age were the six most important factors for the FPGend.Given the disproportionate impact of the FPGbase on the second most important factor (100% versus 28.3% for BF), Model 2 was built excluding the FPGbase, and the same methods were repeated, finding only minor differences in terms of the key impact variables.
Consistent with other studies, the FPGbase was found to be the leading determinator for an increased FPGend.In 2021, We et al. found that FPG was the most important predictor for prediabetes in a 3.35-year follow-up period among 551 Chinese subjects, aged from 40-70 years old [26].However, that study used multiple logistic regression and provided a hazard ratio (HR: 2.284; 95% confidence interval: 1.556, 3.352; p< 0.001).Logistic regression is less informative than MLR because it does not present quantitative changes of the relationships between the dependent and independent variables.Another review article

Discussion
The present study followed 6247 young ethnically Chinese men for an average of 5.8 years.The subject data included lifestyle information, allowing for a more comprehensive view of the predictors for glucose change.Using four different Mach-L in Model 1, we found that FPG base , BF, Cr, TSH, WBC, and age were the six most important factors for the FPG end .Given the disproportionate impact of the FPG base on the second most important factor (100% versus 28.3% for BF), Model 2 was built excluding the FPG base, and the same methods were repeated, finding only minor differences in terms of the key impact variables.
Consistent with other studies, the FPG base was found to be the leading determinator for an increased FPG end .In 2021, We et al. found that FPG was the most important predictor for prediabetes in a 3.35-year follow-up period among 551 Chinese subjects, aged from 40-70 years old [26].However, that study used multiple logistic regression and provided a hazard ratio (HR: 2.284; 95% confidence interval: 1.556, 3.352; p < 0.001).Logistic regression is less informative than MLR because it does not present quantitative changes of the relationships between the dependent and independent variables.Another review article published by Abdul-Ghani et al. also supported the role of FPG.They reported the development of a variety of multivariate models, all of which were useful for predicting future T2D.The main pathophysiology underlines how the FPG might be related to the decline of β-cell function with increasing age [27].Our results further confirm that even a mild elevation of FPG might lead to the further dysregulation of glucose metabolism.
In both Models 1 and 2, BF was the second most important risk factor.While the present study accounts for BMI, BF is more accurate and was thus used to build the models [28].As noted in the Methods section, the impact of BF was much less significant than that of FPG.To demonstrate the effects of BF on glucose metabolism, Jo et al. [29] classified 6335 participants from the National Health and Nutrition Examination Survey into four groups as follows: (1) normal weight with normal %BF, (2) normal weight with high %BF, (3) overweight with normal %BF, and (4) overweight with high % BF.The most important finding was that the prevalence of abnormal glucose in the normal weight group with a high % of BF (13.5%) is significantly higher than that of the overweight group with a low % of BF (10.5%, p < 0.001).This finding is incompatible with our result, which further supports the importance of BF in glucose metabolism.BF is positively related to plasma levels of free fatty acid [30], which has a significantly negative impact on glucose metabolism via an increased hepatic glucose output and decreased skeletal muscle glucose disposal, thus producing inflammatory proteins and increasing insulin resistance [31][32][33].These effects clearly explain the present findings.
The WBC was the 5th and 2nd important factor in Model 1 and 2, respectively.There were many studies showing that this relationship does exist [34][35][36][37].For example, Jiang et al. showed that the WBC was positively correlated with glycated hemoglobin and 2 h postprandial glucose in 9697 Chinese [38].It is well known that one's WBC is closely related to oxidative stress, and could even be used in clinical caring for type 2 diabetes [39,40].Thus, this relationship is easily understood since a high WBC, which is a marker for inflammation, is related to high TG and low LDL-C and hypertension [41,42].All these derangements are hallmarks of insulin resistance [43].
The impact of aging on glucose metabolism has been studied extensively [44].In the present study, age is, respectively, the 6th and 3rd most important impact factor in Models 1 and 2. Chia et al. found that the incidence of several important impairments related to glucose metabolism increases with age, including confounding impacts on insulin secretion [45,46], pulsatile insulin secretion [47], reduced β-cell response to incretin [48], and even insulin resistance [49].The results of the present study are consistent with these findings.
TSH was the 4th most important risk factor for predicting the FPG end in the present study.While this relationship is less widely known, many studies have shown that both hyper-and hypothyroidism are related to T2D [50][51][52][53].Thyroid hormone levels affect the glucose metabolism through the following mechanisms: increased glucose absorption, gluconeogenesis and glycogenolysis, and free fatty acids via promoting lipolysis [54].All these impacts could explain our present findings.
Finally, in Model 2, higher TG and LDL-C levels were positively correlated with the FPG end .Insulin resistance is one of the main causes for T2D [55], while major changes to the lipid profile include increased TG and LDL-C [56].Therefore, our results are consistent with previous findings.
It is interesting to note that the plasma Cr level was selected in Model 1, but not in Model 2. This could be explained by the interplay between the plasma Cr level and the FPG base .Yoshida et al. reported that a lower Cr level is associated with a higher chance of prediabetes [57].When removing the baseline FPG, the position of Cr moved from 3rd to 18th in the present study.This indicates the importance of Cr and FPG base being synchronized together.
Other hidden but important information should also be pointed out.In our study, the gap between the follow-up, income, education level, sleep hour, drinking status, and the presence of a spouse were all unimportant factors for determining the FPG end .
The present study is subject to certain limitations.First, none of the subjects were smokers, thus the impact of tobacco consumption cannot be determined.Secondly, the MJ Health Screening cohort generally excludes those with lower socio-economic statuses who cannot afford the company's services, thus the sample may be subject to selection bias.Finally, our study was limited to ethnic Chinese subjects, and caution should be taken in extrapolating the findings to other ethnic groups.

Conclusions
Mach-L was found to outperform traditional MLR in terms of capturing non-linear relationships.FPG base , BF, WBC, age, TSH, TG, and LDL-C were the most important determinators for the FPG end after 5.8 years in a group of Chinese men, aged from 18 to 35 years old.

Figure 2 .
Figure 2. Proposed machine learning prediction scheme.Figure 2. Proposed machine learning prediction scheme.

Figure 2 .
Figure 2. Proposed machine learning prediction scheme.Figure 2. Proposed machine learning prediction scheme.

Table 2 .
The results of correlation between risk factors and fasting plasma glucose at the end of the follow-up.

Table 3 .
Four performance metrics used: stochastic gradient boosting, random forest, eXtreme gradient boosting, and elastic net.

Table 4 .
The average performance of linear regression and the four machine learning methods.

Table 5 .
Importance percentages of risk factors predicting future fasting plasma glucose using four different machine learning methods in Model 1.

Table 6 .
Importance percentages of risk factors predicting future fasting plasma glucose using four different machine learning methods in Model 2.