A Hybrid Risk Factor Evaluation Scheme for Metabolic Syndrome and Stage 3 Chronic Kidney Disease Based on Multiple Machine Learning Techniques

With the rapid development of medicine and technology, machine learning (ML) techniques are extensively applied to medical informatics and the suboptimal health field to identify critical predictor variables and risk factors. Metabolic syndrome (MetS) and chronic kidney disease (CKD) are important risk factors for many comorbidities and complications. Existing studies that utilize different statistical or ML algorithms to perform CKD data analysis mostly analyze the early-stage subjects directly, but few studies have discussed the predictive models and important risk factors for the stage-III CKD high-risk health screening population. The middle stages 3a and 3b of CKD indicate moderate renal failure. This study aims to construct an effective hybrid important risk factor evaluation scheme for subjects with MetS and CKD stages III based on ML predictive models. The six well-known ML techniques, namely random forest (RF), logistic regression (LGR), multivariate adaptive regression splines (MARS), extreme gradient boosting (XGBoost), gradient boosting with categorical features support (CatBoost), and a light gradient boosting machine (LightGBM), were used in the proposed scheme. The data were sourced from the Taiwan health examination indicators and the questionnaire responses of 71,108 members between 2005 and 2017. In total, 375 stage 3a CKD and 50 CKD stage 3b CKD patients were enrolled, and 33 different variables were used to evaluate potential risk factors. Based on the results, the top five important variables, namely BUN, SBP, Right Intraocular Pressure (R-IOP), RBCs, and T-Cho/HDL-C (C/H), were identified as significant variables for evaluating the subjects with MetS and CKD stage 3a or 3b.


Introduction
Suboptimal health status is a dynamic and intermediate bodily condition between health and disease. Various indicators of suboptimal health must be considered during the prevention of chronic diseases to achieve better health protection. Metabolic syndrome (MetS) is a collection of suboptimal health risk indicators. According to the definition provided by the Health Promotion Administration (HPA) and Ministry of Health and MetS and CKD are important risk factors for many comorbidities and complications [9,10]. Much research has denoted a positive correlation between MetS and CKD [11,12]. Furthermore, MetS diagnosis is an effective predictor of CKD [12]. Studies on CKD prediction have identified four major risk factors: demographic variables (e.g., age, education level), anthropometric parameters (e.g., body mass index, body fat), blood examination indicators (e.g., blood urea nitrogen, uric acid), and lifestyle habits (e.g., smoking status, alcohol consumption) [13][14][15][16]. Thus, they are often used in many studies to construct CKD analytical models through machine learning (ML)-based data analysis methods [16][17][18][19][20].
ML techniques are extensively applied in numerous studies on medical informatics and suboptimal health status [3,16,[21][22][23][24]. They are often used to identify critical predictor variables or risk factors as they can effectively investigate the complex relationships between risk factors and outcomes, based on their promising predictive performance with vast amounts of medical data [16,22,23,25].
Because some ML techniques can identify important predictor variables, a single technique for selecting important predictor variables and risk factors may result in a localized optimal risk that generates a single ranking of the variables. A variable ensemble is often used to integrate the different variables that are selected [26]. Relevant studies have also demonstrated that using variable ensembles improves the robustness of the selected variables, compared to a single variable selection technique, and reduces the bias and variance of the results [27][28][29][30].
Existing studies that utilize ML methods to perform CKD data analysis mostly analyze the patients directly [16,17,20,[31][32][33][34], and few studies have discussed the predictive models and important risk factors for CKD patients with MetS. Several studies have constructed predictive models for MetS patients, as well as their risk factors. Although patients in stages 3a and 3b of CKD vary in disease progression and mortality risk [35,36], they share highly similar clinical presentations. Thus, this study aims to examine the ML predictive models and important risk factors for CKD stages 3a and 3b patients with MetS by using ML techniques.
This study aims to construct an effective hybrid important risk factor evaluation scheme for CKD stages 3a and 3b patients with MetS, based on ML predictive models. Our study used six well-known and effective ML techniques-random forest (RF), logistic regression (LGR), multivariate adaptive regression splines (MARS), extreme gradient boost-3 of 13 ing (XGBoost), gradient boosting with categorical features support (CatBoost), and light gradient boosting machine (LightGBM)-to develop ML predictive models [16,18,19,[37][38][39]. The important risk factors identification results can provide valuable information regarding the prevention of CKD and health promotion.
The rest of this paper is organized as follows: Section 2 describes the used materials and the proposed scheme. Section 3 presents the experiment results. Section 4 discusses the findings of the study. Finally, the study is concluded in Section 5.

Data
This study has selected a large database of sub-health groups in Taiwan, the MJ Health Checkup-based Population Database (MJPD, http://www.mjhrf.org/main/page/ resource/en/#resource07, accessed on 1 August 2022). It has published more than dozens of international journal papers, including 2 JAMA and 6 Lancet journal papers. This study was approved by the institutional review board of Far Eastern Memorial Hospital (FEMH-IRB) (No:_IRB-110027-E Approved Date: 15 February 2022) and the MJ Health Research Foundation, and registered on ClinicalTrials.gov (ID: NCT05225454). Figure 1 shows the all-subjects identification process, and the complete data were collected from the MJPD. A total of 71,108 members from 2005 to 2017 comprised the health examination indicators and questionnaire responses. Table 2 shows the 34 health examination indicators and questionnaire variables. Among the 34 variables, CKD is the target variable and the other 33 indicators are predictor variables. Given that each member might have multiple examination records, those who had undergone multiple health examinations only had their latest records analyzed. In addition, subjects whose data had missing variables were excluded. After data processing, 30,255 subjects were eligible. We applied the MOHW's references and definitions of MetS and CKD to identify 423 MetS patients who were also diagnosed with CKD stages 3a or 3b. Table 3 presents the statistical analysis results of the participants' demographic data. A total of 375 patients (88.65%) were diagnosed with CKD stage 3a, while the remaining had CKD stage 3b. Our study used six well-known and effective ML techniques-random forest (RF), logistic regression (LGR), multivariate adaptive regression splines (MARS), extreme gradient boosting (XGBoost), gradient boosting with categorical features support (CatBoost), and light gradient boosting machine (LightGBM)-to develop ML predictive models [16,18,19,[37][38][39]. The important risk factors identification results can provide valuable information regarding the prevention of CKD and health promotion. The rest of this paper is organized as follows: Section 2 describes the used materials and the proposed scheme. Section 3 presents the experiment results. Section 4 discusses the findings of the study. Finally, the study is concluded in Section 5.

Data
This study has selected a large database of sub-health groups in Taiwan, the MJ Health Checkup-based Population Database (MJPD, http://www.mjhrf.org/main/page/resource/en/#resource07, accessed on 1 August 2022). It has published more than dozens of international journal papers, including 2 JAMA and 6 Lancet journal papers. This study was approved by the institutional review board of Far Eastern Memorial Hospital (FEMH-IRB) (No:_IRB-110027-E Approved Date: 15 February 2022) and the MJ Health Research Foundation, and registered on ClinicalTrials.gov (ID: NCT05225454). Figure 1 shows the all-subjects identification process, and the complete data were collected from the MJPD. A total of 71,108 members from 2005 to 2017 comprised the health examination indicators and questionnaire responses. Table 2 shows the 34 health examination indicators and questionnaire variables. Among the 34 variables, CKD is the target variable and the other 33 indicators are predictor variables. Given that each member might have multiple examination records, those who had undergone multiple health examinations only had their latest records analyzed. In addition, subjects whose data had missing variables were excluded. After data processing, 30,255 subjects were eligible. We applied the MOHW's references and definitions of MetS and CKD to identify 423 MetS patients who were also diagnosed with CKD stages 3a or 3b. Table 3 presents the statistical analysis results of the participants' demographic data. A total of 375 patients (88.65%) were diagnosed with CKD stage 3a, while the remaining had CKD stage 3b.

Proposed Hybrid Risk Factor Evaluation Scheme
On the basis of the six ML methods, including RF, LGR, MARS, XGBoost, CatBoost, and LightGBM, this study developed a hybrid important risk factor identification scheme for the subjects with MetS and CKD stage 3a or 3b. The six ML methods used are based on different concepts and characteristics to develop the classification models [40][41][42][43][44][45]. RF, XGBoost, CatBoost, and LightGBM are tree-based algorithms.
LGR and MARS are nonparametric methods. Since they are based on different characteristics to construct effective algorithms and identify important risk factors for medical data analysis, the important variables identification results of the six methods are integrated to provide more stable and robust results. Figure 2 shows the proposed hybrid risk factor evaluation scheme.

Proposed Hybrid Risk Factor Evaluation Scheme
On the basis of the six ML methods, including RF, LGR, MARS, XGBoost, CatBoost, and LightGBM, this study developed a hybrid important risk factor identification scheme for the subjects with MetS and CKD stage 3a or 3b. The six ML methods used are based on different concepts and characteristics to develop the classification models [40][41][42][43][44][45]. RF, XGBoost, CatBoost, and LightGBM are tree-based algorithms.
LGR and MARS are nonparametric methods. Since they are based on different characteristics to construct effective algorithms and identify important risk factors for medical data analysis, the important variables identification results of the six methods are integrated to provide more stable and robust results. Figure 2 shows the proposed hybrid risk factor evaluation scheme. As shown in Figure 2, the first step was to sample the MetS subjects who were diagnosed with CKD stage 3a or 3b from through the MJPD health examination database. Next, we defined the predictor variables and target variable. We used 33 risk factors as our predictor variables and CKD as the target variable. After consolidating the data, we built the RF, LGR, MARS, XGBoost, CatBoost, and LightGBM predictive models. As shown in Figure 2, the first step was to sample the MetS subjects who were diagnosed with CKD stage 3a or 3b from through the MJPD health examination database. Next, we defined the predictor variables and target variable. We used 33 risk factors as our predictor variables and CKD as the target variable. After consolidating the data, we built the RF, LGR, MARS, XGBoost, CatBoost, and LightGBM predictive models.
RF is a decision tree approach based on ensemble technology [40]. Its principle is to construct several unpruned decision trees, aggregate all the trees into a forest, and then generate the final model by taking the majority vote or average value of the trees.
LGR is the typically most used ML method that generalizes linear models with canonical link functions [41]. Its aim is to minimize the relative cost function using a logistic function and perform model fitting using a maximum likelihood function.
MARS is a nonparametric and nonlinear statistical method in which several linear segments with different gradients are used to automatically examine the nonlinearity and dependency between multidimensional input and output variables, and then generate the final optimum nonlinear prediction model [42]. XGBoost is a decision tree-based approach that applies gradient boosting to generate multiple weak models. When each weak model is generated, the defects or shortcomings of the previous model are corrected. Finally, accuracy categorization is achieved by aggregating all the generated weak models [43].
LightGBM is a decision tree-based distributed gradient boosting framework that utilizes advanced histograms. In an iteration, it learns the approximate value of decision tree residuals based on one-side sampling and negative gradient fitting [44]. CatBoost is a gradient-boosting decision tree technique in which sequential boosting methods are combined with gradient boosting and multiple categorical features [45]. In CatBoost, the tree combinations and categorical features generated through gradient boosting are aggregated into a sequence to generate the final model.
For constructing each ML model, we randomly divided the whole dataset into 80% training data set and 20% testing dataset. The ten-fold cross validation (CV) method was used to perform hyperparameter tuning. The selected final model is the model of the best hyperparameter configuration. This process was performed ten times.
Balanced accuracy (BA), sensitivity, specificity, and area under the receiver operating characteristic (ROC) curve (AUC), are four well-known metrics [46][47][48] utilized to assess the six ML models' performance. To identify the convincing ML models, the widely used LGR was viewed as the baseline model in this study. The ML model's performance that is greater than or equal to that of the LGR model is considered as the convincing model.
To rank the importance of each predictor variable, we applied the "caret" R package of version 6.0-90 [49] to each of the six methods to produce each variable's importance value. In each model, the most important predictor variable is set as ranking 1, whereas the least important predictor variable is defined as 33, because we used 33 predictor variables in this study. Different ML methods may produce different importance rankings for each predictor variable, due to their different specific characteristics. To obtain more stable and integrable ranking results, we hybridized the importance of each variable by averaging its ranking values from the convincing ML models.
In the last step, the research findings regarding the identified important risk factors were discussed to present the conclusions of our study.

Results
This study applied six ML techniques, including RF, LGR, MARS, XGBoost, CatBoost, and LightGBM, to build predictive models for patients with MetS and CKD stage 3. Table 4 depicts the mean prediction performances of the six models after ten learning cycles, as well as the means and standard deviations (SDs) of the four performance metrics used. Figure 3 demonstrates the ROC curves of the six models. From Table 4, it can be observed that the prediction performances of the six models were similar, and the AUC of each model was greater than 0.657. The LGR has the highest AUC value of 0.670 and the RF has the lowest AUC value of 0.657.  To evaluate the performance of the six methods, DeLong's test was used since it is one of the effective tests employed to evaluate the statistically significant difference between two models' AUC values [55]. We used DeLong's test to compare AUC values between the model with the highest AUC model (i.e., LGR model in this study) to each of the remaining five ML methods. Table 5 depicts the results of DeLong's test. It can be determined from the table that the performance difference between the LGR model and each ML method is not significant, since all p-values are greater than 0.05. Therefore, the six models' prediction performances were alike and can be viewed as the convincing models. However, it is still worth noting that, from Table 4, the LGR was relatively the best ML model in this study because it can generate the highest mean balanced accuracy, specificity, and AUC values of 0.719, 0.761, and 0.670, respectively.
Because the six methods used are all considered to be convincing models, we used the variable importance generated by all six methods as the basis for our risk factor ensemble. To evaluate the performance of the six methods, DeLong's test was used since it is one of the effective tests employed to evaluate the statistically significant difference between two models' AUC values [55]. We used DeLong's test to compare AUC values between the model with the highest AUC model (i.e., LGR model in this study) to each of the remaining five ML methods. Table 5 depicts the results of DeLong's test. It can be determined from the table that the performance difference between the LGR model and each ML method is not significant, since all p-values are greater than 0.05. Therefore, the six models' prediction performances were alike and can be viewed as the convincing models. However, it is still worth noting that, from Table 4, the LGR was relatively the best ML model in this study because it can generate the highest mean balanced accuracy, specificity, and AUC values of 0.719, 0.761, and 0.670, respectively. Because the six methods used are all considered to be convincing models, we used the variable importance generated by all six methods as the basis for our risk factor ensemble. Table 6 shows the overall importance ranking of each predictor variable based on the six convincing models. Note that only the first 15 variables of Table 2 are shown. The "Average ranking of RF" to "Average ranking of LightGBM" are the average rankings, with the modeling of each of the six models repeated ten times. The different models produced different variable importance ranking results based on their modeling rules. In order to hybridize the findings of the six models, we summarized the ranking results of the six models equally in the proposed scheme. We obtained the "Average ranking of the six models" with simple averaging values from the six models.
To clarify the ranking, Figure 4 shows the ranked top ten important variables by increasing order of the average ranking values of the six models. From Figure 4, to compactly discuss the important predictor variables, based on physicians' recommendations, the top five important predictor variables, namely BUN, SBP, R-IOP, RBCs and C/H, were identified as significant variables for assessing the subjects with MetS and CKD stage 3a or 3b.
In recent years, three related studies have been published that use different analytical tools and subjects to determine the risk factors for CKD in Taiwan
In recent years, three related studies have been published that use different analytical tools and subjects to determine the risk factors for CKD in Taiwan [16,17,59]. Chang et al. (2020) consulted the Elderly Health Examination Database and used 2006-2012 data from 297,603 elderly people aged 65 years and older in Taipei City, Taiwan. Employing the non-CKD criteria with the G1 and G2 stages (e-GFR > 60 mL/min/1.73 m 2 ), their results showed a 29.7% e-GFR reduction in the likelihood of CKD diagnosis. The study found smoking to be significantly associated with an elevated risk of reduced e-GFR, and found physical exercise and healthy lifestyle habits to be significantly associated with increased e-GFR. Additionally, it found CVD, hypertension, obesity, and diabetes-related indicators to be linked to an increased risk of developing CKD [59]. Another study published in China used the same criteria (e-GFR > 60 mL/min/1.73 m 2 ) to detect CKD among 15,229 subjects (mean age: 62.8 years) from the Dongfeng-Tongji examination dataset (2008-2013). It found that BMI and MetS are potential indicators of CKD risk among elderly people [60]. Shih et al. (2020) analyzed data from an adult health examination dataset, as well as data on elderly adults they collected from three physical examination centers and 32 clinics in Taiwan (2015-2019). However, this study features a notable limitation: the G2 stage was not rigorous when it was used to represent and indicate CKD subjects. It was selected out of 14,169 non-CKD subjects (63.37 ± 11.56 years) and 5101 CKD subjects (69.19 ± 10.74 years)a total of 19,270 subjects-with effective records, but they determined CKD by using the G1 stage (e-GFR ≥ 90 mL/min/1.73 m 2 ) to indicate non-CKD. The study found the UP-Cr. ratio, proteinuria (PRO), RBCs, FPG, TG, T-Cho, age, and gender to be important risk factors for early CKD prediction [17]. Interestingly, they identified RBCs, in addition to UP, as an important factor, though they did not elaborate on it. Previous research on UP features supports data on the correlation with RBCs; in fact, some studies show that it may be a risk factor for hypertension [48].
This study is the follow-up research to Chiu et al.'s (2021) study. The datasets were collected from four major health screening centers in the northern, central, and southern parts of Taiwan (2010-2015). A total of 65,394 subjects were included in the MJPD database for the analysis of 18 risk indicators, CKD was determined by using the criteria with the G2 stages (e-GFR > 60 mL/min/1.73 m 2 ). The MJPD datasets were of the sub-health population, including more young subjects, aged around 30 to 50 years old (y/o). The study results showed that BUN and UA were identified as the first and second most important indicators, and SBP, SGPT, SGOT, and LDL-C were also related risk factors. Interestingly, socioeconomic status (SES)-related education was found to be the third important indicator in this study [16].
From the perspective of preventive medicine, the knowledge of risk factors facilitates early detection and, in turn, allows for targeting and improving relevant lifestyle habits, enabling people to avoid serious chronic diseases. In this study, we continued to use MJPD datasets [16], though notably with a younger sample. However, unlike the three most prominent previous CKD-related studies in Taiwan [16,57,59], we raised the criteria for CKD, asserting that CKD is stage 3b in the earliest stage of end-stage renal disease (e-GFR > 45 mL/min/1.73 m 2 ). At the same time, we increased the number of data-covered years (2005-2017) to increase the sample size. The MJPD dataset excluded the subjects' records related to anything but MetS, CKD stage 3a, and CKD stage 3b. Out of a total of 423 subjects, 88.65% were diagnosed with stage 3a CKD, and 11.35% were diagnosed with stage 3b CKD. BUN, SBP, R-IOP, RBCs, and C/H were identified as the five most important variables for evaluating subjects with MetS, CKD stage 3a, and CKD stage 3b.

Limitations
In order to add the variables found in related studies and the variables that the researcher is interested in, and because analyzing too many research variables may affect the Area Under the Curve (AUC) of the algorithm, it is recommended that follow-up studies appropriately reduce variable analysis, or integrate more relevant variables, such as L-IOP and R-IOP, or T-Cho, HDL-C, LDL-C, and C/H related indicators. In addition, for a smaller number of samples, follow-up research may be able to further advance the analysis of the two risk factor values of relative importance value (RIV) or ordinal ranking value (ORV).

Conclusions
This study proposed innovative algorithms for the analysis of health-screening data pertaining to the third stage of CKD: the earliest stage of ESKD. This study contributed 33 relevant research variables, including R-IOP, RBCs, and T-CHO/HDL-C, outlining their varied associations with risk indicators identified in previous studies. This study suggested that some factorial combinations could potentially be used to separate individuals with stage 3a CKD from those with stage 3b CKD, facilitating the design of prospective studies in the future. We believe that this study has made several valuable contributions to the literature, including some that will aid in the prevention and treatment of CKD and the evaluation of high-risk groups in the third stage.