Prediction of Metabolic Syndrome in a Mexican Population Applying Machine Learning Algorithms

: Metabolic syndrome is a health condition that increases the risk of heart diseases, diabetes, and stroke. The prognostic variables that identify this syndrome have already been deﬁned by the World Health Organization (WHO), the National Cholesterol Education Program Third Adult Treatment Panel (ATP III) as well as by the International Diabetes Federation. According to these guides, there is some symmetry among anthropometric prognostic variables to classify abdominal obesity in people with metabolic syndrome. However, some appear to be more sensitive than others, nevertheless, these proposed deﬁnitions have failed to appropriately classify a speciﬁc population or ethnic group. In this work, we used the ATP III criteria as the framework with the purpose to rank the health parameters (clinical and anthropometric measurements, lifestyle data, and blood tests) from a data set of 2942 participants of Mexico City Tlalpan 2020 cohort, applying machine learning algorithms. We aimed to ﬁnd the most appropriate prognostic variables to classify Mexicans with metabolic syndrome. The criteria of sensitivity, speciﬁcity, and balanced accuracy were used for validation. The ATP III using Waist-to-Height-Ratio (WHtR) as an anthropometric index for the diagnosis of abdominal obesity achieved better performance in classiﬁcation than waist or body mass index. Further work is needed to assess its precision as a classiﬁcation tool for Metabolic Syndrome in a Mexican population.


Introduction
Metabolic Syndrome (MetS) encompasses a group of cardiovascular risk factors that increase the likelihood of suffering heart and other metabolic illnesses, such as cerebrovascular stroke and diabetes.
MetS was first described by Kylin in 1920 as the coexistence of hypertension, hyperglycemia, and gout [1]. In 1940, the central obesity component was added [2]; since then, several definitions have been used to describe it, even different names have been given, such as the X syndrome, the insulin resistance syndrome [3] or the deadly quartet [4].
Due to the controversy regarding a worldwide definition, in 1998, an international initiative gather-up in an attempt to achieve an agreement on this matter. The World Health Organization (WHO 1999) proposed a set of criteria [5], then the National Cholesterol Education Program's Adult Treatment Panel III (NCEP: ATP III-2004) [6] and the European Group on the Study of Insulin Resistance (IDF) [7]

Data
The data set incorporated in this research was obtained from the Tlalpan 2020 cohort, and this study is conducted by the Instituto Nacional de Cardiología Ignacio Chávez in Mexico City [26]. The data were collected from 2942 subjects, 1869 women (64%) and 1073 men (36%), aged 20-50 years. The data set included different cardiovascular risk factors (clinical and anthropometric measurements, lifestyle habits and biomedical evaluation).
Clinical and anthropometric measurements. Systolic and diastolic blood pressure, measurements were made according to standard procedure JNC 7 [33], as well as Waist Circumference (WAIST), height and weight (The International Society for the Advancement of Kinanthropometry (ISAK)) [34], Body Mass Index (BMI) was estimated as (weight/height 2 ), and Waist-to-Height-Ratio (WHtR) was calculated dividing the waist by the height (waist/height) (cm/cm).
Lifestyles habits. Variable related to lifestyle habits such as alcohol consumption, smoking, and physical activity (measured with the long version of the International Physical Activity Questionnaire (IPAQ)) were obtained with validated questionnaires [35].

Methods
Random Forest. It was introduced by Breiman and Adele Cutler [36], as a predictive algorithm that creates a set of CART classification trees and the class assigned to the instance is made by the majority vote, this is known as a classifier assembly. This algorithm can be applied to a wide range of prediction problems and can achieve a better prediction accuracy as compared to individual classification trees [37]. Random Forest has two important parameters: mtry and ntree. The mtry is the size of the random subsets of variables considered for splitting, being the default value 2 √ p for classification and 2 3 for regression, where p is the number of variables in the data set [38]. The ntree parameter refers to tree size. In this work, we used randomForest library [38] available too in R [39].
Random Forest provides a method called Variable Importance Measures (VIMs) to rank the importance of variables in cases of regression or classification.
Variable Importance Measures. [40] Variable importance measures for Random Forest can be used to rank variables by their relevance in regression or classification cases. This method has been successfully applied in many applications [23,37,41]. There are two ways to identify relevant features or perform variable: (1) mean decrease of impurity (MDI), which is based on the Gini index [42], and (2) mean decrease of accuracy (MDA) based on permutation importance. MDI is typically used in classification [42] and is more robust than MDA [43,44]. MDA is more suitable for regression problems.
In this study, we use MDI, which typically uses Gini index (measure commonly chosen for classification-type cases [42]), this process is given by where impbni is impurity before node i, impani is impurity after node i and N k represents the set of nodes in which a split based on X k is made. This method is implemented by R with the function importance (type 2) as a part of the randomForest package [45]. JRip. It is a version of the RIPPER (Repeated Incremental Pruning to Produce Error Reduction) algorithm [46]. JRip generates a rule set to identify the categories of the instances, and at the same time to optimize the classification error. The syntax of a rule is as follows: if attribute1 <relational operator> value1 <logical operator> attribute2 <relational operator> value2 < . . . > then decision-value C4.5. [47] It creates a classification tree using training data through repeated splits. In each repetition, the most relevant predictor variable is identified using the gain ratio as a measure and using this variable the tree is bifurcated. This process is repeated until all training instances are classified. In the end, only the most important predictors are used to create the classification tree. This results in a more simplified tree.
Correlation-Based Feature Selection (CFS). [48] It assesses the capacity to predict the class and the correlation between the features in feature subsets, aiming at maximizing the former and minimizing the latter. As a result, the best predictive feature subset and with the least correlation between members is found. Having a feature subset S with k features, CFS computes the goodness of S, denoted M s with the equation: where r f f is the average correlation of all feature-feature pairs. kr c f is the average correlation of all feature-class pairs. Chi-Squared. This filter calculates the chi-squared statistic of each variable taken individually concerning the class [49]. It gives a feature ranking as a result. Taken a feature f and the class c, the chi-squared test is computed with the equation: where N is the number of records in the data-set. P(x, y) is the joint probability of x and y. P(x) is the marginal probability of x. For example: P( f , c) is the joint probability of f and c, P( f ) is the marginal probability of f . f is the complement of f . c is the complement of c.

Metrics
To evaluate the classifier performance, sensitivity (SENS), specificity (SPC), and balanced accuracy (BACC) were used and computed based on the confusion matrix, as well as on the Kappa index [50].
where P = Positive, N = Negative, TP = True Positive, FN = False Negative, TN = True Negative and FP = False Positive, respectively.

Statistical Analysis
The statistical analysis was performed with the Stata package, version 13.0. The distribution of numerical variables was tested with Shapiro France Test (P > 0.05). Mann-Whitney U test and Chi-squared or exact Fisher tests were used to compare the studied groups (with and without MetS). An alpha index of ≤0.05 was considered statistically significant.

Results
In this study, we used a data set from a cohort study called Tlalpan 2020 (the study protocol for this cohort was published elsewhere [26]). The ATP III criteria were applied (see Table 1) to identify influential cardiovascular risk factors and classify participants with or without MetS. Figure 1 shows a general diagram of our proposed model, where the first step was to identify the variable importance of all data set applying Random Forest, chi-squared and CFS. The results obtained indicate the most important variables (features), which were used to train different models using Random Forest, C45, and JRip. Models were created using 30 independent iterations, as it is the typical number used in the literature for fair comparisons among experiments [51,52]. Then, their performance was compared considering balanced accuracy, sensitivity, and specificity.  The prevalence of MetS according to ATP III criteria was 20.5% (603 participants), and no significant differences were identified between sexes (with MetS: women 20.6% vs. men 20.4%, without MetS: women 79.4% vs. men 79.6%).
The median and interquartile range (IQR) of anthropometric, clinical, and biochemical parameters are shown in Table 2. Participants with MetS were significantly older than those without MetS and showed higher values of all anthropometric and clinical parameters. Concerning biochemical parameters, MetS participants had substantially higher values than those without MetS. * Numerical data were expressed as the median (interquartile range (IQR)) and ** categorical as the number of cases and its corresponding percentage (n (%)). BMI: body mass index, WC: waist circumference, WHtR: Waist-to-Height-Ratio, SBP: systolic blood pressure, DBP: diastolic blood pressure, HR: heart rate, FPG: fasting plasma glucose, CHOL: total cholesterol, LDL-C: low-density lipoprotein cholesterol, HDL-C: high-density lipoprotein cholesterol, TGs: triglycerides.

Variable Importance and Prediction Model
As a first step, we identify the most important variables of the data set using Random Forest algorithm to construct the corresponding model, where the number of trees (ntree) varied between 100 to 1000 (ntree = 100, 200, 300, 500, 800, and 1000) and the mtry value varied between 1 to 10, applying the grid search method proposed by Hsu et al. [53]. Also, 10-fold cross-validation with ten repeats to train the model was used to ensure all data. Once the training process was finished and the best parameters were found and applied, the variable importance was obtained. Figure 2 shows the features attained by the model, where the best value in mtry was 10 and in ntree was 1000, to achieve a balanced accuracy of 0.9675 and a standard deviation (SD) of 0.0006. In Figure 2, the variable importance is shown. FPG displayed the highest value of importance, followed by TGs, WHtR, and HDL.C; then there was a second group (BMI, DBP, and waist) and SEX, CREA and SBP showed the lowest values. Three of the four variables within the first group are considered to be indicators used by ATP III to identify MetS. However, the anthropometric index that ATP III uses to identify abdominal obesity is the waist. In our results, this index showed a lower value of importance than WHtR and even than BMI. Therefore, considering the role that abdominal obesity has as a cardiometabolic risk factor, the importance of each of the three anthropometric indexes (WHtR, BMI and waist) was tested. We performed experiments where WEIGHT, HEIGHT, and BMI, waist or WHtR were eliminated depending on the case, and a separate algorithm was built applying Random Forest and chi.squared. Table 3 shows the variable importance of BMI, WHtR and WAIST using Random Forest. WHtR was placed in the second position with a value higher (146.4299) than BMI (118.7353) and WAIST (118.6575), which were placed in the third position.  Table 4, shows the variable importance of BMI, WHtR and WAIST, using chi.squared. Even though the three anthropometric indexes were placed in the third position, WHtR achieved a higher value (0.5118) than WAIST (0.5068) and BMI (0.4975). As for the last six variables for which the importance was 0, it means that they are not important for diagnosing MetS according to chi.squared filter, therefore they can be discarded from the models.
Once the importance of the variables was obtained with Random Forest (see Table 3) and chi.squared (Table 4), the models for each anthropometric index (WHtR, WAIST, and BMI) using Random Forest, C45 and JRIP as classifiers were developed.
In Table 5, the performance of the 30 models developed for each anthropometric index (WHtR, WAIST, and BMI) using Random Forest, C45 and JRIP as classifiers are shown. The classifier that performed best was the Random Forest for the three anthropometric indexes; however, WAIST showed the highest importance. On the other hand, C45 and JRIP obtained lower importance for the three anthropometric indexes, and the highest importance was observed for the WHtR. Since ATP III is one of the most used guidelines in Latin America to define the MetS, we constructed three models, one to measure the performance of variables used by ATP III (see Table 6), another to measure the same ATP III variables, replacing WAIST for WHtR (see Table 7) and the last one to measure the same ATP III variables using the BMI instead of WAIST. Tables 6-8 show sensitivity, specificity, and the balanced accuracy, as well as their respective standard deviations of the average performance for the 30 models generated for each case. In the case of the model using ATP III variables (Table 6), the best performance was attained by Random Forest with a Balanced Accuracy of 0.8754 and an SD of 0.0036, followed by JRip (0.8723, 0.0203). The worst average performance was attained by SVM linear with a cost = 100, where the Balanced Accuracy was 0.7561 and the SD was 0.0136. The model in which WAIST was replaced with WHtR achieved the best performance with JRip with a balanced accuracy of 0.8926 and an SD of 0.0142, followed by Random Forest (0.8905, 0.0022), the worst average performance was attained by SVM linear with a cost = 50 (0.7812, 0.0154).   Finally, the model using ATP III variables, replacing WAIST with BMI, achieved the best performance model with JRip with a balanced accuracy of 0.8691 and an SD of 0.0168, followed by Random Forest (0.8650, 0.0033), the worst average performance was attained by SVM linear with a cost = 50 (0.7694, 0.0153).
The executed experiments showed that the model using the ATP III variables with WHtR instead of WAIST achieved the best performance, whereby could be a useful index for the identification of MetS in a Mexican population, along with the variables already proposed by ATP III.

Discussion
In this study, a set of health parameters was ranked applying Random Forest and compared with chi.squared and CFS filter methods to obtain the variable importance. These results showed that the main prognostic variables of MetS in our cohort of the Mexico City population according to its importance were: FPG, TGs, WHtR, HDL-C, and BMI, four out of these five variables are among those proposed by the WHO, IDF and ATP III criteria for the classification of people with MetS; however, not taking into consideration its predictability importance. Other studies have also found these prognostic variables; however, using different classification methods [15,20,54].
An interesting result was that WHtR was considered the third variable in order of importance, which is an important finding especially concerning the obesity epidemic in our country [55], and its relationship with cardiovascular disease, which is the first cause of morbidity and mortality worldwide and in Mexico.
Abdominal obesity has become an indicator of cardiometabolic risk. Therefore, significant efforts have been made to find the proper anthropometric measurement that reflects the accumulation of fat tissue in the abdominal area and can be easily obtained without high technology equipment.
It is also true that anthropometric indexes are importantly influenced by age, gender, and ethnicity, among other factors, and therefore, finding the appropriate one could be an overwhelming task. BMI has been used as an indicator of body fatness; however, it does not reflect abdominal obesity. Furthermore, BMI might scale to height with other power than 2, and therefore erroneous conclusions might be made regarding the adipose composition in people with different heights [56].
In recent years, abdominal obesity indexes such as BMI, WAIST, and recently the WHtR have been proposed as indicators of a high cardiometabolic risk [57,58].
A systematic review that included seventy-eight cross-sectional and prospective studies analyzed the predicting capability of WHtR, WAIST, and BMI to identify the risk of diabetes and CVD, and found that WHtR, WAIST, and BMI are useful predictors for this matter, furthermore, balance and adjusted data suggested that WHtR and WAIST are stronger predictors than BMI [59]. Browning et al. [59] suggest that "Keep your waist circumference to less than half your height", could be a suitable cutoff point for all ethnic groups.
In a more recent systematic review and meta-analysis, Ashwell et al. [57], aimed to differentiate the screening potential of WHtR and WAIST for adult cardiometabolic risk (hypertension, diabetes, dyslipidemia, MetS, and overall cardiovascular outcomes) and found that WHtR had significantly higher discriminatory power compared with BMI. However, most importantly, statistical analysis of the with-in study showed that WHtR was a better predictor than WAIST for hypertension, diabetes, cardiovascular disease, and all outcomes in both genders.

Comparing WHtR, WAIST, and BMI
To compare the importance value of WHtR, WAIST, and BMI separately in the complete data set we applied Random Forest and chi.squared. The results showed that WHtR is the most important variable since it obtained the highest values with Random Forest (see Table 3) and chi.squared (Table 4). Likewise, in the results shown in Table 5, WHtR achieved the best performance in balanced accuracy, sensitivity, and specificity with C45 and JRip, using CFS as a feature selection method, even if WAIST achieves better performance with Random Forest.
The ATP III guidelines are the most used to diagnose MetS; however, ethnic and regional characteristics need to be recognized to adjust the parameters for the diagnosis of abdominal obesity. Thus, the performance of WHtR and BMI using the variables of ATP III except for the WAIST was proved. The results in Table 6 show the performance of the model using only ATP III variables, where Random Forest achieves the highest value (0.8754). In Table 7, BMI reached the best performance with JRip (0.8691); however, it fails to reach the value obtained by WAIST. The values attained by WHtR showed the best performance using all classifiers, highlighting Random Forest with the highest values. This shows that for our study, using data from the Mexico City Tlalpan 2020 cohort participants, the WHtR in combination with the variables of ATP III (except for the waist) achieves a better performance in classification than the WAIST and BMI.

Conclusions
Machine learning algorithms have become a useful prognostic tool in medicine [64] to predict different medical outcomes such as treatment response to (chemo)radiotherapy [65], study metabolomic [66], and to identify the association between microbes, metabolites and abdominal pain in children with irritable bowel syndrome [67]. In our case, we used Random Forest to rank health parameters evaluating the prediction performance of the algorithm by accuracy (97%), sensitivity (97%) and specificity (93%). Even though the results of Apilak Worachartcheewan et al. [25] are similar to ours, they obtained an accuracy of 98% using Random Forest to determine MetS prevalence. However, when using other algorithms, such as SVM, results have shown an important variability, for instance, Karimi Alavijeh et al. [20] achieved an accuracy of 75%, while Barakat et al. [21] achieved an accuracy of 97%. Similar results were published using ANN; Hirose et al. [13] reported a sensitivity of 93% and a specificity of 91%. Lin et al. [17] achieved a lower accuracy (88.3%) using the same technique and 83.6% applying logistic regression. However, it is necessary to emphasize that an adequate feature selection and feature ranking significantly impacts the performance and computational burden of machine learning algorithms [11].
In this study, we only included a population living in Mexico City. Nevertheless, MetS encompasses chronic degenerative diseases with a significant genetic burden. Also, Mexico is a country with a wide variety of ethnic groups. Therefore, it will be essential to include populations from other regions of Mexico to have these ethnicities, cultures, customs, lifestyles, diet, and anthropometric characteristics represented and to develop an algorithm that can be applied throughout Mexico to detect and predict the MetS.
Finally, machine learning algorithms have potential applicability in medicine for diagnosis, being Random Forest the most useful algorithm for prediction and ranking variables; in our Tlalpan 2020 cohort, FPG, TGs, WHtR, HDL-C, BMI, DBP, and WAIST were the most important variables to diagnose (or predict) MetS, these results were similar to those found in other cohorts [15,25,54]. Likewise, results using JRip, C4.5, Knn, SVM and Random Forest, showed that WHtR could be a useful index for the identification of MetS, along with other variables proposed by ATP III.