Building a Nomogram for Metabolic Syndrome Using Logistic Regression with a Complex Sample—A Study with 39,991,680 Cases

Metabolic syndrome can cause complications, such as stroke and cardiovascular disease. We aimed to propose a nomogram that visualizes and predicts the probability of metabolic syndrome occurrence after identifying risk factors related to metabolic syndrome for prevention and recognition. We created a nomogram related to metabolic syndrome in this paper for the first time. We analyzed data from the Korea National Health and Nutrition Examination Survey VII. Total 17,584 participants were included in this study, and the weighted sample population was 39,991,680, which was 98.1% of the actual Korean population in 2018. We identified 14 risk factors affecting metabolic syndrome using the Rao-Scott chi-squared test. Next, logistic regression analysis was performed to build a model for metabolic syndrome and 11 risk factors were finally obtained, including BMI, marriage, employment, education, age, stroke, sex, income, smoking, family history and age* sex. A nomogram was constructed to predict the occurrence of metabolic syndrome using these risk factors. Finally, the nomogram was verified using a receiver operating characteristic curve (ROC) and a calibration plot.


Introduction
Metabolic syndrome is a disease that includes conditions such as obesity, hyperlipidemia, low levels of high-density lipoprotein (HDL) cholesterol, high blood pressure, and hyperglycemia coincide in one individual due to chronic metabolic disorders. Metabolic syndrome was first named Syndrome X in 1988 [1]; however, in 1999, the World Health Organization (WHO) renamed it as metabolic syndrome. The WHO definition of metabolic syndrome has not been consistently used because of its requirement to measure serum insulin and urinary microalbumin levels [2,3]. Therefore, we used the diagnostic criteria for metabolic syndrome published by the National Cholesterol Education Program [4]. Metabolic syndrome is diagnosed when three or more of the following criteria are met: obesity (waist circumference ≥ 90 cm for men; waist circumference ≥ 85 cm for women), hyperlipidemia (triglyceride levels ≥ 150 mg/dL), low levels of high-density lipoprotein cholesterol (men < 40 mg/dL; women < 50 mg/dL), high blood pressure (systolic BP ≥ 130 mmHg and diastolic BP ≥ 85 mg/dL or if a patient is on hypertension medication), and hyperglycemia (fasting glucose ≥ 100 mg/dL). In the United States, 32.8 and 36.6% of male and female individuals aged 20 years or older, respectively, were diagnosed with metabolic syndrome in 2012 [5]. In Korea, approximately 30.8 and 26.3% of male and female individuals over 20 years of age, respectively, were diagnosed with metabolic syndrome in 2013 [6]. Furthermore, 28.1% of men and 18.7% of women were diagnosed with metabolic syndromes in 2017 [7]. Since metabolic syndrome causes complications, such as stroke and cardiovascular diseases, identifying strategies to prevent metabolic syndrome occurrence is essential [8].
The Pearson chi-squared test is a statistical analysis method used to identify risk factors for diseases. Logistic regression and Cox proportional hazards models are statistical models used to predict disease incidence. Many studies have been conducted to identify the risk factors for metabolic syndrome; however, interpreting and understanding the results is challenging for medical practitioners and individuals without statistical knowledge. Therefore, we aimed to develop a nomogram to compensate for this limitation.
A nomogram graphically represents the numerical relationships between diseases and risk factors without the need for complex calculations [9,10]. Nomograms have previously been developed for dyslipidemia and hypertension [11,12]. Although many studies have identified risk factors for metabolic syndrome, there has been no attempt to create a nomogram for metabolic syndrome. Now, this study, we used the Rao-Scott chi-squared test instead of the Pearson chi-squared test to identify risk factors from the Korean National Health and Nutrition Examination Survey (KNHANES) data, and after constructing a multinomial logistic regression model, we constructed a nomogram that can predict the incidence rates of metabolic syndrome. Finally, the constructed nomogram was verified using an ROC curve and a calibration plot.
Section 2 describes the materials used in this study and the Rao-Scott chi-squared test. Next, we explain how to build and verify a nomogram using the calculated coefficient from the multinomial logistic regression of complex sampling data. In Section 3, we present the results of the Rao-Scott chi-squared test and the logistic regression analysis. Additionlly, the nomogram for metabolic syndrome was constructed and verified. Finally, Sections 4 and 5 present the discussion and conclusions of this study.

Complex Sampling Design Method
Data collected using a simple random sampling method revealed that each element had the same probability of being selected and was independent of the others. However, the KNHANES data used in this study were designed to sample representative the Korean population using census data as a sampling frame and a two-stage stratified cluster sampling method. Complex sample data consider the design effect involving stratification, clustering, and individual sample weight corrected for inclusion error, imbalanced extraction rate, and non-response error of the target population.

Materials
The 2016-2018 data used in this study were obtained from KNHANES. KNHANES uses a two-stage stratified cluster sampling method, which is a complex sampling design method, to improve the estimation accuracy and representativeness of the sample [13]. This study was conducted with subjects aged 20 years or older. In 3 years, 5072 of the 24,269 individuals under 20 years of age were excluded. Further, 1613 individuals who did not participate in the health or examination surveys were excluded. The sample population then comprised of 17,584 individuals. The mode was used to replace the missing values. We also compared the complex sample population with the actual Korean population surveyed in the 2018 census [14]. The weighted sample population was 39,991,680. According to the Statistics Korea 2018 census, 40,762,796 Korean individuals were older than 20 years of age. The difference between the complex sample and the actual Korean population was 1.9% of the actual population. Therefore, the weighted sample and actual Korean population were considered to be similar. The data were divided into a training set and a test set using a ratio of 8:2. Thus, the weighted sample population of the training set was 32,043,914 and that of the test set was 7,947,766. The model was fitted using the training set and its predictive power was verified using the test set.
The criteria for the diagnosis of metabolic syndrome included three or more of the following factors; abdominal obesity, hyperlipidemia, low HDL cholesterol, hypertension, and hyperglycemia. We used the metabolic syndrome diagnostic criteria published by the National Cholesterol Education Program (NCEP ATP III) in 2001. However, among the five diagnostic criteria, abdominal obesity was applied by the Korean Society for the Study of Obesity in 2014 to reflect the characteristics of Koreans. In this study, 4850 of 17,584 individuals were diagnosed with metabolic syndrome.
Fourteen risk factors with an important effect on the incidence of diabetes, dyslipidemia, and hypertension have been selected in several prior studies [12,[15][16][17]. These risk factors include body mass index (BMI), Marriage, Employment, Education, Age, Stroke, Sex, Income, Heart attack, Exercise, Alcohol, Angina, Smoking, and Family history. BMI was categorized into three groups: <25, ≥25 and <30, and ≥30 [12]. Participants were categorized as "low" for individuals who had not completed high-school, and "high" for those who had completed high-school education. Age was categorized as 20-34 years, 35-64 years, and ≥65 years. Participants were categorized into lowest, low, high, and highest groups based on income. Regarding exercise status, participants were classified into the physical activity group if they walked for more than 5 days per week for more than 30 min walking a day. Participants were categorized into the high drinking group if they drank more than once a month in the past year or in the low drinking group if otherwise.
Smoking habits were categorized as present, past, and nonsmoking. The other variables were categorized as yes or no.

Rao-Scott χ 2
Rao−Scott Test It is important to test the independence between the incidence of metabolic syndrome and the risk factors when selecting risk factors that affect metabolic syndrome incidence. The Pearson chi-squared test has been used to identify the risk factors for metabolic syndrome. However, this test assumes that the frequency of each cell in the contingency table is independent and follows a multinomial distribution. However, the data used in this study are complex data, given different individual weights for each stratum and cluster. Therefore, it does not satisfy the assumption that each cell in the table is independent [12,18]. Therefore, the Rao-Scott chi-squared test, which considers the design effects, such as stratification, clustering, and individual sample weight, was used in this study.

Logistic Regression Model with a Complex Sample
When estimating the regression coefficients and variance of the coefficient, logistic regression analysis of a complex sample considers design effects such as stratification, clustering, and individual weights [19][20][21]. First, the population was divided into K strata, and each stratum was divided into M k clusters. The first unit of collection divided by the stratum is called the primary sample unit (PSU). Each cluster had i = 1, 2, . . . , N kj observations. Thus, supposing that n kj observations exist in each PSU selected for each k stratum. The total number of observations was n = ∑ K k=1 ∑ M k j=1 n kj . The weight of each observation is denoted as ω kji . This study assumed that the dependent variable y kji is a binary variable. y kji represents the binomial value of the i-th observation value in the j-th PSU within the k-th stratum. X kji = x kji1 , x kji2 , . . . , x kjip represents the independent variable vector of the i-th observation value in the j-th PSU within the k-th stratum. We added the interaction terms by considering the association between the independent variables. Let a be the number of interaction terms. β = β 0 , β 1 , . . . , β p , β p+1 , . . . , β p+a is a (p + a + 1) × 1 vector indicating the regression coefficient vector of the logistic model. The logistic regression model was as follows: The maximum pseudo-likelihood method was used to estimate the regression coefficient vector β for the complex sample. ω kji denotes the weight of each observation in the logistic regression model.

Nomogram Construction Method
Several medical studies have aimed to predict methods for reduce the risk of disease or death. Statistical techniques are used to select the relevant disease or death risk factors and to calculate the extent of their effects for risk prediction. Logistic regression analysis is the most widely used method in medical research; however, it is difficult to interpret. Therefore, a nomogram was proposed to estimate the probability of an event [9,22]. A nomogram is easily understood because of its simplified building process, and a line easily expresses its composition. A nomogram comprises the point, risk factor, probability, and total point lines. The construction of the lines that form the nomogram has been explained previously explained [9,23,24].
(a) Point line: A point line comprising 0-100 points is constructed.
(b) Risk factor line: The LP ij (LinearPredictor) value is calculated from the coefficient of the fitted logistic regression model. If the independent variable X is a categorical variable and has j categories, j − 1 dummy variables are generated.
Using this, we calculated Points ij for each risk category and aligned them to each risk factor line.
In this case, β ij is the regression coefficient value of the j-th category of the i-th risk factor and X ij is the attribute value of the j-th category of the i-th risk factor. LP * j represents the LP value of the risk factor with the largest estimated regression coefficient range of attribute values.
(c) Probability line: The probability line represents the probability value corresponding to the total point and ranges from 0 to 1.
(d) Total point line: The total point can be expressed as a cumulative sum of Points ij .
The logistic regression model is expressed for ∑ i,j LP ij , and by substituting this into the above equation, the total points corresponding to each value of the probability line can be obtained.
The value of the probability line is substituted for P(Y = 1|X = x) to construct a total point line.

ROC Curve and Calibration Plot for Nomogram Validation
After constructing the nomogram, ROC curves and calibration plots were used to validate the nomogram [25,26]. 1-specificity and Sensitivity was plotted on the x-axis and y-axis, respectively. The AUC (Area Under the Curve) with a diagonal line was 0.5. The model is considered good when the ROC curve is above this diagonal line, and better when the value is between 0.5 and 1. A calibration plot was used to determine the closeness of the actual probabilities to the predicted probabilities calculated using the nomogram. If the predicted probability was the same as the actual probability, a 45°centerline was drawn. The closer the predicted probability is to the actual probability, the closer it is to the 45°line [9,27]. Therefore, we validated the nomogram using R 2 , a goodness-of-fit indicator of the regression line between the predicted and actual probabilities. All analyses were performed using R software version 4.1.2 (R Core Team, Vienna, Austria). We also used SAS 9.4 to build a nomogram with the suitable aesthetic [24].

14 Risk Factors Associated Metabolic Syndrome by Rao-Scott Chi-Squared Test
We used the Rao-Scott chi-squared test to select risk factors related to metabolic syndrome. Table 1 shows the weighted frequency and results of the Rao-Scott chi-squared test. As shown in Table 1, the prevalence of metabolic syndrome increased with increasing BMI. For example, the incidence of metabolic syndrome was 12.1, 45.1, and 66.4% for BMI < 25, ≥25 and <30, and ≥30, respectively. Additionally, older individuals had a higher incidence of metabolic syndrome. The incidence rates were 8.2%, 26.4%, and 45.1% for 20-34, 35-64, and ≥65 years, respectively. Further, 41.6% and 19.9% of individuals from the low and high education level categories were diagnosed with metabolic syndrome, respectively. The incidence rates were 29.0% and 10.7% for married and unmarried individuals, respectively. Moreover, 54.2% of patients diagnosed with stroke and 24.5% of those not diagnosed with stroke developed metabolic syndrome. The incidence rates in the non-smoking, past smoking, and present smoking groups were 21.7%, 29.2%, and 28.9%, respectively. Men and women had metabolic syndrome incidence rates of 28.3% and 21.5%, respectively. Furthermore, 45.1% and 24.7% of patients diagnosed with and without angina developed metabolic syndrome, respectively. The lower the income, the higher was the incidence of metabolic syndrome. Individuals with metabolic syndrome were generally unemployed, inactive, with a history of heart attack, low drinkers, and with a family history of metabolic syndrome. The results of the Rao-Scott chi-squared test χ 2 R−S showed that all factors were statistically significant at 0.05. Therefore, 14 risk factors were found to be important for predicting the incidence of metabolic syndrome.

Multiple Logistic Results for Metabolic Syndrome
A logistic regression analysis was performed using the risk factors listed in Table 1. The results, which only considered the main effects of 14 risk factors, showed that the regression coefficients for Heart attack, Exercise, Alcohol, and Angina were not significant. Thus, Heart attack, Exercise, Alcohol, and Angina had no significant effect on the prediction of metabolic syndrome. Therefore, ten risk factors were selected as final risk factors for metabolic syndrome. Further, the interactions among the risk factors were considered to improve the predictive power, and the BMI*Marriage, Marriage*Age, Age*Sex, and Sex*Smoking interactions were significant. However, a model including the ten main effects and the interaction of Age and Sex was selected as the final logistic regression model after considering the parsimony of the model and the likelihood ratio, Wald, and score test results. The likelihood ratio, Wald, and score tests were performed to assess the goodness-of-fit of the logistic regression model, and the results showed that the p-value was less than 0.0001. Table 2 shows the logistic regression results with the ten main effects and one interaction. A "95% CI" represents the 95% confidence interval of the odds ratio, and "Point ij " indicates the nomogram point for each category of risk factors. Among the risk factors, BMI had the greatest effect on the incidence of metabolic syndrome (Table 2). Further, the incidence of metabolic syndrome was increased in older males with a history of stroke. The interaction variable showed that male sex aged 35-64 years had the greatest impact on metabolic syndrome, whereas male sex and age over 65 years had the smallest impact because of the lower incidence of metabolic syndrome in males aged ≥ 65 years than in women. The interaction variable complements the odds ratios for age and sex.

11 Risk Factors with a Proposed Nomogram for Metabolic Syndrome
A nomogram was constructed using the logistic regression model (Table 2). In Figure 1, a nomogram is proposed to predict the prevalence of metabolic syndrome. Figure 1 shows that BMI and Age had the greatest impact on metabolic syndrome. The higher the BMI, the higher is the incidence of metabolic syndrome. In addition, the incidence of metabolic syndrome increased with age. After BMI and Age, Age*Sex had the greatest impact, followed by Stroke, excluding BMI, Age, and Age*Sex. Individuals with a history of stroke had higher scores than those without; therefore, the former were more likely to have metabolic syndrome. Subsequently, metabolic syndrome was influenced by Marriage, followed by Education, Smoking, Family history, Income, and Employment. After multiple logistic regression analysis, the nomogram model for metabolic syndrome consisted of 11 risk factors. The result was shown in Figure 1. For example, if a 50-year-old man with a BMI of 30 is married, employed, a college graduate, has a history of stroke, belongs to a high-income level, a smoker, and without a family history, he would obtain nomogram points as follows: 100, 12, 0, 0, 43, 21, 14, 28, 1, 11, and 0 points for BMI, Marriage, Employ, Education, Age, Stroke, Sex, Age*Sex, Income, Smoke, and family history lines, respectively, leading to a summed score of 230. The probability of metabolic syndrome corresponding to the total number of points is 90.7%.

Validation of a Nomogram for Metabolic Syndrome
The nomogram was verified using an ROC curve and a calibration plot. The results are shown in Figures 2 and 3, respectively. Figure 2a shows the ROC curve of the training data, and Figure 2b shows the ROC curve of the test data. The AUC (Area Under the Curve) values were 0.8205 and 0.8123 for the training and test data ROC curves, respectively. Figure 3a shows the calibration plot of the training data and Figure 3b shows the calibration plot of the test data. The R 2 values for the calibration plots were 0.945 and 0.8762, respectively. Therefore, it can be concluded that the nomogram had sufficient predictive power.

Discussion
This study used data from 2016 to 2018 for 17,584 individuals from the KNHANES, which identifying the health behavior of Koreans. However, when the individual weights assigned to each stratum and cluster were applied, the complex sample population was 39,991,680, which is 98.1% of the actual Korean population surveyed in the 2018 Korean Census [14]. Thus, a complex sample analysis is more reasonable than a raw data analysis. The Rao-Scott chi-squared test was thus used to screen for risk factors of metabolic syndrome, and all 14 risk factors were found to be statistically significant. In the results of multiple logistic regression analysis considering the main effects of 14 above, a total of 11 factors were finally selected: BMI, Marriage, Employment, Education, Age, Stroke, Sex, Income, Smoking, Family history and Age*Sex. Based on the above results, Heart attack, Exercise, Alcohol, and Angina were not significant. They were important as single factors ( Table 1), but because of their relatively small chi-square statistics (Heart attack = 26.427, Exercise = 29.7132, Alcohol = 24.9774, Angina = 45.1976), they were not selected in the logistic regression analysis. In particular, in the case of exercise variable, people who walked for more than 5 days a week and more than 30 min a day were defined as physical activity groups, and the boundary between physical activity and non-physical activity was considered somewhat ambiguous. In this case, the Rao Scott chi-squared test statistic value = 29.7132 was not large, and the logistic analysis did not appear to be significant. Analysis of the model considering all interactions of risk factors showed that BMI*Marriage, Marriage*Age, Age*Sex, and Sex*Smoke were significant. However, considering the parsimony of the model and the statistics of the likelihood ratio and the Wald test and score test results, the interaction between age and sex was included. Therefore, ten main effects and the Age*Sex interaction were finally selected as a total 11 risk factors for metabolic syndrome.
Meanwhile, BMI, Age, Sex, Marriage, Education, Smoking, Income, and Employment were similar compared to previous studies for metabolic syndrome [6,8]. However, in our study, Stroke and Family history were newly identified risk factors. Influence of Stroke was also after BMI, Age, and Age*Sex (Figure 1). Nomogram studies on dyslipidemia [11], hypertension [12] and diabetes [15] showed that BMI, Employment, Education, Age, Sex, Income, and Smoking were common significant factors. On the other hand, the Stroke factor was not present in the studies of dyslipidemia [11] and diabetes [15] but was found to be a significant factor in the study of metabolic syndrome.
Nomogram was composed of risk factors including BMI, Marriage, Employment, Education, Age, Stroke, Sex, Income, Smoking, Family history, and Age*Sex ( Figure 1). As shown in Figure 1, the risk factors for metabolic syndrome were important in the order of BMI, Age, Age*Sex, Sex, and Stroke. In other words, the higher the BMI, the greater was the effect on incidence. In addition, the older the patient are, the higher was the prevalence of metabolic syndrome. Age*Sex was also important. Further, men were found to be more likely to have metabolic syndrome than women. Furthermore, a history of stroke, lower educational level, smoking history, family history of metabolic disease, lower income level, and unemployment increase the risk of developing metabolic syndrome. The accuracy of the proposed nomogram was evaluated using the area under the curve (AUC) and R 2 of the calibration plot. In the training data ROC curve, the AUC was 0.8205, whereas, it was 0.8123, in the test data ROC curve. The R 2 values of the calibration plot were 0.945 and 0.8762 for the training and test datasets, respectively. In other words, the proposed nomogram had sufficient reliability. However, if the interaction term is added to the logistic model, interpretation may be somewhat difficult [21]. This also occurs in the nomogram model. Meanwhile, the nomogram model has a great advantage in that it shows the influence of risk factors when diagnosing metabolic syndrome as a score [10].

Conclusions
Metabolic syndrome increases in incidence and is a fatal cause of stroke, cardiovascular disease, and increased death rate. We used 39,991,680 individuals, a complex sample population constituting 98.1% of the actual Korean population surveyed in the 2018 Korean census. In the proposed nomogram ( Figure 1) for predicting the incidence of metabolic syndrome, the most relevant risk factor for metabolic syndrome was BMI, followed by Age, Sex, Stroke, Marriage, Education, Smoking, Family history, Income, and Employment. A nomogram facilitates medical practitioners to diagnose a disease. Thus, the metabolic syndrome nomogram proposed in this study will assist in establishing future medical treatment plans.

Data Availability Statement:
The data that support the findings of this study are available from the Korea National Health and Nutrition Examination Survey.