Prediction Model for Hypertension and Diabetes Mellitus Using Korean Public Health Examination Data (2002–2017)

Hypertension and diabetes mellitus are major chronic diseases that are important factors in the management of cardiovascular disease. In order to prevent the occurrence of chronic diseases, proper health management through periodic health check-ups is necessary. The purpose of this study is to determine the incidence of hypertension and diabetes mellitus according to the health check-up, and to develop a predictive model for hypertension and diabetes according to the health check-up. We used the National Health Insurance Corporation database of Korea and checked whether hypertension or diabetes occurred from that date according to the number of health check-ups over the past 10 years. Compared to those who underwent five health check-ups, those who participated in the first screening had hypertension (OR = 2.18, 95% CI = 2.14–2.22), diabetes mellitus (OR = 1.33, 95% CI = 1.30–1.35) and both diseases (OR = 2.46, 95% CI = 2.39–2.53); individuals who underwent 10 screenings had hypertension (OR = 0.86, 95% CI = 0.83–0.88), diabetes mellitus (OR = 0.83, 95% CI = 0.81–0.85) and both diseases (OR = 0.83, 95% CI = 0.79–0.87). Individuals who attended fewer than five screenings compared with individuals who attended five or more screenings had hypertension (OR = 1.61, 95% CI = 1.59–1.62; AUC = 0.66), diabetes mellitus (OR = 1.21, 95% CI = 1.20–1.22; AUC = 0.59) and both diseases (OR = 1.75, 95% CI = 1.72–1.78, AUC = 0.63). The machine learning-based prediction model using XGBoost showed higher performance in all datasets than the conventional logistic regression model in predicting hypertension (accuracy, 0.828 vs. 0.628; F1-score, 0.800 vs. 0.633; AUC, 828 vs. 0.630), diabetes mellitus (accuracy, 0.707 vs. 0.575; F1-score, 0.663 vs. 0.576; AUC, 0.710 vs. 0.575) and both diseases (accuracy, 0.950 vs. 0.612; F1-score, 0.950 vs. 0.614; AUC, 0.952 vs. 0.612). It was found that health check-up had a great influence on the occurrence of hypertension and diabetes, and screening frequency was more important than other factors in the variable importances.


Introduction
High blood pressure and diabetes are major modifiable risk factors of cardiovascular disease (CVD), the leading cause of death globally [1]. In rapidly aging societies, greater burden from CVD is expected, making the control and management of these modifiable risk factors especially important [2]. Research indicates that while the prevalence of high blood pressure has decreased over the past few decades, especially among high-income Western and Asia Pacific countries, global mean blood pressure levels have remained virtually unchanged [3]. Moreover, the prevalence of diabetes has been on the rise: although increases have been relatively gradual in the European region, several countries have recorded drastic increases in prevalence, especially in the Oceania region [4]. In Korea, the age-standardized prevalence of hypertension among Korean adults over the years 2001 to 2011 did not rise over 24% and was lowest in 2007 at 19.7%. Meanwhile, trends of increasing prevalence for diabetes and impaired fasting glucose were recorded among Korean adults aged 30 years or older, with a diabetes prevalence of 9.9% in 2007-2009. The prevalence of hypertension and diabetes both increased with age [5].
Studies in the literature suggest a need for more stringent management for uncontrolled risk factors related with CVD [5][6][7]. Currently, the US Preventive Services Task Force recommends yearly screening for hypertension among high-risk individuals and adults 40 years and older and less frequent screening for younger adults with low risk [8].
Overweight or obese adults of ages 35 to 70 years are also recommended to receive regular screening for diabetes [9]. In Korea, all adults older than 19 years are eligible for general health screening every 2 years via the National Health Screening Program, which from 2009 began to focus on reducing cardiovascular and cerebrovascular diseases via health risk assessment and lifestyle modification [10,11]. The program consists of seven categories, including history taking and measurements of height, weight, body mass index (BMI), blood pressure, hemoglobin, glucose, lipids, liver function, and renal function. In 2009, 63.3% of about 16 million people who were eligible for the general health screening participated [12]. Participation rates have climbed steadily since, reaching 72.3% in 2014 and 74.1% in 2019.
Several studies have examined the relationship between health check-up participation and cardiovascular risk factors [13][14][15], finding that receiving health test interventions are associated with lowered risk scores and better life expectancy [15], lower risks of all-cause mortality and cardiovascular events [14], and lower prevalence of several diseases, including metabolic syndrome [13], and that screening interval can affect health outcomes [16,17]. Meanwhile, however, some studies have indicated that screening participation has no effect on changes in disease prevalence in comparison to only receiving usual care [18]. Nevertheless, there is not yet enough data on whether the frequency of participation in optional health examinations can affect cardiometabolic risk factors, and few longitudinal studies have explored the effects of health examinations over an expansive time frame.
Therefore, in this study, we aimed to determine the relationship between screening frequency and new incidences of hypertension and diabetes utilizing national screening data collected over a 10-year period. We also set out to build a model based on these findings that can accurately predict the incidence of hypertension and diabetes.

Study Design
The data of this study are longitudinal and were captured over a baseline period from 2002 to 2011. Depending on the number of health check-ups over the past 10 years, longterm follow-up was performed to determine whether hypertension or diabetes occurred from the date of the last checkup to 2017. To accurately determine whether hypertension or diabetes occurred, all subjects with hypertension or diabetes before the last examination were removed. In consideration of the fact that national health check-ups can be received once every 2 years from the age of 19 years or older and the 10-year baseline period, the age of the subjects was set at 30 years or older, and subjects who underwent more than five check-ups were grouped separately from those with fewer than five check-ups to compare the incidences of hypertension and diabetes ( Figure 1). tion were removed. In consideration of the fact that national health check-ups can be received once every 2 years from the age of 19 years or older and the 10-year baseline period, the age of the subjects was set at 30 years or older, and subjects who underwent more than five check-ups were grouped separately from those with fewer than five check-ups to compare the incidences of hypertension and diabetes ( Figure 1).

Dataset
The dataset used in this study was extracted from the National Health Insurance Service as a customized research database from 2002 to 2017, and the data reflected subjects who received health check-ups from 2002 to 2011 (n = 2,229,319). Among them, subjects with hypertensive disease codes (I10-I15) or who were taking antihypertensive drugs (Table S1) and diabetes disease codes (E11-14) or who were taking diabetes medications (Table S2) before the last screening were excluded (n = 1,502,179). Considering that national health check-ups are provided every 2 years for all people over the age of 19, subjects under the age of 30 were removed (n = 59,048), along with subjects with missing values in the database (n = 58,517) and subjects who died before 2011 (n = 162,431). In consideration of the fact that national health check-ups are provided every 2 years, the subjects were divided into those who received more than five health check-ups (n = 330,517) and those who received fewer than five health check-ups (n = 983,976). After propensity score matching (PSM), the final dataset consisted of 330,517 subjects who received more than five health check-ups or fewer than five health check-ups ( Figure 2).

Dataset
The dataset used in this study was extracted from the National Health Insurance Service as a customized research database from 2002 to 2017, and the data reflected subjects who received health check-ups from 2002 to 2011 (n = 2,229,319). Among them, subjects with hypertensive disease codes (I10-I15) or who were taking antihypertensive drugs (Table S1) and diabetes disease codes (E11-14) or who were taking diabetes medications (Table S2) before the last screening were excluded (n = 1,502,179). Considering that national health check-ups are provided every 2 years for all people over the age of 19, subjects under the age of 30 were removed (n = 59,048), along with subjects with missing values in the database (n = 58,517) and subjects who died before 2011 (n = 162,431). In consideration of the fact that national health check-ups are provided every 2 years, the subjects were divided into those who received more than five health check-ups (n = 330,517) and those who received fewer than five health check-ups (n = 983,976). After propensity score matching (PSM), the final dataset consisted of 330,517 subjects who received more than five health check-ups or fewer than five health check-ups ( Figure 2).

Measurements and Definition
The independent variables used to determine the characteristics related to the occurrence of hypertension and diabetes mellitus according to the number of examinations were as follows: age, sex, income, BMI, diastolic blood pressure, systolic blood pressure, fasting blood sugar, total cholesterol, alcohol consumption, smoking, and physical activ-

Measurements and Definition
The independent variables used to determine the characteristics related to the occurrence of hypertension and diabetes mellitus according to the number of examinations were as follows: age, sex, income, BMI, diastolic blood pressure, systolic blood pressure, fasting blood sugar, total cholesterol, alcohol consumption, smoking, and physical activity. Income level was calculated based on the insurance owner's income level to claim health insurance premiums and was classified into quartiles. Body mass index (BMI) was estimated as body weight (kg) divided by height squared (m 2 ). Blood samples were taken during the health examination after an overnight fast of at least 8 h and total cholesterol was enzymatically assessed. Alcohol consumption was divided into four categories grouped according to the frequency for weekly consumption of alcohol. Smoking was divided into 3 categories according to current smoking status. Physical activity was assessed in accordance with Metabolic Equivalent of Task definitions [19].

Study Outcomes
The outcomes of this study were hypertension, diabetes mellitus, and both diseases (hypertension and diabetes mellitus). Hypertension was defined as a blood pressure ≥ 140/90 mmHg or at least one claim per year for antihypertensive medication prescription under ICD-10 codes I10-I15. Diabetes mellitus was defined as type 2 diabetes with an FPG level ≥ 126 mg/dL or at least one claim per year for the prescription of hypoglycemic drugs under ICD-10 codes E11-14 [20]. Having both diseases was defined as the occurrence of both hypertension and diabetes mellitus within the follow-up period. The observation period was set from 1 January 2002 to 31 December 2011, and the follow-up period was set from 1 January 2012 to 31 December 2017. However, within the observation period, the final examination date may vary for each subject. This period can range from a minimum of 6 years to a maximum of 16 years, as the study design is followed-up from the date of final screening.

Statistical Analysis
Categorical variables are described as numbers and percentages. Continuous variables that followed normal distribution are summarized as means and standard deviations. To compare the baseline characteristics of the participants, an independent t-test was used for parametric continuous variables, and the Wilcoxon rank-sum test was used for nonparametric variables. The chi-square test was used for categorical variables. To reduce the effect of selection bias and potential confounders, we adjusted for significant differences in the baseline characteristics of subjects using 1:1 PSM with caliper set to 0.25 [21]. As variables for PSM, age, sex, systolic blood pressure, diastolic blood pressure, and fasting blood glucose, which are variables directly related to hypertension and diabetes mellitus, were used. A value corresponding to 1⁄4 of the standard error of the estimated propensity score was designated as a range and used for pairing. Only when the difference in propensity score between paired subjects falls within this range, pairing is included in the analysis. All excluded subjects were excluded from the analysis. All PSM procedures were performed, and we compared baseline covariates between the groups. Continuous variables were compared using paired t tests or Wilcoxon signed-rank tests, as appropriate, and categorical variables were compared using McNemar's test. Univariate logistic regression and multiple logistic regression analyses were used to evaluate the relationship between screening frequency and the main outcomes and the relationship between explanatory variables and the main outcomes. All p values less than 0.05 were considered statistically significant, and standardized differences of covariates used in PSM analysis were considered significant when less than 0.1.
Logistic regression, random forest, and XGBoost were used to develop the predictive model. Logistic regression is a probability model that can be explained by a model using relationships between independent and dependent variables. Random forest creates a large number of decision trees by randomly sampling training data and then uses the results of the decision trees to derive a final result [22]. XGBoost was developed using the negative slope of loss function as the residual value of the current fitting to achieve an accurate classification effect. XGBoost reduces overfitting by performing a quadratic Taylor extension of the loss function and adding a regular term outside the loss function to balance the reduction of the loss function with the complexity of the model [23].
We divided the full dataset into a training dataset (70%) and a test dataset (30%) using random sampling. Since the range for each independent variable differed, all of the variables were normalized with the minimum and maximum values of each variable in the training dataset. We used a grid search to determine the optimal hyper-parameters and performed 5-fold cross validation to prevent overfitting. Since all datasets comprise imbalanced data, the proportion of unbalanced data was made the same using the synthetic minority oversampling technique [24]. In order to evaluate the performance of each prediction model, we used receiver operating characteristics curves, area under the curve (AUC), F1 score, and accuracy. The software used for the analyses were as follows: SAS 9.

Results
The baseline characteristics of the study subjects are presented in Table 1. We found that participation rates in the health screening program were lower among women than men. Among subjects who underwent screening less than five times, systolic blood pressure, fasting blood glucose, total cholesterol, rate of drinking more than half a week, and the proportion of subjects currently smoking were high; physical activity was very low; and the incidences of hypertension alone, diabetes alone, and both diseases were high. Data are presented as a n (%) or mean (SD). Figure 3 presents the odds ratios and 95% confidence intervals for relationships between screening frequency and hypertension and diabetes. Individuals who participated in screening once were associated with a significantly increased odds for having hypertension (OR = 2.18, 95% CI = 2.14-2.22), diabetes mellitus (OR = 1.33, 95% CI = 1.30-1.35), and both diseases (OR = 2.46, 95% CI = 2.39-2.53) compared to the nationally recommended screening frequency of five. Individuals who underwent screening 10 times were associated with a significantly lower odds for having hypertension (OR = 0.86, 95% CI = 0.83-0.88), diabetes mellitus (OR = 0.83, 95% CI = 0.81-0.85), and both diseases (OR = 0.83, 95% CI = 0.79-0.87) compared to the nationally recommended screening frequency of five. We performed sensitivity analyses according to the presence of hypertension and diabetes mellitus and screening frequency ( Figure S1). Figure S1 displays age groups stratified by 10 years and sex, and when the adjusted odds ratio and odds ratio values were compared, the same trends noted above were observed without any difference.
The ORs and confidence intervals for hypertension (OR = 1.61, 95% CI = 1.59-1.62; AUC = 0.66), diabetes mellitus (OR = 1.21, 95% CI = 1.20-1.22; AUC = 0.59), and both diseases (OR = 1.75, 95% CI = 1.72-1.78; AUC = 0.63) among individuals who attended screening less than five times and individuals who attended screening five times or more as a reference are shown in Table 2.  Table 3 shows the results of the evaluation of each model. Random forest and XGBoost performed better than logistic regression models for all metrics in all datasets, and XGBoost performed the best. In particular, the model for predicting hypertension and diabetes showed the highest performance (accuracy, 0.950; f 1 -score, 0.950; AUC, 0.952 [0.951-0.953]). Figure 4 comprises a comparison of receiver operating characteristics curves for the logistic regression model, random forest, and XGBoost for each dataset and supports the results shown in Table 3. The variables with the highest importance in most models included age, screening frequency, sex, smoking, and BMI. Of these, age and screening frequency had the greatest effect on model performance.
in screening once were associated with a significantly increased odds for having hypertension (OR = 2.18, 95% CI = 2.14-2.22), diabetes mellitus (OR = 1.33, 95% CI = 1.30-1.35), and both diseases (OR = 2.46, 95% CI = 2.39-2.53) compared to the nationally recommended screening frequency of five. Individuals who underwent screening 10 times were associated with a significantly lower odds for having hypertension (OR = 0.86, 95% CI = 0.83-0.88), diabetes mellitus (OR = 0.83, 95% CI = 0.81-0.85), and both diseases (OR = 0.83, 95% CI = 0.79-0.87) compared to the nationally recommended screening frequency of five. We performed sensitivity analyses according to the presence of hypertension and diabetes mellitus and screening frequency ( Figure S1). Figure S1 displays age groups stratified by 10 years and sex, and when the adjusted odds ratio and odds ratio values were compared, the same trends noted above were observed without any difference.

Discussion
Previous studies on the impact of participation in health screening programs have offered varying results. One randomized trial reported that health test participation was associated with reduced cardiovascular risk score [15], while another showed that population screening increased the detection of CVD risk factors, such as hypertension [25]. General screening has been shown to be associated with increased detection of chronic disease in several other studies [14,18,[26][27][28][29][30]. A review article also reported that general health checks were associated with increased detection and treatment of chronic diseases, but had little impact on mortality or cardiovascular event reductions [31]. However, these results remain controversial as other studies have reported that health check participation is associated with fewer cardiovascular risk factors [13], risk factor value reductions [32], and mortality reductions [33]. Additional studies have been conducted to identify social elements that can affect participation, such as perceived susceptibility and health knowledge [34,35], socioeconomic status [36,37], medical history [38][39][40][41], health behavior factors, and sociodemographic factors [42,43]. Despite the above reasons, there have not been many studies on chronic disease prediction models according to health check-ups, so it is necessary to confirm new facts using large-scale data.
A hypothesis that the incidences of hypertension and diabetes would differ in two groups divided according to the number of health examinations could be established due to the lack of studies on the relationship between health check-ups and chronic diseases and the fact that a lot of information was available in the NHIS data. However, since an accurate verification method was needed to prove our hypothesis, we applied two research methods: confirmatory and exploratory. In order to conduct a confirmatory study, age, sex, systolic blood pressure, diastolic blood pressure, and fasting blood glucose, which may act as confounding variables, were examined to evaluate the hypothesis. In logistic regression analysis to compare the incidences of hypertension and diabetes ac-

Discussion
Previous studies on the impact of participation in health screening programs have offered varying results. One randomized trial reported that health test participation was associated with reduced cardiovascular risk score [15], while another showed that population screening increased the detection of CVD risk factors, such as hypertension [25]. General screening has been shown to be associated with increased detection of chronic disease in several other studies [14,18,[26][27][28][29][30]. A review article also reported that general health checks were associated with increased detection and treatment of chronic diseases, but had little impact on mortality or cardiovascular event reductions [31]. However, these results remain controversial as other studies have reported that health check participation is associated with fewer cardiovascular risk factors [13], risk factor value reductions [32], and mortality reductions [33]. Additional studies have been conducted to identify social elements that can affect participation, such as perceived susceptibility and health knowledge [34,35], socioeconomic status [36,37], medical history [38][39][40][41], health behavior factors, and sociodemographic factors [42,43]. Despite the above reasons, there have not been many studies on chronic disease prediction models according to health check-ups, so it is necessary to confirm new facts using large-scale data.
A hypothesis that the incidences of hypertension and diabetes would differ in two groups divided according to the number of health examinations could be established due to the lack of studies on the relationship between health check-ups and chronic diseases and the fact that a lot of information was available in the NHIS data. However, since an accurate verification method was needed to prove our hypothesis, we applied two research methods: confirmatory and exploratory. In order to conduct a confirmatory study, age, sex, systolic blood pressure, diastolic blood pressure, and fasting blood glucose, which may act as confounding variables, were examined to evaluate the hypothesis. In logistic regression analysis to compare the incidences of hypertension and diabetes according to the two groups, age and sex, as well as the number of health examinations, were adjusted, and sensitivity analysis was used to increase the reliability of the results. In order to proceed with the exploratory study, a classification model was developed using various variables, and the importance of the top five variables among the variables used in the model was extracted to confirm the relationships between the variables used in our studies.
There are many previous studies that predict diabetes or hypertension. People with a family history of diabetes are known to have a higher risk of diabetes than people without a family history of diabetes [44]; most of the family history variables are used in the diabetes predictive model [45,46]. However, family history variables were not considered in our study for the following reasons. Since the main purpose of this study is to explore whether or not health check-ups have a significant effect on the occurrence of chronic diseases, the predictive model requires a large number of subjects. Using family history of diabetes as a variable in the NHIS data has the disadvantage of a significant decrease in subjects (Table S3). However, if it is simply for the above reasons, there may be a prejudice that it is to show that the results of our study are good. Therefore, logistic regression analysis was performed, and there was no difference between the results of adding a family history of diabetes and the results obtained in the study, AUC, and odds ratio (Tables S4 and S5). The results of logistic regression analysis confirmed that the probability of high blood pressure and diabetes decreased as the number of health examinations increased. Similar results have been observed in previous studies [14,15,18,[26][27][28][29][30]. Upon extracting the importance of the variables used in the analyses and the variables used in the modeling, we confirmed that the number of health examinations and age were the most important variables in several models.
In terms of general health screening, the Korean government has not only continued expanding the eligible population and facilitating participation but has also implemented policies to ensure improvements in quality, equity, and result management in the national health screening program [47]. One of these policies is the third comprehensive National Health Screening Program plan [48], which was announced in 2021. In this, the government planned to promote self-health management by providing more in-depth information on individual health screening results, regional screening center locations, and reservation notifications through mobile applications. Other laws are also in place to encourage follow-up hospital visits after general check-ups, subsidizing medical expenses incurred by examinations for definite diagnosis in people who have taken national health examinations [49]. Such efforts of the Korean government and the medical community have led to a rise in health screening follow-up management rates from 6.0% in 2008 to 13.9% in 2018 [50].
The results of this study showed that an increased participation in health screening was significantly related to the lowered occurrence of hypertension and diabetes. This suggests that encouraging the general population to participate diligently in health screenings may be an important factor to preventing disease and improving public health. Previous studies on the impact of screening on future health-related behaviors support this claim [51,52], stating that receiving the screening results may be a teachable moment for the participants, which then can lead to a change in lifestyle factors. In Korea, 76.9% of the insured population and 41.9% of recipients of medical care assistance participated in the 2018 national health check-up [53,54]. While these participation rates cover most of the population, there remain people who do not receive adequate screening for whichever reason. Therefore, it is important that national health check-ups be publicized and made easily accessible to all persons eligible for participation, regardless of socio-economic status or location. Encouraging health check-ups using the result and model from our study is a simple step that will help prevent chronic disease and will also help achieve the health goals set by the government.
There are some limitations to this study that should be taken into consideration. First is the possibility of self-selection bias, as it is entirely up to the individual to choose whether they participated in the screening program. Since we do not know the reasons for nonparticipation, the results of this study should be approached with caution. Second is the possibility of response bias in the self-report part of the screening program. Further randomized controlled trials are needed to clarify the results of health screening participation. Finally, in order to know how much predictive power depends on the number of health check-ups, a predictive model was constructed using logistic regression and machine learning technique. Better performance results could be obtained with using deep learning methods, such as long short-term memory and recurrent neural networks which can use risk factors that affect outcomes over time [46].

Conclusions
It was confirmed that the health check-up had a great influence on the occurrence of hypertension and diabetes. Periodic health check-ups are necessary to manage the occurrence of chronic diseases.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/diagnostics12081967/s1, Figure S1: Sensitivity analysis results; Table S1: Pharmaceutical ingredient codes for hypertension medication; Table S2: Pharmaceutical ingredient codes for diabetes medication; Table S3: Baseline characteristics after Propensity Score Matching (Addition of family history variables of diabetes and hypertension; Table S4: Logistic regression of hypertension and diabetes mellitus according to screening frequency; Table S5: Logistic regression of hypertension and diabetes mellitus according to screening frequency.