A Predictive Model of Ischemic Heart Disease in Middle-Aged and Older Women Using Data Mining Technique

This study was conducted to identify ischemic heart disease-related factors and vulnerable groups in Korean middle-aged and older women using data from the Korea National Health and Nutrition Examination Survey (KNHANES). Among the 24,229 people who participated in the 2017–2019 survey, 7249 middle-aged women aged 40 and over were included in the final analysis. The data were analyzed using IBM SPSS and SAS Enterprise Miner by chi-squared analysis, logistic regression analysis, and decision tree analysis. The prevalence of ischemic heart disease in the study results was 2.77%, including those diagnosed with myocardial infarction or angina. The factors associated with ischemic heart disease in middle-aged and older women were identified as age, family history, hypertension, dyslipidemia, stroke, arthritis, and depression. The group most vulnerable to ischemic heart disease included women who had hypertension, a family history of ischemic heart disease, and were menopausal. Based on these results, effective management should be achieved by applying customized medical services and health management services for each relevant factor in consideration of the characteristics of the groups with potential risks. This study can be used as basic data that can be helpful in national policy decision making for the management of chronic diseases.


Introduction
Ischemic heart disease is known as a representative disease with a high social burden that causes much death and disability [1]. The prevalence of cardiovascular disease is rapidly increasing in women after the age of 40 due to changes in women's hormones related to menopause, physical changes related to aging, and increased fat accumulation [2]. Previous studies reported that women with risk factors for cardiovascular disease had a 19% increase in the incidence of myocardial infarction after 10 years and that the quality of life of middle-aged women with cardiovascular disease was poor [3,4]. The life expectancy of Korean women in 2015 was 85.2 years [5], and the prevention and management of cardiovascular disease in middle-aged women are very important to prepare for a healthy old age. According to the 2017 Statistical Annual Report of Causes of Death in Korea, cardiovascular disease is the second-highest cause of death after cancer, and the mortality rate from cardiovascular disease also tends to increase sharply with increasing age. Particularly, hypertension disease (2.3 times) and heart disease (1.1 times) showed higher mortality rates among women than men [6]. Cardiovascular disease is a major chronic disease. Chronic diseases have various causes but no direct cause, making early diagnosis difficult. In addition, the time of disease onset is unclear and the latent period is long [7]. Therefore, since prevention is emphasized more than treatment in chronic diseases, if the characteristics of a group at high risk of cardiovascular disease can be identified and customized interventions suitable for each characteristic can be provided, the prevention and management of cardiovascular disease will be effective.
The risk factors for cardiovascular disease identified in previous studies included gender, age, marital status, income, education, diabetes, hypercholesterolemia, family history, 2 of 12 smoking, drinking, obesity, lack of physical activity, and stress [8][9][10][11][12][13][14][15][16][17][18][19]. However, these studies investigated the incidence of cardiovascular disease related to a few factors by focusing on specific groups such as young men and the elderly as study subjects [8,9,[12][13][14][15][16]18,19]. Studies that have confirmed various characteristics in the groups vulnerable to cardiovascular disease are lacking. Therefore, to understand the characteristics of the groups vulnerable to ischemic heart disease, research using data mining techniques is required.
Data mining technology allows for the exploration, identification, and modeling of the relationships and rules that exist in big data [20]. Recently, research methods using data mining have been used in various fields such as medical research, diagnosis, quality control, hospital management, and customer relationship management in the medical field [21][22][23]. Decision tree analysis, one data mining technique, is an effective tool for classification and prediction; therefore, it is useful for discovering hidden patterns in data [24]. Predicting cardiovascular disease risk using decision support systems can play an important role in disease prevention [24].
In this study, we intended to analyze the factors related to ischemic heart disease in middle-aged women using the Korea National Health and Nutrition Examination Survey (KNHANES) data, which is representative of the Korean middle-aged and older women population, and develop an ischemic heart disease prediction model. The specific study purposes were as follows.

•
Identify the sociodemographic characteristics and health-related behavior characteristics of the study subjects. • Identify the differences in the prevalence of ischemic heart disease according to social demographic characteristics and health-related behaviors, and the presence of chronic diseases. • Identify the factors that affected ischemic heart disease in middle-aged women.

•
Utilizing data mining techniques, develop a predictive model for ischemic heart disease in middle-aged women.
The results of these studies can be used as important foundational data for regional and national health policy decisions for the prevention and management of ischemic heart disease.

Study Population
In this study, the raw data of the Korea National Health and Nutrition Examination Survey (KNHANES) (2017-2019), which was a statutory survey based on Article 16 of the National Health Promotion Law, were utilized. Data use was approved according to the procedures of the Korea Disease Control and Prevention Agency (KDCA). The target population of the KNHANES was one year old or older residing in Korea, and a two-step stratified cluster sampling method using the survey district and household as the primary and secondary extraction units, respectively, was applied. The number of survey districts was 192 and there were 23 sampled households. Within the sampled households, all household members aged one year or older who satisfied the appropriate household size were selected as survey subjects. Among the 24,229 people who participated in the 2017-2019 survey, 7249 middle-aged women aged 40 and over were included in the final analysis after excluding missing data ( Figure 1). The age groups of the study subjects were 40-49 years old (1822 people), 50-59 years old (1966 people), 60-69 years old (1761 people), 70-79 years old (1257 people), and 80 years old and over (443 people).
The study was conducted in accordance with the Declaration of Helsinki. Ethical review and approval were waived for this study because it used anonymous public open data and not an individual's personal data.

Variable Definitions
The dependent variable for the presence or absence of ischemic heart disease utilized the answer to the question "Have you ever been diagnosed with myocardial infarction or angina by your doctor?" The characteristics of the study subjects were classified into sociodemographic factors, health behavior factors, and clinical factors. The independent variables for sociodemographic characteristics were age, marital status, education, household income, subjective health status, and stress. Marital status was divided into married and unmarried, and education levels were classified as less than elementary school, middle school, high school, and college or higher. Household income levels were divided into categories (lower, lower middle, upper middle, and upper) based on the quartile of household equalization income. The independent variables for health behavior characteristics were smoking, alcohol drinking, and physical activity. Smoking status was divided into daily smoking, occasionally smoking, past smoker, and non-smoker. Drinking was divided according to the classification of the raw data, with and without the experience of drinking alcohol for a lifetime. The physical activity variable used the response to the question "Does your work or leisure activity involve moderate-intensity physical activity with a slight shortness of breath or moderately rapid heart rate for at least 10 min?" The clinical characteristic variables were composed of body mass index (BMI), menopause, family history, hypertension, stroke, arthritis, diabetes mellitus, depression, renal failure, and dyslipidemia, with reference to previous studies [7,10,[24][25][26]. BMI was classified as underweight for a value of less than 18.5 kg/m 2 , normal for 18.5 kg/m 2 or more and less than 25.0 kg/m 2 , and obesity for 25.0 kg/m 2 or more. Family history was defined as when at least one parent or sibling had a history of ischemic heart disease.

Statistical Analysis
The data were analyzed using IBM SPSS version 25.0 (IBM Co., Armonk, NY, USA) and SAS Enterprise Miner 9.4. To observe the differences in the prevalence of ischemic heart disease according to social demographic characteristics, health-related behaviors, and the presence of chronic diseases, a chi-squared analysis was conducted. Logistic

Variable Definitions
The dependent variable for the presence or absence of ischemic heart disease utilized the answer to the question "Have you ever been diagnosed with myocardial infarction or angina by your doctor?" The characteristics of the study subjects were classified into sociodemographic factors, health behavior factors, and clinical factors. The independent variables for sociodemographic characteristics were age, marital status, education, household income, subjective health status, and stress. Marital status was divided into married and unmarried, and education levels were classified as less than elementary school, middle school, high school, and college or higher. Household income levels were divided into categories (lower, lower middle, upper middle, and upper) based on the quartile of household equalization income. The independent variables for health behavior characteristics were smoking, alcohol drinking, and physical activity. Smoking status was divided into daily smoking, occasionally smoking, past smoker, and non-smoker. Drinking was divided according to the classification of the raw data, with and without the experience of drinking alcohol for a lifetime. The physical activity variable used the response to the question "Does your work or leisure activity involve moderate-intensity physical activity with a slight shortness of breath or moderately rapid heart rate for at least 10 min?" The clinical characteristic variables were composed of body mass index (BMI), menopause, family history, hypertension, stroke, arthritis, diabetes mellitus, depression, renal failure, and dyslipidemia, with reference to previous studies [7,10,[24][25][26]. BMI was classified as underweight for a value of less than 18.5 kg/m 2 , normal for 18.5 kg/m 2 or more and less than 25.0 kg/m 2 , and obesity for 25.0 kg/m 2 or more. Family history was defined as when at least one parent or sibling had a history of ischemic heart disease.

Statistical Analysis
The data were analyzed using IBM SPSS version 25.0 (IBM Co., Armonk, NY, USA) and SAS Enterprise Miner 9.4. To observe the differences in the prevalence of ischemic heart disease according to social demographic characteristics, health-related behaviors, and the presence of chronic diseases, a chi-squared analysis was conducted. Logistic regression analysis was performed to identify the factors influencing the prevalence of ischemic heart disease. The statistical significance level was set as a two-sided test of p < 0.05. An interactive decision tree analysis and random forest analysis were generated to develop a predictive model of ischemic heart disease.

General Characteristics of the Study Regions
The distribution of ischemic heart disease cases according to the general characteristics is described in Table 1. The prevalence of ischemic heart disease was high among women aged 80 years (7.7%) and over and those with a low educational attainment of less than elementary school (5.6%) ( Table 1). It was found that the presence of ischemic heart disease was high in women with low household incomes (5.3%), women who experienced very poor subjective health (9.1%), and women who experienced a lot of stress (4.8%) ( Table 1). The prevalence of ischemic heart disease according to age, education level, household income, subjective health status, and stress awareness was statistically significantly different (p < 0.05) ( Table 1).

Health Behavior and Clinical Characteristics of the Study Regions
The prevalence of ischemic heart disease according to health behavior characteristics is shown in Table 2. Among the variables of smoking, drinking, and physical activity, there was a statistically significant difference only in the prevalence of ischemic heart disease with or without drinking experience (Table 2). The distribution of ischemic heart disease cases according to the clinical characteristics is described in Table 3. The presence of ischemic heart disease was high among obese (4.1%) and menopausal women (3.8%), and those with a family history of ischemic heart disease (5.0%) ( Table 3). Moreover, the prevalence of ischemic heart disease was statistically significantly higher if there were comorbidities including hypertension, dyslipidemia, stroke, arthritis, diabetes mellitus, depression, and renal failure (p < 0.05) ( Table 3).

Predictive Factors of Ischemic Heart Disease
Logistic regression analysis was performed to identify the factors related to ischemic heart disease in middle-aged women ( Table 4). The analysis showed that ischemic heart disease in middle-aged women was significantly associated with age, physical leisure activity, family history, hypertension, dyslipidemia, stroke, arthritis, and depression ( Table 4). The incidence of ischemic heart disease was 16.73 times higher in people over 80 years old than in those 40-49 years old. The incidence of ischemic heart disease in those with a family history was 3.29 times (95% confidence interval (CI): 2.03-5.32) higher than in those without a family history, and 1.42 times (95% CI: 1.01-2.00) higher in patients with hypertension than in those without hypertension ( Table 4). The risk of ischemic heart disease was more than 1.70 times (95% CI: 1.24-2.33) higher in patients with dyslipidemia and more than 1.93 times (95% CI: 1.18-3.18) higher in those with a previous stroke. The risk of ischemic heart disease was more than 1.43 times (95% CI: 1.05-1.94) higher in patients with arthritis and more than 1.66 times (95% CI 1.09-2.51) higher in patients with depression (Table 4).  Decision tree analysis was performed to identify the ischemic heart disease risk group in the study subjects. As for the method of growing the trees, the classification and regression tree (CRT) method was applied to maximize homogeneity within the child nodes by separating them to be as homogeneous as possible within the child nodes (Figure 2). At the researcher's discretion, we presented an interactive decision tree analysis focusing on health behavior and clinical characteristics, excluding age variables that were too strongly associated in logistic regression analysis. As a result of the analysis, a total of eight nodes were separated based on the terminal node, and the seventh node (16.67%) was found to be the most vulnerable to ischemic heart disease ( Figure 2). The seventh node was a patient with hypertension, a family history of ischemic heart disease, and menopause. The group with hypertension had a higher risk of developing ischemic heart disease than the group without hypertension, and the group without hypertension had the highest risk of developing ischemic heart disease at the eleventh node at 10.47% in patients with diabetes and arthritis ( Figure 2). sensitivity, specificity, and accuracy of each model were confirmed, and the model was evaluated using AUC. For the AUC value, the closer the area of the ROC curve is to 1, the better the performance of the model. If the AUC value is 0.8 or more, it is evaluated as a stable model, and the AUC values of all three prediction models presented in this study showed 0.8 or more. The AUC value has the highest value at 0.872 in random forest. All of the models' accuracy, sensitivity, and specificity showed the highest values in random forest (Table 6).  An interactive decision tree analysis and random forest analysis were generated to develop a predictive model of ischemic heart disease. The results of the random forest algorithm analyzed to predict the presence or absence of ischemic heart disease are shown in Table 5. As a result of the random forest analysis, the important variables in predicting ischemic heart disease response were age, dyslipidemia, education level, arthritis, hypertension, diabetes, depression, family history, menopause, and stroke, in that order (Table 5). For modeling comparison, logistic regression, decision trees, and random forest algorithms were used to compare prediction models for each dependent variable. The sensitivity, specificity, and accuracy of each model were confirmed, and the model was evaluated using AUC. For the AUC value, the closer the area of the ROC curve is to 1, the better the performance of the model. If the AUC value is 0.8 or more, it is evaluated as a stable model, and the AUC values of all three prediction models presented in this study showed 0.8 or more. The AUC value has the highest value at 0.872 in random forest. All of the models' accuracy, sensitivity, and specificity showed the highest values in random forest (Table 6). Table 5. Prediction of ischemic heart disease in middle-aged and older women using a random forest.

Variables
Gini Importance Gini Importance (Out of Bagging)

Discussion
With the development of medical technology, life expectancy has increased, and women spend more than a third of their lives after middle age. The middle-aged period of women is the beginning period of before and just after the onset of menopause, and since health management after middle-age is closely related to the quality of life, active health management is necessary [27]. Therefore, this study was performed to contribute to the prevention and management of ischemic heart disease for health promotion by identifying the factors related to ischemic heart disease in middle-aged and older Korean women and identifying the vulnerable group with a high prevalence of ischemic heart disease.
The prevalence of ischemic heart disease in the study was 2.77%, including those diagnosed with myocardial infarction or angina. It was slightly higher than the results of previous studies [28], which suggested that about 1.72% of the world's population is affected by ischemic heart disease. When the prevalence of ischemic heart disease was compared by age, it increased rapidly after 60 years old compared to those 40-49 years old. Previous studies have also shown that cardiovascular disease increased rapidly after 50 years old [29,30]. In particular, it is known that as women transition from middle age to old age, the incidence of cardiovascular disease increases due to changes in women hormones, physical changes according to aging, and an increase in body fat accumulation [2,27].
In this study, family history, hypertension, dyslipidemia, stroke, arthritis, and depression were found to be statistically significant as clinical factors affecting ischemic heart disease, and smoking, drinking, and physical activity were not related factors. Since the association was investigated in middle-aged and older women, the results differed from previous studies [9,11,29,31,32] where smoking, drinking, and physical activity were associated with ischemic heart disease. Previous studies were conducted on both men and women with cardiovascular disease [9] and on women in their 30s or older [11], and it is thought that the results were different because they were more than the data set of cardiovascular disease patients used in this study. According to a previous study by Lim [33] using machine learning, the major risk factors affecting the occurrence of myocardial infarction and angina were age, hypertension, dyslipidemia, family history, low educational background, and gender, consistent with the results of this study. The diseases identified as risk factors for cardiovascular in this study were hypertension, dyslipidemia, stroke, arthritis, and depression. However, since it is difficult to clearly identify a causal relationship in a cross-sectional study, it is also possible that individuals with ischemic heart disease may have had high prevalence of risk factors for comorbidities such as hypertension and dyslipidemia due to more frequent health care encounters and screening opportunities. Diabetes and renal failure did not show a statistically significant association. These results were similar to previous studies [12,14,26,34,35] reporting that the cardiovascular disease risk factors depression and rheumatoid arthritis were significantly higher in women than men, and diabetes was statistically significantly higher in men. In a study by Seo et al. [36], Korean adults with depression had a higher prevalence of cardiovascular disease than those without depression, and a previous study confirmed depression as a significant cardiovascular disease risk factor in women compared to men [35]. Decreased renal function may increase the prevalence of cardiovascular disease and increase mortality [37]. However, in this study, it was not a risk factor in middle-aged and older women.
As a result of the decision tree analysis to identify the groups vulnerable to ischemic heart disease, hypertension and family history were derived as the most relevant factors, consistent with the regression analysis. Focusing on hypertension, which is the biggest influencing factor, those who had hypertension, a family history, and were menopausal (16.67%), and those who had hypertension, no family history, and had a previous stroke (15.44%) were found to be the groups most vulnerable to ischemic heart disease. Taken together, the risk of ischemic heart disease increased in middle-aged and older women when combined with related factors such as hypertension, a family history of ischemic heart disease, menopause, and stroke. These results are consistent with the results of previous studies [38][39][40] that postmenopausal women significantly increase the risk of cardiovascular disease.
The study results indicated that for the prevention and effective management of ischemic heart disease in middle-aged and older women, a customized program considering the characteristics of the subjects is intensively needed. The importance of women's health care after middle age is emphasized, but in most cases, a uniform program is applied by integrating factors affecting cardiovascular disease [7]. In previous studies [8][9][10][11][12][13][14][15][16][17][18][19], risk factors for ischemic heart disease were selected based on socioeconomic characteristics, some co-morbidities, and clinical test results. However, in this study, most of the comorbidities, socioeconomic characteristics, and lifestyle behaviors suggested to be related in previous studies were reflected and analyzed. In addition, there is a lack of previous studies that have identified factors affecting ischemic heart disease and risk groups in middle-aged women. According to the results of this study, family history, vascular disease, and depression appeared to be the biggest risk factors for cardiovascular disease in middle-aged women, rather than menopause and lifestyle, which can be seen as a different result from previous studies. Based on the results of this study, for the prevention and management of ischemic heart disease in middle-aged and older women, it is necessary to first classify the subjects according to the risk level of each vulnerable group. In addition, it is necessary to establish a customized prevention and management strategy according to the characteristics of the relevant factors in each vulnerable group. Utilization of healthcare big data can contribute to enormous cost savings in the healthcare field by providing patient-customized medical services based on data. E-health and m-health devices that combine technologies such as big data, data mining, and deep learning are bringing about innovation in the medical field, such as disease prevention, diagnosis, and treatment, by providing more effective and personalized solutions. If we add the function of identifying and managing high-risk and low-risk patients in advance using a predictive model to this technological system, we believe that it can contribute to the management of ischemic heart disease in middle-aged women. This study is significant in that it identified the characteristics of middle-aged and older women who are vulnerable to ischemic heart disease using large-scale data representing the entire Korean population. However, it had the following limitations. First, this study was cross-sectional, making the investigation of the cause-effect relationship between the risk factors for ischemic heart disease difficult. Second, there was a lack of data on clinical examinations related to ischemic heart disease. Third, since the ischemic heart disease variables used in this study were obtained as self-reported data on doctors' diagnoses, there is a limitation that there may be a bias toward memory recall. In the future, it is necessary to analyze big data mining techniques in more depth with more data and conduct a prospective cohort study on the relationship between risk factors for ischemic heart disease by addressing the limitations of this study.

Conclusions
This study was conducted to identify ischemic heart disease-related factors and the vulnerable groups in Korean middle-aged and older women using data from the Korea National Health and Nutrition Examination Survey (KNHANES). The factors associated with ischemic heart disease in middle-aged and older women in this study were age, family history, hypertension, dyslipidemia, stroke, arthritis, and depression. Additionally, the group most vulnerable to ischemic heart disease were those with high blood pressure, a family history of ischemic heart disease, and menopause. It is meaningful that research to develop a predictive model for ischemic heart disease, with a high social burden of disease, using healthcare big data can be used as basic data to help in national policy decisionmaking for the prevention and management of chronic diseases. Based on these results, effective management should be achieved by applying customized medical services and health management services for each relevant factor in consideration of the characteristics of the groups with potential risk. In addition, it is necessary to reflect the realization of active screening programs and chronic disease management education programs that consider the comorbidity of patients with ischemic heart disease in health policies.

Conflicts of Interest:
The authors declare no conflict of interest.