Deep Learning Improves Prediction of Cardiovascular Disease-Related Mortality and Admission in Patients with Hypertension: Analysis of the Korean National Health Information Database

Objective: The aim of this study was to develop, compare, and validate models for predicting cardiovascular disease (CVD) mortality and hospitalization with hypertension using a conventional statistical model and a deep learning model. Methods: Using the database of Korean National Health Insurance Service, 2,037,027 participants with hypertension were identified. Among them, CVD (myocardial infarction or stroke) death and/or hospitalization that occurred within one year after the last visit were analyzed. Oversampling was performed using the synthetic minority oversampling algorithm to resolve imbalances in the number of samples between case and control groups. The logistic regression method and deep neural network (DNN) method were used to train models for assessing the risk of mortality and hospitalization. Findings: Deep learning-based prediction model showed a higher performance in all datasets than the logistic regression model in predicting CVD hospitalization (accuracy, 0.863 vs. 0.655; F1-score, 0.854 vs. 0.656; AUC, 0.932 vs. 0.655) and CVD death (accuracy, 0.925 vs. 0.780; F1-score, 0.924 vs. 0.783; AUC, 0.979 vs. 0.780). Interpretation: The deep learning model could accurately predict CVD hospitalization and death within a year in patients with hypertension. The findings of this study could allow for prevention and monitoring by allocating resources to high-risk patients.


Introduction
Cardiovascular diseases (CVDs) such as coronary heart disease, stroke, and peripheral artery disease are well-known leading causes of mortality worldwide with approximately 18 million deaths in 2019, making it a global health burden [1][2][3]. Hypertension, one of the modifiable and preventable strong risk factors for CVDs, shows a rapidly increasing prevalence with more than 1.2 billion diagnosed with hypertension in 2019 [4][5][6]. Since lowering blood pressure is a promising method to reduce CVDs and related hospitalization and mortality [7,8], various types of antihypertensive drugs, surgical intervention, and nonpharmacological therapy have been advanced over the last decade [3]. Nevertheless, more than half of men and women with hypertension have not received suitable treatment, with few models that can predict the risk of complications [4]. Repeated hospitalization due to CVDs can reduce the quality of life, increase long-term mortality, and eventually increase socio-economic costs. To alleviate the burden of CVDs, the early detection or prediction of individuals with hypertension who are more likely to develop CVD related events is necessary for prevention and ti make appropriate decisions.
In order to improve the prognosis of patients, several CVD risk prediction models such as Framingham risk score (FRS) [9], thrombosis in myocardial infarction (TIMI) [10], systematic coronary risk evaluation (SCORE) [11], and QRISK [12] have been developed through regression-based methods in recent decades. These traditional statistical predictive models might be useful for an association analysis using a low number of variables. However, when a novel biomarker related to outcome is developed, it is necessary to analyze its accuracy in predicting outcome [13]. In addition, since large-scale, time-dependent datasets such as electronic health records (EHR) are available, an advanced model for maximizing risk stratification through repetitive measurements of explanatory variables and validation is needed [14].
Deep learning is a subset of machine learning that uses a multi-layered structure of algorithms called artificial neural networks (ANNs) inspired by human neural networks to perform automated feature learning [14]. To date, several studies in the cardiovascular field have shown the efficacy and potential of deep learning through high-accuracy disease prediction based on complex big data [15,16]. However, there is little research using deep learning models to predict prognosis related to CVD events in clinical patients with hypertension.
Thus, the aim of this study was to evaluate the discriminative accuracy of a deep learning-based prediction model for CVD mortality and hospitalization of patients with hypertension using the database of Korean National Health Insurance Service (NHIS) and to compare it with a conventional logistic regression model.

Study Design and Population
This study utilized the Korean National Health Information Database (NHID), a nationwide claims databases produced by the NHIS. Approximately 97% of Korean are enrolled in the NHIS. Mandatory health insurance is provided to enrolled citizens. The NHID includes sociodemographic data, lifestyle questionnaires, laboratory results, diagnoses, prescriptions, and data on death. Additional details about the database have been described elsewhere [17]. Diagnoses recorded in NHID were based on the International Classification of Diseases, Tenth Revision (ICD-10) codes.
The overall study design is shown in Figure 1. The cohort included 3,196,373 individuals not less than 19 years older years who were diagnosed with hypertension (I10-15 and O10-16) between 2002 and 2017. After excluding 159,346 individuals whose data on baseline variables were missing or who received two or more health checkups, 2,037,027 individuals were finally enrolled. Among them, we analyzed CVD-related death and/or hospitalization that occurred within one year after the last visit. CVD-related hospitalization was further subdivided into MI-related hospitalization (I21) and stroke-related hospitalization (I60-69).
This study was approved by Kangbuk Samsung Hospital Institutional Review Board. Informed consent was waived for this retrospective analysis. The authors are restricted from sharing the datasets underlying this study because the Korean NHIS owns the data. There are legal or ethical restrictions to sharing this data publicly. Data are available from the NHIS through a formal application process (https://nhiss.nhis.or.kr/bd/ab/bdaba021 eng.do). This study was approved by Kangbuk Samsung Hospital Institutional Review Board. Informed consent was waived for this retrospective analysis. The authors are restricted from sharing the datasets underlying this study because the Korean NHIS owns the data. There are legal or ethical restrictions to sharing this data publicly. Data are available from the NHIS through a formal application process (https://nhiss.nhis.or.kr/bd/ab/bdaba021eng.do).

Study Variables
In this study, we analyzed data from the NHID to identify factors affecting CVD hospitalization and CVD death. We used the most recent age, sex, status of taking statin medications, and diabetes status for individuals who had the last visit with ICD-10 codes for hypertension (I10-15 and O10-16). Body mass index (BMI), systolic blood pressure (SBP), diastolic blood pressure (DBP), fasting plasma glucose (FPG), total cholesterol, income level, smoking status, and physical activity variables used in the national health screening database (DB) and eligibility DB were obtained. In addition, considering numerical change over time, we applied the value obtained by subtracting the last visit value from the first visit value of each variable. Change in the amount of smoking was defined as change in the number of cigarettes. Change in the amount of alcohol consumed per day and change in the amount of physical activity were defined as changes in results obtained using the formula according to the International Physical Activity Questionnaire (IPAQ) guidelines. To reflect information on history, the number of hospital visits, prescriptions, and length of hospitalization within the last two years from the last visit were utilized.

Clinical Outcomes
The primary outcome of this study was defined as CVD death (all I-code-related deaths in ICD-10), CVD hospitalization (all I-code-related hospitalizations in ICD-10), and a composite of CVD death and hospitalization within one year after the last visit.

Study Variables
In this study, we analyzed data from the NHID to identify factors affecting CVD hospitalization and CVD death. We used the most recent age, sex, status of taking statin medications, and diabetes status for individuals who had the last visit with ICD-10 codes for hypertension (I10-15 and O10-16). Body mass index (BMI), systolic blood pressure (SBP), diastolic blood pressure (DBP), fasting plasma glucose (FPG), total cholesterol, income level, smoking status, and physical activity variables used in the national health screening database (DB) and eligibility DB were obtained. In addition, considering numerical change over time, we applied the value obtained by subtracting the last visit value from the first visit value of each variable. Change in the amount of smoking was defined as change in the number of cigarettes. Change in the amount of alcohol consumed per day and change in the amount of physical activity were defined as changes in results obtained using the formula according to the International Physical Activity Questionnaire (IPAQ) guidelines. To reflect information on history, the number of hospital visits, prescriptions, and length of hospitalization within the last two years from the last visit were utilized.

Clinical Outcomes
The primary outcome of this study was defined as CVD death (all I-code-related deaths in ICD-10), CVD hospitalization (all I-code-related hospitalizations in ICD-10), and a composite of CVD death and hospitalization within one year after the last visit.

Algorithm Development and Statistical Analysis
For statistical analysis, Chi-square test and independent sample t-test were conducted to compare differences according to CVD hospitalization, CVD death, and baseline characteristics using data from the NHIS. Multiple logistic regression analysis was performed to confirm the association between CVD hospitalization and CVD death for each variable. Logistic regression analysis and deep neural network (DNN) methods were used to build a predictive model. An overview of the data-processing of the DNN model is shown in Figure 2. Data-splitting was conducted at 7:3 for training and validation. Oversampling was performed using the synthetic minority oversampling technique (SMOTE) algorithm (because imbalanced data were used for CVD hospitalization and CVD death prediction) and min-max scaling (because units for each variable were different). The DNN model consisted of one hidden layer with 64 nodes. The Rectified Linear Unit (ReLU) activation function was used to apply nonlinearity in the hidden layer and the dropout ratio was set to 0.2 to prevent overfitting. To assess the predictive performance of the developed predictive model, we calculated their performance in terms of accuracy, precision (positive predictive value), recall (sensitivity), F 1 -score (harmonic mean between recall and precision), and area under ROC curve (AUC). The optimal cutoff was found using the Yuden index. All analyses were performed using SAS 9.4 program (SAS Institute Inc., Cary, NC, USA) and Python 3.7.0. All p values were two-tailed and p values less than 0.05 were defined as statistically significant.
formed to confirm the association between CVD hospitalization and CVD death for each variable. Logistic regression analysis and deep neural network (DNN) methods were used to build a predictive model. An overview of the data-processing of the DNN model is shown in Figure 2. Data-splitting was conducted at 7:3 for training and validation. Oversampling was performed using the synthetic minority oversampling technique (SMOTE) algorithm (because imbalanced data were used for CVD hospitalization and CVD death prediction) and min-max scaling (because units for each variable were different). The DNN model consisted of one hidden layer with 64 nodes. The Rectified Linear Unit (ReLU) activation function was used to apply nonlinearity in the hidden layer and the dropout ratio was set to 0.2 to prevent overfitting. To assess the predictive performance of the developed predictive model, we calculated their performance in terms of accuracy, precision (positive predictive value), recall (sensitivity), F1-score (harmonic mean between recall and precision), and area under ROC curve (AUC). The optimal cutoff was found using the Yuden index. All analyses were performed using SAS 9.4 program (SAS Institute Inc., Cary, NC, USA) and Python 3.7.0. All p values were two-tailed and p values less than 0.05 were defined as statistically significant.

Patient Characteristics
The study analyzed a total of 2,037,027 patients with hypertension, including 163,686 participants with CVD hospitalization and 31,634 participants with CVD death. Baseline characteristics of participants are shown in Table 1. All variables showed statistically significant differences except for the variance of 0 income level for those with CVD hospitalization.

Patient Characteristics
The study analyzed a total of 2,037,027 patients with hypertension, including 163,686 participants with CVD hospitalization and 31,634 participants with CVD death. Baseline characteristics of participants are shown in Table 1. All variables showed statistically significant differences except for the variance of 0 income level for those with CVD hospitalization.
Participants with CVD hospitalization were older (66.3 ± 13.4 vs. 62.8 ± 12.9 years), more likely to be females (50.12 % vs. 49.88%), had a longer hospitalization period (32.22 ± 90.66 vs. 25.90 ± 69.61 days), higher proportion of history of diabetes (43.40% vs. 33.56%), higher proportion of myocardial infarction (MI) (2.16% vs. 0.87%), and higher proportion of history of stroke (12.52% vs. 3.98%). Meanwhile, they were less likely to take statins with decreased cholesterol level, reduced smoking, increased physical activity, increased prescriptions, and increased hospitalizations. , and were more likely to have decreased total cholesterol level, but less likely to take statins with reduced smoking, increased physical activity, and increased prescriptions. Table 2 shows results of multiple regression analysis performed to evaluate the risk of variables affecting CVD hospitalization and CVD death using odds ratios (ORs) and 95% confidence intervals (CIs). Age, variance of BMI, variance of SBP, variance of FPG, variance of alcohol consumption, variance of physical activity, and history of diabetes increased the risk of CVD hospitalization and CVD death, irrespective of the presence of missing values. On the contrary, variance in DBP, number of prescriptions, and taking statins decreased the risk of CVD hospitalization and CVD death.  Table 3 shows the results after evaluating each model. For all metrics in all datasets, the DNN model showed a higher performance than the logistic regression model. In the case of ROC comparison, a statistically significant difference of p < 0.0001 was confirmed in all datasets. The DNN model showed a performance of 0.8 or higher for the CVD hospitalization group (accuracy, 0.863; F 1 -score, 0.854; precision, 0.912; recall, 0.803; AUC, 0.932) and 0.9 or higher for the CVD death group (accuracy, 0.925; F 1 -score, 0.924; precision, 0.935; recall, 0.912; AUC, 0.979) in all metrics. Figure 3 shows a comparison of ROC curves of a logistic regression model and DNN model for each dataset. Recall (sensitivity) was relatively lower across all subfigures in common, but it was all higher than recall in logistic regression (LR).  Table 3 shows the results after evaluating each model. For all metrics in all datasets, the DNN model showed a higher performance than the logistic regression model. In the case of ROC comparison, a statistically significant difference of p < 0.0001 was confirmed in all datasets. The DNN model showed a performance of 0.8 or higher for the CVD hospitalization group (accuracy, 0.863; F1-score, 0.854; precision, 0.912; recall, 0.803; AUC, 0.932) and 0.9 or higher for the CVD death group (accuracy, 0.925; F1-score, 0.924; precision, 0.935; recall, 0.912; AUC, 0.979) in all metrics. Figure 3 shows a comparison of ROC curves of a logistic regression model and DNN model for each dataset. Recall (sensitivity) was relatively lower across all subfigures in common, but it was all higher than recall in logistic regression (LR).   In a sub-analysis of CVD hospitalization, the DNN model showed a higher performance than the logistic regression model (Table 4, Figure 4). The DNN model showed an outstanding performance for both MI and stroke groups (AUC: 0.972 for MI and 0.951 for stroke). For both groups, all metrics of DNN model were higher than those of LR model within the test.

Discussion
In this retrospective cohort study based on a nationwide claim dataset, we developed a deep learning model to evaluate future risk of high-risk CVD mortality and hospitalization for patients with hypertension, and then compared the performance with that of a traditional statistical model. Among artificial intelligence studies using deep learning for the prediction of CVD-related outcomes in patients with pre-existing hypertension, ours is, to the best of our knowledge, the first conducted in a real-world setting to predict mortality and admission to hospital for hypertensive patients. We found that the prognostic efficacy of the deep learning method was higher than that of the logistic regression in all classification tasks using five commonly used evaluation metrics. In addition, the deep learning model showed excellent performance in sub-analysis performed by subcategorizing hospitalization due to MI and stroke.
Globally, the prevalence of hypertension in adults aged 30-79 is over 30%. The control rate of hypertension is only about 20%, despite the various mechanisms of hypertension that have been elucidated over a century, with effective drugs and interventions being developed [3,4,18]. Among the risk factors of CVD, the most important global burden on the health care system, hypertension is a crucial component in terms of its high prevalence and its being preventable cause of premature death worldwide [19]. However, there are not many risk assessment models or tools to predict the risk of CVD in hypertensive patients, and even tends to be overestimated in Korean and Asian populations because it is mainly considered in the United States (US) or European population [20].
Regression analysis is the most widely used analysis for medical data. Various models for predicting CVD in hypertensive patients have been proposed using regression analysis [21]. However, the predictors considered in the previous study [21] were only a few, traditional, strong risk factors that were already established. It is known that blood pressure control might have a limited effect on overall CVD risk due to the complexity and variability of individual cardiovascular risk. Therefore, novel risk factors related to CVD, including C-reactive protein (CRP), coronary artery calcium (CAC), and proprotein convertase subtilisin/kexin type 9 (PCSK9), have been proposed [22,23]. Although some newly discovered biomarkers have been confirmed to be associated with increased risk, a CVD prediction model using these markers failed to show superior performance to a traditional predictive model [24]. Logistic regression, used as a control in this study, is a conventional statistical approach frequently used to develop risk prediction models. The strength of this analysis lies in the determination and use of several variables to predict prognosis by expressing the predictive effect of predictor variables using simple and easy ways to explain parameters. Although a regression model with a low number of predictors, and the premise that the effect of the variable on the result is linear and homogeneous, can be useful in association analysis, it can act as a limitation in predictive analysis studies that focus on the result instead of the predictor variable [25]. Such a limitation cannot account for the multicollinearity or spatially varying correlations that may arise from the complexity and variability of an individual's overall cardiovascular risk.
As large-scale cohort studies involving from tens to millions of participants have been established in various regions around the world, the accumulation of large datasets in the medical field has been accelerating as well. Analyses using these big data have provided great potential for research on complex problems beyond the scope of existing clinical and observational studies due to the vast amount of available data and ability to easily extract mortality data and perform disease registration. As a result, it is possible to analyze previously unknown risk factors that could have a statistically significant association with disease incidence using medical big data from a nationwide population-based cohort close to real world data, thereby tracking various disease mechanisms [26][27][28]. As an example, the incidence of CVD is influenced by, and often intertwined with, various risk factors such as race, ethnicity, age, sex, body mass index, and laboratory test results. Previous studies using a cohort with a limited population and traditional statistical analysis often could not fully reflect all the complex causal relationships between these various risk factors, leading to many limitations in the interpretation of results. Recently, the standardization and systematization of healthcare big data has enabled the analysis of previously unknown risk factors that might play a statistically significant role in disease prevalence, providing a good opportunity to more accurately characterize the quality of prevention and treatment according to CVD predictions [14]. The NHID used in this study is one of the largest claims datasets, with a population of over 52 million. It includes data for all age groups and all regions, thus reducing the possibility of selection bias [17]. In addition, it includes health-screening information including detailed lifestyle questionnaires, laboratory test results, and anthropometric measurements that are not included in other claim databases. Furthermore, since NHID is linked to other data, such as mortality data from the National Statistical Office, it enables a more detailed analysis.
On the other hand, machine learning has been introduced to overcome the limitations of risk prediction using traditional regression analysis [25]. In particular, research using artificial intelligence (AI) and big data has recently attracted attention and disease prediction research using AI has already shown a high value in other studies [29][30][31]. Unsupervised deep learning models are trained either consciously or unconsciously by updating and adjusting neuron weights and biases as relevant knowledge is identified. Thus, they can detect patterns or potential risk factors that humans cannot detect, utilizing more computational power to handle all possible variables. The DNN model utilized in this study had a relatively simple structure with one hidden layer. Nevertheless, the strength of our study is that if variables are used in the analysis of this study, it will be possible to find people who are highly likely to be hospitalized within one year due to CVD, enabling an immediate response for prevention that is not possible with existing predictive models. In future studies, we will investigate interpretable and complicated structures of DNN models with multiple hidden layers and evaluate other novel deep learning models.

Study Limitations
The results of our study should be interpreted in the context of several potential limitations. First, in this study, hypertensive patients were diagnosed based on the ICDcode with potential errors in the claims data, not by medical records or anthropometric measurements, as in other large-data studies. However, since previous studies reported that ICD-code showed acceptable reliability in the NHID in Korea [32], this selection method was considered reasonable.
Second, external evaluation was not conducted to determine the reproducibility or generalizability. Although we validated a single cohort by splitting it into development and validation datasets, the accuracy of prediction might have been lowered when comparing with cohorts with different regions, races, countries, and other types of care settings. However, in another study that validated predictive models using NHIS for Koreans through Rotterdam studies for European, the deep learning approach showed significantly improved predictive power compared to the approach through traditional regression analysis, suggesting that it could be generalized without geographic or racial influence [29]. Finally, data analysis using deep learning models can have difficulies in accurately understanding and explaining individual algorithms inside the model. The complexity of the deep learning model enables a distinguished prediction power. At the same time, it makes it more difficult to understand and trust. Furthermore, it is not possible to provide specific recommendations for controlling risk factors because the risk factors affecting event occurrence are unknown. Therefore, to find ways to explain individual risk factors, it is necessary to train models using both interpretable and deep learning methods to evaluate whether there is a trade-off between accuracy and interpretability in a specific case in practice.

Conclusions
Using experimental machine learning techniques based on high-quality big data, we showed that a machine learning approach using DNN could lead to more accurate predictions of the risk of mortality and hospitalization in patients with hypertension than a conventional statistical approach. Further research should be conducted by identifying patients at high risk of hospitalization or death from CVD after being diagnosed with hypertension through prediction models developed using various deep-learning-based techniques so that customized treatment and monitoring can be achieved by allocating resources to the patients at highest risk.