Developing and Validating Risk Scores for Predicting Major Cardiovascular Events Using Population Surveys Linked with Electronic Health Insurance Records

A risk prediction model for major cardiovascular events was developed using population survey data linked to National Health Insurance (NHI) claim data and the death registry. Another set of population survey data were used to validate the model. The model was built using the Nutrition and Health Survey in Taiwan (NAHSIT) collected from 1993–1996 and linked with 10 years of events from NHI data. Major adverse cardiovascular events (MACEs) were identified based on hospital admission or death from coronary heart disease or stroke. The Taiwanese Survey on Hypertension, Hyperglycemia, and Hyperlipidemia (TwSHHH), conducted in 2002 was used for external validation. The NAHSIT data consisted of 1658 men and 1652 women aged 35–70 years. The incidence rates for MACE per 1000 person-years were 13.77 for men and 7.76 for women. These incidence rates for the TwSHHH were 7.27 for men and 3.58 for women. The model had reasonable discrimination (C-indexes: 0.76 for men; 0.75 for women), thus can be used to predict MACE risks in the general population. NHI data can be used to identify disease statuses if the definition and algorithm are clearly defined. Precise preventive health services in Taiwan can be based on this model.


Introduction
Precise preventive health services emphasize managing chronic diseases and developing individualized risk prediction to improve the quality and effectiveness of health care [1]. The ideal disease prevention should start with screening risk factors for predicting risk and then promote health throughout a patient's entire life. Cardiovascular disease is a major cause of death and disability worldwide. The World Health Organization (WHO) recognizes the seriousness of cardiovascular disease (https://www.who.int/cardiovasc ular_diseases/guidelines/Pocket_GL_information/en/, accessed on 30 December 2021) and suggests that most cardiovascular diseases (CVDs) are preventable. A pocket guide can be used to identify people at high risk and provide guidance to prevent heart attacks or strokes. In the U.S., CVD risk is estimated using the Framingham Risk Score (FRS), which was developed in the U.S. when it was facing a high CVD prevalence [2,3]. The Int. J. Environ. Res. Public Health 2022, 19, 1319 2 of 11 ability of the FRS to accurately predict CVD risk for different ethnicities is uncertain. Individual studies examining FRS performance have been conducted in Europe [4][5][6][7] and Asia [8][9][10][11][12]. Knowing the limitations of the original FRS, the American Heart Association (AHA) and the American College of Cardiology (ACC) released a new risk calculator [13]. The calculator, known as the ASCVD, was developed by pooling data from several cohorts, including cohorts from the Atherosclerosis Risk in Community [14], the Cardiovascular Health [15], the Coronary Artery Risk Development in Young Adults [16], and the Framingham, Original [17] and Offspring [18] studies. In November 2013, the AHA and ACC updated the clinical guidelines for managing lipids [19], and risk score was an important feature. The International Atherosclerosis Society provided a calibration factor for each country with different risks [20]. Following their instructions, we generated and submitted our own calibration factor [21]. Based on our experience in generating the calibration factor and observations from Asian countries [8], risk factors for coronary heart disease (CHD) may differ in Asian populations. Liu et al. recalibrated the Framingham prediction function using data from the Chinese Multi-Provincial Cohort study (CMCS) [11]. We applied their new coefficients to our national representative data from the 2002 Taiwan Survey of Hypertension, Hyperglycemia, and Hyperlipidemia (TwSHHH) and found that the prediction did not fit the Taiwanese population [22]. Thus, we developed our own model.
Most of the scores were based on longitudinal follow-up, and the diseases were ascertained either from self-reports or doctors' diagnoses. Periodic follow-up of patients nationwide would be very costly. As use of electronic medical records (EMRs) increases, one form of the EMR can be used to ascertain events. Since 1995, the National Health Insurance (NHI) program has provided universal coverage to more than 99% of the population in Taiwan. To get reimbursed, medical facilities (either clinics or hospitals) must upload the information for each patient's clinical or hospital visits. Thus, the NHI database contains personal characteristics (sex, date of birth, and insurance information) and clinical information (date, expenditures, diagnosis, prescription details, and operations) [23]. We conducted this study to construct and validate our own MACE risk prediction model using national surveys linked to claim data and the death registry.

Data
Data for developing the model were obtained from a nationwide survey, the 1993-1996 Nutrition and Health Survey in Taiwan (NAHSIT), which asked questions on disease history, food intake, and health-related behaviors. Participants were asked to fast for 8 h before physical examination. Blood samples were centrifuged immediately. Serum samples were frozen on dry ice, then delivered to the Academia Sinica and frozen at −70 • C on the same day. Frozen serum samples were analyzed in a certified laboratory. Coefficients of variation from the blood samples were within acceptable ranges deriving from 5% split blood samples [22]. No written consent form was required during the survey; however, we obtained patients' oral consent before the survey. The 1993-1996 NAHSIT was the first representative survey with biomarkers. It was also conducted during the time National Health Insurance (NHI) was implemented. It was the beginning of following-up individuals using nationwide insurance claim data electronically.
Validation data were obtained from another nationwide survey, the TwSHHH, conducted in 2002 and followed up in 2007 [24]. The health examination was similar to that of the NAHSIT; however, nutrition intake was not recorded. Participants were informed about the survey, and data were linked only from those who signed the consent form.
The model included the following data: baseline values of log (age), systolic blood pressure (SBP), diastolic blood pressure (DBP), fasting glucose, total cholesterol, highdensity lipoprotein (HDL-C), low-density lipoprotein (LDL-C), ratio of total cholesterol to HDL-C, triglycerides, uric acid, body mass index (BMI; kg/m 2 ), waist circumference (cm), waist-to-hip ratio, and smoking status. The Institute Review Board of the National Health Research Institutes approved this study.

Events
MACEs are a group of disorders of the heart and blood vessels (https://www.who.int/ne ws-room/fact-sheets/detail/cardiovascular-diseases-(cvds), accessed on 30 December 2021). Heart attacks and strokes are usually acute events and are mainly caused by a blockage that prevents blood from flowing to the heart or brain. MACEs, which include CHD and stroke, were extracted from the NHI databank or death registry. Only hospitalization or death due to CHD or stroke was considered. CHD was diagnosed as per ICD-9: 410-414 or ICD-10: I20-I25; stroke was diagnosed as per ICD-9:430-438 or ICD-10: I60-I69. Time-to-event was calculated from the date of the survey to the date of the event.

Statistical Methods
The model's discrimination ability was evaluated using Harrell's C [25], which evaluates the proportion of concordant pairs over all possible pairs. The formula is (Equation (1)), where (Equation (2)) represents the concordant pair, and (Equation (3)) represents all pairs. If a variable increased the C value by 0.002, the variable was kept in the model. The Akaike information criterion (AIC) was used to guarantee the goodness of the model fit. The model with the lowest AIC was selected. In other words, the model was selected based on a higher C value and lower AIC. The Hosmer-Lemeshow test was used to evaluate the calibration [26]. This test divided the predicted risk into ten groups then compared the observed risk to the predicted risk. The χ 2 test was also used. The same method was applied to the external calibration. All analyses were conducted using R and SAS statistical software, version 9.4 (SAS Institute Inc., Cary, NC, USA).

Results
To build the model, we used data from 1658 men and 1652 women aged between 35 and 70 years. For each disease, we excluded those who reported having the disease at baseline. The MACE incidence rates were 13.77 per 1000 person-years for men and 7.76 per 1000 person-years for women. We constructed separate models for CHD and stroke, and then combined them as MACE. Table 1 compares the baseline variables between patients who developed MACE, CHD, or stroke and those who did not. Men who developed MACEs were significantly older, with higher blood pressure, higher waist-hip ratios, higher waist circumferences, and higher proportions of hypertension than their counterparts at baseline. Among women, almost all variables differed except the glucose level. The 10-year event probabilities S 0 (10) for MACEs were 0.82 for men and 0.90 for women. Thus, approximately 82% of the men and 90% of the women remained MACE-free during the 10-year period.  Table 2 presents the final models. The C values for CHD were 0.73 for men and 0.82 for women. The AIC values were the lowest in both models, at 760 for men and 474 for women. Agreement between the predicted and observed probabilities was classified into deciles and tested via a χ 2 test. The χ 2 were 22.91 for men and 21.68 for women. The C values for stroke were 0.80 for men and 0.79 for women and for MACEs were 0.76 for men and 0.75 for women. Models were validated using TwSHHH data linked to the NHI data. The MACE incidence rates were 7.27 per 1000 person-years for men and 3.58 per 1000 person-years for women in the 1993-1996 NAHSIT sample. The 10-year MACE-free probabilities were 0.90 for men and 0.95 for women. Figure 1 shows the validation. The χ 2 statistics for comparing the predicted to the observed values were <20 for CHD and MACEs, but slightly higher for stroke (28.67 for men; 20.93 for women).
Models were validated using TwSHHH data linked to the NHI data. The MACE incidence rates were 7.27 per 1000 person-years for men and 3.58 per 1000 person-years for women in the 1993-1996 NAHSIT sample. The 10-year MACE-free probabilities were 0.90 for men and 0.95 for women. Figure 1 shows the validation. The χ 2 statistics for comparing the predicted to the observed values were <20 for CHD and MACEs, but slightly higher for stroke (28.67 for men; 20.93 for women).

Discussion
In this study, we used data from national surveys linked to the NHI and death registry and extracted events over a 10-year period to develop a risk prediction model. Millions of records were processed. Manuel  survey to validate the model. The predictors were the self-reported risk behaviors, and the events were obtained from either the hospital admission for the disease or the death registry. We used a similar method to identify events, but chose surveys with biomarkers. Because the NHI data were for insurance claims, the data contained only the date of the claim and no information on disease conformation. We were advised by experts in the fields on hypertension, heart disease, and cerebrovascular disease to use the records from hospitalization or death to guarantee that real events were selected.
The Framingham score and CMCS used categorical data to calculate the risk scores ( Table 3). The Framingham score used blood pressure, total cholesterol, HDL-C, diabetes (yes/no), and smoking (yes/no) as predictors. The CMCS was developed using Chinese data [9]. Figure 2 shows the comparisons of the two models fitting the NAHSIT 1993-1996 data for CHD events. The C statistics were lower in men than in women, whereas the χ 2 statistic was much higher in women when using the Framingham score.   Chien et al. developed a point-based prediction model for CHD in Taiwan [28] using data collected in northern Taiwan. CHD was ascertained by physicians. The authors developed three models: a clinical model, a total cholesterol-based model, and an LDL-Cbased model. The clinical model included age, sex, BMI, SBP, smoking status, and family history of CHD. The total cholesterol-based model was similar to the LDL-C-based model, except for the total cholesterol. However, the study lacked a nationally representative sample. Our sample was selected using a probability sampling scheme and covered the entire population of Taiwan. Therefore, our model can be used for risk prediction for the population in Taiwan.
The first global estimate of the burden of 135 diseases listed cerebrovascular diseases as the second leading cause of death after ischemic heart disease [29]. The WHO reported that 15 million people suffer a stroke worldwide annually (http://www.emro.who.int/hea lth-topics/stroke-cerebrovascular-accident/index.html, accessed on 30 December 2021). Approximately one-third of these remain disabled for long periods, resulting in heavy burdens on their family and community. Taiwan is no exception to this. The earliest risk prediction model for stroke was developed using the Framingham study in 1991 [30]. A risk prediction model was developed using Taiwanese community data in 2010 [31]. The incidence was~6.8% in the 16-year follow-up. Two models have been developed based on these data [31]. One was a clinical model that included measures of blood pressure and disease history. The other was a biomedical model that included total cholesterol, white blood cell counts, and fasting glucose in addition to items in the clinical model. A model was developed for Chinese populations using the China Health and Nutrition Survey [32] using the incidence between 2009 and 2015. Separate models were developed for ischemic stroke and hemorrhagic stroke. Each population has different risk factors. Thus, we developed our model using Taiwan population data. We focused on severe stroke that resulted in hospitalization or death. Our model selected SBP, triglycerides, glucose, and uric acid for men and SBP, waist-hip ratio, smoking, hypertension, and diabetes for women.
Because CHD and stroke are major cardiac events, we put them in one model. Our final model included age, sex, SBP, waist-hip ratio, HDL-C, and uric acid for men. The weight (coefficient) was heavy for the waist-hip ratio, implying that obesity may contribute largely to MACEs in men. Waist circumference, SBP, total cholesterol/HDL ratio, and smoking were used in the model for women. Lipid profiles (total cholesterol/HDL ratio) played a relatively important role in MACE development in women. The C statistics reached 0.76 for men and 0.75 for women. Using another national survey with lower incidence rates to validate the models, the C statistics were higher than those in the original population, reaching 0.78 for men and 0.79 for women. The overestimation on risk scores has been observed in many models for the same purpose. The WHO CVD Risk Chart Working Group suggested it might be models were developed using incidence at the population level and might include recurrent cases [33]. We used event-history model and the National Health Insurance data, which only count the event once. The overestimation was high in the highest 10th percentile. It was possibly caused by the linear function in the model. We have tried other functions, but they did not improve the model. In the end, the purpose was risk prevention. A higher estimation might alert individuals to modify their lifestyle in order to lower their risks.
This study had some limitations. First, no behavioral variables other than obesityrelated variables were selected; thus, these variables may have all been expressed in patients' blood pressure or biomarkers. Mediation models may be one solution. Second, we did not stratify stroke into different subtypes. However, our purpose was primary prevention in the general population. We hoped this model would apply to government-funded health check-ups for people aged ≥40 years. Taiwanese residents get free health check-ups when they are ≥40 years old. Those aged between 40 and 64 years get free health check-ups every 3 years; those aged ≥65 years get free health check-ups annually. The health check-up could implement our mode into the report and inform people about their 10-year risk of MACE. There is a website developed for risk prediction as well as guidelines for prevention (https://cdrc.hpa.gov.tw/index.jsp, accessed on 1 January 2022).

Conclusions
In conclusion, linking national surveys to health insurance data enabled generating a MACE risk prediction model. Our model was validated using a dataset from another survey conducted a few years later and with fewer incidences. The models performed well, indicating that our model was valid regardless of time.