Predicting Australian Adults at High Risk of Cardiovascular Disease Mortality Using Standard Risk Factors and Machine Learning

Effective cardiovascular disease (CVD) prevention relies on timely identification and intervention for individuals at risk. Conventional formula-based techniques have been demonstrated to over- or under-predict the risk of CVD in the Australian population. This study assessed the ability of machine learning models to predict CVD mortality risk in the Australian population and compare performance with the well-established Framingham model. Data is drawn from three Australian cohort studies: the North West Adelaide Health Study (NWAHS), the Australian Diabetes, Obesity, and Lifestyle study, and the Melbourne Collaborative Cohort Study (MCCS). Four machine learning models for predicting 15-year CVD mortality risk were developed and compared to the 2008 Framingham model. Machine learning models performed significantly better compared to the Framingham model when applied to the three Australian cohorts. Machine learning based models improved prediction by 2.7% to 5.2% across three Australian cohorts. In an aggregated cohort, machine learning models improved prediction by up to 5.1% (area-under-curve (AUC) 0.852, 95% CI 0.837–0.867). Net reclassification improvement (NRI) was up to 26% with machine learning models. Machine learning based models also showed improved performance when stratified by sex and diabetes status. Results suggest a potential for improving CVD risk prediction in the Australian population using machine learning models.


Introduction
Cardiovascular disease (CVD) is the leading cause of death in Australia [1]. Many cardiovascular disease risk factors are modifiable and, with early diagnosis and intervention of individuals at higher risk, CVD mortality and morbidities are largely preventable [2]. Risk prediction models that combine known CVD predictors, such as hypertension, cholesterol, age, smoking, and diabetes, have traditionally been used to identify those at greatest risk. The Framingham, Systematic COronary Risk Evaluation (SCORE), and QRISK models are commonly used in the UK, US, Australia, and New Zealand to inform public policy and clinical guidelines [3,4].
Two of the most pertinent limitations of established risk prediction models are: (1) traditional predictive models based on personal health information use simple regression fitting approaches that cannot assume nonlinear relationships between the predictors and outcome measures, which oversimplifies the associations between CVD risk factors and outcomes, thus reducing the accuracy of predictions [5], and (2) there is a limited generalizability of models to accurately predict the risk of CVD in diverse populations and across countries [3,4]. For example, the Framingham Risk Score, one of the most commonly used and widely validated models worldwide, is derived from a largely Caucasian population of European descent, and may be less accurate for some high-risk groups, such as individuals with diabetes, socio-economically disadvantaged populations [6], and Australian females [7].
Machine learning (ML) is a widely accepted computational technique that can address the nonlinear relationships between the risk factors and outcome measures [8]. It also presents an opportunity to improve the robustness and generalizability of prediction models for CVD by constructing phenotypical cohort-based risk models [9]. The potential of improved accuracy in predicting CVD risk using machine learning approaches, compared to the Framingham Risk Score, has been investigated in several international cohorts [5,[10][11][12]. Using large UK cohorts, Weng et al. [5] utilized four machine learning models (logistic regression, random forest, gradient boosting machines, neural networks) to predict CVD events, and Alaa et al. [10] tested the potential of an automated machine learning framework (AutoPrognosis) for predicting CVD events. In the US, Ambale-Venkatesh et al. [11] and Kakadiaris et al. [12] also used random forest and support vector machine, respectively, to predict CVD events and mortality in US populations. A 2020 meta-analysis assessing the predictive ability of machine learning algorithms for cardiovascular diseases found promising potential in ML approaches [13]. The Framingham Risk Score is recommended for use in Australia to predict CVD risk but has been found to have limited accuracy for some Australian sub-populations [7,14]. A recent Australian study based on 5453 participants showed that the widely accepted 2008 Framingham model has overestimated the CVD risk by 29.7% in men and 7.2% in women [14].
This investigation aims to improve CVD risk prediction for the Australian popula-tion by applying different ML techniques to the risk factors used by the 2008 Framingham Risk Score. To our knowledge, these ML based CVD risk prediction models have not previously been applied to Australian population cohorts.
ML is mainly classified into two categories: supervised and unsupervised. If a set of training data is available and the classifier is designed based on that prior information, then it is known as supervised learning, whilst in unsupervised learning no prior training information is available. [15]. The performance of four supervised ML techniques used to derive risk prediction models for cardiovascular deaths for three Australian sub-populations were compared, individually and in combination, in male and female sub-cohorts, and in a diabetes cohort. This study is an applied public health epidemiological research approach using tools of computational modelling (machine learning models). It will be a novel contribution to public health.

Study Sample
Data from the North West Adelaide Health Study (NWAHS) [16], the Australian Diabetes, Obesity, and Lifestyle (AusDiab) study [17], and the Melbourne Collaborative Cohort Study (MCCS) [18] were used in this analysis. Detailed descriptions of the NWAHS [16], AusDiab [17], and the MCCS [18] cohorts, recruitment, response rates, and data collection procedures have been previously published.

Risk Factors and CVD Mortality
Eight core baseline variables (age, sex, total cholesterol, high-density lipoprotein (HDL) cholesterol, systolic blood pressure, hypertension medication, diabetes, and smoking status, Table 1, were used to derive all the CVD risk prediction models. The outcome measure used was CVD mortality. Non-fatal CVD events were excluded from the outcome measure as that information was not available in all three datasets. CVD mortality was defined as deaths that occurred within 15 years of baseline, with CVD listed as the primary or secondary cause of death based on International Classification of Diseases (ICD) from the 9th (390-459) and 10th (I00-I99) revisions.

Participant Numbers and Missing Values
The study population characteristics are reported in Table 2. Out of 4056 NWAHS participants, we excluded 326 people with a previous CVD history, 6 with missing CVD history data, and 70 with missing CVD outcome data. This led to a sample of 3654 participants. For the AusDiab study, out of 11,247 participants, 938 with a previous CVD history, 142 with missing CVD history data, and 17 with missing CVD outcome data were excluded, leaving 10,150 participants for the analysis. Of the 41,513 MCCS participants, 7035 participants with a previous CVD history and 1867 with missing CVD outcome were excluded. This resulted in 32,611 participants for the analysis. Table 2. Missing numbers and summary data (mean ± standard deviation) for the three-study cohorts and combined cohort. The values for n, age, male, female, total cholesterol, HDL cholesterol, systolic blood pressure, hypertension medication, diabetes, and smoker were input after removing CVD history and death, missing data, and imputation of other missing risk factor variables. The missing values in the risk factor variables were imputed using the missRanger algorithm [19]. The missRanger algorithm uses random forest trained imputations on observed data to predict continuous and categorical missing values. Random forest-based imputations perform better than the traditional imputation methods for epidemiologic datasets with missing data [20]. Imputation models that treat continuous variables as linear may be less able to account for complex interactions and non-linear relationships between the variables, compared to random forest-based imputations.

Framingham Risk Prediction Model
For the Framingham model, the CVD risk score was calculated using the eight baseline variables (mentioned previously) included in the 2008 Framingham model [21]. The Framingham model assigns a person to the low-risk group if the score is < 20 and to the high-risk group if the score is ≥ 20. As the Framingham equation was designed to estimate 10-year CVD risk and in this study the follow up data is for 15 years, we have linearly transformed the 10-year risk of the Framingham model into 15-year risk [13]. Thus, the Framingham score risk threshold became 30 instead of 20. Figure 1 shows an overview of the machine learning approach used. The algorithm starts with input of the cohort data (NWAHS, AusDiab, or MCCS). Input variables (eight baseline variables mentioned previously) were normalized to zero mean and unit variance within each dataset to ensure each variable had the same influence on the cost function in designing the machine learning models. This was done separately on training and testing data.

Machine Learning Risk Prediction Model
The three cohort datasets were severely imbalanced. The number of participants who had died due to CVD on or before 15 years follow-up (minority class) was much smaller than the number of participants alive at 15 years follow-up (majority class). The minority class percentage were 3.3, 3.4, and 1.6 for the NWAHS, AusDiab, and the MCCS, respectively (shown in Table 1). As this imbalance affects the decision boundary of the machine learning models and results in poor performance, the Synthetic Minority Over Sampling Technique (SMOTE) algorithm was used [22] to oversample the minority class and balance the data in the training set.
averaged. To ensure stable classification results, the overall process was repeated 10 times for each of the four models and the results were averaged. In addition, to test the generalizability of the machine learning models, another experiment was conducted by taking AusDiab and the MCCS as the training set and the NWAHS as the external validation set.

Software
The programming for the Framingham score calculation and preprocessing of the data (participants exclusion process) was completed in MATLAB R2018b [24]. Missing value imputation was done in the R 3.6.1 using the Ranger package (R Foundation for Four popular machine learning models were applied to each cohort: logistic regression (LR), linear discriminant analysis (LDA), support vector machine with linear kernel (SVM), and random forest (RF) [15,23]. The performance of each model was measured using the testing data. To maximize the models' robustness and generalizability, two-fold cross validation was used. For this approach, the original data was randomly split into two equal sized subsets: a training set to train the models, and a testing set to evaluate them. Then the sets were swapped and the process was repeated. The two results were averaged. To ensure stable classification results, the overall process was repeated 10 times for each of the four models and the results were averaged. In addition, to test the generalizability of the machine learning models, another experiment was conducted by taking AusDiab and the MCCS as the training set and the NWAHS as the external validation set.

Software
The programming for the Framingham score calculation and preprocessing of the data (participants exclusion process) was completed in MATLAB R2018b [24]. Missing value imputation was done in the R 3.6.1 using the Ranger package (R Foundation for Statistical Computing, Vienna, Austria). Standardization of features and machine learning algorithms were implemented using the Scikit-learn library in Python (Python Software Foundation, Wilmington, United States) [25].

Statistical Analysis
The performance of the Framingham model was evaluated using area-under-curve (AUC) score, sensitivity (Sen), specificity and precision based on the prediction equation, and the risk threshold described previously. Then, performance of the machine learning models was analyzed, compared with those of Framingham, and the categorial net reclassification improvement (NRI) for the paired models was calculated. The optimal threshold for classification was found from receiver operating characteristic (ROC) Curve. The optimal threshold was the point where there was the maximum difference between sensitivity and specificity. Variable importance was assessed using a random forest technique to rank the features according to their contributions to the predictions. The random forest variable ranking method has been successfully used for similar studies [12,26]. The dependent variable for the models was CVD mortality.

Ethics Approval
The NWAHS was approved by the Human Research Committee of the Queen Elizabeth Hospital in South Australia, the AusDiab study was approved by the Alfred Human Research Ethics Committee, and the MCCS was approved by the Cancer Council Victoria's Human Research Ethics Committee.

Results
The prediction accuracy of all models, for the individual and combined cohorts, according to the AUC performance measure, is shown in Table 3. For the NWAHS and AusDiab cohorts, all four of the ML models achieved significantly better performance than the Framingham model for predicting CVD deaths. For the MCCS, except for the Logistic Regression model, all other ML models achieved slightly better performance than the Framingham model. When all three study populations were combined (46,315 participants, 982 CVD deaths) the Logistic Regression and Linear Discriminant Analysis models performed significantly better than the Framingham model for predicting CVD deaths.
The classification analysis outcomes can be found in Table 4 Table 4. For the machine learning models, an NRI up to 29%, 24%, and 22% for the NWAHS, AusDiab, and the MCCS, respectively, were achieved when compared with the Framingham model. For the aggregated cohort, the machine models achieved an NRI up to 26%.  A random forest technique [26] was used to predict variable importance. datasets and the combined cohort. For all three individual datasets and in the combined dataset, age appeared to be the most important predictor that was linked to a higher CVD risk, followed by systolic blood pressure.

Sex Stratification
An analysis of the prediction accuracy of all models when applied to the combined cohort stratified by sex found that machine learning models returned higher AUC scores when compared to the Framingham model for male and female populations ( Table 6). The classification performance of the Framingham model was less in females compared to males, correctly predicting 75 out of 481 CVD deaths for females (Sen 15.6, PPV 15.7), compared to 333 deaths out of 501 deaths for males (Sen 66.3, PPV 8.0). The ML models performed significantly better at predicting CVD male and female deaths than the Framingham model. In the male cohort, the Linear Discriminant Analysis and Support Vector Machine models were able to predict 382 out of 501 CVD deaths while, in the female cohort, Logistic Regression and Support Vector Machine models correctly predicted 402 out of 481 CVD deaths. NRI were up to 6.5% and 48.7% for the male and female cohorts, respectively, compared to the Framingham model. Details of this classification analysis can be found in Tables 6 and 7.   Table 6. Two-fold cross validation: Comparison of the performance of Framingham Score (baseline model (BL)) and four machine learning (ML) models predicting 15-year risk of CVD mortality using combined data based on Sex stratification.

Diabetes Stratification
Among 46,315 participants in the combined cohort, a total of 3791 participants reported a diagnosis of diabetes at baseline. All machine learning models achieved significant improvement in prediction accuracy compared to the Framingham model for the diabetes cohort and non-diabetes cohort (Table 8). Additionally, the four ML models performed significantly better in classification performance than the Framingham model for the diabetes cohort and non-diabetes cohort ( Table 9). The Framingham model correctly predicted only 163 CVD deaths out of 231 deaths in the diabetes cohort (Sen 70.1%, PPV 11.1) and 245 CVD deaths out of 751 in the non-diabetes cohort (Sen 32.6%, PPV 7.7). In comparison, the Linear Discriminant Analysis model performed best in both diabetes and non-diabetes cohorts, correctly predicting 185 out of 231 CVD deaths (Sen 80.0%, PPV 16.1) in the diabetes cohort and 629 out of 751 CVD deaths (Sen 84.0%, PPV 5.6) in the non-diabetes cohort. For the ML models, NRI were up to 18.7% and 31.2% for the diabetes and non-diabetes cohorts, respectively, compared to the Framingham model. Table 8. Two-fold cross validation: Comparison of the performance of Framingham Score (baseline model (BL)) and four machine learning (ML) models predicting 15-year risk of CVD mortality using combined data based on diabetes stratification.  Table 9. Two-fold cross validation: Comparison of the classification (Sensitivity, Specificity, Precision) and NRI performance of Framingham Score (baseline model (BL)) and four machine learning (ML) models predicting 15-year risk of CVD mortality using combined data based on diabetes stratification.

External Validation
To evaluate the performance of ML models on unseen data, prediction models were developed using a combined AusDiab and MCCS dataset as the training set and the NWAHS as an the external validation set. The comparison of AUC score, classification results (sensitivity, specificity and precision), and NRI are shown in Table 10. All four machine learning models achieved significant improvement in performance (AUC score, sensitivity, precision) compared to the Framingham model when the model was trained using combined AusDiab and MCCS data and tested on NWAHS data. The support vector machine achieved an AUC score of 0.880 and sensitivity of 72.5, much higher than the Framingham model (AUC 0.837, Sen 41.3). The highest NRI was achieved by the linear discriminant analysis model (29.4). Even when data was stratified based on sex and diabetic diagnosis, the machine learning models performed better than Framingham model. Table 10. External Validation: Comparison of the classification and NRI performance of Framingham Score (baseline model (BL)) and four machine learning (ML) models predicting 15-year risk of CVD mortality using combined AusDiab and MCCS dataset as the training set and NWAHS as the external validation set.

Discussion
This study evaluated the potential of four machine learning CVD risk prediction models for predicting CVD mortality risk in Australian population cohorts, compared with the Framingham model, using eight traditional risk factors. We have validated the ML models both internally (two fold cross validation) and externally (training on combined AusDiab and MCCS data and tested on unseen NWAHS data). To our knowledge, this is the first multiple dataset and multiple sub-cohort study applying machine learning to the Australian population, demonstrating improved performance of predicting CVD risk with machine learning models.
All four machine learning models performed significantly better than the Framingham model at identifying individuals at very high risk of CVD in the Australian population in terms of discrimination, risk classification, and decision curve analysis. Machine learning models improved prediction (AUC score) by up to 5.1% in the aggregated cohort (NWAHS, AusDiab, and MCCS combined cohort), 1.9% in the male cohort, 3.5% in the female cohort, 9.1% in the diabetes cohort, and 5.5% in the non-diabetes cohort (See Tables 3, 6 and 8).
Additionally, this study found that machine learning models detected up to 68% more 'true positive' female cases than the Framingham model and identified 49% net reclassification improvement with the ML models (See Table 7). Recent investigations have shown disparities in the care received by Australian women with CVD compared to Australian men [27]. This can in part be attributed to underdiagnosis or delay in diagnosis of women, resulting from sex differences in CVD pathophysiological mechanisms, clinical presentation, and course of disease [27], and a higher prevalence of comorbid conditions in female CVD patients [28]. Framingham models have been found to underestimate CVD risk for women [27]. Machine learning models to specifically target females may reduce the risk of sex disparities in CVD care in Australia.
Machine learning models may also improve the accuracy of risk identification for individuals with Type 2 diabetes, a group with an elevated risk of CVD [29], compared to the non-diabetic population. The 10% increase in the sensitivity of the risk assessment for subgroups with diabetes found in this analysis suggests an opportunity to optimize and individualize cardiovascular risk reduction interventions for individuals with diabetes.
A Synthetic Minority Oversampling Technique (SMOTE) was used to address the class imbalance. The sample used in the analysis was sufficiently powered for machine learning modelling approaches, and SMOTE is an accepted method for treating imbalanced data [22].
With the growing number of electronic health record datasets, there is an opportunity to use machine learning techniques to improve the accuracy of models by enabling a more nuanced account of the complex relationships between multiple, correlated, and nonlinear risk factors and outcomes [10] and supporting an adaptive approach for risk predictor revisions [30]. Incorporated into decision making tools in general practice, machine learning models of CVD risk may offer more accurate information to guide clinicians' recommendations for treatment for high risk individuals. Intensive risk factor management can potentially lead to a reduction in CVD events and, particularly, of nonfatal myocardial infarction, stroke, and CVD death [2].
This analysis combined data collected across the studies of three prospective cohorts. One limitation of this approach is that it is possible there are unknown inaccuracies in this data, in the recorded cause of death, and self-reported variables (smoking status, diabetes, and use of medication). There is also known missing data in these datasets (32% of HDL cholesterol data from the MCCS dataset was missing). Missing HDL cholesterol data was imputed using a random forest-based imputation method, which can perform well with even a high amount of missing data [31], but imputing large proportions of missing data runs the risk of potentially biasing the model. Additionally, although the study cohorts are broadly representative of the wider Australian population, in all cohorts, non-English speakers who did not have access to support from an English language speaker were excluded from the studies and the MCCS participants were more likely to be older, female, and European-born than other Australians of the same age range [11].
For the purposes of comparison, the analysis approach utilized in this investigation included only the eight key health parameters identified in the Framingham model, developed in 2008, as these factors are routinely included in databases. This may limit the predictive accuracy of our models. Recently established CVD predictors, particularly those associated with elevated CVD risk in females or individuals with history of diabetes, should be included in future databases and investigations of machine learning models. In addition, the Framingham risk model is used to assess cardiovascular disease risk, while in this study we assessed only CVD mortality risk, not CVD incidence because CVD incidence information was not available in all of the three included datasets. Lastly, we did not recalibrate the Framingham model to the Australian dataset as we wanted to compare the machine learning model with the exact same model as recommended by the Framingham 2008 equation [21].

Conclusions
In this study, we developed machine learning risk prediction models for CVD mortality based on data from three popular Australian cohort studies using the same eight risk variables used by the Framingham 2008 model. The machine learning risk prediction models were significantly better than the traditional Framingham risk model for predicting CVD mortality risk in the Australian population. Machine learning models outperformed Framingham in each of the individual study cohorts, and in the combined cohort. Machine learning models also outperformed Framingham when stratified by sex and by diabetic diagnosis. Our findings suggest that machine learning models should be considered in the development of standard CVD risk assessment scales in future.  Informed Consent Statement: Not applicable-Secondary data analysis.

Data Availability Statement:
Restrictions apply to the availability of these data. Data was obtained from the North West Adelaide Health Study; the Australian Diabetes, Obesity, and Life-style study; and the Melbourne Collaborative Cohort Study.