Supervised Analysis for Phenotype Identification: The Case of Heart Failure Ejection Fraction Class

Artificial Intelligence is creating a paradigm shift in health care, with phenotyping patients through clustering techniques being one of the areas of interest. Objective: To develop a predictive model to classify heart failure (HF) patients according to their left ventricular ejection fraction (LVEF), by using available data from Electronic Health Records (EHR). Subjects and methods: 2854 subjects over 25 years old with a diagnosis of HF and LVEF, measured by echocardiography, were selected to develop an algorithm to predict patients with reduced EF using supervised analysis. The performance of the developed algorithm was tested in heart failure patients from Primary Care. To select the most influentual variables, the LASSO algorithm setting was used, and to tackle the issue of one class exceeding the other one by a large amount, we used the Synthetic Minority Oversampling Technique (SMOTE). Finally, Random Forest (RF) and XGBoost models were constructed. Results: The full XGBoost model obtained the maximum accuracy, a high negative predictive value, and the highest positive predictive value. Gender, age, unstable angina, atrial fibrillation and acute myocardial infarct are the variables that most influence EF value. Applied in the EHR dataset, with a total of 25,594 patients with an ICD-code of HF and no regular follow-up in cardiology clinics, 6170 (21.1%) were identified as pertaining to the reduced EF group. Conclusion: The obtained algorithm was able to identify a number of HF patients with reduced ejection fraction, who could benefit from a protocol with a strong possibility of success. Furthermore, the methodology can be used for studies using data extracted from the Electronic Health Records.


Introduction
Artificial intelligence (AI), an interdisciplinary science with multiple approaches, is a wide-ranging branch of computer science. Advancements in machine learning and deep learning are creating a paradigm shift in virtually every sector, including medicine, with phenotyping patients through clustering techniques being one of the areas of interest [1]. The goal of phenotyping patients is to allow for the identification of patient subgroups with similar presentation, prognosis and response to therapy.
Heart failure (HF) is a major health care problem worldwide, for which left ventricular ejection fraction (LVEF) has established clinically useful phenotypes for guiding treatment to reduce associated mortality and morbidity [2,3]. Classically, with heart failure, there are two recognized LVEF phenotypes: reduced LVEF (HFrEF) and preserved (HFpEF) [4,5]. However, recently, The European Society of Cardiology added a third intermediate LVEF phenotype. Although ejection fraction (EF) class is an important predictor of the treatment response data available in electronic health records (EHR), there is frequently a lack of EF quantitative values, limiting their usefulness in clinical and health service research [6]. An immediate next step is to develop algorithms and strategies and identify their distinct phenotypes in the absence of EF measured by echocardiography.
When looking for a proxy to identify HF phenotypes, AI methods could be useful in combining information that is usually collected in EHRs. Machine learning is an application of artificial intelligence, which focuses on how computers learn from data whose methods and techniques are increasingly applied in medicine. Disease identification [7] and pathology and image diagnosis [8,9], as well as clinical research, are some of the main applications of machine learning in epidemiology and clinical medicine, among others [10]. To explore phenotypes of patients with chronic HF, prior studies have already used hierarchical clusters to classify HFpEF patients [11,12]. These studies are mainly focused on the definition of phenotypes rather than predicting the EF class by using a proxy that makes the decision based on available information from the EHR.
In this current study, we developed a predictive model that classifies HF patients according to their LVEF by using available features such as age, gender and present diseases. The goal was to overcome the performance of previous studies by using a new approach, supervised analysis, a subfield of machine learning where models can be trained to predict the class of the target variable with earlier knowledge of the output values from prior data [13].

Data Source and Study Population
Subjects older than 25 years with a diagnosis of heart failure, ICD-9 codes (402.X1, 404.X1,404.X3,428 and 398.91) were selected from the EHR system of a community of people over 25 years in 2012 (Supplementary Material, Figure S1). From this database, we selected a group with available LVEF values measured by echocardiography classified as HFrEF and HFpEF according to the EF measurement, LVEF < 40% and ≥ 40%, respectively. A second group of patients with an HF diagnosis in the absence of LVEF values in the EHR, was collected with or without regular follow-up by Cardiology Departments (Figure 1). The variables to be tested were selected from those codified in the ICD-9. The study was approved by the Ethical Committee of the Hospital Clinico of Valencia in the scope of the BigData@Better Heart, a project founded in the IMI2 program (IMI2-FPP116074-2). Consent forms were obtained from the patients who took part in the echocardiography study and the data of the second group were documented by a process of pseudo-anonymization, making it impossible to use this information to identify the patients, since the only link between the data and the patient is a code not available to the researchers.

Selection of Variables and Analytical Procedure
Variables included demographic information, age and gender and ICD-9-codified diseases. The original dataset was split into two different partitions, corresponding to 80% and 20% of the original dataset. The first, a training set, was used to train the different models developed in the study. The second, a test set, was used to measure the performance of the models developed with the training set. This partition was performed using the caret package in R. This allows us to split our data by maintaining the proportion of classes in both partitions. The most influential variables were considered using feature selection methods.

Selection of Variables and Analytical Procedure
Variables included demographic information, age and gender and ICD-9-codified diseases. The original dataset was split into two different partitions, corresponding to 80% and 20% of the original dataset. The first, a training set, was used to train the different models developed in the study. The second, a test set, was used to measure the performance of the models developed with the training set. This partition was performed using the caret package in R. This allows us to split our data by maintaining the proportion of classes in both partitions. The most influential variables were considered using feature selection methods.

Feature Selection
The least absolute shrinkage and selection operator (LASSO) was used. This method creates a regression model, where the estimated coefficients βi for each variable suffer a penalization [14,15] or are set to zero. In the following equation, we can see the general formula of the regression model expressed in vector notation where Y is the end-point vector (our target), X is the vector of the covariates in our model, β is the vector of the coefficients for these covariates, and ε is a random error. The estimation of β parameters is typically performed by minimizing the sum of squares of the residuals; this is called the Ordinary Least Squares (OLS) approach, and the loss function being minimized is the following: When LASSO is used, the LASSO penalization term is added to this formula, resulting in the following equation:

Feature Selection
The least absolute shrinkage and selection operator (LASSO) was used. This method creates a regression model, where the estimated coefficients β i for each variable suffer a penalization [14,15] or are set to zero. In the following equation, we can see the general formula of the regression model expressed in vector notation where Y is the end-point vector (our target), X is the vector of the covariates in our model, β is the vector of the coefficients for these covariates, and ε is a random error. The estimation of β parameters is typically performed by minimizing the sum of squares of the residuals; this is called the Ordinary Least Squares (OLS) approach, and the loss function being minimized is the following: When LASSO is used, the LASSO penalization term is added to this formula, resulting in the following equation: Here, λ is known as the regularization penalty. Say λ is set to zero, then: then, minimizing L OLS β means minimizing L OLS β . Otherwise, if λ is set to 1, Equation (3) turns into: and minimizing L OLS β means minimizing m ∑ j=1 β j , which makes the value coefficients much lower than in the λ = 0 case. To choose an optimum λ value, we defined a set of λ values, and for each λ, we estimatedβ such that L OLS β is minimal. Then, we had two paired sets of λ andβ values. Those covariates that are not set to zero are usually the final ones selected. Additionally, when features agree, the complexity of the model is reduced. We performed a LASSO algorithm, setting the EF value as the endpoint (y i ) and introducing the rest of the covariables to the model (x i ).

Imbalanced Data Distribution
To tackle the issue of one class exceeding the other one by a large proportion, we used the Synthetic Minority Oversampling Technique (SMOTE) [16,17] included in the DMwR package. This algorithm creates new minority class examples by extrapolating between existing ones. Although matching seems to be a convenient procedure to perform before building any classification model, we created a predicting model using the original database. In order to avoid the problem of data leakage, the different techniques applied over the data, such as SMOTE, feature selection and hyperparameter tuning, should only be applied in the training set, not in the test set.

Model Development
We created several models based on two different machine learning algorithms, Random Forest (RF) and XGBoost [18], to compare their overall performance. We constructed reduced RF and XGBoost models with four possible previous algorithm performances in the dataset: balanced data in combination with feature selection (LASSO) or using all of the variables; unbalanced data, in combination with either feature selection or all of the variables. Then, we defined a set of values for the model hyperparameters. A grid was used, where the following hyperparameters were introduced: (a) Random Forest: number of threes and number of candidate variables at each split; (b) XGboost: subsample ratio, ratio of subsample columns by tree, maximum tree depth, learning rate, regularization terms and partition threshold. We used k = 5 cross validation in the training set to test all possible combinations and find the most convenient tuning.

Performance Measurements
In each model, several performance measurements were calculated, optimizing sensitivity and specificity measurements, but also taking Negative Predictive Value (NPV), Positive Predictive Value (PPV) and Accuracy into account. We made this choice because we aimed to isolate the HFrEF class. In this way, we could be sure that those predicted to be HFrEF (or Positive) would truly be HFrEF.
When comparing between models with similar sensitivity and specificity values, we selected the best by focusing on the other metrics and the Precision-Recall Curve (PR Curve), since it gives a more informative picture of the model's performance than the ROC Curve when the datasets are highly skewed [19,20].
A diagram showing the full process is displayed in Figure 1.

Characteristics of the Study Population
A total of 2854 subjects with HF diagnoses and LVEF measurement were included. Mean age was 74 years old and 47% were females. Diabetes was present in 53.4% of the participants and hypertension was present in 82.3% of the participants, the highest percentage of all the comobidities. In the complete dataset, HFrEF was present in 23.4%. A total of 2284 patients were used to train the models, while 570 were used to test. This partition maintains the proportion of HFrEF and HFpEF registers of the complete dataset. From the variables contained in the EHR, 13 appear to have been candidates for the models. Age and sex distribution, as well as the prevalence of the relevant variables in the study population, are shown in Table 1.

Models Developed
We constructed two types of models: reduced and full. To obtain reduced models, we defined a set of different λ values. For each λ value, the algorithm built a single model, adding the penalization (λ) to the coefficients. The most relevant variables are those that appeared in many models, which means that the associated coefficients were non-zero. Gender, age, unstable angina, atrial fibrillation (AF) and acute myocardial infarct (AMI) were the variables that most influence EF value. In Figure 2, the x-axis represents logarithmic lambda values (λ).
were the variables that most influence EF value. In Figure 2, the x-axis represents logarithmic lambda values (λ). To define the most convenient λ value, the λ that maximizes the area under the curve (AUC) was selected. In Figure 3, the two vertical lines indicate two optimum log(λ) values: the first one from the left corresponds to log (λmin), the value that maximizes the AUC model's, while the second corresponds to log (λse). Afterwards, we built a model setting λ at λmin, and the variables whose coefficients were greater or lower than 0 were the predic-  To define the most convenient λ value, the λ that maximizes the area under the curve (AUC) was selected. In Figure 3, the two vertical lines indicate two optimum log(λ) values: the first one from the left corresponds to log (λ min ), the value that maximizes the AUC model's, while the second corresponds to log (λ se ). Afterwards, we built a model setting λ at λ min , and the variables whose coefficients were greater or lower than 0 were the predictor variables selected for the final model.      After the feature selection process, we constructed two new datasets: the first one (Balance 1) resulted from oversampling the minority class in the original dataset until it reached the majority class size. In the second one (Balance 2), the minority class was oversampled, maintaining a reasonable balance between both classes without equalizing their sizes. Original and new dataset sizes and proportions are shown in Table 2. HFpEF, heart failure preserved ejection fraction. HFrEF, heart failure reduced ejection fraction. Table 3 summarizes the performance of the candidate models. All the results were obtained from testing these models in the testing dataset. While NPV was high among all models (ranging between 0.84 and 0.88), PPV presented a higher variance (0.44 as the lowest value vs. 0.75 as the highest). Note that the models performed with the original dataset reached a higher accuracy, varying from 0.80 to 0.84, as compared to models performed with balanced datasets. C-statistics were around 0,70 and the Precision-Recall Curve (AUCpr) obtained the highest values with the original datasets, with its maximum value obtained with the Random Forest model (0.51).

Models Performance
The full model from the original dataset, obtained with XGBoost, was applied in a large dataset of 79,057 HF patients, and among them, 26,376 patients were treated in primary care in the absence of a routine cardiology consultation and without available LVEF, Table 4. Applying the algorithm can identify patients with HFrEF: 19060 among all patients with HF and 6359 among those without a regular cardiology consultation.

Discussion
Left Ventricle Ejection Fraction phenotypes guide the management of HF patients, but are frequently not recorded in the EHR from primary care. Having alternatives to help estimate the HF class could be helpful, not only for research in health care services but also for physicians to choose better treatment. In the present study, a machine learning algorithm to predict the phenotype category based on the main characteristics and diseases of the patient was developed. The full XGBoost model excelled because it offered better modelling, maximized the sensitivity, and reached a high NPV. This present approach could be applied to other clinical conditions.
Few studies attempted to develop methods to predict left ventricular ejection fraction in patients with heart failure. Some used administrative claims from Medicare [21][22][23] or a specific database such as the Swedish Heart Failure Registry [24]. For those using administrative claims, a large number of variables were used in the training sample, identified by the ICD-code. On the other hand, the Swedish Heart Failure used a restricted number of variables but included laboratory parameters and treatments. The studies differ from this present one in that this study includes EF-measured patients and uses a different methodological approach. Lee et al. [23] identified atrial fibrillation, obesity, pulmonary, hypertension and valvular disease as being significantly associated with the development of heart failure with HFpEF, while male gender, history of cardiomyopathy, and myocardial infarction were significantly associated with the risk of heart failure with HFrEF. Overall, and despite limitations, routine clinical characteristics could potentially be used to identify different EF subphenotypes in databases.
Previous studies have also developed statistical and unsupervised learning algorithms to classify LVEF phenotypes. In the Desai study [22], the analysis included 11,073 patients, which was much larger than our sample size. Furthermore, the proportion of HFrEF and HFpEF individuals was well-balanced, leading to an easier distinction between classes. Despite the above, the overall accuracy of the selected binomial logistic model did not overcome the measures that we obtained with our final model. Our analysis was based on supervised analysis and, although machine learning techniques are far from being emergent technologies, its application on LVEF measure prediction is certainly innovative. In this study, we used SMOTE, LASSO and two powerful algorithms: Random Forest and XGBoost. Synthetic minority oversampling technique (SMOTE) is one of the most commonly used oversampling methods to solve imbalance problems in the training and test datasets. It aims to balance class distribution by randomly increasing minority class examples by replicating them. Although the concept is very promising when stuck with extremely skewed data, it does not always improve the model results, such as in our case. However, in this kind of study, it can be very useful.
Continuing onto a brief description of these algorithms, Random Forest is a combination of Decision Tree algorithms and Bagging, where both belong to supervised analysis. Together, they train a model to predict the class of the target variable with earlier knowledge of the output values deduced from prior data [13]. The other technique used, XGBoost, combines Boosting and Gradient Boosting algorithms. Boosting sequentially the corrects the errors committed by the previous models, which wrongly classified the elements, while Gradient Boosting tries to modelize the residuals, that is, transform the errors into a function to avoid overfitting [13]. We chose these algorithms because these were the most suitable for the dataset and also for the binary class of the target. In addition, they achieved the best results among the other models, based on alternative machine learning algorithms, such as Naive Bayes, Support Vector Machine and Artificial Neural Network. In particular, XGBoost is becoming popular in machine-learning competitors and data scientists, as it has been battle tested for production on large-scale problems [20].
Applying the algorithm to the large amount of data for patients with HF allowed around 24% of patients with HFrEF who would benefit from more precise treatment to be recognized. Future research will include time variables such as time-to-inclusion from diagnoses dates and medication and hospital admissions. Furthermore, exploring other balancing techniques, such as generating synthetic data based on the individual characteristic distribution, could lead to analytical improvement.
There are some limitations to our research which should be mentioned. As we collected the information from the EHR system, there was not a large number of patient's LVEF measures and, furthermore, we dealt with an unbalanced dataset, as HFrEF represents a minority of the total HF patients. In addition, there was a wide variety of performance measurements that could be used to evaluate the models. Depending on the characteristics and the goal of the problem, some metrics will perform better than others. The selection of the optimum λ was based on the AUC metric, which is the most intuitive and typically used metric. Finally, our goal was to maximize the PPV value which entails a relative lower value in the sensitivity analysis.
Heart Failure Guidelines stratify patients based on the LVEF in HFrEF or HFpEF [4,25], In those with HFrEF, well-defined treatment strategies improve the risk of hospitalization and survival and, therefore, a clear treatment algorithm is recommended. In those with HFpEF, no treatment has demonstrated an improvement in the outcomes to date. Identifying patients with HFrEF in the absence of measured LEVF can help to introduce treatments which have been successful. In addition, it can help to retrieve real world data from large databases in epidemiological, health care burden and cost studies.
In conclusion, the presented step-by-step AI approach, in the case of the HF phenotype, is a methodology that can help to obtain phenotypes from partially completed databases for different diseases, a common scenario in the EHRs.