Comparison of Unsupervised Machine Learning Approaches for Cluster Analysis to Define Subgroups of Heart Failure with Preserved Ejection Fraction with Different Outcomes

Heart failure with preserved ejection (HFpEF) is a heterogenous condition affecting nearly half of all patients with heart failure (HF). Artificial intelligence methodologies can be useful to identify patient subclassifications with important clinical implications. We sought a comparison of different machine learning (ML) techniques and clustering capabilities in defining meaningful subsets of patients with HFpEF. Three unsupervised clustering strategies, hierarchical clustering, K-prototype, and partitioning around medoids (PAM), were used to identify distinct clusters in patients with HFpEF, based on a wide range of demographic, laboratory, and clinical parameters. The study population had a median age of 77 years, with a female majority, and moderate diastolic dysfunction. Hierarchical clustering produced six groups but two were too small (two and seven cases) to be clinically meaningful. The K-prototype methods produced clusters in which several clinical and biochemical features did not show statistically significant differences and there was significant overlap between the clusters. The PAM methodology provided the best group separations and identified six mutually exclusive groups (HFpEF1-6) with statistically significant differences in patient characteristics and outcomes. Comparison of three different unsupervised ML clustering strategies, hierarchical clustering, K-prototype, and partitioning around medoids (PAM), was performed on a mixed dataset of patients with HFpEF containing clinical and numerical data. The PAM method identified six distinct subsets of patients with HFpEF with different long-term outcomes or mortality. By comparison, the two other clustering algorithms, the hierarchical clustering and K-prototype, were less optimal.


Introduction
The recognized heterogeneity in various clinical conditions has led to pleas for the utilization of data analytic techniques to more precisely define patient groups to customize or personalize treatment [1]. Cluster analysis has been championed as a methodology to aid in the identification of patient subtypes in complex data sets, which may overstretch a human's capacity to evaluate such data. One condition that has challenged the field of cardiovascular disease is heart failure with preserved ejection (HFpEF). This condition affects nearly half of all patients with heart failure (HF) [2,3] and appears to have an extremely heterogeneous pathophysiology [4,5]. HFpEF has been resistant to conventional therapies, which have been successful in the treatment of other kinds of HF. The absence of an array of effective therapies accounts, in part, for a reported 50% of mortality over 5 years for HFpEF and an annual mortality of 29% after an acute decompensated HF admission [6][7][8]. The challenges in treating HFpEF underscore the importance of translating the heterogeneity of its pathophysiology into clinically identifiable phenotypes.
Novel methods, such as machine learning (ML) and specifically cluster analysis, have been proposed to aid in the understanding of this cardiovascular condition [9]. Unsuper-vised ML seeks inherent patterns in large complex datasets without prior knowledge of the outcome [10]. Cluster analyses, such as hierarchical clustering, K-prototype and partitioning around medoids (PAM), are important unsupervised ML techniques [10][11][12]. Each of these commonly used approaches has its own strengths and weaknesses, necessitating comparative studies for optimal algorithm selection. However, there is a lack of published comparative data analytics in healthcare and specifically in HF that usually apply only one clustering algorithm [13]. The objective of this study is to compare these three different unsupervised ML approaches to the analysis of a single mixed dataset of outpatients with HFpEF in order to determine which analytical methodology provides a better informatic approach to the identification of patient subsets with different outcomes.

Study Population
The study population has been described in detail previously [14]. Briefly, it was a retrospective study of 196 consecutive patients with HFpEF assessed at an ambulatory cardiology clinic. The study was approved by the ethics department at the University of British Columbia. The inclusion criteria were adults over the age of 18, in whom HFpEF was confirmed on a transthoracic echocardiogram (TTE). HFpEF was defined based on the 2016 European Society of Cardiology criteria that specified a left ventricular ejection fraction (LVEF) ≥50% [15].

Machine Learning
Cluster analysis identifies similar subjects based on their combined features. Similar to other studies applying artificial intelligence to HFpEF, we identified variables that were highly correlated, with Pearson's correlation coefficient of >0.6 and removed them from the analysis [16,17]. We removed medications from the cluster analysis as they were strongly correlated with the medical conditions that were already in the dataset. Variables with more than 1% missing data were removed from the clustering analysis. Imputation using a Stochastic Gradient Descent [18] algorithm replaced a total of 10 missing values in the entire dataset. A total of three different approaches to cluster analysis were applied to the remaining 47 candidate variables. Normally distributed continuous variables were described as mean and standard deviation, and others were expressed as medians and interquartile ranges. A p-value of <0.05 was considered statistically significant. All analyses were performed with RStudio version 1.2 (RStudio Inc., Boston, MA, USA).
Hierarchical clustering utilizes a tree-like structure, dendrogram, to identify clusters closest to each other and distinct subgroups [10]. The Euclidean distance, a common metric in hierarchical clustering, was used to determine the separation between the subgroups [19][20][21]. The second method was a non-hierarchical iterative clustering algorithm, K-prototype. This algorithm integrates the K-means, for numerical variables, and K-modes, for categorical variables, algorithms to cluster data with mixed numeric and categorical values [11]. It uses an iterative process to create 'k' number of centroids, where each data point is assigned to its closest centroid, and groups that have a high similarity are defined [11]. The third method, partitioning around medoids (PAM), is another non-hierarchical iterative method that assigns 'k' random entities to be medoids, the most centrally located objects in each cluster. PAM is more robust than K-prototype [12,22]. Another difference between K-prototype and PAM is the distance metric used. PAM uses a Gower distance metric to measure the similarity between variables. The Gower distance selects a particular distance metric that suits each variable type and scales the results to fall between zero and one [12].
As part of the internal validation, the commonly used silhouette width, connectivity index, and the Dunn Index were evaluated for each clustering technique to arrive at the optimal number of clusters. A silhouette plot was used to determine the optimal number of clusters. This tool employs an aggregated measure of similarity between observations within a cluster and compares it to observations in neighboring clusters [23]. The number Bioengineering 2022, 9, 175 3 of 10 of clusters resulting in the largest silhouette width is typically the recommended choice. Silhouette width is an index reflecting the compactness of clusters and their separation from each other [23]. The Dunn index identifies clusters that are compact, with a small variance between members of the cluster and the connectivity index evaluates the interconsecutiveness of the members of the cluster. T-distributed stochastic neighbor embedding (t-SNE) was used to visualize the high-dimensional data by giving each data point location on a two-dimensional plot.

Results
The hierarchical clustering method revealed the underlying architecture of the dataset and generated six different clusters. The optimal number of clusters was determined by the dendrogram and the Euclidean distance between groups ( Figure 1). They also achieved better connectivity and a higher Dunn Index. Three larger clusters were found with 46 (Cluster 3), 49 (Cluster 4), and 69 (Cluster 6) members. Three smaller clusters with 23 (Cluster 1), two (Cluster 2), and seven (Cluster 5) members were also identified. There were significant differences among the clusters in features such as age, gender, diabetes, dyslipidemia, and coronary artery disease (CAD) ( Table 1). Although the outcomes demonstrated statistical significance, they are not applicable clinically given the small size of two of the clusters. optimal number of clusters. A silhouette plot was used to determine the optimal number of clusters. This tool employs an aggregated measure of similarity between observations within a cluster and compares it to observations in neighboring clusters [23]. The number of clusters resulting in the largest silhouette width is typically the recommended choice. Silhouette width is an index reflecting the compactness of clusters and their separation from each other [23]. The Dunn index identifies clusters that are compact, with a small variance between members of the cluster and the connectivity index evaluates the interconsecutiveness of the members of the cluster. T-distributed stochastic neighbor embedding (t-SNE) was used to visualize the high-dimensional data by giving each data point location on a two-dimensional plot.

Results
The hierarchical clustering method revealed the underlying architecture of the dataset and generated six different clusters. The optimal number of clusters was determined by the dendrogram and the Euclidean distance between groups ( Figure 1). They also achieved better connectivity and a higher Dunn Index. Three larger clusters were found with 46 (Cluster 3), 49 (Cluster 4), and 69 (Cluster 6) members. Three smaller clusters with 23 (Cluster 1), two (Cluster 2), and seven (Cluster 5) members were also identified. There were significant differences among the clusters in features such as age, gender, diabetes, dyslipidemia, and coronary artery disease (CAD) ( Table 1). Although the outcomes demonstrated statistical significance, they are not applicable clinically given the small size of two of the clusters.   In the K-prototype method, utilizing the optimal silhouette width, four different clusters were found. This was also signaled by the slightly better connectivity and Dunn Index compared to the other cluster sizes using this method. Figure 2 demonstrates the t-SNE plot for the clusters produced by K-prototype. Cluster 2 was the smallest cluster with nine members. The age and gender of subjects in these clusters were significantly different with Cluster 4 having the youngest members (Table 2). Clusters 1 and 3 consisted of predominantly females while Clusters 2 and 4 were predominantly male. Comparing comorbidities, only atrial fibrillation (AF) and chronic kidney disease (CKD), defined as estimated GFR < 60 mL/min/1.73 m 2 , were significantly different among the clusters. Systolic blood pressure (SBP) and serum low-density lipoprotein cholesterol (LDL-c) showed significant differences. Cluster 1 had the highest SBP, and Cluster 4 had the highest LDL-c. Serum BNP was significantly higher in Cluster 2 compared to the others. There were significant differences among the clusters on TTE, specifically LVEF, mitral valve E/A ratio, average E/e' ratio and pulmonary artery pressure. The presence of elevated filling pressure was significantly higher in Cluster 2 and lowest in Cluster 4. The degree of diastolic dysfunction was significantly higher, and more likely to be moderate or severe, in Clusters 2 and 1 while most likely to be mild in Cluster 4. The Meta-Analysis Global Group in Chronic Heart Failure (MAGGIC) risk score was significantly higher in Cluster 3 compared to the other clusters. With respect to outcomes, HF exacerbation was highest in Cluster 1, whereas cardiovascular mortality, all-cause mortality, and the composite of endpoints were highest in Cluster 2. The PAM method generated six significantly different clusters, which were wellseparated as demonstrated by the optimal silhouette width, connectivity index and Dunn Index. The features of these six subgroups were compared in detail [14]. Briefly, there were three groups of women, those with a low proportion of vascular risk factors (HFpEF 1 ), individuals with a high proportion of hypertension and diabetes (HFpEF 3 ), and older individuals with high rates of atrial fibrillation and chronic kidney disease (HFpEF4). The other clusters were mostly men with a high proportion of coronary artery disease, dyslipidemia and diastolic dysfunction (HFpEF 2 ), those with the highest BMI, obstructive sleep apnea and poorly controlled diabetes (HFpEF 5 ), and individuals with high rates of AF, elevated BNP and biventricular remodeling (HFpEF 6 ) [14]. Figure 3 demonstrates the t-SNE plot for the clusters produced by PAM. significantly higher in Cluster 2 and lowest in Cluster 4. The degree of diastolic dysfunction was significantly higher, and more likely to be moderate or severe, in Clusters 2 and 1 while most likely to be mild in Cluster 4. The Meta-Analysis Global Group in Chronic Heart Failure (MAGGIC) risk score was significantly higher in Cluster 3 compared to the other clusters. With respect to outcomes, HF exacerbation was highest in Cluster 1, whereas cardiovascular mortality, all-cause mortality, and the composite of endpoints were highest in Cluster 2. Figure 2. T-distributed stochastic neighborhood embedding used to show the local structures of the clusters produced by K-prototype on a two-dimensional plot. Each point represents one study subject.  (HFpEF4). The other clusters were mostly men with a high proportion of coronary artery disease, dyslipidemia and diastolic dysfunction (HFpEF2), those with the highest BMI, obstructive sleep apnea and poorly controlled diabetes (HFpEF5), and individuals with high rates of AF, elevated BNP and biventricular remodeling (HFpEF6) [14]. Figure 3 demonstrates the t-SNE plot for the clusters produced by PAM.

Discussion
This is the first study, to our knowledge, to compare different clustering algorithms using the same dataset of outpatient subjects with HFpEF. Hierarchical clustering was the

Discussion
This is the first study, to our knowledge, to compare different clustering algorithms using the same dataset of outpatient subjects with HFpEF. Hierarchical clustering was the first method used to separate patients with HFpEF. Hierarchical clustering, visualized by dendrograms, produces a single nested hierarchy from which a partition can be obtained for any possible choice of the number of clusters [24]. While this feature makes hierarchical clustering popular, it is also a challenge to interpret whether the identified clusters represent an important underlying structure or are artifacts of natural sampling variation [24]. This method is limited by its experimental approach and the arbitrary determination of the number of clusters based on the resulting dendrogram and is consistent with the contention of others [16]. In our study, in addition to encountering this limitation, the cluster sizes were widely different from each other. Two of the smallest clusters had only two (Cluster 2) and seven (Cluster 5) members. Having such small clusters could indicate that there was a significant separation distance between each of these clusters and the larger ones. Comparing the clinical and echocardiographic features of these two clusters with their neighbors, however, fails to show any important features that justifies having these two groups. In fact, the more likely explanation is that Cluster 2 and Cluster 5 contain members that lie on the border of their neighboring clusters or are borderline outliers. Thus, the hierarchical algorithm, which works well when large separation distances exist between clusters [19][20][21], may be artificially forming these smaller subgroups. In fact, if a slightly larger cut-off of Euclidean distance was used, by moving the horizontal line on the dendrogram higher, these smaller clusters would have been joined with larger groups. The challenges posed by identifying the optimal number of clusters using the hierarchical method and our clinical understanding of the results conclude that this method is not optimal.
The K-prototype method produced four different clusters. We found some similarities between these clusters and the four groups of HFpEF identified using a similar algorithm, K-means, by Harada et al., in the outpatient setting [25]. In that study, similar demographics and clinical information were used with the difference that clinical outcomes were also applied to assist in forming the clusters. Categorical variables had to be converted to numerical ones due to the inherent limitations of the K-means algorithm. Another limitation of this method was that the K-means and K-prototype algorithms require an estimate of the number of clusters that naturally exist in the data and cannot identify the optimal number of clusters [25]. This limitation opens the door for random errors and biases in this clustering approach. Group 1 with younger patients and LV relaxation abnormality had the lowest mean mitral E/e ratio and was most similar to Cluster 4. Group 2 with older patients with renal dysfunction was similar to Cluster 2. Group 3 with AF and advanced biventricular diastolic dysfunction was similar to Cluster 1. Group 4 with older patients, renal dysfunction and had a high PASP was similar to Cluster 3. The K-prototype method discovered some underlying data structure; however, several clinical and biochemical features did not show statistically significant differences. Additionally, there was significant overlap between the clusters as demonstrated graphically (Figure 2). Thus, this approach did not satisfy the aim of identifying optimized HFpEF clusters.
Utilizing the PAM method, we discovered six distinct clusters in our database [14]. The clustering results using the PAM method not only showed significant clinical features, but also had better overall internal validity, including increased cluster compactness and separation distance. In contrast, the hierarchical and K methods produced a subgroup that consists of only 5% or less of the population thereby suggesting a less meaningful group. We used measures of internal validity as mentioned in the methods section to arrive at the optimal number of clusters within each clustering technique and did not utilize these to compare the different methods.
Few studies have conducted head-to-head comparison of different clustering algorithms on the same dataset. In a study by Bose and Radhakrishnan, three different unsupervised ML algorithms (hierarchical, K-means, and PAM) were used to identify subgroups of outpatients with HF [26]. They concluded that hierarchical clustering was the best technique using internal validation methods. There are several limitations to that study. First, from the perspective of the study of HFpEF, it did not distinguish HFpEF from HFrEF, and no laboratory or echocardiographic features were used for clustering. Instead, focus was placed on medical history, symptoms and quality of life. By performing feature selection, the number of variables used for identifying the potential clusters was reduced from nearly 300 to only seven. This was performed using a software package to find the "best fit" features for the data [26]. The drastic reduction in the number of features and use of internal validity instead of clinical outcomes for comparing the clustering techniques limit the reliability of their conclusion.
In another study, Preud'homme et al. used three unsupervised ML algorithms (hierarchical clustering or PAM, K-prototypes) to compare their efficacy [27]. These methods were first applied to a dataset of virtual populations and subsequently to the dataset from the EPHESUS randomized clinical trial, which only included patients with HFrEF [27]. For the latter analysis, the number of clusters was fixed at four for all the different algorithms allowing minimal flexibility. The authors concluded that K-prototype was dominant between the unsupervised ML methods compared. The conclusion was based on comparing the clusters from different methods using the adjusted rand index, which is a measure of similarity between clusters [27]. We believe arbitrarily setting the number of clusters to four is a major limitation in this study because it introduces significant bias and restricts the algorithms from identifying the optimal number of clusters. Outcome data were also not used to compare how the clusters differ from each other. Furthermore, that study is likely not relevant to HFpEF since all subjects had HFrEF.
Our analysis supports the contention that multiple relevant partitions can be found in a population, and cluster analysis can be considered successful if it makes sense to the practitioner specialized in the field [27]. Our study is the first, to our knowledge, to compare different clustering algorithms using the same dataset of outpatient subjects with HFpEF to identify the most suitable algorithm yielding clinically relevant results in this population.
While some might argue that heart failure is a clinical entity in which the clinician's role is critical in the diagnosis, and there is no need for the application of machine learning to define subsets of patients, it is important to underscore the heterogeneous etiology of HFpEF which demands characterization. HFpEF has been attributed to a diverse range of abnormalities of cardiomyocyte structure and function, cardiac fibrosis, myocardial extra-cellular matrix, and vascular function [4,5]. Each of these distinct pathophysiologic entities may have different manifestations and their definition will eventually permit a tailored approach to the treatment or implementation of a precision medicine approach to the treatment of HFpEF.
Several studies have applied machine learning approaches to search for subtypes or different phenogroups in patients with HFpEF. Segar et al. examined a subset of 654 participants in the TOPCAT (Treatment of Preserved Cardiac Function Heart Failure with an Aldosterone Antagonist) study [17]. They also removed variables with a correlation coefficient >0.6, keeping the variable that was most clinically meaningful. They identified three mutually exclusive phenogroups of HFpEF participants using penalized finite mixture model-based clustering analysis on 61 mixed-data phenotypic variables [17]. Kao et al. examined data from patients with HFpEF in the I-PRESERVE (Irbesartan in Heart Failure With Preserved Systolic Function) study and considered eleven prospectively selected clinical features [28]. Data analysis used the polkas library in the R statistical package and Latent class analysis (LCA) definitions were derived using maximum-likelihood estimation to determine the most likely subgroup for each patient [28]. They identified six phenogroups [28]. Hedman et al. evaluated data from 320 patients with HFpEF clustering 32 echocardiographic and 11 clinical or laboratory variables from the Karolinska-Rennes cohort (KaRen study) [29]. Model-based clustering of standardized variables was performed using the Mclust function and the optimal model and number of clusters was determined by the maximum BIC with three multinomial classification methods in the R statistical package [29]. They identified six composite phenogroups [29].

Limitations
There are several limitations of the study that have been previously discussed in detail [14] and some warrant further comment. ML algorithms handling small sample sizes with a large number of variables can run into dimensionality and overfitting [13]. Thus, we reduced the number of highly correlating variables to minimize the effects of dimensionality. Second, the number of patients studied is small. However, the data collection allowed for a detailed description of each person. Third, we used only three different ML methodologies so that we could readily compare them. Other approaches exist and warrant comparison but increasing the number of methods would be more challenging to compare. Lastly, defining the best ML method for defining subsets of patient phenotypes can be challenging. However, we contend that a method that defines a subgroup consisting of only 5% or less of the population does not identify a meaningful group. We selected the PAM method not only because it avoided small and less meaningful subgroups, but because it also identified subgroups with significant clinical features and had better overall internal validity, including increased cluster compactness and separation distance. The computational time of the different methods were not recorded and could not be used as an index for comparison.

Conclusions
This is the first study, to our knowledge, to conduct cluster analysis on outpatients with chronic HFpEF, an entity with a high mortality and resistant to most current therapies, utilizing and comparing three different unsupervised ML approaches. The PAM method was found to perform in a robust fashion for this mixed dataset identifying six different subtypes of HFpEF 1-6 that were found to be discrete when assessed on independent clinical outcomes. In contrast, hierarchical clustering and the K-prototype methods did not yield clusters that were clinically as distinct. This study demonstrates the need for careful evaluation of data analytic techniques in machine learning when applied to clinical medicine. Funding: This study received no external funding.
Informed Consent Statement: Patient consent was waived because it was a retrospective study but patients had given consent for this kind of research at enrolment in the clinic.