Identifying Heart Attack Risk in Vulnerable Population: A Machine Learning Approach

Chattopadhyay, Subhagata; Chattopadhyay, Amit K

doi:10.3390/info16040265

Open AccessArticle

Identifying Heart Attack Risk in Vulnerable Population: A Machine Learning Approach

by

Subhagata Chattopadhyay

¹

and

Amit K Chattopadhyay

^2,*

¹

Institute of Health Management Research, Electronic City, Bangalore 560105, Karnataka, India

²

School of Business, National College of Ireland, Mayor Street Lower, D01 K6W2 Dublin, Ireland

^*

Author to whom correspondence should be addressed.

Information 2025, 16(4), 265; https://doi.org/10.3390/info16040265

Submission received: 19 February 2025 / Revised: 11 March 2025 / Accepted: 24 March 2025 / Published: 26 March 2025

(This article belongs to the Special Issue Transformative Technologies in Healthcare: Harnessing Machine Learning, Deep Learning and Large Language Models in Health Informatics)

Download

Browse Figures

Versions Notes

Abstract

The COVID-19 pandemic has significantly increased the incidence of post-infection cardiovascular events, particularly myocardial infarction, in individuals over 40. While the underlying mechanisms remain elusive, this study employs a hybrid machine learning approach to analyze epidemiological data in assessing 13 key heart attack risk factors and their susceptibility. Based on a unique dataset that combines demographic, biochemical, ECG, and thallium stress tests, this study aims to design, develop, and deploy a clinical decision support system. Assimilating outcomes from five clustering techniques applied to the ‘Kaggle heart attack risk’ dataset, the study categorizes distinct subpopulations against varying risk profiles and then divides the population into ‘at-risk’ (AR) and ‘not-at-risk’ (NAR) groups using clustering algorithms. The GMM algorithm outperforms its competitors (with clustering accuracy and Silhouette coefficient scores of 84.24% and 0.2623, respectively). Subsequent analyses, employing Pearson correlation and linear regression as descriptors, reveal a strong association between the likelihood of experiencing a heart attack and the 13 risk factors studied, and these are statistically significant (p < 0.05). Our findings provide valuable insights into the development of targeted risk stratification and preventive strategies for high-risk individuals based on heart attack risk scores. The aggravated risk for postmenopausal patients indicates compromised individual risk factors due to estrogen depletion that may be further compromised by extraneous stress impacts, like anxiety and fear, aspects that have traditionally eluded data modeling predictions. The model can be repurposed to analyze the impact of COVID-19 on vulnerable populations.

Keywords:

heart attack; risk prediction; CDSS; clustering; COVID-19; linear regressions; Pearson’s correlation

Graphical Abstract

1. Introduction

A growing body of evidence suggests a significant increase in post-COVID-19 cardiovascular events, particularly myocardial infarction, with a disproportionate impact on younger populations. While cardiovascular diseases remain a leading cause of global mortality, claiming approximately 20.5 million lives annually [1], the post-COVID-19 era has witnessed a concerning rise in cardiac events among individuals under 40 [2].

This phenomenon may be attributed to a complex interplay of factors, including post-acute COVID-19 sequelae such as persistent infection, post-COVID-19 syndrome (long COVID), and long-haul COVID-19 [3]. Furthermore, potential contributions from vaccine-related side effects, including myocarditis, dysrhythmias, and thromboembolic events, cannot be excluded [4].

The pathophysiological mechanisms linking COVID-19 to cardiovascular complications are multifaceted and include (i) direct viral damage to the vascular endothelium, leading to impaired blood flow; (ii) persistent viral reservoirs in various tissues; and (iii) dysregulation of the immune system. These factors may collectively contribute to an increased risk of adverse cardiovascular outcomes across all age groups [3]. The virus has a known affinity towards the cardiovascular system with a biochemical pathway connecting the pulmonary system. This pathway connection is perceived as a key player in exacerbating heart attacks and is seen as one of the many manifestations of COVID-19 impact on cardiovascular diseases (CVD) [4]. A recent study showed acute coronary disease, myocardial infarction (heart attack), angina, and ischemic cardiomyopathy (due to vasculopathy) to be the predominant causes of heart attack irrespective of age group [5].

This analyzes temporal data patterns to predict the risk of heart attacks, irrespective of COVID-19 infections. It is interesting to note that despite having vulnerabilities due to exposure to well-established causative factors, not all patients are equally susceptible to heart attacks. Profiling patients according to their CVD susceptibility is a challenging area in healthcare analytics which is the premise of this study. It is therefore important and pertinent to understand the electrophysiology of the heart. The heart is an autonomous organ, a metronome that beats seamlessly in its way and is influenced by several physiological factors, e.g., autonomic stress (high norepinephrine drive), oxidative stress (superoxide-led destruction of vascular tissues), emotional stress (anxiety disorders) leading to sympathetic overdrive, and many others [6]. The heart relies on a delicate balance between sympathetic and parasympathetic activity, modulated by the vagus nerve [7] to maintain a state of physiological equilibrium. This dynamic is reflected in several heart rate variability (HRV) metrics such as SDNN, pNN50, RMSSD, and others [8]. Electrophysiological abnormalities, including atrial fibrillation, ectopic beats, and missed beats, contribute to HRV patterns, with a threshold beyond which they signify pathological conditions [9]. Consequently, cardiac function is intricately linked to electrophysiological factors, influencing the HRV outcomes. This complex interplay contributes to the variability in individual responses to cardiac risk factors, making accurate prediction of myocardial infarction a significant clinical challenge. Given this background, this article presents a novel machine learning (ML)-based analytical framework for cardiovascular risk assessment to analyze heart attack risk from longitudinal data. This framework has the potential to significantly improve patient outcomes by enabling early diagnosis and personalized risk stratification.

Unfortunately, the literature on heart attack risk prediction is not particularly well formed. Sporadic studies have reported (a) longitudinal risk predictions over a span of ten years where the prediction methodology is largely rule-based and hence rigid [10], (b) data mining methods for exploratory data analysis to identify the key predictors of heart attack risk [11], and (c) risk scoring to prioritize the high-risk sub-population within a population with cardiovascular diseases (CVDs) [12]. Thus, a key objective of this article is to propose a hybrid machine learning approach to design, develop, and deploy a clinical decision support system (CDSS) to (i) identify the CVD-vulnerable population using a combination of unsupervised machine learning (clustering) models and (ii) supervised model, such as linear regressions for scoring the risk. The study also compares the performances of different clustering techniques on heart attack risk datasets, drawn from the Kaggle public data source [13]. It is important to note that the Kaggle heart risk data were made available in the public domain in 2021. From the data and its background information, it is unclear whether the impact of COVID-19 on the cardiovascular system has been considered. This is a lateral target of this study, namely, to associate COVID-19 connection with Kaggle data based on statistical modeling.

The aim of this study is to identify significant predictors within the high-risk clusters (identifying the best subjective clustering technique as an inclusive challenge), using Pearson’s correlation coefficient values as the key descriptor, and leverage these values to form a linear regression model for scoring the risk level for each case. On the other hand, the objective of the study is to design, develop, and deploy a hybrid ML model using a combination of unsupervised and supervised learning algorithms to not only predict the risk but also its associated score. The predicted ML scores are expected to complement human evaluation by clinicians in identifying the high-risk population, which will both be non-invasive and preemptive. This would help clinicians and nurses to prioritize high-risk cases.

2. Materials and Methods

The details of the Kaggle heart data (303 × 13) are available in [13]. Table 1 shows the first 10 rows of the sample dataset.

The dataset was uploaded around 2021 when the second wave of COVID-19 and vaccination drives were in full swing. Hence, the influence of COVID-19 on heart health risk has been assumed in this work despite no direct link having been found between the COVID-19 pandemic and/or COVID-19 vaccination and the heart health dataset. It is also worth mentioning that the attributes of the dataset display a unique combination of demographic, biochemical, ECG, and thallium stress test data which have been considered in this work. For running clustering algorithms, initially, the dependent predictor (called ‘outcome’) with the class labels (‘0’ as not at risk or NAR and ‘1’ as at risk or AR) is removed from the dataset. However, later, the performances of the algorithms are tested matching the AR and NAR labels.

Summary of Kaggle heart risk dataset: The shape of the dataset is (303 × 13) meaning 303 cases each having 13 predictors or attributes and one target variable, i.e., the outcome label (AR or NAR). As mentioned, initially, to run the clustering algortihm, the outcome labels are removed from the dataset. The features within the dataset are (i) Age (true value); (ii) Sex (male and female); (iii) Chest pain or cp (‘0’ typical angina, ‘1’ atypical angina, ‘2’ non-anginal pain, ‘3’ asymptomatic); (iv) Resting blood pressure in mmHg (trtbps); (v) Cholesterol in mg% (chol); (vi) Fasting blood sugar (FBS) in mg%, >120 mg% is ‘1’, otherwise, it is ‘0’; (vii) Resting ECG (restecg), presented as ‘0’ is normal, ‘1’ is normal ST-T wave, and ‘2’ denotes left ventricular hypertrophy; (viii) Maximum heart rate achieved (thalachh) while on Thallium stress test; (ix) Previous peak (oldpeak); (x) Slope (slp); (xi) Number of major vessels involved (caa); (xii) Thallium stress test results (thall); distributed between 0 and 3 based on the severity level; and (xiii) Exercise induced angina (exng) presented as ‘0’ no and ‘1’ yes.

The dataset’s two clusters correspond to the AR group (labeled ‘1’, with 165 instances) and the NAR group (labeled ‘0’, with 138 instances), respectively.

Prior to clustering, the dataset underwent tests for internal consistency and normality. Cronbach’s alpha [14] was calculated to assess the reliability of the data. An alpha value above 0.8 indicates high internal consistency. The Shapiro–Wilk test [15] was used to evaluate the normality of the data, with a p-values greater than 0.05 (CI 95%) suggesting a normal distribution. In this case, the alpha value was found to be 0.08, indicating poor internal consistency, while the Shapiro–Wilk test returned a p-value of 0, indicating that the data were not normally distributed. Such results are often encountered with clinical and biological datasets.

This section presents an analysis of five clustering techniques employed to identify AR populations, along with their performance in correctly classifying the data. The ML models have been implemented using a five-fold cross-validation technique with all variations in heart attack risk scores duly noted.

Clustering technique-1: The k-means clustering (k-MC) algorithm [16] is an unsupervised machine learning technique that requires the predefined number of clusters, denoted as ‘k’. The algorithm operates without the need for training and iteratively identifies cluster centroids until convergence. Convergence is achieved when the algorithm has processed the entire dataset, ensuring that no data point remains unassigned to a cluster and the loss function is the minimum. Once the centroids are determined, the algorithm computes the distances between the centroids and other data points, assigning each data point to the cluster with the nearest centroid to minimize the sum of distances. This process is iterated to form all clusters. In the present study, the number of clusters, k, was predefined as 2, representing the AR and NAR groups.

Clustering technique-2: The Gaussian Mixture Model (GMM) technique operates similarly to k-means clustering (k-mc); however, it offers three distinct advantages. First, the GMM accounts for variance, which corresponds to the horizontal spread or width of the normal distribution curve, thereby providing a more nuanced representation of the data. Second, the GMM is well suited for non-circular datasets, where data points are not measured in circular units such as radians [17]. Third, the GMM employs a soft clustering approach, unlike the hard clustering approach of k-mc. In k-means, each data point is assigned exclusively to a single cluster, with no consideration for its likelihood or belonging to other plausible clusters. By contrast, GMM assigns probabilities to data points for inclusion in multiple clusters, enhancing its flexibility and robustness in cluster analysis [18].

Clustering technique-3: The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm classifies data points based on density variations, effectively distinguishing regions of higher density from those of lower density [19]. This algorithm is built on the foundational concepts of clusters and noise, requiring a minimum number of data points within a specified radius around each point to form a cluster. Unlike partition-based clustering methods, which generate spherical-shaped clusters and are highly sensitive to noise and outliers, DBSCAN excels in handling datasets with irregular cluster shapes and noise—a common characteristic of real-world data. DBSCAN relies on two critical parameters: (a) ε (epsilon), which defines the neighborhood radius, and (b) MinPts, the minimum number of points required to form a dense region referring to a cluster. The selection of these parameters significantly impacts clustering performance. If ε is set too large, data points may collapse into a single cluster, while an excessively small ε results in most points being categorized as outliers. To optimize ε, a k-distance graph can be employed to identify an appropriate threshold. The value of MinPts depends on the data dimensionality; for two-dimensional datasets, a typical choice is MinPts = 2 + 1 = 3 or higher. This parameter tuning is crucial for accurately capturing the underlying structure of the dataset.

Clustering technique-4: The Balance iterative reducing and clustering using hierarchies (BIRCH) algorithm segregates the data points into summaries which are then clustered instead of clustering the data points [20]. Hence, it works better on large datasets compared to others. Each summary possesses as much information of the summaries as necessary and that is why it is incorporated in this work. It works well with numerical data. Hence, categorical data values need to be transformed into numerical ones to run the BIRCH algorithm.

Clustering technique-5: The Affinity Propagation (AP) clustering algorithm identifies clusters based on the “affinity” or similarity between data points, selecting specific points, termed ‘exemplars’ to represent each cluster. Exemplars are determined through an iterative process that maximizes the overall similarity within clusters. Notably, akin to the DBSCAN algorithm, Affinity Propagation does not require prior specification of the number of clusters, offering flexibility in analyzing datasets with unknown cluster structures [20].

It is important to note that the work proposes designing, developing, and deploying a CDSS where clusters are chosen for grouping the sample into ‘AR’ and ‘NAR’; the maximally overlapping cluster is chosen as the best cluster. Other performance metrics, e.g., compactness and distances between a pair of clusters, have thus been disregarded.

In the following section, Figure 1 displays cluster plots followed by Figure 2 depicting clustering performance. The performance is measured on the percentage of samples, correctly clustered. The high-risk cluster information is then engineered to find the most common predictors of heart attack risk and finally, they are correlated to compute the correlation coefficients using Pearson’s correlation technique [7] (refer to Equation (1)) as a mechanism of exploratory data analytics (EDA),

r = \sum (x_{i} - x^{'}) (y_{i} - y^{'}) / \sqrt{\sum {(x_{i} - x^{'})}^{2} \sum {(y_{i} - y^{'})}^{2}}

(1)

In Equation (1), the parameter ‘r’ represents the correlation coefficient, quantifying the strength and direction of the linear relationship between two predictors. The coefficient r ranges from −1 to +1, where ‘positive’ values indicate direct correlation, ‘negative’ values represent an inverse correlation, and ‘zero’ denotes no correlation. The terms x_i and y_i correspond to the ith sample values (with i ranging from 1 to N), while x’ and y’ denote the mean values of the respective variables. Positive ‘r’ values, particularly those on the AR spectrum, are interpreted as significant predictor coefficients. The corresponding p-values < 0.05 indicate statistically significant correlations and ensure linearity in the model. These high ‘r’ values are then leveraged to form a linear regression equation [21] for heart attack risk scoring (Equation (2)) as cluster information is not necessarily easy to translate into numbers:

y = \sum x_{i} b_{i} + c

(2)

In this equation ‘y’ is the predicted output (the heart attack risk scores), ‘b_i’ is the ith coefficient value (where ‘i’ varies between 1 and N), and ‘c’ is the constant which is set at ‘0’ (straight line passing through the origin in a least square fit). It is important to note that ‘b’ values are replaced by high ‘r’ values in this work to compute the respective ‘y’. Here ‘x_i’ denotes the ith attribute for which ‘y’ is predicted for any given case.

Therefore, a hybrid ML model (clustering plus linear regressions) for heart attack risk scoring with EDA (Pearson’s correlation coefficients, i.e., the ‘r’ values) to understand the significant predictors of heart attack risk is proposed, developed, and deployed in this work. All coding was performed with Python 3.11.1 on a Windows 64-bit system with 8 GB RAM and Intel(R) Core (TM) i5-6300U CPU @ 2.40GHz.

3. Results

Descriptive statistics and data distributions for these features can be found in Table 2. It shows the description of the whole dataset, its central tendency, and its distribution.

Table 2 shows how each parameter is distributed and the central tendency of the sample. Here ‘Std’, ‘Min’, ‘Max’, ‘25%’, ‘50%’, and ‘75%’ refer to the standard deviation, minimum value, maximum value, first, second, and third quartile, respectively. Skewness and Kurtosis values, respectively, determine the skewed deviation of the distribution curves and the thickness of the tail of the curves, predictor-wise.

Below, the relevant performances of clustering techniques are shown and discussed. The performance metrics are (a) classification accuracy (Equation (3)) and (b) Silhouette score (Equation (4)). The score is a measure of the compactness of data points within a cluster and how well the clusters are separated from each other.

A = \frac{1}{N} \sum v a l (A R) + v a l (N A R)

(3)

In this equation, ‘N’ is the sample size, and ‘val(AR)’ and ‘val(NAR)’ refer to validated AR and NAR cases, respectively, which are accurately clustered.

S (i) = \frac{b (i) - a (i)}{m a x (a (i), b (i))}

(4)

Here, a(i) denotes the average distance between an observation and the data points within the same cluster, and b(i) refers to the average distance between an observation and other data points in the nearest cluster. In a good clustering algorithm, we expect the distances to be appreciable enough to separate the data points as well as the clusters.

The k-mc clustering technique (Silhouette score = 0.1782) returns the following result: Out of 165 samples under the AR cluster, 55 are correctly clustered, which is 33.33%; 78 samples are correctly clustered under NAR, which is 56.52% with an average accuracy of 44.92%. Hence, 55.08% are misclassified.

The GMM clustering technique (Silhouette score = 0.2623) shows the following result (Figure 1a): Out of 165 samples under the AR cluster, 139 are correctly clustered, which is 84.24%, and misclassified samples (15.76%) are removed from further study; 123 samples are correctly clustered under NAR (N = 138), which is 89.13% with an average accuracy of 86.68%. Hence, 13.53% are treated as outliers. Figure 1a shows the cluster plots, where green dots (cluster 1) refer to the AR group. It is also evident that the clusters are quite distinctly separated.

Figure 1. (a). GMM-based cluster plots. (b). Parametric study to find the best ‘ε’ value.

The DBSCAN clustering technique (Silhouette score = NA) shows the following result after parametrizing the most appropriate ε value by computing k-distances (see Figure 1b), where the ‘x’ and ‘y’-axes represent the data points and k-distances, respectively. The highlighted zone indicates ‘ε’ at 2.6, where the elbow is crooked at the first instance.

The parameter ‘MinPts’ is chosen as twice the number of data dimensions, as recommended in prior studies [22]. For the Kaggle heart attack risk dataset, which comprises 13 predictors, ‘MinPts’ is calculated as 13 × 2 = 26, corresponding to the 26 nearest neighbors. Subsequently, the distance to the (MinPts−1) nearest neighbors is computed for each data point within the dataset. These distances are then ranked in ascending order to generate a characteristic “elbow” plot [22]. The optimal value of ‘ε’ is shown along the y-axis value (Figure 1b). The evaluation is based on the mean of the five nearest neighbor distances at the most prominent inflection point of the elbow (marked zone in Figure 1b), which, in this case, is 2.6. With ‘ε’ and ‘MinPts’ values of 2.6 and 26, respectively, the DBSCAN algorithm is run to obtain the best clusters. In DBSCAN, all the data points are considered outliers. Hence, it is excluded from cluster comparisons.

The BIRCH clustering technique (Silhouette score = 0.1509) shows the following result: Out of 165 samples under the AR cluster, 40 are correctly clustered, which is 24.24%, while 80 samples are correctly clustered under NAR, which is 57.97% with an average accuracy of 41.1%. Hence, 58.89% are misclassified.

The Affinity propagation clustering algorithm (Silhouette score = 0.1738) shows the following results: Out of 165 samples under the AR cluster, 104 are correctly clustered, which is 63.03%, while 69 samples are correctly clustered under NAR, which is 50% with an average accuracy of 56.5%. Hence, 43.5% are misclassified.

A summary of the clustering performance can be seen in Table 3.

The key observations and advantages of clustering are as follows:

The GMM technique demonstrated superior performance based on both the accuracy and Silhouette score coefficient measures, identifying 84.24% of high-risk cases with a score of 0.2623. As a result, GMM-based clusters were selected for visualization and further study.
In clinical decision support systems, accurate identification of AR cases is critical, as failure to detect such cases could lead to severe consequences, including loss of life. Among the five clustering techniques evaluated, the GMM emerged as the most clinically reliable approach though its Silhouette score indicates some chances of misclassifications and thus overlapping between AR and NAR groups.
The cluster information derived from GMM provides significant predictors with high correlation coefficients (‘r’ values). These predictors were subsequently utilized to develop a linear regression model for calculating heart risk scores for individual cases.
Computed heart risk scores prioritize vulnerable patients, facilitating effective management and timely intervention strategies.

The results of Pearson’s correlation on the GMM-based high-risk cluster (see Table 4 and Figure 2):

From Table 4, it is evident that the predictors ‘slp’ and ‘thalachh’ have the maximum correlation value of 0.38, hence either ‘slp’ or ‘thalachh’ can be chosen for the model. Here, ‘thalachh’ has been chosen, and ‘slp’ is ignored as it has no other meaningful correlations with any of the remaining predictors, unlike ‘thalachh’, which also shows a high positive correlation with ‘cp’ (0.29). All the above ‘r’ values are now leveraged to form the linear regression equation (see Equation (2)).

Figure 2. Pearson’s correlation coefficient heatmap.

Since the ‘r’ values are not sufficiently high for independent validation, these are tested for statistical significance. The p-values fall much below 0.05 and hence the linearity among the variables is statistically significant.

A detailed hand calculation is provided below for Case 1, while the heart attack risk scores for the top three high-risk cases are summarized in Table 5.

(i): Predictor values, including Age (x₁), Sex (x₂), Chest Pain Type (x₃), …, and Thalassemia (x₁₂) are extracted from the Kaggle heart attack risk dataset.
(ii): Corresponding regression coefficients (r’ values) are obtained from Table 4.
(iii): Each predictor value is multiplied by its respective ‘r’ value (refer to Table 4).
(iv): The resulting products are added to calculate the predictors.
(v): A constant term, assumed to be zero in this analysis, is added to the computed sum, yielding the final heart attack risk score, in accordance with Equation (2).
(vi): For Case 1, the calculated risk score (y) is 90.89.

This procedure is systematically applied to all cases in the dataset to compute their respective heart attack risk scores.

In the AR cluster, heart risk scores are distributed as in Table 5. In this context, it is important to note that the remaining 13.53% of cases that have been misclassified by the GMM technique are also scored for heart attack risks. However, their values are less than the top three cases and therefore have not been presented in Table 5.

From the values of skewness and kurtosis, it is evident that the high-risk cases are symmetrically distributed as these values lie between [−2,2] and [−7,7] [23].

The individual heart attack risk score is computed using Equation (2) and showcased in Table 6.

To obtain a more comprehensive picture of the influence of each predictor on the top three AR cases, the value of each has been presented in Table 7 below. We note that the average variation in the risk scores using five-fold cross-validation is 0.0015, which is clinically negligible.

The values of predictors for Case No. 86, 29, and 97 are as follows (refer to Table 7):

Table 7. Distribution of high-risk predictors in the top three AR cases.

No	Predictors	Top Three at-Risk Cases
No	Predictors	86	29	97
1	Age	67	65	62
2	Sex	0	0	0
3	cp	2	2	0
4	trtbps	115	140	140
5	chol	564	417	394
6	fbs	0	1	0
7	restecg	0	0	0
8	thalachh	160	157	157
9	exng	0	0	0
10	oldpeak	1.6	0.8	1.2
11	caa	0	1	0
12	thall	3	2	2

4. Discussion

This section examines the rationale behind the high-risk scores computed for the aforementioned cases, emphasizing the plausible influence of key predictors. Table 6 establishes that females aged 62–67 years (mean age 64.66 years) with hypercholesterolemia (mean level 458.33 mg%), recorded a high maximum heart rate (‘thalachh’ with a mean of 158 beats per minute) and elevated values in the thallium stress test (‘thall’ with a mean of 2.33). This set is at the highest risk of heart attack for this population. Notably, these individuals exhibit no significant changes in resting ECG, no exercise-induced angina, and no apparent coronary vessel involvement or blockades. Furthermore, the ‘old peak’ parameter, indicating ST depression induced by exercise relative to rest, remains unaltered in these cases. These findings are clinically intriguing and warrant further investigation.

The objective of this study is to design, develop, and deploy a clinical decision support system and to statistically measure the performance of algorithms. The Silhouette score or Silhouette coefficient, measuring the inter-cluster distance and compactness of data points grouped inside any cluster for each algorithm, is evaluated using Equation 4 (refer to Table 3). Here, squared Euclidean distance is applied as the measure of distance. The Silhouette coefficient score varies between +1 and −1, where −1 refers to ‘misclassifications’ and +1 is considered the best classification, respectively. The score ‘0’ or close to ‘0’ indicates a chance of overlapping clusters. In the table, the Silhouette score DBSCAN method cannot be calculated as it failed to cluster the dataset into AR and NAR groups. The GMM, K-means, BIRCH, and Affinity Propagation algorithms show comparable scores, where the coefficient score of the GMM is the best. Clinically, the GMM can also cluster the data points more accurately when validated based on the class labels of the working dataset. However, the Silhouette score statistically indicates some overlapping of the classes and some degrees of dispersed data points, which need to be revisited. We note that any Silhouette score over 0.25 can be accepted as reasonable [24]. For this study, it is 0.2623, which is greater than the acceptable threshold. Hence, the chance of some degrees of overlapping and non-coherence among the data points within the clusters is accepted. Although the accuracy measure has been given more importance than the Silhouette coefficient score as the goal of any CDSS for clinical consistency, technically, the algorithm must also be sound. In this case, the GMM shows both clinical and technical soundness, which is the ideal condition.

The higher prevalence of heart attacks in postmenopausal women compared to men of a similar age is well documented [25] and as mentioned above, women with a mean age of 66.64 years usually fall into the postmenopausal group. This corroborates that the cardioprotective role of estrogen in premenopausal women is attributed to its ability to reduce oxidative stress by facilitating the production of antioxidative enzymes and clearing reactive oxygen species [26], which is affected in postmenopausal women. Elevated blood cholesterol levels increase heart attack risk by promoting atherosclerotic plaque formation and coronary vessel blockage [27]. However, postmenopausal women often experience heart attacks without significant vessel involvement [28], which could be linked to the lingering protective effects of estrogen on vascular endothelial cells. These effects may involve the clearance of deposit granules through autophagy, mediated by estrogen receptor-alpha activation [29]. A plausible mechanism for these heart attacks is coronary vasospasm rather than atherosclerotic blockades, as the loss of estrogen’s vasodilatory effect may trigger sudden coronary spasms in vulnerable cases [28,30].

Anxiety and stress, which are known to increase after menopause, contribute to elevated resting heart rates, frequent palpitations, and heightened heart attack risk in middle-aged and elderly women [31]. Between 15% and 50% of postmenopausal women experience anxiety disorders and depression [32], potentially due to the abrupt depletion of estrogen receptors in the brain [33]. This decline occurs more dramatically in cases of artificial menopause (e.g., surgical, chemical, or radiation-induced) compared to natural menopause [34]. The resulting mismatch between myocardial oxygen demand and supply may lead to early-stage heart muscle fatigue. However, many at-risk individuals initially present with normal stress test results and no detectable ECG abnormalities [30]. However, the decline of estrogen receptors and thus lower estrogen response in the body and mind (the physiological intrinsic factor) is still not conclusive as not all postmenopausal women experience similar degrees of anxiety and stress during menopause. We also feel that other intrinsic factors, such as genetic predispositions and family history are some important components behind this variation. The difference could also be related to biological variations in the rate of estrogen decline and/or influenced by various extrinsic modulators (i.e., environmental factors), such as unemployment or fear of job loss, severe illness, hospitalization and its respective financial and social consequences, loss of life, and so forth. Any or all of these may have played critical roles during the COVID-19 pandemic, with the backdrop of emotional stress and anxiety. Thus, the inclusion of the extrinsic stress factors as predictors could be appropriate to model heart attack risk in postmenopausal women. It is also worth noting that during the pandemic period, emergency economic policies were rolled out to curb the financial stress and anxiety in society [35,36]. The inclusion of these factors may provide a more holistic and balanced heart attack risk model in the context of COVID-19.

Applications of artificial intelligence (AI) and ML are being used to provide quality healthcare to a larger population in line with ‘health for all’. These technologies are used for predicting diseases and conditions with higher precision, typically complementing, or at times, superseding human intervention. This is an evolving field of research [37,38] with key inroads already made in the fields of pathology and radiology [39], as well as in oncology [40] research. Unsupervised machine learning techniques, particularly clustering, offer valuable tools for identifying patterns in unlabeled patient populations, as demonstrated in this study. This approach is especially beneficial in subjective clinical decision-making scenarios where patient conditions cannot be readily classified during early stages, such as in psychiatric and psychological disorders. Clustering methods have been widely applied to clinical studies based on patterns [20] and to evaluate clustering performance across domains [24]. Among these methods, Gaussian Mixture Model (GMM) clustering has distinct advantages over approaches such as k-means clustering. As a Gaussian probabilistic method, the GMM provides a reliable density estimation, accounts for uncertainty within groupings, yields the opportunity for a data point’s measurable inclination to be grouped into other clusters as well, and, as a result, allows for overlapping clusters. By assigning a probability distribution to each cluster, the GMM offers greater flexibility and granularity in cases where cluster boundaries are ambiguous [41], which is a natural occurrence with clinical and biological datasets. This clustering method is meaningfully versatile and has been effectively utilized in stratifying clinical data into high-risk and low-risk groups for heart attack [12] and population segmentation based on genomic datasets [42]. The unequivocal conclusion that can be drawn from this study using a hybrid machine learning method is that menopause in women is a critical physiological threshold aggravating the risk of heart attack. While this was not unknown previously, the precise dependence of heart attacks on menopause is an altogether new finding from this study, one that can be an efficient guideline in clinical decision making.

Major outcomes:

(a): We introduce unsupervised learning techniques, specifically clustering methods, to effectively analyze patterns in “unlabeled” data on heart attack risk, enabling the identification of high-risk subpopulations. The aim is to design, develop, and deploy a CDSS on heart risk data.
(b): The study identifies the optimal clustering technique by accurately grouping similar members within clusters while excluding dissimilar members, ensuring robust categorization. In this work, the GMM stands superior to its competitors.
(c): Cross-correlation coefficients (‘r’ values) are computed for the features of the high-risk clusters to quantify and validate the strength of their positive relationships and these are statistically significant.
(d): High correlation values (‘r’ values) are utilized as coefficients in the construction of a linear regression model (refer to Equation (2)), where the predictors correspond to the identified variables.
(e): Using the regression model, heart attack risk scores are computed on a case-by-case basis, providing individualized risk assessment.
(f): The distributions of the predictors are explained, and the characteristics of high-scoring cases are thoroughly discussed to provide deeper insights into the risk factors.
(g): Computed risk scores inform targeted care management strategies aimed at preventing heart attacks within vulnerable populations. The lateral outcome of this study is the extraction of importance scores associated with all risk factors. For a practicing clinician, this could provide precise leads, e.g., how a 10% increase in cholesterol for males in the age range of 50–55 could exacerbate heart attack risk compared to the age range of 60–65 for females.

Limitations of the study:

(a): The sample size (N = 303) is small.
(b): Not all features, such as lifestyle-related predictors, e.g., tobacco smoking, sedentary lifestyle, alcoholism, and other substance abuse, etc., could be considered due to the lack of appropriate data, but within the remit of clean data, machine learning has made groundbreaking contributions to cardiovascular research [43].
(c): Risk validation by medical doctors must involve data from ‘real-world’ cardiac risk datasets, which opens the scope of future work.
(d): The study does not establish any direct link between heart health, the COVID-19 pandemic, and vaccination effects. Plausible associations are assumed based on the published documents.
(e): The model could be more robust if other extrinsic and intrinsic factors could be associated.
(f): Although the GMM can cluster the dataset into AR and NAR classes most accurately, its Silhouette coefficient value (0.2623) indicates some overlapping between the classes. This needs to be revisited.

5. Conclusions

This study aimed to identify high-risk heart attack cases from the publicly available Kaggle heart attack risk dataset using five clustering techniques and to evaluate their comparative performance. Amongst these, the GMM demonstrated superior clustering accuracy, identifying 84.24% of high-risk cases, with an overall accuracy rate of 86.68%. Although, technically, there was some overlapping between the clusters, accuracy in classifying has been given priority from the clinical perspective. However, such a clinical–technical discrepancy may warrant a rethink in the future design, development, and deployment of any CDSS using clustering algorithms. The analysis further assessed heart attack risk by calculating Pearson’s correlation coefficients (‘r’ values) among the predictors for each cluster. The correlation values were statistically significant. Notably, the maximum heart rate achieved post-exercise (‘thalachh’) and the slope of the ECG (‘slp’) exhibited the strongest positive correlation (r = 0.38). These high ‘r’ values were incorporated into a multiple linear regression model, where they served as the predictors whose coefficients were computed to fit information into Equation (2).

The findings reveal that middle-aged and elderly women (falling supposedly into the postmenopausal group, at least physiologically speaking) are at the highest risk of heart attack, potentially due to the loss of estrogen, which has a ‘cardioprotective’ as well as ‘vasoprotective’ effect. Based on this result, the study highlights the importance of increased vigilance in identifying menopausal and perimenopausal women as high-risk groups for heart attacks, paralleling the focus and emphasis that is typically put on men. Estrogen could be the key player behind the elevated risk of heart attacks in postmenopausal females as stated above. Also, extrinsic factors, e.g., adverse socioeconomic situations during the pandemic, relevant financial policies, and emergency measures encumbering all these elements, may contribute to the alleviation of socioeconomic stress and need to be considered for a more robust model of heart attacks.

Author Contributions

Conceptualization, S.C.; methodology, S.C. and A.K.C.; software, S.C.; validation, S.C. and A.K.C.; formal analysis, S.C.; investigation, S.C.; resources, S.C.; data curation, A.K.C.; writing—original draft preparation, S.C.; writing—review and editing, A.K.C.; visualization, S.C.; supervision, S.C. and A.K.C.; project administration, S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

The study used freely available (open access) Kaggle data that can be accessed from [13]. Data for the plots and tables can be made available by the corresponding author on request.

Acknowledgments

S.C. acknowledges the open access data archive of Kaggle.

Conflicts of Interest

Authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial intelligence
AR	At risk
CDSS	Clinical decision support system
COVID	Corona virus disease
CVD	Cardiovascular disease
GMM	Gaussian Mixture Model
HRV	Heart rate variability
ML	Machine learning
NAR	Not at risk

References

“Cardiovascular Diseases”, World Health Organization, 26 June 2023. Available online: https://www.who.int/health-topics/cardiovascular-diseases (accessed on 19 January 2024).
Marimuthu, Y.; Nair, G.C.; Nagesh, U.; Anand, A.; Chopra, K.K.; Nagappa, B.; Sharma, N.; Sivashankar, G.; Nagaraj, N. Prevalence of probable post-COVID cardiac sequelae and its health seeking behaviour among health care workers: A cross-sectional analytical study. Indian J. Tuberc. 2023; online ahead of print. [Google Scholar] [CrossRef] [PubMed]
AAleksova, A.; Fluca, A.L.; Gagno, G.; Pierri, A.; Padoan, L.; Derin, A.; Moretti, R.; Noveska, E.A.; Azzalini, E.; D’Errico, S.; et al. Gagno and Long-term effect of SARS-CoV-2 infection on cardiovascular outcomes and all-cause mortality. Life Sci. 2022, 10, 121018. [Google Scholar] [CrossRef] [PubMed]
Xie, Y.; Xu, E.; Bowe, B.; Al-Aly, Z. Long-term cardiovascular outcomes of COVID-19. Nat. Med. 2022, 28, 583–590. [Google Scholar] [CrossRef] [PubMed]
Núñez-Gil, I.J.; Feltes, G.; Viana-Llamas, M.C.; Raposeiras-Roubin, S.; Romero, R.; Alfonso-Rodríguez, E.; Uribarri, A.; Santoro, F.; Becerra-Muñoz, V.; Pepe, M.; et al. HOPE-2 Investigators. Post-COVID-19 Symptoms and Heart Disease: Incidence, Prognostic Factors, Outcomes and Vaccination: Results from a Multi-Center International Prospective Registry (HOPE 2). J. Clin. Med. 2023, 12, 706. [Google Scholar] [CrossRef]
Shah, A.S.; Vaccarino, V.; Moazzami, K.; Almuwaqqat, Z.; Garcia, M.; Ward, L.; Elon, L.; Ko, Y.-A.; Sun, Y.V.; Pearce, B.D.; et al. Autonomic reactivity to mental stress is associated with cardiovascular mortality. Eur. Heart J. Open 2024, 4, oeae086. [Google Scholar] [CrossRef]
Pérez-Alcalde, A.I.; Galán-del-Río, F.; Fernández-Rodríguez, F.J.; de la Plaza San Frutos, M.; García-Arrabé, M.; Giménez, M.-J.; Ruiz-Ruiz, B. The Effects of a Single Vagus Nerve’s Neurodynamics on Heart Rate Variability in Chronic Stress: A Randomized Controlled Trial. Sensors 2024, 24, 6874. [Google Scholar] [CrossRef]
Shaffer, F.; Ginsberg, J.P. An Overview of Heart Rate Variability Metrics and Norms. Front. Public Health 2017, 5, 258. [Google Scholar] [CrossRef]
Chattopadhyay, S. The Importance of Time-domain HRV Analysis in Cardiac Health Predictions. Ser. Cardiol. Res. 2022, 4, 19–23. [Google Scholar] [CrossRef]
Sofogianni, A.; Stalikas, N.; Antza, C.; Tziomalos, K. Cardiovascular Risk Prediction Models and Scores in the Era of Personalized Medicine. J. Pers. Med. 2022, 12, 1180. [Google Scholar] [CrossRef]
Satapathy, S.; Chattopadhyay, S. Observation-Prevention Framework of Cardiac Risk Factors: An Indian Study. J. Med. Imaging Health Inform. 2012, 2, 102–113. [Google Scholar] [CrossRef]
Chattopadhyay, S. MLMI: A Machine Learning Model for Estimating Risk of Myocardial Infarction. Artif. Intell. Evol. 2024, 5, 11–23. [Google Scholar]
Manchanda, N. “Heart Attack-EDA+Prediction (90% Accuracy)”, Kaggle. Available online: https://www.kaggle.com/code/namanmanchanda/heart-attack-eda-prediction-90-accuracy (accessed on 19 January 2024).
Cronbach, L.J. Coefficient alpha and the internal structure of tests. Psychometrika 1951, 16, 297–334. [Google Scholar]
Shapiro, S.S.; Wilk, M.B. An analysis of variance test for normality (complete samples). Biometrika 1965, 52, 591–611. [Google Scholar] [CrossRef]
K-Means Clustering Algorithm, IBM. Available online: https://www.ibm.com/think/topics/k-means-clustering (accessed on 21 January 2024).
Maklin, C. “Gaussian Mixture Models Clustering Algorithm Explained”, Towards Data Science. Available online: https://medium.com/data-science/gaussian-mixture-models-d13a5e915c8e (accessed on 21 January 2024).
Cremers, J.; Klugkist, I. One Direction? A Tutorial for Circular Data Analysis Using R with Examples in Cognitive Psychology. Front. Psychol. 2018, 9, 396689. [Google Scholar]
“DBSCAN Clustering in ML|Density Based Clustering”, Geeksforgeeks. Available online: https://www.geeksforgeeks.org/dbscan-clustering-in-ml-density-based-clustering/ (accessed on 21 January 2024).
McGregor, M. “8 Clustering Algorithms in Machine Learning that All Data Scientists Should Know”, Freecodecamp, 21 September 2020. Available online: https://www.freecodecamp.org/news/8-clustering-algorithms-in-machine-learning-that-all-data-scientists-should-know/ (accessed on 21 January 2024).
Chattopadhyay, A.K.; Chattopadhyay, S. VIRDOCD: A VIRtual DOCtor to Predict Dengue Fatality. Expert Syst. 2021, 39, e12796. [Google Scholar]
Maklin, C. “DBSCAN Python Example: The Optimal Value for Epsilon (EPS)”, Towards Data Science. Available online: https://www.coryjmaklin.com/2019-06-30_DBSCAN-Python-Example--The-Optimal-Value-For-Epsilon--EPS--3100091cfbc/ (accessed on 23 March 2025).
Byrne, B.M. Strucrtural Equation Modeling with AMOS: Basic Concepts, Applications, and Programming; Routledge: New York, NY, USA, 2010. [Google Scholar]
Benaya, R.; Sibaroni, Y.; Ihsan, A.F. Clustering Content Types and User Roles Based on Tweet Text Using K-Medoids Partitioning Based. J. Comput. Syst. Inform. 2023, 4, 749–756. [Google Scholar] [CrossRef]
Postmenopausal Women at Higher Risk of Heart Attacks Than Men Their Age”, Inside Precision Medicine, 11 May 2023. Available online: https://www.insideprecisionmedicine.com/news-and-features/postmenopausal-women-at-higher-risk-of-heart-attacks-than-men-their-age/ (accessed on 31 January 2024).
Pignatelli, P.; Menichelli, D.; Pastori, D.; Violi, F. Oxidative stress and cardiovascular disease: New insights. Kardiol. Pol. 2018, 76, 713–722. [Google Scholar] [CrossRef]
Lu, Y.; Cui, X.; Zhang, L.; Wang, X.; Xu, Y.; Qin, Z.; Liu, G.; Wang, Q.; Tian, K.; Lim, K.S.; et al. The Functional Role of Lipoproteins in Atherosclerosis: Novel Directions for Diagnosis and Targeting Therapy. Aging Dis. 2022, 13, 491–520. [Google Scholar]
“Uncommon Heart Attack, Found More Often in Women, Needs a Second Look”, American Heart Association News, 28 June 2018. Available online: https://www.heart.org/en/news/2018/07/13/uncommon-heart-attack-found-more-often-in-women-needs-a-second-look (accessed on 31 January 2024).
Meng, Q.; Li, Y.; Ji, T.; Choo, Y.; Li, J.; Fu, Y.; Wong, S.; Chen, O.; Chen, W.; Huang, F.; et al. Estrogen prevent atherosclerosis by attenuating endothelial cell pyroptosis via activation of estrogen receptor α-mediated autophagy. J. Adv. Res. 2021, 28, 149–164. [Google Scholar]
Perez, I.E.; Menegus, M.A.; Taub, C.C. Myocardial Infarction in a Premenopausal Woman on Leuprolide Therapy. Case Rep. Med. 2015, 2015, 390642. [Google Scholar]
“High Resting Heart Rate Predicts Heart Risk in Women at Midlife”, Harvard Health Publishing, 14 May 2022. Available online: https://www.health.harvard.edu/heart-health/in-the-journals-high-resting-heart-rate-predicts-heart-risk-in-women-at-midlife (accessed on 31 January 2024).
Huang, S.; Wang, Z.; Zheng, D.; Liu, L. Anxiety disorder in menopausal women and the intervention efficacy of mindfulness-based stress reduction. Am. J. Transl. Res. 2023, 15, 2016–2024. [Google Scholar]
Renczés, E.; Borbélyová, V.; Steinhardt, M.; Höpfner, T.; Stehle, T.; Ostatníková, D.; Celec, P. The Role of Estrogen in Anxiety-Like Behavior and Memory of Middle-Aged Female Rats. Front. Endocrinol. 2020, 11, 570560. [Google Scholar]
Ko, S.H.; Kim, H.S. Menopause-associated lipid metabolic disorders and foods beneficial for postmenopausal women. Nutrients 2020, 12, 202. [Google Scholar] [CrossRef] [PubMed]
Cortes, G.S.; Gao, G.P.; Silva FB, G.; Song, Z. Unconventional Monetary Policy and Disaster Risk: Evidence from the Subprime and COVID-19 Crises. J. Int. Money Financ. 2022, 122, 102543. [Google Scholar]
Campello, M.; Kankanhalli, G.; Muthukrishnan, P. Corporate Hiring Under COVID-19: Financial Con-straints and the Nature of New Jobs. J. Financ. Quant. Anal. 2024, 59, 1541–1585. [Google Scholar] [CrossRef]
Topol, E.J. High-performance medicine: The convergence of human and artificial intelligence. Nat. Med. 2019, 25, 44–56. [Google Scholar] [CrossRef]
Topol, E.J. As artificial intelligence goes multimodal, medical applications multiply. Science 2023, 381, eadk6139. [Google Scholar] [CrossRef]
Jha, S.; Topol, E.J. Adapting to Artificial Intelligence: Radiologists and Pathologists as Information Specialists. JAMA 2016, 316, 2353–2354. [Google Scholar] [CrossRef]
Topol, E.J.; Douglas, F. The State of Artificial Intelligence in Precision Oncology: An Interview with Eric Topol. AI Precis. Oncol. 2024, 1, 3–10. [Google Scholar]
Beheshti, A. “Unsupervised Learning: Clustering Using Gaussian Mixture Model (GMM)”, Beheshti.medium, 11 March 2023. Available online: https://behesht.medium.com/unsupervised-learning-clustering-using-gaussian-mixture-model-gmm-c788b280932b#:~:text=One%20of%20the%20advantages%20of,GMM%20allows%20for%20overlapping%20clusters (accessed on 3 February 2024).
Budiarto, A.; Mahesworo, B.; Hidayat, A.A.; Nurlaila, I.; Pardamean, B. Gaussian Mixture Model Implementation for Population Stratification Estimation from Genomics Data. Procedia Comput. Sci. 2021, 179, 202–210. [Google Scholar]
He, X.; Matam, R.; Bellary, S.; Ghosh, G.; Chattopadhyay, A.K. CHD Risk Minimization through Lifestyle Control: Machine Learning Gateway. Sci. Rep. 2020, 10, 4090. [Google Scholar]

Table 1. A sample dataset with target and attributes (N = 10).

No.	Age	Sex	cp	trtbps	chol	fbs	restecg	thalachh	exng	oldpeak	slp	thall	Output
1.	63	1	3	145	233	1	0	150	0	2.3	0	1	1
2.	37	1	2	130	250	0	1	187	0	3.5	0	2	1
3.	41	0	1	130	204	0	0	172	0	1.4	2	2	1
4.	56	1	1	120	236	0	1	178	0	0.8	2	2	1
5.	57	0	0	120	354	0	1	163	1	0.6	2	2	1
6.	57	1	0	140	192	0	1	148	0	0.4	1	1	1
7.	56	0	1	140	294	0	0	153	0	1.3	1	2	1
8.	44	1	1	120	263	0	1	173	0	0	2	3	1
9.	52	1	2	172	199	1	1	162	0	0.5	2	3	1
10.	57	1	2	150	168	0	1	174	0	1.6	2	2	1

Table 2. Descriptive statistics of all variables (count 303, M: 207, F: 96).

Parameter	Age	cp	trtbps	chol	fbs	restecg	thalachh	exng	oldpeak	slp	caa	thall
Mean	54.3	0.68	0.96	131.62	246.26	0.14	0.52	149.64	0.32	1.03	1.39	0.72
Std	9.08	0.46	1.03	17.53	51.83	0.35	0.52	22.90	0.46	1.16	0.61	1.02
Min	29	0	0	94	126	0	0	71	0	0	0	0
25%	47.5	0	0	120	211	0	0	133.5	0	0	1	0
50%	55	1	1	130	240	0	1	153	0	0.8	1	0
75%	61	1	2	140	274.5	0	1	166	1	1.6	2	1
Max	77	1	3	200	564	1	2	202	1	6.2	2	4
Skewness	−0.2	−0.8	0.5	0.7	1.14	2.0	0.16	−0.5	0.7	1.26	−0.5	1.3
Kurtosis	−0.5	−1.4	−1.2	0.9	4.5	2.0	−1.36	−0.06	−1.46	1.58	−0.63	0.84

Table 3. Summary of the clustering performance (average).

No.	Technique	Correctly Classified AR Cases (%)	Silhouette Score
1	k-mc	33.33	0.1782
2	GMM	84.24	0.2623
3	DBSCAN	NA	NA
4	BIRCH	24.24	0.1509
5	Affinity propagation	63.03	0.1738

Table 4. Significant coefficients (the highest ‘r’ values).

No.	Predictor 1	Predictor 2	r	p-Value
1	Age	trtbps	0.28	<0.05
2	Sex	Thall	0.21
3	cp	thalachh	0.30
4	trtbps	oldpeak	0.19
5	chol	Age	0.21
6	fbs	caa	0.14
8	thalachh	slp	0.39
9	exng	oldpeak	0.29
10	oldpeak	caa	0.22
11	caa	Age	0.27
12	thall	oldpeak	0.21

Table 5. Distribution of heart risk scores across high-risk cluster of GMM technique.

Parameter	Value
Count	139
Min	90.89
Max	168.12
Mean	121.17
Median	122.02
Standard deviation	10.97
Skewness	0.45
Kurtosis	2.27

Table 6. Top three cases with predicted at risk scores.

No.	Case Number	Heart Risk Score
1	86	168.12
3	29	150.5
5	97	145.73

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chattopadhyay, S.; Chattopadhyay, A.K. Identifying Heart Attack Risk in Vulnerable Population: A Machine Learning Approach. Information 2025, 16, 265. https://doi.org/10.3390/info16040265

AMA Style

Chattopadhyay S, Chattopadhyay AK. Identifying Heart Attack Risk in Vulnerable Population: A Machine Learning Approach. Information. 2025; 16(4):265. https://doi.org/10.3390/info16040265

Chicago/Turabian Style

Chattopadhyay, Subhagata, and Amit K Chattopadhyay. 2025. "Identifying Heart Attack Risk in Vulnerable Population: A Machine Learning Approach" Information 16, no. 4: 265. https://doi.org/10.3390/info16040265

APA Style

Chattopadhyay, S., & Chattopadhyay, A. K. (2025). Identifying Heart Attack Risk in Vulnerable Population: A Machine Learning Approach. Information, 16(4), 265. https://doi.org/10.3390/info16040265

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Identifying Heart Attack Risk in Vulnerable Population: A Machine Learning Approach

Abstract

1. Introduction

2. Materials and Methods

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

No.	Age	Sex	cp	trtbps	chol	fbs	restecg	thalachh	exng	oldpeak	slp	thall	Output
1.	63	1	3	145	233	1	0	150	0	2.3	0	1	1
2.	37	1	2	130	250	0	1	187	0	3.5	0	2	1
3.	41	0	1	130	204	0	0	172	0	1.4	2	2	1
4.	56	1	1	120	236	0	1	178	0	0.8	2	2	1
5.	57	0	0	120	354	0	1	163	1	0.6	2	2	1
6.	57	1	0	140	192	0	1	148	0	0.4	1	1	1
7.	56	0	1	140	294	0	0	153	0	1.3	1	2	1
8.	44	1	1	120	263	0	1	173	0	0	2	3	1
9.	52	1	2	172	199	1	1	162	0	0.5	2	3	1
10.	57	1	2	150	168	0	1	174	0	1.6	2	2	1

No.	Age	Sex	cp	trtbps	chol	fbs	restecg	thalachh	exng	oldpeak	slp	thall	Output
1.	63	1	3	145	233	1	0	150	0	2.3	0	1	1
2.	37	1	2	130	250	0	1	187	0	3.5	0	2	1
3.	41	0	1	130	204	0	0	172	0	1.4	2	2	1
4.	56	1	1	120	236	0	1	178	0	0.8	2	2	1
5.	57	0	0	120	354	0	1	163	1	0.6	2	2	1
6.	57	1	0	140	192	0	1	148	0	0.4	1	1	1
7.	56	0	1	140	294	0	0	153	0	1.3	1	2	1
8.	44	1	1	120	263	0	1	173	0	0	2	3	1
9.	52	1	2	172	199	1	1	162	0	0.5	2	3	1
10.	57	1	2	150	168	0	1	174	0	1.6	2	2	1

No.	Age	Sex	cp	trtbps	chol	fbs	restecg	thalachh	exng	oldpeak	slp	thall	Output
1.	63	1	3	145	233	1	0	150	0	2.3	0	1	1
2.	37	1	2	130	250	0	1	187	0	3.5	0	2	1
3.	41	0	1	130	204	0	0	172	0	1.4	2	2	1
4.	56	1	1	120	236	0	1	178	0	0.8	2	2	1
5.	57	0	0	120	354	0	1	163	1	0.6	2	2	1
6.	57	1	0	140	192	0	1	148	0	0.4	1	1	1
7.	56	0	1	140	294	0	0	153	0	1.3	1	2	1
8.	44	1	1	120	263	0	1	173	0	0	2	3	1
9.	52	1	2	172	199	1	1	162	0	0.5	2	3	1
10.	57	1	2	150	168	0	1	174	0	1.6	2	2	1