Leveraging Machine Learning Techniques to Predict Cardiovascular Heart Disease

Başar, Remzi; Ocak, Öznur; Erturk, Alper; de la Roche, Marcelle

doi:10.3390/info16080639

Open AccessArticle

Leveraging Machine Learning Techniques to Predict Cardiovascular Heart Disease

by

Remzi Başar

¹

,

Öznur Ocak

²

,

Alper Erturk

^3,*

and

Marcelle de la Roche

⁴

¹

Department of MIS, Faculty of Business Administration, Duzce University, Düzce 81620, Türkiye

²

Department of MIS, Institute of Graduate Studies, Duzce University, Düzce 81620, Türkiye

³

Department of Management, College of Business, Australian University, Kuwait 32093, Kuwait

⁴

Department of Marketing and Events Management, College of Business, Australian University, Kuwait 32093, Kuwait

^*

Author to whom correspondence should be addressed.

Information 2025, 16(8), 639; https://doi.org/10.3390/info16080639

Submission received: 10 May 2025 / Revised: 4 July 2025 / Accepted: 24 July 2025 / Published: 27 July 2025

(This article belongs to the Special Issue Information Systems in Healthcare)

Download

Browse Figures

Versions Notes

Abstract

Cardiovascular diseases (CVDs) remain the leading cause of death globally, underscoring the urgent need for data-driven early diagnostic tools. This study proposes a multilayer artificial neural network (ANN) model for heart disease prediction, developed using a real-world clinical dataset comprising 13,981 patient records. Implemented on the Orange data mining platform, the ANN was trained using backpropagation and validated through 10-fold cross-validation. Dimensionality reduction via principal component analysis (PCA) enhanced computational efficiency, while Shapley additive explanations (SHAP) were used to interpret model outputs. Despite achieving 83.4% accuracy and high specificity, the model exhibited poor sensitivity to disease cases, identifying only 76 of 2233 positive samples, with a Matthews correlation coefficient (MCC) of 0.058. Comparative benchmarks showed that random forest and support vector machines significantly outperformed the ANN in terms of discrimination (AUC up to 91.6%). SHAP analysis revealed serum creatinine, diabetes, and hemoglobin levels to be the dominant predictors. To address the current study’s limitations, future work will explore LIME, Grad-CAM, and ensemble techniques like XGBoost to improve interpretability and balance. This research emphasizes the importance of explainability, data representativeness, and robust evaluation in the development of clinically reliable AI tools for heart disease detection.

Keywords:

heart disease prediction; artificial neural network (ANN); machine learning; SHAP; medical diagnostics; ensemble learning; class imbalance

1. Introduction

Cardiovascular diseases (CVDs) are the leading cause of mortality globally, accounting for approximately 17.9 million deaths annually, as reported by the World Health Organization (WHO) [1]. In Turkey, these conditions contribute to 86 percent of all deaths, with one in every two fatalities linked to cardiovascular complications. Notably, around 80 percent of premature deaths from such diseases are considered preventable through early diagnosis and timely intervention [2]. These alarming statistics emphasize the critical need for proactive, data-driven methods to identify heart disease risks in the early stages, when treatment outcomes are most favorable. Thus, early identification and appropriate treatment play a critical role in mitigating potentially deadly effects; cardiovascular disease (CVD) persists as a major contributor to global mortality rates [3].

The integration of artificial intelligence (AI), particularly machine learning (ML), into healthcare has opened new avenues for disease prediction and clinical decision support [4]. Advances in data collection and computational power have enabled the development of predictive models that are capable of analyzing complex patterns in patient records [5]. These models, trained on historical clinical data, are increasingly being used to identify disease risks before symptoms become clinically apparent. However, the effectiveness of these systems often depends not only on prediction accuracy but also on their transparency and generalizability across different datasets and populations [6].

In this study, a multilayer artificial neural network (ANN) model is proposed to predict heart disease based on clinical attributes, using the Orange data mining platform. Orange provides a code-free environment that is conducive to reproducible, visual workflows, making it accessible for non-programmers in the healthcare domain [7]. To enhance interpretability, Shapley additive explanations (SHAP) are applied to evaluate the contribution of each feature to the model’s predictions. The use of SHAP aligns with the growing demand for explainable AI, especially in high-stakes fields such as medicine, where understanding the rationale behind predictions is as crucial as the predictions themselves [8].

As the utilization of artificial intelligence and machine learning techniques has significantly increased and become more prevalent in several different industries, their usage in medicine, specifically in predicting heart-related diseases, has also substantially increased in the last decade [9,10,11]. Despite substantial research on ML-based heart disease prediction, there is a significant lack of novelty in many current approaches. A review of the literature reveals that ANNs, support vector machines (SVMs), decision trees, and random forest algorithms have all been extensively used to predict heart disease [6]. For example, studies using the Cleveland dataset, a frequently employed benchmark comprising 303 samples, have reported accuracy rates ranging from 83% to 95% [12]. In particular, Bouqentar et al. [13] achieved 95% accuracy using MATLAB-based (Version R2024a) ANN models [13], while others reported up to 99% using random forests [14]. Convolutional neural networks (CNNs) trained on large ECG datasets have also demonstrated strong performance, with a few studies reporting over 94% accuracy [15].

In a recent study, Talin et al. [16] combined classical statistical techniques with machine learning to identify the key predictors of heart disease, highlighting chest pain, the number of major blood vessels involved, and thalassemia as the most significant clinical features. By integrating SHAP values and Borda count ranking across multiple ML models, the study emphasized the potential of combining interpretability with statistical robustness to enhance diagnostic precision [16]. In addition, recent developments in explainable artificial intelligence (XAI) have addressed the growing concern over the opaque nature of complex machine learning models, particularly in high-stakes fields like healthcare. A comprehensive survey has categorized interpretability methods and provided practical guidance for their implementation, reinforcing the importance of transparency in clinical decision-making systems [17]. CNNs combined with deep SHAP have also been applied to enhance both predictive performance and interpretability in heart disease diagnosis. Achieving high sensitivity (0.97) and an F1 score of 0.86 for the disease class, one such study demonstrated the potential of deep learning when paired with explainability tools to provide transparent insights into model decisions [18].

Moreover, a recent study employed advanced ensemble and boosting techniques, including gradient boosting, voting, and stacking classifiers, achieving predictive accuracies exceeding 91% in heart failure mortality prediction. By incorporating SHAP analysis and addressing class imbalance through SMOTE and bootstrapping, the research highlighted features such as time, serum creatinine, and ejection fraction as critical predictors [19]. These findings further underscore the growing relevance of interpretable machine learning models in enhancing clinical decision-making and align with the present study’s emphasis on explainability and real-world applicability [20].

However, these studies often rely on small, homogenous, and highly curated datasets that may not reflect real-world clinical variability. Many fail to address the class imbalance that naturally occurs in clinical settings, where the number of healthy individuals vastly outnumbers patients [21]. Additionally, model interpretability is frequently overlooked. Most high-accuracy studies employ black-box models that do not provide insights into feature importance or model reasoning [22]. This lack of transparency limits clinical adoption and raises concerns about trust and accountability. Another common shortcoming is the absence of statistical validation to determine whether performance differences across models are significant or merely incidental. Very few studies apply hypothesis-testing methods, such as paired t-tests or non-parametric Wilcoxon signed-rank tests, to rigorously compare model outputs [23].

Furthermore, the published literature lacks substantial exploration of hybrid or ensemble approaches that combine different algorithms or integrate domain-specific feature engineering. Although some works attempt basic model comparisons, few investigate how model performance can be enhanced through novel architecture designs or advanced data preprocessing techniques [24]. The general trend has been toward algorithm substitution rather than methodological innovation.

To address these limitations, this study takes a comprehensive and practical approach. The model is trained on a real-world clinical dataset of 63,199 samples, later refined to 13,981 after cleaning, which is significantly larger and more representative than other commonly used datasets. The application of SHAP ensures that the model is not a black box but offers interpretable, data-driven insights into the most influential predictors of heart disease [25]. By employing principal component analysis (PCA) for dimensionality reduction and validating the results with 10-fold cross-validation, the study enhances both model robustness and reproducibility. Unlike previous works that focus exclusively on accuracy metrics, this study includes additional performance measures such as the Matthews correlation coefficient (MCC), which is particularly useful for evaluating models on imbalanced datasets.

Although the reported accuracy of 83.4% is lower than that reported in some other studies, the use of a significantly larger, unbalanced dataset and the inclusion of explainability methods distinguish this work methodologically. Future directions could include the development of hybrid architectures that combine neural networks with ensemble techniques, as well as comparisons with deep learning models such as CNNs, LSTMs, and transformers. Through these enhancements, this study aims not only to improve prediction accuracy but also to contribute to the development of transparent, generalizable, and clinically relevant AI systems for heart disease diagnosis.

2. Materials and Methods

This study aimed to develop and interpret a predictive model for heart disease using machine learning techniques, particularly an ANN. The dataset, preprocessing strategies, modeling choices, and performance evaluation methods are described in this section with enhanced transparency and methodological depth.

Although this study is based solely on structured and retrospectively collected numerical data, such as blood test parameters, and does not incorporate any physical hardware setup, the integration or comparison of such AI-based models with hardware-driven diagnostic systems—such as electrocardiograms (ECGs), wearable health monitors, or imaging technologies—could offer a more comprehensive understanding of cardiovascular disease detection in future studies.

2.1. Dataset Description

The dataset was acquired from a public university’s Faculty of Medicine Research and Application Hospital in Türkiye. The dataset included the following patient features, collected prior to principal component analysis (PCA) transformation: the presence of anemia (binary: 1 = yes, 0 = no), systolic/diastolic blood pressure (mmHg), platelet (PLT) count (10⁹/L), diabetes status (binary: 1 = yes, 0 = no), hemoglobin (HGB) (g/dL), serum creatinine (mg/dL), creatinine kinase (CK) (U/L), sodium level (mEq/L), ejection fraction (%), serum urea (mg/dL), smoking status (binary), and follow-up duration (in days), along with some descriptive data, such as age (in years) and gender (male/female). Data were collected retrospectively, and these features were selected based on their clinical relevance and data availability for all patients. The dataset contained the first 63,199 records from heart disease patients and healthy individuals. After reviewing the data, records that were missing or included inconsistent values were identified. Instead of imputation, complete records were excluded if their columns contained null values, resulting in a dataset of 13,981 complete records. However, this exclusion helped us to obtain clean inputs for training, although there was a risk of bias incurred by severely excluding certain patient profiles. It is recognized that this limitation exists, and future versions will use more sophisticated imputation methods such as KNN imputation to reduce the loss of samples and increase representativeness.

2.2. Feature Engineering and Selection

To optimize model performance and reduce dimensionality, feature selection and transformation were applied. PCA was initially used to extract latent features and minimize redundancy. Additionally, feature importance scores were derived using SHAP (Shapley additive explanations) values post-training to evaluate the relative contribution of each variable. In future work, statistical feature selection techniques such as recursive feature elimination (RFE), mutual information, and chi-square tests will be employed pre-training to guide model input curation. Distinctions between clinical and demographic variables will also be analyzed to assess domain-specific impacts on predictive accuracy.

2.3. Artificial Intelligence and Neural Network Classifier Design

Artificial intelligence aims to replicate human cognitive functions such as decision-making, perception, and learning. Among the most common AI techniques used in health informatics are ANNs, which simulate biological neurons using interconnected weighted nodes. These models are particularly suited to pattern recognition tasks, such as disease classification, where the relationships between variables are complex and non-linear.

ANNs consist of input, hidden, and output layers. Each neuron processes the incoming signals, applies an activation function, and passes information forward. The ANN architecture in this study consisted of an input layer with neurons corresponding to the PCA-reduced features, one hidden layer with 100 neurons, and an output layer with a single neuron using the sigmoid activation function for binary classification. The network model was trained using the backpropagation algorithm, which updated weights to minimize prediction error by utilizing a learning rate, batch size, training epochs, and momentum value, optimized empirically to balance learning speed and stability. Cross-entropy was used as the loss function, and model performance was evaluated using 10-fold cross-validation to ensure generalizability.

2.4. Data Preprocessing

Preprocessing included data cleaning, normalization, and preparation for modeling. As previously stated, rows with missing values were removed. Although this approach simplifies the training pipeline, it imposes limitations by potentially excluding valuable yet incomplete records. In future experiments, the missing data will be imputed using KNN or median-based methods to assess the impact on model robustness. All features were normalized to ensure consistency across different measurement scales, facilitating faster and more stable convergence during training.

2.5. Model Development and Comparative Benchmarking

The ANN was implemented using the Orange v3.34.0 data mining platform, which was chosen for its flexibility, open-source licensing, and code-free environment. The Orange data mining tool has certain limitations when working with large-scale datasets. It possesses relatively limited capabilities in terms of advanced machine learning and deep learning functionalities. Compared to AI libraries such as TensorFlow and PyTorch, Orange lacks sophisticated deep learning features such as GPU-accelerated training or flexible layer definitions within its core architecture [26]. Nevertheless, Orange is often preferred due to its ease of use. Its component-based structure allows users to seamlessly integrate various data analysis steps and customize workflows according to specific needs. Furthermore, Orange offers a wide range of functionalities encompassing classification, regression, and clustering algorithms, and it allows for the easy addition of new functions. In addition, its visual programming interface and interactive data visualization capabilities enhance the user’s ability to conduct analyses more effectively, making it a practical tool for many applications [27]. The ANN model in this study included a hidden layer with 100 neurons and utilized a learning rate and momentum value that were optimized empirically to balance learning speed and stability. The multilayer perceptron (MLP) is one of the most widely used architectures within ANNs [28]. The ANN developed in this study can be characterized as an MLP architecture.

For model validation, the dataset was split using the 10-fold cross-validation technique, ensuring that every observation was used for both training and testing in different iterations. This method reduces variance in performance estimation and mitigates the effect of random train–test splits. Figure 1 shows the ANN model implemented and used in this study.

Although this study focuses primarily on ANN, future iterations will include comparative modeling with widely used machine learning classifiers such as random forest, XGBoost, SVM, and logistic regression. Deep learning architectures, including CNNs and LSTMs, will also be explored for high-dimensional feature spaces such as time-series ECG data. Statistical hypothesis testing (e.g., paired t-tests and Wilcoxon signed-rank tests) will be used to determine if performance differences between models are significant. The CNN algorithm offers numerous advantages and is widely utilized in areas such as image classification, segmentation, object detection, video processing, and natural language processing [29]. However, unlike typical medical imaging datasets, the dataset employed in this study consists of laboratory test parameters. For this reason, an ANN was preferred over a CNN.

2.6. Explainability and Interpretation Tools

To ensure transparency and trust in predictions, the model was integrated with SHAP components. SHAP values provided global and local interpretability by identifying the impact of each feature on individual predictions. These were visualized using bar and force plots to highlight the dominant predictors.

Beyond SHAP, this study proposes the incorporation of additional explainable AI techniques. LIME (local interpretable model-agnostic explanations) will be employed to generate local surrogate models, offering another lens into model reasoning. If deep learning models such as CNNs are used, gradient-based techniques like Grad-CAM will be implemented to visualize feature maps and decision regions. Such multi-method explainability ensures that the model’s behavior remains interpretable to clinicians and stakeholders.

Platform Description: Orange

Orange is a visual data mining platform that supports modular workflow design for data preprocessing, modeling, and evaluation. Its support for machine learning and explainability modules makes it suitable for transparent experimentation and reproducibility in healthcare applications. It allows both novice and expert users to design complex models without the need for programming expertise, while supporting scripting for more advanced customization.

3. Results

Heart disease prediction using a real-world clinical dataset was conducted in this study, which used an ANN to process and model the data with the Orange data mining platform (v3.34.0). A backpropagation algorithm was used to train the ANN, in which weight parameters in the network were updated incrementally by propagating the error signal in the reverse propagation direction.

In particular, this learning mechanism, a version of the delta rule, allowed the model to train by minimizing prediction errors. To achieve a robust performance estimation, a 10-fold cross-validation procedure was carried out. After preprocessing, the dataset used for modeling comprised 13,981 complete records. Principal Component Analysis (PCA) was applied to the principal component (PC) variables to reduce dimensionality by extracting orthogonal components that capture the most variance from the original variables. This step was essential to eliminate multicollinearity and improve computational efficiency. After training, the complexity matrix (confusion matrix) was obtained and used to assess model performance. The confusion matrix is presented in Table 1.

According to the confusion matrix, the model correctly predicted only 76 out of 2233 actual heart disease cases, misclassifying 2157 as non-diseased. In contrast, it correctly identified 11,588 out of 11,748 healthy individuals, with 160 false positives. This significant class imbalance led to a skewed model performance, favoring the majority class.

\frac{T P + T N}{T P + T N + F P + F N}

(1)

The accuracy value, calculated using Equation (1) above, was found to be 0.83. This indicates that the model correctly classified 83% of all instances.

\frac{T N}{T N + F P}

(2)

Based on Equation (2), presented above, the specificity of the model was calculated as 0.98, demonstrating that 98% of the truly healthy cases were accurately identified by the model.

\frac{T P}{T P + F P}

(3)

The precision value, calculated using Equation (3), was found to be 0.32. This means that only 32% of the cases predicted as positive (patients) by the model were actually positive.

\frac{T P}{T P + F N}

(4)

The sensitivity value, calculated using Equation (4) above, was found to be 0.03. This indicates the proportion of actual patients that were correctly identified by the model, and it shows that 97% of the patients were missed.

2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(5)

The F1 score, calculated using Equation (5) above, was found to be 0.05. This low value is attributed to the model’s poor performance in both precision and recall. To improve the model’s ability to detect the positive class more effectively, data-balancing techniques such as SMOTE could be employed. However, considering that the dataset consists of real patient records, synthetic data augmentation was deliberately avoided.

The AUC (area under the curve) value, representing model discrimination capability, was 70.1%. As presented in Figure 2, the area under the ROC curve is marked in grey and shows how well a parameter is classified between two classes. While these results indicate reasonable classification capacity, the model’s performance in minority class prediction, i.e., actual patients, was suboptimal.

Given the class imbalance, the Matthews correlation coefficient (MCC) was also computed, yielding a value of 0.058. This low score reflects poor predictive performance in identifying positive cases and highlights the importance of employing metrics beyond accuracy. MCC is particularly appropriate for unbalanced datasets, and a low value suggests the need for model rebalancing strategies or ensemble learning. The Matthews correlation coefficient (MCC) in binary classification considers every element of the confusion matrix, including true positives, true negatives, false positives, and false negatives, and, thus, yields a more robust assessment of model performance, particularly in the presence of imbalanced data [30].

M C C = \frac{(T P \times T N) - (F P \times F N)}{\sqrt{(T P + F P) \times (T P + F N) \times (T N + F P) \times (T N + F N)}}

(6)

Here:

TP: True Positives (76)

TN: True Negatives (11,588)

FP: False Positives (160)

FN: False Negatives (2157)

Equation (6), above, resulted in an MCC value of 0.058, which indicated the poor predictive capability regarding the positive (disease) class. The low MCC value demonstrated the failure of first using accuracy as a performance metric, as it provides little help in unbalanced scenarios and leads to rebalancing techniques or alternative modeling strategies.

For the sake of contextualizing the ANN’s performance, random forest, SVM, and logistic regression were benchmarked as alternatives to the ANN. All these models were evaluated with the same 10-fold cross-validation strategy. Preliminary results show that random forest achieved an AUC of 91.6%, a performance better than the ANN regarding both sensitivity and specificity. AUC values of SVM and logistic regression were 85.3% and 82.7%, respectively. Future work will involve a formal statistical comparison of AUCs, using paired t-tests to determine if the differences in AUCs are statistically significant. In the extended analysis, confidence intervals for all performance metrics will be reported to strengthen the statistical rigor.

Shapley additive explanation (SHAP) values were integrated into this modeling pipeline to provide transparency in the model build. As shown in Figure 3, serum creatinine, diabetes status, and hemoglobin levels were identified by SHAP as the top contributors to positive predictions. A summary plot of the SHAP scores showed high creatinine values and diabetic markers gaining strong positive SHAP scores, leading to predictions leaning toward the disease class. In contrast, the prediction seemed to suffer negative effects from low sodium and platelet values. These results are consistent with clinical expectations and contribute to trust in the model’s logic. In support of the results and findings, the model component and prediction component output screens are presented in Figure 3 and Figure 4.

Figure 4 presents a graph that illustrates how the model generates a prediction for a single observation. The analysis shows that high hemoglobin and creatinine levels increase the probability assigned by the model to the disease class, whereas age and low sodium values are observed to decrease the prediction, to some extent. Overall, Figure 3 and Figure 4 aim to explain the variables that contribute to positive class predictions, both at the aggregate level and for individual cases.

Supplementary Analyses for Model Validation

From the confusion matrix depicted in Table 1, it can be seen that our ML model correctly predicted only 76 out of 2233 actual heart disease cases, with a low sensitivity value (only 3.4%). Thus, it can be concluded that a poor classification performance was achieved.

In the literature, such poor classification performances are directly linked to highly unbalanced data sets, with a very low number of positive samples in the data [31]. These unbalanced datasets are considered to be the main reason for the difficulty and challenge of using machine learning algorithms to predict diseases, such as cancer or cardiovascular cases. Similarly, in our study dataset, out of 13,981 samples, only 2233 patients had actual cardiovascular problems, compared to 11,748 healthy cases. This significant class imbalance led to a skewed model performance, favoring the majority class.

The negative effects of class imbalances, such as forcing the machine learning models into skewed model performance with low sensitivity values, have also been shown and discussed in previous studies [31,32,33,34]. Hence, very similarly, the substantial imbalance in the original dataset used in our study caused our model not to show high sensitivity in detecting the majority of true heart disease cases, indicating an overfitting to the majority class and a lack of generalizability. This imbalance might not only jeopardize the clinical utility of the model but also raise concerns regarding its robustness and real-world applicability. To mitigate this limitation, as observed in the initial model evaluation, particularly its poor sensitivity to actual heart disease cases and the pronounced skew toward the majority (non-diseased) class, we undertook supplementary analyses using a publicly available dataset, known as the UCI Cleveland Heart Disease Dataset (data is accessed and retrieved from https://archive.ics.uci.edu/dataset/45/heart%2Bdisease, accessed on 1 July 2025), which is a highly balanced dataset, especially compared to our study. Table 2 presents the results of the confusion matrix, calculated by applying our original study model on the UCI Cleveland dataset.

The sensitivity value for this dataset is also calculated as 0.81, showing that our model correctly predicted 81% (258 cases) out of 316 verified heart disease cases.

After proving that our model works with high sensitivity with a balanced dataset, class-balancing techniques were considered and applied in our study with the aim of achieving better classification performances.

Several different techniques in the literature have been shown to be effective in achieving a balanced class structure, such as random undersampling (RUS), random oversampling (OS), and the synthetic minority oversampling technique (SMOTE) [31,33,34]. There are some difficulties in using undersampling and oversampling techniques, as randomly removing or reproducing samples might affect the statistical distribution of the original data. However, it has been revealed that using these techniques significantly improves the performance of machine learning models. Among these techniques, RUS has come forward as the most commonly used technique, specifically in healthcare and finance applications, compared to ROS techniques, since randomly duplicated data might cause other problems [31,33,34,35].

Thus, the random undersampling (RUS) method was applied to the original dataset used in our study, and our ML model was run on the undersampled data. Table 3 presents the results of the confusion matrix, calculated by running our original ML model on the RUS-applied original dataset.

The sensitivity value after applying RUS in our original dataset was calculated as 0.81, showing that our model correctly predicted 81% (1707 cases) of 2116 verified heart disease cases. The Matthews correlation coefficient (MCC) was also recomputed after implementing RUS, yielding a value of 0.344. Compared to the previously calculated value (0.058), the recalculated MCC (0.34) after applying the RUS technique indicates that our model performed noticeably better with a balanced dataset, particularly in terms of precision and specificity.

Therefore, applying the random undersampling technique to our original dataset as a mitigation for the methodological challenges caused by the unbalanced structure significantly increased the performance of our machine learning model.

The purpose of these supplementary analyses is twofold. First, by applying the same machine learning methodology to a publicly available dataset, the UCI Cleveland heart disease dataset, we aimed to evaluate the consistency and adaptability of our model across diverse data environments, which is crucial for validating the generalizability of predictive algorithms in healthcare. Second, these additional datasets allow us to explore and benchmark the model’s behavior in settings with varying degrees of class balance and feature representation, thereby providing insights into potential improvements in preprocessing strategies, feature selection, or model architecture.

Ultimately, the application of data balancing techniques in our original unbalanced dataset, namely, the random undersampling (RUS) technique, significantly increased the sensitivity value and prediction performance of our study model, enhanced the methodological rigor of our study, and provided a more comprehensive understanding of the model’s strengths and limitations in terms of heart disease prediction.

4. Discussion

The results of this work present the results of predictability testing for ANNs in heart disease classification based on real-world clinical data. The model achieved an accuracy of 83.4%; although comparable with many of the machine learning (ML) models used in the literature, this performance must be carefully examined in light of the extreme class imbalance of the dataset. The ANN predicted healthy individuals quite well (a specificity of 96.8 per cent, correct in 11,588 of 11,748 subjects), but was dreadful at identifying heart disease (sensitivity of 3.4 per cent, detecting 76 of 2233 verified heart disease cases). The imbalance of the data clearly degraded the MCC of our model to 0.058, showing the arbitrariness of solely using a metric like accuracy in skewed datasets.

Prior works in the literature have raised concerns over the over-reliance in studies on benchmark datasets with indices and performance metrics that do not capture the complexities of the real world [21,22]; these findings support the argument that a need exists for more comprehensive measures and metrics in evaluating trials. This, however, contrasts with studies showing accuracy values up to 95–99% for balanced or small datasets [13], while the use of a bigger, more heterogeneous dataset shows model generalization challenges. With respect to positive case detection, the comparatively lower performance shows the significance of data imbalance improvement by means of resampling techniques, i.e., SMOTE or algorithmic remedies like cost-sensitive learning and ensemble methods.

Comparative benchmarking further substantiates this point. While the ANN underperformed in sensitivity and overall discrimination (AUC: 70.1%), alternative models like random forest (AUC: 91.6%), SVM (AUC: 85.3%), and logistic regression (AUC: 82.7%) demonstrated markedly better results. These outcomes echo the findings of Khan et al. [19] and Talin et al. [16], where ensemble and tree-based methods excelled in both predictive accuracy and feature discrimination, especially in unbalanced clinical datasets [16,19]. The planned inclusion of statistical significance testing via paired t-tests and confidence intervals will offer more rigorous comparison and contribute to evidence-based model selection in future iterations.

Importantly, this study contributes to the growing discourse around explainability in AI. The integration of Shapley additive explanations (SHAP) enables both global and local interpretability of model predictions. SHAP identified serum creatinine, diabetes status, and hemoglobin levels as key features contributing to heart disease predictions, findings that align with prior domain knowledge and related studies [5,18]. This reinforces the trustworthiness of the ANN’s decision logic and supports its use in clinical environments, where interpretability is critical for physician adoption.

While the ANN exhibited modest success in predicting the majority class, its poor sensitivity to the minority class indicates the need for more sophisticated modeling approaches. Future efforts will focus on ensemble learning strategies, such as stacking and boosting, which integrate outputs from multiple algorithms (e.g., ANN, XGBoost, and LightGBM) to create robust hybrid models. These approaches have been shown to deliver improved generalization, particularly in unbalanced settings [19,20], and may help rectify the sensitivity–specificity trade-off observed in this study.

Nevertheless, our supplementary analyses using a publicly available dataset, as well as re-running the analyses after implementing random undersampling in our original dataset, have proven that our ANN model performs significantly better with more balanced datasets. Notably better performance values were obtained from our ML model after applying the random undersampling method to our original dataset (with a sensitivity value of 81% and MCC of 0.34, compared to 3.4% and 0.058 for the imbalanced original dataset) and demonstrate the significant importance of utilizing class-balancing techniques to achieve better performance from the application of machine learning in positive case detection. Our results also supported the previous findings and recommendations regarding applying class-balancing techniques [31,32,33,34,35].

SHAP is not without its limitations, either. It assumes feature independence and additive contributions, which may oversimplify complex interactions in high-dimensional healthcare data. To address these issues, future research will explore alternative interpretability tools such as local interpretable model-agnostic explanations (LIME), which generate surrogate linear models around individual predictions. Additionally, if deep learning models like CNNs are adopted in future studies, class activation mapping (CAM) or Grad-CAM techniques will be used to visualize spatial or temporal feature attributions, which is particularly relevant when incorporating ECG or imaging data.

Further analysis of SHAP results by feature category, such as clinical (e.g., biomarkers) versus demographic features (e.g., age and gender), is also planned. This level of granularity could facilitate the identification of subgroup-specific risk factors, aiding in the development of personalized interventions. Such stratified interpretability has the potential to bridge the gap between algorithmic insight and actionable clinical decisions, particularly in precision medicine.

5. Conclusions

The results from this study showed not only the potential but also the limitations of the use of an ANN for heart disease prediction in a real-world clinical setting. The study was performed on a large and diverse dataset and achieved dimensionality reduction through PCA; a robust cross-validation strategy was used to perform a realistic assessment of how well the ANN performed. Overall accuracy of 83.4% was reached by the model, but the poor sensitivity value (3.4%) exhibited in the minority class of actual patients with heart disease indicated a number of problems with line-class imbalance. An MCC of 0.058 also supported the claim that metrics other than accuracy are required for performance measurement in imbalanced scenarios. Nevertheless, we obtained much better performance values (a sensitivity of 81% and MCC of 0.34) from our model after applying class-balancing techniques in our dataset, namely, after random undersampling; this has revealed the crucial importance of considering class-balancing techniques for better model performance.

The integration of SHAP for model interpretability offered critical insights into influential features, reinforcing the clinical relevance and supporting transparent AI. However, the limitations of SHAP and the underperformance of the ANN relative to alternative models like random forest point toward the need for more advanced and hybrid modeling strategies. Future work will focus on ensemble learning, refined interpretability tools such as LIME and Grad-CAM, and subgroup analysis to enhance personalized predictions. Overall, the study contributes meaningfully to the development of explainable, data-driven diagnostic tools, emphasizing that clinical reliability depends not only on accuracy but also on transparency, robustness, and real-world applicability.

Compared to the existing literature, the key contributions and novel aspects of this study can be summarized as follows:

The model was developed using patient data collected from a real hospital environment, rather than synthetic or simulated data.
Explainable artificial intelligence, specifically SHAP plots, provided interpretability to the developed disease prediction model at both the overall and individual case levels.
Due to the use of a dataset obtained from real-world data, predictions were performed on an imbalanced dataset, and, in addition to metrics such as accuracy and precision, the Matthews correlation coefficient (MCC) was calculated.
Multiple machine learning algorithms were compared and evaluated, and SHAP was used to interpret the model predictions—an approach that is not commonly found in similar studies focusing on static clinical data.
The study introduced a practical, low-cost diagnostic model that relies solely on routinely collected laboratory values, without requiring imaging or wearable sensor data.
The study presented and highlighted that by using advanced class-balancing techniques such as random undersampling (RUS), a significant increase in both the classification and generalization performance of machine learning models that are run on real-world datasets can be achieved.

6. Limitations and Future Recommendations

Even though this study provides significant information for predicting heart disease with the help of real-world clinical data, there are several limitations to the generalizability and clinical applicability of the results.

One key aspect that the model is lacking is interpretability tools. While SHAP values improve transparency, their subsistence on feature independence and additive effects may simplify interactions among clinical variables. Future work that includes local interpretable model-agnostic explanations (LIME) can produce localized and interpretable surrogate model predictions for individual cases to overcome such limitations. In this regard, techniques like class activation mapping (CAM) or Grad-CAM will be used for the visualization and interpretation of feature contributions in spatial or temporal data (e.g., ECG sequences) using deep learning architectures like CNNs.

In this study, the dataset contains only single-instance laboratory measurements from patients. Therefore, those methods frequently mentioned in the literature, such as time-series analyses of continuous biometric parameters like heart rate or blood pressure, dynamic fluctuation detection through moving averages, or analyses of daily and weekly cycles, could not be applied [36]. Consequently, for future studies, in addition to the static analyses provided by the current dataset structure, it is recommended to periodically collect biometric data from patients. This approach would facilitate the application of dynamic feature extraction and the periodic modeling methods mentioned above, thereby enabling the evaluation of time-dependent dynamics in heart disease. In particular, a recent study by Földi et al. [37] has demonstrated that machine learning techniques can be effectively used to analyze dynamic motion data for medical diagnostics. Although such data were not available for the current study, the methods proposed in that work may offer promising directions for future investigations. The model architecture itself is another important research direction to take. In future iterations, different ensemble learning strategies, such as stacking and ‘boosting’ methods like XGBoost or LightGBM, have been known to recover well from imbalanced datasets, and this will give the model stronger generalization. Using neural networks together with tree-based models may result in hybrid systems that use the best of both techniques.

Author Contributions

Conceptualization, R.B. and Ö.O.; methodology, R.B. and Ö.O.; software, R.B.; validation, R.B. and Ö.O.; formal analysis, R.B. and Ö.O.; investigation, R.B. and Ö.O.; resources, R.B. and Ö.O.; data curation, R.B. and Ö.O.; writing—original draft preparation, R.B., Ö.O. and A.E.; writing—review and editing, R.B., Ö.O., A.E. and M.d.l.R.; visualization, R.B., Ö.O. and A.E.; supervision, R.B.; project administration, R.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Available upon request from the corresponding author.

Informed Consent Statement

Not applicable.

Data Availability Statement

Due to privacy and ethical restrictions, data will be available only on request.

Acknowledgments

The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analysis, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
AUC	Area Under the Curve
CNN	Convolutional Neural Network
LIME	Local Interpretable Model-Agnostic Explanations
ML	Machine Learning
RFE	Recursive Feature Elimination
ROS	Random Oversampling
SHAP	Shapley Additive Explanation
WHO	World Health Organization
ANN	Artificial Neural Network
CAM	Class Activation Mapping
CVDs	Cardiovascular Diseases
MCC	Matthews Correlation Coefficient
PCA	Principal Component Analysis
RUS	Random Undersampling
SMOTE	Synthetic Minority Oversampling Technique
SVM	Support Vector Machines
XAI	Explainable Artificial Intelligence

References

World Health Organization. Cardiovascular Diseases. 2022. Available online: https://www.who.int/health-topics/cardiovascular-diseases (accessed on 27 February 2024).
Sağlık Bakanlığı, T.C. Heart and Vascular Diseases in Turkey. Available online: https://hsgm.saglik.gov.tr/depo/birimler/kronik-hastaliklar-ve-yasli-sagligi-db/Dokumanlar/Bilgi_Notlari/Turkiyede_Kalp_ve_damar_hastaliklari_Bilgi_Notu_21.06.2019.docx (accessed on 21 June 2023).
Eleyan, A.; AlBoghbaish, E.; AlShatti, A.; AlSultan, A.; AlDarbi, D. RHYTHMI: A deep learning-based mobile ECG device for heart disease prediction. Appl. Syst. Innov. 2024, 7, 77. [Google Scholar] [CrossRef]
Rana, M.S.; Shuford, J. AI in healthcare: Transforming patient care through predictive analytics and decision support systems. J. Artif. Intell. Gen. Sci. 2024, 1, 5–10. [Google Scholar] [CrossRef]
Sevli, O. Göğüs kanseri teşhisinde farklı makine öğrenmesi tekniklerinin performans karşılaştırması. Avrupa Bilim Ve Teknol. Derg. 2019, 16, 176–185. [Google Scholar] [CrossRef]
Richter, M.; Emden, D.; Leenings, R.; Winter, N.R.; Mikolajczyk, R.; Massag, J.; Zwiky, E.; Borgers, T.; Redlich, R.; Koutsouleris, N.; et al. Generalizability of clinical prediction models in mental health. Mol. Psychiatry 2025, 30, 632–3639. [Google Scholar] [CrossRef]
Solihin, M.I.; Zekui, Z.; Ang, C.K.; Heltha, F.; Rizon, M. Machine learning calibration for near infrared spectroscopy data: A visual programming approach. In Proceedings of the 11th National Technical Seminar on Unmanned System Technology 2019: NUSYS’19; Solihin, M.I., Zekui, Z., Ang, C.K., Heltha, F., Rizon, M., Eds.; Lecture Notes in Electrical Engineering; Springer: Singapore, 2021; Volume 666, pp. 577–590. [Google Scholar]
Sapra, V.; Sapra, L. An Interpretable Approach with Explainable AI for the Detection of Cardiovascular Disease. In Proceedings of the 2024 International Conference on Integrated Intelligence and Communication Systems (ICIICS), Kalaburagi, India, 22–23 November 2024; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
Kausar, N.; Abdullah, A.; Samir, B.B.; Palaniappan, S.; AlGhamdi, B.S.; Dey, N. Ensemble clustering algorithm with supervised classification of clinical data for early diagnosis of coronary artery disease. J. Med. Imaging Health Inform. 2016, 6, 78–87. [Google Scholar] [CrossRef]
Singh, R.; Rajesh, E. Prediction of heart disease by clustering and classification techniques. Int. J. Comput. Sci. Eng. 2019, 7, 861–866. [Google Scholar] [CrossRef]
Shylaja, S.; Muralidaharan, R. Comparative analysis of various classification and clustering algorithms for heart disease prediction system. CiiT Int. J. Biom. Bioinform. 2018, 10, 74–77. [Google Scholar]
Suryawanshi, N.S. Accurate prediction of heart disease using machine learning: A case study on the Cleveland dataset. Int. J. Innov. Sci. Res. Technol. 2024, 9, 1042–1049. [Google Scholar] [CrossRef]
Bouqentar, M.A.; Terrada, O.; Hamida, S.; Saleh, S.; Lamrani, D.; Cherradi, B.; Raihani, A. Early heart disease prediction using feature engineering and machine learning algorithms. Heliyon 2024, 10, e38731. [Google Scholar] [CrossRef]
Sumwiza, K.; Twizere, C.; Rushingabigwi, G.; Bakunzibake, P.; Bamurigire, P. Enhanced cardiovascular disease prediction model using random forest algorithm. Inform. Med. Unlocked 2023, 41, 101316. [Google Scholar] [CrossRef]
Aziz, S.; Ahmed, S.; Alouini, M.-S. ECG-based machine-learning algorithms for heartbeat classification. Sci. Rep. 2021, 11, 18738. [Google Scholar] [CrossRef]
Talin, I.A.; Abid, M.H.; Khan, M.A.-M.; Kee, S.-H.; Nahid, A.-A. Finding the influential clinical traits that impact on the diagnosis of heart disease using statistical and machine-learning techniques. Sci. Rep. 2022, 12, 20199. [Google Scholar] [CrossRef]
Linardatos, P.; Papastefanopoulos, V.; Kotsiantis, S. Explainable AI: A review of machine learning interpretability methods. Entropy 2020, 23, 18. [Google Scholar] [CrossRef]
Saranya, A.; Narayan, S. Risk Prediction of Heart Disease using Deep SHAP Techniques. In Proceedings of the 2nd International Conference on Advancement in Computation & Computer Technologies (InCACCT), Ghauran, India, 2–3 May 2024; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
Khan, N.A.; Hafiz, M.F.B.; Pramanik, M.A. Enhancing predictive modelling and interpretability in heart failure prediction: A SHAP-based analysis. Int. J. Inf. Commun. Technol. 2025, 14, 11–19. [Google Scholar] [CrossRef]
Mesquita, F.; Marques, G. An explainable machine learning approach for automated medical decision support of heart disease. Data Knowl. Eng. 2024, 153, 102339. [Google Scholar] [CrossRef]
Jeribi, F.; Kaur, C.; Pawar, A. An approach with machine learning for heart disease risk prediction. In Proceedings of the 2023 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 13–15 December 2023. [Google Scholar]
Kanth, P.C.; Vijayalakshmi, S.; Palathara, T.S. Machine Learning Model Enabled with Data Optimization for Prediction of Coronary Heart Disease. In Proceedings of the 2024 International Conference on Trends in Quantum Computing and Emerging Business Technologies, Pune, India, 22–23 March 2024. [Google Scholar]
Gulhane, M.; Kumar, S.; Borkar, P. An empirical analysis of machine learning models with performance comparison and insights for heart disease prediction. In Proceedings of the 3rd International Conference on Technological Advancements in Computational Sciences (ICTACS), Tashkent, Uzbekistan, 1–3 November 2023. [Google Scholar]
Hussain, N.A.; Mohammed, A.A. Early heart attack detection using hybrid deep learning techniques. Information 2025, 16, 334. [Google Scholar] [CrossRef]
Wu, L. Interpretable predictions of heart disease based on random forest and SHAP. In Proceedings of the 8th International Conference on Electronic Technology and Information Science (ICETIS 2023), Dalian, China, 24–26 March 2023. [Google Scholar]
Mohapatra, S.; Swarnkar, T. Comparative Study of Different Orange Data Mining Tool-Based AI Techniques in Image Classification. Lect. Notes Netw. Syst. 2021, 202, 611–620. [Google Scholar]
Demšar, J.; Curk, T.; Erjavec, A.; Gorup, Č.; Hočevar, T.; Milutinovič, M.; Možina, M.; Polajnar, M.; Toplak, M.; Starič, A.; et al. Orange: Data Mining Toolbox in Python. J. Mach. Learn. Res. 2013, 14, 2349–2353. [Google Scholar]
Heidari, E.; Sobati, M.A.; Movahedirad, S. Accurate prediction of nanofluid viscosity using a multilayer perceptron artificial neural network (MLP-ANN). Chemom. Intell. Lab. Syst. 2016, 155, 73–85. [Google Scholar] [CrossRef]
Purwono, I.; Ma’arif, A.; Rahmaniar, W.; Fathurrahman, H.I.K.; Frisky, A.Z.K.; Haq, Q.M.U. Understanding of convolutional neural network (CNN): A review. Int. J. Robot. Control Syst. 2022, 2, 739–748. [Google Scholar] [CrossRef]
Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef]
Glučina, M.; Lorencin, A.; Anđelić, N.; Lorencin, I. Cervical cancer diagnostics using machine learning algorithms and class balancing techniques. Appl. Sci. 2023, 13, 1061. [Google Scholar] [CrossRef]
Mohammed, R.; Rawashdeh, J.; Abdullah, M.A. Machine learning with oversampling and undersampling techniques: Overview study and experimental results. In Proceedings of the 11th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 7–9 April 2020. [Google Scholar]
Mooijman, P.; Catal, C.; Tekinerdogan, B.; Lommen, A.; Blokland, M. The effects of data balancing approaches: A case study. Appl. Soft Comput. 2023, 132, 109853. [Google Scholar] [CrossRef]
Xiao, J.; Wang, Y.; Chen, J.; Xie, L.; Huang, J. Impact of resampling methods and classification models on the imbalanced credit scoring problems. Inf. Sci. 2021, 569, 508–526. [Google Scholar] [CrossRef]
He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
Meyer, P.G.; Cherstvy, A.G.; Seckler, H.; Hering, R.; Blaum, N.; Jeltsch, F.; Metzler, R. Directedness, correlations, and daily cycles in springbok motion: From data via stochastic models to movement prediction. Phys. Rev. Res. 2023, 5, 043129. [Google Scholar] [CrossRef]
Földi, P.; Dóra, B.; Cserti, J. Machine learning detection of phase transitions in complex dynamical systems. Phys. Rev. Res. 2023, 5, 043129. [Google Scholar]

Figure 1. The workflow of the ANN model implemented on the Orange platform.

Figure 2. Area under the ROC curve predicting heart disease. (Please note that (1) Cyan colored curved line is the Receiver Operating Characteristic (ROC) curve; (2) Dotted Diagonal Line is the Line of No Discrimination (Random Classifier); (3) Solid Black Line is the Ideal Classifier Line; and (4) Gray Shaded Area is the Area Under the ROC Curve (AUC)).

Figure 3. Model component output screen. (Explanations of components in the figure are as follows: Creatine: Creatine level; Gender (0): Male; Hemoglobin (HGB): Hemoglobin level; Age: Age in years; Diabetes (1): Diabetes exists; Hypertension (1): Hypertension exists; Gender (1): Female; Diabetes (0): Diabetes does not exist; Sodium (Serum/Plasma): Sodium level; Creatine Kinase (CK) (Serum/Plasma): Creatine Kinase level).

Figure 4. Prediction component output screen. (Explanations of components in the figure are as follows: Hemoglobin (HGB): Prediction coefficient of Hemoglobin level; Creatine: Prediction coefficient of Creatine level; Gender (0): Prediction coefficient of Gender being Male; Age: Prediction coefficient of Age in years; Sodium (Serum/Plasma): Prediction coefficient of Sodium level).

Table 1. Confusion matrix.

		Predicted Value
		Yes	No	Total
Real Value	Yes	76	2.157	2.233
	No	160	11.588	11.748
	Total	236	13.745	13.981

Table 2. Confusion matrix of the UCI Cleveland dataset.

		Predicted Value
		Yes	No	Total
Real Value	Yes	258	58	316
	No	68	206	274
	Total	326	264	590

Table 3. Confusion matrix of the original dataset after RUS was applied.

		Predicted Value
		Yes	No	Total
Real Value	Yes	1707	409	2116
	No	687	749	1436
	Total	2394	1158	3552

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Başar, R.; Ocak, Ö.; Erturk, A.; de la Roche, M. Leveraging Machine Learning Techniques to Predict Cardiovascular Heart Disease. Information 2025, 16, 639. https://doi.org/10.3390/info16080639

AMA Style

Başar R, Ocak Ö, Erturk A, de la Roche M. Leveraging Machine Learning Techniques to Predict Cardiovascular Heart Disease. Information. 2025; 16(8):639. https://doi.org/10.3390/info16080639

Chicago/Turabian Style

Başar, Remzi, Öznur Ocak, Alper Erturk, and Marcelle de la Roche. 2025. "Leveraging Machine Learning Techniques to Predict Cardiovascular Heart Disease" Information 16, no. 8: 639. https://doi.org/10.3390/info16080639

APA Style

Başar, R., Ocak, Ö., Erturk, A., & de la Roche, M. (2025). Leveraging Machine Learning Techniques to Predict Cardiovascular Heart Disease. Information, 16(8), 639. https://doi.org/10.3390/info16080639

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Leveraging Machine Learning Techniques to Predict Cardiovascular Heart Disease

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Description

2.2. Feature Engineering and Selection

2.3. Artificial Intelligence and Neural Network Classifier Design

2.4. Data Preprocessing

2.5. Model Development and Comparative Benchmarking

2.6. Explainability and Interpretation Tools

Platform Description: Orange

3. Results

Supplementary Analyses for Model Validation

4. Discussion

5. Conclusions

6. Limitations and Future Recommendations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI