Next Article in Journal
Satellite-Based Assessment of Coastal Morphology Changes in Pichilemu Bay, Chile
Previous Article in Journal
Non-Destructive Mango Quality Prediction Using Machine Learning Algorithms
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Heart Attack Prediction Using Machine Learning Models: A Comparative Study of Naive Bayes, Decision Tree, Random Forest, and K-Nearest Neighbors †

by
Makhdoma Haider
1,*,
Manzoor Hussain
2 and
Gina Purnama Insany
3
1
Department of Software Engineering, University of Sialkot, Sialkot 51040, Pakistan
2
Department of Computing, Indus University, Karachi 75500, Pakistan
3
Department of Informatics Engineering, Nusa Putra University, Sukabumi 43152, West Java, Indonesia
*
Author to whom correspondence should be addressed.
Presented at the 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society, Aizuwakamatsu City, Japan, 20–26 January 2025.
Eng. Proc. 2025, 107(1), 121; https://doi.org/10.3390/engproc2025107121
Published: 28 September 2025

Abstract

Heart disease is the leading cause of death across the world. However, such an early prediction of heart attacks can save lives if clinical data are used to predict it accurately. For this, we use four machine learning models: Naive Bayes, Decision Tree, Random Forest and K-Nearest Neighbors (KNN) to predict heart attacks from the data of the patients. Models developed achieved an average accuracy of 65.08%; however, this paper explores the performance of these models in real world healthcare applications. Our focus is on improving model performance by improving the quality of the data, the features and hyperparameter tuning. Future directions indicate combining deep learning techniques and larger dataset for more accurate prediction.

1. Introduction

Cardiovascular diseases, including heart attacks (myocardial infarction), are a big public health issue and associated with a high global mortality rate. Preventive intervention and better patient outcomes require early detection and accurate heart attack prediction. Until now, clinical tools and tests such as electrocardiograms (ECGs), blood pressure measurements, and cholesterol levels, have been used to determine heart attack risk. Yet, as we know, these methods are limited by reliance on static diagnostic criteria and do not entirely capture the complexity of patient health data nor the nonlinear relationships between risk factors. In recent years machine learning (ML) has become a powerful tool of the healthcare organization, helping to increase the accuracy of diagnosis and to give a prior indication of heart attacks. ML models can use raw data from a large dataset to discover hidden patterns in patient information, which result in more accurate, more personalized predictions than usual. These models can incorporate many risk factors, age, gender, medical history, lifestyle habits, laboratory results, and so on, so they could be more dynamic and more comprehensive in assessing heart attack risk. Specifically, the focus of this paper is to explore the performance of four well known machine learning models, Naive Bayes, Decision Tree, Random Forest, and KNN in the task of predicting the risk of receiving a heart attack. These models all have unique features and techniques by which data is handled, and may affect their predictive accuracy and applicability for clinical settings.

2. Literature Review

Clinical data analysis through machine learning and deep learning was used to predict heart attacks by identifying patterns and making early diagnoses. Many researchers have recently tried different models for the prediction of coronary attack, exploiting different aspects of health data of patients, and reasoned about their efficiency and limits. Ref [1] used logistic regression, Decision Tree, and Random Forest for predicting heart disease. In this work, we showed that Decision Trees reach an accuracy of 75% and logistic regression is interpretable but underperforms compared to less interpretable ensemble methods like Random Forests, which reach an accuracy of 82%. They, however, observed that using Decision Trees resulted in an overfitting of the model, especially on smaller datasets, so generalizability was reduced. While Random Forests needed to spend more computational resources to handle the complexity of nonlinear interactions among features, in contrast, they suited their complexity better. Ref. [2] also used KNN for heart attack prediction over a publicly available heart disease dataset and found that it achieved 72% accuracy [2]. Since these tuned the ‘k’ parameter and chose the right distance metric, KNN’s performance was highlighted by the authors. As the dimensionality of the dataset increased, their performance would go down, elongating their execution time computationally, and increasing computational costs. Ref. [3] obtained Naive Bayes and KNN to work on the Cleveland Heart Disease dataset and claimed that Naive Bayes obtained 70% and KNN obtained 71.9% [3]. Naive Bayes is computationally efficient and easy to implement, but if features are not independent, a usual assumption (in medical datasets), it does not work well; in medical datasets, variables like cholesterol and blood pressure are correlated. The authors further suggested that using feature selection techniques together with Naive Bayes prevents the impact of correlated features. The SVM model worked well for high-dimensional data and was better than many traditional models; however, it was not interpretable by clinicians in terms of what the predictions were based on. However, while neural networks could accurately learn complex patterns in the data, they were prohibitive in both the computational and data resources required to stay away from overfitting. However, the performance of these models can suffer in resource-limited settings, where larger samples and computational infrastructure are unavailable to the authors. The large amount of training data forces conventional feature engineering and manual feature tuning of deep learning models, which has now become a far less popular approach. Ref. [4] used a Convolution Neural Network (CNN) to predict heart attacks from ECG data and achieved an accuracy of 85%. Ref. [5] predict heart attacks with Support Vector Machine (SVM) as well as with neural networks, and achieved accuracies of 79% and 85%, respectively. CNNs were effective for automatically and accurately extracting important features from ECG signals, which themselves are medical time series [5]. However, like many studies in healthcare, a limiting factor to their study was the requirement for a large, labeled dataset, which is often challenging to acquire due to privacy issues and heterogeneity. In work by [6], authors used Recurrent neural networks (RNN) and long short-term memory (LSTM) models to predict cardiovascular events from time series data. These models which capture temporal dependencies of patient health records, could achieve an accuracy of 88 percent. However, this strengthens LSTM networks’ ability to remember long term dependencies which makes them perfect at analyzing sequences of events in the patient data, such as blood pressure or heart rate that happens over time. Although still expensive to calculate and reliant on large data sets, they are less economical in smaller clinical settings or when data collection is constrained. However, we have faced a lot of challenges when it comes to using machine learning and deep learning models to predict heart attacks. One of the biggest limitations is the size of the exposed datasets. For example, [6] used the Cleveland Heart Disease dataset, which consists of just 303 samples and conducted many studies. In the case of small datasets, you will often overfit the training data and fail to generalize unseen data. In addition, class imbalance (having far fewer heart attack patients compared to patients who do not have heart attacks) can pump models towards predicting the most common class and reduce the model’s sensitivity to heart attacks. A second limitation common among many studies [7,8,9] is a lack of model interpretability. Seeing this, complex models such as neural networks and support vector machines can often be more accurate, but being “black box” in nature, the clinicians cannot truly understand why decisions are being made.

3. Methodology

The proposed methodology is designed to provide a structured framework for predicting heart disease using machine learning techniques. As illustrated in Figure 1, the framework consists of several sequential stages, including data collection, preprocessing, feature selection, model training, and evaluation. Each step is carefully organized to ensure data integrity, optimal model performance, and reliable results.

3.1. Data Overview

The dataset used in this study for heart attack prediction are features of patients’ clinical data, with 14 features and 303 instances. It is sourced from Kaggle. The features include critical risk factors such as age, cholesterol level, blood pressure, and lifestyle factors, and often form the main predictors of heart disease. The feature variables have binary values regarding the presence or absence of a heart attack. There are no missing values in the dataset; thus, the preprocessing step is made simple.

3.2. Data Preprocessing

To this aim, the dataset underwent several processes of cleaning, standardizing, and preparing before applying machine learning models for analysis. Handling Imbalanced Data: As for missing values, there are none; however, we also examined the dataset for class imbalance in the target variable, which refers to the presence or the absence of a heart attack. This makes the performance status of the model skewed towards the majority class and leaves the minority class poorly predicted. In this case, the compiled dataset exhibited a relatively skewed distribution, with the frequency of heart attack instances being outdone by the frequency of all other conditions. To that we proposed to overcome by applying some sampling techniques like Synthetic Minority Oversampling Technique (SMOTE). However, since the amount of data available is too small, and after checking models on the imbalanced dataset, the use of SMOTE was not considered because the models were neither highly biased towards positive nor negative instances. Feature Selection: Feature selection is an important part of this cumulative project since it establishes grounds for identifying the most relevant features to the problem at hand. On feature selection, correlation analysis was used to determine the most relevant features to the target variable (presence of heart attack). This overriding was performed to increase simplicity, and since they are generally more easily interpreted in small datasets instead of other methods such as Recursive Feature Elimination (RFE) or mutual information. From this methodology, we realized that determining the most associated features allowed dimensionality reduction while adding little to the computational time. The selected features included:
  • Age;
  • Chest Pain Type (cp), Maximum Heart Rate (thalach), ST Depression (oldpeak), and Number of Major Vessels (ca).

3.3. Model Selection

Four Machine Learning algorithms were used in this study—they include KNN, Random Forest, Decision Tree, and Naive Bayes, which are all known for their suitability for classification tasks and modeling of clinical data. Each of the models discussed in this paper has its merits and limitations, which are explained below.

3.3.1. K-Nearest Neighbors (KNN)

KNN was chosen because of its ease of use and potential to determine complex patterns between the feature variables and the target. It is a method of learning pattern recognition from a rather non-parametric, instance-based kind of learning algorithm which classifies data according to the majority class of its nearest neighbors. The number of neighbors, k, was tuned using grid search, and we set it to five for this experiment. Also, the distance between data points was measured using the Euclidean distance measuring technique. However, one of the reasons why KNN has lower accuracy, especially when analyzing high-dimensional data, is that it is sensitive to irrelevant features. The score which was obtained by the model was quite moderate, and it was estimated to be about 53%. Therefore, it means that there is a possibility that improvement might be necessary in the future or even downsizing the size of the models.

3.3.2. Random Forest

There are two reasons why this model was selected: Random Forest belongs to the set of ensemble learning methods, which have proved to be almost immune to overfitting and can successfully work with both small and huge amounts of data. It is formed from many Decision Trees built from a random sample of the data, and the final prediction is made by averaging over the trees. It also outperforms other algorithms in dealing with missing values as well as in identifying interactions between the features that have nonlinear characteristics. Hyperparameter tuning was applied to optimize key parameters: n_estimators = 150, which means the number of trees created; max_depth = 10; and min_samples_split = 4. The model performed relatively well in this study, attaining an accuracy of 65.08%.

3.3.3. Decision Tree

The Decision Tree model was preferred regarding interpreter ability and usability. Features regard it as a way of categorizing the evaluated dataset according to the attribute shares, presenting a tree-like property. Even so, Decision Trees, despite their interpretability, are often overfitting, especially when the data sample is small. To reduce overfitting, the maximum depth of the tree used was set to eight since this was derived from the grid search conducted. In spite of these preventive measures, the model had come up with almost equal performance as that of the Random Forest, with a test accuracy of 65.03.

3.3.4. Naïve Bayes

A probabilistic classifier based on Bayes’ Theorem, known as Naive Bayes, was chosen because of its rapid calculation. This is an assumption that all features are conditionally independent, which is not the case for most clinical datasets, as many variables are correlated (for example, cholesterol and blood pressure). Nonetheless, its nature provided less time for training and evaluation of the performance as it was a simple model. Minimizing overfitting was the main reason for tuning the variance smoothing parameter. Naive Bayes scored slightly higher than KNN, achieving only 53.40% because under the assumption of independence, it was not optimal for this data set.

3.4. Model Evaluation

The major performance measure applied to each model was accuracy, which simply gives the proportion of instances that are correctly classified as both true positives and true negatives. Accuracy simply measures the prediction correctness of the model in general, which is beneficial when all the classes are of approximately the same size. Table 1 presents a comparison of various heart disease prediction models from existing literature, highlighting the performance of traditional and deep learning approaches. Table 2 shows the accuracy results of different machine learning models applied in this study.
The models considered in this work are mainly KNN, Random Forest, Decision Tree, and Naive Bayes. Regarding the evaluation of each model, the chosen criterion was based on 70:30 train-test partitions, where 70% of the data was used for training, and 30% was used for testing. The following Table 2 summarizes the accuracy of each model.

3.5. Proposed Workflow

The workflow for the heart attack prediction system is summarized as follows and also shown in Figure 2.
  • Data Collection: We collect 14 features related to cardiovascular health from the Kaggle heart attack data.
  • Data Preprocessing: After preprocessing, we handle the categorical variables, normalize the continuous features, and find the most relevant features using correlation analysis.
  • Model Training: The preprocessed dataset is trained on four machine learning models (KNN, Random Forest, Decision Tree, and Naive Bayes).
  • Model Evaluation: We evaluate the accuracy of the models. To ensure robustness and to counteract overfitting, we use cross-validation.
  • Comparison of Results: Performance of the models is compared, and the best models are determined.

4. Result Analysis

All the models performed about the same and were in the accuracy range of 65%. As we compared Naive Bayes and KNN, they both performed well, while the Decision Tree model performed slightly less well. A known problem with Decision Tree is that it overfits to training data when used on small datasets, and this may be because of it [6,7]. However, the simplicity and interpretability of Decision Trees allow them to be used to understand the major features of heart attack prediction. As for naive Bayes, it is a fast and efficient algorithm that is very often used in a classification task. But there is a problem, as it assumes all the features are conditionally independent of each other, but in medical datasets where two such features are pivotal, like cholesterol and blood pressure, these are often correlated [8]. This independence assumption is often deceptive, and the model may exhibit poor performance owing to the absence of capturing the interrelationships among the clinical logical features. This likely also contributed to its lower accuracy than the Naive Bayes model in this study, which was reasonably accurate with a low computational speed. KNN did slightly better in modeling the nonlinear relationship of features. The model of KNN is much more flexible when it comes to controlling complex data patterns, since we rely on the majority class of the nearest neighbors to classify data points. However, the performance of KNN is highly dependent on two factors: Specifically, when the hyperparameter ‘k’ (number of neighbors) as well as the distance metric is chosen. KNN conducted in this study with k = 5 and Euclidean distance metric was better than other parameters but may be improved with further optimization of these parameters [9]. ‘k’ tuning is very important: too small value of ‘k’ makes the model vulnerable to noise, and too large yields to over smoothing where the model is not able to use the structure of the data. KNN performed so well; however, proper tuning of these parameters is necessary for good performance. The following are the drawbacks of this study: one of them is that this test has a small number of 303 instances, thus has a reduced ability to generalize. It may be expected that if a more extended dataset of the patient records and data is used for analysis, the recognized models would demonstrate higher reliability and a more accurate ability to predict outcomes. Furthermore, the Naive Bayes model for medical datasets assumes features to be independent, especially the cholesterol and blood pressure, which are interdependent. While using the KNN and Decision Tree models, it is possible to achieve reasonably good accuracy; these models are not capable of unveiling the inherent complexity and nonlinearity of the underlying data. There could be an added gain and higher accuracy, and more insights such as the use of deep learning methods or even Random Forests in this image analysis.

5. Conclusions

In this study, we assessed four machine learning algorithms: K-Nearest Neighbors (KNN), Random Forests, Decision Trees, and Naive Bayes, for heart attack prediction. Random Forest (RF) and Naïve Bayes (NB) were slightly higher, at about 65%, while all models were almost comparable. Despite having given satisfactory baseline performances in the case of heart attack prediction, there are drawbacks still present in all models used, such as overfitting in Decision Trees and Naive Bayes, the independent assumption tends to hamper the desired level of accuracy in medical datasets. Nonetheless, studying machine learning models indicates that heart attack prediction is feasible. There is an opportunity to increase their predictive merits, keeping in mind the current issues regarding the accuracy and reliability of those approaches. We anticipate capabilities of models using a higher-level approach/methodology with more gigantic and heterogeneous data to enhance their overall prowess as well as significant plausible utility in the clinical field. Hence, future studies should include the investigation of superior techniques to enhance the precision and dependability of heart attack prediction models. More specifically, convolutional neural networks (CNN) and recurrent neural networks (RNN) could greatly increase prediction strength while addressing the temporal structures of ECG signals. More patients should be included in the dataset to reduce the usage of fine-tuning and enhance the deployment of models for wider patients’ performance. Another topic is the family of ensemble methods that combine the main characteristics of various models applied in machine learning, for example, XGBoost. These approaches might help develop better and accurate heart attack risk assessment models.

Author Contributions

M.H. (Makhdoma Haider) contributed to conceptualization, methodology, and writing—original draft. M.H. (Makhdoma Haider) and M.H. (Manzoor Hussain) carried out validation and supervision. G.P.I. was responsible for investigation, resources, and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the first author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Rajdhan, A.; Agarwal, A.; Sai, M.; Ravi, D.; Ghuli, D.P. Heart disease prediction using machine learning. Int. J. Eng. Res. Technol. 2020, 9, 659–662. [Google Scholar] [CrossRef]
  2. Khateeb, N.; Usman, M. Efficient heart disease prediction system using K-nearest neighbor classification technique. In Proceedings of the international conference on big data and internet of thing, London, UK, 20–22 December 2017; pp. 21–26. [Google Scholar] [CrossRef]
  3. Fatima, M.; Pasha, M. Survey of Machine Learning Algorithms for Disease Diagnostic. J. Intell. Learn. Syst. Appl. 2017, 9, 1–16. [Google Scholar] [CrossRef]
  4. Shen, D.; Wu, G.; Suk, H.-I. Deep Learning in Medical Image Analysis. Annu. Rev. Biomed. Eng. 2017, 19, 221–248. [Google Scholar] [CrossRef] [PubMed]
  5. Mondal, A.; Mondal, B.; Chakraborty, A.; Kar, A.; Biswas, A.; Majumder, A.B. Heart disease prediction using support vector machine and artificial neural network. Artif. Intell. Appl. 2023, 2, 45–51. [Google Scholar] [CrossRef]
  6. Rajkomar, A.; Oren, E.; Chen, K.; Dai, A.M.; Hajaj, N.; Hardt, M.; Liu, P.J.; Liu, X.; Marcus, J.; Sun, M.; et al. Scalable and Accurate Deep Learning with Electronic Health Records. npj Digit. Med. 2018, 1, 18. [Google Scholar] [CrossRef] [PubMed]
  7. Diwaker, C.; Tomar, P.; Solanki, A.; Nayyar, A.; Jhanjhi, N.Z.; Abdullah, A.; Supramaniam, M. A New Model for Predicting Component-Based Software Reliability Using Soft Computing. IEEE Access 2019, 7, 147191–147203. [Google Scholar] [CrossRef]
  8. Kok, S.H.; Abdullah, A.; Jhanjhi, N.Z.; Supramaniam, M. A Review of Intrusion Detection System Using Machine Learning Approach. Int. J. Eng. Res. Technol. 2019, 12, 8–15. [Google Scholar]
  9. Airehrour, D.; Gutierrez, J.; Kumar Ray, S. GradeTrust: A Secure Trust Based Routing Protocol for MANETs. In Proceedings of the 25th International Telecommunication Networks and Applications Conference (ITNAC), Sydney, Australia, 18–20 November 2015; pp. 65–70. [Google Scholar] [CrossRef]
Figure 1. Framework.
Figure 1. Framework.
Engproc 107 00121 g001
Figure 2. Methodology.
Figure 2. Methodology.
Engproc 107 00121 g002
Table 1. Comparison of Heart Disease Prediction Models from Literature.
Table 1. Comparison of Heart Disease Prediction Models from Literature.
AuthorsModelsAccuracy
Kumar [1]Logistic Regression, Decision Tree, Random Forest75% (Decision Tree), 82% (Random Forest)
Gupta [2]K-Nearest Neighbors (KNN)72%
Fatima [3]Naive Bayes, K-Nearest Neighbors70% (Naive Bayes), 72% (KNN)
Shen [4]Convolutional Neural Networks (CNN)85%
Rajkomar [6]Recurrent Neural Networks (RNN), LSTM88% (LSTM)
Table 2. Accuracy of Various Machine Learning Models.
Table 2. Accuracy of Various Machine Learning Models.
ModelAccuracy
k-NN53.00%
Random Forest65.08%
Decision Tree65.03%
Naïve Bayes53.40%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Haider, M.; Hussain, M.; Insany, G.P. Heart Attack Prediction Using Machine Learning Models: A Comparative Study of Naive Bayes, Decision Tree, Random Forest, and K-Nearest Neighbors. Eng. Proc. 2025, 107, 121. https://doi.org/10.3390/engproc2025107121

AMA Style

Haider M, Hussain M, Insany GP. Heart Attack Prediction Using Machine Learning Models: A Comparative Study of Naive Bayes, Decision Tree, Random Forest, and K-Nearest Neighbors. Engineering Proceedings. 2025; 107(1):121. https://doi.org/10.3390/engproc2025107121

Chicago/Turabian Style

Haider, Makhdoma, Manzoor Hussain, and Gina Purnama Insany. 2025. "Heart Attack Prediction Using Machine Learning Models: A Comparative Study of Naive Bayes, Decision Tree, Random Forest, and K-Nearest Neighbors" Engineering Proceedings 107, no. 1: 121. https://doi.org/10.3390/engproc2025107121

APA Style

Haider, M., Hussain, M., & Insany, G. P. (2025). Heart Attack Prediction Using Machine Learning Models: A Comparative Study of Naive Bayes, Decision Tree, Random Forest, and K-Nearest Neighbors. Engineering Proceedings, 107(1), 121. https://doi.org/10.3390/engproc2025107121

Article Metrics

Back to TopTop