#
Enhanced Heart Disease Prediction Based on Machine Learning and χ^{2} Statistical Optimal Feature Selection Model

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

^{2}statistical optimum feature selection technique was used. The suggested model’s performance was then validated by comparing it to traditional models using several performance measures. The proposed model increased accuracy from 85.29% to 89.7%. Additionally, the componential load was reduced by half. This result indicates that our system outperformed other state-of-the-art methods in predicting heart disease.

## 1. Introduction

- To develop a new heart disease (HD) classification model based on the ML (SVM) algorithm to improve the detection of heart disease;
- To implement an optimal feature selection model using the χ
^{2}statistical method for the extraction of the most informative attributes to improve prediction accuracy; - To validate the proposed heart disease diagnosis model’s accuracy by comparing it with traditional models through the analysis of several performance metrics.

## 2. Methodology

#### 2.1. Support Vector Machine (SVM) Model

^{2}statistical model selected the same feature set for training and testing data. Next, training data with fewer features was fed into the SVM model for training purposes. Finally, using testing data, the trained SVM model was evaluated. The proposed model utilized 14 features from the University of California Irvine (UCI) Heart Disease Repository’s Statlog and Cleveland datasets [17,18,19,20]. These features were examined using approaches that successfully predict heart disease. It was possible to develop and evaluate the system for heart disease prediction by using Python and the sci-kit learn library [21].

#### 2.1.1. Heart Disease Dataset Features

#### 2.1.2. Data Pre-Processing

^{2}statistical model to overcome this issue.

#### 2.2. Enhanced SVM Model with Feature Selection

^{2}) statistical model [30,31] to select the essential features before applying the SVM model.

^{2}) test is a correlation-based feature selection method that determines the correlation between the features and the predicted class. Each non-negative feature (X

_{i}) computes chi-square statistics to determine which features depend on the predicted attribute. The higher the chi-square score, the more dependent the feature is on the predicted class [24]. First, the commonly used 13 features are ranked according to their χ

^{2}test score. The χ

^{2}test rank features for a binary classification problem are as follows: Let us assume there are (t) instances and two classes, positive and negative. To determine the χ

^{2}test score, we construct Table 2.

_{i}), (t − m) represents the sum of instances that do not include the feature (X

_{i}), (p) represents the sum of positive instances, and (t − p) represents the sum of all instances that are not positive.

^{2}test examines the difference between the expected count (E), and the observed count (O). The observed count (O) is the observed data (α, b, λ, and y), and the expected count (E) is calculated from the row total, column total, and overall total. If two features are independent, the observed count and the expected count are close. The α, b, λ, and y represent the observed values, and E

_{α}, E

_{b}, E

_{λ}, and E

_{y}represent the expected values. Then, assuming that the two occurrences are unrelated, the expected value (E

_{α}) is calculated using Equation (3). Similarly, E

_{b}, E

_{λ}, and E

_{y}are calculated. Finally, based on the general χ

^{2}test form shown in Equation (4), we calculate χ

^{2}score as shown in Equation (5) [27].

^{2}score. In the beginning, we used a subset of (n = 1), i.e., the feature with the highest χ

^{2}score. This subset was then applied to the SVM model, and the performance results were recorded as we experimented with various hyperparameters. We selected a subset of the two most highly scored attributes (n = 2) as a second approach. Then, this selection was applied to the SVM model, and the results were saved. We iterated this process until we obtained the optimum subset of ranked features (n = 6) that gave the best performance.

^{2}statistical method, recognized six notable features that can be selected for model training. As shown in Figure 5, regarding the Cleveland dataset, the algorithm selected the following features: thalach, oldpeak, ca, cp, exang, and chol. While in Figure 6, for the Statlog dataset, the algorithm selected the following features: maximum heart rate, number of major vessels, thallium stress result, exercise-induced ST depression, cholesterol, and exercise-induced angina.

## 3. Results and Discussion

^{2}statistical feature selection method. The feature selection method was used to select the six most important features for the prediction of heart disease. The χ

^{2}-based SVM heart disease prediction model was developed and evaluated using Python and sci-kit learn library [21]. The two collected datasets (i.e., the Cleveland and Statlog (Heart) datasets) were partitioned into train and test sets. Training data was used to train the model, whereas testing data was used to evaluate the performance of the model [32]. To train and evaluate our proposed model, both datasets were split into a train and test set using a 75:25 split ratio. The following four primary parameters were assessed: true negative (T

_{N}), which means that the algorithm prediction output for persons with no heart disease is correct; true positive (T

_{P}), which means that the algorithm prediction output for heart disease patients is correct; false positive (F

_{P}), which means that the patients who have no heart disease are incorrectly classified as having heart disease; and false negative (F

_{N}), which refers to patients who are actually suffering from a cardiac disease but are incorrectly categorized as healthy [24]. The proposed χ

^{2}-based SVM model was evaluated based on the following metrics:

**Accuracy (Acc):**defined as the proportion of total positive instances of the model to the total number of instances, as shown in Equation (6).

**Specificity (Spe):**the percentage of true negatives out of all healthy individuals, calculated by Equation (7). It was used to determine the degree of the attribute to appropriately classify the individuals without diseases.

**Sensitivity (Sen):**used to determine the degree of the attribute in order to appropriately classify the individuals who have diseases, as illustrated in Equation (8);

**F1-score:**defined as the harmonic mean of the specificity and sensitivity. It can be computed as shown in Equation (9);

^{2}-based feature selection method was applied to the normalized dataset to choose the six features that are the most significant for heart disease detection before applying the SVM classifier.

^{2}feature selection method.

^{2}feature selection method played a critical role in enhancing the accuracy of the SVM model, while also improving the results of sensitivity and specificity, which shows the model’s ability to correctly identify people with and without the heart disease. The proposed χ

^{2}-based SVM model improves classification accuracy by 6.25% for the Cleveland dataset and 5.17% for the Statlog dataset, which is important for providing a correct diagnosis and decreasing the rate of false predictions.

^{2}-based SVM diagnostic model and its ability to identify heart disease occurrence. The ROC and AUC chart is a 2D graph, between the sensitivity and specificity, which evaluates the validity of a diagnostic model. The true positive rate (Y axis) and the true negative rate (X axis) are plotted in the ROC chart. It indicates that the optimal ROC curve is in the plot’s upper left corner. An ROC chart with a bigger AUC is better, which indicates that a diagnostic model can correctly identify people with heart issues [27].

^{2}method, but it was 0.91 when using the full set of features. Figure 8b depicts the same thing, but with the Statlog dataset. Before using the χ

^{2}feature selection method, the SVM model’s AUC was 0.94; after using it, the AUC dropped to 0.91. This suggests that the influence of χ

^{2}feature selection approach was minimal in terms of AUC.

^{2}-based SVM model in detecting heart disease, another metric that goes by the term “confusion matrix” was utilized. In a confusion matrix, the values of the true positive (T

_{P}) and false negative (F

_{N}) parameters are laid out in a format like that of a table. The confusion matrix summarizes the number of correct and incorrect predictions. Figure 9b illustrates the confusion matrix that was produced as a result of using the proposed χ

^{2}-based SVM model on the Cleveland dataset. It shows that the proposed model can correctly detect 37 (predicted 1 and actual 1) heart diseased persons and identify 31 healthy subjects out of 35 (predicted 0 and actual 0). In Figure 10b, the resulting confusion matrix of the Statlog dataset is presented. This demonstrates that the methodology that was proposed can accurately identify 31 heart disease patients and identify 30 out of 33 healthy subjects. Both Figure 9 and Figure 10 show that the number of incorrect predictions (actual 1 but predicted 0, and actual 0 but predicted 1) made by the SVM model before and after applying the χ

^{2}feature selection method was reduced from 12 to 8 for the Cleveland dataset, and from 10 to 7 for the Statlog dataset.

## 4. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Das, S.; Sharma, R.; Gourisaria, M.; Rautaray, S.; Pandey, M. Heart disease detection using core machine learning and deep learning techniques: A comparative study. Int. J. Emerg. Technol.
**2020**, 11, 531–538. [Google Scholar] - Hasan, T.T.; Jasim, M.H.; Hashim, I.A. FPGA Design and Hardware Implementation of Heart Disease Diagnosis System Based on NVG-RAM Classifier. In Proceedings of the 2018 Third Scientific Conference of Electrical Engineering (SCEE), Baghdad, Iraq, 19–20 December 2018; pp. 33–38. [Google Scholar] [CrossRef]
- Rahman, A.U.; Saeed, M.; Mohammed, M.A.; Jaber, M.M.; Garcia-Zapirain, B. A novel fuzzy parameterized fuzzy hypersoft set and riesz summability approach based decision support system for diagnosis of heart diseases. Diagnostics
**2022**, 12, 1546. [Google Scholar] [CrossRef] [PubMed] - Javid, I.; Khalaf, A.; Ghazali, R. Enhanced accuracy of heart disease prediction using machine learning and recurrent neural networks ensemble majority voting method. Int. J. Adv. Comput. Sci. Appl.
**2020**, 11, 540–551. [Google Scholar] [CrossRef] - Muhsen, D.K.; Khairi, T.W.A.; Alhamza, N.I.A. Machine Learning System Using Modified Random Forest Algorithm. In Intelligent Systems and Networks, Singapore; Tran, D.-T., Jeon, G., Nguyen, T.D.L., Lu, J., Xuan, T.-D., Eds.; Springer: Singapore, 2021; pp. 508–515. [Google Scholar]
- Wah, T.Y.; Mohammed, M.A.; Iqbal, U.; Kadry, S.; Majumdar, A.; Thinnukool, O. Novel DERMA fusion technique for ECG heartbeat classification. Life
**2022**, 12, 842. [Google Scholar] - Mohammed, M.A.; Abdulkareem, K.H.; Al-Waisy, A.S.; Mostafa, S.A.; Al-Fahdawi, S.; Dinar, A.M.; Alhakami, W.; Abdullah, B.A.Z.; Al-Mhiqani, M.N.; Alhakami, H.; et al. Benchmarking methodology for selection of optimal COVID-19 diagnostic model based on entropy and TOPSIS methods. IEEE Access
**2020**, 8, 99115–99131. [Google Scholar] [CrossRef] - Dinar, A.M.; Zain, A.M.; Salehuddin, F. Utilizing of CMOS ISFET sensors in DNA applications detection: A systematic review. J. Adv. Res. Dyn. Control Syst.
**2018**, 10, 569–583. [Google Scholar] - Soni, M.; Gomathi, S.; Kumar, P.; Churi, P.P.; Mohammed, M.A.; Salman, A.O. Hybridizing Convolutional Neural Network for Classification of Lung Diseases. Int. J. Swarm Intell. Res. (IJSIR)
**2022**, 13, 1–15. [Google Scholar] [CrossRef] - Nasser, A.R.; Hasan, A.M.; Humaidi, A.J.; Alkhayyat, A.; Alzubaidi, L.; Fadhel, M.A.; Santamaría, J.; Duan, Y. IoT and Cloud Computing in Health-Care: A New Wearable Device and Cloud-Based Deep Learning Algorithm for Monitoring of Diabetes. Electronics
**2021**, 10, 2719. Available online: https://www.mdpi.com/2079-9292/10/21/2719 (accessed on 1 May 2022). - Diwakar, M.; Tripathi, A.; Joshi, K.; Memoria, M.; Singh, P. Latest trends on heart disease prediction using machine learning and image fusion. Mater. Today Proc.
**2021**, 37, 3213–3218. [Google Scholar] [CrossRef] - Rahman, A.U.; Saeed, M.; Mohammed, M.A.; Krishnamoorthy, S.; Kadry, S.; Eid, F. An Integrated Algorithmic MADM Approach for Heart Diseases’ Diagnosis Based on Neutrosophic Hypersoft Set with Possibility Degree-Based Setting. Life
**2022**, 12, 729. [Google Scholar] [CrossRef] [PubMed] - Hu, G.; Root, M.M. Building prediction models for coronary heart disease by synthesizing multiple longitudinal research findings. Eur. J. Prev. Cardiol.
**2005**, 12, 459–464. [Google Scholar] [CrossRef] [PubMed] - Deo, R.C. Machine learning in medicine. Circulation
**2015**, 132, 1920–1930. [Google Scholar] [CrossRef] [PubMed] - Mythili, T.; Mukherji, D.; Padalia, N.; Naidu, A. A heart disease prediction model using SVM-decision trees-logistic regression (SDL). Int. J. Comput. Appl.
**2013**, 68, 0975–8887. [Google Scholar] - Elhoseny, M.; Mohammed, M.A.; Mostafa, S.A.; Abdulkareem, K.H.; Maashi, M.S.; Garcia-Zapirain, B.; Mutlag, A.A.; Maashi, M.S. A new multi-agent feature wrapper machine learning approach for heart disease diagnosis. Comput. Mater. Contin
**2021**, 67, 51–71. [Google Scholar] [CrossRef] - Detrano, R.; Janosi, A.; Steinbrunn, W.; Pfisterer, M.; Schmid, J.J.; Sandhu, S.; Guppy, K.H.; Lee, S.; Froelicher, V. International application of a new probability algorithm for the diagnosis of coronary artery disease. Am. J. Cardiol.
**1989**, 64, 304–310. [Google Scholar] [CrossRef] - Gennari, J.H.; Langley, P.; Fisher, D. Models of incremental concept formation. Artif. Intell.
**1989**, 40, 11–61. [Google Scholar] [CrossRef] - Janosi, A.; Steinbrunn, W.; Pfisterer, M.; Detrano, R. UCI Machine Learning Repository: Heart Disease Dataset [Online]. Available online: https://archive-beta.ics.uci.edu/ml/datasets/heart+disease (accessed on 1 March 2022).
- Machine Learning Repository: Statlog (Heart) [Online]. Available online: http://archive.ics.uci.edu/ml/datasets/Statlog+%28Heart%29 (accessed on 1 March 2022).
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Sajja, T.K.; Kalluri, H.K. A Deep Learning Method for Prediction of Cardiovascular Disease Using Convolutional Neural Network. Rev. D’intelligence Artif.
**2020**, 34, 601–606. [Google Scholar] [CrossRef] - Guo, C.; Zhang, J.; Liu, Y.; Xie, Y.; Han, Z.; Yu, J. Recursion enhanced random forest with an improved linear model (RERF-ILM) for heart disease detection on the internet of medical things platform. IEEE Access
**2020**, 8, 59247–59256. [Google Scholar] [CrossRef] - Ali, S.A.; Raza, B.; Malik, A.K.; Shahid, A.R.; Faheem, M.; Alquhayz, H.; Kumar, Y.J. An optimally configured and improved deep belief network (OCI-DBN) approach for heart disease prediction based on Ruzzo–Tompa and stacked genetic algorithm. IEEE Access
**2020**, 8, 65947–65958. [Google Scholar] [CrossRef] - Vijayashree, J.; Parveen Sultana, H. Heart disease classification using hybridized Ruzzo-Tompa memetic based deep trained Neocognitron neural network. Health Technol.
**2020**, 10, 207–216. [Google Scholar] [CrossRef] - Bharti, R.; Khamparia, A.; Shabaz, M.; Dhiman, G.; Pande, S.; Singh, P. Prediction of Heart Disease Using a Combination of Machine Learning and Deep Learning. Comput. Intell. Neurosci.
**2021**, 2021, 8387680. [Google Scholar] [CrossRef] [PubMed] - Ali, L.; Rahman, A.; Khan, A.; Zhou, M.; Javeed, A.; Khan, J.A. An Automated Diagnostic System for Heart Disease Prediction Based on χ2 Statistical Model and Optimally Configured Deep Neural Network. IEEE Access
**2019**, 7, 34938–34945. [Google Scholar] [CrossRef] - Aliyar Vellameeran, F.; Brindha, T. A new variant of deep belief network assisted with optimal feature selection for heart disease diagnosis using IoT wearable medical devices. Comput. Methods Biomech. Biomed. Eng.
**2021**, 25, 387–411. [Google Scholar] [CrossRef] - Yue, W.; Wang, Z.; Chen, H.; Payne, A.; Liu, X. Machine Learning with Applications in Breast Cancer Diagnosis and Prognosis. Designs
**2018**, 2, 13. [Google Scholar] [CrossRef] [Green Version] - Ali, L.; Zhu, C.; Zhou, M.; Liu, Y. Early diagnosis of Parkinson’s disease from multiple voice recordings by simultaneous sample and feature selection. Expert Syst. Appl.
**2019**, 137, 22–28. [Google Scholar] [CrossRef] - Liu, H.; Setiono, R. Chi2: Feature selection and discretization of numeric attributes. In Proceedings of the 7th IEEE International Conference on Tools with Artificial Intelligence, Herndon, VA, USA, 5–8 November 1995; pp. 388–391. [Google Scholar]
- Maldonado, S.; Pérez, J.; Weber, R.; Labbé, M. Feature selection for support vector machines via mixed integer linear programming. Inf. Sci.
**2014**, 279, 163–175. [Google Scholar] [CrossRef] - Shorewala, V. Early detection of coronary heart disease using ensemble techniques. Inform. Med. Unlocked
**2021**, 26, 100655. [Google Scholar] [CrossRef] - Ali, F.; El-Sappagh, S.; Islam, S.R.; Kwak, D.; Ali, A.; Imran, M.; Kwak, K.S. A smart healthcare monitoring system for heart disease prediction based on ensemble deep learning and feature fusion. Inf. Fusion
**2020**, 63, 208–222. [Google Scholar] [CrossRef] - Latha, C.B.C.; Jeeva, S.C. Improving the accuracy of prediction of heart disease risk based on ensemble classification techniques. Inform. Med. Unlocked
**2019**, 16, 100203. [Google Scholar] [CrossRef] - Haq, A.U.; Li, J.P.; Memon, M.H.; Nazir, S.; Sun, R. A Hybrid Intelligent System Framework for the Prediction of Heart Disease Using Machine Learning Algorithms. Mob. Inf. Syst.
**2018**, 2018, 3860146. [Google Scholar] [CrossRef] - Vijayashree, J.; Sultana, H.P. A Machine Learning Framework for Feature Selection in Heart Disease Classification Using Improved Particle Swarm Optimization with Support Vector Machine Classifier. Program. Comput. Softw.
**2019**, 44, 388–397. [Google Scholar] [CrossRef] - Tuli, S.; Basumatary, N.; Gill, S.S.; Kahani, M.; Arya, R.C.; Wander, G.S.; Buyya, R. HealthFog: An ensemble deep learning based Smart Healthcare System for Automatic Diagnosis of Heart Diseases in integrated IoT and fog computing environments. Future Gener. Comput. Syst.
**2020**, 104, 187–200. [Google Scholar] [CrossRef] [Green Version]

**Table 1.**Attribute descriptions of Cleveland datasets [19].

Name | Type | Description |
---|---|---|

Age | Numeric | Age in years |

Sex | Categorial | 0 = Female or 1 = male |

Cp | Categorial | Type of Chest pain (1 = typical angina, 2 = atypical angina, 3 = non anginal pain, 4 = asymptomatic) |

Trestbps | Numeric | Resting blood pressure (mm hg) |

Chol | Numeric | Serum cholesterol (mg/dL) |

Fbs | Categorial | Fasting blood sugar > 120 mg/dL (0 = false, 1 = true) |

Restecg | Categorial | Resting electrocardiography results (0 = normal, 1 = ST-T wave abnormality, 2 = probable or definite left ventricular hypertrophy) |

Thalach | Numeric | Maximum heart rate achieved during thalium stress test |

Exang | Categorial | Exercise-induced angina (1 = yes, 0 = no) |

Oldpeak | Numeric | St depression induced by exercise relative to rest |

Slope | Categorial | Slope of peak exercise ST segment (1 = upsloping, 2 = flat, 3 = downsloping) |

Ca | Categorial | Number of significant vessels colored by fluoroscopy |

Thal | Categorial | Thalium stress test result (3 = normal, 6 = fixed, 7 = reversible defect) |

Num | Categorial | Heart disease status (0 = < 50% diameter narrowing, 1 = > 50% diameter narrowing) |

Positive Class | Negative Class | Total | |
---|---|---|---|

Feature X_{i} occurs | α | b | α + b = m |

Feature X_{i} does not occur | λ | y | λ + y = t − m |

Total | α + λ = p | b + y = t − p | t |

Dataset | Total Records | Total Features | Acc (%) | Spe (%) | Sen (%) | F1 (%) |
---|---|---|---|---|---|---|

Cleveland | 303 | 14 | 84.21 | 84.13 | 67.45 | 84.16 |

Statlog | 270 | 14 | 85.29 | 85.36 | 68.29 | 85.29 |

Dataset | Total Records | Total Features | Acc (%) | Spe (%) | Sen (%) | F1 (%) |
---|---|---|---|---|---|---|

Cleveland | 303 | 6 | 89.47 | 89.40 | 89.40 | 89.40 |

Statlog | 270 | 6 | 89.70 | 89.70 | 89.74 | 89.70 |

Study | Feature Selection Method | Dataset | Classifier | Total Features | Acc (%) | Selected Features | Acc (%) |
---|---|---|---|---|---|---|---|

[33] | Lasso ^{1} | Cleveland | Ensemble | 13 | - | 8 | 75.1 |

[34] | Information Gain | Cleveland | Ensemble | 27 | 72.2 | 16 | 83.5 |

[26] | Lasso | Cleveland | SVM | 13 | 84.09 | - | 84.26 |

[35] | Randomly Generated Feature Set | Cleveland | Ensemble | 13 | - | 9 | 85.48 |

[24] | Ruzzo-Tompa | Cleveland | ANN ^{2} | 13 | - | 7 | 86.20 |

[36] | Lasso | Cleveland | SVM (RBF) | 13 | 86 | 6 | 88 |

[37] | PSO-SVM ^{3} | Cleveland | SVM | 13 | 79.35 | 6 | 88.22 |

[28] | PS-GWO | Statlog and Cleveland | DBN ^{4} | 13 | - | - | 88.8 |

[38] | PCA ^{5} | Cleveland | DL ^{6} | 13 | - | - | 89 |

Proposed | Chi-squared | SVM | 13 | 85.29 | 6 | 89.47 |

^{1}Lasso: least absolute shrinkage and selection operator;

^{2}ANN: artificial neural network;

^{3}PSO: swarm optimizations;

^{4}DBN: deep belief network;

^{5}PCA: principal component analysis;

^{6}DL: deep learning.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Sarra, R.R.; Dinar, A.M.; Mohammed, M.A.; Abdulkareem, K.H.
Enhanced Heart Disease Prediction Based on Machine Learning and χ^{2} Statistical Optimal Feature Selection Model. *Designs* **2022**, *6*, 87.
https://doi.org/10.3390/designs6050087

**AMA Style**

Sarra RR, Dinar AM, Mohammed MA, Abdulkareem KH.
Enhanced Heart Disease Prediction Based on Machine Learning and χ^{2} Statistical Optimal Feature Selection Model. *Designs*. 2022; 6(5):87.
https://doi.org/10.3390/designs6050087

**Chicago/Turabian Style**

Sarra, Raniya R., Ahmed M. Dinar, Mazin Abed Mohammed, and Karrar Hameed Abdulkareem.
2022. "Enhanced Heart Disease Prediction Based on Machine Learning and χ^{2} Statistical Optimal Feature Selection Model" *Designs* 6, no. 5: 87.
https://doi.org/10.3390/designs6050087