A Heart Disease Prediction Model Based on Feature Optimization and Smote-Xgboost Algorithm

: In today’s world, heart disease is the leading cause of death globally. Researchers have proposed various methods aimed at improving the accuracy and efﬁciency of the clinical diagnosis of heart disease. Auxiliary diagnostic systems based on machine learning are designed to learn and predict the disease status of patients from a large amount of pathological data. Practice has proved that such a system has the potential to save more lives. Therefore, this paper proposes a new framework for predicting heart disease using the smote-xgboost algorithm. First, we propose a feature selection method based on information gain, which aims to extract key features from the dataset and prevent model overﬁtting. Second, we use the Smote-Enn algorithm to process unbalanced data, and obtain sample data with roughly the same positive and negative categories. Finally, we test the prediction effect of Xgboost algorithm and ﬁve other baseline algorithms on sample data. The results show that our proposed method achieves the best performance in the ﬁve indicators of accuracy, precision, recall, F1-score and AUC, and the framework proposed in this paper has signiﬁcant advantages in heart disease prediction


Introduction
Cardiovascular disease, which is mostly caused by heart disease, causes over one-third of all annual deaths globally [1].Many academics have developed heart disease diagnostic systems by utilizing machine learning to obtain useful information from existing medical databases in an effort to change the current situation.Using these diagnostic tools can assist clinical decisions on a medical diagnosis, speed up the diagnosis process, and uncover disease-related knowledge that will save the lives of more people.
When making a disease diagnosis, there is a great deal of information on the patient's pathology, which is expressed in the dataset as a good amount of features [2].Each feature has a unique effect on the disease diagnosis results.Often, a few major features contribute to the diagnosis of the presence or absence of disease.Applying some feature selection methods before training the model can help with selection of some key features, thus leading to good prediction results in a shorter period of time.Most disease datasets have an imbalanced distribution, with more samples falling into the negative category and fewer samples falling into the positive category.The distribution of the dataset can be modified using specific data processing techniques, which will rebalance it and boost the validity of the model.
Machine learning algorithms had a significant advantage in dealing with problems with complex and nonlinear features.A number of diseases classification and prediction challenges, including early warning for Electrocardiogram (ECG) detection [3] and prediction related to congenital heart disease [4], have been successfully addressed by the use of several algorithms, including LR, SVM, and KNN, etc.
Ensemble learning [5] is the basis for the construction of many machine learning algorithms.The fundamental idea is to combine the advantages of several poor classifiers to produce a model with superior overall performance.Bagging and boosting are the two primary model combination techniques; bagging integrates multiple underfitting weak classifiers, and boosting integrates multiple overfitting weak classifiers.Xgboost is an efficient implementation of ensemble learning, whose main idea is boosting and introducing regular terms in the objective function to prevent overfitting.
We propose a heart disease prediction model by employing the smote-xgboost algorithm.The model was trained using real pathological data from cardiac patients.Among them, Major Adverse Cardiovascular Events (MACCE) are the prediction target, and the occurrence of MACCE is a key indicator to evaluate the success of coronary heart disease surgery.In summary, we make important contributions as follows.

•
To remove the crucial features from the dataset, an information gain-based feature selection method is used.• Use a technique that combines undersampling and oversampling to handle uneven data on the selected dataset.

•
Using the preprocessed dataset, validate efficacy of xgboost.Additionally, assess the ability of the xgboost algorithm with five baseline methods using a confusion matrix.
The remaining portions of the essay are structured as follows.Section 2 summarizes the most recent research on heart disease prediction.Section 3 is a brief description of the dataset and an introduction to the algorithms applied in the framework.Section 4 is a statistical description of the dataset and a comparison and evaluation of the experimental models, and Section 5 provides a conclusion and outlook.

Related Work
One of the major applications of machine learning in recent years has been the prediction of heart disease, which has had some success.Some scholars have concentrated on the innovation of data processing techniques such as feature selection, and some scholars have focused on innovation from the perspective of prediction algorithms.
Modepalli et al. [6] utilized a new model (DT + RF) to predict the occurrence (or non-occurrence) of heart disease.They chose the UCI dataset to validate the reliability of the hybrid model, comparing the prediction outcomes of the hybrid model and any single algorithm in the hybrid model, respectively.It is found that the hybrid model has a significant advantage over the single algorithm in terms of performance in the evaluation metric of accuracy, with a 7% to 9% improvement.
Joo et al. [7] used a dataset of cardiovascular disease with the same features but different years of return visits to train the model.The authors selected 25 features from the dataset by combining health examination results and questionnaire responses, and used four machine learning models to predict the 2-year and 10-year cardiovascular disease risk, respectively.In particular, they found that the accuracy of each model improved somewhat if physician medication information was taken into account when performing feature selection, and that medication information had a strong effect on the prediction of short-term data in this study.
Li et al. [8] put out a feature selection approach fast conditional mutual information (FCMIM) based on conditional mutual information.They employed four common feature selection algorithms and FCMIM on the Cleveland dataset and used six machine learning algorithms to train the model.The results suggested the use of this novelty feature selection method, with the highest accuracy of 92.37% for the combination of FCMIM and SVM.
Ali et al. [9] used a feature fusion technique to process low-dimensional data extracted from medical records and sensor data.Then, they employed a feature selection strategy relating to information gain and feature ranking to obtain the dataset.They achieved prediction accuracy of 98.5% by applying an ensemble deep learning algorithm.
Rahim et al. [10] applied an oversampling technique to balance the data, and also used the mean value method to fill in the missing values and feature importance method for feature selection.They selected three datasets (including the Framingham dataset and the Cleveland dataset).After data preprocessing on each of the three datasets, the predictive effectiveness of the new ensemble model (KNN and LR) with and without feature selection was compared.The results fully validated the advantages of the new ensemble model, in which the accuracy of the new model with feature selection was as high as 99.1%.
Ishaq et al. [11] used the feature importance of random forest to rank the features and select the features with higher scores, and also employed the SMOTE technique to balance the data.They compared the prediction performance of nine commonly used algorithms on data treated with SMOTE and on unbalanced data without treatment, where it was found that the prediction accuracy of each model was significantly improved on balanced data.
Khurana et al. [12] found that SVM outperformed all other machine learning algorithms when testing their results on the Cleveland dataset by applying five feature selection techniques.The prediction accuracy of each machine learning algorithm improved to a different extent after applying the feature selection methods, where the feature selection methods with Chi-Square and information gain were applied.The accuracy of the combination of Chi-Square and information gain and SVM both reached 83.41%.
Ashri et al. [13] applied a genetic-algorithm-based feature selection Simple Genetic Algorithm (SGA) and trained model by using UCI dataset.Two algorithms with the highest accuracy were selected to propose a hybrid ensemble learning model based on decision trees and random forests, and found that the accuracy of the ensemble learning model reached 98.18%.
Bashir et al. [14] proposed a new ensemble learning combinatorial voting approach, in which four datasets were selected from the UCI database to validate six machine learning algorithms and five ensemble models with a combination of these six algorithms.They found that the accuracy of the ensemble models was generally greater compared to the individual algorithms, in which the average accuracy of the five ensemble models reached 83%.The proposed combination can be extended to bagging and boosting to further improve the accuracy.
In conclusion, data preprocessing, such as data standardization and feature selection, can effectively raise the value of the dataset and greatly enhance the accuracy of a model.Additionally, ensemble learning models perform well when dealing with heart disease.The main point of this study is to employ the ensemble learning algorithm Xgboost on a heart disease dataset after performing feature selection and imbalance processing.Finally, by contrasting xgboost with other standard algorithms, the effectiveness and accuracy of the suggested framework in predicting heart disease are confirmed.

Method
Figure 1 shows the heart disease prediction framework proposed in this paper.

Dataset
This paper uses the return visit data of real patients in a hospital as the research sample.We named this the Heart Disease Dataset (HDD).The dataset has a total of 4232 samples and 37 features, including numeric and categories.The predictive target is major adverse cardiovascular and cerebrovascular events (MACCE), where zero indicates no occurrence and one indicates occurrence.

Data Preprocessing
Data processing is a vital stage before training, since the quality of the data will directly affect the predictions made by the model.We use the following approach to handle missing values.For class variables, we create a new class to represent the null values; for numeric variables, we eliminate the feature columns with missing values rates greater than 70%, citing them as invalid, and replace the remaining feature columns with missing values with the mean values.We also normalize the data using the maximum-minimum norm method to enhance the data's relevance.The formula is as follows.
where H refers to the normalized value, H o refers to the original value, H min refers to the minimum, H max refers to the maximum, NH max and NH min refers to the range of values taken by the transformed dataset, usually NH max =

Feature Selection Based on Information Gain
An information-gain-based feature selection method [15] is used on HDD to remove redundant and useless features in terms of choosing features that have a significant effect on the result.In Algorithm 1, the IGFS pseudocode is included.Output : Feature set FS To decrease feature dimensionality and enhance model prediction, the information in the selected feature set is used as the input features for prediction.It should be emphasized that the feature selection must retain the key task-related traits, and an information-gainbased feature selection method is employed in our study.The prime objective of the method is to quantify a feature's importance in terms of information gain; the more information a feature has that contributes to classification, the higher its information gain.The formula below is used to determine the gain.
where Equation ( 2) denotes the information entropy of feature X, Equation (3) denotes the information entropy of prediction column Y when feature X is known, and Equation ( 4) denotes the information gain, and the information gain of feature X is the difference between the information entropy of prediction column Y and the conditional entropy of both.Different information gain values are taken for various features in the dataset, and these values are sorted.The features with gains larger than the threshold are regarded as essential features that should be selected.The following is its pseudo code.
After the above preprocessing and feature selection, we get a total of 3527 sample data, as well as 15 features and 1 predicted label.The following Table 1 provides a description of the preprocessed HDD.

Imbalance Data Processing Based on Smote-Enn
Due to the low prevalence of most diseases, the distribution of medical datasets is typically imbalanced, exhibiting significant differences in the number of samples from various categories in the dataset.When the model is trained with imbalanced datasets, the performance and dependability of the model are decreased.Table 2 shows the category distribution of the target MACCE of the unbalanced HDD employed in this paper, where the ratio between 0 (not occurring) and 1 (occurring) reached 9:1.To obtain balanced data, there are three basic strategies: (1) expanding the sample size from the minority class (oversampling); (2) decreasing the number of samples from the majority class (undersampling); and (3) combining undersampling and oversampling.The undersampling method removes samples from the majority class at random, which may lead to a loss of crucial information that has a considerable impact on the learning task.The oversampling method directly resamples samples from the minority class, which may result in overfitting of the model.Furthermore, several researchers have shown that mixed methods are superior to single methods when processing datasets [16,17].
In this research, a hybrid technique called SMOTE-ENN [18] is utilized to handle imbalanced data.SMOTE is an oversampling algorithm that employs a method of interpolating samples from the minority class.By removing samples that do not fall into the categories that account for the majority of the k-nearest neighbor samples, the ENN algorithm, which is an undersampling algorithm, decreases the amount of samples from the majority class.In this paper, the SMOTE algorithm is used to undersample the category of MACCE of one until the balance between the samples in the majority and minority groups is reached.Then, the ENN algorithm is applied to remove the overlapping samples in each of the two categories until the dataset is rebalanced.Using this hybrid technique, the minority class of HDD has a proportion of 61.67 percent, increasing from 9.16 percent.In Algorithm 2, the SMOTE-ENN pseudocode is included.Output : Balanced dataset HDD

XGBoost
Xgboost is an implementation of the ensemble learning algorithm boosting [19].The fundamental principle of the Xgboost is to train the model using residuals.The outcome of the most recent tree training is utilized as the input for the subsequent iteration, and the error is progressively decreased over numerous serial iterations.Finally, all weak learners are linearly weighted to produce the ensemble learner.
Additionally, when training the Xgboost tree, the effective splitting point is chosen using an information-gain-based greedy algorithm.To better optimize the objective function, Xgboost uses a second-order Taylor expansion to approximate the objective function, and the optimal solution is the quadratic optimal solution.Furthermore, a regular term is added to regulate the spanning tree's complexity, lowering the possibility of overfitting the model.The loss function is as follows ) W j stands for the leaf node weights, T stands for the total number of nodes, and λ and γ are hyperparameters that control the node complexity.
The Xgboost technique utilizes the shrinkage strategy [20] to ensemble weak learners and decrease the likelihood of overfitting the model.This ensemble takes the form shown below.
where f m (X) denotes the mth iteration to generate the weak learner and F m (X) denotes the mth iteration to generate the integrated learner.Since the parameter η has a strong negative correlation with the number of iterations, the model often has better generalization properties when η has a smaller value.
Moreover, Xgboost adopts the Parzen estimation tree strategy to automatically optimize the hyperparameters in the model for optimal prediction, as well as the block technique to enhance the capability of the model to handle large amounts of data and improve its training efficiency.

Baseline Alogorithms
Five machine learning methods are chosen as the baseline algorithms in this paper.This research compares the prediction performance of the baseline algorithm with Xgboost to illustrate the utility of Xgboost in predicting heart disease.The following is an overview of the baseline algorithms.
3.6.1.Random Forest RF [21] is an ordinary bagging algorithm.Unlike conventional decision trees, RF trains each classifier using a randomly chosen subset of the dataset and a randomly chosen subset of the features.Each trained classifier produces different prediction results for the same input.Voting for the ouput of each trained classifier, typically using the plurality or the mean, leads to the final prediction result.As the features of the algorithm are randomly divided, it will increase the diversity of its classifiers and thus enhance the model's capacity for generalization.

K-Nearest Neighbor
KNN [22] is a form of lazy learning in which KNN learns after receiving the test samples, and the time overhead of the algorithm training samples is zero.The algorithm in the test sample will utilize the distance as the metric to find the k sample points that are closest to each test sample point, and it will use the category information of the k sample points as the judgment basis.The category with the greatest percentage of the k sample points is typically utilized as the test sample in the binary classification issue.

Logistic Regression
LR [23] is a variant of the linear regression algorithm.For the binary classification issue, Logistic Regression applies a logistic function to convert values predicted by a linear regression technique into discrete values (i.e., zero and one) if the predicted value is larger than zero, then one otherwise.Below is a diagram of the logistic function.
3.6.4.Decision Tree DT [24] is a widely used classification algorithm, which can be categorized into three types according to the varied methods of generating trees.These categories include the decision tree based on information gain, which represents the ID3 tree, the decision tree algorithm based on gain rate, which represents the C4.5 tree, and the decision tree based on the Gini index, which represents the CART tree.The decision tree algorithm will also employ prepruning and postpruning procedures to prevent overfitting and enhance the system's capacity for generalization.

Naïve Bayes
NB [25] is a classification algorithm based on event probability and misclassification loss.The main advantage of NB is that it adopts the attribute conditional independence assumption strategy to avoid the combinatorial explosion problem that occurs when computing posterior probabilities.According to the attribute conditional independence assumption, the class conditional probabilities are recast as follows.
The test results are then categorized in accordance with the corresponding probability.

Performance Evaluation 4.1. Result of Exploratory Data Analysis
Exploratory data analytics were carried out on this dataset to better understand its characteristics.The following subsection provides a description of the analyses' observations.
The frequency distribution histogram provides a rapid overview of the data's dispersion and central tendency.The distribution of various features is visually represented by the height of each rectangle in Figure 2, which shows the frequency of occurrence of the values.Additionally, the ability of the model to predict outcomes is impacted by the degree of feature correlation.
In this paper, we utilize Pearson correlation coefficients to calculate the correlation coefficients between features and a heat map to show the level of correlation between features.Each row and each column in Figure 3 represent the correlation coefficient between the corresponding features.It can be inferred that the chosen features have independent effects on the prediction column MACCE, since each feature's correlation coefficient in the figure is less than 0.5.

Cross Validation
In this paper, the training and test sets are produced using the five-fold cross-validation approach.The dataset is initially sampled in layers to produce 5 subsets (D1-D5) that are mutually exclusive, equal in size, and have a dependable distribution.We use one subset as the test set and the remaining subsets as the training set in each round.The average of the five sets of results are then used to get the final result.Figure 4 shows a schematic of the five-fold cross-validation approach.

Performance Measure
The prediction performance of the algorithm is evaluated in this research using five performance measures based on the confusion matrix.Figure 5 illustrates the binary classification problem's confusion matrix structure.Distinct predicted and true values can be merged into four cases: TP, TN, FP, and FN.Using the data from the confusion matrix, the four evaluation indicators can be calculated using the formula below.
Recall = TP TP + FN (13) The ROC curve is a tool for examining the capacity of an algorithm for generalization.The False Positive Rate (FPR) is its horizontal axis and the True Positive Rate (TPR) is its vertical axis, both of which are calculated as follows.
Moreover, the Area Under ROC Cure (AUC) reflects how well a model predicts heart disease.

The Performance of Algorithms
In this section, the six algorithms are trained using a five-fold cross-validation method, and the proposed framework is validated using preprocessed HDD.The confusion matrix for the six algorithms is displayed in Table 3, which shows in detail the percentage of the number of the four cases TP, FP, TN, FN in the prediction results of the six algorithms.Table 4 depicts the average performance of the six different algorithms on five metrics: Accuracy, Precision, Recall, F1-score, and AUC.The most crucial metric for evaluating how well a model predicts is accuracy, of which Xgboost achieves 93.44%.Random Forest and K-Nearest Neighbor achieve 91.15% and 91.77%, respectively.Decision Tree comes in at 83.35%, while Nave Bayes and Logistic Regression fall short at 75.5%.With performance rates of 75.85% and 74.81%, respectively, both Naïve Bayes and Logistic Regression underperformed.
The two measures, Precision and Recall, have a tendency to be inversely correlated; that is, when the precision is high, the recall is typically lower, and vice versa when the recall is high.The most confident samples should be chosen in order to raise the number of correct predictions in MACCE, which will reduce the number of FN and lower the recall.The number of FP will rise when the sample size is increased, resulting in low precision and decreasing the ability to predict the occurrence of MACCE to the greatest extent possible.Owing to the dataset's uniform distribution of positive and negative samples and the model preference that the occurrence of MACCE can be predicted, the Precision of the general model in this experiment is lower than its Recall.The weakest performers were Naïve Bayes and logistic regression, earning 71.05% and 70.59%, respectively, while Xgboost had the best F1-Score at 94.86%.
The ROC curves of these algorithms, which represent the AUC metric in the table, are shown in Figure 6.The higher the value of AUC, the stronger the generalization ability of the model, which is expressed in the ROC curve as the curve close to the upper-left corner of the graph.Naïve Bayes had the worst performance, with an AUC value of 70.59%, but Xgboost had the highest AUC value of 92.44%.
When all evaluating metrics are considered, Xgboost algorithm has a significant advantage in predicting the occurrence of MACCE, and K-Nearest Neighbor is the second best.Apart from that, the comparison showed that Naïve Bayes and Logistic Regression both fared poorly.The 15 features in HDD each have a unique impact on the outcomes of the predictions.Each model prefers different features, and the scores of these features are also varied.The feature importance ranking of Xgboost, RF, LR, and DT is shown in Figure 7 as the KNN and NB algorithms lack an internal evaluation of feature importance.
From the feature ranking in Table 5, it can be seen that Total cholesterol (TC) is an important feature for predicting whether MACCE occurs.Hemoglobin (HBG) and age are two features that appear in the top five important features of all four algorithms, and are also influential factors that cannot be ignored when predicting.In addition, the table illustrates that the ranking of feature importance is similar for Decision Tree and Xgboost, since the two algorithms construct the same tree structure when training.In summary, we used multiple evaluation methods to present the prediction results of the Xgboost algorithm and the selected baseline algorithm on the processed dataset.The number of the four prediction outcome cases, TN, TP, FN, and FP, is displayed in the confusion matrix.Accuracy, Precision, Recall, and F1-score were applied based on the confusion matrix, and the ROC curve was also plotted.The Xgboost algorithm performed well in all of these evaluation metrics, demonstrating that the proposed smote-Xgboost based framework has a significant advantage in predicting heart disease.In addition, we estimated the feature importance and related scoring of the four algorithms to provide ideas for further optimization of the algorithms.

Conclusions
In this research, we present a smote-Xgboost-based methodology for heart disease prediction.Firstly, an approach for choosing features using information gain is proposed, and then the hybrid Smote-Enn algorithm is used to process unbalanced datasets.Finally, the processed HDD dataset is used for model training.In the experimental evaluation, we compare the Xgboost algorithm with five baseline algorithms.The outcomes demonstrate that the model suggested in this research performs exceptionally well across all four evaluation indicators, with prediction accuracy of 93.44%.In addition, we also count the feature importance of the selected algorithm, which has important implications in terms of heart disease prediction.
In future work, we will mix multiple effective machine learning techniques and combine cutting-edge data processing techniques to build real-time and reliable heart disease diagnosis models.

Figure 1 .
Figure 1.The process of the proposed framework.

Algorithm 1 : 4 if g > ga then 5
The pseudo code of IGFS.Input : Heart Disease Dataset HDD; Information gain threshold ga; Feature set FS; Process : 1 FS = ∅; // records the highest gain 2 foreach feature f i in HDD do 3 Calculate the information gain g of f i ; Add feature f i and g to FS;

Figure 3 .
Figure 3.The correlation matrix of features.
1, NH min = 0.In this paper, H is taken as the experimental dataset whose range lies in the interval [0,1].

Table 1 .
Description of Features.

Table 2 .
The distribution of MACCE in HDD.

Algorithm 2 :
The pseudo code of Smote-Enn.Calculate the K-nearest neighbor samples ks i of s i ;Construct a new data sample ns = s i + ( s i − s i )) + δ; 2 3 4 Add the generated sample ns to HDD; 5 foreach sample h i in HDD do 6 if h i class <> majority class of k-nearest neighbors then 7 Remove h i ;

Table 3 .
The confusion matrix of six algorithms.

Table 4 .
The results of six algorithms.
Figure 6.The ROC curve of baseline algorithms and Xgboost.Figure 7. The feature importance of 4 algorithms.

Table 5 .
Features ranking of 4 algorithms.