Improvement of the Performance of Models for Predicting Coronary Artery Disease Based on XGBoost Algorithm and Feature Processing Technology

: Coronary artery disease (CAD) is one of the diseases with the highest morbidity and mortality in the world. In 2019, the number of deaths caused by CAD reached 9.14 million. The detection and treatment of CAD in the early stage is crucial to save lives and improve prognosis. Therefore, the purpose of this research is to develop a machine-learning system that can be used to help diagnose CAD accurately in the early stage. In this paper, two classical ensemble learning algorithms, namely, XGBoost algorithm and Random Forest algorithm, were used as the classiﬁcation model. In order to improve the classiﬁcation accuracy and performance of the model, we applied four feature processing techniques to process features respectively. In addition, synthetic minority oversampling technology (SMOTE) and adaptive synthetic (ADASYN) were used to balance the dataset, which included 71.29% CAD samples and 28.71% normal samples. The four feature processing technologies improved the performance of the classiﬁcation models in terms of classiﬁcation accuracy, precision, recall, F 1 score and speciﬁcity. In particular, the XGBboost algorithm achieved the best prediction performance results on the dataset processed by feature construction and the SMOTE method. The best classiﬁcation accuracy, recall, speciﬁcity, precision, F 1 score and AUC were 94.7%, 96.1%, 93.2%, 93.4%, 94.6% and 98.0%, respectively. The experimental results prove that the proposed method can accurately and reliably identify CAD patients from suspicious patients in the early stage and can be used by medical staff for auxiliary diagnosis.


Introduction
Cardiovascular diseases (CVDs) are the main causes of death in the world. In 2019, the number of deaths caused by cardiovascular diseases reached 18.5 million, accounting for about one third of the total deaths in the world [1,2]. Among them, nearly half of the deaths caused by cardiovascular diseases are caused by coronary artery disease (CAD). Coronary artery disease (CAD) is regarded as one of the most usual types of cardiovascular diseases. In 2019, there were 197 million CAD patients worldwide [1,3].
CAD refers to the stenosis or occlusion of the coronary arteries due to atherosclerotic changes, which prevents the oxygen-rich blood flow from entering the heart, leading to ischemic heart attacks. According to the anatomy of the coronary arteries, there are three main blood vessels supplying blood to the myocardium, namely, (1) the left anterior descending artery (LAD), (2) the left circumflex artery (LCX), and (3) the right coronary artery (RCA). CAD occurs when any one of the blood vessels is blocked by more than ensemble learning algorithms, namely, XGBoost algorithm and Random Fores are used as the classification model. Simultaneously, synthetic minority ov technology (SMOTE) and adaptive synthetic (ADASYN) are used to balance which includes 71.29% CAD samples and 28.71% normal samples. In addi cross-validation technology is used for testing the stability and accuracy of Accuracy, recall, specificity, precision, F1 score and AUC model evaluation me are applied to assess the power of the proposed model. The schematic dia proposed method for CAD prediction is shown in Figure 1. The structure of the rest part of this paper is arranged as follows. An int the proposed method is shown in Section 2. Section 3 provides details abou ments and results. The classification performance of the proposed methods an works are discussed in Section 4. Finally, the conclusion is stated in Section   The structure of the rest part of this paper is arranged as follows. An introduction to the proposed method is shown in Section 2. Section 3 provides details about the experiments and results. The classification performance of the proposed methods and our future works are discussed in Section 4. Finally, the conclusion is stated in Section 5. Feature smoothing is used to smooth the abnormal data contained in the input feature to a specific range. Since feature smoothing does not filter out or delete any records, the numbers of input features and samples remain unchanged after feature smoothing. Feature smoothing is divided into ZScore smoothing, percentile smoothing and threshold smoothing. This article uses Zscore smoothing to process the continuous features in the dataset used in this study. If the feature distribution follows a normal distribution, the noise is generally concentrated on the outside of µ ± 3 * σ. The visual expression of Zscore smoothing handling outliers in features is shown in the following formula:

Proposed Method
where x noisej is the noisy data of feature j, µ is the mean of the feature j, and σ is the standard deviation of the feature j.

Feature Encoding
Feature frequency encoding is used to calculate the frequency of feature values appearance, and this frequency is used to replace feature values. Feature frequency encoding can express the probability information of feature appearance without changing the dimension of the dataset and losing feature information. At the same time, feature frequency encoding can avoid higher feature value dominance models. The original dataset contains both continuous features and categorical features. The value ranges of different continuous features are quite different. Categorical features include two-category and multi-category features. These characteristics of the dataset will affect the stability and convergence speed of the models. In addition, the dataset used in this paper also has the characteristics of a small sample size and many categorical features. For categorical features, the most commonly used encoding method is one-hot encoding. However, for this dataset, one-hot encoding will significantly increase the dimension of the dataset, and may lead to high-dimensional parallelism and multicollinearity. Therefore, this paper adopts feature frequency encoding to deal with continuous features and multi-category features in the dataset used in this study. The intuitive expression of feature frequency encoding is shown in Figure 2, where n 1 , n 2 , n 3 . . . n m are the frequencies corresponding to the values (x 1j , x 2j , x 3j . . . x nj ) of feature j in the original dataset, respectively. Taking x 1j as an example, the values x 1j of feature j in the original dataset are replaced by the corresponding frequency n 1 , which is used as a new feature value for training and testing.

Feature Construction
Feature construction refers to the artificial formation of some valuable features for prediction from the original data. We calculate the sum, mean and standard deviation of continuous features to form new features. Considering the influence of categorical features such as sex and age on continuous features, we calculate the sum, mean and standard deviation of continuous features based on different values of categorical features. For example, we generate triglyceride (TG) features based on sex feature; that is, the sum, mean and standard deviation of TG are calculated separately in male and female groups to form new features. The same continuous features calculation process is also carried out after the bucket division operation for age. Finally, the calculated new features are added to the raw dataset to form a new dataset. The intuitive expression of feature construction is shown in Figure 3. Two columns named feature j and Sex on the left side of the figure represent two features in the original dataset. X nj is the value of the nth sample on feature j. In the middle part of the figure, the values of feature j are grouped according to Sex. Where male = 0 means that the sample with male sex is coded with 0, and m is the number of males in the sample. Similarly, female = 1 means that the sample with female sex is coded with 1, and n − m is the number of females in the sample. The three columns on the right side of the figure with the names of 0_j_sum, 0_j_mean and 0_j_std are the sum, mean and standard deviation of the values of feature j in male group. The corresponding values (e.g., x male,j,sum , x male,j,mean , x male,j,std ) of the three columns are new features' values after feature construction. The operation of the female group is the same as that of the male group. Finally, the new feature columns are obtained by merging the column with same names in the male group and the female group.
can express the probability information of feature appearance without changing mension of the dataset and losing feature information. At the same time, feature fre encoding can avoid higher feature value dominance models. The original dataset c both continuous features and categorical features. The value ranges of different c ous features are quite different. Categorical features include two-category and m egory features. These characteristics of the dataset will affect the stability and conv speed of the models. In addition, the dataset used in this paper also has the charac of a small sample size and many categorical features. For categorical features, t commonly used encoding method is one-hot encoding. However, for this dataset, encoding will significantly increase the dimension of the dataset, and may lead dimensional parallelism and multicollinearity. Therefore, this paper adopts feat quency encoding to deal with continuous features and multi-category features in taset used in this study. The intuitive expression of feature frequency encoding is in Figure 2, where n1, n2, n3... nm are the frequencies corresponding to the values x3j... xnj) of feature j in the original dataset, respectively. Taking x1j as an example, ues x1j of feature j in the original dataset are replaced by the corresponding frequ which is used as a new feature value for training and testing.

Feature Construction
Feature construction refers to the artificial formation of some valuable features for prediction from the original data. We calculate the sum, mean and standard deviation of continuous features to form new features. Considering the influence of categorical features such as sex and age on continuous features, we calculate the sum, mean and standard deviation of continuous features based on different values of categorical features. For example, we generate triglyceride (TG) features based on sex feature; that is, the sum, mean and standard deviation of TG are calculated separately in male and female groups to form new features. The same continuous features calculation process is also carried out after the bucket division operation for age. Finally, the calculated new features are added to the raw dataset to form a new dataset. The intuitive expression of feature construction is shown in Figure 3. Two columns named feature j and Sex on the left side of the figure represent two features in the original dataset. Xnj is the value of the nth sample on feature j. In the middle part of the figure, the values of feature j are grouped according to Sex. Where male = 0 means that the sample with male sex is coded with 0, and m is the number of males in the sample. Similarly, female = 1 means that the sample with female sex is coded with 1, and n − m is the number of females in the sample. The three columns on the right side of the figure with the names of 0_j_sum, 0_j_mean and 0_j_std are the sum, mean and standard deviation of the values of feature j in male group. The corresponding values (e.g., xmale,j,sum, xmale,j,mean, xmale,j,std) of the three columns are new features' values after feature construction. The operation of the female group is the same as that of the male group. Finally, the new feature columns are obtained by merging the column with same names in the male group and the female group.

Feature Selection
Feature selection, a subset of feature engineering, acts as a pivotal part in enhancing the capacity of machine learning and data mining algorithms [15]. The major goal of feature selection is to select better features for prediction from the raw data. That is, for n features in (

Feature Selection
Feature selection, a subset of feature engineering, acts as a pivotal part in enhancing the capacity of machine learning and data mining algorithms [15]. The major goal of feature selection is to select better features for prediction from the raw data. That is, for n features , k (k < n) features are selected from them to enhance the capacity of the machine learning algorithm. Through feature selection, the most relevant, important and less redundant feature subsets will be identified. Feature selection is the most widely used feature processing technology on the Z-Alizadeh Sani dataset in the existing literature. The most frequently used feature selection methods are information gain, weight by SVM, PCA and Gini coefficient. In our study, GBDT (Gradient Boosting Decision Tree) algorithm, which has been used as a feature selection in many prediction tasks and has achieved good results, is used for feature selection [19,20]. GBDT is an iterative decision tree algorithm. The algorithm is composed of multiple decision trees. By accumulating the prediction results of each decision tree, the final prediction conclusion of the algorithm is obtained. According to the principle of GBDT [21] algorithm, it can be used for feature combination and feature selection. When it is used for feature selection, the global importance of feature j is also measured by the importance of feature j in each tree. The global importance of feature j is calculated as follows: where M represents the amount of decision trees. The importance of features in a single decision tree is calculated by the following formula: where L represents the amount of leaf nodes of the decision tree; L − 1 refers to the amount of non-leaf nodes of the decision tree; v t is the characteristic associated with node t; ∧ i 2 t represents the reduction value of the square loss after node splitting, and I is the indicative function.

Processing Method of Unbalanced Dataset
The original dataset has a certain imbalance, which may affect the classification accuracy of the algorithm. Consequently, it is essential to balance the original dataset. In this paper, two sampling algorithms are used to increase the sampling of the minority samples, namely, synthetic minority oversampling technology (SMOTE) and adaptive synthetic (ADASYN). The SMOTE algorithm [22][23][24][25][26] is an improved algorithm on the basis of the random over sampling algorithm. The fundamental thought of the SMOTE algorithm is to synthesize new samples artificially by analyzing the minority samples and then adding the synthesized new samples to the dataset. The generation method of synthetic samples is as follows: (1) calculate the distance from the minority sample x to the all samples set S min of the minority class, and obtain its k-nearest neighbor; (2) according to the sample's imbalance proportion to set a sampling proportion, then, several samples are randomly selected from its k-nearest neighbors for each minority sample x, assuming that the selected nearest neighbor are x n ; (3) calculate the distance from the minority sample x to each nearest neighbor x n , denoted by |x − x n |, multiply this distance by a random number between 0 and 1, and add the multiplication to sample x to produce a new sample x new . This method selects a random point on the connecting line of two specific samples as the new sample. By increasing the number of minority samples, this kind of method effectively makes the minority class decision area become more common.
The calculation formula is as follows: Different from the SMOTE algorithm which generates the equal number of synthetic samples for every minority class data example, the key idea of the ADASYN [26][27][28] Electronics 2022, 11, 315 7 of 19 algorithm is to determine how many new samples should be resampled for every minority class sample automatically, by using density distribution as the standard. The dataset generated by the ADASYN algorithm will not only show the balanced distribution of data, but also require the classification algorithm to devote more attention to those samples that are hard to learn [27].

Classification Algorithm
This paper applies two classification models, namely, Random Forest and XGBoost. The Random Forest algorithm, introduced by Breiman, is a highly effective and most frequently used model, which can be used for classification and regression problems at the same time [29,30]. It belongs to an ensemble learning technology based on bagging. Its basic idea is to train a set of base classifiers, usually a decision tree, and then aggregate the results of the base classifiers by hard voting or weighted voting to obtain the final prediction output. Therefore, Random Forest usually performs better than a single classifier. In addition, to improve the performance of Random Forest, some strategies need to be adopted, such as the introduction of a greater randomness which can make base classifiers as independent as possible during the process of creating forests. In view of these superiorities, the Random Forest algorithm has been widely used in disease prediction and system development.
XGBoost is an optimized implementation of gradient boosting. Different from Random Forest, the base classifier of XGBoost is interrelated, and the base classifier of the latter is generated based on the former. Specifically, the latter base classifier fits the prediction residuals of the previous base classifier. Based on this integrated strategy, machine learning techniques have shown high performance in solving various disease prediction and risk stratification tasks in recent years [31][32][33][34].

Experimental Dataset
The Z-Alizadeh Sani dataset downloaded from UCI Machine Learning Repository consists of the medical records of 303 patients who visited Shaheed Rajaei Hospital due to chest pain. Each record contains 54 features. According to medical knowledge, each feature is the indicator of CAD diagnosis, that is, each feature is the relevant feature of CAD prediction. In the medical literature, these features can be separated into four categories, namely: demographic; symptom and examination; ECG; and laboratory and echo features. The specific information about the features of the Z-Alizadeh Sani dataset is shown in Table 1. The 303 samples of the Z-Alizadeh Sani dataset can be divided into two classes: CAD patient class and normal class. When the diameter of at least one of the three arteries is narrowed by more than or equal to 50%, the patient will be classified as CAD; otherwise, it will be considered normal [15].

Confusion Matrix
The confusion matrix is a comprehensive evaluation index system used for describing the classifier's performance. In the confusion matrix, the rows represent the real classes y (i) and the columns represent the predicted classes y x (i) . In accordance with Table 2, where TP = true positive, i.e., positive instances that are actually CAD class and also correctly predicted as CAD, FP = false positive, i.e., negative instances that are actually normal class but mistakenly predicted as CAD, FN = false negative, i.e., positive instances that are actually CAD class but mistakenly predicted as normal class, TN = true negative, i.e., negative instances that are actually normal class and also correctly predicted as normal class.
The original dataset has a certain imbalance. Only using accuracy cannot measure the models' performance effectively. Thus, besides the accuracy, we also calculated other model evaluation metrics such as recall, specificity, precision, F 1 score and AUC, to evaluate the performance of the classifier models.

Accuracy
Accuracy measures all the samples that are predicted correctly, including positive samples and negative samples.
Accuracy is an evaluation index that is regularly used and easily understood. The influence of positive and negative samples on accuracy is the same. However, in the medical domain, doctors and patients actually pay more attention to the positive samples, namely the CAD samples. At this time, the costs brought on by the missed diagnosis of positive samples and misdiagnosis of negative samples are different. In these circumstances, only using accuracy to assess the performance of a classifier is insufficient.

Precision
Precision is used to measure the proportion of true positive samples in instances that are predicted to be positive.
From the perspective of a positive sample, precision tends to measure how likely it is that an instance predicted to be a positive sample is indeed a true positive sample.

Recall
Recall represents the proportion of samples that are correctly predicted in all positive samples. Recall is an important evaluation index that measures the classifier's ability to recognize positive samples.
4. F 1 score Precision and recall often restrict each other. Therefore, the F 1 score is introduced, that is the weighted harmonic average of recall and precision. A higher F 1 score indicates that the test method is more effective.

Specificity
Specificity represents the ratio of samples that are correctly predicted in all negative samples.

AUC
The receiver operating characteristics (ROC) curve chart with true positive rate (TPR) as the y-axis and false positive rate (FPR) as the x-axis is the relationship diagram between true positive rate and false positive rate, which actually reflects the relationship between specificity and recall. The receiver operating characteristics (ROC) curve contributes significantly to visually display the power of the classifier model's classification capability. The ROC curve of the classification model for CAD prediction is closer to the upper left corner, indicating that the prediction performance of the classifier for CAD is stronger. The area under the ROC curve, namely, area under curve (AUC), is usually used to quantitatively measure the ROC curve. In other words, the closer the AUC is to 1, the more accurate the classifier is in the prediction of CAD, that is, the prediction performance of the classifier is higher.

Experimental Results
In this part, the experimental results of the XGBoost algorithm and Random Forest algorithm combining four feature processing technologies and two dataset balancing methods is reported. Firstly, four feature processing technologies, namely feature smoothing, feature encoding, feature construction and feature selection, are applied to the original dataset. A total of five sets of data are obtained adding the original dataset after the above handle. Meanwhile, two dataset balancing methods, synthetic minority oversampling technique (SMOTE) and adaptive synthetic (ADASYN), are applied to balance the classes in the datasets, respectively. In total, we have 15 sets of data. Furthermore, the performances of the XGBoost algorithm and Random Forest algorithm are evaluated on these datasets separately. A 10-fold cross-validation technology is also used for model development.

Results Obtained on Original Dataset and Two Balanced Datasets
The classification performance results of the XGBoost algorithm and Random Forest algorithm for CAD prediction in terms of accuracy, recall, precision, F 1 score, specificity and AUC on three datasets which include the original dataset and two datasets balanced by SMOTE and ADASYN, respectively, are reported in this section. The original dataset has a certain imbalance, which shows that among 303 samples, 216 samples are CAD (accounting for 71.29%) and 87 samples are normal (accounting for 28.71%). The dataset balanced by the SMOTE method contains 432 samples. At this point, the number of samples of CAD class and normal class are equal, that is, each class consists of 216 samples. However, the dataset balanced using ADASYN includes 426 samples, among them, 216 samples are CAD, and 210 samples are normal. The average testing result in terms of accuracy, precision, recall, F 1 score, specificity and AUC obtained by the XGBoost algorithm and Random Forest algorithm for the three datasets are reported in Table 3. The experimental results show that the XGBoost model on the dataset balanced by the SMOTE method achieves the best performance with a classification accuracy of 94.0%, F 1 score of 94.3%, recall of 94.0%, precision of 95.3%, specificity of 94.0% and AUC of 0.97. From the result, it can be inferred that two dataset balancing methods can enhance the capability of the XGBoost and Random Forest model for predicting CAD. The best results appear in the combination of the XGBoost classification model with the SMOTE method. In addition, it can be found that the XGBoost algorithm performs better than the Random Forest algorithm on the datasets used in this section.

Results Obtained on Datasets Processed by Feature Smoothing and Two Dataset Balancing Methods
The performance results of the XGBoost algorithm and Random Forest algorithm for CAD prediction in respect of classification accuracy, recall, precision, F 1 score, specificity and AUC for the datasets processed by feature smoothing technology and two dataset balancing methods are discussed in this section. Feature smoothing technology only processes the outliers in the dataset and does not change the size of the dataset. Therefore, the dataset processed by feature smoothing technology still has a certain imbalance, in that among 303 samples, 216 samples are CAD and 87 samples are normal. The dataset balanced by the SMOTE method contains 432 samples, of which the numbers of samples classified as CAD and normal are both 216. The dataset balanced using the ADASYN method consists of 427 samples, among them, the numbers of samples classified as CAD and normal are 216 and 211 respectively. The average testing result in terms of accuracy, precision, recall, F 1 score, specificity and AUC obtained by the XGBoost algorithm and Random Forest algorithm for the three datasets are reported in Table 4. All the experimental results show that, compared with Table 3, the performance results of the XGBoost model and the Random Forest model for the datasets processed by feature smoothing are generally reduced. In detail, the feature smoothing technology only improves the performance results of the XGBoost algorithm on the original dataset and the performance results of the Random Forest algorithm on the dataset balanced by ADASYN. In addition, the classification performance of the two models on other datasets is all degraded.
Moreover, it can be found from the experimental results that the performance of the XGBoost algorithm is still better than the Random Forest algorithm on the datasets in this section. The two dataset balancing methods still have the ability to improve the prediction performance of the model. Different to the original dataset, the best result on the dataset processed by feature smoothing technology comes from the combination of the ADASYN method and the XGBoost algorithm. The best recall, F 1 score, accuracy, precision, specificity and AUC are 94.1%, 94.2%, 93.9%, 94.8%, 93.7% and 0.97 respectively.

Results Obtained on Datasets Processed by Feature Encoding and Two Dataset Balancing Methods
The performance results of the XGBoost algorithm and Random Forest algorithm for CAD prediction in respect of classification accuracy, recall, precision, F 1 score, specificity and AUC for the datasets processed by feature encoding technology and two dataset balancing methods are discussed in this section. The dataset processed by feature encoding technology contains 216 CAD samples and 87 normal samples. The dataset balanced by the SMOTE method consists of 432 samples, of which 216 samples are CAD and 216 samples are normal. The dataset balanced using the ADASYN method includes 421 samples, among them, 216 samples are CAD and 205 samples are normal. The average testing result in terms of accuracy, precision, recall, F 1 score, specificity and AUC obtained by the XGBoost algorithm and Random Forest algorithm for the three datasets are reported in Table 5. It can be seen from the experimental results that, compared with Tables 3 and 4, the performance results of the XGBoost and the Random Forest model in terms of recall for the dataset processed by feature encoding and the dataset processed by feature encoding and SMOTE are slightly improved. However, all performance results of the two models on other datasets are degraded.
It should be noted that in this section that the performance results of the Random Forest model on the dataset processed by feature encoding and SMOTE is better than the XGBoost model. Additionally, different to the above two parts, on the dataset processed by feature encoding technology, the best result comes from the combination of the SMOTE method and Random Forest model. The best recall, F 1 score, accuracy, precision, specificity and AUC are 94.9%, 93.3%, 93.3%, 92.5%, 91.7% and 0.98 respectively. In the same way, it can be found from the experimental results that the two dataset balancing methods still have the ability to improve the prediction performance of the model.

Results Obtained on Datasets Processed by Feature Selection and Two Dataset Balancing Methods
The performance results of the XGBoost algorithm and Random Forest algorithm for CAD prediction in respect of classification accuracy, recall, precision, F 1 score, specificity and AUC for the datasets processed by feature selection technology and two dataset balancing methods are discussed in this section. Twelve features that are considered to be the most relevant features of CAD prediction are selected by feature selection technology based on the GBDT algorithm. The details of the 12 selected features are shown in Table 6 and Figure 4. In Table 6 the names of the 12 important features and the corresponding feature importance are shown. Figure 4 expresses the information in Table 6 visually. As can be seen from Table 6 and Figure 4, the 12 selected features are typical chest pain, age, TG, region RWMA, BMI, EF-TTE, ESR, weight, HTN, LDL, nonanginal and T inversion. The significant correlation and indicative relationship between these 12 medical features and CAD diagnosis have also been confirmed by medical experts and medical literature [4]. This shows the effectiveness of our feature selection method. can be seen from Table 6 and Figure 4, the 12 selected features are typical chest pain, age, TG, region RWMA, BMI, EF-TTE, ESR, weight, HTN, LDL, nonanginal and T inversion. The significant correlation and indicative relationship between these 12 medical features and CAD diagnosis have also been confirmed by medical experts and medical literature [4]. This shows the effectiveness of our feature selection method.  The dataset processed by feature selection technology contains 216 CAD samples and 87 normal samples. The dataset balanced by SMOTE method consists of 432 samples, of which 216 samples are CAD and 216 samples are normal. The dataset balanced using the ADASYN method includes 428 samples, and among them, CAD samples and normal samples are 216 and 212, respectively. The average testing result in terms of accuracy, precision, recall, F1 score, specificity and AUC obtained by the XGBoost algorithm and Random Forest algorithm for the three datasets are reported in Table 7.   Table 7. From the result of Table 7, it can be inferred that the feature selection technology based on the GBDT algorithm can improve the performance of the XGBoost model and Random Forest model for predicting CAD. The performance results of the XGBoost algorithm and Random Forest algorithm on the datasets processed by feature selection technology based on the GBDT algorithm are significantly better than the performance results of two models on the datasets processed by feature smoothing technology and feature encoding technology. At the same time, when compared with Table 3, it can be seen that the feature selection technology based on the GBDT algorithm can improve the performance results of the XGBoost model and Random Forest model on the original dataset and the balanced dataset by the ADASYN method. However, on the balanced dataset by the SMOTE method, the feature selection technology based on the GBDT algorithm only promotes the performance of the XGBoost model and Random Forest model in terms of recall. Therefore, it can be concluded that the feature selection technology based on the GBDT algorithm can improve the performance of the models for CAD prediction by identifying the most relevant, important and less redundant features. In this section, the best results appear in the combination of the XGBoost classification model with the ADASYN method with a classification accuracy of 94.2%, F 1 score of 94.3%, recall of 94.6%, precision of 94.4%, specificity of 93.8% and AUC of 0.97. In addition, it also can be found that the XGBoost algorithm performs better than the Random Forest algorithm on the datasets used in this section.

Results Obtained on Datasets Processed by Feature Construction and Two Dataset Balancing Methods
The performance results of the XGBoost algorithm and Random Forest algorithm for prediction of CAD in terms of classification accuracy, precision, recall, F 1 score, specificity and AUC for the datasets processed by feature construction technology and two dataset balancing methods are discussed in this section. Feature construction technology increases the feature dimension of the samples to 120 dimensions without changing the size of the samples. Therefore, the dataset processed by feature construction technology still contains 303 samples, of which 216 samples are CAD and 87 samples are normal. The dataset balanced by the SMOTE method consists of 432 samples, out of which the CAD and normal samples are both 216. The dataset balanced using ADASYN method includes 420 samples, among them, 216 samples are CAD and 204 samples are normal. The average testing result in terms of accuracy, precision, recall, F 1 score, specificity and AUC obtained by the XGBoost algorithm and Random Forest algorithm for the three datasets are reported in Table 8. It can be seen from the experimental results that the best performance results of the classification models for prediction of CAD in terms of classification accuracy, recall and F 1 score are obtained by the XGBoost model which is trained on the dataset processed by feature construction and SMOTE method. The best recall, F 1 score, accuracy, precision, specificity and AUC are 96.1%, 94.6%, 94.7%, 93.4%, 93.2% and 0.98 respectively. From the result, it can be inferred that the XGBoost model combining feature construction technology and SMOTE method has significant ability to identify CAD patients. The higher F 1 score and AUC show that the performance of the model is stable and effective.

Discussion
Feature engineering means a range of technologies that use a sequence of engineering methods to filter out more relevant features from the raw data to promote the training result of the classifier. The function of feature engineering is to remove redundant features and noises existing in the raw data, and select or construct features that can more effectively describe the relationship between the problem to be solved, and the prediction model. In our work, to explore the method of detecting CAD quickly and accurately, the performance of the XGBoost algorithm and Random Forest algorithm is evaluated on datasets processed by four feature processing technologies and two dataset balancing methods. The four feature processing techniques are feature smoothing, feature encoding, feature selection and feature construction. The two dataset balancing methods are based on the SMOTE algorithm and ADASYN algorithm, respectively. Moreover, 10-fold cross-validation technology is used to test the stability and accuracy of the model. Experimental results demonstrate that the four feature processing technologies have different effects on the performance of the classification model on different datasets. Among them, the impact of feature construction technology on the performance of the classification model is prominent. Figure 5a-d shows the effects of four feature processing technologies and two datasets balancing methods on the performance of XGBoost algorithm in terms of accuracy, recall, F 1 score and AUC, respectively. It can be found from Figure 5 that on the dataset processed by feature construction technology and the SMOTE algorithm, the XGBoost classification model produces the best performance results in terms of classification accuracy, recall, F 1 score and AUC. At this time, the XGBoost classification model has the strongest recognition ability for positive samples. Secondly, the feature selection technology based on the GBDT algorithm also significantly improves the classification model. On the dataset processed by feature selection technology and the ADASYN algorithm, the XGBoost classification model also achieves better classification performance. It is worth noting that the better performance results are achieved based on 12 features selected by the GBDT algorithm. However, it is regrettable that feature smoothing technology and feature encoding technology have a poor effect on enhancing the capability of the classification model. Furthermore, it can also be seen from Figure 5 that the two dataset balancing methods can significantly improve the performance of the classification model.
In addition, the comparison of the performance results of our proposed method with the performance results of previous studies on the Z-Alizadeh Sani dataset reported in the literature is shown in Table 9. It can be seen from Table 9 that the proposed method has achieved better performance than existing research. It should be noted that there are many values marked as NR, which represents that these classification metrics have not been reported in the literature, in Table 9. However, these metrics are vital to evaluate the performance of medical models, especially to evaluate the performance of classification models that are trained on imbalanced datasets. Additionally, in Table 9, the column name "FeatureNums" refers to the number of features used for model training and testing. According to the column "FeatureNums", feature selection technology has been applied to almost all studies reported in the literature. In our study, 12 features that are considered to be most relevant to CAD prediction have been selected by the GBDT algorithm, and the performance results of the classification model trained on these 12 features, which is the least number of features in the comparison literature, are very promising. However, although some performance results reported in the literature [16,17] are better than our study, the following points need to be observed: (1) the accuracy and recall in [16] were obtained based on 500 samples; (2) the number of features used in [16,17] were more than used in our study; (3) our proposed method also achieves very competitive results in terms of specificity, precision, F 1 score and AUC, especially the best AUC. Overall, the experimental results clearly demonstrate the robustness and stability of our proposed method in CAD diagnosis and prediction, and it can be seen from Table 9 that our proposed method provides better results when compared with other studies that already exist in the literature.
noting that the better performance results are achieved based on 12 features selected the GBDT algorithm. However, it is regrettable that feature smoothing technology an feature encoding technology have a poor effect on enhancing the capability of the clas fication model. Furthermore, it can also be seen from Figure 5 that the two dataset balan ing methods can significantly improve the performance of the classification model. In addition, the comparison of the performance results of our proposed method wi the performance results of previous studies on the Z-Alizadeh Sani dataset reported in t literature is shown in Table 9. It can be seen from Table 9 that the proposed method h achieved better performance than existing research. It should be noted that there are man values marked as NR, which represents that these classification metrics have not be reported in the literature, in Table 9. However, these metrics are vital to evaluate the pe formance of medical models, especially to evaluate the performance of classification mo els that are trained on imbalanced datasets. Additionally, in Table 9, the column nam "FeatureNums" refers to the number of features used for model training and testing. A cording to the column "FeatureNums", feature selection technology has been applied almost all studies reported in the literature. In our study, 12 features that are consider to be most relevant to CAD prediction have been selected by the GBDT algorithm, and t performance results of the classification model trained on these 12 features, which is t least number of features in the comparison literature, are very promising. However, a hough some performance results reported in the literature [16,17] are better than o study, the following points need to be observed: (1) the accuracy and recall in [16] we obtained based on 500 samples; (2) the number of features used in [16,17] were more th used in our study; (3) our proposed method also achieves very competitive results terms of specificity, precision, F1 score and AUC, especially the best AUC. Overall, t experimental results clearly demonstrate the robustness and stability of our propos  In consideration of the above, the major advantages of our proposed method are as follows: (1) application of the XGBoost algorithm to the Z-Alizadeh Sani dataset for earlier and effective diagnosis of CAD; (2) a series of feature processing techniques, such as feature smoothing, feature frequency encoding, feature construction and feature selection technology, were applied to the Z-Alizadeh Sani dataset to reduce feature redundancy and improve the accuracy of classification models for CAD prediction; (3) application of the feature selection method based on GBDT algorithm on the Z-Alizadeh Sani dataset; (4) two classical datasets balancing methods were applied to the Z-Alizadeh Sani dataset to solve the problem of dataset imbalance; and (5) classification metrics such as accuracy, recall, specificity, precision, F 1 score and AUC were used to validate the model performance. Of course, our research also has some shortcomings, such as: (1) the dataset used was small; (2) more ensemble learning techniques were not tried in this research. This will be the direction of future work.

Conclusions
CAD is one of the diseases with the highest morbidity and mortality in the world. The goal of how to achieve rapid and accurate CAD detection is being pursued by many researchers, scholars and doctors around the world. In this study, four different feature processing techniques, including feature smoothing, feature encoding, feature construction and feature selection, were applied to the Z-Alizadeh Sani dataset to explore methods that can improve the performance of classification models for CAD detection. We used the XGBoost algorithm and Random Forest algorithm as classifiers and applied a 10-fold crossvalidation technique to test the stability of the model. SMOTE algorithm and ADASYN algorithm were used to balance the imbalanced dataset. Model evaluation measurements such as accuracy, recall, specificity, precision, F 1 score and AUC were used to evaluate the performance of the classification model. Experimental results show that, compared with the most advanced algorithms in the literature, our method is very competitive and can be used by medical staff for clinical auxiliary diagnosis.

Conflicts of Interest:
The authors declare no conflict of interest.