Classiﬁcation Comparison of Machine Learning Algorithms Using Two Independent CAD Datasets

: In the last few decades, statistical methods and machine learning (ML) algorithms have become efﬁcient in medical decision-making. Coronary artery disease (CAD) is a common type of cardiovascular disease that causes many deaths each year. In this study, two CAD datasets from different countries (TRNC and Iran) are tested to understand the classiﬁcation efﬁciency of different supervised machine learning algorithms. The Z-Alizadeh Sani dataset contained 303 individuals (216 patient, 87 control), while the Near East University (NEU) Hospital dataset contained 475 individuals (305 patients, 170 control). This study was conducted in three stages: (1) Each dataset, as well as their merged version, was subject to review separately with a random sampling method to obtain train-test subsets. (2) The NEU Hospital dataset was assigned as the training data, while the Z-Alizadeh Sani dataset was the test data. (3) The Z-Alizadeh Sani dataset was assigned as the training data, while the NEU hospital dataset was the test data. Among all ML algorithms, the Random Forest showed successful results for its classiﬁcation performance at each stage. The least successful ML method was kNN which underperformed at all pitches. Other methods, including logistic regression, have varying classiﬁcation performances at every step.


Introduction
In the past few decades, the incidence and mortality of cardiovascular diseases in developing countries have increased year by year [1]. The WHO 2020 report underlines that the number of people who die from heart diseases continues to rise in the coming years. CAD is a common disease of the heart and blood vessels and is one of the most frequent causes of death [2,3]. In CAD, fat layers accumulate in the coronary arteries; this causes them to narrow the blood flow through the coronary arteries, resulting in hypoxia of the heart muscle. In extreme cases, this condition can lead to a heart attack due to the lack of oxygen to the heart, which can be fatal [4].
Machine learning (ML) is a relatively new and efficient data analysis approach for scientific studies. Evaluating the performance of various ML techniques to classify individuals with and without certain health conditions is of great interest to scientific studies. As many researchers have suggested, ML techniques are likely to provide better accuracy in data classification. It is vital to achieving noticeable accuracy in the predicted result, as it is associated with high classification performance.
As of late, many studies are perseverative using ML algorithms on different health problems, including CAD. In 2016 Dwivedi et al. applied six ML techniques in their research: artificial neural network (ANN), support vector machine (SVM), LR, k-nearest neighbor (kNN), classification tree and naïve Bayes on CAD data. The result of the techniques gets cross-checked with receiver operating characteristic (ROC) curves. The highest accuracy for classification performance is LR and ANN [5].
In a different study, Ayatollahi et al. [6] used a dataset of 1324 individuals from AJA University. For the normalization of the data, they applied SVM and ANN algorithms. Abdar et al. (2019) tested Linear SVM (Lin SVM), SVC, and nuSVM methods and indicated that the N2 Genetic nuSVM algorithm acquired the best possible accuracy and the F-1 score of the Z-Alizadeh Sani dataset [7].
The study of Akella et al. (2021) applied six different ML algorithms to predict whether the patients included in the Cleveland Dataset have CAD. The study pointed out that all six ML algorithms had high accuracy, while the "neural network" had the highest among all [8].
In 2018 Cuvitoglu et al. got the data testing for feature selection off the ground, followed by principal component analysis (PCA) to reduce the dimensionality of small sample sizes. The 10-fold cross-validation conforms to the ML methods achieving success for classification. ANN succeed an area-under-the-ROC curve (AUC) that emphasizes the best performance among six methods [9]. Kutrani et al. tested several ML techniques using the Benghazi Heart Center dataset with a sample size of 1.770 and contained 11 attributes. The SVM and kNN algorithms had higher percentages of correct classification [10]. Tougui et al. used the heart disease data with 13 features and 303 cases in their study and employed six data mining software along with six ML algorithms [11]. Shaik Mohammad Naushad et al. conducted a study with a dataset of 648 subjects where they applied algorithms to detect the risk factors associated with CAD. The AUC scores of each algorithm were reported, as well as the sensitivity and specificity of findings [12]. The variety of ML algorithm work conducted thus far has drawn upon the UCI Heart Disease datasets [2,[5][6][7][8][9]11]. With this study, ML algorithms, which the researchers suggested to be better in classification, were tested in two different datasets to particular stages.
The current study compares the classification performance of several ML algorithms by using two independent CAD datasets obtained from two countries; thereby the data from Near East University Hospital, Cardiology Clinic in TRNC were obtained. The openaccess Z-Alizadeh Sani CAD dataset is from the UCI Machine Learning Repository web page. These two datasets run in a multi-stage analysis approach, including the novel cross-validation of the discovered classification rules.

Database
In this study, there are two independent CAD datasets: the dataset of 475 individuals (305 CAD patients, 170 healthy control) from the Department of Cardiology, NEU Hospital, Nicosia, TRNC and the dataset of 303 individuals (216 CAD patients, 87 healthy control) from the Z-Alizadeh Sani dataset of UCI ("UCI Machine Learning Repository: Z-Alizadeh Sani Data Set", 2020) [13]. The NEU Hospital dataset was gathered from the hospital information management system and contained patient data from 2016 to 2020. Data collection was carried out between November 2019 and April 2020.
The number of variables in the two datasets were not equal. Hence, the study included 29 mutual variables from both datasets to ensure the study's integrity while the rest remained idle.
Descriptive analysis of the data is in Tables

Classification Methods
Artificial intelligence methods facilitated the classification and the determination of the validity of the results. There were six different ML (supervised learning) methods in this study. These methods are k-nearest neighbors (kNN), support vector machine (SVM), random forest (RF), artificial neural network (ANN), naïve Bayes (NB), and logistic regression (LR).

k-Nearest Neighbors (kNN)
This algorithm makes a clustering process based on the proximity relations between objects. It works in the coordinate plane with the linear decomposition method that obtains neighbor data using the Euclidean distance between data points [14,15].

Support Vector Machine (SVM)
SVM overcomes the problem of overfitting by expanding the concept of constructional risk minimization and examines the optimal hyperplane between the two classes [5]. This algorithm explains the relationship between dependent and independent variables [16][17][18].

Random Forest (RF)
An RF consisting of decision trees is being created and tested [19,20]. RF can handle large datasets with automatic variable selection and many estimators. RF is reported to provide unbiased estimates [21].

Artificial Neural Network (ANN)
The challenges of using ANN are the time required to train networks on complex data, as this model carries the functional characteristics of the structural biological system, and that they are like black boxes. The model can change, but the user cannot interfere with the final decision-making process [22,23].

Naïve Bayes (NB)
The naïve Bayes re-scans the entire dataset for each new classification operation which might cause it to operate relatively slowly [24].

Logistic Regression (LR)
LR is the iterative presentation of the powerful linear combination of variables most likely to determine the observed outcome [25].

Statistical Analysis
Descriptive statistical analysis was applied and mean, standard deviation, minimum and maximum values were given for quantitative variables. Frequency and percentages are given in qualitative variables (Tables 1 and 2). The datasets were tested for normality with Kolmogorov-Smirnov and Shapiro-Wilk tests, where applicable. Since the datasets did not fulfil the parametric assumptions, a Mann-Whitney U test was applied for quantitative variables and a Chi-square test was applied to qualitative variables (Tables 3 and 4). IBM SPSS software (Demo Version 21.0 for Windows) was used for statistical analysis. To classify the data, 6 different ML algorithms were applied with the Orange (Version 3-3.29.3) program. All analyses were performed on a laptop with Intel(R) Core(TM) i5-7200U CPU @2.50 GHz, installed RAM 4.00 GB, and 64-bit operating system. Figure 1 provides a schematic description of the analysis performed at each step. All variables were included to find out how accurately the ML methods performed the classification. Each step of the study contained a different output and was evaluated separately.     In Step 1, ML algorithms were applied to both separated and combined datasets within sampling settings (train/test 10, training set size 66%) (Table 5, Figure 2). In Step 2, the NEU Hospital dataset was assigned as the training dataset and Z-Alizadeh Sani as the test dataset (Table 6, Figure 3). In Step 3, NEU Hospital data were assigned as the test data, while Z-Alizadeh Sani data were the training data and ML algorithms were applied accordingly (Table 7, Figure  4). The aim was to apply the newly obtained rule from the training dataset to the test dataset to see which ML algorithms perform better for classification.
The Orange software provided the average classification success measures for the classification result of the target class. The performances of the applied algorithms are summarized with five measurement results: AUC, accuracy classification score (CA), weighting depending on the average parameter (F1), precision, and recall. AUC results are shown with ROC curves (Figures 2-4). In Step 1, ML algorithms were applied to both separated and combined datasets within sampling settings (train/test 10, training set size 66%) (Table 5, Figure 2).  Figure 3).      In Step 3, NEU Hospital data were assigned as the test data, while Z-Alizadeh Sani data were the training data and ML algorithms were applied accordingly (Table 7, Figure 4). The aim was to apply the newly obtained rule from the training dataset to the test dataset to see which ML algorithms perform better for classification.   Table 7.  The Orange software provided the average classification success measures for the classification result of the target class. The performances of the applied algorithms are summarized with five measurement results: AUC, accuracy classification score (CA), weighting depending on the average parameter (F1), precision, and recall. AUC results are shown with ROC curves (Figures 2-4). Table 3 shows the bivariate statistical hypothesis testing the results for each dataset to highlight the differed variables between CAD patients and the control group. PR (p = 0.002) and EF-TTE (p = 0.001) variables were statistically important in the NEU Hospital data. The variables showing statistically significant differences in the Z-Alizadeh Sani data are age, systolic BP, PR, FBS, TG, K, lymph, Neut, and EF-TTE (p < 0.05). The age range of people with CAD was 32-89 in the NEU Hospital data and 36-86 in the Z-Alizadeh Sani data. The average BP of people with the disease was 124.65 mm/Hg in NEU Hospital data and 132.41 mm/Hg in Z-Alizadeh Sani data. Considering the mean values of PR and EF-TTE variables, which were statistically significant in both datasets, in people with CAD the mean PR variable was 75.90 ppm in NEU Hospital data and 76.09 ppm in Z-Alizadeh Sani data. The mean EF-TTE variable was 56.81 in NEU Hospital data and 45.91 in Z-Alizadeh Sani data. Table 4 shows the Chi-Square test results in each of the two datasets. There are 148 women and 327 men in the NEU Hospital dataset; 68.2% of women and 62.4% of men have CAD. There are 127 women and 176 men in the Z-Alizadeh Sani's dataset; 67.7% of women and 73.9% of men have CAD. The variables statistically associated with the presence of CAD in the NEU Hospital dataset were current smoker (p = 0.004), systolic murmur (p = 0.019), chest pain (p < 0.001), dyspnea (p < 0.001) and region RWMA (p < 0.001). In the Z-Alizadeh Sani dataset, significantly associated variables were DM (p < 0.001), HT (p < 0.001), chest pain (p < 0.001), dyspnea (p = 0.029), region RWMA (p < 0.001). In the NEU dataset, 139 people smoked, while only 63 people actively smoked in the Z-Alizadeh Sani data. Out of 139 active smokers in the NEU Hospital dataset, 103 had CAD, and 187 CAD patients out of 206 people had chest pain problems in the NEU Hospital data. In the Z-Alizadeh Sani data, 154 out of 164 people with chest pain problems were CAD patients. The dyspnea variable showed a statistically significant association with CAD in both datasets: 63 out of 75 people in the NEU Hospital dataset had CAD, while 87 out of 134 people had CAD in the Z-Alizadeh Sani dataset. Region RWMA variable was statistically significant in both datasets: 61 out of 70 people in the NEU Hospital dataset were CAD patients, and in the Z-Alizadeh Sani dataset 82 out of 86 people were CAD patients.

Results
At the first step of the analysis, ML algorithms were applied to each dataset and the combined dataset one by one. Then in Step 2 and 3, the rule obtained in one dataset was tested in the other dataset for cross-validation. In Step 1, the training and testing settings were applied to each dataset analysis aside from the combined dataset ( Figure 1). All the tables and figures given below contain the classification results of the target variable (CAD). Six ML classification techniques were implemented to each dataset and the combined dataset separately. Firstly, each dataset was separated into a train and test size, and ML algorithms application progressed. According to AUC results, in Step 1, SVM was 81.1%, ANN was 79.8%, and LR was 81.3% in the classification in NEU data. In the Z-Alizadeh Sani data, SVM was 90.8%, naïve Bayes classified as 91.4%, and LR as 92.4% (Figure 2). When both datasets were combined and ML algorithms were repeated, the SVM showed a classification success of 82.6%, ANN showed 83.4%, and LR showed 85.1%. Moreover, for the CA results, in the NEU Hospital dataset SVM was 81.1%, ANN was 75.4%, and LR was 76.5%. In accordance with CA classification results of the second dataset, the Z-Alizadeh Sani dataset came out with 83.2% for RF, 84.4% for ANN, and 86.5% for LR. In combined dataset results, SVM was 78.6%, ANN was 78.2%, and LR was 79.5% (Table 5).
The results of the first step are shown in Table 5. The AUC results of each dataset and the combined dataset are shown in Figure 2 with the ROC graphics. According to the ROC graph results in the NEU Hospital dataset, SVM (81.1%) and LR (81.3%) algorithms cover the most area, and the ML algorithm that covers the smallest area is kNN (52.7%). The ML algorithms that occupy the majority in the Z-Alizadeh Sani dataset are naïve Bayes (91.4%) and LR (92.4%) algorithms, and the ML algorithm that covers the smallest area is kNN (46.8%). In the combined dataset, the ANN (83.4%) and LR (85.1%) algorithms covered the most area, while the kNN (52.2%) algorithm covered an insignificant area (Figure 2).
In the second and third steps, observation shows the successes of ML algorithms classification comparatively.
In the second step ( Figure 1, Table 6), the program used in this study provides convenience to the classification by applying the learned rule in the training dataset to the test data. As stated in the second step of Figure 1, the rule gleaned from the NEU dataset was under test in the Z-Alizadeh Sani dataset. The CA results were kNN 65.7%, SVM 71.3%, RF 77.6%, ANN 28.7%, naïve Bayes 75.6%, and LR 28.7%.
At this step, the AUC results were RF 79.5% and naïve Bayes 86.1% (Table 6, Figure 3). ROC graphs of AUC results in Figure 3 show the success of each ML algorithms of classification. The LR algorithm covered the smallest area with 47.9%.
In the third and final stage, the Z-Alizadeh Sani dataset was the training data and the NEU Hospital dataset was the test data. The AUC results of the applied ML algorithms were SVM 76.3%, RF 77.7%, and ANN 76.1% (Table 7). When the learned rule was applied to NEU data, the CA results were SVM and LR 71.6%, RF 73.7%, and ANN 71.8% (Table 7, Figure 4).
In this study, SVM in NEU data, LR in Z-Alizadeh Sani data, and combined dataset showed the highest success according to CA results of ML algorithms applied to each dataset separately in the first stage. Considering the CA results in this study, SVM classification is the algorithm that showed the most success. The AUC results and ROC curves in the second and third stages, which were the stages of testing the learned rule in one dataset in another dataset, the two ML algorithms with the highest AUC result for the second stage (Table 6, Figure 3) were naïve Bayes (86.1%) and RF (79.5%). The two ML algorithms that were successful in the third stage were RF (77.7%) and SVM (76.3%). Following the completion of the analysis, LR provided good results in one application but did not perform relatively well in practice as a rule.

Discussion
The study aims to contribute to the studies done so far and to develop a new approach by considering CAD disease, which causes a great deal of death in the world. It examined whether statistically significant variables determine the disease between the variables in two separate datasets. After the statistically significant variables were determined, the classification stage was started. It is desired to observe the results to be obtained by applying the classification methods made so far to two different datasets. Unlike the majority of the current studies, two independent datasets were used as training and testing data for each other, and the classification success of ML algorithms was evaluated in a cross-classification manner. This approach provided a better insight for ML classification performance for detecting CAD patients through the justification of the discovered rules in one dataset over testing in another independent dataset. A low call (high incidence of false negatives) in disease prediction would misdiagnose individuals with CAD as healthy, which could have disastrous repercussions.
Similar studies conducted in 2021 are given in Table 8. As a contribution to the listed studies, the current study showed how successful ML algorithms are in the problem of CAD classification. The studies in Table 8 mostly used one or, on rare occasions, two datasets and the classification performances were measured by applying ML algorithms within the single datasets rather than cross-validating over different independent data, as was done in our current study.
In Table 9, the results of the research are classified in general and the algorithms that made the best classification in the three stages of the research are shown. Our findings highlight that the RF algorithm is the most successful in classification, as it scored an AUC more than 75% at each step of our study (Table 9). In the first stage of this research when ML algorithms were applied to each dataset separately, it was seen that the most successful algorithm was the LR algorithm, although the RF algorithm showed similar performance. In the first stage, the LR (92.4%) algorithm showed the highest success in Z-Alizadeh Sani data. In the same dataset, the RF algorithm showed 89.6% success. Table 9. Classification success of the research.

AUC
Step 1 Step 2 Step 3 Lower than 60% kNN kNN, SVM, ANN, LR kNN Higher than 75% SVM, RF, ANN, Naïve Bayes, LR RF, Naïve Bayes SVM, RF, ANN, LR The kNN algorithm failed to successfully classify individuals at each stage of the research. The LR algorithm, which was successful in the first stage (AUC of 0.813 to 0.924), failed in the second step calculations (AUC = 0.479). Considering the overall success, the RF algorithm achieved a successful classification result at every step. In other studies conducted in 2021, it has also been shown that the RF algorithm achieves high performance for classification of individuals.
In a study held in 2016 by Z-Alizadeh Sani et al., artificial intelligence classification algorithms were applied to detect CAD disturbance on a single dataset. They applied training and test analysis on a single dataset as a rule, and they defined the data as 90% training and 10% test data. Thus, the obtained rule was tested on the same data. Therefore, a high classification performance might be expected [32].
In Steps 2 and 3, one dataset is training while the other dataset is test data to observe the cross-classification results. In step 2, this research defined the NEU Hospital dataset as the training data and the Z-Alizadeh Sani dataset as the test data. The aim here was to test the rule learned in the NEU Hospital dataset on the other dataset, as the LR and neural network failed to do. LR was not successful in this analysis. If the number of observations is more or less than the number of features, LR should not be used. Otherwise, it may lead to overfitting. Since the neural network works like the human brain, it may not yield results when there are not enough samples. Naïve Bayes and RF classification methods were observed as giving the best results [33,34].
In Step 3, Z-Alizadeh Sani training data and NEU Hospital data were analyzed as test data. Classification methods other than kNN gave very close results. The results of correct classification RF method were better than other methods. The rule learned in the Z-Alizadeh Sani dataset was tested in the NEU Hospital dataset. It gave a successful classification result. In a different study conducted in 2016, seven ML methods were used with the TOPSIS method on comparing a single dataset and ML methods, and naïve Bayes (79.1%) gave the best result [1]. The same classification method, naïve Bayes, showed similar success in testing the rule learned in the test dataset, and the significant feature in this study is that the two datasets used included people from two different countries and different genetic features, which increases the value of the results. In a similar study conducted in 2013, ML techniques were used for classification. In that study, two datasets were used and the interpretation about the classification was made over the AUC of ROC curve. This shows how accurate it is to evaluate the AUC in classification. The LR gave the best results. Although the number of patients used in this study was considerably higher than our data, the study achieved better results in classification [35].
In the study of Chen et al. (2020) [36], as in this study, ML classification techniques were used for CAD detection, and as a result, ML classification techniques proved to be very useful and practical in the field of health with alternative techniques. The results of which are given in Tables 6 and 7. It is seen that testing the rule learned in the Z-Alizadeh Sani data on the NEU Hospital data gave a very successful result. It is significant to make a classification with the ML techniques and not to interfere with the program while making this classification. In many studies, it is shown that ML techniques are tested on a single dataset, or a separate classification is made even if two different datasets are used. The AUC results in the classification made with a single dataset showed the most successful result in the combined dataset was the LR with 85.1%, in Z-Alizadeh Sani dataset was LR 92.4%, and naïve Bayes was 91.4%. Finally, it was observed that the NEU hospital dataset made a successful classification with SVM 81.1% and LR 81.3%.
In this study, it was observed that the learned rule in Alizadeh Sani data gave more successful results. In testing the rule learned in the NEU data on the other dataset, only two ML algorithms' results are above 70%. In testing the rule learned in the Z-Alizadeh Sani data on the other dataset, five of the six algorithms showed a success of over 70%. The success of ML techniques in the determination of diseases has been tested and it has been proven that ML techniques in disease detection are improving day by day.
In particular, the cross-classification approach utilized in the current study underlines that it is beneficial to use independent datasets from different geographies to ensure the performance evaluation of techniques in a better way. In this study, besides the success of ML techniques in classification, their contribution to health problems is quite significant. The adapted ML techniques for diagnostic purposes predicate relevant variables for any disease and their performance can be comparatively done. In this paper, it has been shown that the application of ML techniques for predictive analysis can provide advantages in diagnosing diseases earlier, which will help children, young adults, and the elderly with treatment and decision-making.

Conclusions
Today, especially in health, data mining is a necessity, and data needs to be transformed into information with the help of ML algorithms. These ML algorithm classifiers prove their performance in the best results in terms of accuracy. All ML algorithm results applied in three steps show the classification success of CAD patients, which is the target variable. This is a great challenge in the medical field which pushes efforts to develop ML methods, to take advantage of information intelligently, and to extract the best knowledge.
The assumption is that the outputs of the standard models will be simple to comprehend and explainable to non-machine-learning readers.
This study desires to help researchers to make the right choices in the future. Despite the progress made in recent years, there are significant shortcomings in ML-based detection of CAD that need to be addressed in the coming years. In the future, this study will extend deep learning by focusing on what happens if the training dataset becomes ambiguous.

Data Availability Statement:
The Z-Alizadeh Sani dataset, one of the datasets supporting this article, is available at https://archive.ics.uci.edu/ml/datasets/Z-Alizadeh+Sani.

Conflicts of Interest:
The authors declare no conflict of interest.