Comparison of Support Vector Machine, Naïve Bayes and Logistic Regression for Assessing the Necessity for Coronary Angiography

(1) Background: Coronary angiography is considered to be the most reliable method for the diagnosis of cardiovascular disease. However, angiography is an invasive procedure that carries a risk of complications; hence, it would be preferable for an appropriate method to be applied to determine the necessity for angiography. The objective of this study was to compare support vector machine, naïve Bayes and logistic regressions to determine the diagnostic factors that can predict the need for coronary angiography. These models are machine learning algorithms. Machine learning is considered to be a branch of artificial intelligence. Its aims are to design and develop algorithms that allow computers to improve their performance on data analysis and decision making. The process involves the analysis of past experiences to find practical and helpful regularities and patterns, which may also be overlooked by a human. (2) Materials and Methods: This cross-sectional study was performed on 1187 candidates for angiography referred to Ghaem Hospital, Mashhad, Iran from 2011 to 2012. A logistic regression, naive Bayes and support vector machine were applied to determine whether they could predict the results of angiography. Afterwards, the sensitivity, specificity, positive and negative predictive values, AUC (area under the curve) and accuracy of all three models were computed in order to compare them. All analyses were performed using R 3.4.3 software (R Core Team; Auckland, New Zealand) with the help of other software packages including receiver operating characteristic (ROC), caret, e1071 and rminer. (3) Results: The area under the curve for logistic regression, naïve Bayes and support vector machine were similar—0.76, 0.74 and 0.75, respectively. Thus, in terms of the model parsimony and simplicity of application, the naïve Bayes model with three variables had the best performance in comparison with the logistic regression model with seven variables and support vector machine with six variables. (4) Conclusions: Gender, age and fasting blood glucose (FBG) were found to be the most important factors to predict the result of coronary angiography. The naïve Bayes model performed well using these three variables alone, and they are considered important variables for the other two models as well. According to an acceptable prediction of the models, they can be used as pragmatic, cost-effective and valuable methods that support physicians in decision making.


Introduction
Cardiovascular disease (CVD) is the most common cause of death in most countries, including Iran, and the most important cause of disability. Approximately 17.5 million people die of CVD each year [1]. In spite of the improvement in diagnosis and treatment, one-third of patients suffering from myocardial infarction die, and two-thirds of those who survive will never recover completely. This imposes a significant cost on health systems [2].
CVD is a disease that affects the heart or blood vessels. Coronary artery disease (CAD) is the most common type of this class. CAD results when the arteries that supply blood to heart muscle become hardened and narrowed. As a result, the heart muscle cannot receive the blood or oxygen it needs. This can lead to chest pain (angina), or a heart attack (myocardial infarct). A heart attack occurs when at least one coronary artery is completely blocked [3].
Angiography is the most reliable method among the various methods of diagnosis for CAD. However, angiography is a costly and invasive procedure [3]. Moreover, various studies have shown that angiography may result in complications [4]. Thus, it is important to use methods that may increase the accuracy of diagnostic and therapeutic tests such as angiography.
Machine learning is considered to be a branch of artificial intelligence. Its aims are the design and development of algorithms that give permission to computers for improving their performances with data. The process involves the analysis of past experiences to find practical and helpful regularities and patterns that a human might overlook. The development of automatic models is a central focus of machine learning research, for instance, rules and patterns extracted from data. Learning patterns contribute to machine learning and are derived from animal and human learning patterns, but can now be used to teach us how animals and humans learn [5].
Medical and computer scientists have worked together for several years, and data mining has allowed the extraction of worthwhile information from large data sets. It is possible that data mining tools could be applied as an alternative to using expensive and invasive medical procedures [5].
In clinical management, tests are often applied to establish the presence or absence of disease. Therefore, for confirming the presence of cardiovascular disease such as coronary heart disease and stroke, the patients may undergo several tests (biochemical tests, rest electrocardiogram (ECG), echocardiography, stress tests or angiography). Some are invasive, uncomfortable for patients, expensive and time consuming. Data mining techniques are capable of identifying high-risk patients to define the most important variables in cardiovascular patients. They also have the ability to allow the building of a risk model [5].
Presently, data mining and machine learning have attracted attention in the prediction and diagnosis of disease, however, these methods are very complex, and require a high level of experience and knowledge. However, their application may reduce medical errors and improve the quality of diagnosis. Logistic regression, support vector machine and naïve Bayes are the most widely used models.
A logistic regression is a generalized linear model with canonical link functions. A logistic regression is used when the dependent variable is categorical, which is the most common model for binary response data. This model calculates the probability of events by using a linear function of the values of explanatory variables. Furthermore, having no limitation for these explanatory variables is considered a benefit of this model [6][7][8].
A support vector machine is considered to be an extension of the maximum margin classification, because it is able to classify nonlinear classifications by the selection of a line with maximum-margin with the help of some points called support vectors. The support vector machine model allows us to increase the feature space using the kernel functions so that the computations are efficient and manageable.
The naïve Bayes method is based on Bayes' theorem. In this method, the probability that a multiplicity X (x 1 , x 2 , . . . , x n ) belongs to a category is measured. Bayes' theorem makes it possible to calculate the posterior probability based on the prior probabilities.
The computation of this model is simple since it is assumed that the value of a particular feature is independent of the value of any other feature [9].
Therefore, the aim of this study was to investigate and compare the support vector machine, naïve Bayes and logistic regression to determine the diagnostic factors for the necessity of angiography.

Materials and Methods
This cross-sectional study was performed on 1187 candidates referred for angiography to Ghaem Hospital in Mashhad, Iran from 2011 to 2012. The study was approved by the Mashhad University of Medical Sciences Ethics Committee (reference number 900671). A positive coronary angiogram is usually defined as having an obstruction greater than 50% of at least one coronary artery, and a negative angiogram indicates an obstruction of less than 50% of any coronary artery.
Patients over 18 years of age, who were candidates for the angiography according to a specialist cardiologist, and who agreed to participate in the study, were enrolled.
Patients with the following conditions were excluded from the study: illness such as kidney or chronic liver disease, rheumatism, immune deficiency disease, any type of cancer, inflammatory disease such as inflammatory bowel disease, infectious disease in the last three months, history of any type of surgery in the last three months, history of coronary angioplasty, taking drugs such as steroids, penicillin, oral contraceptives, hormone replacement therapy, pregnancy and lactation.
Two checklists were used to assess variables that included information on patients' medical records and laboratory results. The variables included in this study were: age, gender, marital status, levels of education, smoking, high fat, history of cardiovascular disease, history of myocardial infarction, family history of cardiovascular disease, family history of kidney disease, family history of hypertension, fasting blood glucose (FBG), serum TGs (triglycerides), high blood pressure, serum HDL (high-density lipoprotein), and the results of angiography.

Statistical Models
Logistic regression, support vector machine and naïve Bayes models were used in this research to predict the necessity of angiography. A logistic regression is a generalized linear model with a canonical link function. In terms of easy computation, a logistic regression is the best model among generalized linear models for binary response data. Interpreting the result of this model is very simple and logical in medical studies because of using odds instead of risk in the link function [6,10].
The success probability at value x b = Determines the rate of change; when b > 0, p increases as x increases and when b < 0, p decreases as x increases a = Value of p when b = 0 A support vector machine is one of the supervised learning models used to classify data. The aim of an Support Vector Machine (SVM) is to perform a linear classification and also select a line with a maximum-margin.
The SVM attempts to select a decision boundary to maximize the minimum distance to each of the categories. SVM is robust in the presence of noise. The process of selecting the best decision boundary is based on some points called support vectors.
The aim of an SVM is to maximize the margin. The following formulae help us to reach this aim. Margin = 2 ||w||2 W: normal vector of the hyperplane; the normal vector is perpendicular to the both hyperplanes which are parallel to each other; therefore, with the help of a normal vector we can calculate the distance between the two hyperplanes. In SVMs, the aim is to choose w in a way that it can maximize the distance between hyperplanes.
Maximizing the margin is equivalent to minimizing the following formula: Moreover, an SVM could efficiently perform a nonlinear classification by using the kernel functions that map data into high-dimensional feature spaces. The radial kernel function used in this paper takes the form training observation x i in terms of the Euclidean distance, then p j=1 x * j − x ij 2 will be large, and so will be very small. This means that in the nonlinear function x i will have no effect on f(x*). The predicted class label for X* (test observation) is based on the sign of f(x*), so the training observations that are far from x* will not have any effect on predicting the class label for x*. In other words, the radial kernel exhibits local behavior because merely nearby training observations play a role in the predicted class label for a test observation [8].
The naïve Bayes method is based on Bayes' theorem. This theorem makes it possible to calculate the posterior probability based on the prior probabilities. P(c) and P(A) are the probabilities of observing A and C independently of each other; this is known as the marginal probability. P (A|c) is a condition probability: the likelihood of event A occurring given that C is true . Firstly, p(C|A1,A2,A3, . . . , An) is calculated for all categories with the help of Bayes' theorem.
. The category with the maximum value of p(C|A1,A2,A3, . . . , An) is selected. Since the denominator of the mentioned formula is the same for all categories, we aim to find the category with the maximum value of the numerator. On the other hand, the value that should be computed for each set of data is p (A1, A2, A3, . . . , An|C).
A sensitivity analysis was used to select the most important variables in naïve Bayes and SVM models. The variables that optimize the value of the maximum area under the curve were selected as final variables. In addition, since there was no significance difference between the SVM with polynomial and radial kernel functions, the latter that had fewer parameters was selected. Regarding the logistic regression, the backward method was used to select the variables.
The sensitivity, specificity, positive and negative predictive values, area under the curve and accuracy of all three models were computed in order to compare them. Two-thirds of the data were randomly chosen as the training set and the model sees and learns from them. The remaining one-third of data were considered as the test set. It is only used once a model is completely trained, and the test set is generally used to evaluate the competing model. All analyses were performed using software R 3.4.3 with the help of software packages including pROC, caret, e1071 and rminer. In this study, p < 0.05 was considered as the significance level.

Results
Of the 1141 participants, 590 (51.7%) were male and 551 (48.3%) were female. The result of angiography was that 750 (65.7%) were positive and 391 (34.3%) were negative. The frequency distribution of the demographic characteristics, other risk factors and the result of angiography are shown in Table 1. The results of fitting logistic regression model with backward method and also the results of the sensitivity analysis for naïve Bayes and SVM with radial kernel functions were used to determine the effective variables (Table 2). Seven, three and six variables were selected for logistic regression, naïve Bayes and SVM, respectively. For all three models, gender, age, and fast blood sugar (FBS) were common.
In Table 3, the sensitivity, specificity, positive and negative predictive value, and area under the curve, are reported in order to compare the three models in their ability to predict the results of angiography.  According to Table 3 and Figure 1, we compared the area under the curve of three models and found no significant difference for these values. Hence, a naïve Bayes model was selected as the best model because it can predict the result of the angiography correctly only with three variables. The SVM model with six predictor variables was second to the naïve Bayes model (Table 2).

Discussion
The objective of this study was firstly to use the naïve Bayes, SVM and regression models to predict the result of the angiography and then compare the performance of these models. Secondly, determining the factors that may predict the need for angiography. Determining these diagnostic factors may not only help in the management of cardiovascular disease, but may also reduce the need to perform an unnecessary coronary angiography by identifying people whose angiography result would be predicted to be negative, because coronary angiography is costly, invasive and accompanied by complications.
According to the results of the logistic regression, the age, gender, family history of kidney disease, history of myocardial infarction, history of heart disease, TG and FBG were statistically significant (p < 0.05) in predicting the results of an angiography. Nasab et al. have previously used a logistic regression model to investigate the important factors of cardiovascular disease for people over 35 years of age. Some variables that are the same as the present study in terms of affecting coronary artery disease are: sex, triglyceride, fasting blood glucose and age. While in both studies the family history of cardiovascular disease is not significant. Moreover, the serum HDL is not significant in the present study but it is significant in the report by Nasab et al., which may be due to the differences in age [11]. Furthermore, regarding naïve Bayes, gender, age and FBG, and for SVM gender, age, family history of kidney disease, history of myocardial disease, hypertension, and FBS were the most effective factors in determining the necessity of an angiography. In a study carried out by Heravi et al. to determine the risk factors of coronary arteries disease, some factors such as age, gender and diabetes presented a significant relationship with the number of involved vessels. These statistically significant factors being is consistent with the present study [12].
Miranda et al. used a naïve Bayes model to detect CVD risk in adults. The model exhibits a good performance for risk level detection of cardiovascular disease. Our results are compatible with this [13].
The SVM model using six predictor variables (gender, age, family history of kidney disease, history of myocardial infarction, high blood pressure, FBG) compared to the logistic regression model with seven predictor variables (gender-age-FBG-history of myocardial infarction-history of heart disease-TG-family history of kidney disease) predicts the result of angiography more accurately.
Mahmoodi used a combination of a fuzzy system and support vector machine classifiers to diagnose patients with heart disease. In this study, data of 270 people with 13 features (age, gender, chest pain, rest blood pressure, cholesterol, FBG, the result of angiography, old peak, slope, Thal, maximum heart rate, exercise-induced angina, number of major vessels colored by fluoroscopy) were used. In this study, the rates of categorization and sensitivity were considered as evaluation criteria

Discussion
The objective of this study was firstly to use the naïve Bayes, SVM and regression models to predict the result of the angiography and then compare the performance of these models. Secondly, determining the factors that may predict the need for angiography. Determining these diagnostic factors may not only help in the management of cardiovascular disease, but may also reduce the need to perform an unnecessary coronary angiography by identifying people whose angiography result would be predicted to be negative, because coronary angiography is costly, invasive and accompanied by complications.
According to the results of the logistic regression, the age, gender, family history of kidney disease, history of myocardial infarction, history of heart disease, TG and FBG were statistically significant (p < 0.05) in predicting the results of an angiography. Nasab et al. have previously used a logistic regression model to investigate the important factors of cardiovascular disease for people over 35 years of age. Some variables that are the same as the present study in terms of affecting coronary artery disease are: sex, triglyceride, fasting blood glucose and age. While in both studies the family history of cardiovascular disease is not significant. Moreover, the serum HDL is not significant in the present study but it is significant in the report by Nasab et al., which may be due to the differences in age [11]. Furthermore, regarding naïve Bayes, gender, age and FBG, and for SVM gender, age, family history of kidney disease, history of myocardial disease, hypertension, and FBS were the most effective factors in determining the necessity of an angiography. In a study carried out by Heravi et al. to determine the risk factors of coronary arteries disease, some factors such as age, gender and diabetes presented a significant relationship with the number of involved vessels. These statistically significant factors being is consistent with the present study [12].
Miranda et al. used a naïve Bayes model to detect CVD risk in adults. The model exhibits a good performance for risk level detection of cardiovascular disease. Our results are compatible with this [13].
The SVM model using six predictor variables (gender, age, family history of kidney disease, history of myocardial infarction, high blood pressure, FBG) compared to the logistic regression model with seven predictor variables (gender-age-FBG-history of myocardial infarction-history of heart disease-TG-family history of kidney disease) predicts the result of angiography more accurately.
Mahmoodi used a combination of a fuzzy system and support vector machine classifiers to diagnose patients with heart disease. In this study, data of 270 people with 13 features (age, gender, chest pain, rest blood pressure, cholesterol, FBG, the result of angiography, old peak, slope, Thal, maximum heart rate, exercise-induced angina, number of major vessels colored by fluoroscopy) were used. In this study, the rates of categorization and sensitivity were considered as evaluation criteria with the percentage of 85% and 85.8%, respectively. According to the results, the combined system diagnosed patients with cardiovascular diseases with a high accuracy. This study shows that the use of SVM has as high as sensitivity compared to our research in which it is not combined with any other models [14].
Gonsalves et al. conducted research about the prediction of coronary heart disease (CHD). This research used the historical medical data to predict CHD by means of a machine learning (ML) algorithm: naïve Bayes, support vector machine and decision tree (DT), to find correlations in CHD data that might improve the prediction rate, and the data were extracted from the South African Heart Disease dataset of 462 cases. The empirical results report that the NB model is promising in terms of detecting CHD which is compatible with our results [15].
Ramalingam et al., used several machine learning algorithms such as support vector machine and naïve Bayes, to name but a few. It shows that the naïve Bayes model was computationally very fast and both the SVM and naïve Bayes performed extremely well [16].
Unnikrishnan et al., in order to predict the risk of CVD, used an SVM and showed that the SVM performed better than a logistic regression model [17]. In addition, Shafie et al. predicted coronary artery disease using bioinformatics algorithms. The results showed that SVM with 100% accuracy can distinguish between individuals with and without disease, which can be due to the ability of SVM to classify nonlinear classifications [18].
In terms of cardiovascular disease, several studies have been undertaken using SVM, naïve Bayes and logistic regression, but it should be taken into consideration that the performance of these models depends on numerous factors, including the size of the database, the number and type of predictor variables and the missing. Thus, selecting a model as the most efficient one is not reasonable.

Conclusions
As there was no significant difference among the models, gender, age, and FBG variables are the most important variables for predicting the result of angiography, because the naïve Bayes model was good at predicting it only with these three variables. In addition, these three variables were important in the other two models.
In addition, it may be necessary to give priority to patients with these characteristics that are more likely to have positive result for angiography so they can be treated more urgently.