Classiﬁcation of Diseases Using Machine Learning Algorithms: A Comparative Study

: Machine learning in the medical area has become a very important requirement. The healthcare professional needs useful tools to diagnose medical illnesses. Classiﬁers are important to provide tools that can be useful to the health professional for this purpose. However, questions arise: which classiﬁer to use? What metrics are appropriate to measure the performance of the classiﬁer? How to determine a good distribution of the data so that the classiﬁer does not bias the medical patterns to be classiﬁed in a particular class? Then most important question: does a classiﬁer perform well for a particular disease? This paper will present some answers to the questions mentioned above, making use of classiﬁcation algorithms widely used in machine learning research with datasets relating to medical illnesses under the supervised learning scheme. In addition to state-of-the-art algorithms in pattern classiﬁcation, we introduce a novelty: the use of meta-learning to determine, a priori, which classiﬁer would be the ideal for a speciﬁc dataset. The results obtained show numerically and statistically that there are reliable classiﬁers to suggest medical diagnoses. In addition, we provide some insights about the expected performance of classiﬁers for such a task.


Introduction
Supervised learning is one of the most common and important paradigms in pattern recognition, with pattern classification being one of the most important tasks [1]. In this context, in the state of the art situation, there are methods of pattern classification which are useful for classifying patterns in different application areas [2].
Pattern classification has become very important for decision making in many areas of human activity and the medical area is no exception. Researchers in machine learning have been designing new classification algorithms for this purpose, seeking a classification efficiency close to 100%. It is important to emphasize that there is no perfect classifier. This fact is guaranteed by the No-Free-Lunch theorem, which governs the effectiveness of classifiers [3,4]. This theorem has motivated machine learning researchers to design novel classification algorithms, with the property that of exhibiting the fewest possible errors [5,6].
This work aims to focus on classification algorithms that are useful for effective diagnosis of medical diseases. This area is of utmost importance because a good diagnosis will significantly improve the life of the patient. An example: based on a chest radiograph, a classifier can correctly decide whether a patient corresponds to a patient suffering from pneumonia or corresponds to a healthy person [7], assuming that only these two classes exist. Obviously, the correct classification depends on the algorithm classifier, that is, how good it is, and the complexity of the database. In the medical area, it is very important to have computer tools that help the health professional to diagnose diseases in a timely manner. In addition, it is very important to have data whose quality is guaranteed. In this regard, over time various international dataset repositories have been formed, which are very useful to the scientific community in machine learning and related areas. Fortunately for our work, in these repositories there is a number of medical datasets, which are the raw material for studies such as the one reported in this article.
In this research work, three widely used repositories have been chosen: Kaggle (https://www.kaggle.com/, accessed on 30 May 2021), the University of California Machine Learning Repository [8], and the KEEL repository [9]. In these repositories there are datasets of patterns of medical diseases, and they offer balanced and unbalanced data, a very important fact, because the classification algorithms, in this situation, have a marked bias towards the majority class and practically ignore the minority class [10,11]. In this article we will use 23 datasets that are classified into five categories of medical diseases: heart disease conformed for six datasets, cancer related diseases with seven datasets, diabetes related diseases with two datasets, thyroid diseases with two datasets, and finally several diseases with six datasets.
The classification algorithms used in the present work are: Multilayer Perceptron (MLP), Naïve Bayes (NB), K Nearest Neighbors (KNN), decision trees (C4.5), logistic regression (Logistic), Support Vector Machines (SVM) and Deep Learning (DL). In addition, we tested several measures of data complexity, in order to a priori determine the expected performance of the compared classifiers for medical datasets [2,11].
In machine learning, researchers in this area have platforms on which they can test classification algorithms or develop their pattern classification algorithms. One of these platforms is WEKA [12], which due to its usefulness and easy handling is a well-known machine learning platform. WEKA was developed in New Zealand at the University of Waikato and can be downloaded at www.cs.waikato.ac.nz/ml/weka/, accessed on 30 May 2021. It contains a comprehensive collection of predictive models and data analysis algorithms that include methods that address regression problems, feature selection, clustering, and classification. WEKA's flexibility allows the preprocessing and management of a data set in a learning schema and then analysis of the performance of the classifier in use. WEKA was programmed in Java. The classification algorithms used in this paper are part of the vast set of algorithms that WEKA includes. Another platform for experimentation is KEEL [9,13], developed by the University of Granada in Spain, which also includes data complexity measures.
The paper is structured as follows: Section 2 includes important state-of-the-art works dealing with the classification of medical patterns. In Section 3 the experimental setup is explained, including the selected datasets and classifiers, as well as the data complexity and performance measures. Section 4 is very important because it describes and discusses several highly relevant aspects: first, the numerical and statistical behavior of the classifiers is widely described and discussed, and then meta-learning techniques are applied. The results obtained allow us to crystallize the purpose of this article: with the results obtained from the meta-learning techniques, we will be able to propose the best classifiers for the diagnosis of specific diseases. Finally, Section 5 presents the conclusions derived from the present research.

Previous Works
The Free Lunch Theorem guarantees that there is no perfect classifier. Therefore, machine learning researchers are now looking for the fewest errors in their algorithms. For example, in [14], the authors added the K-NN algorithm to a distance function that is sensitive to cost, through a careful selection of the K parameter. Another example of performance improvement in classifiers is the case of the multilayer perceptron to find the appropriate number of hidden units [15].
A framework for classifying lung problems is described in [16]. In this work, a tuberculosis dataset and different configurations were used for the semi-supervised learning algorithms, such as co-training, tri-training and self-taught. Another example of the use of classifiers in the medical area is found in [17]. Here, the authors manually obtained the characteristics of Magnetic Resonance Imaging (MRI) and used them in a linear regression algorithm for classification in brain damage in patients.
Deep Learning methods, with special emphasis on Convolutional Neural Networks (CNN), are widely used for the classification and segmentation of medicinal images. Here are some notable recent examples. A detailed description of how computer-aided diagnosis is helpful in improving clinical care, is included in [18]. The detection of glaucoma through the classification of processed images is shown in [19]. A method of classification of lesions in a hemorrhagic stroke using a CNN is shown in [20]. An automatic system based on a CNN to detect discs of the lumbar spine is proposed in [21]. A model based on a CNN to improve the classification of Papanicolaou smears is proposed in [22]. An automatic model called online transfer learning is proposed in [23] for the differential diagnosis of benign and malignant thyroid nodules from ultrasound images. The content of reference [24] does not include the use of CNN. In this research work, a new associative pattern classification algorithm called Lernmatrix tau 9 is introduced, which is applied to medical datasets.
As can be seen, the classification of patterns of medical diseases is of great interest to the scientific community in machine learning and to health professionals. Interest is great in the classification of algorithms, not only for the creation of new algorithms that reduce the error in classification, but also for existing classification algorithms. It is important to know the behavior of pattern classifiers in the state of the art examples, because knowing the performance of the classification algorithms can be very helpful in diagnosing clinical diseases. It is evident and undeniable how valuable it can be for a medical team to know in advance (with scientifically substantiated reasons) which classifier or group of classifiers is the most appropriate for the diagnosis of a specific disease. Hence the relevance of this work, from which the results obtained can have benefits on the quality of peoples' lives.

Experimental Setup
We wanted to determine a priori if some of the well-established state of the art classifiers will perform good or poorly for specific diseases. To do so, we have tested the classifiers over 23 medical datasets, and we have computed 12 data complexity measures for such datasets. Then, for each classifier, what proceeds is the calculation of three performance measures (Balanced Accuracy, Sensitivity and Specificity), and the conversion of these results (by discretization) into three categorical values: Good, Regular, and Poor.
At this point, we were able to create a new dataset for each classifier, whose nature is totally different from the initial datasets. The patterns of this new dataset for each classifier are made up of the 12 complexity measures already calculated as input, and the discretized performance as output.
Finally, the meta-learning process comes into action, which will allow medical teams to obtain the greatest social benefit and impact from this research work on machine learning. To perform this meta-learning process, we train a decision tree.
The experimental setup is shown in Figure 1. The first three stages of the experimental setup are shown in the following Sections, while the remaining five stages are explained in the Results section.

Datasets
We selected 23 datasets from three international repositories: These 23 datasets belong to five of the most important different subgroups of human diseases: heart diseases, cancer related diseases, diabetes, thyroid diseases and, finally, other diseases. In the following, we provide a brief description of the selected datasets.

Datasets for Heart Diseases
Cleveland dataset: this is heart disease dataset provided by the Medical Center, Long Beach, and the Cleveland Clinic Foundation, located in the Keel repository. It has 13 attributes, five classes, and 303 instances.
Heart Statlog dataset: this dataset that was taken from the Keel repository is intended to detect the absence (class 1) or the presence (class 2) of heart diseases in patients. It is made up of 13 attributes, 270 instances and two classes.
Heart 2 dataset: this dataset has 14 attributes, 303 instances, and two classes. Classes refer to the presence of heart disease in the patient. They have integer values from 0 to 4, where 0 means no heart problems. This dataset was taken from the Kaggle repository.
Heart failure dataset: this dataset is designed for machine learning, and contains information that allows the prediction of survival of patients with heart failure only from serum creatinine and ejection fraction. This dataset was taken from the Kaggle repository and is conformed of 13 attributes, 299 instances and two classes.
Saheart dataset: this is a South African Hearth dataset. Taken from the Keel repository, it contains information on men at high risk for coronary heart disease from a region of the Western Cape, South Africa. The result of the classification should indicate  The first three stages of the experimental setup are shown in the following Sections, while the remaining five stages are explained in the Results section.

Datasets
We selected 23 datasets from three international repositories: These 23 datasets belong to five of the most important different subgroups of human diseases: heart diseases, cancer related diseases, diabetes, thyroid diseases and, finally, other diseases. In the following, we provide a brief description of the selected datasets.

Datasets for Heart Diseases
Cleveland dataset: this is heart disease dataset provided by the Medical Center, Long Beach, and the Cleveland Clinic Foundation, located in the Keel repository. It has 13 attributes, five classes, and 303 instances.
Heart Statlog dataset: this dataset that was taken from the Keel repository is intended to detect the absence (class 1) or the presence (class 2) of heart diseases in patients. It is made up of 13 attributes, 270 instances and two classes.
Heart 2 dataset: this dataset has 14 attributes, 303 instances, and two classes. Classes refer to the presence of heart disease in the patient. They have integer values from 0 to 4, where 0 means no heart problems. This dataset was taken from the Kaggle repository.
Heart failure dataset: this dataset is designed for machine learning, and contains information that allows the prediction of survival of patients with heart failure only from serum creatinine and ejection fraction. This dataset was taken from the Kaggle repository and is conformed of 13 attributes, 299 instances and two classes.
Saheart dataset: this is a South African Hearth dataset. Taken from the Keel repository, it contains information on men at high risk for coronary heart disease from a region of the Western Cape, South Africa. The result of the classification should indicate whether the patient has coronary disease. The class values are negative (0) or positive (1). It is made up of nine attributes, 462 instances and two classes. SPECT cardiac dataset: this dataset describes cardiac imaging based on single proton emission computed tomography (SPECT). To use this dataset in atomic learning, the original SPECT images were processed to extract the characteristics that define cardiac problems in patients. The dataset was taken from the UCI repository and is made up of 22 attributes, 267 instances and two classes. The classes indicate whether the patient has (1) or does not have (0) heart problems.

Datasets for Cancer
Breast dataset: this dataset consists of nine attributes, two classes, and 286 instances. The attributes are age, menopause, tumor-size, inv-nodes, node -caps, deg-malig, breast, breast-quad, and irradiated. This dataset is in the Keel repository.
Haberman's Survival dataset. The information in this dataset is used to determine if the patient survived breast cancer for periods greater than 5 years (positive) or if the patient died within 5 years (negative). The information comes from a study that was conducted at the University of Chicago Billings Hospital between 1958 and 1970. It consists of three attributes, 306 instances, and two classes. Haberman's dataset was taken from the Keel repository.
Lymphography dataset: this dataset is widely used in machine learning. The information in the dataset is intended to detect a lymphoma and its current state. This dataset was taken from the Keel repository. It is made up of 18 attributes, 148 instances, and four classes.
Mammographic dataset. Data were obtained between 2003 and 2006 at the Institute of Radiology of the University of Erlangen-Nuremberg. The information is often used to predict the severity of a massive mammographic injury from BI-RADS attributes and the age of the patient. The classification will be benign or malignant. This dataset was taken from the Keel repository and is made up of five attributes, 961 instances, and two classes.
Primary tumor dataset: this dataset contains information on primary tumors in people. The intent is to classify the patterns or records into the class of "metastasized" or "not metastasized" to a part of the body other than where the tumor first appeared. The dataset was taken from the UCI repository and contains 339 instances, 17 attributes, and 21 classes.
Wisconsin dataset: this is an original breast cancer data set, study conducted by University of Wisconsin Hospitals. The information contained in this dataset makes it possible to classify whether the tumor detected is benign (2) or malignant (4) for patients who underwent surgery for breast cancer. The dataset was taken from the Keel repository and contains nine attributes, 699 instances, and two classes.
Wisconsin diagnosis for breast cancer 2 dataset (BCWD2). The attributes of this dataset are obtained from digitized images of breast masses generated by a fine needle aspiration (FNA) process. The extracted characteristics define the cell nuclei present in the image. The dataset was taken from the Kaggle repository. It is made up of 32 attributes, 569 instances and two classes. The distribution of the classes in the dataset are 357 benign and 212 malignant.

Datasets for Diabetes
Diabetes dataset: this dataset has the purpose of classifying information into two possible classes: negative test and positive test, i.e., a patient has or does not have diabetes. The dataset has 578 instances, 20 attributes, and two classes. This dataset was taken from the UCI repository and it was adapted in: https://github.com/renatopp/arff-datasets/ blob/master/classification/diabetes.arff, accessed on 30 May 2021.
Pima Indians Diabetes dataset: this dataset contains information to classify or predict whether women of Pima Indian descent under the age of 21 have diabetes or not, i.e., tested negative or tested positive. This data set was taken from the Keel repository and is made up of eight attributes, 768 instances and two classes.

Datasets for Thyroid Diseases
Newthyroid dataset: this is a new dataset on thyroid disease. It is taken from the Keel repository and is also available from the UCI repository. The information contained in this dataset is used to predict or classify whether a patient is normal (1) or suffers from hyperthyroidism (2) or hypothyroidism (3). It is formed from five attributes, 215 instances, and three classes.
Thyroid diseases dataset: this dataset is taken from the Keel repository and is also available from the UCI repository. The information contained is used to predict or classify whether a patient is normal (1) or suffers from hyperthyroidism (2) or hypothyroidism (3). It is formed from 21 attributes, 7200 instances, and three classes.

Datasets for Other Diseases
Appendicitis dataset: this is a dataset taken from the Keel repository that consists of seven attributes, 106 patient instances or patterns and two classes (0,1), which represent whether the patient has appendicitis or not.
Audiology dataset (standardized)@ this dataset, extracted from the UCI repository, contains information on hearing problems in patients. The dataset is made up of 226 instances, 69 attributes, and 22 classes.
Contraceptive method choice dataset: this dataset is based on the 1987 Indonesian national survey. The samples are from single or married women who do not know if they are pregnant when interviewed. With this dataset, an attempt is made to predict which contraceptive method a woman would use according to her demographic and socioeconomic characteristics. The categories for prediction are no use, long-term methods or short-term methods. It is made up of nine attributes, 1473 instances, and three classes. This dataset was taken from the Keel repository.
Dermatology dataset: this was taken from the Keel repository. The information contained in this dataset is derived from the differential diagnosis of erythemato-squamous diseases. It is formed of 34 attributes, 366 instances, and six classes.
Ecoli dataset. The goal of this dataset is to predict the place where proteins are located using metrics about the cell, for example, cytoplasm, inner membrane, peris-plasm, outer membrane, outer membrane lipoprotein, inner membrane of inner membrane lipoprotein, cleavable signal sequence. Dataset is formed of seven attributes, 336 instances, and eight classes. The dataset was taken from the Keel repository.
Hepatitis dataset. This was taken from the Keel repository and is intended to predict whether patients affected by hepatitis will die (class 1) or survive (class 2). The dataset is made up of 19 attributes, 155 instances, and two classes.
In classification problems, a very important aspect concerning classes is knowing whether they are balanced or unbalanced. Ideally, classes should have the same instances number. Nevertheless, the most interesting data sets are unbalanced.
To exemplify, in the classification of diseases, the sick class is the minority class, and the healthy class is the majority, but the point of view of the unbalanced in the classes will affect the way of measuring the performance of the classifiers [40].
Given that most medical datasets exhibit a certain degree of imbalance in their classes, it is necessary to choose appropriate performance measures for this type of dataset, such as Balanced Accuracy, Sensitivity and Specificity. It is necessary to clarify that, in these cases, one of the most popular performance measures is not useful: accuracy [11]. Table 1 summarizes the characteristics of the 23 medical disease datasets described above. This neural network attempts to solve classification problems when classes are not linearly separable. MLP typically consists of three types of layer, the input layer, the hidden layers, and the output layer [41]. In the output layer are the neurons whose output values belong to the corresponding class. As a propagation rule, the neurons of the hidden layer occupy the weighted sum of the inputs with the synaptic weights, and a sigmoid-type transfer function is applied to this sum. The backpropagation error as a cost function uses the mean square error. Researchers in machine learning consider ML to be a very good pattern classifier.

C4.5 Classifier
This is a decision tree type classification algorithm. This type of classifier is among the most commonly used in classifying patterns. The C4.5 [42] is derived from an old type of decision tree called ID3 [43]. Among the parameters of the C4.5 classifier, the level of confidence for the pruning of the generated tree stands out, because it significantly influences the size and predictability of the created tree. The algorithm could be explained as follows: the predictor variable is sought to make the decision about the cut made in iteration n, in addition to the exact cut-off point where the error made is lowest, taking a pre-established variable as a criterion. This would be done as long as it is at levels of confidence higher than those previously established. Once the cut is made, the algorithm is repeated until all the predictor variables remain below the confidence level higher than the established one. It is very important to work with the confidence level because, in the case of having too many subjects and variables, the tree would be too big. One way to avoid the latter is to limit the size of the tree by specifying a minimum number of instances per node.

Naïve Bayes (NB)
Naïve Bayes classifier [44] is a classifier based on Bayes' theorem. It is a special class of Machine Learning classification algorithms. Bayes argued that the world is neither uncertain nor probabilistic, but rather that we learn from the world through approximations, which makes us get closer and closer to the truth the more evidence we have. Naïve Bayes classifier assumes that the presence or absence of an attribute is not probabilistically related to the presence or absence of another attribute, contrary to what happens in the real world. Naïve Bayes classifier allows easily built probability-based models with excellent performance, due to its simplicity. The Naïve Bayes classifier or algorithm converts the data set into a frequency table. A probability table is created in order for the various events to occur. Naïve Bayes is applied to calculate the posterior probability of each class and the class of the prediction is the class with the highest probability.

The K Nearest Neighbors (K-NN)
The K Nearest Neighbors (K-NN) classifier is a supervised learning algorithm [45]. The idea of the classifier is very intuitive. The K-NN algorithm classifies each new data into the class of its nearest neighbor. It calculates the distance from the new element to each of the existing ones and orders these distances from smallest to largest to select the class to belong to. This class will therefore be the one with the highest frequency with the shortest distances. The K-NN algorithm is widely used for pattern classification [46][47][48][49].

Logistic Classifier (Logistic)
This classifier is based on logistic regression [50]. In order to predict, as with inputs, it takes real values based on the probability of the input belonging to a certain class. The probability is calculated with a sigmoid function, where the exponential function is involved. Logistic regression is widely used in machine learning because it is very efficient and does not require too many computational resources. The most common models of logistic regression as a result of classification have a binary value, i.e., values like true or false, yes or no. Another model of logistic regression is the multinomial, which can model scenarios where there are more than two possible outcomes.

Support Vector Machines (SVM)
This model originates from the famous statistical learning theory. The optimization of analytical functions serves as a theoretical basis in the design and operation of SVM models, which attempts to find the maximum margin hyperplane to separate the classes in the attribute space [6,12,40].

Deep Learning (DL)
Deep learning involves the use of MLP with many layers. In this type of algorithm, the use of backpropagation is intensive, along with other types of operation such as convolution and pooling. In this paper we use the WekaDeeplearning4j package for deep learning [21,37,38].
A detailed description of the computed complexity measures can be found in [51]. Here we will include only the names and a few simple ideas related to each of the 12 complexity measures: L2: This measure of complexity is a kind of complement to L1, since it measures the error rate obtained from the training set in a specific experiment.
L3: This measure of complexity measures the non-linearity of a classifier, considering a specific dataset. From a training set, a test set is created by linear interpolation between pairs of patterns of the same class chosen at random. L3 measures the error rate of a linear classifier on this test set. N1: Mixture Identifiability 1. A minimum spanning tree (MST) is constructed that connects all the patterns (points in the rendering space) of the dataset. Then the edges of the MST that connect opposing classes are counted. N1 is the double fraction of these edges for all patterns in the dataset.
N2: Mixture Identifiability 2. To estimate this measure of complexity, for each pattern (which is a point in the representation space) the Euclidean distance to the nearest neighbor is calculated. Two values are then calculated: the average of the intraclass distances and the average of the interclass distances. N2 is the ratio of these two values.
N3: Mixture Identifiability 3. This measure of complexity corresponds to the error rate of the nearest neighbor classifier when the Leave-one-out cross-validation method is used.
N4: This measure of complexity measures the non-linearity of a classifier, considering a specific dataset. From a training set, a test set is created by linear interpolation between pairs of patterns of the same class chosen at random. N4 measures the error rate of a nearest-neighbor classifier on this test set. T1: Space Covering by Neighborhoods. This measure of complexity involves topological concepts of the datasets.
T2: This measure of complexity is the average number of samples per dimension.
As noted above, reference [51] includes detailed discussions of these 12 measures of complexity. Table 2 shows the values of the complexity measures for the selected datasets. However, for some datasets, the KEEL software obtained invalid values (NaN, Not a Number) for measures F1 and F2. Therefore, we include such values as missing (?).

Performance Measures
Supervised classification has two phases, the learning phase, and the classification phase [52]. The classifier must have one data set for the training class and another data set for the test called the test class. Once the classifier learns with the learning class, it is presented with a test class and this, therefore, will result in the assignment of the set of patterns to the corresponding classes; it should be noted that there will be patterns that will not be classified correctly, due to the No-Free-Lunch theorem [3,4].
The partition of the total data set is done by a validation method. The cross-validation method partitions the total data set in k folds, where k is a positive integer and the most popular values for k in the literature are when k = 5 and k = 10. The cross-validation method ensures that the classes are proportionally distributed in each fold [53,54].
For this article, the k-fold cross validation method will be used with k = 10. Figures 2 and 3 exemplify its behavior, showing schematic diagrams with 10 folds and a data set divided into three classes. To form the 10-fold cross validation, the first pattern from class 1 is taken and placed on the 1-Fold, the second pattern is taken and placed on the 2-Fold, and this process is repeated until the pattern reaches 10 of class 1 and places it in the 10-Fold. The way to operate the 10-Fold cross validation is based on 10 executions, shown in Figure 3.

Performance Measures
Supervised classification has two phases, the learning phase, and the classification phase [53]. The classifier must have one data set for the training class and another data set for the test called the test class. Once the classifier learns with the learning class, it is presented with a test class and this, therefore, will result in the assignment of the set of patterns to the corresponding classes; it should be noted that there will be patterns that will not be classified correctly, due to the No-Free-Lunch theorem [3,4].
The partition of the total data set is done by a validation method. The crossvalidation method partitions the total data set in k folds, where k is a positive integer and the most popular values for k in the literature are when k = 5 and k = 10. The crossvalidation method ensures that the classes are proportionally distributed in each fold [54,55].
For this article, the k-fold cross validation method will be used with k = 10. Figures  2 and 3 exemplify its behavior, showing schematic diagrams with 10 folds and a data set divided into three classes. To form the 10-fold cross validation, the first pattern from class 1 is taken and placed on the 1-Fold, the second pattern is taken and placed on the 2-Fold, and this process is repeated until the pattern reaches 10 of class 1 and places it in the 10-Fold. The way to operate the 10-Fold cross validation is based on 10 executions, shown in Figure 3.  In a classification problem, for example in binary classification, the performance of the classifier can be measured based on the patterns correctly classified in their corresponding class, that is, if they are true positives (TP) or true negatives (TN). But the classifier can make a mistake when classifying the patterns and these classification errors are called false positives (FP) and false negatives (FN). Graphically TP, TN, FP and FP are In a classification problem, for example in binary classification, the performance of the classifier can be measured based on the patterns correctly classified in their corresponding class, that is, if they are true positives (TP) or true negatives (TN). But the classifier can make a mistake when classifying the patterns and these classification errors are called false positives (FP) and false negatives (FN). Graphically TP, TN, FP and FP are represented by the confusion matrix. When having more than two classes (k > 2), a confusion matrix takes the form showed in Table 3. From the confusion matrix in Table 3, the i-th lass (1 ≤ i ≤ k) is considered in order to define the meaning of some symbols. With these definitions it will be possible to define, in turn, three performance measures that are applied in the experimental data of this article: Sensitivity, Specificity and Balanced Accuracy [55].
Note first that N i is the total of patterns that belong to class i. The symbol n ii also represents the number of patterns of class i that were classified correctly. With this information, the performance measure Sensitivity for class i is defined as follows: Now a second performance measure will be defined for class i. To do this, consider now any class j that is different from class i. That is 1 ≤ j ≤ k, and j = i.
While N j is the total of patterns that belong to class j, the symbol n ji represents the number of patterns that are classified as class j, because in reality they belong to class i. With this, the total of patterns of class j that are correctly classified as not belonging to class i is: If all classes different from class i are considered, the total of patterns that are correctly classified as not belonging to class i is: It can clearly be seen that the total of patterns that do not belong to class i is calculated as follows: With expressions (3) and (4), it is now possible to define the performance measure Specificity for class i, as follows Balanced Accuracy for class i is defined as the average of Sensitivity i and Speci f icity i :

Results
In this section, we present the performance of the selected classifiers over the datasets (Section 4.1). In addition, we compare the classifiers by means of statistical analysis (Section 4.2), and we obtain the datasets for further meta-learning (Section 4.3).

Classification Results
As the datasets were taken from three different repositories, with the exception of the datasets of the Keel repositories that provides the files under the 10-Fold cross validation method, a Python program of the algorithm of the 10-Fold cross validation method was developed as described in Figures 2 and 3.
Once learn classes and test classes were generated, the Weka software was applied. Table 4 shows the behavior of the classifiers according to Balanced Accuracy. Best results are highlighted in bold and indicate that a particular classifier performed better compared to the other classifiers. As shown in Table 4, the classifiers with best performance according to Balanced Accuracy measure is Naïve Bayes with seven wins, followed by SVM and Deep Learning, with six and five wins, respectively.
However, the variation in the results is extremely high, ranging from 0.50 to 0.97 for Naïve Bayes, 0.072 to 0.97 for Deep Learning and 0.443 to 0.979 for SVM.
It should be noted at this stage of the research work that this heavy variation supports the need for establishing one of the most important contributions of the present paper: the a priori establishment of the performance of the classifiers, by means of meta-learning procedures. Table 5 shows the behavior of the classifiers according to Sensitivity. According to sensitivity (Table 5), the classifier with best performance is Naïve Bayes, with seven wins, followed by Deep Learning and SVM, with six wins.
However, as for Balanced Accuracy, the variation in the results is extremely high, ranging from 0.493 to 0.975 for Naïve Bayes, 0.51 to 0.979 for SVM, and 0.0743 to 0.979 for Deep Learning. Table 6 shows the behavior of the classifiers according to Specificity. According to specificity (Table 6), the classifier with best performance is SVM (six wins), followed by Naïve Bayes (five wins). It is interesting that Deep Learning showed poor behavior regarding specificity, with only two wins.
Again, the variation in the results is extremely high, ranging from 0.304 to 0.979 for SVM.

Statistical Analysis
Despite the previous results, which support the idea that the best performed classifiers are Naïve Bayes and Support Vector Machines, there is a need to establish if the differences in performance among the classifiers are significant or not. To do so, several authors suggest the use of non-parametric statistical tests [56].
For statistical analysis, we used the Friedman test for the comparison of multiple related samples [57] and the Holm test for post hoc analysis [58]. The application of the Friedman test implies the creation of a block for each of the samples analyzed in such a way that each block contains an observation from the application of each of the different contrasts or treatments. In terms of matrices, the blocks correspond to rows and the treatments to columns.
The null hypothesis establishes that the performances obtained by different treatments are equivalent, while the alternative hypothesis proposes that there is a difference between these performances, which would imply differences in the central tendency.
If k is defined as the number of treatments, then for each block a range between 1 and k is assigned to each input, 1 to the best result and k to the worst. In case of ties, the average rank is assigned. Next, the variable R j (j = 1, . . . , k) is assigned the value of the sum of the ranges corresponding to each treatment. If the performances obtained from the different treatments are equivalent, then R j = R j for all i = j. Thus, from this procedure it is possible to determine when an observed disparity between the R j is sufficient to reject the null hypothesis. Let n be the number of blocks, and k be the number of treatments, then the Friedman statistic (S) is given by: For values of n ≥ 10 and k ≥ 4, the S statistic approximates a chi-square random variable with k − 1 degrees of freedom. The critical region of size α is the right tail of the distribution of said variable. The null hypothesis is rejected when the value of S is greater than the critical value.
In the case that the Friedman test determines the existence of significant differences in the performance of the algorithms, it is recommended to use a post hoc test to determine between which of the algorithms compared in the Friedman test there are such differences. Holm's post hoc test is designed to reduce type I errors when analyzing phenomena that include several hypotheses, and consists of adjusting the rejection criterion for each one of them.
The procedure begins with the ascending ordering of the probability values of each hypothesis. Once ordered, each of these values is compared with the quotient obtained by dividing the level of significance by the total number of hypotheses whose p-value has not been compared. When finding some p-value that exceeds this quotient, all the null hypotheses associated with the p-values that have already been compared are rejected.
Let H1, . . . , Hk be a group of k hypotheses and p1, . . . , pk the corresponding probability values. By ordering these p-values in ascending order, a new nomenclature is established: p(1), p(2), . . . , p(k) for the ordered p-values and H(1), H(2), . . . , H(k) for the hypothesis associated with each of them. If α is the level of significance and j is the minimum index for which it is satisfied that p (j) > α k−j+1 then the null hypotheses H(1), . . . , H(j − 1) are rejected.
For both Friedman and Holm test, a significance level α = 0.05 was established, for 95% confidence. We begin by establishing the following hypotheses: H0: There are no significant differences in the performance of the algorithms.

H1:
There are significant differences in the performance of the algorithms.
The Friedman test obtained a significance values of 0.01758, 0.017996 and 0.152972, for Balanced Accuracy, Sensitivity and Specificity measures, respectively. Therefore, the null hypothesis for both Balanced Accuracy and Sensitivity measures are rejected, showing that there are significant differences in the performance of the compared algorithms. Table 7 shows the ranking obtained by the Friedman test. As can be seen in Table 7, the first algorithm in the ranking for Balanced Accuracy was Naïve Bayes, for Sensitivity this was SVM, and for Specificity the best algorithm was Logistic. Holm's test compares the performance of the best ranked algorithm with the remaining ones. Table 8 shows the results of the Holm's test for Balanced Accuracy.  Table 9 shows the results of the Holm's test for Sensitivity. For Balanced Accuracy, Holm's procedure rejects the hypotheses that have an unadjusted p-value ≤ 0.01. The results of the Holm test show that there are no significant differences in the performance of the Naïve Bayes algorithm with respect to the compared algorithms apart from 3-NN, which showed a significantly worse behavior according to Balanced Accuracy.
For Sensitivity, in addition to 3-NN, which maintained a significantly worse behavior, the Multilayer Perceptron algorithm was also significantly worse than the SVM classifier.

Meta-Learning
After obtaining the results for the compared classifiers, we discretized the performance values into three categories of performance: Good, Regular and Poor. Then, for each classifier, we obtained a new dataset, having as conditional attributes the values of the 12 complexity measures of Table 2, and as decision (class) attribute the discretized performance.
We used the Balanced Accuracy measure as decision attribute, due to the fact that it integrates the results of both sensitivity and specificity.
With such information, we were able to train a decision tree to a priori determine the performance of the classifiers. The decision tree is shown in Figure 4.
In the following, we show the performance results of our proposed meta-learning decision tree (in the form of a confusion matrix), as well as the obtained tree, for each classifier. In the decision trees, G stands for Good, P for Poor and R for Regular.
For the MLP classifier, the proposed meta-learning algorithm had only two errors: a dataset with Regular performance and a dataset with Poor performance, both classified as having Good performance, for a Balanced Accuracy of 0.9144.  The resulting decision tree (Figure 4a) has only six leaves, with size = 11. The decision tree only considers complexity measures L1, N2, N4 and T2 to make the decision. As for the MLP classifier, the proposed meta-learning algorithm had only two errors (Table 11): a dataset with Regular performance and a dataset with Poor performance, both classified as having Good performance, for a Balanced Accuracy of 0.6410.  The corresponding confusion matrix is shown in Table 10. The resulting decision tree (Figure 4a) has only six leaves, with size = 11. The decision tree only considers complexity measures L1, N2, N4 and T2 to make the decision.
As for the MLP classifier, the proposed meta-learning algorithm had only two errors (Table 11): a dataset with Regular performance and a dataset with Poor performance, both classified as having Good performance, for a Balanced Accuracy of 0.6410. Table 11. Confusion matrix of the proposed meta-learning for the Naïve Bayes classifier.

Good Regular Poor
True class The dataset with Poor performance, assigned to have a Good performance, was the BCWD2, in which the classifier obtained the last place, with 0.9261 Balanced Accuracy, which was not a bad result per se.
The resulting decision tree (Figure 4b) has only seven leaves, with size = 13. The decision tree only considers complexity measures F2, L3, N1, N3 and T1 to make the decision.
For the 3-NN classifier, the proposed meta-learning algorithm had again only two errors: a dataset with Regular performance predicted as having Poor performance and a dataset with Poor performance classified as having Regular performance, for a Balanced Accuracy of 0.8929, as shown in Table 12. The resulting decision tree (Figure 4c) has only seven leaves, with size = 13. The decision tree only considers complexity measures F2, L1, L2, N1, and T1 to make the decision. For the C4.5 classifier, the proposed meta-learning algorithm had only one error (Table 13): a dataset with Good performance predicted as having Poor performance, for a Balanced Accuracy of 0.9333. Table 13. Confusion matrix of the proposed meta-learning for the C4.5 classifier.

Good Regular Poor
True class The resulting decision tree (Figure 4d) has only six leaves, with size = 11. The decision tree only considers complexity measures F1, F3, N1, and N2 to make the decision.
For the Logistic classifier, the proposed meta-learning algorithm did not have good results. It misclassified the two datasets with Poor performance, assigning them into Regular and Good classes.
In addition, it misclassified a dataset with Good performance, and predicted it as Regular (Table 14); such results correspond to a Balanced Accuracy of 0.5744. The resulting decision tree (Figure 4e) again has six leaves, for a tree size of 11, and includes only the measures N2, N3, F3 and T1 of data complexity. For the Deep Learning classifier, the proposed meta-learning algorithm misclassified the three datasets (two of them with Good performance, assigned into Regular and Bad classes, and another of Poor performance, assigned into Regular class).
The corresponding confusion matrix is shown in Table 15, with a Balanced Accuracy of 0.8561. The resulting decision tree (Figure 4f) is quite small, with only five leaves for a tree size of nine, and it includes only the measures L1, L2, N2 and N4 of data complexity.
Last but not least, for the Support Vector Machine classifier (one of the best-performing algorithms for medical datasets), the proposed meta-learning decision tree was the best, with all datasets correctly classified, for a perfect Balanced Accuracy of 1.0. Such results were obtained with a very small decision tree (Figure 4g), of five leaves and tree size of nine, using only three complexity measures: L1, F2 and T2.
In our opinion, such results represent a breakthrough for medical datasets classification, because they allow determination a priori of the expected performance of the seven analyzed classifiers, six of them with Balanced Accuracy over 0.85, which is very promising.

Conclusions
After having selected a considerable number of datasets from the main available repositories, the authors of this paper evaluated the performance of some of the most relevant classifiers in state of the art machine learning and related areas.
However, the scope of the proposal was not limited to calculating the Sensitivity, Specificity and Balanced Accuracy values, but also performed a statistical analysis, with the support of the Friedman and Holm statistical tests.
One of the main contributions was a meta-learning process, whose usefulness to medical teams is undeniable. From the results of this paper, teams of doctors and human health researchers will have a valuable tool that can support them in making decisions about which classifier, or group of classifiers, could help them in pre-diagnoses of specific diseases.
A generic conclusion points out that the SVM model is one of the best-performing algorithms for medical datasets. This is based on facts such as the case of a perfect Balanced Accuracy of 1.0 in the decision tree during the meta-learning process. Such results were obtained with a very small decision tree of five leaves, using only three complexity measures: L1, F2 and T2. In our opinion, such results represent a breakthrough for medical dataset classification, due to the determination a priori of the expected performance of the seven analyzed classifiers, which could be a valuable aid to medical teams.
As future work, we plan to include more existing datasets in medical disease repositories and include classification algorithms such as convolutional neural networks, associative classifiers, and deep learning algorithm, seeking to obtain data sets of diseases that are of interest in hospitals to test the performance of the classifiers studied, with regard to current needs.