A Comparative Performance Evaluation of Classification Algorithms for Clinical Decision Support Systems

Classification algorithms are widely taken into account for clinical decision support systems. However, it is not always straightforward to understand the behavior of such algorithms on a multiple disease prediction task. When a new classifier is introduced, we, in most cases, will ask ourselves whether the classifier performs well on a particular clinical dataset or not. The decision to utilize classifiers mostly relies upon the type of data and classification task, thus making it often made arbitrarily. In this study, a comparative evaluation of a wide-array classifier pertaining to six different families, i.e., tree, ensemble, neural, probability, discriminant, and rule-based classifiers are dealt with. A number of real-world publicly datasets ranging from different diseases are taken into account in the experiment in order to demonstrate the generalizability of the classifiers in multiple disease prediction. A total of 25 classifiers, 14 datasets, and three different resampling techniques are explored. This study reveals that the classifier that is likely to become the best performer is the conditional inference tree forest (cforest), followed by linear discriminant analysis, generalize linear model, random forest, and Gaussian process classifier. This work contributes to existing literature regarding a thorough benchmark of classification algorithms for multiple diseases prediction.


Introduction
Artificial intelligence (AI) has changed almost all aspects of our lives dramatically. Undoubtedly, it will not be surprising that manpower is likely to be replaced by AI shortly owing to the rapid development of AI. Some AI techniques, e.g., deep learning and other machine learning (ML) algorithms, have been employed in clinical applications to support an intelligent system for an early detection and diagnosis method of disease [1]. Furthermore, they assist physicians in providing the second opinion about an effective clinical diagnosis, prognosis, and other clinical-related decision tasks in order to avoid potential human errors that might bring the patient life into risk [2,3]. Figure 1 illustrates how ML algorithms are employed in a clinical decision support system (CDSS).
With the emergence of new technological advancement, a large amount of clinical data has been stored and is ready for being analyzed by the clinical researchers. For instance, a large scale of clinical data are publicly available about patients admitted to critical care units at a large tertiary care hospital [4]. Nevertheless, most physicians are still suffering from an inaccurate prediction of a disease outcome due to a lack of knowledge about the available data analytics approaches. This condition leads to a significant improvement necessity of disease prediction using advanced ML techniques. For this reason, ML techniques have grown as a well-known tool for discovering and characterizing complex patterns and relationships among them from large datasets [5]. The prediction task could be either a classification or regression techniques, relying on the type of the target variable, i.e., categorical or numerical. In contrast, the task can be categorized into two main natures, i.e., predictive and descriptive [6][7][8]. A predictive task deals with inferring a function from a set of labeled training samples by mapping data samples based on input-output pairs. Such a task is also known as learning data in a supervised manner. Neural network, classification/regression trees, and support vector classifier/regressor are examples of supervised learning algorithms. In contrast, descriptive task algorithms, e.g., clustering and association techniques, attempt to make an inference from an unlabeled input data. The goal of these approaches is to group the objects into clusters or to figure out some interesting patterns between variables in databases. Examples of these techniques include K-means clustering, hierarchical clustering, and frequent itemset rule extractors.
Among those approaches mentioned above, the classification of clinical data is a very problematic task since it might be confusing to choose the best performing classifiers available in the wild. One of the causes is that the classifiers have emerged from different families, i.e., ensembles, decision trees, rule-based, Bayesian, and neural network, to name a few. A researcher might choose the classifiers erratically due to a limited knowledge within his/her competence or point of interest. Moreover, it would be a very challenging effort as each dataset is not likely to be uniform, considering that disease type and other clinical context domains might vary immeasurably in practice. Hitherto, having a particular prediction method does not always give a significant level of accuracy under all clinical application domains because it mainly relies on the context used. To be more precise, no one can guarantee that the proposed classifier will have a good performance in all clinical datasets unless an empirical benchmark is conducted [9].
Instead of simply conducting a qualitative analysis by using a systematic mapping study about previously published works [2,6,[10][11][12], this study focuses on a quantitative analysis of classification techniques for disease prediction. This empirical study helps researchers and practitioners in deciding the best classification algorithms in a clinical application setting. It is the case that most researchers in the purview of medical informatics are solely familiar with specific ML techniques; thus, picking the best performing classifiers, in many cases, is a resource-intensive task. In addition, the performance of a new proposed classifier for clinical data analysis is often justified against the classifier within the restricted group, exempting the classifiers belonging to other groups. Hence, a cross-domain comparison of the ML algorithms from different groups for different diseases is currently unexplored. To the best of our knowledge, no other quantitative benchmark of ML algorithms that focuses on clinical data has been taken into consideration to date.
While some classification algorithms achieve a superior result with a given dataset, the performance of such classifiers might be contrasting on other datasets. The behavior of the classifiers is consistent with the no-free-lunch theorem [9], where there exists no single classifier or individual method that can be a panacea for all problems. Providing a classifier benchmark for multiple diseases, the objective of this empirical study is to originally find the excellent performing classifier across some clinical datasets. It is meant to assist researchers/practitioners about a reasonable decision in picking the available classifiers for clinical prediction, enabling them to determine a possible well-performing classifier.
According to the above-mentioned issues, this paper attempts to address the two following research questions (RQs): • RQ 1 : What is the relative performance of classification algorithms with respect to different resampling strategies?
• RQ 2 : Among the various families, is there a best choice in selecting classification algorithm for clinical decision support systems?

Related Work
Huge research interest in a systematic mapping study currently exists [2,13,14], which aims at identifying and categorizing the prior published works in order to give a visual summary of their results. However, the approach is deemed to be an unreliable barometer for being taken as a guideline to find the best performing classifiers. This is because such an approach only serves a literature review concerning the most frequent use of ML techniques for clinical data analytics. For instance, Ref. Idri et al. [11] provided a systematic map of studies regarding the application of data processing in the healthcare domain, while a similar literature review of data mining methods for traditional medicine was reported in [15]. They have recognized that support vector machines and neural networks have always been used to solve the disease prediction task.
Jothi et al. [16] reviewed the various papers in the healthcare field with respect to methods, algorithms, and results. Garciarena and Santana [17] investigated the relationship between the type of missing data, the choice of imputation method, and the effectiveness of classification algorithms that employed the imputed data. Performance of several classification algorithms for early detection of liver disease was explored in [18]. The assessment results showed that C5.0 and CHAID algorithm were able to produce rules for liver disease prediction. Kadi et al. [6] had explored a systematic literature review of data mining techniques in cardiology, while Jain and Singh [19] focused their survey study on the utilization of feature selection and classification techniques for the diagnosis and prediction of chronic diseases.
More recently, Moreira et al. [20] analyzed and summarized the current literature of a smart decision support system for healthcare according to their taxonomy, application area, year of publication, and the approaches and technologies used. Sohail et al. [21] concluded that there is no an exclusive classifier or technique available to predict all kind of diseases. They overviewed the previous research that was applied in the healthcare industry. A particular ML technique, e.g., ensemble (meta) learning for breast cancer diagnosis, had been discussed in [10]. The research emphasized ensembles techniques applied to breast cancer data using a systematic mapping study. Lastly, Nayar et al. [22] explored various applications that utilized swarm intelligence with data mining in healthcare with respect to methods and results obtained.
It is worth mentioning that some existing works have performed a comparative study of classification techniques for the effective diagnosis of diseases. However, those studies are limited to either a particular disease or ML technique. This makes their studies still questionable with respect to its generalizability of the proposed methods in other clinical data with a different context. For instance, Das [23] employed multiple classification techniques, i.e., neural networks, regression, and decision tree for the diagnosis of Parkinson's disease. Based on the experiment, a neural network classifier was found to be the best performing classifier with an accuracy of 92%. The proposed disease detection model, called 'HMV,' was taken into consideration to overcome the drawbacks of a traditional heterogeneous ensemble [24]. A similar approach, called 'HM-BagMoov,' was proposed to solve the limitations of conventional heterogeneous ensemble [25]. These two approaches were evaluated on a different number of clinical datasets, i.e., five heart datasets, four breast cancer datasets, two diabetes datasets, two liver disease datasets, and a hepatitis dataset.
To sum up, the aforementioned existing studies possess the following limitations: • Most studies had simply reviewed previous publications by using a mapping study; thus, they could not be used as a support tool in providing a more informed choice to select the best performing classifier for disease prediction.
• When comparing multiple algorithms with multiple datasets, there existed a lack of statistical significance tests. Hence, the performance difference among the classification algorithms was still unrevealed.
In this study, a broad spectrum of classification algorithms, covering different groups, i.e., meta, tree, rule-based, neural, and Bayesian, are taken into consideration. A total of 25 classifiers are involved in the comparison. In addition, 14 real-world clinical datasets with different peculiarities are included in the benchmark. In order to see the impact of different sampling strategies on the classifier's performance, we incorporate 3 different sampling techniques, i.e., subsampling, cross-validation, and multiple rounds of cross-validation. Finally, the results of some statistical significant tests are reported in order to prove that the performance classifier algorithms are significantly different, e.g., there is at least one algorithm that does not perform equal to the others.

Materials and Methods
In this section, several datasets that we employed in the experiment are presented, followed by a brief review of classification algorithms. Finally, significance tests are discussed in the last section.

Datasets
We mostly obtain the datasets from the UCI machine learning repository [26]. There are 12 real-world datasets that we download from the UCI website, while two datasets, i.e., RSMH [8] and Tabriz Iran [27] are privately available (can be obtained upon request). All datasets are categorized into seven different diseases, such as diabetes (4 datasets), breast cancer (3 datasets), heart attack (3 datasets), and one dataset for each thoracic, seizure, liver, and chronic kidney. Furthermore, some datasets hold several classes in their class label attributes; thus, they can be a multi-class classification problem. Such a case of Tabriz Iran [27], where the input variables can be classified into three classes, i.e., negative, positive, and old patient; Z-Alizadeh Sani [28] possess four class attributes, i.e., normal, stenosis of the left anterior descending (LAD) artery, left circumflex (LCX) artery, and right coronary artery (RCA); Cleveland, which predicts the input attributes as being in 4 classes, consisting of an integer value ranging from 0 (no heart disease) to 4 severe heart disease; and Epileptic Seizure Recognition, where the instances labeled class 1 are diagnosed with seizure disease and the instances having any of the classes 2, 3, 4, and 5 are specified as non-seizure disease subjects. The remaining datasets have a binary class in their response attribute. In this study, all datasets having multi-class targets were transformed into binary class targets with a specific criteria, i.e., the subjects labeled class 1 are diagnosed with the disease while they are 0, otherwise.
In the experiment, each dataset undergoes a simple pre-processing step, ensuring that the response attribute of each dataset is a categorical variable with two categories. Another pre-processing step, e.g., feature selection, is not carried out. This is because of the following reasons: (i) our aim is not to achieve the best possible performance of the classifier on each dataset, yet to benchmark algorithms on each dataset, (ii) the performance result of the classifier on a subset features might be random, and (iii) feature selection is usually made on each dataset, making a significant increase to the scope of this work. Any missing values in the datasets are treated using "do-nothing" strategy, meaning that we let some classification algorithms (e.g., gradient boosting machine (GBM) and a generalized linear model (GLM), etc.) to learn the best imputation values for the missing data (e.g., a mean imputation is typically used). However, other algorithms that do not tolerate missing values, an observation that has one or more missing values is simply dropped. Finally, for each dataset, a simple transformation is applied to ensure each dataset is ready to be processed by Weka and R. Table 1 summarizes the collection of 12 datasets from UCI repository and two datasets from privately clinical domains.

Classification Algorithms
Twenty-five classification algorithms implemented in R and Weka are included in this study. These classifiers were chosen with respect to their previous performance behavior in the CDSS domain. Note that previous works have used a variety of classifiers, ranging from tree-based learners [18] to ensemble learners [10]. All classifiers implemented in R are accessible through mlr package [39], while classifiers implemented in Weka [40] are run using a command-line of the java class of each classifier. We briefly explain all classifiers as the following. All classifiers are grouped according to which family they particularly belong to. All default learning parameters are used in the experiment. For the sake of reproducibility, learning parameters of each classifier are listed in Appendix A.

C50 decision tree (C50)
The classifier is the extension of the C4.5 algorithm presented in [41] that possesses extra improvements such as boosting, generating smaller trees, and unequal costs for different types of errors. Tree pruning is performed by a final global pruning strategy in which the costly and complex sub-trees are removed in such a way the error rate exceeds the baseline, e.g., the standard error rate of a decision tree without pruning. C50 can generate a set of rules, as well as a classification tree. ii. Credal decision tree (CDT) It takes into account imprecise probabilities and uncertainty measures for split criteria [42]. This procedure is a bit different from C4.5 or C50, where an information gain is used for a split criterion to choose the split attribute at each branching node in a tree. iii. Classification and regression tree (CART) It is trained in a recursive binary splitting manner to generate the tree. Binary splitting is a numerical process in which all the values are organized, and different split points are tested using a cost function. The split with the lowest cost is chosen [43]. iv. Random tree (RT) It grows a tree using K randomly selected input features at each node without pruning. The cost function (error) is estimated during the training; thus, there is no accuracy estimation operation, i.e., cross-validation or train-test to obtain an estimate of the training error [44].
v. Forest-PA (FPA) The classifier generates bootstrap samples from the training set and trains a CART classifier on the bootstrap sample using the weights of the attributes. The weights of the attributes are then updated incrementally that are presented in the latest tree. Following this step, the weights of applicable attributes are updated using their respective weight increment values that are not present in the latest tree [45].

Random forest (RF)
It uses decision trees as a base classifier, while the tree is grown to a depth of one; then, the same procedure is replicated for all other nodes in the tree until the specified depth of the tree is reached [44]. ii. Extra trees (XT) It works similarly to the random forest classifier, yet the features and splits are chosen at random; thus, it is also known as extremely randomized trees. Because splits are chosen at random, the computational cost (variance) of extra trees is lower than the random forest and decision tree [46]. iii. Rotation forest (RoF) The classifier generates M feature subsets randomly and principal component analysis (PCA) is applied to each subset in order to restore a full feature set (e.g., using M axis rotations) for each base learner (e.g., decision tree) in the ensemble [47]. iv. Gradient boosting machine (GBM) The classifier is proposed to improve the performance of the classification and regression tree (CART). It constructs an ensemble serially, where each new tree in the sequence is in charge of rectifying the prior tree's prediction error [48]. v. Extreme gradient boosting machine (XGB) The classifier is a state-of-the-art implementation of the gradient boosting algorithm. It shares a similar principle with GBM; however, less computational complexity is one of its advantages. In addition, XGB utilizes a more regularized model, making it able to reduce the complexity of the model while improving the prediction accuracy [49]. vi. CForest (CF) The classifier differs from random forest in terms of the base classifier employed and the aggregation scheme implemented. It utilizes conditional inference trees as a base learner. At the same time, the aggregation scheme works by taking the average weights obtained from each tree, not by averaging predictions directly as the random forest does [50]. vii. Adaboost (AB) The classifier attempts to improve the performance of a weak classifier, e.g., decision tree. The weak learner is trained sequentially on several bootstrap resamples of the learning set. Such a sequential scheme takes the results of a previous classifier into the next one to improve the final prediction by having the latter one emphasizing more on the mistakes of the earlier classifiers [51].

Deep learning (DL)
It derives from a multilayer feed-forward neural network, which is built based on a stochastic gradient descent algorithm of back-propagation. The primary distinction from a conventional neural network is that it possesses a large number of hidden layers, e.g., greater than or equal to four layers. In addition, some fine-tuned hyper-parameters are needed to be set properly, where a grid search is an option for obtaining the best parameter settings [52].

ii. Multilayer perceptron (MLP)
The classifier is a fully connected feed-forward network, where the training is performed by error propagation method [53].

iii. Deep neural network with a stacked autoencoder (SAEDNN)
It is a deep learning classifier, where the weights are initialized by a stacked autoencoder [54]. Similar to a deep belief network, it is trained with a greedy layerwise algorithm, while reconstruction error is used as an objective function. iv. Linear support vector machine (SVM) A support vector machine works based on the principle of hyperplane that classifies the data in a higher dimensional space [55]. In this study, a linear implementation [56] is used with an L 2 -regularized and L 2 -loss descent method. This is because the linear implementation is computationally efficient as compared to LibSVM [57], for instance.

Naive Bayes (NB)
A Naive Bayes classifier performs classification based on the conditional probability of a categorical class variable. It considers each of the variables to contribute independently to the probability [58]. In many application domains, the maximum likelihood method is prevalently considered for parameter estimation. Furthermore, it can be trained very efficiently in a supervised learning task.

ii. Gaussian process (GP)
A Gaussian process is defined by a mean and a covariance function. The function in any data modeling problem is considered to be a single sample in Gaussian distribution. In the classification task, the Gaussian process uses Laplace approximation for the parameter estimation [59]. iii. Generalized linear model (GLM) The classifier can be used either for classification and regression tasks. In this study, a multinomial family generalization is used as we deal with multi-class response variables. It models the probability of an observation belonging to an output category given the data [60].

Linear discriminant analysis (LDA)
The classifier assumes that any data model problem is Gaussian, and each attribute has the same variance. It estimates the mean and the variance for each class, while the prediction is made by estimating the probability of a test set belongs to each class. The output class is the one that gets the highest probability, in which the probability is estimated using Bayes theorem [61]. ii. Mixture discriminant analysis (MDA) This classifier is an extension of linear discriminant analysis. It is used for classification based on mixture models, while the mixture of normals is employed to get a density estimation for each class [62]. iii. K-nearest neighbor (K-NN) The classifier performs prediction on each row of the test set by finding the k nearest (measured by Euclidean distance) training set vectors. The classification is then made by majority voting with ties broken at random [63].

Repeated incremental pruning (RIP)
This classifier was originally developed to improve the performance of the IREP algorithm. The classifier constructs a rule by taking into account the following two procedures: (1) data samples are randomly divided into two subsets, i.e., a growing set and a pruning set, and (2) a rule is grown using the FOIL algorithm. After generating a rule, the rule is straightaway pruned by eliminating any final sequence of conditions from the rule [64]. In this study, we employ a Java implementation of the RIP algorithm, so-called JRIP. ii. Partial decision tree (PART) This classifier is a rule-induction procedure that avoids global optimization. It combines the two major rule generation techniques, i.e., decision tree (C4.5) and RIP. PART produces rule sets that are as accurate and of a similar size to those produced by C4.5, and more accurate than the RIP classifier [65].
This is a straightforward classifier that produces one rule for each predictor in the training set and selects the rule with a minimum total error as its final rule. A rule for a predictor is produced by generating a frequency table for each predictor feature against the target feature [66].

Resampling Procedures
Several different resampling procedures, i.e., subsampling, cross-validation, and multiple-round of cross-validation, are included in this study. The objective of using different resampling methods is to ensure that the performance of classifiers is not obtained randomly. A generic resampling procedure is illustrated in Algorithm 1 [67]. Subsampling is a repeated hold-out, where the original dataset D is split into two disjoint parts with a specified proportion, i.e., in this study, 80% for the training set and 20% for the testing set. The procedure is then replicated ten times. Cross validation divides the dataset into k (10 in our case) equally disjoint parts (subsets) and employs k-1 parts to build the model, while the remaining part is used for validation. This step is repeated k rounds, where the test subset is different in each step. Lastly, multiple-round cross-validation is carried out by reproducing twofold cross-validation for five times. This procedure maintains the equal number of performance values as in the two other resampling procedures. We take the average value for each resampling procedure.

Algorithm 1 General resampling strategy
Input: A dataset D of n observation d 1 to d n , the number of subsets k, and a loss function L. Process: Aggregate S, i.e., mean(S ) Output: Summary of validation statistics.

Significance Tests
In order to demonstrate an extensive empirical evaluation of classifiers, it is essential to utilize statistical tests in order to verify the significant performance difference of classifiers that are measurable [68]. Several tests are briefly discussed as follows. • A non-parametric Friedman test [69] is exploited to inspect whether there exist significant differences between the classifiers with respect to the performance metrics as mentioned earlier [70]. The null hypothesis (H 0 ): no performance differences between classifiers exist, i.e., the expected difference µ d that is equal to zero is observed against the alternative hypothesis (H A ): at least one group of classifiers does not have the same performance, i.e., the expected difference µ d is not equal to zero. The statistic of the Friedman test is calculated according to Equation (1): where v denotes the number of datasets (14 in our case), w denotes the number of classifiers (25 in our case) to be compared, and the average rank of classifier algorithm Finner test [71] is an p-value adjustment in a step-down manner. Let p 1 , p 2 , . . . , p w−1 be the ordered p-values in increasing order, so that • Nemenyi test [72] works by calculating the average rank of each benchmarked algorithm and taking their differences. In case such average differences are larger than or equal to a critical difference (CD), the performances of the algorithms are significantly different. CD can be obtained using the following formula: where q * is the Studentized statistic divided by √ 2.
More specifically, the procedures of significance tests can be broken down as follows.
• Calculate the classifiers' rank for each dataset using Friedman rank with respect to their area under ROC curve (AUC) metric in increasing order, from the best performer to the worst performer.
• Calculate the average rank of each classifier over all datasets. The best-performing classifier is determined by the lowest value of Friedman rank. Note that the merit is inversely proportional to numeric value.
• If the Friedman test demonstrates significant results (p-value < 0.05 in our case), run the Finner's method. It is carried out based on a pairwise comparison, where the best-performing algorithm is used as a control algorithm for being compared with the remaining algorithms.
• Perform a Nemenyi test to compare the performance of classifiers by each family.

Overall Analysis
In this research, we implement 25 algorithms over 14 datasets and three different validation strategies, providing 1050 combinations of algorithm-datasets-validation methods. Three different validation procedures are taken into account to anticipate a poor bias and variance due to a small size of samples in each dataset. Furthermore, different validation procedures ensure that the experimental results were not obtained by chance. The test results are the average of 10 elements at each resampling method. All classifiers' performance are assessed in terms of AUC metric. By referring the contingency table shown in Figure 2, AUC value of a classification algorithm can be calculated as: To maintain the readability of this paper, all performance results are provided in Appendix B. In Figures 3-5, the AUC value of each classification algorithm for each validation method is firstly shown. The boxplots show the distributions of data according to the AUC values obtained for each dataset. More specifically, they indicate the performance variability of classification algorithms relative to each dataset. The mean AUC values are grouped based on each resampling technique. It can be observed that there is a greater variability for FPA, RoF, RF, GBM, SVM, LDA, AB, and 1-R, regardless of validation methods considered. On the other hand, MLP and SAEDNN have less variability, meaning that such classifiers' performances are consistent, in spite of the clinical dataset used. Overall, the six best-performing algorithms (in descending order of mean AUC value) are apparently CF, RF, FPA, RoF, DL, and GBM. Furthermore, since simply taking the average performance value might lead to bias, a Friedman ranking is adopted for assessing the classifier's performance (see Figure 6). Instead of analyzing the explicit performance, Friedman rank analysis is based on the rank of each classifier on each dataset. To address the RQ 1 , we analyze the relative performance of classifiers in different resampling strategies using the Friedman rank. With reference to 10cv, the Friedman rank can confirm that the six top-performing classifiers, in ascending order, from the best-performer to the worst performer are LDA, CF, GLM, RoF, RF, and GBM with average rank 5.04, 5.21, 5.32, 6.32, 6.46, and 6.47, respectively.
According to Friedman rank and 10ho, the top-5 superior performers are CF, followed by DL, GLM, GP, and LDA with average rank 5.61, 5.82, 5.86, 5.93, and 5.96, respectively. Subsequently, with respect to 5 × 2cv, CF is on the topmost performance with average rank 4.82. This is succeeded by GLM, RF, GP, and LDA, with average rank 5.11, 5.14, 5.71, and 5.89, respectively. For the sake of an inclusive evaluation, the behavior of the top-performing classifiers can be discussed as follows: • Overall, it is worth mentioning that, over the three resampling techniques, CF have performed best with average AUC 0.857 and rank 5.16. The result is reasonably unexpected since a conditional inference tree model can outperform other gradient-based ensemble algorithms, i.e., XGB.
• CF is as good as RF since CF works similar to RF [73]. Therefore, it is not surprising that both CF and RF are not significantly different.
• Other ensemble learners, i.e., RF and RoF have performed better than other ensemble models, i.e., FPA and XGB.
• Regarding a highly performance of RF, it is obvious that RF is built based on an ensemble of decision trees. The randomness of each tree split usually provides better prediction performance.
In addition, RF is resilient in order to deal with imbalance datasets [74]. Note that several datasets employed in this experiment are highly imbalanced (see Table 1).
• LDA is listed in the top-5 best performing classifiers. LDA has been known as a simple but robust predictor when the dataset is linearly separable. Version October 1, 2020 submitted to Mathematics 11 of 21   Version October 1, 2020 submitted to Mathematics 11 of 21      lead to bias, a Friedman ranking is adopted for assessing the classifier's performance (see Figure 6). 372 Figure 5. Performance AUC for each classifier w.r.t mean 5 × 2cv. Based on our experimental results, the worst performer over three resampling techniques is SAEDNN with average rank 22.74 and average AUC 0.538. This is not surprising because a deep neural network always requires a large number of training samples to build the model. Moreover, neural-based classifiers are nonlinear classifiers, meaning that they are more affected by some hyperparameter-tuning such as learning rate, epochs, number of hidden layers, etc. The lowest AUC might also be the result of an insufficiency of training data samples when constructing the classification model. Moreover, it can be observed that the second and the third worst models are 1-R, SVM, and MLP.
A post-hoc test, i.e., Finner, is carried out after an omnibus Friedman test. If the Friedman test rejects the null hypothesis (there is no performance difference in at least one of the classifiers and others), the post-hoc test is applied. The result of statistical significance tests for each resampling technique is given in Tables 2-4. Concerning the post-hoc test, several options are prevalently available such as a pair-wise comparison, comparison with control classifier, and all pair-wise comparisons. In this study, all pair-wise comparison is adopted. The p-value < 0.01 is set as a significant threshold. The results of Tables 2-4 are further discussed. According to Finner test, low-performing classifiers, i.e., RIP, SAEDNN, SVM, and MLP are significantly different compared to the rest algorithms. In order to inspect how the three resampling procedures have an impact on the classifier's performance, we extend our comparative analysis in the following section.

Analysis by Each Family
In this section, in order to answer RQ 2 , we focus on the benchmark of best performing classifiers on each family, as well as the effect of different resampling strategies on the classifier's performance. As a result, six top performers, corresponding to six different families, are included in the analysis. Among the tree-based classification algorithms, FPA is the best classifier, whilst CF and DL become the two leading classifiers among the ensemble and neural-based algorithms, respectively. GLM has performed best among the probability-based classifiers, whilst LDA is superior compared to other discriminant methods. Finally, PART is an outstanding classifier among the rule-based family. Figures 7-9 expose the CD plots using the Nemenyi test for each resampling technique. It is obvious that, with respect to 10cv, PART and DL are significantly different than the rest of the algorithms since their rank is greater than the CD. The results of 10ho and 5 × 2cv are quite similar, where PART is the only algorithm that has a significant performance difference compared to the rest algorithms.    Table 3. Results of a post-hoc test using Finner correction w.r.t 10ho (bold indicates significance at p-value < 0.01).

Conclusions
Rather than simply carrying out a mapping study, this study provided a more informed option about choosing the best classifier for disease prediction. This study demonstrated a thorough benchmark of 25 classification techniques corresponding to six families over 14 real-world disease datasets, as well as three different resampling procedures. Based on the experimental results, CF had shown its superiority in comparison with other classifiers. It achieved an average AUC 0.857 over all resampling techniques with an average Friedman rank is 5.16. The two worst performers, however, were from neural-based classifier family such as SAEDNN and MLP. These classifiers were not competitive since SAEDNN particularly required a sufficient training set to create the model. In general, the other top-4 classifiers that were very powerful for clinical decision support systems were LDA, GLM, RF, and GP. The following section presents the two RQs and provides answer to them.
• RQ 1 : What is the relative performance of classification algorithms with respect to different resampling strategies? According to the Friedman rank, different resampling techniques had no significant impact on several classifiers.
• RQ 2 : Among the various families, is there a best choice in selecting a classification algorithm for a clinical decision support system? This study revealed that choosing the classification algorithms for disease prediction highly depended on types of practical problems, i.e., imbalanced dataset, linear or nonlinear separable, and expert knowledge regarding data and domain. Therefore, it can be concluded that CF, LDA, GLM, RF, and GP were the best choices so far in clinical decision support system field since they are resilient to imbalanced datasets.
Among the potential approaches to extend this study, we believe that including more clinical datasets would be the most interesting since this might help researchers/clinical practitioners in selecting suitable classifiers in different application domains. For future work, it would be interesting to deal with the main drawback of this study about the performance AUC of state-of-the-art classifiers, i.e., XGB and SAEDNN. To this end, while a large amount of datasets are taken into consideration, the performance of deep structured learning might be improved. Taking into account that deep learning has played a significant role in classifying medical images, acoustic signals, or biosignals detected from medical devices, it would be meaningful to understand the performance of machine learning and deep learning applied on those clinical datasets.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. List of Learning Parameters Used in the Experiment
In the following, tuned parameters of 25 classifiers were briefly shown. Their names and implementations are specified using name_implementation, where the implementation can be r (e.g., in R using mlr) and w (e.g., Weka). •

RIP_w
The amount of data used for pruning: 3; minimum total weight: 2; number of optimizations: 2; use pruning: yes.