Performance Assessment of Supervised Classifiers for Designing Intrusion Detection Systems: A Comprehensive Review and Recommendations for Future Research

Supervised learning and pattern recognition is a crucial area of research in information retrieval, knowledge engineering, image processing, medical imaging, and intrusion detection. Numerous algorithms have been designed to address such complex application domains. Despite an enormous array of supervised classifiers, researchers are yet to recognize a robust classification mechanism that accurately and quickly classifies the target dataset, especially in the field of intrusion detection systems (IDSs). Most of the existing literature considers the accuracy and false-positive rate for assessing the performance of classification algorithms. The absence of other performance measures, such as model build time, misclassification rate, and precision, should be considered the main limitation for classifier performance evaluation. This paper’s main contribution is to analyze the current literature status in the field of network intrusion detection, highlighting the number of classifiers used, dataset size, performance outputs, inferences, and research gaps. Therefore, fifty-four state-of-the-art classifiers of various different groups, i.e., Bayes, functions, lazy, rule-based, and decision tree, have been analyzed and explored in detail, considering the sixteen most popular performance measures. This research work aims to recognize a robust classifier, which is suitable for consideration as the base learner, while designing a host-based or network-based intrusion detection system. The NSLKDD, ISCXIDS2012, and CICIDS2017 datasets have been used for training and testing purposes. Furthermore, a widespread decision-making algorithm, referred to as Techniques for Order Preference by Similarity to the Ideal Solution (TOPSIS), allocated ranks to the classifiers based on observed performance reading on the concern datasets. The J48Consolidated provided the highest accuracy of 99.868%, a misclassification rate of 0.1319%, and a Kappa value of 0.998. Therefore, this classifier has been proposed as the ideal classifier for designing IDSs.


Introduction
The footprint of artificial intelligence-enabled Internet of Things (IoT) devices [1] in our day-to-day life attracts hackers and potential intrusions. In 2017, WannaCry ransomware, a self-propagating malware, devastatingly impacted computing resources by infecting more

Related Works
Supervised classifiers are extensively used in the field of network security. The most potential applications of machine learning techniques are in risk assessment after the deployment of various security apparatus [25], identifying risks associated with various network attacks and in predicting the extent of damage a network threat can do. Apart from these, supervised classification techniques have been explored and analyzed by numerous researchers in a variety of application areas. Most of those studies' analyses focused on a detailed exploration to validate a theory or performance evaluation to come across a versatile classifier [26][27][28]. The performance of supervised classifiers has been explored in intrusion detection [29], robotics [18], semantic web [19], human posture recognition [30], face recognition [20], biomedical data classification [31], handwritten character recognition [22] and land cover classification [21]. Furthermore, an innovative semi-supervised heterogeneous ensemble classifier called Multi-train [32] was also proposed, where a justifiable comparison was made with other supervised classifiers, such as k-Nearest Neighbour (kNN), J48, Naïve Bayes, and random tree. Multi-train was also successfully achieved, and its prediction accuracy of unlabeled data was improved, which, therefore, can reduce the risk of incorrectly labeling the unlabeled data. A study on this topic, which exclusively deals with classifiers' accuracy measures using multiple standard datasets, is proposed by Labatut et al. [33]. An empirical analysis of supervised classifiers was carried out by Caruana et al. [34] using eleven datasets with eight performance measures, where the calibrated boosted trees appeared as the best learning algorithm. Besides, a systematic analysis of supervised classifiers was carried out by Amancio et al. [35] under varying classifiers' settings.
The focus of this paper is to analyze the performance of various supervised classifiers using IDS datasets. Therefore, the authors have decided to review related articles in the literature that examined different classifiers using IDS datasets. The classifier analysis is expected to provide a platform for the researchers to devise state-of-the-art IDSs and quantitative risk assessment schemes for various cyber defense systems. Numerous studies and their detailed analytical findings related to supervised classifiers have been outlined in Table 1. Table 1 summarizes the taxonomy of analyzed articles. In the last column, an attempt has been made to outline the inferences/limitation or research gaps encountered. The summarization of these analyses provides scope for meta-analysis about the supervised classifiers, which ultimately shows direction or justification for further investigation in the field of supervised classification using intrusion detection datasets. From Table 1, it has been observed that the decision tree and function-based approaches are mostly explored. The usage statistics of supervised classifiers are presented in Figure 1.
According to Figure 1, J48 (C4.5) and Random Forest of decision trees and functionbased SVM and Multilayer Perceptron (Neural Network) have been analyzed considerably by numerous researchers. In this work, the authors have tried to understand the reason behind decision trees' popularity and function-based approaches. Therefore, the authors have summarized the performance metrics results used to explore those classifiers in the analyzed papers. Most of the researchers focused on accuracy scores; therefore, the authors used the accuracy score as a base measure to understand the reason behind the use of decision trees and function-based classifiers.
Therefore, in this study, the authors have calculated the minimum, maximum, and average accuracy of Bayes, Decision trees, Functions, Lazy, and Rules group of classifiers concerning the literature outlined in Table 1. The calculated detection accuracy of the research papers surveyed is presented in Figure 2. In Figure 2, almost all groups of classifiers show a maximum accuracy rate of more than 99%. Table 1. Detailed findings and analysis of supervised classifiers.

Inferences/Observations/ Limitations/Research Gaps
With 20 features, BayesNet shows the highest amount of accuracy of 99.3% for classifying DDoS attacks, and PART shows 98.9% for classifying Probe attacks. No class imbalance issue was found. Tested on an older dataset, which is now obsolete. Completely ignored U2R and R2L attacks. Hence, classifiers performance may vary with the inclusion of U2R and R2L instances Gaussian classifier seems to be effective for R2L and Probe attacks with the highest detection rate of 0.136 and 0.874, respectively. Naïve Bayes proved suitable for U2R attacks with the highest detection rate of 0.843, Decision Tree and Random Forest classified DoS attacks with the highest detection rate of 0.972. Considering the highest detection rate of three training sets is not convincing. Instead, the average detection rate could have highlighted better classifiers for the given scenario.
A decent number of performance measures were used to analyze the classifiers, Other state-of-the-art classifiers are missing from the comparison. Dataset sample size, number of features considered are not precise. Although the Naïve Bayes proved to be a better classifier in FP Rate, the ID3 performs far ahead than the Naïve Bayes. Class imbalance issues are not considered during evaluation.
The accuracy of the induction tree is promising, with an overall rate of 99.839/%. Although it is appreciable that the induction tree performs well in the class imbalance KDD'99 dataset, the size of the training set and the class-wise breakup of training stances are not precise. The reason for considering different training instances for three different classifiers is not clear. Considering the ROC area, it is evident that the Induction tree correctly classified Neptune, Smurf, pod, teardrop, port sweep, and back attack instances.
C4.5 scores the highest average accuracy of 64.94% as compared to 62.7% of SVM. Considering attacks accuracy, C4.5 seems to be suitable for detecting Probe, DoS, and U2R attacks, whereas SVM classifies R2L threats better. Class imbalance issue is not addressed.
J48 (C4.5) proved to be an accurate classifier for classifying test instances. Data extraction and the preprocessing procedure is not clearly defined. The training set is a high-class imbalance, so the evaluation of the classifiers in terms of accuracy and detection rate is not sufficient.   [58] marization of these analyses provides scope for meta-analysis about the supervise sifiers, which ultimately shows direction or justification for further investigation field of supervised classification using intrusion detection datasets. From Table 1 been observed that the decision tree and function-based approaches are mostly ex The usage statistics of supervised classifiers are presented in Figure 1. According to Figure 1, J48 (C4.5) and Random Forest of decision trees and fu based SVM and Multilayer Perceptron (Neural Network) have been analyzed con bly by numerous researchers. In this work, the authors have tried to understand t son behind decision trees' popularity and function-based approaches. Therefore, thors have summarized the performance metrics results used to explore those cla in the analyzed papers. Most of the researchers focused on accuracy scores; therefo authors used the accuracy score as a base measure to understand the reason beh use of decision trees and function-based classifiers.

Performance measures used
Therefore, in this study, the authors have calculated the minimum, maximu average accuracy of Bayes, Decision trees, Functions, Lazy, and Rules group of cla concerning the literature outlined in Table 1. The calculated detection accuracy of the research papers surveyed is presented in Figure 2. In Figure 2, almost all groups of classifiers show a maximum accuracy rate of more than 99%. Similarly, considering the average accuracy, the Lazy classifiers are far ahead of different groups of classifiers. Despite having an impressive accuracy rate, the Lazy group classifiers were deeply analyzed by a handful of researchers [48][49][50]. On the other hand, decision trees and function-based classifiers were the center point of many research papers. Consequently, in this paper, the authors have decided to explore multiple classifiers of all the classifier groups. In this work, fifty-four state-of-the-art classifiers of six different classifier groups were analyzed. The classifier groups were created based on their functionality and the guidelines presented by Frank et al. [59]. The classifiers under evaluation and their groups are presented in Tables 2-7 under six different classifier groups.  Similarly, considering the average accuracy, the Lazy classifiers are far ahead of different groups of classifiers. Despite having an impressive accuracy rate, the Lazy group classifiers were deeply analyzed by a handful of researchers [48][49][50]. On the other hand, decision trees and function-based classifiers were the center point of many research papers. Consequently, in this paper, the authors have decided to explore multiple classifiers of all the classifier groups. In this work, fifty-four state-of-the-art classifiers of six different classi- fier groups were analyzed. The classifier groups were created based on their functionality and the guidelines presented by Frank et al. [59]. The classifiers under evaluation and their groups are presented in Tables 2-7 under six different classifier groups.

Materials and Methods
The authors used Weka 3.8.1 [59] software in a CentOS platform on the Param Shavak supercomputing facility provided by the Centre for Development of Advanced Computing (CDAC), India. The supercomputing system consists of 64 GB RAM with two multicore CPUs, each with 12 cores having a performance of 2.3 Teraflops. To evaluate all the classifiers of Tables 2-7, the authors have considered samples of NSLKDD [118][119][120], ISCXIDS2012 [121], and CICIDS2017 [122] datasets. The training and testing sample size for each dataset is outlined in Table 8. The training and testing samples were generated with 66% and a 34% split of the total sample size. All three NSLKDD, CICIDS2017 and ISCXIDS2012, have a high-class imbalance. Additionally, NSLKDD and CICIDS2017 are multi-class, and the ISCXIDS2012 dataset contains binary class information. The performance of a classifier cannot be explored only through its accuracy and detection rate. Therefore, the authors have considered a variety of performance measures such as training time, testing time, model accuracy, misclassification rate, kappa, mean absolute error, root mean squared error, relative absolute error, root relative squared error, true positive rate, false-positive rate, precision, and receiver operating curve (ROC). The ROC value reveals the real performance on class imbalance datasets such as the CICIDS2017 and the NSL-KDD. Similarly, the Matthews correlation coefficient (MCC) and precision-recall curve (PRC) are useful for evaluating binary classification on the ISCXIDS2012 dataset.
The experiment for evaluating classifiers covers five essential steps [123], such as dataset selection, classification, weight calculation using multi-criteria decision making, weight to rank transformation, and finally, global rank generation. Figure 3 shows the methodology used by the authors. The authors have conducted all five steps iteratively for all datasets and classifiers under evaluation. In the initial steps from the pool of datasets, a dataset has been selected. The dataset initially contains several tuples with variable class densities. From each dataset, the requisite number of random samples were generated. The output of this step has been presented in Table 8. This procedure was conducted deliberately to ensure that all the classifiers were not biased for a specific dataset. The second step began by classifying each dataset using each classifier that is presented in the classifier pool. The performance of each classifier was tabulated for future reference. The process has been recursively conducted for each dataset.
The third and fourth steps jointly work to achieve the research objectives. In this process, the average performance score of each group of classifiers has been analyzed. Additionally, each group's ranking has also been calculated to retrieve the best classifier group specific to the dataset. All the group's classifiers with better results were considered to evaluate their consistent performance across the three datasets. Furthermore, considering The authors have conducted all five steps iteratively for all datasets and classifiers under evaluation. In the initial steps from the pool of datasets, a dataset has been selected. The dataset initially contains several tuples with variable class densities. From each dataset, the requisite number of random samples were generated. The output of this step has been presented in Table 8. This procedure was conducted deliberately to ensure that all the classifiers were not biased for a specific dataset. The second step began by classifying each dataset using each classifier that is presented in the classifier pool. The performance of each classifier was tabulated for future reference. The process has been recursively conducted for each dataset.
The third and fourth steps jointly work to achieve the research objectives. In this process, the average performance score of each group of classifiers has been analyzed. Additionally, each group's ranking has also been calculated to retrieve the best classifier group specific to the dataset. All the group's classifiers with better results were considered to evaluate their consistent performance across the three datasets. Furthermore, considering the performances of the best performing group's classifiers, the authors have calculated the weight and rank of each classifier of that group, specific to each dataset. The authors aimed to provide a reliable evaluation of the best classifier for each dataset.
The final step involved global weight and rank calculation. At this stage, the global weight of a classifier of the best performing group was calculated based on the ranking received for each dataset. The average performance results of those included in the group with the better score across the three datasets were based on the individual score of each classifier. The scores were further arranged in ascending order to provide a clear presentation about the best performance classifier.
All the five steps of methodologies included a two-stage procedure. First, the best classifier group was selected, and the second-best classifier was proposed. The best classifier and classifier group were based on an extensively used conventional multiple-criteria decision-making (MCDM) method named TOPSIS. Before applying TOPSIS, the performance outcome of each classifier and each classifier group were calculated. Therefore, the authors have calculated 13 performance metrics of the classifiers.
Furthermore, the authors considered only eight performance measures, i.e., testing time per instance, accuracy, kappa value, mean absolute error, false-positive rate, precision, and receiver operating curve value for weighting and ranking purpose. On the one hand, these eight measures are in line with the aim of this research. On the other hand, all the other performance metrics can be calculated through one of these measures that are considered in this study. Consequently, the significance of those 17 measures did not affect the weighting and ranking process. The algorithmic method of the weighting of each classifier and classifier group based on TOPSIS has been demonstrated in Table 9.
The algorithm begins with constructing a decision matrix M d , where the nth classifier or classifier group is the performance outcome for kth performance measure. The decision matrix is the basis for the evaluation of the best classifier. It helps the decision-making module (TOPSIS) to calculate the weight for each feature.
At the second stage, a weightage normalized decision matrix has been calculated, which is the weight of the jth performance measures.
The idea behind allocating appropriate weight to performance measures is in its ability to rank classifiers specific to domain area and learning environment. For instance, in high class-imbalance learning, the performance measure Matthews correlation coefficient (MCC), Kappa, and receiver operating curve (ROC) value should be given more weightage than other performance matrices. The datasets used here were class imbalance in nature; therefore, more emphasis has been given to performance matrices suitable for the class imbalance environment. In this regard, eight performance matrices have been shortlisted, and corresponding weights have been allocated for TOPSIS processing. The weight for eight performance measures is presented in Table 10. Another reason for not considering all the performance matrices is because other performance measures themselves can be derived from the matrices presented in Table 10. For instance, detection accuracy can be calculated from True Positives (TP) and True Negatives (TN). Therefore, the True Positive Rate (TPR) and True Negative Rate (TNR) have been dropped from calculating weight for classifiers. In this way, out of the 13 performance measures, only eight performance measures have been selected. Table 9. The algorithm algoWeighting.

end end
Step 3. Formation of weighted normalized matrix V ij W j r ij . // W j = weight allocated for performance matric j Step 4. Estimation of positive (A + ) and negative (A − ) ideal solution Step 5. Estimation of separation point of each classifier/classifier group Step 6. Weight estimation of classifiers The algorithm includes a positive and negative ideal solution to calculate the separation measure of each classifier/classifier group, which supports the calculation of each classifier or group's score. The scores are used to rank the classifiers. The procedure followed here for calculating the rank of classifiers has been presented in Table 11.

Results and Discussion
The presented analysis to reach the best classifier was conducted through a top-tobottom approach. Firstly, the best classifier group has been identified through intergroup analysis. Secondly, the best performing classifier of that best classifier group has been acknowledged through intragroup analysis.

Intergroup Performance Analysis
Under intergroup performance analysis, the authors have calculated the classifier group performance as a whole. The classifier's group performances for NSLKDD, ISCX-IDS2012, and CICIDS2017 datasets have been listed in Tables 12-14, respectively. According to Table 12, decision tree classifiers present reliable results in all the fields of performance metrics, except training and testing time. On the one hand, the decision tree classifiers consume training and testing times of 4.18 s and 0.03 s, respectively. Similarly, the Bayes group of classifiers has a fast response in training and testing time but presents low-quality performance metrics. The ROC and MCC values are suitable for evaluating classifier groups' performance, considering the class imbalance learning. Therefore, by observing the average ROC and MCC of classifier groups on the NSL-KDD dataset, the authors have seen that the decision tree behaves far better than other classifier groups. The authors found a similar observation concerning the ISCXIDS2012 dataset. Table 6 shows the group performance of supervised classifiers for the ISCXIDS2012 dataset. The decision tree classifiers showed the highest amount of average accuracy of 97.3519%, but the average testing time per instance was low and on par with Bayes and Miscellaneous classifiers. Nevertheless, decision tree classifiers were far ahead of their peer classifier groups, with a higher average ROC value of 0.985. The authors have also conducted intergroup performance analysis on CICIDS2017. The average, maximum, and minimum performance reading has been outlined in Table 12. The decision tree classifiers reveal an impressive amount of accuracy and ROC values of 99.635 and 0.999, respectively.
Furthermore, the decision tree classifiers present consistent performance metrics for all three intrusion detection datasets NSLKDD, ISCXIDS2012, and CICIDS2017. However, before concluding that decision trees are best for these datasets by considering a limited number of parameters, the authors have decided to identify all these classifier groups' actual weight and rank through TOPSIS. The classifier group with the highest weight and rank will be pointed out as the best classifier for these IDS datasets. This will improve the proposed study's relevance and background to find the best classifier within the winning classifier group.    Figure 4 presents the weights and ranks of classifier groups for all three IDS datasets. The decision tree classifier presents the highest performance. Moreover, the decision trees present a consistent performance for all the IDS datasets. Therefore, the decision tree can be considered as the best method for the development of reliable IDSs.
all three intrusion detection datasets NSLKDD, ISCXIDS2012, and CICIDS2017. However, before concluding that decision trees are best for these datasets by considering a limited number of parameters, the authors have decided to identify all these classifier groups' actual weight and rank through TOPSIS. The classifier group with the highest weight and rank will be pointed out as the best classifier for these IDS datasets. This will improve the proposed study's relevance and background to find the best classifier within the winning classifier group. Figure 4 presents the weights and ranks of classifier groups for all three IDS datasets. The decision tree classifier presents the highest performance. Moreover, the decision trees present a consistent performance for all the IDS datasets. Therefore, the decision tree can be considered as the best method for the development of reliable IDSs.

Intragroup Performance Analysis
In the intergroup analysis, the authors conclude that decision tree classifiers reveal the best performance for imbalanced IDS datasets. The authors have decided to conduct

Intragroup Performance Analysis
In the intergroup analysis, the authors conclude that decision tree classifiers reveal the best performance for imbalanced IDS datasets. The authors have decided to conduct an intragroup analysis of decision trees for NSLKDD, ISCXIDS2012, and CICIDS2017 datasets. The intragroup study aims to identify the best decision tree within the decision tree group of classifiers for the concerned datasets. Several performance outcomes of decision tree classifiers for NSLKDD, ISCXIDS2012, and CICIDS2017 datasets have been analyzed through Figures 5-7.  The J48Consolidated classifier shows better accuracy for the NSL-KDD dataset. The sample size of NSLKDD here is an imbalance in nature. Therefore, these measures play a significant role in finding the best classifier. Considering   sample size of NSLKDD here is an imbalance in nature. Therefore, these measures play a significant role in finding the best classifier. Considering the ROC value, the ForestPA performs better as compared to J48Consolidated. Additionally, both ForestPA and J48Consolidated show similar performance in terms of the MCC value. Consequently, the authors did not find sufficient scope for deciding an ideal decision tree classifier for the NSLKDD dataset.  Furthermore, the decision tree classifiers' performance on a sample of the IS-CXIDS2012 dataset is presented in Figure 6. The Functional Trees (FT), J48Consolidated, NBTree, and SysFor classifiers consumed a significant amount of computational time. Nevertheless, the rest of the decision trees consumed 0.001s of testing time per instance. The J48Consolidated algorithm was limited by presenting the longest amount of time to detect an anomalous instance. However, this computation time consumption supports the fact that J48Consolidated provides the highest accuracy of 98.5546%, which leads to the lowest misclassification rate of 1.4454%. Moreover, J48Consolidated seems to lead the decision trees group with the best Kappa value (0.9711). The test results of decision trees on a CICIDS2017 dataset are presented in Figure 7. The J48Consolidated algorithm provides high-quality results in the class imbalance instances of the CICIDS2017 dataset. J48Consolidated scores the highest accuracy with a low misclassification rate. However, considering the ROC and MCC values, the J48 presents better performance than the J48Consolidated. Therefore, it is not clear about the best classifiers, which can be considered as the base learner for future IDS.
In the case of ISCXIDS2012, J48Consolidated also presents consistent results in all performance measures. However, in the case of NSL-KDD and CICIDS2017, it was not possible to find the best classifier. Therefore, the authors have also considered TOPSIS to allocate individual decision tree classifiers' weight and rank. The average weight and rank of decision tree classifiers for all datasets have also been calculated to find the best classifier for all the datasets. The average weight and rank across all the datasets are not significant in identifying a suitable classifier because an IDS is designed considering a specific dataset or environment. However, average weight and rank will play a relevant role in the conclusion concerning the most versatile classifier conducted in this study. The average ranks and weights of all the classifiers for all the three IDS datasets are represented in Figure 8.   The J48Consolidated classifier shows better accuracy for the NSL-KDD dataset. The sample size of NSLKDD here is an imbalance in nature. Therefore, these measures play a significant role in finding the best classifier. Considering the ROC value, the ForestPA performs better as compared to J48Consolidated. Additionally, both ForestPA and J48Consolidated show similar performance in terms of the MCC value. Consequently, the authors did not find sufficient scope for deciding an ideal decision tree classifier for the NSLKDD dataset.
Furthermore, the decision tree classifiers' performance on a sample of the ISCX-IDS2012 dataset is presented in Figure 6. The Functional Trees (FT), J48Consolidated, NBTree, and SysFor classifiers consumed a significant amount of computational time. Nevertheless, the rest of the decision trees consumed 0.001 s of testing time per instance. The J48Consolidated algorithm was limited by presenting the longest amount of time to detect an anomalous instance. However, this computation time consumption supports the fact that J48Consolidated provides the highest accuracy of 98.5546%, which leads to the lowest misclassification rate of 1.4454%. Moreover, J48Consolidated seems to lead the decision trees group with the best Kappa value (0.9711).
The test results of decision trees on a CICIDS2017 dataset are presented in Figure 7. The J48Consolidated algorithm provides high-quality results in the class imbalance instances of the CICIDS2017 dataset. J48Consolidated scores the highest accuracy with a low misclassification rate. However, considering the ROC and MCC values, the J48 presents better performance than the J48Consolidated. Therefore, it is not clear about the best classifiers, which can be considered as the base learner for future IDS.
In the case of ISCXIDS2012, J48Consolidated also presents consistent results in all performance measures. However, in the case of NSL-KDD and CICIDS2017, it was not possible to find the best classifier. Therefore, the authors have also considered TOPSIS to allocate individual decision tree classifiers' weight and rank. The average weight and rank of decision tree classifiers for all datasets have also been calculated to find the best classifier for all the datasets. The average weight and rank across all the datasets are not significant in identifying a suitable classifier because an IDS is designed considering a specific dataset or environment. However, average weight and rank will play a relevant role in the conclusion concerning the most versatile classifier conducted in this study. The average ranks and weights of all the classifiers for all the three IDS datasets are represented in Figure 8.
The J48Consolidated classifier has the highest rank across all the datasets. Moreover, J48Consolidated presents the highest weight of 0.964 for the ISCXIDS2012 dataset. The J48Consolidated decision tree classifier is best for the high-class imbalance NSLKDD and CICIDS2017 and ISCXIDS2012 datasets. Therefore, J48Consolidated will be a suitable classifier for designing IDS base learners using either NSLKDD, ISCXIDS2012, and CI-CIDS2017 datasets.  The J48Consolidated classifier has the highest rank across all the datasets. Moreover, J48Consolidated presents the highest weight of 0.964 for the ISCXIDS2012 dataset. The J48Consolidated decision tree classifier is best for the high-class imbalance NSLKDD and CICIDS2017 and ISCXIDS2012 datasets. Therefore, J48Consolidated will be a suitable classifier for designing IDS base learners using either NSLKDD, ISCXIDS2012, and CI-CIDS2017 datasets.

Detailed Performance Reading of All the Classifiers
Tables 15-17 provide a detailed insight into all the supervised classifiers of six distinct groups. These tables outlined thirteen performance metrics. The authors have identified the best classifier group (decision tree) and the best classifier (J48Consolidated

Detailed Performance Reading of All the Classifiers
Tables 15-17 provide a detailed insight of all the supervised classifiers in six distinct groups. These tables outlined thirteen performance metrics. However, the authors have identified the best classifier group (decision tree) and the best classifier (J48Consolidated). Nevertheless, other classifiers can have different performances considering other datasets. Therefore, while designing IDSs, the authors suggest further evaluation of supervised classifiers based on specific computing and network environments.

J48Consolidated-A C4.5 Classifier Based on C4.5
J48Consolidated has been presented as the best classifier considering the decision tree group. Therefore, this section provides an in-depth analysis of J48Consodated.

Detection Capabilities of J48Consolidated
In this section, the J48Consolidated classifier is analyzed, considering the classification of the attack detection process. The classification threshold and the percentage of detection have been taken into consideration while analyzing attack classes. The attack-wise classification output for NSLKDD, ISCXIDS, and CICIDS2017 datasets has been presented in Figures 9-11, respectively. Tables 15-17 provide a detailed insight of all the supervised classifiers in six distinct groups. These tables outlined thirteen performance metrics. However, the authors have identified the best classifier group (decision tree) and the best classifier (J48Consolidated). Nevertheless, other classifiers can have different performances considering other datasets. Therefore, while designing IDSs, the authors suggest further evaluation of supervised classifiers based on specific computing and network environments.

J48Consolidated-A C4.5 Classifier Based on C4.5
J48Consolidated has been presented as the best classifier considering the decision tree group. Therefore, this section provides an in-depth analysis of J48Consodated.

Detection Capabilities of J48Consolidated
In this section, the J48Consolidated classifier is analyzed, considering the classification of the attack detection process. The classification threshold and the percentage of detection have been taken into consideration while analyzing attack classes. The attack-wise classification output for NSLKDD, ISCXIDS, and CICIDS2017 datasets has been presented in Figures 9-11, respectively. The detection output for the NSLKDD dataset remains consistently good for DoS, Probe, R2L, U2R, and Normal classes with the increase in detection threshold. The U2R attack class shows low false positives, whereas few regular instances are misclassified during the classification process. Overall, the J48Consolidated classifier exhibited satisfactory performance for the NSLKDD dataset. ISCXIDS2012 is a binary class dataset; therefore, J48Consolidated seems to generate false alarms. However, the presented results are low compared to the number of correctly classified instances (true positives and true negatives).  ISCXIDS2012 is a binary class dataset; therefore, J48Consolidated seems to generate false alarms. However, the presented results are low compared to the number of correctly classified instances (true positives and true negatives).  The detection output for the NSLKDD dataset remains consistently good for DoS, Probe, R2L, U2R, and Normal classes with the increase in detection threshold. The U2R attack class shows low false positives, whereas few regular instances are misclassified during the classification process. Overall, the J48Consolidated classifier exhibited satisfactory performance for the NSLKDD dataset.
ISCXIDS2012 is a binary class dataset; therefore, J48Consolidated seems to generate false alarms. However, the presented results are low compared to the number of correctly classified instances (true positives and true negatives).
Finally, the individual J48Consolidated evaluation presents an effective classification considering six attack groups of the CICIDS2017 dataset. The classifier also differentiates regular instances with attack instances during the classification process.

Classification Output of J48Consolidated
The three IDS datasets are considered for a specific environment. The correlation of attributes, attacks, and benign instances varied from dataset to dataset. Therefore, J48Consolidated shows a different classification performance considering different IDS datasets. The classification output of J48Consolidated for NSLKDD, ISCXIDS2012, and CI-CIDS2017 datasets has been outlined in Figures 12-14, respectively.

Classification Output of J48Consolidated
The three IDS datasets are considered for a specific environment. The correlation of attributes, attacks, and benign instances varied from dataset to dataset. Therefore, J48Consolidated shows a different classification performance considering different IDS datasets. The classification output of J48Consolidated for NSLKDD, ISCXIDS2012, and CICIDS2017 datasets has been outlined in Figures 12-14, respectively.  Figure 12 shows that the J48Consolidated classifier presents a reliable classification in the NSLKDD dataset. Nevertheless, J48Consolidated also produced false alarms for positive and negative instances. Therefore, the authors recommend incorporating filter components such as data standardization and effective feature selection while designing IDSs using J48Considated. A filter component not only smooths the underlying data, but will also improve classification performance.
On the one hand, for the ISCXIDS2012 dataset, J48Consolidated dramatically showed improvement in classification. The classifier showed few false alarms. On the other hand, J48Consolidated successfully detected almost all the instances of the ISCXIDS2012 binary dataset. Therefore, the classifier achieved the highest TOPSIS score of 0.964 ( Figure 8); thus, contributing to the highest average rank.    Figure 12 shows that the J48Consolidated classifier presents a reliable classification in the NSLKDD dataset. Nevertheless, J48Consolidated also produced false alarms for positive and negative instances. Therefore, the authors recommend incorporating filter components such as data standardization and effective feature selection while designing IDSs using J48Considated. A filter component not only smooths the underlying data, but will also improve classification performance.
On the one hand, for the ISCXIDS2012 dataset, J48Consolidated dramatically showed improvement in classification. The classifier showed few false alarms. On the other hand, J48Consolidated successfully detected almost all the instances of the ISCXIDS2012 binary dataset. Therefore, the classifier achieved the highest TOPSIS score of 0.964 ( Figure 8); thus, contributing to the highest average rank.  Finally, for the CICIDS2017 dataset, the J48Consolidated classifier presented a low number of false alarms. The six attack groups of the CICIDS2017 dataset presented a consistent classification with a detection accuracy of 99.868% (Table 17) and a low false positive of 0.000011. A reliable IDS benchmark dataset must fulfill 11 criteria [122], such as complete network configuration, attack diversity, overall traffic, thorough interaction, labeled dataset, full capture, existing protocols, heterogeneity, feature set, anonymity, and metadata. The CICIDS2017 [124] dataset fulfills these criteria. Furthermore, CICIDS2017 is recent and focuses on the latest attack scenarios. The J48Consolidated classifier presented the best results for the CICIDS2017 dataset with an accuracy of 99.868%. Consequently, the J48Consolidated classifier can be assumed as an effective IDS with the CICIDS2017 dataset. Nevertheless, the authors recommend the incorporation of feature selection procedures at the preprocessing stage to extract the most relevant features of the dataset and promote system performance.

Conclusions
This paper analyzed fifty-four widely used classifiers spanning six different groups. These classifiers were evaluated on the three most popular intrusion detection datasets, i.e., NSLKDD, ISCXIDS2012 and CICIDS2017. The authors have extracted a sufficient number of random samples from these datasets, which retained the same class imbalance  Figure 12 shows that the J48Consolidated classifier presents a reliable classification in the NSLKDD dataset. Nevertheless, J48Consolidated also produced false alarms for positive and negative instances. Therefore, the authors recommend incorporating filter components such as data standardization and effective feature selection while designing IDSs using J48Considated. A filter component not only smooths the underlying data, but will also improve classification performance.
On the one hand, for the ISCXIDS2012 dataset, J48Consolidated dramatically showed improvement in classification. The classifier showed few false alarms. On the other hand, J48Consolidated successfully detected almost all the instances of the ISCXIDS2012 binary dataset. Therefore, the classifier achieved the highest TOPSIS score of 0.964 ( Figure 8); thus, contributing to the highest average rank.
Finally, for the CICIDS2017 dataset, the J48Consolidated classifier presented a low number of false alarms. The six attack groups of the CICIDS2017 dataset presented a consistent classification with a detection accuracy of 99.868% (Table 17) and a low false positive of 0.000011.
A reliable IDS benchmark dataset must fulfill 11 criteria [122], such as complete network configuration, attack diversity, overall traffic, thorough interaction, labeled dataset, full capture, existing protocols, heterogeneity, feature set, anonymity, and metadata. The CI-CIDS2017 [123] dataset fulfills these criteria. Furthermore, CICIDS2017 is recent and focuses on the latest attack scenarios. The J48Consolidated classifier presented the best results for the CICIDS2017 dataset with an accuracy of 99.868%. Consequently, the J48Consolidated classifier can be assumed as an effective IDS with the CICIDS2017 dataset. Nevertheless, the authors recommend the incorporation of feature selection procedures at the preprocessing stage to extract the most relevant features of the dataset and promote system performance.

Conclusions
This paper analyzed fifty-four widely used classifiers spanning six different groups. These classifiers were evaluated on the three most popular intrusion detection datasets, i.e., NSLKDD, ISCXIDS2012 and CICIDS2017. The authors have extracted a sufficient number of random samples from these datasets, which retained the same class imbalance property of the original datasets. Consequently, multi-criteria decision-making has been used to allocate weight to these classifiers for different datasets. The rank of the classifiers was also finalized using those weights. First, an intragroup analysis has been conducted to find the best classifier group. Secondly, an intragroup analysis of the best classifier group has been made to find the best classifiers for the intrusion detection datasets. The authors analyzed thirteen performance metrics. Therefore, the best classifier has been selected impartially. On the one hand, the intergroup analysis presented the decision tree group of classifiers as the best classifier group, followed by the Rule-based classifiers, whereas the intragroup study identified J48Consolidated as the best classifier for high-class imbalance considering NSLKDD, CICIDS2017 and ISCXIDS2012 datasets. The J48Consolidated classifier provided the highest accuracy of 99.868%, a misclassification rate of 0.1319%, and a Kappa value of 0.998.
This study presented an in-depth analysis that provides numerous outcomes for IDS designers. Comparing fifty-four classifiers on intrusion detection datasets through thirteen performance matrices and ranking them is the main contributory work of this article. Nevertheless, the present study has limitations. Further investigation is required considering other datasets and other specific application domains. Moreover, the number of classes, class-wise performance observation, and classifiers' performance based on varying sample sizes should be carried out to understand the detailed aspects of the classifiers. The scalability and robustness of the classifiers were not tested. As a future work, many other IDS datasets can be used for ascertaining performance of the classifiers. Many recent ranking algorithms can be used as voting principle to obtain exact ranks of classifiers. Many other recent rule-based, decision forest classifiers were covered in this article; those classifiers can be analyzed to understand the real performance of the classifiers and classifier groups. Finally, J48Consolidated, which evolved as an ideal classifier out of this analysis, can be used along with a suitable feature selection technique to design robust intrusion detection systems.  Data Availability Statement: Publicly available datasets were analyzed in this study. These data can be found here: NSL-KDD-https://www.unb.ca/cic/datasets/nsl.html (accessed on 1 February 2021), ISCXIDS2012-https://www.unb.ca/cic/datasets/ids.html (accessed on 1 February 2021), CICIDS2017-https://www.unb.ca/cic/datasets/ids-2017.html (accessed on 1 February 2021).

Conflicts of Interest:
The authors declare no conflict of interest.