Ensemble Classiﬁers for Network Intrusion Detection Using a Novel Network Attack Dataset †

: Due to the extensive use of computer networks, new risks have arisen, and improving the speed and accuracy of security mechanisms has become a critical need. Although new security tools have been developed, the fast growth of malicious activities continues to be a pressing issue that creates severe threats to network security. Classical security tools such as ﬁrewalls are used as a ﬁrst-line defense against security problems. However, ﬁrewalls do not entirely or perfectly eliminate intrusions. Thus, network administrators rely heavily on intrusion detection systems (IDSs) to detect such network intrusion activities. Machine learning (ML) is a practical approach to intrusion detection that, based on data, learns how to di ﬀ erentiate between abnormal and regular tra ﬃ c. This paper provides a comprehensive analysis of some existing ML classiﬁers for identifying intrusions in network tra ﬃ c. It also produces a new reliable dataset called GTCS (Game Theory and Cyber Security) that matches real-world criteria and can be used to assess the performance of the ML classiﬁers in a detailed experimental evaluation. Finally, the paper proposes an ensemble and adaptive classiﬁer model composed of multiple classiﬁers with di ﬀ erent learning paradigms to address the issue of the accuracy and false alarm rate in IDSs. Our classiﬁers show high precision and recall rates and use a comprehensive set of features compared to previous work.


Introduction
Cyberattacks have become more widespread as intruders take advantage of system vulnerabilities for intellectual property theft, financial gain, or even the destruction of entire network infrastructures [1]. In the past few months, the Federal Bureau of Investigation released a high-impact cybersecurity warning in response to the increasing number of attacks on government targets. Government officials have warned major cities that such hacks are a disturbing trend that is likely to continue. The window for detecting a security breach can be measured in days, as attackers are aware of existing security controls and are continually improving their attacks. In many cases, a security breach is inevitable, which makes early detection and mitigation the best plan for surviving an attack.
Security professionals use different prevention and detection techniques to reduce the risk of security breaches. Prevention techniques such as applying complex configurations and establishing f Abdullah Abuhusseina strong security policy aim to make it more difficult to carry out such attacks. All security policies should maintain the three principles of the Central Intelligence Agency triad: confidentiality, integrity, and availability.

•
It presents a newly generated IDS dataset called GTCS (Game Theory and Cyber Security) that overcomes most shortcomings of existing datasets and covers most of the necessary criteria for common updated attacks, such as botnet, brute force, distributed denial of service (DDoS), and infiltration attacks. The generated dataset is completely labeled, and about 84 network traffic features have been extracted and calculated for all benign and intrusive flows. • It analyzes the GTCS dataset using Weka-an open source software which provides tools for data preprocessing and the implementation of several ML algorithms-to select the best feature sets to detect different attacks and provides a comprehensive analysis of six of the most well-known ML classifiers to identify intrusions in network traffic. Specifically, it analyzes the classifiers in terms of accuracy, true positive rates, and false positive rates.

•
It proposes an adaptive ensemble learning model that integrates the advantages of different ML classifiers for different types of attacks to achieve optimal results through ensemble learning. The advantage of ensemble learning is its ability to combine the predictions of several base estimators to improve generalizability and robustness over a single estimator.

Background
In this section, we introduce the concepts and technologies on which our problem and methodology are based to illuminate the motivation for this work. This includes: (1) a brief overview of IDSs and their limitations; (2) ML, hyperparameter optimization, and ensemble learning; and (3) datasets, problems related to evaluative datasets, and data processing considerations.

Intrusion Detection Systems
An intrusion is a malicious activity that aims to compromise the confidentiality, integrity, and availability of network components in an attempt to disrupt the security policy of a network [11]. The National Institute of Standards and Technology (NIST) defines the intrusion detection process as "the process of monitoring the events occurring in a computer system or network and analyzing them for signs of intrusions, defined as attempts to compromise the confidentiality, integrity, availability, or to bypass the security mechanisms of a computer or network" [12]. An IDS is a tool that scans network traffic for any harmful activity or policy breaches. It is a system for monitoring network traffic for malicious activities and alerting network administrators to such abnormal activities [13]. IDSs achieve this by gathering data from several systems and network sources and analyzing those data for possible threats.
Unlike firewalls, which are used at the perimeter of the network and play the role of gatekeeper by monitoring incoming network traffic and determining whether it can be allowed into the network or endpoint at all, IDSs monitor internal network traffic and mark suspicious and malicious activities. Consequently, an IDS can identify not only attacks that pass the firewall but also attacks that originate from within the network.

Limitations of Intrusion Detection Systems
Although IDSs are considered a key component of computer network security, they have some limitations that should be noted before deploying intrusion detection products [14]. Some of these limitations include: • Most IDSs generate a high false positive rate, which wastes the time of network administrators and in some cases causes damaging automated responses.

•
Although most IDSs are marketed as real-time systems, it may in fact take them some time to automatically report an attack.

•
IDSs' automated responses are sometimes inefficient against advanced attacks.

•
Many IDSs lack user-friendly interfaces that allow users to operate them.

•
To obtain the maximum benefits from the deployed IDS, a skilled IT security staff should exist to monitor IDS operations and respond as needed.

•
Numerous IDSs are not failsafe, as they may not be well protected from attacks or destruction.

Why Do We Need Machine Learning?
There are a number of reasons for researchers to consider the use of ML in network intrusion detection. One such reason is the ability of ML to find similarities within a large amount of data. The main assumption of ML-based approaches is that an intrusion creates distinguishable patterns within the network traffic and that these patterns can be efficiently detected using ML approaches [15]. These approaches promise automated data-driven detection that infers information about malicious network traffic from the vast amount of available traffic traces. Furthermore, ML can be used to discover anomalies in data without the need for prior knowledge about those data. By combining the characteristics of ML techniques and capable computing units, we can create a powerful weapon with which to respond to network intrusion threats. Moreover, we have a great deal of data nowadays, but human expertise is limited and expensive. Therefore, using ML, we can automatically discover patterns which humans may not be able to recognize due to the scale of the data. In the absence of ML, human experts would need to define manually crafted rules, which would not be scalable.

Machine Learning
ML is the field of computer science that studies algorithms that automatically improve during training and experience, enabling a computer to make accurate predictions when fed data without being explicitly programmed [16]. ML algorithms use a subset of a larger dataset, known as training data or sample data, to build a mathematical model to make predictions or decisions based on a given problem. ML is applied in a wide variety of domains-including spam filtering, Internet search engines, recommendation systems, voice recognition, and computer vision-where conventional algorithms cannot accomplish the required tasks. Similarly, in the field of cybersecurity, several ML methods have been proposed to monitor and analyze network traffic to recognize different anomalies. Most of these methods identify anomalies by looking for deviations from a basic regular traffic model. Usually, these models are trained on a set of attack-free traffic data that are collected over a long time period.

Hyperparameter Optimization
Model parameters are the set of configuration variables that are internal to a model and can be learned from the historical training data. The value of these parameters is estimated from the input data. Model hyperparameters are the set of configuration variables external to the model. They are the properties that govern the entire training process and cannot be directly learned from the input data. A model's parameters specify how input data are transformed into the desired output, while the hyperparameters define the structure of the model.
Hyperparameter optimization (also known as hyperparameter tuning) is the process of finding the optimal hyperparameters for a learning algorithm in ML. A set of different measures for a single ML model can be used to generalize different data patterns. This set is known as the hyperparameters set, which should be highly optimized so that the ML model can solve the assigned problem as optimally as possible. The optimization process locates the hyperparameters tuple and produces a model that minimizes the predefined loss function for the given data. The objective function takes the hyperparameters tuple and returns the associated loss [17]. The generalization performance is often estimated using cross-validation [18]. Hyperparameter optimization techniques typically use one of several optimization algorithms: grid search, random search, or Bayesian optimization.

Ensemble Learning
Ensemble approaches utilize collections of ML algorithms to produce higher predictive performance than could be obtained from a single ML classifier [19]. The main idea of an ensemble approach is to combine several ML algorithms to exploit the strengths of each employed algorithm to obtain a more powerful classifier. Ensemble approaches are particularly helpful if the problem can be split into subproblems such that each subproblem can be assigned to one module of the ensemble. Depending on the structure of the ensemble approach, each module can include one or more ML algorithms. In the world of network attacks, since the signatures of different attacks are quite distinct from one another, it is normal to have different sets of features as well as different ML algorithms to detect different types of attacks. It is thus obvious that a single IDS cannot cover all types of input data or identify different types of attacks [20,21]. Many researchers have shown that classification problems can be solved with high accuracy when using ensemble models instead of single classifiers [22].
Ensemble models can be thought of as using various approaches to solve a specific problem. This resembles the process according to which patients diagnosed with a dangerous medical condition, such as a tumor, usually see more than one doctor to solicit different opinions on their case. This is a kind of cross-validation that increases the probability of receiving an accurate diagnosis. Likewise, the outputs of various ML classifiers in an intrusion detection problem can be combined to enhance the accuracy of the overall IDS. The main challenge in using ensemble approaches is how to choose the best set of classifiers to constitute the ensemble model and which decision function to use to combine the results of those algorithms [23]. Bagging, boosting, and stacking are the main three methods used to combine ML algorithms in ensemble models [24].

Datasets
One of the main challenges that researchers face in the area of intrusion detection is finding a suitable dataset that they can use to train and evaluate their proposed IDSs [25]. Although there exists a group of datasets used by researchers to train, test, and evaluate their IDS approaches, as mentioned in Section 1, most of those datasets are outdated, suffer from a lack of attack diversity, and do not reflect current trends and traffic variety. In this section, we first discuss general security issues related to finding a suitable dataset for evaluating IDSs. Next, we analyze and evaluate some of the most widely used datasets and identify deficiencies that indicate the need for a comprehensive and reliable dataset.

Problems with Evaluative Datasets
Researchers have reported many security issues regarding the datasets used to evaluate IDSs [26,27]. Some of these issues are as follows: • Data privacy issues and security policies may prevent corporate entities from sharing realistic data with users and the research community.

•
Getting permission from a dataset's owner is frequently delayed. Moreover, it usually requires the researcher to agree to an acceptable use policy (AUP) that includes limitations on the time of usage and the data that can be published about the dataset.

•
The limited scope of most datasets does not fit various network intrusion detection researchers' aims and objectives.

•
Most of the available datasets in the IDS field suffer from a lack of proper documentation describing the network environment, simulated attacks, and dataset limitations.

•
Many of the accessible datasets were labeled manually.
To overcome these problems, we present a new dataset that is automatically and completely labeled. The dataset and its proper documentation are publicly available to researchers and do not require any AUP to use.

Data Preprocessing
Data preprocessing is the first step of the process of collecting data before they are used by any ML model. It is generally used to transform the raw data into a structure that the ML model can handle and also helps improve the quality of the model.

Normalization
Normalization is a preprocessing method that is usually applied as part of data preparation for ML models. The goal of normalization is to adjust the values of numeric data in a dataset to a range (usually 0-1) using a common scale without distorting differences in the ranges of actual values or losing information. If a dataset contains a column with values ranging from 0 to 1 and a second column with values ranging from 100,000 to 1,000,000, the enormous variance in scale among the numbers in the two columns could cause problems when attempting to combine the values as features during modeling. Normalization can help bypass these problems by creating new values that maintain the general distribution and ratios of the source data while keeping values within a scale applied across all numeric columns used in the model.

One-Hot Encoding
Most ML models require all input and output data to be in the form of numeric values. This means that if the dataset includes categorical data, it must be encoded as numbers before it can be used by ML models. One-hot encoding is the process of converting categorical data into a form that ML algorithms can use to improve prediction [28,29].

Feature Selection
Feature selection (also called attribute selection) is the process of selecting attributes of the dataset that are most relevant to the predictive modeling problem on which one is working [30]. The definition of relevance varies from method to method. Based on its understanding of significance, a feature selection technique mathematically formulates a criterion to evaluate a set of features generated by a scheme that searches over the feature space. [31] defined two degrees of relevance: strong and weak. A feature s is strongly relevant if its removal deteriorates the performance of a classifier. A feature s is weakly relevant if it is not strongly relevant and the removal of a subset of features containing s deteriorates the performance of the classifier. A feature is irrelevant if it is neither strongly nor weakly relevant. Feature selection techniques assist in creating an accurate predictive model by choosing features that will provide better accuracy and less complexity while requiring fewer data.
Feature selection techniques are generally divided into three categories: filter, wrapper, and embedded [32]. The filter method operates without engaging any information about the induction algorithm. Using some prior knowledge-for example, that a feature should be strongly correlated with the target class or that features should be uncorrelated with one another-the filter method selects the best subset of features by measuring the statistical properties of the subset to be evaluated. Alternatively, the wrapper method employs a predetermined induction algorithm to find a subset of features with the highest evaluation by searching through the space of feature subsets and evaluating the quality of selected features. The process of feature selection "wraps around" an induction algorithm. Since the wrapper approach includes a specific induction algorithm to optimize feature selection, it often provides a better classification accuracy result than the filter approach. However, the wrapper method is more time-consuming than the filter method because it is strongly coupled with an induction algorithm and repeatedly calls the algorithm to evaluate the performance of each subset of features. It thus becomes impractical to apply a wrapper method to select features from a large dataset that contains numerous features and instances [33]. Furthermore, the wrapper approach is required to re-execute its induction algorithm to select features from a dataset while the algorithm is replaced with a dissimilar one. Some researchers [34][35][36] have used a hybrid feature selection method. In supervised ML, an induction algorithm is typically presented with a set of training instances wherein each instance is described by a vector of feature (or attribute) values and a class label. For example, in medical diagnosis problems, features might include a patient's age, weight, and blood pressure, and the class label might indicate whether a physician had determined that the patient was suffering from heart disease. The task of the induction algorithm is to induce a classifier that will be useful in classifying future cases. The classifier is a mapping from the space of feature values to the set of class values. More information about the wrapper methods and inductive algorithm can be found in [37]. Finally, the embedded approaches include feature selection in the training process, thus reducing computational costs due to the classification process needed for each subset [38].

Dataset Imbalance
Imbalanced datasets are a common problem in ML classification, where there exists a disproportionate ratio of examples in each class. This problem is especially relevant in the context of supervised ML, which involves two or more classes. The imbalance means that the distribution of the dataset examples across the recognized classes is biased or skewed. This distribution may vary from a trivial bias to a significant imbalance where there are only a few examples in the minority class and many examples in the majority class or classes.
An imbalanced dataset poses a challenge for predictive modeling, since the majority of ML classification algorithms are designed based on the hypothesis that there are almost an equal number of examples for each class. This results in an inaccurate prediction model with poor performance, particularly for the minority class. Therefore, a balanced dataset is essential for creating a good prediction model [39].
Real-world data are usually imbalanced, which may be one of the main causes of the decrease in ML algorithms' generalization. If the imbalance is heavy, it will be difficult to develop efficient classifiers with conventional learning algorithms. In many domains, the cost of misclassifying minority classes is greater than for the majority class for many class-imbalanced datasets. This is particularly true in the IDS domain, where malicious traffic tends to be the minority class. Consequently, there is a need for sampling methods that can handle imbalanced datasets.
Sampling techniques can be used to overcome the dataset imbalance dilemma by either excluding some data from the majority class (known as under-sampling) or by adding artificially generated data to the minority class (known as oversampling) [40]. Oversampling methods boost the number of samples in the minority class in the training dataset. The main advantage of such methods is that there will be no loss in the data from the primary training dataset, as all samples from the minority and majority classes are kept. However, there is also the drawback that the scope of the training dataset is significantly increased. Arbitrary oversampling is the simplest oversampling method, in which randomly chosen samples from the minority class are duplicated and combined with the new dataset [41]. Synthetic minority oversampling technique (SMOTE) is another oversampling method proposed by Chawla [42] wherein synthetic data are created and added to the minority class rather than simply duplicating the examples. SMOTE blindly generates synthetic data without studying the majority class data, which may lead to an overgeneralization problem [43]. Under-sampling methods, on the other hand, are used to decrease the number of examples in the majority class. They reduce the size of the majority class data to balance the class distribution in the dataset. Random under-sampling is an example of an under-sampling method that randomly selects a subset of majority class samples and merges them with a minority class sample, generating a new balanced dataset [44]. However, under-sampling a dataset by reducing the majority class results in a loss of data and overly general rules.

Literature Survey and Related Work
Anomaly-based IDS was first introduced by Anderson in 1980 [45]. Since then, the topic of anomaly detection has been the subject of many surveys and review articles. Various researchers have applied ML algorithms and used different publicly available datasets for their research in order to achieve better detection results [46]. Hodo et al. [47] reviewed ML algorithms and their performance in terms of anomaly detection and discussed and explained the role of feature selection in ML-based IDSs. Chandola et al. [48] presented a structured review of the research on anomaly detection in different research areas and application fields, including network IDSs. G. Meera Gandhi [49] used the DARPA-Lincoln dataset to evaluate and compare the performance of four supervised ML classifiers in detecting four categories of attacks: denial of service (DoS), remote-to-local, probe, and user-to-root. Their results showed that the J48 classifier outperformed the other three classifiers (IBK, MLP, and naïve Bayes [NB]) in prediction accuracy. In [50], Nguyen et al. conducted an empirical study to evaluate a comprehensive set of ML classifiers on the KDD99 dataset to detect attacks from the four attack classes. Abdeljalil et al. [51] tested the performance of three ML classifiers-J48, NN, and support vector machine (SVM)-using the KDD99 dataset and found that the J48 algorithm outperformed the other two algorithms. L. Dhanabal et al. [46] analyzed the NSL-KDD dataset and and used it to measure the effectiveness of ML classifiers in detecting anomalies in network traffic patterns. In their experiment, 20% of the NSL-KDD dataset was used to compare the accuracy of three classifiers. Their results showed that when correlation-based feature selection was used for dimensionality reduction, J48 outperformed SVM and NB in terms of accuracy. In [52], Belavagi et al. checked the performance of four supervised ML classifiers-SVM, RF (random forest), LR (linear regression), and NB-on intrusion detection over the NSL-KDD dataset. The results showed that the RF classifier had 99% accuracy.
In terms of feature selection, Hota et al. [53] utilized different feature selection techniques to remove irrelevant features in a proposed IDS model. Their experiment indicated that the highest accuracy result could be achieved with only 17 features from the NSL-KDD dataset by using the C4.5 algorithm along with information gain. In [54], Khammassi et al. applied a wrapper approach based on a genetic algorithm as a search strategy and logistic regression as a learning algorithm to select the best subset of features of the KDD99 and UNSW-NB15 datasets. They used three different DT classifiers to measure and compare the performance of the selected subsets of features. Their results showed that they could achieve a high detection rate with only 18 features for KDD CUP 99 and 20 features for UNSW-NB15. Abdullah et al. [55] also proposed an IDS framework with selection of features within the NSL-KDD dataset that were based on dividing the input dataset into different subsets and combining them using the information gain filter. The optimal set of features was then generated by adding the list of features obtained for each attack. Their experimental results showed that, with fewer features, the researchers could improve system accuracy while decreasing complexity.
Besides selecting the relevant features that can represent intrusion patterns, the choice of ML classifier can also lead to better accuracy. Moreover, the literature suggests that assembling multiple classifiers can reduce false positives and produce more accurate classification results than single classifiers [56]. Gaikwad et al. [57] used REPTree as a base classifier and proposed a bagging ensemble method that provides higher classification accuracy and lowers false positives for the NSL-KDD dataset. Jabbar et al. [58] suggested a cluster-based ensemble IDS model based on the ADTree and KNN algorithms. The experimental results showed that their model outperformed most existing classifiers in accuracy and detection rates. Similarly, Paulauskas et al. [59] used an ensemble model of four different base classifiers to build a stronger learner and showed that the ensemble model produced more accurate results for an IDS.
Generally speaking, previous studies have mainly focused on comparing different ML algorithms and selecting those with the best accuracy results to improve the overall detection effect. The main optimization methods are feature selection and ensemble learning. However, there is still room to improve the results of these studies. Unlike the above studies, this paper concentrates on evaluating and comparing the performance of a group of well-known supervised ML classifiers over the full NSL-KDD dataset for intrusion detection along the following dimensions: feature selection, sensitivity to hyperparameter tuning, and class imbalance. Moreover, most of the datasets used by researchers to evaluate the performance of their proposed intrusion detection approaches are out-of-date and unreliable. Some of these datasets suffer from a lack of traffic diversity and volume or do not cover a variety of attacks, while others anonymize packet information and payload-which cannot reflect current trends-or lack feature sets and metadata. This paper produces a reliable dataset that contains benign and four common attack network flows, which meets real-world criteria. This study evaluates the performance of a comprehensive set of network traffic features and ML algorithms to indicate the best set of features for detecting certain attack categories.

GTCS Dataset Collection
In this section, we describe: (1) how our novel dataset was collected and stored, (2) the dataset collection testbed implementation and key design decisions, and (3) dataset features selection and extraction. Finally, we provide a dataset statistical summary to show the quality of the data collected.

Lab Setup
To generate the new dataset, benign network traffic along with malicious traffic were extracted, labeled, and stored. To mimic typical network traffic flow, we designed a complete testbed (Figure 1) composed of several normal and attacking virtual machines (VMs) that were distributed between two separate networks. The victim network consisted of a set of VMs running different versions of the most common operating systems, namely Windows Server and/or PC, Linux, and Android. The attack network was a completely separate infrastructure containing Kali 1.1 and Kali 2.0 VMs.
To generate a large amount of realistic benign traffic, we used Ostinato [60], a flexible packet generator tool that generates normal traffic with given IPs and ports. Malicious traffic was generated using Kali Linux [61], an enterprise-ready security-auditing Linux distribution based on Debian GNU/Linux [62]. The process of generating both benign and malicious traffic took eight days. All of the attacks considered for the experiments in this paper were new. In our attack scenarios, four of the most common and up-to-date attack families were considered and are briefly described below.

•
Botnet attacks [63]: This attack type can be defined as a group of compromised network systems and devices that execute different harmful network attacks, such as sending spam, granting backdoor access to compromised systems, stealing information via keyloggers, performing phishing attacks, and so on.

•
Brute force attacks [64]: This refers to a well-known network attack family in which intruders try every key combination in an attempt to guess passwords or use fuzzing methods to obtain unauthorized access to certain hidden webpages (e.g., an admin login page).

•
DDoS attacks [65]: DoS attacks are a very popular type of network attack in which an attacker sends an overwhelming number of false requests to a target service or network in order to prevent legitimate users from accessing that service. DDoS attacks are a modern form of DoS attack wherein attackers use thousands of compromised systems to flood the bandwidth or resources of the target system. • Infiltration attacks [66]: These attacks are usually executed from inside the compromised system by exploiting vulnerabilities in software applications such as Internet browsers, Adobe Acrobat Reader, and the like. • Brute force attacks [64]: This refers to a well-known network attack family in which intruders try every key combination in an attempt to guess passwords or use fuzzing methods to obtain unauthorized access to certain hidden webpages (e.g., an admin login page).

•
DDoS attacks [65]: DoS attacks are a very popular type of network attack in which an attacker sends an overwhelming number of false requests to a target service or network in order to prevent legitimate users from accessing that service. DDoS attacks are a modern form of DoS attack wherein attackers use thousands of compromised systems to flood the bandwidth or resources of the target system. • Infiltration attacks [66]: These attacks are usually executed from inside the compromised system by exploiting vulnerabilities in software applications such as Internet browsers, Adobe Acrobat Reader, and the like.

Data Collection and Feature Extraction
We implemented four attack scenarios: botnet, brute force, DDoS, and infiltration. For each attack, we defined a scenario based on the implemented network topology and executed the attacks using the Kali Linux machines that were located in a separate network from the target machines. The attacking machines were Kali 1.1 and Kali 2.0.
To perform the botnet attack, we used Zbot [67], a trojan horse malware package that can be run on Microsoft Windows operating systems to execute many malicious tasks. Zbot can mainly be extended through drive-by downloads and different phishing schemes. Because it uses stealth techniques to hide from different security tools, Zbot has become the largest botnet on the Internet [68]. The victim machines in our botnet attack scenario were running Windows 7 and Windows 8. It is important to note that firewalls, Windows Defender, and automatic updating were all disabled on all Windows victim machines to enable a wide spectrum of interesting cases to be captured. The Graphical Network Simulator-3 (GNS3) and Emulated Virtual Experience (EVE) cloud components are network software emulators that allow a combination of virtual and real devices to simulate complex networks. In addition, password complexity checks were not active, and all passwords were set to a minimum of three characters. For the brute force attack, we used the FTP module on the Kali

Data Collection and Feature Extraction
We implemented four attack scenarios: botnet, brute force, DDoS, and infiltration. For each attack, we defined a scenario based on the implemented network topology and executed the attacks using the Kali Linux machines that were located in a separate network from the target machines. The attacking machines were Kali 1.1 and Kali 2.0.
To perform the botnet attack, we used Zbot [67], a trojan horse malware package that can be run on Microsoft Windows operating systems to execute many malicious tasks. Zbot can mainly be extended through drive-by downloads and different phishing schemes. Because it uses stealth techniques to hide from different security tools, Zbot has become the largest botnet on the Internet [68]. The victim machines in our botnet attack scenario were running Windows 7 and Windows 8. It is important to note that firewalls, Windows Defender, and automatic updating were all disabled on all Windows victim machines to enable a wide spectrum of interesting cases to be captured. The Graphical Network Simulator-3 (GNS3) and Emulated Virtual Experience (EVE) cloud components are network software emulators that allow a combination of virtual and real devices to simulate complex networks. In addition, password complexity checks were not active, and all passwords were set to a minimum of three characters. For the brute force attack, we used the FTP module on the Kali 2.0 Linux machine to attack the machine in the victim network running Ubuntu 16.4. To carry out the DDoS attack, we used the High Orbit Ion Cannon (HOIC) tool [69], a popular free tool for performing DDoS attacks. HOIC works by flooding a target web server with junk HTTP, GET, and POST requests and can open up to 256 simultaneous attack sessions at once. For the infiltration attack, we sent a vulnerable application to the Metasploitable 2 Linux machine on the victim network. We then used the vulnerable application to open a backdoor and perform our infiltration attack.
To capture the raw network traffic data (in pcap format), we used Wireshark and tcpdump [70,71]. Wireshark is a network packet analyzer that can capture network traffic data in as much detail as possible, while tcpdump is a command line utility that helps capture and analyze network traffic. After collecting the raw network packets (i.e., pcap files), we used CICFlowMeter [72,73] to process those files and extract the features of the network flow packets. CICFlowMeter is an open source tool that generates Bi-flows from pcap files and extracts features from these flows. The extracted features are shown in Table 1. The full dataset is available on ResearchGate for researchers and practitioners [74].

Statistical Summary of GTCS Dataset
Each record in the GTCS dataset reveals different features of the traffic with 83 attributes plus an assigned label classifying each record as either normal or an attack. The attack types in the dataset can be grouped into four main classes (botnet, brute force, DDoS, and infiltration). The number of records associated with each class is shown in Table 2.

Experiments
In this section, we present the experimental setup and results of comparing six ML classifiers from various classifier families in terms of classification accuracy, true positive rate (TPR), false positive rate (FPR), precision, recall, F-measure, and receiver operating characteristic (ROC) area. The selected classifiers included NB, logistic, multilayer perceptron (neural network), SMO (SVM), IBK (k-nearest neighbor), and J48 (decision tree). We compare the performance of the classifiers in identifying intrusions in network traffic using GTCS, the newly generated dataset. Next, we present a holistic approach for detecting network intrusions using an ensemble of the best-performing ML algorithms.
The experiment in this section was carried out using Weka on a PC with an Intel ® CORE™ i5-8265U x64-based CPU running at 3.50 GHz with 16.0 GB RAM installed and a 64-bit Windows 10 OS.

Machine Learning Algorithm Performance Comparison
The experiment in this section had two phases. In the first phase, we compared the performance of the classifiers using the GTCS dataset with the full extracted. To evaluate the performance of the ML classifiers, we used precision, recall, and F1 score, which are the most common measures for evaluating the performance of anomaly detection models. Precision refers to the portion of relevant instances among the retrieved instances. Recall refers to the portion of relevant retrieved instances in the total number of relevant instances. F1 score is the harmonic mean of precision and recall. These three measures depend on the confusion matrix, where four possible situations can be defined, as shown in Table 3. The results of this phase are summarized in Table 4.
In the second phase, we applied the InfoGainAttributeEval algorithm with Ranker [75] to reduce the dimensions of the dataset. InfoGainAttributeEval evaluates the worth of an attribute by measuring the information gain with respect to the class.
InfoGain(Class,Attribute) = H(Class) − H(Class|Attribute) Tables 4 and 5 present a comprehensive comparison of the classifiers in terms of classification accuracy, precision, recall, TPR, FPR, F-measure, and ROC area. Table 6 presents the accuracy of each classifier in classifying different classes in the GTCS dataset in the two phases.  The ranker search method ranks the attributes according to their individual evaluations, after which it is possible to specify the number of attributes to retain. Using the InfoGainAttributeEval algorithm, we ranked the attributes of the GTCS dataset by their evaluations, which resulted in selecting 44 out of 84 features of the GTCS dataset. The final selected features are summarized and ranked in Figure 2. Next, we replicated the phase 1 experiment using the reduced and normalized GTCS dataset with those 44 features. The results of this phase are summarized in Table 3. Figure 3 compares the classification accuracy of each classifier in the two phases. The ranker search method ranks the attributes according to their individual evaluations, after which it is possible to specify the number of attributes to retain. Using the InfoGainAttributeEval algorithm, we ranked the attributes of the GTCS dataset by their evaluations, which resulted in selecting 44 out of 84 features of the GTCS dataset. The final selected features are summarized and ranked in Figure 2. Next, we replicated the phase 1 experiment using the reduced and normalized GTCS dataset with those 44 features. The results of this phase are summarized in Table 3. Figure 3 compares the classification accuracy of each classifier in the two phases.     The experimental results show that IBK outperformed other classifiers and had the best accuracy in both phases. Moreover, the results shown in Table 6 indicate that IBK outperformed other classifiers in classifying normal, botnet, and brute force classes, while MLP and J48 performed best at classifying DDoS and infiltration classes, respectively. These results are illustrated in Figure 4.

A Holistic Approach for IDSs Using Ensemble ML Classifiers
In this section, we propose an ensemble classifier model (shown in Figure 5   The experimental results show that IBK outperformed other classifiers and had the best accuracy in both phases. Moreover, the results shown in Table 6 indicate that IBK outperformed other classifiers in classifying normal, botnet, and brute force classes, while MLP and J48 performed best at classifying DDoS and infiltration classes, respectively. These results are illustrated in Figure 4.

A Holistic Approach for IDSs Using Ensemble ML Classifiers
In this section, we propose an ensemble classifier model (shown in Figure 5) composed of multiple classifiers with different learning paradigms to address the issue of the accuracy and false alarm rates in IDSs. The proposed model is composed of three ML classifiers from various classifier families. The selection of these classifiers is based on the results from the previous section. The selected classifiers are J48 (DT-C4.5), IBK (KNN), and MLP (NN). In the proposed model, the three classifiers work in parallel, and each classifier builds a different model of the data. The outputs of the three classifiers are combined using the majority voting method to obtain the final output of the ensemble model. The selection of the three classifiers is based on the results shown in Table 6.  Table 6.      In the ensemble system, each classifier builds a different model of the data based on the preprocessed dataset. To build the models, each classifier was tested using the 10-fold cross-validation technique within the dataset, wherein the dataset is divided into 10 folds or subsets. Any nine subsets were used as training sets, and the remaining subset was used as the test set. More specifically, each fold was analyzed, and the total score results determined the average performance out of the 10 folds. Majority voting is a traditional and common way to combine classifiers.

Experimental Results
The experimental results show that the proposed ensemble IDS model was able to outperform all single classifiers in terms of classification accuracy, as shown in Figure 6 and Table 7. In the ensemble system, each classifier builds a different model of the data based on the preprocessed dataset. To build the models, each classifier was tested using the 10-fold crossvalidation technique within the dataset, wherein the dataset is divided into 10 folds or subsets. Any nine subsets were used as training sets, and the remaining subset was used as the test set. More specifically, each fold was analyzed, and the total score results determined the average performance out of the 10 folds. Majority voting is a traditional and common way to combine classifiers.

Experimental Results
The experimental results show that the proposed ensemble IDS model was able to outperform all single classifiers in terms of classification accuracy, as shown in Figure 6 and Table 7.

Conclusions and Future Work
In this paper, we presented the GTCS dataset to overcome the shortcomings of most of the existing available datasets and covered most of the necessary criteria for common updated attacks such as botnet, brute force, DDoS, and infiltration attacks. The generated dataset is completely labeled, and about 84 network traffic features have been extracted and calculated for all benign and intrusive flows. We also compared the performance of six of the most well-known ML classifiers over the new dataset and demonstrated that the ensemble of different learning paradigms can improve detection accuracy, improve TPR, and decrease FPR.