A Novel Framework for Generating Personalized Network Datasets for NIDS Based on Traffic Aggregation

In this paper, we addressed the problem of dataset scarcity for the task of network intrusion detection. Our main contribution was to develop a framework that provides a complete process for generating network traffic datasets based on the aggregation of real network traces. In addition, we proposed a set of tools for attribute extraction and labeling of traffic sessions. A new dataset with botnet network traffic was generated by the framework to assess our proposed method with machine learning algorithms suitable for unbalanced data. The performance of the classifiers was evaluated in terms of macro-averages of F1-score (0.97) and the Matthews Correlation Coefficient (0.94), showing a good overall performance average.


Introduction and Motivation
Nowadays, cybersecurity plays a fundamental role to ensure the usability and integrity of the information technology and telecommunication infrastructure. These technologies are fundamental for the common activities of organizations, enterprises, and individuals. As an example, using perimeter security it is possible to implement a layered approach protection, with the objective to identify and stop cyber attacks or network anomalies in the inbound and outbound network traffic using network monitor techniques over a network segment. In recent years, Network Intrusion Detection Systems (NIDS), based on anomalies (Anomay-based NIDS), had been incorporated machine learning (ML) and deep learning models to detect malicious network traffic patterns with excellent results [1,2].
NIDS and Intrusion Prevention Systems (IPS) are part of the defense strategies from cybercriminal tactics and attacks. However, this task to be efficient requires some desirable features in the NIDS, as such as: (1) fault tolerant; (2) minimum of human intervention on the administration of the devices; (3) avoid excessive work over system resources; (4) detect significant deviations from acceptable behavior; (5) high precision to minimize false positives and false negatives; (6) detect all kinds of patterns and sophisticated attacks; and (7) quick to detect intrusions and response effective to reduce possible damages [3,4].

1.
The time between when a dataset is created and updated is quite long, which is a critical situation given the speed with which evolves the complexity and the number of attacks; 2.
The produced datasets are not accurate in terms of real environments when these are created using simulations, since these present the behavior models related to users and attacks that were defined by the dataset designer; 3.
The datasets generated from a production network usually does not include full network dumps due to privacy issues and sensible content. In practices, from these datasets only files with features extracted by diverse tools, such as Argus [7], are shared in the form of CSV files. Without the full network dump is not possible to define other features at packet or network flow level.

4.
There is a lack of tools on the research community for off-line labeling of the network dumps. The available tools only extract network traffic features without offering a classification label. Thus, there is still a strong dependency on the current public datasets. 5.
The available tools for feature extraction on network traffic have some efficiency problems on files over 100 MB of size [8]. The feature extraction occurs at session level, thus, if one network session is not complete with all the packets that are part of it, the statistics are imprecise. We could verify that some tools do not check for the full network session.
The main motivation of this work is to solve the actual limitations of public datasets, which started to be used by the research community more than twenty years ago. Those datasets are considered as a static alternative, because they become obsolete as they do not reflect current threats. Having only this alternative limit the possibility to study other specific attack scenarios on production networks. On the other hand, personalized and dynamic datasets generated from real world network traffic and actual malicious traffic help to improve the state of the art and allow to use ML techniques for developing better NIDSs that fight the increasing cybercrime threats. Therefore, in this paper, we propose a novel framework to generate personalized network datasets to help address the limitations presented before. In this sense, the proposed framework can be used as a suitable option to generate datasets that include specific attack scenarios over a production network, without the attack really happen. Thus, our proposal is three-fold: • For the obsolescence and updates on network traffic datasets we propose the use of public repositories with real network malicious traffic, such as malware-trafficanalysis.net, which is updated daily with network dumps and malware samples; • For the lack of real environment scenarios, we propose a dynamic and extensible framework to generate network traffic datasets based in the off-line aggregation of malicious network traffic to the traffic for a production network; • To process the network traffic dumps generated by our framework, we propose novel and efficient tools for the feature extraction and label assignment for the network traffic sessions.
In summary our contributions are aligned with the main proposal.

1.
We developed a tool to generate network traffic from real traces, sanitized and mixed with malware traffic. That can produce different realistic scenarios; 2.
Another tool will take the generated PCAP file and produce a list of features and labels that can be used by ML techniques; 3.
We generate a network dataset with botnet action on it and perform a deep analysis using ML techniques, discovering a relation between macro average of F1 and MCC.
The rest of the paper is organized as follows: Section 2 offers a revision of the related work. Section 3 presents the framework architecture and describe it in detail. Dataset, evaluation metrics, experiments, and results from this work are presented in Section 4. Finally, conclusions for this paper are in Section 5.

Related Work
The research field of network intrusion detection systems had been invigorated with the application of artificial intelligence techniques for intrusion detection. However, the main difficulty still remains: the scarcity of public network traffic datasets with the right features for the task. Several efforts had addressed to provide the research community with network datasets. There are some datasets that are considered emblematic in the evaluation of NIDS.
The oldest and most popular dataset is DARPA 1998. This dataset was generated by MIT Lincoln Laboratory with DARPA funding. From 1998 to 2001, this laboratory developed a series of simulations at big scale to create these synthetic data. MIT make them available to the public and can be downloaded from their website [9]. This dataset is composed by a training and testing parts, which include, respectively, seven and two weeks of network traffic. This dataset includes five class patterns: Normal, Probing Attack (Probe), Remote to Local (R2L), Denial of Service (DoS), and User to Root (U2R). However, there are some negative aspects [10,11], such as: the models used for network traffic generator are very simple and obsolete and do not represent the distribution of the network attacks in real scenarios (as an example, 79.2% of DoS versus 19.7% of normal traffic and only 1% for the rest of attacks). In 1999, a new improved version of the dataset was released as DARPA 1999. This dataset contains five weeks of network traffic, only the second includes a selected amount of attacks from its predecessor, it includes 200 instances of 56 different network attacks [12]. Because of their limitations, these datasets are no longer used in actual research works [13].
On the other hand, the KDD dataset is a family of synthetic datasets that includes KDD CUP 99 and NSL-KDD. The first [14] is based on DARPA dataset and contains around five million instances, each one representing a TCP/IP session made up of 42 features. It contains 22 attacks grouped into four categories (DoS, Probe, U2R, and R2L) and presents a class imbalance problem, since approximately 20% of them were classified as normal traffic patterns. Similarly, the problem is presented also in the attacks part; for example, U2R represents 0.15% in the training set and 0.7% in the test set. At the opposite extreme, DoS has a distribution of 79.2% in the training set and 73.9% in the testing set. Another problem is the huge number of redundant records in the training set and duplicated records in the test set, which negatively impacts ML algorithms [15]. The second dataset [6] is an improved and corrected version of KDD CUP 99 produced by the University of New Brunswick. It is made up of two training subsets and one test subset. The distribution of normal and attack traffic is nearly balanced in the three subsets. However, the distribution of the attacks is strongly skewed, more than 30% of the attacks are DoS and U2R does not reach even 1%. One of the improvements was the elimination of duplicate instances about 78% and 75% in the training and test sets, respectively.
The Kyoto 2006+ dataset was built with three years of real network traffic (from November 2006 to August 2009) using honeypots, darknet sensors, an email server, and web crawlers from the University of Kyoto [16]. Kyoto 2006+ dataset contains 14 statistical features derived from the KDD Cup 99 dataset and 10 additional features. This dataset was prepared to eliminate redundant instances and irrelevant features. One of its disadvantages is that there is no specific information about the attacks it contains.
Another recognized dataset is ISCX 2012, which was created by the University of New Brunswick in 2012 [5]. It is considered a dynamic one, because synthetic traffic is organized into a series of profiles that allow its selective use for different detection tests. The alpha profile describes the attack scenarios, these are used to generate malicious traffic for the evaluation of the systems. The dataset presents four scenarios: network infiltration from the inside, denial of services via HTTP, distribute denial of service via an IRC botnet, and brute force SSH attack.
The CTU-13 dataset [17] was captured from a production network at the Czech Technical University (CTU) in 2011 and is part of the Stratosphere IPS project [18]. The CTU-13 dataset consists of a group of 13 different malware captures made in a real network environment. These network captures include malware, normal and background traffic samples. The normal part of the dataset was captured using a Windows 7 running in a virtual machine. Normal traffic uses known web browsers, and the malware probably uses its own libraries to communicate with the Internet. Because it contains real network traffic, the full network traces are not publicly available.
All previous described datasets are now considered as static alternatives because of their features, since they do not change over time during the lifespan. Recently, new solutions to public datasets limitations had been proposed. As an example, new software tools had emerged, such as Silk [19]. The authors proposed a method to produce network traffic datasets for NIDS research by extracting meta-information from the network packets, combined with the logs from a IDS to label the traffic. The result generates labeled network flows compatibles with Netflow. That work allows to obtain access to a recent network flow datasets with associated labels. However, this proposal does not generate PCAP files, it only generates flow data, i.e., text files with a summary of specific features. Because it does not offer PCAP files, payload is missing and there are no possibilities to generate new and improved features. Additionally, the authors do not mention the IDS software that was used.
Other work proposed the software tool ID2T [20] to generate datasets for NIDS by injecting synthetic attacks that are mixed with background traffic. This approach produces files in both PCAP and XML formats, the latter used for labeling. The architecture has the ability to inject new and advanced attacks. However, it does not have the ability to guarantee that background traffic is attack-free, that is, it lacks of a traffic sanitization stage. With this, there is a risk of generating performance traces with unknown malicious traffic that will affect the training of a ML model. Another limitation of the work is that it does not support IPv6.
Wilaux and Ngamsuriyaroj [21] proposed a framework to generate bidirectional flow data with 20 features based on the combination of background and malicious traffic. For background traffic, they use 15-min captures of real traffic from the MawiLab project [22]. Then, they model the statistical properties of the traffic and, based on the adjusted model, they generate synthetic traffic using D-ITG (Distributed Internet Traffic Generator) [23], to use it as background traffic. On the other hand, malicious traffic is generated by tools that are part of the Kali Linux distribution [24]. Some issues with this proposal are the way they generate the background traffic, since the models used are memory-less, they cannot represent the self-similarity and long-term dependence properties of the real traffic [25]. MawiLab traffic is sanitized in a simplistic way, they only eliminate sessions of a packet and 0 milliseconds in duration, because they related it to scanning activity.
We identified some gaps in previous works related to the tools and data used to generate the synthetic traffic. For example, when using statistical models, the generated traffic does not adjust to a real network traffic. Additionally, the sanitization on MawiLab traffic is very simplistic and in some proposals the PCAP files are not available, while malicious traffic generated with kali Linux tools are synthetic. Another weakness is the output format for one tool, that only generates traffic in a NetFlow compatible format, discarding any payload and limiting the features that can be used to analyze that traffic. Our proposal framework includes the capability of mixing real and sanitized network traffic and real malicious traffic to generate PCAP files as datasets to investigate features and patterns. Our tool that extracts features supports IPv6. Additionally, traffic labeling can not only be performed to identify malicious and benign sessions, but each malicious session can also be labeled with one of the 13 priority 1 attack classifications defined by Snort in /etc/snort/classification.config.

Framework Architecture
In this section, we present the general architecture of the proposed framework. First, we offer an overview of architecture, which will be followed by a description of the phases that create it. Likewise, the production network used in this work is described.
The architecture proposed in this article is shown in Figure 1. The workflow consists of four phases. The first phase is data collection and cleansing, in which raw traffic from a production network is captured. The cleansing process takes a raw network trace in pcap format and removes the session packets that correspond to an alert generated by the Snort software [26]. Once all the alerts have been processed, a new attack-free network traffic trace according to Snort is obtained. This trace is also referred as sanitized trace. In the second phase, the aggregation of malicious traffic is carried out by replaying the sanitized trace and a trace of malicious traffic. In the third phase, the dataset is created in CSV format, the aggregated trace is processed to extract the features of the traffic sessions. Additionally, Snort software is used as an expert to help tag sessions as either attack or normal traffic. Optionally, with malicious traffic, those sessions could be sub-classified as one of the 38 attack types defined by Snort in the file classification.config. With this information, one more variable can be included with the type of attack provided by Snort. Finally, in phase four, the dataset is used by ML algorithms to select a predictive model that detects malicious traffic sessions. The details of each phase are explained in the following subsections.

Recollection and Cleansing Phase
As can be seen in Figure 2, this phase is made up of two parts, collection and cleansing. In the first part of this phase, network traffic is captured with TCPDUMP software, scheduling the capture task with cron at a set time. Real traces used in this research work were captured in the network of the campus of the Autonomous University of Nayarit (UAN). The following is a description of the network. Figure 2 shows the architecture of the UAN network. It is a network with wired and wireless connections with approximately 7000 internal nodes organized into segments generated through the use of VLANs. These VLANs efficiently separate traffic and allow better use of resources through the logical segmentation of the infrastructure in different subnets. Hence, the packets switch only between ports within the same virtual network. The ER16 multilayer switch is responsible for directing traffic to the FORTIGATE 1000 appliance, and to the Demilitarized Zone (DMZ). The DMZ contains the WEB, DNS, EMAIL, and SQL servers. When a LAN node of the UAN requests access to an external host, the ER16 directs this request to the FORTIGATE 1000, which is in charge of managing the privileges, restrictions, and the priority of the traffic outside the UAN network. The capture sensor has an IP address of 192.100.162.12, and it is inside the DMZ on one of the email servers.   Figure 3 shows the diagram where the collection of traffic and filtering of malicious sessions is carried out. A traffic session is defined as a set of packets that share a fivetuple: source IP address, destination IP address, source port, destination port, and protocol. The cleansing operation consists of eliminating all sessions that are involved with malicious activity. This process is automated and must be executed by an expert, for this reason was proposed the software Snort version 2.9.17 GRE (Build 199) was proposed. Snort is an open source signature-based IDS/IPS. It uses rules to define malicious activity in network traffic and based on them, it searches for matches in the monitored traffic, if that happens an alert is triggered. Alerts are stored in /var/log/snort/ and can be written in two types of binary format: PCAP dump or unified2. The latter is an extensible format formed by a header that defines the type of record and its length, and stores 21 types of information about the event, such as source and destination IP address, source and destination port, alert priority, classification of the attack, among others [27]. The alert file is used by our tool splitraffic, written in C++ to perform the cleansing operation on the raw trace, that is, to eliminate the traffic sessions that correspond to the alert. In the case of a PCAP alert file, the mapping is performed only on the five-tuple basis. Meanwhile, with unified2 alerts, in addition to considering the five-tuple information, we can have greater control over the sessions has to be eliminated. Specifically, network sessions can be removed based on the three priority levels defined in the file /etc/snort/classification.config by Snort. A priority alert 1 is assigned to the highest severity, 2 to the intermediate, and 3 to the lowest severity. The control of filtering using priorities is completed with the max_priority option, whose entered value represents the maximum number of alert priority considered to filter, i.e., max_priority = 1, it only takes into account priority 1 alert, max_priority = 2, it will use priority 1 and 2 alerts and similarly for level 3. With level 3, more alerts will be generated for priority 1, 2, and 3, which could lead to some false positives. For this work, only priority 1 was used. Algorithm 1 performs traffic cleansing based on Snort, once the raw trace is loaded into memory, then, the type of alert to be used is read. In the case of alerts of type unified2, the maximum level of priority is read. Later, the while loop cycle reads the trace to be sanitized packet by packet to find a match with the alert file, if the packet was not related to an alert, the packet is written to the cleaned trace file. Otherwise, the priority level of the alert is checked and compared with max_priority, if the priority is higher than max_priority then the packet is written to the cleaned trace, trace_out file. Table 1 shows the summary of UAN-12, our dataset of sanitized traces for this work, which is composed of 12 network traces. The total size is 47.8 GB and contains 78.7 million of TCP/UDP packets. All network traces were collected on weekdays at the same time of day to minimize errors derived from daytime effects on network use. load_trace_to_memory_buffer network trace to be sanitized 3: read_alert_logs pcap_logs or unified2_logs snort alerts 4: if pcap_logs then 5: alerts_extraction(pcap_logs) snort_logs map of snort alerts 6:

Traffic Aggregation Phase
The objective of this phase is to generate a mixed network trace with attack-free traffic and malicious traffic. The malicious traffic dataset was created using NETRESEC [28] that offers a list of public repositories of PCAP files. Two resources from the list are malwaretraffic-analysis [29] and the Stratosphere project [30] that offer public traces of network traffic in the presence of malware. In this work, a dataset from the Stratosphere IPS Project was used to represent malicious network traffic, specifically the CTU-13 dataset. From the aforementioned dataset, only the botnet network traffic traces of the twelve scenarios were used. Table 2 summarizes the malicious dataset used in this work.
Algorithm 2 shows the procedure for adding malicious network traffic to the sanitized trace. To do this, the duration of the sanitized trace duration_b is obtained, which, in this case, acts as background traffic on which malicious traffic will be mixed. The duration of the malicious traffic trace is also obtained since these measures will determine the time in which the attack will be launched, replay_delay_m. Then, a process is started to capture traffic on a network interface, ethernet_interface, e.g., eth0 for which the TCPDUMP tool will be used. After that, a process is executed for the replay of the background traffic, using tcpreplay in the same interface used for the capture. When the replay_delay_m time has arrived, the replay of the malicious trace begins. During the replay time of the background traffic, the traffic listened on the ethernet_interface is being captured. At the end of its replaying, the capture is interrupted and the mixed trace is obtained in mixed_traffic. Table 3 summarizes the personalized network traffic traces that were generated by our framework, which consists of twelve network traffic traces made up of a mixture of normal and botnet traffic. Each trace represents a scenario where a specific botnet is spread. In the next phase, the 3Fex tool is used to extract the session features, label the sessions and generate a CSV file with this information. replay_traffic(background_traffic,ethernet_interface) 7: wait(replay_delay_m) 8: replay_traffic(malicious_traffic,ethernet_interface) 9: while replay_trafifc(background_traffic) do 10: continue 11: stop_traffic_capture(ethernet_interface, mixed_traffic)

Feature Extraction and Labeling Phase
The session extraction and labeling phase are carried out using the 3Fex tool (Fast Flow Feature Extractor) which we develop in the C++ programming language. 3Fex is designed to handle network traces in libpcap 2.4 format and is efficient in extracting attributes from TCP/UDP sessions for IPv4 and IPv6. Table 4 shows the 52 session features that extracted our tool. Additionally, 3Fex has the option of generating an output file with the times between arrivals of the packets that make up each session. This allows other types of study, such as time series, heavy-tailed distribution, long-term dependence, or self-similarity to be performed.
Tools that extract session features from network traces, such as CICFlowMeter [31] or Flowmeter [32], have shown deficiencies when reconstructing network sessions since these do not extract all the packets that form the session. This behavior produces incorrect values in the features for that session. Another limitation is the long processing time to produce the feature file when processing large-size traces. In addition, none of them has the option to carry out traffic labeling.
Algorithm 3 shows the mechanism that our tool uses to handle network traces and extract the network traffic sessions and optionally carry out their labeling and/or attacks classification. To speed up the process and prevent access to the hard disk, the traffic trace is loaded into the RAM memory using a buffer. Once loaded into memory, the type of Snort alert to use, PCAP or unified2, is identified and a snort_logs map is created, which will function as an associative container that stores the information of the alerts in the form of a five-tuple and an object, the latter representing the alert attributes. Subsequently, a programming cycle allows to identify the network sessions and extract the set of selected features, features_session. Depending on the type of alert, for example, for PCAP alerts it is only possible to perform a binary labeling, i.e., 0, negative class, and 1 for positive class. In case of alerts based on unified2, we can additionally define the highest priority level to tag and/or the type of attack classification given by Snort, i.e., meeting the unified2_logs && snort_logs condition. Finally, a file generated in CSV format contains the features of the sessions and/or their labels and/or the classification of the type of attack. We select CSV files because it is a common data interchange file format used when working with open source languages, such as R or Python. if unified2_logs then 7: snort_logs ← alerts_extraction(unified2_logs) object of snort alerts 8: while session ←extract_session(buffer) do 9: features_session ← feature_extraction(session) extract packet features 10: if snort_logs then SESSION_LABELING(session,snort_logs, features_session) add label feature 11: if unified2_logs && snort_logs then SESSION_CLASSIFICATION(session, snort_logs,features_session) add classification feature SAVE_FEATURES_FILE(features_session, CSV_file) write dataset An outstanding feature is that, once a session has been extracted and processed, the search space in the trace is reduced since packets that were extracted in previous sessions are marked to be ignored in subsequent sessions.

Classification Phase
Network traffic classification is a specialized solution and a valuable tool used to effectively tackle network planning, management, and monitoring, and also for attack detection and forensic analysis. Network attack detection can be accomplished through supervised ML.
ML algorithms learn from data. Specifically, in supervised ML, it is assumed that we have access to a dataset D assembled of n labeled, independent, and identically distributed training examples (i.i.d.), (X 1 , Y 1 ), . . . , (X n , Y n ). Therefore, each instance or observation is a pair formed from a feature vector X belonging to a feature space or input space X ∈ R p together with the system's result (class label), Y belonging to a label space or output space Y. The fundamental assumption of statistical learning theory is that there exists a defined joint probability distribution over the feature-label space X × Y, denoted as P XY (x, y), where X, Y denotes a pair of random variables distributed over P XY (x, y) and the pair (x, y) denotes a realization. The training dataset D is expressed as follows: where k is the number of classes. Supervised ML is described as an approximation problem to a target function, y = f (x) that maps feature vectors x i with outputs y i . Because the objective function is unknown, ML algorithms try to find a hypothesis function, h : X → Y that approximates f (x). The hypothesis function is also usually represented as h(x, θ), or h θ (x), where θ is its parameters' vector from the learned model from D. The set of all possible hypotheses is known as the hypothesis space, H = {h θ (.), θ ∈ Θ}, where Θ is the parameter space.
A learning algorithm is a procedure A : (X × Y ) → H that takes the training set and produces the model that best approximates the unknown objective function, that is, h θ (x) = A(D) = A((X 1 , Y 1 ), . . . , (X n , Y n )). Note that h θ (x) is a function of random variables, so it is also a random variable.
The loss function, L(h θ (x), y) measures the divergence or error, e, between the prediction made by the model and the correct value of the observation used during learning, resulting in a loss value: As an example, for k = 2, in a binary classification, the loss function 0/1 is used frequently: where 1{A} is an indicative function that takes the value 1 if the logic condition A is true and 0 if not. L counts the number of mistaken classifications. Other loss functions for classification are: binary cross-entropy, Huber loss, -insensitive loss, hinge loss, logistic loss, exponential loss, among others. The risk or generalization error for h is written as the expected value of the loss function where the expectation is taken related to a distribution function P XY (x, y). The ideal estimator or objective function is the minimizer of: where H is the hypothesis space in which R(h θ ) is defined. In practice, h θ cannot be found in this way because P XY (x, y) is unknown. The only information available is in the training set, D. A natural estimator of the risk function is the empirical risk Constructed from the training set, D. Learning h θ by minimizing Equation (6) is known as the empirical risk minimization principle (ERM). ERM states that it is possible to minimize R emp (Θ) with respect to Θ. For n → ∞, the empirical risk, R emp (Θ) converges uniformly to the risk function, R(Θ) [33,34]. Figure 4 shows how the concepts mentioned above are related to carrying out the learning of a supervised classification h θ (x i ) model for a ML algorithm, A. Additionally, an input to the ML algorithm has been integrated to modify it and make it cost-sensitive strategy, which is effective to solve the unbalanced classification problem. Class imbalance in binary datasets occurs when there exists a majority or negative class with normal data and a minority or positive class with abnormal or important data, which generally has the highest cost of erroneous classification [35]. One way to quantify the level of class imbalance is with the Imbalance Ratio (IR), which is the ratio of the number of instances of the negative class to the number of instances of the positive. For example, the distribution may be slightly skewed, a 4:6 IR, or severe, with an IR 1:100, 1:1000, or more. Traditional ML algorithms often assume that the training set is balanced. In unbalanced datasets, such as in intrusion detection datasets [36], classical classification algorithms can bias towards majority classes, and metrics, such as accuracy often give misleading values.
The introduction of cost-sensitive learning is necessary to remove the limitations of traditional classification algorithms for unbalanced datasets. Minority class oversampling and majority class subsampling can be used to handle this problem. When working with an unbalanced binary classification problem, the minority class (positive class) is usually the most significant interest. In our case, it would correspond to malicious traffic sessions, and as there are few samples, it is usually more difficult to predict. In this work, to evaluate our generated datasets, we use two approaches: data sampling and cost-sensitive algorithms.
Data sampling is a set of techniques that transform a set of training data to balance or improve the distribution between classes. Once balanced, traditional ML algorithms can be trained directly on the transformed dataset without modification. This technique can help to address the unbalanced classification problem. On the other hand, cost-sensitive learning considers the costs of prediction errors (and potentially other costs) when training a ML model. Therefore, instead of each instance being classified correctly or incorrectly, each class (or instance) receives a misclassification cost. Thus, rather than trying to optimize accuracy, the problem is to minimize the total cost of misclassifications [37].

Experiments and Results
In this paper, we conducted experiments with the dataset generated by our framework. The dataset comprises 12 scenarios that contain both legitimate activity and one type of botnet. For each scenario, the associate CSV file contains the traffic sessions with their corresponding 51 features plus the binary class label. Class 0 or negative corresponds to a benign network traffic session, while class 1 or positive corresponds to a botnet traffic session. The goal is to build a predictive model with a ML workflow in Python 3.8.5. The problem was addressed as an unbalanced binary classification since the distribution between the positive and negative classes is not uniformly distributed, presenting a severe bias towards the negative class, i.e., benign traffic sessions. The attack vector included in the dataset is a botnet network traffic. A botnet is a network of computers, called bots, which were infected with malware that allows them to be remotely controlled by an attacker, called a botmaster. These computers can be used together to carry out malicious activities without the owner's knowledge [38]. The life cycle of a botnet comprises several steps, starting with the botmaster that infects the victim with malware. The infected bot connects to command and control (C&C) channels via HTTP, IRC, or other protocols. The botmaster sends orders to the bots through a C&C server and creates an army of bots little by little [39].
This article presents five types of classifiers based on ML algorithms to distinguish between benign and botnet network traffic sessions by classifying the corresponding session. A set of modified algorithms was used, i.e., cost-sensitive algorithms. Additionally, traditional ML algorithms are used accompanied by sampling techniques to balance the distribution of classes in the dataset.
The first classifier is W-LR which stands for Weighted-Logistic Regression, it is a costsensitive Logistic Regression algorithm. The second classifier is LR-SMOTE, the traditional logistic regression algorithm is applied together with one of the best known oversampling algorithms called SMOTE (Synthetic Minority Over-sampling Technique), which randomly generates synthetic objects between two objects of the minority class [40]. The third classifier is W-DT, a Weighted-Decision Tree, that is a cost-sensitive Decision Tree algorithm.
The fourth classifier is SVM+OSS which is based on the Support Vector Machine (SVM) algorithm applied alongside with a subsampling technique called One-Sided Selection (OSS) [41], that combines Tomek Links and the Condensed Nearest Neighbor (CNN) Rule [42]. The last algorithm is XGB, extreme gradient boost better known as XGBoost which is implemented as an optimized distributed library under the gradient boosting framework designed to be highly efficient, flexible, and portable [43].
The ML pipeline used in this work is as follow: 1. Split data into train and test sets; 2.
Fit data preparation on training dataset; 3.
Apply data transformation to obtain prepared train and test datasets; 4.
Find a model running grid search only on a prepared training set using crossvalidation; 5.
Evaluate model on prepared test set using four performance metrics.

Performance Metrics
Typically, the following criteria are used to evaluate the performance of a ML predictive model in detecting botnet:

1.
True Positive (TP): indicates that the botnet was successfully detected in the traffic session.

2.
True Negative (TN): indicates that a benign traffic session was correctly identified.

3.
False Positive (FP): indicates that a benign traffic session was falsely detected as a botnet session.

4.
False Negative (FN): indicates that a botnet session was not detected and was identified as a benign traffic session.
Based on the previous criteria, different metrics in terms of macro-averages, specifically, macro-precision, macro-recall, and macro-F1 were used to assess the performance of the models proposed in this research. Recently, macro-averaging has been proposed in the literature to quantify the botnet detection rate, [38,[44][45][46]. To define a macro-average we considered m as a performance measure for class c, which depends on the previous criteria. The precision measure for class c = 0 and class c = 1 are: p 0 = TN/(TN + FN) and p 1 = TP/(TP + FP), respectively. These measures are also known as Negative Predictive Value (NPV) and Positive Predictive Value (PPV), which are interpreted as the proportion of correct predictions regarding everything predicted for class c.
On the other hand, the recall for class 0 and class 1 are r 0 = TN/(TN + FP) and r 1 = TP/(TP + FN). These measures are also known as True Negative Rate (TNR) and True Positive Rate (TPR), which are interpreted as the proportion of correct predictions with respect to the size of each class c. The F1-scores for classes 0 and 1 are f 0 = 2p 0 r 0 /(p 0 + r 0 ) and f 1 = 2p 1 r 1 /(p 1 + r 1 ), and correspond to the harmonic means, p c and r c , respectively.
A macro-average metric counts the average for each independent class and then obtains a general average treating all classes as equals, including minorities. As mentioned in [47], the macro-averages of precision, recall and F1-score are more sensitive to the problem of class imbalance compared to their respective micro-averages.
Therefore, if we want to know the effectiveness in identifying the minority class, i.e., the malicious macro-averages must be calculated. A macro-average M of a measure m c can be calculated as follows: where n is the total number of classes. In our case, this is a binary classification, n = 2. The macro averages for precision, recall, and F1-score are P = 1/2(p 0 + p 1 ), R = 1/2(r 0 + r 1 ), and F1 = 1/2( f 0 + f 1 ), respectively. In this work, we also use the Matthews correlation coefficient (MCC), developed by Matthews in 1975 and proposed by Baldi et al. in 2000, as a standard performance metric for ML algorithms with a natural extension to the multiclass case [48]. Because it incorporates the four categories of a confusion matrix, several works consider the MCC metric more reliable than the F1 metric to assess binary classifiers with class imbalance [49,50]. However, this contrasts with the results in [51], where it is argued that MCC seriously deteriorates when a dataset is unbalanced. Under this scenario, we opted to consider macro-averages of precision, recall, F1-score, and MCC to compare the behavior of the classifiers under these metrics.
The MCC is defined as: This metric has some properties: when the classifier is perfect (FP = FN = 0), the MCC value is 1, which indicates a perfect positive correlation. On the other hand, when the classifier always misclassifies (TP = TN = 0), the MCC value is −1, representing a perfect negative correlation (in this case, it is enough to invert the classifier's result to obtain the ideal classifier). In fact, the MCC value is always between −1 and 1, and 0 means that the classifier is random.

Classification Algorithms Performance
Next, we show the experimental results using the twelve botnet attack scenarios represented by the dataset generated in this work. Five classification algorithms were used to perform the detection of botnet attacks. The chosen classifiers combine cost-sensitive algorithms and sampling techniques, adapting the ML methodology for the context of unbalanced data. Figure 5 shows the first six scenarios, while Figure 6 shows the remaining ones. Each scenario is identified by a title that presents a set of bar graphs. These bar graphs show the performance of each classifier using the macro-averages of precision, recall, and F1-score, as well as the MCC, respectively. According to the results obtained in the UAN-12 dataset, it is observed that the macro average measures F1-score and MCC follow a congruent behavior between both by not presenting discrepancies in the evaluation results of the classifiers. In the comparison among the different methods selected to address the unbalanced classification problem, the SVM-OSS and XGB classifiers showed the highest performance versus W-DT, W-LR, and LR-SMOTE, which presented low performance in all scenarios, specially in scenarios 3, 4, 5, 9, and 11 where W-LR and LR-SMOTE performed the worst. Table 6 summarizes the performance of the two best classifiers, SVM + OSS and XGB. Based on the results from macro average of F1 and MCC, both classifiers obtained an average performance of 0.97 and 0.94, respectively. Comparing the classifiers in each scenario, we can see that in scenarios 1, 3, and 7 the classifiers had the same performance. Meanwhile, SVM + OSS outperformed the rest by difference of hundredths in scenarios 4, 5, 8, 9, and 10. Similarly, XGB did so in scenarios 2, 6, 11, and 12. Regarding the computational processing times, the SVM + OSS classifier consumed 3.75 h for all the scenarios, considering only the undersampling times and the model adjustment. XGB, on the other hand, only required 0.45 hrs for model fitting. Is important to highlight that this classifier did not require applying any technique for class imbalance. The models generated for botnet detection were performed on a computer with an AMD Ryzen 9 3950X 16-Core Processor, 64 GB DDR4@ 3600 MHz and a NVMe M.2 SSDs using a PCIe 4.0 interface.  An interesting observation of this work was the behavior of the average macro metrics of F1 and MCC. In the 12 scenarios, these two metrics followed a correlated behavior. Figure 7 shows the variation of these measures for SVM + OSS and XGB classifiers that can be described by calculating the Pearson correlation coefficient of r = 0.99 in both classifiers, verifying a strong relationship between both metrics, concluding that both measures give consistent results.
The bibliography consulted in this research allowed us to identify that previous studies only made comparisons of the MCC measure against precision, recall, and F1-score, despite the three last measures, they only consider a subset of the dataset concerning the positive class. In contrast, we use macro averages for both classes, and the results shown that the performance of our classifiers was very similar in terms of F1-score and MCC measure.

Conclusions
The main goal of this work is to propose a framework for network datasets generation from real traffic traces that can be used in the research of NIDS. The generation of personalized datasets opens up new opportunities in NIDS research by providing dynamic scenarios that allow the representation of the most recent threats. In this sense, we can use network traffic threats that have been made publicly available or captured in honeypot networks.
Our proposed framework offers realistic scenarios, which were not available before. The traffic aggregation offers the advantage of using network traffic captures from our networks for later be used as background traffic combinable with malicious traffic of our interest. Another advantage is that the generated network traces offer full information, because these have not undergone any transformation process related to the anonymization of data that protect the privacy of the information. On the other hand, the tools developed during this research provide significant help by facilitating the reliable and efficient extraction of session features. Similarly, the traffic session labeling component contributes to the process of using supervised ML algorithms that require labeled data.
Finally, the usefulness of the proposed dataset generator framework was assessed using a generated unbalanced dataset. Five ML predictive models were used to detect botnet attacks. The performance of these algorithms was estimated using F1 macro-averagingscore and the MCC, identifying that these measures show a similar performance of classification with unbalanced data.
Some future work related with the proposed framework is focused in three aspects: 1. Usability: Develop a GUI for the final user of the tool.

2.
Portability: Make use of container technology to easily deploy the tool in different environments.

3.
Performance: Use a preprocessing stage to handle larger amount of traffic, not limited by the RAM in the computer. Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The dataset UAN-12 generated by our platform is available at https: //securitylab.uan.mx/dataset-uan12.htm (accessed on 30 December 2021). A virtual machine with a running version of the platform is available also at https://securitylab.uan.mx/dataset-uan12.htm (accessed on 30 December 2021). The source code of the platform is available at https://github.com/ OliverITT/3FEx (accessed on 30 December 2021). Finally, the models source code for this paper is available at https://github.com/pvelardea/botnet-detection (accessed on 30 December 2021).
Acknowledgments: The authors of this paper recognize the students Oliver G. Rodríguez and M. Karin Leyva for their technical support for this research.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: