Towards a Reliable Comparison and Evaluation of Network Intrusion Detection Systems Based on Machine Learning Approaches

: Presently, we are living in a hyper-connected world where millions of heterogeneous devices are continuously sharing information in different application contexts for wellness, improving communications, digital businesses, etc. However, the bigger the number of devices and connections are, the higher the risk of security threats in this scenario. To counteract against malicious behaviours and preserve essential security services, Network Intrusion Detection Systems (NIDSs) are the most widely used defence line in communications networks. Nevertheless, there is no standard methodology to evaluate and fairly compare NIDSs. Most of the proposals elude mentioning crucial steps regarding NIDSs validation that make their comparison hard or even impossible. This work ﬁrstly includes a comprehensive study of recent NIDSs based on machine learning approaches, concluding that almost all of them do not accomplish with what authors of this paper consider mandatory steps for a reliable comparison and evaluation of NIDSs. Secondly, a structured methodology is proposed and assessed on the UGR’16 dataset to test its suitability for addressing network attack detection problems. The guideline and steps recommended will deﬁnitively help the research community to fairly assess NIDSs, although the deﬁnitive framework is not a trivial task and, therefore, some extra effort should still be made to improve its understandability and usability further.


Introduction
Presently, the implementation of digital technologies in all areas and business processes has involved a continuous positive trend in the global connection growth concerning communications and the number of interconnected devices. In this scenario, systems and communications networks are continually being threatened by attackers, thus monitoring and detecting security events play an essential role in preserving vital security services as confidentiality, integrity, availability (CIA), and privacy. For instance, it is worth mentioning two recent attacks, Dyn (2016) [1,2] and VPNFilter (2018) [3], where thousands of Internet of Things (IoT) devices were compromised causing, on the one hand, a high economic impact and, on the other hand, and even worse, personal costs. In this sense, Intrusion Detection Systems (IDSs) are the most widely used defence line in Information and Finally, the proposed framework is tested by comparing classical supervised approaches against semi-supervised and unsupervised techniques by using the recently built dataset UGR'16 [14].
The rest of the paper is organized as follows. Section 2 studies recent works and methods involved in developing ML-based NIDS. Section 3 introduces the recently built UGR'16 dataset as an alternative to the most used and widely accepted datasets. After that, Section 4 describes the proposed methodology and stages for ML-based NIDS evaluation and comparison. The experimentation and results for network attack classification are discussed in Section 5. Finally, remarked conclusions and future work are introduced in Section 6.

Related Work
Throughout this section, recent works addressing network attack detection using ML-based NIDS techniques are described in detail, especially the methodology and steps followed to address the problem of network attack classification.
All works studied here are summarized in Table 1. It shows the involved dataset and the proposed solution according to what we consider should be the standard steps to be performed in network attack detection using ML-based solutions. This way, the whole detection procedure should comprise, at least, the following steps: Feature Engineering (FE), Data Pre-processing (DP), Feature Selection (FE), Hyper-parameter Selection (HS), Machine Learning method (ML) and Performance Metrics (PM). In the table, symbol '-' denotes the involved step could not be found in the referred work, which does not mean that the authors did not use any technique or procedure to accomplish the related task. Finally, the specific steps used in this work to test the proposed methodology are highlighted in bold in the last row of Table 1.

Recent Works and Methods
The study published in [13] highlighted the deficiencies found in many previous works addressing network attacks classification when they consider an outdated dataset such as KDDCup'99 for performance evaluation. Other works similar to this one usually achieve high classification performance rates, although this is mainly motivated by the flaws found in KDDCup'99 dataset. This assumption is confirmed by comparing accuracy results obtained in [15] through classical ML algorithms against the ones obtained by using the same ML methods but, in this case, considering a recent and realistic dataset: NGIDS-DS [16]. Although the same ML algorithms were used for comparison, no specific details were mentioning the methodology followed by the authors.
The authors in [15] proposed an IDS architecture addressing new challenges on managing vast volumes of network traffic in real time. To evaluate the system, attack classification performance was tested by using very well known ML algorithms on the old-fashion KDDCup'99 dataset. Additionally, an interesting study about the impact of specific feature selection techniques on the classification performance is provided. Finally, a proposed FS method combining manual exploratory feature analysis and two of the considered FS techniques (FSR (Forward Selection Ranking) and BER (Backward Selection Ranking)), got the best results in terms of modelling and detection time. However, the authors did not mention anything about FE, DP and HS, making the reported results hard to reproduce and compare.
A realistic network dataset comprising up-to-date attack and normal traffic patterns from an emulated network environment is introduced by Sharafaldin et al. [17]. They corroborated the feasibility of the dataset in order to detect network-based attack with up to seven classical ML algorithms. To this end, they extracted several network traffic flow-based characteristics employing a specific tool. Then, they performed an FS procedure called Random Forest Regressor for relevance feature weighting depending on the attack. However, neither HS nor DP methods were described. A deep learning approach for network anomaly detection is introduced by Li et al. [18]. In this work, authors evaluated an LSTM (Long Short-Term Memory) neural network and BLS (Broad Learning Techniques) deep learning-based techniques for anomaly detection. In this sense, two network datasets were used: NSL-KDD and a BGP (Border Gateway Protocol)-based dataset. A manual HS process was carried out to select the optimal neural network architecture. On the other hand, a dataset normalization and a dummy FE approach were done as part of their methodology. Finally, they supported the BLS technique as a feasible alternative to sophisticated deep learning methods since it offered similar performance rates with less training time. Although this solution almost contemplated all the stages according to our recommended procedure, an important step such as FS was not taken into account and should not be eluded.
A novel greedy FS approach was proposed in [19] to improve the attack detection classification performance in communication networks. NSL-KDD and ICSX12 datasets were considered to evaluate their proposal together with several deep learning and ML classical approaches. The authors concluded the feasibility of the approach in terms of classification performance and memory consumption. However, no FE, DP and HS steps were mentioned nor proposed in this work.
A tool to enhance previously generated network datasets was introduced by Cordero et al. [20]: the I2DT tool. I2DT can add a diverse type of attacks to existing datasets, thus providing a useful tool that, as the authors claim, hardly reduces the time spent in building an adequate dataset for NIDSs evaluation. This way, I2DT was used to enhance MAWI dataset and later used in an RNN (Replicator Neural Networks)-based approach for DoS and port scan attack detection [28]. RNN-based solutions provide an anomaly score such that values exceeding a certain threshold are marked as an anomaly. As with other works, there were no mentions to DP and FS techniques.
The authors in [12] improve the MSNM (Multivariate Statistical Network Monitoring) [29] for anomaly detection. In this sense, a semi-supervised approach of MSNM that relies on the use of PLS (Partial Least Squares) optimization techniques was proposed to adjust features' weights according to the targeted attack dynamically. MSNM is based on statistical ML models (namely PCA). The FaaC (Feature as a Counter) that transforms original variables into counters in a specific time interval is validated here as a promising FE method (see Section 4 for a detailed explanation). Altogether, it improves the MSNM detection capabilities on the diversity of attacks provided in the UGR'16 dataset [14].
An Optimal Allocation Least Square SVM (or OA-LS-SVM) method was proposed in [21] for incremental datasets. When working with incremental datasets, the size of the dataset may eventually be too large to train the models in a reasonable time. To handle this problem, the proposed method first used the Optimal Allocation technique to select representative samples from the dataset, and then applied an LS-SVM model to learn the classification problem, just using the selected samples. This technique was tested to detect anomalies in the KDDCup'99 dataset [6]. The authors compared two versions of the method: using all features available in the dataset, or just a subset of them obtained through PCA [30]. Avoiding critical stages, such as FE, DP and HS, is not recommended for fair NIDS comparison.
In [22], the authors proposed a hybrid approach for anomaly-based IDS consisting of two different evolutionary algorithms, namely Artificial Bee Colony (ABC) and Artificial Fish Swarm (AFS). The hybrid algorithm was used to create rules for anomaly detection, based on a reduced subset of features which were generated with a primary correlation method. In this sense, bio-inspired alternatives may overcome classical and deep learning-based techniques for network attack detection, but the lack of details concerning the methodology or steps used makes comparison harder to perform.
Divekar et al. [23] pointed out some flaws in KDDCup'99 [6] for the evaluation of modern NIDS, and proposed a study on the performance of several classical ML models on two alternatives to the mentioned benchmark, as an attempt to analyze its suitability for modern NIDS. Mean Decrease Impurity technique was used to discard irrelevant features (the string values of some features were mapped onto integer numbers). Besides SMOTE was used for oversampling, followed by a random undersampling method in order to generate a dataset of reasonable size with a balanced number of samples of every class. ML models' hyper-parameters were configured using randomized grid search and 5-fold cross-validation. Once more, no mention of FE methods was provided.
In [24], the authors proposed the application of several classical ML models for NIDS on the UNSW-NB15 dataset, considering all available features. The performance of the considered models was evaluated using Apache Spark. Naïve Bayes was the fastest method to train, while Random Forest provided the most accurate predictions. The authors provided some numbers on the obtained run times, although they did not provide any details about the computation platform used or most of the recommended steps (FE, FS, etc.).
The authors in [25] evaluated an extensive set of classical ML-based algorithms for anomaly detection in networks. To this end, well-known network datasets were used: KDDCup'99 and NSL-KDD. Besides, the influence of adding noise to data in NSL-KDD was also tested. On the one hand, almost all algorithms performed better with more realistic datasets, i.e., with NSL-KDD and its noisy versions. On the other hand, every algorithm had its own relevant set of features leading to better results. For that, a Performance-based Method of Ranking (PMR) FS algorithm was also proposed. As with previous works, FE, DP and FS were not considered nor mentioned.
The CTU-13 dataset for botnet anomaly detection was created and introduced by Garcia et al. [26]. The authors compared three NIDS methods in their work: Cooperative Adaptive Mechanism for Network Protection (CAMNEP), BCplus and BotHunter. Besides, a methodology for comparison of the previous methods and NIDS solutions was proposed, which are, in general, fed with NetFlow-based data sources. Such a methodology just compares predictions obtained by the methods under evaluation for each observation in terms of standard and well-known metrics such as F1, precision, accuracy, etc., and another one proposed by the authors: the error metric. However, a common way to do crucial tasks in network attack detection such as FE, DP, or FS is also not proposed.
Through the work in [27], the authors consider all the steps we claim as mandatory for network attack detection and remark the importance of feature extraction and feature selection steps for DoS, DDoS and network attacks classification in general. Firstly, they chose and extracted a vast number of network traffic features from previous works in the field. Secondly, they proposed a new feature selection method that mixes the Random Forest and Pearson's correlation coefficient, called Random Forest and Pearson wrap (RFPW) algorithm. Their approach was tested through the MAWI dataset with a Gradient Boosted Decision Tree (GBDT) algorithm and some other classical ML algorithms, showing improvements introduced not only in the performance of the algorithms but also in the time required for the training phases.
In summary, almost all studied proposals avoid, elude or do not mention critical steps to evaluate ML-based algorithms for network attack detection. Therefore, the authors of this paper claim that it is hard and, sometimes, impossible to reproduce the obtained performance. Additionally, carrying out a fair comparison between different NIDS solutions only having information about the performance results is questionable. It is also worth noting that one of the most used datasets is the obsolete KDDCup'99 and its derived and improved version NSL-KDD which could be a handicap when evaluating NIDSs. In conclusion, some standard procedures or methodologies comprising all necessary steps, from the considered raw data to the outcome, should be proposed allowing honest and equitable comparisons.

Network Datatasets for NIDSs Evaluation
The network security research community is aware of the lack of existing network datasets for training and evaluating NIDSs. Because of that, some efforts have been made to generate them. Among other characteristics, they differ from the data format (flow-and packet-based mainly), the recording environment (synthetic, emulated or real), the duration and the freshness (up-to-date network traffic) [14,31].
As authors claim in [31], there is no perfect dataset. Two reasons support such an affirmation: a) the ever-increasing number and kind of attacks [1,3], that makes very difficult to have up-to-date datasets; and b) specific applications may probably need specific datasets, thus considering all the previous characteristics unnecessary.
Even in the case that no ideal network dataset exists, authors are confident enough that some of them are more appropriate than others, according to the previous premises. For example, it is expected that the use of the famous and obsolete KDDCup'99 [6] dataset will lead the NIDS system to outcome erroneous results. In this sense, it would be recommended to consider updated and more realistic datasets, such as the recently released UGR'16 [14].
The rest of the section briefly describes the UGR'16 dataset, which will be later evaluated in Section 5 through the proposed methodology for NIDS evaluation and comparison (see Section 4).

UGR'16 Dataset
The UGR'16 [14] consists of 4 months of anonymized network traffic flows (netflow) gathered from a Spanish Internet Service Provider (ISP) facilities. The authors of UGR'16 split it into two different parts: CAL and TEST. The first one, obtained from March to June of 2016, contains incoming and outgoing ISP network traffic. The second one, obtained from July to August of 2016, contains mostly synthetically generated attacks flows which are added to the normal ISP use network traffic. These attacks were launched in a controlled environment, within the ISP facilities, in order to guarantee the correct ISP operation. Most of them were synthetically generated from up-to-date tools covering harmful and difficult to detect attacks such as low-and high-rate DoS, Port scanning or Botnet. Apart from the previous ones, the authors labelled real attacks after an in-depth inspection of traffic flows where several state-of-the-art anomaly detectors suggested anomalous behaviours. They identify UDP port scanning, SSH scanning and Spam campaigns. Last but not least, some flows were labelled as Blacklist if their IP was found in well-known blacklist open sources.
The dataset is publicly available with a total of 23 files, one per week: 17 for CAL and 6 for TEST. Each file size is 14 GB (compressed) being all of them available in nfcapd and CSV format.
A preliminary descriptive analysis of classes distribution of the dataset in the TEST part is numerically shown in Table 2. As expected, in the context of the problem addressed here, the number of traffic flows corresponding to attack and regular background traffic (Background) substantially differs. The attack with the highest number of flows (Spam) supposes only 1.96% of total flows in TEST. Besides, the TEST part contains all of the mentioned attacks, i.e., those synthetically generated and the ones discovered from anomalous behaviours. Because of that, the authors decided to use this part of the whole UGR'16 dataset to perform supervised classification through several ML-based techniques in this work.

Methodology
Throughout this section, the recommended methodology followed in this work is described in order to evaluate ML-based NIDSs and detect different kinds of network communications attacks, which is sketched in Figure 1. As it can be seen there, the available raw data is first processed with a total of N observations and P features using an FE technique called the Feature as a Counter (FaaC) method, which will be introduced in Section 4.1. FaaC generates a new Derived Dataset (DD) with a reduced number of observations N < N and extends the feature space to P ≥ P whose variables are counters of the originals (C n p ). Afterwards, the dataset is used to train and test the considered ML models. For the training phase, we perform k-fold cross-validation by splitting the data into k disjoint sets of similar size (most of them used in the training phase and some other for testing). It is important to note that each fold should be generated in such a way that the classes distribution observed in the whole dataset is maintained within each fold.   Training ML models is a significant phase, no matter the tackled problem. This way, a reliable methodology that automatizes such a process could reduce the related issues (e.g., underfitting or overfitting) and, consequently, increases the trust of the calibrated system. As observed in Figure 1, the training phase (upper thicker black line in the figure) considers almost all the proposed stages of the methodology.

Raw Dataset
After splitting the dataset into two main parts, train and test sets, the FS mechanism (see Section 4.2) is carried out in order to select the most meaningful features P * ≤ P out of all available ones. Although the FS algorithm only considers the train set, thus not being polluted by the test set, the non-selected features are removed from both train and test sets. Afterwards, data is pre-processed (see Section 4.3) to adequate the feature's value range. Once these two steps are accomplished, the HS method is executed to find the best configuration for the considered ML method (see Section 4.4). As with the FS stage, the HS does not consider the test set.
Furthermore, the test set is used in order to evaluate the performance of the considered ML-based NIDS model (see Section 4.5). Depending on the involved problem and the underlying dataset, choosing one or another performance metric could lead the system to hide its real behaviour as is explained in Section 4.6. In this section, authors recommend using the ones they consider most appropriate for NIDSs evaluation, especially in scenarios with unbalanced classes, an inherent characteristic of network traffic data for attack classification.
Finally, as is commonly seen in data science problems no matter the field of application, the procedure should be repeated a certain number of times for a well statistical representation and confidence in the obtained results. For this purpose, an iterative process of r repetitions of k-folds each is done, with a total of r × k executions.

Feature Engineering (FE)
Information from network traffic usually comes in the shape of huge (binary) files containing highly heterogeneous information. This issue makes it impossible to directly apply ML techniques to identify the different kinds of attacks. However, it is possible to overcome this issue by processing the data in order to build a more suitable input for automatic classifiers.
In this sense, the application of some feature engineering technique is proposed to build a well-structured dataset, suitable for ML-based NIDSs. Thus, Feature as a Counter (FaaC) [32] is used as a functional solution to the problem of learning from large heterogeneous datasets. It consists of merging and transforming different data sources (structured or not) of information and their variables into new variables, which are just counters of the original ones, in a given interval of time. For instance, it could be interesting to count the number of accesses to port 22 in a given time lapse, because a high number might mean a brute force SSH attack. The FaaC approach has been successfully tested in unsupervised ML-based NIDS approaches [12,29,33].

Feature Selection (FS)
As the application of the FaaC method considerably increases the number of variables that compose one sample of the dataset compared to the original one, a feature selection step is required to reduce the number of variables keeping the most meaningful ones and removing irrelevant features. In this work, a Least Absolute Shrinkage and Selection Operator (LASSO) [35] model was used to perform feature selection. LASSO is a simple linear model which is typically used to address problems where the number of input variables P is high. Let us suppose a dataset defined as D = {x i , y i } N i=1 consisting of N samples, where x i ∈ R P is a P-dimensional vector representing the input variables and y i being the response variable for the i-th sample. In a binary classification problem, where the response variable takes two values (e.g., 0/1, or negative/positive, etc.), LASSO includes an l 1 -penalty term to the minimization problem solved in a linear regression to find the optimal β coefficients, as shown in Equation (1) where f (βX T ) is the logistic or sigmoid function and λ is the regularization rate which controls how many input features are selected. In this case, the problem addressed consists of a multi-label classification problem with K = 5 labels, i.e., y i ∈ [0, 1, ..., K − 1]. In this sense, K LASSO models are trained in parallel as described before, but with an extra grouped-lasso penalty added on all the K coefficients for a particular input feature, which makes them all be zero or nonzero together, i.e., the retained variables will be those which are significant to predict all K labels. The final response of this multinomial LASSO is obtained through the softmax function depicted in Equation (2):

Data Pre-processing (DP)
Machine learning models usually require normalized input features in order to account for differences among variables distribution. In this work, a standard normalization procedure is applied to each numeric input variable, as depicted in Equation (3), in such a way that normalized features have zero mean and unit variance: where x j is the raw input variable, µ j and σ u are the mean and standard deviation of the given variable, and z j is the standardized feature.

Hyper-parameters Selection (HS)
Machine Learning (ML) models need their hyper-parameters to be tuned in order to perform accurately in the addressed problem. Two well-known strategies have been traditionally used for this purpose: Grid Search [23,36] and Random Search [37]. The former explores several combinations of certain values provided for each of the hyper-parameters to be tuned, while the latter simply try a given number of random settings obtained from a specified range of values for each hyper-parameters. On the one hand, a Grid Search has the disadvantage of only exploring a subset of the entire search space limited by the values provided for the hyper-parameters. Additionally, this strategy is very costly, in terms of computational time, for ML models with more than 2-3 hyper-parameters, since the search space increases exponentially. On the other hand, a Random Search strategy may be faster and independently of the number of hyper-parameters to be tuned, but its main drawback is that some irrelevant settings may be tested using this strategy (e.g., two hyper-parameters may be linked in such a way that either both have high or low values, but not high/low or low/high at the same time).
Even though a Grid Search strategy was applied in [38] providing acceptable results, a more recent and powerful hyper-parameters selection strategy, which has provided excellent results overcoming the limitations mentioned above, has been used in this study: Bayesian optimization [39]. In this strategy, hyper-parameters search is modelled by an underlying Gaussian process in such a way that, on each step of the iterative search, the next hyper-parameters setting to be tested is the one with more uncertainty (higher variance) defined by this Gaussian process. Therefore, this strategy ensures that a pseudo-optimal hyper-parameters setting will be achieved in a few number iterations. Bayesian optimization is performed within the train set to obtain the optimal hyper-parameters setting, fit a ML model to the train set using this optimal hyper-parameters setting, and finally evaluate the performance of the fitted ML model on unseen test data (see Figure 1).

Machine Learning (ML) Models
Several well-known ML models have been used in this work to detect security attacks in network communications. Next, a brief description of each ML model used is provided: • Multinomial Logistic Regression (LR). It is the simplest linear model [40] that has been widely applied to several and diverse tasks. For a binary classification problem, LR models the dependent variable y i ∈ [0, 1] as a linear combination of the independent variables x i as shown in Equation (4), where P is the number of features describing one sample in the dataset, and f (βX T ) is the logistic or sigmoid function. The parameters vector β is learned by solving the minimization problem depicted in Equation (5): Since the problem addressed consists of several (> 2) labels or classes, a One-vs-All approach was used in such a way that K logistic regression models are trained (K being the number of labels or classes), each one focusing on solving the corresponding binary classification problem, and the overall prediction of the LR model is computed using the softmax function depicted in Equation (2). • Support Vector Machine (SVC). It is a kernel-based method that uses a kernel function (radial basis, linear, polynomial, or any other) to map the original input space into a new space where predictions can be made more accurately. In this sense, the performance of an SVC is determined by the type of kernel function. In this work, the Linear Function (SVC-L) and Radial Basis Function (SVC-RBF) were tested. Linear and Gaussian kernel functions are depicted in Equations (6) and (7), respectively.
• Random Forest (RF). It is a tree-based non-linear bagging model for which multiple decision trees are fitted to different views of the observed data [41]. In this sense, each decision tree is fitted to a subset of the N samples (randomly sampled). Moreover, a random subset of the P input features is used within each node of a tree to determine which of them is used to expand it further. Overall RF predictions are computed by calculating the average (or weighted average according to the performance of each single decision tree on the out-of-bag samples) of the individual predictions provided by the multiple decision trees.

Performance Metrics (PM)
Choosing one or another performance metric to evaluate and compare NIDSs can be erroneously interpreted especially by practitioners. For instance, the accuracy metric is not recommended when considering unbalanced data which is, in fact, a common characteristic of network traffic data for attack classification. Because of that, the authors recommend using Recall(R), Precision(P), F1 or Area Under the Curve (AUC) to evaluate ML-based NDISs when dealing with this kind of data. Next, a brief description is provided for each of the performance metrics considered.

•
Recall (R). It is also known as sensitivity or TPR (True Positive Rate) and represents the ability of the classifier to detect all the positive cases, as depicted by Equation (8), where TP (True Positive) is the number of times a positive sample is correctly classified. Instead, FN (False Negative) counts how many times a negative sample is miss-classified. • Precision (P). It evaluates the ability of the classifier to avoid positive samples miss-classification and it is defined in Equation (9), where FP (False Positive) counts how many times negative samples are classified as positive. • F1 score. It is the harmonic mean of the previous two values, as depicted in Equation (10). A high F1 score value (close to 1) means an excellent performance of the classifier.
• AUC. The AUC is a quantitative measurement of the area under the Receiver Operating Characteristic (ROC) curve which is widely used as a performance metric in NIDSs in particular and IDSs in general [10,42]. The ROC curve compares the evolution of the TP rate versus the FP rate for different values of the classifying threshold. Consequently, the AUC is a performance indicator such that classifiers with AUC = 1 behave perfectly, i.e., it is able to correctly classify all the observations, while a random classifier would get an AUC value around 0.5.

•
Weighted average. For each class i = 1, . . . , C, the weighted_avg(PM i ) computes the weighted average of each metric PM i previously introduced times by the corresponding support q i (the number of true observations of each class), being Q the total number of observations. These weighted metrics are defined as shown in Equation (11).

Experimentation: UGR'16 as a Case Study
Please define it. To validate the suitability of the proposed methodology to evaluate and compare ML-based NIDSs approaches, the experimental environment and the obtained results are described and discussed in the following.

Experimental Environment
The experimental environment was devised to run the proposed methodology. All the methods and functions have been developed in python language under the well-known python framework scikit-learn for data science. Additionally, a set of tools called scikit-optimize, also developed in python, was used to perform the hyper-parameters selection using Bayesian optimization. In this sense, they are all together included in the project ff4ml (Free Framework for Machine Learning, see Supplementary Materials), which implements the proposed methodology [43].
UGR'16 dataset, previously introduced in Section 3.1, has been selected to test ML algorithms to classify network attacks. Such algorithms must initially learn from a training dataset, and its accuracy is later on tested on unseen data. For this reason, the dataset is transformed in a derived form of the original one (DD) and split into two different parts: 80% of DD used for training, and the rest (20% of DD) used for testing. Different train and test disjoint sets are selected performing a k = 5-fold cross-validation process which is repeated r = 20 times. As a result, a total of r × k = 100 executions of the methodology steps have been carried out, attempting to look for statistical results significance.
Although the DD includes up to seven different class of malicious traffic, authors decide only to use Botnet, DoS, Scan and Spam. Blacklist was removed due to its appearance in all the observations, showing no differences from regular traffic observations at all. SSH scan and UDP scan were also removed because of the low number of samples available representing these attacks, which will be hard to learn by ML models and may reduce the efficacy of the overall NIDSs. Since the problem addressed is a multi-label classification task, every ML model has been calibrated and tested following a One-Vs-All approach, where a model per class is built in such a way that one specific model will identify its assigned class and will consider the rest of the samples as instances.
All the experiments were executed on the super-computing centre facilities at the University of Cádiz, which is composed of 48 nodes, each one containing two Intel Xeon E5 2670 2.6 GHz octa-core processors, and equipped with 128 GB RAM.

Results and Discussion
In this subsection, the obtained results are introduced both numerically, in Table 5, and graphically, in Figure 3. To get statistical confidence in our conclusions, tests were performed to assess statistical significance on the results achieved by the ML models tested. After analyzing the recommendations given in [44], the authors decided to apply pairwise Wilcoxon signed-rank test comparisons among all algorithms for all attack classes and accuracy metrics. The obtained p-values were corrected with Holm post-hoc method for multi-comparisons, and a 95% confidence level was considered. The differences in the presented accuracy of the models are statistically significant in all cases, except for: We also applied the Friedman test to corroborate with 95% statistical confidence that there are significant differences in the accuracy of the models for all considered attacks and metrics. In general terms, all models offer good classification accuracy for those attacks synthetically generated (DoS, Botnet, Scan). However, a lower performance rate is observed when classifying real attacks (Spam) and real network traffic (Background). This behaviour was expected since real-network and real-attack traffic patterns are hard to model. Thus, existing or intentionally developed tools are not able to fully mimic such kind of real behaviours. Specifically, RF provides the best results for almost all classes, followed by SVC-RBF. Note that RF is also overcoming the rest of the ML models tested when considering the weighted average of each performance metric. In this sense, RF is also a robust model for NIDSs when dealing with unbalanced classification where attack samples are considerably less than those found in the majority class, which might be a consequence of its internal ensemble classification procedure.
The greater performance results achieved for Botnet attacks are also remarkable. Note that this attack is inherently devised to hide its network traffic. Despite its synthetic nature, authors can conclude the suitability of ML-based NIDS supervised approaches to fighting against this kind of attacks.
Regarding the Background class, all algorithms behave similarly: they offer less accurate results than for the other classes. This behaviour is unexpected, and authors envision that it might be due to the reduction of the number of observations concerning the original ones as a consequence of the FaaC approach. Consequently, it is highly likely to find more than one class per observation. Moreover, the higher the number of original class samples is, the higher the probability of finding them in every new observation. Particularly, Table 5 and Figure 3 show how the accuracy of the algorithms degrades for those classes with a high number of samples (e.g., Spam). As a conclusion, our approach can be applied to ML-based NIDSs dealing with unbalanced data which is, on the other hand, a typical problem found in network attack detection.

Proposal Comparison of the Proposed ML-based NIDS with a Previously Published One
The current ML-based NIDS proposed in this paper is compared to one of the state-of-the-art NIDSs proposals previously introduced in Table 1. In particular, authors considered here the approach proposed by Camacho et al. [12] which used the same UGR'16 dataset, as well as the FaaC approach, analyzed in this work. In [12], the anomaly detection problem in network communications is addressed by applying multivariate statistical techniques, called Multivariate Statistical Network Monitoring (MSNM). Their approach considered all the steps suggested in our methodology, but it differs from ours in the methods used for the FS, HS and ML stages.
The comparison results are numerically and graphically shown in Table 6 and Figure 4, respectively. The work published by Camacho et al. introduced unsupervised (MSNM-MC and MSNM-AS) and semi-supervised (MSNM-R2R-PLS) solutions based on the MSNM approach. These approaches were also compared to a supervised ML model such as SVC with RBF (MSNM-SVC-RBF) and linear (MSNM-SVC-L) kernels. Finally, they evaluated and compared all these methods in terms of the AUC metric. According to the results published in [12], only the performance of the ML methods for the DoS, Botnet and Scan attacks can be compared. Table 6. Comparison of the classification performance between MSNM approaches in a state-of-the-art publication and the ML-based ones proposed in this work. Regarding the numerical performance results, authors can conclude that the Botnet attack is hard to detect by unsupervised approaches due to the inherent characteristic of hiding its own traffic. However, they perform well for the DoS and Scan attacks, probably because the behaviour of these attacks highly differs from regular traffic due to their synthetic nature. Semi-supervised approaches outperform the unsupervised ones in all the cases, but they never perform better than supervised ones.

NIDS
Focusing on the performance of supervised ML-based solutions, and especially for the SVC-[RBF,L] and MSNM-SVC-[RBF,L] models, one can see remarkable differences between them. However, one would expect that SVC models using the same type of kernel should perform similarly, probably with minor differences in their performance due to the random nature of the experimentation followed by the authors. Although the same dataset and FE approach were considered in both studies, it is not clear which were the techniques used in [12] for the rest of the steps proposed in the methodology. This issue demonstrates that the accuracy of the models highly depends on the methodology followed, and all the critical steps must be described in detail to allow reproducibility and fair comparisons of the models. Therefore, it is hard to compare ML-based NIDSs, even more difficult considering the heterogeneity of all the involved techniques, its dependence on the nature of the classification method (unsupervised, semi-supervised and supervised) and the final objective of the NIDS solution. However, new frameworks or methodologies, such as the one proposed in this paper, may help the research community to provide novel and relevant approaches that can be used to provide a fair comparison and evaluation of ML-based NIDSs among state-of-the-art solutions.

Conclusions and Future Work
This work introduced a comprehensive study of the current NIDSs proposals pointing out the lack of standard ways to perform a reliable ML-based NIDS assessment. Most of the state-of-the-art approaches do not appropriately describe or avoid necessary steps followed in the methodology used to evaluate their proposals, making it hard to perform a fair comparison and evaluation of ML-based NIDSs, and to be confident about the results published by different authors addressing similar types of problems. To overcome this issue, a methodology comprising what the authors consider are mandatory steps in NIDSs evaluation has been introduced, covering aspects in a wide range varying from the raw network dataset to the recommended performance metrics. The framework suitability has been tested with classical ML algorithms and a real and updated network dataset. Moreover, a comparison to state-of-the-art ML-based NIDS approaches has also been performed.
The results showed that the methodology proposed can help the research community to build better NIDS solutions to allow honest comparisons. However, defining a definitive evaluation guide or framework is not a trivial task, as there could be more than one valid methodology per application field or objective, e.g., to fight against one kind of attack in specific network environments. Besides, not all the steps proposed in the methodology of this paper could always be necessary, i.e., some of them can be counter-productive depending on the context of the application. Furthermore, it is not clear what the impact is of the network dataset on the NIDS performance, where real and updated ones may improve the performance and feasibility of the solutions so that, later on, they can be deployed in real production environments.
In summary, extra work should be done in order to propose common and standard frameworks to compare and evaluate ML-based NIDS solutions that are entirely accepted by the research community. In this sense, future work will focus on determining the impact of the network dataset and the considered methodology steps on the ML-based or DL-based NIDSs performance, with particular attention on the feature engineering procedures allowing NIDS generalization and, consequently, its real deployment in production environments.
Supplementary Materials: The source code of the proposed framework is available online on https://github. com/ucadatalab/ff4ml. We encourage readers to contribute to the project for adding new functionalities and improvements.

Acknowledgments:
The authors acknowledge support given by the supercomputing center at the University of Cádiz as well as José Camacho (University of Granada) whose support on anomaly detection methods based on the use of multivariate statistical techniques was greatly appreciated.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The