You are currently viewing a new version of our website. To view the old version click .
Information
  • Article
  • Open Access

30 April 2019

Improving Intrusion Detection Model Prediction by Threshold Adaptation

and
1
Centre of Information Systems, Sultan Qaboos University, Al-Khoud, P.O. Box 40, P.C. 123, Sultanate of Oman
2
School of Computer Science, University of St. Andrews, St. Andrews KY16 9AJ, UK
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Machine Learning for Cyber-Security

Abstract

Network traffic exhibits a high level of variability over short periods of time. This variability impacts negatively on the accuracy of anomaly-based network intrusion detection systems (IDS) that are built using predictive models in a batch learning setup. This work investigates how adapting the discriminating threshold of model predictions, specifically to the evaluated traffic, improves the detection rates of these intrusion detection models. Specifically, this research studied the adaptability features of three well known machine learning algorithms: C5.0, Random Forest and Support Vector Machine. Each algorithm’s ability to adapt their prediction thresholds was assessed and analysed under different scenarios that simulated real world settings using the prospective sampling approach. Multiple IDS datasets were used for the analysis, including a newly generated dataset (STA2018). This research demonstrated empirically the importance of threshold adaptation in improving the accuracy of detection models when training and evaluation traffic have different statistical properties. Tests were undertaken to analyse the effects of feature selection and data balancing on model accuracy when different significant features in traffic were used. The effects of threshold adaptation on improving accuracy were statistically analysed. Of the three compared algorithms, Random Forest was the most adaptable and had the highest detection rates.

1. Introduction

In the current digital age, numerous research papers and applications have been written and have developed proposed solutions to combat network based threats and to protect information systems. As a result, various security systems have emerged, which aim to ensure that the key goals of cybersecurity are met [1]. However, every day these stated security goals are flagrantly violated by breaches and security incidents, which raises questions about the capability of existing security systems.
Intrusion detection systems (IDS) are one of the many tools used in the cyber security field. Their main purpose is to detect security attacks targeting the critical networks, systems or data that they monitor, and to report any violation by an external intruder or system insider.
With the rapid advancement in technology many new challenges and threats are evolving. As most of these technologies share the same communication networks, many challenges have emerged; extensive data, traffic diversity and encryption. Such challenges made the identification of threats to develop the right protective measure a very difficult task.
There are many areas being explored to address some of the many cyber security requirements; artificial intelligence (AI), machine learning (ML) and data mining (DM) methods are some of the current key research topics in this field, particularly in the area of anomaly-based intrusion detection (ID). These methods aim to overcome the limitation of human capabilities and conventional technologies in handling the very large amounts and existing diversity of exchanged traffic.
As network traffic evolves over time, due to changes in services and users and their behaviours, the capability of these methods to adapt to such changes is being challenged. Ever evolving traffic makes the process of building ID models a particularly challenging task, as learning all possible variations of traffic patterns for all different kinds of traffic and users is an impossible quest. Therefore, there is a pressing need to make intelligent detection methods adaptable to traffic variability.
The remainder of this paper is organised as follows. In Section 2, we describe the problem that we address in this paper. In Section 3, we discuss related work for threshold adaptation techniques, applications and main research gap. Section 4 presents the proposed solution, which has been empirically investigated. Section 5 describes the experimental setting and data sets used. In Section 6, we present and thoroughly discuss the results of the first set of experiments that aimed to serve as a proof of concept. In Section 7, we discuss the results of the second set of experiments that investigated threshold adaptation under different feature sets and data balance scenarios. Finally, Section 8 concludes this work and lists future work and directions.

2. Problem Statement

In a typical (batch-based) scenario, a network-based anomaly ID model would be built to protect specific environments from attackers. The model building phase would require some training data that were previously captured from old traffic to generate the ID model, which would be tuned and set to detect anomalous behaviours. However, as such a model is used to analyse new, real traffic, it will suffer from high false alarms and low detection accuracy. These phenomena are usually caused by the changes in network patterns, and lead to an early phasing out of such a model and a triggering of model regeneration or updating phase. This could be linked to the inefficiency of using a fixed discriminating threshold for such ID models. For example, a network under high volume attacks, such as denial of service (DoS) or scan attacks, would have different class (normal to attack) distributions than when under low volume, but stealthy attacks such as SQL injection and command-and-control (C&C).
Most of the learning and classification methods used in building such ID models are based on a number of key assumptions [2,3], such as: (i) the equal representation of classes, (ii) the equal representation of sub concepts for a specific class, (iii) the similar class conditional distributions of all classes, and (iv) the pre defining and knowledge of all the values of the attributes for all records in the dataset. Due to the traffic evolution, most, if not all, of these assumptions are violated in real environments, as new traffic will start to exhibit different statistical properties to those of the training data.
Unpredictable differences between the training and evaluated (tested) data can be introduced over time because of such traffic evolution, known as concept drift. These differences can take various forms; for example, class distributions might differ in the new data from those used to build the ID model, and even new classes might emerge over time. In addition, class balance (also known as data balance) can play an important role in the accuracy of constructed models, which could be affected as a result of pattern changes. Traffic variability might also bring about differences in feature importance. These effects (collectively or individually) might render the learnt model outdated sooner than anticipated. However, the current methods to deal with these effects (in a batch-based setup) will attempt to generate a new model, which may consume additional resources in collecting and labelling new data to be used to learn that new model.
Many studies have attempted to address some of these issues in real time setups by tuning the detection parameters of the ID models, while others have introduced ensemble methods for data stream setups. However, there is insufficient empirical work to analyse the threshold adaptation of model predictions in binary batch learning (offline learning) setups [4].
The low detection accuracy of such score-based anomaly ID models in a batch learning setup, could be linked to the use of a fixed discriminating threshold, which in turn could result in an inaccurate reading of the accuracy that is far lower than the actual optimal accuracy. This might explain the early termination of such ID models. As a result, adapting the discriminating threshold to the predictions of the evaluated network traffic would provide an accurate reading of the actual accuracy of the ID model. Understanding this will lead to an improvement in detection accuracy, and hence an extension in the lifespan of the ID models.
Therefore, in this paper we address this problem by investigating the effect of adapting the discriminating threshold (specifically to the evaluated network traffic) on the accuracy (i.e., the geometric mean (G-Mean) of accuracy) of multiple models and comparing the results with the use of a fixed threshold. This investigation was done by comparing the effects on traffic collected at different times with existing variability. Further, the ability of different types of ML algorithms to adapt to traffic changes was analysed.

4. Threshold Adaptation

As noted earlier, the adaptation capability of prediction models under the batch learning setups is the least investigated area in comparison to other methods, although the batch-based ID models are important to detect novel attacks that cannot usually be detected by other techniques. Some kinds of attacks are better detected in a batch mode to increase the detection rate, rather than attempting faster detection in real-time with a higher failure rate. With this approach, there is no need to change or tune any of a model’s parameters as long as its predictions are in the form of a probability score. In this sense, threshold adaptation does not require any modification to the anomaly detection model. The detection model is thus treated as a black-box, as the adaptation is performed to its predictions and not to its detection parameters.

5. Experimental Settings

In this section, the experimental settings used in all conducted experiments are presented. Three ML algorithms were used and their main settings are explained. All the experiments were evaluated in terms of detection accuracy using the geometric mean of accuracy (gAcc) [40] measure, as the normal accuracy measure can be a very misleading measure due to its sensitivity to class imbalance. Similarly to other performance assessments of classification models in a supervised learning task, the gAcc uses the computed basic counts of a table known as a confusion (error) matrix [41] (see Table 1).
Table 1. Structure of confusion matrix for n classes.
The gAcc computes the classification accuracies of every class separately, and then computes their geometric mean. Equation (1) shows the general formula used to compute this measure, where Cj,i is the number of class i instances that were predicted as j, and n is the total number of classes.
gAcc = i = 1 n c i , j j = 1 n c j , i n
Although this measure was first proposed by Kubat and Matwin [40], few studies have used it to assess and compare the performance of different models. However, a number of recent studies in network ID domain have started to use it [42,43,44].

5.1. Overview of Classification/Machine Learning Algorithms

For all of the experiments conducted in this paper, three common classification algorithms that are widely used for batch learning were analysed, evaluated and compared to address the anomaly network detection. These algorithms were C5.0, Random Forest (RF) and Support Vector Machine (SVM); this section provides an overview of each of these algorithms.

5.1.1. Decision Trees (C5.0)

C5.0 is a classification algorithm [45] based on decision trees, which are used in classification problems to build a deterministic data structure that is formed out of decision rules for a particular domain [46]. It has a lower error rate due to its use of ‘boosting’ [47]. Additionally, as C5.0 generates smaller trees, it consumes fewer resources, such as memory, and performs faster executions. It also avoids overfitting noisy data [48]. The C5.0 algorithm uses the information gain ratio to perform its splits, aiming to reduce the bias towards features with a large number of distinct values by penalising the selection of a feature based on the number and size of its branches. However, this criterion might result in favouring features with very low information values [49]. The final classification decision is based on the path traversed from root to leaf; these decisions can be either a ‘class’ (label), or ‘probabilities’ (score) of classes.
C5.0 performs tree pruning by removing parts of the tree that are predicted to have a high error rate [50]. In this pruning process, every subtree is evaluated to determine whether it will be replaced with a leaf or a node.

5.1.2. Random Forest (RF)

Random Forest (RF) is basically formed out of multiple decision trees (prediction models) that are grown using a combination of ‘bagging’ and the random selection of features (subspace). Bagging (bootstrap aggregating) is a technique that aims to improve the performance (accuracy and stability) of ML algorithms and to reduce variances and the chance of overfitting [51,52]. This is performed by generating nTree bootstrap samples, which are randomly sampled from the main training data. Each of these bootstraps will then be used to build a prediction model, resulting in a total of nTree models (decision trees).
After a bootstrap sample is produced, a decision tree is generated. In RF, only a random selection of features (subspace), with no replacement, are evaluated at every node to decide on the best split, rather than using the full features set as in the C5.0 algorithm. The number of these random features, mtry, is usually far less than the original number of features.
Out-of-bag (OOB) data are used in the internals of RF to estimate and monitor the errors of the decision tree and its strength, as well as the correlation between different trees and to measure feature importance [46,53].
The final prediction of the forest is performed by running each instance down all decision trees in the forest. The results of all these trees are then aggregated to form the final decision. For numerical predictions, the average or the weighted average of the results of all trees is returned, whereas for classification problems, the majority vote or the probability of the classes is returned.
The RF algorithm has many advantages, such as low training time complexity and fast prediction time [54,55], efficient handling of missing data, no required pre-processing (scaling or normalisation) of data, and efficient handling of imbalanced data and rare cases (due to the bootstrapping feature) [56]. However, this algorithm has some drawbacks, such as slow runtime as the number of its trees increase, and difficulty to interpret its models due to their high complexity (caused by randomisation) [57]. The key stages of the RF algorithm are illustrated in Figure 1.
Figure 1. Main phases of Random Forest algorithm.

5.1.3. Support Vector Machine (SVM)

The Support Vector Machine (SVM) [58] is one of the most popular classification algorithms used for supervised learning tasks in ML. Its development is based on the structural risk minimisation principle [57,59]. In SVM, each data instance is represented geometrically as a vector ( p ) in p-dimensional space— x = ( x 1 , , x p ) Χ p . SVM attempts to find a linear surface (hyperplane)—or a line in 2D space—that separates the instances into two classes y ∈ {–1, 1}, where this separating hyperplane has the largest distance between the edge points of each class. These edge points define the border lines for each class as per Equation (2):
N e g a t i v e   ( )   l i n e   e q u a t i o n ,             |   w . x + b = 1 P o s i t i v e     ( + )   l i n e   e q u a t i o n ,             |   w . x + b = + 1
where x is an edge point in the training data that lies on the border line of a class, and b is (offset) the distance from the origin to the decision boundary ( w . x + b = 0 ) [58]. The edge points also define the width of the margin between those border lines. These points (vectors) are used to define and outline (support) the separating hyperplane and are called the support vectors. The minimum number required of these points is (p + 1).
As there could be many separating hyperplanes that might separate positive cases from negative cases, the SVM algorithm searches for a decision boundary with the maximum margin. The width of this margin is the sum of the distances from that decision boundary to the parallel hyperplanes that contain the closest positive and negative training points (support vectors) [60].
The SVM classifier depends on computing w, which is a normal vector perpendicular to the separating hyperplane (decision boundary). This normal vector is precomputed as Equation (3) presents:
w = i = 1 N λ i y i x i
where λ i are Lagrange multipliers produced at the training phase using data with N training samples. SVM classification is performed by evaluating which side of the hyperplane a test instance (vector) will fall into, as Equation (4) shows:
S V M ( x ^ ) = { 1 ,         w . x ^ + b < 0 + 1 ,         w . x ^ + b 0
where x ^ is a test instance, and b is (offset) the distance from the origin to the decision boundary ( w . x + b = 0 ) [58], which is precomputed at the training phase.
SVM has the capability to find a separating hyperplane with soft margins, which allows some violation of the boundary by permitting some levels of mixing between classes. This is usually done by tuning some cost value (C), which has an effect on the variance [61].
One of the main advantages of SVM is that it does not suffer from the “curse of dimensionality,” as many other ML algorithms do. As a result, feature reduction is not required by SVM [62]. SVM also has many limitations, such as the required pre-processing phase of the data (dealing with missing data, data transformation, scaling and/or normalisation) [63].
For non-linearly separable problems, SVM might require the use of kernel methods or functions to transform the data from input (data) space into higher dimensional (feature) space, where the data can be made linearly separable. Hence, the resultant separating hyperplane can be expressed using the inner products of the vectors [64]. However, using kernels will incur optimisation costs, as all their tuning parameters need to be taken into account [65]. As a result, SVM processing speed is affected by the kernel used, as some kernels will perform more operations in the transformation phase, which will slow the SVM’s speed [66].

5.2. Parameter Setting for the ML Algorithms

All of the implementations of the analysed algorithms within this study utilized packages of the R environment [67]. Default parameters of these algorithms were used to make these experiments reproducible. Adjusting parameters to improve detection would require further investigation, which is outside the scope of this paper.

5.2.1. C5.0 Algorithm

The “c50” package (version 0.1.0-24) [68] was used in this study. All experiments used the default settings of this algorithm, with the 10 trials option (trials = 10) set to return the results of the classification as a probability score (type = “prob”) when the model was used to predict the evaluation (test) data.

5.2.2. Random Forest

The “ranger” package (version 0.8.0) [69,70] was used over the course of this research. This package was selected because of its fast implementation of RF in C++. All experiments used the default settings of 500 trees (nTree) to grow, with the number of features to evaluate at every node being the square root of the total number of features in the dataset ( m t r y = p ) , where p is the number of features. The algorithm was instructed to return results in the form of classification probabilities (probability = TRUE).

5.2.3. Support Vector Machine (SVM)

The open source SVM package (LiblineaR) (version 2.10-8) [71,72] was used in these experiments. This package executes an optimized linear version of SVM. All experiments used the default settings of L2-regularized logistic regression linear model type (type = 0) with the cost set to one (cost = 1).
The choice to use the linear version of SVM was driven by the very large differences in the runtime of experiments between its linear and nonlinear kernel versions. Some preliminary experimentations were conducted to compare the two versions. As a result of the large difference of runtime between the two versions, the linear version of SVM was selected, as it was much faster. These preliminary experiments also showed that the runtime of the kernel SVM grows exponentially as the number of instances increase. With all these differences, the nonlinear kernel SVM was not tractable to be introduced as a solution in a domain like IDS [73].
Data were pre-processed by converting all categorical (nominal) features into dummy attributes, as SVM can only handle numerical data [74]. A data were also standardised, where the standardisation parameters (the mean and standard deviation) of the training data were used to standardise the features of the test data before being classified by the model [71,75].

5.3. Performance Assessment Techniques

K-folds cross-validation (CV) technique is the most widely used performance assessment method of different ML algorithms due to many reasons, such as data shortage [76,77,78], avoiding overfitting problems [79], and to identify and fine tune the model’s parameters [80]. In this technique, the dataset is randomly divided into K parts. A model is then trained using K-1 parts and tested on the remaining part. This process is repeated K times, so that each one of the K parts is only used once as test data. The model’s overall performance is estimated by aggregating the performance of the K models (through averaging or a majority vote). However, it requires a long time to process as larger values of K are used. It could also provide overly optimistic results due to the random division of datasets, which could be a result of partitions that are statistically similar to each other. Therefore, the K-folds CV technique was used in all experiments at every model building (training) stage to estimate the prediction thresholds for every developed model, as per the recommendation of Ambroise and McLachlan [81].
In this paper we adopted the prospective sampling [82] method, which obtains new sample data after the model generation phase is over. This method is not a commonly used evaluation practice in anomaly-based detection. This evaluation method aimed to mirror real life, given that models are usually trained on data that have been collected in the past to predict future data.

5.4. Datasets Description

This section provides an overview of the datasets used in the experiments outlined in this study. Two synthetic datasets (SEA and AGR) were generated randomly, alongside two domain specific datasets (gureKDD and STA2018). The first three datasets (SEA, AGR and gureKDD) were used in the first experiment, and STA2018 in the second one.

5.4.1. SEA

A streaming ensemble algorithm (SEA) generator [83] in the MOA framework [31] was used to generate a data stream with three continuous features (X1, X2, X3). Each feature had a range between 0 and 10, although only features X1 and X2 influenced the class value. Instances were produced by randomly generating points (X1, X2) in a two dimensional space. Instances were labelled as groupA if X1 + X2 > θ, and as groupB if X1 + X2 ≤ θ, where X1 and X2 were the first two features and θ was a threshold. There were four functions, which would label the instances differently based on their threshold values between the two classes (function 1 sets θ = 8, function 2 uses θ = 9, function 3 sets θ = 7, and function 4 sets θ = 9.5) [84]. The SEA generator’s default setting was used to add 10% noise classes. Six different data streams (files) were produced: function 1 was used to generate two streams (file 1 and file 2); function 2 was used to generate two other streams (file 3 and file 4); and a combination of function 1 and function 2 was used to generate two streams (file 5 and file 6). For every file, calls to these functions used different seed values to set the seed of the random generator function to generate new random instances. Figure 2 presents an example of the command line call to generate File 1 with the SEA stream generator.
Figure 2. Command used to generate File 1 of streaming ensemble algorithm (SEA) dataset.
Each stream consisted of 200,000 instances. This dataset was used to analyse the effect of different statistical properties (concept drift) between training and testing data on the model’s performance. Table 2 lists the number of instances of each class in every file in this dataset.
Table 2. Instances’ classes in every file in the SEA dataset.

5.4.2. AGR

The AGRAWAL generator [85] in the MOA framework [31] was used to generate a data stream with nine features (X1, …, X9), six of which were nominal (factor) and three of which were continuous. This generator had ten different functions to assign the produced instances to one of two different classes, based on the values of their different features. The following examples illustrate the labelling rules of the two functions that were used in generating this dataset:
Function 1:
-
if (age < 40 OR age ≥ 60) then groupA else groupB,
Function 2:
-
if (age < 40){    if (50K ≤ salary ≤ 100K) then groupA else groupB },
-
else if (age < 60){ if (75K ≤ salary ≤ 125K) then groupA else groupB },
-
else{         if (25K ≤ salary ≤ 75K) then groupA else groupB },
Each function increases the level of complexity as it uses additional features and complex rules to label the instances [86]. The generator’s default setting was used to add 10% noise classes by introducing a disturbance factor that added a deviation value (following uniform random distribution) to the original feature’s values. Similar to the SEA dataset generation, six different data streams (each with 200,000 instances) were generated using function 1 and function 2 of the AGR data stream generator. Figure 3 provides an example of the command line used to generate the data of File 1 in this dataset. Table 3 presents a summary of the label frequencies in every file for this dataset.
Figure 3. Command used to generate File 1 of AGR dataset.
Table 3. Instances’ classes in every file in the AGR dataset.

5.4.3. gureKDDcup

gureKDDcup [87,88,89] (referred to throughout this paper as gureKDD) is a transformation of the raw network traffic of the DARPA 1998 dataset [90] into a suitable format for ML tasks, where every connection is described using a set of features. This transformation is similar to the KDD 1999 dataset [91], but much richer and cleaner. The KDD 1999 dataset was not used in this paper due to its many limitations, identified by Al Tobi and Duncan [92]. Every connection in the gureKDD dataset has a unique ID that helps identify the chronological order of all connections. Therefore, all connections in this dataset are chronologically separable and can be divided by day, week, etc.
For the first experiments, all traffic (over a seven week period) was segregated into a time window of a week, which resulted in seven files. Every file contained the network traffic of that week (Monday–Friday). Every connection in these files was profiled using 41 features: 3 of which were nominal (protocol_type, service and flag), 6 were binary features, 15 were continuous (real) features and 17 were integer features. These features were divided into four main groups: intrinsic (basic) features {1–9}, content based features {10–22}, time based features {23–31} and connection based features {32-41}.
Each connection was labelled either as normal or as one of the 35 different attacks. These attacks were grouped into four main classes: DOS, Probing, Remote to Local or User to Root. In these experiments, the data were pre-processed so all different attack types were grouped and labelled as ‘attack’ to produce binary classes. Table 4 presents a statistical summary of the connection class types for each of the seven weeks, which were clearly shown to have different class balances.
Table 4. Number of connection classes in every file in the gureKDD dataset.

5.4.4. STA2018

The STA2018 dataset (The full data set can be found at: https://doi.org/10.17630/c5f31888-9db5-4ac0-a990-3fd17dcfe865) [73] was generated by transforming the network traffic of the UNB ISCX Intrusion Detection Evaluation DataSet 2012 [93] into a suitable format for ML tasks. This dataset profiles every connection using 193 basic features, where part of Onut’s feature classification schema [94] was used to extend these features to a total of 550 features (549 independent variables plus 1 dependent (class) variable).
The STA2018 dataset contains the profiled sessions (connections) of the network traffic of seven simulation days, where data records are grouped by day so that every data file aggregated all of the connections within that simulation day. The transformation process of this dataset went into five main stages: basic features extraction, validation and connection labelling, extend the basic features, balance and clean up.
Due to the balancing stage, this dataset can be used into two modes: first with the original imbalanced version, second with a balanced version where synthetic instances of the attack connections (minority class) were generated using the Synthetic Minority Over-sampling Technique (SMOTE) algorithm [95]. Table 5 sets out the number of connections for each class for each day (original and balanced versions).
Table 5. Number of classes of instances for each day’s file of the STA2018 dataset.
In the second set of experiments outlined in Section 7, only days 2 to 7 were used, as the first day was attack free.
Originally, the file for each day consisted of 550 features (549 features + 1 class). Two features (synthetic and origOrder) were omitted from any analysis, as their only purpose was to distinguish the original data from the balanced (synthetic) data and to identify the connection order. Three further features were removed from the analysis (start_time, src_ip and dst_ip), both to avoid any possibility of overfitting and because of the large number of levels. Removing these five features resulted in a total of 545 features (544 features + 1 class). Any reference to the Full set of features thus refers to these 545 features.

5.6. Hardware Specifications

All experiments were performed on a “Dell C5220 PowerEdge Rack Servers” cluster, which had 12 micro servers. Each micro server ran Scientific Linux 7 on dual quad-core Intel Xeon 3.4GHz CPUs, 16GB RAM, two 500GB SATA disks and two Gigabit Ethernet interfaces. The large data files of the STA2018 dataset {Day 5 (15/Jun) and Day 6 (16/Jun)}, in the second set of experiments, were run on a Hyper-V virtual machine with 8 Virtual Processors, 20 GB RAM and 32 GB Swap space. This VM was used to host the Ubuntu 16.04 (64-bit) operating system. It was hosted on a server with the following hardware specifications: 2U Supermicro chassis; 8x host-swap 2.5" SAS/SATA disk bays; Supermicro X8DTU-LN4F+ motherboard; Dual Intel Xeon E5620 (quad core); 24GB RAM (6 x 4GB DDR3 ECC RDIMM); 4x 1TB SATA (RAID10); and 4x 1Gb Ethernet. This machine used a Windows Server 2012 R2 Datacentre (64-bit) operating system.

6. First Experiment

In the first set of experiments, we examined the effect of threshold adaptation on the overall performance of a detection model. We aimed to provide a proof of concept (PoC) by comparing three well known ML algorithms (C5.0, RF and SVM) to determine which was the most adaptable to variations and concept drifts. In this set of experiments, we conducted two different experimental setups (see Figure 4). Both setups used the same datasets and the same ML algorithms, however, different sampling approaches were performed. We analysed the effect of different sampling approaches on individual detection model accuracy using a real life setup (prospective sampling), and compared this to the usual experimental setups reported in academic publications (K-folds cross-validation). Another part of this experiment was to examine the effect of threshold adaptation in improving model overall detection accuracy. The choice to use the synthetic datasets (SEA and AGR) was driven by the need to control the degree of variability between different data files. The gureKDD dataset was used to make this study comparable to other studies in the field, as its comparator datasets (KDD1999 and NSL-KDD) are widely used in this domain.
Figure 4. The phase diagram of the experiments.

6.1. Results and Discussion

This section presents the results of the first set of experiments and discusses their main findings.

6.1.1. 10-folds Cross-Validation on Full Data

In the first setup (Figure 4), we started these experiments by comparing the detection performances of the three ML algorithms (C5.0, RF and SVM) on the three different datasets (gureKDD, SEA and AGR). The conventional method of 10-folds CV technique was performed on the merged files of each dataset. The maximum gAccs of these models and the best cut-off values were reported. Due to the minimal variability between results, each experiment was repeated only ten times (see Table 6).
Table 6. Average model accuracy (G-mean accuracy), the average optimal cut-off value (at which maximum G-mean accuracy was reached) and their standard deviation of the 10-folds cross-validation (10 repetitions).
Table 6 presents the average of the gAcc values of the ten trials of the 10-folds CV in terms of the gAcc of the three algorithms (C5.0, RF and SVM). It also shows the mean of the optimal cut-off values of the ten runs at which the maximum gAccs were reached.
In general, all algorithms reported similar accuracies for their respective datasets. However, in the artificial dataset AGR, SVM failed to perform anywhere close to C5.0 or RF (showing a difference of almost 15%—see Table 5). This could be related to the nature of the dataset, which could be non-linearly separable, as a linear version of SVM was used in this analysis. Generally, RF was capable of improving detection accuracy on all datasets.
Generally, the performance of all algorithms on gureKDD was the highest, followed by those on the SEA dataset. The AGR dataset was the worst in reaching high detection accuracy. This fact is clearly illustrated by the plots in Figure 5, which show the gAcc curve against the cut-off values for all datasets. These plots show the ten runs in a lighter colour and the means of these runs in solid colour. They also show the optimal threshold values for each dataset under the tested algorithm.
Figure 5. The gAcc curves of the 10 runs of the 10-folds cross-validation experiments for the three datasets (gureKDD, SEA and AGR) using three classification algorithms. (a) C5.0. (b) Random Forest. (c) SVM.
Friedman’s test [96,97] was used to analyse whether the differences between the accuracies of all runs of the 10-folds CV on the full datasets for these algorithms were significant. The tested hypothesis was, “there is no statistically significant difference in model gAccs between the different algorithms.” This test revealed that there was a significant difference between the different algorithms applied to these datasets under the 10-folds CV approaches, χ2(2) = 26.7, p = 0.000 < 0.05. The follow up Nemenyi post-hoc test [98] revealed that the algorithms were all different from each other, as illustrated in Figure 6, which shows that the differences between the algorithms were statistically significant.
Figure 6. Critical differences plot of the pairwise Nemenyi comparison test for the full datasets 10-folds cross-validation experiment.

6.1.2. Subset-to-Subset (File-to-File)

In the second setup (Figure 4), we used the same datasets and algorithms to generate detection models, but in scenarios that were similar to natural settings we applied the prospective sampling technique [82]. In these experiments, models were generated on a subset of the dataset using the 10-folds CV technique to set these models’ parameters, i.e., the prediction threshold (cut-off). These models were then used to evaluate the remaining files in the dataset. Two gAcc values were computed for every combination of prediction model and evaluation data. The first gAcc was obtained when the model’s pre-set prediction threshold value, which was calculated using the 10-folds cross-validation, was used to predict the test data file. The second gAcc value was calculated based on the maximum accuracy reached when the prediction threshold value was adapted to the evaluated data file. This section shows the results obtained under this setup.
Plots of the gAcc (in Figure A1, Figure A2 and Figure A3 in Appendix A) present the performance of the prediction model (MDLk) that was trained using Filek on the files in the dataset (Filei≠k) that were not used in producing that model. In these figures, each model’s performance, based on the CV technique, is illustrated with a solid line; other individual performance evaluations are depicted with dotted lines. As the SEA and AGR datasets are composed of only six files each, there is no illustration of model 7 for these datasets in these figures.
C5.0 algorithm:
Algorithm C5.0 had the worst performance on the first file in the gureKDD dataset, even at CV evaluation during the model generation stage (Figure A1 in Appendix A). This is due to the fact that this file has the least number of attacks (21 attacks) and is the most imbalanced of the files. Therefore, the generated model using this file was not able to predict any instances in other files. Where the number of attacks in other files increased with a proportionate balance, the model performances improved under this algorithm.
Generally, applying this algorithm under the prospective sampling approach followed the same pattern as the first experiment (10-folds cross-validation), where performance on gureKDD resulted in the highest accuracy, followed by the SEA dataset; the worst performing dataset was the AGR.
For both the SEA and AGR datasets, the generated models performed best when files exhibited the same statistical properties, denoted in these experiments by the same generating functions. For example, where MDL1 used File 1 as training data, it predicted instances in File 2 with a high performance and vice versa, as both files were generated using the same function. This was also true for Files 3 and 4. Where files contained mixed behaviours, the prediction performance dropped sharply.
Table A1, Table A2 and Table A3—in Appendix A—present the results of each model on every file generated by each of the different algorithms. These tables show that the performance of all of these models improved when the threshold (cut-off) was adapted for the evaluation dataset, rather than using a pre-calculated one.
Random Forest (RF) algorithm:
Unlike C5.0, it was expected that RF would perform well on the first file of the gureKDD dataset despite its low number of attack connections. This was linked to the bootstrap stage, where instances were sampled from the population with replacement. This means that duplicates of the 21 attack connections were sampled many times, which increased the predictability of the built trees (Figure A2 in Appendix A).
After careful examination of the results, as presented by Table A2 in Appendix A, one can see, especially in the synthetic data (SEA and AGR), that when a testing file has similar statistical properties to the model, its performance will not increase much, even after cut-off adaptation. However, when it has different statistical properties, the adaptation process boosts the prediction, leading to an accurate evaluation of a model’s performance.
Furthermore, the effect of the adaptation process was more tangible in gureKDD than in the synthetic data, as this dataset exhibited both different patterns and varying statistical properties between files. For example, Table A2—in Appendix A—under gureKDD data, shows that MDL1, which was trained on File 1, reached a gAcc of 67.33% on File 5 when the original cut-off (threshold) of the model was used, but applying the adaptation process to this threshold increased its performance to 99.37%.
SVM algorithm:
SVM performed the worst on the AGR dataset in comparison to the other algorithms (Figure A3 in Appendix A). This could have been the result of the non-linear nature of this dataset, which was not picked up by the SVM linear implementation used in these experiments. In general, the cut-off (threshold) adaptation showed a similar effect in improving the models’ performances compared to using the model’s optimal threshold.

6.2. Discussion

The findings of the experiments in this section illustrate the importance of the adapted cut-off value to the data–model pairs in achieving an accurate reading of each model’s performance. Friedman’s test [96,97] was used to assess whether the difference between the different algorithms was significant before and after threshold adaptation. The tested hypothesis was, “there is no statistically significant difference in model gAccs before and after cut-off (threshold) adaptation between the different algorithms.” This test revealed that there was a significant difference between the different algorithms before and after threshold adaptation, χ2(5) = 217.7, p = 0.000 < 0.05.
To identify which algorithms were different, a Nemenyi post-hoc test was carried out to calculate the pairwise comparisons. Figure 7 presents the critical differences between the different algorithms before and after cut-off adaptation as a plot. The plot shows that when the cut-off was adapted for the evaluated dataset, the SVM and C5.0 algorithms were no different to each other. They showed the same behaviour even when cut-off adaptation was not performed, but the cut-off adaptation increased their gAccs. In general, all algorithms were ranked higher when cut-off adaptation was performed, with the RF algorithm always outperforming the other two.
Figure 7. Critical differences plot of the pairwise Nemenyi comparison test for the cut-off (threshold) adaptation experiment.

7. Second Experiment

In this set of experiments, we used the STA2018 dataset to investigate the capability of various ML algorithms in adapting their predictions to the variability of network traffic. We investigated this new approach (prediction threshold adaptation) in the IDS domain.
Typical model development would be governed by decisions made to improve some performance measures, e.g., speed or detection rate. Such decisions, which might involve executing a feature selection and/or a data balancing stage, are usually based on the analysis that will be conducted on the training data. As such, when new evaluation data are used, the performance of the models may not be satisfactory, leading to a phasing out of those models and the generation of new ones. However, such models may still be able to maintain high performances if they are adapted to the new concept that is introduced in the new data.
There are many techniques to perform feature selection, which aims to select a subset of salient features to build a prediction model. Bi et al. [99] have attempted feature selection through introducing a probe to the data by adding three randomly generated variables (fake features/columns) to the dataset. These fake features are randomly drawn from a Gaussian distribution [100]. They use a linear SVM to model the subsets at every iteration of a K-folds cross-validation, where variables with nonzero weights are selected. Any variable (feature) with an average weight below that of the fake variables is then rejected. This approach does not address weight variability, as it only compares averages.
Similarly, Kursa et al. [101,102] have proposed a similar approach in which the information system (training data) is doubled, so that every feature has a shadow feature that is basically a shuffled version of the original one. Feature importance evaluation is then performed on the extended system using the RF algorithm. A K-folds CV—of at least 10-folds—is performed at every iteration, so that every feature is compared to its shadow using statistical tests to evaluate the highest performing features. The main drawbacks of this approach are scalability and speed. Therefore, in this paper a new approach has been proposed and executed that combines the core ideas of the two approaches above.
In this approach, as illustrated in Figure 8, the information system (training data) is extended by adding three randomly generated variables (fake features/columns) to the dataset, where these fake variables are drawn randomly from a Gaussian distribution. A feature importance evaluation—using the RF algorithm—was performed on the newly extended system, and the importance measures of these random variables were then used as a threshold to reject any features with a lower importance value than those of the fake variables. In other words, any feature that performed worse than a random guess was rejected. This comparison was performed using statistical measures.
Figure 8. Critical differences plot of the pairwise Nemenyi comparison test for the cut-off (threshold) adaptation experiment.
As equal variance between compared groups (feature versus fake variables) is not guaranteed, and due to the unbalanced design (number of compared importance measures) of these comparisons, which would have small sample sizes, Welch’s two sample t-test [103,104] was used. Comparisons were performed to evaluate the statistical significance of the mean difference between every feature and the fake variables. The aim of this approach was to speed up the feature selection stage and to make it independent of human evaluation or fixed thresholds, so that it would be more adaptive to the true nature of the dataset. This study adapts the approach of Bi et al. [99] to address the limitation of the Kursa et al. [101] method.
Every fake feature was formed of N random values drawn from a Gaussian distribution with a mean of zero and a standard deviation of one, where N was the number of records in the training data. These random features were combined with the original dataset and were processed by the RF algorithm to compute its features’ importance, using a 3-folds CV technique. A Welch’s t-test statistical [103,104] comparison was then performed to evaluate whether the mean of the importance measures of every feature, Fi,—from the three folds—was statistically significantly greater than the mean importance of the fake features (with a significance level of α = 0.05). All features with a mean importance statistically significantly greater than that of the fake features were selected. The steps of the feature selection stage are illustrated in Algorithm 1.
As RF can return importance score of every feature, it has been used to select the salient features using its two measures {mean decrease of accuracy (MDA) and mean decrease in Gini (MDG)} [70,105].
Algorithm 1: Feature Selection with Fake Features (pseudo code)
Input: dataFile, ftrType
Result: Selected Important Features
1dataFile <- filename,    // Name of the data file to be processed
2ftrType <- ftrMsr,    // Features importance measure {MDA or MDG}
3
4ftrImprtance <- {},    // Initialize list to contain the computed
5                 // importance value of every feature
6ftrSelected <- {},    // Initialize list to contain the selected features
7
8DS <- load file (fileName),    // Load the content of the data file
9ftrSet <- getDataFeatures(DS),    // Get the list of features in the data file
10N <- num_rows(DS),    // Get number of records in the training data
11
12 F K 1 <- rand(sample=N, mean=0, sd=1),   // Generate 3 lists of random variables where
13 F K 2 <- rand(sample=N, mean=0, sd=1),   // each list contains N random numbers with
14 F K 3 <- rand(sample=N, mean=0, sd=1),   // mean=0 and standard deviation=1
15
16
17newDS <- [ D S ( N × p ) F K ( N × 1 ) 1 F K ( N × 1 ) 2 F K ( N × 1 ) 3 ], // Append the fake features to the original data
18partsDS <- create K partitions of newDS, // Create K partitions to calculate features
19                 // importance measures using K-folds Cross-Validation
20
21// Compute the importance of every feature using K-folds
22// Cross-Validation and save them in ftrImprtance
23For fold in K-folds, do
24 trainRcrds <- partsDS[-c(fold)]
25 ftrImprtance[fold, ] <- featre_importance(data=newDS[trainRcrds, ], measure=ftrMsr)
26done
27
28// Evaluate every feature in the data file by comparing its performance
29// to the performances of the 3 fake features. If the mean importance of
30// that feature is statistically higher than the mean importance of the
31// fake features, then add that feature to the selection set.
32For Fi in ftrSet, do
33 if( ftrImprtance[,Fi] > ftrImprtance[,c( F K 1 , F K 2 , F K 3 )] with t.test probability > 0.05 ){
34  ftrSelected <- ftrSelected ∪ {Fi},
35 }
36
37done
38
39return( ftrSelected ),     // Return the list of selected features
In the feature importance evaluation, 15 categorical (factor) features were eliminated from the STA2018 dataset, as they had been added to all the experiments’ model building designs and evaluation process by default. These features are listed in Table 7.
Table 7. Categorical (factor) features eliminated from the feature importance evaluation phase.
The experiments were executed in three different phases, as explained below, and presented with the pseudo code in Algorithm 2.
Algorithm 2: Experiment Phases (pseudo code)
Input: Dataset
Result: Performance results
1For Fi in Dataset, do    // Process every file Fi in the STA2018 dataset
2 Ftrs.Set[Full] <- {Full.Ftrs} // 544 features
3 Mdls.Set <- {}
4 Rslt.Set <- {}
5
6 Fi.bal <- Balance(Fi)   // Generate/get a balanced version of data file Fi with balanced
7               // instances’ classes by generating synthetic instances of
8               // minority class using SMOTE algorithm.
9
10// Phase 1: features selection...
11 Ftrs.Set[MDA] <- getImportantFtrs(data=Fi, ftrType=MDA) ,
12 Ftrs.Set[MDG] <- getImportantFtrs(data=Fi, ftrType=MDG) ,
13 Ftrs.Set[MDABal.] <- getImportantFtrs(data=Fi.bal, ftrType=MDA) ,
14 Ftrs.Set[MDGBal.] <- getImportantFtrs(data=Fi.bal, ftrType=MDG) ,
15
16// Phase 2: models generation...
17// Generate five predictive models using original data with five different sets of features.
18 For ftrsa in Ftrs.Set, do
19  Mdls.Set[Fi, ftrsa] <- generate.Model(data=Fi, features= ftrsa)
20 done
21
22// Generate five predictive models using balanced data with five different sets of features.
23 For ftrsa in Ftrs.Set, do
24  Mdls.Set[Fi.bal, ftrsa] <- generate.Model(data=Fi.bal, features= ftrsa)
25 done
26
27// Phase 3: models evaluation...
28// Perform total of 50 evaluations (5 testing files X 10 predictive models)
29 For Fj≠Fi in Dataset, do
30  // Test every file other than Fi on every one of the 10 prediction models
31  // trained on Fi or Fi.bal
32  For Mdlb in Mdls.Set, do
33   // Get the following results:
34   //  1) G-Mean Accuracy using model’s cutoff (threshold) value,
35   //  2) G-Mean Accuracy using adapted cutoff (threshold) value,
36   Rslt.Set[Fj, Mdlb] <- evaluate(data=Fj, model=Mdlb)
37  done
38 done
39
40done
As the STA2018 dataset distinguishes between original and synthetic records, every day’s traffic file (subset) was pre-processed in order to be used into two modes [imbalanced and balanced] (line 6 in Algorithm 2). As explained earlier, the SMOTE algorithm [95] was used by the STA2018 dataset [73] to generate synthetic instances of the minority class until the number of instances in both classes were equal to each other.
In the first phase (lines 10–14 in Algorithm 2), every file in the STA2018 dataset (which was used to generate the models) was evaluated to select two subsets of features (see Algorithms 1) using the Mean Decrease of Accuracy and the Mean Decrease Gini, resulting in the formation of the MDA and MDG sets, respectively. The same feature selection criteria were used on the balanced data file to generate another two sets of features, referred to in this paper as MDABalanced and MDGBalanced. By the end of this phase, there were four feature sets (see Table 8) along with the Full features set for each training day.
Table 8. Number of selected features under every feature importance measure.
In the second phase (lines 16–25 in Algorithm 2), each day’s traffic used each of the five feature sets (including the Full features set) to generate a binary classification (prediction) model, which resulted in five different models. The same process was repeated using the balanced data. Each model generation step used the 3-folds CV technique to establish the model’s optimal (CV) prediction threshold. The final prediction threshold was computed by aggregating all the fold’s predictions for each model to find the point (threshold) of the maximum gAcc. By the end of this phase, there were ten different binary prediction models for each day’s traffic.
In the final phase (lines 27–38 in Algorithm 2), every generated model was evaluated against each day’s traffic from the dataset that had not been used in any of the feature selection, or in the model generation processes. In this phase, to test the data file for each evaluation, the gAcc was computed using the model’s optimal (CV) threshold and the adapted cut-off.
The whole process was repeated for each of the algorithms being evaluated: C5.0, RF and SVM.

7.1. Results and Discussion

As every generated model was evaluated using all of the files (subsets) in the dataset except the one that had been used to generate that model, two gAcc values were computed for every combination of prediction model and evaluation data. The first gAcc ( gAcc T h r C V ) was the one obtained after the model’s optimal (CV) cut-off value had been calculated using 3-folds CV to predict the data file. The other gAcc value ( gAcc T h r O p t ) was calculated based on the maximum accuracy achieved after the prediction cut-off value had been specifically adapted for the evaluated data file.
As stated earlier, this set of experiments aimed to investigate the effect of the cut-off adaptation by determining the statistical significance in the gAcc of the models through comparing their optimal threshold with the adaptive cut-off. The analysis compared the difference between the two approaches by conducting four Friedman’s tests [96,97] (with a significance level of α = 0.05). The decision to use the non-parametric Friedman’s test was based on the fact that the data did not follow a normal distribution, as confirmed by the normality test (Shapiro–Wilk test) [106] W = 0.7, p-value = 0.000. The following list shows the hypotheses that were tested and the results returned by the Friedman tests.
Threshold-H0: “there are no statistically significant differences in model gAccs before and after cut-off (threshold) adaptation has been applied.
χ2(1) = 873.0, p = 0.000 < 0.05 (differences were statistically significant)
ML-H0: “there are no statistically significant differences in model gAccs between the different ML algorithms (C5.0, RF and SVM) before and after cut-off (threshold) adaptation has been applied.
χ2(5) = 747.5, p = 0.000 < 0.05 (differences were statistically significant)
Features-H0: “there are no statistically significant differences in model gAccs between the different feature sets (Full, MDA, MDG, MDABal. and MDGBal.) before and after cut-off (threshold) adaptation has been applied.
χ2(9) = 742.8, p = 0.000 < 0.05 (differences were statistically significant)
Balance-H0: “there are no statistically significant differences in model gAccs between the different data balances (Original and Balanced data) before and after cut-off (threshold) adaptation has been applied.
χ2(3) = 761.3, p = 0.000 < 0.05 (differences were statistically significant)
As all of these tests showed significant differences, a Nemenyi post-hoc test [107,108,109] was conducted to perform pairwise comparisons on the different effects of each test to distinguish which differences were statistically significant. The results of these pairwise comparisons are illustrated in Figure 9 through critical difference plots.
Figure 9. Graphical illustration of pairwise comparisons from the Friedman Test results for different threshold effects (optimal or adaptive cut-off) after applying the Nemenyi test (95% confidence level) (a) Fixed vs. adaptive thresholds. (b) ML algorithms under threshold adaptation effect. (c) Feature sets under threshold adaptation effect. (d) Training data balances under threshold adaptation effect.
All of the plots in Figure 9 show that the cut-off adaptation effect was significantly different from the fixed model’s optimal (CV) threshold. They also show that different treatments (ML algorithm, feature sets and/or data balance) with the adaptive cut-off always ranked higher. Any insignificant differences fall within the same effect (cut-off adaptation or model’s fixed optimal threshold).
Having shown that the models’ performance was ranked significantly higher when the adaptive cut-off approach was used rather than the fixed optimal (CV) threshold (see results in Table A4, Table A5 and Table A6 in Appendix A), all subsequent analyses focus on the results obtained using the adaptive cut-off. For every analysed algorithm, a Friedman’s test (with a significance level of α=0.05) was performed to test the hypothesis, “there are no statistically significant differences in the gAccs of models built with different feature sets and different data balances after a cut-off (threshold) adaptation has been applied.” The results of this hypothesis are discussed under every algorithm.
C5.0 algorithm
Results in Table A4—in Appendix A—for the C5.0 models show different patterns and behaviours from one training day to another. For example, models trained on Day 2 (12/Jun) failed to perform well on Day 5 (15/Jun), whereas Day 5 models predicted Day 2 traffic with a high degree of accuracy. They also showed inconsistent behaviour towards different feature sets across the days. For example, Day 2 models performed best when the Full feature set was used, but this pattern was not consistent across all days. This can clearly be seen from the results of Day 5, when MDG features were used, and the results of Day 7 (17/Jun) when MDA or MDABal. feature sets were used with the balanced training data. One important observation to make is the poor accuracy of Day 6 (16/Jun) models when the original training data were used. These models showed the worst accuracy, due to the low number of attacks in this data file. When a balanced version of the Day 6 data file was used to build the prediction models, accuracy improved. This supports the finding discussed in the previous experiments regarding the behaviour of C5.0 algorithm with imbalanced data. It can also be clearly observed from these results that data balancing had a minor effect in improving the accuracy of models developed using the C5.0 algorithm, which was further investigated using statistical analysis.
The results of Friedman’s test—stated above—revealed that there was not enough evidence to support this hypothesis, χ2(9) = 16.0, p = 0.067 ≮ 0.05. These tests showed that there was no significant effect of one feature set over another when the C5.0 algorithm was used. In addition, data balancing did not lead to a significant improvement in a model’s accuracy.
Results in Table A4 (see Appendix A) support this conclusion, as the C5.0 models show unstable behaviours. Many factors could be behind the volatile behaviour of the C5.0 algorithm. For example, selected feature sets might not be the best sets for this algorithm. This algorithm was also executed within its default parameters, in particular the number of trials, which was set at ten. In addition, C5.0 algorithms carry out random sampling by following the boosting technique (which randomly samples weighted instances). This might have caused C5.0 to overfit the training data, which could be one of the reasons for its overall poor accuracy in predicting new traffic. Overall, based on the statistical results returned using Friedman’s test, the C5.0 models ranked low, as illustrated by Figure 9b.
Random Forest algorithm
Another Friedman’s test was performed to assess the above hypothesis for the RF algorithm’s models. This test aimed to determine how these models performed when using different feature sets with different data balances, and whether the difference in accuracy was significant after applying the threshold adaptation.
This test revealed that for the RF algorithm, there were significant differences between these features after applying the cut-off (threshold) adaptation, χ2(9) = 38.0, p = 0.000 < 0.05. To distinguish which of these effects were statistically significant, a Nemenyi post-hoc test was conducted to perform a pairwise comparison, as illustrated in Figure 10.
Figure 10. Nemenyi test (95% confidence level) on the RF algorithm models using different feature sets and different data balances after applying the adaptive cut-off approach.
Overall, there were no significant differences in the RF’s accuracy when the Full, MDA and MDABal. feature sets were used. However, the Full features set showed a significant difference over the MDG and MDGBal. feature sets, which ranked lowest among the feature sets. This could be due to the nature of the mean decrease Gini metric in selecting local features, which have low generalisation power. However, even with these low accuracies, RF had the highest overall accuracy. As Figure 10 shows, the data balance had no significant effect on the accuracy of RF. On the contrary, it sometimes negatively affected the accuracy of models using the Full feature set with balanced data, as their difference to the MDG and MDGBal. feature sets became insignificant. This was also evident in the results of Day 6 in Table A5—in Appendix A—which showed a lower accuracy for all models for that day as the balanced version of the data was used. Although that day only had 11 attacks, RF was able to build good predictive models with good evaluation accuracy, except for Day 4’s traffic. The ability of RF to learn from Day 6 traffic was linked to its bagging technique. In contrast to C5.0, this sampling technique prevented RF from overfitting, which in turn produced models with good generalisation capabilities. This gave RF more chance of detecting novel attacks, as demonstrated in these experiments.
The RF algorithm showed the best results of the evaluated ML algorithms. As illustrated by the results in Table A5, RF’s accuracy would not have been better than that of the other algorithms if the fixed optimal (CV) threshold of its models had been used to assess their accuracy. However, with the cut-off adaptation approach, RF’s accuracy improved significantly.
The RF algorithm can take longer to train, depending on the complexity of the training data. However, once the model is built, its evaluation of a new instance is reasonably fast.
As expected, it consumed a lot of resources (memory) at the model building phase, and this consumption increased with the size of the training data. This was a result of the number of bootstrap samples it generated, which were used to build trees in parallel threads. The resulting models were quite large compared to the SVM and C5.0 models, and their sizes increased as the complexity of the training data increased.
SVM algorithm
Similar to C5.0 and RF algorithms, Friedman’s test was used to assess the above hypothesis for the SVM models. This test revealed that there was not enough evidence to support this hypothesis, χ2(9) = 13.1, p = 0.158 ≮ 0.05.
The SVM algorithm exhibited similar behaviour to the C5.0 algorithm. All of its statistical tests revealed insignificant effects between one feature set and another, and there was no sign that the improved accuracy of its models was influenced by any of the data balancing effects. As with the C5.0 algorithm, different behaviours were exhibited on different days, as presented in Table 6 (see Appendix A), so no consistent pattern could be deduced.
Although the SVM algorithm showed some overall improvement on days when the reduced feature sets were used instead of the Full features set, this behaviour was not consistent. As a linear version of SVM was used, this effect could have been caused by the non-linear nature of the data on those days where SVM failed to perform well.
Figure 11 summarises all of the accuracy readings in the tables (Table A4, Table A5 and Table A6) after the threshold adaptation process was applied. It compares the average accuracy of all the C5.0, RF and SVM models. This plot shows the average accuracy for each day’s model for all of the tested ML algorithms. The standard error of the average accuracy for each model is illustrated by vertical bars. For each algorithm, the mean accuracy of all models across all days for every combination of feature sets and data balance type is represented by a horizontal dashed line. As this plot shows, RF was always the highest performing of the ML algorithms evaluated. Unlike C5.0 and SVM, RF showed the most stable results with the least variability.
Figure 11. Comparison plot of the average accuracy of every C5.0, RF and SVM model for every feature set and data balance combination.
Although Figure 11 shows that the highest average accuracy of the RF models was attained when the Full features set was used, the difference in the average accuracy of its models was very small, unlike the accuracy of the C5.0 and SVM models, which showed higher variations in accuracy. Therefore, RF models could be generated using a reduced feature set without any significant decrease in their average accuracy, but with a significant gain in speed. Figure 11 also shows that there was only a high variation in the accuracy of models for Day 6; however, given that this day was problematic, with its skewed balance, this level of accuracy is more than acceptable. Moreover, in a real life scenario it would not be sensible to build a model using such data, hence this example is an extreme case, which is presented here merely to demonstrate that the RF algorithm performed reasonably well.
In general, although a linear SVM implementation was used in these experiments, it showed some good results. For example, on average, the accuracy of models trained on the original and balanced version of Day 4’s traffic, using MDG features, was above 90% (see Figure 11). The accuracy of models trained using the MDA features on the original version of Day 6 traffic (which only had 11 attacks), was also above 89%. With such results, more analysis is required to identify the right combination of fast kernel function and parameter tuning to improve the overall SVM results. This would make it an attractive solution for IDS problems.

8. Conclusions

In this work, we have presented the effect of prediction threshold (cut-off) adaptation on improving detection accuracy of binary-based prediction (IDS) models. We also presented how such an approach will benefit the IDS domain is maintaining detection models for long periods of time. The results of our experiments show that the adapted threshold provided a more accurate reading of a model’s true accuracy in comparison to the use of a fixed threshold. From these experiments, we highlight the following characteristics of threshold adaptation:
  • An adaptive cut-off (threshold) approach results in better classification performance than a fixed threshold.
  • Using a single cut-off (threshold) will lead to misleading results, which could result in a decision to terminate a good prediction model that merely required some tuning.
  • Threshold adaptation approach may not show significant improvement to a model’s accuracy when the testing data exhibits the same statistical properties as the training data.
The results of these analyses showed that RF outperformed the other algorithms (C5.0 and SVM) in its ability to predict new traffic and the detection of novel anomalies. It also showed that, before cut-off adaptation, all of the ML algorithms performed as poorly as each other, but that the adaptive cut-off approach increased their overall accuracy, with RF performing the best. Moreover, RF suffered no significant loss in accuracy when the reduced feature sets were used, and its predictions did not improve when the data was balanced, given that the prediction threshold is constantly adapted. This gives RF the advantage of being able to build models using original data with a reduced feature set, which will save a considerable amount of time in training and testing, which makes this algorithm more attractive for such problems.
In these analyses, the gAcc measures were used as the model assessment criteria to avoid issues with imbalanced data. The accuracy of all models was assessed using the non-parametric Friedman test to identify any significant differences. Cut-off adaptation and the algorithm used were the most important effects that contributed to any significant difference in a model’s accuracy.
Furthermore, this study also showed that the K-folds cross-validation (CV) analysis on the entire dataset reports over-optimistic results that would not reflect the true capability of detection models in real life setups. This technique failed to reveal and assess the true power of the detection capability of different ML models. For example, the C5.0 ranked higher than SVM when this technique was applied on the entire datasets (as in the first setup of the first experiment), however, it was no better when more a natural setup (prospective sampling technique) was in effect. As a result, research results using the K-folds CV technique should be carefully addressed in domains such as IDS.
In future work, we will investigate new approaches in identifying the optimal adaptive prediction threshold (cut-off) based on a small randomly selected sample of an evaluated traffic. Another potential avenue for further investigation is including larger, real industrial datasets with real world attacks and unknown labels, to determine if this research can be generalised and to compare the detection accuracy to some production IDS.

Author Contributions

Conceptualization, A.M.A.T. and I.D.; methodology, A.M.A.T. and I.D.; software, A.M.A.T.; validation, A.M.A.T.; formal analysis, A.M.A.T.; investigation, A.M.A.T.; resources, A.M.A.T. and I.D.; data curation, A.M.A.T.; writing—original draft preparation, A.M.A.T.; writing—review and editing, A.M.A.T. and I.D.; visualization, A.M.A.T.; supervision, I.D.; funding acquisition, A.M.A.T.

Funding

This research was supported and funded by the Government of the Sultanate of Oman represented by the Ministry of Higher Education and the Sultan Qaboos University.

Acknowledgments

We thank the editor and the anonymous reviewers for their constructive comments and suggestion, which provided a great help in improving this paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

This section presents the results of the experiments conducted in this paper. Table A1, Table A2 and Table A3 present the results of the second setup of the first set of experiments (discussed in Section 6) for every algorithm (C5.0, RF and SVM) with different datasets (gureKDD, SEA and AGR).
Table A4, Table A5 and Table A6 present the results of the second set of experiments (discussed in Section 7) for every algorithm (C5.0, RF and SVM) on the STA2018 dataset. These tables show the results of each model under different feature sets (Full, MDA, MDG, MDABal. and MDGBal.) and data balances (original and balanced).
Each shaded cell—of these tables—contains the maximum G-mean accuracy (gAcc) achieved at the K-folds cross-validation stage, where the model’s threshold was set. Every other cell contains two performance measures. The top measure is the model’s gAcc on the test file when its fixed optimal (CV) cut-off was used, and the second measure is the model’s gAcc when the cut-off was adapted for the test data. The measure in bold is the greater of the two measures.
Table A1. C5.0 model’s performance on different datasets with various effects (before and after threshold adaptation).
Table A1. C5.0 model’s performance on different datasets with various effects (before and after threshold adaptation).
File 1File 2File 3File 4File 5File 6File 7
gureKDDModel 10.39040.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
Model 20.2181
0.8125
0.99810.2108
0.9740
0.7154
0.9248
0.2162
0.9715
0.5786
0.9646
0.1850
0.9805
Model 30.0000
0.8884
0.7127
0.9150
0.99950.8198
0.8997
0.9874
0.9956
0.9849
0.9928
0.9981
0.9994
Model 40.8727
0.9770
0.4109
0.8656
0.9948
0.9949
0.99810.9862
0.9965
0.9740
0.9944
0.9988
0.9995
Model 50.8448
0.9740
0.3315
0.7145
0.9963
0.9965
0.9107
0.9453
0.99980.8525
0.9977
0.9995
0.9995
Model 60.8451
0.9997
0.9610
0.9948
0.9986
0.9989
0.9093
0.9128
0.9996
0.9997
0.99980.9994
0.9994
Model 70.8161
0.9504
0.9894
0.9903
0.8435
0.9908
0.8454
0.9308
0.9321
0.9959
0.9802
0.9939
0.9998
SEAModel 10.87310.8740
0.8744
0.8046
0.8517
0.8052
0.8522
0.8361
0.8502
0.8362
0.8503
No results as there were only 6 files.
Model 20.8726
0.8736
0.87310.8086
0.8486
0.8074
0.8493
0.8373
0.8471
0.8372
0.8468
Model 30.8320
0.8574
0.8319
0.8586
0.88980.8896
0.8901
0.8592
0.8599
0.8593
0.8600
Model 40.8317
0.8612
0.8319
0.8617
0.8906
0.8906
0.89020.8599
0.8603
0.8599
0.8603
Model 50.8387
0.8700
0.8394
0.8704
0.8781
0.8821
0.8775
0.8821
0.2959
0.8567
0.8568
0.8569
Model 60.8391
0.8686
0.8395
0.8691
0.8762
0.8819
0.8759
0.8821
0.8563
0.8570
0.8559
AGRModel 10.94490.9443
0.9445
0.4844
0.4932
0.4850
0.4930
0.6873
0.6888
0.6873
0.6885
No results as there were only 6 files.
Model 20.9447
0.9448
0.94480.4835
0.4938
0.4829
0.4934
0.6871
0.6882
0.6867
0.6882
Model 30.4925
0.4932
0.4925
0.4929
0.93410.9341
0.9341
0.6968
0.6984
0.6976
0.6990
Model 40.4907
0.4916
0.4900
0.4911
0.9328
0.9334
0.93390.6956
0.6977
0.6964
0.6985
Model 50.7114
0.7492
0.7114
0.7484
0.7382
0.7623
0.7386
0.7624
0.70590.7079
0.7081
Model 60.7147
0.7463
0.7140
0.7459
0.7360
0.7626
0.7376
0.7628
0.7082
0.7085
0.7101
Figure A1. The gAcc curves for the C5.0 algorithm (see Table A1).
Table A2. Random Forest (RF) model’s performance on different datasets with various effects (before and after threshold adaptation).
Table A2. Random Forest (RF) model’s performance on different datasets with various effects (before and after threshold adaptation).
File 1File 2File 3File 4File 5File 6File 7
gureKDDModel 10.99870.9752
0.9777
0.8538
0.9914
0.7948
0.9423
0.6733
0.9937
0.7410
0.9929
0.9673
0.9952
Model 20.3085
0.9930
0.99840.9807
0.9851
0.9103
0.9359
0.9657
0.9699
0.9531
0.9563
0.9859
0.9963
Model 30.2182
0.9205
0.6430
0.9902
0.99960.5304
0.9324
0.9898
0.9951
0.8262
0.9930
0.9815
0.9994
Model 40.8448
0.9970
0.7060
0.9894
0.9953
0.9966
0.99830.9862
0.9990
0.9747
0.9947
0.9987
0.9995
Model 50.8165
0.9736
0.6326
0.8836
0.9968
0.9969
0.9311
0.9418
0.99990.9980
0.9981
0.9996
0.9996
Model 60.8863
0.9981
0.9542
0.9965
0.9989
0.9991
0.9082
0.9486
0.9998
0.9998
0.99990.9994
0.9996
Model 70.8448
0.9754
0.9841
0.9908
0.9884
0.9986
0.9300
0.9352
0.9914
0.9970
0.9961
0.9974
0.9999
SEAModel 10.87500.8758
0.8758
0.8696
0.8764
0.8027
0.8053
0.8358
0.8364
0.8358
0.8364
No results as there were only 6 files.
Model 20.8752
0.8752
0.87570.8026
0.8053
0.8680
0.8759
0.8357
0.8363
0.8357
0.8363
Model 30.8536
0.9113
0.8323
0.8338
0.89200.8924
0.8926
0.8609
0.8610
0.8604
0.8607
Model 40.8319
0.8333
0.8636
0.9113
0.8921
0.8921
0.89250.8609
0.8612
0.8606
0.8609
Model 50.8389
0.8685
0.8393
0.8695
0.8782
0.8832
0.8786
0.8839
0.85760.8572
0.8576
Model 60.8369
0.8691
0.8368
0.8695
0.8806
0.8829
0.8815
0.8835
0.8579
0.8579
0.8574
AGRModel 10.94830.9482
0.9484
0.4774
0.5042
0.4812
0.5040
0.6869
0.6895
0.6871
0.6893
No results as there were only 6 files.
Model 20.9488
0.9490
0.94860.4788
0.5059
0.4762
0.5054
0.6858
0.6894
0.6856
0.6895
Model 30.4900
0.4939
0.4933
0.4940
0.93870.9387
0.9395
0.6989
0.7007
0.6995
0.7019
Model 40.4928
0.4939
0.4902
0.4942
0.9390
0.9391
0.93980.6995
0.7011
0.6997
0.7016
Model 50.7208
0.7620
0.7212
0.7620
0.7401
0.7785
0.7402
0.7779
0.71270.7144
0.7149
Model 60.7248
0.7607
0.7243
0.7608
0.7351
0.7788
0.7367
0.7805
0.7139
0.7140
0.7129
Figure A2. The gAcc curves for Random Forest algorithm (see Table A2).
Table A3. SVM model’s performance on different datasets with various effects (before and after threshold adaptation).
Table A3. SVM model’s performance on different datasets with various effects (before and after threshold adaptation).
File 1File 2File 3File 4File 5File 6File 7
gureKDDModel 10.92500.6471
0.6545
0.8171
0.9727
0.2401
0.3504
0.8974
0.9695
0.8250
0.9123
0.9665
0.9715
Model 20.0000
0.8253
0.98690.1701
0.5024
0.3076
0.4525
0.1176
0.5163
0.3457
0.5285
0.1116
0.6131
Model 30.8092
0.8303
0.7206
0.9022
0.99770.7544
0.9028
0.9699
0.9793
0.9583
0.9636
0.9791
0.9878
Model 40.9195
0.9196
0.2794
0.6683
0.9678
0.9941
0.95910.9766
0.9958
0.9783
0.9840
0.9867
0.9986
Model 50.8724
0.9757
0.2233
0.6778
0.9865
0.9922
0.9172
0.9339
0.99920.8503
0.8507
0.9983
0.9985
Model 60.8443
0.9531
0.2804
0.6929
0.9894
0.9907
0.9145
0.9270
0.9976
0.9979
0.99700.9986
0.9986
Model 70.8165
0.8518
0.3163
0.8853
0.9944
0.9944
0.9107
0.9366
0.9960
0.9962
0.8476
0.9434
0.9994
SEAModel 10.87630.8771
0.8771
0.8018
0.8936
0.8016
0.8941
0.8358
0.8617
0.8360
0.8615
No results as there were only 6 files.
Model 20.8759
0.8760
0.87650.8021
0.8928
0.8018
0.8933
0.8356
0.8613
0.8359
0.8610
Model 30.8319
0.8763
0.8320
0.8770
0.89330.8939
0.8940
0.8615
0.8617
0.8613
0.8615
Model 40.8319
0.8763
0.8321
0.8769
0.8933
0.8933
0.89380.8615
0.8616
0.8612
0.8613
Model 50.8325
0.8759
0.8326
0.8765
0.8924
0.8928
0.8929
0.8932
0.86140.8610
0.8611
Model 60.8331
0.8756
0.8332
0.8760
0.8914
0.8923
0.8918
0.8927
0.8612
0.8612
0.8609
AGRModel 10.55290.5614
0.5615
0.4695
0.5106
0.4676
0.5079
0.5148
0.5211
0.5112
0.5178
No results as there were only 6 files.
Model 20.5494
0.5498
0.54790.4829
0.5045
0.4813
0.5032
0.5148
0.5174
0.5125
0.5161
Model 30.4879
0.4995
0.4877
0.5004
0.64400.6460
0.6462
0.5656
0.5659
0.5676
0.5685
Model 40.4862
0.4991
0.4861
0.5005
0.6450
0.6453
0.64670.5652
0.5664
0.5673
0.5688
Model 50.4867
0.4990
0.4862
0.5003
0.6338
0.6348
0.6352
0.6365
0.55980.5620
0.5623
Model 60.4892
0.4996
0.4889
0.5006
0.6357
0.6374
0.6362
0.6384
0.5615
0.5617
0.5632
Figure A3. The gAcc curves for the SVM algorithm (see Table A3).
Table A4. The gAcc of models for the fixed optimal (CV) and adapted cut-off (threshold) for the C5.0 algorithm.
Table A4. The gAcc of models for the fixed optimal (CV) and adapted cut-off (threshold) for the C5.0 algorithm.
OriginalBalance
Day 2Day 3Day 4Day 5Day 6Day 7Day 2Day 3Day 4Day 5Day 6Day 7
FullMDL 20.99960.00990.01760.00530.80620.05680.99990.01000.02410.00530.89950.0562
0.58820.50450.01480.99990.99340.90440.63870.01441.00000.9879
MDL 30.59670.98340.81060.01050.00000.94770.90140.98230.86610.01300.63040.7753
0.97450.91370.02110.56500.97760.94870.90430.01340.96520.8440
MDL 40.03750.93730.98130.00000.00000.03160.99160.92490.98150.01290.99850.7109
0.99370.95070.01300.95140.06160.99310.92710.01760.99890.9900
MDL 50.99820.00000.02160.99770.90450.08750.88200.01000.36100.99750.85230.7759
0.99930.19500.49560.99990.98950.99340.14190.62420.99660.9907
MDL 60.00000.00000.00000.00000.46060.00000.99100.02000.02500.04791.00000.7458
0.00000.00000.00000.00000.00000.99880.14850.02780.04800.9980
MDL 70.99010.91510.43570.00430.99980.99990.99560.38260.39160.00920.99991.0000
0.99450.94130.76290.01320.99980.99620.92100.67030.02581.0000
MDAMDL 20.99980.00000.02790.04790.85280.05681.00000.00000.02790.04790.85280.0568
0.09400.02790.04790.85280.05680.09400.02790.04790.85280.0568
MDL 30.98140.98380.50620.21460.73270.14030.52780.98260.85040.88070.83440.8673
0.98450.57200.73320.98900.45310.75200.85580.94200.83950.9405
MDL 40.11270.89990.98020.04020.30130.02830.79050.91930.98000.07900.99850.0647
0.66530.94760.34470.85020.11080.98820.92990.36590.99920.4175
MDL 50.99020.01000.09830.99770.85280.05500.90870.01000.66970.99740.94780.8538
0.99900.45680.49440.95040.14380.98570.03780.82770.98960.9923
MDL 60.00000.00000.00000.00000.46060.00000.98080.01410.01280.04731.00000.0532
0.00000.00000.00000.00000.00001.00000.80350.41430.74860.9999
MDL 70.99230.89800.43480.04850.99980.99980.94690.12640.02490.78371.00000.9998
0.99760.92860.44980.98371.00000.99400.92360.81370.99071.0000
MDGMDL 20.99980.00000.02790.04790.85280.05681.00000.00000.02790.04790.85280.0568
0.09400.02790.04790.85280.05680.09400.02790.04790.85280.0568
MDL 30.13090.97020.81380.59490.00000.69220.40310.96560.89760.55220.51200.7539
0.82790.89790.79170.41710.77960.89780.90920.68950.71270.8859
MDL 40.64960.91600.94310.99060.00000.99760.56880.88350.91650.90490.00000.9666
0.97990.92460.99120.98150.99770.86390.90260.91000.00000.9792
MDL 50.92080.02000.01760.99690.00000.96110.99270.01410.32490.99720.85190.9978
0.98960.25230.81860.95290.98360.99610.81600.88210.99650.9982
MDL 60.00000.00000.00000.00000.46060.00000.06140.01000.00000.00921.00000.0142
0.00000.00000.00000.00000.00000.99350.80940.41540.08220.9993
MDL 70.99430.89100.43420.04850.99990.99990.98670.90680.43900.54400.95250.9997
0.99760.92530.44230.98371.00000.99440.91420.65590.98530.9962
MDABal.MDL 20.99980.00000.02790.04790.85280.05681.00000.00000.02790.04790.85280.0568
0.09400.02790.04790.85280.05680.09400.02790.04790.85280.0568
MDL 30.98490.98330.50690.87820.51560.47360.70770.98210.60740.50150.71240.0816
0.98760.59670.92960.98430.84840.88530.83780.86070.77310.4364
MDL 40.13370.94350.98020.05800.30130.02450.98520.62900.98050.06270.99580.4256
0.66460.94920.98240.84960.10620.99060.92320.14550.99710.6606
MDL 50.99840.11420.09930.99770.79770.05680.81560.01000.02780.99740.73680.3558
0.99950.45860.72360.93670.67290.98880.15690.67890.99540.9915
MDL 60.00000.00000.00000.00000.46060.00000.06140.01000.01250.00921.00000.0142
0.00000.00000.00000.00000.00000.99850.80620.41150.74650.9996
MDL 70.99230.89520.43480.04850.99980.99980.98730.19780.08920.78950.99990.9998
0.99760.92850.44450.98291.00000.99400.92360.81370.99071.0000
MDGBal.MDL 20.99980.00000.02790.04790.85280.05681.00000.00000.02790.04790.85280.0568
0.09400.02790.04790.85280.05680.09400.02790.04790.85280.0568
MDL 30.29100.98270.61450.14150.00000.42290.88190.98200.81890.92220.53120.6688
0.44770.83080.75890.00000.80310.88960.83880.93000.57830.8254
MDL 40.64990.94850.98110.97640.30110.05670.98240.92650.98030.76250.85100.0615
0.93420.94970.97990.99180.35650.98260.92650.93260.93410.0756
MDL 50.96730.01580.10480.99760.79770.05310.82010.00000.07360.99730.85100.4139
0.99850.43390.49770.95220.06800.94850.15620.76460.90050.8068
MDL 60.00000.00000.00000.00000.46060.00000.06140.01000.00000.00921.00000.0142
0.00000.00000.00000.00000.00000.99350.80940.41540.08220.9993
MDL 70.99290.18010.07290.78140.99970.99980.98830.00640.01390.05330.99980.9999
0.99540.92670.48130.97781.00000.99000.90540.54120.99131.0000
Table A5. The gAcc of models for the fixed optimal (CV) and adapted cut-off (threshold) for the Random Forest (RF) algorithm.
Table A5. The gAcc of models for the fixed optimal (CV) and adapted cut-off (threshold) for the Random Forest (RF) algorithm.
OriginalBalance
Day 2Day 3Day 4Day 5Day 6Day 7Day 2Day 3Day 4Day 5Day 6Day 7
FullMDL 21.00000.00420.02790.04821.00000.06021.00000.01000.02160.04760.95350.0559
0.92720.94780.90941.00000.99870.92930.94130.92571.00000.9987
MDL 30.95210.98490.92130.68840.85600.94110.96630.98480.91910.74980.95300.9758
0.99320.97390.95770.98700.99760.99250.96880.93400.97930.9987
MDL 40.94280.93010.98270.84250.88320.94040.97620.92370.98290.89910.92680.9688
0.99450.95600.99200.99980.99780.99480.94710.98580.99970.9882
MDL 50.99970.01330.06480.99780.95350.05850.98880.22590.68080.99810.99340.9925
0.99990.92050.93951.00000.99760.99710.91560.93020.99980.9926
MDL 60.99120.00000.02500.04791.00000.05680.02170.01000.01250.00531.00000.0142
1.00000.91200.57370.96710.99980.99990.86850.68370.10480.9998
MDL 70.99640.91400.43140.78641.00001.00000.02170.01000.01250.03410.52791.0000
0.99790.93510.93400.99111.00000.99980.93810.93190.97551.0000
MDAMDL 21.00000.04630.02790.04821.00000.05851.00000.01000.02160.04760.85280.0564
0.93080.94170.93221.00000.99830.92810.95040.79991.00000.9984
MDL 30.97900.98480.96360.90890.95700.98900.96830.98480.91970.81580.87810.9282
0.98950.97380.94900.98440.99630.99270.96900.93360.98390.9983
MDL 40.94930.93740.98260.89130.89560.94480.94630.92690.98270.88820.90240.9467
0.99470.95570.99170.99710.99840.99480.94490.98890.99930.9942
MDL 50.99970.01410.08720.99780.95350.06190.98910.24390.66200.99810.99350.9933
0.99990.91620.93881.00000.99510.99960.92180.94110.99810.9935
MDL 60.99830.01000.02500.04791.00000.05680.02170.01000.01760.00531.00000.0142
1.00000.85530.46490.33260.99991.00000.93200.49560.17311.0000
MDL 70.99330.93090.44590.98751.00001.00000.02170.01000.01250.03670.40531.0000
0.99710.93600.93700.99111.00000.99980.93550.92800.96991.0000
MDGMDL 21.00000.04520.02790.04821.00000.05851.00000.01000.02160.04730.85280.0561
0.92960.93170.84251.00000.99820.92870.94570.94301.00000.9984
MDL 30.50350.97800.91710.63070.00000.78680.53080.96980.90510.56220.00000.6543
0.83400.91850.81950.78870.94710.80000.91120.72040.70870.8924
MDL 40.70320.91600.94320.98560.00000.98840.68000.89960.89060.98290.00000.9531
0.77030.92420.99130.82360.99740.86760.90000.98290.74990.9538
MDL 50.13900.01000.02540.99780.67420.03750.98970.02230.54500.99810.99990.3540
0.99920.85180.91380.99960.99760.99260.80160.88171.00000.9982
MDL 60.99820.00000.02500.04791.00000.05680.02170.01000.01250.00531.00000.0142
1.00000.85730.43490.64490.99981.00000.87100.74460.07671.0000
MDL 70.99260.93530.44930.99110.99991.00000.02170.01000.01250.01360.41161.0000
0.99720.93570.93540.99121.00001.00000.93760.93770.97001.0000
MDABal.MDL 21.00000.04180.02790.04821.00000.05861.00000.01000.02160.04730.85280.0559
0.93310.94340.97391.00000.99850.92770.94600.96031.00000.9985
MDL 30.97580.98480.95580.87550.88840.97770.97420.98490.91220.85150.89800.9522
0.99330.97470.96040.98260.99650.99190.97150.91030.98330.9983
MDL 40.95050.93490.98240.88010.89030.94640.95080.92750.98270.88700.89120.9422
0.99470.95600.99130.99600.99820.99450.94420.98970.99910.9864
MDL 50.99960.01410.06710.99780.95350.06190.98680.23310.67780.99810.99330.9934
0.99990.91630.94020.99990.99840.99730.91410.93440.99690.9941
MDL 60.99470.00000.02500.04791.00000.05680.02170.01000.01250.00531.00000.0142
1.00000.86100.43670.94820.99981.00000.87530.45260.35821.0000
MDL 70.99550.91480.43290.90831.00001.00000.02170.01000.01250.04650.95351.0000
0.99760.93660.93490.99071.00000.99970.93350.93640.96801.0000
MDGBal.MDL 21.00000.04570.02790.04821.00000.05851.00000.01000.02160.04730.85280.0560
0.92570.94090.94391.00000.99830.92720.95120.80911.00000.9984
MDL 30.97950.98500.93460.82270.82500.89900.97100.98500.89500.78240.79850.8641
0.98490.95560.82570.93090.92310.98920.95170.85830.92150.9974
MDL 40.95710.93390.98270.88790.88890.94220.94380.93130.98260.89910.90960.9547
0.99230.95400.96960.93460.99620.99470.94110.97520.95210.9949
MDL 50.99750.01410.03300.99780.95350.05310.98810.09030.61630.99810.99310.9366
0.99930.86930.90490.99950.99540.99410.91330.93400.99510.9927
MDL 60.99520.01410.02500.04791.00000.05680.02170.01000.01250.00531.00000.0142
1.00000.86420.44490.96110.99991.00000.86750.59070.07471.0000
MDL 70.99700.90360.43030.05051.00001.00000.02170.01000.01250.00530.42731.0000
0.99740.93910.93650.99081.00001.00000.93390.94120.98931.0000
Table A6. The gAcc of models for the fixed optimal (CV) and adapted cut-off (threshold) for the SVM algorithm.
Table A6. The gAcc of models for the fixed optimal (CV) and adapted cut-off (threshold) for the SVM algorithm.
OriginalBalance
Day 2Day 3Day 4Day 5Day 6Day 7Day 2Day 3Day 4Day 5Day 6Day 7
FullMDL 21.00000.16300.36950.05730.98570.96810.99960.90760.52940.60400.99970.8786
0.17560.39770.18910.98750.97030.91370.56220.90110.99990.9966
MDL 30.51330.98090.51840.36240.78000.82790.49390.97540.46410.40420.40230.4473
0.93120.52000.80530.95170.86610.82350.61120.74560.82890.8026
MDL 40.35450.91610.98140.05380.91820.49420.12420.73120.98000.01600.00000.0200
0.93970.94570.38520.95890.75600.91540.94050.75630.96770.8836
MDL 50.00000.10260.70300.99610.00000.73850.00000.13870.51760.99130.42490.7042
0.86740.17610.80650.95310.84290.94900.14740.52080.92420.7166
MDL 60.98060.87500.62960.05750.99940.99850.33460.02820.02160.10271.00000.5148
0.99130.89310.64110.51010.99900.99780.14680.23670.95750.8563
MDL 70.98200.10700.48820.05990.95320.99990.98600.04890.17970.96670.99950.9999
0.99060.21430.60210.74170.99970.99310.16110.54370.98430.9995
MDAMDL 21.00000.92060.48120.05420.95340.05850.99970.61430.43740.21120.95340.0568
0.92240.74700.19770.99990.99110.69450.78890.85071.00000.9921
MDL 30.00000.98090.85530.02370.83810.10510.93300.97650.87370.43060.71850.8755
0.24430.86580.03820.93370.69970.96210.87430.84870.91430.9184
MDL 40.61770.94470.98100.46650.90380.07220.14660.93970.97920.92260.00000.0317
0.97250.94540.58710.99780.73300.87070.94130.93600.97490.7675
MDL 50.00000.03070.52240.99570.00000.00000.91780.12680.52030.99140.00000.0000
0.97430.14440.61060.95290.72130.98650.18120.52850.94330.4924
MDL 60.99530.85350.74740.96850.99980.05680.99920.89660.37350.96861.00000.0568
0.99930.91340.89510.98060.99080.99920.92420.47740.98210.9402
MDL 70.26220.06910.56730.87810.95150.99990.97950.12870.28200.96690.95260.9995
0.99310.13080.60900.91170.99710.99220.18910.64850.99020.9989
MDGMDL 21.00000.77060.60990.04290.99980.06020.99970.67130.46380.04800.95340.0568
0.79120.87360.19530.99990.99620.74000.48880.41821.00000.9905
MDL 30.00000.93910.61410.00000.00000.00000.00000.93570.65550.00000.00000.0000
0.51360.68470.56390.37260.35310.54810.66850.59000.39300.5643
MDL 40.82590.74800.88510.98700.98310.99100.97730.86560.90920.98340.98880.9888
0.96750.80620.98790.99270.99560.98330.86650.98840.99280.9958
MDL 50.00000.03310.76490.99500.95280.00000.08030.06950.47300.98490.94790.0647
0.93780.14500.77270.99670.36570.77910.11760.48390.99350.2488
MDL 60.99940.08950.02500.81810.99990.05850.99160.00000.02100.78531.00000.0550
0.99960.87180.57030.97100.88490.99920.88280.45950.96730.9852
MDL 70.00000.91200.45100.68010.95190.99980.85790.90960.50990.96550.95170.9995
0.97420.92010.62050.75800.99740.99030.91000.63280.98490.9982
MDABal.MDL 21.00000.83730.68230.07270.99990.05850.99940.92020.47440.04800.99990.0778
0.85900.86700.29760.99990.99430.93070.53840.72131.00000.9972
MDL 30.00000.98070.75620.02370.82470.20690.93660.97610.87810.48060.74600.8900
0.31850.77650.03820.94150.73260.95930.88040.83010.92350.9226
MDL 40.87490.92980.98100.46130.95120.08370.13650.93840.97910.91230.00000.0245
0.97270.94160.55780.99570.67910.85780.93990.91600.98180.7552
MDL 50.00000.02230.47240.99570.00000.00000.91510.17540.57670.99240.00000.0000
0.97450.14430.54770.95310.71370.98680.21390.60230.94100.6056
MDL 60.97890.04790.56010.93361.00000.12910.59560.73990.23730.15661.00000.0492
0.99400.33770.76030.96400.97490.99840.91350.44650.95620.9925
MDL 70.07950.05820.60140.39420.99820.99990.67210.13560.40070.96940.95290.9996
0.99020.15130.64180.57820.99860.99250.21040.63280.98490.9992
MDGBal.MDL 21.00000.92630.71970.04740.95340.05850.99950.82790.45800.04840.99980.0776
0.93110.91640.56520.99990.99740.91400.55880.82780.99990.9976
MDL 30.87270.97800.64100.35830.75860.86080.53330.96950.48300.46160.62260.7089
0.98000.75870.53330.90020.89950.89540.48340.46540.68940.7755
MDL 40.12250.90350.98010.76840.00000.03160.12380.94460.97810.89690.00000.0245
0.98230.94350.95350.98510.48720.89830.94490.89830.89810.6145
MDL 50.03070.01410.62890.99580.73840.00000.88770.05180.08890.99000.00000.0000
0.96940.16870.66890.95210.66160.98270.14260.30740.97140.5188
MDL 60.99780.01000.70010.58541.00000.16700.73350.87460.37460.96501.00000.0531
0.99960.28160.79270.95950.99080.99610.91990.46830.96650.9930
MDL 70.00000.68300.84420.55960.95270.99980.85090.12880.27950.96920.99900.9996
0.93200.71890.89910.72600.99780.99210.25760.68700.98730.9994

References

  1. Cherdantseva, Y.; Hilton, J. A Reference Model of Information Assurance & Security. In Proceedings of the 2013 International Conference on Availability, Reliability and Security, Regensburg, Germany, 2–6 September 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 546–555. [Google Scholar]
  2. Das, S.; Datta, S.; Chaudhuri, B.B. Handling data irregularities in classification: Foundations, trends, and future challenges. Pattern Recognit. 2018, 81, 674–693. [Google Scholar] [CrossRef]
  3. Kotsiantis, S.B. Supervised Machine Learning: A Review of Classification Techniques. Informatica 2007, 31, 249–268. [Google Scholar]
  4. Burlutskiy, N.; Petridis, M.; Fish, A.; Chernov, A.; Ali, N. An Investigation on Online Versus Batch Learning in Predicting User Behaviour. In Research and Development in Intelligent Systems XXXIII; Springer International Publishing: Cham, Switzerland, 2016; pp. 135–149. [Google Scholar]
  5. Chen, J.J.; Tsai, C.-A.; Moon, H.; Ahn, H.; Young, J.J.; Chen, C.-H. Decision threshold adjustment in class prediction. SAR QSAR Environ. Res. 2006, 17, 337–352. [Google Scholar] [CrossRef]
  6. Catania, C.A.; Garino, C.G. Automatic network intrusion detection: Current techniques and open issues. Comput. Electr. Eng. 2012, 38, 1062–1072. [Google Scholar] [CrossRef]
  7. Freeman, E.A.; Moisen, G.G. A comparison of the performance of threshold criteria for binary classification in terms of predicted prevalence and kappa. Ecol. Modell. 2008, 217, 48–58. [Google Scholar] [CrossRef]
  8. Buczak, A.L.; Guven, E. A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection. IEEE Commun. Surv. Tutor. 2016, 18, 1153–1176. [Google Scholar] [CrossRef]
  9. Beguería, S. Validation and Evaluation of Predictive Models in Hazard Assessment and Risk Management. Nat. Hazards 2006, 37, 315–329. [Google Scholar] [CrossRef]
  10. Yang, Y. A study of thresholding strategies for text categorization. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 01), New Orleans, LA, USA, 9–13 September 2001; ACM Press: New York, NY, USA, 2001; pp. 137–145. [Google Scholar]
  11. Lakhina, A.; Crovella, M.; Diot, C. Diagnosing network-wide traffic anomalies. In Proceedings of the 2004 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM ’04), Portland, OR, USA, 30 August–3 September 2004; ACM Press: New York, NY, USA, 2004; Volume 34, pp. 219–230. [Google Scholar]
  12. Fan, R.-E.; Lin, C.-J. A Study on Threshold Selection for Multi-Label Classification; Technical Report; National Taiwan University: Taipei, Taiwan, 2007. [Google Scholar]
  13. Pillai, I.; Fumera, G.; Roli, F. Threshold optimisation for multi-label classifiers. Pattern Recognit. 2013, 46, 2055–2065. [Google Scholar] [CrossRef]
  14. Koyejo, O.O.; Natarajan, N.; Ravikumar, P.K.; Dhillon, I.S. Consistent Binary Classification with Generalized Performance Metrics. In Proceedings of the Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; pp. 2744–2752. [Google Scholar]
  15. Yan, B.; Koyejo, O.; Zhong, K.; Ravikumar, P. Binary Classification with Karmic, Threshold-Quasi-Concave Metrics. arXiv 2018, arXiv:1806.00640. [Google Scholar]
  16. Eskin, E.; Miller, M.; Zhong, Z.-D.; Yi, G.; Lee, W.-A.; Stolfo, S. Adaptive Model Generation for Intrusion Detection Systems. In Proceedings of the ACMCCS Workshop on Intrusion Detection and Prevention, Athens, Greece, 1 November 2000; pp. 1–14. [Google Scholar]
  17. Honig, A.; Howard, A.; Eskin, E.; Stolfo, S. Adaptive Model Generation: An Architecture for Deployment of Data Mining-based Intrusion Detection Systems. In Applications of Data Mining in Computer Security; Springer: Boston, MA, USA, 2002; pp. 153–193. [Google Scholar]
  18. Hossain, M.; Bridges, S.M. A Framework for an Adaptive Intrusion Detection System with Data Mining. In Proceedings of the 13th Annual Canadian Information Technology Security Symposium, Ottawa, ON, Canada, 11–15 June 2001; pp. 1–8. [Google Scholar]
  19. Hossain, M.; Bridges, S.M.; Vaughn, R.B. Adaptive intrusion detection with data mining. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, Washington, DC, USA, 8 October 2003; IEEE: Piscataway, NJ, USA, 2003; Volume 4, pp. 3097–3103. [Google Scholar]
  20. Jung, J.; Paxson, V.; Berger, A.W.; Balakrishnan, H. Fast portscan detection using sequential hypothesis testing. In Proceedings of the IEEE Symposium on Security and Privacy, Berkeley, CA, USA, 12 May 2004; IEEE: Piscataway, NJ, USA, 2004; pp. 211–225. [Google Scholar]
  21. Ali, M.Q.; Al-Shaer, E.; Khan, H.; Khayam, S.A. Automated Anomaly Detector Adaptation using Adaptive Threshold Tuning. ACM Trans. Inf. Syst. Secur. 2013, 15, 1–30. [Google Scholar] [CrossRef]
  22. Idé, T.; Kashima, H. Eigenspace-based anomaly detection in computer systems. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 22–25 August 2004; ACM Press: New York, NY, USA, 2004; pp. 440–449. [Google Scholar]
  23. Yu, Z.; Tsai, J.J.P.; Weigert, T. An Automatically Tuning Intrusion Detection System. IEEE Trans. Syst. Man Cybern. Part B 2007, 37, 373–384. [Google Scholar] [CrossRef]
  24. Yu, Z.; Tsai, J.J.P.; Weigert, T. An adaptive automatically tuning intrusion detection system. ACM Trans. Auton. Adapt. Syst. 2008, 3, 10:1–10:25. [Google Scholar] [CrossRef]
  25. Chou, H.-H.; Wang, S.-D. An adaptive network intrusion detection approach for the cloud environment. In Proceedings of the International Carnahan Conference on Security Technology (ICCST), Taipei, Taiwan, 21–24 September 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1–6. [Google Scholar]
  26. Agosta, J.M.; Diuk-Wasser, C.; Chandrashekar, J.; Livadas, C. An Adaptive Anomaly Detector for Worm Detection. In Proceedings of the 2nd USENIX Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SYSML’07), Cambridge, MA, USA, 10 April 2007; USENIX Association: Berkeley, CA, USA, 2007; pp. 3:1–3:6. [Google Scholar]
  27. Gu, G.; Fogla, P.; Dagon, D.; Lee, W.; Skoric, B. Towards an Information-Theoretic Framework for Analyzing Intrusion Detection Systems. In Proceedings of the European Symposium on Research in Computer Security (ESORICS 2006), Hamburg, Germany, 18–20 September 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 527–546. [Google Scholar]
  28. Strasburg, C.; Basu, S.; Wong, J.S. S-MAIDS: A Semantic Model for Automated Tuning, Correlation, and Response Selection in Intrusion Detection Systems. In Proceedings of the the IEEE 37th Annual Computer Software and Applications Conference (COMPSAC), Kyoto, Japan, 22–26 July 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 319–328. [Google Scholar]
  29. Jyothsna, V.; Rama Prasad, V.V. Assessing degree of intrusion scope (DIS): a statistical strategy for anomaly based intrusion detection. CSI Trans. ICT 2018, 6, 99–127. [Google Scholar] [CrossRef]
  30. Bifet, A.; Holmes, G.; Pfahringer, B.; Kirkby, R.; Gavaldà, R. New ensemble methods for evolving data streams. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009; ACM Press: New York, NY, USA, 2009; pp. 139–148. [Google Scholar]
  31. Bifet, A.; Holmes, G.; Kirkby, R.; Pfahringer, B. MOA: Massive Online Analysis. J. Mach. Learn. Res. 2010, 11, 1601–1604. [Google Scholar]
  32. Masud, M.; Gao, J.; Khan, L.; Han, J.; Thuraisingham, B.M. Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints. IEEE Trans. Knowl. Data Eng. 2011, 23, 859–874. [Google Scholar] [CrossRef]
  33. Farid, D.M.; Zhang, L.; Hossain, A.; Rahman, C.M.; Strachan, R.; Sexton, G.; Dahal, K. An adaptive ensemble classifier for mining concept drifting data streams. Expert Syst. Appl. 2013, 40, 5895–5906. [Google Scholar] [CrossRef]
  34. Masud, M.M.; Chen, Q.; Khan, L.; Aggarwal, C.; Gao, J.; Han, J.; Thuraisingham, B. Addressing Concept-Evolution in Concept-Drifting Data Streams. In Proceedings of the the IEEE International Conference on Data Mining (ICDM), Sydney, NSW, Australia, 13–17 December 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 929–934. [Google Scholar]
  35. Masud, M.M.; Chen, Q.; Khan, L.; Aggarwal, C.C.; Gao, J.; Han, J.; Srivastava, A.; Oza, N.C. Classification and Adaptive Novel Class Detection of Feature-Evolving Data Streams. IEEE Trans. Knowl. Data Eng. 2013, 25, 1484–1497. [Google Scholar] [CrossRef]
  36. Cretu-Ciocarlie, G.F.; Stavrou, A.; Locasto, M.E.; Stolfo, S.J. Adaptive Anomaly Detection via Self-calibration and Dynamic Updating. In Recent Advances in Intrusion Detection (RAID 2009); Springer: Berlin/Heidelberg, Germany, 2009; Volume 5758, pp. 41–60. [Google Scholar]
  37. Chen, S.; Wang, H.; Zhou, S.; Yu, P.S. Stop Chasing Trends: Discovering High Order Models in Evolving Data. In Proceedings of the IEEE 24th International Conference on Data Engineering, Cancun, Mexico, 7–12 April 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 923–932. [Google Scholar]
  38. Gomes, H.M.; Bifet, A.; Read, J.; Barddal, J.P.; Enembreck, F.; Pfharinger, B.; Holmes, G.; Abdessalem, T. Adaptive random forests for evolving data stream classification. Mach. Learn. 2017, 106, 1469–1495. [Google Scholar] [CrossRef]
  39. Kotłowski, W.; Dembczyński, K. Surrogate regret bounds for generalized classification performance metrics. Mach. Learn. 2017, 106, 549–572. [Google Scholar] [CrossRef]
  40. Kubat, M.; Matwin, S. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In Proceedings of the 14th International Conference on Machine Learning (ICML97), Nashville, TN, USA, 8–12 July 1997; pp. 179–186. [Google Scholar]
  41. Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
  42. Kuncheva, L.I.; Arnaiz-González, Á.; Díez-Pastor, J.-F.; Gunn, I.A.D. Instance selection improves geometric mean accuracy: a study on imbalanced data classification. Prog. Artif. Intell. 2019, 1–14. [Google Scholar] [CrossRef]
  43. Rao, C.M.; Naidu, M.M. A Model for Generating Synthetic Network Flows and Accuracy Index for Evaluation of Anomaly Network Intrusion Detection Systems. Indian J. Sci. Technol. 2017, 10, 1–16. [Google Scholar] [CrossRef]
  44. Rao, C.M.; Naidu, M.M. Acceptance Sampling for Network Intrusion Detection. J. Theor. Appl. Inf. Technol. 2017, 95, 6707–6718. [Google Scholar]
  45. Bujlow, T.; Riaz, T.; Pedersen, J.M. A method for classification of network traffic based on C5.0 Machine Learning Algorithm. In Proceedings of the International Conference on Computing, Networking and Communications (ICNC), Maui, HI, USA, 30 January–2 February 2012; pp. 237–241. [Google Scholar]
  46. Raghav Aggiwal Introduction to Random Forest. Available online: https://dimensionless.in/tag/random-forest/ (accessed on 21 February 2019).
  47. Aporras. What Is the Difference between Bagging and Boosting? Available online: https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/ (accessed on 21 February 2019).
  48. Rulequest Research Is C5.0 Better Than C4.5? Available online: https://rulequest.com/see5-comparison.html (accessed on 21 February 2019).
  49. Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kauffmann Publishers, Inc.: Burlington, MA, USA, 1993. [Google Scholar]
  50. Rulequest Research C5.0: An Informal Tutorial. Available online: https://www.rulequest.com/see5-unix.html (accessed on 21 February 2019).
  51. Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; Chapman & Hall, Inc.: Boca Raton, FL, USA, 1993. [Google Scholar]
  52. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar]
  53. Li, C. Probability Estimation in Random Forests. Master’s Thesis, Department of Mathematics and Statistics, Utah State University, Logan, UT, USA, 2013. [Google Scholar]
  54. Witten, I.H.; Frank, E.; Hall, M.A.; Pal, C.J. Data Mining: Practical Machine Learning Tools and Techniques; Morgan Kaufmann: Burlington, MA, USA, 2016; ISBN 9780128043578. [Google Scholar]
  55. Resende, P.A.A.; Drummond, A.C. A Survey of Random Forest Based Methods for Intrusion Detection Systems. ACM Comput. Surv. 2018, 51, 48:1–48:36. [Google Scholar] [CrossRef]
  56. Khoshgoftaar, T.M.; Golawala, M.; Hulse, J. Van An Empirical Study of Learning from Imbalanced Data Using Random Forest. In Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007), Patras, Greece, 29–31 October 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 310–317. [Google Scholar]
  57. Lin, S.-W.; Ying, K.-C.; Lee, C.-Y.; Lee, Z.-J. An intelligent algorithm with feature selection and decision rules applied to anomaly intrusion detection. Appl. Soft Comput. 2012, 12, 3285–3290. [Google Scholar] [CrossRef]
  58. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  59. Vapnik, V. Statistical Learning Theory; John Wiley and Sons, Inc.: New York, NY, USA, 1998. [Google Scholar]
  60. Golmah, V. An Efficient Hybrid Intrusion Detection System based on C5.0 and SVM. Int. J. Database Theory Appl. 2014, 7, 59–70. [Google Scholar] [CrossRef]
  61. Kelsey, T. Lecture Notes—ID5059: Knowledge Discovery and Data Mining, Lecture 21—Support Vector Machines (SVMs). 2015. Available online: https://tom.host.cs.st-andrews.ac.uk/ID5059/L21-slides.pdf (accessed on 28 April 2019).
  62. Marsupial, D. SVM, Overfitting, Curse of Dimensionality. Available online: https://stats.stackexchange.com/questions/35276/svm-overfitting-curse-of-dimensionality (accessed on 21 February 2019).
  63. Lin, C.-J. Chih-Jen Lin’s Home Page. Available online: https://www.csie.ntu.edu.tw/~cjlin/ (accessed on 21 February 2019).
  64. Kelsey, T. Lecture Notes—ID5059: Knowledge Discovery and Data Mining, Lecture 22—Support Vector Machines (SVMs) (2). 2015. Available online: https://tom.host.cs.st-andrews.ac.uk/ID5059/L22-slides.pdf (accessed on 28 April 2019).
  65. Schölkopf, B.; Tsuda, K.; Vert, J.-P. Kernel Methods in Computational Biology, Chapter 2: A Primer on Kernel Methods; MIT Press: Cambridge, MA, USA, 2004; ISBN 9780262195096. [Google Scholar]
  66. Gwardys, G. Why Is Kernelized SVM Much Slower Than Linear SVM? Available online: https://www.quora.com/Why-is-kernelized-SVM-much-slower-than-linear-SVM (accessed on 21 February 2019).
  67. Team, R.D.C. R: A Language and Environment for Statistical Computing.; R Foundation for Statistical Computing: Vienna, Austria, 2008; ISBN 3-900051-07-0. Available online: https://www.r-project.org/ (accessed on 21 February 2019).
  68. Kuhn, M.; Weston, S.; Coulter, N.; Culp, M.; C code for C5.0 by R. Quinlan C50: C5.0 Decision Trees and Rule-Based Models. Available online: https://cran.r-project.org/package=C50 (accessed on 21 May 2018).
  69. Wright, M.N.; Ziegler, A. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. J. Stat. Softw. 2017, 77, 1–17. [Google Scholar] [CrossRef]
  70. Wright, M.N.; Wager, S.; Probst, P. CRAN—Package Ranger: A Fast Implementation of Random Forests. Available online: https://cran.r-project.org/package=ranger (accessed on 21 May 2018).
  71. Helleputte, T.; Gramme, P.; Paul, J. Linear Predictive Models Based on the LIBLINEAR C/C++ Library. Available online: https://cran.r-project.org/package=LiblineaR (accessed on 21 May 2018).
  72. Fan, R.-E.; Chang, K.-W.; Hsieh, C.-J.; Wang, X.-R.; Lin, C.-J. LIBLINEAR: A Library for Large Linear Classification. J. Mach. Learn. Res. 2008, 9, 1871–1874. [Google Scholar]
  73. Al Tobi, A.M. Anomaly-Based Network Intrusion Detection Enhancement by Prediction Threshold Adaptation of Binary Classification Models. Ph.D. Thesis, School of Computer Science, University of St Andrews, St Andrews, UK, 2018. [Google Scholar]
  74. Garavaglia, S.; Sharma, A. A Smart Guide to Dummy Variables: Four Applications and a Macro. In Proceedings of the Northeast SAS Users Group Conference, Pittsburgh, PA, USA, 4–6 October 1998; pp. 46–55. [Google Scholar]
  75. Chang, C.-C.; Lin, C.-J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 27:1–27:27. [Google Scholar] [CrossRef]
  76. Geisser, S. Predictive Inference; Chapman and Hall: New York, NY, USA, 1993; ISBN 9780203742310. [Google Scholar]
  77. Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Montreal, QC, Canada, 20–25 August 1995; pp. 1137–1145. [Google Scholar]
  78. Devijver, P.A.; Kittler, J. Pattern Recognition: A Statistical Approach; Prentice Hall: London, UK, 1982. [Google Scholar]
  79. Seni, G.; Elder, J.F. Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Synth. Lect. Data Min. Knowl. Discov. 2010, 2, 1–126. [Google Scholar] [CrossRef]
  80. Gupta, P. Cross-Validation in Machine Learning. Available online: https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f (accessed on 21 February 2019).
  81. Ambroise, C.; McLachlan, G.J. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA 2002, 99, 6562–6566. [Google Scholar] [CrossRef]
  82. Fielding, A.H.; Bell, J.F. A review of methods for the assessment of prediction errors in conservation presence/absence models. Environ. Conserv. 1997, 24, 38–49. [Google Scholar] [CrossRef]
  83. Street, W.N.; Kim, Y. A streaming ensemble algorithm (SEA) for large-scale classification. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 26–29 August 2001; ACM Press: New York, NY, USA, 2001; pp. 377–382. [Google Scholar]
  84. Bifet, A. SEAGenerator.java. Available online: https://github.com/Waikato/moa/blob/master/moa/src/main/java/moa/streams/generators/SEAGenerator.java (accessed on 28 April 2019).
  85. Agrawal, R.; Imielinski, T.; Swami, A. Database Mining: A Performance Perspective. IEEE Trans. Knowl. Data Eng. 1993, 5, 914–925. [Google Scholar] [CrossRef]
  86. Kirkby, R. AgrawalGenerator.java. Available online: https://github.com/Waikato/moa/blob/master/moa/src/main/java/moa/streams/generators/AgrawalGenerator.java (accessed on 28 April 2019).
  87. Perona, I.; Gurrutxaga, I.; Arbelaitz, O.; Martín, J.I.; Muguerza, J.; Pérez, J.M. gureKddcup Database. Available online: http://www.sc.ehu.es/acwaldap/gureKddcup/galdetegia_jaso.php (accessed on 21 February 2019).
  88. Perona, I.; Gurrutxaga, I.; Arbelaitz, O.; Martín, J.I.; Muguerza, J.; Pérez, J.M. Service-independent payload analysis to improve intrusion detection in network traffic. In Proceedings of the 7th Australasian Data Mining Conference, Glenelg, Australia, 27–28 November 2008; Volume 87, pp. 171–178. [Google Scholar]
  89. Perona, I.; Arbelaiz Gallego, O.; Gurrutxaga, I.; Martín, J.I.; Muguerza Rivero, J.F.; Pérez, J.M. Generation of the Database Gurekddcup. Universidad del País Vasco. 2008. Available online: http://hdl.handle.net/10810/20608 (accessed on 28 April 2019).
  90. Lincoln Laboratory, Massachusetts Institute of Technology. 1998 DARPA Intrusion Detection Evaluation Data Set. Available online: http://www.ll.mit.edu/ideval/data/1998data.html (accessed on 24 May 2015).
  91. UCI KDD Archive KDD Cup 1999 Data. Available online: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (accessed on 21 February 2019).
  92. Al Tobi, A.M.; Duncan, I. KDD 1999 generation faults: A review and analysis. J. Cyber Secur. Technol. 2018, 2, 164–200. [Google Scholar] [CrossRef]
  93. Shiravi, A.; Shiravi, H.; Tavallaee, M.; Ghorbani, A.A. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Comput. Secur. 2012, 31, 357–374. [Google Scholar] [CrossRef]
  94. Onut, I.-V.; Ghorbani, A.A. A Feature Classification Scheme for Network Intrusion Detection. Int. J. Netw. Secur. 2007, 5, 1–15. [Google Scholar]
  95. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  96. Friedman, M. The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance. J. Am. Stat. Assoc. 1937, 32, 675–701. [Google Scholar] [CrossRef]
  97. Friedman, M. A Comparison of Alternative Tests of Significance for the Problem of m Rankings. Ann. Math. Stat. 1940, 11, 86–92. [Google Scholar] [CrossRef]
  98. Hollander, M.; Wolfe, D.A.; Chicken, E. Nonparametric Statistical Methods, 3rd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2013; ISBN 9780470387375. [Google Scholar]
  99. Bi, J.; Bennett, K.; Embrechts, M.; Breneman, C.; Song, M. Dimensionality Reduction via Sparse Support Vector Machines. J. Mach. Learn. Res. 2003, 3, 1229–1243. [Google Scholar]
  100. Guyon, I.; Elisseeff, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
  101. Kursa, M.B.; Rudnicki, W.R. Feature Selection with the Boruta Package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar] [CrossRef]
  102. Rudnicki, W.R.; Wrzesień, M.; Paja, W. All Relevant Feature Selection Methods and Applications. In Feature Selection for Data and Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2015; Volume 584, pp. 11–28. [Google Scholar]
  103. Welch, B.L. The Generalization of ‘Student’s’ Problem when Several Different Population Variances are Involved. Biometrika 1947, 34, 28–35. [Google Scholar] [CrossRef] [PubMed]
  104. Ruxton, G.D. The unequal variance t-test is an underused alternative to Student’s t-test and the Mann–Whitney U test. Behav. Ecol. 2006, 17, 688–690. [Google Scholar] [CrossRef]
  105. Liaw, A. randomForest: Breiman and Cutler’s Random Forests for Classification and Regression. Available online: https://cran.r-project.org/package=randomForest (accessed on 21 May 2018).
  106. Shapiro, S.S.; Wilk, M.B. An Analysis of Variance Test for Normality (Complete Samples). Biometrika 1965, 52, 591–611. [Google Scholar] [CrossRef]
  107. Nemenyi, P.B. Distribution-free multiple comparisons. Biometrics 1962, 18, 263. [Google Scholar]
  108. Nemenyi, P.B. Distribution-Free Multiple Comparisons. Ph.D. Thesis, Princeton University, Princeton, NJ, USA, 1963. [Google Scholar]
  109. Hollander, M.; Wolfe, D.A. Nonparametric Statistical Methods, 2nd ed.; Wiley-Interscience: Hoboken, NJ, USA, 1999; ISBN 0471190454. [Google Scholar]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.