Improving Intrusion Detection Model Prediction by Threshold Adaptation

: Network tra ﬃ c exhibits a high level of variability over short periods of time. This variability impacts negatively on the accuracy of anomaly-based network intrusion detection systems (IDS) that are built using predictive models in a batch learning setup. This work investigates how adapting the discriminating threshold of model predictions, speciﬁcally to the evaluated tra ﬃ c, improves the detection rates of these intrusion detection models. Speciﬁcally, this research studied the adaptability features of three well known machine learning algorithms: C5.0, Random Forest and Support Vector Machine. Each algorithm’s ability to adapt their prediction thresholds was assessed and analysed under di ﬀ erent scenarios that simulated real world settings using the prospective sampling approach. Multiple IDS datasets were used for the analysis, including a newly generated dataset (STA2018). This research demonstrated empirically the importance of threshold adaptation in improving the accuracy of detection models when training and evaluation tra ﬃ c have di ﬀ erent statistical properties. Tests were undertaken to analyse the e ﬀ ects of feature selection and data balancing on model accuracy when di ﬀ erent signiﬁcant features in tra ﬃ c were used. The e ﬀ ects of threshold adaptation on improving accuracy were statistically analysed. Of the three compared algorithms, Random Forest was the most adaptable and had the highest detection rates.


Introduction
In the current digital age, numerous research papers and applications have been written and have developed proposed solutions to combat network based threats and to protect information systems.As a result, various security systems have emerged, which aim to ensure that the key goals of cybersecurity are met [1].However, every day these stated security goals are flagrantly violated by breaches and security incidents, which raises questions about the capability of existing security systems.
Intrusion detection systems (IDS) are one of the many tools used in the cyber security field.Their main purpose is to detect security attacks targeting the critical networks, systems or data that they monitor, and to report any violation by an external intruder or system insider.
With the rapid advancement in technology many new challenges and threats are evolving.As most of these technologies share the same communication networks, many challenges have emerged; extensive data, traffic diversity and encryption.Such challenges made the identification of threats to develop the right protective measure a very difficult task.
There are many areas being explored to address some of the many cyber security requirements; artificial intelligence (AI), machine learning (ML) and data mining (DM) methods are some of the current key research topics in this field, particularly in the area of anomaly-based intrusion detection (ID).These methods aim to overcome the limitation of human capabilities and conventional technologies in handling the very large amounts and existing diversity of exchanged traffic.
As network traffic evolves over time, due to changes in services and users and their behaviours, the capability of these methods to adapt to such changes is being challenged.Ever evolving traffic makes the process of building ID models a particularly challenging task, as learning all possible variations of traffic patterns for all different kinds of traffic and users is an impossible quest.Therefore, there is a pressing need to make intelligent detection methods adaptable to traffic variability.
The remainder of this paper is organised as follows.In Section 2, we describe the problem that we address in this paper.In Section 3, we discuss related work for threshold adaptation techniques, applications and main research gap.Section 4 presents the proposed solution, which has been empirically investigated.Section 5 describes the experimental setting and data sets used.In Section 6, we present and thoroughly discuss the results of the first set of experiments that aimed to serve as a proof of concept.In Section 7, we discuss the results of the second set of experiments that investigated threshold adaptation under different feature sets and data balance scenarios.Finally, Section 8 concludes this work and lists future work and directions.

Problem Statement
In a typical (batch-based) scenario, a network-based anomaly ID model would be built to protect specific environments from attackers.The model building phase would require some training data that were previously captured from old traffic to generate the ID model, which would be tuned and set to detect anomalous behaviours.However, as such a model is used to analyse new, real traffic, it will suffer from high false alarms and low detection accuracy.These phenomena are usually caused by the changes in network patterns, and lead to an early phasing out of such a model and a triggering of model regeneration or updating phase.This could be linked to the inefficiency of using a fixed discriminating threshold for such ID models.For example, a network under high volume attacks, such as denial of service (DoS) or scan attacks, would have different class (normal to attack) distributions than when under low volume, but stealthy attacks such as SQL injection and command-and-control (C&C).
Most of the learning and classification methods used in building such ID models are based on a number of key assumptions [2,3], such as: (i) the equal representation of classes, (ii) the equal representation of sub concepts for a specific class, (iii) the similar class conditional distributions of all classes, and (iv) the pre defining and knowledge of all the values of the attributes for all records in the dataset.Due to the traffic evolution, most, if not all, of these assumptions are violated in real environments, as new traffic will start to exhibit different statistical properties to those of the training data.
Unpredictable differences between the training and evaluated (tested) data can be introduced over time because of such traffic evolution, known as concept drift.These differences can take various forms; for example, class distributions might differ in the new data from those used to build the ID model, and even new classes might emerge over time.In addition, class balance (also known as data balance) can play an important role in the accuracy of constructed models, which could be affected as a result of pattern changes.Traffic variability might also bring about differences in feature importance.These effects (collectively or individually) might render the learnt model outdated sooner than anticipated.However, the current methods to deal with these effects (in a batch-based setup) will attempt to generate a new model, which may consume additional resources in collecting and labelling new data to be used to learn that new model.
Many studies have attempted to address some of these issues in real time setups by tuning the detection parameters of the ID models, while others have introduced ensemble methods for data stream setups.However, there is insufficient empirical work to analyse the threshold adaptation of model predictions in binary batch learning (offline learning) setups [4].
The low detection accuracy of such score-based anomaly ID models in a batch learning setup, could be linked to the use of a fixed discriminating threshold, which in turn could result in an inaccurate reading of the accuracy that is far lower than the actual optimal accuracy.This might explain the early termination of such ID models.As a result, adapting the discriminating threshold to the predictions of the evaluated network traffic would provide an accurate reading of the actual accuracy of the ID model.Understanding this will lead to an improvement in detection accuracy, and hence an extension in the lifespan of the ID models.
Therefore, in this paper we address this problem by investigating the effect of adapting the discriminating threshold (specifically to the evaluated network traffic) on the accuracy (i.e., the geometric mean (G-Mean) of accuracy) of multiple models and comparing the results with the use of a fixed threshold.This investigation was done by comparing the effects on traffic collected at different times with existing variability.Further, the ability of different types of ML algorithms to adapt to traffic changes was analysed.

Related Work
Security researchers have been aware that the performance of IDS were tightly related to the behavioural patterns of users, as well as to the characteristics of various underlying services and protocols.Anomaly-based methods were introduced to address possible deviations from normal behaviours in order to flag intrusions.These anomaly-based methods suffered from high false alarms, the key reason being their inability to adapt themselves to changes in data patterns (new data) over time.As a result, many proposals have been put forward to address this issue, including methods that adapt to such changes, such as model updating and rule tuning techniques.Other research has looked into the benefits of using adaptive or tuneable thresholds for the IDS measures to flag anomalies, rather than relying on fixed thresholds.The following section presents the key works in this area.
Chen et al. [5] suggested performing threshold tuning for the predictions of classification methods that generate a quantitative output (score), so that the threshold can be set at different values to assign class labels.Catania and Garino [6] suggested performing tuning on statistical based models whenever a change in network traffic patterns is detected by making adjustments to the normal model.
In an attempt to understand the importance of the right threshold selection on the performance of prediction models, Freeman and Moisen [7] investigated 11 optimisation criteria of threshold selection, and concluded that sensitivity to threshold selection demonstrates a low prevalence or a poor model quality.However, many anomaly detection methods have been developed assuming that anomalous traffic forms a minority compared to normal traffic, and, due to the evolving nature of traffic, the quality of these detection models tends to deteriorate over time.Therefore, a key consideration is that threshold adaptation should help improve the quality of these models in terms of accuracy before they are phased out.
To address model tuning, the conventional (batch learning) modelling process usually has two main phases: training and testing.At the modelling stage, training (learning) data are used to build a prediction model, which is then used to predict the test (evaluation) data.Buczak and Guven [8] stressed the importance of having three phases, in which they suggested that the training data be used to build multiple models using different ML/DM algorithms with different parameters.The validation data, which is used in the second phase, could then be used to select the best model(s) and to estimate their errors before they are used to predict or classify the testing data.Buczak and Guven [8] recommended that the selected model should not be fine-tuned (model parameter tuning) based on how it performed on the test data, to avoid reporting overly optimistic results, i.e., reporting accuracy rates that might not be true for another test dataset.However, many of the recommended adaptive real time systems (see Section 3.2) perform tuning on detection rules.Therefore, threshold tuning based on the prediction scores of a model could provide a tool to tune the system over time.However, the single fine-tuning recommendation may not be appropriate, as it may be based on the assumption that test datasets, including future unseen data, have similar statistical properties; which is not a valid assumption given the variable nature of network traffic.
In an attempt to find the right discriminating threshold for the detection model, Beguería [9] suggested the use of validation data.The selected threshold is then used to classify the records in the evaluation/test data, based on the scores returned by the prediction model.However, Beguería [9] does not appear to take into account the variability of behaviour in input (traffic) data over time.
Overall, there are three main themes in model tuning and adaptability to traffic pattern changes, and these are outlined below.

Batch Learning
Yang [10] proposed score based local optimisation (SCut) as a strategy to select a threshold based on optimising a performance measure, such as accuracy.SCut is therefore the threshold at which a performance measure would be maximised or minimised.To the best of our knowledge, no studies have explored model adaptation for changes in network traffic by tuning the threshold of the predictions of a model within a batch learning setup.
Lakhina et al. [11] used principal component analysis (PCA) to separate a high-dimensional space of network traffic measurements into disjoint subspaces.Each subspace corresponded to normal or anomalous network settings.They used a fixed threshold (3σ deviation from the mean) to separate the principal axes into normal and anomalous sets, and found that the first four principal components represented the normal subspace for the cases they analysed.This study did not address the variability of traffic over time, and so requires further analysis of its performance when traffic conditions vary.
In an attempt to investigate the effect of threshold tuning on multi class predictions, Fan and Lin [12] concluded the effectiveness of tuning approaches on the performance of classification techniques.They used the 5-folds cross-validation (CV) technique to evaluate these effects.However, the CV technique may not maintain the statistical differences between the training and the testing data, leading to overly optimistic results.The authors analysed the effect of different optimisation metrics (macro average F-measure, micro average F-measure and exact match ratio) on the overall performance of the selected threshold.They then investigated this tuning approach using validation data, without considering whether such tuning was required for every independent evaluation process or whether the selected threshold could be used for future evaluations performed by the prediction model.Pillai et al. [13] also investigated the issue of threshold selection for multi label classification problems by optimising the F-measure and precision-recall curve.They used 5-folds CV on five datasets to validate their results.The results were compared to the evaluation/testing data by using the optimal threshold that had been selected on the basis of the validation data.However, the authors did not extend their analysis to comparing their results with those where the threshold had been tuned for the testing data.They concluded that selecting an optimal threshold based on maximising the micro F-measure can lead to overfitting.
Koyejo et al. [14] investigated the optimisation of a binary classifier using different metrics where they identified the optimal threshold based on the conditional probability of the positive (normal) class using training and validation data.Yan et al. [15] pointed out that this search requires prior knowledge of the optimal classifier, which is usually unknown in reality.As a result, Yan et al. [15] identified two key properties (the Karmic property and the Threshold Quasi Concavity property), and they theoretically demonstrated that the Bayes optimal classifier is a threshold function of the conditional probability of a positive class.Again, these works do not seem to assume a change in data over time (concept drift), as the threshold is only set once using the validation set.In general, nearly all approaches in the batch learning methods adopt the recommendations of using a single validation dataset to select the right threshold.

Real-Time Learning
In an early study, Eskin et al. [16] proposed an adaptive host-based ID model generation.Their framework, which is similar to that of Honig et al. [17], recommends the aggregation of all data, i.e., system calls, into a single data warehouse.This data can then be used to train detection models, which can in turn be distributed to hosts to detect intrusions.The adaptability of this framework is in the deployment of models on the hosts.However, this framework uses a fixed threshold to flag anomalies without addressing the variability between the hosts.There is a scalability limitation, as storing such large amounts of data will become a serious issue over time.
Hossain and Bridges [18] proposed a framework for adaptive IDS using fuzzy data mining.This framework aims to minimise the human intervention in the adjustments of the profiles used to describe normal traffic by the IDS.The tuning process is designed to operate on a real-time IDS.Hossain et al. [19] evaluated this framework by using a sliding window to update the profile, so that the updating process used the data that fell within that time window.It appears that they considered all traffic, other than simulated portscans, as benign.The system produced results that the authors could not explain, which could be attributed to the lack of controls over the traffic that was analysed.
Jung et al. [20] developed a threshold random walk (TRW) algorithm to detect random portscan attacks in a real-time setup, based on the observations of the state (successful or unsuccessful) of connection attempts from a remote host to newly-visited local addresses.However, this model assumed that all distinct connection attempts had the same likelihood of success, while no correlation between these attempts was assumed.Ali et al. [21] pointed out that threshold adaptation was only performed on the upper boundary of the likelihood ratio, based on previously observed instances, while the lower boundary was fixed.
Idé and Kashima [22] investigated the development of an IDS to detect anomalies in multi-tier systems, such as web-based systems.They used a weighted graph to extract a feature vector of service activities.As this IDS models service activities in the system, where the directions of these activities are assumed to be stable, services that are rarely used may not benefit from its detection capabilities.As a result, services run by careful adversaries, such as command and control (C&C) might not be flagged up.
Yu et al. [23] proposed an automatically tuning IDS (ATIDS) system, which used feedback from the security officer about encountered false predictions to automatically tune the threshold of the rule sets in real-time.This system is dependent on the human resources available, so Yu et al. [24] proposed an extension that adjusts the number of alarms flagged to security operators based on their abilities.Although this extension minimised the burden on security officers, the overall performance of the system was limited by the time it took to provide feedback.This system also failed to cope with drastic changes in system behaviour, as the tuning process was performed on the rules level of the detection model and these rules set might not be representative of new behaviour due to concept or feature drift.
Ali et al. [21] proposed a generic threshold tuning algorithm so that the detection threshold of any score-based anomaly detection systems (ADS) could be adapted.In their approach, statistical and information theoretical analyses were undertaken on the anomaly scores produced by multiple network-based and host-based ADSs.These analyses aimed to reveal consistent structures of time correlation during periods of normal activity.This approach targeted anomalies that cause a detectable variability in traffic patterns due to their high volume, such as UDPFlood, TCP SYN Flood and TCP SYN portscans attacks.This is designed for score-based real-time detectors (not batch), as they quantify the anomaly score based on a comparison between the learned profile and the run-time profile.
Chou and Wang [25] proposed an adaptive network IDS for cloud environments.They claimed that their system had the capability to perform automatic labelling of raw network traffic (normal and anomalous).They used a spectral clustering algorithm (unsupervised learning) to cluster the unlabelled network traffic so that the clusters could later be used as labels to construct a decision-tree-based detection model.These clusters (labelled data) were used to improve the original detector and to adapt it to the network environment.However, the authors used DARPA 2000 and KDD 1999 datasets in their experiments, without any justification as to why such old data had been selected for this scenario.They also proposed an experimental design that overlooked any DDoS attacks in DARPA 2000, claiming that this type of attack would generate lots of connections.This decision calls into question how their system would perform in a real life setup.Agosta et al. [26] introduced a distributed anomaly detection system (ADS) to detect worm threats.This system employed a threshold adaptation technique, to compare it with the performance of a fixed threshold.This study concluded that the adaptive threshold technique was far superior.However, these techniques were specifically designed for this type of attack, and the ability to generalise these results to other classes of threat is debatable.
Gu et al. [27] devised a framework to measure the effectiveness of IDS quantitatively.This method is based on quantifying the feature representation capability, classification information loss and overall intrusion detection capability of an IDS using a set of information-theoretic metrics.The authors discussed the importance of dynamic fine tuning over static fine tuning to address the issue of traffic variability over time.Their framework introduced dynamic fine tuning by dividing the time series into a number of intervals.However, Strasburg et al. [28] have raised concerns about the practical effectiveness of such a model in IDS development.
Jyothsna and Rama Prasad [29] studied a meta-heuristic assessment model, which aimed to set a threshold for random normal behaviour in real-time by estimating the degree of intrusion scope threshold from a given network transaction.Their model also aimed to identify any new intrusions in the network, and feature correlation methods were performed to reduce processing and time costs.This approach did not cater for the effect of concept drift on the selected features over time, and hence on a model's performance.

Data Stream Learning
In the data stream learning methods, the concept drift is a core feature that is considered in the modelling process.Bifet et al. [30] proposed a new data stream framework which aimed to address concept drift by employing ensemble methods using various bagging techniques.They later developed this framework into an open source software known as Massive Online Analysis (MOA) [31].
Masud et al. [32] proposed a classification method to address concept drift in data classes, that is, the emergence of unseen classes (labels).Usually, new class labels require a longer time to be provided with new training data to rebuild the base detection models.Therefore, Masud et al. applied some clustering concepts to measure the distance between known classes and new data instances, so that this technique could flag up these new instances as anomalies.Farid et al. [33] stated that such models would need to gather a large number of test instances to determine their similarities and differences in order to identify any novel classes.
In an earlier study, Masud et al. [34] proposed another detection approach for novel classes that used an adaptive threshold and the Gini coefficient for outlier detection.However, the proposed approach is unable to distinguish between the novel classes if multiple new classes have emerged, and it also does not cater for other types of evolution, such as feature drift [35].
In order to automatically determine the optimal parameters of an anomaly detector (AD) Cretu-Ciocarlie et al. [36] enhanced the training phase by introducing a self-calibration stage.Their method consisted of applying ensemble methods to unsupervised learning techniques to build micro-models.A weighted voting scheme on labels returned by these micro-models was used to compute a final class decision.However, this approach could result into an AD that might be subject to attack, as an adversary could train it.This approach may fail to differentiate between a real change in traffic patterns and an ongoing crafted attack aimed at skewing the majority votes of the micro-models.
Chen et al. [37] suggested the offline mining of an old data stream to build high-quality models for every recurrent concept.When concept drift is later detected in a data stream, it could then be evaluated to identify the type of concept, so that the traffic could be passed to the most suitable pre-built model to classify the traffic in that stream.This technique claims to achieve high rates of accuracy because of the high-order models, but it assumes that there is a finite number of concepts to be modelled.This assumption is challenged by the high volume and diversity of network traffic.In addition, there are scalability issues.
In a more recent work, Gomes et al. [38] proposed an adaptive random forest (ARF) algorithm that was suitable for evolving data streams.This algorithm has the potential to address concept drift by adapting itself to any changes.The adaptation is performed by replacing any outdated trees in the forest with new trees that have been grown (trained) in the background.

Research Gaps
As presented, the importance of adaptation to pattern variability has mainly been addressed in the context of real time and data stream problems.Most of the adaptation and tuning approaches for real-time-based systems target certain classes of attack which are formed of abrupt patterns, such as DoS attacks.As these attacks introduce high variability into traffic patterns, much research has attempted to detect them and fine tune the system accordingly.In most cases, these tuning approaches would aim to adapt the IDS detection parameters to increase or decrease the thresholds of these parameters.However, Catania and Garino [6] pointed out that most of the adaptation approaches are aware of the high network variability, and the proposed methods provide the required adaptability features to adjust for the targeted anomalies.Similarly, in the data stream field, most of the proposed approaches suggest building new detection models to adapt to such changes [21].
As for the batch learning tasks, in an ideally designed experiment, adaptation is undertaken only once for the prediction model, using validation data [8,39].Validation data is used to estimate class distributions in order to calculate the optimal threshold for the prediction model.However, in a real life setup, these distributions are not fixed, which renders such approaches ineffective.Furthermore, using a fixed threshold for predictive models could result in an inaccurate reading of the model's performance, which could in turn lead to the selection of weaker models or an early phasing out of good models.However, no study exists to investigate continuous adaptation for every evaluation/test datum in a batch-based setup.Therefore, in this paper we investigate such an approach.
Moreover, in batch learning approaches, there is a reliance on the K-folds cross-validation (CV) technique to evaluate models, and, when attempting to address the pattern change problem, validation data is the alternative suggested approach.Such an approach is used to select the best threshold based on the optimisation of some measure, such as the accuracy, for the prediction model.However, no study has investigated how a fixed threshold will behave under different setups.Additionally, as model development is based on various decisions taken in relation to the training data (such as feature selection and data balancing), it is important to analyse how such decisions might affect the model performance when traffic changes over time and causes concept or feature drift.It is also important to address whether the threshold (tuning) adaptation of model predictions have any effect on eliminating or mitigating such limitations.

Threshold Adaptation
As noted earlier, the adaptation capability of prediction models under the batch learning setups is the least investigated area in comparison to other methods, although the batch-based ID models are important to detect novel attacks that cannot usually be detected by other techniques.Some kinds of attacks are better detected in a batch mode to increase the detection rate, rather than attempting faster detection in real-time with a higher failure rate.With this approach, there is no need to change or tune any of a model's parameters as long as its predictions are in the form of a probability score.In this sense, threshold adaptation does not require any modification to the anomaly detection model.The detection model is thus treated as a black-box, as the adaptation is performed to its predictions and not to its detection parameters.

Experimental Settings
In this section, the experimental settings used in all conducted experiments are presented.Three ML algorithms were used and their main settings are explained.All the experiments were evaluated in terms of detection accuracy using the geometric mean of accuracy (gAcc) [40] measure, as the normal accuracy measure can be a very misleading measure due to its sensitivity to class imbalance.Similarly to other performance assessments of classification models in a supervised learning task, the gAcc uses the computed basic counts of a table known as a confusion (error) matrix [41] (see Table 1).Similarly to other performance assessments of classification models in a supervised learning task, the gAcc uses the computed basic counts of a table known as a confusion (error) matrix [41] (see Table 1).
The gAcc computes the classification accuracies of every class separately, and then computes their geometric mean.Equation 1shows the general formula used to compute this measure, where Cj,i is the number of class i instances that were predicted as j, and n is the total number of classes.Although this measure was first proposed by Kubat and Matwin [40], few studies have used it to assess and compare the performance of different models.However, a number of recent studies in network ID domain have started to use it [42][43][44].

Overview of Classification/Machine Learning Algorithms
For all of the experiments conducted in this paper, three common classification algorithms that are widely used for batch learning were analysed, evaluated and compared to address the anomaly network detection.These algorithms were C5.0, Random Forest (RF) and Support Vector Machine (SVM); this section provides an overview of each of these algorithms.5.1.1.Decision Trees (C5.0)C5.0 is a classification algorithm [45] based on decision trees, which are used in classification problems to build a deterministic data structure that is formed out of decision rules for a particular domain [46].It has a lower error rate due to its use of 'boosting' [47].Additionally, as C5.0 generates smaller trees, it consumes fewer resources, such as memory, and performs faster executions.It also avoids overfitting noisy data [48].The C5.0 algorithm uses the information gain ratio to perform its splits, aiming to reduce the bias towards features with a large number of distinct values by penalising the selection of a feature based on the number and size of its branches.However, this criterion might result in favouring features with very low information values [49].The final classification decision is based on the path traversed from root to leaf; these decisions can be either a 'class' (label), or 'probabilities' (score) of classes.
C5.0 performs tree pruning by removing parts of the tree that are predicted to have a high error rate [50].In this pruning process, every subtree is evaluated to determine whether it will be replaced with a leaf or a node.

Random Forest (RF)
Random Forest (RF) is basically formed out of multiple decision trees (prediction models) that are grown using a combination of 'bagging' and the random selection of features (subspace).Bagging (bootstrap aggregating) is a technique that aims to improve the performance (accuracy and stability) of ML algorithms and to reduce variances and the chance of overfitting [51,52].This is performed by generating nTree bootstrap samples, which are randomly sampled from the main training data.Each

Forma Forma
The gAcc computes the classification accuracies of every class separately, and then computes their geometric mean.Equation (1) shows the general formula used to compute this measure, where Cj,i is the number of class i instances that were predicted as j, and n is the total number of classes.
Although this measure was first proposed by Kubat and Matwin [40], few studies have used it to assess and compare the performance of different models.However, a number of recent studies in network ID domain have started to use it [42][43][44].

Overview of Classification/Machine Learning Algorithms
For all of the experiments conducted in this paper, three common classification algorithms that are widely used for batch learning were analysed, evaluated and compared to address the anomaly network detection.These algorithms were C5.0, Random Forest (RF) and Support Vector Machine (SVM); this section provides an overview of each of these algorithms.5.1.1.Decision Trees (C5.0)C5.0 is a classification algorithm [45] based on decision trees, which are used in classification problems to build a deterministic data structure that is formed out of decision rules for a particular domain [46].It has a lower error rate due to its use of 'boosting' [47].Additionally, as C5.0 generates smaller trees, it consumes fewer resources, such as memory, and performs faster executions.It also avoids overfitting noisy data [48].The C5.0 algorithm uses the information gain ratio to perform its splits, aiming to reduce the bias towards features with a large number of distinct values by penalising the selection of a feature based on the number and size of its branches.However, this criterion might result in favouring features with very low information values [49].The final classification decision is based on the path traversed from root to leaf; these decisions can be either a 'class' (label), or 'probabilities' (score) of classes.
C5.0 performs tree pruning by removing parts of the tree that are predicted to have a high error rate [50].In this pruning process, every subtree is evaluated to determine whether it will be replaced with a leaf or a node.

Random Forest (RF)
Random Forest (RF) is basically formed out of multiple decision trees (prediction models) that are grown using a combination of 'bagging' and the random selection of features (subspace).Bagging (bootstrap aggregating) is a technique that aims to improve the performance (accuracy and stability) of Information 2019, 10, 159 9 of 42 ML algorithms and to reduce variances and the chance of overfitting [51,52].This is performed by generating nTree bootstrap samples, which are randomly sampled from the main training data.Each of these bootstraps will then be used to build a prediction model, resulting in a total of nTree models (decision trees).
After a bootstrap sample is produced, a decision tree is generated.In RF, only a random selection of features (subspace), with no replacement, are evaluated at every node to decide on the best split, rather than using the full features set as in the C5.0 algorithm.The number of these random features, mtry, is usually far less than the original number of features.
Out-of-bag (OOB) data are used in the internals of RF to estimate and monitor the errors of the decision tree and its strength, as well as the correlation between different trees and to measure feature importance [46,53].
The final prediction of the forest is performed by running each instance down all decision trees in the forest.The results of all these trees are then aggregated to form the final decision.For numerical predictions, the average or the weighted average of the results of all trees is returned, whereas for classification problems, the majority vote or the probability of the classes is returned.
The RF algorithm has many advantages, such as low training time complexity and fast prediction time [54,55], efficient handling of missing data, no required pre-processing (scaling or normalisation) of data, and efficient handling of imbalanced data and rare cases (due to the bootstrapping feature) [56].However, this algorithm has some drawbacks, such as slow runtime as the number of its trees increase, and difficulty to interpret its models due to their high complexity (caused by randomisation) [57].
The key stages of the RF algorithm are illustrated in Figure 1.
Information 2019, 10, x FOR PEER REVIEW 9 of 44 of these bootstraps will then be used to build a prediction model, resulting in a total of nTree models (decision trees).After a bootstrap sample is produced, a decision tree is generated.In RF, only a random selection of features (subspace), with no replacement, are evaluated at every node to decide on the best split, rather than using the full features set as in the C5.0 algorithm.The number of these random features, mtry, is usually far less than the original number of features.
Out-of-bag (OOB) data are used in the internals of RF to estimate and monitor the errors of the decision tree and its strength, as well as the correlation between different trees and to measure feature importance [46,53].
The final prediction of the forest is performed by running each instance down all decision trees in the forest.The results of all these trees are then aggregated to form the final decision.For numerical predictions, the average or the weighted average of the results of all trees is returned, whereas for classification problems, the majority vote or the probability of the classes is returned.
The RF algorithm has many advantages, such as low training time complexity and fast prediction time [54,55], efficient handling of missing data, no required pre-processing (scaling or normalisation) of data, and efficient handling of imbalanced data and rare cases (due to the bootstrapping feature) [56].However, this algorithm has some drawbacks, such as slow runtime as the number of its trees increase, and difficulty to interpret its models due to their high complexity (caused by randomisation) [57].The key stages of the RF algorithm are illustrated in Figure 1.

Support Vector Machine (SVM)
The Support Vector Machine (SVM) [58] is one of the most popular classification algorithms used for supervised learning tasks in ML.Its development is based on the structural risk minimisation principle [57,59].In SVM, each data instance is represented geometrically as a vector ( ℛ ) in p-dimensional space- =  , ⋯ ,  ∈ Χ ⊂ ℛ .SVM attempts to find a linear surface (hyperplane)-or a line in 2D space-that separates the instances into two classes y ∈ {-1, 1}, where this separating hyperplane has the largest distance between the edge points of each class.These edge points define the border lines for each class as per Equation 2: where x is an edge point in the training data that lies on the border line of a class, and b is (offset) the distance from the origin to the decision boundary (. +  = 0) [58].The edge points also define the width of the margin between those border lines.These points (vectors) are used to define and

Support Vector Machine (SVM)
The Support Vector Machine (SVM) [58] is one of the most popular classification algorithms used for supervised learning tasks in ML.Its development is based on the structural risk minimisation principle [57,59].In SVM, each data instance is represented geometrically as a vector (R p ) in p-dimensional space-x = x 1 , • • • , x p ∈ X ⊂ R p .SVM attempts to find a linear surface (hyperplane)-or a line in 2D space-that separates the instances into two classes y ∈ {-1, 1}, where this separating hyperplane has the largest distance between the edge points of each class.These edge points define the border lines for each class as per Equation (2): where x is an edge point in the training data that lies on the border line of a class, and b is (offset) the distance from the origin to the decision boundary (w.x + b = 0) [58].The edge points also define the width of the margin between those border lines.These points (vectors) are used to define and outline (support) the separating hyperplane and are called the support vectors.The minimum number required of these points is (p + 1).
As there could be many separating hyperplanes that might separate positive cases from negative cases, the SVM algorithm searches for a decision boundary with the maximum margin.The width of this margin is the sum of the distances from that decision boundary to the parallel hyperplanes that contain the closest positive and negative training points (support vectors) [60].
The SVM classifier depends on computing w, which is a normal vector perpendicular to the separating hyperplane (decision boundary).This normal vector is precomputed as Equation (3) presents: where λ i are Lagrange multipliers produced at the training phase using data with N training samples.SVM classification is performed by evaluating which side of the hyperplane a test instance (vector) will fall into, as Equation ( 4) shows: where x is a test instance, and b is (offset) the distance from the origin to the decision boundary (w.x + b = 0) [58], which is precomputed at the training phase.SVM has the capability to find a separating hyperplane with soft margins, which allows some violation of the boundary by permitting some levels of mixing between classes.This is usually done by tuning some cost value (C), which has an effect on the variance [61].
One of the main advantages of SVM is that it does not suffer from the "curse of dimensionality," as many other ML algorithms do.As a result, feature reduction is not required by SVM [62].SVM also has many limitations, such as the required pre-processing phase of the data (dealing with missing data, data transformation, scaling and/or normalisation) [63].
For non-linearly separable problems, SVM might require the use of kernel methods or functions to transform the data from input (data) space into higher dimensional (feature) space, where the data can be made linearly separable.Hence, the resultant separating hyperplane can be expressed using the inner products of the vectors [64].However, using kernels will incur optimisation costs, as all their tuning parameters need to be taken into account [65].As a result, SVM processing speed is affected by the kernel used, as some kernels will perform more operations in the transformation phase, which will slow the SVM's speed [66].

Parameter Setting for the ML Algorithms
All of the implementations of the analysed algorithms within this study utilized packages of the R environment [67].Default parameters of these algorithms were used to make these experiments reproducible.Adjusting parameters to improve detection would require further investigation, which is outside the scope of this paper.

C5.0 Algorithm
The "c50" package (version 0.1.0-24)[68] was used in this study.All experiments used the default settings of this algorithm, with the 10 trials option (trials = 10) set to return the results of the classification as a probability score (type = "prob") when the model was used to predict the evaluation (test) data.

Random Forest
The "ranger" package (version 0.8.0) [69,70] was used over the course of this research.This package was selected because of its fast implementation of RF in C++.All experiments used the default settings of 500 trees (nTree) to grow, with the number of features to evaluate at every node being the square root of the total number of features in the dataset mtry = √ p , where p is the number of features.The algorithm was instructed to return results in the form of classification probabilities (probability = TRUE).

Support Vector Machine (SVM)
The open source SVM package (LiblineaR) (version 2.10-8) [71,72] was used in these experiments.This package executes an optimized linear version of SVM.All experiments used the default settings of L2-regularized logistic regression linear model type (type = 0) with the cost set to one (cost = 1).
The choice to use the linear version of SVM was driven by the very large differences in the runtime of experiments between its linear and nonlinear kernel versions.Some preliminary experimentations were conducted to compare the two versions.As a result of the large difference of runtime between the two versions, the linear version of SVM was selected, as it was much faster.These preliminary experiments also showed that the runtime of the kernel SVM grows exponentially as the number of instances increase.With all these differences, the nonlinear kernel SVM was not tractable to be introduced as a solution in a domain like IDS [73].
Data were pre-processed by converting all categorical (nominal) features into dummy attributes, as SVM can only handle numerical data [74].A data were also standardised, where the standardisation parameters (the mean and standard deviation) of the training data were used to standardise the features of the test data before being classified by the model [71,75].

Performance Assessment Techniques
K-folds cross-validation (CV) technique is the most widely used performance assessment method of different ML algorithms due to many reasons, such as data shortage [76][77][78], avoiding overfitting problems [79], and to identify and fine tune the model's parameters [80].In this technique, the dataset is randomly divided into K parts.A model is then trained using K-1 parts and tested on the remaining part.This process is repeated K times, so that each one of the K parts is only used once as test data.The model's overall performance is estimated by aggregating the performance of the K models (through averaging or a majority vote).However, it requires a long time to process as larger values of K are used.It could also provide overly optimistic results due to the random division of datasets, which could be a result of partitions that are statistically similar to each other.Therefore, the K-folds CV technique was used in all experiments at every model building (training) stage to estimate the prediction thresholds for every developed model, as per the recommendation of Ambroise and McLachlan [81].
In this paper we adopted the prospective sampling [82] method, which obtains new sample data after the model generation phase is over.This method is not a commonly used evaluation practice in anomaly-based detection.This evaluation method aimed to mirror real life, given that models are usually trained on data that have been collected in the past to predict future data.

Datasets Description
This section provides an overview of the datasets used in the experiments outlined in this study.Two synthetic datasets (SEA and AGR) were generated randomly, alongside two domain specific datasets (gureKDD and STA2018).The first three datasets (SEA, AGR and gureKDD) were used in the first experiment, and STA2018 in the second one.

SEA
A streaming ensemble algorithm (SEA) generator [83] in the MOA framework [31] was used to generate a data stream with three continuous features (X 1 , X 2 , X 3 ).Each feature had a range between 0 and 10, although only features X 1 and X 2 influenced the class value.Instances were produced by randomly generating points (X 1 , X 2 ) in a two dimensional space.Instances were labelled as groupA if X 1 + X 2 > θ, and as groupB if X 1 + X 2 ≤ θ, where X 1 and X 2 were the first two features and θ was a threshold.There were four functions, which would label the instances differently based on their threshold values between the two classes (function 1 sets θ = 8, function 2 uses θ = 9, function 3 sets θ = 7, and function 4 sets θ = 9.5) [84].The SEA generator's default setting was used to add 10% noise classes.Six different data streams (files) were produced: function 1 was used to generate two streams (file 1 and file 2); function 2 was used to generate two other streams (file 3 and file 4); and a combination of function 1 and function 2 was used to generate two streams (file 5 and file 6).For every file, calls to these functions used different seed values to set the seed of the random generator function to generate new random instances.Figure 2 presents an example of the command line call to generate File 1 with the SEA stream generator.Each stream consisted of 200,000 instances.This dataset was used to analyse the effect of different statistical properties (concept drift) between training and testing data on the model's performance.Table 2 lists the number of instances of each class in every file in this dataset.

AGR
The AGRAWAL generator [85] in the MOA framework [31] was used to generate a data stream with nine features (X1, …, X9), six of which were nominal (factor) and three of which were continuous.This generator had ten different functions to assign the produced instances to one of two different classes, based on the values of their different features.The following examples illustrate the labelling rules of the two functions that were used in generating this dataset: Each function increases the level of complexity as it uses additional features and complex rules to label the instances [86].The generator's default setting was used to add 10% noise classes by introducing a disturbance factor that added a deviation value (following uniform random distribution) to the original feature's values.Similar to the SEA dataset generation, six different data streams (each with 200,000 instances) were generated using function 1 and function 2 of the AGR data stream generator.Figure 3 provides an example of the command line used to generate the data of File 1 in this dataset.Table 3 presents a summary of the label frequencies in every file for this dataset.Each stream consisted of 200,000 instances.This dataset was used to analyse the effect of different statistical properties (concept drift) between training and testing data on the model's performance.Table 2 lists the number of instances of each class in every file in this dataset.

AGR
The AGRAWAL generator [85] in the MOA framework [31] was used to generate a data stream with nine features (X 1 , . . ., X 9 ), six of which were nominal (factor) and three of which were continuous.This generator had ten different functions to assign the produced instances to one of two different classes, based on the values of their different features.The following examples illustrate the labelling rules of the two functions that were used in generating this dataset: Each function increases the level of complexity as it uses additional features and complex rules to label the instances [86].The generator's default setting was used to add 10% noise classes by introducing a disturbance factor that added a deviation value (following uniform random distribution) to the original feature's values.Similar to the SEA dataset generation, six different data streams (each with 200,000 instances) were generated using function 1 and function 2 of the AGR data stream generator.Figure 3 provides an example of the command line used to generate the data of File 1 in this dataset.Table 3 presents a summary of the label frequencies in every file for this dataset.

gureKDDcup
gureKDDcup [87][88][89] (referred to throughout this paper as gureKDD) is a transformation of the raw network traffic of the DARPA 1998 dataset [90] into a suitable format for ML tasks, where every connection is described using a set of features.This transformation is similar to the KDD 1999 dataset [91], but much richer and cleaner.The KDD 1999 dataset was not used in this paper due to its many limitations, identified by Al Tobi and Duncan [92].Every connection in the gureKDD dataset has a unique ID that helps identify the chronological order of all connections.Therefore, all connections in this dataset are chronologically separable and can be divided by day, week, etc.
For the first experiments, all traffic (over a seven week period) was segregated into a time window of a week, which resulted in seven files.Every file contained the network traffic of that week (Monday-Friday).Every connection in these files was profiled using 41 features: 3 of which were nominal (protocol_type, service and flag), 6 were binary features, 15 were continuous (real) features and 17 were integer features.These features were divided into four main groups: intrinsic (basic) features {1-9}, content based features {10-22}, time based features {23-31} and connection based features {32-41}.
Each connection was labelled either as normal or as one of the 35 different attacks.These attacks were grouped into four main classes: DOS, Probing, Remote to Local or User to Root.In these experiments, the data were pre-processed so all different attack types were grouped and labelled as 'attack' to produce binary classes.Table 4 presents a statistical summary of the connection class types for each of the seven weeks, which were clearly shown to have different class balances.[87][88][89] (referred to throughout this paper as gureKDD) is a transformation of the raw network traffic of the DARPA 1998 dataset [90] into a suitable format for ML tasks, where every connection is described using a set of features.This transformation is similar to the KDD 1999 dataset [91], but much richer and cleaner.The KDD 1999 dataset was not used in this paper due to its many limitations, identified by Al Tobi and Duncan [92].Every connection in the gureKDD dataset has a unique ID that helps identify the chronological order of all connections.Therefore, all connections in this dataset are chronologically separable and can be divided by day, week, etc.
For the first experiments, all traffic (over a seven week period) was segregated into a time window of a week, which resulted in seven files.Every file contained the network traffic of that week (Monday-Friday).Every connection in these files was profiled using 41 features: 3 of which were nominal (protocol_type, service and flag), 6 were binary features, 15 were continuous (real) features and 17 were integer features.These features were divided into four main groups: intrinsic (basic) features {1-9}, content based features {10-22}, time based features {23-31} and connection based features {32-41}.
Each connection was labelled either as normal or as one of the 35 different attacks.These attacks were grouped into four main classes: DOS, Probing, Remote to Local or User to Root.In these experiments, the data were pre-processed so all different attack types were grouped and labelled as 'attack' to produce binary classes.Table 4 presents a statistical summary of the connection class types for each of the seven weeks, which were clearly shown to have different class balances.The STA2018 dataset (The full data set can be found at: https://doi.org/10.17630/c5f31888-9db5-4ac0-a990-3fd17dcfe865)[73] was generated by transforming the network traffic of the UNB ISCX Intrusion Detection Evaluation DataSet 2012 [93] into a suitable format for ML tasks.This dataset profiles every connection using 193 basic features, where part of Onut's feature classification schema [94] was used to extend these features to a total of 550 features (549 independent variables plus 1 dependent (class) variable).
The STA2018 dataset contains the profiled sessions (connections) of the network traffic of seven simulation days, where data records are grouped by day so that every data file aggregated all of the connections within that simulation day.The transformation process of this dataset went into five main stages: basic features extraction, validation and connection labelling, extend the basic features, balance and clean up.
Due to the balancing stage, this dataset can be used into two modes: first with the original imbalanced version, second with a balanced version where synthetic instances of the attack connections (minority class) were generated using the Synthetic Minority Over-sampling Technique (SMOTE) algorithm [95].Table 5 sets out the number of connections for each class for each day (original and balanced versions).In the second set of experiments outlined in Section 7, only days 2 to 7 were used, as the first day was attack free.
Originally, the file for each day consisted of 550 features (549 features + 1 class).Two features (synthetic and origOrder) were omitted from any analysis, as their only purpose was to distinguish the original data from the balanced (synthetic) data and to identify the connection order.Three further features were removed from the analysis (start_time, src_ip and dst_ip), both to avoid any possibility of overfitting and because of the large number of levels.Removing these five features resulted in a total of 545 features (544 features + 1 class).Any reference to the Full set of features thus refers to these 545 features.

Hardware Specifications
All experiments were performed on a "Dell C5220 PowerEdge Rack Servers" cluster, which had 12 micro servers.Each micro server ran Scientific Linux 7 on dual quad-core Intel Xeon 3.4GHz CPUs, 16GB RAM, two 500GB SATA disks and two Gigabit Ethernet interfaces.The large data files of the STA2018 dataset {Day 5 (15/Jun) and Day 6 (16/Jun)}, in the second set of experiments, were run on a Hyper-V virtual machine with 8 Virtual Processors, 20 GB RAM and 32 GB Swap space.This VM was used to host the Ubuntu 16.04 (64-bit) operating system.It was hosted on a server with the following hardware specifications: 2U Supermicro chassis; 8x host-swap 2.5" SAS/SATA disk bays; Supermicro X8DTU-LN4F+ motherboard; Dual Intel Xeon E5620 (quad core); 24GB RAM (6 x 4GB DDR3 ECC RDIMM); 4x 1TB SATA (RAID10); and 4x 1Gb Ethernet.This machine used a Windows Server 2012 R2 Datacentre (64-bit) operating system.

First Experiment
In the first set of experiments, we examined the effect of threshold adaptation on the overall performance of a detection model.We aimed to provide a proof of concept (PoC) by comparing three well known ML algorithms (C5.0,RF and SVM) to determine which was the most adaptable to variations and concept drifts.In this set of experiments, we conducted two different experimental setups (see Figure 4).Both setups used the same datasets and the same ML algorithms, however, different sampling approaches were performed.We analysed the effect of different sampling approaches on individual detection model accuracy using a real life setup (prospective sampling), and compared this to the usual experimental setups reported in academic publications (K-folds cross-validation).Another part of this experiment was to examine the effect of threshold adaptation in improving model overall detection accuracy.The choice to use the synthetic datasets (SEA and AGR) was driven by the need to control the degree of variability between different data files.The gureKDD dataset was used to make this study comparable to other studies in the field, as its comparator datasets (KDD1999 and NSL-KDD) are widely used in this domain.

First Experiment
In the first set of experiments, we examined the effect of threshold adaptation on the overall performance of a detection model.We aimed to provide a proof of concept (PoC) by comparing three well known ML algorithms (C5.0,RF and SVM) to determine which was the most adaptable to variations and concept drifts.In this set of experiments, we conducted two different experimental setups (see Figure 4).Both setups used the same datasets and the same ML algorithms, however, different sampling approaches were performed.We analysed the effect of different sampling approaches on individual detection model accuracy using a real life setup (prospective sampling), and compared this to the usual experimental setups reported in academic publications (K-folds crossvalidation).Another part of this experiment was to examine the effect of threshold adaptation in improving model overall detection accuracy.The choice to use the synthetic datasets (SEA and AGR) was driven by the need to control the degree of variability between different data files.The gureKDD dataset was used to make this study comparable to other studies in the field, as its comparator datasets (KDD1999 and NSL-KDD) are widely used in this domain.

Results and discussion
This section presents the results of the first set of experiments and discusses their main findings.

Results and Discussion
This section presents the results of the first set of experiments and discusses their main findings.

10-folds Cross-Validation on Full Data
In the first setup (Figure 4), we started these experiments by comparing the detection performances of the three ML algorithms (C5.0,RF and SVM) on the three different datasets (gureKDD, SEA and AGR).The conventional method of 10-folds CV technique was performed on the merged files of each dataset.The maximum gAccs of these models and the best cut-off values were reported.Due to the minimal variability between results, each experiment was repeated only ten times (see Table 6).
Table 6.Average model accuracy (G-mean accuracy), the average optimal cut-off value (at which maximum G-mean accuracy was reached) and their standard deviation of the 10-folds cross-validation (10 repetitions).Table 6 presents the average of the gAcc values of the ten trials of the 10-folds CV in terms of the gAcc of the three algorithms (C5.0,RF and SVM).It also shows the mean of the optimal cut-off values of the ten runs at which the maximum gAccs were reached.
In general, all algorithms reported similar accuracies for their respective datasets.However, in the artificial dataset AGR, SVM failed to perform anywhere close to C5.0 or RF (showing a difference of almost 15%-see Table 5).This could be related to the nature of the dataset, which could be non-linearly separable, as a linear version of SVM was used in this analysis.Generally, RF was capable of improving detection accuracy on all datasets.
Generally, the performance of all algorithms on gureKDD was the highest, followed by those on the SEA dataset.The AGR dataset was the worst in reaching high detection accuracy.This fact is clearly illustrated by the plots in Figure 5, which show the gAcc curve against the cut-off values for all datasets.These plots show the ten runs in a lighter colour and the means of these runs in solid colour.They also show the optimal threshold values for each dataset under the tested algorithm.
Information 2019, 10, x FOR PEER REVIEW 16 of 44 Table 6 presents the average of the gAcc values of the ten trials of the 10-folds CV in terms of the gAcc of the three algorithms (C5.0,RF and SVM).It also shows the mean of the optimal cut-off values of the ten runs at which the maximum gAccs were reached.
In general, all algorithms reported similar accuracies for their respective datasets.However, in the artificial dataset AGR, SVM failed to perform anywhere close to C5.0 or RF (showing a difference of almost 15%-see Table 5).This could be related to the nature of the dataset, which could be nonlinearly separable, as a linear version of SVM was used in this analysis.Generally, RF was capable of improving detection accuracy on all datasets.
Generally, the performance of all algorithms on gureKDD was the highest, followed by those on the SEA dataset.The AGR dataset was the worst in reaching high detection accuracy.This fact is clearly illustrated by the plots in Figure 5, which show the gAcc curve against the cut-off values for all datasets.These plots show the ten runs in a lighter colour and the means of these runs in solid colour.They also show the optimal threshold values for each dataset under the tested algorithm.
Table 6.Average model accuracy (G-mean accuracy), the average optimal cut-off value (at which maximum G-mean accuracy was reached) and their standard deviation of the 10-folds crossvalidation (10 repetitions).Friedman's test [96,97] was used to analyse whether the differences between the accuracies of all runs of the 10-folds CV on the full datasets for these algorithms were significant.The tested hypothesis was, "there is no statistically significant difference in model gAccs between the different algorithms."This test revealed that there was a significant difference between the different algorithms applied to these datasets under the 10-folds CV approaches, χ 2 (2) = 26.7,p = 0.000 < 0.05.The follow up Nemenyi post-hoc test [98] revealed that the algorithms were all different from each other, as illustrated in Figure 6, which shows that the differences between the algorithms were statistically Friedman's test [96,97] was used to analyse whether the differences between the accuracies of all runs of the 10-folds CV on the full datasets for these algorithms were significant.The tested hypothesis was, "there is no statistically significant difference in model gAccs between the different algorithms."This test revealed that there was a significant difference between the different algorithms applied to these datasets under the 10-folds CV approaches, χ 2 (2) = 26.7,p = 0.000 < 0.05.The follow up Nemenyi post-hoc test [98] revealed that the algorithms were all different from each other, as illustrated in Figure 6, which shows that the differences between the algorithms were statistically significant.In the second setup (Figure 4), we used the same datasets and algorithms to generate detection models, but in scenarios that were similar to natural settings we applied the prospective sampling technique [82].In these experiments, models were generated on a subset of the dataset using the 10folds CV technique to set these models' parameters, i.e., the prediction threshold (cut-off).These models were then used to evaluate the remaining files in the dataset.Two gAcc values were computed for every combination of prediction model and evaluation data.The first gAcc was obtained when the model's pre-set prediction threshold value, which was calculated using the 10folds cross-validation, was used to predict the test data file.The second gAcc value was calculated based on the maximum accuracy reached when the prediction threshold value was adapted to the evaluated data file.This section shows the results obtained under this setup.

G-Mean Accuracy
Plots of the gAcc (in Figure A1, Figure A2 and Figure A3 in Appendix A) present the performance of the prediction model (MDLk) that was trained using Filek on the files in the dataset (Filei≠k) that were not used in producing that model.In these figures, each model's performance, based on the CV technique, is illustrated with a solid line; other individual performance evaluations are depicted with dotted lines.As the SEA and AGR datasets are composed of only six files each, there is no illustration of model 7 for these datasets in these figures.C5.0 algorithm: Algorithm C5.0 had the worst performance on the first file in the gureKDD dataset, even at CV evaluation during the model generation stage (Figure 12 in Appendix A).This is due to the fact that this file has the least number of attacks (21 attacks) and is the most imbalanced of the files.Therefore, the generated model using this file was not able to predict any instances in other files.Where the number of attacks in other files increased with a proportionate balance, the model performances improved under this algorithm.
Generally, applying this algorithm under the prospective sampling approach followed the same pattern as the first experiment (10-folds cross-validation), where performance on gureKDD resulted in the highest accuracy, followed by the SEA dataset; the worst performing dataset was the AGR.
For both the SEA and AGR datasets, the generated models performed best when files exhibited the same statistical properties, denoted in these experiments by the same generating functions.For example, where MDL1 used File 1 as training data, it predicted instances in File 2 with a high performance and vice versa, as both files were generated using the same function.This was also true for Files 3 and 4.Where files contained mixed behaviours, the prediction performance dropped sharply.
Table A1, Table A2 and Table A3-in Appendix A-present the results of each model on every In the second setup (Figure 4), we used the same datasets and algorithms to generate detection models, but in scenarios that were similar to natural settings we applied the prospective sampling technique [82].In these experiments, models were generated on a subset of the dataset using the 10-folds CV technique to set these models' parameters, i.e., the prediction threshold (cut-off).These models were then used to evaluate the remaining files in the dataset.Two gAcc values were computed for every combination of prediction model and evaluation data.The first gAcc was obtained when the model's pre-set prediction threshold value, which was calculated using the 10-folds cross-validation, was used to predict the test data file.The second gAcc value was calculated based on the maximum accuracy reached when the prediction threshold value was adapted to the evaluated data file.This section shows the results obtained under this setup.
Plots of the gAcc (in Figures A1-A3 in Appendix A) present the performance of the prediction model (MDL k ) that was trained using File k on the files in the dataset (File i k ) that were not used in producing that model.In these figures, each model's performance, based on the CV technique, is illustrated with a solid line; other individual performance evaluations are depicted with dotted lines.As the SEA and AGR datasets are composed of only six files each, there is no illustration of model 7 for these datasets in these figures.C5.0 algorithm: Algorithm C5.0 had the worst performance on the first file in the gureKDD dataset, even at CV evaluation during the model generation stage (Figure A1 in Appendix A).This is due to the fact that this file has the least number of attacks (21 attacks) and is the most imbalanced of the files.Therefore, the generated model using this file was not able to predict any instances in other files.Where the number of attacks in other files increased with a proportionate balance, the model performances improved under this algorithm.
Generally, applying this algorithm under the prospective sampling approach followed the same pattern as the first experiment (10-folds cross-validation), where performance on gureKDD resulted in the highest accuracy, followed by the SEA dataset; the worst performing dataset was the AGR.
For both the SEA and AGR datasets, the generated models performed best when files exhibited the same statistical properties, denoted in these experiments by the same generating functions.For example, where MDL 1 used File 1 as training data, it predicted instances in File 2 with a high performance and vice versa, as both files were generated using the same function.This was also true for Files 3 and 4.Where files contained mixed behaviours, the prediction performance dropped sharply.
Tables A1-A3-in Appendix A-present the results of each model on every file generated by each of the different algorithms.These tables show that the performance of all of these models improved when the threshold (cut-off) was adapted for the evaluation dataset, rather than using a pre-calculated one.Random Forest (RF) algorithm: Unlike C5.0, it was expected that RF would perform well on the first file of the gureKDD dataset despite its low number of attack connections.This was linked to the bootstrap stage, where instances were sampled from the population with replacement.This means that duplicates of the 21 attack connections were sampled many times, which increased the predictability of the built trees (Figure A2 in Appendix A).
After careful examination of the results, as presented by Table A2 in Appendix A, one can see, especially in the synthetic data (SEA and AGR), that when a testing file has similar statistical properties to the model, its performance will not increase much, even after cut-off adaptation.However, when it has different statistical properties, the adaptation process boosts the prediction, leading to an accurate evaluation of a model's performance.
Furthermore, the effect of the adaptation process was more tangible in gureKDD than in the synthetic data, as this dataset exhibited both different patterns and varying statistical properties between files.For example, Table A2-in Appendix A-under gureKDD data, shows that MDL 1 , which was trained on File 1, reached a gAcc of 67.33% on File 5 when the original cut-off (threshold) of the model was used, but applying the adaptation process to this threshold increased its performance to 99.37%.

SVM algorithm:
SVM performed the worst on the AGR dataset in comparison to the other algorithms (Figure A3 in Appendix A).This could have been the result of the non-linear nature of this dataset, which was not picked up by the SVM linear implementation used in these experiments.In general, the cut-off (threshold) adaptation showed a similar effect in improving the models' performances compared to using the model's optimal threshold.

Discussion
The findings of the experiments in this section illustrate the importance of the adapted cut-off value to the data-model pairs in achieving an accurate reading of each model's performance.Friedman's test [96,97] was used to assess whether the difference between the different algorithms was significant before and after threshold adaptation.The tested hypothesis was, "there is no statistically significant difference in model gAccs before and after cut-off (threshold) adaptation between the different algorithms."This test revealed that there was a significant difference between the different algorithms before and after threshold adaptation, χ 2 (5) = 217.7,p = 0.000 < 0.05.
To identify which algorithms were different, a Nemenyi post-hoc test was carried out to calculate the pairwise comparisons.Figure 7 presents the critical differences between the different algorithms before and after cut-off adaptation as a plot.The plot shows that when the cut-off was adapted for the evaluated dataset, the SVM and C5.0 algorithms were no different to each other.They showed the same behaviour even when cut-off adaptation was not performed, but the cut-off adaptation increased their gAccs.In general, all algorithms were ranked higher when cut-off adaptation was performed, with the RF algorithm always outperforming the other two.There are many techniques to perform feature selection, which aims to select a subset of salient features to build a prediction model.Bi et al. [99] have attempted feature selection through introducing a probe to the data by adding three randomly generated variables (fake features/columns) to the dataset.These fake features are randomly drawn from a Gaussian distribution [100].They use a linear SVM to model the subsets at every iteration of a K-folds crossvalidation, where variables with nonzero weights are selected.Any variable (feature) with an average weight below that of the fake variables is then rejected.This approach does not address weight variability, as it only compares averages.
Similarly, Kursa et al. [101,102] have proposed a similar approach in which the information system (training data) is doubled, so that every feature has a shadow feature that is basically a shuffled version of the original one.Feature importance evaluation is then performed on the extended system using the RF algorithm.A K-folds CV-of at least 10-folds-is performed at every iteration, so that every feature is compared to its shadow using statistical tests to evaluate the highest performing features.The main drawbacks of this approach are scalability and speed.Therefore, in this paper a new approach has been proposed and executed that combines the core ideas of the two approaches above.
In this approach, as illustrated in Figure 8, the information system (training data) is extended by adding three randomly generated variables (fake features/columns) to the dataset, where these fake variables are drawn randomly from a Gaussian distribution.A feature importance evaluation-using the RF algorithm-was performed on the newly extended system, and the importance measures of these random variables were then used as a threshold to reject any features with a lower importance value than those of the fake variables.In other words, any feature that performed worse than a random guess was rejected.This comparison was performed using statistical measures.As equal variance between compared groups (feature versus fake variables) is not guaranteed, and due to the unbalanced design (number of compared importance measures) of these comparisons, which would have small sample sizes, Welch's two sample t-test [103,104] was used.Comparisons were performed to evaluate the statistical significance of the mean difference between every feature and the fake variables.The aim of this approach was to speed up the feature selection stage and to make it independent of human evaluation or fixed thresholds, so that it would be more adaptive to

Second Experiment
In this set of experiments, we used the STA2018 dataset to investigate the capability of various ML algorithms in adapting their predictions to the variability of network traffic.We investigated this new approach (prediction threshold adaptation) in the IDS domain.
Typical model development would be governed by decisions made to improve some performance measures, e.g., speed or detection rate.Such decisions, which might involve executing a feature selection and/or a data balancing stage, are usually based on the analysis that will be conducted on the training data.As such, when new evaluation data are used, the performance of the models may not be satisfactory, leading to a phasing out of those models and the generation of new ones.However, such models may still be able to maintain high performances if they are adapted to the new concept that is introduced in the new data.
There are many techniques to perform feature selection, which aims to select a subset of salient features to build a prediction model.Bi et al. [99] have attempted feature selection through introducing a probe to the data by adding three randomly generated variables (fake features/columns) to the dataset.These fake features are randomly drawn from a Gaussian distribution [100].They use a linear SVM to model the subsets at every iteration of a K-folds cross-validation, where variables with nonzero weights are selected.Any variable (feature) with an average weight below that of the fake variables is then rejected.This approach does not address weight variability, as it only compares averages.
Similarly, Kursa et al. [101,102] have proposed a similar approach in which the information system (training data) is doubled, so that every feature has a shadow feature that is basically a shuffled version of the original one.Feature importance evaluation is then performed on the extended system using the RF algorithm.A K-folds CV-of at least 10-folds-is performed at every iteration, so that every feature is compared to its shadow using statistical tests to evaluate the highest performing features.The main drawbacks of this approach are scalability and speed.Therefore, in this paper a new approach has been proposed and executed that combines the core ideas of the two approaches above.
In this approach, as illustrated in Figure 8, the information system (training data) is extended by adding three randomly generated variables (fake features/columns) to the dataset, where these fake variables are drawn randomly from a Gaussian distribution.A feature importance evaluation-using the RF algorithm-was performed on the newly extended system, and the importance measures of these random variables were then used as a threshold to reject any features with a lower importance value than those of the fake variables.In other words, any feature that performed worse than a random guess was rejected.This comparison was performed using statistical measures.
adding three randomly generated variables (fake features/columns) to the dataset, where these fake variables are drawn randomly from a Gaussian distribution.A feature importance evaluation-using the RF algorithm-was performed on the newly extended system, and the importance measures of these random variables were then used as a threshold to reject any features with a lower importance value than those of the fake variables.In other words, any feature that performed worse than a random guess was rejected.This comparison was performed using statistical measures.As equal variance between compared groups (feature versus fake variables) is not guaranteed, and due to the unbalanced design (number of compared importance measures) of these comparisons, which would have small sample sizes, Welch's two sample t-test [103,104] was used.Comparisons were performed to evaluate the statistical significance of the mean difference between every feature and the fake variables.The aim of this approach was to speed up the feature selection stage and to make it independent of human evaluation or fixed thresholds, so that it would be more adaptive to As equal variance between compared groups (feature versus fake variables) is not guaranteed, and due to the unbalanced design (number of compared importance measures) of these comparisons, which would have small sample sizes, Welch's two sample t-test [103,104] was used.Comparisons were performed to evaluate the statistical significance of the mean difference between every feature and the fake variables.The aim of this approach was to speed up the feature selection stage and to make it independent of human evaluation or fixed thresholds, so that it would be more adaptive to the true nature of the dataset.This study adapts the approach of Bi et al. [99] to address the limitation of the Kursa et al. [101] method.
Every fake feature was formed of N random values drawn from a Gaussian distribution with a mean of zero and a standard deviation of one, where N was the number of records in the training data.These random features were combined with the original dataset and were processed by the RF algorithm to compute its features' importance, using a 3-folds CV technique.A Welch's t-test statistical [103,104] comparison was then performed to evaluate whether the mean of the importance measures of every feature, F i ,-from the three folds-was statistically significantly greater than the mean importance of the fake features (with a significance level of α = 0.05).All features with a mean importance statistically significantly greater than that of the fake features were selected.The steps of the feature selection stage are illustrated in Algorithm 1.
As RF can return importance score of every feature, it has been used to select the salient features using its two measures {mean decrease of accuracy (MDA) and mean decrease in Gini (MDG)} [70,105].
In the feature importance evaluation, 15 categorical (factor) features were eliminated from the STA2018 dataset, as they had been added to all the experiments' model building designs and evaluation process by default.These features are listed in Table 7.The experiments were executed in three different phases, as explained below, and presented with the pseudo code in Algorithm 2.
As the STA2018 dataset distinguishes between original and synthetic records, every day's traffic file (subset) was pre-processed in order to be used into two modes [imbalanced and balanced] (line 6 in Algorithm 2).As explained earlier, the SMOTE algorithm [95] was used by the STA2018 dataset [73] to generate synthetic instances of the minority class until the number of instances in both classes were equal to each other.
In the first phase (lines 10-14 in Algorithm 2), every file in the STA2018 dataset (which was used to generate the models) was evaluated to select two subsets of features (see Algorithms 1) using the Mean Decrease of Accuracy and the Mean Decrease Gini, resulting in the formation of the MDA and MDG sets, respectively.The same feature selection criteria were used on the balanced data file to generate another two sets of features, referred to in this paper as MDA Balanced and MDG Balanced .By the end of this phase, there were four feature sets (see Table 8) along with the Full features set for each training day.In the second phase (lines 16-25 in Algorithm 2), each day's traffic used each of the five feature sets (including the Full features set) to generate a binary classification (prediction) model, which resulted in five different models.The same process was repeated using the balanced data.Each model generation step used the 3-folds CV technique to establish the model's optimal (CV) prediction threshold.The final prediction threshold was computed by aggregating all the fold's predictions for each model to find the point (threshold) of the maximum gAcc.By the end of this phase, there were ten different binary prediction models for each day's traffic.
In the final phase (lines 27-38 in Algorithm 2), every generated model was evaluated against each day's traffic from the dataset that had not been used in any of the feature selection, or in the model generation processes.In this phase, to test the data file for each evaluation, the gAcc was computed using the model's optimal (CV) threshold and the adapted cut-off.
The whole process was repeated for each of the algorithms being evaluated: C5.0, RF and SVM.

Results and Discussion
As every generated model was evaluated using all of the files (subsets) in the dataset except the one that had been used to generate that model, two gAcc values were computed for every combination of prediction model and evaluation data.The first gAcc (gAcc Thr CV ) was the one obtained after the model's optimal (CV) cut-off value had been calculated using 3-folds CV to predict the data file.The other gAcc value (gAcc Thr Opt ) was calculated based on the maximum accuracy achieved after the prediction cut-off value had been specifically adapted for the evaluated data file.
As stated earlier, this set of experiments aimed to investigate the effect of the cut-off adaptation by determining the statistical significance in the gAcc of the models through comparing their optimal threshold with the adaptive cut-off.The analysis compared the difference between the two approaches by conducting four Friedman's tests [96,97] (with a significance level of α = 0.05).The decision to use the non-parametric Friedman's test was based on the fact that the data did not follow a normal distribution, as confirmed by the normality test (Shapiro-Wilk test) [106] W = 0.7, p-value = 0.000.The following list shows the hypotheses that were tested and the results returned by the Friedman tests.
Threshold-H 0 : "there are no statistically significant differences in model gAccs before and after cut-off (threshold) adaptation has been applied." χ 2 (1) = 873.0,p = 0.000 < 0.05 (differences were statistically significant) ML-H 0 : "there are no statistically significant differences in model gAccs between the different ML algorithms (C5.0,RF and SVM) before and after cut-off (threshold) adaptation has been applied." χ 2 (5) = 747.5,p = 0.000 < 0.05 (differences were statistically significant) Features-H 0 : "there are no statistically significant differences in model gAccs between the different feature sets (Full, MDA, MDG, MDA Bal. and MDG Bal. ) before and after cut-off (threshold) adaptation has been applied."χ 2 (9) = 742.8,p = 0.000 < 0.05 (differences were statistically significant) Balance-H 0 : "there are no statistically significant differences in model gAccs between the different data balances (Original and Balanced data) before and after cut-off (threshold) adaptation has been applied." χ 2 (3) = 761.3,p = 0.000 < 0.05 (differences were statistically significant) As all of these tests showed significant differences, a Nemenyi post-hoc test [107][108][109] was conducted to perform pairwise comparisons on the different effects of each test to distinguish which differences were statistically significant.The results of these pairwise comparisons are illustrated in Figure 9 through critical difference plots.
All of the plots in Figure 9 show that the cut-off adaptation effect was significantly different from the fixed model's optimal (CV) threshold.They also show that different treatments (ML algorithm, feature sets and/or data balance) with the adaptive cut-off always ranked higher.Any insignificant differences fall within the same effect (cut-off adaptation or model's fixed optimal threshold).
Having shown that the models' performance was ranked significantly higher when the adaptive cut-off approach was used rather than the fixed optimal (CV) threshold (see results in Tables A4-A6 in Appendix A), all subsequent analyses focus on the results obtained using the adaptive cut-off.For every analysed algorithm, a Friedman's test (with a significance level of α=0.05) was performed to test the hypothesis, "there are no statistically significant differences in the gAccs of models built with different feature sets and different data balances after a cut-off (threshold) adaptation has been applied."The results of this hypothesis are discussed under every algorithm.

C5.0 algorithm
Results in Table A4-in Appendix A-for the C5.0 models show different patterns and behaviours from one training day to another.For example, models trained on Day 2 (12/Jun) failed to perform well on Day 5 (15/Jun), whereas Day 5 models predicted Day 2 traffic with a high degree of accuracy.They also showed inconsistent behaviour towards different feature sets across the days.For example, Day 2 models performed best when the Full feature set was used, but this pattern was not consistent across all days.This can clearly be seen from the results of Day 5, when MDG features were used, and the results of Day 7 (17/Jun) when MDA or MDABal.feature sets were used with the balanced training data.One important observation to make is the poor accuracy of Day 6 (16/Jun) models when the original training data were used.These models showed the worst accuracy, due to the low number of attacks in this data file.When a balanced version of the Day 6 data file was used to build the prediction models, accuracy improved.This supports the finding discussed in the previous experiments regarding the behaviour of C5.0 algorithm with imbalanced data.It can also be clearly observed from these results that data balancing had a minor effect in improving the accuracy of models developed using the C5.0 algorithm, which was further investigated using statistical analysis.
The results of Friedman's test-stated above-revealed that there was not enough evidence to support this hypothesis, χ 2 (9) = 16.0,p = 0.067 ≮ 0.05.These tests showed that there was no significant effect of one feature set over another when the C5.0 algorithm was used.In addition, data balancing did not lead to a significant improvement in a model's accuracy.
Results in Table A4 (see Appendix A) support this conclusion, as the C5.0 models show unstable behaviours.Many factors could be behind the volatile behaviour of the C5.0 algorithm.For example, selected feature sets might not be the best sets for this algorithm.This algorithm was also executed within its default parameters, in particular the number of trials, which was set at ten.In addition, C5.0 algorithms carry out random sampling by following the boosting technique (which randomly samples weighted instances).This might have caused C5.0 to overfit the training data, which could be one of the reasons for its overall poor accuracy in predicting new traffic.Overall, based on the

C5.0 algorithm
Results in Table A4-in Appendix A-for the C5.0 models show different patterns and behaviours from one training day to another.For example, models trained on Day 2 (12/Jun) failed to perform well on Day 5 (15/Jun), whereas Day 5 models predicted Day 2 traffic with a high degree of accuracy.They also showed inconsistent behaviour towards different feature sets across the days.For example, Day 2 models performed best when the Full feature set was used, but this pattern was not consistent across all days.This can clearly be seen from the results of Day 5, when MDG features were used, and the results of Day 7 (17/Jun) when MDA or MDA Bal.feature sets were used with the balanced training data.One important observation to make is the poor accuracy of Day 6 (16/Jun) models when the original training data were used.These models showed the worst accuracy, due to the low number of attacks in this data file.When a balanced version of the Day 6 data file was used to build the prediction models, accuracy improved.This supports the finding discussed in the previous experiments regarding the behaviour of C5.0 algorithm with imbalanced data.It can also be clearly observed from these results that data balancing had a minor effect in improving the accuracy of models developed using the C5.0 algorithm, which was further investigated using statistical analysis.
The results of Friedman's test-stated above-revealed that there was not enough evidence to support this hypothesis, χ 2 (9) = 16.0,p = 0.067 ≮ 0.05.These tests showed that there was no significant effect of one feature set over another when the C5.0 algorithm was used.In addition, data balancing did not lead to a significant improvement in a model's accuracy.
Results in Table A4 (see Appendix A) support this conclusion, as the C5.0 models show unstable behaviours.Many factors could be behind the volatile behaviour of the C5.0 algorithm.For example, selected feature sets might not be the best sets for this algorithm.This algorithm was also executed within its default parameters, in particular the number of trials, which was set at ten.In addition, C5.0 algorithms carry out random sampling by following the boosting technique (which randomly samples weighted instances).This might have caused C5.0 to overfit the training data, which could be one of the reasons for its overall poor accuracy in predicting new traffic.Overall, based on the statistical results returned using Friedman's test, the C5.0 models ranked low, as illustrated by Figure 9b.Random Forest algorithm Another Friedman's test was performed to assess the above hypothesis for the RF algorithm's models.This test aimed to determine how these models performed when using different feature sets with different data balances, and whether the difference in accuracy was significant after applying the threshold adaptation.
This test revealed that for the RF algorithm, there were significant differences between these features after applying the cut-off (threshold) adaptation, χ 2 (9) = 38.0,p = 0.000 < 0.05.To distinguish which of these effects were statistically significant, a Nemenyi post-hoc test was conducted to perform a pairwise comparison, as illustrated in Figure 10.
models.This test aimed to determine how these models performed when using different feature sets with different data balances, and whether the difference in accuracy was significant after applying the threshold adaptation.
This test revealed that for the RF algorithm, there were significant differences between these features after applying the cut-off (threshold) adaptation, χ 2 (9) = 38.0,p = 0.000 < 0.05.To distinguish which of these effects were statistically significant, a Nemenyi post-hoc test was conducted to perform a pairwise comparison, as illustrated in Figure 10.
Overall, there were no significant differences in the RF's accuracy when the Full, MDA and MDABal.feature sets were used.However, the Full features set showed a significant difference over the MDG and MDGBal.feature sets, which ranked lowest among the feature sets.This could be due to the nature of the mean decrease Gini metric in selecting local features, which have low generalisation power.However, even with these low accuracies, RF had the highest overall accuracy.As Figure 10 shows, the data balance had no significant effect on the accuracy of RF.On the contrary, it sometimes negatively affected the accuracy of models using the Full feature set with balanced data, as their difference to the MDG and MDGBal.feature sets became insignificant.This was also evident in the results of Day 6 in Table A5-in Appendix A-which showed a lower accuracy for all models for that day as the balanced version of the data was used.Although that day only had 11 attacks, RF was able to build good predictive models with good evaluation accuracy, except for Day 4's traffic.The ability of RF to learn from Day 6 traffic was linked to its bagging technique.In contrast to C5.0, this sampling technique prevented RF from overfitting, which in turn produced models with good generalisation capabilities.This gave RF more chance of detecting novel attacks, as demonstrated in these experiments.The RF algorithm showed the best results of the evaluated ML algorithms.As illustrated by the results in Table A5, RF's accuracy would not have been better than that of the other algorithms if the fixed optimal (CV) threshold of its models had been used to assess their accuracy.However, with the cut-off adaptation approach, RF's accuracy improved significantly.
The RF algorithm can take longer to train, depending on the complexity of the training data.However, once the model is built, its evaluation of a new instance is reasonably fast.Overall, there were no significant differences in the RF's accuracy when the Full, MDA and MDA Bal.feature sets were used.However, the features set showed a significant difference over the MDG and MDG Bal.feature sets, which ranked lowest among the feature sets.This could be due to the nature of the mean decrease Gini metric in selecting local features, which have low generalisation power.However, even with these low accuracies, RF had the highest overall accuracy.As Figure 10 shows, the data balance had no significant effect on the accuracy of RF.On the contrary, it sometimes negatively affected the accuracy of models using the Full feature set with balanced data, as their difference to the MDG and MDG Bal.feature sets became insignificant.This was also evident in the results of Day 6 in Table A5-in Appendix A-which showed a lower accuracy for all models for that day as the balanced version of the data was used.Although that day only had 11 attacks, RF was able to build good predictive models with good evaluation accuracy, except for Day 4's traffic.The ability of RF to learn from Day 6 traffic was linked to its bagging technique.In contrast to C5.0, this sampling technique prevented RF from overfitting, which in turn produced models with good generalisation capabilities.This gave RF more chance of detecting novel attacks, as demonstrated in these experiments.
The RF algorithm showed the best results of the evaluated ML algorithms.As illustrated by the results in Table A5, RF's accuracy would not have been better than that of the other algorithms if the fixed optimal (CV) threshold of its models had been used to assess their accuracy.However, with the cut-off adaptation approach, RF's accuracy improved significantly.
The RF algorithm can take longer to train, depending on the complexity of the training data.However, once the model is built, its evaluation of a new instance is reasonably fast.
As expected, it consumed a lot of resources (memory) at the model building phase, and this consumption increased with the size of the training data.This was a result of the number of bootstrap samples it generated, which were used to build trees in parallel threads.The resulting models were quite large compared to the SVM and C5.0 models, and their sizes increased as the complexity of the training data increased.

SVM algorithm
Similar to C5.0 and RF algorithms, Friedman's test was used to assess the above hypothesis for the SVM models.This test revealed that there was not enough evidence to support this hypothesis, χ 2 (9) = 13.1, p = 0.158 ≮ 0.05.
The SVM algorithm exhibited similar behaviour to the C5.0 algorithm.All of its statistical tests revealed insignificant effects between one feature set and another, and there was no sign that the improved accuracy of its models was influenced by any of the data balancing effects.As with the C5.0 algorithm, different behaviours were exhibited on different days, as presented in Table 6 (see Appendix A), so no consistent pattern could be deduced.
Although the SVM algorithm showed some overall improvement on days when the reduced feature sets were used instead of the Full features set, this behaviour was not consistent.As a linear version of SVM was used, this effect could have been caused by the non-linear nature of the data on those days where SVM failed to perform well.
Figure 11 summarises all of the accuracy readings in the tables (Tables A4-A6) after the threshold adaptation process was applied.It compares the average accuracy of all the C5.0, RF and SVM models.This plot shows the average accuracy for each day's model for all of the tested ML algorithms.The standard error of the average accuracy for each model is illustrated by vertical bars.For each algorithm, the mean accuracy of all models across all days for every combination of feature sets and data balance type is represented by a horizontal dashed line.As this plot shows, RF was always the highest performing of the ML algorithms evaluated.Unlike C5.0 and SVM, RF showed the most stable results with the least variability.
Although Figure 11 shows that the highest average accuracy of the RF models was attained when the Full features set was used, the difference in the average accuracy of its models was very small, unlike the accuracy of the C5.0 and SVM models, which showed higher variations in accuracy.Therefore, RF models could be generated using a reduced feature set without any significant decrease in their average accuracy, but with a significant gain in speed.Figure 11 also shows that there was only a high variation in the accuracy of models for Day 6; however, given that this day was problematic, with its skewed balance, this level of accuracy is more than acceptable.Moreover, in a real life scenario it would not be sensible to build a model using such data, hence this example is an extreme case, which is presented here merely to demonstrate that the RF algorithm performed reasonably well.
In general, although a linear SVM implementation was used in these experiments, it showed some good results.For example, on average, the accuracy of models trained on the original and balanced version of Day 4's traffic, using MDG features, was above 90% (see Figure 11).The accuracy of models trained using the MDA features on the original version of Day 6 traffic (which only had 11 attacks), was also above 89%.With such results, more analysis is required to identify the right combination of fast kernel function and parameter tuning to improve the overall SVM results.This would make it an attractive solution for IDS problems.In general, although a linear SVM implementation was used in these experiments, it showed some good results.For example, on average, the accuracy of models trained on the original and balanced version of Day 4's traffic, using MDG features, was above 90% (see Figure 11).The accuracy of models trained using the MDA features on the original version of Day 6 traffic (which only had 11 attacks), was also above 89%.With such results, more analysis is required to identify the right combination of fast kernel function and parameter tuning to improve the overall SVM results.This would make it an attractive solution for IDS problems.

Conclusion
In this work, we have presented the effect of prediction threshold (cut-off) adaptation on improving detection accuracy of binary-based prediction (IDS) models.We also presented how such an approach will benefit the IDS domain is maintaining detection models for long periods of time.The results of our experiments show that the adapted threshold provided a more accurate reading of a model's true accuracy in comparison to the use of a fixed threshold.From these experiments, we highlight the following characteristics of threshold adaptation: • An adaptive cut-off (threshold) approach results in better classification performance than a fixed threshold.

•
Using a single cut-off (threshold) will lead to misleading results, which could result in a decision to terminate a good prediction model that merely required some tuning.

•
Threshold adaptation approach may not show significant improvement to a model's accuracy when the testing data exhibits the same statistical properties as the training data.The results of these analyses showed that RF outperformed the other algorithms (C5.0 and SVM) in its ability to predict new traffic and the detection of novel anomalies.It also showed that, before cut-off adaptation, all of the ML algorithms performed as poorly as each other, but that the adaptive cut-off approach increased their overall accuracy, with RF performing the best.Moreover, RF suffered no significant loss in accuracy when the reduced feature sets were used, and its predictions did not improve when the data was balanced, given that the prediction threshold is constantly adapted.This gives RF the advantage of being able to build models using original data with a reduced

Conclusions
In this work, we have presented the effect of prediction threshold (cut-off) adaptation on improving detection accuracy of binary-based prediction (IDS) models.We also presented how such an approach will benefit the IDS domain is maintaining detection models for long periods of time.The results of our experiments show that the adapted threshold provided a more accurate reading of a model's true accuracy in comparison to the use of a fixed threshold.From these experiments, we highlight the following characteristics of threshold adaptation:

•
An adaptive cut-off (threshold) approach results in better classification performance than a fixed threshold.

•
Using a single cut-off (threshold) will lead to misleading results, which could result in a decision to terminate a good prediction model that merely required some tuning.

•
Threshold adaptation approach may not show significant improvement to a model's accuracy when the testing data exhibits the same statistical properties as the training data.
The results of these analyses showed that RF outperformed the other algorithms (C5.0 and SVM) in its ability to predict new traffic and the detection of novel anomalies.It also showed that, before cut-off adaptation, all of the ML algorithms performed as poorly as each other, but that the adaptive cut-off approach increased their overall accuracy, with RF performing the best.Moreover, RF suffered no significant loss in accuracy when the reduced feature sets were used, and its predictions did not improve when the data was balanced, given that the prediction threshold is constantly adapted.This gives RF the advantage of being able to build models using original data with a reduced feature set, which will save a considerable amount of time in training and testing, which makes this algorithm more attractive for such problems.
In these analyses, the gAcc measures were used as the model assessment criteria to avoid issues with imbalanced data.The accuracy of all models was assessed using the non-parametric Friedman test to identify any significant differences.Cut-off adaptation and the algorithm used were the most important effects that contributed to any significant difference in a model's accuracy.
Furthermore, this study also showed that the K-folds cross-validation (CV) analysis on the entire dataset reports over-optimistic results that would not reflect the true capability of detection models in real life setups.This technique failed to reveal and assess the true power of the detection capability of different ML models.For example, the C5.0 ranked higher than SVM when this technique was applied on the entire datasets (as in the first setup of the first experiment), however, it was no better when more a natural setup (prospective sampling technique) was in effect.As a result, research results using the K-folds CV technique should be carefully addressed in domains such as IDS.
In future work, we will investigate new approaches in identifying the optimal adaptive prediction threshold (cut-off) based on a small randomly selected sample of an evaluated traffic.Another potential avenue for further investigation is including larger, real industrial datasets with real world attacks and unknown labels, to determine if this research can be generalised and to compare the detection accuracy to some production IDS.A1). Figure A1.The gAcc curves for the C5.0 algorithm (see Table A1).A3). Figure A3.The gAcc curves for the SVM algorithm (see Table A3).
Table A4.The gAcc of models for the fixed optimal (CV) and adapted cut-off (threshold) for the C5.0 algorithm.

Figure 1 .
Figure 1.Main phases of Random Forest algorithm.

Figure 1 .
Figure 1.Main phases of Random Forest algorithm.

Information 2019 ,
10, x FOR PEER REVIEW 12 of 44threshold.There were four functions, which would label the instances differently based on their threshold values between the two classes (function 1 sets θ = 8, function 2 uses θ = 9, function 3 sets θ = 7, and function 4 sets θ = 9.5)[84].The SEA generator's default setting was used to add 10% noise classes.Six different data streams (files) were produced: function 1 was used to generate two streams (file 1 and file 2); function 2 was used to generate two other streams (file 3 and file 4); and a combination of function 1 and function 2 was used to generate two streams (file 5 and file 6).For every file, calls to these functions used different seed values to set the seed of the random generator function to generate new random instances.Figure2presents an example of the command line call to generate File 1 with the SEA stream generator.

Figure 2 .
Figure 2. Command used to generate File 1 of streaming ensemble algorithm (SEA) dataset.

Figure 2 .
Figure 2. Command used to generate File 1 of streaming ensemble algorithm (SEA) dataset.

Figure 3 .
Figure 3. Command used to generate File 1 of AGR dataset.

Figure 3 .
Figure 3. Command used to generate File 1 of AGR dataset.

Figure 4 .
Figure 4.The phase diagram of the experiments.

6. 1
.1.10-folds Cross-validation on Full DataIn the first setup (Figure4), we started these experiments by comparing the detection

Figure 4 .
Figure 4.The phase diagram of the experiments.

Information 2019 , 44 Figure 6 .
Figure 6.Critical differences plot of the pairwise Nemenyi comparison test for the full datasets 10folds cross-validation experiment.

Figure 7 .
Figure 7. Critical differences plot of the pairwise Nemenyi comparison test for the cut-off (threshold) adaptation experiment.

Figure 8 .
Figure 8. Critical differences plot of the pairwise Nemenyi comparison test for the cut-off (threshold) adaptation experiment.

Figure 7 .
Figure 7. Critical differences plot of the pairwise Nemenyi comparison test for the cut-off (threshold) adaptation experiment.

Figure 8 .
Figure 8. Critical differences plot of the pairwise Nemenyi comparison test for the cut-off (threshold) adaptation experiment.

Figure 8 .
Figure 8. Critical differences plot of the pairwise Nemenyi comparison test for the cut-off (threshold) adaptation experiment.

Figure 9 .
Figure 9. Graphical illustration of pairwise comparisons from the Friedman Test results for different threshold effects (optimal or adaptive cut-off) after applying the Nemenyi test (95% confidence level) (a) Fixed vs. adaptive thresholds.(b) ML algorithms under threshold adaptation effect.(c) Feature sets under threshold adaptation effect.(d) Training data balances under threshold adaptation effect.

Figure 9 .
Figure 9. Graphical illustration of pairwise comparisons from the Friedman Test results for different threshold effects (optimal or adaptive cut-off) after applying the Nemenyi test (95% confidence level) (a) Fixed vs. adaptive thresholds.(b) ML algorithms under threshold adaptation effect.(c) Feature sets under threshold adaptation effect.(d) Training data balances under threshold adaptation effect.

Figure 10 .
Figure 10.Nemenyi test (95% confidence level) on the RF algorithm models using different feature sets and different data balances after applying the adaptive cut-off approach.

Figure 10 .
Figure 10.Nemenyi test (95% confidence level) on the RF algorithm models using different feature sets and different data balances after applying the adaptive cut-off approach.

Figure 11 .
Figure 11.Comparison plot of the average accuracy of every C5.0, RF and SVM model for every feature set and data balance combination.

Figure 11 .
Figure 11.Comparison plot of the average accuracy of every C5.0, RF and SVM model for every feature set and data balance combination.

Table 1 .
Structure of confusion matrix for n classes.

Table 1 .
Structure of confusion matrix for n classes.

Table 2 .
Instances' classes in every file in the SEA dataset.

Table 2 .
Instances' classes in every file in the SEA dataset.

Table 3 .
Instances' classes in every file in the AGR dataset.

Table 4 .
Number of connection classes in every file in the gureKDD dataset.

Table 3 .
Instances' classes in every file in the AGR dataset.

Table 4 .
Number of connection classes in every file in the gureKDD dataset.

Table 5 .
Number of classes of instances for each day's file of the STA2018 dataset.

Table 7 .
Categorical (factor) features eliminated from the feature importance evaluation phase.

Table 8 .
Number of selected features under every feature importance measure.

Table A1 .
C5.0 model's performance on different datasets with various effects (before and after threshold adaptation).

Table 1 .
C5.0 model's performance on different datasets with various effects (before and after threshold adaptation).

Table A2 .
Random Forest (RF) model's performance on different datasets with various effects (before and after threshold adaptation).

Table 2 .
Random Forest (RF) model's performance on different datasets with various effects (before and after threshold adaptation).

Table A3 .
SVM model's performance on different datasets with various effects (before and after threshold adaptation).

Table 3 .
SVM model's performance on different datasets with various effects (before and after threshold adaptation).

Table A4 .
The gAcc of models for the fixed optimal (CV) and adapted cut-off (threshold) for the C5.0 algorithm.

Table A5 .
The gAcc of models for the fixed optimal (CV) and adapted cut-off (threshold) for the Random Forest (RF) algorithm.

Table A5 .
The gAcc of models for the fixed optimal (CV) and adapted cut-off (threshold) for the Random Forest (RF) algorithm.

Table A6 .
The gAcc of models for the fixed optimal (CV) and adapted cut-off (threshold) for the SVM algorithm.

Table A6 .
The gAcc of models for the fixed optimal (CV) and adapted cut-off (threshold) for the SVM algorithm.