Add-On Anomaly Threshold Technique for Improving Unsupervised Intrusion Detection on SCADA Data

: Supervisory control and data acquisition (SCADA) systems monitor and supervise our daily infrastructure systems and industrial processes. Hence, the security of the information systems of critical infrastructures cannot be overstated. The effectiveness of unsupervised anomaly detection approaches is sensitive to parameter choices, especially when the boundaries between normal and abnormal behaviours are not clearly distinguishable. Therefore, the current approach in detecting anomaly for SCADA is based on the assumptions by which anomalies are deﬁned; these assumptions are controlled by a parameter choice. This paper proposes an add-on anomaly threshold technique to identify the observations whose anomaly scores are extreme and signiﬁcantly deviate from others, and then such observations are assumed to be ”abnormal”. The observations whose anomaly scores are signiﬁcantly distant from ”abnormal” ones will be assumed as ”normal”. Then, the ensemble-based supervised learning is proposed to ﬁnd a global and efﬁcient anomaly threshold using the information of both ”normal”/”abnormal” behaviours. The proposed technique can be used for any unsupervised anomaly detection approach to mitigate the sensitivity of such parameters and improve the performance of the SCADA unsupervised anomaly detection approaches. Experimental results conﬁrm that the proposed technique achieved a signiﬁcant improvement compared to the state-of-the-art of two unsupervised anomaly detection algorithms. were trained with question datasets to produce an ensemble-based decision-making model that can be integrated into both unsupervised anomaly scoring and clustering-based intrusion detection approaches. In the former, GATUD is used to mitigate the sensitivity of anomaly threshold, while in the latter, it is used to efﬁciently label the produced clusters as either normal or abnormal. Experiments show that GATUD demonstrates signiﬁcant and promising results when it was integrated into a clustering-based intrusion detection approach as a labelling technique for the produced clusters.


Introduction
Supervisory control and data acquisition (SCADA) systems control and monitor the information systems of industrial and critical infrastructure (such as electricity, gas, and water). Recently, there has been an increase in attacks targeting these systems. Compromising the information systems of SCADA can lead to large financial losses and serious impacts on public safety and the environment. The attack on a sewage treatment system in Maroochy Shire, Queensland, is an obvious example of the seriousness of cyber attacks on critical infrastructure [1]. Therefore, securing and protecting these systems is extremely important [2][3][4]. The new generation of intrusion detection systems (IDSs) will need to detect ad-hoc SCADA-specific attacks that cannot be detected by existing security technologies [5]. Due to the differences between the nature and the characteristics of traditional IT and SCADA systems, there is a need for the development of SCADA-specific IDSs, and in recent years this has become an interesting research area [6][7][8].
A SCADA system monitors and controls a series of process parameters. The values of these supervised parameters can reflect the internal representation of the SCADA system. The values are called "SCADA data" which are found in [6,[8][9][10][11][12][13]] to be a good information source to monitor the internal behaviour of the given system and protect it from malicious actions that are intended to sabotage or disturb the proper functionality of the targeted system. Therefore, the monitoring of the behaviour of SCADA systems through the evolution of SCADA data has attracted the attention of researchers. In [9,10], a predefined threshold (e.g, minimum, maximum) is proposed to monitor each process parameter individually, and any reading that is not inside a prescribed threshold is considered as an anomaly. These approaches are good for monitoring one single process parameter. However, the value of an individual process parameter may not be abnormal, but in combination with other process parameters, may produce an abnormal observation, which very rarely occurs. These types of parameters are called multivariate parameters, and are assumed to be directly (or indirectly) "correlated".
In [11], the authors proposed an analytical approach to manually identify the range of critical states for multivariate process parameters, and the identified ranges are then used to monitor the critical state of the analysed process parameters. However, analytical approaches require expert involvement, and this results in time-intensive processing that is prone to human error. To avoid the aforementioned issues, purely "normal" SCADA data are used to model the normal behaviours. For example, Rrushi [12] applied probabilistic models to estimate the normalcy of the evolution of multivariate process parameters. Zhanwei et al. [6] proposed a combination of a normal control behaviour model and a normal process behaviour model to build SCADA data-driven detection models to monitor abnormal behaviours in industrial equipment. Gao et al. [8] proposed a neural network-based model to learn the normal behaviours for water tank control systems. Similarly, Zaher et al. [13] proposed the same technique to build the normal behaviours for a wind turbine to identify faults or unexpected behaviours (anomalies).
However, it is difficult to build the "normal" behaviours of a given system using observations of the raw SCADA data because, firstly, it cannot be guaranteed that all observations represent one behaviour as either "normal" or "abnormal", and therefore domain experts are required for the labelling of each observation, and this process is prohibitively expensive; secondly, in order to obtain purely "normal" observations that comprehensively represent "normal" behaviours, this requires a given system to be run for a long period under normal conditions, and this is not practical; and finally, it is challenging to obtain observations that will cover all possible abnormal behaviours that can occur in the future. Therefore, we strongly believe that the development of a SCADA-specific IDS that uses SCADA data and operates in unsupervised mode, where the labelled data is not available, has great potential as a means of addressing the aforementioned issues.
The unsupervised IDS can be a time and cost-efficient means of building detection models from unlabelled data; however, this requires an efficient and accurate technique to differentiate between the normal and abnormal observations without the involvement of experts, which is costly and prone to human error. Then, from observations of each behaviour, either normal or abnormal, the detection models can be built. Two assumptions must be made in unsupervised anomaly detection approaches [14][15][16]: (i) the number of normal observations in the dataset vastly outperforms the abnormal observations, and (ii) the abnormal observations must be statistically different from normal ones. Therefore, the performance of the proposed detection models relies mainly on the two aforementioned to distinguish between normal and abnormal behaviours. The reporting of anomalies in the unsupervised mode can be done either by scoring-based or binary-based techniques [17].
Anomaly scoring techniques are categorised into two broad classes: the distance-based and the density-based scoring techniques. The basic idea of the distance-based technique is to distinguish an observation as outlier on the base of the distance to its nearest neighbours, while in the density-based one, an observation is considered as outlier when it lies in a low-density area of its nearest neighbours [17]. All observations in a dataset are given an anomaly score, and therefore actual anomalies are assumed to have the highest scores. The key problem is how to find the best cut-off threshold that minimises the false positive rate while maximising the detection rate. On the one hand, binary-based techniques [14] group similar observations together into a number of clusters. Abnormal observations are identified by making use of the fact that abnormal observations will be considered as outliers, and therefore will not be assigned to any cluster, or they will be grouped into small clusters that have some characteristics which are different from normal clusters. However, labelling an observation as an outlier or a cluster as anomalous is controlled through some parameter choices within each detection technique. For instance, given the top 50% of the observations which have the highest anomaly scores, these are assumed as outliers. In this case, both detection and false positive rates will be higher. Similarly, labelling a low percentage of largest clusters as normal in clustering-based intrusion detection techniques, will result in higher detection and false positive rates. Therefore, the effectiveness of unsupervised intrusion approaches is sensitive to parameter choices, especially when the boundaries between normal and abnormal behaviours are not clearly distinguishable.
This paper proposes the global anomaly threshold to unsupervised detection (GATUD) that is based on anomaly density-based technique because it is considered as one of the anomaly scoring techniques in network anomaly detection [18]. The proposed GATUD approach can be used as an add-on threshold technique to allow unsupervised anomaly scoring-based techniques to set the value of the cut-off threshold parameter at a satisfactory level to guarantee a high detection rate, while minimising the resulting high false positive rate. In addition, it can be used as a robust technique for labelling clusters to improve the accuracy of clustering-based intrusion detection systems. Figure 1 shows that GATUD involves two steps: (i) establishing two small most-representative datasets, where each dataset represents one-class problem (normal or abnormal) with high-confidence; and (ii) using the established datasets to build an ensemble-based decision-making model using a set of supervised classifiers.
This paper is organised as follows. Section 2 presents an overview of related work. Section 3 introduces GATUD. Section 4 presents the experimental set-up, followed by results and discussion in Section 5. Section 6 concludes the paper. Step 1 Step 2 Support Anomaly Score Calculation Labelling process Figure 1. Overview of GATUD.

Related Work
An intrusion detection system (IDS) is a main component in securing computer systems and networks. In the case of SCADA systems, a number of tailored IDSs have been proposed (refer to a survey paper [19]). There are two categories of IDSs: signature-based and anomaly-based. The former detects only known attacks because it monitors the system against specific attack patterns. The latter attempts to build models from the normal behaviour of the systems, and any deviation from this behaviour is assumed to be a malicious activity. Both approaches have advantages and disadvantages. The former achieves good accuracy, but fails to detect attacks that are new or the patterns of which are not learned. Although the latter is able to detect novel attacks, the overall detection accuracy of this approach is low. This paper focuses on the anomaly detection techniques since they are able to address the problem of the zero-day attacks. Rrushi [12] applied statistics and probability theory to estimate the normality of the evolution of values of correlated process parameters. In the work of [20], the authors assumed that communication patterns among SCADA components are well-behaved, and combined the normal behaviour of SCADA network traffic with artificial intrusion observations to learn the boundaries of the normal behaviour using the neural network technique. There are two types of anomaly detection techniques: supervised and unsupervised modes. In the former mode, training data are labelled, while in the latter, data are not labelled. In contrast to conventional information technology (IT), the unsupervised mode has not been used much in SCADA systems. This is because SCADA security research is relatively new compared with IT. In addition, the security requirements of such systems require a high detection accuracy which this mode lacks.
Recently, machine learning techniques have been successfully applied in unsupervised IDS for industrial control systems [21,22]. Bhatia et al [21] proposed an unsupervised model that uses deep learning autoencoders based on artificial neural networks not for dimension reduction only, but for classification to learn the benign network traffic. The key contribution of their proposed model is to identify the minimal latent subspace that contains the essential characteristics of the benign traffic. Similarly, in [22], the authors proposed unsupervised model that utilises the sparse and denoising auto-encoder (DAE) to obtain the robust latent representations by introducing a stochastic noise to the original data. In [23][24][25], unsupervised anomaly scoring approaches based on clustering techniques were proposed to detect normal and abnormal behaviours of industrial control systems at network and environmental parameters levels. Each observation is given an anomaly score, and therefore actual anomalies are assumed to have the highest scores. However, the key problem is how to find the near-optimal cut-off threshold that minimises the false positive rate while maximising the detection rate. In order to overcome this issue, this paper proposes an approach inspired by the work proposed in [26]. The authors assign an anomaly-scoring score for each observation in the unlabelled data; the observations that have the highest anomaly scores are labelled as outliers, while the rest are labelled as normal. They randomly selected a subset of normal data and combined it with outliers to create labelled data. Afterwards, a supervised technique was trained with the labelled data to build an outlier filtering rule that differentiates outliers from normal data. However, our approach differs in that we learn the labelled data from data about which we have no prior knowledge, and a set of supervised classifiers are used to build a robust decision-maker because each classifier can capture different knowledge [27]. Finally, our approach is proposed as an add-on component (not an independent technique) for unsupervised learning algorithms in order to benefit from the inherent characteristics of each algorithm.

The Proposed GATUD Approach
We focus on improving the detection accuracy of unsupervised anomaly detection approaches. This is because such approaches are able to detect (unknown) zero-day attacks. However, they suffer from low accuracy. In this section, we present GATUD that is intended to address this problem. We outline the various steps below.

Learning of Most-Representative Datasets
In this step, two small, most-representative datasets are established from the unlabelled data, where the first and second datasets approximately represent the normal and abnormal behaviours respectively. In order to choose the most-representative datasets, two steps are followed: Step 1: Anomaly Scoring Since we have no prior knowledge about the normal and abnormal data, the k-nearest neighbour notion is adapted to assign an anomaly score to each observation. The k-nearest neighbour notion is chosen because it has produced significant results in anomaly scoring as proposed in [25], in cases where normal data in n-dimension space form dense areas, and the abnormal data are sparsely distributed. Unlike the previous approaches, we are concerned with the most relevant normal and abnormal data rather than with the detection of all anomalies. However, the k-nearest neighbour algorithm is computationally expensive and this issue has been addressed by the proposed the kNNVWC approach in [28], and therefore, it is used to efficiently find k-nearest neighbours for each observation in a dataset. Let S be unlabelled dataset of SCADA data with a multi-dimensional space, m × n matrix, where m and n represent the numbers of observations and attributes in S respectively. Each dimension represents a distinct data point (e.g., temperature, motor speed or humidity), while each observation x i is represented by values of a set of attributes A = {a 1 , a 2 , a 3 , . . . , a n }. Let d be the Euclidean distance between two observations x 1 = {x 1,1 , x 1,2 , . . . , x 1,n } and x 2 = {x 2,1 , x 2,2 , . . . , x 2,n }, where n is the number of the attributes. Given B = {b 1 , b 2 , . . . , b k } be a set of k-nearest neighbours of the observation x i where B ⊂ S, x i ∈ S, and x i / ∈ B. Then the anomaly score of x i is computed as follows: The fast k-nearest neighbour algorithm in [28] is used here to find k-nearest neighbours. Algorithm 1 summarises the calculation steps of anomaly scores for each an observation x i .

Algorithm 1:
Anomaly scoring calculation. Input: S /* A matrix of unlabelled SCADA measurement data consisting of m observations and n attributes */ Input: k /* A positive integer that specifies the number of nearest neighbours */ Output: AnomalyScoresList /* list of Anomaly Scores sorted by rank in descending order ; /* Compute anomaly score as Equation (2) */

Step 2: Selection of Candidate Sets
From the list of anomaly scores, which are produced by Algorithm 1, we establish two small, most-representative datasets, where each dataset represents normal or abnormal behaviours with high confidence. Based on the two previously-mentioned assumptions of normal and abnormal behaviours in the unsupervised mode, we group the list of anomaly scores into three categories, as illustrated in Figure 2: confidence area of anomalies, uncertain area, and confidence area of normality. As shown in this figure, the extent of these areas is determined by the confidence thresholds β, α, and λ. For instance, the smaller threshold β of the confidence area of anomalies, the greater the confidence that the observations falling into this area are abnormal. This is true for the confidence area of the normality. Therefore, the thresholds (β and λ) should be kept at a distance from the uncertain area because this area requires a best cut-off threshold in order to judge an observation as either normal or abnormal, especially when some actual anomalous observations have anomaly scores that are close to some normal ones. Therefore, the most-representative datasets for normal and abnormal behaviours are established from the following two categories: confidence area of normality and confidence area of anomalies respectively. The most-relevant anomalies, AbnormalData, are defined by selecting observations whose indices correspond to the top n of AnomalyScoresList, where n = β × |AnomalyScoresList|. The most relevant normal observations, MostNormal, are defined by selecting observations whose indices correspond to the bottom n of AnomalyScoresList, where n = λ × |AnomalyScoresList|.  Again, if the two assumptions [14] about the unlabelled data are met, the thresholds β and λ are not difficult to determine. According to these assumptions, the anomalies are assumed to constitute a small portion of the data, where this percentage is assumed to not exceed 5%. Since we are not interested in finding all anomalies more than finding the fraction of anomalies with high confidence, and also we are not supposed to approach the uncertain area, the value of β will be set to a value that is smaller than 5%. As opposed to anomalies, the normal data is assumed to constitute a large portion of the data; therefore, setting λ to a small value will result in a small dataset of the most-relevant normal observations. However, this dataset might not approximately represent the large portion of the normal data. To address this problem, we propose to set λ to a large value providing this value does overlap with the uncertain area by, for example, 80%. This will result in a large dataset that is the most-relevant normal. However, the computation time in the ensemble-based decision-making model will be substantially higher.
In order to resolve the previous problem, we extract a small set of representative observations from the most relevant normal dataset. In this step, we group the similar observations together in terms of Euclidean distance, and take their mean as a representative observation for each group. k-means clustering technique [29] is a candidate algorithm for this process because of its simplicity, low computation time, and fast convergence. Moreover, the main disadvantages of k-means of determining the appropriate number of clusters and forcing an outlier observation to be assigned to the closest cluster, even if it is dissimilar to its members, will not be problematic in this step. This is because we are not interested in finding specific clusters more than chopping the data into a number of groups, and also the clustering data (the most-relevant normal dataset) are assumed to be outlier-free. Therefore, the number of clusters k will be set to a small value, where k |D|. Algorithm 2 summarises the steps involved in learning a small representative dataset of the most relevant normal observations.
The two small datasets (the learned normal and anomalous datasets) are combined in order to form a labelled compressed representation of the unlabelled data. The concept of this compressed dataset is slightly similar to the concept of the set of support vectors built by a support vector machine (SVM) [30].

Decision-Making Model
This section introduces the ensemble-based decision-making model (EDMM) used to calculate the support anomaly score for each testing observation. As shown in Figure 1, EDMM is composed of a set of supervised classifiers whose individual decisions are combined to form an ensemble decision. This is because the combining of classifiers promised to be effective [31]. Each classifier c i is trained with the labelled dataset to build a decision model m i . In GATUD, the number and the type of involved supervised classifiers have been left open because the choice of a specific algorithm is a critical step.
Let C = {c i |1 ≤ i ≥ n} be a set of candidate supervised classifiers that build a set of decision models M = {m i |1 ≤ i ≥ n}. Each decision model m i assigns binary-decision value (either "1" or "0") to a testing observation x i , m i (x i ) : v i . When the binary value v i is "1", the observation x i is judged as anomalous, and otherwise is judged as normal. Then the calculation of the support anomaly score is defined as follows: where n is the number of decision models involved in the calculation of the support anomaly score. The observation type (class), whether abnormal or normal, is defined by the following equation: where τ is the percentage of the accepted vote of the decision-models to judge a testing observation x i as anomalous. For instance, when the threshold τ is set to 1, the testing observation will not be considered as anomalous unless all involved decision models agree that the observation is an anomaly.

Illustrative Example
A simple example is given below to illustrate the process of the EDMM. Given five supervised classifiers are selected, C = {c 1 , c 2 , c 3 , c 4 , c 5 }, and trained with labelled datasets that have been learned at Section 3.1.2, to build the decision models, M = {m 1 , m 2 , m 3 , m 4 , m 5 }, and given a testing observation x i whose status predicted by these in question models as shown in Table 1, then the support anomaly score is computed as follows: Given the threshold τ set to 0.6, where the observation x i is considered as an anomaly, at least three decision models have to assign it as anomalous. Therefore, from this example, the observation x i will be considered anomalous. Is observation x i anomalous?

The Experimental Setup
To provide quantitative results for GATUD, we use seven labelled datasets, and the normalisation is applied to all these datasets to improve the accuracy and efficiency of GATUD.

Datasets
Five datasets are publicly available [32,33] and two are generated by the proposed SCADA testbed called SCADAVT in [25]. The simulated datasets consist of 12,000 objects, each being described by 113 features. Each feature represents one sensor or actuator reading in the water distribution systems. Each dataset has 100 abnormal observations, where abnormal observations are generated by different attacks. The simulated datasets will be denoted as SimData1 and SimData2.
The first real dataset comes from the daily measures of sensors in an urban waste water treatment plant (referred to as DUWWTP), and it consists of 38 data points (attributes) [33]. This dataset consists of approximately 527 observations, while 14 observations are labelled as abnormal. For more quantitative results, we evaluated our approach on four real datasets that are collected from a real wireless sensor network [32]. Each dataset consists of two attributes (e.g., temperature and humidity). Each dataset has more than 4000 observations, and a tiny portion of abnormal observations. For simplicity, we refer to these datasets as: multi-hop outdoor real data, multi-hop indoor real data, single-hop outdoor real data, and single-hop indoor real data as MORD, MIRD, SORD, and SIRD.

Normalisation
To improve the accuracy and efficiency of the proposed approach, the normalisation technique is applied to all testing datasets to scale features by a range 0.0 of 1.0. This will prevent features with a large scale from outweighing features with a small scale. As the actual minimum/maximum of features are already known, and because the identification process is performed in static mode, min-max normalisation technique is used to map the values of features. A given feature A will have values in [0.0, 1.0]. Let us denote by min A and max A the minimum value and maximum value of A respectively. Then, to produce the normalised value of v (v ∈ A) using the min-max normalisation method, which we denote asv, the following formula is used:

Choice of Parameters
As discussed in Section 3.1, four parameters are required in order to learn the most representative labelled datasets.

•
The k-nearest neighbours parameter is the influencing factor for the anomaly scoring technique. However, this value is insensitive and can be heuristically determined based on the assumption that anomalies constitute a tiny portion of the data. Therefore, the value of k-nearest is set to be 1% of the representative dataset, because this value is assumed to discriminate between abnormal observations and normal ones in terms of the density-based distance.

•
There are three parameters used for learning the most-representative datasets: (i) The extent of confidence area of normality λ. (ii) The extent of confidence area of abnormity β. We set the parameters λ and β to 70% and 1% respectively. Even though the assumption was that the normal and abnormal data constitute larger percentages than the ones we have chosen, we want to maintain some distance from the uncertain area. (iii) The number of clusters k required for k-means in the candidate step. The purpose of this parameter is to reduce the number of representative normal observations, not to discover specific clusters. Experimentally, the value of k for several values such as 0.01%, 0.02%, 0.03%, and 0.04% demonstrated similar results, while the larger the value of k, the longer the computation time in the anomaly decision-making model of GATUD. Therefore, the value of this parameter is set to be 0.01% of the representative dataset.

The Candidate Classifiers
As previously discussed, the type and the number of the supervised classifiers that are involved in GATUD are left for the implementer. In this paper, a thorough investigation has been conducted of a number of classifiers. We concluded with the most five efficient classifiers. Two are decision-tree based, best-first decision tree (BFTree) [34] and J48 [35]; another two are rule-based, non-nested generalised exemplars (NNge) [36] and projective adaptive resonance theory (PART) [37]; the fifth is a probabilistic based, naive Bayes [38]. When using classifiers, we kept the default parameters of WEKA data mining software [39].

Results and Discussion
Clearly, GATUD is intended to improve the accuracy of unsupervised anomaly detection systems in general and our proposed SDAD approach [25] in particular. In this evaluation, we demonstrate how GATUD can address the limitations that have been discussed in [25], where a global anomaly threshold is required to work with all datasets that vary in distribution, the number of abnormal observations, and the application domain, when the scoring-based technique is adapted. Furthermore, as mentioned earlier, GATUD can be used as an add-on component to help improve the accuracy of the unsupervised anomaly detection approach. We demonstrate its performance when it is used with k-means algorithm [29] that is considered as one of the most useful and promising techniques that can be adapted to build an unsupervised clustering-based anomaly detection technique [40,41]. The proposed approach is intended to improve the accuracy of unsupervised intrusion detection systems. Therefore, we evaluate the accuracy of the proposed approach on two types of unsupervised modes: scoring-based and clustering-based intrusion detection techniques. In this evaluation, the precision, recall, and F-measure metrics are used to quantitatively measure the performance of the proposed approach, because these metrics are not dependent on the size of the training and testing datasets. The metrics used are defined as follows: where TP is the number of anomalies that are correctly detected, FN is the number of anomalies that have occurred but not detected, and FP is the normal observations that are incorrectly flagged as anomalies. The recall (detection rate) is the proportion of number of anomalies correctly detected to the actual number of anomalies in the testing dataset. The precision metric is used to demonstrate the robustness of the IDS in minimising the false positive rate. However, the system can obtain a high precision score while a number of anomalies are being missed. Similarly, the system can obtain a high recall score, while the false positive rate is higher. Therefore, the F-measure, which is the harmonic mean of precision and recall, would be a more an appropriate metric to demonstrate the accuracy of the proposed approach in this paper and ease the comparison with the other results, because it is the weighted average of the precision and recall rates. Therefore, the F-measure metric takes both false positives and false negatives into account.

Integrating GATUD into SDAD
The separation of the most relevant abnormal observations from normal ones in order to extract proximity detection rules for a given system, is the initial part of the proposed data-driven clustering technique (SDAD) in [25]. However, a cut-off threshold parameter η is required to be given, and in fact this parameter plays a major role in separating the most relevant abnormal observations. The demonstrated results were significant; however, various cut-off thresholds η for a number of datasets have been used, where some datasets work with a small value of η, while others work with large values. Therefore, we evaluate how the integration of GATUD into SDAD can help to find a global and efficient anomaly threshold η that can work with all datasets, regardless of their variant characteristics, such as distribution, the number of abnormal observations, and the application domain, and meanwhile produces significant results.
It is well-known that the larger the value of the cut-off threshold η, the higher the detection rate and the higher the false positive rate as well. This, however, will result in poor performance. The determination of an appropriate cut-off threshold η that maximises and minimises the detection rate and the false positive rate respectively, is the challenging problem. GATUD addresses this problem by allowing the anomaly scoring technique to choose a large value of cut-off thresholds η in order to ensure that the detection rate is higher, while minimising the false positive rate without degrading the detection rate.
In the following, we demonstrate the separation accuracy results with/without the integration of GATUD into SDAD. Then, we demonstrate how this integration has a significant impact on the accuracy of the generation step of proximity detection rules, which is the second phase following the separation process. Tables 2-8 show the separation accuracy results with/without the integration of GATUD. Clearly, as shown in the result tables, the larger the value of the cut-off threshold η, the higher the detection rate of abnormal observations. This is because the observations are sorted by their anomaly scores in ascending order and the consideration of the top large portion of the sorted list increases the chance to obtain the actual abnormal observations. However, this will result in a large number of normal observations existing in this portion, and this definitely increases the false positive rate. On the another hand, it can be seen that the use of GATUD can benefit from the larger value of the cut-off threshold η in order to maximise the detection rate of abnormal observations and the obtained list of the assumed abnormal observations is passed through the decision-making model to rejudge whether each observation is abnormal or normal. This, as can be seen in the result tables, nearly sustains the detection rate, and meanwhile minimises the false positive rate. Table 2 shows that the better accuracy results of separation of abnormal observations without the integration of GATUD are achieved when the cut-off threshold η is set to 2.0%, 2.50%, or 3.0%. Moreover, it can be observed that the setting of η with larger values, such as 4.5% and 5.0%, has not demonstrated any better results, even though the detection rates of the abnormal observations were high. This is because, as demonstrated, the false positive rates were relatively high. On the other hand, when GATUD was integrated, the high false positive rates were significantly reduced, and meanwhile, the detection rates were sustained at significantly acceptable levels. Similarly, the remaining results for each dataset (as shown in Tables 3-8) demonstrated that the integration of GATUD significantly reduced the high false positive rates when larger values of η were used, and meanwhile maintained the detection rates at a satisfactory and significant level. However, the results for the dataset MORD in Table 7 were not significant whether GATUD was integrated or not. We would have expected the integration of GATUD to produce significant results, if the cut-off threshold η was set to a value that is greater than 0.05%. However, this value is assumed as the maximum percentage of abnormal observations in an unlabelled dataset. Therefore, this dataset is considered to be an exceptional case.

The Results of Proximity Detection Rules with/without GATUD
As mentioned previously, the generation process of proximity detection rules comes after and relies on the separation process. Therefore, the robustness of these proximity detection rules is influenced by the accuracy of the separation process, and as shown earlier, the integration of GATUD demonstrated significant results in the separation process even with large cut-off thresholds η. Therefore, the detection accuracy results of the proximity detection rules, which are extracted from the abnormal and normal observations that were separated using such these large cut-off thresholds η, are expected to be significant. Tables 9-15 show the detection accuracy results.
Each table represents the results of the detection accuracy results for each individual dataset, and also they are divided into two parts: the first part shows the results of the proximity-detection rules that were extracted by separating abnormal from normal observations where GATUD was not integrated in the separation process. The second part shows the results obtained after the integration of GATUD. The result tables show that the integration of GATUD into the separation process helps to generate robust proximity-detection rules even with large cut-off thresholds η.
Overall, Table 16 highlights the acceptable thresholds η through which the extracted proximitydetection rules demonstrated significant detection accuracy results, where GATUD was integrated into the separation process of abnormal and normal observations. From this table, the determination of the near-optimal value of a cut-off threshold η will not be problematic because the value of η can be set to 0.05, which is assumed as the maximum percentage of abnormal observations in an unlabelled dataset. The resultant high positive rate that might result from this large value can significantly be reduced by the integration of GATUD. Table 9. The detection accuracy of the proximity-detection rules that have been extracted with/without the integration of GATUD in the separation process on DUWWTP.

Without GATUD
With GATUD    Table 11. The detection accuracy of the proximity-detection rules that have been extracted with/without the integration of GATUD in the separation process on SimData2.  Table 13. The detection accuracy of the proximity-detection rules that have been extracted with/without the integration of GATUD in the separation process on SORD.

Without GATUD
(1) η: top percentage of observations, which are sorted by anomaly scores in ascending order, are assumed as abnormal.
(2) # of agreement: the number of datasets that agree on each separation threshold η, where the agreement is judged by better accuracy results when GATUD was intergraded.

Integrating GATUD into Clustering-Based Technique
Here we show how GATUD can be integrated not only with the scoring-based intrusion detection technique, but also with the clustering-based technique. The k-means algorithm, which is considered as one of the most useful and promising techniques for building an unsupervised clustering-based intrusion detection model [40,41], is chosen to demonstrate the integration effectiveness of GATUD with an unsupervised clustering-based intrusion detection technique. This is because this algorithm already has been adapted by Almalawi et al. [25] to build the unsupervised anomaly detection model to detect abnormal observations, and the results were compared with SDAD [25]. Therefore, it is interesting to demonstrate detection accuracy results with/without the integration of GATUD.
For more details about how this algorithm can be adapted to build an unsupervised anomaly detection model, see [25].
In the adaptation of k-means for building an unsupervised anomaly detection model, anomalies are assumed to be grouped in clusters that contain percentages, θ, of the data. Let C = C 1 , C 2 , . . . C n be the sets of clusters that have been created. Then, the anomalous clusters are defined as follows: The remaining clusters C −Ć are labelled as normal. In this evaluation, we assume the real anomalous cluster is the cluster where the majority of its members are actual abnormal observations. Assume that the number of abnormal observations ≥ |Ć i |/2. Therefore, the labelling accuracy of the assumed percentage θ of the data in anomalous clusters is measured by the Labelling Error Rate (LER) for clusters: When integrating GATUD to label the clusters, the members of each individual cluster pass through the decision-making model to be labelled as either normal or abnormal. Then the cluster is labelled according to the label of the majority of its members. The labelling of clusters by GATUD is given as follows: Class(x j ) (11) where L(C i ) is the number of abnormal observations, which are judged by the decision-making model, in the cluster C i . Then the anomalous clusters are defined as follows: where ε is the percentage of the abnormal observations in a cluster C i to be labelled as abnormal. In this evaluation, it is set to 0.5. We evaluate the integration of GATUD as an add-in component with k-means, where this component is only used to label the produced clusters as either normal or abnormal. The k-means requires two user-specified parameters k and θ to build the unsupervised anomaly detection model from unlabelled data. k is the number of clusters, and θ is the percentage of the data in a cluster to be assumed as malicious. However, the parameter θ is not required when GATUD is integrated. In this evaluation, we demonstrate the detection accuracy of k-means as an independent/dependent algorithm. In the independent use, k-means is used to cluster the training dataset and labels each cluster using an assumption of the percentage of the data in a cluster to be assumed as malicious [40,41]. While in the dependent use, GATUD is used as a labelling technique for the produced clusters by k-means. The parameters k and θ are set to the same values that have been used in paper [25]. This is because the same datasets are used. Tables 17-23 show the detection accuracy results of two parts: The first part (without GATUD) shows the detection accuracy results of k-means. As shown, the the detection accuracy results of five values of θ are demonstrated. For example, in the Table 17, the dataset, which was referred to as DUWWTP, was clustered into 50, 60, 70, 80, 90, and 100 clusters using k-means algorithm; then, when θ was set to 0.01 the generated clusters that constitute an overwhelmingly large portion (≥99%) of the training dataset are labelled as normal clusters, and otherwise labelled as malicious ones. As see in the Table 17, the labelling error rate (LER) when 50 clusters generated is 12.40% which means the average percentage of the clusters that were incorrectly labelled in 10 fold cross validation. This is because their respective labels are not the same as the labels of the majority of their respective members. In the second part (with GATUD), the detection accuracy results of k-means when GATUD were integrated to label the clusters instead of the assumption that assumes normal clusters constitute an overwhelmingly large portion, as shown. We show only the results of F-measure, as they are the interesting results to compare. Clearly, the detection accuracy results of k-means in detecting abnormal observations were very poor for all datasets when GATUD was not integrated. On the other hand, significant results for some datasets are obtained when GATUD is integrated to label the produced clusters. It is obvious from the results that GATUD can be a promising technique to improve the accuracy of an unsupervised anomaly detection approaches, not only with our SDAD approach proposed in [25], but also, it can be integrated with unsupervised clustering-based anomaly detection models.

Conclusions
This paper proposed an innovative approach, called global anomaly threshold to unsupervised detection (GATUD), which is used as an add-on component to improve the accuracy of unsupervised intrusion detection techniques. This has been done by initially learning two labelled small datasets from the unlabelled data, where each dataset represents either normal or abnormal behaviour. Then, a set of supervised classifiers were trained with question datasets to produce an ensemble-based decision-making model that can be integrated into both unsupervised anomaly scoring and clustering-based intrusion detection approaches. In the former, GATUD is used to mitigate the sensitivity of anomaly threshold, while in the latter, it is used to efficiently label the produced clusters as either normal or abnormal. Experiments show that GATUD demonstrates significant and promising results when it was integrated into a clustering-based intrusion detection approach as a labelling technique for the produced clusters.

Conflicts of Interest:
The authors declare no conflict of interest.