Utilising Flow Aggregation to Classify Benign Imitating Attacks

Cyber-attacks continue to grow, both in terms of volume and sophistication. This is aided by an increase in available computational power, expanding attack surfaces, and advancements in the human understanding of how to make attacks undetectable. Unsurprisingly, machine learning is utilised to defend against these attacks. In many applications, the choice of features is more important than the choice of model. A range of studies have, with varying degrees of success, attempted to discriminate between benign traffic and well-known cyber-attacks. The features used in these studies are broadly similar and have demonstrated their effectiveness in situations where cyber-attacks do not imitate benign behaviour. To overcome this barrier, in this manuscript, we introduce new features based on a higher level of abstraction of network traffic. Specifically, we perform flow aggregation by grouping flows with similarities. This additional level of feature abstraction benefits from cumulative information, thus qualifying the models to classify cyber-attacks that mimic benign traffic. The performance of the new features is evaluated using the benchmark CICIDS2017 dataset, and the results demonstrate their validity and effectiveness. This novel proposal will improve the detection accuracy of cyber-attacks and also build towards a new direction of feature extraction for complex ones.


Introduction
Internet traffic pattern analysis is an established and mature discipline. A key application of this research domain is the detection of malicious network intrusions. Intrusion Detection Systems (IDS) research started in the late 1980s. Statistical models were the first to be introduced. Later, signature-based IDSs were proposed [1]. Their detection depended on known patterns (signatures) that were used to distinguish between malicious and benign traffic flows. More recently, the volume and sophistication of these attacks has increased, and that has motivated the use of machine learning (ML) techniques to counteract them [2].
In many machine learning applications, it is well known that the choice of features, i.e., the inputs fed into the model, is more important than the choice of the model [3]. Indeed, Ghaffarian and Shahriari state that features play a vital role in the development of IDS [4]. When Internet traffic is analysed on a flow-by-flow basis, the choice of features is quite self-evident: are particular flags set or reset in the internet packet headers, what is the average size of a packet in a flow, what is the standard deviation of packet sizes, what is the average time between packets in a flow, etc. In simple terms, values are extracted from packet headers, and various statistics are extracted from the packet lengths and inter-arrival times [5]. These features have proved adequate for identifying previous generations of cyber-attacks using a range of ML models; however, they have proven to be inadequate to the latest, most sophisticated attacks.
To address this problem of detecting complex cyber-attacks, specifically the benign imitating ones, we propose, in this paper, an additional level of feature abstraction, named 'Flow Aggregation'. With this approach, we look at flows at a higher level of abstraction by bundling similar flows and extracting features across them. This form of aggregation permits a deeper representation of network traffic, and increases the performance of classification models, particularly for the more sophisticated attacks that attempt to closely resemble benign network traffic and hence evade detection. The proposed features are evaluated using the benchmark CICIDS2017 [6] dataset with a particular focus on the attacks that have proven most difficult to detect using well-adopted features.
The contributions of this paper are as follows: • The introduction of a higher level of abstraction for network traffic analysis by proposing novel features to describe bundles of flows. • Performance improvements in binary classification of cyber-attacks when these novel features are utilised, particularly for attacks that mimic benign network traffic. • Performance improvements in multi-class classification of cyber-attacks when these novel features are utilised. • Performance improvements in zero-day attack detection when these novel features are utilised.
The remainder of the paper is organised as follows; Section 2 outlines the related work and the required background. The methodology is explained in Section 3. Section 4 discusses the proposed feature abstraction and the conduction of experiments alongside the results. Finally, Section 5 concludes the paper.

Background and Related Work
As mentioned, the detection of cyber-attacks is an established research area that has leveraged a range of technologies as it has evolved over the years [7] to cope with the exponential growth of cyber-attacks [8]. A range of ML-based models have been applied to the problem, including support vector machines, artificial neural networks, and k-means clustering [9]. Despite the discriminative power of these models, many cyber-attacks still remain undetected or have low rates of detection.
Older, more well-recognised attacks have been captured in KDD Cup'99 and the NSL-KDD datasets. These can be used to train ML-based models, and in many cases achieve good results. More contemporary cyber-attacks have been recorded in the CICIDS2017 dataset [10]. Much of the research involving this dataset has considered a subset of attacks that are particularly distinctive from benign (i.e., DDoS, Port Scan, and Web attacks), and good results have been achieved. However, some classes are either (a) left undetected, due to their similar behaviour to normal traffic, thus difficult to detect, or (b) when used in detection models, their low detection accuracy is concealed in the overall accuracy due to the class imbalance problem of this dataset [11]. In this paper, we focus on those attacks that have thus far proven elusive for researchers to identify reliably, viz. DoS Slowloris and DoS SlowHTTPTest.
Many studies use performance metrics that do not take into consideration the imbalance between relatively few examples of cyber-attacks and overwhelming examples of benign traffic. For example, an 'always zero' classifier (that always predicts benign traffic) would appear to be 99% accurate if the dataset it was tested on had 99 examples of benign traffic to each example of cyber-attack traffic. Clearly, this has the potential to provide grossly misleading results since it would never detect a single attack. To address this shortcoming, in this paper, we will fully disclose the results for precision, recall, and F1-score for each class independently. Tables 1 and 2 provide a comprehensive list of recent studies in which the CICIDS2017  dataset has been used. The tables present the published papers, the models/techniques  applied, the metrics utilised to assess performance, and the concomitant results. By  observing Table 1, two findings are highlighted. Firstly, some classes (SSH, DDoS, and Port Scan, for example) have received significant attention from researchers, whilst others have been largely ignored due to their poor results and their benign-like behaviour that makes their classification difficult. Secondly, the overall accuracy is much higher than the accuracy for individual classes. For example, in [12], when the authors use 1-layer Deep Neural Network (DNN), the overall multi-class classification accuracy is 96% (Table 1) while the individual classes detection accuracies are 55.9%, 95.9%, 85.4%, and 85.2% for normal, SSH, DDoS, and port scan classes, respectively (Table 1). Similar behaviour was seen when the authors used 5-layers DNN, as demonstrated in Table 1. This indicates the misleading effect of reporting the overall accuracy when dealing with imbalanced datasets. Furthermore, Vinayakumar et al. [12] highlighted in their recent research that by observing the saliency map for the CICIDS2017 dataset, it is shown that "the dataset requires few more additional features to classify the connection record correctly" [12]. The authors' observations highlighted this need for the Denial of Service (DoS) class specifically. As later discussed in Section 4, this concurs with our findings regarding the attack classes that require the additional abstraction level of features to be differentiated from benign traffic and other attacks.

Feature Engineering
In the ML domain, features are used to represent a measurable value, a characteristic or an observed phenomenon [17]. Features are informative: they are usually represented numerically; however, some can be in categorical or string format. When categorical features are used for ML, encoding is used to transform them into a numerical (MLfriendly) format.
Obtaining features can be done by construction, extraction, or selection processes, or a combination of them [18]. Feature construction creates new features by mining existing ones by finding missing relations within features. While extraction works on raw data and/or features and apply mapping functions to extract new ones. Selection works on getting a significant subset of features. This helps reduce the feature space and reduce the computational power. Feature selection can be done through three approaches, as shown in Table 3: filter, wrapper, and embedded. Investigate interaction more thoroughly than Wrapper Result in an optimal subset of variables -Law et al. [22] highlight the importance of feature selection for ML models. The authors discuss its effect on boosting performance and reducing the effect of noisy features, specifically when training using small datasets. Furthermore, non-uniform class distributions should be considered to avoid misleading results when applying a supervised feature selection [23]. Alternatively, when the dataset labels are not available, an unsupervised feature selection is used. Mitra et al. [24] categorise unsupervised feature selection techniques into (a) clustering performance maximisation and (b) feature dependencies and relevance.

Artificial Neural Network
ANNs are inspired by the human biological brain. McCulloch and Pitts [25] proposed the first ANN in 1943. Later in 1986, Rumelhart and McClelland [26] introduced the back propagation concept. ANNs are used to estimate a complex function by learning to generalise using the given input values and the corresponding output values.
An ANN is generally composed of an input layer, zero or more hidden layers, and an output layer. Each layer is composed of one or more neurons. Neurons in layer i are connected to the ones in layer j, j = i + 1. This connection is called weight and is represented as w ij . During the training process, the input values are propagated forward, the error is calculated (based on the expected output), then the error is propagated, and the weights are adjusted accordingly. The weight of a connection implies the significance of the input.
Formally, the output of a single neuron is calculated as shown in Equation (1).
where n represents the number of inputs to this node, x i is the i th input value, w i is the weight value, b is a bias value. Finally, f is the activation function, which squashes the output. Activation function can be, but not limited to, Tanh, Sigmoid, and Rectified Linear Unit (RELU).
The error E is calculated at the final layer using the difference between the expected output and the predicted output (which is, as aforementioned, a result of propagating the input signal). Finally, the weights are updated based on Equation (2) where w t is the old weight and w t+1 is the new weight. η is the learning rate to control the gradient decent steps. The weight of a neuron is directly proportional to the significance of the node's input. This is because the output of any neuron is calculated by multiplying the weights by the input values.
The next section will discuss the proposed features, then Section 4 outlines the experiments where ANNs are used for the classification purpose.

Methodology
Given a raw capture of internet traffic in the form of a raw "pcap" file, two levels of features are traditionally extracted. As illustrated in Figure 1, at the lower level, individual packets are inspected and packet-based features are extracted. These features include flags, packet size, payload data, source and destination address, protocol, etc. At the higher level, flow features (unidirectional and bidirectional) are extracted that consider all the individual packets in a particular communication.
In this paper, a novel additional (third) level of abstraction is proposed where bidirectional flows are grouped into bundles and flow aggregation features are derived. Flow aggregation features aim at representing information about the whole communication between network hosts. These features provide additional traffic characteristics by grouping individual flows. In this setup, it is assumed that legitimate hosts establish secure communication using any of the well-known authentication mechanisms [27,28]. After these aggregated features are calculated, they are propagated back to each bidirectional flow in the bundle/group. This is represented with the superscript + sign in Figure 1. The two proposed flow aggregation features in this manuscript are (a) the number of flows and (b) the source ports delta.
The first feature, 'Number of flows', represents the number of siblings in a flow bundle. Given the communication between a host, A, and one or more hosts, all flows initiated by A are counted. This feature represents the communication flow. The advantage of this feature is that it is significant for attacks that intentionally spread their associated requests over time when targeting a single host. However, when grouped, the bundled flow will have additional information about how many flows are in the same group that can resemble the communication pattern. Moreover, it can represent patterns when an attacker targets many hosts, but each with a few communications, when grouped, a pattern can be identified.

Raw Packets
Packet-based   (Figure 2 top) represents a node/host in the network. Each double arrow represents a bidirectional flow with the notation XY i , such that X is the source node, Y is the destination node, and i is the communication counter. Finally, the colours in Figure 2 represent the grouping of flows into bundles. With reference to Figure 2, the first bundle (blue) has 4 flows, thus AB 1 , AB 2 , AC 1 , and AD 1 will have the 'number of flows' feature set to 4. Similarly, the second bundle (green), BC 1 and BC 2 will have the value 2 and so on.
The second feature 'source ports delta' demonstrates the ports delta. This feature is calculated using all the port numbers used in a bundle communication flow. Algorithm 1 illustrates the feature calculation. The advantage of this feature is to capture the level and variation pattern of used ports in legitimate traffic. This feature adds this piece of information to each flow, which then enhances the learning and classification as further discussed in Section 4. To validate the significance of the newly added features proposed in this manuscript, a feature selection algorithm is used. Recursive Feature Elimination (RFE) [29] is used to select the best k features to use for classification. Over the various experiments discussed in Section 4, RFE demonstrates that the two features are important for identifying classes that mimic benign behaviour. This emphasises the case in which the additional level of feature abstraction is needed. When attackers attempt to mimic benign behaviour, the attacks are in-distinctive using flow features solely. Therefore, new features are needed.

Experiments and Results
In this section, the conducted experiments are explained and the results are discussed. For a dataset to be appropriate to evaluate the proposed features, it has to (a) include both benign and cyber-attack traffic, (b) cover benign mimicking cyber-attacks, and (c) involve multiple attackers to guarantee the aggregation logic. Based on these criteria, the CICIDS2017 dataset [6] is selected. The dataset contains a wide range of insider and outsider attacks alongside benign activity. Sarker et al. [8] highlight in their survey that the CICIDS2017 dataset is suitable for evaluating ML-based IDS including zero-day attacks [8]. As aforementioned, the focus of this manuscript is on attacks that mimic benign behaviour.
Therefore, the attacks of interest from the CICIDS2017 dataset are (1) DoS Slowloris and (2) DoS SlowHTTPTest. These two attacks implement low-bandwidth DoS attacks in the application layer. This is done by draining concurrent connections pool [30]. Since these two attacks are performed slowly, they are typically hard to detect. Alongside the DoS SlowHTTPTest and Slowloris, two other attacks are of interest for comparative purposes: (3) PortScan and (4) DoS Hulk. These two attacks resemble the case where attacks are easy to discriminate from benign traffic.
First, each of these four attacks and benign pcap files is processed. The output of this process is bidirectional flow features and aggregation features. Second, RFE is used to select the best k = 5 features. Third, the selected features are used as input to an ANN that acts as the classifier. The ANN architecture is composed of 5 input neurons, 1 hidden layer with 3 neurons, and an output layer. It is important to mention that since the focus is on the evaluation of the additional level of features and not the classifier complexity, the number of chosen features is small, and a straightforward ANN is used.
Three experiments are performed. The first experiment is a binary classification of each of the attacks of interest (Section 4.2). The second experiment is a 3-class classification (Section 4.3). This experiment evaluates the classification of benign, a benign-mimicking attack, and a distinctive attack (i.e., do not mimic benign behaviour).

Evaluation Metrics
In this section, the used evaluation metrics are discussed. The evaluation is performed in a 10-fold cross-validation manner. Recall, precision, and F1-score are reported for each experiment. Their formulas are shown in Equations (3)

Binary Classification Results
The first experiment is a binary classifier. Each of the attacks of interest is classified against benign behaviour. Tables 4 and 5 show the precision, recall, and F1-score for DoS Slowloris and DoS SlowHTTPTest, respectively. Moreover, the table lists the five features picked by RFE. By observing the recall of the attack class with the flow aggregation features included, it can be seen that the recall rose from 83.69% to 91.31% for the Slowloris attack class and from 65.94% to 70.03% for the SlowHTTPTest attack class. Unlike benign mimicking attacks, aggregation features did not provide benefit when classifying attacks that do not mimic benign traffic such as PortScan and DoS Hulk, as shown in Table 6 and  Table 7. Precision and recall were high (99%) for both of these attacks without utilising the aggregation flow features. This is coherent with the discussion in Section 2.
Finally, Figure 3 additionally visualises the effect flow aggregation features have on the recall of attack classes. It is observed that the two attack classes that mimic benign behaviour (DoS, SlowHTTPTest and Slowloris) had an increase in recall. However, for the other classes (PortScan and DoS Hulk) the aggregation did not impact the classification performance. This is due to the strength of bidirectional flow features to discriminate classes.

Three-Classes Classification Results
In the second experiment, a three-class classification is performed. Benign and PortScan classes (which act as the discriminative class) are used with each of the other attack classes. Similarly, experimental results demonstrate the high recall of both benign and PortScan with and without the use of flow aggregation features and the rise in the other attack recall when flow aggregation features are used.
By observing Table 8, the recall of DoS Slowloris class increased from 78.25% to 99.09% when flow aggregation was used. Similarly, the DoS SlowHTTPTest rose from 0% to 58.97% in Table 9. Finally, DoS Hulk showed similar behaviour to the binary classification as shown in Table 10. Figure 4 visualises the effect flow aggregation features have on the recall of attack classes in a three-class classification problem. The recall for DoS SlowHTTPTest without using the flow aggregation was 0%.

Five-Classes Classification Results
The third experiment combines all the classes of interest in a five-class classification problem. Similarly, the experiment is performed twice-with and without the use of flow aggregation features. Table 11 summarises the performance per class for this experiment. A few observations are as follows; (a) the recall of DoS Slowloris rose from 1.40% to 67.81%. (b) The recall of DoS SlowHTTPTest rose from 0% to 4.64% only. This is not because the new features were not significant, but because the model classified DoS SlowHTTPTest as DoS Slowloris. In this case, flow aggregation features serve to discriminate benign-mimicking attacks from benign traffic but not to discriminate them from each other. Without the aggregation features, 82.84% of DoS SlowHTTPTest attack instances were classified as benign; however, this dropped to 58% when the flow aggregation features were used.
To overcome this low recall, extra features were chosen. Five features were added and the hidden layer neurons were 8. The results of this classification experiment are summarised in Table 12. The rise in the recall for the attack classes with and without flow aggregation features was as follows; from 33.94% to 80.39% and 21.45% to 64.91% for DoS Slowloris and DoS SlowHTTPTest, respectively.
This behaviour is recognised in Figure 5. It is important to mention that there was a rise in the recall of all classes. This rise was more significant for the attack classes that mimic benign behaviour than others.

Zero-Day Attack Detection Revaluation
The authors' previous work in [31] proposed an autoencoder model to detect zeroday attacks. The model relies on the encoding-decoding capabilities of autoencoders to flag unknown (zero-day) attacks. The model performance was evaluated using all the CICIDS2017 dataset attack classes using three threshold values (0.15, 0.1, and 0.05). The published results demonstrate the ability of the autoencoder to effectively detect zero-day attacks; however, the attacks that mimic benign behaviour experienced very low detection rates. In this section, the published model [31] was re-evaluated using the proposed higher level of feature abstraction. The aim is to assess the impact of the new features on zero-day attack detection, specifically benign mimicking ones, whose detection rate is low. Table 13 lists the zero-day detection accuracies when flow aggregation features are used alongside the bidirectional flow ones, which were used solely in the previously published work. The results show a high detection rate of all attacks, including the attacks that were detected with low accuracy without the flow aggregation features in [31]. To visualise the impact of flow aggregation features on zero-day attack detection, Figure 6 shows the effectiveness of flow aggregation features by contrasting the results that are discussed in this section versus the ones in [31]. It is observed that all attacks experienced a rise in detection accuracy when the flow aggregation features were used.
In summary, flow aggregation features prove their effectiveness in providing a deeper representation of a network traffic, thus improving classification performance. This is reflected in the evaluation of both zero-day attacks detection and multi-class attack classifier. The proposed approach has the following three limitations. (a) The unavailability of traffic flow data affects the flow aggregation features computation. (b) More features can be derived from the aggregated flows that can further improve the classification accuracy. Lastly, (c) the proposed features are evaluated using the CICIDS2017, real-time evaluation will provide additional insights.

Conclusions
Traditional traffic features have proven powerful when combined with sufficient training examples to train ML-based classifiers, and the trained models are capable of classifying some cyber-attacks. However, some cyber-attacks are left undetected. To resolve this limitation, there are two alternatives: (a) gather huge amounts of data to be able to build even more complex models, which is difficult and impractical in some cases, and (b) represent the data using more powerful features. In this paper we appoint the second approach.
This paper presents an additional abstraction level of network flow features. The objective is to improve the cyber-attack classification performance for attacks that mimic benign behaviour. Cyber-attacks are becoming more complex, and attackers utilise the available knowledge to tailor attacks that can bypass detection tools by acting like benign traffic.
The idea is to aggregate bidirectional flows to bundles and compute bundle-specific features. Once the features are computed, the values are populated back to the bidirectional flows. The advantage of these additional features is that the bidirectional flows have some additional knowledge/information about their sibling flows.
The proposal is evaluated using the CICIDS2017 dataset. A group of attacks are used to assess the significance of the new features as well as the performance gain. Four cyber-attack classes are used beside benign class: DoS Slowloris, DoS SlowHTTPTest, DoS Hulk, and PortScan. ANN is used as the classifier. Three experiments are conducted: binary, three-class, and five-class classification. The experiments confirm the need for this additional level of features. The results further demonstrate the significance of the added features for classes that are hard to discriminate from benign, such as DoS Slowloris and DoS SlowHTTPTest. The recall of the cyber-attack classes experiences a high rise when the additional features are used. For example, it is observed that the recall of the DoS Slowloris class rises from 83.97% with the bi-directional features to 91.31 in binary classification, from 78.28% to 99.09% for 3-class classification and from 33.94% to 80.39% for 5-class classification.
Furthermore, the additional features prove significant when reassessing the authors' previously published work on detecting zero-day attacks. Benign-mimicking attacks suffered low detection accuracy; however, with the use of flow aggregation features, zeroday attack detection experience a high rise in accuracy.
Future work involves evaluating the significance of the flow aggregation against other operations alongside extracting other high-level features based on needs.

Data Availability Statement:
The 'CICIDS2017' dataset supporting the conclusions of this article is available in the Canadian Institute for Cybersecurity (CIC) repository, http://www.unb.ca/ cic/datasets/ids-2017.html (accessed on: 7 October 2019). The code will be available through a GitHub repository.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: