AppCon : Mitigating Evasion Attacks to ML Cyber Detectors

: Adversarial attacks represent a critical issue that prevents the reliable integration of machine learning methods into cyber defense systems. Past work has shown that even proﬁcient detectors are highly affected just by small perturbations to malicious samples, and that existing countermeasures are immature. We address this problem by presenting AppCon , an original approach to harden intrusion detectors against adversarial evasion attacks. Our proposal leverages the integration of ensemble learning to realistic network environments, by combining layers of detectors devoted to monitor the behavior of the applications employed by the organization. Our proposal is validated through extensive experiments performed in heterogeneous network settings simulating botnet detection scenarios, and consider detectors based on distinct machine-and deep-learning algorithms. The results demonstrate the effectiveness of AppCon in mitigating the dangerous threat of adversarial attacks in over 75% of the considered evasion attempts, while not being affected by the limitations of existing countermeasures, such as performance degradation in non-adversarial settings. For these reasons, our proposal represents a valuable contribution to the development of more secure cyber


Introduction
Adversarial attacks represent a dangerous menace for real world implementations of machine learning (ML) algorithms [1][2][3][4]. This threat involves the production of specific samples that induce the machine learning model to generate an output that is beneficial to an attacker. Literature has identified two categories of adversarial attacks [2]: those occurring at training-time (also known as poisoning attacks [5]), and those occurring at test-time (often referred to as evasion attacks [6]). Our paper focuses on this latter category due to its relevance for cyber detection scenarios.
The topic of adversarial machine learning has been thoroughly studied by computer vision literature [7][8][9]. However, surprisingly, proper analyses and efficient solutions to this menace are scarce in the cybersecurity domain. The field of network intrusion detection is poorly investigated [10,11], while multiple works exist in the areas of malware, phishing, and spam detection [6,[12][13][14][15][16][17]. In particular, although several studies have shown the effectiveness of adversarial evasion attacks against botnet detectors [10,11,18,19], there is a lack of proposals to counter this menace that are feasible for real world environments. Some defensive strategies have been proposed and evaluated by existing literature, but they are affected by critical limitations, such as reduced performance in non-adversarial scenarios [19,20], or high maintenance and deployment costs [2,21].
In this paper, we propose AppCon, an original approach that is focused at mitigating the impact of adversarial evasion attacks against ML-based network intrusion detection systems (NIDS), while preserving detection performance in the absence of adversarial attacks. Furthermore, our proposal is specifically addressed to real-world environments, thus favoring its integration into existing defensive schemes. Our solution is based on the idea of restricting the range of samples that an adversary may create to evade the detector, and also leverages the adoption of ensemble models that are shown to produce more robust classifiers [22]. As a practical implementation, we integrate our solution in botnet detectors, due to the high rate of botnet activities in modern organizations [23]; these detectors are based on different supervised machine and deep-learning algorithms. An extensive experimental campaign conducted on a large and public dataset of millions of labeled network flows [24] is used to evaluate AppCon. The results confirm the efficacy of our solution, which is able to decrease the rate of successful evasion attempts against state-of-the-art botnet detectors by nearly 50%, while retaining similar performance in the absence of adversarial attacks. This symmetric benefit, paired with its simple integration into existing defensive schemes, highlight that the proposed approach represents a valid contribution towards the development of more secure cyber detectors.
The remainder of this paper is structured as follows. Section 2 compares this paper against related work. Section 3 presents the proposed countermeasure and describes the evaluation methodology. Section 4 discusses the experimental results. Section 5 concludes the paper with final remarks and possible extensions for future work.

Related Work
It is important to provide some pieces of background knowledge on the machine learning methods employed in this work, and on adversarial attacks. Then, we compare our proposal against related work on countermeasures against evasion attacks.

Machine Learning for Cyber Detection
Machine learning methods are becoming increasingly popular in several domains, such as image and speech processing, social media marketing, healthcare, and also in cybersecurity [25][26][27]. These techniques can be separated into two main categories: supervised algorithms must undergo a training phase with a proper labeled dataset, where each sample is associated with a specific label (or class); on the other hand, unsupervised algorithms do not require a labeled dataset. These characteristics make unsupervised methods more suitable for data clustering and rule mining, whereas supervised techniques can be adopted for actual classification tasks [25,26]. In the context of cybersecurity, supervised algorithms can be readily employed as cyber detectors, where the (trained) model is used to determine whether a sample is benign or malicious, thus resembling a binary classification problem [28].
Among the dozens of existing supervised algorithms, this paper focuses on those classifiers that have been found to be particularly effective for scenarios of Network Intrusion Detection [1,3,28]: Decision Tree (DT), Random Forest (RF), AdaBoost (AB), and Multi-Layer Perceptron (MLP); we also consider a fifth method based on deep learning and proposed by Google, Wide and Deep (WnD). A brief description of each of these methods is provided below, alongside some notable examples in cybersecurity: • Decision Tree: these algorithms are conditional classifiers composed of several nodes. The tree is inspected from top to bottom, where a given condition is checked at each node by analyzing the features of the input sample, leading to the following node [1,3,27,28]. • Random Forest: they are ensemble methods consisting of several Decision Trees, in which the output is computed after evaluating the prediction of each individual tree composing the "forest" [1,3,[27][28][29].
• AdaBoost: similar to Random Forests, these algorithms are able to improve their final performance by putting more emphasis on the "errors" committed during their training phase [30]. • Multi-Layer Perceptron: also known as "neural network", they are based on sets of processing units (the neurons) organized in multiple layers and that communicate with each other. The output is provided in the last layer of the architecture [1,3,27,28]. • Wide and Deep: this technique is a combination of a linear "wide" model, and a "deep" neural network. The idea is to jointly train these two models and foster the effectiveness of both.
To the best of our knowledge, WnD has not been tested for Cyber Detection yet, but its promising results in other fields motivate our decision to include this deep learning technique into our experiments [31].

Adversarial Attacks
Security operators often adopt machine learning techniques [3,32], which allow for detecting anomalies and may even reveal novel attacks that are not easily recognizable through traditional signature-based approaches [33,34], thus increasing protection against advanced threats. However, the integration of these methods into cyber defence platforms still presents several issues [28]: among these, one of the most critical problems is that that of adversarial attacks.
Adversarial attacks leverage the generation of specific samples that induce a machine learning model to generate an output that is favorable to the attacker, who exploits the high sensitivity of machine learning models to their internal properties [3,35,36]. Early examples of adversarial attacks are the ones proposed in [37,38]: these papers studied the problem in the context of spam filtering, where linear classifiers could be easily tricked by few carefully crafted changes in the text of spam emails without significantly affecting the readability of the spam message. Another typical example of adversarial attack is the one proposed in [39] which targets neural networks classifiers. Here, the authors apply imperceptible perturbations to test images, which are then submitted to the classifiers: the results highlight that it is possible to change the algorithm's predictions arbitrarily.
Adversarial perturbations affect all integrations of machine learning, but in the cyber security sphere the problem is further aggravated due to several intrisic characteristics of this domain. Among these, we cite: the constantly evolving arms race between attackers and defenders; and the continuous modifications that affect the system and network behavior of an organization [28]. These unavoidable and unpredictable changes are known as "concept drift" [40], which is often responsible for decreasing the performance of anomaly detection models. Possible mitigations involve periodic re-calibrations and adjustment processes that can identify behavioral modifications and recent related threats. However, performing such operations is a costly and challenging task in itself [28], and it also facilitates the execution of adversarial attacks [41].
The authors of [4] propose a taxonomy of adversarial attacks that has now become widely accepted by the scientific community [2]. It considers the following properties: • Influence, which denotes whether an attack is performed at training-time or test-time.
-Training-time: it is possible to thwart the algorithm by manipulating the training-set before its training phase, for example by the insertion or removal of critical samples (also known as data poisoning). -Test-time: here, the model has been deployed and the goal is subverting its behavior during its normal operation.
• Violation, which identifies the security violation, targeting the availability or integrity of the system.

-
Integrity: these attacks have the goal of increasing the model's rate of false negatives. In cybersecurity, this involves having malicious samples being classified as benign, and these attempts are known as evasion attacks.
-Availability: the aim is to generate excessive amounts of false alarms that prevent or limit the use of the target model.
This work focuses on adversarial attacks performed at test-time that generate integrity violations, that is, evasion attacks.

Existing Defenses
Despite the proven effectiveness of evasion attacks against cyber detectors [1,10,11,18,42,43], proposals involving countermeasures that are suitable for realistic environments are scarce. The authors of [44] propose an approach exclusive to neural networks that hardens these classifiers against evasion attacks, but detectors based on these algorithms are under-performing in cyber detection scenarios [3,10,19,21,28,45]; furthermore, the authors of [46] showed that the method in [44] can be thwarted by skilled adversaries, while the results in [20] show an increased rate of false positives in the baseline performance of the "hardened" classifier. Other methods to defend against evasion attacks may involve adversarial training [21,47,48]; the problem is that such strategies require to (continuously) enrich the training dataset with (a sufficient amount of) samples representing all the possible variations of attacks that may be conceived, which is an unfeasible task for real organizations. A known line of defense leverages the adoption of an altered feature set [2,49], but these approaches are likely to cause significant performance drops when the modifications involve features that are highly important for the decision-making process of the considered model [11,21]. Solutions based on game theory are difficult to deploy [50] and evaluate in practice [2,51], and their true efficacy in real contexts is yet to be proven [52]. Finally, defenses conforming to the security-by-obscurity principle [2,19] are not reliable by definition, as the attacker is required to learn the underlying mechanism to thwart them [53].
Within this landscape, we propose a countermeasure that (i) can be applied to any supervised ML algorithm, (ii) does not cause a performance drop in non-adversarial scenarios, (iii) can be easily integrated into existing defensive systems, and (iv) does not rely on security-by-obscurity.

Materials and Methods
This section presents the considered threat model and the proposed solution, AppCon; then, it describes the experimental settings adopted for the evaluation.

Threat Model
To propose a valid countermeasure against adversarial attacks, it is necessary to define a threat model that states the characteristics of both the target system, and of the considered attacker. Our solution assumes a threat model similar to the one in [11], which we briefly summarize.
The defensive model, represented in Figure 1, consists of a large enterprise environment, whose network traffic is inspected by a flow-based [54] NIDS that leverages machine learning classifiers to identify botnet activities.
On the offensive side, an attacker has already established a foothold in the internal network by compromising some hosts and deploying botnet malware communicating with a Command and Control (CnC) infrastructure. The adversary is described by following the notation [2,4] of modeling its goal, knowledge, capabilities, and strategy.

•
Goal: the main goal of the attacker is to evade detection so that he can maintain his access to the network, compromise more machines, or exfiltrate data. • Knowledge: the attacker knows that network communications are monitored by an ML-based NIDS. However, he does not have any information on the model integrated in the detector, but he (rightly) assumes that this model is trained over a dataset containing samples generated by the same or a similar malware variant deployed on the infected machines. Additionally, since he knows that the data-type used by the detector is related to network traffic, he knows some of the basic features adopted by the machine learning model.
• Capabilities: we assume that the attacker can issue commands to the bot through the CnC infrastructure; however, he cannot interact with the detector. • Strategy: the strategy to avoid detection is through a targeted exploratory integrity attack [6] performed by inserting tiny modifications in the communications between the bot and its CnC server (e.g., [55]). We thus consider a "gray-box" threat model, where the adversary has (very) limited knowledge of the defenses. Our assumption portrays a realistic scenario of modern attacks [56]: machines that do not present administrative privileges are more vulnerable to botnet malware, and skilled attackers can easily assume control of them. On the other hand, the NIDS is usually one of the most protected devices in an enterprise network, and can be accessed only through a few selected secured hosts.

Proposed Countermeasure
We now describe AppCon, short for "application constraints", which is the main proposal of this paper. We recall that the attacker plans to evade detection by modifying the network communications between the bots and CnC server, without changing the logic of the adopted malware variant. Our solution is based on the idea of restricting the freedom in which an attacker can create malicious adversarial samples that evade detection. The intuition is that an attacker that is subject to additional constraints when devising his samples is less likely to evade detection. Since our threat model assumes that the attacker cannot change the underlying functionality of the botnet, he is already limited in the amount of alterations that he can perform; however, past literature has shown that even small modifications (like extending the duration of network communications by few seconds, or adding few bytes of junk data to the transmitted packets) can lead to successful evasion [10,11,55]. We aim to further restrict the attacker's range of possibilities by having the detection mechanism to focus only on a (set of) specific web-application(s). This approach allows the detector to monitor only the network traffic with characteristics that are similar to those of the considered applications.
To apply our solution to existing detectors based on supervised ML algorithms, AppCon leverages the paradigm of ensemble learning, which has already been adopted to devise countermeasures against adversarial attacks [22]. In particular, the idea is to transform the "initial" detector into an ensemble of detectors, each devoted to a specific web-application. Formally, let D be the considered detector; let A be the set of web-applications employed by an enterprise, and let a ∈ A be a specific web-application within A. Then, we split D into |A| sub-instances (where | · | is the cardinality operator), each devoted to a specific web-application a denoted with D a . The union of all the sub-instances will then represent our "hardened" detector, denoted as D (thus, a D a = D ). Therefore, D will only accept as input those samples that conform to at least one of the network communications generated by a specific web-application a. Conversely, those network flows that do not fall within the accepted ranges will be either blocked (e.g., through a firewall) or analyzed by other defensive mechanisms (which are out of the scope of this paper). We illustrate the entire workflow of AppCon in Figure 2: the generated network flows are first checked to determine their compatibility with the flows of the accepted web-applications A, and those that are compliant (with at least one) are then forwarded to the ensemble of detectors (represented by [D a 1 , D a 2 , . . . , D a n ]) composing D . Let us provide an example to facilitate the comprehension of our proposal. Assume an organization adopting a NIDS that uses some machine learning to detect malicious network flows, and assume that this organization employs the web-applications a 1 and a 2 : the network flows generated by a 1 have durations that vary between 1 and 5 s, whereas those generated by a 2 vary between 10 and 30 s. Let us assume an attacker that has infiltrated the organization network and controls some machines through a botnet variant whose communications with the CnC server generate flows lasting 3 s. In such a scenario, if the attacker plans to evade detection by increasing the length of network communications through small latencies, he will only be able to apply increments of either +[2, 3] s (to fall within the a 1 range) or + [7,27] s (to fall within the a 2 range), thus considerably limiting his options.
We highlight that the proposed method is suited to modern enterprise networks that generate network data through a finite set of web-application: in such a scenario, a potential attacker cannot apply his perturbations arbitrarily because, if an adversarial sample does not conform to the traffic generated by the considered applications, then it will automatically trigger other defensive mechanisms or be completely blocked.

Experimental Settings
We present the dataset adopted for our experiments; the development procedures of the detectors; the formulation of appropriate adversarial attacks; the definition of the application constraints that represent the core of our proposal; and the performance metrics to evaluate the cyber detectors.

Dataset
Our experimental campaign is based on the CTU-13 dataset [24], which consists of a large collection of labeled network traffic data, in the format of network flows, containing both benign and malicious samples belonging to seven different botnet families. Overall, this dataset contains over 15M network flows generated in a network of hundreds of hosts over multiple days. These important characteristics make the CTU-13 a valid representation of a realistic and modern enterprise network, and many studies have adopted it for their experiments [57,58]. To facilitate the understanding of our testbed, we now describe the main characteristics of the CTU-13 dataset.
The CTU-13 includes network data captured at the Czech Technical University in Prague, and contains labeled network traffic generated by various botnet variants and mixed with normal and background traffic. It contains 13 distinct data collections (called scenarios) of different botnet activity.
The network traffic of each scenario is contained in a specific packet-capture (PCAP) file, which has been converted by the authors into network flows [54]. A network flow (or netflow) is essentially a sequence of records, each one summarizing a connection between two endpoints (that is, IP addresses). A typical representation of netflow data is given in Table 1. The inspection of network flows allows administrators to easily pinpoint important information between two endpoints (e.g., source and destination of traffic, the class of service, and the size of transmitted data). Network flows provide several advantages over traditional full packet captures, such as: reduced amount of required storage space; faster computation; and reduced privacy issues due to the lack of content payloads [59]. To convert the raw network packets into network flows, the authors of the CTU-13 rely on Argus [60], a network audit system with a client-server architecture: the processing of the packets is performed by the server, and the output of this computation is a detailed status report of all the netflows which is provided to the final clients. After inspecting the CTU-13, we can safely assume that the client used by the authors to generate the netflows from each individual PCAP file is ra, which has been invoked with the following Command: Command 1: CTU-13 netflow generation through Argus.
ra -L 0 -c , -s saddr daddr sport dport stime ltime flgs dur proto stos dtos pkts sbytes dbytes dir state -r inputFile > outputFile.csv where the -L option prints headers once, -c specifies the field separator, -s chooses the fields to extract, and -r specifies the file to read the data from. The output is redirected to a CSV file. After this conversion process, the authors proceeded to manually label each individual network flow. Indeed, the CTU-13 is made available as a collection of 13 CSV files (one for each scenario) presenting the fields specified in Command 1 alongside the added "Label" field, which separates legitimate from illegitimate flows. In particular, benign flows correspond to the normal and background labels; whereas the botnet and CnC-channel labels denote malicious samples. Table 2 shows the meaningful metrics of each scenario in the CTU-13. This table also shows the botnet-specific piece of malware used to create the capture, alongside the number of infected machines. This table highlights the massive amount of included data, which can easily represent the network behavior of a medium-to-large real organization. Nevertheless, we remark that, in our evaluation, the Sogou botnet is not considered because of the limited amount of its malicious samples.  [3,11,28,45,[61][62][63]: Random Forest (RF), Multi-Layer Perceptron (MLP), Decision Tree (DT), AdaBoost (AB), alongside the recent "Wide and Deep" (WnD) technique proposed by Google [31]. Each detector presents multiple instances, each focused on identifying a specific botnet variant within the adopted dataset-in our case, each detector has six instances (we do not consider the Sogou malware variant due its low amount of available samples). This design idea is motivated by the fact that ML detectors show superior performance when they are used as ensembles instead of "catch-all" solutions, in which each instance addresses a specific problem [2,18,62,63].
Each model is trained through sets of features adopted by related work on flow-based classifiers [11,64,65], reported in Table 3. To compose the training and test datasets for each instance, we rely on the common 80:20 split (in terms of overall malicious samples); benign samples are randomly chosen to compose sets with a benign-to-malicious samples ratio of 90:10. Each classifier is fine-tuned through multiple grid-search operations and validate them through 3-fold cross validation.
The entire procedure followed to prepare the data used to train each detector is displayed in Figure 3. First, the CTU-13 dataset is pre-processed to compute the derived features (such as the bytes_per_packet). Next, we merge the collections pertaining to the same botnet family, and finally create the pool of benign flows by extracting all the non-botnet traffic from each collection and including them in a dedicated collection. At the end of these operations, we thus obtain eight datasets: seven containing malicious flows of each botnet family (We do not consider the Sogou malware in the evaluation) denoted as X m , and 1 containing only legitimate flows, denoted as X l .

Design and Implementation of AppCon
Our proposed countermeasure is implemented by devising our detectors on the basis of five different web applications that are widely used by modern enterprises [66][67][68][69]: WhatsApp (WhatsApp: https://www.whatsapp.com/), Teams (Microsoft Teams: https://teams.microsoft.com/), Skype (Skype: https://www.skype.com/), OneNote (Microsoft OneNote: https://www.onenote.com/), OneDrive (Microsoft OneDrive: https://office.live.com/). We consider this specific set of applications because their popularity makes them a suitable example for a practical use-case: our approach can be easily expanded by considering more network services. To this purpose, each application is deployed on several dedicated machines which have their network behavior monitored over the course of several days; we also distinguish between active and passive use of each application. This allows us to identify the samples of network flows in the CTU-13 dataset that are "compliant" with each of these applications, which are then used to train (and test) our hardened detectors by developing application-specific sub-instances.
As an example, consider the case of the application Teams. After monitoring its network behavior, we determine that, when it is actively used, it generates flows transmitting between 71 and 1488 bytes per packet, whereas it transmits 50-1050 bytes per packet during its passive use. We then take these two ranges (alongside the accepted ranges of other features used by our detectors-see Table 3) and use them to filter all the flows in the CTU-13 dataset: only those flows (both malicious and benign) compatible with these ranges will be used to train each (instance of the) detector. Thus, by taking the Random Forest detector as an example, its hardened version will be composed of six instances (each corresponding to a specific malware variant); each of these instances is, in turn, split into five sub-instances, each devoted to monitor a specific application. A schematic representation of this implementation is provided in Figure 4, where the initial ensemble(s) of application-specific detectors (Denoted with (D WhatsApp , D Teams , D Skype , D OneNote , D OneDrive )) is used as input to another "layer" of botnet-specific classifiers (Denoted with (D b 1 A , . . . , D b n A ), with n = 6 in our case, and b is a specific botnet family), whose output is then combined to generate the final detection.

Generation of Adversarial Samples
The attack scenario considered in this paper is reproduced by inserting small perturbations in the malicious network flow samples contained in the CTU-13 dataset. These perturbations are obtained by adopting the same procedure described in [18], which involve altering (through increments) combinations of the (malicious) flow-based features by small amounts (the corresponding derived features are also updated). This procedure allows for generating multiple adversarial datasets, each corresponding to a specific malware variant, a specific set of altered features, and a specific increment value(s). We anticipate that the generated adversarial samples will be used to evaluate our solution, hence only those botnet flows included in the datasets used for testing the detectors are altered, which allows us to avoid performing the validation phase on samples included in the training sets. To facilitate in the reproduction of our experiments, we now provide a more detailed discussion of the adversarial samples generation process.
Attackers may try to evade detection by inserting small latencies to increase the flow duration; another possibility is appending insignificant bytes to the transmitted packets, thus increasing the number of exchanged bytes or packets. These operations can be easily achieved by botmasters who only need to modify the network communications of the controlled bots [10]; at the same time, these simple strategies comply with our assumption that the underlying application logic of the piece of botnet malware remains unchanged. To devise similar attack strategies, the adversarial samples are crafted by modifying the malicious flows contained in the CTU-13 through manipulations of up to four features: flow_duration, exchanged_packets, src_bytes, dst_bytes. The 15 different groups of feature manipulations are reported in Table 4, which we denote as G. In practical terms, the adversarial samples in group 1b alter only the flow src_bytes, while those of group 3b include modifications to the duration, src_bytes and tot_packets features. Table 4. Groups of altered features, Source: [18].

Group (g) Altered Features
1a Duration (in seconds) 1b Src_bytes 1c Dst_bytes 1d Tot_pkts 2a Duration, Src_bytes 2b Duration, Dst_bytes 2c Duration, Tot_pkts 2d Src_bytes, Tot_pkts 2e Src_bytes, Dst_bytes 2f Dst_bytes, Tot_pkts 3a Duration, Src_bytes, Dst_bytes 3b Duration, Src_bytes, Tot_pkts 3c Duration, Dst_bytes, Tot_pkts 3d Src_bytes, Dst_bytes, Tot_pkts 4a Duration, Src_bytes, Dst_bytes, Tot_pkts The alterations of these features are obtained by increasing their values through nine fixed steps, which we denote as S and which are reported in Table 5. To provide an example, samples obtained through the V step of the group 1c have the values of their flow incoming bytes increased by 64. The adversarial datasets obtained through the I step of the group 3c have the values of their flow duration, outgoing bytes, and exchanged packets increased by 1. We put more emphasis on small increments because not only they are easier to introduce, but they also yield samples that are more similar to their "original" variant (which is a typical characteristic of adversarial perturbations [9]). Furthermore, increasing some of these features by higher amounts may trigger external defensive mechanisms based on anomaly detection [59], or may generate incorrect flows (e.g.,: some flow collectors [70] have flow upper duration limits of 120 s [71]). Table 5. Increment steps of each feature for generating realistic adversarial samples, Source: [18].

Step (s) Duration Src_bytes Dst_bytes Tot_pkts
The complete breakdown of the operations performed to generate our adversarial datasets is provided in Algorithm 1, in which an adversarially manipulated input is denoted through the A(·) operator. It should be noted that some features are mutually dependent: for instance, changing the flow duration also requires to update other time-related features (such as packets_per_second): these operations are addressed in line 19 of Algorithm 1.
After applying all these procedures, we generate a total of 135 adversarial datasets for each botnet family (Given by: 15[groups of altered features] * 9[increment steps] = 135), where each dataset represents a different type of evasion attempt. All these attack patterns are compatible with our threat model and can be easily achieved by botmasters [10,55].

Performance Metrics
The machine learning community usually relies on one or more of the following metrics to measure the performance of machine learning models: Accuracy, Recall, Precision, and F1-score. These metrics are based on the concept of Confusion Matrix. We stress that all problems pertaining to cyber detection can be identified as binary classification problems (that is, a data sample can be either malicious or benign), hence the Confusion Matrix in these contexts is represented as a 2 × 2 matrix, in which rows represent the output of the detector, and columns represent the true class of the input data. An example of such matrix is reported in Table 6, where TP, FP, TN, and FN denote True Positives, False Positives, True Negatives, and False Negatives, respectively. As is common in cybersecurity settings, a True Positive represents a correct prediction of a malicious sample [18].  [72] Input: List of datasets of malicious flows X m divided in botnet-specific sets X b ; list of altered features groups G; list of feature increment steps S. Output: List of adversarial datasets A(X m ).
8 // Function for creating a single adversarial dataset A g s (X b ) corresponding to a botnet-specific dataset X b , a specific altered feature group g, and a specific increment step s.
15 // Function for creating a single adversarial sample A g s (x b ) corresponding to a botnet-specific sample x b , a specific altered feature group g, and a specific increment step s.
An explanation of each performance metric is provided below.

•
Accuracy: This metric denotes the percentage of correct predictions out of all the predictions made. It is computed as follows: In Cybersecurity contexts, and most notably in Network Intrusion Detection [28], the amount of malicious samples is several orders of magnitude lower with respect of that of benign samples; that is, malicious actions can be considered as "rare events". Thus, this metric is often neglected in cybersecurity [3,73]. Consider for example a detector that is validated on a dataset with 990 benign samples and 10 malicious samples: if the detector predicts that all samples are benign, it will achieve almost perfect Accuracy despite being unable to recognize any attack. • Precision: This metric denotes the percentage of correct detections out of all "positive" predictions made. It is computed as follows: Models that obtain a high Precision have a low rate of false positives, which is an appreciable result in Cybersecurity. However, this metric does not tell anything about false negatives.
• Recall: This metric, also known as Detection Rate or True Positive Rate, denotes the percentage of correct detections with respect of all possible detections, and is computed as follows: In Cybersecurity contexts, it is particularly important due to its ability to reflect how many malicious samples were correctly identified. • F1-score: it is a combination of the Precision and Recall metrics. It is computed as follows: It is used to summarize in a single value the Precision and Recall metrics.
To evaluate the quality of the developed detectors, we thus rely on the combination of three different metrics: Recall, Precision, and F1-score; the Accuracy metric is not considered due to the reasons provided above.
In addition, we measure the effectiveness of the considered adversarial evasion attacks through a derived metric denoted as Attack Severity (AS), which is computed as follows: This metric (which has been previously used also in [11,19]) allows for quickly determining if an attack family is effective or not by taking into account the different Detection Rate of the targeted detector before and after the submission of adversarial samples. It considers only the Detection Rate because our focus is on evasion attacks, which implies modifying malicious samples so that they are classified as benign. Higher (lower) values of AS denote attacks with higher (lower) amounts of evaded samples.

Experimental Results
Our experimental campaign has the twofold objective of: (i) showing that the proposed countermeasure is effective in mitigating evasion attacks; and (ii) showing that its integration has negligible impact in non-adversarial settings. To this purpose, we conduct our evaluation by following this outline: 1.
determine the performance of the "baseline" detectors in non-adversarial settings; 2.
assess the effectiveness of the considered evasion attacks against the "baseline" detectors; 3. measure the performance of the "hardened" detectors in non-adversarial settings; 4.
gauge the impact of the considered evasion attacks against the "hardened" detectors.
We address and discuss each of these points in the following sections; and then provide some final considerations.

Baseline Performance in Non-Adversarial Settings
It is important that the considered detectors achieve performance compliant with real-world requirements: the crafted adversarial attacks must be effective on detectors that are ready for production environments. Thus, we train and test each detector with the procedure described in Section 3.3.2, and report the results in Table 7, which shows the average (and standard deviation) values of the chosen performance metrics computed across all botnet-specific instances of the detectors. The boxplot representation of these results is provided in Figure 5.  We observe that all detectors achieve performance scores suitable for real-world environments [11], thus representing a valid baseline for the remaining evaluations.

Adversarial Samples against Baseline Detectors
The impact of the considered adversarial attacks against the baseline detectors is now evaluated. This step is critical because our goal is showing that our countermeasure is capable of mitigating attacks that are highly effective. Hence, we generate the adversarial samples with the procedure described in Section 3.3.4 and submit them to the baseline detectors. The results are displayed in Table 8, which reports the Recall obtained by all the detectors before and after the attack, alongside the Attack Severity metric; in particular, each cell denotes the average (and standard deviation) values achieved by all botnet-specific instances against all the generated adversarial samples; the boxplot diagrams of Table 8 are also presented in Figure 6.
Having established a solid baseline, we can now proceed to evaluate the quality of the proposed countermeasure. We note that all detectors are significantly affected by our evasion attacks, which is particularly evident by comparing Figure 6a with Figure 6b. As an example, consider the results obtained by the Deep Learning algorithm (WnD): its Detection Rate goes from an appreciable 92% to a clearly unacceptable 46%.

Hardened Performance in Non-Adversarial Settings
The first goal is determining if AppCon has a negative effect on the detectors when they are not subject to adversarial attacks. Thus, we devise all the application specific sub-instances of our detectors through the methodology explained in Section 3.3.3 and evaluate them on the same test dataset used for the "baseline" detectors. The results are reported in Table 9 and their boxplot representation in Figure 7. From Table 9 and Figure 7, it is evident that our solution has a negligible impact on the performance of the detectors in non-adversarial settings: indeed, its scores are very similar to the ones obtained by the baseline (see Table 7 and Figure 5). These results are critical because several existing countermeasures against adversarial attacks induce a reduced performance on samples that are not adversarially manipulated. As machine learning-based detectors require undergoing a training phase, we report in Table 10 the length (in minutes) of the training operations for the baseline and hardened detectors, which are performed on an Intel Core i5-4670 CPU with 16 GB of RAM, 512 GB SSD and Nvidia GTX1070 GPU. The implementation of AppCon is computationally more demanding to develop because of the additional layer of classifiers integrated in the detectors. However, these operations need to be executed only periodically.

Countermeasure Effectiveness
Finally, we measure the effectiveness of the proposed solution at mitigating evasion attacks. Hence, we submit all the generated adversarial samples to the hardened detectors.
We first measure how many of the crafted adversarial samples are immediately blocked by AppCon. Indeed, it is safe to assume that all samples that do not conform to the network traffic generated by our set of applications are not able to evade detection as they are not classified as accepted traffic for any of the known applications. In the ideal case in which the set of applications A covers all the web applications used by the protected enterprise, these samples will be blocked. These results are outlined in Table 11, which displays the percentage of adversarial samples that are immediately blocked, alongside the amount of samples that will be forwarded to the hardened detectors. From this table, we appreciate that the simple integration of our method allows for blocking over 75% of the generated adversarial samples, which is a promising result.
Next, the performance of the hardened detectors on the remaining adversarial samples is reported in Table 12; as usual, this table is paired with its corresponding boxplots displayed in Figure 8. By comparing the values in Table 12 with those obtained by the baselines in Table 8, we observe that AppCon allows for devising detectors that are less affected by the considered evasion attacks. For example, let us inspect the performance of the RF classifier: its adversarial Recall goes from ∼0.31 to ∼0.44, which is an improvement of nearly 50%. We stress that the complete quality of our solution is shown through both Tables 11 and 12: in the considered scenario [11], AppCon can immediately prevent about 75% of adversarial samples, and it improves the Detection Rate of detectors based on different supervised ML algorithms on the remaining (about 25%) attack samples.

Considerations
We provide some considerations on the effectiveness of the proposed solution, with regard to (i) its improvement over existing defensive techniques; and to (ii) possible strategies that attackers may adopt to evade our system. The authors of [44] propose a method to harden neural network classifiers, but the application of this approach to neural network-based malware detectors (described in [20]) increased the false positive rate in non-adversarial contexts by nearly 2%, which is not acceptable in modern Network Intrusion Detection scenarios in which NIDS analyze thousands of events every second [59]. Moreover, we highlight the study performed in [11], which considers a large array of ML-based botnet detectors tested also on the CTU-13 dataset against adversarial samples: their results show that techniques such as feature removal may cause drops in Precision from 96% to a worrying 81%. Another possible defense against adversarial evasion attacks relies on re-training the algorithm on perturbed samples [21,47,74]: however, the authors in [47,74] do not evaluate the efficacy of such strategies in non-adversarial settings, while the detection improvements in [21] are very small (only 2%). In contrast to all these past efforts, the values of Precision and F1-score achieved by AppCon in non-adversarial settings only differ by less than 1% from our baseline, while significantly improving the performance in adversarial circumstances.
A skilled adversary may still be able to thwart AppCon. To do this, the following three Conditions must be met: 1.
the attacker must know (fully or partially) the set of web-applications A considered by AppCon. Let us call this set A.

2.
the attacker must know the characteristics of the traffic that A generate in the targeted organization. We denote this piece of knowledge with C A . 3.
the attacker must be able to modify its malicious botnet communications so as to conform to C A .
An attacker that meets all three of these conditions does not conform to the threat model considered in this paper and that is used to devise AppCon (see Section 3.1). Regardless, we stress that, while it may be feasible for a persistent attacker to learn A (Condition #1), obtaining C A (Condition #2) would require far more effort as the attacker would need to gain access to systems that monitor the behavior of the entire network to acquire such information. Finally, concerning Condition #3, the attacker may be able to modify the malicious CnC communications to comply with C A , but this may raise alarms by other detection mechanisms [59]. We conclude by highlighting that, as evidenced by our experiments (see Table 11), AppCon allows protection against over 75% of the considered evasion attempts-regardless of the attacker's capabilities.

Conclusions
The application of machine learning algorithms to cybersecurity must face the problem posed by adversarial attacks. In this paper, we propose AppCon, a novel approach that aims to improve the resilience of cyber detectors against evasion attacks. Our solution is particularly suited to strengthen machine learning-based network intrusion detection systems deployed in realistic environments. The proposal combines the effectiveness of ensemble learning with the intuition that modern network environments generate traffic from a finite set of applications; the goal is limiting the options that an attacker can use to craft his malicious adversarial samples by tailoring the NIDS for the set of applications used in the protected network. We evaluate the quality of AppCon through an extensive experimental campaign in a botnet detection scenario. The results provide evidence that our solution achieves the symmetric quality of mitigating evasion attacks while not affecting the detection performance in non-adversarial settings, and that it is effective on multiple supervised ML algorithms. These improvements represent a meaningful step towards the development of more secure cyber detectors relying on machine learning. The present work presents margins for future improvements: an enticing idea consists of evaluating the synergy of the proposed AppCon approach with other defensive strategies, with the goal of further improving the detection rate against evasion attacks.