PeerAmbush: Multi-Layer Perceptron to Detect Peer-to-Peer Botnet

: Due to emerging internet technologies that mostly depend on the decentralization concept, such as cryptocurrencies, cyber attackers also use the decentralization concept to develop P2P botnets. P2P botnets are considered one of the most serious and challenging threats to internet infrastructure security. Consequently, several open issues still need to be addressed, such as improving botnet intrusion detection systems, because botnet detection is essentially a confrontational problem. This paper presents PeerAmbush, a novel approach for detecting P2P botnets using, for the ﬁrst time, one of the most effective deep learning techniques, which is the Multi-Layer Perceptron, with certain parameter settings to detect this type of botnet, unlike most current research, which is entirely based on machine learning techniques. The reason for employing machine learning/deep learning techniques, besides data analysis, is because the bots under the same botnet have a symmetrical behavior, and that makes them recognizable compared to benign network trafﬁc. The PeerAmbush also takes the challenge of detecting P2P botnets with fewer selected features compared to the existing related works by proposing a novel feature engineering method based on Best First Union (BFU). The proposed approach showed considerable results, with a very high detection accuracy of 99.9%, with no FPR. The experimental results showed that PeerAmbush is a promising approach, and we look forward to building on it to develop better security defenses.


Introduction
Typically, a bot is a compromised machine remotely controlled by a botmaster. A network of such infected machines under the command of a botmaster is called a botnet [1]. The compromised end-hosts are exploited in order to steal data or to launch Distributed Denial of Service (DDoS) attacks [2]. In their different scenarios, botnets have recently led to more threats to internet infrastructure security. Botnets are the reason behind many malware attacks, such as cryptocurrency mining, click fraud, and DDoS attacks [3]. The peculiarity of the botnet is its ability to launch large-scale, stealthy, and highly coordinated attacks [1]. Thus, taking down the botnet or even detecting the botnet is very challenging.
Traditionally, a botnet starts when the attacker, called the botmaster, infects many machines, called bots, over the network with various viruses, worms, and trojan horses and then commands and controls them remotely to coordinate a large-scale attack [1]. For that, we can summarize the main three components of botnets: Bot (master/operator), Command-and-Control mechanism (C2), and Malware (malicious software) [4]. Intuitively, the larger the botnet, the more challenging it is to counter. However, there is centralization (master) in that attack, where security guards could target it to weaken the whole thread. However, there is the issue of decentralized botnets, where there is no centralization to counter. It is important to consider that botmasters also evolve their mechanisms for command-and-control purposes. Consequently, some botnets utilize the concept of avoiding the single point of failure where the non-centralization leads to the attack [5].

Relevant Research and Critical Analysis
This section discusses the state-of-the-art of IDS solutions against P2P botnets using ML and DL techniques. Respectfully, this paper shows the findings and limitations of the relevant research in order to highlight the open issues, which may facilitate other researchers to work on the existing gaps. This section critically analyses the gaps in relevant research to develop a solution that contributes to the current body of knowledge and research.

Relevant Research
In the last decade, there has been a growing interest in detecting and preventing botnet techniques. These techniques include monitoring the botnet and learning how a bot infects Symmetry 2022, 14, 2483 3 of 18 benign machines, but the challenge starts with detecting whether this machine is infected or not. Once a machine is detected as infected, many post-procedures must be taken to avoid exploiting this machine to enlarge the botnet. There have been many approaches developed to detect botnets. Based on a set of factors, these approaches can be classified as either signature-based or anomaly-based IDSs. In addition, there are three classes of botnet detection system based on location: host-based, network-based, and hybrid-based detection systems [21]. This paper covers only research that works on detecting P2P botnets using ML and DL techniques and then critically discusses the relevant research.
PeerRush was proposed to mine unwanted traffic in the P2P network [22]. Rahbarinia is meant to detect Storm, Zeus, and Waledac P2P botnets. The proposed solution performed cross-validation and showed a low misclassification rate of 0.68% and a low false positive of 0.1%.
Garg et al. [23] applied several ML techniques, such as nearest neighbor, Naive Bayes, and J48, to detect P2P botnets. In this experiment, the classifiers nearest neighbor and J48 performed better than other classifiers.
Jiang and Shao [24] proposed an approach to detect P2P botnets using an unsupervised ML technique. The proposed approach focuses more on the characteristics of commandand-control traffic. The author applied a clustering technique to distinguish between benign and P2P botnet flows.
Liao and Chang [25] proposed a methodology to distinguish between legitimate P2P traffic and P2P botnet traffic using the packet size. The author discovered that P2P botnets frequently try to update connection information for other bots rather than staying idle. Moreover, P2P botnets usually transmit data with a minimum rate of connection. The author applied Bayesian networks, Naïve Bayes, and J48 as classifiers to classify the network traffic. The J48 technique showed the highest detection rate compared to other classifiers.
Zhao and Traore [26] proposed an ML-based approach to detect P2P botnets by classifying the captured fast flux network flows. The authors applied a decision tree as a classifier to detect P2P botnets.
Alauthaman et al. [6] proposed a method to detect P2P botnets using a multilayer feed-forward neural network as well as decision trees. The regression tree is applied as a feature selection technique. Thereafter, the selected features feed the feed-forward neural network training. The proposed method achieved a high detection accuracy of 99% and a low FPR of 0.7%.
Yang and Wang [27] implemented an ML technique to detect P2P botnet for DDoS attacks. The authors also proposed a feature extraction method using the graph symmetry concept. This paper took the packets as subdivisions of the signal, and the time interval and data packet were the corresponding two-dimensional features.
Yin proposed a node-based detection approach to detect P2P botnet characteristics [28]. Yin focused more on the network characteristics of an individual node by examining the node flows in order to extract the significant features. The author then utilized an ML classifier to detect the bots.
Priyanka presented a two-tier detection scheme to detect parasite P2P botnets in their waiting stage [17]. The authors considered only two essential behaviors: the search requests' intensity and the long-living peers. However, that does not reflect the whole network of the compromised machines. Table 1 summarizes the findings, limitations, and other details of all the relevant research. The authors used the same dataset of [22], which is small and limited in terms of the traffic type.

Critical Analysis
Critically, we analyze the relevant research respectfully. We did our best to exhaustively cover all the works that work on detecting P2P botnets using ML and DL techniques through the top scientific research repositories, namely: ScienceDirect, WoS, Springer-Link, Wiley, IEEE, and several flagship conferences in the domain of security according to the CORE2021 ranking, which are ACM, IEEE SP, NDSS, and Usenix-Security. For integrity, Symmetry 2022, 14, 2483 5 of 18 some interesting works detected P2P botnets and achieved a high detection rate. However, we still derived some concerns from the existing good works. In general, we cluster the concerns into three, as follows.

Dataset Concern
As cyber security researchers, we know that compiling a real network dataset is difficult for many reasons, such as anonymity and privacy considerations. For this reason, we found that most of the existing datasets are simulated datasets. The argument is not about whether it is a real network dataset or a simulated one, but rather it is about the way the dataset is constructed. Constructing a dataset for IDS experiments is quite important because the dataset is the crucial norm to evaluate the effectiveness of the proposed IDs. For example, there is no problem if the proposed IDS was wisely and smoothly built, but if we look back at the used dataset to evaluate this IDS and we find it was rashly constructed, we no longer trust it. Some issues in the data reflected in the final evaluation, such as the imbalanced dataset, finally led to an overfitting problem [29]. This paper will not discuss other problems, such as small-size datasets, unknown-source datasets, and unavailability, etc.
The problem we found in most of the existing works is that the used datasets were incomplete, or they did not include real network features. It is agreed that the dataset must contain attack traffic mixed with background traffic in order to make the proposed model/approach/system learn more about both normal and abnormal traffic. For example, some existing works evaluated their experiments using the CTU-13 dataset, such as [30], and after analyzing this dataset, we found that it does not contain background traffic. For that, we conducted a simple assessment survey of the existing datasets that has P2P botnet traffic by checking Google data [31], Mendeley data [32], and Kaggle [33]. Based on a checklist inspired by Susan McGregor [34], we chose a dataset to evaluate our proposed approach. The quality assessment included checking for certain points, such as: is it of known pedigree? Is it complete? Is it high-volume? Is it consistent? Is it dimensionality structured? The dataset was also assessed regarding the data fit via validity, reliability, and representativeness. Table 2 summarizes the existing datasets that contain botnet traffic. It should be noted that we did not cover the IoT and Android botnet datasets. Moreover, we did not consider the datasets that were only provided as CVS files because the CSV file reflects a limited network traffic image. Therefore, we listed providing only CSV files as a limitation in our work. Another concern in the datasets is that most of the relevant works did not show whether the used dataset was balanced. Ignoring such norms might show good experimental results, but they would be fake. Overall, we had to construct a new and solid dataset (Section 3.1) to avoid the existing concerns.

Feature Engineering Concern
First, most of the authors of the relevant research did not engineer features where they handled all the features, and they had to then feed that large volume to the models/approaches/systems. Consequently, that cost resources and time and increased the complexity. Second, even authors who had feature engineering did not evaluate the utility of the proposed feature engineering methods before and after implementing those methods. Finally, some matters must be considered by the feature engineering method, such as the simplicity, applicability, and efficiency of the proposed solution. In this paper, we propose a novel, efficient, and applicable feature selection method to reduce the data dimensionality and speed up the whole data processing (Section 3.3). Moreover, the feature selection facilitates the targeted attack detection for the predictors in the next stage (detection stage) [42].

Detection Concern
All the utilized classifiers in the relevant works are ML techniques. Although some works achieved good detection rates, they still cannot prove their effectiveness because of the different data preparation, and major evaluation metrics were missed. In addition, some researchers repeated applying the same ML techniques, i.e., the existing works did not apply many different classifiers to understand each technique's performance and cover more probabilities. For example, [26][27][28] applied Decision Tree as a classifier to detect the P2P botnet, whereas [22] and [27] applied SVM as a classifier, and [23] and [25] applied Naïve Bayes as a classifier. Noticeably, there are many repetitions of applying the same ML classifiers. Furthermore, DL techniques have not been completely leveraged in detecting P2P botnets. Another concern is that not all of the authors mentioned the testing approach utilized in their experiments, whether it was Percentage-split or Cross-validation. For that, our proposed approach is based on the most effective DL technique, the MLP, and it is tested by both testing approaches: Percentage-split and Cross-validation.

PeerAmbush
In this paper, we propose a novel approach, PeerAmbush, to detect one of the most dangerous attacks, P2P botnets using the DL technique. PeerAmbush addresses some of the derived limitations of relevant research, as earlier discussed in Section 2.2, regarding the used datasets to evaluate the proposed solution, feature engineering influence, and detection performance. PeerAmbush consists of five stages: Data Construction, Data Preparation, Feature Engineering, MLP-based P2P Botnet Detection, and Evaluation Results. Each stage includes some substages before feeding the output as input to the following stage. Figure 1 shows an overview of PeerAmbush.
The key contributions of our proposed approach are as follows: • Constructing a new dataset that includes P2P botnet traffic and background flow; • Proposing a novel feature engineering method based on mathematical union theory to select the most significant features: Best First Union (BFU); • Adapting the MLP as a classifier to detect P2P botnets.

Data Construction
As mentioned previously, generating a new dataset is challenging due to privacy matters. For this reason, we constructed a new dataset based on the reliable existing ones. One of the norms to assess the dataset quality was whether it was of a known pedigree. After that, we selected the CTU-13 dataset [36] for three reasons: i) it is reliable, and many researchers have used this dataset to evaluate their solutions, ii) it contains the P2P botnet scenario, and iii) it was provided as PCAP files not CSV files, and that gives more understanding of the network traffic and the behavior of P2P botnets. However, this dataset was missing benign network flows. Hence, we merged the CTU-13 dataset with a recent network background flow collected from the HIKARI dataset [43]. As a result, the newly constructed dataset contains both the abnormal behavior of the P2P botnet and the normal flow of the network to let the trained model learn about both the normal and abnormal and then avoid the problem of overfitting. Figure 2 simplifies the process of this stage.

CTU-13 Dataset
This dataset is considered one of the most reliable datasets in the IDS community. CTU-13 has a large capture of real botnet traffic, including 13 different scenarios of

Data Construction
As mentioned previously, generating a new dataset is challenging due to privacy matters. For this reason, we constructed a new dataset based on the reliable existing ones. One of the norms to assess the dataset quality was whether it was of a known pedigree. After that, we selected the CTU-13 dataset [36] for three reasons: (i) it is reliable, and many researchers have used this dataset to evaluate their solutions, (ii) it contains the P2P botnet scenario, and (iii) it was provided as PCAP files not CSV files, and that gives more understanding of the network traffic and the behavior of P2P botnets. However, this dataset was missing benign network flows. Hence, we merged the CTU-13 dataset with a recent network background flow collected from the HIKARI dataset [43]. As a result, the newly constructed dataset contains both the abnormal behavior of the P2P botnet and the normal flow of the network to let the trained model learn about both the normal and abnormal and then avoid the problem of overfitting. Figure 2 simplifies the process of this stage.

CTU-13 Dataset
This dataset is considered one of the most reliable datasets in the IDS community. CTU-13 has a large capture of real botnet traffic, including 13 different scenarios of different botnet types. This dataset contains many protocols, such as ICMP, TCP, DNS, etc. One of its useful features is that it was provided as PCAP files, allowing a realistic and deeper understanding of the network traffic. The PCAP files of the CTU-13 dataset also contain other types of information, such as NetFlow, Weblogs, etc. [36].

Data Construction
As mentioned previously, generating a new dataset is challenging due to privacy matters. For this reason, we constructed a new dataset based on the reliable existing ones. One of the norms to assess the dataset quality was whether it was of a known pedigree. After that, we selected the CTU-13 dataset [36] for three reasons: i) it is reliable, and many researchers have used this dataset to evaluate their solutions, ii) it contains the P2P botnet scenario, and iii) it was provided as PCAP files not CSV files, and that gives more understanding of the network traffic and the behavior of P2P botnets. However, this dataset was missing benign network flows. Hence, we merged the CTU-13 dataset with a recent network background flow collected from the HIKARI dataset [43]. As a result, the newly constructed dataset contains both the abnormal behavior of the P2P botnet and the normal flow of the network to let the trained model learn about both the normal and abnormal and then avoid the problem of overfitting. Figure 2 simplifies the process of this stage.

HIKARI Dataset
This dataset was recently captured by Ferriyan [43], and it was decided to merge it with the CTU-13 to complete what was missing in the first dataset. The HIKARI dataset completely captures the network traffic, such as communication between hosts, broadcast messages, and domain lookup queries. We chose the ground-truth from this dataset because it provides realistic benign traffic from a real production network, not synthetic traffic, as found in some datasets [43].

Data Preparation
Data preparation means preparing the selected dataset for the next stages by some substages to make it fit the purpose of this research and be seen as readable by the next methods, i.e., the feature engineering method and the detection stage. The first procedure is filtration because, as mentioned previously, the CTU-13 dataset contains 13 different scenarios, and this approach concerns only P2P botnets (Scenario no. 12) [36]. Consequently, we exclude the other scenarios and keep the P2P botnet scenario as well as the ground-truth from the HIKARI dataset. After filtration, we label the filtered dataset to import it into a supervised learning-based model. We convert the labelled dataset into numeric data to make them readable by the next methods/algorithms (Numericalization). Finally, we normalize the dataset before taking it as input for the third stage. Figure 3 shows the process of the data preparation stage. different botnet types. This dataset contains many protocols, such as ICMP, TCP, DNS, etc. One of its useful features is that it was provided as PCAP files, allowing a realistic and deeper understanding of the network traffic. The PCAP files of the CTU-13 dataset also contain other types of information, such as NetFlow, Weblogs, etc. [36].

HIKARI Dataset
This dataset was recently captured by Ferriyan [43], and it was decided to merge it with the CTU-13 to complete what was missing in the first dataset. The HIKARI dataset completely captures the network traffic, such as communication between hosts, broadcast messages, and domain lookup queries. We chose the ground-truth from this dataset because it provides realistic benign traffic from a real production network, not synthetic traffic, as found in some datasets [43].

Data Preparation
Data preparation means preparing the selected dataset for the next stages by some substages to make it fit the purpose of this research and be seen as readable by the next methods, i.e., the feature engineering method and the detection stage. The first procedure is filtration because, as mentioned previously, the CTU-13 dataset contains 13 different scenarios, and this approach concerns only P2P botnets (Scenario no. 12) [36]. Consequently, we exclude the other scenarios and keep the P2P botnet scenario as well as the ground-truth from the HIKARI dataset. After filtration, we label the filtered dataset to import it into a supervised learning-based model. We convert the labelled dataset into numeric data to make them readable by the next methods/algorithms (Numericalization). Finally, we normalize the dataset before taking it as input for the third stage. Figure 3 shows the process of the data preparation stage.

Feature Engineering
Recently, researchers have widely applied feature engineering because it plays a vital role in their proposed solutions. There are many feature engineering methods, such as feature selection. Feature selection reduces the data dimensionality, saving time and reducing the proposed solutions' complexity [29]. Unfortunately, most of the relevant works had no feature engineering methods. Accordingly, those proposed solutions were more complex and sources/time-consuming.
In this paper, we propose a novel feature engineering method based on mathematical union theory to select the most significant features that reflect the influence of whole features and improves the predictor performance. We name the proposed method Best First Union (BFU). BFU starts with a feature evaluator to select the highest important features as the best first ones via two different methods: CFS Subset Evaluation and Consistency Subset Evaluation. The next subsections explain the two methods (Sections 3.3.1 and 3.3.2).
Each method evaluates the features differently and eventually provides a final shortlist of features. The resulting two shortlists will then be united to generate one shortlist that includes the most significant features and leverage two different evaluators. Later, only the selected features will go through the detection stage as inputs. Figure 4 simplifies the process of our novel feature engineering method, BFU.

Feature Engineering
Recently, researchers have widely applied feature engineering because it plays a vital role in their proposed solutions. There are many feature engineering methods, such as feature selection. Feature selection reduces the data dimensionality, saving time and reducing the proposed solutions' complexity [29]. Unfortunately, most of the relevant works had no feature engineering methods. Accordingly, those proposed solutions were more complex and sources/time-consuming.
In this paper, we propose a novel feature engineering method based on mathematical union theory to select the most significant features that reflect the influence of whole features and improves the predictor performance. We name the proposed method Best First Union (BFU). BFU starts with a feature evaluator to select the highest important features as the best first ones via two different methods: CFS Subset Evaluation and Consistency Subset Evaluation. The next subsections explain the two methods (Sections 3.

and 3.3.2).
Each method evaluates the features differently and eventually provides a final shortlist of features. The resulting two shortlists will then be united to generate one shortlist that includes the most significant features and leverage two different evaluators. Later, only the selected features will go through the detection stage as inputs. Figure 4 simplifies the process of our novel feature engineering method, BFU.

CFS Subset Evaluation
This method works on evaluating the worth of features by grouping them into subsets and then considering the individual predictive ability of each feature along with the degree of redundancy among features. Consequently, the features highly correlated with the class during low intercorrelation are preferred as the best first [44].

Consistency Subset Evaluation
This method works on evaluating the worth of features by grouping them into subsets and then measuring the level of consistency in the class values. Each subset has a consistency, and that consistency can never be lower than that of the full dataset [45].

MLP-based P2P Botnet Detection
According to the relevant research, the behavior of P2P botnets is distinguishable from normal network traffic. Detecting the P2P botnets can be modelled as a multi-class classification task. We earlier prepared our dataset through some substages, such as labelling. Accordingly, we adopted a typical supervised deep learning technique, MLP, as a classifier to detect the P2P botnet. There are two reasons behind choosing this DL technique in this work: i) there is no work yet that has applied DL techniques to detect P2P botnets, and ii) this technique has proved its effectiveness in detection systems, i.e., there are some researchers who applied this technique to detect different types of attacks, such as [46][47][48][49], and their experiments show the high efficiency of MLP in network intrusion detection systems.

Multi-Layer Perceptron (MLP)
Multi-Layer Perceptron is a deep learning technique; it is considered one of the most efficient neural network techniques for classification in IDS [50]. MLP is a feed-forward and fully connected neural network. Figure 5 shows a simple hypothetical example of the architectural design of MLP.

CFS Subset Evaluation
This method works on evaluating the worth of features by grouping them into subsets and then considering the individual predictive ability of each feature along with the degree of redundancy among features. Consequently, the features highly correlated with the class during low intercorrelation are preferred as the best first [44].

Consistency Subset Evaluation
This method works on evaluating the worth of features by grouping them into subsets and then measuring the level of consistency in the class values. Each subset has a consistency, and that consistency can never be lower than that of the full dataset [45].

MLP-Based P2P Botnet Detection
According to the relevant research, the behavior of P2P botnets is distinguishable from normal network traffic. Detecting the P2P botnets can be modelled as a multi-class classification task. We earlier prepared our dataset through some substages, such as labelling. Accordingly, we adopted a typical supervised deep learning technique, MLP, as a classifier to detect the P2P botnet. There are two reasons behind choosing this DL technique in this work: (i) there is no work yet that has applied DL techniques to detect P2P botnets, and (ii) this technique has proved its effectiveness in detection systems, i.e., there are some researchers who applied this technique to detect different types of attacks, such as [46][47][48][49], and their experiments show the high efficiency of MLP in network intrusion detection systems.

Multi-Layer Perceptron (MLP)
Multi-Layer Perceptron is a deep learning technique; it is considered one of the most efficient neural network techniques for classification in IDS [50]. MLP is a feed-forward and fully connected neural network. Figure 5 shows a simple hypothetical example of the architectural design of MLP. The MLP takes the numeric and normalized values of the selected features, as prepared in the previous stages.
The most relevant works that applied MLP as a classifier did not mention the parameter settings. Some of them merely increases the number of hidden layers, which might cause overfitting, i.e., increasing the hidden layers may be illusory once it performs well on the training dataset. However, testing the trained model with a new dataset may show a disappointing performance. The number of hidden layers should be tunned along with the number of nodes and the dataset volume. Setting a number of hidden layers and nodes can reflect the performance quality and trade-off, minimizing the total error due to bias and variance. Often, the complex model leads to overfitting, while the simple ones fail to catch the relationship between the input and the output [51].
In this paper, we utilize both testing approaches: Percentage-split, and Cross-validation, as well as providing the parameter settings of MLP.

Percentage-Split
In this testing approach, the dataset is split into 80% and 20% of the dataset. The first one is used to train the MLP, while the second one is to test the effectiveness of MLP in detecting the intrusion with a new dataset (20%).

Cross-Validation
In this testing approach, the dataset is divided into 10 folds of cross-validation, where nine of them are to train the MLP and one fold is to test the effectiveness of MLP in detecting the intrusion. Figure 6 shows the difference between the two testing approaches. The MLP takes the numeric and normalized values of the selected features, as prepared in the previous stages.
The most relevant works that applied MLP as a classifier did not mention the parameter settings. Some of them merely increases the number of hidden layers, which might cause overfitting, i.e., increasing the hidden layers may be illusory once it performs well on the training dataset. However, testing the trained model with a new dataset may show a disappointing performance. The number of hidden layers should be tunned along with the number of nodes and the dataset volume. Setting a number of hidden layers and nodes can reflect the performance quality and trade-off, minimizing the total error due to bias and variance. Often, the complex model leads to overfitting, while the simple ones fail to catch the relationship between the input and the output [51].
In this paper, we utilize both testing approaches: Percentage-split, and Cross-validation, as well as providing the parameter settings of MLP.

Percentage-Split
In this testing approach, the dataset is split into 80% and 20% of the dataset. The first one is used to train the MLP, while the second one is to test the effectiveness of MLP in detecting the intrusion with a new dataset (20%).

Cross-Validation
In this testing approach, the dataset is divided into 10 folds of cross-validation, where nine of them are to train the MLP and one fold is to test the effectiveness of MLP in detecting the intrusion. Figure 6 shows the difference between the two testing approaches.
In this paper, all the major metrics are used to evaluate the performance of employing the MLP to detect the P2P botnet, such as Detection Accuracy, FPR, Precision, Recall, and F-Score. Providing all the major metrics reflects the quality and effectiveness of the proposed approach. Symmetry 2022, 14, x FOR PEER REVIEW 11 of 18 Figure 6. The two testing approaches.
In this paper, all the major metrics are used to evaluate the performance of employing the MLP to detect the P2P botnet, such as Detection Accuracy, FPR, Precision, Recall, and F-Score. Providing all the major metrics reflects the quality and effectiveness of the proposed approach.

Evaluation Metrics
Several key metrics were used to evaluate the performance of different models/approaches/systems. The evaluation metrics are similar to an accurate reading of the performance of the proposed solution. However, some standard metrics should be calculated, especially for comparison purposes, such as Accuracy, False Positive Rate, Precision, etc. This paper considers all the major metrics to evaluate PeerAmbush compared to other works. True Positive (TP) is the percentage of correctly predicted attacks, while True Negative (TN) is the percentage of correctly predicted normal instances of traffic. In comparison, False Positive (FP) and False Negative (FN) are the percentages of normal instances when they are predicted as an attack and the percentage of attacks that are predicted as normal instances, respectively [20,52]. The False Positive Rate is another major metric to evaluate the proposed approach; it is the percentage of normal instances when they are predicted as attacks. Equation (1) calculates the FPR, Equation (2) calculates the Precision, Equation (3) calculates the Recall, and Equation (4) calculates the F-Score, as follows [53].

Implementation and Experimental Results
This section describes the design, implementation, and experimental results of each stage in PeerAmbush through three subsections. PeerAmbush as a security solution against P2P botnets was thoroughly explained in the previous section. However, this

Evaluation Metrics
Several key metrics were used to evaluate the performance of different models/approaches/systems. The evaluation metrics are similar to an accurate reading of the performance of the proposed solution. However, some standard metrics should be calculated, especially for comparison purposes, such as Accuracy, False Positive Rate, Precision, etc. This paper considers all the major metrics to evaluate PeerAmbush compared to other works. True Positive (TP) is the percentage of correctly predicted attacks, while True Negative (TN) is the percentage of correctly predicted normal instances of traffic. In comparison, False Positive (FP) and False Negative (FN) are the percentages of normal instances when they are predicted as an attack and the percentage of attacks that are predicted as normal instances, respectively [20,52]. The False Positive Rate is another major metric to evaluate the proposed approach; it is the percentage of normal instances when they are predicted as attacks. Equation (1) calculates the FPR, Equation (2) calculates the Precision, Equation (3) calculates the Recall, and Equation (4) calculates the F-Score, as follows [53].

Implementation and Experimental Results
This section describes the design, implementation, and experimental results of each stage in PeerAmbush through three subsections. PeerAmbush as a security solution against P2P botnets was thoroughly explained in the previous section. However, this section shows the output of each stage, starting with constructing a new dataset and ending with the evaluation results. Figure 7 shows the roadmap of PeerAmbush in detail. section shows the output of each stage, starting with constructing a new dataset and ending with the evaluation results. Figure 7 shows the roadmap of PeerAmbush in detail.

Data Construction and Preparation (Stages 1-2)
As previously discussed in Section 2.2.1, there is a need to construct a new dataset that contains both the traffic of P2P botnets and benign network traffic. The concern was that Scenario no. 12 was selected for further analysis in our work, and we found that there were flows from the botmaster to the infected machines, and vice. In addition, this scenario also contains flows from 'infected machines' to 'non-infected' machines. Suppose we experimentally block the IP addresses of the botmaster and the infected machines. In that case, no traffic lasts, i.e., there is no normal traffic among only benign nodes (between non-infected machines). Therefore, for further understanding, we need to differentiate between the flows of infected and non-infected machines by their traffic. For additional information regarding CTU-13, Table 3 shows the IP addresses of the botmaster and the infected machines, and the dataset can then be found through [36]. Noticeably, we can see in Table 3 that there was no traffic flow from non-infected to non-infected machines. As a complement measure, we extracted the background of normal network traffic from HIKARI [43]. The number of extracted packets from CTU-13 is 352,266 packets. In contrast, the number of complement packets from the HIKARI dataset is 533,848 packets. Accordingly, the number of packets of the newly constructed dataset is 886,114 packets. The new dataset is labelled into multiclass; botmaster flow, infected machines flow (bots), and benign flow (i.e., multiclass labelling). After labelling, filtration

Data Construction and Preparation (Stages 1-2)
As previously discussed in Section 2.2.1, there is a need to construct a new dataset that contains both the traffic of P2P botnets and benign network traffic. The concern was that Scenario no. 12 was selected for further analysis in our work, and we found that there were flows from the botmaster to the infected machines, and vice. In addition, this scenario also contains flows from 'infected machines' to 'non-infected' machines. Suppose we experimentally block the IP addresses of the botmaster and the infected machines. In that case, no traffic lasts, i.e., there is no normal traffic among only benign nodes (between non-infected machines). Therefore, for further understanding, we need to differentiate between the flows of infected and non-infected machines by their traffic. For additional information regarding CTU-13, Table 3 shows the IP addresses of the botmaster and the infected machines, and the dataset can then be found through [36]. Noticeably, we can see in Table 3 that there was no traffic flow from non-infected to non-infected machines. As a complement measure, we extracted the background of normal network traffic from HIKARI [43]. The number of extracted packets from CTU-13 is 352,266 packets. In contrast, the number of complement packets from the HIKARI dataset is 533,848 packets. Accordingly, the number of packets of the newly constructed dataset is 886,114 packets. The new dataset is labelled into multiclass; botmaster flow, infected machines flow (bots), and benign flow (i.e., multiclass labelling). After labelling, filtration keeps only the flows of scenario no.12 and drops the rest (scenarios 1-11 and 13). Finally, before converting the dataset to numeric data and applying the normalization, we checked whether the dataset was balanced or not before proceeding to the next stage, and it was balanced. Table 4 describes the newly constructed dataset. The prepared dataset has 30 features. In our novel feature engineering method (BFU), all features are evaluated by a feature evaluator using two different methods (Sections 3.3.1 and 3.3.2) to select the best first. Each method then outcomes in a different shortlist representing the best first. Mathematically, A represents the best first list of CFS Subset Evaluation, and B represents the best first list of Consistency Subset Evaluation. Equation (5) calculates A union B, where x represents the feature [54].
Consider the two lists of A and B, such that the number of features in the union of A and B can be calculated as follows in Equation (6) Table 5 summarizes the evaluation methods (Best First Evaluators). The above table shows that four features compose the final shortlist of features. Only these four features are considered to feed the detection stage. Comparatively, the BFU has achieved fewer selected features compared to the relevant research, as explained in Table 6. Selecting only four features represents approximately 12.5% of the whole original dataset composed of 30 features. Accordingly, this feature selection saves time and resources for the detection system, and indirectly that makes the process less complex and smoother. Furthermore, the next section (P2P botnet detection) shows the positive influence of using the BFU method compared to not using the BFU method, i.e., using the full dataset.

Evaluation Results of MLP-Based P2P Botnet Detection
This section shows the parameter settings and experimental results of PeerAmbush to detect P2P botnets. In addition, this section provides a comparison with respect to the relevant works. As mentioned previously, we tested PeerAmbush using two testing approaches: Percentage-split and Cross-validation. Furthermore, the parameter settings that we set to MLP are as follows. The number of training instances utilized in one iteration is 100 (officially called the patch size). There are ten hidden layers in our neural network. Experimentally, we slightly increased/decreased the number of hidden layers and the nodes in each layer until we achieved the highest detection rate, considering also the time taken to build a model and, as mentioned previously, that it is not recommended to keep increasing the number of hidden layers to avoid the overfitting problem [51]. Furthermore, we set 0.5 as the learning rate for updating the weights of nodes. The momentum that is applied to weight updates is 0.2. Last but not least, the training time is measured by the number of epochs to train through, and it is 500 epochs. Our parameter settings achieved better results compared to the relevant research. PeerAmbush achieved a high detection accuracy of 99.9%, and no FPR. Meanwhile, the default parameter settings achieved a detection accuracy of 96.5%, with higher false positive alarms compared to our parameter settings. Table 7 shows the parameter settings of MLP.  Table 8 summarizes the experimental results of our proposed PeerAmbush approach compared to the most recent works (last five years). No relevant works have yet leveraged DL techniques to detect P2P botnets. Consequently, there was a vital need to employ efficient techniques such as MLP to detect one of the most serious threats, P2P botnets. In addition, none of the relevant works have tested the proposed solution with the two different testing approaches to show its effectiveness. Moreover, most of the relevant works did not provide many of the major evaluation metrics, which is what causes us to doubt the solution. Thus, we provide all the major evaluation metrics, which show the superiority of PeerAmbush in Accuracy, FPR, Precision, Recall, and F-Score using two different testing approaches: Percentage-split and Cross-validation. Table 9 comparatively shows the performance of MLP with our parameter settings and the performance of the best ML techniques that have been applied by the relevant research using the same dataset (our newly constructed dataset).
To conclude, some ML techniques did well in detecting P2P botnets in terms of detection accuracy, as shown in the relevant research. However, IDSs are not only based on intrusion detection accuracy, and there are some matters that should be considered to improve the overall performance, such as time and complexity. In this work, we proposed a novel feature engineering methods to select the most significant features. The proposed feature selection method eventually produced only four features to the predictor, and that contributed to reducing the data dimensionality and then reducing the process complexity.
Experimentally, we also used the full-features dataset to show the positive influence of our feature engineering method (BFU, Stage 3). Comparatively, the results of using the BFU method are better for many reasons: higher detection accuracy, lower FPR, higher Precision and Recall, and the time taken to build a model is less than the time taken when we use the full dataset. Table 10 comparatively shows the results of using the BFU method and without using the BFU method.   The above table shows the power of using the BFU. There are some features that may mislead or affect the classification process. For that, feature engineering selected only the most significant features that reflected the worth of the whole set and also benefited the predictor for better classification.
In general, this approach achieved higher detection accuracy and no FPR with fewer selected features compared to the relevant works. Last but not least, this paper shows the performance of one of the most effective DL classifiers, which is the MLP with a certain parameter setting to detect P2P botnets. Finally, the experimental results are promising to build and improve new IDSs to detect the P2P botnets.

Conclusions and Future Work
In conclusion, we proposed PeerAmbush as a novel DL-based approach to detect one of the most serious attacks, the P2P botnets. The proposed approach addresses some of the limitations in the relevant research, such as the dataset issues and feature engineering matter, by constructing a new dataset and proposing a novel feature engineering method, respectively. The novel feature engineering method is based on Best First Union, though we named it BFU, and this method selected only four features as the best to detect P2P botnets. Therefore, PeerAmbush employs the MLP as a DL classifier to classify the network traffic because the relevant works have not yet leveraged the DL techniques, and to benefit the effectiveness of DL in this case. The proposed approach consists of five main stages; each stage has substages to solve a certain issue and prepare the dataset for the next stage. PeerAmbush showed impressive results compared to the relevant research in terms of detection accuracy, FPR, Precision, Recall, and F-Score by only using four features as the fewest number of selected features compared to the relevant works. In the future, we look forward to specifically detecting more new P2P botnet types, such as DDG P2P botnet or the very sophisticated FritzFrog P2P botnet. Henceforward, we will also work on employing more DL techniques to detect more types of P2P botnet. We also plan to complete this work by improving a prevention technique against the detected traffic.