Transport and Application Layer DDoS Attacks Detection to IoT Devices by Using Machine Learning and Deep Learning Models

From smart homes to industrial environments, the IoT is an ally to easing daily activities, where some of them are critical. More and more devices are connected to and through the Internet, which, given the large amount of different manufacturers, may lead to a lack of security standards. Denial of service attacks (DDoS, DoS) represent the most common and critical attack against and from these networks, and in the third quarter of 2021, there was an increase of 31% (compared to the same period of 2020) in the total number of advanced DDoS targeted attacks. This work uses the Bot-IoT dataset, addressing its class imbalance problem, to build a novel Intrusion Detection System based on Machine Learning and Deep Learning models. In order to evaluate how the records timestamps affect the predictions, we used three different feature sets for binary and multiclass classifications; this helped us avoid feature dependencies, as produced by the Argus flow data generator, whilst achieving an average accuracy >99%. Then, we conducted comprehensive experimentation, including time performance evaluation, matching and exceeding the results of the current state-of-the-art for identifying denial of service attacks, where the Decision Tree and Multi-layer Perceptron models were the best performing methods to identify DDoS and DoS attacks over IoT networks.


Introduction
Distributed Denial of Service (DDoS) attacks are one of the main threats to network systems, affecting the applications and devices that rely on them. DDoS attacks consist of grouping multiple devices against one target, preventing legitimate users to access services such as email and websites. They represent the most common and critical attack against and from Internet of Things (IoT) devices, Cloud Computing, and fifth-generation (5G) communication networks [1].
DDoS attacks can be categorized into two main classes according to the size of the traffic they generate: high-rate and low-rate attacks. The behavior of low-rate attacks is extremely inconspicuous since they behave similarly to legitimate traffic and can account for about 10-20% of the total normal network traffic [2]. Although the average traffic of low-rate attacks is small, they can potentially not only reduce the quality of service of the target but also stop the service completely [2]. This is achieved by an attacker sending periodically pulsing data, instead of continuous flows [3]. Examples of low-rate denial of service attacks are GoldenEye, Slowloris, and R.U.D.Y. (R-U-Dead-Yet?) [4].
In contrast to low-rate attacks, high-rate DDoS attacks employ an approach of high-rate packet transmission, where the statistical changes in the behavior can be used to distinguish them from the normal data flows [3]. High-rate attacks violently exhaust the resources and the capacity of the network, making the victim unresponsive in a short period of time [2]. Examples of high-rate denial of service attacks are SYN Flood and UDP Flood [2].
According to the Kaspersky Q2 2020 DDoS Attacks Report [5], in the second quarter of 2020, the number of DDoS attacks slightly increased compared to the first quarter of the • Anomaly detection models that match and exceed the performance of the current state-of-the-art for identifying specific denial of service attacks categories, using three different feature sets; • A comprehensive evaluation of the classification and time performance of several Machine Learning and Deep Learning models with three different feature sets, which led to the discovery that it is not necessary to use the Argus flow data generator for any future online implementation based on the Bot-IoT dataset; • A suitable way to address the Bot-IoT dataset bias problem without adding class weights and without generating synthetic data.
This paper is structured as follows: the literature review is presented in Section 2. Section 3 describes the methodology for the creation of the novel Artificial-Intelligencebased Intrusion Detection System here proposed. The analysis and evaluation of the results are shown in Section 4. Finally, in Section 5, we present the conclusions and future work.

Related Work
In the original publication of the Bot-IoT [11], its authors evaluated their work by training three different models: one Machine Learning model based on a Support Vector Machine (SVM) with a linear kernel and two Deep Learning models using a simple Recurrent Neural Network (RNN) and a Long-Short Term Memory (LSTM) architecture.
The performance of the models was evaluated with two different sets of features: The first of them used the 10 best features (selected from a filter with Correlation Coefficient and Joint Entropy), and the second one used all 35 features.
For multiclass classification of all the attacks in the Bot-IoT dataset, the best accuracy was of 99.988% with an SVM using all the 35 features. In terms of exclusively DDoS and DoS attacks, the work only reports binary classifications (e.g., Normal flows vs. DDoS HTTP), obtaining the maximum accuracy of 99.999% for Normal flows vs. DDoS UDP with an RNN.
Nevertheless, the dataset was unbalanced [23], which may have positively affected the identification of attacks (i.e., the majority class) due to data bias. This is one of the opportunities we address in this research.
The work in [12] also used 5% of the Bot-IoT dataset and presented 7 different Deep Learning models, including RNNs, achieving a maximum sensitivity of 96.868% for Normal flows vs. DoS HTTP. With respect to DDoS, the maximum sensitivity was for Normal flows vs. DDoS UDP, with 96.666% accuracy.
In [13], the authors applied different Random Forest configurations, tuning the depth and the number of trees. The authors proposed 6 different feature sets (from 4 to 8 features, such as IP, port, and timestamp), and compared their accuracies with the 10 best features set and the SVM in the Bot-IoT paper [11]. The accuracy of the SVM with the 10 best features set is 88.372%, while the accuracy of the proposed Random Forest (with the 6 different feature sets) is 100%. Nevertheless, it is important to note that the experiments in [13] not only considered either DDoS or DoS attacks but also included other types of attacks, such as data ex-filtration and service scanning. No other models were presented by the authors, only Random Forest with a small number of features, which might lead to a loss of information. With respect to time performance, the authors only evaluated the effects of the Random Forest sizes on run-time overheads to classify a single data packet.
In [14], a packet-level model based on Deep Learning was proposed using Feed Forward Neural Networks (FFNN) for binary and multiclass classification with the Bot-IoT dataset. The four categories of attacks were DoS, DDoS, reconnaissance, and information theft in order to differentiate them from the normal traffic. Confusion matrices were generated, and the accuracy, precision, recall, and F1 score metrics were used for performance evaluation. With respect to only DoS and DDoS attacks, the proposed model presented all accuracies above 99% in binary classification (e.g., Normal flows vs. DDoS TCP) and an accuracy of 99.414% for multiclass classification. In order to deal with the unbalanced nature of the dataset, class weights were introduced to the training data, so the class with a smaller number of samples received a higher weight value. However, this technique could introduce the risk of over tuning, resulting in weights that may not generalize optimally [14].
In [15], the Bot-IoT dataset was used to validate a new feature-selection algorithm based on the Area Under the Curve (AUC) metric. A feature set of five variables was selected as the best one, and the mean and the standard deviation of the duration of the aggregated records were two of those features. Only four Machine Learning models were applied: Decision Tree, Naive Bayes, Random Forest, and SVM. The accuracy, precision, recall, and specificity metrics were used for performance evaluation. In terms of results, Random Forest and Decision Tree showed an accuracy of 100% for HTTP, TCP, and UDP denial of service attacks detection. This paper presented a solution for the problem of selecting effective features for accurate attack detection in IoT networks. The AUC metric is useful for dealing with imbalanced datasets [24]; nevertheless, the research work neither evaluates Deep Learning models nor presents a performance evaluation metric, such as the average of flows per second each model can process, which is relevant to evaluate the feasibility of the real-time implementation of their proposed best models.
The work in [16] presented a novel use of Gated Recurrent Units (GRU) in the Bot-IoT dataset. GRUs aim to solve the vanishing gradient problem in a standard RNN [16] by using update and reset gates. The proposed model used only 125,971 samples from the original Bot-IoT dataset in order to conduct a fair comparison and to have the same size as the NSL-KDD dataset [25], obtaining an accuracy of 99.76% for Normal vs. Attack traffic identification, with no exclusivity for either DDoS or DoS attacks.
In [17], the Bot-IoT dataset was used for conducting binary and multiclass classification tasks, with balanced and unbalanced representations of it, where the class balancing technique used was based on weights, as seen in [14]. As mentioned by the authors in [17], they used the default values of the hyper parameters for each classifier, as provided by Scikit-Learn [26] and Keras [27]. In terms of performance metrics, they present indicators such as accuracy and F1 score; however, the authors did not present an evaluation of the models feasibility in a real-time scenario (e.g., by evaluating time performance). From the original Bot-IoT dataset that has 35 variables, the authors removed columns with missing values, as well as columns that contain text and columns they considered to be irrelevant. Their complete dataset had 19 variables, where features such as the timestamps and the Argus sequence number remained. For training and testing, they applied a data split of 80% and 20%, respectively, with no percentage reported for a validation set. For the weighted datasets, the Artificial Neural Network (ANN) was the most outstanding model, with a stable accuracy of 99% for binary classification in DDoS and DoS attacks protocols. For the multiclass classification, the authors presented an overall accuracy with all the attacks types contained in the Bot-IoT dataset, where the ANN kept in first place had an accuracy of 97%. The authors stated they did not train Deep Learning models.
In [23], the authors recognized the need for class balancing in the Bot-IoT dataset. This study showed that the majority classes belong to the attack types, while the normal traffic is part of the minority classes with only 9515 samples (accompanied with information theft, which has 1587 samples), resulting in a ratio of normal to malicious traffic of 1:7687 [23]. An imbalanced dataset may lead to problems such as poor accuracy and/or bias towards the majority class in the results obtained. Specifically, talking about DDoS and DoS attacks, the normal to attack traffic ratio for DoS is 1:459 (i.e., 9515 to 33,005,194 flows), and the ratio for normal to DDoS is 1:4038 (i.e., 9515 to 38,532,480 flows) [23]. Thus, the Bot-IoT dataset seems to be better suited to distinguish between a DoS and a DDoS attack [23], since these categories have similar number of samples (i.e., about 38 million for DDoS and 33 million for DoS).
In order to deal with imbalanced datasets, resampling techniques can be applied to ameliorate this problem. When oversampling, minority class instances are created, either by duplicating elements or by creating new ones synthetically from a similar distribution. The latter technique can be achieved using the Synthetic Minority Oversampling Technique (SMOTE), where, depending on the amount of oversampling required, neighbors from the k nearest neighbors are randomly chosen, with one sample generated in each one's direction [28]. When undersampling, samples from the majority class are removed, which can cause loss of information. We propose to tackle the data bias problem of the Bot-IoT dataset by selecting random consecutive flows per each DDoS/DoS attack type to preserve the temporal behavior of the attacks whilst not altering the network traffic collected from the realistic testbed configuration used to design the Bot-IoT dataset.
In addition, we carry out a comprehensive experimentation specialized in normal flows vs. DDoS and DoS attacks in binary and multiclass classifications with three feature sets to evaluate how different flow processors could be used in a real-world scenario. Dividing our experiments into binary and multiclass classifications allows us to evaluate the detection and identification of network traffic, respectively, leveraging the categories and subcategories present in the Bot-IoT dataset. Likewise, we compare seven Machine Learning and Deep Learning models, using popular metrics such as accuracy and precision. Furthermore, we include time performance to analyze the feasibility of implementation of our smart IDS in a real-time environment. Table 1 presents a summary comparing the related work with our approach. We present this comparison across six relevant criteria to describe the position of our work and how it stands out from the current state of the art. As can be seen, our work is one of the two that evaluates time performance and is the only one that also tackles the class balancing problem of the Bot-IoT dataset whilst using different feature sets, evaluating ML and DL models, all at a flow-level detection.

Methodology
In this section, we describe the methodology we followed to create our balanced dataset, as well as the feature standardization process we applied to help convergence in the different classifiers. Likewise, we present a summary of the ML and DL models parameters to conduct our experiments with each of the three feature sets and define the different performance metrics to evaluate our results.
The labeled CSV files were downloaded from [22]. A total of 9085 samples were extracted for the normal class. We selected items in the majority class (the attacks) by randomly choosing sections of consecutive flows for each DDoS/DoS attack type in the same proportion to the normal samples to keep a balanced ratio. Figure 1 shows that we achieved the same number of flows for each of the classes for the multiclass classification, where UDP, TCP, and HTTP, are samples from both DDoS and DoS attacks. See Figure 2 for the distribution for binary classification. In the end, the complete dataset size was of 36,340 samples.  In order to design our models, we selected three different feature sets from the original Bot-IoT dataset that has 35 variables. We followed this approach in order to evaluate how the records timestamps affect in the models predictions and to avoid dependencies produced by the Argus flow data generator [20] (so that more flows processors could be used either in a simulated or in a real network implementation, such as CICFlowMeter [29] or Flowtbag [30]).
As seen in Table 2, all the feature sets share the same statistical variables (i.e., rates, mean, maximum, minimum, etc.). The first feature set we tried was selected to evaluate the impact of the timestamps and the Argus sequence number on the classification results. The second feature set removed the timestamps because we argue that the model could memorize these features, which may lead to poor generalization in a real-time scenario. Likewise, we removed the Argus sequence number to avoid dependencies with this parser. Finally, in the third feature set, we kept the Argus sequence number in agreement with the current state-of-the-art (that use this feature) to evaluate how it affects the classifications excluding only the timestamps. The three feature sets ranged between 15 and 18 variables, which were selected after dropping columns with missing values and choosing statistical features to capture the traffic behavior. No more feature removal was applied in order to capture the greatest amount of information possible. See Table 3 for the description of the variables. It is relevant to note that 8 of the variables in the 10 best feature set identified in [11] were included in the first and the third feature sets we proposed, and 7 of those 10-best variables were in the second feature set. The correlation matrix for the multiclass classification task is shown in Figure 3 and the binary classification in Figure 4. Here, the subcategory represents the class to predict. Since we wanted to see the linear relation between our variables (where all are numerical), we calculated these matrices using the Pearson's correlation coefficient, resulting in values between −1 and 1, where positive values indicated a pair of features that increase or decrease together, and negative values indicated that the increase in one variable implies the decrease in another variable (and vice versa).  To help convergence, the features were standardized by subtracting the mean (centering) and dividing by the standard deviation (scaling), resulting in a set of values whose mean was 0, and the standard deviation was equal to 1. See Equation (1) for the formula: The dataset split for all the Machine Learning and Deep Learning models was 80% for training, 10% for validation (tuning hyper parameters), and 10% for testing. Given our total number of samples, we decided to create separate sets for training, validation, and testing instead of using other alternatives such as k-fold cross-validation.
All the ML models (i.e., Support Vector Machines, Decision Trees, and Random Forests) were built using Scikit-Learn [26] and the DL models (i.e., RNN, GRU, LSTM, and Multilayer Perceptron [MLP]) using PyTorch [31]. Confusion matrices were generated and accuracy, precision, recall, and F1 score metrics, in addition to time performance as proposed in [32], were used for evaluation and models benchmark. See Equations (2)-(5), for these metrics' definitions: Tables 4-6 show the parameters for the ML models using the three feature sets. With respect to the DL models, all of them share the characteristics presented in Table 7, where the input size varies according to the feature set (i.e., 18 for the first set, 15 for the second one, and 16 for the third one). The hyper parameters for both Machine Learning and Deep Learning were chosen after a process of systematic tuning. In this regard, the best max depth obtained for the Decision Tree, in both binary and multiclass classification, was correspondingly used as max depth for the Random Forest sub-estimators. Likewise, we report the number of trees that led us to optimal balance between accuracy and run-time. It should be noted that the Decision Tree implementation Scikit-Learn uses is an optimized version of the CART (Classification and Regression Trees) algorithm [33].
Next, the results and discussion for the experiments are presented.

Experimental Results and Discussion
The results for multiclass and binary classification (for the first feature set) are presented in Tables 8 and 9, respectively; for the second feature set, we display the results  in Tables 10 and 11; finally, Tables 12 and 13 show the results for the third feature set. From these results, it can be seen that Decision Tree and Random Forest have the best performance for both classification tasks in the three distinct feature sets, outperforming the DL models. On the other hand, SVM is the poorest-performing model (in agreement with previous work in [11] with the 10-best feature set).
Our results show that Machine Learning models such as Random Forest and Decision Trees show strong performances that are marginally better than that presented from the Deep Learning models. Since all the results show a similar order of magnitude, we argue that given the relative small amount of features in all our datasets, Decision Tree methods show a robust performance that does not learn to depend on one particular feature, thus generalizing better. This also shows that traditional ML models are reliable and should not be discarded without proper evaluation, such as the one we carried out here, and particularly when using tabular data.
Our sampling methodology allowed us to use standard cost functions without weighting techniques, whilst addressing the balancing problem in the Bot-IoT dataset in contrast to what is commonly performed in the current state-of-the-art (e.g., in [14]), which may lead to over tuning. Likewise, our feature sets included more characteristics compared to [13], capturing more information whilst reducing the amount of manual feature engineering, in agreement with the ethos of current Machine Learning practices. With this, we presented comprehensive tests with both Machine Learning and Deep Learning models (compared, for instance, to [17], where only Machine Learning models were presented by the authors).
Software Defined Networks (SDNs) [34] represent one of the best options to implement smart Intrusion Detection Systems due to their relevance for data centers, 5G technology, and the ease of integration of IoT devices to these type of networks. SDNs are capable of achieving higher system flexibility and scalability, separating the data plane and the control plane to provide a dynamic network structure [35]. All the network control functions, such as traffic monitoring, take place in a software-based controller [1], which can either be physically centralized or distributed but logically centralized [36]. This flexibility on global network monitoring and network configuration enables the implementation of detection and mitigation mechanisms against cyberattacks [1].  Given the importance of real-time hardware implementations, we consider it to be relevant to evaluate the time performance of each model for classifying network traffic. As proposed in [32], we calculate the average number of flows per second our anomaly detection methods can classify. This experimentation was conducted on a MacBook Pro with Apple M1 Chip and 16 GB RAM for both the multiclass and the binary classification models, with the three feature sets. See Tables 14 and 15 for the first feature set; Tables 16  and 17 for the second feature set; and Tables 18 and 19 for the third feature set. From the real-world scenario tested in [32], on regular days, around 500 flows/s passed through the network collector, while in dense traffic situations, it achieved peaks of a maximum of 1681 flows/s. Then, for the first feature set, from Tables 14 and 15, all the models, except for Random Forest in binary classification, are capable of analyzing the amount of flows/s required on heavy traffic days; whilst for the second feature set (Tables 16 and 17), all the models, except for SVM in both classification tasks, achieve the maximum peak. Finally, the results for the third feature set (Tables 18 and 19) show that all the models, except for SVM and Random Forest in multiclass and binary classification, achieve the maximum amount of flows/s discussed.
From our results, we can see that Decision Tree is the best anomaly detection method for the IDS proposed, as shown in the results for accuracy, precision, recall, F1 score, and time performance, outperforming all the other models in the three feature sets, see Figure 5. We consider the third feature set as the most appropriate one for our novel IDS, since it shows stable results for Machine Learning and Deep Learning models (similar to results in the state of the art), both in multiclass and binary classifications, whilst not using timestamps as learnable features (which can lead to poor performance in a real-time real-world scenario). In addition, results in the literature use the Argus seq as one feature they feed in their models, as our third feature set does. See Figures 6 and 7 for the Decision Tree confusion matrices using this feature set.
Not using neither timestamps nor the Argus sequence number (as in the second feature set), caused the Deep Learning models to have accuracies around 96% and 97% for both binary and multiclass classification, which is lower than the performance achieved by standard ML models. Although initially this result may appear surprising, we argue this is due to the fact that the DL models learn to depend heavily on these particular features. Given the recurrent nature of the neural networks we assessed, these features (as provided in the original dataset) may display strong temporal dependencies (with strong correlations to the categories our models are classifying), once again strengthening the network dependence on these features, leading to poor generalization when implemented with online data. Nonetheless, Random Forest and Decision Tree still show the strongest performance when trained on this feature set, achieving results above 99.8%, which can be explained due to the random nature of these models that allows overcoming dependencies on temporal data. It should also be noted that we did not find other studies that use a similar set of features as that proposed in our second one, so we cannot establish a fair comparison to other works. In addition, the models trained on this feature set are totally independent of temporal characteristics such as timestamps and, particularly, Argusgenerated sequence numbers, which make strong generalization models suitable for online IDS implementations.   In addition to all these experiments, binary classification for Normal flows vs. each DDoS/DoS protocol were performed. Tables 20-22, show the best anomaly detection models regarding accuracy, precision, recall, and F1 score, for each of those combinations. It can be seen that Decision Tree and Random Forest are the strongest models, achieving 100% across all the metrics in several combinations.

Comparison with Previous Works
Unlike previous works, this study addresses the class imbalance problem of the Bot-IoT dataset without adding class weights (which can lead to poor generalization, as seen in [14,17]), and without generating synthetic data. With this, we carried out extensive experimentation of normal flows vs. denial of service attacks, in binary and multiclass classifications. Under these analyses, three different feature sets were selected from the original dataset (with larger size and solving the problem of missing information, compared to [13]). We discussed why different flows processors could be used in a real-world scenario and the importance of learning different features.
Likewise, our work shows a comprehensive evaluation of seven distinct Machine Learning and Deep Learning models different to [13,[15][16][17], where only either ML or DL models are assessed by the authors. In addition, we applied a systematic tuning of our models hyper parameters and dedicated 10% of our data to validation, in contrast to the process followed in [17]. With respect to performance evaluation, we not only presented confusion matrices and popular metrics (i.e., accuracy, precision, recall, and F1 score), but we also added the time performance measurement to show the IDS feasibility of implementation in production networks, demonstrating that the best resulting models here presented are a realistic solution: this is in contrast to all the related works reviewed in Section 2 that use the Bot-IoT dataset (except for [13]).
Our results match and exceed the current state-of-the-art, with an average accuracy >99% across our three different feature sets, and 100% across several combinations of Normal flows vs. the DDoS/DoS subcategories. These results do not present bias towards a majority class. Compared to the works in our review that deal with class balancing, in [14], the accuracy for multiclass classification for normal flows vs. DDoS and DoS attacks is 99.414%, whilst our best results for the same classification is 99.945% for the first feature set, 99.89% using the second feature set, and 99.917% using the third feature set. In addition, compared to [17], where the stable accuracy was 99% for binary classification in DDoS and DoS attacks protocols and 97% for multiclass classification, we obtain accuracies >99.85% using our 3 different feature sets for binary classification and a best accuracy of at least 99.945% for the multiclass classification on our 3 feature sets.

Conclusions and Future work
This work uses the Bot-IoT dataset, a state-of-the-art collection of data for protecting IoT networks. The methodology proposed addresses the class imbalance problem of the original dataset (by adding neither synthetic data nor class weights) leading to the creation of a novel IDS based on AI models which focuses on DDoS and DoS attacks. The proposed IDS presents results without biases towards a majority class, achieving an average accuracy >99% with our three distinct feature sets, where the Decision Tree is the outstanding anomaly detection model, whilst being feasible for implementation in real-time production environments, with a remarkable time performance for heavy traffic days (evaluating more than 1681 flows/s). In addition, we achieved 100% across accuracy, precision, recall, and F1 score metrics with the Decision Tree and the Random Forest for several combinations of Normal flows vs. the DDoS/DoS protocols.
As future work, due to the importance of SDN in data centers and in 5G technology and the integration of IoT devices to these networks, we are working on the implementation of both a simulated and a real Software Defined Network infrastructure, using the Open Network Operating System (ONOS) as network controller [4,37], to integrate the IDS developed in this research as a detection service, and we will also integrate a mitigation strategy. The installation and running of this IDS in the SDN controller will solve the issue pointed out in [38], where anomaly detection in heterogeneous sensor networks (as the Bot-IoT testbed) is difficult to achieve directly on the sensor nodes (which have light computing power and limited memory).
Another future direction that is worth mentioning is the interconnection of different IoT environments, considering them as nodes linked by a given relationship and forming a graph, as in a smart city [39]. For this Multiple IoT paradigm (MIoT) [40], it would be interesting to evaluate our proposed IDS because anomalies are less evident in an MIoT than in a single IoT scenario [39]. Funding: This work was partially supported by FRIDA (Fondo Regional para la Innovación Digital en América Latina y el Caribe) and partially supported by the project "Red temática Ciencia y Tecnología para el Desarrollo (CYTED) 519RT0580" by the Ibero-American Science and Technology Program for Development CYTED.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: A publicly available dataset was analyzed in this study. This data can be found here: https://research.unsw.edu.au/projects/bot-iot-dataset (accessed on 26 January 2021).