Detecting IoT Attacks Using an Ensemble Machine Learning Model

: Malicious attacks are becoming more prevalent due to the growing use of Internet of Things (IoT) devices in homes, ofﬁces, transportation, healthcare, and other locations. By incorporating fog computing into IoT, attacks can be detected in a short amount of time, as the distance between IoT devices and fog devices is smaller than the distance between IoT devices and the cloud. Machine learning is frequently used for the detection of attacks due to the huge amount of data available from IoT devices. However, the problem is that fog devices may not have enough resources, such as processing power and memory, to detect attacks in a timely manner. This paper proposes an approach to ofﬂoad the machine learning model selection task to the cloud and the real-time prediction task to the fog nodes. Using the proposed method, based on historical data, an ensemble machine learning model is built in the cloud, followed by the real-time detection of attacks on fog nodes. The proposed approach is tested using the NSL-KDD dataset. The results show the effectiveness of the proposed approach in terms of several performance measures, such as execution time, precision, recall, accuracy, and ROC (receiver operating characteristic) curve.


Introduction
Historically, only computers, mobile phones, and tablets were connected to the Internet. The Internet of Things (IoT) today enables many kinds of devices and appliances (e.g., televisions, air conditioners, washing machines) to be connected to the Internet. IoT is being used in several fields today, including healthcare, agriculture, traffic monitoring, energy saving, water supply, unmanned air vehicles, and automobiles.
A three-layer IoT architecture is illustrated in Figure 1; from left to right: (1) thing layer, (2) fog layer, and (3) cloud layer. The thing layer includes IoT devices from several domains, including smart-homes, eHealth, smart vehicles, smart drones, and smart-cities. This layer enables data collection while having limited resources such as bandwidth, processing, energy, and memory. Next comes the fog layer, which is closer to the thing layer and may contain some operational resources to manage real-time operations and rapid decision making. Finally, the cloud layer facilitates the collection, processing, and storage of data in various data centers. However, as it is far away from the thing layer, it may take a long time to incorporate decisions in the thing layer.
According to a recent report from the International Data Corporation (IDC) (https: //www.idc.com/, accessed on 20 March 2022), the amount of data generated by IoT devices will reach 73 zeta bytes by 2025, up from 18 zeta bytes in 2019. A massive influx of data opens up a lot of potential threats [1]. The problem is that IoT devices and their networks tend to be insecure since they are typically under-powered, memory-limited, or insufficiently bandwidth-limited to perform basic security functions such as encryption. IBM X-Force (https://securityintelligence.com/posts/internet-of-threats-iot-botnets-network-attacks/, accessed on 20 March 2022) reported in 2020 that attacks on IoT grew five-fold over the previous year. Currently, IoT-enabled networks are at risk of losing privacy and confidentiality due to malware and botnet attacks [2]. For the IoT, several security solutions have been proposed, such as authentication [3], detection, and prevention [4]. Introducing machine learning (ML) algorithms into the IoT may alleviate concerns about security and privacy [5,6]. Today, it is crucial to decide where to run which algorithms for fast decision making, such as on the cloud or the fog or the thing layer. When all ML decisions are made in the cloud, IoT decisions may be delayed. In other layers, such as the thing or fog layer, it may be difficult to apply ML solutions due to their limited resources, such as bandwidth, processing, and energy.
Current research [7][8][9][10][11][12] indicates that deep learning algorithms are capable of detecting IoT attacks more effectively than traditional machine learning algorithms. However, only the cloud layer may have the resources to run these algorithms. In addition, these algorithms are not always very effective in some situations, such as remote live operations (e.g., remote surgery), since the system is supposed to make real-time decisions rapidly. Previous work on IoT attacks [9,13] has shown that a machine learning technique such as support vector machine (SVM) can only provide meaningful results if it is combined with a feature extraction/reduction algorithm or optimization algorithm. This combination of algorithms fails to meet the low resource requirement. ML techniques such as decision trees, naïve Bayes, K-nearest neighbors (KNN), and others are extremely robust for applications such as offline or non-interactive predictions between small datasets. These models, however, are considered weak when applied to real-time predictions. Studies conducted in the state of the art [14][15][16] report that the detection rate is quite low when using these classifiers to detect IoT attacks.
The paper proposes an ensemble model for an IoT system with limited bandwidth, processing power, energy, and memory (e.g., in the fog layer) to detect IoT attacks. Denial of service (DoS), authentication attacks, and probe attacks are taken into account. Moreover, no additional feature extraction or dimensional reduction algorithm is used to increase detection rates. This model is best suited to the real-time, quick detection of IoT attacks. In the proposed approach, there are two important steps: (1) selecting the best ensemble model that has a short execution time and high performance (e.g., accuracy), and (2) running the best model to achieve a short delay when applying the decision. Firstly, we perform the first step in the cloud, as more resources are required for selecting the best ensemble model, and the second step is performed in the fog layer, which has a low delay for real-time applications.
In this paper, extensive data analysis experiments are performed on the NSL-KDD dataset (https://www.unb.ca/cic/datasets/nsl.html, accessed on 20 March 2022). The dataset represents IoT attacks on a network in real time, and it is an upgraded version of the original KDD-99 dataset. The results show a high level of accuracy in a minimum amount of time with the fewest possible resources needed. The paper is organized as follows: Section 2 presents the related work and the background, Section 3 presents our proposed approach, Section 4 presents simulation scenarios, Section 5 provides the results and, finally, Section 6 concludes the paper.

IoT-Specific Attacks Overview
From IoT devices, data can be collected which can then be processed and monitored, depending on an application (e.g., e-healthcare or industrial) located in a cloud or fog layer. There are several attacks related to the IoT in the literature. Denial of service (DoS) attacks, authentication attacks, and probe attacks are presented below: 1. A denial of service (DoS) attack poses the greatest threat to IoT devices and servers with open ports [17,18]. There are several types of DoS attacks such as Smurf, Neptune, and Teardrop; 2. An authentication attack is an attack against privileged access. A remote to the user (R2U) attack (such as HTTPtunnel and FTP_write) occurs when an intruder sends malformed packets to a computer or server to which he/she does not have access. User-to-root (U2R) attacks (such as Rootkit) occur when a malicious intruder attempts to gain access to a network resource by posing as a normal user and then accessing it using full permission; 3. In a probe attack, an intruder runs a scan of a network device to determine potential vulnerabilities in the design of its topology or port settings and then exploits those in the future to gain illegal access to confidential information. There are several types of probe attacks, such as IPsweep, Nmap, and Portsweep.

ML-Specific Related Work on Security and Privacy
A comparison of related work on ML-specific attack detection can be seen in Table 1, including the ML (machine learning)/DL (deep learning) used, the pre-processing features, and performance analysis performed. During the pre-processing step, encoding (E), scaling (S), normalization (N), and dimensionality reductions (D) are taken into account. Furthermore, as part of the performance analysis, accuracy, receiver operating characteristic (ROC) curve, F-score, Matthews correlation coefficient (MCC), and detection rate (DR) are considered.
In [13,19,20], decision trees and rule induction are used to explain under what conditions a specific type of attack (DoS, authentication attacks, and probe attacks) occurs on a network. In this approach, encoding is used as a pre-processing technique, while accuracy is used to evaluate the effectiveness of the method. Although this is a valuable state-of-the-art approach, it cannot guarantee that any rules from decision trees will be applicable for large sets of data because overfitting poses the greatest risks. Further, in [21], principal component analysis (PCA) is utilized with a decision tree to detect and investigate the reason of the anomalies.
The previous works of [7,8,13,22,23] show that attacks can be predicted with high accuracy by using deep learning neural networks, either as a standalone technology [7,8] or in combination with optimization [22,23] or machine learning algorithms [9,13]. More precisely, [9,13] combine artificial neural networks (ANNs) with support vector machines (SVMs), which provide significantly higher detection rates than standalone deep learning or machine learning algorithms. Particularly, [13] develops the hybridization by including the SVM with ANN but also combining that fusion with a genetic algorithm (GA) and particle swarm optimization (PSO). This hybridization achieves a 99.3% accuracy rate.  [19,20] Decision Tree + Rule Induction The dimensionality reduction factor is also explored in a wide variety of works. The studies of [10] and refs. [11,12] used principal component analysis (PCA) with ANN and showed an efficacy of 91 percent F1-scores. Researchers from [28] have also explored dimensionality reduction with one-hot encoder and combined outlier analysis, which increased performance by 2.96 percent and 4.12 percent higher than CNN and RNN. This approach to dimensionality reduction with machine learning yields a mix of higher and average results. In addition, it is still unclear how many dimensionality reduction algorithms will fit within a single model to provide an optimal outcome. A combination of latent Dirichlet allocation (LDA) and a genetic algorithm is used in [24], which provides a below-average accuracy rate of 88.5 percent and a false positive rate of 6 percent.
The results are improved even more by techniques such as logistic regression and autoencoder. The study of [25] uses an autoencoder with LSTM and carries out experiments on a number of autoencoders, hitting the AUC score of 96 percent. Multinomial logistic regression provided a 99 percent ROC for finding anomalies in [26]. The idea of ensemble learning has also been explored by several authors. One of the appealing results, with 99.6 AUC, is provided by using XGBoost in [27].
The literature review covered almost all taxonomies of machine learning, from decision trees to neural networks, and from regression (logistic) techniques to ensemble learning. Following an extensive assessment, it was determined that a deep neural network with some optimization algorithm or ensemble learning could provide an impressive detection rate and the least false alarm rate of attacks. Additionally, feature engineering is also required to improve this model.

Voting and Stacking Techniques
The voting process, as its name suggests, ensembles the results of a number of weak classifiers by choosing the classifier with the greatest number of common traits as the final one. The advantage of this method is that it ignores errors of misclassified classifiers. As an example, to solve a classification problem through voting, a range of weak classifiers is selected, including K-nearest neighbor (KNN) classifiers and decision trees. Both naïve Bayes and K-nearest neighbour classifiers yield the same class label as a result, which differs from naïve Bayes. Following this, the maximum number of common votes from the K-nearest neighbor classifier and decision tree will be considered.
Stacking is a method of ensemble learning that takes into account heterogeneous weak classifiers, which means that different machine learning algorithms are combined. In addition, in stacking, there is the concept of a meta-layer that combines the classifier results from the base layers using a meta-layer model. For instance, to solve a classification problem through stacking, a range of weak classifiers, such as K-nearest neighbour classifiers, decision trees, and naïve Bayes classifiers are selected at base layers, and their results are combined through a neural network classifier as a meta-layer model. In the meta-layer model, the neural network will take inputs from the base layer and provide the outputs of these three weak classifiers with a final prediction.

Ensemble Machine Learning-Based Attack Detection
The authors of [29] demonstrate how ensemble machine learning, neural networks, and kernel methods can be used to detect abnormal behavior in an IoT intrusion detection system. In this study, ensemble methods outperform kernel and neural networks in terms of accuracy and error detection rates.
To detect webshell-based attacks, ensemble machine learning is used in [30]. In webshell attacks, a malicious script installed on a web server for remote administration executes malicious code written in popular web programming languages. Ensemble techniques, including random forest and extremely randomized trees, are applied in this work, and voting is used in order to improve their performance. The study concluded that random forests and extremely randomized trees are best for IoT scenarios involving moderate resources (CPU, memory, etc). Nevertheless, voting is proved to be most effective in scenarios requiring heavy resources. In [31], cyberattacks are detected using ensemble methods for IoT-based smart cities. Ensemble methods were found to be more accurate than other machine learning algorithms, including linear regression, support vector machines, decision trees, and random forests.
Further, anomalies are detected using ensemble methods applied to software-defined networking (SDN) in IoT at [32]. In SDN, IoT networks could be controlled from a central server called a controller [33,34]. Further, in [35], DDoS attacks are detected by using an ensemble method that uses traffic flow metrics to classify attacks. The applied approach yields fewer false alarms and a high degree of accuracy. Moreover, cyberattacks are detected by enabling cloud-fog architecture on the Internet of Medical Things (IoMT) using ensemble machine learning, in [36]. In this work, decision trees, naïve Bayes, and random forest machine learning techniques are used as a base classifier, and XGBoost is used at the next level. This method achieved a high detection rate of 99.98% on the NSL-KDD dataset.
The detection of anomalies in the smart home is carried out by ensemble machine learning rather than binary classification in [37]. Ensemble machine learning was able to detect anomalies in categorical datasets with minimal false positives. In [38], adaptive learning is used to boost the intelligence of ensemble machine learning for the Internet of Industrial Networks. This approach proved effective under ROC curve calculations. Figure 1 illustrates the benefits of using the cloud for data processing because it may have the resources necessary to perform complex computations. The cloud, however, has several inherent weaknesses, including high costs, long latency, and limited bandwidth [39].

IoT System with Cloud and Fog
Further, due to proximity to IoT devices, fog is well suited for solving a variety of issues including long latency, communication, control, and computation [40]. With fog computing, time-sensitive data can be stored and analyzed locally [41]. Furthermore, by reducing the amount and distance of data sent to the cloud, IoT applications can be made more secure and private [42,43].
Researchers have employed a number of approaches and techniques to overcome data transfer challenges in fog, including encryption-based data transfer, as described in [44,45]. Furthermore, several researchers have proposed methods to improve security in fog, including game-based security [46]. However, these works do not have the advantage of functioning in real time. Currently, researchers are developing a method for predicting real-time scenarios and minimizing the overall time factor by balancing cloud computing with fog computing and optimizing the trade-off between the two (e.g., [47]). Likewise, this approach is used in our paper to move resource-intensive and time-sensitive tasks to the cloud and real-time tasks to the fog layer.

Proposed Approach
Our objective is to use ensemble machine learning techniques for detecting attacks in an IoT system. This is because deep neural networks require substantial resources, such as memory. The goal is to come up with the best ensemble method and to apply it for real-time attack detection. Figure 2 outlines the proposed approach with three layers: thing, fog and cloud. It involves the following three steps (also shown in Figure 2): (1) data collection at the cloud layer, (2) running the ensemble algorithm on the cloud and selecting the best model, and (3) running the best selected algorithm in the cloud. The description of the above tasks is given below.

Data collection at Cloud Layer
This step involves collecting data from the thing layer and passing it to the cloud layer.
To accomplish this, data from the thing layer can first be transported to the fog layer. The fog layer can then transport it to the cloud layer. While transporting the data to the cloud layer, the fog layer can also filter data to decide which data to be transported to the cloud. IoT attacks can be predicted using the following attributes: (1) login details, (2) the fields of network data packets, such as fragment details, protocol type, source and destination address, (3) service type, (4) flags, and (5) duration. We provide detailed information about the data used in our simulation in the next section.

Selecting a best model on the cloud
The objective of this step is to combine various basic machine learning classifiers (such as naïve Bayes, KNN, and decision trees) with ensemble techniques (such as stacking, bagging, and voting) to obtain optimal results (accuracy, precision, execution time). As this is a time-consuming step, we recommend running it in the cloud. In addition, we simply apply the basic machine learning classifiers, as they require a short execution time.     The proposed approach to include cloud-fog/edge architecture is derived from the analysis of an NGIAtlantic EU project [48], in which cross-Atlantic experimental validation is proposed for intelligent SDN-controlled IoT networks. In this project, IoT devices transmit data to an IoT application in the cloud over the Internet via a gateway (located at edge/fog devices) whose security and latency are enhanced by running secure network functions. Our approach is a practical solution in real-time for such a scenario since, in production IoT networks, fog/edge nodes do not have a lot of resources to run heavy-weight algorithms that require a lot of resources. Therefore, if only the trained model is run in the fog layer (step 3, above), the fog node's resource requirements will be lowered, which is practical. Furthermore, since the cloud layer has plenty of resources, it makes sense to train the data there, as described in steps 1 and 2.

Simulation Environment
This section presents the simulation environment in terms of server configuration, dataset description, cloud and fog data separation, and simulated base classifier and ensemble methods.

Server Configuration
The proposed framework with fog and cloud nodes is tested on a server with a CPU Core E7400 processor and 3.00 GB of RAM and a 32-bit operating system with 2.80 GHz. The proposed ensemble algorithm is implemented on the cloud node and the best model is run on the fog node. The Weka platform is used to run the experimentation at the cloud layer and the real-time detection of IoT attacks at the fog layer.

Dataset Description
The NSL-KDD dataset (https://www.unb.ca/cic/datasets/nsl.html, accessed on 20 March 2022) is used for the simulation of this work. It contains 41 features to describe each specific entity in an IoT network. Details on network intrusions with these 41 features can be segmented into computational information (service, flag, land, etc.), content-based information (login information, root shell information, etc.), duration-based (such as duration from host to destination transfer, error rates), and host-based information (host and destination ports and counts information).
In Figure 4, the NSL-KDD dataset is represented by two layers: (1) the inner layer represents different types of IoT attacks in the dataset, such as Probe, DoS, U2R, and R2L; (2) the outer layer represents examples of attacks within each category. Attacks such as Saint, Satan, Nmap, and portsweep, which can be found in Figure 4, come under the Probe IoT attack category. In these attacks, the attacker scans a network device to determine potential weaknesses in its design, which are subsequently exploited in order to gain access to confidential information, as described in Section 4. Likewise, attacks such as Neptune, Teardrop, Worm, and Smurf fall into the category of DoS attacks. These attacks cause a denial of service when an attacker consumes resources unnecessarily, making the service unavailable for legitimate users. Moreover, Sendmail, Multihop, and phf belong to R2L (remote-to-user) attacks, while Perl, text, and sqlattack belong to U2R (user-to-root) attacks. In Figure 4, variables are underlined according to their segment. Most variables in this dataset are nominal. There are three basic protocol types, TCP (transmission control protocol), UDP (user datagram protocol), and FTP (file transfer protocol), that exist in the dataset.

Data Separation for the Cloud and Fog Layers
Our proposed scheme uses the cloud layer to keep track of historical data about network connections associated with IoT attacks, while the fog layer analyzes real-time data. Furthermore, the cloud layer consists of the target variable and its associated labels, whereas the fog layer requires this variable to be predicted for new entries or labels. Training and testing data segments are provided in the NSL-KDD dataset source. For experimentation, training data is used as cloud data, and testing data as fog data. Further, a significant subset of the NSL-KDD dataset is used in the cloud layer for training and validation, while the rest of the unlabeled data is considered for real-time processing in the fog layer for testing. Moreover, K-cross validation is used with an 80:20 ratio at the cloud layer.

Simulated Base Classifiers and Ensemble Methods
Simulating the proposed approach included the use of five machine learning classifiers and two ensemble methods. The classifiers used are: (1) decision tree (DT), (2) random forest (RF), (3) K-nearest neighbors (KNN), (4) logistic regression (LR), and (5) naïve Bayes (NB), while ensemble techniques are voting and stacking. Table 2 shows the detail of each combination of base classifiers in the base layer. A total of 10 different model combinations are tested. The models are listed in Table 2. This is because we selected five base classifiers, and we created combinations of two. Therefore, we end up with 10 models (i.e., 5 C 2 ).

Results and Analysis
Here, we evaluate the results of the proposed approach for the cloud and fog layers using three factors: (1) execution time, (2) performance measures, and (3) error associated with the final model. On the cloud layer, a larger amount of data (training) is used to build models and conduct experiments. Testing data is considered new data and is tested on the fog layer. In the cloud layer, the best model is selected, and in the fog layer, it is evaluated using real-time data. Our first objective is to summarize the results, including the cloud layer, and the method by which model 8 (distributed in Table 2), with an ensemble method, was selected to be applied to the fog layer. Following that, we show the results obtained from the real-time data in the fog layer.  Table 2. The X-axis in Figure 5 refers to the duration in seconds to execute a model, while the Y-axis refers to the model number. Compared to the voting ensemble method, stacking takes a much higher execution time. According to our results, model number 8, with the voting technique, shows minimal execution time (9.96 s), with KNN, NB, and DT used as base classifiers.  Figure 6 shows overall performance as measured by kappa, F-measure, and the ROC area. It shows that all the models have values greater than 0.99, with model 8 providing the kappa value 0.991, the F-measure value 0.995, and the ROC area 0.999. Figure 7 shows the errors with voting as an ensemble method in terms of mean absolute error, root mean square error, relative absolute error, and root-relative squared error. Model 1, with voting, exhibits significantly fewer errors than any other model. In this model, DT, RF, and K-NN are used as base classifiers, and voting is used as an ensemble technique. In spite of this, we selected model 8 with voting to run in the fog layer, as it performed well in terms of execution times and other performance parameters, as shown in Figure 6. Based on Figure 7, the root-relative squared error in model 8 with voting has the greatest impact, of 27.94 percent, and the mean absolute error has the least impact, of 0.6 percent.  To verify further, we calculate the performance of model 8 in terms of precision, Fmeasure, MCC, and PRC area (Figure 8), in addition to all other metrics. Through the Y-axis, the result is accurate to three decimal places. The most significant performance metric is MCC, which indicates how random or real the prediction is. It ranges from −1 to 1. Model 8's values in the experiment are typically closer to 99.99 percent. In general, model 8 with voting is highly optimized to run on the fog layer, according to the requirements of real-time execution and excellent performance. We found that model number 8, using K-nearest neighbor, naïve Bayes, and decision trees as the base classifiers outperforms all other models with respect to execution time and performance metrics (such as kappa, F-measure, ROC, and MCC). Since time is an important factor in the selection of any model, the voting ensemble technique determines that model 8 takes the least time: 1.15 s. Additionally, kappa, F-measure, ROC, and MCC have maximum values of 6.39, 98.20, 99.60, and 96.40, respectively. There is also a mean absolute error of 7.78 percent, a root mean square error of 17.64 percent, a relative absolute error of 15.87 percent, and a root-relative squared error of 35.63 percent. Further, the root-relative squared error of model 8 is 27.94 percent, and the minimum impact is 0.6 percent. In fact, model 8 is the most time-efficient and resource-intensive model, which is why it has the greatest impact.

Fog Layer Result Analysis
With the new data now being included, we measure the performance of model 8, with this model having KNN, NB, and DT as the base classifiers, as well as voting as an ensemble model.

Performance Measures
Performance measures such as kappa, F-measure, and ROC indicate how well the model performs in the fog layer.   Figure 10 represents the mean absolute error (MAE), root mean square error (RMSE), relative absolute error (RAE), and root-relative squared error (RRSE). Our experiment yielded mean absolute error, root mean square error, relative absolute error, and rootrelative squared error values of 7.78, 17.64, 15.87, and 35.63 percent, respectively. Figure 10. Associated errors on the fog node (using a model with KNN, NB, and DT as the base classifiers as well as voting as an ensemble method). Here, MAE stands for mean absolute Eeror, RMSE stands for root mean square error, RAE stands for root absolute error, and RRSE stands for root-relative squared error.

Execution Time and CPU Usage
Along with the previously discussed performance metric, we also calculated the execution time of the chosen model, as well as all other models (not selected at the cloud layer) using voting as an ensemble method on the fog node. This execution time is shown in Figure 11. This is to determine whether we selected the correct model in terms of execution time. The fog node execution time of model 8 with voting was the fastest of all models. Additionally, we calculated the CPU consumption within the fog layer. Less than 10% of the CPU is consumed by the fog layer. Therefore, our method does not require additional resources from fog nodes. Moreover, our approach has a low execution time. This shows that our approach is highly cost-effective.

Conclusions
This study proposes an approach to offload the ensemble machine learning model selection task to the cloud and the real-time prediction task to fog nodes. Using this technique, the cloud can handle more resource-intensive tasks and the fog nodes can handle real-time computations to simplify and reduce real-time attack detection. The proposed approach has been tested on the NSL-KDD dataset. Using a range of performance indicators, such as kappa, F-measure, ROC, and MCC, our results showed that the selected model in the cloud layer performed well in the fog layer. Moreover, the selected model in the fog node took a minimum of 1.15 s in the experiments. The research also shows that the ensemble method with voting takes less time to execute than stacking.
Our study used the NSL-KDD dataset. Our future plans are to collect data from real testbed emulation. Currently, there are several testbeds available in the EU and the US [49,50], such as Fed4Fire (https://www.fed4fire.eu/, accessed on 20 March 2022), COSMOS (https://cosmos-lab.org/, accessed on 20 March 2022) (Cloud-Enhanced Open Software-Defined Mobile Wireless Testbed for City-Scale Deployment), and POWDER (https://powderwireless.net/, accessed on 20 March 2022) (Platform for Open Wireless Data-Driven Experimental Research). We will create an edge/fog computing use case on these testbeds and run our proposed approach in an IoT scenario presented in an NGIAtlantic project [48].