Towards Near-Real-Time Intrusion Detection for IoT Devices using Supervised Learning and Apache Spark

: In the ﬁelds of Internet of Things (IoT) infrastructures, attack and anomaly detection are rising concerns. With the increased use of IoT infrastructure in every domain, threats and attacks in these infrastructures are also growing proportionally. In this paper the performances of several machine learning algorithms in identifying cyber-attacks (namely SYN-DOS attacks) to IoT systems are compared both in terms of application performances, and in training / application times. We use supervised machine learning algorithms included in the MLlib library of Apache Spark, a fast and general engine for big data processing. We show the implementation details and the performance of those algorithms on public datasets using a training set of up to 2 million instances. We adopt a Cloud environment, emphasizing the importance of the scalability and of the elasticity of use. Results show that all the Spark algorithms used result in a very good identiﬁcation accuracy ( > 99%). Overall, one of them, Random Forest, achieves an accuracy of 1. We also report a very short training time (23.22 sec for Decision Tree with 2 million rows). The experiments also show a very low application time (0.13 sec for over than 600,000 instances for Random Forest) using Apache Spark in the Cloud. Furthermore, the explicit model generated by Random Forest is very easy-to-implement using high-or low-level programming languages. In light of the results obtained, both in terms of computation times and identiﬁcation performance, a hybrid approach for the detection of SYN-DOS cyber-attacks on IoT devices is proposed: the application of an explicit Random Forest model, implemented directly on the IoT device, along with a second level analysis (training) performed in the Cloud.


Introduction
In recent years, a significant spread of Internet of Things (IoT) devices has been noted. Gartner estimates that the IoT will reach 26 billion units by 2020 [1,2] and a study by Statista reveals that this number will become 75.44 billion worldwide by 2025 [3].
The use of these devices is growing more and more in several application such as mobile health, Internet of Vehicles, smart home, industrial control, and environmental monitoring, extending the scope of mobile communications from interpersonal communications to smart interconnection between things and people, but also between things and things [4]. These devices are more and more often part of everyday life of billions of people, just think to Smart-Tv, smartwatch, IP cameras. Thus, these devices interact with people through the use of sensors and actuators, can open doors, monitor houses, record the heartbeat. But these devices are, almost always, connected to the Internet. So, they are sensitive to cyber-attacks.
Thus, on one hand the IoT devices improves the productivity of companies and enhances the quality of people's lives, but on the other hand the IoT will increase the potential attack surfaces for -to value the performances of several machine learning algorithms in identifying SYN-DOS attacks to IoT systems in a Cloud environment, both in terms of application performances, and in training/application times. Namely, we use several general-purpose machine learning algorithms included in the MLlib library of Apache Spark [22], one of the most interesting and used technologies in the big data field, available with an open source license and present in the cloud computing facilities of the main world players [23]. -by using the previous results, to propose a strategy for the sustainable implementation of machine learning algorithms for the detection of SYN-DOS cyber-attacks on IOT devices. Our purpose is to create a hybrid architecture that realizes the training of machine learning models for protecting against DDOS attacks on the cloud and the application of the obtained models directly on the IOT devices.
While there are several application of machine learning algorithms against cyber-attack in a Cloud environment [24][25][26][27][28][29] or also in a local one, such as Kitsune [30], it seems there is no specific integrated application for IoT.
The remaining of this paper is organized as follows. In Section 2, after a brief introduction to the SYN-DOS attack, we introduce the used datasets, the Apache Spark framework and the MLLIB Spark library for Machine learning. In Section 3, we describe the selected cloud environment, the used datasets, the measured parameters and the experimental results. In the last section, we summarize the results and discuss the work. Some details are reported in the Appendix A.

Brief Description of a SYN-DOS Attack
The SYN-DOS (or TCP SYN-DOS or SYN flood) attack, is a type of Distributed Denial of Service (DDoS) attack that exploits the normal three-way handshake of the Transmission Control Protocol (TCP), and can be used to make server processes incapable of answering a legitimate client application's requests for new TCP connections. Any service that binds to and listens on a TCP socket is potentially vulnerable to TCP SYN flooding attacks [10].
According to RFC 793, the normal mechanism of TCP three-way handshake exchanges the following sequence of packets (see Figure 1):

1.
Client requests connection by sending SYN (synchronize) message to the server.

2.
Server acknowledges by sending SYN-ACK (synchronize-acknowledge) message back to the client.

3.
Client responds with an ACK (acknowledge) message, and the connection is established.
Electronics 2020, 9, x FOR PEER REVIEW 3 of 13 used datasets, the measured parameters and the experimental results. In the last section, we summarize the results and discuss the work. Some details are reported in the Appendix.

Brief Description of a SYN-DOS Attack
The SYN-DOS (or TCP SYN-DOS or SYN flood) attack, is a type of Distributed Denial of Service (DDoS) attack that exploits the normal three-way handshake of the Transmission Control Protocol (TCP), and can be used to make server processes incapable of answering a legitimate client application's requests for new TCP connections. Any service that binds to and listens on a TCP socket is potentially vulnerable to TCP SYN flooding attacks [10]. According to RFC 793, the normal mechanism of TCP three-way handshake exchanges the following sequence of packets (see Figure 1): 1. Client requests connection by sending SYN (synchronize) message to the server. 2. Server acknowledges by sending SYN-ACK (synchronize-acknowledge) message back to the client. 3. Client responds with an ACK (acknowledge) message, and the connection is established. In a typical SYN flood attack, a series of SYN packets to the targeted server are sent. The server is unaware of the attack, so it receives multiple, apparently legitimate requests to establish communication. Thus, it responds to each attempt with a SYN-ACK packet.
The malicious client either does not send the expected ACK, or-if the IP address is spoofed-never receives the SYN-ACK. In both cases, the server under attack will wait for an acknowledgement for some time (timeout). During this time the connection remains open. Before the connection time out, another SYN packet arrive. This behavior creates a very large number of connections half-open. Eventually, the server's connection overflow tables fill and the service to legitimate clients will be denied. Finally, the server may even malfunction or crash [31].
Some variations of the attack have been observed. A comprehensive description in presented in [10]. In a typical SYN flood attack, a series of SYN packets to the targeted server are sent. The server is unaware of the attack, so it receives multiple, apparently legitimate requests to establish communication. Thus, it responds to each attempt with a SYN-ACK packet.
The malicious client either does not send the expected ACK, or-if the IP address is spoofed-never receives the SYN-ACK. In both cases, the server under attack will wait for an acknowledgement for some time (timeout). During this time the connection remains open. Before the connection time out, another SYN packet arrive. This behavior creates a very large number of connections half-open. Eventually, the server's connection overflow tables fill and the service to legitimate clients will be denied. Finally, the server may even malfunction or crash [31].
Some variations of the attack have been observed. A comprehensive description in presented in [10].

Attack Data
As attack data we refer to a known data collection [30] containing traffic data of IoT devices, namely surveillance video IP-cameras, assembled in a surveillance network. Several attacks that affect the availability and integrity of the video uplinks are conducted. Specifically, the work contains 9 different datasets each one for a different kind of attack. For each of these 9 attacks, a dataset of extracted feature vectors was compiled. The features consist of statistics on network traffic which are used to implicitly describe the current state of the channel. These statistics are extracted by a Feature extractor module in the chain. For further details please refer to [30].
The full dataset contains a total of 2,771,276 instances, of which 2,764,238 contain regular traffic and 20,000 malicious traffic. Each row of the dataset has 115 features in numeric double format, describing the state of the channel.

Apache Spark
Apache Spark is a high-performance, general-purpose distributed computing system. It enables the process of large quantities of data, beyond what can fit on a single machine, with a high-level APIs, which is accessible in Java, Scala, Python and R programming languages. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Spark allows users to write programs on a cluster computing system that can perform operations on very big amount of data in parallel. A large dataset is rep-resented using a distributed data structure called RDD -Resilient, Distributed Dataset-which is stored in a distributed way in the executors (i.e., slave nodes). The objects that comprise RDDs are called partitions. They may (but not must) be computed on different nodes of a distributed system.
Spark evaluates RDDs lazily. Thus, RDD transformations are computed only when the final RDD data needs to be computed. Spark can keep an RDD loaded in-memory on executor nodes throughout the life of a Spark application for faster access in case of repeated computations. RDDs are immutable: transforming an RDD returns a new RDD and the old one can be trashed by a garbage collector. The paradigms of lazy evaluation, in-memory storage, and immutability make Spark fault-tolerant, scalable, efficient and easy to use [23]. In more detail Spark can warrant resilience: when any node crashes in the middle of any operation, one other node has reference to the crashed one, thanks to mechanism called lineage. In case of a crash, the cluster manager assigns the job to another node, which will operate on the particular partition of the RDD and will perform the operations that it has to execute without data loss [32].

Machine learning algorithms
MLlib is Spark's machine learning (ML) library [23]. It provides tools such as machine learning algorithms for classification, regression, clustering, and collaborative filtering, and others (Featurization, Pipelines, Persistence, and Utilities).
In choosing the machine learning algorithms we consider our data are labeled (attack or not), so we face a supervised learning problem, where the expected output of the model is a binary classification. To this aim we consider the following algorithms from Apache Spark MLlib standard library: Logistic Regression (LR) [33]; Decision Tree (DT) [34]; Random Forest (RF) [35]; Gradient Boosted Tree (GBT) [36]; Linear Support Vector Machine (SVM) [37].
Since the aim of the work is to find algorithms that can be easily implemented on IoT devices, preference have been given to tree algorithms, which can be easily implemented with IF-THEN-ELSE programming structures even on devices with little processing capacity. Linear Regression and Support Vector Machine have been included in the comparison because they have been used in related works.
Logistic Regression (LR) is a linear method commonly used for classification. The method combines each of individual input (i.e., the features) with specific weight, generated during the training process, that are combined to get a probability to belonging to a particular class. The weight represents the feature importance. Thus, if a feature has a large weight can be assumed that a variation in the feature have a significant effect on the outcome [38]. In Spark.ML, logistic regression can be used to predict a binary outcome by using binomial logistic regression, or it can be used to predict a multiclass outcome by using multinomial logistic regression [39]. Decision Trees (DTs) are one of the most friendly and interpretable models for performing classification because they are pretty similar to decision models that humans use quite often. In the training phase, the Decision Tree creates a structure with all the inputs and follows a set of branches to make a prediction. This behavior makes this algorithm a good starting point model, because it is rather easy to reason about and easy to inspect, furthermore it makes very few assumptions on the structure of data. In other words, rather than trying to train coefficient in order to model a function, it creates a big tree of decisions to follow at prediction time. This model supports both binary and multi-class classification. The big issue of Decision Tree is that it can overfit data extremely quickly, because it creates a pathway from the start based on every single training example, even if there are some way to limit this issue, such as limiting the height of the tree [38]. In Spark.MLlib, decision trees are supported for binary and multiclass classification and for regression, using both continuous and categorical features. The implementation partitions data by rows, allowing distributed training with millions of instances [40]. This feature makes the Spark implementation able to work efficiently on very big datasets.
Random Forest (RF) and Gradient-Boosted Tree (GBT) are both extensions of Decision Tree. Rather training one tree on all data, multiple trees on varying subset of data are trained. Thus, the various decision trees will become "expert" in a specific domain. By combining these various experts, a "wisdom of crowd effect" is obtained, where the group's performance exceeds any individual. This method can also help to prevent the overfitting, a big problem for Decision Tree. Random Forest and Gradient-Boosted Tree use two different methods for combining decision trees. On the one hand in Random Forest simply several trees are trained and then average of responses are averaged to make a prediction; on the other hand, in Gradient-Boosted Tree each tree makes a weighted prediction, so some tree have more predictive power for some classes than others. The decision trees are iteratively trained in order to minimize a loss function [38]. New examples, in the application stage, are mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. More formally, an SVM constructs a hyperplane to separate data points belonging to two class labels in feature space. A good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class. Maximizing the separability between the two classes reduces the generalization error of the classifier [42]. The current version of spark (2.4.4.) Spark.ML implementation supports only binary classification [43].

Execution Environment
The execution environment used is Databricks community edition [44]. It provides a just in time platform on top of Apache Spark that empowers to build and deploy advance analytic solution. It is orchestrated with open source Spark Core with underlying general execution engine which supports a

Datasets
The datasets used in this work are subsets of the data collection defined in [22]. We focus on dataset called "SYN DoS", related to a Syn Flood attack able to create a DoS (Denial of Service).
We created a python script, based on [45], to extract 5 datasets containing different number of instances.
The datasets characteristics (Dimension, number of Training Instances, number of Testing Instances, Total number of Instances, and ratio between regular traffic instances and malicious traffic instances) are described in Table 1. Data sets are spitted using the ratio of 70% training set and 30% testing set. Each row of the dataset has 115 features in numeric double format, describing the state of the channel and one label, containing "F" for normal packet and "T" for malicious packet.

Evaluation Parameters
For each algorithm we value the following parameters [46]:

Classification Performance Evaluation
To evaluate the classification performance, the k-fold cross validation technique was used. Cross validation splits the dataset into a set of folds which are used as separate training and test datasets. In the experiments we used k = 10. Thus, 10 (training, test) dataset pair have been generated. The validation was supported by Spark.MLlib class CrossValidator that helps to automate the process and offers tools for tuning the hyperparameters [47].

Results
In Table 2 we report the Accuracy on the testing set (ACC) of the algorithms adopted in this study. In Table 3 we report the Error Rate on the testing set (ERR), and in Table 4 we report the absolute number of errors on the testing set. The training time reported in Table 5 is referred to training set, meanwhile the application time, reported in Table 6, is referred to the inference on the test set.

Hybrid Architecture
Our purpose is to create a hybrid architecture that realizes the training of machine learning models for protecting against DDOS attacks on the cloud and the application of the obtained models directly on the IOT devices.
From the previous analysis the Random Forest results to be the best performer, reporting an accuracy rate of 1. It is worth to be noted that the resulting RF model (see Appendix) is composed of a chain of IF-THEN-ELSE instructions. This chain can be easily implemented using any high-or low-level programming language, resulting in a very high-performance run-time code, thanks to simplicity of instructions. In particular, this code could be easily implemented on any IOT device with very limited CPU and memory resources.

Discussion
In this work we have applied some general-purpose supervised machine learning algorithms included in the MLlib library of Apache Spark on the problem of identifying a SYN-FLOOD DDOS attack for IOT devices (Web cams). For the experiments we used 5 different datasets extracted from a public dataset. The datasets have a cardinality from 10,000 up to 2 Million of elements. A cloud environment, Databricks, has been used for training the models.
The analysis of results shows that all the algorithms of MLLib achieved a very high level of accuracy (up to 1) with both a very short training time (23.22 seconds for Decision Tree on dataset SYNDOS2M) and a minimum application time (less than of 0.14 seconds for all the algorithms). The best performing algorithm was Random Forest, which achieved an accuracy of 1 in all the experiments, a training time of 215.82 seconds with the SYNDOS2M dataset and an application time of 0.13 seconds.
These results appear consistent and improve on the results in the literature. Othman et al. in [48] tested four algorithms of Apache Spark MLlib, Support Vector Machine (SVM), Naïve Bayes, Decision Tree and Random Forest, on UNSW-NB15 dataset. Random Forest resulted the best performer with an accuracy of 97.5%.
A very similar result was archived by Belouch et al. in [49]. They evaluated the performance of four well-known classification algorithms SVM, Naïve Bayes, Decision Tree and Random Forest using Apache Spark using UNSW-NB15 dataset for network intrusion detection. The paper shows an important advantage for Random Forest classifier.
Gupta et al. [26] implemented a Spark-based intrusion detection framework with two feature selection algorithms: correlation-based feature selection and Chi-squared feature selection, based on

Hybrid Architecture
Our purpose is to create a hybrid architecture that realizes the training of machine learning models for protecting against DDOS attacks on the cloud and the application of the obtained models directly on the IOT devices.
From the previous analysis the Random Forest results to be the best performer, reporting an accuracy rate of 1. It is worth to be noted that the resulting RF model (see Appendix A) is composed of a chain of IF-THEN-ELSE instructions. This chain can be easily implemented using any high-or low-level programming language, resulting in a very high-performance run-time code, thanks to simplicity of instructions. In particular, this code could be easily implemented on any IOT device with very limited CPU and memory resources.

Discussion
In this work we have applied some general-purpose supervised machine learning algorithms included in the MLlib library of Apache Spark on the problem of identifying a SYN-FLOOD DDOS attack for IOT devices (Web cams). For the experiments we used 5 different datasets extracted from a public dataset. The datasets have a cardinality from 10,000 up to 2 Million of elements. A cloud environment, Databricks, has been used for training the models.
The analysis of results shows that all the algorithms of MLLib achieved a very high level of accuracy (up to 1) with both a very short training time (23.22 seconds for Decision Tree on dataset SYNDOS2M) and a minimum application time (less than of 0.14 seconds for all the algorithms). A very similar result was archived by Belouch et al. in [49]. They evaluated the performance of four well-known classification algorithms SVM, Naïve Bayes, Decision Tree and Random Forest using Apache Spark using UNSW-NB15 dataset for network intrusion detection. The paper shows an important advantage for Random Forest classifier.
Gupta et al. [26] implemented a Spark-based intrusion detection framework with two feature selection algorithms: correlation-based feature selection and Chi-squared feature selection, based on Spark's batch processing features. They used five Machine Learning algorithms, Logistic Regression, SVM, Naïve Bayes, Random Forest and GB Tree, on NSL-KDD and DARPA 1999 dataset. The best performing algorithm results Random Forest, meanwhile the fastest in the application phase Naïve Bayes.
Furthermore, Random Forest and Decision Tree generate explicit models consisting of a chain of simple IF-THEN-ELSE statements. These conditions can be easily implemented on IoT devices, even if they have limited memory and CPU resources.
The short training times in a cloud environment and the possibility of applying the inferred rules directly on the IoT device thanks to a simple and fast code implementation, leads us to propose a novel approach to SYN-DOS attacks mitigation, creating an architecture that includes training and retraining of machine learning models on the Cloud and the application of the resulting models for protecting against DDOS attacks directly on the IOT devices, leveraging the simple implementation of the Random Forest algorithm on low resources IOT devices.
This kind of approach seems to be supported by a recent report [50] evidencing that the major cloud service vendors have IoT services, that exchange protocols are consolidated and that attention to security is increased.
We currently plan to define a Cloud-based hybrid architecture in a more general context, extending the experiments to other types of attacks, and this will be the subject of future work.