1. Introduction
In recent years, a significant spread of Internet of Things (IoT) devices has been noted. Gartner estimates that the IoT will reach 26 billion units by 2020 [
1,
2] and a study by Statista reveals that this number will become 75.44 billion worldwide by 2025 [
3].
The use of these devices is growing more and more in several application such as mobile health, Internet of Vehicles, smart home, industrial control, and environmental monitoring, extending the scope of mobile communications from interpersonal communications to smart interconnection between things and people, but also between things and things [
4]. These devices are more and more often part of everyday life of billions of people, just think to Smart-Tv, smartwatch, IP cameras. Thus, these devices interact with people through the use of sensors and actuators, can open doors, monitor houses, record the heartbeat. But these devices are, almost always, connected to the Internet. So, they are sensitive to cyber-attacks.
Thus, on one hand the IoT devices improves the productivity of companies and enhances the quality of people’s lives, but on the other hand the IoT will increase the potential attack surfaces for cyber criminals [
1]. A study by Hewlett Packard revealed that 70% of the most commonly used IoT devices contain serious vulnerabilities [
5]. IoT devices have vulnerabilities due to lack of transport encryption, insecure Web interfaces, inadequate software protection, and insufficient authorization. On average, each device contains 25 holes, or risks of compromising the home network [
1].
Identifying the number of cyber-attacks on IoT devices and their economic impact is a challenging issue, given the continuous and high-tech changes [
6]. However, a survey by Irdeto Global Connected Industries Cybersecurity Survey revealed that cyberattacks targeted at IoT devices could cost the U.S. economy
$8.8 billion per year. The research highlight that IoT-focused cyberattacks are alarmingly widespread (80% of interviewed claimed to have experienced an attack in the past 12 months). About 55% of the attacks have caused operational downtime as a result [
7]. Gartner forecasts that worldwide spending on IoT security will reach 2,4 billion in 2020 and 3,1 billion in 2021 [
8].
On the general Web scenario, according to Cisco Cybersecurity Reports, even though global Web traffic enhances security using encryption techniques, 42% of organizations have been faced a DDoS attack [
9]. Namely the SYN-DOS attack is one of the most popular DDoS attack type, widely used because SYN packets are not likely to be rejected by default [
10,
11]. So, in this work we focus on the SYN-DOS attack, a type of attack that undermines the availability of the network interfaces of devices, exploiting the normal functioning of the TCP/IP protocol.
Machine learning has proven to be very important and effective in identifying and protecting against cyber-attacks [
12,
13,
14,
15,
16], also specifically for DDOS attacks [
17]. For example a 97.4% of identification success for real traffic data has been obtained by D’Angelo et al. [
18] using U-BRAIN [
19]. In the specific field of IoT devices anomaly and attacks detection Hasan et al. obtained up to 99.4% of identification success using Decision Tree, Random Forest, and Artificial Neural Networks [
20].
The application limits of state-of-the-art machine learning algorithms are mostly related to the computational requirements needed for large datasets [
21]. This is especially important for IoT devices, which have generally reduced processing capabilities. However, they are often connected or otherwise connectable to the Internet, therefore it is possible to use an approach based on technologies operating in the Cloud environment.
In this work we have a dual purpose:
- -
to value the performances of several machine learning algorithms in identifying SYN-DOS attacks to IoT systems in a Cloud environment, both in terms of application performances, and in training/application times. Namely, we use several general-purpose machine learning algorithms included in the MLlib library of Apache Spark [
22], one of the most interesting and used technologies in the big data field, available with an open source license and present in the cloud computing facilities of the main world players [
23].
- -
by using the previous results, to propose a strategy for the sustainable implementation of machine learning algorithms for the detection of SYN-DOS cyber-attacks on IOT devices. Our purpose is to create a hybrid architecture that realizes the training of machine learning models for protecting against DDOS attacks on the cloud and the application of the obtained models directly on the IOT devices.
While there are several application of machine learning algorithms against cyber-attack in a Cloud environment [
24,
25,
26,
27,
28,
29] or also in a local one, such as Kitsune [
30], it seems there is no specific integrated application for IoT.
The remaining of this paper is organized as follows. In
Section 2, after a brief introduction to the SYN-DOS attack, we introduce the used datasets, the Apache Spark framework and the MLLIB Spark library for Machine learning. In
Section 3, we describe the selected cloud environment, the used datasets, the measured parameters and the experimental results. In the last section, we summarize the results and discuss the work. Some details are reported in the
Appendix A.
2. Materials and Methods
2.1. Brief Description of a SYN-DOS Attack
The SYN-DOS (or TCP SYN-DOS or SYN flood) attack, is a type of Distributed Denial of Service (DDoS) attack that exploits the normal three-way handshake of the Transmission Control Protocol (TCP), and can be used to make server processes incapable of answering a legitimate client application’s requests for new TCP connections. Any service that binds to and listens on a TCP socket is potentially vulnerable to TCP SYN flooding attacks [
10].
According to RFC 793, the normal mechanism of TCP three-way handshake exchanges the following sequence of packets (see
Figure 1):
Client requests connection by sending SYN (synchronize) message to the server.
Server acknowledges by sending SYN-ACK (synchronize-acknowledge) message back to the client.
Client responds with an ACK (acknowledge) message, and the connection is established.
In a typical SYN flood attack, a series of SYN packets to the targeted server are sent. The server is unaware of the attack, so it receives multiple, apparently legitimate requests to establish communication. Thus, it responds to each attempt with a SYN-ACK packet.
The malicious client either does not send the expected ACK, or—if the IP address is spoofed—never receives the SYN-ACK. In both cases, the server under attack will wait for an acknowledgement for some time (timeout). During this time the connection remains open. Before the connection time out, another SYN packet arrive. This behavior creates a very large number of connections half-open. Eventually, the server’s connection overflow tables fill and the service to legitimate clients will be denied. Finally, the server may even malfunction or crash [
31].
Some variations of the attack have been observed. A comprehensive description in presented in [
10].
2.2. Attack Data
As attack data we refer to a known data collection [
30] containing traffic data of IoT devices, namely surveillance video IP-cameras, assembled in a surveillance network. Several attacks that affect the availability and integrity of the video uplinks are conducted.
Specifically, the work contains 9 different datasets each one for a different kind of attack. For each of these 9 attacks, a dataset of extracted feature vectors was compiled. The features consist of statistics on network traffic which are used to implicitly describe the current state of the channel. These statistics are extracted by a Feature extractor module in the chain. For further details please refer to [
30].
The full dataset contains a total of 2,771,276 instances, of which 2,764,238 contain regular traffic and 20,000 malicious traffic. Each row of the dataset has 115 features in numeric double format, describing the state of the channel.
2.3. Apache Spark
Apache Spark is a high-performance, general-purpose distributed computing system. It enables the process of large quantities of data, beyond what can fit on a single machine, with a high-level APIs, which is accessible in Java, Scala, Python and R programming languages. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Spark allows users to write programs on a cluster computing system that can perform operations on very big amount of data in parallel. A large dataset is rep-resented using a distributed data structure called RDD — Resilient, Distributed Dataset—which is stored in a distributed way in the executors (i.e., slave nodes). The objects that comprise RDDs are called partitions. They may (but not must) be computed on different nodes of a distributed system.
Spark evaluates RDDs lazily. Thus, RDD transformations are computed only when the final RDD data needs to be computed. Spark can keep an RDD loaded in-memory on executor nodes throughout the life of a Spark application for faster access in case of repeated computations. RDDs are immutable: transforming an RDD returns a new RDD and the old one can be trashed by a garbage collector. The paradigms of lazy evaluation, in-memory storage, and immutability make Spark fault-tolerant, scalable, efficient and easy to use [
23]. In more detail Spark can warrant resilience: when any node crashes in the middle of any operation, one other node has reference to the crashed one, thanks to mechanism called lineage. In case of a crash, the cluster manager assigns the job to another node, which will operate on the particular partition of the RDD and will perform the operations that it has to execute without data loss [
32].
2.4. Machine learning algorithms
MLlib is Spark’s machine learning (ML) library [
23]. It provides tools such as machine learning algorithms for classification, regression, clustering, and collaborative filtering, and others (Featurization, Pipelines, Persistence, and Utilities).
In choosing the machine learning algorithms we consider our data are labeled (attack or not), so we face a supervised learning problem, where the expected output of the model is a binary classification. To this aim we consider the following algorithms from Apache Spark MLlib standard library: Logistic Regression (LR) [
33]; Decision Tree (DT) [
34]; Random Forest (RF) [
35]; Gradient Boosted Tree (GBT) [
36]; Linear Support Vector Machine (SVM) [
37].
Since the aim of the work is to find algorithms that can be easily implemented on IoT devices, preference have been given to tree algorithms, which can be easily implemented with IF-THEN-ELSE programming structures even on devices with little processing capacity. Linear Regression and Support Vector Machine have been included in the comparison because they have been used in related works.
Logistic Regression (LR) is a linear method commonly used for classification. The method combines each of individual input (i.e., the features) with specific weight, generated during the training process, that are combined to get a probability to belonging to a particular class. The weight represents the feature importance. Thus, if a feature has a large weight can be assumed that a variation in the feature have a significant effect on the outcome [
38]. In Spark.ML, logistic regression can be used to predict a binary outcome by using binomial logistic regression, or it can be used to predict a multiclass outcome by using multinomial logistic regression [
39].
Decision Trees (DTs) are one of the most friendly and interpretable models for performing classification because they are pretty similar to decision models that humans use quite often. In the training phase, the Decision Tree creates a structure with all the inputs and follows a set of branches to make a prediction. This behavior makes this algorithm a good starting point model, because it is rather easy to reason about and easy to inspect, furthermore it makes very few assumptions on the structure of data. In other words, rather than trying to train coefficient in order to model a function, it creates a big tree of decisions to follow at prediction time. This model supports both binary and multi-class classification. The big issue of Decision Tree is that it can overfit data extremely quickly, because it creates a pathway from the start based on every single training example, even if there are some way to limit this issue, such as limiting the height of the tree [
38]. In Spark.MLlib, decision trees are supported for binary and multiclass classification and for regression, using both continuous and categorical features. The implementation partitions data by rows, allowing distributed training with millions of instances [
40]. This feature makes the Spark implementation able to work efficiently on very big datasets.
Random Forest (RF) and Gradient-Boosted Tree (GBT) are both extensions of Decision Tree. Rather training one tree on all data, multiple trees on varying subset of data are trained. Thus, the various decision trees will become “expert” in a specific domain. By combining these various experts, a “wisdom of crowd effect” is obtained, where the group’s performance exceeds any individual. This method can also help to prevent the overfitting, a big problem for Decision Tree. Random Forest and Gradient-Boosted Tree use two different methods for combining decision trees. On the one hand in Random Forest simply several trees are trained and then average of responses are averaged to make a prediction; on the other hand, in Gradient-Boosted Tree each tree makes a weighted prediction, so some tree have more predictive power for some classes than others. The decision trees are iteratively trained in order to minimize a loss function [
38].
The Spark.ML implementation supports Random Forest for binary and multiclass classification and for regression, using both continuous and categorical features. Gradient-Boosted Tree are supported for binary classification and for regression, using both continuous and categorical features. To the current version of Spark (2.4.4.), multiclass classification is not supported. Both Random Forest and Gradient-Boosted Tree of Spark.MLlib use the Decision Tree implementation, therefore the same considerations on efficiency are applied [
41].
Linear Support Vector Machine (SVM) is the MLlib Spark implementation of Support Vector Machine, a class of algorithms widely used for classification and regression analysis. Given a set of training examples, each belonging to one of two class labels, an SVM algorithm builds a model that assigns new examples into one label or another. A linear SVM described finds linear boundaries in the input feature space. The SVM model resulting from the training stage is a representation of the examples belonging to the training set as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples, in the application stage, are mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. More formally, an SVM constructs a hyperplane to separate data points belonging to two class labels in feature space. A good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class. Maximizing the separability between the two classes reduces the generalization error of the classifier [
42]. The current version of spark (2.4.4.) Spark.ML implementation supports only binary classification [
43].
4. Discussion
In this work we have applied some general-purpose supervised machine learning algorithms included in the MLlib library of Apache Spark on the problem of identifying a SYN-FLOOD DDOS attack for IOT devices (Web cams). For the experiments we used 5 different datasets extracted from a public dataset. The datasets have a cardinality from 10,000 up to 2 Million of elements. A cloud environment, Databricks, has been used for training the models.
The analysis of results shows that all the algorithms of MLLib achieved a very high level of accuracy (up to 1) with both a very short training time (23.22 seconds for Decision Tree on dataset SYNDOS2M) and a minimum application time (less than of 0.14 seconds for all the algorithms). The best performing algorithm was Random Forest, which achieved an accuracy of 1 in all the experiments, a training time of 215.82 seconds with the SYNDOS2M dataset and an application time of 0.13 seconds.
These results appear consistent and improve on the results in the literature. Othman et al. in [
48] tested four algorithms of Apache Spark MLlib, Support Vector Machine (SVM), Naïve Bayes, Decision Tree and Random Forest, on UNSW-NB15 dataset. Random Forest resulted the best performer with an accuracy of 97.5%.
A very similar result was archived by Belouch et al. in [
49]. They evaluated the performance of four well-known classification algorithms SVM, Naïve Bayes, Decision Tree and Random Forest using Apache Spark using UNSW-NB15 dataset for network intrusion detection. The paper shows an important advantage for Random Forest classifier.
Gupta et al. [
26] implemented a Spark-based intrusion detection framework with two feature selection algorithms: correlation-based feature selection and Chi-squared feature selection, based on Spark’s batch processing features. They used five Machine Learning algorithms, Logistic Regression, SVM, Naïve Bayes, Random Forest and GB Tree, on NSL-KDD and DARPA 1999 dataset. The best performing algorithm results Random Forest, meanwhile the fastest in the application phase Naïve Bayes.
Furthermore, Random Forest and Decision Tree generate explicit models consisting of a chain of simple IF-THEN-ELSE statements. These conditions can be easily implemented on IoT devices, even if they have limited memory and CPU resources.
The short training times in a cloud environment and the possibility of applying the inferred rules directly on the IoT device thanks to a simple and fast code implementation, leads us to propose a novel approach to SYN-DOS attacks mitigation, creating an architecture that includes training and retraining of machine learning models on the Cloud and the application of the resulting models for protecting against DDOS attacks directly on the IOT devices, leveraging the simple implementation of the Random Forest algorithm on low resources IOT devices.
This kind of approach seems to be supported by a recent report [
50] evidencing that the major cloud service vendors have IoT services, that exchange protocols are consolidated and that attention to security is increased.
We currently plan to define a Cloud-based hybrid architecture in a more general context, extending the experiments to other types of attacks, and this will be the subject of future work.