An Experimental Analysis of Attack Classification Using Machine Learning in IoT Networks

In recent years, there has been a massive increase in the amount of Internet of Things (IoT) devices as well as the data generated by such devices. The participating devices in IoT networks can be problematic due to their resource-constrained nature, and integrating security on these devices is often overlooked. This has resulted in attackers having an increased incentive to target IoT devices. As the number of attacks possible on a network increases, it becomes more difficult for traditional intrusion detection systems (IDS) to cope with these attacks efficiently. In this paper, we highlight several machine learning (ML) methods such as k-nearest neighbour (KNN), support vector machine (SVM), decision tree (DT), naive Bayes (NB), random forest (RF), artificial neural network (ANN), and logistic regression (LR) that can be used in IDS. In this work, ML algorithms are compared for both binary and multi-class classification on Bot-IoT dataset. Based on several parameters such as accuracy, precision, recall, F1 score, and log loss, we experimentally compared the aforementioned ML algorithms. In the case of HTTP distributed denial-of-service (DDoS) attack, the accuracy of RF is 99%. Furthermore, other simulation results-based precision, recall, F1 score, and log loss metric reveal that RF outperforms on all types of attacks in binary classification. However, in multi-class classification, KNN outperforms other ML algorithms with an accuracy of 99%, which is 4% higher than RF.


Introduction
The Internet of Things (IoT) offers a vision where devices with the help of sensors can understand the context and through networking functions can connect with each other [1]. The devices in the IoT network can be employed for collecting information based on the use cases. These include retail, healthcare, and manufacturing industries that use IoT devices for tasks such as tracking purchased items, remote patient monitoring, and fully autonomous warehouses. It is reported that the amount of IoT devices has been growing every year with the predicted amount of devices by 2025 reaching 75.44 billion [2]. Such a massive surge of 1.
We conduct an in-depth and comprehensive survey on the role of various ML methods and attack detection specifically in regards to IoT networks.

2.
We evaluate and compare the state-of-the-art ML algorithms in terms of various performance metrics such as confusion matrix, accuracy, precision, recall, F1 score, log loss, ROC AUC, and Cohen's kappa coefficient (CKC). 3.
We evaluate the results comparing binary class testing as well as examining the results of the multi-class testing.
The rest of the paper is organized as follows: Table 1 lists all the abbreviations used in the paper. Section 2 is devoted to a literature review involving investigating IoT intrusion detection techniques as well as ML methods and how they are being used to aid intrusion detection efforts specifically in regards to IoT networks. Details of various attacks that can occur in IoT networks are also showcased with an explanation of how the various ML methods and performance metrics work. Section 3 explains the performance evaluation, which also includes an in-depth examination of the data used in the datasets. The models are compared against each other for both binary and multi-class classification with an overall best model being selected. Finally, Section 4 draws a conclusion.

Background and Related Work
This section presents the background and examines current literature that would clear up the picture for the reader about the design of the experiments conducted in this paper. Firstly, we discuss IDS including the use of ML used in attack detection and the related work which would help with selecting the algorithms to be used as well as identifying any datasets that could be utilized for testing the models. Each algorithm is explored with further research into the suitability of the algorithm for use in an IDS. The IoT is also described including the attacks that are used in the dataset that has been selected.

Intrusion Detection System
An IDS is a tool that allows a network to be monitored for potentially harmful traffic. An IDS can be implemented using two distinct types: signature-based detection and anomalybased detection. A signature-based IDS uses a database of existing attack signatures and compares the incoming traffic with the database, meaning that an attack can be detected only if the signature is already available in the database. An anomaly-based IDS monitors network traffic and attempts to identify any traffic that is abnormal in regards to the normal network traffic.
The signature-based detection approach has a major flaw as a signature-based IDS will always be susceptible to a zero-day attack or an attacker that modifies the attack to hide from the signature database. Anomaly-based IDS are much better suited to use ML as the IDS can be trained to detect the difference between normal traffic and attack traffic. However, integrating ML with IDS is not a silver bullet and may result in some problems. Research conducted by Sommer and Paxson [9] identified several problems where one important problem is that models can produce false positives, which can render the IDS unusable due to normal data causing the IDS to alert the system. Even though the research is very outdated, this is still a major problem when using ML with IDS. As a result of this, it is of paramount importance to identify models that produce the lowest number of false positives.

IoT Intrusion Detection Using Machine Learning
ML is a subset of AI that involves giving an algorithm or in this case a model a dataset which will be used to identify patterns that can be used to make predictions with future data. There has been limited research devoted to IDS using ML on IoT networks. To this end, recently a study used the Defense Advanced Research Projects Agency (DARPA) ML datasets to test the models such as support vector machine (SVM), Naive Bayes (NB), random forest (RF), and multi-layer perceptron [10]. The results of this research were presented in terms of root mean squared error, mean absolute percentage error, receiver operating characteristic curve, and accuracy, yielding good results with RF being one of the top models. However, this research has two main limitations: Firstly, it used the DARPA datasets, which were over 20 years old at the time of writing. Secondly, it was not performed for multi-class testing using the datasets.
The research was also conducted using the Bot-IoT dataset that used the models k-nearest neighbour (KNN), quadratic discriminant analysis, iterative dichotomiser 3, RF, adaptive boosting, multi-layer perceptron, and NB [11]. The research did yield very good results in terms of accuracy, precision, recall, F1 score, and time. This study used an up-to-date dataset as well as a wide variety of ML models. However, this research did not include any multi-class testing for any of the models.
In regards to multi-class classification, the authors of [12] used several ML methods. This research compared the algorithms such as logistic regression (LR), decision tree (DT), RF, and artificial neural network (ANN) using a dataset created by the researchers which was not available for public use. It was concluded in the study that RF was the best model for multi-class classification. This research shows that with multi-class classification it is possible to achieve high results. Testing with additional algorithms could help bolster the results of the research.
Overall, there is currently a lack of research into intrusion detection within the area of IoT networks. This could be due to the lack of datasets as well as lack of real hardware with all datasets being comprised of simulated IoT devices on regular computers. There is also a lack of research into multi-class classification, which could be due to the lack of a dedicated multi-class dataset. With all available datasets being created with binary classification in mind, performing multi-class testing requires the datasets to be merged into one with proper labelling for each class.
Various ML models can be utilized to perform ML tasks, each with their own mathematical equations powering the analysis of the data presented. In the next subsections, we discuss various ML algorithms for our analysis such as: (i) KNN; (ii) SVM; (iii) DT; (iv) NB; (v) LR; and (vi) ANN.

K-Nearest Neighbor
KNN is a supervised learning model that is considered to be one of the simplest ML models available [13]. KNN is referred to as a lazy learner because there is no training done with KNN; instead, the training data are used when making predictions to classify the data [13]. KNN operates under the assumption that similar data points will group and finds the closest data points using the K value, which can be set to any number [14]. KNN is a suitable model to be used for intrusions detection as showcased with several pieces of research conducted. The authors of [15] examined the effectiveness of KNN at distinguishing between attack and normal data. The results of this research show that KNN was an effective model of detecting attack data and had a low false-positive rate. Moreover, recent research also examined the effectiveness of KNN [16] with a similar consensus being met. The research showed that KNN was an effective model beating SVM and DT.

Support Vector Machine
Support Vector Machine (SVM) is a supervised learning algorithm that uses a hyperplane to separate the training data to classify future predictions. The hyperplanes divide a dataset into two classes and they are decision boundaries that help classify the data points. A hyperplane can be represented as a line or a plane in a multi-dimensional space and is used to separate the data based on the class they belong to. It does this by finding the maximum margin space between the support vectors. SVM is a suitable model for intrusion detection as evident by the large amount of research conducted over the years. One older piece of research created an enhanced SVM model for intrusion detection [17]. The research was successful at creating the model but proved to be only a slight improvement over regular SVM, showing that the model even without enhancements or augmenting is capable of accurately classifying attack data. Other more recent research compared SVM and ANN's ability to classify attack data [18]. As previously mentioned, SVM relies on placing a hyperplane to separate data which can be expressed as follows: where a is the vector of the same dimensions as the input feature vector x and b is the bias. In this case, ax can be written as a 1 x 1 + a 2 x 2 + ... + a n x n where n is the number of dimensions of the feature vector x. When making predictions, the following expression is used: where sign is a function that returns either +1 or −1 depending if the input is a positive number or a negative number respectively. This value is used to determine the prediction of what class the feature vector belongs to. x i is the feature vector and i and y i is the label that can either be +1 or −1 and can be written as the follows: SVMs use kernels and kernel is basically a set of mathematical functions. The kernel is used to take data as an input and transform them into the required form of processing data. The kernels can be linear, nonlinear, polynomial, Gaussian kernel, Radial basis function (RBF), sigmoid, etc.

Decision Tree
DT is a supervised learning algorithm that is useful to present a visual representation of the model. A DT uses a hierarchical model that resembles a flow chart which has several connected nodes. These nodes represent tests on the attribute in the dataset with a branch that leads to either another node or a decision on the data being classified [19]. The training data are used to build the tree with the prediction data being run through the nodes until the data can be classified. DT is a suitable model for intrusion detection based on the research conducted. One fairly recent piece of research compared DT with several other models including NB and KNN [20]. The results show that DT was one of the better models along with NB when compared to ANN's which dominate IDS research. Other research created an IDS for connected vehicles in smart cities [21]. This research showed that the model that used DT was the best model with high accuracy and a low false positive rate. As previously mentioned, DT creates a hierarchical model using the training data to create nodes that act as tests for making predictions. When making DT, the root node needs to be selected as well as selecting the nodes that make up the DT. In this regard, there are many ways to do this with entropy being used in this case. Entropy is used to measure the probability of a data point being incorrectly classified when randomly chosen and is expressed as follows: where p i is the probability of the data being classified to a given class of i and c is the number of classes. The attribute with the lowest entropy would be used for the root node.

Random Forest
RF is a supervised learning algorithm that is seen to be an improvement on the DT model. The random aspect of the model comes from two key concepts. The first is that, when training the model, each tree is given a random assortment of the data which can result in some trees using the same data multiple times. The reason behind this is to lower the variance of the model, which lowers the difference in the predicted results scores [22]. The second concept involves only using small subset of the features when splitting the nodes in the trees [23]. This is done to prevent overfitting when the model uses the training data to inflate the predictions made by the model [13]. When making predictions with RF, the average of each of the trees predictions is used to determine the overall class of the data; this process is called bootstrap aggregating [13]. The reason RF is seen as an improvement on DT is that, instead of relying on one tree to make the classification, multiple trees with different training data and with a different selection of features are used for giving predictions. This allows for a fairer analysis of the data when making predictions. RF is proven to be a suitable model for intrusion detection. To this end, the authors of [24] compared RF to other frameworks used in intrusion detection. They found that the RF model outperformed the other frameworks with increased accuracy, precision, recall, and F1 score.

Naive Bayes
NB is a probabilistic algorithm that works by getting the probability of all the feature vectors and their outcome. The algorithm is used to determine the probability of an event occurring based on previous events occurring which is called posterior probability and is expressed as follows: where P(A|B) is the posterior probability, P(A) is known as the prior probability, P(B) is marginal likelihood (evidence), and P(B|A) is referred to as the likelihood. This formula can be applied to datasets in the following way: where y is the class variable and x is the feature vector of size n shown as the following:

ANN
An ANN refers to a model of performing machine learning that is based on how the human brain operates and can be used to perform supervised learning. An ANN consists of neurons or nodes that make up the layers of the network [25]. The three types of layers in an ANN are input, hidden, and output layers where the input layer takes information provided and passes it onto the hidden layer. The hidden layer performs computations and transfers the data to the output layer. The output layer also performs computations and presents the output of the ANN [26]. When performing supervised learning, the network is given the inputs and expected outputs for training. The connections between the nodes in the network have numbers assigned to them called weights. When an error is made by the network, the data are propagated back through the network and the weights are adjusted. This process occurs repeatedly until the error is minimized, and then the test data can be fed through the network [27]. Training an ANN is described as follows: The first step in training the ANN involves multiplying the input values x i and the weights w i , and then summing the values expressed as the following: The second step involves adding the summed values to the bias b of the hidden layer node as expressed as the following: The third step is to pass the z value through an activation function such as ReLU and Softmax. ReLU R(z) can be defined as follows: where z is the input to a neuron. When the z is smaller than zero, the function will output zero, and, when the z is greater or equal to zero, the output is simply the input. Softmax can be defined as follows: where e is the base of the natural logarithm, z is a vector of the inputs, and i and j indexes the input and output units, respectively. To train the ANN, the loss needs to be calculated so the network can effectively evaluate its performance and make the appropriate changes. Once the loss has been calculated, the next step is to minimize this loss by changing the weights and the biases. Knowing how the cost function C (which is is a measure of "how good" a neural network did with respect to its given training sample and the expected output) changes in relation to weights w i can be done using gradients. Using the following chain rule, the gradient of the cost function in relation to the weights can be calculated: where ∂C ∂ŷ is the gradient of the cost function, ∂ŷ ∂z is the gradient of the predicted value, and ∂z ∂w i is the gradient of z in regards to w i . ANN is the most suitable model for IoT attacks detection and has had many implementations. Recently, the authors of [28] implemented an ANN based model for detecting IoT based attacks. The model was successful and can be used on IoT networks to perform intrusion detection. In [29], the implementation is done for intrusion detection using ANNs. This research had very good results with the model having a near perfect accuracy and a very low false positive rate.

Logistic Regression
LR is a supervised learning algorithm that uses the logistic function also known as the Sigmoid function. Logistic regression is similar to linear regression except, instead of predicting data that are continuous, it is used for classifying data either true or false. Linear regression can have any value, whereas LR has values between 0 and 1 [30]. Logistic regression is a model that is less represented in intrusion detection than other models. Its suitability for use in intrusion detection is not as well established as the previous models. However, some research has examined a logistic regression based intrusion detection model [31]. This model was tested using multi-class classification and was able to outperform the other models.
As previously mentioned, logistic regression can be thought of as linear regression but for classification problems. The reason that logistic regression is used is because with linear regression the hypothesis h o (x) can be greater than one or less than zero. With logistic regression, the hypothesis is between zero and one, e.g., 0 ≤ h o (x) ≤ 1, where h o is a single hypothesis that maps inputs to outputs and can be evaluated and used to make predictions.
To get a value between zero and one, the Sigmoid function is used which is represented as follows: This function returns a number between 0 and 1 which can be mapped to a particular class of data by using a decision boundary to determine the likelihood of the data of a certain class, which can be expressed as follows: Once the threshold is set, predictions can be made using the Sigmoid function to determine the likelihood that the data belongs to class 1 as follows: This function gives back a number that represents the probability that the data should be classified as Class 1. With the previously defined threshold, if the number is 0.5 or above then the data will be classified as Class 1, and anything less than 0.5 will be classified as class 0.
The following subsection provides some details on IoT including the attacks that are used in the dataset for this paper.

Internet of Things Attacks
As previously discussed, IoT is considered as a network of devices/objects communicating through wired or wireless communication technologies [32]. The protocols used by IoT devices are designed to be used on devices with limited computation, storage, and communication capabilities that need to conserve as much battery power as possible. Such protocols include ZigBee, radio-frequency identification (RFID), and smart Bluetooth. The relatively quick increase in IoT devices being used has resulted in a lack of standardization activities which have seen a massive influx of unsecured devices being connected to networks [33]. This in turn creates a massive attack vector allowing for a massive amount of vulnerable devices open to be exploited by attackers. In the following subsections, we provide relevant threats and attacks faced by IoT.

Data Exfiltration
A data exfiltration attack involves attackers gaining access to a private network and stealing data stored on the network [34]. This type of attack can result in the theft of data such as credit card information and personal data. Several studies have been conducted in the field of detecting data exfiltration attacks using methods such as partially observable Markov decision process [35] and a method that involves capturing metadata at the file system level [36].

DoS and DDoS
Denial of service (DoS) and distributed denial of service (DDoS) attacks are very similar in execution. The primary difference involves the scale of the attack. A DoS attack involves a single system and Internet connection being used to attack the victim, whereas a DDoS attack involves multiple systems and Internet connections on a global scale being used to attack the victim, which are typically referred to as botnets [37].
There are many different ways to perform either of these attacks depending on what protocol is used in the attack. These different methods include HTTP flood, TCP SYN, and UDP flood attack, as identified by Mahjabin et al. [38]. An HTTP flood attack involves altering either the GET or POST requests sent via HTTP. A GET request is used when a client wishes to receive information from the server, whereas a POST request is used to send information to the sever such as uploading a file. Sending thousands of these requests to a server or cluster of servers at once increases the workload at the server(s) side exponentially, slowing the entire system down or preventing legitimate users from accessing the server(s).
A TCP SYN attack exploits the three way handshake that occurs during a TCP connection which involves sending a SYN packet which elicits a response from the server with a SYN and ACK packet. During the attack, the destination address sent in the SYN packet is false. As a result, the server sends out SYN and ACK messages repeatedly. This process stores entries in the server's connection tables which then becomes full and prevents legitimate users from accessing the server. A UDP flood attack involves sending UDP packets with a port number and sometime a spoofed IP address as well. Once the server receives this packet, it will check for any applications using the port in the UDP packet. The server checks for applications associated with these UDP packets and, if not found, the server sends back a "Destination Unreachable" packet. As more and more packets are received, the system becomes unresponsive to other clients.
Moreover, attackers are able to turn on devices such as webcams and digital video recorders (DVRs). One such example of this was the Mirai botnet in 2016 which was able to make use of up to 400,000 devices and take down large websites such as Twitter and GitHub [39]. Due to lack of security on IoT devices, paramount research has been conducted into detecting DoS and DDoS traffic [40,41]. However, all such algorithms lacks the use of ML techniques.

Keylogging
The basic function of a keylogger is to store the keystrokes made by a user on their keyboard. Keyloggers can be both hardware and software based [42,43]. Software keylogging is typically done by installing malware on the victim machine that saves the key strokes and relays this to the attacker. Some research has been devoted to keylogging detection methods (see, e.g., [44,45]).

OS Scan and Service Scan
Operating system (OS) and service scans are similar in nature and can be grouped into the attack category of probing. This can be done either passively, in which the attacker gathers packets from the network, or actively, in which the attacker sends traffic and recording the responses. Since passive scanning generates no traffic, active scanning is needed for traffic to test. OS scans involve the attacker being able to discover the OS being used by the victim machine. This information can help an attacker identify the type of device, e.g., server, computer or IoT device. It can also help the attacker identify the version of the OS being used. This can help the attacker find vulnerabilities related to the OS.
There has been plethora of research conducted into using OS scans to identify if a device is an IoT device. One study used neural networks to identify if the device scanned was an IoT device [46]. Another study used deep learning techniques to identify Raspberry Pi devices that were acting as IoT devices [47]. Both studies show that it is possible to identify IoT devices using OS scanning techniques.
Service scans, more commonly referred to as port scans, involve the attacker probing a network in order to identify open ports on the network [48]. This is commonly used by an attacker to gain a better insight into the types of activity on the network as well as showcasing any open ports that are vulnerable to being exploited. A port scan works by having the software used send a request to a port on another network to set up a connection. The software will then wait for a response from the network.
Due to the fact that IoT devices can range from printers to heating controllers, the ports that can be used by devices can vary. To this end, the authors of [49] conducted a study performing a scan on printers to identify vulnerable ports. The results showcase that port 9100 was a commonly opened port on printers. The port is used to carry data to and from printers over TCP. It was also noted that gaining access to the network using this port was a simple process.
Port scanning can also be used to identify if a device is an IoT device. An analysis by Sivanathan et al. [50] showed that by scanning for a small number of TCP ports it could be determined whether a device was an IoT device including information on the device itself, such as identifying a device as an HP printer. Since IoT devices are generally more vulnerable than other devices, this could be used to identify an entry point to a network. A study using an approach based on Dempster-Shafer evidence theory produced a solid groundwork for detecting port scan traffic [51]. Another study proposed a new evaluation metric for IDS, which was reported to take less time to identify port scan data than previous metrics [52]. Neither of these studies included IoT devices, and there is currently a lack of research into OS scans in regards to IoT devices.
Recently, several efforts have been devoted for ML in IoT network [32,[53][54][55]. However, in most of the existing works, the performance are checked for specific types of ML algorithms, such as ANN, J48 DT, and NB without detailed performance evaluation. Although some work is based on various ML algorithms such as LR, SVM, DT, RF, ANN, and KNN, most of them are used to mitigate IoT cybersecurity threats in special environments such a smart city. Contrary to existing works, our study provides a comprehensive evaluation for both real attack and simulated attack data that were created by simulating a realistic network at the University of New South Wales where real attacks on IoT networks were recorded.

Benchmark Data
Our evaluation involves using several datasets with several ML models to identify the best model for correctly classifying IoT attack data. When selecting the datasets, the two most important factors were the amount of variety in the attack data and how up-to-date the datasets are. The datasets chosen were the bot-IoT datasets [56] because they met the two criteria previously mentioned.

Performance Evaluation Metrics
For evaluation, we consider the following metrics.

Confusion Matrix
A confusion matrix shows the predictions made by the model. It is designed to show where the model has correctly and incorrectly classified the data.
The confusion matrix for binary and multi-class classification is different. With binary classification, the matrix shows the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) results, as shown in Table 2. The columns represent the correct classification of the data and the rows represent the available classifications. TP and TN are when the data are correctly classified as either attack or no attack. FP and FN are when data are incorrectly predicted as the other class. When using a confusion matrix for multi-class problems, the same principles apply. However, the matrix shows all the classes which allow for observing where the mis-classification is occurring in the classes, as shown in Table 3. Table 3. Multi-class confusion matrix example.

Actual Label
Predicted label Class 1 Class 2 Class 3 In Table 3, C represents where the correct classifications are located and W represents incorrect classifications. It is to be noted that correct classifications create a diagonal path through the table from the top left corner to the bottom right corner.

Accuracy
Accuracy is a metric that can be used to identify the percentage of predictions that were classified correctly and is expressed as follows: Number of correct predictions Total number of predictions (14) This can be expanded upon by utilizing the results of a confusion matrix including TP, TN, FP, and FN and can be defined as follows:

Precision
Precision is used to determine the ratio of correctly predicted positive outcomes against the total number of predicted positive outcomes and can be defined as follows:

Recall
Recall is used to determine the ratio of correctly predicted positive outcomes to all the outcomes in the given class and can be defined as follows: 3.2.5. F1 Score F1 score is the weighted average of both precision and recall which produces a number between 0 and 1. F1 score is seen as a better performance metric than accuracy and can be defined as follows: It is to be noted that selection of F1 score or accuracy is dependent on how the data are distributed. The F1 score seems a better performance metric than accuracy in the case where the classes are highly unbalanced. F1 score takes into account how the data are distributed, and, in most real-life classification problems, imbalanced class distribution exists and thus F1 score is a better metric to be used. Accuracy is used when the class distribution is similar and it does not take into account how the data are distributed, which may lead to wrong conclusion.

Log Loss
Log loss is used to measure the performance of a model by using the probability of the expected outcome. The higher the probability of the actual class is, the higher the log loss will be. The lower score indicates that the model has performed better.
For binary classification where number of possible classes (M) = 2, log loss can be expressed as follows: For multi-class classification where M > 2, sa eparate loss for each class label is calculated, and the results are summed, which is expressed as follows.
where M is the number of possible classes (0, 1, 2), log is the natural logarithm, y i is a binary indicator of whether class label i is the correct classification for observations, and p i is the models prediction probability.

ROC AUC
ROC is a graph used to plot the results of the model at various thresholds when making predictions. The graph uses the true positive rate (TPR) and false positive rates (FPR), which are expressed as follows:

Cohen's Kappa Coefficient
Cohen's kappa coefficient (CKC), also referred to as the kappa statistic, is used to test the inter rater reliability of prediction and can be expressed as follows: where Pr(a) is the observed agreement and Pr(e) is the expected agreement. This metric is useful as it compares the model against a model that guesses based on the frequency of the classes. This allows for the disparity in a dataset to be evaluated particularly with multi-class testing as the dataset has varying numbers of data points per attack.

Dataset Description
The dataset named Bot-IoT was submitted to the IEEE website on 16/10/19 and was created by the University of New South Wales (UNSW). The dataset consists of ten CSV files containing records for the following attacks on IoT networks: (i) Data exfiltration; (ii) DoS HTTP; (iii) DoS TCP; (iv) DoS UDP; (v) DDoS HTTP; (vi) DDoS TCP; (vii) DDoS UDP; (viii) Keylogging; (ix) OS Scan; and (x) Service Scan. The dataset comprises both real attack and simulated attack data and was created by simulating a realistic network at the UNSW [56]. Table 4 shows the features used in the experiments. There are 35 columns in the dataset. However, only the ones in Table 4 were used. When deciding what features to use, the contents of the columns are examined and any columns that have no values are removed as well as columns that contain text and columns that are deemed to be irrelevant to the overall classification of the data. One important part of examining the dataset involves checking the representation of the classes in the dataset, i.e. whether one class is over or under represented, as this can have a detrimental effect on the experiments. Table 5 shows the amount of attack data and no attack data for each dataset used in the experiments. To conduct multi-class testing, a new CSV file is created using the binary classification datasets. The datasets were collected and then randomized and put into a new file. Due to the large size of the dataset, only a selected percentage of the data is used to prevent excessive run times. Table 6 shows the class representation of the training and test data in the multi-class dataset. It is observable in both the binary and multi-class datasets that not all classes have equal representation. Testing with weighted classes can be done to see the effects of having equal representation among the classes. The models SVM, DT, RF, ANN, and LR are able to use the balanced weighted classes option, which applies to the class weights as follows: where Samples is the number of rows in the dataset, Classes is the number of classes in the dataset, and Y is the number of labels. We use Python version 3.7.4 programming language for the implementation of ML algorithms. The two main modules used for the implementation of the models are sklearn (also referred to as scikit-learn) and Keras. Keras is used to implement the ANN while sklearn is used to implement the other models. It is to be noted that, for comparison purposes, we used the default values of hyperparameters for each classifier. Table 7 contains names of the modules used and a brief description of the module.

Feature Extraction
The dataset contains features that either contain no information or have information that is irrelevant in helping the model classify the data. The unwanted features can be removed during the preprocessing stage using the pandas module. Several features, such as flgs, proto, dir, state, saddr, daddr, srcid, smac, dmac, soui, doui, sco, record, category, and subcategory, were removed from the dataset.

Feature Scaling
The features in the dataset contain large numbers that vary in size. Therefore, it is important to normalize the data in the features. This is done by re-scaling the values of the features to within a defined scale such as −1 to 1 and can be defined as follows: where x is the normalized value, x is the original value, and a and b are the minimum and maximum values. The result of this will take any number between −1 and 1. This can be done in Python using the MinMaxScaler in the preproccesing module.

Multi-Class Dataset
The multi-class dataset is created by collecting all the rows of all the datasets and then randomizing the rows using the random Python module. The random module contains the shuffle method, which allows an array, in this case the rows of the dataset, to be randomized. Due to the large size of the dataset when using it for testing, only roughly 25% of the dataset is used, which is 1,500,000 rows.

Training Data
The data used by the model to learn are called the training data. Data can be split into training and test data with multiple ratios. For this study, a split of 80:20 was used, with 80% being used for training the models, which is governed by the Pareto principle that states that 80% of result comes from 20% of the effort.

Test Data
Twenty percent of the data is used for testing, which is typically a good amount of data. However, if the dataset is small, this can result in a low amount of test data and in the illusion that the model has done extremely well when in fact it has not had enough data to be properly tested. To split the dataset into training and test data, train_test_split can be used from the Python module named model_selection. When using this function, the random state parameter can be used that sets the seed of the pseudo random number generator; in this case, the number 121 was used.

Results and Discussion
To test several ML algorithms and to identify which are the best and worst for classifying attack data on IoT networks, this section provides all the results and analysis based on several performance metrics including binary and multi-class testing.

Binary Classification
Data Exfiltration: Table 8 shows the results for data exfiltration data where RF has the best scores for all the performance metrics including log loss. Whereas DT also has perfect scores, it has a high log loss, indicating that the RF model is more confident in making predictions.  Table 9 shows the confusion matrix for RF and shows two noteworthy pieces of information. The first is that the amount of data tested is very low and that the classes do not have equal representation. It is possible that the low amount of test data is having an impact on the results. However, the other models except from DT have relatively poor scores compared to RF. Table 9. Data exfiltration RF confusion matrix.

Predicted label No Attack Attack
No Attack 5 0 Attack 0 24 Table 10 shows that increasing the test data to 30% has a decrease in the log loss, indicating that the model performs better with more data although only marginally. Once the test data reaches 40% and beyond, the results begin to get worse, although the model is able to maintain perfect recall with up to a 50% split in the training and test data. Due to the class representation being imbalanced, the weighted classes parameter can be used. This allows the disparity of the classes to be rectified, the results of which are shown in Table 11. This option is not available when using the KNN and NB models. It is observable in Table 11 that SVM has had its performance increase by using weighted classes with all metrics increasing and log loss decreasing. ANN is unaffected by weighted classes and LR is marginally affected with the model perfect precision but lowering its recall. DT losses its perfect scores while RF is able to keep perfect scores but slightly increases its log loss.

Algorithms Used Accuracy Precision Recall F1 Score Log Loss ROC AUC
KNN [14] n/a n/a n/a n/a n/a n/a SVM [57] 0.93 1.0 0.91 0.95 0.25 0.95 DT [19] 0.93 1.0 0.91 0.95 0.12 0.95 NB [58] n/a n/a n/a n/a n/a n/a RF [24] 1.0 Without using weighted classes, RF is the best model due to its low log loss when compared to DT. When weighted classes are applied, RF is still the best model with perfect scores and a low log loss, indicating that the model is confident in making predictions.
DDoS HTTP: Table 12 shows the results of DDoS HTTP data. DT has perfect performance scores but a high log loss of 7.25. This dataset does not suffer from a lack of data, rather it suffers from a large imbalance of data since the attack data have more prevalence in the dataset, as shown in Table 13. This confusion matrix shows a large disparity in the data with a ratio of 3:1319 in favor of attack data. A large disparity in the dataset can cause the log loss to be affected, as log loss is based on probability, and, because the data are more likely to be attack data, this can result in a skewed log loss. Table 14 shows the results of weighted classes on the DDoS HTTP data. With weighted classes, both SVM and LR have a sizeable decrease in performance across all metrics except log loss which has decreased for both and ROC AUC, which has increased for both. ANN is unaffected by the weighted classes and retains its perfect recall, whereas RF loses the perfect recall. DT loses its perfect scores but has a large decrease in its log loss.

Algorithms Used Accuracy Precision Recall F1 Score Log Loss ROC AUC
KNN [14] n/a n/a n/a n/a n/a n/a SVM [57] 0.89 0.99 0.89 0.94 0.013 0.83 DT [19] 0.99 0.99 0.99 0.99 0.018 0.88 NB [58] n/a n/a n/a n/a n/a n/a RF [24] 0 Without using weighted classes, DT is the best model due to the perfect scores, although the high log loss is a factor to consider. RF would be the second best as it has perfect recall as well as the lowest log loss and the highest ROC AUC. When weighted classes are applied, ANN is the best model as it has perfect recall and a low log loss.
DDoS TCP: Table 15 shows the results of the DDoS TCP data. The models DT and RF both have perfect score except for log loss which is high for both. Table 16 shows the confusion matrix for RF and once again the matrix shows a very large disparity in the data represented.   Table 17 shows the results of DDoS TCP data with weighted classes enabled. With weighted classes enabled, SVM has lost its perfect precision but lowered its log loss significantly. DT and ANN are unaffected by the weighted classes but RF retains its perfect scores and lowers its log loss slightly. LR has lost its perfect recall and increased its log loss and ROC AUC.

Algorithms Used Accuracy Precision Recall F1 Score Log Loss ROC AUC
KNN [14] n/a n/a n/a n/a n/a n/a SVM [57] 0.99 0.99 0.99 0.99 0.00040 0.83 DT [19] 1.0 1.0 1.0 1.0 9.99 1.0 NB [58] n/a n/a n/a n/a n/a n/a RF [24] 1.0 Both with and without weighted classes, RF is the best model as it has perfect scores. With weighed classes, the log loss is lowered but is still quite high when compared to LR which has a very low log loss.
DDoS UDP: Table 18 shows the results of the DDoS UDP data, where both KNN and and DT have perfect score but KNN is the better model as it has a lower log loss. Although the log loss is still high, this is the case for all the models apart from NB. Table 19 shows the confusion matrix for RF, which shows the disparity in the class representation.  Table 20 shows the results of DDoS UDP data with weighted classes enabled. The table shows that SVM has gained perfect scores and lowered it loss loss, while DT has lost its perfect scores and lowered its log loss substantially. RF has gained perfect scores and lowered its log loss, while ANN is unaffected. LR has lost perfect recall but gained perfect precision and lowered its log loss and increased its ROC AUC.

Algorithms Used Accuracy Precision Recall F1 Score Log Loss ROC AUC
KNN [14] n/a n/a n/a n/a n/a n/a SVM [57] 1.0 1.0 1.0 1.0 2.84 1.0 DT [19] 0.99 1.0 0.99 0.99 0.000011 0.99 NB [58] n/a n/a n/a n/a n/a n/a RF [24] 1.0 Without weighted classes, KNN is the best model as it has perfect scores but the log loss is high. NB would be second best as it has perfect precision and a low log loss. With weighted classes, RF is the best model as it has perfect scores and a low log loss.
Key logging: Table 21 shows the results of Key logging data. DT is the best model as it has the best log loss and ROC AUC scores combined with perfect precision while having high metric scores.  Table 22 shows the confusion matrix for DT where it is observable that the dataset has a low amount of data and the data are imbalanced. Just as with data exfiltration, the amount of test data can be increased to observe the effect on the scores of the DT model. Table 23 shows the results of increasing the test data for key logging data. Increasing the test data to 30% gives the model perfect recall instead of perfect accuracy. Once the data are increased to 50%, the model no longer has perfect recall or precision. Based on the changes in the results, it is observable that the low amount of data has a significant impact on the results of the model.  Table 24 shows the results of key logging data with weighted classes enabled. SVM shows an overall decrease in performance with the model no longer having perfect recall. DT and RF also show a drop in performance with the models losing their perfect precision and recall, respectively. ANN is unaffected with LR having a large decrease in the models recall leading to the worst performance of all the models.

Algorithms Used Accuracy Precision Recall F1 Score Log Loss ROC AUC
KNN [14] n/a n/a n/a n/a n/a n/a SVM [57] 0.87 0.98 0.87 0.92 0.17 0.88 DT [19] 0.98 0.99 0.98 0.99 0.038 0.97 NB [58] n/a n/a n/a n/a n/a n/a RF [24] 0 Without weighted classes, DT is the best model with the lowest log loss and highest ROC AUC as well as perfect precision. With weighted class, all the models tested had a decrease in performance except for ANN, which was unchanged. Apart from the models perfect recall, it still has comparatively worse scores than DT and RF. Unless perfect recall is a factor DT should be used as it will correctly classify more data than ANN.
OS Scan: Table 25 shows the results for OS Scan data. All of the models have good scores with RF, ANN and LR having a perfect recall indicating the models made no false negatives. RF has a higher precision than LR and ANN as well as having a lower log loss and higher ROC AUC. This would suggest that RF is the best model. However, inspection of the confusion matrix shows a large imbalance of data in the dataset, as shown in Table 26.   Table 27 shows the results of OS scan data with weighted classes enabled. SVM shows a decrease in accuracy, recall, F1 score, log loss, and ROC AUC. The table also shows that the models decreased the performance overall. DT shows a decrease in log loss and ROC AUC marking a slight increase in the models confidence but lower ability to perform well at different thresholds. RF has lost its perfect recall and has an increased log loss and ROC AUC. ANN has seen no change to its results, whereas LR has a large performance decrease with only ROC AUC have been improved.

Algorithms Used Accuracy Precision Recall F1 Score Log Loss ROC AUC
KNN [14] n/a n/a n/a n/a n/a n/a SVM [57] 0.89 0.99 0.89 0.94 0.013 0.83 DT [19] 0.99 0.99 0.99 0.99 0.025 0.88 NB [58] n/a n/a n/a n/a n/a n/a RF [24] 0 Without weighted classes, RF is the best model as it has perfect recall and the lowest log loss also having the highest ROC AUC. With weighted classes, ANN is the only model with perfect recall but DT and RF both have better accuracy, precision, log loss, and ROC AUC. If having no false positives is needed, then ANN is the best, but DT is better at classifying data in general.
Service Scan: Table 28 shows the results for service scan data. The models SVM, RF, and ANN have perfect recall but have poor ROC AUC scores. DT has the highest ROC AUC and the lowest log loss, but RF could be considered the best due to its perfect recall.  Table 29 shows the confusion matrix for RF as well as the imbalanced data.  Table 30 shows the results of service scan data with weighted classes enabled. SVM was not tested due to excessive running times. DT, RF, and LR have increased their ROC AUC but all other metrics have been negatively affected. ANN is unaffected, being the only model to keep its perfect recall.

Algorithms Used Accuracy Precision Recall F1 Score Log Loss ROC AUC
KNN [14] n/a n/a n/a n/a n/a n/a SVM [57] n/a n/a n/a n/a n/a n/a DT [19] 0.97 0.99 0.97 0.98 0.079 0.97 NB [58] n/a n/a n/a n/a n/a n/a RF [24] 0 Without weighted classes of the models with perfect recall, RF is the best as it has the lowest log loss and highest ROC AUC. However, DT has the best log loss but does not have perfect recall. With weighted classes, ANN is the best as it is the only model to retain perfect recall, but its ROC AUC is the poorest of all the models.
DoS HTTP: Table 31 shows the results for DoS data; DT and RF both have perfect scores and a low log loss with DT narrowly beating RF.  Table 32 shows the confusion matrix for RF which showcases the disparity in the dataset.

Predicted label No attack Attack
No attack 11 0 Attack 0 5942 Table 33 shows the results of DoS HTTP data with weighted classes enabled. SVM shows deceased performance in all metric except for ROC AUC. DT and RF have lost their perfect scores and have an increased log loss. ANN is unaffected, whereas LR has seen a decrease in all performance metrics apart from ROC AUC, which has increased.

Algorithms Used Accuracy Precision Recall F1 Score Log Loss ROC AUC
KNN [14] n/a n/a n/a n/a n/a n/a SVM [57] 0.90 0.99 0.90 0.95 0.0067 0.90 DT [19] 0.99 0.99 0.99 0.99 0.0098 0.95 NB [58] n/a n/a n/a n/a n/a n/a RF [24] 0  Table 35 shows the confusion matrix for DT and shows the imbalance of the data in the dataset. Without weighted classes, DT is the best model as it has perfect scores and the lowest log loss. With weighted classes, ANN is the best model as it has perfect recall. In regards to the models ability to classify data, ANN comes out on top due to having perfect recall.
DoS TCP: Table 34 shows the results for DoS TCP data, where all the models apart from NB have perfect recall. DT and RF have the best ROC AUC scores, but both have high log losses when compared to the other models. KNN has the lowest log loss and a ROC AUC almost as good as RF. Table 36 shows the results of DoS TCP data with weighted classes enabled. SVM was not recorded due to excessively long running times. With weighted classes, both DT and RF have lost their perfect recall, but DT has gained perfect precision. Both models have also seen an improvement in log loss and ROC AUC. ANN is affected and LR has had a performance decrease in almost all metrics.

Algorithms Used Accuracy Precision Recall F1 Score Log Loss ROC AUC
KNN [14] n/a n/a n/a n/a n/a n/a SVM [57] n/a n/a n/a n/a n/a n/a DT [19] 0.99 1.0 0.99 0.99 0.018 0.99 NB [58] n/a n/a n/a n/a n/a n/a RF [24] 0 Without weighted classes, KNN could be considered the best model as it has the lowest log loss and a reasonably high ROC AUC. DT and RF have a higher ROC AUC but also have a considerably higher log loss than KNN. With weighted classes, both DT and ANN could be considered the best with DT having perfect precision and ANN having perfect recall. Both models also have a low log loss, but ANN has a poorer ROC AUC score.
DoS UDP: Table 37 shows the results for DoS UDP data. NB is the best model with perfect precision, low log loss, and high ROC AUC, as well as having high metrics across all categories. All the other models have perfect recall but have either a high log loss or a low ROC AUC, or both.  Table 38 shows the confusion matrix for NB which shows the disparity between the data in the dataset.  Table 39 shows the results of DoS UDP data with weighted classes enabled. ANN is unaffected and maintains poor log loss and ROC AUC scores. SVM has gained perfect precision but lost perfect recall with an increase in log loss and ROC AUC. DT has also swapped its precision and recall scores with an increase in both log loss and ROC AUC scores. RF has lost its perfect recall and increased its log loss and ROC AUC. LR has improved its log loss, ROC AUC, and gained perfect precision while losing perfect recall.

Algorithms Used Accuracy Precision Recall F1 Score Log Loss ROC AUC
KNN [14] n/a n/a n/a n/a n/a n/a SVM [57] 0.99 1.0 0.99 0.99 0.00053 0.99 DT [19] 0.99 1.0 0.99 0.99 5.24 0.99 NB [58] n/a n/a n/a n/a n/a n/a RF [24] 0 Without weighted classes, NB is the best model having perfect precision with a low log loss and high ROC AUC. With weighted classes, both SVM and LR perform very well but SVM is the better model as it has the lower log loss of the two models. Table 40 shows the best models for each of the datasets including both (with and without weighted classes). DT and RF are the models that appear the most in the table with ANN appearing frequently in the weighted classes column. Without the use of weighted classes, RF achieves the best performance. With weighted classes, ANN achieves the best performances. However, using weighted classes generally decreases the overall performance of the model.  Table 41 shows the results for multi-class classification, KNN has the best performance metrics including the lowest log loss and the highest CKS. LR is the worst model with the lowest metrics including the lowest CKS and a high log loss beat only by SVM.  Table 42 shows the results with weighted classes. KNN and NB cannot use weighted classes and SVM was not tested because of its excessively long running time. Weighted classes have reduced the performance metrics for all models apart from ANN, which has had a small decrease in log loss, making it the best model with weighted classes.

Algorithms Used Accuracy Precision Recall F1 Score Log Loss CKS
KNN [14] n/a n/a n/a n/a n/a n/a SVM [57] n/a n/a n/a n/a n/a n/a DT [19] 0.92 0.92 0.92 0.92 0.46 0.90 NB [58] n/a n/a n/a n/a n/a n/a RF [24] 0  Table 43 shows that KNN performs very well with the multi-class dataset with all the classes having low amounts of incorrectly classified data.   Table 44 shows that SVM performs poorly with the multi-class dataset with data exfiltration (1), DDoS HTTP (2), and key logging (5) data all being incorrectly classified. These classes are ones featuring low amounts of data, which could be the reason for the low accuracy.  True  0  1  2  3  4  5  6  7  8  9  10  0  10  0  0  2  3  0  183  111  6  15  5  1  0  0  0  0  4  0  0  0  2  1  0  2  0  0  0  296  6  0  0  0  79  630  4  3  0  0  0  19626  17561  0  0  0  55  17778  1357  4  0  0  0  429  54506  0  0  0  0  2 Table 45 shows the confusion matrix for DT multi-class classification. It can be observed that the model performs very well; however, the model appears to have difficultly in correctly classifying the data that are imbalanced in the dataset. This is evident in Table 45 with data exfiltration (1), DDoS HTTP (2), and key logging (5) being incorrectly classified.   Table 46 shows the confusion matrix for DT with weighted classes enabled. Using weighted classes has resulted in an overall decrease in the models performance, but has improved the correct classification of data for normal traffic (0), data exfiltration (1), and key logging (5). This has also resulted in DoS HTTP having all its data incorrectly classified.  True  0  1 2  3  4  5  6  7  8  9  10  0  297  2 0  1  3  10  0  21  0  10  3  1  0  6 Table 47 shows the confusion matrix for NB multi-classification, which performs quite well with no classes having all the data incorrectly classified. The model is also able to handle the data disparity in the classes with the low data classes having good classification results.   Table 48 shows the results for RF multi-class classification, which has good classification accuracy for the classes that have lots of data. The classes with low data have no correctly classified data.   Table 49 shows the results of having weighted classes. It is shown that, despite the models having lower correct classifications overall, they have performed better with low data and correctly classifying the classes.   Table 50 shows the results for ANN multi-class classification. The model performs well except for exfiltration (1) and key logging (5), which have incorrectly classified data.   Table 51 shows the results with weighted classes enabled. It is observable that the model is much better at classifying most classes with OS scan (6) and service scan (7) having the most incorrectly classified data. The models is also unable to correctly classify any data for normal data (0) and data exfiltration (1).   Table 52 shows the results for LR multi-class classification, which has poor performance overall with the low data classes and also having no correctly classified data. Table 53 shows the results of having weighted classes. It is evident that the accuracy of overall classification has decreased; however, the model shows improvement in classifying the low data classes.

Conclusions
In this paper, state-of-the-art ML algorithms are compared in terms of accuracy, precision, recall, F1 score, and log loss on both weighted and non-weighted Bot-IoT dataset. It is shown that the performance of RF in terms of accuracy and precision is the best with the non-weighted dataset. However, in a weighted dataset, ANN has higher accuracy for binary classification. In multi-classification, KNN and ANN are highly accurate for weighted and non-weighted datasets, respectively. From the results, it is evident that, when all types of attack have weighted datasets, ANN predicts the type of attack with higher accuracy.
In the future, we intend to adopt the models explored in this research into an IDS prototype for testing using diverse data including a mix of attacks to validate the multi-class functionality of models.