You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

10 January 2021

An Experimental Analysis of Attack Classification Using Machine Learning in IoT Networks

,
,
,
,
,
,
,
and
1
School of Computing, Edinburgh Napier University, Edinburgh EH10 5DT, UK
2
School of Electronics, Electrical Engineering and Computer Science, Queen’s University, Belfast BT9 5BN, UK
3
Department of Computer Science, Namal Institute, Mianwali 42250, Pakistan
4
College of Information Engineering, Yangzhou University, Yangzhou 225127, China
This article belongs to the Special Issue AI for IoT

Abstract

In recent years, there has been a massive increase in the amount of Internet of Things (IoT) devices as well as the data generated by such devices. The participating devices in IoT networks can be problematic due to their resource-constrained nature, and integrating security on these devices is often overlooked. This has resulted in attackers having an increased incentive to target IoT devices. As the number of attacks possible on a network increases, it becomes more difficult for traditional intrusion detection systems (IDS) to cope with these attacks efficiently. In this paper, we highlight several machine learning (ML) methods such as k-nearest neighbour (KNN), support vector machine (SVM), decision tree (DT), naive Bayes (NB), random forest (RF), artificial neural network (ANN), and logistic regression (LR) that can be used in IDS. In this work, ML algorithms are compared for both binary and multi-class classification on Bot-IoT dataset. Based on several parameters such as accuracy, precision, recall, F1 score, and log loss, we experimentally compared the aforementioned ML algorithms. In the case of HTTP distributed denial-of-service (DDoS) attack, the accuracy of RF is 99%. Furthermore, other simulation results-based precision, recall, F1 score, and log loss metric reveal that RF outperforms on all types of attacks in binary classification. However, in multi-class classification, KNN outperforms other ML algorithms with an accuracy of 99%, which is 4% higher than RF.

1. Introduction

The Internet of Things (IoT) offers a vision where devices with the help of sensors can understand the context and through networking functions can connect with each other [1]. The devices in the IoT network can be employed for collecting information based on the use cases. These include retail, healthcare, and manufacturing industries that use IoT devices for tasks such as tracking purchased items, remote patient monitoring, and fully autonomous warehouses. It is reported that the amount of IoT devices has been growing every year with the predicted amount of devices by 2025 reaching 75.44 billion [2]. Such a massive surge of IoT devices ultimately results in more attackers to target IoT networks. Reports state that most of the attack traffic generated on IoT networks is automated through various means such as scripts and malware [3]. The increase in attacks combined with the autonomous nature of the attacks is a problem for IoT networks as the devices are mostly used in a fire and forget fashion for years without any human interaction. This combined with the limitations of IoT devices including limited processing power and bandwidth means that providing adequate security can be difficult, which can result in network layer attacks such as denial of service (DoS). Therefore, it is important to research ways to identify this kind of traffic on networks which can be used in intrusion detection and prevention systems.
Machine learning (ML) methods can be exploited to detect malicious traffic in intrusion detection and prevention systems. ML is a subset of artificial intelligence (AI) that involves using algorithms to learn from data and make predictions based on the data provided [4]. ML has many applications including in retail, healthcare, and finance where AI algorithms may be applied for predicting customer spending habits, predicting medical problems in patients, and detecting bank fraud, respectively [5].
Due to the large yearly increases in cyberattacks that are being seen on a yearly basis, ML methods are being incorporated to help tackle the increasing threats of cyberattacks. ML has several uses within the field of cybersecurity, such as network threat analysis, which can be defined as the act of analyzing threats to the network [6]. ML can be beneficial in this task as it is able to monitor incoming and outgoing traffic to identify potentially suspicious traffic [7]. This area of research is known as intrusion detection and is a widely known research area. ML can be applied to intrusion detection systems (IDS) to help improve the systems ability to run autonomously and increase the accuracy of the system when raising the alarm on a suspected attack [8]. To this end, our primary role is to identify the best ML methods for detecting attacks on IoT networks, using a state-of-the-art dataset by utilizing both binary and multi-class classification testing.
The main contributions of this paper can be summarized as follows:
  • We conduct an in-depth and comprehensive survey on the role of various ML methods and attack detection specifically in regards to IoT networks.
  • We evaluate and compare the state-of-the-art ML algorithms in terms of various performance metrics such as confusion matrix, accuracy, precision, recall, F1 score, log loss, ROC AUC, and Cohen’s kappa coefficient (CKC).
  • We evaluate the results comparing binary class testing as well as examining the results of the multi-class testing.
The rest of the paper is organized as follows: Table 1 lists all the abbreviations used in the paper. Section 2 is devoted to a literature review involving investigating IoT intrusion detection techniques as well as ML methods and how they are being used to aid intrusion detection efforts specifically in regards to IoT networks. Details of various attacks that can occur in IoT networks are also showcased with an explanation of how the various ML methods and performance metrics work. Section 3 explains the performance evaluation, which also includes an in-depth examination of the data used in the datasets. The models are compared against each other for both binary and multi-class classification with an overall best model being selected. Finally, Section 4 draws a conclusion.
Table 1. Abbreviations and their explanations.

3. Performance Evaluation

3.1. Benchmark Data

Our evaluation involves using several datasets with several ML models to identify the best model for correctly classifying IoT attack data. When selecting the datasets, the two most important factors were the amount of variety in the attack data and how up-to-date the datasets are. The datasets chosen were the bot-IoT datasets [56] because they met the two criteria previously mentioned.

3.2. Performance Evaluation Metrics

For evaluation, we consider the following metrics.

3.2.1. Confusion Matrix

A confusion matrix shows the predictions made by the model. It is designed to show where the model has correctly and incorrectly classified the data.
The confusion matrix for binary and multi-class classification is different. With binary classification, the matrix shows the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) results, as shown in Table 2. The columns represent the correct classification of the data and the rows represent the available classifications.
Table 2. Confusion matrix example.
TP and TN are when the data are correctly classified as either attack or no attack. FP and FN are when data are incorrectly predicted as the other class. When using a confusion matrix for multi-class problems, the same principles apply. However, the matrix shows all the classes which allow for observing where the mis-classification is occurring in the classes, as shown in Table 3.
Table 3. Multi-class confusion matrix example.
In Table 3, C represents where the correct classifications are located and W represents incorrect classifications. It is to be noted that correct classifications create a diagonal path through the table from the top left corner to the bottom right corner.

3.2.2. Accuracy

Accuracy is a metric that can be used to identify the percentage of predictions that were classified correctly and is expressed as follows:
A c c u r a c y = Number of correct predictions Total number of predictions
This can be expanded upon by utilizing the results of a confusion matrix including TP, TN, FP, and FN and can be defined as follows:
A c c u r a c y = TP + TN TP + TN + FP + FN

3.2.3. Precision

Precision is used to determine the ratio of correctly predicted positive outcomes against the total number of predicted positive outcomes and can be defined as follows:
P r e c i s i o n = TP TP + FP

3.2.4. Recall

Recall is used to determine the ratio of correctly predicted positive outcomes to all the outcomes in the given class and can be defined as follows:
R e c a l l = TP TP + FN

3.2.5. F1 Score

F1 score is the weighted average of both precision and recall which produces a number between 0 and 1. F1 score is seen as a better performance metric than accuracy and can be defined as follows:
F 1 s c o r e = 2 × ( recall × precision ) recall + precision
It is to be noted that selection of F1 score or accuracy is dependent on how the data are distributed. The F1 score seems a better performance metric than accuracy in the case where the classes are highly unbalanced. F1 score takes into account how the data are distributed, and, in most real-life classification problems, imbalanced class distribution exists and thus F1 score is a better metric to be used. Accuracy is used when the class distribution is similar and it does not take into account how the data are distributed, which may lead to wrong conclusion.

3.2.6. Log Loss

Log loss is used to measure the performance of a model by using the probability of the expected outcome. The higher the probability of the actual class is, the higher the log loss will be. The lower score indicates that the model has performed better.
For binary classification where number of possible classes (M) = 2, log loss can be expressed as follows:
( y i log ( p i ) + ( 1 y i ) log ( 1 p i ) )
For multi-class classification where M > 2, sa eparate loss for each class label is calculated, and the results are summed, which is expressed as follows.
c = 1 M y o , c log ( p o , c )
where M is the number of possible classes (0, 1, 2), log is the natural logarithm, y i is a binary indicator of whether class label i is the correct classification for observations, and p i is the models prediction probability.

3.2.7. ROC AUC

ROC is a graph used to plot the results of the model at various thresholds when making predictions. The graph uses the true positive rate (TPR) and false positive rates (FPR), which are expressed as follows:
T P R = TP TP + FN
F P R = FP FP + TN

3.2.8. Cohen’s Kappa Coefficient

Cohen’s kappa coefficient (CKC), also referred to as the kappa statistic, is used to test the inter rater reliability of prediction and can be expressed as follows:
k = Pr ( a ) - Pr ( e ) 1 - Pr ( e )
where Pr(a) is the observed agreement and Pr(e) is the expected agreement. This metric is useful as it compares the model against a model that guesses based on the frequency of the classes. This allows for the disparity in a dataset to be evaluated particularly with multi-class testing as the dataset has varying numbers of data points per attack.

3.3. Dataset Description

The dataset named Bot-IoT was submitted to the IEEE website on 16 October 2019 and was created by the University of New South Wales (UNSW). The dataset consists of ten CSV files containing records for the following attacks on IoT networks: (i) Data exfiltration; (ii) DoS HTTP; (iii) DoS TCP; (iv) DoS UDP; (v) DDoS HTTP; (vi) DDoS TCP; (vii) DDoS UDP; (viii) Keylogging; (ix) OS Scan; and (x) Service Scan. The dataset comprises both real attack and simulated attack data and was created by simulating a realistic network at the UNSW [56].
Table 4 shows the features used in the experiments. There are 35 columns in the dataset. However, only the ones in Table 4 were used. When deciding what features to use, the contents of the columns are examined and any columns that have no values are removed as well as columns that contain text and columns that are deemed to be irrelevant to the overall classification of the data.
Table 4. Dataset features and description.
One important part of examining the dataset involves checking the representation of the classes in the dataset, i.e. whether one class is over or under represented, as this can have a detrimental effect on the experiments. Table 5 shows the amount of attack data and no attack data for each dataset used in the experiments.
Table 5. Dataset label distribution.
To conduct multi-class testing, a new CSV file is created using the binary classification datasets. The datasets were collected and then randomized and put into a new file. Due to the large size of the dataset, only a selected percentage of the data is used to prevent excessive run times. Table 6 shows the class representation of the training and test data in the multi-class dataset. It is observable in both the binary and multi-class datasets that not all classes have equal representation. Testing with weighted classes can be done to see the effects of having equal representation among the classes. The models SVM, DT, RF, ANN, and LR are able to use the balanced weighted classes option, which applies to the class weights as follows:
W = S a m p l e s C l a s s e s × Y
where S a m p l e s is the number of rows in the dataset, C l a s s e s is the number of classes in the dataset, and Y is the number of labels.
Table 6. Multi-class data representation.

3.4. Implementation

3.4.1. Tools Used

We use Python version 3.7.4 programming language for the implementation of ML algorithms. The two main modules used for the implementation of the models are sklearn (also referred to as scikit-learn) and Keras. Keras is used to implement the ANN while sklearn is used to implement the other models. It is to be noted that, for comparison purposes, we used the default values of hyperparameters for each classifier. Table 7 contains names of the modules used and a brief description of the module.
Table 7. Modules used and description.

3.4.2. Feature Extraction

The dataset contains features that either contain no information or have information that is irrelevant in helping the model classify the data. The unwanted features can be removed during the preprocessing stage using the pandas module. Several features, such as flgs, proto, dir, state, saddr, daddr, srcid, smac, dmac, soui, doui, sco, record, category, and subcategory, were removed from the dataset.

3.4.3. Feature Scaling

The features in the dataset contain large numbers that vary in size. Therefore, it is important to normalize the data in the features. This is done by re-scaling the values of the features to within a defined scale such as −1 to 1 and can be defined as follows:
x = a + ( x m i n ( x ) ) ( b a ) m a x ( x ) m i n ( x )
where x is the normalized value, x is the original value, and a and b are the minimum and maximum values. The result of this will take any number between −1 and 1. This can be done in Python using the MinMaxScaler in the preproccesing module.

3.4.4. Multi-Class Dataset

The multi-class dataset is created by collecting all the rows of all the datasets and then randomizing the rows using the random Python module. The random module contains the shuffle method, which allows an array, in this case the rows of the dataset, to be randomized. Due to the large size of the dataset when using it for testing, only roughly 25% of the dataset is used, which is 1,500,000 rows.

3.4.5. Training Data

The data used by the model to learn are called the training data. Data can be split into training and test data with multiple ratios. For this study, a split of 80:20 was used, with 80% being used for training the models, which is governed by the Pareto principle that states that 80% of result comes from 20% of the effort.

3.4.6. Test Data

Twenty percent of the data is used for testing, which is typically a good amount of data. However, if the dataset is small, this can result in a low amount of test data and in the illusion that the model has done extremely well when in fact it has not had enough data to be properly tested. To split the dataset into training and test data, train_test_split can be used from the Python module named model_selection. When using this function, the random state parameter can be used that sets the seed of the pseudo random number generator; in this case, the number 121 was used.

3.5. Results and Discussion

To test several ML algorithms and to identify which are the best and worst for classifying attack data on IoT networks, this section provides all the results and analysis based on several performance metrics including binary and multi-class testing.

3.5.1. Binary Classification

Data Exfiltration:Table 8 shows the results for data exfiltration data where RF has the best scores for all the performance metrics including log loss. Whereas DT also has perfect scores, it has a high log loss, indicating that the RF model is more confident in making predictions.
Table 8. Data exfiltration results.
Table 9 shows the confusion matrix for RF and shows two noteworthy pieces of information. The first is that the amount of data tested is very low and that the classes do not have equal representation. It is possible that the low amount of test data is having an impact on the results. However, the other models except from DT have relatively poor scores compared to RF.
Table 9. Data exfiltration RF confusion matrix.
Table 10 shows that increasing the test data to 30% has a decrease in the log loss, indicating that the model performs better with more data although only marginally. Once the test data reaches 40% and beyond, the results begin to get worse, although the model is able to maintain perfect recall with up to a 50% split in the training and test data.
Table 10. Data exfiltration RF test data amounts.
Due to the class representation being imbalanced, the weighted classes parameter can be used. This allows the disparity of the classes to be rectified, the results of which are shown in Table 11. This option is not available when using the KNN and NB models. It is observable in Table 11 that SVM has had its performance increase by using weighted classes with all metrics increasing and log loss decreasing. ANN is unaffected by weighted classes and LR is marginally affected with the model perfect precision but lowering its recall. DT losses its perfect scores while RF is able to keep perfect scores but slightly increases its log loss.
Table 11. Data exfiltration weighted classes results.
Without using weighted classes, RF is the best model due to its low log loss when compared to DT. When weighted classes are applied, RF is still the best model with perfect scores and a low log loss, indicating that the model is confident in making predictions.
DDoS HTTP:Table 12 shows the results of DDoS HTTP data. DT has perfect performance scores but a high log loss of 7.25. This dataset does not suffer from a lack of data, rather it suffers from a large imbalance of data since the attack data have more prevalence in the dataset, as shown in Table 13.
Table 12. DDoS HTTP results.
Table 13. DDoS HTTP DT confusion matrix.
This confusion matrix shows a large disparity in the data with a ratio of 3:1319 in favor of attack data. A large disparity in the dataset can cause the log loss to be affected, as log loss is based on probability, and, because the data are more likely to be attack data, this can result in a skewed log loss.
Table 14 shows the results of weighted classes on the DDoS HTTP data. With weighted classes, both SVM and LR have a sizeable decrease in performance across all metrics except log loss which has decreased for both and ROC AUC, which has increased for both. ANN is unaffected by the weighted classes and retains its perfect recall, whereas RF loses the perfect recall. DT loses its perfect scores but has a large decrease in its log loss.
Table 14. DDoS HTTP weighted classes results.
Without using weighted classes, DT is the best model due to the perfect scores, although the high log loss is a factor to consider. RF would be the second best as it has perfect recall as well as the lowest log loss and the highest ROC AUC. When weighted classes are applied, ANN is the best model as it has perfect recall and a low log loss.
DDoS TCP:Table 15 shows the results of the DDoS TCP data. The models DT and RF both have perfect score except for log loss which is high for both. Table 16 shows the confusion matrix for RF and once again the matrix shows a very large disparity in the data represented.
Table 15. DDoS TCP results.
Table 16. DDoS TCP RF confusion matrix.
Table 17 shows the results of DDoS TCP data with weighted classes enabled. With weighted classes enabled, SVM has lost its perfect precision but lowered its log loss significantly. DT and ANN are unaffected by the weighted classes but RF retains its perfect scores and lowers its log loss slightly. LR has lost its perfect recall and increased its log loss and ROC AUC.
Table 17. DDoS TCP weighted classes results.
Both with and without weighted classes, RF is the best model as it has perfect scores. With weighed classes, the log loss is lowered but is still quite high when compared to LR which has a very low log loss.
DDoS UDP:Table 18 shows the results of the DDoS UDP data, where both KNN and and DT have perfect score but KNN is the better model as it has a lower log loss. Although the log loss is still high, this is the case for all the models apart from NB. Table 19 shows the confusion matrix for RF, which shows the disparity in the class representation.
Table 18. DDoS UDP results.
Table 19. DDoS UDP KNN confusion matrix.
Table 20 shows the results of DDoS UDP data with weighted classes enabled. The table shows that SVM has gained perfect scores and lowered it loss loss, while DT has lost its perfect scores and lowered its log loss substantially. RF has gained perfect scores and lowered its log loss, while ANN is unaffected. LR has lost perfect recall but gained perfect precision and lowered its log loss and increased its ROC AUC.
Table 20. DDoS UDP weighted classes results.
Without weighted classes, KNN is the best model as it has perfect scores but the log loss is high. NB would be second best as it has perfect precision and a low log loss. With weighted classes, RF is the best model as it has perfect scores and a low log loss.
Key logging:Table 21 shows the results of Key logging data. DT is the best model as it has the best log loss and ROC AUC scores combined with perfect precision while having high metric scores.
Table 21. Key logging results.
Table 22 shows the confusion matrix for DT where it is observable that the dataset has a low amount of data and the data are imbalanced.
Table 22. Key logging DT confusion matrix.
Just as with data exfiltration, the amount of test data can be increased to observe the effect on the scores of the DT model. Table 23 shows the results of increasing the test data for key logging data. Increasing the test data to 30% gives the model perfect recall instead of perfect accuracy. Once the data are increased to 50%, the model no longer has perfect recall or precision. Based on the changes in the results, it is observable that the low amount of data has a significant impact on the results of the model.
Table 23. Key logging DT test data amounts.
Table 24 shows the results of key logging data with weighted classes enabled. SVM shows an overall decrease in performance with the model no longer having perfect recall. DT and RF also show a drop in performance with the models losing their perfect precision and recall, respectively. ANN is unaffected with LR having a large decrease in the models recall leading to the worst performance of all the models.
Table 24. Key logging weighted classes results.
Without weighted classes, DT is the best model with the lowest log loss and highest ROC AUC as well as perfect precision. With weighted class, all the models tested had a decrease in performance except for ANN, which was unchanged. Apart from the models perfect recall, it still has comparatively worse scores than DT and RF. Unless perfect recall is a factor DT should be used as it will correctly classify more data than ANN.
OS Scan:Table 25 shows the results for OS Scan data. All of the models have good scores with RF, ANN and LR having a perfect recall indicating the models made no false negatives. RF has a higher precision than LR and ANN as well as having a lower log loss and higher ROC AUC. This would suggest that RF is the best model. However, inspection of the confusion matrix shows a large imbalance of data in the dataset, as shown in Table 26.
Table 25. OS Scan results.
Table 26. OS Scan RF confusion matrix.
Table 27 shows the results of OS scan data with weighted classes enabled. SVM shows a decrease in accuracy, recall, F1 score, log loss, and ROC AUC. The table also shows that the models decreased the performance overall. DT shows a decrease in log loss and ROC AUC marking a slight increase in the models confidence but lower ability to perform well at different thresholds. RF has lost its perfect recall and has an increased log loss and ROC AUC. ANN has seen no change to its results, whereas LR has a large performance decrease with only ROC AUC have been improved.
Table 27. OS scan weighted classes results.
Without weighted classes, RF is the best model as it has perfect recall and the lowest log loss also having the highest ROC AUC. With weighted classes, ANN is the only model with perfect recall but DT and RF both have better accuracy, precision, log loss, and ROC AUC. If having no false positives is needed, then ANN is the best, but DT is better at classifying data in general.
Service Scan:Table 28 shows the results for service scan data. The models SVM, RF, and ANN have perfect recall but have poor ROC AUC scores. DT has the highest ROC AUC and the lowest log loss, but RF could be considered the best due to its perfect recall.
Table 28. Service Scan results.
Table 29 shows the confusion matrix for RF as well as the imbalanced data.
Table 29. Service Scan RF confusion matrix.
Table 30 shows the results of service scan data with weighted classes enabled. SVM was not tested due to excessive running times. DT, RF, and LR have increased their ROC AUC but all other metrics have been negatively affected. ANN is unaffected, being the only model to keep its perfect recall.
Table 30. Service scan weighted classes results.
Without weighted classes of the models with perfect recall, RF is the best as it has the lowest log loss and highest ROC AUC. However, DT has the best log loss but does not have perfect recall. With weighted classes, ANN is the best as it is the only model to retain perfect recall, but its ROC AUC is the poorest of all the models.
DoS HTTP:Table 31 shows the results for DoS data; DT and RF both have perfect scores and a low log loss with DT narrowly beating RF.
Table 31. DoS HTTP results.
Table 32 shows the confusion matrix for RF which showcases the disparity in the dataset.
Table 32. DoS HTTP RF confusion matrix.
Table 33 shows the results of DoS HTTP data with weighted classes enabled. SVM shows deceased performance in all metric except for ROC AUC. DT and RF have lost their perfect scores and have an increased log loss. ANN is unaffected, whereas LR has seen a decrease in all performance metrics apart from ROC AUC, which has increased.
Table 33. DoS HTTP weighted classes results.
Without weighted classes, DT is the best model as it has perfect scores and the lowest log loss. With weighted classes, ANN is the best model as it has perfect recall. In regards to the models ability to classify data, ANN comes out on top due to having perfect recall.
DoS TCP:Table 34 shows the results for DoS TCP data, where all the models apart from NB have perfect recall. DT and RF have the best ROC AUC scores, but both have high log losses when compared to the other models. KNN has the lowest log loss and a ROC AUC almost as good as RF.
Table 34. DoS TCP results.
Table 35 shows the confusion matrix for DT and shows the imbalance of the data in the dataset.
Table 35. DoS TCP DT confusion matrix.
Table 36 shows the results of DoS TCP data with weighted classes enabled. SVM was not recorded due to excessively long running times. With weighted classes, both DT and RF have lost their perfect recall, but DT has gained perfect precision. Both models have also seen an improvement in log loss and ROC AUC. ANN is affected and LR has had a performance decrease in almost all metrics.
Table 36. DoS TCP weighted classes results.
Without weighted classes, KNN could be considered the best model as it has the lowest log loss and a reasonably high ROC AUC. DT and RF have a higher ROC AUC but also have a considerably higher log loss than KNN. With weighted classes, both DT and ANN could be considered the best with DT having perfect precision and ANN having perfect recall. Both models also have a low log loss, but ANN has a poorer ROC AUC score.
DoS UDP:Table 37 shows the results for DoS UDP data. NB is the best model with perfect precision, low log loss, and high ROC AUC, as well as having high metrics across all categories. All the other models have perfect recall but have either a high log loss or a low ROC AUC, or both.
Table 37. DoS UDP results.
Table 38 shows the confusion matrix for NB which shows the disparity between the data in the dataset.
Table 38. DoS UDP NB confusion matrix.
Table 39 shows the results of DoS UDP data with weighted classes enabled. ANN is unaffected and maintains poor log loss and ROC AUC scores. SVM has gained perfect precision but lost perfect recall with an increase in log loss and ROC AUC. DT has also swapped its precision and recall scores with an increase in both log loss and ROC AUC scores. RF has lost its perfect recall and increased its log loss and ROC AUC. LR has improved its log loss, ROC AUC, and gained perfect precision while losing perfect recall.
Table 39. DoS UDP weighted classes results.
Without weighted classes, NB is the best model having perfect precision with a low log loss and high ROC AUC. With weighted classes, both SVM and LR perform very well but SVM is the better model as it has the lower log loss of the two models.

3.5.2. Model Comparison

Table 40 shows the best models for each of the datasets including both (with and without weighted classes). DT and RF are the models that appear the most in the table with ANN appearing frequently in the weighted classes column. Without the use of weighted classes, RF achieves the best performance. With weighted classes, ANN achieves the best performances. However, using weighted classes generally decreases the overall performance of the model.
Table 40. Model comparison.

3.5.3. Multiclass Classification

Table 41 shows the results for multi-class classification, KNN has the best performance metrics including the lowest log loss and the highest CKS. LR is the worst model with the lowest metrics including the lowest CKS and a high log loss beat only by SVM.
Table 41. Multi-class results.
Table 42 shows the results with weighted classes. KNN and NB cannot use weighted classes and SVM was not tested because of its excessively long running time. Weighted classes have reduced the performance metrics for all models apart from ANN, which has had a small decrease in log loss, making it the best model with weighted classes.
Table 42. Multi-class weighted classes results.
Table 43 shows that KNN performs very well with the multi-class dataset with all the classes having low amounts of incorrectly classified data.
Table 43. KNN confusion matrix.
Table 44 shows that SVM performs poorly with the multi-class dataset with data exfiltration (1), DDoS HTTP (2), and key logging (5) data all being incorrectly classified. These classes are ones featuring low amounts of data, which could be the reason for the low accuracy.
Table 44. SVM confusion matrix.
Table 45 shows the confusion matrix for DT multi-class classification. It can be observed that the model performs very well; however, the model appears to have difficultly in correctly classifying the data that are imbalanced in the dataset. This is evident in Table 45 with data exfiltration (1), DDoS HTTP (2), and key logging (5) being incorrectly classified.
Table 45. DT confusion matrix.
Table 46 shows the confusion matrix for DT with weighted classes enabled. Using weighted classes has resulted in an overall decrease in the models performance, but has improved the correct classification of data for normal traffic (0), data exfiltration (1), and key logging (5). This has also resulted in DoS HTTP having all its data incorrectly classified.
Table 46. DT weighted classes confusion matrix.
Table 47 shows the confusion matrix for NB multi-classification, which performs quite well with no classes having all the data incorrectly classified. The model is also able to handle the data disparity in the classes with the low data classes having good classification results.
Table 47. NB confusion matrix.
Table 48 shows the results for RF multi-class classification, which has good classification accuracy for the classes that have lots of data. The classes with low data have no correctly classified data.
Table 48. RF confusion matrix.
Table 49 shows the results of having weighted classes. It is shown that, despite lower correct classifications overall, the models performed better with low data and correctly classifying the classes.
Table 49. RF weighted classes confusion matrix.
Table 50 shows the results for ANN multi-class classification. The model performs well except for exfiltration (1) and key logging (5), which have incorrectly classified data.
Table 50. ANN confusion matrix.
Table 51 shows the results with weighted classes enabled. It is observable that the model is much better at classifying most classes with OS scan (6) and service scan (7) having the most incorrectly classified data. The models is also unable to correctly classify any data for normal data (0) and data exfiltration (1).
Table 51. ANN weighted classes confusion matrix.
Table 52 shows the results for LR multi-class classification, which has poor performance overall with the low data classes and also having no correctly classified data.
Table 52. LR confusion matrix.
Table 53 shows the results of having weighted classes. It is evident that the accuracy of overall classification has decreased; however, the model shows improvement in classifying the low data classes.
Table 53. LR weighted classes confusion matrix.

4. Conclusions

In this paper, state-of-the-art ML algorithms are compared in terms of accuracy, precision, recall, F1 score, and log loss on both weighted and non-weighted Bot-IoT dataset. It is shown that the performance of RF in terms of accuracy and precision is the best with the non-weighted dataset. However, in a weighted dataset, ANN has higher accuracy for binary classification. In multi-classification, KNN and ANN are highly accurate for weighted and non-weighted datasets, respectively. From the results, it is evident that, when all types of attack have weighted datasets, ANN predicts the type of attack with higher accuracy.
In the future, we intend to adopt the models explored in this research into an IDS prototype for testing using diverse data including a mix of attacks to validate the multi-class functionality of models.

Author Contributions

Conceptualization, A.C.; Methodology, A.C.; Validation, A.C., J.A., and R.U.; Formal Analysis, A.C.; Investigation, A.C., J.A., R.U., and B.N.; resources, A.C.; writing—original draft preparation, A.C. and J.A.; writing—review and editing, A.C., J.A., R.U., B.N., M.G., F.M., S.u.R., F.A., and W.J.B.; supervision, J.A. and W.J.B.; and funding acquisition, F.A. and J.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Dorsemaine, B.; Gaulier, J.P.; Wary, J.P.; Kheir, N.; Urien, P. Internet of Things: A Definition & Taxonomy. In Proceedings of the 2015 9th International Conference on Next Generation Mobile Applications, Services and Technologies, Cambridge, UK, 9–11 September 2015. [Google Scholar] [CrossRef]
  2. Statista. IoT: Number of Connected Devices Worldwide 2012–2025; Statista: Hamburg, Germany, 2019. [Google Scholar]
  3. Doffman, Z. Cyberattacks On IOT Devices Surge 300% In 2019, ‘Measured in Billions’. Available online: https://www.forbes.com/sites/zakdoffman/2019/09/14/dangerous-cyberattacks-on-iot-devices-up-300-in-2019-now-rampant-report-claims/?sh=24e245575892 (accessed on 10 November 2020).
  4. Furbush, J. Machine Learning: A Quick and Simple Definition. Available online: https://www.oreilly.com/content/machine-learning-a-quick-and-simple-definition/ (accessed on 10 November 2020).
  5. Jmj, A. 5 Industries That Heavily Rely on Artificial Intelligence and Machine Learning. Available online: https://medium.com/datadriveninvestor/5-industries-that-heavily-rely-on-artificial-intelligence-and-machine-learning-53610b6c1525 (accessed on 10 November 2020).
  6. Dosal, E. 3 Advantages of a Network Threat Analysis. Compuquip, 4 September 2018. [Google Scholar]
  7. Groopman, J. Understand the Top 4 Use Cases for AI in Cybersecurity. Available online: https://searchsecurity.techtarget.com/tip/Understand-the-top-4-use-cases-for-AI-in-cybersecurity (accessed on 10 November 2020).
  8. Mohammad, A.; Maen, A.; Szilveszter, K.; Mouhammd, A. Evaluation of machine learning algorithms for intrusion detection system. In Proceedings of the IEEE 15th International Symposium on Intelligent Systems and Informatics (SISY), Subotica, Serbia, 14–16 September 2017; pp. 000277–000282. [Google Scholar]
  9. Sommer, R.; Paxson, V. Outside the Closed World: On Using Machine Learning for Network Intrusion Detection. In Proceedings of the 2010 IEEE Symposium on Security and Privacy, Oakland, CA, USA, 16–19 May 2010; pp. 305–316. [Google Scholar]
  10. Foley, J.; Moradpoor, N.; Ochenyi, H. Employing a Machine Learning Approach to Detect Combined Internet of Things Attacks against Two Objective Functions Using a Novel Dataset. Secur. Commun. Netw. 2020, 2020. [Google Scholar] [CrossRef]
  11. Alsamiri, J.; Alsubhi, K. Internet of Things Cyber Attacks Detection using Machine Learning. Int. J. Adv. Comput. Sci. Appl. 2019, 10, 628–634. [Google Scholar] [CrossRef]
  12. Hasan, M.; Islam, M.M.; Zarif, M.I.I.; Hashem, M. Attack and anomaly detection in IoT sensors in IoT sites using machine learning approaches. Internet Things 2019, 7, 100059. [Google Scholar] [CrossRef]
  13. Ali, N.; Neagu, D.; Trundle, P. Evaluation of k-nearest neighbour classifier performance for heterogeneous data sets. SN Appl. Sci. 2019, 1, 1559. [Google Scholar] [CrossRef]
  14. Harrison, O. Machine Learning Basics with the K-Nearest Neighbors Algorithm. Towards Data Science, 10 September 2019. [Google Scholar]
  15. Liao, Y.; Vemuri, V. Use of K-Nearest Neighbor classifier for intrusion detection. Comput. Secur. 2002, 21, 439–448. [Google Scholar] [CrossRef]
  16. Nikhitha, M.; Jabbar, M. K Nearest Neighbor Based Model for Intrusion Detection System. Int. J. Recent Technol. Eng. 2019, 8, 2258–2262. [Google Scholar] [CrossRef]
  17. Yao, J.; Zhao, S.; Fan, L. An Enhanced Support Vector Machine Model for Intrusion Detection. Rough Sets Knowl. Technol. Lect. Notes Comput. Sci. 2006, 538–543. [Google Scholar] [CrossRef]
  18. Cahyo, A.N.; Hidayat, R.; Adhipta, D. Performance comparison of intrusion detection system based anomaly detection using artificial neural network and support vector machine. AIP Conf. Proc. 2016. [Google Scholar] [CrossRef]
  19. Sharma, H.; Kumar, S. A Survey on Decision Tree Algorithms of Classification in Data Mining. Int. J. Sci. Res. (IJSR) 2016, 5, 2094–2097. [Google Scholar] [CrossRef]
  20. Stampar, M.; Fertalj, K. Artificial intelligence in network intrusion detection. In Proceedings of the 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 25–29 May 2015; pp. 1318–1323. [Google Scholar] [CrossRef]
  21. Aloqaily, M.; Otoum, S.; Al Ridhawi, I.; Jararweh, Y. An Intrusion Detection System for Connected Vehicles in Smart Cities. Ad. Hoc. Netw. 2019. [Google Scholar] [CrossRef]
  22. Koehrsen, W. An Implementation and Explanation of the Random Forest in Python. Available online: https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76 (accessed on 10 November 2020).
  23. Dubey, A. Feature Selection Using Random forest. Available online: https://towardsdatascience.com/feature-selection-using-random-forest-26d7b747597f (accessed on 10 November 2020).
  24. Farnaaz, N.; Jabbar, M. Random Forest Modeling for Network Intrusion Detection System. Procedia Comput. Sci. 2016, 89, 213–217. [Google Scholar] [CrossRef]
  25. Saritas, M.M.; Yasar, A. Performance analysis of ANN and Naive Bayes classification algorithm for data classification. Int. J. Intell. Syst. Appl. Eng. 2019, 7, 88–91. [Google Scholar] [CrossRef]
  26. Ujjwalkarn. A Quick Introduction to Neural Networks. The Data Science Blog, 9 August 2016.
  27. Maind, S.B.; Wankar, P. Research paper on basic of artificial neural network. Int. J. Recent Innov. Trends Comput. Commun. 2014, 2, 96–100. [Google Scholar]
  28. Anitha, A.A.; Arockiam, L. ANNIDS: Artificial Neural Network based Intrusion Detection System for Internet of Things. Int. J. Innov. Technol. Explor. Eng. Regul. Issue 2019, 8, 2583–2588. [Google Scholar] [CrossRef]
  29. Shenfield, A.; Day, D.; Ayesh, A. Intelligent intrusion detection systems using artificial neural networks. ICT Express 2018, 4, 95–99. [Google Scholar] [CrossRef]
  30. Rajput, H. MachineX: Simplifying Logistic Regression. Knoldus Blogs, 28 March 2018. [Google Scholar]
  31. Ghosh, P.; Mitra, R. Proposed GA-BFSS and logistic regression based intrusion detection system. In Proceedings of the 2015 Third International Conference on Computer, Communication, Control and Information Technology (C3IT), Hooghly, India, 7–8 February 2015; pp. 1–6. [Google Scholar]
  32. Hussain, F.; Hussain, R.; Hassan, S.A.; Hossain, E. Machine Learning in IoT Security: Current Solutions and Future Challenges. IEEE Commun. Surv. Tutor. 2020, 22, 1686–1721. [Google Scholar] [CrossRef]
  33. Saleem, J.; Hammoudeh, M.; Raza, U.; Adebisi, B.; Ande, R. IoT standardisation. In Proceedings of the 2nd International Conference on Future Networks and Distributed Systems—ICFNDS 18, Amman, Jordan, 26–27 June 2018. [Google Scholar] [CrossRef]
  34. Ullah, F.; Edwards, M.; Ramdhany, R.; Chitchyan, R.; Babar, M.A.; Rashid, A. Data exfiltration: A review of external attack vectors and countermeasures. J. Netw. Comput. Appl. 2017, 101, 18–54. [Google Scholar] [CrossRef]
  35. Carthy, S.M.M.; Sinha, A.; Tambe, M.; Manadhata, P. Data Exfiltration Detection and Prevention: Virtually Distributed POMDPs for Practically Safer Networks. Lect. Notes Comput. Sci. Decis. Game Theory Secur. 2016, 39–61. [Google Scholar] [CrossRef]
  36. Fadolalkarim, D.; Bertino, E. A-PANDDE: Advanced Provenance-based ANomaly Detection of Data Exfiltration. Comput. Secur. 2019, 84, 276–287. [Google Scholar] [CrossRef]
  37. Malik, M.; Singh, Y. A Review: DoS and DDoS Attacks. Int. J. Comput. Sci. Mob. Comput. 2015, 4, 260–265. [Google Scholar]
  38. Mahjabin, T.; Xiao, Y.; Sun, G.; Jiang, W. A survey of distributed denial-of-service attack, prevention, and mitigation techniques. Int. J. Distrib. Sens. Networks 2017, 13, 2–33. [Google Scholar] [CrossRef]
  39. Kolias, C.; Kambourakis, G.; Stavrou, A.; Voas, J. DDoS in the IoT: Mirai and Other Botnets. Computer 2017, 50, 80–84. [Google Scholar] [CrossRef]
  40. Galeano-Brajones, J.; Carmona-Murillo, J.; Valenzuela-Valdés, J.F.; Luna-Valero, F. Detection and Mitigation of DoS and DDoS Attacks in IoT-Based Stateful SDN: An Experimental Approach. Sensors 2020, 20, 816. [Google Scholar] [CrossRef] [PubMed]
  41. Ul, I.; Bin, M.; Asif, M.; Ullah, R. DoS/DDoS Detection for E-Healthcare in Internet of Things. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 297–300. [Google Scholar] [CrossRef]
  42. Olzak, T. Keystroke Logging (Keylogging). Available online: https://www.researchgate.net/publication/228797653_Keystroke_logging_keylogging (accessed on 12 November 2020).
  43. Abukar, Y.; Maarof, M.; Hassan, F.; Abshir, M. Survey of Keylogger Technologies. Int. J. Comput. Sci. Telecommun. 2014, 5, 25–31. [Google Scholar]
  44. Ortolani, S.; Giuffrida, C.; Crispo, B. Bait Your Hook: A Novel Detection Technique for Keyloggers. Lect. Notes Comput. Sci. Recent Adv. Intrusion Detect. 2010, 198–217. [Google Scholar] [CrossRef]
  45. Wajahat, A.; Imran, A.; Latif, J.; Nazir, A.; Bilal, A. A Novel Approach of Unprivileged Keylogger Detection. In Proceedings of the 2019 2nd International Conference on Computing, Mathematics and Engineering Technologies (iCoMET), Sukkur, Pakistan, 30–31 January 2019. [Google Scholar] [CrossRef]
  46. Yang, K.; Li, Q.; Sun, L. Towards automatic fingerprinting of IoT devices in the cyberspace. Comput. Netw. 2019, 148, 318–327. [Google Scholar] [CrossRef]
  47. Aneja, S.; Aneja, N.; Islam, M.S. IoT Device Fingerprint using Deep Learning. In Proceedings of the 2018 IEEE International Conference on Internet of Things and Intelligence System (IOTAIS), Bali, Indonesia, 1–3 November 2018. [Google Scholar] [CrossRef]
  48. Bhuyan, M.H.; Bhattacharyya, D.K.; Kalita, J.K. Surveying Port Scans and Their Detection Methodologies. Comput. J. 2011, 54, 1565–1581. [Google Scholar] [CrossRef]
  49. Markowsky, L.; Markowsky, G. Scanning for vulnerable devices in the Internet of Things. In Proceedings of the 2015 IEEE 8th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), Warsaw, Poland, 24–26 September 2015. [Google Scholar] [CrossRef]
  50. Sivanathan, A.; Gharakheili, H.H.; Sivaraman, V. Can We Classify an IoT Device using TCP Port Scan? In Proceedings of the 2018 IEEE International Conference on Information and Automation for Sustainability (ICIAfS), Colombo, Sri Lanka, 21–22 December 2018. [Google Scholar] [CrossRef]
  51. Shao, G.L.; Chen, X.S.; Yin, X.Y.; Ye, X.M. A fuzzy detection approach toward different speed port scan attacks based on Dempster-Shafer evidence theory. Secur. Commun. Netw. 2016, 9, 2627–2640. [Google Scholar] [CrossRef]
  52. Lopez-Vizcaino, M.; Novoa, F.J.; Fernandez, D.; Carneiro, V.; Cacheda, F. Early Intrusion Detection for OS Scan Attacks. In Proceedings of the 2019 IEEE 18th International Symposium on Network Computing and Applications (NCA), Cambridge, MA, USA, 26–28 September 2019. [Google Scholar] [CrossRef]
  53. Rashid, M.M.; Kamruzzaman, J.; Hassan, M.M.; Imam, T.; Gordon, S. Cyberattacks Detection in IoT-Based Smart City Applications Using Machine Learning Techniques. Int. J. Environ. Res. Public Health 2020, 17, 9347. [Google Scholar] [CrossRef]
  54. Soe, Y.N.; Feng, Y.; Santosa, P.I.; Hartanto, R.; Sakurai, K. Machine Learning-Based IoT-Botnet Attack Detection with Sequential Architecture. Sensors 2020, 20, 4372. [Google Scholar] [CrossRef] [PubMed]
  55. Ioannou, C.; Vassiliou, V. Classifying Security Attacks in IoT Networks Using Supervised Learning. In Proceedings of the 2019 15th International Conference on Distributed Computing in Sensor Systems (DCOSS), Santorini Island, Greece, 29–31 May 2019; pp. 652–658. [Google Scholar] [CrossRef]
  56. Koroniotis, N.; Moustafa, N.; Sitnikova, E.; Turnbull, B. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst. 2019, 100, 779–796. [Google Scholar] [CrossRef]
  57. Noble, W.S. What is a support vector machine? Nat. Biotechnol. 2006, 24, 1565–1567. [Google Scholar] [CrossRef] [PubMed]
  58. Rish, I. An empirical study of the naive Bayes classifier. In Proceedings of the IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Seattle, WA, USA, 4 August 2001; Volume 3, pp. 41–46. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.