Intelligent Detection of IoT Botnets Using Machine Learning and Deep Learning

: As the number of Internet of Things (IoT) devices connected to the network rapidly increases, network attacks such as ﬂooding and Denial of Service (DoS) are also increasing. These attacks cause network disruption and denial of service to IoT devices. However, a large number of heterogenous devices deployed in the IoT environment make it di ﬃ cult to detect IoT attacks using traditional rule-based security solutions. It is challenging to develop optimal security models for each type of the device. Machine learning (ML) is an alternative technique that allows one to develop optimal security models based on empirical data from each device. We employ the ML technique for IoT attack detection. We focus on botnet attacks targeting various IoT devices and develop ML-based models for each type of device. We use the N-BaIoT dataset generated by injecting botnet attacks (Bashlite and Mirai) into various types of IoT devices, including a Doorbell, Baby Monitor, Security Camera, and Webcam. We develop a botnet detection model for each device using numerous ML models, including deep learning (DL) models. We then analyze the e ﬀ ective models with a high detection F1-score by carrying out multiclass classiﬁcation, as well as binary classiﬁcation, for each model.


Introduction
At the 2016 World Economic Forum (WEF, also known as the Davos Forum), The Fourth Industrial Revolution by Klaus Schwab became a turning point in transforming our society from an information society into an intelligent information society. The Fourth Industrial Revolution represents a fundamental change in the way we live, work, and relate to one another [1]. Key technologies leading the fourth industrial revolution include the Internet of Things (IoT), the cloud, big data, mobile technology, and artificial intelligence (AI). These intelligent information technologies are creating new industries and revolutionizing the ecosystem of existing manufacturing industries. The term "Internet of Things" was coined in 1999 by Kevin Ashton to describe how data collection through sensor technology has unlimited potential [2]. With the inclusion of the IoT in the Gartner Top 10 Strategic Technology Trends in 2020, it was shown that the IoT will develop into more than 20 times more smart devices than existing IT roles in 2023 [3]. According to Gartner, the overall usage of IoT in various areas, such as utilities, healthcare, the government, physical security, and vehicles, is expected to increase [4].
As the IoT develops, cyber threats targeting IoT devices are also increasing. Most IoT devices are connected to the internet, which facilitates abuse and a lack of security control. The fact that IoT manufacturers failed to implement proper security controls to protect these types of devices from

IoT Security Threats
As the IoT evolves, attacks on the IoT and using IoT are becoming more diverse. This section describes IoT security threats. First, we look at studies that categorized IoT security threats. Next, we explore the types of IoT security attacks and the papers that studied those attacks.

Classification of Attacks
The classification that categorizes attacks can be largely divided into three groups. One is divided based on the layers of the IoT. Another includes independently suggested classifications based on attacks, while the other is based on the attacks themselves. In this section, the classification presented in each study is divided into these three groups.

•
IoT layer: The IoT can be divided into three layers: the perception layer, the network layer, and the application layer. The perception layer is the bottom layer of the IoT, which connects physical devices to the network. The network layer is the layer in the middle of the IoT that determines the route based on information received from the perception layer. The application layer is the top layer of the IoT. It receives data from the network layer and sends it to the appropriate service. The potential attacks on each layer can be explained according to this taxonomy based on IoT layers. Lin et al. [6] classified the security challenges that can occur on each layer of the IoT. Sonar et al. [7] focused on DDoS attacks that can occur in the IoT. They categorized the DDoS scenarios that can occur in each layer of the IoT. Abomhara et al. [8] classifies attacks into eight common attack types: physical attacks, reconnaissance attacks, DoS, access attacks, attacks on privacy, cyber-crimes, destructive attacks, and supervisory control and data acquisition (SCADA) attacks. Physical attacks are attacks related to hardware components. Reconnaissance attacks involve the mapping of unauthorized systems or services using packet sniffers or scanning network ports. DoS is an attack that disables network resources or devices. Access attacks are attacks that allow unauthorized users to gain access to a network or device. Attacks on privacy involve privacy protection. Cyber-crimes involve using Internet or smart devices to take user data to commit a crime. Destructive attacks are attacks that harms assets and lives (e.g., terrorism). SCADA systems are vulnerable to many cyber-attacks; such a system can be attacked using DoS, a Trojan, or a virus. Andrea et al. [9] divided attacks into four categories based on the type of attack: physical attacks, network attacks, software attacks, and encryption attacks. Deogirikar et al. [10] used this taxonomy.
• Vulnerabilities: Neshenko et al. [11] categorized the many IoT-related studies from 2005 to 2018 into five classes: layers, security impacts, attacks, countermeasures, and situational awareness capability (SAC). Layers is a class that determines how IoT components affect IoT vulnerabilities. Security impact is a class that represents vulnerabilities based on security elements such as confidentiality, integrity, and availability. Attacks describes IoT vulnerabilities that can be exploited. Countermeasures is a class for countermeasures that can improve IoT weaknesses. SAC is a class of technology used for malicious activities using the IoT. Ronen et al. [12] presented a new classification standard divided into four different categories depending on how the attackers deviated from the specific functions of the IoT. The four categories are ignoring functionality, reducing functionality, misusing functionality, and extending functionality. Ignoring functionality is an attack that ignores the intended physical function of the IoT and considers the IoT device as a normal computer device that is connected to the internet. Reducing functionality involves limiting or eliminating the original functions of the IoT. Misusing functionality entails using a function in an incorrect or unauthenticated way without destruction. Extending functionality means expanding a given function to produce different or unexpected physical effects. Alaba et al. [13] presented four groups of classification: application, architecture, communication, and data. Table 1 provides a summary of classification research presented in Section 2.1.1. Table 1. Summary of classification works.

Attack Works
By reviewing the various classifications used in each study in Section 2.1.1, we decided to use IoT layer classification to categorize the possible attacks on the IoT. Based on the aforementioned IoT layers, the most frequently mentioned attacks can be divided into nine types of attacks. We organize these nine types of attacks into the proposed taxonomy and see if they violate the three elements of information protection: confidentiality, integrity, and availability. We also look at studies related to each attack.

•
Perception: The Perception class includes attacks that usually deplete the resource of the devices. Jamming: A jamming attack is an attack that interferes with the radio frequency of the sensor node. Integrity is violated here because the device's frequency is changed. When someone cannot obtain service due to a jamming attack, availability is also violated. Namvar et al. [24] presented a mechanism that can solve a Jamming probe in an IoT system. Dou et al. [25] proposed an ACA model that provides adequate resources for anti-jamming on the IoT. Through simulation, they showed that the model is suitable for low-power, band-limited IoT networks.
DoS: A DoS attack is an attack that consumes all available resources and causes the IoT system to function abnormally. Here, availability is violated. DDoS is an attack in which many attackers from many points attack one target using a very large network. When carrying out a DDoS attack, the malicious code that allows an attacker to carry out an attack is called a bot. A botnet that carries out a DDoS attack consists of these bots. Angrishi et al. [26] referred the structures of IoT botnets and introduced DDoS events using IoT botnets. The authors also presented improved measurements that can alleviate the risks related to the IoT. Kolias et al. [27] described botnets that can trigger DDoS attacks. They focused on the Mirai botnet and also mentioned other botnets, such as Hajime and BrickerBot.

•
Network: The Network class mainly categorizes attacks that send or receive incorrect information or deplete network resources so that the user cannot obtain normal service.
Man in the middle attack (MitM): A MitM attack is made by a malicious device inserted by an attacker between two normal devices that are communicating in an IoT network environment. Data can be stolen or stored on the malicious device. Integrity is violated in this case because the data can be changed. Because the unauthorized person is in the middle, it can also be said that confidence is violated. Li et al. [28] demonstrated the weakness of the OpenFlow control channel by conducting an experiment based on an MitM attack. The authors also suggested a countermeasure called Bloom filters and indicated the efficiency of the Bloom filter monitoring system. Cekerevac et al. [29] showed a variety of attacks using MitM, including MIT-Cloud (MITC), MIT-Browser (MITB), and MIT-mobile (MITMO). They showed that MitM is not uncommon and can damage the IoT, even though it is an old attack.
Routing attack: A routing attack is an attack based on the routing protocol of the IoT system. It causes a delay in the IoT network by creating a routing loop that manipulates routing. Integrity is violated because information can be manipulated. Wallgren et al. [30] presented several well-known routing attacks and emphasized the importance of security in routing protocol low power and lossy networks (RPL)-based IoT. Yavuz et al. [31] presented a deep learning-based machine learning method that can detect IoT routing attacks with high accuracy by determining the attack detection method based on deep learning.
Sinkhole: A sinkhole attack occurs when an abnormal device makes an exceptional request that is forwarded to another abnormal device. By requiring two abnormal devices continuously, confidentiality is compromised. Shiranzaei et al. [32] proposed an intrusion detection system (IDS) that can defend a 6LowPAN network from a sinkhole or forwarding attack. Soni et al. [33] explored the trends in the countermeasure technology against sinkhole attacks.
Wormhole: A wormhole is an attack that can occur when two malicious devices exchange routing information over a private link so that other devices think that there is only one hop between the two. When someone is present in the middle, it can be said that confidentiality is violated. Integrity can also be violated when other devices have the wrong information due to a wormhole attack. Palacharla et al. [34] examined various mechanisms for detecting wormholes and suggested new methods. The proposed method uses cryptography to detect a wormhole attack. Because the path is dynamically checked, the transfer between the two can be detected without looking at all the nodes.
Lee et al. [35] studied the effects of wormhole attacks on network flow and proposed a passivity-based control-theoretical framework to model a wormhole attack.
Flooding: Flooding is a type of DoS attack. DoS can exhaust not only perception resources but also network resources in the IoT environment. Flooding depletes bandwidth by sending massive or abnormal traffic and results in service disruption at the network layer. Availability is violated because users cannot obtain a connection when desired. Rizal et al. [36] conducted network forensic research that can detect IoT flooding attacks. The authors proposed a network forensics model and successfully detected attacks. Campus et al. [37] described how flooding attacks affect routing on the IoT.

•
Application: The application class classifies attacks that can occur on the application side. Virus: A virus is cloned and infected but is not propagated by itself. Confidentiality, integrity, and availability are all violated by a virus. Because information can be given to an unauthorized user, data can be changed and made unavailable to an authorized user. Azmoodeh et al. [38] proposed a machine learning-based approach to detect ransomware attacks, which are a type of virus. Dash et al. [39] used network traffic flow analysis to access ransomware. They showed that machine learning is a very efficient approach to detect ransomware.
Worm: Unlike a virus, a worm can spread itself. When someone destroys their victim with a worm, integrity is violated. When service is denied due to a worm, availability is violated. Wang et al. [40] proposed a new worm detection method based on mining dynamic program execution. The authors showed that the proposed method has high detection and low detection rates. Yu et al. [41] studied a new type of worm called a self-disciplinary worm. This worm changes its breeding patterns to become less detectable and allow more computers to become infected. The authors used a game-theoretic formulation to propose the corresponding countermeasures. Table 2 shows a summary of these attacks and their relevant research.

Existing AI-Based IoT Studies
Next, we briefly review the trends of IoT studies based on deep learning and machine learning.

IoT Using AI
Mohammed et al. [42] suggested that basic deep learning could provide services like image recognition, voice recognition, localization, detection, and security. The authors summarized the areas in which the IoT is used and what studies have been done in each area. Xiaofeng et al. [43] showed how data are collected through smart cities, sensors, and humans. They suggested using anomaly detection model-based machine learning with annual power data, loops, and land sensor data. They used long short-term memory-neural network (LSTM-NN) and MLP models. Among the models, LSTM-G-NB has the highest accuracy. Furqan Alam et al. [44] suggested a classifier based on eight algorithms such as support vector machine (SVM), K-nearest neighbor (KNN), naïve Bayes (NB), latent Dirichlet allocation (LDA) using IoT device data. NB and LDA offer better accuracy, while LDA provides quicker processing speed. The DLANN algorithm requires the longest time because it features a complex structure and requires many system resources. Mohammadi et al. [45] also applied deep learning and machine learning to the IoT. They introduced the concepts of security threats and artificial intelligence techniques in many fields. In particular, 60% of devices in healthcare are considered Internet of Medical Things (IoMT), which is expected to grow to 20 to 30 billion devices in 2020.

•
Malware Classifier: As IoT malware is widespread and a major source of DDoS traffic, IoT security has become increasingly more important. Since most IoT devices have no existing mechanism to automatically update themselves, malware detection in the network layer is necessary [46]. Hamed et al. [47]  • Network Anomaly Detection: Network malware detection has also been studied. Mcdermott et al. [18] proposed a description of deep learning and a network botnet attack detection model based on deep learning. The Mirai botnet was used in this study. The authors proposed a detection model based on bidirectional long short-term memory (BLSTM) using a recurrent neural network (RNN) that consists of an Adam optimizer and a sigmoid function. With the dataset made by capturing packets, the model-based LSTM provided 99.571% accuracy, and the model-based BLSTM offered 99.998% accuracy. Olivier et al. [19] suggested an attack detection model based on Dense RNN. This model can detect UDP flooding, TCP SYN flooding, sleep deprivation attacks, barrage attacks, and broadcast attacks. Captured packets extract statistical sequence data. In the study, the gateway was connected to the Internet via 3G SIM cards. Several IoT devices and Wi-Fi connections were also connected to the gateway. The proposed model showed similar performance to the threshold method. Yair et al. [20] used packet-captured data with port mirroring in a network including IoT devices. The IoT devices used in the study were of nine types: a baby monitor, motion sensor, refrigerator, security camera, smoke detector, socket, thermostat, TV, and a watch. They introduced unauthorized IoT device classifier model-based random forest learning. On average, the model shows an accuracy of 94% and has higher accuracy than the white list method. This study shows the same result with network changes. Doshi et al. [21] suggested a DoS attack traffic detection model for IoT devices. They used KNN, Lagrangian support vector machine (LSVM), decision tree (DT), random forest (RF), and neural network (NN) for model training.
The dataset was generated with captured packets and grouped by device and time zone. The extracted features were divided into stateless and stateful features. The stateless category includes packet size, the inter-packet interval, and the protocol features. The stateful category includes bandwidth and destination IP address cardinality and novelty features. As a result, using all features was more accurate than using only stateless features. Hodo et al. [48] described the characteristics of host-based IDS and network-based IDS. They suggested using a DDoS/DoS attack detection model using ANN. The proposed model provided 99.4% accuracy.
• Network Anomaly Detection using N-BaIoT: Meidan et al. [49] used the N-BaIoT dataset, which is described in Section 3. The authors proposed an anomaly detection model based on a deep autoencoder. They separated the model for each IoT device and user datum during training. After training, the anomaly threshold was set. The Mirai and Bashlite botnet environments were built and used. The proposed model was compared to the local outlier factor (LOF), one-class SVM, and isolation forest. The proposed model provided the best performance of the four models. Shorman et al. [50] proposed an IoT botnet detection model. The authors used data pfreprocessing in four levels: data cleaning, data migration, and data rescaling and optimizing. With pre-processed N-BaIoT, the authors trained the intrusion detection model based on the Grey Wolf optimization one-class support vector machine (GWO-OCSVM) algorithm. Compared to OCSVM, isolation forest (IF), and LOF, the proposed model provides a faster detection time and higher accuracy. Table 3 shows a summary of Section 2.2 (i.e., which dataset(s) and algorithm(s) are used in each reference).

Methodology
We built a framework for developing an IoT botnet detection model. Our framework includes the entire process from defining the botnet dataset to detecting botnets. In this section, we describe the N-BaIoT dataset used in our framework and design the proposed framework.

N-BaIoT Dataset
The N-BaIoT dataset was generated by Mohammed et al. [42] and consists of data samples with 115 features. The datasets were collected through the port mirroring of IoT devices. The benign data were captured immediately after setting the network to ensure that the data was benign. For two types of packet sizes (only outbound/both outbound and inbound), packet counts, and packet jitters, the times between packet arrival were extracted for each statistical value. A total of 23 features were extracted for each of the 5 time windows (100 ms, 500 ms, 1.5 s, 10 s, and 1 min), for a total of 115 features. We use all of the 115 features in our framework. Table 4 shows the detailed features of the dataset. The datasets were collected by injecting two types of attacks into various types of IoT devices, as shown in Table 5.  [54] suggested a framework to identify Mirai and Bashlite C&C servers by combining 4 heuristic algorithms. Table 6 shows the 10 specific attack types of Bashlite and Mirai.

Proposed Framework
Our framework comprises a botnet dataset, botnet training models, and botnet detection models. The botnet dataset consists of four subdatasets of N-BaIoT. We select devices that include all 10 attack samples described in Table 6 in the N-BaIoT, such as a doorbell (Ennio), baby monitor (Philips B120N/10), security camera (Provision PT-838), and webcam (Samsung SNH 1011 N). Table 7 shows the number of samples in the four datasets according to the device type we used. As botnet training models, we use most widely used ML and DL algorithms. We employed not only five types of ML models (naïve Bayes (NB), K-nearest neighbors (KNN), logistic regression (LR), decision tree (DT), and random forest (RF)) but also three types of DL models (convolutional neural network (CNN), recurrent neural network (RNN), and long short-term memory (LSTM)). There are two types of botnet detection models: binary classification and multiclass classification. The binary classification classifies the N-BaIoT dataset into two categories: attack and benign. This classification does not consider different types of protocols that can be used for botnet attacks, while the multiclass classification distinguishes each of protocol used for the Bashlite and Mirai. Figure 1 shows our framework for developing ML-and DL-based IoT botnet detection models.

Experimental Evaluation
In this Section, we find out the most effective model for IoT botnet detection by analyzing performance differences depending on the type of IoT devices as well as the type of ML and DL models. We first develop an IoT botnet detection model based on the proposed framework. Among the samples of the N-BaIoT dataset, we randomly divide the training and testing samples by 70 to 30 using a dataset split function of Scikit-learn, an open source ML library for supervised and unsupervised learning, so the training and testing sets are independent each other. In order to prevent overfitting, furthermore, we use 20%of the training set as a validation set. We calculate the validation loss during training to monitor whether the validation loss does not increase while the training loss decreases.
In this section, we carry out multiclass classification as well as binary classification. Multiclass classification classifies not only benign but also fine grains of attacks by learning them, while binary classification categorizes N-BaIoT only into benign and attack. We then verify our ML and DL models using the testing sets.

Binary Classification
The binary classification model considers 10 different detailed Bashlite and Mirai attacks injected into IoT devices as one attack. It also distinguishes between attack or benign states, the latter of which means that the attack is not injected. We train our model using the dataset collected from each device based on the ML and DL models. We design these models using Keras, as well as Scikit-learn. Table  8 describes the design of our models.

Experimental Evaluation
In this Section, we find out the most effective model for IoT botnet detection by analyzing performance differences depending on the type of IoT devices as well as the type of ML and DL models. We first develop an IoT botnet detection model based on the proposed framework. Among the samples of the N-BaIoT dataset, we randomly divide the training and testing samples by 70 to 30 using a dataset split function of Scikit-learn, an open source ML library for supervised and unsupervised learning, so the training and testing sets are independent each other. In order to prevent overfitting, furthermore, we use 20%of the training set as a validation set. We calculate the validation loss during training to monitor whether the validation loss does not increase while the training loss decreases.
In this section, we carry out multiclass classification as well as binary classification. Multiclass classification classifies not only benign but also fine grains of attacks by learning them, while binary classification categorizes N-BaIoT only into benign and attack. We then verify our ML and DL models using the testing sets.

Binary Classification
The binary classification model considers 10 different detailed Bashlite and Mirai attacks injected into IoT devices as one attack. It also distinguishes between attack or benign states, the latter of which means that the attack is not injected. We train our model using the dataset collected from each device based on the ML and DL models. We design these models using Keras, as well as Scikit-learn. Table 8 describes the design of our models.
We then analyze the performance of the models through their F1-score measurements. The F-score is an index expressed as a single value considering both precision and recall, and the F1-score is the value that is given a weighted beta value of 1 for precision when calculating the F-score. The F1-score can be expressed by the following equation: where precision = TP TP + FP and recall = TP FN + TP .
True positive (TP) is the number of samples that are properly classified as benign. False negative (FN) is the number of samples that falsely detect benign data as a botnet. False Positive (FP) refers to a sample that incorrectly predicts an actual botnet as benign. A True Negative (TN) indicates the number of samples that are properly detected as a botnet. Table 9 shows the detailed detection results of precision, recall, and F1-score for each ML model.  In addition, the naïve Bayes (NB) classification in Table 9 corresponds to multinomial NB, which has the best performance out of Gaussian NB, Bernoulli NB, and multinomial NB. As shown in Table 10, Gaussian and Bernoulli NB have a lower detection F1-score than multinomial NB. Therefore, we also used multinomial NB for multi-classification because multinomial NB provides high detection F1-score.  Figure 2 shows the results of binary classification based on the three DL models. It can be seen that the CNN model has more than 0.99 F1-score on all devices and offers higher performance than the RNN and LSTM models. LSTM, which has the second highest performance, yields more than 0.99 F1-score for the baby monitor, security camera, and webcam, but only 80% for the doorbell. In the doorbell results, all botnets were accurately detected (except for 2) out of 290,000 botnet samples. However, for benign samples, 5000 samples (comprising about 61% of the 15,000 samples) were incorrectly detected as botnets. This is because the number of benign samples for learning was significantly less than the number of botnet samples, thereby producing several false positive with a benign classification. Using RNN, the F1-score was 0.57 for the baby monitor, about 0.815 for the webcam, and 0.905 for the doorbell and security camera, thus offering the lowest average F1-score compared to CNN and LSTM. Notably, the RNN result for the baby monitor showed high F1-score in benign classification, but about 80% of the samples incorrectly classified the botnet samples as benign during botnet detection. In Section 4.2, using the results of multiclass classification, we determine what specific botnet attacks provide the highest false positive rates.
Appl. Sci. 2020, 10, 7009 12 of 22 In Table 9, all models except logistic regression (LR) are able to classify benign and botnet samples with very high performance. For LR, the precision, recall, and F1-score of benign samples are significantly lower than that of attack on all the devices. Thus, errors occur frequently in benign classification.
In addition, the naïve Bayes (NB) classification in Table 9 corresponds to multinomial NB, which has the best performance out of Gaussian NB, Bernoulli NB, and multinomial NB. As shown in Table  10, Gaussian and Bernoulli NB have a lower detection F1-score than multinomial NB. Therefore, we also used multinomial NB for multi-classification because multinomial NB provides high detection F1-score.  Figure 2 shows the results of binary classification based on the three DL models. It can be seen that the CNN model has more than 0.99 F1-score on all devices and offers higher performance than the RNN and LSTM models. LSTM, which has the second highest performance, yields more than 0.99 F1-score for the baby monitor, security camera, and webcam, but only 80% for the doorbell. In the doorbell results, all botnets were accurately detected (except for 2) out of 290,000 botnet samples. However, for benign samples, 5000 samples (comprising about 61% of the 15,000 samples) were incorrectly detected as botnets. This is because the number of benign samples for learning was significantly less than the number of botnet samples, thereby producing several false positive with a benign classification. Using RNN, the F1-score was 0.57 for the baby monitor, about 0.815 for the webcam, and 0.905 for the doorbell and security camera, thus offering the lowest average F1-score compared to CNN and LSTM. Notably, the RNN result for the baby monitor showed high F1-score in benign classification, but about 80% of the samples incorrectly classified the botnet samples as benign during botnet detection. In Section 4.2, using the results of multiclass classification, we determine what specific botnet attacks provide the highest false positive rates.

Multiclass Classification
The multiclass classification model considers 10 attacks injected into each device as individual attacks and classifies them into 11 groups, including benign. The F1-score of each model as a result of training each device dataset using the five ML models and performing multiple classifications is shown in Table 11.

Multiclass Classification
The multiclass classification model considers 10 attacks injected into each device as individual attacks and classifies them into 11 groups, including benign. The F1-score of each model as a result of training each device dataset using the five ML models and performing multiple classifications is shown in Table 11. Compared to the binary classification in Table 8, DT and RF still provide F1-score closes to 1 for all devices, but NB and KNN show lower F1-scores. To determine why the F1-scores decreased in the NB model, the results of analyzing the F1-scores by attack type are shown in Figure 3.  0  0  0  0  0  0  DT  1  1  1  1  1  1  1  1  1  1  1  RF  1  1  1  1  1  1  1  1  1  Compared to the binary classification in Table 8, DT and RF still provide F1-score closes to 1 for all devices, but NB and KNN show lower F1-scores. To determine why the F1-scores decreased in the NB model, the results of analyzing the F1-scores by attack type are shown in Figure 3. As shown in Figure 3, for NB based botnet detection under Bashlite attacks, junk, scan, and TCP detection have low F1-score, and for Mirai attacks, ACK, SYN, and Plain UDP detection show low F1-score. This occurs because, as shown in Table 12, the Junk of Bashlite was mis-detected as a COMBO (SYN+UDP) of Bashlite, the scan of Bashlite was mis-detected as a scan of Mirai, and the As shown in Figure 3, for NB based botnet detection under Bashlite attacks, junk, scan, and TCP detection have low F1-score, and for Mirai attacks, ACK, SYN, and Plain UDP detection show low F1-score. This occurs because, as shown in Table 12, the Junk of Bashlite was mis-detected as a COMBO (SYN+UDP) of Bashlite, the scan of Bashlite was mis-detected as a scan of Mirai, and the TCP of Bashlite was mis-detected as the UDP of Bashlite. There were also several samples that were mis-detected as the UDP of Mirai from the ACK of Mirai and as a COMBO of Bashlite or a scan of Mirai from the SYN of Mirai and the UDP of Mirai from the Plain UDP of Mirai.  In KNN-based detection, the F1-score was universally high for Bashlite attacks, as shown in Figure 4, but in the case of the Mirai attack, the F1-score was always about 0.7 to 0.9, except for scan.  In KNN-based detection, the F1-score was universally high for Bashlite attacks, as shown in Figure 4, but in the case of the Mirai attack, the F1-score was always about 0.7 to 0.9, except for scan.  Detection results by attack type, as shown in Table 13. The ACK of Mirai was mis-detected as UDP of Mirai, the SYN of Mirai was mis-detected as a COMBO of Bashlite and a scan of Mirai, and the UDP of Mirai was mis-detected as a Plain UDP of Mirai. Because of this, false positives occurred frequently.  The average F1-score of each model as a result of training the dataset for each device using the three DL models and performing multiple classification is shown in Table 14. According to Table 14, CNN has the highest F1-score. Compared to CNN, RNN and LSTM have lower F1-score. For the CNN, although most of the detailed attacks were accurately detected, the detection F1-score for the TCP attack of Bashlite was 0%, as shown in Figure 5. According to the confusion matrix results (which is the table showing whether the class predicted by the model matches the original class of the target), the CNN model consistently detected the TCP attack of Bashlite as a UDP attack of Bashlite on all devices. In addition, for the security camera and webcam, the model detected the Plain UDP attack of Mirai as a UDP attack of Mirai. The model also mis-detected the ACK and scan of Mirai. For RNN, the F1-score of each specific attack is shown in Table 15. The model correctly detected the benign and UDP of Mirai. For Bashlite attacks, the model mis-detected COMBO as benign (security camera). It mis-detected Junk as a scan of Bashlite (doorbell), the UDP of Mirai (baby monitor), benign (security camera), and a COMBO of Bashlite (webcam). It mis-detected scan as benign (baby monitor, security camera), TCP as the UDP of Mirai (doorbell), the UDP of Bashlite (baby monitor), benign (security camera), and the UDP of Bashlite (webcam). It also mis-detected UDP as the UDP of Mirai (doorbell) and benign (security camera). For Mirai attacks, the model misdetected ACK as a COMBO of Bashlite (webcam) and scan as the Ack of Mirai (doorbell) and the scan of Mirai (baby monitor). It mis-detected SYN as the scan of Mirai (doorbell), the COMBO of Bashlite (baby monitor), and benign (security camera, webcam). It also mis-detected Plain UDP as the UDP of Mirai (doorbell, security camera) and benign (webcam). For RNN, the F1-score of each specific attack is shown in Table 15. The model correctly detected the benign and UDP of Mirai. For Bashlite attacks, the model mis-detected COMBO as benign (security camera). It mis-detected Junk as a scan of Bashlite (doorbell), the UDP of Mirai (baby monitor), benign (security camera), and a COMBO of Bashlite (webcam). It mis-detected scan as benign (baby monitor, security camera), TCP as the UDP of Mirai (doorbell), the UDP of Bashlite (baby monitor), benign (security camera), and the UDP of Bashlite (webcam). It also mis-detected UDP as the UDP of Mirai (doorbell) and benign (security camera). For Mirai attacks, the model mis-detected ACK as a COMBO of Bashlite (webcam) and scan as the Ack of Mirai (doorbell) and the scan of Mirai (baby monitor). It mis-detected SYN as the scan of Mirai (doorbell), the COMBO of Bashlite (baby monitor), and benign (security camera, webcam). It also mis-detected Plain UDP as the UDP of Mirai (doorbell, security camera) and benign (webcam).

Mirai
The F1-score of each specific attack for LSTM is shown in Table 16. The model correctly detected benign and COMBO Bashlite attacks. For Bashlite attacks, the model mis-detected junk as a COMBO of Bashlite (doorbell), the SYN of Mirai (baby monitor), and a COMBO of Bashlite (security camera). It mis-detected scan as the SYN of Mirai (baby monitor) and benign and TCP as the ACK of Mirai (doorbell, webcam), the UDP of Bashlite (baby monitor), and benign (security camera). It also mis-detected UDP as the Ack of Mirai (doorbell), benign (security camera), and the SYN of Mirai (webcam). For Mirai attacks, the model mis-detected ACK as benign (security camera, webcam) and the UDP of Bashlite (baby monitor), benign (security camera), and the Plain UDP of Mirai (webcam). It mis-detected SYN as benign (security camera) and UDP as the Plain UDP of Mirai (webcam); it also mis-detected the ACK of Mirai (doorbell) and benign (security camera). F1-score of the LSTM was slightly higher than that of the RNN. In other words, the experimental evaluation determined that detecting Mirai and Bashlite botnets in N-BaIoT with ML models, such as decision tree and random forest results in better performance. Among the various DL models, CNN showed the best performance in our framework. Bashlite and Mirai botnets, which occurred in 2014 and 2016, mainly targeted IP cameras and home routers. Our experimental results using the N-BaIoT dataset showed that the performance of botnet detection mostly depends on the type of training models rather than the type of IoT devices. We believe that developing IoT botnet detection models based on decision tree, random forest, and CNN would be an effective way of improving the performance of botnet detection for various types of IoT devices.
In the multiclass classification, the models tend to detect TCP as UDP, compared to SYN and benign. In the production IoT environment, botnet attacks can occur using various types of protocols. Thus, various protocols, including TCP and UDP, should be considered when collecting traffic and training models for the better performance of detecting IoT botnets.
Our study contributes to providing a framework that can easily compare various ML and DL models in IoT botnet detection. In future, we will develop an integrated IoT security framework that detects a variety of IoT attacks, as well as botnet attacks, based on various ML and DL models.