Anomaly Detection Using Deep Neural Network for IoT Architecture

: The revolutionary idea of the internet of things (IoT) architecture has gained enormous popularity over the last decade, resulting in an exponential growth in the IoT networks, connected devices, and the data processed therein. Since IoT devices generate and exchange sensitive data over the traditional internet, security has become a prime concern due to the generation of zero-day cyberattacks. A network-based intrusion detection system (NIDS) can provide the much-needed efﬁcient security solution to the IoT network by protecting the network entry points through constant network trafﬁc monitoring. Recent NIDS have a high false alarm rate (FAR) in detecting the anomalies, including the novel and zero-day anomalies. This paper proposes an efﬁcient anomaly detection mechanism using mutual information (MI), considering a deep neural network (DNN) for an IoT network. A comparative analysis of different deep-learning models such as DNN, Convolutional Neural Network, Recurrent Neural Network, and its different variants, such as Gated Recurrent Unit and Long Short-term Memory is performed considering the IoT-Botnet 2020 dataset. Experimental results show the improvement of 0.57–2.6% in terms of the model’s accuracy, while at the same time reducing the FAR by 0.23–7.98% to show the effectiveness of the DNN-based NIDS model compared to the well-known deep learning models. It was also observed that using only the 16–35 best numerical features selected using MI instead of 80 features of the dataset result in almost negligible degradation in the model’s performance but helped in decreasing the overall model’s complexity. In addition, the overall accuracy of the DL-based models is further improved by almost 0.99–3.45% in terms of the detection accuracy considering only the top ﬁve categorical and numerical features. the performance evaluation metrics considered in this study. The proposed DNN-based methodology is compared with the four different supervised DL algorithms such as 1-dimensional Convolutional Neural Network (CNN-1D), Recurrent Neural Network (RNN), and its different variants as Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM). The results conﬁrm the superiority of the proposed DNN-based IDS solution for an IoT network comparing other DL methodologies using the evaluation metrics. It is also observed that the CNN-1D model is the second-best model, while the LSTM is the worst among others in terms of the considered evaluation metrics.


Introduction
IoT is a revolutionary computing paradigm that has evolved rapidly over the last decade in almost every technological domain, such as smart homes, smart industries, smart transportation, smart healthcare [1][2][3][4], use of sensors [5][6][7][8], smart cities, and satellites [9], to name a few [10]. It comprises many IoT devices (Things) equipped with different sensors, actuators, storage, computational and communicational capabilities to collect and exchange solved the fast computation requirement of the DL algorithms [50]. This has motivated the researchers to use the DL algorithms to propose efficient security solutions in an IoT network that process vast amounts of raw data. The DL can learn the complex pattern by utilizing its deep structure and help in classifying the benign and anomaly traffic.
Researchers in the context of NIDS widely utilize ML algorithms. For instance, Ali et al. proposed an IDS using a fast-learning network with the particle swarm algorithm [51]. Although efficient enough to predict most attacks, their model's performance in detecting the minority class label was not very promising. Similarly, Shen et al. proposed their methodology using the ensemble approach considering multiple extreme learning machines by utilizing the BAT optimization algorithm during the ensemble pruning stage [52]. In another notable work, a multilevel semi-supervised ML model is proposed by Yao et al. by combining the clustering concept with the Random Forest (RF) algorithm [53]. Their methodology performed well in detecting the attack classes due to multiple levels.
Researchers also adopt different hybrid approaches to combining ML and DL methods to develop efficient NIDS solutions. The DL methods are explored in all those methodologies for feature reduction and complexity reduction purposes, followed by an ML predictor. For instance, a hybrid approach of combining autoencoder (AE) and the RF is adopted by Shone et al. by only utilizing the encoder part of AE [54]. Their nonsymmetric solution performed well in detecting the anomalies except for few labels due to lower instances. Similarly, another hybrid idea is given by Yan et al. by combining sparse AE with support vector machine (SVM) [55]. This methodology also struggled to detect the minority anomaly labels. Another hybrid approach is coined by Marir et al. by using the ensemble approach utilizing the voting to combine the deep belief network (DBN) with SVM [56].
Researchers have also used standalone DL [57,58] algorithms such as AE, recurrent neural network (RNN), DBN, convolutional neural network (CNN), Morlet wavelet neural network (MWNN) [59], etc., to propose efficient NIDS models. For instance, an RNN based NIDS is proposed by Xu et al. by utilizing Gated Recurrent Units (GRUs) as the memory unit [60]. Similarly, a CNN-based solution is presented by Xiao et al. by using the principal component analysis and AE for feature extraction tasks followed by CNN for prediction [61]. Their proposed methodology only performed well for the class label with a more significant number of instances. Another highly complex NIDS solution is provided by Jiang et al. by combining the CNN with the bidirectional Long short-term memory (LSTM) [62]. Wei et al. provided a complex solution based on the combination of different optimization algorithms such as particle swarm, fish swarm, and genetic algorithms with the DBN.
The NIDS solutions are also being proposed by many researchers using the DNN approach. For instance, an efficient DNN-based NIDS is proposed by Jia et al., consisting of four hidden layers [63]. The model achieved high-performance results to classify the KDD cup'99 and NSL-KDD datasets. Their proposed solutions did not perform efficiently in detecting User to Root (U2R) attack instances. Another notable work proposed by Wang, who proposed and DNN-based IDS for the adversaries, is studying the role of each feature in producing adversarial examples [64]. Similarly, a hybrid scalable DNN framework is proposed by Vinayakumar et al. for the host and network-level intrusion detection by implementing on Apache Spark cluster computing platform [65]. They evaluated their proposed methodology on many new and old datasets to show the superiority of their proposed solution.
Based on the analysis of the related literature, it is observed that most of the proposed solutions struggled to detect the minority class labels efficiently. This is because DL methods require a considerable amount of data for training. In such a case with very few samples within a dataset for a certain class, the DL algorithm will not learn enough complex patterns and will result in incorrect predictions for those labels. Moreover, for an IoT network, the research on the DL-based IDS is still in the early days, and there is plenty of room for more research in this domain. To this end, we propose a DNN-based NIDS solution for an IoT network. Notably, we discover the importance of the features in the performance of DL methods for an IoT network.

Proposed Methodology
This section details the essential concepts and methodology adopted for implementing and evaluating the DL-based anomaly detection solutions.

Deep Neural Network (DNN)
DNN belongs to the family of supervised learning algorithms to train the model using multiple layers. The DNN adopted in this study is based on the idea of the feed-forward artificial neural network with multiple hidden layers to enhance the abstraction features for increased capability [66]. DNN structure consists of input layers, multiple hidden layers, and an output layer as shown in Figure 1. Let X = {x 1 , x 2 , · · · , x n } is the input vector with n = 86 features. Similarly, Y = {y 1 , y 2 } is the output vector containing the probability values in the range of [0, 1] for classifying anomaly and benign traffic. The output calculation of each hidden layer H i is mathematically given as: where, A(.) represents the nonlinear activation function, w i and b i represents the weight and bias of the hidden layer i. The activation functions used in this study are 'ReLU' for hidden layers and 'sigmoid' for the output layer that is calculated using the mathematical formulas given as, Appl. Sci. 2021, 11, x FOR PEER REVIEW 4 of 18 patterns and will result in incorrect predictions for those labels. Moreover, for an IoT network, the research on the DL-based IDS is still in the early days, and there is plenty of room for more research in this domain. To this end, we propose a DNN-based NIDS solution for an IoT network. Notably, we discover the importance of the features in the performance of DL methods for an IoT network.

Proposed Methodology
This section details the essential concepts and methodology adopted for implementing and evaluating the DL-based anomaly detection solutions.

Deep Neural Network (DNN)
DNN belongs to the family of supervised learning algorithms to train the model using multiple layers. The DNN adopted in this study is based on the idea of the feed-forward artificial neural network with multiple hidden layers to enhance the abstraction features for increased capability [66]. DNN structure consists of input layers, multiple hidden layers, and an output layer as shown in Figure 1. Let = , , ⋯ , is the input vector with = 86 features. Similarly, = , is the output vector containing the probability values in the range of 0,1 for classifying anomaly and benign traffic. The output calculation of each hidden layer is mathematically given as: where, (. ) represents the nonlinear activation function, and represents the weight and bias of the hidden layer . The activation functions used in this study are ' ' for hidden layers and ' ' for the output layer that is calculated using the mathematical formulas given as, The DNN structure used in this study consists of an input layer with 80, 32, 16, and 8 neurons representing the numerical feature set. Then, we used four dense layers with 2 , 2 , 2 and 2 neurons followed by a sigmoid classification layer considering two out- The DNN structure used in this study consists of an input layer with 80, 32, 16, and 8 neurons representing the numerical feature set. Then, we used four dense layers with 2 10 , 2 9 , 2 8 and 2 7 neurons followed by a sigmoid classification layer considering two outputs to represent the anomaly and benign traffic classification. Similarly, for the experiment considering both the numerical and categorical features, the input layer is furnished with only five neurons, followed by two dense layers with 2 8 and 2 7 neurons and the output layer with a sigmoid activation function to classify the network traffic into benign and anomaly traffic.

IoT Architecture
The IoT has revolutionized future networks in recent years with the potential to improve the overall quality of life efficiently. It contains a vast network of interconnected internet-enabled devices called IoT devices, equipped with a wide range of sensors, storage, computational, and communication capabilities. It generates a massive amount of critical data shared over the internet and is needed to be secured.
A typical three-layer IoT architecture [67] is depicted in Figure 2. It consists of a perception layer, network layer, and application layer. The perception layer is the lowest, also called the physical layer. It involves the different devices, sensors, actuators, etc., that constantly gather information and then exchange using different communication standards and protocols like Bluetooth, Wi-Fi, ZigBee, and 6LowPAN, etc. The network layer, also called the transport layer, is responsible for the smooth transmission of the packets using the different communication standards as 4G, 5G, Wi-Fi, ZigBee, IPv6, etc. The final layer is the application layer, which processes the data for visualization for end users' applications, e.g., smart health monitoring. Some of the protocols used in this layer are the Constrained Application Protocol (CoAP) and Data Distribution Service (DDS) etc. [68].
puts to represent the anomaly and benign traffic classification. Similarly, for the experiment considering both the numerical and categorical features, the input layer is furnished with only five neurons, followed by two dense layers with 2 and 2 neurons and the output layer with a sigmoid activation function to classify the network traffic into benign and anomaly traffic.

IoT Architecture
The IoT has revolutionized future networks in recent years with the potential to improve the overall quality of life efficiently. It contains a vast network of interconnected internet-enabled devices called IoT devices, equipped with a wide range of sensors, storage, computational, and communication capabilities. It generates a massive amount of critical data shared over the internet and is needed to be secured.
A typical three-layer IoT architecture [67] is depicted in Figure 2. It consists of a perception layer, network layer, and application layer. The perception layer is the lowest, also called the physical layer. It involves the different devices, sensors, actuators, etc., that constantly gather information and then exchange using different communication standards and protocols like Bluetooth, Wi-Fi, ZigBee, and 6LowPAN, etc. The network layer, also called the transport layer, is responsible for the smooth transmission of the packets using the different communication standards as 4G, 5G, Wi-Fi, ZigBee, IPv6, etc. The final layer is the application layer, which processes the data for visualization for end users' applications, e.g., smart health monitoring. Some of the protocols used in this layer are the Constrained Application Protocol (CoAP) and Data Distribution Service (DDS) etc. [68]. The most suitable position to deploy the NIDS for the three-layered IoT architecture is the entry points of the network-e.g., an edge router as shown in Figure 2-to provide the needed protection to the network against anomalies. In this study, the considered The most suitable position to deploy the NIDS for the three-layered IoT architecture is the entry points of the network-e.g., an edge router as shown in Figure 2-to provide the needed protection to the network against anomalies. In this study, the considered NIDS is a two-stage solution, comprised of the data capturing and preparation stage (Stage-1 in Figure 2), followed by the DL-based anomaly detection stage (Stage-2 in Figure 2). In stage-1, the data will be collected and intercepted either within the network such as IoT devices or from outside of the IoT network through the internet. The useful feature extraction task will be performed followed by the data preparation for the DL stage. In stage-2, the prepared data will be processed by the DL-based anomaly detection model to detect the anomaly traffic to protect the IoT network.

Methodology
This study considered a two-stage IDS solution to protect the IoT network from possible intrusions, as depicted in Figure 3. The different stages of our considered model are the (1) Data Capturing and Preparation stage and (2) Deep Neural Network-based Anomaly detection stage. The different steps followed to implement and evaluate DL models includes, [69]. The main task is to capture and then analyze the network packs by examination and visualization [70].
Step-2: The features are being extracted from the network packets and are stored in a dataset.
Step-3: The extracted features are then preprocessed to remove the redundant flows, normalize the continuous features, and encode the categorical features using one-hot encoding.
Step-4: The dataset is labeled as the Benign record and Anomaly record to prepare for the binary classification scenarios.
Step-5: The dataset is then split into 75% Train dataset and 25% Test dataset.
Step-6: The DNN is then trained on the Train dataset by selecting the Benign and Anomaly Labels as target features utilizing the binary classification. This step results in a trained DNN model.

Step-7:
The trained model is then tested using the Test dataset to predict the records as either Benign or Anomaly flows. If the Benign traffic is predicted, it was allowed to pass through without taking any action. While on the other hand, if an Anomaly is predicted, an alarm signal is given to the network administrator to take further actions.  Step-1: The IoT network traffic is intercepted using network sniffing tools. For this purpose, some openly available tools such as tcpdump and Wireshark can be used [69]. The main task is to capture and then analyze the network packs by examination and visualization [70].
Step-2: The features are being extracted from the network packets and are stored in a dataset.
Step-3: The extracted features are then preprocessed to remove the redundant flows, normalize the continuous features, and encode the categorical features using onehot encoding.
Step-4: The dataset is labeled as the Benign record and Anomaly record to prepare for the binary classification scenarios.
Step-5: The dataset is then split into 75% Train dataset and 25% Test dataset.
Step-6: The DNN is then trained on the Train dataset by selecting the Benign and Anomaly Labels as target features utilizing the binary classification. This step results in a trained DNN model.

Step-7:
The trained model is then tested using the Test dataset to predict the records as either Benign or Anomaly flows. If the Benign traffic is predicted, it was allowed to pass through without taking any action. While on the other hand, if an Anomaly is predicted, an alarm signal is given to the network administrator to take further actions.

Experimental Results and Analysis
This section provides the details about the dataset and the evaluation metrics considered for evaluation purposes, followed by the experimental setup and the discussion of the results.

Dataset Description
For evaluating the performance of the DL methodologies considered in this study, we used the publicly available dataset IoT-Botnet 2020 [71]. This dataset is available in comma-separated values (CSV) format and is adopted from Pcap files of the BoT-IoT dataset [72] by generating many more network and flow-based attributes. A detailed description of the number of records for Benign and Anomaly labels in the original dataset and this study is given in Table 1. The original dataset contains the samples of different types of attacks such as Denial of Service, Distributed Denial of Service, Reconnaissance, and information theft attacks. We selected the benign samples from the original dataset, while for the anomaly class, we considered the random samples from each anomaly class for fair model evaluation. The original dataset contains a total of 85 features of different data types such as integer, float, and categorial, as detailed in Table 2. Among those 85 features, we perform experiments to find out the performance of different DL algorithms, considering the 80, 32, 16, 8 best numerical features and 5 best categorical-numerical features calculated using the mutual information (MI). The MI is an important concept in the information theory, which provides the average reduction in uncertainty of one random variable provided the information of other variables. Mathematically, MI is given as [73]: where, I(U; V) is the MI between two random discrete variables, U and V, such that U = {u 1 , u 2 , . . . , u k } and V = {v 1 , v 2 , . . . , v k } with k samples each. p(u, v) represents the joint probability mass function while p(u) and p(v) represent the marginal probabilities.
In the context of feature selection from a dataset, the relevant features will contain useful information about the particular class. The MI is chosen in this study for finding the relevant features due to its ability to quantify the amount of information shared among the feature and a class. Figure 4 shows the best numerical features selected based on the MI value arranged in descending order. We selected the top 80, 32, 16, and 8 features based on the highest MI scores to find out the optimal set of useful and important features that can be used to train the DL model, with a negligible loss in the detection accuracy.
The MI is an important concept in the information theory, which provides the average reduction in uncertainty of one random variable provided the information of other variables. Mathematically, MI is given as [73]: where, ( ; ) is the MI between two random discrete variables, and , such that = , , … , and = , , … , with k samples each. ( , ) represents the joint probability mass function while ( ) and ( ) represent the marginal probabilities. In the context of feature selection from a dataset, the relevant features will contain useful information about the particular class. The MI is chosen in this study for finding the relevant features due to its ability to quantify the amount of information shared among the feature and a class. Figure 4 shows the best numerical features selected based on the MI value arranged in descending order. We selected the top 80, 32, 16, and 8 features based on the highest MI scores to find out the optimal set of useful and important features that can be used to train the DL model, with a negligible loss in the detection accuracy.

Evaluation Metrics
The performance evaluation of the consider DL models is performed using the Accuracy, Precision, Recall, F1-Score, False Alarm Rate, True Negative Rate, and False Negative Rate. The basis for these evaluation metrics is the different attributes within the confusion matrix shown in Table 3. The TP and TN instances in the confusion matrix represent the correct prediction of Anomaly and Benign instances, respectively. Similarly, FN and FP instances are incorrect predictions of a classifier as Benign and Anomaly instances. The different evaluation metrics considered in this study are detailed in [74][75][76].

Evaluation Metrics
The performance evaluation of the consider DL models is performed using the Accuracy, Precision, Recall, F1-Score, False Alarm Rate, True Negative Rate, and False Negative Rate. The basis for these evaluation metrics is the different attributes within the confusion matrix shown in Table 3. The TP and TN instances in the confusion matrix represent the correct prediction of Anomaly and Benign instances, respectively. Similarly, FN and FP instances are incorrect predictions of a classifier as Benign and Anomaly instances. The different evaluation metrics considered in this study are detailed in [74][75][76]. Accuracy: It is calculated as the ratio of the correctly classified records to the total number of records. Accuracy = TP + TN TP + TN + FP + FN Precision: It is the ratio of correctly predicted Anomaly instances to all the instances predicted as Anomaly.
Recall: It is defined as the ratio of all the correctly predicted Anomaly instances to all the actual Anomaly instances.
F1 Score: It provides the harmonic mean of the Precision and Recall to examine the accuracy of the system using a statistical technique.
False alarm rate (FAR): It is defined as the ratio of wrongly predicted Anomaly instances to all the Benign instances.
True-negative rate (TNR): It is the ratio of the correctly predicted Benign instance to all the instances that are Benign.
False-negative rate (FNR): It denotes the miss rate and shows the possibility of the classifier missing the anomaly instances. It is the ratio of wrongly predicted Benign instances to all the actual Anomaly instances.

Experimental Setup
To implement and evaluate our proposed methodology on the IoT-Botnet 2020 dataset, we performed the experiments on an HP laptop installed with a Windows 10 operating system, having 8 GB of RAM with Intel Core I7-8550U processor NVIDIA GeForce MX150. Python (version 3.6.9) is used as the primary implementation tool for implementation and evaluation in a Google Colab environment by selecting the GPU as a hardware accelerator [77][78][79].

Results and Discussion
For implementing the different DL-based IDS methodologies in this study, we selected the Batch size as 2 7 , Learning rate as 0.01, Optimizer as Adam, and used binary crossentropy as the Loss function. ReLU and sigmoid are the used activation functions in this study for the DL approaches. Table 4 summarizes the results in a percentage of the performance evaluation metrics considered in this study. The proposed DNN-based methodology is compared with the four different supervised DL algorithms such as 1-dimensional Convolutional Neural Network (CNN-1D), Recurrent Neural Network (RNN), and its different variants as Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM). The results confirm the superiority of the proposed DNN-based IDS solution for an IoT network comparing other DL methodologies using the evaluation metrics. It is also observed that the CNN-1D model is the second-best model, while the LSTM is the worst among others in terms of the considered evaluation metrics. The evaluation metrics scores are also depicted in Figures 5 and 6. From Figure 5, DNN achieved the highest detection accuracy of 99.010% compared to the other DL algorithms, while LSTM showed the most minimal performance with an accuracy score of 96.41%. The DNN model also correctly predicted the anomalies with 99.304% correctness. The DNN-based proposed methodology also showed a high F1 score than other DL models to show the model's accuracy on an IoT-Botnet 2020 dataset. In addition, the DNN model exhibited a high score of 96.08% of correctly predicting Benign flows compared to the other considered DL methodologies.  The evaluation metrics scores are also depicted in Figures 5 and 6. From Figure 5, DNN achieved the highest detection accuracy of 99.010% compared to the other DL algorithms, while LSTM showed the most minimal performance with an accuracy score of 96.41%. The DNN model also correctly predicted the anomalies with 99.304% correctness. The DNN-based proposed methodology also showed a high F1 score than other DL models to show the model's accuracy on an IoT-Botnet 2020 dataset. In addition, the DNN model exhibited a high score of 96.08% of correctly predicting Benign flows compared to the other considered DL methodologies.  Figure 6 plots the performance of the different DL-based NIDS methodologies in terms of FAR and FNR. As discussed earlier, the main problem exhibited by the current NIDS is the high FAR. To this end, it is observed that the proposed method exhibited a very low FAR and FNR compared to methodologies. The proposed scheme achieved a FAR of 3.9% while achieving a very high detection accuracy at the same time. The LSTM model is observed to achieve a high FAR of 11.9%, showing its inefficiency to learn enough patterns to classify the network flows correctly.  Figure 6 plots the performance of the different DL-based NIDS methodologies in terms of FAR and FNR. As discussed earlier, the main problem exhibited by the current NIDS is the high FAR. To this end, it is observed that the proposed method exhibited a very low FAR and FNR compared to methodologies. The proposed scheme achieved a FAR of 3.9% while achieving a very high detection accuracy at the same time. The LSTM model is observed to achieve a high FAR of 11.9%, showing its inefficiency to learn enough patterns to classify the network flows correctly. Figure 7 exhibits the percentage improvement of the DNN-based NIDS comparing other DL-based solutions. It is observed that DNN exhibited an improvement of 0.57-2.6% in terms of the model's accuracy while at the same time reducing the FAR by 0.23-7.98% to show its effectiveness. It is also observed that DNN showed a slight improvement in terms of evaluation metrics from CNN-1D. Similarly, DNN performed quite well comparing the LSTM model exhibiting its superiority comparing the typical supervised DL algorithms.    Table 5 details the analysis of the individual labels (Benign and Anomaly) in terms of percentage precision, recall, and F1-score. We observed that all of the methodologies exhibited a very high percentage of detection rate for the anomaly flows, with DNN showing the best score of 99.95%. On the other hand, it is observed that the detection rate to detect the benign traffic is slightly decreased by 3.87-10.99%, with DNN still performing better than other algorithms with a score of 96.085%. We also observed that the LSTM model performed poorly to detect benign flows with the degradation in the detection rate of almost 11%. We believe that the imbalanced nature of the dataset with anomaly records almost 3.2 times more than the benign records, which contributed to the degradation of the detection rate for the benign labels. Increasing the number of records for benign labels can also improve its detection rate.     Table 5 details the analysis of the individual labels (Benign and Anomaly) in terms of percentage precision, recall, and F1-score. We observed that all of the methodologies exhibited a very high percentage of detection rate for the anomaly flows, with DNN showing the best score of 99.95%. On the other hand, it is observed that the detection rate to detect the benign traffic is slightly decreased by 3.87-10.99%, with DNN still performing better than other algorithms with a score of 96.085%. We also observed that the LSTM model performed poorly to detect benign flows with the degradation in the detection rate of almost 11%. We believe that the imbalanced nature of the dataset with anomaly records almost 3.2 times more than the benign records, which contributed to the degradation of the detection rate for the benign labels. Increasing the number of records for benign labels can also improve its detection rate.  Table 5 details the analysis of the individual labels (Benign and Anomaly) in terms of percentage precision, recall, and F1-score. We observed that all of the methodologies exhibited a very high percentage of detection rate for the anomaly flows, with DNN showing the best score of 99.95%. On the other hand, it is observed that the detection rate to detect the benign traffic is slightly decreased by 3.87-10.99%, with DNN still performing better than other algorithms with a score of 96.085%. We also observed that the LSTM model performed poorly to detect benign flows with the degradation in the detection rate of almost 11%. We believe that the imbalanced nature of the dataset with anomaly records almost 3.2 times more than the benign records, which contributed to the degradation of the detection rate for the benign labels. Increasing the number of records for benign labels can also improve its detection rate.  Figure 8 shows the performance of different DL-based IDS methodology considering the different number of features selected based on the MI scores as depicted in Figure 4. We observed an almost negligible degradation in the accuracy for the DNN model for the feature sets of 80, 32, and 16. The model exhibited an almost 1% decrease in detection accuracy for 8 features set. Other DL algorithms, except for LSTM, also exhibited similar performances considering different feature sets. For LSTM, we observe the improvement of 0.7-0.8% in the accuracy considering 32 and 16 features. Since the number of features contributes to the complexity of the model. As observed, the performance remains almost similar considering the 32 and 16 features for the majority of DL algorithms. Thus, the model can be trained only using the best 16 to 32 features to reduce the complexity of the model.   Figure 8 shows the performance of different DL-based IDS methodology considering the different number of features selected based on the MI scores as depicted in Figure 4. We observed an almost negligible degradation in the accuracy for the DNN model for the feature sets of 80, 32, and 16. The model exhibited an almost 1% decrease in detection accuracy for 8 features set. Other DL algorithms, except for LSTM, also exhibited similar performances considering different feature sets. For LSTM, we observe the improvement of 0.7-0.8% in the accuracy considering 32 and 16 features. Since the number of features contributes to the complexity of the model. As observed, the performance remains almost similar considering the 32 and 16 features for the majority of DL algorithms. Thus, the model can be trained only using the best 16 to 32 features to reduce the complexity of the model.  Figure 9 shows the confusion matrix of the proposed DNN methodology obtained for different feature sets. It was observed that anomalies are detected with more accuracy considering the 80, 32, and 16 features comparing the 8 features. At the same time, the model exhibited more incorrect predictions for the benign traffic with the 8 features set performing the worst. Furthermore, we observed that the DNN with 16 features had fewer incorrect predictions for benign labels and more incorrect predictions for anomaly labels comparing the 32 features set. The feature set selected based on MI was calculated considering the integer and float features only.  Figure 9 shows the confusion matrix of the proposed DNN methodology obtained for different feature sets. It was observed that anomalies are detected with more accuracy considering the 80, 32, and 16 features comparing the 8 features. At the same time, the model exhibited more incorrect predictions for the benign traffic with the 8 features set performing the worst. Furthermore, we observed that the DNN with 16 features had fewer incorrect predictions for benign labels and more incorrect predictions for anomaly labels comparing the 32 features set. The feature set selected based on MI was calculated considering the integer and float features only. To check the role of the categorical features of the dataset on the detection accuracy of the DNN, we calculated the MI scores of all of the features of the dataset. For that purpose, the categorical features were first converted as integer and binary data types which eventually increased the number of features. Figure 10 depicts the features arranged in descending order, including the categorical features as well. For experimenting, we selected all the features depicted in Figure 4 and Figure 10 that resulted in an MI score of a minimum of 0.5 or more as given in Table 6. The experiment was performed again, considering the top five features to test the efficiency of all the considered DL methodologies. We observed an improvement of 0.99-3.45% in the detection accuracy for the DL-based NIDS considering the implementation with only integer and float type features, as depicted in Figure 11. The result depicts the importance of using the categorical features to improve the accuracy performance for DLbased NIDS in an IoT environment. To check the role of the categorical features of the dataset on the detection accuracy of the DNN, we calculated the MI scores of all of the features of the dataset. For that purpose, the categorical features were first converted as integer and binary data types which eventually increased the number of features. Figure 10 depicts the features arranged in descending order, including the categorical features as well. To check the role of the categorical features of the dataset on the detection accuracy of the DNN, we calculated the MI scores of all of the features of the dataset. For that purpose, the categorical features were first converted as integer and binary data types which eventually increased the number of features. Figure 10 depicts the features arranged in descending order, including the categorical features as well. For experimenting, we selected all the features depicted in Figure 4 and Figure 10 that resulted in an MI score of a minimum of 0.5 or more as given in Table 6. The experiment was performed again, considering the top five features to test the efficiency of all the considered DL methodologies. We observed an improvement of 0.99-3.45% in the detection accuracy for the DL-based NIDS considering the implementation with only integer and float type features, as depicted in Figure 11. The result depicts the importance of using the categorical features to improve the accuracy performance for DLbased NIDS in an IoT environment. For experimenting, we selected all the features depicted in Figures 4 and 10 that resulted in an MI score of a minimum of 0.5 or more as given in Table 6. The experiment was performed again, considering the top five features to test the efficiency of all the considered DL methodologies. We observed an improvement of 0.99-3.45% in the detection accuracy for the DL-based NIDS considering the implementation with only integer and float type features, as depicted in Figure 11. The result depicts the importance of using the categorical features to improve the accuracy performance for DL-based NIDS in an IoT environment. Appl. Sci. 2021, 11, x FOR PEER REVIEW 14 of 18 Figure 11. Comparison of Accuracy scores considering different features of different data types.
The confusion metrics of the different DL-based NIDS for IoT considering the features given in Table 6 are shown in Figure 12. It was observed that the categorical features have eventually improved the detection performance for all of the DL methodologies. The DNN and CNN-1D detected the anomaly and benign samples with almost 100% accuracy. The GRU, RNN, and LSTM all improved their detection performance, comparing the results obtained while experimenting using only the numerical feature set. We observed that LSTM, among the other considered DL-based methodologies, exhibited more incorrect predictions. We also observed that the DNN achieved this result in just two hidden layers, which have eventually reduced the model complexity. A 100% accuracy for the binary classification stage is also achieved by the authors in their proposed solution [68]. Our work is different from their work in many ways. Firstly, they adopted the ML approach, while we used the DL approach for this study. Secondly, they achieved 100% accuracy results considering the 20 features while we only needed five features. The DL approach is more suitable for an IoT network, as it has shown its superiority to process a huge amount of data to make efficient and correct predictions. The present study provides a detailed comparative analysis of different DL-based NIDS for an IoT network. The MI score is considered as the feature selection criteria. Only the numerical features were first arranged in descending order, and the DL models were The confusion metrics of the different DL-based NIDS for IoT considering the features given in Table 6 are shown in Figure 12. It was observed that the categorical features have eventually improved the detection performance for all of the DL methodologies. The DNN and CNN-1D detected the anomaly and benign samples with almost 100% accuracy. The GRU, RNN, and LSTM all improved their detection performance, comparing the results obtained while experimenting using only the numerical feature set. We observed that LSTM, among the other considered DL-based methodologies, exhibited more incorrect predictions. We also observed that the DNN achieved this result in just two hidden layers, which have eventually reduced the model complexity. A 100% accuracy for the binary classification stage is also achieved by the authors in their proposed solution [68]. Our work is different from their work in many ways. Firstly, they adopted the ML approach, while we used the DL approach for this study. Secondly, they achieved 100% accuracy results considering the 20 features while we only needed five features. The DL approach is more suitable for an IoT network, as it has shown its superiority to process a huge amount of data to make efficient and correct predictions. The confusion metrics of the different DL-based NIDS for IoT considering the features given in Table 6 are shown in Figure 12. It was observed that the categorical features have eventually improved the detection performance for all of the DL methodologies. The DNN and CNN-1D detected the anomaly and benign samples with almost 100% accuracy. The GRU, RNN, and LSTM all improved their detection performance, comparing the results obtained while experimenting using only the numerical feature set. We observed that LSTM, among the other considered DL-based methodologies, exhibited more incorrect predictions. We also observed that the DNN achieved this result in just two hidden layers, which have eventually reduced the model complexity. A 100% accuracy for the binary classification stage is also achieved by the authors in their proposed solution [68]. Our work is different from their work in many ways. Firstly, they adopted the ML approach, while we used the DL approach for this study. Secondly, they achieved 100% accuracy results considering the 20 features while we only needed five features. The DL approach is more suitable for an IoT network, as it has shown its superiority to process a huge amount of data to make efficient and correct predictions. The present study provides a detailed comparative analysis of different DL-based NIDS for an IoT network. The MI score is considered as the feature selection criteria. Only the numerical features were first arranged in descending order, and the DL models were The present study provides a detailed comparative analysis of different DL-based NIDS for an IoT network. The MI score is considered as the feature selection criteria. Only the numerical features were first arranged in descending order, and the DL models were evaluated considering the 80, 32, 16, and 8 feature sets. For these scenarios, DNN achieved the highest detection accuracy of 99.01 compared to the other DL methodologies. We then repeated the experiments that considered the numerical and categorical features. We consider only those features whose MI score was ≥ 0.5, which resulted in only five features. The experimental results showed a significant improvement in terms of detection accuracy, with the DNN achieving 100% results.
The present study only considered the binary classification to detect only the anomalies in general. The considered model is not able to identify the exact nature of the anomaly, which is required to design an intrusion prevention mechanism. In addition, the present study only evaluated considering IoT-Botnet 2020 dataset. The proposed methodology needed to be evaluated considering different other IoT-based datasets to check its effectiveness for the considered features set. Moreover, the evaluation of the current research work is performed using only the simulation environment. It should be tested in a real-time environment to check efficiently. The performance of the considered solution should be tested under a large number of IoT sensors.

Conclusions
This paper proposes an effective anomaly detection mechanism based on the deep neural network for the IoT network architecture that efficiently learns valuable complex patterns from the IoT network flows to classify traffic as benign and anomalous. The proposed methodology is tested on the newly available IoT-Botnet 2020 dataset. The experimental results demonstrated the proposed model superiority compared to other DL methods by exhibiting a detection accuracy of 99.01% with the false alarm rate of 3.9%, showing an improvement of 0.57-2.6% in terms of the model's accuracy, while at the same time reducing the FAR by 0.23-7.98%. It was also observed that the model showed a detection rate of 99.9% to detect anomalies and recorded a decrease in the detection rate by 3.8% for detecting the benign traffic, probably due to the imbalanced nature of the dataset. Results also show that the best numerical features in the range of 16-32 calculated using the MI will be the reasonable choice to reduce the model complexity with an almost negligible effect on its performance. In addition, the inclusion of the categorical features further improves detection accuracy by only utilizing the top five features.
For future research, we will extend this work by implementing our solution for the multiclass classification scenarios to find out the exact nature of the detected anomaly. In future work, we will test the proposed model's effectiveness in a real-time IoT environment.