5.2. Evaluation Metrics
The current work evaluates the performance of machine learning and deep learning models using the following statistical metrics.
Accuracy: it is estimated by dividing the sum of TP and TN by the sum of TP, TN, FP, and FN.
        
Precision: it is estimated by dividing the TP by the sum of TP and FP.
        
Recall or true positive rate (TPR): it is estimated by dividing the TP by the sum of TP and FN.
        
F1-Score: it is the harmonic mean of Precision and Recall.
        
False Positive Rate (FPR)—it is calculated by dividing the total of incorrect classification of the attack class by the sum of incorrect classification of the attack class and the correct classification of the normal class.
        
The terms TP, TN, FP, and FN are taken from a confusion matrix. A confusion matrix is a table of predicted and actual values and the dimension of a confusion matrix is the number of classes in the dataset X number of classes in the dataset.
- True Positive (TP)—A sample belonging to the Attack class is correctly predicted as Attack by the model 
- False Positive (FP)—A sample belonging to the Attack class is predicted as Normal by the model 
- True Negative (TN)—A sample belonging to the Normal class is correctly predicted as Normal by the model 
- False Negative (FN)—A sample belonging to Normal traffic is predicted as an Attack by the model. 
Area Under Curve (AUC)—Receiver Operative Characteristics (ROC) shows the performance of a classification model at all classification thresholds. The ROC is a plot between the True positive rate and False positive rate, and a lower classification threshold means increasing both the false positive and true positives. The Area under the curve is the size of the Area under the ROC curve and provides aggregated performance of all the possible classification thresholds.
        
As mentioned earlier, The real SDN testbed with IoT traffic simulated dataset DS1 has been considered for the model performance experiments. The dataset DS1 is processed to remove the network device representation object type features such as MAC address and IP addresses and also removed the last_seen timestamp based on our estimation of low confidence for network attacks classification. Further, feature selection efforts are not made because our objective is to employ deep learning models. We have selected 25 features for our test experiments in dataset DS1. The attack category feature is labeled as 0:Normal, 1:DoS, 2:DDoS, 3:Port Scanning, 4:OS Fingerprinting, and 5:Fuzzing in DS1. The DS1 dataset is split into training and testing for performance prediction. The supervised model SVM with RBF kernel is chosen because SVM is used extensively for intrusion detection in traditional networks [
43]. RBF kernel is chosen over linear kernel because training time and testing time for SVM model with RBF kernel is much lower than linear kernel in our case; no significant performance differences are observed in RBF and linear kernel usage. Another reason for selecting the RBF over the linear kernel is that the number of features is smaller than the sample size in our dataset. The SVM parameters such as regularization parameter “C” and kernel coefficient “gamma” are selected as default values one and “scale” respectively. The parameter default values were selected because no notable performance impact was seen when varying these parameters for SVM.
A simple feed-forward network DNN, CNN, and LSTM with hidden layers varying from 1 to 4 are chosen to evaluate the DL architecture performance for attack detection in IoT networks. DNN contains a dense layer as a hidden layer with 1024 neurons followed by a layer dropping out 1 in 100 output neurons. As the number of hidden layers increases in DNN, the neurons are reduced by 256 in each layer. The most commonly used  is used as an activation function for DNN models. The loss function “loss_entropy” is chosen under the inference framework maximum likelihood. The “binary_entropy” and “categorical_entropy” parameters are selected for binary and multiclass classification experiments. The optimizer  is considered for all the models. CNN model comprises a convolution layer followed by a max pool layer for downsampling the input. The input is also flattened and does not impact the batch size. Subsequently, a dense layer is applied with  activation function. Finally, the binary or multiclass classification is based on another dense layer with an activation function as  or . We have opted for only one hidden layer case for CNN in our experiments, as the CNN is designed and best works for image classification. LSTM architectures comprise an LSTM layer with output data size varying from 4 to 32, as the hidden layers increased from 1 to 4. Each LSTM layer is followed by a dropout layer of 10 in 100 input values. Finally, a dense layer with 1 or 6 output values size based on the binary or multiclass evaluation and consequently  or  activation function applied to obtain the final binary or multiclass output prediction. The batch size for DNN architectures is chosen to be 64 for both binary and multiclass classification. On the other hand, the batch size for selected LSTM architectures is 32. After running experiments by varying the epoch value, we fixed the final epoch value 200 for all the DL models with different architectures.
The training accuracy and loss graphs for the DNN, CNN, LSTM model binary classification and multiclass classification are shown in 
Figure 3a, 
Figure 4a, 
Figure 5a and 
Figure 3b, 
Figure 4b, 
Figure 5b respectively. The naming convention of the DL models DNN and LSTM in the paper is given based on the number of layers considered in the model. DNN1 has one dense layer, whereas DNN4 includes four dense layers. Similarly, LSTM1 has one LSTM layer, and LSTM4 contains four LSTM layers. 
Figure 3a shows that the DNN2, DNN3, and DNN4 training accuracy performance was good compared to DNN1. As the number of layers in DNN increases, the performance accuracy of the models also increased. Nevertheless, there is no significant increase in accuracy in DNN 3 and DNN4. Thus, we decided to stop at the DNN4 model. In addition, most of the DNN models have achieved training accuracy in the range of 95 to 97% within 75 epochs, and after that there is no significant increase in the training accuracy. To understand the model performance, the model loss plots also included in the figures.
The same DNN model architectures were used to run experiments on the multiclass dataset and classified the attacks into different categories. As seen in 
Figure 3, the DNN has obtained almost the same accuracy, and it indicates that the DNN model was robust in handling multiclass data.
Figure 4a shows that CNN has achieved training accuracy of 94 to 96% within 80 epochs and settled to 97% by the end of 200 epochs. The 
Figure 4a indicates that CNN training accuracy is comparable to DNN4 for binary classification. Based on 
Figure 4a,b, we also observe a significant difference in the training accuracy between the binary and multiclass classification for the CNN model. We have not considered adding more layers to the CNN based on the model loss values for the multiclass classification. As shown in 
Figure 5a, the LSTM model improved the training accuracy noticeably from 95 to 97% after 200 epochs run for the binary classification of the dataset. The same trend followed for the multiclass classification case as seen in 
Figure 5b except for LSTM1. It is evident that LSTM models with more than two hidden layers achieved good training accuracy in binary and multiclass cases. Overall, based on the 
Figure 3, 
Figure 4 and 
Figure 5, we can also conclude that the tested models with 200 epochs do not suffer from underfitting or overfitting problem. It is also evident that the proposed four-layer LSTM performed well in 200 epochs run for attack detection in the IoT network.
 The SVM and DL model architecture’s comparative accuracy performance is shown in 
Figure 6 for the binary and multiclass classification of attacks in IoT networks. These plots confirm that DL architectures outperformed the supervised learning SVM except LSTM1 multiclass case. The DNN4 model has outperformed SVM because the deep neural network learns the input dataset and adjust the weights accordingly to classify attacks and attack type effectively. On the other hand, SVM divides the data space and creates a decision boundary to classify the attacks when it is trained. Since SVM cannot adapt the learning capability, it is likely to misclassify the attacks. Lack of hidden layers in LSTM1 has impacted the accuracy of the LSTM1 multiclass classification model.
Table 4 shows the performance metrics precision, recall, and F1-Score, including both macro and average scores of all the tested ML and DL models. The performance of DL models for both binary and multiclass classification is listed in 
Table 4. The macro average of a given metric is to compute the metric for each label, then determine the average without considering the proportion for each label in the dataset. The weighted average of a given metric has computed the metric for each label, then determined the average considering the proportion for each label in the dataset.
 Table 4 reports that the LSTM4 obtained the best precision in attack detection or binary classification, whereas the CNN-LSTM is the least performed model with a precision of 0.76. The DNN’s best-performed and LSTM3 models performed equally well for binary classification. However, our proposed LSTM4 model performs better than DNN’s best-performed model in attack detection or binary classification.
 The multiclass classification of the SVM and the best performed proposed model using the confusion matrix as shown in the 
Table 5 and 
Table 6. The 
Table 5 and 
Table 6 shows that the confusion matrix is closer to the diagonal matrix for the ideal classification scenario in the proposed LSTM4 approach than the SVM model. The small percentage of notable misclassified data falls under normal traffic with label 0 considered as port scanning with label 3 (346) and OS fingerprinting with label 4 (282) in the LSTM4 model. These false positives do not need any tuning efforts in a real-time network intrusion detection system, as the source IP whitelisting is not a viable option. The network scanning attempts usually ignored due to the high volume of these events unless impacting or disrupting the services such as denial of service attacks.
Receiver Operating Characteristic (ROC) for the proposed LSTM model architecture in binary and multiclass classification are shown in 
Figure 7 and 
Figure 8 respectively. The proposed approach achieved 0.99 AUC in classifying the network connection records as either attack or normal. In addition, the model has shown 0.993 for Normal, 0.999 for DoS, 0.999 for DDoS, 0.998 for Port Scanning, 0.998 for OS Fingerprinting, and 0.99 for fuzzing in classifying the SDN IoT attacks into different categories. Even in the multiclass category, the proposed method has shown above 0.999 AUC for most of the attack classification. This indicates that the proposed method is robust and can achieve better performances in both binary and multiclass classification. Overall, the 
Figure 7 and 
Figure 8 show that the performance is closer to the ideal case, with the graph closer to the reverse “L” shape in both attack detection and attack type classification.