4.4. Evaluation Metrics
To evaluate the effectiveness of the proposed intrusion detection model, several standard performance metrics were employed, including accuracy, precision, recall, F1 score, and area under the receiver operating characteristic curve (AUC-ROC). These metrics provide a comprehensive assessment of the classification performance of the proposed model.
Accuracy measures the overall proportion of correctly classified samples among the total number of samples. It is defined as
where
(true positive) represents correctly detected attack samples,
(true negative) denotes correctly identified normal samples,
(false positive) refers to normal samples incorrectly classified as attacks, and
(false negative) represents attack samples incorrectly classified as normal traffic.
Precision measures the proportion of correctly predicted attack samples among all predicted attack samples and is defined as follows:
Recall, also known as detection rate, measures the proportion of actual attack samples that are correctly identified by the model:
The F1 score represents the harmonic mean of precision and recall and provides a balanced measure of the model’s performance:
In addition, the area under the receiver operating characteristic curve (AUC-ROC) is used to evaluate the ability of the model to distinguish between different classes across various threshold values. A higher AUC value indicates better classification performance.
4.5. Performance Results
Table 4,
Table 5 and
Table 6 show the per-class performance of the proposed model. The metrics include accuracy, precision, recall, and F1 score. In
Table 4, the proposed model performs very well across most classes in the CICIDS2017 dataset. The Benign, DDoS, FTP, Hulk, and PortScan classes show accuracy values above 99%. Their precision, recall, and F1 scores are also above 0.99. This shows strong classification performance. However, the Bot class shows lower performance. This indicates that the model sometimes confuses Bot traffic with benign traffic. The Web Attack class also shows slightly lower precision. However, the overall detection performance remains high.
Table 5 shows the results for the NSL-KDD dataset. The proposed model performs well for most classes. The Neptune class achieves perfect precision of 1.00. It also shows a high F1 score of 0.9985. The Normal class also performs well. The PortSweep, Satan, and Smurf classes show slightly lower values. However, their accuracy and F1-scores remain high. These results show that the proposed model effectively detects different attack types on both datasets.
Table 6 presents the per-class performance of the proposed model on the CICIoT2023 dataset. In particular, the DDoS, DoS, and Mirai classes achieve very high accuracy above 99%, with corresponding precision, recall, and F1 scores consistently exceeding 0.99, indicating robust detection of large-scale attack patterns. The Benign class also achieves strong performance, with high recall and F1-score values, though its precision is slightly lower due to minor misclassifications in attack traffic. However, comparatively lower performance is observed for the MITM and Recon classes. The MITM class shows lower recall and F1 scores, suggesting that the model occasionally misclassifies these instances, likely because they are similar to normal or other attack traffic patterns. Similarly, the Recon class exhibits lower precision and F1 scores, indicating challenges in distinguishing reconnaissance activities from other classes.
The accuracy curves of the proposed model over the training epochs are illustrated in
Figure 2 for the CICIDS2017, NSL-KDD, and CICIoT2023 datasets. The plots show training and test accuracy values over 100 epochs. As shown in
Figure 2a, for the CICIDS2017 dataset, training and testing accuracies increase rapidly during the initial epochs and gradually stabilize as training progresses. Although minor fluctuations in testing accuracy are observed at certain epochs, the training and testing curves remain closely aligned. Similarly,
Figure 2b illustrates the training performance on the NSL-KDD dataset. The model converges quickly within the first few epochs and reaches maximum accuracy. The training and test accuracy curves closely overlap throughout training, indicating stable learning behavior.
Figure 2c shows the training and testing accuracy curves for the CICIoT2023 dataset. As with the other datasets, the model shows steady improvement in accuracy during the initial training epochs. However, compared to CICIDS2017 and NSL-KDD, the convergence is relatively slower due to the higher complexity and diversity of attack patterns in the IoT environment. After the initial phase, both training and testing accuracy continue to increase and eventually stabilize at high values. The close alignment between the training and test curves demonstrates that the model maintains strong generalization performance on the CICIoT2023 dataset despite its heterogeneous nature.
The loss curves of the proposed model across training epochs are illustrated in
Figure 3 for the CICIDS2017, NSL-KDD, and CICIoT2023 datasets. The plots show the training and testing losses over 100 epochs. As shown in
Figure 3a, for the CICIDS2017 dataset, the training loss decreases rapidly during the initial epochs and gradually stabilizes as the training progresses. Similarly,
Figure 3b presents the loss curves for the NSL-KDD dataset. The training and test losses decrease sharply in the early epochs and converge quickly as training continues. Both curves remain closely aligned throughout the training process.
Figure 3c illustrates the loss curves for the CICIoT2023 dataset. Training loss decreases steadily during the initial epochs, indicating effective learning of underlying patterns. However, compared to CICIDS2017 and NSL-KDD, the loss reduction is more gradual due to the increased complexity and diversity of IoT traffic. The test loss also shows a decreasing trend in the early stages but exhibits noticeable fluctuations at later epochs. Despite these variations, the overall gap between training and test loss remains moderate, indicating that the model does not suffer from severe overfitting.
The normalized confusion matrix shown in
Figure 4 illustrates the relationship between the true and predicted class labels produced by the proposed model. The diagonal elements represent the correctly classified samples and therefore indicate the classification accuracy for each class, whereas the off-diagonal elements correspond to misclassified samples. By observing the diagonal elements across the two datasets, it can be inferred that the proposed model achieves high classification performance for most attack categories. However, the results also show that the model is confused when identifying some specific attack types. For instance, as shown in
Figure 4a, for the CIC-IDS2017 dataset, the proposed model misclassifies approximately 32% of the Bot attacks as Benign traffic, while the remaining classes are accurately classified. In contrast,
Figure 4b shows the results for the NSL-KDD dataset, where the proposed model almost perfectly classifies each attack category with very few misclassifications. Moreover,
Figure 4c shows the confusion matrix for the CICIoT2023 dataset. The proposed model correctly classifies most DDoS, DoS, and Mirai traffic with very high accuracy. However, a few misclassifications are observed in the Benign, MITM, and Recon classes.
Figure 5 shows the ROC curves of the proposed model for the CICIDS2017, NSL-KDD, and CICIoT2023 datasets. In
Figure 5a, the ROC curves for the CICIDS2017 dataset are presented. The curves are located near the top-left corner of the plot. This indicates strong classification performance. Most classes achieve AUC values close to 1.0. The Benign, DDoS, FTP, Hulk, PortScan, and Web Attack classes show perfect discrimination. The Bot class has a slightly lower AUC compared to the other classes. In
Figure 5b, the ROC curves for the NSL-KDD dataset are shown. The curves also remain very close to the top-left corner. This indicates strong detection capability for the attack classes. The Neptune, Normal, and Smurf classes achieve an AUC value of 1.0. The PortSweep and Satan classes show slightly lower values.
Figure 5c shows the ROC curves for the CICIoT2023 dataset. The curves are mostly close to the top-left corner, indicating strong classification performance. Most classes achieve high AUC values near 1.0, demonstrating effective detection. The DDoS, DoS, and Mirai classes show near-perfect discrimination. However, the Benign, MITM, and Recon classes have slightly lower AUC values compared to the others.
Figure 6 shows the precision–recall curves of the proposed model for the CICIDS2017, NSL-KDD, and CICIoT2023 datasets. In
Figure 6a, most classes show precision–recall curves close to the top-right corner. This indicates strong classification performance. The Benign, DDoS, FTP, Hulk, PortScan, and Web Attack classes achieve very high average precision values. The Bot class performs worse than the other classes. The curve for the Bot class decreases as recall increases, indicating that the model struggles to detect Bot attacks in some cases. In
Figure 6b, the curves remain close to the top-right corner for most classes, indicating high precision and recall. The Neptune, Normal, and Smurf classes show very high average precision values. The PortSweep and Satan classes show slightly lower values. However, their performance remains strong. In
Figure 6c, most classes have curves close to the top-right corner, indicating strong classification performance. The DDoS, DoS, and Mirai classes achieve high average precision values and maintain stable precision across different recall levels. However, the Benign, MITM, and Recon classes show relatively lower performance. In particular, precision decreases as recall increases for the MITM and Recon classes, indicating that the model struggles to accurately distinguish these traffic types.