This section presents a comprehensive analysis of the performance of the proposed IDS model, along with a comparative evaluation against other recently developed algorithms.
5.2. Evaluation Metrics
The effectiveness of the proposed IDS model is assessed using several standard evaluation metrics, including accuracy, precision, recall, and F1-score. These metrics provide a detailed understanding of the model’s ability to correctly detect intrusions while minimizing false detections. We report macro-averaged precision, recall, and F1-score across classes in both binary and multiclass evaluations to reflect minority-class detection performance, treating all classes equally regardless of their frequency [
28].
Accuracy represents the overall model performance in correctly detecting threats. A high precision means that real attacks are accurately identified by the algorithm, decreasing the possibility of false alarms. High recall guarantees that the majority of attacks are detected by IDS, decreasing the possibility of missing any. F1-score is the harmonic mean of precision and recall, providing a balance between the two. These evaluation metrics used for evaluating the performance of the proposed DL-based IDS are defined as [
33]:
Accuracy: The ratio of correctly predicted instances to the total number.
Precision: The ratio of true positive predictions (correctly classified as intrusion) to all positive predictions.
Recall: Also known as Detection rate, the ratio of positively predicted attacks to all instances of actual intrusions.
F1-score: The harmonic mean of precision and recall.
Matthews Correlation Coefficient (MCC): A correlation based metric that accounts for all four confusion matrix entries, provides a reliable performance measure under class imbalance.
where: True negative (
): The model correctly predicts an instance as normal. False negative (
): The model fails to predict an attack, incorrectly classifying it as normal. True positive (
): The model correctly predicts an instance as an attack. False positive (
): The model incorrectly predicts a normal instance as an attack.
For the evaluation conducted in this study, macro-averaged metrics are computed by averaging the per-class values of each metric across all classes. Additionally, we employ the confusion matrix to analyze the model’s classification performance. The confusion matrix provides a detailed breakdown of correctly and incorrectly classified instances, highlighting areas where the model excels or needs improvement.
5.3. Experimental Results
The development of the proposed model begins with the initialization of a swarm of twenty particles, each representing a candidate feature subset from the BoT-IoT and CIC-IDS2017 datasets. This initialization ensures a diverse search space for optimal feature selection. The fitness of each particle is assessed by training the MHA-stacked GRU model on its respective feature subset and calculating the validation accuracy. Throughout the iterative optimization process, the BPSO algorithm continuously updates particle velocities and positions based on personal and global best fitness values. Eventually, the optimal feature subset, with the highest validation accuracy, is identified as the global best. The selected features for the BoT-IoT and CIC-IDS2017 datasets by the proposed approach are provided in
Table 3 and
Table 4. Once training is completed, the model’s robustness is assessed using the unseen test dataset to evaluate its effectiveness in detecting unseen threats.
Table 5 presents the binary classification performance of the proposed model, where traffic instances are categorized as either normal or attack. For the BoT-IoT dataset, the binary experiment was intentionally conducted under a highly imbalanced environment by retaining all normal records and sampling majority attack records from the available attack traffic. This design was utilized as a stress test scenario to examine model behavior under highly class imbalanced environments common in IoT networks, evaluating the robustness of the proposed model under such conditions. The goal was not to artificially simplify the problem but to reduce the overwhelming majority class dominance while preserving the minority class. The resulting accuracy should not be considered alone since it is influenced by the dominance of the majority class. Therefore, in addition to accuracy, we report macro precision, recall, F1-score and the confusion matrix to provide more informative evaluation of the model’s performance across classes. For the BoT-IoT dataset, we selected all the normal traffic records in the dataset as is at 477, and around 20% of the attack class was utilized selecting 500,000 records from the 2,541,266 attack records. For the CIC-IDS2017 dataset, we selected all records after preprocessing, 502,983 Benign and 425,878 attack records. Additionally, to provide a more realistic assessment of binary detection performance under class imbalance, the MCC defined in Equation (
17), is also reported for both the BoT-IoT and CIC-IDS2017 datasets in
Table 5. The results demonstrate that the proposed model achieves an F1-score of 98.94% with 99.46% precision employing the BoT-IoT and an F1-score of 99.76% with 99.76% precision on the CIC-IDS2017 dataset, highlighting the model’s effectiveness in accurately detecting intrusions within diverse network traffic.
Figure 4 provides a visual illustration of the proposed model’s performance across the evaluation metrics, further demonstrating its effectiveness for binary intrusion detection.
Figure 5 and
Figure 6 illustrate the model loss and accuracy throughout the training process for binary classification. The training accuracy curve reflects how well the model learns from the training dataset during each epoch. The validation accuracy curve shows how well the model generalizes on the unseen validation dataset. If both training and validation accuracy increase over time and converge, this indicates the model is learning and generalizing well. The training loss curve illustrates the error on the training dataset, it typically decreases as the model learns. The validation loss curve shows the error on the validation dataset, a small gap between training and validation loss indicates good generalization. In both datasets, the close alignment between training and validation accuracy, along with steadily decreasing loss values, indicates the absence of overfitting and robust performance on unseen data. For the BoT-IoT dataset experiment in
Figure 5, target class distribution was intentionally set to be highly imbalanced for stress testing purposes, with the number of instances being selected as Normal: 477 and Attack: 500,000, resulting in rapid accuracy convergence near 100.00 accuracy and minimal loss values after few epochs. However, this near perfect performance metrics primarily originate from the dominance of the majority class, so accuracy alone is insufficient to accurately represent effective minority class detection. Therefore, our results highlight additional focus on macro-averaged precision, recall, and F1-score to more comprehensively assess performance across all classes. In contrast, on the CIC-IDS2017 dataset, which is more balanced, with the number of instances being selected as Normal: 502,983 and Attack: 425,878, accuracy improves more gradually and loss decreases steadily across epochs, demonstrating effective generalization under a less extreme class distribution.
Accurate intrusion detection is fundamental in cybersecurity. To further evaluate the classification effectiveness of the proposed model, we employed the confusion matrix, which provides a detailed breakdown of classification performance. This matrix enables a comprehensive evaluation of the model’s ability to correctly distinguish between normal traffic and intrusion attempts. For the BoT-IoT dataset confusion matrix in
Figure 7, the model correctly classified 92 normal traffic instances out of 95 records and successfully detected 100,000 intrusion attempts within the BoT-IoT dataset out of 100,001. Similarly, for the CIC-IDS2017 dataset
Figure 8, the model accurately classified 100,378 normal traffic instances out of 100,597 records and correctly identified 84,948 intrusions out of 85,176 attacks.
In these confusion matrices, the x-axis represents the predicted labels “Predicted Normal” and “Predicted Attack”, while the y-axis represents the actual labels “Actual Normal” and “Actual Attack”. The diagonal elements correspond to correctly classified instances. In contrast, the off-diagonal elements indicate misclassifications; false positives, normal traffic misclassified as an attack in the top-right, and false negatives, attacks misclassified as normal traffic in the bottom-left. The confusion matrix represents an important tool for analyzing classification performance, as high values along the diagonal highlight the proposed model’s robustness and its strong ability to distinguish between benign and malicious traffic, even under highly skewed class distributions. These results further confirm the reliability of the proposed model in intrusion detection, demonstrating its capability to enhance cybersecurity by minimizing classification errors. To ensure reproducibility, our experiments were conducted with a fixed random seed of 40.
Following the evaluation of the proposed BPSO-MHA-GRU IDS approach on the binary classification task, we extended the analysis to evaluate its performance in the multiclass intrusion detection setting. For the proposed model multiclass experiments, Target Sampling was applied to both the BoT-IoT and CIC-IDS2017 datasets before training and evaluation to reduce the dominance of majority classes while preserving minority classes,
Table 1 and
Table 2. The model achieved outstanding results, an accuracy of 99.99% and 99.89% F1-score on the BoT-IoT dataset and 99.34% accuracy with a 98.04% F1-score on the CIC-IDS2017 dataset, as provided in
Table 6. Macro-averaged metrics are reported throughout all experiments to ensure a fair representation of classification performance, particularly for minority classes, therefore presenting a comprehensive and unbiased evaluation of the model across all class categories.
Additionally, we employed the confusion matrix to provide a detailed breakdown of the classification performance of the proposed model utilizing the BoT-IoT and CIC-IDS2017 datasets in the multiclass classification task, as depicted in
Figure 9 and
Figure 10. The matrix for the BoT-IoT dataset demonstrates high true positive rates across all classes, particularly both DoS and DDoS attacks, with minimal misclassifications among the remaining attack categories and normal traffic. Most off-diagonal misclassified records are zero, indicating strong classification capability despite class deviation. This reflects the model’s effectiveness in capturing unique patterns for each intrusion category even in the presence of class imbalance. For CIC-IDS2017, the confusion matrix demonstrates robust classification, with high successful classification rates along the diagonal for both benign traffic and various types of attacks. The overall distribution confirms that the model achieves reliable multiclass classification, even for minority classes, supporting its generalizability and robustness across complex modern network and IoT environments.
In addition to the confusion matrices,
Table 7 and
Table 8 provide the detailed classification reports for the BoT-IoT and the CIC-IDS2017 datasets, respectively. These tables provide precision, recall, and F1-score for each class, offering a comprehensive evaluation of the per-class detection performance of the proposed model. The consistently high metrics across all classes, including both majority and minority categories, highlight the robust effectiveness and generalizability of the proposed methodology in multiclass intrusion detection scenarios.
To ensure a fair comparison, the baseline models, LSTM and GRU, were trained using the same preprocessing steps and training hyperparameters as the proposed model, including data cleaning, recurrent layers, data partitioning, batch size, learning rate, dropout, and epochs. Also, for the ablation experiments, the same training settings were maintained, while the architectural components within the study were selectively removed according to the comparison conducted.
To further analyze the contribution of MHA, BPSO, and Target Sampling within the proposed BPSO-MHA-GRU Target Sampling framework, we conducted a comparison by benchmarking the proposed model against baseline LSTM and GRU models using the same evaluation metrics. Both baseline models consisted of two recurrent layers with 128 neurons in the first layer and 64 neurons in the second, each followed by a dropout rate of 0.2, corresponding exactly to the recurrent depth of the proposed model. The baseline models used the full datasets, excluding BPSO and the Target Sampling stages, with identical preprocessing steps, data partitioning, and the same hyperparameters of the proposed model, including a batch size of 64, 50 training epochs, and a learning rate of 0.01. This comparison was conducted to evaluate the effectiveness of the proposed framework compared to recurrent baseline models under identical preprocessing and training settings.
The results of this analysis are outlined in
Table 9, reporting the mean ± standard deviation over three independent runs, to evaluate the robustness and stability of the proposed model. On both datasets, the proposed model achieved the best overall performance. On the BoT-IoT dataset, the proposed model achieved
precision,
recall, and
F1-score, outperforming both LSTM, which achieved
,
, and
, respectively, and GRU, which achieved
,
, and
, respectively. Furthermore, on the CIC-IDS2017 dataset, the proposed model achieved
precision,
recall, and
F1-score, outperforming the LSTM and GRU baselines under the same preprocessing and training settings. In addition to the performance improvement, the proposed model reduced the average training time over three runs by approximately 18% compared to both the LSTM and GRU baseline models on the BoT-IoT dataset and approximately 71% reduction on the CIC-IDS2017 dataset. The improved performance of the proposed model is attributed to the effective integration of the BPSO optimization algorithm and the MHA-stacked GRU architecture, which together enable high classification performance with lower computational overhead. Moreover, the Target Sampling strategy rebalances imbalanced class distributions, mitigating majority class bias without introducing synthetic records.
Figure 11 provides a visual illustration of the multiclass classification results reported in
Table 9 for the proposed, LSTM, and GRU models on the BoT-IoT dataset and the CIC-IDS2017 dataset. For each evaluation and model, the bar height represents the mean performance over three independent runs, while the black error bars represent the standard deviation. The three runs were conducted using fixed random seeds of 40, 32, and 20; the same seeds were also used in the subsequent ablation studies. The proposed model achieves the highest mean performance with low deviation across runs, indicating robust and stable intrusion detection performance under the class imbalance of the BoT-IoT dataset and under more diverse attack scenarios of the CIC-IDS2017 dataset. In contrast, LSTM and GRU models show performance changes throughout the runs, especially in recall and F1-score, indicating less stability and increased risk of missing intrusions. These findings highlight the advantage of combining optimized feature selection methods with attention mechanisms within recurrent networks for reliable IDS.
To further investigate the contribution of the MHA and the BPSO components within the proposed BPSO-MHA-GRU Target Sampling framework, an ablation study was conducted by removing both the MHA layer and BPSO feature selection while retaining the stacked GRU layers. The baseline model without MHA and BPSO, referred to as GRU-Target Sampling, utilized the same deep learning architecture, preprocessing steps, dropout, and training hyperparameters of the proposed model, consisting of two stacked GRU layers and including the Target Sampling strategy.
Table 10 presents the multiclass classification performance comparison for both datasets, reported as mean ± standard deviation over three independent runs. The proposed IDS outperformed the baseline GRU-Target Sampling in terms of recall and F1-score on both datasets. For the BoT-IoT dataset, the proposed model achieved a recall of
and an F1-score of
, compared to
recall and
F1-score for the baseline model. On the CIC-IDS2017 dataset, the proposed model achieved a recall of
and an F1-score of
, outperforming the GRU-Target Sampling baseline model, which achieved
recall and
F1-score, respectively.
Figure 12 provides a visual illustration of the results reported in
Table 10. The proposed model achieves higher mean performance with lower deviation across recall and F1-score, confirming the effectiveness of the proposed framework.
Additionally, we conducted an ablation study by removing the MHA layer only while keeping BPSO-GRU and Target Sampling to evaluate the contribution of the MHA component in the proposed model, comparing it to the proposed BPSO-MHA-GRU Target Sampling model under the same preprocessing steps, training hyperparameters, and experimental process for both.
Table 11 demonstrates the multiclass classification performance comparison for both datasets, reported as mean ± standard deviation over three independent runs. The proposed model outperformed the baseline BPSO-GRU with Target Sampling model, demonstrating the effectiveness of incorporating the MHA mechanism in the proposed model. For the BoT-IoT dataset, the proposed model achieved a recall of
and an F1-score of
, compared to
recall and
F1-score achieved by the baseline BPSO-GRU with Target Sampling model. For the CIC-IDS2017 dataset, the proposed model achieved a recall of
and an F1-score of
, compared to
recall and
F1-score for the baseline model.
Figure 13 provides a visual illustration of the results reported in
Table 11. These results highlight the advantage of integrating the MHA mechanism with BPSO-based feature selection, improving detection capability and robustness of the proposed model in computer networks and IoT environments.
The comparisons reported in
Table 12,
Table 13,
Table 14 and
Table 15 are literature based comparisons using performance values reported in the respective original studies. To further demonstrate the effectiveness of our proposed model, we conducted a performance comparison against recent state-of-the-art IDS models. The comparison covered both binary and multiclass classification tasks utilizing the BoT-IoT and CIC-IDS2017 datasets.
Table 12 and
Table 13 present the binary classification performance analysis on both datasets, respectively. Our model achieved a precision of 99.46% and F1-score of 98.94% on the BoT-IoT dataset, outperforming existing models such as LSTM-GRU F1-score of 98.68% and CNN-LSTM with 97.50%. Similarly, on the CIC-IDS2017 dataset, our approach achieved a precision of 99.76% and F1-score of 99.76%, outperforming other existing models, including, GAN-CNN-BiLSTM with F1-score of 96.04%, ODODL-IDS with 94.17%.
For multiclass classification,
Table 14 and
Table 15 illustrate the proposed model’s superior capability in handling multiclass classification tasks, achieving 99.79% recall and 99.89% F1-score on the BoT-IoT dataset and 98.69% recall and 98.04% F1-score on the CIC-IDS2017 dataset. These results are higher than those reported by other existing models, including RIDGE 93.00%, CSMCR 92.75%, Attention-RNN 98.94%, CNN-IoT 72.60% on the BoT-IoT dataset, and compared to DAMML 89.33%, S2CGAN 92.00% and CNN 1D-BLSTM 88.00% on the CIC-IDS2017 dataset. These results validate the robustness and generalization capacity of the proposed IDS across diverse network environments and attack scenarios.