The simulation results of all methods are obtained on the CTU-13 dataset [
33]. The dataset is inherently imbalanced, consisting of 4775 botnet instances and 20,902 normal instances. For model development, 80% of the data was allocated for training, while the remaining 20% was used for testing. The CTU-13 dataset, compiled by the Czech Technical University, is a widely used benchmark for evaluating intrusion detection systems, especially in IoT network security research. It contains thirteen different capture scenarios of real botnet traffic mixed with normal and background traffic. Each scenario includes various types of botnet activities such as spam, click fraud, and DDoS attacks, providing a realistic environment for anomaly detection. We selected the CTU-13 dataset because of its diversity in attack behaviors, the complexity of real network traffic, and its relevance to modern IoT environments where devices communicate dynamically across heterogeneous networks. Furthermore, its widespread use in recent research [
33] allows for fair comparisons with existing methods while validating the robustness and generalization capability of our proposed anomaly detection framework.
The performance evaluation of the proposed method is carried out using the following key metrics: accuracy, specificity, sensitivity, precision, F-measure, false positive rate (FPR), false negative rate (FNR), Matthews Correlation Coefficient (MCC), and detection rate. Below is a detailed explanation of each metric:
4.1. Simulation Results
To evaluate the optimization performance of the proposed and baseline feature selection methods, we analyze the fitness values obtained at different iteration steps.
Table 3 presents the convergence behavior of several metaheuristic algorithms, including the Chimp Optimization Algorithm (ChOA) [
34], Particle Swarm Optimization (PSO) [
35], Social Spider Optimization (SSO) [
36], Sparrow Search Algorithm (SPA) [
37], and proposed feature selection method. Fitness values are reported at iterations 10 through 50, showing that the proposed method consistently achieves faster convergence and higher final fitness values compared to the other approaches.
In
Table 3 are presented the values associated with the performance improvement (e.g., objective function values) at each iteration. The results indicate that the proposed feature selection method achieves faster and more significant convergence than the baseline techniques. The proposed method consistently outperforms the others at each iteration, reaching higher values at earlier iterations. For instance, after 10 iterations, the proposed method performs 129, significantly higher than ChOA (110) and PSO (86). Similarly, by the 50th iteration, the proposed method achieves a value of 210, surpassing ChOA (188) and the other techniques. This superior convergence behavior can be attributed to the advanced optimization strategies integrated into the proposed method, such as adaptive parameter tuning and enhanced exploration and exploitation capabilities. These features enable the method to navigate the solution space more effectively, avoiding local optima and accelerating convergence. The ability to converge faster and achieve higher performance demonstrates the proposed method’s robustness and efficiency, making it a more reliable choice for feature selection in high-dimensional datasets.
Figure 2 compares five classification models: ENN, CNN, SVM, DBN, and the proposed method. The metrics used for evaluation include accuracy, specificity, sensitivity, and precision, displayed in subplots (a), (b), (c), and (d), respectively. The superior performance of the proposed method can be attributed to its sophisticated and well-optimized structure. Unlike traditional models, the proposed method incorporates advanced preprocessing, feature selection, and classification strategies. During preprocessing, noise and outliers are effectively managed using techniques such as the Median-KS Test. The feature selection phase employs Genetic Algorithms, which reduce dimensionality and optimize the feature set, ensuring the inclusion of highly relevant attributes. Finally, in the classification phase, an ensemble voting classifier combines the strengths of the Decision Tree, Random Forest, and XGBoost algorithms, significantly improving prediction accuracy. These enhancements enable the proposed method to handle complex patterns more efficiently and reduce false positives and negatives, achieving a balanced performance across all evaluation metrics.
Figure 3 compares five classification models: ENN, CNN, SVM, DBN, and the proposed method. The metrics used for evaluation include the F-measure, FPR, FNR, and MCC, displayed in subplots (a), (b), (c), and (d), respectively. The proposed method minimizes noise and irrelevant attributes by effectively addressing data inconsistencies during preprocessing and utilizing Genetic Algorithms to select the most impactful features. Additionally, the ensemble voting classifier, which combines the strengths of the Decision Tree, Random Forest, and XGBoost models, ensures high accuracy, robust classification, and scalability. The proposed method’s ability to reduce FPR and FNR is critical for real-world applications, where minimizing false alarms and undetected threats is essential. Moreover, its superior MCC scores reflect the proposed method’s capability to maintain consistent performance across balanced and imbalanced datasets, further emphasizing its reliability and adaptability in various scenarios.
To evaluate the robustness of the proposed method under various threat levels, training datasets with different attack percentages (20%, 30%, 40%, and 50%) were generated. This was achieved by adjusting the ratio of botnet to normal traffic using a combination of random under-sampling (for majority class) and over-sampling (for minority class) strategies on the training set without altering the test set. Due to the inherent imbalance in the CTU-13 dataset, the Synthetic Minority Over-sampling Technique (SMOTE) was applied on the training data to generate synthetic samples for the botnet class. This ensured a more balanced dataset, improved the model’s ability to learn minority class patterns, and enhanced overall classification performance.
Figure 4 illustrates the detection rate performance under varying attack percentages (η) and different proportions of at-risk devices (from 20% to 70%). Subplots (a), (b), (c), and (d) correspond to attack percentages of η = 20%, η = 30%, η = 40%, and η = 50%, respectively.
The proposed method consistently achieves the highest detection rates across all scenarios and percentages of devices at risk, showcasing its robustness and superior anomaly detection capability. Even as the percentage of devices at risk increases (from 20% to 70%), the detection rate of the proposed method declines more gracefully compared to the other methods, demonstrating its ability to handle high-risk conditions effectively. ENN and CNN show moderate detection rates but fail to match the stability and effectiveness of the proposed method as the percentage of devices at risk increases. SVM and DBN exhibit more significant declines in their detection rate with increasing devices at risk, particularly in high-risk scenarios (η = 50). The gap between the proposed method and the other models widens as η and the percentage of devices at risk increases, further emphasizing the robustness of the proposed approach. The superior performance of the proposed method can be attributed to its advanced feature selection and classification strategies. The proposed method ensures better generalization and adaptability by integrating techniques like the Genetic Algorithm for optimal feature selection and ensemble classifiers for improved prediction accuracy. Its capability to maintain a high detection rate under challenging conditions stems from the ensemble voting mechanism, which combines the strengths of multiple models (Decision Tree, Random Forest, and XGBoost). This multi-layered approach enables the proposed method to detect complex patterns and minimize false positives and negatives, ensuring consistent performance even in high-risk scenarios.
Figure 5 illustrates the confusion matrices for training and testing phases at varying percentages of data utilization. In the 80% training phase, the method achieves remarkable results with only 27 false positives and 60 false negatives, indicating high precision and sensitivity in detecting botnet activity. Similarly, the model maintains consistent performance in the 70% training phase, with a minimal increase in false positives (28) and a slight reduction in false negatives (33). This demonstrates the robustness of the proposed approach in accurately identifying botnet and normal traffic, even with reduced training data. During the 20% testing phase, the confusion matrix shows only 7 false positives and 20 false negatives, reflecting the model’s high generalization capability and ability to accurately predict botnet instances under unseen data conditions. The model performs exceptionally well in the 30% testing phase, with just 12 false positives and 50 false negatives. This indicates the scalability and reliability of the proposed method in handling larger test datasets.
In
Table 4 is presented an evaluation of the proposed method performance in comparison with several existing botnet detection approaches across different datasets. The comparison focuses on key performance metrics, namely accuracy, detection rate, and false positive rate (FPR). The proposed method achieves an accuracy of 98.0%, detection rate of 95.0%, and low FPR of 7.8% on the CTU-13 dataset. Compared to Sharma and Babbar (2024) [
33], who also utilized the CTU-13 dataset but achieved lower accuracy (95.2%) and a significantly higher FPR (18.0%), the proposed framework demonstrates enhanced effectiveness. Furthermore, although Elnakib et al. (2023) [
24] and Ullah and Mahmoud (2019) [
19] reported competitive performances on Bot-IoT and N-BaIoT datasets, respectively, their results indicated lower detection rates and higher false positive rates compared to the proposed method. These findings confirm the robustness, scalability, and generalization capability of our proposed three-phase anomaly detection framework in securing IoT networks against diverse and sophisticated botnet attacks.
To ensure a robust evaluation, we adopted a five-fold cross-validation scheme. The dataset was randomly partitioned into five equally sized folds, and each fold was used once as a test set while the remaining four were used for training. We reported the average and standard deviation of three standard classification metrics: accuracy, detection rate, and FPR. These metrics were chosen to reflect both overall performance and anomaly detection capability, especially in the presence of class imbalance. The results in
Table 5 demonstrate the stability and reliability of the proposed framework across multiple runs.
To determine the best ensemble strategy, both Hard Voting and Soft Voting mechanisms were evaluated. As presented in
Table 6, Soft Voting achieved a higher accuracy (98.0%) and detection rate (95.0%) while maintaining a lower false positive rate (7.8%) compared to Hard Voting. Therefore, Soft Voting was selected as the final voting strategy in the proposed anomaly detection framework.
4.2. Discussion
The superior performance of the proposed GA-ensemble framework can be attributed to its synergistic design. The Genetic Algorithm, enhanced with eagle-inspired search behavior, ensures optimal feature selection by exploring the search space more effectively and avoiding local minima. This reduces dimensionality without compromising accuracy. Furthermore, the dragonfly-inspired simulated annealing algorithm refines hyperparameters in a biologically inspired manner, enhancing model generalization. The ensemble classifier integrates diverse models, namely Decision Tree, Random Forest, and XGBoost, each contributing complementary strengths. The combination of Decision Tree, Random Forest, and XGBoost provides both a bias–variance trade-off and resilience to imbalanced data.
The proposed framework offers several practical advantages for securing real-world IoT environments. Its modular design, combining lightweight preprocessing, nature-inspired feature selection, and an ensemble of efficient classifiers, makes it suitable for deployment in edge-based applications such as smart homes, healthcare monitoring devices, the industrial IoT (IIoT), and autonomous sensor networks. The low false positive and false negative rates ensure minimal disruption to normal system operations, which is critical in mission-critical IoT deployments. The strengths of the model include its high detection accuracy (98%), generalization to diverse attack types, and robustness under varying attack ratios and device risk levels. The ensemble learning strategy combined with adaptive optimization techniques ensures a balance between computational efficiency and detection performance. However, the model has two notable limitations: its performance may degrade when dealing with encrypted or heavily obfuscated traffic, as the current design relies primarily on flow-level statistical features; and despite using automated tuning via swarm-enhanced simulated annealing, the training phase still incurs a degree of computational overhead that must be considered in resource-limited deployment scenarios.