Ensemble-IDS: An Ensemble Learning Framework for Enhancing AI-Based Network Intrusion Detection Tasks

Bibers, Ismail; Arreche, Osvaldo; Alayed, Walaa; Abdallah, Mustafa

doi:10.3390/app151910579

Open AccessArticle

Ensemble-IDS: An Ensemble Learning Framework for Enhancing AI-Based Network Intrusion Detection Tasks

¹

Computer and Information Technology Department, Purdue University, Indianapolis, IN 46202, USA

²

Electrical and Computer Engineering Department, Purdue University, Indianapolis, IN 46202, USA

³

Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, Riyadh 11671, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10579; https://doi.org/10.3390/app151910579

Submission received: 6 September 2025 / Revised: 25 September 2025 / Accepted: 26 September 2025 / Published: 30 September 2025

(This article belongs to the Special Issue Machine Learning and AI Techniques for Intrusion Detection and Prevention)

Download

Browse Figures

Versions Notes

Abstract

Modern cybersecurity threats continue to evolve in both complexity and prevalence, demanding advanced solutions for intrusion detection. Traditional AI-based detection systems face significant challenges in model selection, as performance varies considerably across different network environments and attack scenarios. To overcome these limitations, we propose a comprehensive ensemble learning approach that systematically integrates feature selection, model optimization, and rigorous evaluation components. Our framework evaluates fourteen distinct machine learning approaches, ranging from individual classifiers to sophisticated ensemble methods including bagging, boosting, and hybrid stacking/blending architectures. These techniques are applied to multiple base algorithms such as neural networks and tree-based models. Extensive testing was conducted on two complementary benchmark datasets (RoEduNet-SIMARGL2021 and CICIDS-2017) to assess detection capabilities across varied threat landscapes. Our experimental results revealed several key findings. Ensemble techniques universally surpass standalone models in detection accuracy, with random forest achieving the best performance on RoEduNet-SIMARGL2021, while the blending and bagging methods approach yielded perfect scores (F1 > 0.996) on CICIDS-2017. Feature selection via information gain demonstrated particular value, reducing model training times by 94% while maintaining detection accuracy. Among ensemble methods, XGBoost showed exceptional computational efficiency, whereas stacking and blending architectures delivered maximum accuracy at the expense of greater resource requirements. This research provides practical guidance for security professionals in model selection based on specific operational constraints and threat profiles. To support community advancement, we have made our complete framework publicly available, facilitating reproducibility and future innovation in intrusion detection systems.

Keywords:

intrusion detection systems; ensemble learning; network security; machine learning; CICIDS-2017; RoEduNet-SIMARGL2021; predictive modeling

Graphical Abstract

1. Introduction

Intrusion detection systems (IDSs) are fundamentally designed to identify unauthorized access, misuse, and attacks on networked systems, whether initiated by external attackers or insider threats [1,2,3]. Conventional IDS approaches often rely on the premise that malicious activities exhibit distinct patterns compared to normal user behavior and that such anomalies can be reliably detected. Recent advancements in artificial intelligence (AI) have driven the development of autonomous intrusion detection solutions [4,5]. To automate threat detection, researchers have employed diverse AI techniques, such as deep learning models [6,7], tree-based classifiers [8,9], regression-based methods [10,11], and ensemble learning algorithms [12,13].

Most AI-driven intrusion detection techniques, apart from random forest, function as independent models without integrating their outputs [14,15]. These models exhibit distinct limitations, including elevated false alarm rates (e.g., some enterprises grapple with over 10,000 daily alerts from AI-powered security tools [16]) or significant missed detections (a critical concern in high-stakes network environments [17]).

Earlier research on AI-based IDS primarily prioritized individual algorithm accuracy rather than leveraging synergistic combinations of multiple techniques. This gap has underscored the necessity of adopting ensemble learning to improve detection robustness [18,19]. Recent efforts have increasingly explored ensemble-based IDS solutions, as seen in studies such as [20,21,22,23,24,25,26,27,28,29,30,31]. Some frameworks target anomaly detection by distinguishing malicious from benign traffic [21,22,24,26,27,29,32], while others classify specific attack types (e.g., DoS, port scans) alongside normal traffic [20,25,28,30,31,33].

Common ensemble strategies include boosting, stacking, and bagging, applied to base models such as decision trees, K-nearest neighbors (KNN), and neural networks. Performance is typically assessed using metrics such as precision, recall, and F1-score, with evaluations conducted on benchmark datasets (e.g., NSL-KDD) or real-world networks (e.g., Palo Alto [29] or real-time systems such as Kitsune [22]).

Notable contributions include dataset generation and benchmarking via ensemble methods [20] and AI model selection through ensemble optimization [24]. However, existing studies often narrow their focus to specific ensemble techniques applied to a small subset of models, leaving broader comparisons across diverse datasets and methodologies unexplored—a limitation that may restrict their wider adoption.

This study sought to bridge the identified research gap by systematically evaluating a range of ensemble learning techniques for network intrusion detection systems (NIDS). We implemented multiple standalone AI models alongside both basic and advanced ensemble learning frameworks to assess their effectiveness in the NIDS context. Building upon previous studies such as [20,21,22,23,24,25,26,27,28,29,30,31,32,33], which have explored a variety of ensemble strategies, we categorized our proposed framework accordingly to facilitate a structured analysis.

Dataset Preparation: This initial step included importing relevant intrusion detection datasets such as CICIDS-2017 [34] and RoEduNet-SIMARGL2021 [35] for subsequent analysis.
Feature Reduction: Prior to model development, we employed feature selection techniques to enhance detection accuracy and decrease computational load. Specifically, we utilized information gain (IG) and K-best algorithms to extract the most informative attributes. From these, we generated multiple feature subsets—namely All_features, IG Top-5, IG Top-10, K-best Top-5, and K-best Top-10—which were then consistently applied across all model training pipelines.
Training Individual Models: With the selected features, we developed baseline models including decision trees [8,9], logistic regression [10,11], neural networks [6,7], and K-nearest neighbors. Each model’s performance was measured using standard classification metrics such as accuracy, precision, recall, and F1-score.
Basic Ensemble Strategies: We then incorporate simple ensemble techniques, including majority voting, weighted averaging, and mean prediction aggregation. The same evaluation metrics are used to assess their effectiveness in comparison to individual models.
Sophisticated Ensemble Techniques: This phase integrated more advanced ensemble learning methods, such as bagging, boosting, stacking, and blending. Random forest [12,13], which relies on aggregating multiple decision trees, is categorized here under bagging-based methods. Again, model evaluation was conducted using accuracy, precision, recall, and F1-score.
Model Comparison and Insights: In the final step, we performed a comprehensive comparison of all individual and ensemble models to determine the most efficient configurations for intrusion detection. We also examined how different feature subsets influenced model performance, helping identify the most effective feature selection approaches for IDS optimization.

Our study introduces a diverse range of ensemble configurations, including techniques such as bagging, stacking, and boosting, applied across various foundational learners such as decision trees, logistic regression, neural networks, random forests, and others. These methodological variations emphasize the uniqueness of our contribution in comparison to earlier works, as elaborated in Section 2.

To assess the robustness and adaptability of our ensemble framework, we utilized two widely recognized intrusion detection datasets, each offering distinct characteristics. The first is the RoEduNet-SIMARGL2021 dataset [35], compiled under the European Union’s SIMARGL project. This dataset contains live traffic features and simulates real-world network activity, making it particularly appropriate for intrusion detection applications. Notably, few studies have employed ensemble learning extensively on this dataset, highlighting a gap our work aimed to address. The second dataset is CICIDS-2017 [34], developed by the Canadian Institute for Cybersecurity. This benchmark dataset encompasses a range of intrusion types and remains a staple in IDS research.

We systematically examined different ensemble strategies across both datasets, incorporating a wide selection of machine learning algorithms. These included logistic regression (LR), decision tree (DT), K-nearest Neighbors (KNN), multi-layer perceptron (MLP), adaptive boosting (ADA), extreme gradient boosting (XGB), CatBoost (CAT), gradient boosting (GB), averaging (Avg), max voting, weighted averaging, and random forest (RF). For each method, we computed a comprehensive set of evaluation metrics across both datasets to measure detection performance. Our analysis covered not only raw metric values but also a comparative ranking of the methods based on F1-score, enabling clearer identification of high-performing models. We further categorized the algorithms by their effectiveness in detecting network intrusions under different scenarios, offering an organized view of their relative strengths.We performed pairwise statistical significance tests with paired t-tests on the F1-scores of all models across multiple feature selection settings. These tests were conducted for both datasets—RoEduNet-SIMARGL2021 and CICIDS-2017—to evaluate whether observed performance differences between models were statistically meaningful.

This evaluation framework empowers researchers and practitioners to make data-driven choices when selecting ensemble methods for IDS. Our study significantly contributes to closing the methodological gap in ensemble-based IDS research by offering a detailed comparative analysis. Our assessment included vital performance indicators such as accuracy, precision, recall, and F1=score, along with runtime analysis. to evaluate operational efficiency. By doing so, we provide a holistic view of each method’s viability in real-world deployment. Our contributions not only benchmark current ensemble approaches but also lay a foundation for future developments in secure and intelligent network defense systems. The selected metrics—accuracy, precision, recall, and F1-score—offer a comprehensive view of model performance, especially in the context of imbalanced datasets common in intrusion detection. Accuracy provides an overall correctness measure, while precision and recall capture the model’s ability to correctly identify attacks without excessive false alarms. The F1 score balances precision and recall, making it ideal for evaluating detection reliability. Runtime analysis complements these metrics by assessing the operational efficiency and scalability of each model, which is critical for real-time deployment scenarios.

Summary of Key Contributions: This paper presents a number of core contributions, outlined as follows:

In-depth Comparison of Learning Approaches: We performed an extensive comparative study involving a variety of standalone machine learning models and both basic and advanced ensemble strategies applied to intrusion detection tasks.
Multi-Metric Evaluation: The proposed framework was benchmarked using critical performance indicators relevant to cybersecurity, including classification metrics such as accuracy, precision, recall, and F1-score, alongside execution time, to assess efficiency of different learning methods when applied to IDS.
Cross-Dataset Analysis: Our experiments utilized two widely recognized and contrasting IDS datasets—RoEduNet-SIMARGL2021 and CICIDS-2017—to ensure a thorough and diversified performance evaluation in multiple intrusion contexts.
Model Effectiveness Ranking: We present a performance-based ranking of individual and ensemble models, organized in descending order of F1-score, highlighting the comparative strengths and weaknesses of each method.
Advancing Ensemble Learning in IDS Research: By validating the success of various ensemble approaches, our work broadens the scope of ensemble learning techniques in intrusion detection systems and sets the stage for future explorations in the domain.
Open-Source Implementation: To support transparency and reproducibility, we provide public access to our implementation, enabling the research community to extend our framework with additional models or datasets. (The GitHub repository for the complete source code is available at https://github.com/sm3a96/A-Comprehensive-Comparative-Study-of-Individual-ML-Models-and-Ensemble-Strategies-for-IDS.git, accessed on 30 August 2025).

2. Related Work

2.1. Overview of Prior Ensemble Learning Approaches for IDS

Prior studies have extensively explored the role of ensemble learning in improving IDS. A notable survey [23] systematically reviewed developments in ensemble-based IDS from 2009 to 2020. The work examined classical ensemble strategies—such as bagging, boosting, stacking, and majority voting—applied across datasets including KDD’99, NSL-KDD, Kyoto 2006+, and AWID. It also covered a diverse set of learning algorithms, such as neural networks (NNs), support vector machines (SVMs), decision trees (DTs), fuzzy clustering, and radial basis functions (RBFs). This extensive review underscored both existing strengths and limitations of current ensemble methodologies, providing future directions for innovation.

Binary Classification Approaches in Anomaly Detection: For two-class intrusion detection, an IDS architecture was proposed in [21] that integrates various classifiers including Gaussian Naive Bayes, logistic regression, and decision trees. These models were combined using stochastic gradient descent on the CICIDS-2017, UNSW-NB15, and KDD’99 datasets. Chi-square-based feature selection was applied to refine the input space. Despite its effectiveness, the model suffered from imbalanced data, and the authors suggest exploring data augmentation and alternative ensemble methods for performance improvement.

Another contribution was a binary anomaly detection pipeline with ensemble learners such as random forest, AdaBoost, and gradient boosting, aggregated via soft voting developed by [26]. Similarly, in [27], classifiers such as LR, DT, NB, NN, and SVM were utilized within an ensemble setup to enhance detection accuracy on the NSL-KDD and UNSW-NB15 datasets. This study also experimented with feature selection, emphasizing the need for more realistic datasets and the potential of unsupervised learning. Another study [29] incorporated real-world data, such as Palo Alto logs, into its IDS ensemble framework using weighted voting across SVM, autoencoder, and random forest. Although effective in reducing false positives, its limitations in scalability and voting mechanism diversity were noted.

In the context of IoT security, one work [32] introduced an ensemble IDS based on the TON-IoT dataset. The method involved stacking and voting over base models such as random forest, KNN, DT, and LR. While the framework was successful, it lacked evaluation on additional datasets and excluded other ensemble techniques such as bagging.

In contrast, another work [22] focused on real-time detection via an ensemble of autoencoders, offering a unique approach suited for streaming data. Meanwhile, ref. [24] explored ensemble model selection to minimize overfitting in small binary datasets using random forest, Naive Bayes, and logistic regression. However, its reliance on non-IDS datasets and computationally intensive validation procedures were significant drawbacks.

Multiclass Classification with Ensemble Learning: Several works extended ensemble learning for multiclass IDS classification [16,20,33]. In [33], stacking was applied over DNN, CNN, RNN, and LSTM to classify traffic from the CICIDS-2017 and ToN_IoT datasets. Although this method enhanced accuracy, it faced constraints due to computational costs and the absence of experimentation in real IoT contexts. In [20], the GTCS dataset was introduced and adaptive ensemble learning used (via J48, MLP, and IBK) with majority voting, although the work lacked real-world testing and model diversity. The work in [25] addressed class imbalance by incorporating random forest into a framework combining LGP, ANFIS, and weighted voting. Nevertheless, challenges in assigning optimal weights and generalizability remained. Ref. [30] utilized bagging over NB, PART, and AdaBoost for the KDD’99 dataset but was limited in terms of dataset breadth and learner variety.

Recent advancements in resilience modeling have also contributed to the broader understanding of network security. For instance, the work in [36] introduced a resilience recovery method for complex traffic networks using trend forecasting, which models fault propagation and recovery dynamics through a modified SIRD-R framework [36]. Such approaches offer complementary perspectives to intrusion detection by emphasizing proactive recovery and adaptive system behavior, which can be integrated with ensemble-based IDS frameworks to enhance robustness against evolving threats.

2.2. Our Contributions

We present a comprehensive IDS classification framework leveraging both standalone AI models and a range of ensemble learning approaches. Using two contrasting datasets—RoEduNet-SIMARGL2021 and CICIDS-2017—we evaluated models based on accuracy, precision, recall, and F1-score. Our process began with complete feature utilization and dataset preparation. We trained baseline classifiers including logistic regression (LR), decision tree (DT), random forest (RF), multilayer perceptron (MLP), and K-nearest neighbors (KNN). Subsequently, we applied simple ensemble strategies including averaging, max voting, and weighted averaging. Advanced ensemble approaches—including bagging, boosting (e.g., ADA, GB, XGB, CAT), stacking, and blending—were also integrated. Comprehensive benchmarking was conducted across all configurations to identify optimal combinations. A major strength of our work is the inclusion and detailed evaluation of the underexplored RoEduNet-SIMARGL2021 dataset, providing fresh insights into ensemble IDS performance. This detailed analysis offers a solid foundation for future enhancements in ensemble-based intrusion detection. Table 1 shows the differences between our work and prior related ones.

3. Background and Problem Statement

This section lays the foundation for understanding the landscape of network intrusion detection, the limitations posed by individual AI models, the motivation for using ensemble methods, and the associated evaluation challenges within this domain.

3.1. Categories of Network Intrusions

Network intrusions can be classified using the MITRE ATT&CK framework [37], which provides a comprehensive taxonomy of adversarial tactics and techniques. In our evaluation, we focused on key attack types from this framework:

Normal Traffic: This category represents standard, legitimate network operations without any malicious activity.

Malware/Malware Repository Intelligence [MITRE ATT&CK ID: DS0004]: This category involves analysis of software designed with malicious intent. Identifying characteristics such as code signatures, debugging metadata, and code reuse patterns helps trace malware origin or link it to known threat actors. Shared features may reveal malware sourced from common platforms or providers [38].

PortScan (PS)/Network Service Discovery [MITRE ATT&CK ID: T1046]: Port scanning is a reconnaissance activity used to identify open ports and services. It acts as a precursor to full-scale attacks by revealing vulnerable points in the target system [39].

Denial of Service (DoS)/Network Denial of Service [MITRE ATT&CK ID: T1498]: The approach aims to render services inaccessible by overwhelming the target with traffic or connection requests, ultimately exhausting server resources and causing downtime [40].

Brute Force [MITRE ATT&CK ID: T1110]: This method involves repeated attempts to guess authentication credentials, exploiting weak or common passwords to gain unauthorized access [40].

Web Attack/Initial Access [MITRE ATT&CK ID: TA0001, T1659, T1189]: This technique targets vulnerabilities in web applications to gain unauthorized entry. These attacks exploit misconfigurations, software flaws, or exposed services to infiltrate systems [37,41].

Infiltration/Initial Access [MITRE ATT&CK ID: TA0001]: This category refers to attempts to gain unauthorized entry into systems, typically via phishing or by exploiting exposed services, potentially leading to persistent access.

Botnet/Compromise Infrastructure [MITRE ATT&CK ID: T1584.005, T1059, T1036, T1070]: Botnets consist of compromised devices controlled remotely via scripts, often used for scalable and automated attacks across multiple vectors.

Probe Attack/Surveillance [MITRE ATT&CK ID: T1595]: These intrusions gather intelligence on network topologies and exposed services. Tactics include ping sweeps, DNS zone transfers, and other scanning methods [42,43,44].

3.2. Intrusion Detection Systems

The sophistication of modern cyber threats requires resilient monitoring systems. IDSs serve as the primary line of defense against malicious actors attempting unauthorized access [45,46]. Traditional IDS solutions detect anomalies by observing deviations from normal user behavior [47]. The incorporation of AI models into IDSs over the last decade has substantially improved detection capabilities [48], yet significant gaps remain in achieving trustworthy, explainable, and generalizable solutions.

3.3. Limitations of Individual AI Models

Although machine learning models have demonstrated strong performance in IDS, their individual limitations hinder their broader applicability. These models—such as decision trees (DTs), K-nearest neighbors (KNN), support vector machines (SVMs), and deep neural networks (DNNs)—struggle with dataset complexity and often fail to generalize across different types of intrusions. They may exhibit elevated false positive [6] or false negative rates [17], making them unreliable for mission-critical tasks such as real-time intrusion prevention. In addition, base models vary in their computational needs and transparency in decision-making. KNN requires significant memory and can be misled by noise and outliers. Neural networks demand large datasets and may suffer from poor interpretability. Logistic regression offers simplicity but is limited in modeling complex relationships. Decision trees are easy to train and interpret but may overfit the data. These disparities emphasize the difficulty of relying on a single AI model for IDS.

3.4. Need for Ensemble Learning in IDS

To overcome these challenges, ensemble learning methods—such as bagging, boosting, and stacking—combine the strengths of multiple models to enhance overall accuracy and robustness [18,19]. Ensembles help mitigate individual weaknesses by integrating diverse learners, thereby reducing bias, variance, and overfitting risks.

These ensemble strategies are especially beneficial in intrusion detection, where one-size-fits-all models rarely succeed. Through model diversity and collaborative voting or aggregation, ensemble methods provide more resilient and adaptable detection mechanisms, leading to improved identification of sophisticated and evolving cyber threats.

3.5. Key Advantages of Ensemble Methods

Ensemble learning is an evolving discipline in machine learning that focuses on combining multiple learning models to improve overall prediction accuracy and model stability. This approach leverages the diversity of various base models to counteract individual weaknesses, resulting in a more robust predictive system. The most widely adopted ensemble strategies include bagging, boosting, and stacking.

Bagging (Bootstrap Aggregating): This technique involves generating multiple versions of a training dataset by sampling with replacement. Each variant is used to train an independent model instance. The predictions from these models are then aggregated—commonly via majority voting for classification or averaging for regression—to produce a final output. Bagging primarily aims to reduce variance and prevent overfitting by promoting model diversity.

Boosting: Unlike bagging, boosting adopts a sequential training approach where each subsequent model is trained to focus on the errors made by the previous ones. Misclassified samples are assigned greater importance in the training of subsequent models. This progressive correction of mistakes leads to a more accurate final ensemble by minimizing both bias and variance.

Stacking: This method introduces a hierarchical learning process where multiple heterogeneous base models are first trained independently. Their predictions are then fed into a higher-level model, known as a meta-learner or meta-model. The meta-learner synthesizes the outputs of base models to make the final prediction, thereby capturing intricate dependencies among features and predictions.

Together, these ensemble methodologies offer powerful mechanisms for enhancing the predictive strength of machine learning models. By aggregating insights from diverse learners, they not only improve performance metrics but also increase model robustness across different tasks and datasets.

Application of Ensemble Learning in Our Framework: In this study, we systematically explored several ensemble learning techniques within the scope of network intrusion detection. Our framework exclusively focused on leveraging ensemble strategies—built upon foundational base models—for detecting anomalous network activities. To thoroughly assess the efficacy and generalizability of these ensemble approaches, we performed a detailed comparative analysis on two heterogeneous datasets, each characterized by unique traffic patterns and attack profiles.

This evaluation allowed us to investigate how different ensemble learning schemes impact detection accuracy, generalization capability, and computational efficiency. The resulting insights contribute to refining intrusion detection systems and guiding future applications of ensemble learning in cybersecurity.

4. Framework

This research introduces a novel ensemble learning approach designed to enhance detection performance across multiple network security applications. The proposed system provides security professionals with a structured methodology for optimizing threat identification and attack classification processes, ultimately strengthening organizational cyber defense capabilities. As illustrated in Figure 1, our comprehensive methodology evaluates multiple ensemble strategies to determine their effectiveness for modern intrusion detection systems.

4.1. Data Preparation

Both the CICIDS-2017 and RoEduNet-SIMARGL2021 datasets were carefully processed to ensure optimal compatibility with intrusion detection algorithms. The CICIDS-2017 dataset required several cleaning operations: elimination of redundant entries, mean-value imputation for missing data in the “Flow Bytes/s” attribute (or feature), standardization of feature naming conventions, and numerical conversion of categorical labels through encoding techniques.

The RoEduNet-SIMARGL2021 dataset underwent comparable refinement procedures, including removal of duplicate entries, elimination of non-varying attributes, mean-based imputation for incomplete values, and numerical transformation of categorical variables using ordinal encoding. These preparatory measures significantly enhanced data integrity for machine learning applications.

4.2. Feature Optimization

To maximize detection accuracy while minimizing computational overhead, we implemented rigorous feature selection protocols. Our approach leveraged two distinct statistical methods: information gain (measuring uncertainty reduction) and K-Best ANOVA F-score (evaluating inter-class variance). Each technique independently identified the 10 most discriminative features, with both methods offering unique perspectives on feature relevance. The finalized feature sets were uniformly applied across all experimental models to maintain evaluation consistency. By evaluating models across multiple feature subsets (e.g., IG Top-5, K-Best Top-10), we demonstrated that certain configurations maintain high detection performance even with reduced or abstracted feature sets. This suggests resilience to feature drift and adaptability to unseen attack vectors.

4.2.1. CICIDS-2017 Feature Analysis

The feature selection process for CICIDS-2017 employed both information gain and ANOVA F-score methodologies. Table 2 displays the highest-ranked features identified by each approach.

4.2.2. RoEduNet-SIMARGL2021 Feature Analysis

The identical feature selection methodology was employed for the RoEduNet- SIMARGL2021 dataset, with both the information gain and ANOVA F-score approaches being used. Table 3 summarizes the ten most significant features identified by each selection criterion, demonstrating their discriminative capabilities for attack classification.

4.3. Algorithm Selection and Methodology

This section outlines our approach to selecting both core machine learning algorithms and their ensemble combinations for enhanced intrusion detection performance.

4.3.1. Core Single Classification Algorithms

We considered four fundamentally different machine learning approaches to ensure diverse modeling capabilities:

Decision Tree Classifiers: These hierarchical models offer transparent decision pathways through recursive data partitioning, making them valuable for interpretable security analytics.
K-Nearest Neighbors: This distance-based algorithm classifies network events by comparing them to the most similar historical instances, effectively capturing complex attack patterns through local approximations.
Multilayer Perceptrons: Our neural network implementation utilizes multiple hidden layers to learn sophisticated nonlinear relationships in network traffic data.
Logistic Regression: Serving as our baseline linear model, this algorithm establishes fundamental discriminative boundaries between attack and normal traffic patterns.

4.3.2. Basic Ensemble Strategies

To combine the strengths of individual models, we use three fundamental aggregation approaches:

Prediction Averaging: This technique synthesizes outputs from multiple classifiers through arithmetic mean computation, effectively smoothing out individual model biases.
Plurality (Majority) Voting: Our voting system determines final classifications by selecting the most frequently predicted class among all constituent models.
Performance-Weighted Combinations: More accurate models were assigned greater influence through empirically determined weighting coefficients, as detailed in Section 5.

4.3.3. Advanced Ensemble Methods

To enhance model performance, we employ several advanced ensemble techniques:

Bagging: This method generates multiple bootstrapped datasets, training base learners independently on each. Aggregating their predictions reduces variance and boosts robustness. Random forest, an example of bagging, combines many decision trees to prevent overfitting and maintain consistent performance.
Blending: Blending combines outputs from various base learners as input features to a meta-learner, improving generalization by leveraging model diversity.
Boosting: Boosting sequentially trains models that focus on correcting previous errors, placing higher weights on misclassified samples to iteratively refine prediction accuracy.
Stacking: This hierarchical technique trains multiple base learners and feeds their predictions into a meta-model, which learns the optimal way to combine them, capturing complex relationships among predictions.

4.4. Model Development and Training

Our implementation uses Python 3.13, beginning with individual base models, advancing to simple ensemble methods, and culminating in advanced ensemble techniques. Prior to training, feature selection via information gain and K-Best methods identifies the most informative attributes, reducing complexity and improving performance. We also evaluate models trained on all features. To utilize computational resources efficiently, TensorFlow’s tf.distribute. MirroredStrategy() enables synchronous multi-GPU training by replicating models across GPUs, aggregating gradients to maximize throughput and consistency. For individual models, each base learner—decision trees, random forests, neural networks (MLP), and logistic regression—is implemented and trained separately using libraries including scikit-learn, TensorFlow, and Keras. For simple ensemble methods, we combine individual model predictions using averaging, max voting, and weighted averaging ensembles. Training leverages GPU acceleration. For advanced ensemble methods, advanced techniques including Bagging, Blending, Boosting (AdaBoost, CatBoost, Gradient Boosting, XGBoost), and stacking are implemented using scikit-learn and TensorFlow with multi-GPU support. Bagging uses random forest as a base; blending and stacking train meta-models (typically decision trees) on base learner predictions. Boosting methods follow standard iterative procedures.

4.5. Evaluation Metrics and Model Selection

We assessed models based on accuracy, precision, recall, F1-score, and runtime to balance effectiveness with computational efficiency. Models were chosen for their proven utility in IDS literature and diversity in learning principles, enabling robust comparative analysis. The chosen models encompassed a diverse range of learning paradigms—linear (LR), tree-based (DT, RF, GB variants), instance-based (KNN), and neural networks (MLP)—ensuring broad coverage of algorithmic behavior under different intrusion scenarios. Baseline aggregation is provided by ensemble strategies including Avg, Max Voting, and Weighted Avg, while advanced techniques including ADA, XGB, and CAT offer improved performance through feature sensitivity and iterative refinement.

4.6. Key Network Intrusion Features

Table 4 and Table 5 list and describe important features from the RoEduNet-SIMARGL2021 and CICIDS-2017 datasets, essential for understanding model inputs and their relevance. Table 6 presents a comparative overview of the two network intrusion datasets used in this study. CICIDS-2017 includes 7 attack categories and 78 features across approximately 2.78 million samples, while RoEduNet-SIMARGL2021 offers a significantly larger scale with over 31 million samples, 3 attack labels, and 29 features. This contrast highlights the diversity in dataset complexity and volume, supporting robust evaluation across varied threat landscapes. While all features listed in Table 7 were used in initial experiments, highlighting these key characteristics helps to interpretability and understanding. We emphasize that we also used feature selection methods to choose the best features in order to test the performance (both accuracy-related and efficiency-related ones) of both ensemble and single methods under this selection.

5. Foundations of Evaluation

Our experimental evaluation focused on answering the following research questions:

Which individual machine learning models demonstrate optimal performance for specific network intrusion detection datasets?
Among various ensemble techniques, which approach yields the most effective detection capabilities for given cybersecurity scenarios?
How do evaluated methods compare across performance indicators, including
- Detection precision, attack identification sensitivity, and accuracy?
- Computational efficiency and runtime?
To what extent does feature selection influence:
- Model detection performance?
- Computational resource requirements?
What are the practical benefits and limitations of applying ensemble learning methods in real-world intrusion detection systems?

5.1. Experimental Datasets

RoEduNet-SIMARGL2021 Dataset [35]: Developed through the EU-funded SIMARGL project in partnership with Romania’s national research network, this collection contains genuine network traffic captures with comprehensive flow-level attributes. The data organization follows the Netflow [51] paradigm, mirroring the industry-standard format for network monitoring established by Cisco Systems.

CICIDS-2017 Dataset [34]: Created by the University of New Brunswick’s cybersecurity research team, this reference dataset includes six categories of modern network attacks: credential brute-forcing, heartbleed exploits, botnet communications, DoS floods, port-scanning activities, and web application attacks. The traffic patterns incorporate realistic user behavior simulations through the B-Profile methodology [52], ensuring authentic network conditions.

Dataset Characteristics: Key attributes including volume, attack diversity (class distribution), and feature dimensionality are quantitatively compared in Table 6.

5.2. Computational Environment

Hardware Configuration: All experiments were executed on a state-of-the-art computing cluster designed for machine learning workloads. The system features dual NVIDIA A100 accelerators across 64 compute nodes, with each node containing 256 GB RAM and a 64-core AMD EPYC 7713 CPU operating at 2.0 GHz (225W TDP). This configuration delivers up to 7 petaFLOPs of theoretical performance, providing ample resources for demanding AI workloads [53].

Software Stack: Our implementation leverages Python’s scientific computing ecosystem, utilizing specialized machine learning frameworks (Keras, Scikit-learn) alongside essential data processing and visualization libraries (Pandas 2.3.1, Matplotlib 3.10.3). This ensured both methodological transparency and experimental reproducibility.

5.3. Performance Assessment Criteria

To rigorously evaluate IDS effectiveness, we employed four standard classification metrics derived from confusion matrix analysis:

Classification Accuracy $[(T P + T N) / T o t a l]$ : Overall correct prediction rate across all traffic classes.
Attack Precision $[T P / (F P + T P)]$ : Proportion of correctly identified attacks among all positive predictions.
Threat Detection Rate $[T P / (F N + T P)]$ : Percentage of actual attacks successfully detected (also known as sensitivity).
F1-Measure $[2 T P / (2 T P + F P + F N)]$ : Balanced metric combining precision and recall performance.

We additionally measure computational efficiency through execution time analysis, providing practical insights into real-world deployment feasibility for each approach.

5.4. Machine Learning Methodology

Our evaluation framework incorporates both fundamental classifiers and their ensemble combinations:

(A) Core Classification Models:

Neural Networks: MLP architecture [54] for learning complex traffic patterns.
Decision Trees: Rule-based classifier [55] offering interpretable decisions.
Logistic Regression: Linear probabilistic model [56] serving as baseline
KNN Algorithm: Instance-based learner [57] for local pattern recognition.

(B) Ensemble Strategies:

Boosting Variants: CAT [58], LGBM [59], ADA [60], GB [61], and XGBoost [62].
Composite Methods: Stacking, blending, and random forest implementations.
Basic Aggregators: Majority voting [63], prediction averaging, and performance-weighted combinations.

Complete hyperparameter configurations for all models are documented in Appendix A.1 (we refer to Table A1 for all configurations for different models considered in our work). The subsequent section presents detailed evaluation outcomes from this comprehensive experimental framework.

(C) Potential of Our Framework to Capture Zero-day Attacks: We would like to emphasize that several components of our framework indirectly support zero-day detection due to the following reasons:

Model Diversity: Our ensemble configurations—particularly bagging, boosting, and stacking—combine heterogeneous base learners (e.g., decision trees, KNN, MLP, logistic regression). This diversity enhances generalization and robustness, which are essential for identifying anomalous patterns not seen during training.
Anomaly Detection Potential: Some of the base models used (e.g., KNN, decision trees) are inherently capable of identifying outliers or deviations from learned patterns. When integrated into ensemble strategies, these models contribute to the detection of atypical traffic, which may include zero-day attacks.

6. In-Depth Evaluation Results

We now detail the main results and insights found from our evaluation experiments on the two datasets considered in this work.

6.1. RoEduNet-SIMARGL2021 Evaluation

6.1.1. Performance Analysis

Our evaluation of the RoEduNet-SIMARGL2021 dataset (Table 8 and Table 9) revealed notable differences in model effectiveness across various configurations.

Flawless Detection:
-
Individual Models: The decision tree algorithm achieved ideal metrics (all scores = 1.0) regardless of feature selection approach, including complete feature sets and reduced subsets (Table 8). This result answers the first research question.
-
Ensemble Approaches: Multiple ensemble strategies (random forest, voting, stacking, AdaBoost) similarly achieved perfect detection when using comprehensive or IG-selected features (Table 9), demonstrating their capacity to effectively combine feature information. We emphasize that the random forest model used in Table 9 was configured with n_estimators=100. This is also explicitly stated in Appendix A.1. This experiment answers the second research question.
Feature Selection Observations:
-
Comparison of Methods: Information Gain proved superior to ANOVA F-score selection, particularly for Logistic Regression (F1 = 0.994 vs. 0.988 with top 10 features; Table 8). This experiment answers the fourth research question.
-
Minimal Feature Performance: Remarkably, decision trees and CatBoost maintained flawless detection even with only 5 IG-selected features, indicating exceptional tolerance to feature reduction (Table 8 and Table 9). This experiment answers the fourth research question.
-
Blending Method: While achieving near-perfect detection (F1 ≈ 0.9999), the blending technique showed marginally lower performance compared to simpler ensembles, potentially due to its complex architecture (Table 9).

Table 8. Comparative analysis of intrusion detection performance across machine learning approaches using RoEduNet-SIMARGL2021 demonstrating the impact of different feature selection configurations (full/5/10 features) with models ordered by F1 performance.

Model	Accuracy	Precision	Recall	F1 Score	Training Time (s)	Prediction Time (s)	Total Time (s)
All Features
Decision Tree	1.0	1.0	1.0	1.0	58.52	0.14	58.66
MLP	0.999979	0.999979	0.999979	0.999979	7722.65	7.09	7729.74
Logistic Regression	0.999494	0.999494	0.999494	0.999494	48.07	0.19	48.27
IG Top 5 Features
Decision Tree	1.0	1.0	1.0	1.0	16.53	0.07	16.59
MLP	0.998699	0.998702	0.998699	0.998699	13460.16	3.91	13464.07
Logistic Regression	0.904228	0.918102	0.904228	0.903427	63.02	0.13	63.15
IG Top 10 Features
Decision Tree	1.0	1.0	1.0	1.0	29.22	0.08	29.31
MLP	0.999277	0.999278	0.999277	0.999277	3109.69	4.32	3114.01
Logistic Regression	0.994143	0.994171	0.994143	0.994143	34.84	0.10	34.94
K-Best Top 5 Features
Decision Tree	0.999981	0.999980	0.999981	0.999979	10.29	0.08	10.37
MLP	0.994868	0.994830	0.994868	0.994840	5623.41	3.54	5626.95
Logistic Regression	0.988424	0.988542	0.988424	0.988396	26.64	0.10	26.74
K-Best Top 10 Features
Decision Tree	0.999998	0.999998	0.999998	0.999998	67.02	0.11	67.13
MLP	0.998772	0.998775	0.998772	0.998772	4630.49	4.39	4634.89
Logistic Regression	0.988180	0.988365	0.988180	0.988179	34.83	0.16	34.99

Table 9. Comparison of the ensemble method performance on the RoEduNet-SIMARGL2021 dataset. Results are organized by feature selection strategy (All Features, Top 5, and Top 10) and ranked by F1-score within each category.

Model	Accuracy	Precision	Recall	F1 Score	Training Time (s)	Prediction Time (s)	Total Time (s)
All Features
Random Forest	1.0	1.0	1.0	1.0	2607.52	10.87	2618.39
Soft Voting Ens.	1.0	1.0	1.0	1.0	4901.11	19.58	4920.69
Weighted Avg.	1.0	1.0	1.0	1.0	5364.26	13.08	5377.33
Bagging	1.0	1.0	1.0	1.0	12820.91	9.11	12830.02
Stacking	1.0	1.0	1.0	1.0	23951.53	26.57	23978.10
Adaptive Boosting	1.0	1.0	1.0	1.0	1366.73	6.73	1373.46
CatBoost	1.0	1.0	1.0	1.0	681.30	0.83	682.13
Gradient Boosting	1.0	1.0	1.0	1.0	22567.71	16.53	22584.24
XGBoost	1.0	1.0	1.0	1.0	172.57	0.60	173.18
Blending	0.999945	0.999889	0.999945	0.999917	3929.99	6.50	3936.49
IG Top 5 Features
Random Forest	1.0	1.0	1.0	1.0	1230.87	8.64	1239.52
Soft Voting	1.0	1.0	1.0	1.0	2593.67	11.19	2604.87
Weighted Avg.	1.0	1.0	1.0	1.0	3228.50	12.06	3240.55
Bagging	1.0	1.0	1.0	1.0	7273.26	7.93	7281.19
Stacking	1.0	1.0	1.0	1.0	11928.31	14.94	11943.26
Adaptive Boosting	1.0	1.0	1.0	1.0	419.21	3.25	422.46
CatBoost	1.0	1.0	1.0	1.0	513.04	0.78	513.82
Gradient Boosting	1.0	1.0	1.0	1.0	6429.50	10.61	6440.11
XGBoost	1.0	1.0	1.0	1.0	283.73	0.92	284.65
Blending	0.999945	0.999889	0.999945	0.999917	952.84	6.44	959.27
IG Top 10 Features
Random Forest	1.0	1.0	1.0	1.0	1707.36	9.52	1716.88
Soft Voting	1.0	1.0	1.0	1.0	3731.02	11.78	3742.80
Weighted Avg.	1.0	1.0	1.0	1.0	3804.80	10.06	3814.86
Bagging	1.0	1.0	1.0	1.0	9598.89	10.94	9609.83
Stacking	1.0	1.0	1.0	1.0	10935.13	9.72	10944.85
Adaptive Boosting	1.0	1.0	1.0	1.0	682.05	3.90	685.96
Cat Boosting	1.0	1.0	1.0	1.0	526.96	0.76	527.71
Gradient Boosting	1.0	1.0	1.0	1.0	9667.26	10.25	9677.51
XGBoost	1.0	1.0	1.0	1.0	290.79	0.70	291.49
Blending	0.999945	0.999889	0.999945	0.999917	1482.82	9.27	1492.08
K-Best Top 5 Features
Random Forest	0.999981	0.999981	0.999981	0.999980	831.59	9.10	840.69
Soft Voting	0.999981	0.999981	0.999981	0.999980	2367.97	11.90	2379.87
Weighted Avg.	0.999981	0.999981	0.999981	0.999980	1861.42	10.55	1871.97
Bagging	0.999949	0.999945	0.999949	0.999933	7134.74	9.88	7144.62
Stacking	0.999981	0.999980	0.999981	0.999979	4752.17	10.46	4762.63
Adaptive Boosting	0.998882	0.998829	0.998882	0.998854	220.77	3.63	224.40
CatBoost	0.999949	0.999945	0.999949	0.999933	549.96	0.74	550.71
Gradient Boosting	0.999969	0.999969	0.999969	0.999966	4357.26	9.28	4366.54
XGBoost	0.999930	0.999875	0.999930	0.999902	278.54	0.98	279.52
Blending	0.999941	0.999885	0.999941	0.999913	1029.72	6.52	1036.23
K-Best Top 10 Features
Random Forest	0.999999	0.999999	0.999999	0.999999	3029.85	13.42	3043.27
Soft Voting	0.999998	0.999998	0.999998	0.999998	4245.90	12.15	4258.04
Weighted Avg.	0.999998	0.999998	0.999998	0.999998	4330.36	12.67	4343.03
Bagging	0.999995	0.999995	0.999995	0.999995	7777.49	7.79	7785.28
Stacking	0.999999	0.999999	0.999999	0.999999	9358.06	11.35	9369.40
Adaptive Boosting	0.999927	0.999927	0.999927	0.999927	884.34	9.58	893.91
CatBoost	0.999996	0.999996	0.999996	0.999996	615.90	0.77	616.66
Gradient Boosting	0.999995	0.999995	0.999995	0.999995	8690.55	10.11	8700.66
XGBoost	0.999985	0.999985	0.999985	0.999985	288.11	0.95	289.06
Blending	0.999942	0.999887	0.999942	0.999914	1997.98	13.45	2011.43

6.1.2. Computational Efficiency

This experiment answers the third research question. Execution times varied substantially between algorithms, highlighting important accuracy–efficiency tradeoffs:

Individual Models:
-
Decision Trees demonstrated the fastest training (e.g., 10.29 s with minimal features), contrasting with neural networks which required extensive computation (7722 s with full features; Table 8).
-
Logistic regression provided an effective balance, maintaining consistent sub-minute training durations (Table 8).
Ensemble Efficiency:
-
Lightweight Options: XGBoost (173 s) and CatBoost (682 s) significantly outperformed the computationally intensive methods including stacking (23,978 s) in total runtime (Table 9).
-
Real-Time Viability: Methods including AdaBoost and Blending offered rapid inference (6.44–9.58 s), suggesting practical deployment potential (Table 9).
Feature Reduction Benefits: Using IG Top 5 features decreased decision tree training duration by 70%+ and Random Forest by 50%+, confirming the value of feature selection for efficient implementations.

6.1.3. Classification Visualization

The detection accuracy is further illustrated through heatmap-based confusion matrices (Figure 2), evaluating all methods against attacks in RoEduNet-SIMARGL2021 dataset. This experiment answers the fifth research question.

Perfect Classification Models: Multiple approaches including random forest, decision trees, voting ensembles, and boosting methods (XGBoost, CatBoost) achieved flawless classification across all traffic types—normal, DoS, malware, and port scanning—when utilizing complete feature sets.

6.1.4. Performance Evaluation of RF and DT Under Adversarial Conditions

This subsection presents a comprehensive evaluation of the proposed intrusion detection models, addressing potential concerns regarding perfect performance scores through rigorous statistical testing and adversarial validation. We compared the performance of decision tree (DT) and random forest (RF) classifiers across multiple feature subsets.

Evaluation Protocol and Statistical Rigor: To ensure the validity and reliability of our results, we implemented a strict evaluation protocol designed to address common methodological pitfalls.

Strict Data Separation: We trained and evaluated the models using a hold-out test set strategy. All hyperparameter tuning and cross-validation ( $k = 3$ stratified folds) were performed exclusively on the training portion of the data, preventing any information leakage from the test set.
Adversarial Robustness Testing: To assess model stability and generalization under realistic conditions, we created an adversarial test set by injecting Gaussian noise (magnitude of 20% of each feature’s standard deviation) into 30% of the test instances.
Statistical Significance Testing: We employed robust statistical methods:
-
Confidence Intervals (CIs): Cross-validation results are reported as the mean macro-F1 with 95% t-distribution CIs. Held-out test results are reported as non-parametric 95% bootstrap CI ( $B = 150$ resamples).
-
Model Comparison: Paired statistical tests (Wilcoxon signed-rank and paired t-tests) were conducted on performance metrics across CV folds.
-
Adversarial Effect: Significance of performance degradation was tested using Wilcoxon tests on paired bootstrap samples.
We emphasize that this testing was for DT and RF only for the RoEduNet-SIMARGL2021 dataset. The full statistical significance analysis is provided in Section 6.4.

Overall Performance and Adversarial Robustness: The macro F1-scores for both models under original and adversarial conditions are summarized in Table 10. The results revealed the following:

Perfectly Separable Feature Sets: For several feature subsets (All Features, IG Top-5/10, K-Best Top-10), both DT and RF classifiers achieved perfect F1-scores (1.000) with degenerate confidence intervals. This confirms that perfect scores are genuine characteristics of the dataset with these discriminative features, not evaluation artifacts.

Challenging Feature Sets: The K-Best Top-5 subset presented a more challenging task, showing performance degradation even on clean data (e.g., DT CV F1 = 0.9542, CI = [0.9359, 0.9725]), demonstrating our pipeline’s sensitivity to reduced separability.

Adversarial Robustness: Adversarial noise caused significant performance degradation across all scenarios. The degradation was most severe for DT on K-Best Top-5 (

Δ = - 0.3772

), while RF showed greater resilience (

Δ = - 0.1505

). All adversarial degradations were highly significant (Wilcoxon

p < 10^{- 25}

), as visualized in Figure 3.

Comparative Analysis of Model Performance for DT and RF: A key objective was to determine if the increased complexity of the random forest model provides a significant advantage over a single decision tree.

Performance on Clean Data: Statistical comparisons revealed no significant difference between DT and RF models for any feature subset. This was true for both perfectly separable cases and the challenging K-Best Top-5 scenario (Wilcoxon

p = 0.6547

). This indicates that for this intrusion detection task on clean data, the ensemble method does not yield meaningful improvement in classification accuracy.

Robustness as a Key Differentiator: The primary distinction emerged under adversarial conditions. While both models degraded, RF demonstrated superior robustness, particularly on the K-Best Top-5 subset (

Δ

of −0.1505 vs. −0.3772). This suggests that RF’s advantage lies not in raw accuracy but in resilience to noise, a critical property for real-world security applications.

Table 10. Decision tree (DT) and random forest (RF) performance with cross-validation (CV) and held-out test results (original and adversarial). CIs are 95% intervals.

Δ = {F 1}_{Adv} - {F 1}_{Orig}

.

Table 10. Decision tree (DT) and random forest (RF) performance with cross-validation (CV) and held-out test results (original and adversarial). CIs are 95% intervals.

Δ = {F 1}_{Adv} - {F 1}_{Orig}

.

Feature Set	Model	CV F1 (CI)	Test F1 (CI)	Adv. F1 (CI)	Δ
All Features	DT	1.0000 [1.0000,1.0000]	1.0000 [1.0000,1.0000]	0.8979 [0.8976,0.8982]	− 0.1021
All Features	RF	1.0000 [1.0000,1.0000]	1.0000 [1.0000,1.0000]	0.8979 [0.8976,0.8982]	−0.1021
IG Top-5	DT	1.0000 [1.0000,1.0000]	1.0000 [1.0000,1.0000]	0.8979 [0.8976,0.8982]	−0.1021
IG Top-5	RF	1.0000 [1.0000,1.0000]	1.0000 [1.0000,1.0000]	0.8964 [0.8937,0.8982]	−0.1034
K-Best Top-5	DT	0.9542 [0.9359,0.9725]	0.9467 [0.9272,0.9618]	0.5696 [0.5684,0.5710]	−0.3772
K-Best Top-5	RF	0.9547 [0.9325,0.9769]	0.9481 [0.9275,0.9636]	0.7977 [0.7709,0.8202]	−0.1505
K-Best Top-10	DT	1.0000 [1.0000,1.0000]	1.0000 [1.0000,1.0000]	0.8979 [0.8976,0.8982]	−0.1021
K-Best Top-10	RF	1.0000 [1.0000,1.0000]	1.0000 [1.0000,1.0000]	0.8979 [0.8976,0.8982]	−0.1021

Computational Performance: The tradeoff for improved robustness is computational cost. Figure 4 illustrates the computational time for training and inference. As expected, the random forest model incurs higher training and prediction times compared to the single decision tree. This cost must be weighed against the robustness requirements of the target deployment environment.

Having provided main results and insights for the RoEduNet-SIMARGL2021 dataset, we next show main results for the CICIDS-2017 dataset.

6.2. CICIDS-2017 Evaluation Findings

6.2.1. Detection Performance

Analysis of the CICIDS-2017 dataset (Table 11 and Table 12) demonstrated markedly different behavior patterns compared to analysis of RoEduNet-SIMARGL2021, particularly regarding model consistency and feature selection efficacy. We now show the main insights from these patterns.

Leading Algorithms:
-
Individual Models: Decision trees maintained strong performance with complete features (F1 = 0.998126) but proved vulnerable to K-Best selection (F1 = 0.961301 for Top 5 features). Logistic regression showed particular sensitivity, with F1 plummeting to 0.557483 for K-Best Top 5 (Table 11). This result answers the first research question.
-
Ensemble Approaches: CatBoost (F1 = 0.998865) and blending (F1 = 0.998720) delivered superior detection with full features, while gradient boosting failed dramatically (F1 = 0.510177) with K-Best Top 10 features (Table 12). This result answers the second research question.
Feature Selection Effects:
-
Method Comparison: Information gain consistently surpassed K-Best, evidenced by the decision tree performance (F1 = 0.988682 vs. 0.961301 for Top 5 features) and logistic regression results (F1 = 0.864057 vs. 0.692819 for Top 10 features) in Table 11. This experiment answers the fourth research question.
-
Efficiency–Accuracy Balance: While IG Top 5 slashed decision tree training time by 94%, K-Best Top 5 compromised MLP effectiveness (F1 = 0.782772 vs. 0.804494), indicating feature loss critical for neural networks. This experiment answers the fourth research question.
Performance Irregularities:
-
Gradient Boosting Degradation: Our analysis revealed a significant performance collapse with gradient boosting when using K-Best Top 10 features (F1 = 0.510177). To investigate this anomaly, we conducted a detailed posthoc analysis comparing model metrics across feature sets. The high precision (0.871) paired with low recall (0.371) and accuracy (0.371) pointed to a classification bias problem. We examined the feature importance values extracted from the trained model and analyzed correlations between the K-Best features, revealing that the timing-related metrics (Idle Min/Mean/Max, Flow IAT Max) created harmful interactions when combined. Further validation through cross-fold performance analysis confirmed that this was not a random fluctuation but a consistent weakness when these specific features were combined. Conversely, packet-size metrics selected by information gain produced excellent results (F1 = 0.987800), highlighting how feature selection methods can dramatically impact ensemble performance.
-
Stacking Inconsistency: Stacking demonstrated unpredictable behavior, with performance unexpectedly dropping from F1 = 0.998758 to 0.933565 when moving from all features to IG Top 10. Through component-wise ablation testing of the meta-learner, we determined that this counterintuitive result stems from the learning layer struggling to generalize when base models are trained on a reduced feature set. Our analysis of intermediate predictions from base classifiers showed diminished diversity in their output patterns, reducing the information available to the meta-learner despite these features being individually informative.

6.2.2. Computational Characteristics

This experiment answers the third research question. Execution times revealed significant efficiency variations across methods:

Individual Models:
-
Decision trees remained fastest (9.05 s training for IG Top 5), while MLPs required substantial computation (greater than 1148 s for full features) sd per Table 11.
-
KNN’s excessive prediction delay (611.05 s) rendered it unsuitable for real-time applications despite reasonable accuracy.
Ensemble Methods:
-
Blending demanded extensive training (5150 s) but provided quick inference (16.83 s). CatBoost offered a favorable balance (3,935 s training, 0.61 s prediction).
-
Bagging suffered from prohibitive prediction latency (9,444 s) due to its parallel aggregation design (Table 12).
-
XGBoost proved to be the most time-efficient (1305 s total), significantly outperforming complex methods such as stacking (12,304 s).
Feature Reduction Benefits: IG Top 5 features enhanced random forest efficiency by 62% (736.57 s to 280.61 s) with negligible F1 impact (0.998290 to 0.988724), demonstrating its value for time-sensitive deployments (Table 12).

6.2.3. Classification Performance Visualization

We now provide the main insights from confusion matrices for different top methods (both individual and ensemble ones) for the CICIDS-2017 dataset. These insights show the main confusion between different attack types and between benign and attack classes. We provide below the main insights for each one of these models.

(1) DT Qualitative Errors (Figure 5): The key misclassifications were as follow:

Attacks → Benign (FN): PortScan 278; Bot 64; DoS-Hulk 53; DoS-GoldenEye 6; DoS-Slowhttptest 6; DoS-slowloris 1; Web: XSS 3, BruteForce 2, SQLi 2; Infiltration 1.
Benign → attacks (FP): mainly PortScan 181, Bot 65, DoS-Hulk 47; others are single-digit.

The likely causes for such confusion is feature overlap with normal bursts (PortScan/Bot), and flow-level features. The main insight here is that DoS/DDoS attacks are detected well while stealthy classes (Bot/Web/Infiltration) show a higher number of FNs.

Figure 5. Confusion matrix for decision tree (DT) on the CICIDS-2017 dataset.

(2) KNN Qualitative Errors (Figure 6): The confusion matrix shows the following:

Attacks → Benign (FN): PortScan 779; DoS-Hulk 290; Bot 162; DoS-GoldenEye 56; DDoS 47; DoS-slowloris 40; DoS-Slowhttptest 13; FTP-Patator 21; SSH-Patator 35; Infiltration 7; Web: Brute Force 10, SQLi 3 (0 correct), XSS 5.
Benign → Attacks (FP): mainly PortScan 1,532; DoS-Hulk 442; DDoS 196; Bot 95; smaller spillover to Slowhttptest 67, SSH-Patator 12, Brute Force 10.

The likely causes for this confusion are that KNN’s distance-based decisions are without feature scaling, and thus local density may overlap with normal traffic (PortScan/Bot) and very small classes (Web/Infiltration).

Figure 6. Confusion matrix for the K-nearest neighbors (KNN) classifieron the CICIDS-2017 dataset.

(3) RF Qualitative Errors (Figure 7): The major patterns were as follows:

Attacks → Benign (FN): PortScan 193, DoS-Hulk 84, Bot 83; smaller: GoldenEye 12, DDoS 6, Slowhttptest 5, SSH-Patator 3, Infiltration 3, Web (XSS 6, BruteForce 3, SQLi 2), slowloris 1, FTP-Patator 1.
Benign → Attacks (FP): mainly PortScan 176, DoS-Hulk 64, Bot 44; minor to Slowhttptest 9 and GoldenEye 2.
Cross-attack confusions: Web BruteForce $< - >$ XSS (55, 73); slight Slowloris $< - >$ Slowhttptest (3, 2).

The main interpretation is that RF detects volumetric DoS/DDoS well but retains BENIGN <-> PortScan/Bot overlap; rare Web/Infiltration classes remain brittle.

Figure 7. Confusion matrix for random forest (RF) on the CICIDS-2017 dataset.

(4) Blending Qualitative Errors (Figure 8): The blending technique ensemble confusion matrix (Figure 8) shows significant improvement in classification accuracy across most attack types. Notable improvements included reduced misclassification of benign traffic as attacks, with only 179 instances labeled as PortScan compared to higher rates in individual models. The ensemble demonstrated substantially improved performance in classifying PortScan attacks, with only 6 misclassified as benign compared to 193 in random forest and 779 in KNN. Web Attack misclassifications persisted but at reduced rates, with 29 brute force and 23 XSS attacks misidentified. The ensemble approach showed particular effectiveness in reducing false negatives for critical attack types, with only 47 DoS Hulk attacks misclassified as benign (compared to 84 in RF and 290 in KNN). These improvements highlight the value of combining multiple classifiers to mitigate individual model weaknesses when distinguishing between similar traffic patterns.

Figure 8. Confusion matrix for the blending technique ensemble on the CICIDS-2017 dataset.

Figure 8. Confusion matrix for the blending technique ensemble on the CICIDS-2017 dataset.
(5) CatBoost Qualitative Errors (Figure 9): The CatBoost ensemble confusion matrix (Figure 9) demonstrated superior performance in reducing misclassifications across nearly all attack categories. Most remarkably, the model drastically reduced DoS Hulk misclassifications to only 1 instance labeled as Benign (compared to 47 in blending and 84 in random forest), indicating an exceptional ability in detecting this attack type. The model maintained high accuracy with Web Attack classifications, although it still showing some confusion between Web Attack Brute Force (8 instances) and XSS (18 instances). CatBoost’s gradient boosting approach particularly excelled in classifying PortScan attacks with only 3 misclassifications as BENIGN, compared with higher error rates in the other models. Additionally, the model showed marked improvement in correctly identifying DoS GoldenEye attacks with only 1 instance misclassified as Benign. These results demonstrate CatBoost’s effectiveness in learning complex decision boundaries that better distinguish between attack signatures with similar characteristics.

Figure 9. Confusion matrix for cat boosting ensemble on the CICIDS-2017 dataset.

Figure 9. Confusion matrix for cat boosting ensemble on the CICIDS-2017 dataset.
(6) Bagging Qualitative Errors (Figure 10): The bagging confusion matrix (Figure 10) revealed both strengths and persistent challenges in network traffic classification. Unlike CatBoost, the nagging approach struggled more with DoS Hulk attacks, with 293 instances misclassified as Benign traffic—the highest misclassification rate among all ensemble methods and posing a significant security risk. The model also demonstrated substantial misclassification of PortScan attacks, with 809 instances incorrectly labeled as Benign. Web Attack classification remained problematic, with 57 Brute Force attacks misidentified as XSS and 43 XSS attacks misclassified as Brute Force. Additionally, the model incorrectly classified 1462 BENIGN instances as PortScan attacks, which could generate excessive false alarms in operational environments. These patterns suggest that while bagging improves upon some individual classifiers, it does not match the performance of CatBoost or Blending techniques in distinguishing between attacks with similar network patterns, particularly for stealthy attacks that mimic normal traffic.

Optimal Classifiers Using Complete Feature Sets: Six approaches demonstrated superior detection capabilities when utilizing all available features:

Decision tree (DT) classifier.
K-nearest neighbors (KNN) algorithm.
Random forest (RF) ensemble.
Blending composite model.
CatBoost gradient boosting.
Bagging meta-estimator.

Figure 10. Confusion matrix for bagging on the CICIDS-2017 dataset.

6.3. Cross-Dataset Performance Improvements

CICIDS-2017 Results: The evaluation framework demonstrated substantial detection enhancements on the CICIDS-2017 benchmark. Both decision tree and KNN classifiers achieved exceptional performance levels, reaching near-perfect scores (up to 0.998) across all evaluation metrics. When integrated with advanced ensemble techniques—particularly random forest, bagging, blending, and CatBoost—these base learners maintained consistently high detection rates while preserving perfect sensitivity and F1-scores.

RoEduNet-SIMARGL2021 Findings: The framework proved even more effective on the RoEduNet dataset, with multiple ensemble configurations achieving flawless classification performance (all metrics = 1.000). This demonstrates the method’s adaptability to different network environments and attack profiles.

The systematic assessment highlights three key advantages of our ensemble learning approach:

Ensemble learning yields consistent accuracy improvements across heterogeneous datasets (here CICIDS-2017 and RoEduNet-SIMARGL2021).
Our approach has robust performance regardless of feature selection method (IG/K-Best).
Our work provides balanced computational efficiency and detection reliability.

A particularly valuable aspect of our analysis involved the comparative examination of confusion matrices across models. These visualizations revealed the following:

Model-specific strengths against particular attack categories.
The existence of several opportunities for creating specialized ensemble combinations.
Potential for developing meta-ensembles tailored to specific threat landscapes (as shown in the class-based performance in the confusion matrices).

This granular understanding enables security teams to do the following:

Select optimal detectors based on their network’s threat profile.
Combine complementary models for comprehensive protection.
Develop adaptive defense systems that evolve with emerging threats.

6.4. Statistical Significance Analysis

We now present our statistical significance testing results for both datasets.

6.4.1. Statistical Significance Setup

We generated a table (Table 13) for such results. The table was generated by conducting pairwise statistical significance tests using the paired t-test on the F1-scores of different machine learning models. For each dataset—RoEduNet-SIMARGL2021 and CICIDS-2017—models were evaluated across multiple feature selection settings. Their performance vectors were compared in pairs, and the resulting t-statistics and p-values indicate whether the differences in performance are statistically significant. A p-value below 0.05 suggests a significant difference, and the better-performing method is highlighted in bold. If the difference is not significant, both methods are marked in bold to indicate statistical equivalence.

The F1-score vectors used in these tests were extracted directly from the performance tables (Table 8, Table 9, Table 11 and Table 12). These scores represent model effectiveness across different feature subsets. The statistical tests were implemented in Python using scipy.stats.ttest_rel. This approach ensured a rigorous and reproducible comparison of model performance, helping identify which methods consistently outperform others across different datasets and configurations.

6.4.2. Main Insights

For the RoEduNet-SIMARGL2021 dataset, the pairwise t-test results in Table 13 reveal that most ensemble methods perform comparably, with no statistically significant differences among the top-performing models such as random forest, soft voting, and weighted averaging. These methods consistently achieve near-perfect F1-scores across all feature subsets, and their comparisons yield high p-values, indicating statistical equivalence. However, blending stands out with significantly different performance when compared to nearly all other methods, including bagging, CatBoost, and gradient boosting, suggesting that its architecture may be more sensitive to feature selection or model diversity. Additionally, adaptive boosting shows statistically significant differences when compared to several methods, including blending and bagging, highlighting its distinct behavior in this dataset.

In contrast, the CICIDS-2017 dataset presents more pronounced statistical differences between models (as shown in Table 13). Decision tree and KNN outperform logistic regression and MLP in several pairwise comparisons, with p-values below the 0.05 threshold, indicating statistically significant superiority. Ensemble methods such as blending, CatBoost, and stacking also show significant improvements over weaker models such as adaptive boosting and bagging. Notably, blending and CatBoost consistently outperform others, with strong statistical evidence supporting their effectiveness. These results suggest that CICIDS-2017, with its more complex and diverse attack types, benefits more from sophisticated ensemble architectures, whereas simpler models struggle to maintain competitive performance.

Following this complete performance evaluation, we now examine the broader implications and conclusions of our approach in the subsequent section.

7. Conclusions

The fundamental purpose of an intrusion detection tool is to provide a strong safeguard against security threats, and leveraging AI can greatly improve its automation and effectiveness. With the rising frequency of network attacks, significant research has been dedicated to creating AI-based IDS. However, the variety of AI models used for this task, each with unique advantages and limitations, complicates the selection of an optimal model for any specific dataset.

To overcome this issue, hybridizing multiple AI models can lead to substantial gains in overall performance for network intrusion detection. This work addresses this need by assessing a wide array of ensemble techniques for IDS. We conducted an in-depth comparative analysis of standalone models against both simple and advanced ensemble learning architectures. Our methodology included selecting key features, training the base and ensemble models, and then generating performance metrics to offer crucial findings on their effectiveness.

Our findings are based on fourteen different combinations of individual and ensemble models, which utilized techniques such as boosting, stacking, and blending across various base learners. The analysis classified these AI models according to key performance indicators (accuracy, precision, recall, F1-score) and processing time, revealing the strengths of different learning approaches on these datasets. Furthermore, our research offers detailed guidance on selecting the best individual or ensemble ML models for network intrusion detection, tailored to the characteristics of different datasets. Our evaluation was performed on two widely used network intrusion benchmarks, each possessing unique properties.

In particular, our framework was tested on the RoEduNet-SIMARGL2021 and CICIDS-2017 datasets and revealed several important findings:

Ensemble Methods Improve Performance: Combining multiple models consistently outperformed single models. For example, random forest and decision trees achieved perfect scores (F1 = 1.0) on RoEduNet-SIMARGL2021, while blending and bagging techniques performed exceptionally well on CICIDS-2017 (F1 > 0.996).
Feature Selection Enhances Computational Efficiency: Using information gain (IG) reduced training time by 70%–94% without sacrificing accuracy. However, ANOVA-based K-best selection sometimes removed critical features, negatively impacting performance.
Speed vs. Accuracy Tradeoffs: Some ensemble methods, such as XGBoost, offered both speed and accuracy, making them ideal for real-time applications. In contrast, others such as stacking and blending were slower but provided higher robustness and accuracy.
Dataset-Specific Performance Variations: The performance of the model varied according to the dataset. For example, logistic regression struggled with complex attacks in CICIDS-2017 but performed well on simpler tasks with lower labels in RoEduNet-SIMARGL2021.

We further supported the research community by releasing our source code, establishing a versatile ensemble learning framework tailored for network intrusion detection. This framework can be extended with additional models and datasets. Our analysis also identified top-performing models per dataset and revealed shared and unique behavior patterns among models using confusion matrices that helped explain performance outcomes. This work marks a meaningful step forward in applying ensemble learning to intrusion detection systems. Our thorough experimentation and comparative analysis validate the strength of these methods, offering practical direction for both academic research and real-world cybersecurity applications.

Main Limitations and Future Work Avenues:

Concept Drift: In dynamic network environments, the statistical properties of traffic data can change over time due to evolving attack strategies and legitimate usage patterns. This phenomenon, known as concept drift, poses a significant challenge to maintaining model accuracy. We acknowledge that periodic retraining or online learning mechanisms may be necessary to adapt to such changes. Future work will explore drift detection techniques and incremental learning strategies to enhance model resilience.
Retraining Costs of Complex Models: As noted in Table 9, advanced ensemble methods such as stacking require substantial computational resources (e.g., over 23,000 s of training time on RoEduNet-SIMARGL2021). While these models offer high accuracy, their retraining cost may be prohibitive in real-time or resource-constrained environments. We suggest that lightweight models such as XGBoost or CatBoost may be more suitable for frequent updates in production settings.
Scalability in High-Throughput Networks: Enterprise networks often generate millions of flows per hour, demanding intrusion detection systems that can scale efficiently. Our framework demonstrates that certain models (e.g., decision trees, logistic regression) offer fast inference times and can be deployed in high-throughput scenarios. However, they have lower performance capabilities. Thus, future works can build on our insights for exploring the importance of balancing detection accuracy with inference latency and memory footprint, especially for real-time applications.
Integration of Unsupervised Learning: We emphasize that our work has exclusive focus on supervised learning methods with the main focus being on comparative analysis of different ensemble methods. Indeed, while supervised approaches offer strong performance when labeled data is available, they may fall short in detecting novel or zero-day attacks, which are not represented in the training data. Thus, we highlight the importance of integrating unsupervised and semi-supervised techniques in future work. These approaches—such as clustering, anomaly detection, and self-training—can enhance the system’s ability to identify previously unseen threats and adapt to evolving attack patterns.

Author Contributions

Conceptualization, I.B., O.A., and M.A.; methodology, I.B.; software, I.B.; validation, M.A. and W.A.; formal analysis, I.B.; investigation, M.A. and W.A.; resources, M.A.; data curation, I.B.; writing—original draft preparation, I.B.; writing—review and editing, M.A., O.A., and W.A.; visualization, I.B.; supervision, M.A. and W.A.; project administration, M.A.; funding acquisition, W.A. and M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Lilly Endowment (AnalytixIN) and an Enhanced Mentoring Program with Opportunities for Ways to Excel in Research (EMPOWER) Grant from the Office of the Vice Chancellor for Research at IUPUI. The APC was covered by Princess Nourah bint Abdulrahman University Researchers Supporting Project (project number PNURSP2025R500), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We adhere to the data availability policy outlined by MDPI journals. The data supporting the findings of this study are available in the public repository at the following URL. The used datasets are available at https://github.com/sm3a96/A-Comprehensive-Comparative-Study-of-Individual-ML-Models-and-Ensemble-Strategies-for-IDS.git, accessed on 30 August 2025.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

IDS	intrusion detection system
KNN	K-nearest neighbors
RF	random forest
ADA	adaptive boosting
CAT	categorical boosting
LGBM	light gradient-boosting machine
MLP	multilayer perceptron
XGB	extreme gradient boosting
SVM	support vector machine
DT	decision tree
Avg	averaging
LR	logistic regression

Appendix A

This appendix provides the detailed technical specifications and implementation settings for all machine learning models evaluated in this study. Table A1 summarizes the model configurations, hyperparameters, learning rates, and stopping conditions used for the CICIDS-2017 and RoEduNet-SIMARGL2021 datasets. By including these details, we ensure the reproducibility of our experiments and maintain transparency with respect to the setup that produced the results reported in the main paper.

Appendix A.1. Model Architectures and Parameter Configurations

This appendix documents the complete implementation details for all machine learning models employed in our experimental framework. Table A1 summarizes the model configurations, hyperparameters, learning rates, and stopping conditions used for both the CICIDS-2017and SIMARGL datasets.

Appendix A.2. Base Classifier Specifications

The logistic regression (LR) model was implemented using scikit-learn’s default parameter settings without modification. Similarly, the decision tree (DT) classifier utilized standard scikit-learn parameters with the Gini impurity criterion for node splitting decisions. The decision tree model used in Section 4.3.1 employed default scikit-learn settings, with max_depth=None and min_samples_leaf=1, as clarified in Table A1.

The multilayer perceptron (MLP) classifier was configured with a two-layer architecture containing 50 neurons per hidden layer, employing ReLU activation functions. The optimization process used the Adam algorithm with a fixed learning rate of 0.001 and L2 regularization (

α

= 0.0001). Training proceeded for a maximum of 1000 iterations with dynamic batch sizing, with a fixed random seed of 42 for reproducible results.

Table A1. Model configurations, hyperparameters, learning rates, and stopping conditions used in the experiments.

Model	Dataset	Key Hyperparameters	Learning Rate	Epochs/Iterations	Stopping Condition
Decision Tree	CICIDS-2017/ SIMARGL	max_depth=None, min_samples_leaf=1, criterion=gini	–	–	Fixed split criteria
KNN	CICIDS-2017/ SIMARGL	n_neighbors=5, weights=uniform	–	–	Distance-based, no early stop
Logistic Regression	CICIDS-2017/ SIMARGL	solver=lbfgs, C=1.0, penalty=L2	–	max_iter=100	Convergence tolerance (default $10^{- 4}$ )
MLP	CICIDS-2017	hidden_layers=(25), solver=adam	0.001	1000	Early stop via tolerance
MLP	SIMARGL	hidden_layers=(100,50), activation=relu, solver=adam	0.001	500	Early stop via tolerance
Random Forest	CICIDS-2017/ SIMARGL	n_estimators=100, criterion=gini, max_depth=None	–	100 trees	Fixed number of trees
Bagging	CICIDS-2017	base=KNN, n_estimators=10	–	10	Fixed number of estimators
Bagging	SIMARGL	base=CatBoost, n_estimators=10	–	10	Fixed number of estimators
Blending	CICIDS-2017	DT + CatBoost + RF, meta=LogReg	–	–	Meta-model training
Blending	SIMARGL	DT + CatBoost + RF, meta=LogReg	–	–	Meta-model training
Stacking	CICIDS-2017	DT + KNN + RF, meta=KNN, cv=5	–	–	Cross-validation
Stacking	SIMARGL	DT + CatBoost + RF, meta=CatBoost, cv=5	–	–	Cross-validation
AdaBoost	CICIDS-2017/ SIMARGL	n_estimators=50 (default), learning_rate=1.0	1.0	50	Fixed number of learners
CatBoost	CICIDS-2017/ SIMARGL	verbose=0, default params	0.03 (default)	–	Built-in convergence criteria
Gradient Boosting	CICIDS-2017/ SIMARGL	n_estimators=100 (default), learning_rate=0.1	0.1	100	Fixed boosting rounds
XGBoost	CICIDS-2017/ SIMARGL	objective=multi:softmax, random_state=42, max_depth=6	0.01	500	Early stopping after 20 rounds without improvement
Averaging	CICIDS-2017	DT + KNN( $k = 5$ ) + RF( $n = 100$ )	–	–	Equal probability averaging
Averaging	SIMARGL	DT + CatBoost + RF( $n = 100$ )	–	–	Equal probability averaging
Max Voting	CICIDS-2017	DT + KNN( $k = 5$ ) + RF( $n = 100$ )	–	–	Hard voting (majority rule)
Max Voting	SIMARGL	DT + KNN + RF( $n = 100$ ), with StandardScaler	–	–	Hard voting (majority rule)
Weighted Averaging	CICIDS-2017	DT(0.4) + KNN( $k = 5$ , 0.3) + RF( $n = 100$ , 0.3)	–	–	Weighted probability averaging
Weighted Averaging	SIMARGL	DT(0.4) + CatBoost(0.3) + RF( $n = 100$ , 0.3)	–	–	Weighted probability averaging

Appendix A.3. Ensemble Method Implementations

The adaptive boosting (ADA) classifier maintained scikit-learn’s default configuration throughout our experiments. For the XGBoost (XGB) implementation, we specified a learning rate of 0.1 and set the objective function to multi:softmax for multiclass classification. The CatBoost (CAT) algorithm was employed with its standard parameter settings.

Our voting ensemble implementations included three variants: a majority voting approach combining LR and DT predictions through hard voting; a simple averaging method aggregating outputs from the DT, KNN, and RF models with equal weighting; and a weighted averaging scheme assigning differential importance (0.4 for DT, 0.3 for KNN, and 0.3 for RF).

The bootstrap aggregating (nagging) implementation utilized four base estimators (RF, MLP, LR, and DT), with the ensemble size matching the number of base models. Sampling was performed with replacement during the aggregation process. The random forest (RF) configuration included 100 trees with a maximum depth of 10 and required a minimum of 2 samples for node splitting.

For the blending method, we trained four base models (RF, MLP, LR, and DT) on a holdout validation set, with predictions from these models serving as inputs to the meta-learner. The stacked generalization approach followed a similar architecture, using the same base models (RF, MLP, LR, and DT) with their predictions combined through a meta-classifier trained on the stacked outputs.

References

Northcutt, S.; Novak, J. Network Intrusion Detection; Sams Publishing: Indianapolis, IN, USA, 2002. [Google Scholar]
Mukherjee, B.; Heberlein, L.T.; Levitt, K.N. Network Intrusion Detection. IEEE Netw. 1994, 8, 26–41. [Google Scholar] [CrossRef]
Apruzzese, G.; Andreolini, M.; Ferretti, L.; Marchetti, M.; Colajanni, M. Modeling Realistic Adversarial Attacks against Network Intrusion Detection Systems. Digit. Threat. Res. Pract. Dtrap 2022, 3, 1–19. [Google Scholar] [CrossRef]
Buczak, A.L.; Guven, E. A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection. IEEE Commun. Surv. Tutorials 2015, 18, 1153–1176. [Google Scholar] [CrossRef]
Dina, A.S.; Manivannan, D. Intrusion Detection based on Machine Learning Techniques in Computer Networks. Internet Things 2021, 16, 100462. [Google Scholar] [CrossRef]
Kim, J.; Shin, N.; Jo, S.Y.; Kim, S.H. Method of Intrusion Detection Using Deep Neural Network. In Proceedings of the 2017 IEEE International Conference on Big Data and Smart Computing (BIGCOMP), Jeju, South Korea, 13–16 February 2017; IEEE: New York, NY, USA, 2017; pp. 313–316. [Google Scholar]
Tang, C.; Luktarhan, N.; Zhao, Y. SAAE-DNN: Deep learning method on intrusion detection. Symmetry 2020, 12, 1695. [Google Scholar] [CrossRef]
Ferrag, M.A.; Maglaras, L.; Ahmim, A.; Derdour, M.; Janicke, H. RDTIDS: Rules and decision tree-based intrusion detection system for internet-of-things networks. Future Internet 2020, 12, 44. [Google Scholar] [CrossRef]
Al-Omari, M.; Rawashdeh, M.; Qutaishat, F.; Alshira’h, M.; Ababneh, N. An intelligent tree-based intrusion detection model for cyber security. J. Netw. Syst. Manag. 2021, 29, 1–18. [Google Scholar] [CrossRef]
Nick, T.G.; Campbell, K.M. Logistic regression. In Topics in Biostatistics. Methods in Molecular Biology; Humana Press: Totowa, NJ, USA, 2007; Volume 404, pp. 273–301. [Google Scholar]
Panigrahi, R.; Borah, S.; Pramanik, M.; Bhoi, A.K.; Barsocchi, P.; Nayak, S.R.; Alnumay, W. Intrusion detection in cyber–physical environment using hybrid naïve bayes—Decision table and multi-objective evolutionary feature selection. Comput. Commun. 2022, 188, 133–144. [Google Scholar] [CrossRef]
Balyan, A.K.; Ahuja, S.; Lilhore, U.K.; Sharma, S.K.; Manoharan, P.; Algarni, A.D.; Elmannai, H.; Raahemifar, K. A hybrid intrusion detection model using ega-pso and improved random forest method. Sensors 2022, 22, 5986. [Google Scholar] [CrossRef]
Waskle, S.; Parashar, L.; Singh, U. Intrusion detection system using pca with random forest approach. In Proceedings of the 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India, 2–4 July 2020; IEEE: New York, NY, USA, 2020; pp. 803–808. [Google Scholar]
Arisdakessian, S.; Wahab, O.A.; Mourad, A.; Otrok, H.; Guizani, M. A survey on IoT intrusion detection: Federated learning, game theory, social psychology and explainable ai as future directions. IEEE Internet Things J. 2023, 10, 4059–4092. [Google Scholar] [CrossRef]
Sabev, S.I. Integrated approach to cyber defence: Human in the loop. Technical Evaluation Report. Inf. Secur. Int. J. 2020, 44, 76–92. [Google Scholar]
Arreche, O.; Bibers, I.; Abdallah, M. A two-level ensemble learning framework for enhancing network intrusion detection systems. IEEE Access 2024, 12, 83830–83857. [Google Scholar] [CrossRef]
Mijalkovic, J.; Spognardi, A. Reducing the false negative rate in deep learning based network intrusion detection systems. Algorithms 2022, 15, 258. [Google Scholar] [CrossRef]
Aburomman, A.A.; Reaz, M.B.I. A survey of intrusion detection systems based on ensemble and hybrid classifiers. Comput. Secur. 2017, 65, 135–152. [Google Scholar] [CrossRef]
Tama, B.A.; Lim, S. Ensemble learning for intrusion detection systems: A systematic mapping study and cross-benchmark evaluation. Comput. Sci. Rev. 2021, 39, 100357. [Google Scholar] [CrossRef]
Mahfouz, A.; Abuhussein, A.; Venugopal, D.; Shiva, S. Ensemble classifiers for network intrusion detection using a novel network attack dataset. Future Internet 2020, 12, 180. [Google Scholar] [CrossRef]
Thockchom, N.; Singh, M.; Nandi, U. A novel ensemble learning-based model for network intrusion detection. Complex Intell. Syst. 2023, 9, 5693–5714. [Google Scholar] [CrossRef]
Mirsky, Y.; Doitshman, T.; Elovici, Y.; Shabtai, A. Kitsune: An ensemble of autoencoders for online network intrusion detection. arXiv 2018, arXiv:1802.09089. [Google Scholar] [CrossRef]
Al-A’araji, N.H.; Al-Mamory, S.O.; Al-Shakarchi, A.H. Classification and clustering based ensemble techniques for intrusion detection systems: A survey. J. Physics Conf. Ser. 2021, 1818, 012106. [Google Scholar] [CrossRef]
Caruana, R.; Niculescu-Mizil, A.; Crew, G.; Ksikes, A. Ensemble selection from libraries of models. In Proceedings of the Twenty-first International Conference on Machine Learning ICML’04, New York, NY, USA, 4–8 July 2004; p. 18. [Google Scholar] [CrossRef]
Zainal, A.; Maarof, M.; Shamsuddin, S.M. Ensemble classifiers for network intrusion detection system. J. Inf. Assur. Secur. 2009, 4, 217–225. [Google Scholar]
Kiflay, A.Z.; Tsokanos, A.; Kirner, R. A network intrusion detection system using ensemble machine learning. In Proceedings of the 2021 International Carnahan Conference on Security Technology (ICCST), Hatfield, UK, 1–15 October 2021; pp. 1–6. [Google Scholar] [CrossRef]
Das, S.; Saha, S.; Priyoti, A.T.; Roy, E.K.; Sheldon, F.T.; Haque, A.; Shiva, S. Network intrusion detection and comparative analysis using ensemble machine learning and feature selection. IEEE Trans. Netw. Serv. Manag. 2022, 19, 4821–4833. [Google Scholar] [CrossRef]
Zhang, H.; Li, J.L.; Liu, X.M.; Dong, C. Multi-dimensional feature fusion and stacking ensemble mechanism for network intrusion detection. Future Gener. Comput. Syst. 2021, 122, 130–143. [Google Scholar] [CrossRef]
Hsu, Y.F.; He, Z.; Tarutani, Y.; Matsuoka, M. Toward an online network intrusion detection system based on ensemble learning. In Proceedings of the 2019 IEEE 12th International Conference on Cloud Computing (Cloud), Milan, Italy, 8–13 July 2019; pp. 174–178. [Google Scholar]
Kumar Singh Gautam, R.; Doegar, E.A. An ensemble approach for intrusion detection system using machine learning algorithms. In Proceedings of the 2018 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 11–12 January 2018; pp. 14–15. [Google Scholar] [CrossRef]
Divyasree, T.; Sherly, K. A network intrusion detection system based on ensemble cvm using efficient feature selection approach. Procedia Comput. Sci. 2018, 143, 442–449. [Google Scholar] [CrossRef]
Alotaibi, Y.; Ilyas, M. Ensemble-learning framework for intrusion detection to enhance internet of things’ devices security. Sensors 2023, 23, 5568. [Google Scholar] [CrossRef] [PubMed]
Lazzarini, R.; Tianfield, H.; Charissis, V. A stacking ensemble of deep learning models for iot intrusion detection. Knowl.-Based Syst. 2023, 279, 110941. [Google Scholar] [CrossRef]
Panigrahi, R.; Borah, S. A detailed analysis of cicids2017 dataset for designing intrusion detection systems. Int. J. Eng. Technol. 2017, 7, 479–482. [Google Scholar]
Mihailescu, M.E.; Mihai, D.; Carabas, M.; Komisarek, M.; Pawlicki, M.; Hołubowicz, W.; Kozik, R. The proposition and evaluation of the roedunet-simargl2021 network intrusion detection dataset. Sensors 2021, 21, 4319. [Google Scholar] [CrossRef]
Hong, S.; Yue, T.; You, Y.; Lv, Z.; Tang, X.; Hu, J.; Yin, H. A resilience recovery method for complex traffic network security based on trend forecasting. Int. J. Intell. Syst. 2025, 2025, 3715086. [Google Scholar] [CrossRef]
Strom, B.E.; Applebaum, A.; Miller, D.P.; Nickels, K.C.; Pennington, A.G.; Thomas, C.B. Mitre Att&ck: Design and Philosophy; Technical Report; The Mitre Corporation: McLean, VA, USA, 2018. [Google Scholar]
Malware Repository. 2021. Available online: https://attack.mitre.org/datasources/DS0004/ (accessed on 30 April 2024).
Lee, C.B.; Roedel, C.; Silenok, E. Detection and characterization of port scan attacks. Univeristy of California, Department of Computer Science and Engineering 2003. Available online: https://cseweb.ucsd.edu/~clbailey/PortScans.pdf (accessed on 15 May 2025).
Kurniabudi; Stiawan, D.; Darmawijoyo; Bin Idris, M.Y.; Bamhdi, A.M.; Budiarto, R. CICIDS-2017 dataset feature analysis with information gain for anomaly detection. IEEE Access 2020, 8, 132911–132921. [Google Scholar] [CrossRef]
Drive-by Compromise. 2023. Available online: https://attack.mitre.org/techniques/T1189/ (accessed on 21 October 2023).
Chen, Y.; Lin, Q.; Wei, W.; Ji, J.; Wong, K.C.; Coello Coello, C.A. Intrusion detection using multi-objective evolutionary convolutional neural network for internet of things in fog computing. Knowl.-Based Syst. 2022, 244, 108505. [Google Scholar] [CrossRef]
Gorodetski, V.; Kotenko, I. Attacks against computer network: Formal grammar-based framework and simulation tool. In Proceedings of the International Workshop on Recent Advances in Intrusion Detection, 5th International Symposium, RAID 2002, Zurich, Switzerland, 16–18 October 2002; Springer: Berlin/Heidelberg, Germany, 2002; pp. 219–238. [Google Scholar]
Skwarek, M.; Korczynski, M.; Mazurczyk, W.; Duda, A. Characterizing vulnerability of dns axfr transfers with global-scale scanning. In Proceedings of the 2019 IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 19–23 May 2019; IEEE: New York, NY, USA, 2019; pp. 193–198. [Google Scholar]
Mirzaei, O.; Vasilenko, R.; Kirda, E.; Lu, L.; Kharraz, A. SCRUTINIZER: Detecting code reuse in malware via decompilation and machine learning. In Proceedings of the Detection of Intrusions and Malware, and Vulnerability Assessment: 18th International Conference, DIMVA 2021, Virtual Event, 14–16 July 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 130–150. [Google Scholar]
Khan, A.; Kim, H.; Lee, B. M2MON: Building an mmio-based security reference monitor for unmanned vehicles. In Proceedings of the 30th USENIX Security Symposium, Virtual, 11–13 August 2021. [Google Scholar]
Lukacs, S.; Sirb, C.B.; Lutas, D.H.; Colesa, A.V. Strongly Isolated Malware Scanning Using Secure Virtual Containers. U.S. Patent 9117081B2, 25 August 2015. [Google Scholar]
Kim, A.; Park, M.; Lee, D.H. AI-IDS: Application of deep learning to real-time web intrusion detection. IEEE Access 2020, 8, 70245–70261. [Google Scholar] [CrossRef]
Flow Information Elements-Nprobe 10.1 Documentation. Available online: https://www.ntop.org/guides/nprobe/flow_information_elements.html (accessed on 1 April 2025).
Ahlashkari. Master ahlashkari/cicflowmeter. cicflowmeter/readme.txt. 2021. Available online: https://github.com/ahlashkari/CICFlowMeter (accessed on 1 May 2025).
Claise, B. CISCO Systems Netflow Services Export Version 9; Technical Report; Cisco Systems: San Jose, CA, USA, 2004. [Google Scholar]
Sharafaldin, I.; Gharib, A.; Lashkari, A.H.; Ghorbani, A.A. Towards a reliable intrusion detection benchmark dataset. Softw. Netw. 2018, 2018, 177–200. [Google Scholar] [CrossRef]
Stewart, C.A.; Welch, V.; Plale, B.; Fox, G.C.; Pierce, M.; Sterling, T. Indiana University Pervasive Technology Institute: Technical Report; Indiana University Pervasive Technology Institute: Bloomington, IN, USA, 2017. [Google Scholar]
Mebawondu, J.O.; Alowolodu, O.D.; Mebawondu, J.O.; Adetunmbi, A.O. Network intrusion detection system using supervised learning paradigm. Sci. Afr. 2020, 9, e00497. [Google Scholar] [CrossRef]
Song, Y.Y.; Ying, L. Decision tree methods: Applications for classification and prediction. Shanghai Arch. Psychiatry 2015, 27, 130. [Google Scholar] [PubMed]
Dreiseitl, S.; Ohno-Machado, L. Logistic regression and artificial neural network classification models: A methodology review. J. Biomed. Inform. 2002, 35, 352–359. [Google Scholar] [CrossRef] [PubMed]
Li, W.; Yi, P.; Wu, Y.; Pan, L.; Li, J. A new intrusion detection system based on knn classification algorithm in wireless sensor network. J. Electr. Comput. Eng. 2014, 2014, 40217. [Google Scholar] [CrossRef]
Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: Gradient boosting with categorical features support. arXiv 2018, arXiv:1810.11363. [Google Scholar] [CrossRef]
Jin, D.; Lu, Y.; Qin, J.; Cheng, Z.; Mao, Z. SwiftIDS: Real-time intrusion detection system based on lightgbm and parallel intrusion detection mechanism. Comput. Secur. 2020, 97, 101984. [Google Scholar] [CrossRef]
Yulianto, A.; Sukarno, P.; Suwastika, N.A. Improving adaboost-based intrusion detection system (ids) performance on cic ids 2017 dataset. J. Phys. Conf. Ser. 2019, 1192, 012018. [Google Scholar] [CrossRef]
Natekin, A.; Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobotics 2013, 7, 21. [Google Scholar] [CrossRef]
Dhaliwal, S.S.; Nahid, A.A.; Abbas, R. Effective intrusion detection system using xgboost. Information 2018, 9, 149. [Google Scholar] [CrossRef]
Dietterich, T.G. Ensemble methods in machine learning. In Proceedings of the International Workshop on Multiple Classifier Systems, Cagliari, Italy, 21–23 June 2000; Springer: Berlin/Heidelberg, Germany, 2000; pp. 1–15. [Google Scholar]

Figure 1. An illustration of our proposed framework. It presents a comprehensive approach to intrusion detection through ensemble learning, incorporating three key components: a diverse collection of machine learning algorithms (individual classifiers, simple ensemble methods, and advanced ensemble techniques), multiple network intrusion datasets (CICIDS-2017 and RoEduNet-SIMARGL2021) representing varied attack scenarios, and optimized feature selection methodologies to improve both detection accuracy and computational efficiency.

Figure 2. Confusion matrices for DT, RF, soft voting ensemble, weighted averaging, bagging, and Catboost. All of these models have perfect performance on the RoEduNet-SIMARGL2021 dataset.

Figure 3. Heatmap of Wilcoxon p-values for adversarial effect, confirming statistically significant degradation (all p-values <

10^{- 25}

).

Figure 3. Heatmap of Wilcoxon p-values for adversarial effect, confirming statistically significant degradation (all p-values <

10^{- 25}

).

Figure 4. Computational cost (log-scale) for training, testing, and cross-validation. RF incurs higher computational overhead than does DT.

Table 1. Comparison between our work and prior ensemble learning approaches for intrusion detection systems.

Study	Dataset(s)	Ensemble Method(s)	Base Classifiers	Classification Type	Limitations
Al-a’araji et al. [23]	KDD’99, NSL-KDD, Kyoto 2006+, AWID	Bagging, boosting, stacking, voting	NN, SVM, DT, RBF	Binary	Broad survey
Thockom et al. [21]	CICIDS-2017, UNSW-NB15, KDD’99	SGD-based ensemble	GNB, LR, DT	Binary	Imbalanced data
Alotaibi et al. [32]	TON-IoT	Stacking, voting	RF, KNN, DT, LR	Binary	No other datasets
Mirsky et al. [22]	Custom streaming data	Autoencoder ensemble	Autoencoders	Binary	Streaming-focused
Caruana et al. [24]	Custom binary datasets	Ensemble selection	RF, NB, LR	Binary	Non-IDS datasets
Lazzarini et al. [33]	CICIDS-2017, ToN-IoT	Stacking	DNN, CNN, RNN, LSTM	Multiclass	High computational cost
Mahfouz et al. [20]	GTCS	Majority voting	J48, MLP, IBK	Multiclass	Limited model diversity
Our Work	SIMARGL2021, CICIDS-2017	Bagging, boosting, stacking, blending, voting	LR, DT, RF, MLP, KNN	Binary, Multiclass	Limited datasets

Table 2. Top 10 features selected for the CICIDS-2017 Dataset.

Information Gain	K-Best (ANOVA F-Score)
Average Packet Size	Fwd IAT Std
Packet Length Mean	Bwd Packet Length Std
Packet Length Std	Bwd Packet Length Mean
Packet Length Variance	Avg Bwd Segment Size
Total Length of Bwd Packets	Bwd Packet Length Max
Subflow Bwd Bytes	Idle Min
Bwd Packet Length Mean	Idle Mean
Avg Bwd Segment Size	Packet Length Std
Subflow Fwd Bytes	Idle Max
Total Length of Fwd Packets	Flow IAT Max

Table 3. Top 10 Features selected for the RoEduNet-SIMARGL2021 Dataset.

Information Gain	K-Best (ANOVA F-Score)
IPV4_SRC_ADDR	TCP_WIN_MIN_IN
TCP_FLAGS	TCP_WIN_MAX_IN
IPV4_DST_ADDR	TCP_WIN_MSS_IN
IN_BYTES	TCP_WIN_SCALE_IN
FLOW_ID	IPV4_DST_ADDR
TOTAL_FLOWS_EXP	PROTOCOL
TCP_WIN_MAX_IN	TOTAL_FLOWS_EXP
TCP_WIN_SCALE_IN	FLOW_ID
TCP_WIN_MIN_IN	ANALYSIS_TIMESTAMP
FLOW_DURATION_MILLISECONDS	FIRST_SWI

Table 4. Description of main features for the RoEduNet-SIMARGL2021 dataset [49]. The feature name is given on the left and the explanation on the right.

Main Features	Explanation
FLOW_DURATION_MS	Total elapsed time of the flow in milliseconds
PROTOCOL_MAP	Type of network protocol used (e.g., TCP, UDP, ICMP, IPv6)
TCP_FLAGS	Aggregated flags recorded across the TCP flow
TCP_WIN_MAX_IN	Max. observed TCP window size from source to destination
TCP_WIN_MAX_OUT	Maximum observed TCP window size from destination to source
TCP_WIN_MIN_IN	Smallest TCP window value from source to destination
TCP_WIN_MIN_OUT	Smallest TCP window value from destination to source
TCP_WIN_SCALE_IN	Scaling factor used for TCP window from source to destination
TCP_WIN_MSS_IN	Maximum segment size for TCP from source to destination
TCP_WIN_SCALE_OUT	Scaling value for TCP window size from destination to source
SRC_TOS	Type of Service or DSCP value for source to destination traffic
DST_TOS	Type of Service or DSCP value for destination to source traffic
FIRST_SWITCHED	Timestamp of the first packet in the flow (based on uptime)
LAST_SWITCHED	Timestamp of the final packet in the flow (based on uptime)
TOTAL_FLOWS_EXP	Count of all exported flows for the observed connection

Table 5. Description of the main features for the CICIDS-2017 dataset [50]. The feature name is given on the left and the explanation on the right.

CICIDS-2017 Features	Explanation
Packet Length Std	Standard deviation of packet lengths within a flow
Total Length of Bwd Packets	Cumulative size of all packets sent in the reverse direction
Subflow Bwd Bytes	Average byte count per backward subflow
Destination Port	Target port identifier for the network traffic
Packet Length Variance	Statistical variance of packet lengths in the connection
Bwd Packet Length Mean	Average packet size in the backward stream
Avg Bwd Segment Size	Mean segment size in the reverse flow
Bwd Packet Length Max	Largest observed packet size in the backward direction
Init_Win_Bytes_Backward	Initial byte window size from receiver to sender
Total Length of Fwd Packets	Aggregate length of packets transmitted in forward direction
Subflow Fwd Bytes	Mean byte count per forward subflow
Init_Win_Bytes_Forward	Initial size of the byte window from sender to receiver
Average Packet Size	Mean packet size calculated across the entire flow
Packet Length Mean	Average value of all packet lengths in the flow
Max Packet Length	Highest packet length observed in the connection

Table 6. Overview and key metrics of the three network intrusion datasets utilized in this study [16], detailing dataset size, count of attack labels, and number of features used for intrusion detection.

Dataset	No. of Labels	No. of Features	No. of Samples
CICIDS-2017	7	78	2,775,364
RoEduNet-SIMARGL2021	3	29	31,433,875

Table 7. Distribution of samples among different attack (intrusion) types for the datasets [16].

Dataset	Normal	DoS	PortScan	Brute Force	Web Attack	Bot	Infiltration
CICIDS-2017	84.442%	9.104%	5.726%	0.498%	0.157%	0.071%	0.001%
RoEduNet2021	62.20%	24.53%	13.27%	-	-	-	-

Table 11. Performance comparison of various base learners on the CICIDS-2017 dataset. Results are categorized by feature selection method (All Features, Top 5, and Top 10) and ordered by F1-score within each category.

Model	Accuracy	Precision	Recall	F1-Score	Training Time (s)	Prediction Time (s)	Total Time (s)
All Features
Decision Tree	0.998117	0.998138	0.998117	0.998126	164.57	0.20	164.76
K-Neighbors Classifier	0.991597	0.991646	0.991597	0.991597	1.34	611.05	612.39
Logistic Regression	0.888424	0.857927	0.888424	0.870850	103.30	0.16	103.46
MLP	0.865939	0.752318	0.865939	0.804494	1147.63	0.52	1148.15
IG Top 5 Features
Decision Tree	0.989536	0.989346	0.989536	0.988682	9.05	0.04	9.09
K-Neighbors Classifier	0.988075	0.987584	0.988075	0.987030	3.08	505.94	509.03
Logistic Regression	0.871341	0.812407	0.871341	0.837537	67.54	0.04	67.58
MLP	0.930288	0.944872	0.930288	0.925675	286.53	0.23	286.76
IG Top 10 Features
Decision Tree	0.989611	0.989480	0.989611	0.988766	16.96	0.04	17.01
K-Neighbors Classifier	0.988675	0.988495	0.988675	0.987677	5.46	732.40	737.86
Logistic Regression	0.895156	0.849313	0.895156	0.864057	84.08	0.05	84.13
MLP	0.882600	0.856491	0.882600	0.847658	343.17	0.28	343.46
K-Best Top 10 Features
Decision Tree	0.996238	0.996170	0.996238	0.996171	25.05	0.06	25.10
K-Neighbors Classifier	0.989883	0.989836	0.989883	0.989817	5.65	34.17	39.82
Logistic Regression	0.653617	0.762923	0.653617	0.692819	68.25	0.05	68.30
MLP	0.867083	0.811638	0.867083	0.806455	309.98	0.29	310.28
K-Best Top 5 Features
Decision Tree	0.962192	0.960752	0.962192	0.961301	9.58	0.05	9.64
K-Neighbors Classifier	0.952291	0.949686	0.952291	0.950723	2.82	38.96	41.78
Logistic Regression	0.489505	0.740203	0.489505	0.557483	68.26	0.05	68.31
MLP	0.838636	0.826221	0.838636	0.782772	407.30	0.23	407.53

Table 12. Performance Comparison of Ensemble Learning Methods on the CICIDS-2017 dataset. Results are grouped by feature selection approach (All Features, Top 5, and Top 10) and ranked by F1-Score within each group.

Model	Accuracy	Precision	Recall	F1-Score	Training Time (s)	Prediction Time (s)	Total Time (s)
All Features
Blending	0.998781	0.998703	0.998781	0.998720	5150.21	16.83	5167.04
CatBoost	0.998934	0.998918	0.998934	0.998865	3934.98	0.61	3935.58
Stacking	0.998765	0.998755	0.998765	0.998758	11133.67	1170.53	12304.20
Soft Voting	0.998436	0.998433	0.998436	0.998433	1302.17	516.76	1818.92
Random Forest	0.998309	0.998281	0.998309	0.998290	736.57	8.20	744.77
Weighted Avg.	0.998422	0.998427	0.998422	0.998423	1885.23	1362.13	3247.35
Max Voting	0.998353	0.998325	0.998353	0.998333	1706.85	1261.39	2968.24
XGBoost	0.996085	0.996066	0.996085	0.995532	1303.62	1.72	1305.34
Gradient Boosting	0.966866	0.990461	0.966866	0.977287	45650.37	22.62	45672.98
Bagging	0.991698	0.991716	0.991698	0.991682	25.08	9444.05	9469.14
>Adaptive Boosting	>0.842079	>0.872464	>0.842079	>0.848721	>948.01	>5.49	>953.50
IG Top 5 Features
CatBoost	0.988640	0.988383	0.988640	0.987672	865.11	0.45	865.57
Random Forest	0.989591	0.989635	0.989591	0.988724	280.61	3.67	284.28
Soft Voting	0.989603	0.989645	0.989603	0.988739	392.43	487.84	880.27
Max Voting	0.989593	0.989634	0.989593	0.988728	1080.28	1139.73	2220.01
Weighted Avg.	0.989605	0.989638	0.989605	0.988744	909.75	1119.27	2029.02
Stacking	0.989470	0.989442	0.989470	0.988652	3443.69	1125.45	4569.14
Gradient Boosting	0.971763	0.986697	0.971763	0.978720	4143.04	9.81	4152.85
XGBoost	0.987131	0.986839	0.987131	0.985801	1130.70	1.09	1131.80
Bagging	0.988196	0.987953	0.988196	0.987152	40.24	5392.05	5432.29
Adaptive Boosting	0.891326	0.807278	0.891326	0.846459	102.43	2.25	104.68
IG Top 10 Features
CatBoost	0.989409	0.989388	0.989409	0.988506	729.29	0.48	729.76
Random Forest	0.989690	0.989741	0.989690	0.988842	363.05	3.62	366.67
Soft Voting	0.989702	0.989747	0.989702	0.988855	597.98	736.07	1334.05
Max Voting	0.989692	0.989741	0.989692	0.988845	1349.14	1544.59	2893.72
Weighted Avg.	0.989694	0.989734	0.989694	0.988848	1318.02	1420.76	2738.78
Gradient Boosting	0.988822	0.987958	0.988822	0.987800	6510.82	10.58	6521.40
Stacking	0.887046	0.992105	0.887046	0.933565	4843.96	1288.30	6132.26
XGBoost Ensemble	0.987835	0.987835	0.987835	0.986664	1125.95	0.84	1126.79
Bagging	0.988475	0.988251	0.988475	0.987503	68.61	7674.60	7743.21
Adaptive Boosting	0.891526	0.825744	0.891526	0.846705	173.90	2.44	176.34
K-Best Top 10 Features
Random Forest	0.996620	0.996545	0.996620	0.996540	419.21	4.71	423.92
Blending	0.996577	0.996440	0.996577	0.996386	1305.54	5.56	1311.11
Stacking	0.996684	0.996577	0.996684	0.996587	2714.78	465.17	3179.94
Soft Voting	0.996571	0.996490	0.996571	0.996496	647.88	24.93	672.81
Max Voting	0.996610	0.996530	0.996610	0.996527	936.31	46.24	982.54
Weighted Avg.	0.996489	0.996403	0.996489	0.996415	884.30	29.12	913.42
CatBoost	0.995681	0.995591	0.995681	0.995306	743.92	0.48	744.40
XGBoost Ensemble	0.992489	0.991618	0.992489	0.991906	468.79	0.64	469.43
Bagging	0.989938	0.989886	0.989938	0.989865	77.63	223.84	301.47
Gradient Boosting	0.370658	0.871287	0.370658	0.510177	7158.59	10.63	7169.22
Adaptive Boosting	0.902619	0.852833	0.902619	0.869632	194.43	2.48	196.91
K-Best Top 5 Features
Random Forest	0.962920	0.961458	0.962920	0.961999	278.49	5.04	283.53
Blending Ensemble	0.962688	0.960794	0.962688	0.961504	1056.76	6.46	1063.22
Soft Voting	0.962097	0.961091	0.962097	0.961452	427.28	32.37	459.65
Max Voting	0.962916	0.961599	0.962916	0.962075	615.56	63.65	679.21
Weighted Avg.	0.961938	0.960924	0.961938	0.961294	606.83	47.02	653.85
CatBoost	0.957970	0.955943	0.957970	0.956140	685.86	0.45	686.30
Stacking	0.956648	0.959269	0.956648	0.957708	1694.01	470.52	2164.53
Gradient Boosting	0.946469	0.946940	0.946469	0.946288	3980.76	9.79	3990.55
XGBoost	0.950063	0.944017	0.950063	0.944816	578.48	0.68	579.16
Bagging	0.953791	0.953841	0.953791	0.953689	39.51	326.80	366.31
Adaptive Boosting	0.905589	0.859825	0.905589	0.874778	110.29	2.31	112.60

Table 13. Pairwise statistical test results between every pair of methods by t-test. The statistically better method (

p = 0.05

) is shown in bold (both marked bold if no significance). In the left, the RoEduNet-SIMARGL2021 dataset is shown. In the right, the CICIDS-2017 dataset is shown.

Table 13. Pairwise statistical test results between every pair of methods by t-test. The statistically better method (

p = 0.05

) is shown in bold (both marked bold if no significance). In the left, the RoEduNet-SIMARGL2021 dataset is shown. In the right, the CICIDS-2017 dataset is shown.

Method 1	Method 2	t-Statistic	p-Value	Method 1	Method 2	t-Statistic	p-Value
DT	MLP	1.8816	0.1330	DT	KNN	2.9884	$0.0404$
DT	LR	1.4078	0.2319	DT	LR	3.9523	$0.0168$
MLP	Logistic Regression	1.3107	0.2601	DT	MLP	6.2830	$0.0033$
Random Forest	Soft Voting	1.0000	0.3739	KNN	LR	3.9585	$0.0167$
Random Forest	Weighted Avg.	1.0000	0.3739	KNN	MLP	6.3823	$0.0031$
Random Forest	Bagging	1.1048	0.3312	LR	MLP	−1.3453	$0.2497$
Random Forest	Stacking	1.0000	0.3739	Blending	CatBoost	1.2573	$0.3356$
Random Forest	Adaptive Boosting	1.0791	0.3413	Blending	Stacking	0.9079	0.4598
Random Forest	CatBoost	1.0790	0.3413	Blending	Soft Voting	0.6623	0.5759
Random Forest	Gradient Boosting	1.3270	0.2552	Blending	Random Forest	−0.2703	0.8123
Random Forest	XGBoost	1.2150	0.2912	Blending	Weighted Avg.	1.6349	0.2437
Random Forest	Blending	24.1374	0.0000	Blending	Max Voting	−0.391	0.7335
Soft Voting	Bagging	1.0790	0.3413	Blending	XGBoost	1.8877	0.1997
Soft Voting	Stacking	0.0000	1.0000	Blending	Gradient Boosting	1.1174	0.38
Soft Voting	Adaptive Boosting	1.0780	0.3417	Blending	Bagging	18.946	0.0028
Soft Voting	CatBoost	1.0529	0.3518	Blending	Adaptive Boosting	6.5569	0.0225
Soft Voting	Gradient Boosting	1.2533	0.2784	CatBoost	Stacking	−1.7673	0.2192
Soft Voting	XGBoost	1.2005	0.2962	CatBoost	Soft Voting	−1.1834	0.3582
Soft Voting	Blending	24.5718	0.0000	CatBoost	Random Forest	−1.1341	0.3744
Weighted Avg.	Bagging	1.0790	0.3413	CatBoost	Weighted Avg.	−1.1632	0.3648
Weighted Avg.	Stacking	0.0000	1.0000	CatBoost	Max Voting	−1.1435	0.3713
Weighted Avg.	Adaptive Boosting	1.0780	0.3417	CatBoost	XGBoost	2.2691	0.1513
Weighted Avg.	CatBoost	1.0529	0.3518	CatBoost	Gradient Boosting	1.1002	0.386
Weighted Avg.	Gradient Boosting	1.2533	0.2784	CatBoost	Bagging	3.6367	0.068
Weighted Avg.	XGBoost	1.2005	0.2962	CatBoost	Adaptive Boosting	5.9148	0.0274
Weighted Avg.	Blending	24.5718	0.0000	Stacking	Soft Voting	−0.841	0.4889
Bagging	Stacking	−1.1070	0.3304	Stacking	Random Forest	−0.8275	0.495
Bagging	Adaptive Boosting	1.0780	0.3417	Stacking	Weighted Avg.	−0.8014	0.507
Bagging	CatBoost	−1.0000	0.3739	Stacking	Max Voting	−0.8402	0.4892
Bagging	Gradient Boosting	−1.0000	0.3739	Stacking	XGBoost	2.3041	0.1477
Bagging	XGBoost	1.3621	0.2448	Stacking	Gradient Boosting	1.1048	0.3844
Bagging	Blending	5.5973	0.0050	Stacking	Bagging	6.1517	0.0254
Stacking	Adaptive Boosting	1.0792	0.3412	Stacking	Adaptive Boosting	6.095	0.0259
Stacking	CatBoost	1.0806	0.3407	Soft Voting	Random Forest	−0.7248	0.5439
Stacking	Gradient Boosting	1.3482	0.2489	Soft Voting	Weighted Avg.	1.9422	0.1916
Stacking	XGBoost	1.2176	0.2903	Soft Voting	Max Voting	−0.8303	0.4937
Stacking	Blending	22.7185	0.0000	Soft Voting	XGBoost	1.8598	0.204
Adaptive Boosting	CatBoost	−1.0791	0.3413	Soft Voting	Gradient Boosting	1.1163	0.3804
Adaptive Boosting	Gradient Boosting	−1.0757	0.3426	Soft Voting	Bagging	19.6328	0.0026
Adaptive Boosting	XGBoost	−1.0686	0.3455	Soft Voting	Adaptive Boosting	6.5715	0.0224
Adaptive Boosting	Blending	−0.7075	0.5183	Random Forest	Weighted Avg.	0.9376	0.4474
CatBoost	Gradient Boosting	−0.9620	0.3905	Random Forest	Max Voting	−1.3602	0.3068
CatBoost	XGBoost	1.3911	0.2366	Random Forest	XGBoost	1.809	0.2122
CatBoost	Blending	5.5930	0.0050	Random Forest	Gradient Boosting	1.1176	0.38
Gradient Boosting	XGBoost	1.1886	0.3003	Random Forest	Bagging	12.9337	0.0059
Gradient Boosting	Blending	12.9552	0.0002	Random Forest	Adaptive Boosting	6.6536	0.0219
XGBoost	Blending	3.3683	0.0281	Weighted Avg.	Max Voting	−1.0169	0.4162
Random Forest	DT	1.5000	0.2080	Weighted Avg.	XGBoost	1.8575	0.2043
Random Forest	MLP	1.8815	0.1331	Weighted Avg.	Gradient Boosting	1.1157	0.3806
Soft Voting	DT	1.0000	0.3739	Weighted Avg.	Bagging	21.4614	0.0022
Soft Voting	MLP	1.8812	0.1331	Weighted Avg.	Adaptive Boosting	6.5519	0.0225
Weighted Avg.	DT	1.0000	0.3739	Max Voting	XGBoost	1.8095	0.2121
Weighted Avg.	MLP	1.8812	0.1331	Max Voting	Gradient Boosting	1.118	0.3798
Bagging	Decision Tree	−1.2606	0.2760	Max Voting	Bagging	12.5462	0.0063
Bagging	MLP	−1.1874	0.3008	Max Voting	Adaptive Boosting	6.6603	0.0218
Stacking	Decision Tree	−1.1007	0.3328	XGBoost	Gradient Boosting	1.0525	0.403
Stacking	MLP	−1.0029	0.3726	XGBoost	Bagging	−0.2501	0.8258
Adaptive Boosting	Decision Tree	−3.9203	0.0172	XGBoost	Adaptive Boosting	4.9934	0.0378
Adaptive Boosting	MLP	−3.8997	0.0175	Gradient Boosting	Bagging	−1.0697	0.3968
CatBoost	Decision Tree	−1.8833	0.1328	Gradient Boosting	Adaptive Boosting	−0.3449	0.7631
CatBoost	MLP	−1.8761	0.1339	Bagging	Adaptive Boosting	6.0827	0.026
Gradient Boosting	Decision Tree	−1.2273	0.2870	CatBoost	DT	1.2979	0.2641
Gradient Boosting	MLP	−1.2090	0.2933	CatBoost	KNN	2.9200	0.0432
XGBoost	DT	−2.0763	0.1065	Blending	DT	0.6146	0.5720
XGBoost	MLP	6.2195	0.0034	Blending	KNN	0.2251	0.8330
Blending	MLP	5.0665	0.0071

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bibers, I.; Arreche, O.; Alayed, W.; Abdallah, M. Ensemble-IDS: An Ensemble Learning Framework for Enhancing AI-Based Network Intrusion Detection Tasks. Appl. Sci. 2025, 15, 10579. https://doi.org/10.3390/app151910579

AMA Style

Bibers I, Arreche O, Alayed W, Abdallah M. Ensemble-IDS: An Ensemble Learning Framework for Enhancing AI-Based Network Intrusion Detection Tasks. Applied Sciences. 2025; 15(19):10579. https://doi.org/10.3390/app151910579

Chicago/Turabian Style

Bibers, Ismail, Osvaldo Arreche, Walaa Alayed, and Mustafa Abdallah. 2025. "Ensemble-IDS: An Ensemble Learning Framework for Enhancing AI-Based Network Intrusion Detection Tasks" Applied Sciences 15, no. 19: 10579. https://doi.org/10.3390/app151910579

APA Style

Bibers, I., Arreche, O., Alayed, W., & Abdallah, M. (2025). Ensemble-IDS: An Ensemble Learning Framework for Enhancing AI-Based Network Intrusion Detection Tasks. Applied Sciences, 15(19), 10579. https://doi.org/10.3390/app151910579

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ensemble-IDS: An Ensemble Learning Framework for Enhancing AI-Based Network Intrusion Detection Tasks

Abstract

1. Introduction

2. Related Work

2.1. Overview of Prior Ensemble Learning Approaches for IDS

2.2. Our Contributions

3. Background and Problem Statement

3.1. Categories of Network Intrusions

3.2. Intrusion Detection Systems

3.3. Limitations of Individual AI Models

3.4. Need for Ensemble Learning in IDS

3.5. Key Advantages of Ensemble Methods

4. Framework

4.1. Data Preparation

4.2. Feature Optimization

4.2.1. CICIDS-2017 Feature Analysis

4.2.2. RoEduNet-SIMARGL2021 Feature Analysis

4.3. Algorithm Selection and Methodology

4.3.1. Core Single Classification Algorithms

4.3.2. Basic Ensemble Strategies

4.3.3. Advanced Ensemble Methods

4.4. Model Development and Training

4.5. Evaluation Metrics and Model Selection

4.6. Key Network Intrusion Features

5. Foundations of Evaluation

5.1. Experimental Datasets

5.2. Computational Environment

5.3. Performance Assessment Criteria

5.4. Machine Learning Methodology

6. In-Depth Evaluation Results

6.1. RoEduNet-SIMARGL2021 Evaluation

6.1.1. Performance Analysis

6.1.2. Computational Efficiency

6.1.3. Classification Visualization

6.1.4. Performance Evaluation of RF and DT Under Adversarial Conditions

6.2. CICIDS-2017 Evaluation Findings

6.2.1. Detection Performance

6.2.2. Computational Characteristics

6.2.3. Classification Performance Visualization

6.3. Cross-Dataset Performance Improvements

6.4. Statistical Significance Analysis

6.4.1. Statistical Significance Setup

6.4.2. Main Insights

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. Model Architectures and Parameter Configurations

Appendix A.2. Base Classifier Specifications

Appendix A.3. Ensemble Method Implementations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI