1. Introduction
The digital ecosystem has become increasingly vital across industries, national infrastructures, and small and medium enterprises (SMEs), making network connectivity a cornerstone of economic activity and operational resilience. Similarly, Distributed Denial-of-Service (DDoS) attacks have emerged as one of the most pervasive global cyber threats, disrupting services by overwhelming network resources such as bandwidth, processing power, and memory. Recent reports indicate that DDoS incidents continue to escalate, with attack volumes growing by more than 50% in 2024 and peak capacities exceeding 1.6 Tbps, according to Gcore’s H2 2023 report [
1] and A10 Networks’ Global DDoS Weapons Report [
2].
Detecting and mitigating DDoS attacks remains challenging due to the diversity of attack vectors, rapid evolution of threat patterns, and increasing traffic heterogeneity. Traditional rule-based and signature-based Intrusion Detection Systems (IDSs) have become insufficient for detecting novel or obfuscated attacks in real time [
3]. This limitation has led to a shift toward data-driven methodologies, particularly Shallow Machine Learning (S-ML) and Deep Machine Learning (D-ML), for automated traffic analysis and anomaly detection. Classical S-ML algorithms such as Support Vector Machines (SVM), Decision Trees (DT), and Random Forest (RF) offer interpretability and efficiency, yet they often struggle in dynamic and high-dimensional traffic environments [
4]. Conversely, D-ML architectures (e.g., Convolutional Neural Networks (CNN), Recurrent Neural Network (RNN), and Long Short-Term Memory networks) capture spatio-temporal relationships but commonly focus on binary classification, where the task is limited to distinguishing benign from malicious flows. Recent systematic reviews highlight that only 30–40% of prior work addresses the more complex multiclass problem, where the goal is to differentiate among several DDoS attack categories [
3].
To improve robustness and generalization, recent research has explored hybrid and ensemble approaches. Empirical studies show that tree-based ensemble methods such as RF and XGBoost frequently obtain high accuracy and F1-scores (often above 98–99%), particularly in binary or reduced-class tasks [
4,
5]. Hybrid schemes that combine supervised and unsupervised components have also demonstrated strong performance in zero-day detection, reducing false positives and improving adaptability [
6]. However, the majority of these high-performing pipelines evaluate simplified versions of CIC-based datasets, for example by restricting the task to binary classification, selecting only a subset of attack types, or applying aggressive feature reduction, conditions that tend to artificially inflate reported metrics and are not directly comparable to full multiclass scenarios.
A more recent but less explored direction involves stacking ensembles, which integrate heterogeneous learners, tree-based, kernel-based, and neural models, into a hierarchical architecture designed to exploit complementary inductive biases [
7]. Stacking has shown promise in complex environments such as Software-Defined Networks (SDN), cloud infrastructures, and large-scale IoT ecosystems, yet few studies evaluate stacking under full multiclass DDoS settings, where label imbalance, overlapping traffic patterns, and inter-category variability place additional constraints on performance.
Despite these advances, an important limitation persists in most previous studies. First, a large proportion of existing DDoS detection methods focuses primarily on binary classification (benign vs. malicious), leaving the multiclass problem, distinguishing among heterogeneous DDoS attack vectors, only partially addressed. Even when multiclass detection is considered, single-model approaches often exhibit reduced generalization under class imbalance, overlapping traffic patterns, or high-dimensional feature spaces. Second, although ensemble and hybrid techniques have gained attention, heterogeneous stacking architectures that jointly integrate tree-based, kernel-based, and deep learning models remain underexplored. Existing stacking frameworks typically rely on a narrow set of learners or are confined to SDN-specific scenarios, without demonstrating scalable and robust multiclass performance on large, diverse datasets such as CIC-DDoS2019. Consequently, the field still lacks a unified multilayer framework capable of achieving stable, high-performing multiclass DDoS detection by leveraging the complementary strengths of the S-ML and D-ML models.
Motivated by these gaps, this study proposes a three-layer hierarchical stacking ensemble that integrates S-ML and D-ML models to address the challenges of multiclass DDoS detection. The framework combines heterogeneous learners at the base layer, regression-based and neural meta-models at the second layer, and a final voting mechanism to produce the final inference. Using the CIC-DDoS2019 dataset, the proposed model achieves 91.26% accuracy and balanced precision-recall performance across nine attack categories, under a realistic multiclass configuration that retains a broad feature set and avoids reducing the task to binary detection. These results demonstrate the potential of stacking ensembles as a scalable and operationally relevant approach for real-time DDoS detection under heterogeneous and evolving network conditions.
Accordingly, the novelty of this work lies not in the proposal of new classifiers but in the structured, large-scale, and robustness-oriented application of hierarchical stacking to multiclass DDoS detection.
The remainder of this article is structured as follows:
Section 2 reviews previous S-ML, D-ML, and ensemble-based approaches to DDoS detection;
Section 3 describes the proposed stacking architecture and methodology;
Section 4 presents the experimental setup and results;
Section 5 provides a detailed discussion, including limitations, security implications, and operational considerations; and
Section 6 concludes the study and outlines directions for future research.
2. Previous Work
Research on DDoS detection has advanced along three complementary methodological streams: classical S-ML, D-ML, and ensemble approaches. Classical S-ML remains crucial for establishing interpretable and computationally efficient baselines. D-ML methods, in turn, have proven transformative by capturing spatio-temporal patterns that traditional models often miss. Finally, ensemble learning has emerged as a strategy to leverage the complementary strengths of multiple classifiers, boosting accuracy and generalization, while reducing false alarms. Together, these directions provide the foundation for the development of next-generation frameworks that integrate multiple paradigms to overcome individual limitations.
2.1. Machine Learning Approach
In the field of DDoS detection, classical S-ML classifiers remain essential for their interpretability, computational efficiency, and practicality in real-world networks. These models often provide reliable baselines against which more complex approaches can be benchmarked. Abiramasundari and Ramaswamy [
8], for example, conducted a systematic comparison of SVM, K-Nearest Neighbors (KNN), Decision Trees, and RF in CIC-DDoS2019, CICIDS2017, and CICIDS2018. Their results, accuracy values consistently between 98.7% and 98.9%, highlight the robustness of traditional S-ML while underscoring its strengths in interpretability and deployment efficiency, particularly for small-to-medium-sized enterprises seeking reliable detection with low latency and modest resources.
Extending these foundations, Sawah et al. [
9] demonstrated how disciplined optimization can elevate classical models to near-perfect performance. By integrating Recursive Feature Elimination (RFE) with Grid Search for hyperparameter tuning, their RF classifier on DDoS-SDN achieved 99.99% accuracy, precision, recall, and F1-score, outperforming Naïve Bayes (98.85%), KNN (97.90%), Linear Discriminant Analysis (97.10%), and SVM (95.70%). This study demonstrates how careful feature selection and tuning can significantly enhance robustness and minimize error rates in high-throughput detection scenarios.
Similarly, Becerra-Suarez et al. [
10] emphasized the importance of preprocessing pipelines. Their approach to CIC-DDoS2019 combined outlier removal, Pearson-correlation-based feature selection (retaining 22 features), normalization, and Tree-of-Parzen-Estimators (TPE) for hyperparameter optimization. Under this setup, RF achieved 99.97% accuracy, 99.98% F1-score, and 99.96% Receiver Operating Characteristic - Area Under Curve (ROC-AUC), while XGBoost delivered comparable performance. Notably, both models surpassed Multi-Layer Perceptron (MLP) baselines evaluated under the same conditions, underscoring that data quality, feature engineering, and principled optimization often rival the gains attributed to deep learning.
Complementing these efforts, Fathima et al. [
11] evaluated RF, KNN, and Logistic Regression (LR) on the CSE-CICIDS2018, CICIDS2017, and CICDoS datasets, with traffic normalized using the Standard Scaler. Results showed RF as the top performer with 97.6% accuracy, followed by KNN (97%) and LR (91.1%). This comparative analysis across multiple benchmarks reaffirms RF’s capacity to generalize effectively in heterogeneous traffic environments, while also demonstrating the persistent relevance of ensemble-based methods in handling complex traffic distributions. Furthermore, the comparative analysis across multiple benchmark datasets highlights RF as a consistently reliable classifier in this domain.
Further evidence of RF’s dominance was provided by Ebrahem et al. [
12], who introduced a feature-grouping methodology for CICIDS2017, partitioning universal attributes into subgroups to test algorithm resilience under dimensionality reduction. RF, Naïve Bayes, KNN, and LR were compared, with RF consistently outperforming the others in terms of accuracy, precision, recall, and F1-score. Remarkably, RF maintained detection performance above 93% even with only two retained features, demonstrating its adaptability and suitability for environments that require efficient computation.
Taken together, these studies establish that well-optimized classical S-ML approaches (particularly RF and XGBoost) remain highly competitive for DDoS detection. Their demonstrated resilience across multiple datasets, preprocessing strategies, and feature-reduction scenarios provides a rigorous baseline for advancing toward ensemble and stacking strategies, where the complementary strengths of different learners can be further exploited for robust multiclass DDoS detection.
2.2. Deep Learning Approaches
Deep learning has emerged as a transformative paradigm for DDoS detection, owing to its capacity to automatically extract hierarchical spatio-temporal representations from raw or minimally processed traffic. Unlike shallow ML classifiers, D-ML architectures capture bursty patterns, flow dependencies, and stealthy behaviors that traditional feature engineering may overlook, positioning them as powerful tools for both binary and multiclass detection tasks.
Among the earliest contributions, Bashaiwth et al. [
13] proposed an explainable LSTM-based framework applied across three benchmark datasets (CIC-DDoS2019, CICIDS2017, and CSE-CIC-IDS2018). By integrating interpretability techniques such as LIME, SHAP, Anchor, and LORE, the study addressed the “black-box” limitation of recurrent models. Binary classification consistently yielded high accuracy across datasets, while multiclass results were strong on CICIDS2017 and CSE-CIC-IDS2018 but less effective on CIC-DDoS2019 due to overlap in attack types. This dual emphasis on predictive power and transparency illustrates the relevance of explainability when deploying deep models in operational contexts.
Building on this line of research, Ramzan et al. [
14] systematically evaluated RNN, Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU) architectures on CIC-DDoS2019, with validation on CICIDS2017 using the 20 most discriminative features. In binary detection, the RNN achieved 99.99% accuracy, outperforming LSTM and GRU by maintaining lower false-positive and false-negative rates while reducing overfitting. Conversely, in multiclass settings, GRU proved more resilient, achieving 99.54% accuracy compared to 99.43% for LSTM and 99.15% for RNN. These findings highlight the complementary strengths of recurrent networks: RNNs offer efficiency and stability in binary classification, whereas GRUs excel in capturing nuanced temporal dependencies for fine-grained multiclass detection.
Moving beyond purely recurrent designs, Setitra et al. [
15] introduced the OptMLP-CNN framework for DDoS detection in SDN environments. By integrating MLP and CNN, the model achieved a TPR of up to 99.95%, precision of 99.90%, recall of 99.98%, F1-score of 99.94%, and AUC of 99.79% on CIC-DDoS2019, with similarly strong results on InSDN (precision 99.98%, recall 99.77%, F1 99.99%, AUC 99.96%). These outcomes surpassed CNN, MLP, and shallow ML baselines. Key methodological innovations included SHAP-based feature selection to identify relevant traffic attributes, Bayesian hyperparameter optimization, and the ADAM optimizer, collectively enhancing interpretability, convergence stability, and overall robustness.
In contrast, Aktar and Nur [
16] pursued a semi-supervised approach leveraging a contractive autoencoder (CAE) trained on benign traffic and evaluated through reconstruction error on CIC-DDoS2019, CICIDS2017, and NSL-KDD. The CAE achieved 93.41–97.58% accuracy on CIC-DDoS2019, 96.08% on NSL-KDD, and 92.45% on CICIDS2017, consistently outperforming baseline autoencoders, variational AEs, and LSTM-based AEs (70–95%). By employing Contractive Loss, the Adam optimizer, and Sigmoid activations in hidden layers, the model generalized effectively to unseen attacks, underscoring the utility of deep representation learning in scenarios with limited labeled data.
Finally, a CNN–GRU hybrid model [
17] addressed the challenges of incomplete feature extraction, class imbalance, and multiclass accuracy. By combining Random Forest–Pearson correlation for feature selection, ADASYN–RENN for data balancing, CNN for spatial extraction, GRU for temporal dependencies, and attention mechanisms for feature weighting, the framework achieved robust performance across various datasets. Results included 99.65% accuracy, 99.63% recall, 99.65% precision, and 99.64% F1-score on CICIDS2017; 99.69% accuracy, 99.65% recall, 99.69% precision, and 99.70% F1-score on NSL-KDD; and 86.25% accuracy, 86.92% precision, and 86% F1-score on UNSW-NB15. While demonstrating state-of-the-art results on CICIDS2017 and NSL-KDD, the drop in performance on UNSW-NB15 reveals the sensitivity of D-ML models to dataset characteristics, underscoring the need for adaptability in diverse real-world environments.
Overall, D-ML approaches have set state-of-the-art benchmarks in DDoS detection, excelling in accuracy, adaptability, and generalization. However, their sensitivity to dataset variability and computational demands highlights the need for integrative approaches, such as ensembles and stacking, that combine predictive power with robustness and transparency.
2.3. Ensemble Approaches
Ensemble learning has emerged as a compelling approach for DDoS detection, leveraging the complementary strengths of multiple classifiers to improve robustness, accuracy, and generalization. These approaches are particularly effective in handling imbalanced data distributions, high-dimensional feature spaces, and evolving attack patterns, positioning them as strong candidates for real-world deployment in both enterprise and SDN environments.
One representative contribution is the stacking ensemble proposed by Ali et al. [
18], where an MLP and an SVM serve as base classifiers, and an RF acts as the meta-learner. A hybrid feature selection strategy combining Genetic Algorithms, chi-square tests, and correlation analysis was applied to eliminate redundant attributes and retain only discriminative features. Evaluated on an SDN benchmark dataset, the model achieved 99.86% accuracy in training and 98.89% in testing, with a precision of 99.82%, a recall of 99.71%, and an F1-score of 99.71%. By maintaining very low false positives and false negatives, this study demonstrates that stacking, combined with advanced feature engineering, can provide reliable and scalable solutions for intrusion detection in programmable networks.
Complementing this, Hirsi et al. [
19] developed an ensemble-based RF model (ENRF) integrated with Principal Component Analysis (PCA) to reduce dimensionality in SDN environments. Validated in both custom and public datasets, ENRF achieved 100% accuracy, precision, recall, and F1-score, outperforming baseline ensembles and conventional S-ML methods. PCA not only reduced computational complexity but also mitigated feature redundancy, while the ensemble structure improved resilience against overfitting. In particular, the study emphasized reproducibility by using multiple datasets and embedding the ENRF within the SDN control plane for continuous monitoring, allowing rapid anomaly detection and proactive mitigation.
Beyond supervised learning, Das et al. [
20] explored hybrid ensembles combining unsupervised clustering with supervised classifiers to address zero-day detection. Tested on the NSL-KDD, UNSW-NB15, and CICIDS2017 datasets, their model achieved an accuracy of up to 99.1% with exceptionally low false-positive rates, misclassifying only 0.01% of benign instances. This fusion enabled the system to capture novel attack behaviors without compromising detection performance on known patterns, thereby offering a practical balance between generalization and precision, a key requirement for real-world adaptability.
In the IoT domain, where lightweight yet accurate solutions are essential, Lazzarini et al. [
21] proposed DIS-IoT, a stacking ensemble of deep learners integrating MLP, CNN, and LSTM as base classifiers with a fully connected meta-learner. Evaluated on ToN-IoT, CICIDS2017, and SWaT datasets, DIS-IoT consistently outperformed single models, achieving near-perfect binary classification results (accuracy and F1 ≈ 0.99–1.00) and strong multiclass performance (≈0.98–0.99). By exploiting the complementary biases of convolutional, recurrent, and dense architectures, the model minimized false alarms while sustaining high recall, establishing stacking as an effective solution for IoT intrusion detection.
Further advancing ensemble innovation, Butt et al. [
22] introduced a multi-model framework for SDN-based DDoS detection, combining RF, KNN, and XGBoost. Each learner contributed unique strengths: RF provided stable performance via DT aggregation, KNN offered sensitivity to local traffic dynamics, and XGBoost introduced gradient-boosted decision-making, particularly effective against class imbalance and rare attacks. The ensemble achieved nearly 99% accuracy on SDN-specific datasets, significantly outperforming individual learners and underscoring the utility of heterogeneous integration to enhance robustness under dynamic traffic conditions.
Finally, Hossain [
23] presented a feature-driven ensemble where RF was coupled with a novel feature selection pipeline that integrates mutual information, correlation, and PCA. Evaluated on CIC-DDoS2019, the model achieved nearly 100% accuracy, 100% true positive rate, and a 0% false positive rate, outperforming conventional S-ML baselines. By aggregating multiple decision trees and applying dimensionality reduction, the framework effectively mitigated overfitting and enhanced generalization. This study highlights how careful feature curation and ensemble aggregation can jointly optimize detection accuracy, false alarm rates, and scalability in operational contexts.
Collectively, ensemble approaches demonstrate that combining diverse learners (whether homogeneous or heterogeneous, shallow or deep) provides significant benefits for DDoS detection. They consistently reduce false positives, improve generalization, and maintain robustness across datasets ranging from SDN to IoT and general-purpose benchmarks. Collectively, these works establish ensemble learning not only as a performance booster over single models but also as a practical pathway toward resilient, interpretable, and real-time DDoS defense.
In summary, the body of research on DDoS detection reveals a clear trajectory: from interpretable and resource-efficient S-ML models, to highly accurate but dataset-sensitive D-ML architectures, and finally to ensemble approaches that combine complementary strengths. Classical S-ML provides robust baselines, D-ML introduces powerful feature extraction and adaptability, and ensembles achieve resilience and generalization across diverse datasets and environments. Despite these advances, challenges persist to strike a balance between detection accuracy, interpretability, and real-world scalability. This context motivates the development of stacking frameworks that integrate S-ML and D-ML, leveraging their respective strengths to deliver reliable multiclass DDoS detection suitable for deployment in dynamic and heterogeneous network environments.
3. Materials and Methods
The proposed system for detecting Distributed Denial of Service (DDoS) attacks is based on a multi-stage machine learning architecture that employs a stacked ensemble model. The general workflow, from data acquisition to prediction, is summarized in
Figure 1. This diagram illustrates the key stages of the proposed methodology: data collection, preprocessing, model training, ensemble integration, and final classification.
To assess the computational efficiency of the proposed framework, training time, inference latency, throughput, and memory usage were measured during both training and inference phases. Training time was computed as the elapsed wall-clock time between the start and completion of the whole training pipeline, including preprocessing, model fitting, and ensemble construction. Inference latency was measured as the average processing time per network flow, obtained by recording the start and end timestamps for batch inference and normalizing by the number of processed flows. Throughput was calculated as the number of flows processed per second during inference under steady-state conditions. Peak memory usage during training and inference was recorded by monitoring system-level memory consumption and retaining the maximum observed value throughout each phase. All measurements were obtained on the same hardware configuration described in
Section 3.3 to ensure consistency.
3.1. Data Preprocessing
The preprocessing stage, shown in
Figure 2, began with a comprehensive literature review to identify prior approaches to DDoS detection and the datasets most commonly used for experimentation. As a result, the CIC-DDoS2019 dataset [
24] was selected for this study. The dataset contains approximately 30 GB of labeled network traffic data organized by attack type. From this dataset, a balanced subset of approximately 9.25 million samples was constructed using proportional sampling across classes.
Each record originally contained 87 features; however, nine were removed after feature relevance analysis, resulting in a total of 78 features that include one target variable,
Label, which was included to indicate whether the traffic flow corresponds to a benign instance or to a specific DDoS attack type (see
Appendix A Table A1 for a detailed description of the selected features).
Outliers and missing values were removed, and underrepresented attack types (with fewer than 450 samples) were excluded. During exploratory training, two classes exhibited significant confusion, leading to their merging after verification using Principal Component Analysis (PCA). The final dataset contained approximately 8.85 million labeled samples.
After preprocessing and data cleaning, the Label variable was standardized and encoded into nine final categories representing distinct traffic behaviors. These categories correspond to the following classes:
BENIGN: Normal network traffic without attack patterns.
DNS/LDAP: Distributed Denial of Service attacks exploiting DNS or LDAP protocols.
MSSQL: DDoS activity based on Microsoft SQL services.
NTP: Attacks that leverage the Network Time Protocol (NTP) amplification.
NetBIOS/Portmap: Traffic generated by NetBIOS or Portmap-based service abuse.
SNMP: Malicious traffic exploiting the Simple Network Management Protocol.
SSDP/UDP: Attacks using the Simple Service Discovery Protocol or generic UDP floods.
Syn/UDPLag: Flooding attacks characterized by SYN or UDP lag behavior.
TFTP: Traffic related to Trivial File Transfer Protocol-based attacks.
Labels were encoded using both Label Encoding and One-Hot Encoding, depending on model requirements. The dataset was split into training, validation, and testing sets using a 70/25/5 ratio. Data normalization was performed with the StandardScaler, and class balance was achieved using the Synthetic Minority Oversampling Technique (SMOTE). The preprocessing steps ensured consistency across model inputs and mitigated data imbalance.
3.2. Model Architecture
Stacking is a hierarchical ensemble strategy in which multiple S-ML and/or D-ML models are organized in successive levels to deliver higher predictive performance than any single model alone. The central premise is that different learning algorithms capture complementary inductive biases and heterogeneous patterns from the same data distribution; therefore, their coordinated aggregation can yield more robust and generalizable decisions [
25,
26].
As depicted in
Figure 3, a standard stacking architecture consists of two conceptual levels. The base models, also referred to as level-0 learners, are trained in parallel using the same training dataset. Once trained, they generate out-of-fold predictions over validation and test splits. These predictions are then assembled into a new feature matrix (meta-features), which constitutes the input for a higher-level learner, referred to as the meta-model or level-1 learner (Ensemble Learning surveys remark this architecture’s effectiveness).
This hierarchical design is particularly effective in complex classification problems such as multiclass DDoS detection, where combining structurally diverse learners (e.g., tree-based, kernel-based, and deep neural models) consistently outperforms isolated predictors in terms of accuracy, stability, and resilience to traffic variability.
Although stacking has been explored in other domains, its application to multiclass DDoS detection remains limited, and existing approaches typically employ only two layers or rely on a narrow set of homogeneous classifiers. The novelty of this work lies not in the isolated use of known S-ML and D-ML algorithms, but in the design and integration of a three-layer heterogeneous stacking architecture specifically tailored to the characteristics of large-scale DDoS traffic.
First, the proposed framework combines five structurally diverse base learners (tree-based, instance-based, and deep neural models) at Level 0, enabling the simultaneous capture of complementary inductive biases. Second, Level 1 introduces a meta-feature construction pipeline based on out-of-fold predictions, which allows Logistic Regression, Ridge Classifier, and a neural meta-learner to model complex interdependencies and systematic error patterns across base classifiers, an aspect largely unexplored in previous DDoS studies. Third, Level 2 incorporates a final voting layer to enhance prediction stability and mitigate class-level fluctuations typical in multiclass traffic.
Additionally, the ensemble is tightly coupled with a preprocessing pipeline that involves PCA-based class consolidation, SMOTE balancing, and feature refinement, thereby improving the separability between attack categories and contributing to the overall robustness of the stacking mechanism. This coordinated design enables the architecture to deliver highly stable multiclass performance across nine attack types, even under substantial traffic heterogeneity, making it distinct from conventional ensemble and hybrid models in the literature.
The proposed system was trained using a stacked ensemble model composed of three conceptual layers, as illustrated in
Figure 4. This architecture aims to combine the strengths of diverse classifiers to achieve higher accuracy and robustness. To ensure reproducibility and interpretability of the experiments, this section details the hyperparameters of each model used in the architecture. Only the most relevant parameters are listed, as some minor internal constants or random seeds may vary.
3.2.1. Layer 0 (Base Models)
The first layer consisted of five base classifiers: MLP, KNN, DT, RF, and Gradient Boosting. An SVM model was initially evaluated but subsequently excluded due to excessive training time and inferior predictive performance.
The models and their hyperparameter configurations are summarized below:
model = tf.keras.Sequential([
tfl.Dense(128, input_shape=(input_dim,)),
tfl.LeakyReLU(alpha=0.01),
tfl.Dense(256),
tfl.LeakyReLU(alpha=0.01),
tfl.Dense(512),
tfl.LeakyReLU(alpha=0.01),
tfl.Dropout(dropout_rate),
tfl.Dense(n_classes, activation=’softmax’)
])
The model was trained with the following hyperparameters, described in
Table 5:
3.2.2. Layer 1 (Meta-Models)
The second layer included three meta-classifiers, Logistic Regression, Ridge Classifier, and MLP, that used the class probabilities predicted by the Layer 0 models on the validation set as input. This design allowed the meta-models to capture the interactions and correlations among base classifier predictions.
The models and their hyperparameter configurations are summarized below:
model_3 = tf.keras.Sequential([
tfl.Dense(64, activation=’leaky_relu’, input_shape=(input_dim,)),
tfl.Dropout(dropout_rate),
tfl.Dense(n_classes, activation=’softmax’)
])
The training configuration for this model was identical to the one used in Layer 0, described in
Table 8:
3.2.3. Layer 2 (Voting Ensemble)
The final layer implemented a voting mechanism that aggregated the predictions from the three Layer 1 models. The final class label was determined by majority voting, ensuring stable and generalized performance across attack categories.
3.2.4. Proposed Algorithm
To improve methodological clarity and reproducibility, the complete training and inference workflow of the proposed stacking architecture is summarized in Algorithm 1. The algorithm details the preprocessing pipeline, the construction of Level-0 base learners, the generation of Level-1 meta-features, and the final soft voting mechanism used in Level 2.
3.3. System Specifications
All experiments, including data preprocessing, model training, and evaluation, were conducted on a high-performance workstation to ensure computational consistency and scalability. The hardware and software configuration of the system is summarized below.
Processor: Dual Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60 GHz (2 processors)
Installed RAM: 48.0 GB
Graphics Card: AMD FirePro W2100 (2 GB)
System Type: 64-bit Operating System, x64-based processor
The experiments were performed under Windows 10 Pro (Version 22H2, OS Build 19045.6396), with the following Python environment and library versions:
Python: 3.10.16
imbalanced-learn: 0.13.0
NumPy: 2.0.2
Pandas: 2.2.3
Scikit-learn: 1.6.1
TensorFlow: 2.18.0
This configuration enabled efficient model training on large-scale datasets. Parallel computation and optimized memory management, provided by the HP Z840 Workstation, reduced processing time and ensured the reproducibility of all results.
| Algorithm 1 Training and Inference Procedure of the Proposed Three-Layer Stacking Ensemble |
| Require: CIC-DDoS2019 dataset |
| Ensure: Final predicted label for each sample |
| 1: Preprocessing Stage |
| 2: Remove invalid or inconsistent records from .
|
| 3: Apply PCA-based class consolidation to reduce overlapping attack categories.
|
| 4: Apply SMOTE to balance the class distribution in the training subset.
|
| 5: Normalize continuous features using standard scaling.
|
| 6: Split in training, validation and test sets.
|
| 7: Level 0: Base Learners |
| 8: Define the base learner set: |
| 9: for each base model do |
| 10: Train using set.
|
| 11: Collect out-of-fold predictions for all samples from set.
|
| 12: end for |
| 13: Construct the meta-feature matrix: |
| 14: Level 1: Meta Learners |
| 15: Define the meta-learner set: |
| 16: for each meta-learner do |
| 17: Train using the meta-feature matrix .
|
| 18: Obtain Level-1 predictions .
|
| 19: end for |
| 20: Collect Level-1 prediction set: |
| 21: Level 2: Final Voting Layer |
| 22: Aggregate Level-1 predictions using soft voting.
|
| 23: Compute the final class-probability score vector.
|
| 24: Assign final prediction: |
| 25: return |
4. Results
The final evaluation of the proposed system was conducted using the test subset to measure overall performance. The evaluation pipeline and the main metrics used (Accuracy, Precision, Recall, F1-Score, and Confusion Matrix) are summarized in
Figure 5. This figure illustrates the end-to-end validation process, highlighting how the ensemble integrates results across all model layers.
4.1. General Model Evaluation
Table 9 presents the numerical results obtained from each model across the three layers. The ensemble model (Layer 2) achieved the best overall balance between accuracy, precision, and recall, maintaining an average accuracy of 91%.
4.2. Detailed Model Evaluation of Layer 1 and Layer 2
This subsection presents a detailed evaluation of the proposed architecture over four models: three base classifiers in Layer 1 (Logistic Regression, Ridge Classifier, and Neural Network) and one Ensemble model in Layer 2. Each model was trained and tested on the same dataset partition to ensure consistency of results and a fair comparison. The metrics used for performance evaluation include
accuracy,
precision,
recall, and
F1-score. The detailed results for each model are presented in
Table 10,
Table 11,
Table 12 and
Table 13.
4.2.1. Layer 1: Logistic Regression
The Logistic Regression model achieved an overall accuracy of 91.24%. The best performance was achieved for the BENIGN and NTP traffic classes, both of which achieved perfect scores (F1 = 1.00). Lower results were observed in the SNMP (F1 = 0.77) and DNS/LDAP (F1 = 0.83) categories.
4.2.2. Layer 1: Ridge Classifier
The Ridge Classifier obtained an overall accuracy of 91.32%. Its performance remained consistent across most classes, showing slightly improved results compared to Logistic Regression, especially for Syn/UDPLag and TFTP traffic.
4.2.3. Layer 1: Neural Network
The Neural Network model achieved an overall accuracy of 91.27%, showing robust performance across all attack types. High detection accuracy was maintained for BENIGN, NTP, and Syn/UDPLag classes, while slightly lower results persisted for SNMP and DNS/LDAP.
4.2.4. Layer 2: Ensemble Model
The ensemble meta-model in Layer 2 combined the predictions from the three base classifiers, producing an overall accuracy of 91.26%. This model improved class balance while maintaining high precision and recall across all categories, confirming the stability of the ensemble approach.
4.3. Theoretical Interpretation of the Results
While the numerical results reported in the previous subsection demonstrate the effectiveness of the proposed hierarchical stacking ensemble, a deeper theoretical interpretation is required to understand why the model behaves as observed. This section explains the performance trends in terms of classifier design, data characteristics, class imbalance, and ensemble theory.
4.3.1. Behavior of Level-0 Base Learners
The base layer integrates models with distinct inductive biases, tree-based classifiers, distance-based learning, and neural networks. The properties of these model families can theoretically explain the observed performance patterns:
Tree-based methods (RF, DT, GBoost) achieve strong performance because DDoS traffic exhibits non-linear relationships and hierarchical decision boundaries. RF in particular benefits from variance reduction through bagging and feature subsampling, which stabilizes decisions under high-dimensional flows.
The neural network at Level-0 contributes complementary representations by learning distributed features that capture subtle deviations in traffic dynamics that tree models may oversimplify.
KNN, although competitive, is sensitive to feature scaling and local density variations. Its relatively smaller contribution (as shown in the ablation results in
Table 14) aligns with theoretical expectations for high-dimensional, heterogeneous network traffic.
This theoretical diversity explains why removing any Level-0 learner decreases performance: the ensemble relies on combining heterogeneous decision boundaries to capture the multi-modal nature of benign and malicious flows.
4.3.2. Theoretical Role of Level-1 Meta-Learners
The Level-1 layer integrates Logistic Regression, Ridge Classifier, and a shallow Neural Network. The theoretical benefit of this design comes from:
Error-correcting capability: Meta-learners operate on prediction vectors, learning second-order relationships between base models, which classical ensemble theory identifies as key for reducing correlated errors.
Linearity vs. non-linearity: Logistic Regression captures linearly separable patterns across the prediction space, Ridge adds regularization to stabilize decisions under multicollinearity, and the neural network models non-linear interactions.
Bias–variance trade-off: The combination ensures that the meta-layer can correct systematic tendencies of overconfident classifiers or ambiguous predictions in minority classes.
The relatively small but consistent F1-score decrease shown in the ablation study when removing any of these models empirically supports this theoretical framework.
4.3.3. Influence of Preprocessing: SMOTE and PCA Label Treatment
The significant performance loss when removing SMOTE (−9.3% F1) or PCA-based label treatment (−13%) is theoretically aligned with imbalance-learning and class-separability principles:
SMOTE increases minority-class density in feature space, improving the decision boundaries around rare attack types, such as DNS, LDAP, and SNMP. Without it, classifiers become biased toward the majority class, reducing recall in minority categories.
PCA label treatment reduces intra-class variance and removes redundant label correlations. From a geometric perspective, PCA enhances cluster separability, improving both precision and recall.
These theoretical mechanisms explain why these preprocessing components contribute more to performance than individual base or meta-learners.
4.3.4. Analysis of Misclassified Attack Types
The slightly lower F1-scores observed in minority attack types (≈0.77–0.83) are theoretically expected due to:
Feature overlap among similar attacks (e.g., DNS vs. LDAP vs. SNMP), which reduces separability in high-dimensional spaces.
Sparse representation of rare attack patterns, which limits the model’s ability to learn discriminative boundaries even after balancing.
Temporal and protocol-level similarity among certain DDoS categories makes them harder to disambiguate without temporal sequence modeling.
Nonetheless, the ensemble mitigates these limitations by integrating complementary learners, which stabilizes predictions across classes.
4.3.5. Summary
Overall, the theoretical interpretation supports the empirical findings:
Level-0 learners capture different structural properties of the traffic.
Level-1 meta-learning combines these representations and corrects correlated errors.
SMOTE and PCA label treatment enhance feature space geometry and balance.
Misclassification patterns reflect the inherent complexity of the dataset, not model limitations.
This theoretical grounding reinforces the validity of the proposed architecture and explains its superior multiclass performance on CIC-DDoS2019.
4.4. Ablation Study
To quantify the contribution of each component of the proposed three-layer stacking architecture, an ablation study was conducted in which specific elements of the system were individually removed. The objective of this analysis is to determine how the performance degrades when essential components are omitted, thereby validating that the proposed architecture is not simply an aggregation of known methods but a cohesive, interdependent design. All ablations were evaluated under the same experimental conditions as the full model, using macro F1-score as the primary evaluation metric.
Table 14 summarizes the obtained performance for each ablated configuration. The Base Model corresponds to the complete architecture proposed in this study, encompassing all Level-0 learners, all Level-1 meta-learners, and the full preprocessing pipeline, including SMOTE and PCA-based label consolidation.
4.4.1. Contribution of Level-0 Base Learners
Removing any Level-0 classifier leads to a measurable drop in macro F1-score, confirming that the diversity of inductive biases at the base layer is a key element of the ensemble.
The most pronounced degradations occur when removing tree-based models such as RF (0.8989), DT (0.8993), or Gradient Boosting (0.8999), underscoring their ability to model non-linear traffic patterns. Removing the Level-0 Neural Network also reduces performance (0.9030), highlighting the value of deep feature representations. Excluding KNN (0.9018) yields the smallest drop, but still confirms its role in capturing local decision boundaries.
These results demonstrate that the heterogeneous set of Level-0 learners is fundamental to the architecture’s success.
4.4.2. Contribution of Level-1 Meta-Learners
Level-1 ablations also yield consistent performance reductions:
Without Logistic Regression: 0.9134.
Without Ridge Classifier: 0.9122.
Without Level-1 Neural Network: 0.9117.
Although these drops are smaller than those observed in Level-0 ablations, they confirm that Level-1 provides essential error-correcting capability. The three meta-learners jointly learn complementary meta-relationships across Level-0 outputs, improving generalization and stabilizing multiclass predictions.
4.4.3. Importance of the Preprocessing Pipeline
The largest degradations in performance occur when removing SMOTE (0.8267) or PCA label treatment (0.7893).
Without SMOTE, minority attack types are significantly underrepresented, resulting in reduced F1 performance across rare categories.
Without PCA-based label consolidation, overlapping attack categories remain uncorrected, increasing confusion between similar attack types and reducing general separability in feature space.
These results confirm that both preprocessing components are essential for robust multiclass DDoS detection.
4.4.4. Summary of Findings
Overall, the ablation study empirically validates the design decisions of the proposed architecture. Each component, Level-0 model diversity, Level-1 meta-learning, SMOTE-based balancing, and PCA label treatment, plays a significant and complementary role. The complete architecture delivers the highest macro F1-score (0.9201), while removing any component consistently reduces performance. These findings support the effectiveness of the proposed three-layer stacking ensemble for multiclass DDoS detection.
4.5. Expanded Comparative Analysis with Previous Works
To strengthen the evaluation of the proposed model, we conducted an expanded comparative analysis incorporating a broad selection of recent studies on DDoS detection.
Table 15 summarizes representative works published between 2022 and 2025, covering machine learning, deep learning, hybrid architectures, ensemble approaches, and feature-engineering techniques across multiple benchmark datasets.
The comparison includes the methodological approach, the classification scheme (binary or multiclass), the achieved F1-score, and the limitations reported by each study. This structured comparison provides a broader context for the performance of our proposed architecture.
Overall, the most recent efforts achieve high performance in binary classification, often exceeding F1-scores of 0.98. However, robustness decreases for multiclass DDoS detection, where inter-class similarity and data imbalance remain open challenges. Hybrid models (e.g., CNN-GRU, Deep Stacking Ensembles, and PCA-based ensembles) frequently yield excellent results but exhibit high computational complexity, limited interpretability, or dependence on large datasets.
In contrast, our proposed three-layer hierarchical Stacking Ensemble addresses these limitations through (i) heterogeneous Level-0 learners, (ii) meta-learning integration, and (iii) a preprocessing pipeline consisting of SMOTE and PCA. In the multiclass CIC-DDoS2019 dataset, our model achieves an overall F1-score of 0.92, outperforming various classical and deep approaches, particularly in scenarios involving complex attack types and class imbalance. Although some minority attack categories still pose challenges (F1 ≈ 0.77–0.83), the architecture demonstrates superior multiclass stability compared to prior works, where many studies only report binary results.
This expanded comparison reinforces the novelty and effectiveness of our approach, situating our contribution within the current state of the art in DDoS detection research.
4.6. Clarifying the Performance Gap Between Binary and Multiclass DDoS Detection
Several studies cited in the literature report very high performance metrics, often exceeding 99% accuracy, precision, recall, or F1-score. Notable examples include the works presented in [
14,
15]. While these results are technically correct within their respective experimental settings, a closer examination reveals substantial differences in problem formulation compared to the present study.
In [
14], the authors propose a deep learning–based approach for DDoS detection and evaluate their model using a binary classification scheme, where all attack traffic is grouped into a single class and distinguished only from benign traffic. Similarly, Ref. [
15] formulates DDoS detection as a binary decision problem, reporting near-perfect performance metrics under this simplified setting.
Binary DDoS detection is widely acknowledged to be a less challenging task, particularly for modern machine learning and deep learning models, because it requires learning only a coarse separation boundary between malicious and benign traffic. In contrast, the present work focuses on multiclass DDoS classification, distinguishing nine different attack types in the CIC-DDoS2019 dataset. This setting introduces additional complexity due to:
Strong feature overlap among different DDoS attack vectors (e.g., DNS, LDAP, and SNMP-based floods).
Severe class imbalance between majority and minority attack types
Increased decision complexity, as the model must learn fine-grained boundaries rather than a single global separation.
Consequently, performance metrics obtained in binary settings are not directly comparable to those achieved in multiclass scenarios. This observation is consistent with prior multiclass studies, which also report lower, but more realistic, performance when full attack taxonomies are considered.
The contribution of this work lies precisely in addressing this more demanding and operationally relevant multiclass problem using a hierarchical stacking ensemble that combines classical machine learning and deep learning models. Although absolute metrics are lower than those reported in binary studies, they reflect a more realistic evaluation of DDoS detection systems intended for real-world deployment, where identifying the specific attack type is crucial for mitigation and response.
To make these differences explicit,
Table 16 summarizes key distinctions among related works.
By clearly distinguishing binary from multiclass settings, this study highlights its main contribution: a robust hierarchical stacking ensemble capable of full multiclass detection, which better reflects real-world environments where multiple DDoS attack types must be distinguished, not merely detected.
4.7. Why the Final Ensemble Does Not Significantly Outperform Layer-1
A close inspection of
Table 10,
Table 11,
Table 12 and
Table 13 reveals that the performance of the final Layer-2 voting ensemble is very similar to the results obtained by the best individual meta-learner in Layer-1 (particularly the Ridge Classifier). While this may appear to contradict the expected benefit of a stacked ensemble, it is consistent with several well-documented behaviors of hierarchical ensembles in highly separable datasets such as CIC-DDoS2019.
Strong Meta-Learner Dominance
The Ridge Classifier in Layer-1 shows exceptionally strong performance across most attack categories. When a single model already learns nearly all discriminative boundaries, stacking tends to provide diminishing returns because there is very limited residual error for the upper layer to correct.
Correlated Error Patterns Across Models
Although the Level-0 models differ in architecture (trees, distance-based, neural), their errors become partially correlated after the PCA-based label consolidation. This reduces the capacity of a majority-vote mechanism to generate new decision boundaries that substantially outperform the best participant.
Function of Layer-2: Stability Rather Than Accuracy Boost
While Layer-2 does not significantly increase macro performance metrics, it serves an important role:
Reduces variance across training runs.
Stabilizes predictions in borderline cases.
Provides robustness when traffic patterns shift slightly.
This is a known effect in ensemble theory: voting is often most useful for stability, not always peak accuracy.
Highly Separable Classes After Preprocessing
The use of SMOTE, PCA label consolidation, and correlation-based feature reduction strongly increases class separability. Under such conditions, a single meta-learner can exploit the simplified structure almost as effectively as a full ensemble.
Therefore, the similarity between Layer-1 and Layer-2 metrics does not invalidate the architecture. Instead, it indicates that:
5. Discussion
This section presents the experimental results obtained using the proposed stacking framework and situates them within the context of existing DDoS detection literature. The analysis focuses on interpreting performance trends, class-level behavior, and operational implications under the evaluated experimental conditions, rather than solely emphasizing aggregate metric values.
The multilayer stacking architecture demonstrated solid performance and high potential as an integrated solution for detecting multiclass DDoS attacks. With an overall accuracy exceeding 91.26%, the model validated the hypothesis that the hierarchical combination of heterogeneous algorithms, including decision trees, kernel-based methods, and neural networks, enables the capture of diverse behavioral patterns in network traffic, achieving more stable, robust, and generalizable detection. This performance improvement reflects the synergy between S-ML and D-ML models, which contribute both interpretability and representational depth within a hierarchical framework that integrates their complementary strengths.
The results demonstrate that the proposed stacked ensemble model effectively integrates multiple learning algorithms to improve robustness and classification accuracy in DDoS detection. The ensemble achieved approximately 91% accuracy with balanced precision and recall, confirming its reliability across multiple attack categories. As shown in
Figure 5, the evaluation workflow verifies the effectiveness of the hierarchical training procedure and the integration of multiple classifiers within the stacking scheme.
The use of successive layers enhanced the system’s robustness against traffic variability and anomalous patterns. The base models (level zero) generated diverse predictions, while the meta-models (level one) learned to combine these outputs to reduce systematic errors. The voting layer (level two) consolidated the most consistent predictions, achieving an optimal balance between precision and generalization. This hierarchical design proved effective in mitigating overfitting and improving model stability when exposed to unseen data, demonstrating greater adaptability compared to standalone learners.
A detailed analysis of the experimental results also revealed a noticeable reduction in class overlap after applying PCA-based class merging and SMOTE balancing, confirming the positive contribution of the preprocessing strategy illustrated in
Figure 2. This preprocessing step improved class separability and ensured a more uniform training distribution, contributing directly to the model’s consistent performance across validation and testing. The multilayer ensemble structure depicted in
Figure 4 further enhanced generalization capabilities and ensured stable performance under diverse network conditions.
Finally, the results across all evaluated models exhibited remarkable consistency, with accuracies around 91% and F1-scores exceeding 90%. The ensemble layer effectively leveraged the predictive strengths of the base classifiers, achieving balanced precision and recall across various attack types. These outcomes confirm the robustness and scalability of the proposed multilayer detection framework, as well as its suitability for deployment in distributed and real-time DDoS detection environments.
5.1. Contextualizing the Performance Gap with Prior High-Accuracy Works
It is important to note that several prior works reporting 98–99% accuracy or F1-score on CIC-DDoS2019 operate under experimental settings that differ substantially from those in this study. Our evaluation considers a nine-class multiclass problem (after removing classes with fewer than 450 samples and merging two highly confused categories), a large-scale dataset (∼8.85 M flows after cleaning), and a broad feature set of 78 attributes. Many high-performing studies simplify the task by using binary classification, selecting only a subset of attack types, or applying aggressive feature reduction, which can produce overly optimistic results. Under our more challenging setting, the proposed stacking ensemble achieves 91.26% accuracy and demonstrates improved robustness across classes. To further analyze the ensemble’s contribution, we include an ablation study based on F1-score (
Table 14), which compares all base, meta, and ensemble components and confirms that the final layer provides consistent improvements in per-class F1 performance.
5.2. Evaluation Strategy and Generalization Limitations
The evaluation in this study was conducted using a single 70-25-5 train-validation-test split. This decision was motivated by the computational cost associated with training the full three-layer stacking architecture on a dataset of 8.85 million flows. While this approach is common in large-scale intrusion detection research, we acknowledge that it does not capture statistical variability to the same extent as k-fold cross-validation. Furthermore, the present work evaluates the model on a single dataset (CIC-DDoS2019), which limits conclusions about generalization across heterogeneous traffic environments. These aspects represent methodological limitations that will be addressed in future research through k-fold validation, statistical significance testing, and cross-dataset experiments. Despite these constraints, the ablation study included in this work provides evidence of the stability and internal consistency of the proposed ensemble architecture.
5.3. Performance Limitations and Potential Improvements
Although the proposed three-layer stacking ensemble demonstrates strong overall performance for multiclass DDoS detection, the results reveal specific cases where the architecture does not outperform all baseline or reference approaches. In particular, minority attack categories, such as SNMP-UDP Flood, LDAP Flood, or specific DNS-based attacks, exhibit comparatively lower F1-scores. This behavior is theoretically expected in highly imbalanced, noisy, and feature-overlapping datasets such as CIC-DDoS2019, where the statistical distribution of minority classes provides limited discriminative information for the learning process.
From a theoretical standpoint, these observations can be attributed to three core factors:
Data imbalance, which biases the decision boundaries toward dominant classes.
Feature redundancy and multicollinearity, which reduce separability in the feature space.
Limited interaction modeling between Level-0 learners, which restricts the meta-learner’s capacity to correct correlated errors.
To address these challenges, several improvement directions can be considered:
Advanced Class Rebalancing Techniques
While SMOTE alleviates global imbalance, it may oversample regions of high overlap, degrading minority precision. More sophisticated strategies, such as Borderline-SMOTE, ADASYN, SMOTE-IPF, or generative oversampling via GANs, can yield more representative minority samples, preserving local decision boundaries.
Refined Feature Selection and Representation
Although PCA-based label consolidation and Pearson correlation filtering reduce redundancy, complementary techniques such as mutual information, minimum redundancy–maximum relevance (mRMR), or SHAP-based importance could uncover more discriminative features and improve minority-class separability.
Enhanced Meta-Learning Strategies
The current Level-1 stack combines Logistic Regression, Ridge Classifier, and a shallow neural network. Replacing or augmenting these with more expressive meta-learners, e.g., gradient boosting, attention-enhanced layers, or gating mechanisms, may capture higher-order interactions between Level-0 outputs.
Adaptive or Weighted Ensemble Voting
The final decision is based on an unweighted majority vote. A class-sensitive or performance-weighted voting scheme (using per-class validation F1-scores) could emphasize the strengths of individual models in specific attack categories.
Temporal and Behavioral Modeling
CIC-DDoS2019 contains latent temporal patterns relevant to attack escalation. Integrating temporal models (LSTM, GRU, or 1D-CNN) in Level-0 may enhance the recognition of low-rate, evolving, or multi-stage attack variants.
Incorporating these improvements could further enhance the ensemble’s robustness, particularly in scenarios involving rare attack types and complex multiclass decision boundaries. These refinements constitute a natural progression for future research.
5.4. Implications for Real-World Deployment and Future Research Directions
Beyond quantitative performance, the findings have several implications for practical deployment. First, the proposed architecture reduces reliance on deep models with heavy computational overhead, making it suitable for medium-scale environments such as enterprise networks and regional ISPs. Second, the combination of classical machine learning and lightweight neural components ensures predictable inference times, which is critical for real-time DDoS mitigation.
However, deployment in real operational settings presents additional challenges. Real network traffic often contains concept drift, encrypted payloads, adversarial variations, and unseen attack types, none of which are fully covered by CIC-DDoS2019. Future research should incorporate online learning, drift detection, adaptive thresholding, and continuous retraining mechanisms to enhance the performance of these models. Additionally, evaluating the model under high-throughput conditions and on modern datasets, including cloud-native and IoT traffic, will be essential to validate the generalization of the proposed architecture.
These considerations suggest that while the current model provides a strong foundation, further work is needed to ensure robust and scalable performance in production-grade intrusion detection systems.
5.5. Computational Performance Analysis and Operational Feasibility
To assess the practical viability of the proposed three-layer stacking architecture, we evaluated its computational performance in terms of training time, inference latency, throughput, and peak memory usage on the workstation described in
Section 3.3.
Table 17 summarizes the measurements obtained for each component of the pipeline.
Training proved computationally intensive due to the large dataset (∼8.85 million flows) and the number of models involved in Level-0 and Level-1. However, inference is significantly lighter: individual models exhibit per-flow latency in the millisecond range, and the final voting ensemble achieves a throughput compatible with near–real-time processing.
The full pipeline, when all models are loaded simultaneously, has higher memory requirements and reduced throughput, primarily due to the KNN component. Nevertheless, Level-0 pruning or selective loading strategies can substantially improve runtime performance without compromising detection quality.
Overall, these results indicate that, although training requires substantial computational resources, the inference stage of the proposed framework is suitable for deployment in modern network monitoring systems. Future work will investigate model compression, GPU-accelerated inference, and streaming-oriented deployment architectures to further improve scalability and operational efficiency.
5.6. Ensemble Robustness Beyond Global Metrics
Although the overall accuracy and aggregated precision, recall, and F1-scores of the base models, the meta learners, and the final Layer-2 voting ensemble appear similar (≈0.90–0.92), this behavior is expected in highly separable multiclass DDoS datasets. The ablation study presented in
Table 14 shows that the ensemble contributes measurable improvements in robustness, particularly in stabilizing predictions for minority and overlapping attack types. Therefore, the benefit of stacking in this context is not a significant increase in global accuracy, but rather a more balanced and consistent performance across heterogeneous attack categories per-class.
5.7. Why the Full Ensemble Does Not Always Outperform Simpler Models
Although ensemble architectures are generally expected to outperform individual learners, our results reveal that some Layer-0 and Layer-1 models, particularly RF and the Ridge Classifier, achieve metrics that are numerically very close to those of the final Layer-2 voting ensemble. Similar observations have been reported in multiple CIC-DDoS2019 studies, where tree-based models often achieve strong performance even without additional meta-learning layers. This behavior does not imply redundancy in the proposed architecture; rather, it reflects intrinsic characteristics of the dataset and of the learning task.
First, CIC-DDoS2019 includes several attack families whose statistical signatures are sufficiently distinct for strong shallow models, particularly Random Forests, to capture most discriminative patterns. When a base learner already models the dominant variance structure of the data, additional ensemble layers may yield only marginal improvements in aggregate metrics such as accuracy or F1-score. This explains why RF appears to perform similarly to the final ensemble when evaluated solely on global averages.
Second, the objective of the stacked architecture is not only to improve average performance but also to reduce class-specific variance and prevent error concentration. While Layer-2 does not always increase the global F1-score, it produces more stable predictions for minority classes (e.g., SNMP, LDAP, DNS-based attacks) by smoothing the misclassification spikes observed in RF and the Ridge Classifier individually. Summary statistics do not fully capture these stabilizing effects, but they are evident in the per-class confusion matrices. For real-world DDoS detection, the stability of prediction across classes is often more critical than maximizing a single global score.
Third, it is true that the whole ensemble increases computational complexity. Nevertheless, its design is motivated by two practical considerations:
Robustness to drift and cross-scenario variability: Heterogeneous base learners exhibit complementary error profiles that generalize better when traffic patterns deviate from the training distribution.
Fault-tolerance: In high-stakes security systems, reliance on a single model increases vulnerability to systematic biases or adversarial weaknesses. A hierarchical ensemble reduces the probability that a single failure propagates to the final decision.
Finally, although RF and Ridge Classifier perform strongly on CIC-DDoS2019, they remain more sensitive to class imbalance, feature redundancies, and similarity noise than the layered ensemble. The proposed architecture, therefore, represents a trade-off: while computationally heavier, it offers greater resilience, smoother multiclass behavior, and better handling of minority attacks, benefits that are not readily apparent when examining solely aggregate performance metrics.
These considerations justify the use of the full stacked architecture despite the numerical proximity of some results, and they clarify why its advantages lie not only in absolute performance but also in stability, robustness, and operational reliability.
5.8. Security Implications and Threat Model Considerations
From a security perspective, the proposed model operates under a non-adaptive threat model, where attackers do not deliberately manipulate their traffic to evade detection. This assumption matches the volumetric and reflection-based characteristics of the attacks in CIC-DDoS2019 but does not extend to adversarial evasion strategies. The class-wise F1-scores reported in
Table 10,
Table 11,
Table 12 and
Table 13 show that certain amplification vectors, particularly SNMP and DNS/LDAP, exhibit lower detection performance compared to the remaining classes. These attacks are capable of producing large amplification factors through misconfigured or open servers, and false negatives in these categories may reduce detection coverage during the early stages of a multi-vector DDoS event. A delayed or incomplete identification of these vectors can impact the accuracy of mitigation strategies (e.g., filtering rules, upstream scrubbing, or rate-limiting policies), thereby increasing response latency and potentially allowing attackers to sustain higher traffic volumes for longer periods.
In contrast, misclassifications that occur within malicious categories (e.g., mistaking DNS/LDAP for NetBIOS) pose relatively lower operational risk because the system still correctly categorizes the flow as hostile, preserving the ability to trigger defensive actions. False positives, while less frequent, may temporarily affect benign traffic but generally result in proportionally lower operational costs compared to missed detections of large-scale reflection attacks.
5.9. Limitations and Error Analysis
Although the proposed stacking ensemble achieves balanced performance across most attack categories, the confusion matrix reveals systematic misclassification patterns that provide valuable insights into the limitations of the model. These failure modes arise from a combination of dataset-level constraints, overlapping statistical behaviors among amplification attacks, and inherent limitations of flow-level features. A deeper analysis of these patterns helps contextualize the ensemble’s behavior and clarifies where improvements are most needed for real-world deployment.
In particular, the confusion matrix provides a visual interpretation of the ensemble’s behavior beyond aggregate metrics, enabling the identification of systematic error patterns and their association with specific attack categories. This qualitative analysis complements the quantitative evaluation presented earlier and supports a more security-oriented interpretation of the results.
5.9.1. Misclassification Patterns Observed in the Confusion Matrix
The test-set confusion matrix (
Figure 6) highlights several recurrent misclassification clusters, particularly among reflection/amplification attacks. The most prominent examples include:
DNS/LDAP → SNMP: 6768 instances.
TFTP → Syn/UDPLag: 6084 instances.
SNMP → DNS/LDAP: 4973 instances.
NetBIOS/Portmap → SNMP: 4432 instances.
SNMP → NetBIOS/Portmap: 4123 instances.
NetBIOS/Portmap → MSSQL: 2260 instances.
Syn/UDPLag → SSDP/UDP: 2213 instances.
SSDP/UDP → SNMP: 1453 instances.
These patterns reflect two primary characteristics:
High intra-class variability
DNS/LDAP and SNMP exhibit considerable diversity in amplification factors, payload sizes, and upstream server behaviors. This broad variability widens the decision boundary, increasing overlap with other reflection vectors such as NetBIOS/Portmap and SSDP/UDP.
Statistical similarity among amplification-based attacks
Many volumetric DDoS attacks share similar flow-level signatures, bursty packet rates, minimal inter-arrival variation, and comparable statistical profiles, making these classes difficult to separate using purely statistical flow features.
Overall, the confusion is not arbitrary; it reflects structural similarities in traffic behavior rather than deficiencies in the ensemble itself.
5.9.2. Underlying Causes of Failure: Feature Overlap, SMOTE Effects, and Dataset Constraints
A deeper inspection of the feature space and preprocessing pipeline reveals three central causes behind the observed misclassification trends:
Overlapping feature distributions
PCA projections and feature-distribution analyses indicate substantial overlap between DNS/LDAP, NetBIOS/Portmap, SSDP/UDP, and related amplification vectors.
Since all Level-0 and Level-1 models consume the same flow-level feature representation, the ensemble cannot fully disentangle classes whose statistical profiles are inherently similar.
SMOTE-induced boundary distortions
Although SMOTE helps mitigate imbalance, it can introduce:
Synthetic samples near decision boundaries.
Amplification of noisy minority patterns.
Distortions in feature-density regions for SNMP, SSDP/UDP, and other low-frequency classes.
These effects increase ambiguity precisely in the classes where the F1-score is lowest.
Limitations of flow-level statistical features
The CIC-DDoS2019 dataset does not include:
Packet payloads.
Protocol header semantics.
Fine-grained timing structure.
Request-response patterns.
Entropy-based temporal signatures.
Without protocol-aware or payload-level features, classes with similar volumetric patterns (e.g., DNS vs. SSDP) remain difficult to distinguish, regardless of the classifier’s complexity.
5.9.3. Operational Implications of Misclassification
From a real-world security perspective, not all misclassifications carry the same risk. The following considerations outline their operational impact:
False negatives in amplification attacks (e.g., SNMP, DNS/LDAP)
These attack types can produce high amplification factors. Misclassification or delayed detection may:
Slow down activation of rate-limiting or scrubbing mechanisms.
Reduce early situational awareness.
Extend the window during which peak attack bandwidth is sustained.
These are the most critical failure modes from a mitigation standpoint.
False positives on benign traffic
The confusion matrix shows ≈ 4 benign flows misclassified as malicious. While operationally less severe than false negatives, false positives may result in:
Temporary disruption of legitimate traffic.
Unnecessary throttling of valid services.
Reduced operators trust in automated detection.
Misclassifications between DDoS subcategories
Errors where an attack flow is labeled as the wrong attack type but still recognized as malicious have lower operational risk because:
Mitigation is still triggered.
The system identifies the flow as hostile.
Only fine-grained attribution or protocol-specific filtering is affected.
However, they do reduce the quality of forensic analysis and the precision of protocol-specific mitigation rules.
6. Conclusions and Future Work
This study presented a multi-layer stacking ensemble framework for multiclass Distributed Denial-of-Service (DDoS) attack detection, integrating heterogeneous S-ML and D-ML algorithms to improve robustness, generalization, and classification accuracy. The proposed architecture, composed of base, meta, and voting layers, demonstrated its ability to combine the complementary strengths of tree-based, kernel-based, regression, and neural models within a unified hierarchical design. The experimental evaluation on the CIC-DDoS2019 dataset confirmed the effectiveness of this approach, achieving an overall accuracy of 91.26% with balanced precision and recall across all attack categories.
The model successfully mitigated overfitting and improved resilience to class imbalance and traffic variability, which are persistent challenges in network intrusion detection. Moreover, the PCA-based dimensionality reduction and SMOTE balancing contributed to reducing class overlap and enhancing the discriminative capacity of the ensemble. These results validate the proposed framework as a viable and scalable solution for real-world DDoS detection, where maintaining adaptability, interpretability, and efficiency under dynamic traffic conditions is essential.
Overall, this research highlights the potential of hierarchical stacking ensembles as an alternative to single deep or shallow models, providing a foundation for more intelligent, adaptive, and automated network defense systems that can handle the complexity of multiclass DDoS attack scenarios.
Recent studies have explored the potential of quantum machine learning (QML) for complex classification problems, reporting advantages in feature-space expressivity and non-classical kernel mappings [
27,
28,
29]. Although these approaches are promising, current QML implementations remain constrained by hardware limitations, qubit noise, data encoding overhead, and scalability challenges that hinder their applicability to very large datasets such as CIC-DDoS2019. For this reason, the present work focuses on classical S-ML and D-ML techniques. Nevertheless, as quantum hardware matures, future extensions of this research may explore hybrid quantum-classical architectures or QML-based feature encoders as a complementary direction for multiclass DDoS detection.
Additionally, the measured inference latency and throughput indicate that the proposed framework is suitable for near-real-time deployment in high-volume network environments, provided that offline training and streaming-based inference are employed. Taken together, these results validate the proposed framework as a viable and scalable solution for real-world and near-real-time DDoS detection, where maintaining adaptability, interpretability, and efficiency under dynamic traffic conditions is essential.
Future research will aim to extend the current framework in several directions. First, incorporating real-time detection capabilities through streaming-based training and incremental learning could enable deployment in live network environments. Second, integrating federated or distributed learning paradigms may improve data privacy and model scalability across multiple domains and organizations. Third, the adoption of self-supervised and explainable AI techniques could enhance interpretability and reduce dependency on extensive labeled datasets, a current limitation in DDoS detection research.
Future work will extend the experimental evaluation beyond a single dataset and a single train–validation–test split. In particular, we plan to incorporate k-fold cross-validation, more extensive statistical significance testing, and cross-dataset validation on heterogeneous benchmarks such as CIC-IDS2017, UNSW-NB15, and ToN-IoT. In parallel, we will investigate strategies to reduce the computational footprint of the proposed stacking architecture, including model compression, pruning of high-latency components (e.g., KNN), GPU-accelerated inference, and deployment on streaming-oriented platforms. These efforts aim to improve scalability and throughput while preserving the robustness observed in the current multiclass setting.
To support reproducibility and transparency, the complete preprocessing pipeline, trained models, and experimental scripts will be made publicly available upon acceptance of the manuscript.
An additional direction for future research concerns the adoption of more realistic threat models in which attackers actively adapt their traffic patterns to evade detection. In addition, future work will explore adversarial and adaptive scenarios through techniques such as adversarial training, online and incremental learning, and robust feature extraction. These extensions aim to enhance the resilience of the proposed framework against evasive behaviors and improve its applicability in dynamic, real-world network environments.
Collectively, these directions aim to evolve the proposed stacking ensemble into a fully adaptive, interpretable, and resource-efficient defense mechanism for modern cyberinfrastructure.