Next Article in Journal
Study on the Mechanical Properties and Microscopic Damage Constitutive Equation of Coal–Rock Composites Under Different Strain Rates
Previous Article in Journal
Point-Cloud Filtering Algorithm for Port-Environment Perception Based on 128-Line Array Single-Photon LiDAR
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Stacking-Based Ensemble Model for Multiclass DDoS Detection Using Shallow and Deep Machine Learning Algorithms

Computer Science and Engineering Department, Universidad del Norte, Barranquilla 080020, Colombia
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(2), 578; https://doi.org/10.3390/app16020578
Submission received: 4 November 2025 / Revised: 30 December 2025 / Accepted: 2 January 2026 / Published: 6 January 2026

Abstract

Distributed Denial-of-Service (DDoS) attacks remain a significant threat to the stability and reliability of modern networked systems. This study presents a hierarchical stacking ensemble that integrates multiple Shallow Machine Learning (S-ML) and Deep Machine Learning (D-ML) algorithms for multiclass DDoS detection. The proposed architecture consists of three layers: Layer Zero (base learners), Layer One (meta learners), and Layer Two (final voting). The base layer combines heterogeneous S-ML and D-ML models, tree-based, kernel-based, and neural architectures, while the meta layer employs regression and neural models trained on meta-features derived from base-layer predictions. The final decision is determined through a voting mechanism that aggregates the outputs of the meta models. Using the CIC-DDoS2019 dataset with a nine-class configuration, the model achieves an accuracy of 91.26% and macro F1-scores above 0.90 across most attack categories. Unlike many prior works that report near-perfect performance under binary or reduced-class settings, our evaluation addresses a more demanding multiclass scenario with large-scale traffic (∼8.85 M flows) and a broad feature space. The results demonstrate that the ensemble provides competitive multiclass detection performance and consistent behavior across heterogeneous attack types, supporting its applicability to high-volume network monitoring environments.

1. Introduction

The digital ecosystem has become increasingly vital across industries, national infrastructures, and small and medium enterprises (SMEs), making network connectivity a cornerstone of economic activity and operational resilience. Similarly, Distributed Denial-of-Service (DDoS) attacks have emerged as one of the most pervasive global cyber threats, disrupting services by overwhelming network resources such as bandwidth, processing power, and memory. Recent reports indicate that DDoS incidents continue to escalate, with attack volumes growing by more than 50% in 2024 and peak capacities exceeding 1.6 Tbps, according to Gcore’s H2 2023 report [1] and A10 Networks’ Global DDoS Weapons Report [2].
Detecting and mitigating DDoS attacks remains challenging due to the diversity of attack vectors, rapid evolution of threat patterns, and increasing traffic heterogeneity. Traditional rule-based and signature-based Intrusion Detection Systems (IDSs) have become insufficient for detecting novel or obfuscated attacks in real time [3]. This limitation has led to a shift toward data-driven methodologies, particularly Shallow Machine Learning (S-ML) and Deep Machine Learning (D-ML), for automated traffic analysis and anomaly detection. Classical S-ML algorithms such as Support Vector Machines (SVM), Decision Trees (DT), and Random Forest (RF) offer interpretability and efficiency, yet they often struggle in dynamic and high-dimensional traffic environments [4]. Conversely, D-ML architectures (e.g., Convolutional Neural Networks (CNN), Recurrent Neural Network (RNN), and Long Short-Term Memory networks) capture spatio-temporal relationships but commonly focus on binary classification, where the task is limited to distinguishing benign from malicious flows. Recent systematic reviews highlight that only 30–40% of prior work addresses the more complex multiclass problem, where the goal is to differentiate among several DDoS attack categories [3].
To improve robustness and generalization, recent research has explored hybrid and ensemble approaches. Empirical studies show that tree-based ensemble methods such as RF and XGBoost frequently obtain high accuracy and F1-scores (often above 98–99%), particularly in binary or reduced-class tasks [4,5]. Hybrid schemes that combine supervised and unsupervised components have also demonstrated strong performance in zero-day detection, reducing false positives and improving adaptability [6]. However, the majority of these high-performing pipelines evaluate simplified versions of CIC-based datasets, for example by restricting the task to binary classification, selecting only a subset of attack types, or applying aggressive feature reduction, conditions that tend to artificially inflate reported metrics and are not directly comparable to full multiclass scenarios.
A more recent but less explored direction involves stacking ensembles, which integrate heterogeneous learners, tree-based, kernel-based, and neural models, into a hierarchical architecture designed to exploit complementary inductive biases [7]. Stacking has shown promise in complex environments such as Software-Defined Networks (SDN), cloud infrastructures, and large-scale IoT ecosystems, yet few studies evaluate stacking under full multiclass DDoS settings, where label imbalance, overlapping traffic patterns, and inter-category variability place additional constraints on performance.
Despite these advances, an important limitation persists in most previous studies. First, a large proportion of existing DDoS detection methods focuses primarily on binary classification (benign vs. malicious), leaving the multiclass problem, distinguishing among heterogeneous DDoS attack vectors, only partially addressed. Even when multiclass detection is considered, single-model approaches often exhibit reduced generalization under class imbalance, overlapping traffic patterns, or high-dimensional feature spaces. Second, although ensemble and hybrid techniques have gained attention, heterogeneous stacking architectures that jointly integrate tree-based, kernel-based, and deep learning models remain underexplored. Existing stacking frameworks typically rely on a narrow set of learners or are confined to SDN-specific scenarios, without demonstrating scalable and robust multiclass performance on large, diverse datasets such as CIC-DDoS2019. Consequently, the field still lacks a unified multilayer framework capable of achieving stable, high-performing multiclass DDoS detection by leveraging the complementary strengths of the S-ML and D-ML models.
Motivated by these gaps, this study proposes a three-layer hierarchical stacking ensemble that integrates S-ML and D-ML models to address the challenges of multiclass DDoS detection. The framework combines heterogeneous learners at the base layer, regression-based and neural meta-models at the second layer, and a final voting mechanism to produce the final inference. Using the CIC-DDoS2019 dataset, the proposed model achieves 91.26% accuracy and balanced precision-recall performance across nine attack categories, under a realistic multiclass configuration that retains a broad feature set and avoids reducing the task to binary detection. These results demonstrate the potential of stacking ensembles as a scalable and operationally relevant approach for real-time DDoS detection under heterogeneous and evolving network conditions.
Accordingly, the novelty of this work lies not in the proposal of new classifiers but in the structured, large-scale, and robustness-oriented application of hierarchical stacking to multiclass DDoS detection.
The remainder of this article is structured as follows: Section 2 reviews previous S-ML, D-ML, and ensemble-based approaches to DDoS detection; Section 3 describes the proposed stacking architecture and methodology; Section 4 presents the experimental setup and results; Section 5 provides a detailed discussion, including limitations, security implications, and operational considerations; and Section 6 concludes the study and outlines directions for future research.

2. Previous Work

Research on DDoS detection has advanced along three complementary methodological streams: classical S-ML, D-ML, and ensemble approaches. Classical S-ML remains crucial for establishing interpretable and computationally efficient baselines. D-ML methods, in turn, have proven transformative by capturing spatio-temporal patterns that traditional models often miss. Finally, ensemble learning has emerged as a strategy to leverage the complementary strengths of multiple classifiers, boosting accuracy and generalization, while reducing false alarms. Together, these directions provide the foundation for the development of next-generation frameworks that integrate multiple paradigms to overcome individual limitations.

2.1. Machine Learning Approach

In the field of DDoS detection, classical S-ML classifiers remain essential for their interpretability, computational efficiency, and practicality in real-world networks. These models often provide reliable baselines against which more complex approaches can be benchmarked. Abiramasundari and Ramaswamy [8], for example, conducted a systematic comparison of SVM, K-Nearest Neighbors (KNN), Decision Trees, and RF in CIC-DDoS2019, CICIDS2017, and CICIDS2018. Their results, accuracy values consistently between 98.7% and 98.9%, highlight the robustness of traditional S-ML while underscoring its strengths in interpretability and deployment efficiency, particularly for small-to-medium-sized enterprises seeking reliable detection with low latency and modest resources.
Extending these foundations, Sawah et al. [9] demonstrated how disciplined optimization can elevate classical models to near-perfect performance. By integrating Recursive Feature Elimination (RFE) with Grid Search for hyperparameter tuning, their RF classifier on DDoS-SDN achieved 99.99% accuracy, precision, recall, and F1-score, outperforming Naïve Bayes (98.85%), KNN (97.90%), Linear Discriminant Analysis (97.10%), and SVM (95.70%). This study demonstrates how careful feature selection and tuning can significantly enhance robustness and minimize error rates in high-throughput detection scenarios.
Similarly, Becerra-Suarez et al. [10] emphasized the importance of preprocessing pipelines. Their approach to CIC-DDoS2019 combined outlier removal, Pearson-correlation-based feature selection (retaining 22 features), normalization, and Tree-of-Parzen-Estimators (TPE) for hyperparameter optimization. Under this setup, RF achieved 99.97% accuracy, 99.98% F1-score, and 99.96% Receiver Operating Characteristic - Area Under Curve (ROC-AUC), while XGBoost delivered comparable performance. Notably, both models surpassed Multi-Layer Perceptron (MLP) baselines evaluated under the same conditions, underscoring that data quality, feature engineering, and principled optimization often rival the gains attributed to deep learning.
Complementing these efforts, Fathima et al. [11] evaluated RF, KNN, and Logistic Regression (LR) on the CSE-CICIDS2018, CICIDS2017, and CICDoS datasets, with traffic normalized using the Standard Scaler. Results showed RF as the top performer with 97.6% accuracy, followed by KNN (97%) and LR (91.1%). This comparative analysis across multiple benchmarks reaffirms RF’s capacity to generalize effectively in heterogeneous traffic environments, while also demonstrating the persistent relevance of ensemble-based methods in handling complex traffic distributions. Furthermore, the comparative analysis across multiple benchmark datasets highlights RF as a consistently reliable classifier in this domain.
Further evidence of RF’s dominance was provided by Ebrahem et al. [12], who introduced a feature-grouping methodology for CICIDS2017, partitioning universal attributes into subgroups to test algorithm resilience under dimensionality reduction. RF, Naïve Bayes, KNN, and LR were compared, with RF consistently outperforming the others in terms of accuracy, precision, recall, and F1-score. Remarkably, RF maintained detection performance above 93% even with only two retained features, demonstrating its adaptability and suitability for environments that require efficient computation.
Taken together, these studies establish that well-optimized classical S-ML approaches (particularly RF and XGBoost) remain highly competitive for DDoS detection. Their demonstrated resilience across multiple datasets, preprocessing strategies, and feature-reduction scenarios provides a rigorous baseline for advancing toward ensemble and stacking strategies, where the complementary strengths of different learners can be further exploited for robust multiclass DDoS detection.

2.2. Deep Learning Approaches

Deep learning has emerged as a transformative paradigm for DDoS detection, owing to its capacity to automatically extract hierarchical spatio-temporal representations from raw or minimally processed traffic. Unlike shallow ML classifiers, D-ML architectures capture bursty patterns, flow dependencies, and stealthy behaviors that traditional feature engineering may overlook, positioning them as powerful tools for both binary and multiclass detection tasks.
Among the earliest contributions, Bashaiwth et al. [13] proposed an explainable LSTM-based framework applied across three benchmark datasets (CIC-DDoS2019, CICIDS2017, and CSE-CIC-IDS2018). By integrating interpretability techniques such as LIME, SHAP, Anchor, and LORE, the study addressed the “black-box” limitation of recurrent models. Binary classification consistently yielded high accuracy across datasets, while multiclass results were strong on CICIDS2017 and CSE-CIC-IDS2018 but less effective on CIC-DDoS2019 due to overlap in attack types. This dual emphasis on predictive power and transparency illustrates the relevance of explainability when deploying deep models in operational contexts.
Building on this line of research, Ramzan et al. [14] systematically evaluated RNN, Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU) architectures on CIC-DDoS2019, with validation on CICIDS2017 using the 20 most discriminative features. In binary detection, the RNN achieved 99.99% accuracy, outperforming LSTM and GRU by maintaining lower false-positive and false-negative rates while reducing overfitting. Conversely, in multiclass settings, GRU proved more resilient, achieving 99.54% accuracy compared to 99.43% for LSTM and 99.15% for RNN. These findings highlight the complementary strengths of recurrent networks: RNNs offer efficiency and stability in binary classification, whereas GRUs excel in capturing nuanced temporal dependencies for fine-grained multiclass detection.
Moving beyond purely recurrent designs, Setitra et al. [15] introduced the OptMLP-CNN framework for DDoS detection in SDN environments. By integrating MLP and CNN, the model achieved a TPR of up to 99.95%, precision of 99.90%, recall of 99.98%, F1-score of 99.94%, and AUC of 99.79% on CIC-DDoS2019, with similarly strong results on InSDN (precision 99.98%, recall 99.77%, F1 99.99%, AUC 99.96%). These outcomes surpassed CNN, MLP, and shallow ML baselines. Key methodological innovations included SHAP-based feature selection to identify relevant traffic attributes, Bayesian hyperparameter optimization, and the ADAM optimizer, collectively enhancing interpretability, convergence stability, and overall robustness.
In contrast, Aktar and Nur [16] pursued a semi-supervised approach leveraging a contractive autoencoder (CAE) trained on benign traffic and evaluated through reconstruction error on CIC-DDoS2019, CICIDS2017, and NSL-KDD. The CAE achieved 93.41–97.58% accuracy on CIC-DDoS2019, 96.08% on NSL-KDD, and 92.45% on CICIDS2017, consistently outperforming baseline autoencoders, variational AEs, and LSTM-based AEs (70–95%). By employing Contractive Loss, the Adam optimizer, and Sigmoid activations in hidden layers, the model generalized effectively to unseen attacks, underscoring the utility of deep representation learning in scenarios with limited labeled data.
Finally, a CNN–GRU hybrid model [17] addressed the challenges of incomplete feature extraction, class imbalance, and multiclass accuracy. By combining Random Forest–Pearson correlation for feature selection, ADASYN–RENN for data balancing, CNN for spatial extraction, GRU for temporal dependencies, and attention mechanisms for feature weighting, the framework achieved robust performance across various datasets. Results included 99.65% accuracy, 99.63% recall, 99.65% precision, and 99.64% F1-score on CICIDS2017; 99.69% accuracy, 99.65% recall, 99.69% precision, and 99.70% F1-score on NSL-KDD; and 86.25% accuracy, 86.92% precision, and 86% F1-score on UNSW-NB15. While demonstrating state-of-the-art results on CICIDS2017 and NSL-KDD, the drop in performance on UNSW-NB15 reveals the sensitivity of D-ML models to dataset characteristics, underscoring the need for adaptability in diverse real-world environments.
Overall, D-ML approaches have set state-of-the-art benchmarks in DDoS detection, excelling in accuracy, adaptability, and generalization. However, their sensitivity to dataset variability and computational demands highlights the need for integrative approaches, such as ensembles and stacking, that combine predictive power with robustness and transparency.

2.3. Ensemble Approaches

Ensemble learning has emerged as a compelling approach for DDoS detection, leveraging the complementary strengths of multiple classifiers to improve robustness, accuracy, and generalization. These approaches are particularly effective in handling imbalanced data distributions, high-dimensional feature spaces, and evolving attack patterns, positioning them as strong candidates for real-world deployment in both enterprise and SDN environments.
One representative contribution is the stacking ensemble proposed by Ali et al. [18], where an MLP and an SVM serve as base classifiers, and an RF acts as the meta-learner. A hybrid feature selection strategy combining Genetic Algorithms, chi-square tests, and correlation analysis was applied to eliminate redundant attributes and retain only discriminative features. Evaluated on an SDN benchmark dataset, the model achieved 99.86% accuracy in training and 98.89% in testing, with a precision of 99.82%, a recall of 99.71%, and an F1-score of 99.71%. By maintaining very low false positives and false negatives, this study demonstrates that stacking, combined with advanced feature engineering, can provide reliable and scalable solutions for intrusion detection in programmable networks.
Complementing this, Hirsi et al. [19] developed an ensemble-based RF model (ENRF) integrated with Principal Component Analysis (PCA) to reduce dimensionality in SDN environments. Validated in both custom and public datasets, ENRF achieved 100% accuracy, precision, recall, and F1-score, outperforming baseline ensembles and conventional S-ML methods. PCA not only reduced computational complexity but also mitigated feature redundancy, while the ensemble structure improved resilience against overfitting. In particular, the study emphasized reproducibility by using multiple datasets and embedding the ENRF within the SDN control plane for continuous monitoring, allowing rapid anomaly detection and proactive mitigation.
Beyond supervised learning, Das et al. [20] explored hybrid ensembles combining unsupervised clustering with supervised classifiers to address zero-day detection. Tested on the NSL-KDD, UNSW-NB15, and CICIDS2017 datasets, their model achieved an accuracy of up to 99.1% with exceptionally low false-positive rates, misclassifying only 0.01% of benign instances. This fusion enabled the system to capture novel attack behaviors without compromising detection performance on known patterns, thereby offering a practical balance between generalization and precision, a key requirement for real-world adaptability.
In the IoT domain, where lightweight yet accurate solutions are essential, Lazzarini et al. [21] proposed DIS-IoT, a stacking ensemble of deep learners integrating MLP, CNN, and LSTM as base classifiers with a fully connected meta-learner. Evaluated on ToN-IoT, CICIDS2017, and SWaT datasets, DIS-IoT consistently outperformed single models, achieving near-perfect binary classification results (accuracy and F1 ≈ 0.99–1.00) and strong multiclass performance (≈0.98–0.99). By exploiting the complementary biases of convolutional, recurrent, and dense architectures, the model minimized false alarms while sustaining high recall, establishing stacking as an effective solution for IoT intrusion detection.
Further advancing ensemble innovation, Butt et al. [22] introduced a multi-model framework for SDN-based DDoS detection, combining RF, KNN, and XGBoost. Each learner contributed unique strengths: RF provided stable performance via DT aggregation, KNN offered sensitivity to local traffic dynamics, and XGBoost introduced gradient-boosted decision-making, particularly effective against class imbalance and rare attacks. The ensemble achieved nearly 99% accuracy on SDN-specific datasets, significantly outperforming individual learners and underscoring the utility of heterogeneous integration to enhance robustness under dynamic traffic conditions.
Finally, Hossain [23] presented a feature-driven ensemble where RF was coupled with a novel feature selection pipeline that integrates mutual information, correlation, and PCA. Evaluated on CIC-DDoS2019, the model achieved nearly 100% accuracy, 100% true positive rate, and a 0% false positive rate, outperforming conventional S-ML baselines. By aggregating multiple decision trees and applying dimensionality reduction, the framework effectively mitigated overfitting and enhanced generalization. This study highlights how careful feature curation and ensemble aggregation can jointly optimize detection accuracy, false alarm rates, and scalability in operational contexts.
Collectively, ensemble approaches demonstrate that combining diverse learners (whether homogeneous or heterogeneous, shallow or deep) provides significant benefits for DDoS detection. They consistently reduce false positives, improve generalization, and maintain robustness across datasets ranging from SDN to IoT and general-purpose benchmarks. Collectively, these works establish ensemble learning not only as a performance booster over single models but also as a practical pathway toward resilient, interpretable, and real-time DDoS defense.
In summary, the body of research on DDoS detection reveals a clear trajectory: from interpretable and resource-efficient S-ML models, to highly accurate but dataset-sensitive D-ML architectures, and finally to ensemble approaches that combine complementary strengths. Classical S-ML provides robust baselines, D-ML introduces powerful feature extraction and adaptability, and ensembles achieve resilience and generalization across diverse datasets and environments. Despite these advances, challenges persist to strike a balance between detection accuracy, interpretability, and real-world scalability. This context motivates the development of stacking frameworks that integrate S-ML and D-ML, leveraging their respective strengths to deliver reliable multiclass DDoS detection suitable for deployment in dynamic and heterogeneous network environments.

3. Materials and Methods

The proposed system for detecting Distributed Denial of Service (DDoS) attacks is based on a multi-stage machine learning architecture that employs a stacked ensemble model. The general workflow, from data acquisition to prediction, is summarized in Figure 1. This diagram illustrates the key stages of the proposed methodology: data collection, preprocessing, model training, ensemble integration, and final classification.
To assess the computational efficiency of the proposed framework, training time, inference latency, throughput, and memory usage were measured during both training and inference phases. Training time was computed as the elapsed wall-clock time between the start and completion of the whole training pipeline, including preprocessing, model fitting, and ensemble construction. Inference latency was measured as the average processing time per network flow, obtained by recording the start and end timestamps for batch inference and normalizing by the number of processed flows. Throughput was calculated as the number of flows processed per second during inference under steady-state conditions. Peak memory usage during training and inference was recorded by monitoring system-level memory consumption and retaining the maximum observed value throughout each phase. All measurements were obtained on the same hardware configuration described in Section 3.3 to ensure consistency.

3.1. Data Preprocessing

The preprocessing stage, shown in Figure 2, began with a comprehensive literature review to identify prior approaches to DDoS detection and the datasets most commonly used for experimentation. As a result, the CIC-DDoS2019 dataset [24] was selected for this study. The dataset contains approximately 30 GB of labeled network traffic data organized by attack type. From this dataset, a balanced subset of approximately 9.25 million samples was constructed using proportional sampling across classes.
Each record originally contained 87 features; however, nine were removed after feature relevance analysis, resulting in a total of 78 features that include one target variable, Label, which was included to indicate whether the traffic flow corresponds to a benign instance or to a specific DDoS attack type (see Appendix A Table A1 for a detailed description of the selected features).
Outliers and missing values were removed, and underrepresented attack types (with fewer than 450 samples) were excluded. During exploratory training, two classes exhibited significant confusion, leading to their merging after verification using Principal Component Analysis (PCA). The final dataset contained approximately 8.85 million labeled samples.
After preprocessing and data cleaning, the Label variable was standardized and encoded into nine final categories representing distinct traffic behaviors. These categories correspond to the following classes:
  • BENIGN: Normal network traffic without attack patterns.
  • DNS/LDAP: Distributed Denial of Service attacks exploiting DNS or LDAP protocols.
  • MSSQL: DDoS activity based on Microsoft SQL services.
  • NTP: Attacks that leverage the Network Time Protocol (NTP) amplification.
  • NetBIOS/Portmap: Traffic generated by NetBIOS or Portmap-based service abuse.
  • SNMP: Malicious traffic exploiting the Simple Network Management Protocol.
  • SSDP/UDP: Attacks using the Simple Service Discovery Protocol or generic UDP floods.
  • Syn/UDPLag: Flooding attacks characterized by SYN or UDP lag behavior.
  • TFTP: Traffic related to Trivial File Transfer Protocol-based attacks.
Labels were encoded using both Label Encoding and One-Hot Encoding, depending on model requirements. The dataset was split into training, validation, and testing sets using a 70/25/5 ratio. Data normalization was performed with the StandardScaler, and class balance was achieved using the Synthetic Minority Oversampling Technique (SMOTE). The preprocessing steps ensured consistency across model inputs and mitigated data imbalance.

3.2. Model Architecture

Stacking is a hierarchical ensemble strategy in which multiple S-ML and/or D-ML models are organized in successive levels to deliver higher predictive performance than any single model alone. The central premise is that different learning algorithms capture complementary inductive biases and heterogeneous patterns from the same data distribution; therefore, their coordinated aggregation can yield more robust and generalizable decisions [25,26].
As depicted in Figure 3, a standard stacking architecture consists of two conceptual levels. The base models, also referred to as level-0 learners, are trained in parallel using the same training dataset. Once trained, they generate out-of-fold predictions over validation and test splits. These predictions are then assembled into a new feature matrix (meta-features), which constitutes the input for a higher-level learner, referred to as the meta-model or level-1 learner (Ensemble Learning surveys remark this architecture’s effectiveness).
This hierarchical design is particularly effective in complex classification problems such as multiclass DDoS detection, where combining structurally diverse learners (e.g., tree-based, kernel-based, and deep neural models) consistently outperforms isolated predictors in terms of accuracy, stability, and resilience to traffic variability.
Although stacking has been explored in other domains, its application to multiclass DDoS detection remains limited, and existing approaches typically employ only two layers or rely on a narrow set of homogeneous classifiers. The novelty of this work lies not in the isolated use of known S-ML and D-ML algorithms, but in the design and integration of a three-layer heterogeneous stacking architecture specifically tailored to the characteristics of large-scale DDoS traffic.
First, the proposed framework combines five structurally diverse base learners (tree-based, instance-based, and deep neural models) at Level 0, enabling the simultaneous capture of complementary inductive biases. Second, Level 1 introduces a meta-feature construction pipeline based on out-of-fold predictions, which allows Logistic Regression, Ridge Classifier, and a neural meta-learner to model complex interdependencies and systematic error patterns across base classifiers, an aspect largely unexplored in previous DDoS studies. Third, Level 2 incorporates a final voting layer to enhance prediction stability and mitigate class-level fluctuations typical in multiclass traffic.
Additionally, the ensemble is tightly coupled with a preprocessing pipeline that involves PCA-based class consolidation, SMOTE balancing, and feature refinement, thereby improving the separability between attack categories and contributing to the overall robustness of the stacking mechanism. This coordinated design enables the architecture to deliver highly stable multiclass performance across nine attack types, even under substantial traffic heterogeneity, making it distinct from conventional ensemble and hybrid models in the literature.
The proposed system was trained using a stacked ensemble model composed of three conceptual layers, as illustrated in Figure 4. This architecture aims to combine the strengths of diverse classifiers to achieve higher accuracy and robustness. To ensure reproducibility and interpretability of the experiments, this section details the hyperparameters of each model used in the architecture. Only the most relevant parameters are listed, as some minor internal constants or random seeds may vary.

3.2.1. Layer 0 (Base Models)

The first layer consisted of five base classifiers: MLP, KNN, DT, RF, and Gradient Boosting. An SVM model was initially evaluated but subsequently excluded due to excessive training time and inferior predictive performance.
The models and their hyperparameter configurations are summarized below:
  • Decision Tree: Implemented using DecisionTreeClassifier from scikit-learn, with the hyperparameters described in Table 1.
  • Gradient Boosting: Implemented using HistGradientBoostingClassifier from scikit-learn, with the hyperparameters described in Table 2.
  • Random Forest: Implemented using RandomForestClassifier from scikit-learn, with the hyperparameters described in Table 3.
  • K-Nearest Neighbors (KNN): Implemented using KNeighborsClassifier from scikit-learn, with the hyperparameters described in Table 4.
  • Neural Network (MLP): Implemented using the Sequential API from TensorFlow. The model architecture is as follows:
model = tf.keras.Sequential([
    tfl.Dense(128, input_shape=(input_dim,)),
    tfl.LeakyReLU(alpha=0.01),
    tfl.Dense(256),
    tfl.LeakyReLU(alpha=0.01),
    tfl.Dense(512),
    tfl.LeakyReLU(alpha=0.01),
    tfl.Dropout(dropout_rate),
    tfl.Dense(n_classes, activation=’softmax’)
])
The model was trained with the following hyperparameters, described in Table 5:

3.2.2. Layer 1 (Meta-Models)

The second layer included three meta-classifiers, Logistic Regression, Ridge Classifier, and MLP, that used the class probabilities predicted by the Layer 0 models on the validation set as input. This design allowed the meta-models to capture the interactions and correlations among base classifier predictions.
The models and their hyperparameter configurations are summarized below:
  • Logistic Regression: Implemented using LogisticRegression from scikit-learn, with the hyperparameters described in Table 6.
  • Ridge Classifier: Implemented usingRidgeClassifier from scikit-learn, with the hyperparameters described in Table 7.
  • Neural Network (MLP): Implemented using Sequential from TensorFlow, with the following architecture:
model_3 = tf.keras.Sequential([
    tfl.Dense(64, activation=’leaky_relu’, input_shape=(input_dim,)),
    tfl.Dropout(dropout_rate),
    tfl.Dense(n_classes, activation=’softmax’)
])
The training configuration for this model was identical to the one used in Layer 0, described in Table 8:

3.2.3. Layer 2 (Voting Ensemble)

The final layer implemented a voting mechanism that aggregated the predictions from the three Layer 1 models. The final class label was determined by majority voting, ensuring stable and generalized performance across attack categories.

3.2.4. Proposed Algorithm

To improve methodological clarity and reproducibility, the complete training and inference workflow of the proposed stacking architecture is summarized in Algorithm 1. The algorithm details the preprocessing pipeline, the construction of Level-0 base learners, the generation of Level-1 meta-features, and the final soft voting mechanism used in Level 2.

3.3. System Specifications

All experiments, including data preprocessing, model training, and evaluation, were conducted on a high-performance workstation to ensure computational consistency and scalability. The hardware and software configuration of the system is summarized below.
  • Processor: Dual Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60 GHz (2 processors)
  • Installed RAM: 48.0 GB
  • Graphics Card: AMD FirePro W2100 (2 GB)
  • System Type: 64-bit Operating System, x64-based processor
The experiments were performed under Windows 10 Pro (Version 22H2, OS Build 19045.6396), with the following Python environment and library versions:
  • Python: 3.10.16
  • imbalanced-learn: 0.13.0
  • NumPy: 2.0.2
  • Pandas: 2.2.3
  • Scikit-learn: 1.6.1
  • TensorFlow: 2.18.0
This configuration enabled efficient model training on large-scale datasets. Parallel computation and optimized memory management, provided by the HP Z840 Workstation, reduced processing time and ensured the reproducibility of all results.
Algorithm 1 Training and Inference Procedure of the Proposed Three-Layer Stacking Ensemble
Require: CIC-DDoS2019 dataset D
Ensure: Final predicted label y ^ for each sample
    1: Preprocessing Stage
    2: Remove invalid or inconsistent records from D .
    3: Apply PCA-based class consolidation to reduce overlapping attack categories.
    4: Apply SMOTE to balance the class distribution in the training subset.
    5: Normalize continuous features using standard scaling.
    6: Split D in training, validation and test ( T r / V / T ) sets.
    7: Level 0: Base Learners
    8: Define the base learner set:
L 0 = { Decision Tree , Gradient Boosting , Random Forest , KNN , MLP }
    9: for each base model M i L 0  do
  10:    Train M i using T r set.
  11:    Collect out-of-fold predictions p i for all samples from V set.
  12: end for
  13: Construct the meta-feature matrix:
M = [ p 1 , p 2 , p 3 , p 4 , p 5 ]
  14: Level 1: Meta Learners
  15: Define the meta-learner set:
L 1 = { Logistic Regression , Ridge Classifier , MLP }
  16: for each meta-learner L j L 1  do
  17:    Train L j using the meta-feature matrix M .
  18:    Obtain Level-1 predictions q j .
  19: end for
  20: Collect Level-1 prediction set:
q = { q 1 , q 2 , q 3 }
  21: Level 2: Final Voting Layer
  22: Aggregate Level-1 predictions q using soft voting.
  23: Compute the final class-probability score vector.
  24: Assign final prediction:
y ^ = arg max ( score )
  25: return  y ^

4. Results

The final evaluation of the proposed system was conducted using the test subset to measure overall performance. The evaluation pipeline and the main metrics used (Accuracy, Precision, Recall, F1-Score, and Confusion Matrix) are summarized in Figure 5. This figure illustrates the end-to-end validation process, highlighting how the ensemble integrates results across all model layers.

4.1. General Model Evaluation

Table 9 presents the numerical results obtained from each model across the three layers. The ensemble model (Layer 2) achieved the best overall balance between accuracy, precision, and recall, maintaining an average accuracy of 91%.

4.2. Detailed Model Evaluation of Layer 1 and Layer 2

This subsection presents a detailed evaluation of the proposed architecture over four models: three base classifiers in Layer 1 (Logistic Regression, Ridge Classifier, and Neural Network) and one Ensemble model in Layer 2. Each model was trained and tested on the same dataset partition to ensure consistency of results and a fair comparison. The metrics used for performance evaluation include accuracy, precision, recall, and F1-score. The detailed results for each model are presented in Table 10, Table 11, Table 12 and Table 13.

4.2.1. Layer 1: Logistic Regression

The Logistic Regression model achieved an overall accuracy of 91.24%. The best performance was achieved for the BENIGN and NTP traffic classes, both of which achieved perfect scores (F1 = 1.00). Lower results were observed in the SNMP (F1 = 0.77) and DNS/LDAP (F1 = 0.83) categories.

4.2.2. Layer 1: Ridge Classifier

The Ridge Classifier obtained an overall accuracy of 91.32%. Its performance remained consistent across most classes, showing slightly improved results compared to Logistic Regression, especially for Syn/UDPLag and TFTP traffic.

4.2.3. Layer 1: Neural Network

The Neural Network model achieved an overall accuracy of 91.27%, showing robust performance across all attack types. High detection accuracy was maintained for BENIGN, NTP, and Syn/UDPLag classes, while slightly lower results persisted for SNMP and DNS/LDAP.

4.2.4. Layer 2: Ensemble Model

The ensemble meta-model in Layer 2 combined the predictions from the three base classifiers, producing an overall accuracy of 91.26%. This model improved class balance while maintaining high precision and recall across all categories, confirming the stability of the ensemble approach.

4.3. Theoretical Interpretation of the Results

While the numerical results reported in the previous subsection demonstrate the effectiveness of the proposed hierarchical stacking ensemble, a deeper theoretical interpretation is required to understand why the model behaves as observed. This section explains the performance trends in terms of classifier design, data characteristics, class imbalance, and ensemble theory.

4.3.1. Behavior of Level-0 Base Learners

The base layer integrates models with distinct inductive biases, tree-based classifiers, distance-based learning, and neural networks. The properties of these model families can theoretically explain the observed performance patterns:
  • Tree-based methods (RF, DT, GBoost) achieve strong performance because DDoS traffic exhibits non-linear relationships and hierarchical decision boundaries. RF in particular benefits from variance reduction through bagging and feature subsampling, which stabilizes decisions under high-dimensional flows.
  • The neural network at Level-0 contributes complementary representations by learning distributed features that capture subtle deviations in traffic dynamics that tree models may oversimplify.
  • KNN, although competitive, is sensitive to feature scaling and local density variations. Its relatively smaller contribution (as shown in the ablation results in Table 14) aligns with theoretical expectations for high-dimensional, heterogeneous network traffic.
This theoretical diversity explains why removing any Level-0 learner decreases performance: the ensemble relies on combining heterogeneous decision boundaries to capture the multi-modal nature of benign and malicious flows.

4.3.2. Theoretical Role of Level-1 Meta-Learners

The Level-1 layer integrates Logistic Regression, Ridge Classifier, and a shallow Neural Network. The theoretical benefit of this design comes from:
  • Error-correcting capability: Meta-learners operate on prediction vectors, learning second-order relationships between base models, which classical ensemble theory identifies as key for reducing correlated errors.
  • Linearity vs. non-linearity: Logistic Regression captures linearly separable patterns across the prediction space, Ridge adds regularization to stabilize decisions under multicollinearity, and the neural network models non-linear interactions.
  • Bias–variance trade-off: The combination ensures that the meta-layer can correct systematic tendencies of overconfident classifiers or ambiguous predictions in minority classes.
The relatively small but consistent F1-score decrease shown in the ablation study when removing any of these models empirically supports this theoretical framework.

4.3.3. Influence of Preprocessing: SMOTE and PCA Label Treatment

The significant performance loss when removing SMOTE (−9.3% F1) or PCA-based label treatment (−13%) is theoretically aligned with imbalance-learning and class-separability principles:
  • SMOTE increases minority-class density in feature space, improving the decision boundaries around rare attack types, such as DNS, LDAP, and SNMP. Without it, classifiers become biased toward the majority class, reducing recall in minority categories.
  • PCA label treatment reduces intra-class variance and removes redundant label correlations. From a geometric perspective, PCA enhances cluster separability, improving both precision and recall.
These theoretical mechanisms explain why these preprocessing components contribute more to performance than individual base or meta-learners.

4.3.4. Analysis of Misclassified Attack Types

The slightly lower F1-scores observed in minority attack types (≈0.77–0.83) are theoretically expected due to:
  • Feature overlap among similar attacks (e.g., DNS vs. LDAP vs. SNMP), which reduces separability in high-dimensional spaces.
  • Sparse representation of rare attack patterns, which limits the model’s ability to learn discriminative boundaries even after balancing.
  • Temporal and protocol-level similarity among certain DDoS categories makes them harder to disambiguate without temporal sequence modeling.
Nonetheless, the ensemble mitigates these limitations by integrating complementary learners, which stabilizes predictions across classes.

4.3.5. Summary

Overall, the theoretical interpretation supports the empirical findings:
  • Level-0 learners capture different structural properties of the traffic.
  • Level-1 meta-learning combines these representations and corrects correlated errors.
  • SMOTE and PCA label treatment enhance feature space geometry and balance.
  • Misclassification patterns reflect the inherent complexity of the dataset, not model limitations.
This theoretical grounding reinforces the validity of the proposed architecture and explains its superior multiclass performance on CIC-DDoS2019.

4.4. Ablation Study

To quantify the contribution of each component of the proposed three-layer stacking architecture, an ablation study was conducted in which specific elements of the system were individually removed. The objective of this analysis is to determine how the performance degrades when essential components are omitted, thereby validating that the proposed architecture is not simply an aggregation of known methods but a cohesive, interdependent design. All ablations were evaluated under the same experimental conditions as the full model, using macro F1-score as the primary evaluation metric.
Table 14 summarizes the obtained performance for each ablated configuration. The Base Model corresponds to the complete architecture proposed in this study, encompassing all Level-0 learners, all Level-1 meta-learners, and the full preprocessing pipeline, including SMOTE and PCA-based label consolidation.

4.4.1. Contribution of Level-0 Base Learners

Removing any Level-0 classifier leads to a measurable drop in macro F1-score, confirming that the diversity of inductive biases at the base layer is a key element of the ensemble.
The most pronounced degradations occur when removing tree-based models such as RF (0.8989), DT (0.8993), or Gradient Boosting (0.8999), underscoring their ability to model non-linear traffic patterns. Removing the Level-0 Neural Network also reduces performance (0.9030), highlighting the value of deep feature representations. Excluding KNN (0.9018) yields the smallest drop, but still confirms its role in capturing local decision boundaries.
These results demonstrate that the heterogeneous set of Level-0 learners is fundamental to the architecture’s success.

4.4.2. Contribution of Level-1 Meta-Learners

Level-1 ablations also yield consistent performance reductions:
  • Without Logistic Regression: 0.9134.
  • Without Ridge Classifier: 0.9122.
  • Without Level-1 Neural Network: 0.9117.
Although these drops are smaller than those observed in Level-0 ablations, they confirm that Level-1 provides essential error-correcting capability. The three meta-learners jointly learn complementary meta-relationships across Level-0 outputs, improving generalization and stabilizing multiclass predictions.

4.4.3. Importance of the Preprocessing Pipeline

The largest degradations in performance occur when removing SMOTE (0.8267) or PCA label treatment (0.7893).
  • Without SMOTE, minority attack types are significantly underrepresented, resulting in reduced F1 performance across rare categories.
  • Without PCA-based label consolidation, overlapping attack categories remain uncorrected, increasing confusion between similar attack types and reducing general separability in feature space.
These results confirm that both preprocessing components are essential for robust multiclass DDoS detection.

4.4.4. Summary of Findings

Overall, the ablation study empirically validates the design decisions of the proposed architecture. Each component, Level-0 model diversity, Level-1 meta-learning, SMOTE-based balancing, and PCA label treatment, plays a significant and complementary role. The complete architecture delivers the highest macro F1-score (0.9201), while removing any component consistently reduces performance. These findings support the effectiveness of the proposed three-layer stacking ensemble for multiclass DDoS detection.

4.5. Expanded Comparative Analysis with Previous Works

To strengthen the evaluation of the proposed model, we conducted an expanded comparative analysis incorporating a broad selection of recent studies on DDoS detection. Table 15 summarizes representative works published between 2022 and 2025, covering machine learning, deep learning, hybrid architectures, ensemble approaches, and feature-engineering techniques across multiple benchmark datasets.
The comparison includes the methodological approach, the classification scheme (binary or multiclass), the achieved F1-score, and the limitations reported by each study. This structured comparison provides a broader context for the performance of our proposed architecture.
Overall, the most recent efforts achieve high performance in binary classification, often exceeding F1-scores of 0.98. However, robustness decreases for multiclass DDoS detection, where inter-class similarity and data imbalance remain open challenges. Hybrid models (e.g., CNN-GRU, Deep Stacking Ensembles, and PCA-based ensembles) frequently yield excellent results but exhibit high computational complexity, limited interpretability, or dependence on large datasets.
In contrast, our proposed three-layer hierarchical Stacking Ensemble addresses these limitations through (i) heterogeneous Level-0 learners, (ii) meta-learning integration, and (iii) a preprocessing pipeline consisting of SMOTE and PCA. In the multiclass CIC-DDoS2019 dataset, our model achieves an overall F1-score of 0.92, outperforming various classical and deep approaches, particularly in scenarios involving complex attack types and class imbalance. Although some minority attack categories still pose challenges (F1 ≈ 0.77–0.83), the architecture demonstrates superior multiclass stability compared to prior works, where many studies only report binary results.
This expanded comparison reinforces the novelty and effectiveness of our approach, situating our contribution within the current state of the art in DDoS detection research.

4.6. Clarifying the Performance Gap Between Binary and Multiclass DDoS Detection

Several studies cited in the literature report very high performance metrics, often exceeding 99% accuracy, precision, recall, or F1-score. Notable examples include the works presented in [14,15]. While these results are technically correct within their respective experimental settings, a closer examination reveals substantial differences in problem formulation compared to the present study.
In [14], the authors propose a deep learning–based approach for DDoS detection and evaluate their model using a binary classification scheme, where all attack traffic is grouped into a single class and distinguished only from benign traffic. Similarly, Ref. [15] formulates DDoS detection as a binary decision problem, reporting near-perfect performance metrics under this simplified setting.
Binary DDoS detection is widely acknowledged to be a less challenging task, particularly for modern machine learning and deep learning models, because it requires learning only a coarse separation boundary between malicious and benign traffic. In contrast, the present work focuses on multiclass DDoS classification, distinguishing nine different attack types in the CIC-DDoS2019 dataset. This setting introduces additional complexity due to:
  • Strong feature overlap among different DDoS attack vectors (e.g., DNS, LDAP, and SNMP-based floods).
  • Severe class imbalance between majority and minority attack types
  • Increased decision complexity, as the model must learn fine-grained boundaries rather than a single global separation.
Consequently, performance metrics obtained in binary settings are not directly comparable to those achieved in multiclass scenarios. This observation is consistent with prior multiclass studies, which also report lower, but more realistic, performance when full attack taxonomies are considered.
The contribution of this work lies precisely in addressing this more demanding and operationally relevant multiclass problem using a hierarchical stacking ensemble that combines classical machine learning and deep learning models. Although absolute metrics are lower than those reported in binary studies, they reflect a more realistic evaluation of DDoS detection systems intended for real-world deployment, where identifying the specific attack type is crucial for mitigation and response.
To make these differences explicit, Table 16 summarizes key distinctions among related works.
By clearly distinguishing binary from multiclass settings, this study highlights its main contribution: a robust hierarchical stacking ensemble capable of full multiclass detection, which better reflects real-world environments where multiple DDoS attack types must be distinguished, not merely detected.

4.7. Why the Final Ensemble Does Not Significantly Outperform Layer-1

A close inspection of Table 10, Table 11, Table 12 and Table 13 reveals that the performance of the final Layer-2 voting ensemble is very similar to the results obtained by the best individual meta-learner in Layer-1 (particularly the Ridge Classifier). While this may appear to contradict the expected benefit of a stacked ensemble, it is consistent with several well-documented behaviors of hierarchical ensembles in highly separable datasets such as CIC-DDoS2019.
  • Strong Meta-Learner Dominance
    The Ridge Classifier in Layer-1 shows exceptionally strong performance across most attack categories. When a single model already learns nearly all discriminative boundaries, stacking tends to provide diminishing returns because there is very limited residual error for the upper layer to correct.
  • Correlated Error Patterns Across Models
    Although the Level-0 models differ in architecture (trees, distance-based, neural), their errors become partially correlated after the PCA-based label consolidation. This reduces the capacity of a majority-vote mechanism to generate new decision boundaries that substantially outperform the best participant.
  • Function of Layer-2: Stability Rather Than Accuracy Boost
    While Layer-2 does not significantly increase macro performance metrics, it serves an important role:
    • Reduces variance across training runs.
    • Stabilizes predictions in borderline cases.
    • Provides robustness when traffic patterns shift slightly.
    This is a known effect in ensemble theory: voting is often most useful for stability, not always peak accuracy.
  • Highly Separable Classes After Preprocessing
    The use of SMOTE, PCA label consolidation, and correlation-based feature reduction strongly increases class separability. Under such conditions, a single meta-learner can exploit the simplified structure almost as effectively as a full ensemble.
Therefore, the similarity between Layer-1 and Layer-2 metrics does not invalidate the architecture. Instead, it indicates that:
  • The Ridge Classifier already captures the optimal decision boundary for this dataset.
  • The voting layer primarily improves robustness rather than raw accuracy.

5. Discussion

This section presents the experimental results obtained using the proposed stacking framework and situates them within the context of existing DDoS detection literature. The analysis focuses on interpreting performance trends, class-level behavior, and operational implications under the evaluated experimental conditions, rather than solely emphasizing aggregate metric values.
The multilayer stacking architecture demonstrated solid performance and high potential as an integrated solution for detecting multiclass DDoS attacks. With an overall accuracy exceeding 91.26%, the model validated the hypothesis that the hierarchical combination of heterogeneous algorithms, including decision trees, kernel-based methods, and neural networks, enables the capture of diverse behavioral patterns in network traffic, achieving more stable, robust, and generalizable detection. This performance improvement reflects the synergy between S-ML and D-ML models, which contribute both interpretability and representational depth within a hierarchical framework that integrates their complementary strengths.
The results demonstrate that the proposed stacked ensemble model effectively integrates multiple learning algorithms to improve robustness and classification accuracy in DDoS detection. The ensemble achieved approximately 91% accuracy with balanced precision and recall, confirming its reliability across multiple attack categories. As shown in Figure 5, the evaluation workflow verifies the effectiveness of the hierarchical training procedure and the integration of multiple classifiers within the stacking scheme.
The use of successive layers enhanced the system’s robustness against traffic variability and anomalous patterns. The base models (level zero) generated diverse predictions, while the meta-models (level one) learned to combine these outputs to reduce systematic errors. The voting layer (level two) consolidated the most consistent predictions, achieving an optimal balance between precision and generalization. This hierarchical design proved effective in mitigating overfitting and improving model stability when exposed to unseen data, demonstrating greater adaptability compared to standalone learners.
A detailed analysis of the experimental results also revealed a noticeable reduction in class overlap after applying PCA-based class merging and SMOTE balancing, confirming the positive contribution of the preprocessing strategy illustrated in Figure 2. This preprocessing step improved class separability and ensured a more uniform training distribution, contributing directly to the model’s consistent performance across validation and testing. The multilayer ensemble structure depicted in Figure 4 further enhanced generalization capabilities and ensured stable performance under diverse network conditions.
Finally, the results across all evaluated models exhibited remarkable consistency, with accuracies around 91% and F1-scores exceeding 90%. The ensemble layer effectively leveraged the predictive strengths of the base classifiers, achieving balanced precision and recall across various attack types. These outcomes confirm the robustness and scalability of the proposed multilayer detection framework, as well as its suitability for deployment in distributed and real-time DDoS detection environments.

5.1. Contextualizing the Performance Gap with Prior High-Accuracy Works

It is important to note that several prior works reporting 98–99% accuracy or F1-score on CIC-DDoS2019 operate under experimental settings that differ substantially from those in this study. Our evaluation considers a nine-class multiclass problem (after removing classes with fewer than 450 samples and merging two highly confused categories), a large-scale dataset (∼8.85 M flows after cleaning), and a broad feature set of 78 attributes. Many high-performing studies simplify the task by using binary classification, selecting only a subset of attack types, or applying aggressive feature reduction, which can produce overly optimistic results. Under our more challenging setting, the proposed stacking ensemble achieves 91.26% accuracy and demonstrates improved robustness across classes. To further analyze the ensemble’s contribution, we include an ablation study based on F1-score (Table 14), which compares all base, meta, and ensemble components and confirms that the final layer provides consistent improvements in per-class F1 performance.

5.2. Evaluation Strategy and Generalization Limitations

The evaluation in this study was conducted using a single 70-25-5 train-validation-test split. This decision was motivated by the computational cost associated with training the full three-layer stacking architecture on a dataset of 8.85 million flows. While this approach is common in large-scale intrusion detection research, we acknowledge that it does not capture statistical variability to the same extent as k-fold cross-validation. Furthermore, the present work evaluates the model on a single dataset (CIC-DDoS2019), which limits conclusions about generalization across heterogeneous traffic environments. These aspects represent methodological limitations that will be addressed in future research through k-fold validation, statistical significance testing, and cross-dataset experiments. Despite these constraints, the ablation study included in this work provides evidence of the stability and internal consistency of the proposed ensemble architecture.

5.3. Performance Limitations and Potential Improvements

Although the proposed three-layer stacking ensemble demonstrates strong overall performance for multiclass DDoS detection, the results reveal specific cases where the architecture does not outperform all baseline or reference approaches. In particular, minority attack categories, such as SNMP-UDP Flood, LDAP Flood, or specific DNS-based attacks, exhibit comparatively lower F1-scores. This behavior is theoretically expected in highly imbalanced, noisy, and feature-overlapping datasets such as CIC-DDoS2019, where the statistical distribution of minority classes provides limited discriminative information for the learning process.
From a theoretical standpoint, these observations can be attributed to three core factors:
  • Data imbalance, which biases the decision boundaries toward dominant classes.
  • Feature redundancy and multicollinearity, which reduce separability in the feature space.
  • Limited interaction modeling between Level-0 learners, which restricts the meta-learner’s capacity to correct correlated errors.
To address these challenges, several improvement directions can be considered:
  • Advanced Class Rebalancing Techniques
    While SMOTE alleviates global imbalance, it may oversample regions of high overlap, degrading minority precision. More sophisticated strategies, such as Borderline-SMOTE, ADASYN, SMOTE-IPF, or generative oversampling via GANs, can yield more representative minority samples, preserving local decision boundaries.
  • Refined Feature Selection and Representation
    Although PCA-based label consolidation and Pearson correlation filtering reduce redundancy, complementary techniques such as mutual information, minimum redundancy–maximum relevance (mRMR), or SHAP-based importance could uncover more discriminative features and improve minority-class separability.
  • Enhanced Meta-Learning Strategies
    The current Level-1 stack combines Logistic Regression, Ridge Classifier, and a shallow neural network. Replacing or augmenting these with more expressive meta-learners, e.g., gradient boosting, attention-enhanced layers, or gating mechanisms, may capture higher-order interactions between Level-0 outputs.
  • Adaptive or Weighted Ensemble Voting
    The final decision is based on an unweighted majority vote. A class-sensitive or performance-weighted voting scheme (using per-class validation F1-scores) could emphasize the strengths of individual models in specific attack categories.
  • Temporal and Behavioral Modeling
    CIC-DDoS2019 contains latent temporal patterns relevant to attack escalation. Integrating temporal models (LSTM, GRU, or 1D-CNN) in Level-0 may enhance the recognition of low-rate, evolving, or multi-stage attack variants.
Incorporating these improvements could further enhance the ensemble’s robustness, particularly in scenarios involving rare attack types and complex multiclass decision boundaries. These refinements constitute a natural progression for future research.

5.4. Implications for Real-World Deployment and Future Research Directions

Beyond quantitative performance, the findings have several implications for practical deployment. First, the proposed architecture reduces reliance on deep models with heavy computational overhead, making it suitable for medium-scale environments such as enterprise networks and regional ISPs. Second, the combination of classical machine learning and lightweight neural components ensures predictable inference times, which is critical for real-time DDoS mitigation.
However, deployment in real operational settings presents additional challenges. Real network traffic often contains concept drift, encrypted payloads, adversarial variations, and unseen attack types, none of which are fully covered by CIC-DDoS2019. Future research should incorporate online learning, drift detection, adaptive thresholding, and continuous retraining mechanisms to enhance the performance of these models. Additionally, evaluating the model under high-throughput conditions and on modern datasets, including cloud-native and IoT traffic, will be essential to validate the generalization of the proposed architecture.
These considerations suggest that while the current model provides a strong foundation, further work is needed to ensure robust and scalable performance in production-grade intrusion detection systems.

5.5. Computational Performance Analysis and Operational Feasibility

To assess the practical viability of the proposed three-layer stacking architecture, we evaluated its computational performance in terms of training time, inference latency, throughput, and peak memory usage on the workstation described in Section 3.3. Table 17 summarizes the measurements obtained for each component of the pipeline.
Training proved computationally intensive due to the large dataset (∼8.85 million flows) and the number of models involved in Level-0 and Level-1. However, inference is significantly lighter: individual models exhibit per-flow latency in the millisecond range, and the final voting ensemble achieves a throughput compatible with near–real-time processing.
The full pipeline, when all models are loaded simultaneously, has higher memory requirements and reduced throughput, primarily due to the KNN component. Nevertheless, Level-0 pruning or selective loading strategies can substantially improve runtime performance without compromising detection quality.
Overall, these results indicate that, although training requires substantial computational resources, the inference stage of the proposed framework is suitable for deployment in modern network monitoring systems. Future work will investigate model compression, GPU-accelerated inference, and streaming-oriented deployment architectures to further improve scalability and operational efficiency.

5.6. Ensemble Robustness Beyond Global Metrics

Although the overall accuracy and aggregated precision, recall, and F1-scores of the base models, the meta learners, and the final Layer-2 voting ensemble appear similar (≈0.90–0.92), this behavior is expected in highly separable multiclass DDoS datasets. The ablation study presented in Table 14 shows that the ensemble contributes measurable improvements in robustness, particularly in stabilizing predictions for minority and overlapping attack types. Therefore, the benefit of stacking in this context is not a significant increase in global accuracy, but rather a more balanced and consistent performance across heterogeneous attack categories per-class.

5.7. Why the Full Ensemble Does Not Always Outperform Simpler Models

Although ensemble architectures are generally expected to outperform individual learners, our results reveal that some Layer-0 and Layer-1 models, particularly RF and the Ridge Classifier, achieve metrics that are numerically very close to those of the final Layer-2 voting ensemble. Similar observations have been reported in multiple CIC-DDoS2019 studies, where tree-based models often achieve strong performance even without additional meta-learning layers. This behavior does not imply redundancy in the proposed architecture; rather, it reflects intrinsic characteristics of the dataset and of the learning task.
First, CIC-DDoS2019 includes several attack families whose statistical signatures are sufficiently distinct for strong shallow models, particularly Random Forests, to capture most discriminative patterns. When a base learner already models the dominant variance structure of the data, additional ensemble layers may yield only marginal improvements in aggregate metrics such as accuracy or F1-score. This explains why RF appears to perform similarly to the final ensemble when evaluated solely on global averages.
Second, the objective of the stacked architecture is not only to improve average performance but also to reduce class-specific variance and prevent error concentration. While Layer-2 does not always increase the global F1-score, it produces more stable predictions for minority classes (e.g., SNMP, LDAP, DNS-based attacks) by smoothing the misclassification spikes observed in RF and the Ridge Classifier individually. Summary statistics do not fully capture these stabilizing effects, but they are evident in the per-class confusion matrices. For real-world DDoS detection, the stability of prediction across classes is often more critical than maximizing a single global score.
Third, it is true that the whole ensemble increases computational complexity. Nevertheless, its design is motivated by two practical considerations:
  • Robustness to drift and cross-scenario variability: Heterogeneous base learners exhibit complementary error profiles that generalize better when traffic patterns deviate from the training distribution.
  • Fault-tolerance: In high-stakes security systems, reliance on a single model increases vulnerability to systematic biases or adversarial weaknesses. A hierarchical ensemble reduces the probability that a single failure propagates to the final decision.
Finally, although RF and Ridge Classifier perform strongly on CIC-DDoS2019, they remain more sensitive to class imbalance, feature redundancies, and similarity noise than the layered ensemble. The proposed architecture, therefore, represents a trade-off: while computationally heavier, it offers greater resilience, smoother multiclass behavior, and better handling of minority attacks, benefits that are not readily apparent when examining solely aggregate performance metrics.
These considerations justify the use of the full stacked architecture despite the numerical proximity of some results, and they clarify why its advantages lie not only in absolute performance but also in stability, robustness, and operational reliability.

5.8. Security Implications and Threat Model Considerations

From a security perspective, the proposed model operates under a non-adaptive threat model, where attackers do not deliberately manipulate their traffic to evade detection. This assumption matches the volumetric and reflection-based characteristics of the attacks in CIC-DDoS2019 but does not extend to adversarial evasion strategies. The class-wise F1-scores reported in Table 10, Table 11, Table 12 and Table 13 show that certain amplification vectors, particularly SNMP and DNS/LDAP, exhibit lower detection performance compared to the remaining classes. These attacks are capable of producing large amplification factors through misconfigured or open servers, and false negatives in these categories may reduce detection coverage during the early stages of a multi-vector DDoS event. A delayed or incomplete identification of these vectors can impact the accuracy of mitigation strategies (e.g., filtering rules, upstream scrubbing, or rate-limiting policies), thereby increasing response latency and potentially allowing attackers to sustain higher traffic volumes for longer periods.
In contrast, misclassifications that occur within malicious categories (e.g., mistaking DNS/LDAP for NetBIOS) pose relatively lower operational risk because the system still correctly categorizes the flow as hostile, preserving the ability to trigger defensive actions. False positives, while less frequent, may temporarily affect benign traffic but generally result in proportionally lower operational costs compared to missed detections of large-scale reflection attacks.

5.9. Limitations and Error Analysis

Although the proposed stacking ensemble achieves balanced performance across most attack categories, the confusion matrix reveals systematic misclassification patterns that provide valuable insights into the limitations of the model. These failure modes arise from a combination of dataset-level constraints, overlapping statistical behaviors among amplification attacks, and inherent limitations of flow-level features. A deeper analysis of these patterns helps contextualize the ensemble’s behavior and clarifies where improvements are most needed for real-world deployment.
In particular, the confusion matrix provides a visual interpretation of the ensemble’s behavior beyond aggregate metrics, enabling the identification of systematic error patterns and their association with specific attack categories. This qualitative analysis complements the quantitative evaluation presented earlier and supports a more security-oriented interpretation of the results.

5.9.1. Misclassification Patterns Observed in the Confusion Matrix

The test-set confusion matrix (Figure 6) highlights several recurrent misclassification clusters, particularly among reflection/amplification attacks. The most prominent examples include:
  • DNS/LDAP → SNMP: 6768 instances.
  • TFTP → Syn/UDPLag: 6084 instances.
  • SNMP → DNS/LDAP: 4973 instances.
  • NetBIOS/Portmap → SNMP: 4432 instances.
  • SNMP → NetBIOS/Portmap: 4123 instances.
  • NetBIOS/Portmap → MSSQL: 2260 instances.
  • Syn/UDPLag → SSDP/UDP: 2213 instances.
  • SSDP/UDP → SNMP: 1453 instances.
These patterns reflect two primary characteristics:
  • High intra-class variability
    DNS/LDAP and SNMP exhibit considerable diversity in amplification factors, payload sizes, and upstream server behaviors. This broad variability widens the decision boundary, increasing overlap with other reflection vectors such as NetBIOS/Portmap and SSDP/UDP.
  • Statistical similarity among amplification-based attacks
    Many volumetric DDoS attacks share similar flow-level signatures, bursty packet rates, minimal inter-arrival variation, and comparable statistical profiles, making these classes difficult to separate using purely statistical flow features.
Overall, the confusion is not arbitrary; it reflects structural similarities in traffic behavior rather than deficiencies in the ensemble itself.

5.9.2. Underlying Causes of Failure: Feature Overlap, SMOTE Effects, and Dataset Constraints

A deeper inspection of the feature space and preprocessing pipeline reveals three central causes behind the observed misclassification trends:
  • Overlapping feature distributions
    PCA projections and feature-distribution analyses indicate substantial overlap between DNS/LDAP, NetBIOS/Portmap, SSDP/UDP, and related amplification vectors.
    Since all Level-0 and Level-1 models consume the same flow-level feature representation, the ensemble cannot fully disentangle classes whose statistical profiles are inherently similar.
  • SMOTE-induced boundary distortions
    Although SMOTE helps mitigate imbalance, it can introduce:
    • Synthetic samples near decision boundaries.
    • Amplification of noisy minority patterns.
    • Distortions in feature-density regions for SNMP, SSDP/UDP, and other low-frequency classes.
    These effects increase ambiguity precisely in the classes where the F1-score is lowest.
  • Limitations of flow-level statistical features
    The CIC-DDoS2019 dataset does not include:
    • Packet payloads.
    • Protocol header semantics.
    • Fine-grained timing structure.
    • Request-response patterns.
    • Entropy-based temporal signatures.
    Without protocol-aware or payload-level features, classes with similar volumetric patterns (e.g., DNS vs. SSDP) remain difficult to distinguish, regardless of the classifier’s complexity.

5.9.3. Operational Implications of Misclassification

From a real-world security perspective, not all misclassifications carry the same risk. The following considerations outline their operational impact:
  • False negatives in amplification attacks (e.g., SNMP, DNS/LDAP)
    These attack types can produce high amplification factors. Misclassification or delayed detection may:
    • Slow down activation of rate-limiting or scrubbing mechanisms.
    • Reduce early situational awareness.
    • Extend the window during which peak attack bandwidth is sustained.
    These are the most critical failure modes from a mitigation standpoint.
  • False positives on benign traffic
    The confusion matrix shows ≈ 4 benign flows misclassified as malicious. While operationally less severe than false negatives, false positives may result in:
    • Temporary disruption of legitimate traffic.
    • Unnecessary throttling of valid services.
    • Reduced operators trust in automated detection.
  • Misclassifications between DDoS subcategories
    Errors where an attack flow is labeled as the wrong attack type but still recognized as malicious have lower operational risk because:
    • Mitigation is still triggered.
    • The system identifies the flow as hostile.
    • Only fine-grained attribution or protocol-specific filtering is affected.
    However, they do reduce the quality of forensic analysis and the precision of protocol-specific mitigation rules.

6. Conclusions and Future Work

This study presented a multi-layer stacking ensemble framework for multiclass Distributed Denial-of-Service (DDoS) attack detection, integrating heterogeneous S-ML and D-ML algorithms to improve robustness, generalization, and classification accuracy. The proposed architecture, composed of base, meta, and voting layers, demonstrated its ability to combine the complementary strengths of tree-based, kernel-based, regression, and neural models within a unified hierarchical design. The experimental evaluation on the CIC-DDoS2019 dataset confirmed the effectiveness of this approach, achieving an overall accuracy of 91.26% with balanced precision and recall across all attack categories.
The model successfully mitigated overfitting and improved resilience to class imbalance and traffic variability, which are persistent challenges in network intrusion detection. Moreover, the PCA-based dimensionality reduction and SMOTE balancing contributed to reducing class overlap and enhancing the discriminative capacity of the ensemble. These results validate the proposed framework as a viable and scalable solution for real-world DDoS detection, where maintaining adaptability, interpretability, and efficiency under dynamic traffic conditions is essential.
Overall, this research highlights the potential of hierarchical stacking ensembles as an alternative to single deep or shallow models, providing a foundation for more intelligent, adaptive, and automated network defense systems that can handle the complexity of multiclass DDoS attack scenarios.
Recent studies have explored the potential of quantum machine learning (QML) for complex classification problems, reporting advantages in feature-space expressivity and non-classical kernel mappings [27,28,29]. Although these approaches are promising, current QML implementations remain constrained by hardware limitations, qubit noise, data encoding overhead, and scalability challenges that hinder their applicability to very large datasets such as CIC-DDoS2019. For this reason, the present work focuses on classical S-ML and D-ML techniques. Nevertheless, as quantum hardware matures, future extensions of this research may explore hybrid quantum-classical architectures or QML-based feature encoders as a complementary direction for multiclass DDoS detection.
Additionally, the measured inference latency and throughput indicate that the proposed framework is suitable for near-real-time deployment in high-volume network environments, provided that offline training and streaming-based inference are employed. Taken together, these results validate the proposed framework as a viable and scalable solution for real-world and near-real-time DDoS detection, where maintaining adaptability, interpretability, and efficiency under dynamic traffic conditions is essential.
Future research will aim to extend the current framework in several directions. First, incorporating real-time detection capabilities through streaming-based training and incremental learning could enable deployment in live network environments. Second, integrating federated or distributed learning paradigms may improve data privacy and model scalability across multiple domains and organizations. Third, the adoption of self-supervised and explainable AI techniques could enhance interpretability and reduce dependency on extensive labeled datasets, a current limitation in DDoS detection research.
Future work will extend the experimental evaluation beyond a single dataset and a single train–validation–test split. In particular, we plan to incorporate k-fold cross-validation, more extensive statistical significance testing, and cross-dataset validation on heterogeneous benchmarks such as CIC-IDS2017, UNSW-NB15, and ToN-IoT. In parallel, we will investigate strategies to reduce the computational footprint of the proposed stacking architecture, including model compression, pruning of high-latency components (e.g., KNN), GPU-accelerated inference, and deployment on streaming-oriented platforms. These efforts aim to improve scalability and throughput while preserving the robustness observed in the current multiclass setting.
To support reproducibility and transparency, the complete preprocessing pipeline, trained models, and experimental scripts will be made publicly available upon acceptance of the manuscript.
An additional direction for future research concerns the adoption of more realistic threat models in which attackers actively adapt their traffic patterns to evade detection. In addition, future work will explore adversarial and adaptive scenarios through techniques such as adversarial training, online and incremental learning, and robust feature extraction. These extensions aim to enhance the resilience of the proposed framework against evasive behaviors and improve its applicability in dynamic, real-world network environments.
Collectively, these directions aim to evolve the proposed stacking ensemble into a fully adaptive, interpretable, and resource-efficient defense mechanism for modern cyberinfrastructure.

Author Contributions

Conceptualization, J.M. and E.A.; methodology, E.A.; software, E.A. and L.L.; validation, E.A., L.L. and J.M.; formal analysis, E.A. and J.M.; investigation, E.A. and L.L.; resources, J.M.; data curation, E.A. and L.L.; writing—original draft preparation, J.M. and E.A.; writing—review and editing, J.M.; visualization, E.A.; supervision, J.M. and E.A.; project administration, J.M.; funding acquisition, J.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Internet Address Registry for Latin America and the Caribbean (LACNIC) through the FRIDA Program. The funding program does not assign a specific grant or funding number.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are openly available in Canadian Institute for Cybersecurity at https://www.unb.ca/cic/datasets/ddos-2019.html (accessed on 15 January 2025) and in Kaggle at https://www.kaggle.com/datasets/rodrigorosasilva/cic-ddos2019-30gb-full-dataset-csv-files (accessed on 15 January 2025).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Detailed Feature Description

Table A1. Comprehensive description of the selected features from the CIC-DDoS2019 dataset.
Table A1. Comprehensive description of the selected features from the CIC-DDoS2019 dataset.
Feature NameDescription
ProtocolNetwork protocol used in the connection (self-explanatory).
Flow DurationDuration of the flow in microseconds.
Total Fwd PacketsTotal packets in the forward direction.
Total Bwd PacketsTotal packets in the backward direction.
Total Length of Fwd PacketsTotal size of packets in the forward direction.
Total Length of Bwd PacketsTotal size of packets in the backward direction.
Fwd Packet Length MinMinimum size of packet in the forward direction.
Fwd Packet Length MaxMaximum size of packet in the forward direction.
Fwd Packet Length MeanMean size of packet in the forward direction.
Fwd Packet Length StdStandard deviation of packet size in the forward direction.
Bwd Packet Length MinMinimum size of packet in the backward direction.
Bwd Packet Length MaxMaximum size of packet in the backward direction.
Bwd Packet Length MeanMean size of packet in the backward direction.
Bwd Packet Length StdStandard deviation of packet size in the backward direction.
Flow Bytes/sNumber of flow bytes transmitted per second.
Flow Packets/sNumber of flow packets transmitted per second.
Flow IAT MeanMean time between two packets sent in the flow.
Flow IAT StdStandard deviation of inter-arrival time between packets.
Flow IAT MaxMaximum time between two packets sent in the flow.
Flow IAT MinMinimum time between two packets sent in the flow.
Fwd IAT MinMinimum inter-arrival time in the forward direction.
Fwd IAT MaxMaximum inter-arrival time in the forward direction.
Fwd IAT MeanMean inter-arrival time in the forward direction.
Fwd IAT StdStandard deviation of inter-arrival time in the forward direction.
Fwd IAT TotalTotal inter-arrival time in the forward direction.
Bwd IAT MinMinimum inter-arrival time in the backward direction.
Bwd IAT MaxMaximum inter-arrival time in the backward direction.
Bwd IAT MeanMean inter-arrival time in the backward direction.
Bwd IAT StdStandard deviation of inter-arrival time in the backward direction.
Bwd IAT TotalTotal inter-arrival time in the backward direction.
Fwd PSH FlagsNumber of times the PSH flag was set in packets travelling forward (0 for UDP).
Bwd PSH FlagsNumber of times the PSH flag was set in packets travelling backward (0 for UDP).
Fwd URG FlagsNumber of times the URG flag was set in packets travelling forward (0 for UDP).
Bwd URG FlagsNumber of times the URG flag was set in packets travelling backward (0 for UDP).
Fwd Header LengthTotal bytes used for headers in the forward direction.
Bwd Header LengthTotal bytes used for headers in the backward direction.
Fwd Packets/sNumber of forward packets transmitted per second.
Bwd Packets/sNumber of backward packets transmitted per second.
Packet Length MinMinimum length of a packet.
Packet Length MaxMaximum length of a packet.
Packet Length MeanMean length of a packet.
Packet Length StdStandard deviation of packet length.
Packet Length Variance         Variance of packet length.
FIN Flag CountNumber of packets with FIN flag.
SYN Flag CountNumber of packets with SYN flag.
RST Flag CountNumber of packets with RST flag.
PSH Flag CountNumber of packets with PSH flag.
ACK Flag CountNumber of packets with ACK flag.
URG Flag CountNumber of packets with URG flag.
CWR Flag CountNumber of packets with CWR flag.
ECE Flag CountNumber of packets with ECE flag.
Down/Up RatioDownload to upload ratio.
Average Packet SizeAverage size of packets in the flow.
Fwd Segment Size AvgAverage segment size in the forward direction.
Bwd Segment Size AvgAverage segment size in the backward direction.
Fwd Bytes/Bulk AvgAverage bytes per bulk in the forward direction.
Fwd Packet/Bulk AvgAverage packets per bulk in the forward direction.
Fwd Bulk Rate AvgAverage bulk rate in the forward direction.
Bwd Bytes/Bulk AvgAverage bytes per bulk in the backward direction.
Bwd Packet/Bulk AvgAverage packets per bulk in the backward direction.
Bwd Bulk Rate AvgAverage bulk rate in the backward direction.
Subflow Fwd PacketsAverage number of packets in subflows in the forward direction.
Subflow Fwd BytesAverage number of bytes in subflows in the forward direction.
Subflow Bwd PacketsAverage number of packets in subflows in the backward direction.
Subflow Bwd BytesAverage number of bytes in subflows in the backward direction.
Fwd Init Win BytesTotal number of bytes sent in the initial window in the forward direction.
Bwd Init Win BytesTotal number of bytes sent in the initial window in the backward direction.
Fwd Act Data PktsCount of packets with at least one byte of TCP data payload (forward).
Fwd Seg Size MinMinimum segment size observed in the forward direction.
Active MinMinimum time a flow was active before becoming idle.
Active MeanMean time a flow was active before becoming idle.
Active MaxMaximum time a flow was active before becoming idle.
Active StdStandard deviation of active time before becoming idle.
Idle MinMinimum time a flow was idle before becoming active.
Idle MeanMean time a flow was idle before becoming active.
Idle MaxMaximum time a flow was idle before becoming active.
Idle StdStandard deviation of idle time before becoming active.
LabelTarget variable indicating whether the traffic is benign or corresponds to a specific DDoS attack type.

References

  1. Report, G.R. DDoS Attack Volume Doubled in H2 2023. Press Release, 2024. “Maximum Attack Power Rose from 800 Gbps to 1.6 Tbps”. Available online: https://gcore.com/press-releases/gcore-radar-report-reveals-ddos-peak-attack-volumes-doubled-in-h2-2023 (accessed on 5 August 2025).
  2. Intelligence, A.N.T. 2025 Global DDoS Weapons Report. White Paper, 2025. Global DDoS Volume, API Gateway Targeting. Available online: https://www.a10networks.com/resources/reports/ddos-weapons-report/ (accessed on 30 July 2025).
  3. Wahab, S.A.; Sultana, S.; Tariq, N.; Mujahid, M.; Khan, J.A.; Mylonas, A. A Multi-Class Intrusion Detection System for DDoS Attacks in IoT Networks Using Deep Learning and Transformers. Sensors 2025, 25, 4845. [Google Scholar] [CrossRef]
  4. Bahashwan, A.A.; Anbar, M.; Manickam, S.; Al-Amiedy, T.A.; Aladaileh, M.A.; Hasbullah, I.H. A Systematic Literature Review on Machine Learning and Deep Learning Approaches for Detecting DDoS Attacks in Software-Defined Networking. Sensors 2023, 23, 4441. [Google Scholar] [CrossRef]
  5. Ibrahim, N.; Rajalakshmi, N.R.; Sivakumar, V.; Sharmila, L. An optimized hybrid ensemble machine learning model combining multiple classifiers for detecting advanced persistent threats in networks. J. Big Data 2025, 12, 212. [Google Scholar] [CrossRef]
  6. Chen, S.R.; Chen, S.J.; Hsieh, W.B. Enhancing Machine Learning-Based DDoS Detection Through Hyperparameter Optimization. Electronics 2025, 14, 3319. [Google Scholar] [CrossRef]
  7. Cloudflare, Inc. DDoS Attack Trends for 2024 Q1. Cloudflare Radar Report. 2024. Available online: https://radar.cloudflare.com/reports/ddos-2024-q1 (accessed on 3 August 2025).
  8. Abiramasundari, S.; Ramaswamy, V. Distributed denial-of-service (DDOS) attack detection using supervised machine learning algorithms. Sci. Rep. 2025, 15, 13098htt. [Google Scholar] [CrossRef]
  9. Sawah, M.S.; Elmannai, H.; El-Bary, A.A.; Lotfy, K.; Sheta, O.E. Distributed denial of service (DDoS) classification based on random forest model with backward elimination algorithm and grid search algorithm. Sci. Rep. 2025, 15, 19063. [Google Scholar] [CrossRef]
  10. Becerra-Suarez, F.L.; Fernández-Roman, I.; Forero, M.G. Improvement of Distributed Denial of Service Attack Detection through Machine Learning and Data Processing. Mathematics 2024, 12, 1294. [Google Scholar] [CrossRef]
  11. Fathima, A.; Devi, G.S.; Faizaanuddin, M. Improving distributed denial of service attack detection using supervised machine learning. Meas. Sens. 2023, 30, 100911. [Google Scholar] [CrossRef]
  12. Ebrahem, O.; Dowaji, S.; Alhammoud, S. Towards a minimum universal features set for IoT DDoS attack detection. J. Big Data 2025, 12, 88. [Google Scholar] [CrossRef]
  13. Bashaiwth, A.; Binsalleeh, H.; AsSadhan, B. An Explanation of the LSTM Model Used for DDoS Attacks Classification. Appl. Sci. 2023, 13, 8820. [Google Scholar] [CrossRef]
  14. Ramzan, M.; Shoaib, M.; Altaf, A.; Arshad, S.; Iqbal, F.; Castilla, Á.K.; Ashraf, I. Distributed Denial of Service Attack Detection in Network Traffic Using Deep Learning Algorithm. Sensors 2023, 23, 8642. [Google Scholar] [CrossRef]
  15. Setitra, M.A.; Fan, M.; Agbley, B.L.Y.; Bensalem, Z.E.A. Optimized MLP-CNN Model to Enhance Detecting DDoS Attacks in SDN Environment. Network 2023, 3, 538–562. [Google Scholar] [CrossRef]
  16. Aktar, S.; Yasin Nur, A. Towards DDoS attack detection using deep learning approach. Comput. Secur. 2023, 129, 103251. [Google Scholar] [CrossRef]
  17. Cao, B.; Li, C.; Song, Y.; Qin, Y.; Chen, C. Network Intrusion Detection Model Based on CNN and GRU. Appl. Sci. 2022, 12, 4184. [Google Scholar] [CrossRef]
  18. Ali, T.E.; Chong, Y.W.; Manickam, S.; Yusoff, M.N.; Yau, K.L.A.; Zoltan, A.D. A Stacking Ensemble Model with Enhanced Feature Selection for Distributed Denial-of-Service Detection in Software-Defined Networks. Eng. Technol. Appl. Sci. Res. 2025, 15, 19232–19245. [Google Scholar] [CrossRef]
  19. Hirsi, A.; Audah, L.; Salh, A.; Alhartomi, M.A.; Ahmed, S. Enhancing SDN security using ensemble-based machine learning approach for DDoS attack detection. Indones. J. Electr. Eng. Comput. Sci. 2025, 38, 1073. [Google Scholar] [CrossRef]
  20. Das, S.; Ashrafuzzaman, M.; Sheldon, F.T.; Shiva, S. Ensembling Supervised and Unsupervised Machine Learning Algorithms for Detecting Distributed Denial of Service Attacks. Algorithms 2024, 17, 99. [Google Scholar] [CrossRef]
  21. Lazzarini, R.; Tianfield, H.; Charissis, P.V. A Stacking Ensemble of Deep Learning Models for IoT Network Intrusion Detection. Knowl.-Based Syst. 2023, 279, 110941. [Google Scholar] [CrossRef]
  22. Butt, H.A.; Harthy, K.S.A.; Shah, M.A.; Hussain, M.; Amin, R.; Rehman, M.U. Enhanced DDoS Detection Using Advanced Machine Learning and Ensemble Techniques in Software Defined Networking. Comput. Mater. Contin. 2024, 81, 3003–3031. [Google Scholar] [CrossRef]
  23. Hossain, M.A. Enhanced Ensemble-Based Distributed Denial-of-Service (DDoS) Attack Detection with Novel Feature Selection: A Robust Cybersecurity Approach. Artif. Intell. Evol. 2023, 4, 165–186. [Google Scholar] [CrossRef]
  24. Silva, R.R. CIC-DDoS2019 Dataset. 2019. Available online: https://www.kaggle.com/datasets/rodrigorosasilva/cic-ddos2019-30gb-full-dataset-csv-files (accessed on 15 January 2025).
  25. Encyclopedia of Machine Learning; Springer: New York, NY, USA, 2011; p. 912. [CrossRef]
  26. Wolpert, D.H. Stacked generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
  27. Pei, J.J.; Gong, L.H.; Qin, L.G.; Zhou, N.R. One-to-many image generation model based on parameterized quantum circuits. Digit. Signal Process. 2025, 165, 105340. [Google Scholar] [CrossRef]
  28. Ding, Y.; Li, Z.; Zhou, N. Quantum generative adversarial network based on the quantum Born machine. Adv. Eng. Inform. 2025, 68, 103622. [Google Scholar] [CrossRef]
  29. Sudha, D.; Anju, A.; Ezhilarasi, K. Enhanced deep learning and Quantum variational classifier for Large-Scale Data Analysis. Knowl.-Based Syst. 2025, 330, 114611. [Google Scholar]
Figure 1. General process of the proposed DDoS detection system.
Figure 1. General process of the proposed DDoS detection system.
Applsci 16 00578 g001
Figure 2. Data preprocessing workflow applied to the CIC-DDoS2019 dataset.
Figure 2. Data preprocessing workflow applied to the CIC-DDoS2019 dataset.
Applsci 16 00578 g002
Figure 3. Stack Model Concept.
Figure 3. Stack Model Concept.
Applsci 16 00578 g003
Figure 4. Architecture of the proposed stacked ensemble model.
Figure 4. Architecture of the proposed stacked ensemble model.
Applsci 16 00578 g004
Figure 5. Evaluation process and performance metrics of the trained models.
Figure 5. Evaluation process and performance metrics of the trained models.
Applsci 16 00578 g005
Figure 6. Confusion matrix of the proposed three-layer stacking ensemble on the CIC-DDoS2019 test set (nine-class multiclass configuration).
Figure 6. Confusion matrix of the proposed three-layer stacking ensemble on the CIC-DDoS2019 test set (nine-class multiclass configuration).
Applsci 16 00578 g006
Table 1. Hyperparameters of the Decision Tree Classifier (DecisionTreeClassifier).
Table 1. Hyperparameters of the Decision Tree Classifier (DecisionTreeClassifier).
ParameterDefault ValueDescription
criterion"gini"Function used to measure the quality of a split. Alternatives: "entropy", "log_loss".
splitter"best"Strategy used to choose the split at each node ("best" or "random").
max_depthNoneThe tree expands until all leaves are pure or contain fewer than min_samples_split.
min_samples_split2Minimum number of samples required to split an internal node.
min_samples_leaf1Minimum number of samples required to be at a leaf node.
max_featuresNoneNumber of features considered when looking for the best split.
random_stateNoneControls randomness of the estimator for reproducibility.
Table 2. Hyperparameters of the Histogram-based Gradient Boosting Classifier (HistGradientBoostingClassifier).
Table 2. Hyperparameters of the Histogram-based Gradient Boosting Classifier (HistGradientBoostingClassifier).
ParameterDefault ValueDescription
loss"log_loss"Loss function to be minimized; suitable for probabilistic classification.
learning_rate0.1Shrinks the contribution of each tree to prevent overfitting.
max_depthNoneMaximum depth of individual trees. Unlimited by default.
max_leaf_nodes31Number of leaf nodes per tree (controls model complexity).
min_samples_leaf20Minimum number of samples per leaf node.
l2_regularization0.0Strength of L2 regularization on leaf weights.
max_bins255Number of bins for feature histogram discretization.
random_stateNoneEnsures reproducibility of random processes.
Table 3. Hyperparameters of the Random Forest Classifier (RandomForestClassifier).
Table 3. Hyperparameters of the Random Forest Classifier (RandomForestClassifier).
ParameterDefault ValueDescription
n_estimators100Number of trees in the forest.
criterion"gini"Function to measure the quality of a split.
max_depthNoneTrees grow until all leaves are pure or minimal split size is reached.
min_samples_split2Minimum samples required to split a node.
min_samples_leaf1Minimum samples required at a leaf node.
bootstrapTrueWhether bootstrap samples are used when building trees.
n_jobs−1Number of CPU cores used for training (−1 = use all cores).
random_stateNoneControls random number generation for reproducibility.
Table 4. Hyperparameters of the K-Nearest Neighbors Classifier (KNeighborsClassifier).
Table 4. Hyperparameters of the K-Nearest Neighbors Classifier (KNeighborsClassifier).
ParameterDefault ValueDescription
n_neighbors5Number of neighbors to use for classification.
weights"uniform"All neighbors contribute equally; can be set to "distance".
algorithm"auto"Algorithm used to compute nearest neighbors (ball_tree, kd_tree, etc.).
leaf_size30Affects tree-based algorithm speed and memory usage.
p2Power parameter for the Minkowski distance metric (2 = Euclidean).
metric"minkowski"Distance metric used for neighbor computation.
n_jobs−1Number of parallel jobs to run (−1 = all available CPUs).
Table 5. Training hyperparameters used for the neural network models in Layer 0.
Table 5. Training hyperparameters used for the neural network models in Layer 0.
HyperparameterValueDescription
learning_rate0.0001Step size used by the optimizer to update model weights during training.
optimizerAdamAdaptive optimizer combining momentum and RMSProp for efficient gradient updates.
dropout_rate0.3Fraction of neurons randomly deactivated per training step to reduce overfitting.
batch_size256Number of samples processed before updating model parameters.
epochs20Number of complete passes over the training dataset.
Table 6. Hyperparameters of the Logistic Regression model (LogisticRegression).
Table 6. Hyperparameters of the Logistic Regression model (LogisticRegression).
ParameterDefault ValueDescription
penalty"l2"Regularization type applied to prevent overfitting.
tol 1 × 10 4 Tolerance for the stopping criteria.
C1.0Inverse of regularization strength (smaller values = stronger regularization).
solver"lbfgs"Optimization algorithm used for training.
max_iter100Maximum number of iterations for convergence.
multi_class"auto"Determines how the model handles multiclass problems.
n_jobsNoneNumber of CPU cores used for parallel processing.
Table 7. Hyperparameters of the Ridge Classifier (RidgeClassifier).
Table 7. Hyperparameters of the Ridge Classifier (RidgeClassifier).
ParameterDefault ValueDescription
alpha1.0Regularization strength parameter; must be a positive float.
fit_interceptTrueWhether to calculate the intercept for the model.
normalize"deprecated"Deprecated flag for input normalization.
max_iterNoneMaximum number of iterations; default uses auto-convergence.
tol 1 × 10 3 Precision for convergence.
solver"auto"Algorithm used for optimization ("svd", "lsqr", etc.).
random_stateNoneRandom seed for reproducibility.
Table 8. Training hyperparameters used for the neural network models in Layer 1.
Table 8. Training hyperparameters used for the neural network models in Layer 1.
HyperparameterValueDescription
learning_rate0.0001Step size used by the optimizer to update model weights during training.
optimizerAdamAdaptive optimizer combining momentum and RMSProp for efficient gradient updates.
dropout_rate0.3Fraction of neurons randomly deactivated per training step to reduce overfitting.
batch_size256Number of samples processed before updating model parameters.
epochs20Number of complete passes over the training dataset.
Table 9. Performance metrics for trained models across all layers.
Table 9. Performance metrics for trained models across all layers.
Layer/ModelAccuracyPrecisionRecallF1-Score
Layer 0
Neural Network0.900.900.900.90
Random Forest0.910.920.910.92
Decision Tree0.910.920.910.92
KNN0.900.910.910.91
Gradient Boosting0.910.910.910.91
Layer 1
Logistic Regression0.910.920.920.92
Ridge Classifier0.910.920.920.92
Neural Network0.910.920.920.92
Layer 2 (Voting Ensemble)0.910.920.920.92
Table 10. Classification metrics for the Logistic Regression model (Layer 1).
Table 10. Classification metrics for the Logistic Regression model (Layer 1).
ClassPrecisionRecallF1-ScoreSupport
BENIGN0.991.001.002827
DNS/LDAP0.830.820.8340,987
MSSQL0.930.940.9348,599
NTP0.991.001.0048,900
NetBIOS/Portmap0.920.890.9058,680
SNMP0.750.800.7749,756
SSDP/UDP0.950.950.9556,214
Syn/UDPLag0.930.960.9587,766
TFTP0.980.880.9248,914
Accuracy 0.91442,643
Macro avg0.920.920.92442,643
Weighted avg0.910.910.91442,643
Table 11. Classification metrics for the Ridge Classifier model (Layer 1).
Table 11. Classification metrics for the Ridge Classifier model (Layer 1).
ClassPrecisionRecallF1-ScoreSupport
BENIGN0.991.001.002827
DNS/LDAP0.830.820.8340,987
MSSQL0.930.940.9348,599
NTP0.991.001.0048,900
NetBIOS/Portmap0.930.880.9058,680
SNMP0.750.810.7849,756
SSDP/UDP0.950.950.9556,214
Syn/UDPLag0.930.970.9587,766
TFTP0.990.870.9348,914
Accuracy 0.91442,643
Macro avg0.920.920.92442,643
Weighted avg0.920.910.91442,643
Table 12. Classification metrics for the Neural Network model (Layer 1).
Table 12. Classification metrics for the Neural Network model (Layer 1).
ClassPrecisionRecallF1-ScoreSupport
BENIGN0.991.001.002827
DNS/LDAP0.830.820.8340,987
MSSQL0.930.940.9348,599
NTP0.991.001.0048,900
NetBIOS/Portmap0.930.880.9058,680
SNMP0.750.810.7849,756
SSDP/UDP0.950.950.9556,214
Syn/UDPLag0.930.960.9587,766
TFTP0.980.870.9248,914
Accuracy 0.91442,643
Macro avg0.920.920.92442,643
Weighted avg0.910.910.91442,643
Table 13. Classification metrics for the Ensemble model (Layer 2).
Table 13. Classification metrics for the Ensemble model (Layer 2).
ClassPrecisionRecallF1-ScoreSupport
BENIGN0.991.001.002827
DNS/LDAP0.830.820.8340,987
MSSQL0.930.940.9348,599
NTP0.991.001.0048,900
NetBIOS/Portmap0.930.880.9058,680
SNMP0.750.810.7849,756
SSDP/UDP0.950.950.9556,214
Syn/UDPLag0.930.960.9587,766
TFTP0.980.880.9248,914
Accuracy 0.91442,643
Macro avg0.920.920.92442,643
Weighted avg0.910.910.91442,643
Table 14. Ablation study results in terms of Macro F1-score.
Table 14. Ablation study results in terms of Macro F1-score.
ConfigurationMacro F1-Score
Base model (full architecture)0.9201
Without Level-0 neural network0.9030
Without Level-0 random forest0.8989
Without Level-0 decision tree0.8993
Without Level-0 KNN0.9018
Without Level-0 gradient boosting0.8999
Without Level-1 logistic regression0.9134
Without Level-1 ridge classifier0.9122
Without Level-1 neural network0.9117
Base model without SMOTE0.8267
Base model without PCA label treatment0.7893
Table 15. Comparative analysis with representative previous works.
Table 15. Comparative analysis with representative previous works.
StudyDatasetMethodologyClassification TypeMetric (F1-Score)Reported Limitations
Abiramasundari & Ramaswamy (2025) [8]CICIDS2017, CICIDS2018, CICDDoS2019PCA-based EDAD + classical S-ML (RF, KNN, SVM, DT, LR) with feature reductionBinaryRF ≈ 98.9% (CICIDS2017), RF/KNN ≈ 98.7% (CICDDoS2019)No hybrid/ensemble learning; performance varies per dataset; limited multiclass evaluation
Afrah Fathima et al. (2023) [11]Subset of CSE-CICIDS2018, CSE-CICIDS2017, CICDoSSupervised S-ML (RF, KNN, LR); Min–Max normalization; undersamplingBinaryRF: 98%Limited dataset volume; lacks real-time evaluation
Aktar & Nur (2023) [16]CIC-DDoS2019, CICIDS2017, NSL-KDDDeep Contractive Autoencoder (DCAE) with stochastic thresholdBinaryAccuracy range: 93.41–97.58% (CIC-DDoS2019)Sensitive to hyperparameters; fixed anomaly threshold may cause false alarms
Ali et al. (2025) [18]SDN (public + testbed-generated)Stacking ensemble (SVM, MLP base; RF meta) with hybrid feature selection (GAWFS)BinaryF1: 99.71%High model complexity; potential overfitting; requires large datasets
Bashawith et al. (2023) [13]CICIDS2017, CICIDS2018, CICDDoS2019LSTM + XAI methods (LIME, SHAP, Anchor, LORE); 51 intrinsic featuresBinary & MulticlassF1 > 0.99 (Binary)Multiclass confusion among similar attack types (DNS/LDAP/SNMP)
Becerra-Suárez et al. (2024) [10]CIC-DDoS2019 (22 features)Classical S-ML (RF, DT, ADA, XGB, MLP); Pearson correlation; TPE optimizationBinaryRF: 99.98% (F1)Inconsistent preprocessing details in prior studies highlighted
Butt et al. (2024) [22]DDoS-SDNS-ML + ensemble learning with dynamic feature selectionBinaryRF: 100%, KNN: 99%Interpretability adds complexity; unclear scalability to large networks
Cao et al. (2022) [17]UNSW-NB15, NSL-KDD, CIC-IDS2017CNN–GRU + CBAM (attention); hybrid oversampling and feature selectionMulticlassCIC-IDS2017: F1 = 99.64%Long training time; limited performance on minority classes
Das et al. (2024) [20]NSL-KDD, UNSW-NB15, CICIDS2017Supervised + unsupervised stacking (LR, DT, SVM, OC-SVM, LOF, IF) with 6 meta-classifiersBinaryCICIDS2017: 99.9% (F1)High computational cost due to ensemble size
Ebrahem et al. (2025) [12]CICIDS2017, IoT-23Feature-set ensemble (VT, IG, ANOVA, χ 2 ); NB, KNN, RF, LR; reduced to a minimal set of two featuresBinaryRF: 97% (F1)KNN suffers from memory/latency issues; NB/LR affected by correlated features
Hirsi et al. (2025) [19]SDN-DDoS (new) & CICDDoS2019Ensemble RF (ENRF) + PCA (20 components)BinaryF1: 100%PCA assumes linearity; dataset coverage still incomplete
Hossain (2023) [23]CIC-DDoS2019 (LDAP subset)RF-based ensemble with feature selection combining correlation, MI, PCABinaryF1: 100%Requires evaluation under highly dynamic traffic
Lazzarini et al. (2023) [21]ToN-IoT, CICIDS2017Deep stacking ensemble (MLP, CNN, LSTM)Binary & MulticlassCICIDS2017: 98.6%; ToN-IoT: 99.7%Requires computational overhead analysis for IoT devices
Ramzan et al. (2023) [14]CICDDoS2019, CICIDS2017D-ML models (RNN, LSTM, GRU); 20 features via Extra TreesBinary & MulticlassBinary: 99.99% (RNN); Multiclass: 98% (GRU)RNN slow runtime; GRU best suited for multiclass scenarios
Sawah et al. (2025) [9]DDoS-SDNS-ML with Backward Feature Elimination and Grid Search (CV = 5)BinaryRF: 99.99%Real-time deployment challenges under high traffic loads
Setitra et al. (2023) [15]InSDN, CICDDoS2019OptMLP–CNN (MLP + CNN); SHAP-based feature selection with Bayesian optimizationBinaryF1: 99.94%Limited interpretability of deep hybrid models
Our Work (2025)CIC-DDoS2019Three-layer hierarchical stacking ensemble (Base–Meta–Voting) with SMOTE and PCAMulticlassF1: 92% (overall); minority classes: 0.77–0.83Lower performance on minority classes; increased complexity due to multiclass setting
Table 16. Comparison between binary DDoS detection studies and the proposed multiclass approach.
Table 16. Comparison between binary DDoS detection studies and the proposed multiclass approach.
StudyDatasetClassification TypeReported MetricNotes
Ramzan et al. [14]CIC-DDoS2019Binary (Attack vs. Benign)>99% accuracy/F1All attack traffic merged into a single class.
Setitra et al. [15]CIC-DDoS2019Binary (Attack vs. Benign)≈99–100% accuracyBinary formulation simplifies the detection task.
This workCIC-DDoS2019Multiclass (9 classes)91.26% accuracy, 92.01% F1Full attack taxonomy; higher complexity.
Table 17. Computational performance of the proposed stacking architecture.
Table 17. Computational performance of the proposed stacking architecture.
ComponentTraining TimeInference Latency (per flow)Throughput (flows/s)Peak Memory—TrainingPeak Memory—Inference
Layer 0—MLP18 h2.0 ms50018 GB4 GB
Layer 0—RF14 h5.0 ms20022 GB6 GB
Layer 0—KNN2 h150 ms6.720 GB18 GB
Layer 0—DT6 h1.5 ms6678 GB2 GB
Layer 0—Gradient Boost16 h4.0 ms25024 GB4 GB
Layer 1—Logistic Regression10 min0.2 ms50000.8 GB0.15 GB
Layer 1—Ridge Classifier8 min0.25 ms40000.8 GB0.15 GB
Layer 1—Neural Network30 min0.4 ms25002.0 GB0.5 GB
Layer 2—Voting Ensemble0.6 ms16000.5 GB0.6 GB
Full pipeline (all models loaded)≈152 ms≈6.548 GB36 GB
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Angulo, E.; Lizcano, L.; Marquez, J. A Stacking-Based Ensemble Model for Multiclass DDoS Detection Using Shallow and Deep Machine Learning Algorithms. Appl. Sci. 2026, 16, 578. https://doi.org/10.3390/app16020578

AMA Style

Angulo E, Lizcano L, Marquez J. A Stacking-Based Ensemble Model for Multiclass DDoS Detection Using Shallow and Deep Machine Learning Algorithms. Applied Sciences. 2026; 16(2):578. https://doi.org/10.3390/app16020578

Chicago/Turabian Style

Angulo, Eduardo, Leonardo Lizcano, and Jose Marquez. 2026. "A Stacking-Based Ensemble Model for Multiclass DDoS Detection Using Shallow and Deep Machine Learning Algorithms" Applied Sciences 16, no. 2: 578. https://doi.org/10.3390/app16020578

APA Style

Angulo, E., Lizcano, L., & Marquez, J. (2026). A Stacking-Based Ensemble Model for Multiclass DDoS Detection Using Shallow and Deep Machine Learning Algorithms. Applied Sciences, 16(2), 578. https://doi.org/10.3390/app16020578

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop