Combined Dataset System Based on a Hybrid PCA–Transformer Model for Effective Intrusion Detection Systems

Kamal, Hesham; Mashaly, Maggie

doi:10.3390/ai6080168

Open AccessArticle

Combined Dataset System Based on a Hybrid PCA–Transformer Model for Effective Intrusion Detection Systems

by

Hesham Kamal

^*

and

Maggie Mashaly

^*

Networks Department, Faculty of Information Engineering and Technology (IET), German University in Cairo (GUC), New Cairo 11835, Egypt

^*

Authors to whom correspondence should be addressed.

AI 2025, 6(8), 168; https://doi.org/10.3390/ai6080168

Submission received: 26 May 2025 / Revised: 13 July 2025 / Accepted: 14 July 2025 / Published: 24 July 2025

Download

Browse Figures

Versions Notes

Abstract

With the growing number and diversity of network attacks, traditional security measures such as firewalls and data encryption are no longer sufficient to ensure robust network protection. As a result, intrusion detection systems (IDSs) have become a vital component in defending against evolving cyber threats. Although many modern IDS solutions employ machine learning techniques, they often suffer from low detection rates and depend heavily on manual feature engineering. Furthermore, most IDS models are designed to identify only a limited set of attack types, which restricts their effectiveness in practical scenarios where a network may be exposed to a wide array of threats. To overcome these limitations, we propose a novel approach to IDSs by implementing a combined dataset framework based on an enhanced hybrid principal component analysis–Transformer (PCA–Transformer) model, capable of detecting 21 unique classes, comprising 1 benign class and 20 distinct attack types across multiple datasets. The proposed architecture incorporates enhanced preprocessing and feature engineering, followed by the vertical concatenation of the CSE-CIC-IDS2018 and CICIDS2017 datasets. In this design, the PCA component is responsible for feature extraction and dimensionality reduction, while the Transformer component handles the classification task. Class imbalance was addressed using class weights, adaptive synthetic sampling (ADASYN), and edited nearest neighbors (ENN). Experimental results show that the model achieves 99.80% accuracy for binary classification and 99.28% for multi-class classification on the combined dataset (CSE-CIC-IDS2018 and CICIDS2017), 99.66% accuracy for binary classification and 99.59% for multi-class classification on the CSE-CIC-IDS2018 dataset, 99.75% accuracy for binary classification and 99.51% for multi-class classification on the CICIDS2017 dataset, and 99.98% accuracy for binary classification and 98.01% for multi-class classification on the NF-BoT-IoT-v2 dataset, significantly outperforming existing approaches by distinguishing a wide range of classes, including benign and various attack types, within a unified detection framework.

Keywords:

binary classification; combined system; feature engineering; IDS; multi-class classification; PCA–Transformer

1. Introduction

The swift advancement of digital and communication systems, alongside web technologies, has furnished unprecedentedly accessible services for individuals globally. Nevertheless, the escalating count and diversity of digital threats (like network malware, unauthorized monitoring, and harmful assaults), which rise annually, are generating significant risks to personal data protection and financial security. Consequently, the safeguarding of digital data and communications has become pivotal for both individuals and society as a whole [1,2]. Protective software is commonly implemented and utilized as a fundamental security measure. However, owing to the complexity of manual setup and the delayed response to novel attack categories, it is no longer adequate for entities demanding robust security (e.g., governmental bodies, military installations) [3]. Therefore, cybersecurity investigators have suggested a novel approach for promptly recognizing and managing irregular network intrusion detection systems (IDSs).

IDSs have proven their value and capability as a cybersecurity tactic. They pinpoint recognized risks and damaging activities by examining data movement within computer networks and issue alerts when such risks are found [4]. Two main techniques are used for monitoring harmful operations. The initial one, signature-based detection, functions similarly to antivirus software, necessitating a comparison with a collection of past attack attributes or digital “fingerprints”. This technique excels at precisely identifying familiar threats with few incorrect warnings; nonetheless, it struggles to detect novel or changing attacks that lack matching patterns, thus restricting its flexibility in evolving threat landscapes. The subsequent method, anomaly-based detection, entails contrasting present system or network actions with models of standard behavior to ascertain if anything atypical is taking place. This strategy can spot previously unknown attacks and emerging threats by highlighting divergences from established standards. Conversely, it might be susceptible to more false alarms, as genuine but infrequent behavior could be wrongly interpreted as harmful. By skillfully utilizing both techniques, IDSs strive to improve detection performance while mitigating the inherent drawbacks of each individual method.

Promptly pinpointing digital dangers is crucial for upholding the safety and robustness of network systems. Employing advanced machine learning methods has demonstrated significant efficacy in the immediate scrutiny of network data, enabling the rapid discovery of possible intrusions [5]. Sophisticated deep learning models further improve the flexibility of IDSs, empowering them with the capacity to adaptively react to changing attack methods [6]. Moreover, integrating live data handling features within IDS architectures considerably reinforces network safeguards by enabling instant threat identification and preemptive countermeasures [7].

This study introduces a unified dataset framework for intrusion detection using enhanced preprocessing and feature engineering, followed by the vertical concatenation of the CSE-CIC-IDS2018 and CICIDS2017 datasets, enabling the detection of 21 unique classes, including 1 benign class and 20 distinct attack types. An advanced hybrid principal component analysis–Transformer (PCA–Transformer) architecture is employed for feature extraction and dimensionality reduction, with the Transformer performing the classification. Class imbalance is addressed using class weights, ADASYN, and ENN. The experimental results showcase remarkable performance, with the model reaching 99.80% accuracy in binary classification and 99.28% in multi-class classification on the combined CSE-CIC-IDS2018 and CICIDS2017 datasets, along with 99.66% accuracy for binary classification and 99.59% for multi-class classification on the CSE-CIC-IDS2018 dataset, 99.75% accuracy for binary classification and 99.51% for multi-class classification on the CICIDS2017 dataset, and 99.98% accuracy for binary classification and 98.01% for multi-class classification on the NF-BoT-IoT-v2 dataset. These outcomes underscore the model’s capability to detect a diverse range of classes, including benign traffic and various attack patterns, within an integrated detection framework. The principal contributions of this research are outlined as follows:

Proposing a novel approach to IDS by implementing a combined dataset framework using enhanced preprocessing and feature engineering, followed by the vertical concatenation of the CSE-CIC-IDS2018 and CICIDS2017 datasets, enabling the detection of 21 unique classes, including 1 benign class and 20 distinct attack types.
Adopting an enhanced hybrid PCA–Transformer model, where PCA performs feature extraction and dimensionality reduction, and the Transformer is responsible for classification, with class imbalance mitigated through the use of class weights, ADASYN, and ENN.
Conducting an evaluation on the combined CSE-CIC-IDS2018 and CICIDS2017 datasets, as well as on the CSE-CIC-IDS2018, CICIDS2017, and NF-BoT-IoT-v2 datasets individually, highlighting the superior performance of the proposed model in comparison to state-of-the-art approaches, while effectively covering a broader spectrum of network traffic classes, including benign and diverse attack types.
Validating the real-time effectiveness of the combined dataset system using the proposed model, accurately detecting and classifying benign traffic alongside multiple attack categories within a real-world IDS deployment.

This manuscript is structured in the subsequent manner: Section 2 furnishes a comprehensive survey of relevant scholarly works. Section 3 details the techniques employed in this inquiry. Section 4 displays the empirical findings, whereas Section 5 provides an in-depth evaluation of the results. Section 6 considers the constraints of the suggested approach. Section 7 concludes this research by emphasizing the key contributions and understandings. Finally, Section 8 proposes prospective avenues for subsequent investigation.

2. Related Work

This section outlines key developments in IDSs, spanning traditional machine learning, deep learning, Transformer-based models, dimensionality reduction, and hybrid approaches, as summarized in Table 1. It also highlights limitations in existing studies and presents the proposed solutions aimed at improving detection performance, particularly in binary and multi-class classification using combined dataset.

2.1. Traditional Machine Learning for IDSs

Studies such as [8,9,10,11] have investigated how traditional machine learning models, including multilayer perceptrons (MLPs), random forest (RF), and extra trees (ETs), can enhance the performance of intrusion detection. These methods, tested on datasets like CSE-CICIDS2018 and UNSW-NB15, commonly leverage optimization algorithms and data preprocessing to improve their classification capabilities. Although they show strong performance in both simple and complex intrusion detection scenarios, their real-world application faces significant hurdles. These include problems with scalability, their limited ability to adapt to new threats, and the ongoing absence of real-time evaluation. These difficulties primarily arise from the models’ fixed designs, a lack of diverse training data, and the challenge of efficiently processing high volumes of network traffic.

2.2. Deep Learning-Based Intrusion Detection

Recent advancements in deep learning have notably enhanced intrusion detection across various network environments, as shown in previous studies [12,13,14,15,16]. These works have utilized models like convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and generative adversarial networks (GANs) to improve detection accuracy in both binary and multi-class classification. To overcome issues such as class imbalance and limited data diversity, techniques including data synthesis, ensemble modeling, and oversampling were employed. However, despite achieving high accuracy, these models frequently encounter significant drawbacks, including substantial computational overhead, the absence of real-time evaluation, scalability problems in high-speed traffic environments, and restricted generalizability due to the narrow scope of the datasets used.

2.3. Transformer-Based Models in IDS

Recent research [17,18,19,20,21] has increasingly focused on Transformer-based models for intrusion detection systems. This is because they’re great at identifying long-term relationships and complex patterns in network traffic. These models have shown high accuracy in both binary and multi-class classification across different datasets and environments. Researchers have used techniques like attention mechanisms, positional embeddings, and encoder–decoder architectures to improve how features are represented and how precisely threats are detected. However, despite these promising outcomes, many of these models still face common challenges. These include high computational costs, limited real-time applicability, scalability issues with large amounts of data, class imbalance, and restricted generalizability due to either a lack of dataset diversity or insufficient resampling strategies.

2.4. Dimensionality Reduction and Feature Selection Techniques

Several studies [19,22,23] have investigated dimensionality reduction and feature selection techniques to improve intrusion detection performance while also reducing computational demands. Approaches such as PCA, Autoencoders, and stacked Autoencoders have shown significant promise in refining feature representation, boosting classification accuracy, and speeding up training processes. However, despite these benefits, most of these methods continue to face persistent challenges, including class imbalance, limitations in scalability when dealing with high-speed network traffic, restricted dataset diversity, and the absence of robust mechanisms for real-time deployment and generalization.

2.5. Hybrid Models for IDS

Numerous studies [22,23,24,25,26,27,28,29] have explored hybrid deep learning architectures to leverage the strengths of various models for better intrusion detection. These approaches often combine models like CNN-MLP, Transformer, deep neural network (DNN), Autoencoder-CNN, CNN-LSTM, Autoencoder-LSTM, and Autoencoder-support vector machine (SVM) and use ensemble methods that integrate both traditional and deep learning models. Together, these hybrids show substantial progress in capturing spatial, temporal, and contextual features, resulting in highly accurate classification for both binary and multi-class tasks. Despite these advancements, common limitations still exist across these works, including difficulties with real-time deployment, managing high-speed traffic, insufficient resampling for class imbalance, and limited generalization due to dataset diversity.

2.6. Challenges

Existing works on traditional machine learning, deep learning, Transformer-based architectures, dimensionality reduction techniques, or hybrid models commonly face several challenges, including detecting a wide range of attack types, achieving high performance, addressing class imbalance, ensuring generalization across multiple datasets, and maintaining scalability to handle large-scale environments and support real-time suitability. The challenges and the corresponding solutions proposed in this study are summarized in Table 2.

3. Methodology

While numerous prior studies have contributed to the development of IDSs, key challenges still need to be effectively addressed, including the detection of a wide range of attack types, achieving high performance, handling class imbalance, reducing feature dimensionality, ensuring scalability, supporting real-time evaluation, enabling both binary and multi-class classification, and achieving generalization across different datasets. The core of our approach is the construction of a unified dataset through the vertical concatenation of the CSE-CIC-IDS2018 and CICIDS2017 datasets, resulting in 21 distinct traffic classes, comprising one benign and 20 attack types. This addresses the challenge of detecting diverse attack scenarios. The PCA component is leveraged for feature extraction and dimensionality reduction, while the Transformer network is tasked with learning complex traffic behaviors and enabling scalable classification in both binary and multi-class contexts. To further enhance detection reliability and system robustness, the data preprocessing pipeline incorporates a hybrid outlier detection strategy that combines LOF and Z-score methods, directly contributing to the system’s high-performance capabilities. To mitigate class imbalance, we apply strategies such as class weights, ADASYN, and ENN. Furthermore, to ensure both generalization and scalability, the model is evaluated across large and diverse datasets, including the combined dataset, CSE-CIC-IDS2018, CICIDS2017, and NF-BoT-IoT-v2. This comprehensive evaluation demonstrates the model’s adaptability to various traffic environments and confirms its suitability for real-time intrusion detection. Figure 1 presents the overall system architecture, which demonstrates the flexibility and scalability of our approach, as detailed in the following subsections.

3.1. Dataset Description

Comprehensive detection of diverse attack types is achieved by employing a combined dataset constructed through the vertical concatenation of two benchmark datasets: CSE-CIC-IDS2018 and CICIDS2017. Together, these datasets provide coverage for 21 unique traffic classes, including 1 benign class and 20 distinct attack types. Each dataset is described separately in the following subsections to highlight their individual characteristics before presenting the combined dataset.

3.1.1. CSE-CIC-IDS2018 Dataset

Originating from a joint effort between the Communications Security Establishment (CSE) and the Canadian Institute for Cybersecurity (CIC), the CSE-CIC-IDS2018 dataset [30] was created to propel research in IDSs. This dataset, envisioned as a detailed and accurate resource, has become a key benchmark for evaluating the performance of IDS techniques. It was carefully designed to mimic real-world, varied cyberattacks within genuine network traffic, capturing the intricate characteristics of contemporary threats. By simulating complex operational network settings, it allows for researchers and security experts to thoroughly test and improve the resilience and adaptability of IDS solutions. Gathered over ten continuous days, the dataset includes 80 feature attributes and covers 15 different classes (1 Benign and 14 attack categories) [31], providing a substantial basis for empirical studies and performance comparisons in cybersecurity research.

3.1.2. CICIDS2017 Dataset

Data structure and labeling are critical for network intrusion detection, a point thoroughly explored by Markus et al. [32]. This section introduces the CICIDS2017 dataset, a vital resource for our research. Provided by the CIC for academic use [33], it represents a recent and comprehensive dataset with 2,830,743 records, 79 network features, and 15 classes (including benign and 14 attack types) [34]. The dataset is organized into eight files covering five days of both benign and malicious activity, complete with metadata and available in packet-based and bidirectional flow-based formats [32,33,35].

3.2. Dataset Preprocessing

This section provides an overview of the preprocessing steps applied to the CSE-CIC-IDS2018 and CICIDS2017 datasets. The preprocessing procedures for each dataset are described separately to highlight their individual characteristics and challenges. Following this, the details of the preprocessing strategy applied to the combined dataset are presented, outlining how the data was harmonized and prepared for input into the proposed hybrid PCA–Transformer model.

3.2.1. CSE-CIC-IDS2018 Dataset

The CSE-CIC-IDS2018 dataset underwent preprocessing starting with the consolidation of all ten daily CSV files into a unified dataset for in-depth analysis. To eliminate non-informative features, those with only one unique value were discarded. Representative sampling was employed to create a balanced dataset suitable for computation. Outlier detection, utilizing both LOF and Z-score techniques, was performed to remove extreme values that could negatively impact model performance.

3.2.2. CICIDS2017 Dataset

The initial data preparation stage involved the amalgamation of the dataset’s eight component files into a singular, integrated structure. To assure data soundness, redundant entries were excised, attributes exhibiting only a unique value were discarded, and missing values (NaNs) were handled by replacing them with the mean of their respective features, thereby enhancing the relevance of the dataset’s attributes. Attribute labels were standardized by eliminating leading whitespace to maintain uniformity. Following this, a selection of instances was performed, and anomalous data points were detected and removed via LOF and Z-score methodologies to lessen the influence of extreme values on the model’s efficacy.

3.2.3. Combined Dataset (CSE-CIC-IDS2018 and CICIDS2017)

Upon completing the specific preprocessing for both the CSE-CIC-IDS2018 and CICIDS2017 datasets (as detailed in Section 3.2.1 and Section 3.2.2), feature engineering was implemented to establish consistency between them. This process involved renaming features with differing names but similar interpretations and generating new features in one dataset to correspond with those present in the other, thus creating a unified set of features across both datasets. Subsequently, the two datasets were stacked vertically to form a single, larger dataset. Any missing values in the newly created features were then imputed. The combined dataset was then separated into input features and their corresponding output labels, and the input features underwent MinMaxScaler 1.2.2 normalization. The data was then divided into training and testing sets. To address class imbalance, class weights were applied during model training, ensuring better learning across all class categories. Subsequently, model evaluation was performed to assess its effectiveness and generalization capabilities.

Applying Combined LOF and Z-score

A two-step outlier detection strategy was applied to the CSE-CIC-IDS2018 dataset to improve data quality by removing noise and statistically inconsistent samples. The process began with the LOF algorithm, which identified and excluded data points based on deviations in local density. The preprocessing began by applying the LOF algorithm to the entire dataset to detect and remove outliers based on local density deviations. Subsequently, Z-score filtering was applied specifically to the Benign and Infiltration classes to further refine these subsets. This combined approach effectively eliminated outliers while retaining essential patterns across diverse classes, including both benign and attack types. As shown in Table 3, substantial reductions were observed in certain classes, notably Benign, which dropped from 39,000 to 3379 samples, and Infiltration, which decreased from 22,000 to 6071. Smaller, volatile classes like SQL Injection and Brute Force-XSS were also affected, with their sample counts reduced from 85 to 46 and 229 to 203, respectively. In contrast, dominant attack categories such as DDoS attacks-LOIC-HTTP, DDOS attack-HOIC, and DoS attacks-Hulk experienced only minor decreases, demonstrating that the method preserved core malicious behaviors while ensuring a cleaner and more reliable dataset for model training and evaluation.

The number of samples before and after the application of the two-step outlier detection method, combining LOF and Z-score on the CICIDS2017 dataset, is shown in Table 4. The preprocessing began by applying the LOF algorithm to the entire dataset to detect and eliminate outliers based on local density deviations. After this global filtering step, Z-score filtering was applied specifically to the Benign class to further refine this subset by removing statistically inconsistent samples. While most attack classes, such as DDoS, DoS Hulk, and PortScan, experienced only minimal reductions in size, PortScan decreased slightly from 12,000 to 11,956 samples, and DDoS from 14,000 to 13,611. In contrast, the Benign class showed a substantial reduction from 99,000 to 25,637 samples, indicating the effective removal of redundant or noisy data. Similarly, smaller classes such as Infiltration and Web Attack-SqL Injection were refined from 36 to 24 and from 21 to 19 samples, respectively. These results highlight the robustness of the combined LOF and Z-score approach in enhancing dataset quality while preserving essential patterns across diverse classes, including both benign and attack types, thereby supporting more reliable model training and evaluation.

2.: Applying Feature Engineering

Feature engineering was conducted to align the CSE-CIC-IDS2018 and CICIDS2017 datasets for vertical concatenation. As detailed in Table 5, several features in the CSE-CIC-IDS2018 dataset were renamed to ensure consistency with the CICIDS2017 dataset. For example, Fwd Packets Length Total was renamed to Total Length of Fwd Packets, and Init Fwd Win Bytes was renamed to Init_Win_bytes_forward. Additionally, the features Destination Port and Fwd Header Length.1 were added to the CSE-CIC-IDS2018 dataset. To maintain uniformity, the Protocol feature was also added to the CICIDS2017 dataset. These adjustments ensured that both datasets shared the same feature structure, allowing for their seamless vertical integration.

3.: Combining (Concatenating Vertically) Datasets

The CSE-CIC-IDS2018 and CICIDS2017 datasets were combined by vertically concatenating their records after ensuring both shared an identical set of features through careful feature alignment and engineering. This integration resulted in a unified dataset containing a total of 21 unique class types, comprising 1 benign class and 20 distinct attack categories as shown in Table 6, providing a richer and more diverse foundation for training and evaluating intrusion detection models.

4.: Applying Imputation Techniques

The imputation techniques applied to address missing values in the combined dataset are detailed in Table 7. The newly created features were imputed using either constant values or the mean to maintain data consistency. For instance, in the CSE-CIC-IDS2018 dataset, the Destination Port feature was filled with a constant value of zero, representing unknown or missing entries. Similarly, Fwd Header Length.1 was imputed with the mean value of 127.866965 to handle missing data. In the CICIDS2017 dataset, the Protocol feature’s missing values were also replaced with zero, indicating unknown values. These imputation strategies ensured that all features, including the newly created ones, were complete and suitable for further analysis, with missing values replaced using either zeros to indicate unknown entries or mean values to maintain data consistency.

5.: Normalization

Feature normalization was applied using the MinMaxScaler from the scikit-learn library to scale all numerical features within the range of 0 to 1. Since the dataset consists entirely of numerical features, the transformation was applied across all columns. This scaling technique works by subtracting the minimum value of each feature and then dividing by the range (i.e., the difference between the maximum and minimum values). Normalizing the features in this way ensures that no single attribute disproportionately influences the learning process and contributes to more stable and efficient performance of machine and deep learning models.

6.: Splitting Combined Dataset to Train and Test File

The combined dataset was partitioned into distinct training and testing subsets to facilitate model development and evaluation. Each of the 21 unique classes, including one benign category and 20 different attack types, was represented in both sets. As detailed in Table 8, the sample distribution was carefully preserved to maintain class representation throughout both phases. For example, the Benign class includes 24,808 samples in the training set and 4208 in the test set. Prominent attack types such as DoS attacks-Hulk and DDoS attacks-LOIC-HTTP have substantial allocations, with 57,162 and 31,583 samples in training, and 10,245 and 5554 in testing, respectively. Even less frequent classes like Heartbleed and SQL Injection are retained, although with smaller counts. The training set includes 9 and 57 samples for these classes, while the test set includes 2 and 8 samples, respectively. This stratified division supports comprehensive learning and evaluation, promoting the development of a model capable of effectively handling both benign traffic and attack patterns, including frequent and rare classes.

7.: Class Imbalance Mitigation

In the combined dataset, class imbalance was addressed using class weights during model training. This approach assigned higher importance to minority classes, ensuring the model remained sensitive to underrepresented class types, including both benign and attack categories, without altering the original data distribution. By incorporating class weights into the loss function, the system achieved more balanced classification performance across both common and rare classes.

Class Weights

The application of class weights in the binary classification setup for the combined dataset significantly contributed to addressing class imbalance. As shown in Table 9, the normal class was assigned a higher weight of 4.87996, while the Attack class received a lower weight of 0.55708. This weighting scheme reflects the disproportionate distribution of samples between the two classes and ensures that the learning process emphasizes the minority (Normal) class more heavily. By integrating these weights into the loss function, the model was better able to distinguish between normal and attack traffic, ultimately enhancing its ability to detect rare but critical normal instances in heavily imbalanced network environments.

Class imbalance in the multi-class classification phase of the combined dataset was addressed through the use of class weights during model training, as shown in Table 10. These weights were assigned based on the inverse frequency of each class, giving significantly higher importance to underrepresented attacks such as Heartbleed (1281.07937), FTP-BruteForce (288.24286), and SQL Injection (202.27569). Conversely, more prevalent classes like DoS attacks-Hulk (0.20170), Bot (0.47638), and DDoS attacks-LOIC-HTTP (0.36506) received lower weights. This strategy ensured balanced model learning, improving its ability to detect rare yet critical classes while maintaining robust performance across the full spectrum of network threats.

3.3. Proposed Model

The proposed hybrid PCA–Transformer model enhances intrusion detection by first applying PCA to reduce feature dimensionality and improve efficiency. The transformed features are then classified using a Transformer, which captures complex interactions through self-attention. This two-stage approach aims to boost performance, speed up training, and generalize well across different network traffic datasets.

Principal Component Analysis–Transformer (PCA–Transformer)

Initially, feature extraction and dimensionality reduction are performed on the training dataset using PCA. The PCA model is fitted on the training data to retain nearly all of the variance by selecting an appropriate number of principal components. After fitting, both the training and test datasets are transformed into this new feature space composed of principal components, which significantly lowers the dimensionality while preserving the essential information. This transformation helps streamline the subsequent modeling process by providing a compact and informative representation of the original data. Reduced-dimension features, resulting from PCA transformation, are fed into the Transformer architecture, which excels at modeling complex sequential data patterns. An input layer first receives these feature sequences. The central Transformer module utilizes multi-head attention, followed by layer normalization and a residual connection to preserve critical information. Subsequently, a feed-forward network (FFN), composed of two dense layers, one with ReLU activation and the other linear, is applied. This component also includes a residual connection and layer normalization to enhance training stability. The final output from the Transformer block is then flattened and passed to a dense layer for classification. For binary classification tasks, this layer features a single neuron with sigmoid activation, whereas for multi-class scenarios, it incorporates multiple neurons with softmax activation to predict probabilities for all relevant classes.

PCA reduces data dimensionality by projecting centered data onto principal components that capture the maximum variance, as expressed in Equation (1) [36].

Z = (X - Y) W

(1)

X is the original data matrix, and Y is the mean vector computed from the training data used to center the data. The matrix W contains the principal components (eigenvectors) selected to capture the majority of the data variance. This operation reduces the data’s dimensionality by projecting it onto a lower-dimensional space, resulting in the transformed data Z.

Multi-head attention profoundly discerns and captures highly intricate interactions embedded within the input data, as rigorously substantiated by Equation (2) [37].

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}})

(2)

Within this meticulously engineered configuration, Q is precisely defined as the query matrix, K serves as the definitive key matrix, and V unambiguously embodies the value matrix. The scalar parameter

d_{k}

specifically quantifies the inherent dimensionality of the key vectors, a fundamental determinant in the rigorous calculation of the attention scores throughout the model’s sophisticated internal mechanisms.

The attention mechanism’s computation for each individual attention head is executed in isolation. Subsequently, the resultant outputs from these independent computations are consolidated, as formally described by Equation (3) [37].

MultiHead (Q, K, V) = Concat ({h e a d}_{1}, \dots \dots, {h e a d}_{h}) W^{0}

(3)

Within this sophisticated framework, the symbol

W^{0}

precisely signifies the output weight matrix. This matrix is strategically utilized to transform the preceding layer’s output into the model’s ultimate, refined output.

As meticulously articulated in Equation (4) [38], Layer Normalization strategically stabilizes the activations, ensuring a consistent and robust output distribution across each individual layer of the network.

LayerNorm (x) = \frac{X - μ}{σ} \times γ + β

(4)

In this specific context, the symbol μ denotes the arithmetic mean of the inputs, while σ represents their standard deviation. Furthermore, γ and β are learnable parameters that serve as scaling and shifting factors within the normalization process. The FFN rigorously ingests and intricately transforms the outputs meticulously generated by the attention mechanism, a process precisely delineated in Equation (5) [37].

FNN (X) = \max (0, \times W_{1} + b_{1}) W_{2} + b_{2}

(5)

In this highly specific architectural configuration,

W_{1}

and

W_{2}

are the precisely calibrated weight matrices, and

b_{1}

and

b_{2}

are the intrinsically associated bias vectors that govern the intricate computational transformations within this particular layer.

(i): Binary Classification

In this study, the classification pipeline begins with 69 input features for the combined dataset, 67 features for the CSE-CIC-IDS2018 dataset, 68 for the CICIDS2017 dataset, and 19 features for the NF-BoT-IoT-v2 dataset. As shown in Table 11, dimensionality reduction and feature extraction are first performed using PCA, which transforms the raw input features into a more compact form while preserving nearly all the original variance. The PCA model is fitted on the training data and subsequently applied to both training and test sets, resulting in 59 principal components that capture the most informative aspects of the input. The resulting PCA-transformed features serve as the input to the Transformer model, detailed in Table 12. At its core, the Transformer architecture is designed to understand complex data relationships through its multi-head attention mechanism, enabling simultaneous focus on various input segments. To stabilize learning and retain crucial information, this process is augmented by layer normalization and a residual connection. A feed-forward block, comprising two fully connected layers (one with ReLU activation, the other without), is then applied to enhance feature representation. An additional residual connection, along with a normalization layer, helps maintain training stability. Finally, for binary classification, the model’s dense output layer uses a single neuron with a sigmoid activation function to indicate the probability of an input being either an attack or benign. This PCA–Transformer setup offers robust and efficient classification by merging dimensionality-reduced data with a potent sequence-modeling design.

(ii): Multi-class Classification

This study’s classification pipeline for network traffic data starts with initial inputs of either 69 features (for the combined dataset), 68 features (for the CSE-CIC-IDS2018 or CICIDS2017 datasets), or 18 features (for the NF-BoT-IoT-v2 dataset. The first stage involves PCA, as illustrated in Table 11, which reduces dimensionality and extracts key features, compressing the input into 59 principal components that retain nearly all original variance. This PCA output then feeds into our Transformer model. The Transformer’s core strength lies in its multi-head attention mechanism, enabling it to simultaneously focus on various input segments to model complex data relationships. Layer normalization and a residual connection are crucial for stabilizing the learning process and preventing information loss. Following this, a feed-forward block, comprising two fully connected layers (one with ReLU activation, the other without), further refines the feature representation. An additional residual connection and normalization help secure training stability, as detailed in Table 12. Finally, for multi-class classification, the model’s dense output layer uses a softmax activation function to predict class probabilities. This layer has 21 units for the combined system, 15 for the CSE-CIC-IDS2018 or CICIDS2017 datasets, and 5 for the NF-BoT-IoT-v2 dataset. Overall, this PCA–Transformer setup provides robust and efficient classification by combining a reduced-dimensionality input with a sophisticated sequence-modeling design.

(iii): Configuration of Hyperparameters for the PCA–Transformer Model

Dimensionality reduction was first performed using PCA, and the resulting transformed features were subsequently used with the Transformer model. The following hyperparameters were configured specifically for the Transformer to optimize performance for both binary and multi-class classification tasks. A consistent batch size of 128 was utilized, and the Adam optimization algorithm was selected for its efficient parameter updates. The learning rate was dynamically managed by a ReduceLROnPlateau scheduler, commencing at 0.001 and halving when the validation error plateaued, with a minimum value of 1 × 10⁻⁵. The primary differentiation between the two-class and multi-class setups resided in the loss function: binary cross-entropy for two-class tasks and categorical cross-entropy for multi-class tasks. Accuracy consistently served as the chosen performance metric across both scenarios. This adaptable parameter scheme, mirroring the model’s settings, aimed to efficiently fine-tune the models’ capacity for learning and categorizing information.

4. Results and Experiments

This section provides an overview of dataset descriptions, compared models configurations, experiment’s establishment, evaluation metrics, and comparative results of the proposed model against baseline models across four datasets: CSE-CIC-IDS2018, CICIDS2017, their combination, and NF-BoT-IoT-v2. It further includes an ablation study, computational metrics analysis, a comparative evaluation of Transformer variants, and insights into the model’s decision behavior using LIME for interpretability.

4.1. Dataset Characteristics and Preprocessing Overview

This section details the datasets employed in this study, namely, NF-BoT-IoT-v2, CSE-CIC-IDS2018, CICIDS2017, and a combined version of the latter two. It also outlines the preprocessing procedures applied to ensure the datasets are clean, consistent, and well-prepared for reliable model training and evaluation.

NF-BoT-IoT-v2 Dataset

The NF-BoT-IoT-v2 dataset contains over 37 million network flow instances, pre-dominantly malicious (99.64%) with a small benign portion (0.36%), classified into one benign and four attack types [39,40]. A thorough preprocessing pipeline was applied, including the removal of duplicate records, anomaly detection using Z-Score and LOF, feature selection, MinMaxScaler normalization, and PCA for dimensionality reduction. Class imbalance was addressed using ADASYN oversampling, ENN undersampling, and class weights during training to ensure balanced model development.

CSE-CIC-IDS2018 Dataset

The CSE-CIC-IDS2018 dataset, developed by CSE and CIC, comprises realistic network traffic with 80 features and 15 classes, including benign traffic and 14 attack types collected over ten days [30,31]. Preprocessing included merging daily files, removing irrelevant features, outlier elimination via LOF and Z-score, normalization using MinMaxScaler, and dataset splitting into training and testing sets. Class weights were applied during training to address class imbalance and improve model performance.

CICIDS2017 Dataset

The CICIDS2017 dataset, provided by the CIC, contains 2.83 million records with 79 features and 15 classes, including benign traffic and 14 attack types collected over five days in packet- and flow-based formats [32,33,34,35]. Preprocessing steps included merging files, redundant entries were excised, attributes exhibiting only a unique value were dis-carded, and missing values (NaNs) were handled by replacing them with the mean of their respective features, and removing outliers using LOF and Z-score methods. MinMaxScaler normalization was then applied to scale features uniformly. To address class imbalance, class weights were applied for binary classification, and ADASYN, ENN, and class weights were used for multi-class classification to enhance model effectiveness.

Combined Dataset

The preprocessed CSE-CIC-IDS2018 and CICIDS2017 datasets were aligned and merged through feature engineering to ensure compatibility, as detailed in Section 3.2, enabling the detection of 21 unique classes, comprising 1 benign class and 20 distinct attack types. Missing values were imputed, and the combined data was normalized using the MinMaxScaler, then split into training and testing sets. To address class imbalance, class weights were applied during training. The model was subsequently evaluated to assess performance and generalization.

4.2. Configuration and Hyperparameter Overview of Compared Models

This section presents the architectures and hyperparameter settings of the CNN, Autoencoder, MLP, and Transformer models [41,42,43], all carefully configured to ensure a fair and consistent comparison under identical experimental conditions.

Convolutional Neural Network (CNN)

The CNN model processes all input features through a 1D convolutional layer (32 filters, kernel size 3, ReLU), followed by a flattening layer. A single output unit with sigmoid activation is used for binary classification, while a softmax layer is used for multi-class classification with 21, 15, and 5 units for the combined, CSE-CIC-IDS2018/CICIDS2017, and NF-BoT-IoT-v2 datasets, respectively [44,45].

Autoencoder

The Autoencoder is initially trained in an unsupervised manner to compress input features using an encoder with three ReLU-activated layers (128, 64, 32) and a decoder comprising layers (64, 128, input), where the first two layers use ReLU activation and the third layer employs sigmoid activation. After training, the decoder is removed and the encoder is used as a feature extractor linked to a classification layer [46,47,48,49]. For binary classification, the output layer uses a sigmoid activation with one unit, while for multi-class classification, it uses softmax activation with 21, 15, and 5 units for the combined, CSE-CIC-IDS2018/CICIDS2017, and NF-BoT-IoT-v2 datasets, respectively [22,23,24].

Multilayer Perceptron (MLP)

The MLP model consists of an input layer, a single hidden layer (Dense: 64, ReLU), and an output layer tailored to the task: a sigmoid unit for binary classification and a softmax layer for multi-class classification, with output sizes of 21, 15, and 5 for the combined, CSE-CIC-IDS2018/CICIDS2017, and NF-BoT-IoT-v2 datasets, respectively. This structure supports versatile classification across multiple network traffic datasets [44].

Transformer

The Transformer model uses multi-head attention (8 heads, 64-dimension key) to ex-tract complex patterns, supported by normalization and a residual connection for stable training [23,24]. A feed-forward block with two dense layers (128 units with ReLU and 128 units without activation) further refines the data. This is followed by a residual connection and normalization to enhance learning dynamics. The output layer applies sigmoid activation with one unit for binary classification and softmax for multi-class classification, with 21, 15, and 5 units for the combined, CSE-CIC-IDS2018/CICIDS2017, and NF-BoT-IoT-v2 datasets, respectively [37,50].

Configuration of Hyperparameters for Models

Models were tuned for optimal performance using a batch size of 128 and the Adam optimizer. A dynamic learning rate started at 0.001 and was reduced when validation error plateaued. Binary cross-entropy was used for binary classification, and categorical cross-entropy for multi-class tasks. Accuracy was the primary metric, supporting efficient and adaptive learning across both classification types, as shown in Table 13.

4.3. Experiment’s Establishment

In the Kaggle environment, we crafted the algorithmic structures, leveraging TensorFlow and Keras. The experiments were conducted on hardware that featured an Nvidia GeForce RTX 1050 graphics card and ran on Windows 10.

4.4. Evaluation Metrics

Model evaluation is vital for gauging our approach’s effectiveness. In this study, we used standard metrics like accuracy, precision, recall, and F1-score to assess performance. To assess the generalization capability of the proposed model in comparison with baseline models, evaluation was carried out using a combined dataset comprising CSE-CIC-IDS2018 and CICIDS2017, with additional evaluations performed on each dataset individually, as well as on the NF-BoT-IoT-v2 dataset. These metrics were calculated from true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), with their respective formulas shown in Equations (6)–(9) [51,52,53,54,55].

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(6)

P r e c i s i o n = \frac{T P}{T P + F P}

(7)

R e c a l l = \frac{T P}{T P + F N}

(8)

F s c o r e = \frac{2 \times p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}

(9)

4.5. Results

The proposed model, in comparison to the baseline models, was evaluated using the combined CSE-CIC-IDS2018 and CICIDS2017 datasets, which comprise 21 distinct classes (one benign, twenty attack types). We also conducted individual assessments on the CSE-CIC-IDS2018, CICIDS2017, and NF-BoT-IoT-v2 datasets to confirm their generalizability. This study investigated both binary and multi-class classification scenarios. Importantly, the PCA–Transformer model demonstrated outstanding performance in class identification and categorization. The CNN, Autoencoder, MLP, and Transformer models likewise yielded strong results. These outcomes underscore the efficiency of our data preprocessing techniques and the resilience of the proposed model in managing intricate classification challenges.

(i): Binary Classification

The binary classification performance of various models across the NF-BoT-IoT-v2, CSE-CIC-IDS2018, CICIDS2017, and combined dataset is presented in Table 14. On the NF-BoT-IoT-v2 dataset, the PCA–Transformer model outperformed all other models, achieving 99.98% across accuracy, precision, recall, and F-score. It surpassed the Transformer (99.92%), Autoencoder (99.94%), MLP (99.81%), and CNN (99.73%) models, demonstrating its superior ability to detect binary threats in IoT-based environments. For the CSE-CIC-IDS2018 dataset, the PCA–Transformer again led with 99.66% accuracy and 99.71% precision, slightly outperforming the Transformer and Autoencoder models, which reached 99.64% accuracy, and maintaining a marginal lead over the MLP (99.62%) and CNN (99.61%) models. In the CICIDS2017 dataset, the PCA–Transformer model achieved the highest accuracy at 99.75%, followed by the Transformer (99.72%) and Autoencoder (99.70%), with MLP and CNN trailing slightly behind. On the combined dataset, the PCA–Transformer continued to demonstrate top-tier performance, recording 99.80% accuracy, ahead of the Autoencoder and Transformer (both at 99.77%), CNN (99.76%), and MLP (99.75%). These consistent results across all datasets highlight the PCA–Transformer model’s strong generalization ability and its reliability for accurate and balanced binary intrusion detection across diverse network environments.

(ii): Multi-Class Classification

The multi-class classification performance of various models across the NF-BoT-IoT-v2, CSE-CIC-IDS2018, CICIDS2017, and combined dataset is summarized in Table 15. On the NF-BoT-IoT-v2 dataset, the PCA–Transformer model achieved the highest overall performance, reaching 98.01% accuracy, 98.06% precision, 98.01% recall, and 98.01% F-score. It slightly outperformed the Transformer and Autoencoder models, which both recorded 97.97–98.02% across the metrics, while CNN and MLP followed closely behind. On the CSE-CIC-IDS2018 dataset, the PCA–Transformer again led with 99.59% accuracy and 99.68% precision, while the MLP and Transformer performed similarly with 99.56–99.57% accuracy. Although MLP achieved slightly higher precision, it showed a marginally lower F1-score compared to the Transformer. CNN and Autoencoder also showed competitive results, with accuracy ranging from 99.52% to 99.53%. For the CICIDS2017 dataset, the PCA–Transformer achieved the highest accuracy at 99.51% and an F-score of 99.52%, followed by the Transformer with an accuracy of 99.45% and the Autoencoder at 99.37%. MLP and CNN trailed with accuracy scores ranging from 98.65% to 99.04%. On the combined dataset, the PCA–Transformer model maintained its top position with 99.28% accuracy and 99.51% precision, marginally surpassing the Transformer (99.23% accuracy) and Autoencoder (99.22%), while CNN and MLP followed closely. These results consistently affirm the PCA–Transformer model’s superior performance and generalization capability in multi-class intrusion detection across diverse and complex network traffic datasets.

4.6. Ablation Analysis on Component-Wise Enhancements in the PCA–Transformer Model

The ablation study presented in Table 16 evaluates the individual and combined contributions of PCA and Transformer components to the overall performance of the PCA–Transformer model across binary and multi-class classification tasks on multiple datasets. The procedure involves three configurations: using PCA alone, Transformer alone, and their integration as the PCA–Transformer model. Results show that PCA on its own yielded the lowest performance across all datasets, with accuracy values ranging between 97.80% and 98.61% for binary classification, and from 94.22% to 97.21% for multi-class classification. This limitation stems from PCA’s primary role in feature dimensionality reduction without learning capabilities. Using the Transformer alone significantly improved results. Accuracy increased to 99.64–99.92% in binary tasks and to 97.97–99.57% in multi-class tasks, depending on the dataset. This improvement reflects the Transformer’s strong capacity to capture temporal dependencies and intricate patterns in network traffic data. The integration of both components in the PCA–Transformer model led to the highest observed performance across all classification types and datasets. For binary classification, accuracy improved significantly, reaching 99.98% on NF-BoT-IoT-v2, 99.66% on CSE-CIC-IDS2018, 99.75% on CICIDS2017, and 99.80% on the combined dataset. In multi-class classification, the PCA–Transformer also recorded top-tier performance with 98.01% accuracy on NF-BoT-IoT-v2, 99.59% on CSE-CIC-IDS2018, 99.51% on CICIDS2017, and 99.28% on the combined dataset. This configuration not only boosted accuracy but also consistently enhanced precision, recall, and F-score across all evaluations, confirming that the PCA–Transformer model effectively combines the dimensionality reduction benefits of PCA with the powerful feature representation and sequence modeling capabilities of the Transformer.

4.7. Inference Time, Training Time, and Memory Consumption

The performance evaluation of the PCA–Transformer model includes a detailed analysis of inference time, training time, and memory consumption to assess its scalability and suitability for real-world deployment. These metrics were measured across four benchmark datasets under both binary and multi-class classification settings, offering insights into the model’s efficiency, responsiveness, and resource demands in practical intrusion detection scenarios.

(i): Inference time

The PCA–Transformer model exhibited consistently low inference times across all datasets and classification types, as depicted in Table 17, making it highly suitable for deployment in real-world environments where rapid decision-making is critical. Inference time per batch of 128 samples ranged from 0.077 to 0.084 s, translating to a per-sample latency of just 0.60 to 0.66 milliseconds. This efficiency ensures that the model can process network traffic in near real time, supporting timely detection and response to intrusions, which is essential for practical intrusion detection systems operating in dynamic and large-scale network environments.

(ii): Training time

The PCA–Transformer model demonstrated efficient training times across all datasets and classification types, as shown in Table 18, reinforcing its practicality for real-world applications where rapid model retraining or updates are necessary. Training time per batch of 128 samples remained consistently low, ranging from approximately 0.078 to 0.085 s. This translates to a per-sample training duration between 0.61 and 0.67 milliseconds. Such minimal training overhead enables the model to be retrained or fine-tuned swiftly, which is particularly valuable in dynamic network environments where new threats continuously emerge and timely adaptation is essential for maintaining effective intrusion detection.

(iii): Memory consumption

The memory consumption analysis of the PCA–Transformer model across different datasets and classification tasks demonstrates its efficiency and scalability, as shown in Table 19. For binary classification, the model maintained minimal memory usage, requiring only 0.25 MB per batch on the NF-BoT-IoT-v2 dataset and slightly higher values of 0.34 MB to 0.35 MB on CSE-CIC-IDS2018, CICIDS2017, and the combined dataset. In multi-class classification scenarios, memory requirements increased moderately due to the added complexity, with consumption ranging from 0.29 MB on NF-BoT-IoT-v2 to 0.74 MB and 0.73 MB on CSE-CIC-IDS2018 and CICIDS2017, respectively. The combined dataset exhibited the highest memory usage at 0.93 MB per batch. These results reflect the PCA–Transformer model’s ability to handle varying task complexities while maintaining efficient resource utilization.

4.8. Comparative Analysis of Transformer Variants and Lightweight Attention Mechanisms for Performance and Scalability

A comparative analysis of different Transformer variants, namely, Linformer, Performer, and the standard Transformer, on binary and multi-class classification tasks over the combined dataset highlights notable differences in both performance and scalability, as shown in Table 20. In binary classification, all three models delivered exceptionally high accuracy, with the standard Transformer achieving 99.77%, Performer at 99.76%, and Linformer at 99.75%. Despite the close accuracy levels, Linformer exhibited the lowest inference time (0.0735 s per batch) and minimal memory consumption (0.27 MB), demonstrating its suitability for deployment in resource-constrained environments. Performer maintained a strong balance, offering competitive accuracy with slightly higher inference and training times while still keeping memory usage efficient. In multi-class classification, the standard Transformer continued to lead in accuracy at 99.23%, although it required significantly more memory (0.92 MB per batch). Conversely, Linformer and Performer offered improved scalability with reduced memory demands (0.29 MB and 0.27 MB, respectively) and faster processing, albeit at the expense of lower accuracy (95.31% and 96.06%). These results underscore the trade-offs between model performance and computational efficiency when selecting attention mechanisms for real-world intrusion detection systems.

4.9. Explaining Model Decisions Using LIME for the PCA–Transformer Model

The LIME for the PCA–Transformer model’s binary classification reveals how specific features influenced the decision to classify a given instance as normal with a prediction probability of 1.00, while the attack class received 0.00 (Table 21). As detailed in Table 22, several TCP flag-related features played a dominant role in this prediction. ACK Flag Count had the highest positive contribution weight (0.386353), followed by PSH Flag Count (0.242650), Fwd PSH Flags (0.230619), FIN Flag Count (0.221099), and SYN Flag Count (0.202133). Despite all these features having normalized values of 0.000000, their presence in the model’s learned patterns significantly supported the classification of the instance as normal. Other features such as ECE Flag Count, RST Flag Count, and Bwd IAT Max also contributed positively, whereas Active Std and Fwd Packets/s exerted negative influence. These insights demonstrate the model’s sensitivity to subtle traffic characteristics, particularly TCP control flags, and underline its interpretability in differentiating between benign and malicious traffic.

The decision interpretability of the PCA–Transformer model in multi-class classification was examined using LIME to understand its rationale for identifying a specific instance as belonging to the “Heartbleed” class with a prediction probability of 1.00, while assigning 0.00 to all other classes (Table 23). The explanation provided in Table 24 highlights the most influential features that contributed to this decision. Among them, Fwd Header Length.1 (normalized value: 0.997014) had the highest contribution weight (0.002998), followed by Fwd Packet Length Max, Total Backward Packets, Subflow Bwd Bytes, and Total Length of Bwd Packets, each contributing positively with weights above 0.0015. Additional important attributes included Subflow Fwd Packets, Avg Bwd Segment Size, and Bwd Packet Length Mean. Although the individual contribution weights were relatively small, their combined effect guided the model toward a confident classification. These features, primarily related to packet lengths and flow statistics, reflect characteristic patterns associated with Heartbleed traffic, illustrating the model’s capacity to leverage nuanced network behaviors for accurate and interpretable threat detection.

5. Discussion

This section provides a comprehensive evaluation of the PCA–Transformer model across four distinct datasets: CSE-CIC-IDS2018, CICIDS2017, their combined dataset, and NF-BoT-IoT-v2. Each dataset is analyzed individually to assess the model’s performance in both binary and multi-class classification scenarios. Confusion matrices are used to illustrate the model’s classification capabilities. In binary classification, the focus is on distinguishing benign from malicious traffic, while in multi-class classification, the model’s ability to differentiate between benign traffic and multiple attack types is examined. Furthermore, class-level performance metrics are explored to demonstrate the model’s robustness and adaptability across diverse network environments. In addition, real-time testing using the proposed model was performed to evaluate its ability to deliver swift and accurate intrusion detection under live network conditions.

5.1. Binary Classification

The confusion matrix for the NF-BoT-IoT-v2 dataset, shown in Figure 2a, offers a clearer insight into the model’s binary classification performance. The model correctly identified 1272 normal instances, with only one misclassified as an attack. For the Attack class, 10,787 instances were accurately classified, and only two were incorrectly labeled as normal. Analyzing the confusion matrix for the CSE-CIC-IDS2018 dataset offers insight into the model’s binary classification effectiveness, as depicted in Figure 2b. For the normal class, 503 instances were correctly classified, with a small number (2) mistakenly identified as attacks. In contrast, the Attack class saw 27,260 instances accurately detected, while just 93 were incorrectly categorized as normal. Similarly, analyzing the confusion matrix for the CICIDS2017 dataset reveals the proposed model’s impressive ability to differentiate between benign and malicious network activity as depicted in Figure 2c. The model correctly categorized 4683 benign instances, while only 38 were mistakenly identified as attacks. Furthermore, it accurately detected 11,071 attack instances, with a minimal error of only 2 being misclassified as benign. Reflecting the model’s strong performance in binary classification, the confusion matrix for the combined dataset clearly illustrates its ability to reliably distinguish between benign and attack traffic, as shown in Figure 2d. It correctly classified 4208 benign samples and 38,435 attack instances, while misclassifying a mere 85 attack cases as benign. In real-world security operations, minimizing false positives is crucial to avoid overwhelming analysts with irrelevant alerts, while reducing false negatives is equally vital to ensure genuine attacks are not missed. The model’s demonstrated low rates of both false positives and false negatives suggest it is well-suited for practical intrusion detection systems, where accurate and efficient threat detection is paramount for maintaining network security.

The PCA-Transformer model delivered exceptional performance in binary classification tasks across all datasets, as depicted in Table 25. On the NF-BoT-IoT-v2 dataset, it achieved near-perfect accuracy for both normal (99.92%) and attack (99.98%) classes, with correspondingly high precision and recall values above 99.8%, indicating its strong reliability in distinguishing between benign and malicious traffic. In the CSE-CIC-IDS2018 dataset, while the model attained high accuracy, with 99.60% for normal and 99.66% for attack traffic, it showed strong recall and F1-score, reflecting reliable detection of both benign and malicious instances. For the CICIDS2017 dataset, the model maintained 99.20% accuracy for the normal class and 99.98% for the attack class, alongside precision and recall values exceeding 99.20%, confirming its effectiveness in real-world intrusion scenarios. On the combined dataset, the model reached 100% accuracy for the normal class and 99.78% for the attack class, coupled with precision and recall values at or near 100%, showcasing its ability to generalize across heterogeneous network environments. These comprehensive results underscore the model’s robustness, performance, and adaptability in distinguishing between benign and malicious traffic, even under imbalanced conditions across multiple intrusion detection benchmarks.

5.2. Multi-Class Classification

The confusion matrix for the multi-class classification task on the NF-BoT-IoT-v2 dataset, covering five classes, Benign, Reconnaissance, DDoS, DoS, and Theft, demonstrates the model’s strong overall performance, as shown in Figure 3. The Benign class had 1251 out of 1252 samples correctly classified, with only one instance misclassified as Reconnaissance. The model also performed well on the Reconnaissance class, correctly identifying 2732 instances, while 148 were misclassified as DoS, 9 as Benign, and 2 as DDoS. The DDoS class saw 4267 out of 4301 samples correctly predicted, with only 31 misclassified as DoS and 3 as Benign. Similarly, DoS attacks were detected with high accuracy, as 3525 out of 3571 samples were correctly classified, while 31 were misclassified as DDoS and 15 as Reconnaissance. Notably, all 49 instances of the Theft class were correctly classified with no errors. These results indicate that the model effectively distinguishes between benign traffic and multiple attack types, demonstrating strong adaptability and robustness in complex and imbalanced IoT network environments.

The confusion matrix for the multi-class classification task on the CSE-CIC-IDS2018 dataset highlights the PCA-Transformer model’s strong ability to accurately distinguish benign traffic and various network attack types, as shown in Figure 4. Several classes were classified perfectly, including DDOS attack-HOIC (4789), DoS attacks-Hulk (4605), SSH-Bruteforce (3105), DoS attacks-GoldenEye (2575), DoS attacks-Slowloris (1291), and DDOS attack-LOIC-UDP (237). The Benign class was correctly identified in 480 instances, with a small number misclassified as Infiltration (25), while Infiltration itself was mostly accurate (854) but confused with Benign (35). Bot attacks were recognized in 4085 cases, with 23 misclassified as Brute Force-XSS. DDoS attacks-LOIC-HTTP (5625) achieved high accuracy with minimal error. Brute Force-Web was largely correct (55), though a few samples were misclassified. Minority classes such as SQL Injection, DoS attacks-SlowHTTPTest, and FTP-BruteForce were identified with commendable accuracy. Overall, the model demonstrates reliable performance across all classes, including benign and various attack categories, effectively addressing class imbalance and complex traffic scenarios.

The confusion matrix for the CICIDS2017 dataset, as shown in Figure 5, highlights the PCA-Transformer model’s strong capability in multi-class intrusion detection. The model achieved perfect classification of the Benign class with 3740 correct predictions and no misclassifications. It also demonstrated near-perfect performance for PortScan (1789/1791), DDoS (2063/2066), and DoS Hulk (5603/5608), with minimal confusion in each case. DoS GoldenEye was correctly identified in 551 instances, with only one misclassified sample. For FTP-Patator and DoS Slowloris, the model achieved high accuracy with 301 and 253 correct classifications, respectively. SSH-Patator had 116 correctly predicted instances, with minor confusion mainly involving Web Attack-SqL Injection (10) and FTP-Patator (1). The model also showed complete accuracy in identifying DoS Slowhttptest (161/161). Bot traffic was accurately predicted in 98 of 99 instances. In terms of more infrequent attack types, such as Web Attack-Brute Force, Web Attack-XSS, Infiltration, Web Attack-SqL Injection, and Heartbleed, the model maintained consistent and reliable classification. These results underline the model’s effectiveness in handling benign traffic alongside both frequent and rare attack types, preserving balanced and precise multi-class detection performance across a wide range of network traffic behaviors.

The confusion matrix obtained from the combined CSE-CIC-IDS2018 and CI-CIDS2017 datasets, encompassing 21 classes including benign traffic and a wide range of attack types, demonstrates the PCA-Transformer model’s impressive multi-class classification capabilities, as shown in Figure 6. Of the 4208 benign instances, 4081 were correctly classified, with misclassifications primarily into Infiltration (124) and Bot (3). The model achieved perfect classification for DDOS attack-HOIC (4816), Infiltration (900), SSH-Bruteforce (3157), and DDOS attack-LOIC-UDP (267), and showed near-perfect accuracy for DDoS attacks-LOIC-HTTP (5522/ 5554), DoS attacks-Hulk (10,235/10245), Bot (4299/4328), and DoS attacks-GoldenEye (3014/3028). It also performed well for DoS attacks-Slowloris (1506/1508), and PortScan (1798/1801), with minimal confusion. High detection rates were observed for FTP-Patator (302/303) and SSH-Patator (137/139), while rare attacks such as Web Attack-XSS and SQL Injection were also identified, with 51 correct classifications for Web Attack-XSS and 2 for Heartbleed. These results underscore the model’s ability to maintain high performance across diverse intrusion scenarios, making it well-suited for practical, large-scale network security applications.

The PCA-Transformer model exhibited outstanding performance in multi-class classification on the NF-BoT-IoT-v2 dataset, as shown in Table 26. It achieved perfect accuracy, precision, recall, and F1-score (100%) for the Theft class, indicating flawless detection of this critical threat. The model also performed exceptionally well in identifying benign traffic, with 99.92% accuracy and a 99.48% F1-score, demonstrating its ability to distinguish benign traffic from a diverse range of malicious activities across multiple attack categories. High effectiveness was observed for DDoS and DoS attacks, with accuracy values of 99.21% and 98.71% respectively, confirming the model’s proficiency in detecting high-volume network intrusions. In the case of Reconnaissance, the model maintained a strong balance, achieving 94.50% accuracy and 96.90% F1-score, reflecting reliable detection of stealthier, information-gathering attacks. These results collectively underscore the model’s capability to accurately classify benign traffic alongside diverse attack types, supporting its applicability in real-world IoT network security scenarios.

The PCA-Transformer model demonstrated outstanding multi-class classification performance on the CSE-CIC-IDS2018 dataset, as shown in Table 27. It achieved perfect scores (100% across all evaluation metrics) for several critical attack types, including DDOS attack-HOIC, SSH-Bruteforce, DoS attacks-Hulk, DoS attacks-GoldenEye, DoS attacks-Slowloris, and DDOS attack-LOIC-UDP, showcasing its exceptional ability to detect well-defined and high-impact threats. DDoS attacks-LOIC-HTTP also yielded near-perfect detection with 99.98% accuracy and a 99.99% F-score. The model maintained high accuracy on complex categories such as Bot (99.44%) and Infiltration (96.06%), while preserving a strong balance between precision, recall, and F-score. It further demonstrated reliable detection capabilities for a wide range of additional attack types, including Brute Force-Web, Brute Force-XSS, SQL Injection, FTP-BruteForce, and DoS attacks-SlowHTTPTest, highlighting its versatility in recognizing diverse intrusion patterns. Benign traffic was also accurately classified with a 95.05% accuracy and a well-balanced F-score of 94.12%, affirming the model’s ability to accurately distinguish between benign and various malicious traffic types. Overall, the PCA-Transformer model exhibits remarkable robustness and adaptability in accurately identifying benign traffic alongside both frequent and less common attack categories, making it highly effective for deployment in real-world, multi-class intrusion detection environments.

The PCA-Transformer model exhibited impressive multi-class classification performance on the CICIDS2017 dataset, as shown in Table 28. It achieved perfect scores (100% accuracy, precision, recall, and F1-score) for the Benign class, highlighting its precise discrimination of benign traffic. Several major attack types, including DoS Slowhttptest, PortScan, DoS Hulk, DDoS, and DoS GoldenEye, were identified with exceptional accuracy and F1-scores above 99%, confirming the model’s reliability in detecting high-volume threats. Notably, FTP-Patator (99.34% accuracy, 99.50% F1-score), SSH-Patator (91.34% accuracy, 95.47% F1-score), and DoS Slowloris (99.22% accuracy, 98.44% F1-score) were also classified with high effectiveness. The model performed confidently across other complex attack types such as Bot (98.99% accuracy, 98.49% F1-score), Web Attack-Brute Force, and Web Attack-XSS, while also achieving 100% recall for Web Attack-SqL Injection, and 100% precision for both Infiltration and Heartbleed. These results collectively demonstrate the PCA-Transformer’s robust generalization across a diverse range of benign and malicious traffic types, reinforcing its practical value for real-world, multi-class intrusion detection systems.

The PCA-Transformer model exhibited exceptional multi-class classification performance on the combined dataset, as evidenced by the per-class metrics in Table 29. It achieved perfect scores (100%) across all evaluation metrics for critical attack types such as DDOS attack-HOIC, SSH-Bruteforce, and Heartbleed, confirming its effectiveness in identifying both high-impact and low-frequency intrusions. The model also maintained consistently high detection capabilities across various attack types, achieving F1-scores of 99.98% for DDoS, 99.92% for PortScan, 99.93% for DoS attacks-Hulk, and 99.71% for DDoS attacks-LOIC-HTTP. Similarly, it achieved 99.63% for Bot, 99.65% for DoS attacks-GoldenEye, 99.77% for DoS attacks-Slowloris, and 99.26% for DDOS at-tack-LOIC-UDP, indicating robust performance in recognizing complex patterns of malicious traffic. Strong classification results were also recorded for FTP-Patator (99.51% F1-score), SSH-Patator (98.56%), and Infiltration (93.56%), demonstrating the model’s ability to detect a diverse range of sophisticated attacks. The Benign class achieved a high accuracy of 96.98% and an F1-score of 98.46%, reinforcing the model’s capability to accurately distinguish benign traffic from a broad spectrum of attack types. Collectively, these results underscore the PCA-Transformer model’s effectiveness, scalability, and reliability in real-world multi-class network intrusion detection scenarios.

5.3. Case Study for Real-Time Evaluation Using the Hybrid PCA–Transformer Model

A real-time evaluation was conducted to assess the practical applicability of the proposed hybrid PCA–Transformer model within a custom-developed IDS graphical user interface (GUI). The model was deployed in a real-world operational environment, where it processed live network traffic samples. A total of 22 real-time test instances, each representing a distinct network traffic scenario, were analyzed, as illustrated in Table 30. The model achieved perfect classification accuracy, with all predictions correctly matching their actual class labels. This outcome highlights the model’s robustness and reliability in real-time contexts, reinforcing its suitability for deployment in real-world network security systems. The classification outcomes, including the predicted classes, are depicted in Figure 7.

6. Limitations

The PCA–Transformer represents a blend of sophisticated deep learning frameworks, meticulously engineered to optimize performance in both binary and multi-class classification. This innovative strategy successfully tackles crucial issues within security breach detection, notably by improving classification precision and adeptly managing imbalanced class distributions. However, a number of remaining constraints and obstacles still demand attention:

Real-Time Evaluation in Different Environments: Comprehensive real-time evaluation across diverse network environments remains a challenge. While many models demonstrate strong performance on benchmark datasets, their effectiveness in varied real-world settings such as edge devices, cloud platforms, and different network infrastructures is often unvalidated.
Dataset Combination: To achieve comprehensive coverage of the wide variety of attack types and network behaviors, it is necessary to combine multiple datasets. However, the inherent differences in feature definitions, labeling schemes, and data distributions across these heterogeneous datasets pose significant challenges. These disparities complicate data integration and can negatively affect model performance and generalizability, highlighting the need for more effective methods to manage and unify heterogeneous data in intrusion detection research.
Data Preprocessing: The model’s ability to perform optimally is directly tied to how well its data is preprocessed. It is crucial to accurately deal with missing values, appropriately encode categorical information, and effectively normalize numerical data.
Model Adaptation: Adapting the model for peak performance on diverse datasets requires a comprehensive and iterative approach to hyperparameter tuning. This process is vital for ensuring the model perfectly aligns with the specific traits of every new dataset.

7. Conclusions

In this paper, we presented a combined dataset framework IDS that utilizes an advanced hybrid PCA–Transformer model to overcome the limitations of traditional IDS solutions. PCA is used for feature extraction and dimensionality reduction, and the Transformer handles the precise classification. Class imbalance was mitigated through the application of class weights, ADASYN, and ENN. This framework efficiently identifies a 21 categories, including 1 benign class and 20 different attack types across diverse datasets by deploying enhanced preprocessing and feature engineering, followed by the vertical concatenation of the CSE-CIC-IDS2018 and CICIDS2017 datasets. The model exhibited consistently high performance across various datasets. On the combined CSE-CIC-IDS2018 and CICIDS2017 datasets, it achieved 99.80% accuracy in binary classification and 99.28% in multi-class classification. It attained 99.66% and 99.59% accuracy for binary and multi-class classification on the CSE-CIC-IDS2018 dataset, respectively, and 99.75% and 99.51% on the CICIDS2017 dataset. Additionally, on the NF-BoT-IoT-v2 dataset, the model achieved 99.98% accuracy in binary classification and 98.01% in multi-class classification. These results significantly outperform existing IDS methods in both detection accuracy and the ability to handle a diverse range of network classes. The proposed approach presents a robust and scalable solution for real-time intrusion detection, with strong potential for securing modern network infrastructures.

8. Future Work

Addressing the constraints and difficulties elucidated in Section 6, subsequent investigations should direct their attention toward these specific facets:

Real-Time Evaluation in Different Environments: The combined system using PCA–Transformer will be evaluated in real time across multiple environments to assess its adaptability and resilience to diverse challenges. Deployment will span from resource-constrained edge devices to large-scale cloud systems, ensuring consistent detection accuracy and operational efficiency in a wide range of scenarios.
Dataset Combination: To enhance the robustness of intrusion detection models, future work should focus on combining a broader range of datasets to cover more diverse attack types and network behaviors. However, effectively integrating these heterogeneous datasets remains a challenge due to differences in feature definitions, labeling standards, and data distributions. Addressing these issues will require developing improved techniques for data harmonization and interoperability, which are crucial for advancing intrusion detection systems.
Data Preprocessing: To achieve peak model performance, it is critical to tailor data preprocessing methods to each dataset. For a deeper dive into these refined techniques, see Section 3.2 herein.
Model Adaptation and Hyperparameter Optimization: For successful deployment across varied data landscapes, enhanced hyperparameter optimization methods are crucial. These tuning strategies are detailed in Section 3.3.

Author Contributions

Conceptualization, H.K. and M.M.; Methodology, H.K. and M.M.; Software, H.K. and M.M.; Validation, H.K. and M.M.; Writing—original draft, H.K. and M.M.; Supervision, M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in NF-UNSW-NB15-v2 at https://staff.itee.uq.edu.au/marius/NIDS_datasets/, reference number v2. The data presented in this study are available in CSE-CIC-IDS2018 at https://www.unb.ca/cic/datasets/ids-2018.html. The data presented in this study are available in CICIDS2017 at https://www.unb.ca/cic/datasets/ids-2017.html. These data were derived from the following resources available in the public domain: NF-BoT-IoT-v2: https://staff.itee.uq.edu.au/marius/NIDS_datasets/; CSE-CIC-IDS2018: https://www.unb.ca/cic/datasets/ids-2018.html; CICIDS2017: https://www.unb.ca/cic/datasets/ids-2017.html.

Conflicts of Interest

The authors declare no conflict of interest.

References

Patel, A.; Qassim, Q.; Wills, C. A survey of intrusion detection and prevention systems. Inf. Manag. Comput. Secur. 2010, 18, 277–290. [Google Scholar] [CrossRef]
Khraisat, A.; Gondal, I.; Vamplew, P.; Kamruzzaman, J. Survey of intrusion detection systems: Techniques, datasets and challenges. Cybersecurity 2019, 2, 20. [Google Scholar] [CrossRef]
Yuan, L.; Chen, H.; Mai, J.; Chuah, C.N.; Su, Z.; Mohapatra, P. Fireman: A toolkit for firewall modeling and analysis. In Proceedings of the 2006 IEEE Symposium on Security and Privacy (S&P’06), Berkeley/Oakland, CA, USA, 21–24 May 2006; IEEE: Manhattan, NY, USA, 2006; pp. 15–213. [Google Scholar]
Musa, U.S.; Chhabra, M.; Ali, A.; Kaur, M. Intrusion detection system using machine learning techniques: A review. In Proceedings of the 2020 International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 10–12 September 2020; IEEE: Manhattan, NY, USA, 2020; pp. 149–155. [Google Scholar]
Zhang, X.; Xie, J.; Huang, L. Real-Time Intrusion Detection Using Deep Learning Techniques. J. Netw. Comput. Appl. 2020, 140, 45–53. [Google Scholar]
Kumar, S.; Kumar, R. A Review of Real-Time Intrusion Detection Systems Using Machine Learning Approaches. Comput. Secur. 2020, 95, 101944. [Google Scholar]
Smith, B.J.; Taylor, C. Enhancing Network Security with Real-Time Intrusion Detection Systems. Int. J. Inf. Secur. 2021, 21, 123–135. [Google Scholar]
Alzughaibi, S.; El Khediri, S. A cloud intrusion detection systems based on dnn using backpropagation and pso on the cse-cic-ids2018 dataset. Appl. Sci. 2023, 13, 2276. [Google Scholar] [CrossRef]
Gopalsamy, M. Predictive Cyber Attack Detection in Cloud Environments with Machine Learning from the CICIDS 2018 Dataset. Int. J. Sci. Res. Technol. (IJSART) 2024, 10, 34–46. [Google Scholar]
Talukder, M.A.; Islam, M.M.; Uddin, M.A.; Hasan, K.F.; Sharmin, S.; Salem; Alyami, A.; Moni, M.A. Machine learning-based network intrusion detection for big and imbalanced data using oversampling, stacking feature embedding and feature extraction. J. Big Data 2024, 11, 33. [Google Scholar] [CrossRef]
Fathima, A.; Khan, A.; Uddin, M.F.; Waris, M.M.; Ahmad, S.; Sanin, C.; Szczerbicki, E. Performance evaluation and comparative analysis of machine learning models on the UNSW-NB15 dataset: A contemporary approach to cyber threat detection. Cybern. Syst. 2023, 1–7. [Google Scholar] [CrossRef]
Alabdulwahab, S.; Kim, Y.-T.; Seo, A.; Son, Y. Generating synthetic dataset for ml-based ids us-ing ctgan and feature selection to protect smart iot environments. Appl. Sci. 2023, 13, 10951. [Google Scholar] [CrossRef]
Alghamdi, R.; Bellaiche, M. An ensemble deep learning based IDS for IoT using Lambda architecture. Cybersecurity 2023, 6, 5. [Google Scholar] [CrossRef]
Aljuaid, W.H.; Alshamrani, S.S. A deep learning approach for intrusion detection systems in cloud computing environments. Appl. Sci. 2024, 14, 5381. [Google Scholar] [CrossRef]
Nawaz, M.W.; Munawar, R.; Mehmood, A.; Rahman, M.M.U.; Qammer; Abbasi, H. Multi-class Network Intrusion Detection with Class Imbalance via LSTM & SMOTE. arXiv 2023, arXiv:2310.01850. [Google Scholar] [CrossRef]
Abdelkhalek, A.; Mashaly, M. Addressing the class imbalance problem in network intrusion detection systems using data resampling and deep learning. J. Supercomput. 2023, 79, 10611–10644. [Google Scholar] [CrossRef]
Wu, Z.; Zhang, H.; Wang, P.; Sun, Z. RTIDS: A robust transformer-based approach for intrusion detection system. IEEE Access 2022, 10, 64375–64387. [Google Scholar] [CrossRef]
Nguyen, T.P.; Nam, H.; Kim, D. Transformer-based attention network for in-vehicle intrusion detection. IEEE Access 2023, 11, 55389–55403. [Google Scholar] [CrossRef]
Liu, Y.; Wu, L. Intrusion detection model based on improved transformer. Appl. Sci. 2023, 13, 6251. [Google Scholar] [CrossRef]
Long, Z.; Yan, H.; Shen, G.; Zhang, X.; He, H.; Cheng, L. A Transformer-based network intrusion detection approach for cloud security. J. Cloud Comput. 2024, 13, 5. [Google Scholar] [CrossRef]
Tseng, S.-M.; Wang, Y.-Q.; Wang, Y.-C. Multi-Class Intrusion Detection Based on Transformer for IoT Networks Using CIC-IoT-2023 Dataset. Future Internet 2024, 16, 284. [Google Scholar] [CrossRef]
Kamal, H.; Mashaly, M. Robust Intrusion Detection System Using an Improved Hybrid Deep Learning Model for Binary and Multi-Class Classification in IoT Networks. Technologies 2025, 13, 102. [Google Scholar] [CrossRef]
Kamal, H.; Mashaly, M. Advanced Hybrid Transformer-CNN Deep Learning Model for Effective Intrusion Detection Systems with Class Imbalance Mitigation Using Resampling Techniques. Future Internet 2024, 16, 481. [Google Scholar] [CrossRef]
Kamal, H.; Mashaly, M. Enhanced Hybrid Deep Learning Models-Based Anomaly Detection Method for Two-Stage Binary and Multi-Class Classification of Attacks in Intrusion Detection Systems. Algorithms 2025, 18, 69. [Google Scholar] [CrossRef]
Fu, Y.; Du, Y.; Cao, Z.; Li, Q.; Xiang, W. A deep learning model for network intrusion detection with imbalanced data. Electronics 2022, 11, 898. [Google Scholar] [CrossRef]
Sajid, M.; Malik, K.R.; Almogren, A.; Malik, T.S.; Khan, A.H.; Tanveer, J.; Rehman, A.U. Enhancing intrusion detection: A hybrid machine and deep learning approach. J. Cloud Comput. 2024, 13, 123. [Google Scholar] [CrossRef]
Kamal, H.; Mashaly, M. Improving Anomaly Detection in IDS with Hybrid Auto Encoder-SVM and Auto Encoder-LSTM Models Using Resampling Methods. In Proceedings of the 2024 6th Novel Intelligent and Leading Emerging Sciences Conference (NILES), Giza, Egypt, 19–21 October 2024; pp. 34–39. [Google Scholar]
Umair, M.B.; Iqbal, Z.; Faraz, M.A.; Khan, M.A.; Zhang, Y.-D.; Razmjooy, N.; Kadry, S. A network intrusion detection system using hybrid multilayer deep learning model. Big Data 2024, 12, 367–376. [Google Scholar] [CrossRef] [PubMed]
Alghamdi, R.; Bellaiche, M. Evaluation and selection models for ensemble intrusion detection systems in IoT. IoT 2022, 3, 285–314. [Google Scholar] [CrossRef]
Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. “CSE-CIC-IDS2018 Dataset.” Canadian Institute for Cybersecurity, University of New Brunswick. 2018. Available online: https://www.unb.ca/cic/datasets/ids-2018.html (accessed on 20 July 2025).
Songma, S.; Sathuphan, T.; Pamutha, T. Optimizing intrusion detection systems in three phases on the 2053 CSE-CIC-IDS-2018 dataset. Computers 2023, 12, 245. [Google Scholar] [CrossRef]
Ring, M.; Wunderlich, S.; Scheuring, D.; Landes, D.; Hotho, A. A survey of network-based intrusion detection data sets. Comput. Secur. 2019, 86, 147–167. [Google Scholar] [CrossRef]
Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp 2018, 1, 108–116. [Google Scholar]
UNB. Intrusion Detection Evaluation Dataset (CICIDS2017), University of New Brunswick. Available online: https://www.unb.ca/cic/datasets/ids-2017.html (accessed on 30 October 2024).
Sharafaldin, I.; Habibi Lashkari, A.; Ghorbani, A.A. A detailed analysis of the cicids2017 data set. In Proceedings of the Information Systems Security and Privacy: 4th International Conference, ICISSP 2018, Funchal-Madeira, Portugal, 22–24 January 2018; Revised Selected Papers 4. Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 172–188. [Google Scholar]
Bishop, C.M.; Nasser, M.N. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006; Volume 4. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Ba, J.L. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
Sarhan, M.; Layeghy, S.; Portmann, M. Towards a standard feature set for network intrusion detection system datasets. Mob. Netw. Appl. 2022, 27, 357–370. [Google Scholar] [CrossRef]
Moustafa, N. Network Intrusion Detection System (NIDS) Datasets [Internet]; University of Queensland: Brisbane, Australia, 2025; Available online: https://staff.itee.uq.edu.au/marius/NIDS_datasets (accessed on 21 June 2025).
El-Habil, B.Y.; Abu-Naser, S.S. Global climate prediction using deep learning. J. Theor. Appl. Inf. Technol. 2022, 100, 4824–4838. [Google Scholar]
Song, Z.; Ma, J. Deep learning-driven MIMO: Data encoding and processing mechanism. Phys Commun. 2022, 57, 101976. [Google Scholar] [CrossRef]
Zhou, X.; Zhao, C.; Sun, J.; Yao, K.; Xu, M. Detection of lead content in oilseed rape leaves and roots based on deep transfer learning and hyperspectral imaging technology. Spectroch. Acta Part A Mol. Biomol. Spectrosc. 2022, 290, 122288. [Google Scholar] [CrossRef] [PubMed]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 2002, 86, 2278–2324. [Google Scholar] [CrossRef]
Kunang, Y.N.; Nurmaini, S.; Stiawan, D.; Zarkasi, A. Automatic features extraction using autoencoder in intrusion detection system. In Proceedings of the 2018 International Conference on Electrical Engineering and Computer Science (ICECOS), Pangkal Pinang, Indonesia, 2–4 October 2018; pp. 219–224. [Google Scholar]
Gogna, A.; Majumdar, A. Discriminative autoencoder for feature extraction: Application to character recognition. Neural Pro-cessing Letters 2019, 49, 1723–1735. [Google Scholar] [CrossRef]
Chen, X.; Ma, L.; Yang, X. Stacked denoise autoencoder based feature extraction and classification for hyperspectral images. J. Sens. 2016, 2016, 3632943. [Google Scholar]
Michelucci, U. An introduction to autoencoders. arXiv 2022, arXiv:2201.03898. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Veeramreddy, J.; Prasad, K. Anomaly-Based Intrusion Detection System. In Anomaly Detection and Complex Network Systems; Alexandrov, A.A., Ed.; IntechOpen: London, UK, 2019. [Google Scholar] [CrossRef]
Chen, C.; Song, Y.; Yue, S.; Xu, X.; Zhou, L.; Lv, Q.; Yang, L. FCNN-SE: An Intrusion Detection Model Based on a Fusion CNN and Stacked Ensemble. Appl. Sci. 2022, 12, 8601. [Google Scholar] [CrossRef]
Powers, D.M.W. Evaluation: From Precision, Recall, and F-Measure to ROC, Informedness, Markedness & Correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
Assy, A.T.; Mostafa, Y.; El-Khaleq, A.A.; Mashaly, M. Anomaly-based intrusion detection system using one-dimensional convolutional neural network. Procedia Comput. Sci. 2023, 220, 78–85. [Google Scholar] [CrossRef]
Kamal, H.; Mashaly, M. Hybrid Deep Learning-Based Autoencoder-DNN Model for Intelligent Intrusion Detection System in IoT Networks. In Proceedings of the 2025 15th International Conference on Electrical Engineering (ICEENG), Cairo, Egypt, 12–15 May 2025; pp. 1–6. [Google Scholar]

Figure 1. Combined dataset system architecture schematic for binary and multi-class classification.

Figure 2. Visualizing binary classification outcomes with confusion matrices for (a) NF-BoT-IoT-v2, (b) CSE-CIC-IDS2018, (c) CICIDS2017, and (d) combined dataset.

Figure 3. Visualizing multi-class classification outcomes with confusion matrix for NF-BoT-IoT-v2 dataset.

Figure 4. Visualizing multi-class classification outcomes with confusion matrix for CSE-CIC-IDS2018 dataset.

Figure 5. Visualizing multi-class classification outcomes with confusion matrix for CICIDS2017 dataset.

Figure 6. Visualizing multi-class classification outcomes with confusion matrix for combined dataset.

Figure 7. GUI of the IDS combined dataset system during real-time evaluation using the hybrid PCA–Transformer model.

Table 1. Previous studies.

Author	Dataset	Year	Utilized Technique	Accuracy		Contribution	Limitations	Reason of Limitations
Author	Dataset	Year	Utilized Technique	B	M	Contribution	Limitations	Reason of Limitations
Saud Alzughaibi and Salim El Khe-diri [8]	CSE-CICIDS2018	2023	MLP-BP, MLP-PSO	98.97%	98.41%	The study develops two DNN models to enhance IDS in cloud environments, using MLP with Backpropagation and PSO.	• Scalability • Complex architecture • Lack of real-time evaluation	• Limited ability to process large or high-speed network traffic efficiently • High implementation and computational complexity • Unverified performance in live network environments
Mani Gopalsamy [9]	CSE-CICIDS2018	2024	MLP-BP	98.97%	-	The study utilizes the CSE-CIC-IDS2018 dataset to evaluate the MLP-BP model for anomaly-based intrusion detection through machine learning and data preprocessing.	• Lack of real-time evaluation • Scalability	• Real-time testing requires complex infrastructure and continuous processing of live network data, which is challenging to set up and manage. • Limited ability to process large or high-speed network traffic efficiently
Md. Alamin Talukder [10]	UNSW-NB15	2024	RF and ET	99.59%	99.95%	The paper introduces a machine learning model that enhances intrusion detection by balancing data, reducing dimensions, and improving feature representation.	• Model capacity and advancement	• The model’s limited ability to learn complex patterns or adapt to evolving data due to insufficient capacity or outdated architecture
Afrah Fathima et al. [11]	UNSW-NB15	2023	RF	99%	98%	The paper focuses on building and evaluating machine learning models to improve intrusion detection accuracy using the UNSW-NB15 dataset.	• Generalizability • Model capacity and advancement	• Limited dataset diversity • The model’s limited ability to learn complex patterns or adapt to evolving data due to insufficient capacity or outdated architecture
Saleh Alabdulwahab et al. [12]	Combined Dataset (TON_IoT, BoT-IoT and MQTT-IoT-IDS2020)	2023	CTGAN	-	99%	The study uses CTGAN to generate synthetic IoT intrusion data from multiple datasets, improving data balance and detection accuracy.	• Generalizability	• Limited dataset diversity
Rubayyi Alghamdi and Martine Bellaiche [13]	Iot23	2023	LSTM	98.20%	92.8%	The paper introduces a deep ensemble IDS using Lambda architecture, with LSTM-based models for both binary and multi-class traffic classification.	• Computational complexity and resource usage • Increased training and inference time	• Complex algorithms and large volumes of data require more processing power, memory, and time to execute efficiently • Complexity of models and large dataset sizes, which require more time for both training and making predictions
Wa’ad H. Aljuaid and Sultan S. Alshamrani [14]	CSE-CICIDS2018	2024	CNN	-	98.67%	The research proposes a deep learning model using an advanced CNN architecture to improve cyberattack detection efficiency in cloud computing environments.	• Scalability • Lack of real-time evaluation	• Limited ability to process large or high-speed network traffic efficiently • Real-time testing requires complex infrastructure and continuous processing of live network data, which is challenging to set up and manage
Muhammad Wasim Nawaz et al. [15]	CICIDS2017	2023	LSTM	-	99%	The paper proposes an LSTM-based model that tackles class imbalance in network intrusion detection using oversampling and customized loss functions for multi-class classification.	• Class imbalance • Applicable solely to problems with multiple classes	• Lack of resampling causes bias toward majority classes, reducing detection of rare threats • Multi-class problems have uneven class distributions that cause learning bias, unlike simpler binary problems
Ahmed Abdelkhalek and Maggie Mashaly [16]	NSL-KDD	2023	CNN	93.3%	81.8%	The paper addresses class imbalance in the NSL-KDD dataset by combining resampling methods with a CNN model to enhance detection of minority-class attacks.	• Generalizability	• Limited dataset diversity
Zihan Wu et al. [17]	CIC-DDoS2019	2022	RTIDS	-	98.48%	The paper introduces RTIDS, a Transformer-based IDS that addresses data imbalance by using positional embeddings and a stacked encoder–decoder network to reduce dimensionality while preserving essential features.	• Computational overhead	• Handling complex models, large datasets, and techniques like resampling requires extensive processing power and time
Trieu Phong Nguyen et al. [18]	CAN HACKING	2023	Transformer	99.42%	-	The paper details a Transformer-powered attention network that acts as an intrusion detection system, specifically built to monitor CAN bus communications inside vehicles.	• Generalizability • Real-time applicability	• Limited dataset diversity • Real-time testing requires complex infrastructure and continuous processing of live network data, which is challenging to set up and manage
Yi Liu & Lanjian Wu [19]	NSL-KDD	2023	Transformer	88.7%	84.1%	An enhanced Transformer-based model is proposed to improve intrusion detection by reducing training time, improving class separation, and boosting classification accuracy.	• Generalizability • Class imbalance	• Limited dataset diversity • Lack of resampling causes bias toward majority classes, reducing detection of rare threats
Zhenyue Long et al. [20]	CSE-CICIDS2018	2024	Transformer	-	93%	This paper presents a Transformer-based intrusion detection system for cloud environments, using attention mechanisms to enhance feature analysis and improve detection accuracy.	• Generalizability • Scalability	• Limited dataset diversity • Limited ability to process large or high-speed network traffic efficiently
Shu-Ming Tseng et al. [21]	CIC-IoT-2023	2024	Transformer	99.48%	99.24%	The Transformer is employed to scrutinize network traffic characteristics, enabling the detection of anomalous activities and possible intrusions via both two-category and multiple-category classification.	• Class imbalance • Generalizability	• Lack of resampling causes bias toward majority classes, reducing detection of rare threats • Limited dataset diversity
Yi Liu & Lanjian Wu [19]	NSL-KDD	2023	PCA	85.1%	82%	An enhanced PCA model is proposed to improve intrusion detection by reducing training time, improving class separation, and boosting classification accuracy.	• Generalizability • Class Imbalance	• Limited dataset diversity • Lack of resampling causes bias toward majority classes, reducing detection of rare threats
Hesham Kamal and Maggie Mashaly [22]	NF-BoT-IoT-v2	2025	Autoencoder	99.95%	98.08%	The study proposes an autoencoder framework for anomaly-based intrusion detection in IoT environments, enabling accurate binary and multi-class traffic classification.	• Generalizability • Scalability	• Limited dataset diversity • Limited ability to process large or high-speed network traffic efficiently
Hesham Kamal and Maggie Mashaly [23]	NF-UNSW-NB15-v2	2024	Autoencoder	99.66%	95.57%	The research develops an autoencoder model to enhance intrusion detection by handling data imbalance and identifying new threats in cloud environments.	• Generalizability • Scalability	• Limited dataset diversity • Limited ability to process large or high-speed network traffic efficiently
Yi Liu & Lanjian Wu [19]	NSL-KDD	2023	Autoencoder and stacked Autoencoders	87.8% and 88.7%	83.5% and 84.1%	An enhanced autoencoder and SEA model is proposed to improve intrusion detection by reducing training time, improving class separation, and boosting classification accuracy	• Generalizability • Class Imbalance	• Limited dataset diversity • Lack of resampling causes bias toward majority classes, reducing detection of rare threats
Hesham Kamal and Maggie Mashaly [22]	NF-BoT-IoT-v2	2025	CNN-MLP	99.96%	98.11%	The study proposes a hybrid CNN-MLP framework for anomaly-based intrusion detection in IoT environments, enabling accurate binary and multi-class traffic classification.	• Generalizability • Scalability	• Limited dataset diversity • Limited ability to process large or high-speed network traffic efficiently
Hesham Kamal and Maggie Mashaly [24]	NF-BoT-IoT-v2	2025	Transformer-DNN and Autoencoder-CNN	99.98% and 99.98%	97.90% and 97.95%	The study introduces two hybrid models: Autoencoder-CNN for addressing class imbalance and Transformer–DNN for extracting contextual features to improve classification.	• Scalability • Generalizability	• Limited ability to process large or high-speed network traffic efficiently • Limited dataset diversity
Yanfang Fu et al. [25]	NSL-KDD	2022	CNN and BiLSTMs	90.73%	-	The paper introduces DLNID, a deep learning model that uses CNN, attention, and Bi-LSTM to enhance network intrusion detection accuracy and robustness.	• Lack of real-time evaluation	• Real-time testing requires complex infrastructure and continuous processing of live network data, which is challenging to set up and manage
Muhammad Sajid et al. [26]	CICIDS2017	2024	CNN-LSTM	97.90%	-	The paper presents a hybrid model using XGBoost, CNN, and LSTM to improve detection of evolving network attacks.	• Scalability • Class imbalance	• Limited ability to process large or high-speed network traffic efficiently • Lack of resampling causes bias toward majority classes, reducing detection of rare threats
Hesham Kamal and Maggie Mashaly [27]	CICIDS2017	2024	Autoencoder-LSTM and Autoencoder-SVM	99.6% and 95.01%	-	Two hybrid models, Autoencoder-LSTM and Autoencoder-SVM, were developed to improve IDS detection by enhancing feature extraction.	• Generalizability • Scalability • Applicable solely to problems with two distinct classes	• Limited dataset diversity • Limited ability to process large or high-speed network traffic efficiently • Multi-class problems have uneven class distributions that cause learning bias, unlike simpler binary problems
Muhammad Basit Umair et al. [28]	NSL-KDD	2022	Multilayer CNN-LSTM	-	99.5%	The research presents an IDS using CNN and LSTM with a softmax classifier, evaluated on benchmark datasets and compared with a multilayer DNN.	• Generalizability	• Limited dataset diversity
Hesham Kamal and Maggie Mashaly [23]	CICIDS2017	2024	Transformer–CNN	-	99.13%	This research develops a Transformer–CNN model to enhance intrusion detection by handling data imbalance and identifying new threats in cloud environments.	• Generalizability • Scalability	• Limited dataset diversity • Limited ability to process large or high-speed network traffic efficiently
Rubayyi Alghamdi and Martine Bellaiche [29]	Combined Dataset (UNSW-NB15, Ton_IoT and IoT23)	2022	RF, DT, CNN and LSTM	99.45%	97.81%	This study develops an ensemble IDS for IoT by combining ML and DL models with automated selection to enhance attack detection.	• Class imbalance • Generalizability • Scalability	• Lack of resampling causes bias toward majority classes, reducing detection of rare threats • Limited dataset diversity • Limited ability to process large or high-speed network traffic efficiently

Table 2. Challenges and corresponding solutions addressed by the proposed approach.

Challenge	Approaches to Overcome Challenges
Detecting a Wide Range of Attack Types	One of the key challenges in intrusion detection is handling the wide range of attack types, each exhibiting distinct patterns and behaviors. This study addresses the challenge by proposing a novel IDS approach that implements a combined dataset framework with enhanced preprocessing and feature engineering. Specifically, it performs vertical concatenation of the CSE-CIC-IDS2018 and CICIDS2017 datasets, enabling the detection of 21 unique classes, comprising one benign class and 20 distinct attack types, thereby improving the system’s ability to recognize and differentiate a broad spectrum of attacks.
High Performance	Achieving high performance in IDSs remains a key challenge due to the need for accurate classification, efficient feature representation, and robust handling of anomalies. To address this, we present a high-performance IDS based on an integrated PCA–Transformer model. In this configuration, PCA is used to extract and condense relevant features, reducing dimensionality while preserving critical information. The Transformer network then performs accurate classification on these refined features. Additionally, our data preprocessing pipeline includes a hybrid outlier detection strategy, combining local outlier factor (LOF) and Z-score methods to effectively identify anomalies and enhance the overall reliability of the system
Class Imbalance	Class imbalance often results in biased predictions that favor majority classes, thereby reducing the model’s effectiveness in detecting rare but critical attacks. To mitigate this issue, different techniques were applied depending on the task. Class weights were incorporated during training to improve the representation of minority classes. Additionally, the ADASYN oversampling technique was used to generate synthetic samples for underrepresented classes, while the ENN method was employed to eliminate noisy majority class instances, thereby enhancing class boundary clarity
Generalization	Effectively managing variations in network traffic across diverse environments remains a significant challenge for IDSs. To address this, the PCA–Transformer model was initially applied to the combined CSE-CIC-IDS2018 and CICIDS2017 datasets, and subsequently evaluated on CSE-CIC-IDS2018, CICIDS2017, and NF-BoT-IoT-v2 to validate its generalization capability across different traffic conditions.
Scalability	Efficiently processing and classifying high-volume network traffic remains a critical requirement for modern IDSs. To address this, the proposed PCA–Transformer model was evaluated on the combined CSE-CIC-IDS2018 and CICIDS2017 dataset, as well as on the individual datasets: CSE-CIC-IDS2018, CICIDS2017, and NF-BoT-IoT-v2. This demonstrates the model’s capability to handle large-scale, diverse, and heterogeneous traffic scenarios. Furthermore, its scalability is validated through inference time, training time, and memory consumption metrics, as presented in Section 4.7, confirming its suitability for real-time and resource-constrained intrusion detection environments, as discussed in Section 5.3.

Table 3. Samples before/after LOF and Z-score on CSE-CIC-IDS2018 dataset.

Label	Samples Before LOF and Z-Score	Samples After LOF and Z-Score
Benign	39,000	3379
DDoS attacks-LOIC-HTTP	38,000	37,137
DDOS attack-HOIC	33,000	32,339
DoS attacks-Hulk	31,000	30,274
Bot	29,000	27,875
Infiltration	22,000	6071
SSH-Bruteforce	21,000	20,764
DoS attacks-GoldenEye	18,000	17,066
DoS attacks-Slowloris	9908	8453
DDOS attack-LOIC-UDP	1730	1649
Brute Force-Web	568	491
Brute Force-XSS	229	203
SQL Injection	85	46
DoS attacks-SlowHTTPTest	55	52
FTP-BruteForce	53	49

Table 4. Samples before/after LOF and Z-score on CICIDS2017 dataset.

Label	Samples Before LOF and Z-Score	Samples After LOF and Z-Score
Benign	99,000	25,637
PortScan	12,000	11,956
DDoS	14,000	13,611
DoS Hulk	39,000	37,133
DoS GoldenEye	4000	3625
FTP-Patator	2000	1992
SSH-Patator	900	854
DoS slowloris	1700	1566
DoS Slowhttptest	1200	1136
Bot	700	656
Web Attack-Brute Force	450	443
Web Attack-XSS	350	341
Infiltration	36	24
Web Attack-Sql Injection	21	19
Heartbleed	11	11

Table 5. Original and renamed feature names on the CSE-CIC-IDS2018 dataset.

Original Feature Name	Renamed Feature
Fwd Packets Length Total	Total Length of Fwd Packets
Bwd Packets Length Total	Total Length of Bwd Packets
Packet Length Min	Min Packet Length
Packet Length Max	Max Packet Length
Avg Packet Size	Average Packet Size
Init Fwd Win Bytes	Init_Win_bytes_forward
Init Bwd Win Bytes	Init_Win_bytes_backward
Fwd Act Data Packets	act_data_pkt_fwd
Fwd Seg Size Min	min_seg_size_forward

Table 6. Unique classes on combined dataset.

Unique Classes	Unique Classes	Unique Classes
Benign	DoS attacks-GoldenEye	FTP-BruteForce
DDoS attacks-LOIC-HTTP	DoS attacks-Slowloris	PortScan
DDOS attack-HOIC	DDOS attack-LOIC-UDP	DDoS
DoS attacks-Hulk	Brute Force-Web	FTP-Patator
Bot	Brute Force-XSS	SSH-Patator
Infiltration	SQL Injection	Web Attack-XSS
SSH-Bruteforce	DoS attacks-SlowHTTPTest	Heartbleed

Table 7. Imputation techniques for missing values on combined dataset.

Framework	Dataset	Feature	Imputation Technique	Value
Combined System	CSE-CIC-IDS2018	Destination Port	Constant value imputation	0
	CSE-CIC-IDS2018	Fwd Header Length.1	Mean	127.866965
	CICIDS2017	Protocol	Constant value imputation	0

Table 8. Samples of train and test file on combined dataset.

Unique Classes	Train	Test
Benign	24,808	4208
DDoS attacks-LOIC-HTTP	31,583	5554
DDOS attack-HOIC	27,523	4816
DoS attacks-Hulk	57,162	10,245
Bot	24,203	4328
Infiltration	5195	900
SSH-Bruteforce	17,607	3157
DoS attacks-GoldenEye	17,663	3028
DoS attacks-Slowloris	8511	1508
DDOS attack-LOIC-UDP	1382	267
Brute Force-Web	808	126
Brute Force-XSS	165	38
SQL Injection	57	8
DoS attacks-SlowHTTPTest	1010	178
FTP-BruteForce	40	9
PortScan	10,155	1801
DDoS	11,549	2062
FTP-Patator	1689	303
SSH-Patator	715	139
Web Attack-XSS	290	51
Heartbleed	9	2

Table 9. Class weights used in model training for binary classification on the combined dataset.

Class	Weight
Normal	4.87996
Attack	0.55708

Table 10. Class weights used in model training for multi-class classification on the combined dataset.

Class	Weight
Benign	0.46476
DDoS attacks-LOIC-HTTP	0.36506
DDOS attack-HOIC	0.41891
DoS attacks-Hulk	0.20170
Bot	0.47638
Infiltration	2.21939
SSH-Bruteforce	0.65484
DoS attacks-GoldenEye	0.65276
DoS attacks-Slowloris	1.35468
DDOS attack-LOIC-UDP	8.34277
Brute Force-Web	14.26945
Brute Force-XSS	69.87706
SQL Injection	202.27569
DoS attacks-SlowHTTPTest	11.41556
FTP-BruteForce	288.24286
PortScan	1.13537
DDoS	0.99833
FTP-Patator	6.82636
SSH-Patator	16.12547
Web Attack-XSS	39.75764
Heartbleed	1281.07937

Table 11. PCA model structure.

Block	Step	Output Size	Description
Input Block	Input features	number of Features	Raw input features depending on the dataset used
PCA Block	Fit PCA	59	PCA model trained with n_components = 0.999999999999 to retain all variance
PCA Block	Transform train/test data	59	Both training and test sets projected into reduced feature space
Output Block	PCA-transformed features	59	Final feature set used as input to the Transformer model

Table 12. Transformer model structure.

Block	Layer Type	Output Size	Activation Function	Parameters	Description
Input block	Input layer	PCA Output (59 Features)	-	-	Feeds on the output from PCA as input.
Transformer block	Multi-head attention	-	-	num_heads = 8, key_dim = 64	Learns intricate patterns from the input data.
	Layer Normalization	-	-	epsilon = 1 × 10⁻⁶	Processes the attention layer’s output to ensure uniformity.
	Add (Residual Connection)	-	-	-	Merges input data with the attention output to ensure stability.
Feed Forward block	Dense layer	128	ReLU	units = 128, activation = “relu”	Carries out a dense mapping followed by ReLU activation.
	Dense layer	128	-	units = 128	Implements another dense layer without activation.
	Add (Residual Connection)	-	-	-	Integrates the feed-forward output with the prior block’s output.
	Layer Normalization	-	-	epsilon = 1 × 10⁻⁶	Ensures stability by normalizing the combined output.
Output block	Output layer	1 (Binary)	Sigmoid	-	Single unit for Binary classification
Output block	Output layer	Number of Classes (Multi-Class)	Softmax	-	Units (number of classes) for Multi-class classification

Table 13. Model hyperparameters.

Parameter	Binary Classifier	Multi-Class Classifier
Batch size	128	128
Learning rate	Scheduled: Initial = 0.001, Factor = 0.5, Min = 1 × 10⁻⁵ (ReduceLROnPlateau)	Scheduled: Initial = 0.001, Factor = 0.5, Min = 1 × 10⁻⁵ (ReduceLROnPlateau)
Optimizer	Adam	Adam
Loss function	Binary cross-entropy	Categorical cross-entropy
Metric	Accuracy	Accuracy

Table 14. Assessment metrics for binary classification.

Dataset	Model	Accuracy	Precision	Recall	F-Score
NF-BoT-IoT-v2	CNN	99.73%	99.73%	99.73%	99.73%
	Autoencoder	99.94%	99.94%	99.94%	99.94%
	MLP	99.81%	99.81%	99.81%	99.81%
	Transformer	99.92%	99.92%	99.92%	99.92%
	PCA–Transformer	99.98%	99.98%	99.98%	99.98%
CSE-CIC-IDS2018	CNN	99.61%	99.68%	99.61%	99.63%
	Autoencoder	99.64%	99.70%	99.64%	99.65%
	MLP	99.62%	99.69%	99.62%	99.64%
	Transformer	99.64%	99.70%	99.64%	99.65%
	PCA–Transformer	99.66%	99.71%	99.66%	99.67%
CICIDS2017	CNN	98.58%	98.57%	98.58%	98.57%
	Autoencoder	99.70%	99.70%	99.70%	99.70%
	MLP	99.59%	99.59%	99.59%	99.59%
	Transformer	99.72%	99.72%	99.72%	99.71%
	PCA–Transformer	99.75%	99.75%	99.75%	99.75%
Combined Dataset	CNN	99.76%	99.76%	99.76%	99.76%
	Autoencoder	99.77%	99.78%	99.77%	99.77%
	MLP	99.75%	99.76%	99.75%	99.75%
	Transformer	99.77%	99.78%	99.77%	99.77%
	PCA–Transformer	99.80%	99.81%	99.80%	99.80%

Table 15. Assessment metrics for multi-class classification.

Dataset	Model	Accuracy	Precision	Recall	F-Score
NF-BoT-IoT-v2	CNN	97.75%	97.82%	97.75%	97.75%
	Autoencoder	97.98%	98.02%	97.98%	97.98%
	MLP	97.82%	97.89%	97.82%	97.82%
	Transformer	97.97%	98.02%	97.97%	97.97%
	PCA–Transformer	98.01%	98.06%	98.01%	98.01%
CSE-CIC-IDS2018	CNN	99.52%	99.64%	99.52%	99.55%
	Autoencoder	99.53%	99.65%	99.53%	99.57%
	MLP	99.56%	99.67%	99.56%	99.59%
	Transformer	99.57%	99.66%	99.57%	99.60%
	PCA–Transformer	99.59%	99.68%	99.59%	99.62%
CICIDS2017	CNN	99.04%	99.12%	99.04%	99.07%
	Autoencoder	99.37%	99.64%	99.37%	99.29%
	MLP	98.65%	98.79%	98.65%	98.65%
	Transformer	99.45%	99.69%	99.45%	99.40%
	PCA–Transformer	99.51%	99.57%	99.51%	99.52%
Combined Dataset	CNN	99.03%	99.30%	99.03%	99.09%
	Autoencoder	99.22%	99.41%	99.22%	99.25%
	MLP	99.01%	99.34%	99.01%	99.09%
	Transformer	99.23%	99.48%	99.23%	99.29%
	PCA–Transformer	99.28%	99.51%	99.28%	99.33%

Table 16. Ablation results showing the impact of individual components on PCA–Transformer model performance.

Classification Type	Dataset	Model	Accuracy	Precision	Recall	F-Score
Binary	NF-BoT-IoT-v2	PCA	97.80%	97.92%	97.80%	97.84%
		Transformer	99.92%	99.92%	99.92%	99.92%
		PCA–Transformer	99.98%	99.98%	99.98%	99.98%
	CSE-CIC-IDS2018	PCA	98.39%	99.15%	98.39%	98.63%
		Transformer	99.64%	99.70%	99.64%	99.65%
		PCA–Transformer	99.66%	99.71%	99.66%	99.67%
	CICIDS2017	PCA	98.30%	98.30%	98.30%	98.30%
		Transformer	99.72%	99.72%	99.72%	99.71%
		PCA–Transformer	99.75%	99.75%	99.75%	99.75%
	Combined Dataset	PCA	98.61%	98.78%	98.61%	98.65%
		Transformer	99.77%	99.78%	99.77%	99.77%
		PCA–Transformer	99.80%	99.81%	99.80%	99.80%
Multi-Class	NF-BoT-IoT-v2	PCA	94.22%	94.96%	94.22%	94.51%
		Transformer	97.97%	98.02%	97.97%	97.97%
		PCA–Transformer	98.01%	98.06%	98.01%	98.01%
	CSE-CIC-IDS2018	PCA	97.06%	98.30%	97.06%	97.57%
		Transformer	99.57%	99.66%	99.57%	99.60%
		PCA–Transformer	99.59%	99.68%	99.59%	99.62%
	CICIDS2017	PCA	97.21%	97.93%	97.21%	97.35%
		Transformer	99.45%	99.69%	99.45%	99.40%
		PCA–Transformer	99.51%	99.57%	99.51%	99.52%
	Combined Dataset	PCA	96.73%	97.58%	96.73%	97.02%
		Transformer	99.23%	99.48%	99.23%	99.29%
		PCA–Transformer	99.28%	99.51%	99.28%	99.33%

Table 17. Inference time for PCA–Transformer model.

Dataset	Classification Type	Inference Time Per Batch (128 Samples) (Seconds)	Inference Time Per Sample (milliseconds)
NF-BoT-IoT-v2	Binary	0.083739	0.654211
NF-BoT-IoT-v2	Multi-class	0.084465	0.659883
CSE-CIC-IDS2018	Binary	0.077026	0.601766
CSE-CIC-IDS2018	Multi-class	0.081579	0.637336
CICIDS2017	Binary	0.080222	0.626734
CICIDS2017	Multi-class	0.080888	0.631938
Combined Dataset	Binary	0.080309	0.627414
Combined Dataset	Multi-class	0.082079	0.641242

Table 18. Training time for PCA–Transformer model.

Dataset	Classification Type	Training Time Per Batch (128 Samples) (Seconds)	Training Time Per Sample (milliseconds)
NF-BoT-IoT-v2	Binary	0.083813	0.654789
NF-BoT-IoT-v2	Multi-class	0.085264	0.666125
CSE-CIC-IDS2018	Binary	0.078359	0.612180
CSE-CIC-IDS2018	Multi-class	0.081682	0.638141
CICIDS2017	Binary	0.083152	0.649625
CICIDS2017	Multi-class	0.083170	0.649766
Combined Dataset	Binary	0.080862	0.631734
Combined Dataset	Multi-class	0.082748	0.646469

Table 19. Memory consumption for PCA–Transformer model.

Dataset	Classification Type	Memory Consumption (MB) Per Batch (128 Samples)
NF-BoT-IoT-v2	Binary	0.25
NF-BoT-IoT-v2	Multi-class	0.29
CSE-CIC-IDS2018	Binary	0.34
CSE-CIC-IDS2018	Multi-class	0.74
CICIDS2017	Binary	0.34
CICIDS2017	Multi-class	0.73
Combined Dataset	Binary	0.35
Combined Dataset	Multi-class	0.93

Table 20. Comparative analysis of Transformer variants and lightweight attention mechanisms in terms of scalability and performance for binary and multi-class classification.

Classification Type	Model	Accuracy	Inference Time per Batch (128 Samples) (Seconds)	Training Time per Batch (128 Samples) (Seconds)	Memory Consumption (MB) per Batch (128 Samples)
Binary	Linformer	99.75%	0.073526	0.077988	0.27
	Performer	99.76%	0.075856	0.079189	0.25
	Transformer	99.77%	0.077694	0.079305	0.24
Multi-Class	Linformer	95.31%	0.078310	0.082654	0.29
	Performer	96.06%	0.078941	0.082798	0.27
	Transformer	99.23%	0.081036	0.083193	0.92

Table 21. Prediction probabilities for binary classification using the PCA–Transformer model on the combined dataset.

Class	Prediction Probability
Normal	1.00
Attack	0.00

Table 22. LIME feature contributions to the PCA–Transformer model’s binary classification decision on the combined dataset.

Feature	Normalized Value	LIME Contribution Weight
ACK Flag Count	0.000000	0.386353
PSH Flag Count	0.000000	0.242650
Fwd PSH Flags	0.000000	0.230619
FIN Flag Count	0.000000	0.221099
SYN Flag Count	0.000000	0.202133
ECE Flag Count	0.000000	0.130902
Active Std	0.000000	−0.119029
RST Flag Count	0.000000	0.084017
Bwd IAT Max	0.000000	0.058481
Fwd Packets/s	0.006452	−0.053315

Table 23. Prediction probabilities for multi-class classification using the PCA–Transformer model on the combined dataset.

Class	Prediction Probability
Benign	0.00
DDoS attacks-LOIC-HTTP	0.00
DDOS attack-HOIC	0.00
DoS attacks-Hulk	0.00
Bot	0.00
Infiltration	0.00
SSH-Bruteforce	0.00
DoS attacks-GoldenEye	0.00
DoS attacks-Slowloris	0.00
DDOS attack-LOIC-UDP	0.00
Brute Force-Web	0.00
Brute Force-XSS	0.00
SQL Injection	0.00
DoS attacks-SlowHTTPTest	0.00
FTP-BruteForce	0.00
PortScan	0.00
DDoS	0.00
FTP-Patator	0.00
SSH-Patator	0.00
Web Attack-XSS	0.00
Heartbleed	1.00

Table 24. LIME feature contributions to the PCA–Transformer model’s multi-class classification decision on the combined dataset.

Feature	Normalized Value	LIME Contribution Weight
Fwd Header Length.1	0.997014	0.002998
Fwd Packet Length Max	1.000000	0.001736
Total Backward Packets	0.458619	0.001727
Subflow Bwd Bytes	0.999816	0.001565
Total Length of Bwd Packets	0.999816	0.001556
Subflow Fwd Packets	0.011287	0.001545
Avg Bwd Segment Size	0.898546	0.001500
act_data_pkt_fwd	0.000484	0.001499
Bwd Packet Length Mean	0.898546	0.001491

Table 25. Per-class results of the PCA–Transformer model for binary classification.

Dataset	Class	Accuracy	Precision	Recall	F-Score
NF-BoT-IoT-v2	Normal	99.92%	99.84%	99.92%	99.88%
NF-BoT-IoT-v2	Attack	99.98%	99.99%	99.98%	99.99%
CSE-CIC-IDS2018	Normal	99.60%	84.40%	99.60%	91.37%
CSE-CIC-IDS2018	Attack	99.66%	99.99%	99.66%	99.83%
CICIDS2017	Normal	99.20%	99.96%	99.20%	99.57%
CICIDS2017	Attack	99.98%	99.66%	99.98%	99.82%
Combined Dataset	Normal	100%	98.02%	100%	99%
Combined Dataset	Attack	99.78%	100%	99.78%	99.89%

Table 26. Per-class results of the PCA–Transformer model for multi-class classification on the NF-BoT-IoT-v2 dataset.

Class	Accuracy	Precision	Recall	F-Score
Benign	99.92%	99.05%	99.92%	99.48%
Reconnaissance	94.50%	99.42%	94.50%	96.90%
DDoS	99.21%	99.23%	99.21%	99.22%
DoS	98.71%	95.17%	98.71%	96.91%
Theft	100%	100%	100%	100%

Table 27. Per-Class results of the PCA-Transformer model for multi-class classification on the CSE-CIC-IDS2018 dataset.

Class	Accuracy	Precision	Recall	F-Score
Benign	95.05%	93.20%	95.05%	94.12%
DDoS attacks-LOIC-HTTP	99.98%	100%	99.98%	99.99%
DDOS attack-HOIC	100%	100%	100%	100%
DoS attacks-Hulk	100%	100%	100%	100%
Bot	99.44%	100%	99.44%	99.72%
Infiltration	96.06%	97.05%	96.06%	96.55%
SSH-Bruteforce	100%	100%	100%	100%
DoS attacks-GoldenEye	100%	100%	100%	100%
DoS attacks-Slowloris	100%	100%	100%	100%
DDOS attack-LOIC-UDP	100%	100%	100%	100%
Brute Force-Web	72.37%	100%	72.37%	83.97%
Brute Force-XSS	96.30%	39.39%	96.30%	55.91%
SQL Injection	100%	44.44%	100%	61.54%
DoS attacks-SlowHTTPTest	44.44%	50%	44.44%	47.06%
FTP-BruteForce	66.67%	61.54%	66.67%	64%

Table 28. Per-Class results of the PCA-Transformer model for multi-class classification on the CICIDS2017 dataset.

Class	Accuracy	Precision	Recall	F-score
Benign	100%	100%	100%	100%
PortScan	99.89%	100%	99.89%	99.94%
DDoS	99.85%	99.90%	99.85%	99.88%
DoS Hulk	99.91%	99.96%	99.91%	99.94%
DoS GoldenEye	99.82%	99.10%	99.82%	99.46%
FTP-Patator	99.34%	99.67%	99.34%	99.50%
SSH-Patator	91.34%	100%	91.34%	95.47%
DoS slowloris	99.22%	97.68%	99.22%	98.44%
DoS Slowhttptest	100%	98.77%	100%	99.38%
Bot	98.99%	98%	98.99%	98.49%
Web Attack-Brute Force	60%	72.41%	60%	65.62%
Web Attack-XSS	71.70%	60.32%	71.70%	65.52%
Infiltration	33.33%	100%	33.33%	50%
Web Attack-Sql Injection	100%	29.41%	100%	45.45%
Heartbleed	50%	100%	50%	66.67%

Table 29. Per-Class results of the PCA-Transformer model for multi-class classification on the combined dataset.

Class	Accuracy	Precision	Recall	F-score
Benign	96.98%	99.98%	96.98%	98.46%
DDoS attacks-LOIC-HTTP	99.42%	100%	99.42%	99.71%
DDOS attack-HOIC	100%	100%	100%	100%
DoS attacks-Hulk	99.90%	99.95%	99.90%	99.93%
Bot	99.33%	99.93%	99.33%	99.63%
Infiltration	100%	87.89%	100%	93.56%
SSH-Bruteforce	100%	100%	100%	100%
DoS attacks-GoldenEye	99.54%	99.77%	99.54%	99.65%
DoS attacks-Slowloris	99.87%	99.67%	99.87%	99.77%
DDOS attack-LOIC-UDP	100%	98.52%	100%	99.26%
Brute Force-Web	41.27%	94.55%	41.27%	57.46%
Brute Force-XSS	94.74%	33.64%	94.74%	49.66%
SQL Injection	50%	20%	50%	28.57%
DoS attacks-SlowHTTPTest	95.51%	99.42%	95.51%	97.42%
FTP-BruteForce	100%	60%	100%	75%
PortScan	99.83%	100%	99.83%	99.92%
DDoS	99.95%	100%	99.95%	99.98%
FTP-Patator	99.67%	99.34%	99.67%	99.51%
SSH-Patator	98.56%	98.56%	98.56%	98.56%
Web Attack-XSS	100%	46.36%	100%	63.35%
Heartbleed	100%	100%	100%	100%

Table 30. Real-time prediction results using the hybrid PCA–Transformer model.

Actual Class	Predicted Class	Probabilities
Bot	Bot	Bot: 1.0000, Others: 0.0000
DDoS attacks-LOIC-HTTP	DDoS attacks-LOIC-HTTP	DDoS attacks-LOIC-HTTP: 1.0000, Others: 0.0000
SSH-Bruteforce	SSH-Bruteforce	SSH-Bruteforce: 1.0000, Others: 0.0000
DDoS attacks-LOIC-HTTP	DDoS attacks-LOIC-HTTP	DDoS attacks-LOIC-HTTP: 1.0000, Others: 0.0000
DDoS attacks-LOIC-HTTP	DDoS attacks-LOIC-HTTP	DDoS attacks-LOIC-HTTP: 1.0000, Others: 0.0000
DDOS attack-HOIC	DDOS attack-HOIC	DDOS attack-HOIC: 1.0000, Others: 0.0000
DDOS attack-HOIC	DDOS attack-HOIC	DDOS attack-HOIC: 1.0000, Others: 0.0000
DoS attacks-GoldenEye	DoS attacks-GoldenEye	DoS attacks-GoldenEye: 1.0000, Others: 0.0000
Bot	Bot	Bot: 1.0000, Others: 0.0000
DoS attacks-GoldenEye	DoS attacks-GoldenEye	DoS attacks-GoldenEye: 1.0000, Others: 0.0000
DDoS attacks-LOIC-HTTP	DDoS attacks-LOIC-HTTP	DDoS attacks-LOIC-HTTP: 1.0000, Others: 0.0000
DoS attacks-Hulk	DoS attacks-Hulk	DoS attacks-Hulk: 0.9926, DoS attacks-GoldenEye: 0.0074, Others: 0.0000
Bot	Bot	Bot: 1.0000, Others: 0.0000
DDOS attack-HOIC	DDOS attack-HOIC	DDOS attack-HOIC: 1.0000, Others: 0.0000
DoS attacks-Hulk	DoS attacks-Hulk	DoS attacks-Hulk: 0.9921, DoS attacks-GoldenEye: 0.0079, Others: 0.0000
DoS attacks-GoldenEye	DoS attacks-GoldenEye	DoS attacks-GoldenEye: 0.9996, DoS attacks-Hulk: 0.0003, DDoS attacks-LOIC-HTTP: 0.0001, Others: 0.0000
DDoS attacks-LOIC-HTTP	DDoS attacks-LOIC-HTTP	DDoS attacks-LOIC-HTTP: 1.0000, Others: 0.0000
SSH-Bruteforce	SSH-Bruteforce	SSH-Bruteforce: 1.0000, Others: 0.0000
DOS attack-HOIC	DOS attack-HOIC	DDOS attack-HOIC: 1.0000, Others: 0.0000
DoS attacks-Hulk	DoS attacks-Hulk	DoS attacks-Hulk: 0.9934, DoS attacks-GoldenEye: 0.0066, Others: 0.0000
Heartbleed	Heartbleed	Heartbleed: 1.0000, Others: 0.0000
Heartbleed	Heartbleed	Heartbleed: 1.0000, Others: 0.0000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kamal, H.; Mashaly, M. Combined Dataset System Based on a Hybrid PCA–Transformer Model for Effective Intrusion Detection Systems. AI 2025, 6, 168. https://doi.org/10.3390/ai6080168

AMA Style

Kamal H, Mashaly M. Combined Dataset System Based on a Hybrid PCA–Transformer Model for Effective Intrusion Detection Systems. AI. 2025; 6(8):168. https://doi.org/10.3390/ai6080168

Chicago/Turabian Style

Kamal, Hesham, and Maggie Mashaly. 2025. "Combined Dataset System Based on a Hybrid PCA–Transformer Model for Effective Intrusion Detection Systems" AI 6, no. 8: 168. https://doi.org/10.3390/ai6080168

APA Style

Kamal, H., & Mashaly, M. (2025). Combined Dataset System Based on a Hybrid PCA–Transformer Model for Effective Intrusion Detection Systems. AI, 6(8), 168. https://doi.org/10.3390/ai6080168

Article Menu

Combined Dataset System Based on a Hybrid PCA–Transformer Model for Effective Intrusion Detection Systems

Abstract

1. Introduction

2. Related Work

2.1. Traditional Machine Learning for IDSs

2.2. Deep Learning-Based Intrusion Detection

2.3. Transformer-Based Models in IDS

2.4. Dimensionality Reduction and Feature Selection Techniques

2.5. Hybrid Models for IDS

2.6. Challenges

3. Methodology

3.1. Dataset Description

3.1.1. CSE-CIC-IDS2018 Dataset

3.1.2. CICIDS2017 Dataset

3.2. Dataset Preprocessing

3.2.1. CSE-CIC-IDS2018 Dataset

3.2.2. CICIDS2017 Dataset

3.2.3. Combined Dataset (CSE-CIC-IDS2018 and CICIDS2017)

3.3. Proposed Model

Principal Component Analysis–Transformer (PCA–Transformer)

4. Results and Experiments

4.1. Dataset Characteristics and Preprocessing Overview

4.2. Configuration and Hyperparameter Overview of Compared Models

Configuration of Hyperparameters for Models

4.3. Experiment’s Establishment

4.4. Evaluation Metrics

4.5. Results

4.6. Ablation Analysis on Component-Wise Enhancements in the PCA–Transformer Model

4.7. Inference Time, Training Time, and Memory Consumption

4.8. Comparative Analysis of Transformer Variants and Lightweight Attention Mechanisms for Performance and Scalability

4.9. Explaining Model Decisions Using LIME for the PCA–Transformer Model

5. Discussion

5.1. Binary Classification

5.2. Multi-Class Classification

5.3. Case Study for Real-Time Evaluation Using the Hybrid PCA–Transformer Model

6. Limitations

7. Conclusions

8. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI