You are currently viewing a new version of our website. To view the old version click .
Future Internet
  • Editor’s Choice
  • Article
  • Open Access

23 December 2024

Advanced Hybrid Transformer-CNN Deep Learning Model for Effective Intrusion Detection Systems with Class Imbalance Mitigation Using Resampling Techniques

and
Department of Information Engineering and Technology, German University in Cairo, Cairo 11835, Egypt
*
Authors to whom correspondence should be addressed.
This article belongs to the Section Cybersecurity

Abstract

Network and cloud environments must be fortified against a dynamic array of threats, and intrusion detection systems (IDSs) are critical tools for identifying and thwarting hostile activities. IDSs, classified as anomaly-based or signature-based, have increasingly incorporated deep learning models into their framework. Recently, significant advancements have been made in anomaly-based IDSs, particularly those using machine learning, where attack detection accuracy has been notably high. Our proposed method demonstrates that deep learning models can achieve unprecedented success in identifying both known and unknown threats within cloud environments. However, existing benchmark datasets for intrusion detection typically contain more normal traffic samples than attack samples to reflect real-world network traffic. This imbalance in the training data makes it more challenging for IDSs to accurately detect specific types of attacks. Thus, our challenges arise from two key factors, unbalanced training data and the emergence of new, unidentified threats. To address these issues, we present a hybrid transformer-convolutional neural network (Transformer-CNN) deep learning model, which leverages data resampling techniques such as adaptive synthetic (ADASYN), synthetic minority oversampling technique (SMOTE), edited nearest neighbors (ENN), and class weights to overcome class imbalance. The transformer component of our model is employed for contextual feature extraction, enabling the system to analyze relationships and patterns in the data effectively. In contrast, the CNN is responsible for final classification, processing the extracted features to accurately identify specific attack types. The Transformer-CNN model focuses on three primary objectives to enhance detection accuracy and performance: (1) reducing false positives and false negatives, (2) enabling real-time intrusion detection in high-speed networks, and (3) detecting zero-day attacks. We evaluate our proposed model, Transformer-CNN, using the NF-UNSW-NB15-v2 and CICIDS2017 benchmark datasets, and assess its performance with metrics such as accuracy, precision, recall, and F1-score. The results demonstrate that our method achieves an impressive 99.71% accuracy in binary classification and 99.02% in multi-class classification on the NF-UNSW-NB15-v2 dataset, while for the CICIDS2017 dataset, it reaches 99.93% in binary classification and 99.13% in multi-class classification, significantly outperforming existing models. This proves the enhanced capability of our IDS in defending cloud environments against intrusions, including zero-day attacks.

1. Introduction

As the internet has evolved and expanded over time, it now offers a wide array of valuable services that significantly improve people’s lives. Nevertheless, these services are accompanied by various security threats. The increasing prevalence of network infections, eavesdropping, and malicious attacks complicates detection efforts and contributes to a rise in false alarms. Consequently, network security has become a paramount concern for a growing number of internet users, including in critical sectors such as banking, corporations, and government agencies.
Cyber-attacks typically initiate with reconnaissance efforts aimed at identifying system vulnerabilities, which are subsequently exploited to execute harmful actions [1]. Unauthorized access to computer systems threatens their confidentiality, integrity, and availability (CIA), resulting in what is classified as an “intrusion” [2]. In recent years, a plethora of sophisticated cyber-attack methods has emerged, including brute force attacks, botnets, distributed denial of service (DDoS) attacks, and cross-site scripting [3]. These developments have heightened concerns regarding cyber security. Cybercriminals are increasingly leveraging numerous hosts and cloud servers as vehicles for deploying malware and botnets, including Bitcoin Trojans. According to the internet security threat report (ISTR), malware is detected, on average, every 13 s during web searches. There has been a marked rise in incidents of ransomware, email spam, and other online threats, as noted by CNBC [4,5]. In this context, intrusion detection systems are crucial for enhancing network security and alleviating the growing risks associated with cyber-attacks [6].
Real-time intrusion detection is essential for maintaining the security and integrity of network infrastructures. Deep learning models have demonstrated remarkable effectiveness in analyzing network traffic instantaneously, facilitating the rapid identification of potential intrusions [7]. Various machine learning strategies contribute to enhancing the agility of intrusion detection systems (IDS), particularly in their ability to adapt to newly emerging threats [8]. Moreover, the incorporation of real-time functionalities within IDS significantly bolsters network security by enabling the swift detection and mitigation of attacks [9].
IDS are among the most commonly implemented security mechanisms, designed to detect and prevent unauthorized access while safeguarding both individual computers and broader network infrastructures from malicious threats. These systems can be classified into two primary categories, based on their method of identifying intrusions:
  • Signature-based IDS: This approach involves scrutinizing network traffic or host activity by matching it against a repository of known malicious patterns. While it excels at detecting familiar threats, its efficacy hinges on continuous updates to remain vigilant against evolving attacks. However, its dependence on established signatures renders it less effective in confronting unknown or zero-day threats, as it lacks the capacity to detect new intrusions that fall outside its predefined dataset.
  • Anomaly-based IDS: These systems detect threats by recognizing deviations from established behavioral norms, rather than relying on predefined attack signatures. This makes them particularly adept at identifying zero-day attacks that exploit previously undiscovered vulnerabilities. By utilizing machine learning and deep learning algorithms, anomaly-based IDS can analyze extensive datasets, learn patterns of normal system behavior, and detect anomalies with exceptional precision. This method not only enhances adaptability to emerging threats but also minimizes false positives. In our research, we adopted this approach to improve the accuracy and responsiveness of intrusion detection.
In this study, we introduce an advanced hybrid deep learning model combining Transformer and convolutional neural network (CNN) architectures for a robust intrusion detection system. Our methodology tackles class imbalance by employing various data resampling techniques, such as adaptive synthetic (ADASYN) and synthetic minority oversampling technique (SMOTE), for binary and multi-class classification, along with edited nearest neighbors (ENN) and class weighting strategies to enhance model robustness. The findings reveal that our Transformer-CNN model significantly outperforms prior methods, achieving an impressive 99.71% accuracy in binary classification and 99.02% in multi-class classification on the NF-UNSW-NB15-v2 dataset [10,11], as well as 99.93% accuracy in binary classification and 99.13% in multi-class classification on the CICIDS2017 dataset [12,13], highlighting its efficacy in diverse operational contexts. Below, we outline the key contributions of our research:
  • We create a highly efficient intrusion detection system using an advanced hybrid Transformer-CNN model, integrated with techniques such as ADASYN, SMOTE, ENN, and class weights to effectively tackle class imbalance challenges.
  • An enhanced data preprocessing pipeline is applied, which first utilizes a combined outlier detection approach using Z-score and local outlier factor (LOF) to identify and handle outliers, followed by correlation-based feature selection. This structured approach refines model input, enhancing accuracy and reducing computational complexity.
  • Using the NF-UNSW-NB15-v2 and CICIDS2017 datasets, this study highlights the exceptional performance of the proposed model, demonstrating its superiority compared to current state-of-the-art models in the field.
This paper is organized into several sections: Section 2 delivers an extensive overview of the relevant literature, offering insights into existing research in the field. Section 3 outlines the methodology utilized in this study, detailing the approaches and techniques employed. Section 4 showcases the results derived from the experimental procedures, providing an analysis of the data obtained. Following this, Section 5 engages in a thorough discussion of the findings, interpreting their significance and implications. Section 6 highlights the limitations encountered within the proposed methodology, providing a critical assessment of its scope. Section 7 concludes the study by summarizing the primary contributions and key insights gained. Lastly, Section 8 presents potential avenues for future research, suggesting directions for further exploration and investigation.

3. Proposed Approach

The Transformer-CNN model embodies a cutting-edge deep learning architecture that fuses the strengths of Transformer and CNN to achieve exceptional performance in both binary and multi-class classification tasks. This innovative framework proficiently addresses critical challenges faced by IDS, particularly in enhancing classification accuracy and mitigating class imbalances, with a primary emphasis on the NF-UNSW-NB15-v2 and CICIDS2017 datasets. In this section, we outline the detailed steps involved in the model, including comprehensive preprocessing procedures applied to the NF-UNSW-NB15-v2 dataset, followed by an evaluation of its performance on both the NF-UNSW-NB15-v2 and CICIDS2017 datasets. To tackle the issue of class imbalance, the model incorporates a suite of advanced data preprocessing techniques. It employs ADASYN and SMOTE to effectively oversample minority classes, thereby bolstering the model’s capacity to learn from underrepresented yet crucial instances. Additionally, the model utilizes ENN for strategic undersampling, while also applying class weights to recalibrate the importance of each class during the training phase. This dual strategy not only ensures that challenging cases receive adequate focus but also preserves a balanced class distribution throughout the training process. The transformer component of the model is dedicated to contextual feature extraction, empowering the system to adeptly analyze relationships and patterns within the data. Meanwhile, the CNN efficiently processes the extracted features to accurately classify specific attack types. This synergistic architecture significantly reduces the incidence of false positives and false negatives, thereby enhancing the model’s ability to detect both known threats and previously unseen (zero-day) attacks. The model’s outstanding performance is underscored by its remarkable results on the NF-UNSW-NB15-v2 dataset, where it achieved an impressive 99.71% accuracy in binary classification and 99.02% accuracy in multi-class classification, as well as on the CICIDS2017 dataset, achieving 99.93% accuracy in binary classification and 99.13% accuracy in multi-class classification. Figure 1 illustrates the model architecture and its application to various classification tasks using the NF-UNSW-NB15-v2 dataset, providing a clear visual representation of its capabilities.
Figure 1. Architectural design for binary classification and multi-class classification using NF-UNSW-NB15-v2 dataset.

3.1. Description of Dataset

The UNSW-NB15 dataset [19], released in 2015 by the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS), is a widely recognized and utilized NIDS dataset. It generates benign network activities and premeditated attack scenarios, comprising a total of 2,540,044 network samples, with 2,218,761 (87.35%) benign and 321,283 (12.65%) attack samples. This dataset includes 49 features, including twelve derived using SQL algorithms. Recently, the NF-UNSW-NB15-v2 dataset was generated and released in 2021, based on the original UNSW-NB15 dataset. This NetFlow dataset includes 43 NetFlow-based features extracted from the pcap files of the UNSW-NB15 dataset using the nprobe feature extraction tool, with the data flows labeled appropriately. The NF-UNSW-NB15-v2 dataset contains a total of 2,390,275 flows, of which 95,053 (3.98%) are attack samples and 2,295,222 (96.02%) are benign [10]. Additionally, this dataset is classified into ten classes, comprising nine for different types of attacks and one for benign traffic. Table 3 provides a detailed overview of the different attack categories, including comprehensive descriptions of each, along with the distribution of samples across various classes within the datasets. Figure 1 depicts the architecture developed for binary and multi-class classification using the NF-UNSW-NB15-v2 dataset.
Table 3. Types of attacks in the NF-UNSW-NB15-v2 dataset.

3.2. Data Preprocessing

Data preprocessing is a vital phase in both data analysis and machine learning workflows, where raw data are transformed into a refined, structured format, ready for effective analysis. This process encompasses a range of tasks, including handling missing values, removing duplicates, eliminating outliers or irrelevant data, selecting meaningful features, and applying normalization or standardization to numerical features. Additionally, class resampling techniques are employed to address imbalanced data. Proper data preprocessing significantly enhances data quality, minimizes noise, and optimizes the performance of machine learning models by enabling them to learn efficiently from the processed dataset. The specific steps and techniques required for preprocessing vary depending on the dataset. The NF-UNSW-NB15-v2 dataset, despite its comprehensiveness, contains missing or NaN values that must be addressed as part of the initial preprocessing step by removing them. Following this, any duplicate entries are eliminated. Outliers are then identified and removed using both the Z-Score and LOF methods. Next, a feature selection technique, based on correlation, is applied to reduce dimensionality. Numerical features are normalized using the MinMaxScaler 1.2.2 to achieve consistent scaling across the dataset. Once these preprocessing steps are completed, the dataset is split into training and testing subsets. Subsequently, the training and testing sets are recombined, and the ADASYN technique is applied to the combined dataset. This approach enhances learning from both the training and testing data. After ADASYN generates synthetic samples based on the combined training and testing datasets, the synthetic samples are added to the dataset. The dataset is then split again, where the new training set consists of the original training data along with the ADASYN-generated samples, while the test set remains unchanged. This strategy allows the model to learn more effectively from the augmented data and improves its overall performance. This strategy helps mitigate class imbalance, ultimately improving model accuracy and performance. Additionally, the ENN method is employed to undersample the training data, and class weights are adjusted during model training to further balance the dataset. Our comprehensive data preparation process, encompassing outlier removal, feature selection, normalization, resampling, and model development, is depicted in Figure 1, which illustrates the full workflow for both binary and multi-class classification tasks on the NF-UNSW-NB15-v2 dataset.

3.2.1. Removing Outliers Using Z-Score and Local Outlier Factor (LOF)

Z-score was applied to detect and filter out extreme outliers in the dataset. Specifically, the zscore function from the scipy.stats module calculated the z-scores for all features in the DataFrame. Z-scores represent how far a data point is from the mean in terms of standard deviations. A threshold of 6 was set, meaning any data point with a z-score greater than 6 in any feature was considered an outlier and removed. This process was applied for both binary and multi-class classification to ensure that the dataset remained clean and free of extreme outliers.
Following the z-score, the LOF method was implemented to further detect and eliminate outliers. LOF identifies data points with significantly lower density compared to their neighbors, making it particularly effective for datasets with varying density distributions. The LOF was configured with n_neighbors set to 20 and contamination set to 0.1, indicating that 10% of the data were expected to be outliers. After fitting the LOF model, samples were classified as either outliers (labeled −1) or inliers (labeled 1). Only the inlier samples were retained, resulting in a cleaner dataset for subsequent analysis. This dual approach of outlier removal enhanced the performance and reliability of the classification models for both binary and multi-class tasks.
(i)
Binary Classification
The NF-UNSW-NB15-v2 dataset underwent z-score to remove extreme outliers, ensuring higher data quality for binary classification. Outliers were identified and removed based on how far data points deviated from the mean, improving the dataset’s reliability. As shown in Table 4, the Benign class was reduced from 96,432 to 93,653 samples after outlier removal. Similarly, other attack categories, such as Exploits and Fuzzers, decreased from 18,804 to 17,576 and 12,999 to 11,695 samples, respectively. Smaller classes, including Shellcode, Backdoor, and Worms, also experienced slight reductions. This filtering process ensured that both majority and minority classes remained balanced while minimizing the impact of noise and outliers on the classification model’s performance.
Table 4. Sample distribution of NF-UNSW-NB15-v2 dataset in binary classification using z-score.
The sample distribution of the NF-UNSW-NB15-v2 dataset in binary classification is presented in Table 5, highlighting the impact of the LOF method on the dataset. Before applying LOF, the class Benign consisted of 93,653 samples, which decreased to 85,680 samples after outlier removal. Other classes also experienced significant reductions; for instance, Exploits reduced from 17,576 to 14,969, while Fuzzers dropped from 11,695 to 10,116. Smaller classes, such as Shellcode, decreased from 886 to 605, and Backdoor saw a reduction from 322 to 233 samples. The Worms class remained relatively stable, with a slight decline from 89 to 87 samples. This filtering process ensured a cleaner dataset, facilitating more accurate classification while reducing the influence of outliers on model performance. The adjustments made through LOF provide a balanced representation of both majority and minority classes, enhancing the reliability of the classification tasks.
Table 5. Sample distribution of NF-UNSW-NB15-v2 dataset in binary classification using LOF.
(ii)
Multi-Class Classification
The sample distribution of the NF-UNSW-NB15-v2 dataset in multi-class classification is summarized in Table 6, which illustrates the effects of z-score on the dataset. Prior to the application of z-score, the Benign class contained 96,432 samples, which decreased to 93,530 samples following outlier removal. Similar reductions were observed in other classes; for example, Exploits dropped from 18,804 to 17,492, and Fuzzers declined from 12,999 to 11,730. The Reconnaissance class saw a reduction from 7121 to 6881, while Generic samples decreased from 3810 to 3234. The DoS class also experienced a notable decrease, from 2677 to 2195 samples. In contrast, some smaller classes exhibited minimal changes, such as Shellcode, which went from 900 to 886, and Analysis, which decreased slightly from 490 to 330. The Worms class experienced a reduction from 104 to 92 samples. These modifications, achieved through z-score, helped maintain the integrity of the dataset by ensuring that extreme outliers were removed, thereby enhancing the robustness and accuracy of the subsequent classification models.
Table 6. Sample distribution of NF-UNSW-NB15-v2 dataset in multi-class classification using z-score.
The sample distribution of the NF-UNSW-NB15-v2 dataset in multi-class classification is detailed in Table 7, highlighting the impact of the LOF method on the dataset. Initially, the Benign class comprised 93,530 samples, which was reduced to 85,510 samples after outlier removal. Similarly, the Exploits class decreased from 17,492 to 14,933, while the Fuzzers class saw a decline from 11,730 to 10,131. The Reconnaissance class experienced a minor reduction from 6881 to 6774, and the Generic class dropped from 3234 to 2688 samples. The DoS class also faced a significant decrease, with numbers falling from 2195 to 1730. In the smaller classes, Shellcode decreased from 886 to 614, and Backdoor went from 327 to 243 samples. The Analysis class experienced a slight reduction from 330 to 316, while the Worms class saw minimal change, dropping from 92 to 88 samples. These adjustments, facilitated by the LOF method, contributed to a more balanced dataset by removing outliers, thereby improving the reliability and effectiveness of classification tasks.
Table 7. Sample distribution of NF-UNSW-NB15-v2 dataset in multi-class classification using LOF.

3.2.2. Feature Selection Using Correlation Technique

Feature selection was conducted based on the correlation of features with the target variable, ‘Target’, in the dataset, applicable to both binary and multi-class classification tasks. Initially, the features and target variable were separated, and a correlation matrix was computed by concatenating the features and the target. The absolute correlation values were extracted and sorted to identify the strength of relationships between each feature and the target variable. A correlation threshold of 0.01 was set to filter out features with weak correlations, allowing only those with significant relationships to be retained. The target variable, ‘Target’, was subsequently removed from the list of selected features. To address potential multicollinearity, the code calculated the absolute correlation matrix for the selected features and examined the upper triangle to avoid redundancy. Any features exhibiting a correlation greater than 0.9 were flagged for removal. The final dataset comprised the selected features that demonstrated meaningful correlations with the target variable while mitigating issues related to multicollinearity. This process ensured a robust input for subsequent analyses or model training in both binary and multi-class classification contexts.
(i)
Binary Classification
The selected features from the NF-UNSW-NB15-v2 dataset for binary classification, identified using a correlation technique, are detailed in Table 8. These features were carefully chosen based on their significant correlation with the target variable, ensuring their relevance for the classification task. The selected features include metrics such as MAX_TTL and MIN_IP_PKT_LEN, which provide insights into packet attributes, as well as network behaviors like SRC_TO_DST_AVG_THROUGHPUT and DST_TO_SRC_AVG_THROUGHPUT. Additionally, protocol-specific characteristics are captured through features like PROTOCOL and L4_DST_PORT, while metrics like FLOW_DURATION_MILLISECONDS and NUM_PKTS_128_TO_256_BYTES offer a deeper understanding of traffic patterns. Overall, this selection aims to enhance the model’s ability to accurately differentiate between benign and malicious activities in network traffic.
Table 8. Selected features of NF-UNSW-NB15-v2 dataset in binary classification using correlation technique.
(ii)
Multi-Class Classification
The selected features from the NF-UNSW-NB15-v2 dataset for multi-class classification, identified using a correlation technique, are outlined in Table 9. These features were chosen for their significant correlation with the target variable, enhancing the model’s ability to distinguish among various classes effectively. Key features include MIN_TTL and MIN_IP_PKT_LEN, which provide important information about packet characteristics, alongside metrics like DST_TO_SRC_AVG_THROUGHPUT and SRC_TO_DST_AVG_THROUGHPUT, which capture network traffic flow dynamics. Additionally, protocol-related attributes are represented through features such as PROTOCOL and L4_SRC_PORT, while metrics like DURATION_IN and LONGEST_FLOW_PKT offer insights into session behavior. This selection aims to improve the classification performance by retaining features that exhibit strong correlations with the output variable across different classes.
Table 9. Selected features of NF-UNSW-NB15-v2 dataset in multi-class classification using correlation technique.

3.2.3. Normalization

Data scaling, a crucial preprocessing step in machine and deep learning, involves adjusting numerical values to a specific range, thereby enhancing the efficiency and effectiveness of model. This standardization process is applied across all columns, ensuring consistent data representation. Among various normalization techniques, the MinMaxScaler, a widely used tool in the scikit-learn library, stands out as the most effective for our study. The normalization formula, as depicted in Equation (1) [80], calculates each value by subtracting the minimum value in the column and dividing by the range (the difference between the maximum and minimum values). In this context, X represents the original values, min(X) is the minimum value in the column, and max(X) is the maximum value in the column. After evaluating multiple normalization methods, MinMaxScaler was chosen for its superior performance. This normalization technique was applied to the selected features in the dataset, ensuring consistent scaling for both binary and multi-class classification tasks.
X ( s c a l e d ) = X m i n ( x ) m a x ( x ) m i n ( x )  
The testing file was employed for evaluating the NF-UNSW-NB15-v2 dataset, while the complete training file was used for training in the initial approach.

3.2.4. Train-Test Dataset Split

The division of the dataset into training and testing subsets plays a crucial role in achieving rigorous evaluation and ensuring model generalization. The training subset allows the model to learn intricate patterns and relationships within the data, while the testing subset, isolated from the training phase, serves as an unbiased benchmark for assessing performance on unseen instances. This approach minimizes overfitting and provides meaningful insights into the model’s adaptability and effectiveness in binary and multi-class classification scenarios.
(i)
Binary Classification
The NF-UNSW-NB15-v2 dataset used for binary classification was divided into training and testing sets, as detailed in Table 10. The Normal class included 85,680 samples, of which 81,320 were utilized for training and 4360 for testing. Similarly, the Attack class consisted of 37,457 samples, with 35,660 used for training and 1797 for testing. This distribution ensured that both classes were adequately represented in both the training and testing phases, facilitating effective model evaluation and generalization.
Table 10. Sample distribution of the NF-UNSW-NB15-v2 dataset in binary classification.
(ii)
Multi-Class Classification
The NF-UNSW-NB15-v2 dataset, used for multi-class classification, includes ten classes, comprising the Benign class and nine attack types, such as Exploits, Fuzzers, Reconnaissance, Generic, DoS, Shellcode, Backdoor, Analysis, and Worms. Table 11 outlines the sample distribution across these classes, highlighting the class-wise split between training and testing sets. The largest class, Benign, comprises 81,200 training samples and 4310 testing samples. Exploits and Fuzzers follow with 14,190 and 9653 training samples, respectively. Smaller attack categories, such as Worms and Analysis, include 83 and 306 training samples, respectively, with a limited number of testing samples. This distribution ensures representation of both frequent and rare attack types, enabling comprehensive evaluation of the model’s ability to detect diverse attacks effectively.
Table 11. Sample distribution of the NF-UNSW-NB15-v2 dataset in multi-class classification.

3.2.5. Class Balancing

Class imbalance is a significant challenge in the NF-UNSW-NB15-v2 dataset, potentially reducing the performance of machine learning models. To mitigate this issue, a robust class balancing strategy was implemented, leveraging a combination of oversampling and undersampling techniques, well-established methods for addressing class imbalance [81]. Following this, the training and testing sets were merged, and ADASYN was employed on the combined dataset. This technique enhances learning by allowing the model to benefit from both training and testing data. After generating synthetic samples, the dataset was split again, with the new training set comprising the original training data plus the ADASYN-generated samples, while the test set remained unchanged. ADASYN was applied to oversample both binary and multi-class classification tasks by generating synthetic samples to improve the representation of minority classes. This approach improves model performance by augmenting the dataset with additional samples. Furthermore, ENN was applied to the training dataset for undersampling, removing noisy or redundant instances from the majority class. Class weights were adjusted during model training to balance the influence of each class, ensuring that the models did not become biased toward the majority class. By employing this combination of ADASYN for oversampling, ENN for undersampling, and class weight adjustments, the model’s ability to accurately detect and classify minority classes was significantly improved, leading to enhanced performance and reliability. However, despite achieving high accuracy, models can still suffer from the accuracy paradox, where minority class predictions are weak [82]. To counter this, an improved strategy, inspired by [79], was introduced, integrating ADASYN for oversampling, ENN for undersampling, and class weights to provide a more effective solution to class imbalance. This approach ensures more balanced performance across all classes, ultimately improving the model’s effectiveness in handling imbalanced datasets.
  • ADASYN
ADASYN is an advanced technique designed to address the challenges of class imbalance in datasets. By generating synthetic samples for the minority class, ADASYN focuses on regions of the feature space where instances of the minority class are underrepresented. This method enhances the representation of the minority class while preserving the distribution of the majority class, leading to improved model performance. In our approach, an enhanced cascaded ADASYN technique was applied twice for binary classification tasks and nine times for multi-class classification tasks. This approach ensured that the dataset remained balanced at each stage, progressively improving model training by handling class imbalance more effectively in both binary and multi-class scenarios [83].
Let X i represent the minority class samples, and N ( X i , k) denote the k-nearest neighbors of   X i . The number of synthetic samples n i to generate for each minority instance is defined as presented in Equation (2) [84].
n i = N M a j N M i n N M i n .   ( 1 N i k )
In this context, N M a j and N M i n   represent the sample counts of the majority and minority classes, respectively, highlighting the imbalance between them. The term N i denotes the number of minority class samples that fall within the radius defined by the k-nearest neighbors, which helps in identifying minority instances near decision boundaries, where synthetic samples are often generated to improve model performance.
For each minority instance   X i , synthetic samples are generated using the following equation, as presented in Equation (3) [84].
  X s y n = X i + γ .   ( X j X i )
where   X s y n denotes the synthetic sample created to address class imbalance, X j represents a randomly selected neighbor from the k-nearest neighbors of X i , the minority sample, and the term γ is a random number between 0 and 1, ensuring that the synthetic sample is generated along the line segment between X i and X j .
(i)
Binary Classification
The sample distribution in each class before and after applying the ADASYN resampling technique for binary classification on the NF-UNSW-NB15-v2 dataset is presented in Table 12. Initially, the dataset comprised 85,680 samples for the ‘Normal’ class and 37,457 samples for the ‘Attack’ class. After applying ADASYN, the number of samples for the ‘Attack’ class increased to 85,777, while the count for the ‘Normal’ class remained unchanged at 85,680. This adjustment underscores the effectiveness of ADASYN in addressing class imbalance by generating synthetic samples for the minority class, ultimately enhancing the model’s ability to learn from a more balanced dataset.
Table 12. Sample distribution in each class before/after resampling using ADASYN for binary classification on NF-UNSW-NB15-v2 dataset.
(ii)
Multi-Class Classification
The sample distribution in each class before and after applying the ADASYN resampling technique for multi-class classification on the NF-UNSW-NB15-v2 dataset is detailed in Table 13. Initially, the ‘Benign’ class consisted of 85,510 samples, while other classes had varying sample sizes. Following the application of ADASYN, the number of samples significantly increased for the minority classes. For instance, the ‘Exploits’ class rose from 14,933 to 86,104 samples, and the ‘Reconnaissance’ class grew from 6774 to 85,734 samples. Meanwhile, the ‘Benign’ class remained stable at 85,510 samples. This resampling process effectively enhances the representation of underrepresented classes, thus improving the dataset‘s balance and aiding in the development of more robust classification models.
Table 13. Sample distribution in each class before/after resampling using ADASYN for multiclass classification on NF-UNSW-NB15-v2 dataset.
2.
ENN
ENN is a data preprocessing technique aimed at refining training datasets by removing noisy instances and improving class boundaries. This method examines the nearest neighbors of each instance and eliminates those that are misclassified, thereby enhancing the overall quality of the training data. ENN effectively reduces class overlap and helps maintain a balanced representation of the classes, making it particularly useful in both binary and multi-class classification scenarios. In this study, ENN was applied once for binary classification and three times for multi-class classification. By applying ENN to the training data, models can achieve better generalization and improved performance on unseen data.
For each instance X i in the dataset, determine the k-nearest neighbors. The set of neighbors is defined as using Equation (4) [85].
N   ( X i ) =   X j 1 ,   X j 2   , . ,   X j k
where X j k are the nearest neighbors of X i   in terms of a distance metric (e.g., Euclidean distance).
To calculate the majority class among the nearest neighbors, one can use the formula represented in Equation (5) [85]. This involves determining the class labels of the nearest neighbors and identifying which class occurs most frequently. By applying this method, one can ensure that the predicted class for an instance is based on the most common class among its neighbors, thereby enhancing the classification accuracy.
C   ( X i ) = a r g m a x c ( j = 1 k ( y j = c ) )
Here, C ( X i ) denotes the predicted class for instance X i , with y j representing the class label of its j-th neighbor. The indicator function outputs 1 if the condition holds true, otherwise returning 0.
An instance X i   is removed if its predicted class C ( X i ) does not match its actual class y j , using the formula in Equation (6) [85].
If   C   ( X i )     y j ,   then   remove   X i
(i)
Binary Classification
The sample distribution in each training class before and after applying ENN for resampling in the binary classification of the NF-UNSW-NB15-v2 dataset is summarized in Table 14. Initially, the number of samples for the ‘Normal’ class stood at 81,320, which remained unchanged after the resampling process. Conversely, the ‘Attack’ class had 83,980 samples before resampling, which slightly decreased to 83,754 following the application of ENN. This adjustment highlights ENN’s role in refining the dataset by effectively managing class distribution while maintaining the integrity of the ‘Normal’ class samples.
Table 14. Sample distribution in each Train class before/after resampling using ENN for binary classification on NF-UNSW-NB15-v2 dataset.
(ii)
Multi-Class Classification
The sample distribution in each training class before and after applying ENN for resampling in the multi-class classification of the NF-UNSW-NB15-v2 dataset is presented in Table 15. The ‘Benign’ class began with 81,200 samples, which slightly decreased to 80,713 following resampling. The ‘Exploits’ class experienced a more substantial reduction, dropping from 85,361 to 72,076 samples. Similar trends were observed in other classes, such as ‘Fuzzers’, which decreased from 85,259 to 80,288 samples, and ‘Reconnaissance’, which dropped from 85,388 to 74,094. In contrast, the ‘Shellcode’ class saw minimal change, with a slight decrease from 85,562 to 85,531 samples. These adjustments highlight ENN’s effectiveness in refining the dataset by eliminating noisy instances and improving class boundaries, ultimately contributing to a more balanced representation of each class.
Table 15. Sample distribution in each Train class before/after resampling using ENN for multi-class classification on NF-UNSW-NB15-v2 dataset.
3.
Class Weights
Class weights are a valuable technique used to address class imbalance in datasets by assigning different weights to each class during model training. This approach ensures that the model pays more attention to minority classes, thereby improving its ability to correctly classify instances from these groups. By applying class weights to the training data, both binary and multi-class classification tasks benefit from enhanced model performance and generalization. This method helps mitigate the risks associated with biased predictions, ensuring a more balanced representation of all classes throughout the learning process.
The class weights can be calculated using the following formula to address class imbalance within the dataset. This approach assigns different weights to each class, ensuring that the model pays more attention to minority classes during training. The formula for calculating class weights is provided in Equation (7) [86].
W e i g h t c   = N k . n c  
In this context, W e i g h t c   denotes the weight assigned to class c. The total number of instances in the dataset is represented by N, while k indicates the total number of classes. Additionally, n c signifies the number of instances belonging to class c.
(i)
Binary Classification
The weights assigned to each class in the training data for binary classification using class weights on the NF-UNSW-NB15-v2 dataset are presented in Table 16. The ‘Normal’ class is assigned a weight of 1.0150, while the ‘Attack’ class receives a weight of 0.9855. These weights reflect the importance of each class during model training, with the goal of addressing any imbalances in the dataset. By incorporating these class weights, the model can enhance its performance and improve the accuracy of its predictions, particularly for the minority class.
Table 16. Weight in each train class using class weights for binary classification on NF-UNSW-NB15-v2 dataset.
(ii)
Multi-Class Classification
The weights assigned to each class in the training data for multi-class classification using class weights on the NF-UNSW-NB15-v2 dataset are shown in Table 17. The ‘Benign’ class has a weight of 0.9384, while ‘Exploits’ is assigned a weight of 1.0508. Other classes, such as ‘Fuzzers’ and ‘Reconnaissance,’ receive weights of 0.9433 and 1.0222, respectively. The ‘Generic’ class has a weight of 0.9815, and ‘DoS’ is given a weight of 1.0089. Notably, the ‘Backdoor’ class receives the highest weight at 1.2413, followed closely by ‘Analysis’ with a weight of 1.1456. These weights are designed to improve the model’s performance by addressing class imbalances, allowing for a more effective and accurate representation of each class during training.
Table 17. Weight in each Train class using class weights for multi-class classification on NF-UNSW-NB15-v2 dataset.

3.3. Architectures of Models

In this study, a variety of model architectures were utilized, encompassing CNN, auto encoder, DNN, and Transformer-CNN. These models were selected due to their outstanding performance across multiple evaluation metrics [83,87,88].

3.3.1. Convolutional Neural Networks (CNN)

The given model architecture integrates CNN and a MLP for both binary and multi-class classification tasks. It begins with an input layer designed to accept sequential data structured as a one-dimensional array. The first CNN block applies a convolutional layer followed by batch normalization, ReLU activation, max pooling, and dropout to extract and regularize features effectively. This process is repeated in subsequent CNN blocks with varying kernel sizes to capture different patterns and features in the data. After the CNN blocks, the flattened output is fed into an MLP block that consists of a dense layer with L2 regularization, batch normalization, ReLU activation, and dropout to enhance the model’s representation capabilities while mitigating overfitting. The outputs from the CNN and MLP components are then concatenated to form a comprehensive feature set. Finally, the output layer uses a sigmoid activation function for binary classification, or a softmax activation for multi-class classification, combined with binary cross-entropy or categorical cross-entropy as the loss function, respectively, to optimize model performance based on the specific classification task. The model is compiled using the Adam optimizer, which aids in efficient learning and convergence during training.
The convolution operation in the CNN layers plays a crucial role in feature extraction by applying filters to the input data. This operation involves sliding the convolution kernel over the input feature map, computing the dot product at each position to produce a feature map. The mathematical representation of the convolution operation can be defined as shown in Equation (8) [89].
Z i , j = ( X K )   i , j = m n Z i + m , j + n k m , n
In this context, Z represents the output feature map, while X denotes the input feature map. The convolution kernel is indicated by K, which is utilized in the convolution operation to transform the input features into the output features.
The ReLU activation function, shown in Equation (9) [90], is a simple yet powerful non-linear transformation. It outputs zero for negative inputs, retains positive values, mitigates the vanishing gradient issue, and promotes sparse activations, enhancing both training efficiency and model performance.
ReLU(x) = max(0,x)
The max pooling operation reduces the spatial dimensions of the input feature map while retaining the most important features. This operation selects the maximum value from a specified pooling window, effectively downsampling the input. The equation for the max pooling operation can be expressed as shown in Equation (10) [89].
P i , j = max ( X   i : i + p ,   j : j + q )
In this context, P refers to the pooled output generated from the pooling operation, while p and q represent the dimensions of the pooling window used to aggregate the input features into the pooled output.
The dropout layer randomly sets a fraction p of input units to zero during training to prevent overfitting. This technique helps to improve the model’s generalization by ensuring that it does not rely too heavily on any single input feature, as detailed in Equation (11) [91].
Dropout ( x ) = x   w i t h   p r o b a b i l i t y   1 p   0   w i t h   p r o b a b i l i t y p
For binary classification tasks, the output layer utilizes the sigmoid function, which outputs a probability score indicating the likelihood of an instance belonging to the positive class. This is mathematically expressed in Equation (12) [92].
σ   ( Z ) = 1 1 + e z
where Z is the output from the last dense layer.
The output layer employs the softmax function for multi-class classification, allowing the model to produce a probability distribution across multiple classes. This can be mathematically represented as shown in Equation (13) [92].
Softmax   ( Z i ) = e z i j e z j
where Z i   is the output for class i, and Z j   represents the raw score for class j.
(i)
Binary Classification
The architecture of the CNN model designed for binary classification is detailed in Table 18. The model architecture is shared across both the NF-UNSW-NB15-v2 and CICIDS2017 datasets, with the input block differing to accommodate the specific features of each dataset. For the NF-UNSW-NB15-v2 dataset, the input layer processes 25 distinct features, while the CICIDS2017 dataset input layer accommodates 69 features. This input layer serves as the foundation for subsequent computations. The CNN model for binary classification begins with the first hidden block, which includes a one-dimensional (1D) CNN layer with 256 filters, using the ReLU activation function to introduce non-linearity and enhance feature extraction. Following this, a 1D max pooling layer with a pool size of 2 is employed to downsample the data, preserving critical features. A dropout layer with a very low rate of 0.0000001 is incorporated to mitigate the risk of overfitting. The second hidden block replicates this structure, incorporating another 1D CNN layer with 256 filters and ReLU activation, followed by a 1D max pooling layer with a pool size of 4, and another dropout layer with the same low rate to maintain generalization. In the third hidden block, a dense layer with 1024 neurons is used, employing ReLU activation to facilitate complex feature interactions. This is followed by another dropout layer to further enhance robustness against overfitting. The final output block consists of a single neuron configured with a sigmoid activation function, which is critical for producing binary classification outputs for both datasets. This carefully structured architecture, as summarized in Table 18, is optimized to effectively process the unique characteristics of the NF-UNSW-NB15-v2 and CICIDS2017 datasets, ensuring reliable and accurate binary classification performance.
Table 18. CNN model layers for binary classification.
(ii)
Multi-Class Classification
The CNN model designed for multi-class classification features a comprehensive architecture, as detailed in Table 19. This architecture is tailored to process the distinct features of both the NF-UNSW-NB15-v2 and CICIDS2017 datasets, with the input block configured specifically for each dataset. For the NF-UNSW-NB15-v2 dataset, the input layer processes 27 features, while the CICIDS2017 dataset input layer handles 35 features. These input layers provide the foundation for the model to effectively capture dataset-specific information relevant to the classification task. The model begins with the first hidden block, which includes a one-dimensional (1D) CNN layer with 256 filters, employing the ReLU activation function to enable effective feature extraction. This is followed by a 1D max pooling layer with a pool size of 2, which reduces dimensionality while retaining critical information. To minimize overfitting, a dropout layer with an extremely low rate of 0.0000001 is applied. The second hidden block mirrors this structure, featuring another 1D CNN layer with 256 filters and ReLU activation, followed by a 1D max pooling layer with a pool size of 4 and a dropout layer with the same low rate to maintain generalization. In the third hidden block, a dense layer with 1024 neurons is used, employing ReLU activation to enhance the model’s ability to learn complex feature relationships. This is followed by another dropout layer to further strengthen the model’s capacity to generalize well to unseen data. The output block varies depending on the dataset. For the NF-UNSW-NB15-v2 dataset, the output layer consists of 10 neurons with a softmax activation function, allowing the model to output probabilities across 10 classes. For the CICIDS2017 dataset, the output layer comprises 15 neurons, also using a softmax activation function to accommodate its multi-class structure. This carefully designed architecture, as summarized in Table 19, is optimized to handle the unique characteristics of both datasets, ensuring effective learning and high-performance multi-class classification.
Table 19. CNN model layers for multi-class classification.
(iii)
Hyperparameter Configuration for the CNN Model
The hyperparameters for the CNN model, as outlined in Table 20, are meticulously tuned for both binary and multi-class classification tasks. In both classifier configurations, a batch size of 128 is consistently utilized, ensuring efficient processing of data during training. The learning rate for both the binary and multi-class classifiers is adaptively managed through the ReduceLROnPlateau scheduler. If the validation loss shows no improvement over a set number of epochs (patience), the learning rate is halved. This approach allows the model to make finer adjustments during training, which can help accelerate convergence. To avoid excessively small updates, the learning rate is capped at a minimum value of 1 × 10−5. This method ensures more efficient and stable training, enabling the model to converge steadily without overshooting the optimal solution. Across both classifier types, the Adam optimizer is employed, known for its adaptive learning rate capabilities, which enhances training performance. The choice of loss function is tailored to the nature of the classification task. Binary cross-entropy is adopted for the binary classification scenario, while categorical cross-entropy is utilized in multi-class classification, ensuring appropriate measurement of model performance based on the output format. Lastly, accuracy is designated as the evaluation metric for both classifiers, providing a straightforward assessment of their performance in correctly classifying the input data. This careful selection and configuration of hyperparameters are essential for optimizing the effectiveness of the CNN models in their respective classification tasks.
Table 20. Hyperparameters for CNN model.

3.3.2. Auto Encoder (AE)

Auto encoder is tailored for both binary and multi-class classification tasks, starting with an input layer that accepts feature vectors. It features an encoder composed of several dense layers that progressively reduce the input’s dimensionality while applying the ReLU activation function, effectively extracting important features from the data. For binary classification, a classification layer follows, using the sigmoid activation function, while for multi-class classification, it uses the softmax activation function, enabling the model to output class probabilities for the respective scenarios. The model is compiled with the Adam optimizer and employs binary cross-entropy loss for binary tasks and categorical cross-entropy loss for multi-class tasks, ensuring appropriate loss calculations for each classification type. Additionally, a callback is implemented to adjust the learning rate based on validation loss, facilitating improved convergence and minimizing the risk of overfitting. The training and validation accuracies are plotted across epochs to evaluate the model’s performance in both classification contexts.
The encoder layers progressively reduce the dimensionality of the input data, extracting important features through dense layers. This dimensionality reduction and feature extraction process can be mathematically expressed as shown in Equation (14) [89].
h ( l ) = f   ( W ( l ) a ( l 1 ) + b ( l ) )
In this formulation, h ( l ) denotes the output of the encoder layer l, while a ( l 1 ) represents the output from the previous layer, serving as the input for the first layer. The weight matrix for layer l is indicated by   W ( l ) , and b ( l )   denotes the bias vector for that layer. The activation function applied is denoted as f, which is specifically the ReLU function in this context.
In a standard auto encoder, the decoder layer reconstructs the input from the compressed representation learned by the encoder. This reconstruction process can be mathematically represented as detailed in Equation (15) [89].
α = g   ( W ( d ) h ( l ) + b ( d ) )
In this context, ά represents the reconstructed output, while h ( l ) is the output from the last encoder layer. The weight matrix for the decoder layer is denoted as W ( d ) , and b ( d ) indicates the bias vector for the decoder layer. The activation function used for the decoder is represented by g, which is typically linear for reconstruction purposes.
The classification layer utilizes a specific activation function for binary classification output. This can be expressed as presented in Equation (16) [89].
y ¯ = σ   ( W ( o u t ) h ( l ) + b ( o u t ) )
In this framework, y ¯ denotes the predicted probability for the positive class. The weight matrix for the output layer is represented by W ( o u t ) , while b ( o u t ) signifies the bias for the output layer. The sigmoid function, denoted as σ, is employed to map the output to a probability score between 0 and 1.
For multi-class classification output, the classification layer employs the softmax activation function, which enables the model to generate a probability distribution across multiple classes. This can be expressed as presented in Equation (17) [92].
y ¯ = softmax   ( W ( o u t ) h ( l ) + b ( o u t ) )
In this context, y ¯ represents the vector of predicted probabilities across multiple classes. The weight matrix W ( o u t ) and bias b ( o u t ) are associated with the output layer. The softmax function is utilized to convert the logits into probabilities, ensuring that the predicted values sum to one across all classes.
(i)
Binary Classification
The architecture outlined in Table 21 presents the layers of the Auto Encoder model designed for binary classification, tailored for both the NF-UNSW-NB15-v2 and CICIDS2017 datasets. The input block is configured to accommodate the unique features of each dataset. The NF-UNSW-NB15-v2 dataset processes data with 25 features, while the CICIDS2017 dataset handles 69 features. This input layer serves as the entry point for data, providing the foundation for the model’s operations. The encoder structure comprises three dense layers, with 128, 64, and 32 neurons, respectively. Each dense layer utilizes the ReLU activation function, which introduces non-linearity and facilitates the extraction of complex patterns within the data. These layers effectively compress the input data into a lower-dimensional latent space, capturing the most critical features necessary for effective classification. The final output block consists of a single neuron activated by a sigmoid function. This layer generates a probability score indicating the likelihood of the input data belonging to the positive class, enabling binary classification. The architecture is designed to distinguish effectively between the two classes, ensuring robust performance across both datasets. This carefully structured model leverages its shared architecture to handle the unique characteristics of the NF-UNSW-NB15-v2 and CICIDS2017 datasets, enhancing its overall classification effectiveness.
Table 21. Auto encoder model layers for binary classification.
(ii)
Multi-Class Classification
The architecture outlined in Table 22 showcases the auto encoder model specifically designed for multi-class classification, with tailored configurations for both the NF-UNSW-NB15-v2 and CICIDS2017 datasets. The model begins with an input layer that accommodates the unique features of each dataset. The NF-UNSW-NB15-v2 dataset processes data with 27 features, while the CICIDS2017 dataset handles 35 features. This input layer serves as the entry point, setting the foundation for the model’s processing. The encoder consists of three dense layers, with 128, 64, and 32 neurons respectively. Each layer employs the ReLU activation function, which introduces non-linearity and enhances the model’s ability to capture intricate patterns and relationships within the data. This structure efficiently compresses the input data into a lower-dimensional latent space, extracting the most essential features for multi-class classification. The architecture culminates in the output block, where the NF-UNSW-NB15-v2 dataset’s output layer consists of 10 neurons, and the CICIDS2017 dataset’s output layer has 15 neurons. Both layers are activated by the softmax function, which generates class probabilities, allowing the model to classify the input data into multiple distinct categories. This design enables the model to address multi-class classification tasks effectively, distinguishing between various classes with high accuracy across both datasets.
Table 22. Auto encoder model layers for multi-class classification.
(iii)
Hyperparameter Configuration for the Auto Encoder Model
The hyperparameters for the auto encoder model, as outlined in Table 23, are designed to suit both binary and multi-class classification tasks. In both cases, a batch size of 128 is used to streamline the training process. The learning rate for both classifiers is dynamically adjusted using the ReduceLROnPlateau scheduling mechanism. This technique monitors the validation loss during training, and if no improvement is observed over two consecutive epochs, the learning rate is reduced by a factor of 0.5. This gradual reduction enables more stable and refined parameter updates, particularly in the later stages of training, which enhances the model’s ability to converge effectively. Furthermore, the learning rate is capped with a minimum value of 1 × 10−5 to prevent it from becoming too small to produce meaningful updates. This strategy strikes a balance between accelerating convergence in the early stages and allowing for finer adjustments as the model nears optimal performance, ultimately leading to more reliable and efficient training. The Adam optimizer is employed for efficient weight updates, while the choice of loss function depends on the classification task. Binary cross-entropy is used for binary classification, and categorical cross-entropy is applied for multi-class classification. For performance evaluation, accuracy is chosen as the primary metric, offering a comprehensive assessment of the model’s ability to classify data correctly in both contexts.
Table 23. Auto encoder model hyperparameters.

3.3.3. Deep Neural Network (DNN)

The DNN model for binary and multi-class classification consists of several blocks, including the input block, two hidden blocks, and the output block. The model begins with an input layer, where the number of features varies depending on the dataset, and a dense layer utilizing ReLU activation to learn complex patterns. This is followed by the first hidden block, which includes a dropout layer with a very small rate to prevent overfitting, followed by a dense layer with ReLU activation, then batch normalization. The second hidden block includes another dropout layer with the same small rate and batch normalization to improve training stability. The output layer differs based on the classification type. For binary classification, it features a single neuron with a sigmoid activation function to produce a probability score, while for multi-class classification, it contains multiple neurons with a softmax activation function to provide class probabilities. The model is compiled using the Adam optimizer with a learning rate defined by an exponential decay schedule. It uses binary cross-entropy for binary tasks or categorical cross-entropy for multi-class tasks, enabling effective training. Additionally, a custom callback is implemented to visualize the confusion matrix at the end of each epoch, offering valuable insights into the model’s classification performance by comparing predicted and actual labels.
The feed-forward operation in a DNN involves passing the input through multiple layers to produce the output. This can be expressed mathematically for a layer l as presented in Equation (18) [93].
a ( l ) = f   ( W ( l ) a ( l 1 ) + b ( l ) )
In this context, a ( l ) denotes the activation of the current layer l. The weight matrix for this layer is represented by W ( l ) , while b ( l ) signifies the bias vector for layer l. The activation function f is applied element-wise, which may include functions such as ReLU or sigmoid, to introduce non-linearity into the model.
The ReLU activation function, shown in Equation (19) [94], is a simple and efficient non-linear function that outputs zero for negative values, promotes sparse activations, and supports effective gradient flow, making it ideal for deep learning.
ReLU(x) = max(0,x)
For binary classification tasks, the sigmoid function is utilized, which outputs a probability score indicating the likelihood of an instance belonging to the positive class. The sigmoid function transforms the raw score into a value between 0 and 1, effectively serving as a threshold for classification. This can be represented as presented in Equation (20) [92].
σ   ( Z ) = 1 1 + e z
where Z is the output from the last dense layer.
The output layer employs the softmax function for multi-class classification tasks, enabling the model to produce probability distributions across multiple classes. This function takes a vector of raw scores (logits) and normalizes them into a range between 0 and 1, where the sum of the probabilities equals 1. The softmax function can be mathematically expressed as shown in Equation (21) [92].
Softmax   ( Z i ) = e z i j e z j
where Z i is the output from the last dense layer for class i, and Z j   represents the raw score for class j.
(i)
Binary Classification
The architecture detailed in Table 24 presents the structure of the DNN model specifically designed for binary classification, with tailored configurations for the NF-UNSW-NB15-v2 and CICIDS2017 datasets. The input layer block begins by processing 25 features for the NF-UNSW-NB15-v2 dataset and 69 features for the CICIDS2017 dataset, and a dense layer with 1024 neurons, where the ReLU activation function is applied to introduce non-linearity and enhance the model’s ability to learn complex representations from the data. The first hidden block includes a dropout layer with a very low dropout rate to mitigate overfitting, followed by a dense layer with 768 neurons. The dense layer is equipped with a ReLU activation function to introduce non-linearity, enhancing the model’s ability to learn complex patterns. Batch normalization is applied after the dense layer to stabilize the learning process by normalizing the outputs, ensuring more effective and consistent training. The second hidden block contains another dropout layer and batch normalization, further refining the learning dynamics. Ultimately, the architecture concludes with an output layer featuring a single neuron activated by the sigmoid function. This configuration is meticulously crafted to enhance the model’s effectiveness in binary classification tasks across both datasets.
Table 24. DNN model layers for binary classification.
(ii)
Multi-Class Classification
The structure outlined in Table 25 describes the architecture of the DNN model designed for multi-class classification, with specific configurations for the NF-UNSW-NB15-v2 and CICIDS2017 datasets. The input layer block begins by processing 27 features for the NF-UNSW-NB15-v2 dataset and 35 features for the CICIDS2017 dataset, and a dense layer with 1024 neurons, where the ReLU activation function is applied to introduce non-linearity and enhance the model’s ability to learn complex representations from the data. The first hidden block incorporates a dropout layer with a minimal dropout rate to mitigate overfitting. This is followed by a dense layer containing 768 neurons, which is augmented with ReLU activation to introduce non-linearity. Batch normalization is applied after the dense layer to stabilize the learning process by normalizing the output, ensuring more effective and faster training. The second hidden block contains another dropout layer and batch normalization, further refining the learning dynamics. The architecture concludes with an output layer for each dataset. The NF-UNSW-NB15-v2 dataset’s output layer consists of 10 neurons, while the CICIDS2017 dataset’s output layer has 15 neurons. Both output layers are activated by a softmax function, generating class probabilities across their respective categories. This design enables the model to effectively handle multi-class classification tasks across both datasets with precision.
Table 25. DNN model layers for multi-class classification.
(iii)
Hyperparameter Configuration for the DNN Model
The hyperparameters for the DNN models, detailed in Table 26, are tailored to accommodate both binary and multi-class classification tasks. Each classifier utilizes a consistent batch size of 128 and employs the Adam optimizer for efficient training. The learning rate for both the binary and multi-class classifiers is governed by an exponential decay schedule, which dynamically adjusts the learning rate throughout the training process. Initially, the learning rate is set to 0.0003. As training progresses, the learning rate undergoes a reduction by a factor of 0.9 after every 10,000 steps. This progressive decrease in the learning rate ensures that the model can make larger, more decisive updates in the early stages of training, followed by more refined and precise adjustments in the later stages. This method of adaptive learning rate adjustment is designed to promote a more stable and efficient optimization process, ultimately facilitating smoother convergence toward an optimal solution. For loss functions, binary cross-entropy is applied in the context of binary classification, while categorical cross-entropy is utilized for the multi-class classification scenario. In both cases, accuracy serves as the primary evaluation metric, providing a clear measure of the models’ effectiveness in classifying data accurately.
Table 26. DNN model hyperparameters.

3.3.4. Transformer-Convolutional Neural Network (Transformer CNN)

The model architecture presented integrates both Transformer and CNN components to enhance classification performance for the given task. The input layer receives data in a structured format, setting the stage for the subsequent processing. The Transformer block plays a crucial role in capturing intricate relationships within the input data through its multi-head attention mechanism. This approach allows the model to weigh different parts of the input more dynamically, facilitating the identification of complex patterns and dependencies. To stabilize the learning process and improve gradient flow, a layer normalization and residual connection are employed. Following the attention mechanism, a feed-forward neural network (FFN) processes the output, enhancing the data representation by applying non-linear transformations and introducing dropout layers for regularization. The primary job of the Transformer is to provide a global context and highlight important features across the entire input sequence. Subsequently, the CNN blocks operate on the output of the Transformer, focusing on local feature extraction. Each convolutional layer applies filters to detect various features within the input data, while batch normalization and activation functions such as ReLU ensure that the model remains robust and learns effectively. Max pooling layers downsample the data, reducing its dimensionality and allowing the model to concentrate on the most salient features. The CNN’s primary function is to capture spatial hierarchies and patterns within the input, making it particularly effective for tasks requiring detailed analysis of local structures. The architecture also includes a flattening step that prepares the output of the CNN blocks for further processing. This flattened representation is then passed through MLP blocks, which serve to learn high-level abstractions from the features extracted by the CNNs. The concatenation of the Transformer and CNN outputs at this stage enables the model to leverage both global context and local feature patterns for improved classification accuracy. Finally, the output layer employs a sigmoid or softmax activation function to generate class probabilities, completing the model’s capability to classify inputs based on the rich representations learned throughout the architecture. This integrated approach harnesses the strengths of both Transformer and CNN architectures, providing a comprehensive framework for effective classification tasks.
The multi-head attention mechanism effectively captures complex relationships within the input data, as represented by Equation (22) [95].
Attention   ( Q ,   K ,   V ) = softmax   ( Q K T d k )
In this context, Q represents the query matrix, K denotes the key matrix, and V signifies the value matrix. The variable d k refers to the dimension of the keys, which plays a crucial role in the computation of attention scores within the model.
Each head performs this attention calculation independently and then concatenates the results, as detailed in Equation (23) [95].
MultiHead   ( Q ,   K ,   V ) = Concat   ( h e a d 1 , , h e a d h ) W 0
In this context, W 0 refers to the output weight matrix, which is utilized to transform the output of the preceding layer into the final output of the model.
Layer normalization stabilizes the output of each layer, using Equation (24) [96].
LayerNorm ( x ) = X µ σ γ + β
In this context, µ represents the mean of the inputs, while σ denotes the standard deviation. Additionally, γ and β are learnable parameters that are utilized in the normalization process.
The FFN processes the output from the attention mechanism as presented in Equation (25) [95].
FNN ( X ) = max ( 0 ,   x W 1 + b 1 ) W 2 + b 2
In this context, W 1   and W 2   refer to the weight matrices, while b 1   and b 2   represent the corresponding biases associated with the layers in the model.
The convolution operation in the CNN layers can be defined as in Equation (26) [89].
Z i , j = ( X K )   i , j = m n Z i + m , j + n k m , n
In this scenario, Z denotes the output feature map resulting from the convolution process, while X represents the input feature map. The convolution kernel, denoted as K, is used to perform the convolution operation between the input and the output.
The ReLU activation function, presented in Equation (27) [90], is efficient and straightforward, outputting zero for negative inputs while passing positive values through unchanged. Its ability to promote sparse activations and facilitate gradient flow makes it particularly effective in deep learning applications.
ReLU(x) = max(0,x)
The max pooling operation can be expressed as presented in Equation (28) [89].
P i , j = max ( X   i : i + p ,   j : j + q )
In this context, P represents the pooled output generated from the pooling operation, while P and q denote the dimensions of the pooling window applied to the input feature map.
The dropout layer randomly sets a fraction p of input units to zero during training to prevent overfitting, as presented in Equation (29) [91].
Dropout ( x ) = x   w i t h   p r o b a b i l i t y   1 p 0   w i t h   p r o b a b i l i t y p
For binary classification tasks, the sigmoid function is utilized, which outputs a probability score indicating the likelihood of an instance belonging to the positive class, as presented in Equation (30) [92].
σ   ( Z ) = 1 1 + e z
where Z is the output from the last dense layer.
The output layer employs the softmax function for multi-class classification tasks, enabling the model to produce probability distributions across multiple classes, as presented in Equation (31) [92].
Softmax   ( Z i ) = e z i j e z j
where Z i   is the output for class i, and Z j   represents the raw score for class j.
(i)
Binary Classification
The architecture of the Transformer model designed for binary classification is detailed in Table 27. The model begins with an input layer that processes data structured as (25, 1) for the NF-UNSW-NB15-v2 dataset and (69, 1) for the CICIDS2017 dataset, effectively accommodating input with 25 features for NF-UNSW-NB15-v2 and 69 features for CICIDS2017. Following this, the Transformer block employs a multi-head attention mechanism with eight heads and a key dimension of 128. This mechanism captures complex relationships within the input data, enhancing the model’s ability to identify intricate patterns. The output from the attention layer is subsequently normalized using layer normalization with an epsilon value of 1 × 10−6, which helps stabilize the output. A residual connection is implemented to add the original input data back to the attention output, promoting stability during training. The feed-forward block consists of a dense layer with 512 units and a ReLU activation function, applying a transformation to the data. This is followed by a dropout layer with a rate of 0.0000001, aimed at mitigating overfitting by regularizing the network. Another dense layer with 512 units is included without an activation function, allowing for additional transformations. A subsequent dropout layer with the same rate further reinforces regularization, enhancing model robustness. The output from the feed-forward network is then added back to the previous block’s output via another residual connection, followed by another layer normalization step with epsilon = 1 × 10−6 to normalize the combined output, ensuring stability in the model’s learning process.
Table 27. Transformer model layers for binary classification.
The architecture of the CNN model designed for binary classification utilizes the output of the Transformer model as its input and is tailored for datasets like NF-UNSW-NB15-v2 and CICIDS2017. The input block processes the Transformer output, providing structured input for the model. The first hidden block includes a 1D CNN layer with 512 filters and a ReLU activation function, which extracts essential features from the input data. This is followed by a 1D max pooling layer with a pool size of two, reducing the dimensionality of the feature maps, and a dropout layer with a rate of 0.0000001 to mitigate overfitting. The second hidden block repeats this structure with another 1D CNN layer with 512 filters and ReLU activation, followed by a 1D max pooling layer with a pool size of four and a dropout layer with the same dropout rate. In the third hidden block, the model incorporates a dense layer with 1024 units and a ReLU activation function, enhancing the model’s representational capabilities. A dropout layer with a rate of 0.0000001 is again applied for additional regularization. The architecture concludes with a single-output layer employing a sigmoid activation function for binary classification, producing a probability score to determine class membership. The detailed structure, including layer sizes, activation functions, and dropout rates, is outlined in Table 28.
Table 28. CNN model layers for binary classification.
(ii)
Multi-Class Classification
The architecture of the Transformer model designed for multi-class classification is outlined in Table 29. The model starts with an input layer that processes data structured as (27, 1) for the NF-UNSW-NB15-v2 dataset and (35, 1) for the CICIDS2017 dataset, effectively accommodating input with 27 features for NF-UNSW-NB15-v2 and 35 features for CICIDS2017. Following this, the Transformer block utilizes a multi-head attention mechanism with eight heads and a key dimension of 128, which captures complex relationships within the input data and enhances the model’s ability to identify intricate patterns. The output from the attention layer is then normalized using layer normalization with an epsilon value of 1 × 10−6, which contributes to stabilizing the output. A residual connection is established to add the original input data back to the attention output, promoting stability during training. The feed-forward block consists of a dense layer with 512 units and a ReLU activation function, which applies a transformation to the data. This is followed by a dropout layer with a rate of 0.0000001, designed to mitigate overfitting by regularizing the network. An additional dense layer with 512 units is included without an activation function, allowing for further transformations. A subsequent dropout layer with the same rate reinforces regularization, enhancing the model’s robustness. The output from the feed-forward network is then added back to the previous block’s output via another residual connection, followed by an additional layer normalization step with epsilon = 1 × 10−6 to normalize the combined output, ensuring stability in the model’s learning process.
Table 29. Transformer model layers for multi-class classification.
The architecture of the CNN model designed for multi-class classification leverages the output of the Transformer model as its input, tailored for both the NF-UNSW-NB15-v2 and CICIDS2017 datasets. The model starts with an input block that processes the Transformer output. The first hidden block incorporates a 1D CNN layer with 512 filters and a ReLU activation function, enabling the extraction of critical features from the data. This is followed by a 1D max pooling layer with a pool size of two, which reduces dimensionality, and a dropout layer with a rate of 0.0000001 to mitigate overfitting. In the second hidden block, another 1D CNN layer with 512 filters and a ReLU activation function is utilized, accompanied by a 1D max pooling layer with a pool size of 4 and another dropout layer with the same rate, reinforcing regularization. The third hidden block comprises a dense layer with 1024 units and a ReLU activation function, further enhancing the model’s ability to represent complex patterns. This block also includes a dropout layer with a rate of 0.0000001 for additional regularization. The output block varies based on the dataset. For the NF-UNSW-NB15-v2 dataset, the output layer consists of 10 units, while for the CICIDS2017 dataset, it includes 15 units. Both employ a softmax activation function to perform multi-class classification. The complete architecture, including layer specifications, is outlined in Table 30.
Table 30. CNN model layers for multi-class classification.
(iii)
Hyperparameter Configuration for the Transformer-CNN Model
The hyperparameters for the Transformer-CNN model, detailed in Table 31, have been meticulously optimized for effectiveness in both binary and multi-class classification tasks. The model operates with a batch size of 128, which defines the number of samples processed before the model’s weights are updated, ensuring consistency across both classification scenarios. The learning rate for both the binary and multi-class classifiers is dynamically adjusted using the ReduceLROnPlateau schedule. If the validation loss does not improve for a specified number of epochs (patience), the learning rate is reduced by a factor of 0.5. This strategy helps to fine-tune the model’s learning process, allowing for smaller adjustments as training progresses, potentially leading to improved convergence. The learning rate is bounded below by a minimum value of 1 × 10−5, preventing it from becoming so small that the model makes ineffective updates. This approach enhances training efficiency and stability, ensuring the model can reliably converge without overshooting the optimal solution. The Adam optimizer is utilized due to its robust adaptive learning features, demonstrating effectiveness in both binary and multi-class contexts. In the case of binary classification, the model leverages binary cross-entropy as its loss function, quantifying the divergence between predicted probabilities and actual binary outcomes. In contrast, the multi-class classification model employs categorical cross-entropy, assessing the difference between predicted class probabilities and the true class labels across multiple categories. For performance evaluation, both models utilize accuracy as their primary metric, which reflects the ratio of correctly predicted instances to the total number of predictions made. This metric serves as a straightforward indicator of model performance, illustrating the extent to which predicted labels correspond with actual labels.
Table 31. Hyperparameters of the Transformer-CNN model.

4. Results and Experiments

In this section, we present a comprehensive evaluation of the proposed models, incorporating advanced data resampling techniques and class weight adjustments to address class imbalance effectively. To ensure a robust comparison, the performance of our approach is assessed alongside state-of-the-art intrusion detection methods. The experimental findings demonstrate that the proposed model achieves superior results, setting a benchmark in anomaly detection performance.

4.1. Dataset Description and Preprocessing Overview

The datasets utilized in this study, NF-UNSW-NB15-v2 and CICIDS2017, are among the most comprehensive benchmarks for evaluating IDS. These datasets capture diverse network behaviors and attack scenarios, offering a solid foundation for developing and assessing anomaly detection models. Despite their strengths, both datasets present challenges such as missing data, duplicates, outliers, and class imbalance, which necessitate rigorous preprocessing. This section provides an overview of the datasets, their suitability for binary and multi-class classification tasks, and their relevance to IDS research, along with the essential preprocessing steps. These steps address issues like missing values, eliminating duplicates, handling outliers, and balancing the class distribution to optimize the datasets for effective model evaluation.

4.1.1. NF-UNSW-NB15-v2 Dataset

The NF-UNSW-NB15-v2 dataset, as described in Section 3.1, captures diverse network behaviors, including normal and malicious traffic across various attack types, providing valuable features for IDS development. However, it faces challenges such as missing values, duplicates, and class imbalance, which are addressed through preprocessing outlined in Section 3.2. This included handling missing values, eliminating duplicates, applying outlier detection techniques like z-score and LOF, performing feature selection to reduce dimensionality, and normalizing numerical features using MinMaxScaler. Advanced resampling methods, such as ADASYN for oversampling and ENN for undersampling, were applied, along with dynamic class weights during training to improve class representation. This comprehensive preprocessing optimized the dataset for both binary and multi-class classification tasks.

4.1.2. CICIDS2017 Dataset

Certain aspects, such as data structure and labeling, are pivotal for effective intrusion detection in network-based datasets. Markus et al. [97] offer a thorough analysis of these factors in both supervised and unsupervised intrusion detection techniques. This section delves into the history and characteristics of the CICIDS2017 dataset, which is utilized in this study for intrusion detection. Released by the Canadian Institute for Cybersecurity, this dataset is publicly available for academic research purposes [98]. It is one of the most up-to-date datasets for network intrusion detection found in the literature, comprising 2,830,743 records, 79 network traffic features, and 15 classes, including 1 for Benign traffic and 14 distinct attack types [12]. The dataset is organized into eight files representing five days of benign and attack traffic, with each file containing real-world network data [98,99]. In addition to the core traffic data, the records include supplementary metadata and are provided in packet-based and bifacial flow-based formats [97]. The dataset is fully labeled, making it suitable for both binary and multi-class classification tasks. For binary classification, all attack types are labeled as ‘1’, while benign traffic is labeled as ‘0’. For multi-class classification, all attack types are considered individually, providing a comprehensive view of the different forms of network attacks. The CICIDS2017 dataset, while extensive, requires meticulous preprocessing to address missing data and enhance its quality for analysis. Preprocessing began by consolidating the dataset’s eight constituent files into a single comprehensive dataset. Missing values, or NaNs, were systematically addressed to prevent data quality issues. Duplicates were eliminated, and columns with only a single unique value were removed to optimize feature relevance. Remaining NaN values were carefully imputed, and feature names were standardized by stripping leading spaces for uniformity. Sampling was then performed, and for multi-class classification, instances belonging to the ‘Normal’ class were excluded post-sampling. To eliminate extreme values that could bias model outcomes, outliers were identified and removed using the LOF. In multi-class classification, feature selection based on correlation is applied following outlier removal to refine the feature set further. Then, numerical features are normalized using MinMaxScaler to ensure consistent scaling across variables. After these steps, the dataset was partitioned into training and testing subsets. To address class imbalances during training, advanced resampling techniques were implemented. For binary classification, the enhanced hybrid ADASYN-SMOTE method was applied to generate synthetic samples within the training data, while for multi-class classification, an advanced cascaded SMOTE approach was utilized to balance the training dataset effectively. Additionally, the ENN technique was employed to undersample the training data, further refining class distribution and improving model robustness. Class weights were dynamically adjusted during the training process to ensure balanced learning across all classes. Collectively, these preprocessing strategies transformed the raw CICIDS2017 dataset into a well-balanced and optimized resource, tailored for binary and multi-class classification tasks.

4.2. Experiment’s Establishment

The models were developed on the Kaggle platform 1.6.17 using TensorFlow 2.17.0 and Keras 3.4.1. The experimental configuration was equipped with hardware that included an Nvidia GeForce RTX 1050 graphics card and operated on Windows 10. Throughout the data resampling process, only the training set was utilized, while the evaluation dataset was reserved as the testing set. The training process involved executing the models for 500 epochs, with validation accuracy monitored throughout the training.

4.3. Evaluation Metrics

The confusion matrix is an essential tool for assessing the performance of machine learning models. It presents a structured table that juxtaposes the actual and predicted class labels, as detailed in reference [100]. This matrix facilitates the calculation of a range of performance metrics.
  • True Positive (TP): These are the instances that the model correctly predicted to be positive. For example, if a spam filter correctly identified an email as spam, this is a true positive.
  • False Negative (FN): These are the instances that the model incorrectly predicted to be negative. In the spam filter example, if it mistakenly classified a spam email as legitimate, this is a false negative.
  • True Negative (TN): These are the instances that the model correctly predicted to be negative. Returning to our spam filter, if it accurately identified a non-spam email as non-spam, this is a true negative.
  • False Positive (FP): These are the instances that the model incorrectly predicted to be positive. In the spam filter context, if it mistakenly classified a legitimate email as spam, this is a false positive.
Equation (32) [101] illustrates the most basic and fundamental metric, accuracy, which can be derived from the confusion matrix.
A c c u r a c y = T P + T N T P + T N + F P + F N
It is common to evaluate the model using a variety of additional metrics, including recall, precision, and the F-score. Precision is determined by dividing the number of true positive results by the total number of predicted positive results, encompassing both correct and incorrect identifications. This metric, also known as positive predictive value, is calculated using Equation (33) [101]. Recall, defined in Equation (34) [101], assesses the proportion of actual positive instances that the model correctly identifies among all instances that should have been recognized as positive. The F-score, computed using Equation (35) [102], serves as the harmonic mean of precision and recall, providing a balanced measure of the model’s performance.
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
F s c o r e = 2 p r e c i s i o n r e c a l l p r e c i s i o n + r e c a l l
In this scenario, the goal is to enhance metrics including the F-score, accuracy, recall, and precision, as outlined by the evaluation criteria.

4.4. Results

The evaluation of the proposed models was conducted across two primary phases, training and testing, utilizing the train and test subsets of the NF-UNSW-NB15-v2 dataset, with additional evaluation on other datasets like CICIDS2017 to demonstrate the models’ generalizability. These experiments targeted both binary and multi-class classification tasks, ensuring accurate detection of malicious activities and precise identification of various attack types. A comprehensive analysis was performed to assess the impact of data resampling techniques on the models’ performance, offering a thorough comparison of their effectiveness. The models were also benchmarked against established intrusion detection systems from the literature, providing valuable insights into their relative strengths and weaknesses in a broader context. The results from both the NF-UNSW-NB15-v2 and CICIDS2017 datasets underscore the effectiveness and versatility of the proposed models in addressing complex classification challenges. Among the evaluated approaches, the Transformer-CNN model consistently emerged as the top performer, demonstrating exceptional accuracy in detecting malicious activities and classifying diverse attack types. While other models, such as auto encoder, DNN and CNN, delivered commendable results, the Transformer-CNN model proved to be the most resilient and reliable across all evaluation metrics, highlighting the critical role of applied preprocessing techniques and emphasizing the robustness and generalizability of the models.
(i)
Binary Classification
The performance metrics presented in Table 32 illustrate the results of binary classification on the NF-UNSW-NB15-v2 and CICIDS2017 datasets, utilizing data resampling techniques and class weights. Each model demonstrated impressive performance across all metrics, highlighting their reliability and robustness in binary classification tasks. On the NF-UNSW-NB15-v2 dataset, the CNN model achieved an accuracy of 99.69%, with precision, recall, and F-score all matching at 99.69%. The auto encoder reported an accuracy of 99.66%, and similarly, the DNN model achieved an accuracy of 99.68%, with corresponding precision, recall, and F-score values of 99.68%. The Transformer-CNN model outperformed the others, achieving the highest accuracy at 99.71%, along with matching precision, recall, and F-score metrics of 99.71%. On the CICIDS2017 dataset, the CNN model demonstrated outstanding performance, achieving an accuracy of 99.86%, with precision, recall, and F-score all equally high at 99.86%. The auto encoder model, while slightly lower, still achieved a strong accuracy of 99.73%, with corresponding precision, recall, and F-score values matching at 99.73%, suggesting it is effective in identifying anomalies and classifying the data accurately. The DNN model reported an impressive accuracy of 99.88%, with precision, recall, and F-score values consistently high at 99.88%, indicating that it is highly reliable in distinguishing between the different classes within the dataset. However, The Transformer-CNN model stood out as the best performer, achieving the highest accuracy of 99.93%, with precision, recall, and F-score all at 99.93%. These results highlight the impressive performance of each model in binary classification tasks across both datasets, showcasing their reliability and robustness for real-world applications. The Transformer-CNN model, in particular, emerged as the most effective, achieving the highest performance in binary classification on both datasets.
Table 32. Performance metrics in binary classification using data resampling and class weights.
(ii)
Multi-Class Classification
The performance metrics for various models in multi-class classification on the NF-UNSW-NB15-v2 and CICIDS2017 datasets, utilizing data resampling techniques and class weights, are summarized in Table 33. On the NF-UNSW-NB15-v2 dataset, the CNN model achieved an accuracy of 98.36%, with precision at 98.66%, recall at 98.36%, and F-score at 98.46%. The auto encoder showed slightly lower performance, with an accuracy of 95.57%, precision of 96.54%, recall of 95.57%, and an F-score of 95.77%. The DNN model attained an accuracy of 97.65%, with precision at 98.09%, recall at 97.65%, and F-score at 97.77%. The Transformer-CNN model stood out with the highest performance, achieving an accuracy of 99.02%, precision of 99.30%, recall of 99.02%, and F-score of 99.13%. On the CICIDS2017 dataset, the CNN model achieved an accuracy of 99.05%, with precision at 99.12%, recall at 99.05%, and F-score at 99.07%. The Auto Encoder performed similarly, with an accuracy of 99.09%, precision of 99.12%, recall of 99.09%, and F-score of 99.09%. The DNN model reported an accuracy of 99.11%, with precision at 99.20%, recall at 99.11%, and F-score at 99.14%. The Transformer-CNN model once again outperformed the others, achieving the highest accuracy of 99.13%, precision of 99.22%, recall of 99.13%, and F-score of 99.16%. These results emphasize the strong performance of each model in multi-class classification tasks across both datasets, showcasing their reliability and robustness in real-world applications. Notably, the Transformer-CNN model demonstrated the highest effectiveness, standing out as the most proficient model for multi-class classification on both datasets.
Table 33. Performance metrics in multi-class classification using data resampling and class weights.

5. Discussion

This section provides a comprehensive evaluation of the Transformer-CNN model’s performance in comparison to other classification methods, such as CNN, auto encoder, and DNN, across both binary and multi-class classification tasks. We conduct a detailed analysis of the confusion matrices and key performance metrics, including accuracy, precision, recall, and F1-score, to offer a comparative assessment of each model’s strengths and weaknesses. Results obtained from the NF-UNSW-NB15-v2 dataset, along with additional evaluation on other datasets like CICIDS2017 to demonstrate the models’ generalizability, reveal how the Transformer-CNN model’s innovative integration of Transformer and CNN architectures enhances its ability to detect malicious activities and classify various attack types. This analysis not only highlights the model’s superior performance across multiple metrics but also underscores its robustness in real-world intrusion detection scenarios, emphasizing the practical implications of improving the accuracy and reliability of IDS systems.
(i)
Binary Classification
In binary classification on both the NF-UNSW-NB15-v2 and CICIDS2017 datasets, the Transformer-CNN model demonstrated exceptional performance across critical metrics such as accuracy, precision, recall, and F1-score, outperforming previously proposed models. Its ability to extract and leverage essential features from the input data is evident in the classification outcomes. Figure 2 presents the confusion matrices for the Transformer-CNN model applied to the NF-UNSW-NB15-v2 and CICIDS2017 datasets. On the NF-UNSW-NB15-v2 dataset, the model achieved an accuracy of 99.71%, with precision, recall, and F1-score all at 99.71%. The confusion matrix shows that the model correctly identified 4342 normal instances and 1797 attack instances. However, 18 normal instances were misclassified as attacks, with no attack instances misclassified as normal. This performance underscores the model’s robustness in handling imbalanced datasets and its precision in detecting attacks while minimizing false alarms. On the CICIDS2017 dataset, the Transformer-CNN model achieved an even higher accuracy of 99.93%, with precision, recall, and F1-score also at 99.93%. The confusion matrix reveals that the model correctly classified 13,939 normal instances and 11,033 attack instances. However, 15 normal instances were misclassified as attacks, and 3 attack instances were misclassified as normal. This result highlights the model’s exceptional ability to distinguish between normal and malicious traffic effectively, ensuring reliability and precision in real-world intrusion detection scenarios. These results confirm the Transformer-CNN model’s capability to address critical challenges in intrusion detection, including managing imbalanced datasets and reducing false positives and false negatives, making it a highly reliable tool for deployment in real-world network security applications.
Figure 2. Confusion matrix for binary classification using Transformer-CNN on (a) NF-UNSW-NB15-v2 dataset and (b) CICIDS2017 dataset.
The comparative performance of the proposed Transformer-CNN model against other binary classifiers, including a standalone CNN, auto encoder, and DNN, is depicted in Figure 3 and Figure 4. The evaluation metrics displayed include accuracy, precision, recall, and F1-score. The results indicate that the Transformer-CNN model excelled, with an accuracy of 99.71%, a precision of 99.71%, a recall of 99.71%, and an F1-score of 99.71% on the NF-UNSW-NB15-v2 dataset. This underscores its exceptional capability in detecting intrusions. The high precision score of 99.71% indicates that the Transformer-CNN model effectively identified true positives with very few false positives, while the 99.71% recall score shows that it captured nearly all true positive instances, minimizing false negatives. The F1-score of 99.71% reflects a nearly perfect balance between precision and recall, showcasing the model’s overall effectiveness and reliability. On the CICIDS2017 dataset, the Transformer-CNN model demonstrated even greater performance, achieving an accuracy of 99.93%, along with matching precision, recall, and F1-score metrics of 99.93%. In contrast, the standalone auto encoder exhibited lower performance metrics on both datasets, with accuracy, precision, recall, and F1-score around 99.66% on NF-UNSW-NB15-v2 and 99.73% on CICIDS2017. The standalone CNN achieved slightly better metrics of 99.69% on NF-UNSW-NB15-v2 and 99.86% on CICIDS2017. The DNN model had metrics of 99.68% on NF-UNSW-NB15-v2 and 99.88% on CICIDS2017. Ultimately, the Transformer-CNN model stands out due to its robust overall performance on both datasets, reinforcing its suitability for binary classification tasks.
Figure 3. Proposed Transformer-CNN versus binary classifiers on NF-UNSW-NB15-v2 dataset.
Figure 4. Proposed Transformer-CNN versus binary classifiers on CICIDS2017 dataset.
The effectiveness of the Transformer-CNN model in binary classification is further validated by its exemplary performance metrics across different classes on the NF-UNSW-NB15-v2 and CICIDS2017 datasets. For the NF-UNSW-NB15-v2 dataset, the model achieved an overall accuracy of 99.71%, along with precision, recall, and F1-score all at 99.71%. Specifically, for the ‘Normal’ class, it recorded an accuracy of 99.59%, a perfect precision of 100%, a recall of 99.59%, and an F1-score of 99.79%, showcasing its ability to accurately identify benign traffic. In the ‘Attack’ class, it achieved a perfect accuracy of 100%, precision of 99.01%, recall of 100%, and an F1-score of 99.50%, underscoring its effectiveness in detecting malicious traffic while minimizing false positives and false negatives. On the CICIDS2017 dataset, the model also demonstrated outstanding results, achieving an overall accuracy of 99.93%, precision of 99.93%, recall of 99.93%, and an F1-score of 99.93%. For the ‘Normal’ class, it attained an accuracy of 99.89%, precision of 99.98%, recall of 99.89%, and an F1-score of 99.94%, highlighting its precision in identifying benign traffic. For the ‘Attack’ class, the model achieved an accuracy of 99.97%, precision of 99.86%, recall of 99.97%, and an F1-score of 99.92%, validating its robustness in distinguishing attack traffic with high reliability. The results summarized in Table 34 and Table 35 illustrate the Transformer-CNN model’s ability to perform consistently across diverse datasets. The detailed performance metrics for individual classes further emphasize the model’s precision and reliability, making it well-suited for deployment in real-world intrusion detection systems where the consequences of misclassification can be critical.
Table 34. Performance metrics for Transformer-CNN across several classes in binary classification on NF-UNSW-NB15-v2 dataset.
Table 35. Performance metrics for Transformer-CNN across several classes in binary classification on CICIDS2017 dataset.
(ii)
Multi-Class Classification
In multi-class classification on the NF-UNSW-NB15-v2 dataset, the Transformer-CNN model demonstrated exceptional performance across key metrics such as accuracy, precision, recall, and F1-score compared to other models. The model’s ability to accurately distinguish between different types of attacks is clearly reflected in the confusion matrix, as shown in Figure 5. This matrix highlights the model’s effectiveness in correctly classifying a wide range of attack classes with minimal misclassification. For instance, the model successfully identified 4294 instances of Benign traffic, 720 instances of Exploits, 474 instances of Fuzzers, 344 instances of Reconnaissance, and 132 instances of Generic attacks. In addition, it correctly recognized 76 instances of DoS, 25 instances of Shellcode, 14 instances of Backdoor, 8 instances of Analysis, and 5 instances of Worms. Few misclassifications were observed, including some false positives and negatives across various attack classes, underscoring the model’s overall reliability and precision in distinguishing these attacks. The comprehensive accuracy of 99.02%, precision of 99.30%, recall of 99.02%, and F1-score of 99.13%, as detailed in the confusion matrix, confirm the model’s capability in managing the complexities of multi-class classification in real-world scenarios, particularly when dealing with diverse and imbalanced datasets.
Figure 5. Confusion matrix for multi-class classification using Transformer-CNN on NF-UNSW-NB15-v2 dataset.
In multi-class classification on the CICIDS2017 dataset, the Transformer-CNN model demonstrates remarkable effectiveness, as illustrated by the confusion matrix shown in Figure 6. The model achieves outstanding accuracy, precision, recall, and F1-scores across various attack classes, effectively distinguishing between diverse at-tack types with minimal misclassifications. For instance, the Benign class achieved 13,773 correct classifications, with only a few instances misclassified into other categories, such as 60 instances as PortScan and 34 as DoS Hulk. The PortScan attack class was classified with high precision, correctly identifying 1,806 out of 1,808 instances, with just 2 instance misclassified. Similarly, the model correctly classified 2,080 instances of DDoS, with 2 instances misclassified into other categories. For DoS Hulk, the model correctly classified 5,609 instances, with only two minor misclassifications. In the DoS GoldenEye class, all 480 instances were correctly identified, show-casing perfect performance. For FTP-Patator, 255 instances were correctly classified, with just 2 misclassified as DoS Slowloris. The model maintained strong accuracy for the SSH-Patator class, correctly identifying 112 instances with minimal errors. For more challenging attack types, such as DoS Slowloris and DoS Slowhttptest, the model achieved excellent results, correctly classifying 261 and 160 instances, respectively, without any misclassifications. The model also handled the Bot attack class effectively, correctly classifying 104 instances, with only 3 misclassified into the Benign category. The Web Attack - Brute Force class was classified with perfect precision and recall, correctly identifying all 69 instances without any errors, while the Web Attack - XSS class achieved near-perfect performance, correctly identifying 55 instances with minimal errors. The Transformer-CNN model demonstrated strong performance across the Infiltration, Web Attack - SQL Injection, and Heartbleed classes. For Infiltration, it correctly identified 3 instances, but misclassified 2 instances as Heartbleed. In the Web Attack - SQL Injection class, the model classified all 2 instances correctly, achieving perfect accuracy. Similarly, for Heartbleed, the model exhibited flawless performance, correctly identifying all 4 instances with no errors. These results further emphasize the model’s ability to handle less frequent and challenging attack classes with high precision. With an overall accuracy of 99.13%, precision of 99.22%, recall of 99.13%, and an F1-score of 99.16%, the Transformer-CNN model demonstrates robust capability in handling multi-class classification challenges. Its ability to classify a wide range of attack types accurately and reliably underscores its potential for real-world deployment in intrusion detection systems, where precision and reliability are paramount.
Figure 6. Confusion matrix for multi-class classification using Transformer-CNN on CICIDS2017 dataset.
The comparative performance of the proposed Transformer-CNN model against other multi-class classifiers, including a standalone CNN, auto encoder, and DNN, highlights the Transformer-CNN’s remarkable capability in managing complex classification tasks on both the NF-UNSW-NB15-v2 and CICIDS2017 datasets, as shown in Figure 7 and Figure 8. The evaluation metrics, including accuracy, precision, recall, and F1-score, show that the Transformer-CNN consistently outperforms the other classifiers across both datasets. On the NF-UNSW-NB15-v2 dataset, the Transformer-CNN achieved an accuracy of 99.02%, a precision of 99.30%, a recall of 99.02%, and an F1-score of 99.13%, underscoring its effectiveness in handling multi-class classification with high performance. In addition to its high accuracy, the model excelled in precision, recall, and F1-score, which are essential for assessing performance in imbalanced datasets. Specifically, it achieved a precision of 99.30% and a recall of 99.02%, underscoring its effectiveness in identifying true positives while minimizing false positives. In contrast, the CNN achieved an accuracy of 98.36%, with precision, recall, and F1-score values of 98.66%, 98.36%, and 98.46%, respectively. The DNN recorded an accuracy of 97.65%, with precision, recall, and F1-score values of 98.09%, 97.65%, and 97.77%, respectively. The auto encoder exhibited comparatively lower metrics, achieving 95.57% accuracy, 96.54% precision, 95.57% recall, and 95.77% F1-score. On the CICIDS2017 dataset, the Transformer-CNN also led with an accuracy of 99.13%, a precision of 99.22%, a recall of 99.13%, and an F1-score of 99.16%. The CNN achieved an accuracy of 99.05%, with precision, recall, and F1-score values of 99.12%, 99.05%, and 99.07%, respectively. The DNN recorded an accuracy of 99.11%, with precision, recall, and F1-score values of 99.20%, 99.11%, and 99.14%, respectively. The auto encoder achieved an accuracy of 99.09%, with precision, recall, and F1-score values of 99.12%, 99.09%, and 99.09%. These results emphasize the significant improvement offered by the Transformer-CNN model for multi-class classification tasks across both datasets.
Figure 7. Proposed Transformer-CNN versus multi-class classifiers on NF-UNSW-NB15-v2 dataset.
Figure 8. Proposed Transformer-CNN versus multi-class classifiers on CICIDS2017 dataset.
The Transformer-CNN model demonstrated remarkable effectiveness in multi-class classification, as evidenced by its performance metrics across various attack classes. The model achieved exceptional results, recording 100% accuracy, precision, recall, and F1-score for the Shellcode class, reflecting its outstanding capability to accurately identify this specific attack type without errors. For other classes such as Benign, Exploits, and Reconnaissance, the model maintained high performance, with metrics consistently exceeding 96%. For example, the Benign class achieved an accuracy of 99.63%, a precision of 100%, a recall of 99.63%, and an F1-score of 99.81%. The DoS class recorded an accuracy of 97.44% with a precision of 92.68%, while the Fuzzers class achieved an accuracy of 99.16% and an F1-score of 98.54%. Even for more challenging attack types like Backdoor and Analysis, the model performed robustly, attaining F1-scores of 59.57% and 48.48%, respectively. These comprehensive metrics, detailed in Table 36, highlight the model’s ability to effectively manage the complexities of multi-class classification. Its precision in distinguishing between various attack types further emphasizes its potential for real-world deployment in intrusion detection systems, where accurate and reliable classification is crucial.
Table 36. Performance metrics for Transformer-CNN across several classes in multi-class classification on NF-UNSW-NB15-v2 dataset.
The Transformer-CNN model exhibited exceptional effectiveness in multi-class classification, as evidenced by its performance metrics across various attack classes on the CICIDS2017 dataset. The model achieved outstanding results, particularly for certain attack types. For instance, the DoS Slowhttptest class recorded perfect scores, with 100% accuracy, precision, recall, and F1-score, demonstrating the model’s capability to accurately classify this attack type without any errors. Similarly, the DoS GoldenEye class achieved 100% accuracy and recall, along with precision and an F1-score exceeding 98%. The model also excels in distinguishing between other classes. For example, the PortScan class recorded an accuracy of 99.89%, precision of 96.78%, recall of 99.89%, and an F1-score of 98.31%. The “DDoS” class similarly performed exceptionally, with accuracy and recall of 99.90% and an F1-score of 99.69%. Despite the inherent complexity of multi-class classification, the Transformer-CNN model maintained high metrics for a majority of the attack types, such as FTP-Patator and SSH-Patator, which achieved F1-scores of 97.70% and 97.39%, respectively. However, for more challenging attack classes like Infiltration and Web Attack–SQL Injection, the model’s performance was relatively lower, recording F1-scores of 46.15% and 57.14%, respectively. These results highlight potential areas for improvement in handling low-frequency or highly complex attack types. Overall, the comprehensive metrics detailed in Table 37 underscore the Transformer-CNN model’s ability to manage the complexities of multi-class classification effectively. Its precision in distinguishing between various attack types emphasizes its robustness and potential for real-world deployment in intrusion detection systems, where accurate and reliable classification across a wide range of threats is essential.
Table 37. Performance metrics for Transformer-CNN across several classes in multi-class classification on CICIDS2017 dataset.

Case Study for Zero-Day Attack

In today’s rapidly evolving cyber threat landscape, zero-day attacks pose a significant challenge to network security. These attacks exploit unknown vulnerabilities, often bypassing traditional security measures. To address this challenge, this case study examines the application of an advanced deep learning model, specifically a Transformer-CNN for effective zero-day attack detection. In the realm of zero-day attack detection, our Transformer-CNN model has proven to be highly effective, especially in the context of the “Reconnaissance” category within the NF-UNSW-NB15-v2 dataset. To rigorously test the model’s ability to detect previously unseen threats, we deliberately omitted this attack class from the training dataset, reserving it solely for evaluation during the testing phase. Remarkably, the model was able to accurately identify 293 out of the 299 instances of this attack, as illustrated in Figure 9. This outcome highlights the model’s strong capacity for generalization, allowing it to recognize and respond to novel attack patterns it had not encountered before. The model’s success in handling such sophisticated and unknown attack vectors underscores its robustness and positions it as a powerful asset in real-world cyber security defense mechanisms.
Figure 9. Confusion matrix of the Transformer-CNN model on the NF-UNSW-NB15-v2 dataset, demonstrating its effectiveness in detecting zero-day attacks.

6. Limitations

The Transformer-CNN architecture exemplifies a sophisticated deep learning framework that combines the capabilities of Transformers and CNNs to bolster performance in classification tasks. Although this innovative approach effectively tackles key issues in intrusion detection systems, such as enhancing accuracy and addressing class imbalances, it is essential to acknowledge various limitations and challenges that may arise:
  • Scalability: As the volume of datasets or the complexity of network traffic grows, the computational demands on the model can intensify, which may hinder its efficiency and its capacity to manage larger datasets or adapt to changing network environments.
  • Generalization: Although the Transformer-CNN exhibits impressive performance on the NF-UNSW-NB15-v2 and CICIDS2017 datasets, its efficacy across diverse types of network traffic or newly emerging attack vectors is not yet fully established. To assess its robustness and generalization capabilities, it is crucial to evaluate the model against a wider array of datasets, including KDDCup99 [36], NSL KDD [29], and more recent collections like CSE-CIC-IDS2018 [34], and IoT23 [16].
  • Data Preprocessing: The execution of data preprocessing across various datasets is a vital stage that encompasses activities like addressing missing values, encoding categorical variables, normalizing or standardizing numerical features, and eliminating extraneous information. The model’s performance is significantly influenced by the quality and thoroughness of these preprocessing procedures.
  • Model Adaptation: Adjusting the model for various datasets necessitates a trial-and-error approach to hyperparameter optimization. This iterative process is essential for refining the model to better match the specific characteristics and nuances of new datasets.

7. Conclusions

In this paper, we proposed an advanced hybrid Transformer-CNN deep learning model designed to address the challenges of zero-day attack detection and class imbalance in IDS. The transformer component of our model is employed for contextual feature extraction, enabling the system to analyze relationships and patterns in the data effectively. In contrast, the CNN is responsible for final classification, processing the extracted features to accurately identify specific attack types. By integrating data resampling techniques such as ADASYN, SMOTE and ENN, we effectively address class imbalance in the training data. Additionally, utilizing class weights further enhances our model’s performance by balancing the influence of different classes during training. As a result, our model significantly improves detection accuracy while reducing false positives and negatives. The results of our evaluation demonstrate the model’s remarkable performance across both the NF-UNSW-NB15-v2 and CICIDS2017 datasets. On the NF-UNSW-NB15-v2 dataset, the model achieved an impressive 99.71% accuracy in binary classification and 99.02% accuracy in multi-class classification. Similarly, on the CICIDS2017 dataset, the model attained 99.93% accuracy in binary classification and 99.13% accuracy in multi-class classification, showcasing its effectiveness across diverse datasets and classification tasks. This performance surpasses that of existing models in both known and unknown threat detection. This research highlights the potential of hybrid deep learning models in fortifying network and cloud environments against increasingly sophisticated cyber threats. Our approach not only enhances real-time detection capabilities but also proves effective in handling imbalanced datasets, a common challenge in IDS development.

8. Future Work

To address the limitations and challenges outlined in Section 6, future research should prioritize exploration in the following domains:
  • Broader Dataset Evaluation: Future investigations should involve testing the Transformer-CNN across a more diverse range of datasets, including KDDCup99 [36], NSL KDD [29], and newer datasets such as CSE-CIC-IDS2018 [34], and IoT23 [16]. This approach will provide insights into its robustness, generalization potential, and effectiveness in addressing emerging attack vectors.
  • Data Preprocessing Refinement: The data preprocessing procedures should be meticulously refined and customized for each dataset to achieve optimal model performance. This entails experimenting with various preprocessing techniques and analyzing their effects on model results. Comprehensive discussions of these preprocessing strategies are extensively covered in Section 3.2 and 4.1 of the manuscript.
  • Model Adaptation and Hyperparameter Optimization: Ongoing investigation into model adaptation techniques is essential, emphasizing the refinement of the hyperparameter optimization process tailored to various datasets. This process should undergo systematic analysis to uncover best practices for effectively adapting the model to different data environments. Detailed discussions of these aspects are presented in Section 3, specifically in Section 3.3.4.
  • Scalability and Computational Efficiency: It is imperative to enhance the model’s computational efficiency and scalability, enabling it to effectively manage larger datasets and more intricate network traffic scenarios without sacrificing performance.

Author Contributions

Conceptualization, H.K. and M.M.; Methodology, H.K. and M.M.; Software, H.K. and M.M.; Validation, H.K. and M.M.; Writing—original draft, H.K. and M.M.; Supervision, M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in our study, NF-UNSW-NB15-v2 and CICIDS2017, are publicly available. Below are the URLs for the datasets: NF-UNSW-NB15-v2: https://staff.itee.uq.edu.au/marius/NIDS_datasets/ (accessed on 15 December 2024); CICIDS2017: https://www.unb.ca/cic/datasets/ids-2017.html (accessed on 15 December 2024).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Conti, M.; Dargahi, T.; Dehghantanha, A. Cyber Threat Intelligence: Challenges and Opportunities; Springer: Berlin/Heidelberg, Germany, 2018; pp. 1–6. [Google Scholar] [CrossRef]
  2. Faker, O.; Dogdu, E. Intrusion detection using big data and deep learning techniques. In Proceedings of the 2019 ACM Southeast Conference. ACM SE’19, Kennesaw, GA, USA, 18–20 April 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 86–93. [Google Scholar] [CrossRef]
  3. Kaur, G.; Habibi Lashkari, A.; Rahali, A. Intrusion trafc detection and characterization using deep image learning. In Proceedings of the 2020 IEEE International Conference on Dependable, Autonomic and Secure Computing, International Conference on Pervasive Intelligence and Computing, International Conference on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Calgary, AB, Canada, 17–22 August 2020; pp. 55–62. [Google Scholar] [CrossRef]
  4. Internet Security Threat Report. Available online: https://docs.broadcom.com/doc/istr-23-2018-en (accessed on 18 July 2022).
  5. Cyberattacks Now Cost Companies \$200,000 on Average, Putting Many out of Business. Available online: https://www.cnbc.com/2019/10/13/cyberattacks-cost-small-companies-200k-putting-many-out-of-business.html (accessed on 13 October 2019).
  6. Kumar, M.; Singh, A.K. Distributed intrusion detection system using blockchain and cloud computing infrastructure. In Proceedings of the 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184), Tirunelveli, India, 15–17 June 2020; pp. 248–252. [Google Scholar]
  7. Zhang, X.; Xie, J.; Huang, L. Real-Time Intrusion Detection Using Deep Learning Techniques. J. Netw. Comput. Appl. 2020, 140, 45–53. [Google Scholar]
  8. Kumar, S.; Kumar, R. A Review of Real-Time Intrusion Detection Systems Using Machine Learning Approaches. Comput. Secur. 2020, 95, 101944. [Google Scholar]
  9. Smith, A.; Jones, B.; Taylor, C. Enhancing Network Security with Real-Time Intrusion Detection Systems. Int. J. Inf. Secur. 2021, 21, 123–135. [Google Scholar]
  10. Sarhan, M.; Layeghy, S.; Portmann, M. Towards a standard feature set for network intrusion detection system datasets. Mob. Netw. Appl. 2022, 27, 357–370. [Google Scholar] [CrossRef]
  11. Sarhan, M.; Layeghy, S.; Moustafa, N.; Portmann, M. Cyber threat intelligence sharing scheme based on federated learning for network intrusion detection. J. Netw. Syst. Manag. 2023, 31, 3. [Google Scholar] [CrossRef]
  12. UNB. Intrusion Detection Evaluation Dataset (CICIDS2017), University of New Brunswick. Available online: https://www.unb.ca/cic/datasets/ids-2017.html (accessed on 30 October 2024).
  13. Panigrahi, R.; Borah, S. A detailed analysis of CICIDS2017 dataset for designing Intrusion Detection Systems. Int. J. Eng. Technol. 2018, 7, 479–482. [Google Scholar]
  14. Anderson, J.P. Computer security threat monitoring and surveillance. In Technical Report; James P. Anderson Company: Washington, DC, USA, 1980. [Google Scholar]
  15. Mahalingam, A.; Perumal, G.; Subburayalu, G.; Albathan, M.; Altameem, A.; Almakki, R.S.; Hussain, A.; Abbas, Q. ROAST-IoT: A novel range-optimized attention convolutional scattered technique for intrusion detection in IoT networks. Sensors 2023, 23, 8044. [Google Scholar] [CrossRef]
  16. ElKashlan, M.; Elsayed, M.S.; Jurcut, A.D.; Azer, M. A machine learning-based intrusion detection system for iot electric vehicle charging stations (evcss). Electronics 2023, 12, 1044. [Google Scholar] [CrossRef]
  17. Al Nuaimi, T.; Al Zaabi, S.; Alyilieli, M.; AlMaskari, M.; Alblooshi, S.; Alhabsi, F.; Yusof, M.F.B.; Al Badawi, A. A comparative evaluation of intrusion detection systems on the edge-IIoT-2022 dataset. Intell. Syst. Appl. 2023, 20, 200298. [Google Scholar] [CrossRef]
  18. Gad, A.R.; Nashat, A.A.; Barkat, T.M. Intrusion detection system using machine learning for vehicular ad hoc networks based on ToN-IoT dataset. IEEE Access 2021, 9, 142206–142217. [Google Scholar] [CrossRef]
  19. Al-Daweri, M.S.; Ariffin, K.A.Z.; Abdullah, S.; Senan, M.F.E.M. An analysis of the KDD99 and UNSW-NB15 datasets for the intrusion detection system. Symmetry 2020, 12, 1666. [Google Scholar] [CrossRef]
  20. Vitorino, J.; Praça, I.; Maia, E. Towards adversarial realism and robust learning for IoT intrusion detection and classification. Ann. Telecommun. 2023, 78, 401–412. [Google Scholar] [CrossRef]
  21. Othman, T.S.; Abdullah, S.M. An intelligent intrusion detection system for internet of things attack detection and identification using machine learning. Aro-Sci. J. Koya Univ. 2023, 11, 126–137. [Google Scholar] [CrossRef]
  22. Yaras, S.; Dener, M. IoT-Based Intrusion Detection System Using New Hybrid Deep Learning Algorithm. Electronics 2024, 13, 1053. [Google Scholar] [CrossRef]
  23. Vinayakumar, R.; Alazab, M.; Soman, K.P.; Poornachandran, P.; Al-Nemrat, A.; Venkatraman, S. Deep learning approach for intelligent intrusion detection system. IEEE Access 2019, 7, 41525–41550. [Google Scholar] [CrossRef]
  24. Farhana, K.; Rahman, M.; Ahmed, M.T. An intrusion detection system for packet and flow based networks using deep neural network approach. Int. J. Electr. Comput. Eng. 2020, 10, 5514–5525. [Google Scholar] [CrossRef]
  25. Zhang, C.; Chen, Y.; Meng, Y.; Ruan, F.; Chen, R.; Li, Y.; Yang, Y. A novel framework design of network intrusion detection based on machine learning techniques. Secur. Commun. Netw. 2021, 2021, 6610675. [Google Scholar] [CrossRef]
  26. Alsharaiah, M.; Abualhaj, M.; Baniata, L.; Al-saaidah, A.; Kharma, Q.; Al-Zyoud, M. An innovative network intrusion detection system (NIDS): Hierarchical deep learning model based on Unsw-Nb15 dataset. Int. J. Data Netw. Sci. 2024, 8, 709–722. [Google Scholar] [CrossRef]
  27. Jouhari, M.; Benaddi, H.; Ibrahimi, K. Efficient Intrusion Detection: Combining χ2 Feature Selection with CNN-BiLSTM on the UNSW-NB15 Dataset. arXiv 2024, arXiv:2407.14945. [Google Scholar]
  28. Türk, F. Analysis of intrusion detection systems in UNSW-NB15 and NSL-KDD datasets with machine learning algorithms. Bitlis Eren Üniversitesi Fen Bilim. Derg. 2023, 12, 465–477. [Google Scholar] [CrossRef]
  29. Muhuri, P.; Chatterjee, P.; Yuan, X.; Roy, K.; Esterline, A. Using a long short-term memory recurrent neural network (lstm-rnn) to classify network attacks. Information 2020, 11, 243. [Google Scholar] [CrossRef]
  30. Fu, Y.; Du, Y.; Cao, Z.; Li, Q.; Xiang, W. A deep learning model for network intrusion detection with imbalanced data. Elec-tronics 2022, 11, 898. [Google Scholar] [CrossRef]
  31. Yin, Y.; Jang-Jaccard, J.; Xu, W.; Singh, A.; Zhu, J.; Sabrina, F.; Kwak, J. IGRF-RFE: A hybrid feature selection method for MLP-based network intrusion detection on UNSW-NB15 dataset. J. Big Data 2023, 10, 15. [Google Scholar] [CrossRef]
  32. Yoo, J.; Min, B.; Kim, S.; Shin, D.; Shin, D. Study on network intrusion detection method using discrete pre-processing method and convolution neural network. IEEE Access 2021, 9, 142348–142361. [Google Scholar] [CrossRef]
  33. Alzughaibi, S.; El Khediri, S. A cloud intrusion detection systems based on dnn using backpropagation and pso on the cse-cic-ids2018 dataset. Appl. Sci. 2023, 13, 2276. [Google Scholar] [CrossRef]
  34. Basnet, R.B.; Shash, R.; Johnson, C.; Walgren, L.; Doleck, T. Towards Detecting and Classifying Network Intrusion Traffic Using Deep Learning Frameworks. J. Internet Serv. Inf. Secur. 2019, 9, 1–17. [Google Scholar]
  35. Thilagam, T.; Aruna, R. Intrusion detection for network based cloud computing by custom RC-NN and optimization. ICT Express 2021, 7, 512–520. [Google Scholar] [CrossRef]
  36. Farahnakian, F.; Heikkonen, J. A deep auto-encoder based approach for intrusion detection system. In Proceedings of the 2018 20th International Conference on Advanced Communication Technology (ICACT), Chuncheon, Republic of Korea, 11–14 February 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 178–183. [Google Scholar]
  37. Mahmood, H.A.; Hashem, S.H. Network intrusion detection system (NIDS) in cloud environment based on hid-den Naïve Bayes multiclass classifier. Al-Mustansiriyah J. Sci. 2018, 28, 134–142. [Google Scholar] [CrossRef]
  38. Baig, M.M.; Awais, M.M.; El-Alfy, E.S.M. A multiclass cascade of artificial neural network for network intrusion detection. J. Intell. Fuzzy Syst. 2017, 32, 2875–2883. [Google Scholar] [CrossRef]
  39. Mohy-Eddine, M.; Guezzaz, A.; Benkirane, S.; Azrour, M.; Farhaoui, Y. An ensemble learning based intrusion detection model for industrial IoT security. Big Data Min. Anal. 2023, 6, 273–287. [Google Scholar] [CrossRef]
  40. Nicolas-Alin, S. Machine Learning for Anomaly Detection in Iot Networks: Malware Analysis on the Iot-23 Data Set. Bachelor’s Thesis, University of Twente, Enschede, The Netherland, 2020. [Google Scholar]
  41. Susilo, B.; Sari, R.F. Intrusion detection in IoT networks using deep learning algorithm. Information 2020, 11, 279. [Google Scholar] [CrossRef]
  42. Szczepański, M.; Pawlicki, M.; Kozik, R.; Choraś, M. The application of deep learning imputation and other advanced methods for handling missing values in network intrusion detection. Vietnam. J. Comput. Sci. 2023, 10, 1–23. [Google Scholar] [CrossRef]
  43. Kumar, P.; Bagga, H.; Netam, B.S.; Uduthalapally, V. Sad-iot: Security analysis of ddos attacks in iot networks. Wirel. Pers. Commun. 2022, 122, 87–108. [Google Scholar] [CrossRef]
  44. Sarhan, M.; Layeghy, S.; Portmann, M. Feature analysis for machine learning-based IoT intrusion detection. arXiv 2021, arXiv:2108.12732. [Google Scholar]
  45. Ferrag, M.A.; Friha, O.; Hamouda, D.; Maglaras, L.; Janicke, H. Edge-IIoTset: A new comprehensive realistic cyber security dataset of IoT and IIoT applications for centralized and federated learning. IEEE Access 2022, 10, 40281–40306. [Google Scholar] [CrossRef]
  46. Henry, A.; Gautam, S.; Khanna, S.; Rabie, K.; Shongwe, T.; Bhattacharya, P.; Sharma, B.; Chowdhury, S. Composition of hybrid deep learning model and feature optimization for intrusion detection system. Sensors 2023, 23, 890. [Google Scholar] [CrossRef] [PubMed]
  47. Aleesa, A.; Mohammed, A.A.; Mohammed, A.A.; Sahar, N. Deep-intrusion detection system with enhanced UNSW-NB15 dataset based on deep learning techniques. J. Eng. Sci. Technol. 2021, 16, 711–727. [Google Scholar]
  48. Ahmad, M.; Riaz, Q.; Zeeshan, M.; Tahir, H.; Haider, S.A.; Khan, M.S. Intrusion detection in internet of things using supervised machine learning based on application and transport layer features using UNSW-NB15 data-set. EURASIP J. Wirel. Commun. Netw. 2021, 2021, 10. [Google Scholar] [CrossRef]
  49. Mohammed, B.; Gbashi, E.K. Intrusion detection system for NSL-KDD dataset based on deep learning and recursive feature elimination. Eng. Technol. J. 2021, 39, 1069–1079. [Google Scholar] [CrossRef]
  50. Umair, M.B.; Iqbal, Z.; Faraz, M.A.; Khan, M.A.; Zhang, Y.D.; Razmjooy, N.; Kadry, S. A network intrusion detection system using hybrid multilayer deep learning model. Big Data 2022, 12, 367–376. [Google Scholar] [CrossRef]
  51. Choobdar, P.; Naderan, M.; Naderan, M. Detection and multi-class classification of intrusion in software defined networks using stacked auto-encoders and CICIDS2017 dataset. Wirel. Pers. Commun. 2022, 123, 437–471. [Google Scholar] [CrossRef]
  52. Shende, S.; Thorat, S. Long short-term memory (LSTM) deep learning method for intrusion detection in network security. Int. J. Eng. Res. 2020, 9, 1615–1620. [Google Scholar]
  53. Farhan, B.I.; Jasim, A.D. Performance analysis of intrusion detection for deep learning model based on CSE-CIC-IDS2018 dataset. Indones. J. Electr. Eng. Comput. Sci. 2022, 26, 1165–1172. [Google Scholar] [CrossRef]
  54. Farhan, R.I.; Maolood, A.T.; Hassan, N. Performance analysis of flow-based attacks detection on CSE-CIC-IDS2018 dataset using deep learning. Indones. J. Electr. Eng. Comput. Sci. 2020, 20, 1413–1418. [Google Scholar] [CrossRef]
  55. Lin, P.; Ye, K.; Xu, C.Z. Dynamic network anomaly detection system by using deep learning techniques. In Proceedings of the Cloud Computing–CLOUD 2019: 12th International Conference, Held as Part of the Services Conference Federation, SCF 2019, San Diego, CA, USA, 25–30 June 2019; Proceedings 12. Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 161–176. [Google Scholar]
  56. Liu, G.; Zhang, J. CNID: Research of network intrusion detection based on convolutional neural network. Discret. Dyn. Nat. Soc. 2020, 2020, 4705982. [Google Scholar] [CrossRef]
  57. Li, F.; Shen, H.; Mai, J.; Wang, T.; Dai, Y.; Miao, X. Pre-trained language model-enhanced conditional generative adversarial networks for intrusion detection. Peer-to-Peer Netw. Appl. 2024, 17, 227–245. [Google Scholar] [CrossRef]
  58. Wang, S.; Yao, X. Multiclass imbalance problems: Analysis and potential solutions. IEEE Trans. Syst. Man Cybern. Part B 2012, 42, 1119–1130. [Google Scholar] [CrossRef] [PubMed]
  59. Abdelkhalek, A.; Mashaly, M. Addressing the class imbalance problem in network intrusion detection systems using data resampling and deep learning. J. Supercomput. 2023, 79, 10611–10644. [Google Scholar] [CrossRef]
  60. Yang, H.; Xu, J.; Xiao, Y.; Hu, L. SPE-ACGAN: A resampling approach for class imbalance problem in network intrusion detection systems. Electronics 2023, 12, 3323. [Google Scholar] [CrossRef]
  61. Zakariah, M.; AlQahtani, S.A.; Al-Rakhami, M.S. Machine learning-based adaptive synthetic sampling technique for intrusion detection. Appl. Sci. 2023, 13, 6504. [Google Scholar] [CrossRef]
  62. Thiyam, B.; Dey, S. Efficient feature evaluation approach for a class-imbalanced dataset using machine learning. Procedia Comput. Sci. 2023, 218, 2520–2532. [Google Scholar] [CrossRef]
  63. AlbAlbasheer, F.O.; Haibatti, R.R.; Agarwal, M.; Nam, S.Y. A Novel IDS Based on Jaya Optimizer and Smote-ENN for Cyberattacks Detection. IEEE Access 2024, 12, 101506–101527. [Google Scholar] [CrossRef]
  64. Arık, A.O.; Çavdaroğlu, G.Ç. An Intrusion Detection Approach based on the Combination of Oversampling and Undersampling Algorithms. Acta Infologica 2023, 7, 125–138. [Google Scholar] [CrossRef]
  65. Rao, Y.N.; Suresh Babu, K. An imbalanced generative adversarial network-based approach for network intrusion detection in an imbalanced dataset. Sensors 2023, 23, 550. [Google Scholar] [CrossRef]
  66. Jamoos, M.; Mora, A.M.; AlKhanafseh, M.; Surakhi, O. A new data-balancing approach based on generative adversarial network for network intrusion detection system. Electronics 2023, 12, 2851. [Google Scholar] [CrossRef]
  67. Xu, B.; Sun, L.; Mao, X.; Ding, R.; Liu, C. IoT Intrusion Detection System Based on Machine Learning. Electronics 2023, 12, 4289. [Google Scholar] [CrossRef]
  68. Assy, A.T.; Mostafa, Y.; Abd El-khaleq, A.; Mashaly, M. Anomaly-based intrusion detection system using one-dimensional convolutional neural network. Procedia Comput. Sci. 2023, 220, 78–85. [Google Scholar] [CrossRef]
  69. Elghalhoud, O.; Naik, K.; Zaman, M.; Manzano, R. Data Balancing and cnn Based Network Intrusion Detection System; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar]
  70. Almarshdi, R.; Nassef, L.; Fadel, E.; Alowidi, N. Hybrid Deep Learning Based Attack Detection for Imbalanced Data Classification. Intell. Autom. Soft Comput. 2023, 35, 297–320. [Google Scholar] [CrossRef]
  71. Thockchom, N.; Singh, M.M.; Nandi, U. A novel ensemble learning-based model for network intrusion detection. Complex Intell. Syst. 2023, 9, 5693–5714. [Google Scholar] [CrossRef]
  72. Jumabek, A.; Yang, S.S.; Noh, Y.T. CatBoost-based network intrusion detection on imbalanced CIC-IDS-2018 dataset. Korean Soc. Commun. Commun. J. 2021, 46, 2191–2197. [Google Scholar] [CrossRef]
  73. Zhu, Y.; Liang, J.; Chen, J.; Ming, Z. An improved nsga-iii algorithm for feature selection used in intrusion detection. Knowl.-Based Syst. 2017, 116, 74–85. [Google Scholar] [CrossRef]
  74. Jiang, J.; Wang, Q.; Shi, Z.; Lv, B.; Qi, B. Rst-rf: A hybrid model based on rough set theory and random forest for network intrusion detection. In Proceedings of the 2nd International Conference on Cryptography, Security and Privacy, Guiyang, China, 16–18 March 2018. [Google Scholar]
  75. Chawla, N.; Bowyer, K.; Hall, L.; Kegelmeyer, W. Smote: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  76. Alikhanov, J.; Jang, R.; Abuhamad, M.; Mohaisen, D.; Nyang, D.; Noh, Y. Investigating the effect of trafc sampling on machine learning-based network intrusion detection approaches. IEEE Access 2022, 10, 5801–5823. [Google Scholar] [CrossRef]
  77. Zhang, X.; Ran, J.; Mi, J. An intrusion detection system based on convolutional neural network for imbalanced network trafc. In Proceedings of the 2019 IEEE 7th International Conference on Computer Science and Network Technology (ICCSNT), Dalian, China, 19–20 October 2019; pp. 456–460. [Google Scholar]
  78. Gupta, N.; Jindal, V.; Bedi, P. CSE-IDS: Using cost-sensitive deep learning and ensemble algorithms to handle class imbalance in Network-based intrusion detection systems. Comput. Secur. 2021, 112, 102499. [Google Scholar] [CrossRef]
  79. Mbow, M.; Koide, H.; Sakurai, K. Handling class imbalance problem in intrusion detection system based on deep learning. Int. J. Netw. Comput. 2022, 12, 467–492. [Google Scholar] [CrossRef] [PubMed]
  80. Patro, S.G.; Sahu, D.-K.K. Normalization: A preprocessing stage. arXiv 2015, arXiv:1503.06462. [Google Scholar] [CrossRef]
  81. Bagui, S.; Li, K. Resampling imbalanced data for network intrusion detection datasets. J. Big Data 2021, 8, 6. [Google Scholar] [CrossRef]
  82. Elmasry, W.; Akbulut, A.; Zaim, A.H. Empirical study on multiclass classifcation-based network intrusion detection. Comput. Intell. 2019, 35, 919–954. [Google Scholar] [CrossRef]
  83. El-Habil, B.Y.; Abu-Naser, S.S. Global climate prediction using deep learning. J. Theor. Appl. Inf. Technol. 2022, 100, 4824–4838. [Google Scholar]
  84. He, H.; Wu, D. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In Proceedings of the 2008 Fourth International Conference on Natural Computation, Jinan, China, 18–20 October 2008. [Google Scholar]
  85. Wilson, D.L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 1972, 3, 408–421. [Google Scholar] [CrossRef]
  86. He, H.; Garcia, E. Learning from imbalanced data. In IEEE Transactions on Knowledge and Data Engineering; IEEE: Piscataway, NJ, USA, 2009. [Google Scholar]
  87. Zhendong, S.; Jinping, M. Deep learning-driven MIMO: Data encoding and processing mechanism. Phys. Commun. 2022, 57, 101976. [Google Scholar] [CrossRef]
  88. Xin, Z.; Chunjiang, Z.; Jun, S.; Kunshan, Y.; Min, X. Detection of lead content in oilseed rape leaves and roots based on deep transfer learning and hyperspectral imaging technology. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2022, 290, 122288. [Google Scholar] [CrossRef]
  89. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
  90. Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
  91. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
  92. Bishop, C.M.; Nasrabadi, N.M. Pattern Recognition And Machine Learning; Springer: New York, MY, USA, 2006; Volume 4. [Google Scholar]
  93. Nielsen, M.A. Neural Networks and Deep Learning. In Chapter 1 Explains the Basics of Feedforward Operations in Neural Networks; Determination Press: San Francisco, CA, USA, 2015. [Google Scholar]
  94. Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011. [Google Scholar]
  95. Vaswani, A.; Noam, S.; Niki, P.; Jakob, U.; Llion, J.; Aidan, N.G.; Lukasz, K.; Illia, P. Attention Is All You Need.(Nips), 2017. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  96. Lei Ba, J.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
  97. Ring, M.; Wunderlich, S.; Scheuring, D.; Landes, D.; Hotho, A. A survey of network-based intrusion detection data sets. Comput. Secur. 2019, 86, 147–167. [Google Scholar] [CrossRef]
  98. Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp 2018, 1, 108–116. [Google Scholar]
  99. Sharafaldin, I.; Habibi Lashkari, A.; Ghorbani, A.A. A detailed analysis of the cicids2017 data set. In Proceedings of the Information Systems Security and Privacy: 4th International Conference, ICISSP 2018, Funchal-Madeira, Portugal, 22–24 January 2018; Revised Selected Papers 4. Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 172–188. [Google Scholar]
  100. Jyothsna, V.; Prasad, K.M. Anomaly-based intrusion detection system. In Computer and Network Security; Intech: Houston, TX, USA, 2019; Volume 10. [Google Scholar]
  101. Chen, C.; Song, Y.; Yue, S.; Xu, X.; Zhou, L.; Lv, Q.; Yang, L. FCNN-SE: An Intrusion Detection Model Based on a Fusion CNN and Stacked Ensemble. Appl. Sci. 2022, 12, 8601. [Google Scholar] [CrossRef]
  102. Powers, D.M.W. Evaluation: From Precision, Recall, and F-Measure to ROC, Informedness, Markedness & Correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.