1. Introduction
The rapid growth of digital networks and interconnected systems has dramatically increased the volume, velocity, and complexity of network traffic, exposing critical infrastructure, businesses, and individuals to a wide spectrum of cyber threats. These threats range from common intrusions such as denial-of-service attacks and malware to rare and sophisticated attacks, including user-to-root (U2R) and remote-to-local (R2L) exploits [
1,
2]. Modern networks are further complicated by the proliferation of cloud computing, Internet of Things (IoT) devices, and real-time applications, making traditional security measures increasingly inadequate. This evolving threat landscape necessitates the development of intrusion detection systems (IDSs) that are not only accurate but also efficient, interpretable, and capable of real-time operation across dynamic, high-dimensional network environments [
2,
3,
4].
Traditional signature-based IDSs rely on known attack patterns and are effective against previously identified threats. However, they fail to detect novel or zero-day attacks and struggle to adapt to rapidly changing traffic patterns. Anomaly-based IDSs, in contrast, offer the ability to identify deviations from normal network behavior, providing a more flexible and generalizable approach for detecting previously unseen threats [
2]. Among these, unsupervised learning methods such as autoencoders have gained prominence due to their ability to reconstruct input data and detect anomalies based on reconstruction errors [
5]. Autoencoders are particularly valuable for zero-day attack detection and for learning compact, noise-resistant feature representations that can reduce the impact of network variability [
3,
6,
7]. Despite their advantages, conventional autoencoder-based approaches face limitations when applied to multi-class classification scenarios and rarely occurring attack types, primarily due to class imbalance and the high dimensionality of network traffic [
2,
6].
Figure 1 illustrates the general framework of anomaly-based intrusion detection systems, which form the foundation of modern network security approaches. The process begins with network traffic data collection and preprocessing, including normalization and feature selection to prepare raw packets and flows for analysis. In the training phase, the system learns baseline behavior patterns exclusively from normal traffic using unsupervised or semi-supervised learning methods, establishing thresholds that characterize legitimate network activity. During the detection phase, incoming real-time traffic is scored against this learned baseline through anomaly scoring mechanisms that quantify deviations from normal behavior. Traffic samples are then classified based on whether their anomaly scores exceed the established threshold: samples below the threshold are identified as normal traffic and allowed to pass, while those above the threshold are flagged as anomalous and subjected to blocking or further investigation.
This general approach enables the detection of novel and zero-day attacks without requiring prior labeled examples of malicious activity, addressing a critical limitation of traditional signature-based methods. However, as discussed below, practical implementation of this framework faces significant challenges related to class imbalance, computational efficiency, and accurate detection of rare attack types.
Hybrid IDS frameworks, which combine unsupervised feature learning with supervised classification, have emerged as effective solutions to these challenges. Isolation forests, for instance, provide robust anomaly scoring by isolating outliers in high-dimensional feature spaces, while gradient boosting algorithms like LightGBM offer fast, interpretable, and accurate multi-class classification [
6]. By integrating these complementary components, a hybrid approach can leverage the strengths of unsupervised feature extraction, anomaly scoring, and supervised learning within a unified pipeline. Such integration reduces reliance on manual feature engineering and improves generalization across diverse network conditions, addressing a critical limitation of both purely unsupervised and purely supervised methods [
6,
8].
Advanced methods for anomaly detection, including generative models, graph neural networks, and reinforcement learning, have been explored in the literature [
1,
8,
9,
10,
11,
12,
13,
14,
15]. While these methods demonstrate promising capabilities, they often incur substantial computational costs, suffer from scalability limitations, or exhibit reduced performance on imbalanced datasets, particularly in detecting rare attack classes. Consequently, there is a persistent practical gap: existing IDS solutions rarely achieve a simultaneous balance of computational efficiency, accurate detection of rare attacks, and high multi-class classification performance.
To address these challenges, this study proposes a hybrid intrusion detection framework that integrates denoising autoencoders for unsupervised feature learning, isolation forests for anomaly scoring, and LightGBM for multi-class classification. The framework is designed with three key objectives:
Computational efficiency: By employing shallow autoencoder architectures, early stopping, and gradient boosting, the framework maintains low-latency inference suitable for real-time deployment in resource-constrained environments.
Effective handling of class imbalance: The autoencoder is trained exclusively on normal traffic, and SMOTE-ENN resampling is applied to enhance the detection of rare attack types such as U2R and R2L. This approach ensures robust multi-class detection performance across all categories, overcoming a major limitation of traditional IDS methods.
Unified learning and generalization: By combining unsupervised feature extraction with supervised classification, the framework eliminates the need for extensive manual feature engineering while maintaining robustness on high-dimensional, unbalanced datasets. This integration enhances the system’s adaptability to diverse network traffic patterns without compromising detection accuracy or computational efficiency.
The proposed framework is rigorously evaluated on benchmark datasets, including NSL-KDD and UNSW-NB15, using stratified cross-validation, ablation studies, and multiple independent runs to ensure reproducibility and robustness. Experimental results demonstrate consistently high classification accuracy (~99%) and strong macro-F1 performance (>97%) across all attack categories, including rare and challenging classes. In addition, the framework exhibits low inference latency (~2–3 ms per sample), confirming its practical suitability for real-time network security deployments [
6,
7].
The contributions of this study are therefore threefold:
Systematic integration of complementary techniques: Unlike prior work that focuses on individual methods, this framework combines denoising autoencoders, isolation forests, and LightGBM into a unified pipeline that balances unsupervised feature learning, anomaly scoring, and supervised classification.
Robust rare attack detection and class balancing: By leveraging SMOTE-ENN resampling and training strategies that emphasize normal traffic patterns, the framework achieves high detection performance for minority classes, which are often overlooked in conventional IDS approaches.
Comprehensive evaluation and practical applicability: Extensive experiments on multiple benchmark datasets, including ablation analyses and independent trials, validate the effectiveness, efficiency, and generalization of the framework, providing a practical solution for modern network intrusion detection challenges.
The remainder of this paper is organized as follows:
Section 2 reviews relevant literature on anomaly-based and hybrid IDS frameworks.
Section 3 details the proposed hybrid framework, including preprocessing, denoising autoencoder-based feature extraction, isolation forest anomaly scoring, and LightGBM classification.
Section 4 presents the experimental setup, evaluation metrics, results, and comparative analysis.
Section 5 discusses scalability, practical deployment considerations, and broader implications, and
Section 6 concludes the study.
3. Materials and Methods
This study presents an improved hybrid model designed for multi-class anomaly detection to address the complex problems of intrusion detection, specifically the difficulty of identifying uncommon and imbalanced attack types in network data. Unsupervised anomaly scoring, deep feature learning, and a supervised gradient boosting classifier are the three main parts of the proposed architecture, integrated into a single pipeline. The five high-level classes into which network traffic can be accurately classified thanks to this design are: normal, denial of service (DoS), probe, remote-to-local (R2L), and user-to-root (U2R). Data preprocessing, anomaly detection using isolation forests, deep feature learning using an autoencoder, feature fusion, multi-class classification using LightGBM, and performance evaluation using robust metrics and visualization tools are the six main stages of the proposed system, as illustrated in
Figure 2.
3.1. Dataset
The main benchmark for evaluating the effectiveness of the proposed hybrid multi-class intrusion detection model is the NSL-KDD dataset. Both of its unified subsets, KDDTrain+ and KDDTest+, contain labeled instances that reflect a variety of malicious network actions in addition to typical network activity. Each continuous numerical property and three categorical variables (protocol_type, service, and flag) comprise 41 features per record, which together capture the statistical and behavioral characteristics of distinct network communications.
According to
Table 2, these properties make the dataset particularly suitable for anomaly-based intrusion detection studies [
29]. This study maintains the multi-class structure by categorizing attack types into five main categories: normal, denial of service (DoS), probe, remote-to-local (R2L), and user-to-root (U2R). This avoids reducing the work to a binary classification.
This approach allows for a more accurate assessment of the model’s ability to distinguish between different types of breakouts, especially those of low frequency.
Table 3 shows the distribution of cases across the five class assignments used in this work. To facilitate model development, the KDDTrain + and KDDTest + subsets were first combined into a single dataset and then split into training and test sets using stratified sampling. This ensured that the original class distribution was maintained across both subsets.
Categorical variables were encoded via one-hot encoding to support efficient learning, while numerical features were scaled using Min-Max normalization to align value ranges and stabilize convergence.
To overcome the inherent class imbalance, a synthetic minority oversampling with edited closest neighbors (SMOTE-ENN) technique was used, yielding a more equal representation of all classes. This meticulous preprocessing and balancing approach guarantee that the model is trained on a fair and representative feature space, which improves generalization and detection performance across all incursion classes, including ones with restricted sample availability.
Numerical features were scaled using Min-Max normalization after splitting into training and test sets to prevent data leakage. The SMOTE-ENN resampling technique was applied exclusively to the training set to mitigate class imbalance without contaminating the test set. Early stopping was employed for autoencoder training using a separate validation subset, and similarly for LightGBM, ensuring that hyperparameter optimization did not overfit the test data.
In this study, we adopted a unified classification scheme by grouping raw attack types into broader categories. For the NSL-KDD dataset, which originally contained 40 attack types, we divided them into five categories as follows: DoS includes smurf, neptune, back, teardrop, pod, land, apache2, mailbomb, processtable, and udpstorm; Probe includes satan, ipsweep, nmap, portsweep, mscan, and saint; R2L includes guess_passwd, ftp_write, imap, multihop, phf, spy, warezclient, warezmaster, snmpgetattack, snmpguess, xlock, xsnoop, worm, sendmail, and named; U2R includes buffer_overflow, loadmodule, perl, rootkit, ps, sqlattack, xterm, and httptunnel; and Normal represents all benign traffic.
The experimental evaluation in this study is conducted using the NSL-KDD and UNSW-NB15 datasets, which are among the most widely adopted benchmarks in intrusion detection research. NSL-KDD is selected due to its well-defined multi-class labeling scheme and its continued use as a standard baseline for evaluating intrusion detection models, particularly for challenging and low-frequency attack categories such as R2L and U2R. To complement this legacy benchmark and assess generalization under more realistic traffic conditions, the UNSW-NB15 dataset is also employed, as it contains modern attack behaviors and diverse feature representations generated using real traffic emulation tools. In addition to these datasets, several other popular benchmarks have been extensively used in recent intrusion detection studies, including CICIDS2017, CICIDS2018, TON_IoT, and BoT-IoT, which provide large-scale traffic traces and IoT-oriented attack scenarios. These datasets were not included in the current experimental evaluation due to differences in feature definitions, labeling granularity, and experimental scope. However, their incorporation is identified as an important direction for future work to further validate the proposed framework under broader and more heterogeneous network environments.
While NSL-KDD has limitations as a legacy benchmark, we address this through dual-dataset evaluation including UNSW-NB15 (
Section 4.4), which represents modern network traffic generated using IXIA PerfectStorm tool (Calabasas, CA, USA). Additional contemporary datasets (CICIDS2017, CICIDS2018, TON_IoT, CIC-IoT-2023) are identified as important directions for future validation under broader network environments.
3.2. Data Pre-Processing
A key component of the proposed model is preprocessing, which significantly impacts its accuracy, applicability, and resilience to real network traffic. To support both supervised and unsupervised parts of the pipeline, the raw NSL-KDD dataset must be cleaned of noise, categorical features, and serious class imbalance issues.
Labeling and Attack Classification: Over 40 different types of attacks with skewed class distributions are found in the original dataset. These attacks are grouped into five more general and semantically meaningful categories: normal, denial of service (DoS), remote-to-local (R2L), user-to-root (U2R), and test. This improves the model’s interpretability and learning efficiency. By reducing label scarcity and facilitating more efficient multi-class classification without ignoring important behavioral differences, this grouping is consistent with well-known intrusion detection classifications.
Data Cleaning: During preprocessing, samples with unclear, missing, or undefined labels are removed. This removes any noise that might impair anomaly detection performance and ensures more accurate class boundaries. The model is trained using data that more accurately depicts the actual structure of network traffic patterns when these inconsistencies are removed.
Instant Encoding: Certain categorical properties of a dataset, including protocol type, service type, and tag, encode essential details about the types and states of connections. Instant encoding is used to transform these non-numerical properties into a numerical representation. This transformation preserves the integrity of categorical differences during training, preventing the introduction of arbitrary ordinal relationships.
Min-max normalization: Min-max normalization is used to normalize all continuous-valued features to a constant scale within the interval [0,1]. For models sensitive to different feature values, such as autoencoders and isolation forests, normalization is particularly important. During training, renormalization ensures better convergence and numerical stability. The following formulas are used to scale each feature:
All continuous features in this study are scaled equally within the range [0,1], because the normalization bounds are intentionally set at their default values, min = 0 and max = 1. During training.
Class Balancing with SMOTE-ENN: This study uses SMOTE-ENN, a hybrid resampling technique, to mitigate the observed class imbalance in the NSL-KDD dataset, specifically the underrepresentation of U2R and R2L attacks. By generating synthetic instances by interpolating between pre-existing instances, the synthetic minority oversampling technique (SMOTE) improves the representation of minority classes. Then, by analyzing their nearest neighbors, the edited nearest neighbors (ENN) algorithm removes noisy or unclear data, particularly from the majority classes. In addition to maintaining clear decision boundaries and balancing the class distribution, this two-phase strategy reduces the risk of overfitting and enhances generalization during the supervised learning phase.
After separating the training and test sets, only the training set was resampled using SMOTE-ENN to prevent data leakage. To ensure objective evaluation of the model, the test set was left unaltered after preprocessing. This separation maintains the integrity of the reported results, ensuring that artificial or modified samples generated during balancing do not affect the evaluation process.
Maintaining class balance using stratified sampling (80–20 split): The balanced dataset is split using stratified sampling at a ratio of 80–20 to ensure uniform representation of all classes during the training and testing phases. This method maintains the proportions of the original classes in both subsets, which is critical for low-frequency classes such as R2L and U2R. Consequently, the classifier’s evaluation demonstrates a more equal and realistic performance metric for all types of attacks, especially with respect to recall and f1-score for infrequent attacks.
The proposed pipeline integrates three complementary modules: a denoising autoencoder for deep feature learning, an Isolation Forest for unsupervised anomaly scoring, and a LightGBM classifier for multi-class prediction. This integration addresses the limitations of using these modules independently, enhancing detection of rare attacks while maintaining computational efficiency.
Figure 3 illustrates the severe class imbalance present in the original NSL-KDD dataset and the effectiveness of SMOTE-ENN resampling. As shown in
Figure 3a, the original dataset exhibits extreme imbalance, with U2R attacks representing only 0.04% (52 samples) and R2L attacks constituting 0.8% (995 samples) of the total dataset, while Normal traffic dominates at 53.5% (67,343 samples). This severe imbalance would lead to poor detection of rare but critical attack types.
Figure 3b demonstrates that SMOTE-ENN successfully rebalances the dataset, achieving more uniform class distributions (16–25% per class) while maintaining data quality through noise removal via the ENN component.
Algorithm 1 illustrates the data preprocessing pipeline for network traffic. It includes label consolidation into five classes (Normal, DoS, Probe, R2L, U2R), one-hot encoding of categorical features, and normalization of numerical features. The dataset is stratified into training and test sets, and SMOTE-ENN is applied to the training set to balance classes. This ensures the data is clean, normalized, and balanced for effective model training.
| Algorithm 1 Preprocessing and Balancing of Network Traffic Dataset. |
Input: Raw network traffic dataset with features and labels Output: Preprocessed and balanced training dataset and test dataset |
| 1. procedure PREPROCESS_DATA() |
| 2. // Label consolidation |
| 3. GROUP_ATTACKS(L) // Group into 5 classes: Normal, DoS, Probe, R2L, U2R |
| 4. |
| 5. // Handle categorical features |
| 6. EXTRACT_CATEGORICAL() // protocol_type, service, flag |
| 7. ONE_HOT_ENCODE() |
| 8. |
| 9. // Normalize numerical features |
| 10. EXTRACT_NUMERICAL() |
| 11. MIN_MAX_NORMALIZE( |
| 12. // Combine features |
| 13. CONCATENATE() |
| 14. |
| 15. // Split dataset with stratification |
| 16. STRATIFIED_SPLIT( |
| 17. |
| 18. // Apply SMOTE-ENN only to training set |
| 19. SMOTE-ENN( |
| 20. |
| 21. return |
| 22. end procedure |
SMOTE-ENN resampling substantially transformed the class distributions in both datasets. For NSL-KDD, the original 125,973 training samples were expanded to 191,343 samples, with minority classes R2L and U2R growing from 995 to 20,000 samples (+19,005 synthetic) and from 52 to 4000 samples (+3948 synthetic), respectively, while majority class Normal remained unchanged at 67,343 samples. This reduced the maximum class imbalance from 1300:1 to 17:1. For UNSW-NB15, 37,362 synthetic samples were generated, expanding the training set from 206,138 to 243,500 samples, with rare classes such as Worms (130→1500), Shellcode (1133→4000), and Backdoor (1746→5000) receiving substantial augmentation while Normal traffic (56,000) remained unmodified, achieving final imbalance ratios of 2:1 to 4:1.
3.4. Learning Representation Through Autoencoder Architecture for Noise Removal
This study uses an unsupervised deep autoencoder to efficiently extract high-level abstractions and capture complex, nonlinear relationships from network traffic data. The autoencoder acts as an efficient feature learning unit by condensing raw input features into a compact latent representation while preserving underlying structural patterns.
Autonomous Encoder Framework: Both the autoencoder and decoder use a symmetric three-layer architecture. To encode each input sample into a 32-dimensional latent space, the encoder gradually reduces the input dimensionality across layers of 128, 64, and 32 neurons. The network can simulate complex feature interactions by introducing nonlinearity by applying Rectified Linear Unit (ReLU) activation functions to each layer. Batch normalization is used in the early encoding layers to improve stability and accelerate convergence.
Model Training Purpose: The autoencoder is trained to minimize the reconstruction error between the input and its reconstructed output, using the mean absolute error (MAE) as the loss function. MAE is preferred over the mean squared error (MSE) due to its robustness against outliers, a critical factor in intrusion detection, where slight deviations from normal patterns can indicate rare or emerging attacks. Furthermore, MEA produces more interpretable reconstruction errors, simplifying subsequent evaluation and thresholding steps. MAE is mathematically defined as:
This property is particularly useful in anomaly detection tasks, where even small deviations, regardless of sign, may indicate potential intrusions. The MAE’s resistance to outliers and interpretability makes it a popular choice for reconstruction-based tasks, especially in autoencoders. The MAE is suitable for anomaly detection cases where subtle differences may matter because, unlike squared error measures, it does not unduly penalize large deviations.
In the formula above, n is the total number of features per sample. A lower MAE value suggests less reconstruction error, implying that the output feature correctly replicates the input, which is good for anomaly identification. Early halting for normalization is used in autoencoder training to avoid overfitting and increase generalization.
Latent Feature Extraction: After training, only the encoder component of the autoencoder is retained and used to transform each input sample into a 32-dimensional latent vector. These latent representations capture underlying temporal or spatial patterns, complex structural connections, and nonlinear relationships between features. The resulting feature space is a useful input for subsequent classification models because it is dense and information rich.
The latent dimension of 32 balances feature compression and information retention, capturing essential variance while reducing computational complexity. The encoder architecture gradually reduces dimensionality (128 → 64 → 32) using ReLU activations and batch normalization, improving convergence and preventing overfitting. The MAE loss function is used to ensure robustness to outliers common in rare attack types.
Algorithm 3 outlines the training of the denoising autoencoder. The encoder and decoder are built with symmetric layers, and the model is trained on normal traffic using mean absolute error (MAE) and the Adam optimizer. Early stopping is applied based on validation loss to prevent overfitting. The trained encoder and decoder extract meaningful latent features for subsequent anomaly detection and classification.
| Algorithm 3 Denoising Autoencoder Training |
Input: Training data (normal traffic only) Output: Trained encoder and decoder |
| 1. procedure TRAIN_AUTOENCODER) |
| 2. // Initialize autoencoder architecture |
| 3. [32,64,128]// Three-layer encoder |
| 4. [32,64,128]// Symmetric decoder |
| 5. |
| 6. |
| 7. // Build encoder |
| 8. BUILD_ENCODER(encoder_layers, activation = ‘ReLU’, batch_norm = True) |
| 9. |
| 10. // Build decoder |
| 11. )) |
| 12. |
| 13. // Training configuration |
| 14. MAE // Mean Absolute Error |
| 15. Adam(learning_rate = 0.001) |
| 16. |
| 17. // Split for validation |
| 18. ) |
| 19. |
| 20. // Train with early stopping |
| 21. for epoch = 1 to MAX_EPOCHS do |
| 22. |
| 23. ) |
| 24. ) |
| 25. |
| 26. // Validation |
| 27. |
| 28. ) |
| 29. |
| 30. if EARLY_STOPPING_CRITERIA(loss_val, patience = 10) then |
| 31. break |
| 32. end if |
| 33. end for |
| 34. |
| 35. return |
| 36. end procedure |
3.6. Multi-Class Classification Based on LightGBM
In the final stage of the proposed hybrid framework, five types of network traffic are accurately classified using a LightGBM- based multi-class classifier: Denial of Service (DoS), Probe, Remote-to-Local (R2L), User-to-Root (U2R), and Normal. The classifier works with a 33-dimensional input vector created by concatenating the isolation forest’s 1-dimensional anomaly scores with the 32-dimensional latent features retrieved from the autoencoder.
LightGBM was chosen because of its high performance on multi-class classification tasks, good computing efficiency, and built-in weighting techniques that allow it to manage imbalanced datasets. Because of these features, LightGBM is especially well-suited to the diverse and noisy nature of intrusion detection data.
In addition to adopting early stopping with a validation set separate from the training data, the LightGBM autoencoder and classifier were trained using five-layer cross-validation to avoid overfitting. This ensures unbiased evaluation by ensuring that model selection and hyperparameter tuning are performed independently of the test set. LightGBM ‘s tuned hyperparameters:
- •
max_depth: 10
- •
num_leaves: 50
- •
splits: 5
- •
learning_rate: 0.05
- •
class _weight: Balanced (to address class imbalance)
By incorporating both grid search and cross-validation, the classifier achieves a positive balance between bias and variance, which is essential for detecting low-frequency but critical attack classes such as R2L and U2R.
The combination of three complementary elements, the representativeness of deep autoencoders, the outlier sensitivity of unsupervised anomaly scoring, and the interpretability and efficiency of gradient-augmented decisions—forms the strength of the proposed architecture.
This combination significantly improves the model’s ability to detect uncommon and subtle intrusions, while achieving high overall classification accuracy.
Table 4 summarizes model parameters.
Furthermore, excellent generalization between malicious and normal traffic patterns is ensured by incorporating regularization algorithms, feature normalization, and automated data balancing. The final, scalable, and reliable model is ideal for practical applications in contemporary intrusion detection systems.
Hyperparameters are kept consistent across datasets and evaluated with five-fold stratified cross-validation to ensure unbiased comparisons.
Early stopping and validation monitoring prevent overfitting, ensuring that performance gains are due to architectural design rather than arbitrary tuning.
These design choices enhance scalability by limiting memory usage, allowing efficient batch processing, and enabling the hybrid pipeline to handle both NSL-KDD and UNSW-NB15 datasets without excessive computational cost, while maintaining reproducibility and fair evaluation across experimental runs.
Algorithm 4 describes the
training of the LightGBM classifier using hybrid features. Latent features from the autoencoder are fused with anomaly scores from the Isolation Forest to form a 33-dimensional feature set. The classifier is trained with early stopping on a validation split, using balanced class weights to address class imbalance, producing a model capable of accurate multi-class intrusion detection.
| Algorithm 4 Feature Fusion and Classification |
Input: Encoder , Isolation Forest , training data , labels Output: Trained LightGBM classifier |
| 1. procedure TRAIN_CLASSIFIER) |
| 2. // Extract latent features from autoencoder |
| 3. // 32-dimensional latent representation |
| 4. |
| 5. // Compute anomaly scores |
| 6. ) // 1-dimensional score |
| 7. |
| 8. // Fuse features |
| 9. ) // 33-dimensional hybrid features |
| 10. |
| 11. // Initialize LightGBM classifier |
| 12. { |
| 13. max_depth: 10, |
| 14. n_estimators: 50, |
| 15. learning_rate: 0.05, |
| 16. class_weight: ‘balanced’ |
| 17. } |
| 18. |
| 19. INITIALIZE_LGBM(hyperparameters) |
| 20. |
| 21. // Split for validation |
| 22. ) |
| 23. |
| 24. // Train with early stopping |
| 25. |
| 26. ), |
| 27. early_stopping_rounds = 10) |
| 28. |
| 29. return |
| 30. end procedure |
4. Results
4.1. Evaluation Criteria
A wide range of well-known evaluation metrics, including precision, accuracy, recall, and f1-score, are used to accurately evaluate the performance of the proposed multi-class anomaly detection model. Focusing on low-frequency and difficult classes such as R2L and U2R, these metrics provide a comprehensive picture of the model’s classification capabilities across all intrusion classes. To provide more detailed information about the accuracy of each class, misclassification patterns, and model flexibility, confusion metrics are generated to graphically represent the distribution of predicted versus true class labels.
This multidimensional evaluation framework makes it possible to clearly and comprehensively examine a model’s ability to distinguish between different attacks in complex intrusion detection scenarios.
Accuracy measures the proportion of correctly predicted cases among all samples predicted to belong to a specific class. In a multiclass context, accuracy is calculated for each class separately to obtain class-specific predictive accuracy. To summarize the overall accuracy of a model, the overall mean (which treats all classes equally) or the weighted mean (which takes into account the prevalence of each class) can be applied.
This distinction is crucial when evaluating performance on imbalanced datasets, as it ensures that the model’s behavior toward minority classes is not affected by the performance of majority classes. The accuracy formula is defined as follows:
Here, TP refers to the number of true positive predictions that are correctly identified as belonging to a particular class while FP refers to false positive predictions, which represent the cases that are incorrectly classified as belonging to that class.
Recall measures a model’s ability to accurately identify every true occurrence of a given class. It is sometimes referred to as sensitivity or true positive rate. For a given class, recall measures the proportion of true positives out of all actual positives. Similar to precision, recall is calculated independently for each class to provide relevant insights. It can be combined using a weighted average, which takes into account the relative frequency of each class, or an overall average, which gives equal weight to each class. Recognizing minority classes is essential for the model’s overall reliability in imbalanced datasets, making this ensemble technique crucial.
where TP denotes the number of true positives and FN indicates the number of false negatives. A high recall rate shows that the model correctly detects the majority of actual occurrences of a particular class, which is crucial for rare and critical attack types like U2R and R2L. In contrast, a low recall rate suggests that the model fails to detect a substantial fraction of real attacks, thereby posing security problems in real-world deployment.
Accuracy measures a model’s overall performance by calculating the proportion of correctly identified samples among the total number of cases in the dataset. This accuracy provides a full assessment of the classifier’s performance across all classes, albeit it may be less useful in imbalanced cases where the majority of classes dominate. Accuracy is computed as follows:
where FP and FN denote false positives and false negatives, respectively, and TP and TN represent the number of true positives and negatives. Accuracy is a commonly used metric, but in order to properly evaluate model performance, it must be considered alongside precision, recall, and F1 score.
By calculating the harmonic mean of precision and recall, the F1 score provides a fair evaluation metric. In imbalanced datasets, where accuracy alone might give a false impression of model performance, this metric is extremely useful. In multi-class scenarios, the F1 score can be combined as a weighted F1 score, which takes into account class prevalence by weighting each class based on its support, or as an overall F1 score, which considers all classes equally by calculating an unweighted average. The F1 score is defined as:
The F1 score provides a full assessment of a classifier’s robustness by accounting for both false positives and erroneous negatives. This is especially effective in instances when classification errors result in unequal costs, such as intrusion detection, where missing a rare but critical attack (such as U2R or R2L) can be catastrophic.
To evaluate the model’s predictive performance, a confusion matrix is created by comparing anticipated class labels to actual facts. This matrix offers a complete view of the classification findings, making it especially useful for spotting specific examples when the model wrongly labels one class as another.
These insights highlight weaknesses in differentiating between closely similar or uncommon intrusion types, which is particularly important when examining the model’s behavior on minority or rare classes. The model’s performance is examined from several angles by combining the confusion matrix with fundamental evaluation measures including accuracy, precision, recall, and f1-score. This comprehensive evaluation enables a detailed analysis of the model’s capacity to sensitively identify and distinguish underrepresented categories such as R2L and U2R, in addition to accurately classifying common attack types. Collectively, this evaluation methodology provides a thorough, fair, and practical appraisal of the model’s overall robustness, generalization capacity, and real-world applicability in complicated intrusion detection scenarios.
To ensure reproducibility, we employed rigorous experimental controls across all evaluations. For NSL-KDD, we used the standard KDDTrain+ (125,973 samples) and KDDTest+ (22,544 samples) partitions, applying 5-fold stratified cross-validation exclusively to the training set for hyperparameter optimization while reserving the test set for final evaluation. For UNSW-NB15, we implemented an 80–20 stratified split (206,138 training, 51,535 test samples) with proportional class representation maintained in both partitions. All random processes were controlled using fixed seeds: Python 3.10 (42), NumPy v1.26.4 (42), TensorFlow v2.15.0 (42), SMOTE-ENN v0.12.0 (42), and LightGBM v4.1.0 (42), ensuring deterministic neural network initialization, synthetic sampling, and model training. To assess performance stability, we conducted five independent runs with different random seeds (42, 123, 456, 789, 1011), reporting mean and standard deviation across runs. Variance reduction was achieved through stratified sampling across all splits, early stopping with 10-epoch patience during autoencoder training, and consistent validation protocols. This comprehensive reproducibility framework ensures that our results represent genuine model capabilities rather than random variations or methodological artifacts.
4.2. Experimental Findings
We evaluate the performance of the proposed multi-class hybrid anomaly detection model, which combines a LightGBM classifier for final attack classification, a isolation forest for anomaly registration, and an autoencoder for denoising to accurately extract features. To ensure a fair assessment of its generalization ability, the model is rigorously tested on the NSL-KDD dataset using key classification metrics—accuracy, precision, recall, and F1 score—computed on the resilience test set.
Although the NSL-KDD dataset shows high classification metrics, it is a saturated benchmark and may not fully reflect real-world challenges. To ensure robust evaluation, all reported metrics—including accuracy, precision, recall, and F1-score—are averaged over five independent runs, with standard deviations reported to capture variability. Particular attention was given to rare classes, such as R2L and U2R, to validate that the high scores are meaningful for low-frequency attack types rather than artifacts of dataset bias.
Thanks to its robustness to outliers and its ability to preserve small but significant patterns in normal network traffic, the autoencoder is trained to minimize the mean absolute error (MAE). The encoder successfully captures the structural features of the input streams by extracting latent feature representations after training. Combining these deep features with the anomaly scores generated by the isolation forest produces a rich and distinct feature vector that represents both the reconstruction accuracy and the anomaly probability.
This compact representation is then fed to the LightGBM classifier, which uses grid search in conjunction with five-layer cross-validation to optimize its key hyperparameters, including learning rate, leaf count, and maximum tree depth. Improving sensitivity to uncommon and important attack types, such as R2L and U2R, requires a balanced balance between bias and variance, which this systematic tuning approach ensures.
Table 5 provides comprehensive evaluation metrics for each attack type and summarizes the final classification results generated by the LightGBM model.
Figure 6 provides a visual summary of the classification results for each class, along with the corresponding confusion matrix.
Experimental results confirm the effectiveness of the proposed hybrid framework. The model achieves near-perfect precision, recall, and F1 scores for all types of attacks, including the traditionally challenging and rarely represented R2L and U2R classes. Furthermore, it achieves an overall classification accuracy of 99%, while maintaining consistent performance across both majority and minority classes.
These results highlight the robustness of the model, its strong generalizability, and its suitability for practical deployment in real-world imbalanced network environments.
Experimental results on the NSL-KDD dataset demonstrate the utility of the proposed hybrid anomaly detection framework in accurately detecting a wide range of network intrusions. All five attack classes showed consistently high precision, recall, and F1 scores, and the model achieved an overall accuracy of 99%. Notably, the system outperformed standard methods in detecting unusual and normally difficult classes such as R2L and U2R.
These results show how integrated design increases detection performance for both majority and minority classes by combining deep feature extraction, anomaly scoring, balanced resampling, and systematic hyperparameter adjustment.
All reported metrics are accompanied by 95% confidence intervals computed across folds, confirming the reliability and reproducibility of the results. Balanced training, anomaly scoring, and deep feature extraction collectively ensure stable performance for both common and rare attack classes, making the model suitable for practical network intrusion detection scenarios.
All provided stats are the average of five independent runs.
Table 6 shows the average precision, accuracy, recall, and F1 score, together with their standard deviations.
Figure 7 also includes error bars to show the variability among runs, which confirms the proposed model’s resilience and stability.
The superior performance on rare classes (R2L: 98.3%, U2R: 98.7%) compared to prior work (AE-SAC: 83.97% [
30], RL-NIDS: poor U2R recall [
31]) can be attributed to three factors:
for deviation patterns characteristic of rare attacks
- 2.
Isolation Forest provides independent outlier scoring resistant to class
imbalance, particularly effective for sparse attack types
- 3.
SMOTE-ENN resampling ensures LightGBM learns balanced decision boundaries without majority-class dominance.
The ablation study (confirms synergistic effects: removing Isolation Forest reduces F1 to 97.1% (−1.9%), while removing autoencoder reduces F1 to 94.9% (−4.1%), demonstrating both components are necessary for optimal rare-class detection.
To ensure fair comparison with baseline methods, all experiments were conducted using the standardized NSL-KDD KDDTrain+ and KDDTest+ partitions. Baseline results are directly cited from their original publications, all of which employed the same NSL-KDD variant and evaluation protocol. Furthermore, our preprocessing pipeline adheres to established best practices without introducing unconventional enhancements that could artificially inflate performance metrics, thereby enabling a transparent and unbiased comparison.
4.4. Generalization and Scalability Evaluation
Additional tests were conducted using the UNSW-NB15 dataset, a newer and more challenging benchmark that represents modern network intrusion scenarios, to evaluate the scalability and generalizability across datasets of the proposed hybrid intrusion detection model beyond the NSL-KDD benchmark. The UNSW-NB15 dataset, which includes real traffic patterns produced using the IXIA PerfectStorm tool [
31], captures a broader range of contemporary attack behaviors, unlike NSL-KDD, which relies on traditional attack simulations in a controlled environment.
While NSL-KDD provides a traditional benchmark for evaluating intrusion detection models, it is limited in representing modern network traffic. The UNSW-NB15 dataset was therefore included to assess generalization and scalability. Attack types were carefully mapped into semantically coherent categories to preserve their characteristics, while ensuring consistency with our multi-class classification framework. We note that further testing on datasets such as CIC-IDS2017 or TON_IoT is necessary to fully assess real-world generalization capabilities which will be one of the future works of this study.
To enable more efficient generalization, the original UNSW-NB15 attack types were consolidated into larger, more semantically coherent categories in accordance with our multi-category classification approach. In particular, attacks initially classified as denial-of-service (DoS) attacks were left untouched. Due to their shared heuristic and exploitative characteristics, analysis, obfuscation tools, reconnaissance, exploits, and general attacks were included in the “Scanning” category. Similarly, the “User-to-Root” (U2R) category was assigned to shellcode, backdoors, and worms, which typically entail privilege escalation or unauthorized access to the system. Any attacks that did not fall into the defined categories were classified as “Unknown” and not subjected to further investigation, while normal traffic remained unchanged.
While preserving the semantics of the original labels, this relabeling technique successfully simplifies the classification process.
Table 8 summarizes the distribution of the resulting attack classes, and
Table 9 provides specific evaluation metrics for each group. Furthermore,
Figure 8 shows the prediction confusion matrix of the proposed model on the UNSW-NB15 test set, which also demonstrates its ability to reliably achieve high classification accuracy across all attack classes, including normal attacks, DDoS attacks, testing, and U2R attacks. These results demonstrate the model’s flexibility and superior resiliency across datasets with different attack types and class distributions.
As mentioned before, all experiments were conducted on an Intel Xeon 3.4 GHz CPU with 128 GB RAM and an NVIDIA RTX 4090 GPU. The hybrid model maintains a memory footprint of less than 2.3 GB and achieves an inference speed of 2.4 ms per sample (approximately 415 samples/s). This performance surpasses typical LSTM- and GAN-based IDS architectures, which exhibit latencies of 15–30 ms per sample, demonstrating that our approach is suitable for real-time deployment even without specialized hardware accelerators.
For consistency with NSL-KDD while maintaining a compact multiclass setup, the UNSW-NB15 attack labels are consolidated into four categories: Normal, DoS, Probe, and U2R. Traffic flooding attacks (DoS, Fuzzers) are mapped to DoS, reconnaissance activities to Probe, while attacks involving unauthorized access, payload execution, or privilege escalation (Exploits, Analysis, Generic, Backdoors, Shellcode, Worms) are grouped under U2R. This design choice reduces class fragmentation in UNSW-NB15 and enables stable per-class metrics and clearer interpretation of the confusion matrix.
On the UNSW-NB15 test dataset, the proposed hybrid model demonstrates good and consistent performance, achieving an overall accuracy of 98%, an overall average accuracy of 96%, a recall of 97%, and an f1 score of 97%. These results demonstrate the model’s robustness across multiple attack classes in a multi-class classification context, as well as its ability to accurately identify and classify various network traffic patterns.
As mentioned before to mitigate concerns of overfitting, all confusion matrices and classification metrics were validated across five-fold stratified cross-validation. Figures include error bars reflecting the standard deviation across folds, confirming that high precision, recall, and F1 scores are consistent and not specific to a particular split. This validation provides strong statistical support that the model reliably distinguishes both majority and minority attack classes, including R2L and U2R.
4.5. Ablation Study
By systematically evaluating multiple simplified versions of the proposed model, an ablation search was conducted to better examine the contribution of each module, as shown in
Table 10, within the hybrid framework. This approach separates the effects of LightGBM, Isolation Forest, and Autoencoder on overall detection performance: The Autoencoder + LightGBM (no Isolation Forest) model does not use the anomaly scores provided by the Isolation Forest but rather uses deep feature representations produced by the autoencoder and subsequently classified by LightGBM.
Isolation Forest + LightGBM (no Autoencoder): This setup avoids the feature extraction stage by directly using the raw input features, supplemented only with the anomaly scores from the Isolation Forest.
LightGBM only (no autoencoder or isolation forest): This baseline model relies solely on raw features for classification, without leveraging deep representations or anomaly scoring. The ablation results provide compelling evidence of the synergistic effect achieved by combining the three modules:
- •
With 99.0% accuracy, 99.1% precision, 98.9% recall, and 99.0% F1 score, the full hybrid model—which included an autoencoder for deep feature extraction, an isolation forest for statistical anomaly capture, and LightGBM for supervised classification—achieved the best performance.
- •
Performance declined slightly after removing the isolation forest. The Autoencoder + LightGBM combination showed an additional benefit for anomaly detection, with an accuracy of 97.8% and an F1 score of 97.1%.
- •
The importance of autoencoder-driven feature abstraction was demonstrated by a further drop in performance to 96.2% accuracy and 94.9% F1 score in the Isolation Forest + LightGBM configuration, which did not include deep representation learning.
- •
With a 93.5% F1-score and 95.0% accuracy, the LightGBM model alone without any modifications gave the lowest results, demonstrating the need for more sophisticated techniques for feature learning and anomaly detection.
Table 10.
Comprehensive ablation study.
| Model | Accuracy | F1-Score |
|---|
| Autoencoder + LightGBM | 97% | 97% |
| Isolation Forest + LightGBM | 96% | 94% |
| LightGBM-only | 95% | 93% |
| Full hybrid model | 99% | 99% |
While the Isolation Forest provides a modest improvement in overall metrics for majority classes, it plays a critical role in detecting rare and low-frequency attacks (R2L and U2R). The ablation study shows that removing the Isolation Forest reduces the F1-score for these rare classes, indicating that it contributes unique anomaly-based information that complements the deep features from the autoencoder. Furthermore, the Isolation Forest is lightweight and computationally efficient, adding minimal overhead (less than 2.3 GB memory and 2.4 ms per sample inference) while significantly improving rare-class detection. Therefore, its inclusion is justified, particularly for enhancing robustness and sensitivity in multi-class imbalanced intrusion detection scenarios.
In addition to the ablation study summarized in
Table 10, we conducted two extra analyses to further clarify the contributions of each component.
Analysis 1: Component-wise Minority Class Impact examines incremental improvements across four configurations: baseline LightGBM (R2L 62.3%, U2R 41.2%), LightGBM + SMOTE-ENN (R2L 84.7%, U2R 78.5%), AE + IF + LightGBM without SMOTE (R2L 91.2%, U2R 88.4%), and the full hybrid model combining SMOTE-ENN with AE + IF (R2L 99.1%, U2R 100%). These results show that SMOTE-ENN primarily addresses sample quantity (~+25–35% improvement), AE + IF hybridization improves feature quality (~+30–45% improvement), and their combination produces a synergistic effect where oversampling in compressed feature space outperforms raw-space augmentation.
Analysis 2: Class-Specific IF Score Importance uses LightGBM SHAP values to evaluate Isolation Forest (IF) feature importance per class, revealing that IF scores dominate for minority attacks (R2L 34.2%, U2R 42.1%, Worms 38.7%) while contributing less to majority classes (DoS 17.8%, Normal 5.2%). This confirms that IF-based outlier detection specifically supports statistically rare attack types, providing targeted enhancement for minority-class detection.
Additional statistical analyses, along with a single training and testing split, were conducted to ensure the reliability and reproducibility of the presented results. Specifically, the NSL-KDD and UNSW-NB15 datasets underwent five-dimensional stratified cross-validation, and precision, accuracy, recall, mean F1 score, and standard deviation across folds were recorded. Furthermore, model reconstructions were used to calculate 95% confidence intervals to account for the variability of the model’s predictions. Even for uncommon attack classes such as R2L and U2R, our tests demonstrated the stability and consistency of the proposed hybrid IDS, showing only minor differences between folds. This indicates that the near-perfect evaluations are statistically reliable and generalizable, rather than partition-specific.
Figure 9 illustrates the throughput performance of the proposed hybrid intrusion detection framework compared with representative deep learning–based and reinforcement learning–based IDS models. The results indicate that the proposed approach achieves a substantially higher processing rate, reaching approximately 410 samples per second, whereas AE-LSTM, GAN-IDS, and RL-NIDS process around 40, 65, and 30 samples per second, respectively.
This improvement can be attributed to the lightweight design of the proposed pipeline, which relies on compact latent representations learned by the autoencoder and efficient tree-based inference using LightGBM, rather than recurrent, adversarial, or policy-based learning mechanisms. The observed throughput advantage highlights the suitability of the proposed framework for real-time or near–real-time intrusion detection scenarios, where timely response and scalability are critical, while maintaining competitive detection performance.