1. Introduction
The rapidly growing cyber-attacks across a large spectrum of industries have put network security into a very prime research area, being of general worldwide interest. More diversified methods are being employed by cybercriminals to obtain access to network systems to gain unauthorized access to valuable pieces of information. Such methods include spearing using Advanced Persistent Threats (APTs), Dos Attacks, phishing, etc.; to counter these evolving threats, organizations need to set up strong network security against them [
1,
2].
And in response to these evolving threats, various methods have been implanted and tested to safeguard networks, including firewalls, antivirus software, and intrusion detection or prevention systems [
3]. While firewalls and antivirus software are crucial components for network security, Intrusion Detection Systems (IDSs) come with several additional advantages that further make their deployment very valuable. These systems, available as both software and hardware, are mostly engaged for the early detection of intrusions and any unauthorized actions on a network or a computer system that compromise its confidentiality, integrity, availability, and security [
4].
Two types of intrusion detection systems can be defined: signature-based IDSs and anomaly based IDSs [
5]. Signature-based IDSs detect incoming network traffic against signature set patterns predefined for known attacks. They are highly successful in detecting known threats with high accuracy and low false positive rates. All these signatures are, however, limited to known attacks included in each signature database, thus making them less effective for novel or unknown threats [
6]. Anomaly based IDSs, on the other hand, monitor network traffic for changes from normal behavior. They can then raise alerts on detecting unanticipated patterns, which possibly indicate intrusion, basing their expectation on a threshold of normal network activity. These can detect new attack types that may not have been initiated before, but can also lead to false positives indicating benign events that diverge from the established norm [
7].
Numerous proposed anomaly based Intrusion Detection Systems leverage artificial intelligence, to base its detection on various machine learning algorithms, using these algorithms to establish a baseline of normal network behavior from diverse traffic patterns. However, a major challenge that is usually faced with this approach is the high dimensionality of the datasets used to train these classifiers. As a result, training time will increase, thereby affecting the efficiency of the system. To address this, several-dimensional techniques can be adopted to convert the high-dimensional dataset into a lower representation that is manageable. By applying dimensionality reduction methods, we can reduce the overfitting and focus on the features that best describe the traffic, thus improving the efficiency of the machine learning models.
This research investigates the performance of various machine learning algorithms in the context of intrusion detection, concurrently examining the impact of two dimensionality reduction methods: Batch PCA and Incremental PCA. Focusing on the UNSWB-15 dataset, which contains a substantial number of both normal and abnormal instances.
Key objectives include
Assessing the performance of different machine learning algorithms on intrusion detection tasks.
Analyzing the influence of Batch PCA and Incremental PCA on classification performance.
Assessing the performance of different machine learning algorithms on intrusion detection tasks after using K-fold cross-validation for tuning hyperparameters.
Comparing the training and execution times of both dimensionality reduction methods.
By addressing these objectives, this research aims to contribute to the development of a more adaptable intrusion detection system, efficient for real-world scenarios by
Demonstrating how using different feature extraction methods can impact the performance of ML algorithms.
Successfully adopting and applying dimensionality reduction methods to demonstrate their effectiveness in handling large datasets while maintaining good performance.
Achieving real-time performance through optimized dimensionality reduction methods, by applying either Batch PCA or Incremental PCA, we can notice that using different feature extraction methods enables faster processing and potentially real-time analysis or near real-time analysis for anomaly detection.
Providing a direct comparison between Batch PCA and Incremental PCA, a comparative study of these dimensionality reduction methods and their impact on the performance metric, providing an insight into the trade-offs between accuracy, precision, recall, F1-score, and the training and prediction time.
The remainder of the paper is structured as follows:
Section 2 examines previous works in the field of intrusion detection,
Section 3 presents the methodology employed in this study while clarifying the new approach, machine learning algorithms applied, and dimensionality reduction techniques.
Section 4 presents an analysis of experimental results regarding how different machine learning algorithms perform, as well as the impact of the dimensionality reduction in the intrusion detection system. In the final section,
Section 5, this paper highlights the findings and goes further into discussing the implications of the research as well as recommendations for future work in the field of intrusion detection.
2. Related Work
The researchers in [
8] suggested a novel multistep approach for detecting cloud intrusions, using OPTSA-FCM (Oppositional Tunicate Fuzzy C-Means Clustering). The first step was to preprocess and normalize the raw data to create two separate sets of data: train and test data. They used afterwards Logistic Regression as a feature selection method to identify the most relevant features. Subsequently, the OPTSA-FCM algorithm was applied to partition the data into C clusters. Their approach was evaluated on the CICIDS2017 dataset, achieving an accuracy of 80%. The authors of [
9] proposed a hybrid approach that integrated both machine learning and deep learning. The proposed approach was used on two different datasets, the KDDCUP’99 dataset and the CIC-MalMem-2022 dataset. Because of data imbalance, they applied SMOTE as part of the preprocessing, combining it with XGBoost for feature selection. A variety of machine learning and deep learning methods were employed, including Random Forest, Decision Tree, K-Nearest Neighbors, Multilayer Perceptron, Artificial Neural Network, and Convolutional Neural Network. The evaluation of the model encompassed multiple performance metrics, including accuracy, precision, recall, F1-score, RMSE, MAE, MSE, confusion matrices, and ROC curves. This thorough and detailed evaluation provided a comprehensive understanding of the proposed approach’s performance across all possible scenarios. The researchers of [
10], on the other hand, developed a cloud-based intrusion system for wireless networks. They implemented fog computing with ink nodes to offload processing tasks and optimize computational efficiency. By combining polymorphic mutation (PM) and compact stochastic coordinate ascent (CSCA), they sought to reduce resource consumption. CSCA will reduce data density while PM mitigates potential precision loss. This approach was used to tune KNN parameters for optimal performance. Evaluated on NSL-KDD and UNSW-NB15 datasets, achieving an accuracy of 99.327% 98.27%, respectively. By employing different algorithms, this research [
11] tackled the challenge of high dimensionality in intrusion detection to create a more accurate and efficient system with fewer false alarms. Using the widely used NSL-KDD dataset, they first established a performance baseline with several classifiers, where the J48 tree achieved the highest initial accuracy of 79.1%. To boost these results, they explored two dimensionality reduction techniques: Random Projection and PCA. The findings clearly showed that Random Projection was the superior method, significantly improving detection accuracy and proving to be more time-efficient than PCA. The PART algorithm, when combined with Random Projection, achieved the highest reported accuracy of 82.0%, outperforming the baseline and confirming that a dimensionality reduction approach can be a powerful tool for enhancing both the speed and accuracy of intrusion detection systems. A Machine learning IDS that combined Multivariate Correlation Analysis (MCA) and LSTM was described by the authors of [
12], employing Information-gain as a feature selection method. The MCA-LSTM achieved 82.15% test accuracy for the 5-way classification using the dataset of NSL-KDD, whereas the accuracy of the MCA-LSTM for the 10-way classification job in the UNSW NB15 is 77.74%. Introducing an IDS that leverages SVM for enhanced performance. The authors of [
13] incorporated Naïve Bayes for feature selection and optimizing for model training for SVM. The proposed approach was evaluated on two separate datasets, UNSW-NB15 and CICIDS2017, demonstrating significant improvements compared to using SVM directly. The proposed system achieved an accuracy of 93.75% and 98.92% for UNSW-NB15 and CICIDS2017, respectively. The proposed methods in [
14] were extensively analyzed across multiple datasets, i.e., UNSW-NB15, ToN-IoT, and CSE-CIC-IDS2018. By applying three different methods for feature reduction, they used multiple machine learning algorithms to analyze the volume of the datasets. Obtaining different results in terms of accuracy, F1-score, detection rate, FAR, and AUC, they worked on applying these machine learning algorithms either on the full dataset or after using the proposed feature reduction methods. They obtained the best accuracy by combining Decision Tree and Autoencoder, achieving accuracies of 98.67% and 98.23% for UNSW-NB15 and CSE-CIC-IDS2018, respectively, and achieved the accuracy of 98.15% while applying the same ML algorithm on the full ToN-IoT dataset. An Imbalanced Generative Adversarial Network (IGAN) was used to address the issue of class imbalance, to augment the minority class samples by the authors of [
15]. After different steps of preprocessing, IGAN was applied with the purpose of balancing the dataset. An ensemble of Lenet5 and LSTM models was used to classify the instances as normal or malicious. The proposed approach was evaluated on two different datasets, the UNSW-NB15 and CICIDS2019, with the performance metrics including accuracy, recall, TPR, FPR, and F1-score. The researchers of [
16] implemented different ML algorithms; they applied at first different pre-processing steps, by cleaning the data and applying feature engineering methods. The proposed approach was applied on the well-known UNSW-NB15 dataset, where they measured the impact of Logistic Regression, SVM, Decision tree, Random Forest, and XGBoost in terms of F1-score, false alarm rate, AUC, and the confusion matrix. For the ROC, the different curves are close, obtaining an AUC of over 0.95 each time, but in terms of F1-score, they proved that while applying RF, they obtained the best results, by achieving a F1-score of 97.80% and only 1.37% for FAR. Louai A.M [
17] proposed an automatic Network Intrusion Detection System (NIDS) by applying different machine learning algorithms (RF, DT, AdaBoost, BNB, KNN, LR) on the UNSW-NB15 dataset. His approach was employed on two sets: the original set, which is the imbalanced set, and the balanced set by using oversampling and undersampling approaches. Splitting the data into 80% and 20% training and testing sets, he secured the highest accuracy for the imbalanced data by utilizing random forest, accomplishing an accuracy of 90.17%, a precision and F1-score of 90.14% and 90.17% for recall. Attested that oversampling elevated the accuracy of their approach, achieving an accuracy of 98.83% while undersampling degraded its accuracy, obtaining only 81.66%. Furthermore, the author illustrated the effectiveness of Pearson’s Correlation Coefficient (PCC), a feature selection method, in lowering training time, specifically from 321.6 s to 307.3 s. The authors of [
18] presented a sophisticated approach for NIDS. Employing the AWID dataset, containing 154 features, they applied feature selection to narrow the initial features to 76, which was further narrowed to only 13 features. The obtained results were in terms of accuracy, precision, recall, support, F1-score, and Macro average. Deploying multiple machine learning and deep learning algorithms for both multi-classification and binary classification, obtaining for deep learning an accuracy range of 88% to 97%, and for machine learning a range of 88% to 98%. Finally, a comparative study of implementing different ML algorithms (Linear regression, Logistic Regression, Adaboost, SVM, DT, RF, both Xtra and light gradient Boost) to categorize a range of attacks. The authors in [
19] conducted their research on both the UNSW-NB15 and NSL-KDD datasets, incorporating multiple preprocessing steps. They specifically employed a Binary Bat Algorithm for feature selection to minimize training time, then used SMOTE-ENN on the processed data for class balancing. Their findings, including accuracy, precision, recall, and F1-score for both datasets, indicated a 99.7% accuracy on NSL-KDD and 97.3% on UNSW-NB15.
In summary, recent studies used a wide range of either feature selection or feature extraction techniques, combining them with a range of classifiers and applying them to different datasets to study and analyze the impact of their approach in enhancing anomaly detection. Choosing multiple performance metrics to demonstrate that their model is, in fact, effective.
Table 1 presents the methodologies applied to different datasets and summarizes their key findings.
4. Results and Discussion
This section presents a comparative analysis of different approaches, before and after using the dimensionality reduction methods. Starting by using only the classifiers, and by examining the performance metrics and computational efficiency, the purpose of our study is to show the impact of dimensionality reduction methods for detecting malicious activities. The comparison will focus on the classification metrics, the training and prediction time, assessing the efficiency gains achieved through applying dimensionality reduction.
Figure 1 shows the workflow of the IDS while using the UNSW-NB15 dataset by using directly the classifiers without the dimensionality reduction methods. Our approach consists of several steps, we begin by pre-processing the data, splitting it into two sets: 70% will be used for training the data, and the rest for the testing part. The three machine learning algorithms previously discussed were employed to construct classifiers using the complete dataset without modifying it. Once trained, these models were evaluated on the testing dataset to assess their performance in classifying normal and abnormal network traffic.
For this experiment, the three supervised learning algorithms previously discussed were employed to construct classifiers using the complete 49-dimensional feature set derived from the training data. Once trained, these models were evaluated on the testing dataset to assess their performance in classifying normal and abnormal network traffic.
ROC curves and AUC scores work to evaluate a classifier model. The true positive rate is plotted against the false positive rate, showing the trade-off between sensitivity and specificity. For instance, a perfect classifier would move toward the left corner of the ROC space. As represented in
Figure 2, they all show above-average performance, displaying the ROC curve constructed by each classifier near the top-left corner. Each figure of the above shows that the model’s performance is consistent across the data used, implying minimal overfitting, showing the stability of the performance of each model. Moreover, the values of AUC, which usually vary between 0 and 1, maintained good to very good results overall. The fact that each model has approximately the same train AUC and test AUC, with a minimal difference, implies the reliability of the estimated performance and how the model can perform on unseen data. These classifiers demonstrated good performance on all accounts; SVM especially gave the best results with a train AUC of 0.9882 and a test AUC of 0.9916, following it obtaining a train AUC of 0.988 and a test AUC of 0.988 while applying Decision Tree, and the final one with logistic regression, obtaining a train AUC of 0.9871 and a test AUC of 0.9925.
Table 3 presents the classification performance of three models: Logistic Regression, SVM, and Decision Tree. Among them, Logistic Regression achieved the lowest accuracy, with 98.71% on the training set and 98.76% on the test set. However, it also recorded the lowest False Alarm Rate (FAR) on the test set at just 0.76%, suggesting stronger control over false positives. SVM performed slightly better in terms of training accuracy (98.85%) but showed a drop in precision on the test set (91.22%), which may indicate a higher number of false positives when predicting on unseen data. Decision Tree delivered the best overall performance, with training and test accuracy above 99%, precision over 94%, and an F1-score exceeding 96%. Despite these strong metrics, it had the highest FAR (1.2%), which could pose challenges depending on the application context.
Across all models, the performance on training and test sets remained closely aligned, indicating good generalization. Each model demonstrates strengths depending on the specific needs of the intrusion detection task. Each of the models used with its own strength, obtaining overall high accuracy, both in training and testing sets, indicating their overall good performance. Decision Tree had the strongest predictive metrics, and Logistic Regression minimized false alarms, making it potentially more suitable where false positives must be kept low.
The various dimensionality-reduction methods applied in this work aimed to reduce the number of features while preserving the most significant and relevant ones. The removal of irrelevant and redundant features reduces complexity within the model and thus its likelihood to overfit. Learning noise instead of useful patterns, which in turn will reduce its ability to generalize to new, unknown data. Dimensionality reduction involves saving time and increasing speed. The efficiency of such methods depends on the features used to represent data points. We want to represent each data point with as few highly relevant features as possible. Too many features waste computation on irrelevant ones, while too few features may not carry enough information to predict the label of a data point adequately.
Figure 3 shows the workflow of this approach, using either Batch PCA or incremental PCA to reduce the dimensionality of the dataset.
To optimize the performance of each PCA variant, Grid Search [
32] was employed to identify optimal hyperparameters. This involved systematically testing various parameter combinations for both Batch PCA and Incremental PCA (IPCA). For each classifier, the most effective number of Principal Components (PCs) was determined independently, based on the configuration that yielded the highest classification accuracy.
Figure 4 illustrates how accuracy varies with the number of PCs for each approach. The optimal number of PCs for Batch PCA was 19 for SVM, 13 for Decision Tree, and 25 for Logistic Regression. In contrast, IPCA achieved optimal performance with 30 PCs for both SVM and Logistic Regression, and 15 for Decision Tree.
Figure 5 shows the obtained results in terms of accuracy, precision, recall, F1-score and roc_auc for both Batch PCA and Incremental PCA. Applying dimensionality reduction (Batch PCA or IPCA) generally led to a slight decrease in accuracy, never exceeding 0.5% but consistently improved the F1-score across all classifiers. This trade-off is favorable, particularly given the imbalanced nature of the dataset, where the F1-score offers a more reliable evaluation metric than accuracy. A high F1-score indicates that the model effectively detects the minority class, not just the majority.
For instance, Logistic Regression with IPCA showed a small drop in accuracy (from 98.76% to 98.56%) but a noticeable increase in the F1-score (from 95.33% to 98.57%). Similarly, applying Batch PCA with SVM resulted in a drop in accuracy from 98.75% to 98.57%, while the F1-score increased from 95.28% to 98.57%. Decision Tree with IPCA saw only a 0.12% drop in accuracy, but an impressive 2.65% gain in F1-score, reaching 98.93%. These results show that dimensionality reduction can enhance model robustness in classifying minority instances.
To further improve generalization and reduce risks of underfitting or overfitting, we applied K-fold cross-validation during hyperparameter tuning. This method divides the training set into K folds, training the model on K − 1 folds and validating it on the remaining fold, iteratively. Each fold serves as validation once [
33,
34]. K-fold cross-validation is a widely used technique to evaluate the performance of a model and assess how well it generalizes to unseen data. By adding the k-fold cross-validation, we ensure that we obtain a more reliable estimate of a model performance compared to the use of a single train test split. The sole purpose of k-fold cross-validation is to divide the dataset into K sets, using the K − 1 sets for training and the remaining set for validation, changing the folds each time. Avoiding leakage by pre-processing the dataset after the split and not before. The k-fold cross-validation is also used for hyperparameter tuning; it ensures that each hyperparameter combination is evaluated across diverse data subsets, yielding a reliable average performance metric. Hyperparameter optimization was conducted via grid search, which exhaustively tested all parameter combinations within a defined space to identify the best-performing configuration [
35].
Hyperparameter tuning was conducted using a grid search with 5-fold cross-validation to optimize each classifier’s performance on the training set, with final evaluation reported on a held-out test set. We optimized Logistic Regression by tuning the regularization strength alpha over [1 × 10
−6, 1 × 10
−5, 1 × 10
−4, 1 × 10
−3, 1 × 10
−2, 1 × 10
−1, 1, 10] and penalty type (l1 or l2). For the Support Vector Machine (SVM) algorithm, using a linear kernel implementation (scikit-learn’s LinearSVC), we tuned the regularization parameter alpha over [1 × 10
−5, 1 × 10
−4, 1 × 10
−3, 1 × 10
−2, 1 × 10
−1, 1, 10, 100] and penalty type (l1 or l2), identifying optimal parameters as alpha = 1 × 10
−5 and penalty = l1. For Decision Trees, we tuned max_depth in [
8,
10,
12,
14], min_samples_split in [
2,
4,
6], and min_samples_leaf in [
9,
11,
13], with optimal parameters max_depth = 10, min_samples_split = 4, and min_samples_leaf = 11.
Figure 6 and
Figure 7 show the training and testing performance of logistic regression (L1) and (L2) respectively with varying alpha values.
Figure 8 shows hyperparameter tuning heatmap for SVM while
Figure 9 shows the heatmap for decision tree.
Figure 10 on the other hand shows training and testing performance of the decision tree classifier with varying min_samples_leaf Values.
The comparative results presented in
Table 4,
Table 5 and
Table 6 clearly demonstrate the importance of dimensionality reduction in improving the computational efficiency and scalability of intrusion detection systems (IDSs). Without dimensionality reduction, the Decision Tree classifier achieved the highest overall performance, with an accuracy of 99.04% and an F1-score of 99.05%, highlighting its strong capacity to distinguish between normal and attack classes even in an imbalanced dataset.
When Batch PCA was applied, the SVM classifier produced the most computationally efficient configuration, achieving 98.44% accuracy and an F1-score of 98.47%, while reducing the training and prediction times to 3.91 s and 0.04 s, respectively.
Similarly, the combination of IPCA and Decision Tree maintained a high level of accuracy (98.61%) and F1-score (98.64%), albeit with a longer training time (96.54 s) and prediction time (0.09 s). Although less time-efficient, this configuration illustrates the advantage of IPCA for handling large-scale or streaming data, where incremental processing and memory efficiency are crucial.
Overall, both PCA variants preserved nearly identical classification performance compared to the models trained on the full feature space, with less than a 0.5% difference in accuracy and F1-score. However, they substantially reduced training and inference times, confirming that PCA effectively mitigates feature redundancy and noise in high-dimensional network traffic data. These improvements are particularly valuable for real-time IDS applications, where rapid detection and responsiveness are as important as classification accuracy. From a methodological perspective, the comparison between Batch PCA and IPCA provides complementary insights: Batch PCA is more suitable for offline analysis of static datasets, whereas IPCA offers better scalability and adaptability for dynamic, continuously evolving network environments. The results highlight the significance of dimensionality reduction techniques in enhancing the performance and efficiency of intrusion detection systems. Batch PCA proved valuable and effective for offline training scenarios, and Incremental PCA demonstrated its relevance in handling continuously generated network data. Such adaptability is crucial in real-time or resource-limited environments, where intrusion patterns evolve dynamically. Hence, dimensionality reduction emerges as a key strategy in achieving an optimal balance between detection accuracy, computational efficiency, and operational scalability in modern IDS frameworks.
The comparison between both methods underlines their complementary nature: Batch PCA offers stability and comprehensive feature compression, whereas Incremental PCA provides scalability and adaptability. Evaluating both approaches within the same framework allowed for a deeper understanding of their practical trade-offs and their potential to improve detection reliability across different deployment settings.
The UNSW-NB15 dataset is characterized by a significant class imbalance, with the ‘Normal’ traffic heavily dominating over various attack types. This inherent skew can critically hinder classifier performance by biasing models towards the majority class, often leading to deceptively high overall accuracy while compromising the effective detection of crucial minority attack instances. To address this, our evaluation relied not solely on accuracy, but also on the F1-score, a harmonic mean of precision and recall, which provides a more robust measure of a classifier’s ability to identify both minority and majority classes equitably.
To directly counteract this class imbalance within the dataset, we implemented the Synthetic Minority Oversampling Technique (SMOTE). SMOTE is a balancing technique, an oversampling technique, that functions at the data level by synthetically generating new samples for the underrepresented minority classes through feature interpolation, thereby producing a more balanced distribution for model training.
The results presented in
Table 7,
Table 8 and
Table 9 demonstrate the effect of combining SMOTE with dimensionality reduction on the performance and computational behavior of the evaluated intrusion detection models. When SMOTE was applied independently, a slight decrease in accuracy and F1-score was observed for all classifiers compared to their performance on the original imbalanced dataset. Logistic Regression and SVM both achieved 96.69% accuracy and 96.49% F1-score, while the Decision Tree maintained relatively strong results with 98.83% accuracy and 98.86% F1-score. This minor reduction can be attributed to the introduction of synthetic samples, which may add mild noise to the decision boundaries; however, SMOTE improves class representation and minority class recall, enhancing fairness across attack categories.
When integrating Batch PCA with SMOTE, both predictive performance and computational efficiency improved notably. The SVM achieved 98.52% accuracy and 98.55% F1-score with a short training time of 9.39 s and a prediction time of 0.04 s, representing an effective balance between detection quality and processing speed. Similarly, the Decision Tree obtained 98.73% accuracy and an F1-score of 98.75%, matching the performance of the best baseline models but with significantly reduced inference time. These results indicate that PCA alleviates redundancy in the oversampled feature space, leading to more stable and efficient model convergence.
A comparable pattern was observed when combining IPCA with SMOTE. While maintaining similar accuracy levels (98.3–98.7%) and F1-scores, IPCA required slightly longer training times due to its incremental nature. Nevertheless, its ability to process data in small batches makes it advantageous for large-scale or streaming IDS scenarios where memory efficiency and adaptability are crucial.
Overall, integrating SMOTE with PCA-based dimensionality reduction provides a balanced and scalable solution for intrusion detection. Although SMOTE alone can introduce minor performance fluctuations, its combination with PCA or IPCA preserves high classification accuracy while substantially improving computational efficiency.