1. Introduction
The evolution of the Internet in recent years has altered the study and developed research to examine the exposure of networks and observe advanced security threats. For example, by one estimation, it is predicted that in 2025, the cost of cybercrime will be USD 10.5 trillion [
1]. Moreover, the need for a network intrusion detection system (NIDS) that can adapt to changes in network traffic over time is crucial for two reasons. Firstly, the evolving threat requires a dynamic approach that adapts to network security breaches. Secondly, zero-day attacks highlight the need for an adaptive NIDS that uses dynamic network traffic analysis to identify suspicious patterns without relying only on signatures, such as detecting abnormal traffic spikes or unexpected protocol usage. The use of communication systems is increasing rapidly, and due to this, we have a tremendous amount of network traffic that makes it hard for human eyes to detect attack patterns. Broadly speaking, network traffic can be divided into two categories: normal and malicious. Attacks often constitute only a small fraction of network traffic in the real world, so they can be considered anomalies.
In a real-world network, traffic attacks fluctuate over time, but they usually represent a minority of the traffic flow and change their representation over time. Machine learning algorithms that use two-class classification to predict attacks depend on predefined labels of normal and attack categories. This makes it difficult for these algorithms to adapt to evolving threats and handle imbalanced data, making them less effective for real-time detection. Additionally, the machine learning-based detection system is significantly impacted by highly imbalanced classes [
2], making it unreliable for detecting dynamically changing attack ratios in real-world scenarios. Due to the variability in the attack ratios, an interesting approach that can be used is a one-class classification (OCC) model trained with only normal instances. Many techniques of OCC can be used to tackle the problem. This article uses two of the most widely used OCC models in the literature: Autoencoders (AEs) and Isolation Forests (IFs), which use only normal instances in the learning process. However, a single AE or IF might not perform well for all the normal traffic flow variations that exist within the data distribution being used. For that reason, we used an ensemble of multiple AEs or IFs to address these limitations, leveraging diversity among the models. Each model within the ensemble of AEs or IFs captures different facets of the data distribution, resulting in a more robust and reliable detection system [
3,
4]. To evaluate our anomaly detection ensemble (ADE), built with AEs or IFs, we used the baseline method by taking the prediction of each anomaly estimator individually to see if the proposed ensemble model was working. We took the predictions of each anomaly estimator individually and then took the average of all predictions. As we mentioned before, attack distributions are not static but they fluctuate over time, so we used ADEs to adapt to changes in attack flow. To test its performance in this context, we create different validation datasets with different anomaly balances using the NSL-KDD and CIC_IoT-2023 datasets and analyze how the amount of information provided to the ADE influences its predictions. In comparing AEs and IFs within the ensemble framework, the results indicate that the two ensemble methods consistently outperformed the baseline across all simulations, particularly in detecting severe anomalies. For the NSL-KDD dataset, both the AEs ensemble and the IFs ensemble achieved comparable performance in cyberattack detection. However, the IF-based ensemble reached peak performance more quickly, requiring less training data. This efficiency allowed it to better leverage the diversity of the training datasets used in the ensemble. Finally, we have observed that a smaller degree of overlap between the training data among the individual anomaly estimators in the ensemble increases diversity, which often results in enhanced overall performance.
We want to highlight the main contributions of our paper to the literature in this field. First, we present a dynamic anomaly detection ensemble composed of two distinct models: AEs and IFs. We refer to it as dynamic because the ensemble of models responds more effectively to the evaluation of different attack ratios than a simple average of models (the baseline approach). This design enables the system to remain robust under varying attack rates and mitigates the uncertainty that arises when models trained on a fixed attack rate are evaluated under rates different from the training rates [
2]. Using these two ensemble models (AE and IF), we conduct an in-depth study and analysis of the real-world challenge posed by dynamically changing attack rates in network traffic, and we show that this issue can be effectively addressed without the need for system retraining. To clarify, the novelty of our approach does not lie in the use of AEs or IFs per se, but in the strategic ensemble design, the exclusive training on normal traffic, and the evaluation under extreme imbalance conditions, which are rarely addressed in existing literature. This point explicitly reflects the contributions of our work, as well as its methodological benefits. Second, we investigate the impact of overlapping training data among the ensemble components and how this affects the detection of extremely low attack rates, as low as 0.5%, relative to benign traffic. This analysis is conducted for both ensemble types: AEs and IFs. Finally, we examine the minimum number of records required to train each ensemble component, aiming to achieve optimal performance with minimal data and training time. We begin our study with the renowned NSL-KDD dataset. To further generalize our proposal, we then extend the evaluation of the methodology and the findings obtained using the more recent dataset CIC_IoT-2023.
The rest of the article is organized as follows.
Section 2 reviews research work related to using AEs and IFs for anomaly detection in cybersecurity. In
Section 3 we analyze the datasets and describe the experimental methodology, including the data preprocessing, the model description, and the evaluation metrics.
Section 4 presents the results. Finally, in
Section 5 we discuss the results and draw some conclusions and future work. Additional details on the ensemble algorithms and performance evaluations used in our study are provided in
Appendix A.
2. Related Work
Over recent years, cybersecurity threats have increased significantly, leading many researchers to investigate detecting attacks in network traffic using different machine learning techniques, such as neural networks [
5,
6,
7,
8], ensemble models [
9], and soft voting techniques [
9]. The effect of changing the attack ratio on the performance of machine learning algorithms has also been studied [
2].
Many studies have investigated the use of the AE in detecting attacks, such as the study carried out by Singha et al. [
5]. A model based on a unified AE to find spatial features was proposed, and a Multiscale Convolutional Neural Network AE (MSCNN-AE) was used to find the temporal pattern using Long Short-Term Memory (LSTM). Then, a two-stage IF was used for anomaly detection. To test their model, they used different datasets, NSL-KDD, UNSW-NB15, and CIC-DDoS2019, and achieved good results. However, the complexity of using two models of one class classification to detect attacks is high, not making this proposal very suitable for real-time attack detection.
The technique of eliminating features and reducing the dimensionality to detect attacks using neural networks was proposed by Chen et al. [
7]. The proposed method used two types of AEs. The first type applies a normal fully connected AE to capture the non-linear correlation between features. The second approach is applying a Convolutional AE (CAE) to reduce the dimensionality. The NSL-KDD dataset was used to evaluate the method, and the CAE outperformed other detection methods. However, the complexity of the suggested work is high. Our research used a simple method using all features and random AE architecture to create the ensemble of AEs, which makes it less complex and we think it is more appropriate for a real-time attack detection system.
Using the technique of eliminating features, Tang et al. [
8] proposed a model called SAAE-DNN. The stacked AE (SAE) extracts the features and predicts the potential layers for the DNN. The model was evaluated using binary and multi-class classification methods on the NSL-KDD dataset. SAAE-DNN achieves better results than other machine learning algorithms, such as random forests and decision trees. Although their proposed technique may work, they did not analyze the dynamic change of attack ratios over time.
Another approach is to eliminate outlier samples that could affect the basis of the AE training. This was performed by Xu et al. [
6] based on an extensive investigation of an AE model with five hidden layers. To evaluate the proposed model, the NSL-KDD dataset was used, which achieved good results in accuracy and F1 score. Eliminating outlier samples may improve the results, but removing patterns is not ideal in datasets like the NSL-KDD dataset, where training patterns are limited. Another approach used in this paper is a soft voting technique using an ensemble. An example of a soft voting technique was proposed by Khan et al. [
9], the model called OE-IDS. They used resampling methods to handle unbalanced classes, such as SMOTE, ROS, and ADASYN. The proposed method used soft voting based on different optimal sets of four algorithms: gradient boosting, RF, extra tree, and MLP. The proposed technique was tested using the UNSW-NB 15 and CICIDS-2017 datasets. While they used a soft voting technique similar to this research, they did not directly analyze the effect of this methodology on the different ratios of attack patterns versus normal traffic patterns.
Another one-class classification technique commonly used in research is the IF. Elsaid et al. [
10] introduce an optimized IF-based Intrusion Detection System (OIFIDS) designed to handle heterogeneous and streaming data in Industrial Internet of Things (IIoT) networks. The system optimizes the Isolation Forest algorithm using an Enhanced Harris Hawks Optimization (ERHHO) technique to reduce dataset dimensionality and improve detection performance. Evaluated on three datasets (CICIDS-2018, NSL-KDD, and UNSW-NB15), OIFIDS demonstrates superior performance compared to state-of-the-art baseline techniques, achieving higher accuracy. The proposed system effectively addresses the concept drift problem in streaming data, achieving high AUC values. AbuAlghanam et al. [
11] proposed a fusion-based anomaly detection system using a modified isolation forest (M-IF) for Internet of Things (IoT) network intrusion detection. The proposed system comprises two parallel subsystems, one trained on normal data and the other on attack data, utilizing a modified version of the IF classifier to enhance classification performance and reduce runtime. The system was evaluated using three benchmark datasets (UNSW-NB15, NSL-KDD, and KDDCUP99). Results showed that the proposed approach outperformed other NIDS techniques while reducing the runtime of the training model by 28.80%. The M-IF demonstrated superior performance compared to traditional IF, One-Class SVM, and Local Outlier Factors across all datasets.
Nalini et al. [
12] proposed a hybrid approach called Hybrid Density-Based IF with Coati Optimization (HDBIF-CO) for effective anomaly detection in cybersecurity systems. Their method combines density-based clustering (DBSCAN) with an IF algorithm optimized using the Coati Optimization technique. The approach was tested on three datasets: NSL-KDD, CICIDS2017, and UNSW-NB15. The HDBIF-CO method comprises several key stages: data collection, preprocessing (including normalization and outlier elimination), feature selection, cluster discovery using DBSCAN, anomaly detection using the HDBIF-CO algorithm, and a final decision-making phase.
A combination of both algorithms, AE and IF, was proposed by Carrera et al. [
13] and evaluated three novel unsupervised approaches for near real-time network traffic anomaly detection: Deep Autoencoding with GMM and IF (DAGMM-EIF), Deep AE with Isolation Forest (DA-EIF), and Memory Augmented Deep AE with IF (MemAE-EIF). These approaches combine deep learning techniques with the Extended IF algorithm to enhance anomaly detection accuracy while maintaining fast prediction speeds. The proposed methods achieved comparable or superior performance to state-of-the-art unsupervised anomaly detection algorithms on the KDD99, NSL-KDD, and CIC-IDS2017 datasets, with MemAE-EIF obtaining the highest precision and F1-score across all datasets. The addition of IF improved accuracy with only a minimal increase in inference time. SHAP analysis demonstrated that the new features introduced by the combined approaches were influential in improving anomaly detection. While the studies [
10,
11,
12,
13] propose models that demonstrate high performance metrics, they do not explicitly analyze the effect of varying attack ratios, particularly low ratios, on the model’s performance. There have been numerous investigations on the use of AE and IF in anomaly detection as an ensemble. Studies [
14,
15] conducted a comparison between the ensembles of AE and IF, finding that the ensembles of AE performed better. In [
4], a model was proposed utilizing a k-partitioned IF ensemble for detecting stock market manipulation. To detect anomalies in the web, the authors in [
16] proposed an IF with reduced execution time, enabling administrators to respond quickly to attacks. The investigation [
17] analyzes the effect of contaminations of IF on a highly unbalanced dataset (CERT r4.2) and found that it plays a crucial part in the performance of IF. However, none of the previous studies explicitly analyzed the effect of overlap or the different ratios of anomalies on their suggested approaches.
An exception is the work in [
2], which analyzed the effect of various ratios on two traditional machine learning techniques: random forest and support vector machines. The study was applied to two datasets, the UNSW-NB15 and CICIDS-2017 datasets. The investigation found that the detection of attacks on both algorithms is affected by ratio fluctuations in different ways, and the random forest is more robust in detecting attacks, even when the ratio of attacks is severe. This work aligns with the necessity to explore attack detection methodologies that adapt to the evolving nature of these attacks over time, an important topic we examine in this seminal communication.
4. Results
The first analysis we conduct using the NSL-KDD dataset is to examine the optimal number of components in an ensemble system designed to effectively detect different rates of malicious traffic, given a set of normal traffic records for training. Our goal is to keep the training of each model as simple as possible in terms of the number of patterns required.
Figure 3 shows the
results across different attack rates for varying ensemble sizes.
Table 3 details how the 51,546 normal traffic records were distributed among the autoencoders (AEs). Each AE receives slightly more than the total number of normal instances divided by the number of models in the ensemble. Based on these results, we selected an ensemble of 150 AEs, as it provides robust performance even under very low attack ratios. This configuration also ensures that each AE is trained on approximately 400 normal samples, maintaining a manageable computational load per model. This setup further allows us to investigate how the number of training samples per AE affects overall system performance. When each AE is trained on 400 or fewer samples, the overlap between training subsets remains minimal, which promotes diversity among models. However, as the number of samples per AE increases beyond 400, the overlap grows significantly, potentially reducing diversity and impacting detection effectiveness. Therefore, by fixing the number of models and varying the number of training samples per component, we indirectly assess how training data overlap influences the ensemble’s ability to detect anomalies.
As we commented above, our primary goal is to detect attacks in conditions that closely resemble real-time network traffic. This analysis examines the impact of the number of training patterns on the ensemble of AEs and IFs across different anomaly ratios. Panels A and B of
Figure 4 and
Figure 5 display the results of the ensemble and baseline approaches, respectively, explained in
Section 3 for AE (
Figure 4) and IF (
Figure 5), with the NSL-KDD dataset. These figures illustrate
results for various dataset sizes used to train 150 AEs and 150 IFs, with each point representing the average of 20 tests with different percentages of attacks. The training data for the AEs ranges from 100 to 15,000 benign traffic patterns, and the training data for the IFs ranges from 10 to 15,000 benign traffic patterns. One can observe that when the percentage of attacks is very low (0.5% and 1%), both the baseline and ensemble methods struggle to detect these rare attacks, especially with smaller training dataset sizes. However, our ensemble model approach proves more reliable in these situations, showing steady improvement as the training size increases up to 1000 data points. Conversely, when attack ratios are higher, such as 25% or 50%, both models perform consistently well, even with smaller training datasets, as these more frequent attacks are easier to detect. It should be noted that a performance saturation is observed between 300 and 500 training patterns for the AE approach (see zoom in
Figure 4), and even earlier for the IFs (
Figure 5), which is precisely when the variability of the data seen by each AE decreases, as we already discussed in
Section 3.3. The performance of both models declines when the training set exceeds 1000 records, particularly for the AE ensemble and in cases with a low percentage of attacks. This may be due to a limited number of truly distinct training samples and the presence of overlapping data, which becomes more problematic in imbalanced scenarios where attack instances are already scarce. These findings underscore not only the robustness of the ensemble approach but also how sensitive model performance is to data overlap when detecting imbalanced attack distributions.
It is important to note that the ensemble approaches consistently outperform the baseline approaches, although the improvement is smaller when dealing with larger training sizes and ratios of attacks. These findings indicate that the ensembles provide a stronger and more flexible solution for identifying anomalies in networks where attack patterns constantly change, especially when the attack ratios are severely imbalanced. Finally,
results were also obtained and are provided in
Appendix A.2. Although these results confirm our previous observations, the differences between balanced and imbalanced datasets are less pronounced. This suggests that the ROC analysis has difficulty assessing performance in highly imbalanced datasets.
To provide a deeper comparison between ADEs (of AEs and IFs) and the baseline methods, panels (A–C) in
Figure 6 illustrate their performance in detecting malicious traffic across three representative sizes of training datasets (200, 400 and 1000) and several attack ratios (0.5%, 1%, 12.5%, 25%, and 50%). We selected these three representative training sizes based on the results in
Figure 4 and
Figure 5.
Panel A presents the best
results for the IF ensemble when the training data size is very small (see
Figure 5A). We can observe that the panel indicates that the IF ensemble outperformed the AE ensemble in terms of
results under an unbalanced representation of attacks in network traffic. IF-based ensembles are known to achieve optimal performance with fewer training records [
25], as observed in our experiments. Conversely, when the attack representation was balanced at 50%, the
results for both ensembles of AEs and IFs showed minimal differences. In contrast, AE-based ensembles required more training data to achieve comparable performance. Therefore, panel B presents the
results based on the distribution of normal traffic records among the individual components of the ensembles, implying that each model was trained on a dataset with little overlap. At this stage, the ensembles reach a point where the
stabilizes (see
Figure 4A). This panel emphasizes that when each model within the AE ensembles is trained with an adequate amount of data, both ensembles achieve nearly identical results. Panel C highlights the best results achieved by the ensemble method using AEs (see
Figure 4A). The ensemble model generally outperforms or matches the baseline, particularly at reasonably balanced attack ratios, where increasing the size of the training dataset leads to improved
values.
Finally, panels (A–F) in
Figure 7 compare the
values for different percentage attack sizes using three methods. (i) The red point represents the ensemble results (the standard deviation along each axis is calculated using 20 different test files). (ii) The green point represents the results of the baseline approach (after averaging the results of the 150 individual estimators (left panel for AEs and right panels for IFs), the standard deviation along each axis is calculated with 20 different test files). (iii) The blue points represent the results of 150 individual estimators (left panel for AEs and right panels for IFs), where each point represents the average along each axis and is calculated using 20 different test files. Each panel displays
values for one anomaly size on the
y-axis and another on the
x-axis to compare performance. In panels A, B, and C, which compare anomaly sizes of 0.5%, 1%, and 12.5% against each other for the number of 1000 training patterns for the AE, the blue points are widely spread, showing varying performance across individual AEs. The ensemble points are consistently placed in the upper-right area of the panels, indicating stable and high performance. The baseline points (green) fall below the ensemble, suggesting it is less effective. Panels D, E, and F highlight the same observations for IFs trained with 100 patterns. What is most remarkable about this figure, for both AEs and IFs, is that there are individual estimators that improve the average performance of the ADEs. This observation suggests that a proper selection of ensemble components could potentially improve the results.
We extended the analysis of the impact of the number of training patterns on the ensemble of AEs and IFs to the CIC_IoT-2023 dataset using a similar procedure to the NSL-KDD dataset (see
Section 3.1.2). As shown in
Figure 8A, the ensemble of AEs achieves exceptional detection for severe anomalies, significantly earlier than the baseline, with approximately 2000 records. Moreover, it yields better results of a
close to one, with a balanced ratio of attacks, employing only 300 training records. The
results, included in
Appendix A.2, provide similar insights; however, the differences between balanced and imbalanced datasets are less pronounced, a pattern also observed with the NSL-KDD dataset. To provide a more comprehensive comparison of ADEs on the CIC_IoT-2023 dataset,
Figure 9 explicitly focuses on the ensemble and baseline approaches for AEs and IFs, for two representative sizes of training datasets: 400 and 2000 samples. The training size of 400 represents the approximate average number of samples allocated to each component of the ensemble. This approach ensured minimal overlap between the datasets used to train each model, while still leveraging the entire normal traffic dataset. However, when the training size reaches 2000 patterns, also with relatively low overlap, the ensemble
stabilizes and stops improving. These panels detail the performance of the AE-based and the IF-based detection methods in identifying malicious traffic for a range of attack ratios, which include 0.5%, 1%, 12.5%, 25%, and 50%. Panel A illustrates the results of the
when the training data of the normal traffic was distributed between the AE and IF models, with a small ratio of overlap. The results of the ensemble of AEs show improvement with approximately 10% in detecting severe anomalies, such as 0.5% and 1%.
The complete stabilization of
can be observed in panel B, which highlights the best results of
reached by the ensemble of AEs, where the system detects the severe anomaly (0.5%) with 98%
of the attacks. At the same time, it detects the balanced ratios of attacks such as 12.5%, 25%, and 50% with
approximately 1. The poor performance of IF on the CIC_IoT-2023 dataset is surprising; however, similar results have already been reported in [
26,
27,
28], where the authors demonstrate that IF performs significantly worse compared to the AE approach.
We used
as an additional performance metric to evaluate the performance of our proposed model.
Figure A1 and
Figure A2 (in
Appendix A.2) show the results of the ROC-AUC for the NSL-KDD and CIC_IOT2023 datasets, comparing the ensemble of AEs and the baseline, for training record counts ranging from 100 to 15,000. As we can see, the ensemble of AEs reaches the optimal results with a smaller number of records. At the same time,
Figure A3 (in
Appendix A.2) shows the ROC-AUC results for the NSL-KDD dataset for the ensemble of IFs and the baseline.
Figure A4 and
Figure A5 (in
Appendix A.3) illustrate the results of PR-AUC for an extended experiment that represents only two distinct attack categories of the CIC_IoT-2023 dataset: volumetric vs. semantic. DDoS-TCP Flood is a volumetric attack aiming to exhaust system resources (bandwidth, CPU, memory) by generating massive traffic volumes, where thousands or millions of packets are sent to overwhelm a server. On the other hand, DNS Spoofing is a semantic attack, meaning it manipulates the logical behavior or content of the system without producing large traffic volumes.
To place the model’s operational viability within a practical context, it is important to consider its computational performance. Although the ensemble includes 150 models, its training is performed offline and can be performed in parallel, significantly reducing computational costs. Thanks to the use of lightweight models (autoencoders and isolation forests), the entire process was executed for a training data size in a very reasonable time, as shown below. The experiments were conducted exclusively on CPUs, using an Intel XeonW-2245 processor running at 3.90 GHz (8 cores), with 64 GB of DDR4 RAM. To provide context in different training sizes, training the autoencoders with 200, 400, and 1000 input patterns took approximately 100, 120, and 180 s, respectively. During the evaluation phase in our simulations, we used 10,000 test samples for each of the 20 test files we used in the evaluation of model performance (see
Section 3 Materials and Methods). The ensemble of AEs completed each inference on 10,000 test samples in approximately 60 s. Importantly, the inference time per individual sample is very low (0.006 s), making the model suitable for real-time or near-real-time applications. For training isolation forests with 200, 400, and 1000 patterns, the time required was approximately between 17 and 20 s. In the inference phase of IF, the runtime for ensembles is approximately 10 s, tested with a single test file of 10,000 patterns. Again, the inference time per individual sample is very low, making the model suitable for real-time or near-real-time applications. For scenarios with hardware resource constraints (such as IoT), future work will explore more compact configurations and optimization strategies that maintain the effectiveness of the approach without compromising its applicability.
5. Discussion and Conclusions
The attack representation in network traffic fluctuates. Some attacks may occur frequently, while others are relatively infrequent. However, it is necessary to note that the attack traffic is less than the benign traffic in most real-time network traffic, and the imbalance between both changes dynamically over time in real information system scenarios. Traditional machine learning algorithms that use two classes may fail to be real-time detectors because they depend on predefined labels of normal and attack categories and do not adapt easily to evolving threats. In this context, we proposed an ADE. The ADEs are composed of individual estimators, which are OCC models: AE and IF. Therefore, specifically in this work, we evaluated and analyzed in detail how both approaches behave in different scenarios with varying attack rates without the need for retraining, allowing us to approach real-world operating frameworks where this adaptability is essential. Since the individual estimators in the ensembles are OCC models, the system generalizes effectively across variations in the data distribution. This enhances the realism of the cyberattack detection system, as it does not rely as heavily on a predefined proportion of attack instances during training, mirroring real-world conditions where attack frequencies are unknown and unpredictable. For this purpose, we created different attack ratios using the NSL-KDD and CIC_IoT-2023 datasets to mimic these real-world scenarios. We found that the proposed ADEs adapt to changing attack rates, without the need for retraining, performing better than the baseline method under severe imbalances of benign versus malicious traffic. As mentioned above, to our knowledge, there has been no study that explicitly analyzes the effect of varying attack rates on an NIDS. For this reason, we compare the ensemble methodology with the simple average of the models that participate in the ensemble prediction (the baseline approach). Additionally, we investigated how the quantity of information supplied to the individual estimators and the overlap between this in the ensembles affect predictions in the inference phase. We found that when the training datasets are less overlapping, the ADE ensemble of AEs works better, as shown in
Figure 4 and
Figure 8.
Thus, as a summary, it is important to highlight that our results suggest that diversity in training data across ensemble members may contribute to enhanced performance in attack detection. Although several works have already established this relationship between diversity and performance in the context of ensemble models [
29], further investigation is needed to fully understand the extent and nature of this relationship in the present context. The findings point toward the potential benefits of incorporating data diversity when designing ensemble-based detection systems. When comparing the two ADE models evaluated in this study, the AE ensemble achieved reasonable performance in detecting cyberattacks, even for a severe imbalance between anomalous traffic and normal traffic when it was trained with sufficient data. However, the IF-based ensemble for the NSL-KDD reached its optimal performance more rapidly, requiring less training data, as illustrated in
Figure 5. As previously discussed, reducing the overlap between the training data of individual anomaly estimators tends to increase diversity, which may contribute to enhanced overall performance, particularly in the case of the IF-based ensemble. While this effect was more evident in specific configurations, further investigation is needed to assess its consistency and broader applicability. In contrast, for the CIC_IoT-2023 dataset,
Figure 9 shows that the IF ensemble failed to generalize the problem effectively, an outcome consistent with prior observations of several works [
26,
27]. Nevertheless, further exploration is needed to enhance the performance of the two ADEs in detecting attacks, including feature dimensional reduction and tuning AE parameters for different ratios.
Figure 7 shows that some individual estimators (both AEs and IFs) outperformed the ensemble, which could also perform better with optimized individual estimators. Finally, it is essential to highlight that when normal traffic records are distributed among the AE and IF components of the ensemble during training, with small overlap between the data assigned to each model, we have shown that the approach based on an ensemble of autoencoders outperforms the ensemble of IFs, especially on the CIC_IoT-2023 dataset. This property is especially useful in IoT environments, where multiple sensors or devices generate heterogeneous data with a small overlap. In this context, each sensor could train a simple model, for example, an autoencoder, on the normal traffic it records locally. Subsequently, when evaluating new traffic, any sensor could query the models of the rest of the network and conduct a vote based on their responses, enabling more robust collaborative and distributed detection. In conclusion, this study presents an explicit analysis of how training data overlap influences the detection of varying attack ratios, employing ensemble methods based on two algorithms: AEs and IFs.
Finally, we would like to emphasize that despite the strengths of our proposed work, there are still limitations that open avenues for future research. Regarding scalability, although the base models of the ensemble (autoencoders and isolation forests) were trained independently on separate data subsets, enabling efficient parallelization and controlled computational distribution, particularly relevant for IoT contexts, challenges may arise in extremely high-dimensional scenarios. Issues such as data dispersion or attribute redundancy could affect performance. Future research will address these challenges by incorporating dimensionality reduction techniques and evaluating performance on large-scale datasets to validate the robustness of the proposed approach in real-world detection contexts. Moreover, while our study uses artificially rebalanced datasets to simulate fluctuating attack ratios, we acknowledge that this does not fully capture the complexity of real-world traffic evolution. As future work, we plan to extend our evaluation to more realistic scenarios, incorporating temporally evolving data and naturally occurring attack patterns to better assess model adaptability under operational conditions. Additionally, although our focus was on unsupervised ensembles trained solely on normal traffic, we recognize that the experimental comparison could be strengthened by including more sophisticated baselines, such as deep learning-based multi-class classifiers or hybrid one-class models. Future work will explore these approaches to broaden the scope and generalizability of our findings.