A Continual Learning Process to Detect Both Previously Learned and Newly Emerging Attack

Park, Hansol; Kim, Taesu; Lee, Hanhee; Shin, Dongil; Shin, Dongkyoo; Park, Moosung

doi:10.3390/app151810034

Open AccessArticle

A Continual Learning Process to Detect Both Previously Learned and Newly Emerging Attack

by

Hansol Park

^1,2,

Taesu Kim

^1,2,

Hanhee Lee

^1,3,

Dongil Shin

¹

,

Dongkyoo Shin

^1,2,3,*

and

Moosung Park

^4,*

¹

Department of Computer Engineering, Sejong University, Seoul 05006, Republic of Korea

²

Department of Convergence Engineering for Intelligent Drones, Sejong University, Seoul 05006, Republic of Korea

³

Cyber Warfare Research Institute, Sejong University, Seoul 05006, Republic of Korea

⁴

Agency for Defense Development, Seoul 05771, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(18), 10034; https://doi.org/10.3390/app151810034

Submission received: 10 August 2025 / Revised: 6 September 2025 / Accepted: 12 September 2025 / Published: 14 September 2025

(This article belongs to the Special Issue New Advances in Cybersecurity Technology and Cybersecurity Management)

Download

Browse Figures

Versions Notes

Abstract

With the recent intensification of geopolitical tensions, cyber-attacks have become increasingly sophisticated and dynamic. Traditional machine learning-based anomaly detection techniques, which rely on pre-trained data, often suffer from performance degradation when exposed to novel attack types not seen during training. To address this limitation, this study proposes a continual learning-based anomaly detection framework capable of incrementally incorporating new attack patterns without forgetting previously learned information. The proposed method consists of three key stages: first, preprocessing and data augmentation are applied to construct high-quality, balanced datasets; second, a base anomaly detection model is trained; and third, new attack data are incrementally integrated to continuously update and evaluate the model. To enhance adaptability and efficiency, the framework incorporates Memory-LGBM, a lightweight architecture that combines a prototype-based memory module with a gradient-free LGBM classifier. The model maintains class-wise latent representations instead of raw samples, enabling compact, memory-efficient learning. Experimental results on the CICIDS 2017 dataset demonstrate that the proposed approach outperforms existing continual learning methods in accuracy, adaptability, and resistance to forgetting, making it a practical and scalable solution for real-world anomaly detection scenarios that demand rapid adaptation, strong knowledge retention, and low computational overhead.

Keywords:

continual learning; anomaly detection; catastrophic forgetting; machine learning

1. Introduction

The advancement and widespread adoption of artificial intelligence (AI) technologies have significantly improved operational efficiency across various industries, including autonomous driving and administrative automation. However, these technological developments inherently entail the risk of malicious exploitation [1]. In particular, generative AI tools have enabled the creation of sophisticated and diverse cyber-attacks, thereby amplifying threats in cyber environments beyond any specific industry [2,3]. In automation-centric infrastructures—such as Infrastructure as Code (IaC), Continuous Integration/Continuous Deployment (CI/CD) pipelines, and serverless computing—malicious code or components generated by such models can be rapidly propagated throughout entire systems, leading to large-scale security incidents. Traditional signature-based anomaly detection models, which rely on pre-collected attack datasets, exhibit clear limitations in this context, as they require complete model reinitialization and retraining whenever novel attack patterns emerge [4,5,6]:

Failure to detect unseen patterns: Traditional anomaly detection models rely on fixed rules or signature-based learning. As a result, they often fail to recognize previously unseen data types or novel attack behaviors, leading to serious detection blind spots.
High retraining costs and latency: Models trained on static datasets cannot adapt to the dynamic nature of cyber threats. When unfamiliar data distributions appear, these models become obsolete unless retrained, highlighting their lack of adaptability.
Risk of knowledge erosion: During the retraining process, earlier knowledge may be overwritten or forgotten, especially when past data is unavailable. This can lead to degraded performance on previously detected threat types, weakening long-term robustness.

Given that generative AI can rapidly produce dozens or even hundreds of novel attack variants, the likelihood of bypassing traditional detection models increases significantly. This underscores the urgency of developing adaptive, continual learning-based detection systems that can evolve in real-time with emerging threats.

This study introduces a continual learning-based approach that incrementally updates the model with newly emerging attack types. By incorporating new threats without retraining the model from scratch, the proposed method enables real-time detection while reducing computational overhead. To implement this continual learning scheme, we leverage the memory component of a memory-augmented autoencoder (MAE) to store and update representations of both previously known and newly observed attack types. This allows the model to detect novel threats without undergoing full retraining, thereby enhancing adaptability and efficiency [7].

Experimental results demonstrate that the proposed system, when combined with a Light Gradient Boosting Machine (LGBM), achieves an F1-score of up to 92%, significantly outperforming a traditional autoencoder-based approach, which yields an F1-score of approximately 86%. These findings indicate that the proposed method offers both fast classification and improved robustness in detecting diverse attack scenarios [8].

The remainder of this paper is organized as follows. Section 2 examines the dataset used for attack classification, including preprocessing methods and baseline models, and discusses key insights derived from this analysis. Section 3 proposes a methodology that transforms input data into query representations using a memory-augmented autoencoder and classifies them using a LGBM model. Section 4 presents the experimental results of the proposed approach, and Section 5 concludes the paper with a summary of findings and potential directions for future research.

2. Related Works

2.1. Anomaly Detection

Considering that generative AI can rapidly produce dozens or even hundreds of new attack variants, the likelihood of evading existing detection models increases substantially. This underscores the urgent need for adaptive, continual learning-based detection systems capable of evolving in real time to counter emerging threats. Table 1 summarizes related studies in this domain, highlighting prior efforts in feature engineering, model design, and anomaly detection strategies for intrusion detection.

Kim et al. [9] vectorized data for malicious script detection by applying various preprocessing techniques, Abstract Syntax Tree (AST) [10], n-gram [11], and fuzzy hash [12]-based structural features. These feature vectors were effectively fed into an LGBM model, demonstrating robust detection performance even in code environments where both high- and low-level feature extraction is challenging due to encoding and obfuscation.

Panwar et al. [13] utilized Information Gain [14] as a filter-based feature selection method to address the high dimensionality of the CICIDS 2017 [15] dataset. They computed the entropy-based importance of each feature and ranked all 77 features by their information gain scores, grouping them into seven subsets. The selected subsets were evaluated using multiple classifiers, including Random Forest (RF), Bayes Net (BN), Random Tree (RT), Naive Bayes (NB) [16,17,18,19]. Their experimental results showed that selecting a smaller set of relevant features not only reduced computational complexity but also improved detection performance. Specifically, the Random Forest classifier achieved the highest accuracy of 99.86% using just 22 selected features, while the J48 classifier obtained a slightly higher accuracy of 99.87% when using 52 features, albeit with longer execution time. This highlights the importance of effective feature reduction in enhancing both accuracy and efficiency for real-time intrusion detection systems.

Maseer, Z.K. et al. [20] preprocessed the CICIDS 2017 dataset by handling missing and infinite values, normalizing the data within the range of {−3, 3}, and applying z-score standardization, resulting in 38 refined features [21,22]. They evaluated ten machine learning algorithms, including k-Nearest Neighbors (k-NN), Decision Tree (DT), Naïve Bayes (NB) [23,24]. Among these, k-NN, DT, and NB showed superior performance in detecting web-based attacks. The results indicate that proper feature scaling and model selection can significantly improve the accuracy and efficiency of intrusion detection systems (IDS).

Vibhute, A.D. et al. [25] applied MinMax normalization [26] to the UNSW-NB15 and NSL-KDD [27,28] dataset after handling inconsistencies and missing or zero values. To reduce dimensionality, they used an RF-based feature selection method, selecting the 15 most significant features out of 41. These were used to train a convolutional neural network (CNN) model composed of convolutional, pooling, batch normalization, and fully connected layers [29]. The proposed model achieved 99.00% accuracy on the test set, demonstrating that combining effective feature selection with deep learning can significantly enhance detection performance in complex network environments.

More, S. et al. [30] preprocessed the UNSW-NB15 dataset by removing null values, correcting inconsistent data types, normalizing feature values, and eliminating highly correlated attributes through correlation analysis. The study employed various machine learning models, including Logistic Regression (LR) [31], Support Vector Machine (SVM) [32], Decision Tree (DT), and Random Forest (RF), all of which achieved accuracy scores above 0.9. These results suggest that combining exploratory data analysis with proper feature selection can enhance the reliability and robustness of intrusion detection systems.

Altulaihan, E. et al. [33] applied a series of data refinement steps, including null value removal, categorical encoding [34], noise elimination, and numerical feature scaling, to the IoTID20 dataset [35]. These procedures were designed to enhance data consistency and reduce irrelevant variation prior to feature selection and model training. The study findings indicate that such preprocessing improves the quality of extracted features and enables more reliable anomaly detection in IoT environments, particularly when dealing with noisy or high-dimensional data.

The reviewed studies collectively emphasize that effective preprocessing, feature engineering, and dimensionality reduction are critical to enhancing the performance of IDS. Proper feature scaling, normalization, and handling of missing or inconsistent data significantly improve model stability and detection precision across diverse datasets, including CICIDS 2017, UNSW-NB15, and IoTID20. Furthermore, the integration of optimized feature subsets with advanced classifiers or deep learning architectures demonstrates superior performance, underscoring the necessity of coupling data refinement with appropriate model design. These findings highlight that a systematic combination of preprocessing, feature selection, and model optimization is essential for developing real-time, adaptive IDS capable of addressing evolving cyber threats.

Table 1. Summary of anomaly detection.

References (Year)	Preprocessing Methods Used	Used Datasets	Training Models
Kim, K et al. (2024) [9]	AST [10], n-gram [11], Fuzzy Hash [12]	A self-collected dataset	LGBM [8]
Panwar, S.S et al. (2022) [13]	Information gain [14],	CICIDS 2017 [15]	RF [16], BN [17], RT [18], NB [19]
Maseer, Z.K et al. (2021) [20]	Normalization [21], Standardization [22]	CICIDS 2017 [15]	NB [19], KNN [23], DT [24]
Vibhute, A.D et al. (2024) [25]	MinMax normalization [26]	CICIDS 2017 [15], UNSW-NB15 [27] NSL-KDD [28]	RT [18], CNN [29]
More, S et al. (2024) [30]	Normalization [21]	UNSW-NB15 [27]	RF [16], DT [24], LR [31], SVM [32]
Altulaihan, E et al. (2024) [33]	categorical encoding [34]	IoTID20 [35] NSL-KDD [28]	RF [16], KNN [23], DT [24], SVM [32]

2.2. Continual Learning

Continual learning (CL) is a training paradigm that enables a model to sequentially learn new data or tasks while retaining previously acquired knowledge. It aims to mitigate catastrophic forgetting and allows the model to adapt to evolving environments and shifting data distributions over time [36,37].

Liang, Y. S. et al. [38] proposed a continual learning framework that introduces task-specific Low-Rank Adaptation (LoRA)-like branches, consisting of fixed dimensionality reduction matrices and learnable adapters. By projecting new task gradients into a subspace orthogonal to the gradients of previous tasks via CL method effectively mitigates task interference. This architecture enables efficient and modular task expansion without retraining the pre-trained model, offering a well-balanced trade-off between stability and plasticity while preserving parameter efficiency.

Yu, J. et al. [39] propose a parameter-efficient continual learning framework built upon Contrastive Language–Image Pretraining (CLIP) by integrating a Mixture-of-Experts (MoE) architecture with task-specific routers and LoRA adapters. To address catastrophic forgetting and support task-specific adaptation, the model selectively activates a subset of experts for each task while keeping the remaining experts frozen. Additionally, the framework introduces a Distribution Discriminative Auto-Selector (DDAS), which automatically identifies the task by analyzing the input distribution, enabling fully automated continual learning without requiring explicit task IDs. Extensive experiments on multi-domain Task-Incremental Learning (TIL) and Class-Incremental Learning (CIL) benchmarks demonstrate that the proposed method surpasses previous approaches while maintaining CLIP’s zero-shot generalization capability.

Gao, Z. et al. [40] proposed the consistent prompting (CPrompt) framework, which enhances continual learning by incorporating two consistency-driven modules: Classifier Consistency Learning (CCL) and Prompt Consistency Learning (PCL). CCL mitigates catastrophic forgetting through smooth regularization, preventing outdated classifiers from overpowering current predictions. PCL addresses the misalignment between task-specific prompts and classifiers by introducing random prompt selection and auxiliary supervision. Experimental results on multiple benchmarks demonstrate that CPrompt significantly improves both accuracy and robustness in class-incremental learning settings, while maintaining parameter efficiency.

Marczak, D. et al. [41] present Maximum Magnitude Selection (MagMax), a novel continual learning framework designed for exemplar-free settings. The method leverages sequential fine-tuning and task vector merging, selecting parameters with the highest magnitude of change across tasks to construct a merged task representation that preserves essential knowledge while minimizing interference. Empirical results show that a small subset of high-impact parameters largely determines performance, and that sequential fine-tuning significantly reduces parameter sign conflicts. These findings demonstrate the effectiveness of MagMax in enabling efficient knowledge consolidation and stable continual adaptation.

Le, M. et al. [42] propose a unified continual learning framework that combines Mixture-of-Experts (MoE), prefix tuning, and a novel Non-linear Residual Gating (NoRGA) mechanism. By interpreting prefix tuning as a modular expert selection process within the MoE formulation, the framework enables task-specific adaptation without modifying the backbone. It integrates both shared pre-trained experts and prefix-specific experts, selected through a dynamic gating function. To overcome the limitations of linear gating, NoRGA introduces non-linear activation and residual connections, significantly improving parameter estimation under limited data. Theoretical analysis further establishes convergence guarantees and demonstrates the statistical efficiency of the proposed method.

Although these approaches demonstrate progress in addressing catastrophic forgetting, they also reveal limitations: LoRA- and MoE-based methods introduce additional task-specific modules that increase model complexity over time; automated selectors like DDAS assume stable distributions that may not hold in adversarial environments; consistency-driven strategies such as CPrompt require careful tuning and struggle with ambiguous task boundaries; exemplar-free designs like MagMax reduce plasticity; and unified frameworks with complex gating mechanisms face optimization challenges. These limitations highlight that while continual learning methods have advanced in benchmark settings, their scalability, efficiency, and adaptability to unpredictable real-world conditions—particularly in intrusion detection—remain unresolved.

To address this gap, our work introduces a lightweight continual learning framework that leverages the encoder–memory module of memory-augmented autoencoders while replacing the decoder with a gradient-free LightGBM (LGBM) classifier. This design reduces computational overhead, avoids reconstruction bias, and enables efficient memory updates, making it well-suited for real-time intrusion detection where both adaptability and efficiency are essential.

3. Proposed Method

Figure 1 illustrates the overall architecture of the proposed Memory-LGBM framework. During the training phase, each incoming task undergoes preprocessing steps, including null value removal, duplicate elimination, and feature normalization. To enhance feature representation, Dingo optimization algorithm (DOA) is first applied, followed by data augmentation to address class imbalance. The processed high-dimensional data is then compressed via an encoder to generate query vectors. These queries are classified by the LGBM model, grouping similar attack patterns and storing them in the memory module. As new task data arrives, the LGBM and memory module exchange information to accurately distinguish between normal and anomalous instances. In the testing phase, data augmentation is omitted for efficiency, and only encoding is performed before classification. If a previously unseen pattern is detected, the memory module is adaptively updated to reflect the new knowledge.

3.1. Feature Extraction

Feature extraction is the process of generating or extracting new features from raw data that effectively represent information useful for tasks such as classification, prediction, and clustering. This process aims to transform high-dimensional or complex raw data into a more concise and meaningful representation, thereby improving the learning efficiency and performance of machine learning models [43]. Recently, there has been growing research interest in feature extraction methods utilizing the Dingo Optimization Algorithm.

The Dingo Optimization Algorithm (DOA) is a metaheuristic optimization technique that mathematically models the cooperative hunting strategies and hierarchical social behaviors of dingoes. Recent machine learning studies have actively employed DOA to derive optimal feature combinations and construct effective training datasets. DOA operates primarily through two phases: hunting and social behaviors. The hunting phase explores the entire feature space to identify the most meaningful feature subsets and the representative numerical values within those subsets, while the social phase fine-tunes the composition and numerical ranges of those features to iteratively optimize performance [44,45,46,47].

When applying DOA to the CICIDS 2017 dataset, the hunting phase performs a global search by generating multiple feature subsets (20–40 features each) and evaluating their classification performance using models like SVM. The best-performing subset is selected as the alpha solution.

In the subsequent social behavior phase, local search is performed by refining the alpha solution. Other agents adjust their feature subsets by making minor modifications, such as adding or removing 1 to 3 features, or by modifying selection criteria based on statistical properties like the mean, standard deviation, and min–max range of each feature. These adjustments aim to achieve marginal improvements in classification performance. If a new subset surpasses the current alpha in performance, it is adopted as the updated alpha solution.

Compared to other metaheuristics such as Genetic Algorithms (GA) and Particle Swarm Optimization (PSO), DOA offers several advantages in network anomaly detection feature selection. First, GA often suffers from premature convergence due to limited exploration in later generations, while PSO can be overly sensitive to parameter initialization. In contrast, DOA’s dual-phase mechanism—global exploration through hunting and localized refinement through social behaviors—strikes a better balance between exploration and exploitation, reducing the risk of being trapped in local optima. Second, DOA inherently supports dynamic feature subset adjustments (adding/removing features or modifying statistical thresholds), which align well with the evolving and heterogeneous nature of network traffic. Third, empirical studies have shown that DOA requires fewer hyper-parameter adjustments than GA and PSO, making it more practical for high-dimensional and imbalanced network intrusion datasets. These properties make DOA particularly effective and stable for anomaly detection tasks, where adaptability and robustness are crucial.

Table 2 summarizes the most discriminative features for each type of attack, as selected by DOA.

3.2. Data Augmentation

Data imbalance refers to a condition in machine learning and deep learning where the number of samples is significantly skewed across different classes. This issue commonly arises in classification tasks, where certain classes have only a few samples while others are heavily represented. As a result, the model tends to be biased toward the majority class, leading to substantially lower predictive performance on minority classes. Despite potentially achieving high overall accuracy, such models often fail to detect critical minority instances (e.g., specific attack types), ultimately reducing both the generalization capability and reliability of the model [48,49].

To address the problem of data imbalance, recent studies have utilized Generative Adversarial Networks (GAN) to augment samples for minority classes. However, because the generator often learns directly from the imbalanced original data, GAN tend to produce samples biased toward the majority class. As a result, the generated data may fail to capture the diverse distribution of minority classes, leading to limited data diversity and a heightened risk of overfitting. Moreover, the use of GAN requires substantial computational time for data generation, making them unsuitable for the objectives of this study [50,51,52].

Therefore, this study adopts Adaptive Synthetic Sampling (ADASYN) for data augmentation. ADASYN is similar to the Synthetic Minority Over-sampling Technique (SMOTE) in that both generate synthetic samples for the minority class, as illustrated in Figure 2. Conceptual illustration of SMOTE and ADASYN oversampling strategies. In SMOTE (top), synthetic minority samples (red dots) are generated by interpolating between existing minority instances (pink triangles) and their minority-class nearest neighbors within the feature space, resulting in a uniform increase in minority sample density. In contrast, ADASYN (bottom) adaptively focuses sample generation on minority instances located near majority-class samples (blue circles), which are harder to classify. By allocating more synthetic samples to these ambiguous regions, ADASYN enhances the representation of complex decision boundaries and increases the diversity of the augmented dataset [53,54,55].

3.3. Memory−LGBM

Figure 3 presents the overall architecture of the proposed Memory-LGBM model. Unlike conventional autoencoders, this model utilizes only the encoder and the latent vector, replacing the decoder with LGBM classifier. The encoder compresses high-dimensional input data into a low-dimensional latent space, extracting only the most salient features.

The decoder is omitted and replaced by LGBM for the following reasons [56,57]:

First, since the decoder aims to reconstruct the original input, it may inadvertently restore anomalous data to appear normal, thereby reducing the sensitivity of anomaly detection.
Second, the reconstruction process is computationally intensive, leading to inefficiencies in both training and inference.
LGBM performs classification directly based on the similarity of latent vectors without requiring reconstruction, enabling faster and more efficient anomaly detection.

The proposed model operates in two distinct phases. During the training phase, input data is passed through the encoder to generate a latent vector, which is then classified by the LGBM classifier. The class-specific latent representations are subsequently stored in a memory module.

During the testing phase, new inputs are encoded and classified by the LGBM classifier. However, as illustrated in Figure 1, when a previously unseen attack type is encountered in a new task, an additional memory slot is allocated to store the corresponding attack information. The predicted class label is then used to query or update the memory module. If the latent vector matches an existing pattern labeled as “normal” or “anomalous,” rapid classification can be performed via the memory. This mechanism enables the memory to be continuously adapted without retraining the model, thereby enhancing the robustness of anomaly detection even in the presence of newly emerging attack types.

Overall, the Memory-LGBM architecture provides a light weight yet effective alternative to conventional autoencoder-based approaches for anomaly detection, as empirically validated in Section 4.

3.3.1. Memory Addressing

This section outlines three key operations used to store information in the memory module, each represented by a corresponding equation. First, the encoder, which compresses high-dimensional input data to extract only the most salient features, is defined in Equation (1) [7,58,59]. The latent vector

z

represents the compressed output of the encoder, where

f_{e}

denotes the encoder function and

x

refers to the input data.

z = f_{e} (x)

(1)

Equation (2) represents the process of identifying the memory entry that is most similar to the current query among all stored patterns in the memory. The attention weight

s_{i}

is computed using the cosine similarity between the query vector

z

and each memory slot

m_{i}

, normalized via SoftMax. Here,

z \in R^{d}

is the latent representation obtained from the encoder, and

m_{i} \in R^{d}

denotes the

i - t h

memory vector.

s_{i} = \frac{\exp (\frac{z^{⊤ m_{i}}}{| z | | m_{i} |})}{\sum_{j = 1}^{N} \exp (\frac{z^{⊤ m_{j}}}{| z | | m_{j} |})}

(2)

3.3.2. Adaptive Memory Update

The core idea of continual learning is to update a trained model using newly discovered information during the testing phase, allowing the model to adapt and maintain up-to-date knowledge even after initial training has been completed. Accordingly, this section describes how the memory in the trained Memory-LGBM model is updated. The same procedures defined in Equations (1) and (2) are applied during the testing phase. However, if the input contains information does not present in any existing memory slot, the memory must be updated accordingly.

Equation (3) defines the condition for updating the memory by determining whether the newly arrived information already exists in the memory slots. Here,

s_{\max}

denotes the highest cosine similarity between the query vector z and any memory slot

m_{i}

, while

τ

represents the minimum similarity threshold above which the input is considered similar to existing memory entries. This mechanism enables continual updates of the memory by identifying and storing information that is not yet represented in the current memory slots.

Equation (4) represents the direct addition of a new latent vector

z

into the memory module

M

. Equation (5) describes the process in which, after the latent vector

z

is encoded from the input, it is classified by the LGBM classifier to produce a predicted class label

\hat{y}

Based on this prediction, the memory slot corresponding to class

\hat{y}

denoted as

m_{\hat{y}}

is updated using an exponential moving average scheme. This update mechanism allows the memory to incorporate new information from the input

z

while preserving the existing prototype representation through the weighting factor

α

. As a result, the memory module dynamically adapts to changes in the data distribution even during the inference phase, enabling continual learning and enhanced robustness in anomaly detection.

In practice, the similarity threshold

τ

in Equation (3) and the weight factor

α

in Equation (5) are treated as fixed hyper-parameters determined through cross-validation. Specifically,

τ

is set within the range of 0.70–0.85 based on cosine similarity distributions observed in the training data, ensuring that only sufficiently distinct latent vectors create new memory slots. The weighting factor

α

is set to 0.9 to give higher importance to existing memory representations while still allowing incremental adaptation. Sensitivity analysis demonstrated that performance remains stable for

α

values between 0.85 and 0.95, and

τ

values between 0.70 and 0.85. These ranges provide a robust balance between stability and adaptability in the memory update process.

s_{\max} < τ, then update memory with z

(3)

M \leftarrow M \cup {z}

(4)

m_{\hat{y}} \leftarrow α \cdot m_{\hat{y}} + (1 - α) \cdot z

(5)

4. Experiments

4.1. Datasets

Three widely used intrusion detection benchmarks in the cybersecurity domain were selected for evaluation.

CICIDS 2017 is a comprehensive dataset that captures real-world network traffic over five days, including 15 modern attack types such as DDoS, PortScan, BruteForce, and Botnet. It provides over 2.8 million labeled flow records with more than 80 features extracted using CICFlowMeter. Table 3 summarizes the normal and attack classes of the CICIDS 2017 dataset along with brief descriptions of each attack type. In the continual learning setting of this study, all attack types are initially grouped into a single Anomaly class to address data imbalance, distribution shifts, and the high risk of catastrophic forgetting when introducing new attack types. This binary setup (Normal vs. Anomaly) serves as the initial training configuration, enabling the model to focus on distinguishing benign from malicious traffic in the early stage.

Once the anomaly class is detected, the memory module stores representative patterns of each observed attack instance. Using the stored patterns, LGBM classifier is employed to further classify anomalies into subtypes with available labels, or to group them into clusters of similar patterns when labels are unavailable.

NSL-KDD is an enhanced version of the original KDD’99 dataset, designed to reduce redundancy and address class imbalance. It consists of 41 handcrafted features and includes four main attack categories: DoS, Probe, R2L, and U2R.

UNSW-NB15 is a modern intrusion detection dataset generated using the IXIA PerfectStorm tool. It contains approximately 2.5 million labeled records across 10 attack categories, including Exploits, Fuzzers, and Backdoors, with 49 extracted features.

For all three datasets, DOA was applied to identify core features specific to each attack type. These optimized feature subsets were then used as inputs to the proposed Memory-LGBM model, improving both training efficiency and anomaly detection performance by focusing on the most informative attributes.

4.2. Experimental Procedure

Based on the CICIDS 2017 dataset, the training–testing protocol is organized as follows: during the initial training phase, the model is trained on Normal traffic and a single attack type. In the subsequent testing phase, Normal traffic is paired with each remaining attack type in separate rounds, for a total of 14 rounds (1 initial training round + 13 incremental testing rounds). At each round, the memory module either creates a new slot for unseen attack patterns or updates existing slots for similar patterns, allowing the LGBM classifier to refine subtype classification incrementally. The same procedure is applied to the other datasets used in this study to ensure a consistent continual learning evaluation.

4.3. Baseline

Three representative continual learning methods are compared due to their consistent performance across benchmarks. Sequentially Fine Tuning (Seq-FT) [60] provides a simple and memory-efficient baseline, as it trains on tasks sequentially without the need to store or revisit previous data. Experience Replay (ER) [61] mitigates forgetting by storing and replaying a small set of previous samples during training. Dark Experience Replay++ (DER++) [62] is an enhanced variant of ER that additionally retains soft targets from earlier tasks, allowing the model to preserve more comprehensive knowledge over time.

4.4. Hyperparameter Setting

To ensure clarity and reproducibility, the experimental hyperparameters employed in the proposed framework are summarized in Table 4. The table covers the main components of the pipeline, including the population size, iteration count, and hunting/social balance in the DOA, as well as the nearest neighbor parameter K and adaptive sampling ratio in ADASYN. The memory module settings are also provided, highlighting the similar threshold τ and weighting factor α that governs prototype updates during continual learning. In addition, the hyperparameters of the LGBM classifiers such as learning rate, maximum depth, number of leaves, and boosting rounds—are included to present a complete overview of the experimental setup. These detailed configurations provide a rational basis for parameter selection and ensure that the performance comparisons across methods are fair and consistent.

4.5. Results

Accuracy (ACC) [63] refers to the average classification accuracy across all tasks and serves as an indicator of overall performance during the continual learning process. Forgetting Measure (FM) [64] is calculated as the average difference between the maximum accuracy obtained after learning each task and the current accuracy, representing how much previously acquired knowledge has been forgotten; lower FM values indicate better knowledge retention. Learning Accuracy (LA) [65] refers to the average accuracy on each task immediately after it is learned, reflecting the model’s ability to acquire new information during training and serving as a measure of adaptability in continual learning settings.

Table 5, Table 6 and Table 7 present the evaluation results of baseline continual learning methods, where three representative approaches were compared. Seq-FT offers a simple and memory-free training procedure but lacks access to prior data, often resulting in performance degradation due to forgetting. ER partially alleviates forgetting by storing and replaying a small set of past samples. DER++ extends ER by incorporating soft labels, enabling better knowledge retention across tasks. In contrast, the proposed Memory-LGBM model achieves superior performance through several architectural advantages. First, it maintains a class-wise prototype memory structure, where each class is represented by a continuously updated latent vector rather than raw samples, providing a compact and generalized memory representation. Second, incoming data are integrated into the memory using an exponential moving average (EMA) update, which preserves existing knowledge while smoothly incorporating new information. Lastly, classification is performed using a gradient-free LGBM classifier, which efficiently utilizes extracted slot features to achieve high adaptability to new tasks and maintains robust performance on the final task.

In addition, Table 8 summarizes the representative F1-score results across the three IDS datasets. The F1-score is a widely adopted metric in intrusion detection tasks as it balances precision and recall. As shown in Table 7, Memory-LGBM consistently achieves the highest F1-score, reaching 92.0% on CICIDS 2017, 91.0% on NSL-KDD, and 90.0% on UNSW-NB15. This demonstrates that the proposed model not only excels in continual learning metrics (ACC, FM, LA) but also provides practical detection performance in terms of balanced classification ability. Furthermore, Table 9 presents the ablation study results to validate the contribution of each component in the proposed framework. When DOA feature selection was removed, the F1-score decreased by 4–6% across all datasets, confirming the importance of selecting discriminative features for intrusion detection. Eliminating ADASYN or replacing it with SMOTE also reduced performance, highlighting the role of adaptive data augmentation in handling class imbalance. Finally, replacing the LGBM classifier with Random Forest or XGBoost led to a noticeable drop in F1-score, which demonstrates that the gradient-free LGBM classifier provides superior adaptability in continual learning scenarios. These findings verify that each component of the proposed Memory-LGBM pipeline plays a crucial role in achieving robust intrusion detection performance.

Figure 4 illustrates how the FM varies with the number of memory slots across three network intrusion detection datasets: CICIDS 2017, NSL-KDD, and UNSW-NB15. Overall, FM consistently decreases as the number of memory slots increases, indicating that a larger memory capacity helps preserve information about each class or attack type more effectively. In particular, when the number of slots is low (FM above 4.5), severe catastrophic forgetting is observed. However, as the number of slots reaches 15 to 20, FM drops below 3.0, showing a substantial reduction in forgetting. Among the datasets, NSL-KDD exhibits the lowest FM values across all slot configurations, demonstrating stable responsiveness to the memory structure. In contrast, UNSW-NB15 initially shows the highest FM, reflecting instability due to complex attack types, but it also demonstrates significant improvement as the slot count increases. These results validate that memory-based continual learning frameworks can effectively enhance resistance to forgetting as the number of memory slots increases.

Figure 5 illustrates the training time required by each continual learning method (Memory-LGBM, DER++, ER, Seq-FT, and the baseline memory-autoencoder) as the number of training epochs increases. The proposed Memory-LGBM achieves favorable computational efficiency compared to other methods while maintaining high accuracy. Specifically, Memory-LGBM exhibits a linear and moderate growth in training time, consistently below the computational cost of the full memory-autoencoder framework, which includes both encoder and decoder stages. This reduction is attributed to the replacement of the decoder with the lightweight LGBM classifier, which streamlines the learning phase without sacrificing performance. Notably, while DER++ incurs additional overhead due to dual memory buffering and gradient replay, Memory-LGBM maintains computational simplicity through prototype-based memory updates and avoids full model retraining.

Figure 6 presents the memory usage of each method as the number of epochs increases. Memory-LGBM demonstrates consistently lower memory consumption than the baseline memory-autoencoder, which requires substantial overhead to maintain both encoder and decoder parameters. Compared to DER++ and ER, Memory-LGBM achieves superior memory efficiency by eliminating replay buffers and employing lightweight prototype-based updates. Importantly, the memory footprint of Memory-LGBM remains below 300 MB even after 40 epochs, highlighting its scalability and suitability for long-term continual learning scenarios.

Overall, the results presented in Figure 5 and Figure 6 confirm that the proposed Memory-LGBM framework achieves a superior balance between accuracy, training efficiency, and memory consumption. By combining a lightweight LGBM classifier with prototype-based memory updates, the model significantly reduces computational overhead while preserving strong classification performance. These advantages ensure better scalability and robustness compared to conventional continual learning methods, demonstrating the practicality of Memory-LGBM for deployment in resource-constrained or real-time intrusion detection environments.

5. Conclusions

This study presents Memory-LGBM, a lightweight architecture that integrates a prototype-based memory module with a gradient-free LGBM classifier for anomaly detection and continual learning. The model replaces the conventional decoder with a memory-attention mechanism and updates class-wise memory slots via an exponentially weighted averaging scheme. This design enables efficient classification, continual memory adaptation, and compact latent representations without the need to store raw input samples.

Extensive experiments on three benchmark intrusion detection datasets—CICIDS 2017, NSL-KDD, and UNSW-NB15—demonstrate that Memory-LGBM consistently outperforms baseline continual learning methods, including Seq-FT, ER, and DER++, in terms of overall accuracy, reduced forgetting (FM), and improved adaptability (LA).

Further analysis of the number of memory slots confirms that increased memory capacity enhances the model’s resistance to catastrophic forgetting. In particular, complex datasets show steady performance gains as the number of memory slots increases.

Overall, Memory-LGBM provides a practical and scalable solution for real-world anomaly detection scenarios that demand rapid adaptation, strong knowledge retention, and low computational overhead. A promising direction for future work is the deployment of Memory-LGBM in real-time streaming environments, where data arrive continuously and timely decision-making is essential. Such settings would allow further evaluation of the model’s efficiency, adaptability, and robustness under realistic constraints.

In addition, future research will investigate more complex task sequences in which multiple attack types are introduced per round, better reflecting real-world network conditions where simultaneous threats frequently occur.

Author Contributions

Conceptualization, H.P., D.S. (Dongil Shin), M.P. and D.S. (Dongkyoo Shin); funding acquisition, D.S. (Dongkyoo Shin); methodology, H.P. and T.K.; machine learning, H.P. and T.K.; validation, D.S. (Dongil Shin) and H.L.; writing—original draft, H.P.; writing—review and editing, D.S. (Dongkyoo Shin). All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Culture, Sports and Tourism R&D Program through the Korea Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism in 2025 (Project Name: Training Global Talent for Copyright Protection and Management of On-Device AI Models, Project Number: RS-2025-02221620, Contribution Rate: 100%).

Data Availability Statement

Data presented in this study are available on request from the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jalali, M.S.; Kaiser, J.P.; Siegel, M.; Madnick, S. The internet of things promises new benefits and risks: A systematic analysis of adoption dynamics of iot products. IEEE Secur. Priv. 2019, 17, 39–48. [Google Scholar] [CrossRef]
Papernot, N.; McDaniel, P.; Sinha, A.; Wellman, M.P. SoK: Security and Privacy in Machine Learning. In Proceedings of the 2018 IEEE European Symposium on Security and Privacy (EuroS&P), London, UK, 24–26 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 399–414. [Google Scholar]
Pa Pa, Y.M.; Tanizaki, S.; Kou, T.; Van Eeten, M.; Yoshioka, K.; Matsumoto, T. An Attacker’s Dream? Exploring the Capabilities of ChatGPT for Developing Malware. In Proceedings of the 16th Cyber Security Experimentation and Test Workshop, Marina del Rey, CA, USA, 7–8 August 2023; pp. 10–18. [Google Scholar]
Grinsztajn, L.; Oyallon, E.; Varoquaux, G. Why do tree-based models still outperform deep learning on typical tabular data? In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Shin, H.; Lee, J.K.; Kim, J.; Kim, J. Continual Learning with Deep Generative Replay. In Proceedings of the NeurIPS; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming Catastrophic Forgetting in Neural Networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef]
Gong, D.; Liu, L.; Le, V.; Saha, B.; Mansour, M.R.; Venkatesh, S.; van den Hengel, A. Memorizing Normality to Detect Anomaly: Memory-augmented Deep Autoencoder for Unsupervised Anomaly Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1705–1714. [Google Scholar]
Taha, A.A.; Malebary, S.J. An intelligent approach to credit card fraud detection using an optimized light gradient boosting machine. IEEE Access 2020, 8, 25579–25587. [Google Scholar] [CrossRef]
Kim, K.; Shin, J.; Park, J.G.; Kim, J.T. Performance evaluations of AI-based obfuscated and encrypted malicious script detection with feature optimization. ETRI J. 2025, 47, 753–770. [Google Scholar] [CrossRef]
Wang, W.; Li, G.; Ma, B.; Xia, X.; Jin, Z. Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree. In Proceedings of the 27th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2020, London, ON, Canada, 18–21 February 2020; Kontogiannis, K., Khomh, F., Chatzigeorgiou, A., Fokaefs, M., Zhou, M., Eds.; IEEE: Piscataway, NJ, USA, 2020; pp. 261–271. [Google Scholar]
Kondrak, G. N-gram similarity and distance. In International Symposium on String Processing and Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2005; pp. 115–126. [Google Scholar]
Lu, H.; Zhang, M.; Xu, X.; Li, Y.; Shen, H. Deep fuzzy hashing network for efficient image retrieval. IEEE Trans. Fuzzy Syst. 2020, 29, 166–176. [Google Scholar] [CrossRef]
Panwar, S.S.; Raiwani, Y.; Panwar, L.S. An Intrusion Detection Model for CICIDS-2017 Dataset Using Machine Learning Algorithms. In Proceedings of the 2022 International Conference on Advances in Computing, Communication and Materials (ICACCM), Dehradun, India, 10–11 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–10. [Google Scholar]
Lai, C.M.; Yeh, W.C.; Chang, C.Y. Gene Selection using Information Gain and Improved Simplified Swarm Optimization. Neurocomputing 2016, 218, 331–338. [Google Scholar] [CrossRef]
Sharafaldin, I.; Habibi Lashkari, A.; Ghorbani, A.A. A detailed analysis of the cicids2017 data set. In Proceedings of the Information Systems Security and Privacy: 4th International Conference, ICISSP 2018, Funchal-Madeira, Portugal, 22–24 January 2018; Revised Selected Papers 4. Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 172–188. [Google Scholar]
Rigatti, S.J. Random Forest. J. Insur. Med. 2017, 47, 31–39. [Google Scholar] [CrossRef]
Ben-Gal, I. Bayesian Networks. In Encyclopedia of Statistics in Quality and Reliability; John Wiley & Sons: Hoboken, NJ, USA, 2008. [Google Scholar]
Le Gall, J.F. Random trees and applications. Probab. Surv. 2005, 2, 245–311. [Google Scholar] [CrossRef]
Rish, I. An Empirical Study of the Naive Bayes Classifier. Available online: https://faculty.cc.gatech.edu/~isbell/reading/papers/Rish.pdf (accessed on 4 September 2025).
Maseer, Z.K.; Yusof, R.; Bahaman, N.; Mostafa, S.A.; Foozy, C.F.M. Benchmarking of machine learning for anomaly based intrusion detection systems in the CICIDS2017 dataset. IEEE Access 2021, 9, 22351–22370. [Google Scholar] [CrossRef]
Wu, Y.; He, K. Group Normalization. arXiv 2018, arXiv:1803.08494v3. [Google Scholar] [CrossRef]
Gal, M.S.; Rubinfeld, D.L. Data standardization. NYUL Rev. 2019, 94, 737. [Google Scholar] [CrossRef]
Guo, G.; Wang, H.; Bell, D.; Bi, Y.; Greer, K. KNN model-based approach in classification. In Proceedings of the On the Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Italy, 3 November 2003; pp. 986–996. [Google Scholar]
Song, Y.Y.; Ying, L.U. Decision tree methods: Applications for classification and prediction. Shanghai Archiv. Psychiatry 2015, 27, 130. [Google Scholar]
Vibhute, A.D.; Khan, M.; Patil, C.H.; Gaikwad, S.V.; Mane, A.V.; Patel, K.K. Network anomaly detection and performance evaluation of Convolutional Neural Networks on UNSW-NB15 dataset. Procedia Comput. Sci. 2024, 235, 2227–2236. [Google Scholar] [CrossRef]
Henderi, H.; Wahyuningsih, T.; Rahwanto, E. Comparison of Min-Max normalization and Z-Score Normalization in the K-nearest neighbor (kNN) Algorithm to Test the Accuracy of Types of Breast Cancer. Int. J. Inform. Inf. Syst. 2021, 4, 13–20. [Google Scholar] [CrossRef]
Moustafa, N.; Slay, J. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In Proceedings of the 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, ACT, Australia, 10–12 November 2015; pp. 1–6. [Google Scholar]
Su, T.; Sun, H.; Zhu, J.; Wang, S.; Li, Y. BAT: Deep learning methods on network intrusion detection using NSL-KDD dataset. IEEE Access 2020, 8, 29575–29585. [Google Scholar] [CrossRef]
O’Shea, K.; Nash, R. An introduction to convolutional neural networks. arXiv 2015, arXiv:1511.08458. [Google Scholar] [CrossRef]
More, S.; Idrissi, M.; Mahmoud, H.; Asyhari, A.T. Enhanced Intrusion Detection Systems Performance with UNSW-NB15 Data Analysis. Algorithms 2024, 17, 64. [Google Scholar] [CrossRef]
LaValley, M.P. Logistic regression. Circulation 2008, 117, 2395–2399. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Hu, D. Comparison of SVM and LS-SVM for regression. In Proceedings of the 2005 International Conference on Neural Network and Brain, Beijing, China, 13–15 October 2005; Volume 1, pp. 279–283. [Google Scholar]
Altulaihan, E.; Almaiah, M.A.; Aljughaiman, A. Anomaly Detection IDS for Detecting DoS Attacks in IoT Networks Based on Machine Learning Algorithms. Sensors 2024, 24, 713. [Google Scholar] [CrossRef]
Hendricks, R.; Khasawneh, M. Cluster analysis of categorical variables of parkinson’s disease patients. Brain Sci. 2021, 11, 1290. [Google Scholar] [CrossRef]
Home. Available online: https://sites.google.com/view/iot-network-intrusion-dataset/home (accessed on 17 February 2023).
Wang, L.; Zhang, X.; Su, H.; Zhu, J. A Comprehensive Survey of Continual Learning: Theory, Method and Application. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5362–5383. [Google Scholar] [CrossRef] [PubMed]
Van de Ven, G.M.; Tolias, A.S. Three scenarios for continual learning. arXiv 2019, arXiv:1904.07734. [Google Scholar] [CrossRef]
Liang, Y.S.; Li, W.J. Inflora: Interference-free low-rank adaptation for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 23638–23647. [Google Scholar]
Yu, J.; Zhuge, Y.; Zhang, L.; Hu, P.; Wang, D.; Lu, H.; He, Y. Boosting continual learning of vision-language models via mixture-of-experts adapters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 23219–23230. [Google Scholar]
Gao, Z.; Cen, J.; Chang, X. Consistent prompting for rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Marczak, D.; Twardowski, B.; Trzciński, T.; Cygert, S. Magmax: Leveraging model merging for seamless continual learning. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 379–395. [Google Scholar]
Le, M.; Nguyen, H.; Nguyen, T.; Pham, T.; Ngo, L.; Ho, N. Mixture of experts meets prompt-based continual learning. In Advances in Neural Information Processing Systems; NeurIPS (Neural Information Processing System): San Diego, CA, USA, 2024; Volume 37, pp. 119025–119062. [Google Scholar]
Guyon, I.; Elisseeff, A. An introduction to feature extraction. In Feature Extraction: Foundations and Applications; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1–25. [Google Scholar]
Bairwa, A.K.; Joshi, S.; Singh, D. Dingo optimizer: A nature-inspired metaheuristic approach for engineering problems. Math. Probl. Eng. 2021, 2021, 2571863. [Google Scholar] [CrossRef]
Yarram, S. An Optimized Deep Learning Approach for Intrusion Detection: AE-DBN Hybrid Model with Dingo Feature Selection on CSE-CIC-IDS2018. In Proceedings of the 2025 International Conference on Computing for Sustainability and Intelligent Future (COMP-SIF), Bangalore, India, 21–22 March 2025; IEEE: New York, NY, USA, 2025; pp. 1–9. [Google Scholar]
Zhong, R.; Peng, F.; Yu, J.; Munetomo, M. Q-learning based vegetation evolution for numerical optimization and wireless sensor network coverage optimization. Alex. Eng. J. 2024, 87, 148–163. [Google Scholar] [CrossRef]
Longadge, R.; Dongre, S. Class Imbalance Problem in Data Mining Review. Eur. J. Intern. Med. 2013, 24, e256. [Google Scholar]
Thabtah, F.; Hammoud, S.; Kamalov, F.; Gonsalves, A. Data imbalance in classification: Experimental evaluation. Inf. Sci. 2020, 513, 429–441. [Google Scholar] [CrossRef]
Thanh-Tung, H.; Tran, T. On Catastrophic Forgetting and Mode Collapse in Generative Adversarial Networks. arXiv 2018, arXiv:1807.04015. [Google Scholar] [CrossRef]
Srivastava, A.; Valkov, L.; Russell, C.; Gutmann, M.U.; Sutton, C. VEEGAN: Reducing Mode Collapse in GANs Using Implicit Variational Learning. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Bang, D.; Shim, H. Mggan: Solving mode collapse using manifold-guided training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2347–2356. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Wang, S.; Dai, Y.; Shen, J.; Xuan, J. Research on expansion and classification of imbalanced data based on SMOTE algorithm. Sci. Rep. 2021, 11, 24039. [Google Scholar] [CrossRef] [PubMed]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In Proceedings of the IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
Imani, M.; Ghaderpour, Z.; Joudaki, M.; Beikmohammadi, A. The Impact of SMOTE and ADASYN on Random Forest and Advanced Gradient Boosting Techniques in Telecom Customer Churn Prediction. In Proceedings of the 2024 10th International Conference on Web Research (ICWR), IEEE, Tehran, Iran, 24–25 April 2024. [Google Scholar]
Spigler, G. Denoising autoencoders for overgeneralization in neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 998–1004. [Google Scholar] [CrossRef]
Kamara, A.F.; Chen, E.; Liu, Q.; Pan, Z. Combining contextual neural networks for time series classification. Neurocomputing 2019, 384, 57–66. [Google Scholar] [CrossRef]
Gao, H.; Qiu, B.; Barroso, R.J.D.; Hussain, W.; Xu, Y.; Wang, X. TSMAE: A Novel Anomaly Detection Approach for Internet of Things Time Series Data Using Memory-Augmented Autoencoder. IEEE Trans. Netw. Sci. Eng. 2022, 10, 2978–2990. [Google Scholar] [CrossRef]
Yan, H.; Liu, Z.; Chen, J.; Feng, Y.; Wang, J. Memory-augmented skip-connected autoencoder for unsupervised anomaly detection of rocket engines with multi-source fusion. ISA Trans. 2023, 133, 53–65. [Google Scholar] [CrossRef] [PubMed]
Gao, Q.; Zhao, C.; Sun, Y.; Xi, T.; Zhang, G.; Ghanem, B.; Zhang, J. A unified continual learning framework with general parameter-efficient tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 11483–11493. [Google Scholar]
Chaudhry, A.; Rohrbach, M.; Elhoseiny, M.; Ajanthan, T.; Dokania, P.K.; Torr, P.H.S.; Ranzato, M. On Tiny Episodic Memories in Continual Learning. arXiv 2019, arXiv:1902.10486. [Google Scholar] [CrossRef]
Buzzega, P.; Boschini, M.; Porrello, A.; Abati, D.; Calderara, S. Dark Experience for General Continual Learning: A Strong, Simple Baseline. In Proceedings of the Neural Information Processing Systems (NeurIPS), Online, 6–12 December 2020. [Google Scholar]
Lopez-Paz, D.; Ranzato, M. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems 30 (NIPS 2017); NeurIPS (Neural Information Processing System): San Diego, CA, USA, 2017; Volume 30. [Google Scholar]
Chaudhry, A.; Dokania, P.K.; Ajanthan, T.; Torr, P.H. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2018; pp. 532–547. [Google Scholar]
Riemer, M.; Cases, I.; Ajemian, R.; Liu, M.; Rish, I.; Tu, Y.; Tesauro, G. Learning to Learn without Forgetting by Maximizing Transfer and Minimizing Interference. arXiv 2018, arXiv:1810.11910. [Google Scholar] [CrossRef]

Figure 1. Memory−LGBM process architecture.

Figure 2. Conceptual illustration of SMOTE and ADASYN oversampling strategies.

Figure 3. Memory−LGBM architecture.

Figure 4. Forgetting measure (FM) vs. memory slot count across three datasets. (↓ indicates that lower values represent better performance).

Figure 5. Training time vs. epoch counts across different continual learning methods.

Figure 6. Memory usage vs. epoch count across different continual learning methods.

Table 2. Representative key features per attack type extracted via DAO.

Attack Type	Feature	Description
DDoS	Flow Packets/s, Total Fwd Packets, Bwd Packet Length Std, SYN Flag Count	A high number of packets per second, low response rates, and an increased proportion of SYN flood-related flags were observed
Botnet	Active Mean, Idle Max, Flow Duration, Fwd IAT Mean	The traffic pattern exhibited periodic communication attempts and sudden burst transmissions following prolonged idle intervals.

Table 3. CICIDS 2017 dataset attack types and descriptions.

Class	Description
Normal	Legitimate and benign traffic without any malicious activity.
Anomaly	DDOS attack-HOIC	High-rate HTTP-based DoS attack.
	DoS attacks-Hulk	Flooding the server with HTTP requests.
	Bot	Automated malicious traffic from infected hosts.
	FTP-BruteForce	Repeated login attempts to FTP server.
	SSH-Bruteforce	Repeated login attempts to SSH service.
	Infiltration	Unauthorized access into internal network.
	DoS attacks-SlowHTTPTest	Slow HTTP request attack to exhaust server resources.
	DoS attacks-GoldenEye	High-load HTTP DoS variant.
	DoS attacks-Slowloris	Holding many connections open to exhaust web server.
	DDOS attack-LOIC-UDP	UDP flooding using LOIC tool.
	Brute Force-Web	Brute force attack on web login forms.
	Brute Force-XSS	Cross-site scripting exploitation attempts.
	SQL Injection	Database query manipulation via SQL injection.

Table 4. Hyperparameter settings used in the proposed Memory-LGBM framework.

Class	Description	Value/Range	Rationale
DoA	Population Size	30	Ensures sufficient exploration of feature space while maintaining computational efficiency
	Iterations	50	Provides convergence without excessive runtime overhead
	Hunting/Social balance	0.5/0.5	Balanced to avoid premature convergence and maintain diversity
ADASYN	Nearest neighbor number (K)	5	Standard setting in imbalanced learning; empirically stable for network intrusion datasets
ADASYN	Sampling ratio	Adaptive	Automatically adjusted to balance minority/majority classes
LGBM	Weight factor	0.85–0.95	Maintains stability of prototypes while allowing incremental adaptation
	Learning rate	0.05	Provides balance between convergence speed and generalization
	Max depth	8	Controls overfitting in high-dimensional feature space
	Number of leaves	31	Default effective setting for structured tabular data
	Boosting rounds	100	Sufficient for convergence across datasets

Table 5. Continual learning performance on CICIDS 2017.

Method	ACC (↑)	FM (↓)	LA (↑)
Seq-FT	$72.4 \pm 1.3$	$13.1 \pm 1.1$	$83.8 \pm 1.3$
ER	$82.7 \pm 0.9$	$8.2 \pm 0.7$	$85.6 \pm 0.8$
DER++	$84.3 \pm 0.8$	$6.1 \pm 0.6$	$87.1 \pm 0.7$
Memory-LGBM	$89.5 \pm 0.5$	$2.7 \pm 0.4$	$92.1 \pm 0.5$

Table 6. Continual learning performance on NSL-KDD. ↑ indicates that higher values represent better performance, whereas ↓ indicates that lower values are preferable.

Method	ACC (↑)	FM (↓)	LA (↑)
Seq-FT	$75.1 \pm 1.2$	$12.3 \pm 1.0$	$84.7 \pm 1.1$
ER	$85.6 \pm 0.8$	$7.1 \pm 0.6$	$87.5 \pm 0.7$
DER++	$86.9 \pm 0.7$	$5.6 \pm 0.5$	$88.8 \pm 0.6$
Memory-LGBM	$89.2 \pm 0.5$	$2.5 \pm 0.4$	$91.3 \pm 0.5$

Table 7. Continual learning performance on UNSW-NB15. ↑ indicates that higher values represent better performance, whereas ↓ indicates that lower values are preferable.

Method	ACC (↑)	FM (↓)	LA (↑)
Seq-FT	$70.8 \pm 1.3$	$13.7 \pm 1.2$	$82.2 \pm 1.1$
ER	$81.5 \pm 1.0$	$8.6 \pm 0.8$	$84.4 \pm 0.9$
DER++	$83.4 \pm 0.9$	$6.2 \pm 0.7$	$86.5 \pm 0.8$
Memory-LGBM	$88.1 \pm 0.6$	$2.8 \pm 0.4$	$91.0 \pm 0.5$

Table 8. F1-score comparison of continual learning methods across datasets.

Method	CICIDS 2017	NSK-KDD	UNSW-NB15
Seq-FT	$72.0 \pm 2.0$	$75.0 \pm 2.2$	$71.0 \pm 2.0$
ER	$82.0 \pm 1.0$	$85.0 \pm 1.0$	$82.0 \pm 1.0$
DER++	$86.0 \pm 1.0$	$87.0 \pm 1.0$	$84.0 \pm 1.0$
Memory-LGBM	$92.0 \pm 1.0$	$91.0 \pm 1.0$	$90.0 \pm 1.0$

Table 9. F1-score comparison of Memory-LGBM and its variants (without DOA, with ADASYN, with SMOTE) across CICIDS 2017, NSL-KDD, and UNSW-NB15 datasets.

Variant	CICIDS 2017	NSK-KDD	UNSW-NB15
Memory-LGBM	$92.0$	$91.0$	$90.0$
w/o DOA	$87.5$	$86.2$	$85.1$
w/o ADASYN	$88.3$	$87.4$	$86.0$
With Smote instead of ADASYN	$86.2$	$86.1$	$84.8$
Replace LGBM -> with RF	86.7	85.3	84.2
Replace LGBM with XGB	85.9	84.6	83.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, H.; Kim, T.; Lee, H.; Shin, D.; Shin, D.; Park, M. A Continual Learning Process to Detect Both Previously Learned and Newly Emerging Attack. Appl. Sci. 2025, 15, 10034. https://doi.org/10.3390/app151810034

AMA Style

Park H, Kim T, Lee H, Shin D, Shin D, Park M. A Continual Learning Process to Detect Both Previously Learned and Newly Emerging Attack. Applied Sciences. 2025; 15(18):10034. https://doi.org/10.3390/app151810034

Chicago/Turabian Style

Park, Hansol, Taesu Kim, Hanhee Lee, Dongil Shin, Dongkyoo Shin, and Moosung Park. 2025. "A Continual Learning Process to Detect Both Previously Learned and Newly Emerging Attack" Applied Sciences 15, no. 18: 10034. https://doi.org/10.3390/app151810034

APA Style

Park, H., Kim, T., Lee, H., Shin, D., Shin, D., & Park, M. (2025). A Continual Learning Process to Detect Both Previously Learned and Newly Emerging Attack. Applied Sciences, 15(18), 10034. https://doi.org/10.3390/app151810034

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Continual Learning Process to Detect Both Previously Learned and Newly Emerging Attack

Abstract

1. Introduction

2. Related Works

2.1. Anomaly Detection

2.2. Continual Learning

3. Proposed Method

3.1. Feature Extraction

3.2. Data Augmentation

3.3. Memory−LGBM

3.3.1. Memory Addressing

3.3.2. Adaptive Memory Update

4. Experiments

4.1. Datasets

4.2. Experimental Procedure

4.3. Baseline

4.4. Hyperparameter Setting

4.5. Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI