Continual Learning for Intrusion Detection Under Evolving Network Threats

Guo, Chaoqun; Li, Xihan; Cheng, Jubao; Yang, Shunjie; Gong, Huiquan

doi:10.3390/fi17100456

Open AccessArticle

Continual Learning for Intrusion Detection Under Evolving Network Threats

by

Chaoqun Guo

^†

,

Xihan Li

^†

,

Jubao Cheng

,

Shunjie Yang

and

Huiquan Gong

^*

School of Software Engineering, Beijing Jiaotong University, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Future Internet 2025, 17(10), 456; https://doi.org/10.3390/fi17100456

Submission received: 14 August 2025 / Revised: 1 October 2025 / Accepted: 2 October 2025 / Published: 4 October 2025

Download

Browse Figures

Versions Notes

Abstract

In the face of ever-evolving cyber threats, modern intrusion detection systems (IDS) must achieve long-term adaptability without sacrificing performance on previously encountered attacks. Traditional IDS approaches often rely on static training assumptions, making them prone to forgetting old patterns, underperforming in label-scarce conditions, and struggling with imbalanced class distributions as new attacks emerge. To overcome these limitations, we present a continual learning framework tailored for adaptive intrusion detection. Unlike prior methods, our approach is designed to operate under real-world network conditions characterized by high-dimensional, sparse traffic data and task-agnostic learning sequences. The framework combines three core components: a clustering-based memory strategy that selectively retains informative historical samples using DP-Means; multi-level knowledge distillation that aligns current and previous model states at output and intermediate feature levels; and a meta-learning-driven class reweighting mechanism that dynamically adjusts to shifting attack distributions. Empirical evaluations on benchmark intrusion detection datasets demonstrate the framework’s ability to maintain high detection accuracy while effectively mitigating forgetting. Notably, it delivers reliable performance in continually changing environments where the availability of labeled data is limited, making it well-suited for real-world cybersecurity systems.

Keywords:

intrusion detection; continual learning; semi-supervised learning; class imbalance; knowledge distillation; meta-learning

1. Introduction

In open and continuously evolving network environments, intrusion behaviors demonstrate distinctive characteristics such as continuous emergence, diverse attack modalities, and sparse distribution patterns. As adversaries refine their techniques, IDS must not only swiftly recognize known threats but also adapt in real-time to detect novel and evolving attack types. However, traditional static modeling approaches, which rely on fixed task settings and fully labeled datasets, are ill-suited for dynamic deployment scenarios due to the following three core limitations:

Catastrophic forgetting. Conventional models often suffer from catastrophic forgetting, where learning new tasks leads to the degradation of performance on previously learned tasks. This is particularly problematic in intrusion detection, where the system must retain knowledge of both historical and emerging threats.
Label scarcity. In real-world scenarios such as industrial control systems and Internet of Things (IoT) infrastructures, most traffic data are collected in raw form without labels. The high cost and time requirements of manual annotation render only a small portion of these data usable for supervised learning, leaving vast amounts of valuable unlabeled traffic underutilized and limiting the system’s capacity to capture latent attack behaviors.
Class imbalance and distribution bias. During model learning, new classes typically dominate the training stream in terms of volume and frequency. This imbalance biases the model towards recent classes, leading to the erosion of performance on earlier, potentially critical categories—a particularly severe problem for the detection of rare but impactful attacks.

To mitigate these challenges, recent studies have begun to explore the integration of Continual Learning (CL) [1,2] and Semi-Supervised Learning (SSL) [3,4] in the domain of intrusion detection. CL aims to maintain long-term knowledge retention through techniques such as memory replay, knowledge distillation, and regularization-based constraints. SSL, on the other hand, enhances the exploitation of unlabeled data through pseudo-labeling, consistency training, and contrastive learning, thereby improving generalization under limited supervision.

Despite their promise, most existing CL and SSL methods are developed in the context of computer vision or few-shot classification and are not readily applicable to the unique properties of network traffic, including high-dimensional feature spaces, inter-class similarity, and evolving data distributions. Furthermore, real-world deployment issues such as blurred task boundaries, non-stationary label growth, and low overlap between classes are frequently overlooked, leading to performance degradation when these methods are directly transferred to the IDS domain.

To address these limitations, we propose a dynamic intrusion detection framework that integrates SSL with CL through three key mechanisms. Our main contributions are as follows:

Dynamic memory construction for discriminative knowledge retention. We design a data-driven memory selection mechanism tailored to the sparsity and high dimensionality of network traffic. By employing DP-Means clustering and uncertainty-based sampling, the model selectively retains both representative class centers and high-entropy boundary samples, enabling effective consolidation of past knowledge while facilitating the learning of new attack patterns.
Multi-level knowledge distillation for robust representation alignment. We propose a hierarchical knowledge alignment strategy that distills knowledge at three levels—output logits, attention maps, and feature representations. This multi-faceted alignment mitigates the risks of pseudo-label noise and catastrophic forgetting, ensuring stable representation learning across task transitions.
Meta-learning-based dynamic weighting for class imbalance mitigation. To address dynamic class distribution shifts and long-tailed data challenges, we incorporate a meta-learning-driven adaptive weighting mechanism. This component adjusts loss weights and bias corrections based on real-time category frequency statistics, thereby enhancing model robustness against rare or under-represented attack types.

These innovations collectively provide a scalable and generalizable solution for intrusion detection in dynamic, label-scarce, and imbalanced network environments. Extensive experiments across multiple real-world datasets demonstrate that our framework consistently outperforms state-of-the-art methods in terms of catastrophic forgetting suppression, minority class recognition, and generalization under limited supervision, validating its practical deployment potential.

2. Related Work

2.1. Continual Learning and Catastrophic Forgetting Mitigation

CL empowers models to continuously acquire new knowledge from sequential data streams without relying on full access to historical data [1,5]. This property is essential for IDS, where attacks evolve over time and novel threats, such as Advanced Persistent Threats (APT) [6] and Zero-Day Exploits [7] can emerge unpredictably. Existing CL methods are typically categorized into the four paradigms below [2].

Regularization-Based: Approaches such as Elastic Weight Consolidation (EWC) [8,9] and Synaptic Intelligence (SI) [10] prevent forgetting by penalizing significant deviations from previously important parameters. These methods embed knowledge retention constraints directly into the optimization process.

Replay-Based: Techniques like iCaRL [11] and GDumb [12] maintain a fixed memory buffer of representative past samples for experience replay [13], enabling the model to revisit prior knowledge during training. Recent extensions tailored to network traffic include metric-based meta-learning for few-shot classes [14,15], and the use of conditional VAEs for generating polymorphic attack variants [16].

Knowledge Distillation-Based: Frameworks such as Learning without Forgetting (LwF) [17] employ output alignment via KL divergence between old and new model predictions. Extensions like PODNet [18] go further by enforcing feature-level structural consistency, preserving representational space continuity across tasks.

Hybrid: Recent works [19,20] combine pseudo-labeling, consistency regularization, and knowledge distillation to tackle non-IID unlabeled data in CL settings, offering a more comprehensive adaptation strategy.

While these methods have demonstrated success in domains like image classification, they encounter the following critical limitations when applied to intrusion detection:

High inter-class similarity: Network attacks often exhibit high similarity across classes, leading to confusion and reduced discriminative power. This is particularly problematic for replay-based methods that rely on distinct class boundaries.
Sparse, high-dimensional feature spaces: Network traffic data are typically sparse and high-dimensional, which complicates the replay efficiency of memory-based methods and exacerbates memory constraints.
Rigid task-boundary assumptions: Many existing CL frameworks assume clear task boundaries, which are often not present in real-world network environments.

These limitations highlight the need for a class-agnostic, sample-adaptive, and representation-aligned CL strategy specifically designed for intrusion detection. An effective solution must not only retain historical knowledge but also dynamically adapt to non-stationary, imbalanced, and label-sparse traffic distributions—a gap that this work aims to fill.

2.2. Semi-Supervised Learning in Intrusion Detection

SSL has emerged as a powerful approach to address the label scarcity problem by leveraging large volumes of unlabeled data alongside limited labeled samples [21]. In the context of intrusion detection, this is particularly valuable. Network traffic is continuously collected, but accurate annotation requires expert knowledge and labor-intensive procedures such as behavioral analysis and attack forensics [4,22]. This makes labeled data both scarce and costly, whereas unlabeled traffic is abundant but underutilized. In recent years, several SSL strategies have been applied to the intrusion detection domain [23], including the ones below.

Pseudo-Labeling Approaches: Methods such as FixMatch [24] and Pseudo-Label [25] generate pseudo-labels for unlabeled data by selecting high-confidence predictions and incorporating them into the supervised training loss; while effective under class-balanced and well-separated distributions, these approaches are highly susceptible to pseudo-label noise, especially in the presence of novel or ambiguous attack types.

Consistency Regularization: Techniques like UDA [26] and Mean Teacher [27] apply perturbations to input samples and enforce prediction consistency across augmented views. These methods enhance model stability and robustness but rely heavily on the quality of data augmentation, which can be challenging to define in high-dimensional network traffic.

Contrastive Learning and Clustering-Based Methods: Recent approaches such as MixMatch [28] and SimCLR [29] aim to learn robust semantic representations by encouraging clustering in the feature space or contrasting similar/dissimilar pairs. These strategies are well-suited for complex behavior modeling in intrusion detection, where attacks may form latent structures in the embedding space.

Despite their promise, applying SSL to IDS presents the following domain-specific challenges:

Pseudo-label bias caused by class imbalance leads to the overfitting of majority classes and neglect of rare but critical attacks.
Extreme scarcity of novel attack samples in early task stages limits the availability of reliable pseudo-supervisory signals.
Dynamic and non-stationary data distributions undermine the stability of SSL algorithms, especially over long task sequences.

To effectively deploy SSL in practical IDS, it is imperative to redesign pseudo-labeling strategies, regularization objectives, and sample selection mechanisms, particularly when integrating with CL frameworks. This ensures stable and adaptive learning in environments characterized by evolving threats, weak supervision, and severe class imbalance.

2.3. Integration of Continual Learning and Semi-Supervised Learning

The integration of CL and SSL has recently emerged as a promising direction for constructing adaptive intrusion detection models under low-label and dynamic conditions. The overarching goal of such integration is to leverage the abundance of unlabeled data in evolving task sequences to enhance recognition of new or rare attack categories while preserving knowledge of previously encountered threats.

However, combining CL and SSL is inherently non-trivial due to their conflicting optimization objectives. CL emphasizes stability, i.e., retaining prior knowledge over time, whereas SSL prioritizes plasticity, i.e., rapidly adapting to new, unlabeled data. Achieving a robust balance between these competing goals requires addressing several fundamental challenges introduced below.

Pseudo-label reliability under non-IID task streams. In practical settings, tasks are often non-independent and non-identically distributed—with each new task introducing distinct and potentially disjoint category distributions. Naïve pseudo-labeling in such scenarios may introduce significant bias, leading to error accumulation and model drift, especially when rare or emerging attack types are involved.

Loss of weight coordination between knowledge retention and knowledge acquisition. SSL frameworks typically encourage aggressive incorporation of unlabeled data into the learning process. However, without proper weighting, this can distort the decision boundaries of previously learned classes, undermining CL’s objective of long-term memory stability. A joint optimization strategy is required to balance the influence of supervised, unsupervised, and replayed samples.

Dynamic sample selection from unlabeled streams. Not all unlabeled samples are equally beneficial for learning. The challenge lies in identifying and selecting informative and reliable samples from a large, heterogeneous pool to be incorporated into the incremental update. Poor sample selection can waste computational resources or even degrade model performance due to noise and redundancy.

Several recent studies have made initial attempts to bridge these gaps. For example, Lechat et al. [30] proposed decoupling pseudo-label generation from incremental updates, while Cermelli et al. [31] introduced verification modules to filter low-confidence pseudo-labels in the CL context. Nevertheless, these solutions are predominantly developed for image-based tasks and lack adaptation to the structural characteristics of network traffic, such as high feature dimensionality, inter-class similarity, and temporal evolution.

These limitations underscore the need for domain-specific integration mechanisms that jointly optimize CL and SSL under the constraints of real-world intrusion detection. This paper addresses this gap by proposing a unified framework that tightly couples memory-efficient incremental updates with robust semi-supervised training strategies, specifically tailored for network traffic data.

3. Methodology

The proposed method aims to construct an intrusion detection framework with CL capabilities, enabling it to incrementally learn new attack categories while effectively retaining knowledge of previously encountered threats. In addition, the framework is designed to leverage large-scale unlabeled traffic data to enhance model robustness under limited supervision. To this end, we propose a unified approach (see Figure 1) that integrates SSL with CL, composed of three key modules: a dynamic memory selection module for compact and discriminative historical sample retention; a multi-level knowledge alignment module to mitigate catastrophic forgetting across representational hierarchies; a meta-learning-based dynamic weighting module to address class imbalance and non-stationary distributions.

3.1. Problem Definition

Define the intrusion detection process as a chronologically ordered sequence of incremental tasks:

T = {T_{0}, T_{1}, . . . T_{. . .}, . . ., T_{t - 1}, T_{t}}

. Each task

T_{t}

introduces a new category subset

C_{t}

, where

C_{i} \cap C_{j} = Ø

for any

i \neq j

. The dataset for each task is composed of both labeled and unlabeled samples:

D_{t} = D_{t}^{L} \cup D_{t}^{U}

, where

D_{t}^{L} = {(x_{i}, y_{i})}_{i = 1}^{n_{l}}

denotes the labeled subset, and

D_{t}^{U} = {(x_{j}, \cdot)}_{j = 1}^{n_{u}}

represents the unlabeled samples. Typically, the number of unlabeled samples far exceeds the labeled ones (

n_{u} ≫ n_{l}

). To mitigate forgetting, the model maintains a fixed-size memory buffer

M_{t - 1}

, which stores representative samples from previous tasks. During training on task

T_{t}

, the objective is to learn a classifier

f_{θ_{t}}

that satisfies the following criteria:

Accurate recognition of new classes: for any $x \in D_{t}^{L}$ , the model should predict $f_{θ_{t}} (x) = y$ , where $y \in C_{t}$ .
Retention of historical knowledge: for all previously encountered labeled data $x \in D_{t - 1}^{L}$ , the model should maintain high accuracy on $y \in ⋃_{i = 1}^{t - 1} C_{i}$ .
Utilization of unlabeled samples: the model should effectively extract discriminative representations from $D_{t}^{U}$ using unsupervised learning objectives to enhance generalization and decision boundaries.

The overall training objective at task

T_{t}

is formulated as the joint minimization of four complementary loss components,

min_{θ_{t}} L_{t} = \sum_{i \in Λ} λ_{i} L_{i}, Λ = {s u p, u n s u p, a l i g n, m e t a},

(1)

where

L_{x}

denotes the loss component associated with the respective task.

3.2. Method Overview

This section details the overall architecture and key modules of the proposed framework, which integrates SSL and CL for robust, adaptive intrusion detection.

3.2.1. Dynamic Memory Construction and Sample Selection Strategy

A central challenge in CL is how to retain representative historical knowledge without storing excessive data. Traditional memory-based approaches such as iCaRL [11] typically rely on fixed-ratio sampling, uniform selection, or class center exemplars. While effective in simple domains, these strategies are insufficient for high-dimensional and imbalanced network traffic, as they fail to preserve discriminative boundary information critical for distinguishing subtle attack patterns.

To overcome this limitation, we introduce a hybrid memory selection mechanism that combines online clustering via DP-Means with uncertainty-aware sampling. Specifically, DP-Means clustering adaptively partitions the feature space without requiring a predefined number of clusters. It dynamically adjusts the number of modes based on data density, making it suitable for the evolving category structures encountered in real-world intrusion scenarios. High-entropy sample selection is then employed to capture ambiguous or decision-critical points near boundaries, which are typically underrepresented by class centers. These two components ensure that the memory buffer

M_{t}

maintains both semantic diversity and decision-boundary sensitivity.

DP-Means Clustering: As a nonparametric algorithm derived from the infinite Gaussian Mixture Model, DP-Means automatically determines the optimal number of clusters K by minimizing the following objective:

min {C_{k}}_{k = 1}^{K} \sum_{k = 1}^{K} \sum_{x_{i} \in C_{k}} {∥ x_{i} - μ_{k} ∥}^{2} + λ K,

(2)

where

C_{k}

is the k-th cluster,

μ_{k}

denotes its centroid, and

λ

is a penalty coefficient that controls the trade-off between clustering tightness and the number of clusters. A larger

λ

encourages fewer, more generalized clusters, while a smaller

λ

allows for finer granularity to capture subtle variations in attack patterns. The algorithm iteratively assigns samples to the nearest centroid and updates centroids until convergence.

High-Entropy Sample Selection: To augment the memory with samples that improve boundary representation, we select samples with high predictive uncertainty, measured by entropy.

H (p (y | x)) = - \sum_{c = 1}^{C} p (y = c | x) log p (y = c | x) .

(3)

For each cluster, samples are ranked by their entropy, and the top-

k_{e}

samples are selected.

Selection Protocol: Under a fixed memory budget M, we allocate a portion

α M

to cluster centers and

(1 - α) M

to high-entropy samples. The final memory set

M_{t}

is constructed by combining these two subsets.

Complexity Analysis: The DP-Means algorithm has a time complexity of

O (T n K d)

per task, where T is the number of iterations, n is the number of samples, K is the number of clusters, and d is the feature dimensionality. The entropy calculation adds

O (n C)

complexity, where C is the number of classes. Sorting the entropies for ranking introduces an

O (n log n)

factor. Overall, the memory selection process remains efficient and scalable for large datasets.

The combined objective for memory construction is formulated to encapsulate both goals,

L_{mem} = \underset{Modal Preservation}{\underset{︸}{\sum_{k = 1}^{K} \sum_{x_{i} \in C_{k}} {∥ x_{i} - μ_{k} ∥}^{2}}} + β \cdot \underset{Boundary Expression}{\underset{︸}{\sum_{x_{j} \in H} H (p (y | x_{j}))}},

(4)

where

H

is the set of selected high-entropy samples. The first term ensures cluster compactness and core semantic preservation, while the second term emphasizes uncertainty-aware boundary representation, with

β

as a balancing coefficient. This dual-objective mechanism ensures the framework preserves both the core semantics and critical marginal information of past tasks, enhancing long-term stability and adaptive capacity. During this process, the influence of

λ

(in Equation (2)) is already indirectly reflected in the construction of the memory through the clustering structure it determines (K and

μ_{k}

). A larger

λ

will produce fewer, more generalized clusters, resulting in fewer central points in the memory; a smaller

λ

will produce more, finer clusters, resulting in more central points in the memory. Therefore,

λ

controls the granularity of the “semantic diversity” of the memory, but it itself does not need to appear in the final selection criteria.

3.2.2. Multi-Level Knowledge Alignment Mechanism

In dynamic intrusion detection, the model must continuously incorporate new attack patterns while maintaining accurate recognition of previously learned threats. Traditional CL methods, such as LwF [17], implement output-level distillation by aligning the logits of new and old models. However, this single-layer knowledge retention strategy suffers from several limitations: Shallow alignment at the output layer often leads to feature drift in intermediate layers; the model’s attention focus on critical fields (e.g., specific ports, time windows) may shift over time; and inter-class relationships in the embedding space may become distorted by task transitions, compromising the semantic structure of previously learned categories.

To address these limitations, we propose a multi-level knowledge alignment mechanism that performs joint distillation across three hierarchical levels.

Logits-Level Alignment Module: We first align the predicted class distributions of the current model

f_{new}

and the frozen historical model

f_{old}

by minimizing the Kullback–Leibler (KL) divergence between their softened outputs,

L_{logits} = T^{2} \cdot KL (σ (z^{old} / T) ∥ σ (z^{new} / T)),

(5)

where

z^{old}

and

z^{new}

are the logits from

f_{old}

and

f_{new}

for the same input;

σ (\cdot)

is the softmax function; and T is a temperature scaling parameter to smooth the output distribution. This term helps stabilize class decision boundaries, particularly in cases with overlapping categories (e.g., DoS vs. DDoS), where misalignment can severely impact performance.

Attention-Level Alignment Module: To preserve the semantic focus of the model, we align intermediate attention maps extracted from corresponding layers of the old and new models. The Frobenius norm is used to minimize the discrepancy,

L_{attn} = \frac{1}{L} \sum_{l = 1}^{L} {∥ A_{old}^{(l)} - A_{new}^{(l)} ∥}_{F}^{2},

(6)

where

A^{(l)}

denotes the attention weight matrix at layer l; and L is the total number of aligned layers. By maintaining consistent focus across tasks, this alignment enhances the model’s robustness and interpretability, especially in temporal or protocol-sensitive detection scenarios.

Feature-Level Alignment Module: To preserve the geometric and semantic structure of the learned representation space, we apply the Barlow Twins loss to the high-level feature embeddings

Z^{old}

and

Z^{new}

. The loss function is defined as

L_{bt} = \sum_{i} {(1 - C_{i i})}^{2} + γ \sum_{i \neq j} C_{i j}^{2},

(7)

where C is the cross-correlation matrix between normalized embeddings of the two models;

C_{i i}

enforces diagonal dominance (invariance), and

C_{i j}

suppresses off-diagonal redundancy (decorrelation); and

γ

controls the regularization strength. This alignment helps retain cluster structures in the latent space (e.g., ensuring DDoS and PortScan remain adjacent yet distinguishable), which is crucial for semantic consistency across tasks.

The total alignment loss integrates the three components

L_{align} = \sum_{i \in Λ} λ_{i} L_{i}, Λ = {l o g i t s, a t t n, b t},

(8)

where

λ_{i}

are tunable hyperparameters that balance the contribution of each term. This multi-level alignment strategy ensures knowledge preservation across the model hierarchy, alleviates representation degradation, and significantly improves continual generalization under evolving attack scenarios.

3.2.3. Meta-Learning Driven Dynamic Weighting Mechanism

In incremental intrusion detection, the model is exposed to a non-stationary data stream, where category distribution varies significantly across sequential tasks. These variations manifest in several forms: New class emergence: previously unseen attack types are introduced at each stage; category frequency imbalance: rare or critical attacks appear only sporadically; and category overlap and drift: some classes persist across tasks, interfering with boundary preservation.

Traditional CL methods often assume fixed class priors and therefore fail to adapt to such dynamic shifts. As a result, majority classes dominate the learning process, while minority classes, especially newly introduced or rare attack types, are insufficiently learned. This imbalance exacerbates catastrophic forgetting and degrades detection performance in realistic settings, particularly under low-label or few-shot scenarios. To address this issue, we propose a meta-learning-driven dynamic weighting mechanism, which jointly learns class-specific loss weights and bias correction terms based on real-time category statistics. This module enables adaptive rebalancing of model optimization to improve generalization on imbalanced, evolving attack distributions.

Loss Weight Prediction via Meta-Learning: Let

p_{c}

denote the empirical frequency of class c over the union of current task data and memory buffer

p_{c} = \frac{| {(x_{i}, y_{i}) \in D_{t} \cup M_{t - 1} ∣ y_{i} = c} |}{| D_{t} \cup M_{t - 1} |} .

(9)

We then predict the class-specific loss weight

w_{c}

using a learnable function (modeled as a small MLP)

w_{c} = Softplus (MLP (p_{c})) .

(10)

The Softplus activation ensures

w_{c} > 0

to prevent negative loss scaling. This allows the model to dynamically upweight underrepresented classes and downweight overrepresented ones based on real-time distribution. This design is inspired by long-tail classification techniques such as Class-Balanced Loss [32] and LDAM [33], but is extended with meta-learning to provide task-adaptive flexibility.

Bias Correction in the Logit Space: To further reduce class bias, we introduce a logit-level adjustment term

b_{c}

for each class c

b_{c} = b_{c}^{(t - 1)} + η \cdot log (\frac{p_{c}}{\sum_{k} p_{k} + ϵ}) .

(11)

This bias term acts as a prior correction, counteracting class frequency dominance by shifting the decision boundary in favor of underrepresented categories. The update step accumulates frequency information across tasks to maintain temporal smoothness.

Combined Weighted Loss Formulation: The final objective combines both weighting and bias adjustment into a unified loss for both labeled and confidently pseudo-labeled data

L_{meta} = - \sum_{(x, y) \in D_{t}^{L} \cup D_{t}^{U^{*}}} w_{y} \cdot log p (y ∣ x; θ + b),

(12)

where

D_{t}^{U^{*}}

is the subset of pseudo-labeled unlabeled data filtered by confidence thresholding;

w_{y}

is the learned importance weight for class y; and b is the class-wise bias vector applied to logits before softmax. This dynamic loss formulation allows the model to focus learning on critical and underrepresented attack types, prevent new/old class competition, and improve robustness under evolving label distributions.

3.2.4. Confidence-Aware Pseudo-Label Filtering and Enhancement

In dynamic intrusion detection under semi-supervised settings, pseudo-label quality directly affects the generalization performance of the model, especially for minority and ambiguous classes. Noisy pseudo-labels can mislead the learning process and amplify category bias in incremental training. To address this, we propose a confidence-aware pseudo-label filtering mechanism that integrates prediction entropy, agreement consistency, and temporal validation to ensure the reliability of unlabeled supervision.

Entropy-Based Confidence Filtering: Given an unlabeled sample

x \in D_{t}^{U}

, the model computes its predicted class distribution

p (y | x; θ)

, and corresponding prediction entropy

H (x) = - \sum_{c = 1}^{C} p (y = c | x) \cdot log p (y = c | x) .

(13)

We define a confidence threshold

τ

, and retain only samples with low uncertainty

{\hat{D}}_{t}^{U} = {x \in D_{t}^{U} ∣ H (x) < τ} .

(14)

This ensures that only high-confidence pseudo-labeled samples contribute to the unsupervised loss, thereby mitigating noise propagation in early training stages.

Historical Consistency Verification: To further improve label reliability, we design a consistency-based verification mechanism. We compare the pseudo-labels generated by the current model

f_{θ_{t}}

and the frozen model from the previous task

f_{θ_{t - 1}}

Agree (x) = I [arg max f_{θ_{t}} (x) = arg max f_{θ_{t - 1}} (x)] .

(15)

Only samples that satisfy both entropy confidence and inter-task agreement are retained for training. The final filtered pseudo-labeled set is

{\tilde{D}}_{t}^{U^{*}} = {x \in D_{t}^{U} ∣ H (x) < τ and Agree (x) = 1} .

(16)

This dual criterion helps eliminate unstable or newly emerged boundary samples that might otherwise induce pseudo-supervision noise.

Temporal Ensemble Smoothing: To stabilize predictions across tasks and training epochs, a temporal ensemble prediction strategy was applied, maintaining an exponential moving average of predictions for each unlabeled sample,

{\bar{p}}_{t} (x) = α \cdot {\bar{p}}_{t - 1} (x) + (1 - α) \cdot p (y | x; θ_{t}) .

(17)

where

α \in [0, 1]

controls historical memory. This improves label consistency across training rounds, especially for slowly evolving classes. The filtered pseudo-labeled samples

{\tilde{D}}_{t}^{U^{*}}

are used to compute the unsupervised loss term

L_{u n s u p} = - \sum_{x \in {\tilde{D}}_{t}^{U^{*}}} log p ({\hat{y}}_{x} | x; θ) .

(18)

This revised loss ensures that only high-confidence, consistent pseudo-labels contribute to model updates, enhancing learning stability and robustness in the presence of noisy, evolving data distributions.

Algorithm 1 demonstrates the semi-supervised incremental learning training process driven by multi-layer alignment and meta-weighting.

Algorithm 1: The Proposed Framework

4. Experiments

To comprehensively evaluate the performance of the proposed framework in terms of catastrophic forgetting mitigation, minority class recognition, and robust learning under label scarcity, we conduct extensive experiments on multiple real-world intrusion detection datasets. The effectiveness of our method is benchmarked against representative baselines across a variety of metrics, including classification accuracy, forgetting rate, robustness to class imbalance, and training efficiency.

4.1. Datasets and Evaluation Metrics

We select three widely used intrusion detection datasets, UNSW-NB15 [34], CIC-IDS2017 [35], and TON_IoT [36], to simulate realistic multi-task, label-incomplete, and heterogeneous network scenarios. Together, these datasets offer a diverse and challenging testbed for evaluating the proposed method’s adaptability and robustness.

UNSW-NB15 dataset [34]: This dataset is a new generation of network intrusion detection benchmark datasets proposed by the University of New South Wales, Australia, in 2015, aiming to overcome the limitations of traditional datasets (such as KDD99 and NSL-KDD) in terms of attack types and network environments. This dataset contains about 2.5 million network traffic records by mixing real campus network traffic and simulated attack data, covering 9 types of modern network attacks (such as vulnerability exploits, DoS, worms, Web attacks, etc.) and normal traffic, and provides 49 finely divided network traffic features, including basic flow features, traffic statistics features, and content features. Compared with earlier datasets, UNSW-NB15 has significant advantages in the diversity of attack types and data balance and can better reflect the current network threat situation. However, this dataset still has the problem of insufficient samples of some attack categories, which may affect the generalization performance of the model in specific attack detection.

CIC-IDS2017 dataset [35]: This dataset is a comprehensive network intrusion detection benchmark dataset released by the Canadian Institute for Cyber Security in 2017. It represents one of the most advanced data benchmarks in the current field of network intrusion detection research. This dataset simulates 15 types of modern network attacks (including brute force cracking, DDoS, Web attacks, penetration attacks, and Heartbleed vulnerability attacks, etc.) in a real network environment, while collecting normal background traffic, to construct a large-scale dataset containing about 2.8 million network flow records. Compared with earlier datasets, CIC-IDS2017 provides more abundant network traffic features, including flow duration, packet statistics, traffic behavior, and TLS/SSL encryption features, which can more comprehensively reflect the characteristics of modern network threats. The dataset is organized in time series, which supports the study of attack evolution patterns and is particularly suitable for deep learning-based intrusion detection methods. However, this dataset has challenges such as large data size and unbalanced samples of some attack categories. In practical applications, it is necessary to consider computing resource limitations and data balance issues. As one of the most comprehensive network intrusion detection benchmarks, CIC-IDS2017 is often used in conjunction with datasets such as UNSW-NB15, providing an important reference for evaluating the performance of intrusion detection systems in modern complex network environments.

TON_IoT dataset [36]: This dataset is a dedicated IoT security dataset developed by the Cyber Security Team of the University of New South Wales, Australia, and represents one of the most advanced data benchmarks in the field of IoT threat detection. This dataset simulates a variety of typical attacks against IoT by building a real experimental environment containing smart home devices and industrial control systems (such as device hijacking, data leakage, ransomware, etc.) and collects about 22 million multimodal security data, covering network layer traffic (NetFlow/PCAP) and host layer logs (Windows/Linux event logs). Compared with traditional network security datasets, the outstanding advantage of TON_IoT is that it is specially designed for IoT security scenarios. It not only includes protocol-level attacks (such as MQTT spoofing and CoAP flooding) but also provides detailed attack stage annotations (reconnaissance, intrusion, and lateral movement), supporting end-to-end attack behavior analysis. The dataset adopts a collaborative recording method of network traffic and host logs, providing unique data support for the study of cross-layer attack detection methods. However, this dataset has challenges such as large data size and unbalanced samples of some attack categories. In practical applications, computing resource optimization and data balance processing need to be considered. As one of the most comprehensive IoT security benchmarks currently, TON-IoT is often used in conjunction with datasets such as Bot-IoT [37], providing an important reference for evaluating the performance of intrusion detection systems in IoT environments, and is particularly suitable for the development of lightweight security models with limited resources.

For each dataset, a standardized preprocessing pipeline was applied as follows: Feature Selection and Normalization: We retained relevant numerical and categorical features for traffic statistics and flow behavior. Continuous features were normalized to [0, 1], and categorical features were one-hot encoded to reduce scale variance and stabilize training. Handling Missing Values: Missing numerical values were replaced with the class mean, and missing categorical values with an “UNK” token to ensure consistent input dimensionality. Label Harmonization: Attack labels were harmonized across datasets. Synonymous attack types were mapped to unified categories following established taxonomies [34,35] to mitigate labeling discrepancies. Incremental Task Sequence Construction: Each dataset was partitioned into 5 sequential tasks based on chronological flow order, introducing new attack categories successively. To simulate blurred boundaries, 30–50% of samples were shared between consecutive tasks. Earlier tasks introduced 3–5 categories, while later tasks added 1-2, creating an imbalanced distribution. Label Scarcity Simulation: we randomly sampled 1%, 5%, 10%, and 20% of each task as labeled data, preserving class distribution, with the remainder unlabeled.

As shown in Table 1, all three datasets are structured to simulate incremental intrusion detection scenarios, where new attack classes are introduced progressively across task rounds. Each dataset exhibits different characteristics: UNSW-NB15 provides a relatively balanced baseline; CIC-IDS2017 emphasizes temporal sequence and class confusion; and TON_IoT introduces device heterogeneity and distribution shift, making them ideal for evaluating CL under varying real-world conditions.

For each dataset, we define a sequence of incremental tasks, where each task introduces a new subset of attack classes. The labeled data for each task is limited to 10% of the total samples, while the remaining 90% are treated as unlabeled. This setup reflects realistic scenarios where labeled data are scarce, and the model must rely on SSL techniques to leverage abundant unlabeled traffic.

Table 2 illustrates the task-wise class distribution for the CIC-IDS2017 dataset under a 5-task incremental setting. Each task introduces a new set of attack types, with some tasks containing multiple classes to simulate realistic class overlap and confusion scenarios. This design allows us to evaluate the model’s ability to adapt to new threats while retaining knowledge of previously learned classes.

For quantitative assessment, we adopt evaluation metrics in Table 3. Average Accuracy (AA) measures overall classification performance across tasks; accuracy is defined as the ratio of correctly predicted samples to total samples, i.e.,

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

, where

T P

,

T N

,

F P

, and

F N

represent true positives, true negatives, false positives, and false negatives, respectively. Average Precision (AP) measures the precision across tasks; for a specific class, precision is defined as the ratio of true positive predictions to the total predicted positives, i.e.,

P r e c i s i o n = \frac{T P}{T P + F P}

. Average Recall (AR) measures the recall across tasks; for a specific class, recall is defined as the ratio of true positive predictions to the total actual positives, i.e.,

R e c a l l = \frac{T P}{T P + F N}

. F1-Score (F1) balances precision and recall, calculated as the harmonic mean of precision and recall, i.e.,

F 1 = \frac{2 \cdot P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

. Forgetting Measure (FM) quantifies the degree of catastrophic forgetting on previous tasks. Backward Transfer (BWT) assesses the influence of learning new tasks on old tasks. Imbalance Robustness (IR) evaluates performance consistency across imbalanced classes. Minority Recall (MR) focuses on the detection rate of minority classes. Top-k Error (Top-k) assesses multi-class confusion detection.

4.2. Baselines

To verify the effectiveness of the methods proposed, this paper selects representative comparison methods under the current mainstream CL, SSL, and the fusion paradigm of the two to construct a systematic experimental control group. All baseline methods are uniformly reproduced or reused under the same data division and task sequence to ensure the fairness and reliability of the comparison. Model classification and core strategies are shown in Table 4.

In the face of various complex experimental settings, not all comparison methods are applicable. The applicability description is shown in Table 5. As the basis for the rationality of the experimental design. In the subsequent experimental result tables, for methods that do not support the relevant settings, their results are uniformly marked as “×” or “N/A”, and the reasons are stated in the table notes.

4.3. Implementation Details

The proposed framework is implemented using PyTorch (with version 1.13.1), with the following key hyperparameters: Memory size: 5000 samples per task, dynamically adjusted based on the number of classes; DP-Means clustering threshold: 0.5, controlling cluster granularity; Uncertainty sampling: Top-20% samples with highest entropy selected from unlabeled data; Loss weights:

λ = {1.0, 0.5, 0.5}

for balancing alignment losses,

λ_{meta} = 1.0

for meta-weighted loss; Learning rate: 0.001 for Adam optimizer, with cosine annealing schedule; Batch size: 128 for labeled data, 256 for unlabeled data.

The model architecture consists of a feature extractor followed by a multi-layer perceptron head for classification. The attention mechanism is implemented using self-attention layers to capture critical fields in network traffic data.

The model is trained for 100 epochs per task, with early stopping based on validation accuracy to prevent overfitting. The memory buffer is updated after each task, retaining the most representative samples according to the DP-Means clustering and uncertainty sampling strategy.

For a fair comparison, we carefully designed the training protocols for all baseline methods and our proposed framework. Specifically, for methods that are originally designed for CL but do not support semi-supervised settings (e.g., iCaRL, IL2M), we followed the same incremental task protocol as our method but only used the labeled subset of each task, which was restricted to 10% of the total training data. This ensures that these methods operate under the same label-scarce condition as our framework.

For SSL methods that are not inherently continual (e.g., FixMatch), we trained the model with both labeled and unlabeled data within each task, but without the incremental extension across tasks. This provides a strong semi-supervised baseline while keeping the comparison fair in terms of data utilization.

For our proposed framework, we integrated both continual and semi-supervised components: Labeled data (10% per task) were used with supervised loss, while the remaining unlabeled data were incorporated via pseudo-labeling and consistency regularization.

4.4. Results and Analysis

Figure 2 and Table 6 present the detailed experimental results on the UNSW-NB15, CIC-IDS2017, and TON_IoT datasets, respectively. From these tables, it is evident that the proposed framework consistently achieves the best overall performance across all three benchmark intrusion detection datasets. This demonstrates its strong adaptability and robustness in addressing realistic challenges, including multi-task CL, label scarcity, and class imbalance. Specifically, our framework attains classification accuracies of 85.6% on UNSW-NB15, 83.4% on CIC-IDS2017, and 81.7% on TON_IoT—each surpassing the current strongest baseline method (PLCiL) by more than 5% points. These improvements indicate the framework’s stable learning capability under dynamic task sequences and data distributions. To further illustrate the advantages of our method, we provide a detailed analysis of key metrics below. AA: Our method consistently outperforms all baselines across datasets, achieving significant gains of 4-5% over the next best method (PLCiL). This highlights its superior ability to learn and retain knowledge across incremental tasks. FM: The proposed framework exhibits the lowest forgetting rates, indicating effective mitigation of catastrophic forgetting. The multi-layer alignment and meta-weighted loss components contribute to this stability. MR: Our approach achieves the highest recall on minority classes, demonstrating its robustness in handling class imbalance. The dynamic memory update and uncertainty-based sampling strategies play a crucial role in enhancing minority class detection. IR: The framework shows superior performance consistency across imbalanced classes, with IR values significantly higher than baselines. This reflects its ability to generalize well even under skewed class distributions. Top-3: The lowest Top-3 error rates indicate that our method effectively reduces multi-class confusion, which is critical in intrusion detection scenarios with overlapping attack characteristics.

Table 6 summarizes the key performance metrics across all datasets. A high Avg-Precision indicates a low false alarm rate (few benign samples are misclassified as attacks), and a high Recall indicates a low missed detection rate (few attacks are misclassified as benign). Our method achieves an outstanding F-score of 84.3% on UNSW-NB15, 82.1% on CIC-IDS2017, and 80.1% on TON_IoT, which implies that both its Precision and Recall are high. This allows us to confidently state that the high accuracy of our model is not achieved by sacrificing one metric for the other but represents a robust balance between minimizing both false alarms and missed detections.

Figure 2a presents a detailed breakdown of performance metrics on the UNSW-NB15 dataset. Our framework not only excels in AA but also demonstrates significant reductions in FM and improvements in MR. The BWT metric indicates that learning new tasks has a less detrimental effect on previously learned tasks compared to baselines, showcasing the stability of our approach.

Figure 3a–d illustrate the performance trends of each method across different task sequences on the CIC-IDS2017 dataset. Our proposed framework consistently outperforms all baselines in AA, demonstrating its superior learning and retention capabilities. Additionally, it maintains the lowest FM values, indicating effective mitigation of catastrophic forgetting. The MR metric shows that our method excels in recognizing minority classes, a critical aspect in intrusion detection scenarios. Finally, the F1-Score trends further validate the robustness and reliability of our approach across sequential tasks. The detailed numerical results are provided in Table 7.

To comprehensively assess the stability and adaptability of our framework in diverse and complex deployment scenarios, we design an extended set of experiments along the four axes below. Catastrophic Forgetting Analysis under Varying Task Sequence Lengths: Evaluate how the model retains prior knowledge as the number of sequential tasks increases, simulating long-term deployment. Semi-Supervised Robustness under Varying Labeled Sample Ratios: Test the model’s sensitivity to supervision sparsity by adjusting the ratio of labeled samples available per task. Memory-Efficient Knowledge Retention under Varying Buffer Sizes: Investigate the trade-off between memory constraints and knowledge preservation by altering the replay buffer capacity. Stress Testing under Extreme Scenarios: Analyze model behavior under challenging conditions—e.g., unbalanced task orders, adversarial class overlap, and label scarcity—designed to simulate worst-case real-world settings.

These extended evaluations offer a holistic perspective on the proposed framework’s real-world readiness, demonstrating not only strong average-case performance but also resilience to domain shift and resource constraints.

Table 8 summarizes the performance variations in each method under different labeled sample ratios. As expected, all models demonstrate improved accuracy and MR with increasing supervision levels. However, the proposed framework consistently maintains strong recognition capability even under extremely limited supervision. Notably, at a 1% label ratio, our framework achieves an MR of 60.1%, significantly outperforming all baseline methods. This result highlights the framework’s ability to effectively leverage unlabeled data through its multi-level alignment and confidence-aware pseudo-labeling mechanisms. Moreover, the model demonstrates superior performance on both MR and IR across all settings, showcasing its resilience to class imbalance and label sparsity—two pervasive challenges in real-world intrusion detection.

These findings suggest that our framework is particularly well-suited for weakly supervised deployment scenarios, such as industrial control systems and IoT environments, where high-quality labels are scarce and costly to obtain.

Table 9 shows the disaster forgetting performance of the proposed method and the comparison method under different memory capacity settings. It can be seen that as the memory capacity increases, the accuracy of all methods is improved, but this method can still maintain a low forgetting rate and a high minority class recall rate under a small memory capacity and can store more representative class center samples and boundary information, which is suitable for resource-constrained edge device scenarios.

Table 10 presents the performance of our proposed framework under varying degrees of class imbalance on the CIC-IDS2017 dataset. The experiments are designed to simulate real-world scenarios where certain attack types are significantly underrepresented compared to others, posing challenges for effective intrusion detection. In the balanced scenario (10:10:10), our method achieves an AA of 82.3%, demonstrating its ability to maintain high accuracy across all classes. The FM of 12.1 indicates a low level of catastrophic forgetting, while the MR of 62.5% highlights the model’s effectiveness in identifying minority classes. As the class imbalance increases to 10:50:50, the AA slightly decreases to 78.7%, but the model still maintains a robust performance with an FM of 13.4 and an MR of 56.3%. This suggests that our framework can adapt to moderate class imbalance without significant degradation in performance. In the more extreme imbalance scenario of 10:100:100, the AA further decreases to 73.4%, with an FM of 15.2 and an MR of 50.2%. Despite the challenges posed by severe class imbalance, our method continues to outperform baseline approaches, indicating its resilience and adaptability. Finally, in the most severe imbalance scenarios (50:100:100, 100:100:100, and 100:100:500), the AA drops to 61.1%, with corresponding increases in FM and decreases in MR. However, even under these challenging conditions, our framework demonstrates a commendable ability to detect minority classes, as evidenced by the MR of 38.5% in the 100:100:500 scenario. Overall, these results underscore the robustness of our proposed framework in handling class imbalance, making it a viable solution for real-world intrusion detection tasks where data distribution is often skewed.

Table 11 presents the performance of our proposed framework under various extreme scenarios designed to test its robustness and adaptability. Each scenario simulates challenging real-world conditions that an intrusion detection system might encounter. In the ESTE scenario, the model demonstrates strong stability with an AA of 75.8% and a low FM of 7.9, indicating its ability to retain knowledge over prolonged periods of minimal change. The MR of 62.1% further highlights its effectiveness in identifying rare attack types even when task changes are infrequent. In the FTC scenario, the model achieves an AA of 79.7% and an FM of 13.4, showcasing its capacity to quickly adapt to rapidly evolving threats. The MR of 59.9% indicates that while the model adapts quickly, it still maintains a reasonable level of performance on minority classes. The NICD scenario tests the model’s ability to handle uneven data flows, where it achieves an AA of 77.3% and an FM of 12.7. The high MR of 64.2% suggests that the model effectively manages class imbalance, ensuring that minority classes are not overlooked. Finally, in the EFLR scenario, the model maintains an AA of 78.5% and an FM of 12.1, demonstrating resilience to significant label noise. The MR of 60.3% indicates that the model can still identify minority classes despite fluctuations in label quality. Overall, these results underscore the robustness of our proposed framework across a range of challenging conditions, making it well-suited for deployment in dynamic and unpredictable network environments.

4.5. Ablation Study

To evaluate the contribution of each component within the proposed framework, we design a set of systematic ablation experiments. By removing or replacing specific modules, we observe the changes in key performance metrics such as overall accuracy, catastrophic forgetting, category robustness, and minority class recall to analyze the effectiveness and necessity of each sub-module in dynamic intrusion detection. All ablation experiments are conducted on the CIC-IDS2017 dataset. The task sequence is divided into 5 incremental rounds, with a 10% labeled sample ratio and a fixed memory buffer size of 5000. All other hyperparameters remain consistent with those used in the main experiments to ensure a fair comparison. The details of each configuration are introduced below. Full Model: keep the complete proposed framework, including all components. w/o Multi-Level Distillation: remove attention alignment and feature-level distillation. w/o DP-Means Memory: replace the DP-Means + entropy-based memory selection with random sampling (reservoir sampling). w/o Meta-Weighting: remove the category frequency-driven loss weight and bias adjustment, treating all classes equally. w/o Semi-Supervised Learning: completely remove the unlabeled data and train only on labeled samples, thus disabling the SSL mechanism. w/o Pseudo-Labeling: remove the confidence-aware pseudo-labeling mechanism.

Table 12 presents the results of the ablation experiments, highlighting the impact of each component on the overall performance of the proposed framework. The full model consistently outperforms all ablated versions across key metrics, demonstrating the effectiveness of each module in enhancing dynamic intrusion detection. Removing the multi-level distillation module results in a significant drop in AA to 79.2% and an increase in FM to 13.7, indicating that knowledge retention across tasks is compromised without this component. The MR also decreases to 59.6%, underscoring the importance of multi-level feature alignment in recognizing rare attack types. The absence of the DP-Means memory selection strategy leads to a further decline in performance, with AA dropping to 78.6% and FM rising to 15.1. This suggests that effective memory management is crucial for maintaining model stability and performance over time. Excluding the meta-weighting mechanism has a pronounced effect on IR, which falls to 0.71, and MR decreases to 52.4%. This highlights the role of class-aware optimization in addressing class imbalance and ensuring equitable learning across categories. The removal of the semi-supervised learning component results in the most substantial performance degradation, with AA falling to 74.9% and FM increasing to 17.4. The MR drops to 49.1%, indicating that the model struggles to learn effectively from limited labeled data without leveraging unlabeled samples. Finally, eliminating the pseudo-labeling mechanism leads to a decrease in AA to 81.3% and an increase in FM to 10.7, demonstrating that confident pseudo-labeling contributes significantly to the model’s learning process.

5. Limitations and Future Work

This paper presents a systematic study on dynamic intrusion detection based on the integration of SSL and CL, with a particular focus on three critical challenges in real-world network security: the continuous evolution of attack patterns, the scarcity of labeled data, and the imbalance in category distributions. To address these issues, we propose the framework, which incorporates dynamic memory construction, multi-level knowledge alignment, and meta-learning-based class-aware optimization. Extensive experiments on multiple real-world intrusion detection datasets demonstrate that our method significantly outperforms existing state-of-the-art methods across key metrics. These results validate the practicality, stability, and generalization capacity of the proposed approach under dynamic task sequences, label-deficient settings, and skewed data distributions.

Despite the promising results, there are several limitations to our current approach that warrant further investigation. More detailed theoretical analysis: While our empirical results are strong, a deeper theoretical understanding of the convergence properties and generalization bounds of the proposed framework would provide valuable insights into its performance guarantees. The potential of adversarial attacks: As our method relies on pseudo-labeling and memory replay, it may be vulnerable to adversarial attacks that exploit these mechanisms. Future work should explore robust training techniques to mitigate such risks. More detailed experiments: Although we have conducted extensive experiments, further evaluations on larger and more diverse datasets, as well as real-world deployment scenarios, would help to better understand the practical applicability and limitations of our approach. The results in extremely imbalanced scenarios still have room for improvement, and we will continue to explore more effective solutions. The results given by our method do not include each intrusion category’s performance, which is also an important evaluation indicator for intrusion detection tasks. Computational efficiency: The proposed framework involves multiple components that may introduce computational overhead. Future work should focus on optimizing the efficiency of the model to ensure its feasibility for real-time intrusion detection applications.

In future work, we plan to further explore the directions introduced below. Online and Real-Time Adaptation: Extend the current framework to support real-time data streams and online continual learning, enabling immediate response to emerging threats in live environments. Model Interpretability and Transparency: Incorporate explainable learning mechanisms to enhance the interpretability of model predictions, making the system’s decision-making process more transparent, accountable, and trustworthy, which is crucial for security-critical applications.

These enhancements aim to broaden the practical deployment scope of our method and further bridge the gap between theoretical models and real-world dynamic intrusion detection systems.

Author Contributions

Methodology, software, validation, formal analysis, investigation, resources, data curation, writing—original draft preparation, writing—review and editing, C.G.; writing—review and editing, X.L., J.C., S.Y., visualization, supervision, project administration, funding acquisition, H.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available upon request from the author. The dataset can be accessed at https://research.unsw.edu.au/projects/\{unsw-nb15-dataset,toniot-datasets\} (UNSW-NB15 and TON_IoT, accessed on 15 March 2025), https://www.unb.ca/cic/datasets/ids-2017.html (CIC-IDS2017, accessed on 1 April 2025).

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable feedback and suggestions. The authors also thank the members of the research group for their helpful discussions and support during the development of this work. The authors also thank the editors for their assistance in the publication process and former colleagues for their contributions to the early stages of this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Du, L.; Gu, Z.; Wang, Y.; Wang, L.; Jia, Y. A few-shot class-incremental learning method for network intrusion detection. IEEE Trans. Netw. Serv. Manag. 2023, 21, 2389–2401. [Google Scholar] [CrossRef]
Van de Ven, G.M.; Tuytelaars, T.; Tolias, A.S. Three types of incremental learning. Nat. Mach. Intell. 2022, 4, 1185–1197. [Google Scholar] [CrossRef]
Nasteski, V. An overview of the supervised machine learning methods. Horizons B 2017, 4, 56. [Google Scholar] [CrossRef]
Zhu, X.J. Semi-Supervised Learning Literature Survey: Technical Report, University of Wisconsin-Madison Department of Computer Sciences: Madison, WI, USA, 2005.
Shyaa, M.A.; Zainol, Z.; Abdullah, R.; Anbar, M.; Alzubaidi, L.; Santamaría, J. Enhanced intrusion detection with data stream classification and concept drift guided by the incremental learning genetic programming combiner. Sensors 2023, 23, 3736. [Google Scholar] [CrossRef] [PubMed]
Alshamrani, A.; Myneni, S.; Chowdhary, A.; Huang, D. A survey on advanced persistent threats: Techniques, solutions, challenges, and research opportunities. IEEE Commun. Surv. Tutor. 2019, 21, 1851–1877. [Google Scholar] [CrossRef]
Bilge, L.; Dumitraş, T. Before we knew it: An empirical study of zero-day attacks in the real world. In Proceedings of the 2012 ACM Conference on Computer and Communications Security, Raleigh, NC, USA, 16–18 October 2012; pp. 833–844. [Google Scholar]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
Baysal, E.; Bayılmış, C. Overcoming Class Imbalance in Incremental Learning Using an Elastic Weight Consolidation-Assisted Common Encoder Approach. Mathematics 2025, 13, 1887. [Google Scholar] [CrossRef]
Zenke, F.; Poole, B.; Ganguli, S. Continual learning through synaptic intelligence. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, NSW, Australia, 6–11 August 2017; pp. 3987–3995. [Google Scholar]
Rebuffi, S.A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2001–2010. [Google Scholar]
Prabhu, A.; Torr, P.H.; Dokania, P.K. Gdumb: A simple approach that questions our progress in continual learning. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 524–540. [Google Scholar]
Bhurani, P.; Chouhan, S.S.; Mittal, N. Exbcil: An exemplar-based class incremental learning for intrusion detection system. Int. J. Mach. Learn. Cybern. 2025, 16, 3865–3885. [Google Scholar] [CrossRef]
Xu, H.; Wang, Y. A continual few-shot learning method via meta-learning for intrusion detection. In Proceedings of the 2022 IEEE 4th International Conference on Civil Aviation Safety and Information Technology (ICCASIT), Dali, China, 12–14 October 2022; pp. 1188–1194. [Google Scholar]
Wang, R.; Fei, J.; Zhang, R.; Guo, M.; Qi, Z.; Li, X. Drnet: Dynamic retraining for malicious traffic small-sample incremental learning. Electronics 2023, 12, 2668. [Google Scholar] [CrossRef]
Sabeel, U.; Heydari, S.S.; El-Khatib, K.; Elgazzar, K. Incremental Adversarial Learning for Polymorphic Attack Detection. IEEE Trans. Mach. Learn. Commun. Netw. 2024, 2, 869–887. [Google Scholar] [CrossRef]
Li, Z.; Hoiem, D. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2935–2947. [Google Scholar] [CrossRef]
Douillard, A.; Cord, M.; Ollion, C.; Robert, T.; Valle, E. Podnet: Pooled outputs distillation for small-tasks incremental learning. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 86–102. [Google Scholar]
Smith, J.; Balloch, J.; Hsu, Y.C.; Kira, Z. Memory-Efficient Semi-Supervised Continual Learning: The World is its Own Replay Buffer. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
Gao, J.; Chai, S.; Zhang, B.; Xia, Y. Research on network intrusion detection based on incremental extreme learning machine and adaptive principal component analysis. Energies 2019, 12, 1223. [Google Scholar] [CrossRef]
Amalapuram, S.K.; Tamma, B.R.; Channappayya, S.S. Spider: A semi-supervised continual learning-based network intrusion detection system. In Proceedings of the IEEE INFOCOM 2024—IEEE Conference on Computer Communications, Vancouver, BC, Canada, 20–23 May 2024; pp. 571–580. [Google Scholar]
Abdallah, E.E.; Otoom, A.F.; Otoom, A.F. Intrusion detection systems using supervised machine learning techniques: A survey. Procedia Comput. Sci. 2022, 201, 205–212. [Google Scholar] [CrossRef]
Chen, C.; Gong, Y.; Tian, Y. Semi-supervised learning methods for network intrusion detection. In Proceedings of the 2008 IEEE International Conference on Systems, Man and Cybernetics, Singapore, 12–15 October 2008; pp. 2603–2608. [Google Scholar]
Sohn, K.; Berthelot, D.; Carlini, N.; Zhang, Z.; Zhang, H.; Raffel, C.A.; Cubuk, E.D.; Kurakin, A.; Li, C.L. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Adv. Neural Inf. Process. Syst. 2020, 33, 596–608. [Google Scholar]
Lee, D.H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Proceedings of the Workshop on Challenges in Representation Learning, ICML, Atlanta, GA, USA, 16–21 June 2013; Volume 3, p. 896. [Google Scholar]
Xie, Q.; Dai, Z.; Hovy, E.; Luong, T.; Le, Q. Unsupervised data augmentation for consistency training. Adv. Neural Inf. Process. Syst. 2020, 33, 6256–6268. [Google Scholar]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst. 2017, 30, 1195–1204. [Google Scholar]
Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; Raffel, C.A. Mixmatch: A holistic approach to semi-supervised learning. Adv. Neural Inf. Process. Syst. 2019, 32, 5049–5059. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PmLR, Online, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Lechat, A.; Herbin, S.; Jurie, F. Pseudo-labeling for class incremental learning. In Proceedings of the BMVC 2021: The British Machine Vision Conference, Online, 22–25 November 2021. [Google Scholar]
Cermelli, F.; Fontanel, D.; Tavera, A.; Ciccone, M.; Caputo, B. Incremental learning in semantic segmentation from image labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4371–4381. [Google Scholar]
Cui, Y.; Jia, M.; Lin, T.Y.; Song, Y.; Belongie, S. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9268–9277. [Google Scholar]
Cao, K.; Wei, C.; Gaidon, A.; Arechiga, N.; Ma, T. Learning imbalanced datasets with label-distribution-aware margin loss. Adv. Neural Inf. Process. Syst. 2019, 32, 1567–1578. [Google Scholar]
Moustafa, N.; Slay, J. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In Proceedings of the 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, ACT, Australia, 10–12 November 2015; pp. 1–6. [Google Scholar]
Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp 2018, 1, 108–116. [Google Scholar]
Booij, T.M.; Chiscop, I.; Meeuwissen, E.; Moustafa, N.; Hartog, F.T.H.d. ToN_IoT: The Role of Heterogeneity and the Need for Standardization of Features and Attack Types in IoT Network Intrusion Data Sets. IEEE Internet Things J. 2022, 9, 485–496. [Google Scholar] [CrossRef]
Koroniotis, N.; Moustafa, N.; Sitnikova, E.; Turnbull, B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: Bot-iot dataset. Future Gener. Comput. Syst. 2019, 100, 779–796. [Google Scholar] [CrossRef]
Belouadah, E.; Popescu, A. Il2m: Class incremental learning with dual memory. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 583–592. [Google Scholar]
Lechat, A.; Herbin, S.; Jurie, F. Semi-supervised class incremental learning. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 10383–10389. [Google Scholar]

Figure 1. The proposed semi-supervised continual learning framework for intrusion detection.

Figure 2. Performance on three datasets.

Figure 3. Performance evolution across task sequences on CIC-IDS2017 dataset under 3 tasks.

Table 1. Overview of datasets and incremental task settings.

Dataset	Type	Classes	# Tasks	Label Ratio
UNSW-NB15	Flow-based	10 (9 attack + 1 normal)	5	10%
CIC-IDS2017	Flow-based	15 (14 attack + 1 normal)	10	10%
TON_IoT	IoT telemetry + Net	13 (12 attack + 1 normal)	7	10%

* The datasets are structured to simulate incremental learning scenarios, where new attack classes are introduced progressively across task rounds. Each dataset exhibits different characteristics, making them ideal for evaluating CL under varying real-world conditions. ** The label ratio indicates the proportion of labeled samples available for each task, reflecting the challenges of label scarcity in real-world scenarios.

Table 2. Task-wise class split for CIC-IDS2017 under 5-task incremental setting.

Task ID	Classes Introduced (Attack Types with Quantity)
T1	Normal (2,271,320), DDoS (128,025), PortScan (158,804)
T2	Bot (1956), Web Attack-Brute Force (1507), Web Attack-XSS (652)
T3	Infiltration (36), Heartbleed (11), Web Attack-Sql Injection (21)
T4	FTP-Patator (7935), SSH-Patator (5897)
T5	DoS {Hulk (230,124), GoldenEye (10,293), Slowloris (5796), Slowhttptest (5499)}

* Each task introduces a new set of attack types, with some tasks containing multiple classes to simulate realistic class scenarios. This design allows us to evaluate the model’s ability to adapt to new threats while retaining knowledge of previously learned classes. ** The quantities in parentheses indicate the number of samples for each class in the dataset; training and testing splits are performed accordingly. Supervised and unsupervised samples are drawn from these splits (mainly from the training set) based on the defined label ratio. For simplicity, only the total number of samples per class is shown here.

Table 3. Evaluation indicator definition description.

Metric	Abbreviation	Formula
Average Accuracy	AA	${AA}_{t} = \frac{1}{t} \sum_{i = 1}^{t} A_{i}^{t}$
Average Precision	AP	${AP}_{t} = \frac{1}{t} \sum_{i = 1}^{t} P_{i}^{t}$
Average Recall	AR	${AR}_{t} = \frac{1}{t} \sum_{i = 1}^{t} R_{i}^{t}$
F1-Score	F1	${F 1}_{t} = \frac{1}{t} \sum_{i = 1}^{t} \frac{2 \cdot P_{i}^{t} \cdot R_{i}^{t}}{P_{i}^{t} + R_{i}^{t}}$
Forgetting Measure	FM	${FM}_{t} = \frac{1}{t - 1} \sum_{i = 1}^{t - 1} ({max}_{j < t} A_{i}^{j} - A_{i}^{t})$
Backward Transfer	BWT	${BWT}_{t} = \frac{1}{t} \sum_{i = 1}^{t} (A_{i}^{i} - A_{i}^{t})$
Imbalance Robustness	IR	$IR = \frac{{min}_{c} A c c_{c}}{{max}_{c} A c c_{c}}$
Minority Recall	MR	$MR = \frac{1}{\| C_{m} \|} \sum_{c \in C_{m}} \frac{T P_{c}}{T P_{c} + F N_{c}}$
Top-k Error	Top-k	$Top - k = 1 - \frac{1}{N} \sum_{i = 1}^{N} I (y_{i} \in {Top}_{k} (f (x_{i})))$

* Where

A_{i}^{t}

represents the accuracy on task i after the tth round of training,

P_{i}^{t}

and

R_{i}^{t}

represent the precision and recall on task i after the tth round of training, respectively.

A c c_{c}

is the accuracy for class c, and

T P_{c}

and

F N_{c}

are the true positive and false negative examples of category c, respectively. AA measures the overall classification performance of the model, AP measures the precision of the model, AR measures the recall of the model, and F1 balances precision and recall, providing a comprehensive performance metric. FM reflects the degree of catastrophic forgetting, BWT measures backward transfer stability, IR evaluates imbalance robustness, MR focuses on minority class detection, and Top-k assesses multi-class confusion detection.

Table 4. Baselines.

Category	Method	Core Strategy
IL	iCaRL [11]	Memory sample + Prototype classifier
IL	IL2M [38]	Dual memory + Output calibration
SSL	FixMatch [24]	High confidence pseudo-label + Consistency constraint
SSL + IL	SSIL [39]	Incremental distillation + Reconstruction error
SSL + IL	PLCiL [30]	Hybrid label modeling + Dynamic pseudo-label correction

* IL refers to Incremental Learning, SSL refers to Semi-Supervised Learning, and their combination is denoted as SSL + IL.

Table 5. Applicability of comparison methods under different experimental settings.

Method	IL	SST	LST	DLS	MCV	ETS
iCaRL	✓	×	✓	×	∆	✓
IL2M	✓	×	✓	×	✓	✓
FixMatch	×	✓	×	✓	×	×
SSIL/PLCiL/Ours	✓	✓	✓	✓	✓	✓

* ✓ indicates that the method is applicable to the experimental setting; ∆ indicates indirect adaptation; × indicates that the method does not support or the original design cannot adapt to this setting. ** IL stands for Incremental Learning. SST stands for Semi-supervised Training. LST stands for Long Sequence Task. DLS stands for Dynamic Label Scale. MCV stands for Memory Capacity Variation. ETS stands for Extreme Task Sequence.

Table 6. Experimental results on three datasets: UNSW-NB15, CIC-IDS2017, and TON_IoT.

Dataset	Method	AA	AP	AR	F1	FM	BWT	IR	MR	Top-3
UNSW-NB15	iCaRL	76.2	75.5	74.3	74.9	18.7	−7.4	0.67	46.1	25.3
	IL2M	78.0	77.3	76.1	76.7	16.3	−6.1	0.70	49.8	22.9
	FixMatch	75.3	74.6	73.2	73.8	×	×	0.62	44.2	27.1
	SSIL	80.7	80.1	78.9	79.5	12.1	−4.5	0.76	57.9	19.8
	PLCiL	78.5	77.8	76.4	77.1	13.9	−4.8	0.83	55.9	21.3
	Ours	85.6	84.9	83.7	84.3	8.7	−2.1	0.90	64.7	15.6
CIC-IDS2017	iCaRL	70.3	69.8	68.5	69.1	22.4	−9.1	0.67	42.4	30.5
	IL2M	73.9	73.2	71.8	72.5	19.8	−7.6	0.70	45.2	28.3
	FixMatch	72.8	72.1	70.7	71.4	×	×	0.62	43.1	29.7
	SSIL	77.8	77.1	75.9	76.5	14.7	−5.2	0.81	54.2	22.4
	PLCiL	78.6	77.9	76.7	77.3	13.5	−4.8	0.83	55.9	21.3
	Ours	83.4	82.7	81.5	82.1	9.4	−2.9	0.91	63.2	17.1
TON_IoT	iCaRL	68.5	67.9	66.4	67.0	25.1	−10.3	0.65	43.2	33.8
	IL2M	70.4	69.7	68.2	68.8	22.8	−8.7	0.68	47.4	30.5
	FixMatch	69.7	69.0	67.5	68.1	×	×	0.60	39.7	26.5
	SSIL	75.3	74.6	73.1	73.7	18.4	−6.1	0.75	56.1	22.9
	PLCiL	76.9	76.2	74.7	75.3	16.9	−5.4	0.78	59.7	21.7
	Ours	81.7	81.0	79.5	80.1	12.5	−3.7	0.88	69.1	16.1

× indicates that the metric is not applicable. The bold font highlights the best performance in each metric.

Table 7. Experimental results during the training process under different task sequences on the CIC-IDS2017 (under 3 task sequences).

Task	Method	AA	AP	AR	F1	FM	BWT	IR	MR	Top-3
T1	iCaRL	76.7	76.0	74.6	75.3	17.5	−6.8	0.69	47.3	26.1
	IL2M	77.2	76.5	75.1	75.8	14.8	−5.6	0.70	48.9	24.3
	FixMatch	72.5	72.0	70.6	71.3	×	×	0.62	43.1	29.7
	SSIL	76.8	76.1	74.7	75.4	15.6	−5.9	0.76	52.1	23.1
	PLCiL	78.5	77.8	76.4	77.1	13.9	−4.8	0.83	55.9	21.3
	Ours	85.0	84.5	83.1	83.8	10.5	−3.0	0.90	63.0	19.0
T2	iCaRL	74.1	73.4	72.0	72.7	20.5	−8.1	0.65	44.0	29.1
	IL2M	74.3	73.6	72.2	72.9	20.1	−7.4	0.65	44.1	28.7
	FixMatch	70.8	70.2	68.8	69.5	×	×	0.60	41.3	30.1
	SSIL	75.1	74.4	73.0	73.7	16.2	−5.5	0.74	50.3	24.7
	PLCiL	76.3	75.6	74.2	74.9	15.1	−4.9	0.79	53.1	22.8
	Ours	81.2	80.5	79.1	79.8	11.8	−3.2	0.88	60.5	21.1
T3	iCaRL	72.5	71.8	70.4	71.1	22.7	−8.9	0.63	40.5	30.8
	IL2M	73.4	72.7	71.3	72.0	21.3	−8.2	0.64	42.3	29.4
	FixMatch	69.2	68.6	67.2	67.9	×	×	0.58	39.7	31.5
	SSIL	73.5	72.8	71.4	72.1	18.3	−6.3	0.72	48.7	25.9
	PLCiL	75.1	74.4	73.0	73.7	16.7	−5.1	0.81	51.6	23.9
	Ours	78.9	78.2	76.8	77.5	13.2	−3.5	0.85	57.1	22.4

× indicates that the metric is not applicable. The bold font highlights the best performance in each metric.

Table 8. Model performance under different label ratios on the UNSW-NB15 dataset.

Label	Method	AA	AP	AR	F1	FM	BWT	IR	MR	Top-3
1%	iCaRL	67.4	66.1	64.8	65.4	21.5	−9.2	0.63	40.2	28.5
	IL2M	66.9	65.2	64.0	64.6	21.0	−9.0	0.64	50.1	28.7
	FixMatch	65.4	63.8	62.5	63.1	×	×	0.48	38.5	29.2
	SSIL	71.2	69.5	68.3	68.9	18.9	−9.5	0.62	51.7	27.5
	PLCiL	73.5	71.8	70.5	71.1	16.7	−9.3	0.65	53.3	30.5
	Ours	77.1	76.4	75.1	75.7	11.3	−5.1	0.73	60.1	28.7
5%	iCaRL	72.8	70.9	69.5	70.1	20.3	−8.7	0.65	43.3	27.1
	IL2M	73.4	71.9	72.5	72.2	18.9	−7.5	0.68	49.5	26.3
	FixMatch	71.5	69.8	68.4	69.1	×	×	0.58	45.1	28.1
	SSIL	76.3	74.7	73.5	74.1	14.4	−6.2	0.70	56.2	24.5
	PLCiL	77.9	75.2	74.0	74.6	12.2	−4.3	0.73	57.6	22.5
	Ours	82.6	81.9	80.6	81.2	9.7	−2.8	0.80	64.7	19.8
10%	iCaRL	74.1	72.8	71.6	72.2	20.1	−8.3	0.66	45.9	26.3
	IL2M	75.2	73.9	74.5	74.2	17.5	−7.1	0.69	49.8	25.7
	FixMatch	73.2	71.5	70.7	71.4	×	×	0.61	47.9	27.2
	SSIL	78.1	78.5	77.2	77.8	13.3	−4.6	0.74	58.5	22.9
	PLCiL	79.5	78.8	77.5	78.1	11.5	−3.5	0.76	60.3	18.7
	Ours	83.9	83.2	81.9	82.5	9.1	−2.4	0.83	67.4	17.1
20%	iCaRL	76.2	75.5	74.3	74.9	18.7	−7.4	0.68	46.1	25.3
	IL2M	78.0	77.3	76.1	76.7	16.3	−6.1	0.70	49.8	22.9
	FixMatch	75.3	74.6	73.2	73.8	×	×	0.65	44.2	27.1
	SSIL	80.7	80.1	78.9	79.5	12.1	−4.3	0.78	57.9	19.8
	PLCiL	81.5	80.8	79.6	80.2	11.6	−3.8	0.80	59.8	18.7
	Ours	85.6	84.9	83.7	84.3	8.7	−2.1	0.82	64.7	15.6

× indicates that the metric is not applicable. The bold font highlights the best performance in each metric.

Table 9. Comparison of catastrophic forgetting under different memory capacities.

Memory Capacity	Method	AA	AP	AR	F1	FM	BWT	IR	MR	Top-3
1000	iCaRL	70.5	67.5	66.3	66.9	22.1	−9.1	0.55	41.6	33.2
1000	Ours	72.3	71.3	70.1	70.7	10.9	−4.3	0.71	58.6	23.7
3000	iCaRL	73.4	70.5	69.3	69.9	18.6	−8.0	0.63	45.7	28.1
3000	Ours	82.3	80.1	78.9	79.5	9.3	−3.1	0.77	61.3	19.4
5000	iCaRL	76.2	75.5	74.3	74.9	18.7	−7.4	0.68	46.1	25.3
5000	Ours	85.6	84.9	83.7	84.3	8.7	−2.1	0.82	64.7	15.6

The bold font highlights the best performance in each metric.

Table 10. Robustness results under different class imbalance on the CIC-IDS2017 dataset under 3 tasks.

Imbalance Ratio	Method	AA	AP	AR	F1	FM	BWT	IR	MR	Top-3
10:10:10	iCaRL	78.4	77.3	73.5	75.7	15.8	−6.3	0.72	52.3	22.5
	IL2M	80.1	79.0	75.2	77.1	14.3	−5.8	0.74	54.1	21.3
	FixMatch	74.2	73.1	69.3	71.1	×	×	0.60	48.3	26.7
	SSIL	79.2	78.1	74.3	76.2	14.9	−5.9	0.75	55.8	20.1
	PLCiL	81.1	80.0	76.2	78.1	13.5	−4.8	0.83	60.3	18.7
	Ours	82.3	81.2	77.4	79.3	12.1	−4.1	0.85	62.5	17.3
10:50:50	iCaRL	72.5	71.4	68.6	70.0	17.5	−7.9	0.53	49.8	27.2
	IL2M	71.8	70.7	67.9	69.3	16.8	−7.1	0.55	50.9	25.3
	FixMatch	68.3	67.2	64.4	65.8	×	×	0.45	44.1	30.5
	SSIL	73.5	72.4	69.6	71.0	16.2	−6.5	0.60	51.7	24.1
	PLCiL	75.4	74.3	71.5	72.9	14.8	−5.3	0.65	53.9	22.8
	Ours	78.7	77.6	74.8	76.2	13.4	−4.5	0.68	56.3	20.5
10:100:100	iCaRL	63.2	62.1	60.3	61.2	23.5	−11.8	0.43	35.7	38.1
	IL2M	61.5	60.4	58.6	59.5	22.8	−10.3	0.47	36.5	35.2
	FixMatch	53.8	52.7	50.9	51.8	×	×	0.38	26.1	40.5
	SSIL	67.4	66.3	64.5	65.4	20.1	−8.2	0.55	42.3	30.2
	PLCiL	66.9	65.8	64.0	65.0	19.3	−7.5	0.58	44.7	28.9
	Ours	73.4	72.3	70.5	71.4	15.2	−5.1	0.65	50.2	25.7
50:100:100	Ours	71.3	70.2	68.4	69.3	16.8	−5.9	0.62	48.7	27.3
100:100:100	Ours	67.7	66.6	64.8	65.7	19.4	−7.2	0.58	44.3	30.1
100:100:500	Ours	61.1	60.0	58.2	59.1	22.7	−10.5	0.49	38.5	35.7

* × indicates that the metric is not applicable. The bold font highlights the best performance in each metric. The imbalance ratio calculates the ratio of the number of samples in the majority class to the number of samples in the minority class for each task. For example, 10:1:1 indicates that in the first task, the majority class has 10 times the number of samples as the minority class, while 50:1:1 indicates a more severe imbalance, with the majority class having 50 times the samples of the minority class. ** To avoid ineffective learning due to extreme class imbalance, we set a minimum threshold of N samples for the minority class in each task. Some attack types with fewer than N samples are excluded from the experiments. Then split the dataset into 3 tasks randomly, then determine the number of samples for each class in each task based on the specified imbalance ratio, i.e., for a 10:1:1 ratio, if the minority class has 800 samples, the majority class will have 8000 samples. *** All methods are reproduced and compared under the same experimental settings to ensure fairness.

Table 11. Robustness results under extreme experimental settings.

Experiments	AA	AP	AR	F1	FM	BWT	IR	MR	Top-3
ESTE	75.8	74.1	72.9	73.5	7.9	−2.1	0.79	62.1	18.7
FTC	79.7	78.5	77.3	77.9	13.4	−4.3	0.73	59.9	22.4
NICD	77.3	76.1	74.9	75.5	12.7	−3.8	0.77	64.2	20.3
EFLR	78.5	77.3	76.1	76.7	12.1	−3.5	0.75	60.3	21.3

* This table scenario is constructed based on the TON_IoT dataset. ** ESTE is the extreme task evolution scenario, where the model needs to adapt to extremely slow task changes; FTC is the fast task change scenario, where the model needs to quickly adapt to new tasks. NICD is the non-independent category distribution scenario, where the model needs to handle uneven data flows. EFLR is the extreme fluctuation of the label ratio scenario, where the model needs to cope with significant label noise.

Table 12. Results of key module ablation experiment.

Model Configuration	AA	AP	AR	F1	FM	BWT	IR	MR	Top-3
Full Model	83.4	82.7	81.5	82.1	9.4	−2.9	0.91	63.2	17.1
w/o Multi-Level Distillation	79.2	78.5	77.3	77.9	13.7	−4.3	0.84	59.6	20.4
w/o DP-Means Memory	78.6	77.1	75.8	76.4	15.1	−4.5	0.79	55.2	22.8
w/o Meta-weighting	76.7	75.2	73.9	74.5	16.3	−4.8	0.71	52.4	24.9
w/o Semi-Supervised	74.9	73.4	72.1	72.7	17.4	−5.1	0.66	49.1	26.2
w/o Pseudo-Labeling	81.3	80.8	79.5	80.1	10.7	−3.2	0.89	62.8	18.3

The bold font highlights the best performance in each metric.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, C.; Li, X.; Cheng, J.; Yang, S.; Gong, H. Continual Learning for Intrusion Detection Under Evolving Network Threats. Future Internet 2025, 17, 456. https://doi.org/10.3390/fi17100456

AMA Style

Guo C, Li X, Cheng J, Yang S, Gong H. Continual Learning for Intrusion Detection Under Evolving Network Threats. Future Internet. 2025; 17(10):456. https://doi.org/10.3390/fi17100456

Chicago/Turabian Style

Guo, Chaoqun, Xihan Li, Jubao Cheng, Shunjie Yang, and Huiquan Gong. 2025. "Continual Learning for Intrusion Detection Under Evolving Network Threats" Future Internet 17, no. 10: 456. https://doi.org/10.3390/fi17100456

APA Style

Guo, C., Li, X., Cheng, J., Yang, S., & Gong, H. (2025). Continual Learning for Intrusion Detection Under Evolving Network Threats. Future Internet, 17(10), 456. https://doi.org/10.3390/fi17100456

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Continual Learning for Intrusion Detection Under Evolving Network Threats

Abstract

1. Introduction

2. Related Work

2.1. Continual Learning and Catastrophic Forgetting Mitigation

2.2. Semi-Supervised Learning in Intrusion Detection

2.3. Integration of Continual Learning and Semi-Supervised Learning

3. Methodology

3.1. Problem Definition

3.2. Method Overview

3.2.1. Dynamic Memory Construction and Sample Selection Strategy

3.2.2. Multi-Level Knowledge Alignment Mechanism

3.2.3. Meta-Learning Driven Dynamic Weighting Mechanism

3.2.4. Confidence-Aware Pseudo-Label Filtering and Enhancement

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Baselines

4.3. Implementation Details

4.4. Results and Analysis

4.5. Ablation Study

5. Limitations and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI