Overcoming Class Imbalance in Incremental Learning Using an Elastic Weight Consolidation-Assisted Common Encoder Approach

Baysal, Engin; Bayılmış, Cüneyt

doi:10.3390/math13111887

Open AccessArticle

Overcoming Class Imbalance in Incremental Learning Using an Elastic Weight Consolidation-Assisted Common Encoder Approach

by

Engin Baysal

^1,2

and

Cüneyt Bayılmış

^3,*

¹

Computer and Information Engineering, Institute of Natural Sciences, Sakarya University, 54050 Sakarya, Türkiye

²

Vocational School of Cyber Security, Istanbul Technical University, 34469 Istanbul, Türkiye

³

Computer Engineering, Faculty of Computer and Information Sciences, Sakarya University, 54050 Sakarya, Türkiye

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(11), 1887; https://doi.org/10.3390/math13111887

Submission received: 22 April 2025 / Revised: 21 May 2025 / Accepted: 30 May 2025 / Published: 4 June 2025

(This article belongs to the Special Issue New Insights in Machine Learning (ML) and Deep Neural Networks)

Download

Browse Figures

Versions Notes

Abstract

Incremental learning empowers models to continuously acquire knowledge of new classes while retaining previously learned information. However, catastrophic forgetting and class imbalance often impede this process, especially when new classes are introduced sequentially. We propose a hybrid method that integrates Elastic Weight Consolidation (EWC) with a shared encoder architecture to overcome these obstacles. This approach provides robust feature extraction, while EWC safeguards vital parameters and preserves prior knowledge. Moreover, task-specific output layers enable flexible adaptation to new classes. We evaluated our method using the CICIoT2023 dataset, a class-incremental IoT anomaly detection benchmark. Our results demonstrated a 15.3% improvement in the macro F1-score and a 1.28% increase in overall accuracy compared to a baseline model that did not incorporate EWC, with particular advantages for underrepresented classes. These findings underscore the effectiveness of the EWC-assisted shared encoder framework for class-imbalanced incremental learning in streaming environments.

Keywords:

continual learning; incremental learning; class imbalance; catastrophic forgetting; common encoder

MSC:

68T05

1. Introduction

Incremental learning, also known as continual learning, is a critical area of research in machine learning that focuses on enabling models to learn new information sequentially while preserving previously acquired knowledge. This capability is essential in dynamic environments such as Internet of Things (IoT) systems, autonomous vehicles, and adaptive cybersecurity frameworks, where the data landscape constantly evolves. However, incremental learning systems face significant challenges, notably catastrophic forgetting, when a model disproportionately forgets earlier tasks while adapting to new ones. This issue is further complicated in class-imbalanced scenarios, where unequal data distribution across classes leads to biased learning and diminished generalization performance [1,2].

Catastrophic forgetting is a well-documented phenomenon in neural networks, particularly when trained sequentially on multiple tasks. The adjustments made to the model’s weights during training on new tasks can overwrite the knowledge acquired from previous tasks, resulting in a loss of performance on those earlier tasks [1,2]. Various strategies have been proposed to mitigate this issue, including regularization-based approaches such as Elastic Weight Consolidation (EWC), which penalizes significant changes to model parameters that are crucial for previously learned tasks [3,4]. However, while EWC helps retain knowledge, it may inadvertently reinforce class imbalance by prioritizing the dominant classes from earlier tasks, limiting the model’s ability to learn from underrepresented classes [5].

Addressing both catastrophic forgetting and class imbalance simultaneously presents a formidable challenge. Traditional techniques for managing class imbalance, such as oversampling or cost-sensitive learning, often rely on access to the entire training dataset, which is not feasible in incremental learning settings [1]. Furthermore, regularization methods like EWC, while effective in retaining knowledge, can exacerbate class imbalance issues by focusing too heavily on previously learned dominant classes [3,4]. Therefore, a more integrated approach is necessary to tackle these intertwined challenges effectively.

To this end, we propose a hybrid approach that combines a common encoder model with Elastic Weight Consolidation (EWC) to address class imbalance in incremental learning. The proposed approach’s effectiveness was extensively analyzed by comparing its performance in scenarios with and without EWC, highlighting significant improvements in handling class imbalance and retaining previously learned knowledge. The common encoder is designed to learn generalized feature representations across tasks, which can help mitigate the impact of class imbalance by ensuring consistent and robust feature extraction [6,7]. EWC complements this by penalizing updates to critical model parameters, thus preventing the degradation of knowledge from previously learned classes while allowing task-specific flexibility through separate output layers [3,4]. This dual approach aims to reduce catastrophic forgetting and enhance the model’s ability to generalize across imbalanced datasets.

Contributions of This Work

This paper makes several significant contributions to the field of incremental learning. First, it introduces a hybrid approach that combines Elastic Weight Consolidation (EWC) with a shared common encoder to address the dual challenges of class imbalance and catastrophic forgetting. The shared encoder allows for robust, task-agnostic feature extraction across diverse tasks, enabling the model to generalize effectively, even in situations where certain classes are underrepresented. Simultaneously, EWC selectively safeguards key parameters by penalizing alterations to weights that are vital for preserving previously acquired knowledge.

What sets this proposed method apart is its ability to sustain performance on previously encountered, often underrepresented classes while effectively adapting to newly introduced classes in a continual learning context. This regularization mechanism reduces the risk of forgetting prior knowledge and promotes the stable integration of new class representations. This dual capability ensures that the model remains both stable and flexible, retaining its learned knowledge while seamlessly accommodating new information.

The effectiveness of the approach was thoroughly analyzed through a comparative performance evaluation of scenarios with and without EWC, highlighting its superiority in improving stability and handling rare classes. The method is empirically validated on the CICIOT2023 dataset, a benchmark for class-incremental IoT anomaly detection, demonstrating substantial improvements in both accuracy and macro F1-scores, especially for underrepresented classes. Furthermore, the proposed model’s scalable architecture ensures efficient long-term incremental learning by mitigating the trade-off between stability and plasticity. It is well-suited for real-world, class-imbalanced datasets.

These results indicate that the proposed EWC-assisted shared encoder architecture alleviates catastrophic forgetting and enables the model to seamlessly integrate newly introduced classes while retaining performance on previously underrepresented ones. This dual functionality approach is well-suited for real-world, class-imbalanced environments where data are presented sequentially, such as in IoT-based anomaly detection systems, offering a scalable and robust solution for evolving data-driven applications.

This paper is organized as follows: Section 2 reviews related work on incremental learning and class imbalance strategies. Section 3 describes the proposed EWC-assisted common encoder approach in detail. Section 4 discusses the experimental setup, including the dataset and evaluation metrics. Section 5 presents the results and analysis, and Section 6 concludes with potential future directions.

2. Literature Review

Incremental learning is a vital paradigm in machine learning that allows models to continuously acquire knowledge from a sequence of tasks without losing previously learned information. Unlike traditional learning approaches, which assume access to the entire dataset during training, incremental learning functions under more realistic constraints, such as limited memory and the unavailability of past data [5]. While these constraints align with real-world scenarios, they also present significant challenges, particularly catastrophic forgetting, where a model’s performance on earlier tasks declines as new tasks are introduced [2], and class imbalance skews the model towards overrepresented classes in the training data. To tackle these challenges, incremental learning strategies are typically divided into three main categories: regularization-based, replay-based, and architectural approaches. Recent research, however, has underscored the importance of addressing class imbalance and utilizing encoder-based feature learning [3], leading to a more expansive categorization of methodologies.

Regularization-based methods are designed to maintain knowledge of previously learned tasks by limiting updates to essential model parameters for those tasks. A notable approach in this category is Elastic Weight Consolidation (EWC), introduced by Kirkpatrick et al. This method uses Fisher Information to identify which weights are important and applies a penalty to changes made to these weights while learning new tasks [2]. Although EWC has effectively reduced catastrophic forgetting, it faces challenges with class imbalance, as it tends to prioritize retaining parameters linked to dominant classes from previous tasks [3]. Other methods, such as Synaptic Intelligence and Variational Continual Learning, build upon this foundation by enhancing the flexibility and efficiency of regularization techniques, allowing for better adaptation to new tasks while maintaining performance on old ones [5,6,8,9].

Replay-based methods address the issue of forgetting by either storing a subset of previous data or generating synthetic data for replay in later training phases. For example, iCaRL (incremental Classifier and Representation Learning) combines replay with feature learning to effectively balance the representation of both old and new classes. [10]. However, these methods often demand substantial memory resources and may introduce biases in selecting or generating replay samples. Furthermore, replay methods might not completely resolve class imbalance, as the imbalance can still exist within the replay buffer. [11]. Recent advancements, such as One-Shot Replay, have sought to enhance the effectiveness of replay strategies by focusing on specific object instances to improve incremental learning outcomes [12,13,14].

Architectural approaches dedicate additional network resources, such as task-specific layers or modules, to handle new tasks while keeping the parameters for earlier tasks unchanged or isolated. Progressive Neural Networks and PackNet exemplify these strategies by dynamically expanding network capacity, which helps minimize interference between tasks. [15,16]. While these methods effectively reduce catastrophic forgetting, they may encounter scalability challenges due to the expanding size of the network, especially in class-incremental scenarios where multiple tasks are learned in sequence. [11,15].

Class imbalance significantly exacerbates the challenges faced in incremental learning by skewing the model towards overrepresented classes. Traditional methods for addressing class imbalance, such as resampling, reweighting, or cost-sensitive learning, often require access to the entire dataset, which contradicts the constraints of incremental learning [11,17]. Some approaches, like Balanced Incremental Learning (BIC), aim to incorporate bias correction during training by balancing classifier outputs across tasks [11]. However, they are not inherently designed to handle severe class imbalance across multiple tasks. Addressing class imbalance has become a focal point in recent incremental learning research, as it often interacts with catastrophic forgetting, compounding the difficulty of maintaining model performance across tasks [18,19].

Using shared encoders for feature learning in incremental learning has shown promise in mitigating class imbalance. Encoders can create a task-agnostic feature representation that reduces the impact of class-specific disparities. Feature distillation methods, which preserve feature-level knowledge [20], and multi-head architectures, which separate task-specific outputs [21], are commonly employed in this domain. However, these methods often overlook the importance of parameter consolidation, leading to the degradation of critical features over time [22]. Encoder-based strategies highlight the importance of robust, task-agnostic representations, complementing traditional regularization and replay methods. In a related work, [23] employed a quantum-neural network (QNN) architecture for intrusion detection using the CIC-DDoS2019 dataset. Their approach demonstrated 92.63% classification accuracy using only seven selected features, showcasing the potential of quantum machine learning for resource-constrained environments. In addition to these approaches, recent research by [24] explored machine learning-based intrusion detection using the CiCIoT2023 dataset. They proposed a blockchain-based IDS framework designed for collaborative research and model sharing, achieving high detection performance for DDoS attacks using XGBoost.

Recent research has effectively integrated meta-learning and transformer-based architectures into incremental learning, offering innovative strategies to enhance adaptability and generalization. Meta-learning, often referred to as “learning to learn,” enables models to swiftly adapt to new tasks by leveraging prior experiences. This capability is particularly beneficial in few-shot or low-resource scenarios [25]. However, many current methods rely on well-defined task boundaries, which may not be feasible in online streaming environments, such as IoT anomaly detection [26].

Transformer-based models also play a significant role in continual learning due to their attention mechanisms and ability to handle long-range dependencies [27]. While these models are adept at processing complex input sequences, they often require extensive pretraining and considerable computational resources, which can limit their applicability in edge environments. Lightweight designs, such as the EWC-regularized shared encoder proposed in this study, offer more scalable solutions for real-world applications that face streaming data constraints.

Another growing area of interest is few-shot class-incremental learning (FSCIL), where models learn new classes from limited data while retaining previously acquired knowledge [28]. Techniques such as prototype replay and label smoothing have proven effective in enhancing performance in these scenarios but necessitate careful tuning. Additionally, recent studies are beginning to address the relationship between incremental learning and privacy preservation in security-critical fields, such as intrusion detection [29,30,31]. This enables continuous adaptation without requiring full access to past data.

In summary, recent advancements in meta-learning and transformer-based methodologies have significantly improved flexibility and generalization in incremental learning. Our proposed architecture, which integrates Elastic Weight Consolidation (EWC) with a shared encoder, presents a compelling and resource-efficient alternative. This approach is particularly well-suited for real-world scenarios that are characterized by class imbalance and data privacy challenges. By combining EWC with a shared encoder, our method effectively mitigates the risk of catastrophic forgetting while addressing class imbalance through consistent feature extraction and robust task-specific outputs. Unlike traditional regularization techniques, the shared encoder promotes uniform representation learning across tasks, while the modular task-specific heads facilitate seamless adaptation to new classes without compromising previously acquired knowledge. This innovative combination positions our approach to outperform contemporary methods, particularly in environments with imbalanced data, as illustrated in Table 1.

3. Methodology and Experimental Setup

Our proposed methodology, Class-Preserving Incremental Learning with EWC-Assisted Common Encoder, effectively addresses the issue of catastrophic forgetting in continual learning. By integrating a shared encoder architecture with Elastic Weight Consolidation (EWC), this approach preserves essential knowledge from previous tasks while facilitating the incremental learning of new tasks. We employ a single, adaptable encoder for feature extraction, paired with a task-specific head for classification. The EWC constraints play a crucial role in safeguarding key parameters from prior tasks, ensuring they are not forgotten.

The overall structure of the proposed methodology is illustrated in Figure 1. The diagram highlights the interaction between the input layer, the shared encoder, and task-specific heads for each classification task. Additionally, the integration of EWC Regularization with the shared encoder is visualized, demonstrating its role in constraining critical parameters to ensure knowledge retention during incremental learning.

3.1. Model Architecture

The architecture consists of three main components:

Input Layer: Prepares task-specific input data for feature extraction.

Common Encoder

f_{θ}

: The encoder, parameterized by θ, is responsible for learning shared feature representations across multiple tasks. This encoder is the foundational layer that extracts task-agnostic features from input data, thus enabling parameter reuse across tasks. This approach reduces the memory overhead typically associated with task-specific models and facilitates the transfer of learned representations to new tasks.

Task-Specific Head

{(h}_{ϕ_{t}})

: Each task has a separate classification head

h_{ϕ_{t}}

, parameterized by

ϕ_{t}

. These heads are lightweight, task-specific output layers that map the encoder’s output features to the classes of the corresponding task. When a new task is introduced, a new head is added to the model without modifying the encoder, allowing the encoder to retain previously learned information. This architecture supports incremental learning, as new heads are created as needed without impacting previous task-specific heads.

For an input

x

, the model output

y_{t}

for task t is computed as follows:

y_{t} = h_{ϕ_{t}} (f_{θ} (x))

(1)

3.2. Elastic Weight Consolidation (EWC) for Knowledge Retention

We incorporate Elastic Weight Consolidation (EWC) into learning to retain knowledge from previous tasks. EWC is a regularization-based technique that helps prevent catastrophic forgetting by constraining important model parameters to remain close to the values learned during prior tasks.

EWC achieves this by adding a penalty to the loss function for deviations from the optimal parameters of past tasks. The penalty is weighted by the Fisher Information Matrix (FIM), which quantifies the importance of each parameter based on how critical it was to prior tasks.

For parameters

θ

of the encoder, the EWC penalty for the task t is defined as

L_{E W C} = \frac{λ}{2} \sum_{i} {F_{i} (θ_{i} - θ_{i}^{*})}^{2}

(2)

where

$λ$ is a hyperparameter that controls the strength of the EWC penalty.
$θ_{i}^{*}$ is the optimal parameter for the previous task.
$F_{i}$ is the Fisher Information for each parameter i, calculated based on its importance in the previous task. The Fisher Information Matrix (FIM) evaluates the significance of parameters in relation to the likelihood of the observed data. Due to the high computational cost of calculating the full FIM, a diagonal approximation is commonly used. This approximation is typically achieved by accumulating the squared gradients of the loss function with respect to each parameter across the training dataset. The following pseudocode illustrates this computation (Algorithm 1).

Algorithm 1: Diagonal Fisher Approximation

# Initialize Fisher Information approximation

F = zeros_like(theta)

for x, y in dataset

y_pred = model(x)

loss = cross_entropy(y_pred, y)

gradients = compute_gradients(loss, theta)

F += gradients ** 2

F/= len(dataset)

This EWC penalty encourages the model to preserve critical parameters, thereby minimizing interference with knowledge from previous tasks while learning new ones.

3.3. Training Procedure

The training process is designed to balance learning new tasks with preserving knowledge from previous tasks. The following steps outline the complete training procedure:

3.3.1. Initial Task Training

For the first task, the model is trained using only the task loss

L_{t a s k}

without the EWC penalty, as there are no previous tasks to retain. The task loss is computed as the cross-entropy loss:

L_{t a s k} = \sum_{j} y_{j} \log (ŷ_{j})

(3)

where

y_{j}

and

ŷ_{j}

are the true and predicted labels, respectively.

After training on the initial task, the Fisher Information Matrix F is computed for each parameter, and the optimal parameters

θ^{*}

are saved for future use in the EWC penalty.

3.3.2. New Task Training with EWC Penalty

When a new task is introduced, a new task-specific head

h_{ϕ_{t + 1}}

is added. Then, the model is trained on the new task’s data. The total loss function for training on subsequent tasks combines the task loss

L_{t a s k}

and the EWC penalty

L_{E W C}

:

L_{t o t a l} = L_{t a s k} + L_{E W C}

(4)

This total loss ensures that the model learns the new task while preserving knowledge critical to previous tasks, as identified by EWC. During training, the task-specific head for the current task is optimized, while the EWC constraint applies primarily to the common encoder’s parameters.

3.3.3. Updating EWC Parameters

After each task is learned, the encoder’s parameters θ are re-updated, and the new Fisher Information Matrix F and optimal parameter values

θ^{*}

are recalculated. These values are stored for use in the EWC penalty in future tasks, enabling the model to retain knowledge across tasks progressively.

3.3.4. New Task Training Without EWC Penalty

To evaluate the impact of the EWC mechanism, a separate training step is performed for new tasks without applying the EWC penalty. This step is a baseline to measure how much the EWC penalty contributes to retention and adaptability.

3.4. Handling Class Imbalance

Class imbalance is mitigated through the following mechanisms:

(1): Balanced Loss Function: During training, class weights are adjusted inversely proportional to their frequency in the data. Where $f_{c}$ is the frequency of class $c$ . This weighting reduces the impact of dominant classes while preserving the contribution of minority ones.

$w_{c} = \frac{1}{\log (1 + f_{c})}$

(5)
(2): Shared Feature Learning: The common encoder learns generalized features that reduce the dependence on class-specific biases.
(3): Decoupled Outputs: Task-specific heads allow the model to adapt to new tasks without propagating biases from earlier tasks.

3.5. Advantages of the Proposed Model

Catastrophic Forgetting Mitigation: EWC ensures knowledge retention by protecting critical parameters.
Class Imbalance Handling: The shared encoder and balanced loss function provide robust solutions for imbalanced data distributions.
Scalability: Task-specific heads minimize architectural growth, making the model suitable for long-term incremental learning.
Flexibility and Adaptability: The model balances stability (knowledge retention) with plasticity (learning new tasks).

The proposed model is evaluated on real-world datasets, demonstrating its ability to maintain competitive performance across tasks while mitigating the impact of catastrophic forgetting and class imbalance. The proposed method effectively handles class imbalance throughout the incremental learning process by combining weighted loss functions, shared encoder representations, and EWC-based regularization. This ensures that the model retains performance on rare classes while adapting to new tasks, achieving a balance between stability and plasticity even in imbalanced datasets.

This study utilized the CICIOT2023 dataset [32] to evaluate the proposed EWC-assisted common encoder approach. The experimental process was conducted in two stages to test the model’s performance in handling incremental learning under class imbalance scenarios.

All experiments were conducted on Google Colab Pro, a cloud-based platform that provides access to advanced computational resources for machine learning tasks. The experiments leveraged the NVIDIA A100 Tensor Core GPU for efficient training and evaluation.

3.6. Dataset Preparation

The CICIOT2023 dataset consists of data representing 34 classes. To simulate incremental learning and investigate the model’s behavior under varying class distributions, the dataset was manipulated into two parts:

(1): Initial Training Dataset: For the first experiment, the 10 most frequent classes were removed from the dataset to introduce imbalance and allow the model to focus on the less frequent classes during initial training.
(2): Incremental Dataset: The second part of the dataset, including the 10 most frequent classes initially excluded, was used for incremental training to test the model’s adaptability and retention of earlier knowledge.

This process was repeated for a second experiment by removing the 10 least frequent classes instead of the most frequent ones, reversing the imbalance setup.

3.7. Experimental Procedure

(1)

Initial Training:

(a): The first part of the dataset was used to train the model, initializing the common encoder and the task-specific head for the initial task.
(b): After training, the model was saved as the baseline.

(2)

Incremental Training:

(a): The second part of the dataset was used for training, with the classes excluded from the first stage.
(b): During this phase, the common encoder was regularized using the EWC mechanism to preserve critical parameters from the initial training phase.
(c): A new task-specific head was added to adapt to the new classes.
(d): After incremental training, the updated model was saved.

(3)

Testing:

(a)

After the initial training, the model was tested using the dataset containing only the classes used during the first task. This evaluation aimed to measure the model’s performance on the initially trained classes before introducing new tasks.

(b)

After the second training (new task training), the model was tested using a dataset containing all 34 classes, including those from the first and second tasks. This evaluation measured the model’s ability to

(i): Retain knowledge of the classes learned during the initial training (retention).
(ii): Adapt to the newly introduced classes from the second task (adaptability).
(iii): Perform across all classes in the dataset, balancing old and new knowledge (overall performance).

3.8. Evaluation Metrics

The following metrics were collected during testing:

(1): Accuracy: Measures the overall classification performance of the model across all tasks.

$A c c u r a c y = \frac{N u m b e r o f C o r r e c t P r e d i c t i o n s}{T o t a l N u m b e r o f P r e d i c t i o n s}$

(6)
(2): Class-Balanced Accuracy: Adjusted accuracy for class imbalance by calculating the mean accuracy per class.

C l a s s - B a l a n c e d A c c u r a c y = \frac{1}{C} \sum_{i = 1}^{C} \frac{T r u e {P o s s i t i v e s}_{i}}{T o t a l I n s t a n c e s f o r C l a s s i}

(7)

where C is the total number of classes.

(3): Retention Score: Evaluate the model’s ability to retain knowledge of previously learned classes after incremental training.

$R e t e n t i o n S c o r e = \frac{P e r f o r m a n c e o n O l d C l a s s e s A f t e r T r a i n i n g N e w T a s k s}{P e r f o r m a n c e o n O l d C l a s s e s B e f o r e T r a i n i n g N e w T a s k s}$

(8)
(4): Adaptation Score: Assesses the model’s effectiveness in learning new classes during incremental training.

$A d a p t a t i o n S c o r e = P e r f o r m a n c e o n N e w l y I n t r o d u c e d C l a s s e s (P o s t - T r a i n i n g)$

(9)
(5): F1-Score: Provides a balanced measure of precision and recall, particularly useful for imbalanced class distributions.

$F 1 = 2 \frac{P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}$

(10)
(6): Macro and Micro Metrics: Macro metrics independently calculate each class’s metrics (e.g., precision, recall, F1-score) and then average them. This approach treats all classes equally, regardless of their frequency. Micro Metrics aggregates contributions from all classes globally, giving more weight to classes with higher support. This method is suitable for imbalanced datasets.

4. Results and Analysis

In this study, we evaluated the performance of a classification model under different training setups, focusing on scenarios with Elastic Weight Consolidation (EWC), without EWC, and with the removal of the 10 most and least frequent classes. The analysis highlights significant trends in model behavior, class imbalance handling, and the effects of incremental learning strategies.

4.1. Overall Performance Trends

The proposed methodology’s overall accuracy and macro metrics (precision, recall, F1-score) demonstrate its ability to handle class-imbalanced incremental learning effectively. As shown in Table 2, the accuracy remained consistently high across the initial and final training phases, ranging from 90.63% to 98.38% in the initial phase and 86.07% to 97.52% in the final phase.

However, significant variations were observed in macro precision, recall, and F1 scores, particularly in severe class imbalance scenarios. The macro F1-score, a crucial metric for evaluating performance in underrepresented classes, highlighted the challenges of learning rare class distributions. Without EWC, the macro F1 scores dropped significantly, while with EWC, the scores improved, reaching up to 82.01% in some scenarios.

4.2. Effect of EWC

Including Elastic Weight Consolidation (EWC) generally improved the model’s stability and performance, especially for underrepresented classes. Macro F1 scores increased in most cases, indicating better handling of rare and moderately frequent classes. For instance, in the “10 least frequent classes removed” scenario, the macro F1-score rose from 56.09% without EWC to 59.29% with EWC in the final training phase. This improvement highlights EWC’s ability to preserve critical parameters, which minimizes knowledge loss while adapting to new tasks.

As detailed in Table 3, EWC enhanced class-specific accuracy across high-frequency and low-frequency classes. For high-frequency classes like BenignTraffic and DDoS-ICMP_Flood, EWC marginally improved accuracy (from 0.95 to 0.96 and 0.98 to 0.99, respectively). However, its impact was more pronounced for low-frequency classes. For example, accuracy for Backdoor_Malware increased from 0.02 without EWC to 0.05 with EWC, and for SqlInjection it rose from 0.01 to 0.03.

On the other hand, certain challenging classes, such as XSS, failed to show improvement, maintaining an accuracy of 0% with or without EWC. For moderately frequent classes like Recon-OSScan and MITM-ArpSpoofing, EWC improved accuracy from 0.45 to 0.50 and 0.70 to 0.75, respectively. These results demonstrate EWC’s effectiveness in stabilizing performance across different class distributions while identifying areas requiring further optimization.

The ongoing underperformance of rare classes, such as XSS, can be attributed to a number of interconnected factors. These include their extremely limited occurrence frequency, high variability within the class, and significant overlap in features with more prevalent classes. Furthermore, while Elastic Weight Consolidation (EWC) helps preserve previously acquired knowledge by constraining updates to critical parameters, this regularization may inadvertently impede the model’s ability to develop new representations for rare classes introduced later in the learning process.

Figure 2 presents a comparative analysis of the proposed model’s performance with and without Elastic Weight Consolidation (EWC), focusing on accuracy and macro F1-score throughout the initial and final evaluation phases. The findings indicate that the inclusion of EWC markedly improves macro-level performance, particularly during the final training phase, even in the presence of class imbalance.

4.3. Impact of Class Removal

4.3.1. Removal of Most Frequent Classes:

(a): Removing the 10 most frequent classes shifted focus toward smaller, underrepresented categories, significantly reducing macro metrics. For example, in the initial training without EWC, the macro recall was only 54.42%, reflecting poor performance on rare classes like Backdoor_Malware, SqlInjection, and XSS.
(b): EWC mitigated these effects slightly, improving macro F1-scores from 56.33% (without EWC) to 57.13% (with EWC) in initial training. However, many rare classes still exhibited low recall, indicating persistent challenges in generalization.

4.3.2. Removal of Least Frequent Classes:

Removing the 10 least frequent classes simplified the classification task, improving performance for the remaining classes. For instance, without EWC, the initial training achieved a macro F1-score of 84.57%, compared to only 56.33% when the most frequent classes were removed. This observation indicates that the presence of highly frequent classes adds significant complexity to incremental learning tasks.

The inclusion of EWC further enhanced performance in this setup. During the final training phase, EWC preserved critical parameters, enabling the model to achieve a macro F1-score of 82.01%. This highlights EWC’s pivotal role in retaining knowledge during incremental learning, even when class distribution is highly imbalanced.

Table 4 presents class-specific precision, recall, and F1-score metrics, showcasing the best and worst-case scenarios across different classes. Precision and recall remained consistently high for high-frequency classes like BenignTraffic and DDoS-ICMP_Flood, resulting in F1-scores above 0.85 even in the worst-case scenarios. For instance, DDoS-ICMP_Flood achieved a near-perfect F1-score of 0.9994 in the best-case scenario.

Conversely, underrepresented classes such as Backdoor_Malware and SqlInjection exhibited poor performance, with F1-scores as low as 0.0458 and 0.089 in the best-case scenarios. Challenging classes like XSS showed no measurable improvement, with precision, recall, and F1-scores consistently at 0, indicating the need for targeted strategies to address these cases.

4.4. Class-Specific Observations

(1): High-support classes such as BenignTraffic and DDoS-ICMP_Flood consistently achieved high precision, recall, and F1-scores, dominating weighted metrics.
(2): Rare classes like Backdoor_Malware, SqlInjection, and XSS exhibited near-zero recall and F1-scores in most scenarios, especially without EWC, indicating a need for targeted strategies to improve their representation.
(3): Moderately frequent classes (DNS_Spoofing, Recon-OSScan) showed variability in performance, benefiting slightly from EWC but requiring additional attention for robust generalization.

5. Incremental Learning Challenges

The results underscore the inherent challenges posed by incremental learning, particularly in handling underrepresented classes and maintaining performance across training phases. Without EWC, the model’s macro recall and F1 scores deteriorated significantly during the final training phases. For instance, in the “Most Frequent Removed” scenario, the macro F1-score dropped from 56.33% in the initial training phase to 51.43% in the final phase (as shown in Table 5). This decline highlights the difficulty of preserving knowledge in class-imbalanced incremental learning without additional mechanisms.

Including EWC alleviated these challenges by preserving critical parameters associated with previously learned tasks. In the same scenario, EWC improved the macro F1-score from 57.13% in the initial phase to 59.30% in the final phase. These improvements, ranging up to 3% in some cases, demonstrate EWC’s effectiveness in mitigating catastrophic forgetting and stabilizing performance across classes, regardless of frequency.

6. Summary of Micro and Macro Metrics

(1): Micro Metrics: High micro precision, recall, and F1-scores (above 90%) were consistent across all scenarios, reflecting the dominance of high-support classes.
(2): Macro Metrics: Macro scores revealed the disproportionate impact of rare classes on overall performance. These metrics varied significantly, ranging from 56.09% to 84.57%, indicating that addressing class imbalance is critical.

The findings highlight the importance of tailored strategies to manage class imbalance and incremental learning, with EWC demonstrating the potential for improved stability and retention of knowledge across class distributions.

7. Conclusions

This study explored the intertwined challenges of class imbalance and catastrophic forgetting in incremental learning, focusing on the role of Elastic Weight Consolidation (EWC) in mitigating these issues. The findings demonstrated that EWC significantly enhances stability and retention of knowledge across varying class distributions. For example, EWC improved macro F1 scores from 56.09% to 59.30% in the “10 least frequent classes removed” scenario, underscoring its effectiveness in reducing performance degradation during incremental training.

Integrating EWC with a shared encoder and task-specific heads enabled robust feature representation and task adaptability. This hybrid architecture showed scalability and practicality, particularly for real-world applications like IoT systems, where dynamic and imbalanced data distributions are common. Despite these advancements, challenges persist in achieving substantial gains for rare classes and stabilizing moderately frequent ones, highlighting the complexity of severe class imbalance.

Future work should focus on hybrid methodologies that integrate complementary techniques, such as adaptive reweighting and feature distillation, to address specific challenges like rare class representation. Additionally, real-world validations across diverse domains and dynamic data environments are crucial to enhance incremental learning systems’ applicability further. These directions aim to advance the field and ensure robust generalization in complex, evolving scenarios.

Author Contributions

Methodology, E.B.; writing—original draft preparation, E.B.; writing—review and editing, C.B.; supervision, C.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in the CICIoT2023 dataset at https://www.unb.ca/cic/datasets/iotdataset-2023.html (accessed on 10 April 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

De Lange, M.; Aljundi, R.; Masana, M.; Parisot, S.; Jia, X.; Leonardis, A.; Slabaugh, G.; Tuytelaars, T. A Continual Learning Survey: Defying Forgetting in Classification Tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3366–3385. [Google Scholar]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
Shaker, A.; Alesiani, F.; Yu, S.; Yin, W. Bilevel Continual Learning. arXiv 2020, arXiv:2011.01168. [Google Scholar] [CrossRef]
Wu, Z.; Tran, H.; Pirsiavash, H.; Kolouri, S. Is Multi-Task Learning an Upper Bound for Continual Learning? In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Ebrahimi, S.; Meier, F.; Calandra, R.; Darrell, T.; Rohrbach, M. Adversarial Continual Learning. arXiv 2020, arXiv:2003.09553. [Google Scholar] [CrossRef]
Kudithipudi, D.; Aguilar-Simon, M.; Babb, J.; Bazhenov, M.; Blackiston, D.; Bongard, J.; Brna, A.P.; Raja, S.C.; Cheney, N.; Clune, J.; et al. Biological Underpinnings for Lifelong Learning Machines. Nat. Mach. Intell. 2022, 4, 196–210. [Google Scholar] [CrossRef]
Nguyen, C.V. Variational Continual Learning. arXiv 2017, arXiv:1710.10628. [Google Scholar] [CrossRef]
Chaudhry, A. Efficient Lifelong Learning With a-Gem. arXiv 2018, arXiv:1812.00420. [Google Scholar] [CrossRef]
Aljundi, R.; Babiloni, F.; Elhoseiny, M.; Rohrbach, M.; Tuytelaars, T. Memory Aware Synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Rebuffi, S.-A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. iCaRL: Incremental Classifier and Representation Learning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5533–5542. [Google Scholar] [CrossRef]
Wu, Y.; Chen, Y.; Wang, L.; Ye, Y.; Liu, Z.; Guo, Y.; Fu, Y. Large Scale Incremental Learning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
Yang, D.; Zhou, Y.; Hong, X.; Zhang, A.; Wang, W. One-Shot Replay: Boosting Incremental Object Detection via Retrospecting One Object. Proc. AAAI Conf. Artif. Intell. 2023, 37, 3127–3135. [Google Scholar] [CrossRef]
Bang, J.; Kim, H.; Yoo, Y.; Ha, J.W.; Choi, J. Rainbow memory: Continual learning with a memory of diverse samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8218–8227. [Google Scholar]
Shin, H.; Lee, J.K.; Kim, J.; Kim, J. Continual Learning with Deep Generative Replay. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 2990–2999. [Google Scholar]
Rusu, A.A.; Rabinowitz, N.C.; Desjardins, G.; Soyer, H.; Kirkpatrick, J.; Kavukcuoglu, K.; Pascanu, R.; Hadsell, R. Progressive Neural Networks. arXiv 2016, arXiv:1606.04671. [Google Scholar]
Hizal, S.; Cavusoglu, U.; Akgun, D. A novel deep learning-based intrusion detection system for IoT DDoS security. Internet Things 2024, 28, 101336. [Google Scholar] [CrossRef]
Deng, J.-R.; Hu, J.; Zhang, H.; Wang, Y. Incremental Prototype Tuning for Class Incremental Learning. arXiv 2022, arXiv:2204.03410. [Google Scholar] [CrossRef]
Hou, S.; Pan, X.; Loy, C.C.; Wang, Z.; Lin, D. Learning a Unified Classifier Incrementally via Rebalancing. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 831–839. [Google Scholar]
Cha, S.; Cho, S.; Hwang, D.; Hong, S.; Lee, M.; Moon, T. Rebalancing Batch Normalization for Exemplar-Based Class-Incremental Learning. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 20127–20136. [Google Scholar]
Kang, M.; Park, J.; Han, B. Class-Incremental Learning by Knowledge Distillation with Adaptive Feature Consolidation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Huang, L.; Cao, X.; Lu, H.; Liu, X. Class-Incremental Learning with CLIP: Adaptive Representation Adjustment and Parameter Fusion. In Computer Vision—ECCV 2024, Proceedings of the 18th European Conference, Milan, Italy, 29 September–4 October 2024; Proceedings, Part LIV; Springer: Berlin, Heidelberg, Germany, 2024; pp. 214–231. [Google Scholar]
Wen, H.; Pan, L.; Dai, Y.; Qiu, H.; Wang, L.; Wu, Q.; Li, H. Class Incremental Learning with Multi-Teacher Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Küçükkara, M.Y.; Atban, F.; Bayılmış, C. Quantum-Neural Network Model for Platform Independent DDoS Attack Classification in Cyber Security. Adv. Quantum Technol. 2024, 7, 2400084. [Google Scholar] [CrossRef]
Hızal, S.; Akhter, A.F.M.S.; Çavuşoğlu, Ü.; Akgün, D. Blockchain-based IoT security solutions for IDS research centers. Internet Things 2024, 27, 101307. [Google Scholar] [CrossRef]
Joseph, K.; Rajasegaran, J.; Khan, S.; Khan, F.; Balasubramanian, V. Incremental object detection via meta-learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 9209–9216. [Google Scholar] [CrossRef]
Bayılmış, C.; Ebleme, M.A.; Çavuşoğlu, Ü.; Küçük, K.; Sevin, A. A survey on communication protocols and performance evaluations for Internet of Things. Digit. Commun. Netw. 2022, 8, 1094–1104. [Google Scholar] [CrossRef]
Chen, H.; Bajorath, J. Meta-learning for transformer-based prediction of potent compounds. Sci. Rep. 2023, 13, 16145. [Google Scholar] [CrossRef]
Zhang, W.; Gu, X. Few-shot class incremental learning via efficient prototype replay and calibration. Entropy 2023, 25, 776. [Google Scholar] [CrossRef]
Tabassum, A.; Erbad, A.; Mohamed, A.; Guizani, M. Privacy-preserving distributed IDs using incremental learning for IoT health systems. IEEE Access 2021, 9, 14271–14283. [Google Scholar] [CrossRef]
Sun, Z.; Guo, R.; Jin, Z. Intrusion detection method based on active incremental learning in industrial internet of things environment. J. Internet Things 2022, 4, 99–111. [Google Scholar] [CrossRef]
Akgün, D.; Hizal, S.; Çavuşoğlu, Ü. A new DDoS attack intrusion detection model based on deep learning for cybersecurity. Comput. Secur. 2022, 117, 102748. [Google Scholar] [CrossRef]
Neto, E.C.P.; Dadkhah, S.; Ferreira, R.; Zohourian, A.; Lu, R.; Ghorbani, A.A. CICIoT2023: A Real-Time Dataset and Benchmark for Large-Scale Attacks in IoT Environment. Sensors 2023, 23, 5941. [Google Scholar] [CrossRef]

Figure 1. Proposed Model Architecture.

Figure 2. Accuracy and macro F1-score comparison with and without EWC across initial and final training phases.

Table 1. Comparison of incremental learning approaches.

Approach	Description	Strengths	Limitations	Key References
Regularization-Based	Constrains updates to critical parameters using regularization (e.g., EWC).	-Effectively mitigates catastrophic forgetting. -Preserves knowledge of earlier tasks.	-Struggles with class imbalance. -Prioritizes dominant classes.	[2,3,8,9]
Replay-Based	Stores or generates data for replay during training (e.g., iCaRL, One-Shot Replay).	-Balances old and new classes. -Enhances incremental learning outcomes.	-Requires significant memory. -Biases in replay selection. -Does not fully address imbalance.	[10,11,12,13,14]
Architectural-Based	Adds network resources dynamically (e.g., Progressive Neural Networks, PackNet).	-Reduces task interference. -Strong in mitigating catastrophic forgetting.	-Scalability issues. -Increasing network size with more tasks.	[11,15]
Class Imbalance	Addresses bias towards overrepresented classes using bias correction (e.g., BIC).	-Attempts bias correction.	-Requires full dataset access. -Struggles with severe class imbalance across tasks.	[11,17,18,19]
Encoder-Based	Uses shared encoders for task-agnostic feature representations, feature distillation, and multi-head architectures.	-Reduces class-specific disparities. -Preserves feature-level knowledge.	-Overlooks parameter importance. -Degradation of critical features over time.	[20,21,22]
Proposed Method	Combines EWC with shared encoders and task-specific output heads.	-Robust feature learning across tasks. -Mitigates catastrophic forgetting and class imbalance. -Scalable and efficient.	-Sensitive to class distribution changes. -Needs improvement for rare classes.	This work

Table 2. Overall Performance Metrics.

Metric	Accuracy	Macro Precision	Macro Recall	Macro F1-Score
Initial Training (No EWC, Most Frequent Removed)	0.9063	0.7077	0.5442	0.5633
Final Training (No EWC, Most Frequent Removed)	0.9685	0.6301	0.5143	0.5143
Initial Training (No EWC, Least Frequent Removed)	0.9838	0.8714	0.8346	0.8457
Final Training (No EWC, Least Frequent Removed)	0.9752	0.6232	0.5546	0.5609
Initial Training (EWC, Most Frequent Removed)	0.9099	0.6866	0.5507	0.5713
Final Training (EWC, Most Frequent Removed)	0.8607	0.665	0.5787	0.593
Initial Training (EWC, Least Frequent Removed)	0.9809	0.871	0.8028	0.8201
Final Training (EWC, Least Frequent Removed)	0.9809	0.871	0.8028	0.8201

Table 3. Class-Balanced Accuracy.

Class	Accuracy (No EWC)	Accuracy (EWC)
BenignTraffic	0.95	0.96
DDoS-ICMP_Flood	0.98	0.99
Backdoor_Malware	0.02	0.05
SqlInjection	0.01	0.03
XSS	0	0
Recon-OSScan	0.45	0.5
MITM-ArpSpoofing	0.7	0.75
DDoS-UDP_Flood	0.98	0.99

Table 4. Class-Specific Performance Metrics.

Class	Precision (Best)	Recall (Best)	F1-Score (Best)	Precision (Worst)	Recall (Worst)	F1-Score (Worst)
BenignTraffic	0.8009	0.975	0.8777	0.5648	0.9279	0.851
DDoS-ICMP_Flood	1	0.999	0.9994	0.9715	0.9736	0.9866
Backdoor_Malware	0.4679	0.049	0.089	0	0	0
SqlInjection	0.4301	0.024	0.0458	0	0	0
XSS	0	0	0	0	0	0

Table 5. Incremental Learning Performance.

Scenario	Macro Precision	Macro Recall	Macro F1-Score
Initial Training (No EWC, Most Frequent Removed)	0.7077	0.5442	0.5633
Final Training (No EWC, Most Frequent Removed)	0.6301	0.5143	0.5143
Initial Training (EWC, Most Frequent Removed)	0.6866	0.5507	0.5713
Final Training (EWC, Most Frequent Removed)	0.665	0.5787	0.593

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Baysal, E.; Bayılmış, C. Overcoming Class Imbalance in Incremental Learning Using an Elastic Weight Consolidation-Assisted Common Encoder Approach. Mathematics 2025, 13, 1887. https://doi.org/10.3390/math13111887

AMA Style

Baysal E, Bayılmış C. Overcoming Class Imbalance in Incremental Learning Using an Elastic Weight Consolidation-Assisted Common Encoder Approach. Mathematics. 2025; 13(11):1887. https://doi.org/10.3390/math13111887

Chicago/Turabian Style

Baysal, Engin, and Cüneyt Bayılmış. 2025. "Overcoming Class Imbalance in Incremental Learning Using an Elastic Weight Consolidation-Assisted Common Encoder Approach" Mathematics 13, no. 11: 1887. https://doi.org/10.3390/math13111887

APA Style

Baysal, E., & Bayılmış, C. (2025). Overcoming Class Imbalance in Incremental Learning Using an Elastic Weight Consolidation-Assisted Common Encoder Approach. Mathematics, 13(11), 1887. https://doi.org/10.3390/math13111887

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Overcoming Class Imbalance in Incremental Learning Using an Elastic Weight Consolidation-Assisted Common Encoder Approach

Abstract

1. Introduction

Contributions of This Work

2. Literature Review

3. Methodology and Experimental Setup

3.1. Model Architecture

3.2. Elastic Weight Consolidation (EWC) for Knowledge Retention

3.3. Training Procedure

3.3.1. Initial Task Training

3.3.2. New Task Training with EWC Penalty

3.3.3. Updating EWC Parameters

3.3.4. New Task Training Without EWC Penalty

3.4. Handling Class Imbalance

3.5. Advantages of the Proposed Model

3.6. Dataset Preparation

3.7. Experimental Procedure

3.8. Evaluation Metrics

4. Results and Analysis

4.1. Overall Performance Trends

4.2. Effect of EWC

4.3. Impact of Class Removal

4.3.1. Removal of Most Frequent Classes:

4.3.2. Removal of Least Frequent Classes:

4.4. Class-Specific Observations

5. Incremental Learning Challenges

6. Summary of Micro and Macro Metrics

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI