Mitigating the Stability–Plasticity Trade-Off in Neural Networks via Shared Extractors in Class-Incremental Learning

Dong, Mingda; Li, Rui; Liu, Feng

doi:10.3390/app151910757

Open AccessArticle

Mitigating the Stability–Plasticity Trade-Off in Neural Networks via Shared Extractors in Class-Incremental Learning

by

Mingda Dong

^1,2,*,

Rui Li

³ and

Feng Liu

^4,*

¹

Lab of Artificial Intelligence for Education, East China Normal University, Shanghai 200062, China

²

School of Computer Science and Technology, East China Normal University, Shanghai 200062, China

³

China Mobile (HangZhou) Information Technology Co., Ltd., Hangzhou 311121, China

⁴

School of Psychology, Shanghai Jiao Tong University, Shanghai 200030, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10757; https://doi.org/10.3390/app151910757

Submission received: 4 September 2025 / Revised: 25 September 2025 / Accepted: 3 October 2025 / Published: 6 October 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Humans learn new tasks without forgetting, but neural networks suffer from catastrophic forgetting when trained sequentially. Dynamic expandable networks attempt to address this by assigning each task its own feature extractor and freezing previous ones to preserve past knowledge. While effective for retaining old tasks, this design leads to rapid parameter growth, and frozen extractors never adapt to future data, often producing irrelevant features that degrade later performance. To overcome these limitations, we propose Task-Sharing Distillation (TSD), which reduces the number of extractors by allowing multiple tasks to share one extractor and consolidating them through distillation. We study two strategies: (1) grouped rolling consolidation, which groups consecutive tasks and consolidates them into a shared extractor, and (2) a fixed-size pooling with similarity-based consolidation, where new tasks are merged into the most compatible extractor in a fixed pool according to prototype similarity. Experiments on the CIFAR-100 and ImageNet-100 datasets show that TSD maintains strong performance across tasks, demonstrating that careful feature sharing is more effective than simply adding more extractors. On ImageNet-100, our method achieves 2.5% higher average accuracy than DER while using fewer feature extractors.

Keywords:

class-incremental learning; catastrophic forgetting; knowledge distillation

1. Introduction

Humans learn new tasks without forgetting what they already know, but neural networks trained on sequential tasks typically suffer from catastrophic forgetting [1], especially when access to old data is limited or only a few exemplars are available. Regularization-based methods add penalties that discourage changes to parameters deemed important for previous tasks. Rehearsal-based methods store a small subset of old examples for replay during new-task training. Knowledge distillation is also widely used; for example, iCaRL [2] freezes the model learned up to stage

(t - 1)

and uses it as a teacher to transfer old-task knowledge to the model at stage t. These approaches help, yet continual learning with a single shared backbone remains challenging.

Dynamic-network methods expand the model by assigning a dedicated feature extractor to each task. After training a task, its extractor is frozen, and a new extractor is allocated for the next task (see Figure 1, left). This prevents interference with parameters of previous tasks, but the number of extractors increases linearly with the number of tasks. For example, in methods such as DER, a full CNN backbone is added for every new task, so after M tasks, the model contains M extractors. At test time, each input is processed by all extractors, and the resulting features are concatenated and fed into a global classifier. However, since early extractors are trained only on their original task data, their outputs are often irrelevant for later tasks. Consequently, concatenating features from all extractors introduces redundancy and noise, which reduces accuracy. For example, as shown in Table 1, on CIFAR-100, dividing 100 classes into 10 tasks requires 10 extractors and yields

65.26 %

accuracy with DER. When the dataset is instead split into 20 tasks, the number of extractors doubles to 20, but the accuracy drops to

61.87 %

. This demonstrates that increasing the number of extractors does not guarantee better performance and that excessive extractors can harm generalization by amplifying redundancy.

These observations highlight a key limitation: dedicating one extractor per task is inefficient. We therefore ask whether extractors can be shared across tasks to reduce their total number, mitigate interference from redundant extractors, and still preserve the benefits of dynamic networks. To this end, we propose Task-Sharing Distillation (TSD), which reduces the number of extractors by allowing tasks to share them. We study two variants.

Under grouped rolling consolidation, Consecutive tasks are grouped and merged to share a single extractor by distillation. For example, tasks 1–2 may share one extractor, and tasks 3–4 may share another (see Figure 1, middle). The group size does not need to be fixed; different extractors may cover different numbers of tasks, depending on the experimental setup.

Under a fixed-size pool with similarity-based consolidation, we first set a maximum number of extractors. Early tasks each initialize a new extractor until this limit is reached. For every subsequent task, we train a temporary extractor, then merge it into the most compatible existing extractor through distillation. Compatibility is determined by prototype similarity: the new task’s feature prototypes are compared with those maintained by each existing extractor, and the task is merged into the extractor with the highest similarity. This strategy encourages feature reuse among related tasks while keeping the overall number of extractors bounded (see Figure 1, right).

To further ensure that different extractors maintain independent feature subspaces, we impose a feature distinctness constraint during training. When a new extractor is introduced, we add an explicit constraint that encourages its features to remain distinct from those of existing extractors. This constraint guides the new extractor toward representations that are discriminative with respect to previously learned subspaces so that each extractor specializes in a subset of task subspaces and different subsets remain independent.

Compared with DER, our approach substantially reduces the number of backbone parameters. In particular, using only three extractors, our methods achieve higher accuracy than DER, which relies on ten extractors.

Although our experiments focus on vision tasks, the underlying idea is more general. Large language models (LLMs) are widely applied in continual and multi-domain settings, where assigning a separate adapter to each task similarly leads to parameter growth and redundancy. Our proposed Task-Sharing Distillation (TSD) offers a promising way to consolidate and share modules in such scenarios, highlighting its potential relevance for the efficient scaling of LLMs.

Our main contributions are summarized as follows:

1. We analyze the relationship between the number of extractors and model performance. Increasing the number of extractors not only inflates parameters but also reduces generalization.

2. We propose task-sharing distillation, which reduces the number of extractors by allowing tasks to share them and by consolidating multiple extractors through distillation. We present two practical strategies: grouped rolling consolidation, which groups consecutive tasks into a shared extractor, and fixed-size pooling with similarity-based consolidation, which allocates a fixed number of extractors and assigns new tasks to the most similar one.

3. Our methods achieve superior accuracy and parameter efficiency compared with state-of-the-art methods such as DER. In particular, our methods outperform DER while using only three extractors, whereas DER requires ten.

2. Related Work

Incremental learning can be categorized into three main types [3,4]: task-incremental, domain-incremental, and class-incremental learning. In task-incremental learning, the model receives the task ID during evaluation and only classifies within the given task. Domain-incremental learning uses the same set of classes across tasks but with different domains [5,6]. Class-incremental learning assigns different classes to each task, without providing the task ID at test time, so the model must classify across all classes.

To reduce forgetting, regularization-based methods such as EWC [1,7] constrain parameter updates based on their importance to previous tasks, where the Fisher information matrix is used to estimate parameter importance. Knowledge distillation methods [2,8,9,10,11,12,13] transfer knowledge from old feature extractors and exemplars, using distillation [14] to guide the new model. In this setting, the new model acts as the student, and the old models serve as the teacher, ensuring that the knowledge of previous tasks is preserved. Other works [15,16] address old-class forgetting caused by the imbalance between old and new samples, where the classifier tends to be biased toward classes with more samples. By reweighting the classifiers of old and new classes, these methods adjust the bias and mitigate forgetting. Although these methods alleviate forgetting to some extent, a single shared extractor remains limited in capacity. To improve model expressiveness, AANets [17] combine stable and plastic blocks, and DualNet [18] incorporates complementary fast and slow learning systems [19] to balance stability and plasticity. To cope with the scarcity of old samples, GAN-based approaches [20,21] generate pseudo-samples to alleviate data imbalance, while autoencoder-based methods [22] learn class-specific subspaces to improve class discrimination.

Dynamic network methods expand the architecture by adding new feature extractors for incoming tasks. DER [23] assigns a new extractor to each task and freezes old ones, thereby reducing interference with previously learned parameters. DyTox [24] introduces a transformer encoder–decoder with dynamic task tokens. TagFex [25] captures task-agnostic features and merges them with task-specific ones. FOSTER [26] incrementally learns new extractors inspired by gradient boosting and distills knowledge from both old and new extractors into a unified model. MEMO [27] expands selected blocks instead of adding entire extractors, while SEED [28] employs a fixed number of extractors and models each class with a Gaussian distribution to select suitable extractors for new tasks. Although these methods are effective and reduce the number of parameters, most of them still do not match the accuracy of DER [23], as using fewer modules makes it difficult to achieve higher performance.

Recently, pre-trained models have also been applied to incremental learning [5,29,30,31,32]. These approaches freeze the pre-trained backbone and insert lightweight modules such as adapters, LoRA, or prompts, tuning only these additional modules for efficient adaptation. This strategy leverages the rich knowledge in pre-trained models while enabling efficient learning of new tasks.

3. Methodology

3.1. Problem Setup and Method Overview

In this section, we first define the class-incremental learning task, then introduce dynamic network-based approaches and, finally, motivate our two methods that share feature extractors.

Class-Incremental Learning Task. The goal of incremental learning is to train on a sequence of tasks and, after completing all tasks, maintain strong performance across all learned classes. Let the total number of tasks be T. For task

t \in {1, \dots, T}

, the dataset is

S_{t} = {(x, y)}

, with class set

C_{t}

. The class sets are disjoint, i.e.,

C_{t} \cap C_{t^{'}} = ⌀

for

t \neq t^{'}

. Let

n_{t} = | C_{t} |

denote the number of classes in task t. Each sample (

(x, y) \in S_{t}

) belongs to one of these

n_{t}

classes.

Dynamic-Network Methods. Traditional incremental learning often uses a single backbone for all tasks, which risks severe forgetting. Dynamic-network methods address this by assigning a dedicated feature extractor to each task. We denote the task-specific extractor (e.g., a CNN) used in DER for task t as

ϕ_{t}

. A representative baseline, i.e., DER, trains

ϕ_{t}

on task t, then freezes its parameters. For the next task, a new extractor (

ϕ_{t + 1}

) is introduced. At inference time, an input (x) is processed by all extractors, and their features are concatenated before being classified by a global head (g):

z (x) = [ϕ_{1} (x) ∥ ϕ_{2} (x) ∥ \dots ∥ ϕ_{T} (x)], \hat{y} = g (z (x)) .

(1)

Freezing prevents interference with past knowledge and improves performance compared with a single shared backbone. However, this design introduces two major issues. First, the number of parameters grows linearly with the number of tasks. Second, since learning is sequential, an extractor (

ϕ_{i}

, where

i < j

) never observes data from a future task (

S_{j}

), making its features irrelevant or even harmful for later tasks.

These limitations raise an important question: Can we reduce the number of extractors while still retaining the benefits of dynamic networks? To address this, we introduce Task-Sharing Distillation (TSD), which progressively merges task knowledge into shared extractors through distillation. Building on TSD, our two methods, grouped rolling consolidation and fixed-size pooling with similarity-based consolidation, effectively control parameter growth, reduce redundancy, and maintain accuracy on both past and future tasks.

3.2. Grouped Rolling Consolidation (GRC)

The overall framework is illustrated in Figure 2. Given T tasks, we partition them into groups (

{G_{1}, \dots, G_{K}}

), where each group

G_{k}

contains a contiguous subset of tasks. The group size is not fixed in advance; different groups may cover different numbers of tasks, depending on the schedule. After completing the first group (

G_{1}

), we freeze a consolidated extractor (

Ψ_{1} \equiv ψ_{G_{1}}

). For later groups, task-specific extractors are temporarily maintained; progressively merged by distillation into a rolling extractor (

ψ_{t_{s} : t}

); and, finally, frozen as

Ψ_{k}

at the end of the group. Importantly,

ψ_{t_{s} : t}

denotes a single consolidated extractor obtained from tasks

t_{s}

to t, not a concatenation of multiple extractors.

Training inside a group ( $G_{k}$ , $k \geq 2$ ).

Let

t_{s} = min (G_{k})

be the first task in

G_{k}

. All frozen extractors from previous groups are denoted by

{Ψ_{1}, \dots, Ψ_{k - 1}}

and remain fixed.

Case A (first task, i.e., $t = t_{s}$ ).

A new extractor (

ϕ_{t_{s}}

) and a temporary classifier head (

q_{t_{s}}

) are instantiated. The classifier input concatenates frozen features with the new feature, i.e.,

F_{t_{s}}^{tea} (x) = [Ψ_{1} (x) ∥ \dots ∥ Ψ_{k - 1} (x) ∥ ϕ_{t_{s}} (x)],

(2)

and the cross-entropy loss is

L_{CE} (t_{s}) = E_{(x, y) \sim S_{t_{s}}} CE (q_{t_{s}} (F_{t_{s}}^{tea} (x)), y) .

(3)

No consolidation is performed, since there is only one new extractor in this group.

Case B (subsequent tasks, i.e., $t > t_{s}$ ).

For each new task, we instantiate an extractor (

ϕ_{t}

) and a temporary head (

q_{t}

). The current-group teacher feature is defined as

R_{t} (x) = \{\begin{matrix} [ϕ_{t_{s}} (x) ∥ ϕ_{t} (x)], & if no rolling extractor exists, \\ [ψ_{t_{s} : (t - 1)} (x) ∥ ϕ_{t} (x)], & otherwise . \end{matrix}

(4)

The full teacher feature is

F_{t}^{tea} (x) = [Ψ_{1} (x) ∥ \dots ∥ Ψ_{k - 1} (x) ∥ R_{t} (x)],

(5)

and

q_{t}

predicts over

C_{1 : t} = ⋃_{i = 1}^{t} C_{i}

.

We first optimize

ϕ_{t}

and

q_{t}

using the cross-entropy loss on

F_{t}^{tea} (x)

:

L_{CE} (t) = E_{(x, y) \sim S_{t}} CE (q_{t} (F_{t}^{tea} (x)), y) .

(6)

After this step,

ϕ_{t}

and

q_{t}

are frozen, and consolidation is applied.

Immediate consolidation at step t ( $t > t_{s}$ ).

We merge the current-group extractors into a rolling extractor (

ψ_{t_{s} : t}

) with head

h_{t_{s} : t}

while keeping

{Ψ_{1}, \dots, Ψ_{k - 1}}

fixed. The student features are:

F_{t}^{stu} (x) = [Ψ_{1} (x) ∥ \dots ∥ Ψ_{k - 1} (x) ∥ ψ_{t_{s} : t} (x)] .

(7)

Teacher and student logits are

z_{t}^{tea} (x) = q_{t} (F_{t}^{tea} (x)), z_{t}^{stu} (x) = h_{t_{s} : t} (F_{t}^{stu} (x)) .

(8)

Logit distillation with temperature

τ

is applied:

L_{KD} (t) = KL (softmax (\frac{z_{t}^{tea} (x)}{τ}) ∥ softmax (\frac{z_{t}^{stu} (x)}{τ})) .

(9)

After optimization, the temporary teachers that form

R_{t}

are discarded, and the rolling extractor (

ψ_{t_{s} : t}

) is kept.

End of a group.

When the last task in

G_{k}

is completed, the rolling extractor that distills knowledge from multiple tasks into a single extractor is frozen as the group extractor:

Ψ_{k} \leftarrow ψ_{G_{k}}, H_{k} \leftarrow h_{G_{k}} .

(10)

Thus, after the K groups, we obtain

{Ψ_{1}, \dots, Ψ_{K}}

.

Inference.

After completion of task

t \in G_{k}

, the prediction uses the concatenation of all extractors from the frozen group and the current rolling extractor (if the group is not finished):

\hat{y} = h_{t_{s} : t} ([Ψ_{1} (x) ∥ \dots ∥ Ψ_{k - 1} (x) ∥ ψ_{t_{s} : t} (x)]) .

(11)

If t is the last task in

G_{k}

, inference uses frozen extractors only:

\hat{y} = H_{k} ([Ψ_{1} (x) ∥ \dots ∥ Ψ_{k} (x)]) .

(12)

Working example.

For instance, if tasks are grouped as

G_{1} = {1, 2}, G_{2} = {3, 4, 5, 6}, G_{3} = {7, 8, 9},

G_{4} = {10}

, the procedure yields four frozen extractors

{Ψ_{1}, Ψ_{2}, Ψ_{3}, Ψ_{4}}

.

3.3. Fixed-Size Pooling with Similarity-Based Consolidation

GRC controls parameter growth effectively, but it does not exploit task similarity. This raises a natural question: Should similar tasks share extractors to further reduce redundancy? Since task similarity cannot be determined in advance under incremental learning, we adopt a simplified setting. Ee first allocate extractors to the initial N tasks, forming a fixed-size pool of N extractors. Subsequent tasks are then integrated by sharing with the most related extractor in this pool. The overall framework is illustrated in Figure 3.

Initialization with N extractors.

We predefine the number of available extractors as N. For the first N tasks, each task (

t \leq N

) is assigned a new extractor (

ϕ_{t}

) and a classifier (

g_{t}

). During training, all previous extractors are frozen, and only

ϕ_{t}

and

g_{t}

are updated. The classifier input concatenates features from all frozen extractors and the new one:

F_{t} (x) = [ϕ_{1} (x) ∥ \dots ∥ ϕ_{t - 1} (x) ∥ ϕ_{t} (x)],

(13)

L_{CE} (t) = E_{(x, y) \sim S_{t}} CE (g_{t} (F_{t} (x)), y) .

(14)

After training the N-th task, we retain the extractors

{ϕ_{1}, \dots, ϕ_{N}}

.

Beyond N tasks: triggering consolidation.

When a new task (

t > N

) arrives, we instantiate a temporary extractor (

ϕ_{t}

) and a classifier (

g_{t}

). The classifier takes as input the concatenation of features from all N frozen extractors (

{Ψ_{1}, \dots, Ψ_{N}}

) and the new extractor (

ϕ_{t}

):

F_{t} (x) = [Ψ_{1} (x) ∥ \dots ∥ Ψ_{N} (x) ∥ ϕ_{t} (x)] .

(15)

The model is trained with cross-entropy over all classes from tasks 1 through t:

L_{CE} (t) = E_{(x, y) \sim S_{1 : t}} CE (g_{t} (F_{t} (x)), y) .

(16)

After training, the temporary extractor (

ϕ_{t}

) must be consolidated into one of the existing N extractors

{Ψ_{1}, \dots, Ψ_{N}}

to keep the pool size fixed.

Similarity-based target selection.

For each class (

c_{t j}

) in the new task (

S_{t}

), we compute the prototype on extractor

Ψ_{k}

, with

k \in {1, \dots, N}

:

p_{t j}^{k} = \frac{1}{| S_{t j} |} \sum_{x \in S_{t j}} Ψ_{k} (x),

(17)

where

S_{t j}

denotes the set of samples from class

c_{t j}

.

For extractor

Ψ_{k}

and its own task (

S_{k}

), the prototype of class

c \in C_{k}

is

M p_{c}^{k} = \frac{1}{M} | M_{k c} | \sum_{x \in M_{k c}} Ψ_{k} (x) M,

(18)

where

M_{k c}

denotes the exemplar set of class c from task

S_{k}

.

The similarity score for class

c_{t j}

on

Ψ_{k}

is

s_{t j}^{k} = max_{c \in C_{k}} sim (p_{t j}^{k}, p_{c}^{k}) .

(19)

Summing over all classes in task t, we obtain the task-to-extractor similarity:

S^{k} = \sum_{c_{t j} \in S_{t}} s_{t j}^{k} .

(20)

We select the extractor with the maximum similarity score (

S^{k}

) as the consolidation target:

i^{*} = arg max_{k \in {1, \dots, N}} S^{k} .

(21)

To avoid overusing a single extractor, we further enforce balanced selection among extractors. If an extractor is chosen for the current task, it is temporarily excluded from the next selection, and the consolidation target is chosen from the remaining extractors.

Consolidation training.

We copy the selected extractor (

Ψ_{i^{*}}

) and unfreeze it for optimization while keeping the remaining

N - 1

extractors frozen. The teacher logits are produced by the classifier (

g_{t}

) applied to the concatenation of features from all N frozen extractors and the frozen temporary extractor (

ϕ_{t}

):

z_{t}^{tea} (x) = g_{t} ([Ψ_{1} (x) ∥ \dots ∥ Ψ_{N} (x) ∥ ϕ_{t} (x)]) .

(22)

The student always maintains exactly N extractors. It replaces

Ψ_{i^{*}}

with its trainable copy (

Ψ_{i^{*}}^{'}

) and introduces a trainable head (

h_{i^{*}}

):

F_{t}^{stu} (x) = [Ψ_{1} (x) ∥ \dots Ψ_{i^{*}}^{'} (x) ∥ \dots ∥ Ψ_{N} (x)],

(23)

z_{t}^{stu} (x) = h_{i^{*}} (F_{t}^{stu} (x)) .

(24)

We minimize the distillation loss:

L_{KD} (t) = KL (softmax (z_{t}^{tea} (x) / τ) ∥ softmax (z_{t}^{stu} (x) / τ)),

(25)

where

τ

is the temperature.

After optimization, we update

Ψ_{i} \leftarrow Ψ_{i^{*}}^{'}

and discard

ϕ_{t}

, ensuring that the size of the extractor pool remains N.

3.4. Training Objective

For each new task (t), the learnable feature extractor (

ϕ_{t}

) and its task-specific classifier (

g_{t}

) are optimized with a standard cross-entropy loss as Equations (3), (6), (14), and (16). To further encourage feature-space separation across extractors, we enhance the training of each newly added extractor using contrastive learning [33]. Specifically, memory samples from old tasks processed by frozen extractors are incorporated as additional negatives to guide

ϕ_{t}

toward a distinct representation space.

For a sample (

x_{i}

) from the current task (

S_{t}

) in the mini batch, the positive set (

P_{i}

) consists of embeddings from augmented views of the same class extracted by the current feature extractor (

ϕ_{t}

). The negative set (

N_{i}

) includes two sources: (i) embeddings from different classes in the same mini batch under

ϕ_{t}

and (ii) embeddings of old classes in the mini batch extracted by the frozen extractors (

E_{t}^{frozen}

).

Let

z_{i} = proj! (ϕ_{t} (x_{i}))

, where

proj (\cdot)

denotes a two-layer, fully connected projection head [33] applied to the output of the feature extractor. The contrastive loss for a mini batch of size B from task

S_{t}

is

L_{sep} (t) = \frac{1}{2 * B} \sum_{i = 1}^{2 * B} (- \frac{1}{| P_{i} |} \sum_{p \in P_{i}} log \frac{exp (sim (z_{i}, p) / λ)}{exp (sim (z_{i}, p) / λ) + \sum_{n \in N_{i}} exp (sim (z_{i}, n) / λ)}) .

(26)

Overall Objective.

The overall loss for task t is

L_{new} (t) = L_{CE} (t) + α L_{sep} (t),

(27)

where

α > 0

balances the two terms.

Here,

L_{CE} (t)

refers to the cross-entropy loss used in Equation (3) for the first task in a group, Equation (6) for subsequent tasks before consolidation, and Equations (14) and (16) for the fixed-pool method.

4. Experiments

4.1. Datasets and Settings

We evaluate our methods on CIFAR-100 [34] and ImageNet-100. CIFAR-100 consists of 100 classes, each containing 500 training images and 100 test images with dimensions of

32 \times 32

. ImageNet-100 is a subset of ImageNet [35] with 100 classes. Each each class has about 1300 training images and 50 validation images of higher resolution. The evaluation covers multiple incremental learning settings. Following DER, we adopt two standard protocols on ImageNet-100:

(1) B0S10: 10 classes per task, for a total of 10 tasks;

(2) B50S5: the initial task has 50 classes, and each new task adds 5 classes, for a total of 11 tasks.

For CIFAR-100, we consider four variants:

(1) B0S10: 10 classes per task, for a total of 10 tasks;

(2) B0S5: 5 classes per task, for a total of 20 tasks;

(3) B50S5: 50 classes in the initial task, followed by 5 classes per task, for a total of 11 tasks;

(4) B50S2: 50 classes in the initial task, followed by 2 classes per task, for a total of 26 tasks.

Following DER, we use the herding selection strategy [36] to choose and retain old samples. For the B0 setting, we save a total of 2000 samples, while for the B50 setting, we retain 20 samples per class.

Under DER, each task is assigned a separate feature extractor, so the number of extractors grows linearly with the number of tasks. In contrast, our two methods—fixed-size pooling and grouped rolling consolidation—use significantly fewer extractors. We first compare our approaches with DER and other baselines under the same settings and show that the use of fewer extractors can even surpass the strong baseline DER. For clarity in the experimental tables, we denote our methods as

f i x_{N}

and

g r c_{N}

, where N indicates the maximum number of extractors used.

Group Size

For

g r c_{N}

, unless otherwise specified, we assume that all groups (except possibly the last one) share the same number of tasks. For example, with 10 tasks and

g r c_{4}

, tasks 1–3 share one extractor, tasks 4–6 share another, tasks 7–9 share another, and task 10 uses its own extractor.

Following [23,26], we use ResNet18 [37] as the feature extractor on ImageNet100 with a batch size of 256. For CIFAR-100, we employ a modified ResNet32 as the feature extractor with a batch size of 128. The initial learning rate is set to 0.1, and we use a cosine annealing scheduler, running for a total of 170 epochs. We use SGD with a momentum of 0.9. Weight decay is 5 × 10⁻⁴ when learning new feature extractors. We set the weight decay to 0 during the distillation phase and use a temperature scalar (

τ

) of 2. Following [33],

λ

is set to 0.07, and we use a two-layer linear projection head, where the hidden layer has the same dimension as the input and the final output dimension is 128.

Evaluation Metrics

Following DER, we evaluate models using average accuracy, last-step accuracy, and backward transfer (BWT).

Average accuracy: After completing step i, let

A_{i}

denote the average accuracy over tasks 1 to i. With a total of N tasks, the metric is defined as

Avg = \frac{1}{N} \sum_{i = 1}^{N} A_{i} .

(28)

Last-step accuracy: The accuracy after the final task, i.e.,

A_{N}

.

Backward transfer (BWT): Let

A_{S_{j}}^{i}

denote the accuracy on the test set of task

S_{j}

after learning task

S_{i}

. Then, BWT is defined as

BWT = \frac{1}{T - 1} \sum_{i = 2}^{T} \frac{1}{i} \sum_{j = 1}^{i} (A_{S_{j}}^{i} - A_{S_{j}}^{j}),

(29)

where T is the total number of tasks. A negative BWT indicates the forgetting of previously learned tasks.

We note that DyTox [24] has revised its official results, and in our work, we adopt the corrected values accordingly.

4.2. Results on ImageNet100

In the ImageNet100-B0S10 setting, DER uses 10 feature extractors. Our two methods require only 3 extractors yet outperform DER, as shown in Table 2, achieving more than

2 %

higher average accuracy and over

3 %

higher last-step accuracy. With 6 extractors, our methods achieve even higher accuracy.

In the ImageNet100-B50-S5 setting, DER uses 11 extractors, while our method uses only 8 yet achieves higher average accuracy than DER and also consistently outperforms DER in terms of top-5 accuracy, as shown in Table 3.

4.3. Results on CIFAR-100

For CIFAR-100 B0S10, with only six extractors, our methods achieve higher accuracy than DER with ten extractors, as shown in Table 4. We further examine the effect of increasing the number of extractors. As shown in Figure 4, both average accuracy and last-step accuracy improve consistently as the number of extractors increases. This result suggests that the benefit of adding new extractors outweighs the adverse effect of redundant information, thereby alleviating the stability–plasticity dilemma. We further compare the per-task accuracy after learning 10 tasks, as shown in Figure 5. DER suffers from a significant drop in accuracy on the first three tasks, whereas our methods, even with only five extractors, maintain clear advantages on these early tasks. The representations of old tasks captured by newly added extractors are largely redundant, as these extractors primarily focus on the new tasks. As the number of new extractors grows, this redundancy accumulates and consequently weakens the contribution of the old extractors, which are supposed to play the dominant role for their respective tasks. This demonstrates that reducing the number of extractors and enforcing distinct feature subspaces among them is more effective in alleviating forgetting of early tasks. We further provide a visualization. As shown in Figure 6, DER produces more compact clusters within each class, but several clusters overlap in the central region. In contrast,

f i x_{5}

achieves clearer separation between classes despite using fewer extractors.

In the CIFAR-100-B0S5 setting, both

f i x_{7}

and

g r c_{10}

achieve clear improvements over DER with 20 extractors, as shown in Table 5. We also analyze the effect of extractor numbers on accuracy. As shown in Figure 7, accuracy increases as the number of extractors grows.

In On the CIFAR-100 B0S5 setting, both

f i x_{7}

and

g r c_{10}

achieve clear improvements over DER with 20 extractors, as shown in Table 5. We also analyze the effect of extractor numbers on accuracy. As shown in Figure 7, accuracy grows as the number of extractors increases. For the fix method, increasing the number of extractors from 12 to 15 results in negligible improvement in average accuracy. In contrast, the GRC method continues to benefit as more extractors are added.

In the CIFAR-100 B50S5 setting, DER employs 11 extractors. Our two methods achieve higher accuracy with only four extractors, as shown in Table 6. We further analyze the effect of extractor numbers on accuracy. As shown in Figure 8, accuracy improves significantly when the number of extractors increases from four to seven. Beyond 7 extractors, the gain becomes marginal, and using 11 extractors does not yield the best accuracy.

In the CIFAR-100 B50S2 setting with 26 tasks, DER uses 26 extractors. Our two methods achieve higher accuracy with far fewer extractors, achieving clear advantages with only seven extractors, as reported in Table 7.

In this setting, more extractors do not always yield better accuracy. As shown in Figure 9, the best results are obtained with about seven extractors. When the number of extractors increases further, accuracy drops significantly, indicating that too many extractors cause interference across tasks and lead to degradation. This suggests that while preserving task-specific extractors helps retain old knowledge, an excessive number of extractors introduces redundant and noisy information, which can interfere with the learning of new tasks and degrade overall performance.

4.4. Ablation Study

We conduct an ablation study on

L_{sep}

. As shown in Table 8,

L_{sep}

consistently improves accuracy by enforcing separation across different extractors, making their representations more discriminative. Maintaining distinct feature spaces is particularly important for the fixed method, since it selects extractors by assigning similar tasks. With

L_{sep}

, the fixed methodachieves clearer feature separation and larger performance gains.

We also conducted an ablation study on balanced selection in

g r c_{5}

. As shown in Table 9, the extractors were selected in an unbalanced manner, and one extractor was selected twice under this setting. We observed that this led to a drop in accuracy.

4.5. BWT

We report the BWT results in Table 10. Compared with DER, our methods exhibit stronger resistance to forgetting. In DER, newly added extractors do not form clear boundaries with earlier tasks; coupled with the imbalance between new and old classes, the new extractors tend to dominate, leading to more severe forgetting of old tasks. The

L_{sep}

term encourages separation among extractors and—even with fewer extractors—reduces redundant cross-task interference introduced by the new ones. In this setting, using

L_{sep}

and moderately increasing the number of extractors can further preserve old-task knowledge: the stability gained from old-task extractors outweighs the benefit of reducing redundancy, so

g r c_{8}

achieves a lower forgetting rate than

g r c_{5}

. We further analyze the relationship between BWT and the number of extractors without

L_{sep}

, as shown in Table 11. Using 10 extractors causes more forgetting than using 5, indicating that simply increasing the number of extractors is counterproductive when their boundaries are not well maintained. The

L_{sep}

term strengthens per-extractor boundaries and mitigates interference from redundant extractors. Consequently, without

L_{sep}

, the damage from redundancy introduced by more extractors outweighs the stability benefit provided by additional old-task extractors.

4.6. Training and Inference Time

While the distillation step increases the training wall-clock time by more than one-quarter compared with training without distillation, it enables the use of fewer extractors, and inference is substantially faster. Under identical settings on ImageNet-100, DER requires 41.41 s to complete the test, whereas our

g r c_{3}

takes only 13.13 s.

4.7. Sensitivity Study of Hyperparameters

We evaluate the effect of the

α

hyperparameter on the

g r c_{6}

setting, with results shown in Table 12. We set

α = 0.5

.

4.8. Ablation Study on Task Order

The official DER code provides alternative task orders on CIFAR-100. In our experiments, we additionally adopt order 0 and order 1 from DER to evaluate performance under different task sequences. The results are presented in Table 13. While the fixed-size pooling method achieves slightly higher average accuracy, the difference in last-step accuracy is more pronounced, indicating that the fixed-size pooling method is more sensitive to task order.

4.9. Ablation Study on Similarity-Based Target Selection

We also conducted an ablation study on the similarity-based target selection strategy. Specifically, we compared it with a setting where the strategy was not used. The resulting precision is reported in Table 14, showing that selecting extractors based on higher similarity is more beneficial in terms of leveraging information from existing extractors, thereby achieving higher precision.

4.10. Discussion

Our experimental results demonstrate that both task-sharing strategies can achieve higher accuracy than DER while requiring significantly fewer extractors. The additional separation loss (

L_{sep}

) enlarges the feature space between different extractors, ensuring clearer task boundaries and more discriminative representations. In the CIFAR-100 B0S10 setting, we observe that accuracy improves steadily as the number of extractors increases. However, this trend is not always consistent. In the CIFAR-100 B50S5 setting, accuracy gains gradually saturate as the number of extractors grows and even exhibit slight declines in the end. More strikingly, in the CIFAR-100 B50S2 setting, excessive extractors lead to a noticeable drop in accuracy. These findings suggest that while an appropriate number of extractors provides useful task-specific capacity, an excessive number introduces redundancy and noise, which can interfere with cross-task generalization. Future work should therefore explore adaptive mechanisms to determine the optimal number of extractors under different task sequences and data splits.

The fixed-size pooling method may fail due to unreliable similarity measures, and its effectiveness could be limited for tasks with large domain gaps. We leave a more thorough investigation of this limitation for future work.

As a potential future direction, TSD’s feature-sharing mechanism could be extended to continual facial expression recognition, by dynamically allocating the most relevant feature extractors based on the representational similarity between new and existing expressions and incorporating a contrastive loss to enhance their discriminability. However, its practical effectiveness remains to be systematically validated on benchmark affective computing datasets.

5. Conclusions

In this work, we analyzed the impact of extractor growth in dynamic-network methods. We showed that old extractors, which never observe new classes, introduce noise and cause interference, while model parameters continue to grow with the number of tasks. To address these issues, we proposed two strategies for sharing extractors: grouped rolling consolidation (GRC), which groups consecutive tasks to share a consolidated extractor, and fixed-size pooling with similarity-based consolidation, which first learns N extractors, then allows subsequent tasks to share the most similar one. Furthermore, we encouraged each extractor to preserve discriminative and independent features. Both approaches achieve higher accuracy than the strong baseline DER while requiring significantly fewer extractors. Compared with DER, our method uses less than one-third of the extractors while achieving 2.5% higher average accuracy. While our study focuses on vision tasks, the idea of task-sharing distillation is also meaningful for large language models. As future work, we plan to explore whether the idea transfers to large language models, where continual and multi-domain adaptation often face parameter growth and redundancy. In particular, we aim to investigate its use in adapter-based extensions by allowing multiple tasks to share adapters. Moreover, the feature-sharing mechanism of TSD holds potential for extension to Continual Facial Expression Recognition. By dynamically allocating the most relevant feature extractors based on the representational similarity between new and existing expressions, and incorporating contrastive loss to enhance their discriminability. However, its practical effectiveness remains to be systematically validated on benchmark affective computing datasets.

Author Contributions

Methodology, M.D.; software, M.D. and R.L.; validation, M.D. and R.L.; formal analysis, M.D.; investigation, M.D.; resources, M.D.; data curation, M.D. and R.L.; writing—original draft preparation, M.D., R.L., and F.L.; writing—review and editing, M.D., R.L. and; visualization, M.D.; supervision, F.L.; project administration, F.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available. CIFAR-10 can be accessed at https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 3 October 2025) and ImageNet can be accessed at https://www.image-net.org/ (accessed on 3 October 2025). The source code can be accessed at https://github.com/mingdd522/TSD (accessed on 3 October 2025).

Acknowledgments

We extend our gratitude to the editors and reviewers who have invested their time and effort in reviewing this article.

Conflicts of Interest

Author Rui Li was employed by the company China Mobile (Hangzhou) Information Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
Rebuffi, S.A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. iCaRL: Incremental Classifier and Representation Learning. In Proceedings of the 27th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5533–5542. [Google Scholar] [CrossRef]
van de Ven, G.; Tuytelaars, T.; Tolias, A. Three types of incremental learning. Nat. Mach. Intell. 2022, 4, 1185–1197. [Google Scholar] [CrossRef] [PubMed]
Zhou, D.W.; Wang, Q.W.; Qi, Z.H.; Ye, H.J.; Zhan, D.C.; Liu, Z. Class-Incremental Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9851–9873. [Google Scholar] [CrossRef] [PubMed]
Smith, J.S.; Karlinsky, L.; Gutta, V.; Cascante-Bonilla, P.; Kim, D.; Arbelle, A.; Panda, R.; Feris, R.; Kira, Z. CODA-Prompt: Continual Decomposed Attention-Based Prompting for Rehearsal-Free Continual Learning. In Proceedings of the 36th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 11909–11919. [Google Scholar]
Zhou, D.W.; Cai, Z.W.; Ye, H.J.; Zhang, L.; Zhan, D.C. Dual Consolidation for Pre-Trained Model-Based Domain-Incremental Learning. arXiv 2025. [Google Scholar] [CrossRef]
Chaudhry, A.; Dokania, P.K.; Ajanthan, T.; Torr, P.H.S. Riemannian Walk for Incremental Learning: Understanding Forgetting and Intransigence. In Proceedings of the 14th European Conference on Computer Vision (ECCV); Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 556–572. [Google Scholar]
Lee, K.; Lee, K.; Shin, J.; Lee, H. Overcoming Catastrophic Forgetting with Unlabeled Data in the Wild. In Proceedings of the 27th IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 312–321. [Google Scholar] [CrossRef]
Douillard, A.; Cord, M.; Ollion, C.; Robert, T.; Valle, E. PODNet: Pooled Outputs Distillation for Small-Tasks Incremental Learning. In Proceedings of the 16th European Conference on Computer Vision (ECCV); Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 86–102. [Google Scholar]
Hou, S.; Pan, X.; Loy, C.C.; Wang, Z.; Lin, D. Lifelong Learning via Progressive Distillation and Retrospection. In Proceedings of the Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 452–467. [Google Scholar]
Buzzega, P.; Boschini, M.; Porrello, A.; Abati, D.; Calderara, S. Dark Experience for General Continual Learning: A Strong, Simple Baseline. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Online, 6–12 December 2020; pp. 15920–15930. [Google Scholar]
Boschini, M.; Bonicelli, L.; Buzzega, P.; Porrello, A.; Calderara, S. Class-Incremental Continual Learning into the Extended Der-Verse. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5497–5512. [Google Scholar] [CrossRef] [PubMed]
Dhar, P.; Singh, R.V.; Peng, K.C.; Wu, Z.; Chellappa, R. Learning without Memorizing. arXiv 2019. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015. [Google Scholar] [CrossRef]
Wu, Y.; Chen, Y.; Wang, L.; Ye, Y.; Liu, Z.; Guo, Y.; Fu, Y. Large Scale Incremental Learning. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 374–382. [Google Scholar]
Zhao, B.; Xiao, X.; Gan, G.; Zhang, B.; Xia, S.T. Maintaining Discrimination and Fairness in Class Incremental Learning. In Proceedings of the 33rd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 13205–13214. [Google Scholar]
Liu, Y.; Schiele, B.; Sun, Q. Adaptive Aggregation Networks for Class-Incremental Learning. In Proceedings of the 34th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2544–2553. [Google Scholar] [CrossRef]
Pham, Q.; Liu, C.; Hoi, S.C. DualNet: Continual learning, fast and slow. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 6–14 December 2021. NIPS ’21. [Google Scholar]
Mcclelland, J.; Mcnaughton, B.; O’Reilly, R. Why There are Complementary Learning Systems in the Hippocampus and Neocortex: Insights from the Successes and Failures of Connectionist Models of Learning and Memory. Psychol. Rev. 1995, 102, 419–457. [Google Scholar] [CrossRef] [PubMed]
Shin, H.; Lee, J.K.; Kim, J.; Kim, J. Continual Learning with Deep Generative Replay. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; NIPS’17. pp. 2994–3003. [Google Scholar]
Xiang, Y.; Fu, Y.; Ji, P.; Huang, H. Incremental Learning Using Conditional Adversarial Networks. In Proceedings of the 32nd IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6618–6627. [Google Scholar] [CrossRef]
Chen, H.; Wang, Y.; Fan, Y.; Jiang, G.; Hu, Q. Reducing Class-wise Confusion for Incremental Learning with Disentangled Manifolds. arXiv 2025. [Google Scholar] [CrossRef]
Yan, S.; Xie, J.; He, X. 34th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). In Proceedings of the 34th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021; pp. 3013–3022. [Google Scholar]
Douillard, A.; Ramé, A.; Couairon, G.; Cord, M. DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion. In Proceedings of the 35th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9275–9285. [Google Scholar]
Zheng, B.; Zhou, D.W.; Ye, H.J.; Zhan, D.C. Task-Agnostic Guided Feature Expansion for Class-Incremental Learning. arXiv 2025. [Google Scholar] [CrossRef]
Wang, F.Y.; Zhou, D.W.; Ye, H.J.; Zhan, D.C. FOSTER: Feature Boosting and Compression for Class-Incremental Learning. In Proceedings of the 17th European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 398–414. [Google Scholar]
Zhou, D.W.; Wang, Q.W.; Ye, H.J.; Zhan, D.C. A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental Learning. In Proceedings of the 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Rypeść, G.; Cygert, S.; Khan, V.; Trzcinski, T.; Zieliński, B.; Twardowski, B. Divide and Not Forget: Ensemble of Selectively Trained Experts in Continual Learning. In Proceedings of the 12th International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Wang, Z.; Zhang, Z.; Ebrahimi, S.; Sun, R.; Zhang, H.; Lee, C.Y.; Ren, X.; Su, G.; Perot, V.; Dy, J.; et al. DualPrompt: Complementary Prompting for Rehearsal-Free Continual Learning. In Proceedings of the 17th European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 631–648. [Google Scholar]
Wang, Y.; Huang, Z.; Hong, X. S-Prompts Learning with Pre-Trained Transformers: An Occam’s Razor for Domain Incremental Learning. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Wang, Z.; Zhang, Z.; Lee, C.Y.; Zhang, H.; Sun, R.; Ren, X.; Su, G.; Perot, V.; Dy, J.; Pfister, T. Learning to Prompt for Continual Learning. In Proceedings of the 35th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 139–149. [Google Scholar]
Wang, L.; Xie, J.; Zhang, X.; Huang, M.; Su, H.; Zhu, J. Hierarchical decomposition of prompt-based continual learning: Rethinking obscured sub-optimality. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. NIPS ’23. [Google Scholar]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), Online, 6–12 December 2020; pp. 18661–18673. [Google Scholar]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. In Technical Report. University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Welling, M. Herding Dynamical Weights to Learn. In Proceedings of the 26th International Conference on Machine Learning (ICML), Montreal, QC, Canada, 14–18 June 2009; pp. 417–424. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Rajasegaran, J.; Hayat, M.; Khan, S.; Khan, F.S.; Shao, L. Random Path Selection for Incremental Learning. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Wang, F.Y.; Zhou, D.W.; Liu, L.; Ye, H.J.; Bian, Y.; Zhan, D.C.; Zhao, P. BEEF: Bi-Compatible Class-Incremental Learning via Energy-Based Expansion and Fusion. In Proceedings of the The 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Tao, X.; Chang, X.; Hong, X.; Wei, X.; Gong, Y. Topology-Preserving Class-Incremental Learning. In Proceedings of the 16th European Conference on Computer Vision (ECCV); Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 254–270. [Google Scholar]
Zhou, D.W.; Ye, H.J.; Zhan, D.C. Co-Transport for Class-Incremental Learning. In Proceedings of the 29th ACM International Conference on Multimedia, New York, NY, USA, 20–24 October 2021. MM’21. [Google Scholar] [CrossRef]

Figure 1. (Left) Baseline dynamic expansion. Each task (

D_{i}

) has its own feature extractor (

B_{i}

). (Middle) Grouped rolling consolidation—tasks are pre-assigned to share specific extractors (e.g.,

D_{1}, D_{2}

share

B_{12}

). (Right) Fixed-size pooling with similarity-based consolidation—new tasks are assigned to the most similar existing extractor (e.g.,

B_{5}

joins

B_{1}

to form

B_{15}

).

Figure 1. (Left) Baseline dynamic expansion. Each task (

D_{i}

) has its own feature extractor (

B_{i}

). (Middle) Grouped rolling consolidation—tasks are pre-assigned to share specific extractors (e.g.,

D_{1}, D_{2}

share

B_{12}

). (Right) Fixed-size pooling with similarity-based consolidation—new tasks are assigned to the most similar existing extractor (e.g.,

B_{5}

joins

B_{1}

to form

B_{15}

).

Figure 2. Overview of grouped rolling consolidation.

Figure 3. Overview of fixed-size pooling with similarity-based consolidation.

Figure 4. Accuracy on CIFAR-100 B0S10 with different numbers of extractors. We compare our methods, fix_N and grc_N, with the DER baseline. (Left) Last-step accuracy. (Right) Average accuracy.

Figure 5. Accuracy of each step on CIFAR-100 B0S10. We compare our methods, fix_N and grc_N, with the DER baseline.

Figure 6. t-SNE visualization on CIFAR-100. We randomly select 12 classes for visualization. (Left) DER. (Right)

g r c_{5}

.

Figure 6. t-SNE visualization on CIFAR-100. We randomly select 12 classes for visualization. (Left) DER. (Right)

g r c_{5}

.

Figure 7. Accuracy on CIFAR-100 B0S5 with different numbers of extractors. We compare our methods, fix_N and grc_N, with the DER baseline. (Left) Last-step accuracy. (Right) Average accuracy.

Figure 8. Accuracy in the CIFAR-100 B50S5 setting with different numbers of extractors. We compare our methods, fix_N and grc_N, with the DER baseline. (Left) Last-step accuracy. (Right) Average accuracy.

Figure 9. Accuracy in the CIFAR-100 B50S2 setting with different numbers of extractors for

g r c_{N}

. (Left) Last-step accuracy. (Right) Average accuracy.

Figure 9. Accuracy in the CIFAR-100 B50S2 setting with different numbers of extractors for

g r c_{N}

. (Left) Last-step accuracy. (Right) Average accuracy.

Table 1. Last-step accuracy of DER on CIFAR-100. Each task is assigned one extractor, so the number of tasks equals the number of extractors. Accuracy is reported after learning all 100 classes. B0S10 and B0S5 represent different task splits: B0S10 denotes 10 tasks with 10 classes per task, while B0S5 denotes 20 tasks with 5 classes per task.

Setting	Number of Tasks	Classes per Task	Last Acc. (%)
B0S10	10	10	65.26
B0S5	20	5	61.87

Table 2. Results on ImageNet-100 B0S10. Results of the comparison methods are reported from [23,24,28]. The results of DyTox are based on the official corrected version.

ImageNet-100 B0S10
Method	Top-1		Top-5
	Avg	Last	Avg	Last
iCaRl [2]	-	-	83.60	63.80
RPSNet [38]	-	-	87.90	74.00
BiC [15]	-	-	90.60	84.40
WA [16]	-	-	91.00	84.10
SEED [28]	$67.8 \pm 0.3$	-	-	-
DyTox [24]	71.85	57.94	90.72	83.52
FOSTER [26]	78.40	69.91	-	-
BEEF [39]	77.62	68.78	93.66	89.32
DER [23]	77.18	66.70	93.23	87.52
$g r c_{3}$	79.93	70.54	95.08	91.64
$f i x_{3}$	79.90	69.78	94.92	91.44
$g r c_{6}$	80.59	72.36	95.20	92.04
$f i x_{6}$	81.01	71.38	95.36	91.84

Table 3. Results of the comparison methods are reported from [23,24,28]. The results of DyTox are based on the official corrected version.

ImageNet-100 B50S5
Method	Top-1		Top-5
	Avg	Last	Avg	Last
UCIR [10]	68.09	57.30	-	-
PODNet [9]	74.33	-	-	-
TPCIL [40]	74.81	66.91	-	-
SEED [28]	$70.9 \pm 0.5$	-	-	-
FOSTER [26]	77.54	-	-	-
DER [23]	78.20	74.92	94.20	91.30
$g r c_{8}$	79.15	73.82	95.13	93.04
$f i x_{8}$	79.08	74.14	95.30	93.12

Table 4. Results on CIFAR-100 B0S10. Results of the comparison methods come from DyTox [24] and DER [23]. Average accuracy is recorded as Avg (%). Last-step accuracy is recorded as Last (%).

Method	Avg	Last
iCaRl [2]	$65 . 27_{\pm 1.02}$	50.74
UCIR [10]	$58 . 66_{\pm 0.71}$	43.39
BiC [15]	$68 . 80_{\pm 1.20}$	53.54
WA [16]	$69 . 46_{\pm 0.29}$	53.78
PODNet [9]	$58 . 03_{\pm 1.27}$	41.05
RPSNet [38]	68.60	57.05
DyTox [24]	$73 . 66_{\pm 0.02}$	$60 . 67_{\pm 0.34}$
FOSTER [26]	72.90	-
BEEF [39]	71.94	60.98
BEEF-Compress [39]	72.93	61.45
DER [23]	76.14	65.26
$g r c_{6}$	77.40	67.36
$f i x_{6}$	77.88	66.68

Table 5. Results on CIFAR-100 B0S5. Results of the comparison methods come from DyTox [24] and DER [23]. Average accuracy is recorded as Avg (%). Last-step accuracy is recorded as Last (%).

Method	Avg	Last
iCaRl [2]	$61 . 20_{\pm 0.83}$	43.75
UCIR [10]	$58 . 17_{\pm 0.30}$	40.63
BiC [15]	$66 . 48_{\pm 0.32}$	47.02
WA [16]	$67 . 33_{\pm 0.15}$	47.31
PODNet [9]	$53 . 97_{\pm 0.85}$	35.02
RPSNet [38]	-	-
DyTox [24]	$72 . 27_{\pm 0.18}$	$56 . 32_{\pm 0.61}$
FOSTER [26]	70.65	-
BEEF [39]	69.84	56.71
BEEF-Compress [39]	71.69	57.06
DER [23]	74.69	61.87
$g r c_{7}$	75.57	63.74
$f i x_{10}$	77.20	63.22

Table 6. Accuracy in the CIFAR-100 B50S5 setting. Average accuracy is recorded as Avg (%). Last-step accuracy is recorded as Last (%).

Method	Avg	Last
iCaRL [2]	53.78	-
BiC [15]	53.21	-
WA [16]	57.57	-
COIL [41]	59.96	-
PODNet [9]	63.19	-
Foster [26]	69.20	60.42
DER [23]	72.09	63.96
$g r c_{4}$	73.91	68.00
$f i x_{4}$	73.28	66.55
$g r c_{6}$	74.21	68.55
$f i x_{6}$	74.22	67.86

Table 7. Average incremental accuracy in the CIFAR-100 B50S2 setting. Average accuracy is recorded as Avg (%). Last-step accuracy is recorded as Last (%).

Method	Avg	Last
iCaRL [2]	50.60	-
BiC [15]	48.96	-
WA [16]	54.10	-
PODNet [9]	60.72	-
Foster [26]	65.99	56.58
DER [23]	71.45	63.81
$g r c_{7}$	71.90	65.36
$f i x_{7}$	71.75	64.21

Table 8. Ablation study on

L_{sep}

. Average accuracy is denoted as Avg (%), and last-step accuracy is denoted as as Last (%).

Table 8. Ablation study on

L_{sep}

. Average accuracy is denoted as Avg (%), and last-step accuracy is denoted as as Last (%).

Method	Avg (%)	Last (%)
DER	76.14	65.26
$g r c_{6}$ (w/o $L_{sep}$ )	75.53	64.07
$g r c_{6}$ (w/ $L_{sep}$ )	77.40	67.36
$f i x_{6}$ (w/o $L_{sep}$ )	74.25	61.56
$f i x_{6}$ (w/ $L_{sep}$ )	77.88	66.68

Table 9. Ablation study on balanced selection.

Method	Avg (%)	Last (%)
$g r c_{6}$ (w/o balanced)	77.88	66.68
$g r c_{6}$	77.20	64.61

Table 10. Backward Transfer (BWT, %) comparison in the CIFAR-100 B0S10 setting.

Method	BWT
DER	−10.76
grc₅	−10.36
fix₅	−10.44
grc₈	−9.19
fix₈	−8.93

Table 11. Backward Transfer (BWT, %) in the CIFAR-100 B0-S10 setting without

L_{sep}

.

Table 11. Backward Transfer (BWT, %) in the CIFAR-100 B0-S10 setting without

L_{sep}

.

Method	BWT
fix₅ (w/o $L_{sep}$ )	−12.41
fix₁₀ (w/o $L_{sep}$ )	−13.38

Table 12. Effect of the

α

hyperparameter on average accuracy (Avg %).

Table 12. Effect of the

α

hyperparameter on average accuracy (Avg %).

$α$	Avg (%)
2.0	76.70
1.0	77.03
0.7	77.15
0.5	77.40
0.1	76.92

Table 13. Ablation study on task order (task order from [23]).

Method	Avg (%)	Last (%)
$g r c_{6}$ order 0	78.75	69.46
$f i x_{6}$ order 0	79.41	68.41
$g r c_{6}$ order 1	77.77	69.07
$f i x_{6}$ order 1	77.92	67.39

Table 14. Ablation study on similarity-based target selection. Average accuracy is denoted as Avg (%), and last-step accuracy as is denoted as Last (%).

Method	Avg (%)	Last (%)
$f i x_{6}$ (w/o similarity)	77.59	66.14
$f i x_{6}$	77.88	66.68

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dong, M.; Li, R.; Liu, F. Mitigating the Stability–Plasticity Trade-Off in Neural Networks via Shared Extractors in Class-Incremental Learning. Appl. Sci. 2025, 15, 10757. https://doi.org/10.3390/app151910757

AMA Style

Dong M, Li R, Liu F. Mitigating the Stability–Plasticity Trade-Off in Neural Networks via Shared Extractors in Class-Incremental Learning. Applied Sciences. 2025; 15(19):10757. https://doi.org/10.3390/app151910757

Chicago/Turabian Style

Dong, Mingda, Rui Li, and Feng Liu. 2025. "Mitigating the Stability–Plasticity Trade-Off in Neural Networks via Shared Extractors in Class-Incremental Learning" Applied Sciences 15, no. 19: 10757. https://doi.org/10.3390/app151910757

APA Style

Dong, M., Li, R., & Liu, F. (2025). Mitigating the Stability–Plasticity Trade-Off in Neural Networks via Shared Extractors in Class-Incremental Learning. Applied Sciences, 15(19), 10757. https://doi.org/10.3390/app151910757

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mitigating the Stability–Plasticity Trade-Off in Neural Networks via Shared Extractors in Class-Incremental Learning

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Problem Setup and Method Overview

3.2. Grouped Rolling Consolidation (GRC)

3.3. Fixed-Size Pooling with Similarity-Based Consolidation

3.4. Training Objective

4. Experiments

4.1. Datasets and Settings

4.2. Results on ImageNet100

4.3. Results on CIFAR-100

4.4. Ablation Study

4.5. BWT

4.6. Training and Inference Time

4.7. Sensitivity Study of Hyperparameters

4.8. Ablation Study on Task Order

4.9. Ablation Study on Similarity-Based Target Selection

4.10. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI