1. Introduction
Humans learn new tasks without forgetting what they already know, but neural networks trained on sequential tasks typically suffer from catastrophic forgetting [
1], especially when access to old data is limited or only a few exemplars are available. Regularization-based methods add penalties that discourage changes to parameters deemed important for previous tasks. Rehearsal-based methods store a small subset of old examples for replay during new-task training. Knowledge distillation is also widely used; for example, iCaRL [
2] freezes the model learned up to stage
and uses it as a teacher to transfer old-task knowledge to the model at stage
t. These approaches help, yet continual learning with a single shared backbone remains challenging.
Dynamic-network methods expand the model by assigning a dedicated feature extractor to each task. After training a task, its extractor is frozen, and a new extractor is allocated for the next task (see
Figure 1, left). This prevents interference with parameters of previous tasks, but the number of extractors increases linearly with the number of tasks. For example, in methods such as DER, a full CNN backbone is added for every new task, so after
M tasks, the model contains
M extractors. At test time, each input is processed by all extractors, and the resulting features are concatenated and fed into a global classifier. However, since early extractors are trained only on their original task data, their outputs are often irrelevant for later tasks. Consequently, concatenating features from all extractors introduces redundancy and noise, which reduces accuracy. For example, as shown in
Table 1, on CIFAR-100, dividing 100 classes into 10 tasks requires 10 extractors and yields
accuracy with DER. When the dataset is instead split into 20 tasks, the number of extractors doubles to 20, but the accuracy drops to
. This demonstrates that increasing the number of extractors does not guarantee better performance and that excessive extractors can harm generalization by amplifying redundancy.
These observations highlight a key limitation: dedicating one extractor per task is inefficient. We therefore ask whether extractors can be shared across tasks to reduce their total number, mitigate interference from redundant extractors, and still preserve the benefits of dynamic networks. To this end, we propose Task-Sharing Distillation (TSD), which reduces the number of extractors by allowing tasks to share them. We study two variants.
Under grouped rolling consolidation, Consecutive tasks are grouped and merged to share a single extractor by distillation. For example, tasks 1–2 may share one extractor, and tasks 3–4 may share another (see
Figure 1, middle). The group size does not need to be fixed; different extractors may cover different numbers of tasks, depending on the experimental setup.
Under a fixed-size pool with similarity-based consolidation, we first set a maximum number of extractors. Early tasks each initialize a new extractor until this limit is reached. For every subsequent task, we train a temporary extractor, then merge it into the most compatible existing extractor through distillation. Compatibility is determined by prototype similarity: the new task’s feature prototypes are compared with those maintained by each existing extractor, and the task is merged into the extractor with the highest similarity. This strategy encourages feature reuse among related tasks while keeping the overall number of extractors bounded (see
Figure 1, right).
To further ensure that different extractors maintain independent feature subspaces, we impose a feature distinctness constraint during training. When a new extractor is introduced, we add an explicit constraint that encourages its features to remain distinct from those of existing extractors. This constraint guides the new extractor toward representations that are discriminative with respect to previously learned subspaces so that each extractor specializes in a subset of task subspaces and different subsets remain independent.
Compared with DER, our approach substantially reduces the number of backbone parameters. In particular, using only three extractors, our methods achieve higher accuracy than DER, which relies on ten extractors.
Although our experiments focus on vision tasks, the underlying idea is more general. Large language models (LLMs) are widely applied in continual and multi-domain settings, where assigning a separate adapter to each task similarly leads to parameter growth and redundancy. Our proposed Task-Sharing Distillation (TSD) offers a promising way to consolidate and share modules in such scenarios, highlighting its potential relevance for the efficient scaling of LLMs.
Our main contributions are summarized as follows:
1. We analyze the relationship between the number of extractors and model performance. Increasing the number of extractors not only inflates parameters but also reduces generalization.
2. We propose task-sharing distillation, which reduces the number of extractors by allowing tasks to share them and by consolidating multiple extractors through distillation. We present two practical strategies: grouped rolling consolidation, which groups consecutive tasks into a shared extractor, and fixed-size pooling with similarity-based consolidation, which allocates a fixed number of extractors and assigns new tasks to the most similar one.
3. Our methods achieve superior accuracy and parameter efficiency compared with state-of-the-art methods such as DER. In particular, our methods outperform DER while using only three extractors, whereas DER requires ten.
2. Related Work
Incremental learning can be categorized into three main types [
3,
4]: task-incremental, domain-incremental, and class-incremental learning. In task-incremental learning, the model receives the task ID during evaluation and only classifies within the given task. Domain-incremental learning uses the same set of classes across tasks but with different domains [
5,
6]. Class-incremental learning assigns different classes to each task, without providing the task ID at test time, so the model must classify across all classes.
To reduce forgetting, regularization-based methods such as EWC [
1,
7] constrain parameter updates based on their importance to previous tasks, where the Fisher information matrix is used to estimate parameter importance. Knowledge distillation methods [
2,
8,
9,
10,
11,
12,
13] transfer knowledge from old feature extractors and exemplars, using distillation [
14] to guide the new model. In this setting, the new model acts as the student, and the old models serve as the teacher, ensuring that the knowledge of previous tasks is preserved. Other works [
15,
16] address old-class forgetting caused by the imbalance between old and new samples, where the classifier tends to be biased toward classes with more samples. By reweighting the classifiers of old and new classes, these methods adjust the bias and mitigate forgetting. Although these methods alleviate forgetting to some extent, a single shared extractor remains limited in capacity. To improve model expressiveness, AANets [
17] combine stable and plastic blocks, and DualNet [
18] incorporates complementary fast and slow learning systems [
19] to balance stability and plasticity. To cope with the scarcity of old samples, GAN-based approaches [
20,
21] generate pseudo-samples to alleviate data imbalance, while autoencoder-based methods [
22] learn class-specific subspaces to improve class discrimination.
Dynamic network methods expand the architecture by adding new feature extractors for incoming tasks. DER [
23] assigns a new extractor to each task and freezes old ones, thereby reducing interference with previously learned parameters. DyTox [
24] introduces a transformer encoder–decoder with dynamic task tokens. TagFex [
25] captures task-agnostic features and merges them with task-specific ones. FOSTER [
26] incrementally learns new extractors inspired by gradient boosting and distills knowledge from both old and new extractors into a unified model. MEMO [
27] expands selected blocks instead of adding entire extractors, while SEED [
28] employs a fixed number of extractors and models each class with a Gaussian distribution to select suitable extractors for new tasks. Although these methods are effective and reduce the number of parameters, most of them still do not match the accuracy of DER [
23], as using fewer modules makes it difficult to achieve higher performance.
Recently, pre-trained models have also been applied to incremental learning [
5,
29,
30,
31,
32]. These approaches freeze the pre-trained backbone and insert lightweight modules such as adapters, LoRA, or prompts, tuning only these additional modules for efficient adaptation. This strategy leverages the rich knowledge in pre-trained models while enabling efficient learning of new tasks.
3. Methodology
3.1. Problem Setup and Method Overview
In this section, we first define the class-incremental learning task, then introduce dynamic network-based approaches and, finally, motivate our two methods that share feature extractors.
Class-Incremental Learning Task. The goal of incremental learning is to train on a sequence of tasks and, after completing all tasks, maintain strong performance across all learned classes. Let the total number of tasks be T. For task , the dataset is , with class set . The class sets are disjoint, i.e., for . Let denote the number of classes in task t. Each sample () belongs to one of these classes.
Dynamic-Network Methods. Traditional incremental learning often uses a single backbone for all tasks, which risks severe forgetting. Dynamic-network methods address this by assigning a dedicated feature extractor to each task. We denote the task-specific extractor (e.g., a CNN) used in DER for task
t as
. A representative baseline, i.e., DER, trains
on task
t, then freezes its parameters. For the next task, a new extractor (
) is introduced. At inference time, an input (
x) is processed by all extractors, and their features are concatenated before being classified by a global head (
g):
Freezing prevents interference with past knowledge and improves performance compared with a single shared backbone. However, this design introduces two major issues. First, the number of parameters grows linearly with the number of tasks. Second, since learning is sequential, an extractor (, where ) never observes data from a future task (), making its features irrelevant or even harmful for later tasks.
These limitations raise an important question: Can we reduce the number of extractors while still retaining the benefits of dynamic networks? To address this, we introduce Task-Sharing Distillation (TSD), which progressively merges task knowledge into shared extractors through distillation. Building on TSD, our two methods, grouped rolling consolidation and fixed-size pooling with similarity-based consolidation, effectively control parameter growth, reduce redundancy, and maintain accuracy on both past and future tasks.
3.2. Grouped Rolling Consolidation (GRC)
The overall framework is illustrated in
Figure 2. Given
T tasks, we partition them into groups (
), where each group
contains a contiguous subset of tasks. The group size is not fixed in advance; different groups may cover different numbers of tasks, depending on the schedule. After completing the first group (
), we freeze a consolidated extractor (
). For later groups, task-specific extractors are temporarily maintained; progressively merged by distillation into a rolling extractor (
); and, finally, frozen as
at the end of the group. Importantly,
denotes a single consolidated extractor obtained from tasks
to
t, not a concatenation of multiple extractors.
Let be the first task in . All frozen extractors from previous groups are denoted by and remain fixed.
A new extractor (
) and a temporary classifier head (
) are instantiated. The classifier input concatenates frozen features with the new feature, i.e.,
and the cross-entropy loss is
No consolidation is performed, since there is only one new extractor in this group.
For each new task, we instantiate an extractor (
) and a temporary head (
). The current-group teacher feature is defined as
The full teacher feature is
and
predicts over
.
We first optimize
and
using the cross-entropy loss on
:
After this step, and are frozen, and consolidation is applied.
We merge the current-group extractors into a rolling extractor (
) with head
while keeping
fixed. The student features are:
Teacher and student logits are
Logit distillation with temperature
is applied:
After optimization, the temporary teachers that form are discarded, and the rolling extractor () is kept.
When the last task in
is completed, the rolling extractor that distills knowledge from multiple tasks into a single extractor is frozen as the group extractor:
Thus, after the K groups, we obtain .
After completion of task
, the prediction uses the concatenation of all extractors from the frozen group and the current rolling extractor (if the group is not finished):
If
t is the last task in
, inference uses frozen extractors only:
For instance, if tasks are grouped as , the procedure yields four frozen extractors .
3.3. Fixed-Size Pooling with Similarity-Based Consolidation
GRC controls parameter growth effectively, but it does not exploit task similarity. This raises a natural question: Should similar tasks share extractors to further reduce redundancy? Since task similarity cannot be determined in advance under incremental learning, we adopt a simplified setting. Ee first allocate extractors to the initial
N tasks, forming a fixed-size pool of
N extractors. Subsequent tasks are then integrated by sharing with the most related extractor in this pool. The overall framework is illustrated in
Figure 3.
We predefine the number of available extractors as
N. For the first
N tasks, each task (
) is assigned a new extractor (
) and a classifier (
). During training, all previous extractors are frozen, and only
and
are updated. The classifier input concatenates features from all frozen extractors and the new one:
After training the N-th task, we retain the extractors .
When a new task (
) arrives, we instantiate a temporary extractor (
) and a classifier (
). The classifier takes as input the concatenation of features from all
N frozen extractors (
) and the new extractor (
):
The model is trained with cross-entropy over all classes from tasks 1 through
t:
After training, the temporary extractor () must be consolidated into one of the existing N extractors to keep the pool size fixed.
For each class (
) in the new task (
), we compute the prototype on extractor
, with
:
where
denotes the set of samples from class
.
For extractor
and its own task (
), the prototype of class
is
where
denotes the exemplar set of class
c from task
.
The similarity score for class
on
is
Summing over all classes in task
t, we obtain the task-to-extractor similarity:
We select the extractor with the maximum similarity score (
) as the consolidation target:
To avoid overusing a single extractor, we further enforce balanced selection among extractors. If an extractor is chosen for the current task, it is temporarily excluded from the next selection, and the consolidation target is chosen from the remaining extractors.
We copy the selected extractor (
) and unfreeze it for optimization while keeping the remaining
extractors frozen. The teacher logits are produced by the classifier (
) applied to the concatenation of features from all
N frozen extractors and the frozen temporary extractor (
):
The student always maintains exactly
N extractors. It replaces
with its trainable copy (
) and introduces a trainable head (
):
We minimize the distillation loss:
where
is the temperature.
After optimization, we update and discard , ensuring that the size of the extractor pool remains N.
3.4. Training Objective
For each new task (
t), the learnable feature extractor (
) and its task-specific classifier (
) are optimized with a standard cross-entropy loss as Equations (
3), (
6), (
14), and (
16). To further encourage feature-space separation across extractors, we enhance the training of each newly added extractor using contrastive learning [
33]. Specifically, memory samples from old tasks processed by frozen extractors are incorporated as additional negatives to guide
toward a distinct representation space.
For a sample () from the current task () in the mini batch, the positive set () consists of embeddings from augmented views of the same class extracted by the current feature extractor (). The negative set () includes two sources: (i) embeddings from different classes in the same mini batch under and (ii) embeddings of old classes in the mini batch extracted by the frozen extractors ().
Let
, where
denotes a two-layer, fully connected projection head [
33] applied to the output of the feature extractor. The contrastive loss for a mini batch of size
B from task
is
The overall loss for task
t is
where
balances the two terms.
Here,
refers to the cross-entropy loss used in Equation (
3) for the first task in a group, Equation (
6) for subsequent tasks before consolidation, and Equations (
14) and (
16) for the fixed-pool method.
4. Experiments
4.1. Datasets and Settings
We evaluate our methods on CIFAR-100 [
34] and ImageNet-100. CIFAR-100 consists of 100 classes, each containing 500 training images and 100 test images with dimensions of
. ImageNet-100 is a subset of ImageNet [
35] with 100 classes. Each each class has about 1300 training images and 50 validation images of higher resolution. The evaluation covers multiple incremental learning settings. Following DER, we adopt two standard protocols on ImageNet-100:
(1) B0S10: 10 classes per task, for a total of 10 tasks;
(2) B50S5: the initial task has 50 classes, and each new task adds 5 classes, for a total of 11 tasks.
For CIFAR-100, we consider four variants:
(1) B0S10: 10 classes per task, for a total of 10 tasks;
(2) B0S5: 5 classes per task, for a total of 20 tasks;
(3) B50S5: 50 classes in the initial task, followed by 5 classes per task, for a total of 11 tasks;
(4) B50S2: 50 classes in the initial task, followed by 2 classes per task, for a total of 26 tasks.
Following DER, we use the herding selection strategy [
36] to choose and retain old samples. For the B0 setting, we save a total of 2000 samples, while for the B50 setting, we retain 20 samples per class.
Under DER, each task is assigned a separate feature extractor, so the number of extractors grows linearly with the number of tasks. In contrast, our two methods—fixed-size pooling and grouped rolling consolidation—use significantly fewer extractors. We first compare our approaches with DER and other baselines under the same settings and show that the use of fewer extractors can even surpass the strong baseline DER. For clarity in the experimental tables, we denote our methods as and , where N indicates the maximum number of extractors used.
For , unless otherwise specified, we assume that all groups (except possibly the last one) share the same number of tasks. For example, with 10 tasks and , tasks 1–3 share one extractor, tasks 4–6 share another, tasks 7–9 share another, and task 10 uses its own extractor.
Following [
23,
26], we use ResNet18 [
37] as the feature extractor on ImageNet100 with a batch size of 256. For CIFAR-100, we employ a modified ResNet32 as the feature extractor with a batch size of 128. The initial learning rate is set to 0.1, and we use a cosine annealing scheduler, running for a total of 170 epochs. We use SGD with a momentum of 0.9. Weight decay is 5 × 10
−4 when learning new feature extractors. We set the weight decay to 0 during the distillation phase and use a temperature scalar (
) of 2. Following [
33],
is set to 0.07, and we use a two-layer linear projection head, where the hidden layer has the same dimension as the input and the final output dimension is 128.
Following DER, we evaluate models using average accuracy, last-step accuracy, and backward transfer (BWT).
Average accuracy: After completing step
i, let
denote the average accuracy over tasks 1 to
i. With a total of
N tasks, the metric is defined as
Last-step accuracy: The accuracy after the final task, i.e., .
Backward transfer (BWT): Let
denote the accuracy on the test set of task
after learning task
. Then, BWT is defined as
where
T is the total number of tasks. A negative BWT indicates the forgetting of previously learned tasks.
We note that DyTox [
24] has revised its official results, and in our work, we adopt the corrected values accordingly.
4.2. Results on ImageNet100
In the ImageNet100-B0S10 setting, DER uses 10 feature extractors. Our two methods require only 3 extractors yet outperform DER, as shown in
Table 2, achieving more than
higher average accuracy and over
higher last-step accuracy. With 6 extractors, our methods achieve even higher accuracy.
In the ImageNet100-B50-S5 setting, DER uses 11 extractors, while our method uses only 8 yet achieves higher average accuracy than DER and also consistently outperforms DER in terms of top-5 accuracy, as shown in
Table 3.
4.3. Results on CIFAR-100
For CIFAR-100 B0S10, with only six extractors, our methods achieve higher accuracy than DER with ten extractors, as shown in
Table 4. We further examine the effect of increasing the number of extractors. As shown in
Figure 4, both average accuracy and last-step accuracy improve consistently as the number of extractors increases. This result suggests that the benefit of adding new extractors outweighs the adverse effect of redundant information, thereby alleviating the stability–plasticity dilemma. We further compare the per-task accuracy after learning 10 tasks, as shown in
Figure 5. DER suffers from a significant drop in accuracy on the first three tasks, whereas our methods, even with only five extractors, maintain clear advantages on these early tasks. The representations of old tasks captured by newly added extractors are largely redundant, as these extractors primarily focus on the new tasks. As the number of new extractors grows, this redundancy accumulates and consequently weakens the contribution of the old extractors, which are supposed to play the dominant role for their respective tasks. This demonstrates that reducing the number of extractors and enforcing distinct feature subspaces among them is more effective in alleviating forgetting of early tasks. We further provide a visualization. As shown in
Figure 6, DER produces more compact clusters within each class, but several clusters overlap in the central region. In contrast,
achieves clearer separation between classes despite using fewer extractors.
In the CIFAR-100-B0S5 setting, both
and
achieve clear improvements over DER with 20 extractors, as shown in
Table 5. We also analyze the effect of extractor numbers on accuracy. As shown in
Figure 7, accuracy increases as the number of extractors grows.
In On the CIFAR-100 B0S5 setting, both
and
achieve clear improvements over DER with 20 extractors, as shown in
Table 5. We also analyze the effect of extractor numbers on accuracy. As shown in
Figure 7, accuracy grows as the number of extractors increases. For the fix method, increasing the number of extractors from 12 to 15 results in negligible improvement in average accuracy. In contrast, the GRC method continues to benefit as more extractors are added.
In the CIFAR-100 B50S5 setting, DER employs 11 extractors. Our two methods achieve higher accuracy with only four extractors, as shown in
Table 6. We further analyze the effect of extractor numbers on accuracy. As shown in
Figure 8, accuracy improves significantly when the number of extractors increases from four to seven. Beyond 7 extractors, the gain becomes marginal, and using 11 extractors does not yield the best accuracy.
In the CIFAR-100 B50S2 setting with 26 tasks, DER uses 26 extractors. Our two methods achieve higher accuracy with far fewer extractors, achieving clear advantages with only seven extractors, as reported in
Table 7.
In this setting, more extractors do not always yield better accuracy. As shown in
Figure 9, the best results are obtained with about seven extractors. When the number of extractors increases further, accuracy drops significantly, indicating that too many extractors cause interference across tasks and lead to degradation. This suggests that while preserving task-specific extractors helps retain old knowledge, an excessive number of extractors introduces redundant and noisy information, which can interfere with the learning of new tasks and degrade overall performance.
4.4. Ablation Study
We conduct an ablation study on
. As shown in
Table 8,
consistently improves accuracy by enforcing separation across different extractors, making their representations more discriminative. Maintaining distinct feature spaces is particularly important for the fixed method, since it selects extractors by assigning similar tasks. With
, the fixed methodachieves clearer feature separation and larger performance gains.
We also conducted an ablation study on balanced selection in
. As shown in
Table 9, the extractors were selected in an unbalanced manner, and one extractor was selected twice under this setting. We observed that this led to a drop in accuracy.
4.5. BWT
We report the BWT results in
Table 10. Compared with DER, our methods exhibit stronger resistance to forgetting. In DER, newly added extractors do not form clear boundaries with earlier tasks; coupled with the imbalance between new and old classes, the new extractors tend to dominate, leading to more severe forgetting of old tasks. The
term encourages separation among extractors and—even with fewer extractors—reduces redundant cross-task interference introduced by the new ones. In this setting, using
and moderately increasing the number of extractors can further preserve old-task knowledge: the stability gained from old-task extractors outweighs the benefit of reducing redundancy, so
achieves a lower forgetting rate than
. We further analyze the relationship between BWT and the number of extractors without
, as shown in
Table 11. Using 10 extractors causes more forgetting than using 5, indicating that simply increasing the number of extractors is counterproductive when their boundaries are not well maintained. The
term strengthens per-extractor boundaries and mitigates interference from redundant extractors. Consequently, without
, the damage from redundancy introduced by more extractors outweighs the stability benefit provided by additional old-task extractors.
4.6. Training and Inference Time
While the distillation step increases the training wall-clock time by more than one-quarter compared with training without distillation, it enables the use of fewer extractors, and inference is substantially faster. Under identical settings on ImageNet-100, DER requires 41.41 s to complete the test, whereas our takes only 13.13 s.
4.7. Sensitivity Study of Hyperparameters
We evaluate the effect of the
hyperparameter on the
setting, with results shown in
Table 12. We set
.
4.8. Ablation Study on Task Order
The official DER code provides alternative task orders on CIFAR-100. In our experiments, we additionally adopt order 0 and order 1 from DER to evaluate performance under different task sequences. The results are presented in
Table 13. While the fixed-size pooling method achieves slightly higher average accuracy, the difference in last-step accuracy is more pronounced, indicating that the fixed-size pooling method is more sensitive to task order.
4.9. Ablation Study on Similarity-Based Target Selection
We also conducted an ablation study on the similarity-based target selection strategy. Specifically, we compared it with a setting where the strategy was not used. The resulting precision is reported in
Table 14, showing that selecting extractors based on higher similarity is more beneficial in terms of leveraging information from existing extractors, thereby achieving higher precision.
4.10. Discussion
Our experimental results demonstrate that both task-sharing strategies can achieve higher accuracy than DER while requiring significantly fewer extractors. The additional separation loss () enlarges the feature space between different extractors, ensuring clearer task boundaries and more discriminative representations. In the CIFAR-100 B0S10 setting, we observe that accuracy improves steadily as the number of extractors increases. However, this trend is not always consistent. In the CIFAR-100 B50S5 setting, accuracy gains gradually saturate as the number of extractors grows and even exhibit slight declines in the end. More strikingly, in the CIFAR-100 B50S2 setting, excessive extractors lead to a noticeable drop in accuracy. These findings suggest that while an appropriate number of extractors provides useful task-specific capacity, an excessive number introduces redundancy and noise, which can interfere with cross-task generalization. Future work should therefore explore adaptive mechanisms to determine the optimal number of extractors under different task sequences and data splits.
The fixed-size pooling method may fail due to unreliable similarity measures, and its effectiveness could be limited for tasks with large domain gaps. We leave a more thorough investigation of this limitation for future work.
As a potential future direction, TSD’s feature-sharing mechanism could be extended to continual facial expression recognition, by dynamically allocating the most relevant feature extractors based on the representational similarity between new and existing expressions and incorporating a contrastive loss to enhance their discriminability. However, its practical effectiveness remains to be systematically validated on benchmark affective computing datasets.
5. Conclusions
In this work, we analyzed the impact of extractor growth in dynamic-network methods. We showed that old extractors, which never observe new classes, introduce noise and cause interference, while model parameters continue to grow with the number of tasks. To address these issues, we proposed two strategies for sharing extractors: grouped rolling consolidation (GRC), which groups consecutive tasks to share a consolidated extractor, and fixed-size pooling with similarity-based consolidation, which first learns N extractors, then allows subsequent tasks to share the most similar one. Furthermore, we encouraged each extractor to preserve discriminative and independent features. Both approaches achieve higher accuracy than the strong baseline DER while requiring significantly fewer extractors. Compared with DER, our method uses less than one-third of the extractors while achieving 2.5% higher average accuracy. While our study focuses on vision tasks, the idea of task-sharing distillation is also meaningful for large language models. As future work, we plan to explore whether the idea transfers to large language models, where continual and multi-domain adaptation often face parameter growth and redundancy. In particular, we aim to investigate its use in adapter-based extensions by allowing multiple tasks to share adapters. Moreover, the feature-sharing mechanism of TSD holds potential for extension to Continual Facial Expression Recognition. By dynamically allocating the most relevant feature extractors based on the representational similarity between new and existing expressions, and incorporating contrastive loss to enhance their discriminability. However, its practical effectiveness remains to be systematically validated on benchmark affective computing datasets.