Few-Shot Class-Incremental Learning with Prompt Alignment and Subspace Prototype Aggregation

Huang, Qiang

doi:10.3390/a19050407

Open AccessArticle

Few-Shot Class-Incremental Learning with Prompt Alignment and Subspace Prototype Aggregation

by

Qiang Huang

School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China

Algorithms 2026, 19(5), 407; https://doi.org/10.3390/a19050407

Submission received: 12 March 2026 / Revised: 10 May 2026 / Accepted: 12 May 2026 / Published: 19 May 2026

(This article belongs to the Special Issue Advances in Deep Learning and Next-Generation Internet Technologies)

Download

Browse Figures

Versions Notes

Abstract

Few-Shot Class-Incremental Learning (FSCIL) aims to learn new classes with only a few samples, making it more challenging than traditional Class-Incremental Learning (CIL) due to the scarcity of available samples. The imbalance in sample distribution further complicates balancing the abundant base data with the scarce incremental data. While the model must fully leverage the extensive base data to guide the learning of subsequent tasks, it must also avoid over-relying on these data, as doing so could degrade its generalization capability and impede the learning of new incremental tasks. To address these challenges, we propose a novel framework for few-shot incremental learning, incorporating tailored prompt alignment strategies for both the base and incremental session. In the base session, we strike a balance between task-specific and task-agnostic knowledge to preserve the model’s generalization ability. In the incremental session, we mitigate the overfitting issue typically associated with few-shot learning. Furthermore, to tackle the prototype network bias caused by the imbalance in sample distribution, we propose a subspace prototype aggregation module, which effectively alleviates prediction bias in the incremental phase. Extensive experiments conducted on three benchmark datasets—CIFAR-100, miniImageNet, and CUB-200—demonstrate that our approach achieves state-of-the-art (SOTA) performance in FSCIL.

Keywords:

few-shot class-incremental learning; transformer; prompt learning; class-incremental learning

1. Introduction

As data distributions continuously change, machine learning models often need to transition from a familiar data distribution to an entirely unfamiliar one, where the models learn new classes from the unfamiliar data while retaining knowledge acquired from previous data. This task is known as class-incremental learning [1]. The primary challenge in CIL is the catastrophic forgetting problem. Due to limited storage space or privacy concerns [2], models are unable to retain access to the full data from previous tasks. Consequently, after training on new tasks, model performance on old tasks often experiences significant degradation. To address this issue, some works have adopted approaches such as knowledge distillation [3,4] and parameter consolidation [5,6] to alleviate catastrophic forgetting. However, these methods may inevitably hinder the model’s ability to learn new classes, leading to the well-known stability–plasticity dilemma [7].

Few-Shot Class-Incremental Learning (FSCIL) typically involves training a base model using a set of base classes with sufficient data, followed by leveraging the knowledge learned from the base classes to facilitate few-shot learning during the incremental sessions. In addition to the challenge of catastrophic forgetting, FSCIL can suffer from overfitting to limited training samples, making it more difficult for the model to learn new classes. Currently, various works [8] have been proposed to address FSCIL scenarios. Some focus on enhancing the generalization ability of the base model for newly encountered few-shot classes [9], while others explore better strategies for incremental training on new tasks with limited data [10,11].

In recent years, with the rise of prompt learning and Vision Transformer (ViT), some methods [12,13,14] have used prompt learning to address CIL tasks. These methods freeze the pretrained Vision Transformer backbone, train only a set of task-specific prompt parameters, and store these prompt parameters in a prompt pool to retain knowledge for different tasks. During testing, a key–value query mechanism is used to retrieve the corresponding prompt parameters for each sample. These methods have shown strong performance in CIL tasks. However, as they require training task-specific prompt parameters, the small number of samples in FSCIL’s incremental sessions may not be sufficient to allow prompt learning to capture task-specific knowledge. Furthermore, the performance of these methods relies heavily on the prompt selection strategy and requires additional resources to maintain the prompt pool. Methods such as [15,16] leverage pretrained models and prompt learning for FSCIL tasks, but the limited capacity of prompt parameters and the frozen backbone network prevent the model from fully utilizing the large quantity of data available during the base session. This restriction hampers effective knowledge transfer to downstream tasks.

Notably, recent prompt regularization methods such as PromptSRC [17] have achieved strong results in base-to-novel generalization by enforcing consistency between prompted and frozen model features within a single training phase. However, these methods are designed for scenarios where all training data are available simultaneously, without a sequential task structure or incremental few-shot constraints. In FSCIL, directly applying such single-phase regularization strategies is insufficient for three reasons. First, the base session must first absorb rich task-specific knowledge from abundant data before reinforcing generalizable features; a naive joint optimization may compromise this knowledge acquisition. Second, incremental sessions contain only a handful of samples, requiring stronger and more targeted regularization than what task-agnostic consistency losses can provide. Third, the prototype classifier, widely adopted in few-shot learning, suffers from severe prediction bias toward base classes due to sample imbalance.

In this paper, we propose a prompt-learning-based FSCIL framework that incorporates alignment and regularization strategies for prompts and the model across the base and incremental sessions, along with a prototype aggregation module, as shown in Figure 1. Specifically, during the base-session training, we train the prompt parameters in sessions and fine-tune specific layers of the Transformer. First, we freeze the Transformer backbone and optimize the visual prompts using cross-entropy loss to enable the prompts to learn task-specific knowledge. Once the visual prompts acquire sufficient task-specific knowledge, we include the Transformer in the training process and selectively fine-tune its first two layers. To mitigate overfitting, we propose a self-supervised prompt alignment loss, which aligns the outputs of the model with and without prompts. The two outputs are then weighted together to compute the cross-entropy loss, jointly optimizing both the model and the prompts. Through these two training stages, the model is able to learn both task-specific and task-agnostic knowledge from the abundant training samples in the base session, while retaining the generalization ability of the pretrained Transformer to better adapt to downstream tasks. During the incremental session, to mitigate the adverse effects caused by few-shot samples, we preserve the base-session prompts and apply regularization constraints on the current prompt outputs by weighting the outputs of the model with and without prompts based on prototype semantic similarity. In addition, we propose a prototype aggregation calibration module based on subspace regularization. This module uses the prototypes calculated from the abundant base-session samples to calibrate the prototype bias caused by the limited samples in the incremental session, effectively solving the classification bias problem of the classifier in the incremental session. In summary, our contributions can be listed as follows:

We propose a prompt-based few-shot class-incremental learning framework that addresses the unique stability–plasticity and overfitting challenges in sequential few-shot learning. Specifically, we design a two-stage prompt alignment strategy (TPA) for the base session to decouple task-specific and task-agnostic knowledge learning, and a few-shot prompt alignment strategy (FSPA) for the incremental sessions that leverages the base-session prompt as a cross-task knowledge anchor.
We propose a subspace prototype aggregation calibration module (SPAC) that alleviates the prototype computation bias caused by the severe sample imbalance between base and incremental sessions. The module operates via QR-decomposition-based subspace projection and similarity-weighted aggregation, requiring no gradient-based optimization during incremental updates.
We conduct extensive comparative and ablation experiments on three public datasets: CIFAR100, miniImageNet, and CUB200, demonstrating the effectiveness of our proposed method.

2. Related Work

2.1. Class-Incremental Learning

Generally, incremental learning can be categorized into three different settings: Task-Incremental Learning (TIL), Domain-Incremental Learning (DIL), and class-incremental learning (CIL) [18]. Among these, CIL is considered the most challenging scenario [8], as it requires learning new classes without forgetting the old ones. Class-incremental learning is currently a hot topic in the field of machine learning and can be broadly divided into two groups based on whether old class instances are preserved. Sample-replay-based methods store representative instances of each class and perform data replay when learning new classes. For example, iCaRL[4] combines replay with knowledge distillation to retain previously learned knowledge. BiC [19] utilizes these samples to build a validation set and optimizes an additional scaling layer. GEM [20] projects gradients using samples to mitigate forgetting. Ref. [21] employs generative models for data rehearsal, while other works consider storing embeddings instead of raw images [22]. Non-replay-based methods address the problem by incorporating regularization terms to consolidate model outputs or by dynamically adapting the model structure to meet the demands of new classes. Many studies leverage knowledge distillation to retain knowledge from previous tasks and overcome forgetting [23,24]. Generally, methods that do not require data replay are more suitable for real-world scenarios, as replaying data is often infeasible due to privacy policies or other constraints. Notably, the method proposed in this paper does not require a rehearsal buffer to store any data samples. Typically, CIL methods rely on obtaining sufficient training data in each incremental task to learn new classes, which is not feasible under the FSCIL scenario.

2.2. Few-Shot Class-Incremental Learning

FSCIL is more challenging than standard CIL because it requires incrementally learning new classes with limited labeled data. This task involves first obtaining a well-trained base model using sufficient base class data and then incrementally learning new tasks with limited new class data. TOPIC [25] was the first to propose the FSCIL task, which learns the feature space topology for different classes and represents new classes by growing and adapting the network topology. Other FSCIL methods can generally be divided into two categories. The first category trains a backbone network using abundant base class samples and transfers the knowledge of the backbone network to incremental tasks [26,27]. The trained backbone network typically remains frozen during the subsequent incremental tasks. The second category of methods [28,29,30] focuses on enabling the model to learn knowledge from limited samples in the incremental sessions without overfitting. Recently, some prompt-based FSCIL methods [15,16,31,32] have demonstrated excellent performance by leveraging the downstream task adaptability of prompt learning to address the FSCIL problem. However, these methods fail to fully utilize the abundant samples from base classes to enhance the generalizability of the prompts, causing the model in the incremental sessions to underutilize the knowledge learned during the base session.

2.3. Prompt Engineering for Vision Transformer

Prompt learning is an alternative fine-tuning method that enables a pretrained model to quickly adapt to downstream tasks without retraining the entire model. This approach adapts a pretrained model by adding a small number of new learnable embeddings at the input, known as prompt tokens. Vision Transformers with prompt engineering have demonstrated excellent performance in class-incremental learning. L2P [13] and DualPrompt [12] utilize prompt and prefix tuning to learn new classes while keeping the pretrained ViT frozen. L2P and DualPrompt apply randomly initialized prompts for prompt engineering. Recently, some methods [14,33,34] have proposed generating prompts to adapt to domain spaces for effective continual learning. CODA-Prompt [14] requires a set of prompt components, which are combined with input-dependent weights to generate input-specific prompts. APG [33] and DAP [34] use prompt generators composed of multiple components, including cross-attention layers, learnable parameter sets, and linear layers. However, these prompt generation methods require additional components and introduce training costs for the prompt generator. PriViLeg [31] proposed a novel Pretrained Knowledge Tuning (PKT) method to help the model better learn knowledge during the base session. PromptSRC [17] proposed a prompt self-regularization method to prevent overfitting of prompts. But PromptSRC focuses on single-session generalization and does not address the sequential task structure and extreme data scarcity of FSCIL. In contrast, our method is specifically tailored for the FSCIL setting, where the training process is divided into base and multiple incremental sessions, necessitating distinct prompt alignment strategies for each stage and requiring prototype calibration to combat session-wise imbalance.

3. Methodology

In this section, we introduce the overall structure of our proposed framework as shown in Figure 2. It primarily consists of three key components: base-session prompt alignment, which learns task-specific knowledge while preserving the model’s generalization ability; incremental-session prompt regularization, which mitigates the overfitting of prompt parameters to downstream tasks; and subspace prototype aggregation, which addresses the imbalance between new and old classes in prototype networks. The prompt structure follows [31], where the prompt parameters are divided into prompt-V and prompt-L, denoted as

P_{V L} \in R^{2 \times D}

. Prompt-V is responsible for learning visual knowledge from image features, while prompt-L captures semantic knowledge provided by the word embeddings of the language model. Both prompts, along with the image tokens, are fed into the pretrained Transformer. The features of prompt-V and the CLS token are then average-pooled to serve as the input to the classifier.

3.1. Preliminaries

The goal of FSCIL is to learn knowledge from a series of tasks, with data

D_{0}, \dots, D_{T}

from each task t being provided sequentially. When learning task t, the model cannot fully access the data from previous tasks

0, \dots, t - 1

. During training, the model must perform performance tests on all the data it has been trained on, from tasks

0, \dots, t

. The data for the tth training task are denoted as

D_{t} = {(x_{t}^{i}, y_{t}^{i})}_{i = 1}^{N_{t}}

, where

N_{t} = | D_{t} |

represents the size of

D_{t}

,

x_{t}^{i} \in X_{t}

, and

y_{t}^{i} \in Y_{t}

represent the training samples and their corresponding labels, respectively. The classes in the data of each task are mutually exclusive, i.e., for tasks t and

t^{'} \in [0, T]

with

t \neq t^{'}

,

Y_{t} \cap Y_{t^{'}} = ⌀

. The first task includes a large quantity of training data

D_{0}

, which is referred to as the base session task, while subsequent tasks contain fewer training data

D_{t}

, which are referred to as the incremental sessions tasks. All incremental sessions tasks are N-way K-shot classification tasks, where N represents the number of categories in each task, and K represents the number of samples per category. The FSCIL model consists of a feature extractor backbone network

f_{θ}

, with parameters denoted as

θ

, and a classifier

h_{ψ}

, with parameters denoted as

ψ

. For test data x from all previously trained tasks, the model matches the corresponding label by outputting the predicted probability

y = h_{ψ} (f_{θ} (x))

.

3.2. Subspace Prototype Aggregation

The prototypical classifier [35] is widely used in few-shot learning. It uses the feature mean

c_{k}

of class k as the prototype for that class:

c_{k} = \frac{1}{N_{k}} \sum_{y_{i} = k} f (x_{i});

(1)

where

N_{k}

is the number of samples in class k. Here, the class features output by the model are used as the embeddings for the image. For a classification task with K classes, the set

W = [c_{0}, c_{1}, \dots, c_{K}]

and the sample feature embeddings are input into a linear layer, and the output is then passed through a softmax function to compute the classification probabilities:

P (y = k | x) \propto c_{k}^{T} f_{θ, p} (x)

(2)

However, the prototypical classifier has limitations in few-shot class-incremental learning tasks. Due to the limited data in the incremental sessions, the computed prototypes sometimes fail to accurately reflect the general characteristics of that class. Additionally, the severe imbalance in data between the base and incremental sessions causes the classifier to be biased towards classifying the categories from the incremental sessions as base session categories. To address this issue, some studies have proposed solutions, such as the method in [36], which aggregates the prototypes of the base classes and the new classes in the current incremental sessions based on similarity. This approach alleviates the classification bias to some extent, but directly taking a weighted average of the current prototypes and the base class prototypes may cause significant shifts during the updating process due to the weight distribution, especially when there is a large distributional difference between the base and new classes. If the base class distribution is incomplete or affected by noise, the adjustment effect may be further limited.

To address the above limitations, we propose a simple yet effective subspace prototype aggregation calibration module (SPAC). The key idea is to constrain novel class prototypes within the intrinsic feature subspace of the base classes and then calibrate them by aggregating with their own subspace-anchored counterparts, rather than directly fusing with raw base prototypes. This makes the adjustment more robust to noise and imbalanced data, avoids extreme feature shift of novel classes, and is particularly suitable for incremental scenarios where base and novel class distributions differ significantly.

Formally, let

W_{b} \in R^{N_{b} \times D}

denote the base class prototypes and

W_{c} \in R^{N_{c} \times D}

the novel class prototypes of the current incremental session, where

N_{b}

and

N_{c}

are the number of base and novel classes, and D is the feature dimension. Both matrices are L2-normalized along the feature axis before use. We perform a QR decomposition on

W_{b}^{⊤}

to extract the orthonormal basis of the base class subspace:

Q, R = QR (W_{b}^{⊤}),

(3)

where

Q \in R^{D \times N_{b}}

(assuming

D \geq N_{b}

) is an orthogonal matrix whose columns span the base feature subspace. The orthogonal projection of

W_{c}

onto this subspace is then obtained as

W_{c}^{'} = (W_{c} Q) Q^{⊤} \in R^{N_{c} \times D} .

(4)

Because the columns of Q are unit vectors, this projection effectively discards the components of the novel prototypes that lie outside the reliable base subspace, thereby suppressing few-shot sampling noise.

The robustness of this projection can be understood through a simple bias–variance perspective. Let the few-shot novel-class prototypes be expressed as

W_{c} = W_{c}^{*} + E

, where

W_{c}^{*}

denotes the true class means and E denotes the estimation error due to limited samples. The true means

W_{c}^{*}

are expected to lie primarily within the base-class subspace, as the base classes, trained with abundant data, capture the dominant discriminative directions of the feature space. In contrast, the sampling error E is approximately isotropic, with non-negligible components orthogonal to the base subspace. The orthogonal projection

W_{c}^{'} = W_{c} Q Q^{⊤}

preserves all components within the subspace while discarding the orthogonal residual, thereby eliminating the portion of E that lies outside the reliable subspace. This introduces a controlled bias toward the base distribution but significantly reduces variance, which is crucial in the few-shot regime where variance dominates the estimation error. In contrast, standard operations such as centering or scaling do not alter the directional structure of the noise.

Next, we compute the similarity between the original prototypes

W_{c}

and their subspace projections

W_{c}^{'}

using a temperature-scaled dot product, followed by softmax normalization:

S = softmax (τ \cdot W_{c} W_{c}^{' ⊤}),

(5)

where

τ

is a temperature hyperparameter controlling the softness of the aggregation.

S \in R^{N_{c} \times N_{c}}

can be viewed as a class-to-class weight matrix within the current session.

We then aggregate the projected prototypes according to S to obtain the calibration target:

Δ W_{c} = S W_{c}^{'} .

(6)

Finally, we fuse the original and calibrated prototypes via convex combination:

{\bar{W}}_{c} = (1 - α) W_{c} + α Δ W_{c},

(7)

where

α \in [0, 1]

balances the retention of novel-class discriminability and the alignment with the base-class subspace. The complete process is summarized in Algorithm 1.

Algorithm 1 Subspace Prototype Aggregation Algorithm.

Input:: Base prototypes $W_{b} \in R^{N_{b} \times D}$ , current session prototypes $W_{c} \in R^{N_{c} \times D}$ , shift weight $α$ , softmax temperature $τ$
Output:: Calibrated prototypes ${\bar{W}}_{c}$
1:: for each incremental session do
2:: // QR decomposition and subspace projection (Equations (3) and (4))
3:: Compute $Q, R = QR (W_{b}^{⊤})$ , where $Q \in R^{D \times N_{b}}$ .
4:: Project current prototypes onto the base subspace: $W_{c}^{'} = (W_{c} Q) Q^{⊤}$ .
5:: // Intra-session similarity-based calibration (Equations (5) and (6))
6:: Compute similarity matrix: $S = softmax (τ \cdot W_{c} W_{c}^{' ⊤})$ , where $S \in R^{N_{c} \times N_{c}}$ .
7:: Compute calibration target: $Δ W_{c} = S W_{c}^{'}$ .
8:: // Convex fusion (Equation (7))
9:: Fuse original and calibrated prototypes: ${\bar{W}}_{c} = (1 - α) W_{c} + α Δ W_{c}$ .
10:: Update $W_{c} \leftarrow {\bar{W}}_{c}$ .
11:: end for
12:: Return ${\bar{W}}_{c}$ .

3.3. Prompt-Adaptive Alignment Loss

3.3.1. Base-Session Prompt Alignment

A model with good generalization ability can better adapt to downstream tasks and transfer knowledge. To this end, some methods focus on improving the generalization ability of the model during the base session. For example, ref. [17] constrains the update of prompt parameters by using frozen CLIP outputs without prompts, allowing the prompt parameters to learn more generalization knowledge from the pretrained CLIP. However, this method struggles to balance task-specific knowledge and task-agnostic knowledge. Therefore, we adopt a staged constraint alignment approach to allow the model to learn both task-specific and task-agnostic knowledge.

In the first stage, the model freezes the entire Transformer backbone, and the prompt parameters are optimized using Equation (8). This stage primarily ensures that the prompt learns task-related knowledge

L_{C E} = - \sum_{i = 1}^{N} y_{i} log ({\hat{y}}_{i}^{p})

(8)

In the second stage, to enable the model to gain more knowledge from the base-session tasks, we further fine-tune the first two layers of the Transformer, in addition to the learnable prompt. To mitigate overfitting caused by fine-tuning, we propose a prompt self-supervised alignment loss, Equation (9), where

\hat{y}

denotes the model output without the prompt, and

{\hat{y}}^{p}

denotes the model output with the prompt, where the KL divergence aligning

\hat{y}

and

{\hat{y}}^{p}

and the weighted cross-entropy loss are used jointly to optimize the model. This approach alleviates overfitting caused by fine-tuning the backbone network while allowing the prompt parameters to focus on learning task-agnostic knowledge, thereby enhancing the model’s generalization ability. The overall loss of the two-stage prompt alignment (TPA) is shown in Equation (10). The hyperparameter

λ

controls the strength of the model’s learning of task-agnostic knowledge. A larger

λ

enhances the model’s generalization ability but can cause a decline in the model’s performance on base session tasks due to reduced learning of task-specific knowledge.

L_{P A} = L_{C E} ({\hat{y}}^{p}, y) + λ L_{C E} (\hat{y}, y) + L_{K L} (\hat{y}, {\hat{y}}^{p})

(9)

L_{T P A} = \{\begin{matrix} - \sum_{i = 1}^{N} y_{i} log ({\hat{y}}_{i}^{p}), e p o c h < = γ \\ L_{C E} ({\hat{y}}^{p}, y) + λ L_{C E} (\hat{y}, y) + L_{K L} (\hat{y}, {\hat{y}}^{p}), e p o c h > γ \end{matrix}

(10)

3.3.2. Incremental Sessions Prompt Alignment

Training with a small number of samples significantly increases the risk of overfitting. The prompt self-supervised alignment loss proposed for the base session may not be sufficient to maintain generalization when samples are extremely scarce. Therefore, in the incremental sessions, we freeze the Transformer backbone and adopt a different regularization strategy. The key idea is to construct a confidence-weighted teacher signal that dynamically balances two complementary sources of knowledge: the output of the base-session prompt

{\hat{y}}_{b}^{p}

, which carries rich transferable knowledge learned from abundant data, and the unprompted output

\hat{y}

, which preserves the pretrained model’s task-agnostic generalization ability. The intuition, rooted in confidence-based knowledge distillation, is that knowledge should only be transferred when the teacher is reliable for a given input. Here, the reliability of the base prompt is naturally measured by the semantic similarity between the novel class and the base classes: higher similarity implies stronger transferability. Formally, let

W_{b} \in R^{D \times N_{b}}

and

W_{c} \in R^{D \times N_{c}}

be the L2-normalized prototypes of base and current novel classes. For a novel class c, its cosine similarity vector to all base classes is

v_{c} = [w_{c}^{⊤} w_{1}, \dots, w_{c}^{⊤} w_{N_{b}}] \in R^{N_{b}}

(11)

We introduce a lightweight learnable MLP

g_{θ}

that has a single hidden layer and takes this similarity vector as input, outputting a class-specific confidence weight:

ρ_{c} = σ (g_{θ} (v_{c})) \in (0, 1)

(12)

where

σ

is the sigmoid function. The use of a learnable module allows the model to adaptively calibrate the optimal trust level from data, rather than relying on manually tuned thresholds. For a sample x from novel class c, the teacher signal is then a convex combination:

Δ {\hat{y}}^{p} = ρ_{c} \cdot {\hat{y}}_{b}^{p} + (1 - ρ_{c}) \cdot \hat{y}

(13)

The final FSPA loss is the KL divergence between the current prompt output

{\hat{y}}_{t}^{p}

and the teacher signal:

L_{FSPA} = L_{KL} (Δ {\hat{y}}^{p}, {\hat{y}}_{t}^{p})

(14)

This formulation is theoretically grounded in the well-established framework of confidence-weighted distillation, where the mixing weight reflects the estimated transferability of the source knowledge.

3.4. Model Training and Evaluation

In order to have a clearer understanding of the proposed method, we summarize the whole training procedure. During the base session, we set the first two blocks of the model and the prompt as trainable and use the TPA module to balance the learning of task-specific and task-agnostic knowledge. When the training procedure is finished, the prototypes of all base classes are calculated with the corresponding training samples for subsequent classification. In the incremental session, we freeze the entire backbone network, back up the prompt as prompt-base, and use the FSPA module to mitigate the overfitting issue in few-shot learning. After completing the training of each incremental session, the SPAC module is used to calibrate the prototype network when updating it.

The total loss for model optimization is shown in Equation (16). In addition to the aforementioned

L_{T P A}

and

L_{F S P A}

, we also introduce a semantic distillation loss

L_{S K D}

, which allows the language prompt to learn the semantic information from the label text.

L_{S K D} = L_{K L} (y_{i}^{l a n g}, w_{c n_{i}}) + L_{C E} ({\hat{y}}_{i}^{l a n g}, y_{i})

(15)

L_{t o t a l} = L_{T P A} + L_{S K D} + L_{F S P A}

(16)

4. Experiments

In this section, we compare our model with other state-of-the-art methods on multiple few-shot class-incremental learning datasets and conduct ablation studies on the main modules to validate their effectiveness.

4.1. Dataset and Evaluation Indicators

We conducted comprehensive experiments on three benchmark datasets for FSCIL: CIFAR100 [37], miniImageNet [38], and CUB200 [20].

CIFAR100: This dataset is commonly used in CIL. It consists of 100 categories with 600 RGB images per class. For each category, 500 images were used for training and 100 images for testing. The size of the images is $32 \times 32$ .
CUB200: This dataset contains about 6000 training images and 6000 test images of over 200 bird categories. The images were resized to $256 \times 256$ and then cropped to $224 \times 224$ for training.
miniImageNet: This is a subset of the ImageNet with a smaller number of classes. It includes 600 images for each of 100 classes. The size of the images is $84 \times 84$ .

The dataset configuration for FSCIL is illustrated in Table 1, following that in [25]. For CIFAR100 and miniImageNet, we set 60 and 40 classes as the base and novel categories, respectively, and chose a five-way five-shot setting in each incremental session. In total, we had nine training sessions, one session for base classes and eight sessions for novel classes. For CUB200, we chose 100 classes as base classes and split the remaining 100 classes into 10 incremental sessions with the 10-way five-shot setting. Notably, except for the labeled samples used in each incremental learning session, the rest was regarded as an unlabeled dataset, following the work of [39].

We evaluated our model using four metrics. First, the accuracy of the tth session was defined as

{Acc}_{t} = \frac{# correctly classified samples in session t}{# total test samples in session t} \times 100 % .

(17)

The average accuracy (AVG) was then computed as the mean of the session-wise accuracies:

AVG = \frac{1}{T} \sum_{t = 1}^{T} {Acc}_{t},

(18)

where T is the total number of sessions. This reflects the overall learning performance across all sessions. Second, the final overall accuracy was

{Acc}_{T}

, the accuracy obtained in the final session, where the test set included all class samples from previous tasks. This represents the ultimate performance of the incremental learning model. Third, the performance dropping (PD) was defined as

PD = {Acc}_{1} - {Acc}_{T},

(19)

where

{Acc}_{1}

is the accuracy of the base session (first session), and

{Acc}_{T}

is the accuracy of the final session. This metric quantifies the absolute accuracy drop from the first to the last session, serving as a reference indicator for model stability. Fourth, to evaluate the balance between base-class and incremental-class performance, we report the harmonic mean (HM). Let

{Acc}_{base}

be the top-one accuracy on the classes from the base session only, and

{Acc}_{inc}

be the top-one accuracy on all classes introduced in the incremental sessions, both evaluated in the final session. The harmonic mean is then defined as

HM = \frac{2 \cdot {Acc}_{base} \cdot {Acc}_{inc}}{{Acc}_{base} + {Acc}_{inc}} .

(20)

4.2. Model Configurations and Training Details

We used a Vision Transformer (ViT-B/16) pretrained on ImageNet-21k as the backbone network and employed BERT [40] to extract word vector embeddings for the labels. In the base session, the TPA and SKD modules were activated for training. During the fine-tuning of the Transformer in that session, the parameters of the first two blocks were set to be learnable, while the rest remained frozen. We used deep prompting with

V = 2

vision prompts and

L = 2

language prompts. We used Adam as the optimizer and applied a cosine annealing strategy to adjust the learning rate, with an initial value set to

2 \times 10^{- 4}

. The model was trained on a GeForce RTX 3090 GPU, with the batch size set to 32. The base session was trained for six epochs. The hyperparameters

γ

and

λ

were set to two and

1.2

, respectively. All prompt parameters were initialized by drawing from a standard normal distribution

N (0, 1)

. In the incremental session, the SPAC, FSPA and SKD modules were activated, while the Transformer backbone was completely frozen to preserve the knowledge learned from the base session. We used Adam as the optimizer and applied a cosine annealing strategy to adjust the learning rate, with an initial value set to

2 \times 10^{- 4}

. The batch size was set to 32, and each incremental session was trained for six epochs. The calibration weight

α

for SPAC and the temperature coefficient

τ

for FSPA were set to 0.1 and 10, respectively. All experiments were run using three random seeds and the average results are reported.

4.3. Methods for Comparison

Our comparative methods included classic FSCIL approaches implemented with ResNet-18, such as CEC [41], LIMIT [11], FACT [27], and TOPIC [25]. Additionally, we considered methods based on Vision Transformer (ViT), including ASP [15], PriViLeg [31], and IOS [32]. To ensure a fair comparison, we primarily focused on methods utilizing ViT as the backbone. Moreover, for methods designed for traditional class-incremental learning, such as DualPrompt [12] and L2P [13], we modified their classifiers to prototype networks to better suit the few-shot learning tasks. Class prototypes were computed using Equation (1) and expanded at the end of each incremental session. All other settings were kept identical to their original configurations.

4.4. Main Results

We present the performance of our method compared to other classic and state-of-the-art methods on the CIFAR100, CUB200, and MiniImageNet datasets in Table 2, Table 3, and Table 4, respectively. The compared methods span two backbone architectures: ResNet-18 and ViT-B/16. It is widely recognized that the architectural shift from CNN to Vision Transformer contributes substantially to overall performance improvements. Therefore, the comparisons between ViT-based and ResNet-based methods should be interpreted with this factor in mind. Our primary baselines were recent ViT-based state-of-the-art methods, which shared the same backbone as ours and thus enabled a fair and direct assessment of the proposed framework. Compared to classic prompt-based continual learning methods such as L2P and DualPrompt, our method demonstrated significant improvements across all three datasets. On CIFAR-100, CUB-200, and MiniImageNet, the average accuracy increased by

8.93 %

,

5.70 %

, and

7.32 %

, respectively. When compared with the latest state-of-the-art methods PriViLege (CVPR 2024) and ASP (ECCV 2024), our method still showed certain advantages. On these three datasets, the average accuracy increased by

1.03 %

,

0.40 %

, and

0.35 %

, respectively, compared to ASP. Moreover, on CIFAR-100 and MiniImageNet, our method achieved the best results in both

A_{L a s t}

and PD metrics.

4.5. Ablation Studies

To validate the effectiveness of the proposed framework, we conducted ablation experiments on the CUB200 dataset to verify the functionality of its four main modules and analyze the primary results. We adopted cross-entropy training with a frozen backbone and learned prompts as the baseline. Detailed results are shown in Table 5. The implementation details and task settings were consistent with the main experiments.

4.5.1. Efficacy of the TPA Module

The first module, the two-stage prompt alignment (TPA), was designed to leverage prompt alignment for knowledge learning while preserving the generalization capability of the pretrained model. The results indicated that TPA improved the average accuracy by

6.344 %

compared to the baseline and reduced the performance degradation rate by

1.63 %

. This demonstrates that the model retains strong generalization performance without hindering subsequent tasks.

4.5.2. Efficacy of the SPAC Module

The subspace prototype aggregation calibration (SPAC) module addressed the issue of prototype networks being biased toward large-sample prototypes. Since this module only functioned during the incremental session, it had no effect on

A_{B a s e}

. By refining the classification decision boundaries, it enhanced

A_{L a s t}

and

A_{A v g}

by approximately

0.7 %

. Considering the challenge of improving model performance with extremely few samples, this is a significant improvement.

4.5.3. Efficacy of the FSPA Module

The few-shot prompt alignment (FSPA) module applied regularization constraints during the incremental session to mitigate overfitting. This module boosted

A_{L a s t}

and

A_{A v g}

by approximately

0.8 %

. Notably, the combined contribution of SPAC and FSPA was substantial, confirming their complementary interaction during incremental sessions.

4.5.4. Efficacy of the SKD Module

Finally, the semantic knowledge distillation (SKD) module introduced semantic information from the pretrained language model into the prompt parameters. While this module slightly reduced the performance of

A_{B a s e}

, it had minimal impact on

A_{L a s t}

and

A_{A v g}

. Overall, it helped smooth the model’s performance curve.

4.6. Model Visualization Results

In this section, we present various forms of visual analysis of the model’s output.

4.6.1. Confusion Matrix

Figure 3 presents the confusion matrix results from our experiments, showing the testing outcomes after the final task session for three different models. The confusion matrix when directly fine-tuning a pretrained Vision Transformer is shown in Figure 3a. It can be observed that the model predicted most categories as the new classes from the final session, indicating catastrophic forgetting. Figure 3b and Figure 3c show the confusion matrices of the baseline and our proposed method, respectively. Compared to the baseline, the noise in the lower-left corner of the confusion matrix was significantly reduced in our proposed method, demonstrating its ability to mitigate the tendency of the model to classify new categories as base-session categories.

4.6.2. T-SNE Feature Space

To further demonstrate the effectiveness of our method, we visualized the feature space of our model on the CUB200 dataset. Five classes were randomly selected from the base classes, and three classes were randomly selected from the incremental classes (indicated by triangles in Figure 4). Compared to the baseline, our method increased the inter-class distance between the new classes and the base classes to some extent, resulting in higher discriminability.

4.7. Base-to-Novel Class Generalization

To validate the model’s generalization ability after training on the base session, we tested the model directly on subsequent sessions after training on the first session. Table 6 compares the average performance of three metrics for PriViLeg [31], ASP [15], L2P [13], and DualPrompt [12]. On the CUB200 dataset, our method outperformed the second-best approach in terms of base-class accuracy, novel-class accuracy, and harmonic mean by

4.41 %

,

2.85 %

, and

1.41 %

, respectively. L2P and DualPrompt are methods designed for traditional class-incremental learning; thus, their ability to learn new categories in few-shot tasks is limited, resulting in lower novel-class accuracy. On the CIFAR100 dataset, our base-class accuracy was comparable to the second-best method, but our novel-class accuracy improved by

3.73 %

. Due to its strong knowledge transfer capability, PriViLeg also exhibited high novel-class performance on this dataset. On the miniImageNet dataset, all methods demonstrated high performance compared to ASP, our method achieving improvements of

0.67 %

,

0.21 %

, and

0.13 %

in the three metrics, respectively. Overall, thanks to the prompt alignment during the base session and the calibration effect of prototype aggregation, our method achieved superior performance across all metrics on the three datasets, significantly enhancing the recognition ability for novel classes.

4.8. Further Analysis

4.8.1. Hyperparameter Sensitivity

Figure 5a presents the parameter sensitivity matrix for two hyperparameters,

γ

and

λ

, employed in prompt-adaptive alignment on the CUB200 dataset. These hyperparameters were introduced in the loss functions of Equations (9) and (10), where

γ

controls the epoch-based switching of the training objective and

λ

balances the cross-entropy terms. When

γ

was set to two, the model performed well. However, when

γ

was too large, the model failed to fully learn the foundational knowledge due to fewer learnable parameters. When

γ

was set to −1, which indicated no task-specific knowledge learning for the prompt parameter, the performance was also suboptimal. The model achieved optimal performance when

λ

was between one and 1.2. Figure 5b displays an ablation experiment on the hyperparameters of the prototype calibration module. It can be observed that the performance was best when

α

equaled 0.1, after which the performance steadily declined. When

α

is too small, the calibrated prototypes remain close to the original few-shot estimates, which are dominated by sampling noise. When

α

is too large, the calibrated prototypes are excessively pulled toward the subspace projection, losing the unique class-specific information that distinguishes different novel classes. The narrow optimal range around

α = 0.1

indicates that only a modest degree of calibration is needed, because the subspace projection already provides a strong denoising effect while the remaining original component is essential for preserving novel-class discriminability. This optimal value is consistent across all three datasets, which suggests that it is not dataset-specific but reflects a fundamental trade-off in the few-shot calibration process.

4.8.2. Subspace Prototype Aggregation Module

One of the goals of incremental learning is to continuously learn new categories. However, even when a model achieves high average accuracy, the accuracy may primarily stem from base category samples, while its performance on new categories remains poor. Such a model is not practical. To address this issue, some methods have adopted the harmonic mean to evaluate the model’s overall performance on both base and incremental categories. Figure 6 presents the ablation experiment of the harmonic mean for our prototype aggregation module. The blue line at the bottom represents the performance of the model without any prototype calibration measures. The red line in the middle indicates the performance of the module proposed in [36] for the same problem, while the top line demonstrates the performance of our proposed module. It can be observed that the harmonic mean at each session of our proposed module was consistently higher than that of other methods.

4.8.3. Performance on CLIP

To verify the applicability of the modules in the proposed model, we conducted ablation experiments on four modules based on the CLIP model on the CIFAR-100 dataset. We also compared the performance of CEC on the CLIP model and the incremental learning capability of CLIP itself. Specifically, the text encoder of the semantic knowledge distillation module was changed from the Bert model to the language encoder of CLIP, and the visual encoder of CLIP was used to extract image features, with final classification prediction performed based on CLIP. As shown in Figure 7, both CEC and CLIP exhibit severe catastrophic forgetting on few-shot incremental learning tasks. In contrast, each module proposed in this paper brings significant performance improvements when integrated into the CLIP model. The experimental results demonstrate that the proposed method is applicable to CLIP and can achieve superior performance under the CLIP framework.

4.8.4. Effect of Different Numbers of Incremental Samples

To investigate the impact of the number of samples per class in the incremental learning stage on model performance, we conducted comparative experiments with diverse sample sizes. Specifically, we set the number of samples per incremental class to one-shot, five-shot, 10-shot, 20-shot, 50-shot, and 100-shot and evaluated on multiple datasets. The experimental results are shown in Figure 8. As the shot value increases, the overall recognition accuracy of the model across incremental sessions exhibits a clear trend of rapid improvement followed by a plateau. When the shot value increases from one to five, the accuracy of the model improves substantially across all incremental sessions, indicating that a modest increase in incremental samples can significantly alleviate the insufficient feature learning problem under few-shot conditions and effectively enhance the model’s representational capacity for novel classes. When the shot value continues to increase from five to 50, the accuracy gains gradually slow down, and the growth trend steadily diminishes. When the shot value reaches 50, the model accuracy essentially enters a saturation stage, and further increasing to 100 shots brings no noticeable performance improvement. This indicates that the model’s feature learning for novel classes in the incremental stage has reached a bottleneck, and excessively increasing the number of samples can no longer yield further performance gains.

4.8.5. Impact of Trainable Block

As shown in Table 7, we conducted further experimental analysis on fine-tuning different layers of the Transformer. When the number of trainable blocks was zero, the model failed to fully leverage the vast quantity of data from the base session to acquire knowledge due to the limited size of the prompt parameters, resulting in poor performance. As the number of learnable layers increased, the model’s performance on the base session gradually improved, but it failed to achieve optimal performance on the other three metrics. Additionally, the computational cost and training time increased significantly. When only the first two blocks of the Transformer were fine-tuned, the model achieved optimal average accuracy and harmonic mean. This indicates that, under this setting, the model effectively balances the acquisition of task-specific and task-independent knowledge. These findings suggest that fine-tuning a limited number of Transformer blocks can strike a balance between performance and computational efficiency. By focusing on the first two blocks, the model can effectively capture both task-specific and task-independent knowledge without incurring excessive computational overhead.

4.8.6. Complexity Analysis

All experiments in this comparison were conducted on the same NVIDIA GeForce RTX 4060 Ti GPU with a batch size of 32. For each method, the training time was measured over six epochs in the base session under the CUB200 dataset setting. As shown in Table 8, our method employs a two-stage training strategy, with 1.38M parameters in the first stage and 14.5M in the second stage. Despite the larger parameter count in the second stage, our training time of 11.65 min remains competitive, surpassing ASP at 15.21 min and approaching PriViLege at 10.64 min. The inference time of 7.72 min and memory consumption of 3253 MB are also reasonable, both lower than PriViLege’s 4581 MB. Overall, our method achieves a favorable trade-off between computational cost and performance.

5. Conclusions

In this study, we proposed a novel few-shot incremental learning framework based on a pretrained Transformer model and prompt learning. The method leveraged prompt alignment and regularization to mitigate overfitting during model training and introduced a prototype aggregation module to address the classification bias of prototype networks toward large-sample categories. Our proposed approach demonstrated optimal or near-optimal performance across multiple metrics compared to other methods and exhibited significant performance improvements in numerous experiments. Overall, the proposed method provides solutions to key challenges in the field of few-shot incremental learning and makes a meaningful contribution to its development.

Funding

This research received no external funding.

Data Availability Statement

The three benchmark datasets used in this work are publicly available. Detailed references and configurations are provided in Section 4.1. The code generated during the current study is not publicly available due to its intended use as part of further ongoing research, but it can be made available from the corresponding author upon reasonable request.

Conflicts of Interest

The author declares no conflict of interest.

References

Masana, M.; Liu, X.; Twardowski, B.; Menta, M.; Bagdanov, A.D.; van de Weijer, J. Class-incremental learning: Survey and performance evaluation on image classification. arXiv 2020, arXiv:2010.15277. [Google Scholar] [CrossRef]
Delange, M.; Aljundi, R.; Masana, M.; Parisot, S.; Jia, X.; Leonardis, A.; Slabaugh, G.; Tuytelaars, T. A continual learning survey: Defying forgetting in classification tasks. TPAMI 2021, in press. [Google Scholar]
Li, Z.; Hoiem, D. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2935–2947. [Google Scholar] [CrossRef] [PubMed]
Rebuffi, S.A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. icarl: Incremental classifier and representation learning. In Proceedings of the CVPR; IEEE: New York, NY, USA, 2017; pp. 2001–2010. [Google Scholar]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
Aljundi, R.; Babiloni, F.; Elhoseiny, M.; Rohrbach, M.; Tuytelaars, T. Memory aware synapses: Learning what (not) to forget. In Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018, Proceedings, Part III; Springer: Berlin/Heidelberg, Germany, 2018; pp. 144–161. [Google Scholar]
Mermillod, M.; Bugaiska, A.; Bonin, P. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects. Front. Psychol. 2013, 4, 504. [Google Scholar] [CrossRef] [PubMed]
Tian, S.; Li, L.; Li, W.; Ran, H.; Ning, X.; Tiwari, P. A survey on few-shot class-incremental learning. Neural Netw. 2024, 169, 307–324. [Google Scholar] [CrossRef] [PubMed]
Chi, Z.; Gu, L.; Liu, H.; Wang, Y.; Yu, Y.; Tang, J. MetaFSCIL: A Meta-Learning Approach for Few-Shot Class Incremental Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 14166–14175. [Google Scholar]
Cheraghian, A.; Rahman, S.; Fang, P.; Roy, S.K.; Petersson, L.; Harandi, M. Semantic-Aware Knowledge Distillation for Few-Shot Class-Incremental Learning. In Proceedings of the CVPR; IEEE: New York, NY, USA, 2021; pp. 2534–2543. [Google Scholar]
Zhou, D.W.; Ye, H.J.; Ma, L.; Xie, D.; Pu, S.; Zhan, D.C. Few-shot class-incremental learning by sampling multi-phase tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12816–12831. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Zhang, Z.; Ebrahimi, S.; Sun, R.; Zhang, H.; Lee, C.Y.; Ren, X.; Su, G.; Perot, V.; Dy, J.; et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2022. [Google Scholar]
Wang, Z.; Zhang, Z.; Lee, C.Y.; Zhang, H.; Sun, R.; Ren, X.; Su, G.; Perot, V.; Dy, J.; Pfister, T. Learning to Prompt for Continual Learning. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2022. [Google Scholar]
Smith, J.S.; Karlinsky, L.; Gutta, V.; Cascante-Bonilla, P.; Kim, D.; Arbelle, A.; Panda, R.; Feris, R.; Kira, Z. CODA-Prompt: COntinual Decomposed Attention-Based Prompting for Rehearsal-Free Continual Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2023. [Google Scholar]
Liu, C.; Wang, Z.; Xiong, T.; Chen, R.; Wu, Y.; Guo, J.; Huang, H. Few-Shot Class Incremental Learning with Attention-Aware Self-Adaptive Prompt. arXiv 2024, arXiv:2403.09857. [Google Scholar]
D’Alessandro, M.; Alonso, A.; Calabrés, E.; Galar, M. Multimodal parameter-efficient few-shot class incremental learning. In IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2023; pp. 3393–3403. [Google Scholar]
Khattak, M.U.; Wasim, S.T.; Naseer, M.; Khan, S.; Yang, M.H.; Khan, F.S. Self-regulating prompts: Foundational model adaptation without forgetting. In IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2023; pp. 15190–15200. [Google Scholar]
van de Ven, G.M.; Tolias, A.S. Three scenarios for continual learning. arXiv 2019, arXiv:1904.07734. [Google Scholar] [CrossRef]
Wu, Y.; Chen, Y.; Wang, L.; Ye, Y.; Liu, Z.; Guo, Y.; Fu, Y. Large scale incremental learning. In Proceedings of the CVPR; IEEE: New York, NY, USA, 2019; pp. 374–382. [Google Scholar]
Chaudhry, A.; Ranzato, M.; Rohrbach, M.; Elhoseiny, M. Efficient Lifelong Learning with A-GEM. In Proceedings of the ICLR; Curran Associates, Inc.: Red Hook, NY, USA, 2019. [Google Scholar]
Xiang, Y.; Fu, Y.; Ji, P.; Huang, H. Incremental learning using conditional adversarial networks. In Proceedings of the ICCV; IEEE: New York, NY, USA, 2019; pp. 6619–6628. [Google Scholar]
Iscen, A.; Zhang, J.; Lazebnik, S.; Schmid, C. Memory-efficient incremental learning through feature adaptation. In Proceedings of the ECCV; Springer: Cham, Switzerland, 2020; pp. 699–715. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. In Proceedings of the NeurIPS Workshop; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2015. [Google Scholar]
Zhao, L.; Lu, J.; Xu, Y.; Cheng, Z.; Guo, D.; Niu, Y.; Fang, X. Few-shot class-incremental learning via class-aware bilateral distillation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 11838–11847. [Google Scholar]
Tao, X.; Hong, X.; Chang, X.; Dong, S.; Wei, X.; Gong, Y. Few-shot class-incremental learning. In Proceedings of the CVPR; IEEE: New York, NY, USA, 2020; pp. 12183–12192. [Google Scholar]
Shi, G.; Chen, J.; Zhang, W.; Zhan, L.M.; Wu, X.M. Overcoming catastrophic forgetting in incremental few-shot learning by finding flat minima. In Advances in Neural Information Processing Systems (NeurIPS); Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2021. [Google Scholar]
Zhou, D.W.; Wang, F.Y.; Ye, H.J.; Ma, L.; Pu, S.; Zhan, D.C. Forward compatible few-shot class-incremental learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 9046–9056. [Google Scholar]
Mazumder, P.; Singh, P.; Rai, P. Few-Shot Lifelong Learning. In Proceedings of the AAAI; AAAI Press: Washington, DC, USA, 2021; pp. 2337–2345. [Google Scholar]
Hersche, M.; Karunaratne, G.; Cherubini, G.; Benini, L.; Sebastian, A.; Rahimi, A. Constrained few-shot class-incremental learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022. [Google Scholar]
Yang, Y.; Yuan, H.; Li, X.; Lin, Z.; Torr, P.; Tao, D. Neural Collapse Inspired Feature-Classifier Alignment for Few-Shot Class-Incremental Learning. In Proceedings of the Eleventh International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Park, K.H.; Song, K.; Park, G.M. Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 23881–23890. [Google Scholar]
Yoon, I.U.; Choi, T.M.; Lee, S.K.; Kim, Y.M.; Kim, J.H. Image-Object-Specific Prompt Learning for Few-Shot Class-Incremental Learning. arXiv 2023, arXiv:2309.02833. [Google Scholar]
Tang, Y.M.; Peng, Y.X.; Zheng, W.S. When Prompt-based Incremental Learning Does Not Meet Strong Pretraining. In IEEE International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2023. [Google Scholar]
Jung, D.; Han, D.; Bang, J.; Song, H. Generating Instance-level Prompts for Rehearsal-free Continual Learning. In IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2023. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. In Proceedings of the NeurIPS; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2017; pp. 4080–4090. [Google Scholar]
Wang, Q.W.; Zhou, D.W.; Zhang, Y.K.; Zhan, D.C.; Ye, H.J. Few-Shot Class-Incremental Learning Via Training-Free Prototype Calibration. In Proceedings of the NeurIPS; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2023. [Google Scholar]
Krizhevsky, A.; Hinton, G. Learning multiple layers of features from tiny images. In 37th Conference on Neural Information Processing Systems (NeurIPS 2023); Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2009. [Google Scholar]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; Wierstra, D. Matching networks for one shot learning. In Proceedings of the NeurPIS; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2016; pp. 3630–3638. [Google Scholar]
Cui, Y.; Xiong, W.; Tavakolian, M.; Liu, L. Semi-Supervised Few-Shot Class-Incremental Learning. In Proceedings of the ICIP; IEEE: New York, NY, USA, 2021; pp. 1239–1243. [Google Scholar]
Kenton, J.D.M.W.C.; Toutanova, L.K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1, p. 2. [Google Scholar]
Zhang, C.; Song, N.; Lin, G.; Zheng, Y.; Pan, P.; Xu, Y. Few-shot incremental learning with continually evolved classifiers. In Proceedings of the CVPR; IEEE: New York, NY, USA, 2021; pp. 12455–12464. [Google Scholar]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Conditional prompt learning for vision-language models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 16816–16825. [Google Scholar]

Figure 1. An illustration of the prompt alignment and regularization. (a) Training mode of baseline model, (b) Prompt alignment of our model in the base session: We first freeze the backbone network and allow the prompt parameters to learn task-specific knowledge. Then, we fine-tune the first two blocks of the backbone network and align the outputs with and without prompts. (c) Prompt regularization of our model in the incremental session where we freeze the backbone network and apply a regularization constraint on the current prompt outputs using both the outputs with base prompts and those without prompts.

Figure 2. Illustration of the proposed framework. The proposed framework includes learning strategies for both the base and incremental sessions. In the base session, we first freeze the backbone network and train the prompt parameters using the cross-entropy loss, allowing the prompts to learn task-specific knowledge. We then fine-tune the first two blocks of the Transformer, using the output of the model without prompts to constrain the output with prompts. Both the output with prompts and the output without prompts are used jointly to calculate the cross-entropy loss for model optimization. Since our model employs a prototype network for classification, we propose a prototype calibration module to mitigate the adverse effects of few-shot prototypes. This module calibrates the prototypes at the end of each incremental session. In the bottom-right corner, the prompt alignment operation for the incremental sessions is illustrated, where additional considerations are made to address the negative effects of few-shot samples compared to the base session.

Figure 3. Confusion matrices of the final session output predictions on the CUB200 dataset: (a) results from directly fine-tuning the model, (b) results from the baseline model, and (c) results from our proposed model.

Figure 4. t-SNE feature space visualization, including five base-session categories and three incremental-session categories. (a) shows the output results of the baseline model, while (b) shows the output results of our proposed model. Triangles represent incremental classes, while circles represent base classes.

Figure 5. Hyperparameter Sensitivity Experiments: (a) Sensitivity matrix of two hyperparameters,

γ

and

λ

, for the prompt adaptation alignment. (b) Performance impact curve of the hyperparameter

α

in the Subspace Regularization Prototype Aggregation module.

Figure 5. Hyperparameter Sensitivity Experiments: (a) Sensitivity matrix of two hyperparameters,

γ

and

λ

, for the prompt adaptation alignment. (b) Performance impact curve of the hyperparameter

α

in the Subspace Regularization Prototype Aggregation module.

Figure 6. The harmonic mean experiment of the SPAC module.

Figure 7. CLIP performance on CIFAR-100.

Figure 8. The impact of different sample sizes of different categories during the incremental stage on performance is as follows: 1-shot indicates that each incremental category only contains 1 sample.

Table 1. The dataset configurations for FSCIL. #Class and #Samples stand for the number of classes and the number of samples, respectively. The learning pattern represents the setting of novel tasks in each incremental learning session.

	Base Session		Incremental Session
	#Class	#Samples	#Class	#Samples	Incremental Pattern
CIFAR100	60	500	40	5	5-way 5-shot
miniImageNet	60	500	40	5	5-way 5-shot
CUB200	100	30	100	5	10-way 5-shot

Table 2. The performance of every session on CIFAR100. ↑ indicates higher is better, ↓ indicates lower is better. Bold values indicate the best performance.

Method	Backbone	Acc. in Each Session↑ (%)									AVG (↑)	PD (↓)
Method	Backbone	1	2	3	4	5	6	7	8	9	AVG (↑)	PD (↓)
TOPIC [25]	ResNet18	64.1	55.9	47.1	45.2	40.1	36.4	34.0	31.6	29.4	42.6	34.7
CEC [41]	ResNet18	73.1	68.9	65.3	61.2	58.1	55.6	53.2	51.3	49.1	59.5	23.9
LIMIT [11]	ResNet18	73.8	72.1	67.9	63.9	60.7	57.8	55.7	53.5	51.2	61.8	22.6
FACT [27]	ResNet18	74.6	72.1	67.6	63.5	61.4	58.4	56.3	54.2	52.1	62.2	22.5
CLIP ZSL	ViT-B/16	73.8	71.6	72.0	70.71	69.8	68.5	67.8	67.3	66.9	69.8	6.9
CoCoOp [42]	ViT-B/16	82.2	77.4	73.7	71.7	69.1	67.7	65.5	63.8	62.1	70.4	20.1
IOS [32]	ViT-B/16	86.2	82.3	79.6	77.5	75.9	75.3	74.1	73.6	72.8	77.5	13.4
CEC [41]	ViT-B/16	74.20	71.49	70.11	67.34	65.96	65.14	64.74	63.48	61.48	67.10	12.72
L2P [13]	ViT-B/16	84.7	82.3	80.1	77.5	77.0	76.0	75.6	74.1	72.3	77.7	12.4
DualPrompt [12]	ViT-B/16	86.0	83.6	82.9	80.2	80.6	80.2	80.5	79.0	77.4	81.1	8.5
PriViLege [31]	ViT-B/16	90.88	89.39	88.97	87.55	87.83	87.35	87.53	87.15	86.06	88.08	4.82
ASP [15]	ViT-B/16	92.2	90.7	90.0	88.7	88.7	88.2	88.2	87.8	86.7	89.0	5.5
Ours	ViT-B/16	$\underset{\pm 0.25}{92.37}$	$\underset{\pm 0.31}{90.94}$	$\underset{\pm 0.28}{90.60}$	$\underset{\pm 0.34}{89.61}$	$\underset{\pm 0.29}{89.84}$	$\underset{\pm 0.33}{89.58}$	$\underset{\pm 0.27}{89.68}$	$\underset{\pm 0.30}{89.53}$	$\underset{\pm 0.37}{88.14}$	$\underset{\pm 0.22}{90.03}$	$\underset{\pm 0.19}{4.22}$

Table 3. The performance of every session on CUB200. ↑ indicates higher is better, ↓ indicates lower is better. Bold values indicate the best performance.

Method	Backbone	Acc. in Each Session↑ (%)											AVG (↑)	PD (↓)
Method	Backbone	1	2	3	4	5	6	7	8	9	10	11	AVG (↑)	PD (↓)
TOPIC [25]	ResNet18	68.7	62.5	54.8	50.0	45.3	41.4	38.4	35.4	32.2	28.3	26.3	43.9	42.4
CEC [41]	ResNet18	75.9	71.9	68.5	63.5	62.4	58.3	57.7	55.8	54.8	53.5	52.3	61.3	23.6
LIMIT [11]	ResNet18	75.9	73.6	72.0	68.1	67.4	63.6	62.4	61.4	59.9	58.7	57.4	65.5	18.5
FACT [27]	ResNet18	75.9	73.2	70.8	66.1	65.6	62.2	61.7	59.8	58.4	57.9	56.9	64.4	19.0
CLIP ZSL	ViT-B/16	65.5	64.2	63.2	62.4	59.9	60.3	59.8	58.4	56.3	54.9	53.5	59.9	12.0
CoCoOp [42]	ViT-B/16	80.3	72.1	68.8	65.4	63.4	61.2	58.2	56.9	54.5	52.3	50.1	62.1	30.2
IOS [32]	ViT-B/16	81.3	77.4	75.8	73.3	72.6	70.4	68.7	67.3	65.9	64.4	63.8	71.0	17.5
CEC [41]	ViT-B/16	75.40	73.23	72.00	68.70	69.35	67.78	67.01	66.40	65.78	65.57	65.70	72.41	9.7
L2P [13]	ViT-B/16	82.4	81.2	79.0	76.8	76.2	74.7	74.1	74.1	72.7	73.0	73.6	76.2	8.7
DualPrompt [12]	ViT-B/16	83.5	82.2	80.9	79.5	78.6	77.0	76.3	77.0	75.7	76.1	76.5	78.5	7.1
PriViLeg [31]	ViT-B/16	82.21	81.25	80.45	77.76	77.78	75.95	75.69	76.00	75.19	75.19	75.08	77.50	7.13
ASP [15]	ViT-B/16	87.1	86.0	84.9	83.4	83.6	82.4	82.6	83.0	82.6	83.0	83.5	83.8	3.6
Ours	ViT-B/16	$\underset{\pm 0.29}{87.26}$	$\underset{\pm 0.32}{86.38}$	$\underset{\pm 0.38}{85.71}$	$\underset{\pm 0.45}{84.28}$	$\underset{\pm 0.42}{84.51}$	$\underset{\pm 0.44}{82.87}$	$\underset{\pm 0.42}{82.97}$	$\underset{\pm 0.46}{83.36}$	$\underset{\pm 0.45}{82.75}$	$\underset{\pm 0.39}{82.93}$	$\underset{\pm 0.40}{83.22}$	$\underset{\pm 0.33}{84.20}$	$\underset{\pm 0.18}{4.03}$

Table 4. The performance of every session on MiniImageNet. ↑ indicates higher is better, ↓ indicates lower is better. Bold values indicate the best performance.

Method	Backbone	Acc. in Each Session↑ (%)									AVG (↑)	PD (↓)
Method	Backbone	1	2	3	4	5	6	7	8	9	AVG (↑)	PD (↓)
TOPIC [25]	ResNet18	61.3	50.1	45.2	41.2	37.5	35.5	32.2	29.5	24.4	39.6	36.9
CEC [41]	ResNet18	72.0	66.8	63.0	59.4	56.7	53.7	51.2	49.2	47.6	57.7	24.4
LIMIT [11]	ResNet18	72.3	68.5	64.3	60.8	58.0	55.1	52.7	50.7	49.2	59.1	23.1
FACT [27]	ResNet18	72.6	69.6	66.4	62.8	60.6	57.3	54.3	52.2	50.5	60.7	22.1
CLIP ZSL	ViT-B/16	85.8	85.5	84.9	84.8	84.2	84.2	84.0	83.9	83.7	84.6	2.1
CoCoOp [42]	ViT-B/16	94.2	91.7	88.9	87.8	86.3	84.3	82.5	81.9	81.3	86.5	12.9
IOS [32]	ViT-B/16	95.4	94.4	93.4	93.1	92.1	91.4	90.8	90.0	89.1	92.2	6.3
CEC [41]	ViT-B/16	87.43	85.99	84.03	83.21	83.11	81.64	80.66	80.72	80.74	83.06	6.69
L2P [13]	ViT-B/16	93.05	92.31	90.51	87.07	86.38	85.19	84.45	84.15	81.44	87.172	11.61
DualPrompt [12]	ViT-B/16	95.05	93.81	91.51	90.07	88.38	86.19	85.45	84.15	83.14	88.638	11.91
PriViLege [31]	ViT-B/16	96.68	96.49	95.65	95.54	95.54	94.91	94.33	94.19	94.10	95.27	2.58
ASP [15]	ViT-B/16	96.72	96.59	96.05	95.74	95.54	95.11	94.72	94.59	94.33	95.607	2.39
Ours	ViT-B/16	$\underset{\pm 0.06}{97.05}$	$\underset{\pm 0.07}{96.95}$	$\underset{\pm 0.08}{96.31}$	$\underset{\pm 0.09}{96.23}$	$\underset{\pm 0.10}{96.26}$	$\underset{\pm 0.10}{95.74}$	$\underset{\pm 0.11}{95.11}$	$\underset{\pm 0.12}{95.02}$	$\underset{\pm 0.13}{94.96}$	$\underset{\pm 0.08}{95.96}$	$\underset{\pm 0.06}{2.09}$

Table 5. Ablation experiments on CUB200: TPA denotes the two-stage prompt adaptation alignment, SPAC represents the prototype aggregation calibration module, FSPA stands for few-shot prompt alignment, and SKD refers to the introduced Semantic Knowledge Distillation module guided by textual semantics.

A_{B a s e} = {Acc}_{1}

,

A_{L a s t} = {Acc}_{T}

,

A_{A v g} = AVG

. Bold values indicate the best performance.

Table 5. Ablation experiments on CUB200: TPA denotes the two-stage prompt adaptation alignment, SPAC represents the prototype aggregation calibration module, FSPA stands for few-shot prompt alignment, and SKD refers to the introduced Semantic Knowledge Distillation module guided by textual semantics.

A_{B a s e} = {Acc}_{1}

,

A_{L a s t} = {Acc}_{T}

,

A_{A v g} = AVG

. Bold values indicate the best performance.

Ablation				CUB200
TPA	SPAC	FSPA	SKD	A_Base	A_Last	A_Avg
				81.532	74.095	76.31
✓				87.5	81.693	82.654
✓	✓			87.5	82.325	83.309
✓	✓	✓		87.5	83.034	84.138
	✓	✓	✓	81.416	79.264	78.5
✓	✓		✓	87.256	82.418	83.657
✓		✓	✓	87.256	82.057	83.102
✓			✓	87.5	81.827	82.702
✓	✓	✓		87.314	82.813	83.895
✓	✓	✓	✓	87.256	83.086	84.171

Table 6. Accuracy comparison on base-to-novel generalization of our method with previous methods. The results are the averages from various sessions. HM represents the harmonic mean. Bold values indicate the best performance.

Dataset		PriViLeg	ASP	L2P	DualPrompt	Ours	Δ
Average on 3 datasets	Base	84.09	89.83	86.43	84.04	90.64	+0.81
	Novel	80.22	82.43	71.24	71.85	85.59	+3.16
	HM	82.54	86.34	78.17	78.44	88.67	+2.33
CUB200	Base	79.47	81.84	82.15	76.66	86.56	+4.41
	Novel	69.45	75.34	63.96	64.54	78.19	+2.85
	HM	74.12	80.75	71.80	72.47	82.16	+1.41
cifar100	Base	82.48	90.46	84.07	84.24	90.47	+0.01
	Novel	80.95	80.25	66.60	66.76	84.68	+3.73
	HM	81.71	85.05	74.32	74.37	87.48	+2.43
miniimagenet	Base	92.32	95.20	93.08	93.23	95.87	+0.67
	Novel	91.27	92.69	84.15	84.26	92.90	+0.21
	HM	91.79	94.23	88.39	88.48	94.36	+0.13

Table 7. Experiments on the impact of the number of trainable Transformer layers on the CUB200 dataset. The four metrics are base-session accuracy, final-session accuracy, average accuracy, and the harmonic mean of base class and incremental class accuracies.

A_{B a s e} = {Acc}_{1}

,

A_{L a s t} = {Acc}_{T}

,

A_{A v g} = AVG

, and

A_{H M}

is the harmonic mean defined in Equation (20). Bold values indicate the best performance.

Table 7. Experiments on the impact of the number of trainable Transformer layers on the CUB200 dataset. The four metrics are base-session accuracy, final-session accuracy, average accuracy, and the harmonic mean of base class and incremental class accuracies.

A_{B a s e} = {Acc}_{1}

,

A_{L a s t} = {Acc}_{T}

,

A_{A v g} = AVG

, and

A_{H M}

is the harmonic mean defined in Equation (20). Bold values indicate the best performance.

Dataset	CUB 200
# of Layers	A_Base	A_Last	A_Avg	A_HM
0 Layers	82.367	70.504	75.550	70.33
2 Layers	87.256	83.224	84.203	83.150
3 Layers	87.360	82.879	84.153	82.770
5 Layers	87.221	83.000	84.04	82.880
8 Layers	87.954	83.218	84.169	82.460
12 Layers	88.478	83.034	84.082	82.850

Table 8. Complexity comparison on CUB200. For our model (Ours), Params are reported as “stage 1/stage 2” of the base session.

Methods	Params (M)	Training Time (min)	Inference Time (min)	Memory (MB)
PriViLege	14.33	10.64	7.64	4581
ASP	2.08	15.21	8.10	2692
Ours	1.38/14.5	11.65	7.72	3253

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, Q. Few-Shot Class-Incremental Learning with Prompt Alignment and Subspace Prototype Aggregation. Algorithms 2026, 19, 407. https://doi.org/10.3390/a19050407

AMA Style

Huang Q. Few-Shot Class-Incremental Learning with Prompt Alignment and Subspace Prototype Aggregation. Algorithms. 2026; 19(5):407. https://doi.org/10.3390/a19050407

Chicago/Turabian Style

Huang, Qiang. 2026. "Few-Shot Class-Incremental Learning with Prompt Alignment and Subspace Prototype Aggregation" Algorithms 19, no. 5: 407. https://doi.org/10.3390/a19050407

APA Style

Huang, Q. (2026). Few-Shot Class-Incremental Learning with Prompt Alignment and Subspace Prototype Aggregation. Algorithms, 19(5), 407. https://doi.org/10.3390/a19050407

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Few-Shot Class-Incremental Learning with Prompt Alignment and Subspace Prototype Aggregation

Abstract

1. Introduction

2. Related Work

2.1. Class-Incremental Learning

2.2. Few-Shot Class-Incremental Learning

2.3. Prompt Engineering for Vision Transformer

3. Methodology

3.1. Preliminaries

3.2. Subspace Prototype Aggregation

3.3. Prompt-Adaptive Alignment Loss

3.3.1. Base-Session Prompt Alignment

3.3.2. Incremental Sessions Prompt Alignment

3.4. Model Training and Evaluation

4. Experiments

4.1. Dataset and Evaluation Indicators

4.2. Model Configurations and Training Details

4.3. Methods for Comparison

4.4. Main Results

4.5. Ablation Studies

4.5.1. Efficacy of the TPA Module

4.5.2. Efficacy of the SPAC Module

4.5.3. Efficacy of the FSPA Module

4.5.4. Efficacy of the SKD Module

4.6. Model Visualization Results

4.6.1. Confusion Matrix

4.6.2. T-SNE Feature Space

4.7. Base-to-Novel Class Generalization

4.8. Further Analysis

4.8.1. Hyperparameter Sensitivity

4.8.2. Subspace Prototype Aggregation Module

4.8.3. Performance on CLIP

4.8.4. Effect of Different Numbers of Incremental Samples

4.8.5. Impact of Trainable Block

4.8.6. Complexity Analysis

5. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI