Dynamic LoRA-Experts and Prototype-Ensemble Matching for Class-Incremental Learning

Zhao, Hongwei; Liu, Rui; Liu, Yansong

doi:10.3390/app16126153

Open AccessArticle

Dynamic LoRA-Experts and Prototype-Ensemble Matching for Class-Incremental Learning

by

Hongwei Zhao

^*

,

Rui Liu

and

Yansong Liu

School of Computer Science and Engineering, Beihang University, 37 Xueyuan Road, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(12), 6153; https://doi.org/10.3390/app16126153

Submission received: 7 May 2026 / Revised: 13 June 2026 / Accepted: 14 June 2026 / Published: 17 June 2026

Download

Browse Figures

Versions Notes

Featured Application

This work can be applied to intelligent vision systems that need to incrementally learn new categories while maintaining previously acquired knowledge with low storage overhead.

Abstract

Class-incremental learning (CIL) aims to continuously learn new classes without forgetting previously acquired knowledge. Recent advances in parameter-efficient fine-tuning (PEFT) based on pre-trained models (PTMs) have shown promise in this setting by integrating new tasks with minimal parameter overhead. However, these methods often suffer from knowledge degradationdue to: (1) cumulative interference caused by iterative updates, constrained gradient flows, or entangled module integration; and (2) suboptimal alignment between inference samples and specialized modules. To address these challenges, we propose Dynamic LoRA-Experts and Prototype-Ensemble Matching (DLEPEM), a novel two-stage, rehearsal-free framework. In the first stage, we allocate a task-specific LoRA-Expert for each incremental task, enabling isolated representation learning and reducing cross-task interference. In the second stage, we introduce a prototype-ensemble-matching mechanism that combines general prototypes derived from the frozen PTM with task-adaptive prototypes learned by the LoRA-Experts. This fusion facilitates both strong generalization and precise task-level discrimination. Extensive experiments on standard CIL and few-shot class-incremental learning (FSCIL) benchmarks demonstrate that DLEPEM achieves strong performance under the evaluated protocols. For instance, in CIL, it achieves 93.39% on CIFAR100 (+0.80% over EASE), 92.31% on CUB200 (+2.11% over EASE), and 91.84% on VTAB (+1.39% over EASE). In the more challenging FSCIL setting, it achieves 88.77% on CUB200, outperforming the strongest baseline by a clear margin of 5.31%. These results indicate that DLEPEM effectively mitigates catastrophic forgetting while enhancing incremental learning capability.

Keywords:

class-incremental learning; dynamic LoRA-Experts; prototype-ensemble; catastrophic forgetting

1. Introduction

In open-world environments, data typically arrives as a continuous stream of novel categories, a scenario formalized as class-incremental learning (CIL). Traditional machine learning models struggle under such conditions, exhibiting catastrophic forgetting [1,2], where learning from new classes disrupts existing representations and leads to severe performance degradation. Class-incremental learning seeks to navigate this challenge by balancing the acquisition of new knowledge with the preservation of prior learning, a fundamental trade-off known as the stability–plasticity dilemma [3,4].

Recent advances leverage pre-trained models (PTMs), whose robust generalization capabilities stem from large-scale supervised or self-supervised training [5], providing a compelling foundation for CIL. However, directly fine-tuning all PTM parameters across sequential tasks risks compromising this generalization and amplifying forgetting. To mitigate this, contemporary methods use parameter-efficient fine-tuning (PEFT) techniques [6], such as prompts [7,8], adapters [9,10], and LoRA [4,11], which adapt models with minimal additional parameters, thereby reducing forgetting while preserving generalization.

Despite these gains, two critical challenges remain insufficiently addressed:

1. Stability–plasticity limitations from cumulative interference:

Shared prompt pools [7,8] are prone to overwriting earlier knowledge when exposed to shifting data distributions.
LoRA-based strategies [4,11], though effective in constraining updates to mitigate forgetting, inadvertently restrict the plasticity needed for new task adaptation.
Fusion-based methods [4,9] attempt to balance old and new knowledge but often degrade the fidelity of both due to forced trade-offs.

2. Inference-stage module-sample mismatches:

Fixed PTM selection mechanisms [7,12] struggle under substantial domain shifts, resulting in suboptimal activations and degraded predictions.

These observations motivate a key question: Can we simultaneously enhance stability–plasticity dynamics and improve module-sample matches to robustly mitigate catastrophic forgetting in CIL?

To this end, we propose DLEPEM, a novel rehearsal-free framework that rethinks PEFT-based CIL by introducing two synergistic components. First, inspired by Mixture-of-Experts (MoE) architectures [13], we dynamically allocate a dedicated LoRA-Expert for each new incremental task. Unlike shared or sequentially fine-tuned modules, each LoRA-Expert exclusively encodes task-specific knowledge; only the current expert is trainable, while all prior experts are frozen. Strategically embedded within Transformer feed-forward network (FFN) and multi-head self-attention (MHA) layers, these lightweight experts preserve plasticity for new tasks while entirely isolating past parameters, thereby achieving a more favorable balance between stability and plasticity. Different from conventional token-level sparse MoE or LoRA-MoE models that learn a soft router and aggregate multiple expert outputs, DLEPEM uses MoE as a task-level organization principle: the expert bank grows with the task sequence, each old expert remains frozen, and expert activation is determined by prototype retrieval rather than differentiable token routing. Second, to mitigate inference mismatches, we introduce a prototype-ensemble strategy that jointly leverages (1) general representations from the frozen PTM backbone and (2) specialized representations from task-specific experts. By fusing these distinct feature spaces, our approach enhances sample-to-module alignment, effectively bridging generalization and specialization. Thus, DLEPEM explicitly separates within-task prediction (WTP), handled by isolated task LoRA-Experts, from module-identity inference (MII), handled by prototype-ensemble matching.

In summary, our principal contributions are threefold:

We introduce task-level dynamic LoRA-Expert allocation into PEFT-based CIL. Unlike conventional MoE-style LoRA methods with a fixed expert pool and soft expert mixing, DLEPEM automatically adds one dedicated LoRA-Expert for each incremental task, trains only the current expert, and freezes all previous experts to reduce cross-task parameter interference.
We enhance module-sample alignment through prototype-ensemble matching, which fuses frozen-PTM prototypes and router-based domain-specific prototypes. This design improves task-level expert retrieval when PTM-only matching is unreliable under downstream domain shift.
Extensive experiments on six challenging CIL benchmarks validate the effectiveness of our approach, showing that DLEPEM achieves leading performance among the evaluated methods under the evaluated protocols. We further demonstrate architectural flexibility through DLEPEM-MLP, a variant that explores alternative expert integration strategies while retaining competitive results.

2. Related Work

2.1. Class-Incremental Learning

Class-incremental learning requires models to continually recognize newly introduced classes while maintaining discriminability for previously learned classes. Existing methods are commonly grouped into regularization-, rehearsal-, and architecture-based approaches [7].

Regularization-based methods [14] reduce forgetting by penalizing changes to parameters that are estimated to be important for old tasks. This strategy does not store old samples, but its effectiveness can decrease when the incremental stream contains large domain shifts or many sequential tasks [15]. Rehearsal-based methods replay raw images [15] or stored feature representations [16] to preserve old knowledge. They are often effective, but the buffer budget and possible privacy restrictions limit their applicability in rehearsal-free settings. Dynamic network methods [17] expand the model with task-specific components and freeze previous ones to reduce parameter interference, but the growing architecture may introduce non-trivial memory overhead, and some methods still rely on old data for calibration or fusion.

2.2. PEFT-Based CIL

With the strong transferability of PTMs, recent CIL methods increasingly adopt PEFT modules to adapt only a small subset of parameters while keeping most backbone weights frozen. Prompt-based methods, such as L2P [7], DualPrompt [8], S-Prompts [12], and CODA-Prompt [18], learn task-relevant prompts and retrieve or combine them during inference. These methods reduce the need for full fine-tuning, but their retrieval quality depends heavily on how well prompt keys separate tasks in the PTM feature space.

Adapter- and LoRA-based methods modify the PTM through lightweight modules. APER [10] combines PEFT-adapted and PTM features to retain generalization. LAE [9] improves compatibility across PEFT modules, but repeated feature fusion can introduce stability–plasticity trade-offs. InfLoRA [4] constrains LoRA updates through gradient-orthogonal projection to reduce interference, whereas SD-LoRA [11] decouples gradient direction and magnitude to protect early-task directions. These constraint-based strategies improve stability but may restrict plasticity for newly introduced classes. The MoE-Adapters method [19] uses an activate-freeze mechanism with a predefined expert pool, which enables inter-task collaboration but limits flexibility when the number or diversity of future tasks is unknown.

2.3. Mixture-of-Experts and Expert Retrieval

MoE architectures introduce multiple expert modules and a routing mechanism that selects or weights experts for each input [13]. In vision models, V-MoE [20] replaces part of the dense feed-forward layers in ViT with sparse expert layers, where image patches are routed to a subset of MLP experts. In multi-modal or instruction-tuning settings, MoCLE [21] integrates multiple LoRA experts to handle task diversity. These methods show that expert specialization can improve adaptation, but their routing is usually learned within a fixed or predefined expert set and often combines expert outputs through token-level or sample-level gating.

For incremental learning, the key challenge is different: the model must accommodate an open-ended task sequence while avoiding repeated updates to old task-specific parameters. Therefore, expert life cycle and inference-time expert retrieval become central design choices. A fixed expert pool may be insufficient for long or unpredictable task streams, while a learned soft gate can suffer from task-recency bias when trained only on the current task.

2.4. Our Approach

Similar to PEFT-based CIL methods [4,9,10], DLEPEM uses a PTM backbone with LoRA [22] adaptation. However, instead of repeatedly updating or blending shared PEFT modules, DLEPEM allocates one LoRA-Expert for each incremental task, trains only the current expert, and freezes all historical experts. This stage-wise expert life cycle reduces cross-task parameter interference while preserving plasticity for the new task.

DLEPEM also differs from conventional MoE-based or multi-LoRA methods in routing granularity. Existing MoE formulations usually learn an input-dependent gate to select or softly combine experts from a predefined pool. In contrast, DLEPEM performs top-1 task-level expert retrieval through an append-only prototype dictionary. Each ensemble key combines a frozen-PTM prototype with a router-domain prototype, so inference uses both general semantics and task-adaptive cues to select the appropriate expert. Thus, old-task preservation is mainly achieved by structural parameter isolation, while module-identity inference is handled by prototype-ensemble matching rather than a learned soft gate.

3. Preliminaries

Problem formulation. CIL considers a sequential stream of tasks

D = {D_{1}, \dots, D_{T}}

, where the t-th task

D_{t} = {(x_{i}, y_{i})}_{i = 1}^{n_{t}}

contains

n_{t}

samples. Here,

x_{i} \in X_{t}

denotes an input from domain

X_{t}

, and

y_{i} \in Y_{t}

is its corresponding label. Importantly, the label spaces are mutually exclusive across tasks, i.e.,

Y_{t} \cap Y_{t^{'}} = \emptyset

for

t \neq t^{'}

.

Following the rehearsal-free setting [7,8,18], the model only observes data from the current task during training. The training objective is to learn a model

f_{Θ} (x) = W_{c l s}^{⊤} ϕ (x)

that minimizes the empirical risk over the current task’s training set,

L (D_{t}) = \frac{1}{| D_{t} |} \sum_{(x_{i}, y_{i}) \in D_{t}} L (f_{Θ} (x_{i}), y_{i}),

(1)

where

ϕ (x)

represents the embedded [class] token from the ViT,

W_{c l s}

denotes the classifier weights,

| D_{t} |

is the number of examples in the current task, and

L (\cdot, \cdot)

represents the loss function that measures prediction error. After each task t, performance is evaluated on all classes seen so far, i.e., on the union

Y_{t} = Y_{1} \cup \dots \cup Y_{t}

.

Mixture-of-experts. MoE architectures offer an efficient way to increase model capacity by activating only a subset of parameters per input, achieving faster training and inference compared to dense networks of equivalent scale. A typical MoE layer comprises a set of M expert networks

E = {E_{1}, \dots, E_{M}}

and a router G that determines expert activations based on the input [23]. Each expert is commonly implemented as an FFN, while the router is parameterized by a weight matrix

W_{g}

.

Formally, for an input

x

, the router computes

G (x) = softmax (W_{g} x),

(2)

producing a soft selection over experts. The final MoE output is given by

MoE (x) = \sum_{i = 1}^{M} G {(x)}_{i} E_{i} (x),

(3)

where

G {(x)}_{i}

represents the routing probability for expert

E_{i}

. In Transformer-based architectures, MoE layers often replace standard FFN blocks to selectively route representations [24].

Low-rank adaptation. LoRA was introduced to efficiently fine-tune large pre-trained models by injecting low-rank updates into weight matrices [22]. Given a pre-trained weight matrix

W \in R^{d_{i n} \times d_{o u t}}

, LoRA learns an additive low-rank decomposition

W + Δ W = W + U V,

(4)

where

U \in R^{d_{i n} \times r}

,

V \in R^{r \times d_{o u t}}

, and the rank

r ≪ min (d_{i n}, d_{o u t})

. This reduces the number of trainable parameters while maintaining expressiveness. LoRA enables cost-effective, scalable fine-tuning, making it suitable for continual learning, where efficiency and avoidance of forgetting are critical.

4. The Proposed Method

As demonstrated by HiDe-Prompt [25], CIL methods employing multi-module selection can be decomposed into two probabilistic components: module-identity inference (MII) and within-task prediction (WTP), represented by

P (x \in X_{i} | D, Θ)

and

P (x \in X_{i, j} | x \in X_{i}, D, Θ)

, respectively. By Bayes’ theorem, we have

\begin{matrix} P (x \in X_{i, j} | D, Θ) = P (x \in X_{i, j} | x \in X_{i}, D, Θ) P (x \in X_{i} | D, Θ) . \end{matrix}

(5)

Letting

\hat{i}

and

\hat{j}

denote the ground-truth task index and class label for input

x

, Equation (5) implies that improving either WTP accuracy,

P (x \in X_{\hat{i}, \hat{j}} | x \in X_{\hat{i}}, D, Θ)

, or MII accuracy,

P (x \in X_{\hat{i}} | D, Θ)

, directly enhances overall prediction performance. However, existing approaches suffer from two limitations: (1) iterative updates or the fusion of new and existing modules progressively deteriorate WTP [4,7,8,9]; and (2) MII performance, when relying solely on pre-trained features [9,12], is inherently constrained by the similarity between pre-training and downstream data distributions. To address these challenges, we propose DLEPEM, which explicitly enhances both WTP and MII via two complementary innovations:

Dynamic LoRA-Expert

To exploit the strong generalization of the pre-trained model, we keep its weights

W

fixed throughout training. To maintain plasticity and safeguard WTP, we dynamically allocate a dedicated LoRA-Expert for each incremental task, embedded within an MoE framework. Each new LoRA-Expert is trained exclusively on its respective task while previously introduced experts remain frozen. This ensures isolated task-specific adaptation with a small number of trainable parameters, leveraging LoRA’s efficiency to effectively capture discriminative features. Figure 1a depicts expert training, while Figure 1b details the internal structure. This expert life cycle is the main difference from conventional MoE-based LoRA continual learning. DLEPEM does not repeatedly update a shared expert pool or combine all experts through a soft gate during feature computation. Instead, for task t, only

E_{t}

receives gradients, whereas

{E_{1}, \dots, E_{t - 1}}

remain frozen. This design directly reduces cross-task parameter interference and protects WTP for previously learned tasks.

Prototype-Ensemble Matching

After training each incremental task, we extract prototypes using two sources: (1) the fixed PTM for generalizable features, and (2) the router-enhanced LoRA-Expert for domain-specific nuances. Unlike prior methods that rely solely on pre-trained representations, DLEPEM combines these prototypes and associates them with their corresponding LoRA-Experts. During inference, a nearest-neighbor search in this combined prototype space determines the most appropriate expert, substantially improving MII. This design mitigates privacy concerns inherent to rehearsal-based strategies. Figure 1c illustrates our complete prototype-ensemble matching mechanism. The ensemble key for a class is constructed as the concatenation of a frozen-PTM prototype and a router-domain prototype. The former provides stable general semantics, while the latter captures downstream task-specific cues. This complements dynamic expert allocation: the task expert improves WTP once selected, and the prototype-ensemble dictionary improves MII by selecting the appropriate expert without storing raw rehearsal samples.

4.1. Dynamic LoRA-Experts

We integrate the proposed LoRA-Expert modules into the Vision Transformer (ViT) architecture [26]. In ViT, an input image is first divided into fixed-size patches, linearly projected, and augmented with positional embeddings before being processed by a Transformer encoder comprising MHA layers and multilayer perceptrons (MLPs).

To adaptively capture task-specific features in incremental learning, we dynamically introduce a LoRA-Expert at each incremental stage. This module can be inserted either as a parallel branch to the MLP (MLP-Expert, see Figure 1b) or into the attention projections of the MHA module. When LoRA is applied to

W_{q}

and

W_{v}

, we refer to this variant as the QV-Expert configuration.

The modified forward computations for these components are given by

h^{'} = e + MLP (e) + E_{t}^{M L P} (e),

(6)

h^{'} = Attn (h_{Q} + E_{t}^{Q} (e), h_{K} + E_{t}^{K} (e), h_{V} + E_{t}^{V} (e)),

(7)

where

e

and

h

are the inputs and outputs of the original module, respectively. Here,

E_{t}

denotes the task-t LoRA-Expert module rather than a complete, independent ViT. It is a group of LoRA adapters inserted into the selected branch or projection, and each historical expert

E_{s}

,

s < t

, is frozen after its task has been learned. The attention operation is defined as:

Attn (Q, K, V_{a t t n}) = softmax (\frac{Q K^{⊤}}{\sqrt{d}}) V_{a t t n},

(8)

Here,

V_{a t t n}

denotes the attention value matrix and is distinct from the LoRA matrix

V_{t}

. For N tokens, the softmax term is an attention-weight matrix

A \in R^{N \times N}

, so

A V_{a t t n}

is the standard weighted sum over value vectors. The multi-head extension is omitted for clarity. Each LoRA-Expert shares the same architecture but learns distinct parameters. Under the row-vector convention, for an input

e \in R^{1 \times d_{i n}}

, one adapter consists of low-rank matrices

U_{t} \in R^{d_{i n} \times r}

and

V_{t} \in R^{r \times d_{o u t}}

, producing:

E_{t} (e) = e U_{t} V_{t} .

(9)

In the default ViT-B/16 QV-Expert setting,

d_{i n} = d_{o u t} = 768

and

r = 10

; for the MLP-Expert variant,

d_{i n}

and

d_{o u t}

correspond to the inserted MLP branch dimensions.

In the ViT setting, we denote by

ϕ (x; E_{t})

the output embeddings produced by the PTM equipped with the task-specific expert

E_{t}

. For the first incremental task, we follow the standard LoRA initialization [22], setting

V

to zero and initializing

U

via Kaiming initialization [27]. For subsequent tasks, we initialize each new LoRA-Expert by copying the weights from the preceding expert, then fine-tune it while keeping all previous experts frozen. As illustrated in Figure 1a, this strategy ensures that each task-specific expert adapts independently, preserving knowledge from prior stages. This design provides a direct stability–plasticity mechanism. Let

θ_{t} = {U_{t}, V_{t}}

denote the LoRA parameters of the expert allocated to task t, while the PTM weights

W

remain frozen. After task s has been learned, its old expert parameters

θ_{s}

are not optimized when learning any later task

t > s

; hence,

\nabla_{θ_{s}} L (D_{t}) = 0, s < t .

(10)

Therefore, old experts remain parameter-stationary during subsequent training, which reduces forgetting caused by repeated modification of shared PEFT parameters. Meanwhile, the current expert

θ_{t}

is optimized with the cross-entropy objective in Equation (18) without imposing gradient-orthogonality or update-magnitude shrinking constraints, preserving plasticity for newly introduced classes. The remaining source of old-task degradation is mainly expert retrieval error, which is addressed by the prototype-ensemble-matching mechanism.

4.2. Prototype-Ensemble Matching Mechanism

When the distribution gap between the pre-trained model (PTM) and the downstream dataset is small, generalized features extracted by the frozen PTM can enable effective module-sample matching. However, since the relationship between pre-training and downstream distributions is generally unknown, it is essential to adaptively capture domain-specific features from the downstream task.

To tackle this, we introduce a dynamically updated router

E^{r o u t e r}

that extracts domain-specific features. The router shares the same architecture as the LoRA-Experts but is continuously updated across incremental tasks.

Under typical incremental constraints, where only data from the current task is available, directly fine-tuning the router leads to overfitting, causing it to route all samples to the most recent LoRA-Expert. To mitigate this, we propose dual feature distillation mechanisms that jointly enforce stability (retaining prior knowledge) and plasticity (adapting to new tasks), regularizing router updates to ensure robust generalization.

1. Plasticity feature distillation. To encourage the router to learn category-specific features of the current task, we first compute the LoRA-Expert prototype for each class

i \in Y_{t}

:

P_{i, t}^{L} = \frac{1}{N_{i, t}} \sum_{(x_{j}, y_{j}) \in D_{t}} I (y_{j} = i) ϕ (x_{j}; E_{t}), N_{i, t} = \sum_{(x_{j}, y_{j}) \in D_{t}} I (y_{j} = i) .

(11)

Here,

N_{i, t}

is the number of current-task samples belonging to class i, and

I (\cdot)

denotes the indicator function. We then align the router feature of each current-task sample with the LoRA-Expert prototype of its ground-truth class using a KL-divergence objective:

L_{P F D} = \frac{1}{| D_{t} |} \sum_{(x_{j}, y_{j}) \in D_{t}} KL (σ (P_{y_{j}, t}^{L} / τ) ∥ σ (ϕ (x_{j}; E_{t}^{r o u t e r}) / τ)) .

(12)

Here,

σ (z) = softmax (z)

, and temperature scaling is written explicitly as

σ (z / τ) = softmax (z / τ)

, where

τ

is the temperature parameter. Thus,

σ (\cdot)

in the distillation losses denotes the same softmax function as

softmax (\cdot)

in the attention equation. These class-specific prototypes guide the router toward discriminative features of the current task, ensuring effective adaptation for new tasks.

2. Stability feature distillation. To mitigate the degradation of previously learned knowledge, we introduce a stability constraint that encourages the current router to mimic the feature distribution produced by the previous router on the current-task inputs:

L_{S F D} = \frac{1}{| D_{t} |} \sum_{(x_{j}, y_{j}) \in D_{t}} KL (σ (ϕ (x_{j}; E_{t - 1}^{r o u t e r}) / τ) ∥ σ (ϕ (x_{j}; E_{t}^{r o u t e r}) / τ)) .

(13)

The overall router loss combines these objectives,

L_{r o u t e r} = \{\begin{matrix} L_{P F D}, & t = 1, \\ α L_{P F D} + (1 - α) L_{S F D}, & t > 1, \end{matrix}

(14)

where

α

weights the plasticity and stability distillation terms for the router. For the first task,

L_{S F D}

is omitted because no previously trained router exists. The router parameters

E_{t}^{r o u t e r}

are optimized by minimizing this combined loss function. With the default

α = 0.04

, the router update places more weight on

L_{S F D}

; however, this coefficient controls router regularization only and should not be interpreted as proof of an optimal global stability–plasticity balance. The main source of new-task plasticity remains the trainable current LoRA-Expert, while frozen old experts provide parameter-level stability.

At the end of each incremental stage, we compute general prototypes

P_{i}^{F}

using the frozen PTM

ϕ (x)

and domain-specific prototypes

P_{i}^{R}

using the current router

E_{t}^{r o u t e r}

for each newly introduced class

i \in Y_{t}

, as defined in Equation (11). The prototype dictionary is updated in an append-only manner: old entries for classes

i \in Y_{t - 1}

are retained as historical keys and are not recomputed with later-task data. The new ensemble keys created at stage t are

K_{t}^{n e w} = \{K_{i} = [P_{i}^{F}; P_{i}^{R}] ∣ i \in Y_{t}\} .

(15)

Equivalently, we define the class-i ensemble prototype vector as

P_{i} = concat (P_{i}^{F}, P_{i}^{R}) = [P_{i}^{F}; P_{i}^{R}]

, and use it as the dictionary key

K_{i} = P_{i}

. The semicolon in

[\cdot; \cdot]

denotes vector concatenation, not matrix addition or summation. Each new key

K_{i}

is appended to Dict, together with the expert index

ν_{i} = t

, which points to the corresponding expert

E_{t}

. Thus, after task t, the dictionary covers all seen classes

Y_{t}

, while only entries for

Y_{t}

are newly created. This keeps DLEPEM rehearsal-free: the persistent state consists of frozen LoRA-Experts, the current router, and compact ensemble keys, but no raw samples, old mini-batches, feature buffers, or exemplar sets.

Because the router is updated across tasks, router-domain keys from older stages may have been generated by earlier router states. DLEPEM mitigates this router-state drift in two ways. First, each key includes the frozen-PTM component

P_{i}^{F}

, which remains comparable across stages because the PTM is fixed. Second, stability feature distillation in Equation (13) encourages

E_{t}^{r o u t e r}

to preserve the behavior of

E_{t - 1}^{r o u t e r}

while adapting to the current task.

Prototype-Ensemble Matching

Our mechanism dynamically selects the most suitable LoRA-Expert for each input by leveraging a key-value association strategy inspired by L2P [7]. Specifically, we maintain a dictionary where each seen class

i \in Y_{t}

has one ensemble key

K_{i}

and an associated expert index

ν_{i}

:

{Dict}_{t} = \{(K_{i}, ν_{i}) ∣ i \in Y_{t}, ν_{i} \in {1, \dots, t}\} .

(16)

Here,

ν_{i}

identifies the task-specific LoRA-Expert associated with class i.

During inference, given an input

x

, we construct an ensemble query

q (x) = concat (P^{F}, P^{R}) = [P^{F}; P^{R}]

, where

P^{F}

is the feature from the frozen PTM and

P^{R}

is the domain-specific feature from the router. We then identify the nearest key by cosine similarity:

\hat{i} = \underset{i \in Y_{t}}{argmax} (\cos (q (x), K_{i})),

(17)

The retrieved key

K_{\hat{i}}

returns an expert identity rather than a direct class prediction. Let

\hat{t} = ν_{\hat{i}}

denote the task index associated with

K_{\hat{i}}

in

{Dict}_{t}

; DLEPEM then selects

E_{\hat{t}}

for within-task prediction. Final classification is restricted to the class set

Y_{\hat{t}}

associated with the selected expert and uses the corresponding prototype weights. Thus, all stored ensemble keys participate in module-identity inference, while prototype weights from unrelated experts are not mixed during classification.

4.3. Optimization Objective and Training Procedure

DLEPEM employs a two-stage training paradigm: it first learns Dynamic LoRA-Expert modules and then optimizes the router for effective module-sample matching. Algorithm 1 provides the complete training procedure, and Algorithm 2 gives the corresponding task-agnostic inference procedure.

1. Dynamic LoRA-Expert learning: The objective function for training the LoRA-Expert is

min_{W_{c l s}, E_{i}} L_{C E} (W_{c l s}^{⊤} ϕ (x; E_{i}), y),

(18)

where

L_{C E}

denotes the cross-entropy loss, and

ϕ (x; E_{i})

represents the PTM equipped with the LoRA-Expert

E_{i}

. During inference, DLEPEM uses LoRA-Expert-derived prototype weights for classification. Specifically,

P^{L}

denotes the prototype–weight matrix formed by the class prototypes

P_{i, t}^{L}

defined in Equation (11). Classification is performed using cosine similarity,

f (x | E_{i}) = {(\frac{P^{L}}{∥ P^{L} ∥_{2}})}^{⊤} (\frac{ϕ (x; E_{i})}{∥ ϕ (x; E_{i}) ∥_{2}}),

(19)

where

E_{i}

denotes the selected LoRA-Expert, and

P^{L}

contains the prototype weights associated with this expert.

Algorithm 1 Training Procedure for DLEPEM

1:: Input: Pre-trained model $ϕ (\cdot)$ , incremental datasets ${D_{1}, \dots, D_{T}}$ , LoRA rank r, distillation coefficient $α$ .
2:: Output: Trained LoRA-Experts ${E_{1}, \dots, E_{T}}$ , Router $E_{T}^{r o u t e r}$ , Prototype Dictionary Dict.
3:: Initialize: $Dict \leftarrow \emptyset$ , $E_{0}^{r o u t e r}$ with random weights.
4:: for $t = 1$ to T do
5:: # Stage 1: Dynamic LoRA-Expert Learning
6:: Initialize LoRA-Expert $E_{t}$ . If $t > 1$ , copy weights from $E_{t - 1}$ .
7:: Train $E_{t}$ and classifier $W_{c l s}$ on $D_{t}$ using $L_{C E}$ (Equation (18)).
8:: Freeze parameters of $E_{t}$ .
9:: # Stage 2: Router Learning and Prototype-Ensemble Building
10:: If $t > 1$ , freeze a copy of the previous router $E_{t - 1}^{r o u t e r}$ .
11:: Train router $E_{t}^{r o u t e r}$ on $D_{t}$ using $L_{r o u t e r}$ (Equation (14)).
12:: # Append Prototype Dictionary with newly introduced classes
13:: Keep old dictionary entries for classes in $Y_{t - 1}$ unchanged.
14:: for each newly introduced class $i \in Y_{t}$ do
15:: Compute general prototype $P_{i}^{F}$ using frozen PTM $ϕ (\cdot)$ and samples from $D_{t}$ .
16:: Compute domain-specific prototype $P_{i}^{R}$ using router $E_{t}^{r o u t e r}$ and samples from $D_{t}$ .
17:: Create ensemble prototype $P_{i} = [P_{i}^{F}; P_{i}^{R}]$ (Equation (15)).
18:: Append an entry to Dict by associating key $K_{i} = P_{i}$ with expert index $ν_{i} = t$ (Equation (16)).
19:: end for
20:: end for
21:: return ${E_{1}, \dots, E_{T}}$ , $E_{T}^{r o u t e r}$ , Dict.

Algorithm 2 Inference Procedure for DLEPEM

1:: Input: Test sample $x$ , frozen PTM $ϕ (\cdot)$ , frozen LoRA-Experts ${E_{1}, \dots, E_{T}}$ , current router $E_{T}^{r o u t e r}$ , prototype dictionary Dict, and prototype weights ${P_{c}^{L}}_{c \in Y_{T}}$ .
2:: Output: Predicted label $\hat{y}$ .
3:: Compute frozen-PTM feature $P^{F} (x) = ϕ (x)$ .
4:: Compute router-domain feature $P^{R} (x) = ϕ (x; E_{T}^{r o u t e r})$ .
5:: Build the ensemble query $q (x) = [P^{F} (x), P^{R} (x)]$ .
6:: Retrieve the nearest dictionary key $\hat{i} = arg {max}_{i \in Y_{T}} \cos (q (x), K_{i})$ .
7:: Obtain the expert index $\hat{t} = ν_{\hat{i}}$ associated with $K_{\hat{i}}$ in Dict.
8:: Extract the selected-expert feature $z = ϕ (x; E_{\hat{t}})$ .
9:: Restrict candidate labels to $Y_{\hat{t}}$ and predict $\hat{y} = arg {max}_{c \in Y_{\hat{t}}} \cos (z, P_{c}^{L})$ .
10:: return $\hat{y}$ .

2. Router learning: The router is optimized with the objective function shown in Equation (14).

As illustrated in Figure 1, we divide the class-incremental learning process into two stages: First, a new LoRA-Expert is learned for each incremental task to capture task-specific features; each expert is categorized as either an MLP-Expert or a QV-Expert depending on its insertion point. Second, we introduce a prototype-ensemble-matching mechanism that captures both general and domain-specific features, thereby improving module-sample matching. Notably, DLEPEM’s components are orthogonal to many existing approaches and can be integrated with them straightforwardly. For clarity, Table 1 summarizes the main symbols used throughout the paper.

5. Experiments

5.1. Experimental Settings

Datasets: We evaluate DLEPEM in two settings: CIL and few-shot class-incremental learning (FSCIL) [28]. For CIL, we followed standard protocols [10,11] and tested on five benchmarks: VTAB [29], CIFAR100 [30], CUB200 [31], ImageNet-R [32], and OmniBenchmark [33]. VTAB comprises 50 classes, CIFAR100 contains 100 classes, CUB200 and ImageNet-R each contain 200 classes, and OmniBenchmark, the largest benchmark among them, includes 300 classes. As shown in Table 2, we followed the common practices [4,9,11], splitting CIFAR100 into 10 tasks, CUB200 into 10 tasks, ImageNet-R into 5 tasks, OmniBenchmark into 10 tasks, and VTAB into 5 tasks. Code is available at: https://github.com/hongwei-zhao/Appl_Sci-DLEPEM-main (accessed on 13 June 2026).

For FSCIL, we adopted the settings used in prior work [34,35] on CUB200, CIFAR100, and miniImageNet [36]. As shown in Table 3, we used 100 classes in CUB200 as the base class set for the first task. The remaining 100 classes were partitioned into 10 incremental tasks, with each incremental task containing 10 new classes and the few-shot training set containing 5 examples per class (10-way 5-shot incremental task). CIFAR100 and miniImageNet were divided into 60 classes for the base task, and the remaining 40 classes were divided into eight 5-way 5-shot incremental tasks.

Comparison methods: For CIL, we compared DLEPEM against several representative and recent methods, including SimpleCIL [10], prompt-based methods (L2P [7], DualPrompt [8], CODA-Prompt [18]), LoRA-based methods (LAE [9], APER [10], InfLoRA [4], SD-LoRA [11], BiLoRA [37]), and recent PTM-based CIL methods such as EASE [38]. We also include standard full fine-tuning as a baseline, where the model is sequentially fine-tuned without any continual learning mechanism. We reproduced all baseline results in our benchmark tables under the corresponding CIL/FSCIL protocols. For fairness, all methods use the same pre-trained backbone and identical data splits; for LoRA/adapter-based methods, we set the rank to

r = 10

where applicable. Other method-specific training settings, including optimizer, training epochs, batch size, and augmentation policy, follow their original or recommended implementations.

For FSCIL, we additionally benchmarked against three recent ViT-based methods tailored for few-shot scenarios: PriViLege [34], ASP [35], and CPE-CLIP [39]. All methods leverage the same pre-trained backbone (ViT-B/16-IN21K [26]) and identical data splits to ensure fair comparisons.

Evaluation metrics: For CIL, we assessed model performance using two established metrics:

\bar{A} = \frac{1}{T} \sum_{i = 1}^{T} {ACC}_{i}

and

A_{L}

[4]. Here,

\bar{A}

is the average accuracy of all T incremental stages, and

A_{L}

is the accuracy of the last incremental stage.

{ACC}_{i}

is defined as

{ACC}_{i} = \frac{1}{i} \sum_{j = 1}^{i} a_{i, j},

(20)

where

a_{i, j}

denotes the accuracy on the j-th task after training on the i-th task. For FSCIL,

A_{Base}

is the accuracy of the base classes in task 0. Both

A_{L}

and

\bar{A}

are defined identically to those in the standard CIL setting.

Architecture and training details: We adopted ViT-B/16-IN21K [26] pre-trained on ImageNet-21K as the backbone. Optimization was conducted using SGD with an initial learning rate of 0.02 and cosine annealing. LoRA-Experts were trained for 20 epochs and the router for 5 epochs, using a batch size of 48. We set the LoRA rank to

r = 10

and inserted LoRA-Experts into all Transformer blocks. The distillation coefficient

α

was set to 0.04. All experiments were performed on an NVIDIA A800 GPU with fixed data splits (seed 1993) and the same backbone to ensure reproducibility; results were averaged over three runs. Following SD-LoRA [11], our QV-Experts were integrated into the query and value projections of the attention module. We also evaluated a variant, DLEPEM-MLP, which introduces MLP-Experts as a parallel branch to the FFN layer.

5.2. Benchmark Comparison

Class-incremental learning: We conducted a comprehensive evaluation of DLEPEM against representative recent methods on five benchmark datasets. As shown in Table 4, DLEPEM consistently delivers superior accuracy across all benchmarks. After adding the recent EASE and BiLoRA baselines, DLEPEM still achieves the best average accuracy on all five CIL benchmarks under our reproduced experimental setting. Compared with the strongest baseline in each dataset, DLEPEM improves

\bar{A}

by 0.80% on CIFAR100, 2.11% on CUB200, 0.53% on ImageNet-R, 1.54% on OmniBenchmark, and 1.39% on VTAB.

For instance, on CIFAR100, DLEPEM-QV achieves an average accuracy of 93.39%, outperforming the newly added EASE baseline by 0.80%. Following the caution of Kim and Han [40], we do not interpret high average accuracy alone as proof of an optimal stability–plasticity balance; instead, the old/new-task analysis in Section 5.7 provides behavior-level evidence of the trade-off. Figure 2 further shows that DLEPEM achieves the highest performance throughout training, underscoring its robustness.

Few-shot class-incremental learning: We further evaluated DLEPEM in the few-shot class-incremental learning setting. As shown in Table 5, DLEPEM consistently delivers superior accuracy and achieves leading results among the evaluated methods on multiple benchmarks. In particular, it achieves the highest last accuracy (

A_{L}

) and average accuracy (

\bar{A}

) on CUB200 and CIFAR100. For instance, on CUB200, DLEPEM-QV achieves an average accuracy of 88.77%, outperforming the strongest baseline, ASP, by a clear margin of 5.31%. On CIFAR100, DLEPEM-QV reaches 90.50%, surpassing ASP by 1.96%. While its performance on miniImageNet is highly competitive and on par with the strongest baselines, these substantial gains on the other datasets highlight DLEPEM’s effectiveness in long-term continual learning, even under severe data sparsity. Figure 3 further illustrates that DLEPEM maintains superior performance throughout the incremental learning process, showing its robustness and adaptability in few-shot scenarios.

5.3. Ablation Study

Different components: We conducted ablation studies to assess the contribution of each component in DLEPEM (Table 6). The variant w/o Dynamic LoRA-Expert uses a single LoRA-Expert for all tasks, resulting in a significant performance drop under large domain shifts and indicating severe forgetting and task interference. In contrast, assigning a dedicated LoRA-Expert to each task better preserves the stability–plasticity trade-off across incremental steps. This result underscores the value of task-specific experts in PEFT-based continual learning. Removing the prototype-ensemble router (w/o Prototype-Ensemble) and using frozen class prototypes as keys also degrades performance, confirming the router’s critical role in effective module-sample matching. The advantage of dynamic expert allocation is most visible on ImageNet-R. Replacing the single shared expert with task-specific dynamic experts increases DLEPEM-MLP from 61.17/74.31 to 76.25/82.37 in

A_{L} / \bar{A}

, and increases DLEPEM-QV from 72.37/80.14 to 78.77/83.43. These gains support our claim that isolating LoRA parameters by task reduces cross-task interference, especially under domain shift.

Different routers: In addition to the prototype-ensemble strategy, we compared three module-sample-matching mechanisms: frozen-PTM prototypes

P^{F}

, K-nearest-neighbor (KNN) matching, and router-domain prototypes from

E^{r o u t e r}

. For KNN, features were extracted using the frozen PTM, and performance was evaluated across k values (

k = {1, 3, 5, 7, 9}

), with

k = 3

yielding the best results. Both

P^{F}

and

E^{r o u t e r}

provide single-source keys within the prototype-ensemble framework.

In Table 7, each entry is reported as DLEPEM-MLP/DLEPEM-QV. CNN reports the final average classification accuracy of each variant. Router reports the expert-selection accuracy of the learned routing module. LoRA-Expert (Oracle) reports classification accuracy when the ground-truth task/expert identity is used at test time to select the correct LoRA-Expert before normal within-expert classification.

As shown in Figure 4, the prototype ensemble consistently outperforms the baseline KNN approach. Two key insights emerge:

1. Domain-specific adaptation: Under significant domain shift (e.g., ImageNet-R

(T = 5)

), the ensemble achieves larger gains, primarily due to

E^{r o u t e r}

’s ability to capture domain-specific characteristics.

2. Robust matching via integration: Combining generalized and domain-aware prototypes enables more reliable module-sample matching. Even when

E^{r o u t e r}

underperforms

P^{F}

, its ensemble compensates through mutual alignment, enhancing overall stability and accuracy. This conclusion is also consistent with the component ablation in Table 6: on ImageNet-R, replacing prototype-ensemble retrieval with frozen-prototype keys reduces

A_{L} / \bar{A}

from 76.25/82.37 to 69.53/77.75 for DLEPEM-MLP and from 78.77/83.43 to 73.13/80.42 for DLEPEM-QV. The result indicates that combining PTM-general and router-domain prototypes provides more reliable MII than relying on a single feature source. Figure 4 also contains the two single-source prototype variants requested by the reviewer: the frozen-PTM prototype

P^{F}

and the router-domain prototype

E^{r o u t e r}

. Across the evaluated settings, both single-source variants, including the router-domain prototype variant, obtain lower accuracy than the prototype-ensemble strategy, which supports the complementarity of the two prototype sources. This router comparison also serves as a post-hoc compatibility check for the append-only dictionary. During evaluation, current queries are matched against stored keys accumulated from previous stages. If historical router-domain keys were incompatible with the current router state, prototype-ensemble retrieval would not consistently outperform PTM-only or KNN-style matching. The observed gains, therefore, suggest that the frozen-PTM component and stability-regularized router updates help maintain usable key-query compatibility across incremental stages.

Different pre-trained models: Beyond validating DLEPEM on ViT-B/16-IN21K, we further assessed its generalization across diverse Transformer-based architectures, including ViT-B/16-IN1K & ViT-L/16 [26], ViT-B/16-DINO [41], and ViT-B/16-SAM [42]. As a baseline, SimpleCIL [10] fine-tunes only the classifier to reflect the inherent capability of each backbone in incremental settings. We evaluated DLEPEM on the CIFAR100 and ImageNet-R benchmarks, with the results shown in Figure 5a,b. Two key observations emerge:

1.: DLEPEM yields greater improvements on larger models (e.g., ViT-L/16), highlighting its scalability with model capacity.
2.: Among similarly sized architectures (e.g., ViT-B/16 variants), DLEPEM consistently outperforms the SimpleCIL baseline, demonstrating robustness to architectural and pre-training differences.

These results show DLEPEM’s versatility and effectiveness across a wide range of Transformer-based backbones.

5.4. Analysis of Trainable Parameters and Accuracy

To further evaluate the parameter-performance trade-off, we compared the number of trainable parameters with accuracy across different methods in Figure 6. As illustrated in these plots, DLEPEM achieves a favorable balance between model size and performance, demonstrating that it uses additional parameters efficiently to improve accuracy.

We further quantified the scalability of the one-expert-per-task design. DLEPEM is rehearsal-free in the sense that it stores no raw samples, old feature batches, or exemplar buffers; nevertheless, it retains a frozen LoRA-Expert bank, the current router, and an append-only prototype dictionary. For a ViT with hidden dimension d, LoRA rank r, L Transformer blocks, and b bytes per floating-point value, the storage of one MLP-Expert, one QV-Expert, and the prototype dictionary after task t can be written as

\begin{matrix} N_{MLP} & = L (2 d r), \\ N_{QV} & = 2 L (2 d r), \\ M_{proto} (t) & = 2 d | Y_{1 : t} | b . \end{matrix}

(21)

For the LoRA-Expert bank, under our default ViT-B/16-IN21K setting,

d = 768

,

L = 12

,

r = 10

, and

b = 4

for FP32. Thus, each MLP-Expert adds 0.18 M parameters (≈0.70 MiB), and each QV-Expert adds 0.37 M parameters (≈1.41 MiB). In a 10-task setting, the frozen expert bank stores 1.84 M LoRA parameters for DLEPEM-MLP and 3.69 M for DLEPEM-QV, corresponding to approximately 2.1% and 4.3% of an 86 M-parameter ViT-B backbone, respectively. For longer task streams, the growth is linear: in a 50-task sequence, the expert bank contains 9.22 M parameters for DLEPEM-MLP and 18.43 M for DLEPEM-QV; in a 100-task sequence, it contains 18.43 M and 36.86 M parameters, respectively. The router contributes one additional same-size LoRA module and is constant with respect to the number of tasks. Therefore, the LoRA-Expert bank remains lightweight under the evaluated protocols, but it is the main component that scales with the number of tasks.

For the prototype dictionary, each ensemble prototype contains two 768-dimensional vectors: one general prototype from the frozen PTM feature space and one domain-specific prototype from the router feature space. Its memory, therefore, scales with the number of seen classes rather than directly with the number of tasks. In FP32, the dictionary requires about 0.59 MiB for 100 classes, 1.17 MiB for 200 classes, and 1.76 MiB for 300 classes. Compared with both the LoRA-Expert bank and the ViT-B backbone, this storage is small in our evaluated benchmarks, but it still grows as

O (| Y_{1 : T} |)

with the number of seen classes.

5.5. Training and Inference Time Comparison

At inference time, task-specific LoRA-Expert selection is based on nearest-neighbor search using cosine similarity, formulated as

\hat{i} = arg {max}_{i \in Y_{t}} \cos (q (x), K_{i})

. Table 8 reports training time and inference latency. The reported inference latency is measured end-to-end for one image and includes query construction with the frozen PTM and router, nearest-neighbor expert selection in the prototype dictionary, the selected LoRA-Expert forward pass, and final prototype-weight classification. DLEPEM-MLP requires

69.46

s/epoch on CIFAR100 and

33.30

s/epoch on ImageNet-R, which is competitive with representative PEFT-based baselines. Its inference latency is

8.39

ms/image on CIFAR100 and

8.47

ms/image on ImageNet-R. This latency is higher than that of SD-LoRA and InfLoRA, so the overhead should not be described as negligible. A more accurate interpretation is that DLEPEM introduces a non-negligible but still practical inference overhead in exchange for stronger expert retrieval and final accuracy.

From a deployment-memory perspective, the storage analysis above shows that DLEPEM remains lightweight relative to the ViT-B/16-IN21K backbone under the evaluated protocols. For the LoRA-Expert bank, each MLP-Expert adds about

0.70

MiB, and each QV-Expert adds about

1.41

MiB in FP32; in a 10-task setting, the retained expert bank accounts for about

2.1 %

and

4.3 %

of the 86 M-parameter backbone for DLEPEM-MLP and DLEPEM-QV, respectively. For the prototype dictionary, the storage is smaller and requires at most about

1.76

MiB for 300 seen classes in our evaluated CIL benchmarks. Therefore, DLEPEM is suitable for GPU-based real-time or near-real-time recognition scenarios, while stricter edge deployment or much longer task streams may require additional expert compression or pruning.

5.6. Parameter Sensitivity Analysis

We investigated the sensitivity of three key hyperparameters in DLEPEM: (1) the LoRA rank r, (2) the insertion positions of LoRA-Experts in the ViT backbone, and (3) the distillation coefficient

α

for plasticity regularization.

To evaluate the effect of r and insertion positions, we conducted experiments on CUB200 (

T = 10

), varying

r \in {1, 2, 4, 8, 10, 16}

and testing insertion ranges {0–2, 0–4, 0–8, 0–12}, where “0–2” denotes insertion into the first two Transformer layers. As shown in Figure 7a, DLEPEM maintains stable performance across various settings, demonstrating robustness to hyperparameter variations. Following prior work [4,11], we adopt

r = 10

and insert LoRA-Experts into all Transformer blocks. Similar trends are observed on other datasets.

We also analyzed the impact of the distillation coefficient

α

across datasets (Figure 7b). The two endpoints correspond to removing one distillation term:

α = 0

removes plasticity feature distillation and keeps only

L_{S F D}

, while

α = 1

removes stability feature distillation and keeps only

L_{P F D}

. The model performs consistently well for

α \in (0, 0.1]

, with

α = 0.04

as the default.

5.7. Performance of LoRA-Based Methods Across Sequential Tasks

Kim and Han [40] show that final or average accuracy can obscure whether a CIL method is genuinely plastic or mainly stable, and they propose feature-representation diagnostics such as classifier retraining with frozen feature extractors and representation-similarity analysis. We do not reproduce that full representation-level protocol here. Instead, using the available sequential-task results, we provide a behavior-level decomposition: new-task performance is used as an empirical proxy for plasticity, and old-task performance after subsequent learning is used as an empirical proxy for stability.

To further analyze the performance characteristics of LoRA-based methods, we evaluate LAE [9], InfLoRA [4], and SD-LoRA [11] in terms of model plasticity and stability on CIFAR100 (

T = 10

). As shown in Figure 8a, both InfLoRA and SD-LoRA show lower performance on new tasks because their gradient-direction constraints are designed to preserve old knowledge. While these constraints effectively mitigate catastrophic forgetting, they can compromise the model’s plasticity when learning new tasks. In contrast, DLEPEM learns task-specific LoRA modules without such restrictions, enabling stronger adaptation to new tasks while maintaining competitive performance. This comparison further distinguishes DLEPEM from constraint-based LoRA continual learning: InfLoRA and SD-LoRA protect old knowledge by restricting update directions or magnitudes, whereas DLEPEM preserves old knowledge structurally by freezing previous experts while keeping the current expert fully trainable within its low-rank subspace. Thus, DLEPEM shifts the stability–plasticity trade-off from a single shared update space to two separated factors: parameter-stationary old experts for stability and a newly trainable expert for plasticity.

LAE employs a different strategy by continuously integrating new parameters into the existing model through weighted blending. However, as demonstrated in Figure 8b, this approach leads to progressive destabilization of previously learned knowledge. DLEPEM consistently outperforms LAE on old tasks, demonstrating that our dynamic expert selection mechanism effectively preserves old knowledge without the instability issues associated with parameter blending.

6. Conclusions

We propose DLEPEM, a novel framework for CIL that combines Dynamic LoRA-Experts for task-specific adaptation and prototype-ensemble matching for improved module selection. This dual design supports a more favorable empirical stability–plasticity trade-off under the evaluated CIL protocols by combining parameter-stationary old experts for stability with a trainable expert for each new task to preserve plasticity. Extensive evaluations on six benchmarks demonstrate that DLEPEM achieves strong and competitive performance under the evaluated protocols. Although each LoRA-Expert is lightweight, the stored expert bank grows linearly with the number of tasks, and the prototype dictionary grows with the number of seen classes. Future work will explore expert compression, rank sharing, expert pruning, expert merging/distillation, different prototype fusion weights, random and expert-ID routing diagnostics, and extensions to multi-modal learning.

Author Contributions

Conceptualization, H.Z.; methodology, H.Z.; software, H.Z.; validation, H.Z., R.L. and Y.L.; formal analysis, H.Z.; investigation, H.Z.; resources, R.L.; data curation, H.Z.; writing–original draft preparation, H.Z.; writing–review and editing, H.Z., R.L. and Y.L.; visualization, H.Z.; supervision, Y.L.; project administration, R.L.; funding acquisition, R.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available benchmark datasets. The code is available at https://github.com/hongwei-zhao/Appl_Sci-DLEPEM-main (accessed on 13 June 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CIL	Class-Incremental Learning
FSCIL	Few-Shot Class-Incremental Learning
PEFT	Parameter-Efficient Fine-Tuning
PTM	Pre-Trained Model
LoRA	Low-Rank Adaptation
MoE	Mixture-of-Experts
ViT	Vision Transformer
MHA	Multi-Head Self-Attention
FFN	Feed-Forward Network
WTP	Within-Task Prediction
MII	Module Identity Inference

References

McCloskey, M.; Cohen, N.J. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. In Psychology of Learning and Motivation; Academic Press: Cambridge, MA, USA, 1989; Volume 24, pp. 109–165. [Google Scholar] [CrossRef]
French, R.M. Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 1999, 3, 128–135. [Google Scholar] [CrossRef] [PubMed]
Grossberg, S.T. Studies of Mind and Brain: Neural Principles of Learning, Perception, Development, Cognition, and Motor Control; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 70. [Google Scholar] [CrossRef]
Liang, Y.S.; Li, W.J. InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE Computer Society: Los Alamitos, CA, USA, 2024; pp. 23638–23647. [Google Scholar] [CrossRef]
Han, X.; Zhang, Z.; Ding, N.; Gu, Y.; Liu, X.; Huo, Y.; Qiu, J.; Yao, Y.; Zhang, A.; Zhang, L.; et al. Pre-trained models: Past, present and future. AI Open 2021, 2, 225–250. [Google Scholar] [CrossRef]
Xin, Y.; Luo, S.; Zhou, H.; Du, J.; Liu, X.; Fan, Y.; Li, Q.; Du, Y. Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey. arXiv 2024, arXiv:2402.02242. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, Z.; Lee, C.Y.; Zhang, H.; Sun, R.; Ren, X.; Su, G.; Perot, V.; Dy, J.; Pfister, T. Learning to Prompt for Continual Learning. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE Computer Society: Los Alamitos, CA, USA, 2022; pp. 139–149. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, Z.; Ebrahimi, S.; Sun, R.; Zhang, H.; Lee, C.Y.; Ren, X.; Su, G.; Perot, V.; Dy, J.; et al. DualPrompt: Complementary Prompting for Rehearsal-Free Continual Learning. In Proceedings of the Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXVI; Springer: Berlin/Heidelberg, Germany, 2022; pp. 631–648. [Google Scholar] [CrossRef]
Gao, Q.; Zhao, C.; Sun, Y.; Xi, T.; Zhang, G.; Ghanem, B.; Zhang, J. A Unified Continual Learning Framework with General Parameter-Efficient Tuning. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE Computer Society: Los Alamitos, CA, USA, 2023; pp. 11449–11459. [Google Scholar] [CrossRef]
Zhou, D.W.; Cai, Z.W.; Ye, H.J.; Zhan, D.C.; Liu, Z. Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You Need. Int. J. Comput. Vis. 2024, 133, 1012–1032. [Google Scholar] [CrossRef]
Wu, Y.; Piao, H.; Huang, L.K.; Wang, R.; Li, W.; Pfister, H.; Meng, D.; Ma, K.; Wei, Y. SD-LoRA: Scalable Decoupled Low-Rank Adaptation for Class Incremental Learning. In Proceedings of the The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, 24–28 April 2025. [Google Scholar] [CrossRef]
Wang, Y.; Huang, Z.; Hong, X. S-Prompts Learning with Pre-trained Transformers: An Occam’s Razor for Domain Incremental Learning. In Proceedings of the Advances in Neural Information Processing Systems 35, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar] [CrossRef]
Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive Mixtures of Local Experts. Neural Comput. 1991, 3, 79–87. [Google Scholar] [CrossRef] [PubMed]
Aljundi, R.; Kelchtermans, K.; Tuytelaars, T. Task-Free Continual Learning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE Computer Society: Los Alamitos, CA, USA, 2019; pp. 11246–11255. [Google Scholar] [CrossRef]
Rebuffi, S.A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. iCaRL: Incremental Classifier and Representation Learning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE Computer Society: Los Alamitos, CA, USA, 2017; pp. 5533–5542. [Google Scholar] [CrossRef]
Yu, L.; Twardowski, B.; Liu, X.; Herranz, L.; Wang, K.; Cheng, Y.; Jui, S.; van de Weijer, J. Semantic Drift Compensation for Class-Incremental Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE Computer Society: Los Alamitos, CA, USA, 2020; pp. 6980–6989. [Google Scholar] [CrossRef]
Wang, F.; Zhou, D.; Ye, H.; Zhan, D. FOSTER: Feature Boosting and Compression for Class-Incremental Learning. In Proceedings of the Computer Vision—ECCV 2022—17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXV; Lecture Notes in Computer Science; Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Berlin/Heidelberg, Germany, 2022; Volume 13685, pp. 398–414. [Google Scholar] [CrossRef]
Smith, J.S.; Karlinsky, L.; Gutta, V.; Cascante-Bonilla, P.; Kim, D.; Arbelle, A.; Panda, R.; Feris, R.; Kira, Z. CODA-Prompt: COntinual Decomposed Attention-Based Prompting for Rehearsal-Free Continual Learning. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE Computer Society: Los Alamitos, CA, USA, 2023; pp. 11909–11919. [Google Scholar] [CrossRef]
Yu, J.; Zhuge, Y.; Zhang, L.; Hu, P.; Wang, D.; Lu, H.; He, Y. Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE Computer Society: Los Alamitos, CA, USA, 2024; pp. 23219–23230. [Google Scholar] [CrossRef]
Riquelme, C.; Puigcerver, J.; Mustafa, B.; Neumann, M.; Jenatton, R.; Pinto, A.S.; Keysers, D.; Houlsby, N. Scaling vision with sparse mixture of experts. In Proceedings of the Proceedings of the 35th International Conference on Neural Information Processing Systems, Red Hook, NY, USA; NIPS ’21; NeurIPS Foundation: La Jolla, CA, USA, 2021; Available online: https://proceedings.neurips.cc/paper/2021/hash/48237d9f2dea8c74c2a72126cf63d933-Abstract.html (accessed on 13 June 2026).
Gou, Y.; Liu, Z.; Chen, K.; Hong, L.; Xu, H.; Li, A.; Yeung, D.Y.; Kwok, J.T.; Zhang, Y. Mixture of cluster-conditional lora experts for vision-language instruction tuning. arXiv 2023, arXiv:2312.12379. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the The Tenth International Conference on Learning Representations, ICLR 2022, Virtual, 25–29 April 2022. [Google Scholar] [CrossRef]
Jin, P.; Zhu, B.; Yuan, L.; Yan, S. MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts. In Proceedings of the The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, 24–28 April 2025. [Google Scholar]
Dou, S.; Zhou, E.; Liu, Y.; Gao, S.; Zhao, J.; Shen, W.; Zhou, Y.; Xi, Z.; Wang, X.; Fan, X.; et al. LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment. arXiv 2023, arXiv:2312.09979. [Google Scholar] [CrossRef]
Wang, L.; Xie, J.; Zhang, X.; Huang, M.; Su, H.; Zhu, J. Hierarchical Decomposition of Prompt-Based Continual Learning: Rethinking Obscured Sub-optimality. In Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; NeurIPS Foundation: La Jolla, CA, USA, 2023; Available online: https://proceedings.neurips.cc/paper_files/paper/2023/hash/d9f8b5abc8e0926539ecbb492af7b2f1-Abstract-Conference.html (accessed on 13 June 2026).
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual, 3–7 May 2021. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV); IEEE Computer Society: Los Alamitos, CA, USA, 2015; pp. 1026–1034. [Google Scholar] [CrossRef]
Tao, X.; Hong, X.; Chang, X.; Dong, S.; Wei, X.; Gong, Y. Few-Shot Class-Incremental Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE Computer Society: Los Alamitos, CA, USA, 2020; pp. 12180–12189. [Google Scholar] [CrossRef]
Zhai, X.; Puigcerver, J.; Kolesnikov, A.; Ruyssen, P.; Riquelme, C.; Lucic, M.; Djolonga, J.; Pinto, A.S.; Neumann, M.; Dosovitskiy, A.; et al. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv 2019, arXiv:1910.04867. [Google Scholar]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset; Technical Report; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
Hendrycks, D.; Basart, S.; Mu, N.; Kadavath, S.; Wang, F.; Dorundo, E.; Desai, R.; Zhu, T.; Parajuli, S.; Guo, M.; et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE Computer Society: Los Alamitos, CA, USA, 2021; pp. 8340–8349. [Google Scholar]
Zhang, Y.; Yin, Z.; Shao, J.; Liu, Z. Benchmarking omni-vision representation through the lens of visual realms. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 594–611. [Google Scholar]
Park, K.H.; Song, K.; Park, G.M. Pre-trained Vision and Language Transformers are Few-Shot Incremental Learners. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE Computer Society: Los Alamitos, CA, USA, 2024; pp. 23881–23890. [Google Scholar] [CrossRef]
Liu, C.; Wang, Z.; Xiong, T.; Chen, R.; Wu, Y.; Guo, J.; Huang, H. Few-Shot Class Incremental Learning with Attention-Aware Self-adaptive Prompt. In Proceedings of the Computer Vision—ECCV 2024—18th European Conference, Milan, Italy, 29 September–4 October 2024, Proceedings, Part LXXXI; Springer: Cham, Switzerland, 2024; pp. 1–18. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.S.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Zhu, H.; Zhang, Y.; Dong, J.; Koniusz, P. BiLoRA: Almost-Orthogonal Parameter Spaces for Continual Learning. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE Computer Society: Los Alamitos, CA, USA, 2025; pp. 25613–25622. [Google Scholar]
Zhou, D.W.; Sun, H.L.; Ye, H.J.; Zhan, D.C. Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE Computer Society: Los Alamitos, CA, USA, 2024; pp. 23554–23564. [Google Scholar] [CrossRef]
D’Alessandro, M.; Alonso, A.; Calabrés, E.; Galar, M. Multimodal Parameter-Efficient Few-Shot Class Incremental Learning. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW); IEEE Computer Society: Los Alamitos, CA, USA, 2023; pp. 3385–3395. [Google Scholar] [CrossRef]
Kim, D.; Han, B. On the Stability-Plasticity Dilemma of Class-Incremental Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE Computer Society: Los Alamitos, CA, USA, 2023; pp. 20196–20204. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jegou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE Computer Society: Los Alamitos, CA, USA, 2021; pp. 9630–9640. [Google Scholar] [CrossRef]
Chen, X.; Hsieh, C.; Gong, B. When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations. In Proceedings of the The Tenth International Conference on Learning Representations, ICLR 2022, Virtual, 25–29 April 2022. [Google Scholar] [CrossRef]

Figure 1. Illustration of DLEPEM. (a) In the t-th incremental task, a new LoRA-Expert

E_{t}

(with parameters

U_{t}

and

V_{t}

) is trained to capture task-specific features. (b) Structure of the LoRA-Expert. Depending on the insertion location, the module is categorized as either an MLP-Expert or a QV-Expert. The QV-Expert integrates LoRA into the

W_{q}

and

W_{v}

projections of the MHA layer. (c) Prototype-ensemble matching. General prototypes from the PTM and domain-specific prototypes from the router

E_{t}^{r o u t e r}

are combined and linked to their respective experts. Circles and triangles denote prototypes in the general and domain-specific spaces, respectively, and different colors indicate different classes. During inference, the nearest prototype guides expert selection for each input sample.

Figure 1. Illustration of DLEPEM. (a) In the t-th incremental task, a new LoRA-Expert

E_{t}

(with parameters

U_{t}

and

V_{t}

) is trained to capture task-specific features. (b) Structure of the LoRA-Expert. Depending on the insertion location, the module is categorized as either an MLP-Expert or a QV-Expert. The QV-Expert integrates LoRA into the

W_{q}

and

W_{v}

projections of the MHA layer. (c) Prototype-ensemble matching. General prototypes from the PTM and domain-specific prototypes from the router

E_{t}^{r o u t e r}

are combined and linked to their respective experts. Circles and triangles denote prototypes in the general and domain-specific spaces, respectively, and different colors indicate different classes. During inference, the nearest prototype guides expert selection for each input sample.

Figure 2. Incremental accuracy curves on the CIL benchmarks. All methods use the same ViT-B/16-IN21K backbone, and each subplot reports accuracy after successive incremental tasks for the specified dataset protocol.

Figure 3. Incremental accuracy curves on the FSCIL benchmarks. The three subplots show the CUB200, CIFAR100, and miniImageNet protocols, respectively, using the same ViT-B/16-IN21K backbone.

Figure 4. Comparison of routing strategies for module-identity inference. The prototype-ensemble strategy is compared with frozen-PTM prototypes, KNN matching, and router-domain prototypes under the same evaluation protocols.

Figure 5. Effect of different pre-trained backbones on CIL performance. The comparison evaluates DLEPEM with several ViT-based pre-training sources and reports the corresponding performance on ImageNet-R and CIFAR100. The arrows indicate the accuracy improvements of DLEPEM over the SimpleCIL baseline.

Figure 6. Parameter–accuracy comparison on CIL benchmarks. The plots compare trainable parameter counts and final accuracy on CIFAR100 and VTAB to illustrate the parameter-performance trade-off of DLEPEM.

Figure 7. Hyperparameter sensitivity of DLEPEM. Subplot (a) analyzes the effect of LoRA rank and insertion position, while subplot (b) shows the effect of the distillation coefficient

α

across datasets.

Figure 7. Hyperparameter sensitivity of DLEPEM. Subplot (a) analyzes the effect of LoRA rank and insertion position, while subplot (b) shows the effect of the distillation coefficient

α

across datasets.

Figure 8. New-task and old-task performance of LoRA-based CIL methods. The left subplot reports accuracy on newly introduced tasks as a proxy for plasticity, and the right subplot reports retained accuracy on previous tasks as a proxy for stability.

Table 1. Notation used in DLEPEM.

Symbol	Definition
$D_{t}$ , $Y_{t}$	Training set and class set of task t.
$Y_{t}$	The set of all classes observed up to task t, i.e., $Y_{t} = Y_{1} \cup \dots \cup Y_{t}$ .
$E_{t}$	Task-t LoRA-Expert module, i.e., a group of low-rank LoRA adapters inserted into selected Transformer layers rather than a complete independent ViT. Historical experts are frozen after training.
$U_{t}$ , $V_{t}$ , r	Low-rank matrices and rank inside each adapter. For row-vector input $e \in R^{1 \times d_{i n}}$ , $U_{t} \in R^{d_{i n} \times r}$ , $V_{t} \in R^{r \times d_{o u t}}$ , with $r = 10$ by default.
$V_{a t t n}$	Attention value matrix in Equation (8), distinct from the LoRA matrix $V_{t}$ .
$E_{t}^{r o u t e r}$	Router after learning task t, used to extract router-domain features.
$ϕ (x)$ , $ϕ (x; E)$	Frozen-PTM feature and the feature obtained with module $E$ , respectively.
$P_{i, t}^{L}$	LoRA-Expert prototype of class i at task t.
$P_{i}^{F}$ , $P_{i}^{R}$	Frozen-PTM prototype vector and router-domain prototype vector of class i.
$P_{i}$	Ensemble prototype vector of class i, defined as $P_{i} = concat (P_{i}^{F}, P_{i}^{R}) = [P_{i}^{F}; P_{i}^{R}]$ .
$[\cdot; \cdot]$	Vector concatenation operator; it does not denote matrix summation.
$K_{i}$ , $ν_{i}$	Dictionary key of class i and the associated expert index. In our implementation, $K_{i} = P_{i}$ .
$q (x)$	Ensemble query of test sample $x$ .
$σ (\cdot)$ , $τ$	Softmax shorthand and temperature used in the distillation losses, where $σ (z) = softmax (z)$ and $σ (z / τ) = softmax (z / τ)$ .

Table 2. Configuration of the class-incremental learning (CIL) benchmarks.

Task	CIFAR100	CUB200	ImageNet-R	OmniBenchmark	VTAB
Classes/Task	10	20	40	30	10
# of tasks	10	10	5	10	5

Table 3. Configuration of the few-shot class-incremental learning (FSCIL) benchmarks.

Task	CUB200	CIFAR100	miniImageNet
Base Classes	100	60	60
Incremental Tasks	10-way 5-shot	5-way 5-shot	5-way 5-shot
# of Tasks	1 + 10	1 + 8	1 + 8

Table 4. Performance comparison on CIL benchmarks using the same ViT-B/16-IN21K backbone. Results are reported as mean ± standard deviation over three runs; the best result in each column is highlighted in bold.

Method	CIFAR100 (T = 10)		CUB200 (T = 10)		ImageNet-R (T = 5)		OmniBenchmark (T = 10)		VTAB (T = 5)
Method	$A_{L}$	$\bar{A}$	$A_{L}$	$\bar{A}$	$A_{L}$	$\bar{A}$	$A_{L}$	$\bar{A}$	$A_{L}$	$\bar{A}$
Full Fine-Tuning	$66.26 \pm 0.18$	$76.94 \pm 0.02$	$55.29 \pm 0.46$	$70.30 \pm 0.74$	$59.90 \pm 0.05$	$72.19 \pm 0.12$	$47.75 \pm 0.14$	$65.86 \pm 0.12$	$62.95 \pm 5.94$	$80.80 \pm 1.50$
SimpleCIL [10]	$81.27 \pm 0.01$	$87.13 \pm 0.01$	$82.28 \pm 6.35$	$91.85 \pm 0.00$	$65.14 \pm 15.29$	$59.72 \pm 0.03$	$66.98 \pm 8.93$	$79.35 \pm 0.01$	$80.74 \pm 5.26$	$90.80 \pm 0.00$
L2P [7]	$84.82 \pm 0.22$	$89.78 \pm 0.01$	$71.98 \pm 0.07$	$81.80 \pm 0.03$	$72.08 \pm 0.01$	$76.76 \pm 0.06$	$64.45 \pm 0.07$	$74.14 \pm 0.03$	$64.27 \pm 0.02$	$81.84 \pm 0.19$
DualPrompt [8]	$85.23 \pm 0.07$	$90.32 \pm 0.06$	$74.23 \pm 0.10$	$84.81 \pm 0.02$	$69.34 \pm 0.09$	$73.58 \pm 0.05$	$66.16 \pm 0.06$	$74.97 \pm 0.03$	$78.90 \pm 0.10$	$89.82 \pm 0.06$
CODA-Prompt [18]	$86.69 \pm 0.00$	$91.31 \pm 0.01$	$75.45 \pm 0.00$	$84.65 \pm 0.00$	$75.16 \pm 0.08$	$80.46 \pm 0.04$	$68.67 \pm 0.00$	$77.79 \pm 0.00$	$75.08 \pm 0.00$	$87.24 \pm 0.00$
APER [10]	$87.32 \pm 0.01$	$92.09 \pm 0.02$	$86.82 \pm 0.04$	$91.84 \pm 0.03$	$68.22 \pm 0.07$	$75.30 \pm 0.07$	$74.40 \pm 0.01$	$80.62 \pm 0.01$	$84.44 \pm 0.01$	$86.27 \pm 0.02$
LAE [9]	$85.60 \pm 0.19$	$91.26 \pm 0.12$	$67.91 \pm 0.04$	$80.17 \pm 0.04$	$71.26 \pm 0.18$	$77.08 \pm 0.04$	$66.14 \pm 0.06$	$74.89 \pm 0.03$	$68.45 \pm 2.53$	$85.31 \pm 0.44$
InfLoRA [4]	$86.43 \pm 0.02$	$91.80 \pm 0.02$	$70.07 \pm 0.26$	$81.71 \pm 0.17$	$77.66 \pm 0.08$	$82.90 \pm 0.05$	$68.38 \pm 0.12$	$78.06 \pm 0.03$	$74.66 \pm 0.43$	$86.03 \pm 0.17$
SD-LoRA [11]	$87.62 \pm 0.00$	$92.10 \pm 0.00$	$72.69 \pm 0.00$	$83.17 \pm 0.00$	$78.52 \pm 0.00$	$82.74 \pm 0.00$	$69.32 \pm 0.00$	$77.78 \pm 0.00$	$67.73 \pm 0.00$	$84.44 \pm 0.00$
EASE [38]	$88.13 \pm 0.04$	$92.59 \pm 0.05$	$84.18 \pm 0.06$	$90.20 \pm 0.04$	$76.95 \pm 0.05$	$81.48 \pm 0.06$	$67.75 \pm 0.04$	$74.85 \pm 0.05$	$82.34 \pm 0.06$	$90.45 \pm 0.04$
BiLoRA [37]	$85.30 \pm 0.05$	$90.73 \pm 0.04$	$73.75 \pm 0.06$	$83.67 \pm 0.05$	$76.33 \pm 0.04$	$81.21 \pm 0.07$	$68.87 \pm 0.05$	$77.53 \pm 0.04$	$76.76 \pm 0.06$	$88.94 \pm 0.05$
DLEPEM-MLP	$85.99 \pm 0.34$	$92.40 \pm 0.03$	$87.70 \pm 0.05$	$92.31 \pm 0.29$	$76.25 \pm 0.44$	$82.37 \pm 0.14$	$74.32 \pm 0.00$	$81.60 \pm 0.00$	$84.96 \pm 0.22$	$91.84 \pm 0.03$
DLEPEM-QV	$88.84 \pm 0.09$	$93.39 \pm 0.10$	$87.56 \pm 0.16$	$92.09 \pm 0.08$	$78.77 \pm 0.07$	$83.43 \pm 0.12$	$75.53 \pm 0.08$	$82.16 \pm 0.06$	$85.18 \pm 0.14$	$91.11 \pm 0.13$

Table 5. Performance comparison on FSCIL benchmarks using the same ViT-B/16-IN21K backbone. Results are reported as mean ± standard deviation over three runs; the best result in each column is highlighted in bold.

Method	CUB200 (T = 11)			CIFAR100 (T = 9)			miniImageNet (T = 9)
Method	$A_{Base}$	$A_{L}$	$\bar{A}$	$A_{Base}$	$A_{L}$	$\bar{A}$	$A_{Base}$	$A_{L}$	$\bar{A}$
L2P [7]	$91.50 \pm 0.00$	$50.04 \pm 0.00$	$66.70 \pm 0.00$	$93.43 \pm 0.00$	$55.75 \pm 0.00$	$71.81 \pm 0.00$	$96.53 \pm 0.00$	$60.91 \pm 0.00$	$76.28 \pm 0.00$
CODA-Prompt [18]	$91.50 \pm 0.00$	$53.65 \pm 0.00$	$69.30 \pm 0.00$	$94.05 \pm 0.00$	$57.10 \pm 0.00$	$73.11 \pm 0.00$	$97.15 \pm 0.00$	$65.55 \pm 0.00$	$78.83 \pm 0.00$
InfLoRA [4]	$92.45 \pm 0.20$	$45.18 \pm 2.50$	$66.27 \pm 1.43$	$94.92 \pm 0.06$	$57.41 \pm 0.31$	$74.28 \pm 0.28$	$97.42 \pm 0.06$	$51.52 \pm 0.10$	$71.55 \pm 0.03$
SD-LoRA [11]	$91.92 \pm 0.00$	$56.28 \pm 0.00$	$70.87 \pm 0.00$	$94.60 \pm 0.00$	$73.51 \pm 0.00$	$78.42 \pm 0.00$	$97.72 \pm 0.00$	$79.07 \pm 0.00$	$84.36 \pm 0.00$
CPE-CLIP [39]	$80.21 \pm 0.85$	$63.32 \pm 0.17$	$69.37 \pm 0.37$	$88.32 \pm 0.04$	$79.99 \pm 0.18$	$83.38 \pm 0.10$	$90.14 \pm 0.05$	$81.55 \pm 0.04$	$85.54 \pm 0.04$
ASP [35]	$87.14 \pm 0.11$	$82.86 \pm 0.26$	$83.46 \pm 0.22$	$91.77 \pm 0.09$	$86.04 \pm 0.03$	$88.54 \pm 0.03$	$96.32 \pm 0.15$	$93.72 \pm 0.25$	$94.97 \pm 0.21$
PriViLege [34]	$82.21 \pm 0.35$	$75.08 \pm 0.52$	$77.50 \pm 0.33$	$90.88 \pm 0.20$	$86.06 \pm 0.32$	$88.08 \pm 0.20$	$96.68 \pm 0.06$	$94.10 \pm 0.13$	$95.27 \pm 0.11$
DLEPEM-MLP	$92.51 \pm 0.00$	$85.37 \pm 0.00$	$88.53 \pm 0.00$	$94.28 \pm 0.00$	$84.67 \pm 0.00$	$89.00 \pm 0.00$	$96.37 \pm 0.00$	$89.96 \pm 0.00$	$93.11 \pm 0.00$
DLEPEM-QV	$92.68 \pm 0.00$	$86.22 \pm 0.00$	$88.77 \pm 0.00$	$94.00 \pm 0.00$	$87.29 \pm 0.00$	$90.50 \pm 0.00$	$96.77 \pm 0.00$	$93.62 \pm 0.00$	$94.80 \pm 0.00$

Table 6. Ablation study on CIL and FSCIL tasks. The first two benchmarks follow CIL protocols, while the last two follow FSCIL protocols. Each metric reports MLP-Expert/QV-Expert performance.

Ablated Components	CIFAR100 (T = 10)		ImageNet-R (T = 5)		CIFAR100 (T = 9)		miniImageNet (T = 9)
Ablated Components	$A_{L}$	$\bar{A}$	$A_{L}$	$\bar{A}$	$A_{L}$	$\bar{A}$	$A_{L}$	$\bar{A}$
w/o Dynamic LoRA-Expert	83.13/86.72	88.85/90.82	61.17/72.37	74.31/80.14	73.03/79.73	80.39/86.59	84.26/91.11	89.90/93.31
w/o Prototype Ensemble	85.79/87.59	91.21/92.29	69.53/73.13	77.75/80.42	81.13/84.09	86.97/88.67	85.54/92.25	90.03/94.06
DLEPEM-MLP/QV	85.99/88.84	92.40/93.39	76.25/78.77	82.37/83.43	84.67/87.29	89.00/90.5	89.96/93.62	93.11/94.80

Table 7. Diagnostic accuracy of the classifier, router, and oracle expert selection on five CIL benchmarks. Each entry is reported as DLEPEM-MLP/DLEPEM-QV; the oracle expert setting uses the ground-truth task identity only for diagnostic analysis.

Metric	CIFAR100 ( $T = 10$ )	CUB200 ( $T = 10$ )	ImageNet-R ( $T = 5$ )	OmniBenchmark ( $T = 10$ )	VTAB ( $T = 5$ )
CNN	$92.40 / 93.39$	$92.31 / 92.09$	$82.37 / 83.43$	$81.60 / 82.16$	$91.84 / 91.11$
Router	$91.88 / 93.68$	$93.24 / 93.31$	$88.43 / 88.87$	$84.90 / 85.55$	$93.45 / 93.56$
LoRA-Expert (Oracle)	$98.04 / 97.81$	$97.15 / 96.31$	$89.17 / 88.81$	$92.88 / 91.48$	$97.66 / 95.88$

Table 8. Training and inference time comparison. Training time is the average time per epoch for each incremental task, and inference latency is measured in milliseconds per image. All methods use the same ViT-B/16-IN21K backbone for a fair comparison.

Method	CIFAR100 (T = 10)		ImageNet-R (T = 5)
Method	Training Time (s)	Inference Time (ms)	Training Time (s)	Inference Time (ms)
L2P [7]	102.12	3.77	50.12	3.84
DualPrompt [8]	93.21	3.44	45.16	3.58
CODA-Prompt [18]	99.42	2.99	47.53	3.08
LAE [9]	46.87	3.35	24.26	3.52
InfLoRA [4]	72.36	2.03	34.96	2.24
APER [10]	16.04	3.63	8.92	3.71
SD-LoRA [11]	79.08	1.93	32.85	2.04
DLEPEM-MLP/QV	69.46/71.37	8.39/9.26	33.3/34.15	8.47/9.08

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, H.; Liu, R.; Liu, Y. Dynamic LoRA-Experts and Prototype-Ensemble Matching for Class-Incremental Learning. Appl. Sci. 2026, 16, 6153. https://doi.org/10.3390/app16126153

AMA Style

Zhao H, Liu R, Liu Y. Dynamic LoRA-Experts and Prototype-Ensemble Matching for Class-Incremental Learning. Applied Sciences. 2026; 16(12):6153. https://doi.org/10.3390/app16126153

Chicago/Turabian Style

Zhao, Hongwei, Rui Liu, and Yansong Liu. 2026. "Dynamic LoRA-Experts and Prototype-Ensemble Matching for Class-Incremental Learning" Applied Sciences 16, no. 12: 6153. https://doi.org/10.3390/app16126153

APA Style

Zhao, H., Liu, R., & Liu, Y. (2026). Dynamic LoRA-Experts and Prototype-Ensemble Matching for Class-Incremental Learning. Applied Sciences, 16(12), 6153. https://doi.org/10.3390/app16126153

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dynamic LoRA-Experts and Prototype-Ensemble Matching for Class-Incremental Learning

Featured Application

Abstract

1. Introduction

2. Related Work

2.1. Class-Incremental Learning

2.2. PEFT-Based CIL

2.3. Mixture-of-Experts and Expert Retrieval

2.4. Our Approach

3. Preliminaries

4. The Proposed Method

4.1. Dynamic LoRA-Experts

4.2. Prototype-Ensemble Matching Mechanism

4.3. Optimization Objective and Training Procedure

5. Experiments

5.1. Experimental Settings

5.2. Benchmark Comparison

5.3. Ablation Study

5.4. Analysis of Trainable Parameters and Accuracy

5.5. Training and Inference Time Comparison

5.6. Parameter Sensitivity Analysis

5.7. Performance of LoRA-Based Methods Across Sequential Tasks

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI