As demonstrated by HiDe-Prompt [
25], CIL methods employing multi-module selection can be decomposed into two probabilistic components: module-identity inference (MII) and within-task prediction (WTP), represented by
and
, respectively. By Bayes’ theorem, we have
Letting
and
denote the ground-truth task index and class label for input
, Equation (
5) implies that improving either WTP accuracy,
, or MII accuracy,
, directly enhances overall prediction performance. However, existing approaches suffer from two limitations: (1) iterative updates or the fusion of new and existing modules progressively deteriorate WTP [
4,
7,
8,
9]; and (2) MII performance, when relying solely on pre-trained features [
9,
12], is inherently constrained by the similarity between pre-training and downstream data distributions. To address these challenges, we propose DLEPEM, which explicitly enhances both WTP and MII via two complementary innovations:
To exploit the strong generalization of the pre-trained model, we keep its weights
fixed throughout training. To maintain plasticity and safeguard WTP, we dynamically allocate a dedicated LoRA-Expert for each incremental task, embedded within an MoE framework. Each new LoRA-Expert is trained exclusively on its respective task while previously introduced experts remain frozen. This ensures isolated task-specific adaptation with a small number of trainable parameters, leveraging LoRA’s efficiency to effectively capture discriminative features.
Figure 1a depicts expert training, while
Figure 1b details the internal structure. This expert life cycle is the main difference from conventional MoE-based LoRA continual learning. DLEPEM does not repeatedly update a shared expert pool or combine all experts through a soft gate during feature computation. Instead, for task
t, only
receives gradients, whereas
remain frozen. This design directly reduces cross-task parameter interference and protects WTP for previously learned tasks.
After training each incremental task, we extract prototypes using two sources: (1) the fixed PTM for generalizable features, and (2) the router-enhanced LoRA-Expert for domain-specific nuances. Unlike prior methods that rely solely on pre-trained representations, DLEPEM combines these prototypes and associates them with their corresponding LoRA-Experts. During inference, a nearest-neighbor search in this combined prototype space determines the most appropriate expert, substantially improving MII. This design mitigates privacy concerns inherent to rehearsal-based strategies.
Figure 1c illustrates our complete prototype-ensemble matching mechanism. The ensemble key for a class is constructed as the concatenation of a frozen-PTM prototype and a router-domain prototype. The former provides stable general semantics, while the latter captures downstream task-specific cues. This complements dynamic expert allocation: the task expert improves WTP once selected, and the prototype-ensemble dictionary improves MII by selecting the appropriate expert without storing raw rehearsal samples.
4.1. Dynamic LoRA-Experts
We integrate the proposed LoRA-Expert modules into the Vision Transformer (ViT) architecture [
26]. In ViT, an input image is first divided into fixed-size patches, linearly projected, and augmented with positional embeddings before being processed by a Transformer encoder comprising MHA layers and multilayer perceptrons (MLPs).
To adaptively capture task-specific features in incremental learning, we dynamically introduce a LoRA-Expert at each incremental stage. This module can be inserted either as a parallel branch to the MLP (MLP-Expert, see
Figure 1b) or into the attention projections of the MHA module. When LoRA is applied to
and
, we refer to this variant as the QV-Expert configuration.
The modified forward computations for these components are given by
where
and
are the inputs and outputs of the original module, respectively. Here,
denotes the task-
t LoRA-Expert module rather than a complete, independent ViT. It is a group of LoRA adapters inserted into the selected branch or projection, and each historical expert
,
, is frozen after its task has been learned. The attention operation is defined as:
Here,
denotes the attention value matrix and is distinct from the LoRA matrix
. For
N tokens, the softmax term is an attention-weight matrix
, so
is the standard weighted sum over value vectors. The multi-head extension is omitted for clarity. Each LoRA-Expert shares the same architecture but learns distinct parameters. Under the row-vector convention, for an input
, one adapter consists of low-rank matrices
and
, producing:
In the default ViT-B/16 QV-Expert setting,
and
; for the MLP-Expert variant,
and
correspond to the inserted MLP branch dimensions.
In the ViT setting, we denote by
the output embeddings produced by the PTM equipped with the task-specific expert
. For the first incremental task, we follow the standard LoRA initialization [
22], setting
to zero and initializing
via Kaiming initialization [
27]. For subsequent tasks, we initialize each new LoRA-Expert by copying the weights from the preceding expert, then fine-tune it while keeping all previous experts frozen. As illustrated in
Figure 1a, this strategy ensures that each task-specific expert adapts independently, preserving knowledge from prior stages. This design provides a direct stability–plasticity mechanism. Let
denote the LoRA parameters of the expert allocated to task
t, while the PTM weights
remain frozen. After task
s has been learned, its old expert parameters
are not optimized when learning any later task
; hence,
Therefore, old experts remain parameter-stationary during subsequent training, which reduces forgetting caused by repeated modification of shared PEFT parameters. Meanwhile, the current expert
is optimized with the cross-entropy objective in Equation (
18) without imposing gradient-orthogonality or update-magnitude shrinking constraints, preserving plasticity for newly introduced classes. The remaining source of old-task degradation is mainly expert retrieval error, which is addressed by the prototype-ensemble-matching mechanism.
4.2. Prototype-Ensemble Matching Mechanism
When the distribution gap between the pre-trained model (PTM) and the downstream dataset is small, generalized features extracted by the frozen PTM can enable effective module-sample matching. However, since the relationship between pre-training and downstream distributions is generally unknown, it is essential to adaptively capture domain-specific features from the downstream task.
To tackle this, we introduce a dynamically updated router that extracts domain-specific features. The router shares the same architecture as the LoRA-Experts but is continuously updated across incremental tasks.
Under typical incremental constraints, where only data from the current task is available, directly fine-tuning the router leads to overfitting, causing it to route all samples to the most recent LoRA-Expert. To mitigate this, we propose dual feature distillation mechanisms that jointly enforce stability (retaining prior knowledge) and plasticity (adapting to new tasks), regularizing router updates to ensure robust generalization.
1. Plasticity feature distillation. To encourage the router to learn category-specific features of the current task, we first compute the LoRA-Expert prototype for each class
:
Here,
is the number of current-task samples belonging to class
i, and
denotes the indicator function. We then align the router feature of each current-task sample with the LoRA-Expert prototype of its ground-truth class using a KL-divergence objective:
Here,
, and temperature scaling is written explicitly as
, where
is the temperature parameter. Thus,
in the distillation losses denotes the same softmax function as
in the attention equation. These class-specific prototypes guide the router toward discriminative features of the current task, ensuring effective adaptation for new tasks.
2. Stability feature distillation. To mitigate the degradation of previously learned knowledge, we introduce a stability constraint that encourages the current router to mimic the feature distribution produced by the previous router on the current-task inputs:
The overall router loss combines these objectives,
where
weights the plasticity and stability distillation terms for the router. For the first task,
is omitted because no previously trained router exists. The router parameters
are optimized by minimizing this combined loss function. With the default
, the router update places more weight on
; however, this coefficient controls router regularization only and should not be interpreted as proof of an optimal global stability–plasticity balance. The main source of new-task plasticity remains the trainable current LoRA-Expert, while frozen old experts provide parameter-level stability.
At the end of each incremental stage, we compute general prototypes
using the frozen PTM
and domain-specific prototypes
using the current router
for each newly introduced class
, as defined in Equation (
11). The prototype dictionary is updated in an append-only manner: old entries for classes
are retained as historical keys and are not recomputed with later-task data. The new ensemble keys created at stage
t are
Equivalently, we define the class-
i ensemble prototype vector as
, and use it as the dictionary key
. The semicolon in
denotes vector concatenation, not matrix addition or summation. Each new key
is appended to Dict, together with the expert index
, which points to the corresponding expert
. Thus, after task
t, the dictionary covers all seen classes
, while only entries for
are newly created. This keeps DLEPEM rehearsal-free: the persistent state consists of frozen LoRA-Experts, the current router, and compact ensemble keys, but no raw samples, old mini-batches, feature buffers, or exemplar sets.
Because the router is updated across tasks, router-domain keys from older stages may have been generated by earlier router states. DLEPEM mitigates this router-state drift in two ways. First, each key includes the frozen-PTM component
, which remains comparable across stages because the PTM is fixed. Second, stability feature distillation in Equation (
13) encourages
to preserve the behavior of
while adapting to the current task.
Our mechanism dynamically selects the most suitable LoRA-Expert for each input by leveraging a key-value association strategy inspired by L2P [
7]. Specifically, we maintain a dictionary where each seen class
has one ensemble key
and an associated expert index
:
Here,
identifies the task-specific LoRA-Expert associated with class
i.
During inference, given an input
, we construct an ensemble query
, where
is the feature from the frozen PTM and
is the domain-specific feature from the router. We then identify the nearest key by cosine similarity:
The retrieved key
returns an expert identity rather than a direct class prediction. Let
denote the task index associated with
in
; DLEPEM then selects
for within-task prediction. Final classification is restricted to the class set
associated with the selected expert and uses the corresponding prototype weights. Thus, all stored ensemble keys participate in module-identity inference, while prototype weights from unrelated experts are not mixed during classification.
4.3. Optimization Objective and Training Procedure
DLEPEM employs a two-stage training paradigm: it first learns Dynamic LoRA-Expert modules and then optimizes the router for effective module-sample matching. Algorithm 1 provides the complete training procedure, and Algorithm 2 gives the corresponding task-agnostic inference procedure.
1. Dynamic LoRA-Expert learning: The objective function for training the LoRA-Expert is
where
denotes the cross-entropy loss, and
represents the PTM equipped with the LoRA-Expert
. During inference, DLEPEM uses LoRA-Expert-derived prototype weights for classification. Specifically,
denotes the prototype–weight matrix formed by the class prototypes
defined in Equation (
11). Classification is performed using cosine similarity,
where
denotes the selected LoRA-Expert, and
contains the prototype weights associated with this expert.
| Algorithm 1 Training Procedure for DLEPEM |
- 1:
Input: Pre-trained model , incremental datasets , LoRA rank r, distillation coefficient . - 2:
Output: Trained LoRA-Experts , Router , Prototype Dictionary Dict. - 3:
Initialize: , with random weights. - 4:
for to T do - 5:
# Stage 1: Dynamic LoRA-Expert Learning - 6:
Initialize LoRA-Expert . If , copy weights from . - 7:
Train and classifier on using (Equation ( 18)). - 8:
Freeze parameters of . - 9:
# Stage 2: Router Learning and Prototype-Ensemble Building - 10:
If , freeze a copy of the previous router . - 11:
Train router on using (Equation ( 14)). - 12:
# Append Prototype Dictionary with newly introduced classes - 13:
Keep old dictionary entries for classes in unchanged. - 14:
for each newly introduced class do - 15:
Compute general prototype using frozen PTM and samples from . - 16:
Compute domain-specific prototype using router and samples from . - 17:
Create ensemble prototype (Equation ( 15)). - 18:
Append an entry to Dict by associating key with expert index (Equation ( 16)). - 19:
end for - 20:
end for - 21:
return , , Dict.
|
| Algorithm 2 Inference Procedure for DLEPEM |
- 1:
Input: Test sample , frozen PTM , frozen LoRA-Experts , current router , prototype dictionary Dict, and prototype weights . - 2:
Output: Predicted label . - 3:
Compute frozen-PTM feature . - 4:
Compute router-domain feature . - 5:
Build the ensemble query . - 6:
Retrieve the nearest dictionary key . - 7:
Obtain the expert index associated with in Dict. - 8:
Extract the selected-expert feature . - 9:
Restrict candidate labels to and predict . - 10:
return .
|
2. Router learning: The router is optimized with the objective function shown in Equation (
14).
As illustrated in
Figure 1, we divide the class-incremental learning process into two stages: First, a new LoRA-Expert is learned for each incremental task to capture task-specific features; each expert is categorized as either an MLP-Expert or a QV-Expert depending on its insertion point. Second, we introduce a prototype-ensemble-matching mechanism that captures both general and domain-specific features, thereby improving module-sample matching. Notably, DLEPEM’s components are orthogonal to many existing approaches and can be integrated with them straightforwardly. For clarity,
Table 1 summarizes the main symbols used throughout the paper.