Next Article in Journal
Bridging Topological Over-Squashing and Physical Long-Range Interactions in ML Interatomic Potentials
Previous Article in Journal
Computational Ghost Imaging Encryption for Multiple Images Based on Compressed Sensing and Block Scrambling
Previous Article in Special Issue
Maieutic, Natural, and Artificial Forms in Automatic Control Case Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Unified Knowledge Management Framework for Continual Learning and Machine Unlearning in Large Language Models

1
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
2
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
*
Author to whom correspondence should be addressed.
Information 2026, 17(3), 238; https://doi.org/10.3390/info17030238
Submission received: 22 January 2026 / Revised: 24 February 2026 / Accepted: 25 February 2026 / Published: 1 March 2026
(This article belongs to the Special Issue Learning and Knowledge: Theoretical Issues and Applications)

Abstract

Large language models (LLMs) are increasingly deployed as information systems that evolve over time, where managing internal knowledge—acquisition, retention, and removal—becomes essential. In practice, these processes are primarily realized through continual learning and machine unlearning mechanisms. Despite this, these two mechanisms are often studied in isolation, limiting both interpretability and controllability. In this work, we present a parameter-efficient knowledge management framework where continual learning and machine unlearning—despite employing distinct task-specific objectives—are integrated through a shared retention-controlled parameter evolution mechanism. We ground these structural constraints in a drift-aware design principle: under a model smoothness assumption, we establish a formal upper bound showing that Kullback–Leibler (KL) divergence on retained knowledge is controlled by the magnitude and direction of parameter updates, providing a principled rationale for combining Low-Rank Adaptation (LoRA) freezing, sparse masking, and orthogonal gradient projection into a unified constraint system. Experiments on the Task of Fictitious Unlearning (TOFU) benchmark and real-world benchmarks demonstrate effective knowledge acquisition, selective removal, and robust retention across sequential tasks with strong overall performance and stability. This work provides a practical parameter-efficient recipe and a drift-aware design principle validated on controlled interleaved benchmarks, offering insights toward reliable knowledge management in evolving deployment scenarios.

1. Introduction

Large language models (LLMs) are increasingly deployed in real-world applications such as conversational agents, decision support systems, and personalized assistants, where they interact with evolving data streams, user feedback, and regulatory requirements. In such settings, managing the internal knowledge of LLMs—how knowledge is acquired, retained, and removed over time—has emerged as a critical challenge for the reliable and responsible deployment of large-scale Artificial Intelligence (AI) systems [1,2]. In this work, we use “knowledge management” to refer to the explicit organization and control of these acquisition, retention, and removal processes across the entire lifecycle of a deployed LLM, linking low-level parameter updates to high-level requirements such as controlled adaptation, domain shift handling, and desired model behavior.
In practice, the acquisition and removal of knowledge in neural models are primarily addressed through two research paradigms: continual learning and machine unlearning. Continual learning aims to enable models to incrementally acquire new knowledge from sequential tasks or data distributions without catastrophically degrading previously learned capabilities [3,4]. In contrast, machine unlearning focuses on the controlled removal of specific data or knowledge from a trained model, often motivated by privacy regulations, data ownership concerns, or error correction [2,5]. Despite addressing complementary aspects of the knowledge lifecycle, these two paradigms have largely been studied in isolation, with distinct objectives, evaluation protocols, and algorithmic designs.
The joint treatment of these two paradigms is captured by the Continual Learning and Unlearning (CLU) framework [6,7], which models the knowledge lifecycle of deployed systems through interleaved sequences of learning and unlearning operations. Critically, machine unlearning within the CLU framework extends beyond mere data removal or output suppression: it requires targeted modifications to the model’s deep structural parameters—including weight matrices and their low-rank factorizations—in order to selectively eliminate encoded knowledge while preserving the parametric structures responsible for retained capabilities. This parameter-level perspective highlights that effective unlearning must navigate the entangled representations in deep neural networks, where knowledge is distributed across interconnected layers and cannot be simply excised without risking collateral degradation of other learned functionalities.
This separation obscures the intrinsic relationship between learning and unlearning as two directions of knowledge dynamics within a single model. From a knowledge management perspective, both continual learning and machine unlearning operate on the same underlying knowledge space, differing only in whether the objective is to incorporate or remove information. Treating them as unrelated problems limits our theoretical understanding of how model knowledge evolves under sequential updates and hinders the development of unified mechanisms for controlling model behavior. In particular, the lack of a common formulation makes it difficult to reason about trade-offs between knowledge retention, adaptability, and selective forgetting in scenarios involving interleaved task sequences.
While recent studies have explored unified perspectives on CLU through gradient-based optimization in small-scale discriminative models [7,8], these approaches primarily focus on model-centric optimization. In contrast, we propose a fundamentally different perspective by framing CLU as a knowledge management problem, where the central objective is to systematically control what knowledge is retained, acquired, and removed throughout a model’s lifecycle. From this knowledge-centric view, we develop a practical framework grounded in drift-aware design principles for large-scale generative language models. While continual learning and machine unlearning employ distinct task-specific objectives (supervised fine-tuning and gradient ascent, respectively), we integrate them through a shared retention-controlled parameter evolution mechanism. Specifically, we use Kullback–Leibler (KL) divergence as a design principle to characterize distributional drift on retained knowledge, and derive parameter-space structural constraints that provably bound this drift. This enables drift-aware parameter-space approximations that govern stability–plasticity trade-offs without requiring explicit distributional measurements, offering both conceptual clarity and practical scalability for large-scale deployments.
Compared with prior CLU formulations that are predominantly model-centric and optimization-driven, our framework is explicitly knowledge-centric and drift-aware. The  unification is operational rather than loss-level: learning and unlearning are governed by the same retention-controlled parameter constraints and drift-control principle, organized around information-theoretic distributional drift rather than task labels or gradient-based heuristics.
Based on the drift-aware conceptual framework, we operationalize the KL-minimization objective through a suite of parameter-space approximations. These structural choices serve as computationally efficient proxies for Equation (4), enabling controlled knowledge evolution without explicit distributional measurements or base model modification. Our approach leverages low-rank adaptation to localize knowledge updates, while combining parameter freezing, sparsity constraints, and orthogonal gradient projection to structurally constrain parameter updates and suppress interference with retained knowledge. By grounding these structural choices in the principle of controlling distributional drift—where KL divergence serves as the conceptual characterization and design motivation—we obtain a parameter-space drift control approximation that operates without explicit distributional KL computation. This KL-inspired parameter-space strategy enables scalable and incremental learning and unlearning operations suitable for large-scale models in interleaved update scenarios.
The main contributions of this work are summarized as follows:
  • We develop a parameter-efficient CLU method that combines Low-Rank Adaptation (LoRA) [9] freezing, magnitude-based sparse masking, and orthogonal gradient projection into a unified structural constraint system, achieving state-of-the-art stability–plasticity balance across interleaved learning-unlearning sequences on 4B- and 8B-scale LLMs.
  • We ground these structural choices in a drift-aware design principle based on KL divergence, establishing a formal upper bound (Theorem A1) that decomposes distributional drift into update magnitude and direction terms. This provides a principled explanation for why magnitude-controlling constraints (freezing, sparsity) yield the largest individual gains, while direction control (orthogonal projection) provides crucial cumulative-drift mitigation in longer sequences.
  • We provide systematic experimental evidence including behavioral metrics, token-level distributional drift analysis, and ablation studies that jointly validate the method’s effectiveness and the design principle’s explanatory power on controlled interleaved CLU benchmarks.

2. Related Work

2.1. Continual Learning

The primary challenge of continual learning for intelligent systems lies in enabling models to acquire new knowledge while retaining previously learned knowledge under a sequential task setting. This requires finding an optimal trade-off between plasticity (the ability to learn new knowledge) and stability (the ability to preserve old knowledge), so as to mitigate the problem of catastrophic forgetting during continual learning.
Existing mainstream continual learning methods can be broadly categorized into the following three classes:

2.1.1. Regularization-Based Methods

Regularization-based methods preserve previously learned knowledge by explicitly introducing regularization terms into the loss function, thereby constraining parameter updates during training on new tasks. Specifically, these methods limit changes to parameters that are deemed important for previous tasks. A key challenge lies in how to quantify the importance of each parameter.
Elastic Weight Consolidation (EWC) [10] estimates parameter importance using the Fisher Information Matrix (FIM), leveraging second-order statistics of the loss with respect to model parameters to identify those critical to past tasks. Memory Aware Synapses (MAS) [11] measures parameter importance based on the sensitivity of the model’s output L2 norm to parameter perturbations. Synaptic Intelligence (SI) [12] tracks parameter updates throughout training and evaluates their contribution to the loss reduction to compute importance scores. Riemannian Walk (RWalk) [13] combines the advantages of EWC and SI by introducing concepts from information geometry, modeling the curvature of different tasks in parameter space through a Riemannian metric.

2.1.2. Replay-Based Methods

Replay-based methods mitigate forgetting by maintaining a representative subset of past data, often referred to as a coreset, to preserve data distributional characteristics [14,15]. The process of selecting such representative samples is known as coreset selection. Since finding an optimal subset is an NP-hard problem, early approaches relied on heuristic strategies to approximate the original data distribution.
More recent studies propose generating representative samples through optimization rather than selecting them directly from the original dataset [16,17]. These approaches, commonly referred to as dataset distillation or dataset condensation, aim to compress large-scale datasets into a compact set of synthetic samples that retain the essential information of the original data.

2.1.3. Structure-Based Methods

While regularization-based and replay-based approaches update knowledge within a shared parameter space, structure-based methods allocate task-specific parameter subspaces for incremental learning [18,19]. During inference, only the neurons, parameters, or network branches associated with the relevant task are activated. Because parameters across tasks are isolated, these methods typically require a task identification step at inference time to determine which task a given input belongs to before invoking the corresponding parameters or modules.

2.2. Machine Unlearning

The goal of machine unlearning is to remove the influence of specific data from a trained model without significantly degrading its overall performance. Existing research can be broadly categorized based on the level of intervention into the following two paradigms:

2.2.1. Removal-Intended Methods

Removal-intended methods aim to negate the effect of the data to be forgotten by modifying the training process. Gradient Ascent (GA)-based approaches achieve unlearning by applying reversed gradients or selectively fine-tuning on targeted sample sets [20,21]. Variants such as Negative Preference Optimization (NPO) and second-order methods [22,23] further improve optimization stability by incorporating divergence-based loss functions or curvature information.

2.2.2. Suppression-Intended Methods

Suppression-intended methods focus on restricting the model’s access to the forgotten information rather than fully retraining the model. Full-parameter approaches include fine-grained probability adjustment [24,25], rejection fine-tuning [26,27], and incorrect label construction [28,29]. These methods weaken the influence of forgotten data by adjusting output confidence or disrupting label consistency.

2.3. Continual Learning and Machine Unlearning

Existing research on CLU has primarily focused on small-scale models in the image domain, with classification tasks as the dominant setting [6,7,8,30]. While these studies have made valuable progress in integrating continual learning and unlearning, their scope remains limited to traditional discriminative models.
For example, ref. [30] introduced the CLU concept into image classification for the first time by adaptively enhancing model plasticity through selective parameter degradation. Work in [6] proposed a complete CLU formalization framework but treated entire tasks as the minimal unlearning unit, which fails to support fine-grained unlearning requirements. Study [7] unified learning and unlearning in classification tasks through a dual-teacher distillation mechanism, albeit at the cost of substantial computational and storage overhead.
We systematically study CLU under parameter-efficient constraints in generative LLMs with an interleaved learning–unlearning protocol. Our framework operationalizes KL-based CLU objectives in a knowledge-centric manner and provides a practical recipe for controlled knowledge evolution in large-scale generative models.

3. Materials and Methods

3.1. Problem Definition

We study the problem of CLU in a parametric model with parameters θ , which sequentially receives a stream of T task requests, where T is the total number of tasks. Each task request isdenoted as
T t = ( D t , R t ) ,
where t { 1 , 2 , , T } is the task index, D t is the dataset for task t, and  R t denotes the request type. The request type can be either learning or unlearning, i.e.,  R t { L , U } . The dataset D t = { q i } i = 1 N t consists of N t data points. Each data point q i = ( x i , y i ) is composed of a prompt x i and its corresponding reference response y i . These data points are used either for model learning or for unlearning, depending on the request type. For continual learning tasks, we denote the request as T t L with the corresponding dataset D t L . For unlearning tasks, we denote the request as T t U with the corresponding dataset D t U . Figure 1 illustrates the overall framework of the CLU paradigm, depicting how the model alternately processes learning and unlearning requests in a sequential task stream.
For continual learning, we follow prior work on large-scale model adaptation and adopt Supervised Fine-Tuning (SFT), enabling the model to incrementally acquire new and previously unseen knowledge. For continual unlearning, we adopt established unlearning paradigms, aiming to make the model effectively “forget” specified data or knowledge fragments while preserving the stability of its existing knowledge structure.
Specifically, when the model receives a request, the objective is to update the model parameters from θ t to θ t + 1 such that the following three core constraints are satisfied: Forgetting Constraint: The model must reduce its tendency to recall or reproduce information from the unlearning dataset D t U under the specified evaluation protocol, achieving observable behavioral redirection while minimizing collateral degradation. Retention Constraint: The model must preserve its performance on retain data D t R that is disjoint from the unlearning target, preventing negative interference with previously acquired knowledge. Acquisition Constraint: The model must maintain its plasticity for learning future task data D t L , ensuring that the unlearning operation does not compromise its capacity for subsequent knowledge acquisition in the continual learning paradigm.
Overall, the goal of the model over the entire task stream is to achieve a balanced trade-off between learning new knowledge and forgetting obsolete or sensitive information by alternately executing learning and unlearning tasks, enabling controllable model knowledge evolution.

3.2. A Drift-Aware Framework for Retention-Controlled CLU

Continual learning and machine unlearning can be regarded as two canonical special cases within a unified CLU framework. Both scenarios correspond to an idealized model that provides a theoretical upper bound on achievable performance. In the continual learning setting, the jointly trained model is commonly treated as the optimal solution [31]. By aggregating all available data and removing the constraint of historical data inaccessibility, joint training minimizes the global empirical risk, thereby representing the optimal performance attainable by continual learning algorithms. In contrast, in the machine unlearning setting, the optimal model is defined as the model obtained by retraining from scratch after completely removing all data requested to be forgotten. Although this approach guarantees exact unlearning, it relies on full retraining over the remaining dataset, resulting in prohibitive computational costs [32] and rendering it impractical for real-world applications.
Accordingly, in the joint continual learning and machine unlearning problem, we define a theoretical ideal model as the optimal solution obtained by training on the union of all learning data while excluding all data subject to unlearning requests. Formally, this ideal model is defined as
θ = arg min L ( θ ) ,
where L ( θ ) denotes the training loss evaluated on the dataset after removing all samples that need to be forgotten. However, directly obtaining a model whose parameter distribution exactly matches that of the ideal model is generally infeasible in practice. For example, joint retraining requires storing all historical data and performing full model retraining, which incurs excessive computational and storage overhead.
Therefore, in this work, we treat the ideal model as a theoretical reference rather than a directly attainable baseline. Based on this observation, we propose a more practical optimization paradigm, termed approximate CLU. When new learning or machine unlearning requests arrive, the system performs drift-aware updates that balance the new task objective with controlled distributional changes relative to previously acquired knowledge. In this way, a dynamic balance between continual learning and machine unlearning can be achieved.
Following prior work [8], we view continual learning and machine unlearning as two types of controlled distributional updates in a model facing sequential task requests. Instead of fully retraining to the ideal model θ (trained on all learning data while excluding all samples requested to be forgotten), we perform approximate CLU by constraining the output distribution drift relative to a reference model.
Let π θ denote the language model with parameters θ and output distribution p θ ( · s < t ) . At update step k, we take the current model θ k as the reference model, i.e.,  θ ref = θ k . Given three data partitions at step k—retain set D R , new learning set D L , and forget set D U —we define a unified objective that (i) fits new knowledge on D L , (ii) suppresses target knowledge on D U , and (iii) limits distributional drift on D R (see Figure 2 for an illustration of the framework):
θ k + 1 = arg min θ L req ( θ ; D L , D U ) learning / unlearning + λ R retain ( θ ; θ ref , D R ) retention ( drift control ) ,
where λ > 0 controls the retention strength.
For retention regularization (stay close on D R ), we characterize distributional drift on retained data via the KL divergence between the reference model and the updated model:
R retain ( θ ; θ ref , D R ) = E s D R 1 | s | t = 1 | s | D KL p θ ref ( · s < t ) p θ ( · s < t ) ,
where s denotes a sequence in the retain set D R , | s | is its length, s < t represents the prefix up to position t 1 , and  p θ ( · s < t ) is the model’s next-token probability distribution conditioned on s < t . Implementation Bridge: Equation (4) serves as the foundational design principle for our unified framework. While explicit computation of this token-level KL term is avoided to maintain data privacy and efficiency, we operationalize this principle by constraining the parameter-space evolution. Specifically, under a standard model smoothness assumption—namely that the logit output function f θ is twice continuously differentiable with bounded first- and second-order derivatives (Assumption A1)—we establish via Taylor expansion of the logit function and Lipschitz analysis of the softmax operator that the token-level KL divergence on the retain set isformallyupper-bounded by C 1 Δ θ 2 2 + C 2 Δ θ 2 , where C 1 , C 2 are explicit constants determined by model properties (Theorem A1, Appendix A). This theoretical link directly motivates our choice of structural constraints:localization via sparse masking and direction control via orthogonal projection are not merely empirical heuristics, but  are principled proxies for minimizing Equation (4) without historical data rehearsal.
The request term L req depends on whether the incoming request is learning or unlearning:
L req ( θ ; D L , D U ) = L SFT ( θ ; D L ) , ( learning request ) , L GA ( θ ; D U ) , ( unlearning request ) ,
In our research, we employ supervised fine-tuning (SFT) for learning tasks and gradient ascent (GA) for unlearning tasks. Specifically, the SFT loss is defined as the standard negative log-likelihood:
L SFT ( θ ; D L ) = E ( x , y ) D L 1 | y | t = 1 | y | log p θ ( y t x , y < t ) ,
where ( x , y ) denotes a prompt-response pair, and  p θ ( y t x , y < t ) is the model’s predicted probability for the next token y t given the prompt x and previous tokens y < t .
For unlearning, we adopt the gradient ascent objective that maximizes the loss on the forget set, thereby reducing the model’s confidence on the targeted knowledge:
L GA ( θ ; D U ) = E ( x , y ) D U 1 | y | t = 1 | y | log p θ ( y t x , y < t ) ,
which effectively pushes the model away from reproducing responses in the forget set D U .
Scope of Unification. We emphasize that the unification in our framework is operational rather than loss-level. The task-specific objectives for learning (Equation (6)) and unlearning (Equation (7)) are fundamentally distinct—SFT minimizes next-token prediction loss while GA maximizes it. What is unified is the retention-controlled parameter evolution mechanism: both operations are executed within the same constrained low-rank adapter space (frozen A, sparsely masked B), subject to the same orthogonal projection constraints, and governed by the same drift-control principle (Equation (4)). This shared infrastructure ensures that regardless of whether the current task involves knowledge acquisition or removal, the parameter update respects the same stability guarantees on retained knowledge.
Equations (3)–(5) yield three practical design principles: (i) retention control via R retain on D R , (ii) localization by restricting updates to a small parameter subset (to reduce interference), and (iii) direction control by constraining update directions to minimize impact on historical knowledge. These principles motivate our parameter-efficient implementation with frozen LoRA projection matrices A, sparse masking, and orthogonal gradient projection in Section 3.3. Details could be found in Table 1.

3.3. Method

In this study, we adopt Low-Rank Adaptation (LoRA) to address the problem of CLU. Compared with full-parameter fine-tuning, LoRA does not modify the parameters of the backbone large language model; instead, it introduces only a small number of additional trainable parameters. Prior studies have shown that this parameter-efficient strategy can achieve performance comparable to full fine-tuning [9]. The overall architecture of the proposed framework is illustrated in Figure 3.
LoRA fine-tunes a large language model for new tasks by factorizing the weight update into the product of two low-rank matrices. Formally, for a specific task t, given a pretrained weight matrix w R d in × d out , the weight update Δ t R d in × d out is constrained to be low-rank:
h = x w + x Δ t = x w + x A t B t ,
where x R 1 × d in is the input feature vector, h R 1 × d out is the output, A t R d in × r is the projection matrix, B t R r × d out is the expansion matrix, and  r min ( d in , d out ) is the rank of the low-rank decomposition. We refer to Δ t as the LoRA adapter for task t. In practice, LoRA adapters are typically applied to multiple projection matrices in Transformer layers (e.g., w k and w v ).
Conventionally, both the low-rank projection matrix A t and the low-rank expansion matrix B t are updated via gradient descent. The matrix A t is usually randomly initialized (e.g., with a Gaussian distribution), while B t is initialized to zero to ensure Δ t = 0 at the start of training.

3.3.1. Freezing the LoRA Matrix A

In our CLU setting, to reduce interference between learning and unlearning across a task stream and to preserve the backbone model’s general capability, we freeze the low-rank projection matrix A and only optimize the task-specific expansion matrices B. Concretely, for the task sequence t , t + 1 , , all tasks share the same fixed A but maintain different B matrices, yielding the low-rank updates
Δ t = A B t , Δ t + 1 = A B t + 1 .
This design constrains all task updates to a common low-dimensional subspace spanned by A, while allowing different tasks to adapt through different directions in the r-dimensional coefficient space. As a result, the correlation (or orthogonality) between the induced parameter updates in the original space is largely governed by the alignment between the corresponding B matrices. This relationship can be motivated as follows under standard random matrix concentration assumptions:
Let A R d in × r be initialized with i.i.d. standard normal entries and then frozen. Consider the Frobenius inner product between the adapters of two consecutive tasks:
Δ t , Δ t + 1 = Tr Δ t Δ t + 1 = Tr B t A A B t + 1 ,
where · , · denotes the Frobenius inner product and Tr ( · ) is the matrix trace operator. When d in is large, random matrix concentration suggests that
A A α I r ,
where I r R r × r is the identity matrix and α > 0 is a constant determined by the initialization of A. Substituting into the inner product gives
Δ t , Δ t + 1 α Tr B t B t + 1 = α B t , B t + 1 .
This suggests that orthogonality between the induced updates in the original parameter space, i.e., Δ t , Δ t + 1 0 , can be promoted when the corresponding coefficients satisfy B t , B t + 1 0 . While this argument serves as an intuition under idealized random initialization assumptions, in practice orthogonality is enforced through explicit masking and projection mechanisms regardless of this approximation. Motivated by this perspective, we enforce near-orthogonality between the B-space parameters for consecutive tasks using two complementary mechanisms: sparse masking (Section 3.3.2) to protect important large-magnitude parameters while allowing selective updates to less critical parameters, and orthogonal gradient projection (Section 3.3.3) to remove components of the current task’s update that align with previously learned directions. Together, these techniques promote approximately perpendicular adaptations in the B-space along the task stream, thereby alleviating destructive interference while preserving the backbone’s general capability.

3.3.2. Sparse Masking for the Weight Matrix B

To mitigate interference between tasks, we construct a sparse mask M t before training task t ( t > 1 ) , based on the magnitude statistics of the current parameters. The mask is then applied during optimization to restrict parameter updates: only parameters with mask value 1 are allowed to be updated, while parameters with mask value 0 remain fixed.
Concretely, prior to training task t, we aggregate all parameters from the collection of B t matrices across layers/projections, denoted by B t , and compute a global threshold T ˜ t as the s % quantile of their absolute values, where s denotes the sparsity ratio. Following the standard magnitude-based masking approach, the mask for each matrix B t is defined as:
M t = I | B t | < T ˜ t , T ˜ t = Quantile s % | B t | ,
where I ( · ) is the element-wise indicator function that returns 1 for parameters below the threshold and 0 otherwise. This formulation protects the top s % largest-magnitude parameters by setting their mask values to 0 (frozen), while allowing updates to the remaining ( 1 s ) % smaller parameters with mask value 1.

3.3.3. Orthogonal Gradient Projection

To further suppress catastrophic forgetting in continual learning and to prevent unintended damage to non-target knowledge during unlearning, we introduce an orthogonal gradient projection strategy. Recent studies suggest that if the gradient update direction is orthogonal to the feature subspace of previous tasks, the impact on old tasks is minimized, thereby reducing forgetting [33].
Consider training on task t + 1 after having learned tasks 1 , , t . Let E denote a generic trainable parameter matrix. The parameter update can be written as:
E t + 1 = E t + Δ E ,
where Δ E represents the parameter change. To preserve the output of old task t with input feature x t , where f θ ( · , · ) denotes the model’s output function parameterized by θ , we require:
f θ E t + Δ E , x t = f θ E t , x t .
By linearization, this condition is approximately satisfied when:
E f θ ( E t , x t ) , Δ E = 0 ,
meaning that the parameter update Δ E should be orthogonal to the gradient direction of the old-task output with respect to the parameters.
Let θ R d denote the vector of all trainable model parameters, where d is the total number of parameters. For each task i, let θ i init and θ i final denote the parameter vectors before and after finishing training on task i, respectively. We define the task-update displacement and its normalized direction as
Δ θ i θ i final θ i init , v i Δ θ i Δ θ i 2 ,
where Δ θ i R d represents the net parameter change induced by task i, v i R d is the corresponding unit direction (“task direction”), and · 2 denotes the Euclidean norm.
When training on a new task t, let g t R d denote the raw gradient of the task-t loss with respect to parameters, i.e.,  g t = θ L t ( θ ) , computed at the current optimization step. To prevent updates that interfere with previously learned task directions, we project g t onto the orthogonal complement of the subspace spanned by the stored directions { v i } i = 1 t 1 :
g t g t i = 1 t 1 g t v i v i ,
where g t R d is the projected gradient used for the parameter update, g t v i is the scalar inner product measuring the component of g t along v i , and ( g t v i ) v i is the corresponding projection component removed from g t .
As a result, the projected gradient is orthogonal to every previous task direction:
g t v i = 0 , i { 1 , , t 1 } .
This gradient projection strategy complements the sparse masking mechanism: the sparse mask constrains where updates occur (i.e., which parameters are modified by protecting important parameters), while orthogonal gradient projection constrains the direction of updates (i.e., how parameters are modified). Their combination enables the model to balance parameter protection, directional orthogonality, and knowledge stability, thereby effectively mitigating catastrophic forgetting and reducing the adverse impact of unlearning on the model’s general capabilities.

3.3.4. Overall Algorithm

Algorithm 1 summarizes the complete procedure of our unified CLU framework.
Algorithm 1 Unified CLU Framework with Parameter-Efficient Adaptation
Require: Base model θ 0 , task sequence { T 1 , T 2 , , T T } where T t = ( D t , R t ) and R t { L , U }
Require: Hyperparameters: LoRA rank r, sparsity ratio s, learning rate η
Ensure: Updated model θ T with LoRA adapters
 1:
Initialize LoRA matrices: A R d in × r (random), B 0 = 0 (zero matrix)
 2:
Freeze A for all subsequent tasks
 3:
Initialize task direction history V =
 4:
for each task t = 1 , 2 , , T  do
 5:
    // Construct sparse mask for task t
 6:
    if  t > 1  then
 7:
         T ˜ t Quantile s % ( | B t 1 | )
▹ Compute global threshold
 8:
         M t I ( | B t 1 | < T ˜ t )
▹ Mask: 1 for small params, 0 for large
 9:
    else
10:
         M t 1
▹ No masking (all-ones matrix) for first task
11:
    end if
12:
 
13:
    // Training loop for task t
14:
     B t B t 1 , θ t init θ t 1
15:
    for each training step do
16:
        // Compute task-specific loss
17:
        if  R t = L (learning) then
18:
            L L SFT ( θ ; D t L )
19:
        else if  R t = U (unlearning) then
20:
            L L GA ( θ ; D t U )
21:
        end if
22:
 
23:
        // Compute and project gradient
24:
         g t B t L
25:
         g t g t v i V ( g t v i ) v i
▹ Orthogonal projection
26:
 
27:
        // Apply sparse mask and update
28:
         B t B t η · ( g t M t )
▹ Masked parameter update (⊙: element-wise product)
29:
    end for
30:
 
31:
    // Store task direction for future projection
32:
     θ t final θ t with adapter A · B t
33:
     Δ θ t θ t final θ t init
34:
     v t Δ θ t / Δ θ t 2
35:
     V V { v t }
36:
end for
37:
return  θ T with LoRA adapter A · B T

4. Experiments

4.1. Dataset and Experimental Setup

We adopt the Task of Fictitious Unlearning (TOFU) benchmark [20] for evaluation. TOFU contains profiles of 200 fully fictitious authors, where each profile consists of 20 question–answer (QA) pairs. All profiles are carefully constructed to ensure that their content does not appear in the model’s pretraining data, thereby providing a controlled environment for evaluating whether a model can selectively forget specific information.
To emulate a CLU setting, we design an experimental protocol with six tasks: three unlearning (UL) tasks and three continual learning (CL) tasks. We construct six data groups from TOFU, where each group contains 20 QA pairs and is assigned to one specific task (i.e., 6 × 20 = 120 QA pairs in total). Specifically, three data groups ( 3 × 20 = 60 QA pairs) are designated as UL data for the three unlearning tasks, while the remaining three data groups ( 3 × 20 = 60 QA pairs) serve as CL data for the three learning tasks. Following common practice, the base model undergoes an initial supervised fine-tuning (SFT) stage on a combined dataset consisting of a retain set D L 0 and the three UL data groups (60 QA pairs total) that will subsequently be unlearned. This SFT stage establishes both the baseline knowledge to be retained and the target knowledge to be selectively forgotten in later stages. The three CL data groups are kept separate and used exclusively for evaluating continual learning capabilities under interleaved unlearning operations.
We conduct experiments in an interleaved schedule that alternates between unlearning and learning tasks. Specifically, the six tasks are executed in a fixed sequence:
UL 1 CL 1 UL 2 CL 2 UL 3 CL 3 ,
where each UL i ( i = 1 , 2 , 3 ) represents an unlearning task that aims to forget the i-th injected data group, and each CL i ( i = 1 , 2 , 3 ) represents a continual learning task that learns the i-th held-out data group. This interleaved design enables us to evaluate whether the model can successfully unlearn specific knowledge while simultaneously acquiring new knowledge without catastrophic forgetting or interference.
In our experiments, we conduct comprehensive evaluations using two representative large language models: Qwen3-4B-Instruct and Llama3-8B-Instruct. To efficiently adapt these models while maintaining parameter efficiency, we employ LoRA for both continual learning and unlearning tasks. The training configuration is set as follows: we use a learning rate of 5 × 10 5 with AdamW optimizer, a batch size of 16, and train the models for 10 epochs. For the LoRA hyperparameters, we set the rank r = 8 to control the low-rank decomposition, the scaling coefficient α = 16 to regulate the magnitude of LoRA updates, and apply a dropout rate of 0.05 to the LoRA layers to prevent overfitting. These hyperparameters are kept consistent across all unlearning and continual learning stages to ensure fair comparisons and reproducibility.

4.2. Baselines

To systematically evaluate the proposed CLU framework, we compare it against several representative baseline methods across sequential unlearning-continual learning tasks. Given our focus on an interleaved CLU protocol under a single Parameter-Efficient Fine-Tuning (PEFT) adapter without replay or task-specific routing, we adopt the standard supervised fine-tuning (SFT) approach as the CL backbone for all methods. For the unlearning (UL) tasks, we compare our framework against the following established LLM unlearning baselines:
  • Gradient Ascent (GA) [20]. This method performs unlearning by maximizing the negative log-likelihood on the forget set, thereby progressively reducing the model’s confidence in generating answers related to the data that should be forgotten.
  • Gradient Ascent + Gradient Descent (GA + GD) [34]. This approach combines gradient ascent on the forget set with gradient descent on the retain set. It enables the model to erase undesired knowledge while simultaneously maintaining performance on data that should be retained.
  • KL-Regularized Gradient Ascent (GA + KL) [34]. This method applies gradient-ascent unlearning on the forget set while constraining the model’s distributional drift via a KL divergence regularizer with respect to a reference model. This prevents excessive deviation from the original model behavior during the unlearning process.
  • Negative Preference Optimization (NPO) [22]. This technique explicitly downweights forget-set targets by penalizing the likelihood ratio under a negative-preference objective, thereby directly diminishing the model’s confidence in producing answers from the forget set.
  • Direct Preference Optimization (DPO) [20]. We adapt DPO to the unlearning scenario by constructing preference pairs where a neutral or alternative response is preferred over the forget-set target. The pairwise objective increases the probability of neutral responses while decreasing the probability of undesired responses relative to a reference policy.
  • Low-Rank Adaptation (LoRA) [9]. Instead of updating all model parameters, LoRA injects trainable low-rank decomposition matrices into the model’s attention layers. A single shared LoRA adapter is trained across all sequential tasks (both unlearning and continual learning), modifying the model’s behavior through parameter-efficient updates. This enables efficient adaptation across the entire task sequence while maintaining the base model’s weights frozen.
For detailed mathematical formulations and algorithmic implementations of these baseline methods, please refer to Appendix B.

4.3. Evaluation Metrics

We evaluate forgetting and utility from multiple perspectives, including lexical overlap, semantic similarity, and factual consistency.
Recall-Oriented Understudy for Gisting Evaluation (ROUGE) measures token-level overlap between the generated answer and the reference answer [35]. We adopt ROUGE-L recall, which is based on the longest common subsequence (LCS):
ROUGE L = LCS ( g , r ) | r | ,
where LCS ( g , r ) denotes the length of the longest common subsequence between g and r, g is the generated answer, r is the reference answer, and | r | is the length of the reference answer.
Cosine Similarity (CS) [36] measures the semantic similarity between model outputs before and after training. We obtain sentence embeddings using Sentence-BERT [37], compute the cosine similarity between pre- and post-training outputs, and truncate negative values to zero:
CS = max 0 , e pre · e post e pre 2 e post 2 ,
where e pre and e post are embeddings of the outputs before and after training, respectively. A lower CS suggests that the trained model introduces semantic drift.
Entailment Score (ES) [36] assesses the factual consistency between the model’s output and the ground-truth answer, based on Natural Language Inference (NLI). We use a pre-trained NLI model [38] to predict whether the model output entails the ground-truth answer, and compute the proportion of outputs predicted as entailment:
ES = 1 N i = 1 N I NLI ( g i , r i ) = entailment ,
where N is the number of evaluated samples, g i is the i-th generated answer, r i is the i-th reference answer, NLI ( · , · ) denotes the NLI model’s prediction, and I [ · ] is the indicator function. A higher ES indicates better factual alignment, and lower scores signal hallucinated or incorrect outputs.
To provide a comprehensive assessment of model performance, we introduce two aggregate metrics that combine the aforementioned individual measures:
  • Model Utility (MU) serves as a task-level response quality proxy, quantifying the model’s ability to retain useful knowledge on the retain set and newly learned tasks. It is computed as the arithmetic mean of ROUGE-L, CS, and ES:
    MU = 1 3 ROUGE L + CS + ES .
    A higher MU indicates that the model maintains strong performance on data that should be preserved.
  • Forgetting Proxy (FP) measures the degree to which model outputs deviate from original responses on the forget set under specified prompt templates. Rather than certifying irrecoverability in a privacy sense, FP quantifies behavioral redirection—the extent to which outputs shift away from target responses. It is defined as:
    FP = 1 1 3 ROUGE L + CS + ES .
    A higher FP indicates that the model produces outputs that diverge substantially from the original responses on forget-set samples, reflecting observable behavioral change rather than provable knowledge elimination.
Scope and Limitations of Evaluation Metrics. It is important to clarify the applicability boundaries of MU and FP: Not equivalent to privacy guarantees: FP does not equate to reduced membership inference risk or certified non-extractability of forgotten data. It reflects observable output deviation under controlled prompting, not cryptographic or information-theoretic guarantees of knowledge removal. Behavioral unlearning in controlled settings: Our evaluation framework and conclusions are scoped to behavioral unlearning or behavioral redaction within a controlled benchmark setting. Claims regarding “unlearning” should be interpreted as empirical output-level behavior changes, not as guarantees of complete knowledge erasure or resistance to adversarial extraction attempts.
Following established practices in continual unlearning evaluation [36], we define task-specific evaluation sets for computing MU and FP at each stage of the sequential task sequence. For a continual learning task CL i , the MU is computed on the union of the current training data and all previous learning datasets j = 0 i D L j (where D L 0 denotes the initial retain set from the SFT stage, excluding data intended for subsequent unlearning since those samples are also trained during SFT but will be selectively forgotten later), while the FP is evaluated on the forget set from the preceding unlearning task D U i . Conversely, for an unlearning task UL i , the MU is measured on the cumulative set of all prior learning data j = 0 i 1 D L j to assess knowledge retention, and the FP is computed on the current forget set D U i used for training the unlearning objective. This evaluation protocol ensures that we capture both the model’s ability to preserve previously acquired knowledge and its effectiveness in selectively forgetting target information across the sequential task trajectory.

5. Results

In this section, we present a comprehensive evaluation of the proposed method across multiple dimensions. We first report the main results on the TOFU benchmark and a real-world unlearning dataset to demonstrate the effectiveness and stability of our approach compared to several representative baselines. Subsequently, we conduct a detailed sensitivity analysis of the sparsity parameter and investigate the distribution drift to quantify the preservation of general capabilities. Furthermore, an extensive ablation study is performed to verify the contribution of each individual component. Finally, we analyze the parameter and computational efficiency to highlight the practical advantages of our framework in resource-constrained scenarios.

5.1. Main Results

To evaluate the performance of the proposed method, we conduct a series of experiments on the TOFU benchmark. We compare the proposed method against several representative baselines. The experimental results are shown in Table 2.
As shown in Table 2, our method achieves the best average performance (computed as the mean of all 12 MU and FP metric values across the six sequential tasks) on both models: 0.560 on Qwen3-4B-Instruct and 0.573 on Llama3-8B-Instruct, outperforming all baselines with exceptional stability (variance ± 0.0089 across five seeds). The MU metric remains consistently high throughout CLU, while baselines like GA show dramatic fluctuations and LoRA suffers catastrophic forgetting. Although our FP scores (0.38–0.46) are lower than aggressive methods, this reflects a deliberate design choice prioritizing stable knowledge retention over maximal output deviation, making it particularly suitable for scenarios requiring controlled, policy-driven knowledge removal with minimal disruption to retained capabilities.

5.2. Sensitivity Analysis

To investigate the impact of the sparsity parameter on model performance, we conduct sensitivity analysis on the Qwen3-4B-Instruct model by varying the sparsity level from 0 to 0.9. The experimental results are presented in Table 3.
As shown in Table 3, sparsity level significantly impacts performance. Without parameter protection (sparsity = 0), the model suffers catastrophic forgetting (MU: 0.54→0.03). Performance improves progressively as sparsity increases from 0.3 to 0.7. At sparsity = 0.9 (protecting top 90% parameters), the model achieves optimal performance with average 0.55 and consistently high MU scores (0.58, 0.58, 0.55, 0.56), demonstrating that aggressive parameter protection is crucial for CLU. We therefore adopt sparsity = 0.9 for all experiments.

5.3. Distribution Drift Analysis

To quantify the side effects of knowledge unlearning on the model’s general capabilities, we introduce the Token-level Distribution Drift proxy. Unlike coarse-grained metrics such as accuracy or perplexity, this metric captures the microscopic probability shifts in the model’s output distribution.
For a given sample in the retain set D r e t a i n , let P ( w | t < ) be the next-token probability distribution of the reference model θ r e f (the initial SFT model) and Q ( w | t < ) be that of the current unlearned model θ c u r . The token-level Kullback-Leibler (KL) divergence is defined as:
D K L ( P Q ) = w V P ( w | t < ) log P ( w | t < ) Q ( w | t < ) + ϵ
where V denotes the full vocabulary. To ensure numerical stability and symmetry, we also report the Jensen–Shannon (JS) divergence:
D J S ( P Q ) = 1 2 D K L ( P M ) + 1 2 D K L ( Q M )
where M = 1 2 ( P + Q ) . These metrics are averaged across all tokens within the generated Answer segment using a teacher-forcing paradigm.
Experimental Setup. We use the checkpoint after the initial supervised fine-tuning (SFT) as θ r e f to maintain a consistent Reference Baseline for distribution comparison. To avoid the dilution of the drift signal by fixed prompt templates, we apply a mask for Evaluation Focus that restricts calculation exclusively to the Answer portion of the tokens. For Sampling, we randomly sample N = 50 instances from the Retain Set for each task stage.
Results and Analysis. The experimental results in Figure 4 and Figure 5 reveal distinct behaviors in distribution maintenance. Conventional unlearning methods, such as GA and NPO, exhibit a progressive increase in both KL and JS divergence as the task sequence advances. This cumulative drift is particularly pronounced in the later stages (e.g., UL3 and CL3), where the model’s output distribution deviates significantly from the original SFT baseline, leading to the “catastrophic collapsing” of general capabilities. In contrast, our proposed method (Ours) maintains a consistently low and stable drift throughout the entire CLU process. The near-zero KL divergence indicates that our parameter-protected orthogonal optimization effectively confines the updates to a narrow subspace, successfully erasing specific knowledge without perturbing the model’s fundamental linguistic patterns.
Connecting Distribution Drift to Behavioral Metrics. The token-level KL divergence on the retain set and the behavioral metric MU are not independent observations—they are causally linked through the generation process. Since MU is computed from ROUGE-L, CS, and ES on retain-set outputs, and these outputs are generated autoregressively from p θ ( · s < t ) , any systematic shift in this token-level distribution propagates directly into degraded output quality. This causal chain—cumulative KL drift → shifted generation distribution → degraded retain-set outputs → lower MU—is clearly reflected in the cross-method comparison. For instance, on Qwen3-4B-Instruct, GA exhibits progressively increasing KL divergence (Figure 4) accompanied by a 46% decline in MU (from 0.50 at CL1 to 0.27 at UL3), while its ostensibly high FP scores (0.60→0.80) are not indicative of precise forgetting but rather of indiscriminate distributional collapse affecting both retain and forget sets. In contrast, our method’s near-zero KL divergence corresponds to only a 7% MU decline (0.59→0.55), with FP growing selectively (0.50→0.54). This demonstrates that low KL drift is the distributional-level mechanism enabling the behavioral-level stability–plasticity balance: parameter-space constraints (Theorem A1) bound the KL divergence, which in turn preserves retain-set generation fidelity (high MU) while permitting targeted behavioral change on the forget set (moderate FP). The full chain—parameter constraints → bounded KL → stable MU with selective FP—provides end-to-end empirical validation of our drift-aware design principle.

5.4. Ablation Study

To evaluate the effectiveness of the proposed method, we conduct a series of ablation studies. The experimental results are shown in Table 4. The ablation studies are conducted on the TOFU benchmark. Here, a represents whether matrix A is frozen, b represents whether matrix B is sparsified, and c indicates whether the code uses orthogonal gradient projection. ✓ indicates that the corresponding component is enabled, while × indicates it is disabled.
Table 4 reveals the critical role of each component. The baseline (no components) suffers catastrophic forgetting (MU: 0.54→0.03), underscoring the necessity of specialized mechanisms. Individually, freezing matrix A (component a) improves MU to 0.38–0.47, while sparse masking on matrix B (component b) achieves stronger gains (MU: 0.54, 0.54, 0.34, 0.46). Orthogonal projection alone (c) shows minimal improvement with extremely high FP (0.97–0.98). The combination a + b achieves strong performance (MU: 0.58, 0.59, 0.44, 0.44), while the complete framework ( a + b + c ) reaches optimal performance with consistently high MU (0.58, 0.59, 0.55, 0.56) and balanced FP (0.51–0.56), confirming the synergistic contributions of all components.
Connecting Ablation Patterns to the Theoretical Bound. The observed contribution hierarchy—where magnitude-controlling mechanisms (freezing A and sparse masking on B) yield larger individual gains than the direction-controlling mechanism (orthogonal projection)—is consistent with the structure of our theoretical bound (Theorem A1). The bound E [ D KL ] C 1 Δ θ 2 2 + C 2 Δ θ 2 is dominated by magnitude terms: reducing Δ θ 2 yields both quadratic and linear reductions in the KL upper bound, whereas directional constraints operate only through the effective projection of Δ θ onto critical subspaces. This explains why freezing A (which restricts updates to a low-rank subspace) and sparse masking (which zeros out updates to critical parameters) each independently prevent catastrophic forgetting, while orthogonal projection alone cannot compensate for unconstrained update magnitude. However, the benefit of orthogonal projection becomes pronounced in later tasks (UL3: MU improves from 0.44 to 0.55 when added to a + b ), where cumulative directional interference across multiple sequential updates becomes the binding constraint—a regime where magnitude control alone is insufficient.

5.5. Model Size and Computational Efficiency

Table 5 presents a comprehensive comparison of parameter and computational efficiency across different training approaches for the base model with 3.74 billion parameters. The results demonstrate that our proposed method achieves superior parameter efficiency compared to both full fine-tuning and standard LoRA approaches. Specifically, while full fine-tuning requires updating all 3.74B parameters and consumes 183.8 TFLOPs per training step, our method with rank r = 8 only requires training 8.40M parameters (0.22% of the base model), reducing the trainable parameter count by approximately 47% compared to standard LoRA ( r = 8 ) with 15.63M parameters (0.42%). In terms of computational efficiency, our approach achieves 62.0 TFLOPs per step (33.7% of full fine-tuning), which is marginally more efficient than standard LoRA’s 62.3 TFLOPs per step (33.9%). These results highlight that our method not only maintains competitive computational efficiency but also significantly reduces the memory footprint and parameter overhead, making it particularly suitable for resource-constrained scenarios and sequential learning tasks where parameter efficiency is crucial.

5.6. More Results on Real-World Datasets

To further demonstrate the generalization capability of our proposed method, we conduct additional experiments on a real-world unlearning scenario dataset. Following the setup described by Liu et al. [39], we adopt a more realistic scenario where the knowledge to be unlearned is inherent in the target model and the training data are unknown. This dataset identifies several real-world individuals with Wikipedia entries, along with inappropriate responses from Llama3-8B-Instruct model and golden answers for each individual. For detailed information on the dataset composition, sample size, annotation protocol, data sources, and compliance considerations, we refer readers to the original work [39].
For this evaluation, we employ the Qwen3-4B-Instruct model as the base model, maintaining all other experimental settings identical to those used in the main experiments (Section 5.1). This includes the same hyperparameters, training procedures, and evaluation metrics (MU and FP) to ensure fair comparison. The real-world dataset provides a more challenging testbed as it involves unlearning factual knowledge about actual individuals that has been deeply embedded in the pre-trained model, rather than synthetic or artificially injected information.
The results on the real-world dataset further validate the effectiveness and generalization capability of our proposed method. As shown in Table 6, our method achieves the highest average score of 0.620 (averaged across all 12 MU and FP metric values), surpassing all baseline methods including DPO (0.617), GA + KL (0.614), and GD (0.612). More importantly, our method maintains consistently high and stable MU scores throughout the CLU process (0.81, 0.79, 0.76, 0.74, 0.71, 0.68), demonstrating robust resistance to catastrophic forgetting even when dealing with deeply embedded factual knowledge. In contrast, LoRA exhibits severe performance degradation with MU scores dropping to 0.17, 0.29, and 0.07 in later tasks. While the FP scores remain moderate (0.42–0.55), consistent with our controlled-forgetting design, the combination of highest average score and exceptional stability confirms that our method generalizes well to real-world scenarios involving interleaved learning and unlearning under controlled benchmark conditions.

6. Discussion

This work presents a unified knowledge management framework that integrates continual learning and machine unlearning in large language models under a single information-theoretic perspective. Our experimental results on controlled interleaved benchmarks (six sequential tasks) demonstrate that the proposed method achieves the best average score (0.573 on Llama3-8B-Instruct) and exceptional stability (variance ± 0.0089 across seeds) across sequential tasks, outperforming existing baseline methods on both synthetic (TOFU) and real-world benchmarks.

6.1. Interpretation of Key Findings

The superior performance of our method can be attributed to three synergistic design principles derived from the drift-aware conceptual framework (where distributional shifts are characterized via KL as a design principle): freezing the LoRA projection matrix A constrains updates to a shared low-dimensional subspace reducing inter-task interference, sparse masking on B protects important large-magnitude parameters while allowing selective updates to less critical ones, and orthogonal gradient projection suppresses destructive interference with previously learned directions. Compared to prior work, our framework differs fundamentally in its knowledge-centric formulation—while traditional continual learning methods such as EWC [10] and MAS [11] focus on parameter importance estimation in small-scale discriminative models, and existing unlearning methods like gradient ascent [20] and NPO [22] prioritize rapid maximal forgetting at the cost of collateral damage and instability, our unified framework treats learning and unlearning as complementary operations under the same optimization principle, deliberately emphasizing controlled low-collateral forgetting with stable knowledge retention for reliable deployments in interleaved task scenarios. Ablation studies reveal that structural constraints (freezing A and sparse masking on B) are more critical than orthogonal gradient projection alone, aligning with recent findings that parameter protection and magnitude-based selective updating play a more dominant role than gradient-based regularization in large-scale models [3].

6.2. Limitations and Future Directions

Despite promising results on controlled interleaved benchmarks (6 sequential tasks), our framework has several concrete limitations warranting further investigation:
Controlled vs. maximal forgetting trade-off. Our design prioritizes controlled low-collateral forgetting over maximal erasure (FP: 0.38–0.46 vs. GA: 0.51–0.77), making it well-suited for gradual policy-driven knowledge removal but less suitable for emergency privacy scenarios requiring immediate complete erasure. Future work could explore adaptive forgetting strategies with switchable objectives balancing controllability and erasure strength.
Scalability to longer task sequences. Our evaluation covers six interleaved tasks, leaving scalability to significantly longer sequences (e.g., 50+ tasks) unexplored.
Analytical Scaling Behavior. Beyond computational cost, it is important to analyze how the framework’s effectiveness—not just its efficiency—scales with task count. We derive predictions from three complementary perspectives.
Cumulative drift growth. Theorem A1 bounds per-step KL drift by C 1 Δ θ 2 2 + C 2 Δ θ 2 . After T sequential tasks with per-task updates { δ t } t = 1 T , the total displacement is Δ θ T = t δ t . Under orthogonal projection, δ i δ j = 0 for i j , so Δ θ T 2 2 = t δ t 2 2 = O ( T ) and the cumulative KL bound grows as O ( T ) . Without orthogonality, constructive interference can yield Δ θ T 2 = O ( T ) and KL bound O ( T 2 ) in the worst case. This provides a theoretical rationale for why orthogonal projection becomes increasingly important in longer sequences—it reduces cumulative drift scaling from quadratic to linear—consistent with our ablation results showing disproportionate Orthogonal Gradient Projection (OGP) benefit in later tasks (UL3 MU: 0.44→0.55 when adding OGP to a + b ).
Direction space saturation. OGP stores one direction per task in the space of trainable parameters. With sparse masking at sparsity s, the effective dimension is ( 1 s ) · r · d out . At s = 0.9 , r = 32 , and typical d out = 4096 , this yields 13 , 000 theoretical directions before the orthogonal complement vanishes. While far beyond practical horizons, numerical accumulation and non-linear gradient dynamics will reduce effective capacity, motivating direction compression strategies.
Empirical trend extrapolation. On Llama3-8B-Instruct, our method exhibits an approximately linear MU decline of 2.3 % per task across the six-task sequence (0.81→0.67). If this rate persisted—a strong assumption, as interference patterns depend on task similarity and distribution overlap—MU would reach 0.50 around task 14. This suggests the current framework without consolidation is most suited for medium-length sequences ( 10 –20 tasks), with longer horizons requiring periodic adapter merging and mask refresh as discussed below.
A key concern is the computational and memory footprint of orthogonal gradient projection (OGP) as the number of tasks grows.
Direction set growth and projection cost. Let d denote the number of trainable adapter parameters being projected (in our case, the flattened B parameters after masking), and let m = t 1 be the number of stored directions. The naive projection g t = g t i = 1 m ( g t v i ) v i requires: (i) computing m inner products, each O ( d ) ; and (ii) accumulating m scaled vectors, also O ( d ) . Thus, the per-step compute is O ( m d ) = O ( t d ) , and storing all directions costs O ( m d ) = O ( t d ) in memory. Over an entire task with T steps, the total compute is O ( T m d ) = O ( T t d ) . In practice, this can become a bottleneck for long horizons because projection is applied at every optimization step (not just once per task); therefore, the wall-clock overhead scales roughly linearly with both task count and the number of gradient steps.
Matrix form and memory bandwidth. If we stack directions as V R m × d , then g t = g t V ( V g t ) (assuming approximately orthonormal rows). This formulation highlights that OGP is primarily limited by two matrix–vector multiplications, with memory traffic proportional to m d . As m grows, GPU memory bandwidth and cache locality become limiting factors even when the arithmetic cost is moderate.
Practical mitigation for scalability. To keep OGP scalable, one may (a) retain only a window of recent directions (size k t ), reducing compute/memory to O ( k d ) ; (b) compress directions into a low-dimensional principal subspace of rank r t via incremental Singular Value Decomposition (SVD)/online Principal Component Analysis (PCA), yielding O ( r d ) storage and O ( r d ) per-step projection; or (c) use randomized sketching to approximate V g t with lower memory overhead. These approaches trade full-history orthogonality for scalable approximate constraints and remain to be validated in extended-horizon CLU settings.
In addition to OGP, two further bottlenecks may arise: sparsity capacity exhaustion—with 90% sparsity, only a small fraction of parameters are available for adaptation, and repeated task-specific masking may eventually exhaust unused capacity, requiring periodic mask refresh or dynamic capacity reallocation strategies; cumulative drift accumulation—accumulated parameter deviations may gradually shift representations away from the pretrained reference distribution, potentially destabilizing retained knowledge. Addressing these issues may require periodic consolidation (merging adapters and resetting masks) or dynamic rank adjustment. These remain promising but unvalidated directions for future work.
Task boundary assumptions. Our framework assumes explicit task boundaries and manually constructed data partitions (retain/learn/forget sets). Extending to task-free continual learning settings with automatic boundary detection or gradual distribution shifts represents an important direction, though it introduces additional challenges in identifying when to apply learning vs. unlearning objectives without supervision.
These limitations delineate the scope of our current validation and highlight concrete technical challenges that future research can address to extend the framework toward longer-horizon deployment scenarios.

7. Conclusions

In this work, we present a parameter-efficient knowledge management framework where continual learning and machine unlearning—while employing distinct task-specific objectives (SFT for learning, GA for unlearning)—are integrated through a shared retention-controlled parameter evolution mechanism, with KL divergence serving as the design principle governing drift-aware structural constraints. We develop a practical implementation combining three synergistic mechanisms—freezing the LoRA projection matrix, magnitude-based sparse masking, and orthogonal gradient projection—that realize drift control entirely through parameter-space operations without modifying the base model. Extensive experiments on synthetic (TOFU) and real-world benchmarks using 4B- and 8B-scale language models demonstrate that our framework achieves the best average score (0.573 on Llama3-8B-Instruct) and exceptional stability (variance ± 0.0089 across five random seeds), with consistently high model utility and controlled forget-set response deviation that prioritizes low-collateral behavioral shifts over maximal output divergence. Token-level distributional drift analysis further validates that the parameter-space constraints effectively bound KL divergence on retained knowledge, and that this distributional stability directly underlies the observed behavioral-level stability–plasticity balance. It is important to emphasize that our evaluation captures behavioral unlearning in a controlled benchmark setting—measuring output-level changes rather than certifying complete knowledge elimination or privacy guarantees. While promising, concrete scalability challenges remain: direction set growth introduces linear computational overhead, sparsity capacity may exhaust under extended task sequences, and task boundary assumptions limit applicability to gradual distribution shifts—these represent well-defined technical directions for future research. This work provides a practical parameter-efficient recipe and a drift-aware design principle validated on controlled interleaved benchmarks (six sequential tasks), contributing both practical tools and theoretical understanding toward systematic and controllable knowledge dynamics in large language models.

Author Contributions

Conceptualization, J.L. and L.L.; methodology, J.L.; software, J.L.; validation, J.L. and L.L.; formal analysis, J.L.; investigation, J.L.; resources, L.L. and D.Z.; data curation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, L.L. and D.Z.; visualization, J.L.; supervision, L.L. and D.Z.; project administration, L.L.; funding acquisition, D.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant XDA0480301, in part by the Major Project of the National Social Science Fund of China under Grant 25&ZD043, and by the National Natural Science Foundation of China under Grant 62206293.

Data Availability Statement

The datasets used in this study are publicly available and can be accessed at the following GitHub repository: https://github.com/Langjiaqi/dataset_clu (accessed on 20 February 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Parameter-Space Drift Control as a KL Approximation

In the unified distributional formulation (Section 3.2), retention is expressed as a drift-control regularizer R retain ( θ ; θ ref , D R ) that constrains distributional changes on the retain set via token-level KL divergence. In our implementation, however, we do not explicitly compute this KL term on D R . Instead, we realize retention implicitly via parameter-space constraints that approximate drift control. Below, we provide a formal derivation establishing the approximation guarantees.
Formal Approximation Framework. We establish the connection between distributional drift and parameter-space constraints through a series of formal assumptions, lemmas, and theorems.
Assumption A1 (Model Smoothness).
The model output function f θ : X × Θ R V (where X is the input space, Θ is the parameter space, and V is the vocabulary size) is twice continuously differentiable with respect to Θ . Moreover, there exist constants L 1 , L 2 > 0 such that for all x X and θ , θ Θ with θ θ 2 ϵ (where ϵ is the learning rate bound):
θ f θ ( x ) op L 1 , θ 2 f θ ( x ) op L 2 ,
where · op denotes the operator norm.
Lemma A1 (KL-Logit Bound).
Let p θ ref ( · s < t ) and p θ ( · s < t ) be the softmax distributions over vocabulary V induced by logit vectors z ref = f θ ref ( s < t ) and z = f θ ( s < t ) , respectively. Then the token-level KL divergence satisfies
D KL p θ ref ( · s < t ) p θ ( · s < t ) 1 2 V z ref z 2 2 + z ref z · C KL ,
where C KL = log V is a constant depending on the vocabulary size.
Proof. 
By Pinsker’s inequality and properties of softmax perturbation under bounded logit changes, combined with the Lipschitz continuity of the softmax function, the KL divergence can be bounded by a quadratic term in the 2 norm plus a linear term in the norm of the logit perturbation. For detailed derivation, see [40]. □
Lemma A2 (Parameter-Logit Approximation).
Under Assumption A1, for parameter update Δ θ = θ θ ref with Δ θ 2 ϵ , the logit change at input s < t satisfies
f θ ( s < t ) = f θ ref ( s < t ) + J θ ref ( s < t ) Δ θ + O ( Δ θ 2 2 ) ,
where J θ ref ( s < t ) = θ f θ ref ( s < t ) R d × V is the Jacobian matrix. Consequently,
f θ ( s < t ) f θ ref ( s < t ) 2 L 1 Δ θ 2 + L 2 2 Δ θ 2 2 .
Proof. 
By Taylor expansion of f θ around θ ref and applying the bounds from Assumption A1, we obtain the first-order approximation with an explicit remainder term. Taking norms and applying the triangle inequality yields the stated bound. □
Theorem A1 (Parameter-Space Drift Control).
Under Assumptions A1, for parameter update Δ θ with Δ θ 2 ϵ , the average token-level KL divergence on the retain set D R satisfies
E s D R 1 | s | t = 1 | s | D KL p θ ref ( · s < t ) p θ ( · s < t ) C 1 Δ θ 2 2 + C 2 Δ θ 2 ,
where C 1 = L 1 2 2 V + L 2 2 ϵ 4 V and C 2 = ( L 1 + L 2 ϵ 2 ) C KL are constants determined by model properties and hyperparameters.
Proof. 
Combining Lemmas A1 and A2, for any input s < t on the retain set, we have
D KL p θ ref ( · s < t ) p θ ( · s < t )
1 2 V z ref z 2 2 + z ref z · C KL
1 2 V L 1 Δ θ 2 + L 2 2 Δ θ 2 2 2 + L 1 Δ θ 2 + L 2 2 Δ θ 2 2 C KL .
Under the constraint Δ θ 2 ϵ (with ϵ sufficiently small), expanding the squared term and retaining dominant terms yields
D KL p θ ref ( · s < t ) p θ ( · s < t ) L 1 2 2 V + L 2 2 ϵ 4 V Δ θ 2 2 + L 1 + L 2 ϵ 2 C KL Δ θ 2 .
Taking expectation over sequences s D R and averaging over tokens completes the proof. □
Corollary A1 (Structural Constraint Realization).
Theorem A1 implies that controlling distributional drift R retain can be achieved by bounding Δ θ 2 and constraining the effective direction of Δ θ . Our three structural mechanisms realize this as follows:
1.
Localization (Freezing A): Restricting updates to Δ θ = A Δ B with frozen A R d in × r and r d in reduces the effective parameter space dimension from d to O ( r · d out ) , yielding Δ θ F 2 = A Δ B F 2 A F 2 Δ B F 2 , thereby bounding update magnitude via the fixed subspace defined by A.
2.
Selective Protection (Sparse Masking on B): Applying element-wise mask M t with sparsity s enforces Δ B 0 ( 1 s ) · | B | , where · 0 denotes the number of non-zero elements. By protecting top-s percentile parameters (largest magnitude entries critical to retained capabilities), we further constrain Δ B F 1 s · Δ B unmasked F , reducing perturbation magnitude.
3.
Direction Control (Orthogonal Projection): Projecting gradients to be orthogonal to previous task directions { v i } i = 1 t 1 ensures Δ θ v i = 0 for all i < t , minimizing alignment with critical directions for retained knowledge and thereby reducing the effective impact on D R in directions where J θ ref ( s < t ) v i 2 is large.
Remark A1.
Together, Theorem A1 and Corollary A1 establish that our parameter-space structural constraints provide a principled approximation to the distributional drift control R retain in Equation (4), with explicit approximation bounds. This justifies our implementation strategy as KL-inspired parameter-space drift control—motivated by distributional considerations but realized entirely through parameter-space operations with formal guarantees.

Appendix B. Baseline Method Details

Let π θ denote the language model parameterized by θ . Each sample is a QA pair ( q , a ) , where q is the question (prompt) and a = ( a 1 , , a T ) is the answer token sequence. We define the token-level negative log-likelihood (NLL) loss on answer tokens as
( q , a ; θ ) = 1 T t = 1 T log π θ a t q , a < t ,
where T is the length of the answer sequence a, a t is the t-th token, a < t denotes tokens before position t, and π θ ( a t q , a < t ) is the model’s predicted probability for the token at position t given the question q and preceding tokens a < t . Given a dataset D (a set of QA pairs), the averaged training loss is
L ( D ; θ ) = 1 | D | ( q , a ) D ( q , a ; θ ) .
Let D f and D r denote the forget set and retain set, respectively.
  • Gradient Ascent (GA).
GA aims to “forget” by increasing the loss on the forget set. Equivalently, if we implement unlearning via gradient descent, GA minimizes
L GA ( θ ) = L ( D f ; θ ) ,
which corresponds to performing gradient ascent on L ( D f ; θ ) .
  • GA + GD (Gradient Difference).
GA + GD mitigates the utility degradation of GA by combining (i) gradient ascent on D f and (ii) gradient descent on D r :
L GA + GD ( θ ) = L ( D f ; θ ) + L ( D r ; θ ) .
  • GA + KL (KL-regularized GA).
GA + KL further constrains distributional drift by adding a KL regularization term between the unlearned model and a reference model. Let π θ 0 be a reference model (e.g., the pre-unlearning model), and let s = [ q , a ] be the concatenated sequence. Denote by s < t the prefix up to position t 1 , and by π θ ( · s < t ) the next-token distribution. A commonly used KL-regularized objective is
L GA + KL ( θ ) = L ( D f ; θ ) + λ R KL ( θ ) ,
R KL ( θ ) = 1 | D r | s D r 1 | s | t = 2 | s | D KL π θ 0 ( · s < t ) π θ ( · s < t ) ,
where λ > 0 controls the strength of the regularization.
  • Negative Preference Optimization (NPO).
NPO reduces the model’s confidence on forget-set answers via a negative-preference objective. Given ( x , y ) D f (here x is the prompt and y is the target response to be forgotten), a reference model π ref , and inverse temperature β > 0 , the NPO loss is
L NPO , β ( θ ) = 2 β E ( x , y ) D f log σ β log π θ ( y x ) π ref ( y x )
= 2 β E ( x , y ) D f log 1 + π θ ( y x ) π ref ( y x ) β ,
where σ ( · ) is the sigmoid function.
  • Direct Preference Optimization (DPO).
DPO is originally designed for paired human preferences; for unlearning, it can be adapted by constructing preference pairs that encourage neutral/non-target responses. Given preference pairs ( x , y w , y l ) where y w is the preferred response (e.g., a neutral “I don’t know” style answer) and y l is the dispreferred response (e.g., the original answer to be forgotten), the DPO objective is
L DPO , β ( θ ) = 1 β E ( x , y w , y l ) log σ β log π θ ( y w x ) π ref ( y w x ) β log π θ ( y l x ) π ref ( y l x ) ,
where E ( x , y w , y l ) denotes the expectation over preference pairs sampled from the dataset.

References

  1. Shi, H.; Xu, Z.; Wang, H.; Qin, W.; Wang, W.; Wang, Y.; Wang, Z.; Ebrahimi, S.; Wang, H. Continual learning of large language models: A comprehensive survey. ACM Comput. Surv. 2025, 58, 1–42. [Google Scholar] [CrossRef]
  2. Liu, S.; Yao, Y.; Jia, J.; Casper, S.; Baracaldo, N.; Hase, P.; Yao, Y.; Liu, C.Y.; Xu, X.; Li, H.; et al. Rethinking machine unlearning for large language models. Nat. Mach. Intell. 2025, 7, 181–194. [Google Scholar] [CrossRef]
  3. Wang, X.; Chen, T.; Ge, Q.; Xia, H.; Bao, R.; Zheng, R.; Zhang, Q.; Gui, T.; Huang, X.J. Orthogonal subspace learning for language model continual learning. In Findings of the Association for Computational Linguistics: EMNLP 2023; Association for Computational Linguistics: Kerrville, TX, USA, 2023; pp. 10658–10671. [Google Scholar]
  4. He, J.; Guo, H.; Zhu, K.; Zhao, Z.; Tang, M.; Wang, J. Seekr: Selective attention-guided knowledge retention for continual learning of large language models. arXiv 2024, arXiv:2411.06171. [Google Scholar] [CrossRef]
  5. Gao, C.; Wang, L.; Ding, K.; Weng, C.; Wang, X.; Zhu, Q. On large language model continual unlearning. arXiv 2024, arXiv:2407.10223. [Google Scholar]
  6. Liu, B.; Liu, Q.; Stone, P. Continual learning and private unlearning. In Proceedings of the Conference on Lifelong Learning Agents; PMLR: Cambridge, MA, USA, 2022; pp. 243–254. [Google Scholar]
  7. Chatterjee, R.; Chundawat, V.; Tarun, A.; Mali, A.; Mandal, M. A unified framework for continual learning and unlearning. arXiv 2024, arXiv:2408.11374. [Google Scholar] [CrossRef]
  8. Huang, Z.; Cheng, X.; Zhang, J.; Zheng, J.; Wang, H.; He, Z.; Li, T.; Huang, X. A unified gradient-based framework for task-agnostic continual learning-unlearning. arXiv 2025, arXiv:2505.15178. [Google Scholar]
  9. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-rank adaptation of large language models. Int. Conf. Learn. Represent. 2022, 1, 3. [Google Scholar]
  10. Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
  11. Aljundi, R.; Babiloni, F.; Elhoseiny, M.; Rohrbach, M.; Tuytelaars, T. Memory aware synapses: Learning what (not) to forget. In European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2018; pp. 139–154. [Google Scholar]
  12. Zenke, F.; Poole, B.; Ganguli, S. Continual learning through synaptic intelligence. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2017; pp. 3987–3995. [Google Scholar]
  13. Chaudhry, A.; Dokania, P.K.; Ajanthan, T.; Torr, P.H. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2018; pp. 532–547. [Google Scholar]
  14. Guo, C.; Zhao, B.; Bai, Y. Deepcore: A comprehensive library for coreset selection in deep learning. In International Conference on Database and Expert Systems Applications; Springer: Cham, Switzerland, 2022; pp. 181–195. [Google Scholar]
  15. Feldman, D. Core-sets: Updated survey. In Sampling Techniques for Supervised or Unsupervised Tasks; Springer: Cham, Switzerland, 2019; pp. 23–44. [Google Scholar]
  16. Wang, T.; Zhu, J.Y.; Torralba, A.; Efros, A.A. Dataset distillation. arXiv 2018, arXiv:1811.10959. [Google Scholar]
  17. Yu, R.; Liu, S.; Wang, X. Dataset distillation: A comprehensive review. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 150–170. [Google Scholar] [CrossRef]
  18. Ahn, H.; Cha, S.; Lee, D.; Moon, T. Uncertainty-based continual learning with adaptive regularization. arXiv 2019, arXiv:1905.11614. [Google Scholar] [CrossRef]
  19. Jin, H.; Kim, E. Helpful or harmful: Inter-task association in continual learning. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 519–535. [Google Scholar]
  20. Maini, P.; Feng, Z.; Schwarzschild, A.; Lipton, Z.C.; Kolter, J.Z. Tofu: A task of fictitious unlearning for LLMs. arXiv 2024, arXiv:2401.06121. [Google Scholar] [CrossRef]
  21. Jang, J.; Yoon, D.; Yang, S.; Cha, S.; Lee, M.; Logeswaran, L.; Seo, M. Knowledge unlearning for mitigating privacy risks in language models. In 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Kerrville, TX, USA, 2023; pp. 14389–14408. [Google Scholar]
  22. Zhang, R.; Lin, L.; Bai, Y.; Mei, S. Negative preference optimization: From catastrophic collapse to effective unlearning. arXiv 2024, arXiv:2404.05868. [Google Scholar] [CrossRef]
  23. Fan, C.; Liu, J.; Lin, L.; Jia, J.; Zhang, R.; Mei, S.; Liu, S. Simplicity prevails: Rethinking negative preference optimization for LLM unlearning. arXiv 2024, arXiv:2410.07163. [Google Scholar] [CrossRef]
  24. Cha, S.; Cho, S.; Hwang, D.; Lee, M. Towards robust and parameter-efficient knowledge unlearning for LLMs. arXiv 2024, arXiv:2408.06621. [Google Scholar]
  25. Russinovich, M.; Salem, A. Obliviate: Efficient unmemorization for protecting intellectual property in large language models. arXiv 2025, arXiv:2502.15010. [Google Scholar] [CrossRef]
  26. Liu, Z.; Dou, G.; Tan, Z.; Tian, Y.; Jiang, M. Towards safer large language models through machine unlearning. arXiv 2024, arXiv:2402.10058. [Google Scholar] [CrossRef]
  27. Ishibashi, Y.; Shimodaira, H. Knowledge sanitization of large language models. arXiv 2023, arXiv:2309.11852. [Google Scholar]
  28. Liu, Y.; Zhang, Y.; Jaakkola, T.; Chang, S. Revisiting Who’s Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective. arXiv 2024, arXiv:2407.16997. [Google Scholar]
  29. Xu, H.; Zhao, N.; Yang, L.; Zhao, S.; Deng, S.; Wang, M.; Hooi, B.; Oo, N.; Chen, H.; Zhang, N. Relearn: Unlearning via learning for large language models. arXiv 2025, arXiv:2502.11190. [Google Scholar] [CrossRef]
  30. Shibata, T.; Irie, G.; Ikami, D.; Mitsuzumi, Y. Learning with Selective Forgetting. Int. Jt. Conf. Artif. Intell. 2021, 3, 4. [Google Scholar]
  31. Wang, Z.; Bi, B.; Pentyala, S.K.; Ramnath, K.; Chaudhuri, S.; Mehrotra, S.; Mao, X.B.; Asur, S.; Cheng, N. A comprehensive survey of LLM alignment techniques: RLHF, RLAIF, PPO, DPO and more. arXiv 2024, arXiv:2407.16216. [Google Scholar] [CrossRef]
  32. Izzo, Z.; Smart, M.A.; Chaudhuri, K.; Zou, J. Approximate data deletion from machine learning models. In International Conference on Artificial Intelligence and Statistics; PMLR: Cambridge, MA, USA, 2021; pp. 2008–2016. [Google Scholar]
  33. Qiao, J.; Zhang, Z.; Tan, X.; Qu, Y.; Zhang, W.; Han, Z.; Xie, Y. Gradient projection for continual parameter-efficient tuning. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 9316–9329. [Google Scholar] [CrossRef]
  34. Yao, J.; Chien, E.; Du, M.; Niu, X.; Wang, T.; Cheng, Z.; Yue, X. Machine unlearning of pre-trained large language models. arXiv 2024, arXiv:2402.15159. [Google Scholar] [CrossRef]
  35. Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Kerrville, TX, USA, 2004; pp. 74–81. [Google Scholar]
  36. Yuan, X.; Pang, T.; Du, C.; Chen, K.; Zhang, W.; Lin, M. A closer look at machine unlearning for large language models. arXiv 2024, arXiv:2410.08109. [Google Scholar] [CrossRef]
  37. Reimers, N.; Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
  38. Sileo, D. tasksource: A Dataset Harmonization Framework for Streamlined NLP Multi-Task Learning and Evaluation. arXiv 2023, arXiv:2301.05948. [Google Scholar] [CrossRef]
  39. Liu, Z.; Zhu, T.; Tan, C.; Chen, W. Learning to refuse: Towards mitigating privacy risks in LLMs. In 31st International Conference on Computational Linguistics; Association for Computational Linguistics: Kerrville, TX, USA, 2025; pp. 1683–1698. [Google Scholar]
  40. Pinsker, M.S. Some mathematical questions of theory of information transmission. Probl. Inf. Transm. 2007, 43, 380–392. [Google Scholar] [CrossRef]
Figure 1. Overview of the CLU framework. The model sequentially processes a stream of tasks { T 1 , T 2 , , T T } , where each task can be either a learning request ( R t = L ) or an unlearning request ( R t = U ). Through alternating learning and unlearning operations, the model parameters evolve from θ 0 to θ T , achieving dynamic knowledge management while satisfying forgetting, retention, and acquisition constraints.
Figure 1. Overview of the CLU framework. The model sequentially processes a stream of tasks { T 1 , T 2 , , T T } , where each task can be either a learning request ( R t = L ) or an unlearning request ( R t = U ). Through alternating learning and unlearning operations, the model parameters evolve from θ 0 to θ T , achieving dynamic knowledge management while satisfying forgetting, retention, and acquisition constraints.
Information 17 00238 g001
Figure 2. The unified distributional framework for CLU. The framework operates on three data partitions: the retain set D R (historical knowledge to preserve), the learning set D L (new knowledge to acquire), and the forget set D U (target knowledge to eliminate). At each update step, the model θ is optimized relative to a reference model θ r e f through drift-controlled updates, balancing three objectives: (i) retention regularization (drift minimization on D R ) to maintain stability, (ii) learning via supervised fine-tuning on D L for knowledge acquisition, and (iii) unlearning via gradient ascent on D U for knowledge removal.
Figure 2. The unified distributional framework for CLU. The framework operates on three data partitions: the retain set D R (historical knowledge to preserve), the learning set D L (new knowledge to acquire), and the forget set D U (target knowledge to eliminate). At each update step, the model θ is optimized relative to a reference model θ r e f through drift-controlled updates, balancing three objectives: (i) retention regularization (drift minimization on D R ) to maintain stability, (ii) learning via supervised fine-tuning on D L for knowledge acquisition, and (iii) unlearning via gradient ascent on D U for knowledge removal.
Information 17 00238 g002
Figure 3. Overview of the proposed LoRA-based framework with frozen matrix A, sparse masking on matrix B, and orthogonal gradient projection for knowledge management in continual learning and unlearning.
Figure 3. Overview of the proposed LoRA-based framework with frozen matrix A, sparse masking on matrix B, and orthogonal gradient projection for knowledge management in continual learning and unlearning.
Information 17 00238 g003
Figure 4. Token -level KL Divergence across task sequence. The shaded area represents ± 1 standard deviation.
Figure 4. Token -level KL Divergence across task sequence. The shaded area represents ± 1 standard deviation.
Information 17 00238 g004
Figure 5. Token-level JS Divergence across task sequence.
Figure 5. Token-level JS Divergence across task sequence.
Information 17 00238 g005
Table 1. Concise mapping from KL-inspired design principles to their corresponding algorithmic components in our implementation.
Table 1. Concise mapping from KL-inspired design principles to their corresponding algorithmic components in our implementation.
Design Principle (Conceptual)Implementation Component (Algorithmic)Role/Intuition
Retention control on D R (drift-aware stability)Frozen LoRA projection matrix A (Section 3.3.1)Constrains updates to a shared low-dimensional subspace, promoting stable behavior on retained knowledge.
Localization (reduce interference)Sparse masking on B (Section 3.3.2)Restricts parameter changes to a small subset, limiting collateral forgetting and isolating task-specific edits.
Direction control (protect past directions)Orthogonal gradient projection (Section 3.3.3)Removes update components aligned with previously learned directions, reducing destructive interference across tasks.
Table 2. Performance comparison of different methods on Qwen3-4B-Instruct and Llama3-8B-Instruct models.
Table 2. Performance comparison of different methods on Qwen3-4B-Instruct and Llama3-8B-Instruct models.
ModelMethodUL1CL1UL2CL2UL3CL3Average
MU FP MU FP MU FP MU FP MU FP MU FP
Qwen3-4B-InstructGA0.390.600.500.590.310.760.430.710.270.800.370.700.536
GA + GD0.470.550.520.560.430.720.490.630.410.710.470.620.548
GA + KL0.570.480.520.470.430.670.480.630.460.660.470.610.538
NPO0.400.500.420.580.440.660.440.700.290.790.430.680.528
DPO0.540.490.510.510.450.640.490.630.470.670.480.710.549
LoRA0.520.540.540.490.030.960.280.820.030.990.230.860.524
Our Method0.590.500.610.520.580.510.580.520.550.540.560.660.560
Llama3-8B-InstructGA0.520.520.590.510.410.770.440.640.380.690.380.700.546
GA + GD0.670.410.680.410.410.710.510.700.380.690.430.670.556
GA + KL0.620.500.680.490.410.650.470.610.360.720.390.660.547
NPO0.590.540.670.520.410.690.470.620.330.710.380.680.551
DPO0.590.400.590.360.520.610.520.570.480.630.530.590.533
LoRA0.590.500.690.370.030.970.290.840.030.990.250.890.537
Our Method0.810.390.800.380.740.380.730.380.680.450.670.460.573
Table 3. Sensitivity analysis of sparsity parameter on model performance.
Table 3. Sensitivity analysis of sparsity parameter on model performance.
SparsityUL1CL1UL2CL2UL3CL3
MU FP MU FP MU FP MU FP MU FP MU FP
0.00.520.540.540.490.030.960.280.820.030.990.230.86
0.30.520.540.550.490.100.940.240.840.090.940.170.83
0.50.520.540.550.490.200.790.270.810.110.900.210.80
0.70.520.540.580.520.300.730.330.720.190.820.250.77
0.90.590.500.610.520.580.510.580.520.550.540.560.56
Table 4. Ablation study under different settings.
Table 4. Ablation study under different settings.
MethodUL1CL1UL2CL2UL3CL3
a b c MU FP MU FP MU FP MU FP MU FP MU FP
×××0.520.540.540.490.030.960.280.820.030.990.230.86
××0.590.500.600.500.380.760.470.610.310.720.400.62
××0.520.540.630.470.540.570.540.570.340.710.460.60
××0.520.540.540.490.030.970.180.940.010.980.150.97
×0.520.540.630.460.560.530.530.550.350.720.440.64
×0.590.500.590.480.380.680.450.640.310.710.360.66
×0.590.500.610.510.580.500.590.520.440.520.440.64
0.590.500.610.520.580.510.590.520.550.540.560.56
Table 5. Parameter and Computational Efficiency on Qwen3-4B-Instruct.
Table 5. Parameter and Computational Efficiency on Qwen3-4B-Instruct.
MethodTrainable ParamsRatioFLOPs/Step
Base Model3.74B--
Full Fine-tuning3.74B100.0%183.8 TFLOPs (100.0%)
LoRA ( r = 8 )15.63M0.42%62.3 TFLOPs (33.9%)
Ours ( r = 8 )8.40M0.22%62.0 TFLOPs (33.7%)
Table 6. Performance comparison on real-world dataset with Qwen3-4B-Instruct.
Table 6. Performance comparison on real-world dataset with Qwen3-4B-Instruct.
MethodUL1CL1UL2CL2UL3CL3Average
MU FP MU FP MU FP MU FP MU FP MU FP
GA0.690.570.750.450.510.60.690.560.560.660.640.590.606
GD0.690.470.740.480.730.510.710.560.630.590.610.620.612
GA + KL0.760.40.780.550.680.540.690.540.570.650.590.620.614
NPO0.690.520.770.510.630.540.70.530.590.60.620.590.601
DPO0.690.510.770.510.720.540.710.510.60.60.60.580.617
LoRA0.80.390.710.420.170.870.290.830.070.950.330.720.546
Ours0.810.420.790.450.760.480.740.510.710.540.680.550.620
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lang, J.; Li, L.; Zeng, D. A Unified Knowledge Management Framework for Continual Learning and Machine Unlearning in Large Language Models. Information 2026, 17, 238. https://doi.org/10.3390/info17030238

AMA Style

Lang J, Li L, Zeng D. A Unified Knowledge Management Framework for Continual Learning and Machine Unlearning in Large Language Models. Information. 2026; 17(3):238. https://doi.org/10.3390/info17030238

Chicago/Turabian Style

Lang, Jiaqi, Linjing Li, and Dajun Zeng. 2026. "A Unified Knowledge Management Framework for Continual Learning and Machine Unlearning in Large Language Models" Information 17, no. 3: 238. https://doi.org/10.3390/info17030238

APA Style

Lang, J., Li, L., & Zeng, D. (2026). A Unified Knowledge Management Framework for Continual Learning and Machine Unlearning in Large Language Models. Information, 17(3), 238. https://doi.org/10.3390/info17030238

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop