Machine Unlearning: A Perspective, Taxonomy, and Benchmark Evaluation

Cosentino, Cristian; Gatto, Simone; Liò, Pietro; Marozzo, Fabrizio

doi:10.3390/fi18030174

Open AccessPerspective

Machine Unlearning: A Perspective, Taxonomy, and Benchmark Evaluation

¹

Department of Informatics, Modeling, Electronics and Systems Engineering (DIMES), University of Calabria, 87036 Rende, Italy

²

Department of Computer Science and Technology, University of Cambridge, Cambridge CB3 0FD, UK

^*

Author to whom correspondence should be addressed.

Future Internet 2026, 18(3), 174; https://doi.org/10.3390/fi18030174

Submission received: 6 February 2026 / Revised: 12 March 2026 / Accepted: 18 March 2026 / Published: 23 March 2026

Download

Browse Figures

Versions Notes

Abstract

Machine Learning (ML) models trained on large-scale datasets learn useful predictive patterns, but they may also memorize undesired information, leading to risks such as information leakage, bias, copyright violations, and privacy attacks. As these models are increasingly deployed in real-world and regulated settings, the consequences of such memorization become practical and high-stakes, reinforced by data-protection frameworks that grant individuals a Right to be Forgotten (e.g., the GDPR). Simply removing a record from the training dataset does not guarantee the elimination of its influence from the model, while retrain-from-scratch procedures are often prohibitive for modern architectures, including Transformers and Large Language Models (LLMs). In this work, we provide a perspective on Machine Unlearning (MU) in supervised learning settings, with a particular focus on Natural Language Processing (NLP) scenarios, grounded in a PRISMA-driven systematic review. We propose a multi-level taxonomy that organizes MU techniques along practical and conceptual dimensions, including exactness (exact versus approximate), unlearning granularity, guarantees, and application constraints. To complement this perspective, we run an illustrative benchmark evaluation using a standardized unlearning protocol on DistilBERT trained on a public corpus of news headlines for topic classification, contrasting the retraining gold standard with representative design-for-unlearning and approximate post hoc techniques. For completeness, we also report two oracle-assisted upper-bound baselines (distillation and scrubbing) that rely on a clean retrained reference model, and we account for their incremental cost separately. Our analysis jointly considers model utility, probabilistic quality, forgetting and privacy indicators, as well as computational efficiency. The results highlight systematic trade-offs between accuracy, computational cost, and removal effectiveness, providing practical guidance for selecting machine unlearning techniques in realistic deployment scenarios.

Keywords:

machine unlearning; privacy; deep learning; transformers; DistilBERT; SISA; approximate unlearning; membership inference; right to be forgotten

Graphical Abstract

1. Introduction

Machine Learning (ML) models are trained to identify predictive patterns from large-scale datasets and to reuse such patterns for inference on unseen data. In modern Deep Neural Networks (DNNs), however, learned knowledge is not stored in a separable or easily removable form: information can become implicitly memorized and distributed across model parameters. As a consequence, the influence of specific training samples cannot be retroactively removed by simple data deletion. This phenomenon exposes ML systems to concrete risks, including leakage of sensitive information, amplification of dataset biases, memorization of copyright-protected content, and vulnerabilities to malicious samples such as poisoning or backdoor attacks. Moreover, adversaries may exploit membership inference attacks to infer whether an individual’s data were used during training [1].

The practical relevance of this issue has been amplified by regulatory frameworks such as the General Data Protection Regulation (GDPR), which establishes the Right to be Forgotten (Art. 17) [2]. In the context of AI systems, however, removing a record from the training dataset is insufficient to ensure compliance: a trained model may continue to encode and reveal information about deleted samples through its parameters. The most direct remedy—retrain-from-scratch on the remaining data—is often impractical for modern architectures, including Transformer-based models and, even more so, Large Language Models (LLMs), due to prohibitive computational costs, energy consumption, and service downtime. To bridge this gap between regulatory obligations and technical feasibility, Machine Unlearning (MU) has emerged as a principled approach for efficiently and effectively removing the influence of specific training data from an already trained model, ideally yielding behavior indistinguishable from a model that has never been exposed to such data [1]. As such, MU represents a key enabling component for responsible, secure, and regulation-compliant AI, offering a practical alternative to full retraining [1,2].

Despite rapid progress, the MU landscape remains heterogeneous: methods differ in the strength of their removal guarantees, the granularity of deletion requests, the extent to which they require design-time constraints, and the way forgetting is evaluated and reported. In particular, empirical claims of “forgetting” are often supported primarily by utility or efficiency metrics, while robust evidence of removal effectiveness and alignment to the retraining gold standard is less consistently assessed across studies [3,4]. This fragmentation complicates both scientific comparison and practical adoption, especially for Transformer-based NLP models where retraining costs are high and information can be diffuse across representations.

Motivated by this fragmentation, and by the fact that “forgetting” claims are often not evaluated against privacy-relevant threats or the retraining gold standard, this work develops an evidence-driven, deployment-oriented view of supervised MU for Transformer-based NLP. We first consolidate the empirical state of the field through a PRISMA-guided review, extracting recurring design choices, assumptions, and evaluation pitfalls that shape real-world adoptability. Building on these findings, we introduce a multi-level taxonomy and decision framework that organizes MU methods by their operational requirements (e.g., design-time constraints vs. post hoc applicability), expected deletion regime, and strength of removal guarantees, with the explicit goal of helping practitioners match techniques to assurance needs.

We then ground this perspective in a reproducible benchmark study using a standardized unlearning protocol on DistilBERT trained on a public corpus of news headlines for topic classification, where retrain-from-scratch serves as the reference point. We compare representative design-for-unlearning strategies and approximate post hoc approaches, and we assess not only utility but also removal effectiveness under adversarial scrutiny, most notably through membership inference risk, together with calibration and computational cost. To reduce ambiguity in what it means to be “indistinguishable from retraining”, we complement conventional metrics with distributional and structural alignment measures that quantify how closely the unlearned model matches the retrained one. Overall, the results surface consistent trade-offs between accuracy, privacy/forgetting strength, and time-to-unlearn, providing practical guidance on when approximate unlearning is sufficient and when stronger assurances warrant heavier procedures.

The remainder of the paper is organized as follows. Section 2 introduces the essential background, including definitions of supervised Machine Unlearning, Transformer-based models, and the AG News dataset. Section 3 presents the PRISMA-driven systematic review of Machine Unlearning, detailing the search protocol, the classification axes, and a comparative synthesis of existing approaches. Section 4 introduces the proposed taxonomy and decision framework, mapping the main families of MU techniques. Section 5 describes the methods and experimental setup, including the pipeline, the definition of forget and retain sets, the implemented techniques, and the evaluation metrics. Section 6 reports the experimental results and analyzes efficiency aspects. Section 7 discusses the observed trade-offs, practical implications, and limitations. Finally, Section 8 concludes the paper.

2. Background

2.1. Supervised Learning and the Need for Data Removal

Consider a supervised classification model

f_{θ}

parameterized by weights

θ

, trained on a dataset

D = {(x_{i}, y_{i})}_{i = 1}^{n}

of n labeled examples, where

x_{i}

denotes the input (e.g., a text) and

y_{i}

the corresponding class label. Under the Empirical Risk Minimization (ERM) principle, training seeks parameters that minimize the average per-example loss (e.g., cross-entropy) over the dataset D. In practice,

θ

is optimized via stochastic gradient descent (SGD) and its variants.

After training, information extracted from the data is not stored as a set of separable records but becomes distributed across parameters and internal representations. In deep neural networks, the influence of a single example is entangled with that of many others and propagates across multiple layers, making it non-trivial to retroactively remove the impact of specific training samples. As a result, deleting a subset

D_{f} \subset D

from the dataset does not imply that the trained model has “forgotten” it: the model may still encode and reveal information through its predictions or through privacy and security vulnerabilities (e.g., membership inference).

The practical need for data removal arises in several settings, including regulatory compliance (e.g., the Right to be Forgotten), mitigation of security threats (poisoning/backdoor), removal of unauthorized or copyright-protected data, correction of biased or erroneous records, and model maintenance in operational pipelines (e.g., rollback, auditing, and data drift management). These scenarios motivate mechanisms that can reduce or eliminate the influence of

D_{f}

while preserving utility on the retained data and keeping computational costs sustainable.

2.2. Problem Formulation and Definitions

Let D denote the training dataset and

D_{f} \subset D

the forget set, with

D_{retain} = D ∖ D_{f}

indicating the retain set. Machine Unlearning (MU) addresses the problem of efficiently removing the influence of

D_{f}

from a model

f_{θ}

trained on D. Formally, the goal is to compute an updated parameterization

θ \to θ^{'}

such that the resulting model

f_{θ^{'}}

behaves, as closely as possible, like a model trained from scratch on

D_{retain}

, which represents the gold standard, while preserving predictive performance on retained data and satisfying practical constraints on cost, scalability, and availability [4,5].

Because training is stochastic, the gold standard does not correspond to a single deterministic model. Let A be a (potentially randomized) training algorithm that induces a distribution over trained models

P_{A} (D)

when trained on D. A trained model can be viewed as a sample from

P_{A} (D)

, while a retrained model corresponds to a sample from

P_{A} (D_{retain})

. In the strongest formulation, MU aims to produce an unlearned model that is statistically indistinguishable from a sample drawn from

P_{A} (D_{retain})

, i.e., it matches (or closely approximates) the distribution induced by retraining on the retained data.

In practice, existing approaches differ in how closely they target the retraining distribution. Exact or near-exact unlearning methods attempt to match the retraining behavior (or an operational approximation thereof), often by controlling the training process and/or reusing intermediate states. While this can be feasible for certain algorithmic classes, it typically introduces substantial overhead for monolithic deep neural networks. Strategies that are design-for-unlearning, such as SISA-style partitioning, aim to confine the influence of data to localized components (e.g., shards/slices), enabling more targeted removals at the cost of structural constraints and additional training complexity [4,6,7]. By contrast, approximate unlearning relaxes indistinguishability and requires only statistical proximity within an explicit tolerance, often inspired by notions related to Differential Privacy. These methods typically offer better efficiency and scalability, but provide weaker or implicit guarantees about forgetting completeness.

Evaluating MU is inherently multi-dimensional and typically involves a trade-off among three objectives: (i) efficiency (time/cost relative to full retraining), (ii) retained utility on

D_{retain}

and on the original task, and (iii) forgetting quality (effective removal of

D_{f}

’s influence and closeness to the retraining gold standard, including robustness against attacks such as membership inference). In practice, improving one aspect often degrades another, and the desired operating point depends on the application scenario.

2.3. Transformer-Based NLP Models

The Transformer [8] is an attention-based architecture that replaces recurrence with highly parallelizable computation. In its standard form, it consists of stacks of encoder and decoder blocks. Each layer includes multi-head self-attention, a position-wise feed-forward network, and residual connections followed by layer normalization (Add & Norm). Since self-attention is permutation-invariant, sequence order is incorporated through positional encodings.

Building on this architecture, BERT [9] is an encoder-only Transformer that learns bidirectional contextual representations via self-supervised pre-training, primarily through Masked Language Modeling (and Next Sentence Prediction in the original formulation). A common configuration, BERT_BASE, uses 12 layers, hidden size

d_{model} = 768

, 12 attention heads, and approximately 110 M parameters.

DistilBERT [10] is a compressed variant obtained via Knowledge Distillation, transferring knowledge from a BERT_BASE teacher to a smaller student trained with a distillation objective (without NSP). DistilBERT preserves the encoder-only structure but reduces depth to six layers (with

d_{model} = 768

and 12 heads), removes segment embeddings, and omits the pooler. This yields roughly 66 M parameters (about 40% fewer than BERT_BASE) and substantially faster inference, while retaining most of BERT_BASE’s downstream performance. These properties make DistilBERT a practical choice for MU benchmarking, where multiple training and unlearning cycles are required under limited computational budgets.

2.4. AG News Dataset

AG NEWS (AG’s News Topic Classification Dataset) is a widely used benchmark for text classification. It is derived from a larger collection of news articles and focuses on the four most prevalent categories, yielding a compact and balanced corpus suitable for reproducible experimentation.

The task involves four classes (World, Sports, Business, Sci/Tech) and two standard splits: a train set of

120,000

instances and a test set of 7600 instances. Each sample includes a label and a text composed of title and short description. A key property for evaluation is the perfect class balance (30,000 training instances per class and 1900 test instances per class), making accuracy particularly informative (chance level: 25%).

AG News is well-suited for unlearning experiments because it combines: (i) broad adoption and integration in standard libraries (facilitating reproducibility and comparisons); (ii) moderate scale enabling repeated training/unlearning runs; and (iii) balanced class structure that supports controlled construction of forget sets (e.g., class-balanced removals), which stabilizes the analysis of trade-offs between forgetting, utility, and efficiency.

3. A PRISMA-Grounded Perspective on Machine Unlearning

This section provides the evidence base for our perspective on Machine Unlearning (MU), grounded in a PRISMA-driven systematic review [11,12]. The goal is to consolidate a rapidly expanding and heterogeneous body of work into a coherent view, by clarifying the scope of the review, defining a set of overlapping classification axes, and providing a transparent methodology for study identification and selection. We then synthesize existing approaches, highlight recurring patterns, and summarize open challenges that emerge across the literature [3,4,13].

3.1. Scope and Classification Axes

We consider in scope Machine Unlearning techniques for supervised machine learning, with particular emphasis on deep neural networks and medium-to-large models, for which retrain-from-scratch is the conceptual gold standard but is often computationally impractical. We include both methods that provide theoretical guarantees (e.g., exact/certified unlearning) and empirical solutions based on approximate unlearning, and we discuss their strengths and limitations under realistic motivations such as privacy, security, and fairness [3,4].

Several related topics are treated as out of scope. First, we exclude catastrophic forgetting in continual learning, which concerns unintentional loss of prior knowledge and is orthogonal to intentional data removal. Second, we do not provide an extended legal analysis of the Right to be Forgotten, which we treat as a motivating requirement rather than the object of legal discussion. Third, we exclude non-supervised settings such as reinforcement learning. Finally, we omit model classes for which unlearning reduces to straightforward retraining without substantial technical challenges. These choices keep the review focused on the most technically relevant MU developments for modern learning systems [3,4].

To compare heterogeneous approaches, we adopt a multi-axis classification that captures both conceptual and operational differences. We first differentiate works by model scale and learning setting, i.e., small-to-medium models (where retraining may remain feasible) versus large-scale architectures such as Transformers and LLMs (where localized or approximate interventions are typically preferred), as well as distributed scenarios such as federated learning, where the unit of removal is often the client rather than individual instances [3,13].

A second axis concerns the type of guarantee provided by the unlearning procedure. At one end of the spectrum, exact or near-exact approaches aim to match the behavior of a model trained on retained data, whereas approximate techniques trade strict equivalence for efficiency. Between these extremes, certified approximate methods provide probabilistic guarantees (often inspired by

(ε, δ)

-style indistinguishability) that partially bridge the gap between practicality and reliability [3,4].

We further categorize works by unlearning granularity, ranging from instance-level and subset-level removal to class-level, concept-level, and client-level unlearning. Another key dimension is the operational strategy: full retraining, design-for-unlearning approaches (e.g., partitioning/sharding such as SISA [6]), and post-training (post hoc) interventions including gradient-based updates, influence/Hessian approximations, teacher–student distillation, and weight scrubbing or masking [3,4]. Finally, we group works by their application motivation (privacy/RTBF, security, data cleaning, model updating, fairness), since motivation strongly influences assumptions and evaluation protocols [3,4].

3.2. Review Methodology (PRISMA)

We follow PRISMA 2020 recommendations for systematic reviews and adopt PRISMA-S to transparently report the search strategy [11,12].

3.2.1. Databases and Search Strings

To ensure broad coverage of MU literature, we queried five major sources: Google Scholar, IEEE Xplore, ACM Digital Library, Scopus, and arXiv. The choice combines peer-reviewed venues and preprint repositories to capture both established and recent contributions. Searches were conducted without temporal restrictions and included all publications available up to (and including) 15 September 2025.

Search strings were derived from a core set of MU synonyms, including machine unlearning, machine forgetting, algorithmic forgetting, selective forgetting, data removal, and data deletion. These terms were combined using OR to maximize recall, optionally constrained via AND (e.g., adding neural network) to improve precision. To avoid unrelated retrieval dominated by continual-learning work, the term catastrophic was explicitly excluded using NOT operators or equivalent syntax. The final queries, adapted to each database, are reported in Table 1.

3.2.2. Inclusion and Exclusion Criteria

Eligibility criteria were defined a priori in line with PRISMA recommendations [11]. We included peer-reviewed conference or journal papers, as well as preprints that propose, implement, or evaluate MU techniques, published in English and available up to the search date. We also included works in federated learning or generative/LLM settings where the primary focus is unlearning of training data [3,13].

We excluded studies not directly addressing intentional data removal (including continual-learning catastrophic forgetting), duplicate records indexed across multiple sources, works without accessible full text, and non-scientific items such as editorials or materials without new evidence.

3.2.3. PRISMA Flow Diagram

The study selection process follows the standard PRISMA workflow: identification, deduplication, title/abstract screening, full-text eligibility assessment, and final inclusion [11]. Figure 1 reports the PRISMA 2020 flow diagram with the number of records retained at each stage.

3.3. Comparative Analysis of Existing Works

To enable a systematic comparison of the included studies, we compiled a comparative table capturing, for each work, the learning setting (centralized vs. federated; static vs. online), model family, unlearning granularity, methodological class, assumptions (e.g., data partitioning, convexity/linearity), threat model, evaluation metrics (utility, forgetting/privacy indicators, overhead), datasets and baselines, code availability, and reported limitations. This synthesis is aimed not only at cataloging contributions, but at making explicit the recurring trade-offs and the information that is often missing in experimental reporting.

Across studies, we observe a consistent imbalance in evaluation practice. While post-unlearning utility and computational overhead are frequently reported, robust evidence of effective forgetting is less consistent. In particular, systematic comparisons against the retraining gold standard and privacy-oriented audits (e.g., membership inference) are not uniformly adopted, which complicates cross-paper comparability and the interpretation of “forgetting” claims [3,4].

Methodologically, the literature spans several recurring families. First, exact or design-for-unlearning approaches, most notably full retraining and partition-based training such as SISA, seek strong or near-exact forgetting behavior by construction, at the cost of architectural/training constraints and additional storage or preprocessing [6]. Second, certified removal methods aim to provide formal bounds on the deviation from retraining, typically under restrictive assumptions that limit their applicability to large deep models [3,4]. Third, approximate post hoc techniques update model parameters or outputs after training, using gradient-based updates, influence/Hessian approximations, selective scrubbing/masking/pruning, or teacher–student distillation; these methods prioritize efficiency and scalability, but their forgetting guarantees are often empirical and may depend strongly on the evaluation protocol [3,4]. Finally, federated unlearning and recent concept-/knowledge-level unlearning for LLMs further expand the design space, but currently rely on less standardized benchmarks and still-evolving notions of what constitutes successful forgetting [13].

Overall, cross-mapping works along the classification axes reveal a clear asymmetry: approximate instance- or class-level unlearning dominates current practice, whereas near-exact approaches remain comparatively fewer and typically require design-time constraints. As model scale increases, efficiency gains are often achieved by relaxing equivalence to retraining, highlighting a persistent tension between practicality and verifiable guarantees [3,4,13].

3.4. Open Challenges and Research Gaps

Despite rapid progress, the surveyed literature reveals several persistent gaps that hinder reliable comparison and deployment of MU methods.

A first limitation is the lack of standardized benchmarks and shared protocols. Many studies rely on heterogeneous datasets and unlearning scenarios, often adopting different definitions of the forget set, different removal schedules (single-shot vs. repeated requests), and different baselines. This heterogeneity makes it difficult to isolate methodological contributions and to compare trade-offs across papers [3,4]. Moreover, realistic operational conditions such as multiple requests over time, batch removals, and long-term maintenance are frequently simplified, and code and data releases remain uneven [4].

A second challenge is the absence of uniform and robust forgetting metrics. Current evaluations range from changes in loss/accuracy on the forget set to distributional measures and privacy attacks (e.g., membership inference), often without a shared auditing protocol. Importantly, reduced performance on forgotten data does not necessarily imply reduced privacy leakage, and the two are not consistently disentangled in experimental designs [3,4].

Finally, scalability and deployment constraints remain underexplored. The integration of MU into real-world pipelines requires handling repeated requests, model versioning, traceability of deletions, service-level constraints, and drift-aware updating, yet these aspects are rarely addressed systematically in the experimental literature [4]. These issues become more pronounced in LLMs and federated settings: in LLMs, knowledge is diffuse and may lead to behavior-level suppression rather than true removal, whereas in federated learning the unlearning unit is typically the client and communication constraints interact with privacy and threat models [3,13]. Addressing these gaps will be essential to move MU from academic prototypes to reliable components of model governance and compliance in production systems.

4. Taxonomy and Decision Framework

The body of work on Machine Unlearning (MU) has expanded quickly, spanning security-driven formulations, privacy-motivated data deletion, and production-oriented model editing. While this growth has produced a rich set of solutions, it has also introduced substantial heterogeneity in assumptions, targets, and evaluation protocols. To make the outcomes of our PRISMA-driven systematic review actionable, we distill the evidence from the included studies into (i) a compact taxonomy grounded in recurring dimensions and (ii) a decision framework that supports method selection under practical constraints [1,11,12,14].

4.1. Taxonomy Dimensions

Our taxonomy is designed to be operational: its dimensions are chosen because they repeatedly appear as explicit design variables in the surveyed works and because they directly determine feasibility and guarantees in realistic deployments. Figure 2 summarizes the four primary dimensions.

(D1) Removal Guarantees: exact, certified, or approximate. The first dimension captures the strength of the removal guarantee. Exact (or near-exact) unlearning targets equivalence to retraining on $D ∖ D_{f}$ , where $D_{f}$ is the forget set; retraining-from-scratch remains the reference baseline and the most reliable option when strict guarantees are required [15,16]. Certified removal provides formal (often probabilistic) bounds on the deviation from retraining, typically under structural assumptions such as convexity or smoothness [17,18]. Approximate unlearning trades strict equivalence for efficiency, aiming to substantially reduce the influence of $D_{f}$ while preserving utility on retained data [19,20].
(D2) Granularity of the unlearning target. The second dimension specifies what must be forgotten: individual instances or small subsets, entire classes, higher-level concepts/knowledge, or client-level contributions in federated learning [6,21]. This choice shapes both the algorithmic strategy (local vs distributed interventions) and the evaluation (sample-level tests vs concept-level probes).
(D3) Model scale and operational setting. The third dimension captures the interaction between model scale (small/convex vs deep networks and large models) and setting (centralized vs federated). Second-order and influence-based approaches are often most natural for smaller or structured objectives [22,23], whereas large-scale models (e.g., LLMs) frequently motivate distillation or post hoc editing due to retraining cost and representation entanglement [13,24]. In federated settings, additional constraints emerge: limited observability, communication overhead, and client-level removal requirements [21,25].
(D4) Intervention mode: design-for-unlearning vs post hoc editing. Finally, methods differ by when unlearning is enabled. Design-for-unlearning builds unlearning capability into training (e.g., sharding and composable pipelines), improving repeatability for recurring requests [6,26]. Post hoc editing modifies an already trained model (e.g., gradient-based “untraining”, scrubbing, pruning, or teacher–student transfer), reflecting the common reality of deployed models not originally designed for MU [27,28].

4.2. Mapping of Techniques to the Taxonomy

Using the above dimensions, we group representative approaches into method families and locate them within the taxonomy.

Retraining-from-scratch provides the strongest removal guarantee and is commonly used as the reference baseline [14,15,16]. To reduce cost while preserving exactness under sparse deletions, sharded and composable pipelines such as SISA retrain only affected components, offering a practical compromise for recurring requests [6,29].

Certified removal and influence-based updates approximate the effect of retraining by leveraging gradients and curvature information. Influence functions quantify how model parameters would change if a point were removed, which motivates Newton-style or Hessian-inverse updates [17,22]. Practical systems reuse training artifacts to accelerate these updates, as in

Δ

-Grad [23], while recent work explores more scalable influence approximations [30].

When second-order computation is infeasible, gradient-based unlearning performs targeted “untraining” by optimizing on the forget set, typically preserving retain-set performance through regularization and/or complementary repair steps [27,31,32]. Scrubbing and editing methods aim to remove information directly from parameters or internal representations, balancing forgetting and utility via controlled perturbations or alternating objectives [20,28,33].

Distillation-based removal relies on teacher–student transfer: a student is trained to match desired behavior on retained data while discouraging retention of deleted content. This strategy is particularly attractive in large-scale and federated settings [34,35], and also appears in recommender systems and other domain-specific pipelines where full retraining is expensive [36].

Figure 3 operationalizes the taxonomy as a method selection procedure. Starting from the unlearning target (instance/class/concept/client), the tree branches on whether formal guarantees are required (exact or certified removal versus approximate forgetting), and then refines the choice according to model scale/complexity and deployment setting (centralized vs. federated). The leaves denote technique families that are typically compatible with each requirement profile rather than a single algorithmic recipe. A consistent trend in the literature is that exact or certified strategies are most feasible in structured or convex regimes (or when training is explicitly designed for deletion), whereas deep and large-scale settings favor post hoc approaches such as gradient-based unlearning, scrubbing, and distillation when retraining costs dominate [6,17,21,23,27,35].

Three practical principles emerge from the systematic review. Method selection should begin with the unlearning granularity (instance/class/concept/client) and the required strength of guarantees, since these choices largely determine feasibility and the evaluation protocol [1,14]. Model scale further constrains the admissible toolset: influence and certified updates are often viable for smaller or structured objectives, whereas deep and large models typically motivate approximate, scrubbing, or distillation-based solutions [13,22,24]. In federated learning, client-level removal requires dedicated protocols that explicitly account for distributed optimization and limited observability [21,25]. In particular, coordinators typically observe only aggregated updates while client data remain local, which prevents direct data filtering or exact retraining; as a result, unlearning relies on protocol-level mechanisms (e.g., update accounting and selective rollback/repair) under partial observability and often non-IID client distributions.

5. Benchmark Setup

This section describes the experimental protocol adopted to compare retraining (gold standard) with representative machine unlearning techniques under a controlled and reproducible setup. We use the term benchmark to denote a standardized and repeatable evaluation protocol (including deletion regime, baselines, and metrics) that enables controlled comparison against retrain-from-scratch as the reference behavior. Accordingly, the benchmark is intended as a focused, illustrative case study: its primary contribution is the reproducible evaluation protocol and the associated comparative diagnostics, rather than broad empirical generalization across datasets, architectures, or deletion regimes.

For completeness, we also include two oracle-assisted upper-bound baselines, Knowledge Distillation and Weight Scrubbing. These methods assume access to a clean reference model trained on the retained data. In the runtime tables, we therefore report only their incremental update cost, excluding the cost of producing the clean reference itself. Although our experiments consider a single-shot deletion request, the protocol can be extended to repeated or streaming deletions, as discussed in Section 7.

5.1. Model and Training Setup

We consider a Transformer-based text classifier implemented with DistilBERTForSequenceClassification and configured for four classes (AG News). To ensure comparability, all methods share the same backbone architecture, tokenizer, and evaluation protocol. Inputs (title + description) are tokenized with distilbert-base-uncased using truncation to a maximum sequence length (

L_{max} = 96

), while padding is applied dynamically at the batch level. We adopt this value as a compute–coverage trade-off, since self-attention has quadratic cost in sequence length.

We fine-tune a baseline model on the training pool

D_{full} = D_{train} ∖ val_full

(

108,000

instances) and treat a retrain-from-scratch model (trained only on retained data) as the reference for successful unlearning. Unless otherwise specified, baseline and retrain share the same fine-tuning configuration: AdamW, learning rate

2 \cdot 10^{- 5}

, weight decay

0.01

, linear warmup ratio

0.06

, and three epochs.

All experiments were executed in Kaggle Notebooks using a fixed environment (CPU + single NVIDIA Tesla P100 GPU) to ensure fair comparisons across methods. The implementation is based on PyTorch (v2.10.0) and Hugging Face Transformers (v5.3.0).

Wall-clock runtimes (in seconds) for a single unlearning cycle are analyzed in the efficiency results section. Retraining is the most expensive procedure, while post hoc methods offer substantial speedups. Notably, influence/Hessian updates and distillation-based approaches are among the fastest, whereas SISA incurs overhead due to shard management and selective retraining. For Knowledge Distillation and Weight Scrubbing, the reported runtime counts only the student-update/scrub step given an oracle clean teacher/reference model (

M_{retrain}

or

θ_{clean}

); it does not include the cost of producing that clean model (i.e., retraining), which we compute separately as the benchmark target.

5.2. Definition of Forget and Retain Sets

Let

D_{train}

denote the original AG News training split (120,000 samples). We first create a disjoint

10 %

holdout validation set

val_full

from

D_{train}

, with

| val_full | = 12,000

. We define the training pool used for the baseline model as

D_{full} = D_{train} ∖ val_full

(108,000 samples).

From this pool, we then sample the forget set

D_{f}

(train_deleted) using a class-balanced deletion regime, selecting

| D_{f} | = 30,000

samples uniformly across the four AG News classes (7500 per class). The retained training set is the remainder:

train_retained = D_{full} ∖ D_{f},

\Rightarrow | D_{train} | = | val_full | + | train_deleted | + | train_retained | .

In our configuration, this yields

| train_retained | = 78,000

. With this construction,

val_full

is disjoint from both

train_deleted

and

train_retained

by design (hold out

val_full

first, then sample

D_{f}

from the remaining pool).

We additionally construct a small validation subset val_retained (2000 samples) drawn from train_retained to support hyperparameter tuning for the retrained model and for unlearning methods that require validation. The held-out AG News test split (7600 instances) is used as test_full.

5.3. Implemented Unlearning Techniques

We evaluate both design-for-unlearning and post hoc approaches. Post hoc methods start from the baseline model trained on

D_{full}

and apply an update guided by

(D_{f}, D_{retain})

. Design-for-unlearning methods instead modify the training procedure so that deletions can be handled efficiently after deployment. We distinguish between methods that can be executed without any clean oracle (SISA, Ascent/Descent-to-Delete, Influence/Hessian) and oracle-assisted upper-bound baselines (Knowledge Distillation and Weight Scrubbing) that require

M_{retrain}

as a clean teacher/reference; the latter are included to contextualize the best-case alignment to retraining under idealized access to a clean model.

Retrain-from-scratch (gold standard). Given $D_{f} \subset D_{full}$ , the gold-standard model $M_{retrain}$ is trained from scratch on $D_{retain} = D_{full} ∖ D_{f}$ and validated on val_retained. This model defines the target behavior after deletion and serves as the reference for successful unlearning.
SISA (sharded, isolated, sliced, aggregated). SISA can enable near-exact unlearning under appropriate sharding/slicing and stable optimization by retraining only the affected slice(s) after a deletion request. The training data are partitioned into k disjoint shards, each further subdivided into sequential slices. After training, a model checkpoint is stored for each slice, allowing for selective rollback. Upon deletion, only the impacted shard is retrained from the earliest affected slice onward, reducing recomputation relative to full retraining; however, effectiveness and stability can be configuration- and request-regime-dependent in practice.

In our implementation we adopt

(S = 8, R = 3)

shards/slices (one checkpoint per slice), and we aggregate shard-level predictors as an ensemble. We report these settings explicitly because both utility and compute can vary substantially with

(S, R)

and with how dispersed deletion requests are across shards/slices (see Section 6 and Section 7).

Gradient-based Ascent/Descent-to-Delete (D2D). D2D alternates (i) gradient ascent on $D_{f}$ to reduce fit to deleted samples and (ii) repair steps via gradient descent on $D_{retain}$ to preserve utility:

$θ \leftarrow θ + η_{f} \nabla_{θ} L (θ; B_{f}), θ \leftarrow θ - η_{r} \nabla_{θ} L (θ; B_{r}),$

where $B_{f} \subset D_{f}$ and $B_{r} \subset D_{retain}$ . We use one forget step followed by two repair batches per iteration, together with mixed-precision training and gradient clipping. To reduce catastrophic drift, we freeze the first Transformer block in the final configuration. This tractability choice implies that the method performs a constrained parameter edit and therefore targets functional unlearning (behavioral regression toward retraining) rather than exact parameter-level equivalence.
Influence functions with Hessian approximation. We approximate the parameter change induced by removing $D_{f}$ using an influence-style update computed from the baseline parameters $\hat{θ}$ :

$(H_{r} + λ I) Δ θ = - \frac{| D_{f} |}{| D_{r} |} g_{del},$

where $D_{r} \subseteq D_{retain}$ is used to estimate curvature, $H_{r} = \nabla_{θ}^{2} L (θ; D_{r})$ , $g_{del} = \nabla_{θ} L (θ; D_{f})$ , and $λ > 0$ is a Tikhonov damping coefficient. To make the method tractable for DistilBERT, updates are restricted to the classification head and the last Transformer blocks. This restriction makes the method an approximate editing baseline: it aims to match retrain behavior under a limited update budget and does not provide exact (parameter-wise) unlearning guarantees. The linear system is solved via conjugate gradient using Hessian–vector products and a diagonal preconditioner. Since second-order derivatives are required, we disable flash/memory-efficient attention kernels when necessary, which can increase wall-clock time; all methods are timed end-to-end under the same environment, so comparisons remain consistent.
Knowledge distillation (teacher–student unlearning). We implement a teacher–student unlearning procedure following the general paradigm of distillation-based removal. A clean teacher model $T_{clean}$ is obtained via retraining on retained data ( $M_{retrain}$ ), and a student model is initialized from the baseline ( $M_{base}$ ). This variant is oracle-assisted, as it assumes access to a clean teacher trained without $D_{f}$ ; we use it as an upper-bound reference to quantify the best-case effectiveness of distillation given $M_{retrain}$ .

The student is then trained to match the teacher’s soft predictions on the retained set, using a KL divergence loss:

L_{KD} = KL (σ (z_{s} / τ) ∥ σ (z_{t} / τ)),

where

z_{s}

and

z_{t}

denote student and teacher logits and

τ

is a temperature parameter. In our experiments, we use

τ = 2

and perform a single epoch of distillation over train_retained.

Weight Scrubbing. We implement Weight Scrubbing as a post hoc parameter repair mechanism that nudges the baseline weights toward a clean reference $θ_{clean}$ . Here $θ_{clean}$ denotes the parameters of a retrained model on retained data ( $M_{retrain}$ ), so this configuration should be interpreted as an oracle-assisted/upper-bound variant: it assumes access to a clean reference trained without $D_{f}$ and uses it to quantify the best-case effect of parameter restoration.

Given baseline parameters

θ_{base}

and a clean reference

θ_{clean}

, we apply an exponential moving update:

θ \leftarrow (1 - α) θ + α θ_{clean},

where

α \in (0, 1)

controls the strength of scrubbing. We use

α = 0.1

and apply 10 scrubbing steps over the full parameter set.

5.4. Evaluation Metrics

We evaluate unlearning outcomes on train_deleted, train_retained, and test_full along four dimensions: task utility, functional forgetting, privacy leakage (membership inference), and alignment to the retrain-from-scratch reference in output behavior, representations, and parameters.

Utility and calibration. We report accuracy and negative log-likelihood (NLL):

$Accuracy = \frac{# correct}{# total}, NLL = - \frac{1}{n} \sum_{i = 1}^{n} log p_{θ} (y_{i} ∣ x_{i}) .$

To assess probabilistic quality, we apply temperature scaling on the validation set and compute Expected Calibration Error (ECE) using M confidence bins:

$ECE = \sum_{m = 1}^{M} \frac{| B_{m} |}{n} |acc (B_{m}) - conf (B_{m})| .$
Forgetting indicators. As a functional signal of forgetting, we compute the average cross-entropy loss on $D_{f}$ and $D_{retain}$ :

$\bar{ℓ} (M, D) = \frac{1}{| D |} \sum_{(x, y) \in D} CE (M (x), y),$

and summarize deviations from retraining via $Δ {\bar{ℓ}}_{f}$ and $Δ {\bar{ℓ}}_{r}$ with respect to $M_{retrain}$ .
Membership inference audit. We use loss-based membership inference as the primary privacy audit (other scores, such as entropy, maxprob, margin, and energy, follow the same trend in our setting). We consider MIA Deleted (members: train_deleted, non-members: test_full) and MIA Retained (members: train_retained, non-members: test_full). We report ROC curves and AUC; leakage is best summarized by how close the AUC is to 0.5, with values closer to 0.5 indicating a weaker membership signal. Because this benchmark operates in a near-chance MIA regime (AUC values close to 0.5), loss-based MIA is best interpreted as a detectability and sanity-check signal rather than as a standalone privacy guarantee. We discuss stronger attacker models and more demanding “stress-test” regimes (e.g., larger models, stronger overfitting, or representation-based MIAs) in Section 7.
Alignment to retraining (distributional and structural). To quantify similarity to $M_{retrain}$ , we measure: (i) Jensen–Shannon (JS) divergence between predicted class distributions, (ii) Top-k behavioral disagreement based on Jaccard overlap, (iii) activation distances (L2 and cosine) on [CLS] hidden states, and (iv) relative weight shift (RSS). For each input x, let These measures are intended for relative comparison and should be interpreted in context. In general, lower values indicate closer alignment to retraining, but what constitutes “good enough” depends on the task and on the inherent variability of independent retraining runs (e.g., across random seeds). A principled acceptance criterion would compare a method’s distance-to-retrain against the retrain-to-retrain variability band; since our benchmark uses fixed seeds to enable controlled comparisons, we use these metrics primarily to rank methods and to flag large deviations, and we discuss seed-sensitivity as a threat to validity in Section 7.

p_{ref} (x) \in Δ^{C - 1}

and

p_{cmp} (x) \in Δ^{C - 1}

denote the softmax class-probability vectors from

M_{retrain}

and the compared model, respectively; split-level values average over examples in the split S.

JS divergence is computed as

\begin{matrix} JS (p_{cmp} (x) ∥ p_{ref} (x)) & = \frac{1}{2} KL (p_{cmp} (x) ∥ m (x)) + \frac{1}{2} KL (p_{ref} (x) ∥ m (x)), \\ m (x) & ≜ \frac{p_{cmp} (x) + p_{ref} (x)}{2} . \end{matrix}

and we report

\frac{1}{| S |} \sum_{x \in S} JS (p_{cmp} (x) ∥ p_{ref} (x))

.

For Top-k agreement, let

A_{k} (x) ≜ Top - k (p_{cmp} (x))

and

B_{k} (x) ≜ Top - k (p_{ref} (x))

, and define

J_{k} (x) = \frac{| A_{k} (x) \cap B_{k} (x) |}{| A_{k} (x) \cup B_{k} (x) |}, {Dis}_{k} ≜ 1 - \frac{1}{| S |} \sum_{x \in S} J_{k} (x) .

For

k = 1

,

{Dis}_{1}

reduces to the Top-1 mismatch rate.

For representation alignment, let

h_{ℓ}^{ref} (x), h_{ℓ}^{cmp} (x) \in R^{d}

be the [CLS] hidden states at layer ℓ (pre-classifier). We compute

d_{ℓ}^{L 2} (x) = {∥h_{ℓ}^{cmp} (x) - h_{ℓ}^{ref} (x)∥}_{2}, d_{ℓ}^{cos} (x) = 1 - 〈\frac{h_{ℓ}^{cmp} (x)}{∥ h_{ℓ}^{cmp} {(x) ∥}_{2}}, \frac{h_{ℓ}^{ref} (x)}{∥ h_{ℓ}^{ref} {(x) ∥}_{2}}〉,

and average over

x \in S

. Unless otherwise stated, we report distances at the last Transformer layer (

ℓ = L

), and optionally provide layer-wise profiles.

Finally, parameter drift is quantified via

{RSS}_{global} = \frac{∥ θ_{cmp} - θ_{ref} ∥_{2}}{∥ θ_{ref} ∥_{2}}, {RSS}_{g} = \frac{∥ θ_{cmp}^{(g)} - θ_{ref}^{(g)} ∥_{2}}{∥ θ_{ref}^{(g)} ∥_{2}},

where

θ_{ref}

and

θ_{cmp}

denote the full parameter vectors, and

θ^{(g)}

a parameter group (e.g., embeddings, encoder blocks, classifier head).

6. Benchmark Evaluation

We first compare the baseline model, fine-tuned on the training pool

D_{full} = D_{train} ∖ val_full

, against the retrained model trained only on retained data (

D_{retain} = train_retained

). This baseline-vs-retrain comparison establishes the intrinsic impact of removing the forget set on both utility and model behavior. Since baseline training includes

D_{f}

(i.e.,

train_deleted

), the two models differ only by whether the deleted samples are present during training.

Table 2 reports task performance (accuracy), probabilistic quality (NLL/log-loss), and calibration (ECE with and without temperature scaling) on train_deleted, train_retained, and test_full, together with wall-clock training time. Differences are reported as

Δ = retrain - baseline

.

Retraining on

D_{retain}

is faster than training the baseline on

D_{full}

, consistent with the reduced number of training instances after removal. In our setup, retraining requires about 25% less time (Table 2), and therefore provides the exact runtime reference against which the efficiency of unlearning procedures is evaluated.

On train_deleted, the baseline model shows the expected advantage because it has been trained on

D_{f}

. In contrast, the retrained model exhibits a systematic reduction in predictive fit on deleted samples, with a drop in accuracy of roughly

0.02

and corresponding increases in NLL/log-loss (Table 2). Notably, after temperature scaling, ECE values become nearly indistinguishable, suggesting that deletion mainly reduces task-specific fit on forgotten points rather than broadly degrading probabilistic reliability.

On train_retained, differences remain small: accuracy and NLL change marginally, and temperature-scaled calibration is comparable (or slightly improved) for the retrained model. On test_full, retraining yields negligible variations in accuracy and calibration, indicating that removing

D_{f}

does not materially affect out-of-sample generalization in this setting (Table 2). Figure 4 summarizes the (raw) test accuracy comparison between baseline and retrain on test_full.

Beyond aggregate metrics, distributional and structural indicators confirm that baseline and retrained models remain close overall (Table 3). Output-probability divergence is modest and is more pronounced on train_deleted than on the test set, consistent with the localized impact of removing

D_{f}

. For representation- and parameter-level indicators (activation distance, RSS weight change, and classifier share), we report a single global value computed on a fixed probe (hence the spanning entries in Table 3). Parameter-level analyses further show that changes concentrate in task-specific components (especially the classification head and upper layers), while early representations remain comparatively stable—a pattern consistent with a boundary readjustment rather than a disruption of general-purpose features.

A per-class inspection does not reveal systematic shifts in confusion patterns or class-specific degradation; observed deviations remain limited and broadly uniform across classes.

6.1. Comparison Across Unlearning Methods

This section compares representative unlearning strategies against the retrain-from-scratch reference, which approximates the desired behavior after removing

D_{f}

. Results are organized by (i) forgetting/utility on train_deleted and train_retained, (ii) generalization and calibration on test_full, and (iii) privacy-oriented audits via membership inference.

6.1.1. Performance on Deleted and Retained Sets

We evaluate each method on the forget split (train_deleted) and on the retained split (train_retained). Retraining defines a natural target: on train_deleted it reflects the expected degradation once

D_{f}

is absent from training, while on train_retained it approximates the utility of a clean model trained only on

D_{retain}

.

Table 4 hiA shows predominantly negatithods (Ascent/Descent-to-Delete, Influence/Hessian, Knowledge Distillation, Weight Scrubbing) preserve retained utility well, matching or slightly improving upon retraining on train_retained. However, they remain systematically too strong on train_deleted (higher accuracy and lower NLL than retraining), which is consistent with under-forgetting: residual information about

D_{f}

still influences the model. SISA behaves qualitatively differently, with a pronounced degradation on both deleted and retained splits and extremely large raw NLL values that are substantially reduced after temperature scaling, suggesting reduced stability and/or overly aggressive removal effects in the adopted configuration. We therefore report the adopted shard/slice setting

(S = 8, R = 3)

and emphasize that SISA’s utility/forgetting profile can be sensitive to this configuration and to deletion dispersion.

6.1.2. Performance and Calibration

We next assess generalization and probabilistic quality on the disjoint test_full split. The main goal is to verify whether unlearning preserves task utility (accuracy) and whether it introduces calibration shifts (NLL/log-loss and ECE), relative to retraining as the clean reference.

As shown in Table 5, approximate post hoc methods maintain test accuracy and NLL close to retraining, and calibration remains stable overall. Temperature scaling further compresses ECE into a narrow range across methods, indicating that most differences are limited to confidence re-scaling rather than fundamental changes in predictive behavior. A notable exception is Ascent/Descent-to-Delete, which shows a higher uncalibrated ECE, consistent with targeted parameter updates affecting confidence patterns and benefiting from explicit calibration.

SISA exhibits a markedly different profile: substantially lower accuracy and very large raw NLL, while temperature scaling sharply reduces NLL and restores low ECE. This illustrates that calibration can improve even when predictive performance remains degraded, i.e., calibration and accuracy may decouple under strong distributional or ensemble-induced shifts.

Figure 5 summarizes the raw test accuracy of each unlearning method on test_full relative to the retraining reference.

6.1.3. Membership Inference and Privacy Indicators

To probe privacy leakage, we perform threshold-based membership inference using loss-derived scores and report ROC/AUC, average precision (AP), and FPR at fixed TPR (0.90 and 0.95). We consider two complementary settings: MIA Deleted, where members are train_deleted and non-members are test_full (the core adversarial forgetting audit), and MIA Retained, where members are train_retained and non-members are test_full (a control to detect pathological side effects on retained data).

Table 6 shows that AUC values are close to

0.5

in both settings, indicating weak discriminability overall in this experimental regime. Differences between retraining and post hoc methods are small, and no method produces a clear increase in membership leakage relative to the retraining reference. SISA yields the flattest membership signal (AUC closest to

0.5

), consistent with stronger smoothing/removal effects, but this must be interpreted jointly with its lower utility and higher instability observed in task metrics. Given the small absolute AUC differences, these MIA results primarily function as a safety-consistency check: method selection in this setup is more strongly driven by the utility/forgetting trade-off on deleted vs. retained splits and by behavioral/structural alignment to retraining.

Since the ROC curves are nearly overlapping and AUC differences are within a few thousandths, we inspect the differential signal

Δ {TPR}_{m} (FPR) = {TPR}_{m} (FPR) - {TPR}_{retrain} (FPR)

, so that retraining lies on the horizontal axis (

Δ TPR = 0

). Positive values indicate a slightly more effective attacker than retraining at the same FPR, while negative values indicate a weaker attacker (closer to random).

Figure 6 (MIA Deleted) shows that all methods remain confined to a narrow band (approximately from

- 0.01

to

+ 0.017

), with small oscillations around zero across the full FPR range. Baseline, Influence/Hessian, Ascent/Descent-to-Delete, and Weight Scrubbing exhibit mildly positive

Δ TPR

especially at medium–high FPR (≳0.5), with peaks around

0.01

–

0.015

, meaning the attacker can identify deleted points only marginally better than under retraining. Knowledge Distillation stays closest to zero overall, indicating the strongest alignment to retraining in terms of membership signal on the forget set. SISA is predominantly negative (about

- 0.005

to

- 0.01

), bringing the ROC slightly closer to the random diagonal, i.e., a weaker attacker than retraining on deleted instances. Overall, the deviations confirm that no method introduces a marked increase in membership leakage, and the Influence/Hessian curve closely tracks the baseline, consistent with the absence of measurable forgetting under this MIA audit.

Figure 7 (MIA Retained) provides the complementary control: we seek methods that remain close to retraining without systematically collapsing the membership signal on retained data. Baseline, Ascent/Descent-to-Delete, Influence/Hessian, Weight Scrubbing, and Knowledge Distillation oscillate around zero within a tight range (about ±0.006), with occasional mildly positive deviations at high FPR (≳0.6), suggesting that the attacker’s ability to recognize retained members remains essentially unchanged (or slightly higher) compared to retraining, i.e., informative structure is preserved on legitimate data. In contrast, SISA shows predominantly negative

Δ TPR

(down to about

- 0.01

–

- 0.013

at medium–high FPR), indicating a weaker attacker also on retained points consistent with a stronger but less selective smoothing effect, which aligns with its lower utility profile.

6.1.4. Distributional and Structural Analysis

Beyond utility and privacy metrics, we assess whether unlearned models behaviorally match the retrain-from-scratch reference not only in terms of predictions, but also in terms of (i) distributional similarity of output probabilities and (ii) structural similarity of internal representations and parameters. The goal is to quantify how closely each method realigns to retraining and to identify where this realignment occurs along the model architecture.

Table 7 summarizes the main indicators with respect to retraining. Lower values indicate closer agreement.

Overall, the most effective post hoc approximate methods exhibit a tight distributional alignment to retraining: output-probability divergences remain small on test_full and train_retained, with a consistent (but limited) increase on train_deleted. This pattern indicates that the confidence profile remains close to the clean reference while still reflecting the absence of

D_{f}

. In contrast, SISA yields substantially larger distributional shifts, suggesting a broader displacement of the decision geometry that is not confined to the forget set.

Structural indicators provide a coherent explanation for this behavior. The behavioral disagreement measure (Top-k overlap with retrain) shows that post hoc methods preserve local decision consistency: predictions largely coincide with the retrain reference, and disagreement increases only mildly as k grows. Representation-level distances further indicate a characteristic layerwise trend: discrepancies are minimal in early layers and increase toward deeper, task-specific layers. For the methods most aligned with retraining, this profile remains stable across splits, consistent with a controlled readjustment rather than split-specific representational drift. SISA preserves the same monotonic depth trend but at uniformly higher distances, consistent with a more global representational shift.

Finally, weight-space analyses confirm that post hoc updates induce limited overall parameter drift, yet the change is highly concentrated in task-specific components (most notably the classification head and upper layers), while embeddings and early layers remain comparatively stable. This concentration is consistent with a decision-boundary adjustment mechanism that preserves general-purpose representations. Conversely, the larger structural distances observed for SISA are compatible with broader parameter shifts driven by its training and partitioning scheme, rather than targeted and localized forgetting.

6.2. Efficiency Analysis

We evaluate the computational efficiency of unlearning methods in terms of wall-clock time, with the goal of quantifying the practical time-to-unlearn relative to the retrain-from-scratch reference. In operational settings, the feasibility of compliance-driven deletion depends not only on how well a method forgets, but also on how quickly an updated model can be produced after a deletion request.

Under identical architecture, hyperparameters, and hardware, retraining on

D_{retain}

is faster than baseline training on

D_{full}

because it operates on a smaller dataset. In our setup, baseline training takes about 710 s, while retraining takes about 540 s (roughly

- 24 %

). We treat this retraining runtime as the exact time reference for deletion via full retraining, and as the primary benchmark for comparing alternative unlearning strategies.

All approximate methods that reuse the baseline model as a starting point are faster than monolithic retraining (Table 8). The gradient-based update (Ascent/Descent-to-Delete) is the most time-efficient, reducing wall-clock time to roughly 300 s (about

- 44 %

vs. retrain). Influence/Hessian follows at around 360 s (

- 33 %

), while Weight Scrubbing and Knowledge Distillation require about 380–395 s (

- 30 %

to

- 27 %

). Overall, these results indicate that the deletion cycle can typically be accelerated by approximately 20–

45 %

relative to retraining, at the cost of potential trade-offs in forgetting effectiveness that must be assessed jointly with utility and privacy metrics.

SISA introduces a qualitatively different efficiency profile. By design, it supports exact unlearning through sharding/slicing and selective retraining, but it incurs a substantial upfront overhead in the initial ensemble training phase. In our configuration, SISA training without deletions takes about 1290 s compared to about 710 s for monolithic baseline training (approximately

+ 80 %

). This overhead is driven by multiple training phases across shards/slices and by orchestration costs (repeated checkpointing, data loader reconstruction, and training-loop re-initialization).

At unlearning time, SISA efficiency strongly depends on how many instances are removed and how dispersed they are across shards/slices. For a small, localized request (e.g., deleting 4 instances), SISA remains competitive (about 435 s, i.e.,

- 19 %

vs. retrain). For a large, diffuse request (e.g., deleting ∼30k instances), many components are affected and selective retraining approaches a broad retraining workload: runtime increases to about 980 s (about

+ 80 %

vs. retrain), becoming more expensive than monolithic retraining. Operationally, this suggests that SISA is most suitable for continuous, incremental, small-scale deletions, whereas its advantage diminishes (and may reverse) under batch deletions affecting large portions of the training data. SISA efficiency is also influenced by the

(S, R)

configuration. Increasing the number of shards S and/or slices R increases the number of checkpoints approximately linearly, amplifying storage and I/O overhead. With the adopted setting

(S = 8, R = 3)

, the initial training produces 24 checkpoints (about

6.5

GB), while a worst-case unlearning scenario can require roughly double (about 48 checkpoints, ∼13 GB). More aggressive configurations quickly lead to dozens or hundreds of checkpoints, which can become a practical bottleneck under strict storage/throughput constraints or highly distributed deletion patterns. From an efficiency standpoint, three conclusions emerge. First, when immediate speed-ups on an already trained model are required, lightweight post hoc editing methods (especially gradient-based updates and scrubbing) are typically the most economical. Second, Influence/Hessian can remain time-competitive with retraining, but comes with higher implementation complexity and stricter computational requirements. Third, SISA can be advantageous in systems engineered for frequent small deletions, yet may become slower than retraining for large-scale removals or overly aggressive shard/slice configurations.

7. Discussion

Our results highlight a fundamental three-way tension in practical machine unlearning: preserving utility on legitimate data, achieving effective forgetting and privacy that approaches training without

D_{f}

, and satisfying efficiency constraints such as time-to-unlearn and operational overhead. No evaluated method simultaneously optimizes all three dimensions. Instead, approaches occupy distinct positions in this trade-off space, suggesting that unlearning is better framed as a deployment decision conditioned on the required assurance level and the expected deletion regime, rather than as a single universally best algorithmic choice.

Retraining-from-scratch remains the most reliable behavioral reference for deletion, since the model never observes

D_{f}

and therefore provides a natural target response on train_deleted while maintaining stable generalization on test_full (Table 2). This makes retraining a defensible anchor for assessing whether unlearning has succeeded in practice, as it offers a concrete notion of what “forgotten” should look like at the level of model behavior. Its limitation is primarily operational: full retraining can be costly and may become impractical under frequent requests, strict latency constraints, or larger model scales, motivating approximate strategies that aim to approach retrain behavior at reduced cost.

Single-model post hoc methods (gradient-based editing, distillation, scrubbing, influence-style updates) offer substantial runtime advantages (Table 8) and, in many cases, preserve retained and test utility, as well as calibration relative to retraining (Table 4 and Table 5). However, a recurring pattern is that these methods can remain more predictive on train_deleted than the retrain reference, with higher accuracy and lower NLL. This is consistent with under-forgetting: rather than fully removing evidence associated with

D_{f}

, the update may predominantly readjust decision boundaries while leaving residual information in the representation. Distributional and structural analyses are aligned with this interpretation: the strongest post hoc approaches remain close to retraining overall, yet the induced changes are concentrated in task-specific components (Table 7), suggesting targeted interventions that preserve utility but may offer weaker assurances when judged against retraining.

SISA exhibits a qualitatively different trade-off profile. By sharding and slicing training, it enables composable deletion through selective retraining, which is particularly attractive when deletion requests are small and localized. At the same time, the extent to which SISA approximates the retrain-from-scratch reference is influenced by the sharding/slicing configuration and by the optimization dynamics within each shard. In our setting, we find that some configurations yield outcomes that are less closely aligned with retraining, even though SISA is designed to support near-exact unlearning. This effect is more evident under large, class-balanced deletions dispersed across many shards/slices: more components are affected, selective retraining becomes widespread, and the expected locality benefits are reduced, with additional variability introduced by repeated retraining across multiple components.

This design also introduces non-trivial upfront overhead and sensitivity to deletion granularity. Empirically, SISA can be competitive for limited removals, but becomes more expensive than monolithic retraining when deletions are large or widely distributed (Table 8). In our configuration, it is also associated with larger deviations from the retrain reference and a broader impact on predictive performance (Table 5 and Table 7), suggesting weaker separation between deleted and retained effects under this deletion regime. Finally, while temperature scaling improves calibration, it does not eliminate the utility gap, illustrating that calibration and predictive performance can decouple after substantial procedural changes.

Taken together, these outcomes suggest that deployable unlearning should be evaluated through an acceptance protocol rather than any single metric. Approximate methods are compelling when low latency is critical, but they should be adopted only when they satisfy all of the following: (i) retained and test utility remain within predefined tolerances, (ii) behavior on deleted samples regresses toward retrain-level performance, and (iii) distributional and structural alignment with retraining is preserved. In this context, membership inference provides supporting evidence rather than a standalone guarantee. In our setting, loss-based MIAs are generally weak, with AUC values close to

0.5

across methods (Table 6), and are therefore best interpreted as safety-consistency checks. This is compatible with limited separation between train and test losses on this task and with the attacker being score-only; stronger attackers (e.g., representation-based) may yield a clearer signal. Stronger assurance is more likely to be achieved by combining privacy audits with behavioral regression on deleted data and complementary alignment diagnostics.

Our benchmark is intentionally controlled: it focuses on supervised text classification (DistilBERT on AG News), a single deletion scale and construction (class-balanced ∼30k deletion), and a fixed execution environment. Additionally, some approximate methods in the benchmark (e.g., Influence/Hessian and gradient-based editing) constrain updates to late layers or the classifier head for tractability; this should be read as targeting functional unlearning rather than guaranteeing exact parameter-level “true unlearning”. Consequently, absolute results should not be over-generalized to other regimes. Different deletion schedules (streaming vs. batch, repeated requests), different targets (instance-, class-, concept-, or client-level deletion), and larger architectures and pre-training settings may amplify or shift the observed trade-offs. Moreover, the experiments rely on fixed random seeds to ensure controlled comparisons across methods; however, retraining outcomes in deep models can vary across seeds due to stochastic optimization and initialization. As a result, distances between an unlearned model and the retraining reference should ideally be interpreted relative to the variability observed across independent retraining runs. We therefore treat seed sensitivity as a potential threat to validity and leave systematic multi-seed evaluation as an extension of the benchmark. In addition, a near-random MIA regime limits the discriminative power of privacy auditing in this setup, motivating stronger attacker models and complementary leakage tests. In particular, when baseline leakage is weak (AUC close to 0.5), incremental improvements due to unlearning may be difficult to detect statistically using score-only attacks. Future benchmark extensions should therefore include settings that induce more detectable memorization (e.g., stronger overfitting, larger models, or more difficult generalization regimes) together with stronger MIAs (e.g., representation-based or likelihood-ratio-style attacks) to stress-test privacy claims.

Several concrete directions emerge from the review and the benchmark evidence. First, standardized benchmarks should move beyond single-shot deletion by varying deletion granularity (instance/class/concept), deletion scale, and deletion schedule (single-shot vs. repeated requests), while reporting utility, calibration, and compute budgets under consistent protocols. This is especially important for design-for-unlearning methods, whose efficiency depends on how deletions are dispersed across shards and slices. Second, privacy auditing should be strengthened: when loss-based MIA operates near chance, differences between methods can be masked, so evaluations should include multiple attacker models (black-box vs. white-box; score-based vs. representation-based) and complementary exposure tests to better assess residual leakage on

D_{f}

. Third, the recurring “too-strong on deleted” pattern motivates representation-level diagnostics that can help distinguish boundary readjustment from deeper representational forgetting, for instance, through layer-wise probes, targeted attribution tests, and similarity analyses that may also inform principled where-to-edit strategies. Fourth, design-for-unlearning may benefit from increased modularity (e.g., adapters, structured heads, or separation of task-specific components) and from principled selection of editable parameters to improve selectivity and alignment to retraining. Finally, scaling toward LLM and federated settings introduces additional requirements: in LLMs, unlearning may manifest as behavioral suppression rather than true removal, requiring concept/knowledge-level tests and careful treatment of auxiliary components (retrieval layers, caches), while in federated learning, client-level removal must account for distributed optimization, limited observability, and communication constraints. Across settings, operational verifiability (audit logs, deletion registries, and reproducible unlearning transcripts) will be important to translate algorithmic progress into governance-ready practice.

Overall, retraining remains the clearest behavioral target, while approximate methods can be viable under strict acceptance testing; progress will benefit from standardized evaluation, stronger audits, design-time modularity, and verifiable evidence of forgetting.

8. Conclusions

This work addresses machine unlearning through an integrated perspective that combines: (i) a systematic organization of the literature along key dimensions (request granularity, required guarantees, model scale/setting, and design-for-unlearning vs. post hoc operation); (ii) a practical taxonomy and selection guidelines to support method choice under operational and compliance constraints; and (iii) a unified, reproducible experimental pipeline that compares retrain-from-scratch against representative approximate post hoc techniques and a design-for-unlearning baseline (SISA). The evaluation spans complementary axes utility and probabilistic quality, forgetting/privacy indicators, distributional and structural alignment to retraining, and wall-clock efficiency to make explicit the trade-offs that arise in real deployments.

Empirically, retrain-from-scratch provides the most direct behavioral reference: it preserves stable generalization on held-out data while exhibiting the expected degradation on the forget split, consistent with removing the influence of

D_{f}

. In contrast, the strongest approximate post hoc methods remain close to retraining on retained/test utility and calibration and show tight distributional/structural proximity, but they often remain systematically more predictive on deleted samples than retraining, indicating under-forgetting under functional criteria. On the privacy side, loss-based membership inference operates near chance in this setting, with AUC values close to

0.5

across methods; consequently, MIA acts primarily as a consistency check rather than a decisive discriminator, and robust acceptance should rely on a joint view of deleted-set regression toward retraining, retained/test utility, and behavioral alignment measures. From an efficiency standpoint, post hoc methods substantially reduce time-to-unlearn relative to retraining, while SISA exhibits a request-dependent profile: it incurs substantial upfront overhead and becomes advantageous mainly under frequent, small, localized deletions, whereas large or widely distributed deletions can erase its runtime benefits.

Overall, the study supports a pragmatic conclusion: when strong guarantees or straightforward auditability are required, retraining remains the conservative choice; when rapid response is the dominant constraint, approximate post hoc methods can be effective but should be deployed with explicit acceptance criteria that verify both utility preservation and progress toward retrain-level forgetting. The proposed taxonomy, guidelines, and experimental evidence jointly provide a practical basis for building unlearning systems that are implementable, auditable, and aligned with real-world constraints on performance and resources.

Author Contributions

Conceptualization, C.C., S.G., P.L. and F.M.; Methodology, C.C., S.G., P.L. and F.M.; Software, C.C. and S.G.; Investigation, C.C. and S.G.; Supervision, P.L. and F.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

AG News is publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Nguyen, T.T.; Huynh, T.T.; Ren, Z.; Nguyen, P.L.; Liew, A.W.C.; Yin, H.; Nguyen, Q.V.H. A Survey of Machine Unlearning. ACM Trans. Intell. Syst. Technol. 2025, 16, 1–46. [Google Scholar] [CrossRef]
European Data Protection Supervisor (EDPS). Machine Unlearning. 2024. Available online: https://www.edps.europa.eu/data-protection/technology-monitoring/techsonar/machine-unlearning_en (accessed on 1 January 2024).
Wang, W.; Tian, Z.; Zhang, C.; Yu, S. Machine Unlearning: A Comprehensive Survey. arXiv 2024, arXiv:2405.07406. [Google Scholar] [CrossRef] [PubMed]
Li, C.; Jiang, H.; Chen, J.; Zhao, Y.; Fu, S.; Jing, F.; Guo, Y. An Overview of Machine Unlearning. High-Confid. Comput. 2025, 5, 100254. [Google Scholar] [CrossRef]
Liu, H.; Xiong, P.; Zhu, T.; Yu, P.S. A Survey on Machine Unlearning: Techniques and New Emerged Privacy Risks. arXiv 2024, arXiv:2406.06186. [Google Scholar] [CrossRef]
Bourtoule, L.; Chandrasekaran, V.; Choquette-Choo, C.A.; Jia, H.; Travers, A.; Zhang, B.; Lie, D.; Papernot, N. Machine Unlearning. In Proceedings of the 2021 IEEE Symposium on Security and Privacy (SP); IEEE: New York, NY, USA, 2021; pp. 141–159. [Google Scholar] [CrossRef]
Graves, L.; Nagisetty, V.; Ganesh, V. Amnesiac Machine Learning. Proc. Aaai Conf. Artif. Intell. 2021, 35, 11516–11524. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Proc. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MI, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. PLoS Med. 2021, 18, e1003583. [Google Scholar] [CrossRef] [PubMed]
Rethlefsen, M.L.; Kirtley, S.; Waffenschmidt, S.; Ayala, A.P.; Moher, D.; Page, M.J.; Koffel, J.B. PRISMA-S: An extension to the PRISMA Statement for Reporting Literature Searches in Systematic Reviews. Syst. Rev. 2021, 10, 39. [Google Scholar] [CrossRef] [PubMed]
Geng, J.; Li, Q.; Woisetschläger, H.; Chen, Z.; Cai, F.; Wang, Y.; Nakov, P.; Jacobsen, H.A.; Karray, F. A Comprehensive Survey of Machine Unlearning Techniques for Large Language Models. arXiv 2025, arXiv:2503.01854. [Google Scholar] [CrossRef]
Xu, H.; Zhu, T.; Zhang, L.; Zhou, W.; Yu, P.S. Machine Unlearning: A Survey. ACM Comput. Surv. 2023, 56, 1–36. [Google Scholar] [CrossRef]
Cao, Y.; Yang, J. Towards Making Systems Forget with Machine Unlearning. In Proceedings of the 2015 IEEE Symposium on Security and Privacy (SP); IEEE: New York, NY, USA, 2015; pp. 463–480. [Google Scholar] [CrossRef]
Ginart, A.; Guan, M.Y.; Valiant, G.; Zou, J. Making AI Forget You: Data Deletion in Machine Learning. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Guo, C.; Goldstein, T.; Hannun, A.; van der Maaten, L. Certified Data Removal from Machine Learning Models. In Proceedings of the 37th International Conference on Machine Learning (ICML); PMLR: New York, NY, USA, 2020; pp. 3832–3842. [Google Scholar]
Sekhari, A.; Acharya, J.; Kamath, G.; Suresh, A.T. Remember What You Want to Forget: Algorithms for Machine Unlearning. Proc. Adv. Neural Inf. Process. Syst. 2021, 34, 18075–18086. [Google Scholar]
Golatkar, A.; Achille, A.; Soatto, S. Forgetting and Generalization in Machine Learning. In Proceedings of the 37th International Conference on Machine Learning (ICML); PMLR: New York, NY, USA, 2020; pp. 3634–3643. [Google Scholar]
Golatkar, A.S.; Achille, A.; Soatto, S. Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 14–19 June 2020; pp. 9304–9312. [Google Scholar] [CrossRef]
Liu, G.; Ma, X.; Yang, Y.; Wang, C.; Liu, J. FedEraser: Enabling Efficient Client-Level Data Removal from Federated Learning Models. In Proceedings of the 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS), Tokyo, Japan, 25–28 June 2021; pp. 1–10. [Google Scholar] [CrossRef]
Koh, P.W.; Liang, P. Understanding Black-box Predictions via Influence Functions. In Proceedings of the 34th International Conference on Machine Learning (ICML); PMLR: New York, NY, USA, 2017; pp. 1885–1894. [Google Scholar]
Wu, Y.; Dobriban, E.; Davidson, S.B. DeltaGrad: Rapid Retraining of Machine Learning Models. In Proceedings of the 37th International Conference on Machine Learning (ICML); PMLR: New York, NY, USA, 2020; pp. 10355–10366. [Google Scholar]
Yao, Y.; Xu, X.; Liu, Y. Large Language Model Unlearning. arXiv 2023, arXiv:2310.10683. [Google Scholar]
Wang, W.; Tian, Z.; Zhang, C.; Liu, A.; Yu, S. BFU: Bayesian Federated Unlearning with Parameter Self-Sharing. In Proceedings of the ACM Asia Conference on Computer and Communications Security (AsiaCCS), Melbourne, Australia, 10–14 July 2023. [Google Scholar]
Schelter, S.; Grafberger, S.; Dunning, T. Amnesia—A Selection of Machine Learning Models That Can Forget User Data Very Fast. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), Amsterdam, The Netherlands, 12–15 January 2020. [Google Scholar]
Neel, S.; Roth, A.; Sharifi-Malvajerdi, S. Descent-to-Delete: Gradient-Based Methods for Machine Unlearning. In Proceedings of the 32nd Conference on Learning Theory (COLT); PMLR: New York, NY, USA, 2021; pp. 931–962. [Google Scholar]
Kurmanji, M.; Triantafillou, P.; Hayes, J.; Triantafillou, E. Towards Unbounded Machine Unlearning. Adv. Neural Inf. Process. Syst. 2023, 36, 1957–1987. [Google Scholar]
Brophy, J.; Lowd, D. Machine Unlearning for Random Forests. In Proceedings of the International Conference on Machine Learning (ICML), Online, 18–24 July 2021. [Google Scholar]
Liu, J.; Wu, C.; Lian, D.; Chen, E. Efficient Machine Unlearning via Influence Approximation. arXiv 2025, arXiv:2507.23257. [Google Scholar] [CrossRef]
Tarun, A.K.; Chundawat, V.S.; Mandal, M.; Kankanhalli, M. Fast Yet Effective Machine Unlearning. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 13046–13055. [Google Scholar] [CrossRef] [PubMed]
Hoang, T.; Rana, S.; Gupta, S.K.; Venkatesh, S. Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning Interference with Gradient Projection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 4807–4816. [Google Scholar] [CrossRef]
Golatkar, A.; Achille, A.; Ravichandran, A.; Polito, M.; Soatto, S. Mixed-Privacy Forgetting in Deep Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 792–801. [Google Scholar]
Wu, C.; Zhu, S.; Mitra, P. Federated Unlearning with Knowledge Distillation. arXiv 2022, arXiv:2201.09441. [Google Scholar] [CrossRef]
Zhou, Y.; Zheng, D.; Mo, Q.; Lu, R.; Lin, K.Y.; Zheng, W.S. Decoupled Distillation to Erase: A General Unlearning Method for Any Class-centric Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
Chen, C.; Sun, F.; Zhang, M.; Ding, B. Recommendation Unlearning. In Proceedings of the ACM Web Conference (WWW), Virtual, 25–29 April 2022; pp. 2768–2777. [Google Scholar] [CrossRef]

Figure 1. PRISMA 2020 flow diagram of the study selection process, following PRISMA guidelines [11].

Figure 2. High-level taxonomy used to organize the PRISMA-selected MU literature and to support method selection. The four dimensions recur across the surveyed works and directly determine feasibility, guarantees, and evaluation requirements [1,3,5,14].

Figure 3. Decision tree for selecting machine unlearning (MU) technique families under practical constraints.

Figure 4. Baseline vs. retrain-from-scratch accuracy on test_full (raw accuracy).

Figure 5. Raw accuracy on test_full for retrain-from-scratch and unlearning methods.

Figure 6.

Δ

TPR relative to retrain for loss-based MIA Deleted. Positive values indicate a slightly stronger attacker than retraining at the same FPR, while negative values indicate a weaker attacker.

Figure 6.

Δ

TPR relative to retrain for loss-based MIA Deleted. Positive values indicate a slightly stronger attacker than retraining at the same FPR, while negative values indicate a weaker attacker.

Figure 7.

Δ

TPR relative to retrain for loss-based MIA Retained (control).

Figure 7.

Δ

TPR relative to retrain for loss-based MIA Retained (control).

Table 1. Search strings used for each source (MU synonyms and Boolean operators).

Source	Search String
Google Scholar	`“machine unlearning” OR “machine forgetting” OR “algorithmic forgetting” OR “selective forgetting” OR “data removal” OR “data deletion”--catastrophic`
IEEE Xplore	`(“machine unlearning” OR “machine forgetting” OR “algorithmic forgetting” OR “selective forgetting” OR “data removal” OR “data deletion”) NOT “catastrophic”`
ACM Digital Library	`(“machine unlearning” OR “machine forgetting” OR “algorithmic forgetting” OR “selective forgetting” OR “data removal” OR “data deletion”) NOT “catastrophic”`
Scopus	`TITLE-ABS-KEY (“machine unlearning” OR “machine forgetting” OR “algorithmic forgetting” OR “selective forgetting” OR “data removal” OR “data deletion”) AND NOT TITLE-ABS-KEY (“catastrophic”)`
arXiv	`all: “machine unlearning” OR all: “machine forgetting” OR all: “algorithmic forgetting” OR all: “selective forgetting” OR all: “data removal” OR all: “data deletion” NOT “catastrophic”`

Table 2. Baseline vs. retrain-from-scratch across deleted/retained/test splits.

Δ

denotes retrain − baseline. TS: temperature scaling. Values are reported with mild rounding for readability.

Table 2. Baseline vs. retrain-from-scratch across deleted/retained/test splits.

Δ

denotes retrain − baseline. TS: temperature scaling. Values are reported with mild rounding for readability.

Split	Metric	Baseline	Retrain	$Δ$
Efficiency
–	Runtime (s)	710	540	( $- 24 %$ )
Forget target
`train_deleted`	Accuracy	0.95	0.93	$- 0.02$
`train_deleted`	NLL/log-loss	0.16	0.20	$+ 0.04$
`train_deleted`	ECE (raw)	0.004	0.008	$+ 0.004$
`train_deleted`	ECE (TS)	0.007	0.007	≈0.000
Retained data
`train_retained`	Accuracy	0.95	0.94	$- 0.01$
`train_retained`	NLL/log-loss	0.16	0.17	$+ 0.02$
`train_retained`	ECE (raw)	0.0035	0.0042	≈0.0007
`train_retained`	ECE (TS)	0.007	0.005	$- 0.002$
Generalization
`test_full`	Accuracy	0.93	0.93	$- 0.003$
`test_full`	NLL/log-loss	0.20	0.21	$+ 0.01$
`test_full`	ECE (raw)	0.011	0.011	≈0.000
`test_full`	ECE (TS)	0.010	0.010	≈0.000

Table 3. Distributional and structural indicators comparing baseline vs. retrain. Lower values indicate closer agreement. Values are mildly rounded. JS divergence is reported per split (Deleted/Test); representation- and parameter-level indicators are computed globally on a fixed probe and therefore span both columns.

Metric	Deleted	Test	Interpretation
JS divergence (output probs)	0.036	0.009	higher on deleted indicates forgetting signal
Behavioral disagreement (Top-1)	0.12	0.03	fraction of differing predictions
Activation distance $1 - cos$ (last layer)	0.74		global (L2-normalized last-layer [CLS])
Weight RSS change (total)	3.1%		parameter drift relative to baseline
Classifier-head share	56%		fraction of drift in classification head

Table 4. Performance on train_deleted (forgetting) and train_retained (utility). Retrain-from-scratch is the reference: on deleted it represents the expected forgetting level, while on retained it reflects clean-scenario utility. Values are mildly rounded for readability.

Method	Acc. (Deleted)	NLL (Deleted)	Acc. (Retained)	NLL (Retained)
Retrain-from-scratch	0.932	0.203	0.943	0.174
Ascent/Descent-to-Delete	0.937	0.189	0.950	0.155
Influence/Hessian	0.948	0.163	0.948	0.160
Knowledge Distillation	0.941	0.180	0.948	0.168
Weight Scrubbing	0.938	0.185	0.948	0.167
SISA (selective retraining)	0.837	2.60 (TS: 0.63)	0.839	2.57 (TS: 0.62)

Table 5. Test-set performance and calibration on test_full. TS denotes temperature scaling. Retrain-from-scratch is the clean reference. Values are mildly rounded.

Method	Acc.	NLL	ECE	NLL (TS)	ECE (TS)
Retrain-from-scratch	0.931	0.208	0.011	0.208	0.010
Ascent/Descent-to-Delete	0.934	0.200	0.016	0.200	0.010
Influence/Hessian	0.934	0.196	0.011	0.196	0.010
Knowledge Distillation	0.931	0.199	0.011	0.199	0.010
Weight Scrubbing	0.931	0.202	0.011	0.202	0.010
SISA (selective retraining)	0.836	2.61	0.24	0.63	0.008

Table 6. Loss-based membership inference summary in two scenarios: Deleted (forgetting audit) and Retained (control). AUC values near

0.5

indicate low membership discriminability. Values are mildly rounded.

Table 6. Loss-based membership inference summary in two scenarios: Deleted (forgetting audit) and Retained (control). AUC values near

0.5

indicate low membership discriminability. Values are mildly rounded.

	MIA Deleted				MIA Retained
Method	AUC	AP	FPR@0.90	FPR@0.95	AUC	AP	FPR@0.90	FPR@0.95
Baseline	0.512	0.804	0.898	0.946	0.507	0.916	0.888	0.941
Retrain-from-scratch	0.505	0.801	0.892	0.941	0.506	0.915	0.887	0.940
Ascent/Descent-to-Delete	0.506	0.802	0.894	0.942	0.507	0.916	0.889	0.942
Influence/Hessian	0.505	0.801	0.893	0.941	0.507	0.915	0.888	0.941
Knowledge Distillation	0.505	0.800	0.891	0.940	0.507	0.915	0.887	0.941
Weight Scrubbing	0.507	0.803	0.895	0.943	0.506	0.915	0.888	0.941
SISA	0.500	0.799	0.999	0.999	0.501	0.914	0.900	0.951

Table 7. Distributional and structural similarity to retrain-from-scratch (lower is better). JS div.: Jensen–Shannon divergence between output probability distributions; Behavioral disagreement: disagreement in Top-k predictions (lower indicates higher agreement); Activation dist.: representation distance in deeper layers; Weight RSS: relative parameter drift. Entries are reported on a qualitative scale when absolute magnitudes are not directly comparable across indicators.

Method	JS Div.	Behavioral Disagreement	Activation Dist.	Weight RSS
Ascent/Descent-to-Delete	low	low	low	few %
Influence/Hessian	very low	very low	very low	few %
Knowledge Distillation	low	low	low	few %
Weight Scrubbing	low	low	low	few %
SISA (selective retraining)	high	high	medium–high	elevated

Table 8. Wall-clock runtime of a single unlearning cycle compared to exact retraining.

Δ

is reported relative to retrain-from-scratch. Values are mildly rounded for readability. For Knowledge Distillation and Weight Scrubbing, the time reported is the incremental update given an oracle clean teacher/reference and excludes the cost of producing that clean model.

Table 8. Wall-clock runtime of a single unlearning cycle compared to exact retraining.

Δ

is reported relative to retrain-from-scratch. Values are mildly rounded for readability. For Knowledge Distillation and Weight Scrubbing, the time reported is the incremental update given an oracle clean teacher/reference and excludes the cost of producing that clean model.

Method	Time (s)	$Δ$ (s)	Relative
Ascent/Descent	360	−180	$0.67 \times$
Influence/Hessian	120	−420	$0.22 \times$
Knowledge Distillation	395	−145	$0.73 \times$
Weight Scrubbing	380	−160	$0.70 \times$
SISA	620	+80	$1.15 \times$
Retrain	540	0	$1.00 \times$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cosentino, C.; Gatto, S.; Liò, P.; Marozzo, F. Machine Unlearning: A Perspective, Taxonomy, and Benchmark Evaluation. Future Internet 2026, 18, 174. https://doi.org/10.3390/fi18030174

AMA Style

Cosentino C, Gatto S, Liò P, Marozzo F. Machine Unlearning: A Perspective, Taxonomy, and Benchmark Evaluation. Future Internet. 2026; 18(3):174. https://doi.org/10.3390/fi18030174

Chicago/Turabian Style

Cosentino, Cristian, Simone Gatto, Pietro Liò, and Fabrizio Marozzo. 2026. "Machine Unlearning: A Perspective, Taxonomy, and Benchmark Evaluation" Future Internet 18, no. 3: 174. https://doi.org/10.3390/fi18030174

APA Style

Cosentino, C., Gatto, S., Liò, P., & Marozzo, F. (2026). Machine Unlearning: A Perspective, Taxonomy, and Benchmark Evaluation. Future Internet, 18(3), 174. https://doi.org/10.3390/fi18030174

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Unlearning: A Perspective, Taxonomy, and Benchmark Evaluation

Abstract

1. Introduction

2. Background

2.1. Supervised Learning and the Need for Data Removal

2.2. Problem Formulation and Definitions

2.3. Transformer-Based NLP Models

2.4. AG News Dataset

3. A PRISMA-Grounded Perspective on Machine Unlearning

3.1. Scope and Classification Axes

3.2. Review Methodology (PRISMA)

3.2.1. Databases and Search Strings

3.2.2. Inclusion and Exclusion Criteria

3.2.3. PRISMA Flow Diagram

3.3. Comparative Analysis of Existing Works

3.4. Open Challenges and Research Gaps

4. Taxonomy and Decision Framework

4.1. Taxonomy Dimensions

4.2. Mapping of Techniques to the Taxonomy

5. Benchmark Setup

5.1. Model and Training Setup

5.2. Definition of Forget and Retain Sets

5.3. Implemented Unlearning Techniques

5.4. Evaluation Metrics

6. Benchmark Evaluation

6.1. Comparison Across Unlearning Methods

6.1.1. Performance on Deleted and Retained Sets

6.1.2. Performance and Calibration

6.1.3. Membership Inference and Privacy Indicators

6.1.4. Distributional and Structural Analysis

6.2. Efficiency Analysis

7. Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI