1. Introduction
Machine unlearning is the task of updating a machine learning (ML) model after the partial deletion of data on which the model had been trained so that the model reflects the remaining data. The task arises in the context of many database applications that involve training and using an ML model while allowing data deletions to occur. For example, consider an online store that maintains a database of ratings for its products and uses the database to train a model that predicts customer preferences (e.g., a logistic regression model that predicts what rating a customer would assign to a given product). If part of the database is deleted (e.g., if some users request their accounts to be removed), then a problem arises: how to update the ML model to “unlearn” the deleted data. It is crucial to address this problem appropriately so that the computational effort for unlearning is in proportion to the effect of the deletion. A tiny amount of deletion should not trigger a full retraining of the ML model, leading to potentially huge data-processing costs, but at the same time, data deletions should not be ignored to such extent that the ML model does not reflect the remaining data anymore.
In this work, we perform a comparative analysis of the existing methods for machine unlearning. In doing so, we are motivated both by the practical importance of the task and the lack of a comprehensive comparison in the literature. Our goal is to compare the performance of existing methods in a variety of settings in terms of certain desirable qualities.
What are those qualities? First, machine unlearning should be
efficient (i.e., achieving small running time) and
effective (i.e., achieving good accuracy). Moreover, machine unlearning is sometimes required to be
certifiable (i.e., guarantee that after data deletion the ML model operates
as if the deleted data had never been observed). Such a requirement may be stipulated by laws (e.g., in the spirit of the
right to be forgotten [
1] or the
right of erasure [
2] in EU laws) or even offered voluntarily by the application in order to address privacy concerns. In the example of the online store, consider the case where some users request their data to be removed from its database. The online store should not only delete the data in the hosting database but also ensure that the data are unlearned by any ML model that was built from them. Essentially, if an audit was performed, the employed ML models should be found to have unlearned the deleted data as well as a model that is obtained with a brute-force full retraining of the remaining data, even if full retraining was not actually performed to unlearn the deleted data.
The aforementioned qualities exhibit pairwise trade-offs. There is a trade-off between efficiency on one hand and effectiveness or certifiability on the other because it takes time to optimize a model so as to reflect the underlying data or unlearn the deleted data. Moreover, there is a trade-off between certifiability and effectiveness. That’s because unlearning the deleted data, thus ensuring certifiability, corresponds to learning from fewer data, thus decreasing accuracy. In this study, we observe the three trade-offs experimentally and find that because the compared methods involve different processing costs for different operations, they offer better or worse trade-offs in different settings.
For the experimental evaluation, we implement a common
unlearning pipeline (
Figure 1) for the compared methods. The first stage trains an
initial ML model from the data. To limit the variable parts of our experimentation, we will be focusing on
logistic regression, which represents a large class of models that are commonly used in a wide range of settings. In addition, we will be assuming that the initial model is trained with
stochastic gradient descent (SGD), since SGD and its variants are the standard algorithms for training general ML models. The second stage
employs the initial ML model for inference (i.e., for classification). During this stage, if data deletion occurs, then the pipeline proceeds to the third stage to unlearn the deleted data and produce an updated model. After every such model update, the updated model is evaluated for certifiability. If it fails, then the pipeline restarts and trains a new model from scratch on the remaining data; otherwise, it is employed in the inference stage, and the pipeline resumes. When an audit is requested by an external auditor (not shown in
Figure 1), a full retraining of the ML model is executed on the remaining data, and the
fully retrained model is compared to the currently employed model. If the employed model is found to have unlearned the deleted data as well as the fully retrained model (within a threshold of disparity), then the audit is successful, meaning the pipeline has certifiably unlearned the deleted data thus far and is allowed to resume.
Given this pipeline, we evaluate three methods, namely
Fisher [
3],
Influence [
4], and
DeltaGrad [
5], that follow largely different approaches for machine unlearning and represent the state of the art for our setting (linear classification models trained with SGD).
Fisher, proposed by Golatkar et al. [
3], updates the initial ML model using the
remaining data to perform a corrective Newton step.
Influence, proposed by Guo et al. [
4], updates the initial ML model using the
deleted data to perform a corrective Newton step. Finally,
DeltaGrad, proposed by Wu et al. [
5], updates the initial ML model by correcting the SGD steps that led to the initial model. Note that, in this work, we extend the original algorithms of [
3,
5] to ensure that all the evaluated methods are equipped with mechanisms to control the trade-offs between efficiency, effectiveness and certifiability.
For the experimental evaluation, we implement the three methods and compare them in a large range of settings that adhere to the pipeline described above. The aim of the experiments is to demonstrate the trade-offs that the three methods offer in terms of efficiency, effectiveness and certifiability. First, we demonstrate that the trade-offs are much more pronounced for certain worst-case deletion distributions than for random deletions. Subsequently, we observe that Fisher offers the overall best certifiability, along with good effectiveness at a lower efficiency than Influence, especially for larger datasets. In addition, Influence offers the overall best efficiency, along with good effectiveness at lower levels of certifiability, and DeltaGrad offers stable albeit lower performance across all qualities. Moreover, we observe that the efficiency of Fisher and Influence is much higher for datasets of lower dimensionality. The patterns we observe in these experiments have a beneficial by-product: they allow us to define a practical approach to determine in online fashion (i.e., as the pipeline unfolds) when the accumulated error from approximate unlearning is large enough to require restarting the pipeline to perform a full retraining of the ML model.
To summarize, we make the following contributions:
We define a novel framework to compare machine unlearning methods in terms of effectiveness, efficiency, and certifiability.
We extend the methods of [
3,
5] with mechanisms to control performance trade-offs.
We offer the first experimental comparison of the competing methods in a large variety of settings. As an outcome, we obtain novel empirical insights about (1) the effect of the deletion distribution on the performance trade-offs and (2) the strengths of each method in terms of performance trade-offs.
We propose a practical online strategy to determine an optimal time for when to restart the training pipeline.
As for future work, a similar experimental study would address model updates for data addition rather than deletion. For this work, we opted to focus on deletion to keep the paper well-contained, because certifiability is typically required in the case of deletion (e.g., when users request their data to be deleted from an application), and the methods we evaluate are tailored to certifiable deletion.
2. Related Work
Unlearning methods are classified as exact or approximate.
Exact unlearning methods produce ML models that perform as fully retrained models. By definition, these methods offer the highest certifiability, as the produced models are effectively the same as ones obtained with retraining. There exist several exact unlearning methods, typically for training algorithms that are model-specific and deterministic in nature. For instance, ML models such as support vector machines [
6,
7,
8], collaborative filtering, naive Bayes [
9,
10],
k-nearest neighbors and ridge regression [
9] possess exact unlearning methods. The efficiency for such exact methods varies.
For stochastic training algorithms such as SGD, an exact unlearning approach, under the assumption that learning is performed in federated fashion, was proposed in [
11]. In federated learning, separate ML models are trained on separate data partitions, and their predictions are aggregated during inference. This partitioning of data allows for efficient retraining of ML models on smaller fragments of data, leading to efficient unlearning when data are deleted. However, for general ML models trained with SGD, the setting of federated learning comes with a potential cost of effectiveness that is difficult to quantify and control because model optimization is not performed jointly on the full dataset.
The
first group [
3,
12,
13] uses the remaining data of the training dataset to update the ML model and control certifiability. These methods use Fisher information [
14] to retain the information of the remaining data and inject optimal noise in order to unlearn the deleted data. The
second group [
4,
15,
16] uses the deleted data to update ML models during unlearning. They perform a Newton step [
17] to approximate the influence of the deleted data on the ML model and remove it. To trade off effectiveness for certifiability, they inject random noise into the training objective function [
16]. The
third group [
5,
18,
19,
20] stores data and information during training and then utilizes this when deletion occurs to update the model. Specifically, these methods focus on approximating the SGD steps that would have occurred if full retraining was performed. To aid in this approximation, they store the intermediate quantities (e.g., gradients and model updates) produced by each SGD step during training. The amount of stored information and the approximation process raise an effectiveness vs. efficiency trade-off.
Methods from the above three groups can be used to perform unlearning for classification models with SGD, as long as the relevant quantities (e.g., the model gradients) are easy to compute for the model at hand. Apart from the above three groups, there are other approximate unlearning methods that do not fit the same template (e.g., methods for specific ML models, such as [
21] for random forest models, or for Bayesian modeling, such as [
22] for Bayesian learning), and so we consider them outside the scope of this paper.
In this paper, we focus on approximate unlearning methods, because they are applicable to general ML models, when training is performed with general and widely used optimization algorithms such as SGD. We implement three methods—
Fisher,
Influence and
DeltaGrad—which correspond to state-of-the-art unlearning methods from each of the aforementioned groups ([
3,
4,
5], respectively).
3. Machine Unlearning
Section 3.1 below will present the stages of the unlearning pipeline and how each method implements them. Each unlearning method we consider is equipped with mechanisms to navigate trade-offs between efficiency, effectiveness and certifiability. Therefore, to facilitate presentation, let us first provide a high-level description of those mechanisms along with some related terms and notation.
In what follows, effectiveness is measured as the model’s accuracy on the test dataset
Acctest (i.e., the fraction of the test data that it classifies correctly). Certifiability is measured as the disparity
AccDis in accuracy of the deleted data between the updated model (i.e., the model that results from the (possibly approximate) unlearning of the deleted data) and the fully retrained ML model. This metric is a normalized version of the “error on the cohort to be forgotten” metric seen in Golatkar et al. [
3]. Intuitively,
AccDis quantifies how well the updated model “remembers” the deleted data. If it is small, then the updated model has “unlearned” the deleted data almost as well as if fully retrained.
The first mechanism trades efficiency for effectiveness on one hand and certifiability on the other via an efficiency parameter τ, which is specified separately for each unlearning method. Lower values of τ indicate lower efficiency, thus allowing longer running times to improve the effectiveness and certifiability of the updated model.
The second mechanism trades effectiveness (high accuracy
Acctest) for certifiability (low disparity
AccDis) via noise injection. Simply expressed, noise injection deliberately adds randomness to an ML model both during training and unlearning. A
noise parameter controls the amount of injected noise, defined for a model of
d features as
On one end of this trade-off, large amounts of noise ensure that the fully retrained and unlearned model are both essentially random, thus leading to high certifiability (low disparity
AccDis) at the cost of low effectiveness (low accuracy). On the other end, when no noise is injected, the unlearning method strives to readapt the model to the remaining data, thus possibly sacrificing in favor of effectiveness some of the certifiability it would achieve on the first end of this trade-off. We note that this trade-off, along with noise injection as a control mechanism, has already been introduced as concepts in the differential privacy literature [
4,
16,
23].
3.1. Unlearning Pipeline and Methods
The unlearning pipeline represents the lifecycle of the ML models in our study (
Figure 1). Below, we describe how each unlearning method we consider executes each stage of the pipeline, with a summary shown in
Table 1.
3.2. Training Stage
This stage produces an ML model from the training dataset
. In what follows, we assume that the ML model is a logistic regression, a simple and widely used model for classification. Each data point
is accompanied by a categorical label
, and there are
data points initially. At any time,
will denote the
currently available training dataset, which is a subset of the initial training dataset
due to possible deletions (i.e.,
). The fitness of an ML model’s parameters
on a dataset
is measured via an
objective function, such as
with binary cross entropy
ℓ in the first term and ridge regularization in the second term for a fixed value
.
Moreover, we use SGD for training [
24]. SGD iteratively minimizes the objective function over the training data. First, it initializes the model parameters to a random value
, and it improves them iteratively:
where
is the learning rate at iteration
t and the iterations repeat until convergence. Following common practice, we execute SGD in mini-batch fashion (i.e., only a subset of
is used in each execution of Equation (3)).
The training algorithms for each unlearning method are presented in the second column of
Table 1. The
Fisher and
DeltaGrad methods obtain the model
that optimizes
L and subsequently adds to it random noise proportionally to the noise parameter
. Here, the random noise is directly added to
by the
DeltaGrad method, while the
Fisher method adds noise in the direction of the fisher matrix
F, which for logistic regression is the Hessian of the objective
L. On the other hand, the
Influence method optimizes a modified objective function
, which adds random noise to the standard objective
L rather than directly adding noise to the model parameters. Furthermore, in the
DeltaGrad method, after every iteration of the SGD algorithm (Equation (3)), the parameters
and the objective function gradients
are
stored to the disk.
A model obtained from the training stage of the pipeline is denoted by . When it is obtained using the initial dataset , then is referred to as the initial trained model, and when it is obtained using a subset of the training dataset, it is referred to as the fully retrained model. This model is sent to the second stage to be employed for inference.
3.3. Inference Stage
This stage uses the available model to predict the class of arbitrary data points specified as queries. As soon as a subset of the data is deleted, the pipeline moves to the third stage.
3.4. Unlearning Stage
This stage executes the
unlearning algorithm so as to “unlearn” the deleted data
and produce an updated model
(third column of
Table 1).
The Fisher and Influence methods use corrective Newton steps in their unlearning algorithms (terms and , respectively). These Newton steps compute the inverse Hessian of the objective L on the remaining data . However, while the Fisher method also computes the gradient on the remaining data, the Influence method computes the gradient on the deleted data. Furthermore, the Fisher method adds random noise in the direction of the Fisher matrix, similar to the training algorithm. Both the Fisher and Influence unlearning algorithms can be performed in mini-batches of size , which leads to multiple smaller corrective Newton steps. Smaller unlearning mini-batch sizes lead to a more effective ML model at the cost of efficiency, and therefore, serves as the efficiency parameter for the Fisher and Influence methods. Refer to Algorithms A1 and A2 for the mini-batch versions of the unlearning algorithms.
The
DeltaGrad unlearning algorithm proceeds in two steps. The first step of the algorithm aims to obtain the approximate ML model that would have resulted from SGD if the subset
had never been used for training. This is achieved by approximating the objective function gradient on the remaining data points
using the L-BFGS algorithm [
25,
26,
27] and the stored terms
and
from the training stage. However, to reduce the error from consecutive L-BFGS approximations, the gradient
is computed explicitly after every
SGD iterations. Larger values of
lead to more consecutive approximations steps, resulting in higher efficiency, though at the cost of a less effective model. Therefore, the periodicity
serves as the efficiency parameter
. The second step adds random noise similar to the training algorithm. Please see
Appendix A.3 and Algorithm A3 for further details.
3.5. Evaluation
During evaluation the updated model
is assessed for certifiability as soon as it is produced using the test dataset
. If the disparity
AccDis is below some threshold (e.g., determined by the administrator of this pipeline), then the pipeline returns to the second stage and employs the updated model
:=
for inference; otherwise, it returns to the first stage for a full retraining over the remaining data. In
Section 7, we discuss a practical online strategy for evaluation.
A properly conducted evaluation ensures that the pipeline will successfully pass a certifiability audit at any point. In other words, if an auditor were to perform a full retraining on the available data and compare the resulting model to the currently employed model, then the two models should be found to be within the threshold of disparity.
5. Effect of Deletion Distribution
Before we compare the unlearning methods, let us explore how the volume and distribution of the deleted data affect the accuracy of fully trained models. This will allow us to separate the effects of data deletion from the effects of a specific unlearning method.
We implemented a two-step process to generate different deletion distributions. The process was invoked once for each deleted data point for a predetermined number of deletions. In the first step, one
class is selected. For example, for binary class datasets (see
Table 2), the first step selects one of the two classes. The selection may be either
uniform, where one of the
k classes is selected at random, each with a probability
, or
targeted, where one class is randomly predetermined and subsequently always selected.
The second step selects a data point from the class chosen in the previous step. The selection may be either
random, where one data point is selected uniformly at random, or
informed, where the point that decreases the model’s accuracy the most is selected. Ideally, for the
informed selection and for each data point, we would compute the exact drop in the accuracy of a fully trained model on the remaining data after the single-point removal, and we would repeat this computation after every single selection. In practice, however, such an approach is extremely heavy computationally, even for experimental purposes. Instead, for the
informed selection, we opted to heuristically select the outliers in the dataset, as quantified by the
norm of each data point. This heuristic is inspired by [
15], who stated that deleting data points with a large
norm negatively affects the approximation of Hessian-based unlearning methods. We note that the
informed selections are highly adversarial and are not practically feasible. We included it in our experiments to study the worst-case effects of data deletion on unlearning methods.
The two-step process described above yields four distinct deletion distributions, namely
uniform-random,
uniform-informed,
targeted-random and
targeted-informed. In the experiments that follow, we vary the distribution and volume of deletions performed. For each set of deleted data, we report the accuracy of the fully trained model after deletion (i.e., the accuracy achieved by the model that optimizes Equation (
2) using SGD).
The results are shown in
Figure 2. Each plot in the figure corresponds to one dataset. The first row of plots reports the accuracy on the test dataset, and the second row shows the accuracy on the deleted data. Accuracy values correspond to the
y-axis, while the volume of deletion (as a fraction of the original dataset size) corresponds to the
x-axis. Different deletion distributions are indicated with different markers and colors. The variance seen in
Figure 2 is a consequence of the randomness in the selection of deleted points (two random runs were performed). There are three main takeaways from these results.
- 1.
Uniform deletion distributions are unimpactful
uniform-random and uniform-informed do not adversely affect the test accuracy of a fully retrained ML model even at deletion fractions close to . This is due to the clear separation of classes encountered in many real-world datasets. Therefore, evaluating unlearning methods on deletions from uniform distributions will not offer significant insights into the effectiveness and efficiency trade-offs.
- 2.
Targeted deletion distributions are the worst case
targeted-random and targeted-informed deletions lead to large drops in test accuracy. This is because deleting data points from one targeted class eventually leads to class imbalance, causing ML models to be less effective in classifying data points from that class. Therefore, to validate the performance of unlearning methods, one should test them on targeted distributions, where data deletions reduce the accuracy of the learned model.
In addition, we observed that the variance resulting from the selection of the deleted class was low in all datasets apart from higgs. We postulate that this is because of the particular way that missing values were treated for this dataset: data points that had missing feature values disproportionately belonged to class 1. Therefore, this tended to cause a steeper drop in accuracy when data from class 1 was targeted. Moreover, between the two targeted distributions, we observed that targeted-informed led to quicker accuracy drops.
- 3.
Metrics and are correlated
We see across deletion fractions that the values of the accuracy on the test and deleted dataset were highly correlated. Hence, the test accuracy
, which can always be computed for a model on the test data, can be used as a good proxy for the
of an ML model, which may be impossible to compute after data deletion but is required in order to assess certifiability. This observation will be useful in deciding when to trigger a model retrained in the pipeline (
Section 7).
Additionally, we note that the drop rate in and with respect to the deletion fraction varied with the dataset.
6. Experimental Evaluation
In this section, we demonstrate the trade-offs exhibited by the unlearning methods in terms of the qualities of interest (effectiveness, efficiency and certifiability) for different values of their parameters
and
. For each dataset, we experimented with three volumes of deleted data points—
small,
medium and
large—measured as a fraction of the initial training data. The deletion volumes corresponded to a 1%, 5% and 10% drop in
for a fully retrained model when using a
targeted-informed deletion distribution, with class 0 as the deleted class (see
Figure 2). Here, we present the results for the
large volumes of deletion, which were 4493, 2000, 4500, 78,436, 100,000 and 990,000 deletions for the datasets in
Table 2, respectively. Furthermore, we grouped the datasets presented in
Table 2 into three categories (
low,
moderate and
high) based on their dimensionality. Please see
Appendix D for the results corresponding to the
small and
medium deletion volumes.
6.1. Efficiency vs. Certifiability
In this experiment, we studied how varying the efficiency parameter
trades off certifiability and efficiency for a fixed noise parameter
. The efficiency parameter
for
Fisher and
Influence was the size of the unlearning mini-batch, and we varied
, where
m is the volume of deleted data. For
DeltaGrad, the efficiency parameter is the periodicity
of the unlearning algorithm, and we varied
. The noise parameter was set at
for all methods, and we obtained the updated model
and the fully retrained model
for each unlearning method as described in
Table 1.
The results are shown in
Figure 3a. For each plot in the figure, the
y-axis reports certifiability (
AccDis), and the
x-axis reports efficiency (speed-up). Different unlearning methods and values of
are indicated with different colors and markers, respectively, in the legend.
We observed two main trends. First, for the general trade-off between efficiency and certifiability, higher efficiency (i.e., a higher speed-up) was typically associated with lower certifiability (i.e., a higher AccDis) in the plots. Some discontinuity in the plotlines, especially for DeltaGrad, was largely due to the convergence criteria, particularly since DeltaGrad employs SGD not only for training but also for unlearning.
Second, the efficiency of
Influence and
Fisher had a roughly similar trend for each dataset. For the low-dimensional datasets,
Influence and
Fisher provided large speed-ups of nearly 200× and 50× for each dataset, respectively, when
, while
DeltaGrad provided a speed-up of less than 1× (i.e., requiring more time than the fully retrained model). This was because when the dimensionality was low, the cost of computing the inverse Hessian matrix for
Influence and
Fisher (see
Section 3) was much lower compared with the cost of approximating a large number of SGD iterations for the
DeltaGrad method. Conversely, for the high-dimensional datasets,
Influence and
Fisher provided a smaller speed-up. When
was decreased to
, the efficiency reduced to 2.2× and 1× for
cifar2 and 0.3× and 1.2× for
epsilon, respectively, whereas
DeltaGrad at
provided better speed-ups of 1.3× and 1.1×, respectively, along with comparable values of
AccDis.
6.2. Efficiency vs. Effectiveness
In this experiment, we studied how varying the efficiency parameter
trades off efficiency and effectiveness for a fixed
and volume of deleted data. The range of
for each unlearning method was the same as the previous experiment and
. In
Figure 3b, each plot reports effectiveness as the test accuracy error
AccErr and efficiency as the speed-up in running time. We observed the following trends. First, there was the general trade-off: higher efficiency (i.e., a higher speed-up) was associated with slightly higher accuracy error (i.e.,
AccErr) for each method. Furthermore, for
Influence, decreasing
in the unlearning algorithm led to lower test accuracy error because the noise was injected only in the training algorithm (see
Table 1).
Second, Influence offered the best efficiency and effectiveness trade-off among all the methods. Especially for the high-dimensional datasets, the highest efficiency offered was 20× and 2.5× compared with 0.4× and 1.3× for Fisher, respectively, at a slightly larger test accuracy error. For the low-dimensional datasets, Influence and Fisher offered similar efficiency while the former had a lower AccErr. Lastly, for the moderate dimensional datasets, the largest efficiency Influence offered was 168× and 29× compared with 9× and 8.5× for Fisher, respectively, at a lower test accuracy error.
Third, we saw that DeltaGrad was mostly stable in terms of both efficiency and effectiveness. However, note that the test accuracy error for all datasets was larger compared with the other methods due to the direct noise injection and hence offered lower effectiveness even at .
6.3. Effectiveness vs. Certifiability
In this experiment, we studied how varying the noise parameter traded off effectiveness and certifiability for a fixed efficiency parameter .
The efficiency parameter
was set as follows: for
Influence and
Fisher, we set the size of the unlearning mini-batch to
, and for
DeltaGrad, we set the periodicity
. For different values of the noise parameter
, we obtained the updated models
corresponding to each unlearning method as described in
Section 3. For the baselines, we first obtained the fully retrained model
at the same
to measure certifiability and a second fully retrained model
at
to measure effectiveness, as per
Section 4.3.
The results are shown in
Figure 4, where for each plot the left
y-axis reports the certifiability (
AccDis), and the right
y-axis reports effectiveness (
AccErr) as
is varied from
to
for the different unlearning methods. We observed the trade-off between effectiveness and certifiability: higher effectiveness (lower
AccErr) was typically associated with lower certifiability (higher
AccDis).
Another clear observation is that for the Influence method, the test accuracy error AccErr increased only at higher values of (≥10). Moreover, we saw that its largest AccErr was lower than that for other methods across all datasets. For example, in the mnist dataset, the maximum AccErr (at ) was approximately , and for Influence, Fisher and DeltaGrad, respectively. At the same time, however, improved certifiability (i.e., decreased AccDis) was achieved for high values of . Therefore, to obtain a good combination of effectiveness and certifiability, one must select higher values of based on the dataset.
Moreover, for
Fisher, near
, the trade-off between
AccErr and
AccDis was the best among all methods, having values of at most
and
, respectively, across all datasets, as seen in
Figure 4. If a good effectiveness and certifiability trade-off is required, then the
Fisher method appears to be a very suitable method.
Note that because Influence and Fisher shared the same efficiency parameter , their results in this section are directly comparable. However, that is not the case with DeltaGrad. As we saw earlier in this section, DeltaGrad was typically quite slower than the other methods. Therefore, for these experiments, we used it with the largest value of so that its running time was small and closer to the running time of the other two methods (being, in fact, comparable for the high-dimensional datasets).
7. Online Strategy for Retraining
When the updated model
is obtained after data deletion (see
Figure 1), a decision is made on whether to employ the model for inference. Specifically, if the disparity
AccDis of the updated model is below a certain predetermined threshold for certifiability, then the model is employed for inference; otherwise, the pipeline restarts, training a new model on the remaining data. However, measuring
AccDis using Equation (6) would require the full retrained model
, which was not readily available. In fact, computing
after every batch of deletions would defeat the purpose of using an approximate unlearning method in the first place.
Therefore, in practice,
AccDis needs to be estimated. To this end, we propose an online estimation strategy based on the empirical observation (see
Table A5) that, as more data are deleted, the disparity
AccDis grows proportionally to the drop
in test accuracy relative to the initial model
:
where
and
are the test accuracies for the initial and updated model, respectively. In more detail, we measured the correlations between
AccDis and
for
targeted-random deletions while varying the deletion fraction from
to
(
for
mnist) while utilizing the
Fisher unlearning method (
). We observed that, apart from the
higgs dataset, the Pearson correlation [
36] for all other datasets was greater than
, suggesting a strong correlation between
AccDis and
. Note that this observation is related to the correlation of accuracy on the test and deleted data that we mentioned in
Section 5.
Building upon this observation, we estimated
AccDis as
where
c is a constant proportion learned from the data
before the pipeline starts as follows. First, we obtained the test accuracy
for the model trained on the initial dataset. Second, we obtained an updated model
and a fully retrained model
for a large deletion fraction
, such as
. Third, the proportion
c was calculated as follows:
An example run of the pipeline using the estimation strategy for the threshold
is shown in
Figure 5. At each timestep, batches of
data points from a
targeted-informed distribution were chosen for deletion. Then, the
Fisher unlearning algorithm (see
Table 1) was used with the efficiency parameter set as
to obtain an updated model
. Next, the
of
was measured using the stored
, and the certifiability disparity was estimated using Equation (8). If the estimated
exceeds the threshold
, then the pipeline restarts; a new initial model is trained from scratch on the remaining data, and the
is updated, while
c remains unchanged. The dotted green lines in
Figure 5 correspond to the times when the pipeline restarted. Then, the pipeline resumes receiving further batches of deletions until the next restart is determined by the estimation strategy or the pipeline ends.
To evaluate the strategy, we selected three thresholds
for each dataset:
and 50 for
higgs and
and 5 for the other datasets. Next, the number of deletions
m was 100 for the
mnistb,
mnist and
cifar2 datasets and
and 10,000 for the for the
epsilon,
covtype and
higgs datasets, respectively. In
Figure 5, we observe that the initial
estimate was larger than the true
AccDis, and as more data were deleted, the true
AccDis exceeded the threshold, leading to errors in the estimation. We define the relative percentage estimation error of the true
AccDis for a given threshold
at a given time as
where the true
AccDis is computed by obtaining a fully retrained model
on the remaining data at that time. For each individual run of the pipeline, we computed the mean
across the duration of the pipeline (e.g., in
Figure 5, the mean
is
). Then, we collected multiple individual runs corresponding to different random seeds (six for
and two for the random deletion distributions) and computed the
pooled mean (i.e., the mean of the mean
for each threshold
).
In
Table 3, we report the pooled mean of
for different datasets and deletion distributions at different thresholds. We observe that the pooled mean
for all thresholds was lower than
. The larger estimation errors for the
higgs dataset were a result of the lower Pearson correlation of
discussed earlier. However, the Spearman correlation [
36] was
, indicating that the estimation error may have been reduced by using a non-linear estimation strategy. Note that the pooled mean
was larger for the
targeted-informed distribution, indicating that the
estimated using the proportion
c computed from
targeted-random deletions tended to underestimate the true disparity for the more adversarial
targeted-informed deletions. Therefore, based on
Table 3, for a pipeline to be certifiable at threshold
, the estimation strategy should be employed with a threshold
, ensuring a large buffer for possible estimation errors.
In
Figure 6, we report the speed-up of the pipeline with respect to retraining at every timestep for different datasets and deletion distributions at different thresholds
. The observed drops in the speed-up for the
targeted-informed distribution correspond to the restarts of the pipeline triggered by the estimation strategy. Notice how smaller thresholds resulted in more frequent restarts and therefore larger drops in the speed-up for
targeted-informed deletions. Furthermore, the estimation strategy was adaptive; in the less adversarial
uniform-random distribution, fewer restarts were triggered, thereby resulting in larger speed-ups.
In summary, we found that the proposed strategy to estimate the disparity and restart the pipeline, albeit a heuristic, provided a significant speed-up in running time. Furthermore, as evidenced by our experiments, the strategy performed well for different distributions of deletions and could effectively be used to ensure that the pipeline was certifiable at a given disparity threshold . The design of more sophisticated strategies to better estimate the disparity is left for future work.