ABCDP: Approximate Bayesian Computation with Differential Privacy

We developed a novel approximate Bayesian computation (ABC) framework, ABCDP, which produces differentially private (DP) and approximate posterior samples. Our framework takes advantage of the sparse vector technique (SVT), widely studied in the differential privacy literature. SVT incurs the privacy cost only when a condition (whether a quantity of interest is above/below a threshold) is met. If the condition is sparsely met during the repeated queries, SVT can drastically reduce the cumulative privacy loss, unlike the usual case where every query incurs the privacy loss. In ABC, the quantity of interest is the distance between observed and simulated data, and only when the distance is below a threshold can we take the corresponding prior sample as a posterior sample. Hence, applying SVT to ABC is an organic way to transform an ABC algorithm to a privacy-preserving variant with minimal modification, but yields the posterior samples with a high privacy level. We theoretically analyzed the interplay between the noise added for privacy and the accuracy of the posterior samples. We apply ABCDP to several data simulators and show the efficacy of the proposed framework.


Introduction
Approximate Bayesian computation (ABC) aims to identify the posterior distribution over simulator parameters. The posterior distribution is of interest as it provides the mechanistic understanding of the stochastic procedure that directly generates data in many areas such as climate and weather, ecology, cosmology, and bioinformatics [1][2][3][4].
Under these complex models, directly evaluating the likelihood of data is often intractable given the parameters. ABC resorts to an approximation of the likelihood function using simulated data that are similar to the actual observations.
In the simplest form of ABC called rejection ABC [5], we proceed by sampling multiple model parameters from a prior distribution π: θ 1 , θ 2 , . . . ∼ π. For each θ t , a pseudo dataset Y t is generated from a simulator (the forward sampler associated with the intractable likelihood P(y|θ)). The parameter θ t for which the generated Y t are similar to the observed Y * , as decided by ρ(Y t , Y * ) < abc , is accepted. Here, ρ is a notion of distance, for instance, L2 distance between Y t and Y * in terms of a pre-chosen summary statistic. Whether the distance is small or large is determined by abc , a similarity threshold. The result is samples {θ t } M t=1 from a distribution,P (θ|Y * ) ∝ π(θ)P (Y * |θ), whereP (Y * |θ) = B (Y * ) P(Y|θ)dY and B (Y * ) = {Y : ρ(Y, Y * ) < abc }. As the likelihood computation is approximate, so is the posterior distribution. Hence, this framework is named by approximate Bayesian computation, as we do not compute the likelihood of data explicitly.
Most ABC algorithms evaluate the data similarity in terms of summary statistics computed by an aggregation of individual datapoints [6][7][8][9][10][11]. However, this seemingly innocuous step of similarity check could impose a privacy threat, as aggregated statistics could still reveal an individual's participation to the dataset with the help of combining other publicly available datasets (see [12,13]). In addition, in some studies, the actual observations are privacy-sensitive in nature, e.g., Genotype data for estimating tuberculosis transmission parameters [14]. Hence, it is necessary to privatize the step of similarity check in ABC algorithms.
In this light, we introduce an ABC framework that obeys the notion of differential privacy. The differential privacy definition provides a way to quantify the amount of information that the distance computed on the privacy-sensitive data contains, whether or not a single individual's data are included (or modified) in the data [15]. Differential privacy also provides rigorous privacy guarantees in the presence of arbitrary side information such as similar public data.
A common form of applying DP to an algorithm is by adding noise to outputs of the algorithm, called output perturbation [16]. In the case of ABC, we found that adding noise to the distance computed on the real observations and pseudo-data suffices for the privacy guarantee of the resulting posterior samples. However, if we choose to simply add noise to the distance in every ABC inference step, this DP-ABC inference imposes an additional challenge due to the repeated use of the real observations. The composition property of differential privacy states that the privacy level degrades over the repeated use of data. To overcome this challenge, we adopt the sparse vector technique (SVT) [17], and apply it to the rejection ABC paradigm. The SVT outputs noisy answers of whether or not a stream of queries is above a certain threshold, where privacy cost incurs only when the SVT outputs at most c "above threshold" answers. This is a significant saving in privacy cost, as arbitrarily many "below threshold" answers are privacy cost free.
We name our framework, which combines ABC with SVT, as ABCDP (approximate Bayesian computation with differential privacy). Under ABCDP, we theoretically analyze the effect of noise added to the distance in the resulting posterior samples and the subsequent posterior integrals. Putting together, we summarize our main contributions:

1.
We provide a novel ABC framework, ABCDP, which combines the sparse vector technique (SVT) [17] with the rejection ABC paradigm. The resulting ABCDP framework can improve the trade-off between the privacy and accuracy of the posterior samples, as the privacy cost under ABCDP is a function of the number of accepted posterior samples only.

2.
We theoretically analyze ABCDP by focusing on the effect of noisy posterior samples in terms of two quantities. The first quantity provides the probability of an output of ABCDP being different from that of ABC at any given time during inference. The second quantity provides the convergence rate, i.e., how fast the posterior integral using ABCDP's noisy samples' approaches that using non-private ABC's samples. We write both quantities as a function of added noise for privacy to better understand the characteristics of ABCDP.

3.
We validate our theory in the experiments using several simulators. The results of these experiments are consistent with our theoretical findings on the flip probability and the average error induced by the noise addition for privacy.
Unlike other existing ABC frameworks that typically rely on a pre-specified set of summary statistics, we use a kernel-based distance metric called maximum mean discrepancy following K2-ABC [18] to eliminate the necessity of pre-selecting a summary statistic. Using a kernel for measuring similarity between two empirical distributions was also proposed in K-ABC [19]. K-ABC formulates ABC as a problem of estimating a conditional mean embedding operator mapping (induced by a kernel) from summary statistics to corresponding parameters. However, unlike our algorithm, K-ABC still relies on a particular choice of summary statistics. In addition, K-ABC is a soft-thresholding ABC algorithm, while ours is a rejection-ABC algorithm.
To avoid the necessity of pre-selecting summary statistics, one could resort to methods that automatically or semi-automatically learn the best summary statistics given in a dataset, and use the learned summary statistics in our ABCDP framework. An example is semi-auto ABC [6], where the authors suggest using the posterior mean of the parameters as a summary statistic. Another example is the indirect-score ABC [20], where the authors suggest using an auxiliary model which determines a score vector as a summary statistic. However, the posterior mean of the parameters in semi-auto ABC as well as the parameters of the auxiliary model in indirect-score ABC need to be estimated. The estimation step can incur a further privacy loss if the real data need to be used for estimating them. Our ABCDP framework does not involve such an estimation step and is more economical in terms of privacy budget to be spent than semi-auto ABC and indirect-score ABC.

Background
We start by describing relevant background information.

Approximate Bayesian Computation
Given a set Y * containing observations, rejection ABC [5] yields samples from an approximate posterior distribution by repeating the following three steps: where the pseudo dataset Y is compared with the observations Y * via: where ρ is a divergence measure between two datasets. Any distance metric can be used for ρ. For instance, one can use the L2 distance under two datasets in terms of a pre-chosen set of summary statistics, i.e., ρ(Y, Y * ) = D(S(Y), S(Y * )), with an L2 distance measure D on the statistics computed by S. A more statistically sound choice for ρ would be maximum mean discrepancy (MMD, [21]) as used in [18]. Unlike a pre-chosen finite dimensional summary statistic typically used in ABC, MMD compares two distributions in terms of all the possible moments of the random variables described by the two distributions. Hence, ABC frameworks using the MMD metric such as [18] can avoid the problem of non-sufficiency of a chosen summary statistic that may occur in many ABC methods. For this reason, in this paper, we demonstrate our algorithm using the MMD metric. However, other metrics can be used as we illustrated in our experiments.

Maximum Mean Discrepancy
Assume that the data Y ⊂ X and let k : X × X be a positive definite kernel. MMD between two distributions P, Q is defined as By following the convention in kernel literature, we call MMD 2 simply MMD. The Moore-Aronszajn theorem states that there is a unique Hilbert space H on which k defines an inner product. As a result, there exists a feature map φ : X → H such that k(x, y) = φ(x), φ(y) H , where ·, · H = ·, · denotes the inner product on H. The MMD in (5) can be written as where E x∼P [φ(x)] ∈ H is known as the (kernel) mean embedding of P, and exists if E x∼P k(x, x) < ∞ [22]. The MMD can be interpreted as the distance between the mean embeddings of the two distributions. If k is a characteristic kernel [23], then P → E x∼P [φ(x)] is injective, implying that MMD(P, Q) = 0, if and only if P = Q. When P, Q are observed through samples X m = {x i } m i=1 ∼ P and Y n = {y i } n i=1 ∼ Q, MMD can be estimated by empirical averages [21] (Equation (3) When applied in the ABC setting, one input to MMD is the observed dataset Y * and the other input is a pseudo dataset Y t ∼ p(·|θ t ) generated by the simulator given θ t ∼ π(θ).

Differential Privacy
An output from an algorithm that takes in sensitive data as input will naturally contain some information of the sensitive data D. The goal of differential privacy is to augment such an algorithm so that useful information about the population is retained, while sensitive information such as an individual's participation in the dataset cannot be learned [17]. A common way to achieve these two seemingly paradoxical goals is by deliberately injecting a controlled level of random noise to the to-be-released quantity. The modified procedure, known as a DP mechanism, now gives a stochastic output due to the injected noise. In the DP framework, a higher level of noise provides stronger privacy guarantee at the expense of less accurate population-level information that can be derived from the released quantity. Less noise added to the output thus reveals more about an individual's presence in the dataset.
More formally, given a mechanism M (a mechanism takes a dataset as input and produces stochastic outputs) and neighboring datasets D, D differing by a single entry (either by replacing one's datapoint with another, or by adding/removing a datapoint to/from D), the privacy loss of an outcome o is defined by The mechanism M is called -DP if and only if |L (o) | ≤ , for all possible outcomes o and for all possible neighboring datasets D, D . The definition states that a single individual's participation in the data does not change the output probabilities by much; this limits the amount of information that the algorithm reveals about any one individual. A weaker or an approximate version of the above notion is ( , δ)-DP: where δ is often called a failure probability which quantifies how often the DP guarantee of the mechanism fails. Output perturbation is a commonly used DP mechanism to ensure the outputs of an algorithm to be differentially private. Suppose a deterministic function h : D → R p computed on sensitive data D outputs a p-dimensional vector quantity. In order to make h private, we can add noise to the output of h, where the level of noise is calibrated to the global sensitivity [24], ∆ h , defined by the maximum difference in terms of some norm ||h(D) − h(D )|| for neighboring D and D (i.e., differ by one data sample).
There are two important properties of differential privacy. First, the post-processing invariance property [24] tells us that the composition of any arbitrary data-independent mapping with an ( , δ)-DP algorithm is also ( , δ)-DP. Second, the composability theorem [24] states that the strength of privacy guarantee degrades with the repeated use of DP-algorithms. Formally, given an 1 -DP mechanism M 1 and an 2 -DP mechanism M 2 , the mechanism M(D) := (M 1 (D), M 2 (D)) is ( 1 + 2 )-DP. This composition is oftencalled linear composition, under which the total privacy loss linearly increases with the number of repeated use of DP-algorithms. The strong composition [17] [Theorem 3.20] improves the linear composition, while the resulting DP guarantee becomes weaker (i.e., approximate ( , δ)-DP). Recently, more refined methods further improve the privacy loss (e.g., [25]).

AboveThreshold and Sparse Vector Technique
Among the DP mechanisms, we will utilize AboveThreshold and sparse vector technique (SVT) [17] to make the rejection ABC algorithm differentially private. AboveThreshold outputs 1 when a query value exceeds a pre-defined threshold, or 0 otherwise. This resembles rejection ABC where the output is 1 when the distance is less than a chosen threshold. To ensure the output is differentially private, AboveThreshold adds noise to both the threshold and the query value. We take the same route as AboveThreshold to make our ABCDP outputs differentially private. Sparse vector technique (SVT) consists of c calls to AboveThreshold, where c in our case determines how many posterior samples ABCDP releases.
Before presenting our ABCDP framework, we first describe the privacy setup we consider in this paper.

Problem Formulation
We assume a data owner who owns sensitive data Y * and is willing to contribute to the posterior inference.
We also assume a modeler who aims to learn the posterior distribution of the parameters of a simulator. Our ABCDP algorithm proceeds with the two steps:

1.
Non-private step: The modeler draws a parameter sample θ t ∼ π(θ); then generates a pseudo-dataset Y t , where Y t ∼ P(y|θ t ) for t = 1, · · · , T for a large T. We assume these parameter-pseudo-data pairs {θ t , Y t } T t=1 are publicly available (even to an adversary).

2.
Private step: the data owner takes the whole sequence of parameter-pseudo-data pairs {(θ t , Y t )} T t=1 and runs our ABCDP algorithm in order to output a set of differentially private binary indicators determining whether or not to accept each θ t .
Note that T is the maximum number of parameter-pseudo-data pairs that are publicly available. We will run our algorithm for T steps, while our algorithm can terminate as soon as we output the c number of accepted posterior samples. So, generally, c T. The details are then introduced.

ABCDP
Recall that the only place where the real data Y * appear in the ABC algorithm is when we judge whether the simulated data are similar to the real data, i.e., as in (4). Our method hence adds noise to this step. In order to take advantage of the privacy analysis of SVT, we also add noise to the ABC threshold and to the ABC distance. Consequently, we introduce two perturbation steps.
Before we introduce them, we describe the global sensitivity of the distance, as this quantity tunes the amount of noise we will add in the two perturbation steps. For A proof is given in Appendix B. For ρ = MMD using a Gaussian kernel, k(x, y) = where l > 0 is the bandwidth of the kernel, B k = 1 for any l > 0. Now, we introduce the two perturbation steps used in our algorithm summarized in Algorithm 1.

Algorithm 1 Proposed c-sample ABCDP
Require: Observations Y * , Number of accepted posterior sample size c, privacy tolerance total , ABC threshold abc , distance ρ, and parameter-pseudo-data pairs {(θ t , Y t )} T t=1 , and option RESAMPLE.
where m t ∼ Lap(b), i.e., drawn from the zero-mean Laplace distribution with a scale parameter b.
Step 2: Noise for privatizing the distance.
where ν t ∼ Lap(2b). Due to these perturbations, Algorithm 1 runs with the privatized threshold and distance. We can choose to perturb the threshold only once, or every time we output 1 by setting RESAMPLE to false or true. After outputting c number of 1's, the algorithm is terminated. How do we calculate the resulting privacy loss under the different options we choose?
We formally state the relationship between the noise scale and the final privacy loss tot for the Laplace noise in Theorem 1.
Theorem 1 (Algorithm 1 is total -DP). For any neighboring datasets Y * , Y * of size N and any dataset Y, assume that ρ is such that A proof is given in Appendix A. The proof uses linear composition, i.e., the privacy level linearly degrading with c. However, using the strong composition or more advanced compositions can reduce the resulting privacy loss, while these compositions turn pure-DP into a weaker, approximate-DP. In this paper, we focus on the pure-DP. For the case of RESAMPLE = True, the proof directly follows the proof of the standard SVT algorithm using the linear composition method [17], with an exception that we utilize the quantity representing the minimum noisy value of any query evaluated on Y * , as opposed to the maximum utilized in SVT. For the case of RESAMPLE= False, the proof follows the proof of Algorithm 1 in [26].
Note that the DP analysis in Theorem 1 holds for other types of distance metrics and not limited to only MMD, as long as there is a bounded sensitivity ∆ ρ under the chosen metric. When there is no bounded sensitivity, one could impose a clipping bound C to the distance by taking the distance from min[ρ(Y t , Y * ), C], such that the resulting distance between any pseudo data Y t and Y * with modifying one datapoint in Y * cannot exceed that clipping bound. In fact, we use this trick in our experiments when there is no bounded sensitivity.

Effect of Noise Added to ABC
Here, we would like to analyze the effect of noise added to ABC. In particular, we are interested in analyzing the probability that the output of ABCDP differs from that of ABC: P[τ t = τ t |τ t ] at any given time t. To compute this probability, we first compute the probability density function (PDF) of the random variables m t − ν t in the following Lemma.
. The subtraction of these yields another random variable Z = m t − ν t , where the PDF of Z is given by Furthermore, for a ≥ 0, G b (a) := See Appendix C for the proof. Using this PDF, we now provide the following proposition: Proposition 1. Denote the output of Algorithm 1 at time t byτ t ∈ {0, 1} and the output of ABC by τ t ∈ {0, 1}. The flip probability, the probability that the outputs of ABCDP and ABC differ given the output of ABC, is given by P See Appendix D for proof. To provide an intuition of Proposition 1, we visualize the flip probability in Figure 1. This flip probability provides a guideline for choosing the accepted sample size c given the datasize N and the desired privacy level total . For instance, if a given dataset is extremely small, e.g., containing datapoints on the order of 10, c has to be chosen such that the flip probability of each posterior sample remains low for a given privacy guarantee ( total ). If a higher number of posterior samples are needed, then one has to reduce the desired privacy level for the posterior sample of ABCDP to be similar to that of ABC. Otherwise, with a small total with a large c, the accepted posterior samples will be poor. On the other hand, if the dataset is bigger, then a larger c can be taken for a reasonable level of privacy.

Convergence of Posterior Expectation of Rejection-ABCDP to Rejection-ABC.
The flip probability studied in Section 4.1 only accounts for the effect of noise added to a single output of ABCDP. Building further on this result, we analyzed the discrepancy between the posterior expectations derived from ABCDP and from the rejection ABC. This analysis requires quantifying the effect of noise added to the whole sequence of outputs of ABCDP. The result is presented in Theorem 2.
Theorem 2 contains three statements. The first states that the expected error between the two posterior expectations of an arbitrary function f is bounded by a constant factor of the sum of the flip probability in each rejection/acceptance step. As we have seen in Section 4.1, the flip probability is determined by the scale parameter b of the Laplace distribution. Since b = O(1/N) (see Theorem 1 and Lemma 1), it follows that the expected error decays as N increases, giving the second statement.
The third statement gives a probabilistic bound on the error. The bound guarantees that the error decays exponentially in N. Our proof relies on establishing an upper bound on the error as a function of the total number of flips ∑ T t=1 |τ t − τ t | which is a random variable. Bounding the error of interest then amounts to characterizing the tail behavior of this quantity. Observe that in Theorem 2, we consider ABCDP and rejection ABC with the same computational budget, i.e., the same total number of iterations T performed. However, the number of accepted samples may be different in each case (c for ABCDP and c for reject ABC). The fact that c itself is a random quantity due to injected noise presents its own technical challenge in the proof. Our proof can be found in Appendix E.

Related Work
Combining DP with ABC is relatively novel. The only related work is [27], which states that a rejection ABC algorithm produces posterior samples from the exact posterior distribution given perturbed data, when the kernel and bandwidth of rejection ABC are chosen in line with the data perturbation mechanism. The focus of [27] is to identify the condition when the posterior becomes exact in terms of the kernel and bandwidth of the kernel through the lens of data perturbation. On the other hand, we use the sparse vector technique to reduce the total privacy loss. The resulting theoretical studies including the flip probability and the error bound on the posterior expectation are new.

Toy Examples
We start by investigating the interplay between abc and total , in a synthetic dataset where the ground truth parameters are known. Following [18], we also consider a symmetric Dirichlet prior π and a likelihood p(y|θ) given by a mixture of uniform distributions as π(θ) = Dirichlet(θ; 1), A vector of mixing proportions is our model parameters θ, where the ground truth is θ * = [0.25, 0.04, 0.33, 0.04, 0.34] (see Figure 2). The goal is to estimate E[θ|Y * ] where Y * is generated with θ * . We first generated 5000 samples for Y * drawn from (11) with true parameters θ * . Then, we tested our two ABCDP frameworks with varying abc and total . In these experiments, we set ρ = MMD with a Gaussian kernel. We set the bandwidth of the Gaussian kernel using the median heuristic computed on the simulated data (i.e., we did not use the real data for this, hence there is no privacy violation in this regard).
We drew 5000 pseudo-samples for Y t at each time. We tested various settings, as shown in Figure 3

Coronavirus Outbreak Data
In this experiment, we modelled coronavirus outbreak in the Netherlands using a polynomial model consisting of four parameters a 0 , a 1 , a 2 , a 3 , which we aimed to infer, where: The observed (https://www.ecdc.europa.eu/en/publications-data/download-todaysdata-geographic-distribution-COVID-19-cases-worldwide, accessed on 10 October 2020) data are the number of cases of the coronavirus outbreak from 27 February to 17 March 2020, which amounts to 18 datapoints (N = 18). The presented experiment imposes privacy concern as each datapoint is a count of the individuals who are COVID positive at each time. The goal is to identify the approximate posterior distributionP(a 0 , a 1 , a 2 , a 3 |y * ) over these parameters, given a set of observations.
Recalling from Figure 1 that the small size of data worsens the privacy and accuracy trade-off, the inference is restricted to a small number of posterior samples (we chose c = 5) since the number of datapoints is extremely limited in this dataset. We used the same prior distributions for the four parameters as a i ∼ N (0, 1) for all i = 0, 1, 2, 3. We drew 50, 000 samples from the Gaussian prior, and performed our ABCDP algorithm with total = {13, 22, 44} and abc = 0.1, as shown in Figure 4.

Modeling Tuberculosis (TB) Outbreak Using Stochastic Birth-Death Models
In this experiment, we used the stochastic birth-death models to model Tuberculosis (TB) outbreak. There are four parameters that we aim to infer, which go into the communicable disease outbreak simulator as inputs: burden rate β, transmission rate t 1 , reproductive numbers R 1 and R 2 . The goal is to identify the approximate posterior distributionp(R 1 , t 1 , R 2 , β|y * ) over these parameters given a set of observations. Please refer to Section 3 in [28] for the description of the birth-death process of the model. We used the same prior distributions for the four parameters as in [28]: β ∼ N (200, 30), Unif(0.01, 30).
To illustrate the privacy and accuracy trade-off, we first generated two sets of observations y * (n = 100 and n = 1000) by some true model parameters (shown as black bars in Figure 5). We then tested our ABCDP algorithm with a privacy level = 1. We used the summary statistic described in Table 1 in [28] and used a weighted L2 distance as ρ as done in [28], together with abc = 150. Since there is no bounded sensitivity in this case, we impose an artificial boundedness by clipping the distance by C (we set C = 200) when the distance goes beyond C.
As an error metric, we computed the mean absolute distance between each posterior mean and the true parameter values. The top row in Figure 5 shows that the mean of the prior (red) is far from the true value (black) that we chose. As we increase the data size from n = 100 (middle) to n = 1000 (bottom), the distance between true values and estimates reduces, as reflected in the error from 4.71 to 2.20 for RESAMPLE = True; and from 4.51 to 2.10 for RESAMPLE=False.

Summary and Discussion
We presented the ABCDP algorithm by combining DP with ABC. Our method outputs differentially private binary indicators, yielding differentially private posterior samples.
To analyze the proposed algorithm, we derived the probability of flip from the rejection ABC's indicator to the ABCDP's indicator, as well as the average error bound of the posterior expectation.
We showed experimental results that output a relatively small number of posterior samples. This is due to the fact that the cumulative privacy loss increases linearly with the number of posterior samples (i.e., c) that our algorithm outputs. For a large-sized dataset (i.e., N is large), one can still increase the number of posterior samples while providing a reasonable level of privacy guarantee. However, for a small-sized dataset (i.e., N is small), a more refined privacy composition (e.g., [29]) would be necessary to keep the cumulative privacy loss relatively small, at the expense of providing an approximate DP guarantee rather than the pure DP guarantee that ABCDP provides.
When we presented our work to the ABC community, we often received the question of whether we could apply ABCDP to other types of ABC algorithms such as the sequential Monte Carlo algorithm which outputs the significance of each proposal sample, as opposed to its acceptance or rejection as in the rejection ABC algorithm. Directly applying the current form of ABCDP to these algorithms is not possible, while applying the Gaussian mechanism to the significance of each proposal sample can guarantee differential privacy for the output of the sequential Monte Carlo algorithm. However, the cumulative privacy loss will be relatively large, as now it is a function of the number of proposal samples, whether they are taken as good posterior samples or not.
A natural by-product of ABCDP is differentially private synthetic data, as the simulator is a public tool that anybody can run and hence differentially private posterior samples suffice for differentially private synthetic data without any further privacy cost. Applying ABCDP to generate complex datasets is an intriguing future direction.
Author Contributions: M.P. and W.J. contributed to conceptualization and methodology development. M.P. and M.V. contributed to software development. All three authors contributed to writing the paper. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.
Note: Our code is available at https://github.com/ParkLabML/ABCDP. Given any neighboring datasets Y * and Y * of size N and any dataset Y, assume that Let A denote the random variable that represents the outputs Algorithm 1 given ({(θ t , Y t )} T t=1 , Y * , ρ, abc , ) and A the random variable that represents the outputs given ({(θ t , Y t )} T t=1 , Y * , ρ, abc , ). The output of the algorithm is some realization of these variables, τ ∈ {1, 0} k where 0 < k ≤ T and for all t < k, τ t = 0 and τ k = 1. For the rest of the proof, we will fix the arbitrary values of ν 1 , ..., ν k−1 and take probabilities over the randomness of ν k and abc . We define the deterministic quantity (ν 1 , ..., ν k−1 are fixed): that represents the minimum noised value of the distance evaluated on any dataset Y * .
Let P[ˆ abc = a] be the pdf ofˆ abc evaluated on a and P[ν k = v] the pdf of ν k evaluated on v, and 1[x] the indicator function of event x. We have: Now, we define the following variables: We know that for each Y * , Y * , ρ is ∆ ρ -sensitive and hence, the quantity g(Y * ) is ∆ ρsensitive as well. In this way, we obtain that |â − a |≤ ∆ ρ and |v − vs. |≤ 2∆ ρ . Applying these changes of variables, we have: where the inequality comes from the bounds considered throughout the proof (i.e., |â − a |≤ ∆ ρ and |v − vs. |≤ 2∆ ρ ) and the form of the cdf for the Laplace distribution. Case II: RESAMPLE = False. In this case, the proof follows the proof of Algorithm 1 in [26], with an exception that positive events for [26] become negative events for us and vice versa as we find the value below a threshold, where [26] finds the value above a threshold.

Appendix B. Proof of Lemma 1
Proof of Lemma 1. We will establish ∆ ρ when ρ is MMD. Recall that (Y * , Y * ) is a pair of neighboring datasets, and Y is an arbitrary dataset. Without loss of generality, assume that Y * = {x 1 , . . . , x N }, Y * = {x 1 , . . . , x N } such that x i = x i for all i = 1, . . . , N − 1, and Y = {y 1 , . . . , y m }. We start with: where at (a), we use the reverse triangle inequality. Furthermore:

Appendix C. Proof of Lemma 2
Proof of Lemma 2. The PDF is computed from the convolution of two PDFs: where f m t (x) = 1 2b exp(− |x| b ) and f ν t (y) = 1 4b exp(− |y| 2b ): For case z ≥ 0: For case z < 0: or equivalently: More concisely:

Appendix D. Proof of Proposition 1
Proof of Proposition 1. Using this pdf above, we can compute the following probabilities: and: So: , if ρ t ≥ abc The two cases can be combined with the use of an absolute value to give the result.