1. Introduction
Approximate Bayesian computation (ABC) aims to identify the posterior distribution over simulator parameters. The posterior distribution is of interest as it provides the mechanistic understanding of the stochastic procedure that directly generates data in many areas such as climate and weather, ecology, cosmology, and bioinformatics [
1,
2,
3,
4].
Under these complex models, directly evaluating the likelihood of data is often intractable given the parameters. ABC resorts to an approximation of the likelihood function using simulated data that are similar to the actual observations.
In the simplest form of ABC called
rejection ABC [
5], we proceed by sampling multiple model parameters from a prior distribution
$\pi $:
${\theta}_{1},{\theta}_{2},\phantom{\rule{0.166667em}{0ex}}\dots \sim \pi $. For each
${\theta}_{t}$, a pseudo dataset
${Y}_{t}$ is generated from a simulator (the forward sampler associated with the intractable likelihood
$\mathrm{P}\left(y\right\theta )$). The parameter
${\theta}_{t}$ for which the generated
${Y}_{t}$ are similar to the observed
${Y}^{*}$, as decided by
$\rho ({Y}_{t},{Y}^{*})<{\u03f5}_{abc}$, is accepted. Here,
$\rho $ is a notion of distance, for instance, L2 distance between
${Y}_{t}$ and
${Y}^{*}$ in terms of a prechosen summary statistic. Whether the distance is small or large is determined by
${\u03f5}_{abc}$, a
similarity threshold. The result is samples
${\left\{{\theta}_{t}\right\}}_{t=1}^{M}$ from a distribution,
${\tilde{\mathrm{P}}}_{\u03f5}\left(\theta \right{Y}^{*})\propto \pi \left(\theta \right){\tilde{\mathrm{P}}}_{\u03f5}\left({Y}^{*}\right\theta )$, where
${\tilde{\mathrm{P}}}_{\u03f5}\left({Y}^{*}\right\theta )={\int}_{{B}_{\u03f5}\left({Y}^{*}\right)}\mathrm{P}\left(Y\right\theta )dY$ and
${B}_{\u03f5}\left({Y}^{*}\right)=\left\{Y\phantom{\rule{0.277778em}{0ex}}:\phantom{\rule{0.277778em}{0ex}}\rho (Y,{Y}^{*})<{\u03f5}_{abc}\right\}$. As the likelihood computation is approximate, so is the posterior distribution. Hence, this framework is named by
approximate Bayesian computation, as we do not compute the likelihood of data explicitly.
Most ABC algorithms evaluate the data similarity in terms of summary statistics computed by an aggregation of individual datapoints [
6,
7,
8,
9,
10,
11]. However, this seemingly innocuous step of similarity check could impose a privacy threat, as aggregated statistics could still reveal an individual’s participation to the dataset with the help of combining other publicly available datasets (see [
12,
13]). In addition, in some studies, the actual observations are privacysensitive in nature, e.g., Genotype data for estimating tuberculosis transmission parameters [
14]. Hence, it is necessary to privatize the step of similarity check in ABC algorithms.
In this light, we introduce an ABC framework that obeys the notion of
differential privacy. The differential privacy definition provides a way to quantify the amount of information that the distance computed on the privacysensitive data contains, whether or not a single individual’s data are included (or modified) in the data [
15]. Differential privacy also provides rigorous privacy guarantees in the presence of
arbitrary side information such as similar public data.
A common form of applying DP to an algorithm is by adding noise to outputs of the algorithm, called
output perturbation [
16]. In the case of ABC, we found that
adding noise to the distance computed on the real observations and pseudodata suffices for the privacy guarantee of the resulting posterior samples. However, if we choose to simply add noise to the distance in every ABC inference step, this DPABC inference imposes an additional challenge due to the
repeated use of the real observations. The
composition property of differential privacy states that the privacy level degrades over the repeated use of data. To overcome this challenge, we adopt the
sparse vector technique (SVT) [
17], and apply it to the rejection ABC paradigm. The SVT outputs
noisy answers of whether or not a stream of queries is above a certain threshold, where privacy cost incurs only when the SVT outputs at most
c “above threshold” answers. This is a significant saving in privacy cost, as arbitrarily many “below threshold” answers are privacy cost free.
We name our framework, which combines ABC with SVT, as ABCDP (approximate Bayesian computation with differential privacy). Under ABCDP, we theoretically analyze the effect of noise added to the distance in the resulting posterior samples and the subsequent posterior integrals. Putting together, we summarize our main contributions:
We provide a novel ABC framework, ABCDP, which combines the
sparse vector technique (SVT) [
17] with the rejection ABC paradigm. The resulting ABCDP framework can improve the tradeoff between the privacy and accuracy of the posterior samples, as the privacy cost under ABCDP is a function of the number of
accepted posterior samples only.
We theoretically analyze ABCDP by focusing on the effect of noisy posterior samples in terms of two quantities. The first quantity provides the probability of an output of ABCDP being different from that of ABC at any given time during inference. The second quantity provides the convergence rate, i.e., how fast the posterior integral using ABCDP’s noisy samples’ approaches that using nonprivate ABC’s samples. We write both quantities as a function of added noise for privacy to better understand the characteristics of ABCDP.
We validate our theory in the experiments using several simulators. The results of these experiments are consistent with our theoretical findings on the flip probability and the average error induced by the noise addition for privacy.
Unlike other existing ABC frameworks that typically rely on a prespecified set of summary statistics, we use a kernelbased distance metric called
maximum mean discrepancy following K2ABC [
18] to eliminate the necessity of preselecting a summary statistic. Using a kernel for measuring similarity between two empirical distributions was also proposed in KABC [
19]. KABC formulates ABC as a problem of estimating a conditional mean embedding operator mapping (induced by a kernel) from summary statistics to corresponding parameters. However, unlike our algorithm, KABC still relies on a particular choice of summary statistics. In addition, KABC is a softthresholding ABC algorithm, while ours is a rejectionABC algorithm.
To avoid the necessity of preselecting summary statistics, one could resort to methods that automatically or semiautomatically learn the best summary statistics given in a dataset, and use the learned summary statistics in our ABCDP framework. An example is semiauto ABC [
6], where the authors suggest using the posterior mean of the parameters as a summary statistic. Another example is the indirectscore ABC [
20], where the authors suggest using an auxiliary model which determines a score vector as a summary statistic. However, the posterior mean of the parameters in semiauto ABC as well as the parameters of the auxiliary model in indirectscore ABC need to be estimated. The estimation step can incur a further privacy loss if the real data need to be used for estimating them. Our ABCDP framework does not involve such an estimation step and is more economical in terms of privacy budget to be spent than semiauto ABC and indirectscore ABC.
4. ABCDP
Recall that the only place where the real data
${Y}^{*}$ appear in the ABC algorithm is when we judge whether the simulated data are similar to the real data, i.e., as in (
4). Our method hence adds noise to this step. In order to take advantage of the privacy analysis of SVT, we also add noise to the ABC threshold and to the ABC distance. Consequently, we introduce two perturbation steps.
Before we introduce them, we describe the global sensitivity of the distance, as this quantity tunes the amount of noise we will add in the two perturbation steps. For $\rho ({Y}^{*},Y)=\widehat{\mathrm{MMD}}({Y}^{*},Y)$ with a bounded kernel, then the sensitivity of the distance is ${\Delta}_{\rho}=O(1/N)$ as shown in Lemma 1.
Lemma 1. (
${\Delta}_{\rho}=O(1/N)$ for MMD).
Assume that ${Y}^{*}$ and each pseudo dataset ${Y}_{t}$ are of the same cardinality N. Set $\rho ({Y}^{*},Y)=\widehat{\mathrm{MMD}}({Y}^{*},Y)$ with a kernel k bounded by ${B}_{k}>0$, i.e., ${\mathrm{sup}}_{x,y\in \mathcal{X}}k(x,y)\le {B}_{k}<\infty $. Then:and ${\mathrm{sup}}_{{Y}^{*},Y}\rho ({Y}^{*},Y)\le 2\sqrt{{B}_{k}}.$ A proof is given in
Appendix B. For
$\rho =\widehat{\mathrm{MMD}}$ using a Gaussian kernel,
$k(x,y)=\mathrm{exp}\left(\frac{{\parallel xy\parallel}^{2}}{2{l}^{2}}\right)$ where
$l>0$ is the bandwidth of the kernel,
${B}_{k}=1$ for any
$l>0$.
Now, we introduce the two perturbation steps used in our algorithm summarized in Algorithm 1.
Algorithm 1 Proposed csample ABCDP 
Require: Observations ${Y}^{*}$, Number of accepted posterior sample size c, privacy tolerance ${\u03f5}_{total}$, ABC threshold ${\u03f5}_{abc}$, distance $\rho $, and parameterpseudodata pairs ${\left\{({\mathbf{\theta}}_{t},{Y}_{t})\right\}}_{t=1}^{T}$, and option RESAMPLE. Ensure:${\u03f5}_{total}$DP indicators ${\left\{{\tilde{\tau}}_{t}\right\}}_{t=1}^{T}$ for corresponding samples ${\left\{{\mathbf{\theta}}_{t}\right\}}_{t=1}^{T}$  1:
Calculate the noise scale b by Theorem 1.  2:
Privatize ABC threshold: ${\widehat{\u03f5}}_{abc}={\u03f5}_{abc}+{m}_{t}$ via ( 7)  3:
Set count=0  4:
for$t=1,\dots ,T$do  5:
Privatize distance: ${\widehat{\rho}}_{t}=\rho ({Y}^{*},{Y}_{t})+{\nu}_{t}$ via ( 8)  6:
if ${\widehat{\rho}}_{t}\le {\widehat{\u03f5}}_{abc}$ then  7:
Output ${\tilde{\tau}}_{t}=1$  8:
count = count$+1$  9:
if RESAMPLE then  10:
${\widehat{\u03f5}}_{abc}={\u03f5}_{abc}+{m}_{t}$ via ( 7)  11:
end if  12:
else  13:
Output ${\tilde{\tau}}_{t}=0$  14:
end if  15:
if count ≥ c then  16:
Break the loop  17:
end if  18:
end for

Step 1: Noise for privatizing the ABC threshold.
where
${m}_{t}\sim Lap\left(b\right)$, i.e., drawn from the zeromean Laplace distribution with a scale parameter
b.
Step 2: Noise for privatizing the distance.
where
${\nu}_{t}\sim Lap\left(2b\right)$.
Due to these perturbations, Algorithm 1 runs with the privatized threshold and distance. We can choose to perturb the threshold only once, or every time we output 1 by setting RESAMPLE to false or true. After outputting c number of 1’s, the algorithm is terminated. How do we calculate the resulting privacy loss under the different options we choose?
We formally state the relationship between the noise scale and the final privacy loss ${\u03f5}_{tot}$ for the Laplace noise in Theorem 1.
Theorem 1. (Algorithm 1 is${\u03f5}_{total}$DP) thmmrejdp For any neighboring datasets${Y}^{*},{Y}^{{*}^{\prime}}$of size N and any dataset Y, assume that$\rho $is such that$0<{\mathrm{sup}}_{({Y}^{*},{Y}^{{*}^{\prime}}),Y}\rho ({Y}^{*},Y)\rho ({Y}^{{*}^{\prime}},Y)\le {\Delta}_{\rho}<\infty $. Algorithm 1 is${\u03f5}_{total}$DP, where: A proof is given in
Appendix A. The proof uses linear composition, i.e., the privacy level linearly degrading with
c. However, using the strong composition or more advanced compositions can reduce the resulting privacy loss, while these compositions turn pureDP into a weaker, approximateDP. In this paper, we focus on the pureDP. For the case of RESAMPLE = True, the proof directly follows the proof of the standard SVT algorithm using the linear composition method [
17], with an exception that we utilize the quantity representing the minimum noisy value of any query evaluated on
${Y}^{*}$, as opposed to the maximum utilized in SVT. For the case of RESAMPLE= False, the proof follows the proof of Algorithm 1 in [
26].
Note that the DP analysis in Theorem 1. holds for other types of distance metrics and not limited to only MMD, as long as there is a bounded sensitivity ${\Delta}_{\rho}$ under the chosen metric. When there is no bounded sensitivity, one could impose a clipping bound C to the distance by taking the distance from $\mathrm{min}[\rho ({Y}_{t},{Y}^{*}),C]$, such that the resulting distance between any pseudo data ${Y}_{t}$ and ${Y}^{{*}^{\prime}}$ with modifying one datapoint in ${Y}^{*}$ cannot exceed that clipping bound. In fact, we use this trick in our experiments when there is no bounded sensitivity.
4.1. Effect of Noise Added to ABC
Here, we would like to analyze the effect of noise added to ABC. In particular, we are interested in analyzing the probability that the output of ABCDP differs from that of ABC: $\mathbb{P}[{\tilde{\tau}}_{t}\ne {\tau}_{t}{\tau}_{t}]$ at any given time t. To compute this probability, we first compute the probability density function (PDF) of the random variables ${m}_{t}{\nu}_{t}$ in the following Lemma.
Lemma 2. Recall ${m}_{t}\sim Lap\left(b\right)$, ${\nu}_{t}\sim Lap\left(2b\right)$. The subtraction of these yields another random variable $Z={m}_{t}{\nu}_{t}$, where the PDF of Z is given byFurthermore, for $a\ge 0$, ${G}_{b}\left(a\right):={\int}_{a}^{\infty}{f}_{Z}\left(z\right)\phantom{\rule{0.166667em}{0ex}}\mathrm{d}z=\frac{1}{6}\left[4\mathrm{exp}\left(\frac{a}{2b}\right)\mathrm{exp}\left(\frac{a}{b}\right)\right]$, and the CDF of Z is given by ${F}_{Z}\left(a\right)=H\left[a\right]+(12H\left[a\right]){G}_{b}\left(\righta\left\right)$ where $H\left[a\right]$ is the Heaviside step function. See
Appendix C for the proof. Using this PDF, we now provide the following proposition:
Proposition 1. Denote the output of Algorithm 1 at time t by ${\tilde{\tau}}_{t}\in \{0,1\}$ and the output of ABC by ${\tau}_{t}\in \{0,1\}$. The flip probability, the probability that the outputs of ABCDP and ABC differ given the output of ABC, is given by $P[{\tilde{\tau}}_{t}\ne {\tau}_{t}{\tau}_{t}]={G}_{b}\left(\right{\rho}_{t}{\u03f5}_{abc}\left\right)$, where ${G}_{b}\left(a\right)$ is defined in Lemma 1, and ${\rho}_{t}:=\rho ({Y}^{*},{Y}_{t})$.
To provide an intuition of Proposition 1, we visualize the flip probability in
Figure 1. This flip probability provides a guideline for choosing the accepted sample size
c given the datasize
N and the desired privacy level
${\u03f5}_{total}$. For instance, if a given dataset is extremely small, e.g., containing datapoints on the order of 10,
c has to be chosen such that the flip probability of each posterior sample remains low for a given privacy guarantee (
${\u03f5}_{total}$). If a higher number of posterior samples are needed, then one has to reduce the desired privacy level for the posterior sample of ABCDP to be similar to that of ABC. Otherwise, with a small
${\u03f5}_{total}$ with a large
c, the accepted posterior samples will be poor. On the other hand, if the dataset is bigger, then a larger
c can be taken for a reasonable level of privacy.
4.2. Convergence of Posterior Expectation of RejectionABCDP to RejectionABC
The flip probability studied in
Section 4.1 only accounts for the effect of noise added to a single output of ABCDP. Building further on this result, we analyzed the discrepancy between the posterior expectations derived from ABCDP and from the rejection ABC. This analysis requires quantifying the effect of noise added to the whole sequence of outputs of ABCDP. The result is presented in Theorem 2.
Theorem 2. Given ${Y}^{*}$ of size N, and ${\left\{({\mathit{\theta}}_{t},{Y}_{t})\right\}}_{t=1}^{T}$ as input, let ${\tilde{\tau}}_{t}\in \{0,1\}$ be the output from Algorithm 1 where ${\tilde{\tau}}_{t}=1$ indicates that $({\mathit{\theta}}_{t},{Y}_{t})$ is accepted, for $t=1,\dots ,T$. Similarly, let ${\tau}_{t}$ denote the output from the traditional rejection ABC algorithm, for $t=1,\dots ,T$. Let f be an arbitrary vectorvalued function of θ. Assume that the numbers of accepted samples from Algorithm 1, and the traditional rejection ABC algorithm are $c:={\sum}_{t=1}^{T}{\tilde{\tau}}_{t}\ge 1$ and ${c}^{\prime}:={\sum}_{t=1}^{T}{\tau}_{t}\ge 1$, respectively. Let $b=\frac{4c{\sqrt{B}}_{k}}{{\u03f5}_{total}N}$ if RESAMPLE=True, and $b=\frac{2(c+1)\sqrt{{B}_{k}}}{{\u03f5}_{total}N}$ if RESAMPLE=False (see Theorem 1. Define ${K}_{T}:={\mathrm{max}}_{t=1,\dots ,T}{\parallel f\left({\mathit{\theta}}_{t}\right)\parallel}_{2}$. Then, the following statements hold for both RESAMPLE options:
1. ${\mathbb{E}}_{{\tilde{\tau}}_{1},\dots ,{\tilde{\tau}}_{T}}\parallel \frac{1}{c}{\sum}_{t=1}^{T}f\left({\mathit{\theta}}_{t}\right){\tilde{\tau}}_{t}\frac{1}{{c}^{\prime}}{\sum}_{t=1}^{T}f\left({\mathit{\theta}}_{t}\right)\tau {}_{t}{\parallel}_{2}\le \frac{2{K}_{T}}{{c}^{\prime}}{\sum}_{t=1}^{T}{G}_{b}\left(\right{\rho}_{t}{\u03f5}_{abc}\left\right),$ where the decreasing function ${G}_{b}\left(x\right)\in (0,\frac{1}{2}]$ for any $x\ge 0$ is defined in Lemma 2;
2. ${\mathbb{E}}_{{\tilde{\tau}}_{1},\dots ,{\tilde{\tau}}_{T}}\parallel \frac{1}{c}{\sum}_{t=1}^{T}f\left({\mathit{\theta}}_{t}\right){\tilde{\tau}}_{t}\frac{1}{{c}^{\prime}}{\sum}_{t=1}^{T}f\left({\mathit{\theta}}_{t}\right)\tau {}_{t}{\parallel}_{2}\to 0$ as $N\to \infty $;
3. For any $a>0$:where the probability is taken with respect to ${\tilde{\tau}}_{1},\dots ,{\tilde{\tau}}_{T}$. Theorem 2 contains three statements. The first states that the expected error between the two posterior expectations of an arbitrary function
f is bounded by a constant factor of the sum of the flip probability in each rejection/acceptance step. As we have seen in
Section 4.1, the flip probability is determined by the scale parameter
b of the Laplace distribution. Since
$b=O(1/N)$ (see Theorem 1 and Lemma 1), it follows that the expected error decays as
N increases, giving the second statement.
The third statement gives a probabilistic bound on the error. The bound guarantees that the error decays exponentially in
N. Our proof relies on establishing an upper bound on the error as a function of the total number of flips
${\sum}_{t=1}^{T}{\tilde{\tau}}_{t}{\tau}_{t}$ which is a random variable. Bounding the error of interest then amounts to characterizing the tail behavior of this quantity. Observe that in Theorem 2, we consider ABCDP and rejection ABC with the same computational budget, i.e., the same total number of iterations
T performed. However, the number of accepted samples may be different in each case (
c for ABCDP and
${c}^{\prime}$ for reject ABC). The fact that
c itself is a random quantity due to injected noise presents its own technical challenge in the proof. Our proof can be found in
Appendix E.
7. Summary and Discussion
We presented the ABCDP algorithm by combining DP with ABC. Our method outputs differentially private binary indicators, yielding differentially private posterior samples. To analyze the proposed algorithm, we derived the probability of flip from the rejection ABC’s indicator to the ABCDP’s indicator, as well as the average error bound of the posterior expectation.
We showed experimental results that output a relatively small number of posterior samples. This is due to the fact that the cumulative privacy loss increases linearly with the number of posterior samples (i.e.,
c) that our algorithm outputs. For a largesized dataset (i.e.,
N is large), one can still increase the number of posterior samples while providing a reasonable level of privacy guarantee. However, for a smallsized dataset (i.e.,
N is small), a more refined privacy composition (e.g., [
29]) would be necessary to keep the cumulative privacy loss relatively small, at the expense of providing an
approximate DP guarantee rather than the pure DP guarantee that ABCDP provides.
When we presented our work to the ABC community, we often received the question of whether we could apply ABCDP to other types of ABC algorithms such as the sequential Monte Carlo algorithm which outputs the significance of each proposal sample, as opposed to its acceptance or rejection as in the rejection ABC algorithm. Directly applying the current form of ABCDP to these algorithms is not possible, while applying the Gaussian mechanism to the significance of each proposal sample can guarantee differential privacy for the output of the sequential Monte Carlo algorithm. However, the cumulative privacy loss will be relatively large, as now it is a function of the number of proposal samples, whether they are taken as good posterior samples or not.
A natural byproduct of ABCDP is differentially private synthetic data, as the simulator is a public tool that anybody can run and hence differentially private posterior samples suffice for differentially private synthetic data without any further privacy cost. Applying ABCDP to generate complex datasets is an intriguing future direction.