F-Divergences and Cost Function Locality in Generative Modelling with Quantum Circuits

Generative modelling is an important unsupervised task in machine learning. In this work, we study a hybrid quantum-classical approach to this task, based on the use of a quantum circuit born machine. In particular, we consider training a quantum circuit born machine using f-divergences. We first discuss the adversarial framework for generative modelling, which enables the estimation of any f-divergence in the near term. Based on this capability, we introduce two heuristics which demonstrably improve the training of the born machine. The first is based on f-divergence switching during training. The second introduces locality to the divergence, a strategy which has proved important in similar applications in terms of mitigating barren plateaus. Finally, we discuss the long-term implications of quantum devices for computing f-divergences, including algorithms which provide quadratic speedups to their estimation. In particular, we generalise existing algorithms for estimating the Kullback–Leibler divergence and the total variation distance to obtain a fault-tolerant quantum algorithm for estimating another f-divergence, namely, the Pearson divergence.


Introduction
One of the most challenging technological questions of our time is whether existing quantum computers can achieve quantum advantage in tasks of practical interest. Variational quantum algorithms (VQAs), which are well suited to the constraints imposed by existing devices, have emerged as the leading strategy for achieving such a quantum advantage [1][2][3][4].
In VQAs, a problem-specific cost function, which typically consists of a functional of the output of a parameterised quantum circuit, is efficiently evaluated using a quantum computer. Meanwhile, a classical optimiser is leveraged to train the circuit parameters in order to minimise the cost function. This hybrid quantum-classical approach is robust to the limited connectivity and qubit count of existing devices, and, by restricting the circuit depth, also provides an effective strategy for error mitigation.
In this paper, we focus on a hybrid quantum-classical approach to generative modelling using a born machine [34]. We adopt an adversarial framework to this task, in which a born machine (the 'generator') generates samples from the target distribution, while a binary classifier (the 'discriminator') attempts to distinguish between generated samples and true samples. This is sometimes referred to in the literature as a quantum generative adversarial network.
In a generalisation of existing approaches, we consider training the born machine with respect to any f -divergence as a cost function. Well-known examples of f -divergences include the Kullback-Leibler divergence (KL), the Jensen-Shannon divergence (JS), the squared Hellinger distance (H 2 ), the total variation distance (TV), and the Pearson divergence (χ 2 ). In the adversarial framework, it is straightforward to estimate the f -divergence: any such divergence is defined in terms of the density ratio of the target distribution and model distribution, which can be estimated using standard techniques via the output of the binary classifier [35]. On this basis, we propose a heuristic for training the born machine, based on the idea of dynamically switching the f -divergence during training in order to optimise the rate of convergence and utilise favourable qualities of each one. We also propose a second heuristic, based on introducing locality into the f -divergence, motivated by the now well-established connection between locality and barren plateaus in VQA training landscapes [36,37]. For both heuristics, we provide numerical evidence to suggest that they can lead to (sometimes significant) performance improvements, particularly in under-and over-parameterised circuits.
We conclude this paper with a discussion of the longer-term implication of quantum devices for computing the f -divergences between two probability distributions. In particular, we discuss the existence of quadratic speedups for the estimation of TV and KL shown by [38][39][40] and extend these results to an algorithm for estimating χ 2 , assuming access to a fault-tolerant quantum computer.
The remainder of this paper is organised as follows. In Section 2, we begin by introducing generative modelling, Born machines, and f -divergences. In Section 3, we then introduce the two training heuristics for the born machine. In Section 4, we provide numerical results to demonstrate the performance of the heuristics. In Section 5, we discuss the long-term implications of quantum devices for computing f -divergences. Finally, in Section 6, we offer some concluding remarks.

Generative Modelling
Generative modelling is an unsupervised machine learning task in which the goal is to learn the probability distribution which generates a given data set. More precisely, given access to i.i.d. samples x 1 , . . . , x m i.i.d.
∼ p(x) in R p , the objective of generative modelling is to learn a model q θ (x), typically parameterised by a d dimensional parameter vector, θ ∈ R d , which closely resembles p(x). Generative models find applications in a wide range of problems, ranging from the typical modalities of machine learning such as text [41], image [42] and graph [43] analysis, to problems in active learning [44], reinforcement learning [45], medical imaging [46], physics [47], and speech synthesis [48].
Broadly speaking, one can distinguish between two main categories of generative model: prescribed models and implicit models [49,50]. Prescribed models provide an explicit parametric specification of the distribution of the observed random variable x, directly specifying the density q θ (x). An example of a prescribed model is the ubiquitous multivariate Gaussian distribution. Implicit models, on the other hand, specify only the stochastic procedure which generates samples. An example of an implicit model is a complex computer simulation of some physical phenomenon, for which the likelihood function cannot be computed. Since, in this case, one no longer models q θ (x) directly, valid objectives can now only involve quantities (e.g., expectation values) which can be estimated efficiently using samples.

Born Machines as Implicit Generative Models
By directly exploiting born's probabilistic interpretation of quantum wave functions [65], it is possible to model the probability distribution of classical data using a pure quantum state. Such models are referred to as born machines [34]. We are particularly interested in born machines for which the quantum state is obtained via a parameterised quantum circuit (as opposed to, say, a continuous time Hamiltonian evolution). These are known as quantum circuit born machines (QCBMs) [15,16].
The use of QCBMs as generative models is in large part motivated by their expressiveness. Indeed, it is now well established that born machines have greater expressive power than classical models, including neural networks [20] and partially matrix product states [66] (see also [19]). This means, in particular, that QCBMs can efficiently represent certain distributions which are classically intractable to simulate (e.g., [67][68][69]). These include those recently used in a demonstration of quantum supremacy [70].
Let us consider a binary vector x ∈ {0, 1} n , with n the number of qubits. A QCBM takes a product state |0 ⊗n as input and evolves it into a normalised output state |Ψ(θ) via a parameterised quantum circuit U(θ). One can generate n-bit strings according to where |x are computational basis states; sampling from this distribution then consists of a simple measurement. Since we only have access to x ∼ q θ (x) and not the probabilities, q θ (x) themselves, the born machine can be regarded as an implicit generative model. We consider parameterised quantum circuits U(θ) of the form is a set of parameterised unitaries, and D is the depth of circuit. We also assume that U i (θ i ) = e −iθ i V i are rotations through angles θ i , generated by Hermitian operators V i with eigenvalues ±1. In this case, one can compute partial derivatives of q θ (x) using the parameter-shift rule [71], which reads where θ i ± = θ ± π 4 e i , with e i a unit vector in the ith direction. More generally, this formula allows one to express the first-order partial derivative of an expectation of a function h as The major challenge in using any implicit generative model is designing a suitable objective function. As noted before, one cannot compute q θ (x) directly, and thus valid objectives can only involve statistical quantities (e.g., expectations) which can be efficiently computed using samples. For generative models based on QCBMs, various objectives have been proposed, including moment-matching, maximum mean discrepancy, Stein and Sinkhorn divergences, and adversarial objectives based on the Kullback-Leibler divergence. In this paper, we propose a more general class of objective functions-f -divergences-for training QCBMs.

Adversarial Generative Modelling with f -Divergences
Let f : (0, ∞) → R be a convex function with f (1) = 0 and strict convexity at 1. Suppose that p(x) = 0 whenever q θ (x) = 0. The f -divergence, or Csiszár divergence [72,73], between q θ and p is defined as Suppose instead that q θ (x) = 0 whenever p(x) = 0. Then the f -divergence can be written in terms of f * , the convex conjugate of f , as In what follows, we will generally prefer this formulation, as it leads to simpler expressions. The function f is called the generator of the divergence. For different choices of f , one obtains well-known divergences such as TV, KL, and χ 2 . In this paper, we investigate the effect of this choice on the training of a QCBM. To ensure a fair comparison, we assume that the generators are standardised and normalised such that f (1) = 0 and f (1) = 1 [74]. This ensures that D f (p q θ ) ≥ 0 with equality if and only if p ≡ q θ , even if p and q θ are unnormalised. Note that one can normalise and standardise We minimise the f -divergence using gradient-based methods. We thus require the derivative of D f with respect to θ i . Using the chain and the parameter-shift rules, it is straightforward to compute We summarise some well-known f -divergences, the convex conjugates of their generators, and their parameter-shift rules, in Tables 1 and 2. We also plot some of the convex conjugate generators in Figure 1.  Returning to Equation (10), it is clear that the problem of computing the gradient reduces to that of estimating the probability ratio r(x) = q θ (x) p(x) . We choose to define r(x) in this way since it is more natural when one is interested with writing the f -divergence in terms of f * , as we do here. Note that in some literature the ratio is defined in the reverse manner by switching the probabilities. We can estimate the probability ratio from the output of a binary classifier [35]. Suppose we assign samples x∼q θ (x) to one class, and samples x∼p(x) to another class. Suppose, in addition, that one has access to an exact binary classifier d * (x), which outputs the probability that the sample x originated from q θ (x). Then, assuming uniform prior probabilities for the two classes, it is straightforward to show via Bayes' theorem that (see Section 2.2 in [50]). Table 1. A summary of well-known f -divergences, including the definition, the convex conjugate of the generator f * , and the corresponding parameter-shift rule in terms of the ratio r(x) = q θ (x) p(x) . The symbol indicates that the divergence is asymmetric, while a comma indicates that it is symmetric. Interestingly, one can construct symmetric f -divergences for every asymmetric one (see Table 2).

f -Divergence
Definition Kullback-Leibler (type II, forward) KL(p| Table 2. A summary of the symmetric f -divergences corresponding to some well-known asymmetric f -divergences, including the definition, and the parameter-shift rule.

f -Divergence Definition Parameter-Shift
symmetric Kullback-Leibler (type I, Jeffrey) J(p, q θ ) = KL(p q θ ) + KL(q θ p) symmetric Kullback-Leibler (type II, Jensen-Shannon) In practice, we do not have access to the exact classifier d * (x). However, under the assumption that we can efficiently sample from both distributions, we can train a classifier d φ (x), parameterised by φ, to distinguish between the two distributions. One can use any proper scoring rule to train the classifier [50]. A typical choice is the negative cross entropy, given by The classifier seeks to minimise this objective, which corresponds to low classification errors. We emphasise that, in this objective, θ is fixed at the current QCBM parameters. The resulting classifier approximates the probability ratio for the current QCBM as This can be plugged into Equation (10) to approximate the gradient. With this in mind, we define the cost function for the QCBM as where now the parameters of the classifier φ are fixed and the argument of the expectation value is independent of θ. The adversarial generative modelling can be regarded as the following optimisation problem where the required expectation values are estimated from samples. In principle, the classifier can be trained to optimality in order to provide the best possible ratio for the generative model. Alternatively, the two objective functions can be optimised in tandem, using alternating gradient descent steps or a two-timescale gradient descent scheme [75].

Switching f -Divergences
In this Section, we describe a heuristic for dynamically switching between f -divergences throughout the training process of our generative model (specifically the QCBM).
To motivate this heuristic, we examine how D f (p q θ ) varies with respect to values of r(x) = q θ (x) p(x) . We begin by noting that all f -divergences which can be standardised agree on the divergence between nearby distributions [76], but can otherwise exhibit very different behaviours. In particular, we focus on their initial rates of convergence.
One may rationalise the different rates of convergence for each divergence at the beginning of training by considering the following argument [50,64,77]. Consider n qubits, such that there are 2 n different values of r(x). For a successful training, all these values need to converge towards 1 (which implies our goal that q θ ≡ p). Now suppose we were to estimate the divergence in Equation (6) using a set of samples from the target distribution . At the beginning of training, q θ is initialised at random and is therefore expected to be far from the target. This means that q θ (x i ) p(x i ) for most of the samples. In other words, at the beginning of training most of the samples yield probability ratios r(x i ) 1. It is evident from the left panel of Figure 1 that some divergences, including TV, vary slowly in the region where r 1, and are therefore more liable to saturation in the initial stages of training. Other divergences, such as forward KL and reverse χ 2 , generate strong gradients in this region. In the limiting case where p and q θ have disjoint supports, TV and JS saturate, whereas forward KL diverges [78]. This problem is well known within the context of training generative adversarial networks; since an idealised formulation optimises JS, several alternative cost functions have been proposed to mitigate its slow initial convergence [64,[77][78][79].
Though we can only apply this logic to the particular regime where p and q θ are far apart, it is also evident from Figure 1 that the f -divergences exhibit a wide diversity of behaviours throughout most of training. We propose to exploit this with the following heuristic. At every optimisation step, we choose an f -divergence for each direction in parameter space that generates the highest gradient in said direction. This requires no additional quantum circuit evaluations since we only need to evaluate Equation (10) for the different generators. Concretely, the heuristic can be written as follows. For each step, to update parameter θ i , we choose the f -divergence labelled j, D f j , which obeys: For simplicity, in this paper, we restrict the set F to only contain those f -divergences illustrated in Figure 1. We call this heuristic f -switch.

Local Cost Functions
In this Section, we outline an alternative heuristic for training the QCBM, based on introducing locality into the cost function. Let us briefly provide some motivation for this approach. One of the most fundamental challenges associated with hybrid quantumclassical algorithms is the barren plateau phenomenon, whereby the gradient of the cost function vanishes exponentially in the number of qubits [36,37,[80][81][82][83][84][85][86][87][88]. This phenomenon can arise due to deep unstructured ansätze [80], large entanglement [83,84], high levels of noise [88], and global cost functions [36,37]. As such, it is a rather general phenomenon in many quantum machine learning applications, including generative models. In the presence of barren plateaus, exponential precision (i.e., an exponential number of samples) is required in order to resolve against finite sampling noise and determine a minimising direction in the cost function landscape. Since the standard objective of quantum algorithms is to achieve a polynomial scaling in the system size (as opposed to the exponential scaling of classical algorithms), barren plateaus can destroy any hope of a variational quantum algorithm achieving quantum advantage.
Although, in this paper, we do not directly analyse the emergence of barren plateaus in the QCBM, we are nonetheless motivated by existing results on barren plateaus. We focus, in particular, on the connection between barren plateaus and global cost functions (i.e., cost functions defined in terms of global observables), given that such cost functions naturally arise in hybrid quantum-classical generative models. The connection between trainability and locality was first established by Cerezo et al. [36], who proved that cost functions defined in terms of global observables exhibit barren plateaus for all circuit depths in circuits composed of random two-qubit gates which act on alternating pairs of qubits (i.e., blocks forming local 2-designs). Meanwhile, local cost functions do not exhibit barren plateaus for shallow circuits; in this case, cost function gradients vanish at worst polynomially in the number of qubits.
On the basis of this result, there is clear motivation to seek a local cost function (i.e., a cost function defined in terms of local observables) for the hybrid quantum-classical generative model introduced in Section 2.3. We now attempt to make some progress towards this goal.
We write q i θ (x i ) to denote the marginal distribution of the i th element of the bit-string x = (x 1 , . . . , x n ). Using Jensen's inequality on Equation (6), it can be shown that the fdivergence between joint distributions is larger than the f -divergence between marginal distributions. Thus, we have Our heuristic consists of minimising the right-hand side of this inequality. Even though this is a lower-bound to the original cost, it is a fully local cost function. Later, we show how to generalise this approach allowing for a trade off between trainability and accuracy. We call this heuristic f-local.
Let us show the difference between the global cost function (left-hand side of the inequality) and the local cost function (right-hand side) by means of an example. For ease of exposition, we assume in this discussion that the f -divergence of interest is the reverse KL with generator f * (r) = r log r − r + 1. We emphasise, however, that the methodology is generic to any f -divergence. We begin by rewriting the expression in Equation (1) as where we have defined H x := |x x|. We can thus write the reverse KL in the form of a generic cost function (see, e.g., [3]) as where we define g x (q θ ) := q θ log q θ p(x) . This cost function is clearly global, since the observables, H x , act on all qubits. Now, rewriting Equation (20) in terms of the adversarial approximation in Equation (14), we have where h x (q θ ) := q θ logit(d φ (x)), and logit(y) := log y 1−y . It is interesting to note that the global observable H x only enters into h x (q θ ) via the first term, namely q θ (x). It is arguable, however, that the second term in h x (q θ ), namely logit(d φ (x)) should also be regarded as a global quantity.
We now consider the fully local cost function in the right-hand side of Equation (18). Applying the adversarial approximation to each of the n probability ratios, the QCBM objective is where we have replaced the global observable H x in Equation (21) by the set of local observables Here, |x x| i is a projector on the computational basis for qubit i, and 1˜i denotes the identity on all qubits except qubit i. We have also replaced the 'global' function h x (q θ ) in Equation (21) by the set of local functions Here is a set of n 'local' classifiers, which act only on the marginal distribution corresponding to the ith qubit. That is to say, d i φ are trained to distinguish between samples One may ask why it is not sufficient to simply make only the observable, H x , local as is done in other literature addressing the barren plateau problem [36]. In our case, it turns out that if one does not also make the functions h x local, in other words by keeping the classifier 'global', the cost function becomes intractable to compute due to a need to explicitly compute joint probabilities from the circuit, q θ . This hints at the subtlety that appears when attempting to address barren plateaus in generative modelling, that does not necessarily exist in other variational algorithms.
We are, of course, interested in whether the local cost function is faithful to the original cost function. Recall that we are minimising the lower bound in Equation (18). It is clear that, if the local cost function is minimised, so that D f (q i θ p i ) = 0 for all i ∈ {1, . . . , n}, and all of the marginals coincide, there is still no guarantee that the joint distributions will be identical. This observation suggests that, while this cost function may be more trainable than the original cost function on account of its locality, it may also be significantly less accurate. In an attempt to remedy this, we can instead consider a more general k-local cost function which acts on subsets of k qubits. In particular, by defining x i:j := (x i , . . . , x j ), we can introduce where H L(k) h L(k) and is a set of n − k + 1 'k-local' classifiers, defined in an obvious fashion. This k-local cost function now approximates the sum of the reverse KL between the k-marginals (of neighbouring qubits) of the target distribution p(x), and the variational distribution q θ (x).
Arguing as before, it is clear that the k-local cost function will admit additional global minima in comparison to the global cost function for any 1 ≤ k < n. In particular, when the k-local cost function is minimised, the k-nearest neighbour marginals of p(x) and q θ (x) coincide. One can expect, however, that as the value of k is increased, not only will the number of additional minima decrease, but the disparity between the joint distributions of the target and the model at these global minima will decrease. This suggests that in order to achieve a 'sweet spot' between trainability and accuracy, a reasonably approach is to start by optimising the k-local cost function with a small value of k (promoting trainability), before iteratively increasing the value of k (promoting accuracy) until k = n, thus recovering the global cost function.
We should remark that, while for ease of notation we have defined the k-local cost function in terms of marginals with respect to neighbouring qubits (x i , . . . , x i+k−1 ), one can in theory choose any sets of qubits of size at most k (e.g., nearest neighbours, all possible combinations, and randomly sampled). In general, for a fixed value of k, this choice will influence the accuracy of the objective function, as well as its computational cost, and should be made on a case-by-case basis on the basis of the available computational resources.

Numerical Results
In this Section, we present numerical results to illustrate the performance of the training heuristics proposed in Section 3.
Preliminaries Throughout this Section, we utilise a QCBM composed of alternating layers of single qubit gates and entangling gates (see Figure 2). We implement the quantum circuit using pytket [89] and execute the simulations with Qiskit [90]. The parameters of the QCBM are updated using stochastic gradient descent with a constant learning rate, which is tuned to each of the simulations. … … R z (θ 1,1 ) Regarding the classical component of the adversarial generative model (i.e., the binary classifier), we use either a fully connected feed-forward neural network with ReLU neurons (NN), or a support vector machine with RBF kernel (SVM). Indeed, one rather surprising byproduct of our numerical investigation is that the training performance of the adversarial generative model could be improved, at times significantly, by using a SVM in place of a NN for this component (see Figure 3). This, in itself, should be of some interest to practitioners. Not only can SVMs be faster to train, but they depend on significantly fewer hyper-parameters than NNs, whose performance is often highly dependent on careful tuning of the number of hidden layers, the number of neurons in each hidden layer, the learning rate, the batch size, etc. While we do not suggest that SVMs will always outperform NNs in this setting, this does indicate that SVMs may represent a viable alternative. We implement the NNs using PyTorch [91], while the SVMs are implemented with scikit-learn [92]. The particular hyper-parameters used in each simulation are specified below.  In the majority of our numerical simulations, we consider a QCBM with 3 qubits. This corresponds to a discrete target distribution p which takes 2 3 values. We generally also assume that the target distribution corresponds to a particular instantiation of the QCBM, for a fixed number of layers, D p . By varying the number of layers, D q θ used to train the generative model, we can then investigate different parameterisation regimes of interest. In the case that the number of layers used to generate the target is greater than the number of layers used in the model (D p > D q θ ), the model is under-parameterised (or severely under-parameterised). Meanwhile, when the number of layers used to generate the target and the number of layers used in the model are equal (D p = D q θ ), the model is said to be exactly parameterised. In these cases, a solution to the learning problem is guaranteed to exist: there exists θ = θ 0 such that p ≡ q θ 0 . Finally, when the number of layers used to generate the target is less than the number used in the model (D p < D q θ ), the model is over-parameterised (or severely over-parameterised). We provide a more precise definition of these different cases, as applied to our numerics, in Table 3. For each of the settings (i.e., choice of circuit depth for the target and model, choice of heuristic, number of qubits) explored, we train the generative model using nine independent parameter initialisations. We then use a bootstrapping procedure to provide a more robust estimate of the median cost at each training epoch. We first take samples of size nine from the outcome of the nine independent experiments, 10,000 times with replacement. We then compute the median cost across each set of samples to obtain a distribution of 10,000 medians. Using this distribution, we compute the median and obtain error bars from the 5th and 95th percentiles, corresponding to a 90% confidence interval.

Switching f -Divergences
We begin by considering the performance of the heuristic introduced in Section 3.1. The f -divergences that can be standardised locally behave as KL to second order [76]. Notably, TV cannot be standardised; indeed, it is straightforward to show that TV provides an upper bound for all other f -divergences with f (1) = 1 in this regime. For this reason, we evaluate both the exact TV and the exact KL to measure performance.
We begin by reporting the results obtained using an exact classifier, for each of the parameterisation regimes given in Table 3. The generator is trained using 1000 samples per iteration. The results are given in Table 4. Table 4. Performance of the QCBM trained using the TV and the f -divergence heuristic for 3 qubits in over-, under-, and exactly parameterised regimes. We show the bootstrapped median of the TV (top two rows) and the KL (bottom two rows) after 500 epochs. The asterisk (*) on some of the experiments indicates that the cost is still converging. The bold indicates the regimes where f -switch significantly outperforms the other methods.  Our results indicate that the heuristic is able to outperform TV when the QCBM is (severely) over-parameterised. This may be due to the extra degrees of freedom in the model. These allow for more discrepancies between the loss landscapes of the fdivergences, which the heuristic is able to exploit. In Figures 4 and 5, we provide a more detailed illustration of the training performance of the f -switch heuristic in this regime. Figure 4 corresponds to an exact classifier: in this case, use of the heuristic significantly improves the convergence of the QCBM.   The average performance of the heuristic is similar to TV in the exactly and underparametrised regimes. There are, however, initial parameter configurations within these regimes for which the heuristic significantly outperforms TV. In Figure 6, we plot the median losses obtained throughout the training of the QCBM in the under-parametrised U (30,18) regime. The best-performing experiment in this regime is also presented in Figure 7, alongside all the other f -divergences considered in Figure 1. After 200 epochs, the training method that solely uses TV has converged, but all the other divergences, including the heuristic, continue to converge exponentially quickly to smaller losses. In the under-parameterised regime, the ansatz is not guaranteed to contain the true solution. However, after reaching a KL of ∼10 −3 , these f -divergences traverse similar landscapes. Since the f -switch heuristic is shown to reach a KL of ∼10 −5 , we can assume that all of these f -divergences will converge to the global minimum, with the heuristic arriving first.  Figure 7. Performance of the QCBM trained using several f -divergences for 3 qubits in the underparameterised case U (30,18). The parameters are initialised using the parameters which gave the lowest cost during training in Figure 6. We show the exact TV (left) and the exact KL (right).
Finally, in Figure 8, we illustrate the mechanics of the f -switch heuristic. In particular, we plot which f -divergence is 'activated' for each direction in the parameter space, at each epoch of the training in Figure 7.  We remark that as the number of qubits is increased, the randomly initialised model and the target distributions are expected to be increasingly further apart. The heuristic can pick the divergence that provides the highest initial learning signal. For this reason, we expect the heuristic to become particularly useful as the number of qubits is increased.

Local Cost Functions
We now turn our attention to the heuristic introduced in Section 3.2, incorporating locality in the cost function, dubbed f -local. In this Section, the target distribution is a discretised Gaussian. All classifiers are neural networks with 1 hidden layer made of 10 k ReLU neurons, where k is the locality parameter. The number of layers in the QCBM equals the number of qubits, D = n. All expectation values are estimated using 500 samples. In Figure 9, we plot the training performance of the QCBM using the global cost function and several k-local cost functions, for n = 4, 5, and 6 qubit experiments. For 4 and 5 qubits, we show the bootstrapped median for the first 500 training epochs, as well as 90% confidence intervals. For 6 qubits, we plot an illustrative training example for the first 1000 training epochs.  . Training performance of the QCBM using the global and local reverse KL for 4 qubits, 5 qubits, and 6 qubits, for a discretised Gaussian target distribution. For 4 qubits and 5 qubits, we show the bootstrapped median (solid line), as well as 90% confidence intervals (shaded). For 6 qubits, we plot an illustrative training example.
Let us make several remarks. Firstly, it would appear that the use of a k-local cost function can indeed improve the convergence (rate) of the training procedure, particularly during the initial stages. This improvement is increasingly evident as the number of qubits is increased. As such, this approach could be regarded as a potential strategy for tackling barren plateaus in higher-dimensional problems. However, we leave a thorough study of this phenomenon to future work.
Secondly, it is clear that the use of any k-local cost function will eventually prohibit convergence to the true target distribution. As discussed in Section 3.2, the k-local cost function is minimised whenever the k-marginal distributions of the target and the model coincide, which does not necessarily imply that their joint distributions are equal. The smaller the value of k, the greater the possible disparity between two distributions whose k-marginals coincide. This is clearly visualised in Figure 9: as the value of k decreases, the asymptotic reverse KL achieved during training with the k-local cost function plateaus at increasingly larger values.
As remarked previously, this suggests that an optimal training strategy may be to start the training procedure with a small value of k, before iteratively increasing the value of k as training proceeds. For example, let us consider the 5 qubit experiment in Figure 9b. Initially, the 3-local cost function (red) appears to yields the greatest convergence rate. After approximately 150 epochs, the 4-local cost function (purple) now seems to be favourable. Asymptotically, one can imagine that the global cost function (blue) will be preferable. One observes similar behaviour in the 6 qubit experiment in Figure 9c.
In practice, of course, it is not possible to compute the reverse KL directly, and thus another tractable metric is required in order to determine the optimal moment for switching between the k-local cost functions. Alternatively, one can simply increase the locality of the cost function after a set number of epochs.

Estimation of f -Divergences on Fault-Tolerant Quantum Computers
The above discussion is purely heuristic in nature and suitable for near-term quantum computers, but we can also address f -divergences from the other end of the spectrum; using fault-tolerant devices. In particular, we can leverage a recent line of study into quantum property testing of distributions. The key question here is whether or not a particular probability distribution has a certain property.
The work of [38] was one of the first to provide such an answer, demonstrating a quadratic speedup for determining whether two distributions over [n] were close or ε-far in TV. These quantum algorithms typically work in the oracle model, and we measure run time relative to the number of queries to such an oracle (query complexity). In the classical case, we define oracle access to a distribution over [n], The oracle is a mechanism to translate a uniform distribution over [S] to the true distribution over [n]. In the quantum case, such an oracle is replaced by a unitary operator,Ô p acting on a state encoding s ∈ [S], along with an ancillary register to ensure reversibility and defined as: We begin our discussion with the TV. The authors of [38] produced a quantum property testing algorithm for the TV via an algorithm which actually estimates the TV quadratically faster. The analysis in [38] resulted in an algorithm to estimate the TV up to additive error ε, with probability of success of 1 − δ, using O( √ n/ε 8 δ 5 ) samples. This was later improved by [39] to the following Theorem 1 (Section 4, Montanaro [39]). Assume p, q are two distributions on [n]. Then there is a quantum algorithm that approximates TV(p, q) up to an additive error ε > 0, with probability of success 1 − δ, using O( √ nε −3/2 / log(1/δ)) quantum queries.
These ideas were extended in [40] to also give an algorithm for computing the (forward) KL quadratically faster than possibly classically (and also computing certain entropies of distributions). Due to the existence of the ratio p i /q i in the expression for the KL, we must make a further assumption, which was not necessary in the case of the TV distance in Theorem 1. This assumption will also be necessary when considering many of the other divergences in Table 1. In particular, we must assume the two distributions are such that: p i /q i ≤ g(n), ∀i ∈ [n], for some g : N → R + . (This assumption is appropriate when one defines the KL in terms of the generator f and the ratio r = p/q. Conversely, when one defines the KL in terms of the convex conjugate f * and the ratio r = q/p, then the appropriate assumption would instead be that q i /p i ≤ g(n), ∀i ∈ [n].) This assumption is also necessary in the classical case. With this, we then have Theorem 2 (Theorem 4.1, Li and Wu [40]). Assume p, q are two distributions on [n] satisfying p i /q i ≤ g(n), ∀i ∈ [n] for some a : N → R + . Then there is a quantum algorithm that approximates KL(p q) within an additive error ε > 0 with probability of success at least 2/3 using O( √ n/ε 2 ) quantum queries to p and O( √ n g(n)/ε 2 ) quantum queries to q. (The notation O(·) ignores factors that are polynomial in log n and log 1/ε.) These results cover two of the f -divergences we use above (see Table 1). In particular, the latter algorithm provide a quantum speedup since it is known that one requires Ω(n/ log(n)), Ω(ng(n)/ log(n)) classical queries to p and q respectively to estimate the KL [93]. On the other hand, we get a speedup for the former algorithm since it is known one requires Θ(n 2/3 ε −4/3 ) [94] queries to test if two distributions are near or far in TV classically, which is an easier problem than estimating the metric directly.
The key idea behind both of these algorithms is to use a subroutine known as quantum probability estimation or quantum counting, which is adapted from quantum amplitude estimation. This provides a quadratic speedup in producing estimatesp i ,q i , of probabilities p i , q i from the distributions p, q, which are specified via a quantum oracle. Once the estimates ofp i ,q i have been produced via the quantum subroutine, both of the above algorithms reduce to simple classical post-processing. This post processing involves constructing a random variable, y, whose expectation value gives exactly the divergence we require. For TV and KL estimation, this random variable is given by By sampling this random variable according to another distribution r := (r i ) n i=1 (to be defined below), the quantity of interest is exactly given as an expectation value, namely One can check [38,40] that the suitable random variables are given by Due to the probabilistic nature of quantum mechanics, one cannot obtain the exact values of the probabilities required to compute these expectation values. We must settle instead for approximations of p, q, namelyp,q. These estimates are achieved using the quantum approximate counting lemma, which is an application of quantum amplitude estimation [95]. The work in [40] considered two versions of this algorithm, called EstAmp and EstAmp'. The only difference between these two algorithms is the behavior when one of the probabilities, q i , is sufficiently close to zero. This is problematic in the case of the KL estimation (and indeed entropy estimation) in [40] since the relevant quantities diverge as q i → 0. The same is true in our case, as q −1 i appears in many f -divergences.
The modified algorithm (EstAmp') outputs sin 2 ( π 2M ) when EstAmp outputs 0, and outputs the same as EstAmp otherwise. Now that we have a mechanism for estimating the probabilities, we need a final ingredient, which is the generic speedup of Monte Carlo methods from [39] Theorem 4 (Theorem 5, Montanaro [39]). Let A be a quantum algorithm with output X such that Var[X] ≤ σ 2 . Then for ε where 0 < ε < 4σ, by using O((σ/ε) log 3/2 (σ/ε) log log(σ/ε)) executions of A and A −1 , Algorithm 3 in [39] outputs an estimateẼ[X] of E[X] such that Using these results, we now extend Theorems 1 and 2 to cover another f -divergence in Table 1: the forward Pearson divergence, χ 2 (p q). The convex conjugate of the generator for this divergence is given by f * (r) = 1 2 (r − 1) 2 or, equivalently, f * (r) = 1 2 (r 2 − 1). The equivalence of these two generators is straightforward to demonstrate. In particular, we have In fact, in what follows, we make use of the following representation: where we have identified r FP i = q i and y FP i = 1 2 ( q i p i − 1). Using this representation, we develop the following Algorithm 1 for estimating the forward Pearson divergence.

Set:
The following subroutine to be the algorithm A: begin Sample an index, i ∈ [n], according to p. Use the procedure EstAmp' with 2 log 2 ( √ ng(n)/ε) and 2 log 2 ( √ ng(n) 2 /ε) queries to q and p, respectively. Obtain estimates ofp i ,q i . Outputỹ FP i = 1 2 q ĩ p i − 1 . end Use A for l times in Theorem 4 to output an estimate,χ 2 (p q) for χ 2 (p q).
The query complexity of this Algorithm is contained in the following theorem. We defer the proof of this result, which is largely a technical extension of the proof(s) in [40], to Appendix A.
Theorem 5. Assume p, q are two distributions on [n] satisfying q i /p i ≤ g(n), ∀i ∈ [n] for some a : N → R + . Then there is a quantum algorithm that approximates X 2 (p q) within an additive error ε > 0 with probability of success at least 2/3 usingÕ( √ ng(n)/ε 2 ) quantum queries to q andÕ( √ ng(n) 2 /ε 2 ) quantum queries to p.

Discussion
Each f -divergence, with its unique operational meaning, finds application in information theory, statistics, and machine learning. In this paper, we showed that a generative model called quantum circuit born machine can be trained by efficiently minimising any fdivergence. The key observation is that a probabilistic classifier can be trained adversarially to provide an approximation to such divergences.
Building on this, we developed heuristics aimed at improving convergence of the generative training. The first heuristic, f-switch, lets each parameter minimise a different f -divergence. Numerical results with an ideal exact classifier show that this heuristic can converge faster and to better minima than when using a single f -divergence. However, in a more realistic setting where the classifier is trained adversarially, f -switch yields results similar to those obtained by minimising a single f -divergence.
The second training heuristic, f-local, consists of using a single f -divergence approximated by local cost functions. Numerical results show that, as the number of qubits increases, this strategy yield improved convergence of the generative training than when using a global cost function. To the best of our knowledge this is the first proposal of cost functions for generative modelling that can interpolate between trainability and accuracy. Extensive numerical simulations will be needed to confirm whether f -local can alleviate the barren plateau problem in generative modelling.
Interestingly, our local cost functions approximate the f -divergence using an ensemble of local binary classifiers. If the target probability distribution is known to have a particular conditional independence structure (e.g., it is defined by a Bayesian network or a Hidden Markov model), this information could be used to inform the choice of local classifiers.
One interesting research direction is to adapt the above heuristics to work with other families of distance measures. Of particular interest, integral probability metrics (IPMs) include the maximum mean discrepancy, the Dudley metric and the Wasserstein distance. While f -divergences are defined in terms of probability ratios, IPMs are defined in terms of probability differences. However, it is know that under suitable constraints margin-based classifiers yield estimators for IPMs [96]. This suggests that an extension of our heuristics to IPMs could be possible.
In this work, we also discussed the possibility of estimating certain f -divergences on a fault-tolerant quantum computer, therefore avoiding the use of classifiers. Previously published work has proven quadratic quantum speedups for the estimation of total variation [38,39] and forward Kullback-Leibler (KL) of type I [40]. Using these algorithms a quadratic speedups is achievable for the reverse KL of type I, and thus for the symmetric KL of type I (also known as Jeffrey divergence). It is plausible that with some refinements these algorithms can provide quadratic speedups for the KL of type II as well.
We contributed to this topic with an algorithm for estimating Pearson χ 2 divergences and by providing its query complexity. Interestingly, high-order Pearson divergences (also known as Vajda divergences) can be used to approximate any other f -divergence via Taylor expansion [97]. Generalising our quantum algorithm to Vajda divergences would therefore provide a way to estimate all other f -divergences on a fault-tolerant quantum computer.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proof of Theorem 5
In this Appendix, we provide a proof of Theorem 5. For completeness, we first repeat the theorem here.
Theorem 5. Assume p, q are two distributions on [n] satisfying q i /p i ≤ g(n), ∀i ∈ [n] for some a : N → R + . Then there is a quantum algorithm that approximates X 2 (p q) within an additive error ε > 0 with probability of success at least 2/3 usingÕ( √ ng(n)/ε 2 ) quantum queries to q andÕ( √ ng(n) 2 /ε 2 ) quantum queries to p.
Proof. We prove this theorem in two parts, following closely the approach in [40]. We are first required to show that the expectation of the output of the sub-routine A, namelỹ E = ∑ i∈[n] q i (q i /p i − 1) is sufficiently close to E = ∑ i∈[n] q i (q i /p i − 1). We begin by observing the following inequality. Let x, y > 0. In addition, suppose there exists 0 < K < ∞ such that y ≤ K. Then where we have used the elementary inequality: (z − 1)/z ≤ log(z). We then have, using the linearity of expectation, The remainder of the proof follows [40], with the roles of p and q now reversed, and with an additional factor of g(n). In particular, using elementary properties of the logarithm, we have where in the second line we have used the assumption that q i /p i ≤ g(n) for all i ∈ [n]. By (IV.5) and (IV.6) in ( [40], Section IV), 2 log 2 ( √ ng(n)/ε) queries to q and 2 log 2 ( √ ng(n) 2 /ε) queries to p yield ∑ i∈ [n] q i E[|log q i − logq i |] = O ε g(n) , Substituting these bounds into Equation (A5), and re-scaling Algorithm 1 by a large enough constant, we obtain |E −Ẽ| ≤ ε 2 . We are now required to bound the variance of this random variable. The variance is at most We first turn our attention to the first term. Recall that EstAmp' outputsq i such that q i ≥ sin 2 (π/2 log 2 ( √ ng(n)/ε) +1 ) ≥ ε 2 /(4ng(n) 2 ) for any i. It follows thatq i /p i ≥q i ≥ ε 2 /(4ng(n) 2 ), and thus exp(−2q i /p i ) ≤ exp −ε 2 /(2ng(n) 2 ) . We thus have, using also the fact that (x − 1) 2 < exp(−2x) for x > −1, that Meanwhile, for the second term, using the fact that (q i /p i − 1) 2 ≤ [q i /p i ] 2 since, in this summation,q i /p i ≥ 1 ≥ 1/2, we obtain Substituting Equations (A10) and (A11) into Equation (A8), we see that the variance of the random variable is at most It follows from Corollary 2 in [40] that we can approximateẼ up to an additive error of ε/2 with probability of success of at least 2/3 using O(1/ε) · 2 log 2 ( √ ng(n)/ε) = O( √ ng(n)/ε 2 ) queries to q and O(1/ε) · 2 log 2 ( √ ng(n) 2 /ε) = O( √ ng(n) 2 /ε 2 ) queries to p. Together with our earlier demonstration that |E −Ẽ| ≤ ε/2, this completes the proof.