Learning Functions and Approximate Bayesian Computation Design: ABCD

A general approach to Bayesian learning revisits some classical results, which study which functionals on a prior distribution are expected to increase, in a preposterior sense. The results are applied to information functionals of the Shannon type and to a class of functionals based on expected distance. A close connection is made between the latter and a metric embedding theory due to Schoenberg and others. For the Shannon type, there is a connection to majorization theory for distributions. A computational method is described to solve generalized optimal experimental design problems arising from the learning framework based on a version of the well-known approximate Bayesian computation (ABC) method for carrying out the Bayesian analysis based on Monte Carlo simulation. Some simple examples are given.


Information-Based Learning
The classical formulation proceeds as follows. Let U be a random variable with density f U (u). Let g(·) be a function on R+ → R and define a measure of information of the Shannon type for U with respect to g as I g (U ) = E U (g(f U (U ))).
If X represents the future observation, we can measure the preposterior information of the experiment (query, etc.), which generates a realization of X, by the prior expectation of the posterior information, which we define as: I g (θ; X) = E X E θ|X (g(π(θ|X))) = E X,θ (g(π(θ|X))).
In the second term, the inner expectation is with respect to the posterior (conditional) distribution of θ given X, namely π(θ|X), and the outside expectation is with respect to the marginal distribution of X.
In the last term, the expectation is with respect to the full joint distribution of X and θ. We wish to compare I g (θ; X) with the prior information: I g (θ) = E θ (g(π(θ))).
We shall postpone the proof of Theorem 1 until after a more general result for functionals on densities: ϕ : π(θ) → R.
Proof. (of Theorem 1). We now show that Theorem 1 is a special case of Theorem 2.

Distance-Based Information Functions
Shannon type information functionals take no account of metrics. Intuitively, if mass is moved around, the information stays the same. Let Z 1 , Z 2 be independent copies from π(z), and let d(z 1 , z 2 ) be a distance or metric. Define d-information as: The condition for convexity, again using the second directional derivative with respect to α, is Noting that ∫ (π 1 (z 1 ) − π 2 (z 1 )) = 0, (5) is a generalized version of the following condition: Condition (6), considered as a condition on a distance matrix d ij = d(z i , z j ), is called almost positive and is the necessary and sufficient condition for an abstract set of points P 1 , . . . , P k , with interpoint distances {d ij }, to be embedded in Euclidean space.
Theorem 3. If d ij = d ji , 1 ≤ i < j ≤ n, are 1 2 n(n − 1) positive quantities, then a necessary and sufficient condition that the d ij are the interpoint distances between points P i , i = 1, . . . , n, in R n is that the distance matrix D = −{d ij } is an almost positive matrix. This is a special case of metric embedding, sometimes called metric multi-dimensional scaling, in statistics; see, for example, Torgeson [16], Gower [17,18]. A more general result is: It is a task to identify the functions B(d(x, y) 2 ), such that, when d(x, y) is a Euclidean or Hilbert space metric, the space with the new metric can still be embedded into the Hilbert space. Schoenberg [10] gives the following major result that such B(·) comprise the Bernstein function defined as follows (see Theorem 12.14 in [11]): ≥ 0 for all λ > 0 and the derivatives f (n) satisfy (−1) n−1 f (n) ≥ 0 for all positive integers n and all λ > 0.
Note that this says that f ′ is a completely monotone function.
The following are equivalent: (2) B is a Bernstein function.
(3) e −B(t) is the Laplace transform of an infinitely divisible distribution, i.e., where γ is an infinitely divisible distribution. (4) B has the Lévy-Khintchine representation: for some b ≥ 0 and a measure µ, such that We now combine the above discussion with Schoenberg's theorem.
In the univariate case the negative of the variance of the distribution is a learning function since: When Z is multivariate, we again take independent copies Z 1 , Z 2 of Z and use Euclidean distance, and we have that minus the trace of the covariance matrix of Z, Γ, is a learning function: Schilling et al. [11] (Chapter 15) list 138 Bernstein functions, each of which will lead to a learning functional of the distance type. We give a small selection of Bernstein functions B(λ), which then, applied with λ = d(z 1 , z 2 ) 2 , give a learning function:

Counterexamples
We show first that it is not true that information always increases. That is, it is not true that the posterior information is always more than the prior information: A simple discrete example runs as follows. I have lost my keys. With high prior probability, p, I think they are on my desk. Suppose I have a uniform prior over all k likely other locations. However, suppose when I look on the desk that my keys are not there. My posterior distribution is now uniform on the other locations. Under certain conditions on p and k, Shannon information has gone down. For fixed p, the condition is k > k * where: ) , by expanding pk * in a Taylor expansion. When p = 1 2 , k * = 4 and pk * → e, 1 when p → 0, 1. This example is captured by the somewhat self-doubting phrase "if my keys are not on my desk, I don't know where they are". Note, however, that something has improved: the support size is reduced from k + 1 to k.
There is a simple way of obtaining a large class of examples, namely to arrange that there are x-values for which the posterior distribution is approximately uniform. Then, because the uniform distribution typically has low information, for such x, we can have a decrease in information. Thus, we construct examples in which f (x|θ)π(θ) happens to be approximately constant for some x. This motivates the following example.
Let Θ × X = [0, 1] 2 with joint distribution having support on [0, 1] 2 . Let π(θ) be the prior distribution and define a sampling distribution: Note that we include the prior distribution into the sampling distribution as a constructive device, not as some strange new general principle. We have in mind, in giving this construction, that when x → 1, the first term should approach zero and the second term, after multiplying by π(θ), should approach unity. Solving for a(θ) by setting The joint distribution is then: The marginal distribution of X is f X (x) = 1 on [0, 1], since the integral of (9) is unity, so that (9) is also the posterior distribution π(θ|x). Note that, in order for (9) to be a proper density, we require that The Shannon information of the prior is: and of the posterior is When x = 1 2 , the integrands of I 1 and I 0 are equal and I 0 = I 1 . When x = 1, the integrand of I 1 is zero, as expected. Thus, for a non-uniform prior, we have less posterior information in a neighborhood of x = 1, as we aimed to achieve.
Specializing π(θ) = 1 2 + θ on [0, 1] gives: Information I 1 decreases from a maximum of log(2) − 1 2 at x = 0, through the value I 0 at x = 1 2 , to the value zero at x = 1; see also Figure 1. Thus, I 0 > I 1 for 1 2 < x ≤ 1. Since the marginal distribution of X is uniform on [0, 1], we have the challenging fact that: Namely, with prior probability equal to one half, there is less Shannon information in the posterior than the prior. The Renyi entropy exhibits the same phenomenon, but we omit the calculations. We might say that f (x|θ) is not a good choice sampling distribution to learn about θ.

Surprise and Ignorance
The conflict between prior beliefs and empirical data, demonstrated by these examples, lies at the heart of debates about inference and learning, that is to say epistemology. This has given rise to formal theories of surprise, which seek to take account of the conflict. Some Bayesian theories are closely related to the learning theory discussed here and measure surprise quantities, such as the difference: Since, under the conditions of Theorem 1, S is expected to be negative, a positive value is taken to measure surprise; see Itti and Baldi [19].
Taking a subjective view of these issues, we may stray into cognitive science, where there is evidence that the human brain may react in a more focused way than normal when there is surprise. This is related to wider computational models of learning: given the finite computational capacity of the brain, we need to use our sensing resources carefully in situations of risk or utility. One such body of work emanates from the so-called "cocktail party effect": if the subject matter is of sufficient interest, such as the mention of one's own name across a crowded room, then one's attention is directed towards the conversation. Discussions about how the attention is first captured are closely related to surprise; see Haykin and Chen [20].

Minimal Information Prior Distributions
It is clear that if the prior distribution has minimal information (maximum entropy), then there is no surprise, because S, as defined above, is never positive. The use of such prior distributions has been advocated for many years and is incorporated into objective Bayesian analysis by some researchers. One key idea is to use Jeffrey's prior distributions, that is those which are invariant under a suitable group (Haar measure); for a discussion, see Berger [21].
An unresolved issue is that the minimal information distribution depends on the learning function. A simple example is that for Shannon information, the minimal information distribution with support on [0, 1] is the uniform distribution, whereas the maximum variance distribution has mass 1 2 at each of {0, 1} and variance 1 4 , which is achievable for the Beta(α, β) distribution as α, β → 1. The variance of the uniform distribution, on the other hand, is 1 12 < 1 4 . Consider the standard beta-binomial Bayesian set-up, where the sampling distribution is Bin(n, θ) and the (conjugate) prior is Beta(α, β). If x is the data, the posterior distribution is Beta(α + x, β + n − x), and the posterior mean, which is the Bayes estimator with respect to quadratic loss, isθ = α+x α+β+n . The minimal Shannon information is achieved for the uniform distribution when α, β → 1, in which case we haveθ = 1+x 2+n . However, if we take α, β → 0, giving, as mentioned, the minimal information with respect to the variance, we obtain in the limit the maximum likelihood estimator x n . The same feature arises with the Dirichlet-multinomial case, with the Dirichlet prior distribution: π(θ 1 , . . . , Beta(α 1 ,...,α k ) . The minimal Shannon information is uniform when all α i = 1, but the minimal trace of the covariance matrix is for mass 1 k at each corner of the simplex

The Role of Majorization
We concentrate here on Shannon-type learning functions. The analysis of the last section leads to the notion that for two distributions π 1 (θ) and π 2 (θ), the second is more peaked than the first if and only if: The statement (10) defines a partial ordering between π 1 and π 2 .
For Bayesian learning, we may hope that the ordering holds when π 1 is the prior distribution and π 2 is the posterior distribution. We have seen from the counterexamples that it does not hold in general, but, loosely speaking, always holds in expectation, by Theorem 1. However, it is natural to try to understand the partial ordering, and we shall now indicate that the ordering is equivalent to a well-known majorization ordering for distributions.
Consider two discrete distributions with n-vectors of probabilities π 1 = (π First, order the probabilities: Then, π 2 is said to majorize π 1 , written π 1 ⪯ π 2 , when: (2) i for j = 1, . . . , n (equality for j = n). The standard reference is Marshall and Olkin [22], where one can find several equivalent conditions. Two of the best known are: A1. there is a doubly stochastic matrix P n×n , such that π 1 = P π 2 ; A2.
Condition A2 shows that, in the discrete case, the partial ordering (10) is equivalent to the majorization of the raw probabilities. We now extend this to the continuous case. This generalization, which we shall also call ⪯, to save notation, has a long history, and the area is historically referred to as the theory of the "rearrangements of functions" to respect the terminology of Hardy et al. [23]. It is particularly well-suited to probability density functions, because ∫ π 1 (θ)dθ = ∫ π 2 (θ)dθ = 1. The natural analogue of the ordered values in the discrete case is that every density π has a unique densityπ, called a "decreasing rearrangement", obtained by a reordering of the probability mass to be non-increasing, by direct analogy with the discrete case above. In the theory, π andπ are then referred to as being equimeasurable, in the sense that the supports are transformed in a measure-preserving way.
There are short sections on the topic in Marshall and Olkin [22] and in Müller and Stoyan [24]. A key paper in the development is Ryff [25]. The next paragraph is a brief summary.
Definition 2. Let π(z) be a probability density and define m(y) = µ{z : π(z) ≥ y}. Then: The picture is that the probability mass (in infinitely small intervals) is moved, so that a given mass is to the left of any smaller mass. For example, for the triangular distribution: Definition 3. We say that π 2 majorizes π 1 , written π 1 ⪯ π 2 , if and only if, for the decreasing rearrangements, There is a list of key equivalent conditions to ⪯, which are the continuous counterparts of the discrete majorization conditions. The first two generalize A1 and A2 above.
Condition B2 is the key, for it shows that in the univariate case, if we assume that h(u) = ug(u) is continuous and convex, (10) is equivalent to π 1 (θ) ⪯ π 2 (θ). We also see that ⪯ is equivalent to standard first order stochastic dominance of the decreasing rearrangements, sinceF (θ) = ∫ θ 0π (z)dz is the cdf corresponding toπ(θ). Condition B3 says that the probability mass under the density above a "slice" at height c is more for π 2 than for π 1 .
We can summarize this discussion by the following.

Proposition 1. A functional is a learning functional of the Shannon type (under mild conditions) if and only if it is an order-preserving functional with respect to the majorization ordering on distributions.
The role of majorization has been noticed by DeGroot and Fienberg [26] in the related area of proper scoring rules.
The classic theory of rearrangements is for univariate distributions, whereas, as stated, we are interested in θ of arbitrary dimension. In the present paper, we will simply make the claim that the interpretation of our partial ordering in terms of decreasing rearrangements can indeed be extended to the multivariate case. Heuristically, this is done as follows. For a multivariate distribution, we may create a univariate rearrangement by considering a decreasing threshold and "squashing" all of the multivariate mass for which the density is above the threshold to a univariate mass adjacent to the origin. Since we are transforming multivariate volume to area, care is needed with Jacobians. We can then use the univariate development above. It is an instructive exercise to consider the univariate decreasing rearrangement of the multivariate normal distribution, but we omit the computations here.

Learning Based on Covariance Functions
If we restrict our functionals to those which are only functionals of covariance matrices, then we can prove wider results than just for the trace. Dawid and Sebastiani [27] (Section 4) refer to dispersion-coherent uncertainty functions and, where their results are close to ours, we differ only by assumptions.
We use the notation A ≥ 0 to mean that a symmetric matrix is non-negative definite.
We first prove the converse for matrices Γ andΓ = Γ + zz T , for some vector z. Take two distributions with equal covariance functions, but with means satisfying µ 1 − µ 2 = 2z. Then, Now assume −ϕ is a learning function. Then, by concavity, In general, we can write anyΓ ≥ Γ asΓ = Γ + ∑ m i=1 z (i) z (i)T , for a sequence of vectors {z (i) }, i = 1, . . . , m, and the result follows by induction from the last result.
Most criteria used in classical optimum design theory (in the linear regression setting) when applied to covariance matrices are Loewner increasing. If, in addition, we can claim concavity, then by Theorem 7, the negative of any such function is a learning function. We have seen in Section 3 that −trace(Γ) is a learning function, while − log det(Γ) corresponding to D-optimality is another example.
For the normal distribution, we can show that for two normal density functions, π 1 and π 2 , with covariance Γ 1 and Γ 2 , respectively, we have that for any Shannon-type learning function I g (θ 1 ) ≤ I g (θ 2 ) if and only if det(Γ 1 ) ≥ det(Γ 2 ). We should note that in many Bayesian set-ups, such as regression and Gaussian process prediction, we have a joint multivariate distribution between x and θ. Suppose that, with obvious notation, the joint covariance matrix is: .

Approximate Bayesian Computation Designs
We now present a general method for performing optimum experimental design calculations, which, combined with the theory of learning outlined above, may provide a comprehensive approach. Recall that in our general setting, a decision about experimentation or observation is essentially a choice of the sampling distribution. In the statistical theory of the design of experiments, this choice typically means a choice of observation sites indexed by a control or independent variable z.
Indeed, we will have examples below in this category. However, the general formulation is that we want to maximize ψ over some restricted set of sampling distributions f (x|θ) ∈ F. A choice of f we call generalized design. Below, we will have one non-standard example based on selective sampling. Note that we shall always assume that the prior distribution π(θ) is fixed, which is independent of the choice of f . Then, recalling our general information functional as ϕ(π), the design optimization problem is (for fixed π): where we stress the dependence of the random variable X on the design and, thereby, on the sampling distribution f , by adding the subscript D.
If the set of sampling distributions f is specified by the control variable z, that is the choice of the sampling distribution f (x|θ, z) amounts to selecting z ∈ Z, then the maximization problem is: (π(θ|X D , z)).
In the examples that we consider below, the sampling distribution will be indexed by a control variable z.
An important distinction should be made between what we shall here call linear and non-linear criteria. By a more general utility problem being linear, we mean that there is a utility function U (θ, x), such that, when we seek to minimize, again over choice f , where the last expectation is with respect to the joint distribution of X D and θ. In terms of integration, this only requires a single double integral. The non-linear case requires the evaluation of an "internal" integral for E θ|X D U (X D , θ) and an external integral for E X D . It is important to note that Shannon-type functionals are special types of linear functionals where U (θ, X D ) = g(π(θ|X D )). The distance-based functionals are non-linear in that they require a repeated single integral.
This distinction is important when other costs or utilities are included in addition to those coming from learning. Most obvious is a cost for the experiment. This could be fixed, so that no preposterior analysis is required, or it might be random in that it depends on the actual observation. For example one might add an additional utility U (X D ) solely dependent of the outcome of the experiment: if it really does snow, then snow plows may need to be deployed. The overall (preposterior) expected value of the experiment might be: In this way, one can study the exploration-exploitation problem, often referred to in search and optimization. We now give a procedure to compute ψ for a particular choice of sampling distribution f ∈ F. We assume that f (x|θ) and π(θ) are known. If the functional ϕ is non-linear, we have to obtain the posterior distribution π(θ|X D ) before evaluating ϕ. For simplicity, we use ABC rejection sampling (see Marjoram et al. [28]) to obtain an approximate sample from π(θ|X D ) that allows us to estimate the functional ϕ(π(θ|X D )). In many cases, it is hard to find an analytical solution for π(θ|X D ), especially if f (x|θ) is intractable. These are the cases where ABC methods are most useful. Furthermore, ABC rejection sampling has the advantage that it is easily possible to re-computeφ(π(θ|X D )) for different values of X D , which is an important feature, because we have to integrate over the marginal distribution of X D in order to obtain ψ(f ) = E X D ϕ(π(θ|X D )).
For a given f ∈ F , we find the estimateψ by integrating overφ(π(θ|X D )) with respect to the marginal distribution f X . We can achieve this using Monte Carlo integration: The ABC procedure to obtain the estimateφ(π(θ|x D )) given x D is as follows.
(2) For each θ i , sample from f (x|θ i ) to obtain a sample: . This gives a sample from the joint distribution: f X,θ .
Steps 1-4 need to be conducted only once at the outset for each f ∈ F; only Steps 5-7 have to be repeated for each For the linear functional, explained above, we do not even need to compute the posterior distribution, π(θ|x D ), if we are happy to use the naive approximation to the double integral: The optimum ψ(f ) for f ∈ F may be found by employing any suitable optimization method. In this paper, we intend to focus on the computation ofψ(f ). Therefore, in the illustrative examples below, we take a "crude" optimization approach, that is we estimate ψ(f ) for a fixed set of possible choices for f and compare the estimates.
The basic technique of ABCD was introduced in Hainy et al. [29], but here, we present it fully embedded into statistical learning theory. Note that related different procedures utilizing MCMC chains were independently developed in Drovandi and Pettitt [30] and Hainy et al. [31].
We now present two examples that are meant to illustrate the applicability of ABCD to very general design problems using non-linear design criteria. Although these examples are rather simple and may also be solved by analytical or numerical methods, their generalizations become intractable using traditional methods.

Selective Sampling
When the background sampling distribution is f (x|θ), we may impose prior constraints of which data we accept to use. Such models in greater generality may occur when observation is cheap, but the use of observation is expensive, for example computationally. We can call this "selective sampling", and we present a simple example.
Suppose in a one-dimensional problem that we are only allowed to accept observations from two slits of equal width at z 1 and z 2 . Here, the model is equivalent (in the limit as the slit widths become small) to replacing f (x|θ) by the discrete distribution: If we have a prior distribution π(θ) and f (x|z 1 , z 2 ) = ∫ f (x|θ, z 1 , z 2 )π(θ)dθ denotes the marginal distribution of x, the posterior distribution is given by: To simplify even further, we take as a criterion: ϕ(π(θ|x, z 1 , z 2 )) = max θ π(θ|x, z 1 , z 2 ).
The maximum is a limiting version of Tsallis entropy and is a learning functional. Now consider a special case: The preposterior: can be calculated explicitly. If z 2 ≥ z 1 and z i ∈ [−a, a], then: Next, we show how this example can be solved using ABCD. Due to the special structure of the sampling distribution in this example, we modified our ABC sampling strategy slightly.
We performed our ABC sampling strategy for this example for a range of parameters for the slit neighborhood length ϵ (ϵ = 0.005, 0.01, 0.05), H (H = 100, 1,000, 10,000) and K z (K z = 50, 100, 200) in order to assess the effect of these parameters on the accuracy of the ABC estimates of the criterion ψ. The most notable effect was found for the ABC sample size H. Figure 2 shows the estimated values of the criterion,ψ, for the special case where z 2 = −z 1 when a = 1.5. We set ϵ = 0.01, K z = 100. The ABC sample size H is set to H = 100 (left), H = 1, 000 (center), and H = 10, 000 (right). The criterion was evaluated at the eight points (z 1 = 0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5). The theoretical criterion function ψ(z 1 ) is plotted as a solid line.

Spatial Sampling for Prediction
This example is also a simple version of an important paradigm, namely optimal sampling of a spatial stochastic process for good prediction. Here, the stochastic process labeled X is indexed by a space variable z, and we write X i = X i (z i ), i = 1, . . . , n to indicate sampling at sites (the design) D n = {z 1 , . . . , z n }. We would typically take the design space, Z, to be a compact region.
We wish to compute the predictive distribution at a new site z n+1 , namely x n+1 (z n+1 ), given x D = x(D n ) = (x 1 (z 1 ), . . . , x n (z n )). In the Gaussian case, the background parameter θ could be related to a fixed effect (drift) or the covariance function of the process, or both. In the analysis, x n+1 is regarded as an additional parameter, and we need its (marginal) conditional distribution.
The criterion of interest is the maximum variance of the (posterior) predictive distribution over the design space: This functional is learnable, since it is is a maximum of a set of variances, each one of which is learnable. Referring back to how the general design optimization problem that was stated in (11), the posterior predictive distribution of x n+1 may be interpreted as the posterior distribution in (11). The optimality criterion ψ is found by integrating ϕ with respect to X 1 , . . . , X n .
The strategy is to select a design D n and then perform ABC at each test point z n+1 . The learning functional ϕ(x D ) is estimated by generating the sample I = {x n+1 } H j=1 at the sites {z 1 , z 2 , . . . , z n , z n+1 } and calculating: n+1 . In order to estimate ψ(D n ) = E X D (ϕ(X D )), we obtain a sample O = {x from the marginal distribution of the random field at the design D n and perform Monte Carlo integration: For each x (i) D ∈ O from the marginal sample, we use the sample I to computeφ(x (i) D ) in order to save computing time. We then vary the design using some optimization algorithm.
A simple example is adopted from Müller et al. [32]. The observations (x 1 (z 1 ), x 2 (z 2 ), x 3 (z 3 ), x 4 (z 4 )) are assumed to be distributed according to a one-dimensional Gaussian random field with mean zero, a marginal variance of one and z i ∈ [0, 1]. We want to select an optimal design D 3 = (z 1 , z 2 , z 3 ), such that: is minimal. We assume the Ornstein-Uhlenbeck process with correlation function ρ(|s − t|; θ) = e −θ|s−t| . Two prior distributions for the parameter θ are considered. The first one is a point prior at θ = log(100), so that ρ(h) = ρ(h; log(100)) = 0.01 h . This is the correlation function used by Müller et al. [32] in their study of empirical kriging optimal designs. The second prior distribution is an exponential prior for θ with scale parameter λ = 10 (i.e., θ ∼ Exp (10)). The scale parameter λ was chosen, such that the average correlation functions of the point and exponential priors are similar. By that, we mean that the average of the mean correlation function for the exponential prior over all pairs of sites s and t, E s,t [E θ {ρ(|s − t|; θ)|θ ∼ Exp(λ)}] = E s,t [1/(1 + λ|s − t|)], matches the average of the fixed correlation function ρ(|s − t|; log(100)) = 0.01 |s−t| over all pairs of sites s and t, E s,t [0.01 |s−t| ]. The sites are assumed to be uniformly distributed over the coordinate space.
To be more specific, first, for each site s ∈ X , the average correlation to all other sites t ∈ X is computed. Then, these average correlations are averaged over all sites s ∈ X . For the point prior, the average correlation is E s,t [ρ(|s − t|; log(100))] = 2 log(100) 2 (log(100) − (1 − 1 100 )) = 0.3409, and for the exponential prior, the value is We estimated the criterion on a grid with spacing 0.05 for the design points z 1 and z 3 (z 2 is fixed at z 2 = 0.5). We set G = 1, 000, H = 5 · 10 6 and ϵ = 0.01 for each design point. The sample {x (j) (z) : z ∈ Z} H j=1 is simulated at all points z of the grid prior to the actual ABC algorithm. In order to accelerate the computations, it is then reused for all possible designs D 3 to estimate eachφ(x (i) D ), i = 1, . . . , G, in (12). The sample size H = 5 · 10 6 was deemed to provide a sufficiently exhaustive sample from the four-dimensional normal vector (x 1 (z 1 ), x 2 (z 2 ), x 3 (z 3 ), x 4 (z 4 )) for any z i ∈ Z, so that the distortive effect of using the same sample for the computations of allφ(x (i) D ) is only of negligible concern for our purposes of ranking the designs. , when the prior distribution of θ is the point prior at θ = log(100). It can be seen that the minimum of the criterion is attained at about (z 1 , z 3 ) = (0.9, 0.1) or (z 1 , z 3 ) = (0.1, 0.9), which is comparable to the the results obtained in Müller et al. [32] for empirical kriging optimal designs. Note that the diverging criterion values at the diagonal and at z 1 = 0.5 and z 3 = 0.5 are attributable to a specific feature of the ABC method used. At these designs, the actual dimension of the design is lower than three, so for a given ϵ, there are more elements in the neighborhood than for the other designs with three distinctive design points. Hence, a much larger fraction of the total sample, {x n+1 : j ∈ J ϵ (y D )}. Therefore, the values of the criterion get closer to the marginal variance of one. In order to avoid this effect, the parameter ϵ would have to be adapted in these cases. Alternatively, one could use other variants of ABC rejection, where the fixed number of N elements of I = {x  Due to the uncertainty of the prior parameter θ, the optimal design points for z 1 and z 3 slightly move to the edges, which is also in accordance with the findings of Müller et al. [32].

Conclusions
There are some fundamental results in Bayesian learning which provide important background to fields like the optimal design of experiments. Functionals of prior distributions which are learnable, via observation, in a wide sense, are convex. Shannon information is an example but there are many others and the paper points to some wide classes with connections to other fields. It combines the theory of learning with an effective method for the optimal design of experiments based on simulation: ABCD. It is suggested that the method should prove useful in non-standard situations, such as non-linear, non-Gaussian models and for complex problems where the sampling distribution is intractable but one can still draw samples from it, for given parameter values. A simple message is that the learning theory and simulation method applies to a generalized notion of an experiment as a choice of sampling distribution, under restrictions.