1. Introduction
The nested sampling Monte Carlo algorithm (proposed by [
1] and recently reviewed in [
2,
3]) enables Bayesian inference by estimating the posterior and its integral. A population of
K live points are sampled randomly from the prior. Then, the lowest likelihood point is discarded, and a new live point is sampled from the prior under the constraint that its likelihood must by higher than the discarded point. This is iteratively repeated in nested sampling. The likelihood-restricted prior sampling (LRPS), if unbiased, causes the likelihoods of discarded points to have interesting properties. In particular, the fraction of prior mass below the likelihood threshold is approximately
. The recursive estimation of prior mass discarded, together with the sampled likelihood, allows nested sampling to perform the integration.
The first LRPS algorithms proposed were adaptations of Markov Chain Monte Carlo (MCMC) variants. For simplicity, we assume that the parameters have been parameterized; the prior probability density is uniform. Then, in a Gaussian Metropolis MCMC started from a randomly chosen live point, a random nearby point would be proposed. If the likelihood exceeds the current threshold, it replaces the starting point for the next iteration. This procedure is repeated
times. After this, the final point is considered a sufficiently independent prior sample. Since the value of the likelihood is not considered further, this pursues a purely geometric random walk. The class of such step sampler algorithms [
4] includes slice sampling [
5], first proposed for LRPS by [
6]. In
Section 2.4, we describe ten members of a larger family of algorithms which includes slice sampling. These algorithms are popular because no additional parameters that influence its outcome need to be chosen, making robust applications to a wide variety of problems. When the dimensionality
d of the inference problem is high (
), i.e., the fitted model has many parameters, such step samplers tend to outperform rejection sampling-based algorithms (see [
3] for a survey).
Given a inference problem with
d parameters, choosing a LRPS method and its parameters is not trivial. Detecting biased LRPS is a question of on-going research (e.g., [
4,
7,
8]), and we give our definition of acceptable LRPS quality in
Section 2.1. Our procedure for calibrating
is described in
Section 2.3. The test problems are listed in
Section 2.2 and the LRPS methods described in
Section 2.4. The evaluation results of the mixing behaviour of the various LRPS methods are presented in
Section 3.1 and discussed in
Section 4.
2. Materials and Methods
2.1. Distinguishing Good and Bad LRPS Performance
How can we judge that one method performs better than another, or even acceptably well? We could run NS on example inference problems and analyse the produced posterior and integral. While this may be very realistic, the results would be limited to the geometry and likelihood slopes of that particular inference problem, leaving generalization to different data or models unclear. It is also difficult to define an objective criterion on whether the integral and posterior approximations are acceptable, in addition to the requirement that the true result is available for comparison. Alternative to such a “in vivo” test, we could explore the geometric mixing behaviour in isolation (“in vitro”). For example, from initial samples concentrated in a small ball, we could measure the time until samples have converge to true geometry. Here, the information gain could measure the quality, but thresholds are also unclear. Additionally, in NS, the situation is not static. Rather, a sequence of similar geometries is explored with ever-shrinking size. To address the shortcomings listed above, we instead adopt a verification test built for NS that is already familiar.
The shrinkage test was presented by [
4]. In problems where the volume enclosed at a likelihood threshold,
, the volume can be computed. The shrinkage test analyses the volume ratio distribution
of a sequence of discarded samples produced by NS with a LRPS. This measures the critical issue of interest, namely, whether the shrinkage is passed to NS correctly (following the expected beta distribution) by an unbiased LRPS.
The shrinkage test can flag problematic LRPS methods. We collect shrinkage samples for LRPS iterations, with live points. Given these samples from 10,000 iterations, configurations are rejected if the shrinkage KS test gives a p-value below 0.01. We consider a configuration to be the combination of a LRPS method with its hyperparameters. Additionally, if any iteration had no movement (live point is stuck), the configuration is also rejected. The LRPS is started with live points drawn perfectly from the geometry, and the first iterations are not included in sample collection to allow the LRPS method to warm up and calibrate.
On real inference problems, troublesome situations may occur at some arbitrary iteration and persist for some arbitrary length; this will modulate the impact on evidence and posterior estimation. The shrinkage test is independent of this and, thus, provides conservative results. With NS applications that typically require several tens of thousands iterations, our detections of deviations provide lower limits. Importantly, the shrinkage test results are independent of the likelihood slope and are therefore generalizable. Only the geometry of the likelihood contours are important, regardless of how informative the posterior is, whether there are phase transitions, etc. While the requirement that
can be computed analytically may seem restrictive, this is possible for difficult geometries (
Section 2.2), including ellipsoidal, multi-ellipsoidal, convex and non-convex problems. We seek LRPS methods that behave robustly across such situations.
2.2. Geometries Considered
Likelihood contours can be classified into convex or non-convex, depending on whether the linear interpolation between two points within the restricted prior is also inside. A sub-class of the convex geometries are ellipsoidal contours, arising from Gaussian or other ellipsoidal posterior distributions. This classification can also be considered locally. Different information gain per parameter influences how the contour changes shape.
Our goal here is to span the space of geometries across dimensionality, while simultaneously avoiding excessive computational cost. Therefore, we sparsely sample the problem space of geometries with six setups. This includes: (1) an ellipsoidal problem of a correlated Gaussian with covariance
and a unit diagonal in 16 and 100 dimensions; (2) a difficult convex problem with the hyperpyramid with log-likelihood
in 4 and 16 dimensions; and (3) a non-convex problem—the Gaussian shell problem [
9]
, in two and eight dimensions. The former two are self-similar across iterations, while the last is not, with the shell thickness becoming thinner with increased iteration. To avoid entering extreme situations infrequent in real applications, short runs of length 3000 (
) and 6000 iterations (
) were performed with different seeds. They were repeated until the desired number of samples were collected.
The scope sampled by these geometries is limited and does not capture all real-world situations of interest. The scope of this work is to develop a methodology to evaluate LRPS behaviour and to discard configurations that are already showing problematic behaviour in these simple geometries. However, there are some relevancies. For example, posteriors are often Gaussians away from the borders, which is approximated by (1). Configuration (2) approximates when the data impose upper limits on each parameter independently. Non-linear banana degeneracies are a sub-set of (3). Multi-modal problems are left for future work. However, we can extrapolate in limited way: methods that only consider local instead of global live point correlations may be unimpacted by inter-mode structure, and, thus, our results may hold for these. To emphasize, some reasonable situations are considered in-vitro here to place some limits on in-vivo behaviour, but we do not aim to consider the most difficult case conceivable.
2.3. Calibrating the Number of Steps
To find the required to sample the geometry correctly, a run as described above with is performed. Upon rejection, is doubled until success. Since we want a lower limit usable by practitioners and because this should be a monotonically increasing function, an improvement for computational efficiency can be made: if the test problems are ordered by d, then the next problem does not need to be tested on that already failed a previous problem. Thus, we run the problems in order of d and initialize the search for at the value that was acceptable for the previous problem. This automatically finds a conservative curve across the different problems.
2.4. LRPS Methods Considered
Multiple algorithms are considered. Within a single step, they have similar behaviour, which we describe first. From a starting point
x, a direction
v is chosen. To sample a point on the slice
, with
t uniformly sampled, bounds on
t first need to be found. Given a guess length
L, initially
, the point at
,
, etc., is tested until the point lies outside the restricted prior,
. The same is repeated for negative
t values. Then,
t can be uniformly sampled between
and
. If the point
is rejected, subsequent tries can set
(if
t is negative) or
(if
t is positive). This is akin to bisection and assumes convexity [
5]. When the
point is accepted, the sampling is restarted from there. This is done
times. The number of model evaluations per NS iteration is therefore
, where the
i stepping out iterations and
j bisection steps are subject to the constraint behaviour and the direction proposal. The guess length
L is increased by
if
, or decreased by 10%, otherwise.
We consider ten algorithms which differ in their direction proposal. The algorithm variants are the following:
cube-slice: In slice sampling, one parameter is randomly chosen as the slice axis [
5];
region-slice: Whitened slice sampling, randomly choosing a principal axis based on the estimated covariance matrix;
region-seq-slice: Whitened slice sampling, iteratively choosing a principal axis [
10];
cube-harm: Hit-and-run sampling choses a randomly chosen direction from the unit sphere ([
11]. It is combined with slice bisection: [
12]);
region-harm: Whitened hit-and-run sampling by randomly choosing a ball direction using the sample covariance;
cube-ortho-harm;
region-ortho-harm are the same as cube-harm and region-harm, but prepare a sequence of
d proposals that are made orthogonal by Gram–Schmidt. They are then used in order. This is similar to the PolyChord [
13] implementation;
de-harm: Differential evolution (randomly choose pair of live points and use their differential vector) [
14];
de1: same as de-harm, but set the difference in all but one randomly chosen dimension to zero. This is very similar to cube-slice, but scales the direction using the distance between points;
de-mix: randomly choose between de-harm and region-slice at each step with equal probability.
The phase from proposing a slice until acceptance of a point on that slice is considered a single slice sampling step for counting to . This is also the case for methods where subsequent proposals are prepared together (e.g., region-seq-slice, cube-ortho-harm, region-ortho-harm). Some algorithms take advantage of the sample covariance matrix estimated from the live points, and its principle axis, to whiten the problem space. We estimate the covariance matrix every nested sampling iterations, i.e., when the space has shrunken to .
Detailed balance is satisfied by the cube-* and de-* methods. When whitening is computed from the live points, detailed balance is not retained. Since detailed balance is a sufficient but not a necessary condition of the Metropolis proposal, this may or may not lead to convergence issues. The shrinkage test should indicate this.
4. Discussion
Methods can now be compared by the relative position of the curves in
Figure 3. Slice sampling (blue squares) is the most inefficient among the considered methods. The single-parameter differential proposal (blue circle) has marginally improved the efficiency. Hit-and-run Monte Carlo (red squares) is more than twice as efficient as these. Orthogonalisation helps further (yellow squares). The differential proposal (de-harm) has comparable performance to these, with slightly better behaviour at lower
d than at high
d.
The results from the whitened proposals (region-*) are diverse. To inform the discussion of these results, two opposing insights seem relevant. Firstly, whitening a space, or more accurately, adjusting the proposal to the distribution sampled, is a common MCMC technique to improve the proposal efficiency. For example, in a highly correlated Gaussian, sampling along one axis, then another, will only diffuse for short distances, while sampling along the principal axis generates distant jumps. This may be the interpretation for the poor performance of the cube-slice sampler (blue curve in
Figure 3), relative to the region-slice (green crosses). A similar effect could be that picking the same axis twice may lead to unnecessary reversals. However, our results show that efficiency worsens when the directions are sampled sequentially (region-seq-slice; gray crosses), instead of randomly (region-slice; green crosses).
Secondly, if the adaptation is computed from the live points, an iterative bias could be induced. Indeed, it is usually recommended to perform a warm-up or burn-in step which adjusts the proposal before a sampling phase, where the proposal is fixed and the samples are trusted. Ref. [
15] observed that in the affine invariant ensemble sampler [
16], which uses differential vectors from one half of the live points to update the other half of live points, the iterative updating can lead to a collapse onto a sub-space and hinder convergence to the true distribution at high dimension. Here, we observe that, while region-slice performs efficiently, and cube-harm performs efficiently, region-harm (purple crosses) performs very poorly at
. This may be a similar self-enforcing behaviour. The orthogonalisation of region-harm (black crosses) helps, but still requires a high number of steps at high dimension. It is interesting that cube-harm does not show this behaviour. From this, we can therefore conclude that hit-and-run and whitening are good ideas separately; however, the iterative estimation of covariance matrices during the run may lead to problematic behaviour.
It is surprising that region-slice outperforms cube-slice, even for symmetric problems such as the shell and pyramid at . This may be because the sample covariance estimation is an additional source of noise and thus improves mixing. If this is true, then it is perhaps wiser to not rely on this and instead use a harm step to introduce more directional diversity, at least in some NS iterations.
Among the algorithms discussed so far, region-slice performs best and shows a relatively flat scaling with dimensionality. This, however, may be driven by the 100d data point by chance. Although, if that data point has to be corrected down by a factor of two, it is still within the trend.
The differential evolution hit-and-run variant lies in the middle, performing similar to the other slice and hit-and-run methods, with no strong outliers. It outperforms cube-slice and cube-de, and behaves similar to region-slice. This suggests that this method is not susceptible to collapse in the same way as whitened hit-and-run.
The best performance overall is the mixture of de-harm and region-slice (de-mix). It outperforms both individual methods. In particular, in the shell problems, we speculate that the differential vector proposal helps circumnavigate the shell.
Finally, how do these results apply to application to real world NS analyses? The shrinkage test in our setup is approximately sensitive to 1% deviations in mean shrinkage
over 10,000 iterations. This is directly related to the integral accuracy in NS and corresponds approximately to the accuracy desired in real NS runs that are dozens of times longer. Given the results in this limited study, the de-mix, region-slice and cube-ortho-harm methods appear promising for future research and application. The number of steps minimally necessary for use are listed in
Table 1, following
, together with the efficiencies observed in this work. Future simulation studies should extend the scope to asymmetric problems, where some parameters shrink faster than others, and multi-modal problems, as well as to other step samplers. The tested proposals are implemented in the nested sampling package UltraNest available at
https://johannesbuchner.github.io/UltraNest/ (accessed on 1 January 2023), enabling efficient low and high-dimensional inference of a large class of practical inference problems relevant to cosmology, particle physics and astrophysics.