Proximal nested sampling with data-driven priors for physical scientists

Proximal nested sampling was introduced recently to open up Bayesian model selection for high-dimensional problems such as computational imaging. The framework is suitable for models with a log-convex likelihood, which are ubiquitous in the imaging sciences. The purpose of this article is two-fold. First, we review proximal nested sampling in a pedagogical manner in an attempt to elucidate the framework for physical scientists. Second, we show how proximal nested sampling can be extended in an empirical Bayes setting to support data-driven priors, such as deep neural networks learned from training data.


Introduction
In much of the sciences not only is one interested in estimating the parameters of an underlying model, but deciding which model is best among a number of alternatives is of critical scientific interest.Bayesian model comparison provides a principled approach to model selection [1] that has found widespread application in the sciences [2].
Bayesian model comparison requires computation of the model evidence: also called the marginal likelihood, where y ∈ R m denotes data, x ∈ R n parameters of interest, and M the model under consideration.We adopt the shorthand notation for the likelihood of L(x) = p(y | x, M) and prior of π(x) = p(x | M).Evaluating the multidimensional integral of the model evidence is computationally challenging, particularly in high dimensions.While a number of highly successful approaches to computing the model evidence have been developed, such as nested sampling [e.g.[2][3][4][5][6][7][8] and the learned harmonic mean estimator [9][10][11], previous approaches do not scale to the very high-dimensional settings of computational imaging, which is our driving motivation.The proximal nested sampling framework was introduced recently by a number of authors of the current article in order to open up Bayesian model selection for highdimensional imaging problems [12].Proximal nested sampling is suitable for models for which the likelihood is log-convex, which are ubiquitous in the imaging sciences.By restricting the class of models considered, it is possible to exploit structure of the problem to enable computation in very high-dimensional settings of O(10 6 ) and beyond.
Proximal nested sampling draws heavily on convex analysis and proximal calculus.In this article we present a pedagogical review of proximal nested sampling, sacrificing some mathematical rigor in an attempt to provide greater accessibility.We also provide a concise review of convexity and proximal calculus to introduce the background underpinning the framework.We assume the reader is familiar with nested sampling, hence we avoid repeating an introduction to nested sampling and instead refer the reader to other sources that provide excellent descriptions [2,3,8].Finally, for the first time we show in an empirical arXiv:2307.00056v2[stat.ME] 28 Jul 2023 Bayes setting how proximal nested sampling can be extended to support data-driven priors, such as deep neural networks learned from training data.

Convexity and proximal calculus
We present a concise review of convexity and proximal calculus to introduce the background underpinning proximal nested sampling to make it more accessible.

Convexity
Proximal nested sampling draws on convexity, key concepts of which are illustrated in Figure 1.A set C is convex if for any x 1 , x 2 ∈ C and α ∈ (0, 1) we have αx The function f is convex if and only if its epigraph is convex.A convex function is lower semicontinuous if its epigraph is closed (i.e.includes its boundary).

Proximity operator
Proximal nested sampling leverages proximal calculus [13,14], a key component of which is the proximity operator, or prox.The proximity operator of the function f with parameter λ is defined by The proximity operator maps a point x towards the minimum of f , while remaining in the proximity of the original point.The parameter λ controls how close the mapped point remains to x.An illustration is given in Figure 2.
The proximity operator can be considered as a generalisation of the projection onto a convex set.Indeed, the projection operator can be expressed as a prox by with function f given by the characteristic function χ C (x) = ∞ if x / ∈ C and zero otherwise.

Moreau-Yosida regularisation
The final component required in the development of proximal nested sampling is Moreau-Yosida regularisation [e.g.14].The Moreau-Yosida envelop of a convex function f : R n → R is given by the infimal convolution: The Moreau-Yoshida envelope of a function can be interpreted as taking its convex conjugate, adding regularisation, before taking the conjugate again [14].Consequently, it Illustration of the proximity operator (reproduced from [14]).The proximal operator maps the blue points to red points (i.e. from base to head of arrows).The thick black line defines the domain boundary, while the thin black lines define level-sets (iso-contours) of f .The proximity operator maps points towards the minimum of f , while remaining in the proximity of the original point.
provides a smooth regularised approximation of f , which is very useful to enable the use of gradient-based computational algorithms [e.g.15].The Moreau-Yosida envelop exhibits the following properties.First, λ controls the degree of regularisation with f λ (x) → f (x) as λ → 0. Second, the gradient of the Moreau-Yosida envelope of f can be computed through its prox by ∇ f λ (x) = (x − prox λ f (x))/λ.

Proximal nested sampling
The challenge of nested sampling in high-dimensional settings is to sample from the prior distribution subject to a hard likelihood constraint [2,3,8].Proximal nested sampling addresses this challenge for the case of log-convex likelihoods, which are widespread in computational imaging problems.In this section we review the proximal nested sampling framework [12] in a pedagogical manner, sacrificing some mathematical rigor in an attempt to improve readability and accessibility.

Constrained sampling formulation
Consider a prior and likelihood π(x) ∝ exp(− f (x)) and L(x) ∝ exp(−g(x)), where the log-likehood g = − log L is a convex lower semicontinuous function.The log-prior f = − log π need only be differentiable or convex (it need not be convex if it is differentiable).
We consider sampling from the prior π(x), such that L(x) > L * for some likelihood value L * ≥ 0. Let ι L * (x) and χ L * (x) be the indicator function and characteristic function corresponding to this constraint, respectively, defined as To sample from the constrained prior we require sampling techniques that firstly can scale to high-dimensional settings and that secondly can support the convex constraint χ B τ (x).

Langevin MCMC sampling
Langevin Markov chain Monte Carlo (MCMC) sampling has been demonstrated to be highly effective at sampling in high-dimensional settings by exploiting gradient information [15,16].The Langevin stochastic differential equation associated with distribution p(x) is a stochastic process defined by where w(t) is Brownian motion.This process converges to p(x) as time t increases and is therefore useful for generating samples from p(x).In practice we compute a discrete-time approximation of x(t) by the conventional Euler-Maruyama discretisation: where w (k) is a sequence of standard Gaussian random variables and δ is a step size.Equation 8 provides a strategy for sampling in high-dimensions.However, notice that the updates rely on the score of the target distribution ∇ log p(•).Nominally the target distribution must therefore be differentiable, which is not the case for our target of interest given by Equation 6.The prior may or may not be differentiable but the likelihood constraint certainly is not.Proximal versions of Langevin sampling have been developed to address the setting where the distribution is log-convex but not necessarily differentiable [15,16].We follow a similar approach.

Proximal nested sampling framework
The proximal nested sampling framework follows by taking the constrained sampling formulation of Equation 6, adopting Langevin MCMC sampling of Equation 8, and applying Moreau-Yosida regularisation of Equation 4to the convex constraint χ B τ (x) to yield a differentiable target.This strategy yields (see [12]) the update equation: where δ is the step size and λ is the Moreau-Yosida regularisation parameter.Further intuition regarding proximal nested sampling can be gained by examining the term v (k) = −[x (k) − prox χ Bτ (x (k) )], together with Figure 3.The vector v (k) points from the sample x (k) to its projection onto the likelihood constraint.If the sample x (k) is already in the likelihood-restricted prior support B τ , i.e. x ∈ B τ , the term v (k) disappears and the Markov chain iteration simply involves the standard Langevin MCMC update.In contrast, if x (k) is not in B τ , i.e. x / ∈ B τ , then a step is taken in the direction v (k) , which acts to move the next iteration of the Markov chain in the direction of the projection of x (k) onto the convex set B τ .This term therefore acts to push the Markov chain back into the constraint set B τ if it wanders outside of it. 1e have so far assumed that the (log) prior is differentiable (see Equation 9).This may not be the case, as is typical for sparsity-promoting priors (e.g.− log π(x) = ∥Ψ † x∥ 1 + const.for some wavelet dictionary Ψ).Then we make a Moreau-Yosida approximation of the log-prior, yielding the update equation: For notational simplicity here we have adopted the same regularisation parameter λ for each Moreau-Yosida approximation.
With the current formulation we are not guaranteed to recover samples from the prior subject to the hard likelihood constraint due to the approximation introduced in the Moreu- x (k−1) x (k+1) Yosida regularisation and due to the approximation in discretising the underlying Langevin stochastic differential equation.We therefore introduce a Metropolis-Hastings correction step to eliminate the bias introduced by these approximations and ensure convergence to the required target density (see [12] for further details).Finally, we adopt this strategy for sampling from the constrained prior in the standard nested sampling strategy to recover the proximal nested sampling framework.The algorithm can be initalised with samples from the prior as described by the update equations above but with the likelihood term removed, i.e. with [x (k) − prox χ Bτ (x (k) )] → 0.

Explicit forms of proximal nested sampling
While we have discussed the general framework for proximal nested sampling, we have yet to address the issue of computing the proximity operators involved.As Equation 2demonstrates, computing proximity operators involves solving an optimisation problem.Only in certain cases are closed form solutions available [13].Explicit forms of proximal nested sampling must therefore be considered for the problem at hand.We focus on a common high-dimensional inverse imaging problem where we acquire noisy observations y = Φx + n, of an underlying image x via some measurement model Φ, in the presence of Gaussian noise n (without loss of generality we consider independent and identically distributed noise here).We consider a Gaussian negative likelihood, − log L(x) = y − Φx 2 2 /2σ 2 + const., and a sparsity-promoting prior, − log π(x) = µ Ψ † x 1 + const., for some wavelet dictionary Ψ.The prox of the prior can be computed in closed-form by [13] where soft λ (•) is the soft thresholding function with threshold λ (recall µ is the scale of the sparsity-promoting prior, i.e. the regularisation parameter, defined above).However, the prox of the likelihood is not so straightforward.The prox for the likelihood can be recast as a saddle-point problem that can be solved iteratively by a primal dual method initialised by the current sample position (see [12] for further details): 1. , Combining these algorithms to efficiently compute prox operators with the proximal nested sampling framework, we can compute the model evidence to perform Bayesian model comparison in high-dimensional settings.We can also obtain posterior distributions with the usual weighted samples from the dead points of nested sampling.This allows one to recover, for example, point estimates such as the posterior mean image.

Deep data-driven priors
While hand-crafted priors, such as wavelet-based sparsity promoting priors, are common in computational imaging, they provide only limited expressivity.If example images are available an empirical Bayes approach with data-driven priors can be taken, where the prior is learned from training data.Since proximal nested sampling requires only the log-likelihood to be convex, complex data-driven priors, such as represented by deep neural networks, can be integrated into the framework.Through Tweedie's formula we describe how proximal nested sampling can be adapted to support data-driven priors, opening up Bayesian model selection for data-driven approaches.We take a similar approach to [19], where data-driven priors are integrated into Langevin MCMC sampling strategies, although in that work model selection is not considered.

Tweedie's formula and data-driven priors
Tweedie's formula is a remarkable result in Bayesian estimation credited to personal correspondence with Maurice Kenneth Tweedie [20].Tweedie's formula has gained renewed interest in recent years [19,[21][22][23] due to its connection to score matching [24][25][26] and denoising diffusion models [27,28], which as of this writing provide state-of-the-art performance in deep generative modelling.
Tweedie's result follows by considering the following scenario.Consider x sampled from a prior distribution q(•) and noisy observations z ∼ N (x, σ 2 I).Tweedie's formula gives the posterior expectation of x given z as where p(z) is the marginal distribution of z (for further details see, e.g., [21]).The critical advantage of Tweedie's formula is that it does not require knowledge of the underlying distribution q(•) but rather only the marginalised distribution of the observation.Equation 12 can be interpreted as a denoising strategy to estimate x from noisy observations z.Moreover, Tweedie's formula can also be used to relate a denoiser (potentially a trained deep neural network) to the score ∇ log p(z).
In a data-driven setting, where the underlying prior is implicitly specified by training data (which are considered to be samples from the prior), there is no guarantee that the underlying prior, and therefore the posterior, is well-suited for gradient-based Bayesian computation such as Langevin sampling, e.g. it may not be differentiable.Therefore we consider a regularised version of the prior defined by Gaussian smoothing: This regularisation can also be viewed as adding a small amount of regularising Gaussian noise.We can therefore leverage Tweedie's formula to relate the regularised prior distribution p ϵ (x) to a denoiser D ϵ trained to recover x from noisy observations x ϵ ∼ N (x, ϵI), i.e. the score of the regualised prior can be related to the denoiser by Denoisers are commonly integrated in proximal optimisation algorithms in replace of proximity operators, giving rise to so-called plug-and-play (PnP) methods [29,30] and, more recently, also into Bayesian computational algorithms [19].Typically denoisers are represented by deep neural networks, which can be trained by injecting a small amount of noise in training data and learning to denoise the corrupted data.While a noise level ϵ needs to be chosen, as discussed above this is considered a regularisation of the prior and so the denoiser need not be trained on the noise level of a problem at hand.In this manner, the same denoiser can be used for multiple subsequent problems (hence the PnP name).The learned score of the regularised prior inherits the same properties as the denoiser, such as smoothness, hence the denoiser should be considered carefully.Well-behaved denoisers have been considered already in PnP methods (in order to provide convergence guarantees) and a popular approach for imaging problems is the DnCNN model [30], which is based on a deep convolutional neural network, and that is (Lipschitz) continuous.

Proximal nested sampling with data-driven priors
By Tweedie's formula the standard proximal nested sampling update of Equation 9 can be revised to integrate a learned denoiser, yielding where we have included a regularisation parameter α that allows us to balance the influence of the prior and the data fidelity terms [19].We typically consider a deep convolutional neural network based on the DnCNN model [30] since it is (Lipschitz) continuous and has been demonstrated to perform very well in PnP settings [19,30].Again, this sampling strategy can then be integrated into the standard nested sampling framework.We can therefore support data-driven priors in the proximal nested sampling framework by integrating a deep denoiser that learns to denoise training data, using Tweedie's formula to relate this to the score of a regularised data-driven prior.

Numerical experiments
The new methodology presented allows us to perform Bayesian model comparison between a data-driven and hand-crafted prior(validation of proximal nested sampling in a setting where the ground truth can be computed directly has been performed already [12]).We consider a simple radio interferometric imaging reconstruction problem as an illustration.We assume the same observational model as Section 3.4, with white Gaussian noise giving a signal-to-noise ratio (SNR) of 15dB.The measurement operator Φ is a masked Fourier transform as a simple model of a radio interferometric telescope.The mask is built by randomly selecting 50% of the Fourier coefficients.A Gaussian likelihood is used in both models.For the hand-crafted prior we consider a sparsity-promoting prior using a Daubechis 6 wavelet dictionary.We base the data-driven prior on a DnCNN [30] model trained on galaxy images extracted from the IllustrisTNG simulations [31].We also consider an IllustrisTNG galaxy simulation, not used in training, as the ground truth test image.We generate samples following the proximal nested sampling strategies of Equation 10 and Equation 15 for the hand-crafted and data-driven priors, respectively.Posterior inferences (e.g.posterior mean image) and the model evidence can then be computed from nested sampling samples in the usual manner.The step size δ is set to 10 −7 , the Moreau-Yosida regularisation parameter λ to 5 × 10 −7 , and the regularisation strength of wavelet-based model µ to 5 × 10 4 .We consider noise level ϵ ≃ 8.34 and set the regularisation parameter α of the data-driven prior to 3.5 × 10 −7 .For the nested sampling methods, the number of live and dead samples is set to 10 2 and 2.5 × 10 3 , respectively.For the Langevin sampling, we use a thinning factor of 20 and set the number of burn-in iterations to 10 2 .
Results are presented in Figure 4.The data-driven prior results in a superior reconstruction with an improvement in SNR of 1.2dB, although it may be difficult to tell simply from visual inspection of the recovered images.Computing the SNR of the reconstructed images requires knowledge of the ground truth, which clearly is not accessible in realistic settings involving real observational data.The Bayesian model evidence, computed by proximal nested sampling, proves a way to compare the hand-crafted and data-driven models without requiring knowledge of the ground truth and is therefore applicable in realistic scenarios.We compute log evidences of −2.96 × 10 3 for the hand-crafted prior and −1.35 × 10 3 for the data-driven prior.Consequently, the data-driven model is preferred by the model evidence, which agrees with the SNR levels computed from the ground truth.These results are all as one might expect since learned data-driven priors are more expressive than hand-crafted priors and can better adapt to model high-dimensional images.

Conclusions
Proximal nested sampling leverages proximal calculus to extend nested sampling to high-dimensional settings for problems involving log-convex likelihoods, which are ubiquitous in computational imaging.The purpose of this article is two-fold.First, we review proximal nested sampling in a pedagogical manner in an attempt to elucidate the framework for physical scientists.Second, we show how proximal nested sampling can be extended in an empirical Bayes setting to support data-driven priors, such as deep neural networks learned from training data.We show only preliminary results for learned proximal nested sampling and will present a more thorough study in a follow-up article.

1 .
Convex set (b) Non-convex set (c) Convex function (d) Non-convex function Figure Proximal nested sampling considers likelihoods that are log-convex and lower semicontinuous.A lower semicontinuous convex function has a convex and closed epigraph.

Figure 2 .
Figure 2. Illustration of the proximity operator (reproduced from[14]).The proximal operator maps the blue points to red points (i.e. from base to head of arrows).The thick black line defines the domain boundary, while the thin black lines define level-sets (iso-contours) of f .The proximity operator maps points towards the minimum of f , while remaining in the proximity of the original point.

Figure 3 .
Figure 3. Diagram illustrating proximal nested sampling.If a sample x (k) outside of the likelihood constraint is considered, then proximal nested sampling introduces a term in the direction of the projection of x (k) onto the convex set defining the likelihood constraint, thereby acting to push the Markov chain back into the constraint set B τ if it wanders outside of it.A subsequent Metropolis-Hastings step can be introduced to enforce strict adherence to the convex likelihood constraint.

Figure 4 .
Results of radio interferometric imaging reconstruction problem.(a) Ground truth galaxy image.(b) Dirty reconstruction based on pseudo-inverting the measurement operator Φ. (c) Posterior mean reconstruction computed from proximal nested samples for the hand-crafted wavelet-sparsity prior.(d) Posterior mean reconstruction for the data-driven prior based on a deep neural network (DnCNN) trained on example images.Reconstruction SNR is shown on each image.The computed SNR levels demonstrate that the data-driven prior results in a superior reconstruction quality, although this may not be obvious from a visual assessment of the reconstructed images.Computing the reconstructed SNR requires knowledge of the ground truth, which is not available in realistic settings.The Bayesian model evidence proves a way to compare the hand-crafted and data-driven models without requiring knowledge of the ground truth.For this example the Bayesian evidence correctly selects the data-driven prior as the best model.