An Analytical Approach to Bayesian Evidence Computation

García-Bellido, Juan

doi:10.3390/universe9030118

Open AccessEditor’s ChoiceArticle

An Analytical Approach to Bayesian Evidence Computation

by

Juan García-Bellido

Departamento de Física Teórica C-XI, Universidad Autónoma de Madrid, Cantoblanco, 28049 Madrid, Spain

Universe 2023, 9(3), 118; https://doi.org/10.3390/universe9030118

Submission received: 1 February 2023 / Revised: 20 February 2023 / Accepted: 22 February 2023 / Published: 24 February 2023

(This article belongs to the Section Cosmology)

Download

Browse Figure

Versions Notes

Abstract

:

Bayesian evidence is a key tool in model selection, allowing a comparison of models with different numbers of parameters. Its use in the analysis of cosmological models has been limited by difficulties in calculating it, with current numerical algorithms requiring supercomputers. In this paper we give exact formulae for the Bayesian evidence in the case of Gaussian likelihoods with arbitrary correlations and top-hat priors, and approximate formulae for the case of likelihood distributions with leading non-Gaussianities (skewness and kurtosis). We apply these formulae to cosmological models with and without isocurvature components, and compare with results we previously obtained using numerical thermodynamic integration. We find that the results are of lower precision than the thermodynamic integration, while still being good enough to be useful.

Keywords:

Bayesian evidence; Gaussian likelihoods; thermodynamic integration

1. Introduction

Model selection refers to the statistical problem of deciding which model description of observational data is the best [1,2]. It differs from parameter estimation, where the choice of a single model (i.e., choice of parameters to be varied) has already been made and the aim is to find their best-fitting values and ranges. While there have been widespread applications of parameter estimation techniques, usually likelihood fitting, to cosmological data, there has so far been quite limited application of model selection statistics [3,4,5,6,7,8,9,10,11,12]. This is unfortunate, as model selection techniques are necessary to robustly distinguish between models with different numbers of parameters, and many of the most interesting issues in cosmology concern the desirability or otherwise of incorporating additional parameters to describe new physical effects.

Within the context of Bayesian inference, model selection should be carried out using the Bayesian evidence [1,2], which measures the probability of the model in light of the observational data (i.e., the average likelihood over the prior distribution). The Bayesian evidence associates a single number with each model, and the models can then be ranked in order of the evidence, with the ratios of those values interpreted as the relative probability of the models. This process sets up a desirable tension between model simplicity and the ability to fit the data.

Use of the Bayesian evidence has so far been limited by difficulties in calculating it. The standard technique is thermodynamic integration [13,14], which varies the temperature in a Monte Carlo Markov Chain (MCMC) approach in order that the distribution is sampled in a way covering both posterior and prior distributions. However, in recent work [12] we showed that in order to obtain sufficiently-accurate results in a cosmological context, around

10^{7}

likelihood evaluations are required per model. Such analyses are CPU-limited by the time needed to generate the predicted spectra to compare with the data, and this requirement pushes the problem into the supercomputer class (for comparison, parameter estimation runs typically employ

10^{5}

to

10^{6}

likelihood evaluations).

In this paper, we propose and exploit a new analytic method to compute the evidence based on an expansion of the likelihood distribution function. The method pre-supposes that the covariance of the posterior distribution has been obtained, for instance via an MCMC parameter estimation run, and in its present form requires that the prior distributions of the parameters are uniform top-hat priors.1 While the method will not be applicable for general likelihood distributions, we include the leading non-Gaussianities (skewness and kurtosis) in approximating the likelihood shape, with the expectation of obtaining good results whenever the likelihood distribution is sufficiently simple. Cosmological examples commonly exhibit likelihood distributions with only a single significant peak.

We apply the method both to toy model examples and to genuine cosmological situations. In particular, we calculate the evidences for adiabatic and isocurvature models, which we previously computed using thermodynamic integration in ref. [12]. We find that the discrepancies between the methods are typically no worse than 1 in ln(Evidence), meaning that the analytical method is somewhat less accurate than would be ideal, but is accurate enough to give a useful indication of model preference.

2. The Bayesian Evidence

The posterior probability distribution

P (θ, M | D)

for the parameters

θ

of the model

M

, given the data

D

, is related to the likelihood function

L (D | θ, M)

within a given set of prior distribution functions

π (θ, M)

for the parameters of the model, by Bayes’ theorem:

P (θ, M | D) = \frac{L (D | θ, M) π (θ, M)}{E (D | M)},

(1)

where E is the Bayesian evidence, i.e., the average likelihood over the priors,

E (D | M) = \int d θ L (D | θ, M) π (θ, M),

(2)

where

θ

is a vector with n-components characterising the n independent parameters. The prior distribution function

π

contains all the information about the parameters before observing the data, i.e., our theoretical prejudices, our physical understanding of the model, and input from previous experiments.

In the case of a large number of parameters (

n ≫ 1

), the evidence integral cannot be performed straightforwardly and must be obtained either numerically or via an analytical approximation. Amongst numerical methods the most popular is thermodynamic integration [13,14] but this can be computationally intensive [12]. An alternative is the application of the nested sampling algorithm [15,16] and Monte Carlo methods with the stepping-stone sampling algorithm [17,18]. On the other hand, the simplest analytical approximation is the Laplace approximation, valid when the distribution can be approximated by a multivariate Gaussian. This may hold when the quantity and quality of the data is optimal, but is likely to be valid only in limited cosmological circumstances.

The Bayesian evidence is of interest because it allows a comparison of models amongst an exclusive and exhaustive set

{M_{i}}_{i = 1 \dots N}

. We can compute the posterior probability for each hypothesis given the data

D

using Bayes’ theorem:

P (M_{i} | D) \propto E (D | M_{i}) π (M_{i}),

(3)

where

E (D | M_{i})

is the evidence of the data under the model

M_{i}

, and

π (M_{i})

is the prior probability of the ith model before we see the data. The ratio of the evidences for the two competing models is called the Bayes factor [19]

B_{i j} = \frac{E (D | M_{i})}{E (D | M_{j})},

(4)

and this is also equal to the ratio of the posterior model probabilities if we assume that we do not favour any model a priori, so that

π (M_{1}) = π (M_{2}) = \dots = π (M_{N}) = 1 / N

.

The Bayes factor Equation (4) provides a mathematical representation of Occam’s razor, because more complex models tend to be less predictive, lowering their average likelihood in comparison to simpler, more predictive models. More complex models can only be favoured if they are able to provide a significantly improved fit to the data. In simple cases where models give vastly different maximum likelihoods there is no need to employ model selection techniques, but they are essential for properly discussing cases where the improvement of fit is marginal. This latter situation is more or less inevitable whenever the possibility of requiring an additional parameter arises from new data, unless the new data is of vastly greater power than that preceding it; cosmological examples include the inclusion of spectral tilt, dark energy density variation, or trace isocurvature perturbations, explored later in this paper.

In this paper we will obtain analytical formulae which approximates the Bayesian evidence by considering the higher-order cumulants of the distribution in a systematic way. The advantage is that with these analytical formulae one can compute the evidence for a given model with an arbitrary number of parameters, given the hierarchy of cumulants of the distribution, assumed previously computed for the likelihood distribution function within the parameter estimation programme.

The evidence needs to be calculated to sufficient precision for robust conclusions to be drawn. The standard interpretational scale, due to Jeffreys [1] and summarized in ref. [12], strengthens its verdict roughly each time the difference in ln(Evidence) increases by one. The evidence therefore needs to be computed more accurately than this, with an uncertainty of 0.1 in ln(Evidence) easily sufficient, and a factor two worse than that acceptable. This accuracy requirement ensures that the relative model probabilities are changed little by the uncertainty.

The first thing needed is to characterize the distribution function for the model with n parameters. Let

f (x)

be this function, and let us assume that it is properly normalized,

\int_{- \infty}^{\infty} d^{n} x f (x) = 1 .

(5)

Then, the p-point correlation function is given by

〈 x_{i_{1}} \dots x_{i_{p}} 〉 = \int_{- \infty}^{\infty} d^{n} x x_{i_{1}} \dots x_{i_{p}} f (x) .

(6)

From this distribution function one can always construct the generating functional,

ϕ (u)

, as the Fourier transform

ϕ (u) = \int_{- \infty}^{\infty} d^{n} x e^{i u \cdot x} f (x) .

(7)

This function can be expanded as

ϕ (u) = exp [\sum_{p = 1}^{\infty} \frac{i^{p}}{p!} A_{i_{1} \dots i_{p}} u^{i_{1}} \dots u^{i_{p}}],

(8)

where

A_{i_{1} \dots i_{p}}

are totally symmetric rank-p tensors. For instance, if we restrict ourselves to order 4, we can write

ϕ (u) = exp [i μ_{i} u_{i} - \frac{1}{2!} C_{i j} u_{i} u_{j} - \frac{i}{3!} B_{i j k} u_{i} u_{j} u_{k} + \frac{1}{4!} D_{i j k l} u_{i} u_{j} u_{k} u_{l} + \dots + \frac{i^{n}}{n!} A_{i_{1} \dots i_{n}} u_{i_{1}} \dots u_{i_{n}}],

(9)

where

μ_{i}

is the mean value of variable

x_{i}

;

C_{i j}

is the covariance matrix;

B_{i j k}

is the trilinear matrix associated with the third cumulant or skewness;

D_{i j k l}

is the rank-4 tensor associated with the fourth cumulant or kurtosis; and

A_{i_{1} \dots i_{n}}

is the rank-n tensor associated with the n-th cumulant. Their expressions in terms of n-point correlation functions can be obtained from Equation (7), by realising that

〈 x_{i_{1}} \dots x_{i_{n}} 〉 = {(- i)}^{n} {\frac{\partial^{n} ϕ (u)}{\partial u_{i_{1}} \dots \partial u_{i_{n}}}|}_{u = 0} .

(10)

For instance, the first-order term gives

〈 x_{i} 〉 = (- i) {\frac{\partial ϕ (u)}{\partial u_{i}}|}_{u = 0} = μ_{i} .

(11)

The second-order correlation function gives

〈 x_{i} x_{j} 〉 = {(- i)}^{2} {\frac{\partial^{2} ϕ (u)}{\partial u_{i} \partial u_{j}}|}_{u = 0} = C_{i j} + μ_{i} μ_{j},

(12)

such that the covariance matrix is obtained, as usual, from

C_{i j} = 〈 x_{i} x_{j} 〉 - 〈 x_{i} 〉 〈 x_{j} 〉 .

The third-order correlation function gives

〈 x_{i} x_{j} x_{k} 〉 = {(- i)}^{3} {\frac{\partial^{3} ϕ (u)}{\partial u_{i} \partial u_{j} \partial u_{k}}|}_{u = 0} = B_{i j k} + μ_{i} C_{j k} + μ_{j} C_{k i} + μ_{k} C_{i j} + μ_{i} μ_{j} μ_{k},

(13)

such that the skewness matrix is obtained from

B_{i j k} = 〈 x_{i} x_{j} x_{k} 〉 - 〈 x_{i} 〉 〈 x_{j} x_{k} 〉 - 〈 x_{j} 〉 〈 x_{k} x_{i} 〉 - 〈 x_{k} 〉 〈 x_{i} x_{j} 〉 + 2 〈 x_{i} 〉 〈 x_{j} 〉 〈 x_{k} 〉 .

(14)

The fourth-order correlation function gives

\begin{matrix} 〈 x_{i} x_{j} x_{k} x_{l} 〉 = {(- i)}^{4} {\frac{\partial^{4} ϕ (u)}{\partial u_{i} \partial u_{j} \partial u_{k} \partial u_{l}}|}_{u = 0} & = & D_{i j k l} + C_{i j} C_{k l} + C_{i k} C_{j l} + C_{i l} C_{j k} \\ + & B_{i j k} μ_{l} + B_{i j l} μ_{k} + B_{j k l} μ_{i} + B_{i k l} μ_{j} \\ + & C_{i j} μ_{k} μ_{l} + C_{i k} μ_{j} μ_{l} + C_{i l} μ_{j} μ_{k} \\ + & C_{j k} μ_{i} μ_{l} + C_{j l} μ_{i} μ_{k} + C_{k l} μ_{i} μ_{j} \\ + & μ_{i} μ_{j} μ_{k} μ_{l}, \end{matrix}

(15)

such that the kurtosis matrix is obtained from

\begin{matrix} D_{i j k l} & = & 〈 x_{i} x_{j} x_{k} x_{l} 〉 - 〈 x_{i} x_{j} 〉 〈 x_{k} x_{l} 〉 - 〈 x_{i} x_{k} 〉 〈 x_{j} x_{l} 〉 - 〈 x_{i} x_{l} 〉 〈 x_{j} x_{k} 〉 \\ - & 〈 x_{i} x_{j} x_{k} 〉 〈 x_{l} 〉 - 〈 x_{i} x_{j} x_{l} 〉 〈 x_{k} 〉 - 〈 x_{i} x_{k} x_{l} 〉 〈 x_{j} 〉 - 〈 x_{j} x_{k} x_{l} 〉 〈 x_{i} 〉 \\ + & 2 〈 x_{i} x_{j} 〉 〈 x_{k} 〉 〈 x_{l} 〉 + 2 〈 x_{i} x_{k} 〉 〈 x_{j} 〉 〈 x_{l} 〉 + 2 〈 x_{i} x_{l} 〉 〈 x_{j} 〉 〈 x_{k} 〉 + 2 〈 x_{j} x_{k} 〉 〈 x_{i} 〉 〈 x_{l} 〉 \\ + & 2 〈 x_{j} x_{l} 〉 〈 x_{i} 〉 〈 x_{k} 〉 + 2 〈 x_{k} x_{l} 〉 〈 x_{i} 〉 〈 x_{j} 〉 - 6 〈 x_{i} 〉 〈 x_{j} 〉 〈 x_{k} 〉 〈 x_{l} 〉, \end{matrix}

(16)

and so on, for the higher-order cumulants.

3. The Gaussian Approximation

Let us first evaluate the evidence for a multivariate Gaussian distribution, that is, one in which all the cumulants are zero except the covariance matrix

C_{i j}

and the means

μ_{i}

. In this case, the generating functional and the distribution are given by 2

\begin{matrix} ϕ (u) = exp [- i μ_{i} u_{i} - \frac{1}{2} C_{i j} u_{i} u_{j}], \end{matrix}

(17)

\begin{matrix} f (x) = \frac{1}{{(2 π)}^{n}} \int_{- \infty}^{\infty} d^{n} u e^{- i u \cdot x} ϕ (u) \end{matrix}

(18)

\begin{matrix} = \frac{1}{{(2 π)}^{n / 2} \sqrt{det C}} exp [- \frac{1}{2} C_{i j}^{- 1} (x_{i} - μ_{i}) (x_{j} - μ_{j})], \end{matrix}

(19)

which satisfies

〈 x_{i} 〉 = μ_{i}, 〈 x_{i} x_{j} 〉 = C_{i j} + μ_{i} μ_{j}, 〈 x_{i} x_{j} x_{k} 〉 = μ_{(i} C_{j k)} + μ_{i} μ_{j} μ_{k}, \dots

(20)

where the sub-indices in parenthesis,

(i j k)

, indicate a cyclic sum. Notice that all the n-point correlation functions can be written in terms of the first two moments of the distribution, and all the higher-order cumulants vanish.

3.1. Centred Priors

For initial calculations, we assume a top-hat prior and make the unrealistic assumption, to be lifted later, that it is centred at the mean value:

π (x, a) \equiv \{\begin{matrix} {(2 a)}^{- 1} & - a < x - μ < a, \\ 0 & otherwise . \end{matrix}

(21)

Since the Fourier transform of a top-hat function is

\int_{- \infty}^{\infty} d x e^{i u x} π (x, a) = \frac{sin a u}{a u} exp [i μ u],

we can write the evidence either way

\begin{matrix} E (a_{1}, \dots, a_{n}) & = & \int_{- \infty}^{\infty} d^{n} x f (x) \prod_{i = 1}^{n} π (x_{i}, a_{i}) = \prod_{i = 1}^{n} {(2 a_{i})}^{- 1} \int_{- a_{1}}^{a_{1}} d x_{1} \dots \int_{- a_{n}}^{a_{n}} d x_{n} f (\tilde{x}) \end{matrix}

(22)

\begin{matrix} = & \frac{1}{{(2 π)}^{n}} \int_{- \infty}^{\infty} d^{n} u ϕ (u) \prod_{i = 1}^{n} \frac{sin a_{i} u_{i}}{a_{i} u_{i}} . \end{matrix}

(23)

In Equation (22) we integrate over the displaced coordinate,

{\tilde{x}}_{i} \equiv x_{i} - μ_{i}

, such that

〈 {\tilde{x}}_{i} 〉 = 0

and

〈 {\tilde{x}}_{i} {\tilde{x}}_{j} 〉 = C_{i j}

. From now on, we ignore the tildes, and assume we have moved to those coordinates. Note that the choice of prior is not crucial. We could have chosen a Gaussian prior, and the result would not be very different, except that the window functions,

sin z / z

, would then be Gaussian. Let us now perform the integration Equation (22) in the case of one, two and then n variables.

One variable. Suppose the covariance is just

C = σ^{2}

. The evidence is then

E (a) = \frac{1}{2 a σ \sqrt{2 π}} \int_{- a}^{a} d x e^{- \frac{x^{2}}{2 σ^{2}}} = \frac{1}{2 π} \int_{- \infty}^{\infty} d u \frac{sin a u}{a u} e^{- \frac{1}{2} σ^{2} u^{2}} = \frac{1}{2 a} Erf [\frac{a}{σ \sqrt{2}}],

(24)

where

Erf [x]

is the error function, which asymptotes very quickly to one for

x \geq 2

, or

a \geq 3 σ

. Therefore, the evidence of a model with centred top-hat prior of width

2 a

is well approximated by

{(2 a)}^{- 1}

. Note that the Bayesian evidence depends very strongly on the prior chosen for the model, and often choosing this prior is crucial for model specification [20].

Two variables. Suppose we have two correlated variables,

x_{1}

and

x_{2}

, with covariance matrix

C = (\begin{matrix} C_{11} & C_{12} \\ C_{12} & C_{22} \end{matrix}) = (\begin{matrix} σ_{1}^{2} & ρ σ_{1} σ_{2} \\ ρ σ_{1} σ_{2} & σ_{2}^{2} \end{matrix}) .

(25)

where the cross-correlation

ρ

is defined by

ρ = \frac{〈 x_{1} x_{2} 〉}{\sqrt{〈 x_{1}^{2} 〉 〈 x_{2}^{2} 〉}} = \frac{〈 x_{1} x_{2} 〉}{σ_{1} σ_{2}},

with

σ_{1}

and

σ_{2}

as the corresponding quadratic dispersions. In this case, the normalized two-dimensional distribution function is

f (x) = \frac{1}{2 π σ_{1} σ_{2} \sqrt{1 - ρ^{2}}} exp [\frac{- 1}{1 - ρ^{2}} (\frac{x_{1}^{2}}{2 σ_{1}^{2}} - \frac{ρ x_{1} x_{2}}{σ_{1} σ_{2}} + \frac{x_{2}^{2}}{2 σ_{2}^{2}})],

(26)

which has the property that integrating (“marginalizing”) over one of the two variables, leaving a properly-normalized Gaussian distribution for the remaining variable,

\int_{- \infty}^{\infty} d x_{2} f (x) = \frac{1}{σ_{1} \sqrt{2 π}} e^{- \frac{x_{1}^{2}}{2 σ_{1}^{2}}} .

(27)

Let us now evaluate the evidence Equation (22) by integrating first over the prior in

x_{2}

,

\frac{1}{2 a_{2}} \int_{- a_{2}}^{a_{2}} d x_{2} f (x) = \frac{e^{- \frac{x_{1}^{2}}{2 σ_{1}^{2}}}}{σ_{1} \sqrt{2 π}} \cdot \frac{1}{4 a_{2}} [Erf [\frac{a_{2} σ_{1} + ρ σ_{2} x_{1}}{σ_{1} σ_{2} \sqrt{2 (1 - ρ^{2})}}] + Erf [\frac{a_{2} σ_{1} - ρ σ_{2} x_{1}}{σ_{1} σ_{2} \sqrt{2 (1 - ρ^{2})}}]] .

(28)

The first term is the result we would have obtained if we had been marginalizing over

x_{2}

; the second is a sum of error functions that still depend on

x_{1}

, and modulates the marginalization. We can use the series expansion of the error function to second order,

\frac{1}{2} (Erf [a + x] + Erf [a - x]) = Erf [a] - \frac{2 a x^{2}}{\sqrt{π}} e^{- a^{2}} + O (x^{4}),

to write Equation (28) to order

x_{1}^{2}

as

\frac{1}{2 a_{2}} \int_{- a_{2}}^{a_{2}} d x_{2} f (x) = \frac{e^{- \frac{x_{1}^{2}}{2 σ_{1}^{2}}}}{σ_{1} \sqrt{2 π}} [\frac{1}{2 a_{2}} Erf [\frac{a_{2}}{σ_{2} \sqrt{2 (1 - ρ^{2})}}] - \frac{ρ^{2} x_{1}^{2} e^{- \frac{a_{2}^{2}}{2 σ_{2}^{2} (1 - ρ^{2})}}}{2 σ_{1}^{2} σ_{2} (1 - ρ^{2}) \sqrt{2 π (1 - ρ^{2})}}] .

(29)

Integrating now over the

x_{1}

prior, we finally obtain the evidence

\begin{matrix} E (a_{1}, a_{2}) & = & \frac{1}{4 a_{1} a_{2}} \int_{- a_{1}}^{a_{1}} d x_{1} \int_{- a_{2}}^{a_{2}} d x_{2} f (x) \\ = & \frac{1}{4 a_{1} a_{2}} Erf [\frac{a_{2}}{σ_{2} \sqrt{2 (1 - ρ^{2})}}] Erf [\frac{a_{1}}{σ_{1} \sqrt{2}}] \\ - & \frac{ρ^{2} e^{- \frac{a_{2}^{2}}{2 σ_{2}^{2} (1 - ρ^{2})}}}{2 σ_{1} σ_{2} (1 - ρ^{2}) \sqrt{2 π (1 - ρ^{2})}} \frac{Erf [\frac{a_{1}}{σ_{1} \sqrt{2}}]}{2 a_{1}} + \frac{ρ^{2} e^{- \frac{a_{2}^{2}}{2 σ_{2}^{2} (1 - ρ^{2})} - \frac{a_{1}^{2}}{2 σ_{1}^{2}}}}{4 π σ_{1}^{2} σ_{2} \sqrt{1 - ρ^{2}}} . \end{matrix}

(30)

Note that in the limit of no cross-correlations,

ρ \to 0

, the integral factorizes and we can write an exact expression for the evidence,

\begin{matrix} E (a_{1}, a_{2}) & = & \frac{1}{4 a_{1} a_{2}} \frac{1}{2 π σ_{1} σ_{2}} \int_{- a_{1}}^{a_{1}} d x_{1} \int_{- a_{2}}^{a_{2}} d x_{2} e^{- \frac{x_{1}^{2}}{2 σ_{1}^{2}} - \frac{x_{2}^{2}}{2 σ_{2}^{2}}} \end{matrix}

(31)

\begin{matrix} = & \frac{1}{4 π^{2}} \int_{- \infty}^{\infty} d u_{1} \int_{- \infty}^{\infty} d u_{2} \frac{sin a_{1} u_{1}}{a_{1} u_{1}} \frac{sin a_{2} u_{2}}{a_{2} u_{2}} e^{- \frac{1}{2} σ_{1}^{2} u_{1}^{2} - \frac{1}{2} σ_{2}^{2} u_{2}^{2}} \end{matrix}

(32)

\begin{matrix} = & \frac{1}{4 a_{1} a_{2}} Erf [\frac{a_{1}}{σ_{1} \sqrt{2}}] Erf [\frac{a_{2}}{σ_{2} \sqrt{2}}] . \end{matrix}

(33)

It happens, however, that even in the presence of cross-correlations, if the prior is wide (

a_{i} \geq 2 σ_{i}

), then the terms proportional to exponentials are negligible and the evidence becomes, to a very good approximation,

E (a_{1}, a_{2}) = \frac{1}{4 a_{1} a_{2}} Erf [\frac{a_{2}}{σ_{2} \sqrt{2 (1 - ρ^{2})}}] Erf [\frac{a_{1}}{σ_{1} \sqrt{2}}] .

(34)

Moreover, in that case, the error functions are approximately given by 1.

nvariables. Suppose we have n correlated variables,

x = (x_{1}, \dots, x_{n})

, with covariance matrix

C_{n} = (\begin{matrix} C_{11} & C_{12} & \dots & C_{1 n} \\ C_{12} & C_{22} & \dots & C_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ C_{1 n} & C_{2 n} & \dots & C_{n n} \end{matrix}) .

(35)

In this case, the probability distribution function can be expressed as

f (x) = \frac{1}{{(2 π)}^{n / 2} \sqrt{det C_{n}}} exp [- \frac{1}{2} x^{_{T}} C_{n}^{- 1} x],

(36)

which has the property that marginalizing over the last variable,

x_{n}

, we obtain a correlated probability distribution function for the

n - 1

variables,

x = (x_{1}, \dots, x_{n - 1})

,

f (x) = \frac{1}{{(2 π)}^{(n - 1) / 2} \sqrt{det C_{n - 1}}} exp [- \frac{1}{2} x^{_{T}} C_{n - 1}^{- 1} x],

(37)

where the

C_{n - 1}

covariance matrix is given by Equation (35) without the last column and last row.

We will now evaluate the evidence Equation (22) for this multivariate Gaussian, starting with the integration over the last variable,

x_{n}

,

\begin{matrix} \frac{1}{2 a_{n}} \int_{- a_{n}}^{a_{n}} d x_{n} f (x) & = & \frac{1}{{(2 π)}^{(n - 1) / 2} \sqrt{det C_{n - 1}}} exp [- \frac{1}{2} x^{_{T}} C_{n - 1}^{- 1} x] \\ \times \{\frac{1}{2 a_{n}} Erf [\frac{a_{n}}{\sqrt{2}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}}] + O (e^{- \frac{a_{n}^{2} det C_{n - 1}}{2 det C_{n}}})\} . \end{matrix}

(38)

Integrating now over the next variable,

x_{n - 1}

, we find

\begin{matrix} \frac{1}{4 a_{n} a_{n - 1}} \int_{- a_{n}}^{a_{n}} d x_{n} \int_{- a_{n - 1}}^{a_{n - 1}} d x_{n - 1} f (x) = \frac{1}{{(2 π)}^{(n - 2) / 2} \sqrt{det C_{n - 2}}} exp [- \frac{1}{2} x^{_{T}} C_{n - 2}^{- 1} x] \\ \times \{\frac{1}{4 a_{n} a_{n - 1}} Erf [\frac{a_{n}}{\sqrt{2}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}}] Erf [\frac{a_{n}}{\sqrt{2}} \sqrt{\frac{det C_{n - 2}}{det C_{n - 1}}}] + O (e^{- \frac{a_{n}^{2} det C_{n - 1}}{2 det C_{n}}})\} . \end{matrix}

(39)

Continuing the integration over the priors, we end up with the evidence for the n-dimensional distribution,

\begin{matrix} E (a_{1}, \dots, a_{n}) & = & \frac{1}{\prod_{p = 1}^{n} 2 a_{p}} \int_{- a_{1}}^{a_{1}} \dots \int_{- a_{n}}^{a_{n}} d^{n} x f (x) \\ = & \prod_{p = 1}^{n} \frac{1}{2 a_{p}} Erf [\frac{a_{p}}{\sqrt{2}} \sqrt{\frac{det C_{p - 1}}{det C_{p}}}] + O (exp [- \sum_{p = 1}^{n} \frac{a_{p}^{2} det C_{p - 1}}{2 det C_{p}}]), \end{matrix}

(40)

where the covariance matrices

C_{p}

are constructed as above, by eliminating the

n - p

last rows and columns, until we end up with

C_{0} \equiv 1

. Note that the approximation is very good whenever

\sum_{p = 1}^{n} (a_{p}^{2} det C_{p - 1}) / (2 det C_{p}) ≫ 1

, which is often the case. Note also that we recover the previous result Equation (34) for the particular case

n = 2

.

In the limit that the cross-correlation between the n variables vanishes, the evidence (40) reduces to the exact result

E (a_{1}, \dots, a_{n}) = \prod_{p = 1}^{n} \frac{1}{2 a_{p}} Erf [\frac{a_{p}}{σ_{p} \sqrt{2}}] .

(41)

Note that the evidence Equation (40) correctly reflects the limit in which we eliminate the need for a new variable

x_{n}

, by making its prior vanish,

lim_{a_{n} \to 0} E (a_{1}, \dots, a_{n}) = E (a_{1}, \dots, a_{n - 1}) \frac{1}{\sqrt{2 π}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}},

(42)

and thus we recover in that limit a properly normalized distribution,

f (x_{1}, \dots, x_{n}) \to f (x_{1}, \dots, x_{n - 1})

, while the inspection of the likelihood function alone would not have been able to give a reasonable answer.

On the other hand, in the case that our theoretical prejudice cannot assign a concrete prior to a given variable, we see that the evidence decreases as

1 / 2 a

as a increases. Therefore, the Bayesian evidence seems to be a very good discriminator between theoretical priors, and penalizes including too many parameters, a la Occam’s razor.

3.2. Uncentred Priors

It is unlikely that the priors will actually be centred on the mean of the distribution, as the priors are not supposed to know what the data will tell us. We therefore need to generalize the above for uncentred priors. We continue to assume that the priors are top hats.

We also continue to assume for the moment that the probability distribution is well-approximated by a Gaussian with mean value

μ

. We will then use displaced variables

{\tilde{x}}_{i} = x_{i} - μ_{i}

, and write the Gaussian distribution function as in Equation (36). The normalized top-hat prior is now uncentered with respect to the mean value,

π (\tilde{x}; a, b) \equiv \{\begin{matrix} {(a + b)}^{- 1} & - a < \tilde{x} < b, \\ 0 & otherwise . \end{matrix}

(43)

For a single variable, the result is exact,

E (a; b) = \int_{- \infty}^{\infty} d x f (x) π (x; a, b) = \frac{1}{2 a + 2 b} (Erf [\frac{a}{σ \sqrt{2}}] + Erf [\frac{b}{σ \sqrt{2}}]) .

(44)

where we are integrating over the displaced variable

\tilde{x}

, from now on renamed as x. Note that we recover the result Equation (24) for the centred prior case in the limit

b \to a

.

For two variables, with distribution function Equation (26), the uncentred Bayesian evidence is

\begin{matrix} E (a_{1}, a_{2}; b_{1}, b_{2}) & = & \frac{1}{(a_{1} + b_{1}) (a_{2} + b_{2})} \int_{- a_{1}}^{b_{1}} d x_{1} \int_{- a_{2}}^{b_{2}} d x_{2} f (x_{1}, x_{2}) \end{matrix}

(45)

\begin{matrix} = & \frac{1}{(2 a_{1} + 2 b_{1}) (2 a_{2} + 2 b_{2})} \{(Erf [\frac{a_{1}}{σ_{1} \sqrt{2}}] + Erf [\frac{b_{1}}{σ_{1} \sqrt{2}}]) \\ \times (Erf [\frac{a_{2}}{σ_{2} \sqrt{2 (1 - ρ^{2})}}] + Erf [\frac{b_{2}}{σ_{2} \sqrt{2 (1 - ρ^{2})}}]) \\ - \frac{ρ}{2 π \sqrt{1 - ρ^{2}}} (e^{- \frac{a_{1}^{2}}{2 σ_{1}^{2}}} - e^{- \frac{b_{1}^{2}}{2 σ_{1}^{2}}}) (e^{- \frac{a_{2}^{2}}{2 σ_{2}^{2} (1 - ρ^{2})}} + e^{- \frac{b_{2}^{2}}{2 σ_{2}^{2} (1 - ρ^{2})}})\} \end{matrix}

(46)

The evidence for the multiple-variable case Equation (36) is

E (a, b) = \int_{- \infty}^{\infty} d^{n} x f (x) \prod_{i = 1}^{n} π (x_{i}; a_{i}, b_{i}) = \prod_{i = 1}^{n} {(a_{i} + b_{i})}^{- 1} \int_{- a_{1}}^{b_{1}} d {\tilde{x}}_{1} \dots \int_{- a_{n}}^{b_{n}} d {\tilde{x}}_{n} f (\tilde{x}) .

(47)

Let us now evaluate it for the multivariate Gaussian Equation (36), starting with the integration over the last variable,

x_{n}

,

\begin{matrix} \frac{1}{a_{n} + b_{n}} \int_{- a_{n}}^{b_{n}} d x_{n} f (x) = \frac{1}{{(2 π)}^{(n - 1) / 2} \sqrt{det C_{n - 1}}} exp [- \frac{1}{2} x^{_{T}} C_{n - 1}^{- 1} x] \frac{1}{(2 a_{n} + 2 b_{n})} \\ \times \{Erf [\frac{a_{n}}{\sqrt{2}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}}] + Erf [\frac{b_{n}}{\sqrt{2}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}}] + O (e^{- \frac{a_{n}^{2} det C_{n - 1}}{2 det C_{n}}} + e^{- \frac{b_{n}^{2} det C_{n - 1}}{2 det C_{n}}})\} \end{matrix}

(48)

Integrating now over the next variable,

x_{n - 1}

, we find

\begin{matrix} \frac{1}{(a_{n} + b_{n}) (a_{n - 1} + b_{n - 1})} \int_{- a_{n}}^{b_{n}} d x_{n} \int_{- a_{n - 1}}^{b_{n - 1}} d x_{n - 1} f (x) = \end{matrix}

\begin{matrix} \frac{1}{{(2 π)}^{(n - 2) / 2} \sqrt{det C_{n - 2}}} exp [- \frac{1}{2} x^{_{T}} C_{n - 2}^{- 1} x] \frac{1}{(2 a_{n} + 2 b_{n}) (2 a_{n - 1} + 2 b_{n - 1})} \end{matrix}

(49)

\begin{matrix} \times \{(Erf [\frac{a_{n}}{\sqrt{2}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}}] + Erf [\frac{b_{n}}{\sqrt{2}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}}]) \end{matrix}

(50)

\begin{matrix} \times (Erf [\frac{a_{n - 1}}{\sqrt{2}} \sqrt{\frac{det C_{n - 2}}{det C_{n - 1}}}] + Erf [\frac{b_{n - 1}}{\sqrt{2}} \sqrt{\frac{det C_{n - 2}}{det C_{n - 1}}}]) \\ + O (e^{- \frac{a_{n}^{2} det C_{n - 1}}{2 det C_{n}}} + e^{- \frac{b_{n}^{2} det C_{n - 1}}{2 det C_{n}}}) \times (e^{- \frac{a_{n - 1}^{2} det C_{n - 2}}{2 det C_{n - 1}}} + e^{- \frac{b_{n - 1}^{2} det C_{n - 2}}{2 det C_{n - 1}}})\} . \end{matrix}

(51)

Continuing the integration over the priors, we end up with the evidence for the n-dimensional distribution,

\begin{matrix} E (a, b) & = & \frac{1}{\prod_{p = 1}^{n} (a_{p} + b_{p})} \int_{- a_{1}}^{b_{1}} \dots \int_{- a_{n}}^{b_{n}} d^{n} x f (x) \\ = & \prod_{p = 1}^{n} \frac{1}{(2 a_{p} + 2 b_{p})} (Erf [\frac{a_{p}}{\sqrt{2}} \sqrt{\frac{det C_{p - 1}}{det C_{p}}}] + Erf [\frac{b_{p}}{\sqrt{2}} \sqrt{\frac{det C_{p - 1}}{det C_{p}}}]) \\ + O (\prod_{p = 1}^{n} [exp (- \frac{a_{p}^{2} det C_{p - 1}}{2 det C_{p}}) + exp (- \frac{b_{p}^{2} det C_{p - 1}}{2 det C_{p}})]), \end{matrix}

(52)

where the covariance matrices

C_{p}

are constructed as above, by eliminating the

n - p

last rows and columns, until

C_{0} \equiv 1

. Note that the approximation is very good whenever the exponents are large,

\sum_{p = 1}^{n} (a_{p}^{2} det C_{p - 1}) / (2 det C_{p}) ≫ 1

, which is often the case. Note also that we recover the expression of the evidence for the centred priors Equation (40) in the limit

b \to a

.

Let us now evaluate the evidence for a distribution normalized to the maximum of the likelihood distribution,

f (x) = L_{\max} exp [- \frac{1}{2} x^{_{T}} C_{n}^{- 1} x]

(53)

In this case, the evidence is given by Equation (52), multiplied by a factor

L_{\max} \times {(2 π)}^{n / 2}

\sqrt{det C_{n}}

from the normalization. We can then evaluate the logarithm of the evidence, ignoring the exponentially small corrections, as

\begin{matrix} ln E & = & ln L_{\max} + \frac{n}{2} ln (2 π) + \frac{1}{2} ln det C_{n} - \sum_{p = 1}^{n} ln (2 a_{p} + 2 b_{p}) \\ + \sum_{p = 1}^{n} ln (Erf [\frac{a_{p}}{\sqrt{2}} \sqrt{\frac{det C_{p - 1}}{det C_{p}}}] + Erf [\frac{b_{p}}{\sqrt{2}} \sqrt{\frac{det C_{p - 1}}{det C_{p}}}]) . \end{matrix}

(54)

Uncorrelated case. Suppose we have a multivariate Gaussian distribution without correlations between variables, i.e.,

C_{i j} = σ_{i}^{2} δ_{i j}

is a diagonal matrix; then the evidence reads exactly,

E (a, b) = \frac{1}{\prod_{p = 1}^{n} (a_{p} + b_{p})} \int_{- a_{1}}^{b_{1}} \dots \int_{- a_{n}}^{b_{n}} d^{n} x f (x) = \prod_{p = 1}^{n} \frac{1}{2 (a_{p} + b_{p})} (Erf [\frac{a_{p}}{σ_{p} \sqrt{2}}] + Erf [\frac{b_{p}}{σ_{p} \sqrt{2}}]),

(55)

where

σ_{p}

are the dispersions of each variable

{\tilde{x}}_{p}

, and thus the logarithm of the evidence becomes

ln E = ln L_{\max} + \frac{n}{2} ln (2 π) + \sum_{p = 1}^{n} ln σ_{p} - \sum_{p = 1}^{n} ln (2 a_{p} + 2 b_{p}) + \sum_{p = 1}^{n} ln (Erf [\frac{a_{p}}{σ_{p} \sqrt{2}}] + Erf [\frac{b_{p}}{σ_{p} \sqrt{2}}])

(56)

Laplace approximation. The Laplacian approximation to the evidence assumes the distribution is a correlated Gaussian, and that the priors are large enough so that the whole distribution fits easily inside them, in which case the error functions are approximately in unity and do not contribute to the evidence; from Equation (54) we now have

ln E = ln L_{\max} + \frac{n}{2} ln (2 π) + \frac{1}{2} ln det C_{n} - \sum_{p = 1}^{n} ln Δ θ_{p},

(57)

where

Δ θ_{p} = a_{p} + b_{p}

is the parameter interval associated to the prior. In the next section we will compare the different approximations.

4. Non-Gaussian Corrections

The advantage of this method is that one can perform a systematic computation of the evidence of a given model with its own priors, given an arbitrary set of moments of the distribution. Here we will consider the first two beyond the covariance matrix, i.e., the skewness and kurtosis terms, see Equation (9).

4.1. Skewness

Let us start with the first correction to the Gaussian approximation, the trilinear term

B_{i j k}

. For this, we write the generating functional (9) as

ϕ (u) = exp [i μ_{i} u_{i} - \frac{1}{2!} C_{i j} u_{i} u_{j} - \frac{i}{3!} B_{i j k} u_{i} u_{j} u_{k}] .

(58)

By performing a change of variable,

u_{i} = y_{i} - i C_{i k}^{- 1} (x_{k} - μ_{k})

, we can evaluate the Fourier transform integral and obtain the properly-normalized probability distribution function

\begin{matrix} f (x) & = & \frac{1}{{(2 π)}^{n / 2} \sqrt{det C_{n}}} exp [- \frac{1}{2} x^{_{T}} C_{n}^{- 1} x] \\ \times (1 - \frac{1}{2} B_{i j k} C_{i j}^{- 1} C_{k l}^{- 1} x_{l} + \frac{1}{6} B_{i j k} C_{i l}^{- 1} C_{j m}^{- 1} C_{k n}^{- 1} x_{l} x_{m} x_{n}), \end{matrix}

(59)

where

x_{k}

are the displaced coordinates

(x_{k} - μ_{k})

. This skewed distribution function satisfies

〈 x_{i} 〉 = 0, 〈 x_{i} x_{j} 〉 = C_{i j}, 〈 x_{i} x_{j} x_{k} 〉 = B_{i j k}, 〈 x_{i} x_{j} x_{k} x_{l} 〉 = 0, \dots

(60)

as can be confirmed by direct evaluation. Let us now compute the evidence Equation (22) for this skewed model. Since the extra terms in the parenthesis of Equation (59) are both odd functions of x, when integrating over an even range like that of the centred top-hat prior Equation (21), their contribution to the evidence vanish, and thus the final evidence for the skewed model does not differ from that of the Gaussian model Equation (40). In case the prior is off-centred with respect to the mean, e.g., in Equation (43), then the contribution of the odd terms to the evidence would not vanish. Let us evaluate their contribution.

For a single variable

(n = 1)

, the correctly normalized likelihood function can be written as

f (x) = \frac{e^{- x^{2} / 2 σ^{2}}}{σ \sqrt{2 π}} (1 - \frac{B x}{2 σ^{4}} + \frac{B x^{3}}{6 σ^{6}}),

satisfying

〈 x 〉 = 0

,

〈 x^{2} 〉 = σ^{2}

,

〈 x^{3} 〉 = B

, and the Bayesian integral can be computed exactly as

E (a, b) = \frac{1}{2 a + 2 b} (Erf [\frac{a}{σ \sqrt{2}}] + Erf [\frac{b}{σ \sqrt{2}}]) - \frac{B σ^{- 3}}{6 \sqrt{2 π}} [(1 - \frac{a^{2}}{σ^{2}}) e^{- \frac{a^{2}}{2 σ^{2}}} - (1 - \frac{b^{2}}{σ^{2}}) e^{- \frac{b^{2}}{2 σ^{2}}}] \frac{1}{a + b} .

(61)

Note that for even (centred) priors, with

b = a

, the evidence reduces to Equation (24).

For an arbitrary number of variables the computation is more complicated. Let us start with the n-th variable and, in order to compute the integral, let us define the auxiliary function

\begin{matrix} g (λ) & = & \int_{- a_{n}}^{b_{n}} d x_{n} x_{n} \frac{exp [- \frac{λ}{2} x^{_{T}} C_{n}^{- 1} x]}{{(2 π)}^{n / 2} \sqrt{det C_{n}}} = \frac{exp [- \frac{1}{2} x^{_{T}} C_{n - 1}^{- 1} x]}{{(2 π)}^{(n - 1) / 2} \sqrt{det C_{n - 1}}} \times \\ \times \frac{1}{λ \sqrt{2 π}} (exp [- \frac{λ a_{n}^{2}}{2} \frac{det C_{n - 1}}{det C_{n}}] - exp [- \frac{λ b_{n}^{2}}{2} \frac{det C_{n - 1}}{det C_{n}}]), \end{matrix}

(62)

such that, using

{Erf}^{'} [x] = \frac{2}{\sqrt{π}} e^{- x^{2}}

,

\begin{matrix} - 2 g^{'} (λ = 1) = \int_{- a_{n}}^{b_{n}} d x_{n} x_{n} \frac{(x^{_{T}} C_{n}^{- 1} x) exp [- \frac{1}{2} x^{_{T}} C_{n}^{- 1} x]}{{(2 π)}^{n / 2} \sqrt{det C_{n}}} = \frac{exp [- \frac{1}{2} x^{_{T}} C_{n - 1}^{- 1} x]}{{(2 π)}^{(n - 1) / 2} \sqrt{det C_{n - 1}}} \times \\ \times \frac{1}{\sqrt{2 π}} \{(2 + a_{n}^{2} \frac{det C_{n - 1}}{det C_{n}}) exp [- \frac{a_{n}^{2}}{2} \frac{det C_{n - 1}}{det C_{n}}] - (2 + b_{n}^{2} \frac{det C_{n - 1}}{det C_{n}}) exp [- \frac{b_{n}^{2}}{2} \frac{det C_{n - 1}}{det C_{n}}]\} . \end{matrix}

(63)

Therefore, with the use of Equation (63), the integral of the skewness-corrected distribution function Equation (59) over the

x_{n}

uncentred prior becomes

\begin{matrix} \int_{- a_{n}}^{b_{n}} d x_{n} f (x) = \frac{exp [- \frac{1}{2} x^{_{T}} C_{n - 1}^{- 1} x]}{{(2 π)}^{(n - 1) / 2} \sqrt{det C_{n - 1}}} \{\frac{1}{2} (Erf [\frac{a_{n}}{\sqrt{2}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}}] + Erf [\frac{b_{n}}{\sqrt{2}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}}]) \\ - \frac{1}{6} B_{i j n} C_{i j}^{- 1} \frac{1}{\sqrt{2 π}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}} [(1 - a_{n}^{2} \frac{det C_{n - 1}}{det C_{n}}) e^{- \frac{a_{n}^{2} det C_{n - 1}}{2 det C_{n}}} - (1 - b_{n}^{2} \frac{det C_{n - 1}}{det C_{n}}) e^{- \frac{b_{n}^{2} det C_{n - 1}}{2 det C_{n}}}]\} . \end{matrix}

(64)

Let us define two new functions,

\begin{matrix} E_{i} (a_{i}, b_{i}) & = & \frac{1}{2} (Erf [\frac{a_{i}}{\sqrt{2}} \sqrt{\frac{det C_{i - 1}}{det C_{i}}}] + Erf [\frac{b_{i}}{\sqrt{2}} \sqrt{\frac{det C_{i - 1}}{det C_{i}}}]), \\ F_{i} (a_{i}, b_{i}) & = & \frac{1}{6 \sqrt{2 π}} \sqrt{\frac{det C_{i - 1}}{det C_{i}}} [(1 - a_{i}^{2} \frac{det C_{i - 1}}{det C_{i}}) e^{- \frac{a_{i}^{2} det C_{i - 1}}{2 det C_{i}}} - (1 - b_{i}^{2} \frac{det C_{i - 1}}{det C_{i}}) e^{- \frac{b_{i}^{2} det C_{i - 1}}{2 det C_{i}}}] . \end{matrix}

(65)

Integrating iteratively over

x_{n - 1}, \dots, x_{1}

, we end up with the Bayesian evidence for the third-order-corrected probability distribution function

f (x)

,

E (a, b) = \prod_{p = 1}^{n} \frac{E_{p} (a_{p}, b_{p})}{(a_{p} + b_{p})} [1 - \sum_{k = 1}^{n} B_{i j k} C_{i j}^{- 1} \frac{F_{k} (a_{k}, b_{k})}{E_{k} (a_{k}, b_{k})}] .

(66)

Unless

B_{i j k} C_{i j}^{- 1}

is very large, the correction to the error function is exponentially suppressed, and we do not expect significant departures from the Gaussian case Equation (40). Note also that if the prior is symmetrical, it is easy to see that the skewness part of the integral vanishes,

F_{k} (a_{k}, b_{k}) \to 0

, as can be checked explicitly by taking

b_{k} \to a_{k}

.

4.2. Kurtosis

The next correction beyond skewness is the fourth-order moment or kurtosis, given by the

D_{i j k l}

term in Equation (9). Let us ignore for the moment the third-order skewness and write

ϕ (u) = exp [i μ_{i} u_{i} - \frac{1}{2!} C_{i j} u_{i} u_{j} + \frac{1}{4!} D_{i j k l} u_{i} u_{j} u_{k} u_{l}] .

(67)

By performing the same change of variables,

u_{i} = y_{i} - i C_{i k}^{- 1} (x_{k} - μ_{k})

, we can now compute the Fourier transform and obtain the properly normalized probability distribution function

\begin{matrix} f (x) & = & \frac{1}{{(2 π)}^{n / 2} \sqrt{det C_{n}}} exp [- \frac{1}{2} x^{_{T}} C_{n}^{- 1} x] (1 + \frac{1}{8} D_{i j k l} C_{i j}^{- 1} C_{k l}^{- 1} \\ - \frac{1}{4} D_{i j k l} C_{i j}^{- 1} C_{k m}^{- 1} C_{l n}^{- 1} x_{m} x_{n} + \frac{1}{24} D_{i j k l} C_{i m}^{- 1} C_{j n}^{- 1} C_{k p}^{- 1} C_{l q}^{- 1} x_{m} x_{n} x_{p} x_{q}) . \end{matrix}

(68)

Performing the integrals, it is easy to see that this distribution satisfies

〈 x_{i} x_{j} 〉 = C_{i j}, 〈 x_{i} x_{j} x_{k} x_{l} 〉 = D_{i j k l} + C_{i j} C_{k l} + C_{i k} C_{j l} + C_{i l} C_{j k}, \dots

(69)

Note that in order for the new likelihood distribution (68) to be positive definite, it is required that

D_{i j k l} C_{i j}^{- 1} C_{k l}^{- 1} < 4

, and if we impose that there is only one maximum at the centre, then it must satisfy

D_{i j k l} C_{i j}^{- 1} C_{k l}^{- 1} < 2

. These conditions impose bounds on the maximum possible deviation of the evidence from a that of a Gaussian.

Let us now compute the evidence Equation (22) for this kurtosis model. The extra terms in the parenthesis of Equation (68) are both even functions of x, and we cannot ignore them, even for centred priors.

For a single variable

(n = 1)

, the correctly normalized likelihood function can be written as

f (x) = \frac{e^{- \frac{x^{2}}{2 σ^{2}}}}{σ \sqrt{2 π}} (1 + \frac{D}{8 σ^{4}} - \frac{D x^{2}}{4 σ^{6}} + \frac{D x^{4}}{24 σ^{8}}),

satisfying

〈 x 〉 = 0

,

〈 x^{2} 〉 = σ^{2}

,

〈 x^{3} 〉 = 0

,

〈 x^{4} 〉 = D + 3 σ^{4}

, etc. The Bayesian integral can be computed exactly as

E (a, b) = \frac{1}{2 a + 2 b} (Erf [\frac{a}{σ \sqrt{2}}] + Erf [\frac{b}{σ \sqrt{2}}]) + \frac{D σ^{- 4}}{8 \sqrt{2 π}} (\frac{a}{σ} (1 - \frac{a^{2}}{3 σ^{2}}) e^{- \frac{a^{2}}{2 σ^{2}}} + \frac{b}{σ} (1 - \frac{b^{2}}{3 σ^{2}}) e^{- \frac{b^{2}}{2 σ^{2}}}) \frac{1}{a + b} .

(70)

For an arbitrary number of variables, the computation is again much more complicated. Let us start with the n-th variable and, in order to compute the first integral, let us define a new auxiliary function

\begin{matrix} h (λ) & = & \int_{- a_{n}}^{b_{n}} d x_{n} \frac{exp [- \frac{λ}{2} x^{_{T}} C_{n}^{- 1} x]}{{(2 π)}^{n / 2} \sqrt{det C_{n}}} = \frac{exp [- \frac{1}{2} x^{_{T}} C_{n - 1}^{- 1} x]}{{(2 π)}^{(n - 1) / 2} \sqrt{det C_{n - 1}}} \times \\ \times \frac{1}{2 \sqrt{λ}} (Erf [\frac{a_{n} \sqrt{λ}}{\sqrt{2}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}}] + Erf [\frac{b_{n} \sqrt{λ}}{\sqrt{2}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}}]), \end{matrix}

(71)

such that,

\begin{matrix} - 2 h^{'} (λ = 1) & = & \int_{- a_{n}}^{b_{n}} d x_{n} \frac{(x^{_{T}} C_{n}^{- 1} x) exp [- \frac{1}{2} x^{_{T}} C_{n}^{- 1} x]}{{(2 π)}^{n / 2} \sqrt{det C_{n}}} = \frac{exp [- \frac{1}{2} x^{_{T}} C_{n - 1}^{- 1} x]}{{(2 π)}^{(n - 1) / 2} \sqrt{det C_{n - 1}}} \times \\ \times \{\frac{1}{2} (Erf [\frac{a_{n}}{\sqrt{2}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}}] + Erf [\frac{b_{n}}{\sqrt{2}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}}]) \\ - \frac{1}{\sqrt{2 π}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}} (a_{n} exp [- \frac{a_{n}^{2}}{2} \frac{det C_{n - 1}}{det C_{n}}] + b_{n} exp [- \frac{b_{n}^{2}}{2} \frac{det C_{n - 1}}{det C_{n}}])\} . \end{matrix}

(72)

\begin{matrix} 4 h^{″} (λ = 1) & = & \int_{- a_{n}}^{b_{n}} d x_{n} \frac{{(x^{_{T}} C_{n}^{- 1} x)}^{2} exp [- \frac{1}{2} x^{_{T}} C_{n}^{- 1} x]}{{(2 π)}^{n} \sqrt{det C_{n}}} = \frac{exp [- \frac{1}{2} x^{_{T}} C_{n - 1}^{- 1} x]}{{(2 π)}^{(n - 1) / 2} \sqrt{det C_{n - 1}}} \times \\ \times \{\frac{3}{2} (Erf [\frac{a_{n}}{\sqrt{2}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}}] + Erf [\frac{b_{n}}{\sqrt{2}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}}]) \\ - \frac{3}{\sqrt{2 π}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}} (a_{n} exp [- \frac{a_{n}^{2}}{2} \frac{det C_{n - 1}}{det C_{n}}] + b_{n} exp [- \frac{b_{n}^{2}}{2} \frac{det C_{n - 1}}{det C_{n}}]) \\ - \frac{a_{n}^{2}}{\sqrt{2 π}} {(\frac{det C_{n - 1}}{det C_{n}})}^{3 / 2} (a_{n} exp [- \frac{a_{n}^{2}}{2} \frac{det C_{n - 1}}{det C_{n}}] + b_{n} exp [- \frac{b_{n}^{2}}{2} \frac{det C_{n - 1}}{det C_{n}}])\} . \end{matrix}

(73)

Therefore, with the use of Equations (72) and (73), the integral of the kurtosis-corrected distribution function (68) over the

x_{n}

prior becomes

\begin{matrix} \int_{- a_{n}}^{b_{n}} d x_{n} f (x) = \frac{exp [- \frac{1}{2} x^{_{T}} C_{n - 1}^{- 1} x]}{{(2 π)}^{(n - 1) / 2} \sqrt{det C_{n - 1}}} \{\frac{1}{2} (Erf [\frac{a_{n}}{\sqrt{2}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}}] + Erf [\frac{b_{n}}{\sqrt{2}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}}]) + \\ + \frac{1}{8} D_{i j k l} C_{i j}^{- 1} C_{k l}^{- 1} \frac{1}{\sqrt{2 π}} \sqrt{\frac{det C_{n - 1}}{det C_{n}}} [a_{n} (1 - \frac{a_{n}^{2}}{3} \frac{det C_{n - 1}}{det C_{n}}) e^{- \frac{a_{n}^{2} det C_{n - 1}}{2 det C_{n}}} + b_{n} (1 - \frac{b_{n}^{2}}{3} \frac{det C_{n - 1}}{det C_{n}}) e^{- \frac{b_{n}^{2} det C_{n - 1}}{2 det C_{n}}}]\} . \end{matrix}

(74)

We can now define a new function

G_{i} (a_{i}, b_{i}) = \frac{1}{8 \sqrt{2 π}} \sqrt{\frac{det C_{i - 1}}{det C_{i}}} [a_{i} (1 - \frac{a_{i}^{2}}{3} \frac{det C_{i - 1}}{det C_{i}}) e^{- \frac{a_{i}^{2} det C_{i - 1}}{2 det C_{i}}} - b_{i} (1 - \frac{b_{i}^{2}}{3} \frac{det C_{i - 1}}{det C_{i}}) e^{- \frac{b_{i}^{2} det C_{i - 1}}{2 det C_{i}}}] .

(75)

Integrating iteratively over

x_{n - 1}, \dots, x_{1}

, we end up with the Bayesian evidence for the fourth-order-corrected probability distribution function

f (x)

,

E (a, b) = \prod_{p = 1}^{n} \frac{E_{p} (a_{p}, b_{p})}{(a_{p} + b_{p})} [1 + D_{i j k l} C_{i j}^{- 1} C_{k l}^{- 1} \sum_{m = 1}^{n} \frac{G_{m} (a_{m}, b_{m})}{E_{m} (a_{m}, b_{m})}] .

(76)

so, unless

D_{i j k l} C_{i j}^{- 1} C_{k l}^{- 1}

is very large, the correction to the error function is exponentially suppressed, and we do not expect significant departures from the Gaussian case, Equation (40).

In order to compare models it is customary to compute the logarithm of the evidence. Let us assume that we are given a likelihood distribution function normalized by the maximum likelihood, and with corrections up to the fourth order,

\begin{matrix} f (x) = L_{\max} exp [- \frac{1}{2} x^{_{T}} C_{n}^{- 1} x] {(1 + \frac{1}{8} D_{i j k l} C_{i j}^{- 1} C_{k l}^{- 1})}^{- 1} (1 - \frac{1}{2} B_{i j k} C_{i j}^{- 1} C_{k l}^{- 1} x_{l} + \frac{1}{6} B_{i j k} C_{i l}^{- 1} C_{j m}^{- 1} C_{k n}^{- 1} x_{l} x_{m} x_{n} \\ + \frac{1}{8} D_{i j k l} C_{i j}^{- 1} C_{k l}^{- 1} - \frac{1}{4} D_{i j k l} C_{i j}^{- 1} C_{k m}^{- 1} C_{l n}^{- 1} x_{m} x_{n} + \frac{1}{24} D_{i j k l} C_{i m}^{- 1} C_{j n}^{- 1} C_{k p}^{- 1} C_{l q}^{- 1} x_{m} x_{n} x_{p} x_{q}) . \end{matrix}

(77)

Note that it is normalized so that the maximum corresponds to the mean-centred distribution, i.e.,

x = 0

. In this case, the evidence of the normalized distribution is given by

\begin{matrix} E (a, b) = L_{\max} {(2 π)}^{n / 2} \sqrt{det C_{n}} {(1 + \frac{1}{8} D_{i j k l} C_{i j}^{- 1} C_{k l}^{- 1})}^{- 1} \times \\ \prod_{p = 1}^{n} \frac{E_{p} (a_{p}, b_{p})}{(a_{p} + b_{p})} [1 - \sum_{k = 1}^{n} B_{i j k} C_{i j}^{- 1} \frac{F_{k} (a_{k}, b_{k})}{E_{k} (a_{k}, b_{k})} + D_{i j k l} C_{i j}^{- 1} C_{k l}^{- 1} \sum_{m = 1}^{n} \frac{G_{m} (a_{m}, b_{m})}{E_{m} (a_{m}, b_{m})}] . \end{matrix}

(78)

We can then evaluate the logarithm of the evidence by

\begin{matrix} ln E & = & ln L_{\max} + \frac{n}{2} ln (2 π) + \frac{1}{2} ln det C_{n} - ln (1 + \frac{1}{8} D_{i j k l} C_{i j}^{- 1} C_{k l}^{- 1}) - \sum_{p = 1}^{n} ln (2 a_{p} + 2 b_{p}) \\ + \sum_{p = 1}^{n} ln (Erf [\frac{a_{p}}{\sqrt{2}} \sqrt{\frac{det C_{p - 1}}{det C_{p}}}] + Erf [\frac{b_{p}}{\sqrt{2}} \sqrt{\frac{det C_{p - 1}}{det C_{p}}}]) \\ + ln (1 - \sum_{k = 1}^{n} B_{i j k} C_{i j}^{- 1} \frac{F_{k} (a_{k}, b_{k})}{E_{k} (a_{k}, b_{k})} + D_{i j k l} C_{i j}^{- 1} C_{k l}^{- 1} \sum_{m = 1}^{n} \frac{G_{m} (a_{m}, b_{m})}{E_{m} (a_{m}, b_{m})}) . \end{matrix}

(79)

Note that the condition

D_{i j k l} C_{i j}^{- 1} C_{k l}^{- 1} < 2

constrains the maximum amount that the kurtosis corrections can contribute to the evidence.

Uncorrelated case. In the case where the likelihood distribution has no correlations among the different variables, the exact expression for the Bayesian evidence is

\begin{matrix} ln E = ln L_{\max} + \frac{n}{2} ln (2 π) + \sum_{p = 1}^{n} ln σ_{p} - \sum_{p = 1}^{n} ln (2 a_{p} + 2 b_{p}) + \sum_{p = 1}^{n} ln (Erf [\frac{a_{p}}{σ_{p} \sqrt{2}}] + Erf [\frac{b_{p}}{σ_{p} \sqrt{2}}]) \\ - ln (1 + \frac{1}{8} D_{i i j j} σ_{i}^{- 2} σ_{j}^{- 2}) + ln (1 - \sum_{k = 1}^{n} B_{i i k} σ_{k}^{- 2} \frac{F_{k} (a_{k}, b_{k})}{E_{k} (a_{k}, b_{k})} + D_{i i j j} σ_{i}^{- 2} σ_{j}^{- 2} \sum_{m = 1}^{n} \frac{G_{m} (a_{m}, b_{m})}{E_{m} (a_{m}, b_{m})}), \end{matrix}

(80)

where

σ_{p}

are the corresponding dispersions of variables

x_{p}

, and the functions

E_{i}, F_{i}

and

G_{i}

are the corresponding limiting functions of Equations (65) and (75) for uncorrelated matrices.

5. Model Comparison

Finally we turn to specific applications of the formalism discussed above. Initially we will carry out some toy model tests of its performance, and then examine real cosmological applications for which we previously obtained results by thermodynamic integration [12].

5.1. A Baby-Toy Model Comparison

We begin with a very simple two-dimensional toy model. The purpose of this section is to illustrate the ineffectiveness of the thermodynamic integration and to give an indication of the performance of the method we propose here. In addition, the two-dimensional model is simple enough to allow a brute-force direct numerical integration of evidence allowing us to check the accuracy at the same time. We use the following two forms of likelihood:

\begin{matrix} L_{g} (x, y) & = & exp [- \frac{2 x^{2} - 2 {(y - 1)}^{2} - x y}{2}] \end{matrix}

(81)

\begin{matrix} L_{n g} (x, y) & = & exp [- \frac{2 x^{2} - 2 {(y - 1)}^{2} - x y}{2}] + exp [- \frac{2 x^{2} - 2 y^{2} - 3 x y}{2}] \end{matrix}

(82)

The subscripts g and

n g

indicate the Gaussian and non-Gaussian cases, respectively.

Firstly, we calculate the evidence by the analytical method using Equations (56) and () and covariance matrices inferred from sampling the likelihood using the vanilla Metropolis–Hastings algorithm with fixed proposal widths. Chains ranging from a few to several millions of samples were used. We also calculate evidence using thermodynamic algorithm explained in ref. [12]. Again, we vary algorithm parameters to obtain evidence values of varying accuracy. The resulting evidence as a function of the number of likelihood evaluations is plotted in the Figure 1, together with the correct value inferred by direct numerical integration. The number of likelihood evaluations is crucial as this is the time-limiting step in the cosmological parameter estimation and model comparison exercises. The results are what could have been anticipated. We note that the size of the prior does not seem to be of crucial importance. This is comforting, given that the analytical method requires the knowledge of the true covariance information, while we can only supply a covariance matrix estimated from the prior-truncated likelihood. We also note that the thermodynamic integration converges to the correct value in all cases. However, it does so after very many likelihood evaluations; typically about a million or so even for a two-dimensional problem. The analytical method already becomes limited by systematics by the ten-thousand samples. For the Gaussian case, there is no systematic by construction, while the non-Gaussian case suffers a systematic of about

0.1

in

ln E

. The non-Gaussian correction reduces the error by about half and thus correctly estimates the uncertainty associated with the purely Gaussian approximation. In the case of wide priors, the only non-Gaussian correction of appreciable size is the

ln (1 + D_{i j k l} C_{i j}^{- 1} C_{k l}^{- 1} / 8)

.

5.2. A Toy Model Comparison

We now proceed by calculating the Bayesian evidence for simple toy models with five and six parameters, shown in Table 1. The purpose is to compare results with those obtained from thermodynamic integration again, but this time using a model that bears more resemblance to a typical problem encountered in cosmology.

Beginning with the five-parameter model, we first assume that it has an uncorrelated multivariate Gaussian likelihood distribution. In this case the aim is to test the thermodynamic integration method, which gives

ln E_{toy 5}^{num} = - 8.65 \pm 0.03

, while the exact expression gives

ln E_{toy 5}^{ana} = - 8.66

. Therefore, we conclude that the thermodynamic integration method is rather good in obtaining the correct evidence of the model. The Laplace approximation Equation (57) also fares well for uncorrelated distributions,

ln E_{toy 5}^{Lap} = - 8.67

.

We now consider a likelihood function with a correlated covariance matrix

C_{i j}

, with the same mean values and dispersions as the previous case, but with significant correlations. The analytic formula needed, Equation (54), is no longer exact,3 and gives

ln E_{toy 5 c}^{ana} = - 7.32

. For comparison thermodynamic integration gives

ln E_{toy 5 c}^{num} = - 7.28 \pm 0.06

, again in perfect agreement within errors. In this case the Laplace approximation fails significantly,

ln E_{toy 5 c}^{Lap} = - 6.89

, the reason being that the correlations chosen bring the posterior into significant contact with the edges of the priors.

Let us now return to the uncorrelated case and include a new parameter,

x_{6}

, as in Table 1, and evaluate the different evidences that appear because of this new parameter, in order to see the sensitivity to systematic errors in the evaluation of the Bayesian evidence and their effects on model comparison. The numerical result is

ln E_{toy 6}^{num} = - 10.75 \pm 0.03

, while the exact analytical expression gives

ln E_{toy 6}^{ana} = - 10.74

, in perfect agreement within errors. The Laplace approximation Equation (57) again fares well for uncorrelated distributions,

ln E_{toy 6}^{Lap} = - 10.74

.

When the likelihood function has large correlations, and the priors are not too large, the naive Laplace approximation, Equation (57), fares less well than the analytical approximation, Equation (54).

5.3. A Real Model Comparison

In this subsection we will make use of the results obtained in ref. [12], where we evaluated the evidence for 5- and 6-parameter adiabatic models, and for three 10-parameter mixed adiabatic plus isocurvature models. The prior ranges used are given in Table 2. The latter models give a marginally better fit to the data but require more parameters, which is exactly the situation where model selection techniques are needed to draw robust conclusions. In ref. [12] we used thermodynamic integration to compute the evidence and showed that the isocurvature models were less favoured than the adiabatic ones, but only at a mild significance level.4

Beginning with the simplest adiabatic model, which uses the Harrison–Zel’dovich spectrum, we have used the analytical formulae above, Equation (54), together with the covariance matrix provided by the cosmoMC programme [21], and obtained

ln E_{ad}^{ana} = - 854.07

, while the thermodynamic integration gave

ln E_{ad}^{num} = - 854.1 \pm 0.1

[12]. The agreement is excellent; this is because the distribution function for the adiabatic model is rather well-approximated by a Gaussian, and the priors are rather large, so the formula Equation (54) is very close to that obtained in the Laplace approximation,

ln E_{ad}^{Lap} = - 854.08

.

However the analytic method fares less well for the adiabatic model with varying

n_{s}

, with both the analytical and Laplace methods giving

ln E_{AD - n_{s}} = - 853.4

, while the numerical method gives the smaller value −854.1, a discrepancy of near unity.

Turning now to the isocurvature cases, we found an extremely good result for the CDI model, gaining from Equation (54) the value

ln E_{cdi}^{ana} = - 855.08

, while the thermodynamic integration gives

ln E_{cdi}^{num} = - 855.1 \pm 0.1

. This is surprising, given the relatively large non-Gaussianities for at least three variables:

n_{iso}

,

β

and

δ_{cor}

, whose priors are not centred with respect to the mean. However the NID case shows much less agreement, with a discrepancy of 0.6. This suggests that the closeness of the CDI comparison is to some extent a statistical fluke, with the underlying method less accurate.

A summary of the different models can be found in Table 3.

5.4. Savage–Dickey Method

Another numerical method for evidence calculation is the Savage–Dickey method, first described in ref. [22] and recently used in ref. [20]. This technique allows one to calculate the evidence ratio of two models from a simple and quick analysis of the Markov chains used for parameter estimation, provided that the models are nested; i.e., that one of them is included in the parameter space of the other. For instance, the AD model is nested within the AD-

n_{s}

model, and the AD and AD-

n_{s}

models are both nested within the CDI, NID and NIV ones. In the context of Markov chains, the Savage–Dickey method is essentially a measure of how much time the sampler spends in the nested model, weighted by the respective volumes of the two models. When the outer model has extra parameters, this method relies on approximating the nested model as a model with negligibly narrow priors in directions of extra parameters. We note, however, that when many extra parameters are present, this method must fail for reasons similar to why those with grid-based parameter estimation approaches fail with models with many parameters. The MCMC parameter estimation simply does not have high enough dynamic range to probe the two models given the large prior volume ratio.

The AD and AD-

n_{s}

models differ by one parameter. Using the same AD+

n_{s}

samples as for the analytical method (i.e., the samples from which we extracted the covariance matrix), we obtained

ln (E_{A D} / E_{A D + n_{s}}) = 0.03

. The result from the precise thermodynamic integration,

ln (E_{AD} / E_{AD - n_{s}}) = 0 \pm 0.1

is in excellent agreement. The AD-

n_{s}

and CDI (or NID, NIV) models differ by four parameters. With most simple choices of parametrization (including in particular the isocurvature and cross-correlation tilts), the AD-

n_{s}

is not a point, but a hyper-surface within the parameter space of the isocurvature models (i.e.,

α = 0

and the other three parameters act as dummy, unconstrained, parameters which do not affect the evidence). In these cases, the evidence ratios given by the Savage–Dickey method do not converge as the priors of the extra parameters are tightened up around the nested model, although they match thermodynamically determined values to within a unit of

ln E

.

6. Discussion and Conclusions

We have developed an analytical formalism for computing the Bayesian evidence in the case of an arbitrary likelihood distribution with a hierarchy of non-Gaussian corrections, and with arbitrary top-hat priors, centred or uncentred. This analysis can be of great help for the problem of model comparison in the present context of cosmology where observational data is still unable to rule out most extensions of the standard model based on the

Λ

CDM inflationary paradigm.

As an application of the exact and approximate formulae obtained for the Bayesian evidence of a model with approximately Gaussian likelihood distributions, we have compared the value predicted analytically with that computed with a time-consuming algorithm based on the thermodynamic integration approach. The values analytically obtained agree surprisingly well with those obtained numerically. While one can estimate the magnitude of the higher-order corrections for the analytical formulae, it is very difficult to estimate the systematic effects of the numerical approach. Thus, with this analytical method we can test for systematics in the thermodynamic integration approach. So far, the values obtained agree, so it seems that the numerical approach is a good tool for estimating the evidence. However, it takes considerable effort and machine time to do the correct evaluation, and therefore we propose the use of the analytical estimate, whose corrections are well under control, in the sense that one can compute the next order corrections and show that they are small.

Funding

This research was funded by the Spanish grants PID2021-123012NB-C43 [MICINN-FEDER] and the Centro de Excelencia Severo Ochoa Program CEX2020-001007-S through IFT.

Data Availability Statement

There is no data associated with this work.

Conflicts of Interest

The author declares no conflict of interest.

Notes

1	An extension to Gaussian priors should be feasible, but not one to arbitrary priors.
2	Note that, for scalar quantities, Einstein notation for the sum over free indices is assumed.
3	One could rotate the parameter basis to remove the correlations, but then the priors would not be top-hats.
4	Recently, Trotta [20] used a different technique to analyse a restricted class of isocurvature model featuring just one extra parameter, and found it highly disfavoured. The different conclusion is primarily due to the very different prior he chose on the isocurvature amplitude, such that almost all the models under the prior are dominated by isocurvature models and in poor agreement with the data.

References

Jeffreys, H. Theory of Probability, 3rd ed.; Oxford University Press: Oxford, UK, 1961. [Google Scholar]
MacKay, D.J.C. Information Theory, Inference and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Jaffe, A. H0 and odds on cosmology. Astrophys. J. 1996, 471, 24. [Google Scholar]
Drell, P.S.; Loredo, T.J.; Wasserman, I. Type Ia supernovae, evolution, and the cosmological constant. Astrophys. J. 2000, 530, 593. [Google Scholar] [CrossRef] [Green Version]
John, M.V.; Narlikar, J.V. Comparison of cosmological models using Bayesian theory. Phys. Rev. D 2002, 65, 043506. [Google Scholar] [CrossRef] [Green Version]
Hobson, M.P.; Bridle, S.L.; Lahav, O. Combining cosmological data sets: Hyperparameters and Bayesian evidence. Mon. Not. R. Astron. Soc. 2002, 335, 377. [Google Scholar] [CrossRef]
Slosar, A.; Carreira, P.; Cleary, K.; Davies, R.D..; Davis, R.J.; Dickinson, C.; Genova-Santos, R.; Grainge, K.; Gutierrez, C.M.; Hafez, Y.A.; et al. Cosmological parameter estimation and Bayesian model comparison using Very Small Array data. Mon. Not. R. Astron. Soc. 2003, 341, L29. [Google Scholar] [CrossRef] [Green Version]
Saini, T.D.; Weller, J.; Bridle, S.L. Revealing the nature of dark energy using Bayesian evidence. Mon. Not. R. Astron. Soc. 2004, 348, 603. [Google Scholar] [CrossRef] [Green Version]
Niarchou, A.; Jaffe, A.H.; Pogosian, L. Large-scale power in the CMB and new physics: An analysis using Bayesian model comparison. Phys. Rev. D 2004, 69, 063515. [Google Scholar] [CrossRef]
Marshall, P.; Rajguru, N.; Slosar, A. Bayesian evidence as a tool for comparing datasets. Phys. Rev. D 2006, 73, 067302. [Google Scholar] [CrossRef] [Green Version]
Liddle, A.R. How many cosmological parameters? Mon. Not. R. Astron. Soc. 2004, 351, L49–L53. [Google Scholar] [CrossRef] [Green Version]
Beltrán, M.; García-Bellido, J.; Lesgourgues, J.; Liddle, A.R.; Slosar, A. Bayesian model selection and isocurvature perturbations. Phys. Rev. D 2005, 71, 063532. [Google Scholar] [CrossRef] [Green Version]
Ó’Ruanaidh, J.J.K.; Fitzgerald, W.J. Numerical Bayesian Methods Applied to Signal Processing; Springer: New York, NY, USA, 1996. [Google Scholar]
Hobson, M.P.; McLachlan, C. A Bayesian approach to discrete object detection in astronomical data sets. Mon. Not. R. Astron. Soc. 2003, 338, 765. [Google Scholar] [CrossRef] [Green Version]
Skilling, J. Nested sampling. AIP Conf. Proc. 2004, 735, 395. [Google Scholar]
Handley, W.J.; Hobson, M.P.; Lasenby, A.N. POLYCHORD: Nested sampling for cosmology. Mon. Not. R. Astron. Soc. 2015, 450, L61. [Google Scholar] [CrossRef]
Xie, W.; Lewis, P.O.; Fan, Y.; Kuo, L.; Chen, M.-H. Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Syst. Biol. 2011, 60, 150. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Maturana-Russel, P.; Meyer, R.; Veitch, J.; Christensen, N. Search for the isotropic stochastic background using data from Advanced LIGO’s second observing run. Phys. Rev. D 2019, 99, 084006. [Google Scholar]
Kass, R.E.; Raftery, A.E. Bayes Factors. J. Am. Stat. Assoc. 1995, 90, 773. [Google Scholar] [CrossRef]
Trotta, R. Applications of Bayesian model selection to cosmological parameters. Mon. Not. R. Astron. Soc. 2007, 378, 72. [Google Scholar] [CrossRef]
Lewis, A.; Bridle, S. Cosmological parameters from CMB and other data: A Monte Carlo approach. Phys. Rev. D 2002, 66, 103511. [Google Scholar] [CrossRef] [Green Version]
Dickey, J.M. The Weighted Likelihood Ratio, Linear Hypotheses on Normal Location Parameters. Ann. Math. Stat. 1971, 42, 204. [Google Scholar] [CrossRef]

Figure 1. This figure shows the calculated evidence as a function of the number of likelihood evaluations. Note that the horizontal axis is logarithmic. The star-centred line corresponds to the thermodynamic integration. The cross-centred lines are the analytical methods with (upper panels) and without (lower panels) non-Gaussian corrections applied. The horizontal dashed line is the number obtained by the direct integration. The upper two panels correspond to

L_{g}

, while the lower two to

L_{n g}

. The left-hand side panels correspond to wide flat priors of

(- 7, 10)

on both parameters, while the right-hand side to the narrow priors of

(- 2, 3)

on both parameters. The error bars correspond to the dispersion due to the number of likelihood evaluations.

Figure 1. This figure shows the calculated evidence as a function of the number of likelihood evaluations. Note that the horizontal axis is logarithmic. The star-centred line corresponds to the thermodynamic integration. The cross-centred lines are the analytical methods with (upper panels) and without (lower panels) non-Gaussian corrections applied. The horizontal dashed line is the number obtained by the direct integration. The upper two panels correspond to

L_{g}

, while the lower two to

L_{n g}

. The left-hand side panels correspond to wide flat priors of

(- 7, 10)

on both parameters, while the right-hand side to the narrow priors of

(- 2, 3)

on both parameters. The error bars correspond to the dispersion due to the number of likelihood evaluations.

Table 1. The parameters used in the analytical evaluation of the toy model evidences, with five and six parameters, respectively. The maximum likelihood of the toy models is taken (arbitrarily) to be

L_{\max} = 1

.

Table 1. The parameters used in the analytical evaluation of the toy model evidences, with five and six parameters, respectively. The maximum likelihood of the toy models is taken (arbitrarily) to be

L_{\max} = 1

.

Parameter	Mean	Prior Range	Model
$x_{1}$	0.022	[0.0001, 0.044]	toy5, toy6
$x_{2}$	0.12	[0.001, 0.3]	toy5, toy6
$x_{3}$	1.04	[0.8, 1.4]	toy5, toy6
$x_{4}$	0.1	[0.01, 0.3]	toy5, toy6
$x_{5}$	3.1	[2.6, 3.6]	toy5, toy6
$x_{6}$	0.98	[0.5, 1.5]	toy6

Table 2. The parameters used in the models; see ref. [12] for nomenclature and other details. For the AD-HZ model

n_{s}

was fixed to 1 and

n_{iso}

,

δ_{cor}

,

α

and

β

were fixed to 0. In the AD-

n_{s}

model,

n_{s}

also varies. Every isocurvature model holds the same priors for the whole set of parameters.

Table 2. The parameters used in the models; see ref. [12] for nomenclature and other details. For the AD-HZ model

n_{s}

was fixed to 1 and

n_{iso}

,

δ_{cor}

,

α

and

β

were fixed to 0. In the AD-

n_{s}

model,

n_{s}

also varies. Every isocurvature model holds the same priors for the whole set of parameters.

Parameter	Mean	Prior Range	Model
$ω_{b}$	0.022	[0.018, 0.032]	AD-HZ,AD- $n_{s}$ ,ISO
$ω_{dm}$	0.12	[0.04, 0.16]	AD-HZ,AD- $n_{s}$ ,ISO
$θ$	1.04	[0.98, 1.10]	AD-HZ,AD- $n_{s}$ ,ISO
$τ$	0.17	[0, 0.5]	AD-HZ,AD- $n_{s}$ ,ISO
$ln [10^{10} R_{rad}]$	3.1	[2.6, 4.2]	AD-HZ,AD- $n_{s}$ ,ISO
$n_{s}$	1.0	[0.8, 1.2]	AD- $n_{s}$ ,ISO
$n_{iso}$	1.5	[0, 3]	ISO
$δ_{cor}$	1.5	[−0.14, 0.4]	ISO
$\sqrt{α}$	0	[−1, 1]	ISO
$β$	0	[−1, 1]	ISO

Table 3. The different models, both toy and real, with their maximum likelihoods and evidences.

Model	$ln L^{\max}$	$ln E^{num}$	$ln E^{ana}$	$ln E^{Lap}$
toy5	0	$- 8.65 \pm 0.03$	$- 8.66$	$- 8.67$
toy5c	0	$- 7.28 \pm 0.06$	$- 7.32$	$- 6.89$
toy6	0	$- 10.75 \pm 0.03$	$- 10.74$	$- 10.74$
toy6c	0	$- 9.73 \pm 0.06$	$- 9.71$	$- 9.63$
AD	$- 840.78$	$- 854.1 \pm 0.1$	$- 854.1$	$- 854.1$
AD- $n_{s}$	$- 838.50$	$- 854.1 \pm 0.1$	$- 853.4$	$- 853.4$
CDI	$- 838.05$	$- 855.1 \pm 0.2$	$- 855.1$	$- 854.5$
NID	$- 836.60$	$- 855.1 \pm 0.2$	$- 854.5$	$- 854.5$
NIV	$- 842.53$	$- 855.1 \pm 0.3$	$- 854.9$	$- 854.9$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

García-Bellido, J. An Analytical Approach to Bayesian Evidence Computation. Universe 2023, 9, 118. https://doi.org/10.3390/universe9030118

AMA Style

García-Bellido J. An Analytical Approach to Bayesian Evidence Computation. Universe. 2023; 9(3):118. https://doi.org/10.3390/universe9030118

Chicago/Turabian Style

García-Bellido, Juan. 2023. "An Analytical Approach to Bayesian Evidence Computation" Universe 9, no. 3: 118. https://doi.org/10.3390/universe9030118

APA Style

García-Bellido, J. (2023). An Analytical Approach to Bayesian Evidence Computation. Universe, 9(3), 118. https://doi.org/10.3390/universe9030118

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Analytical Approach to Bayesian Evidence Computation

Abstract

1. Introduction

2. The Bayesian Evidence

3. The Gaussian Approximation

3.1. Centred Priors

3.2. Uncentred Priors

4. Non-Gaussian Corrections

4.1. Skewness

4.2. Kurtosis

5. Model Comparison

5.1. A Baby-Toy Model Comparison

5.2. A Toy Model Comparison

5.3. A Real Model Comparison

5.4. Savage–Dickey Method

6. Discussion and Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Notes

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI