Next Article in Journal
Wick Theorem and Hopf Algebra Structure in Causal Perturbative Quantum Field Theory
Previous Article in Journal
De Sitter Entropy in Higher Derivative Theories of Gravity
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Analytical Approach to Bayesian Evidence Computation

by
Juan García-Bellido
Departamento de Física Teórica C-XI, Universidad Autónoma de Madrid, Cantoblanco, 28049 Madrid, Spain
Universe 2023, 9(3), 118; https://doi.org/10.3390/universe9030118
Submission received: 1 February 2023 / Revised: 20 February 2023 / Accepted: 22 February 2023 / Published: 24 February 2023
(This article belongs to the Section Cosmology)

Abstract

:
Bayesian evidence is a key tool in model selection, allowing a comparison of models with different numbers of parameters. Its use in the analysis of cosmological models has been limited by difficulties in calculating it, with current numerical algorithms requiring supercomputers. In this paper we give exact formulae for the Bayesian evidence in the case of Gaussian likelihoods with arbitrary correlations and top-hat priors, and approximate formulae for the case of likelihood distributions with leading non-Gaussianities (skewness and kurtosis). We apply these formulae to cosmological models with and without isocurvature components, and compare with results we previously obtained using numerical thermodynamic integration. We find that the results are of lower precision than the thermodynamic integration, while still being good enough to be useful.

1. Introduction

Model selection refers to the statistical problem of deciding which model description of observational data is the best [1,2]. It differs from parameter estimation, where the choice of a single model (i.e., choice of parameters to be varied) has already been made and the aim is to find their best-fitting values and ranges. While there have been widespread applications of parameter estimation techniques, usually likelihood fitting, to cosmological data, there has so far been quite limited application of model selection statistics [3,4,5,6,7,8,9,10,11,12]. This is unfortunate, as model selection techniques are necessary to robustly distinguish between models with different numbers of parameters, and many of the most interesting issues in cosmology concern the desirability or otherwise of incorporating additional parameters to describe new physical effects.
Within the context of Bayesian inference, model selection should be carried out using the Bayesian evidence [1,2], which measures the probability of the model in light of the observational data (i.e., the average likelihood over the prior distribution). The Bayesian evidence associates a single number with each model, and the models can then be ranked in order of the evidence, with the ratios of those values interpreted as the relative probability of the models. This process sets up a desirable tension between model simplicity and the ability to fit the data.
Use of the Bayesian evidence has so far been limited by difficulties in calculating it. The standard technique is thermodynamic integration [13,14], which varies the temperature in a Monte Carlo Markov Chain (MCMC) approach in order that the distribution is sampled in a way covering both posterior and prior distributions. However, in recent work [12] we showed that in order to obtain sufficiently-accurate results in a cosmological context, around 10 7 likelihood evaluations are required per model. Such analyses are CPU-limited by the time needed to generate the predicted spectra to compare with the data, and this requirement pushes the problem into the supercomputer class (for comparison, parameter estimation runs typically employ 10 5 to 10 6 likelihood evaluations).
In this paper, we propose and exploit a new analytic method to compute the evidence based on an expansion of the likelihood distribution function. The method pre-supposes that the covariance of the posterior distribution has been obtained, for instance via an MCMC parameter estimation run, and in its present form requires that the prior distributions of the parameters are uniform top-hat priors.1 While the method will not be applicable for general likelihood distributions, we include the leading non-Gaussianities (skewness and kurtosis) in approximating the likelihood shape, with the expectation of obtaining good results whenever the likelihood distribution is sufficiently simple. Cosmological examples commonly exhibit likelihood distributions with only a single significant peak.
We apply the method both to toy model examples and to genuine cosmological situations. In particular, we calculate the evidences for adiabatic and isocurvature models, which we previously computed using thermodynamic integration in ref. [12]. We find that the discrepancies between the methods are typically no worse than 1 in ln(Evidence), meaning that the analytical method is somewhat less accurate than would be ideal, but is accurate enough to give a useful indication of model preference.

2. The Bayesian Evidence

The posterior probability distribution P ( θ , M | D ) for the parameters θ of the model M , given the data D , is related to the likelihood function L ( D | θ , M ) within a given set of prior distribution functions π ( θ , M ) for the parameters of the model, by Bayes’ theorem:
P ( θ , M | D ) = L ( D | θ , M ) π ( θ , M ) E ( D | M ) ,
where E is the Bayesian evidence, i.e., the average likelihood over the priors,
E ( D | M ) = d θ L ( D | θ , M ) π ( θ , M ) ,
where θ is a vector with n-components characterising the n independent parameters. The prior distribution function π contains all the information about the parameters before observing the data, i.e., our theoretical prejudices, our physical understanding of the model, and input from previous experiments.
In the case of a large number of parameters ( n 1 ), the evidence integral cannot be performed straightforwardly and must be obtained either numerically or via an analytical approximation. Amongst numerical methods the most popular is thermodynamic integration [13,14] but this can be computationally intensive [12]. An alternative is the application of the nested sampling algorithm [15,16] and Monte Carlo methods with the stepping-stone sampling algorithm [17,18]. On the other hand, the simplest analytical approximation is the Laplace approximation, valid when the distribution can be approximated by a multivariate Gaussian. This may hold when the quantity and quality of the data is optimal, but is likely to be valid only in limited cosmological circumstances.
The Bayesian evidence is of interest because it allows a comparison of models amongst an exclusive and exhaustive set { M i } i = 1 N . We can compute the posterior probability for each hypothesis given the data D using Bayes’ theorem:
P ( M i | D ) E ( D | M i ) π ( M i ) ,
where E ( D | M i ) is the evidence of the data under the model M i , and π ( M i ) is the prior probability of the ith model before we see the data. The ratio of the evidences for the two competing models is called the Bayes factor [19]
B i j = E ( D | M i ) E ( D | M j ) ,
and this is also equal to the ratio of the posterior model probabilities if we assume that we do not favour any model a priori, so that π ( M 1 ) = π ( M 2 ) = = π ( M N ) = 1 / N .
The Bayes factor Equation (4) provides a mathematical representation of Occam’s razor, because more complex models tend to be less predictive, lowering their average likelihood in comparison to simpler, more predictive models. More complex models can only be favoured if they are able to provide a significantly improved fit to the data. In simple cases where models give vastly different maximum likelihoods there is no need to employ model selection techniques, but they are essential for properly discussing cases where the improvement of fit is marginal. This latter situation is more or less inevitable whenever the possibility of requiring an additional parameter arises from new data, unless the new data is of vastly greater power than that preceding it; cosmological examples include the inclusion of spectral tilt, dark energy density variation, or trace isocurvature perturbations, explored later in this paper.
In this paper we will obtain analytical formulae which approximates the Bayesian evidence by considering the higher-order cumulants of the distribution in a systematic way. The advantage is that with these analytical formulae one can compute the evidence for a given model with an arbitrary number of parameters, given the hierarchy of cumulants of the distribution, assumed previously computed for the likelihood distribution function within the parameter estimation programme.
The evidence needs to be calculated to sufficient precision for robust conclusions to be drawn. The standard interpretational scale, due to Jeffreys [1] and summarized in ref. [12], strengthens its verdict roughly each time the difference in ln(Evidence) increases by one. The evidence therefore needs to be computed more accurately than this, with an uncertainty of 0.1 in ln(Evidence) easily sufficient, and a factor two worse than that acceptable. This accuracy requirement ensures that the relative model probabilities are changed little by the uncertainty.
The first thing needed is to characterize the distribution function for the model with n parameters. Let f ( x ) be this function, and let us assume that it is properly normalized,
d n x f ( x ) = 1 .
Then, the p-point correlation function is given by
x i 1 x i p = d n x x i 1 x i p f ( x ) .
From this distribution function one can always construct the generating functional, ϕ ( u ) , as the Fourier transform
ϕ ( u ) = d n x e i u · x f ( x ) .
This function can be expanded as
ϕ ( u ) = exp p = 1 i p p ! A i 1 i p u i 1 u i p ,
where A i 1 i p are totally symmetric rank-p tensors. For instance, if we restrict ourselves to order 4, we can write
ϕ ( u ) = exp i μ i u i 1 2 ! C i j u i u j i 3 ! B i j k u i u j u k + 1 4 ! D i j k l u i u j u k u l + + i n n ! A i 1 i n u i 1 u i n ,
where μ i is the mean value of variable x i ; C i j is the covariance matrix; B i j k is the trilinear matrix associated with the third cumulant or skewness; D i j k l is the rank-4 tensor associated with the fourth cumulant or kurtosis; and A i 1 i n is the rank-n tensor associated with the n-th cumulant. Their expressions in terms of n-point correlation functions can be obtained from Equation (7), by realising that
x i 1 x i n = ( i ) n n ϕ ( u ) u i 1 u i n u = 0 .
For instance, the first-order term gives
x i = ( i ) ϕ ( u ) u i u = 0 = μ i .
The second-order correlation function gives
x i x j = ( i ) 2 2 ϕ ( u ) u i u j u = 0 = C i j + μ i μ j ,
such that the covariance matrix is obtained, as usual, from
C i j = x i x j x i x j .
The third-order correlation function gives
x i x j x k = ( i ) 3 3 ϕ ( u ) u i u j u k u = 0 = B i j k + μ i C j k + μ j C k i + μ k C i j + μ i μ j μ k ,
such that the skewness matrix is obtained from
B i j k = x i x j x k x i x j x k x j x k x i x k x i x j + 2 x i x j x k .
The fourth-order correlation function gives
x i x j x k x l = ( i ) 4 4 ϕ ( u ) u i u j u k u l u = 0 = D i j k l + C i j C k l + C i k C j l + C i l C j k + B i j k μ l + B i j l μ k + B j k l μ i + B i k l μ j + C i j μ k μ l + C i k μ j μ l + C i l μ j μ k + C j k μ i μ l + C j l μ i μ k + C k l μ i μ j + μ i μ j μ k μ l ,
such that the kurtosis matrix is obtained from
D i j k l = x i x j x k x l x i x j x k x l x i x k x j x l x i x l x j x k x i x j x k x l x i x j x l x k x i x k x l x j x j x k x l x i + 2 x i x j x k x l + 2 x i x k x j x l + 2 x i x l x j x k + 2 x j x k x i x l + 2 x j x l x i x k + 2 x k x l x i x j 6 x i x j x k x l ,
and so on, for the higher-order cumulants.

3. The Gaussian Approximation

Let us first evaluate the evidence for a multivariate Gaussian distribution, that is, one in which all the cumulants are zero except the covariance matrix C i j and the means μ i . In this case, the generating functional and the distribution are given by 2
ϕ ( u ) = exp i μ i u i 1 2 C i j u i u j ,
f ( x ) = 1 ( 2 π ) n d n u e i u · x ϕ ( u )
= 1 ( 2 π ) n / 2 det C exp 1 2 C i j 1 ( x i μ i ) ( x j μ j ) ,
which satisfies
x i = μ i , x i x j = C i j + μ i μ j , x i x j x k = μ ( i C j k ) + μ i μ j μ k ,
where the sub-indices in parenthesis, ( i j k ) , indicate a cyclic sum. Notice that all the n-point correlation functions can be written in terms of the first two moments of the distribution, and all the higher-order cumulants vanish.

3.1. Centred Priors

For initial calculations, we assume a top-hat prior and make the unrealistic assumption, to be lifted later, that it is centred at the mean value:
π ( x , a ) ( 2 a ) 1 a < x μ < a , 0 otherwise .
Since the Fourier transform of a top-hat function is
d x e i u x π ( x , a ) = sin a u a u exp [ i μ u ] ,
we can write the evidence either way
E ( a 1 , , a n ) = d n x f ( x ) i = 1 n π ( x i , a i ) = i = 1 n ( 2 a i ) 1 a 1 a 1 d x 1 a n a n d x n f ( x ˜ )
= 1 ( 2 π ) n d n u ϕ ( u ) i = 1 n sin a i u i a i u i .
In Equation (22) we integrate over the displaced coordinate, x ˜ i x i μ i , such that x ˜ i = 0 and x ˜ i x ˜ j = C i j . From now on, we ignore the tildes, and assume we have moved to those coordinates. Note that the choice of prior is not crucial. We could have chosen a Gaussian prior, and the result would not be very different, except that the window functions, sin z / z , would then be Gaussian. Let us now perform the integration Equation (22) in the case of one, two and then n variables.
One variable. Suppose the covariance is just C = σ 2 . The evidence is then
E ( a ) = 1 2 a σ 2 π a a d x e x 2 2 σ 2 = 1 2 π d u sin a u a u e 1 2 σ 2 u 2 = 1 2 a Erf a σ 2 ,
where Erf [ x ] is the error function, which asymptotes very quickly to one for x 2 , or a 3 σ . Therefore, the evidence of a model with centred top-hat prior of width 2 a is well approximated by ( 2 a ) 1 . Note that the Bayesian evidence depends very strongly on the prior chosen for the model, and often choosing this prior is crucial for model specification [20].
Two variables. Suppose we have two correlated variables, x 1 and x 2 , with covariance matrix
C = C 11 C 12 C 12 C 22 = σ 1 2 ρ σ 1 σ 2 ρ σ 1 σ 2 σ 2 2 .
where the cross-correlation ρ is defined by
ρ = x 1 x 2 x 1 2 x 2 2 = x 1 x 2 σ 1 σ 2 ,
with σ 1 and σ 2 as the corresponding quadratic dispersions. In this case, the normalized two-dimensional distribution function is
f ( x ) = 1 2 π σ 1 σ 2 1 ρ 2 exp 1 1 ρ 2 x 1 2 2 σ 1 2 ρ x 1 x 2 σ 1 σ 2 + x 2 2 2 σ 2 2 ,
which has the property that integrating (“marginalizing”) over one of the two variables, leaving a properly-normalized Gaussian distribution for the remaining variable,
d x 2 f ( x ) = 1 σ 1 2 π e x 1 2 2 σ 1 2 .
Let us now evaluate the evidence Equation (22) by integrating first over the prior in  x 2 ,
1 2 a 2 a 2 a 2 d x 2 f ( x ) = e x 1 2 2 σ 1 2 σ 1 2 π · 1 4 a 2 Erf a 2 σ 1 + ρ σ 2 x 1 σ 1 σ 2 2 ( 1 ρ 2 ) + Erf a 2 σ 1 ρ σ 2 x 1 σ 1 σ 2 2 ( 1 ρ 2 ) .
The first term is the result we would have obtained if we had been marginalizing over x 2 ; the second is a sum of error functions that still depend on x 1 , and modulates the marginalization. We can use the series expansion of the error function to second order,
1 2 Erf [ a + x ] + Erf [ a x ] = Erf [ a ] 2 a x 2 π e a 2 + O ( x 4 ) ,
to write Equation (28) to order x 1 2 as
1 2 a 2 a 2 a 2 d x 2 f ( x ) = e x 1 2 2 σ 1 2 σ 1 2 π 1 2 a 2 Erf a 2 σ 2 2 ( 1 ρ 2 ) ρ 2 x 1 2 e a 2 2 2 σ 2 2 ( 1 ρ 2 ) 2 σ 1 2 σ 2 ( 1 ρ 2 ) 2 π ( 1 ρ 2 ) .
Integrating now over the x 1 prior, we finally obtain the evidence
E ( a 1 , a 2 ) = 1 4 a 1 a 2 a 1 a 1 d x 1 a 2 a 2 d x 2 f ( x ) = 1 4 a 1 a 2 Erf a 2 σ 2 2 ( 1 ρ 2 ) Erf a 1 σ 1 2 ρ 2 e a 2 2 2 σ 2 2 ( 1 ρ 2 ) 2 σ 1 σ 2 ( 1 ρ 2 ) 2 π ( 1 ρ 2 ) Erf a 1 σ 1 2 2 a 1 + ρ 2 e a 2 2 2 σ 2 2 ( 1 ρ 2 ) a 1 2 2 σ 1 2 4 π σ 1 2 σ 2 1 ρ 2 .
Note that in the limit of no cross-correlations, ρ 0 , the integral factorizes and we can write an exact expression for the evidence,
E ( a 1 , a 2 ) = 1 4 a 1 a 2 1 2 π σ 1 σ 2 a 1 a 1 d x 1 a 2 a 2 d x 2 e x 1 2 2 σ 1 2 x 2 2 2 σ 2 2
= 1 4 π 2 d u 1 d u 2 sin a 1 u 1 a 1 u 1 sin a 2 u 2 a 2 u 2 e 1 2 σ 1 2 u 1 2 1 2 σ 2 2 u 2 2
= 1 4 a 1 a 2 Erf a 1 σ 1 2 Erf a 2 σ 2 2 .
It happens, however, that even in the presence of cross-correlations, if the prior is wide ( a i 2 σ i ), then the terms proportional to exponentials are negligible and the evidence becomes, to a very good approximation,
E ( a 1 , a 2 ) = 1 4 a 1 a 2 Erf a 2 σ 2 2 ( 1 ρ 2 ) Erf a 1 σ 1 2 .
Moreover, in that case, the error functions are approximately given by 1.
nvariables. Suppose we have n correlated variables, x = ( x 1 , , x n ) , with covariance matrix
C n = C 11 C 12 C 1 n C 12 C 22 C 2 n C 1 n C 2 n C n n .
In this case, the probability distribution function can be expressed as
f ( x ) = 1 ( 2 π ) n / 2 det C n exp 1 2 x T C n 1 x ,
which has the property that marginalizing over the last variable, x n , we obtain a correlated probability distribution function for the n 1 variables, x = ( x 1 , , x n 1 ) ,
f ( x ) = 1 ( 2 π ) ( n 1 ) / 2 det C n 1 exp 1 2 x T C n 1 1 x ,
where the C n 1 covariance matrix is given by Equation (35) without the last column and last row.
We will now evaluate the evidence Equation (22) for this multivariate Gaussian, starting with the integration over the last variable, x n ,
1 2 a n a n a n d x n f ( x ) = 1 ( 2 π ) ( n 1 ) / 2 det C n 1 exp 1 2 x T C n 1 1 x × 1 2 a n Erf a n 2 det C n 1 det C n + O e a n 2 det C n 1 2 det C n .
Integrating now over the next variable, x n 1 , we find
1 4 a n a n 1 a n a n d x n a n 1 a n 1 d x n 1 f ( x ) = 1 ( 2 π ) ( n 2 ) / 2 det C n 2 exp 1 2 x T C n 2 1 x × 1 4 a n a n 1 Erf a n 2 det C n 1 det C n Erf a n 2 det C n 2 det C n 1 + O e a n 2 det C n 1 2 det C n .
Continuing the integration over the priors, we end up with the evidence for the n-dimensional distribution,
E ( a 1 , , a n ) = 1 p = 1 n 2 a p a 1 a 1 a n a n d n x f ( x ) = p = 1 n 1 2 a p Erf a p 2 det C p 1 det C p + O exp p = 1 n a p 2 det C p 1 2 det C p ,
where the covariance matrices C p are constructed as above, by eliminating the n p last rows and columns, until we end up with C 0 1 . Note that the approximation is very good whenever p = 1 n ( a p 2 det C p 1 ) / ( 2 det C p ) 1 , which is often the case. Note also that we recover the previous result Equation (34) for the particular case n = 2 .
In the limit that the cross-correlation between the n variables vanishes, the evidence (40) reduces to the exact result
E ( a 1 , , a n ) = p = 1 n 1 2 a p Erf a p σ p 2 .
Note that the evidence Equation (40) correctly reflects the limit in which we eliminate the need for a new variable x n , by making its prior vanish,
lim a n 0 E ( a 1 , , a n ) = E ( a 1 , , a n 1 ) 1 2 π det C n 1 det C n ,
and thus we recover in that limit a properly normalized distribution, f ( x 1 , , x n ) f ( x 1 , , x n 1 ) , while the inspection of the likelihood function alone would not have been able to give a reasonable answer.
On the other hand, in the case that our theoretical prejudice cannot assign a concrete prior to a given variable, we see that the evidence decreases as 1 / 2 a as a increases. Therefore, the Bayesian evidence seems to be a very good discriminator between theoretical priors, and penalizes including too many parameters, a la Occam’s razor.

3.2. Uncentred Priors

It is unlikely that the priors will actually be centred on the mean of the distribution, as the priors are not supposed to know what the data will tell us. We therefore need to generalize the above for uncentred priors. We continue to assume that the priors are top hats.
We also continue to assume for the moment that the probability distribution is well-approximated by a Gaussian with mean value μ . We will then use displaced variables x ˜ i = x i μ i , and write the Gaussian distribution function as in Equation (36). The normalized top-hat prior is now uncentered with respect to the mean value,
π ( x ˜ ; a , b ) ( a + b ) 1 a < x ˜ < b , 0 otherwise .
For a single variable, the result is exact,
E ( a ; b ) = d x f ( x ) π ( x ; a , b ) = 1 2 a + 2 b Erf a σ 2 + Erf b σ 2 .
where we are integrating over the displaced variable x ˜ , from now on renamed as x. Note that we recover the result Equation (24) for the centred prior case in the limit b a .
For two variables, with distribution function Equation (26), the uncentred Bayesian evidence is
E ( a 1 , a 2 ; b 1 , b 2 ) = 1 ( a 1 + b 1 ) ( a 2 + b 2 ) a 1 b 1 d x 1 a 2 b 2 d x 2 f ( x 1 , x 2 )
= 1 ( 2 a 1 + 2 b 1 ) ( 2 a 2 + 2 b 2 ) Erf a 1 σ 1 2 + Erf b 1 σ 1 2 × Erf a 2 σ 2 2 ( 1 ρ 2 ) + Erf b 2 σ 2 2 ( 1 ρ 2 ) ρ 2 π 1 ρ 2 e a 1 2 2 σ 1 2 e b 1 2 2 σ 1 2 e a 2 2 2 σ 2 2 ( 1 ρ 2 ) + e b 2 2 2 σ 2 2 ( 1 ρ 2 )
The evidence for the multiple-variable case Equation (36) is
E ( a , b ) = d n x f ( x ) i = 1 n π ( x i ; a i , b i ) = i = 1 n ( a i + b i ) 1 a 1 b 1 d x ˜ 1 a n b n d x ˜ n f ( x ˜ ) .
Let us now evaluate it for the multivariate Gaussian Equation (36), starting with the integration over the last variable, x n ,
1 a n + b n a n b n d x n f ( x ) = 1 ( 2 π ) ( n 1 ) / 2 det C n 1 exp 1 2 x T C n 1 1 x 1 ( 2 a n + 2 b n ) × Erf a n 2 det C n 1 det C n + Erf b n 2 det C n 1 det C n + O e a n 2 det C n 1 2 det C n + e b n 2 det C n 1 2 det C n
Integrating now over the next variable, x n 1 , we find
1 ( a n + b n ) ( a n 1 + b n 1 ) a n b n d x n a n 1 b n 1 d x n 1 f ( x ) =
1 ( 2 π ) ( n 2 ) / 2 det C n 2 exp 1 2 x T C n 2 1 x 1 ( 2 a n + 2 b n ) ( 2 a n 1 + 2 b n 1 )
× Erf a n 2 det C n 1 det C n + Erf b n 2 det C n 1 det C n
× Erf a n 1 2 det C n 2 det C n 1 + Erf b n 1 2 det C n 2 det C n 1 + O e a n 2 det C n 1 2 det C n + e b n 2 det C n 1 2 det C n × e a n 1 2 det C n 2 2 det C n 1 + e b n 1 2 det C n 2 2 det C n 1 .
Continuing the integration over the priors, we end up with the evidence for the n-dimensional distribution,
E ( a , b ) = 1 p = 1 n ( a p + b p ) a 1 b 1 a n b n d n x f ( x ) = p = 1 n 1 ( 2 a p + 2 b p ) Erf a p 2 det C p 1 det C p + Erf b p 2 det C p 1 det C p + O p = 1 n exp a p 2 det C p 1 2 det C p + exp b p 2 det C p 1 2 det C p ,
where the covariance matrices C p are constructed as above, by eliminating the n p last rows and columns, until C 0 1 . Note that the approximation is very good whenever the exponents are large, p = 1 n ( a p 2 det C p 1 ) / ( 2 det C p ) 1 , which is often the case. Note also that we recover the expression of the evidence for the centred priors Equation (40) in the limit b a .
Let us now evaluate the evidence for a distribution normalized to the maximum of the likelihood distribution,
f ( x ) = L max exp 1 2 x T C n 1 x
In this case, the evidence is given by Equation (52), multiplied by a factor L max × ( 2 π ) n / 2 det C n from the normalization. We can then evaluate the logarithm of the evidence, ignoring the exponentially small corrections, as
ln E = ln L max + n 2 ln ( 2 π ) + 1 2 ln det C n p = 1 n ln ( 2 a p + 2 b p ) + p = 1 n ln Erf a p 2 det C p 1 det C p + Erf b p 2 det C p 1 det C p .
Uncorrelated case. Suppose we have a multivariate Gaussian distribution without correlations between variables, i.e., C i j = σ i 2 δ i j is a diagonal matrix; then the evidence reads exactly,
E ( a , b ) = 1 p = 1 n ( a p + b p ) a 1 b 1 a n b n d n x f ( x ) = p = 1 n 1 2 ( a p + b p ) Erf a p σ p 2 + Erf b p σ p 2 ,
where σ p are the dispersions of each variable x ˜ p , and thus the logarithm of the evidence becomes
ln E = ln L max + n 2 ln ( 2 π ) + p = 1 n ln σ p p = 1 n ln ( 2 a p + 2 b p ) + p = 1 n ln Erf a p σ p 2 + Erf b p σ p 2
Laplace approximation. The Laplacian approximation to the evidence assumes the distribution is a correlated Gaussian, and that the priors are large enough so that the whole distribution fits easily inside them, in which case the error functions are approximately in unity and do not contribute to the evidence; from Equation (54) we now have
ln E = ln L max + n 2 ln ( 2 π ) + 1 2 ln det C n p = 1 n ln Δ θ p ,
where Δ θ p = a p + b p is the parameter interval associated to the prior. In the next section we will compare the different approximations.

4. Non-Gaussian Corrections

The advantage of this method is that one can perform a systematic computation of the evidence of a given model with its own priors, given an arbitrary set of moments of the distribution. Here we will consider the first two beyond the covariance matrix, i.e., the skewness and kurtosis terms, see Equation (9).

4.1. Skewness

Let us start with the first correction to the Gaussian approximation, the trilinear term B i j k . For this, we write the generating functional (9) as
ϕ ( u ) = exp i μ i u i 1 2 ! C i j u i u j i 3 ! B i j k u i u j u k .
By performing a change of variable, u i = y i i C i k 1 ( x k μ k ) , we can evaluate the Fourier transform integral and obtain the properly-normalized probability distribution function
f ( x ) = 1 ( 2 π ) n / 2 det C n exp 1 2 x T C n 1 x × 1 1 2 B i j k C i j 1 C k l 1 x l + 1 6 B i j k C i l 1 C j m 1 C k n 1 x l x m x n ,
where x k are the displaced coordinates ( x k μ k ) . This skewed distribution function satisfies
x i = 0 , x i x j = C i j , x i x j x k = B i j k , x i x j x k x l = 0 ,
as can be confirmed by direct evaluation. Let us now compute the evidence Equation (22) for this skewed model. Since the extra terms in the parenthesis of Equation (59) are both odd functions of x, when integrating over an even range like that of the centred top-hat prior Equation (21), their contribution to the evidence vanish, and thus the final evidence for the skewed model does not differ from that of the Gaussian model Equation (40). In case the prior is off-centred with respect to the mean, e.g., in Equation (43), then the contribution of the odd terms to the evidence would not vanish. Let us evaluate their contribution.
For a single variable ( n = 1 ) , the correctly normalized likelihood function can be written as
f ( x ) = e x 2 / 2 σ 2 σ 2 π 1 B x 2 σ 4 + B x 3 6 σ 6 ,
satisfying x = 0 , x 2 = σ 2 , x 3 = B , and the Bayesian integral can be computed exactly as
E ( a , b ) = 1 2 a + 2 b Erf a σ 2 + Erf b σ 2 B σ 3 6 2 π 1 a 2 σ 2 e a 2 2 σ 2 1 b 2 σ 2 e b 2 2 σ 2 1 a + b .
Note that for even (centred) priors, with b = a , the evidence reduces to Equation (24).
For an arbitrary number of variables the computation is more complicated. Let us start with the n-th variable and, in order to compute the integral, let us define the auxiliary function
g ( λ ) = a n b n d x n x n exp λ 2 x T C n 1 x ( 2 π ) n / 2 det C n = exp 1 2 x T C n 1 1 x ( 2 π ) ( n 1 ) / 2 det C n 1 × × 1 λ 2 π exp λ a n 2 2 det C n 1 det C n exp λ b n 2 2 det C n 1 det C n ,
such that, using Erf [ x ] = 2 π e x 2 ,
2 g ( λ = 1 ) = a n b n d x n x n ( x T C n 1 x ) exp 1 2 x T C n 1 x ( 2 π ) n / 2 det C n = exp 1 2 x T C n 1 1 x ( 2 π ) ( n 1 ) / 2 det C n 1 × × 1 2 π 2 + a n 2 det C n 1 det C n exp a n 2 2 det C n 1 det C n 2 + b n 2 det C n 1 det C n exp b n 2 2 det C n 1 det C n .
Therefore, with the use of Equation (63), the integral of the skewness-corrected distribution function Equation (59) over the x n uncentred prior becomes
a n b n d x n f ( x ) = exp 1 2 x T C n 1 1 x ( 2 π ) ( n 1 ) / 2 det C n 1 1 2 Erf a n 2 det C n 1 det C n + Erf b n 2 det C n 1 det C n 1 6 B i j n C i j 1 1 2 π det C n 1 det C n 1 a n 2 det C n 1 det C n e a n 2 det C n 1 2 det C n 1 b n 2 det C n 1 det C n e b n 2 det C n 1 2 det C n .
Let us define two new functions,
E i ( a i , b i ) = 1 2 Erf a i 2 det C i 1 det C i + Erf b i 2 det C i 1 det C i , F i ( a i , b i ) = 1 6 2 π det C i 1 det C i 1 a i 2 det C i 1 det C i e a i 2 det C i 1 2 det C i 1 b i 2 det C i 1 det C i e b i 2 det C i 1 2 det C i .
Integrating iteratively over x n 1 , , x 1 , we end up with the Bayesian evidence for the third-order-corrected probability distribution function f ( x ) ,
E ( a , b ) = p = 1 n E p ( a p , b p ) ( a p + b p ) 1 k = 1 n B i j k C i j 1 F k ( a k , b k ) E k ( a k , b k ) .
Unless B i j k C i j 1 is very large, the correction to the error function is exponentially suppressed, and we do not expect significant departures from the Gaussian case Equation (40). Note also that if the prior is symmetrical, it is easy to see that the skewness part of the integral vanishes, F k ( a k , b k ) 0 , as can be checked explicitly by taking b k a k .

4.2. Kurtosis

The next correction beyond skewness is the fourth-order moment or kurtosis, given by the D i j k l term in Equation (9). Let us ignore for the moment the third-order skewness and write
ϕ ( u ) = exp i μ i u i 1 2 ! C i j u i u j + 1 4 ! D i j k l u i u j u k u l .
By performing the same change of variables, u i = y i i C i k 1 ( x k μ k ) , we can now compute the Fourier transform and obtain the properly normalized probability distribution function
f ( x ) = 1 ( 2 π ) n / 2 det C n exp 1 2 x T C n 1 x 1 + 1 8 D i j k l C i j 1 C k l 1 1 4 D i j k l C i j 1 C k m 1 C l n 1 x m x n + 1 24 D i j k l C i m 1 C j n 1 C k p 1 C l q 1 x m x n x p x q .
Performing the integrals, it is easy to see that this distribution satisfies
x i x j = C i j , x i x j x k x l = D i j k l + C i j C k l + C i k C j l + C i l C j k ,
Note that in order for the new likelihood distribution (68) to be positive definite, it is required that D i j k l C i j 1 C k l 1 < 4 , and if we impose that there is only one maximum at the centre, then it must satisfy D i j k l C i j 1 C k l 1 < 2 . These conditions impose bounds on the maximum possible deviation of the evidence from a that of a Gaussian.
Let us now compute the evidence Equation (22) for this kurtosis model. The extra terms in the parenthesis of Equation (68) are both even functions of x, and we cannot ignore them, even for centred priors.
For a single variable ( n = 1 ) , the correctly normalized likelihood function can be written as
f ( x ) = e x 2 2 σ 2 σ 2 π 1 + D 8 σ 4 D x 2 4 σ 6 + D x 4 24 σ 8 ,
satisfying x = 0 , x 2 = σ 2 , x 3 = 0 , x 4 = D + 3 σ 4 , etc. The Bayesian integral can be computed exactly as
E ( a , b ) = 1 2 a + 2 b Erf a σ 2 + Erf b σ 2 + D σ 4 8 2 π a σ 1 a 2 3 σ 2 e a 2 2 σ 2 + b σ 1 b 2 3 σ 2 e b 2 2 σ 2 1 a + b .
For an arbitrary number of variables, the computation is again much more complicated. Let us start with the n-th variable and, in order to compute the first integral, let us define a new auxiliary function
h ( λ ) = a n b n d x n exp λ 2 x T C n 1 x ( 2 π ) n / 2 det C n = exp 1 2 x T C n 1 1 x ( 2 π ) ( n 1 ) / 2 det C n 1 × × 1 2 λ Erf a n λ 2 det C n 1 det C n + Erf b n λ 2 det C n 1 det C n ,
such that,
2 h ( λ = 1 ) = a n b n d x n ( x T C n 1 x ) exp 1 2 x T C n 1 x ( 2 π ) n / 2 det C n = exp 1 2 x T C n 1 1 x ( 2 π ) ( n 1 ) / 2 det C n 1 × × 1 2 Erf a n 2 det C n 1 det C n + Erf b n 2 det C n 1 det C n 1 2 π det C n 1 det C n a n exp a n 2 2 det C n 1 det C n + b n exp b n 2 2 det C n 1 det C n .
4 h ( λ = 1 ) = a n b n d x n ( x T C n 1 x ) 2 exp 1 2 x T C n 1 x ( 2 π ) n det C n = exp 1 2 x T C n 1 1 x ( 2 π ) ( n 1 ) / 2 det C n 1 × × 3 2 Erf a n 2 det C n 1 det C n + Erf b n 2 det C n 1 det C n 3 2 π det C n 1 det C n a n exp a n 2 2 det C n 1 det C n + b n exp b n 2 2 det C n 1 det C n a n 2 2 π det C n 1 det C n 3 / 2 a n exp a n 2 2 det C n 1 det C n + b n exp b n 2 2 det C n 1 det C n .
Therefore, with the use of Equations (72) and (73), the integral of the kurtosis-corrected distribution function (68) over the x n prior becomes
a n b n d x n f ( x ) = exp 1 2 x T C n 1 1 x ( 2 π ) ( n 1 ) / 2 det C n 1 1 2 Erf a n 2 det C n 1 det C n + Erf b n 2 det C n 1 det C n + + 1 8 D i j k l C i j 1 C k l 1 1 2 π det C n 1 det C n a n 1 a n 2 3 det C n 1 det C n e a n 2 det C n 1 2 det C n + b n 1 b n 2 3 det C n 1 det C n e b n 2 det C n 1 2 det C n .
We can now define a new function
G i ( a i , b i ) = 1 8 2 π det C i 1 det C i a i 1 a i 2 3 det C i 1 det C i e a i 2 det C i 1 2 det C i b i 1 b i 2 3 det C i 1 det C i e b i 2 det C i 1 2 det C i .
Integrating iteratively over x n 1 , , x 1 , we end up with the Bayesian evidence for the fourth-order-corrected probability distribution function f ( x ) ,
E ( a , b ) = p = 1 n E p ( a p , b p ) ( a p + b p ) 1 + D i j k l C i j 1 C k l 1 m = 1 n G m ( a m , b m ) E m ( a m , b m ) .
so, unless D i j k l C i j 1 C k l 1 is very large, the correction to the error function is exponentially suppressed, and we do not expect significant departures from the Gaussian case, Equation (40).
In order to compare models it is customary to compute the logarithm of the evidence. Let us assume that we are given a likelihood distribution function normalized by the maximum likelihood, and with corrections up to the fourth order,
f ( x ) = L max exp 1 2 x T C n 1 x 1 + 1 8 D i j k l C i j 1 C k l 1 1 1 1 2 B i j k C i j 1 C k l 1 x l + 1 6 B i j k C i l 1 C j m 1 C k n 1 x l x m x n + 1 8 D i j k l C i j 1 C k l 1 1 4 D i j k l C i j 1 C k m 1 C l n 1 x m x n + 1 24 D i j k l C i m 1 C j n 1 C k p 1 C l q 1 x m x n x p x q .
Note that it is normalized so that the maximum corresponds to the mean-centred distribution, i.e., x = 0 . In this case, the evidence of the normalized distribution is given by
E ( a , b ) = L max ( 2 π ) n / 2 det C n 1 + 1 8 D i j k l C i j 1 C k l 1 1 × p = 1 n E p ( a p , b p ) ( a p + b p ) 1 k = 1 n B i j k C i j 1 F k ( a k , b k ) E k ( a k , b k ) + D i j k l C i j 1 C k l 1 m = 1 n G m ( a m , b m ) E m ( a m , b m ) .
We can then evaluate the logarithm of the evidence by
ln E = ln L max + n 2 ln ( 2 π ) + 1 2 ln det C n ln 1 + 1 8 D i j k l C i j 1 C k l 1 p = 1 n ln ( 2 a p + 2 b p ) + p = 1 n ln Erf a p 2 det C p 1 det C p + Erf b p 2 det C p 1 det C p + ln 1 k = 1 n B i j k C i j 1 F k ( a k , b k ) E k ( a k , b k ) + D i j k l C i j 1 C k l 1 m = 1 n G m ( a m , b m ) E m ( a m , b m ) .
Note that the condition D i j k l C i j 1 C k l 1 < 2 constrains the maximum amount that the kurtosis corrections can contribute to the evidence.
Uncorrelated case. In the case where the likelihood distribution has no correlations among the different variables, the exact expression for the Bayesian evidence is
ln E = ln L max + n 2 ln ( 2 π ) + p = 1 n ln σ p p = 1 n ln ( 2 a p + 2 b p ) + p = 1 n ln Erf a p σ p 2 + Erf b p σ p 2 ln 1 + 1 8 D i i j j σ i 2 σ j 2 + ln 1 k = 1 n B i i k σ k 2 F k ( a k , b k ) E k ( a k , b k ) + D i i j j σ i 2 σ j 2 m = 1 n G m ( a m , b m ) E m ( a m , b m ) ,
where σ p are the corresponding dispersions of variables x p , and the functions E i , F i and G i are the corresponding limiting functions of Equations (65) and (75) for uncorrelated matrices.

5. Model Comparison

Finally we turn to specific applications of the formalism discussed above. Initially we will carry out some toy model tests of its performance, and then examine real cosmological applications for which we previously obtained results by thermodynamic integration [12].

5.1. A Baby-Toy Model Comparison

We begin with a very simple two-dimensional toy model. The purpose of this section is to illustrate the ineffectiveness of the thermodynamic integration and to give an indication of the performance of the method we propose here. In addition, the two-dimensional model is simple enough to allow a brute-force direct numerical integration of evidence allowing us to check the accuracy at the same time. We use the following two forms of likelihood:
L g ( x , y ) = exp 2 x 2 2 ( y 1 ) 2 x y 2
L n g ( x , y ) = exp 2 x 2 2 ( y 1 ) 2 x y 2 + exp 2 x 2 2 y 2 3 x y 2
The subscripts g and n g indicate the Gaussian and non-Gaussian cases, respectively.
Firstly, we calculate the evidence by the analytical method using Equations (56) and () and covariance matrices inferred from sampling the likelihood using the vanilla Metropolis–Hastings algorithm with fixed proposal widths. Chains ranging from a few to several millions of samples were used. We also calculate evidence using thermodynamic algorithm explained in ref. [12]. Again, we vary algorithm parameters to obtain evidence values of varying accuracy. The resulting evidence as a function of the number of likelihood evaluations is plotted in the Figure 1, together with the correct value inferred by direct numerical integration. The number of likelihood evaluations is crucial as this is the time-limiting step in the cosmological parameter estimation and model comparison exercises. The results are what could have been anticipated. We note that the size of the prior does not seem to be of crucial importance. This is comforting, given that the analytical method requires the knowledge of the true covariance information, while we can only supply a covariance matrix estimated from the prior-truncated likelihood. We also note that the thermodynamic integration converges to the correct value in all cases. However, it does so after very many likelihood evaluations; typically about a million or so even for a two-dimensional problem. The analytical method already becomes limited by systematics by the ten-thousand samples. For the Gaussian case, there is no systematic by construction, while the non-Gaussian case suffers a systematic of about 0.1 in ln E . The non-Gaussian correction reduces the error by about half and thus correctly estimates the uncertainty associated with the purely Gaussian approximation. In the case of wide priors, the only non-Gaussian correction of appreciable size is the ln ( 1 + D i j k l C i j 1 C k l 1 / 8 ) .

5.2. A Toy Model Comparison

We now proceed by calculating the Bayesian evidence for simple toy models with five and six parameters, shown in Table 1. The purpose is to compare results with those obtained from thermodynamic integration again, but this time using a model that bears more resemblance to a typical problem encountered in cosmology.
Beginning with the five-parameter model, we first assume that it has an uncorrelated multivariate Gaussian likelihood distribution. In this case the aim is to test the thermodynamic integration method, which gives ln E toy 5 num = 8.65 ± 0.03 , while the exact expression gives ln E toy 5 ana = 8.66 . Therefore, we conclude that the thermodynamic integration method is rather good in obtaining the correct evidence of the model. The Laplace approximation Equation (57) also fares well for uncorrelated distributions, ln E toy 5 Lap = 8.67 .
We now consider a likelihood function with a correlated covariance matrix C i j , with the same mean values and dispersions as the previous case, but with significant correlations. The analytic formula needed, Equation (54), is no longer exact,3 and gives ln E toy 5 c ana = 7.32 . For comparison thermodynamic integration gives ln E toy 5 c num = 7.28 ± 0.06 , again in perfect agreement within errors. In this case the Laplace approximation fails significantly, ln E toy 5 c Lap = 6.89 , the reason being that the correlations chosen bring the posterior into significant contact with the edges of the priors.
Let us now return to the uncorrelated case and include a new parameter, x 6 , as in Table 1, and evaluate the different evidences that appear because of this new parameter, in order to see the sensitivity to systematic errors in the evaluation of the Bayesian evidence and their effects on model comparison. The numerical result is ln E toy 6 num = 10.75 ± 0.03 , while the exact analytical expression gives ln E toy 6 ana = 10.74 , in perfect agreement within errors. The Laplace approximation Equation (57) again fares well for uncorrelated distributions, ln E toy 6 Lap = 10.74 .
When the likelihood function has large correlations, and the priors are not too large, the naive Laplace approximation, Equation (57), fares less well than the analytical approximation, Equation (54).

5.3. A Real Model Comparison

In this subsection we will make use of the results obtained in ref. [12], where we evaluated the evidence for 5- and 6-parameter adiabatic models, and for three 10-parameter mixed adiabatic plus isocurvature models. The prior ranges used are given in Table 2. The latter models give a marginally better fit to the data but require more parameters, which is exactly the situation where model selection techniques are needed to draw robust conclusions. In ref. [12] we used thermodynamic integration to compute the evidence and showed that the isocurvature models were less favoured than the adiabatic ones, but only at a mild significance level.4
Beginning with the simplest adiabatic model, which uses the Harrison–Zel’dovich spectrum, we have used the analytical formulae above, Equation (54), together with the covariance matrix provided by the cosmoMC programme [21], and obtained ln E ad ana = 854.07 , while the thermodynamic integration gave ln E ad num = 854.1 ± 0.1 [12]. The agreement is excellent; this is because the distribution function for the adiabatic model is rather well-approximated by a Gaussian, and the priors are rather large, so the formula Equation (54) is very close to that obtained in the Laplace approximation, ln E ad Lap = 854.08 .
However the analytic method fares less well for the adiabatic model with varying n s , with both the analytical and Laplace methods giving ln E AD n s = 853.4 , while the numerical method gives the smaller value −854.1, a discrepancy of near unity.
Turning now to the isocurvature cases, we found an extremely good result for the CDI model, gaining from Equation (54) the value ln E cdi ana = 855.08 , while the thermodynamic integration gives ln E cdi num = 855.1 ± 0.1 . This is surprising, given the relatively large non-Gaussianities for at least three variables: n iso , β and δ cor , whose priors are not centred with respect to the mean. However the NID case shows much less agreement, with a discrepancy of 0.6. This suggests that the closeness of the CDI comparison is to some extent a statistical fluke, with the underlying method less accurate.
A summary of the different models can be found in Table 3.

5.4. Savage–Dickey Method

Another numerical method for evidence calculation is the Savage–Dickey method, first described in ref. [22] and recently used in ref. [20]. This technique allows one to calculate the evidence ratio of two models from a simple and quick analysis of the Markov chains used for parameter estimation, provided that the models are nested; i.e., that one of them is included in the parameter space of the other. For instance, the AD model is nested within the AD- n s model, and the AD and AD- n s models are both nested within the CDI, NID and NIV ones. In the context of Markov chains, the Savage–Dickey method is essentially a measure of how much time the sampler spends in the nested model, weighted by the respective volumes of the two models. When the outer model has extra parameters, this method relies on approximating the nested model as a model with negligibly narrow priors in directions of extra parameters. We note, however, that when many extra parameters are present, this method must fail for reasons similar to why those with grid-based parameter estimation approaches fail with models with many parameters. The MCMC parameter estimation simply does not have high enough dynamic range to probe the two models given the large prior volume ratio.
The AD and AD- n s models differ by one parameter. Using the same AD+ n s samples as for the analytical method (i.e., the samples from which we extracted the covariance matrix), we obtained ln ( E A D / E A D + n s ) = 0.03 . The result from the precise thermodynamic integration, ln ( E AD / E AD n s ) = 0 ± 0.1 is in excellent agreement. The AD- n s and CDI (or NID, NIV) models differ by four parameters. With most simple choices of parametrization (including in particular the isocurvature and cross-correlation tilts), the AD- n s is not a point, but a hyper-surface within the parameter space of the isocurvature models (i.e., α = 0 and the other three parameters act as dummy, unconstrained, parameters which do not affect the evidence). In these cases, the evidence ratios given by the Savage–Dickey method do not converge as the priors of the extra parameters are tightened up around the nested model, although they match thermodynamically determined values to within a unit of ln E .

6. Discussion and Conclusions

We have developed an analytical formalism for computing the Bayesian evidence in the case of an arbitrary likelihood distribution with a hierarchy of non-Gaussian corrections, and with arbitrary top-hat priors, centred or uncentred. This analysis can be of great help for the problem of model comparison in the present context of cosmology where observational data is still unable to rule out most extensions of the standard model based on the Λ CDM inflationary paradigm.
As an application of the exact and approximate formulae obtained for the Bayesian evidence of a model with approximately Gaussian likelihood distributions, we have compared the value predicted analytically with that computed with a time-consuming algorithm based on the thermodynamic integration approach. The values analytically obtained agree surprisingly well with those obtained numerically. While one can estimate the magnitude of the higher-order corrections for the analytical formulae, it is very difficult to estimate the systematic effects of the numerical approach. Thus, with this analytical method we can test for systematics in the thermodynamic integration approach. So far, the values obtained agree, so it seems that the numerical approach is a good tool for estimating the evidence. However, it takes considerable effort and machine time to do the correct evaluation, and therefore we propose the use of the analytical estimate, whose corrections are well under control, in the sense that one can compute the next order corrections and show that they are small.

Funding

This research was funded by the Spanish grants PID2021-123012NB-C43 [MICINN-FEDER] and the Centro de Excelencia Severo Ochoa Program CEX2020-001007-S through IFT.

Data Availability Statement

There is no data associated with this work.

Conflicts of Interest

The author declares no conflict of interest.

Notes

1
An extension to Gaussian priors should be feasible, but not one to arbitrary priors.
2
Note that, for scalar quantities, Einstein notation for the sum over free indices is assumed.
3
One could rotate the parameter basis to remove the correlations, but then the priors would not be top-hats.
4
Recently, Trotta [20] used a different technique to analyse a restricted class of isocurvature model featuring just one extra parameter, and found it highly disfavoured. The different conclusion is primarily due to the very different prior he chose on the isocurvature amplitude, such that almost all the models under the prior are dominated by isocurvature models and in poor agreement with the data.

References

  1. Jeffreys, H. Theory of Probability, 3rd ed.; Oxford University Press: Oxford, UK, 1961. [Google Scholar]
  2. MacKay, D.J.C. Information Theory, Inference and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
  3. Jaffe, A. H0 and odds on cosmology. Astrophys. J. 1996, 471, 24. [Google Scholar]
  4. Drell, P.S.; Loredo, T.J.; Wasserman, I. Type Ia supernovae, evolution, and the cosmological constant. Astrophys. J. 2000, 530, 593. [Google Scholar] [CrossRef] [Green Version]
  5. John, M.V.; Narlikar, J.V. Comparison of cosmological models using Bayesian theory. Phys. Rev. D 2002, 65, 043506. [Google Scholar] [CrossRef] [Green Version]
  6. Hobson, M.P.; Bridle, S.L.; Lahav, O. Combining cosmological data sets: Hyperparameters and Bayesian evidence. Mon. Not. R. Astron. Soc. 2002, 335, 377. [Google Scholar] [CrossRef]
  7. Slosar, A.; Carreira, P.; Cleary, K.; Davies, R.D..; Davis, R.J.; Dickinson, C.; Genova-Santos, R.; Grainge, K.; Gutierrez, C.M.; Hafez, Y.A.; et al. Cosmological parameter estimation and Bayesian model comparison using Very Small Array data. Mon. Not. R. Astron. Soc. 2003, 341, L29. [Google Scholar] [CrossRef] [Green Version]
  8. Saini, T.D.; Weller, J.; Bridle, S.L. Revealing the nature of dark energy using Bayesian evidence. Mon. Not. R. Astron. Soc. 2004, 348, 603. [Google Scholar] [CrossRef] [Green Version]
  9. Niarchou, A.; Jaffe, A.H.; Pogosian, L. Large-scale power in the CMB and new physics: An analysis using Bayesian model comparison. Phys. Rev. D 2004, 69, 063515. [Google Scholar] [CrossRef]
  10. Marshall, P.; Rajguru, N.; Slosar, A. Bayesian evidence as a tool for comparing datasets. Phys. Rev. D 2006, 73, 067302. [Google Scholar] [CrossRef] [Green Version]
  11. Liddle, A.R. How many cosmological parameters? Mon. Not. R. Astron. Soc. 2004, 351, L49–L53. [Google Scholar] [CrossRef] [Green Version]
  12. Beltrán, M.; García-Bellido, J.; Lesgourgues, J.; Liddle, A.R.; Slosar, A. Bayesian model selection and isocurvature perturbations. Phys. Rev. D 2005, 71, 063532. [Google Scholar] [CrossRef] [Green Version]
  13. Ó’Ruanaidh, J.J.K.; Fitzgerald, W.J. Numerical Bayesian Methods Applied to Signal Processing; Springer: New York, NY, USA, 1996. [Google Scholar]
  14. Hobson, M.P.; McLachlan, C. A Bayesian approach to discrete object detection in astronomical data sets. Mon. Not. R. Astron. Soc. 2003, 338, 765. [Google Scholar] [CrossRef] [Green Version]
  15. Skilling, J. Nested sampling. AIP Conf. Proc. 2004, 735, 395. [Google Scholar]
  16. Handley, W.J.; Hobson, M.P.; Lasenby, A.N. POLYCHORD: Nested sampling for cosmology. Mon. Not. R. Astron. Soc. 2015, 450, L61. [Google Scholar] [CrossRef]
  17. Xie, W.; Lewis, P.O.; Fan, Y.; Kuo, L.; Chen, M.-H. Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Syst. Biol. 2011, 60, 150. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  18. Maturana-Russel, P.; Meyer, R.; Veitch, J.; Christensen, N. Search for the isotropic stochastic background using data from Advanced LIGO’s second observing run. Phys. Rev. D 2019, 99, 084006. [Google Scholar]
  19. Kass, R.E.; Raftery, A.E. Bayes Factors. J. Am. Stat. Assoc. 1995, 90, 773. [Google Scholar] [CrossRef]
  20. Trotta, R. Applications of Bayesian model selection to cosmological parameters. Mon. Not. R. Astron. Soc. 2007, 378, 72. [Google Scholar] [CrossRef]
  21. Lewis, A.; Bridle, S. Cosmological parameters from CMB and other data: A Monte Carlo approach. Phys. Rev. D 2002, 66, 103511. [Google Scholar] [CrossRef] [Green Version]
  22. Dickey, J.M. The Weighted Likelihood Ratio, Linear Hypotheses on Normal Location Parameters. Ann. Math. Stat. 1971, 42, 204. [Google Scholar] [CrossRef]
Figure 1. This figure shows the calculated evidence as a function of the number of likelihood evaluations. Note that the horizontal axis is logarithmic. The star-centred line corresponds to the thermodynamic integration. The cross-centred lines are the analytical methods with (upper panels) and without (lower panels) non-Gaussian corrections applied. The horizontal dashed line is the number obtained by the direct integration. The upper two panels correspond to L g , while the lower two to L n g . The left-hand side panels correspond to wide flat priors of ( 7 , 10 ) on both parameters, while the right-hand side to the narrow priors of ( 2 , 3 ) on both parameters. The error bars correspond to the dispersion due to the number of likelihood evaluations.
Figure 1. This figure shows the calculated evidence as a function of the number of likelihood evaluations. Note that the horizontal axis is logarithmic. The star-centred line corresponds to the thermodynamic integration. The cross-centred lines are the analytical methods with (upper panels) and without (lower panels) non-Gaussian corrections applied. The horizontal dashed line is the number obtained by the direct integration. The upper two panels correspond to L g , while the lower two to L n g . The left-hand side panels correspond to wide flat priors of ( 7 , 10 ) on both parameters, while the right-hand side to the narrow priors of ( 2 , 3 ) on both parameters. The error bars correspond to the dispersion due to the number of likelihood evaluations.
Universe 09 00118 g001
Table 1. The parameters used in the analytical evaluation of the toy model evidences, with five and six parameters, respectively. The maximum likelihood of the toy models is taken (arbitrarily) to be L max = 1 .
Table 1. The parameters used in the analytical evaluation of the toy model evidences, with five and six parameters, respectively. The maximum likelihood of the toy models is taken (arbitrarily) to be L max = 1 .
ParameterMeanPrior RangeModel
x 1 0.022[0.0001, 0.044]toy5, toy6
x 2 0.12[0.001, 0.3]toy5, toy6
x 3 1.04[0.8, 1.4]toy5, toy6
x 4 0.1[0.01, 0.3]toy5, toy6
x 5 3.1[2.6, 3.6]toy5, toy6
x 6 0.98[0.5, 1.5]toy6
Table 2. The parameters used in the models; see ref. [12] for nomenclature and other details. For the AD-HZ model n s was fixed to 1 and n iso , δ cor , α and β were fixed to 0. In the AD- n s model, n s also varies. Every isocurvature model holds the same priors for the whole set of parameters.
Table 2. The parameters used in the models; see ref. [12] for nomenclature and other details. For the AD-HZ model n s was fixed to 1 and n iso , δ cor , α and β were fixed to 0. In the AD- n s model, n s also varies. Every isocurvature model holds the same priors for the whole set of parameters.
ParameterMeanPrior RangeModel
ω b 0.022[0.018, 0.032]AD-HZ,AD- n s ,ISO
ω dm 0.12[0.04, 0.16]AD-HZ,AD- n s ,ISO
θ 1.04[0.98, 1.10]AD-HZ,AD- n s ,ISO
τ 0.17[0, 0.5]AD-HZ,AD- n s ,ISO
ln [ 10 10 R rad ] 3.1[2.6, 4.2]AD-HZ,AD- n s ,ISO
n s 1.0[0.8, 1.2]AD- n s ,ISO
n iso 1.5[0, 3]ISO
δ cor 1.5[−0.14, 0.4]ISO
α 0[−1, 1]ISO
β 0[−1, 1]ISO
Table 3. The different models, both toy and real, with their maximum likelihoods and evidences.
Table 3. The different models, both toy and real, with their maximum likelihoods and evidences.
Model ln L max ln E num ln E ana ln E Lap
toy50 8.65 ± 0.03 8.66 8.67
toy5c0 7.28 ± 0.06 7.32 6.89
toy60   10.75 ± 0.03    10.74    10.74
toy6c0 9.73 ± 0.06 9.71 9.63
AD 840.78 854.1 ± 0.1 854.1 854.1
AD- n s 838.50 854.1 ± 0.1 853.4 853.4
CDI 838.05 855.1 ± 0.2 855.1 854.5
NID 836.60 855.1 ± 0.2 854.5 854.5
NIV 842.53 855.1 ± 0.3 854.9 854.9
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

García-Bellido, J. An Analytical Approach to Bayesian Evidence Computation. Universe 2023, 9, 118. https://doi.org/10.3390/universe9030118

AMA Style

García-Bellido J. An Analytical Approach to Bayesian Evidence Computation. Universe. 2023; 9(3):118. https://doi.org/10.3390/universe9030118

Chicago/Turabian Style

García-Bellido, Juan. 2023. "An Analytical Approach to Bayesian Evidence Computation" Universe 9, no. 3: 118. https://doi.org/10.3390/universe9030118

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop