Marginal Bayesian Statistics Using Masked Autoregressive Flows and Kernel Density Estimators with Examples in Cosmology

Cosmological experiments often employ Bayesian workflows to derive constraints on cosmological and astrophysical parameters from their data. It has been shown that these constraints can be combined across different probes such as Planck and the Dark Energy Survey and that this can be a valuable exercise to improve our understanding of the universe and quantify tension between multiple experiments. However, these experiments are typically plagued by differing systematics, instrumental effects and contaminating signals, which we collectively refer to as `nuisance' components, that have to be modelled alongside target signals of interest. This leads to high dimensional parameter spaces, especially when combining data sets, with>20 dimensions of which only around 5 correspond to key physical quantities. We present a means by which to combine constraints from different data sets in a computationally efficient manner by generating rapid, reusable and reliable marginal probability density estimators, giving us access to nuisance-free likelihoods. This is possible through the unique combination of nested sampling, which gives us access to Bayesian evidences, and the marginal Bayesian statistics code MARGARINE. Our method is lossless in the signal parameters, resulting in the same posterior distributions as would be found from a full nested sampling run over all nuisance parameters, and typically quicker than evaluating full likelihoods. We demonstrate our approach by applying it to the combination of posteriors from the Dark Energy Survey and Planck.


Introduction
Bayesian inference is a cornerstone of modern cosmology and astrophysics. It is frequently employed to derive parameter constraints on key signal parameters from data sets such as the Dark Energy Survey [DES, 1,2], Planck [3], REACH [4] and SARAS2 [5] among others. Often experiments are sensitive to different aspects of the same physics, and by combining constraints across probes we can improve our understanding of the Universe or reveal tensions between different experiments.
However, this can become a computationally expensive task as many experiments feature systematics, instrumental effects and contamination from other physical signals that need to be model alongside the signal or parameters of interest. For individual experiments, this can lead to high dimensional problems with 20 parameters of which the majority can be considered 'nuisance' parameters. The problem is compounded when combining different data sets with different models for common nuisance components and different systematics or instrumental effects that have to be modelled.
In this work we demonstrate that we can use density estimators, such as Kernel Density Estimators [6,7] and Masked Autoregressive Flows [8], to rapidly calculate reliable and reusable representations of marginal probability densities and marginal Bayesian summary statistics for key signal or cosmological parameters. This gives us access to the nuisance-free likelihood functions and allows us to combine parameter constraints from different data sets in a computationally efficient manner given marginal posterior samples from the different experiments. We use the publicly available code 1 MARGARINE [9] to generate density estimators.
In section 2 we mathematically demonstrate that the application of MARGARINE to the problem of combining the marginal posteriors from two data sets is equivalent to running a full nested sampling run including all 'nuisance' parameters. We define in this section the nuisance-free likelihood. Section 3 briefly discusses the methodology behind MARGARINE with reference to a previously published work [9]. Finally, we show the results of combining samples from DES and Planck in section 4 and conclude in section 5.

Theory
Notation Given a likelihood L(Θ) ≡ P(D|Θ, M) representing the probability of data D given some model M with parameters Θ, Bayesian inference proceeds by defining a prior π(Θ) ≡ P(Θ|M), and then through Bayes theorem computing a posterior distribution P (Θ) ≡ P(Θ|D, M) for the purposes of parameter estimation and an evidence Z ≡ P(D|M) in order to perform model comparison. In our notation we suppress model dependence, but where we wish to refer to the likelihoods derived from different datasets, we denote this with a subscript so for example L A (Θ) ≡ P(D A |Θ, M), and L B (Θ) ≡ P(D B |Θ, M).
In our setting, the parameter vector is split into two sub-vectors Θ = (θ, α), where θ are parameters of scientific interest, and α are nuisance parameters, included for the purposes of data analysis. Such situations are common in astrophysics, where for example θ might be parameters governing the Universe's evolution, whilst α might be associated with instrument calibration and foreground removal [3,4]. The α parameters are generally "marginalised out" and not considered further in final or future analyses.

Definitions
With this notation established, the version of Bayes theorem including nuisance parameters takes the form where we have placed the inputs of inference (likelihood and prior) on the left-hand side, and the outputs (posterior and evidence) on the right. The evidence is as usual equivalent to the (fully-)marginalised likelihood Z = L(θ, α)π(θ, α)dθdα. We may marginalise any probability distribution so can straightforwardly define the nuisance marginalised posterior and prior by integrating over α P (θ) = P (θ, α)dα, π(θ) = π(θ, α)dα. (2) The nuisance marginalised version of Bayes theorem eq. (1) takes the form Here Z is the original evidence, whilst L(θ) is non-trivially the nuisance-free likelihood where the above is motivated by marginalising over α of the full Bayes theorem eq. (1) and substituting the definitions in eqs.
We now explain why eq. (4) is a useful definition be two likelihoods with distinct datasets, each with their own nuisance parameters. The nuisance-free likelihoods L A (θ), L B (θ) form a lossless compression in θ. This means that we can recover the same (marginal) inference in combination that we would have made when performing a combined analysis with all nuisance parameters: if their respective priors π A (θ, α A ) and π B (θ, α B ) satisfy the marginal consistency relations: This process is represented graphically in fig. 1 and fig. 2.

Proof.
Integrating the combined Bayes theorem eq. (5) with respect to α B , applying the definition of the marginal posterior eq. (2) on the right-hand side, and drawing out terms independent of α B on the left, yields From the definition of a nuisance-free likelihood eq. (4), and the marginal consistency eq. (8), we can say that the integral on the left-hand side becomes: Substituting eq. (10) back into eq. (9) we find Proceeding with a similar manipulation to eq. (10), marginalising with respect to α A , and applying the definition of the nuisance-free likelihood L A (θ) eq. (4) and the marginal prior consistency eq. (7) we recover eq. (6)

Discussion
Equation (5) represents Bayes theorem for the combined likelihood of both datasets L AB (θ, α A , α B ) = L A (θ, α A )L B (θ, α B ), using the combined prior π AB (θ, α A , α B ). We assume the combined prior is marginally consistent, eqs. (7) and (8), which is reasonable, merely demanding that the priors are identical in the parameter spaces where they overlap. In practice, this would usually be achieved by assuming separability between signal and nuisance parameter spaces π(θ, α) = π(θ)π(α), but eqs. (7) and (8) are a slightly less restrictive requirement and therefore more general.
The upshot of this is that if you have performed inference for two datasets separately, such that you are able to compute the nuisance-free likelihoods with MARGARINE, you may discard the nuisance parameters for the next set of analyses when you combine the datasets.

Methods
MARGARINE was first introduced in [9] and uses density estimation to approximate probability distributions such as P (θ) and π(θ) given sets of representative samples. The code was developed initially to calculate marginal Kullback-Leibler (KL) divergences [10] and Bayesian Model Dimensionalities (BMD) [11] however as discussed in section 2 it can be used to calculate the nuisance-free likelihoods. This in turn means that we can use MARGARINE alongside an implementation of the nested sampling algorithm to sample the product L A (θ)L B (θ). In this manner, MARGARINE allows us to combine constraints on common parameters across different data sets.
We refer the reader to [9] for a complete discussion of how MARGARINE works, however, we discuss briefly the density estimation here. MARGARINE uses two different types of density estimator to model posterior and prior samples, namely Masked Autoregressive Flows (MAFs, [8]) and Kernel Density Estimators (KDEs, [6,7]).
Nested Sampling with θ, α A and α B Figure 1. A graphical representation of combining constraints from different data sets via a full nested sampling run over both cosmological and nuisance parameters (eq. (5)).
Nested Sampling MAFs transform a multivariate base distribution, the standard normal, into a target distribution via a series of shifts and scaling which are estimated by autoregressive neural networks. To improve the performance of the MAF the samples representing the target distribution, in our case P (θ) and π(θ), are transformed into a gaussianized space. We implement the MAFs using TENSORFLOW and the KERAS backend [12].
KDEs use a kernel to approximate the multivariate probability density of a series of samples. In our case, the kernel is Gaussian and the probability density is a sum of Gaussians centred on the sample points with a given bandwidth. Again we transform the target samples into a gaussianized parameter space allowing the KDE to better capture the distribution. The KDEs are implemented with SCIPY in MARGARINE [6,7,13].
Since both types of density estimator build approximations to the target distribution using known distributions, the approximate log-probabilities of the target distribution can be easily calculated.
The evaluation of normalized log-probabilities for the marginal posterior and marginal prior allows us to calculate the nuisance-free likelihoods, as discussed, along with marginal Kullback-Leibler divergences which quantifies the amount of information gained when moving from the marginal prior to posterior.

Cosmological Example
It has previously been demonstrated that MARGARINE is capable of replicating complex probability distributions and approximating marginal Bayesian statistics such as the KL divergence and the BMD [9]. Here we demonstrate the theory discussed in section 2 by combining samples from the Dark Energy Survey (DES) Year 1 posterior [1] and Planck posterior [3] using MARGARINE to estimate nuisance-free likelihoods. DES surveys supernovae, galaxies and large scale cosmic structures in the universe in an effort to measure dark matter and dark energy densities and model the dark energy equation of state. In contrast, Planck mapped the anisotropies in the Cosmic Microwave Background (CMB) and correspondingly provided constraints on key cosmological parameters.
The constraints from DES and Planck have previously been combined using a full nested sampling run over all parameters including a multitude of 'nuisance' parameters in a computationally expensive exercise [14]. This corresponds to the flow chart in fig. 1 and the previous analysis gives us access to the combined DES and Planck evidence, which is found to have a value of log(Z ) = −5965.7 ± 0.3. In fig. 3 we show the DES, Planck and joint posteriors for the six cosmological parameters derived in this work using MARGARINE and the flow chart in fig. 2. The constrained parameters are the baryon and dark matter density parameters, Ω b h 2 and Ω c h 2 , the angular size of the sound horizon at recombination, θ MC , the CMB optical depth, τ, the amplitude of the power spectrum, A s , and the corresponding spectral index, n s . These make up the set θ = (Ω b h 2 , Ω c h 2 , θ MC , τ, A s , n s ). We use the nested sampling algorithm POLYCHORD in our analysis [15,16].
We use a uniform prior that is defined to be three sigma around the Planck posterior mean. This is done to improve the efficiency of our nested sampling run. However, we subsequently have to re-weight the samples and correct the evidence for the difference between the priors used here and in the previous full nested sampling run [14] for comparison. If we define where A is our uniform prior space and B is our target prior space from the previous work [14], then giving Then following similar arguments we can transform our posteriors by re-weighting the distributions with the following We see from the figure and corresponding table that with our joint analysis we are able to derive a log-evidence that is approximately consistent with that found in [14] validating the theory discussed and its implementation with MARGARINE. We note that the re-weighting described above relies on calculation of the two prior log-probabilities for which we use MARGARINE and currently do not have an estimate of the error for. As a result, the error in the combined evidence, Z B , is given by the error in Z A from the nested samples and is likely underestimated. Using MARGARINE [9] we can also derive the combined KL divergence, also reported in fig. 3, which we find is consistent with the result in the literature of D = 6.17 ± 0.36 [11]. Similarly, we derive marginal KL divergences for the DES and Planck cosmological parameters using MARGARINE. A full discussion of the implications of combining the two data sets for our understanding of cosmology can be found in the literature [e.g 11,14].
By reducing the number of parameters that need to be sampled, we significantly reduce the nested sampling runtime. For POLYCHORD the runtime scales as the cube of the number of dimensions [17]. This can be seen by assessing the time complexity of the algorithm where, T ∝ n live × T{L(θ)} × T{Impl.} × D(P ||π). Here n live scales with the number of dimensions, d, as does the Kullback-Leibler divergence. For POLYCHORD, the specific implementation time complexity factor, T{Impl.} , representing the impact of replacing dead points with higher likelihood live points on the runtime, scales linearly with d. Together this gives T ∝ d 3 × T{L(θ)} . Therefore, by using nuisance-free likelihoods and sampling over 6 parameters rather than 41 parameters (cosmological plus 20 nuisance parameters for DES and 15 different nuisance parameters for Planck) we reduce the runtime by a factor of (41/6) 3 ≈ 319 with further improvements in T{L(θ)} . Using MARGARINE, T{L(θ)} is typically reduced since analytic likelihoods are computationally more expensive than emulated likelihoods.

Conclusion
In the paper, we have demonstrated the consistency between combining constraints from different experiments in a marginal framework using density estimators and the code MARGARINE with a full nested sampling run over all parameters, including those describing 'nuisance' components of the data. We have shown this consistency mathematically and with a cosmological example. For the combination of Planck and DES, we find a Bayesian evidence and KL divergence that is consistent with previous results [11,14].
The analysis in this paper is only possible because (a) we are able to estimate densities in the (much smaller) cosmological parameter space θ using MARGARINE, and (b) because we have evidences, Z, from our original nested sampling runs. It is this unique combination which allows us to compress away or discard nuisance parameters once they have been used. We note also that working in the marginal space results in a compression that is lossless in information on θ as it recovers an identical marginal   Figure 3. The combined posterior (in grey) found when combining the constraints on the cosmological parameters from DES and Planck using MARGARINE. For DES and Planck, we calculate the marginal KL divergences using MARGARINE, whereas the Bayesian evidences are calculated using ANESTHETIC. The joint evidence and joint KL divergence are calculated with a combination of the two codes and are found to be approximately consistent with those found in the literature [11,14]. Note that the error on the joint evidence is likely underestimated as it relies on evaluations of log-probabilities for various distributions, for which MARGARINE does not currently provide errors. The figure produced with ANESTHETIC [18].
posterior and total evidence as is found during a full Bayesian inference. Finally, through the nuisance-free likelihood we can significantly reduce the dimensionality of our problems and since it is faster to emulate a likelihood rather than analytically evaluate, MARGARINE offers a much more computationally efficient path to combined Bayesian analysis. In principle, our work paves the way for the development of a publicly available library of cosmological density estimators modelled with MARGARINE that can be combined with new data sets using the proposed method in a more efficient manner than currently implemented techniques. However, the work has implications outside of cosmology in any field where multiple experiments probe different aspects of the same physics.