Cosmological parameter inference with Bayesian statistics

Bayesian statistics and Markov Chain Monte Carlo (MCMC) algorithms have found their place in the field of Cosmology. They have become important mathematical and numerical tools, especially in parameter estimation and model comparison. In this paper, we review some fundamental concepts to understand Bayesian statistics and then introduce MCMC algorithms and samplers that allow us to perform the parameter inference procedure. We also introduce a general description of the standard cosmological model, known as the $\Lambda$CDM model, along with several alternatives, and current datasets coming from astrophysical and cosmological observations. Finally, with the tools acquired, we use an MCMC algorithm implemented in python to test several cosmological models and find out the combination of parameters that best describes the Universe.


INTRODUCTION
The beginning of the standard cosmology as it is known today emerged after 1920, when the Shapley-Curtis debate was carried out [1].This debate was held between the astronomers Harlow Shapley and Heber Curtis, resulting in a revolution for astronomy at that time by reaching an important conclusion: "The Universe had a larger scale than the Milky Way".Several observations at that epoch established that the size and dynamics of the cosmos could be explained by Einstein's General Theory of Relativity.In its childhood, cosmology was a speculative science based only on a few datasets, and it was characterized by a dispute between two cosmological models: the steady state model and the Big Bang (BB) theory.It was not until 1990, when the amount of data increased enough to discriminate and rule out compelling theories, that the BB model awarded was the most accepted.During the same decade, David Schramm heralded the "Golden Age of Cosmology" at a National Academy of Sciences colloquium [2].
Once the new age of cosmological observations arrived with a large variety of data, it was necessary to confront the cosmological models with such data, usually done through statistics.It is important to stress out that, since we have a unique Universe, we cannot rely on a frequentist interpretation of statistics (we are not able to create multiple Universes and make a frequentist inference of our models).An alternative approach that will help in our task is the Bayesian statistics.In Bayesian statistics, the probability is interpreted as a "degree of belief", and it may be useful when repetitive processes are complicated to reproduce.
The main aim of this work is to provide an introduction of Bayesian parameter inference and its applications * javazquez@icf.unam.mx to cosmology.We assume the reader is familiarized with the basic concepts of statistics, but not necessarily with Bayesian statistics.Then, we provide a general introduction to this subject, enough to work out some examples.This review is written in a generic way so that the parameter inference theory may be applicable to any subject, in particular we put into practice the Bayesian concepts into the field of cosmology.
The paper is organized as follows.In Section 2, we point out the main differences between the Bayesian and Frequentist approaches.Then, in Section 3, we introduce the basic mathematical concepts in Bayesian statistics to perform the parameter estimation procedure for a given model.Once we have the mathematical background, we continue, in Section 4, with some numerical resources that are able to simplify our task, especially for models with several parameters that need to be tested with many datasets.With these methods and tools in place, we provide the example of fitting a straight line in Section 5.In Section 6, we present an introduction to cosmology, and then, in Section 7, focus on some codes to compute the cosmological observables.In Section 8, we constrain the parameter space that describes the standard cosmological model, namely the ΛCDM model, along with several extensions.Finally, in Section 9, we present our conclusions.

BAYESIAN VS FREQUENTIST STATISTICS
Fundamentally, the main difference between Bayesian and Frequentist statistics is on the definition of probability.From a Frequentist point of view, probability has meaning in limiting cases of repeated measurements where n denotes the number of successes, and N the total number of trials.Frequentist statistics defines the prob-arXiv:1903.11127v4[astro-ph.CO] 1 Jul 2021 ability P as the limit of the number of independent trials going to infinity.Then, for Frequentist statistics, probabilities are fundamentally related to frequencies of events.On the other hand, in Bayesian statistics, the concept of probability is extended to cover degrees of certainty about a statement.For Bayesian statistics, probabilities are fundamentally related to our knowledge concerning an event.
These rules are summed up as follows: the first condition (2) is necessary due to the probability of having an event is always positive; the second rule (3) is a normalized relation, which tells us that we are certain to obtain one of the possible outcomes; now, in the third point (4) we have that the probability of obtaining an observation, from a set of mutually exclusive events, is given by the individual probabilities of each event; finally, and in general, if one event occurs given the occurrence of another then the probability that both x 1 and x 2 happen is equal to the probability of x 1 times the probability of x 2 given that x 1 has already happened P (x 1 ∩ x 2 ) = P (x 1 )P (x 2 |x 1 ). (5) If two events x 1 and x 2 are mutually exclusive, then The rules of probability distributions must be fulfilled by both Frequentist and Bayesian statistics.However, there are some consequences derived by the fact that these two scenarios have a different definition of probability, as we shall see below.

Frequentist statistics
Any frequentist inferential procedure relies on three basic ingredients: the data, the model and an estimation procedure.The main assumption in Frequentist statistics is that the data has a definite, albeit unknown, underlying distribution to which all inference pertains.
The data is a measurement or observation, denoted by X, that can take any value from a corresponding sample space.A sample space of an observation X can be defined as a measurable space (x, B) that contains all values that X can take upon measurement.In Frequentist statistics, it is considered that there is a probability function P 0 : B → [0, 1] in the sample space (x, B) representing the "true distribution of the data" X ∼ P 0 .
For Frequentist statistics, the model Q is a collection of probability measurements P θ : B → [0, 1] in the sample space (x, B).The distributions P θ are called model distributions, with θ being the model parameters; in this approach θ is unchanged.A model Q is said to be wellspecified if it contains the true distribution of the data P 0 , i.e., P 0 ∈ Q.
Finally, we need a point-estimator (or estimator) for P 0 .An estimator for P 0 is a map P : x → Q, representing our best guess P ∈ Q for P 0 based on the data X.Therefore, the Frequentist statistics is based on trying to answer the following questions: "what the data is trying to tell about P 0 ?" or "considering the data, what can we say about the mean value of P 0 ?"

Bayesian statistics
In Bayesian statistics, data and model are two elements of the same space [3], i.e., no formal distinction is made between measured quantities X and parameters θ.One may envisage the process of generating a measurement's outcome Y = y as two draws, one draw for Θ (where Θ is a model with associated probabilities to the parameter θ) to select a value of θ and a subsequent draw for P θ to arrive at X = x.This perspective may seem rather absurd when thinking in a Frequentist way, but, in Bayesian statistics, where probabilities are related to our own knowledge, it results natural to associate probability distributions to our parameters.In this way, an element P θ of the model is interpreted simply as the distribution of X given the parameter value θ, i.e., as the conditional distribution X|θ.

Comparing both descriptions
Table I provides a short summary of the most important differences between the two statistics.To under-

Frequentist Bayesian
Data are a repeatable random Data are observed from sample.There is a frequency.the realized sample.
Underlying parameters remain Parameters are unknown and constant during this repeatable described probabilistically.process.

Parameters are fixed.
Data are fixed.stand these differences, let us review a basic example.
Here, we present an experiment and, since we are interested in comparing both descriptions, we show only the basic results from both points of view: Frequentist and Bayesian.
Example.-Let us assume we have a coin that has a probability p to land as heads and a probability 1 − p to land as tails.Our goal is to know whether this coin is fair (p = 0.5) or not.In the process of trying to estimate p, we flip the coin 14 times, obtaining heads in 10 of the trials.Now, we are interested in the next two possible events.To be precise: "What is the probability that in the next two tosses we will get two heads in a row?" • Frequentist approach.As mentioned previously, in Frequentist statistics probability is related to the frequency of events, then our best estimate for p is P (head) = p = # of heads # of events = 10/14.So, the probability of having 2 heads in a row is P (2heads) = P (head)P (head) 0.51.
• Bayesian approach.In Bayesian statistics, p is not a value; it is a random variable with its own distribution, and it must be defined by the existing evidence (the 14 trials and 10 successes).Then, by considering that we do not know anything about p a priori and averaging over all possible values of p, we have that the probability of having two heads is This Bayesian example will be expanded in detail during the following section, but, for now, we just want to stress out that both approximations arrive at different results.
In the Frequentist approach, since we adopt the probability as a frequency of events (the probability of having a head was fixed by p = 10/14); hence, the final result was obtained by only multiplying each of these probabilities (since we assume the events are independent of each other).On the other hand, in the Bayesian framework, it was necessary to average over all possible values of p in order to obtain a numerical value.However, in both cases, the probability differs from the real one (P (2heads) = 0.25) because we do not have enough data for our estimations.
In this example, we have seen the forward application of statistics: using a mathematical model to relate measured quantities to an unknown quantity of interest, in this case, the probability of getting two heads in a row.This can also be applicable to the inverse problem.That is, by the having the data, we would like to obtain information about the parameters of a given model [4,14].In Section 8.1, we will illustrate this point by using different cosmological observations and finding out the best-fit values of the parameters that describe a given model.

A FIRST LOOK AT BAYESIAN STATISTICS
Before we start with the applications of Bayesian statistics in cosmology, it is advisable to understand the important mathematical tools within the Bayesian procedure.In this section, we present a basic revision but encourage the reader to look for the formal treatment in the literature, cited in each section.

Bayes' theorem, priors, posteriors and all that stuff
When anyone is interested in the Bayesian framework, there are several concepts to understand before presenting the results.In this section, we review these concepts, and then we get back to the example of the coin toss given in the last section.
The Bayes' theorem.The Bayes' theorem is a direct consequence of the axioms of probability shown in Equations (2) - (5).From Equation (5), without loss of generality, it must be fulfilled that P (x 1 ∩ x 2 ) = P (x 2 ∩ x 1 ).In such a case, the following relation applies: As already mentioned, in the Bayesian framework, data and model are part of the same space.Given a model (or hypothesis) H, considering x 1 → D as a set of data, and x 2 → θ as the parameter vector of said hypothesis, we can rewrite the above equation as This last relation is the so-called Bayes' theorem and the most important tool in a Bayesian inference procedure.In this result, P (θ|D, H) is called the posterior probability of the model.L(D|θ, H) ≡ P (D|θ, H) is called the likelihood, and it will be our main focus in future sections, π(θ) ≡ P (θ|H) is called the prior and expresses the knowledge about the model before acquiring the data (this prior can be fixed depending on either previous experiment results or the theory behind), Z ≡ P (D|H) is the evidence of the model, usually referred to as the Bayesian Evidence.
The prior refers to the information one has a priori of the model.It can be defined in various ways; however, a common one is the uniform prior (also referred to as a flat prior): with c being a constant.This type of prior is telling us that every parameter value is equally probable a priori, as seen in Figure 1.Using this prior also means that the posterior probability will be proportional to the likelihood (since the Bayesian Evidence is a constant).Another convenient prior distribution is the beta distribution B(θ; a, b) since it contains several statistical distributions by varying its parameters a and b (in particular the flat prior is obtained when a = b = 1).It is defined as where Γ is the gamma function.These are just two examples of useful priors, and it is evident that the choice of a prior will influence the posterior distribution, although its effect is reduced as more data are collected, as we shall see later in this section.Now, regarding the Bayesian Evidence, we notice that it acts as a normalizing factor, and is nothing more than the average of the likelihood over the where N is the dimensionality of the parameter space.This quantity is usually ignored, for practical reasons, i.e., when testing the parameter space of a unique model.Nevertheless, the Bayesian evidence plays an important role for selecting the model that best describes the data, this process being known as model selection.For convenience, the ratio of two evidences or equivalently the difference in log evidence ln Z 0 −ln Z 1 is often termed as the Bayes factor B 0,1 : where θ i is a parameter vector (with dimensionality N i ) for the hypothesis H i and i = 0, 1.In Equation ( 14), the quantity B 0,1 = ln K provides an idea on how well model 0 may fit the data when compared to model 1.Jeffreys provided a suitable guideline scale on which we are able to make qualitative conclusions (see Table II [15]).We can see that Bayes' theorem has an enormous implication with respect to a statistical inferential point of view.In a typical scenario, we collect some data and then interpret it with a given model; however, we usually do the opposite.That is, first we have a set of data, and then we confront a model considering the probability that our model fits the data.Bayes' theorem provides a tool to relate both scenarios.Then, based on the Bayes' theorem, we are able to select the model that best fits the data.
Example.-We go back to the example shown in the last section: the coin toss.We are interested in the probability of obtaining two heads in a row given the data P (2heads|D) (D being the previous 14 coin tosses acting as data).First, let us assume that we have a model with a parameter p that defines the probability of obtaining the two heads, that is P (2heads|p).This parameter p will have a probability distribution P (p|D) depending on the data in place.Therefore, the probability can be obtained by averaging over all the possible parameters with its corresponding density distribution For simplicity, we do not update p between the two tosses, but we assume that both of them are independent of each other.With this last assumption, we have where P (head|p) is the probability of obtaining a head given our model.We assume a simple description of P (head|p) as On the other hand, notice that we do not know a priori the quantity P (p|D) but P (D|p) (i.e., we know the probability of obtaining a dataset by considering a model as correct).A good choice for experiments that have two possible outcomes is the binomial distribution with n the number of trials (in this case, = 14) and x the number of successes (here =10).Hence, we have an expression for P (D|p).Using the Bayes' formula, we have For the prior, we will use the beta distribution (11), so In order to get the explicit form of P (p|D), we still need to compute P (D).That is, by plugging Equations ( 18) and ( 20) into the integral of Equation ( 12) yields to If we know nothing about p, then we can assume the prior is a uniform distribution; this means a = b = 1.Notice from Figure 1 that our posterior result (Red figure) described by Equation (22) does not exactly agree with the real value of p (black dashed vertical line).We would expect the posterior distribution to be centered at p = 0.5 with a very narrow distribution.Nevertheless, this value is recovered by increasing the experimental data.
Finally, solving the integral in Equation ( 15) using ( 17) and (22), we arrive at the result obtained in the previous section It is important to clarify that the inferred value for the parameter p is merely the probability of said value given our data D.This value can change and will generally be a better estimation if more data are collected (as will be seen in detail in the next section).

Updating the probability distribution
As seen in the coin example, we were not able to get the real value of p because of the lack of enough data.If Notice that while we continue increasing the experimental results, the posterior distribution starts to be more localized near by the real value p = 0.5.
we want to get closer, we would have to keep flipping the coin until the amount of data becomes sufficient.Let us continue with the example: suppose that after throwing the coin 100 times we obtain, let us say 56 heads, while after throwing it 500 times, we obtain 246 heads.Then, we expect to obtain a thinner distribution with its center close to p = 0.5 (see Figure 2).Given this, it is clear that in order to confront a parameter and be more accurate about the most probable (or "real") value, it is necessary to increase the amount of data (and the precision) in any experiment.That is, if we take into account the 500 tosses-with 246 heads-the previous result is updated to P (2heads|D) = 0.249, much closer to the real value.
Then, we have some model parameters that have to be confronted with different sets of data.This can be done in two alternative ways: (a) by considering the set of all datasets; or (b) by taking each dataset as the new data, but our prior information updated by the previous information.The important point in Bayesian statistics is that it is indeed equivalent to choose any of these two possibilities.In the coin toss example, it means that it is identical to start with the prior given in Figure 2a, and then, by considering the 500 data points, we can arrive at the posterior in Figure 2d, or similarly start with the posterior shown in Figure 2c as our prior and consider only the last 400 data points to obtain the same posterior, displayed in Figure 2d.
In fact, if we rewrite Bayes' theorem so that all probabilities are explicitly dependent on some prior informa- where we can explicitly see the equivalence of the two different options.

About the Likelihood
We mentioned that the Bayesian evidence is usually set apart when doing any inference procedure in the parameter space of a single model.Then, without loss of generality, we can fix it to P (D|H) = 1.If we ignore the prior (a motivation for doing this will be expanded upon in the next section), we can identify the posterior with the likelihood P (θ|D, H) ∝ L(D|θ, H); thus, by maximizing it, we can find the most probable set of parameters for a model given the data.However, having ignored P (D|H) and the prior, we are not able to provide an absolute probability for a given model, but only relative probabilities.On the other hand, it is possible to report results independently of the prior by using the Likelihood ratio.The likelihood at a particular point in the parameter space can be compared with the best-fit value, or the maximum likelihood L max .Then, we can say that some parameters are acceptable if the likelihood ratio is bigger than a given value.
Let us assume we have a single-peaked distribution.We consider that θ is the mean of the distribution θ = dθθP (θ|D, H). ( If our model is well-specified and the expectation value of θ corresponds to the real or most probable value θ 0 , we have then we say that θ is unbiased.Considering a Taylor expansion of the log likelihood around its maximum, we have + ..., where θ 0 corresponds to the parameter vector of the real model.In this manner, we have that the likelihood can be expressed as a multi-variable likelihood given by where is called the Hessian matrix.It controls whether the estimates of θ i and θ j are correlated, and if it is diagonal, these estimates are uncorrelated.The above expression for the likelihood is a good approximation as long as our posterior distribution possesses a single-peak.It is worth mentioning that, if the data errors are normally distributed, then the likelihood for the data will be a Gaussian function as well.In fact, this is always true if the model is linearly dependent on the parameters.On the other hand, if the data is not normally distributed, we can resort to the central limit theorem.In this way, the central limit theorem tells us that the resulting distribution will be best approximated by a multi-variate Gaussian distribution [6].

Letting aside the priors
In this section, we present an argument for letting aside the prior in the parameter estimation.For this, we follow the example given in Reference [5].In this example, there are two people, A and B, that are interested in the measurement of a given physical quantity θ.A and B have different prior beliefs regarding the possible value of θ.This discrepancy could be given by the experience, such as the possibility that A and B have made a similar measurement at different times.Let us denote their priors by P (θ|I i ), (i = A, B), and assume they are described by two Gaussian distributions with mean µ i and variance Σ 2 i .Now, A and B measure θ together using an apparatus subject to a Gaussian noise with known variance σ.They obtain the value θ 0 = m 1 .Therefore, they can write their likelihoods for θ as By using the Bayes formula, the posterior of the model A (and B) becomes where we have skipped writing explicitly the hypothesis H and used the notation given in Equation (24).Then, the posterior of A and B are (again) Gaussian with mean and variance Thus, if the likelihood is more informative than the prior, i.e., (σ/Σ i ) 1, the posterior mean of A (and B) will converge towards the measured value m 1 .As more data are obtained, one can simply replace the value of m 1 in the above equation by the mean m and σ 2 by σ 2 /N .Then, we can see that the initial prior µ i of A and B will progressively be overridden by the data.This process is illustrated in Figure 3 where the green (red) curve corresponds to the probability distribution of θ for person A (B) and the blue curve corresponds to their likelihood.

Chi-square and goodness of fit
We mentioned that the main aim of parameter estimation is to maximize the likelihood in order to obtain the most probable set of model parameters given the data.If we consider the Gaussian approximation given in Equation (30), we can see the likelihood will be maximum if the quantity is minimum.The quantity χ 2 is usually called chisquare and is related to the Gaussian likelihood via L = L 0 e −χ 2 /2 .Then, we can say that maximizing the Gaussian likelihood is equivalent to minimizing the chisquare.However, as we mentioned before, there are some circumstances where the likelihood cannot be described by a Gaussian distribution, in these cases the chi-square and the likelihood are no longer equivalent.
The probability distribution for different values of χ 2 around its minimum is given by the χ 2 distribution for v = n − M degrees of freedom, where n is the number of independent data points and M the number of parameters.Hence, we can calculate the probability that an observed χ 2 exceeds, by chance, a value χ for the correct model.This probability is given by Q(v, χ) = 1 − Γ(v/2, χ/2) [16], where Γ is the incomplete Gamma function.Then, the probability that the observed χ 2 (even the correct model) is less than a given value χ2 is 1 − Q.This statement is strictly true if the errors are Gaussian and the model is a linear function of the likelihood, i.e., for Gaussian likelihoods.
If we evaluate the quantity Q for the best-fit values (minimum chi-square), we can have a measure of the goodness of fit.If Q is small (small probability), we can interpret it as: • The model is wrong and can be rejected.
• The errors are underestimated.
• The error measurements are not normally distributed.
On the other hand, if Q is too large there are some reasons to believe that: • Errors have been overestimated.
• Data are correlated or non-independent.
• The distribution is non-Gaussian.

Contour plots and confidence regions
Once the best fit parameters are obtained, we would like to know the confidence regions where values could be considered good candidates for our model.The most logical election is to take values inside a compact region around the best fit value.Then, a natural choice are regions with constant χ 2 boundaries.When the χ 2 possesses more than one minimum, it is said that we have non-connected confidence regions, and for multi-variate Gaussian distributions (as the likelihood approximation in Equation ( 30)) these are ellipsoidal regions.In this section, we exemplify how to calculate the confidence regions, following Reference [6].
We consider a small perturbation from the best fit of chi-square ∆χ 2 = χ 2 − χ 2 best .Then, we use the properties of χ 2 distribution to define confidence regions for variations on χ 2 to its minimum.In Table III, we see the typical 68.3%, 95.4%, and 99.73% confidence levels as a function of number of parameters M for the joint confidence level.For Gaussian distributions, these correspond to the conventional 1, 2, and 3 σ confidence levels.
As an example, we plot in Figure 4 the corresponding confidence regions associated with the coin example.The general recipe to compute constant χ 2 confidence regions is as follows: after finding the best fit by minimizing χ 2 (or maximizing the likelihood) and checking that Q is acceptable for the best parameters, then: 1. Let M be the number of parameters, n the number of data and p the confidence limit desired.

Marginalization
It is clear that a model may (in general) depend on more than one parameter.However, some of these parameters θ i may be of less interest.For example, they may correspond to nuisance parameters, like calibration factors, or it may be the case that we are interested in only one of the parameter constraints rather than the joint of two or more of them simultaneously.Then, we marginalize over the uninteresting parameters by P (θ 1 , ..., θ j , H|D) = dθ j+1 ...dθ m P (θ, H|D), (38) where m is the total number of parameters in our model, and θ 1 ,...,θ j denote the parameters we are interested in.

Fisher Matrix
Once we have a dataset, it is important to know the accuracy for which we can estimate parameters.Fisher suggested a way 70 years ago [17].Let us start by considering again a Gaussian likelihood.As we notice, the Hessian matrix H ij has information on the parameter errors and their covariance.More specifically, when all parameters are fixed except one (e.g., the i-th parameter), its error is 1/ √ H ii .These errors are called conditional errors, although they are rarely used.
A quantity to forecast the precision of a model, that arises naturally with Gaussian likelihoods, is the so-called

Fisher information matrix
where It is clear that F = H , where the average is made with observational data.
As we can see from Equation ( 4), for independent datasets, the complete likelihood is the product of the likelihoods, and the Fisher matrix is the sum of individual Fisher matrices.A pedagogical and easy case is having one-parameter θ i with a Gaussian likelihood.In this scenario, when 2∆L = 1, and identifying the ∆χ 2 corresponding to 68% confidence level, we notice that 1/ √ F ii yields the 1 − σ displacement for θ i .In the general case, Thus, when all parameters are estimated simultaneously from the data, the marginalized error is The beauty of the Fisher matrix approach is that there is a simple prescription for setting it up by only knowing the model and measurement uncertainties, and under the assumption of a Gaussian likelihood the Fisher matrix is the inverse of the covariance matrix.So, all we have to do is set up the Fisher matrix and then invert it to obtain the covariance matrix (that is, the uncertainties on the model parameters).In addition, its fast calculation also enables one to explore different experimental setups and optimize the experiment.
The main point of the Fisher matrix formalism is to predict how well the experiment will be able to constrain the parameters, of a given model, before doing the experiment and perhaps even without simulating it in any detail.We can then forecast the results of different experiments and look at trade-offs, such as precision versus cost.In other words, we can engage in experimental design.The inequality in Equation ( 42) is called the Kramer-Rao inequality.One can see that the Fisher information matrix represents a lower bound of the errors.Only when the likelihood is normally distributed the inequality is transformed into an equality.However, as we saw in Section 3.3, a Gaussian likelihood is only applicable to some circumstances, being generally impossible to be applied, so the key is to have a good understanding of our theoretical model in such a way that we can construct a Gaussian likelihood.

Constructing Fisher Matrices: A simple description
Let us construct Fisher matrices in a simple way.Suppose we have a model that depends on N parameters θ 1 , θ 2 , ..., θ N .We consider M observables f 1 , f 2 , ..., f M each one related to the model parameters by some equation f i = f i (θ 1 , θ 2 , ..., θ N ).Then, the elements of the Fisher matrix can be computed as where σ k are the errors associated with each observable, and we have considered they are Gaussianly distributed.
Here, instead of taking the real data values (which could be unknown), it is possible to recreate the data with a fiducial model.The errors associated with the mock data can be taken as the expected experimental errors, and then it is possible to calculate the above expression.
To complement the subject, there is also the Figure of Merit used by the Dark Energy Task Force (DETF) [18], which is defined as the reciprocal of the area in the plane enclosing the 95% confidence limit of two parameters.The larger the figure of merit the greater accuracy one has measuring said parameters.The figure of merit can also be used to see how different experiments break degeneracies, and to predict accuracy in future experiments (experimental design).

Importance Sampling
We call Importance Sampling (IS) to different techniques of determining properties of a distribution by drawing samples from another one.The main idea is that the distribution one samples from should be representative of the distribution of interest (for a larger number of samples).In such case, we should infer different quantities out of it.In this section, we review the basic concepts necessary to understand the IS, following the Reference [19].
Suppose we are interested in computing the expectation value µ f = E p [f (X)], where f (X) is a probability density of a random variable X and the sub-index p means average over the distribution p.Then, if we consider a new probability density q(x) that satisfies q(x) > 0 whenever f (x)p(x) = 0, we can rewrite the mean value µ f as where w(x) = p(x)/q(x), and now we have an average over q.So, if we have a collection of different draws x (1) , ..., x (m) from q(x), we can estimate µ f using these If p(x) is known only up to a normalizing constant, the above expression can be calculated as a ratio estimate: For the strong law of large numbers, in the limit when m → ∞, we will have that μf → µ f .
Another useful quantity to compute in Bayesian analysis is the ratio between evidences for two different models where the samples {θ n } are drawn from P (θ|D).
An important result for importance sampling is that, if we have a new set of data which is broadly consistent with the current data (in the sense that the posterior only shrinks), we can make use of importance sampling in order to quickly calculate a new posterior including the new data.

Combining datasets: Hyperparameter method
Suppose we are dealing with multiple datasets {D 1 , . . ., D N }, coming from a collection of different surveys {S 1 , . . ., S N }.Sometimes it is difficult to know, a priori, if all our data are consistent with each other, or whether there could be one or more that are likely to be erroneous.If we were sure that all datasets are consistent, then it should be enough to update the probability, as seen in Section 3.2, in order to calculate the new posterior distribution for the parameters we are interested in.However, since there is usually an uncertainty about this, a way to know how useful the data may be is by introducing the hyperparameter method.This method was initially introduced by Reference [20,21] in order to perform a joint estimation of cosmological parameters from combined datasets.This method may be used as long as every survey is independent of each other.In this section, we review the main steps necessary to understand the hyperparameter method.
The main feature of this process is the introduction of a new set of hyperparameters α in the Bayesian procedure to allow extra freedom in the parameter estimation.These hyperparameters are equivalent to nuisance parameters in the sense that we need to marginalize over them in order to recover the posterior distribution, i.e., P (θ|D, H) = 1 P (D|H) P (θ|α, H)P (α|D, H)dα, (49) where we have used the Bayes' theorem.Now, for the method, it is necessary to assume that the hyperparameters α and the parameters of interest θ are independent, i.e., P (θ, α, H) = P (α)P (θ, H), and it is also necessary to assume that each hyperparameter α k is independent of each other, i.e., P (α) = P (α 1 )P (α 2 ) . . .P (α N ).In this way, we can rewrite the above expression as Here, the quantity inside the square brackets is the marginalized likelihood over the hyperparameters.We can identify the quantity inside the integration as the individual likelihood L(D k |θ, α k , H), for every α k and the dataset D k ; P (D|H) is the evidence and, similarly to a parameter inference procedure, it works as a normalizing function, i.e., P (D|H) = dθP (θ, H)L(D|θ, H).Notice that, by considering P (α k ) = δ(α k − 1), we rely on the standard approach, where no hyperparameters are used.
We add these α k in order to weight every dataset and take away the data that does not seem to be consistent with other ones.Then, we would like to know whether the data supports the introduction of hyperparameters or not.A way to address this point is given by the Bayesian evidence K defined in Equation ( 13).If we consider a Gaussian likelihood with maximum entropy prior, and assuming that on average the hyperparameters' weights are unity, we can rewrite the marginalized likelihood function L(D|θ, H 1 ) for model H 1 as obtaining an explicit functional form for K, given by Here, χ 2 k is given by (36) for every dataset, and n k is the number of points contained in D k .In Equation ( 50), V k is the covariance matrix for the k-data.Suppose we have two models, one with hyperparameters, called H 1 , and a second one without them, called H 0 .The Bayesian evidence P (D|H i ) is the key quantity for making a comparison between two different models.In fact, by using the Bayes factor K from Equation (51), we can estimate the necessity to introduce the hyperparameters to our model using the criteria given in Table II.Notice that, if we have a set of independent samples for H 0 , we can compute an estimate for K with the help of Equation (48).

NUMERICAL TOOLS
In typical scenarios, it is very difficult to compute the posterior distribution analytically.For these cases, the numerical tools play an important role during the parameter estimation task.There exist several options to carry out this work; nevertheless, in this section, we focus on the Markov Chain Monte Carlo (MCMC) with the Metropolis Hastings algorithm (MHA).Additionally, we present some useful details we take into account to make our computation more efficient .

MCMC techniques for parameter inference
The purpose of the MCMC algorithm is to build up a sequence of points (called chain) in a parameter space in order to evaluate the posterior of Equation (9).In this section, we review the basic steps for this procedure in a simplistic way; however, it is recommendable to check [22][23][24][25][26] for a more formal version of the MCMC theory.
A Monte Carlo simulation is assigned to algorithms that use random number generators to approximate a specific quantity.On the other hand, a sequence X 1 , X 2 , . . . of elements of some set is a Markov Chain if the conditional distribution of X n+1 given X 1 , . . ., X n depends only on X n .In other words, a Markov Chain is a process where we can compute subsequent steps based only in the information given at the present.An important property of a Markov Chain is that it converges to a stationary state where successive elements of the chain are samples from the target distribution; in our case, it converges to the posterior P (θ|D, H).Hence, we can estimate all the usual quantities of interest out of the posterior (mean, variance, etc.).
The combination of both procedures is called an MCMC.The number of points required to get good estimates in MCMCs is said to scale linearly with the number of parameters, so this method becomes much faster than grids as the dimensionality increases.
The target density is approximated by a set of delta functions: with N being the number of points in the chain.Then, the posterior mean is computed as where follows because the samples θ i are generated out of the posterior by construction.Then, we can estimate any integrals (such as the mean, variance, etc.) as As mentioned before, in a Markov Chain, it is necessary to generate a new point θ i+1 from the present point θ i .However, as it is expected, we need a criterion for accepting (or rejecting) this new point depending on whether it turns out to be better for our model or not.If the new step is worse than the previous one, we may still accept it since, if we only accept steps with better probability, we could be converging into a local maximum in our parameter space and, therefore, just exploring a small region of the entire space.The simplest algorithm that contains all this information in its methodology is known as the Metropolis-Hastings algorithm.In the Metropolis-Hastings algorithm [27,28], it is necessary to start from a random initial point θ i , with an associated posterior probability p i = p(θ i |D, H).We need to propose a candidate θ c by drawing from a proposal distribution q(θ i , θ c ) used as a generator of new random steps.Then, the probability of acceptance of the new point is given by If the proposal distribution is symmetric, the algorithm is reduced to the Metropolis algorithm In this way, the complete algorithm can be expressed by the following steps: 1. Choose a random initial condition θ i in the parameter space and compute the posterior distribution.
2. Generate a new candidate from a proposal distribution in the parameter space and compute the corresponding posterior distribution.
3. Accept (or reject) the new point with the help of the Metropolis-Hastings algorithm.
4. If the point is rejected, repeat the previous point in the chain.
5. Repeat steps 2-4 until you have a large enough chain.

A first example of parameter inference
In order to put into practice the numerical tools, we take again the coin toss example seen from Section 3.1.Let us try to estimate the value of p (or region of values for p) that best matches our data (we assume only the 14 times that the coin was thrown).To calculate the posterior distribution (20), we use the MHA.
As mentioned before, we consider a likelihood given by the binomial distribution (18) and a normal distributed prior (11) (a = b = 1).As our first guess for p, we consider p i = 0.1.We generate a new candidate p c as p c = p cu + G(p cu , σ), where G(p cu , σ) is our proposed Gaussian distribution centered at p cu with variance σ = 0.1; p cu is the current value of p, for our first step is p cu = p i .Then, we introduce the MHA in a Python code.Our final result (shown in Figure 4) is a posterior distribution that matches very well with the results calculated analytically (shown in Figure 1).Numerically, we obtained p = 0.695 +0. 123 −0.107 , where the upper and lower values for p correspond to the 1σ standard deviation.Notice that we have plotted the width of our 1σ, 2σ and 3σ confidence regions in the same figure.To complement the example (and Figure 4), we also show in Figure 5 the Markov Chains generated by our code.It is easy to see that the chains oscillate with small amplitude around the mean value.
Remark: In Figure 18, we include the MCMC algorithm using an explicit code for the MCMC process.However, in Python there are some modules that can simplify this task.For example, PyMC3 [29] is a Python module that implements statistical models and fitting algorithms, including the MCMC algorithm.We use this module at the end of this section.

Convergence test
It is clear that we need a test to know when our chains have converged.We need to verify that the points in the chain are not converging to a false convergent point or to a local maximum.In this sense, we need that our algorithm takes into account this possible difficulty.The simplest way (the informal way) to know whether our chain is converging to a global maximum is by running several chains starting with different initial proposals for the parameters.Then, if we see by the naked eye that all chains seem to converge into a single region of the possible value for our parameter, we may say that our chains are converging to that region.Taking yet again the example of the coins, we can run several chains for the above example and try to estimate whether the value (region) of p that we found is a stationary value.In Figure 5, we plot 5 different Markov chains with initial conditions p = 0.1, 0.3, 0.5, 0.7, 0.9.As we expected from the analytical result, after several steps all the chains seem to concentrate nearby the same value.
The convergence method used above is very informal, and we would like to have a better way to ensure that our result is correct.The usual test is the Gelman-Rubin convergence criterion [30,31].That is, by starting with M chains with very different initial points and N points per chain, if θ j i is a point in the parameter space of position i and belonging to the chain j, we need to compute the mean of each chain: and the mean of all the chains Then, the chain-to-chain variance B is and the average variance of each chain is If our chains converge, W and B/N must agree.In fact, we say that the chains converge when the quantity which is the ratio of the two estimates, approaches unity.A typical convergence criterion is when 0.97 < R < 1.03.

Some useful details
The proposal distribution: The choice of a proposal distribution q is crucial for the efficient exploration of the posterior.In our example, we used a Gaussian-like distribution with a variance (step) σ = 0.1.This value was taken because we initially explored different values for σ, and we select the quickest that approaches the analytic posterior distribution of p.However, if the scale of q is too small compared to the scale of the target (in the sense that the typical jump is small), then the chain may take very long to explore the target distribution, which implies that the algorithm will be very inefficient.As we can see in Figure 6 (top panel), considering an initial step p i = 0.6 and a variance for the proposal distribution σ = 0.002, the number of points are not enough for the system to move to its "real" posterior distribution.On the other hand, if the scale of q is too large, the chain gets stuck, and it does not jump very frequently (bottom panel of the figure corresponding to σ = 0.8), so we will have different maxima in our posterior.
In order to fix this issue in a more efficient way, it is recommendable to run an exploratory MCMC, compute the covariance matrix from the samples, and then re-run with this covariance matrix as the covariance of a multivariate Gaussian proposal distribution.This process can be computed a couple of times before running the final MCMC.
The burn-in: It is important to notice that, at the beginning of the chain, we have a set of points far outside the stationary region.This early stage of the chain (called burn-in) must be ignored; this means that the dependence on the starting point must be lost.Thus, it is important to have a reliable convergence test.
Thinning: There are several Bayesian statisticians that usually thin their MCMC, which means that they do not prefer to save every step given by the MCMC; instead, they prefer to save a new step each time n steps have taken place.An obvious consequence of thinning the chains is that the amount of autocorrelation is reduced.However, as long as the chains are thinned, the precision for the estimated parameters is reduced [32].Thinning the chains can be useful in other kind of circumstances, for example, if we have limitations in memory.Notice that thinning a chain does not yield incorrect results; it yields correct results, but less efficient ones, than using the full chains.
Autocorrelation probes: A complementary way to look for convergence in an MCMC estimation is by looking for the autocorrelation between the samples.The autocorrelation lag k is defined as the correlation between every sample and the sample k steps before.It can be quantified as [33,34] where X t is the t-th sample, and X is the mean of the samples.This autocorrelation should become smaller as long as k increases (this means that samples start to become independent).

More Samplers
Gibbs sampling: The basic idea of the Gibbs sampling algorithm [35] is to split the multidimensional θ into blocks and sample each block separately, conditional on the most recent values of the other blocks.It basically breaks a high-dimensional problem into low-dimensional problems.
The algorithm reads as follows: 1. θ consists of k blocks θ 1 , . . ., θ k .Then, at step i 6. Repeat the above steps for the wished iterations with i → i + 1.
Metropolis Coupled Markov Chain Monte Carlo (M C 3 ): It is easy to see that it could be a little problematic if our likelihood has local maxima.The M C 3 is a modification of the standard MCMC algorithm that consists of running several Markov Chains in parallel to explore the target distribution for different temperatures.The temperature controls the height of the peaks in the likelihood; this simplifies the way we sample our parameter space and helps us to avoid local maxima.If a distribution has a temperature T < 1, it is said to be tempered.
We consider a tempering version of the posterior distribution P (θ, T |D, H): where L is the likelihood, and P (θ, H) the prior.Notice that, for higher T , individual peaks of L become flatter, making the distribution easier to sample with an MCMC algorithm.Now, we have to run N chains with different temperatures assigned in a ladder T 1 < T 2 < . . .< T N , usually taken with a geometrically distributed division, with T 1 = 1.The coldest chain T 1 samples the posterior distribution more accurately and behaves as a typical MCMC.Then, we define this chain as the main chain.The rest of the chains are running such that they can cross local maximum likelihoods easier and transport this information to our main chain.The chains explore independently the landscape for a certain number of generations.Then, in a predetermined interval, the chains are allowed to swap its actual position with probability In this way, if a swap is accepted, chains i and j must exchange their current position in the parameter space, and then chain i has to be in position θ j and chain j has to move to position θ i .
We can see that, since the hottest chain T max can access easier to all the modes of P (θ, H, T max |D), then it can propagate its position to colder chains, to be precise, it can propagate its position to the coldest chain T = 1.At the same time, the position of colder chains can be propagated to hotter chains, allowing them to explore the entire prior volume.For an extensive explanation, or modification to make the temperature of the chains dynamical, see Reference [36][37][38].
Affine Invariant MCMC Ensemble Sampler: The main property of this algorithm relies on its invariance under affine transformations.Let us consider a highly anisotropic density: which is difficult to calculate for small .But, by making the affine transformation we can rewrite the anisotropic density into the easier problem An MCMC sampler has the form X(t + 1) = R(X(t), ψ(t), p), where X(t) is the sample after t iterations, R is the sampler algorithm, ψ is the sequence of independent identically distributed random variables, and p is the density.A sampler is said to be affine invariant if, for any affine transformation Ax + b, R(AX(t) + b, ψ(t), p A,b ) = AR(X(t), ψ(t), p) + b. (68) There are already several algorithms that are affine invariant, one of the easiest is known as the stretch move [39].An algorithm fully implemented in Python under the name EMCEE [40] is also affine invariant, and some others that can be found in Reference [41].
Even more samplers: The generation of the elements in a Markov chain is probabilistic by construction, and it depends on the algorithm we are working with.The MHA is the easiest algorithm used in Bayesian inference.However, there are several algorithms that can help to accomplish our mission, for instance, some of the most popular and effective ones, are the Hamiltoninan Monte Carlo (see, e.g., Reference [42,43]) or the Adaptative Metropolis-Hastings (AMH) (see, e.g., Reference [19]).

FITTING A STRAIGHT-LINE
In this section, we carry out an standard example: fitting a straight-line.That is, we assume that we have a certain theory where our measurements should follow a straight line.Then, we simulate several datasets and focus on two different cases (Figure 7): 1. Consider two datasets coming from the same straight-line but having different errors.
2. Consider two datasets coming from different straight-lines and also having different errors.
In our analysis, we used the PyMC3 module, and the complete code can be downloaded from the Git repository [44].Top: case 1 and Bottom: case 2.

Case 1
In this example, we start by considering that our measurements for a given theory (a straight-line y = a + bx) are given by the data shown in the upper panel of Figure 7.The two datasets, D1 and D2, were generated from the same line y = 3 + 2x, adding a gaussian error to each point.For D1, we added an error with a standard deviation σ 1 = 0.3, while, for D2 we use σ 2 = 0.2.Then, we would like to estimate the parameters of the model, i.e., a and b.We will analyze this data with and without the hyperparameter method and discuss in detail our results.
Before we make a Bayesian estimation, it is necessary to specify the priors.As we have seen, a good prior is a non-informative one.Suppose we only know some limits for a and b, as can be seen when plotting the data.Then, we consider the flat priors where U [α, β] are uniform distributions with lower limit α and upper limit β.
From Equation (30), we can write our likelihood as where y d is our data taken from the dataset D = D 1 +D 2 , and σ d its errors.We use the MHA to generate the Markov chains.In our analysis, we ran 5 chains with 10,000 steps for each one.We ran each chain with temperature T = 2 and we thinned them every 50 steps.The result corresponds to a = 2.982 ± 0.047 and b = 1.994 ± 0.013, and their posterior distributions are plotted in Figure 8.Notice that there are some regions where the frequency of events in our sample is increased.So, we can say that such parameter regions seem more likely to match the data.Additionally, we compute the Gelman-Rubin criterion for each variable in order to verify that our results converged, i.e., for a is 1.000017, and for b is 1.000291.We see that this number is very close to 1, so our convergence criterion is achieved.The bottom panel of Figure 8 displays the 1 − 4 σ confidence regions.We also added a point in red to show the real value for our parameters.The real value for a and b are within the curve corresponding to one standard deviation of our estimations in the inferential method.
As we mentioned, we need the autocorrelation to be small as k increases in order to consider that our analysis is converging.We see in Figure 9 such plots and notice that our convergence criterion is fulfilled.Then, in Case 1, we can see that the model H 0 seems to be a very good estimation procedure.
Now, let us consider the Hyperparameter method.In this case, our likelihood can be written as Equation (50).Similar to the last procedure, we compute the posterior with flat priors, using 5 chains with 10,000 steps for each one, and check for autocorrelations.Our results are as follows: a = 2.97 ± 0.038 with Gelman-Rubin of 1.000113, and b = 1.995 ± 0.010 with Gelman-Rubin 1.000155.Comparing both procedures, we observe they provide similar results.In fact, the confidence regions for both approximations, Figure 8 and the top panel of Figure 10 are similar as well.So, which method is better?We could say that the method with hyperparameters is as good as the one without them, but, in order to be sure, we compute the evidence ratio K between both models.We obtained from Equation ( 51) Then, comparing with Table II, we can say that the evidence for H 1 to be better than H 0 is weak.In such a case, it should be equally better to work with H 0 as to H 1 , as explained before.
Finally, in order to exemplify our results, let us plot in the bottom panel of Figure 10 our data with the straightline inferred by the mean parameters of both models.As we expected, our estimation fits well the data for both cases.

Case 2
Here, we consider that we have the same theory for the straight-line but different measurements.The data points are given in the lower panel of Figure 7.These correspond to our dataset D 1 and D 2 , but now changing D 2 by 16 new points generated around the line y = 3.5 + 1.5x with a Gaussian noise and standard deviation σ = 0.5.So, our datasets are not auto-consistent with each other.Let us make again the parameter estimation for the parameters a and b and look for the differences in both procedures.
We follow the same procedure as in Case 1.We computed our posterior and verified that our results converged with the help of the Gelman-Rubin criterion and the autocorrelation plots.Our results are the following: a = 3.528 ± 0.056 and b = 1.795 ± 0.014.Then, we plotted 1 − 4σ confidence regions in the upper panel of Figure 11.It is easy to see that our estimation differs so much from the real parameters of the datasets (red points).This is because we are trying to fit a model with non-auto-consistent datasets; therefore, we arrive at incorrect results.Now, let us see what happens in the hyperparameters procedure.
In the bottom-left panel of Figure 11, we plotted the posterior distribution.We see immediately that both approximations are very different.While, for model H 0 , we obtained a single region far away of the real values of our data, for model H 1 , we obtained two local maximum regions near the real values for our datasets (red dots).
Given the fact that we know a priori the real values of the parameters, we could immediately say that the method with hyperparameters is a better approximation than the case without them.However, we confirm this assumption by computing the ratio K between both models.We then obtain which means that we have very strong evidence that H 1 is better that H 0 .Finally, we can plot the straight-line inferred by model H 0 and the two lines inferred by model H 1 .Considering the parameters inside the two regions in the bottom-left of Figure 11, we obtain the bottom-right panel of the same Figure.Bayesian statistics is a very useful tool in Cosmology to determine, for instance, the combination of model parameters that best describes the Universe.In this section, we present the basics of Cosmology and apply the Bayesian statistics to perform the parameter inference.In our examples, we will focus on the background Universe-and avoid perturbations-since the main purpose of this article is the application of these techniques rather than the cosmology by itself.It should be clear, however, that the extension to consider perturbations is immediate, i.e., there is only an increment in the number of parameters, and the expressions turn out to be just a little more complicated.

Einstein Field equations
In order to specify the geometry of the Universe, an essential assumption is the Cosmological Principle: for a particular time and on sufficiently large scales, the observable Universe can be considered homogeneous and isotropic, with great precision.For example, at scales greater than 100 Mega-parsecs, the distribution of galaxies observed on the celestial sphere justifies the assumption of isotropy.The uniformity observed in the temperature distribution (one part in 10 5 ) measured through the Cosmic Microwave Background (CMB) is the best observational evidence we have in favor of a universal isotropy.Therefore, if isotropy is taken for granted and by taking into account that our position in the Universe has no preference-known as the Copernican Principle-, the homogeneity follows when considering isotropy in each point.
Homogeneity establishes that the Universe is observed equally in each point of space.
Isotropy establishes that the Universe is observed equally in all directions.
The formalism of General Relativity establishes the relationship between the geometry of space-time and the matter on it.That is, the curvature of the spacetime produces physical effects on the matter, and these effects are associated with the gravitational field.Additionally, the curvature is related to the matter, described by an energy-momentum tensor T µν .The above expressions can be summarized by paraphrasing Wheeler: "matter tells space-time how to curve and, in turn, the geometry of this curvature tells matter how to move".We can write this sentence down by the Einstein equations where G µν is the Einstein tensor (geometry of the spacetime), and G is the gravitational Newton constant [45][46][47][48].Throughout this review, we use natural units c = = 1.
The distance between two points in a curved spacetime can be measured as where g µν is the metric tensor that contains all the information about the geometry of the space-time.From now on, and unless stated otherwise, greek letters µ, ν, . . .denote spacetime indices ranging from 0 to 3, while latin letters i, j, . . .denote spatial coordinates ranging from 1 to 3. The geometry that best describes a homogeneous, isotropic, and expanding Universe is given by the Friedmann-Lemaître-Robertson-Walker metric (FLRW), with a line element where In Equation ( 75), a represents the scale factor of the Universe which only depends on time, and by convention is normalized to today a(t 0 ) ≡ 1.Similarly, in expression ( 76), x i labeled the spatial coordinates (also called comoving coordinates), δ ij is the Kronecker delta and κ describes the curvature of the space-time.

Friedmann and continuity equations
The content of the Universe needs to satisfy homogeneity and isotropy, as well; hence, here it is described by the energy-momentum tensor of a perfect fluid where ρ is the energy density, P is the fluid pressure, and U µ is the 4-velocity relative to the observer.we take the velocity as U µ = (1, 0, 0, 0) (comoving observer), the energy-momentum tensor reduces to Using Equations ( 73) and ( 78), with the FLRW metric, we can obtain the Friedmann equations In these expressions, H accounts for the rate of expansion/contraction of the Universe, named as the Hubble parameter.Subindex i labels all the components that believe the Universe is made of, as we will see in the next section.These equations describe the evolution of the Universe.By combining Equations ( 79) and ( 80), we can obtain the continuity equation given by The meaning of ( 81) is the conservation of the energymomentum tensor (∇ µ T µν = 0).In order to close the system, we need to include an equation-of-state that relates pressure and energy density for a given fluid.In particular, we are interested on barotropic fluids, which generally have the form of P = ωρ.

Content of the Universe
Once the equations that define the dynamics of the Universe are known, it is necessary to specify its content.The standard cosmological model, also known as Λ Cold Dark Matter (ΛCDM), is one of the most accepted models to describe the Universe, with its content being: • Dust: It has no pressure, and its energy density takes the form of ρ ∝ a −3 .Dust is conformed by baryons (ordinary matter).
• Dark matter: It is proposed to explain several astrophysical observations, like the dynamics of galaxies in the Coma cluster or the rotation curves of galaxies [49,50].The ΛCDM model assumes the dark matter only interacts gravitationally (and possibly by weak interaction) with the rest of the Universe, hence its name, Cold Dark Matter (CDM).Since it is proposed as interacting only via gravitational force, there can be several candidates to fulfill this requirement: it could be conformed by weakly interacting massive particles (WIMPs), by gravitationally-interacting massive particles (GIMPs); by axions (hypothetical elementary particles); or by sterile neutrinos, just to mention a few.For a short review about Dark Matter and possible candidates, see Reference [51].
• Radiation: This corresponds to relativistic particles that follow the relation P = 1 3 ρ, which implies a density with a behavior ρ ∝ a −4 .We consider photons ρ γ and massless neutrinos ρ ν as radiation, so the total radiation energy density in the Universe is given by The relation between these quantities is where N eff is the effective number of relativistic degrees of freedom, with standard value N eff = 3.046 [13].
• Dark Energy: It is introduced to explain the current accelerated expansion of the Universe.In the ΛCDM model, dark energy is given by the cosmological constant Λ or equivalently by an equationof-state ω = −1.Each of these components can be described by its equation of state shown in Table IV, and defining the density parameter with ρ crit being the condition to have a flat Universe or equivalently zero curvature, we can rewrite (79) as where Ω r,0 is the radiation density parameter, Ω m,0 ≡ Ω b,0 + Ω DM,0 corresponds to the total matter, Ω b,0 to baryons, Ω DM,0 to dark matter, Ω k ≡ −κ/(aH) 2 the curvature density parameter, and Ω Λ ≡ Λ/3H 2 associated with the Cosmological Constant, and the subscript zero indicates they are evaluated by today's values.

Alternatives to the ΛCDM model
The ΛCDM model has had great success in modeling a wide range of astronomical observations.However, it is in apparent conflict with some observations on small-scales within galaxies (e.g., cuspy halo density profiles, overproduction of satellite dwarfs within the Local Group, amongst many others; see, for example, Reference [50,52]).In addition, all attempts to detect WIMPs either directly in the laboratory, or indirectly by astronomical signals of distant objects have failed so far.For some of these reasons, it seems necessary to explore alternatives to the standard ΛCDM model.With this in mind, several alternatives have been suggested.For instance, the Scalar Field Dark Matter (SFDM) model proposes the dark matter as a spin 0 boson particle [53][54][55][56][57][58]; or the Self Interacting Dark Matter, as its name states, relies on the cold dark matter to be made of self interacting particles [59][60][61].On the other hand, in order to explain the accelerated expansion of the Universe, there exist different modifications to the theory of General Relativity, i.e., f (R) theories [62], braneworld models [63,64].There are also some alternatives to the cosmological constant as Dark Energy, i.e., scalar fields (quintessence, K-essence, phantom, quintom, non-minimally coupled scalar fields [65][66][67][68][69]); or many more alternatives, i.e., anisotropic Universes [70][71][72].Finally, if the dark energy is assumed to be a perfect fluid, one of the most popular time-evolving parameterizations for its equation of state consists of expanding ω in a Taylor series, for example, the Chevallier-Polarski-Linder (CPL) ω = ω 0 + ω a (1 − a), with two free parameters ω 0 , ω a [73,74].It may also be expanded into Fourier series [75], or many more Bayesian approaches, as have been suggested to account for a dynamical dark energy [76][77][78].

Base parameters
These parameters, also known as parameters, are the main quantities used in the description of the Universe.They are not predicted by a fundamental theory, but their values must be fitted to provide the best description of the current astrophysical and cosmological observables.To explain the homogeneous and isotropic Universe, we can use the density parameter of each component Ω i,0 and the Hubble parameter H 0 related by (85).In particular, the radiation contribution is measured with great precision, so that Ω γ is pinned down very accurately, and, hence, there is no need to fit this parameter.Similarly, neutrinos, as long as they maintain a relativistic behavior, can be related to the density of the photons through (83).
On the other hand, the existence of strong degeneracies from different combinations of parameters is also notorious.In particular, the geometric degeneracy involving Ω m , Ω Λ and the curvature parameter To reduce these degeneracies, it is common to introduce a combination of cosmological parameters such that they have orthogonal effects in the measurements.

Derived parameters
The above standard set of parameters provides an adequate description of the cosmological models.However, this parameterization is not unique, and some others can be as good as this one.Various parameterizations make use of the knowledge of the physics or the sensitivity of the detectors and can, therefore, be interpreted more naturally.In general, other parameters could have been used to describe the Universe, for example: the age of the Universe, the current temperature of the neutrino background, the epoch of equality of matter-radiation, or the epoch of reionization.In the standard cosmological model (ΛCDM), in order to decrease degeneracies, the physical energy densities Ω DM,0 h 2 and Ω b,0 h 2 are used as base parameters [13].

Cosmological observations
In this section, we review some of the most common experiments and observables used to constrain the cosmological models on the background level.
Baryon Acoustic Oscillations (BAO): The BAO is a statistical property, a feature in the correlation function of galaxies or in the power spectrum.The best description of the early Universe considers that it was made of plasma of coupled photons and matter (baryons and dark matter).The interaction between the gravitational force due to matter and the radiation pressure formed spherical waves in the plasma.When the Universe cooled down enough, the protons and electrons were able to join together, forming hydrogen atoms; therefore, this process allowed photons to decouple from the rest of the baryons.The photons began to travel uninterrupted, while the gravitational field attracted matter towards the center of the spherical wave.The final configuration is an overdensity of matter in the center and a shell of baryons of fixed radius called sound horizon.This radius, used as a standard ruler, is the maximum distance that sound waves could have traveled through the primordial plasma before recombination.The sound horizon r d is given by where the sound speed (in terms of redshift z) in the photon-baryon fluid is c s (z , and z d is the redshift when photons and baryons decouple.The BAO scale is determined by adopting a fiducial model to be able to translate the angular and redshift separations at comoving distances.The information of the measurement is found in the ratio (α) of the measured BAO scale and that predicted by the fiducial model (f id).In an anisotropic fit, two ratios are used, one perpendicular α ⊥ and one parallel α to the line of sight.A measurement of α ⊥ constrains the ratio of the comoving angular diameter distance to the sound horizon [79]: where the comoving angular diameter distance is given by The line-of-sight comoving distance is defined as and S k (z) The Hubble parameter can be constrained by measuring α using an analogous quantity with D H (z) = c/H(z).
If redshift-space distortions are weak (which is valid for luminous galaxy surveys but not for the Ly-α), an isotropic analysis measures an effective combination of ( 87) and ( 91), and the volume averaged distance D V (z) [79] with The BAO measurements constrain the cosmological parameters through the radius of the sound horizon r d , Hubble distance D H (z) and the comoving angular diameter distance D M (z); see Figure 12.The data used in this work to constrain D V /r d is obtained from the 6dF Galaxy Survey (6dFGS [80]) from UK Schmidt Telescope, the Main Galaxy Sample (MGS [81]), and the BOSS LOWZ Sample [82] from SDSS.On the other hand, the data from BOSS CMASS Sample [82], BOSS Lyman α auto-correlation (Lyα auto, [83]), and BOSS Lyman-α cross correlation (Lyα cross [84]) are used to constrain D M /r d , D H /r d .
Supernovae type Ia (SNIa): This type of supernova occurs in binary star systems, one of which is a white dwarf that accretes matter from the star that accompanies it.When the white dwarf accumulates sufficient mass (≈1.4 solar masses), its core will start the ignition temperature for carbon fusion, and, within a few seconds, it releases enough energy to produce the supernova [86].Since type Ia supernovae (SNIa) are hypothesized to occur near the same mass limit of 1.4 solar masses, commonly referred to as Chandrasekhar mass, their luminosity peaks are fairly consistent and can be standardized and, thus, be used as standard candles [13].From several analyses of SNIa, the Supernova Cosmology Project and High-z Supernova Search Team both found evidence that the Universe is currently expanding at an accelerated rate [87][88][89].
These stars allow us to measure relative distances using the luminosity distance given by where L is the luminosity defined as the energy emitted per unit solid angle per second, and S is the radiation flux density defined as the energy received per unit area per second [49].The observable quantity is the radiation flux density received, and it cannot be translated into the luminosity density unless the absolute luminosity of the object is known.Even if the luminosity is unknown, it will appear as a scaling factor [49].The relation between D L and the cosmological parameters is given by where D M is provided by Equation (88).Another important quantity in the observation of supernovae is the standardized distance modulus where m * B is the observed peak magnitude in the rest frame of blue band (B), α, β, and M B are parameters that depend on host galaxy properties [90].X 1 is the time stretching of the light curve, and C is the supernova color at maximum brightness.The relation between the standardized distance modulus and the luminosity distance is The data used in this paper (shown in Figure 13) is obtained from the Joint Light-curve Analysis (JLA).
It is a collaboration to analyze the data of 740 stars from the SDSS-II (previous version of BOSS), the Su-perNova Legacy Survey (SNLS [91]) experiment that used the Canada-France-Hawaii telescope (CFHT), the Calán/Totolo Survey, the Carnegie Supernova Project, the Harvard-Smithsonian Center for Astrophysics, La Silla Observatory, Fred Lawrence Observatory, and the FIG.13.Joint Light-curve Analysis data (JLA).Vertical axis is the standardized distance modulus µ (luminosity distance function) and the horizontal axis is the redshift z.Source: [93].
Although SNIa have been widely used to restrict cosmological models, there are still discussions about the way it is done.That is, in order to extract information from them, an a priori cosmological model has to be assumed, which may biased the outcomes [92].
Cosmic Microwave Background (CMB): Corresponds to the radiation that permeates all the Universe, discovered in 1965.Before recombination, baryons and photons were tightly coupled, and once photons decouple from the rest of the matter, they traveled uninterrupted until reach us.The temperature radiation measured at different parts of the sky contains information of the last scattering epoch, gravitational lensing, among others.Here, the CMB displays the primordial anisotropies studied in the angular power spectrum.One of the most important recent collaborations that studies the CMB corresponds to the Planck satellite, and previous probes include COBE [94] and WMAP [95].It is a European Space Agency mission, in which the main objective is to measure the temperature, polarization, and anisotropies of the CMB over the entire sky.These results would allow to determine the properties of the Universe at large scales and the nature of dark matter and dark energy, as well as to test inflationary theories, determine whether the Universe is homogeneous or not, and obtain maps of galaxies in the microwave [96][97][98][99][100][101][102][103].
In this work, we use the CMB information as a BAO located at redshift z = 1090, measuring the angular scale of the sound horizon D M (1090)/r d , in addition, to calibrating the absolute length of the BAO ruler through the determination of Ω b h 2 and Ω cb h 2 (the density parameters of baryonic and dark matter, respectively; more details about these parameters can be found in Reference [79]).
Large Red Galaxies: (Cosmic Chronometers) These are the most massive galaxies for each redshift z and contain the oldest star population.These kinds of galaxies are used to estimate the Hubble factor because they con- tain little stardust, which makes it easier to get their spectra.The way these chronometers work is by selecting two galaxies at different redshifts between z ∼ 0 − 2 and compare their upper cut in its age distributions.By doing this, it is possible to obtain the difference of ages ∆t and redshifts ∆z such that the expression dz dt can be approximated as dz dt ∆z ∆t .This quantity is related with the Hubble factor via [104,105] In this work, the data used to constrain H(z) was obtained from Reference [106][107][108][109].In Figure 14, a compilation of these data is shown.
Another important probe in the foreseen future corresponds to the Gravitational waves from merging black holes (standard sirens), as they directly measure the luminosity distance to the merger.Therefore, it is possible to constrain the distance-redshift relation and, hence, the cosmological parameters [110,111].

PARAMETERS INFERENCE IN COSMOLOGY
Here, we estimate the Hubble parameter H 0 at the present time and the matter density of the Universe Ω m , assuming a ΛCDM standard model.We use our own Python code, which can be found in Reference [44].
The data.-For this particular example, we focus only on the Cosmic Chronometers, as shown in Figure 14.
The model.-Our interest is to fit the density parameters for each component of the Universe, as well as the value of the Hubble parameter at present time.For this purpose, we use Equation ( 85), and, for simplicity, we assume a flat Universe with a well measured radiation content.Then, the model is given by constrained by the relation Notice that the above relation implies that we can get rid of a parameter-Ω m,0 or Ω Λ,0 -in the analysis.Then, the parameters we decided to estimate are H 0 and Ω m,0 .By assuming the Gaussian approximation, we construct the likelihood given by where H(z i ) is described by Equation ( 98) evaluated at each redshift z i , and H i is the value of the Hubble parameter measured at z i and σ i the error of the i-th measurement.
The only a priori information we have about the free parameters is that each component of the Universe Ω i,0 must satisfy the relation 0 ≤ Ω i,0 ≤ 1, while, for the present Hubble parameter, its conservative prior can be obtained by observing the data at our disposal.In such cases, a good prior choice is a uniform distribution (flat prior) with limits Ω m ∈ [0, 1] and H 0 ∈ [10,100].Hence, the priors Numerical estimation.-Wefollow the same procedure as in the straight-line example.In the left panel of Figure 15, we have plotted the chains obtained for our estimations and its corresponding 1D posterior distribution.Similar to the above examples, we have also plotted the 2D posterior distributions with 1-4σ confidence regions in the right panel of the same figure.Additionally, we have obtained the mean, standard deviation and the Gelman Rubin criteria for each parameter: H 0 = 67.77± 3.13 with convergence 1.00045, and Ω m = 0.331 ± 0.0628 with convergence 1.00044.We can see that our estimations are very similar to the values reported in the literature [85].

Cosmological and statistical codes
As we have shown, until now, during the process of parameter inference for a given model, there are three steps we have followed: first, obtain the data we would like to confront with the model parameters.Then, construct the likelihood associated with the theory we are working with.Depending on the nature of the data, the likelihood can depend on different ways of the parameters.For example, in the last exercise, the likelihood in terms of the parameters is via ( 98) and ( 100).Finally, it is necessary to program the numerical tools in order to obtain the parameter inference.Such programming can be done, for example, in PyMC3, as we saw before.
It is notorious that the above process can be a tremendous task for programming, for instance, when the theories involve numerous parameters, or some of them must be marginalized in order to ignore the non-interesting ones (like nuisance parameters), or/and when the models depend on the parameters of interest in a difficult way (like by solving differential equations, integrals, etc.).Then, we can proceed in two different ways: the first is by accepting the challenge and creating our own code.Developing a new code may be a good option rather than using others, as it would require a long time to learn the implementation and modification (specially if the theory we are dealing with is somehow simple).Otherwise, we need to rely on existing codes, principally when the theory of interest is quite complicated, as it is usually the case when perturbations are taken into account on the cosmological models.In this section, we present some cosmological and statistical codes to test cosmological models.

Cosmological codes
Today, there are several cosmological Boltzmann codes available.Some of them include: CMBFAST (writ-ten in FORTRAN 77 [126][127][128]), CMBEASY (C++ [129]), CAMB (FORTRAN 90 [130,131]), CLASS (C [132,133]), and COSMOSIS [134] (written in Python, and it works as an interface between CLASS, CAMB, MontePython, CosmoMC, and more).All of them are used for calculating the linear CMB anisotropy spectra, based on integrations over the sources along the photon line of sight.CosmoMC [138].-This is a Fortran MCMC engine for exploring cosmological parameter space.It contains Monte Carlo samples and importance sampling.It also has by default several likelihoods of the most recent experiments, and interfaces with CAMB.
SimpleMC.-It is an MCMC code for cosmological parameter estimation where only the expansion history of the Universe matters.It was written by Anẑe Slosar and José A. Vázquez, initially released in Reference [79], and can be downloaded in Reference [139].This code solves the cosmological equations for the background parameters in the same way as CLASS or CAMB, and it contains the statistical parameter inference of Cos-moMC/MontePython.An advantage of this code is that it is completely written in Python with an interface to machine learning tools, such as artificial neural networks, genetic algorithms, as well as algorithms to compute the Bayesian evidence, i.e., Dynesty [140] or MCEvidence [141].
The main idea of MCEvidence is that, asymptotically, the number density n of the MCMC is proportional to the density of the likelihood multiplied by the prior, that is, the non-normalized posterior P (D|θ, H) = P (D|θ, H)P (θ|H) = an(D|θ, H), where n(D|θ, H) = N P (D|θ, H) with N the length of the chain.The code uses Bayesian inference to find a, the constant of proportionality (see Reference [142] for details).Once this constant has been found, the Bayesian evidence is given by P (D|H) = aN.

EXAMPLES WITH SIMPLEMC
The main interest of this section is to test several cosmological models through SimpleMC.Even though we selected this code, the results we present here can also be obtained in any of the aforementioned codes.
Throughout these examples, we consider Gaussian likelihoods for each dataset with the following form: where T (z i ) is the theoretical value related to the observation T i ; and σ i are the corresponding errors associated to each measurement.In our estimation, T (z i ) is given by (94) for Supernovae; (87) for CMB; (87), (91), and (92) for BAO; and (97) for Cosmic Chronometers.We use the BAO data mentioned in Section 6.3 (labeled as BBAO); for Supernovae, we use the Joint Light-Curve Analysis compressed data denoted as SN, the Planck data (denoted as Planck) for CMB, and the Cosmic Chronometers data (HD).

Models of the Universe
The base example corresponds to the flat ΛCDM model, already explored in Section 7, but now including the full observations presented in Section 6.3 and using the SimpleMC code.Here, in order to test the Friedmann Equation (85) with the data, we consider as free parameters (along with flat priors): the total matter dimensionless density parameter Ω m ∈ [0.05, 1.5], the baryon physical density Ω b h 2 ∈ [0.02, 0.025], and the dimensionless Hubble constant h ∈ [0.4,1].Then, assuming the base ΛCDM model, we let the curvature of the Universe be a free parameter (model oΛCDM), with its corresponding flat prior Ω k ∈ [−1.5, 1.5].Moreover, because the cosmological constant is only a particular case for the dark energy equation-of-state ω = −1, we let ω be a free parameter with flat priors ω ∈ [−2.0, 0.0] and labeled it as model ωCDM.We may combine the addition of curvature and constant ω to define the oωCDM model.In order to go even further and describe a dynamical dark energy, we use the CPL parameterization for the equation of state with flat priors on the parameters ω 0 ∈ [−2.0, 0.0] and ω a ∈ [−2.0, 2.0] and labeled it as ω 0 ω a CDM model.Again, we can incorporate the curvature of the Universe to the CPL parameterization, named as oω 0 ω a CDM.By using the combined dataset BBAO+Planck+SN+HD, Table V shows the best fit values, along with 1σ confidence levels.The first important result to highlight is how the constraints have shrunk once new information is taken into account.That is, in Section 7, for the ΛCDM model and by using only Hubble data, we had h = 0.677 ± 0.313 and Ω m = 0.331 ± 0.0628.Now, with the inclusion of BAO,  Planck, and SN data, the constraints have improved considerably to h = 0.684 ± 0.006 and Ω m = 0.299 ± 0.007.Figure 15 is updated with new data in order to get Figure 16.Here, the upper panel displays the chain for the parameter H 0 = 100 • h, with 9000 steps.In the lower panel of the same figure, we plot the 2D posterior distribution, along with 1 and 2σ confidence regions obtained from our estimations.
From Table V, we also observe the best fit of most of the new parameters-additional to the base ΛCDM model-remained well inside the 1σ confidence level.However, the exception is for the oω 0 ω a CDM model, where Ω k , ω 0 , and ω a lay down right outside the 1σ region.This main feature is better observed in Figure 17, where the 2D posterior distributions, along with 1 and 2σ confidence regions, are shown.Here, the standard ΛCDM values are marked with a dash line ( Ω k = 0, ω = ω 0 = −1, ω a = 0).The inclusion of extra parameters improves the fit to the data, observed through the minimum χ 2 min , before the last row of Table V.However, it also carries out a penalization factor that affects directly the model selection, as seen in Reference [75,77,143].
In the last row of the same table, we show the Bayes factor for the extended models using the LCDM model as reference, where the Bayesian evidences were obtained using the MCEvidence code.We found significant evidence in favor of ΛCDM compared to ωCDM and ω 0 ω a CDM; strong evidence with respect to oΛCDM and oω 0 ω a CDM; and decisive evidence against oωCDM.This is in agreement with the results presented in Reference [144], where Planck 2013 found no evidence against ΛCDM.

CONCLUSIONS AND DISCUSSION
The fact that the number of cosmological observations have increased impressively over the last decade allowed to obtain a better description of the Universe.However, since we still have the limitation of a unique Universe, a Frequentist approach may not be the best to rely on; hence, the Bayesian statistics came into consideration.In this work, we provide a review of the Bayesian statistics and present some of its applications to cosmology, mainly throughout several examples.
The Bayesian statistics rests on the rules of probability which yield to the Bayes' theorem.Given a model or hypothesis H for some data D, Bayes' theorem tells us how to determine the probability distribution of the set where the prior probability P (θ|H)-the state of knowledge before acquiring the data-is upgraded through the likelihood P (D|θ, H) when experimental data D are considered.The aim for parameter estimation is to then obtain the posterior probability P (θ|D, H) which represents the state of knowledge once we have taken into account the new information.We noticed that, if the prior probability is constant, we can identify the posterior probability with the likelihood P (θ|D, H) ∝ L(D|θ, H); thus, by maximizing it, we can find the most probable set of parameters for a model given the data.Moreover, if we assume a Gaussian approximation for the likelihood, then the chisquared quantity is related to the Gaussian likelihood via L = L 0 e −χ 2 /2 .Therefore, maximizing the Gaussian likelihood is equivalent to minimizing the chi-squared.Once the posterior distribution for a set of parameters, of a given model, is calculated, we show the results in the form of confidence regions of said parameters.In addition, for this particular case, in which likelihoods are Gaussian, Fisher's matrix can be computed according to the Hessian matrix, where the latter contains information about the errors of the parameters and their covariances.The Fisher matrix gives information about the accuracy of the model and allows predicting how well an experiment will be able to constrain the set of parameters for a given model.
On the other hand, sometimes it is difficult to know, a priori, if multiple datasets are consistent with each other, or whether there could be one or more that are likely to be erroneous.Since there is usually this uncertainty, a way to know how useful a dataset is useful a dataset is may be by introducing the hyperparameter method.may be by introducing the hyperparameter method.These hyperparameters act as weights for every dataset in order to take away data that does not seem to be consistent with all of them.Here, the key quantity to estimate the necessity of introducing hyperparameters to our model is given by the Bayesian evidence P (D|H).
The estimation of the posterior distribution is a very computationally demanding process, since it requires a multidimensional exploration of the likelihood and prior.To carry out the exploration of the cosmological parameter space, we focus on Markov Chain Monte Carlo methods with the Metropolis Hastings algorithm.A Markov process is a stochastic process (that aims to describe the temporal evolution of some random phenomenon) where the probability distribution of the immediate future state depends only on the present state.Any computational algorithm that uses random numbers is called Monte Carlo.Thus, the Metropolis Hastings algorithm uses a transition kernel to construct a sequence of points (called chain) in the parameter space in order to evaluate the posterior distribution of said parameter.The generation of the elements in a Markov chain is probabilistic by construction, and it depends on the algorithm we are working with.The MHA is the easiest algorithm used in Bayesian inference; however, to explore complex posterior distributions more efficiently, we provide a brief description of several samplers.
Finally, we show how Bayesian statistics is a very useful tool in Cosmology to determine, for instance, the combination of model parameters that best describes the Universe.In particular, we confront the standard cosmological model (ΛCDM) to current observations and compare it to different models.We found the model that best fit the data, through χ 2 min , corresponds to a curved Universe with a dynamical dark energy, namely oω o ω a CDM.It is important to clarify that this does not mean that the oω o ω a CDM model is the final model; it merely shows that, for these particular datasets, there is an improvement in the fit when compared to ΛCDM, but further analysis and more data is necessary to give a verdict.However, adding extra parameters brings up a penalized factor seen through the Bayesian evidence; hence, the favored model becomes ΛCDM.

FIG. 1 .
FIG. 1.The coin example: blue figure displays the prior distribution P (p) which is updated, when the data is taken into account, to get the posterior distribution P (p|D) (red).The vertical black line corresponds to the real value, p = 0.5.

FIG. 3 .
FIG. 3.Converging views in Bayesian inference (taken from[5]).A and B have different priors P (θ|I i ) for a value θ (panel (a)).Then, they observe one datum with an apparatus subject to a Gaussian noise and they obtained a likelihood L(θ; HI) (panel (b)), after which their posteriors P (θ|m 1 ) are obtained (panel (c)).After observing 100 data, it can be seen that both posteriors are practically indistinguishable (panel (d)).

FIG. 4 .
FIG.4.1D posterior distribution for our example.We plot the prior distribution (red), true posterior (dashed-black) and the posterior calculated by the MHA (blue).We plot 1,2 and 3σ confidence regions for the estimation of p.

FIG. 6 .
FIG. 6.Two Markov Chains considering different variance for our Gaussian proposal distribution.Upper panel corresponds to σ = 0.002, while lower panel corresponds to σ = 0.8.

FIG. 7 .
FIG. 7. Datasets D 1 and D 2 measured by the straight-line theory.

FIG. 8 .
FIG. 8. Top panel: 1D marginalized posterior distributions for our samples and the Markov chains for model H 0 .Bottom panel: 2D marginalized posterior distributions along with 1-4 confidence regions for our parameters for model H 0 .The red point corresponds to the true value.

FIG. 11 .
FIG. 11.Upper panel: confidence regions for the parameters in model H 0 .Middle panel: confidence regions for the parameters in model H 1 .Bottom panel: Best-fit values for the straight-lines for Case 2 inferred with the datasets shown.

FIG. 12 .
FIG. 12. BAO Hubble diagram.BAO measurements of D V /r d , D M /r d and zD H /r d from the sources indicated in the legend.The scaling factor √ z is included for a better display of the error bars.Solid lines are plotted by using the best-fit values obtained by the Planck satellite [85].The Lyα cross-correlation points have been shifted in redshift; auto-correlation points are plotted at the correct effective redshift.

1 ]FIG. 14 .
FIG. 14. Hubble data as a function of redshift z for Cosmic Chronometers.The solid line corresponds to the best-fit by using ΛCDM model.

FIG. 15 .
FIG. 15. Results for the ΛCDM model.Upper panel: 1D posterior distributions for the parameters H 0 and Ωm along with its chains.Lower panel: joint 2D posterior distributions with 1-4 confidence levels.

FIG. 16 .
FIG.16.Top panel: Markov chain for the parameter H 0 , with 9000 steps.Bottom panel: 2D posterior distribution with confidence regions 1 and 2 σ for the joint parameters H 0 and Ωm.

FIG. 17 .
FIG. 17. 2D posterior distributions with 1,2-σ confidence regions for different models.oΛCDM refers to ΛCDM with curvature, ωCDM is a flat Universe with the dark energy equation of state as a free parameter, oωCDM generalizes to non-zero curvature, wowaCDM uses the CPL parameterization and owowaCDM generalizes to non-zero curvature.The dashed lines show the standard ΛCDM values.

TABLE I .
Main differences between the Bayesian and Frequentist interpretations.

TABLE II .
Jeffreys guideline scale for evaluating the strength of evidence when two models are compared.

TABLE III .
∆χ 2 for the conventional 68.3%, 95.4% and 99.73% as a function of the number of parameters (M ) for the joint confidence level.

TABLE IV .
Equation of state associated to each component of the Universe.
Once the cosmological model is established, we need a statistical code to estimate the free parameters of our model.There are several MCMC codes that can make this task easy to handle.Some of them are:Monte Python [135-137].-It is a Monte Carlo code for Cosmological Parameter extraction that contains likelihoods of most recent experiments, and interfaces with the Boltzmann code Class for computing the cosmological observables.The code has several sampling methods available: Metropolis-Hastings, Nested Sampling (through MultiNest), EMCEE (through Cosmo-Hammer), and Importance Sampling.