Generalisations of Fisher Matrices

Fisher matrices play an important role in experimental design and in data analysis. Their primary role is to make predictions for the inference of model parameters - both their errors and covariances. In this short review, I outline a number of extensions to the simple Fisher matrix formalism, covering a number of recent developments in the field. These are: (a) situations where the data (in the form of (x,y) pairs) have errors in both x and y; (b) modifications to parameter inference in the presence of systematic errors, or through fixing the values of some model parameters; (c) Derivative Approximation for LIkelihoods (DALI) - higher-order expansions of the likelihood surface, going beyond the Gaussian shape approximation; (d) extensions of the Fisher-like formalism, to treat model selection problems with Bayesian evidence.


Introduction
Fisher information matrices are widely used for making predictions for the errors and covariances of parameter estimates. They characterise the expected shape of the likelihood surface in parameter space, subject to an assumption that the likelihood surface is a multivariate Gaussian when viewed as a function of the model parameters. Diagonal terms are the inverse variances of the parameters, conditional on all others being known, and non-zero off-diagonal terms indicate correlations between inferred parameters. Diagonal terms of the inverse Fisher matrix yield the variances of parameters when all others are marginalised over. The Cramér-Rao inequality shows that the variances deduced from the Fisher matrix are lower bounds.
Fisher matrices have been extensively used in cosmology, where future experiments have been designed in order to deduce as precisely as possible the parameters of the standard cosmological model, so-called ΛCDM (Cold Dark Matter, with a cosmological constant Λ), and are routinely used to give "figures-of-merit" [1] for the power of each experiment. Normally, these studies are standard applications of Fisher matrix theory, often simplified by an approximation (which is very good for observations of the Early Universe) that the data are Gaussian-distributed.
In this article, I review a number of generalisations of the Fisher matrix approach. In Section 2 the derivation of the Fisher matrix for Gaussian data is sketched out; in Section 3 we consider Fisher matrices for data pairs that have errors in both x and y; in Section 4 we show how Fisher matrices may be used to estimate biases when some parameters are fixed at incorrect values; in Section 5 we explore better approximations for the likelihood surface ("DALI"), from expansions to higher order in derivatives, and in Section 6 we generalise the use of Gaussian likelihood surfaces to model selection and Bayesian evidence.

Gaussian Fields
In cosmology, one is very often dealing with Gaussian random fields, which are characterised statistically entirely by their mean and covariance. A pedagogical derivation for the Fisher matrix when the data y are Gaussian appears in [2]. The negative log-likelihood L ≡ − ln L is where in general both the mean vector µ and the covariance matrix C = ( y − µ)( y − µ) T depend on the model parameters θ. If y represents 1-point statistics, such as Fourier coefficients, then typically µ = 0, and all the parameter dependence is in C. If y represents 2-point statistics, then for Gaussian fields they have only approximately a Gaussian distribution, and the analysis is only approximately correct. In this case, the covariance matrix has some parameter dependence through the 4-point function, which for Gaussian fields can be written as products of the 2-point function.
The Gaussian assumption is widely applicable in cosmology, since the quantum fluctuations that are thought to give rise to the density and radiation fields should ensure this, and limits on departures from gaussianity are very tight [3]. Defining the data matrix D ≡ ( y − µ)( y − µ) T and using the matrix identity for positive definite square matrices ln det C = Tr ln C, where Tr indicates trace, we can re-write (1) as 2L = Tr ln C + C −1 D . (2) Using standard comma notation for partial derivatives, Z ,α = ∂Z/∂θ α , and using the matrix identities (C −1 ), α = −C −1 C, α C −1 and (ln C), α = C −1 C, α , we find after taking two derivatives and then the expectation value, The great advantage of the Fisher matrix approach is seen in this example: no data (real or simulated) are required to compute the expected log-likelihood surface, only the statistical properties of the data. This can be a big advantage if simulation is computationally expensive.

Fisher Matrix with Errors in x as Well as y
The previous section gives the standard analysis where only the covariance of the y values is considered. Let us now consider the fairly general case where the data consists of data pairs (X, Y), where we have errors in both X and Y. We can compute the Fisher matrix via the application of a Bayesian hierarchical model, provided that the errors in X are small (this will be defined later). The full analysis is given in [4].
We assume X and Y are length m and n vectors (for data pairs, m = n, but in fact the analysis is more general), and have Gaussian errors, around true values x, y, with a covariance matrix C, which also allows correlations between X and Y. x and y are not observed, being latent variables, and are essentially nuisance parameters. In fact the y are not independent nuisance parameters as they are assumed to be related precisely to x through a deterministic theoretical model y = µ( x) (however, a stochastic element could easily be included). Given the observed data, X, Y, we seek the posterior p( θ| X, Y). With a uniform prior for θ, this is proportional to the likelihood L = p( X, Y| θ). We write this as the marginalised distribution over x and y as A deterministic y( x) relation gives a delta function, p( y| x, θ) = δ( y − µ( x)), and assuming a uniform prior for x (a more general prior is considered in [4]), integration over y gives We now assume that the errors in X are small, for which we require that we can truncate at the linear term of the Taylor expansion of µ: where T is diagonal for data consisting of X, Y pairs. We assume a multivariate Gaussian for p( X, Y| x, y) (independent of θ), and write the covariance matrix of the data in block form as is not symmetrical, nor invertible or even square in general; although C XX and C YY are. The covariance matrix may include a number of elements, such as intrinsic scatter and measurement noise, with individual covariance matrices adding to give the final C. We also assume that the function µ( x) is linear across the width of the Gaussian error distribution of x, in which case the likelihood may be integrated analytically, as follows. We write , and z and Z are m + n-dimensional vectors: The inverse of C in block form is where , we find that Q has the quadratic form where With the definition of Q in Equation (12), the Gaussian integral of Equation (9) can be performed, using and noting that Q is independent ofx. The likelihood then simplifies to where the inverse of the marginal covariance matrix of˜ Y is R −1 = E − P T A −1 P. This is obtained using the Woodbury formula [5] This is a key result. We see that this looks just like a standard Gaussian (in terms of data) likelihood, but with the covariance matrix C (C YY in our current notation) replaced by R. Hence to compute the Fisher matrix, we can use the standard formula found in Equation (3) and Equation (15) of [2], and simply replace C by R: Note that R depends not only on the standard covariance, but also on the covariance in the independent variable, C XX , the meta-covariance, C XY , and the first partial derivatives of the model function µ. In the case of uncorrelated data pairs, the result reduces to that found in [6]. For the simple case of no correlations between X and Y values R = C YY + T T C XX T, and with diagonal covariance matrices C YY and C XX we recover the propagation of error result that the variance of f ≡ Y − µ(X) for each data point is effectively where µ = ∂µ/∂x and C can be replaced in the standard Fisher expression, Equation (3), by a diagonal n × n matrix with these enhanced entries.

Generalising Still Further
The analysis above is applicable not just to the simple case of data with errors in x as well as y, but to any system where the 'data' y depend (in a locally linear way) on any parameters x that have some error associated with them.

Systematic Errors, or Errors from Simplified Nested Models
The Fisher matrix can also be useful to determine the errors in parameter inference that arise if one parameter is fixed at an erroneous value. This could arise in a number of contexts, such as a nuisance parameter (e.g., a calibration setting) being fixed at an incorrect value, or when considering nested models. An example of the latter would be cosmological models where the Universe is assumed to be flat. This is an example of a nested model, being a subset of a more general model, but with the curvature parameter (usually given the symbol Ω k ) set to zero. In these cases, the maximum likelihood values of all the other parameters are, in general, shifted from their maximum likelihood values in the more general model. See Figure 1 for an illustration of this in two dimensions. With the usual Fisher assumption that the likelihood surface is a Gaussian function of the parameters, these shifts can be computed using the Fisher matrix. We consider two models, M, which has more (n + p) parameters than a simpler nested model M , which has n . The extra parameters are designated ψ ζ , and these are fixed in M at values that are δψ ζ from their maximum likelihood values in M. In this case, the maximum likelihood values of all other parameters of M , θ, are systematically shifted by [7,8] where which we recognise as a subset of the Fisher matrix.

Beyond the Gaussian Approximation-DALI
The Fisher matrix approach assumes that the likelihood surface is a multivariate Gaussian, which will be asymptotically true near the peak, but may not be a good approximation over the range of parameter values of interest. A generalisation of the Fisher matrix is DALI, Derivative Approximation for LIkelihoods [9], which expands the likelihood surface to include higher-order derivatives than the second. This is a rather elegant expansion, in derivatives rather than parameters, that ensures that the approximate distribution is a genuine probability distribution-i.e., it is non-negative and normalisable, non-divergent and asymptotically approaches the true likelihood.
The starting point is a Taylor expansion of the likelihood: where L 0 is a normalization constant and F αβ = L ,αβ , S αβγ = L ,αβγ and Q αβγδ = L ,αβγδ .
If the expansion is arranged in order of derivatives, the expressions are normalisable and positive-definite. For example, to second order in the µ derivatives, and assuming C is independent of θ, we have This is apparently true at every order (see [9] for the third-order expansion, and [10] for the case where the parameter dependence is in C). Figure 2 shows the improvement in the expected likelihood surfaces for a supernova cosmology experiment.

The Expected Bayesian Evidence-Generalising Fisher Matrices to Model Selection
At the root of the Fisher matrix formalism is the Laplace approximation, i.e., the assumption that the likelihood surface is a multivariate Gaussian when viewed as a function of the model parameters. We can generalise this to the higher-level question of model selection, where we compute the posterior probabilities of different models, given the data collected, but regardless of the model parameters θ. The ratio of these probabilities is the ratio of the prior model probabilities, multiplied by the "Bayes factor", which is the ratio of the marginal likelihoods (or Bayesian evidence) of the models, where the evidence for a model M is With the Laplace approximation for the first, likelihood term, and a uniform prior (which can be generalised to a Gaussian prior), we can compute the expected evidence (conditional on some fiducial set of parameters) by performing Gaussian integrals. For nested models (with n and n = n + p parameters respectively), the considerations of Section 4 on the locations of the peak likelihood is relevant, and the result depends on the shifts of the fiducial parameters away from the values that are fixed in the lower-dimensional model, δψ ζ . If we further approximate that the expected Bayes factor is the ratio of the expected evidences, then the expected Bayes factor is (see [7] for details) where ∆θ α are the prior ranges of the additional p parameters in the extended model, and the offsets δθ α are given by Equation (19). Note that F is an n × n matrix, F is n × n , and G is an n × p block of the full n × n Fisher matrix F, given by Equation (20). The expression we find is a specific example of the Savage-Dickey density ratio [11]; here we explicitly use the Laplace approximation to compute the offsets in the parameter estimates which accompany the wrong choice of model. Figure 3 shows the ratio of expected evidences, assuming the Laplace approximation (as the Fisher matrix does), for nested cosmological models. Details are in the caption, but essentially one parameter is fixed in the simpler model, but allowed to vary in the more complex model. If the more complex model applies, then the data will favour the simpler model if the parameter is close to the fixed value. This is shown in the figure by the cusp in the graph. ln B is positive to the left of the cusp, and negative to the right. Figure 3. The ratio of expected evidences B for two cosmological models. One is based on Einstein Gravity; the other is a more general model where the growth rate of perturbations is allowed to be a free parameter, rather than fixed. The graph shows the ratio as a function of the true shift of the growth rate away from the General Relativity value, for weak lensing data expected from ESA's Euclid satellite. If the growth rate is close to Einstein's (left of the figure; ln B > 0), Bayesian evidence is expected to favour Einstein gravity, whereas if the deviation is large enough (right of the cusp; ln B < 0), it favours the more complex model. Adapted from Figure 2 of "On model selection forecasting, Dark Energy and modified gravity" published in Mon. Not. Roy. Astron. Soc. [7].

Discussion
This article reviews some recent developments in Fisher matrix theory, which have been motivated by cosmology. The Fisher matrix for data consisting of pairs that have errors in both x and y is derived, as a specific example of a general result where the data can depend on arbitrary variables x that may be measured with some error. The Fisher matrix is shown to be able to determine biases in some parameters when others are set to fixed values (such as in nested models where the simpler model does not allow some parameters to vary). DALI, which goes beyond the Laplace approximation by using higher-order derivatives, is found to allow much more accurate predictions for the expected shape of the likelihood surface. Finally, the concept of expected probabilities in the Laplace approximation is generalised to model selection, by computing the expected Bayesian evidence.