1. Introduction
No experiment fixes a model’s parameters perfectly. Every approach to propagating the resulting uncertainty must, explicitly or implicitly, assume a measure of the space of possible parameter values. A badly chosen measure can introduce bias, and we argue here that avoiding such bias is equivalent to the very natural goal of assigning equal weight to each distinguishable outcome. However, this goal is seldom reached, either because no attempt is made, or because the problem is simplified by prematurely assuming the asymptotic limit of nearly infinite data. We demonstrate here that this assumption can lead to a large bias in what we infer from the parameters, in models with features typical of many-parameter mechanistic models found in science. We propose a score for such bias, and advocate for using a measure that makes this zero. Such a measure allows for unbiased inference without the need to first simplify the model to just the right degree of complexity. Instead, weight is automatically spread according to a lower effective dimensionality, ignoring details irrelevant to visible outcomes.
We consider models which predict a probability distribution
for observing data
x given parameters
. The degree of overlap between two such distributions indicates how difficult it is to distinguish the two parameter points, which gives a notion of distance on parameter space. The simplifying idea of information geometry is to focus on infinitesimally close parameter points, for which there is a natural Riemannian metric, the Fisher information [
1,
2]. This may be thought of as having units of standard deviations, so that along a line of integrated length
L there are about
L distinguishable points, and thus any parameter which can be measured to a few digits of precision has length
. It is a striking empirical feature of models in science that most have a few such long (or relevant) parameter directions, followed by many more short (or irrelevant) orthogonal directions [
3,
4,
5,
6]. The irrelevant lengths, all
, show a characteristic spectrum of being roughly evenly spaced on a log scale, often over many decades. As a result, much of the geometry of this Riemannian model manifold consists of features much smaller than 1, far too small to observe. However, the natural intrinsic volume measure, which follows from the Fisher metric, is sensitive to all of these unobservable dimensions, and as we demonstrate here, they cause this measure to introduce enormous bias.
To avoid this problem, we need a measure tied to the Fisher length scale
, instead of one from the continuum. Locally, this length scale partitions dimensions into relevant and irrelevant, which in turn approximately factorizes the volume element into a relevant part and what we term the
irrelevant co-volume. The wild variations of this co-volume are the source of the bias we describe, and it is rational to ignore them. As we illustrate in
Figure 1 for a simple two-parameter model, equally distinguishable predictions do not correspond to equal intrinsic volumes, and this failure is detected by a score we call
bias pressure. The measure
for which this score is everywhere zero, by contrast, captures relevant distinguishability and ignores the very thin irrelevant direction. The same measure is also obtained by maximizing the information learned about parameter
from seeing data
x [
7,
8,
9], or equivalently from a particular minimax game [
10,
11,
12]. Since
is usually discrete [
9,
13,
14,
15,
16,
17,
18], it can be seen as implementing a length cutoff, replacing the smooth differential-geometric view of the model manifold with something quantized [
19].
In the Bayesian framework, the natural continuous volume measure
is known as Jeffreys prior, and is the canonical example of an uninformative prior: a principled, ostensibly neutral choice. It was first derived based on invariance considerations [
20], and can also be justified by information- or game-theoretic ideas, provided these are applied in the limit of infinitely many repetitions [
7,
8,
17,
21,
22]. This asymptotic limit often looks like a technical trick to simplify derivations. However, in realistic models, this limit is very far from being justified, exponentially far in the number of parameters, often requiring an experiment to be repeated for longer than the age of the universe. We demonstrate here that using the prior derived in this limit introduces a large bias in such models. Furthermore, we argue that such bias, and not only computational difficulties, has prevented the wide use of uninformative priors.
The promise of principled ways of tracking uncertainty, Bayesian or otherwise, is to free us from the need to select a model with precisely the right degree of complexity. This idea is often encountered in the context of overfitting, where the maximum likelihood point of an overly complex model gives worse predictions. The bias discussed here is a distinct way for overly complex models to give bad predictions. We begin with toy models in which the number of parameters can be easily adjusted. However, in the real models of interest, we cannot trivially tune the number of parameters. This is why we wish to find principled methods which are not fooled by the presence of many irrelevant parameters.
2. Results
We consider a model to be characterized by the likelihood
of observing data
when the parameters are
. In such a model, the Fisher information metric (FIM) measures the distinguishability of nearby points in parameter space as a distance
, where
For definiteness, we may take points separated along a geodesic by a distance
to be distinguishable. Intuitively, though incorrectly, the
d-dimensional volume implied by the FIM might be thought to correspond to the total number of distinguishable parameter values inferable from an experiment:
However, this counting makes a subtle assumption that all structure in the model has a scale much larger than 1. When many dimensions are smaller than 1, their lengths weigh the effective volume along the larger dimensions, despite having no influence on distinguishability.
The same effect applies to the normalized measure, Jeffreys prior:
This measure’s dependence on the
irrelevant co-volume is an under-appreciated source of bias in posteriors derived from this prior. The effect is most clearly seen when the FIM is block-diagonal,
. Then the volume form factorizes exactly, and the relevant effective measure is the
factor times
, an integral over the irrelevant dimensions.
A more principled measure of the (log of the) number of distinguishable outcomes is the mutual information between parameters and data,
:
where
is the Kullback–Leibler divergence between two probability distributions, which are not necessarily close:
is typically much broader than
. Unlike the volume
Z, the mutual information depends on the prior
. Past work, both by ourselves and others, has advocated for using the prior, which maximizes this mutual information, with [
8] or without [
9] taking the asymptotic limit:
The same prior arises from a minimax game in which you choose a prior, your opponent chooses the true
, and you lose the (large) KL divergence [
10,
11,
12]:
Here we stress a third perspective, defining a quantity we call
bias pressure which captures how strongly the prior disfavors predictions from a given point:
The optimal
has
on its support, and can be found by minimizing
Other priors have
at some points, indicating that
can be increased by moving weight there (and away from points where
). We demonstrate below that
deserves to be called a bias, as it relates to large deviations of the posterior center of mass. We enact this by presenting a number of toy models, chosen to have information geometry similar to that typically found in mechanistic models from many scientific fields [
5].
2.1. Exponential Decay Models
The first model we study involves inferring rates of exponential decay. This may be motivated, for instance, by the problem of determining the composition of a radioactive source containing elements with different half-lives, using Geiger counter readings taken over some period of time. The mean count rate at time
t is
We take the decay rates as parameters, and fix the proportions
, usually to
, thus initial condition
. If we make observations at
m distinct times
t, then the prediction
y is an
m-vector, restricted to a compact region
. For radioactivity, we would expect to observe
plus Poisson noise, but the qualitative features are the same if we simplify to Gaussian noise with constant width
:
The Fisher metric then simplifies to be the Euclidean metric in the space of predictions
Y, pulled back to parameter space
:
thus, plots of
in
will show Fisher distances accurately. This model is known to be ill conditioned, with many small manifold widths and many small FIM eigenvalues when
d is large [
23].
With just two dimensions,
,
Figure 1 shows the region
, Jeffreys prior
, and the optimal prior
, projected to densities on
Y. Jeffreys is uniform
(since the metric is constant in
y), and hence always weights a two-dimensional area, both where this is appropriate and where it’s not. The upper portion of
Y in the figure is thin compared to
, so the points we can distinguish are those separated vertically: the model is effectively one-dimension there. Jeffreys does not handle this well, which we illustrate in two ways. First, the prior is drawn divided into 20 segments of equal weight (equal area), which roughly correspond to distinguishable differences where the model is two-dimensional, but not where it becomes one-dimensional. Second, the points are colored by
, which detects this effect, and gives large values at the top (about 10 bits). The optimal prior avoids these flaws by smoothly adjusting from the one- to the two-dimensional part of the model [
9].
The claim that some parts of the model are effectively one-dimensional depends on the amount of data gathered. Independent repetitions of the experiment have overall likelihood
, which will always scale the FIM by
M, hence all distances by a factor
. This scaling is exactly equivalent to smaller Gaussian noise
. Increasing
M increases the number of distinguishable points, and large enough
M (or small enough
) can eventually make any nonzero length larger than 1. Thus, the amount of data gathered affects which parameters are relevant. However, notice that such repetition has no effect at all on
, since the scale of
in Equation (
2) is canceled by
Z. In this sense, it is already clear that Jeffreys prior belongs to the fixed point of repetition, i.e., to the asymptotic limit
.
Figure 2 shows a more complicated version of the model (
6), with
parameters, and looks at the effect of varying the noise level
. Jeffreys prior always fills the 4-dimensional bulk, but at moderate
, most of the distinguishable outcomes are located far from this mass. At large
, equivalent to few repetitions, all the weight of the optimal prior is on zero- and one-dimensional edges. As more data are gathered, it gradually fills in the bulk, until, in the asymptotic limit
, it approaches Jeffreys prior [
12,
17,
21,
24]. However, while
approaches a continuum at any interior point [
25], it remains discrete at Fisher distances
from the boundary. The worst-case bias pressure detects this; hence, the maximum for Jeffreys prior does not approach that for the optimal prior:
. However, since mutual information is dominated by the interior in this limit, we expect the values for
and
to agree in the limit:
.
One way to quantify the effective dimensionality is to look at the rate of increase in mutual information under repetition, or decreasing noise
. Along a dimension with Fisher length
, the number of distinguishable points is proportional to
L, and thus a cube with
large dimensions will have
such points. This motivates defining
by
Figure 2 shows lines for slope
, and we expect
in the limit
.
2.2. The Costs of High Dimensionality
The problems of uneven measure grow more severe with more dimensions. To explore this,
Figure 3,
Figure 4 and
Figure 5 show a sequence of models with 1 to 26 parameters. All describe the same data: observations at the same list of
times in
with the same noise
. While Jeffreys prior is nonzero everywhere, its weight is concentrated where the many irrelevant dimensions are largest. With a Monte Carlo sample of a million points, all are found within the small orange area on the right of
Figure 3. For a particular observation
x, we plot also the posterior
for each prior. The extreme concentration of weight in
in
pulls this some 20 standard deviations away from the maximum likelihood point
. We call this distance the posterior deviation
; it is the most literal kind of bias in results.
Figure 4 compares the posterior deviation
to the bias pressure
defined in Equation (
5). For each of many observations
x, we find the maximum likelihood point
, and calculate the distance from this point to the posterior expectation value of
y:
Then, using the same prior, we evaluate the corresponding bias pressure,
. The figure shows 100 observations
x drawn from
, and we believe this justifies the use of the word “bias” to describe
. The figure is for
, but a similar relationship is seen in other dimensionalities.
Instead of looking at particular observations
x,
Figure 5 shows global criteria
and
. The optimal prior is largely unaffected by the addition of many irrelevant dimensions. Once
, it captures essentially the same information in any higher dimension and has zero bias (or near-zero bias, in our numerical approximation). We may think of this as a new invariance principle, that predictions should be independent of unobservable model details. This replaces one of the invariances of Jeffreys, that repetition of the experiment does not change the prior. Repetition invariance guarantees poor performance when we are far from the asymptotic limit, as we see here from the rapidly declining performance of Jeffreys prior with increasing dimension, capturing less than one bit in
. This decline in information is mirrored by a rise in the worst-case bias
B.
Figure 3,
Figure 4 and
Figure 5 also show a third prior, which is log-normal in each decay rate
, that is, normal in terms of
:
This is not a strongly principled choice, but something like this is commonly used for parameters known to be positive. Here it produces better results than Jeffreys prior in high dimensions. We observe that it also suffers a decline in performance with increasing
d, despite making no attempt to deliberately adapt to the high-dimensional geometry. The details of how well it works will, of course, depend on the values chosen for
, and more complicated priors of this sort can be invented. With enough free “meta-parameters” such as
, we can surely adjust such a prior to approximate the optimal prior, and in practice, such a variational approach might be more useful than solving for the optimal prior directly. We believe that worst-case bias
is a good score for this purpose, partly because its zero point is meaningful.
2.3. Inequivalent Parameters
Compared to these toy models, more realistic models often still have many parameter combinations poorly fixed by data, but seldom come in families that allow us to easily tune the number of dimensions. Instead of having many interchangeable parameters, each will often describe a different microscopic effect that we know to exist, even if we are not sure which combination of them will matter in a given regime [
27]. To illustrate this, we now examine some models of enzyme kinetics, starting with the famous reaction:
This summarises differential equations for the concentrations, such as
for the final product
P, and
for the enzyme, which combines with the substrate to form a bound complex.
If the concentration of product
is observed at some number times, with some noise, and starting from fixed initial conditions, then this model is not unlike the toy model above.
Figure 6 shows the resulting priors for the rate constants appearing in Equation (
11). The shape of the model manifold is similar, and the optimal prior again places most of its weight along two one-dimensional edges, while Jeffreys prior places it in the bulk, favoring the region where all three rate constants come closest to having independently visible effects on the data. However, the resulting bias is not extreme in three dimensions.
The edges of this model are known approximations, in which certain rate constants become infinite (or equal), which we discuss in the
Appendix A [
28]. These approximations are useful in practice since each spans the full length of the most relevant parameter. However, the more difficult situation is when many different processes of comparable speed are unavoidably involved. The model manifold may still have many short directions, but the simpler description selected by
will tend to have weight on many different processes. In other words, the simpler model, according to information theory, is not necessarily one simpler model obtained by taking a limit, but instead, a mixture of many different analytic limits.
To see this, we consider a slightly more complicated enzyme kinetics model, the ping-pong mechanism with
rate constants:
Here
is a deformed version of the enzyme
E, which is produced in the reaction from
A to
P, and reverted in the reaction from
B to final product
Q. There are clearly many more possible limits in which some combination of the rate constants become large or small.
Figure 6 shows that the optimal prior has weight on at least five different 1-edges, none of which is a good description by itself.
The concentration of weight seen in Jeffreys prior for these enzyme models is comparable to what we had before, with worst-case bias pressure
bits in
and 28 bits in
. These examples share geometric features with many real models in science [
5], and thus we believe the problems described here are generic.
3. Discussion
Before fitting a model to data, there is often a selection step to choose a model which is complex enough to fit the true pattern, but not so complex as to fit the noise. The motivation for this is clear in maximum likelihood estimation, where only one
is kept, and there are various criteria for making the trade-off [
29,
30,
31]. The motivation is less clear in Bayesian analysis, where slightly different criteria can be derived by approximating
[
32,
33]. We might hope that if many different points
are consistent with the noisy data
x, then the posterior
should simply have weight on all of them, encoding our uncertainty about
.
Why, then, is model selection needed at all in Bayesian inference? Our answer here is that this is performed to avoid measure-induced bias, not overfitting. When using a sub-optimal prior, models with too much complexity do indeed perform badly. This problem is seen in
Figure 5, in the rapid decline of scores
or
B with increasing
d, and would also be seen in the more traditional model evidence
—all of these scores prefer models with
. However, the problem is not overfitting, since the extra parameters being added are irrelevant, i.e., they can have very little effect on the predictions
. Instead, the problem is concentration of measure. In models with tens of parameters, this effect can be enormous: It leads to posterior expectation values
standard deviations away from ideal, for the
model with Jeffreys prior, and mutual information
bit learned, and
bits of bias. This problem is completely avoided by the optimal prior
, which suffers no decline in performance with increasing parameter count
d.
Geometrically, we can view traditional model selection as adjusting
d to ensure that the model manifold only has dimensions of length
. This ensures that most of the posterior weight is in the interior of the manifold; hence, ignoring model edges is justified. By contrast, when there are dimensions of length
, the optimal posterior will usually have its weight at their extreme values, on several manifold edges, which are themselves simpler models [
9]. Fisher lengths
L depend on the quantity of data to be gathered, and repeating an experiment
M times enlarges all by a factor
. Large enough
M can eventually make any dimension larger than 1, and thus repetition alters what
d traditional model selection prefers. Similarly, repetition alters the effective dimensionality of
. Some earlier work on model geometry studies a series in
[
22,
33,
34]; this expansion around
captures some features beyond the volume but is not suitable for models with dimensions
.
Real models in science typically have many irrelevant parameters [
5,
35,
36,
37,
38]. It is common to have parameter directions
times as important as the most relevant one, but impossible to repeat an experiment the
times needed to bridge this gap. Sometimes it is possible to remove the irrelevant parameters and derive a simpler effective theory. This is what happens in physics, where a large separation of scales allows great simplicity and high accuracy [
39,
40]. However, many other systems we would like to model cannot, or cannot yet, be so simplified. For complicated biological reactions, climate models, or neural networks, it is unclear which of the microscopic details can be safely ignored, or what the right effective variable is. Unlike our toy models, we cannot easily adjust
d, since every parameter has a different meaning. This is why we seek statistical methods which do not require us to find the right effective theory. Furthermore, in particular, here we study priors almost invariant to complexity.
The optimal prior is discrete, which makes it difficult to find, and this difficulty appears to be why its good properties have been overlooked. It is known analytically only for extremely simple models such as
Bernoulli, and previous numerical work only treated slightly more complicated models, with
parameters [
9]. While our concern here is with the ideal properties, for practical use nearly optimal approximations may be required. One possibility is the adaptive slab-and-spike prior introduced in [
6]. Another would be to use some variational family
with adjustable meta-parameters
[
41].
Discreteness is also how the exactly optimal encodes a length scale in the model geometry, which is the divide between relevant and irrelevant parameters, between parameters that are constrained by data and those which are not. Making this distinction in some way is essential for good behavior, and it implies a dependence on the quantity of data. An effective model appropriate for much fewer data than observed will be too simple: the atoms of will be too far apart (much like recording too few significant figures), or else selecting a small d means picking just one edge (fixing some parameters which may, in fact, be relevant). On the other hand, what we have demonstrated here is that a model appropriate for much more data—infinitely more in the case of —will instead introduce enormous bias into our inference about .