Far from Asymptopia: Unbiased High-Dimensional Inference Cannot Assume Unlimited Data

Inference from limited data requires a notion of measure on parameter space, which is most explicit in the Bayesian framework as a prior distribution. Jeffreys prior is the best-known uninformative choice, the invariant volume element from information geometry, but we demonstrate here that this leads to enormous bias in typical high-dimensional models. This is because models found in science typically have an effective dimensionality of accessible behaviors much smaller than the number of microscopic parameters. Any measure which treats all of these parameters equally is far from uniform when projected onto the sub-space of relevant parameters, due to variations in the local co-volume of irrelevant directions. We present results on a principled choice of measure which avoids this issue and leads to unbiased posteriors by focusing on relevant parameters. This optimal prior depends on the quantity of data to be gathered, and approaches Jeffreys prior in the asymptotic limit. However, for typical models, this limit cannot be justified without an impossibly large increase in the quantity of data, exponential in the number of microscopic parameters.


Introduction
No experiment fixes a model's parameters perfectly. Every approach to propagating the resulting uncertainty must, explicitly or implicitly, assume a measure of the space of possible parameter values. A badly chosen measure can introduce bias, and we argue here that avoiding such bias is equivalent to the very natural goal of assigning equal weight to each distinguishable outcome. However, this goal is seldom reached, either because no attempt is made, or because the problem is simplified by prematurely assuming the asymptotic limit of nearly infinite data. We demonstrate here that this assumption can lead to a large bias in what we infer from the parameters, in models with features typical of many-parameter mechanistic models found in science. We propose a score for such bias, and advocate for using a measure that makes this zero. Such a measure allows for unbiased inference without the need to first simplify the model to just the right degree of complexity. Instead, weight is automatically spread according to a lower effective dimensionality, ignoring details irrelevant to visible outcomes.
We consider models which predict a probability distribution p(x|θ) for observing data x given parameters θ. The degree of overlap between two such distributions indicates how difficult it is to distinguish the two parameter points, which gives a notion of distance on parameter space. The simplifying idea of information geometry is to focus on infinitesimally close parameter points, for which there is a natural Riemannian metric, the Fisher information [1,2]. This may be thought of as having units of standard deviations, so that along a line of integrated length L there are about L distinguishable points, and thus any parameter which can be measured to a few digits of precision has length L > 100. It is a striking empirical feature of models in science that most have a few such long (or relevant) parameter directions, followed by many more short (or irrelevant) orthogonal directions [3][4][5][6]. The irrelevant lengths, all L < 1, show a characteristic spectrum of being roughly evenly spaced on a log scale, often over many decades. As a result, much of the geometry of this Riemannian model manifold consists of features much smaller than 1, far too small to observe. However, the natural intrinsic volume measure, which follows from the Fisher metric, is sensitive to all of these unobservable dimensions, and as we demonstrate here, they cause this measure to introduce enormous bias.
To avoid this problem, we need a measure tied to the Fisher length scale L ≈ 1, instead of one from the continuum. Locally, this length scale partitions dimensions into relevant and irrelevant, which in turn approximately factorizes the volume element into a relevant part and what we term the irrelevant co-volume. The wild variations of this co-volume are the source of the bias we describe, and it is rational to ignore them. As we illustrate in Figure 1 for a simple two-parameter model, equally distinguishable predictions do not correspond to equal intrinsic volumes, and this failure is detected by a score we call bias pressure. The measure p (θ) for which this score is everywhere zero, by contrast, captures relevant distinguishability and ignores the very thin irrelevant direction. The same measure is also obtained by maximizing the information learned about parameter θ from seeing data x [7][8][9], or equivalently from a particular minimax game [10][11][12]. Since p (θ) is usually discrete [9,[13][14][15][16][17][18], it can be seen as implementing a length cutoff, replacing the smooth differential-geometric view of the model manifold with something quantized [19]. The natural volume is a biased measure for the space of distinguishable outcomes. The left panel outlines the space of possible predictions Y; the observed x is deterministic y(θ) plus measurement noise. With the scale of the noise σ as shown, the upper half is effectively onedimensional. The center panel shows a sample from the volume measure p J (θ), divided into blocks of equal weight. These are strongly influenced by the unobservable thickness of the upper portion. Points are colored by bias pressure b(θ), which we define in Equation (5). The right panel shows the explicitly unbiased optimal measure p (θ), which gradually adjusts from two-to one-dimensional behavior (The model is Equation (6) with a 1 = 0.8, a 2 = 0.2, and k 1 ≥ k 2 , observed at times t = 1, 3 each with Gaussian noise σ = 0.1).
In the Bayesian framework, the natural continuous volume measure p J (θ) is known as Jeffreys prior, and is the canonical example of an uninformative prior: a principled, ostensibly neutral choice. It was first derived based on invariance considerations [20], and can also be justified by information-or game-theoretic ideas, provided these are applied in the limit of infinitely many repetitions [7,8,17,21,22]. This asymptotic limit often looks like a technical trick to simplify derivations. However, in realistic models, this limit is very far from being justified, exponentially far in the number of parameters, often requiring an experiment to be repeated for longer than the age of the universe. We demonstrate here that using the prior derived in this limit introduces a large bias in such models. Furthermore, we argue that such bias, and not only computational difficulties, has prevented the wide use of uninformative priors.
The promise of principled ways of tracking uncertainty, Bayesian or otherwise, is to free us from the need to select a model with precisely the right degree of complexity. This idea is often encountered in the context of overfitting, where the maximum likelihood point of an overly complex model gives worse predictions. The bias discussed here is a distinct way for overly complex models to give bad predictions. We begin with toy models in which the number of parameters can be easily adjusted. However, in the real models of interest, we cannot trivially tune the number of parameters. This is why we wish to find principled methods which are not fooled by the presence of many irrelevant parameters.

Results
We consider a model to be characterized by the likelihood p(x|θ) of observing data x ∈ X when the parameters are θ ∈ Θ. In such a model, the Fisher information metric (FIM) measures the distinguishability of nearby points in parameter space as a distance For definiteness, we may take points separated along a geodesic by a distance L = ds 2 (θ) > 1 to be distinguishable. Intuitively, though incorrectly, the d-dimensional volume implied by the FIM might be thought to correspond to the total number of distinguishable parameter values inferable from an experiment: However, this counting makes a subtle assumption that all structure in the model has a scale much larger than 1. When many dimensions are smaller than 1, their lengths weigh the effective volume along the larger dimensions, despite having no influence on distinguishability.
The same effect applies to the normalized measure, Jeffreys prior: This measure's dependence on the irrelevant co-volume is an under-appreciated source of bias in posteriors derived from this prior. The effect is most clearly seen when the FIM is block-diagonal, g = g rel ⊕ g irrel . Then the volume form factorizes exactly, and the relevant effective measure is the det g rel (θ rel ) factor times V irrel (θ rel ), an integral over the irrelevant dimensions.
A more principled measure of the (log of the) number of distinguishable outcomes is the mutual information between parameters and data, I(X; Θ): where D KL is the Kullback-Leibler divergence between two probability distributions, which are not necessarily close: p(x) = dθ p(θ) p(x|θ) is typically much broader than p(x|θ). Unlike the volume Z, the mutual information depends on the prior p(θ). Past work, both by ourselves and others, has advocated for using the prior, which maximizes this mutual information, with [8] or without [9] taking the asymptotic limit: The same prior arises from a minimax game in which you choose a prior, your opponent chooses the true θ, and you lose the (large) KL divergence [10][11][12]: Here we stress a third perspective, defining a quantity we call bias pressure which captures how strongly the prior disfavors predictions from a given point: The optimal p (θ) has b(θ) = 0 on its support, and can be found by minimizing B = max θ b(θ). Other priors have b(θ) > 0 at some points, indicating that I(X; Θ) can be increased by moving weight there (and away from points where b(θ) < 0). We demonstrate below that b(θ) deserves to be called a bias, as it relates to large deviations of the posterior center of mass. We enact this by presenting a number of toy models, chosen to have information geometry similar to that typically found in mechanistic models from many scientific fields [5].

Exponential Decay Models
The first model we study involves inferring rates of exponential decay. This may be motivated, for instance, by the problem of determining the composition of a radioactive source containing elements with different half-lives, using Geiger counter readings taken over some period of time. The mean count rate at time t is We take the decay rates as parameters, and fix the proportions a µ , usually to a µ = 1/d, thus initial condition y 0 = 1. If we make observations at m distinct times t, then the prediction y is an m-vector, restricted to a compact region Y ⊂ [0, 1] m . For radioactivity, we would expect to observe y t plus Poisson noise, but the qualitative features are the same if we simplify to Gaussian noise with constant width σ: The Fisher metric then simplifies to be the Euclidean metric in the space of predictions Y, pulled back to parameter space Θ: ∂y t ∂θ µ ∂y t ∂θ ν δ tt thus, plots of p(y) in R m will show Fisher distances accurately. This model is known to be ill conditioned, with many small manifold widths and many small FIM eigenvalues when d is large [23]. With just two dimensions, d = m = 2, Figure 1 shows the region Y ⊂ R 2 , Jeffreys prior p J (θ), and the optimal prior p (θ), projected to densities on Y. Jeffreys is uniform p J (y) ∝ 1 (since the metric is constant in y), and hence always weights a two-dimensional area, both where this is appropriate and where it's not. The upper portion of Y in the figure is thin compared to σ, so the points we can distinguish are those separated vertically: the model is effectively one-dimension there. Jeffreys does not handle this well, which we illustrate in two ways. First, the prior is drawn divided into 20 segments of equal weight (equal area), which roughly correspond to distinguishable differences where the model is two-dimensional, but not where it becomes one-dimensional. Second, the points are colored by b(θ), which detects this effect, and gives large values at the top (about 10 bits). The optimal prior avoids these flaws by smoothly adjusting from the one-to the two-dimensional part of the model [9].
The claim that some parts of the model are effectively one-dimensional depends on the amount of data gathered. Independent repetitions of the experiment have overall , which will always scale the FIM by M, hence all distances by a factor √ M. This scaling is exactly equivalent to smaller Gaussian noise σ. Increasing M increases the number of distinguishable points, and large enough M (or small enough σ) can eventually make any nonzero length larger than 1. Thus, the amount of data gathered affects which parameters are relevant. However, notice that such repetition has no effect at all on p J (θ), since the scale of g µν (θ) in Equation (2) is canceled by Z. In this sense, it is already clear that Jeffreys prior belongs to the fixed point of repetition, i.e., to the asymptotic limit M → ∞. Figure 2 shows a more complicated version of the model (6), with d = 4 parameters, and looks at the effect of varying the noise level σ. Jeffreys prior always fills the 4-dimensional bulk, but at moderate σ, most of the distinguishable outcomes are located far from this mass. At large σ, equivalent to few repetitions, all the weight of the optimal prior is on zero-and one-dimensional edges. As more data are gathered, it gradually fills in the bulk, until, in the asymptotic limit σ → 0, it approaches Jeffreys prior [12,17,21,24]. However, while p (θ) approaches a continuum at any interior point [25], it remains discrete at Fisher distances ∼ 1 from the boundary. The worst-case bias pressure detects this; hence, the maximum for Jeffreys prior does not approach that for the optimal prior: B J → 0. However, since mutual information is dominated by the interior in this limit, we expect the values for p J (θ) and p (θ) to agree in the limit: I J − I → 0. parameters, observed at m = 5 times t = 1, 2, . . . 5. Top right, the optimal prior p (θ) has all of its weight on 0-and 1-dimensional edges at large σ, but adjusts to fill in the bulk at small σ (colors indicate the dimension r of the 4-dimensional shape's edge on which a point is located, the rank of the FIM there). Jeffreys prior p J (θ) is independent of σ, and has nonzero density everywhere, but a sample of 10 6 points is largely located near the middle of the shape. Left, the slope of I (X; Θ) ∼ d eff log 1/σ gives a notion of effective dimensionality; in the asymptotic limit σ → 0, we expect d eff = d = 4. Bottom right, the worst-case bias pressure B = max θ b(θ) is always zero for p (θ), up to the numerical error, but remains nonzero for p J (θ) even in the asymptotic limit. The Appendix A describes how upper and lower bounds for B are calculated.
One way to quantify the effective dimensionality is to look at the rate of increase in mutual information under repetition, or decreasing noise σ. Along a dimension with Fisher length L 1, the number of distinguishable points is proportional to L, and thus a cube with d eff large dimensions will have ∝ L d eff such points. This motivates defining d eff by Figure 2 shows lines for slope d eff = 1, 2, 3, and we expect d eff → d in the limit σ → 0.

The Costs of High Dimensionality
The problems of uneven measure grow more severe with more dimensions. To explore this, Figures 3-5 show a sequence of models with 1 to 26 parameters. All describe the same data: observations at the same list of m = 26 times in 1 ≤ t ≤ 5 with the same noise σ = 0.1. While Jeffreys prior is nonzero everywhere, its weight is concentrated where the many irrelevant dimensions are largest. With a Monte Carlo sample of a million points, all are found within the small orange area on the right of Figure 3. For a particular observation x, we plot also the posterior p(θ|x) for each prior. The extreme concentration of weight in p J (θ) in d = 26 pulls this some 20 standard deviations away from the maximum likelihood point y(θ x ). We call this distance the posterior deviation ∆; it is the most literal kind of bias in results.
Then, using the same prior, we evaluate the corresponding bias pressure, b(θ x ). The figure shows 100 observations x drawn from p (x) = dθ p (θ) p(x|θ), and we believe this justifies the use of the word "bias" to describe b(θ). The figure is for d = 11, but a similar relationship is seen in other dimensionalities. Instead of looking at particular observations x, Figure 5 shows global criteria I(X; Θ) and B = max θ b(θ). The optimal prior is largely unaffected by the addition of many irrelevant dimensions. Once d > 3, it captures essentially the same information in any higher dimension and has zero bias (or near-zero bias, in our numerical approximation). We may think of this as a new invariance principle, that predictions should be independent of unobservable model details. This replaces one of the invariances of Jeffreys, that repetition of the experiment does not change the prior. Repetition invariance guarantees poor performance when we are far from the asymptotic limit, as we see here from the rapidly declining performance of Jeffreys prior with increasing dimension, capturing less than one bit in d = 26. This decline in information is mirrored by a rise in the worst-case bias B.  Figure 3, these models all describe the same data, with the same noise. Above, mutual information I(X; Θ)/ log 2 (all plots are scaled thus to have units of bits). The optimal prior ignores the addition of more irrelevant parameters, but Jeffreys prior is badly affected, and ends up capturing less than 1 bit. Below, worst-case bias pressure max θ b(θ)/ log 2. This should be zero for the optimal prior, but our numerical solution has small errors. For the other priors, we plot lower and upper bounds, calculated using Bennett's method [26], as described in the Appendix A. The bias of Jeffreys prior increases strongly with the increasing concentration of its weight in higher dimensions.
Figures 3-5 also show a third prior, which is log-normal in each decay rate k µ = e θ µ > 0, that is, normal in terms of θ ∈ R d : This is not a strongly principled choice, but something like this is commonly used for parameters known to be positive. Here it produces better results than Jeffreys prior in high dimensions. We observe that it also suffers a decline in performance with increasing d, despite making no attempt to deliberately adapt to the high-dimensional geometry. The details of how well it works will, of course, depend on the values chosen forθ,σ, and more complicated priors of this sort can be invented. With enough free "meta-parameters" such asθ,σ, we can surely adjust such a prior to approximate the optimal prior, and in practice, such a variational approach might be more useful than solving for the optimal prior directly. We believe that worst-case bias B = max θ b(θ) is a good score for this purpose, partly because its zero point is meaningful.

Inequivalent Parameters
Compared to these toy models, more realistic models often still have many parameter combinations poorly fixed by data, but seldom come in families that allow us to easily tune the number of dimensions. Instead of having many interchangeable parameters, each will often describe a different microscopic effect that we know to exist, even if we are not sure which combination of them will matter in a given regime [27]. To illustrate this, we now examine some models of enzyme kinetics, starting with the famous reaction: If the concentration of product [P] is observed at some number times, with some noise, and starting from fixed initial conditions, then this model is not unlike the toy model above. Figure 6 shows the resulting priors for the rate constants appearing in Equation (11). The shape of the model manifold is similar, and the optimal prior again places most of its weight along two one-dimensional edges, while Jeffreys prior places it in the bulk, favoring the region where all three rate constants come closest to having independently visible effects on the data. However, the resulting bias is not extreme in three dimensions. Figure 6. Priors for two models of enzyme kinetics. Above, the 3-parameter model from Equation (11), observing only the concentration of product [P] at times t = 1, 2, . . . 5. Below, the 8-parameter model from Equation (12), observing only the final product [Q] at times t = 1, 2, . . . 10. Here Jeffreys prior has worst-case bias B ≈ 28 bits, comparable to the models in Figure 5 at similar dimension, while the optimal prior for the d = 3 model has its weight on well-known 2-parameter approximations, including that of Michaelis and Menten, the edge structure of the d = 8 model is much more complicated (for suitable initial conditions, it will include the d = 3 model as an edge).
The edges of this model are known approximations, in which certain rate constants become infinite (or equal), which we discuss in the Appendix A [28]. These approximations are useful in practice since each spans the full length of the most relevant parameter. However, the more difficult situation is when many different processes of comparable speed are unavoidably involved. The model manifold may still have many short directions, but the simpler description selected by p (θ) will tend to have weight on many different processes. In other words, the simpler model, according to information theory, is not necessarily one simpler model obtained by taking a limit, but instead, a mixture of many different analytic limits.
To see this, we consider a slightly more complicated enzyme kinetics model, the ping-pong mechanism with d = 8 rate constants: Here E * is a deformed version of the enzyme E, which is produced in the reaction from A to P, and reverted in the reaction from B to final product Q. There are clearly many more possible limits in which some combination of the rate constants become large or small. Figure 6 shows that the optimal prior has weight on at least five different 1-edges, none of which is a good description by itself. The concentration of weight seen in Jeffreys prior for these enzyme models is comparable to what we had before, with worst-case bias pressure B ≈ 14 bits in d = 3 and 28 bits in d = 8. These examples share geometric features with many real models in science [5], and thus we believe the problems described here are generic.

Discussion
Before fitting a model to data, there is often a selection step to choose a model which is complex enough to fit the true pattern, but not so complex as to fit the noise. The motivation for this is clear in maximum likelihood estimation, where only oneθ x is kept, and there are various criteria for making the trade-off [29][30][31]. The motivation is less clear in Bayesian analysis, where slightly different criteria can be derived by approximating p(x) [32,33]. We might hope that if many different points θ are consistent with the noisy data x, then the posterior p(θ|x) should simply have weight on all of them, encoding our uncertainty about θ.
Why, then, is model selection needed at all in Bayesian inference? Our answer here is that this is performed to avoid measure-induced bias, not overfitting. When using a suboptimal prior, models with too much complexity do indeed perform badly. This problem is seen in Figure 5, in the rapid decline of scores I(X; Θ) or B with increasing d, and would also be seen in the more traditional model evidence p(x)-all of these scores prefer models with d ≤ 3. However, the problem is not overfitting, since the extra parameters being added are irrelevant, i.e., they can have very little effect on the predictions y t (θ). Instead, the problem is concentration of measure. In models with tens of parameters, this effect can be enormous: It leads to posterior expectation values ∆ > 20 standard deviations away from ideal, for the d = 26 model with Jeffreys prior, and mutual information I < 1 bit learned, and B > 500 bits of bias. This problem is completely avoided by the optimal prior p (θ), which suffers no decline in performance with increasing parameter count d.
Geometrically, we can view traditional model selection as adjusting d to ensure that the model manifold only has dimensions of length L > 1. This ensures that most of the posterior weight is in the interior of the manifold; hence, ignoring model edges is justified. By contrast, when there are dimensions of length L < 1, the optimal posterior will usually have its weight at their extreme values, on several manifold edges, which are themselves simpler models [9]. Fisher lengths L depend on the quantity of data to be gathered, and repeating an experiment M times enlarges all by a factor √ M. Large enough M can eventually make any dimension larger than 1, and thus repetition alters what d traditional model selection prefers. Similarly, repetition alters the effective dimensionality of p (θ). Some earlier work on model geometry studies a series in 1/M [22,33,34]; this expansion around L = ∞ captures some features beyond the volume but is not suitable for models with dimensions L 1. Real models in science typically have many irrelevant parameters [5,[35][36][37][38]. It is common to have parameter directions 10 −10 times as important as the most relevant one, but impossible to repeat an experiment the M = 10 20 times needed to bridge this gap. Sometimes it is possible to remove the irrelevant parameters and derive a simpler effective theory. This is what happens in physics, where a large separation of scales allows great simplicity and high accuracy [39,40]. However, many other systems we would like to model cannot, or cannot yet, be so simplified. For complicated biological reactions, climate models, or neural networks, it is unclear which of the microscopic details can be safely ignored, or what the right effective variable is. Unlike our toy models, we cannot easily adjust d, since every parameter has a different meaning. This is why we seek statistical methods which do not require us to find the right effective theory. Furthermore, in particular, here we study priors almost invariant to complexity.
The optimal prior is discrete, which makes it difficult to find, and this difficulty appears to be why its good properties have been overlooked. It is known analytically only for extremely simple models such as M = 1 Bernoulli, and previous numerical work only treated slightly more complicated models, with d ≤ 2 parameters [9]. While our concern here is with the ideal properties, for practical use nearly optimal approximations may be required. One possibility is the adaptive slab-and-spike prior introduced in [6]. Another would be to use some variational family p λ (θ) with adjustable meta-parameters λ [41].
Discreteness is also how the exactly optimal p (θ) encodes a length scale L ≈ 1 in the model geometry, which is the divide between relevant and irrelevant parameters, between parameters that are constrained by data and those which are not. Making this distinction in some way is essential for good behavior, and it implies a dependence on the quantity of data. An effective model appropriate for much fewer data than observed will be too simple: the atoms of p (θ) will be too far apart (much like recording too few significant figures), or else selecting a small d means picking just one edge (fixing some parameters which may, in fact, be relevant). On the other hand, what we have demonstrated here is that a model appropriate for much more data-infinitely more in the case of p J (θ)-will instead introduce enormous bias into our inference about θ.
Author Contributions: Both authors contributed to developing the theory and writing the manuscript. M.C.A. was responsible for software, algorithms, and graphs. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement:
The code used to find the priors (and the scores) shown is available at https://github.com/mcabbott/AtomicPriors.jl, using Julia [42].

Acknowledgments:
We thank Isabella Graf, Mason Rouches, Jim Sethna, and Mark Transtrum for helpful comments on a draft.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
For brevity, the main text omits some standard definitions. The KL divergence (or relative entropy) is defined With a conditional probability, in our notation D KL p(x|θ) q(x) integrates x but remains a function of θ. The Fisher information metric g µν (θ) is the quadratic term from expanding D KL p(x|θ + dθ) p(x|θ) .

The mutual information is
where we use Bayes theorem, with p(x) = dθ p(x|θ)p(θ), and entropy Conditional entropy is S(X|θ) = − dx p(x|θ) log p(x|θ) for one value θ, or S(X|Θ) = dθ p(θ) S(X|θ). With a Gaussian likelihood, Equation (7), and X = R m , this is a constant: Many of these quantities depend on the choice of prior, such as the posterior p(θ|x) and the mutual information I(X; Θ). This is also true of our bias pressure b(θ) and worstcase B = max θ b(θ). When we need to refer to those for a specific prior, such as p (θ), we use the same subscript, writing I and B .
All probability distributions are normalized. In particular, the gradient in (5) is taken with the constraint of normalization-varying the density at each point independently would give a different constant.
Appendix A.1. Square Hypercone Here we consider an even simpler toy model, in which we can more rigorously define what we mean by co-volume, and analytically calculate the posterior deviation.
Regarding the first factor as det g rel + O(1/L 2 ), it is trivial to integrate the second factor over θ µ for all µ ≥ 2, and this factor 1 0 dθ 2 · · · 1 0 dθ d det g irrel = (r(θ 1 )) (d−1) is the irrelevant co-volume. The effective Jeffreys prior along the one relevant dimension is thus which clearly has much more weight at large θ 1 , at the thick end of the cone. Now observe some x, giving p( . With a few lines of algebra, we can derive, assuming 1 x L, that the posterior deviation (9) is Choosing L = 50 and d = 26 to roughly match Figure 3, at x ≈ 10 the deviation is ∆ ≈ 2.5. This is smaller than what is seen for the exponential decay model, whose geometry is, of course, more complicated. This difference is also detected by bias pressure. The maximum b(θ) for this cone is about 55 bits, which is close to the d = 5 model in Figure 5.
While this example takes all irrelevant dimensions to be of equal Fisher length, it would be more realistic to have a series L, 1, L −1 , L −2 , . . . equally spaced on a log scale. This makes no difference to the effective p J (θ 1 ) ∝ (θ 1 ) (d−1)/2 and hence to the posterior deviation ∆. Figure A1 draws instead a cone with a round cross-section, which also makes no difference. It compares this to a shape of constant cross-section, for which p J (θ 1 ) ∝ 1; hence, there is no such bias. The distinction between these two situations is, by assumption, unobservable, but using the d-dimensional notion of volume as a prior gives an effective p(θ 1 ) which is either flat or ∝ (θ 1 ) 5 . This can induce substantial bias in the posterior p(θ 1 |x).

Appendix A.2. Estimating I(X; Θ) and Its Gradient
The mutual information can be estimated for a discrete prior and a Gaussian likelihood (7) by replacing the integral over x with normally distributed samples: This gives an unbiased estimate, and what we used for plotting I(X; Θ) in Figures 2 and 5.
However, for finding p (θ), what we need is the very small gradients of I with respect to each θ a : near to the optimum, the function is very close to flat. We find that, instead of Monte Carlo, the following kernel density approximation works well: Here we Taylor expand the log p(x) about x = θ a for each atom [43]. For the purpose of finding p (θ) we may ignore the constant and the conditional entropy in I(X; Θ) = S(X) − S(X|Θ). Further, this is a better estimate used at σ = √ 2σ. Before maximizing I(X; Θ) using L-BFGS [44,45] to adjust all θ a and λ a together, we find it useful to sample initial points using Mitchell's best-candidate algorithm [46].

Appendix A.3. Other Methods
Our focus here is on the properties of the optimal prior p (θ), but better ways to find nearly optimal solutions may be needed in order to use these ideas on larger problems. Some ideas have been explored in the literature: • The famous algorithm for finding p (θ) is due to Blahut and Arimoto [47,48], but it needs a discrete Θ, which limits d. This was adapted to use MCMC sampling instead by [49], although their work appears to need discrete X instead. Perhaps it can be generalized. • More recently, the following lower bound for I(X; Θ) was used by [41] to find approximations to p (θ): This bound is too crude to see the features of interest here: For all models in this paper, it favors a prior p 2 (θ) = argmax p(θ) I NS with just two delta functions, for any noise level σ. • We mentioned above that adjusting some "meta-parameters" of some distribution pθ ,σ (θ) would be one way to handle near-optimal priors. This is the approach of [41], and of many papers maximizing other scores, often described as "variational". • Another prior that typically has large I(X; Θ) was introduced in [6] under the name "adaptive slab-and-spike prior". It pulls every point x in a distribution p NML (x) = maxθ p(x|θ) Z , Z = dx maxθ p(x|θ) back to its maximum likelihood pointθ: The result has weight everywhere in the model manifold, but has extra weight on the edges. Because the amount of weight on edges is controlled by σ, it adopts an appropriate effective dimensionality (8), and has a low bias (5).
numbers are required). However, in the case d = m, where the Jacobian is a square matrix, it is of Vandermonde form. Hence, its determinant is known exactly, and we can simply write: For d = m, more complicated formulae (involving a sum over Schur polynomials) are known, but the points in Figure 5 are chosen not to need them. To sample from Jeffreys prior, or posterior, we use the affine-invariant "emcee" sampler [50]. This adapts well to badly conditioned geometries. Because the Vandermonde formula lets us work in machine precision, we can sample 10 6 points in a few minutes, which is sufficient for Figure 3.
In the enzyme kinetics models of Figure 6, finding y t (θ) involves solving a differential equation. The gradient ∂y t /∂θ µ is needed both for Jeffreys density and for maximizing S(X) via the above KDE Formula (A1). This can be handled efficiently by passing dual numbers through the solver [51].
The arrows in (11) summarise the following differential equations for the concentrations of the four chemicals involved: The original analysis of Michaelis and Menten [52] takes the first two reactions to be in equilibrium. This can be viewed as taking the limit k r , k f → ∞ holding fixed K D = k r /k f , which picks a 2-parameter subspace of Θ, an edge of the manifold. Then, [ES] = [E][S]/K D becomes constant, leaving their equation If we do not observe [E], then this is almost identical to the quasi-static limit of Briggs and Haldane [53], who take k r , E 0 → 0, and k f , k p → ∞ holding fixed K M = k p / f f and V max = k p E 0 , which gives which was much later shown to be analytically tractable [54]. In Figure 6, most of the points of weight in the optimal prior lie on the intersection of these two 2-parameter models, that is, on a pair of one-parameter models. These and other limits were discussed geometrically in [28].