Empirical Information Metrics for Prediction Power and Experiment Planning

In principle, information theory could provide useful metrics for statistical inference. In practice this is impeded by divergent assumptions: Information theory assumes the joint distribution of variables of interest is known, whereas in statistical inference it is hidden and is the goal of inference. To integrate these approaches we note a common theme they share, namely the measurement of prediction power. We generalize this concept as an information metric, subject to several requirements: Calculation of the metric must be objective or model-free; unbiased; convergent; probabilistically bounded; and low in computational complexity. Unfortunately, widely used model selection metrics such as Maximum Likelihood, the Akaike Information Criterion and Bayesian Information Criterion do not necessarily meet all these requirements. We define four distinct empirical information metrics measured via sampling, with explicit Law of Large Numbers convergence guarantees, which meet these requirements: I e , the empirical information, a measure of average prediction power; I b , the overfitting bias information, which measures selection bias in the modeling procedure; I p , the potential information, which measures the total remaining information in the observations not yet discovered by the model; and I m , the model information, which measures the model's extrapolation prediction power. Finally, we show that I p + I e , I p + I m , and I e − I m are fixed constants for a given observed dataset (i.e. prediction target), independent of the model, and thus represent a fundamental subdivision of the total information contained in the observations. We discuss the application of these metrics to modeling and experiment planning.

Information theory as formulated by Shannon [1], Kolmogorov and others provides an elegant and general measure of information (or coupling) that connects variables.As such, it might be expected to be universally applied in the "Information Age" (see, for example, the many fields to which it is relevant, described in [2]).Identifying and measuring such information connections between variables lies at the heart of statistical inference (infering accurate models from observed data) and more generally of scientific inference (performing experimental observations to infer increasingly accurate models of the universe).
However, information theory and statistical inference are founded on rather different assumptions, which greatly complicate their union.Statistical inference draws a fundamental distinction between observable variables (operationally defined measurements with no uncertainty) and hidden variables (everything else).It seeks to estimate the likely probability distribution of a hidden variable(s), given a sample of relevant observed variables.Note that from this point of view, probability distributions are themselves hidden, in the sense that they can only be estimated (with some uncertainty) via inference.For example, individual values of an observable are directly observed, but their true distribution can only be inferred from a sample of many such observations.Traditional information theory, by contrast, assumes as a starting point that the joint probability distribution p(X, Y, Z...) of all variables of interest is completely known, as a prerequisite for beginning any calculations.The basic tools of information theory -entropy, relative entropy, and mutual information -are undefined unless one has the complete joint probability distribution p(X, Y, Z...) in hand.Unfortunately, in statistical inference problems this joint distribution is unknown, and precisely what we are trying to infer.Thus, while "marrying" information theory and statistical inference is by no means impossible, it requires clear definitions that resolve these basic mismatches in assumptions.In this paper we begin from a common theme that is important to both areas, namely the concept of prediction power, i.e., a model's ability to accurately predict values of the observable variable(s) that it seeks to model.Prediction power metrics have long played a central role in statistical inference.Fisher formulated prediction power as simply the total likelihood of the observations given the model, and developed Maximum Likelihood estimators, based on seeking the specific model that maximizes this quantity.This concept remains central to more recent metrics such as the Akaike Information Criterion (AIC) [3], and Bayesian Information Criterion (BIC) [4], which add "corrections" based on the number of model parameters being fitted.
In this paper we define a set of statistical inference metrics that constitute statistical inference proxies for the fundamental metrics of information theory (such as mutual information, entropy and relative entropy).We show that they are vitally useful for statistical inference (for precisely the same properties that make them useful in information theory), and highlight how they differ from standard statistical inference metrics such as Maximum Likelihood, AIC and BIC.We present a series of metrics that address distinct aspects of statistical inference: • prediction power, as it is ordinarily defined, as the likelihood of future observations (e.g., "test data") under a given set of conditions that we have already observed ("training data").
• bias: A measure of any systematic difference in the model's prediction power on future observations vs. on its original training data.
• completeness: We define a modeling process as "complete" when no further improvements in prediction power are possible (by further varying the model).Thus a completeness metric measures how far we are from obtaining the best possible model.
• extrapolation prediction power:We will introduce a measure of how much the model's prediction power exceeds the prediction power of our existing observation density, when tested on future observations.If this value is zero (or negative) one might reasonably ask to what extent its results can truly be called a "prediction", but instead are only a summary (or "interpolation") of our existing observation data.
To clarify the challenges that such metrics must solve, we wish to highlight several characteristics they must possess: • objective or model-free: One important criterion for such a metric is whether it is model-free; that is, whether or not the calculation of the metric itself involves a process that is equivalent to modeling.If it does, the metric can only be considered to yield a "subjective" evaluation -how well one model fits to the expectations of another model.By contrast, a model-free metric aims to provide an objective measure of how well a model fits the empirical observations.While this criterion may seem very simple to achieve, it poses several challenges, which this paper will seek to clarify.
• unbiased: Like any estimator calculated from a finite sample, these metrics are expected to suffer from sampling errors, but they must be mathematically proven to be free from systematic errors.Such errors are an important source of overfitting problems, and it is important to understand how to exclude them by design.
• convergent: These metrics must provide explicit Law of Large Numbers proofs that they converge to the "true value" in the limit of large sample size.The assumption of convergence is implicit in the use of many methods (such as Maximum Likelihood), but unfortunately the strict requirements of the Law of Large Numbers are sometimes violated, breaking the convergence guarantee and resulting in serious errors.To prevent this, a metric must explicitly show that it meets the requirements of the Law of Large Numbers.
• bounded: These metrics must provide probabilistic bounds that measure the level of uncertainty about their true value, based on the limitations of the available evidence.
• low computational complexity: Ideally, the computational complexity for computing a metric should be O(N log N ) or better, where N is the number of sample observations.
In this paper we define a set of metrics obeying these requirements, which we shall refer to as empirical information metrics.As a Supplement, we also provide a tutorial that shows how to calculate these metrics using darwin, an easy-to-use open source software package in Python, available at https://github.com/cjlee112/darwin.

Standard Prediction Power Metrics
Fisher defined the prediction power of a model Ψ for an observable variable X in terms of the total likelihood of a sample of independent and identically distributed (I.I.D.) draws where we adopt the convention Ψ(X) ≡ p(X|Ψ) as a shorthand for the probability of an observation given a model, and define the log-likelihood L = log Ψ(X).We follow the standard notation L to indicate its sample mean.Note that we will sometimes write L(Ψ) to emphasize that L is a function of the specific model we are computing.
Fisher's Maximum Likelihood method seeks the model that maximizes the total likelihood or, equivalently, the sample average log-likelihood L. Similarly, minimizing the Akaike Information Criterion (AIC) [3] AIC = 2k − 2 log p(x 1 , x 2 , ... or the Bayesian Information Criterion (BIC) [4] BIC = k log n − 2nL again seeks to maximize the prediction power L while explicitly correcting for model complexity expressed as k, the number of free parameters in the model Ψ.
Vapnik-Chervonenkis theory also supplies a correction factor that penalizes model complexity for classifier problems [6].For example, consider the simplest case of a binary classifier that predicts the class of each data point with a confidence factor C (by assigning that class a likelihood of 1 − 1 C , and the other class a likelihood of 1 C ).In this case the classification error probability on the training data, R train , converges for large C to R train → −L/ log C, and structural risk minimization indicates choosing the model that minimizes the upper bound of the classification error probability: where h is the Vapnik-Chervonenkis (VC) dimension of the model (a measure of model complexity), and η is the desired level of confidence for the probabilistic bound.

Prediction Power and the Law of Large Numbers
These metrics are best understood by highlighting the critical role that the Law of Large Numbers plays in inference metrics.Say we want to find a model Ψ that maximizes the total likelihood of many draws of X, or equivalently the expectation value of the log-likelihood, which depends on the true distribution Ω(X): where the summation is over all possible values of X (for a continuous variable the summation is replaced by an integral).Since we do not know the true distribution Ω(X) we cannot use this definition directly.However, we can apply the Law of Large Numbers (LLN) to the log-likelihood of a sample of observations, whose sample average must converge as n → ∞, if the sample values L i are conditionally independent given Ω and identically distributed as L, and the variance V ar(L) is finite (the LLN can also be extended to the case of exchangeable observations [5]).Specifically, the Law of Large Numbers guarantees a probabilistic bound on the sample estimator's deviation from the expectation value: So we obtain a lower bound estimate for L at confidence level 1 − of Note that to actually compute this lower bound, we must also use our sample to estimate the variance, which adds another source of error.In practice this is usually not a problem, except for pathological cases (e.g., V ar(L) → ∞).For example, to calculate a 95% confidence lower bound: where we have used the shorthand notation V ar(L) = (L − L) 2 to denote the sample estimator of the variance.Note that since the Law of Large Numbers is a general result (i.e., it holds over all possible distributions), it does not necessarily represent the best confidence interval that one can obtain for a specific case.Other methods for computing a confidence interval such as resampling [7], can usually improve on (i.e., increase) this lower bound, but we will not explore such implementation details in this paper.
Since the X i are indeed conditionally independent given Ω and identically distributed as X, we expect for large sample size n to be able to use L as a proxy for E(L).In that case maximizing L also maximizes E(L), which it is convenient to separate into one term dependent only on Ω and another term dependent on Ψ: where D(Ω||Ψ) is the relative entropy of model Ψ relative to the true distribution Ω, and H(Ω(X)) is the entropy of the true distribution Ω.Since the right hand term is constant with respect to Ψ, this expression is maximized when D(Ω||Ψ) is minimized, which occurs iff Ψ(X) = Ω(X) for all values of X.This guarantees that choosing the model Ψ * that maximizes E(L) will indeed identify the correct model Ψ * (X) = Ω(X).

The Problem of Selection Bias
Unfortunately, there is a catch.This guarantee can only be extended to maximization of the sample log-likelihood L, if the L i are identically distributed as L. All of these metrics (L, AIC, BIC) were designed for use with model selection; that is, we compute the metric for each of a large set of models, then select the model that maximizes the likelihood (or minimizes the AIC or BIC).And the very nature of model selection introduces bias into the sample likelihoods [8].Briefly, if the model Ψ was chosen specifically to maximize the values L i , we cannot assume that the L i are identically distributed as L. Indeed, we expect that the L i will be biased to higher values than L in general.Therefore the Law of Large Numbers convergence guarantee collapses, and we cannot prove that model selection using L will yield the true distribution Ω.Vapnik-Chervonenkis theory seeks to protect against this bias by deriving an upper bound on the possible error due to selection bias [6], based on the model's VC dimension.
First, let's examine this problem from an empirical point of view, by simply defining a metric for measuring the bias.We define a test data criterion: • a set of sample values X 1 , X 2 , ...X m are valid test data for a model Φ predicting an observable X if the X i are exchangeable, identically distributed as X, and conditionally independent of Φ given the true distribution Ω, i.e., p(X i , Φ|Ω) = p(X i |Ω)p(Φ|Ω).Equivalently, Φ contains no information about the X i except via their shared dependence on the hidden distribution Ω.Note that for any model Φ generated by model selection, its training data do not meet this requirement, since Φ is not conditionally independent of the training data given Ω.
We desire an estimator for L − E(L).Since the X i are identically distribured as X and conditionally independent of Φ given Ω, the log Φ(X i ) are identically distributed as log Φ(X) i.e., L. So by the Law of Large Numbers we can define an overfitting bias information metric as m → ∞, where L e is the sample average of the log Φ(X i ) test data log-likelihoods.We will refer to L e as the empirical log-likelihood.Note that whereas Vapnik-Chervonenkis theory provides an upper bound on the bias errors for an entire class of models (i.e., all models with the same VC dimension), I b measures the actual error due to a specific model's selection bias.The BIC adds a correction term k log n to the total log-likelihood, which penalizes against models with larger numbers of parameters.Note that this correction is designed specifically to protect against overfitting.This correction is referred to as the Bayesian Information Criterion because it is based on choosing the model with maximum Bayesian posterior probability, and by this criterion is provably optimal for the exponential family of models [4].
However, several caveats about such corrections should be understood: • a given correction addresses a particular kind of overfitting, for example, for the AIC and BIC, excessive number of model parameters k.
• a given correction is based on specific assumptions about the model, and may not behave as expected under other conditions; • Such corrections do not guarantee that the model they select will be optimal, or even unbiased.
As an example, Figure 1: Overfitting analysis of BIC models on a small sample from a normal distribution shows the distribution of L vs. L e for BIC-optimal models generated using a sample of three observations drawn from a unit normal distribution.(Note that in this case BIC-optimality is just equivalent to AIC-optimality and Maximum Likelihood, since the set of all possible normal models share the same value of k = 2).This simple example illustrates several points: • A large fraction of the models strongly overfit the observations as indicated by a large deviation from the L = L e diagonal.
• L and L e are strongly and non-linearly anti-correlated.That is, the better the apparent fit to the training data, the worse the actual fit to the test data.

The Empirical Information Metric
Based on these considerations, we use the unbiased estimator L e to define the empirical information, a signed measure of prediction power relative to the uninformative distribution p(X): The empirical information estimates the improvement in the accuracy of a model Ψ(X) in predicting the test observations.For observable variables X whose uninformative distribution is simply a constant density, I e (Ψ) differs from L e (Ψ) by simply a constant (log R, where R is the size of the range of X).
In such cases the lower bound estimator for I e differs from that of L e only by this constant: It is important to note a few aspects of the empirical information that arise from the above considerations: • Note that I e can be negative, if the model's prediction power is even worse than that of the uninformative distribution.
• Whereas most metrics for model selection such as the AIC and BIC contain correction terms dependent on the model complexity k (or VC dimension h), I e needs no such corrections because it is unbiased by definition.Excessive model complexity will not increase I e but instead will reduce it.I e contains no bias and therefore needs no correction.In this sense, it follows a similar approach as cross-validation [9].
• Note that we do not need to incorporate the sample size directly into the metric definition (as in the case of the BIC [4], Vapnik-Chervonenkis upper-bound error R V C [6], and "small-sample corrected" versions of the AIC such as the AICc [10]).Instead, the effect of sample size emerges naturally from the Law of Large Numbers lower bound estimator for our empirical information metrics (e.g., L e, , I b, , I e, ).Fundamentally, the importance of sample size is simply the uncertainty due to sampling error, and the Law of Large Numbers probabilistic bound captures this in a general way.

Empirical Information as A Sampleable Form of Mutual Information
Consider the following "mutual information sampling problem": • draw a specific inference problem (hidden distribution Ω(X)) from some class of real-world problems (e.g., for weight distributions of different animal species, this step would mean randomly choosing one particular animal species); • draw training data X t and test data X from Ω(X); • find a way to estimate the mutual information I( X t ; X) on the basis of this single case (single instance of Ω).
The standard definition of mutual information I( X t ; X) = E log p( X t ,X) p( X t )p(X) does not enable such a calculation.Even if we draw many pairs X t , X to estimate this value, we will just get a value of zero, because X t , X are conditionally independent given Ω.The mutual information I( X t ; X) is defined only over the complete joint distribution p(Ω, X t , X); it does not appear meaningful to talk about calculating it from a single instance of Ω.
By contrast with mutual information, we do calculate empirical information for a specific value of Ω, i.e., we use it to measure the prediction power of our model Ψ on observations emitted by that specific value of Ω.It is therefore interesting to investigate the relationship of the empirical information vs. the mutual information.We follow the usual information theory approach of taking its expectation value over the complete joint distribution: assuming that the uninformative distribution p(X) used in the denominator of I e matches the true marginal distribution of X. Focusing on the remaining expectation log-likelihood term: where we take the expectation value over all possible values of the observable X, all possible values of the hidden variable Ω, and all possible training data sets X t of size t.Note that we write the model as Ψ(X| X t ) to explicitly emphasize its dependence on a set of training data X t .Since Ω does not appear in the log term we can eliminate it: where the first term is a relative entropy of the model vs. the true conditional probability, and the second term is the conditional entropy of the observable vs. the training data.Therefore the expectation value of the empirical information is just: where I(X; X t ) is the mutual information between the training data and the observable.Now consider the following sampling protocol: • for one specific inference problem (hidden value of Ω), we draw a training dataset X t , use it to train a model Ψ(X| X t ), and measure the empirical information I e (Ψ) on a set of test data X n drawn from the same distribution.
• We repeat this procedure for multiple inference problems Ω (1) , Ω (2) , ..., Ω (m) , and take the average If the model Ψ(X| X t ) approximates the true conditional distribution p(X| X t ) more and more closely, the relative entropy term D(p(X| X t )||Ψ(X| X t )) will vanish, and we expect the average of the empirical information values to converge simply to I(X; X t ).Under these conditions, the empirical information becomes a "sampleable form" of the mutual information.Note that the mutual information itself does not have this property; as shown above, the mutual information cannot be computed "piecewise" for individual instances of Ω and then averaged.By contrast, if we compute the empirical information for each inference problem, and then take the average, it will converge to the mutual information.

The Problem of Convergence
If we wish to maximize prediction power, our ultimate goal must be convergence, namely that our model will converge to the true, hidden distribution Ω.So we must ask the obvious question, how do we know when we're done?Two basic strategies present themselves: • self-consistency tests: We can use our model as a reference to test whether the observations exactly match its expectations, as must be true if Ψ → Ω.
• convergence distance metric: If we knew the value of the absolute maximum prediction power L(Ω) possible for our target observable X, we could define a distance metric δ = L(Ω) − L(Ψ), which measures how "far" our current model is from convergence, in terms of its relative prediction power.
We will define empirical information metrics for both these approaches.

The Inference "Halting Problem"
As an example of the need for a convergence metric, we consider the process of Bayesian inference in modeling scientific data.In scientific research, we cannot easily restrict the set of possible models a priori either to closed-form analytic solutions or to finite sets of models that we can fully compute in practical amounts of CPU time.That is, the set of all possible models of the universe is not strictly bounded, and generally can be reduced only by calculating likelihoods for different terms of this set vs. experimental observations.What is the computational complexity for Bayesian inference to find the correct term Ω or any term within some distance δ of it?We can view this as a form of the Halting Problem, in the sense that it requires a metric that indicates when it has found a term that is less than δ distance from Ω, at which point the algorithm halts.Unfortunately, the standard form of Bayes' Law offers no evident shortcuts: Even if we had calculated all but one last term of the summation, we still would not know whether our best model so far is actually the best model, or even whether it is within distance δ of the best model.In the absence of a halting test, this implies that its computational complexity must simply be that of exhaustive enumeration.This is a serious problem, especially given that the set of all possible models may for scientific inference problems be infinite.
In real-world practice this "halting problem" often grows into an even worse problem of "model misspecification" [11].That is, Bayesian computational methods typically lack a mechanism for generating all possible models even in theory.Instead they are limited to assuming a specific mathematical form for the model.Unless by good fortune the true distribution exactly fits this mathematical form, the computation will simply exclude it.Therefore, a reliable convergence metric becomes essential as an external indicator for whether the computational model is "misspecified" in this way.It should be noted that this is not addressed by asking whether a given Bayesian modeling process has "converged" in the sense of a Markov Chain Monte Carlo sampling process converging to its stationary distribution [12].Any such process is still restricted by its assumptions of a specific mathematical form for the model; there is no guarantee that this will contain the correct answer.

Potential Information
We define I ∞ as the total information content obtainable from a set of observations by considering the infinite set of all possible models.By analogy to the classical physics division of kinetic vs. potential energy components, we divide this into one part representing the model terms we've actually calculated (I e , the empirical information), and a second part for the remaining uncomputed terms, which we define as I p , the potential information: I ∞ = I e + I p I p therefore represents the maximum amount of information theoretically attainable by computing more terms of the infinite set.Assuming that the true, hidden likelihood is Ω(X) and that our current model (after considering all terms calculated so far) is Ψ(X), then where p(X) is the uninformative reference distribution, which cancels, yielding We can therefore solve the Inference Halting Problem by deriving an empirical I p estimator (with a Law of Large Numbers convergence guarantee) that can be calculated without computing any more terms of the infinite model set.This is surprisingly straightforward.The right-hand term can be estimated directly by −L e (the empirical log-likelihood).The left-hand term −H(Ω(X)) is simply the negative entropy of the observable.We evidently need an empirical estimator of the entropy, and specifically of the density Ω(X).This density estimation problem poses one conceptual problem that requires clarification.Since the ultimate purpose of the potential information calculation is to catch possible errors in modeling, no part of its calculation (such as the empirical entropy calculation) should itself be equivalent to a form of modeling.If we used such a form of modeling to compute the empirical entropy, that would introduce a strongly subjective element, i.e., simply comparing one model (Ψ) versus another (the model used for estimating H e ).To obtain an objective I p metric, the empirical entropy calculation should be model-free.It should be a purely empirical procedure with a Law of Large Numbers convergence guarantee for large sample size n → ∞.

The Empirical Entropy
For the case where the observable X is restricted to a set of discrete values, we define an indicator label κ x (X) which equals 1 if X equals a desired value x, otherwise zero.Then by the Law of Large Numbers The empirical entropy estimator follows directly in this case: for n → ∞, For the continuous case, we need an empirical probability density estimator P e (X).To obtain this we define an indicator function κ x (X) which equals 1 if X ≤ x, otherwise zero.Then i.e., the cumulative density function c.d.f.(X).Therefore we define By construction we choose δx ∝ 1/n → 0 as n → ∞.Then by the Fundamental Theorem of Calculus, For example, we can construct δx ∝ 1/n as follows: For each sample point X j , find its m nearest neighbors (sample points), where m is a relatively small constant.Then set where we use the notation X j:m to mean the "m -th nearest neighbor of point X j " .Note that the interval [X j − δx/2, X j + δx/2] contains m − 1 sample points (not including X j itself, to avoid the inherent bias that would introduce; this in turn requires replacing the n in the log-denominator with n − 1).This implementation of the H e calculation is simply: There are of course many possible empirical density estimation implementations that could be used; we offer this implementation solely as an illustrative example.This implementation also generalizes to multidimensional data, and thus can be used to estimate mutual information [13,14].
Of course, the empirical entropy has the usual lower bound estimator from the Law of Large Numbers H e, = H e − V ar(log P e ) n

Potential Information Estimators
This gives us mean and lower bounds estimators for the potential information where the variance is computed from P e and L e pairs calculated from the same sample of observations.Note that since the potential information is computed in "observation space" instead of "model space", the computational complexity of its calculation depends primarily on the observation sample size.This can be very efficient.First of all, the calculation divides into two parts that can be done separately; since the empirical entropy has no dependence on the model Ψ, it need only be calculated once and can then used for computing I p for many different models.Second, the empirical entropy calculation can have low computational complexity.For the simple implementation outlined above, it is simply O(mn) (where m is a small constant for the nearest-neighbor density calculation; this assumes the observations are already sorted in order.If not, an additional O(n log n) step is required to sort them).For high dimensional data, the computational complexity scales as O(n 2 ), due to the need to calculate pairwise distances.Of course, the details of the computational complexity will vary depending on what empirical entropy implementation is used.

Convergence to the Kullback-Leibler Distance
In the limit of large sample size, the potential information converges to which is simply the relative entropy (Kullback-Leibler divergence [15]) of the true distribution vs. the model.(It should be emphasized that computing the Kullback-Leibler divergence directly requires knowing the true distribution, which of course in any inference problem is unknown).
We may thus consider the potential information to represent a distance estimator from the true distribution Ω.Specifically, it estimates the difference in prediction power of our current model vs. that of the true distribution.Thus it solves the Inference Halting metric problem; if we are searching for a model with prediction power within distance δ of the maximum, we simply halt when at whatever level of confidence 1 − we desire.The Akaike Information Criterion (AIC) [3] and related information metrics [16] are often referred to as representing the Kullback-Leibler (KL) divergence of the true distribution vs. the model [17].So it is logical to ask how the potential information differs from these well-known metrics.The AIC and related metrics were designed for model selection problems, in which the observable (characterized by the true distribution Ω) is treated as a fixed constant, and the model is varied in search of the best fit.As shown in part A of Figure 2: Comparing AIC and Potential Information to the Theoretical Kullback-Leibler Divergence, the AIC does indeed correlate directly with the KL divergence D(Ω||Ψ) under this assumption (holding the true distribution fixed as a constant).Specifically, for a sample of exchangeable observations X n , Thus the AIC converges to the negative log-likelihood, whereas the KL divergence D(Ω||Ψ) = −H(Ω(X)) − E(L(Ψ)) also contains an entropy term −H(Ω(X)).However, if the true distribution Ω(X) is held fixed, then the AIC differs from the KL divergence only by a constant.So for comparing two different models Ψ 1 , Ψ 2 , the difference in their AIC values converges to This is why the AIC and related likelihood metrics are often treated as a proxy for the KL divergence in model selection.However, if the true distribution Ω is not treated as a fixed constant, and instead is allowed to vary, this simple relationship breaks.In that case, the AIC no longer correlates with the KL divergence (Figure 2B).By contrast, the potential information metric I p correlates with the KL divergence under all conditions (Figure 2C).The main difference between the potential information and the AIC is simply the empirical entropy term, which is included in the potential information metric but missing from the AIC: k n Thus, the potential information metric (and consequently, the empirical entropy term) is essential for any problem where • we need an estimate of the absolute value of the Kullback-Leibler divergence, rather than simply comparing its relative value for two models; • or we need to consider possible variation between different true distributions Ω (or equivalently, different observable variables X).For example, in experiment planning problems, we consider different possible experiments (different observable variables) in order to estimate how much information they are likely to yield [18].

Unbiased Empirical Posteriors
Standard Bayesian inference can grossly overestimate the posterior probability of a model term, because the sum of calculated terms is biased to underestimate the total p(X) summed over the complete infinite series.The empirical entropy provides a resolution to this problem.By the Asymptotic Equipartition theorem [1], for a sample and thus we can therefore estimate p( X N ) via This provides an unbiased estimator of the posterior probability of a model term θ p e (θ| We designate this the "empirical posterior" probability of model term θ, with confidence interval: A. Comparison of AIC values vs. Kullback-Leibler divergence for a sample of 10,000 different models, with the true distribution fixed to the unit normal distribution N (0, 1).Each model was a normal distribution N (0, τ 2 ) where the standard deviation τ was drawn uniformly on the interva (0.1, 2).For each model, the AIC was calculated using n = 1000 observations.B. The same comparison, with a variable true distribution Ω = N (0, σ 2 ) with standard deviation σ ∈ (0.1, 2).Note the AIC no longer correlates with the Kullback-Leibler divergence.C. The same comparison as in B, except using the potential information metric.Note that it closely matches the theoretical Kullback-Leibler divergence

The Model Self-Consistency Test
We note that a more limited convergence test is possible, by reversing the procedure, and calculating the entropy of the model (which can be done directly, either analytically or by simulation).We define a self-consistency measure where H(Ψ(X)) is the entropy of our model.For Ψ → Ω, δ SC → 0. We use this fact to construct a test for rejecting the null hypothesis that Ψ = Ω at confidence 1 − .

What is "Prediction"?
We defined our empirical information metric as a measure of prediction power.However, it seems worthwhile to ask again what exactly we mean by "prediction".The empirical density estimation procedure outlined above suggests that in the limit of large sample size there is always a trivial way of obtaining perfect prediction power: Copy the empirical density for X as our "likelihood model" for X, and show that it accurately predicts new observations of X.Such a procedure does not seem to qualify as "prediction"; we simply copied the observed density.In this case all the information for the "prediction" came from the observed data, and none at all from the modeling procedure itself.This suggest several conclusions: • We desire a metric for the intrinsic prediction power of a model, above and beyond just copying the existing observation density.We will refer to this as I m , the model information.
• Generalizing our original definition of "prediction power", we wish to maximize our prediction accuracy not only for situations that we have already observed, but also for novel situations that we have never encountered before.In other words, we adopt the conservative position that our data may be incomplete, so we cannot assume that future experience will simply mirror past experience.
To maximize future prediction power, we must seek models that predict future observations more accurately than simply interpolating from past observations.
• Of course, we do not know a priori that such models even exist; that is a strictly empirical question.We simply generate models and measure whether they have such intrinsic prediction power, i.e., I m > 0.
• By definition, such a measurement can only be performed via new observations, e.g., a region of observation space that we have not observed before.As we will show in a moment, a region that has already been observed (thoroughly) cannot yield significant model information, because the past observations already provide a good density image for predicting future observations in this region.
• Thus, we can consider the adoption of a new model to be a cut on the temporal sequence of observations, partitioning them into two sets: The "old" observations (those taken before the adoption of the model), and the "new" observations (those taken after the adoption of the model).

Defining Model Information
The key question of model information is whether the model yields better prediction power than simple interpolation from past observations.As the interpolation reference, we simply use the empirical density calculation defined previously.Specifically, for a model Ψ we define its model information as where L e (Ψ|new) is calculated specifically using the new observations, and we define H e (new, old) = −L e (P e,old |new) as the empirical cross entropy of the new observations versus the old observations; P e,old is the empirical density estimator from the old observations.One example implementation (based on the previous empirical density estimator) is Ω(X) log P e,old (X)dX where X j,new is the j th observation from the new observation set, X i,old is the i th observation from the old observations, n is the sample size of the new observations, and n old is the sample size of the old observations.Many other H e (new, old) estimation implementations are possible.It should be noted that proper normalization of the empirical density is especially important for cross-entropy calculation; however, we will not investigate such implementation details here.
Thus, I m measures whether the model's empirical log-likelihood L e on the new observations exceeds the average log-likelihood of the new observations computed from the old observation density, i.e., −H e (new, old).As for the potential information, we define a lower bound estimator for I m with confidence level 1 − based on the Law of Large Numbers: V ar(L e − log P e,old ) n • In the case n old → 0 we make the density function converge to the uninformative prior based on the detector range for the observable X.That is, if the range of detectable values for X is [0,10] then P e,old (X) → 1/10.
• Note that the model information can be negative, indicating that the model has worse prediction power than the old empirical density estimator.We then computed I m (Ψ) using a test sample of size 100 drawn from the unit normal.This procedure was repeated 1000 times, to obtain the average of I m (Ψ) for that training dataset size.

Example: The Binomial Distribution
By contrast, the binomial distribution doesn't yield significant model information, because the observable has only two possible states (success or failure) for the model to predict, and the binomial model's prediction of its probability is just equivalent to the empirical probability in the training data: where s old is the count of successes in the training data, and n old is the size of the training data set (the +1 and +2 arise from the pseudocount principle, derived by Laplace as his "rule of succession" [19]).Fundamentally, since there is no "shape" for the model to predict (as there would be for a continuous variable, as in the case of the Normal distribution above), there is no way for the model to systematically outperform the empirical distribution.• All information originates as potential information.That is, before we have a successful model for a set of observations, our prediction power is no better than random, and this manifests as positive I p and zero I e .

Empirical Information Partition Rules
• For a given observable X, the sum of I p + I e is a constant (i.e., independent of the model Ψ(X)).
That is, for any observation sample X n , where p(X) is the uninformative distribution for X.For large sample size n which is simply the relative entropy of the true distribution relative to the uninformative distribution p(X).
• Thus potential information is converted to empirical information by modeling.As the model Ψ becomes a more accurate image of the observation density, I p decreases and I e increases by the same amount.
• relation to mutual information: It must be emphasized that the mutual information I(X; Ω) is defined only if we know the complete joint distribution p(X, Ω).Since we do not know this joint distribution, we would like a sampling-based estimator for I(X; Ω).We can do this by simply sampling different inference cases Ω (1) , Ω (2) , ...Ω (m) (represented by different observation samples X n (1) , X n (2) , ... X n (m) ).Taking the average of I p (Ψ) + I e (Ψ) over a large number m of inference cases converges: If we explicitly assume that the uninformative distribution used for computing the empirical information matches the true marginal distribution of X, then = −H(X|Ω) + H(X) = I(X; Ω) Thus, I p + I e may be considered to be a "sampleable version of the mutual information"; that is, it can be measured for any individual inference case, and its average over multiple inference problems will converge to the mutual information of the observable vs. hidden variables.
• For a given observable X, the sum of I e − I m is a constant.(i.e., independent of the model Ψ(X)).Thus I e − I m measures the amount of information supplied by the past observations (in the form of P e,old (X)).
• Moreover, in the asymptotic limit, I e − I m ≥ 0 since for n old → 0 we guarantee that P e,old (X) → p(X) and for n old → ∞ we have P e,old (X) LLN −→ Ω(X).
• Thus, I m partitions I e into the part that is simply provided by the training observations themselves, versus the part that actually constitutes "value added" predictive power of the model itself.
• For a given observable X, the sum of I p + I m is a constant (i.e., independent of the model Ψ(X)).Specifically, assuming both I p , I m are calculated on the same test data, which simply measures the amount of information available to be learned about the true distribution of X above and beyond that already provided by past observations (in the form of P e,old (X)).
• Relation of I m to relative entropy: Note that since I p LLN −→ D(Ω||Ψ), this also implies that I m LLN −→ D(Ω||P e,old ) − D(Ω||Ψ).This simply restates the principle that the model information represents the increase in model prediction power relative to the empirical density of the past observations.

Asymptotic Conversion of Potential and Model Information to Empirical Information
Consider the following asymptotic modeling protocol: For a large sample size n old → ∞ we simply adopt the empirical density P e,old as our model Ψ.We then measure I e , I p , I m on a set of new observations.
As n old → ∞, P e,old (X) converges to the true density Ω(X), so H e (new, old) Since the relative entropy is non-negative, the maximum attainable value of the model information drops asymptotically to zero.Moreover, as Ψ(X) = P e,old (X) also converges to the true density Ω(X), I p LLN −→ D(Ω||Ω) = 0. Since both the model and potential information vanish, by the I p + I e and I e − I m partition rules, all information is converted exclusively to empirical information.
This scenario illustrates a simple point about the distinct meanings of empirical information vs. model information.The overriding goal of model selection is maximizing empirical information (likelihood).However, this scenario shows that maximizing the empirical information is in a sense trivial if one can collect a large enough observation sample.By contrast, there is no trivial way to produce positive model information; note that the very procedure that automatically maximizes I e also ensures that I m ≤ 0.
This suggests several changes in how we think about the value of modeling.In model selection, the value of a model is often thought of in terms of data compression; that is, that the best model encodes the underlying pattern of the data in the most efficient manner possible.Metrics such as the AIC and BIC seek to enforce this principle by adding "correction terms" that penalize the number of model parameters.However, to be truly valuable for prediction, a model should meet this data compression criterion not only retrospectively (i.e., it can yield a more efficient encoding of the past observations) but also prospectively (i.e., it can predict future observations more accurately than simply interpolating from the past observations).Whereas the total empirical information metric fails to draw this distinction, the model information explicitly measures it.That is, it partitions the total I e into a "trivial" part that represents the prediction power implicit in the observation dataset itself, and a non-trivial part that represents true "predictions" coming from the model.

Conclusion
We wish to suggest that these empirical information metrics represent a useful extension of existing statistical inference metrics, because they provide "sampleable" measures of key information theory metrics (such as mutual information and relative entropy), with explicit Law of Large Numbers convergence guarantees.That is, each empirical information metric can be measured via sampling on an individual inference problem (unlike the conventional definition of mutual information); Yet its average value over multiple inference problems will converge to the true, hidden value of its associated metric from information theory (such as the mutual information).On such a foundation, one can begin to recast statistical and scientific inference problems in terms of the very useful and general tools of information theory.For example, the "inference halting problem", which imposes a variety of problems and limitations in Bayesian inference, can be easily resolved by the potential information metric, which directly measures the distance of the current model from the true distribution in standard information theoretic terms.Similarly, the model information metric measures the "value-added" prediction power of a model relative to its training data.

1 .
The Need for Information Metrics for Statistical and Scientific Inference

IFigure 1 .
Figure 1.Overfitting analysis of BIC models on a small sample from a normal distribution.For each data point, a sample of three observations was drawn randomly from a unit normal distribution.The BIC-optimal model was fit to these observations and used to compute the training vs. test log-likelihoods L vs. L e , the latter calculated on an additional test sample of three observations drawn from the same unit normal.To generate the scatter plot, this process was performed a total of N = 100000 times.The mean value of L e for successive windows of 1000 observations sorted from left to right is plotted in red.The zero-bias line is shown in black (L = L e ).Thus, the overfitting bias information I b is given at any position on the graph by the vertical distance between the black and red lines.The white circle indicates the true expectation log-likelihood for the unit normal distribution.The dotted line marks the mean value of L e averaged over all 100,000 data points.Note that this figure shows only a portion of the full distribution, which has a long tail extending to large negative values of L e .

of their empirical information values 1 m
I e LLN −→ E(I e (Ψ)).

Figure 2 .
Figure 2. Comparing AIC and Potential Information to the Theoretical Kullback-Leibler Divergence.A.Comparison of AIC values vs. Kullback-Leibler divergence for a sample of 10,000 different models, with the true distribution fixed to the unit normal distribution N (0, 1).Each model was a normal distribution N (0, τ 2 ) where the standard deviation τ was drawn uniformly on the interva (0.1, 2).For each model, the AIC was calculated using n = 1000 observations.B. The same comparison, with a variable true distribution Ω = N (0, σ 2 ) with standard deviation σ ∈ (0.1, 2).Note the AIC no longer correlates with the Kullback-Leibler divergence.C. The same comparison as in B, except using the potential information metric.Note that it closely matches the theoretical Kullback-Leibler divergence D(N (0, σ 2 )||N (0, τ 2 )) = log τ σ + σ 2 −τ 2 2τ 2 .

Figure 3 :
Figure 3: Model Information of the Normal Distribution.We draw n old observations from the unit normal distribution N (0, 1) and compute the posterior likelihood distribution for this sample.We then

Figure 3 .
Figure 3. Model information of the normal distribution.A model can exceed the prediction power of the empirical density computed from the training observations, because the model predicts the complete shape of the probability distribution, and how fast the tails will go to zero.Of course, as the training dataset size increases, the training data constitute a more and more accurate competing "model", and the model information decreases asymptotically.For each dataset size, a sample of that size was drawn from a unit normal distribution, and used to train a normal distribution Ψ based on the sample mean and variance.We then computed I m (Ψ) using a test sample of size 100 drawn from the unit normal.This procedure was repeated 1000 times, to obtain the average of I m (Ψ) for that training dataset size.

5. 1 .
The I p + I e , I e − I m , I p + I m Partitions We now briefly consider the relationships between potential information, empirical information and model information, illustrated in Figure 4: Empirical Information Partition Rules.

Figure 4 .
Figure 4. Empirical information partition rules.This diagram illustrates the three basic partition rules: 1. total information: I p + I e → D(Ω||p) 2. new observations yield: I p + I m → D(Ω||P e,old ) 3. old observations yield: I e − I m → D(Ω||p) − D(Ω||P e,old ).The vertical axis represents increasing information yield, starting from zero when there are no observations, to a maximum of D(Ω||p).This axis is split by two intermediate points, the current model, Ψ(X); and the old observation density P e,old (X).Colored intervals represent the three information metrics: I p (red), I e (green), I m (blue).
Assuming both I e , I m are calculated on the same test data, I e (Ψ) − I m (Ψ) = L e (Ψ) − L e (p) − L e (Ψ) + L e (P e,old ) = −L e (p) + L e (P e,old )where P e,old (X) is the distribution of X computed from past observations (as described above).So for n → ∞ I e (Ψ) − I m (Ψ) LLN −→ D(Ω||p) − D(Ω||P e,old )