^{1}

^{2}

^{3}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/.)

In principle, information theory could provide useful metrics for statistical inference. In practice this is impeded by divergent assumptions: Information theory assumes the joint distribution of variables of interest is known, whereas in statistical inference it is hidden and is the goal of inference. To integrate these approaches we note a common theme they share, namely the measurement of _{e}_{b}_{p}_{m}_{p}_{e}_{p}_{m}_{e}_{m}

Information theory as formulated by Shannon [

However, information theory and statistical inference are founded on rather different assumptions, which greatly complicate their union. Statistical inference draws a fundamental distinction between

Traditional information theory, by contrast, assumes as a starting point that the joint probability distribution

Thus, while “marrying” information theory and statistical inference is by no means impossible, it requires clear definitions that resolve these basic mismatches in assumptions. In this paper we begin from a common theme that is important to both areas, namely the concept of

In this paper we define a set of statistical inference metrics that constitute statistical inference proxies for the fundamental metrics of information theory (such as mutual information, entropy and relative entropy). We show that they are vitally useful for statistical inference (for precisely the same properties that make them useful in information theory), and highlight how they differ from standard statistical inference metrics such as Maximum Likelihood, AIC and BIC. We present a series of metrics that address distinct aspects of statistical inference:

To clarify the challenges that such metrics must solve, we wish to highlight several characteristics they must possess:

In this paper we define a set of metrics obeying these requirements, which we shall refer to as

Fisher defined the prediction power of a model Ψ for an observable variable _{1},_{2}, …_{n}

Fisher's Maximum Likelihood method seeks the model that maximizes the total likelihood or, equivalently, the sample average log-likelihood

Vapnik-Chervonenkis theory also supplies a correction factor that penalizes model complexity for classifier problems [_{train}

These metrics are best understood by highlighting the critical role that the Law of Large Numbers plays in inference metrics. Say we want to find a model Ψ that maximizes the total likelihood of many draws of

Since we do not know the true distribution Ω(_{i}

Since the _{i}

Unfortunately, there is a catch. This guarantee can only be extended to maximization of the sample log-likelihood _{i}_{i}_{i}_{i}

First, let's examine this problem from an empirical point of view, by simply defining a metric for measuring the bias. We define a

a set of sample values

We desire an estimator for _{b}

_{b}_{e}_{b}

The BIC adds a correction term

However, several caveats about such corrections should be understood:

a given correction addresses a particular kind of overfitting, for example, for the AIC and BIC, excessive number of model parameters

a given correction is based on specific assumptions about the model, and may not behave as expected under other conditions;

Such corrections do

As an example,

A large fraction of the models strongly overfit the observations as indicated by a large deviation from the

Based on these considerations, we use the unbiased estimator _{e}_{e}_{e}

Note that _{e}

Whereas most metrics for model selection such as the AIC and BIC contain correction terms dependent on the model complexity _{e}_{e}_{e}

Note that we do not need to incorporate the sample size directly into the metric definition (as in the case of the BIC [_{vc}_{e,ε}_{b,ε}_{e,ε}

Consider the following “mutual information sampling problem”:

draw a specific inference problem (hidden distribution Ω(

draw training data

find a way to estimate the mutual information

The standard definition of mutual information

By contrast with mutual information, we _{e}

for one specific inference problem (hidden value of Ω), we draw a training dataset

We repeat this procedure for multiple inference problems Ω_{(1)}, Ω_{(2)}, …, Ω_{(m)}, and take the average of their empirical information values

If the model Ψ(X|

If we wish to maximize prediction power, our ultimate goal must be convergence, namely that our model will converge to the true, hidden distribution Ω. So we must ask the obvious question, how do we know when we're done? Two basic strategies present themselves:

We will define empirical information metrics for both these approaches.

As an example of the need for a convergence metric, we consider the process of Bayesian inference in modeling scientific data. In scientific research, we cannot easily restrict the set of possible models

What is the computational complexity for Bayesian inference to find the correct term Ω or any term within some distance

In real-world practice this “halting problem” often grows into an even worse problem of “model misspecification” [

We define _{∞}_{e}_{p}_{p}_{p}

This density estimation problem poses one conceptual problem that requires clarification. Since the ultimate purpose of the potential information calculation is to catch possible errors in modeling, no part of its calculation (such as the empirical entropy calculation) should itself be equivalent to a form of modeling. If we used such a form of modeling to compute the empirical entropy, that would introduce a strongly subjective element, _{e}_{p}

For the case where the observable _{x}_{e}_{x}_{j}_{j:m}_{j}_{j}_{j}_{j}

Of course, the empirical entropy has the usual lower bound estimator from the Law of Large Numbers

This gives us mean and lower bounds estimators for the potential information
_{e}_{e}

Note that since the potential information is computed in “observation space” instead of “model space”, the computational complexity of its calculation depends primarily on the observation sample size. This can be very efficient. First of all, the calculation divides into two parts that can be done separately; since the empirical entropy has no dependence on the model Ψ, it need only be calculated once and can then used for computing _{p}^{2}), due to the need to calculate pairwise distances. Of course, the details of the computational complexity will vary depending on what empirical entropy implementation is used.

In the limit of large sample size, the potential information converges to

We may thus consider the potential information to represent a distance estimator from the true distribution Ω. Specifically, it estimates the difference in prediction power of our current model

The Akaike Information Criterion (AIC) [

However, if the true distribution Ω is _{p}̅

we need an estimate of the

or we need to consider possible variation between

Standard Bayesian inference can grossly overestimate the posterior probability of a model term, because the sum of calculated terms is biased to underestimate the total _{1}, _{2}, …_{N}

We note that a more limited convergence test is possible, by reversing the procedure, and calculating the entropy of the model (which can be done directly, either analytically or by simulation). We define a self-consistency measure

For Ψ → Ω, _{SC}

We defined our empirical information metric as a measure of prediction power. However, it seems worthwhile to ask again what exactly we mean by “prediction”. The empirical density estimation procedure outlined above suggests that in the limit of large sample size there is always a trivial way of obtaining perfect prediction power: Copy the empirical density for

We desire a metric for the _{m}

Generalizing our original definition of “prediction power”, we wish to maximize our prediction accuracy not only for situations that we have already observed, but also for novel situations that we have never encountered before. In other words, we adopt the conservative position that our data may be incomplete, so we cannot assume that future experience will simply mirror past experience. To maximize future prediction power, we must seek models that predict future observations more accurately than simply interpolating from past observations.

Of course, we do not know _{m}

By definition, such a measurement can only be performed via

Thus, we can consider the adoption of a new model to be a

The key question of model information is whether the model yields better prediction power than simple interpolation from past observations. As the interpolation reference, we simply use the empirical density calculation defined previously. Specifically, for a model Ψ we define its model information as
_{e,old}_{j,new}_{i}_{,old} is the _{old}

Thus,
_{m}

In the case _{old}_{e},_{old}

Note that the model information can be

_{old}_{m}

By contrast, the binomial distribution doesn't yield significant model information, because the observable has only two possible states (_{old}_{old}

We now briefly consider the relationships between potential information, empirical information and model information, illustrated in

_{p}_{e}

_{p}_{e} is a constant

_{p}_{e}

_{(1)}, Ω_{(2)}, … Ω_{(m)} (represented by different observation samples
_{p}_{e}

_{e}_{m} is a constant_{e}_{m}_{e},_{old}_{e}_{m}_{e},_{old}

Moreover, in the asymptotic limit,
_{old}_{e},_{old}_{old}

Thus, _{m}_{e}

_{p}_{m} is a constant_{p}_{m}_{e,old}

_{m} to relative entropy

Consider the following asymptotic modeling protocol: For a large sample size _{old}_{e}_{,old} as our model Ψ. We then measure _{e},I_{p},I_{m}

As _{old}_{e,old}_{e,old}_{p}_{e}_{e}_{m}

This scenario illustrates a simple point about the distinct meanings of empirical information _{e}_{m}

This suggests several changes in how we think about the value of modeling. In model selection, the value of a model is often thought of in terms of data compression; that is, that the best model encodes the underlying pattern of the data in the most efficient manner possible. Metrics such as the AIC and BIC seek to enforce this principle by adding “correction terms” that penalize the number of model parameters. However, to be truly valuable for prediction, a model should meet this data compression criterion not only retrospectively (_{e}

We wish to suggest that these empirical information metrics represent a useful extension of existing statistical inference metrics, because they provide “sampleable” measures of key information theory metrics (such as mutual information and relative entropy), with explicit Law of Large Numbers convergence guarantees. That is, each empirical information metric can be measured via sampling on an individual inference problem (unlike the conventional definition of mutual information); Yet its average value over multiple inference problems will converge to the true, hidden value of its associated metric from information theory (such as the mutual information). On such a foundation, one can begin to recast statistical and scientific inference problems in terms of the very useful and general tools of information theory. For example, the “inference halting problem”, which imposes a variety of problems and limitations in Bayesian inference, can be easily resolved by the potential information metric, which directly measures the distance of the current model from the true distribution in standard information theoretic terms. Similarly, the model information metric measures the “value-added” prediction power of a model relative to its training data.

Overfitting analysis of BIC models on a small sample from a normal distribution. For each data point, a sample of three observations was drawn randomly from a unit normal distribution. The BIC-optimal model was fit to these observations and used to compute the training _{b}

Comparing AIC and Potential Information to the Theoretical Kullback-Leibler Divergence. ^{2}) ^{2}) with standard deviation

Model information of the normal distribution. A model can exceed the prediction power of the empirical density computed from the training observations, because the model predicts the complete shape of the probability distribution, and how fast the tails will go to zero. Of course, as the training dataset size increases, the training data constitute a more and more accurate competing “model”, and the model information decreases asymptotically. For each dataset size, a sample of that size was drawn from a unit normal distribution, and used to train a normal distribution Ψ based on the sample mean and variance. We then computed _{m}_{m}

Empirical information partition rules. This diagram illustrates the three basic partition rules: 1. _{p}_{e}_{p}_{m}_{e,old}_{e}_{m}_{e,old}_{e,old}_{p} (red)_{e}_{m}

The author wishes to thank Marc Harper, Esfan Haghverdi, John Baez, Qing Zhou, Alex Alekseyenko, and Cosma Shalizi for helpful discussions on this work. This research was supported by the Office of Science (BER), U. S. Department of Energy, Cooperative Agreement No. DE-FC02-02ER63421.