Article Statistical Information: A Bayesian Perspective

We explore the meaning of information about quantities of interest. Our approach is divided in two scenarios: the analysis of observations and the planning of an experiment. First, we review the Sufficiency, Conditionality and Likelihood principles and how they relate to trivial experiments. Next, we review Blackwell Sufficiency and show that sampling without replacement is Blackwell Sufficient for sampling with replacement. Finally, we unify the two scenarios presenting an extension of the relationship between Blackwell Equivalence and the Likelihood Principle.


Introduction
One of the goals of statistics is to extract information about unknown quantities of interest from observations or from an experiment to be performed.The intuitive definition of information that we adopt from Basu [1] is: "Information is what it does for you, it changes your opinion".
One might further question: • Information about what?
We are interested in information about a quantity of interest, θ ∈ Θ.A quantity of interest represents a state of nature that we are uncertain of.For example, one might be interested in the number of rainy days next year.For instance, θ can be this number and Θ all natural numbers smaller or equal to 366.
• Where is the information?
Stating Θ already uses previous knowledge about θ.In the example in the last paragraph, we have informed that any year has at most 366 days and, therefore, θ must be smaller than this number.Besides stating Θ, one might also think that some values are more probable than others.This kind of knowledge is used to elicit the prior distribution for θ.The prior distribution represents a description of our present state of uncertainty about θ.Usually, the scientists' goal is to decrease his uncertainty about θ.Thus, he collects data he believes to be related to the quantity of interest.That is, he expects that there is information about θ in the data he collects.
• How is information extracted?
We focus on the case in which one uses Bayes' theorem, to compute the posterior distribution for θ given the observation.The posterior distribution describes the uncertainty about the quantity of interest after calibrating the prior by the observation.(In practice, the posterior distribution can rarely be computed.In these cases, it usually is sufficient to compute a quantity proportional to the posterior or to sample from the posterior.)Information also depends on the statistical framework.
• How much information is extracted?
In Section 3 we question: How much information is extracted from a given observation?Section 3.1 reviews common principles in Statistics and their relationship with the Likelihood principle.Section 3.2 presents a simple example and discusses information functions compatible with the Likelihood principle.
In Section 4 we consider questions related to experimental design: "How much information do we expect to obtain by performing an experiment?" or "What is the best choice among possible experiments?".Blackwell Sufficiency is a strong criterion for the comparison of experiments.The definition of Blackwell Sufficiency, with a new example, is presented in Section 4.1.If Θ is finite, then two experiments are equally informative in Blackwell's sense iff the distribution of their likelihoods is the same (Torgersen [2]).In Section 4.2 we extend this result to a setting with no restrictions on Θ.Finally, since not all experiments are comparable in Blackwell's sense, Section 4.3 explores the metrics discussed in Section 3.2 within the framework of decision theory to compare experiments.
In the following Section, we formalize the definitions here introduced.

Definitions
A probability space is a triple (Ω, , P ) in which Ω is a set, is a σ-algebra on Ω and P : → [0, 1] is a probability function.A quantity R corresponds to a function from Ω to a set .We define the probability space induced by R, ( , R , P R ), where R = {M ⊂ : An experiment corresponds to a mechanism that allows observing a given quantity.The performance of an experiment corresponds to the observation of this quantity.In order to be concise, from this point forward, we use the word experiment for both the experiment itself and the quantity that is observed when the experiment is performed.
Many quantities of interest are not observable.Therefore, it is only possible to learn about them in an indirect manner.Here, we restrict ourselves to performing experiments that are related to the quantity of interest and applying Bayes' Theorem to update our knowledge about the latter.For example, in Section 1, the quantity of interest θ corresponds to the number of rainy days next year.A possible experiment to learn about θ would be to collect pluviometric data from recent years.Let X be the quantity representing this yet unobserved data.For brevity, we also call X the experiment.Our uncertainty about θ after performing X and observing X = x is given by P (θ|X = x).
Let X be an experiment in X .A function T : X → τ is called a statistic of X.Therefore, T (X) is also an experiment.Whenever there is no confusion, we use the letter T both to indicate the statistic T and the experiment T (X).
From now on, we restrict ourselves to quantities in R with probability distributions that are discrete or are absolutely continuous with respect to the Lebesgue measure.p X (x|θ) is the conditional probability (density) function of the experiment X given the quantity of interest θ.After the experiment is performed we write L(θ|X = x) for the likelihood function of X at point x.Whenever clear in the context, we write p(x|θ) and L(θ|x) for the former functions.The prior distribution of θ is denoted by p(θ) and the posterior by p(θ|X = x).Bayes's Theorem provides p(θ|X = x) ∝ p(θ)L(θ|x).
Finally, we say that an experiment X : Ω → X is trivial for a quantity of interest We use the word trivial to emphasize that X and θ are not associated.Consequently, performing X alone does not bring "information" about θ.

Statistical Principles and Information
Let Inf (X, x, θ) denote the information gained about the quantity of interest θ after observing outcome x in experiment X.We follow Basu [1] and Birnbaum [3] in restricting the possible forms for Inf (X, x, θ) by assuming common statistical principles.
A statistic T : X → τ is sufficient if X and θ are conditionally independent given T , that is, X is a trivial experiment for θ given T .Thus, since X is a trivial experiment for θ given T , all the information about θ in X is gained by observing T alone.The Sufficiency Principle states that for any sufficient statistic T , for any x and y in X , if T (x) = T (x ) then Inf (X, x, θ) = Inf (X, x , θ).This principle is usually followed by all scientists, although not always explicitly mentioned: for inference about θ the scientist only needs to consider a sufficient statistic.
The Conditionality Principle is another important statistical principle: it can be seen as the reciprocal of Sufficiency.The latter states that a trivial experiment performed after T does not bring extra information about θ.The former states that a trivial experiment performed before another experiment does not bring extra information about θ.Let X 1 and X 2 be two arbitrary experiments.Let Y ∈ {1, 2} be an experiment jointly independent from θ, X 1 and X 2 .Let X Y be the mixture of X 1 and X 2 .X Y is performed in the following way: Perform Y .If the result of Y is 1 then perform X 1 , else perform X 2 .The Conditionality Principle states that Inf ((Y, X Y ), (i, x), θ) = Inf (X i , x, θ), ∀i ∈ {1, 2}.This principle is more controversial than that of Sufficiency.
The Likelihood Principle states that any two possible outcomes having proportional likelihood functions must provide the same information about the quantity of interest.Therefore, for any experiments X 1 and X 2 and any x 1 ∈ X 1 and ).This principle is stronger than the Sufficiency Principle and the Conditionality Principle.Birnbaum [3], Basu [4] present the converse statement, Theorem 1 The Sufficiency and the Conditionality Principles hold iff the Likelihood Principle holds.
A scientist who follows the Likelihood Principle can perform inference about the quantity of interest solely based on the likelihood function.Lindley and Philips [5] and Pereira and Lindley [6] provide examples in which some frequentist methods violate the Likelihood Principle.Indeed, Frequentist Statistics does not follow the Conditionality Principle.On the other hand, Wechsler et al. [7] shows that Bayesian Statistics follows the Likelihood Principle.

Information in the Observation
After performing an experiment, how much information about θ does one obtain?In the last section, we argued that the information obtained from points with proportional likelihoods should be the same.Nevertheless, this property only gives a vague idea about how the information function should be.In order to add precision to the definition of information we again rely on: "Information is what it does for you, it changes your opinion".
Before one performs an experiment, his opinion about θ is given by his prior distribution.On the other hand, his opinion after the experiment is performed is given by his posterior distribution.Hence, since the information should represent the change in opinion, it should be a function of prior and posterior distributions.If prior and posterior distributions are equal, there is no gain of information.
Next, we use an intuitive example to illustrate some information functions that satisfy this property.Consider that there are 4 balls, 2 of them are black and 2 are white.3 of these 4 balls are put in an urn.You do not know which ball was left out.You are offered the possibility of performing one of the following three experiments; Experiment 1 consists of taking only one ball from the urn; Experiment 2 consists of taking two balls with replacement and; Experiment 3 consists of taking two balls without replacement.Your goal is to guess the number of white balls in the urn, 1 or 2. Assume that, a priori, you do not believe any combination of balls is more likely, a uniform prior.Also assume that all balls in the urn have equal probability of being selected.Let θ be the number of white balls in the urn and X i be the number of white balls observed in the i-th experiment.The posterior probabilities P (θ = 1|X i = j) are provided in Table 1.
Which of the experiments is the most informative?That is, which experiment do you expect to most change your opinion?Table 2 does not provide a straightforward answer.For example, Experiment 1, in a worst case scenario, brings more information than Experiments 2 and 3. Similarly, P (X 2 = 1) < P (X 3 = 1) and, thus, obtaining no information in Experiment 2 is less likely than in Experiment 3. On the other hand, Experiment 3 provides the largest possible increments in information.In the next section, we discuss how to decide which experiment is the most informative.
Table 2. From left to right, tables for Inf E (X i , j, θ), Inf V (X i , j, θ) and Inf KL (X i , j, θ).

Blackwell Sufficiency
Consider two experiments, X and Y , that depend on θ.One usually wants to choose between X and Y for inferences about θ based solely on the conditional distributions of X given θ and Y given θ.In this section we review the concept of Blackwell Sufficiency Blackwell [8] and show that it is a generalization of the Sufficiency Principle for comparison of experiments.
A statistic T is sufficient for an experiment X, if X and θ are conditionally independent given T .Consequently, T is sufficient iff p(x|θ) = p(t|θ)p(x|t).The conditional distribution of X given θ can be generated by observing T and sampling from p(x|t).
Let X ∈ X (X) and Y ∈ X (Y ) be two statistical experiments.X is Blackwell Sufficient for Y if there exists a map H : X (X) × X (Y ) → [0, 1], a transition function, satisfying the following properties: • For any y ∈ X (Y ), H(•, y) is measurable on the σ-algebra induced by X, |X .
Let X (X) and X (Y ) be countable sets and define for all x ∈ X (X), Z x ∈ X (Y ) as a trivial experiment such that P (Z x = y) = H(x, y).From the definition of Blackwell Sufficiency, the quantities (Z X , θ) and (Y, θ) are equally distributed: X is Blackwell Sufficient for Y if and only if one can obtain an experiment with the same distribution as Y by observing X = x and, after that, performing the "randomization", Z x .
Next, we provide two examples of Blackwell Sufficiency that address the question in the end of Section 3.2.Example 1 is a version of that in Basu and Pereira [9].Example 2 is new and shows that sampling without replacement is Blackwell sufficient for sampling with replacement.Other examples of Blackwell Sufficiency can be found, for example, in Goel and Ginebra [10] and Torgersen [2].
Analogously, (Y 1 , . . ., Y n ) corresponds to a sample without replacement, that is: Y j and ∀i ∈ {1, . . ., n − 1} two quantities A i+1 and B i+1 .These two quantities are such that: , and is independent of all other variables; 2.
Hence, in Section 3.2, Experiment 3 is Blackwell Sufficient for Experiment 2. Similarly, Basu and Pereira [11] shows that Experiment 3 is Blackwell Sufficient for 1.One expects that the information gained about θ by performing Experiment 3 is at least as much as one would obtain by performing Experiments 1 or 2. Are experiments 1 or 2 also Blackwell Sufficient for 3? In this case, the experiments would be equally informative.In the next subsection we present a theorem that characterizes when two experiments are equally informative in Blackwell's sense and, thus, also answers the comparison of the experiments in Section 3.2.

Equivalence Relation in Experiment Information
In this section, the experiments can assume values in a countable set.For an experiment X : Ω → X , we assume that X is measurable on the power set of X and that ∀θ ∈ Θ, ∃x ∈ X , P (x|θ) > 0. No assumption is required of Θ.
Using Blackwell Sufficiency, it is possible to define an equivalence relation between experiments: X and Y are Blackwell Equivalent if any one is Blackwell Sufficient for the other, X ≈ Y .This equivalence relates to the Likelihood Principle in Section 3.1 through: Theorem 2 Let X ∈ X and Y ∈ Y be two experiments.X ≈ Y iff, for every likelihood function L(•), The following notation reduces the algebra involved.Since all sets are countable, consider them to be ordered.Let, ∀θ ∈ Θ, P (X = x|θ) be a probability function, then we define that p(.|θ) is a vector such that in its i-th position the value assumed is P (x i |θ); x i is the i-th element of the ordering assumed in the set of values of X.Consider F to be an arbitrary map from X × Y into [0, 1].We also use the symbol F for the countably infinite matrix that has in its j-th row and i-th column position the value of F (x i , y j ); x i is the i-th element of the ordering in X and y j is the j-th element of the ordering in Y. Finally, a (transposed) transition matrix is such that all of its elements are greater or equal to 0 and for any column the sum of its elements is equal to 1.
Proof.(⇐) Let S : X → [0, 1] Θ and T := Y → [0, 1] Θ , such that S(x) and T (y) are likelihood nuclei of x and y-a likelihood nucleus is a chosen likelihood between all of those that are proportional.
If two states are in the same irreducible component then their likelihood functions are proportional.The same proof holds to matrix N .
The i-th element of ξ X is said to connect to the j-th element of ξ Y if A(i, j) > 0. Similarly, the i-th element of ξ Y is said to connect to the j-th element of ξ X if B(i, j) > 0. Note that every state in ξ X connects to at least one state in ξ Y and vice-versa.This is true because A and B are transition matrices.
For all x 1 ∈ ξ X , if x 1 connects to y ∈ ξ Y then y only connects to x 1 .If there were a state x 2 ∈ ξ X such that y connected to x 2 , then x 1 and x 2 would be on the same irreducible component of M .Therefore x 1 and x 2 would yield proportional likelihood functions and, by the definition of S, x 1 = x 2 .Similarly, if a state y ∈ ξ Y connects to a state x ∈ ξ X then x connects solely to y.
Finally, we conclude that every state in ξ X only connects to one state in ξ Y and vice versa.Also, if x ∈ ξ X connects to y ∈ ξ Y , then y connects to x and vice-versa.This implies that if x connects to y, then P (X = x|θ) = P (Y = y|θ), ∀θ ∈ Θ.Since S and T are sufficient the Theorem is proved.
Applying the above Theorem and the Likelihood Principle, one obtains the following result: if X is Blackwell Equivalent to Y , then P (A e |θ) = P (B e |θ), ∀θ ∈ Θ, for all possible e-the value of information.
For any information function, Inf , satisfying the Likelihood Principle -if x and y yield proportional likelihood functions, then Inf (X, x, θ) = Inf (Y, y, θ) -, X is Blackwell Equivalent to Y , if and only if, the distribution of (Inf, θ) for X and Y are the same.Also, since the likelihood nuclei are not equally distributed in the experiments in Section 3.2, conclude that no pair of them is Blackwell Equivalent.Hence, from the conclusions in 4.1, Experiment 3 is strictly more informative than Experiments 2 and 1.

Experiment Information Function
In the last section, we defined properties an information function should satisfy.We reviewed Blackwell Sufficiency as a general rule for comparing experiments.Nevertheless, not every two experiments are comparable through this criterion.Next, we explicitly consider functions capable of describing the information of an experiment.A possible approach to this problem is considering that the information gained is a utility function DeGroot [13] that the scientist wants to maximize.This way, it follows from DeGroot [13] that Inf (X, θ) = E (Inf (X, x, θ)).Since we consider the data information function as non-negative, the utility function is concave, see DeGroot [14] for instance.
Proceeding with this approach, we compare the different information functions presented in Section 3.2.In this example, the maximum information is obtained when the posterior distribution is such that P (θ = 0|x) = 0 or P (θ = 0|x) = 1.Therefore, to compare those information functions, we divide all of them by these maxima.
First, we consider Euclidean distance as the information function.In the first experiment, with probability 1 the gain of information is 33%.That is, a small gain with a small risk.On the second experiment, with probability 56% the gain is 60% of the maximum and with probability 44% it is 0%

4 .
Information before an Experiment is Performed Proof.Let A ∼ Ber(q) and B ∼ Ber(p), both independent of all other variables, then defining Y = AX + (1 − A)B, (Y , π) and (Y, π) are equally distributed.Therefore, X is Blackwell Sufficient for Y .Example 2 Next, we generalize the example of Section 3.2.Consider an urn with N balls.θ of these balls are black and N − θ are white.n (≤ N ) balls are drawn from the urn.