Coarse-graining and the Blackwell order

Suppose we have a pair of information channels, $\kappa_{1},\kappa_{2}$ with a common input. The Blackwell order is a partial order over channels that compares $\kappa_{1}$ and $\kappa_{2}$ by the maximal expected utility an agent can obtain when decisions are based on the outputs of $\kappa_{1}$ and $\kappa_{2}$. Equivalently, $\kappa_{1}$ is said to be Blackwell-inferior to $\kappa_{2}$ if and only if $\kappa_{1}$ can be constructed by garbling the output of $\kappa_{2}$. A related partial order stipulates that $\kappa_{2}$ is more capable than $\kappa_{1}$ if the mutual information between the input and output is larger for $\kappa_{2}$ than for $\kappa_{1}$ for any distribution over inputs. If one channel is Blackwell-inferior to another then it must be less capable. However examples are known where $\kappa_{1}$ is less capable than $\kappa_{2}$ even though it is not Blackwell-inferior. We give a new example of this phenomenon in which $\kappa_{1}$ is constructed by coarse-graining the inputs of $\kappa_{2}$. Such a coarse-graining is a special kind of"pre-garbling"of the channel inputs. This example directly establishes that the expected value of the shared utility function for the coarse-grained channel is larger than it is for the non-coarse-grained channel. This is surprising, as one might think that coarse-graining can only destroy information and lead to inferior channels.


I. INTRODUCTION
Suppose we are given the choice of two channels that both provide information about the same random variable, and that we want to make a decision based on the channel outputs.Suppose that our utility function depends on the joint value of the input to the channel and our resultant decision based on the channel outputs.Suppose as well that we know the precise conditional distributions defining the channels, and the distribution over channel inputs.Which channel should we choose?The answer to this question depends on the choice of our utility function as well as on the details of the channels and the input distribution.So for example, without specifying how we will use the channels, in general we cannot just compare their information capacities to choose between them.
Nonetheless, for certain pairs of channels we can make our choice, even without knowing the utility functions or the distribution over inputs.Let us denote the two channels with κ 1 and κ 2 , respectively.Then if there exists another stochastic matrix λ such that κ 1 = λ • κ 2 , there is never any reason to strictly prefer κ 1 .This is because if we choose κ 2 , we can always make our decision by chaining the output of κ 2 through the channel λ and then using the same decision function we would have used had we chosen κ 1 .This simple argument shows that whatever the three stochastic matrices are and whatever the decision rule we would use if we chose channel κ 1 , we can always get the same expected utility (as we would get by choosing κ 1 and using the associated decision rule) by instead choosing channel κ 2 (for the appropriate associated decision rule).In this kind of situation, where κ 1 = λ • κ 2 , we say that κ 1 is a garbling (or degradation) of κ 2 .It is much more difficult to prove that the converse also holds true: Theorem 1 (Blackwell's theorem [1]).Let κ 1 , κ 2 be two stochastic matrices representing two channels with the same input alphabet.Then the following two conditions are equivalent: 1) When the agent chooses κ 2 (and uses the decision rule that is optimal for κ 2 ), her expected utility is always at least as big as the expected utility when she chooses κ 1 (and uses the optimal decision rule for κ 1 ), independent of the utility function and the distribution of the input S. 2) κ 1 is a garbling of κ 2 .
Blackwell's theorem motivates looking at the following partial order over channels κ 1 , κ 2 with a common input alphabet: one of the two statements in Blackwell's theorem holds true.
We call this partial order the Blackwell order.If κ 1 κ 2 , then κ 1 is said to be Blackwell-inferior to κ 2 .Strictly speaking, the Blackwell order is only a preorder, since there are channels κ 1 = κ 2 that satisfy κ 1 κ 2 κ 1 .However, for our purposes such channels can usually be considered as equivalent.
For a given distribution of S, we can also compare κ 1 and κ 2 by comparing the two mutual informations I(S; X 1 ), I(S; X 2 ) between the common input S and the channel outputs X 1 and X 2 .The data processing inequality shows that κ 2 κ 1 implies I(S; X 2 ) ≥ I(S; X 1 ).However, the converse implication does not hold.The intuitive reason is that for the Blackwell order, not only the amount of information is important.Rather, the question is how much of the information that κ 1 or κ 2 preserve is relevant for a decision problem at hand.
Given two channels κ 1 , κ 2 , suppose that I(S; X 2 ) ≥ I(S; X 1 ) for all distributions of S. In this case, we say that κ 2 is more capable than κ 1 .Conversely, κ 1 is less capable than κ 2 .Does this imply that κ 1 κ 2 ?Surprisingly, specific examples are known where the answer is negative [2].In Proposition 6 we introduce a new example of this phenomenon.In this example, κ 1 is a Markov approximation of κ 2 by a deterministic function, in the following sense: Consider a function f : S → S ′ from the support S := {s : P S (s) = Pr(S = s) > 0} of S to another set S ′ .Given two random variables S, X, denote by X ← S the channel defined by the conditional probabilities P X|S (x|s) on S. Consider the two channels κ 2 := (X ← S) and κ 1 := (X ← f (S)) • (f (S) ← S).Which channel is superior?Using the data processing inequality, it is easy to see that κ 1 is less capable than κ 2 .However, as Proposition 6 shows, in general κ 1 κ 2 .
We call κ 1 a Markov approximation, because the output of κ 1 is independent of the input S given f (S).κ 1 can also be obtained from κ 2 by "pre-garbling" (Lemma 8); that is, κ 1 is obtained by composing another channel λ f with κ 2 .It is known that pre-garbling may improve the performance of a channel (but not its capacity!) as we recall in Section II.What may be surprising is that this can happen for pre-garbling distributions of the form λ f , which have the effect of coarsegraining according to f .
The fact that the more capable preorder does not imply the Blackwell order shows that "Shannon information," as captured by the mutual information, is not the same as "Blackwell information," as needed for the Blackwell decision problems.Indeed, our example explicitly shows that even though coarse-graining always reduces Shannon information, it may not reduce Blackwell information.
Proposition 6 builds upon another effect that we find paradoxical: Namely, there exist random variables S, X 1 , X 2 and there exists a function f : S → S ′ from the support of S to a finite set S ′ such that the following holds: 1) S and X 1 are independent given f (S). 2) (X 2 ← S).Statement 1) says that everything X 1 knows about S, it knows through f (S).Statement 2) says that X 2 knows more about f (S) than X 1 .Still, 3) says that we cannot conclude that X 2 knows more about S than X 1 .The paradox illustrates that it is difficult to formalize what it means to "know more." The remainder of this work is organized as follows: In Section II, we recall how pre-garbling can be used to improve the performance of a channel.We also show that the pre-garbled channel will always be less capable and that simultaneous pre-garbling of both channels preserves the Blackwell order.In Section III, we state a few properties of the Blackwell order, and we explain why we find these properties counterintuitive and paradoxical.In particular, we also show that coarse-graining the input can improve the performance of a channel.Section IV contains a detailed discussion of an example that illustrates these properties.In the appendix, we briefly remark how to compute the Blackwell order and how to quantify deviations from the same.

II. PRE-GARBLING
As discussed above, Blackwell's theorem states that garbling the output of a channel ("post-garbling") never increases the quality of a channel.On the other hand, garbling the input of a channel ("pre-garbling") may increase the quality of a channel, as the following example shows.
Example 2. Suppose that an agent can choose an action from a finite set A. She then receives a utility u(a, s) that depends both on the chosen action a ∈ A and on the value s of a random variable S. Consider the channels κ 1 = 0.9 0 0.1 1 and and the utility function For uniform input the optimal decision rule for κ 1 is and the opposite for κ 2 .The expected utility with κ 1 is 1.4, while using κ 2 , it is slightly higher, 1.45.
The intuitive reason for the difference in the expected utilities is that the channel κ 2 transmits one of the states without noise and the other state with noise.With a convenient pre-processing, it is possible to make sure that the relevant information for choosing which action has better expected utility is transmitted with less noise.
Note the symmetry of the example: Each of the two channels arises from the other by a convenient pre-processing, since the pre-processing is invertible, and the two channels are not comparable by the Blackwell order.In contrast, two channels that only differ by an invertible garbling of the output are equivalent only with respect to the Blackwell order.
The pre-garbling in Example 2 is invertible, and so it is more aptly described as a pre-processing.In general, though, pure pre-garbling and pure pre-processing are not easily distinguishable, and it is easy to perturb Example 2 by adding noise without changing the conclusion.In Section III, we will present an example in which the pre-garbling consists of coarse-graining.It is much more difficult to understand how coarse-graining can be used as sensible pre-processing.
Even though pre-garbling can make a channel better (or, more precisely, more suited for a particular decision problem at hand), pre-garbling cannot invert the Blackwell order: Proof.Suppose that κ 1 ≺ κ 2 • λ.Then the capacity of κ 1 is less than the capacity of κ 2 • λ, which is bounded by the capacity of κ 2 .Therefore, the capacity of κ 1 is less than the capacity of κ 2 .Also, it follows directly from Blackwell's theorem that for any channel λ, where the input and output alphabets of λ equal the input alphabet of κ 1 , κ 2 .Thus, pre-garbling preserves the Blackwell order when applied to both channels simultaneously.
Finally, let us remark that certain kinds of simultaneous pre-garbling can also be "hidden" in the utility function: Namely, in Blackwell's theorem, it is not necessary to vary the distribution of S, as long as the (fixed) input distribution has full support S (that is, every state of the input alphabet of κ 1 and κ 2 appears with positive probability).In this setting, it suffices to look only at different utility functions.When the input distribution is fixed, it is more convenient to think in terms of random variables instead of channels, which slightly changes the interpretation of the decision problem.Suppose we are given random variables S, X 1 , X 2 and a utility function u(a, s) depending on the value of S and an action a ∈ A as above.If we cannot look at both X 1 and X 2 , should we rather look at X 1 or at X 2 to take our decision?Theorem 4 (Blackwell's theorem for random variables [3]).
The following two conditions are equivalent: 1) Under the optimal decision rule, when the agent chooses X 2 , her expected utility is always at least as big as the expected utility when she chooses X 1 , independent of the utility function.2) (X 1 ← S) (X 2 ← S).

III. PRE-GARBLING BY COARSE-GRAINING
In this section we present a few counter-intuitive properties of the Blackwell order.Proposition 5.There exist random variables S, X 1 , X 2 and a function f : S → S ′ from the support of S to a finite set S ′ such that the following holds: 1) S and X 1 are independent given f (S). 2) This result may at first seem paradoxical.After all, property 3) implies that there exists a decision problem involving S for which it is better to use X 1 than X 2 .Now, property 1) implies that any information that X 1 has about S is contained in X 1 's information about f (S).One would therefore expect that, from the viewpoint of X 1 , any decision problem in which the task is to predict S and to react on S looks like a decision problem in which the task is to react to f (S).But property 2) implies that for such a decision problem, it is better to look at X 2 .
Proof of Proposition 5.The proof is by Example 9, which will be given in Section IV.This example satisfies 1) S and X 1 are independent given f (S). 2) It only remains to show that it is possible to also achieve the strict relation (X 1 ← f (S)) ≺ (X 2 ← f (S)) in the second statement.This can easily be done by adding a small garbling to the channel X 1 ← f (S) (e.g. by adding a binary symmetric channel with sufficiently small noise parameter ǫ).This ensures (X 1 ← f (S)) ≺ (X 2 ← f (S)), and if the garbling is small enough, this does not destroy the property The example from Proposition 5 also leads to the following paradoxical property: Proposition 6.There exist random variables S, X and there exists a function f : S → S ′ from the support of S to a finite set S ′ such that the following holds: Let us again give a heuristic argument why we find this property paradoxical.Namely, the combined channel (X ← f (S)) • (f (S) ← S) can be seen as a Markov chain approximation of the direct channel X ← S that corresponds to replacing the conditional distribution Proposition 6 together with Blackwell's theorem states that there are situations where this approximation is better than the direct channel.
Proof of Proposition 6.Let S, X 1 , X 2 be as in Example 9 in Section IV that also proves Proposition 5, and let X = X 2 .In that example, the two channels X 1 ← f (S) and X 2 ← f (S) are equal.Moreover, X 1 and S are independent given f (S).Thus, (X ← f (S) • (f (S) ← S) = (X 1 ← S).Therefore, the statement follows from (X 1 ← S) (X 2 ← S).
On the other hand, the channel (X ← f (S)) • (f (S) ← S) is less capable than X ← S: Lemma 7.For any random variables S, X, and function f : Proof.For any distribution of S, let X ′ be the output of the channel (X ← f (S)) • (f (S) ← S).Then, X ′ is independent of S given f (S).On the other hand, since f is a deterministic function, X ′ is independent of f (S) given S. Together, this implies I(S; X ′ ) = I(f (S); X ′ ).Using the fact that the joint distributions of (X, f (S)) and (X ′ , f (S)) are identical and applying the data processing inequality again gives The setting of Proposition 6 can also be understood as a specific kind of pre-garbling.Namely, consider the channel λ f defined by The effect of this channel is the following: The precise value of S is forgotten, and only the value of f (S) is preserved.Then a new value s ′ is sampled for S according to the conditional distribution of S given f (S).
where we have used that X − S − f (S) forms a Markov chain.
While it is easy to understand that pre-garbling can be advantageous in general (since it can work as preprocessing), we find it surprising that this can also happen in the case where the pre-garbling is done in terms of a function f ; that is, in terms of a channel λ f that does coarse-graining.

IV. EXAMPLES
Example 9. Consider the joint distribution and the function f : {0, 1, 2} → {0, 1} with f (0) = f (1) = 0 and f (2) = 1.Then X 1 and X 2 are independent uniform binary random variables, and f (S) = AND(X 1 , X 2 ).By symmetry, the joint distributions of the pairs (f (S), X 1 ) and (f (S), X 2 ) are identical, and so the two channels X 1 ← f (S) and X 2 ← f (S) are identical.In particular On the other hand, consider the utility function s a u(s, a) To compute the optimal decision rule, let us look at the conditional distributions: The optimal decision rule for X 1 is a(0) = 0, a(1) = 1, with expected utility The optimal decision rule for X 2 is a(0) = 0, a(1) ∈ {0, 1} (this is not unique in this case), with expected utility How can we understand this example?Some observations: • It is easy to see that X 2 has more irrelevant information than X 1 : namely, X 2 can determine relatively precisely when S = 0.However, since S = 0 gives no utility independent of the action, this information is not relevant.It is more difficult to understand why X 2 has less relevant information than X 1 .Surprisingly, X 1 can determine more precisely when S = 1: If S = 1, then X 1 "detects this" (in the sense that X 1 chooses action 0) with probability 2/3.For X 2 , the same probability is only (2).Thus, the conditional entropies of S given X 2 are smaller than the conditional entropies of S given X 1 .
• One can see in which sense f (S) captures the relevant information for X 1 , and indeed for the whole decision problem: knowing f (S) is completely sufficient in order to receive the maximal utility for each state of S.However, when information is incomplete, it matters how the information about the different states of S is mixed, and two variables X 1 , X 2 that have the same joint distribution with f (S) may perform differently.It is somewhat surprising that it is the random variable that has less information about S and that is conditionally independent of S given f (S) which actually performs better.Example 9 is different from the pre-garbling Example 2 discussed in Section III.In the latter, both channels had the same amount of information (mutual information) about S, but for the given decision problem the information provided by κ 2 was more relevant than the information provide by κ 1 .The first difference in Example 9 is that X 1 has less information about S than X 2 (Lemma 7).Moreover, both channels are identical with respect to f (S), i.e. they are providing the same information about f (S), and for X 1 it is the only information it has about S. So, one could argue that X 2 has additional information, that does not help though, but decreases the expected utility instead.
To visualize how far the channel X 1 ← S is from being a garbling of the channel X 2 ← S, we use a function U I APPENDIX COMPUTING THE BLACKWELL ORDER Given two channels κ 1 , κ 2 , how can one decide whether or not κ 1 κ 2 ?The easiest way is to check whether the equation κ 1 = λ • κ 2 has a solution λ that is a stochastic matrix.In the finite alphabet case, this amounts to checking feasibility of a linear program, which is considered computationally easy.However, when the feasibility check returns a negative result, this approach does not give any more information, e.g.how far κ 1 is away from being a garbling of κ 2 .A measure for this can be constructed as follows: Let P SX1X2 be a joint distribution of S and the outputs X 1 and X 2 of κ 1 and κ 2 satisfying the constraints that X 1 ← S = κ 1 and X 2 ← S = κ 2 and that the support of S equals the input alphabet S of κ 1 and κ 2 .Let ∆ P be the set of all joint distributions of the random variables S, X 1 , X 2 (with the same alphabets) that are compatible with the marginal distributions of P SX1X2 for the pairs (S, X 1 ) and (S, X 2 ), i.e., ∆ P := {Q SX1X2 ∈ ∆ : Q SX1 = P SX1 , Q SX2 = P SX2 , ∀ (S, X 1 , X 2 ) ∈ ∆}.In other words, ∆ P consists of all joint distributions that are compatible with κ 1 and κ 2 and that have the same distribution for S as P SX1X2 .Consider the function U I(S; X 1 \X 2 ) := min Q∈∆P I Q (S; X 1 |X 2 ), where I Q denotes the conditional mutual information evaluated with respect to the the joint distribution Q.This function has the following property: U I(S; X 1 \X 2 ) = 0 if and only if κ 1 κ 2 [3].Note that U I depends not only on the two channels κ 1 , κ 2 , but also on the choice of the input distribution.Computing U I is a convex optimization problem.However, the condition number can be very bad, which makes the problem difficult in practice.