An Axiomatic Characterization of Mutual Information

We characterize mutual information as the unique map on ordered pairs of discrete random variables satisfying a set of axioms similar to those of Faddeev’s characterization of the Shannon entropy. There is a new axiom in our characterization, however, which has no analog for Shannon entropy, based on the notion of a Markov triangle, which may be thought of as a composition of communication channels for which conditional entropy acts functorially. Our proofs are coordinate-free in the sense that no logarithms appear in our calculations.


Introduction
Axiomatic characterizations of information measures go back to the seminal work of Shannon [1], providing conceptual insights into their meaning as well as justification for the analytic formulae involved in their definitions. Various characterizations for Shannon entropy, relative entropy, Renyi and Tsallis entropies, von Neumann and Segal entropies, quantum relative entropy, as well as other generalized information measures have appeared in the literature [2][3][4][5][6][7][8][9], and a review of such enterprises in the classical (i.e., non-quantum) setting appears in the survey of Csiszár [10]. More recently, functorial characterizations of information measures from a categorical viewpoint have appeared in the works of Baez, Fritz, and Leinster [11,12], as well as our work with Parzygnat [13], who have proven a functorial characterization of the von Neumann entropy [14]. An axiomatic approach to entropy in the theory of biodiversity is the subject of the recent book [15] by Leinster. In spite of the breadth of the aforementioned results, the mutual information of a pair of random variables seems to be missing from the story. While an operational characterization of mutual information in the context of algorithmic information theory appears in [16], to the best of our knowledge, an axiomatic characterization in the vein of those surveyed by Csiszár in [10] is absent in the literature. It is then the goal of the present work to introduce mutual information into the axiomatic framework.
Our main result is Theorem 1, where we prove that the mutual information I(X, Y) of an ordered pair of random variables is the unique function (up to an arbitrary multiplicative factor) on pairs of random variables satisfying the following axioms:

2.
Strong Additivity: Given a random variable X : Ω → X with probability mass function p : X → [0, 1], and a collection of pairs of random variables (Y x , Z x ) indexed by X, then

6.
Vacuity: If C is a constant random variable, then I(X, C) = 0.
The fact that mutual information satisfies Axioms 1, 3, and 6 is well known to anybody familiar with mutual information. As we work at the level of random variables as opposed to simple probability distributions (which we do for wider applicability of our results), Axiom 4 is a reflection of the fact that mutual information only depends on probabilities. For Axiom 2, we define a convex structure on pairs of random variables in such a way that the strong additivity of Shannon entropy is generalized to our context. Axiom 5 is defined in terms of the notion of a Markov triangle, a concept we define based on the notion of a "coalescable" composition of communication channels which was introduced in [13]. Intuitively, a Markov triangle may be thought of as a composition of noisy channels over which the associated conditional entropy is additive. Moreover, such axioms are sharp in the sense that if any of the axioms are removed, then mutual information may not be characterized. In particular, the joint entropy H(X, Y) satisfies Axioms 1-5, while the conditional entropy H(Y|X) satisfies all the axioms except the symmetry Axiom 3 (note that since H(X, X) = 0, Axiom 2 in the case of conditional entropy becomes convex linearity).
In the spirit of the axiomatic approach, we note that logarithms are absent from all calculations in this paper.

Mutual Information
Let (Ω, Σ, µ) be a probability space, where Ω is thought of as the set of all possible outcomes of a data generating process or experiment, Σ is a σ-algebra of measurable subsets of Ω, and µ is a probability measure.

Definition 1.
A finite random variable is a surjective function X : Ω → X such that X is a finite set and X −1 (x) ∈ Σ for all x ∈ X. In such a case, the set X is often referred to as the support, or alphabet associated with X. The probability mass function of X is the function p : X → [0, 1] given by and the Shannon entropy of X is the non-negative real number H(X) given by The collection of all finite random variables on Ω will be denoted FRV(Ω).

Definition 2.
Let (X, Y) ∈ FRV(Ω) × FRV(Ω) be an ordered pair of random variables with supports X and Y respectively.

•
The joint distribution function of (X, Y) is the function ϑ : The joint entropy of (X, Y) is the non-negative real number given by • The mutual information of (X, Y) is the real number I(X, Y) given by

Remark 1.
With every pair of random variables (X, Y) one may associate a probability transition matrix p(y|x) given by where p : X → [0, 1] is the probability mass function of X. As such, one may view (X, Y) as a noisy channel X ; Y together with the prior distribution p on its set of inputs.
We now list some well-known properties of mutual information which will be useful for our purposes (see, e.g., [17] for proofs).  i.
The probability mass function of P(X, Y) is the joint distribution function ϑ(x, y). In particular, H(X, Y) = H(P(X, Y)). ii.

Proof.
i. Let ν : X × Y → [0, 1] denote the probability mass function of P(X, Y). Then for all as desired. ii. The statement follows from the fact that H(X, P(X, Y)) = H(X, Y).

Convexity
We now generalize the notion of a convex combination of probability distributions to the setting of pairs of random variables, which will be used to extend the notion of strong additivity for Shannon entropy to mutual information. Notation 1. We use the notation X Y to denote the disjoint union of the sets X and Y. Definition 4. Let X be a finite set, and let p : X → [0, 1] be a probability distribution on X. Then x∈X p(x)(Ω, Σ, µ) is the probability space associated with the triple (X × Ω, X × Σ, p × µ). Now suppose Y x ∈ FRV(Ω) is a collection of random variables indexed by X, and let q x : Y x → [0, 1] denote the probability mass function of Y x . The p-weighted convex sum x∈X p(x)Y x ∈ FRV(X × Ω) is the random variable given by It then follows that the probability mass function of x∈X p(x)Y x is a function of the form r : x∈X Y x → [0, 1], and using the fact that x∈X Y x is canonically isomorphic to the set {(x, y) | x ∈ X and y ∈ Y x }, it follows that r is then given by r(x, y) = p(x)q x (y).
A reformulation of the strong additivity property for Shannon entropy in terms of the convex structure just introduced for random variables is given using the following proposition. Proposition 3. Let X be a finite set, let p : X → [0, 1] be a probability distribution on X, and suppose Y x ∈ FRV(Ω) is a collection of random variables indexed by X. Then where H(p) is the Shannon entropy of the probability distribution p.
In light of Proposition 4, we make the following definition.
Definition 5. Let X be a finite set, let p : X → [0, 1] be a probability distribution on X, and suppose Proposition 5 (Strong Additivity of Mutual Information). Let X be a finite set, let p : X → [0, 1] be a probability distribution on X, and suppose (Y x , Z x ) ∈ FRV(Ω) × FRV(Ω) is a collection of pairs of random variables indexed by X. Then where H(p) is the Shannon entropy of the probability distribution p.

Continuity
Definition 6. Let X n ∈ FRV(Ω) be a sequence of random variables, and let p n : X n → [0, 1] be the associated sequence of probability mass functions. Then X n is said to weakly converge (or converge in distribution) to the random variable X ∈ FRV(Ω) with probability mass function p : X → [0, 1] if the following conditions hold.
i. There exists an N ∈ N for which X n = X for all n N. ii.
For all x ∈ X we have lim n→∞ p n (x) = p(x), i.e., p n → p pointwise.
In such a case, we write X n → X. If (X n , Y n ) ∈ FRV(Ω) × FRV(Ω) is a sequence of pairs of random variables, then (X n , Y n ) is said to weakly converge to (X, Y) ∈ FRV(Ω) × FRV(Ω) if P(X n , Y n ) → P(X, Y). Proposition 6. Shannon entropy is continuous, i.e., if X n → X, then Proof. This result is standard, see, e.g., [3] or [11].
as desired.

Markov Triangles
In this section, we define the notion of a Markov triangle, a concept based on the notion of a "coalescable" composition of communication channels which was introduced in [13]. Such a notion will be crucial for our characterization of mutual information. Definition 7. Let X ∈ FRV(Ω) be a random variable with probability mass function p : X → [0, 1], and let x ∈ X. Then for any random variable Y ∈ FRV(Ω), the conditional distribution function of Y given X = x is the function q x : Y → [0, 1] given by From here on, the value q x (y) will be denoted q(y|x). The conditional entropy of Y given X is the non-negative real number H(Y|X) given by where H(q x ) is the Shannon entropy of the distribution q x on Y.
, the proof of which may be found in any information theory text (e.g., [17]).

Definition 8.
Let (X, Y, Z) be a triple of random variables with supports X, Y and Z respectively, and let q(y|x), p(z|y) and r(z|x) denote the associated conditional distribution functions. Then In such a case, h is said to be a mediator function for the triple (X, Y, Z).

Remark 2.
A Markov triangle (X, Y, Z) with supports X, Y, and Z may be thought of as a composition of noisy channels X f ; Y g ; Z such that if z ∈ Z is the output of the channel g • f, and one is given the information that the associated input was x ∈ X, then the output at the intermediary stage Y was necessarily y = h(z, x) (where h is the associated mediator function). In other words, if P is a symbol for general probabilities, then P(x, z|y) = P(x|y)P(z|y), thus the Markov triangle condition says that X and Z are conditionally independent given Y. As compositions of deterministic channels always satisfy this property, Markov triangles are a generalization of compositions of deterministic channels. While Markov triangles play a crucial role in our characterization of mutual information and also the characterizations of conditional entropy and information loss in [13], their broader significance in the study of information measures has yet to be determined. In particular, I(X, Z) I(X, Y) + I(Y, Z).
Before giving a proof of Proposition 9, we first need the following lemma.
Proof. The statement is simply a reformulation of Theorem 2 in [13]. i. The triple (X, P(X, Y), Y) is a Markov triangle. ii.
If f : X → X is a bijection, then the triple (X, f • X, Y) is a Markov triangle. iii. If g : Y → Y is a bijection, then the triple (X, Y, g • Y) is a Markov triangle.
Let r(y|x) be the conditional distribution associated with (X, Y), let p(y|(x,ỹ)) be the conditional distribution associated with (P(X, Y), Y), and let q((x,ỹ)|x) be the conditional distribution function associated with (X, P(X, Y)). Then for all y ∈ Y and x ∈ X we have where the second equality comes from the fact that p(y|(x,ỹ)) = 0 unlessỹ = y and q((x,ỹ)|x) = 0 unless x =x. It then follows that the function h : Y × X → X × Y given by h(y, x) = (x, y) is a mediator function for (X, P(X, Y), Y), thus (X, P(X, Y), Y) is a Markov triangle. ii. Let r(y|x) be the conditional distribution associated with (X, Y), let p(y|x ) be the conditional distribution associated with (f • X, Y), and let q(x |x) be the conditional distribution function associated with (X, f • X). Then for all y ∈ Y and x ∈ X we have where the second equality comes from the fact that q(x |x) = 0 unless x = f(x). It then follows that the function h : Y × X → X given by h(y, x) = f(x) is a mediator function for (X, f • X, Y), thus (X, f • X, Y) is a Markov triangle. iii. Let r(y |x) be the conditional distribution associated with (X, g • Y), let p(y |y) be the conditional distribution associated with (Y, g • Y), and let q(y|x) be the conditional distribution associated with (X, Y). Then for all y ∈ Y and x ∈ X we have r(y |x) = y∈Y p(y |y)q(y|x) = p y |g −1 (y ) q g −1 (y )|x , where the second equality comes from the fact that p(y |y) = 0 unless y = g −1 (y ). It then follows that the function h : Y × X → X given by h(y , x) = g −1 (y ) is a mediator function for (X, Y, g • Y), thus (X, Y, g • Y) is a Markov triangle.

Characterization Theorem
We now state and prove our characterization theorem for mutual information.

Definition 10.
Let F be a map that sends pairs of random variables to the real numbers.
• F is said to be continuous if F is said to be strongly additive if given a random variable X with probability mass function p : X → [0, 1], and a collection of pairs of random variables (Y x , Z x ) indexed by X, then • F is said to be symmetric if F(X, Y) = F(Y, X) for every pair of random variables (X, Y).

•
F is said to be invariant under pullbacks if for every pair of random variables (X, Y) ∈ FRV(Ω) × FRV(Ω) and every measure-preserving map π : Ω → Ω we have • F is said to be weakly functorial if for every Markov triangle (X, Y, Z) we have Remark 3. The terminology "weakly functorial" comes from viewing (9) from a category-theoretic perspective. In particular, with a pair of random variables (X, Y) one may associate a noisy channel ; Z with Z = Supp(Z). If FinPS denotes the category of noisy channels and BR denotes the category with one object whose morphisms are the real numbers (with a composition corresponding to addition), then a map F : FinPS → BR is a functor if Rewriting (10) in terms of the pairs of random variables for which the morphisms f, g, and g • f are associated with, then the functoriality condition (10) reads thus the condition F(X, Z) F(X, Y) + F(Y, Z) is a weaker form of functoriality. For more on information measures from a category-theoretic perspective, see [11][12][13][14].

Theorem 1 (Axiomatic Characterization of Mutual Information).
Let F be a map that sends pairs of random variables to the non-negative real numbers, and suppose F satisfies the following conditions.
F(X, C) = 0 for every constant random variable C.
Then F is a non-negative multiple of mutual information. Conversely, mutual information satisfies conditions 1-6.
Before giving a proof, we first need several lemmas. The first lemma states that a map F on pairs of random variables which is continuous and invariant under pullbacks, only depends on the underlying probability mass functions of the random variables. Proof. Let π : Ω × Ω → Ω and π : Ω × Ω → Ω be the natural projections. Since both the natural projections are measure-preserving, we have F(X, Y) = F(X • π, Y • π), F(X , Y ) = F(X • π , Y • π ), and moreover, from the assumption that ϑ = ϑ it follows that the joint distribution functions associated with (X • π, Y • π) and (X • π , Y • π ) are equal. It then follows that if (X n , Y n ) is the constant sequence given by X n = X • π and Y n = Y • π for all n ∈ N, then (X n , Y n ) → (X • π, Y • π) (since P(X n , Y n ) → P(X • π, Y • π)). We then have Lemma 3. Let X be a random variable with probability mass function p : X → [0, 1], let f : X → Y be a bijection, and suppose C is a constant random variable. Then the following statements hold. i.
The triples (X, f • X, C) and (f • X, X, C) are both Markov triangles. ii.
Let F be a map that sends pairs of random variables to real numbers, and suppose F is symmetric and weakly functorial. Then Proof.
i. The statement follows from item ii of Proposition 10. ii. By item 3, the triples (X, f • X, C) and (f • X, X, C) are both Markov triangles, thus the weak functoriality of F yields and And since F is symmetric F(X, f • X) = F(f • X, X), thus Equations (12) and (13) imply Equation (11), as desired.
The next lemma is Baez, Fritz, and Leinster's reformulation of Faddeev's characterization of Shannon entropy [3], which they use in their characterization of the information loss associated with a deterministic mapping [11]. This lemma will allow us to relate F(X, X) to the Shannon entropy H(X).

Lemma 4.
Let S be a map that sends finite probability distributions to the non-negative real numbers, and suppose S satisfies the following conditions. i.
S is continuous, i.e., if p n : X → [0, 1] is a convergent sequence of probability distributions on a finite set X (i.e., if lim n→∞ p n (x) exists for all x ∈ X), then S lim n→∞ p n = lim n→∞ S(p n ).
ii. S(1) = 0 for the distribution 1 : { } → [0, 1]. iii. If q : Y → [0, 1] is a probability distribution on a finite set Y and f : X → Y is a bijection, then S(q) = S(q • f). iv. If p : X → [0, 1] is a probability distribution on a finite set X, and q x : Y x → [0, 1] is a collection of finite probability distributions indexed by X, then Then S is a non-negative multiple of Shannon entropy.

Lemma 5.
Let F be a map that sends pairs of random variables to the non-negative real numbers satisfying conditions 1-6 of Theorem 1, and let E be the map on random variables given by Then E is a non-negative multiple of Shannon entropy.
Proof. Let φ be the map that takes a random variable to its probability mass function, let σ be a section (so that φ • σ is the identity), and let S = E • σ. Since F is invariant under pullbacks (condition 5 of Theorem 1) Lemma 2 holds, thus the map S is independent of the choice of a section σ of φ, and as such, it follows that E = S • φ. We now show that S satisfies items i-iv of Lemma 4, which then implies E (X) is a non-negative multiple of the Shannon entropy H(X). where the fifth equality follows from the continuity assumption on F (condition 1 of Theorem 1). ii. Let 1 : { } → [0, 1] be a point mass distribution, so that σ(1) = C with C a constant random variable. Then S(1) = E (σ(1)) = E (C) = F(C, C) = 0, where the last equality follows from condition 6 of Theorem 1, i.e., that F(X, C) = 0 for every constant random variable C. iii. Let X be a random variable with probability mass function p : X → [0, 1], and suppose f : X → Y is a bijection. Since F is symmetric and weakly functorial (conditions 3 and 4 of Theorem 1), the hypotheses of item ii Lemma 3 are satisfied, so that Equation (11) holds, i.e., for any constant random variable C we have And since F(X, C) = F(f • X, C) = 0 by condition 6 of Theorem 1, it follows that F(X, X) = F(f • X, f • X). Now let q : Y → [0, 1] be the probability mass function of f • X, so that q = p • f −1 . We then have thus S satisfies item iii of Faddeev's Theorem.
iv. Let X be a random variable with probability mass function p : X → [0, 1], Y x a collection of random variables indexed by X, and let q x : Y x → [0, 1] be the associated probability mass functions for all x ∈ X. Then x∈X p(x)Y x has probability mass function x∈X p(x)q x , thus where the fourth equality follows from the strong additivity of F, i.e., condition 2 of Theorem 1. It then follows that S satisfies item iv of Faddeev's Theorem, as desired.
The next lemma is the analog of property iii of Lemma 4 for information measures on pairs of random variables. Lemma 6. Let X, Y ∈ FRV(Ω) be random variables with probability mass functions p : X → [0, 1] and q : Y → [0, 1] respectively, and suppose F is a map on pairs of random variables to the real numbers which is symmetric, weakly functorial, and F(X, C) = 0 for every constant random variable C. If f : X → X and g : Y → Y are bijections, then Proof. Since f is a bijection, (X, f • X, X) is a Markov triangle by item ii of Proposition 10, thus From the proof of Lemma 5 it follows that if F is weakly functorial, symmetric and F(X, C) = 0 for every constant random variable C, F(f • X, f • X) = F(X, X). Moreover by the symmetry of F we have F(X, f • X) = F(f • X, X), thus Equation (15) implies F(X, X) = F(X, f • X). Now consider the triples (f • X, X, g • Y) and (X, Y, g • Y), which are both Markov triangles by items ii and iii of Proposition 10. The weakly functorial assumption on F then yields and since F(f • X, X) = F(X, X) and F(Y, g • Y) = F(Y, Y), it follows that F(X, Y) = F(f • X, g • Y), as desired.
The next lemma, together with the fact that (X, P(X, Y), Y) is a Markov triangle (by Proposition 10) is the crux of the proof, as we will soon see.

Lemma 7.
Let F be a map from pairs of random variables to the real numbers satisfying conditions 1-6 of Theorem 1, and let (X, Y) be a pair of random variables. Then F(X, P(X, Y)) = F(X, X) and F(P(X, Y), Y) = F(Y, Y).
Proof. Let p : X → [0, 1] and q : Y → [0, 1] be the probability mass functions of X and Y respectively, and for all x ∈ X, let Y x be a random variable with probability mass function q x : Y → [0, 1] given by q x (y) = q(y|x), so that q x is the conditional distribution of Y given X = x. By pulling back to larger sample spaces if necessary, we can assume without loss of generality that each Y x ∈ FRV(Ω) for some fixed Ω. We also let C x ∈ FRV(Ω) be the constant random variable supported on {x} for all x ∈ X, we let f : x∈X {x} → X and g : x∈X Y → X × Y be the canonical bijections, and we let π : X × Ω → Ω be the natural projection. It then follows that f • x∈X p(x)C x and X • π both have probability mass function p : X → [0, 1], and also, that g • x∈X p(x)Y x and P(X, Y) • π both have probability mass function equal to the joint distribution function ϑ : X × Y → [0, 1] associated with (X, Y), thus Lemma 2 yields We then have F(X, P(X, Y)) (8) = F(X • π, P(X, Y) • π) (16) = where the last equality follows from the fact that F(C, X) = 0 for every constant random variable C (since F is symmetric and F(X, C) = 0 for every constant random variable C).
As for F(P(X, Y), Y), first note that F(Y, P(Y, X)) = F(Y, Y) by what what we have just proved. We then have F(P(X, Y), Y) = F(Y, P(X, Y)) = F(Y, P(Y, X)) = F(Y, Y), where the first and second equalities follow from symmetry and invariance under pullbacks.
Proof of Theorem 1. Suppose F is a map from pairs of random variables to the nonnegative real numbers satisfying conditions 1-6 of Theorem 1. According to Lemma 5, there exists a constant c 0 such that F(X, X) = cH(X) for all random variables X. Now let (X, Y) be an arbitrary pair of random variables. According to Proposition 10, the triple (X, P(X, Y), Y) is a Markov triangle, thus F(X, Y) (9) = F(X, P(X, Y)) + F(P(X, Y), Y) − F(P(X, Y), P(X, Y)) = F(X, X) + F(Y, Y) − cH(P(X, Y)) = cH(X) + cH(Y) − cH(X, Y) = cI(X, Y), where the second equality follows from Lemma 7 and Lemma 5, and the third equality follows from Lemma 5 and item 2 of Proposition 2, thus F is a non-negative multiple of mutual information.
Conversely, mutual information satisfies condition 1 by Proposition 7, condition 2 by Proposition 5, condition 3 by item 1 of Proposition 1, condition 4 by Proposition 9, condi-tion 5 by the fact that mutual information only depends on probabilities, and condition 6 by item 1 of Proposition 1.