Probability Mass Exclusions and the Directed Components of Mutual Information

Information is often described as a reduction of uncertainty associated with a restriction of possible choices. Despite appearing in Hartley’s foundational work on information theory, there is a surprising lack of a formal treatment of this interpretation in terms of exclusions. This paper addresses the gap by providing an explicit characterisation of information in terms of probability mass exclusions. It then demonstrates that different exclusions can yield the same amount of information and discusses the insight this provides about how information is shared amongst random variables—lack of progress in this area is a key barrier preventing us from understanding how information is distributed in complex systems. The paper closes by deriving a decomposition of the mutual information which can distinguish between differing exclusions; this provides surprising insight into the nature of directed information.


I. INTRODUCTION
C ONSIDER three random variables X, Y , Z with finite discrete state spaces X , Y, Z, and let x, y, z represent events that have occurred simultaneously in each space.Although underappreciated in the current reference texts on information theory [1], [2], both the entropy and mutual information can be derived from first principles as fundamentally pointwise quantities that measure the information content of individual events rather than entire variables.The pointwise entropy h(x) = − log b (x), also known as the Shannon information content, quantifies the information content of a single event x, while the pointwise mutual information quantifies the information provided by x about y, or vice versa. 1 To our knowledge, the first explicit reference to pointwise information is due to Woodward and Davies [3], [5] who noted that average form of Shannon's entropy "tempts one to enquire into other simpler methods of derivation [of the pointwise entropy]" [3, p. 51].Indeed, using two axioms regarding C. Finn (email: conor.finn@sydney.edu.au), and J. T. Lizier (email: joseph.lizier@sydney.edu.au) are with the Complex Systems Research Group and Centre for Complex Systems, Faculty of Engineering & IT, The University of Sydney, NSW 2006, Australia.
C. Finn is also with CSIRO Data61, Marsfield NSW 2122, Australia Manuscript received October 17, 2018 1 The prefix pointwise has only recently become typical; both [3] and [4] both referred to the pointwise mutual information as the mutual information and then explicitly prefix the average mutual information.
the addition of information, they derived the pointwise mutual information [5].Fano further formalised this idea by deriving the quantities from four postulates that "should be satisfied by a useful measure of information" [4, p. 31].
Similar to the average entropy, the pointwise entropy is non-negative.On the contrary, unlike the average mutual information, the pointwise mutual information is a signed measure.A positive value corresponds to the event y raising the posterior p(x|y) relative to the prior p(x) so that when x occurs one would say that y was informative about x.In contrast, a negative value corresponds to the event y lowering the posterior p(x|y) relative to the prior p(x), hence when x occurs one would say that y was misinformative about x.Nonetheless, this misinformation is a purely pointwise phenomena since (as observed by both Woodward and Fano) the average information provided by the event y about the variable X is non-negative, I(X, y) = i(x; y) x∈X ≥ 0. It follows trivially, that the (dual average) mutual information is non-negative, I(X; Y ) = i(x; y) x∈X , y∈Y ≥ 0.

II. INFORMATION AND PROBABILITY MASS EXCLUSIONS
By definition, the pointwise information provided by y about x is associated with a change from the prior p(x) to the posterior p(x|y).Ultimately, this change is a consequence of the exclusion of probability mass in the distribution P (X) induced by the occurrence of the event y and inferred via the joint distribution P (X, Y ).To be specific, when the event y occurs, one knows that the complementary event ȳ = {Y \y} did not occur; hence, one can exclude the probability mass in the joint distribution P (X, Y ) associated with this complementary event, i.e. exclude P (X, ȳ).This exclusion leaves only the probability mass P (X, y) remaining, which can be normalised to obtain the conditional distribution P (X|y).A visual representation of how the event y excludes probability mass in P (X) can be seen in the probability mass diagram in Fig. 1.
Since the event x has also occurred, the excluded probability mass P (X, ȳ) can be divided into two distinct categories: the informative exclusion p(x, ȳ) is the portion of the exclusion associated with the complementary event x, while the misinformative exclusion p(x, ȳ) is the portion of the exclusion associated with the event x.The choice of appellations is justified by considering the subsequent two special cases.The first special case is a purely informative exclusion which, as depicted in Fig. 2, occurs when the event y induces exclusions which are confined to the probability mass associated with the complementary event x.Formally, the informative exclusion Centre: The occurrence of the event y 1 leads to exclusion of the probability mass associated with ȳ1 = {y 2 , y 3 }.Since the event x 1 occurred, there is an informative exclusion p( x1 , ȳ1 ) and a misinformative exclusion p(x 1 , ȳ1 ), represented, by convention, with vertical and diagonal hatching respectively.Right: normalising the remaining probability mass yields P (X|y 1 ).
p(x, ȳ) is non-zero while there is no misinformative exclusion as p(x, ȳ) = 0. Thus, p(x) = p(x, y) + p(x, ȳ) = p(x, y), and hence the pointwise mutual information, is a strictly positive, monotonically increasing function of the size of the informative exclusion p(x, ȳ) for fixed p(x).
The second special case is a purely misinformative exclusion which, as depicted in Fig. 2, occurs when the event y induces exclusions which are confined to the probability mass associated with the event x.Formally, there is no informative exclusion as p(x, ȳ) = 0 while the misinformative exclusion p(x, ȳ) is non-zero.Thus, p(y) = 1 − p(ȳ) = 1 − p(x, ȳ) − p(x, ȳ) = 1 − p(x, ȳ), and hence, together with p(x, y) = p(x) − p(x, ȳ), the pointwise mutual information, is a strictly negative, monotonically decreasing function of the size of the misinformative exclusion p(x, ȳ) for fixed p(x).Now consider the general case depicted in Fig. 2, where both informative and misinformative exclusions are present simultaneously.Given that the purely informative exclusion yields positive pointwise mutual information, while the purely misinformative exclusion yields negative pointwise mutual information, the question naturally arises-in the general case, can one decompose the pointwise information into underlying informative and misinformative components each associated with one type of exclusion?
Before attempting to address this question, there are two other important observations to be made about probability mass exclusions.The first observation is that an event can only ever induce an informative exclusion about itself-if x occurred then clearly that precludes the complementary event x from having occurred.The second observation is that the exclusion process must satisfy the chain rule of probability; in particular, as shown in Fig. 3, there are three equivalent ways to consider the exclusions induced in P (X) by the events y and z.Firstly, one could consider the information provided by the joint event yz which excludes the probability

=⇒ =⇒
Fig. 2. Top: A purely informative probability mass exclusion, p(x, ȳ) > 0 and p(x, ȳ) = 0, leading to p(x|y) > p(x) and hence i(x; y) > 0. Middle: A purely misinformative probability mass exclusion, p(x, ȳ) = 0 and p(x, ȳ) > 0, leading to p(x|y) < p(x) and hence i(x; y) < 0. Bottom: The general case p(x, ȳ > 0) and p(x, ȳ) > 0. Whether i(x; y) turns out to be positive or negative depends on the balance of the exclusions.mass in P (X) associated with the joint events y z, ȳz and ȳ z.Secondly, one could first consider the information provided by y which excludes the probability mass in P (X) associated with the joint events ȳz and ȳ z, and then subsequently consider the information provided by z which excludes the probability mass in P (X|y) associated with the joint event y z.Thirdly, one could first consider the information provided by z which excludes the probability mass in P (X) associated with the joint events y z and ȳz, and then subsequently consider the information provided by y which excludes the probability mass in P (X|z) associated with the joint event ȳz.Regardless of the chaining, one starts with the same p(x) and p(x) and finishes with the same p(x|yz) and p(x|yz).
Returning now to the question of decomposing the pointwise information-consider the following postulates.Postulate 1 is a formal statement of the proposed decomposition, while Postulate 2 mandates that the information associated with the exclusions satisfies the functional relationship observed between the pointwise mutual information in both the purely informative and purely misinformative cases.Postulate 3 is based upon the observation that an event can not misinform about itself, and finally, Postulate 4 demands that the information associated with these exclusions must satisfy the chain rule of probability.Postulate 1 (Decomposition).The pointwise information provided by y about x can be decomposed into two non-negative components, such that i(x; y Postulate 2 (Monotonicity).For all fixed p(x, y) and p(x, ȳ), the function i + (y → x) is a monotonically increasing, continuous function of p(x, ȳ).For all fixed p(x, y) and p(x, ȳ), the function i − (y → x) is a monotonically increasing continuous function of p(x, ȳ).For all fixed p(x, y) and p(x, y), the functions i + (y → x) and i − (y → x) are monotonically increasing and decreasing functions of p(x, ȳ), respectively.
Theorem 1.The unique functions satisfying the postulates are where the base b is fixed by the choice of base in Postulate 3.
By writing these function in terms of the exclusions, it is trivial to see that (4) and ( 5) satisfy Postulates 1-4, i.e.
As such, the proof focuses on the uniqueness of the functions and is structured as follows: Lemma 1 considers the functional form required when p(x) = 0, and is used in the proof of Lemma 3; Lemmas 2 and 3 consider the purely informative and misinformative special cases respectively; finally, the proof of Theorem 1 brings these two special cases together for the general case.
Lemma 1.In the special case where p(x) = 0, we have that where k ≥ b.
Proof.Since p(x) = 0, we have that i(x; y) = 0 and hence by Postulate 1, that i + (y → x) = i − (y → x).Furthermore, we also have that p(y) = 1 − p(x, ȳ); thus, without a loss of generality, we will consider i − (y → x) to be a function of p(y) rather than p(x, ȳ).As such, let f (m) be our candidate function for i − (y → x) where m = 1 /p(y).First consider choosing p(x, ȳ) = 0, such that m = 1.Postulate 4 demands that and hence f (1) = 0, i.e. if there is no misinformative exclusion, then the negative informational component should be zero.Now consider choosing p(x, ȳ) so that m is a positive integer greater than 1.If r is an arbitrary positive integer, then 2 r lies somewhere between two powers of m, i.e. there exists a positive integer n such that So long as the base k is greater than 1, the logarithm is a monotonically increasing function, thus or equivalently, By Postulate 2, f (m) is a monotonically increasing function of m, hence applying it to (8) yields Note that, by Postulate 4 and mathematical induction, it is trivial to verify that Hence, by ( 11) and ( 12), we have that Now, (10) and (13) have the same bounds, hence Since m is fixed and r is arbitrary, let r → ∞.Then, by the squeeze theorem, we get that and hence, Now consider choosing p(x, ȳ) so that m is a rational number; in particular, let m = s /r where s and r are positive integers.By Postulate 4, Thus, combining ( 16) and ( 17), we get that Now consider choosing p(x, ȳ) such that m is a real number.By Postulate 2, the function ( 18) is the unique solution, and hence, i Finally, to show that k ≥ b, consider an event z = y.
Proof.Consider an event z such that x = yz and x = {y z, ȳz, ȳz}.By Postulate 4, as depicted in Fig. 4. By Postulate 3, i + (yz → x) = h(x) and i + (z → x|y) = h(x|y), where the latter equality follows from the equivalence of the events x and z given y.Furthermore, since p(x, ȳ) = 0, we have that p(x, y) = p(x), and hence that p(y|x) = 1.Thus, from (19), we have that Finally, by Postulate 1, i − (y → x) = 0. Proof.Consider an event z = x.By Postulate 4, as depicted in Fig. 5. Since z = x, by Postulate 3, hence, from ( 21) and ( 22), we get that as required.
In addition, since p(x, v|u) = 0, by Lemma 3, we have where k ≥ b.Therefore, by ( 25) and (26), Finally, since Postulate 1 requires that i + (y → x) ≥ 0, we have that h(y) − h(y|x) − log k p(y|x) ≥ 0, or equivalently, This must hold for all p(y) and p(y|x), which is only true in general for b ≥ k.Hence, k = b and therefore Corollary 1.The conditional decomposition of the information provided by y about x given z is given by Proof.Follows trivially using conditional distributions.
Corollary 2. The joint decomposition of the information provided by y and z about x is given by The joint decomposition of the information provided by y about x and z is given by Proof.Follows trivially using joint distributions.
Corollary 4. The information provided by y about x and z satisfies the following chain rule, Proof.Starting from the joint decomposition (36) and (37).By the identities (38) and (40), we get that Then, by identity (39), and recomposition, we get that = i(y → x) + i(y → z|x).
Note that, in general, it is not true that i + (y → xz) = i + (y → x) + i + (y → z|x), nor is it true that i − (y → xz) = i − (y → x) + i − (y → z|x).Hence, although not unexpected, it is interesting to see how the chain rule (41) is satisfied-the key observation is that the positive informational component provided by y about z given x equals the negative informational component provided by y about z, as per (39).
In summary, the unique forms satisfying Postulates 1-4 are That is, Postulates 1-4 decompose the pointwise information provided by y about x into

III. DISCUSSION
Clearly, the decomposition (50) is a well-known result, especially with regards to the (average) mutual information.Nonetheless, it is non-trivial that considering the pointwise mutual information in terms of the exclusions induced by y in P (X) should lead to this decomposition as opposed to the decomposition i(x; y) = h(x)−h(x|y).Indeed, this latter form is more typically used when considering information provided by y about x, since it states that this information is equal to the difference between the entropy of the prior p(x) and the entropy of the posterior p(x|y).Despite this, Postulates 1-4 mandate the use of the former decomposition (50), rather than the latter.
Recall the motivational question from Section II which asked if was possible to decompose the pointwise information into an informative and misinformative component, each associated with one type of exclusion.As can be seen from ( 6) and (7), the unique functions derived from the exclusions do not quite possess this precise functional independence-although the negative informational component only depends on the size of the misinformative exclusion p(x, ȳ), the positive component depends on the size of both the informative exclusion p(x, ȳ) and the misinformative exclusion p(x, x).That is, since p(ȳ) = p(x, ȳ) + p(x, ȳ), the positive component i + (y → x) depends on the total size of the exclusions induced by y and hence has no functional dependence on x, or indeed X.Thus, i + (y → x) quantifies the specificity of the event y: the less likely y is to occur, the greater the total amount of probability mass excluded and therefore the greater the potential for y to inform about x.On the other hand, the negative component i − (y → x) quantifies the ambiguity of y given x: the less likely y is to coincide with the event x, the greater the misinformative probability mass exclusion and therefore the greater the potential for y to misinform about x.This asymmetry in the functional dependence can be seen in the two special cases.Decomposing the pointwise mutual information for a purely informative exclusion yields i.e.only the positive informational component is non-zero.(Note that (2) is recovered.)On the other hand, decomposing the pointwise mutual information for a purely informative exclusion yields i.e. both the positive and negative informational components are non-zero.(Note that (3) is recovered).Nevertheless, despite both terms being non-zero, it is clear that i + (y → x) < i − (y → x) and hence i(x; y) < 0. Now as to why one should be interested in considering information in terms of exclusions-recently, there has been a concerted effort to quantify the shared or redundant information contained in a set of variables about one or more target variables.There has been particular interest focusing around a proposed axiomatic framework for decomposing multivariate information called the partial information decomposition [6].(There are a substantial number of publications following on from this paper, see [7] and references therein.)However, flaws have been identified in this approach regarding "whether different random variables carry the same information or just the same amount of information" [8] (see also [9]).In [7], exclusions are utilised to provide an operational definition of when the events y and z provide the same information about x.Specifically, the information is deemed to be the same information when the events y and z provide the same probability mass exclusions in P (X) with respect to the event x.To motivate why this approach is appealing, consider the situation depicted in the probability mass diagram in Fig. 7 where i(x 1 ; y 1 ) = i(x 1 ; z 1 ) = log 2 4 /3 bit, but yet i + (y 1 → x 1 ) = log 2 8 /3 bit, i − (y 1 → x 1 ) = 1 bit, i + (z 1 → x 1 ) = log 2 4 /3 bit, i − (z 1 → x 1 ) = 0 bit.(53) Although the net amount of information provided by y and z is the same, it is in some way different since y and z are different in terms of exclusions.However, this is not the subject of this paper-those who are interested in the operational definition of shared information based on redundant exclusions should see [7].Note that the events y 1 and z 1 can induce different exclusions in P (X) and yet still yield the same conditional distributions P (X|y 1 ) = P (X|z 1 ) and hence provide the same amount of information i(x 1 ; y 1 ) = i(x 1 ; z 1 ) about the event x 1 .

Fig. 1 .
Fig.1.In probability mass diagrams, height represents the probability mass of each joint event from X × Y .Left: the full joint distribution P (X, Y ).Centre: The occurrence of the event y 1 leads to exclusion of the probability mass associated with ȳ1 = {y 2 , y 3 }.Since the event x 1 occurred, there is an informative exclusion p( x1 , ȳ1 ) and a misinformative exclusion p(x 1 , ȳ1 ), represented, by convention, with vertical and diagonal hatching respectively.Right: normalising the remaining probability mass yields P (X|y 1 ).

Fig. 3 .
Fig.3.The probability mass exclusions must satisfy the chain rule of probability: there three equivalent ways y and z can provide information about x.

Fig. 4 .
Fig. 4. The probability mass diagram associated with (19).Lemma 2 uses Postulates 3 and 4 to provide a solution for the purely informative case.

Fig. 5 .
Fig. 5.The diagram corresponding to (21) and (22).Lemma 3 uses Postulate 4 and Lemma 1 to provide a solution for the purely misinformative case. i

Fig. 7 .
Fig.7.Top: probability mass diagram for X ×Y .Bottom: probability mass diagram for X × Z.Note that the events y 1 and z 1 can induce different exclusions in P (X) and yet still yield the same conditional distributions P (X|y 1 ) = P (X|z 1 ) and hence provide the same amount of information i(x 1 ; y 1 ) = i(x 1 ; z 1 ) about the event x 1 .