Generalised Measures of Multivariate Information Content

Finn, Conor; Lizier, Joseph T.

doi:10.3390/e22020216

Open AccessEditor’s ChoiceArticle

Generalised Measures of Multivariate Information Content

by

Conor Finn

^1,2,*

and

Joseph T. Lizier

¹

Centre for Complex Systems, The University of Sydney, Sydney NSW 2006, Australia

²

CSIRO Data61, Marsfield NSW 2122, Australia

^*

Author to whom correspondence should be addressed.

Entropy 2020, 22(2), 216; https://doi.org/10.3390/e22020216

Submission received: 11 December 2019 / Revised: 5 February 2020 / Accepted: 12 February 2020 / Published: 14 February 2020

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

The entropy of a pair of random variables is commonly depicted using a Venn diagram. This representation is potentially misleading, however, since the multivariate mutual information can be negative. This paper presents new measures of multivariate information content that can be accurately depicted using Venn diagrams for any number of random variables. These measures complement the existing measures of multivariate mutual information and are constructed by considering the algebraic structure of information sharing. It is shown that the distinct ways in which a set of marginal observers can share their information with a non-observing third party corresponds to the elements of a free distributive lattice. The redundancy lattice from partial information decomposition is then subsequently and independently derived by combining the algebraic structures of joint and shared information content.

Keywords:

information content; multivariate mutual information; information measures; information decomposition; synergy; redundancy

Graphical Abstract

1. Introduction

For any pair of random variables X and Y, the entropy H satisfies the inequality

H (X) + H (Y) \geq H (X, Y) \geq H (X), H (Y) \geq 0 .

(1)

From this inequality, it is easy to see that the conditional entropies and mutual information are non-negative,

\begin{matrix} H (X | Y) & = H (X, Y) - H (Y) \geq 0, \end{matrix}

(2)

\begin{matrix} H (Y | X) & = H (X, Y) - H (X) \geq 0, \end{matrix}

(3)

\begin{matrix} I (X; Y) & = H (X) + H (Y) - H (X, Y) \geq 0 . \end{matrix}

(4)

For any pair of sets A and B, a measure

μ

satisfies the inequality

μ (A) + μ (B) \geq μ (A \cup B) \geq μ (A), μ (B) \geq 0,

(5)

which follows from the non-negativity of measure on the relative complements and the intersection,

\begin{matrix} μ (A \ B) & = μ (A \cup B) - μ (B) \geq 0 \end{matrix}

(6)

\begin{matrix} μ (B \ A) & = μ (A \cup B) - μ (A) \geq 0 \end{matrix}

(7)

\begin{matrix} μ (A \cap B) & = μ (A) + μ (B) - μ (A \cup B) \geq 0 . \end{matrix}

(8)

Although the entropy is not itself a measure, several authors have noted the entropy is analogous to measure in this regard [1,2,3,4,5,6,7]. Indeed, it is this analogy which provides the justification for the typical depiction of a pair of entropies using Venn diagrams, i.e., Figure 1. Nevertheless, MacKay [8] noted that this representation is misleading for at least two reasons: Firstly, since the measure on the intersection

μ (A \cap B)

is a measure on a set, it gives the false impression that the mutual information

I (X; Y)

is the entropy of some intersection between the random variables. Secondly, it might lead one to believe that this analogy can be generalised beyond two variables. However, the analogy does not generalise beyond two variables since the multivariate mutual information [9] between three random variables (which is also known as the interaction information [10], amount of information [2] or co-information [11]),

\begin{matrix} I (X; Y; Z) & = H (X) + H (Y) + H (Z) - H (X, Y) - H (X, Z) - H (Y, Z) + H (Z, Y, Z), \end{matrix}

(9)

is not non-negative [3,9,12], and hence is not analogous to measure on the triple intersection

μ (A \cap B \cap C)

[3]. Indeed, this “unfortunate” property led Cover and Thomas to conclude that “there isn’t really a notion of mutual information common to three random variables” (p. 49 [13]). Consequently, MacKay [8] recommended against depicting the entropy of three or more variables using a Venn diagram, i.e., Figure 1, unless one is aware of these issues with this representation.

However, Yeung [6,7] showed that there is an analogy between entropy and signed measure that is valid for an arbitrary number of random variables. To do this, Yeung defined a signed measure on a suitably constructed

σ

-algebra that is uniquely determined by the joint entropies of the random variables involved. This correspondence enables one to establish information-theoretic identities from measure-theoretic identities and hence Venn diagrams can be used to represent the entropy of three or more variables provided one is aware that the certain overlapping areas may correspond to negative quantities. Moreover, the multivariate mutual information is useful both as summary quantity and for manipulating information-theoretic identities provided one is mindful it may have “no intuitive meaning” [5,6].

In this paper, we introduce new measures of multivariate information that are analogous to measures upon sets and maintain their operational meaning when considering an arbitrary number of variables. These new measures complement the existing measures of multivariate mutual information, and will be constructed by considering the distinct ways in which a set of marginal observers might share their information with a non-observing third party. In Section 2, we discuss the existing measures of information content in terms of a set of individuals who each have different knowledge about a joint realisation from a pair of random variables. Then, in Section 3, we discuss how these individuals can share their information with a non-observing third party, and derive the functional form of this individual’s information. In Section 4, we relate this new measure of information content back to the mutual information. Section 5, Section 6 and Section 7 then generalise the arguments of Section 3 and Section 4 to consider an arbitrary number of observers. Finally, in Section 8, we discuss how these new measures can be combined to define new measures of mutual information.

2. Mutual Information Content

Suppose that Alice and Bob are separately observing some process and let the discrete random variables X and Y represent their respective observations. Say that Johnny is a third individual who can simultaneously make the same observations as Alice and Bob such that his observations are given by the joint variable

(X, Y)

. When a realisation

(x, y)

occurs, Alice’s information is given by the information content [8],

h (x) = - log p_{X} (x) \geq 0,

(10)

where

p_{X} (x)

is the probability mass of the realisation x of variable X computed from the probability distribution

p_{X}

. Likewise, Bob’s information is given by the information content

h (y)

, while Johnny’s information is by the joint information content

h (x, y) = - log p_{X Y} (x, y)

. The information that Alice can expect to gain from an observation is given by the entropy,

H (X) = E_{X} [h (x)] \geq 0,

(11)

where

E_{X}

represents an expectation value over realisations of the variable X. Similarly, Bob’s expected information gain is given by the entropy

H (X)

and Johnny’s expected information is given by the joint entropy

H (X, Y) = E_{X Y} [h (x, y)]

. Clearly, for any realisation, Johnny has at least as much information as either Alice or Bob,

h (x, y) \geq h (x), h (y) \geq 0 .

(12)

The conditional information content can be used to quantify how much more information Johnny has relative to either Alice or Bob, respectively,

\begin{matrix} h (x | y) & = h (x, y) - h (y) & \geq 0, \end{matrix}

(13)

\begin{matrix} h (y | x) & = h (x, y) - h (x) & \geq 0 . \end{matrix}

(14)

Similarly, we can quantify how much more information Johnny expects to get compared to either Alice or Bob via the conditional entropies,

\begin{matrix} H (X | Y) & = E_{X Y} [h (x | y)] & \geq 0, \end{matrix}

(15)

\begin{matrix} H (Y | X) & = E_{X Y} [h (y | x)] & \geq 0 . \end{matrix}

(16)

Now, consider a fourth individual who does not directly observe the process, but with whom Alice and Bob share their knowledge. To be explicit, we are considering the situation whereby this individual knows that the joint realisation

(x, y)

has occurred and knows the marginal distributions

p_{X}

and

p_{Y}

, but does not know the joint distribution

p_{X Y}

. How much information does this individual obtain from the shared marginal knowledge provided by Alice and Bob? The answer to this question is provided in Section 3, but for now let us consider a simplified version of this problem. Suppose that such an individual, whom we call Indiana (or Indy for short), assumes that Alice’s observations are independent of Bob’s observations. In terms of the probabilities, this means that Indy believes that the joint probability

p_{X Y} (x, y)

is equal to the product probability

p_{X \times Y} (x, y) = p_{X} (x) p_{Y} (y)

, while, in terms of information, this assumption leads Indiana to believe that her information is given by the independent information content

h (x) + h (y)

. Moreover, the information that Indiana expects to gain from any one realisation is given by

H (X) + H (Y)

.

Let us now compare how much information Indiana believes that she has compared to our other observers. For every realisation, Indiana believes that she has at least as much information as either Alice or Bob,

h (x) + h (y) \geq h (x), h (y) \geq 0 .

(17)

Since Indy knows what both Alice and Bob know individually, it is hardly surprising that she always has at least as much information as either Alice or Bob. The comparison between Indiana and Johnny, however, is not so straightforward—there is no inequality that requires the information content of the joint realisation to be less than the information content of the independent realisations, or vice versa. Consequently, the difference between the information that Indiana thinks she has and Johnny’s information, i.e., the mutual information content between a pair of realisations,

i (x; y) = h (x) + h (y) - h (x, y) = log \frac{p_{X Y} (x, y)}{p_{X} (x) p_{Y} (y)},

(18)

is not non-negative [14]. (This function goes by several different names including the pointwise mutual information, the information density [15] or simply the mutual information [9].) Thus, similar to how it is potentially misleading to depict the entropy of three of more variables using a Venn diagram, representing the information content of two variables using a Venn diagram is somewhat dubious (see Figure 2).

Since Johnny knows the joint distribution

p_{X Y}

, while Indiana only knows the marginal distributions

p_{X} (x)

and

p_{Y} (y)

, we might expect that Indiana should never have more information than Johnny. However, Indiana’s assumed information is based upon the belief that Alice’s observations X are independent of Bob’s observations Y, which leads Indiana to overestimate her information on average. Indeed, Indiana is so optimistic that the information she expects to get upper bounds the information that Johnny can expect to get,

H (X) + H (Y) \geq H (X, Y) \geq 0 .

(19)

Thus, despite the fact that Indiana can have less information than Johnny for certain realisations—i.e., despite the fact that the mutual information content is not non-negative—the mutual information in expectation is non-negative,

I (X; Y) = H (X) + H (Y) - H (X, Y) = E_{X Y} [i (x; y)] \geq 0 .

(20)

Crucially, and in contrast to the information content (10) and entropy (11), the non-negativity of the mutual information does not follow directly from the non-negativity of the mutual information content (18), but rather must be proved separately. (Typically, this is done by showing that the mutual information can be written as a Kullback–Leibler divergence which is non-negative by Jensen’s inequality, e.g., see Cover and Thomas [13].) Thus, not only does Indiana potentially have more information than Johnny for certain realisations, but on average we expect Indiana to have more information than Johnny. Of course, by assuming Alice’s observations are independent of Bob’s observations, Indiana is overestimating her information. Thus, in the next section, we consider the situation whereby one does not make this assumption.

3. Marginal Information Sharing

Suppose that Eve is another individual who, similar to Indiana, does not make any direct observations, but with whom both Alice and Bob share their knowledge; i.e., Eve knows the joint realisation

(x, y)

has occurred and knows the marginal distributions

p_{X}

and

p_{Y}

, but does not know the joint distribution

p_{X Y}

. Furthermore, suppose that Eve is more conservative than Indiana and does not assume that Alice’s observations are independent of Bob’s observations—how much information does Eve have for any one realisation?

It seems clear that Eve’s information should always satisfy the following two requirements. Firstly, since Alice and Bob both share their knowledge with Eve, she should have at least as much information as either of them have individually. Secondly, since Eve has less knowledge than Johnny, she should have no more information than Johnny; i.e., in contrast to Indy, Eve should never have more information than Johnny. As the following theorem shows, these two requirements uniquely determine the functional form of Eve’s information:

Theorem 1.

The unique function

h (x ⊔ y)

of

p_{X} (x)

and

p_{Y} (y)

that satisfies

h (x, y) \geq h (x ⊔ y) \geq h (x), h (y) \geq 0

for all

p_{X Y} (x, y)

is

h (x ⊔ y) = max (h (x), h (y)) \geq 0 .

(21)

Proof.

Clearly, the function is lower bounded by

max (h (x), h (y))

. The upper bound is given by the minimum possible

h (x, y)

, which corresponds to the maximum allowed

p_{X Y} (x, y)

. For any

p_{X} (x)

and

p_{Y} (y)

, the maximum allowed

p_{X Y} (x, y)

is

min (p_{X} (x), p_{Y} (y))

, which corresponds to

h (x, y) = max (h (x), h (y))

.□

Eve’s information is given by the maximum of Alice’s and Bob’s information, or the information content of the most surprising marginal realisation. Although we have defined Eve’s information by requiring it to be no greater than Johnny’s information, it is also clear that Eve also has no more information than Indiana. As such, Eve’s information satisfies the inequality

h (x) + h (y) \geq h (x ⊔ y) \geq h (x), h (y) \geq 0,

(22)

which is analogous to the inequality (5) satisfied by measure. Hence, as pre-empted by the notation (and as further justified in Section 6), Eve’s information is referred to as the union information content. The union information content is the maximum possible information that Eve can get from knowing what Alice and Bob know—it quantifies the information provided by a joint event

(x, y)

when one knows the marginal distributions

p_{X}

and

p_{Y}

, but does not know nor make any assumptions about the joint distribution

p_{X Y}

.

Similar to how the conditional information contents (15) and (16) enable us to quantify how much more information Johnny has relative to either Alice or Bob, the inequality (22) enables us to quantify how much information Eve gets from Alice relative to Bob and vice versa, respectively,

\begin{matrix} h (x \ y) & = h (x ⊔ y) - h (y) & = max (h (x) - h (y), 0) & \geq 0, \end{matrix}

(23)

\begin{matrix} h (y \ x) & = h (x ⊔ y) - h (x) & = max (0, h (y) - h (x)) & \geq 0 . \end{matrix}

(24)

These non-negative functions are analogous to measure on the relative complements of a pair of sets and are called the unique information content from x relative to y, and vice versa, respectively. It is easy to see that, since Eve’s information is either equal to Alice’s or Bob’s information (or both), at least one of these two functions must be equal to zero.

The inequality (22) also enables us to quantify how much more information Indiana has relative to Eve. Since Indiana’s assumed information is given by the sum of Alice’s and Bob’s information while Eve’s information is given by the maximum of Alice’s and Bob’s information, the difference between the two is given by the minimum of Alice’s and Bob’s information,

h (x ⊓ y) = h (x) + h (y) - h (x ⊔ y) = h (x) + h (y) - max (h (x), h (y)) = min (h (x), h (y)) \geq 0 .

(25)

In contrast to the comparison between Indiana and Johnny, i.e., the mutual information content (18), the comparison between Indiana and Eve is non-negative. As such, this function is analogous to measure on the intersection of two sets and hence will be referred to as the intersection information content. The intersection information content is the minimum possible information that Eve could have gotten from knowing either what Alice or Bob know, and is given by the information content of the least surprising marginal realisation.

Finally, from (21) and (23)–(25), it is not difficult to see that Eve’s information can be decomposed into the information that could have been obtained from either Alice or Bob, the unique information from Alice relative to Bob and the unique information from Bob relative to Alice,

h (x ⊔ y) = h (x ⊓ y) + h (x \ y) + h (y \ x) .

(26)

Of course, as already discussed, at least one of these unique information contents must be zero. Figure 3 depicts this decomposition for some realisation whereby Alice’s information

h (x)

is greater than Bob’s information

h (y)

.

To summarise thus far, both Alice and Bob share their information with Indiana and Eve, who then each interpret this information in a different way. By comparing Figure 2 and Figure 3, we can easily contrast their distinct perspectives. Eve is more conservative than Indiana and assumes that she has gotten as little information as she could possibly have gotten from knowing what Alice and Bob know; this is given by the maximum from Alice’s and Bob’s information, or is the information content associated with the most surprising marginal realisation observed by Alice and Bob. In effect, Eve’s conservative approach means that she pessimistically assumes that the information provided by the least surprising marginal realisation was already provided by the most surprising marginal realisation. In contrast, Indiana optimistically assumes that the information provided by the least surprising marginal realisation is independent of the information provided by the most surprising marginal realisation.

Let us now consider the information that Eve expects to get from a single realisation,

H (X ⊔ Y) = E_{X Y} [h (x ⊔ y)] \geq 0 .

(27)

This function is called the union entropy, and quantifies the expected surprise of the most surprising realisation from either X or Y. Similar to how the non-negativity of the entropy (11) follows from the non-negativity of the information content (10), the non-negativity of the union entropy (27) follows directly from the non-negativity of the union information content (21)—i.e., we do not need to invoke Jensen’s inequality. Indeed, the union entropy cannot be written as a Kullback–Leibler divergence.

Since the expectation value is monotonic, and since the union information content satisfies the inequality (22), we get that the union entropy satisfies

H (X) + H (Y) \geq H (X ⊔ Y) \geq H (X), H (Y) \geq 0,

(28)

and hence is also analogous to measure on the union of two sets. Using this inequality, we can quantify how much more information Eve expects to get from Alice relative to Bob, or vice versa, respectively,

\begin{matrix} H (X \ Y) & = H (X ⊔ Y) - H (Y) & = E_{X Y} [h (x \ y)] & \geq 0, \end{matrix}

(29)

\begin{matrix} H (Y \ X) & = H (X ⊔ Y) - H (X) & = E_{X Y} [h (y \ x)] & \geq 0, \end{matrix}

(30)

These functions are also analogous to measure on the relative complements of a pair of sets and hence will be called the unique entropy from X relative to Y, and vice versa, respectively. Crucially, and in contrast to (23) and (24), both of these quantities can be simultaneously non-zero; although Alice might observe the most surprising event in one joint realisation, Bob might observe the most surprising event in another and hence both functions can be simultaneously non-zero.

Now, consider how much more information Indiana expects to get relative to Eve,

H (X ⊓ Y) = H (X) + H (Y) - H (X ⊔ Y) = E_{X Y} [h (x ⊓ y)] \geq 0 .

(31)

This function is also analogous to measure on the intersection of two sets function will be called the intersection entropy. In contrast to the mutual information (20), since the intersection information content (25) is non-negative, we do not require an additional proof to show that the intersection entropy is non-negative. Moreover, the intersection entropy cannot be written as a Kullback–Leibler divergence.

Finally, similar to (26), we can decompose Eve’s expected information into the following components,

H (X ⊔ Y) = H (X ⊓ Y) + H (X \ Y) + H (Y \ X) .

(32)

It is important to reiterate that, in contrast to (26), there is nothing which requires either of the two unique entropies to be zero. Thus, as shown in Figure 3, the Venn diagram which represents the union and intersection entropy differs from that which represents the union information content.

4. Synergistic Information Content

As discussed at the beginning of the previous section, and as required in Theorem 1, one of the defining features of Eve’s information is that it is never greater than Johnny’s information,

h (x, y) \geq h (x ⊔ y) .

(33)

Thus, we can compare how much more information Johnny has relative to Eve,

h (x \oplus y) = h (x, y) - h (x ⊔ y) = h (x, y) - max (h (x), h (y)) = min (h (y | x), h (x | y)) \geq 0 .

(34)

This non-negative function is called the synergistic information content, and it quantifies how much more information one gets from knowing the joint probability

p_{X Y} (x, y)

relative to merely knowing the marginal probabilities

p_{X} (x)

and

p_{Y} (y)

. Figure 4 shows how this relationship can represented using a Venn diagram. Of course, by this definition, Johnny’s information is equal to the union information content plus the synergistic information content, and hence, by using (26), we can decompose Johnny’s information into the intersection information content, the unique information contents and the synergistic information contents,

h (x, y) = h (x ⊔ y) + h (x \oplus y) = h (x ⊓ y) + h (x \ y) + h (y \ x) + h (x \oplus y) .

(35)

This decomposition can be seen in Figure 4, although it is important to recall that at least one of

h (x \ y)

and

h (y \ x)

must be equal to zero. In a similar manner, the extra information that Johnny has relative to Bob (13) can be decomposed into the unique information content from Alice and the synergistic information content, and vice versa for the extra information that Johnny has relative to Alice (14),

\begin{matrix} h (x | y) & = h (x \ y) + h (x \oplus y), \end{matrix}

(36)

\begin{matrix} h (y | x) & = h (y \ x) + h (x \oplus y) . \end{matrix}

(37)

Now, recall that the mutual information content (18) is given by Indiana’s information minus Johnny’s information. By replacing Johnny’s information with the union information content plus the synergistic information content via (34) and rearranging using (25), we get that the mutual information content is equal to the intersection information content minus the synergistic information content,

\begin{matrix} i (x; y) = h (x) + h (y) - h (x, y) = h (x) + h (y) - h (x ⊔ y) - h (x \oplus y) = h (x ⊓ y) - h (x \oplus y) . \end{matrix}

(38)

Indeed, this relationship can be identified in Figure 4. Clearly, the mutual information content is negative whenever the synergistic information content is greater than the intersection information content. From this perspective, the mutual information content can be negative because there is nothing to suggest that the synergistic information content should be no greater than the intersection information content. In other words, the additional surprise associated with knowing

p_{X Y} (x, y)

relative to merely knowing

p_{X} (x)

and

p_{Y} (y)

can exceed the surprise of the least surprising marginal realisation.

Let us now quantify how much more information Johnny expects to get relative to Eve,

H (X \oplus Y) = E_{X Y} [h (x \oplus y)] = H (X, Y) - H (X ⊔ Y) \geq 0,

(39)

which we call the synergistic entropy. Crucially, although the synergistic information content is given by the minimum of the two conditional information contents, the synergistic entropy does not in general equal one of the two the conditional entropies. This is because, although Alice might observe the most surprising event in one joint realisation such that the synergistic information content is equal to Bob’s information given Alice’s information, Bob might observe the most surprising event in another realisation such that the synergistic information content is equal to Alice’s information given Bob’s information for that particular realisation. Thus, the synergistic entropy does not equal the conditional entropy for the same reason that unique entropies (29) and (30) can be simultaneously non-zero.

With the definition of synergistic entropy, it is not difficult to show that, similar to (35), the joint entropy can be decomposed into the following components,

H (X, Y) = H (X ⊔ Y) + H (X \oplus Y) = H (X ⊓ Y) + H (X \ Y) + H (Y \ X) + H (X \oplus Y) .

(40)

Figure 5 depicts this decomposition using a Venn diagram, and shows how the union entropy from Figure 3 is related to the joint entropy

H (X, Y)

. Likewise, similar to (36) and (37), it is easy to see that conditional entropies can be decomposed as follows,

\begin{matrix} H (X | Y) & = H (X \ Y) + H (X \oplus Y), \end{matrix}

(41)

\begin{matrix} H (Y | X) & = H (Y \ X) + H (X \oplus Y) \end{matrix}

(42)

Finally, as with (38), we can also show that the mutual information is equal to the intersection entropy minus the synergistic entropy,

I (X; Y) = H (X ⊓ Y) - H (X \oplus Y) \geq 0 .

(43)

Although there is nothing to suggest that the synergistic information content must be no greater than the intersection information content, we know that the synergistic entropy must be no greater than the intersection entropy because

I (X; Y) \geq 0

. In other words, the expected difference between the surprise of the joint realisation and the most surprising marginal realisation cannot exceed the expected surprise of the least surprising realisation.

5. Properties of the Union and Intersection Information Content

Theorem 1 determined the function form of Eve’s information when Alice and Bob share their knowledge with her. We now wish to generalise this result to consider the situation whereby an arbitrary number of marginal observers share their information with Eve. Rather than try to directly determine the functional form, however, we proceed by considering the algebraic structure of shared marginal information.

If Alice and Bob observe the same realisation x such that they have the same information

h (x)

, then upon sharing we would intuitively expect Eve to have the same information

h (x)

. Similarly, the minimum information that Eve could have received from either Alice or Bob should be the same information

h (x)

. Since the maximum and minimum operators are idempotent, the union and intersection information content both align with this intuition.

Property 1

(Idempotence). The union and intersection information content are idempotent,

\begin{matrix} h (x ⊔ x) & = h (x), \end{matrix}

(44)

\begin{matrix} h (x ⊓ x) & = h (x) . \end{matrix}

(45)

It also seems reasonable to expect that Eve’s information should not depend on the order in which Alice and Bob share their information, nor should the minimum information that Eve could have received from either individual. Again, since the maximum and minimum operators are commutative, the union and intersection information content both align with our intuition.

Property 2

(Commutativity). The union and intersection information content are commutative,

\begin{matrix} h (x ⊔ y) & = h (y ⊔ x), \end{matrix}

(46)

\begin{matrix} h (x ⊓ y) & = h (y ⊓ x) . \end{matrix}

(47)

Now, suppose that Charlie is another individual who, similar to Alice and Bob, is separately observing some process, and let the random variable Z represent her observations. Say that Dan is yet another individual with whom, similar to Eve, our observers can share their information. Intuitively, it should not matter whether Alice, Bob and Charlie share their information directly with Eve, or whether they share their information through Dan. To be specific, Alice and Bob could share their information with Dan such that his information is given by

h (x ⊔ y)

, and then Charlie and Dan could subsequently share their information with Eve such that her information is given by

h ((x ⊔ y) ⊔ z)

. Similarly, Bob and Charlie could share their information with Dan such that his information is given by

h (y ⊔ z)

, and then Alice and Dan could subsequently share their information with Eve such that her information is given by

h (x ⊔ (y ⊔ z))

. Alternatively, Alice, Bob and Charlie could entirely bypass Dan and share their information directly with Eve such that her information is given by

h (x ⊔ y ⊔ z)

. Since the maximum operator is associative, the union information content is the same in all three cases and hence aligns with our intuition. A similar argument can be made to show that the intersection information content is also associative.

Property 3

(Associative). The union and intersection information content are associative,

\begin{matrix} h (x ⊔ y ⊔ z) & = h ((x ⊔ y) ⊔ z) & = h (x ⊔ (y ⊔ z)), \end{matrix}

(48)

\begin{matrix} h (x ⊓ y ⊓ z) & = h ((x ⊓ y) ⊓ z) & = h (x ⊓ (y ⊓ z)), \end{matrix}

(49)

Suppose now that Alice and Bob share their information with Dan such the information that he could have gotten from either Alice or Bob is given by

h (x ⊓ y)

. If Alice and Dan both share their information with Eve, then Eve’s information is given by

h (x ⊔ (x ⊓ y)) = max (h (x), min (h (x), h (y))) = h (x),

(50)

and hence Bob’s information has been absorbed by Alice’s information. Now, suppose that Alice and Bob share their information with Dan such his information is given by

h (x ⊔ y)

. If Alice and Dan both share their information with Eve, then the information that Eve could have gotten from either Alice or Dan is given by

h (x ⊓ (x ⊔ y)) = min (h (x), max (h (x), h (y))) = h (x) .

(51)

Again, Bob’s information has been absorbed by Alice’s information. Both of these results are a consequence of the fact that the maximum and minimum operators are connected to each other by the absorption identity.

Property 4

(Absorption). The union and intersection information content are connected by absorption,

\begin{matrix} h (x ⊔ (x ⊓ y)) & = h (x), \end{matrix}

(52)

\begin{matrix} h (x ⊓ (x ⊔ y)) & = h (x) . \end{matrix}

(53)

Now, say that Daniella is, similar to Eve or Dan, an individual with whom our observers can share their information. Consider the following two cases: Firstly, suppose that Bob and Charlie share their information with Dan such that the information that Dan could have gotten from either Bob or Charlie is given by

h (y ⊓ z)

. If both Alice and Dan share their information with Eve, then her information is given by

h (x ⊔ (y ⊓ z))

. In the second case, suppose that Alice and Bob share their information with Dan such that his information is given by

h (x ⊔ y)

, while Alice and Charlie simultaneously share their information with Daniella such that her information is given by

h (x ⊔ z)

. If Dan and Daniella both share their information with Eve, then the information that she could have gotten from either Dan or Daniella is then given by

h ((x ⊔ y) ⊓ (x ⊔ z))

. In both cases, Eve has the same information since the maximum operator is distributive,

\begin{matrix} h (x ⊔ (y ⊓ z)) & = max (h (x), min (h (y), h (z))) \\ = min (max (h (x), h (y)), max (h (x), h (z))) = h ((x ⊔ y) ⊓ (x ⊔ z)) . \end{matrix}

(54)

Since the maximum and minimum operators are distributive over each other, regardless of whether Eve gets Alice’s information and Bob’s or Charlie’s information, or if Eve gets Alice’s and Bob’s information or Alice’s and Charlie’s information, Eve has the same information. The same reasoning can be applied to show that, regardless of whether Eve gets Alice’s information or Bob’s and Charlie’s information, or if Eve gets Alice’s or Bob’s information and Alice’s or Charlie’s information, Eve has the same information.

Property 5

(Distributivity). The union and intersection information content are distribute over each other,

\begin{matrix} h (x ⊔ (y ⊓ z)) & = h ((x ⊔ y) ⊓ (x ⊔ z)), \end{matrix}

(55)

\begin{matrix} h (x ⊓ (y ⊔ z)) & = h ((x ⊓ y) ⊔ (x ⊓ z)) . \end{matrix}

(56)

Now, consider a set of n individuals and let

X = {X_{1}, X_{2}, \dots, X_{n}}

be the joint random variable that represents their observations. Suppose that these individuals together observe the joint realisation

x = {x_{1}, x_{2}, \dots, x_{n}}

from

X

. By Property 3 and the general associativity theorem, it is clear that Eve’s information is given by

h (x_{1} ⊔ x_{2} ⊔ \dots ⊔ x_{n}) = max (h (x_{1}), h (x_{2}), \dots, h (x_{n})) \geq 0,

(57)

while the minimum information that Eve could have gotten from any individual observer is given by

h (x_{1} ⊓ x_{2} ⊓ \dots ⊓ x_{n}) = min (h (x_{1}), h (x_{2}), \dots, h (x_{n})) \geq 0 .

(58)

This accounts for the situation whereby n marginal observers directly share their information with Eve, and could clearly be considered for any subset

S

of the observers

x

. We now wish to consider all of the distinct ways that these marginal observers can share their information indirectly with Eve. As the following theorem shows, Properties 1–5 completely characterise the unique methods of marginal information sharing.

Theorem 2.

The marginal information contents form a join semi-lattices

〈 x, h (⊔) 〉

under the max operator. Separately, the marginal information contents form a meet semi-lattice

〈 x, h (⊓) 〉

under the min operator.

Proof.

Properties 1–3 completely characterise semi-lattices [16,17]. □

Theorem 3.

The marginal information contents form a distributive lattice

〈 x, h (⊔), h (⊓) 〉

under the max and min operators.

Proof.

From Property 4, we have that the semi-lattices

〈 x, h (⊔) 〉

and

〈 x, h (⊓) 〉

are connected by absorption and hence form a lattice

〈 x, h (⊔), h (⊓) 〉

. By Property 5, this is a distributive lattice [16,17].□

Each way that a set of n observers can share their information with Eve such that she has distinct information corresponds to an element in partially ordered set, or more specifically the free distributive lattice on n generators [16]. Figure 6 shows the free distributive lattices generated by

n = 2

and

n = 3

observers. The number of elements in this lattice is given by the (n)th Dedekind number (p. 273 [18]) (see also [19]). By the fundamental theorem of distributive lattices (or Birkhoff’s representation theory), there is isomorphism between the union information content and set union, and between the intersection information content and set intersection [16,17,20,21]. It is this one-to-one correspondence that justifies our use of the terms union and intersection information content for n variables in general. Every identity that holds in a lattice of sets will have a corresponding identity in this distributive lattice of information contents. Figure 6 also depicts the sets which correspond to each term in the lattice of information contents. Just as the cardinality of sets is non-decreasing as we consider moving up through the various terms in a lattice of sets, Eve’s information is non-decreasing as we moving up through the various terms in the corresponding lattice of information contents. In particular, we can quantify the unique information content that Eve gets from one method of information sharing relative to any other method that is lower in the lattice.

Every property of the union and intersection information content that we have considered thus far has been directly inherited by the union and intersection entropy. However, there is one final property is not inherited by the entropies. If Alice and Bob share their information with Eve, then Eve’s information is given by either Alice’s or Bob’s information, and similar for the information that Eve could have gotten from either Alice or Bob. As the subsequent theorem shows, this property enables us to greatly reduce the number of distinct terms in the distributive lattice for information content since any partially ordered set with a connex relation forms a total order.

Property 6

(Connexity). The union and intersection information content are given by at least one of

h (x ⊔ y) = h (x) a n d h (x ⊓ y) = h (y), o r h (x ⊔ y) = h (y) a n d h (x ⊓ y) = h (x)

(59)

6. Generalised Marginal Information Sharing

We now use Properties 1–6 to generalise the results of Theorem 1 and Section 3.

Theorem 4.

The marginal information contents are a totally ordered set under the max and min operators.

Proof.

A totally ordered set is a partially ordered set with the connex property (p. 2 [16]).□

Figure 7 shows the totally ordered sets generated by

n = 2

and

n = 3

observers, and also depicts the corresponding sets. Although the number of distinct terms has been reduced, Eve’s information is still non-decreasing as we move up through terms of the totally ordered set. If we now compare how much unique information Eve gets from a given method of information sharing relative to any other method of information sharing which is equal or lower in the totally ordered set, then we obtain a result which generalises (23) and (24) to consider more than two observers. Similarly, this total order enables us to generalise (25) using the maximum–minimum identity [22], which is a form of the principle of inclusion–exclusion [21] for a totally ordered set,

\begin{matrix} h (x_{1} ⊓ x_{2} ⊓ \dots ⊓ x_{n}) & = min (h (x_{1}), h (x_{2}), \dots, h (x_{n})) \\ = \sum_{k = 1}^{n} {(- 1)}^{k - 1} \sum_{\begin{matrix} S \subseteq x \\ | S | = k \end{matrix}} max (h (s_{1}), h (s_{2}), \dots, h (s_{k})) \\ = \sum_{k = 1}^{n} {(- 1)}^{k - 1} \sum_{\begin{matrix} S \subseteq x \\ | S | = k \end{matrix}} h (s_{1} ⊔ s_{2} ⊔ \dots ⊔ s_{k}), \end{matrix}

(60)

or, conversely,

h (x_{1} ⊔ x_{2} ⊔ \dots ⊔ x_{n}) = \sum_{k = 1}^{n} {(- 1)}^{k - 1} \sum_{\begin{matrix} S \subseteq x \\ | S | = k \end{matrix}} h (s_{1} ⊓ s_{2} ⊓ \dots ⊓ s_{k}) .

(61)

Now that we have generalised the union and intersection information content, similar to Section 3, let us now consider taking the expectation value for each term in the distributive lattice. For every joint realisation

x

from

X

, there is a corresponding distributive lattice of information contents. Hence, similar to (27) and (29)–(31), we can consider taking the expectation value of each term in the lattice over all realisations. Since the expectation is a linear operator, this yields a set of entropies that are also idempotent, commutative, associative, absorptive and distributive, only now over the random variables from

X

. Thus, the information that Eve expects to gain from a single realisation for a particular method of information sharing also corresponds to a term in a free distributive lattice generated by n. This distributive lattice for entropies can be seen in Figure 6 by replacing x, y, z and h with X, Y, Z and H, respectively.

Crucially, however, Property 6 does not hold for the entropies—it is not true that Eve’s expected information

H (X ⊔ Y)

is given by either Alice’s expected information

H (X)

or Bob’s expected information

H (Y)

. Thus, despite the fact that the distributive lattice of information content can be reduced to a total order, the distributive lattice of entropies remains partially ordered. Although the information contents are totally ordered for every realisation, this order is not in general the same for every realisation. Consequently, when taking the expectation value across many realisations to yield the corresponding entropies, the total order is not maintained, and hence we are left with a partially ordered set of entropies. Indeed, we already saw the consequences of this result in Figure 3 whereby Alice’s and Bob’s information content was totally ordered for any one realisation, but their expected information was partially ordered.

7. Multivariate Information Decomposition

In Section 4, we use the shared marginal information from Section 3 to decompose the joint information content into four distinct components. Our aim now is to use the generalised notion of shared information from the previous section to produce a generalised decomposition of the joint information content. To begin, suppose that Johnny observes the joint realisation

(x, y, z)

while Alice, Bob and Charlie observe the marginal realisations x, y and z, respectively, and say that Alice, Bob and Charlie share their information with Eve such that her information is given by

h (x ⊔ y ⊔ z)

. Clearly, Johnny has at least as much information as Eve,

h (x, y, z) \geq h (x ⊔ y ⊔ z) .

(62)

Thus, we can compare how much more information Johnny has relative to Eve,

h (x \oplus y \oplus z) = h (x, y, z) - h (x ⊔ y ⊔ z) = min (h (y, z | x), h (x, z | y), h (x, y | z)) \geq 0 .

(63)

This non-negative function generalises the earlier definition of the synergistic information content (34) such that it now quantifies how much information one gets from knowing the joint probability

p_{X Y Z} (x, y, z)

relative to merely knowing the three marginal probabilities

p_{X} (x)

,

p_{Y} (y)

and

p_{Z} (z)

. Figure 8 shows how this relationship can be represented using a Venn diagram.

Now, consider three more observers, Joan, Jonas, and Joanna, who observe the joint marginal realisations

(x, y)

,

(x, z)

and

(y, z)

, respectively. Clearly, these additional observers greatly increase the number of distinct ways in which marginal information might be shared with Eve. For example, if Alice and Joanna share their information, then Eve’s information is given by

h (x ⊔ (y, z))

. Alternatively, if Joan and Jonas share their information, then Eve’s information is given by

h ((x, y) ⊔ (x, z))

. Perhaps most interestingly, if Joan, Jonas and Joanna share their information, then Eve’s information is given by

h ((x, y) ⊔ (x, z) ⊔ (y, z))

. Moreover, we know that Johnny has at least as much information as Eve has in this situation,

h (x, y, z) \geq h ((x, y) ⊔ (x, z) ⊔ (y, z)) .

(64)

Thus, by comparing how much more information Johnny has relative to Eve in this situation, we can define a new type of synergistic information content that quantifies how much information one gets from knowing the full joint realisation to merely knowing all of the pairwise marginal realisations,

h ((x, y) \oplus (x, z) \oplus (y, z)) = h (x, y, z) - h ((x, y) ⊔ (x, z) ⊔ (y, z)) = min (h (z | x, y), h (y | x, z), h (x | y, z)) .

(65)

Of course, these new ways to share joint information are not just restricted to the union information. If Alice and Joanna share their information, then the information that Eve could have gotten from either is given by

h (x ⊓ (y, z))

. It is also worthwhile noting that this quantity is not less than the information that Eve could have gotten from either Alice’s information or Bob’s and Charlie’s information,

h (x ⊓ (y, z)) \geq h (x ⊓ (y ⊔ z)) .

(66)

Thus, we can also consider defining new types of synergistic information content associated with these this mixed type comparisons,

h (x ⊓ (y \oplus z)) = h (x ⊓ (y, z)) - h (x ⊓ (y ⊔ z)) .

(67)

However, it is important to note that this quantity does not equal

min (h (x), min (h (z | y), h (y | z)))

.

With all of these new ways to share joint marginal information, it is not immediately clear how we should decompose Johnny’s information. Nevertheless, let us begin by considering the algebraic structure of joint information content. From the inequality (12), we know that any pair of marginal information contents

h (x)

and

h (y)

are upper-bounded by the joint information content

h (x, y)

. It is also easy to see that the joint information content is idempotent, commutative and associative. Together, these properties are sufficient for establishing that the algebraic structure of joint information content is that of a join semi-lattice [16] which we denote by

〈 x; h (,) 〉

. Figure 9 shows the semi-lattices generated by

n = 2

and

n = 3

observers.

We now wish to establish the relationship between this semi-lattice of joint information content

〈 x; h (,) 〉

and the distributive lattice of shared marginal information

〈 x; h (⊔), h (⊓) 〉

. In particular, since our aim is to decompose Johnny’s information, consider the relationship between the join semi-lattice

〈 x; h (,) 〉

and the meet semi-lattice

〈 x; h (⊓) 〉

, which is also depicted in Figure 9. In contrast to the semi-lattice of union information content

〈 x; h (⊔) 〉

, the semi-lattice

〈 x; h (,) 〉

is not connected to the semi-lattice

〈 x; h (⊓) 〉

. Although the intersection information content absorbs the joint information content, since

h (x ⊓ (x, y)) = h (x)

(68)

for all

h (x)

and

h (y)

, the joint information content does not absorb the intersection information content since

h (x, (x ⊓ y))

is equal to

h (x, y)

for

h (x) \geq h (y)

, i.e., is not equal to

h (x)

as required for absorption. Since the the join semi-lattice

〈 x; h (,) 〉

is not connected to the meet semi-lattice

〈 x; h (⊓) 〉

by absorption, their combined algebraic structure is not a lattice.

Despite the fact that the overall algebraic structure is not a lattice, there is a lattice sub-structure

〈 A (x), ⪯ 〉

within the general structure. This substructure is isomorphic to the redundancy lattice from the partial information decomposition [23] (see also [24]), and its existence is a consequence of the fact that the intersection information content absorbs the joint information content in (68). To identify this lattice, we must first determine the reduced set of elements

A (x)

upon which it is defined. We begin by considering the set of all possible joint realisations which is given by

P_{1} (x)

where

P_{1} (x) = P (x) \ ⊘

. Elements of this set

P_{1} (x)

correspond to the elements from the join semi-lattice

〈 x; h (,) 〉

, e.g., the elements

{x}

and

{x, y}

correspond to

h (x)

and

h (x, y)

, respectively. In alignment with Williams and Beer [23], we call the elements of

P_{1} (x)

sources and denote them by

A_{1}, A_{2}, \dots, A_{k}

. Next, we consider set of all possible collections of sources which are given by the set

P_{1} (P_{1} (x))

. Each collection of sources corresponds to an element of the meet semi-lattice

〈 P_{1} (x); h (⊓) 〉

, or a particular way in which we can evaluate the intersection information content of a group of joint information contents. For example, the collections of sources

{{x}, {y}}

and

{{x}, {y, z}}

correspond to the

h (x ⊓ y)

and

h (x ⊓ (y, z)))

, respectively. Not all of these collections of sources are distinct, however. Since the intersection information content absorbs the joint information content, we can remove the element

{{x}, {x, y}}

corresponding to

h (x ⊓ (x, y))

as this information is already captured by the element

{{x}}

corresponding to

h (x)

. In general, we can remove any collection of sources that corresponds to the intersection information content between a source

A_{i}

and any source

A_{j}

that is in the down-set

↓ A_{i}

with respect to the join semi-lattice

〈x; h (,)〉

. (A definition of the down-set can be found in [17]. Informally, the down-set

↓ A

is the set of all elements that precede

A

.) By removing all such collections of sources, we get the following reduced set of collections of sources,

A (x) = {α \in P_{1} (P_{1} (x)) : \forall A_{i}, A_{j} \in α, A_{i} ⊄ A_{j}} .

(69)

Formally, this set corresponds to the set of antichains on the lattice

〈 P_{1} (x), \subseteq 〉

, excluding the empty set [23].

Now that we have determined the elements upon which the lattice sub-structure is defined, we must show that they indeed form a lattice. Recall that when constructing the set

A (x)

, we first considered the ordered elements of the semi-lattice

〈x; h (,)〉

and then subsequently consider the ordered elements of the semi-lattice

〈P_{1} (x); h (⊓)〉

. Thus, we need to show that these two orders can be combined together into one new ordering relation over the set

A (x)

. This can be done by extending the approach underlying the construction of the set

A (x)

to consider any pair of collections of sets

α

and

β

from

A (x)

. In particular, the collection of sets

β

precedes the collection of sets

α

if and only if for every source

B

from

β

, there exists a source

A

from

α

such that

A

is in the down-set

↓ B

with respect to the join-semi-lattice

〈x; h (,)〉

, or formally,

\forall α, β \in A (x), (α ⪯ β \Leftrightarrow \forall B \in β, \exists A \in α, A \subseteq B) .

(70)

The fact that

〈 A (x), ⪯ 〉

forms a lattice was proved by Crampton and Loizou [25,26] where the corresponding lattice is denoted

〈 A (X), ⪯^{'} 〉

in their notation. Furthermore, they showed that this lattice is isomorphic to the distributive lattices, and hence the number of elements in the set

A (x)

for n marginal observers is also given by the (n)th Dedekind number (p. 273 [18]) (see also [19]). Crampton and Loizou [26] also provided the meet ∧ and join ∨ operations for this lattice, which are given by

\begin{matrix} α \land β & = \underset{̲}{α ⊔ β}, \end{matrix}

(71)

\begin{matrix} α \lor β & = \underset{̲}{↑ α \cap ↑ β}, \end{matrix}

(72)

where

\underset{̲}{α}

denotes the set of minimal elements of

α

with respect to the semi-lattice

〈x; h (,)〉

. (A definition of the set of minimal elements can be found in [17]. Informally,

\underset{̲}{α}

is the set of sources of

α

that are not preceded by any other sources from

α

with respect to the semi-lattice

〈x; h (,)〉

.) This lattice

〈 A (x), ⪯ 〉

is the aforementioned sub-structure that is isomorphic to the redundancy lattice from Williams and Beer [23]. However, as it is a lattice over information contents, it is actually equivalent to the specificity lattice from [27]. Figure 10 depicts the redundancy lattice of information contents for

n = 2

and

n = 3

marginal observers.

Similar to how Eve’s information is non-decreasing as we move up through the terms of the distributive lattice of shared information, the redundancy lattice of information contents enables to see that, for example, the information that Eve could have gotten from either Alice or Joanna

h (x ⊓ (y, z))

is no less than the information that Eve could have gotten from Alice or Bob

h (x ⊓ y)

. Thus, by taking the information

h (α)

associated with the collection of sources

α

from

A (x)

and subtracting from it the information

h (α_{i})

associated with any collection of sources

α_{i}

from the down-set

↓ α

, we can evaluate the unique information

h (α \ α_{i})

provided by

α

relative to

α_{i}

. Moreover, as per Williams and Beer [23], we can derive a function that quantifies the partial information content

h_{\partial} (α)

associated with the collection of sources

α

that is not available in any of the collections of sources that are covered by

α

. (The set of collections of sources that are covered by

α

is denoted

α^{-}

. A definition of the covering relation is provided in [17]. Informally,

α^{-}

is the set collections of sources that immediately precede

α

.) Formally, this function corresponds to the Möbius inverse of h on the redundancy lattice

〈 A (x), ⪯ 〉

, and can be defined implicitly by

h (α) = \sum_{β ⪯ α} h_{\partial} (β) .

(73)

By subtracting away the partial information terms that strictly precede

α

from both sides, it is easy to see that the partial information content

h_{\partial} (α)

can be calculated recursively from the bottom of the redundancy lattice of information contents,

h_{\partial} (α) = h (α) - \sum_{β ≺ α} h_{\partial} (β),

(74)

As the following theorem shows, the partial information content

h_{\partial} (α)

can be written in closed-form.

Theorem 5.

The partial information content

h_{\partial} (α)

is given by

\begin{matrix} h_{\partial} (α) & = h (α) - h (α_{1}^{-} ⊔ α_{2}^{-} ⊔ \dots ⊔ α_{| α^{-} |}^{-}) \\ = h (α) - max (h (α_{1}^{-}), h (α_{2}^{-}), \dots, h (α_{| α^{-} |}^{-})) \geq 0, \end{matrix}

(75)

where each

α_{i}^{-}

is a collection of sets from

α^{-}

.

Proof.

For

S \subseteq A (x)

, define the set-additive function

f (S) = \sum_{β \in S} h_{\partial} (β) .

(76)

From (73), we have that

h (α) = f (↓ α)

. The partial information can then by subtracting the set additive on the down-set

↓ α

from the set additive function on the strict down-set

\dot{↓} α

,

h_{\partial} (α) = f (↓ α) - f (\dot{↓} α) = f (↓ α) - f (⋃_{β \in α^{-}} ↓ β),

(77)

By applying the principle of inclusion-exclusion [21], we get that

h_{\partial} (α) = f (↓ α) - \sum_{k = 1}^{| α^{-} |} {(- 1)}^{k - 1} \sum_{\begin{matrix} S \subseteq α^{-} \\ | S | = k \end{matrix}} f (⋂_{σ \in S} ↓ σ) .

(78)

For any lattice L and

A \subseteq L

, we have that

⋂_{a \in A} ↓ a

is equal to

↓ (⋀ A)

(p. 57 [17]), and since the meet operation is given by the intersection information content, we have that

\begin{matrix} h_{\partial} (α) & = f (↓ α) - \sum_{k = 1}^{| α^{-} |} {(- 1)}^{k - 1} \sum_{\begin{matrix} S \subseteq α^{-} \\ | S | = k \end{matrix}} h (s_{1} ⊓ s_{2} ⊓ \dots ⊓ s_{k}) \\ = h (α) - h (α_{1} ⊔ α_{2} ⊔ \dots ⊔ α_{| α^{-} |}), \end{matrix}

(79)

where the final step has been made using (61) and (76).□

The closed-form solution (75) from Theorem 5 is the same as the closed-form solution presented in Theorem A2 from Finn and Lizier [27]. This, together with the aforementioned fact that the lattice

〈 A (x), ⪯ 〉

is equivalent to the specificity lattice, means that each partial information content

h_{\partial} (α)

is equal to the partial specificity

i_{\partial}^{+} (α \to t)

from (A22) of [27]. As such, the partial information decomposition present in this paper is equivalent to the pointwise partial information decomposition presented in [27].

Let us now use the closed-form solution (75) from Theorem 5 to evaluate the partial information contents for the

n = 2

redundancy lattice of information contents. Starting from the bottom, we get the intersection information content,

h_{\partial} (x ⊓ y) = h (x ⊓ y),

(80)

followed by the unique information contents,

\begin{matrix} h_{\partial} (x) & = h (x) & - h (x ⊓ y) & = h (x \ y), \end{matrix}

(81)

\begin{matrix} h_{\partial} (y) & = h (y) & - h (x ⊓ y) & = h (y \ x), \end{matrix}

(82)

and, finally, the synergistic information content,

h_{\partial} (x, y) = h (x, y) - h (x ⊔ y) = h (x \oplus y) .

(83)

It is clear that these partial information contents recover the intersection, unique and synergistic information contents from Section 3 and Section 4. Moreover, by inserting these partial terms back into (73) for

α = {{x, y}}

, we recover the earlier decomposition (35) of Johnny’s information,

\begin{matrix} h (x, y) = h_{\partial} (x ⊓ y) + h_{\partial} (x) + h_{\partial} (y) + h_{\partial} (x, y) = h (x ⊓ y) + h (x \ y) + h (y \ x) + h (x \oplus y) . \end{matrix}

(84)

Of course, our aim is to generalise this result such that we can decompose the joint information content for an arbitrary number of marginal realisations. This can be done by first evaluating the partial information contents over the redundancy lattice corresponding to n marginal realisations, and then subsequently inserting the results back into (73) for

α = {{x_{1}, x_{2}, \dots, x_{n}}}

. For example, we can invert the

n = 3

redundancy lattice of information contents which yields the partial information contents shown in Figure 10. (The inversion is evaluated in the Appendix A.) When inserted back into (73), we get the following decomposition for Johnny’s information,

\begin{matrix} h (x, y, z) & = h (x ⊓ y ⊓ z) \\ + h ((x ⊓ y) \ z) + h ((x ⊓ z) \ y) + h ((y ⊓ z) \ x) \\ + h (x ⊓ (y \oplus z)) + h (y ⊓ (x \oplus z)) + h (z ⊓ (x \oplus y)) \\ + h (x \ (y, z)) + h (y \ (x, z)) + h (z \ (x, y)) + h ((x \oplus y) ⊓ (x \oplus z) ⊓ (y \oplus z)) \\ + h ((x \oplus y) ⊓ (x \oplus z) \ (y, z)) + h ((x \oplus y) ⊓ (y \oplus z) \ (x, z)) + h ((x \oplus z) ⊓ (y \oplus z) \ (x, y)) \\ + h ((x \oplus y) \ ((x, z) ⊔ (y, z))) + h ((x \oplus z) \ ((x, y) ⊔ (y, z))) + h ((y \oplus z) \ ((x, y) ⊔ (x, z))) \\ + h ((x, y) \oplus (x, z) \oplus (y, z)) . \end{matrix}

(85)

Finally, we can also consider taking the expectation value of each term in the redundancy lattice of information contents. Since the expectation is a linear and monotonic operator, the resulting expectation values will inherit the structure of the redundancy lattice of information contents and so form a redundancy lattice of entropies, i.e., Figure 10 with x, y, z and h replaced by X, Y, Z and H, respectively. By inverting the

n = 2

redundancy lattice of entropies, we can recover the decomposition (40) from Figure 5. Furthermore, inverting the

n = 3

lattice generalises this result and is depicted in Figure 11.

8. Union and Intersection Mutual Information

Suppose that Alice, Bob and Johnny are now additionally and commonly observing the variable Z. When a realisation

(x, y, z)

occurs, Alice’s information for z is given by the conditional information content

h (x | z)

, while Bob’s conditional information is given by

h (y | z)

and Johnny’s conditional information is given by

h (x, y | z)

. By using the same argument as in Section 3, it is easy to see that Eve’s conditional information given z is given by the conditional union information content,

h (x ⊔ y | z) = max (h (x | z), h (y | z)) .

(86)

Likewise, we can define the conditional unique information contents and conditional intersection information content, respectively,

\begin{matrix} h (x \ y | z) & = h (x ⊔ y | z) - h (y | z) = max (h (x | z) - h (y | z), 0), \end{matrix}

(87)

\begin{matrix} h (y \ x | z) & = h (x ⊔ y | z) - h (x | z) = max (0, h (y | z) - h (x | z)), \end{matrix}

(88)

\begin{matrix} h (x ⊓ y | z) & = h (x | z) + h (y | z) - h (x ⊔ y | z) = min (h (x | z), h (y | z)) . \end{matrix}

(89)

Furthermore, since Johnny’s conditional information

h (x, y | z)

is no less than Eve’s conditional information content

h (x ⊔ y | z)

, we can also define the conditional synergistic information content,

h (x \oplus y | z) = h (x, y | z) - h (x ⊔ y | z) = min (h (y | x, z), h (x | y, z)) .

(90)

Similar to (35), we can decompose Johnny’s conditional information

h (x, y | z)

into the following components,

h (x, y | z) = h (x ⊔ y | z) + h (x \oplus y | z) = h (x ⊓ y | z) + h (x \ y | z) + h (y \ x | z) + h (x \oplus y | z) .

(91)

Moreover, similar to (38), the conditional mutual information content is equal to the difference between the conditional intersection information content and the conditional synergistic information content,

i (x; y | z) = h (x | z) + h (y | z) - h (x, y | z) = h (x ⊓ y | z) - h (x \oplus y | z) .

(92)

Notice that all of the above definitions directly correspond to the definitions of the unconditioned quantities, with all probability distributions conditioned on z here.

Let us now consider how much information each of our observers have about the commonly observed realisation z. The information that Alice has about z from observing x is given by the mutual information content,

i (x; z) = h (x) - h (x | z) .

(93)

Similarly, Bob’s information about z is given by

i (y; z)

, while Johnny’s information is given by the joint mutual information content

i (x, y; z)

. Thus, the question naturally arises—are we able to quantify how much information Eve has about the realisation z from knowing Alice’s and Bob’s shared information?

Clearly, we could consider defining the union mutual information content,

i (x ⊔ y; z) = h (x ⊔ y) - h (x ⊔ y | z) .

(94)

It is important to note that, while the mutual information can be defined in three different ways

i (x, z) = h (x) - h (x | z) = h (x) + h (z) - h (x, z) = h (z) - h (z | x)

, there is only one way in which one can define this function. (Indeed, this point aligns well with our argument based on exclusions presented in [28].) Similar to (94), we could consider respectively defining the unique mutual information contents, the intersection mutual information content and synergistic mutual information content,

\begin{matrix} i (x \ y; z) & = h (x \ y) - h (x \ y | z), \end{matrix}

(95)

\begin{matrix} i (y \ x; z) & = h (y \ x) - h (y \ x | z), \end{matrix}

(96)

\begin{matrix} i (x ⊓ y; z) & = h (x ⊓ y) - h (x ⊓ y | z), \end{matrix}

(97)

\begin{matrix} i (x \oplus y; z) & = h (x \oplus y) - h (x \oplus y | z) . \end{matrix}

(98)

As with the mutual information content (18), there is nothing to suggest that these quantities are non-negative. Of course, the mutual information or expected mutual information content (20) is non-negative. Thus, with this in mind, consider defining the union mutual information

I (X ⊔ Y; Z) = E_{X Y Z} [i (x ⊔ y; z)] .

(99)

However, there is nothing to suggest that this function is non-negative. Consequently, it is dubious to claim that this function represents Eve’s expected information about Z, and is similarly fallacious to say that Eve’s information about z is given by the union mutual information content (94). Indeed, by inserting the definitions (21) and (86) into (94), it is easy to see why it is difficult to interpret these functions,

\begin{matrix} i (x ⊔ y; z) & = max (h (x), h (y)) - max (h (x | z), h (y | z)) \\ = max (min (i (x; z), h (x) - h (y | z)), min (h (y) - h (x | z), i (y; z))) . \end{matrix}

(100)

That is, the union mutual information content can mix the information content provided by one realisation with the conditional information content provided by another. Thus, there is no guarantee that this function’s expected value will be non-negative. It is perhaps best to interpret this function as being a difference between two surprisals, rather than a function which represent information. Of course, similar to the multivariate mutual information (9), the union mutual information can be used a summary quantity provided one is careful not to misinterpret its meaning. The same is true for the unique mutual informations, intersection mutual information and synergistic mutual information, which we can similarly define,

\begin{matrix} I (X \ Y; Z) & = E_{X Y Z} [i (x \ y; z)], \end{matrix}

(101)

\begin{matrix} I (Y \ X; Z) & = E_{X Y Z} [i (y \ x; z)], \end{matrix}

(102)

\begin{matrix} I (X ⊓ Y; Z) & = E_{X Y Z} [i (x ⊓ y; z)], \end{matrix}

(103)

\begin{matrix} I (X \oplus Y; Z) & = E_{X Y Z} [i (x \oplus y; z)] . \end{matrix}

(104)

Despite lacking the clear interpretation that we had for the information contents, these functions share a similar algebraic structure. For example, by using (35) and (91), we can decompose the mutual information content into the following components,

i (x, y; z) = i (x ⊓ y; z) + i (x \ y; z) + i (y \ x; z) + i (x \oplus y; z),

(105)

which is similar to the earlier decomposition of the joint entropy (35). Moreover, similar to (38), by using (38) and (92), we get that the multivariate mutual information content is given by the difference between the intersection mutual information content and the synergistic mutual information content,

\begin{matrix} i (x; y; z) & = i (x; y) - i (x; y | z) = h (x ⊓ y) - h (x \oplus y) - h (x ⊓ y) + h (x \oplus y | z) \\ = i (x ⊓ y; z) - i (x \oplus y; z) . \end{matrix}

(106)

Of course, since the expectation value is a linear operator, both of these results can be carried over to the joint mutual information. Hence, the mutual information can be decomposed into the following components,

I (X, Y; Z) = I (X ⊓ Y; Z) + I (X \ Y; Z) + I (Y \ X; Z) + I (X \oplus Y; Z) .

(107)

while the the multivariate mutual information is equal to the intersection mutual information minus the synergistic mutual information,

\begin{matrix} I (X; Y; Z) & = I (X; Y) - I (X; Y | Z) = H (X ⊓ Y) - H (X \oplus Y) - H (X ⊓ Y) + H (X \oplus Y | Z) \\ = I (X ⊓ Y; Z) - I (X \oplus Y; Z) . \end{matrix}

(108)

This latter result aligns with Williams and Beer’s prior result that the multivariate mutual information conflates redundant and synergistic information (Equation (14) [23]).

9. Conclusions

The main aim of this paper has been to understand and quantify the distinct ways that a set of marginal observers can share their information with some non-observing third party. To accomplish this objective, we examined the distinct ways in which two marginal observers, Alice and Bob, can share their information with the non-observing individual, Eve, and introduced several novel information-theoretic quantities: the union information content, which quantifies how much information Eve gets from the Alice and Bob; the intersection information content, which quantifies how much information Eve could have gotten from either Alice or Bob; and the unique information content, which quantifies how much information Eve gets from Alice relative to Bob, and vice versa. We then investigated the algebraic structure of these new measures of shared marginal information and showed that the structure of shared marginal information is that of a distributive lattice. Next, by using the fundamental theorem of distributive lattices, we showed that these new measures are isomorphic to the various unions and intersections of sets. This isomorphism is similar to Yeung’s correspondence between multivariate mutual information and signed measure [6,7]. However, in contrast to Yeung’s correspondence, the measures of information content presented in this paper are non-negative and maintain a clear operational meaning regardless of the number of realisations or variables involved. (This is, of course, excepting the mutual information contents presented in Section 8, which are not non-negative.)

The appearance of a lattice structure within the context of information theory is by no means novel. Han [12] developed a lattice-theoretic description of the entropy over a Boolean lattice generated by a set of random variables. This lattice encapsulates all linear sums and differences of the basic information-theoretic quantities, i.e., entropy, conditional entropy, mutual information and conditional mutual information. Moreover, this lattice structure captures several of the existing multivariate generalisations of mutual information [29], including the aforementioned multivariate mutual information (9) (which is also known as the interaction information [10], amount of information [2] or co-information [11]), the total correlation [30] (which is also known as the multivariate constraint [31], multi-information [32] or integration [33]), the dual total correlation [12] (which is also known as binding information [34]) and the novel measure of multivariate mutual information defined by Chan et al. [29] (see Han [12] and Chan et al. [29] for further details). Similar to the lattice of shared marginal information content, Han’s lattice is distributive—indeed, on a fundamental level, it is this algebraic structure that enables Yeung [6,7] to establish a correspondence with signed measure. Nevertheless, there two important differences to note between Han’s information lattice and the lattice of shared marginal information content: Firstly, Han’s lattice is based upon the entropies of random variables rather than the information content of realisations. In principle, there is no reason why one could not consider the information content of a Boolean lattice generated by a set of realisations (although the mutual information content would not be non-negative). Secondly, the Möbius inverse on Han’s information lattice yields the multivariate mutual information (9), which is not non-negative. In contrast, the partial information contents (75) that result from the Möbius inversion of the lattice of shared marginal information content are non-negative. Thus, in contrast to the multivariate mutual information, the new measures of multivariate information presented in this paper maintain their operational meaning for any number of random variables.

Similar to Han, Shannon [35] introduced his own information lattice, although it is based upon the notion of common information. In comparison to Shannon’s other work, this paper is not well recognised. Indeed, this common information was later independently proposed and studied by Gács and Körner [36]. Shannon’s original paper is relatively brief; however, Li and Chong [37] expanded upon Shannon’s discussion by formalising his argument in terms of

σ

-algebras and sample space partitions (see also [38]). To be specific, they described a random variable X as “being-richer-than” another random variable Y if the former’s sample space partition is finer than the latter’s sample space partition. Moreover, if their

σ

-algebras coincide, then two random variables are said to be informationally equivalent. This relation naturally forms a partial order over a set of random variables. For all X and Y, the joint variable

(X, Y)

is the poorest amongst all of the variables that are richer than both X and Y. Conversely, one can define a random variable Z that is the richest amongst all of the variables that are poorer than both X and Y. The entropy of this common variable Z defines the aforementioned common information. In contrast to the joint variable

(X, Y)

, it is relatively difficult to characterise the common variable Z [36,37,39]. Nevertheless, its existence is sufficient for the definition of Shannon’s information lattice [35,37]. There are several features that distinguish this lattice from the lattice of shared marginal information. Firstly, similar to Han’s information lattice, the joint entropy and common information are defined in terms of entire random variables, rather than the information content of realisations. Secondly, even if we were to restrict ourselves to the comparing Shannon’s information lattice to the lattice of shared marginal entropy, the meet and join operations for these lattices are fundamentally different. We have already discussed the between their respective join operations, i.e., the joint entropy and union entropy, in Section 3 and Section 4. If we consider their respective meet operations, we get the common information is relatively restrictive compared to the intersection entropy, due to the fact that the common information requires one to identify the common random variable Z. This follows from the fact that the intersection information is greater than or equal the mutual information (43), which is in turn greater than or equal to the common information [36]. Finally, in general, Shannon’s information lattice is not distributive, nor is it even modular [35,37]. Thus, unlike the lattice of share marginal information or Han’s information lattice, the fundamental theorem of distributive lattice is not applicable, and hence Shannon’s information lattice does inherit any set-like identities.

The secondary objective of this paper has been to understand and demonstrate how we can use the measures of shared information content to decompose multivariate information. We began by comparing the union information content to the joint information content and used this comparison to define a measure of synergistic information content that captures how much more information a full joint observer, Johnny, has relative to an individual, Eve, who knows which joint realisation has occurred, but only knows the marginal distributions. We showed how one can use this measure, together with the measures of shared information content, to decompose the joint information content. We then compared the algebraic structure of joint information to the lattice structure of shared information, and showed how one can find the redundancy lattice from the partial information decomposition [23] embedded within this larger algebraic structure. More specifically, since this paper considers information contents, this redundancy lattice is actually same as the specificity lattice from pointwise partial information decomposition [27,28]. This observation connects the work presented in this paper to the existing body of theoretical literature on information decomposition [23,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62], and its applications [63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86]. (For a brief summary of this literature, see [24].) Nevertheless, in contrast to the pointwise partial information decomposition [27,28], most of these approaches aim to decompose the average mutual information rather than the information content. The ability to decompose information content, and pointwise mutual information, provides a unique perspective on multivariate dependency.

To our knowledge, the only other approach that attempts to provide this pointwise perspective is due to Ince [87]. Ince’s approach proposes a method of information decomposition based upon the entropy, but can be applied to the information content (or in Ince’s terminology, the local entropy). Of particular relevance to this paper, Ince obtains a result that is equivalent to (38) whereby the mutual information content is equal to the redundant information content minus the synergistic information content (Equation (5) [87]). However, Ince’s definition of redundant information content differs from that of the intersection information content in (38). To be specific, it is based upon the sign of the multivariate mutual information content (or pointwise co-information), which is interpreted as a measure of “the set-theoretic overlap” of multiple information contents (or local entropies) (p. 7 [87]). However, as discussed in Section 1, this set-theoretic interpretation of the multivariate mutual information (co-information) is problematic. To account for these difficulties, Ince disregards the negative values, defining the redundant information content to equal to the multivariate mutual information when it is positive, and to be zero otherwise.

There are several avenues of inquiry for which this research will yield new insights, particularly in complex systems, neuroscience and communications theory. For instance, these measures might be used to better understand and quantify distributed intrinsic computation [66,79]. It is well known that that dynamics of individual regions in the brain depend synergistically on multiple other regions; synergistic information content might provide a means to quantify such dependencies in neural data [69,77,88,89,90]. Furthermore, these measures might be helpful for quantifying the synergistic encodings used in network coding [7]. Finally, it is well-known that many biological traits are not dependent on any one gene, but rather are synergistically dependent on two or more genes, and the decomposed information provides a means to quantify the unique, redundant and synergistic dependencies between a trait and a set of genes [91,92,93,94].

Author Contributions

C.F. and J.T.L. conceived the idea; C.F. prepared the original draft; and J.T.L. reviewed and edited the draft. All authors have read and agreed to the published version of the manuscript.

Funding

JL was supported through the Australian Research Council DECRA grant DE160100630 and through The University of Sydney Research Accelerator (SOAR) Fellowship program.

Acknowledgments

The authors would like to thank Nathan Harding, Richard Spinney, Daniel Polani, Leonardo Novelli and Oliver Cliff for helpful discussions relating to this manuscript.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A

We can use the closed-form (75) to evaluate the partial information contents for the

n = 3

redundancy lattice of information contents. Starting from the bottom and working up through the redundancy lattice, we have the following partial information contents:

h_{\partial} (x ⊓ y ⊓ z) = h (x ⊓ y ⊓ z);

(A1)

\begin{array}{l} h_{\partial} (x ⊓ y) & = h (x ⊓ y) - h (x ⊓ y ⊓ z) \\ = h ((x ⊓ y) \ z); \end{array}

(A2)

\begin{matrix} h_{\partial} (x ⊓ (y, z)) & = h (x ⊓ (y, z)) - h ((x ⊓ y) ⊔ (x ⊓ z)) \\ = h (x ⊓ (y, z)) - h (x ⊓ (y ⊔ z)) \\ = h (x ⊓ (y \oplus z)); \end{matrix}

(A3)

\begin{matrix} h_{\partial} (x) & = h (x) - h (x ⊓ (y, z)) \\ = h (x \ (y, z)); \end{matrix}

(A4)

\begin{matrix} (B y L e m m a A 9 .) & \begin{matrix} h_{\partial} ((x, y) ⊓ (x, z) ⊓ (y, z)) & = h ((x, y) ⊓ (x, z) ⊓ (y, z)) - h ((x ⊓ (y, z)) ⊔ (y ⊓ (x, z)) ⊔ (z ⊓ (x, y))) \\ = h ((x, y) ⊓ (x, z) ⊓ (y, z)) - h (x ⊓ (y \oplus z)) - h (y ⊓ (x \oplus z)) - h (z ⊓ (x \oplus y)) \\ - h ((x ⊓ y) \ z) - h ((x ⊓ z) \ y) - h ((y ⊓ z) \ x) - h (x ⊓ y ⊓ z) \\ = h ((x, y) ⊓ (x, z) ⊓ (y, z)) - h (x ⊓ (y, z)) - h (y ⊓ (x \oplus z)) - h (z ⊓ (x \oplus y)) \\ - h ((y ⊓ z) \ x) \end{matrix} \\ \begin{matrix} = h ((x \oplus y) ⊓ (x \oplus z) ⊓ (y \oplus z)); \end{matrix} \end{matrix}

(A5)

\begin{matrix} \begin{matrix} (B y L e m m a A 1 .) & \begin{matrix} h_{\partial} ((x, y) ⊓ (x, z)) & = h ((x, y) ⊓ (x, z)) - h (x ⊔ ((x, y) ⊓ (x, z) ⊓ (y, z))) \\ = h ((x, y) ⊓ (x, z)) - h (x) - h ((x, y) ⊓ (x, z) ⊓ (y, z)) \\ + h (x ⊓ ((x, y) ⊓ (x, z) ⊓ (y, z))) \\ = h (((x, y) ⊓ (x, z)) \ (y, z)) - h (x \ (y, z)) \end{matrix} \end{matrix} \\ \begin{matrix} (B y L e m m a A 2 .) & \begin{matrix} = h (((x, y) \ (y, z)) ⊓ ((x, z) \ (y, z))) - h (x \ (y, z)) \end{matrix} \end{matrix} \\ \begin{matrix} \begin{matrix} = h ((x \oplus y) ⊓ (x \oplus z) \ (y, z)); \end{matrix} \end{matrix} \end{matrix}

(A6)

\begin{matrix} (B y L e m m a A 3 .) & \begin{matrix} h_{\partial} (x, y) & = h (x, y) - h (((x, y) ⊓ (x, z)) ⊔ ((x, y) ⊓ (y, z))) \\ = h (x, y) - h ((x, y) ⊓ ((x, z) ⊔ (y, z))) \\ = h ((x, y) \ ((x, z) ⊔ (y, z))) \end{matrix} \\ \begin{matrix} \begin{matrix} = h ((x \oplus y) \ ((x, z) ⊔ (y, z))); \end{matrix} \end{matrix} \end{matrix}

(A7)

\begin{matrix} h_{\partial} (x, y, z) & = h (x, y, z) - h ((x, y) ⊔ (x, z) ⊔ (y, z)) \\ = h ((x, y) \oplus (x, z) \oplus (y, z)) . \end{matrix}

(A8)

Lemma A1.

We have the following identity,

h ((x ⊓ y) \ z) = h ((x \ z) ⊓ (y \ z)) .

(A9)

Proof.

From (23) and (25), we have that

\begin{matrix} h ((x ⊓ y) \ z) & = max (min (h (x), h (y)) - h (z), 0) = max (min (h (x) - h (z), h (y) - h (z)), 0) \\ = min (max (h (x) - h (z), 0), max (h (y) - h (z), 0)) = h ((x \ z) ⊓ (y \ z)), \end{matrix}

(A10)

where we have used the fact that

max (a, min (b, c))

is equal to

min (max (a, b), max (a, c))

.□

Lemma A2.

We have the following identity,

h (((x, y) \ (y, z)) ⊓ ((x, z) \ (y, z))) = h (x \ (y, z)) + h ((x \oplus y) ⊓ (x \oplus z) \ (y, z)) .

(A11)

Proof.

From (35), we have that

\begin{matrix} h ((x, y) \ (y, z)) & = h ((x ⊓ y) \ (y, z)) + h ((x \ y) \ (y, z)) + h ((y \ x) \ (y, z)) + h ((x \oplus y) \ (y, z)) \\ = h (x \ (y, z)) + h ((y \ x) \ (y, z)) + h ((x \oplus y) \ (y, z)), \end{matrix}

(A12)

and since

h ((y \ x) \ (y, z)) = max (h (y \ x) - h (y, z), 0) = 0,

(A13)

we get that

h ((x, y) \ (y, z)) = h (x \ (y, z)) + h ((x \oplus y) \ (y, z)) .

(A14)

Finally, by inserting this result into (25), we get that

\begin{array}{l} h (((x, y) \ (y, z)) & ⊓ ((x, z) \ (y, z)) \\ = min (h (x \ (y, z)) + h ((x \oplus y) \ (y, z)), h (x \ (y, z)) + h ((x \oplus z) \ (y, z))) \\ = h (x \ (y, z)) + h (((x \oplus y) \ (y, z)) ⊓ ((x \oplus z) \ (y, z))) . □ \end{array}

Lemma A3.

We have the following identity,

h (((x, y) \ ((x, z) ⊔ (y, z))) = h (((x \oplus y) \ ((x, z) ⊔ (y, z))) .

(A15)

Proof.

By (35), we have that

\begin{matrix} h ((x, y) \ ((x . z) ⊔ (y, z))) & = h ((x ⊓ y) \ ((x, z) ⊔ (y, z))) + h ((x \ y) \ ((x, z) ⊔ (y, z))) \\ + h ((y \ x) \ ((x, z) ⊔ (y, z))) + h ((x \oplus y) \ ((x, z) ⊔ (y, z))) . \end{matrix}

(A16)

Since

h (x ⊓ y)

,

h (x \ y)

and

h (y \ x)

are less than

h (((x, z) ⊔ (y, z))

, from (23) we have that

h ((x ⊓ y) \ ((x, z) ⊔ (y, z)))

,

h ((x \ y) \ ((x, z) ⊔ (y, z)))

and

h ((y \ x) \ ((x, z) ⊔ (y, z)))

are all equal to 0.□

Lemma A4.

We have the following identity,

h (x ⊓ (y \ x)) = 0 .

(A17)

Proof.

We have that

\begin{matrix} h (x ⊓ (y \ x)) = h (x ⊓ (x, y)) - h (x ⊓ x) = h (x ⊓ x) - h (x) = 0, \end{matrix}

(A18)

where we have used (44) and (68).□

Lemma A5.

We have the following identity,

h ((y \ x) ⊓ (y, z)) = h (y \ x) .

(A19)

Proof.

We have that

\begin{matrix} h ((y \ x) ⊓ (y, z)) & = h ((y \ x) ⊓ (y \ x, y ⊓ x, z)) . \end{matrix}

(A20)

Using (68), this then reduces to

\begin{matrix} h ((y \ x) ⊓ (y, z)) = h (y \ x) . □ \end{matrix}

(A21)

Lemma A6.

We have the following identity,

h (x ⊓ (x \oplus z)) = 0 .

(A22)

Proof.

We have

\begin{matrix} h (x ⊓ (x \oplus z)) & = h (x ⊓ (x, z)) - h (x ⊓ x) + h (x ⊓ (z \ x)), \end{matrix}

which then reduces via (68), (44) and Lemma A4 to:

\begin{matrix} h (x ⊓ (x \oplus z)) & = h (x) - h (x) + h (x ⊓ (z \ x)) + 0 \\ = 0 . □ \end{matrix}

(A23)

Lemma A7.

We have the following identity,

h ((y \ x) ⊓ (x \oplus z)) = h (y ⊓ (x \oplus z)) .

(A24)

Proof.

We have that

\begin{matrix} h (y ⊓ (x \oplus z)) & = h ((y \ x) ⊓ (x \oplus z)) + h ((y ⊓ x) ⊓ (x \oplus z)) . \end{matrix}

However, since

h (x ⊓ (x \oplus z)) = 0

via Lemma A6 and therefore

h ((y ⊓ x) ⊓ (x \oplus z)) = 0

, we have

\begin{matrix} h (y ⊓ (x \oplus z)) = h ((y \ x) ⊓ (x \oplus z)) \end{matrix}

as required.□

Lemma A8.

We have the following identity,

h ((x \oplus y) ⊓ (x, z) ⊓ (y, z)) = h (z ⊓ (x \oplus y)) + h ((x \oplus y) ⊓ (x \oplus z) ⊓ (y \oplus z)) .

(A25)

Proof.

We have that

\begin{matrix} h ((x \oplus y) ⊓ (x, z) ⊓ (y, z)) & = h ((x \oplus y) ⊓ x ⊓ (y, z)) + h ((x \oplus y) ⊓ (z \ x) ⊓ (y, z)) \\ + h ((x \oplus y) ⊓ (x \oplus z) ⊓ (y, z)) . \end{matrix}

(A26)

Then, by using Lemma A6, we get that

\begin{matrix} h ((x \oplus y) ⊓ (x, z) ⊓ (y, z)) & = h ((x \oplus y) ⊓ (z \ x) ⊓ (y, z)) + h ((x \oplus y) ⊓ (x \oplus z) ⊓ (y, z)) \\ = h ((x \oplus y) ⊓ (z \ x) ⊓ y) + h ((x \oplus y) ⊓ (z \ x) ⊓ (z \ y)) \\ + h ((x \oplus y) ⊓ (z \ x) ⊓ (y \oplus z)) + h ((x \oplus y) ⊓ (x \oplus z) ⊓ y) \\ + h ((x \oplus y) ⊓ (x \oplus z) ⊓ (z \ y)) + h ((x \oplus y) ⊓ (x \oplus z) ⊓ (y \oplus z)) \\ = h ((x \oplus y) ⊓ (z \ x) ⊓ (z \ y) + h ((x \oplus y) ⊓ (x \oplus z) ⊓ (y \oplus z)), \end{matrix}

(A27)

where we have used Lemma A6 four more times. Finally, using Lemma A7, we get that

\begin{matrix} h ((x \oplus y) ⊓ (x, z) ⊓ (y, z)) & = h ((x \oplus y) ⊓ z) + h ((x \oplus y) ⊓ (x \oplus z) ⊓ (y \oplus z)), \end{matrix}

(A28)

as required.□

Lemma A9.

We have the following identity,

\begin{matrix} h ((x, y) ⊓ (x, z) ⊓ (y, z)) & = h (x ⊓ (y, z)) + h (y ⊓ (x \oplus z)) + h (z ⊓ (x \oplus y)) + h ((y ⊓ z) \ x) \\ + h ((x \oplus y) ⊓ (x \oplus z) ⊓ (y \oplus z)) . \end{matrix}

(A29)

Proof.

We have that

\begin{matrix} h ((x, y) ⊓ (x, z) ⊓ (y, z)) & = h (x ⊓ (x, z) ⊓ (y, z)) + h ((y \ x) ⊓ (x, z) ⊓ (y, z)) + h ((x \oplus y) ⊓ (x, z) ⊓ (y, z)) \\ = h (x ⊓ (y, z)) + h ((y \ x) ⊓ (x, z)) + h ((x \oplus y) ⊓ (x, z) ⊓ (y, z)), \end{matrix}

(A30)

where we have used (68) and Lemma A5. Next, we have that

\begin{matrix} h ((x, y) ⊓ (x, z) ⊓ (y, z)) & = h (x ⊓ (y, z)) + h ((y \ x) ⊓ x) + h ((y \ x) ⊓ (z \ x)) + h ((y \ x) ⊓ (x \oplus z)) \\ + h ((x \oplus y) ⊓ (x, z) ⊓ (y, z)) \\ = h (x ⊓ (y, z)) + 0 + h ((y ⊓ z) \ x) + h (y ⊓ (x \oplus z)) \\ + h ((x \oplus y) ⊓ (x, z) ⊓ (y, z)) \end{matrix}

(A31)

where we have used Lemma A1, Lemma A4 and Lemma A7. Finally, we have that

\begin{matrix} h ((x, y) ⊓ (x, z) ⊓ (y, z)) & = h (x ⊓ (y, z)) + h ((y ⊓ z) \ x) + h (y ⊓ (x \oplus z)) + h (z ⊓ (x \oplus y)) \\ + h ((x \oplus y) ⊓ (x \oplus z) ⊓ (y \oplus z)), \end{matrix}

(A32)

where we have used via Lemma A8.□

References

Reza, F. An Introduction to Information Theory, International student edition; McGraw-Hill: New York, NY, USA, 1961. [Google Scholar]
Kuo Ting, H. On the Amount of Information. Theory Probab. Appl. 1962, 7, 439–447. [Google Scholar] [CrossRef]
Abramson, N. Information Theory and Coding; McGraw-Hill: New York, NY, USA, 1963. [Google Scholar]
Campbell, L. Entropy as a measure. IEEE Trans. Inf. Theory 1965, 11, 112–114. [Google Scholar] [CrossRef]
Csiszar, I.; Körner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems; Academic Press, Inc.: Cambridge, MA, USA, 1981. [Google Scholar]
Yeung, R.W. A new outlook on Shannon’s information measures. IEEE Trans. Inf. Theory 1991, 37, 466–474. [Google Scholar] [CrossRef]
Yeung, R.W. Information Theory and Network Coding; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
MacKay, D.J. Information Theory, Inference and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Fano, R.M. Transmission of Information: A Statistical Theory of Communication; M.I.T. Press: Cambridge, MA, USA, 1961. [Google Scholar]
McGill, W. Multivariate information transmission. Trans. IRE Prof. Group Inf. Theory 1954, 4, 93–111. [Google Scholar] [CrossRef]
Bell, A.J. The co-information lattice. In Proceedings of the Fifth International Workshop on Independent Component Analysis and Blind Signal Separation: ICA, Granada, Spain, 22–24 September 2004; Volume 2003. [Google Scholar]
Han, T.S. Linear dependence structure of the entropy space. Inf. Control 1975, 29, 337–368. [Google Scholar] [CrossRef] [Green Version]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
Fano, R.M. The statistical theory of information. Il Nuovo Cimento 1959, 13, 353–372. [Google Scholar] [CrossRef]
Pinsker, M.S. Information and Information Stability of Random Variables and Processes; Holden-Day: San Francisco, CA, USA, 1964. [Google Scholar]
Grätzer, G. General Lattice Theory; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
Davey, B.A.; Priestley, H.A. Introduction to Lattices and Order; Cambridge University Press: Cambridge, UK, 2002. [Google Scholar]
Comtet, L. Advanced Combinatorics: The Art of Finite and Infinite Expansions; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
The OEIS Foundation Inc. The On-Line Encyclopedia of Integer Sequences. 2019. Available online: https://oeis.org/A000372 (accessed on 14 February 2020).
Birkhoff, G. Lattice Theory; American Mathematical Soc.: Providence, RI, USA, 1940; Volume 25. [Google Scholar]
Stanley, R.P. Enumerative Combinatorics; Cambridge University Press: Cambridge, UK, 1997; Volume 1. [Google Scholar]
Sheldon, R. A first Course in Probability; Pearson Education India: Delhi, India, 2002. [Google Scholar]
Williams, P.L.; Beer, R.D. Nonnegative decomposition of multivariate information. arXiv 2010, arXiv:1004.2515. [Google Scholar]
Lizier, J.T.; Bertschinger, N.; Jost, J.; Wibral, M. Information Decomposition of Target Effects from Multi-Source Interactions: Perspectives on Previous, Current and Future Work. Entropy 2018, 20, 307. [Google Scholar] [CrossRef] [Green Version]
Crampton, J.; Loizou, G. The completion of a poset in a lattice of antichains. Int. Math. J. 2001, 1, 223–238. [Google Scholar]
Crampton, J.; Loizou, G. Two Partial Orders on the Set of Antichains. 2000. Available online: http://learninglink.ac.uk/oldsite/research/techreps/2000/bbkcs-00-09.pdf (accessed on 14 February 2020).
Finn, C.; Lizier, J.T. Pointwise partial information decomposition using the specificity and ambiguity lattices. Entropy 2018, 20, 297. [Google Scholar] [CrossRef] [Green Version]
Finn, C.; Lizier, J.T. Probability Mass Exclusions and the Directed Components of Mutual Information. Entropy 2018, 20, 826. [Google Scholar] [CrossRef] [Green Version]
Chan, C.; Al-Bashabsheh, A.; Ebrahimi, J.B.; Kaced, T.; Liu, T. Multivariate mutual information inspired by secret-key agreement. Proc. IEEE 2015, 103, 1883–1913. [Google Scholar] [CrossRef]
Watanabe, S. Information theoretical analysis of multivariate correlation. IBM J. Res. Dev. 1960, 4, 66–82. [Google Scholar] [CrossRef]
Garner, W.R. Uncertainty and structure as psychological concepts. Science 1963, 140, 799. [Google Scholar]
Studenỳ, M.; Vejnarová, J. The multiinformation function as a tool for measuring stochastic dependence. In Learning in Graphical Models; Springer: Berlin/Heidelberg, Germany, 1998; pp. 261–297. [Google Scholar]
Tononi, G.; Sporns, O.; Edelman, G.M. A measure for brain complexity: Relating functional segregation and integration in the nervous system. Proc. Natl. Acad. Sci. USA 1994, 91, 5033–5037. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Abdallah, S.A.; Plumbley, M.D. A measure of statistical complexity based on predictive information with application to finite spin systems. Phys. Lett. A 2012, 376, 275–281. [Google Scholar] [CrossRef]
Shannon, C. The lattice theory of information. Trans. IRE Prof. Group Inf. Theory 1953, 1, 105–107. [Google Scholar] [CrossRef]
Gács, P.; Körner, J. Common information is far less than mutual information. Probl. Control. Inf. Theory 1973, 2, 149–162. [Google Scholar]
Li, H.; Chong, E.K. On a connection between information and group lattices. Entropy 2011, 13, 683–708. [Google Scholar] [CrossRef] [Green Version]
Yu, H.; Mineyev, I.; Varshney, L.R. A group-theoretic approach to computational abstraction: Symmetry-driven hierarchical clustering. arXiv 2018, arXiv:1807.11167. [Google Scholar]
Wolf, S.; Wultschleger, J. Zero-error information and applications in cryptography. In Proceedings of the IEEE Information Theory Workshop, San Antonio, TX, USA, 24–29 October 2004; pp. 1–6. [Google Scholar]
Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J. Shared information—New insights and problems in decomposing information in complex systems. In Proceedings of the European Conference on Complex Systems 2012; Springer: Berlin/Heidelberg, Germany, 2013; pp. 251–269. [Google Scholar]
Harder, M.; Salge, C.; Polani, D. Bivariate measure of redundant information. Phys. Rev. E 2013, 87, 012130. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Harder, M. Information Driven Self-Organization of Agents and Agent Collectives. Ph.D. Thesis, University of Hertfordshire, Hatfield, UK, 2013. [Google Scholar]
Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J.; Ay, N. Quantifying unique information. Entropy 2014, 16, 2161–2183. [Google Scholar] [CrossRef] [Green Version]
Griffith, V.; Koch, C. Quantifying Synergistic Mutual Information. In Guided Self-Organization: Inception; Emergence, Complexity and Computation; Prokopenko, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2014; Volume 9, pp. 159–190. [Google Scholar]
Griffith, V.; Chong, E.K.; James, R.G.; Ellison, C.J.; Crutchfield, J.P. Intersection information based on common randomness. Entropy 2014, 16, 1985–2000. [Google Scholar] [CrossRef] [Green Version]
Rauh, J.; Bertschinger, N.; Olbrich, E.; Jost, J. Reconsidering unique information: Towards a multivariate information decomposition. In Proceedings of the 2014 IEEE International Symposium on Information Theory (ISIT), Honolulu, HI, USA, 29 June–4 July 2014; pp. 2232–2236. [Google Scholar]
Barrett, A.B. Exploration of synergistic and redundant information sharing in static and dynamical Gaussian systems. Phys. Rev. E 2015, 91, 052802. [Google Scholar] [CrossRef] [Green Version]
Griffith, V.; Ho, T. Quantifying redundant information in predicting a target random variable. Entropy 2015, 17, 4644–4653. [Google Scholar] [CrossRef] [Green Version]
Olbrich, E.; Bertschinger, N.; Rauh, J. Information decomposition and synergy. Entropy 2015, 17, 3501–3517. [Google Scholar] [CrossRef] [Green Version]
Perrone, P.; Ay, N. Hierarchical Quantification of Synergy in Channels. Front. Robot. AI 2016, 2, 35. [Google Scholar] [CrossRef] [Green Version]
Rosas, F.; Ntranos, V.; Ellison, C.J.; Pollin, S.; Verhelst, M. Understanding interdependency through complex information sharing. Entropy 2016, 18, 38. [Google Scholar] [CrossRef] [Green Version]
Faes, L.; Marinazzo, D.; Stramaglia, S. Multiscale information decomposition: Exact computation for multivariate Gaussian processes. Entropy 2017, 19, 408. [Google Scholar] [CrossRef] [Green Version]
Ince, A.A.R. Measuring multivariate redundant information with pointwise common change in surprisal. Entropy 2017, 19, 318. [Google Scholar] [CrossRef] [Green Version]
James, R.G.; Crutchfield, J.P. Multivariate dependence beyond shannon information. Entropy 2017, 19, 531. [Google Scholar] [CrossRef]
Kay, J.W.; Ince, R.A.; Dering, B.; Phillips, W.A. Partial and Entropic Information Decompositions of a Neuronal Modulatory Interaction. Entropy 2017, 19, 560. [Google Scholar] [CrossRef] [Green Version]
Makkeh, A.; Theis, D.O.; Vicente, R. Bivariate Partial Information Decomposition: The Optimization Perspective. Entropy 2017, 19, 530. [Google Scholar] [CrossRef] [Green Version]
Pica, G.; Piasini, E.; Chicharro, D.; Panzeri, S. Invariant components of synergy, redundancy, and unique information among three variables. Entropy 2017, 19, 451. [Google Scholar] [CrossRef] [Green Version]
Quax, R.; Har-Shemesh, O.; Sloot, P. Quantifying synergistic information using intermediate stochastic variables. Entropy 2017, 19, 85. [Google Scholar] [CrossRef] [Green Version]
Rauh, J.; Banerjee, P.; Olbrich, E.; Jost, J.; Bertschinger, N. On extractable shared information. Entropy 2017, 19, 328. [Google Scholar] [CrossRef] [Green Version]
Rauh, J.; Banerjee, P.; Olbrich, E.; Jost, J.; Bertschinger, N.; Wolpert, D. Coarse-graining and the Blackwell order. Entropy 2017, 19, 527. [Google Scholar] [CrossRef]
Rauh, J. Secret sharing and shared information. Entropy 2017, 19, 601. [Google Scholar] [CrossRef] [Green Version]
James, R.G.; Emenheiser, J.; Crutchfield, J.P. Unique information via dependency constraints. J. Phys. Math. Theor. 2018, 52, 014002. [Google Scholar] [CrossRef] [Green Version]
Williams, P.L.; Beer, R.D. Generalized measures of information transfer. arXiv 2011, arXiv:1102.1507. [Google Scholar]
Flecker, B.; Alford, W.; Beggs, J.M.; Williams, P.L.; Beer, R.D. Partial information decomposition as a spatiotemporal filter. Chaos Interdiscip. J. Nonlinear Sci. 2011, 21, 037104. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Stramaglia, S.; Wu, G.R.; Pellicoro, M.; Marinazzo, D. Expanding the transfer entropy to identify information circuits in complex systems. Phys. Rev. E 2012, 86, 066211. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Lizier, J.T.; Flecker, B.; Williams, P.L. Towards a synergy-based approach to measuring information modification. In Proceedings of the 2013 IEEE Symposium on Artificial Life (ALIFE), Singapore, 15–19 April 2013; pp. 43–51. [Google Scholar]
Stramaglia, S.; Cortes, J.M.; Marinazzo, D. Synergy and redundancy in the Granger causal analysis of dynamical networks. New J. Phys. 2014, 16, 105003. [Google Scholar] [CrossRef] [Green Version]
Timme, N.; Alford, W.; Flecker, B.; Beggs, J.M. Synergy, redundancy, and multivariate information measures: An experimentalist’s perspective. J. Comput. Neurosci. 2014, 36, 119–140. [Google Scholar] [CrossRef] [PubMed]
Wibral, M.; Lizier, J.T.; Priesemann, V. Bits from brains for biologically inspired computing. Front. Robot. 2015, 2, 5. [Google Scholar] [CrossRef] [Green Version]
Biswas, A.; Banik, S.K. Redundancy in information transmission in a two-step cascade. Phys. Rev. E 2016, 93, 052422. [Google Scholar] [CrossRef] [Green Version]
Frey, S.; Williams, P.L.; Albino, D.K. Information encryption in the expert management of strategic uncertainty. arXiv 2016, arXiv:1605.04233. [Google Scholar]
Timme, N.M.; Ito, S.; Myroshnychenko, M.; Nigam, S.; Shimono, M.; Yeh, F.C.; Hottowy, P.; Litke, A.M.; Beggs, J.M. High-degree neurons feed cortical computations. PLoS Comput. Biol. 2016. [Google Scholar] [CrossRef]
Ghazi-Zahedi, K.; Langer, C.; Ay, N. Morphological computation: Synergy of body and brain. Entropy 2017, 19, 456. [Google Scholar] [CrossRef]
Maity, A.K.; Chaudhury, P.; Banik, S.K. Information theoretical study of cross-talk mediated signal transduction in MAPK pathways. Entropy 2017, 19, 469. [Google Scholar] [CrossRef] [Green Version]
Sootla, S.; Theis, D.; Vicente, R. Analyzing information distribution in complex systems. Entropy 2017, 19, 636. [Google Scholar] [CrossRef] [Green Version]
Tax, T.; Mediano, P.A.; Shanahan, M. The partial information decomposition of generative neural network models. Entropy 2017, 19, 474. [Google Scholar] [CrossRef]
Wibral, M.; Finn, C.; Wollstadt, P.; Lizier, J.T.; Priesemann, V. Quantifying Information Modification in Developing Neural Networks via Partial Information Decomposition. Entropy 2017, 19, 494. [Google Scholar] [CrossRef] [Green Version]
Wibral, M.; Priesemann, V.; Kay, J.W.; Lizier, J.T.; Phillips, W.A. Partial information decomposition as a unified approach to the specification of neural goal functions. Brain Cogn. 2017, 112, 25–38. [Google Scholar] [CrossRef] [Green Version]
Finn, C.; Lizier, J.T. Quantifying Information Modification in Cellular Automata using Pointwise Partial Information Decomposition. In Artificial Life Conference Proceedings; MIT Press: Cambridge, MA, USA, 2018; pp. 386–387. [Google Scholar]
Rosas, F.; Mediano, P.A.; Ugarte, M.; Jensen, H. An information-theoretic approach to self-organisation: Emergence of complex interdependencies in coupled dynamical systems. Entropy 2018, 20, 793. [Google Scholar] [CrossRef] [Green Version]
Wollstadt, P.; Lizier, J.T.; Vicente, R.; Finn, C.; Martínez-Zarzuela, M.; Mediano, P.A.; Novelli, L.; Wibral, M. IDTxl: The Information Dynamics Toolkit xl: A Python package for the efficient analysis of multivariate information dynamics in networks. J. Open Source Softw. 2019, 4, 1081. [Google Scholar] [CrossRef]
Biswas, A. Multivariate information processing characterizes fitness of a cascaded gene-transcription machinery. Chaos: Interdiscip. J. Nonlinear Sci. 2019, 29, 063108. [Google Scholar] [CrossRef]
James, R.G.; Emenheiser, J.; Crutchfield, J. Unique information and secret key agreement. Entropy 2019, 21, 12. [Google Scholar] [CrossRef] [Green Version]
Kolchinsky, A. A novel approach to multivariate redundancy and synergy. arXiv 2019, arXiv:1908.08642. [Google Scholar]
Li, M.; Han, Y.; Aburn, M.J.; Breakspear, M.; Poldrack, R.A.; Shine, J.M.; Lizier, J.T. Transitions in information processing dynamics at the whole-brain network level are driven by alterations in neural gain. PLoS Comput. Biol. 2019. [Google Scholar] [CrossRef] [Green Version]
Rosas, F.; Mediano, P.A.; Gastpar, M.; Jensen, H.J. Quantifying high-order interdependencies via multivariate extensions of the mutual information. Phys. Rev. E 2019, 100, 032305. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ince, R.A. The Partial Entropy Decomposition: Decomposing multivariate entropy and mutual information via pointwise common surprisal. arXiv 2017, arXiv:1702.01591. [Google Scholar]
Lizier, J.T.; Heinzle, J.; Horstmann, A.; Haynes, J.D.; Prokopenko, M. Multivariate information-theoretic measures reveal directed information structure and task relevant changes in fMRI connectivity. J. Comput. Neurosci. 2011, 30, 85–107. [Google Scholar] [CrossRef] [PubMed]
Vakorin, V.A.; Krakovska, O.A.; McIntosh, A.R. Confounding effects of indirect connections on causality estimation. J. Neurosci. Methods 2009, 184, 152–160. [Google Scholar] [CrossRef]
Novelli, L.; Wollstadt, P.; Mediano, P.A.; Wibral, M.; Lizier, J.T. Large-scale directed network inference with multivariate transfer entropy and hierarchical statistical testing. Netw. Neurosci. 2019, 3, 827–847. [Google Scholar] [CrossRef]
Deutscher, D.; Meilijson, I.; Schuster, S.; Ruppin, E. Can single knockouts accurately single out gene functions? BMC Syst. Biol. 2008, 2, 50. [Google Scholar] [CrossRef] [Green Version]
Anastassiou, D. Computational analysis of the synergy among multiple interacting genes. Mol. Syst. Biol. 2007, 3, 83. [Google Scholar] [CrossRef]
White, D.; Rabago-Smith, M. Genotype–phenotype associations and human eye color. J. Hum. Genet. 2011, 56, 5–7. [Google Scholar] [CrossRef] [Green Version]
Chan, T.E.; Stumpf, M.P.; Babtie, A.C. Gene regulatory network inference from single-cell data using multivariate information measures. Cell Syst. 2017, 5, 251–267. [Google Scholar] [CrossRef] [Green Version]

Figure 1. (Top left) When depicting a measure on the union of two sets

μ (A \cup B)

, the area of each section can be used to represent the inequality (5) and hence the values

μ (A \ B)

,

μ (B \ A)

and

μ (A \cap B)

correspond to the area of each section. This correspondence can be generalised to consider an arbitrary number of sets. (Bottom left) When depicting the joint entropy

H (X, Y)

, the area of each section can also be used to represent the inequality (1) and hence the values

H (X | Y)

,

H (Y | X)

and

I (X; Y)

correspond to the area of each section. However, this correspondence does not generalise beyond two variables. (Right) For example, when considering the entropy of three variables, the multivariate mutual information

I (X; Y; Z)

cannot be accurately represented using an area since, as represented by the hatching, it is not non-negative.

Figure 1. (Top left) When depicting a measure on the union of two sets

μ (A \cup B)

, the area of each section can be used to represent the inequality (5) and hence the values

μ (A \ B)

,

μ (B \ A)

and

μ (A \cap B)

correspond to the area of each section. This correspondence can be generalised to consider an arbitrary number of sets. (Bottom left) When depicting the joint entropy

H (X, Y)

, the area of each section can also be used to represent the inequality (1) and hence the values

H (X | Y)

,

H (Y | X)

and

I (X; Y)

correspond to the area of each section. However, this correspondence does not generalise beyond two variables. (Right) For example, when considering the entropy of three variables, the multivariate mutual information

I (X; Y; Z)

cannot be accurately represented using an area since, as represented by the hatching, it is not non-negative.

Figure 2. (Left) Indiana assumes that Alice’s information

h (x)

is independent of Bob’s information

h (y)

such that her information is given by

h (x) + h (y)

. (Middle) Johnny knows the joint distribution

p_{X Y}

, and hence his information is given by the joint information content

h (x, y)

. (Right) There is no inequality that requires Johnny’s information to be no greater than Indiana’s assumed information, or vice versa. On the one hand, Johnny can have more information than Indiana since a joint realisation can be more surprising than both of the individual marginal realisations. On the other hand, Indiana can have more information than Johnny since a joint realisation can be less surprising than both of the individual marginal realisations occurring independently. Thus, as represented by the hatching, the mutual information content

i (x; y)

is not non-negative.

Figure 2. (Left) Indiana assumes that Alice’s information

h (x)

is independent of Bob’s information

h (y)

such that her information is given by

h (x) + h (y)

. (Middle) Johnny knows the joint distribution

p_{X Y}

, and hence his information is given by the joint information content

h (x, y)

. (Right) There is no inequality that requires Johnny’s information to be no greater than Indiana’s assumed information, or vice versa. On the one hand, Johnny can have more information than Indiana since a joint realisation can be more surprising than both of the individual marginal realisations. On the other hand, Indiana can have more information than Johnny since a joint realisation can be less surprising than both of the individual marginal realisations occurring independently. Thus, as represented by the hatching, the mutual information content

i (x; y)

is not non-negative.

Figure 3. (Left) If Alice’s information

h (x)

is greater than Bob’s information

h (y)

, then Eve’s information

h (x ⊔ y)

is equal to Alice’s information

h (x)

. In effect, Eve is pessimistically assuming that information provided by the least surprising marginal realisation

h (x ⊓ y)

is already provided by the most surprising marginal realisation

h (x ⊔ y)

, i.e., Bob’s information

h (y)

is a subset of Alice’s information

h (x)

. From this perspective, Eve gets unique information from Alice relative to Bob

h (x \ y)

, but does not get any unique information from Bob relative to Alice

h (y \ x) = 0

. (Right) Although for each realisation Eve can only get unique information from either Alice or Bob, it is possible that Eve can expect to get unique information from both Alice and Bob on average. (Do not confuse this representation of the union entropy with the diagram that represents the joint entropy in Figure 1).

Figure 3. (Left) If Alice’s information

h (x)

is greater than Bob’s information

h (y)

, then Eve’s information

h (x ⊔ y)

is equal to Alice’s information

h (x)

. In effect, Eve is pessimistically assuming that information provided by the least surprising marginal realisation

h (x ⊓ y)

is already provided by the most surprising marginal realisation

h (x ⊔ y)

, i.e., Bob’s information

h (y)

is a subset of Alice’s information

h (x)

. From this perspective, Eve gets unique information from Alice relative to Bob

h (x \ y)

, but does not get any unique information from Bob relative to Alice

h (y \ x) = 0

. (Right) Although for each realisation Eve can only get unique information from either Alice or Bob, it is possible that Eve can expect to get unique information from both Alice and Bob on average. (Do not confuse this representation of the union entropy with the diagram that represents the joint entropy in Figure 1).

Figure 4. (Left) This Venn diagram shows how the synergistic information

h (x \oplus y)

can be defined by comparing the joint information content

h (x, y)

from Figure 2 to the union information content

h (x ⊔ y)

from Figure 3. Note that, for this particular realisation, we are assuming that

h (x) > h (y)

. It also provides a visual representation of the decomposition (40) of the joint information content

h (x, y)

. (Right) By rearranging the marginal entropies such that they match Figure 2 (albeit with different sizes here), it is easy to see why the mutual information content

i (x; y)

is equal to the intersection information content

h (x ⊓ y)

minus the synergistic information content

h (x \oplus y)

, c.f. (38).

Figure 4. (Left) This Venn diagram shows how the synergistic information

h (x \oplus y)

can be defined by comparing the joint information content

h (x, y)

from Figure 2 to the union information content

h (x ⊔ y)

from Figure 3. Note that, for this particular realisation, we are assuming that

h (x) > h (y)

. It also provides a visual representation of the decomposition (40) of the joint information content

h (x, y)

. (Right) By rearranging the marginal entropies such that they match Figure 2 (albeit with different sizes here), it is easy to see why the mutual information content

i (x; y)

is equal to the intersection information content

h (x ⊓ y)

minus the synergistic information content

h (x \oplus y)

, c.f. (38).

Figure 5. This Venn diagram shows how the synergistic entropy

H (X \oplus Y)

can be defined by comparing the joint entropy

H (X, Y)

from Figure 1 to the union entropy

H (X ⊔ Y)

from Figure 3. It also provides a visual representation of the decomposition (40) of the joint entropy

H (X, Y)

.

Figure 5. This Venn diagram shows how the synergistic entropy

H (X \oplus Y)

can be defined by comparing the joint entropy

H (X, Y)

from Figure 1 to the union entropy

H (X ⊔ Y)

from Figure 3. It also provides a visual representation of the decomposition (40) of the joint entropy

H (X, Y)

.

Figure 6. (Bottom right) The distributive lattices

〈 x, h (⊔), h (⊓) 〉

of information contents for two and three and three observers. It is also important to note that, by replacing h, x, y and z with H, X, Y and Z, respectively, we can obtain the distributive lattices for entropy. In fact, this is crucial since Property 6 enables us to reduce the distributive lattice of information contents to a mere total order; however, this property does not apply to the entropies, and hence we cannot further simplify the lattice of entropies. (Top left) By the fundamental theorem of distributive lattices, the distributive lattices of marginal information contents has a one-to-one correspondence with the lattice of sets. Notice that the lattice for two sets corresponds to the Venn diagram for entropies in Figure 3.

Figure 6. (Bottom right) The distributive lattices

〈 x, h (⊔), h (⊓) 〉

of information contents for two and three and three observers. It is also important to note that, by replacing h, x, y and z with H, X, Y and Z, respectively, we can obtain the distributive lattices for entropy. In fact, this is crucial since Property 6 enables us to reduce the distributive lattice of information contents to a mere total order; however, this property does not apply to the entropies, and hence we cannot further simplify the lattice of entropies. (Top left) By the fundamental theorem of distributive lattices, the distributive lattices of marginal information contents has a one-to-one correspondence with the lattice of sets. Notice that the lattice for two sets corresponds to the Venn diagram for entropies in Figure 3.

Figure 7. (Left) The total order of marginal information contents for two and three observers, whereby we have assumed that Alice’s information

h (x)

is greater than Bob’s information

h (y)

, which is greater than Charlie’s information

h (z)

. It is important to note that taking the expectation value over these information contents for each realisation, which may each have a different total orders, yields entropies which are merely partially ordered. It is for this reason that Property 6 does not apply to entropies. (Right) The Venn diagrams corresponding to the total order for for two and three observers and their corresponding information contents. Notice that the total order for two sets corresponds to the Venn diagram for information contents in Figure 3.

Figure 7. (Left) The total order of marginal information contents for two and three observers, whereby we have assumed that Alice’s information

h (x)

is greater than Bob’s information

h (y)

, which is greater than Charlie’s information

h (z)

. It is important to note that taking the expectation value over these information contents for each realisation, which may each have a different total orders, yields entropies which are merely partially ordered. It is for this reason that Property 6 does not apply to entropies. (Right) The Venn diagrams corresponding to the total order for for two and three observers and their corresponding information contents. Notice that the total order for two sets corresponds to the Venn diagram for information contents in Figure 3.

Figure 8. Similar to Figure 4, this Venn diagram shows how the synergistic information

h (x \oplus y \oplus z)

can be defined by comparing the joint information content

h (x, y, z)

to the union information content

h (x ⊔ y ⊔ z)

. Note that, for this particular realisation, we are assuming that

h (x) > h (y) > h (z)

.

Figure 8. Similar to Figure 4, this Venn diagram shows how the synergistic information

h (x \oplus y \oplus z)

can be defined by comparing the joint information content

h (x, y, z)

to the union information content

h (x ⊔ y ⊔ z)

. Note that, for this particular realisation, we are assuming that

h (x) > h (y) > h (z)

.

Figure 9. (Top-middle and left) The join semi-lattice

〈h; (,)〉

for

n = 2

and

n = 3

marginal observers. Johnny’s information is always given by the joint information content at the top of the semi-lattice, while the information content of individuals such as Alice, Bob and Charlie who observe single realisations are found at the bottom of the semi-lattice. The information content of joint marginal observers such as Joanna, Jonas and Joan are found in between these two extremities. (Bottom-middle and right) The meet semi-lattice

〈h; ⊓〉

for

n = 2

and

n = 3

marginal observers. Since these two semi-lattices are not connect by absorption, their combined structure is not a lattice.

Figure 9. (Top-middle and left) The join semi-lattice

〈h; (,)〉

for

n = 2

and

n = 3

marginal observers. Johnny’s information is always given by the joint information content at the top of the semi-lattice, while the information content of individuals such as Alice, Bob and Charlie who observe single realisations are found at the bottom of the semi-lattice. The information content of joint marginal observers such as Joanna, Jonas and Joan are found in between these two extremities. (Bottom-middle and right) The meet semi-lattice

〈h; ⊓〉

for

n = 2

and

n = 3

marginal observers. Since these two semi-lattices are not connect by absorption, their combined structure is not a lattice.

Figure 10. (Top left) The redundancy lattices

〈 A (x), ⪯ 〉

of information contents for two and three and three observers. Each note in the lattice corresponds to an element in

A (x)

from (69), while the ordering between elements is given by ⪯ from (70). (Bottom right) The partial information contents

h_{\partial} (α)

corresponding to the redundancy lattices of information contents for two and three observers.

Figure 10. (Top left) The redundancy lattices

〈 A (x), ⪯ 〉

of information contents for two and three and three observers. Each note in the lattice corresponds to an element in

A (x)

from (69), while the ordering between elements is given by ⪯ from (70). (Bottom right) The partial information contents

h_{\partial} (α)

corresponding to the redundancy lattices of information contents for two and three observers.

Figure 11. This Venn provides a visual representation of the decomposition of the joint entropy

H (X, Y, Z)

. This decomposition is given by replacing x, y, z and h with X, Y, Z and H in (85), respectively.

Figure 11. This Venn provides a visual representation of the decomposition of the joint entropy

H (X, Y, Z)

. This decomposition is given by replacing x, y, z and h with X, Y, Z and H in (85), respectively.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Finn, C.; Lizier, J.T. Generalised Measures of Multivariate Information Content. Entropy 2020, 22, 216. https://doi.org/10.3390/e22020216

AMA Style

Finn C, Lizier JT. Generalised Measures of Multivariate Information Content. Entropy. 2020; 22(2):216. https://doi.org/10.3390/e22020216

Chicago/Turabian Style

Finn, Conor, and Joseph T. Lizier. 2020. "Generalised Measures of Multivariate Information Content" Entropy 22, no. 2: 216. https://doi.org/10.3390/e22020216

APA Style

Finn, C., & Lizier, J. T. (2020). Generalised Measures of Multivariate Information Content. Entropy, 22(2), 216. https://doi.org/10.3390/e22020216

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Generalised Measures of Multivariate Information Content

Abstract

1. Introduction

2. Mutual Information Content

3. Marginal Information Sharing

4. Synergistic Information Content

5. Properties of the Union and Intersection Information Content

6. Generalised Marginal Information Sharing

7. Multivariate Information Decomposition

8. Union and Intersection Mutual Information

9. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI