Next Article in Journal
B-DP: Dynamic Collection and Publishing of Continuous Check-In Data with Best-Effort Differential Privacy
Previous Article in Journal
Prediction Method of Soft Fault and Service Life of DC-DC-Converter Circuit Based on Improved Support Vector Machine
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel Approach to the Partial Information Decomposition

Santa Fe Institute, Santa Fe, NM 87501, USA
Entropy 2022, 24(3), 403; https://doi.org/10.3390/e24030403
Submission received: 4 January 2022 / Revised: 22 February 2022 / Accepted: 23 February 2022 / Published: 13 March 2022
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
We consider the “partial information decomposition” (PID) problem, which aims to decompose the information that a set of source random variables provide about a target random variable into separate redundant, synergistic, union, and unique components. In the first part of this paper, we propose a general framework for constructing a multivariate PID. Our framework is defined in terms of a formal analogy with intersection and union from set theory, along with an ordering relation which specifies when one information source is more informative than another. Our definitions are algebraically and axiomatically motivated, and can be generalized to domains beyond Shannon information theory (such as algorithmic information theory and quantum information theory). In the second part of this paper, we use our general framework to define a PID in terms of the well-known Blackwell order, which has a fundamental operational interpretation. We demonstrate our approach on numerous examples and show that it overcomes many drawbacks associated with previous proposals.

1. Introduction

Understanding how information is distributed in multivariate systems is an important problem in many scientific fields. In the context of neuroscience, for example, one may wish to understand how information about an external stimulus is encoded in the activity of different brain regions. In computer science, one might wish to understand how the output of a logic gate reflects the information present in different inputs to that gate. Numerous other examples abound in biology, physics, machine learning, cryptography, and other fields [1,2,3,4,5,6,7,8,9,10].
Formally, suppose that we are provided with a random variable Y which we call the “target”, as well as a set of n random variables X 1 , , X n which we call the “sources”. The partial information decomposition (PID), first proposed by Williams and Beer in 2010 [11], aims to quantify how information about the target is distributed among the different sources. In particular, the PID seeks to decompose the mutual information provided jointly by all sources into a set of nonnegative terms, such as redundancy (information present in each individual source), synergy (information only provided by the sources jointly, not individually), union information (information provided by at least one individual source), and unique information (information provided by only one individual source).
As discussed in detail below, the PID is inspired by an analogy between information theory and set theory. In this analogy, the information that the sources provide about the target are imagined as sets, while PID terms such as redundancy, union information, and synergy are imagined as the sizes of intersections, unions, and complements. While the analogy between information-theoretic and set-theoretic quantities is suggestive, it does not specify how to actually define the PID. Moreover, it has also been shown that existing measures from information theory (such as mutual information and conditional mutual information) cannot be used directly to construct the PID, since these measures conflate contributions from different terms like synergy and redundancy [11,12]. In response, many proposals for how to define PID terms have been advanced [5,13,14,15,16,17,18,19,20,21]. However, existing proposals suffer from various drawbacks, such as behaving counterintuitively on simple examples, being limited to only two sources, or lacking a clear operational interpretation. Today there is no generally agreed-upon way of defining the PID.
In this paper, we propose a new and principled approach to the PID which addresses these drawbacks. Our approach can handle any number of sources and can be justified in algebraic, axiomatic, and operational terms. We present our approach in two parts.
In part I (Section 4), we propose a general framework for defining the PID. Our framework does not prescribe specific definitions, but instead shows how an information-theoretic decomposition can be grounded in a formal analogy with set theory. Specifically, we consider the definitions of “set intersection” and “set union” in set theory: the intersection of sets S 1 , S 2 , is the largest set that is contained in all of the S i , while the union of sets S 1 , S 2 , is the smallest set that contains all of the S i . As we show, these set-theoretic definitions can be mapped into information-theoretic terms by treating “sets” as random variables, “set size” as mutual information between a random variable and the target Y, and “set inclusion” as some externally specified ordering relation ⊏, which specifies when one random variable is more informative than another. Using this mapping, we define information-theoretic redundancy and union information in the same way that the sizes of intersections and unions are defined in set theory (other PID terms, such as synergy and unique information, can be computed in a straightforward way from redundancy and union information). Moreover, while our approach is motivated by set-theoretic intuitions, as we show in Section 4.2, it can also be derived from an alternative axiomatic foundation. We finish part I by reviewing relevant prior work in information theory and the PID literature. We also discuss how our framework can be generalized beyond the standard setting of the PID and even beyond Shannon information theory, to domains like algorithmic information theory and quantum information theory.
One unusual aspect of our framework is that it provides independent definitions of union information and redundancy. Most prior work on the PID has focused exclusively on the definition of redundancy, because it assumed that union information can be determined from redundancy using the so-called “inclusion-exclusion principle”. In Section 4.3, we argue that the inclusion-exclusion principle should not be expected to hold in the context of the PID.
Part I provides a general framework. Concrete definitions of the PID can be derived from this general framework by choosing a specific “more informative” ordering relation ⊏. In fact, the study of ordering relations between information sources has a long history in statistics and information theory [22,23,24,25,26,27]. One particularly important relation is the so-called “Blackwell order” [13,28], which has a fundamental operational interpretation in terms of utility maximization in decision theory.
In part II of this paper (Section 5), we combine the general framework developed in part I with the Blackwell order. This gives rise to concrete definitions of redundancy and union information. We show that our measures behave intuitively and have simple operational interpretations in terms of decision theory. Interestingly, while our measure of redundancy is novel, our measure of union information has previously appeared in the literature under a different guise [13,17].
In Section 6, we compare our redundancy measure to previous proposals, and illustrate it with various bivariate and multivariate examples. We finish the paper with a discussion and proposals for future work in Section 7.
We introduce some necessary notation and preliminaries in the next section. In addition, we provide background regarding the PID in Section 3. All proofs, as well as some additional results, are found in the appendix.

2. Notation and Preliminaries

We use uppercase letters ( Y , X , Q , ) to indicate random variables over some underlying probability space. We use lowercase letters ( y , x , q , ) to indicate specific outcomes of random variables, and calligraphic letters ( Y , X , Q ) to indicate sets of outcomes. We often index random variables with a subscript, e.g., the random variable X i with outcomes x i X i (so x i does not refer to the i th outcome of random variable X, but rather to some generic outcome of random variable X i ). We use notation like A B C to indicate that A is conditionally independent of C given B. Except where otherwise noted, we assume that all random variables have a finite number of outcomes.
We use notation like P X ( x ) to indicate the probability distribution associated with random variable X, P X Y ( x , y ) to indicate the joint probability distribution associated with random variables X and Y, and P X | Y ( x | y ) to indicate the conditional probability distribution of X given Y. Given two random variables X and Y with outcome sets X and Y , we use notation like κ X | Y ( x | y ) to indicate some stochastic channel of outputs x X given inputs y Y . In general, a channel κ X | Y specifies some arbitrary conditional distribution of X given Y, which can be different from P X | Y , the actual conditional distribution of X given Y (as determined by the underlying probability space).
As described above, we consider the information that a set of “source” random variables X 1 , , X n provide a “target” random variable Y. Without loss of generality, we assume that the marginal distributions P Y and P X i for all i have full support (if they do not, one can restrict Y and/or X i to outcomes that have strictly positive probability).
Finally, note that despite our use of the terms “source” and “target”, we do not assume any causal directionality between the sources and target (see also discussion in [29]). For example, in neuroscience, Y might be an external stimulus which causes the activity of brain regions X 1 , , X n , while in computer science Y might represent the output of a logic gate caused by inputs X 1 , , X n (so the causal direction is reversed). In yet other contexts, there could be other causal relationships among X 1 , , X n and Y, or they might not be causally related at all.

3. Background on the Partial Information Decomposition (PID)

Given a set of sources X 1 , , X n and a target Y, the PID aims to decompose I ( Y ; X 1 , , X n ) , the total mutual information provided by all sources about the target, into a set of nonnegative terms such as [11,12]:
  • Redundancy  I ( X 1 ; ; X n     Y ) , the information present in each individual source. Redundancy can be considered as the intersection of the information provided by different sources and is sometimes called “intersection information” in the literature [16,18].
  • Union information  I ( X 1 ; ; X n     Y ) , the information provided by at least one individual source [12,17].
  • Synergy  S ( X 1 ; ; X n     Y ) , the information found in the joint outcome of all sources, but not in any of their individual outcomes. Synergy is defined as [17]
    S ( X 1 ; ; X n     Y ) = I ( Y ; X 1 , , X n ) I ( X 1 ; ; X n     Y ) .
  • Unique information in source X i , U ( X i     Y | X 1 ; ; X n ) , the non-redundant information in each particular source. Unique information is defined as
    U ( X i     Y | X 1 ; ; X n ) = I ( Y ; X i ) I ( X 1 ; ; X n     Y ) .
In addition to the above terms, one can also define excluded information,
E ( X i     Y | X 1 ; ; X n ) = I ( X 1 ; ; X n     Y ) I ( Y ; X i ) ,
as the information in the union of the sources which is not in a particular source X i . To our knowledge, excluded information has not been previously considered in the PID literature, although it is the natural “dual” of unique information as defined in Equation (2).
Given the definitions above, once a measure of redundancy I is chosen, unique information is determined by Equation (2). Similarly, once a measure of union information I is chosen, synergy and excluded information are determined by Equations (1) and (3). In Figure 1, we illustrate the relationships between these different PID terms for the simple case of two sources, X 1 and X 2 . We show two different decompositions of the information provided by the sources jointly, I ( X 1 , X 2 ; Y ) , and individually, I ( X 1 ; Y ) and I ( X 2 ; Y ) . The diagram on the left shows the decomposition defined in terms of redundancy I , while the diagram on the right shows the decomposition defined in terms of union information I .
When more than two sources are present, the PID can be used to define additional terms, beyond the ones shown in Figure 1. For example, for three sources, one can define redundancy terms like I ( X 1 , X 2 , X 3     Y ) (representing the information found in all individual sources) as well as redundancy terms like I ( ( X 1 , X 2 ) , ( X 1 , X 3 ) , ( X 2 , X 3 )     Y ) (representing the information found in all pairs of sources), and similarly for union information.
The idea that redundancy and union information lead to two different information decompositions is rarely discussed in the literature. In fact, the very concept of union information is rarely discussed in the literature explicitly (although it often appears in an implicit form via measures of synergy, since synergy is related to union information through Equation (1)). As we discuss below in Section 4.3, the reason for this omission is that most existing work assumes (whether implicitly or explicitly) that redundancy and union information are not independent measures, but are instead related via the so-called “inclusion-exclusion principle”. If the inclusion-exclusion principle is assumed to hold, then the distinction between the two decompositions disappears. We discuss this issue in greater detail below, where we also argue that the inclusion-exclusion principle should not be expected to hold in the context of the PID.
We have not yet described how the redundancy and union information measures I and I are defined. In fact, this remains an open research question in the field (and one which this paper will address). When they first introduced the idea of the PID, Williams and Beer proposed a set of intuitive axioms that any measure of redundancy should satisfy [11,12], which we summarize in Appendix A. In later work, Griffith and Koch [17] proposed a similar set of axioms that union information should satisfy, which are also summarized in Appendix A. However, these axioms do not uniquely identify a particular measure of redundancy or union information.
Williams and Beer also proposed a particular redundancy measure which satisfies their axioms, which we refer to as I WB [11,12]. Unfortunately, I WB has been shown to behave counterintuitively in some simple cases [19,20]. For example, consider the so-called “COPY gate”, where there are two sources X 1 and X 2 and the target is a copy of their joint outcomes, Y = ( X 1 , X 2 ) . If X 1 and X 2 are statistically independent, I ( X 1 ; X 2 ) = 0 , then intuition suggests that the two sources provide independent information about Y and therefore that redundancy should be 0. In general, however, I WB ( X 1 ; X 2     Y ) does not vanish in this case. To avoid this issue, Ince [20] proposed that any valid redundancy measure should obey the following property:
If   I ( X 1 ; X 2 ) = 0 ,   then   I ( X 1 ; X 2     ( X 1 , X 2 ) ) = 0 ,
which is called the Independent identity property.
In recent years, many other redundancy measures have been proposed [13,15,16,18,19,20,21]. However, while some of these proposals satisfy the Independent identity property, they suffer various other drawbacks, such as exhibiting other types of counterintuitive behavior, being limited to two sources, and/or lacking a clear operational motivation. We discuss some of these previously proposed measures in Section 4.4, Section 5.4 and Section 6.
Unlike redundancy, to our knowledge only two measures of union information have been advanced. The first one appeared in the original work on the PID [12], and was derived from I WB using the inclusion-exclusion principle. The second one appeared more recently [13,17] and is discussed in Section 5.4 below.

4. Part I: Redundancy and Union Information from an Ordering Relation

4.1. Introduction

As mentioned above, PID is motivated by an informal analogy with set theory [12]. In particular, redundancy is interpreted analogously to the size of the intersection of the sources X 1 , , X n , while union information is interpreted analogously to the size of their union.
We propose to define the PID by making this analogy formal, and in particular by going back to the algebraic definitions of intersection and union in set theory. In pursuing this direction, we build on a line of previous work in information theory and PID, which we discuss in Section 4.4.
Recall that in set theory, the intersection of sets S 1 , , S n U (where U is some universal set) is the largest set that is contained in all S i (Section 7.2, [30]). This means that the size of the intersection can be written as
| i S i | = sup T U   T such   that   i   T S i ,
Similarly, the union of sets S 1 , , S n U is the smallest set that contains all S i (Section 7.2, [30]), so the size of the union can be written as
| i S i | = inf T U   T   such   that   i   S i T .
Equations (5) and (6) are useful because they express the size of the intersection and union via an optimization over simpler terms (the size of individual sets, | T | , and the subset inclusion relation, ⊆).
We translate these definitions to the information-theoretic setting of the PID. We take the analogue of a “set” to be some random variable A that provides information about the target Y, and the analogue of “set size” to be the mutual information I ( A ; Y ) . In addition, we assume that there is some ordering relation ⊏ between random variables analogous to set inclusion ⊆. Given such a relation, the expression A B means that random variable B is “more informative” than A, in the sense that the information that A provides about Y is contained within the information that B provides about Y.
At this point, we leave the ordering relation ⊏ unspecified. In general, we believe that the choice of ⊏ will not be determined from purely information-theoretic considerations, but may instead depend on the operational setting and scientific domain in which the PID is applied. At the same time, there has been a great deal of research on ordering relations in statistics and information theory. In part II of this paper, Section 5, we will combine our general framework with a particular ordering relation, the so-called “Blackwell order”, which has a fundamental interpretation in terms of decision theory.
We now provide formal definitions of redundancy and union information, relative to the choice of ordering relation ⊏. In analogy to Equation (5), we define redundancy as
I ( X 1 ; ; X n     Y ) : = sup Q I ( Q ; Y ) such   that   i   Q X i
where the maximization is over all random variables with a finite number of outcomes. Thus, redundancy I is the maximum information about Y in any random variable that is less informative than all of the sources. In analogy with Equation (6), we define union information as
I ( X 1 ; ; X n     Y ) : = inf Q I ( Q ; Y ) such   that   i   X i Q
Thus, union information I is the minimum information about Y in any random variable that is more informative than all of the sources. Given these definitions, other elements of the PID (such as unique information, synergy, and excluded information) can be defined using the expressions found in Section 3. Note that I and I depend the choice of ordering relation ⊏, although for convenience we leave this dependence implicit in our notation.
One of the attractive aspects of our definitions is that they do not simply quantify the amount of redundancy and union information, but also specify the “content” of that redundant and union information. In particular, the random variable Q that achieves the optimum in Equation (7) specifies the content of the redundant information via the joint distribution P Y Q . Similarly, the random variable Q which achieves the optimum in Equation (8) specifies the content of the union information via the joint distribution P Y Q . Note that these optimizing Q may not be unique, reflecting the fact that there may be different ways to represent the redundancy or union information. (Note also that the supremum or infinitum may not be achieved in Equations (7) and (8), in which case one can consider Q that achieve the optimal values to any desired precision ϵ > 0 .)
So far we have not made any assumptions about the ordering relation ⊏. However, we can derive some useful bounds by introducing three weak assumptions:
  • Monotonicity of mutual information: A B I ( A ; Y ) I ( B ; Y ) (less informative sources have less mutual information).
  • Reflexivity: A A for all A (each source is at least as informative as itself).
  • For all sources X i , O X i ( X 1 , , X n ) , where O indicates a constant random variable with a single outcome and ( X 1 , , X n ) indicates all sources considered jointly (each source is more informative than a trivial source and less informative than all sources jointly).
Assumptions I and II imply that the redundancy and union information of a single source are equal to the mutual information in that source:
I ( X 1     Y ) = I ( X 1     Y ) = I ( X 1 ; Y ) .
Assumptions I and III imply the following bounds on redundancy and union information:
0 I ( X 1 ; ; X n     Y ) min i I ( Y ; X i ) .
max i I ( Y ; X i ) I ( X 1 ; ; X n     Y ) I ( Y ; X 1 , , X n ) .
Equation (9) in turn implies that the unique information in each source X i , as defined in Equation (2), is bounded between 0 and I ( Y ; X i ) . Similarly, Equation (10) implies that the synergy, as defined in Equation (1), obeys
0 S ( X 1 ; ; X n     Y ) min i I ( Y ; X 1 , , X n | X i ) ,
where we have used the chain rule I ( Y ; X 1 , , X n ) = I ( Y ; X i ) + I ( Y ; X 1 , , X n | X i ) . Equation (10) also implies that excluded information in each source X i , as defined in Equation (3), is bounded between 0 and I ( Y ; X 1 , , X n | X i ) .
Note that in general, stronger orders give smaller values of redundancy and larger values of union information. Consider two orders ⊏ and where the first one is stronger than the second: A B A B for all A and B. Then, any Q in the feasible set of Equation (7) under ⊏ will also be in the feasible set under , and similarly for Equation (8). Therefore, I defined relative to ⊏ will have a lower value than I defined relative to , and vice versa for I .
In the rest of this section, we discuss alternative axiomatic justifications for our general framework, the role of the inclusion-exclusion principle, relation to prior work, and further generalizations. Readers who are more interested in the use of our framework to define concrete measures of redundancy and union information may skip to Section 5.

4.2. Axiomatic Derivation

In Section 4.1, we defined the PID in terms of an algebraic analogy with intersection and union in set theory. This definition can be considered as the primary one in our framework. At the same time, the same definitions can also be derived in an alternative manner from a set of axioms, as commonly sought after in the PID literature. In particular, in Appendix B, we prove the following result regarding redundancy.
Theorem 1. 
Any redundancy measure that satisfies the following five axioms is equal to I ( X 1 ; ; X n     Y ) as defined in Equation (7).
  • Symmetry: I ( X 1 ; ; X n     Y ) is invariant to the permutation of X 1 , , X n .
  • Self-redundancy: I ( X 1     Y ) = I ( Y ; X 1 ) .
  • Monotonicity: I ( X 1 ; ; X n     Y ) I ( X 1 ; ; X n 1     Y ) .
  • Order equality: I ( X 1 ; ; X n     Y ) = I ( X 1 ; ; X n 1     Y ) if X i X n for some i < n .
  • Existence: There is some Q such that I ( X 1 ; ; X n     Y ) = I ( Y ; Q ) and Q X i for all i.
While Symmetry, Self-redundancy, and Monotonicity axioms are standard in the PID literature (see Appendix A), the last two axioms require some explanation. Order equality is a generalization of the previously proposed Deterministic equality axiom, described in Appendix A, where the condition X i = f ( X n ) (deterministic relationship) is generalized to the “more informative” relation X i X n . This axiom reflects the idea that if a new source X n is more informative than an existing source X i , then redundancy shouldn’t decrease when X n is added.
Existence is the most novel of our proposed axioms. It says that for any set of sources X 1 , , X n , there exists some random variable which captures the redundant information. It is similar to the statement in axiomatic set theory that the intersection of a collection of sets is itself a set (note that in Zermelo-Fraenkel set theory, this statement is derived from the Axiom of Separation).
We can derive a similar result for union information (proof in Appendix B).
Theorem 2. 
Any union information measure that satisfies the following five axioms is equal to I ( X 1 ; ; X n     Y ) as defined in Equation (8).
  • Symmetry: I ( X 1 ; ; X n     Y ) is invariant to the permutation of X 1 , , X n .
  • Self-union: I ( X 1     Y ) = I ( Y ; X 1 ) .
  • Monotonicity: I ( X 1 ; ; X n     Y ) I ( X 1 ; ; X n 1     Y ) .
  • Order equality: I ( X 1 ; ; X n     Y ) = I ( X 1 ; ; X n 1     Y ) if X n X i for some i < n .
  • Existence: There is some Q such that I ( X 1 ; ; X n     Y ) = I ( Y ; Q ) and X i Q for all i.
These axioms are dual to the redundancy axioms outlined above. Compared to previously proposed axioms for union information, as described in Appendix A, the most unusual of our axioms is Existence. It says that given a set of sources X 1 , , X n , there exists some random variable which captures the union information. It is similar in spirit to the “Axiom of Union” in axiomatic set theory [31].
Finally, note that for some choices of ⊏, there may not exist measures of redundancy and/or union information that satisfy the axioms in Theorems 1 and 2, in which case these theorems still hold but are trivial. However, even in such “pathological” cases, I and I can still be defined via Equations (7) and (8), as long as ⊏ has a “least informative” and a “most informative” element (e.g., as provided by Assumption III above), so that the feasible sets are not empty. In this sense, the definitions in Equations (7) and (8) are more general than the axiomatic derivations provided by Theorems 1 and 2.

4.3. Inclusion-Exclusion Principle

One unusual aspect of our approach is that, unlike most previous work, we propose separate measures of redundancy and union information.
Recall that in set theory, the size of the intersection and the union are not independent of each other, but are instead related by the inclusion-exclusion principle (IEP). For example, given any two sets S and T, the IEP states that the size of the union of S and T is given by the sum of their individual sizes minus the intersection,
S T = S + T S T .
More generally, the IEP relates the sizes of intersection and unions for any number of sets, via the following inclusion-exclusion formulas:
| i = 1 n S i | = J { 1 , , n } ( 1 ) J 1 | i J S i | .
| i = 1 n S i | = J { 1 , , n } ( 1 ) J 1 | i J S i | .
Historically, the IEP has played an important role in analogies between set theory and information theory, which began to be explored in 1950s and 1960s [32,33,34,35,36]. Recall that the entropy H ( X ) quantifies the amount of information gained by learning the outcome of random variable X. It has been observed that, for a set of random variables X 1 , , X n , the joint entropy H ( X 1 , , X n ) behaves somewhat like the size of the union of the information in the individual variables. For instance, like the size of the union, joint entropy is subadditive ( H ( X 1 ) + H ( X 2 ) H ( X 1 , X 2 ) ) and increases with additional random variables ( H ( X 1 , X 2 ) H ( X 1 ) ). Moreover, for two random variables X 1 and X 2 , the mutual information I ( X 1 ; X 2 ) = H ( X 1 ) + H ( X 2 ) H ( X 1 , X 2 ) acts like the size of the intersection of the information provided by X 1 and X 2 , once intersection is defined analogously to the IEP expression in Equation (11) [35,36]. Given the general IEP formula in Equation (13), this can be used to define the size of the intersection between any number of random variables. For instance, the size of a three-way intersection is
I ( X 1 ; X 2 ; X 3 ) = H ( X 1 ) + H ( X 2 ) + H ( X 3 ) H ( X 1 , X 2 ) H ( X 1 , X 3 ) H ( X 2 , X 3 ) + H ( X 1 , X 2 , X 3 ) ,
a quantity called co-information or interaction information in the literature [32,33,35,36,37].
Unfortunately, interaction information, as well as other higher-order interaction terms defined via the IEP, can take negative values [32,35,37]. This conflicts with the intuition that information measures should always be non-negative, in the same way that set size is always non-negative.
One of the primary motivations for the PID, as originally proposed by Williams and Beer [11,12], was to solve the problem of negativity encountered by interaction information. To develop a non-negative information decomposition, Williams and Beer took two steps. First, they considered the information that a set of sources X 1 , , X n provide about some target random variable Y. Second, they developed a non-negative measure of redundancy ( I WB ) which leads to a non-negative union information once an IEP formula like Equation (12) is applied (Theorem 4.7, [12]). For example, in the original proposal, union information and redundancy are related via
I ( X 1 ; X 2     Y ) = ? I ( Y ; X 1 ) + I ( Y ; X 2 ) I ( X 1 ; X 2     Y ) ,
which is the analogue of Equation (11). This can be plugged into expressions like Equation (1), so as to express synergy in terms of redundancy as
S ( X 1 ; ; X n     Y ) = ? I ( Y ; X 1 , , X n ) I ( Y ; X 1 ) I ( Y ; X 2 ) + I ( X 1 ; X 2     Y ) .
The meaning of IEP-based identities such as Equations (14) and (15) can be illustrated using the Venn diagrams in Figure 1. In particular, they imply that the pink region in the right diagram is equal in size to the pink region in the left diagram, and that the grey region in the left diagram is equal in size to the grey region in the right diagram. More generally, IEP implies an equivalence between the information decomposition based on redundancy and the one based on union information.
As mentioned in Section 3, due to shortcomings in the original redundancy measure I WB , numerous other proposals for the PID have been advanced. Most of these proposals introduce new measures of redundancy, while keeping the general structure of the PID as introduced by Williams and Beer. In particular, most of these proposals assume that the IEP holds, so that union information can be derived from a measure of redundancy. While the assumption of the IEP is sometimes stated explicitly, more frequently it is implicit in the definitions used. For example, many proposals assume that synergy is related to redundancy via an expression like Equation (15), although (as shown above) this implicitly assumes that the IEP holds. In general, the IEP has been largely an unchallenged and unexamined assumption in the PID field. It is easy to see the appeal of the IEP: it builds on deep-seated intuitions about intersection/union from set theory and Venn diagrams, it has a long history in the information-theoretic literature, and it simplifies the problem of defining the PID since it only requires a measure of redundancy to be defined—rather than a measure of redundancy and a measure of union information. (Note that one can also start from union information and then derive redundancy via the IEP formula in Equation (13), as in Appendix B of Ref. [17], although this is much less common in the literature.)
However, there is a different way to define a non-negative PID, which is still grounded in a formal analogy with set theory but does not assume the IEP. Here, one defines measures of redundancy and union information based on the underlying algebra of intersection and union: the intersection of X 1 , , X n is the largest element that is less than each X i , while the union is the smallest element that is greater than each X i . Given these definitions, intersections and unions are not necessarily related to each numerically, as in the IEP, but are instead related by an algebraic duality.
This latter approach is the one we pursue in our definitions (it has also appeared in some prior work, which we review in the next subsection). In general, the IEP will not hold for redundancy and union information as defined in Equations (7) and (8). (To emphasize this point, we put a question mark in Equations (14) and (15), and made the sizes of the pink and grey regions visibly different in Figure 1). However, given the algebraic and axiomatic justifications for I and I , we do not see the violation of the IEP as a fatal issue. In fact, there are many domains where generalizations of intersections and unions do not obey the IEP. For example, it is well-known that the IEP is violated in the domain of vector spaces, once the size of a vector space is measured in terms of its dimension [38]. The PID is simply another domain where the IEP should not be expected to hold.
We believe that many problems encountered in previous work on the PID—such as the failure of certain redundancy measures to generalize to more than two sources, or the appearance of uninterpretable negative synergy values—are artifacts of the IEP assumption. In fact, the following result shows that any measures of redundancy and union information which satisfy several reasonable assumptions must violate the IEP as soon as 3 or more sources are present (the proof, in Appendix I, is based on a construction from [39,40]).
Lemma 1. 
Let I be any nonnegative redundancy measure which obeys Symmetry, Self-redundancy, Monotonicity, and Independent identity. Let I be any union information measure which obeys I ( X 1 ; ; X n     Y ) I ( Y ; X 1 , , X n ) . Then, I and I cannot be related by the inclusion-exclusion principle for 3 or more sources.
The idea that different information decompositions may arise from redundancy versus synergy (and therefore union information) has recently appeared in the PID literature [15,40,41,42,43]. In particular, Chicharro and Panzeri proposed a PID that involves two decomposition: an “information gain” decomposition based on redundancy and an “information loss” decomposition based on synergy [41]. These decompositions correspond to the two Venn diagrams shown in Figure 1.

4.4. Relation to Prior Work

Here we discuss prior work which is relevant to our algebraic approach to the PID.
First, note that our definitions of redundancy and union information in Equations (7) and (8) are closely related to notions of “meet” and “join” in a field of algebra called order theory, which generalize intersections and unions to domains beyond set theory [44]. Given a set of objects S and an order ⊏, the meet of a , b S is the unique largest c S that is smaller than both a and b: c a , c b and d c for any d that obeys d a , d b . Similarly, the join of a , b S is the unique smallest c that is larger than both a and b: a c , a c and c d for any d that obeys a d , b d . Note that meets and joins are only defined when ⊏ is a special type of partial order called a lattice. This is a strict requirement, and many important ordering relations in information theory are not lattices (this includes the “Blackwell order”, which we will consider in part II of this paper [45]).
In our approach, we do not require the ordering relation ⊏ to be a lattice, or even a partial order. We do not require these properties because we do not aim to find the unique union random variable or the unique redundancy random variable. Instead, we aim to quantify the size of the intersection and the size of the union, which we do by optimizing mutual information subject to constraints, as Equations (7) and (8). These definitions are well-defined even when ⊏ is not a lattice, which allows us to consider a much broader set of ordering relations.
We mention three important precursors of our approach that have been proposed in the PID literature. First, Griffith et al. [16] considered the following order between random variables:
A B   iff   A = f ( B )   for   some   deterministic   function   f .
This ordering relation ⊲ was first considered in a 1953 paper by Shannon [22], who showed that it defines a lattice over random variables. That paper was the first to introduce the algebraic idea of meets and joins into information theory, leading to an important line of subsequent research [46,47,48,49,50]. Using this order, Ref. [16] defined redundancy as the maximum mutual information in any random variable that is a deterministic function of all of the sources,
I ( X 1 ; ; X n     Y ) : = max Q I ( Q ; Y )   such   that   i   Q X i ,
which is clearly a special case of Equation (7). Unfortunately, in practice, I is not a useful redundancy measure, as it tends to give very small values and is highly discontinuous. For example, I ( X 1 ; ; X n     Y ) = 0 whenever the joint distribution P X 1 X n Y has full support, meaning that it vanishes on almost all joint distributions [16,18,47]. The reason for this counterintuitive behavior is that the order ⊲ formalizes an extremely strict notion of “more informative”, which is not robust to noise.
Given the deficiencies of I , Griffith and Ho [18] proposed another measure of redundancy (also discussed as I 2 in Ref. [49]),
I GH ( X 1 ; ; X n     Y ) : = max Q I ( Q ; Y )   such   that   i   Q X i Y .
This measure is also a special case of Equation (7), where the more informative relation A B is formalized via the conditional independence condition A B Y . This measure is similar to the redundancy measure we propose in part II of this paper, and we discuss it in more detail in Section 5.4. (Note that there are some incorrect claims about I GH in the literature: Lemmas 6 and 7 of Ref. [49] incorrectly state that I GH ( X 1 ; X 2     Y ) = 0 whenever X 1 and X 2 are independent—see the AND gate counterexample in Section 6—while Ref. [18] incorrectly states that I GH obeys a property called Target Monotonicity).
Finally, we mention the so-called “minimum mutual information” redundancy I MMI [51]. This is perhaps the simplest redundancy measure, being equal to the minimal mutual information in any source: I MMI ( X 1 ; ; X n     Y ) : = min i I ( X i ; Y ) . It can be written in the form of Equation (7) as
I MMI ( X 1 ; ; X n     Y ) : = max Q I ( Q ; Y )   such   that   i   I ( Q ; Y ) I ( X i ; Y ) .
This redundancy measure has been criticized for depending only on the amount of information provided by the different sources, being completely insensitive to the content of that information. Nonetheless, I MMI can be useful in some settings, and it plays an important role in the context of Gaussian random variables [51].
Interestingly, unlike I MMI , the original redundancy measure proposed by Williams and Beer [11], I WB , does not appear to be a special case of Equation (7) (at least not under the natural definition of the ordering relation ⊏). We demonstrate this using a counter-example in Appendix H.
As mentioned in Section 4.1, stronger ordering relations give smaller values of redundancy. For the orders considered above, it is easy to show that
A B A B Y I ( A ; Y ) I ( B ; Y ) .
This implies that I I GH I MMI . In fact, I MMI is the largest measure that is compatible with the monotonicity of mutual information (Assumption I in Section 4.1).

4.5. Further Generalizations

We finish part I of this paper by noting that one can further generalize our approach, by considering other analogues of “set”, “set size”, and “set inclusion” beyond the ones considered in Section 4.1. Such generalizations allow one to analyze notions of information intersection and union in a wide variety of domains, including setups different from the standard one considered in the PID, and domains not based on Shannon information theory.
At a general level, consider a set of object Ω that represents possible “sources”, which may be random variables, as in Section 4.1, or otherwise. Assume there is some function ϕ : Ω R that quantifies the “amount of information” in a given source Ω , and some relation ⊏ on Ω that indicates which sources are more informative than others. Then, in analogy to Equations (5) and (6), for any set of sources { b 1 , , b n } Ω , one can define redundancy and union information as
I ( b 1 ; ; b n ) : = sup a Ω   ϕ ( a )   such   that   i   a b i
I ( b 1 ; ; b n ) : = inf a Ω   ϕ ( a )   such   that   i   b i a .
Synergy, unique, and excluded information can then be defined via Equations (1) to (3).
There are many possible examples of such generalizations, of which we mention a few as illustrations.
  • Shannon information theory (beyond mutual information). In Section 4.1, ϕ was the mutual information between each random variable and some target Y. This can be generalized by choosing a different “amount of information” function ϕ , so that redundancy and union information are quantified in terms of other measures of statistical dependence. Among many other options, possible choices of ϕ include Pearson’s correlation (for continuous random variables) and measures of statistical dependency based f-divergences [52], Bregman divergences [53], and Fisher information [54].
  • Shannon information theory (without a fixed target). The PID can also be defined for a different setup than the typical one considered in the literature. For example, consider a situation where the sources are channels κ X 1 | Y , , κ X n | Y , while the marginal distribution over the target Y is left unspecified. Here one may take Ω as the set of channels, ϕ as the channel capacity ϕ ( κ A | Y ) : = max P Y I P Y κ A | Y ( A ; Y ) , and ⊏ as some ordering relation on channels [24]
  • Algorithmic information theory. The PID can be defined for other notions of information, such as the ones used in Algorithmic Information Theory (AIT) [55]. In AIT, “information” is not defined in terms of statistical uncertainty, but rather in terms of the program length necessary to generate strings. For example, one may take Ω as the set of finite strings, ⊏ as algorithmic conditional independence ( a b   iff   K ( y | b ) K ( y | b , a ) const , where K ( · | · ) is conditional Kolmogorov complexity), and ϕ ( a ) : = K ( y ) K ( y | a ) as the “algorithmic mutual information” with some target string y. (This setup is closely related to the notion of algorithmic “common information” [47]).
  •   Quantum information theory. As a final example, the PID can be defined in the context of quantum information theory. For example, one may take Ω as the set of quantum channels, ⊏ as quantum Blackwell order [56,57,58], and ϕ ( Φ ) = I ( ρ , Φ ) , where I is the Ohya mutual information for some target density matrix ρ under channel Φ Ω [59].

5. Part II: Blackwell Redundancy and Union Information

In the first part of this paper, we proposed a general framework for defining PID terms. In this section, which forms part II of this paper, we develop a concrete definition of redundancy and union information by combining our general framework with a particular ordering relation ⊏. This ordering relation is called the “Blackwell order”, and it plays a fundamental role in statistics and decision theory [28,45,60]. We first introduce the Blackwell order, then use it to define measures of redundancy and union information, and finally discuss various properties of our measures.

5.1. The Blackwell Order

We begin by introducing the ordering relation that we use to define our PID. Given three random variables B , C and Y, the ordering relation B Y C is defined as follows:
B Y C iff P B | Y ( b | y ) = c κ B | C ( b | c ) P C | Y ( c | y ) for   some   channel   κ B | C   and   all   b , y .
We refer to the relation Y as the Blackwell order relative to random variable Y. (Note that the Blackwell order and Blackwell’s Theorem are usually formulated in terms of channels—that is, conditional distributions like κ B | Y and κ C | Y —rather than of random variables as done here. However, these two formulations are equivalent, as shown in [45]).
In words, Equation (23) means the conditional distribution by P B | Y can be generated by first sampling from the conditional distribution P C | Y , and then applying some channel κ B | C to the outcome. The relation B Y C implies that P B | Y is more noisy than P C | Y and, by the “data processing inequality” [61], B must have less mutual information about Y than C:
B Y C I ( B ; Y ) I ( C ; Y ) .
Intuition suggests that when B Y C , the information that B provides about Y is contained in the information that C provides about Y. This intuition is formalized within a decision-theoretic framework using the so-called Blackwell’s Theorem [28,45,60]. To introduce this theorem, imagine a scenario in which Y represents the state of the environment. Imagine also that there is an agent who acquires information about the environment via the conditional distribution P B | Y ( b | y ) , and then uses outcome B = b to select actions a A according to some “decision rule” given by the channel κ A | B . Finally, the agent gains utility according to some utility function u ( a , y ) , which depends on the agent’s action a and the environment’s state y. The maximum expected utility achievable by any decision rule is given by
V Y max ( B , u ) : = max κ A | B y , b , a P Y ( y ) P B | Y ( b | y ) κ A | B ( a | b ) u ( a , y ) .
From an operational perspective, it is natural to say that B is less informative than C about Y if there is no utility function such that an agent with access to B can achieve higher expected utility than an agent with access to C. Blackwell’s Theorem states that this is precisely the case if and only if B Y C [28,45]:
B Y C   iff   V Y max ( B , u ) V Y max ( C , u )   for   all   u .
In some sense, this operational description of the relation Y is deeper than the data processing inequality, Equation (24), which says that B Y C is sufficient (but not necessary) for I ( B ; Y ) I ( C ; Y ) . In fact, it can happen that I ( B ; Y ) I ( C ; Y ) even though B Y C [26,60,62].
A connection between PID and Blackwell’s theorem was first proposed in [13], which argued that the PID should be defined in an operational manner (see Section 5.3 for further discussion of [13]).

5.2. Blackwell Redundancy

We now define a measure of redundancy based on the Blackwell order. Specifically, we use our general definition of redundancy, Equation (7), while using the Blackwell order relative to Y as the “more informative” relation ⊏:
I ( X 1 ; ; X n     Y ) : = sup Q   I ( Q ; Y )   such   that   i   Q Y X i .
We refer to this measure as Blackwell redundancy.
Given Blackwell’s Theorem, I has a simple operational interpretation. Imagine two agents, Alice and Bob, who can acquire information about Y via different random variables, and then use this information to maximize their expected utility. Suppose that Alice has access to one of the sources X i . Then, the Blackwell redundancy I is the maximum information that Bob can have about Y without being able to do better than Alice on any utility function, regardless of which source Alice has access to.
Blackwell redundancy can also be used to define a measure of Blackwell unique information, U ( X i     Y | X 1 ; ; X n ) : = I ( Y ; X i ) I ( X 1 ; ; X n     Y ) , via Equation (2). As we show in Appendix I, U satisfies the following property, which we term the Multivariate Blackwell property.
Theorem 3. 
U ( X i     Y | X 1 ; ; X n ) = 0 if and only if X i Y X j for all j i .
Operationally, Theorem 3 means that source X i has non-zero unique information iff there exists a utility function such that an agent with access to source X i can achieve higher utility than an agent with access to any other source X j .
Computing I involves maximizing a convex function subject to a set of linear constraints. These constraints define a feasible set which is a convex polytope, and the maximum must lie on one of the vertices of this polytope [63]. In Appendix C, we show how to solve this optimization problem. In particular, we use a computational geometry package to enumerate the vertices of the feasible set, and then choose the best vertex (code is available at [64]). In that appendix, we also prove that an optimal solution to Equation (27) can always be achieved by Q with cardinality Q = ( i X i ) n + 1 . Note that the supremum in Equation (27) is always achieved. Note also that I satisfies the redundancy axioms in Section 4.2.
As discussed above, solving the optimization problem in Equation (27) gives a (possibly non-unique) optimal random variable Q which specifies the content of the redundant information. As shown in Appendix C, solving Equation (27) also provides a set of channels κ Q | X i for each source X i , which identify the redundant information in each source.
Note that the Blackwell order satisfies assumptions I-III in Section 4.1, thus Blackwell redundancy satisfies the bounds derived in that section. Finally, note that like many other redundancy measures, Blackwell redundancy becomes equivalent to the measure I MMI (as defined in Equation (19)) when applied to Gaussian random variables (for details, see Appendix E).

5.3. Blackwell Union Information

We now define a measure of union information using our general definition in Equation (8), while using the Blackwell order relative to Y as the “more informative” relation:
I ( X 1 ; ; X n     Y ) : = inf Q   I ( Q ; Y )   such   that   i   X i Y Q .
We refer to this measure as Blackwell union information.
As for Blackwell redundancy, Blackwell union information can be understood in operational terms. Consider two agents, Alice and Bob, whose use information about Y to maximize their expected utility. Suppose that Alice has access to one of the sources X i . Then, the Blackwell union information I is the minimum information that Bob must have about Y in order to do better than Alice on any utility function, regardless of which source Alice has access to.
Blackwell union information can be used to define measures of synergy and excluded information via Equations (1) and (3). The resulting measure of excluded information E ( X i     Y | X 1 ; ; X n ) : = I ( X 1 ; ; X n     Y ) I ( Y ; X i ) satisfies the following property, which is the “dual” of the Multivariate Blackwell property considered in Theorem 3. (See Appendix I for the proof).
Theorem 4. 
E ( X i     Y | X 1 ; ; X n ) = 0 if and only if X j Y X i for all j i .
Operationally, Theorem 4 means that there is excluded information for source X i iff there exists a utility function such that an agent with access to one of the other sources X j can achieve higher expected utility than an agent with access to X i .
We discuss the problem of numerically solving the optimization problem in Equation (28) in the next subsection.

5.4. Relation to Prior Work

Our measure of Blackwell redundancy I is new to the PID literature. The most similar existing redundancy measure is I GH [18], which is discussed above in Section 4.4. I GH is a special case of Equation (7), once the “more informative” relation B C is defined in terms of conditional independence B C Y . Note that conditional independence is stronger than the Blackwell order: given the definition of Y in Equation (23), it is clear that B C Y implies B Y C (the channel κ B | C can be taken to be P B | C ), but not vice versa. As discussed in Section 4.1, stronger ordering relations give smaller values of redundancy, so in general I GH I . Note also that B Y C depends only on the pairwise marginals P B Y and P C Y , while conditional independence B C Y depends on the joint distribution P B C Y . As we discuss in Appendix F, the conditional independence order can be interpreted in decision-theoretic terms, which suggests an operational interpretation for I GH .
Interestingly, Blackwell union information I is equivalent to two measures that have been previously proposed in the PID literature, although they were formulated in a different way. Bertschinger et al. [13] considered the following measure of bivariate redundancy:
I BROJA ( X 1 ; X 2     Y ) : = I ( Y ; X 1 ) + I ( Y ; X 2 ) I BROJA ( X 1 ; X 2     Y ) ,
where I BROJA is defined via the optimization problem
I BROJA ( X 1 ; X 2     Y ) = min X ˜ 1 , X ˜ 2 I ( Y ; X ˜ 1 , X ˜ 2 )   such   that   P X ˜ 1 Y = P X 1 Y , P X ˜ 2 Y = P X 2 Y ,
and reflects the minimal mutual information that two random variables can have about Y, given that their pairwise marginals with Y are fixed to be P X 1 Y and P X 2 Y . Note that Ref. [13] did not refer to I BROJA as a measure of union information (we use our notation in writing it as I BROJA ). Instead, these measures were derived from an operational motivation, with the goal of deriving a unique information measure that obeys the so-called Blackwell property: I ( Y ; X 1 ) I BROJA ( X 1 ; X 2     Y ) = 0 if X 1 Y X 2 (see Theorems 3 and 4 above).
Starting from a different motivation, Griffith and Koch [17] proposed a multivariate version of I BROJA ,
I BROJA ( X 1 ; ; X n     Y ) = min X ˜ i , , X ˜ n I ( Y ; X ˜ 1 , , X ˜ n )   such   that   i   P X ˜ i Y = P X i Y .
The goal of Ref. [17] was to derive a measure of multivariate synergy from a measure of union information, as in Equation (1). In that paper, I BROJA was explicitly defined as a measure of union information. To our knowledge, Ref. [17] was the first (and perhaps only) paper to propose a measure of union information that was not derived from redundancy via the inclusion-exclusion principle.
While I BROJA ( X 1 ; ; X n     Y ) and I ( X 1 ; ; X n     Y ) are stated as different optimization problems, we prove in Appendix G that these optimization problems are equivalent, in that they will always achieve the same optimum value. Interestingly, since I BROJA and I are equivalent, our measure of Blackwell redundancy I appears as the natural dual to I BROJA . Another implication of this equivalence is that Blackwell union information I can be quantified by solving the optimization problem in Equation (31), rather than Equation (28). This is advantageous, because Equation (31) involves the minimization of a convex function over a convex polytope, which can be solved using standard convex optimization techniques [65].
In Ref. [13], the redundancy measure I BROJA in Equation (29) was only defined for the bivariate case. Since then, it has been unclear how to extend this redundancy measure to more than two sources. However, by comparing Equations (14) and (29), we see the root of the problem: I BROJA is derived by applying the inclusion-exclusion principle to a measure of union information, I BROJA . It cannot be extended to more than two sources because the inclusion-exclusion principle generally leads to counterintuitive results for more than 2 sources, as shown in Lemma 1. Note also that what Ref. [13] called the unique information in X 1 , I BROJA ( X 1 ; X 2     Y ) I ( Y ; X 2 ) , in our framework would be considered a measure of the excluded information for X 2 .
At the same time, the union information measure I BROJA , and the corresponding synergy from Equation (1), does not use the inclusion-exclusion principle. Therefore, it can be easily extended to any number of sources [17].

5.5. Continuity of Blackwell Redundancy and Union Information

It is often desired that information-theoretic measures are continuous, meaning that small changes in underlying probability distributions lead to small changes in the resulting measures. In this section, we consider the continuity of our proposed measures, I and I .
We first consider Blackwell redundancy I . It turns out that this measure is not always continuous in the joint probability P X 1 X n Y (a discontinuous example is provided in Section 5.6). However, the discontinuity of I is not necessarily pathological, and we can derive an interpretable geometric condition that guarantees that I is continuous.
Consider the conditional distribution of the target Y given some source X i , P Y | X i . Let rank   P Y | X i indicate its rank, meaning the dimension of the space spanned by the vectors { P Y | X i = x i } x i X i . The rank of P Y | X i quantifies the number of independent directions that the target distribution P Y can be moved by manipulating the source distribution P X i , and it cannot be larger than | Y | . The next theorem shows that I is locally continuous, as long as n 1 or more of the source conditional distributions have this maximal rank.
Theorem 5. 
As a function of the joint distribution P X 1 , , X n , Y , I is locally continuous whenever n 1 or more of the conditional distributions P Y | X i have rank   P Y | X i = | Y | .
In proving this result, we also show that I is continuous almost everywhere (see proof in Appendix D). Finally, in that appendix we also use Theorem 5 to show that I is continuous everywhere if Y is a binary random variable.
We illustrate the meaning of Theorem 5 visually in Figure 2. We show two situations, both of which involve two sources X 1 and X 2 and a target Y with cardinality | Y | = 3 . In one situation, both pairwise conditional distributions have rank equal to | Y | , so I is locally continuous. In the other situation, both pairwise conditional distributions are rank deficient (e.g., this might happen because X 1 and X 2 have cardinality | X 1 | = | X 2 | = 2 ), so I is not guaranteed to be continuous. From the figure it is easy to see how the discontinuity may arise. Given the definition of the Blackwell order and I , for any random variable Q in the feasible set of Equation (27), the conditional distributions P Y | Q = q must fall within the intersection of the distributions spanned by P Y | X 1 and P Y | X 2 (the intersection of the red and green shaded regions in Figure 2). On the right, the size of this intersection can discontinuously jump from a line (when P Y | X 1 and P Y | X 2 are perfectly aligned) to a point (when P Y | X 1 and P Y | X 2 are not perfectly aligned). Thus, the discontinuity of I arises from a geometric phenomenon, which is related to the discontinuity of the intersection of low-dimensional vector subspaces.
We briefly comment on the continuity of I . As we described above, this measure turns out to be equivalent to I BROJA . The continuity of I BROJA in the bivariate case was proven in Theorem 35 of Ref. [66]. We believe that the continuity of I BROJA for an arbitrary number of sources can be shown using similar methods, although we leave this for future work.

5.6. Behavior on the COPY Gate

As mentioned in Section 3, the “COPY gate” example is often used to test the behavior of different redundancy measures. The COPY gate has two sources, X 1 and X 2 , and a target Y = ( X 1 , X 2 ) which is a copy of the joint outcome. It is expected that redundancy should vanish if X 1 and X 2 are statistically independent, as formalized by the Independent identity property in Equation (4).
Blackwell redundancy I satisfies the Independent identity. In fact, we prove a more general result, which shows that I ( X 1 , X 2     ( X 1 , X 2 ) ) is equal to an information-theoretic measure called Gács-Körner common information  C ( X Y ) [16,47,67]. C ( X Y ) quantifies the amount of information that can be deterministically extracted from both random variables X or Y, and it is closely related to the “deterministic function” order ⊲ defined in Equation (16). Formally, it can be written as
C ( X Y ) = sup Q H ( Q )   such   that   Q X , Q Y ,
where H is Shannon entropy. In Appendix I, we prove the following result.
Theorem 6. 
I ( X 1 , X 2     ( X 1 , X 2 ) ) = C ( X 1 X 2 ) .
Note that 0 C ( X 1 X 2 ) I ( X 1 ; X 2 ) [47], so I satisfies the Independent identity property. At the same time, C ( X 1 X 2 ) can be strictly less than I ( X 1 ; X 2 ) . For example, if P X 1 X 2 has full support, then I ( X 1 ; X 2 ) can be arbitrarily large while C ( X 1 X 2 ) = 0 (see proof of Theorem 6). This means that I violates a previously proposed property, sometimes called the Identity property, that suggests that redundancy should satisfy I ( X 1 ; X 2     ( X 1 , X 2 ) ) = I ( X 1 ; X 2 ) . However, the validity of the Identity property is not clear, and several papers have argued against it [15,39].
The value of C ( X 1 X 2 ) depends on the precise pattern of zeros in the joint distribution P X 1 X 2 and is therefore not continuous. For instance, for the bivariate COPY gate, redundancy can change discontinuously as one goes from the situation where X 1 = X 2 (so that all information is redundant, I = I ( X 1 ; X 2 ) ) to one where X 1 and X 2 are almost, but not entirely, identical. This discontinuity can be understood in terms of Theorem 5 and Figure 2: in the COPY gate, the cardinality of the target variable | Y | = | X 1 | × | X 2 | is larger than the cardinality of the individual sources. In other words, when the sources X 1 and X 2 are not perfectly correlated, they provide information about different “subspaces” of the target ( X 1 , X 2 ) , and so it is possible that very little (or none) of their information is redundant.
At the same time, the Blackwell property, Theorem 3, implies that
I ( X 1 , X 2     X 1 ) = I ( X 1 ; X 2 ) = I ( X 1 , X 2     X 2 )
In other words, the redundancy in X 1 and X 2 , where either one of the individual sources is taken as the target, is given by the mutual information I ( X 1 ; X 2 ) . This holds even though the redundancy in the COPY gate can be much lower than I ( X 1 ; X 2 ) .
It is also interesting to consider how Blackwell union information, I , behaves on the COPY gate. Using techniques from [13], it can be shown that the union information is simply the joint entropy,
I ( X 1 ; X 2     ( X 1 , X 2 ) ) = H ( X 1 , X 2 ) .
Since H ( X 1 , X 2 ) = I ( X 1 , X 2 ; X 1 , X 2 ) , Equations (1) and (34) together imply that the COPY gate has no synergy.
Note that we can use Theorem 6 and Equation (34) to illustrate that I and I violate the inclusion-exclusion principle, Equation (14). Using Equation (34) and a bit of rearranging, Equation (14) becomes equivalent to I ( X 1 ; X 2     ( X 1 , X 2 ) ) = ? I ( X 1 ; X 2 ) , which is the Identity property mentioned above. I violates this property, since redundancy for the COPY gate can be smaller than I ( X 1 ; X 2 ) .

6. Examples and Comparisons to Previous Measures

In this section, we compare our proposed measure of Blackwell redundancy I to existing redundancy measures. We focus on redundancy, rather than union information, because redundancy has seen much more development in the literature, and because Blackwell union information I is equivalent to an existing measure (see Section 5.4).

6.1. Qualitative Comparison

In Table 1, we compare I to six existing measures of multivariate redundancy:
  • I WB , the redundancy measure first proposed by Williams and Beer [11].
  • I MMI , the “minimum mutual information” [51], Equation (19) in Section 4.4.
  • I , proposed by Griffith et al. [16], Equation (17) in Section 4.4.
  • I GH , proposed by Griffith and Ho [18], Equation (18) in Section 4.4.
  • I Ince , proposed by Ince [20].
  • I FL , proposed by Finn and Lizier [21].
We also compare I to three existing measures of bivariate redundancy (i.e., for 2 sources):
  • I BROJA , proposed by Bertschinger et al. [13], defined in Equation (29).
  • I Harder , proposed by Harder et al. [19].
  • I dep , proposed by James et al. [15].
For I as well as the 9 existing measures, we consider the following properties, which are chosen to highlight differences between our approach and previous proposals:
  • Has it been defined for more than 2 sources
  • Does it obey the Monotonicity axiom from Section 4.2
  • Is it compatible with the inclusion-exclusion principle (IEP) for the bivariate case, such that union information as defined in Equation (14) obeys I ( X 1 ; X 2     Y ) I ( X 1 , X 2 ; Y )
  • Does it obey the Independent identity property, Equation (4)
  • Does it obey the Blackwell property (possibly in its multivariate form, Theorem 3)
We also consider two additional properties, which require a bit of introduction.
The first property was suggested by Ref. [13], who argued that redundancy should only depend on the pairwise marginal distributions of each source with the target,
If   p X i Y = p X ˜ i Y ˜   for   all   i ,   then   I ( X 1 ; ; X n     Y ) = I ( X ˜ 1 ; ; X ˜ n     Y ˜ ) .
In Table 1, we term this property Pairwise marginals. We believe that the validity of Equation (35) is not universal, but may depend on the particular setting in which the PID is being used. However, redundancy redundancy measures that satisfy this property have one important advantage: they are well-defined not only when the sources are random variables X 1 , , X n , but also in the more general case when the sources are channels κ X 1 | Y , , κ X n | Y .
The second property has not been previously considered in the literature, although it appears to be highly intuitive. Observe that the target random variable Y contains all possible information about itself. Thus, it may be expected that adding the target to the set of sources should not decrease the redundancy:
I ( X 1 ; ; X n ; Y     Y ) = I ( X 1 ; ; X n     Y ) .
In Table 1, we term this property Target equality. Note that for redundancy measures which can be put in the form of Equation (7), Target Equality is satisfied if the order ⊏ obeys X i Y for all sources X i . (Note also that Target Equality is unrelated to the previously proposed Strong Symmetry property; for instance, it is easy to show that the redundancy measures I WB and I MMI satisfy Target Equality, even though they violate Strong Symmetry [68]).

6.2. Quantitative Comparison

We now illustrate our proposed measure of redundancy I on some simple examples, and compare its behavior to existing redundancy measures.
The values of I were computed with our code, provided at [64]. The values of all other redundancy measures except I GH were computed using the dit Python package [69]. To our knowledge, there have been no previous proposals for how to compute I GH . In fact, this measure involves maximizing a convex function subject to linear constraints, and can be computed using similar methods as I . We provide code for computing I GH at [64].
We begin by considering some simple bivariate examples. In all cases, the sources X 1 and X 2 are binary and uniformly distributed. The results are shown in Table 2.
  • The AND gate, Y = X 1   AND   X 2 , with X 1 and X 2 independent. (It is incorrectly stated in Refs. [18,49] that I GH vanishes here; actually I GH ( X 1 ; X 2     X 1   AND   X 2 ) 0.123 , which corresponds to the maximum achieved in Equation (18) by Q = X 1   OR   X 2 .)
  • The SUM gate: Y = X 1 + X 2 , with X 1 and X 2 independent.
  • The UNQ gate: Y = X 1 . Here I Ince (marked with ∗) gave values that increased with the amount of correlation between X 1 and X 2 but were typically larger than I ( X 1 ;   X 2 ) .
  • The COPY gate: Y = ( X 1 , X 2 ) . Here, our redundancy measure is equal to the Gács-Körner common information between X and Y, as discussed in Section 5.6. The same holds for the redundancy measures I GH and I , which can be shown using a slight modification of the proof of Theorem 6. For this gate, I Ince (marked with ∗) gave the same values as for the UNQ gate, which increased with the amount of correlation between X 1 and X 2 but were typically larger than I ( X 1 ; X 2 ) .
We also analyze several examples with three sources, with the results shown in Table 3. We considered those previously proposed measures which can be applied to more than two sources (we do not show I GH , as our implementation was too slow for these examples).
  • Three-way AND gate: Y = X 1 AND X 2 AND X 3 , where the sources are binary and uniformly and independently distributed.
  • Three-way SUM gate: Y = X 1 + X 2 + X 3 , where the sources are binary and uniformly and independently distributed.
  • “Overlap” gate: we defined four independent uniformly distributed binary random variables, A , B , C , D . These were grouped into three sources X 1 , X 2 , X 3 as X 1 = ( A , B ) , X 2 = ( A , C ) , X 3 = ( A , D ) . The target was the joint outcome of all three sources, Y = ( X 1 , X 2 , X 3 ) = ( ( A , B ) , ( A , C ) , ( A , D ) ) . Note that the three sources overlap on a single random variable A, which suggests that the redundancy should be 1 bit.

7. Discussion and Future Work

In this paper, we proposed a new general framework for defining the partial information decomposition (PID). Our framework was motivated in several ways, including a formal analogy with intersections and unions in set theory as well as an axiomatic derivation.
We also used our general framework to propose concrete measures of redundancy and union information, which have clear operational interpretations based on Blackwell’s theorem. Other PID measures, such as synergy and unique information, can be computed from our measures of redundancy and union information via simple expressions.
One unusual aspect of our framework is that it provides separate measures of redundancy and union information. As we discuss above, most prior work on the PID assumed that redundancy and union information are related to each other via the so-called “inclusion-exclusion” principle. We argue that the inclusion-exclusion principle should not be expected to hold in the context of the PID, and in fact that it leads to counterintuitive behavior once 3 or more sources are present. This suggests that different information decompositions should be derived for redundancy vs. union information. This idea is related to a recent proposal in the literature, which argues that two different PIDs are needed, one based on redundancy and one based on synergy [41]. An interesting direction for future work is to relate our framework with the dual decompositions proposed in [41].
From a practical standpoint, an important direction for future work is to develop better schemes for computing our redundancy measure. This measure is defined in terms of a convex maximization problem, which in principle can be NP-hard (a similar convex maximization problem was proven to be NP-hard in [70]). Our current implementation, which enumerates the vertices of the feasible set, works well for relatively small state spaces, but we do not expect it to scale to situations with many sources, or where the sources have large cardinalities. However, the problem of convex maximization with linear constraints is a very active area of optimization research, with many proposed algorithms [63,71,72]. Investigating these algorithms, as well as various approximation schemes such as relaxations and variational bounds, is of interest.
Finally, we showed how our framework can be used to define measures of redundancy and union information in situations that go beyond the standard setting of the PID (e.g., when the probability distribution of the target is not specified). Our framework can even be applied in domains beyond Shannon information theory, such as algorithmic information theory and quantum information theory. Future work may exploit this flexibility to explore various new applications of the PID.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Acknowledgments

We thank Paul Williams, Alexander Gates, Nihat Ay, Bernat Corominas-Murtra, Pradeep Banerjee, and especially Johannes Rauh for helpful discussions and suggestions. We also thank the Santa Fe Institute for helping to support this research.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. PID Axioms

In developing the PID framework, Williams and Beer [11,12] proposed that any measure of redundancy should obey a set of axioms. In slightly modified form, these axioms can be written as follows:
  • Symmetry: I ( X 1 ; ; X n     Y ) is invariant to the permutation of X 1 , , X n .
  • Self-redundancy: I ( X 1     Y ) = I ( Y ; X 1 ) .
  • Monotonicity: I ( X 1 ; ; X n     Y ) I ( X 1 ; ; X n 1     Y ) .
  • Deterministic equality: I ( X 1 ; ; X n     Y ) = I ( X 1 ; ; X n 1     Y ) if X i = f ( X n ) for some i < n and deterministic function f.
These axioms are based on intuitions regarding the behavior of intersection in set theory [12]. The Symmetry axiom is self-explanatory. Self-redundancy states that if only a single-source is present, all of its information is redundant. Monotonicity states that redundancy should not increase when an additional source is considered (consider that the size of set intersection can only decrease as more sets are considered). Deterministic equality states that redundancy should remain the same when an additional source X n is added that contains all (or more) of the same information that is already contained in an existing source X i (which is formalized as the condition X i = f ( X n ) ).
Union information was considered the original PID proposal [12,73], as well as a more recent paper [17]. Ref. [17] proposed that any measure of union information should satisfy the following set of natural axioms, stated here in slightly modified form:
  • Symmetry: I ( X 1 ; ; X n     Y ) is invariant to the permutation of X 1 , , X n .
  • Self-union: I ( X 1     Y ) = I ( Y ; X 1 ) .
  • Monotonicity: I ( X 1 ; ; X n     Y ) I ( X 1 ; ; X n 1     Y ) .
  • Deterministic equality: I ( X 1 ; ; X n     Y ) = I ( X 1 ; ; X n 1     Y ) if X n = f ( X i ) for some i < n and deterministic function f.
These axioms are based on intuitions concerning the behavior of the union operator in set theory, and are the natural “duals” of the redundancy axioms mentioned above.

Appendix B. Uniqueness Proofs

Proof of Theorem 1. 
Assume there is a redundancy measure I that obeys the five axioms stated in the theorem. We will show that I = I , as defined in Equation (7).
Given Equation (7) and the definition of the supremum, for any ϵ > 0 there exists a random variable Q such that Q X i for i { 1 , , n } and
I ( Q ; Y ) I ( X 1 ; ; X n     Y ) ϵ ,
By Order equality, I ( Q ; X 1 ; ; X k     Y ) = I ( Q ; X 1 ; ; X k 1     Y ) . Induction gives
I ( Q ; X 1 ; ; X n     Y ) = I ( Q     Y ) = I ( Q ; Y ) I ( X 1 ; ; X n     Y ) ϵ
where we used Self-redundancy and Equation (A1). We also have I ( Q ; X 1 ; ; X n     Y ) I ( X 1 ; ; X n     Y ) by Symmetry and Monotonicity. Combining gives
I ( X 1 ; ; X n     Y ) ϵ I ( X 1 ; ; X n     Y ) .
We now show that I is the largest measure that satisfies Existence. Let Q be a random variable that obeys Q X i for all i { 1 , , n } and I ( X 1 ; ; X n     Y ) = I ( Y ; Q ) . Since Q falls within the feasible set of the optimization problem in Equation (7),
I ( X 1 ; ; X n     Y ) = I ( Q ; Y ) I ( X 1 ; ; X n     Y ) .
Combining gives
I ( X 1 ; ; X n     Y ) ϵ I ( X 1 ; ; X n     Y ) ϵ I ( X 1 ; ; X n     Y ) .
Since this holds for all ϵ > 0 , taking the limit ϵ 0 gives I = I . □
Proof of Theorem 2. 
Assume there is a union information measure I that obeys the five axioms stated in the theorem. We will show that I = I , as defined in Equation (8).
Given Equation (8) and the definition of the infinitum, for any ϵ > 0 there exists a random variable Q such that Q X i for i { 1 , , n } and
I ( Q ; Y ) I ( X 1 ; ; X n     Y ) + ϵ ,
By Order equality, I ( Q ; X 1 ; ; X k     Y ) = I ( Q ; X 1 ; ; X k 1     Y ) . Induction gives
I ( Q ; X 1 ; ; X n     Y ) = I ( Q     Y ) = I ( Q ; Y ) I ( X 1 ; ; X n     Y ) + ϵ
where we used Self-union and Equation (A2). We also have I ( Q ; X 1 ; ; X n     Y ) I ( X 1 ; ; X n     Y ) by Symmetry and Monotonicity. Combining gives
I ( X 1 ; ; X n     Y ) + ϵ I ( X 1 ; ; X n     Y ) .
We now show that I is the smallest measure that satisfies Existence. Let Q be a random variable that obeys X i Q for all i { 1 , , n } and I ( X 1 ; ; X n     Y ) = I ( Y ; Q ) . Since Q falls within the feasible set of the optimization problem in Equation (8),
I ( X 1 ; ; X n     Y ) = I ( Q ; Y ) I ( X 1 ; ; X n     Y ) .
Combining gives
I ( X 1 ; ; X n     Y ) I ( X 1 ; ; X n     Y ) + ϵ I ( X 1 ; ; X n     Y ) + ϵ .
Since this holds for all ϵ > 0 , taking the limit ϵ 0 gives I = I . □

Appendix C. Computing I

Here we consider the optimization problem that defines our proposed measure of redundancy, Equation (27). We first prove a bound on the required cardinality of Q.
Theorem A1. 
For optimizing Equation (27), it suffices to consider Q with cardinality Q = i X i n + 1 .
Proof. 
Consider any random variable Q with outcome set Q which satisfies Q Y X i for all i. We show that whenever Q has full support on Q > i X i n + 1 outcomes, there is another random variable Q ˜ which achieves I ( Q ˜ ; Y ) I ( Q ; Y ) , while satisfying Q ˜ Y X i for all i and having support on at most i X i n + 1 outcomes.
To begin, let Ω indicate the set of random variables over outcomes Q , such that all Q ˜ Ω satisfy:
P Y | Q ˜ ( y | q ) = P Y | Q ( y | q )   for   all   y , q { q Q : P Q ˜ ( q ) > 0 }
q P Q ˜ ( q ) P X i | Q ( x i | q ) = P X i ( x i )     for   all   i , x i .
Since Q Y X i , by Equation (23) there exist channels κ Q | X i ( q | x i ) that satisfy P Q | Y ( q | y ) = x i κ Q | X i ( q | x i ) P X i | Y ( x i | y ) . Now write the conditional distribution over Q ˜ and Y as
P Q ˜ Y ( q , y ) = P Q ˜ ( q ) P Q ( q ) P Q Y ( q , y ) = x i P Q ˜ ( q ) P Q ( q ) κ Q | X i ( q | x i ) P X i | Y ( x i | y ) P Y ( y ) = x i κ Q ˜ | X i ( q | x i ) P X i | Y ( x i | y ) P Y ( y ) ,
where we used Equation (A3) and defined the channel κ Q ˜ | X i as
κ Q ˜ | X i = P Q ˜ ( q ) P X i ( x i ) P X i ( x i ) P Q ( q ) κ Q | X i ( q | x i ) ,
(Note this is a kind of double Bayesian inverse, given Equation (A4)). Equation (A5) implies that Q ˜ Y X i for all i.
We now show that there is Q ˜ Ω that achieves I ( Q ; Y ) I ( Q ˜ ; Y ) and has support on at most i X i n + 1 outcomes in Q . Write the mutual information between any Q ˜ Ω and Y as
I ( Q ˜ ; Y ) = q P Q ˜ ( q ) D KL ( P Y | Q ˜ = q P Y ) = q P Q ˜ ( q ) D KL ( P Y | Q = q P Y ) ,
where D KL is the Kullback-Leibler divergence. We consider the maximum of this mutual information across Ω , I * = max Q ˜ Ω I ( Q ˜ ; Y ) . Using Equations (A4) and (A6), this maximum can be written as
I * = max ω Δ q ω ( q ) D ( P Y | Q = q P Y )   such   that   i , x i : q ω ( q ) P X i | Q ( x i | q ) = P X i ( x i ) ,
where Δ is the set of all distributions over Q . By conservation of probability, x i P X i ( x i ) = 1 , so we can eliminate a constraint for one of the outcomes x i of each source i. Thus, I * is the maximum of a linear function over Δ , subject to i ( X i 1 ) = i X i n hyperplane constraints.
The feasible set is compact, and the maximum will be achieved at one of the extreme points of the feasible set. By Dubin’s Theorem [74], any extreme point of this feasible set can be expressed as a convex combination of at most i X i n + 1 extreme points of Δ . In other words, the maximum in Equation (27) is achieved by a random variable Q ˜ with support on at most i X i n + 1 values of Q . This random variable satisfies
I ( Q ˜ ; Y ) = I * I ( Q ; Y ) ,
where the last inequality comes from the fact that Q is an element of Ω . □
We now return to the optimization problem in Equation (27). Given Theorem A1 and the definition of the Blackwell order in Equation (23), it can be rewritten as
I ( X 1 ; ; X n     Y ) = max κ Q | Y , κ Q | X 1 , , κ Q | X n I κ ( Q , Y ) such   that   i , y , x i   : x i κ Q | X i ( q | x i ) P X i | Y ( x i | y ) = κ Q | Y ( q | y ) .
where the optimization is over channels with Q of cardinality i X i n + 1 . The notation I κ ( Q ; Y ) indicates the mutual information that arises from the marginal distribution P Y and the conditional distribution κ Q | Y ,
I κ ( Q ; Y ) = y P Y ( y ) κ Q | Y ( q | y ) ln κ Q | Y ( q | y ) y κ Q | Y ( q | y ) P Y ( y )
Equation (A7) involves maximizing a convex function over the convex polytope defined by the following system of linear (in)equalities:
Λ = { ( κ Q | Y , κ Q | X 1 , , κ Q | X n ) :
i , x i , q   κ Q | X i ( q | x i ) 0 ,
q , y   κ Q | Y ( q | y ) 0 ,
y   q κ Q | Y ( q | y ) = 1 ,
i , x i   q κ Q | X i ( q | x i ) = 1 ,
i , y , q Q { 0 }   x i κ Q | X i ( q | x i ) P X i Y ( x i , y ) κ Q | Y ( q | y ) P Y ( y ) = 0 } ,
We do not place a constraint on q = 0 in Equation (A12) because that would be redundant with the constraints Equations (A10) and (A11). Also, note that we replaced the sup in Equation (27) with max in Equation (A7), which is justified since we are optimizing over a finite dimensional, closed, and bounded region (thus the supremum is always achieved).
The maximum of a convex function over a convex polytope is found at one of the vertices of the polytope. To find the solution to Equation (A7), we use a computational geometry package to enumerate the vertices of Λ . We evaluate I κ ( Y ; Q ) at each vertex, and pick the maximum value. This procedure also finds optimal conditional distributions κ Q | Y , κ Q | X 1 , , κ Q | X n . Code is available at [64].

Appendix D. Continuity of I

To prove the continuity of I , we begin by considering the feasible set of the optimization problem in Equation (A7), as specified by the system (in)equalities in Equations (A8) to (A12). For convenience, write this system of (in)equalities in matrix notation,
Λ = κ R | Q | | Y | + i | Q | | X i | : κ 0 , A κ = a ,
where κ is a vector representation of ( κ Q | Y , κ Q | X 1 , , κ Q | X n ) , the matrix A encodes the left-hand side of Equations (A10) to (A12), and the vector a is filled with 1s and 0s, as appropriate.
We first prove the following lemma.
Lemma A1. 
The matrix A defined in Equation (A13) is full rank if n 1 or more of the pairwise conditional distributions have rank   P Y | X i = | Y | .
Proof. 
Without loss of generality, assume that P Y has full support (otherwise none of the pairwise marginals P X i Y can achieve rank | Y | ). Write A in block matrix form as A = B C , where the matrix B has | Y | + i | X i | rows and encodes the constraints of Equations (A10) and (A11), and the matrix C has n | Y | ( | Q | 1 ) rows and encodes the constraints of Equation (A12).
Each row in B has a 1 in some column which is zero in every other row of B and every row of C. This column corresponds either to κ Q | Y ( 0 | y ) for a particular y (for constraints like Equation (A10)), or to κ Q | X i ( 0 | x i ) for a particular i and x i (for constraints like Equation (A11)). These columns are 0 in C because q = 0 is omitted Equation (A12). This means that no row of B is a linear combination of other rows in B or C, and that no row in C is a linear combination of any set of other rows that includes a row in B. Therefore, if the rows of A are linearly dependent, it must be that the rows of C are linearly dependent.
Next, let c i , y , q indicate the row of C that represents the constraints in Equation (A12) for some source i and outcomes y , q 0 . Any such row has a column for each x i X i with value P X i Y ( x i , y ) (at the same index as the row in κ that represents κ Q | X i ( q | x i ) ). Since P X i Y ( x i , y ) > 0 for at least one x i X i , one of these columns must be non-zero. At the same time, these columns are zero in every row c j , y , q where j i or q q . This means that row c i , y , q can only be a linear combination of other rows in C if, for all x i , P X i Y ( x i , y ) is a linear combination of P X i Y ( x i , y ) for y y . In linear algebra terms, this can be stated as rank   P Y | X i < | Y | .
The previous argument shows that if A is linearly dependent, there must be at least one source i with rank   P Y | X i < | Y | and some row c i , y , q which is a linear combination of other rows from C. Observe that this row c i , y , q has a column with value P Y ( y ) > 0 (at the same index as the row in κ that represents κ Q | Y ( q | y ) ). This column is zero in every other row c i , y , q for y y or q q . This means that c i , y , q is a linear combination of a set of other rows in C that include some row c j , y , q for j i . This implies that c j , y , q is also a linear combination of other rows in C, which means that rank   P Y | X j < | Y | .
We have shown that if A is linearly dependent, there must be at least two pairwise conditionals with rank   P Y | X i < | Y | . □
We are now ready to prove Theorem 5.
Proof of Theorem 5. 
For the case of a single source ( n = 1 ), I reduces to the mutual information I = I ( Y ; X 1 ) , which is continuous (Section 2.3, [75]). Thus, without loss of generality, we assume that n 2 .
Next, we define some notation. Note that the optimum value ( I ) and the feasible set of the optimization problem in Equation (A7) is a function of the pairwise marginal distributions P X 1 Y , , P X n Y . We write Ω for the set of all pairwise marginal distributions which have the same marginal over Y:
Ω = ( q X 1 Y , , q X n Y ) : x i q X i Y ( x i , y ) = x j q X j Y ( x j , y )   i , j .
For any r Ω , let I ( r ) indicate the corresponding optimum value in Equation (A7), given the marginals in r, and let Λ ( r ) indicate the feasible set of the optimization problem, as defined in Equation (A13).
Note that the matrix A in Equation (A13) depends on the choice of r, which we indicate by writing it as the matrix-valued function A ( r ) . Given any r = ( q X 1 Y , , q X n Y ) Ω and feasible solution κ = ( κ Q | Y , κ Q | X 1 , , κ Q | X n ) Λ ( r ) , let I ( r , κ ) indicate the corresponding mutual information I ( Q ; Y ) , where the marginal distribution over Y is specified by r and the conditional distribution of Q given Y is specified by κ Q | Y . Using this notation, I ( r ) = max κ Λ ( r ) I ( r , κ ) .
Below, we show that I ( r ) is continuous if r is rank regular [76], which means that there is a neighborhood U Ω of r such that rank   A ( r ) = rank   A ( r ) for all r U . Then, to prove the theorem, we assume that A ( r ) is full rank. Given Lemma A1, this is true as long as n 1 or more of the pairwise conditionals P Y | X i have rank   P Y | X i = | Y | . Note that a matrix M is full rank iff the singular values σ ( M ) are all strictly positive. Since A ( r ) is full rank, and A ( r ) and σ ( M ) are continuous, there is a neighborhood U of r such that the singular values σ ( A ( r ) ) are all strictly positive for all r U , therefore all A ( r ) ) have full rank. This shows that r is rank regular and so I is continuous at r.
We now prove that I ( r ) is continuous if A ( r ) is rank regular. To do so, we will use Hoffman’s Theorem [77,78]. In our case, it states that for any pair of marginals r , r Ω and a feasible solution κ Λ ( r ) , there exists a feasible solution κ Λ ( r ) such that
κ κ α A ( r ) A ( r ) ,
where α is a constant that does not depend on r or κ . (In the notation of [78], we take G = G , g = g and d = d , and use that the norm of s = κ is bounded, given that it is finite dimensional and has entries in [ 0 , 1 ] ). We will also use Daniel’s theorem (Theorem 4.2, [78]), which states that for any r , r Ω such that rank   A ( r ) = rank   A ( r ) , and any feasible solution κ Λ ( r ) , there exists κ Λ ( r ) such that
κ κ β A ( r ) A ( r ) ,
where β is a constant that doesn’t depend on r (in the notation of [78], ε = A ( r ) A ( r i ) and again use that κ have a bounded norm).
Now consider also any sequence r 1 , r 2 , Ω that converges to a marginal r Ω . Let κ i Λ ( r i ) indicate an optimal solution of Equation (A7) for r i , so that I ( r i , κ i ) = I ( r i ) . Given Equation (A14), there is a corresponding sequence κ 1 , κ 2 , Λ ( r ) such that
κ i κ i α A ( r ) A ( r i ) .
Since A ( · ) is continuous and r i converges to r, we have lim i A ( r i ) = A ( r ) and therefore lim i κ i κ i = 0 . This implies
0 = lim i I ( r i , κ i ) I ( r , κ i ) lim i I ( r i ) I ( r )
where we first used continuity of mutual information, I ( r i , κ i ) = I ( r i ) and I ( r , κ i ) I ( r ) .
Now assume that r is rank regular. Since r i converges to r, rank   A ( r i ) = rank   A ( r ) for all sufficiently large i. Let κ Λ ( r ) be an optimal solution of Equation (A7) for r, so that I ( r , κ ) = I ( r ) . Given Equation (A15), for all sufficiently large i there exists κ i Λ ( r i ) such that
κ κ i β A ( r ) A ( r i ) .
As before, we have lim i A ( r i ) = A ( r ) and lim i κ κ i = 0 , which implies
0 = lim i I ( r i , κ i ) I ( r , κ ) lim i I ( r i ) I ( r )
where we first used continuity of mutual information, I ( r i , κ i ) I ( r i ) , and I ( r , κ ) I ( r ) .
Combining Equations (A16) and (A17) proves continuity, lim i I ( r i ) = I ( r ) , under the assumption that A ( r ) is rank regular.
Finally, note that A ( r ) is a real analytic function of r. This means that almost all r rank regular, because those r which are not rank regular form a proper analytic subset of Ω (which has measure zero) [76]. Thus, I ( r ) is continuous almost everywhere. □
We finish our analysis of the continuity of I by showing global continuity when the target is a binary random variable.
Corollary A1. 
I ( X 1 ; ; X n     Y ) is continuous everywhere when Y is a binary random variable.
Proof. 
In an overloading of notation, let I ( r ) and I r ( X i ; Y ) indicate I ( X 1 ; ; X n     Y ) and the mutual information I ( X i ; Y ) , respectively, for the joint distribution r X 1 X n Y . By Theorem 5, I can only be discontinuous at the joint distribution P X 1 X n Y if there is a source X i with rank   P Y | X i = 1 < | Y | . However, if source X i has rank 1, then the conditional distributions P Y | X i = x i are the same for all x i , so I P ( X i ; Y ) = 0 and I ( P ) = 0 (since 0 I ( P ) I P ( X i ; Y ) ). Finally, consider any sequence of joint distributions s X 1 X n Y ( n ) for n = 1 , 2 , that converges to P X 1 X n Y . We have
0 lim n I ( s ( n ) ) lim n I s ( n ) ( X i ; Y ) = I P ( X i ; Y ) = 0 ,
where we used the continuity of mutual information. This shows that lim n I ( s ( n ) ) = 0 = I ( P ) , proving continuity. □

Appendix E. Behavior of I on Gaussian Random Variables

Although in this paper we focused on random variables with finite sets of outcomes, we can briefly comment on the behavior of Blackwell redundancy on Gaussian random variables. Suppose that all sources X 1 , , X n and the target Y are continuous-valued, and that the pairwise marginals P X i Y are multivariate Gaussians. In addition, suppose that Y is one-dimensional (the sources X i can be multi-dimensional). Given these assumptions, Barrett [51] analyzed the I BROJA measure and showed that the corresponding excluded information obeys E ( X j     Y | X i ; X j ) = 0 whenever I ( X i ; Y ) I ( X j ; Y ) . Recall that I BROJA is equivalent to Blackwell union information I . Then, given the Blackwell property, Theorem 4, and the data processing inequality, Equation (24), the result in Ref. [51] implies that X i Y X j if and only if I ( X i ; Y ) I ( X j ; Y ) . Thus, for Gaussian random variables, Blackwell redundancy I is equivalent to I MMI redundancy, as defined in Equation (19). This parallels the case for most other redundancy measures [51].

Appendix F. Operational Interpretation of the I GH

As mentioned in the main text, the redundancy measure I GH is a special case of Equation (7), where the “more informative” order B C is defined in terms of conditional independence B C Y . Here we show that this ordering relation can be given an operational interpretation, which is similar but distinct from the operational interpretation of the Blackwell order Y discussed in Section 5.1.
To introduce this operational interpretation, let the random variable Y represent the state of the environment, and assume there are two random variables B and C which have some information about Y. Suppose that an agent tries to maximize expected utility u ( a , y ) by using a strategy that depends either on the outcomes of B or C. Blackwell’s theorem tells us that B Y C iff an agent with access to C can always achieve higher expected utility than an agent with access to B. It is possible, however, the agent with access to C may do worse than the agent with access to B, conditional on the event that random variable C has some particular outcome c. In the following theorem, we show B C Y iff the agent cannot do better with B than C, even when conditioned on any particular outcome C = c . (We thank Johannes Rauh for suggesting this simplified proof).
Theorem A2. 
Given random variables B, C, and Y, B C Y if and only if
max κ A | B y , a , b P Y B | C ( y , b | c ) κ A | B ( a | b ) u ( a , y ) max κ A | C y , a P Y | C ( y | c ) κ A | C ( a | c ) u ( a , y ) .
for all utility functions u ( a , y ) and all c C with P C ( c ) > 0 .
Proof. 
Consider any c C with P C ( c ) > 0 . By multiplying both sides of Equation (A18) by P C ( c ) and rearranging, this inequality can be rewritten as
max κ A | B y , a , b P Y ( y ) P C | Y ( c | y ) P B | Y , C ( b | y , c ) κ A | B ( a | b ) u ( a , y )     max κ A | C y , a P Y ( y ) P C | Y ( c | y ) κ A | C ( a | c ) u ( a , y ) .
Note that if Equation (A18) holds for a given c and all utility functions, then it must also hold for the utility function u ( a , y ) : = P C | Y ( c | y ) u ( a , y ) . Plugging into Equation (A19) gives
max κ A | B y , a , b P Y ( y ) P B | Y , C ( b | y , c ) κ A | B ( a | b ) u ( a , y ) max κ A | C y , a P Y ( y ) κ A | C ( a | c ) u ( a , y ) .
Now define two random variables: a constant random variable C ^ c with a single outcome c and B ^ c with the same outcomes as B but having the conditional distribution P B ^ c | Y = P B | Y , C = c . Then, Equation (A20) can be written in terms of these random variables as
max κ A | B y , a , b P Y ( y ) P B ^ c | Y ( b | y ) κ A | B ( a | b ) u ( a , y ) max κ A | C y , a P Y ( y ) P C ^ c | Y ( c | y ) κ A | C ( a | c ) u ( a , y ) .
Given Equations (25) and (26), Equation (A21) holds for all u iff B ^ c Y C ^ c . Since C ^ c has a single outcome, it is independent of Y. That means B ^ c must be also independent of Y and so P B ^ c | Y = y = P B | Y = y , C = c is the same for all y, implying that P B | Y = y , C = c = P B | C = c . Since this holds for all c C , P B | Y C = P B | C and therefore B C Y . □
Given Theorem A2, I GH can be given the following operational interpretation. Imagine two agents, Alice and Bob, who can acquire information about Y via different random variables, and then use this information to maximize their expected utility. Suppose that Alice has access to one of the sources X i . Then, I GH is the maximum information that Bob can have about Y without being able to do better than Alice on any utility function, regardless of which source X i Alice has access to, and even when conditioned on X i having any particular outcome x i .

Appendix G. Equivalence of I and I BROJA

The following proves that I and I BROJA , as defined via the optimization problems in Equations (28) and (31), are equivalent.
Theorem A3. 
I ( X 1 ; ; X n     Y ) = I BROJA ( X 1 ; ; X n     Y ) .
Proof. 
Let X ˜ 1 , , X ˜ n be a set of random variables that achieve I ( Y ; X ˜ 1 , , X ˜ n ) = I BROJA ( X 1 ; ; X n     Y ) . Define the random variable Q : = ( X ˜ 1 , , X ˜ n ) , and note that X ˜ i Y Q for all i. Since P X ˜ i Y = P X i Y for all i, it must be that X i Y Q for all i. Thus Q satisfies the constraints of the optimization problem in Equation (28), so
I ( X 1 ; ; X n     Y ) I ( Y ; Q ) = I BROJA ( X 1 ; ; X n     Y ) .
Next, consider the optimization in Equation (28). For any ϵ > 0 , let Q be a random variable that satisfies X i Y Q and achieves
I ( Y ; Q ) I ( X 1 ; ; X n     Y ) + ϵ .
For each i, let κ X i | Q be a channel that obeys P X i | Y ( x i | y ) = q κ X i | Q ( x i | q ) P Q | Y ( q | y ) (such a channel must exist since X i Y Q ). Define the random variables X ˜ 1 , , X ˜ n with the joint distribution
P Y Q X ˜ 1 X ˜ n ( y , q , x 1 , , x n ) = P Y ( y ) P Q | Y ( q | y ) i κ X i | Q ( x i | q ) .
Note that the pairwise marginals obey P X ˜ i Y = P X i Y . Thus, all of the X ˜ i satisfy the marginal constraints in the right hand side of Equation (31), so
I BROJA ( X 1 ; ; X n     Y ) I ( Y ; X ˜ 1 , , X ˜ n ) .
By elementary properties of mutual information, we have
I ( Y ; X ˜ 1 , , X ˜ n ) I ( Y ; Q , X ˜ 1 , , X ˜ n )
Given Equation (A24), the Markov condition Y Q X ˜ 1 , , X ˜ n holds, so
I ( Y ; X ˜ 1 , , X ˜ n ) I ( Y ; Q )
by the data processing inequality. Combining Equations (A23) and (A25) to (A27) implies
I BROJA ( X 1 ; ; X n     Y ) I ( Y ; Q ) I ( X 1 ; ; X n     Y ) + ϵ .
Since this holds for all ϵ , we can take the limit ϵ 0 to give I BROJA ( X 1 ; ; X n     Y ) I ( X 1 ; ; X n     Y ) . The result follows by combining with Equation (A22). □

Appendix H. Relation between I WB and Our General Framework

Here we consider whether the redundancy measure I WB proposed by Williams and Beer [11] can be put in the general form Equation (7). This measure is defined as
I WB ( X 1 ; ; X n     Y ) : = y P Y ( y ) min i I ( X i ; Y = y ) ,
where I ( X i ; Y = y ) is called the “specific information” between X i and target outcome y,
I ( X i ; Y = y ) : = D KL ( P X i | Y = y P X i ) = x i P X i | Y ( x i | y ) log P X i | Y ( x i | y ) P X i ( x i ) ,
and D KL is Kullback-Leibler (KL) divergence.
Specific information obeys I ( X ; Y ) = y P Y ( y ) I ( X ; Y = y ) . Thus, Equation (A28) looks similar to a mutual information expression, where each specific information term is given by the smallest specific information that y carries about any of the sources. Motivated by this interpretation, one might ask whether there exists a random variable Q whose specific information terms are equal to I ( Q ; Y = y ) = min i I ( X i ; Y = y ) for each y. If such a random variable existed, then I WB could be written as
I WB ( X 1 ; ; X n     Y ) = ? max Q I ( Q ; Y )   such   that   i , y : I ( Q ; Y = y ) I ( X i ; Y = y ) ,
which has the form of Equation (7), with the ⊏ order defined as
A B   iff   I ( A ; Y = y ) I ( B ; Y = y )   for   all   y Y .
Here we provide a counterexample to demonstrate that such a variable does not exist in general, and so therefore Equation (A29) is not generally valid. Suppose Y has three outcomes Y = { 0 , 1 , 2 } with a uniform distribution, and consider two binary sources X 1 , X 2 with the following conditional distributions,
P X 1 | Y ( x 1 | y ) = δ ( x 1 , y ) if   y { 0 , 1 } 1 2 δ ( x 1 , 0 ) + 1 2 δ ( x 1 , 1 ) if   y = 2 P X 2 | Y ( x 1 | y ) = δ ( x 1 , y ) if   y { 0 , 2 } 1 2 δ ( x 1 , 0 ) + 1 2 δ ( x 1 , 2 ) if   y = 1
In this case, a simple calculation shows that the specific information obeys (in bits)
I ( X 1 ; Y = 0 ) = 1 I ( X 2 ; Y = 0 ) = 1 I ( X 1 ; Y = 1 ) = 1 I ( X 2 ; Y = 1 ) = 0 I ( X 1 ; Y = 2 ) = 0 I ( X 2 ; Y = 2 ) = 1
Plugging into Equation (A28) gives I WB ( X 1 ; X 2     Y ) = 1 / 3 .
Now consider the optimization problem in Equation (A29). Since I ( X 1 ; Y = 2 ) = I ( X 2 ; Y = 1 ) = 0 , any allowed Q must satisfy I ( Q ; Y = 1 ) = I ( Q ; Y = 2 ) = 0 and therefore P Q | Y = 1 = P Q = P Q | Y = 2 . Combined with the marginalization identity P Y ( 0 ) P Q | Y = 0 + P Y ( 1 ) P Q | Y = 1 + P Y ( 2 ) P Q | Y = 2 = P Q , this implies that P Q | Y = 0 = P Q and therefore that I ( Q ; Y = 0 ) = 0 . Thus, any allowed Q obeys I ( Q ; Y ) = 0 I WB . This means that I WB cannot be expressed in the form of Equation (7) when ⊏ is defined as Equation (A30).

Appendix I. Miscellaneous Derivations

Proof of Lemma 1. 
We use a modified version of the example in [39,68]. Consider a set of n 3 sources. The inclusion-exclusion principle states that
I ( X 1 ; ; X n     Y ) = J { 1 , , n } { } ( 1 ) J 1 I ( X J 1 ; ; X J J     Y ) .
Now, let X 1 , , X n 1 be uniformly distributed and statistically independent binary random variables, and take X n = X 1   XOR   X 2 and Y = ( X 1 , X 2 , X n ) . Note that I ( Y ; X i ) = 1   bit for i { 1 , 2 , n } and I ( Y ; X i ) = 0 for i { 3 , , n 1 } , and that I ( Y ; X 1 , , X n ) = 2   bit . Thus, I ( X i ; X j     Y ) = 0 whenever i { 3 , , n 1 } or j { 3 , , n 1 } , as follows from Symmetry, Self-redundancy, and Monotonicity. Note also that the outcomes of Y are simply a relabelling of ( X 1 , X 2 ) , and similarly for ( X 1 , X n ) and ( X 2 , X n ) . Then, since by Independent identity property, I ( X i ; X j     Y ) = 0 for i j where i , j { 1 , 2 , n } . Thus, I ( X i ; X j     Y ) = 0 for all pairs i j . By Monotonicity, redundancy is 0 for any set of 2 or more sources.
Plugging this into Equation (A31) gives
I ( X 1 ; ; X n     Y ) = i I ( X i     Y ) = i I ( X i     Y ) = 3   bit
Note that this exceeds the total amount of information about the target provided jointly by all sources, which is only 2 bits, so I ( X 1 ; ; X n     Y ) ¬ I ( Y ; X 1 , , X n ) . □
Proof of Theorem 3. 
Without loss of generality, let i = 1 . We will use that U ( X 1     Y | X 1 ; ; X n ) = 0 is equivalent to
I ( X 1 ; ; X n     Y ) = I ( X 1 ; Y ) .
We will use that by monotonicity of mutual information with respect to Y (see Section 4.1),
I ( X 1 ; Y ) I ( X 1 ; ; X n     Y ) .
We first prove the “if” direction. Since Q = X 1 is in the feasible set of Equation (27), I ( X 1 ; ; X n     Y ) I ( X 1 ; Y ) . Combining with Equation (A33) gives Equation (A32).
We now prove the “only if” direction. As described in Appendix C, I can be expressed as an optimization over a finite dimensional, closed, and bounded region, so the supremum in Equation (27) is achieved. Thus, there is some Q such that Q Y X i for all i and
I ( Y ; Q ) = I ( X 1 ; ; X n     Y ) .
Since Q Y X 1 , there is a conditional probability distribution κ Q | X 1 such that P Q | Y ( q | y ) = x 1 κ Q | X 1 ( q | x 1 ) P X 1 | Y ( x 1 | y ) . Define a random variable Q ˜ with the joint distribution
P Y X 1 Q ˜ ( y , x 1 , q ) = κ Q | X 1 ( q | x 1 ) P Y X 1 ( y , x 1 ) .
We will use that P Q Y = P Q ˜ Y . Then, the chain rule for mutual information gives
I ( Y ; X 1 , Q ˜ ) = I ( Y ; Q ˜ ) + I ( Y ; X 1 | Q ˜ ) = I ( Y ; X 1 ) + I ( Y ; Q ˜ | X 1 ) = I ( Y ; X 1 ) ,
where we used the Markov condition Y X 1 Q ˜ . Combining and rearranging gives
I ( Y ; Q ˜ ) = I ( Y ; X 1 ) I ( Y ; X 1 | Q ˜ ) .
Now assume that Equation (A32) holds. Combining with Equation (A34) and P Q Y = P Q ˜ Y gives I ( Y ; X 1 ) = I ( Y ; Q ) = I ( Y ; Q ˜ ) . Combining with Equation (A35) gives I ( Y ; X 1 | Q ˜ ) = 0 , meaning that the Markov condition Y Q ˜ X 1 holds and therefore X 1 Y Q ˜ . Since Q Y X i for all i and P Q Y = P Q ˜ Y , it also the case that Q ˜ Y X i for all i. Finally, since Y is transitive, X 1 Y X i for all i, which is the desired result. □
Proof of Theorem 4. 
Without loss of generality, let i = 1 . We will use that E ( X 1     Y | X 1 ; ; X n ) = 0 is equivalent to
I ( X 1 ; ; X n     Y ) = I ( X 1 ; Y ) .
We will use that by monotonicity of mutual information with respect to Y (see Section 4.1),
I ( X 1 ; Y ) I ( X 1 ; ; X n     Y ) .
We first prove the “if” direction. Since Q = X 1 is in the feasible set of Equation (28), I ( X 1 ; ; X n     Y ) I ( X 1 ; Y ) . Combining with Equation (A37) gives Equation (A36).
We now prove the “only if” direction. As we show in Appendix G, I is equivalent to I BROJA , which is defined as an optimization over a finite dimensional, closed, and bounded region. Thus the infimum in Equation (28) is always achieved, so there is some Q such that X i Y Q for all i and
I ( Y ; Q ) = I ( X 1 ; ; X n     Y ) .
Moreover, since X 1 Y Q , there is a conditional probability distribution κ X 1 | Q such that P X 1 | Y ( x 1 | y ) = q κ X 1 | Q ( x 1 | q ) P Q | Y ( q | y ) . Define a random variable X ˜ 1 with the joint distribution
P Y X ˜ 1 Q ( y , x 1 , q ) = κ X 1 | Q ( x 1 | q ) P Q Y ( q , y ) .
We will use that P X 1 Y = P X ˜ 1 Y . Then, using the chain rule for mutual information,
I ( Y ; X ˜ 1 , Q ) = I ( Y ; X ˜ 1 ) + I ( Y ; Q | X ˜ 1 ) = I ( Y ; Q ) + I ( Y ; X ˜ 1 | Q ) = I ( Y ; Q ) ,
where we used the Markov condition Y Q X ˜ 1 . Combining and rearranging gives
I ( Y ; X ˜ 1 ) = I ( Y ; Q ) I ( Y ; Q | X ˜ 1 ) .
Now assume that Equation (A36) holds. Combining with Equation (A38) and P X 1 Y = P X ˜ 1 Y gives I ( Y ; X 1 ) = I ( Y ; X ˜ 1 ) = I ( Y ; Q ) . Combining with Equation (A39) gives I ( Y ; Q | X ˜ 1 ) = 0 , meaning that the Markov condition Y X ˜ 1 Q holds and therefore Q Y X ˜ 1 . Since P X 1 Y = P X ˜ 1 Y , it is also the case that Q Y X 1 . Finally, since X i Y Q for all i and Y is transitive, X i Y X 1 for all i, which is the desired result. □
Proof of Theorem 6. 
Consider any random variable Q which achieves the maximum in Equation (27). This implies there are channels κ Q | X 1 and κ Q | X 2 such that for any q Q and ( x 1 , x 2 ) X 1 × X 2 with p X 1 X 2 ( x 1 , x 2 ) > 0 ,
P Q | X 1 X 2 ( q | x 1 , x 2 ) = x 1 κ Q | X 1 ( q | x 1 ) P X 1 | X 1 X 2 ( x 1 | x 1 , x 2 ) P Q | X 1 X 2 ( q | x 1 , x 2 ) = x 2 κ Q | X 2 ( q | x 2 ) P X 2 | X 1 X 2 ( x 2 | x 1 , x 2 ) .
We now equate the above two expressions, while using that P X 1 | X 1 X 2 ( x 1 | x 1 , x 2 ) = δ ( x 1 , x 1 ) and P X 2 | X 1 X 2 ( x 2 | x 1 , x 2 ) = δ ( x 2 , x 2 ) (where δ ( · , · ) is the Kronecker delta). This gives
κ Q | X 1 ( q | x 1 ) = κ Q | X 2 ( q | x 2 )
for all q and any ( x 1 , x 2 ) where p X 1 X 2 ( x 1 , x 2 ) > 0 .
Now consider a bipartite graph with vertex set X 1 X 2 and an edge between vertex x 1 and vertex x 2 if P X 1 X 2 ( x 1 , x 2 ) > 0 . Define Π to be the set of connected components of this bipartite graph, and let f 1 : X 1 Π be a function that maps each x 1 to its corresponding connected component (for any x 1 with P X 1 ( x 1 ) = 0 , f 1 ( x 1 ) can be any value). Equation (A40) implies that if x 1 and x 1 both belong to the same connected component, then the constraint Equation (A40) will “propagate” from x 1 to x 1 , so that κ Q | X 1 ( q | x 1 ) = κ Q | X 1 ( q | x 1 ) . Said differently, this means that κ Q | X 1 ( q | x 1 ) = κ Q | X 1 ( q | f 1 ( x 1 ) ) and that the Markov condition ( X 1 , X 2 ) X 1 f 1 ( X 1 ) Q holds. This gives
I ( X 1 , X 2 ; Q ) I ( X 1 , X 2 ; f 1 ( X 1 ) ) = H ( f 1 ( X 1 ) ) ,
where the first inequality uses the data processing inequality, and the second equality uses that f 1 ( X 1 ) is a deterministic function of ( X 1 , X 2 ) . The upper bound in Equation (A41) is achieved when Q = f 1 ( X 1 ) , thus Q X 1 . A similar argument shows that Q X 2 .
We have shown that the constraints in Equation (27) can be replaced by Q X 1 , Q X 2 . It is also clear that any Q which is a deterministic function of either X 1 or X 2 must also be a deterministic function of the target Y = ( X 1 , X 2 ) , hence I ( Y ; Q ) = H ( Q ) . Combining these results shows that Equation (27) is equivalent to Equation (32) for the COPY gate. □

References

  1. Schneidman, E.; Bialek, W.; Berry, M.J. Synergy, Redundancy, and Independence in Population Codes. J. Neurosci. 2003, 23, 11539–11553. [Google Scholar] [CrossRef] [PubMed]
  2. Daniels, B.C.; Ellison, C.J.; Krakauer, D.C.; Flack, J.C. Quantifying collectivity. Curr. Opin. Neurobiol. 2016, 37, 106–113. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  3. Tax, T.; Mediano, P.; Shanahan, M. The partial information decomposition of generative neural network models. Entropy 2017, 19, 474. [Google Scholar] [CrossRef]
  4. Amjad, R.A.; Liu, K.; Geiger, B.C. Understanding individual neuron importance using information theory. arXiv 2018, arXiv:1804.06679. [Google Scholar]
  5. Lizier, J.; Bertschinger, N.; Jost, J.; Wibral, M. Information decomposition of target effects from multi-source interactions: Perspectives on previous, current and future work. Entropy 2018, 20, 307. [Google Scholar] [CrossRef] [Green Version]
  6. Wibral, M.; Priesemann, V.; Kay, J.W.; Lizier, J.T.; Phillips, W.A. Partial information decomposition as a unified approach to the specification of neural goal functions. Brain Cogn. 2017, 112, 25–38. [Google Scholar] [CrossRef] [Green Version]
  7. Timme, N.; Alford, W.; Flecker, B.; Beggs, J.M. Synergy, redundancy, and multivariate information measures: An experimentalist’s perspective. J. Comput. Neurosci. 2014, 36, 119–140. [Google Scholar] [CrossRef]
  8. Chan, C.; Al-Bashabsheh, A.; Ebrahimi, J.B.; Kaced, T.; Liu, T. Multivariate Mutual Information Inspired by Secret-Key Agreement. Proc. IEEE 2015, 103, 1883–1913. [Google Scholar] [CrossRef]
  9. Rosas, F.E.; Mediano, P.A.; Jensen, H.J.; Seth, A.K.; Barrett, A.B.; Carhart-Harris, R.L.; Bor, D. Reconciling emergences: An information-theoretic approach to identify causal emergence in multivariate data. PLoS Comput. Biol. 2020, 16, e1008289. [Google Scholar] [CrossRef]
  10. Cang, Z.; Nie, Q. Inferring spatial and signaling relationships between cells from single cell transcriptomic data. Nat. Commun. 2020, 11, 2084. [Google Scholar] [CrossRef]
  11. Williams, P.L.; Beer, R.D. Nonnegative decomposition of multivariate information. arXiv 2010, arXiv:1004.2515. [Google Scholar]
  12. Williams, P.L. Information dynamics: Its theory and application to embodied cognitive systems. Ph.D. Thesis, Indiana University, Bloomington, IN, USA, 2011. [Google Scholar]
  13. Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J.; Ay, N. Quantifying unique information. Entropy 2014, 16, 2161–2183. [Google Scholar] [CrossRef] [Green Version]
  14. Quax, R.; Har-Shemesh, O.; Sloot, P. Quantifying synergistic information using intermediate stochastic variables. Entropy 2017, 19, 85. [Google Scholar] [CrossRef] [Green Version]
  15. James, R.G.; Emenheiser, J.; Crutchfield, J.P. Unique information via dependency constraints. J. Phys. Math. Theor. 2018, 52, 014002. [Google Scholar] [CrossRef] [Green Version]
  16. Griffith, V.; Chong, E.K.; James, R.G.; Ellison, C.J.; Crutchfield, J.P. Intersection information based on common randomness. Entropy 2014, 16, 1985–2000. [Google Scholar] [CrossRef] [Green Version]
  17. Griffith, V.; Koch, C. Quantifying synergistic mutual information. In Guided Self-Organization: Inception; Springer: Berlin/Heidelberg, Germany, 2014; pp. 159–190. [Google Scholar]
  18. Griffith, V.; Ho, T. Quantifying redundant information in predicting a target random variable. Entropy 2015, 17, 4644–4653. [Google Scholar] [CrossRef] [Green Version]
  19. Harder, M.; Salge, C.; Polani, D. Bivariate measure of redundant information. Phys. Rev. 2013, 87, 012130. [Google Scholar] [CrossRef] [Green Version]
  20. Ince, R. Measuring Multivariate Redundant Information with Pointwise Common Change in Surprisal. Entropy 2017, 19, 318. [Google Scholar] [CrossRef] [Green Version]
  21. Finn, C.; Lizier, J. Pointwise Partial Information Decomposition Using the Specificity and Ambiguity Lattices. Entropy 2018, 20, 297. [Google Scholar] [CrossRef] [Green Version]
  22. Shannon, C. The lattice theory of information. Trans. Ire Prof. Group Inf. Theory 1953, 1, 105–107. [Google Scholar] [CrossRef]
  23. Shannon, C.E. A note on a partial ordering for communication channels. Inf. Control 1958, 1, 390–397. [Google Scholar] [CrossRef] [Green Version]
  24. Cohen, J.; Kempermann, J.H.; Zbaganu, G. Comparisons of Stochastic Matrices with Applications in Information Theory, Statistics, Economics and Population; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1998. [Google Scholar]
  25. Le Cam, L. Sufficiency and approximate sufficiency. Ann. Math. Stat. 1964, 35, 1419–1455. [Google Scholar] [CrossRef]
  26. Korner, J.; Marton, K. Comparison of two noisy channels. Top. Inf. Theory 1977, 16, 411–423. [Google Scholar]
  27. Torgersen, E. Comparison of Statistical Experiments; Cambridge University Press: Cambridge, UK, 1991; Volume 36. [Google Scholar]
  28. Blackwell, D. Equivalent comparisons of experiments. Ann. Math. Stat. 1953, 24, 265–272. [Google Scholar] [CrossRef]
  29. James, R.; Emenheiser, J.; Crutchfield, J. Unique information and secret key agreement. Entropy 2019, 21, 12. [Google Scholar] [CrossRef] [Green Version]
  30. Whitelaw, T.A. Introduction to Abstract Algebra, 2nd ed.; OCLC: 17440604; Blackie & Son: London, UK, 1988. [Google Scholar]
  31. Halmos, P.R. Naive Set Theory; Courier Dover Publications: Mineola, NY, USA, 2017. [Google Scholar]
  32. McGill, W. Multivariate information transmission. Trans. Ire Prof. Group Inf. Theory 1954, 4, 93–111. [Google Scholar] [CrossRef]
  33. Fano, R.M. The Transmission of Information: A Statistical Theory of Communications; Massachusetts Institute of Technology: Cambridge, MA, USA, 1961. [Google Scholar]
  34. Reza, F.M. An Introduction to Information Theory; Dover Publications, Inc.: Mineola, NY, USA, 1961. [Google Scholar]
  35. Ting, H.K. On the amount of information. Theory Probab. Its Appl. 1962, 7, 439–447. [Google Scholar] [CrossRef]
  36. Yeung, R.W. A new outlook on Shannon’s information measures. IEEE Trans. Inf. Theory 1991, 37, 466–474. [Google Scholar] [CrossRef]
  37. Bell, A.J. The co-information lattice. In Proceedings of the Fifth International Workshop on Independent Component Analysis and Blind Signal Separation: ICA, Nara, Japan, 1–4 April 2003. [Google Scholar]
  38. Tilman. Examples of Common False Beliefs in Mathematics (Dimensions of Vector Spaces). MathOverflow. 2010. Available online: https://mathoverflow.net/q/23501 (accessed on 4 January 2022).
  39. Rauh, J.; Bertschinger, N.; Olbrich, E.; Jost, J. Reconsidering unique information: Towards a multivariate information decomposition. In Proceedings of the 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, 29 June–4 July 2014; pp. 2232–2236. [Google Scholar]
  40. Rauh, J. Secret Sharing and Shared Information. Entropy 2017, 19, 601. [Google Scholar] [CrossRef] [Green Version]
  41. Chicharro, D.; Panzeri, S. Synergy and Redundancy in Dual Decompositions of Mutual Information Gain and Information Loss. Entropy 2017, 19, 71. [Google Scholar] [CrossRef] [Green Version]
  42. Ay, N.; Polani, D.; Virgo, N. Information decomposition based on cooperative game theory. arXiv 2019, arXiv:1910.05979. [Google Scholar] [CrossRef]
  43. Rosas, F.E.; Mediano, P.A.; Rassouli, B.; Barrett, A.B. An operational information decomposition via synergistic disclosure. J. Phys. A Math. Theor. 2020, 53, 485001. [Google Scholar] [CrossRef]
  44. Davey, B.A.; Priestley, H.A. Introduction to Lattices and Order; Cambridge University Press: Cambridge, UK, 2002. [Google Scholar]
  45. Bertschinger, N.; Rauh, J. The Blackwell relation defines no lattice. In Proceedings of the 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, 29 June–4 July 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 2479–2483. [Google Scholar]
  46. Li, H.; Chong, E.K. On a connection between information and group lattices. Entropy 2011, 13, 683–708. [Google Scholar] [CrossRef] [Green Version]
  47. Gács, P.; Körner, J. Common information is far less than mutual information. Probl. Control Inf. Theory 1973, 2, 149–162. [Google Scholar]
  48. Aumann, R.J. Agreeing to disagree. Ann. Stat. 1976, 4, 1236–1239. [Google Scholar] [CrossRef]
  49. Banerjee, P.K.; Griffith, V. Synergy, Redundancy and Common Information. arXiv 2015, arXiv:1509.03706v1. [Google Scholar]
  50. Hexner, G.; Ho, Y. Information structure: Common and private (Corresp.). IEEE Trans. Inf. Theory 1977, 23, 390–393. [Google Scholar] [CrossRef]
  51. Barrett, A.B. Exploration of synergistic and redundant information sharing in static and dynamical Gaussian systems. Phys. Rev. E 2015, 91, 052802. [Google Scholar] [CrossRef] [Green Version]
  52. Pluim, J.P.; Maintz, J.A.; Viergever, M.A. F-information measures in medical image registration. IEEE Trans. Med. Imaging 2004, 23, 1508–1516. [Google Scholar] [CrossRef]
  53. Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J.; Lafferty, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
  54. Brunel, N.; Nadal, J.P. Mutual information, Fisher information, and population coding. Neural Comput. 1998, 10, 1731–1757. [Google Scholar] [CrossRef] [PubMed]
  55. Li, M.; Vitányi, P. An Introduction to Kolmogorov Complexity and Its Applications; Springer: Berlin/Heidelberg, Germany, 2008; Volume 3. [Google Scholar]
  56. Shmaya, E. Comparison of information structures and completely positive maps. J. Phys. A Math. Gen. 2005, 38, 9717. [Google Scholar] [CrossRef] [Green Version]
  57. Chefles, A. The quantum Blackwell theorem and minimum error state discrimination. arXiv 2009, arXiv:0907.0866. [Google Scholar]
  58. Buscemi, F. Comparison of quantum statistical models: Equivalent conditions for sufficiency. Commun. Math. Phys. 2012, 310, 625–647. [Google Scholar] [CrossRef] [Green Version]
  59. Ohya, M.; Watanabe, N. Quantum entropy and its applications to quantum communication and statistical physics. Entropy 2010, 12, 1194–1245. [Google Scholar] [CrossRef]
  60. Rauh, J.; Banerjee, P.K.; Olbrich, E.; Jost, J.; Bertschinger, N.; Wolpert, D. Coarse-Graining and the Blackwell Order. Entropy 2017, 19, 527. [Google Scholar] [CrossRef]
  61. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
  62. Makur, A.; Polyanskiy, Y. Comparison of channels: Criteria for domination by a symmetric channel. IEEE Trans. Inf. Theory 2018, 64, 5704–5725. [Google Scholar] [CrossRef]
  63. Benson, H.P. Concave minimization: Theory, applications and algorithms. In Handbook of Global Optimization; Springer: Berlin/Heidelberg, Germany, 1995; pp. 43–148. [Google Scholar]
  64. Kolchinsky, A. Code for Computing I∩≺. 2022. Available online: https://github.com/artemyk/redundancy (accessed on 3 January 2022).
  65. Banerjee, P.K.; Rauh, J.; Montúfar, G. Computing the unique information. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 141–145. [Google Scholar]
  66. Banerjee, P.K.; Olbrich, E.; Jost, J.; Rauh, J. Unique informations and deficiencies. In Proceedings of the 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 2–5 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 32–38. [Google Scholar]
  67. Wolf, S.; Wultschleger, J. Zero-error information and applications in cryptography. In Proceedings of the Information Theory Workshop, San Antonio, TX, USA, 24–29 October 2004; IEEE: Piscataway, NJ, USA, 2004; pp. 1–6. [Google Scholar]
  68. Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J. Shared information - new insights and problems in decomposing information in complex systems. In Proceedings of the European Conference on Complex Systems 2012; Springer: Berlin/Heidelberg, Germany, 2013; pp. 251–269. [Google Scholar]
  69. James, R.G.; Ellison, C.J.; Crutchfield, J.P. dit: A Python package for discrete information theory. J. Open Source Softw. 2018, 3, 738. [Google Scholar] [CrossRef]
  70. Kovačević, M.; Stanojević, I.; Šenk, V. On the entropy of couplings. Inf. Comput. 2015, 242, 369–382. [Google Scholar] [CrossRef]
  71. Horst, R. On the global minimization of concave functions. Oper.-Res.-Spektrum 1984, 6, 195–205. [Google Scholar] [CrossRef]
  72. Pardalos, P.M.; Rosen, J.B. Methods for global concave minimization: A bibliographic survey. Siam Rev. 1986, 28, 367–379. [Google Scholar] [CrossRef]
  73. Williams, P.L.; Beer, R.D. Generalized measures of information transfer. arXiv 2011, arXiv:1102.1507. [Google Scholar]
  74. Dubins, L.E. On extreme points of convex sets. J. Math. Anal. Appl. 1962, 5, 237–244. [Google Scholar] [CrossRef] [Green Version]
  75. Yeung, R.W. A First Course in Information Theory; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
  76. Lewis, A.D. Semicontinuity of Rank and Nullity and Some Consequences. 2009. Available online: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.709.7290&rep=rep1&type=pdf (accessed on 3 January 2022).
  77. Hoffman, A.J. On Approximate Solutions of Systems of Linear Inequalities. J. Res. Natl. Bur. Stand. 1952, 49, 174–176. [Google Scholar] [CrossRef]
  78. Daniel, J.W. On Perturbations in Systems of Linear Inequalities. SIAM J. Numer. Anal. 1973, 10, 299–307. [Google Scholar] [CrossRef]
Figure 1. Partial information decomposition of the information provided by two sources about a target. On the left, we show the decomposition induced by redundancy I , which leads to measures of unique information U. On the right, we show the decomposition induced by union information I , which leads to measures of synergy S and excluded information E.
Figure 1. Partial information decomposition of the information provided by two sources about a target. On the left, we show the decomposition induced by redundancy I , which leads to measures of unique information U. On the right, we show the decomposition induced by union information I , which leads to measures of synergy S and excluded information E.
Entropy 24 00403 g001
Figure 2. Illustration of Theorem 5, which provides a sufficient condition for the local continuity of I . Consider two scenarios, both of which involves two sources X 1 and X 2 and a target Y with cardinality | Y | = 3 . The blue areas indicate the simplex of probability distributions over Y , with the marginal P Y and the pairwise conditionals P Y | X i = x i marked. On the left, both sources have rank   P Y | X i = 3 = | Y | , so I is locally continuous. On the right, both sources have rank   P Y | X i = 2 < | Y | , so I is not necessarily locally continuous. Note that I is also continuous if only source has rank   P Y | X i = 3 .
Figure 2. Illustration of Theorem 5, which provides a sufficient condition for the local continuity of I . Consider two scenarios, both of which involves two sources X 1 and X 2 and a target Y with cardinality | Y | = 3 . The blue areas indicate the simplex of probability distributions over Y , with the marginal P Y and the pairwise conditionals P Y | X i = x i marked. On the left, both sources have rank   P Y | X i = 3 = | Y | , so I is locally continuous. On the right, both sources have rank   P Y | X i = 2 < | Y | , so I is not necessarily locally continuous. Note that I is also continuous if only source has rank   P Y | X i = 3 .
Entropy 24 00403 g002
Table 1. Comparison of different redundancy measures. ? indicate properties that we could not easily establish.
Table 1. Comparison of different redundancy measures. ? indicate properties that we could not easily establish.
I I WB I MMI I I GH I Ince I FL I BROJA I Harder I dep
More than 2 sources
Monotonicity
IEP for bivariate case ??
Independent identity
Blackwell property
Pairwise marginals
Target equality
Table 2. Behavior of I and other redundancy measures on bivariate examples.
Table 2. Behavior of I and other redundancy measures on bivariate examples.
Target I I WB I MMI I I GH I Ince I FL I BROJA I Harder I dep
Y = X 1   AND   X 2 0.3110.3110.31100.1230.1040.5610.3110.082
Y = X 1 + X 2 0.50.50.50000.50.50.189
Y = X 1 I ( X 1 ;   X 2 ) I ( X 1 ;   X 2 ) I ( X 1 ;   X 2 ) C ( X 1     X 2 ) I ( X 1 ;   X 2 ) *1 I ( X 1 ;   X 2 ) I ( X 1 ;   X 2 )
Y = ( X 1 , X 2 ) C ( X 1     X 2 ) 11 C ( X 1     X 2 ) C ( X 1     X 2 ) *1 I ( X 1 ;   X 2 ) I ( X 1 ;   X 2 )
Table 3. Behavior of I and other redundancy measures on three sources.
Table 3. Behavior of I and other redundancy measures on three sources.
Target I I WB I MMI I I Ince I FL
Y = X 1   AND   X 2   AND   X 3 0.1380.1380.13800.0240.294
Y = X 1 + X 2 + X 3 0.3110.3110.311000.561
Y = ( ( A , B ) , ( A , C ) , ( A , D ) ) 122112
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Kolchinsky, A. A Novel Approach to the Partial Information Decomposition. Entropy 2022, 24, 403. https://doi.org/10.3390/e24030403

AMA Style

Kolchinsky A. A Novel Approach to the Partial Information Decomposition. Entropy. 2022; 24(3):403. https://doi.org/10.3390/e24030403

Chicago/Turabian Style

Kolchinsky, Artemy. 2022. "A Novel Approach to the Partial Information Decomposition" Entropy 24, no. 3: 403. https://doi.org/10.3390/e24030403

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop