Understanding interdependency through complex information sharing

The interactions between three or more random variables are often nontrivial, poorly understood, and yet, are paramount for future advances in fields such as network information theory, neuroscience, genetics and many others. In this work, we propose to analyze these interactions as different modes of information sharing. Towards this end, we introduce a novel axiomatic framework for decomposing the joint entropy, which characterizes the various ways in which random variables can share information. The key contribution of our framework is to distinguish between interdependencies where the information is shared redundantly, and synergistic interdependencies where the sharing structure exists in the whole but not between the parts. We show that our axioms determine unique formulas for all the terms of the proposed decomposition for a number of cases of interest. Moreover, we show how these results can be applied to several network information theory problems, providing a more intuitive understanding of their fundamental limits.


1
Understanding interdependency through complex information sharing Interdependence is a key concept for understanding the rich structures that can be exhibited by biological, economical and social systems [1], [2]. Although this phenomenon lies in the heart of our modern interconnected world, there is still no solid quantitative framework for analyzing complex interdependences, this being crucial for future advances in a number of disciplines. In neuroscience, researchers desire to identify how various neurons affect an organism's overall behavior, asking to what extent the different neurons are providing redundant or synergistic signals [3]. In genetics, the interactions and roles of multiple genes with respect to phenotypic phenomena are studied, e.g. by comparing results from single and double knockout experiments [4].
In graph and network theory, researchers are looking for measures of the information encoded in node interactions in order to quantify the complexity of the network [5]. In communication theory, sensor networks usually generate strongly correlated data [6]; a haphazard design might not account for these interdependencies and, undesirably, will process and transmit redundant information across the network degrading the efficiency of the system.
The dependencies that can exist between two variables have been extensively studied, generating a variety of techniques that range from statistical inference [7] to information theory [8].
Most of these approaches require that one differentiate the role of the variables, e.g. between a target and predictor. However, the extension of these approaches to three or more variables is not straightforward, as a binary splitting is, in general, not enough to characterize the rich interplay that can exist between variables. Moreover, the development of more adequate frameworks has been difficult as most of our theoretical tools are rooted in sequential reasoning, which is adept at representing linear flows of influences but not as well-suited for describing distributed systems or complex interdependencies [9].
In this work, we propose to understand interdependencies between variables as information sharing. In the case of two variables, the portion of the variability that can be predicted corresponds to information that target and predictor have in common. Following this intuition, we 2 present a framework that decomposes the total information of a distribution according to how it is shared among its variables. Our framework is novel in combining the hierarchical decomposition of higher-order interactions, as developed in [10], with the notion of synergistic information, as proposed in [11]. In contrast to [10], we study the information that exists in the system itself without comparing it with other related distributions. In contrast to [11], we analyze the joint entropy instead of the mutual information, looking for symmetric properties of the system.
One important contribution of this paper is to distinguish shared information from predictability. Predictability is a concept that requires a bipartite system divided into predictors and targets. As different splittings of the same system often yield different conclusions, we see predictability as a directed notion that strongly depends on one's "point of view". In contrast, we see shared information as a property of the system itself, which does not require differentiated roles between its components. Although it is not possible in general to find an unique measure of predictability, we show that the shared information can be uniquely defined for a number of interesting scenarios.
Additionally, our framework provides new insight to various problems of network information theory. Interestingly, many of the problems of network information theory that have been solved are related to systems which present a simple structure in terms of shared information and synergies, while most of the open problems possess a more complex mixture of them.
The rest of this article is structured as follows. First, Section II introduces the notions of hierarchical decomposition of dependencies and synergistic information, reviewing the state-ofthe-art and providing the necessary background for the unfamiliar reader. Section III presents our axiomatic decomposition for the joint entropy, focusing on the fundamental case of three random variables. Then, we illustrate the application of our framework for various cases of interest: pairwise independent variables in Section IV, pairwise maximum entropy distributions and Markov chains in Section V, and multivariate Gaussians in VI. After that, Section VII presents a first application of this framework in settings of fundamental importance for network information theory. Finally, Section VIII summarizes our main conclusions.

II. PRELIMINARIES AND STATE OF THE ART
One way of analyzing the interactions between the random variables X = (X 1 , . . . , X N ) is to study the properties of the correlation matrix R X = E {XX t }. However, this approach 3 only captures linear relationships and hence the picture provided by R X is incomplete. Another possibility is to study the matrix I X = [I(X i ; X j )] i,j of mutual information terms. This matrix captures the existence of both linear and nonlinear dependencies [12], but its scope is restricted to pairwise relationships and thus misses all higher-order structure. To see an example of how this can happen, consider two independent fair coins X 1 and X 2 and let X 3 := X 1 ⊕ X 2 be the output of an XOR logic gate. The mutual information matrix I X has all its off-diagonal elements equal to zero, making it indistinguishable from an alternative situation where X 3 is just another independent fair coin.
For the case of R X , a possible next step would be to consider higher-order moment matrices, such as co-skewness and co-kurtosis. We seek their information-theoretic analogs, which complement the description provided by I X . One method of doing this is by studying the information contained in marginal distributions of increasingly larger sizes; this approach is presented in Section II-A. Other methods try to provide a direct representation of the information that is shared between the random variables; they are discussed in Sections II-B, II-C and II-D.

A. Negentropy and total correlation
When the random variables that compose a system are independent, their joint distribution is given by the product of their marginal distributions. In this case, the marginals contain all that is to be learned about the statistics of the entire system. For an arbitrary joint probability density function (p.d.f.), knowing the single variable marginal distributions is not enough to capture all there is to know about the statistics of the system.
To quantify this idea, let us consider N discrete random variables X = (X 1 , . . . , X N ) with joint p.d.f. p X , where each X j takes values in a finite set with cardinality Ω j . The maximal amount of information that could be stored in any such system is H (0) = j log Ω j , which corresponds to the entropy of the p.d.f. p U := j p X j , where p X j (x) = 1/Ω j is the uniform distribution for each random variable X j . On the other hand, the joint entropy H(X) with respect to the true distribution p X measures the actual uncertainty that the system possesses. Therefore, the difference corresponds to the decrease of the uncertainty about the system that occurs when one learns its p.d.f. -i.e. the information about the system that is contained in its statistics. This quantity is 4 known as negentropy [13], and can also be computed as where p X j is the marginal of the variable X j and D(·||·) is the Kullback-Leibler divergence. In this way, (3) decomposes the negentropy into a term that corresponds to the information given by simple marginals and a term that involves higher-order marginals. The second term is known as the total correlation (TC) [14] (also known as multi-information [15]), which is equal to the mutual information for the case of N = 2. Because of this, the TC has been suggested as an extension of the notion of mutual information for multiple variables.
An elegant framework for decomposing the TC can be found in [10] (for an equivalent formulation that do not rely on information geometry c.f. [16]). Let us call k-marginals the distributions that are obtained by marginalizing the joint p.d.f. over N − k variables. Note that the k-marginals provide a more detailed description of the system than the (k − 1)-marginals, as the latter can be directly computed from the former by marginalizing the corresponding variables.
In the case where only the 1-marginals are known, the simplest guess for the joint distribution isp (1) X = j p X j . One way of generalizing this for the case where the k-marginals are known is by using the maximum entropy principle [17], which suggests to choose the distribution that maximizes the joint entropy while satisfying the constrains given by the partial (k-marginal) knowledge. Let us denote byp (k) X the p.d.f. which achieves the maximum entropy while being consistent with all the k-marginals, and let H (k) = H({p (k) X }) denote its entropy. Note that H (k) ≥ H (k+1) , since the number of constrains that are involved in the maximization process that generates H (k) increases with k. It can therefore be shown that the following generalized Pythagorean relationship holds for the total correlation: Above, ∆H (k) ≥ 0 measures the additional information that is provided by the k-marginals that was not contained in the description of the system given by the (k − 1)-marginals. In general, the information that is located in terms with higher values of k is due to dependencies between 5 groups of variables that cannot be reduced to combinations of dependencies between smaller groups.
It has been observed that in many practical scenarios most of the TC of the measured data is provided by the lower marginals. It can be shown that percentage of the TC that is lost by considering only the k 0 -order marginals is given by This quantity is small if there exists a value of k 0 such thatp provides an accurate approximation for the joint p.d.f. of the system. Interestingly, it has been shown that pairwise maximum entropy models (i.e. k 0 = 2) can provide an accurate description of the statistics of many biological systems [18]- [21] and also some social organizations [22], [23].

B. Internal and external decompositions
An alternative approach to study the interdependencies between many random variables is to analyze the ways in which they share information. This can be done by decomposing the joint entropy of the system. For the case of two variables, the joint entropy can be decomposed as suggesting that it can be divided into shared information, I(X 1 ; X 2 ), and into terms which represent information that is exclusively located in a single variable, i.e., H(X 1 |X 2 ) for X 1 and In systems with more than two variables, one can compute the total information that is exclusively located in one variable as H (1) := j H(X j |X c j ) 1 , where X c j denotes all the system's variables except X j . The difference between the joint entropy and the sum of all exclusive information terms, H (1) , defines a quantity known [24] as the dual total correlation (DTC) 2 : 1 The superscripts and subscripts are used to reflect that H (1) ≥ H(X) ≥ H (1) . 2 The DTC is also known as excess entropy in [25], whose definition differs from its typical use in the context of time series, e.g. [26]. 6 which measures the portion of the joint entropy that is shared between two or more variables of the system. When N = 2 then DTC = I(X 1 ; X 2 ), and hence the DTC has also been suggested in the literature as a measure for the multivariate mutual information.
By comparing (4) and (7), it would be appealing to look for a decomposition of the DTC of the form DTC = N k=2 ∆H (k) , where ∆H (k) ≥ 0 would measure the information that is shared by exactly k variables [27]. With this, one could define an internal entropy H (j) = H (1) + j i=2 ∆H (i) as the information that is shared between at most j variables, in contrast to the external entropy which describes the information provided by the j-marginals. These entropies form a non-decreasing sequence: This layered structure, and its relationship with the TC and the DTC, is graphically represented in Figure 1.
From this, it is clear that while each transmitter ha nel with capacity C i , their interaction create synerg C S . This additional resource behaves like a physica linearly, generating a slope of 1 in the graph. Is interesting that, if one consider the Slepian-W B, there is a direct relationship between H(A|B) an contents that needs to be transmitted by each sourc capacity for each user, which cannot be shared. On mation I(A; B) is the information that can be tran which in this case corresponds to the synergetic cap

Degraded wiretap channel
Consider a communication system with a eavesdro symbols X 1 , the intended receiver gets X 2 and th simplicity of the exposition, let us consider the c X 1 X 2 X 3 form a Markov chain. Under those given input distribution p X 1 the rate of secure comm this channel is upper bound by where the second equality comes from the Markov c Seciton 3.2.1. Note that the eavesdropping capacity C eav = I(X 1 ; X 3 ) = I \ (X 1 ;

Degraded wiretap channel
Consider a communication system with a eavesdropper, symbols X 1 , the intended receiver gets X 2 and the eav simplicity of the exposition, let us consider the case o X 1 X 2 X 3 form a Markov chain. Under those condi given input distribution p X 1 the rate of secure communica this channel is upper bound by C sec = I(X 1 ; X 2 ) I(X 1 ; X 3 ) = I un (X where the second equality comes from the Markov condit Seciton 3.2.1. Note that the eavesdropping capacity is giv ... It is interesting to note that even though the TC and DTC coincide for the case of N = 2, these quantities are in general different for larger system sizes. Therefore, in general ∆H (k) = ∆H (k) , although it is appealing to believe that there should exist a relationship between them. One of the goals of this paper is to explore the difference between these quantities.

C. Inclusion-exclusion decompositions
Perhaps the most natural approach to decompose the DTC and joint entropy is to apply the inclusion-exclusion principle, using a simplifying analogy that the entropies and areas have 7 similar properties. A refined version of this approach can be found in and also in the Imeasures [28] and in the multi-scale complexity [29]. For the case of three variables, this approach The last term is known as the co-information [30] (being closely related to the interaction information [31]), and can be defined using the inclusion-exclusion principle as As I(X 1 ; X 2 ; X 2 ) = I(X 1 ; X 2 ), the co-information has also been proposed as a candidate for extending the mutual information to multiple variables. For a summary of the various possible extensions of the mutual information, see Table I and also additional discussion in Ref. [32].  It is tempting to coarsen the decomposition provided by this approach in order to build a decomposition for the DTC. In this decomposition, the co-information associates to ∆H (3) , and the the remaining terms of (9) associate to ∆H (2) . With this, one can build a Venn diagram for the information sharing between three variables, as in Figure 2. However, the resulting decomposition and diagram are not very intuitive since the co-information can be negative.
As part of this temptation, it is appealing to consider the conditional mutual information I(X 1 ; X 2 |X 3 ) as the information contained in X 1 and X 2 that is not contained in X 3 , just as the conditional entropy H(X 1 |X 2 ) is the information that is in X 1 and not in X 2 . However, the latter interpretation works because conditioning always reduces entropy (i.e., H( while this is not true for mutual information; that is, in some cases the conditional mutual information I(X 1 ; X 2 |X 3 ) can be greater than I(X 1 ; X 2 ). This suggests that the conditional mutual information can capture information that extends beyond X 1 and X 2 , incorporating higher-order effects with respect to X 3 . Therefore, a better understanding of the conditional mutual information is required in order to refine the decomposition suggested by (9).

D. Synergistic information
An extended treatment of the conditional mutual information and its relationship with the mutual information decomposition can be found in [33], [34]. For presenting these ideas, let us consider two random variables X 1 and X 2 which are used to predict Y . The total predictability 3 , i.e., the part of the randomness of Y that can be predicted by X 1 and X 2 , can be expressed using the chain rule of the mutual information as 4 9 It is natural to think that the predictability provided by X 1 , which is given by the term I(X 1 ; Y ), can be either unique or redundant with respect of the information provided by X 2 . On the other hand, due to (12) is clear that the unique predictability contributed by X 2 must be contained in I(X 2 ; Y |X 1 ). However, the fact that I(X 2 ; Y |X 1 ) can be larger than I(X 2 ; Y ) -while the latter contains both the unique and redundant contributions of X 2 -suggests that there can be an additional predictability that is accounted for only by the conditional mutual information.
Following this rationale, we denote as synergistic predictability the part of the conditional mutual information that corresponds to evidence about the target that is not contained in any single predictor, but is only revealed when both are known. As an example of this, consider again the case in which X 1 and X 2 are independent random bits and Y = X 1 ⊕ X 2 . Then, it can be seen that Further discussions about the notion of information synergy can be found in [11], [35]- [37].

III. A NON-NEGATIVE JOINT ENTROPY DECOMPOSITION
Following the discussion presented in Section II-B, we search for a decomposition of the joint entropy that reflects the private, common and synergistic modes of information sharing. In this way, we want the decomposition to distinguish information that is shared only by few variables from information that accessible from the entire system.
Our framework is based on distinguishing the directed notion of predictability from the undirected one of information. It is to be noted that there is an ongoing debate about the best way of characterizing and computing the predictability in arbitrary systems, as the commonly used axioms are not enough for specifying a unique formula that satisfies them [35]. Nevertheless, our approach is to explore how far one can reach based an axiomatic approach. In this way, our results are going to be consistent with any choice of formula that is consistent with the discussed axioms.
In the following, Sections III-A, III-B and III-C discuss the basic features of predictability and information. After these necessary preliminaries, Section III-D finally presents our joint entropy decomposition for discrete and continuous variables.

A. Predictability axioms
Let us consider two variables X 1 and X 2 that are used to predict a target variable Y := X 3 .
Intuitively, I(X 1 ; Y ) quantifies the predictability of Y that is provided by X 1 . In the following, we want to find a function R(X 1 X 2 Y ) that measures the redundant predicability provided by X 1 with respect to the predictability provided by X 2 , and a function U(X 1 Y |X 2 ) that measures the unique predictability that is provided by X 1 but not by X 2 . Following [33], we first determine a number of desired properties that these functions should have.
Definition A predictability decomposition is defined by the real-valued functions R( and U(X 1 Y |X 2 ) over the distributions of (X 1 , Y ) and (X 2 , Y ), which satisfy the following axioms: (1) Non-negativity: Above, Axiom (3) states that the sum of the redundant and corresponding unique predictabilities given by each variable cannot be larger than the total predictability 5 . Axiom (4) states that the redundancy is independent of the ordering of the predictors. The following Lemma determines the bounds for the redundant predicability (the proof is given in Appendix A).
In principle, the notion of redundant predictability takes the point of view of the target variable and measures the parts that can be predicted by both X 1 and X 2 when they are used by themselves, i.e., without combining them with each other. It is appealing to think that there should exist a unique function that provides such a measure. Nevertheless, these axioms define only very basic properties that a measure of redundant predictability should satisfy, and hence in general they are not enough for defining an unique function. In fact, a number of different predictability decompositions have been proposed in the literature [35], [36], [38], [39].
It is to be noted that, from all the candidates that are compatible with the Axioms, the decomposition given in Corollary 2 gives the largest possible redundant predictability measure.
It is clear that in some cases this measure gives an over-estimate of the redundant predictability given by X 1 and X 2 ; for an example of this consider X 1 and X 2 to be independent variables and Y = (X 1 , X 2 ). Nevertheless, (14) has been proposed as a adequate measure for the redundant predictability of multivariate Gaussians [39] (for a corresponding discussion see Section VI).

B. Shared, private and synergistic information
Let us now introduce an additional axiom, which will form the basis for our proposed information decomposition.
Definition A symmetrical information decomposition is given by the real valued functions I ∩ (X 1 ; X 2 ; X 3 ) and I priv (X 1 ; X 2 |X 3 ) over the marginal distributions of (X 1 , X 2 ), (X 1 , X 3 ) and , while also satisfying the following property: The role of Axiom (5) can be related to the role of the fifth of Euclid's postulates, as -while seeming innocuous-their addition has strong consequences in the corresponding theory. The following Lemma explains why this decomposition is denoted as symmetrical, and also shows fundamental bounds for these information functions (the proof is presented in Appendix C). 12 Lemma 3: The functions that compose a symmetrical information decomposition satisfy the following properties: (a) Strong symmetry: I ∩ (X 1 ; X 2 ; X 3 ) and I S (X 1 ; X 2 ; X 3 ) are symmetric on their three arguments.
(b) Bounds: these quantities satisfy the following inequalities: Note that the defined functions can be used to decompose the following mutual information: In contrast to a decomposition based on the predictability, these measures address properties of the system (X 1 , X 2 , X 3 ) as a whole, without being dependent on how it is divided between target and predictor variables (for a parallelism with respect to the corresponding predictability measures, see Table II). Intuitively, I ∩ (X 1 ; X 2 ; X 3 ) measures the shared information that is common to X 1 , X 2 and X 3 ; I priv (X 1 ; X 3 |X 2 ) quantifies the private information that is shared by X 1 and X 3 but not X 2 , and I S (X 1 ; X 2 ; X 3 ) captures the synergistic information that exist between (X 1 , X 2 , X 3 ). The latter is a non-intuitive mode of information sharing, whose nature we hope to clarify through the analysis of particular cases presented in Sections IV and VI.  13 Note also that the co-information can be expressed as Hence, a strictly positive (resp. negative) co-information is a sufficient -although not necessarycondition for the system to have a non-zero shared (resp. synergistic) information.

C. Further properties of the symmetrical decomposition
At this point, it is important to clarify a fundamental distinction that we make between the notions of predictability and information. The predictability is intrinsically a directed notion, which is based on a distinction between predictors and the target variable. On the contrary, we use the term information to exclusively refer to intrinsic statistical properties of the whole system which do not rely on such distinction. The main difference between the two notions is that, in principle, the predictability only considers the predictable parts of the target, while the shared information also considers the joint statistics of the predictors. Although this distinction will be further developed when we address the case of Gaussian variables (c.f. Section VI-C), let us for now present a simple example to help developing intuitions about this issue.
Example Define the following functions: It is straightforward that these functions satisfy Axioms (1)-(5), and therefore constitute a symmetric information decomposition. In contrast to the decomposition given in Corollary 2, this can be seen to be strongly symmetric and also dependent on the three marginals (X 1 , X 2 ), In the following Lemma we will generalize the previous construction, whose simple proof is omitted.
Lemma 4: For a given predictability decomposition with functions R(X 1 X 2 X 3 ) and U(X 1 X 2 |X 3 ), the functions 14 provide a symmetrical information decomposition, which is called the canonical symmetrization of the predictability.
Corollary 5: There always exists at least one symmetric information decomposition.
Proof: This is a direct consequence of the previous Lemma and Corollary 2.
Maybe the most remarkable property of symmetrized information decompositions is that, in contrast to directed ones, they are uniquely determined by Axioms (1)-(5) for a number of interesting cases.
Theorem 6: The symmetric information decomposition is unique if the variables form a Markov chain or two of them are pairwise independent.
Proof: Let us consider the upper and lower bound for I ∩ given in (15), denoting them as Therefore, the framework will provide a unique expression for the shared information if (at least) one of the above six terms is zero. These scenarios correspond either to Markov chains, where one conditional mutual information term is zero, or pairwise independent variables where one mutual information term vanishes.
Pairwise independent variables and Markov chains are analyzed in Sections IV and V-A, respectively.

D. Decomposition for the joint entropy of three variables
Now we use the notions of redundant, private and synergistic information functions for developing a non-negative decomposition of the joint entropy, which is based on a non-negative decomposition of the DTC. For the case of three discrete variables, by applying (20) and (21) to (9), one finds that From (7) and (28), one can propose the following decomposition for the joint entropy: where In contrast to (9), here each term is non-negative because of Lemma 3 6 . Therefore, (29) yields a non-negative decomposition of the joint entropy, where each of the corresponding terms captures the information that is shared by one, two or three variables. Interestingly, H (1) and ∆H (2) are homogeneous (being the sum of all the exclusive information or private information of the system) while ∆H (3) is composed by a mixture of two different information sharing modes.
An analogous decomposition can be developed for the case of continuous random variables.
Nevertheless, as the differential entropy can be negative, not all the terms of the decomposition can be non-negative. In effect, following the same rationale that lead to (29), the following decomposition can be found: Above, h(X) denotes the differential entropy of X, ∆H (2) and ∆H (3) are as defined in (31) and (32), and Hence, although both the joint entropy h(X 1 , X 2 , X 3 ) and h (1) can be negative, the remaining terms conserve their non-negative condition.
It can be seen that the lowest layer of the decomposition is always trivial to compute, and hence the challenge is to find expressions for ∆H (2) and ∆H (3) . In the rest of the paper, we will explore scenarios were these quantities can be characterized. 6 From (20), it can be seen that the co-information is sometimes negative for compensating the triple counting of the synergy due to the sum of the three conditional mutual information terms. 16

IV. PAIRWISE INDEPENDENT VARIABLES
In this section we focus on the case where two variables are pairwise independent while being globally connected by a third variable. The fact that pairwise independent variables can become correlated when additional information becomes available is known in statistics literature as the Bergson's paradox or selection bias [40], or as the explaining away effect in the context of artificial intelligence [41]. As an example of this phenomenon, consider X 1 and X 2 to be two pairwise independent canonical Gaussians variables, and X 3 a binary variable that is equal to 1 if X 1 + X 2 > 0 and zero otherwise. Then, knowing that X 3 = 1 implies that X 2 > −X 1 , and hence knowing the value of X 1 effectively reduces the uncertainty about X 2 .
In our framework, Bergson's paradox can be understood as synergistic information that is introduced by the third component of the system. In fact, we will show that in this case the synergistic information function is unique and given by which is, in fact, a measure of the dependencies between X 1 and X 2 that are created by X 3 .
In the following, Section IV-A presents the unique symmetrized information decomposition for this case. Then, Section IV-B focuses on the particular case where X 3 is a function of the other two variables.

A. Uniqueness of the entropy decomposition
Let us assume that X 1 and X 2 are pairwise independent, and hence the joint p.d.f. of X 1 , X 2 and X 3 has the following structure: It is direct to see that in this case p Therefore, as I(X 1 ; X 2 ) = 0, it is direct from Axiom (1) that any redundant predictability function satisfies R(X 1 X 3 X 2 ) = R(X 2 X 3 X 1 ) = 0. However, the axioms are not enough to uniquely determine R(X 1 X 2 X 3 ) 7 . Nevertheless, the symmetrized decomposition is uniquely determined, as shown in the next Corollary that is a consequence of Theorem 6. 7 Note that in this case I(X1; X2; X3) = −I(X1; X2|X3) ≤ 0, the only restriction that the bound presented in Lemma 3 provides is min{I(X1; X3), I(X2; X3)} ≥ R(X1X2 X3) ≥ 0.

17
Corollary 7: If X 1 , X 2 and X 3 follow a p.d.f. as (36), then the shared, private and synergetic information functions are unique. They are given by Proof: The fact that there is no shared information follows directly from the upper bound presented in Lemma 3. Using this, the expressions for the private information can be found using Axiom (2). Finally, the synergistic information can be computed as The second formula for the synergistic information can be found then using the fact that With this corollary, the unique decomposition of the DTC = ∆H (2) + ∆H (3) can be found to be ∆H (2) = I(X 1 ; Note that the terms ∆H (2) and ∆H (3) can be bounded as follows: The bound for ∆H (2) follows from the basic fact that I(X; Y ) ≤ min{H(X), H(Y )}. The second bound follows from = min{H(X|Z), H(Y |Z)} .

B. Functions of independent arguments
Let us focus in this section on the special case where X 3 = F (X 1 , X 2 ) is a function of two independent random inputs, and study its corresponding entropy decomposition. We will consider X 1 and X 2 as inputs and F (X 1 , X 2 ) to the output. Although this scenario fits nicely in the predictability framework, it can also be studied from the shared information framework's perspective. Our goal is to understand how F affects the information sharing structure.
As H(X 3 |X 1 , X 2 ) = 0, we have The term H (1) hence measures the information of the inputs that is not reflected by the output.
The term ∆H (2) measures how much of F can be predicted with knowledge that comes from one of the inputs but not from the other. If ∆H (2) is large then F is not "mixing" the inputs too much, in the sense that each of them is by itself able to provide relevant information that is not given also by the other. In fact, a maximal value of ∆H (2) is given by where H (1) = ∆H (3) = 0 and the bound provided in (44) is attained.
Finally, due to (43), there is no shared information and hence ∆H when both inputs variables are uniformly distributed.
Proof: Using the same rationale than in (49), it can be shown that if F is an arbitrary function then where the last inequality follows from the fact that both inputs are restricted to alphabets of size K.
Now, consider F * to be the function given in (51) and assume that X 1 and X 2 are uniformly distributed. It can be seen that for each z ∈ K there exist exactly K ordered pairs of inputs (x 1 , x 2 ) such that F * (x 1 , x 2 ) = z, which define a bijection from K to K. Therefore, and hence showing that the upper bound presented in (55) is attained.

Corollary 9:
The XOR logic gate generates the largest amount of synergistic information possible for the case of binary inputs.
The synergistic nature of the addition over finite fields helps to explain the central role it has in various fields. In cryptography, the one-time-pad [42] is an encryption technique that uses finite-field additions for creating a synergistic interdependency between a private message, a public signal and a secret key. This interdependency is completely destroyed when the key is not known, ensuring no information leakage to unintended receivers [43]. Also, in network coding [44], [45], nodes in the network use linear combinations of their received data packets to create and transmit synergistic combinations of the corresponding information messages. This 20 technique has been shown to achieve the multicast capacity in wired communication networks [45] and has also been used to increase the throughput of wireless systems [46].

V. DISCRETE PAIRWISE MAXIMUM ENTROPY DISTRIBUTIONS AND MARKOV CHAINS
This section studies the case where the system's variables follow a pairwise maximum entropy (PME) distribution. These distributions are of great importance in statistical physics and machine learning communities, where they are studied under the names of Gibbs distributions [47] or Markov random fields [48].
Concretely, let us consider three pairwise marginal distributions p X 1 X 2 , p X 2 X 3 and p X 1 X 3 for the discrete variables X 1 , X 2 and X 3 . Let us denote as Q the set of all the joint p.d.f.s over (X 1 , X 2 , X 3 ) that have those as their pairwise marginals distributions. Then, the corresponding PME distribution is given by the joint p.d.f.p X (x 1 , x 2 , x 3 ) that satisfies For the case of binary variables (i.e. X j ∈ {0, 1}), the PME distribution is given by an Ising distribution [49]:p where Z is a normalization constant and E(X) an energy function given by E(X) = i J i X i + j k =j J j,k X j X k , being J j,k the coupling terms. In effect, if J i,k = 0 for all i and k, theñ p X (X) can be factorized as the product of the unary-marginal p.d.f.s.
In the context of the framework discussed in Section II-A, a PME system has TC = ∆H (2) while ∆H (3) = 0. In contrast, Section V-A studies these systems under the light of the decomposition of the DTC presented in Section III-D. Then, Section V-B specifies the analysis for the particular case of Markov chains.

A. Synergy minimization
It is tempting to associate the synergistic information with that which is only in the joint p.d.f. but not in the pairwise marginals, i.e. with ∆H (3) . However, the following result states that there can exist some synergy defined by the pairwise marginals themselves.
Theorem 10: PME distributions have the minimum amount of synergistic information that is allowed by their pairwise marginals. 21 Proof: Note that Therefore, maximizing the joint entropy for fixed pairwise marginals is equivalent to minimizing the synergistic information. Note that the last equality follows from the fact that I priv (X 2 ; X 3 |X 1 ) by definition only depends on the pairwise marginals.
Corollary 11: For an arbitrary system (X 1 , X 2 , X 3 ), the synergistic information can be decomposed as where ∆H (3) is as defined in (4) and I PME S = min p∈Q I S (X 1 ; X 2 ; X 3 ) is the synergistic information of the corresponding PME distribution.
Proof: This can be proven noting that, for an arbitrary p.d.f. p X 1 X 2 X 3 , it can be seen that Above, the first equality corresponds to the definition of ∆H (3) and the second equality comes from using (62) on each joint entropy term and noting that only the synergistic information depends on more than the pairwise marginals.
The previous corollary shows that ∆H (3) measures only one part of the information synergy of a system, the part that can be removed without altering the pairwise marginals. Note that PME systems with non-zero synergy are easy to find. For an example, consider X 1 and X 2 to be two independent equiprobable bits, and X 3 = X 1 AND X 2 . It can be shown that for this case one has ∆H (3) = 0 [16]. On the other side, as the inputs are independent the synergy can be computed using (40), and therefore a direct calculation shows that From the previous discussion, one can conclude that only a special class of pairwise distributions p X 1 X 2 , p X 1 X 3 , and p X 2 X 3 are compatible with having null synergistic information in the 22 system. This is a remarkable result, as the synergistic information is usually considered to be an effect purely related to high-order marginals. It would be interesting to have an expresion for the minimal information synergy that a set of pairwise distributions requires, or equivalently, a symmetrized information decomposition for PME distributions. A particular case that allows a unique solution is discussed in the next section.

B. Markov chains
Markov chains maximize the joint entropy subject to constrains on only two of the three pairwise distributions. In effect, following the same rationale as in the proof of Theorem 10, it can be shown that Then, for fixed pairwise distributions p X 1 X 2 and p X 2 X 3 , maximizing the joint entropy is equivalent to minimizing the conditional mutual information. Moreover, the maximal entropy is attained by the p.d.f. that makes I(X 1 ; X 3 |X 2 ) = 0, which is precisely the Markov chain For the binary case, it can be shown that a Markov chain corresponds to an Ising distribution like (59), where the interaction terms J 1,3 is equal to zero.
Theorem 6 showed that the symmetric information decomposition for Markov chains is unique.
We develop this decomposition in the following corollary.
Corollary 12: If X 1 − X 2 − X 3 is a Markov chain, then their unique shared, private and synergistic information functions are given by In particular, Markov chains have no synergistic information. 23 Proof: For this case one can show that where the first equality is a consequence of the data process inequality, and the second of the fact that I(X 1 ; X 3 |X 2 ) = 0. The above equality shows that the bounds for the shared information presented in Lemma 3 give the unique solution I ∩ (X 1 ; X 2 ; X 3 ) = I(X 1 ; X 3 ). All the other equalities follow from this fact and their definition.
Using this corollary, the unique decomposition of the DTC = ∆H (2) + ∆H (3) for Markov chains is given by Hence, corollary 12 states that a sufficient condition for three pairwise marginals to be compatible with zero information synergy is for them to satisfy the Markov condition p X 3 |X 1 = The question of finding a necessary condition is an open problem, intrinsically linked with the problem of finding a good definition for the shared information for arbitrary PME distributions.
For concluding, let us note an interesting duality that exists between Markov chains and the case where two variables are pairwise independent, which is illustrated in Table III.  In this section we study the entropy-decomposition for the case where (X 1 , X 2 , X 3 ) follow a multivariate Gaussian distribution. As the entropy is not affected by translation, we assume 24 without loss of generality, that all the variables have zero mean. The covariance matrix is denoted where σ 2 i is the variance of X i , α is the correlation between X 1 and X 2 , β is the correlation between X 1 and X 3 and γ is the correlation between X 2 and X 3 . The condition that the matrix Σ should be positive semi-definite yields the following condition: Unfortunately, Theorem 6 implicitly states that Axioms (1)-(5) do not define a unique symmetrical information decomposition for Gaussian variables with an arbitrary covariance matrix.
Nevertheless, there are some interesting properties of their shared and synergistic information, which are discussed in Sections VI-A and VI-B. Then, Section VI-C presents one symmetrical information decomposition that is consistent with these properties.

A. Understanding the synergistic information between Gaussians
The simplistic structure of the joint p.d.f. of multivariate Gaussians, which is fully determined by mere second order statistics, could make one to think that these systems do not have synergistic information sharing. However, it can be shown that a multivariate Gaussian is the maximum entropy distribution for a given covariance matrix Σ. Hence, the discussion provided in Section V-A suggests that these distributions can indeed have non-zero information synergy, depending on the structure of the pairwise distributions, or equivalently, on the properties of Σ.
Moreover, it has been reported that synergistic phenomena are rather common among multivariate Gaussian variables [39]. As a simple example, consider where A and B are independent Gaussians. Intuitively, it can be seen that although X 2 is useless by itself for predicting X 3 , it can be used jointly with X 1 to remove the noise term B and provide a perfect prediction. For refining this observation, let us consider a more general example where the variables have equal variances and X 2 and X 3 are independent (i.e. γ = 0). Then, the optimal 25 predictor of X 3 given X 1 isX X 1 3 = αX 1 , the optimal predictor given X 2 isX X 2 3 = 0, and the optimal predictor given both X 1 and X 2 is [50] Therefore, although X 2 is useless to predict X 3 by itself, it can be used for further improving the prediction given by X 1 . Hence, all the information provided by X 2 is synergistic, as is useful only when combined with the information provided by X 1 . Note that all these examples fall in the category of the systems considered in Section IV.

B. Understanding the shared information
Let us start studying the information shared between two Gaussians. For this, let us consider a pair of zero-mean variables (X 1 , X 2 ) with unit variance and correlation α. A suggestive way of expressing these variables is given by where W 1 , W 2 and W 12 are independent centered Gaussian variables with variances s 2 1 = s 2 2 = 1 − |α| and s 2 12 = |α|, respectively. Note that the signs in (80) can be set in order to achieve any desired sign for the covariance (as E {X 1 X 2 } = ±E {W 2 12 } = ±s 2 12 ). The mutual information is given by (see Appendix D) showing that it is directly related to the variance of the common term W 12 .
For studying the shared information between three Gaussian variables, let us start considering a case where σ 2 1 = σ 2 2 = σ 2 3 = 1, α = β := ρ and γ = 0. It can be seen that (c.f. Appendix D) A direct evaluation shows that (82) is non-positive 8 for all ρ with |ρ| < 1/ √ 2 (note that |ρ| cannot be larger that 1/ √ 2 because of condition (77)). Therefore, following the discussion related to A direct evaluation shows that, in contrast to (82), the co-information in this case is non-negative, showing that the system is dominated by shared information for all ρ = 0.
The previous discussion suggests that the shared information depends on the smallest of the correlation coefficients. An interesting approach to understand this fact can be found in [39], where the predictability among Gaussians is discussed. In this work, the authors note that from the point of view of X 3 both X 1 and X 2 are able to decompose the target in a predictable and an unpredictable portion: X 3 =X 3 + E. In this sense, both predictors achieve the same effect although with a different efficiency, which is determined by their correlation coefficient.
As a consequence of this, the predictor that is less correlated with the target does not provide unique predictability and hence its contribution is entirely redundant. This motivates the following redundant predictability measure:

C. Shared, private and synergistic information for Gaussian variables
Let us use the intuitions developed in the previous section for building a symmetrical information decomposition. For this, we use the decomposition given by the following Lemma (whose proof is presented in Appendix E).
Lemma 13: Let (X 1 , X 2 , X 3 ) follow a multivariate Gaussian distribution with zero mean and covariance matrix Σ with α ≥ β ≥ γ ≥ 0. Then where W 123 , W 12 , W 13 , W 1 , W 2 and W 3 are independent standard Gaussians and s 123 , s 12 , s 13 , s 1 , s 2 27 and s 3 are given by It is natural to relate s 123 with the shared information, s 12 and s 13 with the private information and s 1 , s 2 and s 3 with the exclusive terms. Note that the decomposition presented in Lemma 13 is unique in not requiring a private component between the two less correlated variables -i.e. a term W 23 . Hence, based on Lemma 13 and (81), we propose the following symmetric information decomposition for Gaussians: First, note that the above shared information coincides with what was expected from Lemma 13, as for the general case s 2 123 = min{|α|, |β|, |γ|}. Also, (91) is consistent with the fact that the two less correlated Gaussians share no private information. Moreover, by comparing (93) and (122), it can be seen that if X 1 and X 2 are the less correlated variables then the synergistic information can be expressed as I S (X 1 ; X 2 ; X 3 ) = I(X 1 ; X 2 |X 3 ), which for the particular case of α = 0 confirms (40). This in turn also shows that, for the particular case of Gaussians variables, forming a Markov chain is a necessary and sufficient condition for having zero information synergy 9 .
Finally, by noting that (89) can also be expressed as it can be seen that our definition of shared information corresponds to the canonical symmetrization of (84) as discussed in Lemma 4. In contrast with (84), (94) states that there cannot be information shared by the three components of the system if two of them are pairwise 9 For the case of α ≥ β ≥ γ, a direct calculation shows that I(X1; X2|X3) = 0 is equivalent to γ = αβ. 28 independent. Therefore, the magnitude of the shared information is governed by the lowest correlation coefficient of the whole system, being upper-bounded by any of the redundant predictability terms.
To close this section, let us note that (94) corresponds to the upper bound provided by (15), which means that multivariate Gaussians have a maximal shared information. This is complementary to the fact that, because of being a maximum entropy distribution, they also have the smallest amount of synergy that is compatible with the corresponding second order statistics.

VII. APPLICATIONS TO NETWORK INFORMATION THEORY
In this section we use the framework presented in Section III to analyze four fundamental scenarios in network information theory [51]. Our goal is to illustrate how the framework can be used to build new intuitions over these well-known optimal information-theoretic strategies.
The application of the framework to scenarios with open problems is left for future work.
In the following, Section VII-A uses the general framework to analyze the Slepian-Wolf coding for three sources, which is a fundamental result in the literature of distributed source compression. Then, Section VII-B applies the results of Section IV to the multiple access channel, which is one of the fundamental settings in multiuser information theory. Section VII-C uses the results related to Markov chains from Section V to the wiretap channel, which constitutes one of the main models of information-theoretic secrecy. Finally, Section VII-D uses results from Section VI to study fundamental limits of public or private broadcast transmissions over Gaussian channels.

A. Slepian-Wolf coding
The Slepian-Wolf coding gives lower bounds for the data rates that are required to transfer the information contained in various data sources. Let us denote as R k the data rate of the k-th source and defineR k = R k − H(X k |X c k ) as the extra data rate that each source has above their own exclusive information (c.f. Section II-B). Then, in the case of two sources X 1 and X 2 , the well-known Slepian-Wolf bounds can be re-written asR 1 ≥ 0,R 2 ≥ 0, andR 1 +R 2 ≥ I(X 1 ; X 2 ) [51, Section 10.3]. The last inequality states that I(X 1 ; X 2 ) corresponds to shared information that can be transmitted by any of the two sources. 29 Let us consider now the case of three sources, and denote R S = I S (X 1 ; X 2 ; X 3 ). The Slepian-Wolf bounds provide seven inequalities [51,Section 10.5], which can be re-written as Above, (97) states that the DTC needs to be accounted by the extra rate of the sources, and (96) that every pair needs to to take care of their private information. Interestingly, due to (32) the shared information needs to be included in only one of the rates, while the synergistic information needs to be included in at least two. For example, one possible solution that is consistent with

B. Multiple Access Channel
Let us consider a multiple access channel, where two pairwise independent transmitters send X 1 and X 2 and a receiver gets X 3 as shown in Fig. 3. It is well-known that, for a given distribution , the achievable transmission rates R 1 and R 2 satisfy the constrains [51, Section 4.5] As the transmitted random variables are pairwise independent, one can apply the results of Section IV. Therefore, there is no shared information and I S (X 1 ; X 2 ; X 3 ) = I(X 1 ; X 3 |X 2 ) − I(X 1 ; X 3 ). Let us introduce a shorthand notation for the remaining terms : C 1 = I priv (X 1 ; X 3 |X 2 ) = I(X 1 ; X 3 ), C 2 = I priv (X 2 ; X 3 |X 1 ) = I(X 2 ; X 3 ) and C S = I S (X 1 ; X 2 ; X 3 ). Then, one can re-write the bounds for the transmission rates as From this, it is clear that while each transmitter has a private portion of the channel with capacity C 1 or C 2 , their interaction creates synergistically extra capacity C S that corresponds to what can be actually shared.
information and the synergy. Note that, because of (10), the redundancy can be included only in one of the rates while the synergy has to be included in at least two.

MAC channel
Let us consider a multiple access channel, where two pairwise independent transmitters send X 1 and X 2 and a receiver gets X 3 . It is well-known that, for a given distribution (X 1 , X 2 ) ⇠ p(x 1 )p(x 2 ), the achievable rates R 1 and R 2 satisfy the constrains R 1  I(X 1 ; X 3 |X 2 ), R 2  I(X 2 ; X 3 |X 1 ) and R 1 + R 2  I(X 1 , X 2 ; X 3 ). Using the results from Section 3.2.2, it can be seen that in this case there exist no redundancy between the three random variables. Because of this I(X 1 ; X 3 |X 2 ) I(X 1 ; X 3 ) holds, and the di↵erence is given by the synergy of the system. Let us introduce shorthand notation for the remaining three components: C 1 = I un (X 1 ; X 3 |X 2 ) = I(X 1 ; X 3 ), C 2 = I un (X 2 ; X 3 |X 1 ) = I(X 2 ; X 3 ) and C S = I S (X 1 ; X 2 ; X 3 ). Then, using the results presented in Section 3.2.2, one can find that the contrains for the performance of the MAC channel can be re-written as rate of the sources, and (16) that every pair needs to to t information and the synergy. Note that, because of (10), included only in one of the rates while the synergy has to be

MAC channel
Let us consider a multiple access channel, where two pairwise send X 1 and X 2 and a receiver gets X 3 . It is well-known tha (X 1 , X 2 ) ⇠ p(x 1 )p(x 2 ), the achievable rates R 1 and R 2 sat I(X 1 ; X 3 |X 2 ), R 2  I(X 2 ; X 3 |X 1 ) and R 1 + R 2  I(X 1 , X 2 ; X Using the results from Section 3.2.2, it can be seen tha no redundancy between the three random variables. Becaus I(X 1 ; X 3 ) holds, and the di↵erence is given by the synergy of duce shorthand notation for the remaining three components I(X 1 ; X 3 ), C 2 = I un (X 2 ; X 3 |X 1 ) = I(X 2 ; X 3 ) and C S = I S ( the results presented in Section 3.2.2, one can find that the mance of the MAC channel can be re-written as I priv (X 2 ; X 3 |X 1 ) Fig. 3. Capacity region of the Multiple Access Channel, which represents the possible data-rates that two transmitters can use for transferring information to one receiver. Fig. 4), where the transmitter sends X 1 , the intended receiver gets X 2 and the eavesdropper receives X 3 . For simplicity of the exposition, let us consider the case where the eavesdropper get only a degraded copy of the signal received by the intended receiver, i.e. that X 1 − X 2 − X 3 form a Markov chain. Using the results of Section V-B, one can see that in this case there is no synergistic but only shared and private information between X 1 , X 2 and X 3 .

Consider a communication system with an eavesdropper (shown in
From this, it is clear that while each transmitter have a exclusive portion of the channel with capacity C i , their interaction create synergistically an additional capacity of C S . This additional resource behaves like a physical property, which has to be shared linearly, generating a slope of 1 in the graph. Is interesting that, if one consider the Slepian-Wolf coding for two sources A and B, there is a direct relationship between H(A|B) and H(B|A) as exclusive information contents that needs to be transmitted by each source and C 1 and C 2 as unique channel capacity for each user, which cannot be shared. On the other hand, the mutual information I(A; B) is the information that can be transmitted by either of the variables, which in this case corresponds to the synergetic capacity C S .

Degraded wiretap channel
Consider a communication system with a eavesdropper, where the transmitter send symbols X 1 , the intended receiver gets X 2 and the eavesdropper receives X 3 . For simplicity of the exposition, let us consider the case of a degraded channel where X 1 X 2 X 3 form a Markov chain. Under those conditions, it is known that for a given input distribution p X1 the rate of secure communication that can be achieved on this channel is upper bound by C sec = I(X 1 ; X 2 ) I(X 1 ; X 3 ) = I un (X 1 ; X 2 |X 3 ) ( 1 9 ) where the second equality comes from the Markov condition and the results shown in Seciton 3.2.1. Note that the eavesdropping capacity is given by 5 Conclusions    I priv (X 1 ; X 2 |X 3 ) Fig. 4. The rate of secure information transfer, Csec, is the portion of the mutual information that can be used while providing perfect confidentiality with respect to the eavesdropper.
In this scenario, it is known that for a given input distribution p X 1 the rate of secure commu- 31 nication that can be achieved is upper bounded by [42,Section 3.4] C sec = I(X 1 ; X 2 ) − I(X 1 ; X 3 ) = I priv (X 1 ; which is precisely the private information sharing between X 1 and X 2 . Also, as intuition would suggest, the eavesdropping capacity is equal to the shared information between the three variables:

D. Gaussian Broadcast Channel
Let us consider a Gaussian Broadcast Channel, where a transmitter sends a Gaussian signal X 1 that is received as X 2 and X 3 by two receivers. Assuming that all these variables jointly Gaussian with zero mean and covariance matrix as given by (76), the transmitter can broadcast a public message, intended for both users, at a maximum rate C pub given by [42, Section 5.1] where the redundant predictability, R(X 2 X 3 X 1 ), between Gaussian variables is as defined in (84). On the other hand, if the transmitter wants to send a private (confidential) message to receiver 1, the corresponding maximum rate C priv that can be achieved in this case is given by where the last equality follows from Axiom (2).
Interestingly, the predictability measures prove to be better suited to describe the communication limits in the above scenario that their symmetrical counterparts. In effect, using the shared information would have underestimated the public capacity (c.f. Section VI-C). This opens the question whether or not directed measures could be better suited for studying certain communication systems, compared to their symmetrized counterparts. Even though a definite answer to this question might not be straightforward, we hope that future research will provide more evidence and a better understanding of this issue. 32

VIII. CONCLUSIONS
In this work we propose an axiomatic framework for studying the interdependencies that can exist between multiple random variables as different modes of information sharing. The framework is based on a symmetric notion of information that refers to properties of the system as a whole. We showed that, in contrast to predictability-based decompositions, all the information terms of the proposed decomposition have unique expressions for Markov chains and for the case where two variables are pairwise independent. We also analyzed the cases of pairwise maximum entropy (PME) distributions and multivariate Gaussian variables. Finally, we illustrated the application of the framework by using it to develop a more intuitive understanding of the optimal information-theoretic strategies in several fundamental communication scenarios.
The key insight that this framework provides is that although there is only one way in which information can be shared between two random variables, there are two essentially different ways of sharing between three. One of these ways is a simple extension of the pairwise dependency, where information is shared redundantly and hence any of the variables can be used to predict any other. The second way leads to the counter-intuitive notion of synergistic information sharing, where the information is shared in a way that the statistical dependency is destroyed if any of the variables is removed; hence, the structure exists in the whole but not in any of the parts.
Information synergy has therefore been commonly related to statistical structures that exist only in the joint p.d.f. and not in low-order marginals. Interestingly, although we showed that indeed PME distributions posses the minimal information synergy that is allowed by their pairwise marginals, this minimum can be strictly positive.
Therefore, there exists a connection between pairwise marginals and synergistic information sharing that is still to be further clarified. In fact, this phenomenon is related to the difference between the TC and the DTC, which is rooted in the fact that the information sharing modes and the marginal structure of the p.d.f. are, although somehow related, intrinsically different.
This important distinction has been represented in our framework by the sequence of internal and external entropies. This new unifying picture for the entropy, negentropy, TC and DTC has shed new light in the understanding of high-order interdependencies, whose consequences have only begun to be explored. 33 APPENDIX A PROOF OF LEMMA 3 Proof: Let us assume that R(X 1 X 2 Y ) and U(X 1 Y |X 2 ) = I(X 1 ; Y ) − R(X 1 X 2 Y ) satisfy Axioms (1)-(3). Then, where the inequalities are a consequence of the non-negativity of U(X 1 Y |X 2 ) and the third equality is due to the weak symmetry of the redundant predictability. For proving the lower bound, first notice that Axiom (2) can be re-written as The lower bound follows considering the non-negativity of R(X 1 X 2 Y ) and by noting that The proof of the converse is direct, and left as an exercise to the reader.
The symmetry of I S (X 1 ; X 2 ; X 3 ) with respect to X 1 and X 3 follows directly from its definition, the weak symmetry of I(X 1 ; X 3 |X 2 ) and the strong symmetry of I ∩ (X 1 ; X 2 ; X 3 ). The symmetry with respect to X 1 and X 2 can be shown using the definition of I S (X 1 ; X 2 ; X 3 ) and the strong symmetry of I ∩ (X 1 ; X 2 ; X 3 ) and the co-information I(X 1 ; X 2 ; X 3 ) as follows: = I(X 1 ; X 2 ; X 3 ) + I ∩ (X 2 ; X 1 ; X 3 ) = I(X 1 ; X 3 |X 2 ) − I(X 1 ; X 3 ) + I ∩ (X 1 ; X 2 ; X 3 ) = I S (X 1 ; X 2 ; X 3 ) .
The bounds for I ∩ (X 1 ; X 2 ; X 3 ), I priv (X 1 ; X 2 ; X 3 ) and I S (X 1 ; X 2 ; X 3 ) follow directly from the definition of these quantities and Axiom (3). Finally, d) is proven directly using those definitions, and the fact that the mutual information depend only on the pairwise marginals, while the conditional mutual information depends on the full p.d.f.