# Information Decomposition and Synergy

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

_{min}for shared information. This measure I

_{min}was, however, criticized as unintuitive [5,6], and several alternatives were proposed [7,8], but only for the bivariate case so far.

## 2. Information Decomposition

_{1}, … , X

_{n}, S be random variables. We are mostly interested in two settings. In the discrete setting, all random variables have finite state spaces. In the Gaussian setting, all random variables have continuous state spaces, and their joint distribution is a multivariate Gaussian.

_{x}p(x) log p(x), and the mutual information of two discrete random variables is:

_{X}is given by:

_{X}| denotes the determinant of Σ

_{X}. This entropy is not invariant under coordinate transformations. In fact, if A ∈ ℝ

^{m}

^{×}

^{m}, then the covariance matrix of AX is given by AΣ

_{X}A

^{t}, and so the entropy of AX is given by:

_{X}, Σ

_{Y}and with a joint multivariate Gaussian distribution with joint covariance matrix Σ

_{X},

_{Y},

#### 2.1. Partial Information Lattice

_{1}, … , X

_{n}have about S is distributed among X

_{1}, … , X

_{n}. In Shannon’s theory of information, the total amount of information about S contained in X

_{1}, … , X

_{n}is quantified by the mutual information:

_{1}, … , X

_{n}) as a sum of non-negative functions with a good interpretation in terms of how the information is distributed, e.g., redundantly or synergistically, among X

_{1}, … , X

_{n}. For example, as we have mentioned in the Introduction and as we will see later, several suggestions have been made to measure the total synergy of X

_{1}, … , X

_{n}in terms of a function Synergy(S : X

_{1}; … ; X

_{n}). When starting with such a function, the idea of the information decomposition is to further decompose the difference:

_{1};X

_{2}), the unique information UI(S : X

_{1}\ X

_{2}) and UI(S : X

_{2}\ X

_{1}) of X

_{1}and X

_{2}, respectively, and the synergistic (or complementary) information CI(S : X

_{1}; X

_{2}).

_{1}, … , X

_{n}may interact with each other, combining redundant, unique and synergistic effects.

_{∩}(S : X

_{1}; … ;X

_{n}) that measures the redundant information about S contained in X

_{1}, … , X

_{n}. Clearly, such a function should be symmetric in permutations of X

_{1}, … , X

_{n}. In a second step, I

_{∩}is also used to measure the redundant information I

_{∩}(S : A

_{1}; … ; A

_{k}) about S contained in combinations A

_{1}, … , A

_{k}of the original random variables (that is, A

_{1}, … , A

_{k}are random vectors whose components are among {X

_{1}, … , X

_{n}}). Moreover, Williams and Beer proposed that I

_{∩}should satisfy the following monotonicity property:

_{i}⊆ A

_{k}

_{+1}means that any component of A

_{i}is also a component of A

_{k}

_{+1}).

_{∩}in the case where A

_{1}, … , A

_{k}form an antichain; that is, A

_{i}⊈ A

_{j}for all i ≠ j. The set of antichains is partially ordered by the relation:

_{∩}is a monotone function with respect to this partial order. This partial order actually makes the set of antichains into a lattice.

_{1}, … ,B

_{l}) ≼ (A

_{1}, … , A

_{k}), then the difference I

_{∩}(S : A

_{1};… ; A

_{k}) − I

_{∩}(S : B

_{1}; … ; B

_{l}) quantifies the information contained in all A

_{i}, but not contained in some B

_{l}. The idea of Williams and Beer can be summarized by saying that all information can be classified according to within which antichains it is contained. Thus, the third step is to write:

_{∂}is uniquely defined as the Möbius transform of I

_{∩}on the lattice of antichains.

_{∩}is non-negative (as it should be as an information quantity), it is not immediate that the function I

_{∂}is also non-negative. This additional requirement was called local positivity in [5].

_{∩}should be defined. There have been some proposals of functions I

_{∩}(S : X

_{1}; X

_{2}) with up to two arguments, so-called bivariate information decompositions [7,8], but so far, only two general information decompositions are known. Williams and Beer defined a function I

_{min}that satisfies local positivity, but, as mentioned above, it was found to give unintuitive values in many examples [5,6]. In [5], I

_{min}was compared with the function:

_{I}in [5]). This function has many nice mathematical properties, including local positivity. However, I

_{mmi}clearly does not have the right interpretation as measuring the shared information, since I

_{mmi}only compares the different amounts of information of S and A

_{i}, without checking whether the measured information is really the “same” information [5]. However, for Gaussian random variables, I

_{mmi}might actually lead to a reasonable information decomposition (as discussed in [12] for the case n = 2).

#### 2.2. Interaction Spaces

_{0}, X

_{1}, … , X

_{n}. Later, we will put X

_{0}= S in order to compare the setting of interaction spaces with the setting of information decompositions.

_{0}, … , X

_{n}} of cardinality |A| = k. The exponential family of k-th order interactions $\mathcal{E}$

^{(}

^{k}

^{)}of random variables X

_{0}, X

_{1}, … , X

_{n}consists of all distributions of the form:

_{A}is a strictly positive function that only depends on those x

_{i}with X

_{i}∈ A. Taking the logarithm, this is equivalent to saying that:

_{A}only depends on those x

_{i}with X

_{i}∈ A. This second representation corresponds to the Gibbs–Boltzmann distribution used in statistical mechanics, and it also explains the name exponential family. Clearly, $\mathcal{E}$

^{(1)}⊆ $\mathcal{E}$

^{(2)}⊆ … ⊆ $\mathcal{E}$

^{(}

^{n}

^{)}⊆ $\mathcal{E}$

^{(}

^{n}

^{+1)}.

^{(}

^{k}

^{)}is not closed (for k > 0), in the sense that there are probability distributions outside of $\mathcal{E}$

^{(}

^{k}

^{)}that can be approximated arbitrarily well by k-th order interaction distributions. Thus, we denote by $\overline{{\mathcal{E}}^{(k)}}$ the closure of $\mathcal{E}$

^{(}

^{k}

^{)}(technically speaking, for probability spaces, there are different notions of approximation and of closure, but in the finite discrete case, they all agree; for example, one may take the induced topology by considering a probability distribution as a vector of real numbers). For example, $\overline{{\mathcal{E}}^{(k)}}$ contains distributions that can be written as products of non-negative functions Ψ

_{A}with zeros. In particular, $\overline{{\mathcal{E}}^{(n+1)}}$ consists of all possible joint distributions of X

_{0}, … , X

_{n}. However, for 1 < k ≤ n, the closure of $\mathcal{E}$

^{(}

^{k}

^{)}also contains functions that do not factorize at all (see Section 2.3 in [13] and the references therein).

_{0}, … , X

_{n}, we might ask for the best approximation of p by a k-th order interaction distribution q. It is customary to measure the approximation error in terms of the Kullback-Leibler divergence:

**Proposition 1.**(1). Let $\mathcal{E}$ be an exponential family, and let p be an arbitrary distribution. Then, there is a unique distribution p

_{$\mathcal{E}$}in the closure of $\mathcal{E}$ that best approximates p, in the sense that:

_{$\mathcal{E}$}is called the rI-projection of p to $\mathcal{E}$.

^{(}

^{k}

^{)}:= p

_{$\mathcal{E}$}

_{(}

_{k}

_{)}. For example, q

^{(}

^{n}

^{+1)}= p. For n ≥ k > 1, there is no general formula for q

^{(}

^{k}

^{)}. For k = 1, one can show that:

_{0}, … , X

_{n}. Applying the Pythagorean theorem n − 1 times to the hierarchy $\mathcal{E}$

^{(1)}⊆ $\mathcal{E}$(

^{2}) ⊆ … ⊆ $\mathcal{E}$

^{(}

^{n}

^{)}, it follows that:

_{0}, … , X

_{n}is a part of the multi-information of X

_{0}, … , X

_{n}. The last sum shows that the hierarchy of interaction families gives a finer decomposition of S

^{(2)}into terms that may be interpreted as “synergy of a fixed order”. In the case n = 3 that we will study later, there is only one term, since p = q

^{(3)}in this case. Using the maximum entropy principle behind exponential families [14], the function S

^{(2)}can also be expressed as:

_{0}, … , X

_{n}that have the same pair marginals as p.

_{0}: X

_{1}, … , X

_{n}) can be achieved in a similar spirit as follows. Let ${\left(\begin{array}{l}X\hfill \\ k\hfill \end{array}\right)}_{0}$ be the set of all subsets A ⊆ {X

_{0}, … , X

_{n}} of cardinality |A| = k that contain X

_{0}, and let ${\widehat{\mathcal{E}}}^{(k)}$ be the set of all probability distributions of the form:

_{A}are as above and where Ψ

_{[}

_{n}

_{]}is a function that only depends on x

_{1}, … , x

_{n}. As above, each ${\widehat{\mathcal{E}}}^{(k)}$ is an exponential family.

_{0}: X

_{1}, … , X

_{n}).

_{0}, … , X

_{n}). Using the maximum entropy principle behind exponential families [14], the function Ŝ

^{(2)}can also be expressed as:

_{0}, … , X

_{n}that have the same pair marginals as p and for which, additionally, the marginal distribution for X

_{1}, … , X

_{n}is the same as for p.

^{(}

^{k}

^{)}are symmetric in all random variables X

_{0}, … , X

_{n}, in the definition of ${\widehat{\mathcal{E}}}^{(k)}$, the variable X

_{0}plays a special role. This is reminiscent of the special role of S in the information decomposition framework, when the goal is to decompose the information about S. Thus, also in Ŝ

^{(2)}, the variable X

_{0}is special.

^{(1)}⊆ $\mathcal{E}$

^{(2)}⊆ … ⊆ $\mathcal{E}$

^{(}

^{n}

^{)}and ${\widehat{\mathcal{E}}}^{(1)}\subseteq {\widehat{\mathcal{E}}}^{(2)}\subseteq \dots \subseteq {\widehat{\mathcal{E}}}^{(n)}$.

^{(2)}(S; X; Y) = Ŝ

^{(2)}(S : X; Y).

^{(}

^{k}

^{)}is classical, to our best knowledge, the alternative hierarchy of the families ${\widehat{\mathcal{E}}}^{(k)}$ has not been studied before. We do not want to analyze this second hierarchy in detail here, but we just want to demonstrate that the framework of interaction exponential families is flexible enough to give a nice decomposition of mutual information, which can naturally be compared with the information decomposition framework. In this paper, in any case, we only consider cases where ${\mathcal{E}}^{(k)}={\widehat{\mathcal{E}}}^{(k)}$.

## 3. Measures of Synergy and Their Properties

#### 3.1. WholeMinusSum Synergy

_{WMS}is the difference between the complementary and the shared information:

#### 3.2. Synergy from Unique Information

_{p}.

^{(2)}coming from the interaction decomposition.

#### 3.3. Synergy from Maximum Entropy Arguments

^{(1)}(S; X; Y) was discussed under the name “connected information” ${I}_{C}^{(2)}$, but it was not considered as a measure of synergy. Synergy was measured instead by the WMS synergy measure (7).

^{(2)}(S; X; Y), we see that:

- Both quantities are by definition ≥ 0.
- S
^{(2)}(S; X; Y) is symmetric with respect to permutation of all of its arguments, in contrast to $\tilde{CI}(S;X;Y)$. - ${S}^{(2)}(S;X;Y)\le \tilde{CI}(S:X;Y)$, because ${\mathrm{\Delta}}_{p}^{(2)}\subseteq {\mathrm{\Delta}}_{p}$ and:

^{(2)}(S; X; Y) is considered as a synergy measure in the information decomposition [8], one gets negative values for the corresponding shared information, which we will denote by SI

^{(2)}(S; X, Y).

## 4. Examples

#### 4.1. An Instructive Example: AND

_{WMS}(S : X; Y) ≈ 0.1887 bit. On the other hand, in the AND case, the joint probability distribution p(s, x, y) is already fully determined by the marginal distributions p(x, y), p(s, y) and p(s, x); that is, ${\mathrm{\Delta}}_{p}^{(2)}=\left\{p\right\}$ (see, e.g., [10]). Therefore,

#### 4.2. Gaussian Random Variables: When Should Synergy Vanish?

^{(2)}(S; X; Y) = 0.

_{SX}and r

_{SY}denote the correlation coefficients between S and X and S and Y, respectively. If |r

_{SX}| ≤ |r

_{SY}|, then X has no unique information about S, i.e.: $\tilde{UI}(S:X\backslash Y)=0$, and therefore, $\tilde{CI}(S:X;Y)=MI(S:X|Y)$. This was shown in [12] using explicit computations with semi-definite matrices. Here, we give a more conceptual argument involving simple properties of Gaussian random variables and general properties of $\tilde{UI}$.

_{ρ}= Y + ρϵ, where ϵ denotes Gaussian noise, which is independent of X, Y and S. Then, X

_{ρ}is independent of S given Y, and so, $|{r}_{S{X}_{\rho}}|\le |{r}_{SY}|$. It is easy to check that ${r}_{S{X}_{\rho}}$ is a continuous function of ρ, with ${r}_{S{X}_{0}}={r}_{SX}$ and ${r}_{S{X}_{\rho}}\to 0$ as ρ → ∞. In particular, there exists a value ρ

_{0}∈ ℝ, such that ${r}_{S{X}_{\rho 0}}$. Let ${X}^{\prime}=\frac{{\sigma}_{X}}{{\sigma}_{{X}_{\rho 0}}}{X}_{{\rho}_{0}}$. Then, the pair (X′, S) has the same distribution as the pair (X, S) (since X′ has the same variance as X and since the two pairs have the same correlation coefficient). Thus, $\tilde{UI}(S:X\backslash Y)=\tilde{UI}(S:{X}^{\prime}\backslash Y)$. Moreover, since MI(S : X′|Y) = 0, it follows from (5) that $\tilde{UI}(S:{X}^{\prime}\backslash Y)=0$.

_{SX}| ≤ |r

_{SY}|, we arrive at the following formulas:

_{mmi}. In fact, any information decomposition according to the PI lattice satisfies SI(S : X;Y) ≤ I

_{mmi}(S : X;Y) [4]. Moreover, any information decomposition that satisfies (*) satisfies $SI(S:X;Y)\ge \tilde{SI}(S:X;Y)$ (Lemma 3 in [8]), and thus, all such information decompositions agree in the Gaussian case (this was first observed by [12]). In [12], it is shown that this result generalizes to the case where X and Y are Gaussian random vectors. The proof of this result basically shows that the above argument also works in this more general case.

_{mmi}decomposition suggests that the information decomposition based on I

_{mmi}may also be sensible for Gaussian distributions for larger values of n.

^{(2)}(S; X; Y) vanishes). On the other hand, S

_{WMS}(S : X; Y) = CI(S : X; Y) − SI(S : X; Y) can be positive for Gaussian variables, and thus, synergy must be positive, as well (see [12]; for a simple example, choose $0<{r}_{SX}={r}_{SY}=r<{\scriptscriptstyle \raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$\sqrt{2}$}\right.}$ and r

_{XY}= 0; then ${S}_{WMS}(S:X;Y)=\frac{1}{2}\mathrm{log}\frac{1-{r}^{2}}{1-2{r}^{2}}>0)$.

_{SX}| < |r

_{SY}|.From CI(S : X; Y) = MI(S : X|Y), we see that synergy vanishes if and only if S and X are conditionally independent given Y. Since all distributions are Gaussian and information measures do not depend on the mean values, this condition can be checked by computing the conditional variances $\mathrm{Var}[S|X,Y]={\sigma}^{2}$ and $\mathrm{Var}[S|Y]={\alpha}^{2}\mathrm{Var}[X|Y]+{\sigma}^{2}$ We see that these distributions agree, and thus, S is conditionally independent of X given Y if $\mathrm{Var}[X|Y]=0$, i.e., X is a function of Y and effectively the same variable or if α = 0. Positive synergy arises whenever X contributes to S with a non-trivial coefficient α=0. This is a very reasonable interpretation and shows that the synergy measure CI(S : X; Y) nicely captures the intuition of X and Y acting together to bring about S.

## 5. Discussion and Conclusions

^{(2)}based on the projection on the exponential family of distributions with only pairwise interactions is not compatible with the partial information lattice framework, because it does not yield a non-negative information decomposition, as we have shown in the examples. The reason why we believe that it is important to have a complete non-negative information decomposition is that, in addition to a formula for synergy, it would give us an interpretation of the “remainder” MI(S : X

_{1}, … , X

_{n}) — Synergy. In the bivariate case, $\tilde{CI}(S:X;Y)$ provides a synergy measure, which complies with the information decomposition.

^{(2)}for multivariate Gaussians reflects their “simplicity” in the sense that they can be transformed into independent sub-processes by a linear transformation. In contrast, this simplicity is reflected in the information decomposition by the fact that one of the unique information always vanishes. Since the WholeMinusSum synergy (or co-information) can be positive for Gaussian distributions, it is not possible to define an information decomposition for Gaussian variables that puts the synergy to zero.

^{(2)}(S : X; Y), are part of the synergy, i.e., S

^{(2)}(S : X; Y) ≤ CI(S : X; Y), they are not required as demonstrated in our AND example and the case of Gaussian random variables. Especially, the latter example leads to the intuitive insight that synergy arises when multiple inputs X, Y are processed simultaneously to compute the target S. Interestingly, the nature of this processing is less important and can be rather simple, i.e., the output is literally just “the sum of its inputs”. In this sense, we believe that our negative result, regarding the non-negativity of S

^{(2)}(S : X; Y), provides important insights into the nature of synergy in the partial information decomposition. It is up to future work to develop a better understanding of the relationship between the presence of higher-order dependencies and synergy.

## Acknowledgments

**PACS classifications:**89.70.Cf; 89.75.-k

**MSC classifications:**94A15; 94A17

## Author Contributions

## Conflicts of Interest

## References

- Schneidman, E.; Bialek, W.; Berry, M.J.I. Synergy, redundancy, and independence in population codes. J. Neurosci.
**2003**, 23, 11539–11553. [Google Scholar] - Margolin, A.A.; Wang, K.; Califano, A.; Nemenman, I. Multivariate dependence and genetic networks inference. IET Syst. Biol.
**2010**, 4, 428–440. [Google Scholar] - Bertschinger, N.; Olbrich, E.; Ay, N.; Jost, J. Autonomy: An information theoretic perspective. Biosystems
**2008**, 91, 331–345. [Google Scholar] - Williams, P.; Beer, R. Nonnegative Decomposition of Multivariate Information
**2010**, arXiv, 1004.2515v1. - Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J. Shared Information—New Insights and Problems in Decomposing Information in Complex Systems, Proceedings of the European Conference on Complex Systems 2012 (ECCS’12), Brussels, Belgium, 3–7 September 2012; pp. 251–269.
- Griffith, V.; Koch, C. Quantifying Synergistic Mutual Information. In Guided Self-Organization: Inception; Prokopenko, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2014; Volume 9, Emergence, Complexity and Computation Series; pp. 159–190. [Google Scholar]
- Harder, M.; Salge, C.; Polani, D. Bivariate measure of redundant information. Phys. Rev. E
**2013**, 87, 012130. [Google Scholar] - Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J.; Ay, N. Quantifying unique information. Entropy
**2014**, 16, 2161–2183. [Google Scholar] - Amari, S.I. Information geometry on hierarchy of probability distributions. IEEE Trans. Inf. Theory
**2001**, 47, 1701–1711. [Google Scholar] - Schneidman, E.; Still, S.; Berry, M.J., 2nd; Bialek, W. Network information and connected correlations. Phys. Rev. Lett.
**2003**, 91, 238701. [Google Scholar] - Ay, N.; Olbrich, E.; Bertschinger, N.; Jost, J. A geometric approach to complexity. Chaos
**2011**, 21, 037103. [Google Scholar] - Barrett, A.B. Exploration of synergistic and redundant information sharing in static and dynamical Gaussian systems. Phys. Rev. E
**2015**, 91, 052802. [Google Scholar] - Rauh, J.; Kahle, T.; Ay, N. Support sets of exponential families and oriented matroids. Int. J Approx. Reason.
**2011**, 52, 613–626. [Google Scholar] - Csiszár, I.; Shields, P.C. Information Theory and Statistics: A Tutorial. Found. Trends in Commun. Inf. Theory.
**2004**, 1, 417–528. [Google Scholar] - Studený, M.; Vejnarová, J. The Multiinformation Function as a Tool for Measuring Stochastic Dependence. In Learning in Graphical Models; Jordan, M.I., Ed.; Springer: Dordrecht, The Netherlands, 1998; Volume 89, NATO ASI Series; pp. 261–297. [Google Scholar]
- Watanabe, S. Information Theoretical Analysis of Multivariate Correlation. IBM J. Res. Dev.
**1960**, 4, 66–82. [Google Scholar] - Kahle, T.; Olbrich, E.; Jost, J.; Ay, N. Complexity Measures from Interaction Structures. Phys. Rev. E
**2009**, 79, 026201. [Google Scholar] - Gawne, T.J.; Richmond, B.J. How independent are the messages carried by adjacent inferior temporal cortical neurons? J. Neurosci.
**1993**, 13, 2758–2771. [Google Scholar] - Gat, I.; Tishby, N. Synergy and Redundancy among Brain Cells of Behaving Monkeys. In Advances in Neural Information Processing Systems 11; Kearns, M., Solla, S., Cohn, D., Eds.; MIT Press: Cambridge, MA, USA, 1999; pp. 111–117. [Google Scholar]
- Chechik, G.; Globerson, A.; Anderson, M.J.; Young, E.D.; Nelken, I.; Tishby, N. Group Redundancy Measures Reveal Redundancy Reduction in the Auditory Pathway. In Advances in Neural Information Processing Systems 14; Dietterich, T., Becker, S., Ghahramani, Z., Eds.; MIT Press: Cambridge, MA, USA, 2002; pp. 173–180. [Google Scholar]
- Bell, A.J. The Co-Information Lattice, Proceedings of the 4th International Workshop on Independent Component Analysis and Blind Signal Separation, Nara, Japan, 1–4 April 2003.
- McGill, W.J. Multivariate information transmission. Psychometrika
**1954**, 19, 97–116. [Google Scholar]

**Figure 1.**(

**a**) The PI lattice for two random variables; (

**b**) the PI lattice for n = 3. For brevity, every antichain is indicated by juxtaposing the components of its elements, separated by bars |. For example, 12|13|23 stands for the antichain {X

_{1}, X

_{2}}, {X

_{1}, X

_{3}}, {X

_{2}, X

_{3}}.

$\begin{array}{||}\hline \begin{array}{cccc}X& Y& S& p\\ 0& 0& 0& \frac{1}{4}\\ 0& 1& 0& \frac{1}{4}\\ 1& 0& 0& \frac{1}{4}\\ 1& 1& 1& \frac{1}{4}\end{array}\\ \hline\end{array}$ | $\begin{array}{l}\phantom{\rule{3em}{0ex}}H(S)=2\mathrm{log}2-\frac{3}{4}\mathrm{log}3\\ \phantom{\rule{1.8em}{0ex}}H(S,X)=\frac{3}{2}\mathrm{log}2\\ \phantom{\rule{0.7em}{0ex}}H(S,X,Y)=2\mathrm{log}2\\ \phantom{\rule{1.2em}{0ex}}MI(S:X)=\frac{3}{2}\mathrm{log}2-\frac{3}{4}\mathrm{log}3\\ MI(S:X|Y)=\frac{1}{2}\mathrm{log}2\end{array}$ |

© 2015 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Olbrich, E.; Bertschinger, N.; Rauh, J.
Information Decomposition and Synergy. *Entropy* **2015**, *17*, 3501-3517.
https://doi.org/10.3390/e17053501

**AMA Style**

Olbrich E, Bertschinger N, Rauh J.
Information Decomposition and Synergy. *Entropy*. 2015; 17(5):3501-3517.
https://doi.org/10.3390/e17053501

**Chicago/Turabian Style**

Olbrich, Eckehard, Nils Bertschinger, and Johannes Rauh.
2015. "Information Decomposition and Synergy" *Entropy* 17, no. 5: 3501-3517.
https://doi.org/10.3390/e17053501