Next Article in Journal
Entropy vs. Energy Waveform Processing: A Comparison Based on the Heat Equation
Next Article in Special Issue
Tail Risk Constraints and Maximum Entropy
Previous Article in Journal
Operational Reliability Assessment of Compressor Gearboxes with Normalized Lifting Wavelet Entropy from Condition Monitoring Information
Previous Article in Special Issue
An Information-Theoretic Perspective on Coarse-Graining, Including the Transition from Micro to Macro
Open AccessArticle

Information Decomposition and Synergy

1
Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, 04103 Leipzig, Germany
2
Frankfurt Institute for Advanced Studies, Ruth-Moufang-Straße 1, 60438 Frankfurt am Main, Germany
3
Institute of Algebraic Geometry, Leibniz Universität Hannover, Welfengarten 1, 30167 Hannover, Germany
*
Author to whom correspondence should be addressed.
Academic Editor: Rick Quax
Entropy 2015, 17(5), 3501-3517; https://doi.org/10.3390/e17053501
Received: 26 March 2015 / Revised: 12 May 2015 / Accepted: 19 May 2015 / Published: 22 May 2015
(This article belongs to the Special Issue Information Processing in Complex Systems)

Abstract

Recently, a series of papers addressed the problem of decomposing the information of two random variables into shared information, unique information and synergistic information. Several measures were proposed, although still no consensus has been reached. Here, we compare these proposals with an older approach to define synergistic information based on the projections on exponential families containing only up to k-th order interactions. We show that these measures are not compatible with a decomposition into unique, shared and synergistic information if one requires that all terms are always non-negative (local positivity). We illustrate the difference between the two measures for multivariate Gaussians.
Keywords: Shannon information; mutual information; information decomposition; shared information; synergy Shannon information; mutual information; information decomposition; shared information; synergy

1. Introduction

Studying a complex system usually involves figuring out how different parts of the system interact with each other. If two processes, described by random variables X and Y , interact with each other to bring about a third one, S, it is natural to ask for the contribution of the single processes. We might distinguish unique contributions of X and Y from redundant ones. Additionally, there might be a component that can be produced only by X and Y acting together: this is what we will call synergy in the following. Attempts to measure synergy were already undertaken in several fields. When investigating neural codes, S is the stimulus, and one asks how the information about the stimulus is encoded in neural representations X and Y [1]. When studying gene regulation in systems biology, S could be the target gene, and one might ask for synergy between transcription factors X and Y [2]. For the behavior of an autonomous system S, one could ask to which extent it is influenced by its own state history X or the environment Y [3].
Williams and Beer proposed the partial information lattice as a framework to achieve such an information decomposition starting from the redundant part, i.e., the shared information. It is based on a list of axioms that any reasonable measure for shared information should fulfill [4]. The lattice alone, however, does not determine the actual values of the different components, but just the structure of the decomposition. In the bivariate case, there are four functions (redundancy, synergy and unique information of X and Y , respectively), related by three linear conditions. Thus, to complete the theory, it suffices to provide a definition for one of these functions. In [4], Williams and Beer also proposed a measure Imin for shared information. This measure Imin was, however, criticized as unintuitive [5,6], and several alternatives were proposed [7,8], but only for the bivariate case so far.
In this paper, we do not want to propose another measure. Instead, we want to relate the recent work on information decomposition to work on information decompositions based on projections on exponential families containing only up to k-th order interactions [2,911]. We focus on the synergy aspect and compare both approaches for two instructive examples: the AND gate and multivariate Gaussian distributions. We start with reviewing the construction of the partial information lattice by Williams and Beer [4] and discussing the terms for the bivariate case in more detail. In particular, we show how synergy appears in this framework and how it is related to other information measures. In Section 2.2, we recall the exponential families of k-th-order interactions and the corresponding projections and how they can be used to decompose information. In Section 3, we provide the definitions of specific synergy measures, on the one hand side, in the frame work of the partial information lattice, and on the other side, in the framework of interaction spaces, and discuss their properties. In Section 4, we compare the two measures for specific examples and conclude the paper by discussing the significance of the difference between the two measures for analyzing complex systems.

2. Information Decomposition

Let X1, … , Xn, S be random variables. We are mostly interested in two settings. In the discrete setting, all random variables have finite state spaces. In the Gaussian setting, all random variables have continuous state spaces, and their joint distribution is a multivariate Gaussian.
For discrete random variables, information-theoretic quantities, such as entropy and mutual information, are canonically defined. For example, the entropy of a discrete random variable X is given by H(X) = x p(x) log p(x), and the mutual information of two discrete random variables is:
M I ( X : Y ) = x , y p ( x , y ) log p ( x , y ) p ( x ) p ( y )
The conditional mutual information is defined accordingly as:
M I ( X : Y | Z ) = x , y , z p ( x , y | z ) p ( z ) log p ( x , y | z ) p ( x | z ) p ( y | z )
For continuous random variables, there is no canonical entropy function. Instead, there is differential entropy, which is computed with respect to some reference measure dx:
H ( X ) = x p ( x ) log p ( x ) d x
where p now denotes the probability density of X with respect to dx. Taking the Lebesgue measure, the entropy of an m-dimensional Gaussian random vector with covariance matrix ΣX is given by:
H ( X ) = 1 2 log | Σ X | + 1 2 m log ( 2 π e )
where |ΣX| denotes the determinant of ΣX. This entropy is not invariant under coordinate transformations. In fact, if A ∈ ℝm×m, then the covariance matrix of AX is given by AΣX At, and so the entropy of AX is given by:
H ( A X ) = H ( X ) + log | A |
In contrast, the mutual information of continuous random variables does not depend on the choice of a reference measure. The relation MI(X : Y) = H(X) + H(Y) − H(X,Y) shows that, for Gaussian random vectors with covariance matrices ΣX, ΣY and with a joint multivariate Gaussian distribution with joint covariance matrix ΣX,Y,
M I ( X : Y ) = 1 2 log | Σ X | | Σ Y | | Σ X , Y |
and it is easy to check directly that this is independent of linear transformations of X and Y (of course, here, one should not apply a linear transformation to the total vector (X, Y) that mixes components of X and Y).

2.1. Partial Information Lattice

We want to analyze how the information that X1, … , Xn have about S is distributed among X1, … , Xn. In Shannon’s theory of information, the total amount of information about S contained in X1, … , Xn is quantified by the mutual information:
M I ( S : X 1 , , X n )
We are looking for a way to write MI(S : X1, … , Xn) as a sum of non-negative functions with a good interpretation in terms of how the information is distributed, e.g., redundantly or synergistically, among X1, … , Xn. For example, as we have mentioned in the Introduction and as we will see later, several suggestions have been made to measure the total synergy of X1, … , Xn in terms of a function Synergy(S : X1; … ; Xn). When starting with such a function, the idea of the information decomposition is to further decompose the difference:
M I ( S : X 1 , X n ) S y n e r g y ( S : X 1 ; ; X n )
as a sum of non-negative functions. The additional advantage of such a complete information decomposition would be to give a better interpretation of the difference (2), apart from the tautological interpretation that it just measures “everything but the synergy.” Throughout the paper, we will use the following notation: the left argument of the information quantities, the target variable S, is divided by a colon from the right arguments. The semicolon separates the different arguments on the right side, while comma-separated random variables are treated as a single vector-valued argument.
When looking for such an information decomposition, the first question is what terms to expect. In the case n = 2, this may seem quite easy, and it seems to be common sense to expect a decomposition of the form:
M I ( S : X 1 , X 2 ) = S I ( S : X 1 ; X 2 ) + U I ( S : X 1 \ X 2 ) + U I ( S : X 2 \ X 1 ) + C I ( S : X 1 ; X 2 )
into four terms corresponding to the redundant (or shared) information SI(S : X1;X2), the unique information UI(S : X1 \ X2) and UI(S : X2 \ X1) of X1 and X2, respectively, and the synergistic (or complementary) information CI(S : X1; X2).
However, when n > 2, it seems less clear in which different ways X1, … , Xn may interact with each other, combining redundant, unique and synergistic effects.
As a solution, Williams and Beer proposed the partial information framework. We explain the idea only briefly here and refer to [4] for more detailed explanations. The basic idea is to construct such a decomposition purely in terms of a function for shared information I(S : X1; … ;Xn) that measures the redundant information about S contained in X1, … , Xn. Clearly, such a function should be symmetric in permutations of X1, … , Xn. In a second step, I is also used to measure the redundant information I(S : A1; … ; Ak) about S contained in combinations A1, … , Ak of the original random variables (that is, A1, … , Ak are random vectors whose components are among {X1, … , Xn}). Moreover, Williams and Beer proposed that I should satisfy the following monotonicity property:
I ( S : A 1 ; ; A k ; A k + 1 ) I ( S : A 1 ; ; A k ) , with equality if A i A k + 1 for some i k
(where the inclusion AiAk+1 means that any component of Ai is also a component of Ak+1).
The monotonicity property shows that it suffices to consider the function I in the case where A1, … , Ak form an antichain; that is, AiAj for all ij. The set of antichains is partially ordered by the relation:
( B 1 , , B l ) ¯ ( A 1 , , A k ) : for each j = 1 , , k , there exists i l with B i A j
and, again by the monotonicity property, I is a monotone function with respect to this partial order. This partial order actually makes the set of antichains into a lattice.
If (B1, … ,Bl) ≼ (A1, … , Ak), then the difference I(S : A1;… ; Ak) − I(S : B1; … ; Bl) quantifies the information contained in all Ai, but not contained in some Bl. The idea of Williams and Beer can be summarized by saying that all information can be classified according to within which antichains it is contained. Thus, the third step is to write:
I ( S : A 1 ; ; A k ) = ( B 1 , , B l ) ¯ ( A 1 , , A k ) I ( S : B 1 ; ; B l )
where the function I is uniquely defined as the Möbius transform of I on the lattice of antichains.
For example, the PI lattices for n = 2 and n = 3 are given in Figure 1. For n = 2, it is easy to make the connection with (3): The partial measures are:
I ( S : ( X 1 , X 2 ) ) = C I ( S : X 1 ; X 2 ) I ( S : X 1 ) = U I ( S : X 1 \ X 2 ) I ( S : X 2 ) = U I ( S : X 2 \ X 1 ) I ( S : X 1 ; X 2 ) = S I ( S : X 1 ; X 2 )
and the redundancy measure satisfies:
I ( S : ( X 1 , X 2 ) ) = M I ( S : X 1 , X 2 ) = C I ( S : X 1 ; X 2 ) + U I ( S : X 1 \ X 2 ) + U I ( S : X 2 \ X 1 ) + S I ( S : X 1 ; X 2 ) I ( S : X 1 ) = M I ( S : X 1 ) = U I ( S : X 1 \ X 2 ) + S I ( S : X 1 ; X 2 ) I ( S : X 2 ) = M I ( S : X 2 ) = U I ( S : X 2 \ X 1 ) + S I ( S : X 1 ; X 2 ) I ( S : X 1 ; X 2 ) = S I ( S : X 1 ; X 2 )
From (4) and the chain rule for the mutual information:
M I ( S : X 1 , X 2 ) = M I ( S : X 2 ) + M I ( S : X 1 | X 2 )
follows immediately
M I ( S : X 1 | X 2 ) = U I ( S : X 1 \ X 2 ) + C I ( S : X 1 ; X 2 )
Even if I is non-negative (as it should be as an information quantity), it is not immediate that the function I is also non-negative. This additional requirement was called local positivity in [5].
While the PI lattice is a beautiful framework, so far, there has been no convincing proposal of how the function I should be defined. There have been some proposals of functions I(S : X1; X2) with up to two arguments, so-called bivariate information decompositions [7,8], but so far, only two general information decompositions are known. Williams and Beer defined a function Imin that satisfies local positivity, but, as mentioned above, it was found to give unintuitive values in many examples [5,6]. In [5], Imin was compared with the function:
I M M I ( S : A 1 ; ; A k ) = min i M I ( S : A i )
which was called minimum mutual information (MMI) in [12] (originally, it was denoted by II in [5]). This function has many nice mathematical properties, including local positivity. However, Immi clearly does not have the right interpretation as measuring the shared information, since Immi only compares the different amounts of information of S and Ai, without checking whether the measured information is really the “same” information [5]. However, for Gaussian random variables, Immi might actually lead to a reasonable information decomposition (as discussed in [12] for the case n = 2).

2.2. Interaction Spaces

An alternative approach to quantify synergy comes from the idea that synergy among interacting systems has to do with interactions beyond simple pair interactions. We slightly change the notation and now analyze the interaction of n + 1 random variables X0, X1, … , Xn. Later, we will put X0 = S in order to compare the setting of interaction spaces with the setting of information decompositions.
For simplicity, we restrict ourselves here to the discrete setting. Let ( X k ) be the set of all subsets A ⊆ {X0, … , Xn} of cardinality |A| = k. The exponential family of k-th order interactions (k) of random variables X0, X1, … , Xn consists of all distributions of the form:
p ( x 0 , , x n ) = A ( X k ) Ψ A ( x 0 , , x n )
where ΨA is a strictly positive function that only depends on those xi with XiA. Taking the logarithm, this is equivalent to saying that:
p ( x 0 , , x n ) = exp ( A ( X k ) ψ A ( x 0 , , x n ) )
where, again, each function ψA only depends on those xi with XiA. This second representation corresponds to the Gibbs–Boltzmann distribution used in statistical mechanics, and it also explains the name exponential family. Clearly, (1)(2) ⊆ … ⊆ (n)(n+1).
The set (k) is not closed (for k > 0), in the sense that there are probability distributions outside of (k) that can be approximated arbitrarily well by k-th order interaction distributions. Thus, we denote by ( k ) ¯ the closure of (k) (technically speaking, for probability spaces, there are different notions of approximation and of closure, but in the finite discrete case, they all agree; for example, one may take the induced topology by considering a probability distribution as a vector of real numbers). For example, ( k ) ¯ contains distributions that can be written as products of non-negative functions ΨA with zeros. In particular, ( n + 1 ) ¯ consists of all possible joint distributions of X0, … , Xn. However, for 1 < k ≤ n, the closure of (k) also contains functions that do not factorize at all (see Section 2.3 in [13] and the references therein).
Given an arbitrary joint distribution p of X0, … , Xn, we might ask for the best approximation of p by a k-th order interaction distribution q. It is customary to measure the approximation error in terms of the Kullback-Leibler divergence:
D ( p | | q ) = x 0 , , x n p ( x 0 , , x n ) log p ( x 0 , , x n ) q ( x 0 , , x n ) .
There are many relations between the KL divergence and exponential families. We need the following properties:
Proposition 1. (1). Let be an exponential family, and let p be an arbitrary distribution. Then, there is a unique distribution p in the closure of that best approximates p, in the sense that:
D ( p | | p ) = D ( p | | ) : = inf q D ( p | | q ) .
p is called the rI-projection of p to .
(2). If ′ are two exponential families, then:
D ( p | | ) = D ( p | | ) + D ( p | | )
See [9,14] for a proof and further properties of exponential families. The second identity is also called the Pythagorean theorem for exponential families.
In the following, we will abbreviate q(k) := p(k). For example, q(n+1) = p. For nk > 1, there is no general formula for q(k). For k = 1, one can show that:
q ( 1 ) ( x 0 , , x n ) = p ( X 0 = x 0 ) p ( X 1 = x 1 ) p ( X n = x n )
Thus, D ( p | | q ( 1 ) ) = i = 0 n H ( X i ) H ( H 0 , , X n ) equals the multi-information [15] (also known as total correlation [16]) of X0, … , Xn. Applying the Pythagorean theorem n − 1 times to the hierarchy (1)(2) ⊆ … ⊆ (n), it follows that:
D ( p | | q ( 1 ) ) = D ( p | | q ( n ) ) + D ( q ( n ) | | q ( n 1 ) ) + + D ( q ( 2 ) | | q ( 1 ) )
This equation decomposes the multi-information into terms corresponding to different interaction orders. This decomposition was introduced in [9] and studied for several examples in [10] or [17] with the single terms called connected information or interaction complexities, respectively. The idea that synergy should capture everything beyond pair interactions motivates us to define:
S ( 2 ) ( X 0 ; ; X n ) : = D ( p | | q ( 2 ) ) = D ( p | | q ( n ) ) + D ( q ( n ) | | q ( n 1 ) ) + + D ( q ( 3 ) | | q ( 2 ) )
as a measure of synergy. In this interpretation, the synergy of X0, … , Xn is a part of the multi-information of X0, … , Xn. The last sum shows that the hierarchy of interaction families gives a finer decomposition of S(2) into terms that may be interpreted as “synergy of a fixed order”. In the case n = 3 that we will study later, there is only one term, since p = q(3) in this case. Using the maximum entropy principle behind exponential families [14], the function S(2) can also be expressed as:
S ( 2 ) ( S ; X ; Y ) = max q Δ p ( 2 ) H q ( S , Y , X ) H ( S , Y , X )
where
Δ p ( 2 ) = { r ( x 0 , , x n ) | r ( x i , x j ) = p ( x i , x j ) for all i , j = 0 , , n }
denotes the set of all joint distributions r of X0, … , Xn that have the same pair marginals as p.
In contrast, the partial information lattice provides a decomposition of the mutual information and not the multi-information. However, a decomposition of the mutual information MI(X0 : X1, … , Xn) can be achieved in a similar spirit as follows. Let ( X k ) 0 be the set of all subsets A ⊆ {X0, … , Xn} of cardinality |A| = k that contain X0, and let ^ ( k ) be the set of all probability distributions of the form:
p ( x 0 , , x n ) = A ( X k ) 0 Ψ A ( x 0 , , x n ) Ψ [ n ] ( x 1 , , x n )
where the ΨA are as above and where Ψ[n] is a function that only depends on x1, … , xn. As above, each ^ ( k ) is an exponential family.
We will abbreviate q ^ ( k ) : p ^ ( k ). Again, for general k, there is no formula for q ^ ( k ), but for k = 1, one can show that:
q ^ ( 1 ) ( x 0 , , x n ) = p ( X 0 = x 0 ) p ( X 1 = x 1 , , X n = x n )
Therefore, D ( p | | q ^ ( 1 ) ) = M I ( X 0 : X 1 , , X n ) Moreover, by the Pythagorean theorem,
D ( p | | q ^ ( 1 ) ) = D ( p | | q ^ ( n ) ) + D ( q ^ ( n ) | | q ^ ( n 1 ) ) + + D ( q ^ ( 2 ) | | q ^ ( 1 ) )
Thus, we obtain a decomposition of the mutual information MI(X0 : X1, … , Xn).
Again, one can group together all terms except the last term that corresponds to the pair interactions and define:
S ^ ( 2 ) ( X 0 : X 1 ; X n ) : = D ( p | | q ^ ( 2 ) ) = D ( p | | q ^ ( n ) ) + D ( q ^ ( n ) | | q ^ ( n 1 ) ) + + D ( q ^ ( 3 ) | | q ^ ( 2 ) )
as a measure of synergy. In this interpretation, synergy is a part of the mutual information MI(S : X0, … , Xn). Using the maximum entropy principle behind exponential families [14], the function Ŝ(2) can also be expressed as:
S ^ ( 2 ) ( S ; X ; Y ) = max q Δ ^ p ( 2 ) H q ( S , Y , X ) H ( S , Y , X )
where:
Δ ^ p ( 2 ) = { r ( x 0 , , x n ) | r ( x 0 , x i ) = p ( x 0 , x i ) for all i = 1 , , n and r ( x 1 , , r x n ) = p ( x 1 , , x n ) }
denotes the set of all joint distributions r of X0, … , Xn that have the same pair marginals as p and for which, additionally, the marginal distribution for X1, … , Xn is the same as for p.
While the exponential families (k) are symmetric in all random variables X0, … , Xn, in the definition of ^ ( k ), the variable X0 plays a special role. This is reminiscent of the special role of S in the information decomposition framework, when the goal is to decompose the information about S. Thus, also in Ŝ(2), the variable X0 is special.
There are some relations between the hierarchies (1)(2) ⊆ … ⊆ (n) and ^ ( 1 ) ^ ( 2 ) ^ ( n ).
By definition, ( i ) ^ ( i ) and thus:
D ( p | | q ^ ( i ) ) = D ( p | | ^ ( i ) ) D ( p | | ( i ) ) = D ( p | | q ( i ) )
In particular, S ( 2 ) ( X 0 ; ; X n ) S ^ ( 2 ) ( X 0 : X 1 ; ; X n ). Moreover, ( n ) = ^ ( n ), which implies:
D ( p | | q ^ ( n ) ) = D ( p | | ^ ( n ) ) = D ( p | | ( n ) ) = D ( p | | q ( n ) )
In particular, for n = 2, this shows S(2)(S; X; Y) = Ŝ(2)(S : X; Y).
The case n = 2, k = 2 is also the case that we are most interested in later for the following reasons. First, for n = 2, the terms in the partial information lattice have an intuitively clear interpretation. Second, while there are not many examples of full information decompositions for n > 2, there exist at least two proposals for reasonable measures of shared, unique and complementary information [7,8], which allow a direct comparison with measures based on the decompositions using the interaction spaces.
While the symmetric hierarchy of the families (k) is classical, to our best knowledge, the alternative hierarchy of the families ^ ( k ) has not been studied before. We do not want to analyze this second hierarchy in detail here, but we just want to demonstrate that the framework of interaction exponential families is flexible enough to give a nice decomposition of mutual information, which can naturally be compared with the information decomposition framework. In this paper, in any case, we only consider cases where ( k ) = ^ ( k ).
It is possible to generalize the definitions of the interaction exponential families to continuous random variables, but there are some technical issues to be solved. For example, the corresponding exponential families will be infinite-dimensional. We will not do this here in detail, since we only need the following observation later: any Gaussian distribution can be described by pair-interactions. Therefore, when p is a multivariate normal distribution, then q ( 2 ) = q ^ ( 2 ) = p.

3. Measures of Synergy and Their Properties

Synergy or complementary information is very often considered as a core property of complex systems, being strongly related to “emergence” and the idea of the “whole being more than the sum of its parts”. In this section, we discuss three approaches to formalize this idea. We first introduce a classical function called WholeMinusSum synergy in [6], which reduces to the interaction information or (up to the sign) co-information when n = 2. This function can become negative. It is sensitive to redundancy, as well as synergy, and its sign tells which kind of information dominates. In Section 3.2, we recall the definition of the measure of synergy C I ˜ from [8] that comes from a (bivariate) information decomposition. In Section 3.3, we compare C I ˜ with the synergy defined from the interaction spaces in Section 2.2.

3.1. WholeMinusSum Synergy

WholeMinusSum synergy is the difference between joint mutual information between explaining variables and the target variables and the sum of the pairwise mutual information. Griffith and Koch [6] trace it back to [1820]. In the n = 2 case, this reduces to:
S W M S ( S : X ; Y ) = M I ( S : X , Y ) M I ( S : X ) M I ( S : Y ) = M I ( S : Y | X ) M I ( S : Y ) = H ( S , Y , X ) + H ( X , Y ) + H ( S , X ) + H ( S , Y ) H ( S ) H ( X ) H ( Y ) = C o I ( S , X , Y )
with CoI(S, X, Y) being the co-information [21] or interaction information [22]. This measure of synergy was used, e.g., in [1] to study synergy in neural population codes. As one can easily see from Equation (4), for any information decomposition, SWMS is the difference between the complementary and the shared information:
S W M S ( S : X ; Y ) = C I ( S : X ; Y ) S I ( S : X ; Y )
Therefore, the WholeMinusSum synergy is a lower bound for the complementary information in the partial information lattice. Obviously it can become also negative, which makes it a deficient measure for synergy. However, it fulfills the condition of strong symmetry, i.e., it is not only invariant with respect to permutation of X and Y, but to permutations of all three arguments.

3.2. Synergy from Unique Information

In [8], it was proposed to use the following function as a measure of synergy:
C I ˜ ( S ; X ; Y ) = M I ( S : Y ; X ) min q Δ p M I q ( S : X ; Y )
where:
Δ p = { q ( s , x , y ) | q ( s , x ) = p ( s , x ) ^ q ( s , y ) = p ( s , y ) }
denotes the set of all joint distributions of S, X, Y that have the same pair marginals as p for the pairs (S, X) and (S, Y). Originally, this function was motivated from considerations about decision problems. The basic idea is that unique information should be observable in the sense that there should be a decision problem in which this unique information is advantageous. One crucial property is the idea that the amount of unique information should only depend on the marginal distributions of the pairs (S, X) and (S,Y), i.e.:
(*) The functions U I ˜ ( S : X \ Y ) and U I ˜ ( S : Y \ X ) are constant on ∆p.
These thoughts lead to a formula for unique information U I ˜, from which formulas for S I ˜ and the above formula for C I ˜ can be derived. Thus, in particular, C I ˜ is part of a (non-negative) bivariate information decomposition. While it is not easy to see directly that S I ˜ is non-negative, it follows right from the definition that C I ˜ is non-negative.
Heuristically, the formula for C I ˜ also encodes the idea that synergy has to do with pair interactions, here in the form of pair marginals. Namely, the joint distribution is compared with all other distributions that have the same marginals for the pairs (S, X) and (S, Y). In Section 3.3, we will see how this is related to the synergy function S(2) coming from the interaction decomposition.
The same measure of synergy was proposed in [6], without any operational justification, and generalized to n > 2 variables as follows:
C I ˜ ( S : X 1 ; ; X n ) = M I ( S : X 1 , , X n ) min q Δ p M I q ( S : X 1 , , X n )
where now:
Δ p = { q ( s , x 1 , , x n ) | q ( s , x i ) = p ( s , x i ) for i = 1 , , n }

3.3. Synergy from Maximum Entropy Arguments

Quantifying synergy using maximum entropy projections on k-th-order interaction spaces can be viewed as a more direct approach of quantifying the extent that “a system is more than the sum of its parts” [11] than the WholeMinusSum (WMS) synergy discussed above. Surprisingly, we are not aware of any publication using this approach to define explicitly a measure of synergy, but the idea seems to be common and is proposed, for instance, in [2]. Consider the joint probability distribution p(s, x, y). Synergy should quantify dependencies among S, Y, X that cannot be explained by pairwise interactions. Therefore, one considers:
S ( 2 ) ( S ; X ; Y ) = D ( p | | ( 2 ) )
as a measure of synergy.
In [10], S(1)(S; X; Y) was discussed under the name “connected information” I C ( 2 ), but it was not considered as a measure of synergy. Synergy was measured instead by the WMS synergy measure (7).
Comparing C I ˜ ( S : X ; Y ) and S(2)(S; X; Y), we see that:
  • Both quantities are by definition ≥ 0.
  • S(2)(S; X; Y) is symmetric with respect to permutation of all of its arguments, in contrast to C I ˜ ( S ; X ; Y ).
  • S ( 2 ) ( S ; X ; Y ) C I ˜ ( S : X ; Y ), because Δ p ( 2 ) Δ p and:
S ( 2 ) ( S ; X ; Y ) = M I ( S : X , Y ) min q ( 2 ) Δ p ( 2 ) M I q ( 2 ) ( S : X , Y ) C I ˜ ( S : X ; Y ) = M I ( S : X , Y ) min q Δ p M I q ( S : X , Y )
In fact, as shown in [8], any measure CI of complementary information that comes from an information decomposition and that satisfies property (*) must satisfy C I ˜ ( S : X ; Y ) C I ( S : X ; Y ), and thus, the inequality:
S ( 2 ) ( S ; X ; Y ) C I ( S : X ; Y )
also holds in this more general setting.
However, we will show in the next section that if S(2) (S; X; Y) is considered as a synergy measure in the information decomposition [8], one gets negative values for the corresponding shared information, which we will denote by SI(2)(S; X, Y).

4. Examples

4.1. An Instructive Example: AND

Let X and Y be independent binary random variables with p ( 0 ) = p ( 1 ) = 1 2 and S = X AND Y. Because the marginal distributions of the pairs (S, X) and (S, Y) are identical (by symmetry), in this example, there is no unique information [8] (Corollary 8), and therefore, by (5),
C I ˜ ( S : X ; Y ) = M I ( S : X | Y ) = 1 2 log 2 = 0.5 bit
In this example, the co-information is:
C o I ( S ; X ; Y ) = S I ( S : X ; Y ) C I ( S : X ; Y ) = M I ( S : X ) M I ( S : X | Y ) = log 2 3 4 log 3 0.1887 bit
Thus, the shared information is S I ˜ ( S : X ; Y ) = 3 2 log 2 3 4 log 3 0.3113 bit and the WMS synergy is SWMS(S : X; Y) ≈ 0.1887 bit. On the other hand, in the AND case, the joint probability distribution p(s, x, y) is already fully determined by the marginal distributions p(x, y), p(s, y) and p(s, x); that is, Δ p ( 2 ) = { p } (see, e.g., [10]). Therefore,
S ( 2 ) ( S ; X ; Y ) = 0
If we now consider S(2) as a measure CI(2) for the complementary information in the information decomposition (4), we see from (10) that the corresponding shared information becomes negative:
S I ( 2 ) = C o I + C I ( 2 ) = 0.1887 bits < 0

4.2. Gaussian Random Variables: When Should Synergy Vanish?

Let p(s, x, y) be a multivariate Gaussian distribution. As mentioned above, S(2)(S; X; Y) = 0.
What about C I ˜? As shown by [12], the result is that one of the two unique pieces of information U I ˜ always vanishes. Let rSX and rSY denote the correlation coefficients between S and X and S and Y, respectively. If |rSX| ≤ |rSY|, then X has no unique information about S, i.e.: U I ˜ ( S : X \ Y ) = 0, and therefore, C I ˜ ( S : X ; Y ) = M I ( S : X | Y ). This was shown in [12] using explicit computations with semi-definite matrices. Here, we give a more conceptual argument involving simple properties of Gaussian random variables and general properties of U I ˜.
For any ρ ∈ ℝ, let Xρ = Y + ρϵ, where ϵ denotes Gaussian noise, which is independent of X, Y and S. Then, Xρ is independent of S given Y, and so, | r S X ρ | | r S Y |. It is easy to check that r S X ρ is a continuous function of ρ, with r S X 0 = r S X and r S X ρ 0 as ρ → ∞. In particular, there exists a value ρ0 ∈ ℝ, such that r S X ρ 0. Let X = σ X σ X ρ 0 X ρ 0. Then, the pair (X′, S) has the same distribution as the pair (X, S) (since X′ has the same variance as X and since the two pairs have the same correlation coefficient). Thus, U I ˜ ( S : X \ Y ) = U I ˜ ( S : X \ Y ). Moreover, since MI(S : X′|Y) = 0, it follows from (5) that U I ˜ ( S : X \ Y ) = 0.
In summary, assuming that |rSX| ≤ |rSY|, we arrive at the following formulas:
S I ˜ ( S : X ; Y ) = M I ( S : X ) = 1 2 log ( 1 r S X 2 ) U I ˜ ( S : X \ Y ) = 0 U I ˜ ( S : Y \ X ) = M I ( S : Y ) M I ( S : X ) = 1 2 log ( 1 r S X 2 1 r S Y 2 ) C I ˜ ( S : X ; Y ) = M I ( S : X Y ) M I ( S : Y ) = M I ( S : X | Y ) = 1 2 log ( ( 1 r S Y 2 ) ( 1 r X Y 2 ) 1 ( r S X 2 + r S Y 2 + r X Y 2 ) + 2 r S X r S Y r X Y )
Thus, for Gaussian random variables, S I ˜ agrees with Immi. In fact, any information decomposition according to the PI lattice satisfies SI(S : X;Y) ≤ Immi(S : X;Y) [4]. Moreover, any information decomposition that satisfies (*) satisfies S I ( S : X ; Y ) S I ˜ ( S : X ; Y ) (Lemma 3 in [8]), and thus, all such information decompositions agree in the Gaussian case (this was first observed by [12]). In [12], it is shown that this result generalizes to the case where X and Y are Gaussian random vectors. The proof of this result basically shows that the above argument also works in this more general case.
The fact that, for Gaussian distributions, all bivariate information decompositions (that satisfy (*)) agree with the Immi decomposition suggests that the information decomposition based on Immi may also be sensible for Gaussian distributions for larger values of n.
Here, we do not pursue this line of thought. Instead, we want to provide another interpretation of synergy CI(S : X; Y) in the Gaussian case. Based on the apparent simplicity of Gaussians where all information measures are obtained from the correlation coefficients, one could be led to the conclusion that there should be no synergy (recall that S(2)(S; X; Y) vanishes). On the other hand, SWMS(S : X; Y) = CI(S : X; Y) − SI(S : X; Y) can be positive for Gaussian variables, and thus, synergy must be positive, as well (see [12]; for a simple example, choose 0 < r S X = r S Y = r < 1 2 and rXY= 0; then S W M S ( S : X ; Y ) = 1 2 log 1 r 2 1 2 r 2 > 0 ).
To better understand this situation, we regress S on X and Y, i.e., we write S = αX + βY + σϵ for some coefficients α β and normally distributed noise ϵ that is independent of X and Y. Let us again assume that |rSX| < |rSY|.From CI(S : X; Y) = MI(S : X|Y), we see that synergy vanishes if and only if S and X are conditionally independent given Y. Since all distributions are Gaussian and information measures do not depend on the mean values, this condition can be checked by computing the conditional variances Var [ S | X , Y ] = σ 2 and Var [ S | Y ] = α 2 Var [ X | Y ] + σ 2 We see that these distributions agree, and thus, S is conditionally independent of X given Y if Var [ X | Y ] = 0, i.e., X is a function of Y and effectively the same variable or if α = 0. Positive synergy arises whenever X contributes to S with a non-trivial coefficient α=0. This is a very reasonable interpretation and shows that the synergy measure CI(S : X; Y) nicely captures the intuition of X and Y acting together to bring about S.

5. Discussion and Conclusions

We think that using maximum entropy projections on k-th-order interaction spaces can be viewed as a direct approach of quantifying the extent that “a system is more than the sum of its parts” [11]. According to this view, synergy requires and manifests itself in the presence of higher-order interactions, which can be quantified using projections on the exponential families of k-th order interactions. While this idea is not new, it has, to our knowledge, not been explicitly formulated as a definition of synergy before. However, the synergy measure S(2) based on the projection on the exponential family of distributions with only pairwise interactions is not compatible with the partial information lattice framework, because it does not yield a non-negative information decomposition, as we have shown in the examples. The reason why we believe that it is important to have a complete non-negative information decomposition is that, in addition to a formula for synergy, it would give us an interpretation of the “remainder” MI(S : X1, … , Xn) — Synergy. In the bivariate case, C I ˜ ( S : X ; Y ) provides a synergy measure, which complies with the information decomposition.
One could argue that the vanishing S(2) for multivariate Gaussians reflects their “simplicity” in the sense that they can be transformed into independent sub-processes by a linear transformation. In contrast, this simplicity is reflected in the information decomposition by the fact that one of the unique information always vanishes. Since the WholeMinusSum synergy (or co-information) can be positive for Gaussian distributions, it is not possible to define an information decomposition for Gaussian variables that puts the synergy to zero.
Overall, our results suggest that intuition about synergy should be based on information processing rather than higher-order dependencies. While higher-order dependencies, as captured by the measure S(2)(S : X; Y), are part of the synergy, i.e., S(2)(S : X; Y) ≤ CI(S : X; Y), they are not required as demonstrated in our AND example and the case of Gaussian random variables. Especially, the latter example leads to the intuitive insight that synergy arises when multiple inputs X, Y are processed simultaneously to compute the target S. Interestingly, the nature of this processing is less important and can be rather simple, i.e., the output is literally just “the sum of its inputs”. In this sense, we believe that our negative result, regarding the non-negativity of S(2)(S : X; Y), provides important insights into the nature of synergy in the partial information decomposition. It is up to future work to develop a better understanding of the relationship between the presence of higher-order dependencies and synergy.

Acknowledgments

Eckehard Olbrich has received funding from the European Community’s Seventh Framework Program under Grant Agreement No. 318723 (Mathematics of Multilevel Anticipatory Complex Systems). Eckehard Olbrich also acknowledges interesting discussions with participants, in particular Peter Grassberger and Ilya Nemenman at the seminar and workshop on Causality, Information Transfer and Dynamical Networks (CIDNET14) in Dresden, Germany, 12 May–20 June 2014, which led to some of the ideas in this paper regarding information decompositions based on maximum entropy arguments. We also thank the organizers of the workshop Information Processing in Complex Systems (IPCS14) in Lucca, Italy, 24 September 2014, for having the opportunity to present and discuss the first version of these ideas.
PACS classifications: 89.70.Cf; 89.75.-k
MSC classifications: 94A15; 94A17

Author Contributions

The research was initiated by Eckehard Olbrich and carried out by all authors. The manuscript was written by Eckehard Olbrich, Johannes Rauh and Nils Bertschinger. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Schneidman, E.; Bialek, W.; Berry, M.J.I. Synergy, redundancy, and independence in population codes. J. Neurosci. 2003, 23, 11539–11553. [Google Scholar]
  2. Margolin, A.A.; Wang, K.; Califano, A.; Nemenman, I. Multivariate dependence and genetic networks inference. IET Syst. Biol. 2010, 4, 428–440. [Google Scholar]
  3. Bertschinger, N.; Olbrich, E.; Ay, N.; Jost, J. Autonomy: An information theoretic perspective. Biosystems 2008, 91, 331–345. [Google Scholar]
  4. Williams, P.; Beer, R. Nonnegative Decomposition of Multivariate Information 2010, arXiv, 1004.2515v1.
  5. Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J. Shared Information—New Insights and Problems in Decomposing Information in Complex Systems, Proceedings of the European Conference on Complex Systems 2012 (ECCS’12), Brussels, Belgium, 3–7 September 2012; pp. 251–269.
  6. Griffith, V.; Koch, C. Quantifying Synergistic Mutual Information. In Guided Self-Organization: Inception; Prokopenko, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2014; Volume 9, Emergence, Complexity and Computation Series; pp. 159–190. [Google Scholar]
  7. Harder, M.; Salge, C.; Polani, D. Bivariate measure of redundant information. Phys. Rev. E 2013, 87, 012130. [Google Scholar]
  8. Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J.; Ay, N. Quantifying unique information. Entropy 2014, 16, 2161–2183. [Google Scholar]
  9. Amari, S.I. Information geometry on hierarchy of probability distributions. IEEE Trans. Inf. Theory 2001, 47, 1701–1711. [Google Scholar]
  10. Schneidman, E.; Still, S.; Berry, M.J., 2nd; Bialek, W. Network information and connected correlations. Phys. Rev. Lett. 2003, 91, 238701. [Google Scholar]
  11. Ay, N.; Olbrich, E.; Bertschinger, N.; Jost, J. A geometric approach to complexity. Chaos 2011, 21, 037103. [Google Scholar]
  12. Barrett, A.B. Exploration of synergistic and redundant information sharing in static and dynamical Gaussian systems. Phys. Rev. E 2015, 91, 052802. [Google Scholar]
  13. Rauh, J.; Kahle, T.; Ay, N. Support sets of exponential families and oriented matroids. Int. J Approx. Reason. 2011, 52, 613–626. [Google Scholar]
  14. Csiszár, I.; Shields, P.C. Information Theory and Statistics: A Tutorial. Found. Trends in Commun. Inf. Theory. 2004, 1, 417–528. [Google Scholar]
  15. Studený, M.; Vejnarová, J. The Multiinformation Function as a Tool for Measuring Stochastic Dependence. In Learning in Graphical Models; Jordan, M.I., Ed.; Springer: Dordrecht, The Netherlands, 1998; Volume 89, NATO ASI Series; pp. 261–297. [Google Scholar]
  16. Watanabe, S. Information Theoretical Analysis of Multivariate Correlation. IBM J. Res. Dev. 1960, 4, 66–82. [Google Scholar]
  17. Kahle, T.; Olbrich, E.; Jost, J.; Ay, N. Complexity Measures from Interaction Structures. Phys. Rev. E 2009, 79, 026201. [Google Scholar]
  18. Gawne, T.J.; Richmond, B.J. How independent are the messages carried by adjacent inferior temporal cortical neurons? J. Neurosci. 1993, 13, 2758–2771. [Google Scholar]
  19. Gat, I.; Tishby, N. Synergy and Redundancy among Brain Cells of Behaving Monkeys. In Advances in Neural Information Processing Systems 11; Kearns, M., Solla, S., Cohn, D., Eds.; MIT Press: Cambridge, MA, USA, 1999; pp. 111–117. [Google Scholar]
  20. Chechik, G.; Globerson, A.; Anderson, M.J.; Young, E.D.; Nelken, I.; Tishby, N. Group Redundancy Measures Reveal Redundancy Reduction in the Auditory Pathway. In Advances in Neural Information Processing Systems 14; Dietterich, T., Becker, S., Ghahramani, Z., Eds.; MIT Press: Cambridge, MA, USA, 2002; pp. 173–180. [Google Scholar]
  21. Bell, A.J. The Co-Information Lattice, Proceedings of the 4th International Workshop on Independent Component Analysis and Blind Signal Separation, Nara, Japan, 1–4 April 2003.
  22. McGill, W.J. Multivariate information transmission. Psychometrika 1954, 19, 97–116. [Google Scholar]
Figure 1. (a) The PI lattice for two random variables; (b) the PI lattice for n = 3. For brevity, every antichain is indicated by juxtaposing the components of its elements, separated by bars |. For example, 12|13|23 stands for the antichain {X1, X2}, {X1, X3}, {X2, X3}.
Figure 1. (a) The PI lattice for two random variables; (b) the PI lattice for n = 3. For brevity, every antichain is indicated by juxtaposing the components of its elements, separated by bars |. For example, 12|13|23 stands for the antichain {X1, X2}, {X1, X3}, {X2, X3}.
Entropy 17 03501f1
Table 1. Joint probabilities for the AND example and corresponding values of selected entropies.
Table 1. Joint probabilities for the AND example and corresponding values of selected entropies.
X Y S p 0 0 0 1 4 0 1 0 1 4 1 0 0 1 4 1 1 1 1 4 H ( S ) = 2 log 2 3 4 log 3 H ( S , X ) = 3 2 log 2 H ( S , X , Y ) = 2 log 2 M I ( S : X ) = 3 2 log 2 3 4 log 3 M I ( S : X | Y ) = 1 2 log 2
Back to TopTop