Information Decomposition and Synergy

Recently, a series of papers addressed the problem of decomposing the information of two random variables into shared information, unique information and synergistic information. Several measures were proposed, although still no consensus has been reached. Here, we compare these proposals with an older approach to define synergistic information based on the projections on exponential families containing only up to k-th order interactions. We show that these measures are not compatible with a decomposition into unique, shared and synergistic information if one requires that all terms are always non-negative (local positivity). We illustrate the difference between the two measures for multivariate Gaussians.


Introduction
Studying a complex system usually involves figuring out how different parts of the system interact with each other.If two processes, described by random variables X and Y , interact with each other to bring about a third one, S, it is natural to ask for the contribution of the single processes.We might distinguish unique contributions of X and Y from redundant ones.Additionally, there might be a component that can be produced only by X and Y acting together: this is what we will call synergy in the following.Attempts to measure synergy were already undertaken in several fields.When investigating neural codes, S is the stimulus, and one asks how the information about the stimulus is encoded in neural representations X and Y [1].When studying gene regulation in systems biology, S could be the target gene, and one might ask for synergy between transcription factors X and Y [2].For the behavior of an autonomous system S, one could ask to which extent it is influenced by its own state history X or the environment Y [3].
Williams and Beer proposed the partial information lattice as a framework to achieve such an information decomposition starting from the redundant part, i.e., the shared information.It is based on a list of axioms that any reasonable measure for shared information should fulfill [4].The lattice alone, however, does not determine the actual values of the different components, but just the structure of the decomposition.In the bivariate case, there are four functions (redundancy, synergy and unique information of X and Y , respectively), related by three linear conditions.Thus, to complete the theory, it suffices to provide a definition for one of these functions.In [4], Williams and Beer also proposed a measure I min for shared information.This measure I min was, however, criticized as unintuitive [5,6], and several alternatives were proposed [7,8], but only for the bivariate case so far.
In this paper, we do not want to propose another measure.Instead, we want to relate the recent work on information decomposition to work on information decompositions based on projections on exponential families containing only up to k-th order interactions [2,[9][10][11].We focus on the synergy aspect and compare both approaches for two instructive examples: the AND gate and multivariate Gaussian distributions.We start with reviewing the construction of the partial information lattice by Williams and Beer [4] and discussing the terms for the bivariate case in more detail.In particular, we show how synergy appears in this framework and how it is related to other information measures.In Section 2.2, we recall the exponential families of k-th-order interactions and the corresponding projections and how they can be used to decompose information.In Section 3, we provide the definitions of specific synergy measures, on the one hand side, in the frame work of the partial information lattice, and on the other side, in the framework of interaction spaces, and discuss their properties.In Section 4, we compare the two measures for specific examples and conclude the paper by discussing the significance of the difference between the two measures for analyzing complex systems.

Information Decomposition
Let X 1 , . . ., X n , S be random variables.We are mostly interested in two settings.In the discrete setting, all random variables have finite state spaces.In the Gaussian setting, all random variables have continuous state spaces, and their joint distribution is a multivariate Gaussian.
For discrete random variables, information-theoretic quantities, such as entropy and mutual information, are canonically defined.For example, the entropy of a discrete random variable X is given by H(X) = − x p(x) log p(x), and the mutual information of two discrete random variables is: The conditional mutual information is defined accordingly as: For continuous random variables, there is no canonical entropy function.Instead, there is differential entropy, which is computed with respect to some reference measure dx: where p now denotes the probability density of X with respect to dx.Taking the Lebesgue measure, the entropy of an m-dimensional Gaussian random vector with covariance matrix Σ X is given by: where |Σ X | denotes the determinant of Σ X .This entropy is not invariant under coordinate transformations.In fact, if A ∈ R m×m , then the covariance matrix of AX is given by AΣ X A t , and so the entropy of AX is given by: In contrast, the mutual information of continuous random variables does not depend on the choice of a reference measure.The relation M I(X : Y ) = H(X) + H(Y ) − H(X, Y ) shows that, for Gaussian random vectors with covariance matrices Σ X , Σ Y and with a joint multivariate Gaussian distribution with joint covariance matrix Σ X,Y , and it is easy to check directly that this is independent of linear transformations of X and Y (of course, here, one should not apply a linear transformation to the total vector (X, Y ) that mixes components of X and Y ).

Partial Information Lattice
We want to analyze how the information that X 1 , . . ., X n have about S is distributed among X 1 , . . ., X n .In Shannon's theory of information, the total amount of information about S contained in X 1 , . . ., X n is quantified by the mutual information: We are looking for a way to write M I(S : X 1 , . . ., X n ) as a sum of non-negative functions with a good interpretation in terms of how the information is distributed, e.g., redundantly or synergistically, among X 1 , . . ., X n .For example, as we have mentioned in the Introduction and as we will see later, several suggestions have been made to measure the total synergy of X 1 , . . ., X n in terms of a function Synergy(S : X 1 ; . . .; X n ).When starting with such a function, the idea of the information decomposition is to further decompose the difference: as a sum of non-negative functions.The additional advantage of such a complete information decomposition would be to give a better interpretation of the difference (2), apart from the tautological interpretation that it just measures "everything but the synergy."Throughout the paper, we will use the following notation: the left argument of the information quantities, the target variable S, is divided by a colon from the right arguments.The semicolon separates the different arguments on the right side, while comma-separated random variables are treated as a single vector-valued argument.
When looking for such an information decomposition, the first question is what terms to expect.In the case n = 2, this may seem quite easy, and it seems to be common sense to expect a decomposition of the form: into four terms corresponding to the redundant (or shared) information SI(S : X 1 ; X 2 ), the unique information U I(S : X 1 \ X 2 ) and U I(S : X 2 \ X 1 ) of X 1 and X 2 , respectively, and the synergistic (or complementary) information CI(S : X 1 ; X 2 ).
However, when n > 2, it seems less clear in which different ways X 1 , . . ., X n may interact with each other, combining redundant, unique and synergistic effects.
As a solution, Williams and Beer proposed the partial information framework.We explain the idea only briefly here and refer to [4] for more detailed explanations.The basic idea is to construct such a decomposition purely in terms of a function for shared information I ∩ (S : X 1 ; . . .; X n ) that measures the redundant information about S contained in X 1 , . . ., X n .Clearly, such a function should be symmetric in permutations of X 1 , . . ., X n .In a second step, I ∩ is also used to measure the redundant information I ∩ (S : A 1 ; . . .; A k ) about S contained in combinations A 1 , . . ., A k of the original random variables (that is, A 1 , . . ., A k are random vectors whose components are among {X 1 , . . ., X n }).Moreover, Williams and Beer proposed that I ∩ should satisfy the following monotonicity property: The monotonicity property shows that it suffices to consider the function I ∩ in the case where A 1 , . . ., A k form an antichain; that is, A i ⊆ A j for all i = j.The set of antichains is partially ordered by the relation: and, again by the monotonicity property, I ∩ is a monotone function with respect to this partial order.This partial order actually makes the set of antichains into a lattice.
If (B 1 , . . ., B l ) (A 1 , . . ., A k ), then the difference I ∩ (S : A 1 ; . . .; A k ) − I ∩ (S : B 1 ; . . .; B l ) quantifies the information contained in all A i , but not contained in some B l .The idea of Williams and Beer can be summarized by saying that all information can be classified according to within which antichains it is contained.Thus, the third step is to write: where the function I ∂ is uniquely defined as the Möbius transform of I ∩ on the lattice of antichains.
For example, the PI lattices for n = 2 and n = 3 are given in Figure 1.For n = 2, it is easy to make the connection with (3): The partial measures are: and the redundancy measure satisfies: From ( 4) and the chain rule for the mutual information: Even if I ∩ is non-negative (as it should be as an information quantity), it is not immediate that the function I ∂ is also non-negative.This additional requirement was called local positivity in [5].
While the PI lattice is a beautiful framework, so far, there has been no convincing proposal of how the function I ∩ should be defined.There have been some proposals of functions I ∩ (S : X 1 ; X 2 ) with up to two arguments, so-called bivariate information decompositions [7,8], but so far, only two general information decompositions are known.Williams and Beer defined a function I min that satisfies local positivity, but, as mentioned above, it was found to give unintuitive values in many examples [5,6].In [5], I min was compared with the function: which was called minimum mutual information (MMI) in [12] (originally, it was denoted by I I in [5]).This function has many nice mathematical properties, including local positivity.However, I M M I clearly does not have the right interpretation as measuring the shared information, since I M M I only compares the different amounts of information of S and A i , without checking whether the measured information is really the "same" information [5].However, for Gaussian random variables, I M M I might actually lead to a reasonable information decomposition (as discussed in [12] for the case n = 2). (a) The PI lattice for two random variables; (b) the PI lattice for n = 3.For brevity, every antichain is indicated by juxtaposing the components of its elements, separated by bars |.For example, 12|13|23 stands for the antichain {X 1 , X 2 }, {X 1 , X 3 }, {X 2 , X 3 }.

Interaction Spaces
An alternative approach to quantify synergy comes from the idea that synergy among interacting systems has to do with interactions beyond simple pair interactions.We slightly change the notation and now analyze the interaction of n + 1 random variables X 0 , X 1 , . . ., X n .Later, we will put X 0 = S in order to compare the setting of interaction spaces with the setting of information decompositions.
For simplicity, we restrict ourselves here to the discrete setting.Let X k be the set of all subsets A ⊆ {X 0 , . . ., X n } of cardinality |A| = k.The exponential family of k-th order interactions E (k) of random variables X 0 , X 1 , . . ., X n consists of all distributions of the form: where Ψ A is a strictly positive function that only depends on those x i with X i ∈ A. Taking the logarithm, this is equivalent to saying that: where, again, each function ψ A only depends on those x i with X i ∈ A. This second representation corresponds to the Gibbs-Boltzmann distribution used in statistical mechanics, and it also explains the name exponential family.Clearly, The set E (k) is not closed (for k > 0), in the sense that there are probability distributions outside of E (k) that can be approximated arbitrarily well by k-th order interaction distributions.Thus, we denote by E (k) the closure of E (k) (technically speaking, for probability spaces, there are different notions of approximation and of closure, but in the finite discrete case, they all agree; for example, one may take the induced topology by considering a probability distribution as a vector of real numbers).For example, E (k) contains distributions that can be written as products of non-negative functions Ψ A with zeros.In particular, E (n+1) consists of all possible joint distributions of X 0 , . . ., X n .However, for 1 < k ≤ n, the closure of E (k) also contains functions that do not factorize at all (see Section 2.3 in [13] and the references therein).
Given an arbitrary joint distribution p of X 0 , . . ., X n , we might ask for the best approximation of p by a k-th order interaction distribution q.It is customary to measure the approximation error in terms of the Kullback-Leibler divergence: There are many relations between the KL divergence and exponential families.We need the following properties: . Let E be an exponential family, and let p be an arbitrary distribution.Then, there is a unique distribution p E in the closure of E that best approximates p, in the sense that: p E is called the rI-projection of p to E.
(2).If E ⊆ E are two exponential families, then: See [9,14] for a proof and further properties of exponential families.The second identity is also called the Pythagorean theorem for exponential families.
In the following, we will abbreviate q (k) := p E (k) .For example, q (n+1) = p.For n ≥ k > 1, there is no general formula for q (k) .For k = 1, one can show that: ) equals the multi-information [15] (also known as total correlation [16]) of X 0 , . . ., X n .Applying the Pythagorean theorem n − 1 times to the hierarchy E (1) ⊆ E (2) ⊆ . . .⊆ E (n) , it follows that: This equation decomposes the multi-information into terms corresponding to different interaction orders.This decomposition was introduced in [9] and studied for several examples in [10] or [17] with the single terms called connected information or interaction complexities, respectively.The idea that synergy should capture everything beyond pair interactions motivates us to define: as a measure of synergy.In this interpretation, the synergy of X 0 , . . ., X n is a part of the multi-information of X 0 , . . ., X n .The last sum shows that the hierarchy of interaction families gives a finer decomposition of S (2) into terms that may be interpreted as "synergy of a fixed order".In the case n = 3 that we will study later, there is only one term, since p = q (3) in this case.Using the maximum entropy principle behind exponential families [14], the function S (2) can also be expressed as: where: p ={r(x 0 , . . ., x n ) | r(x i , x j ) = p(x i , x j ) for all i, j = 0, . . ., n} denotes the set of all joint distributions r of X 0 , . . ., X n that have the same pair marginals as p.
In contrast, the partial information lattice provides a decomposition of the mutual information and not the multi-information.However, a decomposition of the mutual information M I(X 0 : X 1 , . . ., X n ) can be achieved in a similar spirit as follows.Let X k 0 be the set of all subsets A ⊆ {X 0 , . . ., X n } of cardinality |A| = k that contain X 0 , and let Ê(k) be the set of all probability distributions of the form: where the Ψ A are as above and where Ψ [n] is a function that only depends on x 1 , . . ., x n .As above, each Ê(k) is an exponential family.
We will abbreviate q(k) := p Ê(k) .Again, for general k, there is no formula for q(k) , but for k = 1, one can show that: q(1) (x 0 , . . ., Therefore, D(p q(1) ) = M I(X 0 : X 1 , . . ., X n ) Moreover, by the Pythagorean theorem, Thus, we obtain a decomposition of the mutual information M I(X 0 : X 1 , . . ., X n ).Again, one can group together all terms except the last term that corresponds to the pair interactions and define: as a measure of synergy.In this interpretation, synergy is a part of the mutual information M I(S : X 0 , . . ., X n ).Using the maximum entropy principle behind exponential families [14], the function Ŝ(2) can also be expressed as: where: denotes the set of all joint distributions r of X 0 , . . ., X n that have the same pair marginals as p and for which, additionally, the marginal distribution for X 1 , . . ., X n is the same as for p.
While the exponential families E (k) are symmetric in all random variables X 0 , . . ., X n , in the definition of Ê(k) , the variable X 0 plays a special role.This is reminiscent of the special role of S in the information decomposition framework, when the goal is to decompose the information about S. Thus, also in Ŝ(2) , the variable X 0 is special.
The case n = 2, k = 2 is also the case that we are most interested in later for the following reasons.First, for n = 2, the terms in the partial information lattice have an intuitively clear interpretation.Second, while there are not many examples of full information decompositions for n > 2, there exist at least two proposals for reasonable measures of shared, unique and complementary information [7,8], which allow a direct comparison with measures based on the decompositions using the interaction spaces.
While the symmetric hierarchy of the families E (k) is classical, to our best knowledge, the alternative hierarchy of the families Ê(k) has not been studied before.We do not want to analyze this second hierarchy in detail here, but we just want to demonstrate that the framework of interaction exponential families is flexible enough to give a nice decomposition of mutual information, which can naturally be compared with the information decomposition framework.In this paper, in any case, we only consider cases where It is possible to generalize the definitions of the interaction exponential families to continuous random variables, but there are some technical issues to be solved.For example, the corresponding exponential families will be infinite-dimensional.We will not do this here in detail, since we only need the following observation later: any Gaussian distribution can be described by pair-interactions.Therefore, when p is a multivariate normal distribution, then q (2) = q(2) = p.

Measures of Synergy and Their Properties
Synergy or complementary information is very often considered as a core property of complex systems, being strongly related to "emergence" and the idea of the "whole being more than the sum of its parts".In this section, we discuss three approaches to formalize this idea.We first introduce a classical function called WholeMinusSum synergy in [6], which reduces to the interaction information or (up to the sign) co-information when n = 2.This function can become negative.It is sensitive to redundancy, as well as synergy, and its sign tells which kind of information dominates.In Section 3.2, we recall the definition of the measure of synergy CI from [8] that comes from a (bivariate) information decomposition.In Section 3.3, we compare CI with the synergy defined from the interaction spaces in Section 2.2.

WholeMinusSum Synergy
WholeMinusSum synergy is the difference between joint mutual information between explaining variables and the target variables and the sum of the pairwise mutual information.Griffith and Koch [6] trace it back to [18][19][20].In the n = 2 case, this reduces to: with CoI(S, X, Y ) being the co-information [21] or interaction information [22].This measure of synergy was used, e.g., in [1] to study synergy in neural population codes.As one can easily see from Equation ( 4), for any information decomposition, S W M S is the difference between the complementary and the shared information: Therefore, the WholeMinusSum synergy is a lower bound for the complementary information in the partial information lattice.Obviously it can become also negative, which makes it a deficient measure for synergy.However, it fulfills the condition of strong symmetry, i.e., it is not only invariant with respect to permutation of X and Y , but to permutations of all three arguments.

Synergy from Unique Information
In [8], it was proposed to use the following function as a measure of synergy: where: denotes the set of all joint distributions of S, X, Y that have the same pair marginals as p for the pairs (S, X) and (S, Y ).Originally, this function was motivated from considerations about decision problems.The basic idea is that unique information should be observable in the sense that there should be a decision problem in which this unique information is advantageous.One crucial property is the idea that the amount of unique information should only depend on the marginal distributions of the pairs (S, X) and (S, Y ), i.e.: ( * ) The functions U I(S : X \ Y ) and U I(S : Y \ X) are constant on ∆ p .
These thoughts lead to a formula for unique information U I, from which formulas for SI and the above formula for CI can be derived.Thus, in particular, CI is part of a (non-negative) bivariate information decomposition.While it is not easy to see directly that SI is non-negative, it follows right from the definition that CI is non-negative.Heuristically, the formula for CI also encodes the idea that synergy has to do with pair interactions, here in the form of pair marginals.Namely, the joint distribution is compared with all other distributions that have the same marginals for the pairs (S, X) and (S, Y ).In Section 3.3, we will see how this is related to the synergy function S (2) coming from the interaction decomposition.
The same measure of synergy was proposed in [6], without any operational justification, and generalized to n > 2 variables as follows: where now:

Synergy from Maximum Entropy Arguments
Quantifying synergy using maximum entropy projections on k-th-order interaction spaces can be viewed as a more direct approach of quantifying the extent that "a system is more than the sum of its parts" [11] than the WholeMinusSum (WMS) synergy discussed above.Surprisingly, we are not aware of any publication using this approach to define explicitly a measure of synergy, but the idea seems to be common and is proposed, for instance, in [2].Consider the joint probability distribution p(s, x, y).Synergy should quantify dependencies among S, Y, X that cannot be explained by pairwise interactions.Therefore, one considers: as a measure of synergy.In [10], S (2) (S; X; Y ) was discussed under the name "connected information" I C , but it was not considered as a measure of synergy.Synergy was measured instead by the WMS synergy measure (7).
2. S (2) (S; X; Y ) is symmetric with respect to permutation of all of its arguments, in contrast to CI(S : X; Y ).
3. S (2) (S; X; Y ) ≤ CI(S : X; Y ), because ∆ p ⊆ ∆ p and: In fact, as shown in [8], any measure CI of complementary information that comes from an information decomposition and that satisfies property ( * ) must satisfy CI(S : X; Y ) ≤ CI(S : X; Y ), and thus, the inequality: If we now consider S (2) as a measure CI (2) for the complementary information in the information decomposition (4), we see from (10) that the corresponding shared information becomes negative: SI (2) = CoI + CI (2) = −0.1887bits < 0 4.2.Gaussian Random Variables: When Should Synergy Vanish?
Let p(s, x, y) be a multivariate Gaussian distribution.As mentioned above, S (2) (S; X; Y ) = 0. What about CI?As shown by [12], the result is that one of the two unique pieces of information U I always vanishes.Let r SX and r SY denote the correlation coefficients between S and X and S and Y , respectively.If |r SX | ≤ |r SY |, then X has no unique information about S, i.e.: U I(S : X \ Y ) = 0, and therefore, CI(S : X; Y ) = M I(S : X|Y ).This was shown in [12] using explicit computations with semi-definite matrices.Here, we give a more conceptual argument involving simple properties of Gaussian random variables and general properties of U I.
For any ρ ∈ R, let X ρ = Y + ρ , where denotes Gaussian noise, which is independent of X, Y and S.Then, X ρ is independent of S given Y , and so, |r SXρ | ≤ |r SY |.It is easy to check that r SXρ is a continuous function of ρ, with r SX 0 = r SY and r SXρ → 0 as ρ → ∞.In particular, there exists a value ρ 0 ∈ R, such that r SX = r SXρ 0 .Let X = σ X σ Xρ 0 X ρ 0 .Then, the pair (X , S) has the same distribution as the pair (X, S) (since X has the same variance as X and since the two pairs have the same correlation coefficient).Thus, U I(S : X \ Y ) = U I(S : X \ Y ).Moreover, since M I(S : X |Y ) = 0, it follows from (5) that U I(S : X \ Y ) = 0.
In summary, assuming that |r SX | ≤ |r SY |, we arrive at the following formulas: Thus, for Gaussian random variables, SI agrees with I M M I .In fact, any information decomposition according to the PI lattice satisfies SI(S : X; Y ) ≤ I M M I (S : X; Y ) [4].Moreover, any information decomposition that satisfies ( * ) satisfies SI(S : X; Y ) ≥ SI(S : X; Y ) ( Lemma 3 in [8]), and thus, all such information decompositions agree in the Gaussian case (this was first observed by [12]).In [12], it is shown that this result generalizes to the case where X and Y are Gaussian random vectors.The proof of this result basically shows that the above argument also works in this more general case.The fact that, for Gaussian distributions, all bivariate information decompositions (that satisfy ( * )) agree with the I M M I decomposition suggests that the information decomposition based on I M M I may also be sensible for Gaussian distributions for larger values of n.
Here, we do not pursue this line of thought.Instead, we want to provide another interpretation of synergy CI(S : X; Y ) in the Gaussian case.Based on the apparent simplicity of Gaussians where all information measures are obtained from the correlation coefficients, one could be led to the conclusion that there should be no synergy (recall that S (2) (S; X; Y ) vanishes).On the other hand, S W M S (S : X; Y ) = CI(S : X; Y ) − SI(S : X; Y ) can be positive for Gaussian variables, and thus, synergy must be positive, as well (see [12]; for a simple example, choose 0 < r SX = r SY = r < 1 / √ 2 and r XY = 0; then S W M S (S : X; Y ) = 1  2 log 1−r 2 1−2r 2 > 0).To better understand this situation, we regress S on X and Y , i.e., we write S = αX + βY + σ for some coefficients α, β and normally distributed noise that is independent of X and Y .Let us again assume that |r SX | < |r SY |.From CI(S : X; Y ) = M I(S : X|Y ), we see that synergy vanishes if and only if S and X are conditionally independent given Y .Since all distributions are Gaussian and information measures do not depend on the mean values, this condition can be checked by computing the conditional variances Var[S|X, Y ] = σ 2 and Var[S|Y ] = α 2 Var[X|Y ] + σ 2 .We see that these distributions agree, and thus, S is conditionally independent of X given Y if Var[X|Y ] = 0, i.e., X is a function of Y and effectively the same variable or if α = 0. Positive synergy arises whenever X contributes to S with a non-trivial coefficient α = 0.This is a very reasonable interpretation and shows that the synergy measure CI(S : X; Y ) nicely captures the intuition of X and Y acting together to bring about S.

Discussion and Conclusions
We think that using maximum entropy projections on k-th-order interaction spaces can be viewed as a direct approach of quantifying the extent that "a system is more than the sum of its parts" [11].According to this view, synergy requires and manifests itself in the presence of higher-order interactions, which can be quantified using projections on the exponential families of k-th order interactions.While this idea is not new, it has, to our knowledge, not been explicitly formulated as a definition of synergy before.However, the synergy measure S (2) based on the projection on the exponential family of distributions with only pairwise interactions is not compatible with the partial information lattice framework, because it does not yield a non-negative information decomposition, as we have shown in the examples.The reason why we believe that it is important to have a complete non-negative information decomposition is that, in addition to a formula for synergy, it would give us an interpretation of the "remainder" M I(S : X 1 , . . ., X n )−Synergy.In the bivariate case, CI(S : X; Y ) provides a synergy measure, which complies with the information decomposition.
One could argue that the vanishing S (2) for multivariate Gaussians reflects their "simplicity" in the sense that they can be transformed into independent sub-processes by a linear transformation.In contrast, this simplicity is reflected in the information decomposition by the fact that one of the unique information always vanishes.Since the WholeMinusSum synergy (or co-information) can be positive for Gaussian distributions, it is not possible to define an information decomposition for Gaussian variables that puts the synergy to zero.
Overall, our results suggest that intuition about synergy should be based on information processing rather than higher-order dependencies.While higher-order dependencies, as captured by the measure S (2) (S : X; Y ), are part of the synergy, i.e., S (2) (S : X; Y ) ≤ CI(S : X; Y ), they are not required as demonstrated in our AND example and the case of Gaussian random variables.Especially, the latter example leads to the intuitive insight that synergy arises when multiple inputs X, Y are processed simultaneously to compute the target S. Interestingly, the nature of this processing is less important and can be rather simple, i.e., the output is literally just "the sum of its inputs".In this sense, we believe that our negative result, regarding the non-negativity of S (2) (S : X; Y ), provides important insights into the nature of synergy in the partial information decomposition.It is up to future work to develop a better understanding of the relationship between the presence of higher-order dependencies and synergy.