Quantifying synergistic information using intermediate stochastic variables

Quantifying synergy among stochastic variables is an important open problem in information theory. Information synergy occurs when multiple sources together predict an outcome variable better than the sum of single-source predictions. It is an essential phenomenon in biology such as in neuronal networks and cellular regulatory processes, where different information flows integrate to produce a single response, but also in social cooperation processes as well as in statistical inference tasks in machine learning. Here we propose a metric of synergistic entropy and synergistic information from first principles. The proposed measure relies on so-called synergistic random variables (SRVs) which are constructed to have zero mutual information about individual source variables but non-zero mutual information about the complete set of source variables. We prove several basic and desired properties of our measure, including bounds and additivity properties. In addition, we prove several important consequences of our measure, including the fact that different types of synergistic information may co-exist between the same sets of variables. A numerical implementation is provided, which we use to demonstrate that synergy is associated with resilience to noise. Our measure may be a marked step forward in the study of multivariate information theory and its numerous applications.


Introduction
Shannon's information theory is a natural framework for studying the correlations among stochastic variables. Claude Shannon proved that the entropy of a single stochastic variable uniquely quantifies how much information is required to identify a sample value from the variable, which follows from four quite plausible axioms (non-negativity, continuity, monotonicity and additivity) [1]. Using similar arguments, the mutual information between two stochastic variables is the only pairwise correlation measure which quantifies how much information is shared. However, higher-order informational measures among three or more stochastic variables remain a long-standing research topic [2][3][4][5][6].
A prominent higher-order informational measure is synergistic information [3][4][5][7][8][9][10], however it is still an open question how to measure it. It should quantify the idea that a set of variables taken together can convey more information than the summed information of its individual variables. Synergy is studied for instance in the context of regulatory processes in cells and networks of neurons. To illustrate the idea at a high level, consider the recognition of a simple object, say a red square, implemented by a multi-layer neuronal network. Some input neurons will implement local edge detection, and some other input neurons will implement local color detection, but the presence of the red square is not defined solely by the presence of edges or red color alone: it is defined as a particular higher-order relation between edges and color. Therefore, a neuronal network which successfully recognizes an object must integrate the multiple pieces of information in a synergistic manner. However, it is unknown exactly how and where this is implemented in any dynamical network because no measure exists to quantify synergistic information among an arbitrary number of variables. Synergistic information appears to play a crucial role in all complex dynamical systems ranging from molecular cell biology to social phenomena, and some argue even in quantum entanglement [11].
We consider the task of predicting the values of an outcome variable Y using a set of source variables X ≡ {X i } i . The total predictability of Y given X is quantified information-theoretically by the classic Shannon mutual information: Here: is the entropy of Y and denotes the total amount of information needed on average to determine a unique value of Y, in bits. It is also referred to as the uncertainty about Y. This denotes the remaining entropy of Y given that the value for X is observed. We note that H(X) and I(X : Y) easily extend to vector-valued variables, for details see for instance Cover and Thomas [12].
In this article we address the problem of quantifying synergistic information between X and Y. To illustrate information synergy, consider the classic example of the XOR-gate of two i.i.d. binary inputs, defined by the following (deterministic) input-output table (Table 1).  A priori the outcome value of Y is 50/50 distributed. It is easily verified that observing both inputs X 1 and X 2 simultaneously fully predicts the outcome value Y, while observing either input individually does not improve the prediction of Y at all. Indeed, we find that: I(X 1 : Y) = 0, I(X 2 : Y) = 0, I(X 1 , X 2 : Y) = 1.
In words this means that in this case the information about the outcome is not stored in either source variable individually, but is stored synergistically in the combination of the two inputs. In this case Y stores whether X 1 = X 2 , which is independent of the individual values of either X 1 or X 2 .
Two general approaches to quantify synergy exist in the current literature. On the pragmatic and heuristic side, methods have been devised to approximate synergistic information using simplifying assumptions. An intuitive example is the "whole minus sum" (WMS) method [10] which simply subtracts the sum of pairwise ("individual") mutual information quantities from the total mutual information, i.e., I(X : Y) − ∑ i I(X i : Y). This formula is based on the assumption that the X i are uncorrelated; in the presence of correlations this measure may become negative and ambiguous.
On the theoretical side, the search is ongoing for a set of necessary and sufficient conditions for a general synergy measure to satisfy. To our knowledge, the most prominent systematic approach is the Partial Information Decomposition framework (PID) proposed by Williams and Beer [3]. Here, synergistic information is implicitly defined by additionally defining so-called "unique" and "shared" information; together they are required to sum up to the total mutual information I(X : Y), among other conditions. However, it appears that the original axioms of Shannon's information theory are insufficient to uniquely determine the functions in this decomposition framework [13], so two approaches exist: extending or changing the set of axioms [3,7,8,14], or finding "good enough" approximations [3,6,9,10].
Our work differs crucially from both abovementioned approaches. In fact, we will define "synergy" from first principles which is incompatible with PID. We use a simple example to motivate our intuitive incongruence of PID; however, no mathematical argument are found in favor of either framework. Our proposed procedure of calculating synergy is based upon a newly introduced notion of perfect "orthogonal decomposition" among stochastic variables. We will prove important basic properties which we feel any successful synergy measure should obey, such as non-negativity and insensitivity to reordering subvariables. We will also derive a number of intriguing properties, such as an upper bound on the amount of synergy that any variable can have about a given set of variables. Finally, we provide a numerical implementation which we use for experimental validation and to demonstrate that synergistic variables tend to have increased resilience to localized noise, which is an important property at large and specifically in biological systems.

Definition 1: Orthogonal Decomposition
Following the intuition from linear algebra we refer to two stochastic variables A, B as orthogonal in case they are independent, i.e., I(A : B) = 0. Given a joint distribution of two stochastic variables  In words, B is decomposed into two orthogonal stochastic variables B ⊥ , B so that (i) the two parts taken together are informationally equivalent to B; (ii) the orthogonal part has zero mutual information about A; and (iii) the parallel part has the same mutual information with A as the original variable B has.
Our measure of synergy is defined in terms of orthogonal decompositions of MSRVs. However this decomposition is not a trivial procedure and is even impossible to do exactly in certain cases. A deeper discussion of its applicability and limitations is deferred to Section 7.1; here we proceed with defining synergistic information.

Synergistic Random Variable
Firstly we define S as a synergistic random variable (SRV) of X ≡ {X i } i if and only if it satisfies the conditions: In words, an SRV stores information about X as a whole but no information about any individual X i which constitutes X. Each SRV S i is defined by a conditional probability distribution Pr i (S i |X ) and is thus conditionally independent of any other SRV given X, i.e., Pr S i , S j |X = Pr(S i |X ) · Pr S j |X . We denote the collection of all possible SRVs of X as the joint random variable σ(X). We sometimes refer to σ(X) as a set because the ordering of its marginal distributions (SRVs) is irrelevant due to their conditional independence.

Maximally Synergistic Random Variables
The set σ(X) may in general be uncountable, and many of its members may have extremely small mutual information with X, which would prevent any practical use. Therefore we introduce the notion of maximally synergistic random variables (MSRV) which we will also use in some derivations. We do not have a proof yet that this set is countable, however our numerical results (see especially the figure in Section 6.2) show that a typical MSRV has substantial mutual information with X (about 75% of the maximum possible). This suggests that either the set of MSRVs is countable or that the mutual information of a small set of MSRVs rapidly converges to the maximum possible mutual information, enabling a practical use.
We define the set of MSRVs of X, denoted Σ(X), as the smallest possible subset of σ(X) which still makes σ(X) redundant, i.e.: Here, |Σ| denotes the cardinality of set Σ which is minimized. Intuitively, one could imagine building Σ(X) by iteratively removing an SRV S i from σ(X) in case it is completely redundant given another SRV S j , i.e., if ∃j : H S i S j = 0. The result is a set Σ(X) with the same informational content (entropy) as σ(X) since only redundant variables are discarded. In case multiple candidates for Σ(X) would exist then any candidate among them will induce the same synergy quantity in our proposed measure, as will be clear from the definition in Section 2.2.5 and further proven in Appendix A.2 and A.3.

Synergistic Entropy of X
We interpret Σ(X) as representing all synergistic information that any stochastic variable could possibly store about X. Therefore we define the synergistic entropy of X as H(Σ(X)). This will be the upper bound on the synergistic information of any other variable about X.

Orthogonalized SRVs
In order to prevent doubly counting synergistic information we orthogonalize all MSRVs. Let us denote π k (Σ) for the kth permuted sequence of all MSRVs in Σ(X) out of all |Σ|! possibilities which are arbitrarily labeled by integers 1 ≤ k ≤ |Σ|. Then we convert Σ(X) into a set of orthogonal MSRVs, or OSRVs for short, for a given ordering: In words, we iteratively take each MSRV S i in Σ(X) and add its orthogonal part S ⊥ i to the set Σ ⊥ π k (Σ) (X) in the specific order π k (Σ). As a result, each OSRV in Σ ⊥ π k (Σ) (X) is completely independent from all others in this set. Σ ⊥ π k (Σ) (X) is still informationally equivalent to Σ(X) because during its construction we only discard completely redundant variables S i given other SRVs.
Note that each orthogonal part S ⊥ i is an SRV if S i is an SRV (or MSRV), which follows from the contradiction of the negation: if S ⊥ i is not an SRV then ∃j : I S ⊥ i : X j > 0 and consequently by the above definition of orthogonal decomposition.

Total Synergistic Information
We define the total amount of synergistic information that Y stores about X as: In words, we propose to quantify synergy as the sum of the mutual information that Y contains about each MSRV of X, after first making the MSRVs independent and then reordering them to maximize this quantity. Note that the optimal ordering π k (Σ) is dependent on Y, making the calculation of the set of MSRVs used to calculate synergy also dependent on Y.
Intuitively, we first "extract" all synergistic entropy of a set of variables X ≡ {X i } i by constructing a new set of all possible maximally synergistic random variables (MSRVs) of X, denoted Σ(X), where each MSRV has non-zero mutual information with the set X but zero mutual information with any individual X i . This set of MSRVs is then transformed into a set of independent orthogonal SRVs (OSRV), denoted Σ ⊥ π k (Σ) (X), to prevent over counting. Then we define the amount of synergistic information in outcome variable Y about the set of source variables X as the sum of OSRV-specific mutual information quantities, ∑ S i ∈Σ ⊥ π k (Σ) (X) I(S i : Y). We illustrate our proposed measure on a few simple examples in Section 5 and compare to other measures. In the next Section we will prove several desired properties which this definition satisfies; here we finish with an informal outline of the intuition behind this definition and refer to corresponding proofs where appropriate.

Outline of Intuition of the Proposed Definition
Our initial idea was to quantify synergistic information directly as I(Y : σ(X)), however we found that this results in undesired counting of non-synergistic information which we demonstrate in Section 4.3 and in Appendix A.2.1. That is, two or more SRVs taken together do not necessarily form an SRV, meaning that their combination may store information about individual inputs. For this reason we use the summation over individual OSRVs. Intuitively, each term in the sum quantifies a "unique" amount of synergistic information which none of the other terms quantifies, due to the independence among all OSRVs in Σ ⊥ π k (Σ) (X). That is, no synergistic information is doubly counted, which we also discuss in Appendix A.2 by proving that I syn (X → Y) never exceeds I Y : Σ ⊥ π k (Σ) (X) . On the other hand, no possible type of synergistic information is ignored (undercounted). This can be seen from the fact that only fully redundant variables are ever discarded in the above process; also we prove for example in Section 3.6 in the sense that for any arbitrary X there exists a Y such that I syn (X → Y) equals the maximum H(σ(X)), namely Y = X.
This summation is sensitive to the ordering of the orthogonalization of the SRVs. The reason for maximizing over these orderings is the possible presence of synergies among the SRVs themselves. We prove that I syn (X → Y) handles correctly such "synergy-among-synergies", i.e., does not lead to over counting or undercounting, in Appendix A.3.

Basic Properties
Here we first list important minimal requirements that the above definitions obey. The first four properties typically appear in the related literature either implicitly or explicitly as desired properties; the latter two properties are direct consequences of our first principle to use SRVs to encode synergistic information. The corresponding proofs are straightforward and sketched briefly.

Non-Negativity
This follows from the non-negativity of the underlying mutual information function, making every term in the sum of Equation (5) non-negative.

Upper-Bounded by Mutual Information
This follows from the Data-Processing Inequality [12] where X is first processed into Σ ⊥ π k (Σ) (X) and then I syn (X → Y) ≤ I Σ ⊥ π k (Σ) (X) : Y follows because we can write: Here, S ⊥ i is understood to denote the i th element in Σ ⊥ π k (Σ) (X) after maximizing the sequence π k (Σ) used to construct Σ ⊥ π k (Σ) (X) for computing I syn (X → Y).

Equivalence Class of Reordering in Arguments
I syn π n (X) → π j (Y) = I syn π m (X) → π j (Y) , for any n and m.
This follows from the same property of the underlying mutual information function and that of the sum in Equation (5).

Zero Synergy about a Single Variable
This follows from the constraint that any SRV must be ignorant about any individual variable in X, so σ(X 1 ) = ∅.

Zero Synergy in a Single Variable
This also follows from the constraint that any SRV must be ignorant about any individual variable in X: all terms in the sum in Equation (5) are necessarily zero.

Identity Maximizes Synergistic Information
This follows from the fact that each S i ∈ Σ ⊥ π k (Σ) (X) is computed from X and is therefore completely redundant given X, so each term in the sum in Equation (5) must be maximal and equal to H S ⊥ i .

Consequential Properties
We now list important properties which are induced by our proposed synergy measure I syn (X → Y) along with their corresponding proofs.

Upper Bound on the Mutual Information of an SRV
The maximum amount of mutual information (and entropy) of an SRV of a set of variables can be derived analytically. We start with the case of two input variables, i.e., |X| = 2, and then generalize. Maximizing I(X 1 , X 2 : S) under the two constraints I(X 1 : S) = 0 and I(X 2 : S) = 0 from Equation (2) leads to: I(X 1 , X 2 : S) = I(X 1 : S) + I(X 2 : S|X 1 ), = I(X 2 : using that I(X 1 : S) = 0 by construction. Since the first term in the third line H(X 2 |X 1 ) does not change by varying S we can maximize I(X 1 , X 2 : S) only by minimizing the second term H(X 2 |S, X 1 ) . Since H(X 2 |S, X 1 ) ≥ 0 and from relabeling (reordering) the inputs we also have the constraint I(X 1 , X 2 : S) ≤ H(X 1 |X 2 ), leading to: This can be rewritten as: The generalization to N variables is fairly straightforward by induction (see Appendix A.1) and here illustrated for the case N = 3 for one particular labeling (ordering) π m (X): I(X 1 , X 2 , X 3 : S) = I(X 1 , X 2 : S) + I(X 3 : S|X 1 , X 2 ) = I(X 2 : Since this inequality must be true for all labelings of the π m (Σ), in particular for the labeling that maximizes H(X 1 ), and extending this result to any N, we find that: Corollary. Suppose that Y is completely synergistic about X, i.e., Y ∈ σ(X). Then their mutual information is bounded as follows: Finally, we assume that the SRV is "efficient" in the sense that it contains no additional entropy that is unrelated to X, i.e., I(X 1 , X 2 : S) = H(S). After all, if it would contain additional entropy then by our orthogonal decomposition assumption we can distill only the dependent part exactly. Therefore the derived upper bound of any SRV is also the upper bound on its entropy.

Non-Equivalence of SRVs
It is indeed possible to have at least two non-redundant MSRVs in Σ(X), i.e., I(S 1 : S 2 ) < H(S 1 ) where S 1 , S 2 ∈ Σ(X), or even I(S 1 : S 2 ) = 0. In words, this means that there can be multiple types of synergistic relation with X which are not equivalent. This is demonstrated by the following example: X = {X 1 , X 2 } with X i ∈ {0, 1, 2} and uniform distribution Pr(X) = 1/9, where S 1 ≡ (2 − X 1 + X 2 )mod3 and S 2 = (X 1 + X 2 )mod3. The fact that these functions are MSRVs is verified numerically by trying all combinations. It can also be seen visually in Figure 1; adding additional states for S 1 or S 2 or changing their distribution will break the symmetries needed to stay uncorrelated with the individual inputs. In this case I(S 1 : S 2 ) = 0 so the two MSRVs are mutually independent, whereas I(S 1 : X) = I(S 2 : X) = log 2 3 ≈ 1.58. In fact, as shown in Section 4.1 this is actually the maximum possible mutual information that any SRV can store about X. Since the MSRVs are a subset of the SRVs it follows trivially that SRVs can be non-equivalent or even independent. S and 2 S which are mutually independent but highly synergistic about two 3-valued variables 1 X and 2 X . 1 X and 2 X are uniformly distributed and independent.

Synergy among MSRVs
The combination of two (or more) MSRVs ( )  violating Equation (2). Since each individual MSRV has zero mutual information with each individual source variable by definition, it must be true that this "non-synergistic" information results from synergy among MSRVs. We emphasize that this type of synergy among the is different from the synergy among the i X X ∈ which we intend to quantify in this paper, and could more appropriately be considered as a "synergy of synergies". The fact that multiple MSRVs are possible is already proven by the example used in the previous proof in Section 4.2. The synergy among these two MSRVs in this example is indeed easily verified: : 0 Figure 1. The values of the two MSRVs S 1 and S 2 which are mutually independent but highly synergistic about two 3-valued variables X 1 and X 2 . X 1 and X 2 are uniformly distributed and independent.
This means that all combinations of MSRVs, such as (S 1 , S 2 ), must necessarily have non-zero mutual information about at least one of the individual source variables, i.e., ∃i : I(S 1 , S 2 : X i ) > 0, violating Equation (2). Since each individual MSRV has zero mutual information with each individual source variable by definition, it must be true that this "non-synergistic" information results from synergy among MSRVs. We emphasize that this type of synergy among the S i ∈ Σ(X) is different from the synergy among the X i ∈ X which we intend to quantify in this paper, and could more appropriately be considered as a "synergy of synergies".
The fact that multiple MSRVs are possible is already proven by the example used in the previous proof in Section 4.2. The synergy among these two MSRVs in this example is indeed easily verified: I(S 1 : X 1 ) = 0 and I(S 2 : X 1 ) = 0, whereas I(S 1 , S 2 : X 1 ) = I(S 1 , S 2 : X 1 ) = log 2 3.
Since MSRVs are a subset of the SRVs it follows that also SRVs can have such "synergy-of-synergies". In fact, the existence of multiple MSRVs means that there are necessarily SRVs which are synergistic about another SRV, and conversely, if there is only one MSRV then there cannot be any set of SRVs which are synergistic about another SRV.
Corollary. Alternatively quantifying synergistic information using directly the mutual information I Y : Σ ⊥ π k (Σ) (X) could violate the fourth desired property, "Zero synergy about a single variable", because if Σ(X) consists of two or more MSRVs then ∃i : I(S 1 , S 2 : X i ) > 0. In this case the choice Y = X i would have non-zero synergistic information about X, which is undesired.

XOR-Gates of Random Binary Inputs Always Form an MSRV
Lastly we use our definition of synergy to prove the common intuition that the XOR-gate is maximally synergistic about a set of i.i.d. binary variables (bits), as suggested in the introductory example.
We start with the case of two bits X 1 , X 2 ∈ {0, 1}. As SRV we take S 1 ≡ X 1 ⊕ X 2 . The entropy of this SRV equals 1, which is in fact the upper bound of any SRV for this X, Equation (17). Therefore no other SRV can make S 1 completely redundant such that it would prevent S 1 from becoming an MSRV (Section 2.2.2). It is only possible for another SRV to make S 1 redundant in case the converse is also true, in which case the two SRVs are equivalent. An example of this would be the NOT-XOR gate which is informationally equivalent to XOR. Here we consider equivalent SRVs as one and the same.
For the more general case of N bits X 1 , ..., X N ∈ {0, 1}, consider as SRV the set of XOR-gates It is easily verified that S does not contain mutual information about any individual bit X i , so indeed S ∈ σ(X). Moreover it is also easily verified that all S i are independent, so the entropy H(S) = N − 1 which equals the upper bound on any SRV. Following the same reasoning as the two-bit case, S is indeed an MSRV. We remark that conversely, each possible set of XOR gates is not necessarily an MSRV because, e.g., X 1 ⊕ X 3 is redundant given both X 1 ⊕ X 2 and X 2 ⊕ X 3 . That is, some (sets of) XOR-gates are redundant given others and will therefore not be member of the set Σ(X) by construction.
The converse is proved for the case of two independent input bits in Appendix A.4, that is, the only possible MSRV of two bits is the XOR-gate.

Examples
In this Section we derive the SRVs and MSRVs for some small example input-output relations to illustrate how synergistic information is calculated using our proposed definition. This also allows comparing our approach to that of others in the field, particularly PID-like frameworks even though the community has not yet settled on a satisfactory definition of such a framework.

Two Independent Bits and XOR
Let X 1 , X 2 ∈ {0, 1} be two independent input bits, so Pr(X 1 , We derive in Appendix A.4 that any SRV in this case is necessarily a (stochastic) function of the XOR relation X 1 ⊕ X 2 . One could thus informally write S = f (X 1 ⊕ X 2 ) with the understanding that the function f could be either a deterministic function or result in a stochastic variable. Therefore, the set σ(X) consists of all possible deterministic and stochastic mappings f . A deterministic example would be S 1 = ¬(X 1 ⊕ X 2 ) and a stochastic example could be written loosely as the mixture where the probability p = 1/2 satisfies the first condition in Equation (2).
The set σ(X) is indeed an uncountably large set in this case. However in Appendix A.4 we also derive that there is only a single MSRV in the set Σ(X): the XOR-gate X 1 ⊕ X 2 itself. It makes all other SRVs redundant. This confirms our conjecture in Section 2.2.2 that although σ(X) is uncountable, Σ(X) may typically still be countable.
For any output stochastic variable Y the amount of synergistic information is simply equal to I(Y : X 1 ⊕ X 2 ), according to Equation (5). This implies trivially that the XOR-gate Y ⊕ = X 1 ⊕ X 2 has maximum synergistic information for two independent input bits, as is shown schematically in Figure 2a. S X X = ¬ ⊕ and a stochastic example could be written loosely as the mixture where the probability 1 2 p ≠ satisfies the first condition in Equation (2).
The set ( ) X σ is indeed an uncountably large set in this case. However in Appendix A.4 we also derive that there is only a single MSRV in the set ( ) : I Y X X ⊕ , according to Equation (5). This implies trivially that the XOR-gate 1 2 Y X X ⊕ = ⊕ has maximum synergistic information for two independent input bits, as is shown schematically in Figure 2a.
In (b) an additional input bit is added which copies the XOR output, adding individual (unique) information I(X 3 : Y) = 1.

XOR-Gate and Redundant Input
Suppose now that an extra independent input bit X 3 is added as X 3 = X 1 ⊕ X 2 and that the output is still Y ⊕ = X 1 ⊕ X 2 (see Figure 2b). This case highlights a crucial difference between our method and that of PID-like frameworks.
In our method this leaves the perceived amount of synergistic information stored in Y unchanged as it still stores the XOR-gate. Adding unique or "individual" information can never reduce this.
In PID-like frameworks, however, synergistic information and individual information are treated on equal footing and subtracted; intuitively, the more individual information is in the output, the less synergy it computes. As a result, various canonical PID-based synergy measures all reach "the desired answer of 0 bits" for this so-called XorLoses example [10].

AND-Gate
Our method nevertheless still differs from that of Bertschinger et al. [14] as demonstrated in the simple case of the AND-gate of two independent random bits, i.e., Y ∧ = X 1 ∧ X 2 . In our method the outcome of the AND-gate is simply correlated (using mutual information) with that of the XOR-gate as shown in Section 5.1, i.e.: Their proposed method in contrast calculates that the individual information with either input equals I syn (X 1 : Y ∧ ) = I(X 2 : Y ∧ ) = −3/4 log 2 3/4 ≈ 0.311. which they infer as fully "shared" (intuitively speaking, both inputs are said to provide exactly the same individual information to the output). The total information with both inputs equals I(X : Since in the PID framework all four types of information are required to sum up to the total mutual information and no "unique" information exists, their method infers the synergy equals I(X : Y ∧ ) − I(X 1 : Y ∧ ) = 1/2. Indeed, Griffith and Koch [10] state that for this example all but one PID-like measure result in 0.189 bits of synergy; only the original I min measure also results in 0.5 bits which agrees with our method.

Numerical Implementation
We have implemented the numerical procedures to compute the above as part of a Python library named jointpdf (https://bitbucket.org/rquax/jointpdf). Here, a set of discrete stochastic variables X is represented by a matrix of joint probabilities of dimensions m n , where n is the number of variables and m is the number of possible values per variable. This matrix is uniquely identified by m n − 1 independent parameters each on the unit line.
In brief, finding an MSRV S amounts to numerically optimizing a subset of the (bounded) parameters of Pr(X,S) in order to maximize I(S:X) while satisfying the conditions for SRVs in Equation (2). Then we approximate the set of OSRVs Σ ⊥ π k (Σ) (X) by constructing it iteratively. For finding the next OSRV S N in addition to an existing set S 1 , . . . ,S N−1 , the independence constraint I(S N : S 1 , ..., S N−1 ) = 0 is added to the numerical optimization. The procedure finishes once no more OSRVs are found. The optimization of their ordering is implemented by restarting the sequence of numerical optimizations from different starting points and taking the result with highest synergistic information. Orthogonal decomposition is also implemented even though it is not used since the OSRV set is built directly using this optimization procedure. This uses the fact that each decomposed part of an SRV must also be an SRV (assuming perfect orthogonal decomposition) and can therefore be found directly in the optimization. For all numerical optimizations the algorithm scipy.optimize.minimize (version 0.11.0) is used. Once the probability distribution is extended with the set of OSRVs, the amount of synergistic information has a confidence interval due to the approximate nature of the numerical optimizations. That is, one or more OSRVs may turn out to store a small amount of unwanted information about individual inputs. We subtract these unwanted quantities from each mutual information term in Equation (5) in order to estimate the synergistic information in each OSRV. However, these subtracted terms could be (partially) redundant, the extent of which cannot be determined in general. Thus, once the optimal sequence of OSRVs is found we take the lower bound on the estimated synergistic informationÎ syn (X → Y) as: This corresponds to the case where each subtracted mutual information term is fully independent so that they can be summed, leading to this WMS form [6]. On the other hand, the corresponding upper bound would occur if all subtracted mutual information terms would be fully redundant, in which case:Î We take the middle point between these bounds as the best estimateÎ syn (X → Y). The corresponding measure of uncertainty is then defined as the relative error: The following numerical results have been obtained for the case of two input variables, X 1 and X 2 , and one output variable Y. Their joint probability distribution Pr(X 1 ,X 2 ,Y) is randomly generated unless otherwise stated. Once an OSRV is found it is added to this distribution as an additional variable. All variables are constrained to have the same number of possible values ("state space") in our experiments. All results reported in this section have been obtained by sampling random probability distributions. This results in interesting characteristics pertaining to the entire space of probability distributions but offers a limitation when attempting to translate the results to any specific application domain such as neuronal networks or gene-regulation models, since domains focus only on specific subspaces of probability distributions.

Success Rate and Accuracy of Finding SRVs
Our first result is on the ability of our numerical algorithm to find a single SRV as function of the number of possible states per individual variable. Namely, our definition of synergistic information in Equation (5) relies on perfect orthogonal decomposition; we showed that perfect orthogonal decomposition is impossible for at least one type of relation among binary variables (Appendix A.6), whereas previous work hints that continuous variables might be (almost) perfectly decomposed (Section 7.1). Figure 3 shows the probability of successfully finding an SRV for variables with a state space of 2, 3, 4 and 5 values. Success is defined as a relative error on the entropy of the SRV of less than 10%. In Figure 3 we also show the expected relative error on the entropy of an SRV once successfully found. This is relevant for our confidence in the subsequent results. For 2 or 3 values per variable we find a relative error in the low range of 1%-3%, indicating that finding an SRV is a bimodal problem: either it is successfully found with relatively low error or it is not found successfully and has high error. For 4 or more values per variable a satisfactory SRV is always successfully found. This indicates that additional degrees of freedom aid in finding SRVs.  Red line with dots: probability that an SRV could be found with at most 10% relative error in 50 randomly generated Pr(X1,X2,Y) distributions. The fact that it is lowest for binary variables is consistent with the observation that perfect orthogonal decomposition is impossible in this case under at least one known condition (Appendix A.6). The fact that it converges to 1 is consistent with our suggestion that orthogonal decomposition could be possible for continuous variables (Section 7.1). Blue box plot: expected relative error of the entropy of a single SRV, once successfully found.

Efficiency of a Single SRV
Once an SRV is successfully found, the next question is how much synergistic information it actually contains compared to the maximum possible. According to Equation (17) and its preceding, the upper bound is the minimum of 2 1 ( | ) H X X and 1 2 ( | ) H X X . Thus, a single added variable as SRV has in principle sufficient entropy to store this information. However, depending on ( ) 1 2 Pr , X X it is possible that a single SRV cannot store all synergistic information at once, regardless of how much entropy it has, as demonstrated in Section 4.3. This happens if two or more SRVs would be mutually "incompatible" (cannot be combined into a single, large SRV). Therefore we show the expected synergistic information in a single SRV normalized by the corresponding upper bound in Figure 4. Red line with dots: probability that an SRV could be found with at most 10% relative error in 50 randomly generated Pr(X 1 ,X 2 ,Y) distributions. The fact that it is lowest for binary variables is consistent with the observation that perfect orthogonal decomposition is impossible in this case under at least one known condition (Appendix A.6). The fact that it converges to 1 is consistent with our suggestion that orthogonal decomposition could be possible for continuous variables (Section 7.1). Blue box plot: expected relative error of the entropy of a single SRV, once successfully found.

Efficiency of a Single SRV
Once an SRV is successfully found, the next question is how much synergistic information it actually contains compared to the maximum possible. According to Equation (17) and its preceding, the upper bound is the minimum of H(X 2 |X 1 ) and H(X 1 |X 2 ) . Thus, a single added variable as SRV has in principle sufficient entropy to store this information. However, depending on Pr(X 1 , X 2 ) it is possible that a single SRV cannot store all synergistic information at once, regardless of how much entropy it has, as demonstrated in Section 4.3. This happens if two or more SRVs would be mutually "incompatible" (cannot be combined into a single, large SRV). Therefore we show the expected synergistic information in a single SRV normalized by the corresponding upper bound in Figure 4.
The decreasing trend indicates that this incompatibility among SRVs plays a significant role as the state space of the variables grows. This would imply that an increasing number of SRVs must be found in order to estimate the total synergistic informationÎ syn (X → Y). Fortunately, Figure 4 also suggests that the efficiency settles to a non-zero constant which suggests that the number of needed SRVs does not grow to impractical numbers.

Resilience Implication of Synergy
Finally we compare the impact of two types of perturbations in two types of input-output relations, namely the case of a randomly generated ( ) 1 2 Pr , Y X X versus the case that Y is an SRV of X . A "local" perturbation is implemented by adding a random vector with norm 0.1 to the point in the unit hypercube that defines the marginal distribution of a randomly selected input variable, so ( ) 1 P X or ( ) 2 P X . Conversely, a "non-local" perturbation is similarly applied to ( ) , : I X X Y due to the perturbation. That is, we ask whether a small perturbation disrupts the information transmission when viewing 1 2 , X X Y → as a communication channel.
In Figure 5 we show that a synergistic Y is significantly less susceptible to local perturbations compared to a randomly generated Y . For non-local perturbations the difference in susceptibility is smaller but still significant. The null-hypothesis of equal population median is rejected both for local and non-local perturbations (Mood's median test, p-values 13 1.2 10 − × and 5 5.5 10 − × respectively; threshold 0.001 ).
The difference in susceptibility for local perturbations is intuitive because an SRV has zero mutual information with individual inputs, so it is arguably insensitive to changes in individual inputs. We still find a non-zero expected impact; this could be partly explained by our algorithm's relative error being on the order of 3% which is the same order as the relative impact found (2%). In

Resilience Implication of Synergy
Finally we compare the impact of two types of perturbations in two types of input-output relations, namely the case of a randomly generated Pr(Y|X 1 , X 2 ) versus the case that Y is an SRV of X. A "local" perturbation is implemented by adding a random vector with norm 0.1 to the point in the unit hypercube that defines the marginal distribution of a randomly selected input variable, so P(X 1 ) or P(X 2 ). Conversely, a "non-local" perturbation is similarly applied to P(X 1 , X 2 ) while keeping the marginal distributions P(X 1 ) and P(X 2 ) unchanged. The impact is quantified by the relative change of the mutual information I(X 1 , X 2 : Y) due to the perturbation. That is, we ask whether a small perturbation disrupts the information transmission when viewing X 1 , X 2 → Y as a communication channel.
In Figure 5 we show that a synergistic Y is significantly less susceptible to local perturbations compared to a randomly generated Y. For non-local perturbations the difference in susceptibility is smaller but still significant. The null-hypothesis of equal population median is rejected both for local and non-local perturbations (Mood's median test, p-values 1.2 × 10 −13 and 5.5 × 10 −5 respectively; threshold 0.001).

The difference in susceptibility for local perturbations is intuitive because an SRV has zero mutual information with individual inputs, so it is arguably insensitive to changes in individual inputs.
We still find a non-zero expected impact; this could be partly explained by our algorithm's relative error being on the order of 3% which is the same order as the relative impact found (2%). In order to test this intuition we devised the non-local perturbations to compare against. A larger susceptibility is indeed found for non-local perturbations, however it remains unclear why synergistic variables are still less susceptible in the non-local case compared to randomly generated variables. Nevertheless, our numerical results indicate that synergy plays a significant role in resilience to noise. This is relevant especially for biological systems which are continually subject to noise and must be resilient to it. A simple use-case on using the jointpdf package to estimate synergies, as is done here, is included in Appendix A.8. , , P X X Y , where in the synergistic case Y is constrained to be an SRV of 1 2 , X X . Right: the same as left except that the perturbation is "non-local" in the sense that it is applied to ( ) 2 1 P X X while keeping ( ) 1 P X and ( ) 2 P X unchanged.

Orthogonal Decomposition
Our formulation is currently dependent on being able to orthogonally decompose MSRVs exactly. To the best of our knowledge our decomposition formulation has not appeared in previous literature. However from similar work we gather that it is not a trivial procedure, and we derive that it is even impossible to do exactly in certain cases, as we explore next.

Related Literature on Decomposing Correlated Variables
Our notion of orthogonal decomposition is related to the ongoing study of "common random variable" definitions dating back to around 1970. In particular our definition of B  appears equivalent to the definition by Wyner [15] it is an open question whether this implies that our B  does not exist The median relative change of the mutual information I(X 1 , X 2 : Y) after perturbing a single input variable's marginal distribution P(X 1 ) ("local" perturbation). Error bars indicate the 25th and 75th percentiles. A perturbation is implemented by adding a random vector with norm 0.1 to the point in unit hypercube that defines the marginal distribution P(X 1 ). Each bar is based on 100 randomly generated joint distributions P(X 1 , X 2 , Y), where in the synergistic case Y is constrained to be an SRV of X 1 , X 2 . Right: the same as left except that the perturbation is "non-local" in the sense that it is applied to P(X 2 |X 1 ) while keeping P(X 1 ) and P(X 2 ) unchanged.

Orthogonal Decomposition
Our formulation is currently dependent on being able to orthogonally decompose MSRVs exactly. To the best of our knowledge our decomposition formulation has not appeared in previous literature. However from similar work we gather that it is not a trivial procedure, and we derive that it is even impossible to do exactly in certain cases, as we explore next.

Related Literature on Decomposing Correlated Variables
Our notion of orthogonal decomposition is related to the ongoing study of "common random variable" definitions dating back to around 1970. In particular our definition of B appears equivalent to the definition by Wyner [15], here denoted B W , in case it holds that I B W : A, B = I(B : A).
That is, in Appendix A.7 we show that under this condition their B W satisfies all three requirements in Equation (1)  open question whether this implies that our B does not exist for the particular A, B. The required minimization step to calculate B W is highly non-trivial and solutions are known only for very specific cases [16,17].
To illustrate a different approach in this field, Gács and Körner [18] define their common random variable as the "largest" random variable which can be extracted deterministically from both A and B individually, i.e., f (A) = g(B) = B 0 for functions f and g chosen to maximize H B 0 . They show that I B W : A ≤ I(B : A) and it appears in practice that typically the "less than" relation actually holds, preventing its use for our purpose. Their variable is more restricted than ours but has applications in zero-error communication and cryptography.

Sufficiency of Decomposition
Our definition of orthogonal decomposition is sufficient to be able to define a consistent measure of synergistic information. However we leave it as an open question whether Equation (1) is actually more stringent than strictly necessary. Therefore, our statement is that if orthogonal decomposition is possible then our synergy measure is valid; in case it is not possible then it remains an open question whether this implies the impossibility to calculate synergy using our method. For instance, for the calculation of synergy in the case of two independent input bits there is actually no need for any orthogonal decomposition step among SRVs, such as in the examples in Section 4.2 and in Section 5. Important future work is thus to try to minimize the reliance on orthogonal decomposition while leaving the synergy measure intact.

Satisfiability of Decomposition
Indeed it turns out that it is not always possible to achieve a perfect orthogonal decomposition according to Equation (1), depending on A and B. For example, we demonstrate in Appendix A.6 that for the case of binary-valued A and B, it is impossible to achieve the decomposition in case B depends on A as Pr(B = A) = p b .
On the other hand, one sufficient condition for being able to achieve a perfect orthogonal decomposition is being able to restate A and B as A = (W, X) and B = (W, Y) for W, X, Y independent from each other. In this case it is easy to see that B = W and B ⊥ = Y are a valid orthogonal decomposition. Such a restating could be reached by reordering and relabeling of variables and states.
As an example consider A and B denoting the sequences of positions (paths) of two causally non-interacting random walkers on the plane which are under influence of the same constant drift tendency (e.g., a constant wind speed and direction). This drift creates a spurious correlation (mutual information) between the two walkers. From sufficiently long paths this constant drift tendency W can however be estimated and subsequently subtracted from both paths to create drift-corrected paths X and Y which are independent by construction, reflecting only the internal decisions of each walker. The two walkers therefore have a mutual information equal to the entropy H(W) of the "wind" stochastic variable whose value is generated once at the beginning of the two walks and then kept constant.
We propose the following more general line of reasoning to (asymptotically) reach this restating of A and B or at least approximate it. Nevertheless the remainder of the paper simply assumes the existence of the orthogonal decomposition and does not use any particular method to achieve it.
Consider the Karhunen-Loève transform (KLT) [19][20][21] which can restate any stochastic variable X as: Here, µ is the mean of X, the Z k are pairwise independent random variables, and the coefficients α k are real scalars. This transform could be seen as the random variables analogy to the well-known principle component analysis or the Fourier transform.
Typically this transform is defined for a range of random variables in the context of a continuous stochastic process X t a≤t≤b . Here each X t is decomposed by Z k which are defined through the X t themselves as: Here, the scalar coefficients become functions α k (t) on [a, b] which must be pairwise orthogonal (zero inner product) and square-integrable. Otherwise the abovementioned transform applies to each single X t in the same way, now with t-dependent coefficients α k (t). Nevertheless, for our purpose we leave it open how the Z k are chosen; through being part of a stochastic process or otherwise. We also note that the transform works similarly for the discrete case, which is often applied to image analysis.
Let us now choose a single sequence of Z k as our variable "basis". Now consider two random variables A and B which can both be decomposed into Z k as the sequences {α k Z k } and {β k Z k }, respectively. In particular, the mutual information I(A : B) must be equal before and after this transform. Then the desired restating of A and B into A = (W, X) and B = (W, Y) is achieved by: The choice of the common Z k could either be natural, such as a common stochastic process of which both A and B are part, or a known common signal which two receivers intermittently record. Otherwise Z k could be found through a numerical procedure to attempt a numerical approximation, as is done for instance in image analysis tasks.

Discussion
Most theoretical work on defining synergistic information uses the PID framework [3], which (informally stated) requires that I(X : Y) = synergy + individual. That is, the more synergistic information Y stores about X, the less information it can store about an individual X i and vice versa because those two types of information are required to sum up to the quantity I(X : Y) as non-negative terms. Our approach is incompatible with this viewpoint. That is, in our framework the amount of synergistic information I syn (X → Y) makes no statement on the amount of "individual" information that Y may also store about X i . In fact, the proposed synergistic information I syn (X → Y) can be maximized by the identity I syn (X → X), which obviously also stores maximum information about all individual variables X i . To date no synergy measure has been found which has earned the consensus of the PID framework (or similar) community, typically by offering counter-examples. This led us to explore this completely different viewpoint. If our proposed measure would prove successful then it may imply that the decomposition requirement is too strong for a synergy measure to obey, and that synergistic information and individual information cannot be treated on equal footing (increasing one means decreasing the other by the same amount). Whether our proposed synergy measure can be used to define a different notion of non-negative information decomposition is left as an open question.
Our intuitive argument against the decomposition requirement is exemplified in Sections 4.2 and 4.3. This example demonstrates that two independent SRVs can exist which are not synergistic when taken together. That is, there are evidently two distinct (independent or uncorrelated) ways in which a variable Y can be completely synergistic about X (it could be set equal to one or the other SRV). However, we show it is impossible for Y to store information about both these SRVs simultaneously (maximum synergy) while still having zero information about all individual input variables X i -in fact this leads to maximum mutual information with all individual inputs in the example. This suggests that synergistic information and "individual" information cannot simply be considered on equal footing or as mutually exclusive.
Therefore we propose an alternative viewpoint. Whereas synergistic information could be measured by I syn (X → Y), the amount of "individual information" could foreseeably be measured by a similar procedure. For instance, the sequence Σ ⊥ π k (Σ) could be replaced by the individual inputs π m (X) after which the same procedure in Equation (5) as for I syn (X → Y) is repeated. This would measure the amount of "unique" information that Y stores about individual inputs which is not also stored in (combinations of) other inputs. This measure would be upper bounded by H(X). For N completely random and independent inputs, this individual information in Y would be upper bounded by N · H(X 1 ) whereas if Y were synergistic then its total mutual information would be upper bounded by (N − 1) · H(X 1 ) (since it is then an SRV). This suggests that both quantities measure different but not fully independent aspects. How the two measures relate to each other is subject of future work.
Our proposed definition builds upon the concept of orthogonal decomposition. It allows us to rigorously define a single, definite measure of synergistic information from first principles. However further research is needed to determine for which cases this decomposition can be done exactly, approximately, or not at all, and in which cases a decomposition is even necessary. Even if in a specific case it would turn out to be not exactly computable (due to imperfect orthogonal decomposition) then our definition can still serve as a reference point. To the extent that a necessary orthogonal decomposition must be numerically approximated (or bounded), the resulting amount of synergistic information must also be considered an approximation (or bound).
Our final point of discussion is that the choice of how to divide a stochastic variable X into subvariables X ≡ {X i } i is crucial and determines the amount of information synergy found. This choice strongly depends on the specific research question. For instance, the neurons of a brain may be divided into the two cerebral hemispheres, into many anatomical regions, or into individual neurons altogether, where at each level the amount of information synergy may differ. In this article we are not concerned with choosing the division and will calculate the amount of information synergy once the subvariables have been chosen.

Conclusions
In this paper we propose a measure to quantify synergistic information from first principles. Briefly, we first "extract" all synergistic entropy of a set of variables X ≡ {X i } i by constructing a new set of all possible maximally synergistic random variables (MSRVs) of X, denoted ∑(X), where each MSRV has non-zero mutual information with the set X but zero mutual information with any individual X i . This set of MSRVs is then transformed into a set of independent orthogonal SRVs (OSRV), denoted Σ ⊥ π k (Σ) (X), to prevent over counting. Then we define the amount of synergistic information in outcome variable Y about the set of source variables X as the sum of OSRV-specific mutual information quantities, ∑ S i ∈Σ ⊥ π k (Σ) (X) I(S i : Y). Our proposed measure satisfies important desired properties, e.g., it is non-negative and bounded by mutual information, invariant under reordering of X, and always has zero synergy if the input is a single variable. We also prove four important properties of our synergy measure. In particular, we derive the maximum mutual information in case Y is an SRV; we demonstrate that synergistic information can be of different types (multiple, independent SRVs); and we prove the fact that the combination of multiple SRVs may store non-zero information about an individual X i in a synergistic way. This latter property leads to the intriguing concept of "synergy among synergies", which we show must necessarily be excluded from quantifying synergy in Y about X but which might turn out to be an interesting subject of study in its own right. Finally, we provide a software implementation of the proposed synergy measure.
The ability to quantify synergistic information in an arbitrary multivariate setting is a necessary step to better understand how dynamical systems implement their complex information processing capabilities. Our proposed framework based on SRVs and orthogonal decomposition provides a new line of thinking and produces a general synergy measure with important desired properties.
Our initial numerical experiments suggest that synergistic relations are less sensitive to noise, which is an important property of biological and social systems. Studying synergistic information in complex adaptive systems will certainly lead to substantial new insights into their various emergent behaviors. Here, the negative maximization term arises from applying the base case. We emphasize that this upper bound relation must be true for all choices of orderings π m (X) of all N labels (since the labeling is arbitrary and due to the desired property in Section 3.3). Therefore, S must satisfy all N! simultaneous instances of the above inequality, one for each possible ordering. Any S that satisfies the "most constraining" inequality, i.e., where the r.h.s. is minimal, necessarily also satisfies all N! inequalities. The r.h.s. is minimized for any ordering where the X i with overall maximum H(X i ) is part of the subset {X 1 , ..., X N−1 }. In other words, for the inequality with minimal r.h.s. it is true that, due to considering all possible reorderings, Substituting this above we find indeed that I(X 1 , ..., X N : S) ≤ H(X 1 , ..., X N ) − max 1≤i≤N H(X i ).
Appendix A.2 I syn (X → Y) Does Not "Overcount" Any Synergistic Information All synergistic information that any Y can store about X is encoded by the set of SRVs σ(X) which is informationally equivalent to Σ ⊥ π k (Σ) (X), i.e., they have equal entropy and zero conditional entropy. Therefore I Y : Σ ⊥ π k (Σ) (X) should be an upper bound on I syn (X → Y) since otherwise some synergistic information must have been doubly counted. In this section we derive that I syn (X → Y) ≤ I Y : Σ ⊥ π k (Σ) (X) . In Appendix A.2.1 we use the same derivation to demonstrate that a positive difference I Y : Σ ⊥ π k (Σ) (X) − I syn (X → Y) is undesirable at least in some cases. Here we start with the proof that I syn (X → Y) ≤ I Y : Σ ⊥ π k (Σ) (X) in case Σ ⊥ π k (Σ) (X) consists of two OSRVs, taken as base case n = 2 for a proof by induction. Then we also work out the case n = 3 so that the reader can see how the derivation extends for increasing n. Then we provide the proof by induction in n.
For n = 2 we use the property H(S 1 |S 2 ) = H(S 1 ) by construction of Σ ⊥ π k (Σ) (X): For n = 3 we similarly use the independence properties H(S 1 |S 2 ) = H(S 1 ) and H(S 1 , S 2 |S 3 ) = H(S 1 , S 2 ): Essentially, the proof for each n proceeds by rewriting each conditional mutual information term as a mutual information term and four added entropy terms (third equality above) of which two cancel out (H(S 1 , S 2 ) = H(S 1 , S 2 |S 3 ) above) and the remaining two terms summed together are non-negative (H(S 1 , S 2 |Y ) ≥ H(S 1 , S 2 |Y, S 3 ) above). Thus, by induction: Thus we find that it is not possible for our proposed I syn (X → Y) to exceed the mutual information I Y : Σ ⊥ π k (Σ) (X) . This suggests that I syn (X → Y) does not "overcount" any synergistic information.
Appendix A.2.1 I Y : Σ ⊥ π k (Σ) (X) Also Includes Non-Synergistic Information In the derivation of the previous section we observe that, conversely, I Y : Σ ⊥ π k (Σ) (X) can exceed I syn (X → Y) and we will now proceed to show that this is undesirable at least in some cases.
The positive difference I Y : Σ ⊥ π k (Σ) (X) − I syn (X → Y) must arise from one of the non-negative terms in square brackets in all derivations above. Suppose that Y = X i and therefore has zero information with any individual OSRV by definition. That is, Y = X i does not correlate with any possible synergistic relation (SRV) about X. In our view, Y = X i should thus be said to store zero synergistic information about X. However, even though ∀i : H(S i |Y) = H(S i ) by construction, this does not necessarily imply H(S 1 , ..., S n−1 |Y ) = H(S 1 , ..., S n−1 |Y, S n ), among others, and therefore any term in square brackets above can still be positive. In other words, it is possible for Y = X i to "cooperate" or have synergy with one or more OSRVs to have non-zero mutual information about another OSRV. A concrete example of this is given in Section 4.3. This would lead to a non-zero synergistic information if quantified by I Y : Σ ⊥ π k (Σ) (X) , which is undesirable in our view. In contrast, our proposed definition for I syn (X → Y) in Equation (5) purposely ignores this "synergy-of-synergies" and in fact will always yield I syn (X → Y) = 0 in case Y ≡ X i , which is desirable in our view and proved in Section 3.5.

Appendix A.3 Synergy Measure Correctly Handles Synergy-of-Synergies among SRVs
By "correctly handled" we mean that synergistic information is neither overcounted nor undercounted. We already start by the conjecture that "non-synergistic" redundancy among a pair of SRVs does not lead to under or overcounting synergistic information. That is, suppose that I(S 1 : S 2 ) > 0, which we consider "non-synergistic" mutual information. If Y correlates with one or neither SRV then the optimal ordering is trivial. If it correlates with both then any ordering will do, assuming that their respective "parallel" parts (see Section 2.1) are informationally equivalent and it does not matter which one is retained in Σ ⊥ π k (Σ) (X). The respective orthogonal parts are retained in any case. Therefore we now proceed to handle the case where there is synergy among SRVs.
First we illustrate the apparent problem which we handle in this section. Suppose that σ(X) = {S 1 , S 2 , S 3 } and further suppose that I(S 1 , S 2 : S 3 ) = H(S 3 ) while ∀i, j : I S i : S j = 0. In other words, by this construction the pair S 1 , S 2 synergistically makes S 3 fully redundant, and no non-synergistic redundancy among the SRVs exists. Finally, let S 3 ∈ Y. At first sight it appears possible that Σ ⊥ π k (Σ) (X) happens to be constructed using an ordering (S i ) i such that S 3 appears after S 1 and S 2 . This is unwanted because then S 3 will not be part of the Σ ⊥ π k (Σ) (X) used to compute I syn (X → Y), i.e., the term I(Y : S 3 ) disappears from the sum, which potentially leads to the contribution of S 3 to the synergistic information being ignored.
In this Appendix we show that the contribution is always counted towards I syn (X → Y) by construction, and that the only possibility for the individual term I(Y : S 3 ) to disappear is if its synergistic information is already accounted for.
First we interpret each such (synergistic) mutual information from a set of SRVs to another, single SRV as a (n − 1 to 1) hyperedge in a hypergraph. In the above example, there would be a hyperedge from the pair S 1 , S 2 to S 3 . Let the weight of this hyperedge be equal to the mutual information. In the Appendix A.3.1 below we prove that in this setting, one hyperedge from n − 1 SRVs to one SRV implies a hyperedge from all other possible n − 1 subsets to the remaining SRV, at the same weight. That is, the hypergraph for σ(X) = {S 1 , S 2 , S 3 } forms a fully connected "clique" of three hyperedges.
In this setting, finding a "correct" ordering translates to letting S n appear before all S 1 , ..., S n−1 have appeared in case there is a hyperedge S 1 , ..., S n−1 → S n and I(Y : S n ) > 0. This translates to traversing a path of n steps through the hyperedges in reverse order, each time choosing one SRV from the ancestor set that is not already previously chosen, such that for each SRV either (i) not all ancestor SRVs were chosen, or (ii) it has zero mutual information with Y. In other words, in case there is an S i such that I(Y : S i ) = 0 then any ordering with S i as last element will suffice. Only if Y correlates with all SRVs then one of the SRVs will be (partially) discarded by the order maximization process in I syn (X → Y). This is desirable because otherwise I syn (X → Y) could exceed I(X : Y) or even H(Y). Intuitively, if Y correlates with n − 1 SRVs then it automatically correlates with the n th SRV as well, due to the redundancy among the SRVs. Counting this synergistic information would be overcounting this redundancy, leading to the violation of the boundedness by mutual information.
An example that demonstrates this phenomenon is given by X ≡ {X 1 , X 2 , X 3 } consisting of three i.i.d. binary variables. It has four pairwise-independent MSRVs, namely the three pairwise XOR functions and one nested "XOR-of-XOR" function (verified numerically). However, one pairwise XOR is synergistically fully redundant given the two other pairwise XORs, so the entropy H(σ(X)) = 3, which equals H(X). Taking e.g., Y ≡ X yields indeed 3 bits of synergistic information according to our proposed definition of I syn (X → Y), correctly discarding the synergistic redundancy among the four SRVs. However, if the synergistically redundant SRV would not be discarded from the sum then we would find 4 bits of synergistic information in Y about X, which is counterintuitive because it exceeds H(X), H(Y), and I(X : Y). Intuitively, the fact that Y correlates with two pairwise XORs necessarily implies that it also correlates with the third pairwise XOR, so this redundant correlation should not be counted.

Appendix A.3.1 Synergy among SRVs Forms a Clique
Given is a particular set of SRVs σ(X) in arbitrary order. Suppose that the set S 1 , S 2 is fully synergistic about S 3 , i.e., I(S 1 , S 2 : S 3 ) = d > 0 and we first assume that ∀i, j : I S i : S j = 0. This assumption is dropped in the subsection below. The question is: are S 2 , S 3 then also synergistic about S 1 , and S 1 , S 3 about S 2 ? We will now prove that in fact they are indeed synergistic at exactly the same amount, i.e., I(S 2 , S 3 : S 2 ) = I(S 1 , S 3 : S 2 ) = d. The following proof is thus for the case of two variables being synergistic about a third, but trivially generalizes to n variables (in case the condition ∀i, j : I S i : S j = 0 is also generalized for n − 1 variables).
Then we use this to derive a different combination I(S 1 , S 3 : S 2 ) (the third combination is derived similarly): I(S 1 , S 3 : S 2 ) = I(S 1 : S 2 ) + I(S 3 : S 2 |S 1 ) = I(S 3 : In conclusion, we find that if a set of SRVs S 1 , ..., S n−1 synergistically stores mutual information about S n at amount d, then all subsets of n − 1 SRVs of S 1 , ..., S n will store exactly the same synergistic information about the respective remaining SRV. If each such synergistic mutual information from a set of SRVs to another SRV is considered as a directed (n − 1 to 1) hyperedge in a hypergraph, then the resulting hypergraph of SRVs will have a clique in S 1 , ..., S n . Appendix A.3.2 Generalize to Partial Synergy among SRVs Above we assumed ∀i, j : I S i : S j = 0. Now we remove this constraint and thus let all mutual informations of 2 (or n − 1 in general) to be arbitrary. We then proceed as above, first: We see that again d is obtained for the mutual information among n variables, but a correction term appears to account for a difference in the mutual information quantities among n − 1 variables.
A.3.2. Generalize to Partial Synergy among SRVs Above we assumed ( ) , : : 0 i j i j I S S ∀ = . Now we remove this constraint and thus let all mutual informations of 2 (or 1 n − in general) to be arbitrary. We then proceed as above, first:  3  2  1  2  3  2  1   1  2  3  2  1  3  1  2  1  3  2  1   1  2  3  2  1  1  3  1  2  1  2  3   1  2  3  2  1  3  1  3  2   1  2  3  3  2  1  3   3  2  1  3   , :  :  :   :  :  ,   : : , : : : : : , : : : Pr S X is illustrated in Figure 6. For S to satisfy the conditions for being an SRV in Equation (2) The first set of constraints implies that the diagonal probability vectors must be equal, which can be seen by summing the two equalities: k Figure A1. Left: The median relative change of the mutual information I(X 1 , X 2 : Y) after perturbing a single input variable's marginal distribution P(X 1 ) ("local" perturbation). Error bars indicate the 25th and 75th percentiles. A perturbation is implemented by adding a random vector with norm 0.1 to the point in unit hypercube that defines the marginal distribution P(X 1 ). Each bar is based on 100 randomly generated joint distributions P(X 1 , X 2 , Y), where in the synergistic case Y is constrained to be an SRV of X 1 , X 2 . Right: the same as left except that the perturbation is "non-local" in the sense that it is applied to P(X 2 |X 1 ) while keeping P(X 1 ) and P(X 2 ) unchanged.
For S to satisfy the conditions for being an SRV in Equation (2) the conditional probabilities must satisfy the following constraints. Firstly, it must be true that Pr(S, X 1 ) = Pr(S)Pr(X 1 ) which implies Pr(S|X 1 = 0 ) = Pr(S|X 1 = 1 ), and similarly Pr(S|X 2 = 0 ) = Pr(S|X 2 = 1 ), leading to: Secondly, in order to ensure that I(S : X) > 0 it must be true that Pr(S|X ) = Pr(S), meaning that the four k-vectors cannot be all equal: The first set of constraints implies that the diagonal probability vectors must be equal, which can be seen by summing the two equalities: ∀ i : a i + b i + a i + c i = c i + d i + b i + d i ⇔ ∀ i : a i = d i , and ∀ i : The second constraint then requires that the non-diagonal probability vectors must be unequal, i.e.: ¬(∀ i : a i = c i ), and ¬(∀ i : b i = d i ).
The mutual information I(S : X) equals the entropy computed from the average of the four probability vectors minus the average entropy computed of one of the probability vectors. First let us define the shorthand: In words, the mutual information that any SRV S stores about two independent bits X 1 , X 2 is equal to the mutual information with the XOR of the bits, X 1 ⊕ X 2 . Intuitively, one could therefore think of S as the result of the following sequence of (stochastic) mappings: X 1 , X 2 → X 1 ⊕ X 2 → S .
A corollary of this result is that the deterministic XOR function S = X 1 ⊕ X 2 is an MSRV of two independent bits since this maximizes I(S : X) due to the data-processing inequality condition.
To be more precise and as an aside, the MSRV S = X 1 ⊕ X 2 would technically also have to include all additional stochastic variables ("noise sources") which are used in any SRVs, in order for S to make all SRVs redundant and thus satisfy Equation (3). For instance, for S to make, e.g., S 2 = p · (X 1 ⊕ X 2 ) + (1 − p) · ¬(X 1 ⊕ X 2 ) redundant it would also have to store the outcome of the independent random probability p as a stochastic variable, meaning that the combined variable (X 1 ⊕ X 2 , p) must actually be the MSRV, and so forth for the uncountably many SRVs in σ(X). However we assume that these noise sources like p are independent of the inputs X and outputs Y. Therefore in the mutual information terms in calculating the synergistic information in Equation (5) they do not contribute anything, meaning that we may ignore them in writing down MSRVs.

Appendix A.5 Independence of the Two Decomposed Parts
From the first constraint I B ⊥ : A = 0 it follows that: In particular, we will show that B ⊥ cannot be computed from B without storing information about A, violating the orthogonality condition. Being supposedly independent from A, we encode B ⊥ by its dependence on B fully encoded by two parameters as: Intuitively, in the case of binary variables, B ⊥ cannot store information about B without also indirectly storing information about A. A possible explanation is that the binary case has an insufficient number of degrees of freedom for this.
To satisfy the condition I B ⊥ : A = 0 it must be true that Pr B ⊥ |A = Pr B ⊥ and therefore that Pr B ⊥ = 1|A = 1 = Pr B ⊥ = 1 , among others. Let us find the conditions for this equality: Pr B ⊥ = 1|A = 1 = Pr B ⊥ = 1 , This result demonstrates that a class of correlated binary variables A and B exists for which perfect orthogonal decomposition is impossible. Choices for binary A and B for which decomposition is indeed possible do exist, such as the trivial independent case. Exactly how numerous such cases are is currently unknown, especially when the number of possible states per variable is increased.
These are the "parallel" and "parsimony" conditions, concluding the proof.