A Partial Information Decomposition for Multivariate Gaussian Systems Based on Information Geometry

There is much interest in the topic of partial information decomposition, both in developing new algorithms and in developing applications. An algorithm, based on standard results from information geometry, was recently proposed by Niu and Quinn (2019). They considered the case of three scalar random variables from an exponential family, including both discrete distributions and a trivariate Gaussian distribution. The purpose of this article is to extend their work to the general case of multivariate Gaussian systems having vector inputs and a vector output. By making use of standard results from information geometry, explicit expressions are derived for the components of the partial information decomposition for this system. These expressions depend on a real-valued parameter which is determined by performing a simple constrained convex optimisation. Furthermore, it is proved that the theoretical properties of non-negativity, self-redundancy, symmetry and monotonicity, which were proposed by Williams and Beer (2010), are valid for the decomposition Iig derived herein. Application of these results to real and simulated data show that the Iig algorithm does produce the results expected when clear expectations are available, although in some scenarios, it can overestimate the level of the synergy and shared information components of the decomposition, and correspondingly underestimate the levels of unique information. Comparisons of the Iig and Idep (Kay and Ince, 2018) methods show that they can both produce very similar results, but interesting differences are provided. The same may be said about comparisons between the Iig and Immi (Barrett, 2015) methods.


Introduction
Williams and Beer [1] introduced a new method for the decomposition of information in a probabilistic system termed partial information decomposition (PID).This allows the joint mutual information between a number of input sources and a target (output) to be decomposed into components which quantify different aspects of the transmitted information in the system.These are the unique information that each source conveys about the target; the shared information that all sources possess about the target; the synergistic information that the sources in combination possess regarding the target.An additional achievement was to prove that the interaction information [2] is actually the difference between the synergy and redundancy in a system.Thus, a positive value for interaction information signifies that there is more synergy than redundancy in the system, while a negative value indicates the opposite.The work by Willliams and Beer has led to many new methods for defining a PID, mainly for discrete probabilistic systems [3][4][5][6][7][8][9][10][11][12][13] spawning a variety of applications [14][15][16][17][18][19].
There has been considerable interest in PID methods for Gaussian systems.The case of static and dynamic Gaussian systems with two scalar sources and a scalar target was considered in [20], which applied the minimum mutual information PID, I mmi .Further insights were developed in [21] regarding synergy.A PID for Gaussian systems based on common surprisal was published in [7].Barrett's work [20] was extended to multivariate Gaussian systems with two vector sources and a vector target in [22] using the I dep method which was introduced for discrete systems in [8].Further work based on the concept of statistical deficiency is reported in [23].Application of PID for Gaussian systems has been used in a range of applications [18,[24][25][26][27][28][29][30] We focus in particular here on the method proposed by Niu and Quinn [3].They applied standard results from information geometry [31][32][33] in order to define a PID for three scalar random variables which follow an exponential family distribution, including a trivariate Gaussian distribution.
Here, we extend this work in two ways: (a) we provide general formulae for a PID involving multivariate Gaussian systems which have two vector sources and a vector target by making use of the same standard methods from information geometry as in [3] and (b) we prove that the Williams-Beer properties of non-negativity, self-redundancy, symmetry and monotonicity are valid for this PID.We also provide some illustrations of the resulting algorithm using real and simulated data.The PID developed herein is based on some of the probability models in the same partially ordered lattice on which the I dep algorithm is based.Therefore, we also compare the results obtained with those obtained by using the I dep method.The I ig results are also compared with those obtained using the I mmi algorithm.

Notation
A generic 'p' will be used to denote an absolutely continuous probability density function (pdf), with the arguments of the function signifying which distribution is intended.Bold capital letters are used to denote random vectors, with their realised values appearing in bold lowercase-so that p(x 1 , x 2 , x 3 ) denotes the joint pdf of the random vectors, X 1 , X 2 , X 3 , while p(x 1 , x 3 |x 2 ) is the conditional pdf of (X 1 , X 3 ) given a value for X 2 .
We consider the case where random vectors X 1 , X 2 , X 3 , of dimensions n 1 , n 2 , n 3 , respectively, have partitioned mean vectors equal to zero vectors of lengths n 1 , n 2 , n 3 , respectively, and a conformably partitioned covariance matrix.We stack these random vectors into the random vector Z, so that Z has dimension m = n 1 + n 2 + n 3 , and assume that Z has a positive definite multivariate Gaussian distribution with pdf p(x 1 , x 2 , x 3 ), mean vector 0 and covariance matrix given by where the covariance matrices Σ 11 , Σ 22 , Σ 33 of X 1 , X 2 , X 3 , respectively, are of sizes , and Σ 12 , Σ 13 , Σ 33 are the pairwise cross-covariance matrices between the three vectors X 1 , X 2 , X 3 .We also denote the conformably partitioned precision (or concentration) matrix K by

Some Information Geometry
We now describe some standard results from information geometry [32,33] as applied to zero-mean, partitioned multivariate Gaussian probability distributions.The fact that there is no loss of generality in making this zero-mean assumption will be justified by Lemma 1 in Section 3. The multivariate Gaussian pdf defined in (2) may be written in the form |K| (2π) which may be written in terms of the Frobenius inner product as where This is of exponential family form [33] (p.34) and [34] with natural parameter We note that there is something of a terminal ambiguity here, since a 'parameter' is usually a real number.It is convenient to use the more compact notation provided by matrices since this enables all of the elements of a matrix natural parameter to be set to zero simultaneously.
The exponential family distribution in ( 2) is a dually flat manifold [31], which we denote by M.
We define the following e-flat submanifolds of M: which may be conveniently pictured as the partially ordered lattice in Figure 1.The submanifolds S 5 and S 6 are necessary for the definition of the information-geometric PID [3] and the others will be considered in the sequel.Lattices similar to that in Figure 1 appear in [8, 35,36] in relation to information decomposition, and in [37] who consider dually flat manifolds on posets.See also [38], and references therein, for the use of a variety of lattices of models in statistical work.Hierarchical chains of submanifolds were considered in [31] but here the submanifolds are not all in a hierarchical chain due to the presence of two antichains: {S 2 , S 3 , S 4 } and {S 5 , S 6 , S 7 }.There are, however, several useful chains within the lattice.Of particular relevance here are the chains {S 2 , S 5 , M}, {S 2 , S 6 , M} and {S 2 , M}. Application of Amari's mixed-cut coordinates [31] and calculation of divergences produces measures of mutual information that are of direct relevance in PID (as was noted by [3] for three scalar random variables) in that the equations are obtained-and they are standard results in information theory based on the chain rule for mutual information [39].These are nice illustrations of Amari's method.
M : p(x 1 , x 2 , x 3 ) S 1 : p(x 1 )p(x 2 )p(x 3 ) Figure 1.A partially ordered lattice of the manifold M and submanifolds S 1 − S 7 .The form of the pdf that is shown for each submanifold is that obtained by m-projection of the distribution p(x 1 , x 2 , x 3 ) onto the submanifold.
We now consider m-projections from the pdf p ∈ M to each of the submanifolds, S 1 − S 7 [31].It is easy to find the pdf in each submanifold that is closest to the given pdf in M, p, in terms of Kullback-Leibler (KL) divergence [40], ([Ch. 4]).They are given in Figure 1.We know [34,40] that setting a block of the inverse covariance for a multivariate Gaussian distribution to zero expresses a conditional dependence between the variables involved.For example, consider S 5 .On this submanifold K 23 = 0 and so X 2 and X 3 are conditionally independent given a value for X 1 .Therefore, this pdf, which we denote by p 5 , has the form p 5 (x 1 , x 2 , x 3 ) = p(x 2 |x 1 )p(x 3 |x 1 )p(x 1 ).On submanifold S 2 , there are two conditional independences and so X 3 and the pair (X 1 , X 2 ) are independent and the closest pdf in S 2 to the pdf p has the form p 2 (x 1 , x 2 , x 3 ) = p(x 1 , x 2 )p(x 3 ).
The probability distributions defined by these information projections could also have been obtained by the method of maximum entropy, subject to constraints on model interactions [31], and they were obtained in this manner in [22] by making use of Gaussian graphical models [34,40].
We now mention important results from information geometry which are crucial for defining a PID [3].Consider the pdfs p 5 , p 6 , p belonging to the submanifolds S 5 , S 6 , and to manifold M, and the e-geodesic passing through p 5 and p 6 .Then, any pdf on this e-geodesic path is also a zero-mean multivariate Gaussian pdf [41], ([Ch.1]).We denote such a pdf by p t .It has covariance matrix Σ t , defined by provided that Σ −1 t is positive definite.We consider also an m-geodesic from p to p t .Then, by standard results [31,33], this m-geodesic meets the e-geodesic through p 5 and p 6 at a unique pdf p t * such that generalized Pythagorean relationships hold in terms of the KL divergence: The pdf p(t * ) minimizes the KL divergence between the pdf p in M and the pdf p t which lies on the e-geodesic which passes through pdfs p 5 and p 6 .

The Partial Information Decomposition
Williams and Beer [1] introduce a framework called the partial information decomposition (PID) which decomposes the joint mutual information between a target and a set of multiple predictor variables into a series of terms reflecting information which is shared, unique or synergistically available within and between subsets of predictors.The joint mutual information, conditional mutual information and bivariate mutual information are defined as follows.
Here, we focus on the case of two vector sources, X 1 , X 2 , and a vector target X 3 .Adapting the notation of [42], we express the joint mutual information in four terms as follows: gives the common (or redundant or shared) information that both X 1 and X 2 have about X 3 ; Syn ≡ I syn [X 1 , X 2 ; X 3 ] is the synergy or information that the joint vector (X 1 , X 2 ) has about X 3 that cannot be obtained by observing X 1 and X 2 separately.
It is possible to make deductions about a PID by using the following four equations which give a link between the components of a PID and certain classical Shannon measures of mutual information.The following are from ( [42] Equations ( 4) and ( 5)), with amended notation; see also [1].
I[X 2 ; Also, the joint mutual information may be written as The Equations ( 6)-( 9) are of rank 3 and so it is necessary to provide a value for any one of the components, and then the remaining terms can be easily calculated.The initial formulation of [1] was based on quantifying the shared information and deriving the other quantities, but others have focussed on quantifying unique information or synergy directly [4,5,8].Also, the following form [16] of the interaction information [2] will be useful.It was shown [1] to be equal to the difference in Syn-Shd.

A PID for Gaussian Vector Sources and a Gaussian Vector Target
We now apply the results from the previous two sections in order to derive a partial information decomposition by making use of the method defined in [3].The following lemma will confirm that without any loss of generality, we may assume, for all of the multivariate normal distributions considered herein, that the mean vector can be taken to be 0 and the covariance matrix of Z, defined on R m , where m = n 1 + n 2 + n 3 , can have the form where the matrices P, Q, R are of size n 1 × n 2 , n 1 × n 3 , n 2 × n 3 , respectively, and are the cross-covariance (correlation) matrices between the three pairings of the three random vectors X 1 , X 2 , X 3 , and so The calculation of the partial information coefficients will involve the computation of KL divergences [43] between two multivariate Gaussian distributions associated with two submanifolds in the lattice, defined in Figure 1; see Lemma 1, with proof in Appendix C.These probability distributions will have two features in common: they each have the same partitioned mean vector and also the same variance-covariance matrices for the random vectors X 1 , X 2 and X 3 , but different cross covariance matrices for each pair of the random vectors X 1 , X 2 and X 3 .
Then, the Kullback-Liebler divergence D( f 1 || f 2 ) does not depend on the mean vector µ, nor does it depend directly on the variance-covariance matrices Σ 11 , Σ 22 , Σ 33 .The divergence is equal to where jj , (i, j = 1, 2, 3 ; i ≤ j), which are the respective cross-correlation matrices among X 1 , X 2 , X 3 .The KL divergence depends only on these cross-correlation matrices.

Covariance Matrices
Table 1 gives the covariance matrices corresponding to each of the projected distributions p 1 − p 7 on the submanifolds.It is known from Gaussian graphical models [34,40] that the probability distributions associated with submanifolds S 5 and S 6 are defined by setting K 23 = 0 and K 13 = 0, respectively, in the precision matrix K.These conditions were shown in [22] to be equivalent to the equations R = P T Q and Q = PR, respectively.From Table 1, we see that the covariance matrices for pdfs p 5 and p 6 have the following form.
The following lemma, which is proved in Appendix D, gives some useful results on determinants that will be used in the sequel.

Feasible Values for the Parameter t
From (3), the m-projection from manifold M to the e-geodesic passing through the pdfs p 5 and p 6 meets in general at pdf p t which has covariance matrix Σ t defined by and Σ t must be positive definite.Therefore, when finding the optimal pdf p t * , we require to constrain the values of the parameter t to be such that Σ t is positive definite.We define the set of feasible values for t as To enable the derivation of explicit results, it is useful to define the matrix Σ ′ t by We also require a feasible value t * for t when working with the matrix Σ ′ t , and so we define the set G of feasible values as follows It turns out that the sets of feasible values F, G are actually the same set, as stated in the following lemma, which is proved in Appendix E, and this fact allows us to infer that Σ ′ t is positive definite when Σ t is.Lemma 3. If the parameter t belongs to the closed interval [0,1], then the matrices Σ t and Σ ′ t are both positive definite.Also, the two feasible sets F and G defined above in ( 16) and ( 18) are equal.

A Convex Optimisation Problem
The optimal value t * of the parameter t is defined by The following lemma, with proof in Appendix F, provides details of the optimisation required to find t * .
Lemma 4. For t ∈ F • , we define the real valued function g by g(t) = D(p||p t ).Then, and Provided that the joint mutual information is positive, the minimization of g subject to the constraint t ∈ F • , an open convex set, is a strictly convex problem, and the optimal value t * is unique.The minimum value of g is equal to where the determinant d(t) is defined by Alternatively, the minimum could occur at either endpoint of F.
We now define the PID components.

Definition of the PID Components
Following the proposal in [3], we define the synergy of the system to be and by Lemma 1 and (20) the expression for the synergy is Before defining the other PID terms, we require the following lemma, with proof in Appendix G.
Lemma 5.The trace terms required in the definitions of the unique information are both equal to m: From (4), we know that and we define the unique information in the system that is due to source X 2 to be as in [3].By (5), we also have that and we define the unique information in the system that is due to source X 1 to be as in [3].Finding the optimal point, t * , of minimisation of the KL divergence D(p||p t ), and the orthogonality provided by the generalised Pythagorean theorems, define a clear connection between the geometry of the tangent space to manifold M and the definition of the information-geometric PID developed herein.By using two of the defining equations of a PID ( 6) and (7), there are two possible expressions for the shared information, Shd, in the system: Using the result in Lemma 1, we may write the unique information terms as follows.The unique information provided by X 2 is defined to be by Lemma 5.The unique information provided by X 1 is defined to be by Lemma 5.

The I ig PID
Explicit expressions for the PID components are given in Proposition 1, with proof in Appendix H.
Proposition 1.The partial information decomposition I ig for the zero-mean multivariate Gaussian system defined in (12) has the following components. .
where the determinant d(t) is defined by and F is the interval of real values of t for which Σ t is positive definite.
The two possible expressions for the shared information in (31) are equal.
Theoretical properties of the I ig PID are presented in Proposition 2, with proof in Appendix I.2.Proposition 2. The PID defined in Proposition 1 possesses the Williams-Beer properties of nonnegativity, self-redundancy, symmetry and monotonicity.

Some Examples and Illustrations
Example 1. Prediction of calcium contents.
This dataset was considered in [22].The I ig PID developed here, along with the I dep PID [22] and I mmi PID [20], was applied using data on 73 women involving one set of predictors X 1 (Age, Weight, Height), another set of two predictors X 2 (diameter of os calcis, diameter of radius and ulna), and target X 3 (calcium content of heel and forearm).The following results were obtained.A plot of the 'synergy' function g(t) is shown in Figure 2a.All three PIDs indicate the presence of synergy and a large component of unique information due to the variables in X 1 .The I ig PID suggests the transmission of more of the joint mutual information as shared and synergistic information and correspondingly less unique information due to either source vector than does the I dep PID.This is true also for the results from the I mmi PID, but it has higher values for synergistic and shared information and a lower value for Unq1 than those produced by the I ig PID.It was shown in [22] that pdf p in manifold M provides a better fit to these data than any of the submanifold distributions.This pdf contains pairwise cross-correlation between the vectors X 1 and X 3 , and between X 2 and X 3 .Hence, it is no surprise to find that a relatively large Unq1 component.One might also anticipate a large value for Unq2.That this is not the case is explained, at least partly, by the presence of unique information asymmetry, in that the mutual information between X 1 and X 3 (0.4309) is much larger than that between X 2 and X 3 (0.1032) and also bearing in mind the constraints imposed by ( 6)- (10).

PID
The PIDs were also computed with the same X 1 and X 3 but taking X 2 to be another set of four predictors (surface area, strength of forearm, strength of leg, area of os calcis).The following results were obtained.A plot of the 'synergy' function g(t) is shown in Figure 2b.In this case, the PIDs obtained from all three methods are very similar, with the main component being unique information due to the variables in X 1 .The PIDs indicate almost zero synergy and almost zero unique information due to the variables in X 2 .In [22], it was shown that the best of the pdfs is p 5 associated with submanifold S 5 .If this model were to hold exactly, then a PID must have Syn and Unq2 components that are equal to zero.Therefore, all three PIDs perform very well here, and the fact that the Unq1 component is much larger than the Shd component is due to unique information asymmetry, since the mutual information between X 2 and X 3 is only 0.0787.In this dataset, the I ig PID suggests the transmission just a little more of the joint mutual information as shared and synergistic information and correspondingly less unique information due to either source vector than does the I dep PID.The I ig and I mmi PIDs produce identical results (to 4 d.p.).When working with real or simulated data, it is important to use the correct covariance matrix.In order to use the results given in Proposition 1, it is essential that the input covariance matrix has the structure of Σ, as given in (12).Further detail is provided in Appendix J.

Example 2. PID expectations and exact results.
Since there is no way to know the true PID for any given dataset it is useful to consider situations under which some values of the PID components can be predicted, and this approach has been used in developments of the topic.Here, we consider such expectations provided by the pdfs associated with the submanifolds S 3 − S 7 , defined in Figure 1.In submanifold S 3 , the source X 2 is independent of both the other source X 1 and the target X 3 .Hence, we expect only unique information due to source X 1 to be transmitted.Submanifold S 4 is similar but we expect only unique information due to source X 2 to be transmitted.In manifold S 5 , X 2 and X 3 are conditionally independent given a value for X 1 .Hence, from (9), we expect the Unq2 and Syn components to be zero.Similarly, for S 6 , we expect the Unq1 and Syn components to be equal to zero, by (8).On submanifold S 7 , the sources X 1 , X 2 are conditionally independent given a value for the target X 3 (which does not mean that the sources are marginally independent).Since the target X 3 interacts with both source vectors, one might expect some shared information as well as unique information from both sources, and also perhaps some synergy.Here, from (11), the interaction information must be negative or zero, and so we can expect to see transmission of more shared information than synergy.
We will examine these expectations by using the following multivariate Gaussian distribution (which was used in [22]).The matrices P, Q, R are given an equi-cross-correlation structure in which all the entries are equal within each matrix: where p, q, r denote here the constant cross correlations within each matrix and 1 n denotes an n-dimensional vector whose entries are each equal to unity.The values of (p, q, r) are taken to be (−0.15,0.15, 0.15), with n 1 = 3, n 2 = 4, n 3 = 3. Covariance matrices for pdfs p 3 − p 7 were computed using the results in Table 1.Thus, we have the exact covariance matrices which can be fed into the I ig , I dep and I mmi algorithms.The PID results are displayed in Table 2.
From Table 2, we see that all three PIDs meet the expectations exactly for pdfs p 3 − p 6 , with only unique information transmitted when the pdfs p 3 , p 4 , are true, respectively, and zero unique for the relevant component and zero synergy when the models p 5 , p 6 are true, respectively.When model p 8 is the true model, we find that the I ig and I dep PIDs produce virtually identical results: the joint mutual information is transmitted almost entirely as synergistic information.The I mmi PID is slightly different, with less unique information transmitted about the variables in X 2 , and more shared and synergistic information transmitted than with the other two PIDs.The PIDs produce very different results for pdf p 7 , although, as expected, they do express more shared information than synergy.When this model is satisfied, I dep sets the synergy to 0, even if there is no compelling reason to support this.This curiosity is mentioned and illustrated in [22].On the other hand, the I ig PID suggests that each of the four components contributes to the transmission of the joint mutual information, with unique information due to X 2 and shared information making more of a contribution than the other two components.The I mmi PID transmits a higher percentage of the joint information as shared and synergistic information, and a smaller percentage due to the variables in X 2 , than is found with I ig ; these differences are much stronger when comparison is made with the corresponding I dep components.As with model p, it appears that the setting of the Unq1 component in I ig to zero has been translated into its percentage being subtracted from the Unq2 component and added to both the Shd and Syn components in I ig to produce I mmi .Taking the same values of p, q, r and n 1 , n 2 , n 3 as in the previous example, a small simulation study was conducted.From each of the pdfs, p 3 − p 7 , p, a simple random sample of size 1000 was generated from the 10-dimensional distribution, a covariance matrix estimated from the data and the I ig , I dep and I mmi algorithms were applied.This procedure was repeated 1000 times.In order to make the PID results from the sample of 1000 datasets comparable each PID was normalized by dividing each of its components by the joint mutual information; see (10).A summary of the results is provided in Table 3.We focus here on the comparison of I ig and I dep , and also I ig and I mmi , since I dep has been compared with I mmi for Gaussian systems [22].Table 3. PID results for simulated datasets from the pdfs, p 3 − p 7 , p, reported as median (in bold) and range of the sample of percentages of the joint mutual information, apart from t * which gives the median and range of the actual values obtained using the I ig algorithm.

pdf
For pdf p, the I ig and I dep PIDs produce very similar results in terms of both median and range, and the median results are very close indeed to the corresponding exact values in Table 2.For pdf p 7 , the differences between the PID components found in Table 2 persist here although each PID, respectively, produces median values of their components that are close to the exact results in Table 2.For the other four pdfs, there are some small but interesting differences among the results produced by the two PID methods.The I ig method has higher median values for synergy and shared information than for the unique information, when compared against the corresponding exact values in Table 2.In particular, the values of unique information given by I ig are much lower than expected for pdfs p 3 , p 4 , p 6 , and the levels of synergy are larger than expected particularly for pdfs p 3 and p 5 .On the other hand, the I dep PID tends to have larger values for the unique information, and lower values for synergy, especially for datasets generated from pdfs p 3 , p 4 and p 5 .For models p 3 − p 6 , I dep has median values of synergy that are closer to the corresponding exact values than those produced by I ig .The suggestion that the I ig method can produce more synergy and shared information than the I dep method, given the same dataset, is supported by the fact that for all the pdfs and all 6000 datasets considered, the I ig method produced greater levels of synergy and shared information and smaller values of the unique information in every dataset.This raises a question of whether such a finding is generally the case and whether there is this type of a systematic difference between the methods.In the case of scalar variables, it is easy to derive general analytic formulae for the I ig PID components and such a systematic difference is present in this case.

I ig vs. I mmi
The I ig and I mmi PIDs produce similar results for the datasets generated from pdf p, although the I mmi PID suggests the transmission of more shared and synergistic information and less unique information than does I ig .For pdf p 7 , the differences between the PID results are much more dramatic, with the I mmi PID allocating an additional 15% of the joint mutual information to be shared and the synergistic information, and correspondingly 15% less of the unique information.Both methods produce almost identical summary statistics on the datasets generated from pdfs p 3 − p 6 .Since the same patterns are present for all four distributions, we discuss the results for pdf p 5 as an exemplar and compare them with the corresponding exact values in Table 2.The results for component Unq1 show that both methods produce an underestimate of approximately 7%, on average, of the joint mutual information.The median values of Unq2 are close to those expected.The underestimates on the Unq1 component are coupled with overestimates, on average, for the shared and synergistic components; they are 2.6% and 4.3%, respectively, with the I ig method, and 3.1% and 4.7%, respectively, with I mmi .
As to be expected with percentage data, the variation in results for each component tends to be larger for values that are not extreme and much smaller for the extreme values.Also, the optimal values of t * are shown in Table 3.They were all found to be in the range [0, 1], except for 202 of the datasets generated from pdf p 5 or p 6 .

Discussion
For the case of multivariate Gaussian systems with two vector inputs and a vector output, results have been derived using standard theorems from information geometry in order to develop simple, almost exact formulae for the I ig PID, thus extending the scope of the work of [3] on scalar inputs and output.The formulae require one parameter to be determined by a simple, constrained convex optimisation.In addition, it has been proved that this I ig PID algorithm satisfies the desirable theoretical properties of non-negativity, self-redundancy, symmetry and monotonicity, first postulated by Williams and Beer [1].These results strengthen the confidence that one might have in using the I ig method to separate the joint mutual information in a multivariate Gaussian system into shared, unique and synergistic components.The examples demonstrate that the I ig method is simple to use and a small simulation study reveals that it is fairly robust, although in some of the scenarios considered the I ig method produced more synergy and shared information than expected, and correspondingly less unique information; in some other scenarios, it performed as expected.Comparison of the I ig and I dep algorithms reveal that they can produce exactly the same, or very similar, results in some scenarios, but in other situations, it is clear that the I ig method tends to have larger levels of shared information and synergy, and correspondingly, lower levels of unique information when compared with the results from the I dep method.
For datasets generated from pdfs p 3 − p 6 , the PIDs produced using the I ig and I mmi methods are, on average, very similar indeed, and both methods overestimate synergy and shared information and underestimate unique information.The extent of these biases, as a percentage of the joint mutual information, is fairly small, on average, when pdf p 4 or p 6 is the true pdf, but larger, on average, when p 3 or p 5 is the true pdf.When pdf p 7 or p is the true pdf, the I mmi algorithm produces even more shared and synergistic information than obtained with the I ig method.This effect is particularly dramatic in the case of p 7 , where on average with I mmi 82% of the joint mutual information is transmitted as shared or synergistic information, as compared with 51.5% for I ig .It appears that the fact that the I mmi method forces one of the unique informations to be zero leads to an underestimation of the other unique information and an overestimate of both the shared and synergistic information, especially when p 7 or p is the true pdf and both unique information are expected to be non-zero.
While some numerical support is presented here for the hypothesis that there might be a systematic difference in this type between the I ig and I dep methods further research would be required to investigate this possibility.Also, the I ig developed here is a bivariate PID and it would be of interest to explore whether the method could be extended to deal with more than two source vectors. where 33 .and The pdf p 5 is defined by the constraint K 23 = 0 in the inverse covariance matrix K. Hence, Σ −1 5 has the form Performing the multiplication of these block matrices shows that the diagonal blocks of Σ −1 5 (Σ − Σ 5 ) are all equal to a zero matrix, and so have trace equal to zero.Hence, the trace of this matrix is equal to zero.Since Tr(I m ) = m, it follows from (A18) that this matrix has trace equal to m.Now consider the trace of Σ −1 6 Σ.From ( 12) and the form of Σ 6 in Table 1, we have that The pdf p 6 is defined by the constraint K 13 = 0 in the inverse covariance matrix K. Hence, Σ −1 6 has the form By adopting a similar argument to that given above for Σ 5 , it follows that Tr(Σ −1 6 Σ) = m.The required result follows from (A17).Now for the second part of the proof, we use Lemma 1 and the trace result just derived to write Σ −1 t may be written as and so, by Lemma 2, we have We wish to minimize g(t) with respect to t under the constraint that t ∈ F • .We set H(t) = (1 − t)Σ 6 + tΣ 5 , which is positive definite by Lemma 3, and apply Jacobi's formula and the chain rule.Differentiating (A22) with respect to t we obtain Lemma A1(i) that A is positive definite, as is A −1 .Therefore, A −1 possesses a (symmetric) positive definite square root, C say, and we may write I n 3 − B(t * ) T A −1 B(t * ) = I n 3 − X T X, with X = CB(t * ).

Appendix I.2. Self-Redundancy
The property of self-redundancy considers the case of X 1 = X 2 , i.e., both sources are the same, and it requires us to show that Shd ≡ I shd (X 1 , X 1 ; X 3 ] = I[X 1 ; X 3 ]. When the sources are the same we have n 1 = n 2 , Q = R and P = I n 1 , which results in a singular covariance matrix.Therefore, we take P = (1 − ϵ)I n 1 , for very small ϵ such that 0 < ϵ < 1, and let ϵ → 0+.Using this information in (A27) we have that , which is equal to I[X 1 ; X 3 ] by (A1).Out of interest, it seems worthwhile to check the limits for the classical information measures given in (A3)-(A7).From (A3) and (A13), after some cancellation, we have that Similarly, for the joint mutual information and the interaction information: I[X 1 , X 2 ; X 3 ] → 1 2 log 1 |I n 3 − Q T Q| , and I[X 1 ; X 2 ; X 3 ] → − 1 2 log 1 |I n 3 − Q T Q|, as expected, since ϵ → 0+.

Appendix I.3. Symmetry
To validate the symmetry property we require to prove that I shd [X 1 , X 2 ; X 3 ] is equal to I shd [X 2 , X 1 ; X 3 ].Swapping X 1 and X 2 means that the sources are now in order X 2 , X 1 , X 3 and the covariance matrix in (12) becomes The switching of X 1 and X 2 means also that the probability distributions on S 5 and S 6 swap, since Some matrix calculation then gives and also which is identical to the expression of d(t) obtained in (A27).It follows that d(t * ) = d ′ (t * ) and that the shared information is unchanged by swapping X 1 and X 2 .

Appendix I.4. Monotonicity on the Redundancy Lattice
We use the term 'redundancy' here for convenience rather than 'shared information', since both terms mean exactly the same thing.A redundancy lattice is defined in [1].When there are two sources and a target, there are four terms of interest that are usually denoted by

Table 2 .
PID results for exact pdfs, reported as a percentage of the joint mutual information.