#### 2.1. Gaussian Graphical Models

The independence graph for a probability distribution on three univariate random variables,

${X}_{0},{X}_{1},Y$ has three vertices and three possible edges, as described in

Table 1. Let

$\mathbf{Z}={\left[\begin{array}{ccc}{X}_{0}& {X}_{1}& Y\end{array}\right]}^{T}$.

Graphical models represent the conditional independences present in a probability distribution, as described in

Table 1.

Suppose that

$\mathbf{Z}$ has a multivariate Gaussian distribution with mean vector

${\mathit{\mu}}_{Z}$, positive definite covariance matrix

${\Sigma}_{Z}$ and p.d.f.

$f({x}_{0},{x}_{1},y)$. There is no loss of generality in assuming that each component of

$\mathbf{Z}$ has mean zero and variance equal to 1 [

12]. If we let the covariance (correlation) between

${X}_{0}$ and

${X}_{1}$ be

p, between

${X}_{0}$ and

Y be

q and between

${X}_{1}$ and

Y be r, then the covariance (correlation) matrix for

$\mathbf{Z}$ is

and we require that

$\left|p\right|,\left|q\right|,\left|r\right|$ are each less than 1, and to ensure positive definiteness we require also that

$|{\Sigma}_{Z}|>0.$Conditional independences are specified by setting certain off-diagonal entries to zero in the inverse covariance matrix, or concentration matrix,

$K={\Sigma}^{-1}$ ([

13], p. 164). Given our assumptions about the covariance matrix of

Z, this concentration matrix is

where

$|{\Sigma}_{Z}|=1-{p}^{2}-{q}^{2}-{r}^{2}+2pqr.$We now illustrate using these Gaussian graphical models how conditional independence constraints also impose constraints on marginal distributions of the type required, and we use the Gaussian graphical models ${G}_{8}$ and ${G}_{6}$ to do so.

Since $\mathbf{Z}$ is multivariate Gaussian and has a zero mean vector, the distribution of $\mathbf{Z}$ is specified via its covariance matrix ${\Sigma}_{Z}$. Hence, fitting any of the Gaussian graphical models ${G}_{1}\dots {G}_{8}$ involves estimating the relevant covariance matrix by taking the conditional independence constraints into account. Let ${\widehat{\Sigma}}_{i}$ and $\widehat{{K}_{i}}$ be the covariance and concentration matrices of the fitted model ${G}_{i},\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}(i=1\dots 8)$.

We begin with the saturated model ${G}_{8}$ which has a fully connected graph and no constraints of conditional independence. Therefore, there is no need to set any entries of the concentration matrix K to zero, and so ${\widehat{\Sigma}}_{8}$ = ${\Sigma}_{Z}$. That is: model ${G}_{8}$ is equal to the given model for $\mathbf{Z}$.

Now consider model

${G}_{6}$. In this model there is no edge between

${X}_{0}$ and

Y and so

${X}_{0}$ and

Y are conditionally independent given

${X}_{1}$. This conditional independence is enforced by ensuring that the [1, 3] and [3, 1] entries in

${\widehat{K}}_{6}$ are zero. The other elements in

${\widehat{K}}_{6}$ remain to be determined. Therefore

${\widehat{K}}_{6}$ has the form

Given the form of

${\widehat{K}}_{6}$ ,

${\widehat{\Sigma}}_{6}$ has the form

where

${\widehat{\sigma}}_{02}$ is to be determined. Notice that only the [1, 3] and [3, 1] entries in

${\widehat{\Sigma}}_{6}$ have been changed from the given covariance matrix

${\Sigma}_{Z}$, since the [1, 3] and [3, 1] entries of

${\widehat{K}}_{6}$ have been set to zero. An exact solution is possible. The inverse of

${\widehat{\Sigma}}_{6}$ is

Since the [1, 3] entry in

${\widehat{K}}_{6}$ must be zero, we obtain the solution that

${\widehat{\sigma}}_{02}=pr,$ and so the estimated covariance matrix for model

${G}_{6}$ is

The estimated covariance matrices for the other models can be obtained exactly using a similar argument.

Model

${G}_{6}$ contains the marginal distributions of

${X}_{0}$,

${X}_{1}$,

Y,

$({X}_{0},{X}_{1})$ and

$({X}_{0},Y)$. It is important to note that these marginal distributions are exactly the same as in the given multivariate Gaussian distribution for

$\mathbf{Z}$, which has covariance matrix

${\Sigma}_{Z}$. To see this we use a standard result on the marginal distribution of a sub-vector of a multivariate Gaussian distribution [

24], p. 63.

The covariance matrix of the marginal distribution

$({X}_{0},{X}_{1})$ is equal to the upper-left 2 by 2 sub-matrix of

${\widehat{\Sigma}}_{6}$, which is also equal to the same sub-matrix in

${\Sigma}_{Z}$ in (

16). This means that this marginal distribution in model

${G}_{6}$ is equal to the corresponding marginal distribution in the distribution of

Z. The covariance matrix of the marginal distribution

$({X}_{0},Y)$ is equal to the lower-right 2 by 2 sub-matrix of

${\widehat{\Sigma}}_{6}$, which is also equal to the same sub-matrix in

${\Sigma}_{Z}$ in (

16), and so the

$({X}_{0},Y)$ marginal distribution in model

${G}_{6}$ matches the corresponding marginal distribution in the distribution of

Z. Using similar arguments, such equality is also true for the other marginal distributions in model

${G}_{6}$.

Looking at (

17), we see that setting to [1, 3] of

K entry to zero gives

$q=pr$. Therefore, simply imposing this conditional independence constraint also gives the required estimated covariance matrix

${\widehat{\Sigma}}_{6}$.

It is generally true ([

13], p.176) that applying the conditional independence constraints is sufficient and it also leads to the marginal distributions in the fitted model being exactly the same as the corresponding marginal distributions in the given distribution of

$\mathbf{Z}$. For example, in (

19) we see that the only elements in

${\widehat{\Sigma}}_{6}$ that are altered are the [1, 3] and [3, 1] entries and these entries corresponds exactly to the zero [1, 3] and [3, 1] entries in

${\widehat{K}}_{6}$. That is: the location of zeroes in

${\widehat{K}}_{6}$ determines which entries in

${\widehat{\Sigma}}_{6}$ will be changed; the remaining entries of

${\widehat{\Sigma}}_{Z}$ are unaltered and therefore this fixes the required marginal distributions. Therefore, in

Section 2.2, we will determine the required maximum entropy solutions by simply applying the necessary conditional independence constraints together with the other required constraints.

We may express the combination of the constraints on marginal distributions and the constraints imposed by conditional independences as follows [

16]. For model

${G}_{k}$, the

$(i,j)$th entry of

${\widehat{\Sigma}}_{Z}$ is given by

where

${E}_{k}$ is the edge set for model

${G}_{k}$ (see

Table 1). For model

${G}_{k}$, the conditional independences are imposed by setting the

$(i,j)$th entry of

$\widehat{K}$ to zero whenever

$(i,j)\notin {E}_{k}.$Before moving on to derive the maximum entropy distributions, we consider the conditional independence constraints in model

${G}_{3}$. In model

${G}_{3}$ we see from

Table 1 that this model has no edge between

${X}_{0}$ and

${X}_{1}$ and none between

${X}_{1}$ and

Y. Hence,

${X}_{0}$ and

${X}_{1}$ are conditionally independent given

Y and also

${X}_{1}$ and

Y are conditionally independent given

${X}_{0}$. Hence, in

K in (

17) we set the [1, 2] and [2, 3] (and the [2, 1] and [3, 2]) entries to zero to enforce these conditional independences. That is:

$p=qr$ and

$r=pq$. Taken together these equations give that

$p=0$ and

$r=0$, and so the estimated covariance matrix for model

${G}_{3}$ is

We also note that model

${U}_{3}$ in

Figure 1 also possesses the same conditional independences as

${G}_{3}$. This is true for all of the maximum entropy models

${U}_{i}$, and so when finding the nature of these models in the next section we apply in each case the conditional independence constraints satisfied by the graphical model

${G}_{i}$.

#### 2.2. Maximum Entropy Distributions

We are given the distribution of

Z which is multivariate Gaussian with zero mean vector and covariance matrix

${\Sigma}_{Z}$ in (

16), and has p.d.f.

$f\left(\mathbf{z}\right)\equiv f({x}_{0},{x}_{1},y).$ For each of the models

${U}_{1}\dots {U}_{8}$, we will determine the p.d.f. of the maximum entropy solution

$g\left(\mathbf{z}\right)\equiv g({x}_{0},{x}_{1},y)$ subject to the constraints

and the separate constraint for model

${U}_{i}$
as well as the conditional independence constraints given in

Table 2.

We begin with model

${U}_{8}$. As shown in the previous section, the estimated covariance matrix for model

${U}_{8}$,

${\widehat{\Sigma}}_{8}$, is equal to the covariance matrix of

$\mathbf{Z}$,

${\Sigma}_{Z}$. By a well-known result [

25], the solution is that

${U}_{8}$ is multivariate Gaussian with mean vector zero and covariance matrix,

${\Sigma}_{Z}$. That is:

${U}_{8}$ is equal to the given distribution of

$\mathbf{Z}$.

For model

${U}_{5}$, the conditional independence constraint is

$r=pq$ and so

Hence, using a similar argument to that for ${U}_{8}$, the maximum entropy solution for model ${U}_{5}$ is multivariate Gaussian with zero mean vector and covariance matrix ${\widehat{\Sigma}}_{5}$, and so is equal to the model ${G}_{5}$.

In model ${U}_{3}$, the conditional independence constraints are $p=qr,r=pq$ and so $p=0$ and $r=0$.

Therefore,

and the maximum entropy solution for

${U}_{3}$ is multivariate Gaussian with zero mean vector and covariance matrix

${\widehat{\Sigma}}_{3}$, and so is equal to

${G}_{3}$. The derivations for the other maximum entropy models are similar, and we state the results in Proposition 1.

**Proposition** **1.** The distributions of maximum entropy, ${U}_{1}\dots {U}_{8}$, subject to the constraints (23)–(24) and the conditional independence constraints in Table 2, are trivariate Gaussian graphical models ${G}_{1}\dots {G}_{8}$ having mean vector $\mathbf{0}$ and with the covariance matrices ${\widehat{\Sigma}}_{i},\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}(i=1,\dots 8),$ given above in Table 3. The estimated covariance matrices in

Table 3 were inverted to give the corresponding concentration matrices, which are also given in

Table 3. They indicate by the location of the zeroes that the conditional independences have been appropriately applied in the derivation of the results in Proposition 1.

It is important to check that the relevant bivariate and univariate marginal distributions are the same in all of the models in which a particular constraint has been added For example, the

${X}_{0}{X}_{1}$ constraint is present in models

${U}_{2},{U}_{5},{U}_{6},{U}_{8}.$ The marginal bivariate

${X}_{0}{X}_{1}$ distribution has zero mean vector and so is determined by the upper-left 2 by 2 sub-matrix of the estimated covariance matrices,

${\widehat{\Sigma}}_{i}$ ([

24], p. 63). Inspection of

Table 3 shows that this sub-matrix is equal to

$\left[\begin{array}{cc}1& p\\ p& 1\end{array}\right]$ in all four models. Thus, the bivariate distribution of

$({X}_{0},{X}_{1})$ is the same in all four models in which this dependency constraint is fitted. It is also the same as in the original distribution, which has covariance matrix

${\Sigma}_{Z}$ in (

16). Further examination of

Table 3 shows equivalent results for the

$({X}_{0},Y)$ and

$({X}_{1},Y)$ bivariate marginal distributions. The univariate term

Y is present in all eight models. The univariate distribution of

Y has mean zero and so is determined by the [3, 3] element of the estimated covariance matrices

${\widehat{\Sigma}}_{i}$ ([

24], p. 63). Looking at the

${\widehat{\Sigma}}_{i}$ column, we see that the variance of

Y is equal to 1 in all eight models, and so the marginal distribution of

Y is the same in all eight models. In particular, this is true in the original distribution, which has covariance matrix

${\Sigma}_{Z}$ in (

16).

#### 2.5. Some Examples

**Example** **1.** We consider the ${I}_{\mathit{dep}}$ PID when $q=\mathit{corr}({X}_{0},Y)=0,r\ne 0,p\ne 0.$

When

$q=0$, we see from

Table 4 that

$b=d=i=0$ and

$k>0$, so unq0 = 0, and since

$I({X}_{0};Y)=0$ the redundancy component is also zero. The unique information, unq1, and the synergy component, syn, are equal to

respectively. The

${I}_{\mathrm{mmi}}$ PID is exactly the same as the

${I}_{\mathrm{dep}}$ PID.

**Example** **2.** We consider the ${I}_{\mathit{dep}}$ PID when $r=\mathit{corr}({X}_{1},Y)=0,q\ne 0,p\ne 0.$

When

$r=0$, we see from

Table 5 that

and also that

$k>\{b,d,i\}$ because

since

$p\ne 0,q\ne 0.$ It follows that

$\mathrm{unq}0=\frac{1}{2}\mathrm{log}\frac{1}{1-{q}^{2}}$, and that the synergy component is equal to

Since

$I({X}_{1};Y)=0$, from (

29), the redundancy component is zero, as is unq1. The

${I}_{\mathrm{mmi}}$ PID is exactly the same as the

${I}_{\mathrm{dep}}$ PID.

**Example** **3.** We consider the ${I}_{\mathit{dep}}$ PID when $p=\mathit{corr}({X}_{0},{X}_{1})=0,q\ne 0,r\ne 0.$

Under the stated conditions, it is easy to show that

$b<i$ and

$i<k$ and so the minimum edge value is attained at

i. Using the results in

Table 5 and (

29)–(

31), we may write down the

${I}_{\mathrm{dep}}$ PID as follows.

For this situation, the ${I}_{\mathrm{mmi}}$ PID takes two different forms, depending on whether or not $\left|q\right|<\left|r\right|$. Neither form is the same as the ${I}_{\mathrm{dep}}$ PID.

**Example** **4.** Compare the ${I}_{\mathit{mmi}}$ and ${I}_{\mathit{dep}}$ PIDs when $p=-0.2$, $q=0.7$ and $r=-0.7$.

The PIDs are given in the following table.

**PID** | **unq0** | **unq1** | **red** | **syn** |

${I}_{\mathrm{dep}}$ | 0.2877 | 0.2877 | 0.1981 | 0.4504 |

${I}_{\mathrm{mmi}}$ | 0 | 0 | 0.4587 | 0.7380 |

There is a stark contrast between the two PIDs in this system. Since $\left|q\right|=\left|r\right|$, the ${I}_{\mathrm{mmi}}$ PID has two zero unique informations, whereas ${I}_{\mathrm{dep}}$ has equal values for the uniques but they are quite large. The ${I}_{\mathrm{mmi}}$ PID gives much larger values for the redundancy and synergy components than does the ${I}_{\mathrm{dep}}$ PID. In order to explore the differences between these PIDs, 50 random samples were generated from a multivariate normal distribution having correlations $p=-0.2,q=0.7,r=-0.7.$ The sample estimates of $p,q,r$ were $\widehat{p}=-0.1125,\widehat{q}=0.6492,\widehat{r}=-0.6915$ and the sample PIDs are

**PID** | **unq0** | **unq1** | **red** | **syn** |

${I}_{\mathrm{dep}}$ | 0.2324 | 0.3068 | 0.1623 | 0.4921 |

${I}_{\mathrm{mmi}}$ | 0 | 0.0744 | 0.3948 | 0.7245 |

We now apply tests of deviance in order to test model

${U}_{i}$ within the saturated model

${U}_{8}$. The null hypothesis being tested is that model

${U}_{i}$ is true (see

Appendix E). The results of applying tests of deviance ([

13], p. 185), in which each of models

${U}_{1}\dots {U}_{7}$ is tested against the saturated model

${U}_{8}$, produced approximate

p values that were close to zero (

$p<{10}^{-11}$) for all but model

${U}_{7},$ which had a

p value of

${10}^{-6}$. This suggests that none of the models

${U}_{1}\dots {U}_{7}$ provides an adequate fit to the data and so model

${U}_{8}$ provides the best description. The results of testing

${U}_{6}$ and

${U}_{7}$ within model

${U}_{8}$ gave strong evidence to suggest that the interaction terms

${X}_{0}Y$ and

${X}_{0}Y$ are required to describe the data, and that each term makes a significant contribution in addition to the presence of the other term. Therefore, one would expect to find fairly sizeable unique components in a PID, and so the

${I}_{\mathrm{dep}}$ PID seems to provide a more sensible answer in this example. One would also expect synergy to be present, and both PIDs have a large, positive synergy component.

**Example** **5.** Prediction of grip strength

Some data concerning the prediction of grip strength from physical measurements was collected from 84 male students at Glasgow University. Let Y be the grip strength, ${X}_{0}$ be the bicep circumference and ${X}_{1}$ the forearm circumference. The following correlations between each pair of variables were calculated: $\mathrm{corr}({X}_{1},Y)=0.7168,\mathrm{corr}({X}_{0},Y)=0.6383,\mathrm{corr}({X}_{0},{X}_{1})=0.8484,$ and PIDs applied with the following results.

**PID** | **unq0** | **unq1** | **red** | **syn** |

${I}_{\mathrm{dep}}$ | 0.0048 | 0.1476 | 0.3726 | 0 |

${I}_{\mathrm{mmi}}$ | 0 | 0.1427 | 0.3775 | 0.0048 |

The

${I}_{\mathrm{dep}}$ and

${I}_{\mathrm{mmi}}$ PIDs are very similar, and the curious fact that unq0 in

${I}_{\mathrm{dep}}$ is equal to the synergy in

${I}_{\mathrm{mmi}}$ is no accident, It is easy to show this connection theoretically by examining the results in (

30)–(

31) and

Table 5; that is, the sum of unq0 and syn in the

${I}_{\mathrm{dep}}$ PID or the sum of unq1 and syn in the

${I}_{\mathrm{dep}}$ PID is equal to the synergy value in the

${I}_{\mathrm{mmi}}$ PID. This happens because the

${I}_{\mathrm{mmi}}$ PID must have a zero unique component.

These PIDs indicate that there is almost no synergy among the three variables, which makes sense because the value of

$I({X}_{0};Y|{X}_{1})$ is close to zero, and this suggests that

${X}_{0}$ and

Y are conditionally independent given

${X}_{1}$. On the other hand,

$I({X}_{1};Y|{X}_{0})$ is 0.1427 which suggests that

${X}_{1}$ and

Y are not conditionally independent given

${X}_{0}$, and so both terms

${X}_{0}{X}_{1}$ and

${X}_{0}Y$ are of relevance in explaining the data, which is the case in model

${U}_{6}.$ This model has

$I({X}_{0};Y|{X}_{1})=0$ and therefore no synergy and also a zero unique value in relation to

${X}_{0}$. The results of applying tests of deviance ([

13], p. 185), in which each of models

${U}_{1}\dots {U}_{7}$ is tested within the saturated model

${U}_{8}$, show that the approximate

p values are close to zero (

$p<{10}^{-13}$) for all models except

${U}_{5}$ and

${U}_{6}$. The

p value for the model

${U}_{5}$ is

$3\times {10}^{-5}$, while the

p value for the test of

${U}_{6}$ against

${U}_{8}$ is approximately 0.45. Thus, there is strong evidence to reject all the models except model

${U}_{6}$ and this suggests that model

${U}_{6}$ provides a good fit to data, and this alternative viewpoint provides support for the form of both PIDs.

#### 2.6. Graphical Illustrations

We present some graphical illustrations of the

${I}_{\mathrm{dep}}$ PID and compare it to the

${I}_{\mathrm{mmi}}$ PID; see

Section 1.1.1 and

Section 1.2 for definitions of these PIDs.

Since

$q=r$, both the

${I}_{\mathrm{mmi}}$ unique informations are zero in

Figure 3a. The redundancy component is constant, while the synergy component decreases towards zero. In

Figure 3b, we observe change-point behaviour of

${I}_{\mathrm{dep}}$ when

$p=0.25$. For

$p<0.25$ the unique components of

${I}_{\mathrm{dep}}$ are equal, constant and positive. The redundancy component is also constant and positive with a lower value than the corresponding component in the

${I}_{\mathrm{mmi}}$ PID. The synergy component decreases towards zero and reaches this value when

$p=0.25$. The

${I}_{\mathrm{dep}}$ synergy is lower that the corresponding

${I}_{\mathrm{mmi}}$ synergy for all values of

p.

At $p=0.25$, the synergy “switches off” in the ${I}_{\mathrm{dep}}$ PID, and stays “off” for larger values of p, and then the unique and redundancy components are free to change. In the range $0.25<p<1,$ the redundancy increases and takes up all the mutual information when $p=1$, while the unique informations decrease towards zero. The ${I}_{\mathrm{dep}}$ and ${I}_{\mathrm{mmi}}$ profiles show different features in this case. The “regime switching” in the ${I}_{\mathrm{dep}}$ PID is interesting. As mentioned in Proposition 2, the minimum edge value occurs with unq0 = i or k. When unq0 = k the synergy must be equal to zero, whereas when unq0 = i the synergy is positive and the values of the unique informations and the redundancy are constant. Regions of zero synergy in the ${I}_{\mathrm{dep}}$ PID are explored in Figure 5.

In

Figure 4a,b, there are clear differences in the PID profiles between the two methods. The

${I}_{\mathrm{dep}}$ synergy component switches off at

$p=0.5$ and is zero thereafter. For

$p<0.5$, both the

${I}_{\mathrm{dep}}$ uniques are much larger than those of

${I}_{\mathrm{mmi}}$ , which are zero, and

${I}_{\mathrm{mmi}}$ has a larger redundancy component. For

$p>0.5$, the redundancy component in

${I}_{\mathrm{dep}}$ increases to take up all of the mutual information, while the unique information components decrease towards zero. In contrast to this, in the

${I}_{\mathrm{mmi}}$ PID the redundancy and unique components remain at their constant values while the synergy continues to decrease towards zero.

The PIDs are plotted for increasing values of

$q=r$ in

Figure 4c,d when

$p=0.25$. The

${I}_{\mathrm{mmi}}$ and

${I}_{\mathrm{dep}}$ profiles are quite different. As

q increases, the

${I}_{\mathrm{mmi}}$ uniques remain at zero, while the

${I}_{\mathrm{dep}}$ uniques rise gradually. Both the

${I}_{\mathrm{mmi}}$ redundancy and synergy profiles rise more quickly than their

${I}_{\mathrm{dep}}$ counterparts, probably because both their uniques are zero. In the

${I}_{\mathrm{dep}}$ PID, the synergy switches on at

$p=0.5$ and it is noticeable than all the

${I}_{\mathrm{dep}}$ components can change simultaneously as

q increases.

One of the characteristics noticed in

Figure 3 and

Figure 4 is the ’switching behaviour’ of the

${I}_{\mathrm{dep}}$ PID in that there are kinks in the plots of the PIDs against the correlation between the predictors,

p: the synergy component abruptly becomes equal to zero at certain values of

p, and there are other values of

p at which the synergy moves from being zero to being positive.

In Proposition 2, it is explained for the

${I}_{\mathrm{dep}}$ PID that when both predictor-target correlations are non-zero the minimum edge value occurs at edge value

i or

k. When the synergy moves from zero to a positive value, this means that the minimum edge value has changed from being

k to being equal to

i, and vice-versa. For a given value of

p, one can explore the regions in

$(q,r)$ space at which such transitions take place. In

Figure 5, this region of zero synergy is shown, given four different values of

p. The boundary of each of the regions is where the synergy component changes from positive synergy to zero synergy, or vice-versa.

The plots in

Figure 5 show that synergy is non-zero (positive) whenever

q and

r are of opposite sign. When the predictor-predictor correlation,

p, is 0.05 there is also positive synergy for large regions, defined by

$qr-p>0$, when

q and

r have the same sign. As

p increases the regions of zero synergy change shape, initially increasing in area and then declining as

p becomes quite large (

$p=0.75$). As

p is increased further the zero-synergy bands narrow and so zero synergy will only be found when

q and

r are close to being equal.

When p is negative, the corresponding plots are identical to those with positive p but rotated counter clockwise by $\pi /2$ about the point $q=0,r=0.$ Hence, synergy is present when q and r have the same sign. When q and r have opposite signs, there is also positive synergy for regions defined by $qr-p<0.$

The case of $p=0$ is of interest and there are no non-zero admissible values of q and r (where the covariance matrix is positive definite) where the synergy is equal to zero. Hence the system will have synergy in this case unless $q=0$ or $r=0$. This can be seen from the ${I}_{\mathrm{dep}}$ synergy expression in Example 3.