For continuous random variables, there is no canonical entropy function. Instead, there is differential entropy, which is computed with respect to some reference measure d

x:

where

p now denotes the probability density of

X with respect to d

x. Taking the Lebesgue measure, the entropy of an

m-dimensional Gaussian random vector with covariance matrix Σ

_{X} is given by:

where |Σ

_{X}| denotes the determinant of Σ

_{X}. This entropy is not invariant under coordinate transformations. In fact, if

A ∈ ℝ

^{m}^{×}^{m}, then the covariance matrix of

AX is given by

AΣ

_{X} A^{t}, and so the entropy of

AX is given by:

#### 2.1. Partial Information Lattice

We want to analyze how the information that

X_{1}, … ,

X_{n} have about

S is distributed among

X_{1}, … ,

X_{n}. In Shannon’s theory of information, the total amount of information about

S contained in

X_{1}, … ,

X_{n} is quantified by the mutual information:

We are looking for a way to write

MI(

S :

X_{1}, … ,

X_{n}) as a sum of non-negative functions with a good interpretation in terms of how the information is distributed, e.g., redundantly or synergistically, among

X_{1}, … ,

X_{n}. For example, as we have mentioned in the Introduction and as we will see later, several suggestions have been made to measure the total synergy of

X_{1}, … ,

X_{n} in terms of a function

Synergy(

S :

X_{1}; … ;

X_{n}). When starting with such a function, the idea of the information decomposition is to further decompose the difference:

as a sum of non-negative functions. The additional advantage of such a complete information decomposition would be to give a better interpretation of the difference

(2), apart from the tautological interpretation that it just measures “everything but the synergy.” Throughout the paper, we will use the following notation: the left argument of the information quantities, the target variable

S, is divided by a colon from the right arguments. The semicolon separates the different arguments on the right side, while comma-separated random variables are treated as a single vector-valued argument.

When looking for such an information decomposition, the first question is what terms to expect. In the case

n = 2, this may seem quite easy, and it seems to be common sense to expect a decomposition of the form:

into four terms corresponding to the redundant (or shared) information

SI(

S :

X_{1};

X_{2}), the unique information

UI(

S :

X_{1} \

X_{2}) and

UI(

S :

X_{2} \

X_{1}) of

X_{1} and

X_{2}, respectively, and the synergistic (or complementary) information

CI(

S :

X_{1};

X_{2}).

However, when n > 2, it seems less clear in which different ways X_{1}, … , X_{n} may interact with each other, combining redundant, unique and synergistic effects.

As a solution, Williams and Beer proposed the partial information framework. We explain the idea only briefly here and refer to [

4] for more detailed explanations. The basic idea is to construct such a decomposition purely in terms of a function for shared information

I_{∩}(

S :

X_{1}; … ;

X_{n}) that measures the redundant information about

S contained in

X_{1}, … ,

X_{n}. Clearly, such a function should be symmetric in permutations of

X_{1}, … ,

X_{n}. In a second step,

I_{∩} is also used to measure the redundant information

I_{∩}(

S :

A_{1}; … ;

A_{k}) about

S contained in combinations

A_{1}, … ,

A_{k} of the original random variables (that is,

A_{1}, … ,

A_{k} are random vectors whose components are among {

X_{1}, … ,

X_{n}}). Moreover, Williams and Beer proposed that

I_{∩} should satisfy the following monotonicity property:

(where the inclusion

A_{i} ⊆

A_{k}_{+1} means that any component of

A_{i} is also a component of

A_{k}_{+1}).

The monotonicity property shows that it suffices to consider the function

I_{∩} in the case where

A_{1}, … ,

A_{k} form an antichain; that is,

A_{i} ⊈

A_{j} for all

i ≠

j. The set of antichains is partially ordered by the relation:

and, again by the monotonicity property,

I_{∩} is a monotone function with respect to this partial order. This partial order actually makes the set of antichains into a lattice.

If (

B_{1}, … ,

B_{l}) ≼ (

A_{1}, … ,

A_{k}), then the difference

I_{∩}(

S :

A_{1};… ;

A_{k}) −

I_{∩}(

S :

B_{1}; … ;

B_{l}) quantifies the information contained in all

A_{i}, but not contained in some

B_{l}. The idea of Williams and Beer can be summarized by saying that all information can be classified according to within which antichains it is contained. Thus, the third step is to write:

where the function

I_{∂} is uniquely defined as the Möbius transform of

I_{∩} on the lattice of antichains.

For example, the PI lattices for

n = 2 and

n = 3 are given in

Figure 1. For

n = 2, it is easy to make the connection with

(3): The partial measures are:

and the redundancy measure satisfies:

From

(4) and the chain rule for the mutual information:

follows immediately

Even if

I_{∩} is non-negative (as it should be as an information quantity), it is not immediate that the function

I_{∂} is also non-negative. This additional requirement was called local positivity in [

5].

While the PI lattice is a beautiful framework, so far, there has been no convincing proposal of how the function

I_{∩} should be defined. There have been some proposals of functions

I_{∩}(

S :

X_{1};

X_{2}) with up to two arguments, so-called bivariate information decompositions [

7,

8], but so far, only two general information decompositions are known. Williams and Beer defined a function

I_{min} that satisfies local positivity, but, as mentioned above, it was found to give unintuitive values in many examples [

5,

6]. In [

5],

I_{min} was compared with the function:

which was called minimum mutual information (MMI) in [

12] (originally, it was denoted by

I_{I} in [

5]). This function has many nice mathematical properties, including local positivity. However,

I_{mmi} clearly does not have the right interpretation as measuring the shared information, since

I_{mmi} only compares the different amounts of information of

S and

A_{i}, without checking whether the measured information is really the “same” information [

5]. However, for Gaussian random variables,

I_{mmi} might actually lead to a reasonable information decomposition (as discussed in [

12] for the case

n = 2).

#### 2.2. Interaction Spaces

An alternative approach to quantify synergy comes from the idea that synergy among interacting systems has to do with interactions beyond simple pair interactions. We slightly change the notation and now analyze the interaction of n + 1 random variables X_{0}, X_{1}, … , X_{n}. Later, we will put X_{0} = S in order to compare the setting of interaction spaces with the setting of information decompositions.

For simplicity, we restrict ourselves here to the discrete setting. Let

$\left(\begin{array}{c}X\\ k\end{array}\right)$ be the set of all subsets

A ⊆ {

X_{0}, … ,

X_{n}} of cardinality |

A| = k. The exponential family of

k-th order interactions

$\mathcal{E}$^{(}^{k}^{)} of random variables

X_{0},

X_{1}, … ,

X_{n} consists of all distributions of the form:

where Ψ

_{A} is a strictly positive function that only depends on those

x_{i} with

X_{i} ∈

A. Taking the logarithm, this is equivalent to saying that:

where, again, each function

ψ_{A} only depends on those

x_{i} with

X_{i} ∈

A. This second representation corresponds to the Gibbs–Boltzmann distribution used in statistical mechanics, and it also explains the name exponential family. Clearly,

$\mathcal{E}$^{(1)} ⊆

$\mathcal{E}$^{(2)} ⊆ … ⊆

$\mathcal{E}$^{(}^{n}^{)} ⊆

$\mathcal{E}$^{(}^{n}^{+1)}.

The set

$\mathcal{E}$^{(}^{k}^{)} is not closed (for

k > 0), in the sense that there are probability distributions outside of

$\mathcal{E}$^{(}^{k}^{)} that can be approximated arbitrarily well by

k-th order interaction distributions. Thus, we denote by

$\overline{{\mathcal{E}}^{(k)}}$ the closure of

$\mathcal{E}$^{(}^{k}^{)} (technically speaking, for probability spaces, there are different notions of approximation and of closure, but in the finite discrete case, they all agree; for example, one may take the induced topology by considering a probability distribution as a vector of real numbers). For example,

$\overline{{\mathcal{E}}^{(k)}}$ contains distributions that can be written as products of non-negative functions

Ψ_{A} with zeros. In particular,

$\overline{{\mathcal{E}}^{(n+1)}}$ consists of all possible joint distributions of

X_{0}, … ,

X_{n}. However, for 1 <

k ≤ n, the closure of

$\mathcal{E}$^{(}^{k}^{)} also contains functions that do not factorize at all (see Section 2.3 in [

13] and the references therein).

Given an arbitrary joint distribution

p of

X_{0}, … ,

X_{n}, we might ask for the best approximation of

p by a

k-th order interaction distribution

q. It is customary to measure the approximation error in terms of the Kullback-Leibler divergence:

There are many relations between the KL divergence and exponential families. We need the following properties:

**Proposition 1.** (1).

Let $\mathcal{E}$ be an exponential family, and let p be an arbitrary distribution. Then, there is a unique distribution p_{$\mathcal{E}$} in the closure of $\mathcal{E}$ that best approximates p, in the sense that:p_{$\mathcal{E}$} is called the rI-projection of p to $\mathcal{E}$.(2).

If $\mathcal{E}$ ⊆

$\mathcal{E}$′ are two exponential families, then: See [

9,

14] for a proof and further properties of exponential families. The second identity is also called the Pythagorean theorem for exponential families.

In the following, we will abbreviate

q^{(}^{k}^{)} :=

p_{$\mathcal{E}$}_{(}_{k}_{)}. For example,

q^{(}^{n}^{+1)} =

p. For

n ≥

k > 1, there is no general formula for

q^{(}^{k}^{)}. For

k = 1, one can show that:

Thus,

$D(p||{q}^{(1)})={\displaystyle {\sum}_{i=0}^{n}H({X}_{i})-H({H}_{0},\dots ,{X}_{n})}$ equals the multi-information [

15] (also known as total correlation [

16]) of

X_{0}, … ,

X_{n}. Applying the Pythagorean theorem

n − 1 times to the hierarchy

$\mathcal{E}$^{(1)} ⊆

$\mathcal{E}$(

^{2}) ⊆ … ⊆

$\mathcal{E}$^{(}^{n}^{)}, it follows that:

This equation decomposes the multi-information into terms corresponding to different interaction orders. This decomposition was introduced in [

9] and studied for several examples in [

10] or [

17] with the single terms called connected information or interaction complexities, respectively. The idea that synergy should capture everything beyond pair interactions motivates us to define:

as a measure of synergy. In this interpretation, the synergy of

X_{0}, … ,

X_{n} is a part of the multi-information of

X_{0}, … ,

X_{n}. The last sum shows that the hierarchy of interaction families gives a finer decomposition of

S^{(2)} into terms that may be interpreted as “synergy of a fixed order”. In the case

n = 3 that we will study later, there is only one term, since

p = q^{(3)} in this case. Using the maximum entropy principle behind exponential families [

14], the function

S^{(2)} can also be expressed as:

where

denotes the set of all joint distributions r of

X_{0}, … ,

X_{n} that have the same pair marginals as

p.In contrast, the partial information lattice provides a decomposition of the mutual information and not the multi-information. However, a decomposition of the mutual information

MI(

X_{0} :

X_{1}, … ,

X_{n}) can be achieved in a similar spirit as follows. Let

${\left(\begin{array}{l}X\hfill \\ k\hfill \end{array}\right)}_{0}$ be the set of all subsets

A ⊆ {

X_{0}, … ,

X_{n}} of cardinality |

A| = k that contain

X_{0}, and let

${\widehat{\mathcal{E}}}^{(k)}$ be the set of all probability distributions of the form:

where the Ψ

_{A} are as above and where Ψ

_{[}_{n}_{]} is a function that only depends on

x_{1}, … ,

x_{n}. As above, each

${\widehat{\mathcal{E}}}^{(k)}$ is an exponential family.

We will abbreviate

${\widehat{q}}^{(k)}:{p}_{{\widehat{\mathcal{E}}}^{(k)}}$. Again, for general

k, there is no formula for

${\widehat{q}}^{(k)}$, but for

k = 1, one can show that:

Therefore,

$D(p||{\widehat{q}}^{(1)})=MI({X}_{0}:{X}_{1},\dots ,{X}_{n})$ Moreover, by the Pythagorean theorem,

Thus, we obtain a decomposition of the mutual information MI(X_{0} : X_{1}, … , X_{n}).

Again, one can group together all terms except the last term that corresponds to the pair interactions and define:

as a measure of synergy. In this interpretation, synergy is a part of the mutual information

MI(

S :

X_{0}, … ,

X_{n}). Using the maximum entropy principle behind exponential families [

14], the function

Ŝ^{(2)} can also be expressed as:

where:

denotes the set of all joint distributions

r of

X_{0}, … ,

X_{n} that have the same pair marginals as

p and for which, additionally, the marginal distribution for

X_{1}, … ,

X_{n} is the same as for

p.While the exponential families $\mathcal{E}$^{(}^{k}^{)} are symmetric in all random variables X_{0}, … , X_{n}, in the definition of
${\widehat{\mathcal{E}}}^{(k)}$, the variable X_{0} plays a special role. This is reminiscent of the special role of S in the information decomposition framework, when the goal is to decompose the information about S. Thus, also in Ŝ^{(2)}, the variable X_{0} is special.

There are some relations between the hierarchies $\mathcal{E}$^{(1)} ⊆ $\mathcal{E}$^{(2)} ⊆ … ⊆ $\mathcal{E}$^{(}^{n}^{)} and
${\widehat{\mathcal{E}}}^{(1)}\subseteq {\widehat{\mathcal{E}}}^{(2)}\subseteq \dots \subseteq {\widehat{\mathcal{E}}}^{(n)}$.

By definition,

${\mathcal{E}}^{(i)}\subseteq {\widehat{\mathcal{E}}}^{(i)}$ and thus:

In particular,

${S}^{(2)}({X}_{0};\dots ;{X}_{n})\ge {\widehat{S}}^{(2)}({X}_{0}:{X}_{1};\dots ;{X}_{n})$. Moreover,

${\mathcal{E}}^{(n)}={\widehat{\mathcal{E}}}^{(n)}$, which implies:

In particular, for n = 2, this shows S^{(2)}(S; X; Y) = Ŝ^{(2)}(S : X; Y).

The case

n = 2,

k = 2 is also the case that we are most interested in later for the following reasons. First, for

n = 2, the terms in the partial information lattice have an intuitively clear interpretation. Second, while there are not many examples of full information decompositions for

n > 2, there exist at least two proposals for reasonable measures of shared, unique and complementary information [

7,

8], which allow a direct comparison with measures based on the decompositions using the interaction spaces.

While the symmetric hierarchy of the families $\mathcal{E}$^{(}^{k}^{)} is classical, to our best knowledge, the alternative hierarchy of the families
${\widehat{\mathcal{E}}}^{(k)}$ has not been studied before. We do not want to analyze this second hierarchy in detail here, but we just want to demonstrate that the framework of interaction exponential families is flexible enough to give a nice decomposition of mutual information, which can naturally be compared with the information decomposition framework. In this paper, in any case, we only consider cases where
${\mathcal{E}}^{(k)}={\widehat{\mathcal{E}}}^{(k)}$.

It is possible to generalize the definitions of the interaction exponential families to continuous random variables, but there are some technical issues to be solved. For example, the corresponding exponential families will be infinite-dimensional. We will not do this here in detail, since we only need the following observation later: any Gaussian distribution can be described by pair-interactions. Therefore, when p is a multivariate normal distribution, then
${q}^{(2)}={\widehat{q}}^{(2)}=p$.