Quantifying Unique Information

Bertschinger, Nils; Rauh, Johannes; Olbrich, Eckehard; Jost, Jürgen; Ay, Nihat

doi:10.3390/e16042161

Open AccessArticle

Quantifying Unique Information

by

Nils Bertschinger

¹,

Johannes Rauh

^1,*

,

Eckehard Olbrich

¹,

Jürgen Jost

^1,2 and

Nihat Ay

^1,2

¹

Max Planck Institute for Mathematics in the Sciences, Inselstraße 23, 04109 Leipzig, Germany

²

Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, USA

^*

Author to whom correspondence should be addressed.

Entropy 2014, 16(4), 2161-2183; https://doi.org/10.3390/e16042161

Submission received: 15 January 2014 / Revised: 24 March 2014 / Accepted: 4 April 2014 / Published: 15 April 2014

Download

Browse Figures

Versions Notes

Abstract

: We propose new measures of shared information, unique information and synergistic information that can be used to decompose the mutual information of a pair of random variables (Y, Z) with a third random variable X. Our measures are motivated by an operational idea of unique information, which suggests that shared information and unique information should depend only on the marginal distributions of the pairs (X, Y ) and (X, Z). Although this invariance property has not been studied before, it is satisfied by other proposed measures of shared information. The invariance property does not uniquely determine our new measures, but it implies that the functions that we define are bounds to any other measures satisfying the same invariance property. We study properties of our measures and compare them to other candidate measures.MSC Classification: 94A15, 94A17

Keywords:

Shannon information; mutual information; information decomposition; shared information; synergy

1. Introduction

Consider three random variables X, Y, Z with finite state spaces Entropy 16 02161f3 , Entropy 16 02161f4 , Entropy 16 02161f5 . Suppose that we are interested in the value of X, but we can only observe Y or Z. If the tuple (Y, Z) is not independent of X, then the values of Y or Z or both of them contain information about X. The information about X contained in the tuple (Y, Z) can be distributed in different ways. For example, it may happen that Y contains information about X, but Z does not, or vice versa. In this case, it would suffice to observe only one of the two variables Y, Z namely the one containing the information. It may also happen that Y and Z contain different information, so it would be worthwhile to observe both of the variables. If Y and Z contain the same information about X, we could choose to observe either Y or Z. Finally, it is possible that neither Y nor Z taken for itself contains any information about X, but together they contain information about X. This effect is called synergy, and it occurs, for example, if all variables X, Y, Z are binary, and X = Y XORZ. In general, all effects may be present at the same time. That is, the information that (Y, Z) has about X is a mixture of shared information SI(X : Y ; Z) (that is, information contained both in Y and in Z), unique information UI(X : Y \Z) and UI(X : Z\Y ) (that is, information that only one of Y and Z has) and synergistic or complementary information CI(X : Y ; Z) (that is, information that can only be retrieved when considering Y and Z together).

The total information that (Y, Z) has about X can be quantified by the mutual information MI(X : (Y, Z)). Decomposing MI(X : (Y, Z)) into shared information, unique information and synergistic information leads to four terms, as:

M I (X : (Y, Z)) = S I (X : Y; Z) + U I (X : Y \ Z) + U I (X : Z \ Y) + C I (X : Y; Z) .

(1)

The interpretation of the four terms as information quantities demands that they should all be positive. Furthermore, it suggests that the following identities also hold:

\begin{array}{l} M I (X : Y) = S I (X : Y; Z) + U I (X : Y \ Z), \\ M I (X : Z) = S I (X : Y; Z) + U I (X : Z \ Y) . \end{array}

(2)

In the following, when we talk about a bivariate information decomposition, we mean a set of three functions SI, UI and CI that satisfy (1) and (2).

Combining the three equalities in (1) and (2) and using the chain rule of mutual information:

M I (X : (Y, Z)) = M I (X : Y) + M I (X : Z ∣ Y)

yields the identity:

\begin{array}{l} C o I (X; Y; Z) : = M I (X : Y) - M I (X : Y ∣ Z) \\ = M I (X : Y) - M I (X : Z) - M I (X : (Y, Z)) = S I (X : Y; Z) + C I (X : Y; Z), \end{array}

(3)

which identifies the co-information with the difference of shared information and synergistic information (the co-information was originally called interaction information in [1], albeit with a different sign in the case of an odd number of variables). It has been known for a long time that a positive co-information is a sign of redundancy, while a negative co-information expresses synergy [1]. However, although there have been many attempts, as of currently, there has been no fully satisfactory solution to separate the redundant and synergistic contributions to the co-information, and also, a fully satisfying definition of the function UI is still missing. Observe that, since we have three equations ((1) and (2)) relating the four quantities SI(X : Y ; Z), UI(X : Y \ Z), UI(X : Z \ Y ) and CI(X : Y ; Z), it suffices to specify one of them to compute the others. When defining a solution for the unique information UI, this leads to the consistency equation:

M I (X : Z) + U I (X : Y \ Z) = M I (X : Y) + U I (X : Z \ Y) .

(4)

The value of (4) can be interpreted as the union information, that is, the union of the information contained in Y and in Z without the synergy.

The problem of separating the contributions of shared information and synergistic information to the co-information is probably as old as the definition of co-information itself. Nevertheless, the co-information has been widely used as a measure of synergy in the neurosciences; see, for example, [2,3] and the references therein. The first general attempt to construct a consistent information decomposition into terms corresponding to different combinations of shared and synergistic information is due to Williams and Beer [4]. See also the references in [4] for other approaches to study multivariate information. While the general approach of [4] is intriguing, the proposed measure of shared information I_min suffers from serious flaws, which prompted a series of other papers trying to improve these results [5–7].

In our current contribution, we propose to define the unique information as follows: Let Δ be the set of all joint distributions of X, Y and Z. Define:

Δ_{P} = {Q \in Δ : Q (X = x, Y = y) = P (X = x, Y = y) and Q (X = x, Z = z) = P (X = x, Z = z) for all x \in X, y \in Y, z \in Z}

as the set of all joint distributions that have the same marginal distributions on the pairs (X, Y ) and (X, Z). Then, we define:

\tilde{U I} (X : Y \ Z) = min_{Q \in Δ_{P}} M I_{Q} (X : Y ∣ Z),

where MI_Q(X : Y |Z) denotes the conditional mutual information of X and Y given Z, computed with respect to the joint distribution Q. Observe that:

\begin{array}{l} M I (X : Z) + \tilde{U I} (X : Y \ Z) = min_{Q \in Δ_{P}} (M I (X : Z) + M I_{Q} (X : Y ∣ Z)) \\ = min_{Q \in Δ_{P}} (M I (X : Z) + M I_{Q} (X : Z ∣ Y)) = M I (X : Y) + \tilde{U I} (X : Z \ Y), \end{array}

where the chain rule of mutual information was used. Hence, $\tilde{U I}$ satisfies the consistency Condition (4), and we can use (2) and (3) to define corresponding functions $\tilde{S I}$ and $\tilde{C I}$ , such that $\tilde{U I}, \tilde{S I}$ and $\tilde{C I}$ form a bivariate information decomposition. $\tilde{S I}$ and $\tilde{C I}$ are given by:

\begin{array}{l} \tilde{S I} (X : Y; Z) = max_{Q \in Δ_{P}} C o I_{Q} (X; Y; Z), \\ \tilde{C I} (X : Y; Z) = M I (X : (Y, Z)) - min_{Q \in Δ_{P}} M I_{Q} (X : (Y, Z)) . \end{array}

In Section 3, we show that $\tilde{U I}, \tilde{S I}$ and $\tilde{C I}$ are non-negative, and we study further properties. In Appendix A.1, we describe Δ_P, in terms of a parametrization.

Our approach is motivated by the idea that unique and shared information should only depend on the marginal distribution of the pairs (X, Z) and (X, Y ). This idea can be explained from an operational interpretation of unique information: Namely, if Y has unique information about X (with respect to Z), then there must be some way to exploit this information. More precisely, there must be a situation in which Y can use this information to perform better at predicting the outcome of X. We make this idea precise in Section 2 and show how it naturally leads to the definitions of $\tilde{U I}, \tilde{S I}$ and $\tilde{C I}$ as given above. Section 3 contains basic properties of these three functions. In particular, Lemma 5 shows that all three functions are non-negative. Corollary 7 proves that the interpretation of $\tilde{U I}$ as unique information is consistent with the operational idea put forward in Section 2. In Section 4, we compare our functions with other proposed information decompositions. Some examples are studied in Section 5. Remaining open problems are discussed in Section 6. Appendices A.1 to A.3 discuss more technical aspects that help to compute $\tilde{U I}, \tilde{S I}$ and $\tilde{C I}$ .

After submitting our manuscript, we learned that the authors of [5] had changed their definitions in Version 5 of their preprint. Using a different motivation, they define a multivariate measure of union information that leads to the same bivariate information decomposition.

2. Operational Interpretation

Our basic idea to characterize unique information is the following: If Y has unique information about X with respect to Z, then there must be some way to exploit this information. That is, there must be a situation in which this unique information is useful. We formalize this idea in terms of decision problems as follows:

Let X, Y, Z be three random variables; let p be the marginal distribution of X, and let κ ∈ Entropy 16 02161f6 and μ ∈ Entropy 16 02161f7 be (row) stochastic matrices describing the conditional distributions of Y and Z, respectively, given X. In other words, p, κ and μ satisfy:

P (X = x, Y = y) = p (x) κ (x; y) and P (X = x, Z = z) = p (x) μ (x; z) .

Observe that, if p(x) > 0, then κ(x; y) and μ(x; z) are uniquely defined. Otherwise, κ(x; y) and μ(x; z) can be chosen arbitrarily. In this section, we will assume that the random variable X has full support. If this is not the case, our discussion will remain valid after replacing Entropy 16 02161f3 by the support of X. In fact, the information quantities that we consider later will not depend on those matrix elements κ(x; y) and μ(x; z) that are not uniquely defined.

Suppose that an agent has a finite set of possible actions Entropy 16 02161f8 . After the agent chooses her action a ∈ , she receives a reward u(x, a), which not only depends on the chosen action a ∈ , but also on the value x ∈ Entropy 16 02161f3 of the random variable X. The tuple (p, , u), consisting of the prior distribution p, the set of possible actions A and the reward function u, is called a decision problem. If the agent can observe the value x of X before choosing her action, her best strategy is to chose a in such a way that u(x, a) = Entropy 16 02161f9 u(x, a′). Suppose now that the agent cannot observe X directly, but the agent knows the probability distribution p of X. Moreover, the agent observes another random variable Y, with conditional distribution described by the row-stochastic matrix κ ∈ Entropy 16 02161f6 . In this context, κ will also be called a channel from Entropy 16 02161f3 to Entropy 16 02161f4 . When using a channel κ, the agent’s optimal strategy is to choose her action in such a way that her expected reward:

\sum_{x \in X} P (X = x ∣ Y = y) u (x, a) = \frac{\sum_{x \in X} p (x) κ (x; y) u (x, a)}{\sum_{x \in X} p (x) κ (x; y)}

(5)

is maximal. Note that, in order to maximize (5), the agent has to know (or estimate) the prior distribution of X, as well as the channel κ. Often, the agent is allowed to play a stochastic strategy. However, in the present setting, the agent cannot increase her expected reward by randomizing her actions; and therefore, we only consider deterministic strategies here.

Let R(κ, p, u, y) be the maximum of (5) (over a ∈ Entropy 16 02161f8 ), and let:

R (κ, p, u) = \sum_{y \in Y} P (Y = y) R (κ, p, u, y) .

be the maximal expected reward that the agent can achieve by always choosing the optimal action.

In this setting, we make the following definition:

Definition 1

Let X, Y, Z be three random variables; and let p be the marginal distribution of X, and let κ ∈ Entropy 16 02161f6 and μ ∈ Entropy 16 02161f7 be (row) stochastic matrices describing the conditional distributions of Y and Z, respectively, given X.

Y has unique information about X (with respect to Z), if there is a set and a reward function u ∈ that satisfy R(κ, p, u) > R(μ, p, u).
Z has no unique information about X (with respect to Y ), if for any set and reward function u ∈ , the inequality R(κ, p, u) ≥ R(μ, p, u) holds. In this situation we also say that Y knows everything that Z knows about X, and we write Y ⊒_X Z.

This operational idea allows one to decide when the unique information vanishes, but, unfortunately, does not allow one to quantify the unique information.

As shown recently in [8], the question whether or not Y ⊒_X Z does not depend on the prior distribution p (but just on the support of p, which we assume to be Entropy 16 02161f3 ). In fact, if p has full support, then, in order to check whether Y ⊒_X Z, it suffices to know the stochastic matrices κ, μ representing the conditional distributions of Y and Z, given X.

Consider the case Entropy 16 02161f4 = Entropy 16 02161f5 and κ = μ ∈ K( Entropy 16 02161f3 ; ), i.e., Y and Z use a similar channel. In this case, Y has no unique information with respect to Z, and Z has no unique information with respect to Y. Hence, in the decomposition (1), only the shared information and the synergistic information may be larger than zero. The shared information may be computed from:

S I (X : Y; Z) = M I (X : Y) - U I (X : Y \ Z) = M I (X : Y) = M I (X : Z);

and so the synergistic information is:

C I (X : Y; Z) = M I (X : (Y, Z)) - S I (X : Y; Z) = M I (X : (Y, Z)) - M I (X : Y) .

Observe that in this case, the shared information can be computed from the marginal distribution of X and Y. Only the synergistic information depends on the joint distribution of X, Y and Z.

We argue that this should be the case in general: by what was said above, whether the unique information UI(X : Y \ Z) is greater than zero only depends on the two channels κ and μ. Even more is true: the set of those decision problems (p, Entropy 16 02161f8 , u) that satisfy R(κ, p, u) > R(μ, p, u) only depends on κ and μ (and the support of p). To quantify the unique information, this set of decision problems must be measured in some way. It is reasonable to expect that this quantification can be achieved by taking into account only the marginal distribution p of X. Therefore, we believe that a sensible measure UI for unique information should satisfy the following property:

U I (X : Y \ Z) only depends on p, κ and μ .

Equation (*)

Although this condition seems to have not been considered before, many candidate measures of unique information satisfy this property; for example, those defined in [4,6]. In the following, we explore the consequences of Assumption (*).

Lemma 2

Under Assumption (*), the shared information only depends on p, κ and μ.

Proof

This follows from SI(X : Y ; Z) = MI(X : Y ) − UI(X : Y \ Z).

Let Δ be the set of all joint distributions of X, Y and Z. Fix P ∈ Δ, and assume that the marginal distribution of X, denoted by p, has full support. Let:

Δ_{P} {Q \in Δ : Q (X = x, Y = y) = P (X = x, Y = y) and Q (X = x, Z = z) = P (X = x, Z = z) for all x \in X, y \in Y, z \in Z}

be the set of all joint distributions that have the same marginal distributions on the pairs (X, Y ) and (X, Z), and let:

Δ_{P}^{*} = {Q \in Δ_{P} : Q (x) > 0 for all x \in X}

be the subset of distributions with full support. Lemma 2 says that, under Assumption (*), the functions UI(X : Y \ Z), UI(X : Z \ Y ) and SI(X : Y ; Z) are constant on $Δ_{P}^{*}$ , and only CI(X : Y ; Z) depends on the joint distribution $Q \in Δ_{P}^{*}$ . If we further assume continuity, the same statement holds true for all Q ∈ Δ_P. To make clear that we now consider the synergistic information and the mutual information as a function of the joint distribution Q ∈ Δ we write CI_Q(X : Y ; Z) and MI_Q(X : (Y, Z)) in the following; and we omit this subscript, if these information theoretic quantities are computed with respect to the “true” joint distribution P.

Consider the following functions that we defined in Section 1:

\begin{array}{l} \tilde{U I} (X : Y \ Z) = min_{Q \in Δ_{P}} M I_{Q} (X : Y ∣ Z), \\ \tilde{U I} (X : Z \ Y) = min_{Q \in Δ_{P}} M I_{Q} (X : Z ∣ Y), \\ \tilde{S I} (X : Y; Z) = max_{Q \in Δ_{P}} C o I_{Q} (X; Y; Z), \\ \tilde{C I} (X : Y; Z) = M I (X : (Y, Z)) - min_{Q \in Δ_{P}} M I_{Q} (X : (Y, Z)) . \end{array}

These minima and maxima are well-defined, since Δ_P is compact and since mutual information and co-information are continuous functions. The next lemma says that, under Assumption (*), the quantities $\tilde{U I}, \tilde{S I}$ and $\tilde{C I}$ bound the unique, shared and synergistic information.

Lemma 3

Let UI(X : Y \ Z), UI(X : Z \ Y ), SI(X : Y ; Z) and CI(X : Y ; Z) be non-negative continuous functions on Δ satisfying Equations (1) and (2) and Assumption (*). Then:

\begin{array}{l} U I (X : Y \ Z) \leq \tilde{U I} (X : Y \ Z), \\ U I (X : Z \ Y) \leq \tilde{U I} (X : Z \ Y), \\ S I (X : Y; Z) \geq \tilde{S I} (X : Y; Z), \\ C I (X : Y; Z) \geq \tilde{C I} (X : Y; Z) . \end{array}

If P ∈ Δ and if there exists Q ∈ Δ_P that satisfies CI_Q(X : Y ; Z) = 0, then equality holds in all four inequalities. Conversely, if equality holds in one of the inequalities for a joint distribution P ∈ Δ, then there exists Q ∈ Δ_P satisfying CI_Q(X : Y ; Z) = 0.

Proof

Fix a joint distribution P ∈ Δ. By Lemma 2, Assumption (*) and continuity, the functions UI_Q(X : Y \ Z), UI_Q(X : Z \ Y ) and SI_Q(X : Y ; Z) are constant on Δ_P, and only CI_Q(X : Y ; Z), depends on Q ∈ Δ_P. The decomposition (1) rewrites to:

C I_{Q} (X : Y; Z) = M I_{Q} (X : (Y, Z)) - U I (X : Y \ Z) - U I (X : Z \ Y) - S I (X : Y; Z) .

(6)

Using the non-negativity of synergistic information, this implies:

U I (X : Y \ Z) + U I (X : Z \ Y) + S I (X : Y; Z) \leq min_{Q \in Δ_{P}} M I_{Q} (X : (Y, Z)) .

Choosing P in (6) and applying this last inequality shows:

C I (X : Y; Z) \geq M I (X : (Y, Z)) - min_{Q \in Δ_{P}} M I_{Q} (X : (Y, Z)) = \tilde{C I} (X : Y; Z) .

The chain rule of mutual information says:

M I_{Q} (X : (Y, Z)) = M I_{Q} (X : Z) + M I_{Q} (X : Y ∣ Z) .

Now, Q ∈ Δ_P implies MI_Q(X : Z) = MI(X : Z), and therefore,

\tilde{C I} (X : Y; Z) = M I (X : Y ∣ Z) - min_{Q \in Δ_{P}} M I_{Q} (X : Y ∣ Z) .

Moreover,

M I_{Q} (X : Y ∣ Z) = H_{Q} (X ∣ Z) - H_{Q} (X ∣ Y, Z),

where H_Q(X|Z) = H(X|Z) for Q ∈ Δ_P, and so:

\tilde{C I} (X : Y; Z) = max_{Q \in Δ_{P}} H_{Q} (X ∣ Y, Z) - H (X ∣ Y, Z) .

By (3), the shared information satisfies:

\begin{array}{l} S I (X : Y; Z) = C I (X : Y; Z) + M I (X : Y) + M I (X : Z) - M I (X : (Y, Z)) \\ \geq \tilde{C I} (X : Y; Z) + M I (X : Y) + M I (X : Z) - M I (X : (Y, Z)) \\ = M I (X : Y) + M I (X : Z) - min_{Q \in Δ_{P}} M I_{Q} (X : (Y, Z)) \\ = max_{Q \in Δ_{P}} (M I_{Q} (X : Y) + M I_{Q} (X : Z) - M I_{Q} (X : (Y, Z))) \\ = max_{Q \in Δ_{P}} C o I_{Q} (X; Y; Z) = \tilde{S I} (X : Y; Z) . \end{array}

By (2), the unique information satisfies:

\begin{array}{l} U I (X : Y \ Z) = M I (X : Y) - S I (X : Y; Z) \\ \leq min_{Q \in Δ_{P}} [M I_{Q} (X : (Y, Z)) - M I (X : Z))] \\ = min_{Q \in Δ_{P}} [M I_{Q} (X : (Y ∣ Z)] = \tilde{U I} (X : Y \ Z) . \end{array}

The inequality for UI(X : Z \ Y ) follows similarly.

If there exists Q₀ ∈ Δ_P satisfying CI_Q_₀ (X : Y ; Z) = 0, then:

0 = C I_{Q_{0}} (X : Y; Z) \geq {\tilde{C I}}_{Q_{0}} (X : Y; Z) = M I (X : (Y, Z)) - min_{Q \in Δ_{P}} M I_{Q} (X : (Y, Z)) \geq 0.

Since $\tilde{S I}, \tilde{U I}$ and $\tilde{C I}$ , as well as SI, UI and TI form an information decomposition, it follows from (1) and (2) that all inequalities are equalities at Q₀. By Assumption (*), they are equalities for all Q ∈ P. Conversely, assume that one of the inequalities is tight for some P ∈ Δ. Then, by the same reason, all four inequalities hold with equality. By Assumption (*), the functions $\tilde{U I}$ and $\tilde{S I}$ are constant on Δ_P. Therefore, the inequalities are tight for all Q ∈ Δ_P. Now, if Q₀ ∈ Δ_P minimizes MI_Q(X : (Y, Z)) over Δ_P, then $C I_{Q_{0}} (X : Y; Z) = {\tilde{C I}}_{Q_{0}} (X : Y; Z) = 0$ .

The proof of Lemma 3 shows that the optimization problems defining $\tilde{U I}, \tilde{S I}$ and $\tilde{C I}$ are in fact equivalent; that is, it suffices to solve one of them. Lemma 4 in Section 3 gives yet another formulation and shows that the optimization problems are convex.

In the following, we interpret $\tilde{U I}, \tilde{S I}$ and $\tilde{C I}$ as measures of unique, shared and complementary information. Under Assumption (*), Lemma 3 says that using the information decomposition given by $\tilde{U I}, \tilde{S I}, \tilde{C I}$ is equivalent to saying that in each set Δ_P there exists a probability distribution Q with vanishing synergistic information CI_Q(X : Y ; Z) = 0. In other words, $\tilde{U I}, \tilde{S I}$ and $\tilde{C I}$ are the only measures of unique, shared and complementary information that satisfy both (*) and the following property:

It is not possible to decide whether or not there is synergistic information

when only the marginal distribution of (X, Y) and (X, Z) are known

Equation (**)

For any other combination of measures different from $\tilde{U I}, \tilde{S I}$ and $\tilde{C I}$ that satisfy Assumption (*), there are combinations of (p, μ, κ) for which the existence of non-vanishing complementary information can be deduced. Since complementary information should capture precisely the information that is carried by the joint dependencies between X, Y and Z, we find Assumption (**) natural, and we consider this observation as evidence in favor of our interpretation of $\tilde{U I}, \tilde{S I}$ and $\tilde{C I}$ .

3. Properties

3.1. Characterization and Positivity

The next lemma shows that the optimization problems involved in the definitions of $\tilde{U I}, \tilde{S I}$ and $\tilde{C I}$ are easy to solve numerically, in the sense that they are convex optimization problems on convex sets. As always, theory is easier than practice, as discussed in Example 31 in Appendix A.2.

Lemma 4

Let P ∈ Δ and Q_P ∈ Δ_P. The following conditions are equivalent:

MI_{Q_P} (X : Y |Z) = min_Q_{∈Δ_P} MI_Q(X : Y |Z).
MI_{Q_P} (X : Z|Y ) = min_Q_{∈Δ_P} MI_Q(X : Z|Y ).
MI_{Q_P} (X; (Y, Z)) = min_Q_{∈Δ_P} MI_Q(X : (Y, Z)).
CoI_{Q_P} (X; Y ; Z) = max_Q_{∈Δ_P} CoI_Q(X; Y; Z).
H_{Q_P} (X|Y, Z) = max_Q_{∈Δ_P} H_Q(X|Y, Z).

Moreover, the functions MI_Q(X : Y |Z), MI_Q(X : Z|Y ) and MI_Q(X : (Y, Z)) are convex on Δ_P ; and CoI_Q(X; Y ; Z) and H_Q(X|Y, Z) are concave. Therefore, for fixed P ∈ Δ, the set of all Q_P ∈ Δ_P satisfying any of these conditions is convex.

Proof

The conditional entropy H_Q(X|Y, Z) is a concave function on Δ; therefore, the set of maxima is convex. To show the equivalence of the five optimization problems and the convexity properties, it suffices to show that the difference of any two minimized functions and the sum of a minimized and a maximized function is constant on Δ_p. Except for H_Q(X|Y, Z), this follows from the proof of Lemma 3. For H_Q(X|Y, Z), this follows from the chain rule:

\begin{array}{l} M I_{Q} (X : (Y, Z)) = M I_{P} (X : Y) + M I_{Q} (X : Z ∣ Y) \\ = M I_{P} (X : Y) + H_{P} (X ∣ Y) - H_{Q} (X ∣ Y, Z) = H (X) - H_{Q} (X ∣ Y, Z) . \end{array}

The optimization problems mentioned in Lemma 4 will be studied more closely in the appendices.

Lemma 5 (Non-negativity)

$\tilde{U I}, \tilde{S I}$ and $\tilde{C I}$ are non-negative functions.

Proof

$\tilde{C I}$ is non-negative by definition. $\tilde{U I}$ is non-negative, because it is obtained by minimizing mutual information, which is non-negative.

Consider the real function:

Q_{0} (X = x, Y = y, Z = z) = {\begin{array}{l} \frac{P (X = x, Y = y) P (X = x, Z = z)}{P (X = x)} & if P (X = x) > 0, \\ 0, & else . \end{array}

It is easy to check Q₀ ∈ Δ_P. Moreover, with respect to Q₀, the two random variables Y and Z are conditionally independent given X, that is, MI_Q_₀ (Y : Z|X) = 0. This implies:

C o I_{Q_{0}} (X; Y; Z) = M I_{Q_{0}} (Y : Z) - M I_{Q_{0}} (Y : Z ∣ X) = M I_{Q_{0}} (Y : Z) \geq 0.

Therefore, $\tilde{S I} (X : Y; Z) = {max}_{Q \in Δ_{P}} C o l_{Q} (X; Y; Z) \geq C o l_{Q_{0}} (X; Y; Z) \geq 0$ , showing that $\tilde{S I}$ is a non-negative function.

In general, the probability distribution Q₀ constructed in the proof of Lemma 5 does not satisfy the conditions of Lemma 4, i.e., it does not minimize MI_Q(X : Y |Z) over $Δ_{P}^{*}$ .

3.2. Vanishing Shared and Unique Information

In this section, we study when $\tilde{S I} = 0$ and when $\tilde{U I} = 0$ . In particular, in Corollary 7, we show that $\tilde{U I}$ conforms with the operational idea put forward in Section 2.

Lemma 6

$\tilde{U I} (X : Y \ Z)$ vanishes if and only if there exists a row-stochastic matrix λ ∈ Entropy 16 02161f11 that satisfies:

P (X = x, Y = y) = \sum_{z \in Z} P (X = x, Z = z) λ (z; y) .

Proof

If MI_Q(X : Y |Z) = 0 for some Q ∈ Δ_P, then X and Y are independent given Z with respect to Q. Therefore, there exists a stochastic matrix λ ∈ Entropy 16 02161f11 satisfying:

P (X = x, Y = y) = Q (X = x, Y = y) = \sum_{z \in Z} Q (X = x, Z = z) λ (z; y) = \sum_{z \in Z} P (X = x, Z = z) λ (z; y) .

Conversely, if such a matrix λ exists, then the equality:

Q (X = x, Y = y, Z = z) = P (X = x, Z = z) λ (z; y)

defines a probability distribution Q which lies in Δ_P. Then:

\tilde{U I} (X : Y \ Z) \leq M I_{Q} (X : Y ∣ Z) = 0.

The last result can be translated into the language of our motivational Section 2 and says that $\tilde{U I}$ is consistent with our operational idea of unique information:

Corollary 7

$\tilde{U I} (X : Z \ Y) = 0$ if and only if Z has no unique information about X with respect to Y (according to Definition 1).

Proof

We need to show that decision problems can be solved with the channel κ at least as well as with the channel μ if and only if μ = κλ for some stochastic matrix λ. This result is known as Blackwell’s theorem [9]; see also [8].

Corollary 8

Suppose that Entropy 16 02161f4 = Entropy 16 02161f5 and that the marginal distributions of the pairs (X, Y ) and (X, Z) are identical. Then:

\begin{array}{l} \tilde{U I} (X : Y \ Z) = \tilde{U I} (X : Z \ Y) = 0, \\ \tilde{S I} (X : Y; Z) = M I (X : Y) = M I (X : Z), \\ \tilde{C I} (X : Y; Z) = M I (X : Y ∣ Z) = M I (X : Z ∣ Y) . \end{array}

In particular, under Assumption (*), there is no unique information in this situation.

Proof

Apply Lemma 6 with the identity matrix in the place of λ.

Lemma 9

$\tilde{S I} (X : Y; Z) = 0$ if and only if MI_Q_₀ (Y : Z) = 0, where Q₀ ∈ Δ is the distribution constructed in the proof of Lemma 5.

The proof of the lemma will be given in Appendix A.3, since it relies on some technical results from Appendix A.2, where Δ_P is characterized and the critical equations corresponding to the optimization problems in Lemma 4 are computed.

Corollary 10

If both Y ⫫ Z |X and Y ⫫ Z, then $\tilde{S I} (X : Y; Z) = 0$ .

Proof

By assumption, P = Q₀. Thus, the statement follows from Lemma 9.

There are examples where Y ⫫ Z and $\tilde{S I} (X : Y; Z) \neq 0$ (see Example 30). Thus, independent random variables may have shared information. This fact has also been observed in other information decomposition frameworks; see [5,6].

3.3. The Bivariate PI Axioms

In [4], Williams and Beer proposed axioms that a measure of shared information should satisfy. We call these axioms the PI axioms after the partial information decomposition framework derived from these axioms in [4]. In fact, the PI axioms apply to a measure of shared information that is defined for arbitrarily many random variables, while our function $\tilde{S I}$ only measures the shared information of two random variables (about a third variable). The PI axioms are as follows:


1. The shared information of Y₁, …, Y_n about X is symmetric under permutations of Y₁, …, Y_n.	(symmetry)
2. The shared information of Y₁ about X is equal to MI(X : Y₁).	(self-redundancy)
3. The shared information of Y₁, …, Y_n about X is less than the shared information of Y₁, …, Y_n₋₁ about X, with equality if Y_n₋₁ is a function of Y_n.	(monotonicity)

Any measure SI of bivariate shared information that is consistent with the PI axioms must obviously satisfy the following two properties, which we call the bivariate PI axioms:


(A) SI(X : Y ; Z) = SI(X : Z; Y ).	(symmetry)
(B) SI(X : Y ; Z) ≤ MI(X : Y ), with equality if Y is a function of Z.	(bivariate monotonicity)

We do not claim that any function SI that satisfies (A) and (B) can be extended to a measure of multivariate shared information satisfying the PI axioms. In fact, such a claim is false, and as discussed in Section 6, our bivariate function $\tilde{S I}$ is not extendable in this way.

The following two lemmas show that $\tilde{S I}$ satisfies the bivariate PI axioms, and they show corresponding properties of $\tilde{U I}$ and $\tilde{C I}$ .

Lemma 11 (Symmetry)

\begin{matrix} \tilde{S I} (X : Y; Z) = \tilde{S I} (X : Z; Y), \\ \tilde{C I} (X : Y; Z) = \tilde{C I} (X : Z; Y), \\ M I (X : Z) + \tilde{U I} (X : Y \ Z) = M I (X : Y) + \tilde{U I} (X : Z \ Y) . \end{matrix}

Proof

The first two equalities follow since the definitions of $\tilde{S I}$ and $\tilde{C I}$ are symmetric in Y and Z. The third equality is the consistency condition (4), which was proved already in Section 1.

The following lemma is the inequality condition of the monotonicity axiom.

Lemma 12 (Bounds)

\begin{array}{l} \tilde{S I} (X : Y; Z) \leq M I (X : Y), \\ \tilde{C I} (X : Y; Z) \leq M I (X : Y ∣ Z), \\ \tilde{U I} (X : Y \ Z) \geq M I (X : Y) - M I (X : Z) . \end{array}

Proof

The first inequality follows from:

\tilde{S I} (X : Y; Z) = max_{Q \in Δ_{P}} C o I_{Q} (X; Y; Z) = max_{Q \in Δ_{P}} (M I (X : Y) - M I_{Q} (X : Y ∣ Z)) \leq M I (X : Y),

the second from:

\tilde{C I} (X : Y; Z) = M I (X : Y ∣ Z) - min_{Q \in Δ_{P}} M I_{Q} (X : Y ∣ Z),

using the chain rule again. The last inequality follows from the first inequality, Equality (2) and the symmetry of Lemma 11.

To finish the study of the bivariate PI axioms, only the equality condition in the monotonicity axiom is missing. We show that $\tilde{S I}$ satisfies $\tilde{S I} (X : Y; Z) = M I (X : Y)$ not only if Z is a deterministic function of Y, but also, more generally, when Z is independent of X given Y. In this case, Z can be interpreted as a stochastic function of Y, independent of X.

Lemma 13

If X is independent of Z given Y, then P solves the optimization problems of Lemma 4. In particular,

\begin{array}{l} \tilde{U I} (X : Y \ Z) = M I (X : Y ∣ Z), \\ \tilde{U I} (X : Z \ Y) = 0, \\ \tilde{S I} (X : Y; Z) = M I (X; Z), \\ \tilde{C I} (X : Y; Z) = 0. \end{array}

Proof

If X is independent of Z given Y, then:

M I (X : Z ∣ Y) = 0 \leq min_{Q \in Δ_{P}} M I_{Q} (X : Z ∣ Y),

so P minimizes MI_Q(X : Z|Y ) over Δ_P.

Remark 14

In fact, Lemma 13 can be generalized as follows: in any bivariate information decomposition, Equations (1) and (2) and the chain rule imply:

M I (X : Z ∣ Y) = M I (X : (Y, Z)) - M I (X : Y) = U I (X : Z \ Y) + C I (X : Y; Z) .

Therefore, if MI(X : Z|Y ) = 0, then UI(X : Z \ Y) = 0 = CI(X : Y ; Z).

3.4. Probability Distributions with Structure

In this section, we compute the values of $\tilde{S I}, \tilde{C I}$ and $\tilde{U I}$ for probability distributions with special structure. If two of the variables are identical, then $\tilde{C I} = 0$ as a consequence of Lemma 13 (see Corollaries 15 and 16). When X = (Y, Z), then the same is true (Proposition 18). Moreover, in this case, $\tilde{S I} ((Y, Z) : Y; Z) = M I (Y : Z)$ . This equation has been postulated as an additional axiom, called the identity axiom, in [6].

The following two results are corollaries to Lemma 13:

Corollary 15

\begin{array}{l} \tilde{C I} (X : Y; Y) = 0, \\ \tilde{S I} (X : Y; Y) = C o l (X; Y; Y) = M I (X : Y), \\ \tilde{U I} (X : Y; \ Y) = 0. \end{array}

Proof

If Y = Z, then X is independent of Z, given Y.

Corollary 16

\begin{array}{l} \tilde{C I} (X : X; Z) = 0 \\ \tilde{S I} (X : X; Z) = C o I (X; X; Z) = M I (X : Z) - M I (X : Z ∣ X) = M I (X : Z), \\ \tilde{U I} (X : X \ Z) = M I (X : X ∣ Z) = H (X ∣ Z), \\ \tilde{U I} (X : Z \ X) = M I (X : Z ∣ X) = 0. \end{array}

Proof

If X = Y, then X is independent of Z, given Y.

Remark 17

Remark 14 implies that Corollaries 15 and 16 hold for any bivariate information decomposition.

Proposition 18 (Identity property)

Suppose that Entropy 16 02161f3 = Entropy 16 02161f4 × Entropy 16 02161f5 , and X = (Y, Z). Then, P solves the optimization problems of Lemma 4. In particular,

\begin{array}{l} \tilde{C I} ((Y, Z) : Y; Z) = 0, \\ \tilde{S I} ((Y, Z) : Y; Z) = M I (Y : Z), \\ \tilde{U I} ((Y, Z) : Y \ Z) = H (Y ∣ Z), \\ \tilde{U I} ((Y, Z) : Z \ Y) = H (Z ∣ Y) . \end{array}

Proof

If X = (Y, Z), then, by Corollary 28 in the appendix, Δ_P = {P}, and therefore:

\tilde{S I} ((Y, Z) : Y; Z) = M I ((Y, Z) : Y) - M I ((Y, Z) : Y ∣ Z) = H (Y) - H (Y ∣ Z) = M I (Y : Z)

and:

\tilde{U I} ((Y : Z) : Y \ Z) = M I ((Y, Z) : Y ∣ Z) = H (Y ∣ Z),

and similarly for $\tilde{U I} ((Y : Z) : Y \ Z)$ .

The following Lemma shows that $\tilde{U I}, \tilde{S I}$ and $\tilde{C I}$ are additive when considering systems that can be decomposed into independent subsystems.

Lemma 19

Let X₁, X₂, Y₁, Y₂, Z₁, Z₂ be random variables, and assume that (X₁, Y₁, Z₁) is independent of (X₂, Y₂, Z₂). Then:

\begin{array}{c} \tilde{S I} ((X_{1}, X_{2}) : (Y_{1}, Y_{2}); (Z_{1}, Z_{2})) = \tilde{S I} (X_{1} : Y_{1}; Z_{1}) + \tilde{S I} (X_{1} : Y_{1}; Z_{1}), \\ \tilde{C I} ((X_{1}, X_{2}) : (Y_{1}, Y_{2}); (Z_{1}, Z_{2})) = \tilde{C I} (X_{1} : Y_{1}; Z_{1}) + \tilde{C I} (X_{1} : Y_{1}; Z_{1}), \\ \tilde{U I} ((X_{1}, X_{2}) : (Y_{1}, Y_{2}) \ (Z_{1}, Z_{2})) = \tilde{U I} (X_{1} : Y_{1} \ Z_{1}) + \tilde{U I} (X_{1} : Y_{1} \ Z_{1}), \\ \tilde{U I} ((X_{1}, X_{2}) : (Z_{1}, Z_{2}) \ (Y_{1}, Y_{2})) = \tilde{U I} (X_{1} : Z_{1} \ Y_{1}) + \tilde{U I} (X_{1} : Z_{1} \ Y_{1}) . \end{array}

The proof of the last lemma is given in Appendix A.3.

4. Comparison with Other Measures

In this section, we compare our information decomposition using $\tilde{U I}, \tilde{S I}$ and $\tilde{C I}$ with similar functions proposed in other papers; in particular, the function I_min of [4] and the bivariate redundancy measure I_red of [6]. We do not repeat their definitions here, since they are rather technical.

The first observation is that both I_red and I_min satisfy Assumption (*). Therefore, $I_{red} \geq S I$ and $I_{min} \geq \tilde{S I}$ . According to [6], the value of I_min tends to be larger than the value of I_red, but there are some exceptions.

It is easy to find examples where I_min is unreasonably large [6,7]. It is much more difficult to distinguish I_red and $\tilde{S I}$ . In fact, in many special cases, I_red and $\tilde{S I}$ agree, as the following results show.

Theorem 20

I_red(X : Y ; Z) = 0 if and only if $\tilde{S I} (X : Y; Z) = 0$ .

The proof of the theorem builds on the following lemma:

Lemma 21

If both Y ⫫ Z |X and Y ⫫ Z, then I_red(X : Y ; Z) = 0.

The proof of the lemma is deferred to Appendix A.3.

Proof of Theorem 20

By Lemma 3, if I_red(X : Y ; Z) = 0, then $\tilde{S I} (X : Y; Z) = 0$ . Now assume that $\tilde{S I} (X : Y; Z) = 0$ . Since both $\tilde{S I}$ and I_red are constant on Δ_P, we may assume that P = Q₀; that is, we may assume that Y ⫫ Z |X. Then, Y ⫫ Z by Lemma 9. Therefore, Lemma 21 implies that I_red(X : Y ; Z) = 0.

Denote by UI_red the unique information defined from I_red and (2). Then:

Theorem 22

UI_red(X : Y \ Z) = 0 if and only if $\tilde{U I} (X : Y \ Z) = 0$

Proof

By Lemma 3, if $\tilde{U I}$ vanishes, then so does UI_red. Conversely, UI_red(X : Y \ Z) = 0 if and only if I_red(X : Y ; Z) = MI(X : Y ). By Equation (20) in [6], this is equivalent to p(x|y) = p_y_↘_Z(x) for all x, y. In this case, p(x|y) = Entropy 16 02161f12 p(x|z)λ(z; y) for some λ(z; y) with λ(z; y) = 1. Hence, Lemma 6 implies that $\tilde{U I} (X : Y \ Z) = 0$ .

Theorem 22 implies that I_red does not contradict our operational ideas introduced in Section 2.

Corollary 23

Suppose that one of the following conditions is satisfied:

X is independent of Y given Z.
X is independent of Z given Y.
= × , and X = (Y, Z).

Then, $I_{red} (X : Y; Z) = \tilde{S I} (X : Y; Z)$ .

Proof

If X is independent of Z, given Y, then, by Remark 14, for any bivariate information decomposition, UI(X : Z \ Y) = 0. In particular, $\tilde{U I} (X : Z \ Y) = 0 = U I_{red} (X : Z \ Y)$ (compare also Lemma 13 and Theorem 22). Therefore, $I_{red} (X : Y; Z) = \tilde{S I} (X : Y; Z)$ . If Entropy 16 02161f3 = Entropy 16 02161f4 × Entropy 16 02161f5 and X = (Y, Z), then $\tilde{S I} (X : Y; Z) = M I (Y : Z) = I_{red} (X : Y; Z)$ by Proposition 18 and the identity axiom in [6].

Corollary 24

If the two pairs (X, Y ) and (X, Z) have the same marginal distribution, then $I_{r e d} (X : Y; Z) = \tilde{S I} (X : Y; Z)$ .

Proof

In this case, $\tilde{U I} (X : Y; Z) = 0 = U I_{red} (X : Y; Z)$ .

Although $\tilde{S I}$ and I_red often agree, they are different functions. An example where $\tilde{S I}$ and I_red have different values is the dice example given at the end of the next section. In particular, it follows that I_red does not satisfy Property (**).

5. Examples

Table 1 contains the values of $\tilde{C I}$ and $\tilde{S I}$ for some paradigmatic examples. The list of examples is taken from [6]; see also [5]. In all these examples, $\tilde{S I}$ agrees with I_red. In particular, in these examples, the values of $\tilde{S I}$ agree with the intuitively plausible values called “expected values” in [6].

As a more complicated example, we treated the following system with two parameters λ ∈ [0, 1], α ∈ {1, 2, 3, 4, 5, 6}, also proposed by [6]. Let Y and Z be two dice, and define X = Y + αZ. To change the degree of dependence of the two dice, assume that they are distributed according to:

P (Y = i, Z = j) = \frac{λ}{36} + (1 - λ) \frac{δ_{i, j}}{6} .

For λ = 0, the two dice are completely correlated, while for λ = 1, they are independent. The values of $\tilde{S I}$ and I_red are shown in Figure 1. In fact, for α = 1, α = 5 and α = 6, the two functions agree. Moreover, they agree for λ = 0 and λ = 1. In all other cases, $\tilde{S I} < I_{red}$ , in agreement with Lemma 3. For α = 1 and α = 6 and λ = 0, the fact that $I_{red} = \tilde{S I}$ follows from the results in Section 4; in the other cases, we do not know a simple reason for this coincidence.

It is interesting to note that for small λ and α > 1, the function $\tilde{S I}$ depends only weakly on α. In contrast, the dependence of I_red on α is stronger. At the moment, we do not have an argument that tells us which of these two behaviors is more intuitive.

6. Outlook

We defined a decomposition of the mutual information MI(X : (Y, Z)) of a random variable X with a pair of random variables (Y, Z) into non-negative terms that have an interpretation in terms of shared information, unique information and synergistic information. We have shown that the quantities $\tilde{S I}, \tilde{C I}$ and $\tilde{U I}$ have many properties that such a decomposition should intuitively fulfil; among them, the PI axioms and the identity axiom. It is a natural question whether the same can be done when further random variables are added to the system.

The first question in this context is what the decomposition of MI(X : Y₁, …, Y_n) should look like. How many terms do we need? In the bivariate case n = 2, many people agree that shared, unique and synergistic information should provide a complete decomposition (but, it may well be worth looking for other types of decompositions). For n > 2, there is no universal agreement of this kind.

Williams and Beer proposed a framework that suggests constructing an information decomposition only in terms of shared information [4]. Their ideas naturally lead to a decomposition according to a lattice, called the PI lattice. For example, in this framework, MI(X : Y₁, Y₂, Y₃) has to be decomposed into 18 terms with a well-defined interpretation. The approach is very appealing, since it is only based on very natural properties of shared information (the PI axioms) and the idea that all information can be “localized,” in the sense that, in an information decomposition, it suffices to classify information according to “who knows what,” that is, which information is shared by which subsystems.

Unfortunately, as shown in [10], our function $\tilde{S I}$ cannot be generalized to the case n = 3 in the framework of the PI lattice. The problem is that the identity axiom is incompatible with a non-negative decomposition according to the PI lattice.

Even though we currently cannot extend our decomposition to n > 2, our bivariate decomposition can be useful for the analysis of larger systems consisting of more than two parts. For example, the quantity:

\tilde{U I} (X : Y_{i} \ (Y_{1}, \dots, Y_{i - 1}, Y_{i + 1}, \dots, Y_{n}))

can still be interpreted as the unique information of Y_i about X with respect to all other variables, and it can be used to assess the value of the i-th variable, when synergistic contributions can be ignored. Furthermore, the measure has the intuitive property that the unique information cannot grow when additional variables are taken into account:

Lemma 25

$\tilde{U I} (X : Y \ (Z_{1}, \dots, Z_{k})) \geq \tilde{U I} (X : Y \ (Z_{1}, \dots, Z_{k + 1}))$ .

Proof

Let P^k be the joint distribution of X, Y, Z₁, …, Z_k, and let P^k⁺¹ be the joint distribution of X, Y, Z₁, …, Z_k, Z_k₊₁. By definition, P^k is a marginal of P^k⁺¹. For any Q ∈ Δ_Pk, the distribution Q′ defined by:

Q^{'} (x, y, z_{1}, \dots, z_{k}, z_{k + 1}) : = {\begin{array}{l} \frac{Q (x, y, z_{1}, \dots, z_{k}) P^{k + 1} (x, z_{1}, \dots z_{k}, z_{k + 1})}{P^{k} (x, z_{1}, \dots, z_{k})} & if P^{k} (x, z_{1}, \dots, z_{k}) > 0, \\ 0, & else, \end{array}

lies in Δ_Pk₊₁. Moreover, Q is the (X, Y, Z₁, …, Z_k)-marginal of Q′, and Z_k₊₁ is independent of Y given X, Z₁, …, Z_k. Therefore,

\begin{array}{l} M I_{Q^{'}} (X : Y ∣ Z_{1}, \dots, Z_{k}, Z_{k + 1}) \leq M I_{Q^{'}} (X, Z_{k + 1} : Y ∣ Z_{1}, \dots, Z_{k}) = M I_{Q^{'}} (X : Y ∣ Z_{1}, \dots, Z_{k}) + M I_{Q^{'}} (Z_{k + 1} : Y ∣ X, Z_{1}, \dots, Z_{k}) \\ \leq M I_{Q^{'}} (X : Y ∣ Z_{1}, \dots, Z_{k}) + M I_{Q^{'}} (X : Y ∣ Z_{1}, \dots, Z_{k}) . \end{array}

The statement now follows by taking the minimum over Q ∈ Δ_Pk.

Thus, we believe that our measure, which is well-motivated in operational terms, can serve as a good starting point towards a general decomposition of multi-variate information.

Acknowledgments

NB is supported by the Klaus Tschira Stiftung. JR acknowledges support from the VW Stiftung. EO has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement No. 258749 (CEEDS) and No. 318723 (MatheMACS). We thank Malte Harder, Christoph Salge and Daniel Polani for fruitful discussions and for providing us with the data for I_red in Figure 1. We thank Ryan James and Michael Wibral for helpful comments on the manuscript.

Appendix: Computing $\tilde{U I}, \tilde{S I}$ and $\tilde{C I}$

A.1. The Optimization Domain Δ_P

By Lemma 4, to compute $\tilde{U I}, \tilde{S I}$ and $\tilde{C I}$ , we need to solve a convex optimization problem. In this section, we study some aspects of this problem.

First, we describe Δ_P. For any set Entropy 16 02161f13 , let Δ( ) be the set of probability distributions on , and let A be the map Δ → Δ( Entropy 16 02161f3 × Entropy 16 02161f4 ) × Δ( × Entropy 16 02161f5 ) that takes a joint probability distribution of X, Y and Z and computes the marginal distributions of the pairs (X, Y ) and (X, Z). Then, A is a linear map, and Δ_P = (P +ker A)∩Δ. In particular, Δ_P is the intersection of an affine space and a simplex; hence, Δ_P is a polytope.

The matrix describing A (and denoted by the same symbol in the following) is a well-studied object. For example, A describes the graphical model associated with the graph Y—X—Z. The columns of A define a polytope, called the marginal polytope. Moreover, the kernel of A is known: let δ_x,y,z ∈ Entropy 16 02161f14 be the characteristic function of the point (x, y, z) ∈ Entropy 16 02161f3 × Entropy 16 02161f4 × Entropy 16 02161f5 , and let:

γ_{x; y, y^{'}; z, z^{'}} = δ_{x, y, z} + δ_{x, y^{'}, z^{'}} - δ_{x, y^{'}, z} - δ_{x, y, z^{'}} .

Lemma 26

The defect of A (that is, the dimension of ker A) is | Entropy 16 02161f3 |(| Entropy 16 02161f4 | − 1)(| Entropy 16 02161f5 | − 1).

The vectors γ_x;y,y_′;_z,z_′ for all x ∈ , y, y′ ∈ and z, z′ ∈ span ker A.
For any fixed y₀ ∈ , z₀ ∈ , the vectors γ_x;y₀_,y;z₀_,z for all x ∈ , y ∈ \{y₀} and z ∈ \{z₀} form a basis of ker A.

Proof

See [11].

The vectors γ_x;y,y_′;_z,z_′ for different values of x ∈ Entropy 16 02161f3 have disjoint supports. As the next lemma shows, this can be used to write Δ_P as a Cartesian product of simpler polytopes. Unfortunately, the function MI(X : (Y, Z)) does not respect this product structures. In fact, the diagonal directions are important (see Example 31 below).

Lemma 27

Let P ∈ Δ. For all x ∈ Entropy 16 02161f3 with P(x) > 0 denote by:

Δ_{P, x} {Q \in Δ (Y \times Z) : Q (Y = y) = P (Y = y ∣ X = x) a n d Q (Z = z) = P (Z = z ∣ X = x)}

the set of those joint distributions of Y and Z with respect to which the marginal distributions of Y and Z agree with the conditional distributions of Y and Z, given X = x. Then, the map π_P : Δ_P ↦ Entropy 16 02161f15 Δ_P,x that maps each Q ∈ Δ_P to the family Entropy 16 02161f16 of conditional distributions of Y and Z given X = x for those x ∈ Entropy 16 02161f3 with P(X = x) > 0 is a linear bijection.

Proof

The image of π_P is contained in Entropy 16 02161f15 Δ_P,x by definition of Δ_P. The relation:

Q (X = x, Y = y, Z = z) = P (X = x) Q (Y = y, Z = z ∣ X = x)

shows that π_P is injective and surjective. Since π_P is in fact a linear map, the domain and the codomain of π_P are affinely equivalent.

Each Cartesian factor Δ_P,x of Δ_P is a fiber polytope of the independence model.

Corollary 28

If X = (Y, Z), then Δ_P = {P}.

Proof

By assumption, both conditional probability distributions P(Y |X = x) and P(Z|X = x) are supported on a single point. Therefore, each factor Δ_P,x consists of a single point; namely the conditional distribution P(Y, Z|X = x) of Y and Z, given X. Hence, Δ_P is a singleton.

A.2. The Critical Equations

Lemma 29

The derivative of MI_Q(X : (Y, Z)) in the direction γ_x;y,y_′;_z,z_′ is:

log \frac{Q (x, y, z) Q (x, y^{'}, z^{'}) Q (y^{'}, z) Q (y, z^{'})}{Q (x, y^{'}, z) Q (x, y, z^{'}) Q (y, z) Q (y^{'}, z^{'})} .

Therefore, Q solves the optimization problems of Lemma 4 if and only if:

log \frac{Q (x, y, z) Q (x, y^{'}, z^{'}) Q (y^{'}, z) Q (y, z^{'})}{Q (x, y^{'}, z) Q (x, y, z^{'}) Q (y, z) Q (y^{'}, z^{'})} \geq 0

(7)

for all x, y, y′, z, z′ with Q + γ_x;y,y_′;_z,z_′ ε Δ_P for ε > 0 small enough.

Proof

The proof is by direct computation.

Example 30 (The AND-example). Consider the binary case Entropy 16 02161f3 = Entropy 16 02161f4 = Entropy 16 02161f5 = {0, 1}; assume that Y and Z are independent and uniformly distributed, and suppose that X = Y AND Z. The underlying distribution P is uniformly distributed on the four states {000, 001, 010, 111}. In this case, Δ_P,₁ = {δ_Y ₌₁_,Z₌₁} is a singleton, and Δ_P,₀ consists of all probability distributions Q_α_′ of the form:

Q_{α^{'}} (Y = y, Z = z) = {\begin{array}{l} \frac{1}{3} + α^{'}, & if (y, z) = (0, 0), \\ \frac{1}{3} - α^{'}, & if (y, z) = (0, 1), \\ \frac{1}{3} - α^{'}, & if (y, z) = (1, 0), \\ α^{'}, & if (y, z) = (1, 1), \end{array}

for some $0 \leq α^{'} \leq \frac{1}{3}$ . Therefore, Δ_P is a one-dimensional polytope consisting of all probability distributions of the form:

Q_{α} (X = x, Y = y, Z = z) = {\begin{array}{l} \frac{1}{4} + α, & if (x, y, z) = (0, 0, 0), \\ \frac{1}{4} - α, & if (x, y, z) = (0, 0, 1), \\ \frac{1}{4} - α, & if (x, y, z) = (0, 1, 0), \\ α, & if (x, y, z) = (0, 1, 1), \\ \frac{1}{4}, & if (x, y, z) = (1, 1, 1), \\ 0, & else, \end{array}

for some $0 \leq α^{'} \leq \frac{1}{4}$ . To compute the minimum of MI_{Q_α} (X : (Y, Z)) over Δ_P, we compute the derivative with respect to α (which equals the directional derivative of MI_Q(X : (Y, Z)) in the direction γ_0;0_,_1;0_,₁ at Q_α) and obtain:

log \frac{(\frac{1}{4} + α) α}{{(\frac{1}{4} - α)}^{2}} \frac{{(\frac{1}{4} - α)}^{2}}{{(\frac{1}{4} + α)}^{2}} = log \frac{α}{\frac{1}{4} + α} .

Since $\frac{α}{\frac{1}{4} + α} < 1$ for all α > 0, the function MI_{Q_α} (X : (Y, Z)) has a unique minimum at $α = \frac{1}{4}$ . Therefore,

\begin{array}{l} \tilde{U I} (X : Y \ Z) = M I_{Q_{1 / 4}} (X : Y ∣ Z) = 0 = \tilde{U I} (X : Y \ Z), \\ \tilde{S I} (X : Y; Z) = C o I_{Q_{1 / 4}} (X; Y; Z) = M I_{Q_{1 / 4}} (X : Y) = \frac{3}{4} log \frac{4}{3}, \\ \tilde{C I} (X : Y; Z) = M I (X : (Y, Z)) - M I_{Q_{1 / 4}} (X : (Y, Z)) = \frac{1}{2} log 2. \end{array}

In other words, in the AND-example, there is no unique information, but only shared and synergistic information. This follows, of course, also from Corollary 8.

Example 31. The optimization problems in Lemma 4 can be very ill-conditioned, in the sense that there are directions in which the function varies fast and other directions in which the function varies slowly. As an example, let P be the distribution of three i.i.d. uniform binary random variables. In this case, Δ_P is a square. Figure A.1 contains a heat map of CoI_Q on Δ_P, where Δ_P is parametrized by:

Q (a, b) = P + a γ_{0; 0, 1; 0, 1} + b γ_{1; 0, 1; 0, 1}, - \frac{1}{8} \leq a \leq \frac{1}{8}, - \frac{1}{8} \leq b \leq \frac{1}{8} .

Clearly, the function varies very little along one of the diagonals. In fact, along this diagonal, X is independent of (Y, Z), corresponding to a very low synergy.

Although, in this case, the optimizing probability distribution is unique, it can be difficult to find. For example, Mathematica’s function FindMinimum does not always find the true optimum out of the box (apparently, FindMinimum cannot make use of the convex structure in the presence of constraints) [12].

Figure A.1. The function CoI_Q in Example 31 (figure created with Mathematica [12]). Darker colors indicate larger values of CoI_Q. In this example, Δ_P is a square. The uniform distribution lies at the center of this square and is the maximum of CoI_Q. In the two dark corners, X is independent of Y and Z, and either Y = Z or Y = ¬Z. In the two light corners, Y and Z are independent, and either X = Y XORZ or X = ¬(Y XORZ).

A.3. Technical Proofs

Proof of Lemma 9

Since $\tilde{S I} (X : Y; Z) \geq C o I_{Q_{0}} (X; Y; Z) \geq 0$ , if $\tilde{S I} (X : Y; Z) = 0$ , then:

0 = C o I_{Q_{0}} (X; Y; Z) = M I_{Q_{0}} (Y : Z) - M I_{Q_{0}} (Y : Z ∣ X) = M I_{Q_{0}} (Y : Z) .

To show that MI_Q_₀ (Y : Z) = 0 is also sufficient, observe that:

Q_{0} (x, y, z) Q_{0} (x, y^{'}, z^{'}) = Q_{0} (x, y, z^{'}) Q_{0} (x, y^{'}, z),

by construction of Q₀, and that:

Q_{0} (y, z) Q_{0} (y^{'}, z^{'}) = Q_{0} (y, z^{'}) Q_{0} (y^{'}, z),

by the assumption that MI_Q_₀ (Y : Z) = 0. Therefore, by Lemma 29, all partial derivatives vanish at Q₀. Hence, Q₀ solves the optimization problems in Lemma 4, and $\tilde{S I} (X : Y; Z) = C o I_{Q_{0}} (X; Y; Z) = 0$ .

Proof of Lemma 19

Let Q₁ and Q₂ be solutions of the optimization problems from Lemma 4 for (X₁, Y₁, Z₁) and (X₂, Y₂, Z₂) in the place of (X, Y, Z), respectively. Consider the probability distribution Q defined by:

Q (x_{1}, x_{2}, y_{1}, y_{2}, z_{1}, z_{2}) = Q_{1} (x_{1}, y_{1}, z_{1},) Q_{2} (x_{2}, y_{2}, z_{2}) .

Since (X₁, Y₁, Z₁) is independent of (X₂, Y₂, Z₂) (under P), Q ∈ Δ_P. We show that Q solves the optimization problems from Lemma 4 for X = (X₁, X₂), Y = (Y₁, Y₂) and Z = (Z₁, Z₂). We use the notation from Appendices A.1 and A.2.

If $Q + ɛ γ_{x_{1} x_{2}; y_{1} y_{2}, y_{1}^{'}, y_{2}^{'}; z_{1} z_{2} z_{1}^{'} z_{2}^{'}} \in Δ_{Q}$ , then:

Q_{1} + ɛ γ_{x_{1}; y_{1}, y_{1}^{'}; z_{1}, z_{1}^{'}} \in Δ_{Q_{1}} and Q_{2} + ɛ γ_{x_{2}; y_{2}, y_{2}^{'}; z_{2}, z_{2}^{'}} \in Δ_{Q_{2}} .

Therefore, by Lemma 29,

\begin{array}{l} log \frac{Q (x_{1} x_{2}, y_{1} y_{2}, z_{1} z_{2}) Q (x_{1} x_{2}, y_{1}^{'} y_{2}^{'}, z_{1}^{'} z_{2}^{'})}{Q (x_{1} x_{2}, y_{1}^{'} y_{2}^{'}, z_{1} z_{2}) Q (x_{1} x_{2}, y_{1} y_{2}, z_{1}^{'} z_{2}^{'})} \frac{Q (y_{1}^{'} y_{2}^{'}, z_{1} z_{2}) Q (y_{1} y_{2}, z_{1}^{'} z_{2}^{'})}{Q (y_{1} y_{2}, z_{1} z_{2}) Q (y_{1}^{'} y_{2}^{'}, z_{1}^{'} z_{2}^{'})} \\ = log \frac{Q (x_{1}, y_{1}, z_{1}) Q (x_{1}, y_{1}^{'}, z_{1}^{'})}{Q (x_{1}, y_{1}^{'}, z_{1}) Q (x_{1}, y_{1}, z_{1}^{'})} \frac{Q (y_{1}^{'}, z_{1}) Q (y_{1}, z_{1}^{'})}{Q (y_{1}, z_{1}) Q (y_{1}^{'}, z_{1}^{'})} \\ + log \frac{Q (x_{2}, y_{2}, z_{2}) Q (x_{2}, y_{2}^{'}, z_{2}^{'})}{Q (x_{2}, y_{2}^{'}, z_{2}) Q (x_{2}, y_{2}, z_{2}^{'})} \frac{Q (y_{2}^{'}, z_{2}) Q (y_{2}, z_{2}^{'})}{Q (y_{2}, z_{2}) Q (y_{2}^{'}, z_{2}^{'})} \geq 0, \end{array}

and hence, again by Lemma 29, Q is a critical point and solves the optimization problems.

Proof of Lemma 21

We use the notation from [6]. The information divergence is jointly convex. Therefore, any critical point of the divergence restricted to a convex set is a global minimizer. Let y ∈ Entropy 16 02161f4 . Then, it suffices to show: If P satisfies the two conditional independence statements, then the marginal distribution P_X of X is a critical point of D(P(·|y)||Q) for Q restricted to C_cl(〈Z〉_X); for if this statement is true, then P_y_↘_Z = P_X; thus $I_{X}^{π} (Y ↘ Z) = 0$ , and finally, I_red(X : Y ; Z) = 0.

Let z, z′ ∈ Entropy 16 02161f5 . The derivative of D(P(·|y)||Q) at Q = P_X in the direction P(X|z) − P(X|z′) is:

\sum_{x \in X} (P (x ∣ z) - P (x ∣ z^{'})) \frac{P (x ∣ y)}{P (x)} = \sum_{x \in X} (\frac{P (x, z) P (x, y)}{P (x) P (y) P (z)} - \frac{P (x, z^{'}) P (x, y)}{P (x) P (y) P (z^{'})}) .

Now, Y ⫫ Z |X implies:

\sum_{x \in X} \frac{P (x, z) P (x, y)}{P (x)} = P (y, z) and \sum_{x \in X} \frac{P (x, z^{'}) P (x, y)}{P (x)} = P (y, z^{'}),

and Y ⫫ Z implies P(y)P(z) = P(y, z) and P(y)P(z′) = P(y, z′). Together, this shows that P_X is a critical point.

Conflicts of Interest

The authors declare no conflict of interest.

Author’s ContributionAll authors contributed to the design of the research. The research was carried out by all authors, with main contributions by Bertschinger and Rauh. The manuscript was written by Rauh, Bertschinger and Olbrich. All authors read and approved the final manuscript.

References

McGill, W.J. Multivariate information transmission. Psychometrika 1954, 19, 97–116. [Google Scholar]
Schneidman, E.; Bialek, W.; Berry, M.J.I. Synergy, redundancy, and independence in population codes. J. Neurosci 2003, 23, 11539–11553. [Google Scholar]
Latham, P.E.; Nirenberg, S. Synergy, redundancy, and independence in population codes, revisited. J. Neurosci 2005, 25, 5195–5206. [Google Scholar]
Williams, P.; Beer, R. Nonnegative decomposition of multivariate information. arXiv:1004.2515v1. 2010. [Google Scholar]
Griffith, V.; Koch, C. Quantifying synergistic mutual information. arXiv:1205.4265. 2013. [Google Scholar]
Harder, M.; Salge, C.; Polani, D. Bivariate measure of redundant information. Phys. Rev. E 2013, 87, 012130. [Google Scholar]
Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J. Shared information–new insights and problems in decomposing Information in Complex Systems. Proceedings of the ECCS 2012, Brussels, September 3–7 2012; pp. 251–269.
Bertschinger, N.; Rauh, J. The blackwell relation defines no lattice. arXiv:1401.3146. 2014. [Google Scholar]
Blackwell, D. Equivalent comparisons of experiments. Ann. Math. Stat 1953, 24, 265–272. [Google Scholar]
Rauh, J.; Bertschinger, N.; Olbrich, E.; Jost, J. Reconsidering unique information: Towards a multivariate information Decomposition. arXiv:1404.3146. 2014. Accepted for ISIT 2014. [Google Scholar]
Hoşten, S.; Sullivant, S. Gröbner bases and polyhedral geometry of reducible and cyclic Models. J. Combin. Theor. A 2002, 100, 277–301. [Google Scholar]
Wolfram Research, Inc, Mathematica, 8th ed; Wolfram Research Inc: Champaign, Illinois, USA, 2003.

Figure 1. The shared information measures

\tilde{S I}

and I_red in the dice example depending on the correlation parameter λ. The figure on the right is reproduced from [6] (copyright 2013 by the American Physical Society). In both figures, the summation parameter α varies from one (uppermost line) to six (lowest line).

Figure 1. The shared information measures

\tilde{S I}

and I_red in the dice example depending on the correlation parameter λ. The figure on the right is reproduced from [6] (copyright 2013 by the American Physical Society). In both figures, the summation parameter α varies from one (uppermost line) to six (lowest line).

Table 1. The value of

\tilde{S I}

in some examples. The note is a short explanation of the example; see [6] for the details.

**Table 1.** The value of $\tilde{S I}$ in some examples. The note is a short explanation of the example; see [6] for the details.
Example	$\tilde{C I}$	$\tilde{S I}$	I_min	Note
Rdn	0	1	1	X = Y = Z uniformly distributed
Unq	1	0	1	X = (Y, Z), Y, Z i.i.d.
Xor	1	0	0	X = Y XORZ, Y, Z i.i.d.
And	1/2	0.311	0.311	X = Y AND Z, Y, Z i.i.d.
RdnXor	1	1	1	X = (Y₁ XOR Z₁, W), Y = (Y₁, W), Z = (Z₁, W), Y₁, Z₁, W i.i.d.
RdnUnqXor	1	1	2	X = (Y₁ XOR Z₁, (Y₂, Z₂), W), Y = (Y₁, Y₂, W), Z = (Z₁, Z₂, W), Y₁, Y₂, Z₁, Z₂, W i.i.d.
XorAnd	1	1/2	1/2	X = (Y XORZ, Y AND Z), Y, Z i.i.d.
Copy	0	MI(X : Y )	1	X = (Y, Z)

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J.; Ay, N. Quantifying Unique Information. Entropy 2014, 16, 2161-2183. https://doi.org/10.3390/e16042161

AMA Style

Bertschinger N, Rauh J, Olbrich E, Jost J, Ay N. Quantifying Unique Information. Entropy. 2014; 16(4):2161-2183. https://doi.org/10.3390/e16042161

Chicago/Turabian Style

Bertschinger, Nils, Johannes Rauh, Eckehard Olbrich, Jürgen Jost, and Nihat Ay. 2014. "Quantifying Unique Information" Entropy 16, no. 4: 2161-2183. https://doi.org/10.3390/e16042161

APA Style

Bertschinger, N., Rauh, J., Olbrich, E., Jost, J., & Ay, N. (2014). Quantifying Unique Information. Entropy, 16(4), 2161-2183. https://doi.org/10.3390/e16042161

Article Menu

Quantifying Unique Information

Abstract

1. Introduction

2. Operational Interpretation

Definition 1

Lemma 2

Proof

Lemma 3

Proof

3. Properties

3.1. Characterization and Positivity

Lemma 4

Proof

Lemma 5 (Non-negativity)

Proof

3.2. Vanishing Shared and Unique Information

Lemma 6

Proof

Corollary 7

Proof

Corollary 8

Proof

Lemma 9

Corollary 10

Proof

3.3. The Bivariate PI Axioms

Lemma 11 (Symmetry)

Proof

Lemma 12 (Bounds)

Proof

Lemma 13

Proof

Remark 14

3.4. Probability Distributions with Structure

Corollary 15

Proof

Corollary 16

Proof

Remark 17

Proposition 18 (Identity property)

Proof

Lemma 19

4. Comparison with Other Measures

Theorem 20

Lemma 21

Proof of Theorem 20

Theorem 22

Proof

Corollary 23

Proof

Corollary 24

Proof

5. Examples

6. Outlook

Lemma 25

Proof

Acknowledgments

Appendix: Computing U I ˜ , S I ˜ and C I ˜

A.1. The Optimization Domain ΔP

Lemma 26

Proof

Lemma 27

Proof

Corollary 28

Proof

A.2. The Critical Equations

Lemma 29

Proof

A.3. Technical Proofs

Proof of Lemma 9

Proof of Lemma 19

Proof of Lemma 21

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

Appendix: Computing $\tilde{U I}, \tilde{S I}$ and $\tilde{C I}$

A.1. The Optimization Domain Δ_P