On extractable shared information

We consider the problem of quantifying the information shared by a pair of random variables $X_{1},X_{2}$ about another variable $S$. We propose a new measure of shared information, called extractable shared information that is left monotonic; that is, the information shared about $S$ is bounded from below by the information shared about $f(S)$ for any function $f$. We show that our measure leads to a new nonnegative decomposition of the mutual information $I(S;X_1X_2)$ into shared, complementary and unique components. We study properties of this decomposition and show that a left monotonic shared information is not compatible with a Blackwell interpretation of unique information. We also discuss whether it is possible to have a decomposition in which both shared and unique information are left monotonic.


I. INTRODUCTION
A series of recent papers have focused on the bivariate information decomposition problem [1]- [6]. Consider three random variables S, X 1 , X 2 with finite alphabets S, X 1 and X 2 , respectively. The total information that the pair (X 1 ,X 2 ) convey about the target S can have aspects of shared or redundant information (conveyed by both X 1 and X 2 ), of unique information (conveyed exclusively by either X 1 or X 2 ), and of complementary or synergistic information (retrievable only from the the joint variable (X 1 ,X 2 )). In general, all three kinds of information may be present concurrently. One would like to express this by decomposing the mutual information I(S;X 1 X 2 ) into a sum of nonnegative components with a well-defined operational interpretation. One possible application area is in the neurosciences. In [7] it is argued that such a decomposition can provide a framework to analyze neural information processing using information theory that can integrate and go beyond previous attempts.
For the general case of k finite source variables (X 1 ,...,X k ), Williams and Beer [3] proposed the partial information lattice framework which specifies how the total information about the target S is shared across the singleton sources and their disjoint or overlapping coalitions. It is a consequence of certain natural properties of shared information (sometimes called the Williams-Beer axioms). In the bivariate case (k = 2), the decomposition has the form I(S;X 1 X 2 ) = SI(S;X 1 ,X 2 ) shared +CI(S;X 1 ,X 2 ) complementary + U I(S;X 1 \X 2 ) unique (X1 wrt X2) +U I(S;X 2 \X 1 ) I(S;X 1 ) = SI(S;X 1 ,X 2 ) + U I(S;X 1 \X 2 ), I(S;X 2 ) = SI(S;X 1 ,X 2 ) + U I(S;X 2 \X 1 ), where SI(S;X 1 ,X 2 ), U I(S;X 1 \X 2 ), U I(S;X 2 \X 1 ), and CI(S;X 1 ,X 2 ) are nonnegative continuous functions on the set of all joint distributions of (S,X 1 ,X 2 ). The difference between shared and complementary information is the familiar coinformation [8] (or interaction information [9]), a symmetric generalization of the mutual information for three variables, (1) to (3) leave only a single degree of freedom, i.e., it suffices to specify either a measure for SI, for CI or for U I.
Williams and Beer not only introduced the general partial information framework, but also proposed a measure of SI to fill this framework. While their measure has subsequently been criticized for "not measuring the right thing" [4]- [6], there has been no successful attempt to find better measures, except for the bivariate case (k = 2) [1], [4]. One problem seems to be the lack of a clear consensus on what an ideal measure of shared (or unique or complementary) information should look like and what properties it should satisfy. In particular, the Williams-Beer axioms only put crude bounds on the values of the functions SI, U I and CI. Therefore, additional axioms have been proposed by various authors [4]- [6]. Unfortunately, some of these properties contradict each other [5], and the question for the right axiomatic characterization is still open.
The Williams-Beer axioms do not say anything about what should happen when the target variable S undergoes a local transformation. In this context, the following left monotonicity property was proposed in [5]: (left monotonicity) Left monotonicity for unique or complementary information can be defined similarly. The property captures the intuition that shared information should only decrease if the target performs some local operation (e.g., coarse graining) on her variable S. As argued in [2], left monotonicity of shared and unique information are indeed desirable properties. Unfortunately, none of the measures of shared information proposed so far satisfy left monotonicity.
In this contribution, we study a construction that enforces left monotonicity. Namely, given a measure of shared information SI, define SI(S;X 1 ,X 2 ) := sup where the supremum runs over all functions f : S → S ′ from the domain of S to an arbitrary finite set S ′ . By construction, SI satisfies left monotonicity, and SI is the smallest function bounded from below by SI that satisfies left monotonicity.
Changing the definition of shared information in the information decomposition framework (1)-(3) leads to new definitions of unique and complementary information: Summary of results. Lemma 2 shows that these functions are nonnegative and thus define a nonnegative bivariate decomposition. We study this decomposition in Section IV. In Theorem 1, we show that our construction is not compatible with a decision-theoretic interpretation of unique information proposed in [1]. In Section V, we ask whether it is possible to find an information decomposition in which both shared and unique information measures are left monotonic. Our construction cannot directly be generalized to ensure left monotonicity of two functions simultaneously. Nevertheless, it is possible that such a decomposition exists, and in Proposition 5 we prove bounds on the corresponding shared information measure. Our original motivation for the definition of SI was to find a bivariate decomposition in which the shared information satisfies left monotonicity. However, we can also interpret SI as a measure of extractable shared information, because it asks for the maximal amount of shared information that can be extracted from S by further processing S by a local mechanism. More generally, one can apply a similar construction to arbitrary information measures. We explore this idea in Section III and discuss probabilistic generalizations and relations to other information measures. In Section VI, we apply our construction to existing measures of shared information.

II. PROPERTIES OF INFORMATION DECOMPOSITIONS
1) The Williams-Beer axioms: Although we are mostly concerned with the case k = 2, let us first recall the three axioms that Williams and Beer [3] proposed for a measure of shared information for arbitrarily many arguments: Any measure of SI satisfying these axioms is nonnegative. Moreover, the axioms imply the following: (RM) SI(S;X 1 ,...,X k ) ≥ SI(S;f 1 (X 1 ),...,f k (X k )) for all functions f 1 ,...,f k .

(right monotonicity)
Williams and Beer also defined a function and showed that I min satisfied their axioms.
2) The COPY example and the Identity axiom: Let X 1 ,X 2 be independent uniformly distributed binary random variables, and consider the copy function COPY(X 1 ,X 2 ) := (X 1 ,X 2 ). One point of criticism of I min is the fact that according to I min , X 1 and X 2 share I min (COPY(X 1 ,X 2 );X 1 ,X 2 ) = 1bit about COPY(X 1 ,X 2 ), even though they are independent. In [4], the authors argued that the shared information about the copied pair should equal the mutual information: (Identity) [4] also proposed a bivariate measure of shared information that satisfies (Id). Similarly, the measures of bivariate shared information proposed in [1] satisfies (Id). However, it has been shown in [2] that (Id) is incompatible with a nonnegative information decomposition according to the Williams-Beer axioms for k ≥ 3.
On the other hand, [5] uses an example from game theory to give an intuitive explanation how even independent variables X 1 and X 2 can have nontrivial shared information. However, in any case the value of 1bit assigned by I min is deemed to be too large. 3

) The Blackwell property and property ( * ):
One of the reasons that it is so difficult to find good definitions of shared, unique or synergistic information is that a clear operational idea behind these notions is missing. Starting from an operational idea about decision problems, [1] proposed the following property for the unique information, which we now propose to call the Blackwell property: (BP) For a given joint distribution P SX1X2 , U I(S;X 1 \X 2 ) vanishes if and only if there exists a random variable X ′ 1 such that S − X 2 − X ′ 1 is a Markov chain and P SX ′ 1 = P SX1 .
(Blackwell property) In other words, the channel S → X 1 is a garbling of the channel S → X 2 . Blackwell's theorem [10] implies that this garbling property is equivalent to the fact that any decision problem in which the task is to predict S can be solved just as well with the knowledge of X 2 as with the knowledge of X 1 . We refer to Section 2 in [1] for the details.
[1] also proposed the following property: ( * ) SI and U I only depend on the marginal distributions P SX1 and P SX2 of the pairs (S,X 1 ) and (S,X 2 ). This property was in part motivated by (BP) which also depends only on the channels S → X 1 and S → X 2 and thus on P SX1 and P SX2 .

III. EXTRACTABLE INFORMATION MEASURES
One can interpret SI as a measure of extractable shared information. We explain this idea in a more general setting.
For fixed k, let IM (S;X 1 ,...,X k ) be an arbitrary information measure that measures one aspect of the information that X 1 ,...,X k contain about S. At this point, we do not specify what precisely an information measure is, except that it is a function that assigns a real number to any joint distributions of S,X 1 ,...,X k . The notation is, of course, suggestive of the fact that we mostly think about one of the measures SI, U I or CI, in which the first argument plays a special role. However, IM could also be the mutual information I(S;X 1 ), the entropy H(S), or the coinformation CoI(S;X 1 ,X 2 ). We define the corresponding extractable information measure as where the supremum runs over all functions f : S → S ′ from the domain of S to an arbitrary finite set S ′ . The intuition is that IM is the maximal possible amount of IM one can "extract" from (X 1 ,...,X k ) by transforming S. Clearly, the precise interpretation depends on the interpretation of IM . This construction has the following general properties: 1) Most information measures satisfy IM (O;X 1 ,...,X k ) = 0 when O is a constant random variable. Thus, in this case, IM (S;X 1 ,...,X k ) ≥ 0. So, for example, even though the coinformation can be negative, the extractable coinformation is never negative. Similarly, as shown in [2], the measure of unique information U I defined in [1] satisfies left monotonicity, and so U I = U I.

3) In fact, IM is the smallest left monotonic information
measure that is at least as large as IM . The next result shows that our construction preserves monotonicity properties of the other arguments of IM . It follows that by iterating this construction, one can construct an information measure that is monotonic in all arguments.
where (a) follows from the assumptions.
As a generalization to the construction, instead of looking at "deterministic extractability," one could also look at "probabilistic extractability" and replace f by a stochastic matrix. This leads to the definition IM (S;X 1 ,...,X k ) := sup where the supremum now runs over all random variables S ′ that are independent of X 1 ,...,X k given S. The function IM is the smallest function bounded from below by IM that satisfies (PLM) IM (S;X 1 ,X 2 ) ≥ IM (S ′ ;X 1 ,X 2 ) whenever S ′ is independent of X 1 ,X 2 given S.
(probabilistic left monotonicity) An example of this construction is the intrinsic conditional information I(X;Y ↓ Z) := min P Z ′ |Z I(X;Y |Z ′ ), which was defined in [11] to study the secret-key rate, which is the maximal rate at which a secret can be generated by two agents knowing X or Y , respectively, such that a third agent who knows Z, has arbitrarily small information about this key. The min instead of the max in the definition implies that I(X;Y ↓ Z) is "anti-monotone" in Z.
In this paper, we restrict ourselves to the deterministic notions, since many of the properties we want to discuss can already be explained using deterministic extractability. Moreover, the optimization problem (5) is a finite optimization problem and thus much easier to solve than (6).

IV. EXTRACTABLE SHARED INFORMATION
We now specialize to the case of shared information. The first result is that when we apply our construction to a measure of shared information that belongs to a bivariate decomposition, we again obtain a bivariate decomposition.

Lemma 2. Suppose that SI is a measure of shared information, coming from a nonnegative bivariate information decomposition (satisfying (1) to (3)). Then SI defines a nonnegative information decomposition; that is, the derived functions
where f * is a function that achieves the supremum in (4). ≥ I(f * (S);X 1 ) − SI(f * (S);X 1 ,X 2 ) = U I(f * (S);X 1 \X 2 ) ≥ 0, where we have used the data processing inequality. SI satisfies ( * ), then SI also satisfies ( * ).

Lemma 3. 1) If
2) If SI is right monotonic, then SI is also right monotonic.
Without further assumptions on SI, we cannot say much about when SI vanishes. However, the condition that U I * vanishes has strong consequences.

Lemma 4.
Suppose that U I * (S;X 1 \X 2 ) vanishes, and let f * be a function that achieves the supremum in (4).
Proof. Suppose that U I * (S;X 1 \X 2 ) = 0. Then I(S;X 1 ) = SI(S;X 1 ,X 2 ) = SI(f * (S);X 1 ,X 2 ) ≤ I(f * (S);X 1 ) ≤ I(S;X 1 ). Thus, the data processing inequality holds with equality. This implies that X 1 − f * (S) − S is a Markov chain. The identity U I(f * (S);X 1 \X 2 ) = 0 follows from the same chain of inequalities. Proof. As shown in Example 9 in [12], there exist random variables S, X 1 , X 2 and a function f that satisfy 1) S and X 1 are independent given f (S).
2) The channel f (S) → X 1 is a garbling of the channel f (S) → X 2 .
Thus, U I * does not satisfy the Blackwell property.

Corollary 2.
There is no bivariate information decomposition in which U I satisfies the Blackwell property and SI satisfies left monotonicity.
Proof. If SI satisfies left monotonicity, then SI = SI. Thus, U I = U I * cannot satisfy the Blackwell property by Theorem 1.

V. LEFT MONOTONIC INFORMATION DECOMPOSITIONS
Is it possible to have an extractable information decomposition? More precisely, is it possible to have an information decomposition in which all information measures are left monotonic? The obvious strategy of starting with an arbitrary information decomposition and replacing each partial information measure by its extractable analogue does not work, since this would mean increasing all partial information measures (unless they are extractable already), but then their sum would also increase. For example, in the bivariate case, when SI is replaced by a larger function SI, then U I needs to be replaced by a smaller function, due to the constraints (2) and (3).
As argued in [2], it is intuitive that U I be left monotonic. As argued above (and in [5]), it is also desirable that SI be left monotonic. The intuition for synergy is much less clear. In the following we restrict to the bivariate case and study the implications of requiring both SI and U I to be left monotonic. Proposition 5 gives bounds on the corresponding SI measure.
Proposition 5. Suppose that SI, U I and CI define a bivariate information decomposition, and suppose that SI and U I are left monotonic. Then for any function f .
Before proving the proposition, let us make some remarks. Inequality (7) is related to the identity axiom. Indeed, it is easy to derive (7) from the identity axiom and from the assumption that SI is left monotonic. Although inequality (7) may not seem counterintuitive at first sight, none of the information decompositions proposed so far satisfy this property 1 .

VI. EXAMPLES
In this section, we apply our construction to Williams and Beer's measure, I min [3], and to the bivariate measure of shared information, SI, proposed in [1].
First, we make some remarks how to compute the extractable information measure (under the assumption that one knows how to compute the underlying information measure itself). The optimization problem (4) is a discrete optimization problem. The search space is the set of functions from the support S of S to some finite set S ′ . For the information measures that we have in mind, we may restrict to surjective functions f , since the information measures only depend on events with positive probabilies. Thus, we may restrict to sets S ′ with |S ′ | ≤ |S|. Moreover, the information measures are invariant under permutations of the alphabet S. Therefore, the only thing that matters about f is which elements from S are mapped to the same element in S ′ . Thus, any function f : S → S ′ corresponds to a partition of S, where s,s ′ ∈ S belong to the same block if and only if f (s) = f (s ′ ), and it suffices to look at all such partitions. The number of partitions of a finite set S is the Bell number B |S| .
The Bell numbers increase super-exponentially, and for larger sets S, the search space of the optimization problem (4) becomes quite large. For smaller problems, enumerating all partitions in order to find the maximum is still feasible. For larger problems, one would need a better understanding about the optimization problem. For reference, some Bell numbers: In specific examples, there may be more symmetries, so in the COPY example discussed below, where |S| = 4, it suffices to study 6 functions instead of B 4 = 15.
We now compare the measure I min , an extractable version of Williams and Beer's measure I min [3] to the measure SI, an extractable version of the measure SI proposed in [1]. For the latter, we briefly recall the definitions. Let ∆ be the set of all joint distributions of random variables (S,X 1 ,X 2 ) with given state spaces S, X 1 , X 2 . Fix P SX1X2 ∈ ∆. Define ∆ P as the set of all distributions Q SX1X2 that preserves the marginals of the pairs (S,X 1 ) and (S,X 2 ), that is, Then, define the functions U I(S;X 1 \X 2 ) . . = min Q∈∆P I Q (S;X 1 |X 2 ), U I(S;X 2 \X 1 ) . . = min Q∈∆P I Q (S;X 2 |X 1 ), SI(S;X 1 ,X 2 ) . .= max Q∈∆P CoI Q (S;X 1 ,X 2 ), CI(S;X 1 ,X 2 ) . . = I(S;X 1 X 2 ) − min where the index Q in I Q or CoI Q indicates that the corresponding quantity is computed with respect to the joint distribution Q. The decomposition corresponding to SI satisfies the Blackwell property and the identity axiom [1]. U I is left monotonic, but SI is not [2]. In particular, SI = SI. SI can be characterized as the smallest measure of shared information that satisfies property ( * ). Therefore, SI is the smallest left monotonic measure of shared information that satisfies property ( * ). Let X 1 = X 2 = {0,1} and let X 1 , X 2 be independent uniformly distributed random variables. Table I collects values of shared information about f (X 1 ,X 2 ) for various functions f (in bits). The function f 1 : {00,01,10,11} → {0,1,2} is defined as f 1 (X 1 ,X 2 ) := X 1 , if X 2 = 1, 2, if X 2 = 0.
The SUM function is defined as f (X 1 ,X 2 ) := X 1 + X 2 . Table I  In these examples, I min = I min , but as shown in [5], I min is not left monotonic in general.

VII. CONCLUSIONS
We introduced a new measure of shared information that satisfies an intuitive left monotonicity property with respect to local operations on the target variable. Our measure fits the bivariate information decomposition framework; that is, we also obtain corresponding measures of unique and synergistic information. The fact that left monotonicity of shared information contradicts the Blackwell property for unique information is an important step forward in understanding what one could expect from such a decomposition.