Quantifying Redundant Information in Predicting a Target Random Variable

This paper considers the problem of defining a measure of redundant information that quantifies how much common information two or more random variables specify about a target random variable. We discussed desired properties of such a measure, and propose new measures with some desirable properties.


I. Introduction
Many molecular and neurological systems involve multiple interacting factors affecting an outcome synergistically and/or redundantly. Attempts to shed light on issues such as population coding in neurons, or genetic contribution to a phenotype (e.g. eye-color), have motivated various proposals to leverage principled information-theoretic measures for quantifying informational synergy and redundancy, e.g. [1]- [5]. In these settings, we are concerned with the statistics of how two (or more) random variables X 1 , X 2 , called predictors, jointly or separately specify/predict another random variable Y , called a target random variable. This focus on a target random variable is in contrast to Shannon's mutual information which quantifies statistical dependence between two random variables, and various notions of common information, e.g. [6]- [8].
The concepts of synergy and redundancy are based on several intuitive notions, e.g., positive informational synergy indicates that X 1 and X 2 act cooperatively or antagonistically to influence Y ; positive redundancy indicates there is an aspect of Y that X 1 and X 2 can each separately predict. However, it has proven challenging [9]- [12] to come up with precise information-theoretic definitions of synergy and redundancy that are consistent with all intuitively desired properties.

II. Background: Partial Information Decomposition
The Partial Information Decomposition (PID) approach of [13] defines the concepts of synergistic, redundant and unique information in terms of intersection information, I ∩ {X 1 , . . . , X n } : Y , which quantifies the common information that each of the n predictors X 1 , . . . , X n conveys about a target random variable Y . An antichain lattice of redundant, unique, and synergistic partial informations is built from the intersection information.  Each PI-region is either redundant, unique, or synergistic, but any combination of positive PI-regions may be possible. Per [13], for two predictors, the four partial informations are defined as follows: the redundant information as I ∩ {X 1 , X 2 } : Y , the unique informations as and the synergistic information as

III. Desired I ∩ properties and canonical examples
There are a number of intuitive properties, proposed in [5], [9]- [13], that are considered desirable for the intersection information measure I ∩ to satisfy: (SR) Self-Redundancy: tersection information a single predictor X 1 conveys about the target Y is equal to the mutual information between the X 1 and the target Y .
(M 1 ) Strong Monotonicity: (LP) Local Positivity: For all n, the derived "partial informations" defined in [13] are nonnegative. This is equivalent to requiring that I ∩ satisfy total monotonicity, a stronger form of supermodularity. For n = 2 this can be concretized as, There are also a number of canonical examples for which one or more of the partial informations have intuitive values, which are considered desirable for the intersection information measure I ∩ to attain.
Example Unq, shown in Figure 2, is a canonical case of unique information, in which each predictor carries independent information about the target. Y has four equiprobable states: ab, aB, Ab, and AB. X 1 uniquely specifies bit a/A, and X 2 uniquely specifies bit b/B. Note that the states are named so as to highlight the two bits of unique information; it is equivalent to choose any four unique names for the four states.
Example RdnXor, shown in Figure 3, is a canonical example of redundancy and synergy coexisting. The r/R bit is redundant, while the 0/1 bit of Y is synergistically specified as the XOR of the corresponding bits in X 1 and X 2 .
Example And, shown in Figure 4, is an example where the relationship between X 1 , X 2 and Y is nonlinear, making the desired partial information values less intuitively obvious. Nevertheless, it is desired that the partial information values should be nonnegative.
Example ImperfectRdn, shown in Figure 5, is an example of "imperfect" or "lossy" correlation between the predictors, where it is intuitively desirable that the derived redundancy should be positive. Given (LP), we can determine the desired decomposition analytically.  RdnXor. This is the canonical example of redundancy and synergy coexisting. I min and I ∧ each reach the desired decomposition of one bit of redundancy and one bit of synergy. This example demonstrates I ∧ correctly extracting the embedded redundant bit within X 1 and X 2 .

IV. Previous candidate measures
In [13], the authors propose to use the following quantity, I min , as the intersection information measure: where D KL is the Kullback-Leibler divergence.
Though I min is an intuitive and plausible choice for the intersection information, [9] showed that I min has counterintuitive properties. In particular, I min calculates one bit of redundant information for example Unq (Figure 2). It  does this because each input shares one bit of information with the output. However, its quite clear that the shared informations are, in fact, different: X 1 provides the low bit, while X 2 provides the high bit. This led to the conclusion that I min overestimates the ideal intersection information measure by focusing only on how much information the inputs provide to the output. Another way to understand why I min overestimates redundancy in example Unq is to imagine a hypothetical example where there are exactly two bits of unique information for every state y ∈ Y and no synergy or redundancy. I min would calculate the redundancy as the minimum over both predictors which would be min[1, 1] = 1 bit. Therefore I min would calculate 1 bit of redundancy even though by definition there was no redundancy but merely two bits of unique information.
A candidate measure of synergy, ∆ I, proposed in [14], leads to a negative value of redundant information for Example And. Starting from ∆ I as a direct measure for synergistic information and then using eqs. (1) and (2) to derive the other terms, we get Figure 4c showing Fig. 5: Example ImperfectRdn. I ∧ is blind to the noisy correlation between X 1 and X 2 and calculates zero redundant information. An ideal I ∩ measure would detect that all of the information X 2 specifies about Y is also specified by X 1 to calculate I ∩ {X 1 , X 2 } : Y = 0.99 bits.
Another candidate measure of synergy, Syn [15], calculates zero synergy and redundancy for Example RdnXor, as opposed to the intuitive value of one bit of redundancy and one bit of synergy.

V. New candidate measures
A. The I ∧ measure Based on [16], we can consider a candidate intersection information as the maximum mutual information I(Q : Y ) that some random variable Q conveys about Y , subject to Q being a function of each predictor X 1 , . . . , X n . After some algebra, the leads to, subject to ∀i ∈ {1, . . . , n} : which reduces to a simple expression in [12].
Example ImperfectRdn highlights the foremost shortcoming of I ∧ ; I ∧ does not detect "imperfect" or "lossy" correlations between X 1 and X 2 . Instead, I ∧ calculates zero redundant information, that I ∩ {X 1 , X 2 } : Y = 0 bits. This arises from Pr(X 1 = 1, X 2 = 0) > 0. If this were zero, ImperfectRdn reverts to being determined by the properties (SR) and the (M 0 ) equality condition. Due to the nature of the common random variable, I ∧ only sees the "deterministic" correlations between X 1 and X 2 -add even an iota of noise between X 1 and X 2 and I ∧ plummets to zero. This highlights a related issue with I ∧ ; it is not continuous-an arbitrarily small change in the probability distribution can result in a discontinuous jump in the value of I ∧ .
Despite this, I ∧ is useful stepping-stone, it captures what is inarguably redundant information (the common random variable). In addition, unlike earlier measures, I ∧ satisfies (TM).

B. The I α measure
Intuitively, we expect that if Q only specifies redundant information, that conditioning on any predictor X i would vanquish all of the information Q conveys about Y . Noticing that I ∧ underestimates the ideal I ∩ measure (i.e. it doesn't satisfy (LP)), we loosen the constraint H Q|X i = 0 in eq. (4), leading us to define the measure I α : This measure obtains the desired values for the canonical examples in Section III. However, its implicit definition makes it more difficult to verify whether or not it satisfies the desired properties in Section III. Pleasingly, I α also satisfies (TM).
While I α satisfies previously defined canonical examples, we have found another example, shown in Figure 6, for which I ∧ and I α both calculate negative synergy. This example further complicates Example And by making the predictors mutually dependent.

VI. Conclusion
We have defined new measures for redundant information of predictor random variables regarding a target random variable. It is not clear whether it is possible for a single measure of synergy/redundancy to satisfy all previously proposed desired properties and canonical examples, and some of them are debatable. For example, a plausible measure of the "unique information" [17] and "union information" [9] yields an equivalent I ∩ measure that does not satisfy (TM). Determining whether some of these properties are contradictory is an interesting question for further work.