You are currently viewing a new version of our website. To view the old version click .
Entropy
  • Article
  • Open Access

30 June 2023

Decomposing and Tracing Mutual Information by Quantifying Reachable Decision Regions

and
Department of Information Technology, Uppsala University, 752 36 Uppsala, Sweden
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Synergy and Redundancy Measures: Theory and Applications to Characterize Complex Systems and Shape Neural Network Representations

Abstract

The idea of a partial information decomposition (PID) gained significant attention for attributing the components of mutual information from multiple variables about a target to being unique, redundant/shared or synergetic. Since the original measure for this analysis was criticized, several alternatives have been proposed but have failed to satisfy the desired axioms, an inclusion–exclusion principle or have resulted in negative partial information components. For constructing a measure, we interpret the achievable type I/II error pairs for predicting each state of a target variable (reachable decision regions) as notions of pointwise uncertainty. For this representation of uncertainty, we construct a distributive lattice with mutual information as consistent valuation and obtain an algebra for the constructed measure. The resulting definition satisfies the original axioms, an inclusion–exclusion principle and provides a non-negative decomposition for an arbitrary number of variables. We demonstrate practical applications of this approach by tracing the flow of information through Markov chains. This can be used to model and analyze the flow of information in communication networks or data processing systems.

1. Introduction

A Partial Information Decomposition (PID) aims to attribute the provided information about a discrete target variable T from a set of predictor or viewable variables V = { V 1 , , V n } to each individual variable V i . The partial contributions to the information about T may be provided by all variables (redundant or shared), by a specific variable (unique) or only be available through a combination of variables (synergetic/complementing) [1]. This decomposition is particularly applicable when studying complex systems. For example, it can be used to study logical circuits, neural networks [2] or the propagation of information over multiple paths through a network. The concept of synergy has been applied to develop data privacy techniques [3,4], and we think that the concept of redundancy may be suitable to study a notion of robustness in data processing systems.
Unfortunately and to the best of our knowledge, there does not exist a non-negative decomposition of mutual information for an arbitrary number of variables that satisfies the commutativity, monotonicity and self-redundancy axioms except the original measure of Williams and Beer [5]. However, this measure has been criticized for not distinguishing “the same information and the same amount of information” [6,7,8,9].
Here, we propose an alternative non-negative partial information decomposition that satisfies Williams and Beer’s axioms [5] for an arbitrary number of variables. It provides an intuitive operational interpretation and results in an algebra like probability theory. To demonstrate that the approach distinguishes the same information from the same amount of information, we highlight its application in tracing the flow of information through a Markov chain, as visualized in Figure 1.
Figure 1. Visualization of a partial information decomposition with information flow analysis of a Markov chain as Sankey diagram. A partial information decomposition enables attributing the provided information about T to being shared (orange), unique (blue/green) or synergetic/complementing (pink). While this already offers practical insights for studying complex systems, the ability to trace the flow of partial information may create a valuable tool to model and analyze many applications.
This work is structured in three parts: Section 2 provides an overview of the related work and background information. Section 3 presents a representation of pointwise uncertainty, constructs a distributive lattice and demonstrates that mutual information is the expected value of its consistent valuation. Section 4 discusses applications of the resulting measure to PIDs and the tracing of information through Markov chains. We provide an overview of the used notation at the end of the paper.

3. Quantifying Reachable Decision Regions

We start by studying the decomposition of binary decision problems from an interpretational perspective (Section 3.1). This provides the basis for constructing a distributive lattice in Section 3.2 and demonstrating the structure of a consistent valuation function. Section 3.3 highlights that mutual information is such a consistent valuation and extends the concept from binary decision problems to target variables with an arbitrary finite number of states. The resulting definition of shared information for the PID will be discussed as an application in Section 4.1 together with the tracing of information flows in Section 4.2.
We define an equivalence relation (∼) for binary input channels κ t , which allows for the removal of zero vectors, the permutation of columns ( P representing a permutation matrix) and the splitting/merging of columns with identical likelihood ratios (vectors of identical slope, R ), as shown in Equation (10). These operations are invertible using garblings and do not affect the underlying zonotope.
κ t   [ κ t [ 0 0 ] ] ;
κ t   κ t P ;
[ ( 1 + ) v 1 v 2 ] v 1 v 1 v 2 .
Based on this definition, block matrices cancel at an inverted sign ( = 1 ) if we allow negative columns, as shown in Equation (11), where M 1 and M 2 are some 2 × n matrix.
M 1 M 1 M 2 M 2

3.1. Motivation and Operational Interpretation

The aim of this section is to provide a first intuition based on a visual example for the methodology that will be used in Section 3.2 to construct a distributive lattice of the reachable decision regions and its consistent valuation. We only consider binary variables T t = { t , t ¯ } or the one-vs-rest encoding of others ( T t = t T = t ).
In the used example, the desired variable can be observed indirectly using the two variables V 1 and V 2 . The visible variables are considered to be the output of the channels T t κ 1 t V 1 , T t κ 2 t V 2 and T t κ 12 t ( V 1 , V 2 ) and correspond to the zonotopes shown in Figure 5. We consider each reachable decision point (a pair of TPR and FPR) to represent a different notion of uncertainty about the state of the target variable. We want to attribute the reachable decision regions to each channel for constructing a lattice, as shown in Figure 6, with the following operational interpretation:
Figure 5. Relating the zonotope representations to TPR/FPR plots. The zonotopes correspond to the regions of a TPR/FPR plot that are reachable by some decision strategy. Regions outside of the zonotopes are known to be unreachable since the likelihood ratio test is optimal for binary decision problems. The convex hull of both zonotopes κ 1 t κ 2 t is the (unique) lower bound of any joint distribution under the Blackwell order.
Figure 6. Decomposing the achievable decision regions for binary decision problems from an operational perspective. Each node is visualized by its cumulative and partial decision region. The partial decision region is shown within round brackets. The cumulative region corresponds to the matrix concatenation of the partial regions in its down-set under the defined equivalence relation. Three key elements are highlighted using a grey background.
  • Synergy: Corresponds to the partial contribution of κ 12 t = T t ( V 1 , V 2 ) and represents the decision region which is only accessible due to the (in-)dependence of both variables.
  • Joint: The joint element κ 1 t κ 2 t = ( T t V 1 ) ( T t V 2 ) corresponds to the joint under the Blackwell order and represents the decision region which is always accessible if the marginal distributions ( V 1 , T t ) and ( V 2 , T t ) can be obtained. Therefore, we say that its information shall be fully attributed to V 1 and V 2 such that is has no partial contribution. For binary target variables, this definition is equivalent to the notion of union information by Bertschinger et al. [10] and Griffith and Koch [11]. However, we extend the analysis beyond binary target variables with a different approach in Section 3.3.
  • Unique: Corresponds to the partial contribution of κ 1 t = T t V 1 or κ 2 t = T t V 2 and represents the decision region that is lost when losing the variable. It only depends on their marginal distributions ( V 1 , T t ) and ( V 2 , T t ) .
  • Shared: Corresponds to the cumulative contribution of κ 1 t κ 2 t = ( T t V 1 ) ( T t V 2 ) and represents the decision region which is lost when losing either V 1 or V 2 . Since it only depends on the marginal distributions, we interpret it as being part of both variables. The shared decision region can be split in two components: the decision region that is part of both individual variables and the component that is part of the convex hull but neither individual one. The latter component only exists if both variables provide unique information.
  • Redundant: The largest decision region κ 1 t κ 2 t = ( T t V 1 ) ( T t V 2 ) which can be accessed from both V 1 and V 2 . It corresponds to the meet under the Blackwell order and the part of shared information that can be represented by some random variable (pointwise extractable component of shared information). The redundant and shared regions are equal unless both variables provide some unique information.
Due to the invariance of re-ordering columns under the defined equivalence relation, κ t represents a set of likelihood vectors. All cumulative and partial decision regions of Figure 6 can be constructed using a convex hull operator (joint) and matrix concatenations under the defined equivalence relation (∼). For example, the shared decision region (meet) can be expressed through an inclusion–exclusion principle with the joint operator κ 1 t κ 2 t κ 1 t κ 2 t κ 1 t κ 2 t . This operator is not closed on channels since it introduces negative likelihood vectors. Therefore, we distinguish the notation between channels ( κ t ) and atoms ( α t ). These matrices α t sum to one similar to channels but may contain negative columns. Their partial contributions α δ t sum to zero.
  • The unique contribution of V 2 :                          α δ t ( κ 1 t κ 2 t ) κ 1 t
  • The shared cumulative region of V 1 and V 2 :    β t κ 1 t κ 2 t ( κ 1 t κ 2 t ) κ 1 t κ 2 t
  • The shared partial contribution:                        β δ t β t ( κ 1 t κ 2 t )
  • Each cumulative region corresponds to the combination of partial contributions in its down-set. Notice that the partial contribution of the shared region is canceled by a section of each unique contribution due to an opposing sign:
    κ 1 t α δ t β δ t ( κ 1 t κ 2 t )
In Section 3.3, we demonstrate a valuation function f that can quantify all cumulative and partial atoms of this lattice while ensuring their non-negativity and consistency with the defined equivalence relation (∼). We will refer to a more detailed example on the valuation of partial decision regions in Appendix C in the context of the following section.
Why does the decomposition of reachable decision regions as shown in Figure 6 provide a meaningful operational interpretation? Because combining the partial contributions of the up-set for a variable results in the decision region that becomes inaccessible when the variable is lost, while combining the partial contributions of the down-set results in the decision region that is accessible through the variable. For example, losing access to variable V 2 results in losing access to the decision regions provided uniquely by V 2 and its synergy with V 1 (the up-set on the lattice). Additionally, the cumulative component corresponds to the combination of all partial contributions in its down-set since opposing vectors cancel under the defined equivalence relation (∼) such as the shared and unique contributions. Therefore, we define a consistent valuation of this lattice in Section 3.2 by quantifying decision regions based on their spanning vectors and highlight that the expected value for each t T corresponds to the definition of mutual information.
Section 3.2 and Section 3.3 focus only on defining the meet and joint operators ( / ) with their consistent valuation. To obtain the pointwise redundant and synergetic components for a PID, we can later add the corresponding channels when constructing the pointwise lattices V = { V 1 , V 2 , ( V 1 , V 2 ) , V 1 V 2 } with the ordering of Figure 6 from the meet and joint operators.

3.2. Decomposition Lattice and Its Valuation

This section first defines the meet and joint operators (∧, ∨) and then constructs a consistent valuation for the resulting distributive lattice. For constructing a pointwise channel lattice based on the redundancy lattice, we notate the map of functions as shown in Equation (12) and consider the function k t ( S i ) = T t S i = κ i t to obtain the pointwise channel κ i t of a source S i .
f P = { f ( x ) x P } , f P = { f x x P } , f P = { f x x P } .
The intersections shall correspond to some meet operation and the union to some joint operation on the pointwise channels, as shown in Equation (13), while maintaining the ordering relation of Williams and Beer [5]. This section aims to define suitable meet and joint operations together with a function for their consistent valuation. Each atom α t , β t B t ( V ) now represents an expression of channels κ t with the operators / , as shown in Appendix A. For example, the element { S 12 , S 3 } is converted to the expression ( κ 1 t κ 2 t ) κ 3 t .
B t ( V ) = k t A ( V ) .
As seen in Section 3.1, we want to define the joint for a set of channels to be equivalent to their convex hull, matching the Blackwell order. This also ensures that the joint operation is closed on channels.
κ 1 t κ 2 t κ 1 t κ 2 t ( joint is closed on channels )
Since opposing vectors cancel under the defined equivalence relation, we can use a notion of the Möbius inverse to define the set of vectors spanning a partial decision region α δ t for an atom α t B t ( V ) , as shown in Equation (15), written as a recursive block matrix and using the strict down-set of the ordering based on the underlying redundancy lattice.
α δ t α t β δ t β t ˙ α t
The definition of the meet operator (∧) and the extension of the joint operator (∨) from channels to atoms is now obtained from the constraint that the partial contribution for the joint of two incomparable atoms ( α t , β t B t ( V ) , α t β t α t and α t β t β t ) shall be zero, as shown in Equation (16).
α t β t α t and α t β t β t ( α t β t ) δ t 0 0
This creates the desired inclusion–exclusion principle and results in the equivalences of the meet for two and three atoms, as shown in Equation (17). Their resulting partial channels ( α δ t ) correspond to the set of vectors spanning the desired unique and shared decision regions of Figure 6.
α t β t α t β t α t β t
α t ( β t γ t ) α t β t γ t α t β t α t γ t β t γ t α t β t γ t
From their construction, the meet and joint operators provide a distributive lattice for a set of channels under the defined equivalence relation as shown in Appendix B by satisfying idempotency, commutativity, associativity, absorption and distributivity. This can be used to define a corresponding ordering relation (Equation (18)).
α t β t α t β t α t α t β t β t
To obtain a consistent valuation of this lattice, we consider a function f ( α t ) , as shown in Equation (19). First, this function has to be invariant under the defined equivalence relation, and second, it has to match the ordering of the constructed lattice.
f ( α t ) = v α t r ( v ) where r is convex and satisfies r ( v ) = r ( v ) and r = 0
The function f shall apply a (convex) function r ( v ) to each vector of the matrix of an atom v α t . The function is invariant under the equivalence relation (∼, Equation (10)):
  • Zero vectors do not affect the quantification: r ( [ 0 0 ] ) = 0  
  • The structure of f ensures invariance under reordering columns: f ( κ t ) = f ( κ t P )  
  • The property r ( v ) = r ( v ) with R ensures invariance under splitting/merging columns of identical likelihood ratios:
    f ( [ ( 1 + ) v 1 ] ) = ( 1 + ) r ( v 1 ) = r ( v 1 ) + r ( v 1 ) = f ( [ v 1 v 1 ] )
The function f is a consistent valuation of the ordering relation (⪯, Equation (18)) from the constructed lattice:
  • The convexity of r ensures that the quantification f ( α t ) is a valuation as shown in Appendix C: β t α t f ( β t ) f ( α t )  
  • The function f provides a sum-rule: f ( α t β t ) = f ( [ α t β t α t β t ] ) = f ( α t ) + f ( β t ) f ( α t β t )  
  • The function f quantifies the bottom element correctly: f ( ) = r = 0
A parameterized function that forms a consistent lattice valuation with 0 p 1 and that will be used in Section 3.3 is shown in Equation (20) (the convexity of r p is shown in Appendix D).
f p ( α t ) = v α t r p ( v )
r p ( v ) = r p x y = x log x p x + ( 1 p ) y
This section demonstrated the construction of a distributive lattice and its consistent valuation, resulting in an algebra as shown in Equation (9).

3.3. Decomposing Mutual Information

This section demonstrates that mutual information is the expected value of a consistent valuation for the constructed pointwise lattices and discusses the resulting algebra. To show this, we define the parameter p and pointwise channel κ i t for the consistent valuation (Equation (20)) using a one-vs-rest encoding (Equation (21)).
p = P ( T = t ) ( parameter ) κ i t = P ( S i | T = t ) P ( S i | T t ) = x 1 x 2 x m y 1 y 2 y m ( binary input channel )
The expected value of the resulting valuation in Equation (20) is equivalent to the definition of mutual information, as shown in Equation (22). Therefore, we can interpret mutual information as being the expected value of quantifying the reachable decision regions for each state of the target variable that represent a concept of pointwise uncertainty.
I ( T ; S i ) = s S i t T P ( S i , T ) ( s , t ) log ( P ( S i , T ) ( s , t ) P S i ( s ) P T ( t ) )
= E T s S i P ( S i | T = t ) ( s ) x j log P ( S i | T = t ) ( s ) x j P ( T = t ) p P ( S i | T = t ) ( s ) x j + ( 1 P ( T = t ) ) 1 p P ( S i | T t ) ( s ) y j
The expected value for a set of consistent lattice valuations corresponds to a weighted sum such that the resulting lattice remains consistent. Therefore, we can combine the pointwise lattices to extend the definition of mutual information for meet and joint elements, which we will think of as intersections and unions. Let α represent an expression of sources with the operators ∨ and ∧. Then, we can obtain its valuation from the pointwise lattices using the function I ^ T , as shown in Equation (23). Notice that we do not define the operators for random variables but only use the notation for selecting the corresponding element on the underlying pointwise lattices. For example, we write α = ( S 12 S 3 ) S 4 to refer to the pointwise atom α t = ( κ 12 t κ 3 t ) κ 4 t on each pointwise lattice.
The special case of atoms that consist of a single source corresponds by construction to the definition of mutual information. However, we propose normalizing the measure, as shown in Equation (23), to capture a degree of inclusion between zero and one. This is possible for discrete variables and will lead to an easier intuition for the later definition of bi-valuations and product spaces by ensuring the same output range for these measures. As a possible interpretation for the special role of the target variable, we like to think of T as the considered origin of information within the system, which then propagates through channels to other variables.
I ^ T ( α ) E T f P T ( t ) α t E T f P T ( t ) = E T f P T ( t ) α t H ( T ) I ^ T ( T ) = 1 = H ( T ) H ( T ) I ^ T ( S i ) = I ^ T ( T S i ) = I ( T ; S i ) H ( T )
We obtain the following algebra with the bi-valuation I ^ T ( [ α ; β ] ) that quantifies a degree of inclusion from α within the context of β . We can think of I ^ T ( [ α ; β ] ) as asking how much of the information from β about T is shared with α .
I ^ T ( α β ) = I ^ T ( α ) + I ^ T ( β ) I ^ T ( α β ) ( Sum rule )
I ^ T ( [ α ; β ] ) I ^ T ( α β ) I ^ T ( β ) ( Bi - Valuation )
I ^ T ( [ α β ; γ ] ) = I ^ T ( [ α ; γ ] ) + I ^ T ( [ β ; γ ] ) I ^ T ( [ α β ; γ ] ) ( Conditioned sum rule )
I ^ T ( [ β γ ; α ] ) = I ^ T ( [ γ ; α β ] ) · I ^ T ( [ β ; α ] ) ( Product rule )
I ^ T ( [ β ; α γ ] ) = I ^ T ( [ γ ; α β ] ) · I ^ T ( [ β ; α ] ) I ^ T ( [ γ ; α ] ) ( Bayes   Theorem )
Since the definitions satisfy an inclusion–exclusion principle, we obtain the interpretation of classical measures as proposed by Williams and Beer [5]: conditional mutual information I ( T ; V 1 V 2 ) measures the unique contribution of V 1 plus its synergy with V 2 , and interaction information I ( T ; V 1 ; V 2 ) measures the difference between synergy and shared information, which explains its possible negativity.
As highlighted by Knuth [13], the lattice product (the Cartesian product with ordering ( α ; β ) ( τ ; υ ) α τ and β υ ) can be valuated using a product rule to maintain consistency with the ordering of the individual lattices. This creates an opportunity to define information product spaces for multiple reference variables. Since we normalized the measures, the valuation of the product space will also be normalized to the range from zero to one. The subscript notation T 1 × T 2 shall indicate the product of the lattice constructed for T 1 with the product of the lattice constructed for T 2 .
I ^ ( T 1 × T 2 ) ( ( α ; β ) ) = I ^ T 1 ( α ) · I ^ T 2 ( β ) ( Valuation Product rule )
I ^ ( T 1 × T 2 ) ( ( [ α ; τ ] ; [ β ; υ ] ) ) = I ^ T 1 ( [ α ; β ] ) · I ^ T 2 ( [ τ ; υ ] ) ( Bi - Valuation Product rule )
The lattice product is distributive over the joint for disjoint elements [13], which leads to the equivalence in Equation (26). Unfortunately, it appears that only the bottom element is disjoint with other atoms in the constructed lattice.
t : α t β t I ^ ( T 1 × T 2 ) ( ( α β ; τ ) ) = I ^ ( T 1 × T 2 ) ( ( α ; τ ) ( β ; τ ) )
Finally, we would like to provide an intuition for this approach based on possible operational scenarios:
  • Consider having characterized four radio links and obtained the conditional distributions P V 1 | T , P ( V 2 , V 3 ) | T and P V 4 | T . We are interested in their joint channel capacity; however, lack the required joint distribution. In this case, we can use their joint sup P T ( t ) I ^ T ( S 1 S 23 S 4 ) to obtain a (pointwise) lower bound on their joint channel capacity.
  • Consider having two datasets { T 1 , V 1 , V 2 , V 3 } and { T 2 , V 2 , V 3 , V 4 } that provide different types of labels ( T x ) and associated features ( V y ), where some events were recorded in both datasets. In such cases, one may choose to study the cases T 1 ( V 1 , V 2 , V 3 ) , T 2 ( V 2 , V 3 , V 4 ) and ( T 1 , T 2 ) ( V 1 , V 2 , V 3 , V 4 ) for events appearing in both datasets, which could then be combined into a product lattice I ^ ( T 1 × T 2 × ( T 1 , T 2 ) ) .

4. Applications

This section focuses on applications of the obtained measure from Section 3.3. We first apply the meet operator to the redundancy lattice for constructing a PID. Since an atom of the redundancy lattice α A ( V ) corresponds to a set of sources for which the shared information shall be measured, we use the notation α to obtain an expression for the function I ^ T . Section 4.2 additionally utilizes the properties of a Markov chain to demonstrate how the flow of partial information can be traced through system models.

4.1. Partial Information Decomposition

Based on Section 3.3, we can define a measure of shared information I ^ ( α ; T ) for the elements of the redundancy lattice α A ( V ) in the framework of Williams and Beer [5], as shown in Equation (27). The measure satisfies the three axioms of Williams and Beer [5] (commutativity from the equivalence relation and structure of f p , monotonicity from being a lattice valuation and self-redundancy from removing the normalization), and the decomposition is non-negative since the joint channel κ 12 t is superior to the joint of two channels κ 1 t κ 2 t for all t T . The partial contribution I ^ δ ( α ; T ) corresponds to the expected value of the quantified partial decision regions α δ t .
This provides the interpretation of Section 3.1, where combining the partial contributions of the up-set corresponds to the expected value of quantifying the decision regions that are lost when losing the variable, while combining the partial contributions of the down-set corresponds to the expected value of quantifying the accessible decision region from this variable. Additionally, we obtain a pointwise version of the property by Bertschinger et al. [10]: if a variable provides unique information, then there is a way to utilize this information for a reward function to some target variable state. Finally, it can be seen that taking the minimal quantification of the different decision regions as done by Williams and Beer [5] leads to a lack in distinguishing distinct reachable decision regions or, as phrased in the literature: a lack of distinguishing “the same information and the same amount of information” [6,7,8,9].
α A ( V ) , I ^ ( α ; T ) = I ^ T α · H ( T ) ,
I ^ δ ( α ; T ) = I ^ ( α ; T ) β ˙ α I ^ δ ( β ; T ) = E T f P T ( t ) α δ t
An identical definition of I ^ ( α ; T ) can be obtained only based on the Blackwell order, as shown in Equation (28). Let α A ( V ) be a set of sources and let T t represent a binary target variable ( T t = { t , t ¯ } ) such that T t = t T = t . We can expand the meet operator used in Equation (27a) using the sum-rule and utilize the distributivity for arriving at the joint of two channels, which matches the Blackwell order (Equation (28b)). We write S i T t S j to refer to the joint of S i and S j under the Blackwell order with respect to variable T t . This results in the recursive definition of i α ; T t that corresponds to the definition of mutual information for a single source (Equation (28a)). This expansion of Equation (27a) is particularly helpful since it eliminates the operators ∧/∨ for a simplified implementation.
i { S i } ; T t = s S i P ( S i | T t = t ) ( s ) log P ( S i | T t = t ) ( s ) P ( T t = t ) P ( S i | T t = t ) ( s ) + ( 1 P ( T t = t ) ) P ( S i | T t t ) ( s )
i { S i } β ; T t = i { S i } ; T t + i β ; T t i { S i T t S j S j β } ; T t
I ^ ( α ; T ) = E T i α , T t
Our decomposition is equivalent to the measures of Bertschinger et al. [10], Griffith and Koch [11] and Williams and Beer [5] in two special cases:
  • For a binary target variable T = { t , t ¯ } with two observable variables V 1 and V 2 , our approach is identical to Bertschinger et al. [10] and Griffith and Koch [11] since κ 1 κ 2 κ 1 t κ 2 t κ 1 t ¯ κ 2 t ¯ . Beyond binary target variables, the resulting definitions differ due to the pointwise construction (see Appendix E). 
  • If from a pointwise perspective ( T t ), some variable is Blackwell superior to the other (not necessarily the same each time), then our method is identical to Williams and Beer [5] since the defined meet operation will equal their minimum κ 1 t κ 2 t κ 2 t f p ( κ 1 t ) f p ( κ 2 t ) min ( f p ( κ 1 t ) , f p ( κ 2 t ) ) = f p ( κ 1 t κ 2 t ) = f p ( κ 1 t ) and equivalently for the function i ( α , T t ) .
A decomposition of typical examples can be found in Appendix E. We also provide an implementation of the PID based on our approach [18].

4.2. Information Flow Analysis

Due to the achieved inclusion–exclusion principle, the data processing inequality of mutual information and the achieved non-negativity of partial information for an arbitrary number of variables, it is possible to trace the flow of information through Markov chains. The measure I ^ T appears suitable for this analysis due to the chaining properties of the underlying pointwise channels that are quantified. The analysis can be applied among others for analyzing communication networks or designing data processing systems.
The flow of information in Markov chains has been studied by Niu and Quinn [19], who considered chaining individual variables X 1 X 2 X n and performed a decomposition on V = { X 1 , X 2 , , X n } . In contrast to this, we consider Markov chains that map sets of random variables from one step to the next. In this case, it is possible to perform an information decomposition at each step of the Markov chain and identify how the partial information components propagate from one set of variables to the next.
Let T V Q be a Markov chain with the atoms α A ( V ) and β A ( Q ) , through which we trace the flow of partial information from α to β about T. We can measure the shared information between both atoms α and β , as shown in Equation (29a), to obtain how much information their cumulative components share J ^ ( α β ; T ) . Similar to the PID, we remove the normalization for the self-redundancy axiom. To identify how much of the cumulative information of β is obtained from the partial information of α , we subtract the strict down-set of α on the lattice ( A ( V ) , ) as shown in Equation (29b) to obtain J ^ δ ( α β ; T ) . To compute how much of the partial information of α is shared with the partial contribution of β , we similarly remove the flow from the partial information of α into the strict down-set of β on the lattice ( A ( Q ) , ), as shown in Equation (29c), to obtain J ^ δ δ ( α β ; T ) . This can be used to trace the origin of information for each atom β A ( Q ) to the previous elements α A ( V ) .
The approach is not limited to one step and can be extended for tracing the flow through Markov chains of arbitrary length J ^ δ δ δ ( α β γ ; T ) . However, we only trace one step in this demonstration for simplicity.
J ^ ( α β ; T ) = I ^ T ( α β ) · H ( T )
J ^ δ ( α β ; T ) = J ^ ( α β ; T ) γ ˙ α J ^ δ ( γ β ; T )
J ^ δ δ ( α β ; T ) = J ^ δ ( α β ; T ) γ ˙ β J ^ T δ δ ( α γ ; T )
We demonstrate the Information Flow Analysis using a full-adder as a small logic circuit with the input variables V = { A , B , C in } and the output T = { S , C out } as shown in Equation (30). Any ideal implementation of this computation results in the same channel from V to T . Therefore, they create an identical flow of the partial information from V to the partial information of T . However, the specific implementation will determine how (over which intermediate representations and paths) the partial information is transported.
S = A B C in C out = A · B + A · C in + B · C in = ( A · B ) + C in · ( A B ) ) ( typical implementation ) T = ( S , C out )
To make the example more interesting, we consider the implementation of a noisy full-adder, as shown in Figure 7, which allows for bit-flips on wires. We indicate the probability of a bit-flip below each line and imagine this value correlates to the wire length and proximity to others. Now, changing the implementation or even the layout of the same circuit would have an impact on the overall channel.
Figure 7. Noisy full-adder example for the Information Flow Analysis demonstration. The probability of a bit-flip is indicated below the wires. If a wire has two labels, the first label corresponds to the wire input and the second label to its output.
To perform the analysis, we first have to define the target variable: What it is that we want to measure information about? In this case, we select the joint distribution of the desired computation output T as the target variable and define the noisy computation result to be T ^ = { S ^ , C ^ out } , as shown in Figure 7. We obtain both variables from their definition by assuming that the input variables V are independently and uniformly distributed and that bit-flips occurred independently. However, it is worth noting that noise dependencies can be modeled in the joint distribution. This fully characterizes the Markov chain shown in Equation (31).
T = ( S , C out ) T = { S , C out } V = { A , B , C in } Q = { Q 1 , Q 2 , Q 3 } R = { R 1 , R 2 , R 3 } T ^ = { S ^ , C ^ out }
We group two variables at each stage to reduce the number of interactions in the visualization. The resulting information flow of the full-adder is shown as a Sankey diagram in Figure 8. Each bar corresponds to the mutual information of a stage in the Markov chain with the input T. The bars’ colors indicate the partial information decomposition of Equation (27). The information flow over one step using Equation (29) is indicated by the width of a line between the partial contributions of two stages. To follow the flow of a particular component over more than one step—for example, to see how the shared information of T propagates to the shared information of T ^ —the analysis can be performed by tracing multiple steps after extending Equation (29).
Figure 8. Sankey diagram of the Information Flow Analysis for the noisy full-adder in Figure 7. Each bar corresponds to one stage in the Markov chain, and its height corresponds to this stage’s mutual information with the target T. Each bar is decomposed into the information that the considered variables provide shared (orange), unique (blue/green) or synergetic (pink) about the target. If a stage is represented by a single variable or joint distribution, no further decomposition is performed (gray). We trace the information between variables over one step using the sub-chains T T T , T T V , T V Q , T Q R and T R T ^ using Equation (29). The resulting flows between each bar visualize how the partial information propagates for one step in the Markov chain. For following the flow of a particular partial component over more than one step in the Sankey diagram, Equation (29) can be extended.
The results (Figure 8) show that the decomposition does not attribute unique information to S or C out about their own joint distribution. The reason for this is shown in Equation (32): both variables provide an equivalent channel for each state of their joint distribution and, thus, an equivalent uncertainty about each state of T. Phrased differently, both variables provide access to the identical decision regions for each state of their joint distribution and can therefore not provide unique information (no advantage for any reward function to any t T ). If this result feels counter-intuitive, we would also recommend the discussion of the two-bit-copy problem and identity axiom by Finn [9] (p. 16ff.) and Finn and Lizier [20]. The same effect can also be seen when viewing each variable in V individually (not shown in Figure 8), which causes neither of them to provide unique information on their own about the joint target distribution T.
( T ( 0 , 0 ) C out ) ( T ( 0 , 0 ) S ) 1 0 3 / 7 4 / 7 ( T ( 1 , 1 ) C out ) ( T ( 1 , 1 ) S ) ( T ( 0 , 1 ) C out ) ( T ( 0 , 1 ) S ) 1 0 1 / 5 4 / 5 ( T ( 1 , 0 ) C out ) ( T ( 1 , 0 ) S )
The Information Flow Analysis is particularly useful in practice since it can be performed on an arbitrary resolution of the system model to handle its complexity. For example, a small full-adder can be analyzed on the level of gates and wires represented by channels. However, the full-adder is itself a channel that can be used to analyze an n-bit adder on the level of full-adders.
Further applications of the Information Flow Analysis could include the identification of which inputs are most critical for the computational result and where information is being lost. It can also be explored if a notion of robustness in data processing systems could be meaningfully defined based on how much pointwise redundant or shared information of the input V can be traced to its output T ^ . This might indicate a notion of robustness based on whether or not it is possible to compensate for the unavailability of input sources through a system modification.
Finally, the target variable does not have to be the desired computational outcome as has been done in the demonstration. When thinking about secure multi-party computations, it might be of interest to identify the flow of information from the perspective of some sensitive or private variable (T) to understand the impact of disclosing the final computation result. The possible applications of such an analysis are as diverse as those of information theory.

5. Discussion

We propose the interpretation that the reachable decision regions correspond to different notions of uncertainty about each state of the target variable and that mutual information corresponds to the expected value of quantifying these decision regions. This allows partial information to represent the expected value of quantifying partial decision regions (Equations (27) and (28)), which can be used to attribute mutual information to the visible variables and their interactions (pointwise redundant/shared/unique/synergetic). Since the proposed quantification results in the consistent valuation of a distributive lattice, it creates a novel algebra for mutual information with possible practical applications (Equations (24) and (25)). Finally, the approach allows for tracing information components through Markov chains (Equation (29)), which can be used to model and study a wide range of scenarios. The presented method is directly applicable to discrete and categorical source variables due to their equivalent construction for the reachable decision regions (zonotopes). However, we recommend that the target variable should be categorical since the measure does not consider a notion of distance between target states (achievable estimation proximity). This would be an interesting direction for future work due to its practical application for introducing semantic meaning to sets of variables. An intuitive example is a target variable with 256 states that is used to represent an 8-bit unsigned integer as the computation result. For this reason, we wonder if it is possible to introduce a notion of distance to the analysis such that the classical definition of mutual information becomes the special case for encoding categorical targets.
A recent work by Kolchinsky [21] removes the assumption that an inclusion–exclusion principle relates the intersection and union of information and demands their extractability. This has the disadvantage that a similar algebra or tracing of information would no longer be possible. We tried to address this point by distinguishing the pointwise redundant from the pointwise shared element and also obtain no inclusion–exclusion principle for the pointwise redundancy. We focus in this work on the pointwise shared element due to the resulting properties and operational interpretation from the accessibility and losses of reachable decision regions. Moreover, the relation between the used meet and joint operators provides consistent results from performing the decomposition using the meet operator on a redundancy lattice, as done in this work, or a decomposition using the joint operator on a synergy or loss lattice [22].
Further notions of redundancy and synergy can be studied within this framework if they are extractable, meaning they can be represented by some random variable. Depending on the desired interpretation, the representing variable can be constructed for T and added to the set of visible variables or can be constructed for each pointwise variable T t and added to the pointwise lattices. We showed an example of the latter in Section 3.1 by adding the pointwise redundant element to the lattice, which we interpret as pointwise extractable components of shared information to quantify the decision regions that can be obtained from each source.
Since our approach satisfies the original axioms of Williams and Beer [5] and results in non-negative partial contributions for an arbitrary number of variables, it cannot satisfy the proposed identity axiom of Harder et al. [8]. This can also be seen by the decomposition examples in Appendix E (Table A2 and Figure A3). We do not consider this a limitation since all four axioms cannot be satisfied without obtaining negative partial information [23], which creates difficulties for interpreting results.
Finally, our approach does not appear to satisfy a target/left chain rule as proposed by Bertschinger et al. [7]. While our approach provides an algebra that can be used to handle multiple target variables, we think that further work on understanding the relations when decomposing with multiple target variables is needed. In particular, it would be helpful for the analysis of complex systems if the flow of already analyzed sub-chains could be reused and their interactions could be predicted.

6. Conclusions

We use the approach of Bertschinger et al. [10] and Griffith and Koch [11] to construct a pointwise partial information decomposition that provides non-negative results for an arbitrary number of variables and target states. The measure obtains an algebra from the resulting lattice structure and enables the analysis of complex multivariate systems in practice. To our knowledge, this is the first alternative to the original measure of Williams and Beer [5] that satisfies their three proposed axioms and results in a non-negative decomposition for an arbitrary number of variables.

Author Contributions

T.M. and C.R conceived the idea; T.M. prepared the original draft; C.R. reviewed and edited the draft. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Swedish Civil Contingencies Agency (MSB) through the project RIOT grant number MSB 2018-12526.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used:
PIDPartial Information Decomposition
ROCReceiver Operating Characteristic
TPRTrue-Positive Rate ( β ¯ )
FPRFalse-Positive Rate ( α )
We use the following notation conventions:
T, T , t, T t T (upper case) represents the target variable with an event t (lower case) of its event space (calligraphic), t T . T t represents a pointwise (binary) target variable which takes state one if T = t and state two if T t ( T t represents the one-vs-rest encoding of state t);
V , V i , V i , v V represents a set of visible/observable/predictor variables V i with v V i ;
S i , S i sources represent a set of visible variables, where the index i lists the contained visible variables, such as S 12 = { V 1 , V 2 } . The event s S i corresponds to an event of the corresponding joint variable, e.g., ( V 1 , V 2 ) .
We represent channels ( κ , λ ) as row stochastic matrices with the following indexing:
P represents a permutation matrix;
κ i represents a channel from the target to a source T κ i S i using the joint distribution of the variables within the source, such as T κ 12 ( V 1 , V 2 ) ;
κ i t represents a pointwise channel from the target to a source T t κ i t S i , such as  T t κ 12 t ( V 1 , V 2 ) ;
Z κ i t binary input channels κ i t can be represented as (row) stochastic matrix, which contain a likelihood vector v s = p ( S i = s T = t ) p ( S i = s T t ) for each state s S i . Z κ i t represents the zonotope for this set of vectors;
κ 1 t κ 2 t represents the binary input channel corresponding to the convex hull of Z κ 1 t and Z κ 2 t (Blackwell order joint of binary input channels κ 1 t κ 2 t κ 1 t κ 2 t );
κ 1 t κ 2 t represents the meet element for constructing a distributive lattice with the joint operator κ 1 t κ 2 t ;
κ 1 t κ 2 t represents the binary input channel corresponding to the intersection of Z κ 1 t and Z κ 2 t (Blackwell order meet of binary input channels);
α , β atoms represent an expression of random variables with the operators ( / ) . In Section 2.2 and Section 4, they represent sets of sources;
α t , β t represent an expression of pointwise channels with the operators ( / ) ;
α δ t , β δ t represent a partial pointwise channel corresponding to α t .
We use the following convention for operations, functions and brackets:
P 1 ( · ) represents the power set without the empty set;
{ V 1 , V 2 } curly brackets with comma separation represent a set;
[ M 1 M 2 ] square brackets without comma separation represent a matrix, and the listing of matrices in this manner represents their concatenation;
q ( [ α ; β ] ) square brackets with semicolon separation are used to refer to the bi-valuation b ( α , β ) of a consistent lattice valuation q ( α ) . In a similar manner to Knuth [13], we use the notation q ( [ α ; β ] ) b ( α , β ) ;
( α ; β ) round brackets with semicolon separation represent an element of a Cartesian product L 1 × L 2 , where α L 1 and β L 2 ;
f L angled brackets indicate that a function f shall be mapped to each element of the set L . We may nest this notation, such as f L , to indicate a map to each element of the sets within L ;
α False-Positive Rate, type I error;
β ¯ True-Positive Rate, 1 type II error.
We distinguish between a joint channel  T κ 12 ( V 1 , V 2 ) and the joint of two channels  κ 1 κ 2 . To avoid confusion, we write the first case as “joint channel ( κ )” and the latter case as “joint of channels ( κ i κ j )” throughout this work.

Appendix A

The considered lattice relates the meet and joint elements ( / ) through an inclusion–exclusion principle. Here, the partial contribution for the joint of any two incomparable elements ( α t , β t B t ( V ) , α t β t α t and α t β t β t ) shall be zero, which is indicated using a gray font in Figure A1.
Figure A1. The considered lattice relating the meet and joint operators. The joint of any two incomparable elements ( α t , β t B t ( V ) , α t β t α t and α t β t β t ) shall have no partial contribution to create an inclusion–exclusion principle between the operators and is highlighted using a gray font.

Appendix B

This section demonstrates that the defined meet and joint operators of Section 3.2 provide a distributive lattice under the defined equivalence relation (∼, Equation (10)).
Lemma A1.
The meet and joint operators (∧, ∨) define a distributive lattice for a set of channels under the defined equivalence relation (∼).
Proof. 
The definitions of the meet and joint satisfy associativity, commutativity, idempotency, absorption and distributivity on channels under the defined equivalence relation:
  • Idempotency:  κ 1 t κ 1 t κ 1 t and κ 1 t κ 1 t κ 1 t .
    κ 1 t κ 1 t κ 1 t κ 1 t κ 1 t ; κ 1 t κ 1 t κ 1 t κ 1 t κ 1 t κ 1 t κ 1 t κ 1 t κ 1 t κ 1 t .
  • Commutativity:  κ 1 t κ 2 t κ 2 t κ 1 t and κ 1 t κ 2 t κ 2 t κ 1 t .
    κ 1 t κ 2 t κ 1 t κ 2 t κ 2 t κ 1 t κ 2 t κ 1 t ; κ 1 t κ 2 t κ 1 t κ 2 t κ 1 t κ 2 t κ 2 t κ 1 t κ 2 t κ 1 t κ 2 t κ 1 t .
  • Associativity:  κ 1 t ( κ 2 t κ 3 t ) ( κ 1 t κ 2 t ) κ 3 t and κ 1 t ( κ 2 t κ 3 t ) ( κ 1 t κ 2 t ) κ 3 t .
    κ 1 t ( κ 2 t κ 3 t ) κ 1 t ( κ 2 t κ 3 t ) ( κ 1 t κ 2 t ) κ 3 t ( κ 1 t κ 2 t ) κ 3 t ; κ 1 t ( κ 2 t κ 3 t ) κ 1 t κ 2 t κ 3 t κ 1 t κ 2 t κ 1 t κ 3 t κ 2 t κ 3 t κ 1 t κ 2 t κ 3 t κ 3 t κ 2 t κ 1 t κ 3 t κ 2 t κ 3 t κ 1 t κ 2 t κ 1 t κ 3 t κ 2 t κ 1 t κ 3 t ( κ 2 t κ 1 t ) ( κ 1 t κ 2 t ) κ 3 t .
  • Absorption:  κ 1 t ( κ 1 t κ 2 t ) κ 1 t and κ 1 t ( κ 1 t κ 2 t ) κ 1 t .
    κ 1 t ( κ 1 t κ 2 t ) κ 1 t κ 1 t κ 2 t κ 1 t κ 1 t κ 2 t κ 1 t κ 1 t κ 2 t κ 1 t κ 2 t κ 1 t ; κ 1 t ( κ 1 t κ 2 t ) κ 1 t κ 1 t κ 2 t κ 1 t κ 1 t κ 2 t κ 1 t κ 1 t κ 2 t κ 1 t κ 2 t κ 1 t .
  • Distributivity:  κ 1 t ( κ 2 t κ 3 t ) ( κ 1 t κ 2 t ) ( κ 1 t κ 3 t ) and κ 1 t ( κ 2 t κ 3 t ) ( κ 1 t κ 2 t ) ( κ 1 t κ 3 t ) .
    κ 1 t ( κ 2 t κ 3 t ) κ 2 t κ 3 t κ 1 t κ 1 t ( κ 2 t κ 3 t ) κ 2 t κ 3 t κ 2 t κ 3 t κ 1 t κ 2 t κ 1 t κ 3 t κ 2 t κ 3 t κ 1 t κ 2 t κ 3 t κ 2 t κ 3 t κ 1 t κ 2 t κ 1 t κ 3 t κ 2 t κ 3 t κ 1 t κ 2 t κ 3 t κ 1 t κ 2 t κ 1 t κ 3 t κ 1 t κ 2 t κ 3 t κ 1 t κ 2 t κ 1 t κ 3 t ( κ 1 t κ 2 t ) ( κ 1 t κ 3 t ) ( κ 1 t κ 2 t ) ( κ 1 t κ 3 t ) ; κ 1 t ( κ 2 t κ 3 t ) κ 2 t κ 3 t κ 1 t κ 1 t ( κ 2 t κ 3 t ) κ 2 t κ 3 t κ 2 t κ 3 t κ 1 t κ 2 t κ 1 t κ 3 t κ 2 t κ 3 t κ 1 t κ 2 t κ 3 t κ 2 t κ 3 t κ 1 t κ 2 t κ 1 t κ 3 t κ 2 t κ 3 t κ 1 t κ 2 t κ 3 t κ 1 t κ 2 t κ 1 t κ 3 t ( κ 1 t κ 2 t ) ( κ 1 t κ 3 t ) ( κ 1 t κ 2 t ) ( κ 1 t κ 3 t ) .
   □

Appendix C

This section demonstrates the quantification of a small example and proves that the function f of Equation (19) creates a consistent valuation  α t β t β t f ( β t ) f ( α t ) for the pointwise lattice ( B t ( V ) , , ) .
The convexity of the function r ( v ) results, in combination with the property that r ( v ) = r ( v ) with R , in a triangle inequality, as shown in Equation (A1). This ensures that Blackwell superior channels obtain a larger quantification result and thus the non-negativity of channels: f ( κ t λ t ) f ( κ t ) f ( 1 1 ) = 0 .
r ( t v 1 + ( 1 t ) v 2 ) t r ( v 1 ) + ( 1 t ) r ( v 2 ) ( convexity , 0 t 1 ) r ( v 1 + v 2 ) r ( v 1 ) + r ( v 2 ) ( using t = 0.5 and r ( v ) = r ( v ) )
To provide an intuition for the meet operator with a minimal example and highlight its relation to the intersection of zonotopes (redundant region), consider the two channels κ 1 t and κ 2 t of Equation (A2) and as visualized in Figure A2. To simplify the notation, we use the property [ ( ( 1 + ) v 1 ) ] [ ( v 1 ) ( v 1 ) ] to differentiate the vectors a 2 and a 3 as well as b 1 and b 2 .
κ 1 t ( a 1 ) ( a 2 ) ( a 3 ) κ 2 t ( b 1 ) ( b 2 ) ( b 3 ) κ 1 t κ 2 t ( a 1 ) ( a 2 + b 2 ) ( b 3 )
The resulting shared and redundant element is shown in Equation (A3). Due to the construction of the meet element through an inclusion–exclusion principle with the joint, the meet element always contains the vectors which span the redundant decision region as the first component.
κ 1 t κ 2 t ( b 1 ) ( a 3 ) ( b 2 ) ( a 2 ) ( a 2 + b 2 ) κ 1 t κ 2 t ( b 1 ) ( a 3 )
The second component of the meet element corresponds to the decision region of the joint, which is not part of either individual channel. This component is non-negative due to the triangle inequality.
0 f ( κ 1 t κ 2 t ) f ( κ 1 t κ 2 t ) = r ( a 2 ) + r ( b 2 ) r ( a 2 + b 2 )
The same argument applies to the meet for an arbitrary number of channels since the inclusion–exclusion principle with the joint elements ensures that the vectors spanning the redundant region are contained in the meet element, and the triangle inequality ensures non-negativity for the additional components.
Figure A2. A minimal example to discuss the relation between the shared ( κ 1 t κ 2 t ) and redundant ( κ 1 t κ 2 t ) decision regions. The channel κ 1 t consists of the vectors a x , and the channel κ 2 t consists of the vectors b x .
Lemma A2.
The function f ( α t ) is a (consistent) valuation α t β t β t f ( β t ) f ( α t ) on the pointwise lattice corresponding to ( B t ( V ) , , ) , as visualized in Appendix A.
Proof. 
Let S t = { κ 1 t , , κ a t } represent a set of pointwise channels. The meet element ( λ t S t λ t ) is constructed through an inclusion–exclusion principle with the joint (convex hull). This ensures that the set of vectors spanning the zonotope intersection ( λ t S t λ t ) is contained within the meet element. Additionally, the meet contains a second component that is ensured to be positive from the triangle inequality of r: f ( λ t S t λ t ) f (   λ t S t λ t ) . Since the joint operator is closed on channels and is distributive, we can introduce a channel to enforce a minimal redundant decision region between the channels: f ( κ 0 t ) f (   λ t S t κ 0 t λ t ) f ( λ t S t κ 0 t λ t ) = f ( κ 0 t λ t S t λ t ) . Applying the sum-rule shows that f ( κ 0 t λ t S t λ t ) f ( λ t S t λ t ) .
We again make use of the distributive property, which allows writing any expression α t into a conjunctive normal form. Since the joint operator is closed for channels, any expression α t can be represented as meet for a set of channels α t λ t { κ p 1 t , , κ p i t } λ t . This demonstrates that the obtained inequality of the meet operator on channels also applies to atoms f ( α t β t ) f ( α t ) , such that α t β t β t f ( β t ) f ( α t ) .    □

Appendix D

The considered function f p ( κ t ) of Section 3.2 takes the sum of a convex function. The Hessian matrix H r of the function r p ( x , y ) = x log b x p x + ( 1 p ) y is positive-semidefinite in the required domain (symmetric and its eigenvalues e 1 and e 2 are greater than or equal to zero for x > 0 and b > 1 ).
H r = 1 log ( b ) ( p 1 ) 2 y 2 x ( p x + ( 1 p ) y ) 2 ( p 1 ) 2 y ( p x + ( 1 p ) y ) 2 ( p 1 ) 2 y ( p x + ( 1 p ) y ) 2 ( p 1 ) 2 x ( p x + ( 1 p ) y ) 2 e 1 = 0 e 2 = ( p 1 ) 2 ( x 2 + y 2 ) x log ( b ) ( p x + ( 1 p ) y ) 2

Appendix E

We use the examples of Finn and Lizier [20] since they provided an extensive discussion of their motivation. We compare our decomposition results to I min of Williams and Beer [5] and I ± of Finn and Lizier [20]. Examples with two sources are additionally compared to I BROJA of Bertschinger et al. [10] and Griffith and Koch [11]. We notate the results for shared information S ( V 1 , V 2 ; T ) , unique information U ( V x ; T ) and synergetic/complementing information C ( V 1 , V 2 ; T ) . We use the implementation of I min , I BROJA and I ± provided by the dit Python package for discrete information theory [24].
Notice that our approach is identical to Williams and Beer [5] if one of the variables is pointwise (for each T t , not necessarily the same one each time) Blackwell superior to another, and that our approach is equal to Bertschinger et al. [10] and Griffith and Koch [11] for two visible variables at a binary target variable.
We would like to highlight Table A1 for the difference in our approach to Williams and Beer [5]. This is an arbitrary example, where the variables V 1 and V 2 are not Blackwell superior to each other from the perspective of T t , as visualized in Figure 6. For highlighting the difference in our approach to Bertschinger et al. [10] and Griffith and Koch [11], we require an example where the target variable is not binary, such as the two-bit copy example in Table A2.
It can be seen that our approach does not satisfy the identity axiom of Harder et al. [8]. This axiom demands the decomposition of the two-bit-copy example (Table A2) to both variables providing one bit unique information and demands negative partial contributions in the three-bit even-parity example (Figure A3) [8,20].
Table A1. Two incomparable channels (visualized in Section 3.1). The table highlights the difference in our approach to Williams and Beer [5] while being identical to Bertschinger et al. [10] since the target variable is binary.
Table A1. Two incomparable channels (visualized in Section 3.1). The table highlights the difference in our approach to Williams and Beer [5] while being identical to Bertschinger et al. [10] since the target variable is binary.
(a) Distribution(b) Results
V 1 V 2 T PrMethod S ( V 1 , V 2 ; T ) U ( V 1 ; T ) U ( V 2 ; T ) C ( V 1 , V 2 ; T )
0000.0625 I ^ T · H ( T ) 0.11960.02720.07160.1205
0010.3 I min [5]0.146800.04440.1477
1000.0375 I ± [20]0.3214−0.1746−0.13020.3223
1010.05 I BROJA [10,11]0.11960.02720.07160.1205
0100.1875
0110.15
1100.2125
Table A2. Two-bit-copy (TBC) example. The results of our approach differ from Bertschinger et al. [10] and Griffith and Koch [11] since the target variable is not binary.
Table A2. Two-bit-copy (TBC) example. The results of our approach differ from Bertschinger et al. [10] and Griffith and Koch [11] since the target variable is not binary.
(a) Distribution(b) Results
V 1 V 2 T PrMethod S ( V 1 , V 2 ; T ) U ( V 1 ; T ) U ( V 2 ; T ) C ( V 1 , V 2 ; T )
0001/4 I ^ T · H ( T ) 1001
0111/4 I min [5]1001
1021/4 I ± [20]1001
1131/4 I BROJA [10,11]0110
Figure A3. Three-bit even-parity (Tbep) example. The results for I ^ T · H ( T ) , I min and I ± are identical. (a) Distribution. (b) Decomposition lattice. (c) Cumulative results (partial).
Table A3. XOR-gate (Xor) example. All compared measures provide the same results.
Table A3. XOR-gate (Xor) example. All compared measures provide the same results.
(a) Distribution(b) Results
V 1 V 2 T PrMethod S ( V 1 , V 2 ; T ) U ( V 1 ; T ) U ( V 2 ; T ) C ( V 1 , V 2 ; T )
0001/4 I ^ T · H ( T ) 0001
0111/4 I min [5]0001
1011/4 I ± [20]0001
1101/4 I BROJA [10,11]0001
Table A4. Pointwise unique (PwUnq) example. Our approach provides the same results as Williams and Beer [5] and Bertschinger et al. [10].
Table A4. Pointwise unique (PwUnq) example. Our approach provides the same results as Williams and Beer [5] and Bertschinger et al. [10].
(a) Distribution(b) Results
V 1 V 2 T PrMethod S ( V 1 , V 2 ; T ) U ( V 1 ; T ) U ( V 2 ; T ) C ( V 1 , V 2 ; T )
0101/4 I ^ T · H ( T ) 0.5000.5
1001/4 I min [5]0.5000.5
0211/4 I ± [20]00.50.50
2011/4 I BROJA [10,11]0.5000.5
Table A5. Redundant Error (RdnErr) example. Our approach provides the same results as Williams and Beer [5] and Bertschinger et al. [10].
Table A5. Redundant Error (RdnErr) example. Our approach provides the same results as Williams and Beer [5] and Bertschinger et al. [10].
(a) Distribution(b) Results
V 1 V 2 T PrMethod S ( V 1 , V 2 ; T ) U ( V 1 ; T ) U ( V 2 ; T ) C ( V 1 , V 2 ; T )
0003/8 I ^ T · H ( T ) 0.1890.81100
1113/8 I min [5]0.1890.81100
0101/8 I ± [20]10−0.8110.811
1011/8 I BROJA [10,11]0.1890.81100
Table A6. Unique (Unq) example. Our approach provides the same results as Williams and Beer [5] and Bertschinger et al. [10].
Table A6. Unique (Unq) example. Our approach provides the same results as Williams and Beer [5] and Bertschinger et al. [10].
(a) Distribution(b) Results
V 1 V 2 T PrMethod S ( V 1 , V 2 ; T ) U ( V 1 ; T ) U ( V 2 ; T ) C ( V 1 , V 2 ; T )
0001/4 I ^ T · H ( T ) 0100
0101/4 I min [5]0100
1011/4 I ± [20]10−11
1111/4 I BROJA [10,11]0100
Table A7. And-gate (And) example. Our approach provides the same results as Williams and Beer [5] and Bertschinger et al. [10].
Table A7. And-gate (And) example. Our approach provides the same results as Williams and Beer [5] and Bertschinger et al. [10].
(a) Distribution(b) Results
V 1 V 2 T PrMethod S ( V 1 , V 2 ; T ) U ( V 1 ; T ) U ( V 2 ; T ) C ( V 1 , V 2 ; T )
0001/4 I ^ T · H ( T ) 0.311000.5
0101/4 I min [5]0.311000.5
1001/4 I ± [20]0.561−0.25−0.250.75
1111/4 I BROJA [10,11]0.311000.5

References

  1. Lizier, J.T.; Bertschinger, N.; Jost, J.; Wibral, M. Information Decomposition of Target Effects from Multi-Source Interactions: Perspectives on Previous, Current and Future Work. Entropy 2018, 20, 307. [Google Scholar] [CrossRef] [PubMed]
  2. Wibral, M.; Finn, C.; Wollstadt, P.; Lizier, J.T.; Priesemann, V. Quantifying Information Modification in Developing Neural Networks via Partial Information Decomposition. Entropy 2017, 19, 494. [Google Scholar] [CrossRef]
  3. Rassouli, B.; Rosas, F.E.; Gündüz, D. Data Disclosure Under Perfect Sample Privacy. IEEE Trans. Inf. Forensics Secur. 2020, 15, 2012–2025. [Google Scholar] [CrossRef]
  4. Rosas, F.E.; Mediano, P.A.M.; Rassouli, B.; Barrett, A.B. An operational information decomposition via synergistic disclosure. J. Phys. Math. Theor. 2020, 53, 485001. [Google Scholar] [CrossRef]
  5. Williams, P.L.; Beer, R.D. Nonnegative Decomposition of Multivariate Information. arXiv 2010, arXiv:1004.2515. [Google Scholar]
  6. Griffith, V.; Chong, E.K.P.; James, R.G.; Ellison, C.J.; Crutchfield, J.P. Intersection Information Based on Common Randomness. Entropy 2014, 16, 1985–2000. [Google Scholar] [CrossRef]
  7. Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J. Shared Information—New Insights and Problems in Decomposing Information in Complex Systems. In Proceedings of the European Conference on Complex Systems 2012; Gilbert, T., Kirkilionis, M., Nicolis, G., Eds.; Springer: Cham, Switzerland, 2013; pp. 251–269. [Google Scholar]
  8. Harder, M.; Salge, C.; Polani, D. Bivariate measure of redundant information. Phys. Rev. E 2013, 87, 012130. [Google Scholar] [CrossRef] [PubMed]
  9. Finn, C. A New Framework for Decomposing Multivariate Information. Ph.D. Thesis, University of Sydney, Sydney, NSW, Australia, 2019. [Google Scholar]
  10. Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J.; Ay, N. Quantifying Unique Information. Entropy 2014, 16, 2161–2183. [Google Scholar] [CrossRef]
  11. Griffith, V.; Koch, C. Quantifying Synergistic Mutual Information. In Guided Self-Organization: Inception; Springer: Berlin/Heidelberg, Germany, 2014; pp. 159–190. [Google Scholar] [CrossRef]
  12. Bertschinger, N.; Rauh, J. The Blackwell Relation Defines No Lattice. In Proceedings of the 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, 29 June–4 July 2014; pp. 2479–2483. [Google Scholar] [CrossRef]
  13. Knuth, K.H. Lattices and Their Consistent Quantification. Ann. Der Phys. 2019, 531, 1700370. [Google Scholar] [CrossRef]
  14. Blackwell, D. Equivalent Comparisons of Experiments. In The Annals of Mathematical Statistics; Institute of Mathematical Statistics: Beachwood, OH, USA, 1953; pp. 265–272. [Google Scholar]
  15. Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
  16. Schechtman, E.; Schechtman, G. The relationship between Gini terminology and the ROC curve. Metron 2019, 77, 171–178. [Google Scholar] [CrossRef]
  17. Neyman, J.; Pearson, E.S. IX. On the Problem of the Most Efficient Tests of Statistical Hypotheses. Philos. Trans. R. Soc. Lond. Ser. Contain. Pap. Math. Phys. Character 1933, 231, 289–337. [Google Scholar]
  18. Mages, T.; Rohner, C. Implementation: PID Quantifying Reachable Decision Regions. 2023. Available online: https://github.com/uu-core/pid-quantifying-reachable-decision-regions (accessed on 1 May 2023).
  19. Niu, X.; Quinn, C.J. Information Flow in Markov Chains. In Proceedings of the 2021 60th IEEE Conference on Decision and Control (CDC), Austin, TX, USA, 14–17 December 2021; pp. 3442–3447. [Google Scholar] [CrossRef]
  20. Finn, C.; Lizier, J.T. Pointwise Partial Information Decomposition Using the Specificity and Ambiguity Lattices. Entropy 2018, 20, 297. [Google Scholar] [CrossRef] [PubMed]
  21. Kolchinsky, A. A Novel Approach to the Partial Information Decomposition. Entropy 2022, 24, 403. [Google Scholar] [CrossRef]
  22. Chicharro, D.; Panzeri, S. Synergy and Redundancy in Dual Decompositions of Mutual Information Gain and Information Loss. Entropy 2017, 19, 71. [Google Scholar] [CrossRef]
  23. Rauh, J.; Bertschinger, N.; Olbrich, E.; Jost, J. Reconsidering Unique Information: Towards a Multivariate Information Decomposition. In Proceedings of the 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, 29 June–4 July 2014; pp. 2232–2236. [Google Scholar] [CrossRef]
  24. James, R.G.; Ellison, C.J.; Crutchfield, J.P. dit: A Python package for discrete information theory. J. Open Source Softw. 2018, 3, 738. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.