Next Article in Journal
Memory Corrections to Markovian Langevin Dynamics
Next Article in Special Issue
Causal Structure Learning with Conditional and Unique Information Groups-Decomposition Inequalities
Previous Article in Journal
Landauer Bound in the Context of Minimal Physical Principles: Meaning, Experimental Verification, Controversies and Perspectives
Previous Article in Special Issue
A Measure of Synergy Based on Union Information
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Non-Negative Decomposition of Multivariate Information: From Minimum to Blackwell-Specific Information

Department of Information Technology, Uppsala University, 752 36 Uppsala, Sweden
*
Author to whom correspondence should be addressed.
Entropy 2024, 26(5), 424; https://doi.org/10.3390/e26050424
Submission received: 5 March 2024 / Revised: 6 May 2024 / Accepted: 11 May 2024 / Published: 15 May 2024

Abstract

:
Partial information decompositions (PIDs) aim to categorize how a set of source variables provides information about a target variable redundantly, uniquely, or synergetically. The original proposal for such an analysis used a lattice-based approach and gained significant attention. However, finding a suitable underlying decomposition measure is still an open research question at an arbitrary number of discrete random variables. This work proposes a solution with a non-negative PID that satisfies an inclusion–exclusion relation for any f-information measure. The decomposition is constructed from a pointwise perspective of the target variable to take advantage of the equivalence between the Blackwell and zonogon order in this setting. Zonogons are the Neyman–Pearson region for an indicator variable of each target state, and f-information is the expected value of quantifying its boundary. We prove that the proposed decomposition satisfies the desired axioms and guarantees non-negative partial information results. Moreover, we demonstrate how the obtained decomposition can be transformed between different decomposition lattices and that it directly provides a non-negative decomposition of Rényi-information at a transformed inclusion–exclusion relation. Finally, we highlight that the decomposition behaves differently depending on the information measure used and how it can be used for tracing partial information flows through Markov chains.

1. Introduction

From computer science to neuroscience, we can find the following problem: We would like to know information about a random variable T, called the target, which we cannot observe directly. However, we can obtain information about the target indirectly from another set of variables V = { V 1 , , V n } . We can use information measures to quantify how much information any set of variables provides about the target. When doing so, we can identify the concept of redundancy: For example, if we have two identical variables V 1 = V 2 , then we can use one variable to predict the other and, thus, anything that this other variable can predict. Similarly, we can identify the concept of synergy: For example, if we have two independent variables and a target that corresponds to their XOR operation T = ( V 1 XOR V 2 ) , then both variables provide no advantage on their own for predicting the state of T, yet their combination fully determines it. Williams and Beer [1] suggested that it is possible to characterize information as visualized by the Venn diagram for two variables V = { V 1 , V 2 } in Figure 1a. This decomposition attributes the total information about the target to being redundant, synergetic, or unique to a particular variable. As indicated in Figure 1a by I ( · , T ) , we can quantify three of the areas using information measures. However, this is insufficient to determine the four partial areas that represent the individual contributions. This causes the necessity to extend an information measure to either quantify the amount of redundancy or synergy between a set of variables.
Williams and Beer [1] first proposed a framework for Partial Information Decomposition (PIDs) and found favor by the community [2]. However, the proposed measure of redundancy was criticized for not distinguishing, “the same information and the same amount of information” [3,4,5,6]. The proposal of Williams and Beer [1] focused specifically on mutual information. This work additionally studies the decomposition of any f-information or Rényi-information at discrete random variables. They have significance, among others, in parameter estimations, high-dimensional statistics, hypothesis testing, channel coding, data compression, and privacy analyses [7,8].

1.1. Related Work

Most of the literature focuses on the decomposition of mutual information. Here, many alternative measures have been proposed, but cannot fully replace the original measure of Williams and Beer [1] since they do not provide non-negative results for any | V | : The special case of bivariate partial information decompositions ( | V | = 2 ) has been well studied, and several non-negative decompositions for the framework of Williams and Beer [1] are known [5,9,10,11,12]. However, each of these decompositions provides negative partial information for | V | > 2 . Further research [13,14,15] specifically aimed to define decompositions of mutual information for an arbitrary number of observable variables, but similarly obtained negative partial contributions and the resulting difficulty of interpreting their results. Griffith et al. [3] studied the decomposition of zero-error information and obtained negative partial contributions. Kolchinsky [16] proposed a decomposition framework for an arbitrary number of observable variables that is applicable beyond Shannon information theory, however, where the partial contributions do not sum to the total amount.
In this work, we propose a decomposition measure for replacing the one presented by Williams and Beer [1] while maintaining its desired properties. To achieve this, we combine several concepts from the literature: We use the Blackwell order, a preorder of information channels, for the decomposition and for deriving its operational interpretation, similar to Bertschinger et al. [9] and Kolchinsky [16]. We use its special case for binary input channels, the zonogon order studied by Bertschinger and Rauh [17], to achieve non-negativity at an arbitrary number of variables and provide it with a practical meaning by highlighting its equivalence to the Neyman–Pearson (decision) region. To utilize this special case for a general decomposition, we use the concept of a target pointwise decomposition as demonstrated by Williams and Beer [1] and related to Lizier et al. [18], Finn and Lizier [13], and Ince [14]. Specifically, we use Neyman–Pearson regions of an indicator variable for each target state to define distinct information and quantify pointwise information from its boundary. This allows for the non-negative decomposition of an arbitrary number of variables, where the source and target variables can have an arbitrary finite number of states. Finally, we apply the concepts from measuring on lattices, discussed by Knuth [19], to transform a non-negative decomposition with an inclusion–exclusion relation from one information measure to another while maintaining the decomposition properties.
Remark 1. 
We use the term “target pointwise” or simply “pointwise” within this work to refer to the analysis of each target state individually. This differs from [13,14,18], who use the latter term for the analysis of all joint source–target realizations.

1.2. Contributions

In a recent work [20], we presented a decomposition of mutual information on the redundancy lattice (Figure 1b). This work aims to simplify, generalize, and extend these ideas to make the following contributions to the area of partial information decompositions:
  • We propose a representation of distinct uncertainty and distinct information, which is used to demonstrate the unexpected behavior of the measure by Williams and Beer [1] (Section 2.2 and Section 3.1).
  • We propose a non-negative decomposition for any f-information measure at an arbitrary number of discrete random variables that satisfies an inclusion–exclusion relation and provides a meaningful operational interpretation (Section 3.2, Section 3.3 and Section 3.5). The decomposition satisfies the original axioms of Williams and Beer [1] (Theorems 3 and 4) and obtains different properties from different information measures (Section 4).
  • We demonstrate several transformations of the proposed decomposition: (i) We transform the cumulative measure between different decomposition lattices (Section 3.4). (ii) We demonstrate that the non-negative decomposition of f-information directly provides a non-negative decomposition of Rényi- and Bhattacharyya-information at a transformed inclusion–exclusion relation (Section 3.6).

2. Background

This section aims to provide the required background information and introduce the notation used. Section 2.1 discusses the Blackwell order and its special case at binary targets, the zonogon order, which will be used for operational interpretations and the representation of f-information for its decomposition. Section 2.2 discusses the PID framework of Williams and Beer [1] and the relation between a decomposition based on the redundancy lattice and one based on the synergy lattice. We also demonstrate the unintuitive behavior of the original decomposition measure, which will be resolved by our proposal in Section 3. Section 2.3 provides the considered definitions of f-information, Rényi-information, and Bhattacharyya-information for the later demonstration of transforming decomposition results between measures.
Notation 1 
(Random variables and their distribution). We use the notation T (upper case) to represent a random variable, ranging over the event space T (calligraphic) containing events t T (lower case) and use the notation P T (P with subscript) to indicate its probability distribution. The same convention applies to other variables, such as a random variable S with events s S and distribution P S . We indicate the outer product of two probability distributions as P S P T , which assigns the product of their marginals P S ( s ) · P T ( t ) to each event ( s , t ) of the Cartesian product S × T . Unless stated otherwise, we use the notation T, S, and V to represent random variables throughout this work.

2.1. Blackwell and Zonogon Order

Definition 1 
(Channel). A channel μ = T S from T to S represents a garbling of the input variable T, which results in variable S. Within this work, we represent an information channel μ as a (row) stochastic matrix, where each element is non-negative, and all rows sum to one.
For the context of this work, we consider a variable S to be the observation of the output from an information channel T S from the target variable T, such that the corresponding channel can be obtained from their conditional probability distribution, as shown in Equation (1) where T = { t 1 , , t n } and S = { s 1 , , s m } .
μ = ( T S ) = P ( S T ) = p ( s 1 t 1 ) p ( s m t 1 ) p ( s 1 t n ) p ( s m t n )
Notation 2 
(Binary input channels). Throughout this work, we reserve the symbol κ for binary input channels, meaning κ signals a stochastic matrix of dimension 2 × m . We use the notation v κ to indicate a column of this matrix.
Definition 2 
(More informative [17,21]). An information channel μ 1 = T S 1 is more informative than another channel μ 2 = T S 2 if—for any decision problem involving a set of actions a Ω and a reward function u : ( Ω , T ) R that depends on the chosen action and state of the variable T—an agent with access to S 1 can always achieve an expected reward at least as high as another agent with access to S 2 .
Definition 3 
(Blackwell order [17,21]). The Blackwell order is a preorder of channels. A channel μ 1 is Blackwell superior to channel μ 2 , if we can pass its output through a second channel λ to obtain an equivalent channel to μ 2 , as shown in Equation (2).
μ 2 μ 1 μ 2 = μ 1 · λ f o r s o m e stochastic matrix λ
Blackwell [21] showed that a channel is more informative if and only if it is Blackwell superior. Bertschinger and Rauh [17] showed that the Blackwell order does not form a lattice for channels μ = T S if | T | > 2 since the ordering does not provide unique meet and join elements. However, binary target variables | T | = 2 are a special case where the Blackwell order is equivalent to the zonogon order (discussed next) and does form a lattice [17].
Definition 4 
(Zonogon [17]). The zonogon Z ( κ ) of a binary input channel κ = T S is defined using the Minkowski sum from the collection of vector segments as shown in Equation (3). The zonogon Z ( κ ) can similarly be defined as the image of the unit cube [ 0 , 1 ] | S | under the linear map of κ.
Z ( κ ) i x i v i : 0 x i 1 , v i κ = κ a : a [ 0 , 1 ] | S |
The zonogon Z ( κ ) is a centrally symmetric convex polygon, and the set of vectors v i κ spans its perimeter. Figure 2 shows an example of a binary input channel and its corresponding zonogon.
Definition 5 
(Zonogon sum). The addition of two zonogons corresponds to their Minkowski sum as shown in Equation (4).
Z ( κ 1 ) + Z ( κ 2 ) a + b : a Z ( κ 1 ) , b Z ( κ 2 ) = Z κ 1 κ 2
Definition 6 
(Zonogon order [17]). A zonogon Z ( κ 1 ) is zonogon superior to another Z ( κ 2 ) if and only if Z ( κ 2 ) Z ( κ 1 ) .
Bertschinger and Rauh [17] showed that, for binary input channels, the zonogon order is equivalent to the Blackwell order and forms a lattice (Equation (5)). In the remaining work, we will only discuss binary input channels, such that the orderings of Definitions 2, 3, and 6 are equivalent and can be thought of as zonogons with a subset relation.
κ 1 κ 2 Z ( κ 1 ) Z ( κ 2 )
To obtain an interpretation of what a channel zonogon Z ( κ ) represents, we can consider a binary decision problem by aiming to predict the state t T of a binary target variable T using the output of channel κ = T S . Any decision strategy λ [ 0 , 1 ] | S | × 2 for obtaining a binary prediction T ^ can be fully characterized by its resulting pair of True-Positive Rate (TPR) and False-Positive Rate (FPR), as shown in Equation (6):
κ · λ = ( T S T ^ ) = P ( T ^ T ) = p ( T ^ = t T = t ) p ( T ^ t T = t ) p ( T ^ = t T t ) p ( T ^ t T t ) = TPR 1 TPR FPR 1 FPR
Therefore, a channel zonogon Z ( κ ) provides the set of all achievable (TPR,FPR)-pairs for a given channel κ [20,22]. This can also be seen from Equation (3), where the unit cube a [ 0 , 1 ] | S | represents all possible first columns of the decision strategy λ . The first column of λ fully determines the second since each row has to sum to one. As a result, κ a provides the (TPR,FPR)-pair for the decision strategy λ = a ( 1 a ) and the definition of Equation (3) for all achievable (TPR,FPR)-pairs for predicting the state of a binary target variable. Since this will be helpful for operational interpretations, we label the axis of zonogon plots accordingly, as shown in Figure 2. The zonogon ([17], p. 2480) is the Neyman–Pearson region ([7], p. 231).
Definition 7 
(Neyman–Pearson region [7] and decision regions). The Neyman–Pearson region for a binary decision problem is the set of achievable (TPR,FPR)-pairs and can be visualized as shown in Figure 2. The Neyman–Pearson regions underlie the zonogon order, and their boundary can be obtained from the likelihood-ratio test. We refer to subsets of the Neyman–Pearson region as reachable decision regions, or simply decision regions, and the boundary as the zonogon perimeter.
Remark 2. 
Due to the zonogon symmetry, the diagram labels can be swapped (FPR x-axis/TPR y-axis), which changes the interpretation to aiming at a prediction for T t .
Notation 3 
(Channel lattice). We use the notation κ 1 κ 2 for the meet element of binary input channels under the Blackwell order and κ 1 κ 2 for their join element. We use the notation BW = 1 0 0 1 for the top element of binary input channels under the Blackwell order and BW = 1 1 for the bottom element.
For binary input channels, the meet element of the Blackwell order corresponds to the zonogon intersection Z ( κ 1 κ 2 ) = Z ( κ 1 ) Z ( κ 2 ) and the join element of the Blackwell order corresponds to the convex hull of their union Z ( κ 1 κ 2 ) = Conv ( Z ( κ 1 ) Z ( κ 2 ) ) . Equation (7) describes this for an arbitrary number of channels.
Z κ A κ = κ A Z ( κ ) and Z κ A κ = C o n v κ A Z ( κ )
Example 1. 
The remaining work only analyzes indicator variables, so we only need to consider the case | T | = 2 where all presented ordering relations of this section are equivalent and form a lattice.
Figure 3a visualizes a channel T κ S with | S | = 3 . We can use the observations of S for making a prediction T ^ about T. For example, we predict that T is in its first state with probability w 1 if S is in its first state, with probability w 2 if S is in its second state, and with probability w 3 if S is in its third state. These randomized decision strategies can be noted as stochastic matrix λ shown in Figure 3a. The resulting TPR and FPR of this decision strategy is obtained from the weighted sum of these parameters ( w 1 , w 2 , and w 3 ) with the vectors in κ. Each decision strategy corresponds to a point within the zonogon, since the probabilities are constrained by w 1 , w 2 , w 3 [ 0 , 1 ] and the resulting zonogon is the Neyman–Pearson region.
Figure 3b visualizes an example for the discussed ordering relations, where all observable variables have two states: | S i | = 2 where i { 1 , 2 , 3 } . The zonogon/Neyman–Pearson region corresponding to variable S 3 is fully contained within the others ( Z ( κ 3 ) Z ( κ 1 ) and Z ( κ 3 ) Z ( κ 2 ) ). Therefore, we can say that S 3 is Blackwell inferior (Definition 3) and less informative (Definition 2) than S 1 and S 2 about T. Practically, this means that we can construct an equivalent variable to S 3 by garbling S 1 or S 2 and that, for any sequence of actions based on S 3 and any reward function with dependence on T, we can achieve an expected reward at least as a high by acting based on S 1 or S 2 instead. The variables S 1 and S 2 are incomparable to the zonogon order, Blackwell order, and informativity order, since the Neyman–Pearson region of one is not fully contained in the other.
The zonogon shown in Figure 3a corresponds to the join under the zonogon order, Blackwell order, and informativity order of S 1 and S 2 in Figure 3b about T. For binary targets, this distribution can directly be obtained from the convex hull of their Neyman–Pearson regions and corresponds to a valid joint distribution for ( T , S 1 , S 2 ) . All other joint distributions are either equivalent or superior to it. When doing this on indicator variables for | T | > 2 , then the obtained joint distributions for each t T may not combine into a specific valid overall joint distribution.

2.2. Partial Information Decomposition

The commonly used framework for PIDs was introduced by Williams and Beer [1]. A PID is computed with respect to a particular random variable that we would like to know information about, called the target, and tries to identify from which variables that we have access to, called visible variables, we obtain this information. Therefore, this section considers sets of variables that represent their joint distribution.
Notation 4. 
Throughout this work, we use the notation T for the target variable and V = { V 1 , , V n } for the set of visible variables. We use the notation P ( V ) for the power set of V and P 1 ( V ) = P ( V ) for its power set without the empty set.
Definition 8 
(Sources, atoms [1]).
  • A source S i P 1 ( V ) is a non-empty set of visible variables.
  • An atom α A ( V ) is a set of sources constructed by Equation (8).
    A ( V ) = { α P ( P 1 ( V ) ) : S a , S b α , S a S b } ,
The filter used for obtaining the set of atoms (Equation (8)) removes sets that would be equivalent to other elements. This is required for obtaining a lattice from the following two ordering relations:
Definition 9 
(Redundancy/gain lattice [1]). The redundancy lattice ( A ( V ) , ) is obtained by applying the ordering relation of Equation (9) to all atoms α , β A ( V ) .
α β S b β , S a α , S a S b
The redundancy lattice for three visible variables is visualized in Figure 4a. On this lattice, we can think of an atom as representing the information that can be obtained from all of its sources about the target T (their redundancy or informational intersection). For example, the atom α = { { V 1 , V 2 } , { V 1 , V 3 } } represents on the redundancy lattice the information that is contained in both ( V 1 , V 2 ) and ( V 1 , V 3 ) about T. Since both sources in α provide the information of V 1 , their redundancy contains at least this information, and the atom β = { { V 1 } } is considered its predecessor. Therefore, the ordering indicates an informational subset relation for the redundancy of atoms, and the information that is represented by an atom increases as we move up. The up-set of an atom α on the redundancy lattice indicates the information that is lost when losing all of its sources. Considering the example from above, if we lose access to { V 1 ( or ) V 2 } and { V 1 ( or ) V 3 } , then we lose access to all atoms in the up-set of α = { { V 1 , V 2 } , { V 1 , V 3 } } .
Definition 10 
(Synergy/loss lattice [23]). The synergy lattice ( A ( V ) , ) is obtained by applying the ordering relation of Equation (10) to all atoms α , β A ( V ) .
α β S b β , S a α , S b S a
The synergy lattice for three visible variables is visualized in Figure 4b. On this lattice, we can think of an atom as representing the information that is contained in neither of its sources (information outside their union). For example, the atom α = { { V 1 , V 2 } , { V 1 , V 3 } } represents on the synergy lattice the information that is obtained from neither ( V 1 , V 2 ) nor ( V 1 , V 3 ) about T. The ordering again indicates their expected subset relation: the information that is obtained from neither { V 1 ( and ) V 2 } nor { V 1 ( and ) V 3 } is fully contained in the information that cannot be obtained from β = { { V 1 } } , and thus, α is a predecessor of β .
With an intuition for both ordering relations in mind, we can see how the filter in the construction of atoms (Equation (8)) removes sets that would be equivalent to another atom: the set { { V 1 , V 2 } , { V 1 } } is removed from the power set of sources since it would be equivalent to the atom { { V 1 } } under the ordering of the redundancy lattice and equivalent to the atom { { V 1 , V 2 } } under the ordering of the synergy lattice. Using Definition 11, one can similarly define the atoms of the decomposition lattices from the power set of sources without the equivalence relation.
Definition 11. 
We define equivalence relations for sets of sources under the redundancy and synergy order:
Redundancy order : ( α β ) ( α β and β α )
Synergy order : ( α β ) ( α β and β α )
We use the notation A { } B to indicate that two sets of atoms are equal when comparing their contained atoms with respect to equivalence under the synergy order.
Notation 5 
(Redundancy/synergy lattices). We use the notation ( A ( V ) , , ) for the join and meet operators on the redundancy lattice, and ( A ( V ) , , ) for the join and meet operators on the synergy lattice. We use the notation RL = { V } for the top and RL = for the bottom atom on the redundancy lattice, and SL = and SL = { V } for the top and bottom atom on the synergy lattice. For an atom α on the redundancy lattice, we use the notation R α for its down-set, ˙ R α for its strict down-set, R α for its up-set, ˙ R α for its strict up-set, and α R for its cover set. For an atom α on the synergy lattice, we use the notation S α for its down-set, ˙ S α for its strict down-set, S α for its up-set, ˙ S α for its strict up-set, and α S for its cover set.
For convenience, Table 1 provides a summary of the notation used.
The redundant, unique, or synergetic information (partial contributions) can be calculated based on either lattice. They are obtained by quantifying each atom of the redundancy or synergy lattice with a cumulative measure that increases as we move up in the lattice. The partial contributions are then obtained in a second step from a Möbius inverse.
Definition 12 
([Cumulative] redundancy measure [1]). A redundancy measure I ( α ; T ) is a function that assigns a real value to each atom of the redundancy lattice. It is interpreted as a cumulative information measure that quantifies the redundancy between all sources S α of an atom α A ( V ) about the target T.
Definition 13 
([Cumulative] loss measure [23]). A loss measure I ( α ; T ) is a function that assigns a real value to each atom of the synergy lattice. It is interpreted as a cumulative measure that quantifies the information about T that is provided by neither of the sources S α of an atom α A ( V ) .
To ensure that a redundancy measure actually captures the desired concept of redundancy, Williams and Beer [1] defined three axioms that a measure I should satisfy. For the synergy lattice, we consider the equivalent axioms discussed by Chicharro and Panzeri [23]:
Axiom 1 
(Commutativity [1,23]). Invariance in the order of sources (σ permuting the order of indices):
I ( { S 1 , , S i } ; T ) = I ( { S σ ( 1 ) , , S σ ( i ) } ; T ) I ( { S 1 , , S i } ; T ) = I ( { S σ ( 1 ) , , S σ ( i ) } ; T )
Axiom 2 
(Monotonicity [1,23]). Additional sources can only decrease redundant information. Additional sources can only decrease the information that is in neither source.
I ( { S 1 , , S i 1 } ; T ) I ( { S 1 , , S i } ; T ) I ( { S 1 , , S i 1 } ; T ) I ( { S 1 , , S i } ; T )
Axiom 3 
(Self-redundancy [1,23]). For a single source, redundancy equals mutual information. For a single source, the information loss equals the difference between the total available mutual information and the mutual information of the considered source with the target.
I ( { S i } ; T ) = I ( S i ; T ) and I ( { S i } ; T ) = I ( V ; T ) I ( S i ; T )
The first axiom states that an atom’s redundancy and information loss should not depend on the order of its sources. The second axiom states that adding sources to an atom can only decrease the redundancy of all sources (redundancy lattice) and decrease the information from neither source (synergy lattice). The third axiom binds the measures to be consistent with mutual information and ensures that the bottom element of both lattices is quantified to zero.
Once a lattice with the corresponding cumulative measure ( I / I ) is defined, we can use the Möbius inverse to compute the partial contribution of each atom. This partial information can be visualized as the partial area in a Venn diagram (see Figure 1a) and corresponds to the desired redundant, unique, and synergetic contributions. However, the same atom represents different partial contributions on each lattice: As visualized for the case of two visible variables in Figure 1, the unique information of variable V 1 is represented by α = { { V 1 } } on the redundancy lattice and by β = { { V 2 } } on the synergy lattice.
Definition 14 
(Partial information [1,23]). Partial information Δ I ( α ; T ) and Δ I ( α ; T ) corresponds to the Möbius inverse of its corresponding cumulative measure on the respective lattice.
Redundancy lattice : Δ I ( α ; T ) = I ( α ; T ) β ˙ R α Δ I ( β ; T ) ,
Synergy lattice : Δ I ( α ; T ) = I ( α ; T ) β ˙ S α Δ I ( β ; T ) .
Remark 3. 
Using the Möbius inverse for defining partial information enforces an inclusion–exclusion relation in that all partial information contributions have to sum to the corresponding cumulative measure. Kolchinsky [16] argues that an inclusion–exclusion relation should not be expected to hold for PIDs and proposes an alternative decomposition framework. In this case, the sum of partial contributions (unique/redundant/synergetic information) is no longer expected to sum to the total amount I ( V ; T ) .
Property 1 
(Local positivity, non-negativity [1]). A partial information decomposition satisfies non-negativity or local positivity if its partial information contributions are always non-negative, as shown in Equation (13).
α A ( V ) . Δ I ( α ; T ) 0 or Δ I ( α ; T ) 0
The non-negativity property is important if we assume an inclusion–exclusion relation since it states that the unique, redundant, or synergetic information cannot be negative. If an atom α provides a negative partial contribution in the framework of Williams and Beer [1], then this may indicate that we over-counted some information in its down-set.
Remark 4. 
Several additional axioms and properties have been suggested since the original proposal of Williams and Beer [1], such as target monotonicity and the target chain rule [4]. However, this work will only consider the axioms and properties of Williams and Beer [1]. To the best of our knowledge, no other measure since the original proposal (discussed below) has been able to satisfy these properties for an arbitrary number of visible variables while ensuring an inclusion–exclusion relation for their partial contributions.
It is possible to convert between both representations due to a lattice duality:
Definition 15 
(Lattice duality and dual-decompositions [23]). Let C = ( A ( V ) { RL } , ) be a redundancy lattice with associated measure I , and let D = ( A ( V ) { SL } , ) be a synergy lattice with measure I ; then, the two decompositions are said to be dual if and only if the down-set on one lattice corresponds to the up-set in the other, as shown in Equation (14).
α C , β D : Δ I ( α ; T ) = Δ I ( β ; T )
α D , β C : Δ I ( α ; T ) = Δ I ( β ; T )
α C , β D : Δ I ( α ; T ) = γ R α Δ I ( γ ; T ) = γ S β Δ I ( γ ; T )
α D , β C : Δ I ( α ; T ) = γ S α Δ I ( γ ; T ) = γ R β Δ I ( γ ; T )
I ( RL ; T ) = I ( SL ; T ) = 0 = Δ I ( RL ; T ) = Δ I ( SL ; T )
Williams and Beer [1] proposed I min , as shown in Equation (15), to be used as a measure of redundancy and demonstrated that it satisfies the three required axioms and local positivity. They define redundancy (Equation (15b)) as the expected value of the minimum specific information (Equation (15a)).
Remark 5. 
Throughout this work, we use the term “target pointwise information” or simply “pointwise information” to refer to “specific information”. This shall avoid confusion when naming their corresponding binary input channels in Section 3.
I ( S i ; T = t ) = s S i p ( s t ) log 1 p ( t ) log 1 p ( t s )
I min ( S 1 , , S k ; T ) = t T p ( t ) min i 1 . . k I ( S i ; T = t ) .
To the best of our knowledge, this measure is the only existing non-negative decomposition that satisfies all three axioms listed above for an arbitrary number of visible variables while providing an inclusion–exclusion relation of partial information.
However, the measure I min could be criticized for not providing a notion of distinct information due to its use of a pointwise minimum (for each t T ) over the sources. This leads to the question of distinguishing “the same information and the same amount of information” [3,4,5,6]. We can use the definition through a pointwise minimum (Equation (15)) to construct examples of unexpected behavior: consider, for example, a uniform binary target variable T and two visible variables as the output of the channels visualized in Figure 5. The channels are constructed to be equivalent for both target states and provide access to distinct decision regions while ensuring constant pointwise information t T : I ( V x , T = t ) = 0.2 .
Even though our ability to predict the target variable significantly depends on which of the two indicated channel outputs we observe (blue or green in Figure 5, incomparable informativity based on Definition 2), the measure I min concludes full redundancy between them I ( V 1 ; T ) = I min ( { V 1 , V 2 } ; T ) = I ( V 2 , T ) = 0.2 . We think this behavior is undesired and, as discussed in the literature, caused by an underlying lack of distinguishing the same information. To resolve this issue, we will present a representation of f-information in Section 3.1, which allows the use of all (TPR,FPR)-pairs for each state of the target variable to represent a distinct notion of uncertainty.

2.3. Information Measures

This section discusses two generalizations of mutual information at discrete random variables based on f-divergences and Rényi-divergences [24,25]. While mutual information has interpretational significance in channel coding and data compression, other f-divergences have their significance in parameter estimations, high-dimensional statistics, and hypothesis testing ([7], p. 88), while Rényi-divergences can be found among others in privacy analysis [8]. Finally, we introduce Bhattacharyya information for demonstrating that it is possible to chain decomposition transformations in Section 3.6. All definitions in this section only consider the case of discrete random variables (which is what we need for the context of this work).
Definition 16 
(f-divergence [24]). Let f : ( 0 , ) R be a function that satisfies the following three properties:
  • f is convex;
  • f ( 1 ) = 0 ;
  • f ( z ) is finite for all z > 0 .
By convention, we understand that f ( 0 ) = lim z 0 + f ( z ) and 0 f 0 0 = 0 . For any such function f and two discrete probability distributions P and Q over the event space X , the f-divergence for discrete random variables is defined as shown in Equation (16).
D f ( P Q ) x X Q ( x ) f P ( x ) Q ( x ) = E Q f P ( X ) Q ( X )
Notation 6. 
Throughout this work, we reserve the name f for functions that satisfy the required properties for an f-divergence of Definition 16.
An f-divergence quantifies a notion of dissimilarity between two probability distributions P and Q. Key properties of f-divergences are their non-negativity, their invariance under bijective transformations, and them satisfying a data-processing inequality ([7], p. 89). A list of commonly used f-divergences is shown in Table 2. Notably, the continuation for a = 1 of both the Hellinger- and α -divergence results in the KL-divergence [26].
The generator function of an f-divergence is not unique since D f ( z ) = D f ( z ) + c ( z 1 ) for a real constant c R ([7], p. 90f). As a result, the considered α -divergence is a linear scaling of the Hellinger divergence ( D H a = a · D α = a ), as shown in Equation (17).
z a 1 a 1 + c ( z 1 ) = a · z a 1 a ( z 1 ) a ( a 1 ) for c = a a 1
Definition 17 
(f-information [7]). An f-information is defined based on an f-divergence from the joint distribution of two discrete random variables and the product of their marginals, as shown in Equation (18).
I f ( S ; T ) : = D f P ( S , T ) P S P T = ( s , t ) S × T P S ( s ) · P T ( t ) · f P ( S , T ) ( s , t ) P S ( s ) · P T ( t ) = t T P T ( t ) s S P S ( s ) · f P S T ( s t ) P S ( s )
Definition 18 
(f-entropy). A notion of f-entropy for a discrete random variable is obtained from the self-information of a variable H f ( T ) I f ( T ; T ) .
Notation 7. 
Using the KL-divergence results in the definition of mutual information and Shannon entropy. Therefore, we use the notation I KL for mutual information (KL-information) and H KL (KL entropy) for the Shannon entropy.
The remaining part of this section will define Rényi- and Bhattacharyya-information to highlight that they can be represented as an invertible transformation of Hellinger-information. This will be used in Section 3.6 to transform the decomposition of Hellinger-information to a decomposition of Rényi- and Bhattacharyya-information.
Remark 6. 
We could similarly choose to represent Rényi-divergence as a transformation of the α-divergence. A liner scaling of the considered f-divergence will, however, not affect our later results (see Section 3.6).
Definition 19 
(Rényi divergence [25]). Let P and Q be two discrete probability distributions over the event space X , then Rényi-divergence R a is defined as shown in Equation (19) for a ( 0 , 1 ) ( 1 , ) , and extended to a { 0 , 1 , } by continuation.
R a ( P Q ) : = 1 a 1 log E Q P ( X ) Q ( X ) a = 1 a 1 log 1 + ( a 1 ) E Q P ( X ) Q ( X ) a 1 a 1 = 1 a 1 log 1 + ( a 1 ) D H a ( P Q )
Notably, the continuation of Rényi-divergence for a = 1 also equals the KL-divergence ([7], p. 116). Rényi-divergence can be expressed as an invertible transformation of the Hellinger-divergence ( D H a ; see Equation (19)) [26].
Definition 20 
(Rényi-information [7]). Rényi-information is defined equivalent to f-information as shown in Equation (20) and corresponds to an invertible transformation of Hellinger-information ( I H a ).
I R a ( S ; T ) : = R a P ( S ; T ) P S P T = 1 a 1 log 1 + ( a 1 ) I H a ( S ; T )
Finally, we consider the Bhattacharyya distance (Definition 21), which is equivalent to a linear scaling from a special case of Rényi-divergence (Equation (21)) [26]. It is applied, among others, in signal processing [27] and coding theory [28]. The corresponding information measure (Equation (22)) is like its distance, the scaling of a special case of Rényi-information.
Definition 21 
(Bhattacharyya distance [29]). Let P and Q be two discrete probability distributions over the event space X , then the Bhattacharyya distance is defined as shown in Equation (21).
B ( P Q ) : = log x X P ( x ) Q ( x ) = log x X Q ( x ) P ( x ) Q ( x ) = log 1 0.5 · E Q P ( X ) Q ( X ) 0.5 1 0.5 1 = log 1 0.5 · D H 0.5 ( P Q ) = 0.5 · R 0.5 ( P Q )
Definition 22 
(Bhattacharyya-information). Bhattacharyya-information is defined equivalent to f-information as shown in Equation (22).
I B ( S ; T ) B P ( S , T ) P S P T = 0.5 · I R 0.5 ( S ; T )
Example 2. 
Consider the channel T κ S with T = { t 1 , t 2 } and S = { s 1 , s 2 } . While it will be discussed in more detail in Section 3.1, Equation (23) already indicates that f-information can be interpreted as the expected value of quantifying the boundary of the Neyman–Pearson region for an indicator variable of each target state t T . Each state of a source variable s S corresponds to one side/edge of this boundary as discussed in Section 2.1 and visualized in Figure 2. Therefore, the sum over s S corresponds to the sum of quantifying each edge of the zonogon by some function, which is only parameterized by the distribution of the indicator variable for t. This function satisfies a triangle inequality (Corollary A1), and the total boundary is non-negative (Theorem 2 discussed later). Therefore, we can vaguely think of pointwise f-information as quantifying the length of the boundary of the Neyman–Pearson region or zonogon perimeter to give an oversimplified intuition.
I f ( S ; T ) = t T P T ( t ) s S P S ( s ) · f P S T ( s t ) P S ( s ) quantifies   each   zonogon   edge pointwise   information   of   an   indicator   variable   T = t
Below is a stepwise computation of χ 2 -information ( f ( z ) = ( z 1 ) 2 ) on a small example from this interpretation for the setting of Equation (24).
κ = P S T = p ( S = s 1 T = t 1 ) p ( S = s 2 T = t 1 ) p ( S = s 1 T = t 2 ) p ( S = s 2 T = t 2 ) = 0.8 0.2 0.35 0.65
P T = p ( T = t 1 ) p ( T = t 2 ) = 0.4 0.6 0 0
Since | T | = 2 , we compute the pointwise information for two indicator variables as shown in Figure 6. Since each state s S corresponds to one edge of the zonogon, we compute them individually. Notice that the quantification of each vector v s i can be expressed as a function that is only parameterized by the distribution of the indicator variable. The total zonogon perimeter is quantified as the sum of each of its edges, which equals pointwise information. In this particular case, we obtain 0.292653 for the total boundary on the indicator of t 1 and 0.130068 for the total boundary on the indicator of t 2 . The expected information corresponds to the expected value of these pointwise quantifications and provides the final result (Equation (25)).
I χ 2 ( S ; T ) = p ( T = t 1 ) · 0.292653 + p ( T = t 2 ) · 0.130068 = 0.195102

3. Decomposition Methodology

To construct a partial information decomposition in the framework of Williams and Beer [1], we only have to define a cumulative redundancy measure ( I ) or cumulative loss measure ( I ). However, doing this requires a meaningful definition of when information is the same. Therefore, Section 3.1 presents an interpretation of f-information that enables a representation of distinct information. Specifically, we demonstrate that pointwise f-information for a target state t T corresponds to the Neyman–Pearson region of its indicator variable, which is quantified by its boundary (zonogon perimeter). This allows for the interpretation that each distinct (TPR,FPR)-pair for predicting a state of the target variable provides a distinct notion of uncertainty. This interpretation of f-information is used in Section 3.2 to construct a partial information decomposition on the synergy lattice under the Blackwell order for each state t T individually. These individual decompositions are then combined into the final result. Therefore, we decompose specific information based on the Blackwell order rather than using its minimum, like Williams and Beer [1]. The resulting operational interpretation is discussed in Section 3.3. Section 3.4 studies the relation between decomposition lattices to derive the dual-decomposition of any f-information on the redundancy lattice in the following Section 3.5 and prove its correctness. We use the obtained decomposition for any f-information in Section 3.6 to transform a Hellinger-information decomposition into a Rényi-information decomposition while maintaining its non-negativity and an inclusion–exclusion relation. To achieve the desired axioms and properties, we combine different aspects of the existing literature:
  • Like Bertschinger et al. [9] and Kolchinsky [16], we base the decomposition on the Blackwell order and use this to obtain the operational interpretation of the decomposition.
  • Like Williams and Beer [1] and related to Lizier et al. [18], Finn and Lizier [13], and Ince [14], we perform a decomposition from a pointwise perspective, but only for the target variable.
  • In a similar manner to how Finn and Lizier [13] used probability mass exclusion to differentiate distinct information, we use Neyman–Pearson regions for each state of a target variable to differentiate distinct information.
  • We propose applying the concepts about lattice re-graduations discussed by Knuth [19] to PIDs to transform the decomposition of one information measure to another while maintaining its consistency.
We extend Axiom 3 of Williams and Beer [1] as shown below, to allow binding any information measure to the decomposition.
Axiom 3* (Self-redundancy).For a single source, redundancy I , * and information loss I , * correspond to information measure I * as shown below:
I , * ( { S i } ; T ) = I * ( S i ; T ) and I , * ( { S i } ; T ) = I * ( V ; T ) I * ( S i ; T )

3.1. Representing f-Information

We begin with an interpretation of f-information, for which we define a pointwise (indicator) variable π ( T , t ) that represents one state of the target variable (Equation (27a)) and construct its pointwise information channel (Definition 23). Then, we define a function r f based on the generator function of an f-divergence for quantifying (half) the zonogon perimeter of each pointwise information channel (see Figure 2). These perimeter quantifications are pointwise f-information.
Definition 23 
([Target] pointwise binary input channel). We define a target pointwise binary input channel κ ( S , T , t ) from one state of the target variable t T to an information source S with event space S = { s 1 , , s m } as shown in Equation (27b).
π ( T , t ) 1 if T = t 0 otherwise
κ ( S , T , t ) π ( T , t ) S = p ( S = s 1 T = t ) p ( S = s m T = t ) p ( S = s 1 T t ) p ( S = s m T t )
Definition 24 
([Target] pointwise f-information).
  • We define a function r f as shown in Equation (28a) to quantify a vector, where 0 p , x , y 1 .
  • We define a target pointwise f-information function i f , as shown in Equation (28b), to quantify half the zonogon perimeter for the corresponding pointwise channel Z ( κ ( S , T , t ) ) .
r f p , x y : = p x + ( 1 p ) y · f x p x + ( 1 p ) y
i f p , κ : = v κ r f ( p , v )
Theorem 1 
(Properties of r f ). For a constant 0 p 1 , (1) the function r f ( p , v ) is convex in v , (2) scales linearly in v , (3) satisfies a triangle inequality in v , (4) quantifies any vector of slope one to zero, and (5) quantifies the zero vector to zero.
Proof. 
  • The convexity of r f ( p , v ) in v is shown separately in Lemma A1 of Appendix A.
  • That r f ( p , v ) = r f ( p , v ) scales linearly in v can directly be seen from Equation (28a).
  • The triangle inequality of r f ( p , v ) in v is shown separately in Corollary A1 of Appendix A.
  • A vector of slope one is quantified to zero r f ( p , ) = · f 1 = 0 , since f 1 = 0 is a requirement on the generator function of an f-divergence (Definition 16).
  • The zero vector is quantified to zero r f ( p , 0 0 ) = 0 · f 0 0 = 0 by the convention of generator functions for an f-divergence (Definition 16).
The function r f provides the following properties to the pointwise information measure i f .
Theorem 2 
(Properties of i f ). The pointwise information measure i f (1) maintains the ordering relation of the Blackwell order for binary input channels and (2) is non-negative.
Proof. 
  • That the function r f maintains the ordering relation of the Blackwell order on binary input channels is shown separately in Lemma A2 of Appendix A (Equation (29a)).
  • The bottom element BW = 1 1 consists of a single vector of slope one, which is quantified to zero by Theorem 1 (Equation (29b)). The combination with Equation (29a) ensures the non-negativity.
κ 1 κ 2 i f ( p , κ 1 ) i f ( p , κ 2 ) ,
i f ( p , BW ) = 0 .
An f-information corresponds to the expected value of the target pointwise f-information function defined above (Equation (30)). As a result, we can interpret f-information as the expected value of quantifying (half) the zonogon perimeters for the target pointwise channels κ ( S , T , t ) .
I f ( S ; T ) = t T P T ( t ) · i f P T ( t ) , κ ( S , T , t ) = t T P T ( t ) · v κ ( S , T , t ) r f P T ( t ) , v = t T P T ( t ) · s S P S ( s ) · f P S T ( s t ) P S ( s )

3.2. Decomposing f-Information on the Synergy Lattice

With the representation of Section 3.1 in mind, we can define a non-negative partial information decomposition for a set of visible variables V = { V 1 , , V n } about a target variable T for any f-information. The decomposition is performed from a pointwise perspective, which means that we decompose the pointwise measure i f on the synergy lattice ( A ( V ) , ) for each t T . The pointwise synergy lattices are then combined using a weighted sum to obtain the decomposition of I f .
We map each atom of the synergy lattice to the join of pointwise channels for its contained sources.
Definition 25 
(From atoms to channels). We define the channel corresponding to an atom α A ( V ) as shown in Equation (31).
κ ( α , T , t ) BW if α = S α κ ( S , T , t ) ) otherwise
Lemma 1. 
For any set of sources α , β P ( P 1 ( V ) ) and target variable T with state t T , the function κ maintains the ordering of the synergy lattice under the Blackwell order as shown in Equation (32).
α β κ ( β , T , t ) κ ( α , T , t )
Lemma 1 is shown separately in Appendix C. The mapping from Definition 25 provides a lattice that can be quantified using pointwise f-information to construct a cumulative loss measure for its decomposition using the Möbius inverse.
Definition 26 
([Target] pointwise cumulative and partial loss measures). We define the target pointwise cumulative and partial loss functions as shown in Equations (33a) and (33b).
i , f ( α , T , t ) : = i f P T ( t ) , κ ( V , T , t ) i f ( P T ( t ) , κ ( α , T , t ) )
Δ i , f ( α , T , t ) : = i , f ( α , T , t ) β ˙ S α Δ i , f ( β , T , t )
The combined cumulative and partial measures are the expected value of their corresponding pointwise measures. This corresponds to combining the pointwise decomposition lattices by a weighted sum.
Definition 27 
(Combined cumulative and partial loss measures). The cumulative loss measure I , f is defined by Equation (34) and the decomposition result Δ I , f by Equation (35).
I , f ( α ; T ) t T P T ( t ) · i , f ( α , T , t )
Δ I , f ( α ; T ) : = t T P T ( t ) · Δ i , f ( α , T , t ) = I , f ( α ; T ) β ˙ S α Δ I , f ( β ; T )
Theorem 3. 
The presented definitions for the pointwise and expected loss measures ( i , f and I , f ) provide a non-negative PID on the synergy lattice with an inclusion–exclusion relation that satisfies Axioms 1, 2, and 3* for any f-information measure.
Proof. 
  • Axiom 1: The measure i , f (Equation (33a)) is invariant to permuting the order of sources in α , since the join operator of the zonogon order ( S α ) is. Therefore, also I , f satisfies Axiom 1.
  • Axiom 2: The monotonicity of both i , f and I , f on the synergy lattice is shown separately as Corollary A2 in Appendix C.
  • Axiom 3*: For a single source, i , f equals the pointwise information loss by definition (see Equations (26), (28b), and (33a)). Therefore, I , f satisfies Axiom 3*.
  • Non-negativity: The non-negativity of Δ i , f and Δ I , f is shown separately as Lemma A8 in Appendix C.

3.3. Operational Interpretation

From a pointwise perspective ( | T | = 2 ), there always exists a dependency between the sources for which the synergy of this state becomes zero. This dependence corresponds, by definition, to the join of their channels. This is helpful for the operational interpretation in the following paragraph since, individually, each pointwise synergy becomes fully volatile to the dependence between the sources. There may not exist a dependency between the sources for which the expected synergy becomes zero for | T | > 2 . However, each decision region that is quantified as synergetic becomes inaccessible at some dependence between the sources.
The decomposition obtains the operational interpretation that, if a variable provides pointwise unique information, then there exists a unique decision region for some t T that this variable provides access to. Moreover, if a set of variables provides synergetic information, then a decision region for some t T may become inaccessible if the dependence between the variables changes. Due to the equivalence of the zonogon and Blackwell order for binary input variables, these interpretations can also be transferred to a set of actions a Ω and a pointwise reward function u ( a , π ( T , t ) ) , which only depends on one state of the target variable π ( T , t ) (see Section 2.1): If a variable provides unique information, then it provides an advantage for some set of actions and pointwise reward function, while synergy indicates that the advantage for some pointwise reward function is based on the dependence between variables.
The implication of the interpretation does not hold in the other direction, which we will also highlight in the example of I , TV in Section 4.1. Finally, the definition of the Blackwell order through the chaining of channels (Equation (2)) highlights its suitability for tracing the flows of information in Markov chains (see Section 4.2).
Remark 7. 
The operational interpretation can be strengthened further such that the implication between accessible regions and partial information holds in both directions by revising Lemmas A1 and A2 with a strictly convex generator function to obtain κ 1 κ 2 i f ( p , κ 1 ) < i f ( p , κ 2 ) .

3.4. Decomposition Duality

A non-negative decomposition on the synergy lattice raises the question about its dual-decomposition on the redundancy lattice. Unfortunately, the definition of decomposition duality (Definition 15 [23]) does not specify the mapping between atoms to easily construct dual-decompositions. Therefore, this section discusses how the redundancy and synergy lattice are related by identifying operators that transform one lattice into the other. This transformation can then be used to refine the definition of decomposition duality and, correspondingly, transforms the cumulative measure between lattices.
Definition 28. 
We define two functions: The function Ξ : P ( P 1 ( V ) ) P ( P 1 ( V ) ) provides the atom with complement sources, and the function Ψ : P ( P 1 ( V ) ) P ( P 1 ( V ) ) is the n-ary Cartesian product. We indicate the i-th source of an atom as α [ i ] and indicate some variable within the i-th source as x i .
Ξ ( α ) if α = { V } { V } if α = V { S } : S α otherwise Ψ ( α ) { } if α = x 1 , , x m : x 1 α [ 1 ] , , x m α [ m ] otherwise , where m = | α |
Example 3. 
For an example of these functions, let V = { V 1 , V 2 , V 3 , V 4 } and α = { { V 1 } , { V 2 , V 3 } } :
Ξ ( α ) = { { { V 2 , V 3 , V 4 } , { V 1 , V 4 } } } Ξ ( Ξ ( α ) ) = { { V 1 } , { V 2 , V 3 } } = α Ψ ( α ) = { { V 1 , V 2 } , { V 1 , V 3 } } Ψ ( Ψ ( α ) ) = { { V 1 } , { V 1 , V 3 } , { V 2 , V 1 } , { V 2 , V 3 } } { { V 1 } , { V 2 , V 1 } , { V 2 , V 3 } } ( since { V 1 } { V 1 , V 3 } ) { { V 1 } , { V 2 , V 3 } } α ( since { V 1 } { V 2 , V 1 } ) Ξ ( Ψ ( Ψ ( α ) ) ) = { { V 2 , V 3 , V 4 } , { V 2 , V 4 } , { V 3 , V 4 } , { V 1 , V 4 } } { { V 2 , V 3 , V 4 } , { V 3 , V 4 } , { V 1 , V 4 } } ( since { V 2 , V 4 } { V 2 , V 3 , V 4 } ) { { V 2 , V 3 , V 4 } , { V 1 , V 4 } } Ξ ( α ) ( since { V 3 , V 4 } { V 2 , V 3 , V 4 } )
Lemma 2. 
The function Ψ ( · ) is a bijection on the redundancy lattice without the bottom element (∅) that reverses its order. Let α , β A ( V ) { RL } :
  • Ψ ( Ψ ( α ) ) α
  • α β Ψ ( β ) Ψ ( α )
Lemma 3. 
The function Ξ ( · ) is a bijection that maintains the ordering of atoms between the redundancy and synergy order. Let α , β A ( V ) :
  • α = Ξ ( Ξ ( α ) )
  • α β Ξ ( α ) Ξ ( β )
The proofs of Lemmas 2 and 3 are given separately in Appendix D.
Corollary 1. 
Without bottom elements, the redundancy ( A ( V ) { RL } , ) and synergy lattice ( A ( V ) { SL } , ) are related, as shown below with α , β A ( V ) { RL } :
α β Ξ ( Ψ ( β ) ) Ξ ( Ψ ( α ) )
Ξ ( Ψ ( α β ) ) Ξ ( Ψ ( α ) ) Ξ ( Ψ ( β ) )
Ξ ( Ψ ( α β ) ) Ξ ( Ψ ( α ) ) Ξ ( Ψ ( β ) )
{ Ξ ( Ψ ( β ) ) : β R α } { } S Ξ ( Ψ ( α ) )
{ Ξ ( Ψ ( β ) ) : β R α } { } S Ξ ( Ψ ( α ) )
Proof. 
Follows directly from Lemma 2 and 3. □
Figure 7 visualizes the relations from the introduced operators to provide an intuition. Applying the function Ψ to all atoms is equal to reversing the redundancy order, while applying the function Ξ to all atoms is equal to swapping the ordering relation used (synergy/redundancy order).
With these definitions in place, we can refine the definition of decomposition duality:
Lemma 4 
(Decomposition duality). A redundancy- and synergy-based information decomposition is pointwise dual if, for all α A ( V ) { RL } :
Δ i , f ( α , T , t ) = Δ i , f ( Ξ ( Ψ ( α ) ) , T , t ) Δ i , f ( RL , T , t ) = 0 = Δ i , f ( SL , T , t )
A redundancy- and synergy-based information decomposition is dual if, for all α A ( V ) { RL } :
Δ I , f ( α ; T ) = Δ I , f ( Ξ ( Ψ ( α ) ) ; T ) Δ I , f ( RL ; T ) = 0 = Δ I , f ( SL ; T )
The proof of Lemma 4 is shown separately in Appendix D. To convert a decomposition from the synergy lattice into its dual-decomposition on to the redundancy lattice, the following relation is particularly useful. It states that, on the synergy lattice, all atoms are either in the up-set of Ξ ( Ψ ( α ) ) or in the down-set of an atom that corresponds to an individual source within α .
Lemma 5. 
For α A ( V ) { RL } :
A ( V ) S Ξ ( Ψ ( α ) ) = S a α S { S a }
Proof. 
When expanding the definition of up- and down-sets, it can directly be seen from Lemma A9 that both sets provide an exclusive partitioning of all atoms.
S Ξ ( Ψ ( α ) ) = { β A ( V ) : Ξ ( Ψ ( α ) ) β } S a α S { S a } = { β A ( V ) : S a α . β { S a } } Ξ ( Ψ ( α ) ) β ¬ S a α . β { S a } ( by Lemma A 9 )
Figure 8 summarizes and visualizes the required relations for the following transformation of the cumulative measure: (i) The bottom elements of all lattices are mapped to each other and quantified to zero. (ii) The function Ψ reverses the redundancy lattice ( β Ψ ( α ) such that α Ψ ( β ) ) to relate the down-set of α to the up-set of β while ignoring the bottom element. The function Ξ captures the relation between both orderings ( α Ξ ( β ) such that β Ξ ( α ) ), to relate the up-set of β on the redundancy lattice to the up-set of α on the synergy lattice. This provides the desired mapping from the down-set of α on the redundancy lattice to the up-set of α on the synergy lattice for duality. Alternatively, we could first transform the down-set of α on the redundancy to the down-set of β = Ξ ( α ) on the synergy lattice, then reverse the synergy order and obtain the same result. (iii) Lemma 5 states that all atoms on the synergy lattice are either in the up-set of Ξ ( Ψ ( α ) ) or in the down-set of { S a } with S a α . The example α = { S 3 , S 12 } is visualized in Figure 8, and we encourage the reader to view another example such as α = { S 13 } Ψ { S 1 , S 3 } Ξ { S 12 , S 23 } .
With these relations in place, we can construct dual-decompositions and prove their correctness.
Lemma 6. 
The pointwise dual-decomposition for the redundancy lattice of a loss measure on the synergy lattice is defined by:
i , f ( α , T , t ) : = 0 if α = i , f ( SL , T , t ) β P 1 ( α ) ( 1 ) | β | 1 i , f ( β , T , t ) otherwise
The proof of Lemma 6 is shown separately in Appendix D. This section discussed the relation between four decomposition lattices, which are the redundancy and synergy lattice, as well as their reversed counterparts. Additionally, we demonstrated how this relation can be used to transform a cumulative decomposition measure between them. Decomposition duality enforces each lattice to be consistent with its set-theoretic interpretation. The function Ψ corresponds to taking the set-theoretic complement on the redundancy lattice and, thus, reflects on the cumulative measure by subtracting it from the top atom. The function Ξ corresponds to the relation between the union and intersection and, thus, introduces an inclusion–exclusion principle between their cumulative measures.

3.5. Decomposing f-Information on the Redundancy Lattice

Using the results from Section 3.4, we can now convert the decomposition of Section 3.2 to the redundancy lattice. The conversion can be applied to both the expected or pointwise measure. The partial contributions ( Δ i , f and Δ I , f ) are obtained from the Möbius inverse.
Lemma 7 
(Dual-decomposition on the redundancy lattice). The definitions of Equation (43) correspond to the dual-decomposition of Definition 26.
i , f ( α , T , t ) = 0 if α = β P 1 ( α ) ( 1 ) | β | 1 i f ( P T ( t ) , κ ( β , T , t ) ) otherwise
I , f ( α ; T ) = t T P T ( t ) · i , f ( α , T , t ) = I f ( V ; T ) β P 1 ( α ) ( 1 ) | β | 1 I , f ( β ; T )
Proof. 
The duality of the pointwise measure is obtained from Lemma 6 and Definition 26. The duality of the pointwise measure implies the duality of the combined measure. □
The function i f ( P T ( t ) , κ ( α , T , t ) ) quantifies the convex hull/blackwell join of the Neyman–Pearson regions of its sources and represents a notion of pointwise union information about the target state t T . It is used in Equation (33a) to define a pointwise loss measure for the synergy lattice by subtracting it from the total information. As expected, we can see that the corresponding dual-decomposition on the redundancy lattice enforces an inclusion–exclusion relation between our notions of pointwise union information ( i f ( P T ( t ) , κ ( α , T , t ) ) ) and pointwise intersection information ( i , f ( α , T , t ) ).
Theorem 4. 
The dual-decomposition as defined by Equation (43) provides a non-negative PID, which satisfies an inclusion–exclusion relation and the axioms of Williams and Beer [1] on the redundancy lattice for any f-information.
Proof. 
  • Axiom 1: The measure i , f is invariant to permuting the order of sources in α , since the join operator of the zonogon order ( S α ) is. Therefore, also, I , f satisfies Axiom 1.
  • Non-negativity: The non-negativity of Δ i , f is obtained from Lemma 7 and Theorem 3 as shown in Equation (44). The non-negativity of the pointwise measure implies the non-negativity of the combined measure Δ I , f .
    α A ( V ) . Δ i , f ( α , T , t ) = 0 0 if α = RL Δ i , f ( Ξ ( Φ ( α ) ) , T , t ) 0 otherwise
  • Axiom 2: Since the cumulative measures i , f and I , f correspond to the sum of partial contributions in their down-set, the non-negativity of partial information implies the monotonicity of the cumulative measures.
  • Axiom 3*: For a single source, I , f equals f-information by definition (see Equation (30)). Therefore, I , f satisfies Axiom 3*.
The operational interpretation of Section 3.3 is maintained since the partial contributions are identical between both lattices.
Remark 8. 
The definitions of Equations (34) and (43) satisfy the desired property of Bertschinger et al. [9], who argued that any sensible measure for unique and redundant information should only depend on the marginal distribution of sources.
Remark 9. 
As discussed before [20], it is possible to further split redundancy into two components for extracting the pointwise meet under the Blackwell order (zonogon intersection, first component). The second component of redundancy as defined above contains decision regions that are part of the convex hull, but not the individual channel zonogons (discussed as shared information in [20]). By combining Equation (43) and Lemma A7, we obtain that both components of this split for redundancy are non-negative.

3.6. Decomposing Rényi-Information

Since Rényi-information is an invertible transformation of Hellinger-information and α -information, we argue that their decompositions should be consistent. We propose to view the decomposition of Rényi-information as a transformation from an f-information and demonstrate the approach by transferring the Hellinger-information decomposition to a Rényi-information decomposition. Then, we demonstrate that the result is invariant to a linear scaling of the considered f-information, such that the transformation from α -information provides identical results. The obtained Rényi-information decomposition is non-negative and satisfies the three axioms proposed by Williams and Beer [1] (see below). However, its inclusion–exclusion relation is based on a transformed addition operator. For transforming the decomposition, we consider Rényi-information to be a re-graduation of Hellinger-information, as shown in Equation (45).
v a ( z ) : = 1 a 1 log 1 + ( a 1 ) z
I R a ( S ; T ) = v a ( I H a ( S ; T ) )
To maintain consistency when transforming the measure, we also have to transform its operators ([19], p. 6 ff.):
Definition 29 
(Addition of Rényi-information). We define the addition of Rényi-information a with its corresponding inverse function a by Equation (46).
x a y v a ( v a 1 ( x ) + v a 1 ( y ) ) = log e ( a 1 ) x + e ( a 1 ) y 1 a 1
x a y v a ( v a 1 ( x ) v a 1 ( y ) ) = log e ( a 1 ) x e ( a 1 ) y + 1 a 1
To transform a decomposition of the synergy lattice, we define the cumulative loss measures as shown in Equation (47) and use the transformed operators when computing the Möbius inverse (Equation (48a)) to maintain consistency in the results (Equation (48b)).
Definition 30. 
The cumulative and partial Rényi-information loss measures are defined as transformations of the cumulative and partial Hellinger-information loss measures, as shown in Equations (47) and (48).
I , R a ( α ; T ) : = v a ( I , H a ( α ; T ) )
Δ I , R a ( α ; T ) : = I , R a ( α ; T ) a β ˙ S α Δ I , R a ( β ; T ) where : + : = a
= v a ( Δ I , H a ( α ; T ) )
Remark 10. 
We show in Lemma A11 of Appendix E that re-scaling the original f-information does not affect the resulting decomposition or transformed operators. Therefore, transforming a Hellinger-information decomposition or a α-information decomposition to a Rényi-information decomposition provides identical results.
The operational interpretation presented in Section 3.2 is similarly applicable to partial Rényi-information ( Δ I , R a , Equation (48b)), since the function v a satisfies v a ( 0 ) = 0 and x 0 0 v a ( x ) .
Theorem 5. 
The presented definitions for the cumulative loss measure I , R a provide a non-negative PID on the synergy lattice with an inclusion–exclusion relation under the transformed addition (Definition 29) that satisfies Axioms 1, 2, and 3* for any Rényi-information measure.
Proof. 
  • Axiom 1: I , R a ( α ; T ) is invariant to permuting the order of sources, since I , H a ( S ; T ) satisfies Axiom 1 (see Section 3.2).
  • Axiom 2: I , R a ( α ; T ) satisfies monotonicity, since I , H a ( S ; T ) satisfies Axiom 2 (see Section 3.2) and the transformation function v a is monotonically increasing for a ( 0 , 1 ) ( 1 , ) .
  • Axiom 3*: Since I , H a satisfies Axiom 3* (see Section 3.2, Equations (45) and (47)), I , R a satisfies the self-redundancy axiom by definition, however, at a transformed operator: I , R a ( { S i } ; T ) = I R a ( { V } ; T ) a I R a ( { S i } ; T ) .
  • Non-negativity: The decomposition of I , R a is non-negative, since Δ I , H a is non-negative (see Section 3.2), the Möbius inverse is computed with transformed operators (Equation (48b)) and the function v a satisfies x 0 0 v a ( x ) .
Remark 11. 
To obtain an equivalent decomposition of Rényi-information on the redundancy lattice, we can correspondingly transform the dual-decomposition from the redundancy lattice of Hellinger-information as shown in Equation (49). The resulting decomposition will satisfy the non-negativity, the axioms of Williams and Beer [1], and an inclusion–exclusion relation under the transformed operators (Definition 29) for the same reasons described above from Theorem 4.
I , R a ( α ; T ) : = v a ( I , H a ( α ; T ) )
Δ I , R a ( α ; T ) : = v a ( Δ I , H a ( α ; T ) )
Remark 12. 
The relation between the redundancy and synergy lattice can be used for the definition of a bi-valuation [19] in calculations as discussed in [20]. This is also possible for Rényi-information at transformed operators.
When taking the limit of Rényi-information for a 1 , we obtain mutual information ( I KL ). Since mutual information is also an f-information, we expect its operators in the Möbius inverse to be addition. This is indeed the case (Equation (50)), and the measures will be consistent.
lim a 1 x a y = x + y lim a 1 x a y = x y
Finally, the decomposition of Bhattacharyya-information can be obtained by re-scaling the decomposition of Rényi-information at a = 0.5 , which causes another transform of the addition operator for the inclusion–exclusion relation.

4. Evaluation

A comparison of the proposed decomposition with other methods of the literature can be found in [20] for mutual information. Therefore, this section first compares different f-information measures for typical decomposition examples and discusses the special case of total variation (TV)-information to explain its distinct behavior. Since we can see larger differences between measures in more complex scenarios, we compare the measures by analyzing the information flows in a Markov chain. We provide the implementation used for both dual-decompositions of f-information and the examples used in this work in [30].

4.1. Partial Information Decomposition

4.1.1. Comparison of Different f-Information Measures

We use the examples discussed by Finn and Lizier [13] to compare different f-information decompositions and add a generic example from [20]. All probability distributions used and their abbreviations can be found in Appendix F. We normalize the decomposition results to the f-entropy of the target variable for the visualization in Figure 9.
Since all results are based on the same framework, they behave similarly for examples that analyze a specific aspect of the decomposition function (XOR, Unq, PwUnq, RdnErr, Tbc, AND). However, it can be observed that the decomposition of total variation (TV) appears to differ from others: (1) In all examples, total variation attributes more information to being redundant than other measures. (2) In the generic example, total variation is the only measure that does not attribute any information to being unique to variable one or synergetic. We discuss the case of total variation in Section 4.1.2 to explain its distinct behavior.
We visualize the zonogons for the generic example in Figure A2, which shall highlight that the implication of the operational interpretation does not hold in the other direction: the existence of partial information implies an advantage for the expected reward towards some state of the target variable, but an advantage for the expected reward towards some state of the target variable does not imply partial information in the example of total variation.

4.1.2. The Special Case of Total Variation

The behavior of total variation appears different compared to other f-information measures (Figure 9). This is due to total variation measuring the perimeter of a zonogon such that the result corresponds to a linear scaling of the maximal (Euclidean) height h * that the zonogon reaches above the diagonal, as visualized in Figure 10.
Remark 13. 
From a cost perspective, the height h * can be interpreted as the performance evaluation of the optimal decision strategy (symmetric point to P * in the lower zonogon half) for a prediction T ^ with minimal expected cost at the cost ratio Cos t ( T = t , T ^ t ) Cos t ( T = t , T ^ = t ) Cos t ( T t , T ^ = t ) Cos t ( T t , T ^ t ) = 1 P T ( t ) P T ( t ) (see Equation (8) of [31]) for each target state individually.
Lemma 8. 
(a)
The pointwise total variation ( i TV ) is a linear scaling of the maximal (Euclidean) height h * that the corresponding zonogon reaches above the diagonal, as visualized in Figure 10 (Equation (51a)).
(b)
For a non-empty set of pointwise channels A , pointwise total variation i TV quantifies the join element to the maximum of its individual channels (Equation (51b)).
(c)
The loss measure i , TV quantifies the meet for a set of sources on the synergy lattice to their minimum (Equation (51c)).
i TV ( p , κ ) = 1 p 2 v κ | v x v y | = ( 1 p ) h * 2
i TV ( p , κ A κ ) = max κ A i TV ( p , κ )
i , TV ( α A α , T , t ) = min α A i , TV ( α , T , t )
Proof. 
The proof of the first two statements (Equations (51a) and (51b)) is provided separately in Appendix G, which imply the third (Equation (51c)) by Definition 26. □
Quantifying the meet element on the synergy lattice to the minimum has the following consequences for total variation: (1) It attributes a minimum amount of synergy, and therefore more information to redundancy than other measures. (2) For each state of the target, at most one variable can provide unique information. In the case of | T | = 2 , the pointwise channels are symmetric (see Equation (6)), such that the same variable provides the maximal zonogon height both times. This is the case in the generic example of Figure 9, and the reason why at most one variable can provide unique information in this setting. However, beyond binary targets ( | T | > 2 ), both variables may provide unique information at the same time since different sources can provide the maximal zonogon height for different target states (see the later example in Figure 11).
Remark 14. 
Using the pointwise minimum on the synergy lattice results in a similar structure to the proposed measure of Williams and Beer [1]. However, TV-information is based on a different pointwise measure i T V , which displays the same behavior (Equation (51b)), unlike pointwise KL-information.

4.2. Information Flow Analysis

The differences between f-information measures in Section 4.1 appear more visible in complex scenarios. Therefore, this section compares different measures in the information flow analysis of a Markov chain.
Consider a Markov chain M 1 M 2 M 5 , where M i = ( X i , Y i ) is the joint distribution of two variables. Assume that we are interested in state three, and thus, define T = M 3 as the target variable. Using the approach described in Section 3, we can compute an information decomposition for each state M i of the Markov chain with respect to the target. Now, we are additionally interested in how the partial information decomposition from stage M i propagates into the next M i + 1 , as visualized in Figure 11.
Definition 31 
(Partial information flow). The partial information flow of an atom α A ( M i ) into the atom β A ( M i + 1 ) quantifies the redundancy between the partial contributions of their respective decomposition lattices.
Notation 8. 
We use the notation I , f with { , } to refer to either the loss measure I , f or redundancy measure I , f . The same applies to the functions J , f and J Δ , f of Equation (52).
Let α A ( M i ) and β A ( M i + 1 ) , then we compute information flows equivalently on the redundancy or synergy lattice as shown in Equation (52). When using a redundancy measure = , then the strict down-set of ˙ α refers to the strict down-set on its redundancy lattice ( A ( M i ) , ) , and when using a loss measure = , then the strict down-set ˙ α refers to the strict down-set on its synergy lattice ( A ( M i ) , ) . We obtain the intersection of cumulative measures by quantifying their meet, which is on both lattice equivalent to their union of sources ( J , f , Equation (52a)). To obtain how much of the partial contribution of α can be found in the cumulative measure of β ( J Δ , f ), we remove the contributions of its down-set ( ˙ α on the lattice for A ( M i ) , see Equation (52b)). To finally obtain the flow from the partial contribution of α to the partial contribution of β ( J Δ Δ , f ), we similarly remove the contributions of the down-set of β ( ˙ β on the lattice for A ( M i + 1 ) , see Equation (52c)). The approach can be extended for tracing information flows over multiple steps; however, we will only trace one step in this example.
J , f ( α , β , T ) : = I , f ( α β ; T )
J Δ , f ( α , β , T ) : = J , f ( α , β , T ) : = γ ˙ α J Δ , f ( γ , β , T )
J Δ Δ , f ( α , β , T ) J Δ , f ( α , β , T ) γ ˙ β J Δ Δ , f ( α , γ , T )
Remark 15. 
The resulting partial information flows are equivalent (dual) between the redundancy and loss measure, except for the bottom element since their functionality differs: The flow from or to the bottom element on the redundancy lattice is always zero. In contrast, the flow from or to the bottom element on the synergy lattice quantifies the information gained or lost in the step.
Remark 16. 
The information flow analysis of Rényi- and Bhattacharyya-information can be obtained as a transformation of the information flow from Hellinger-information. Alternatively, the information flow can be computed directly using Equation (52) under the corresponding definition of addition and subtraction for the information measure used.
We randomly generate an initial distribution and each row of a transition matrix under the constraint that at least one value shall be above 0.8 to avoid an information decay that is too rapid through the chain. The specific parameters of the example are shown in Appendix H. The event spaces used are X = { 0 , 1 , 2 } and Y = { 0 , 1 } such that | M i | = 6 . We construct a Markov chain of five steps with the target T = M 3 and trace each partial information for one step using Equation (52). We visualized the results for KL-, TV-, and χ 2 -information in Figure 11, and the results for H 2 -, LC-, and JS-information in Figure A3 of Appendix H.
All results display the expected behavior that the information that M i provides about M 3 increases for 1 i 3 and decreases for 3 i 5 . The information flow results of KL-, H 2 -, LC-, and JS-information are conceptually similar. Their main differences appear in the rate at which the information decays and, therefore, how much of the total information we can trace. In contrast, the results of TV- and χ 2 -information display different behavior, as shown in Figure 11: TV-information indicates significantly more redundancy, and χ 2 -information displays significantly more synergy than the other measures. Additionally, the decomposition of TV-information contains fever information flows. For example, it is the only analysis that does not show any information flow from M 2 into the unique contribution of Y 3 or from M 2 into the synergy of ( X 3 , Y 3 ) . This demonstrates that the same decomposition method can obtain different behaviors from different f-divergences.

5. Discussion

Using the Blackwell order to construct pointwise lattices and to decompose pointwise information is motivated from the following three aspects:
  • All information measures in Section 2.3 are the expected value of the pointwise information (quantification of the Neyman–Pearson region boundary) for an indicator variable of each target state. Therefore, we argue for acknowledging the “pointwise nature” [13] of these information measures and to decompose them accordingly. A similar argument was made previously by Finn and Lizier [13] for the case of mutual information and motivated their proposed pointwise partial information decomposition.
  • The Blackwell order does not form a lattice beyond indicator variables since it does not provide a unique meet or join element for | T | > 2 [17]. However, from a pointwise perspective, the informativity (Definition 2) provides a unique representation of union information. This enables separating the definition of redundant, unique, and synergetic information from a specific information measure, which then only serves for its quantification. We interpret these observations as an indication that the Blackwell order should be used to decompose pointwise information based on indicator variables rather than decomposing the expected information based on the full target distribution.
  • We can consider where the alternative approach would lead, if we decomposed the expected information from the full target distribution using the Blackwell order: the decomposition would become identical to the method of Bertschinger et al. [9] and Griffith and Koch [10]. For bivariate examples ( | V | = 2 ), this decomposition [9,10] is non-negative and satisfies an additional property (identity, proposed by Harder et al. [5]). However, the identity property is inconsistent [32] with the axioms of Williams and Beer [1] and non-negativity for | V | > 2 . This causes negative partial information when extending the approach to | V | > 2 . The identity property also contradicts the conclusion of Finn and Lizier [13] from studying Kelly Gambling that, “information should be regarded as redundant information, regardless of the independence of the information sources” ([13], p. 26). It also contradicts our interpretation of distinct information through distinct decision regions when predicting an indicator variable for some target state. We do not argue that this interpretation should be applicable to the concept of information in general, but acknowledge that this behavior seems present in the information measures studied in this work and construct their decomposition accordingly.
Our critique for the decomposition measure of Williams and Beer [1] focuses on the implication that a less informative variable (Definition 2) about t T provides less pointwise information ( I ( S ; T = t ) , Equation (15a)): κ ( S 1 , T , t ) κ ( S 2 , T , t ) I ( S 1 ; T = t ) I ( S 2 ; T = t ) . This implication does not hold in the other direction. Therefore, equal pointwise information does not imply equal informativity and, thus, does not mean being redundant.
We chose to define a notion of pointwise union information based on the join of the Blackwell order since it leads to a meaningful operational interpretation: the convex hull of the pointwise Neyman–Pearson regions is always a subset of their joint distribution. Moreover, it is possible to construct joint distributions for which each individual decision region outside the convex hull becomes inaccessible, even if there may not exist one unique joint distribution at which all synergetic regions are lost simultaneously. This volatility due to the dependence between variables appears suitable for a notion of synergy. Similarly, the resulting unique information appears suitable since it ensures that a variable with unique information must provide access to some additional decision region. Finally, the obtained unique and redundant information is sensible [9] since it only depends on the marginal distributions with the target. The operational interpretation can be strengthened further such that the implication between accessible regions and partial information holds in both directions by revising Lemmas A1 and A2 with a strictly convex generator function.
We perform the decomposition on a pointwise lattice using the Blackwell join since it is possible to represent f-information as the expected value of quantifying the Neyman–Pearson region boundary (zonogon perimeter) for indicator variables (pointwise channels). Since the pointwise measures satisfy a triangle inequality, we mentioned the oversimplified intuition of pointwise f-information as the length of the zonogon perimeter. Correspondingly, if we identified an information measure that behaved more like the area of the zonogon (which could also maintain their ordering), then we would need to decompose it on a pointwise lattice using the Blackwell meet to achieve non-negativity. We assume that most information measures behave more similar to quantifying the boundary length rather than its area, since the boundary segments can directly be obtained from the conditional probability distribution and do not require an actual construction from the likelihood-ratio test.
In the literature, PIDs have been defined based on different ordering relations [16], the Blackwell order being only one of them. We think that this diversity is desirable since each approach provides a different operational interpretation of redundancy and synergy. For this reason, we wonder if obtaining a non-negative decomposition with the inclusion–exclusion relation for other ordering relations was possible when transferring them to a pointwise perspective or from mutual information to other information measures.
Studying the relations between different information measures for the same decomposition method may provide further insights into their properties, as demonstrated by the example of total variation in Section 4.2. The ability to decompose different information measures is also a necessity to apply the method in a variety of areas, since each information measure can then provide the operational meaning within its respective domains. To ensure consistency between related information measures, we allowed the re-definition of information addition, as demonstrated in the example of Rényi-information in Section 3.6, which also opens new possibilities for satisfying the inclusion–exclusion relation.
There is currently no universally accepted definition of conditional Rényi information. Assuming that I R a ( T ; S i S j ) should capture the information that S i provides about T when already knowing the information from S j , then one could propose that this quantity should correspond to the according partial information contributions (unique/synergetic) and, thus, the definition of Equation (53).
With this in mind, it is also possible to define, model, decompose, and trace Transfer Entropy [33], used in the analysis of complex systems, for each presented information measure with the methodology of Section 4.2.
I R a ( T ; S i S j ) I R a ( T ; S i , S j ) I R a ( T ; S j )
Finally, studying the corresponding definitions for continuous random variables and identifying suitable information measures for specific applications would be interesting directions for future work.

6. Conclusions

In this work, we demonstrated a non-negative PID in the framework of Williams and Beer [1] for any f-information with practical operational interpretation and the conversion of measures between decomposition lattices. We demonstrated that the decomposition of f-information can be used to obtain a non-negative decomposition of Rényi-information, for which we re-defined the addition to demonstrate that its results satisfy an inclusion–exclusion relation. Finally, we demonstrated how the proposed decomposition method can be used for tracing the flow of information through Markov chains and how the decomposition obtains different properties depending on the chosen information measure.

Author Contributions

Conceptualization, Writing—original draft: T.M.; Analysis of non-negativity: T.M. and E.A.; Writing—review and editing: E.A. and C.R.; Supervision: C.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Swedish Civil Contingencies Agency (grant number MSB 2018-12526) and the Swedish Research Council (grant number VR 2020-04430).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of the data; in the writing of the manuscript; nor in the decision to publish the results.

Appendix A. Quantifying Zonogon Perimeters

Lemma A1. 
If the function f is convex, then the function r f ( p , v ) as defined in Equation (28a) is convex in its second argument v [ 0 , 1 ] 2 for a constant p [ 0 , 1 ] .
Proof. 
We use the following definitions for abbreviating the notation. Let 0 t 1 and v i = x i y i :
a 1 : = x 1 p + y 1 ( 1 p ) a 2 : = x 2 p + y 2 ( 1 p ) b 1 : = t a 1 t a 1 + ( 1 t ) a 2 b 2 : = ( 1 t ) a 2 t a 1 + ( 1 t ) a 2
The case of a i = 0 is handled by the convention that 0 · f 0 0 = 0 . Therefore, we can assume that a i 0 and use 0 b 1 1 with b 2 = 1 b 1 to apply the definition of convexity on the function f:
r f p , t x 1 + ( 1 t ) x 2 t y 1 + ( 1 t ) y 2 = ( t a 1 + ( 1 t ) a 2 ) · f t x 1 + ( 1 t ) x 2 t a 1 + ( 1 t ) a 2 = ( t a 1 + ( 1 t ) a 2 ) · f b 1 x 1 a 1 + b 2 x 2 a 2 ( t a 1 + ( 1 t ) a 2 ) · b 1 f x 1 a 1 + b 2 f x 2 a 2 ( by convexity of f ) = t a 1 · f x 1 a 1 + ( 1 t ) a 2 · f x 2 a 2 = t · r f p , x 1 y 1 + ( 1 t ) · r f p , x 2 y 2
Corollary A1. 
For v 1 , v 2 , ( v 1 + v 2 ) [ 0 , 1 ] 2 and a constant p [ 0 , 1 ] , the function r f ( p , v ) as defined in Equation (28a) satisfies a triangle inequality on its second argument: r f ( p , v 1 + v 2 ) r f ( p , v 1 ) + r f ( p , v 2 ) .
Proof. 
r f ( p , v 1 + ( 1 ) v 2 ) r f ( p , v 1 ) + ( 1 ) r f ( p , v 2 ) ( be Lemma A 1 ) r f p , 0.5 ( v 1 + v 2 ) 0.5 r f ( p , v 1 ) + r f ( p , v 2 ) ( let = 0.5 ) r f ( p , v 1 + v 2 ) r f ( p , v 1 ) + r f ( p , v 2 ) ( by r f ( p , v ) = r f ( p , v ) )
Lemma A2. 
For a constant p [ 0 , 1 ] , the function i f maintains the ordering relation from the Blackwell order on binary input channels: κ 1 κ 2 i f ( p , κ 1 ) i f ( p , κ 2 ) .
Proof. 
Let κ 1 be represented by a 2 × n matrix and κ 2 by a 2 × m matrix. By the definition of the Blackwell order ( κ 1 κ 2 , Equation (2)), there exists a stochastic matrix λ such that κ 1 = κ 2 · λ . We use the notation κ 2 [ : , i ] to refer to the i th column of matrix κ 2 and indicate the element at row i { 1 . . m } and column j { 1 . . n } of λ by λ [ i , j ] . Since λ is a valid (row) stochastic matrix of dimension m × n , its rows sum to one i { 1 . . m } . j = 1 n λ [ i , j ] = 1 .
i f ( p , κ 1 ) = j = 1 n r f ( p , κ 1 [ : , j ] ) ( by Equation ( 28 b ) ) = j = 1 n r f ( p , i = 1 m κ 2 [ : , i ] λ [ i , j ] ) ( by Equation ( 2 ) ) j = 1 n i = 1 m r f ( p , κ 2 [ : , i ] λ [ i , j ] ) ( by Corollary A 1 ) = j = 1 n i = 1 m λ [ i , j ] r f ( p , κ 2 [ : , i ] ) ( by r f ( p , v ) = r f ( p , v ) ) = i = 1 m r f ( p , κ 2 [ : , i ] ) ( by j = 1 n λ [ i , j ] = 1 ) = i f ( p , κ 2 ) ( by Equation ( 28 b ) )
Lemma A3. 
Consider two non-empty sets of binary input channels with equal cardinality ( | A | = | B | ) and a constant p [ 0 , 1 ] . If the Minkowski sum for the zonogons of channels in A is a subset of the Minkowski sum for the zonogons of channels in B , then the sum of pointwise information for the channels in A is less than the sum of pointwise information for the channels in B as shown in Equation (A1).
κ A Z ( κ ) κ B Z ( κ ) κ A i f ( p , κ ) κ B i f ( p , κ )
Proof. 
Let n = | A | = | B | . We use the notation A [ i ] with 1 i n to indicate the channel κ i within the set A .
i = 1 n Z ( A [ i ] ) i = 1 n Z ( B [ i ] ) Z A [ 1 ] A [ n ] Z B [ 1 ] B [ n ] ( by Equation ( 4 ) ) Z 1 n · A [ 1 ] A [ n ] Z 1 n · B [ 1 ] B [ n ] ( scale to sum ( 1 , 1 ) ) i f p , 1 n · A [ 1 ] A [ n ] i f p , 1 n · B [ 1 ] B [ n ] ( by Equation ( 5 ) , Lemma A 2 ) i = 1 n i f p , 1 n A [ i ] i = 1 n i f p , 1 n B [ i ] ( by Equation ( 28 b ) ) 1 n i = 1 n i f p , A [ i ] 1 n i = 1 n i f p , B [ i ] ( by r f ( p , v ) = r f ( p , v ) ) κ A i f ( p , κ ) κ B i f ( p , κ )

Appendix B. Inclusion-Exclusion Inequality of Zonogons

Let P ( A ) represent the power set of a non-empty set A and separate the subsets of even ( L e ) and odd ( L o ) cardinality as shown below. Additionally, let L 1 represent all subsets with cardinality less than or equal to one and L 1 all subsets of cardinality equal to one:
L 1 : = { B P ( A ) : | B | 1 } L 1 : = { B P ( A ) : | B | = 1 } L e : = { B P ( A ) : | B | e v e n } L o : = { B P ( A ) : | B | o d d } P ( A ) : = = L e L o and = L e L o
The number of subsets with even cardinality is equal to the number of subsets with odd cardinality as shown in Equation (A3).
| L e | = i = 0 | A | 2 | A | 2 i = 2 | A | 1 = i = 0 | A | 2 | A | 2 i + 1 = | L o |
Consider a function g e : L e L 1 , which takes an even subset E L e and returns a subset of cardinality | g e ( E ) | = min ( | E | , 1 ) according to Equation (A4).
E S e : g e ( E ) = if E = g e ( E ) = { e } s . t . e E otherwise
Lemma A4. 
For any function g e G e , there exists a function G : ( L e , G e ) L o that satisfies the following two properties:
(a)
For any subset with even cardinality, the function g e ( · ) returns a subset of function G ( · ) :
g e G e , E L e : g e ( E ) G ( E , g e ) .
(b)
The function G ( · ) that satisfies Equation (A5) has an inverse on its first argument G 1 : ( L o , G e ) L e .
g e G e , E L e , G 1 : G 1 ( G ( E , g e ) , g e ) = E .
Proof. 
We construct a function G for an arbitrary g e and demonstrate that it satisfies both properties (Equations (A5) and A6) by induction on the cardinality of A . We indicate the cardinality of A with n = | A | as subscripts A n , L e , n , L o , n , and G n :
  • In the base case A 1 = { a } , the sets of subsets are L e , 1 = { } and L o , 1 = { { a } } . We define the function G 1 ( , g e ) { a } for any g e to satisfy both required properties:
    (a)
    The constraints of Equation (A4) ensure that g e ( ) = . Since the empty set is the only element in S e , 1 , the subset relation (requirement of Equation (A5)) is satisfied g e ( ) = { a } = G 1 ( , g e ) .
    (b)
    The function G 1 : ( L e , 1 , G e ) L o , 1 is a bijection from L e , 1 to L o , 1 and, therefore, has an inverse on its first argument G 1 1 : ( L o , 1 , G e ) L e , 1 (requirement of Equation (A6)).
  • Assume there exists a function G n that satisfies both required properties (Equations (A5) and (A6)) of sets of cardinality 1 n = | A n | .
  • For the induction step, we show the definition of a function G n + 1 that satisfies both required properties. For sets A n + 1 = A n { q } , the subsets of even and odd cardinality can be expanded as shown in Equation (A7).
    L e , n + 1 = L e , n O { q } : O L o , n , L o , n + 1 = L o , n E { q } : E L e , n .
    We define G n + 1 for E L e , n and O L o , n at any g e as shown in Equation (A8) using the function G n and its inverse G n 1 from the induction hypothesis. The function G n + 1 is defined for any subset in L e , n + 1 as can be seen from Equation (A7).
    G n + 1 ( E , g e ) : = E { q } if g e ( G n ( E , g e ) { q } ) { q } G n ( E , g e ) if g e ( G n ( E , g e ) { q } ) = { q } G n + 1 ( O { q } , g e ) : = O if g e ( O { q } ) { q } G n 1 ( O , g e ) { q } if g e ( O { q } ) = { q }
    Figure A1 provides an intuition for the definition of G n + 1 : the outcome of g e ( O { q } ) determines if the function G n + 1 maintains or breaks the mapping of G n .
    Figure A1. Intuition for the definition of Equation (A8). We can divide the set P ( A n + 1 ) into P ( A n ) and { B { q } : B P ( A n ) } . The definition of function G n + 1 mirrors G n if g e ( O { q } ) = { q } (blue) and otherwise breaks its mapping (orange).
    Figure A1. Intuition for the definition of Equation (A8). We can divide the set P ( A n + 1 ) into P ( A n ) and { B { q } : B P ( A n ) } . The definition of function G n + 1 mirrors G n if g e ( O { q } ) = { q } (blue) and otherwise breaks its mapping (orange).
    Entropy 26 00424 g0a1
    The function F as defined in Equation (A8) satisfies both requirements (Equations (A5) and (A6)) for any g e :
    (a)
    To demonstrate that the function satisfies the subset relation of Equation (A5), we analyze the four cases for the return value of G n + 1 as defined in Equation (A8) individually:
    -
    g e ( E ) E { q } holds, since the function g e always returns a subset of its input (Equation (A4)).
    -
    g e ( E ) G n ( E , g e ) holds by the induction hypothesis.
    -
    If g e ( O { q } ) { q } , then g e ( O { q } ) O : Since the input to function g e is not the empty set, the function g e ( O { q } ) returns a singleton subset of its input (Equation (A4)). If the element in the singleton subset is unequal to q, then it is a subset of O .
    -
    If g e ( O { q } ) = { q } , then g e ( O { q } ) { q } G n 1 ( O , g e ) holds trivially.
    (b)
    To demonstrate that the function G n + 1 has an inverse (Equation (A6)), we show that the function G n + 1 is a bijection from L e , n + 1 to L o , n + 1 . Since the function G n + 1 is defined for all elements in L e , n + 1 and both sets have the same cardinality ( | L e , n + 1 | = | L o , n + 1 | , Equation (A3)), it is sufficient to show that the function G n + 1 is distinct for all inputs.
    The return value of G n + 1 has four cases, two of which return a set containing q (cases 1 and 4 in Equation (A8)), while the two others do not (cases 2 and 3 in Equation (A8)). Therefore, we have to show that both of these cases cannot coincide for any input:
    -
    Cases 2 and 3 in Equation (A8): If the return value of both cases was equal, then O = G n ( E , g e ) , and therefore, g e ( O { q } ) = g e ( G n ( E , g e ) { q } ) . This leads to a contradiction, since the condition of case 3 ensures g e ( O { q } ) { q } , while the condition of case 2 ensures g e ( G n ( E , g e ) { q } ) = { q } . Hence, the return values of cases 2 and 3 are distinct.
    -
    Cases 1 and 4 in Equation (A8): If the return value of both cases was equal, then E = G n 1 ( O , g e ) , and therefore, g e ( O { q } ) = g e ( G n ( E , g e ) { q } ) . This leads to a contradiction, since the condition of case 4 ensures g e ( O { q } ) = { q } , while the condition of case 1 ensures g e ( G n ( E , g e ) { q } ) { q } . Hence, the return values of cases 1 and 4 are distinct.
    Since the function G n + 1 is a bijection, there exists an inverse G n + 1 1 .
Lemma A5. 
For a non-empty set of 2 × x row stochastic matrices A :
Z κ A κ + | B | even B A Z λ B λ | B | odd B A Z ν B ν
Proof. 
Consider a function g o : L o L 1 , where g o ( O ) O such that the function returns a singleton subset for a set of odd cardinality. Equation (A10) can be obtained from the constraints on g e (Equation (A4)) and Lemma A4.
g e G e , E L e , g o G o , G : g e ( ) g o ( G ( ) ) if E = g e ( E ) = g o ( G ( E ) ) otherwise
Equation (A11a) holds since we can replace g e ( ) with g o ( G ( ) ) , meaning there exists a κ A for creating a (Minkowski) sum over the same set of channel zonogons on both sides of the quality. Equation (A11b) holds since Lemma A4 ensured that the existing function G is a bijection. Equation (A11c) holds since the intersection is a subset of each individual zonogon.
g e G e , g o G o , κ A , G : Z ( κ ) + E L e Z ( g e ( E ) ) = E L e Z ( g o ( G ( E ) ) )
g e G e , g o G o , κ A : Z ( κ ) + E L e Z ( g e ( E ) ) = O L o Z ( g o ( O ) )
g e G e , g o G o : κ A Z ( κ ) + E L e Z ( g e ( E ) ) O L o Z ( g o ( O ) )
Equation (A11c) is parameterized by g e , and the subsets are closed under set union. Therefore, we can combine all choices for g e and g o using the set-theoretic union as shown below. For the notation, let m = 2 | A | 1 , and we indicate subsets of A with even cardinality as E i L e , where 1 i m . We use the last index for the empty set E m = . The subsets of A with odd cardinality are correspondingly noted as O i L o . For clarity, we note binary input channels from an even subset as λ E and binary input channels from an odd subset as ν O .
λ 1 E 1 λ 2 E 2 λ m 1 E m 1 κ A Z ( κ ) + i = 1 m 1 Z ( λ i ) ν 1 O 1 ν 2 O 2 ν m O m j = 1 m Z ( ν j ) κ A Z ( κ ) + i = 1 m 1 λ E i Z ( λ ) j = 1 m ν O j Z ( ν ) Minkowski sum dis tributes over set union Conv κ A Z ( κ ) + i = 1 m 1 λ E i Z ( λ ) Conv j = 1 m ν O j Z ( ν ) if X Y then Conv ( X ) Conv ( Y ) κ A Z ( κ ) + i = 1 m 1 Conv λ E i Z ( λ ) j = 1 m Conv ν O j Z ( ν ) Convex hull distributes over Minkowski sum Z κ A κ + i = 1 m 1 Z λ E i λ j = 1 m Z ν O j ν by Equation ( 7 ) Z κ A κ + | E i | even E i A Z λ E i λ | O j | odd O j A Z ν O j ν replace notation Z κ A κ + | B | even B A Z λ B λ | B | odd B A Z ν B ν

Appendix C. Non-Negativity of Partial f-Information on the Synergy Lattice

The proof of non-negativity can be divided into three parts. First, we show that the loss measure maintains the ordering relation of the synergy lattice and how the quantification of a meet element i , f ( α β , T , t ) can be computed. Second, we demonstrate how the inclusion–exclusion inequality of zonogons under the Minkowski sum from Appendix B leads to relating pointwise information measures with respect to the Blackwell order. Finally, we combine these two results to demonstrate that an inclusion–exclusion relation using the convex hull of zonogons is greater than their intersection and obtain the non-negativity of the decomposition by transitivity.

Appendix C.1. Properties of the Loss Measure on the Synergy Lattice

Lemma A6. 
Any set of sources α P ( P 1 ( V ) ) is equivalent (≅) to some atom of the synergy lattice γ A ( V ) .
α P ( P 1 ( V ) ) . γ A ( V ) . γ α
The union for two sets of sources is equivalent to the meet of their corresponding atoms on the synergy lattice. Let α , β P ( P 1 ( V ) ) and γ , δ A ( V ) :
γ α and δ β ( γ δ ) ( α β )
Proof. 
The used filter in the definition of an atom ( A ( V ) P ( P 1 ( V ) ) , Equation (8)) only removes sets of cardinality 2 | α | , and for any removed set of sources, we can construct an equivalent set that contains one less source by removing the subset S a S b as shown in Equation (A12a). Therefore, all sets of sources α P ( P 1 ( V ) ) are equivalent to some atom γ A ( V ) within the lattice (Equation (A12b)).
S a S b α ( α S a ) where : S a , S b α
α P ( P 1 ( V ) ) , γ A ( V ) . α γ
The union of two sets of sources α P ( P 1 ( V ) ) is inferior to each individual set α and β :
( α β ) α ( by Equation ( 10 ) ) ( α β ) β ( by Equation ( 10 ) )
All sets of sources ε P ( P 1 ( V ) ) that are inferior to both α and β ( ε α and ε β ) are also inferior to their union.
ε α and ε β ε ( α β ) ( by Equation ( 10 ) )
Therefore, the union of α and β is equivalent to the meet of their corresponding atoms on the synergy lattice. □
Proof of Lemma 1 from Section 3.2.
For any set of sources α , β P ( P 1 ( V ) ) and target variable T with state t T , the function κ (Equation (31)) maintains the ordering from the synergy lattice under the Blackwell order.
α β κ ( β , T , t ) κ ( α , T , t )
Proof. 
We consider two cases for β :
  • If β = , then the implication holds for any α since the bottom element κ ( , T , t ) = BW is inferior (⊑) to any other channel.
  • If β , then α is also a non-empty set since α β SL = .
    α β S b β , S a α . S b S a ( by Equation ( 10 ) ) S b β , S a α . κ ( S b , T , t ) κ ( S a , T , t ) ( by Equation ( 2 ) ) S b β κ ( S b , T , t ) S a α κ ( S a , T , t ) κ ( β , T , t ) κ ( α , T , t )
Since the implication holds for both cases, the ordering is maintained. □
Corollary A2. 
The defined cumulative loss measures ( i , f of Equation (33a) and I , f of Equation (34)) maintain the ordering relation of the synergy lattice for any set of sources α , β P ( P 1 ( V ) ) and target variable T with state t T :
α β i , f ( α , T , t ) i , f ( β , T , t ) α β I , f ( α ; T ) I , f ( β ; T )
Proof. 
The pointwise monotonicity of the cumulative loss measure ( α β i , f ( α , T , t ) i , f ( β , T , t ) ) is obtained from Lemmas 1 and A2 with Equation (33a). Sine all cumulative pointwise losses i , f are smaller for α than β , so will be their weighted sum ( α β I , f ( α ; T ) I , f ( β ; T ) , see Equation (34)). □
Corollary A3. 
The cumulative pointwise loss of the meet from two atoms is equivalent to the cumulative pointwise loss of their union for any target variable T with state t T :
i , f ( α β , T , t ) = i , f ( α β , T , t ) .
Proof. 
The result follows from Lemma A6 and Corollary A2. □

Appendix C.2. The Non-Negativity of the Decomposition

Lemma A7. 
Consider a non-empty set of of binary input channel A and 0 p 1 . Quantifying an inclusion–exclusion principle on the pointwise information of their Blackwell join is larger than the pointwise information of their Blackwell meet as shown in Equation (A14).
i f p , κ A κ B A ( 1 ) | B | 1 i f p , κ B κ
Proof. 
Z κ A κ + | B | even B A Z λ B λ | B | odd B A Z ν B ν by Lemma A 5 i f p , κ A κ + | B | e v e n B A i f p , κ B κ | B | o d d B A i f p , κ B κ by Lemma A 3 i f p , κ A κ B A ( 1 ) | B | 1 i f p , κ B κ
Lemma A8 
(Non-negativity on the synergy lattice). The decomposition of f-information is non-negative on the pointwise and combined synergy lattice for any target variable T with state t T :
α A ( V ) . 0 Δ i , f ( α , T , t ) , α A ( V ) . 0 Δ I , f ( α ; T ) .
Proof. 
We show the non-negativity of pointwise partial information ( Δ i , f ( α , T , t ) ) in two cases. We write α S to represent the cover set of α on the synergy lattice and use p = P T ( t ) as the abbreviation:
  • Let α = SL = { V } . The bottom element of the synergy lattice is quantified to zero (by Equation (33a), i , f ( SL , T , t ) = 0 ), and therefore, also its partial contribution will be zero ( Δ i , f ( SL , T , t ) = 0 ), which implies Equation (A15).
    α = SL 0 Δ i , f ( α , T , t )
  • Let α A ( V ) { SL } , then its cover set is non-empty ( α S ). Additionally, we know that no atom in the cover set is the empty set ( β α S . β ), since the empty atom is the top element ( SL = ).
    Since it will be required later, note that the inclusion–exclusion principle of a constant is the constant itself as shown in Equation (A16) since, without the empty set, there exists one more subset of odd cardinality than with even cardinality (see Equation (A3)).
    i f ( p , κ ( V , T , t ) ) = B α S ( 1 ) | B | 1 i f ( p , κ ( V , T , t ) )
    We can re-write the Möbius inverse as shown in Equation (A17), where Equation (A17b) is obtained from ([23], p. 15)).
    Δ i , f ( α , T , t ) = i , f ( α , T , t ) β ˙ S α Δ i , f ( β , T , t ) ( by Equation ( 33 b ) )
    = i , f ( α , T , t ) B α S ( 1 ) | B | 1 · i , f β B β , T , t
    = i , f ( α , T , t ) B α S ( 1 ) | B | 1 · i , f β B β , T , t ( by Corollary A 3 )
    = i f ( p , κ ( α , T , t ) ) + B α S ( 1 ) | B | 1 · i f ( p , κ ( β B β , T , t ) ) ( by Equations ( 33 a ) , ( A 16 ) )
    = i f ( p , κ ( α , T , t ) ) + B α S ( 1 ) | B | 1 · i f ( p , S ( β B β ) κ ( S , T , t ) ) ( by β α S . β )
    = i f ( p , κ ( α , T , t ) ) + B α S ( 1 ) | B | 1 · i f ( p , β B S β κ ( S , T , t ) )
    = i f ( p , κ ( α , T , t ) ) + B { κ ( β , T , t ) : β α S } ( 1 ) | B | 1 · i f ( p , κ B κ )
    Consider the non-empty set of channels D = { κ ( β , T , t ) : β α S } , then we obtain Equation (A18b) from Lemma A7.
    i f p , κ { κ ( β , T , t ) : β α S } κ B { κ ( β , T , t ) : β α S } ( 1 ) | B | 1 i f p , κ B κ
    i f p , β α S κ ( β , T , t ) B { κ ( β , T , t ) : β α S } ( 1 ) | B | 1 i f p , κ B κ
    We can construct an upper bound on i f ( p , κ ( α , T , t ) ) based on the cover set α S as shown in Equation (A19).
    β α S .       β α
    β α S .       κ ( α , T , t ) κ ( β , T , t ) ( by Lemma 1 )
    κ ( α , T , t ) β α S κ ( β , T , t )
    i f p , κ ( α , T , t ) i f p , β α S κ ( β , T , t ) ( by Lemma A 2 )
    By the transitivity of Equations (A18b) and (A19d), we obtain Equation (A20).
    i f ( p , κ ( α , T , t ) ) B { κ ( β , T , t ) : β α S } ( 1 ) | B | 1 i f p , κ B κ
    By Equations (A17) and (A20), we obtain the non-negativity of pointwise partial information as shown in Equation (A21).
    α A ( V ) { SL } . 0 Δ i , f ( α , T , t )
From Equations (A15) and (A21), we obtain that pointwise partial information is non-negative for all atoms of the lattice:
α A ( V ) . 0 Δ i , f ( α , T )
If all pointwise partial components are non-negative, then their expected value will also be non-negative (see Equation (35)):
α A ( V ) . 0 Δ I , f ( α ; T )

Appendix D. Mappings between Decomposition Lattices and Their Duality

Proof of Lemma 2 from Section 3.4
The function Ψ ( · ) is a bijection on the redundancy lattice without the bottom element () that reverses its order. Let α , β A ( V ) { RL } :
  • Ψ ( Ψ ( α ) ) α ;
  • α β Ψ ( β ) Ψ ( α ) .
Proof. 
  • Property 1: the n-ary Cartesian product ( Ψ ) provides all combinations of one variable from each source (Definition 28). Let γ = Ψ ( α ) , then by Definition 11 (≃) of equivalence Ψ ( γ ) α , we have to show that both elements are inferior to each other under the redundancy order:
    -
    Ψ ( γ ) α : We begin by expanding the definition of the redundancy order as shown in Equation (A24) to highlight that it is sufficient to show that α Ψ ( γ ) .
    α Ψ ( γ ) S a α , S b Ψ ( γ ) , S b S a Ψ ( γ ) α
    To show α Ψ ( γ ) , we have to demonstrate that is is possible to select one variable from each source in γ to reconstruct each source in α :
    *
    By definition γ = Ψ ( α ) , each source in γ contains one variable from each source in α , and all variables from each source in α can be found in some source of γ .
    *
    By selecting the variable in each source of γ that originated from the same source in α , we can exactly reconstruct each source in α .
    *
    Therefore, α Ψ ( Ψ ( α ) ) , which implies Ψ ( Ψ ( α ) ) α .
    -
    α Ψ ( γ ) : We begin by expanding the definition of the redundancy order (Equation (9)) as shown in Equation (A25) to highlight that we have to show that all sources in Ψ ( γ ) are a super-set of some source in α .
    α Ψ ( γ ) S b Ψ ( γ ) . S a α . S a S b
    For a proof by induction, the recursive definition Ψ ( α ) as shown in Equation (A26) highlights the relation of interest more clearly. We use the notation S [ i ] to indicate the i-th variable in source S . That both functions are equivalent Ψ ( α ) = Ψ ( α ) can directly be seen, since Ψ ( α ) recursively combines all possible choices of selecting one variable from each source in α , which is the definition of Ψ ( α ) .
    Ψ ( α ) { } if α = i 1 . . | S | { x { S [ i ] } : x Ψ ( α { S } ) } otherwise , where S α
    Induction on the cardinality of α :
    *
    Hypothesis: It is impossible to choose one variable from each source in Ψ ( α ) without selecting all variables of some source S a α :
    S b Ψ ( Ψ ( α ) ) . S a α . S a S b
    *
    Base case | α | = 1 : The condition is satisfied as shown in Equation (A28), since Ψ ( { S } ) turns each variable in S into its own source. The second application Ψ ( Ψ ( { S } ) ) recombines them.
    Ψ ( { S } ) = { { V } : V S } Ψ ( Ψ ( { S } ) ) = { S }
    *
    Assume the induction hypothesis holds for | α | = m .
    *
    For the induction step, let α = α { S } : From the recursive definition shown in Equation (A29), we can directly see all relevant options of choosing one element from each resulting source.
    Ψ ( α ) = i 1 . . | S | { x { S [ i ] } : x Ψ ( α ) }
    ·
    Case 1: From every source in Ψ ( α ) , we choose the variable S [ i ] that was contributed by the new source S . The resulting set contains all variables of S .
    ·
    Case 2: To avoid choosing all variables from S , we have to select the variables contributed by x Ψ ( α ) instead for some S [ i ] . By the induction hypothesis, choosing one variable from each set in Ψ ( α ) leads to choosing all variables of some source S a α .
    ·
    Choosing one variable from each set in α = α { S } leads to choosing all variables of S or all variables of some source S a α .
    ·
    Thus, the induction hypothesis holds for | α | = | α | + 1 .
    -
    As shown above, Ψ ( Ψ ( α ) ) α and α Ψ ( Ψ ( α ) ) , which implies α Ψ ( Ψ ( α ) ) .
  • Property 2: We first expand the definitions:
    α β Ψ ( β ) Ψ ( α ) S b β . S a α . S a S b S c Ψ ( α ) . S d Ψ ( β ) . S d S c ( by Definition 9 )
    Then, we view both implications separately:
    • Assume S b β . S a α . S a S b . Then, there exists a function w : P ( P 1 ( V ) ) P ( P 1 ( V ) ) that associates each source in β with a source in α .
      S b β . w ( S b ) S b and w ( S b ) α .
      All sets S c Ψ ( α ) contain one variable of each source in α . Let the function v c : P 1 ( V ) V indicate this selection:
      S c = { v c ( S x ) : S x α } where : v c ( S x ) S x
      Define the set S d Ψ ( β ) using the defined functions above as shown in Equation (A32). The function w is defined for all sources in β , and the selected element is in the original source ( v c ( w ( S x ) ) S x ) by Equation (A30).
      S d = { v c ( w ( S x ) ) : S x β }
      The constructed set S d Ψ ( β ) is a subset of S c Ψ ( α ) , and it can be constructed for each S c Ψ ( α ) . This proves Equation (A33):
      S b β . S a α . S a S b S c Ψ ( α ) . S d Ψ ( β ) . S d S c
    • For the other direction, we show Equation (A34) and start with its simplification:
      ¬ ( S b β . S a α . S a S b ) ¬ ( S c Ψ ( α ) . S d Ψ ( β ) . S d S c ) S b β . S a α . ¬ ( S a S b ) S c Ψ ( α ) . S d Ψ ( β ) . ¬ ( S d S c ) S b β . S a α . x S a . x S b S c Ψ ( α ) . S d Ψ ( β ) . x S d . x S c
      The left-hand side states that, for some S b β , all sources S a α contain an element that is not in S b . Let us fix a particular S b and define a function returning this element v : P 1 ( V ) V :
      S a α . v ( S a ) S a and v ( S a ) S b
      Then, we can define the set S c = { v ( S a ) : S a α } . The source S c selects one variable from each source; thus, S c Ψ ( α ) , and by definition, S c S b = . All sets S d Ψ ( β ) must select one element from S b and, thus, contain one element that is not in S c . This provides the required implication of Equation (A34).
Proof of Lemma 3 from Section 3.4
The function Ξ ( · ) is a bijection that maintains the ordering of atoms between the redundancy and synergy order. Let α , β A ( V ) :
  • α = Ξ ( Ξ ( α ) ) ;
  • α β Ξ ( α ) Ξ ( β ) .
Proof. 
  • Property 1 is obtained from Definition 28: the first two cases revert each other, and the third case ( α and α { V } ) holds since S α : S = V ( V S ) .
  • Property 2:
    -
    Case 1: If α = = RL , then Ξ ( α ) = { V } = SL . Therefore, β A ( V ) . RL β SL β .
    -
    Case 2: If α = { V } = RL , then Ξ ( α ) = = SL . Therefore, α A ( V ) . α RL α SL .
    -
    Case 3: If α , then β :
    α β = S b β , S a α , S a S b ( by Definition 9 ) = S b β , S a α , ( V S b ) ( V S a ) = { V S a : S a α } { V S b : S b β } ( by Definition 10 ) = Ξ ( α ) Ξ ( β ) ( by Definition 28 )
Proof of Lemma 4 from Section 3.4
A redundancy- and synergy-based information decomposition is pointwise dual if, for all α A ( V ) { RL } :
Δ i , f ( α , T , t ) = Δ i , f ( Ξ ( Ψ ( α ) ) , T , t ) Δ i , f ( RL , T , t ) = 0 = Δ i , f ( SL , T , t )
A redundancy- and synergy-based information decomposition is dual if, for all α A ( V ) { RL } :
Δ I , f ( α , T ) = Δ I , f ( Ξ ( Ψ ( α ) ) , T ) Δ I , f ( RL , T ) = 0 = Δ I , f ( SL , T )
Proof. 
β R α Δ i , f ( β , T , t ) = β ( R α ) { RL } Δ i , f ( β , T , t ) ( Δ i , f ( RL , T , t ) = i , f ( SL , T , t ) = 0 ) = β S Ξ ( Ψ ( α ) ) Δ i , f ( Ψ ( Ξ ( β ) ) , T , t ) ( by Corollary 1 ) = β S Ξ ( Ψ ( α ) ) Δ i , f ( Ξ ( Ψ ( Ψ ( Ξ ( β ) ) ) ) , T , t ) ( by Equation ( 38 ) ) = β S Ξ ( Ψ ( α ) ) Δ i , f ( β , T , t )
The duality of the pointwise measure ( i , f , i , f ) implies the duality of the combined measure ( I , f , I , f ).
Lemma A9. 
For α A ( V ) { RL } and β A ( V ) :
¬ S a α . β { S a } Ξ ( Ψ ( α ) ) β
Proof. 
  • Case β = SL = : The condition holds since it implies α to be the minimal element in A ( V ) { RL } .
  • Case β SL : We start by simplifying the expression.
    ¬ S a α . β { S a } Ξ ( Ψ ( α ) ) β S a α . ¬ ( β { S a } ) Ξ ( Ψ ( α ) ) β S a α . ¬ ( S b β . S a S b ) Ξ ( Ψ ( α ) ) β ( by Definition 10 ) S a α . S b β . ¬ ( S a S b ) Ξ ( Ψ ( α ) ) β S a α . S b β . ¬ ( S a S b ) Ψ ( α ) Ξ ( β ) ( by Lemma 3 ) S a α . S b β . ¬ ( S a S b ) S b Ξ ( β ) . S c Ψ ( α ) . S c S b ( by Definition 9 ) S b β . S a α . ¬ ( S a S b ) S b β . S c Ψ ( α ) . S c V S b ( by Definition 28 ) S b β . S a α . ¬ ( S a S b ) S b β . S c Ψ ( α ) . S c S b = S b β . S a α . x S a . x S b S b β . S c Ψ ( α ) . x S c . x S b
    -
    The left-hand side states that, for all S b β , all S a α must have at least one element that is not in S b .
    -
    The right-hand side states that, for all S b β , there exists a combination of one variable per source in α such that no element of the resulting collection is in S b . This is possible if and only if all sources S a α have at least one element that is not in S b .
    Therefore, both statements imply each other.
Lemma A10. 
For α A ( V ) { RL } :
P 1 ( α ) { } β B β : B { { S } : S α }
Proof. 
= P 1 ( α ) = B : B α = β B β : B { { S } : S α } { } β B β : B { { S } : S α } ( by Lemma A 6 )
Proof of Lemma 6 from Section 3.4
The pointwise dual-decomposition for the redundancy lattice of a loss measure on the synergy lattice is defined by:
i , f ( α , T , t ) 0 if α = i , f ( SL , T , t ) β P 1 ( α ) ( 1 ) | β | 1 i , f ( β , T , t ) otherwise
Proof. 
The case α = is satisfied by definition. Therefore, we proceed assuming α :
i , f ( α , T , t ) = γ R α Δ i , f ( γ , T , t ) = γ S Ξ ( Ψ ( α ) ) Δ i , f ( γ , T , t ) ( by Lemma 4 ) = i , f ( SL , T , t ) i , f ( SL , T , t ) + γ S Ξ ( Ψ ( α ) ) Δ i , f ( γ , T , t ) = i , f ( SL , T , t ) γ A ( V ) Δ i , f ( γ , T , t ) γ S Ξ ( Ψ ( α ) ) Δ i , f ( γ , T , t ) = i , f ( SL , T , t ) γ A ( V ) S Ξ ( Ψ ( α ) ) Δ i , f ( γ , T , t ) = i , f ( SL , T , t ) γ S a α S { S a } Δ i , f ( γ , T , t ) ( by Lemma 5 ) = i , f ( SL , T , t ) γ β { { S a } : S a α } S β Δ i , f ( γ , T , t ) = i , f ( SL , T , t ) B { { S a } : S a α } ( 1 ) | β | 1 i , f ( β B β , T , t ) ( by inclusion - exclusion ) = i , f ( SL , T , t ) γ β B β : B { { S a } : S a α } ( 1 ) | β | 1 i , f ( γ , T , t ) = i , f ( SL , T , t ) γ P 1 ( α ) ( 1 ) | β | 1 i , f ( γ , T , t ) ( by Lemma A 10 )

Appendix E. Scaling f-Information Does Not Affect Its Transformation

Lemma A11. 
The linear scaling of an f-information does not affect the transformation result and operator: Consider scaling an f-information measure I a 2 ( S ; T ) = k · I a 1 ( S ; T ) with k ( 0 , ) , then their decomposition transformation to another measure I b ( S ; T ) will be equivalent.
Proof. 
Based on the definitions of Section 3.2, the loss measures scale linearly with the scaling of their f-divergence. Therefore, we obtain two cumulative loss measures, where I , a 1 and I , a 2 are a linear scaling of each other (Equation (A44a)). They can be transformed into another measure I , b , as shown in Equation (A44b).
I , a 2 ( α ; T ) = k · I , a 1 ( α ; T )
I , b ( α ; T ) = v 1 ( I , a 1 ( α ; T ) ) = v 2 ( I , a 2 ( α ; T ) )
Equation (A44b) already demonstrates that their transformation results will be equivalent and that v 1 ( z ) = v 2 ( k · z ) and k · v 1 1 ( z ) = v 2 1 ( z ) . Therefore, their operators will also be equivalent as shown below:
x   ± 2   y   : =   v 2   ( v 2 1 ( x )   ±   v 2 1 ( y ) ) x   ± 1   y   : =   v 1   ( v 1 1 ( x )   ±   v 1 1 ( y ) ) =   v 2   ( k v 1 1 ( x )   ±   k v 1 1 ( y ) ) =   v 2   ( v 2 1 ( x )   ±   v 2 1 ( y ) ) =   x   ± 2   y

Appendix F. Decomposition Example Distributions

The probability distributions used in Figure 9 can be found in Table A1. For providing an intuition of the decomposition result for I , TV in the generic example, we visualize its corresponding zonogons in Figure A2. It can be seen that the maximal zonogon height is obtained from V 1 (blue), which equals the maximal zonogon height of their joint distribution ( V 1 , V 2 ) (red). Therefore, I , TV does not attribute partial information uniquely to V 2 or their synergy by Lemma 8.
Table A1. The distributions used from [13] and the generic example from [20]. The example names are abbreviations for: XOR-gate (XOR), Unique (Unq), Pointwise Unique (PwUnq), Redundant-Error (RdnErr), Two-Bit-copy (Tbc), and the AND-gate (AND) [13].
Table A1. The distributions used from [13] and the generic example from [20]. The example names are abbreviations for: XOR-gate (XOR), Unique (Unq), Pointwise Unique (PwUnq), Redundant-Error (RdnErr), Two-Bit-copy (Tbc), and the AND-gate (AND) [13].
StateProbability
V 1 V 2 T XORUnqPwUnqRdnErrTbcANDGeneric
0001/41/403/81/41/40.0625
001------0.3000
010-1/41/41/8-1/40.1875
0111/4---1/4-0.1500
021--1/4----
100--1/4--1/40.0375
1011/41/4-1/8--0.0500
102----1/4--
1101/4-----0.2125
111-1/4-3/8-1/4-
113----1/4--
201--1/4----
Figure A2. Visualization of the zonogons from the generic example of [20] in state t = 0 . The target variable T has two states. Therefore, the zonogons of its second state are symmetric (second column of Equation (6)) and have identical heights.
Figure A2. Visualization of the zonogons from the generic example of [20] in state t = 0 . The target variable T has two states. Therefore, the zonogons of its second state are symmetric (second column of Equation (6)) and have identical heights.
Entropy 26 00424 g0a2

Appendix G. The Relation of Total Variation to the Zonogon Height

Proof of Lemma 8(a) from Section 4.1.2
The pointwise total variation ( i TV ) is a linear scaling of the maximal (Euclidean) height h * that the corresponding zonogon Z ( κ ) reaches above the diagonal, as visualized in Figure 10 for any 0 p 1 .
i TV ( p , κ ) = 1 p 2 v κ | v x v y | = ( 1 p ) h * 2
Proof. 
The point of maximal height P * that a zonogon Z ( κ ) reaches above the diagonal is visualized in Figure 10 and can be obtained as shown in Equation (A45), where Δ v represents the slope of vector v .
P * = v { v κ : Δ v > 1 } v
The maximal height (Euclidean distance) above the diagonal is calculated as shown in Equation (A46), where P * = ( P x * , P y * ) .
h * = 1 2 P x * P y * P y * P x * 2 = ( P x * P y * ) 2 + ( P y * P x * ) 2 = 2 ( P y * P x * )
The pointwise total variation i TV can be expressed as the invertible transformation of the maximal euclidean zonogon height above the diagonal as shown below, where v = ( v x , v y ) .
i TV ( p , κ ) = v κ 1 2 v x p v x + ( 1 p ) v y 1 ( p v x + ( 1 p ) v y ) = 1 p 2 v κ v x v y = 1 p 2 v { v κ : Δ v > 1 } ( v y v x ) + v { v κ : Δ v 1 } ( v x v y ) = 1 p 2 ( P y * P x * ) + ( 1 P x * ) ( 1 P y * ) ( by Equation ( A 45 ) ) = ( 1 p ) ( P y * P x * ) = ( 1 p ) h * 2 ( by Equation ( A 46 ) )
Proof of Lemma 8(b) from Section 4.1.2
For a non-empty set of pointwise channel A and 0 p 1 , pointwise total variation i TV quantifies the join element to the maximum of its individual channels:
i TV ( p , κ A κ ) = max κ A i TV ( p , κ )
Proof. 
The join element Z ( κ A κ ) corresponds to the convex hull of all individual zonogons (see Equation (7)). The maximal height that the convex hull reaches above the diagonal is equal to the maximum of the maximal height that each individual zonogon reaches. Since pointwise total variation is a linear scaling of the (Euclidean) zonogon height above the diagonal (Lemma 8(a) shown above), the join element is valuated to the maximum of its individual channels. □

Appendix H. Information Flow Example Parameters and Visualization

The parameters for the Markov chain used in Section 4.2 are shown in Equation (A47), where M n = ( X n , Y n ) , X i = { 0 , 1 , 2 } , Y i = { 0 , 1 } , P M 1 is the initial distribution, and P M n + 1 M n is the transition matrix. The visualized results for the information flow of KL-, TV-, and χ 2 -information can be found in Figure 11, and the visualized results of H 2 -, LC-, and JS-information in Figure A3.
Figure A3. Analysis of the Markov chain information flow (Equation (A47)). Visualized results for the information measures: H 2 , LC, and JS. The remaining results (KL, TV, and χ 2 ) can be found in Figure 11.
Figure A3. Analysis of the Markov chain information flow (Equation (A47)). Visualized results for the information measures: H 2 , LC, and JS. The remaining results (KL, TV, and χ 2 ) can be found in Figure 11.
Entropy 26 00424 g0a3
States ( X 1 , Y 1 ) : ( 0 , 0 ) ( 0 , 1 ) ( 1 , 0 ) ( 1 , 1 ) ( 2 , 0 ) ( 2 , 1 ) P M 1 = 0.01 0.81 0.00 0.02 0.09 0.07
P M n + 1 M n = 0.05 0.01 0.04 0.82 0.02 0.06 0.05 0.82 0.00 0.01 0.06 0.06 0.04 0.01 0.82 0.05 0.04 0.04 0.03 0.84 0.02 0.06 0.04 0.01 0.04 0.03 0.03 0.02 0.06 0.82 0.07 0.04 0.01 0.03 0.81 0.04

References

  1. Williams, P.L.; Beer, R.D. Nonnegative Decomposition of Multivariate Information. arXiv 2010, arXiv:1004.2515. [Google Scholar]
  2. Lizier, J.T.; Bertschinger, N.; Jost, J.; Wibral, M. Information Decomposition of Target Effects from Multi-Source Interactions: Perspectives on Previous, Current and Future Work. Entropy 2018, 20, 307. [Google Scholar] [CrossRef] [PubMed]
  3. Griffith, V.; Chong, E.K.P.; James, R.G.; Ellison, C.J.; Crutchfield, J.P. Intersection Information Based on Common Randomness. Entropy 2014, 16, 1985–2000. [Google Scholar] [CrossRef]
  4. Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J. Shared Information—New Insights and Problems in Decomposing Information in Complex Systems. In Proceedings of the European Conference on Complex Systems 2012; Gilbert, T., Kirkilionis, M., Nicolis, G., Eds.; Springer: Cham, Switzerland, 2013; pp. 251–269. [Google Scholar]
  5. Harder, M.; Salge, C.; Polani, D. Bivariate measure of redundant information. Phys. Rev. E 2013, 87, 012130. [Google Scholar] [CrossRef]
  6. Finn, C. A New Framework for Decomposing Multivariate Information. Ph.D. Thesis, University of Sydney, Darlington, NSW, Australia, 2019. [Google Scholar]
  7. Polyanskiy, Y.; Wu, Y. Information Theory: From Coding to Learning; Book Draft; Cambridge University Press: Cambridge, UK, 2022; Available online: https://people.lids.mit.edu/yp/homepage/data/itbook-2022.pdf (accessed on 13 May 2024).
  8. Mironov, I. Rényi Differential Privacy. In Proceedings of the 2017 IEEE 30th Computer Security Foundations Symposium (CSF), Santa Barbara, CA, USA, 21–25 August 2017; pp. 263–275. [Google Scholar] [CrossRef]
  9. Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J.; Ay, N. Quantifying Unique Information. Entropy 2014, 16, 2161–2183. [Google Scholar] [CrossRef]
  10. Griffith, V.; Koch, C. Quantifying Synergistic Mutual Information. In Guided Self-Organization: Inception; Springer: Berlin/Heidelberg, Germany, 2014; pp. 159–190. [Google Scholar] [CrossRef]
  11. Goodwell, A.E.; Kumar, P. Temporal information partitioning: Characterizing synergy, uniqueness, and redundancy in interacting environmental variables. Water Resour. Res. 2017, 53, 5920–5942. [Google Scholar] [CrossRef]
  12. James, R.G.; Emenheiser, J.; Crutchfield, J.P. Unique information via dependency constraints. J. Phys. A Math. Theor. 2018, 52, 014002. [Google Scholar] [CrossRef]
  13. Finn, C.; Lizier, J.T. Pointwise Partial Information Decomposition Using the Specificity and Ambiguity Lattices. Entropy 2018, 20, 297. [Google Scholar] [CrossRef]
  14. Ince, R.A.A. Measuring Multivariate Redundant Information with Pointwise Common Change in Surprisal. Entropy 2017, 19, 318. [Google Scholar] [CrossRef]
  15. Rosas, F.E.; Mediano, P.A.M.; Rassouli, B.; Barrett, A.B. An operational information decomposition via synergistic disclosure. J. Phys. A Math. Theor. 2020, 53, 485001. [Google Scholar] [CrossRef]
  16. Kolchinsky, A. A Novel Approach to the Partial Information Decomposition. Entropy 2022, 24, 403. [Google Scholar] [CrossRef] [PubMed]
  17. Bertschinger, N.; Rauh, J. The Blackwell relation defines no lattice. In Proceedings of the 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, 29 June–4 July 2014; pp. 2479–2483. [Google Scholar] [CrossRef]
  18. Lizier, J.T.; Flecker, B.; Williams, P.L. Towards a synergy-based approach to measuring information modification. In Proceedings of the 2013 IEEE Symposium on Artificial Life (ALife), Singapore, 16–19 April 2013; pp. 43–51. [Google Scholar] [CrossRef]
  19. Knuth, K.H. Lattices and Their Consistent Quantification. Ann. Phys. 2019, 531, 1700370. [Google Scholar] [CrossRef]
  20. Mages, T.; Rohner, C. Decomposing and Tracing Mutual Information by Quantifying Reachable Decision Regions. Entropy 2023, 25, 1014. [Google Scholar] [CrossRef]
  21. Blackwell, D. Equivalent comparisons of experiments. Ann. Math. Stat. 1953, 24, 265–272. [Google Scholar] [CrossRef]
  22. Neyman, J.; Pearson, E.S., IX. On the problem of the most efficient tests of statistical hypotheses. Philos. Trans. R. Soc. London. Ser. A Contain. Pap. Math. Phys. Character 1933, 231, 289–337. [Google Scholar]
  23. Chicharro, D.; Panzeri, S. Synergy and Redundancy in Dual Decompositions of Mutual Information Gain and Information Loss. Entropy 2017, 19, 71. [Google Scholar] [CrossRef]
  24. Csiszár, I. On information-type measure of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar. 1967, 2, 299–318. [Google Scholar]
  25. Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, Berkeley, CA, USA, 20–30 July 1960; University of California Press: Berkeley, NC, USA, 1961; Volume 4, pp. 547–562. [Google Scholar]
  26. Sason, I.; Verdú, S. f -Divergence Inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
  27. Kailath, T. The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Technol. 1967, 15, 52–60. [Google Scholar] [CrossRef]
  28. Arikan, E. Channel Polarization: A Method for Constructing Capacity-Achieving Codes for Symmetric Binary-Input Memoryless Channels. IEEE Trans. Inf. Theory 2009, 55, 3051–3073. [Google Scholar] [CrossRef]
  29. Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distribution. Bull. Calcutta Math. Soc. 1943, 35, 99–110. [Google Scholar]
  30. Mages, T.; Anastasiadi, E.; Rohner, C. Implementation: PID Blackwell Specific Information. 2024. Available online: https://github.com/uu-core/pid-blackwell-specific-information (accessed on 15 March 2024).
  31. Cardenas, A.; Baras, J.; Seamon, K. A framework for the evaluation of intrusion detection systems. In Proceedings of the 2006 IEEE Symposium on Security and Privacy (S & P’06), Berkeley, CA, USA, 21–24 May 2006; pp. 15–77. [Google Scholar] [CrossRef]
  32. Rauh, J.; Bertschinger, N.; Olbrich, E.; Jost, J. Reconsidering unique information: Towards a multivariate information decomposition. In Proceedings of the 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, 29 June–4 July 2014; pp. 2232–2236. [Google Scholar] [CrossRef]
  33. Bossomaier, T.; Barnett, L.; Harré, M.; Lizier, J.T. An Introduction to Transfer Entropy; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar] [CrossRef]
Figure 1. Partial information decomposition representations at two variables V = { V 1 , V 2 } . (a) Desired set-theoretic analogy: Visualization of the desired intuition for multivariate information as a Venn diagram. (b) Representation as redundancy lattice, where the redundancy measure I quantifies the information that is contained in all of its provided variables (inside their intersection). The ordering represents the expected subset relation of redundancy. (c) Representation as synergy lattice, where the loss measure I quantifies the information that is contained in neither of its provided variables (outside their union). (d) Information flow visualization: When having two partial information decompositions with respect to the same target variable, we can study how the partial information of one decomposition propagates into the next. We refer to this as information flow analysis of a Markov chain such as T ( A 1 , A 2 ) ( B 1 , B 2 ) .
Figure 1. Partial information decomposition representations at two variables V = { V 1 , V 2 } . (a) Desired set-theoretic analogy: Visualization of the desired intuition for multivariate information as a Venn diagram. (b) Representation as redundancy lattice, where the redundancy measure I quantifies the information that is contained in all of its provided variables (inside their intersection). The ordering represents the expected subset relation of redundancy. (c) Representation as synergy lattice, where the loss measure I quantifies the information that is contained in neither of its provided variables (outside their union). (d) Information flow visualization: When having two partial information decompositions with respect to the same target variable, we can study how the partial information of one decomposition propagates into the next. We refer to this as information flow analysis of a Markov chain such as T ( A 1 , A 2 ) ( B 1 , B 2 ) .
Entropy 26 00424 g001
Figure 2. An example zonogon (blue) for a binary input channel κ from T = { t 1 , t 2 } to S = { s 1 , s 2 , s 3 , s 4 } . The zonogon is the Neyman–Pearson region, and its perimeter corresponds to the vectors v s i κ sorted by an increasing/decreasing slope for the lower/upper half, which results from the likelihood ratio test. The zonogon, thus, represents the achievable (TPR,FPR)-pairs for predicting T while knowing S.
Figure 2. An example zonogon (blue) for a binary input channel κ from T = { t 1 , t 2 } to S = { s 1 , s 2 , s 3 , s 4 } . The zonogon is the Neyman–Pearson region, and its perimeter corresponds to the vectors v s i κ sorted by an increasing/decreasing slope for the lower/upper half, which results from the likelihood ratio test. The zonogon, thus, represents the achievable (TPR,FPR)-pairs for predicting T while knowing S.
Entropy 26 00424 g002
Figure 3. Visualizations for Example 1 where | T | = 2 . (a) A randomized decision strategy for predictions based on T κ S can be represented by a | S | × 2 stochastic matrix λ . The first column of this decision matrix provides the weights for summing the columns of channel κ to determine the resulting prediction performance (TPR, FPR). Any decision strategy corresponds to a point in the zonogon. (b) All presented ordering relations in Section 2.1 are equivalent at binary targets and correspond to the subset relation of the visualized zonogons. The variable S 3 is less informative than both S 1 and S 2 with respect to T, and the variables S 1 and S 2 are incomparable. The shown channel in (a) is the Blackwell join of κ 1 and κ 2 in (b).
Figure 3. Visualizations for Example 1 where | T | = 2 . (a) A randomized decision strategy for predictions based on T κ S can be represented by a | S | × 2 stochastic matrix λ . The first column of this decision matrix provides the weights for summing the columns of channel κ to determine the resulting prediction performance (TPR, FPR). Any decision strategy corresponds to a point in the zonogon. (b) All presented ordering relations in Section 2.1 are equivalent at binary targets and correspond to the subset relation of the visualized zonogons. The variable S 3 is less informative than both S 1 and S 2 with respect to T, and the variables S 1 and S 2 are incomparable. The shown channel in (a) is the Blackwell join of κ 1 and κ 2 in (b).
Entropy 26 00424 g003
Figure 4. For the visualization, we abbreviated the notation by indicating the contained visible variable as the index of the source, for example S 12 = { V 1 , V 2 } to represent their joint distribution: (a) A redundancy/gain lattice ( A ( { V 1 , V 2 , V 3 } ) , ) based on the ordering of Equation (9) quantifies information present in all sources. The redundancy of all sources within an atom increases while moving up on the redundancy lattice. (b) A synergy/loss lattice ( A ( { V 1 , V 2 , V 3 } ) , ) based on the ordering of Equation (10) quantifies information present in neither source. On the synergy lattice, the information that is obtained from neither source of an atom increases while moving up.
Figure 4. For the visualization, we abbreviated the notation by indicating the contained visible variable as the index of the source, for example S 12 = { V 1 , V 2 } to represent their joint distribution: (a) A redundancy/gain lattice ( A ( { V 1 , V 2 , V 3 } ) , ) based on the ordering of Equation (9) quantifies information present in all sources. The redundancy of all sources within an atom increases while moving up on the redundancy lattice. (b) A synergy/loss lattice ( A ( { V 1 , V 2 , V 3 } ) , ) based on the ordering of Equation (10) quantifies information present in neither source. On the synergy lattice, the information that is obtained from neither source of an atom increases while moving up.
Entropy 26 00424 g004
Figure 5. Example of the unexpected behavior of I min : the dashed isoline indicates the pairs ( x , y ) for which channel κ ( x , y ) = T V i results in pointwise information t T : I ( V i , T = t ) = 0.2 for a uniform binary target variable. Even though observing the output of both indicated example channels (blue/green) provides significantly different abilities for predicting the target variable state, the measure I min indicates full redundancy.
Figure 5. Example of the unexpected behavior of I min : the dashed isoline indicates the pairs ( x , y ) for which channel κ ( x , y ) = T V i results in pointwise information t T : I ( V i , T = t ) = 0.2 for a uniform binary target variable. Even though observing the output of both indicated example channels (blue/green) provides significantly different abilities for predicting the target variable state, the measure I min indicates full redundancy.
Entropy 26 00424 g005
Figure 6. This example visualizes the computation of χ 2 -information by indicating its results on the representation of zonogons of an indicator variable. (a) For the pointwise information of t 1 , both vectors of the zonogon perimeter are quantified to the sum 0.292653. (b) For the pointwise information of t 2 , both vectors of the zonogon perimeter are quantified to the sum of 0.130068 . The final χ 2 -information is their expected value I χ 2 ( S ; T ) = 0.4 · 0.292653 + 0.6 · 0.130068 = 0.195102 .
Figure 6. This example visualizes the computation of χ 2 -information by indicating its results on the representation of zonogons of an indicator variable. (a) For the pointwise information of t 1 , both vectors of the zonogon perimeter are quantified to the sum 0.292653. (b) For the pointwise information of t 2 , both vectors of the zonogon perimeter are quantified to the sum of 0.130068 . The final χ 2 -information is their expected value I χ 2 ( S ; T ) = 0.4 · 0.292653 + 0.6 · 0.130068 = 0.195102 .
Entropy 26 00424 g006
Figure 7. Visualization of the functions Ψ and Ξ : The application of function Ψ is equal to reversing the redundancy order, and the application of function Ξ is equal to swapping the ordering relation used between the redundancy and synergy lattice.
Figure 7. Visualization of the functions Ψ and Ξ : The application of function Ψ is equal to reversing the redundancy order, and the application of function Ξ is equal to swapping the ordering relation used between the redundancy and synergy lattice.
Entropy 26 00424 g007
Figure 8. Visualization of lattice duality and Lemma 5. We abbreviate the notation of sources within this figure by listing the contained visible variables as source index ( S 12 = { V 1 , V 2 } ). (i) All bottom elements are mapped to each other and quantified to zero. (ii) To identify the dual for α = { S 3 , S 12 } from the redundancy lattice, we first apply the transformation Ψ ( α ) { S 13 , S 23 } and, then, Ξ ( Ψ ( α ) ) { S 1 , S 2 } . (iii) Ignoring the bottom elements, the down-set of α on the redundancy lattice corresponds to the up-set of Ξ ( Ψ ( α ) ) on the synergy lattice for duality (gray areas). (iv) Lemma 5 states that, on the synergy lattice, exactly those atoms that are not in the up-set of Ξ ( Ψ ( { S 3 , S 12 } ) ) must be in the down-set of either { S 3 } or { S 12 } .
Figure 8. Visualization of lattice duality and Lemma 5. We abbreviate the notation of sources within this figure by listing the contained visible variables as source index ( S 12 = { V 1 , V 2 } ). (i) All bottom elements are mapped to each other and quantified to zero. (ii) To identify the dual for α = { S 3 , S 12 } from the redundancy lattice, we first apply the transformation Ψ ( α ) { S 13 , S 23 } and, then, Ξ ( Ψ ( α ) ) { S 1 , S 2 } . (iii) Ignoring the bottom elements, the down-set of α on the redundancy lattice corresponds to the up-set of Ξ ( Ψ ( α ) ) on the synergy lattice for duality (gray areas). (iv) Lemma 5 states that, on the synergy lattice, exactly those atoms that are not in the up-set of Ξ ( Ψ ( { S 3 , S 12 } ) ) must be in the down-set of either { S 3 } or { S 12 } .
Entropy 26 00424 g008
Figure 9. Comparison of different f-information measures normalized to the f-entropy of the target variable. All distributions are shown in Appendix F and correspond to the examples of [13,20]. The example name abbreviations are listed below in Table A1. The measures behave mostly similarly since the decompositions follow an identical structure. However, it can be seen that total variation attributes more information to being redundant than other measures and appears to behave differently in the generic example since it does not attribute any partial information to the first variable or their synergy.
Figure 9. Comparison of different f-information measures normalized to the f-entropy of the target variable. All distributions are shown in Appendix F and correspond to the examples of [13,20]. The example name abbreviations are listed below in Table A1. The measures behave mostly similarly since the decompositions follow an identical structure. However, it can be seen that total variation attributes more information to being redundant than other measures and appears to behave differently in the generic example since it does not attribute any partial information to the first variable or their synergy.
Entropy 26 00424 g009
Figure 10. Visualization of the maximal (Euclidean) height h * at point P * that a zonogon (blue) reaches above the diagonal.
Figure 10. Visualization of the maximal (Euclidean) height h * at point P * that a zonogon (blue) reaches above the diagonal.
Entropy 26 00424 g010
Figure 11. Analysis of the Markov chain information flow (Equation (Appendix H)). Visualized results for the information measures: KL, TV, and χ 2 . The remaining results ( H 2 -, LC-, and JS-information) can be found in Figure A3.
Figure 11. Analysis of the Markov chain information flow (Equation (Appendix H)). Visualized results for the information measures: KL, TV, and χ 2 . The remaining results ( H 2 -, LC-, and JS-information) can be found in Figure A3.
Entropy 26 00424 g011
Table 1. Summary of the notation used for the redundancy and synergy lattice.
Table 1. Summary of the notation used for the redundancy and synergy lattice.
Redundancy OrderSynergy Order
Ordering/equivalence≼/≃⪯/≅
Join/meet⋎/⋏∨/∧
Up-set/strict up-set R / ˙ R S / ˙ S
Down-set/strict down-set R / ˙ R S / ˙ S
Cover-set α R α S
Top/bottom RL = { V } / RL = SL = / SL = { V }
Table 2. Commonly used functions for f-divergences.
Table 2. Commonly used functions for f-divergences.
NotationNameGenerator Function
D KL Kullback-Leiber (KL)-divergence f ( z ) = z log z
D TV Total Variation (TV) f ( z ) = 1 2 z 1
D χ 2 χ 2 -divergence f ( z ) = ( z 1 ) 2
D H 2 Squared Hellinger distance f ( z ) = ( 1 z ) 2
D LC Le Cam distance f ( z ) = 1 z 2 z + 2
D JS Jensen–Shannon divergence f ( z ) = z log 2 z z + 1 + log 2 z + 1
D H a Hellinger-divergence with a ( 0 , 1 ) ( 1 , ) f ( z ) = z a 1 a 1
D α = a α -divergence with a ( 0 , 1 ) ( 1 , ) f ( z ) = z a 1 a ( z 1 ) a ( a 1 )
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mages, T.; Anastasiadi, E.; Rohner, C. Non-Negative Decomposition of Multivariate Information: From Minimum to Blackwell-Specific Information. Entropy 2024, 26, 424. https://doi.org/10.3390/e26050424

AMA Style

Mages T, Anastasiadi E, Rohner C. Non-Negative Decomposition of Multivariate Information: From Minimum to Blackwell-Specific Information. Entropy. 2024; 26(5):424. https://doi.org/10.3390/e26050424

Chicago/Turabian Style

Mages, Tobias, Elli Anastasiadi, and Christian Rohner. 2024. "Non-Negative Decomposition of Multivariate Information: From Minimum to Blackwell-Specific Information" Entropy 26, no. 5: 424. https://doi.org/10.3390/e26050424

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop