A Measure of Synergy Based on Union Information

The partial information decomposition (PID) framework is concerned with decomposing the information that a set of (two or more) random variables (the sources) has about another variable (the target) into three types of information: unique, redundant, and synergistic. Classical information theory alone does not provide a unique way to decompose information in this manner and additional assumptions have to be made. One often overlooked way to achieve this decomposition is using a so-called measure of union information—which quantifies the information that is present in at least one of the sources—from which a synergy measure stems. In this paper, we introduce a new measure of union information based on adopting a communication channel perspective, compare it with existing measures, and study some of its properties. We also include a comprehensive critical review of characterizations of union information and synergy measures that have been proposed in the literature.

1 Introduction Williams and Beer [2010] introduced the partial information decomposition (PID) framework as a way to characterize, or analyze, the information that a set of random variables (often called sources) has about another variable (referred to as the target).PID is a useful tool for gathering insights and analyzing the way information is stored, modified, and transmitted within complex systems [Lizier et al., 2013, Wibral et al., 2017].It has been applied in several areas such as cryptography [Rauh, 2017] and neuroscience [Luppi et al., 2019, Varley et al., 2023], with many other potential use cases, such as in studying information flows in gene regulatory networks [Chan et al., 2017], neural coding [Faber et al., 2019], financial markets [James et al., 2018a], and network design [Ehrlich et al., 2023, Tokui andSato, 2021].
Consider the simplest case, a three-variable joint distribution p(y 1 , y 2 , t) describing three random variables: two so-called sources, Y 1 and Y 2 , and a target T .Notice that, despite what the names sources and target might suggest, there is no directionality (causal or otherwise) assumption.The goal of PID is to decompose the information that the sources Y = (Y 1 , Y 2 ) have about T into the sum of four non-negative quantities: the information that is present in both Y 1 and Y 2 , known as redundant information, R; the information that only Y 1 (respectively Y 2 ) has about T , known as unique information, U 1 (respectively U 2 ); and the synergistic information, S, that is present in the pair (Y 1 , Y 2 ) but not in Y 1 or Y 2 alone.In this case with two variables, the goal is, thus, to write where I(T ; Y ) is the mutual information between T and Y [Cover and Thomas, 1999].The redundant information R, because it is present in both Y 1 and Y 2 , is also referred to as intersection information and denoted as I ∩ .Finally, I ∪ refers to union information, i.e., the amount of information provided by at least one of the sources; in the case of two sources, Because unique information and redundancy satisfy the relationship U i = I(T ; Y i ) − R (for i ∈ {1, 2}), it turns out that defining how to compute one of these quantities (R, U i , or S) is enough to fully determine the others [Williams and Beer, 2010].Williams and Beer [2010] suggested a set of axioms that a measure of redundancy should satisfy, and proposed a measure of their own.Those axioms became well known as the Williams-Beer axioms, although the measure they proposed has subsequently been criticized for not capturing informational content, but only information size [Harder et al., 2013].It is worth noting that as the number of variables grows the number of terms appearing in the PID of I(T ; Y ) grows super-exponentially [Gutknecht et al., 2021].
Stimulated by that initial work, other measures of information and other sets of axioms for information decomposition have been introduced; see, for example, the work by Bertschinger et al. [2014], Griffith and Koch [2014], and James et al. [2018b], for different measures of redundant, unique, and synergistic information.There is no consensus about what axioms any measure should satisfy or whether a given measure captures the information that it should capture, except for the Williams-Beer axioms.Today, there is still debate about what axioms different measures of information should satisfy, and there is no general agreement on what is an appropriate PID [Chicharro and Panzeri, 2017, James et al., 2018b, Bertschinger et al., 2013, Rauh et al., 2017, Ince, 2017].
Most PID measures that have been suggested thus far are either measures of redundant information, e.g., [Williams and Beer, 2010, Harder et al., 2013, Kolchinsky, 2022, Barrett, 2015, Ince, 2017, Griffith et al., 2014, Griffith and Ho, 2015, Gomes and Figueiredo, 2023], or measures of unique information, e.g., [Bertschinger et al., 2014, James et al., 2018b].Alternatively, it is possible to define the union information of a set of sources as the amount of information provided by at least one of those sources.Synergy is then defined as the difference between the total information and union information [Kolchinsky, 2022].
In this paper, we introduce a new measure of union information based on the information channel perspective that we already pursued in earlier work [Gomes and Figueiredo, 2023] and study some of its properties.The resulting measure leads to a novel information decomposition that is particularly suited for analyzing how information is distributed in channels.
The rest of the paper is organized as follows.A final subsection of this section introduces the notation used throughout the paper.In Section 2, we recall some properties of PID and take a look at how the degradation measure for redundant information introduced by Kolchinsky [2022] decomposes information in bivariate systems, while also pointing out some drawbacks of that measure.Section 3 presents the motivation for our proposed measure, its operational interpretation, its multivariate definition, as well as some of its drawbacks.In Section 4, we propose an extension of the Williams-Beer axioms for measures of union information and show that our proposed measure satisfies those axioms.We review all properties that have been proposed both for measures of union information and synergy, and either accept or reject them.We also compare different measures of synergy and relate them, whenever possible.Finally, Section 5 presents concluding remarks and suggestions for future work.

Notation
For two discrete random variables X ∈ X and Z ∈ Z, their Shannon mutual information I(X; Z) is given by I(X; Z) = I(Z; X) = H(X) − H(X|Z) = H(Z) − H(Z|X), where H(X) = − x∈X p(x) log p(x) and H(X|Z) = − x∈X z∈X p(x, z) log p(x|z) are the entropy and conditional entropy, respectively Cover and Thomas [1999].The conditional distribution p(z|x) corresponds, in an information-theoretical perspective, to a discrete memoryless channel with a channel matrix K, i.e., such that K[x, z] = p(z|x) [Cover and Thomas, 1999].This matrix is row-stochastic: K[x, z] ≥ 0, for any x ∈ X and z ∈ Z, and z∈Z K[x, z] = 1, for any x.
Given a set of n discrete random variables (sources), Y 1 ∈ Y 1 , . .., Y n ∈ Y n , and a discrete random variable T ∈ T (target) with joint distribution (probability mass function) p(y 1 , . .., y n , t), we consider the channels matrix with the conditional distribution p(y i |t).For a vector y, y −i refers to the same vector without the i th component.
We say that three random variables, say X, Y, Z, form a Markov chain (which we denote by for instance, the redundancy measure introduced by Kolchinsky [2022], based on the so-called degradation order ⪯ d between communication channels (see recent work by Kolchinsky [2022] and Gomes and Figueiredo [2023] for definitions): In the above equation, Q is the output of the channel T → Q, which we also denote as K Q .When computing the information shared by the m sources, , the perspective is that there is a channel with a single input T and m outputs Y 1 , . .., Y m .This definition of I d ∩ corresponds to the mutual information of the most informative channel, under the constraint that this channel is dominated (in the degradation order ⪯ d sense) by all channels K (1) , . .., K (m) .Since mutual information was originally introduced to formalize the capacity of communication channels, it is not surprising that measures that presuppose channel directionality are found useful in this context.For instance, the work of James et al. [2018c] concluded that only if one assumes the directionality T → Y i does one obtain a valid PID from a secret key agreement rate, which supports the approach of assuming this directionality perspective.
Although it is not guaranteed that the structure of the joint distribution p(y 1 , . .., y n , t) is compatible with the causal model of a single input and multiple output channels (which implies that the sources are conditionally independent, given T ), one may always compute such measures, which have interesting and relevant operational interpretations.In the context of PID, where the goal is to study how information is decomposed, such measures provide an excellent starting point.Although it is not guaranteed that there is actually a channel (or a direction) from T to Y i , we can characterize how information about T is spread through the sources.In the case of the degradation order, I d ∩ provides insight about the maximum information obtained about T if any Y i is observed.
Arguably, the most common scenario in PID is finding out something about the structure of the information the variables Y 1 , . .., Y n have about T .In a particular system of variables characterized by its joint distribution, we do not make causal assumptions, so we can adopt the perspective that the variables Y i are functions of T , hence obtaining the channel structure.Although this channel structure may not be faithful [Pearl, 2009] to the conditional independence properties implied by p(y 1 , . .., y n , t), this channel perspective allows for decomposition of I(Y ; T ) and for drawing conclusions about the inner structure of the information that Y has about T .Some distributions, however, cannot have this causal structure.Take, for instance, the distribution generated by T = Y 1 xor Y 2 , where Y 1 and Y 2 are two equiprobable and independent binary random variables.We will call this distribution XOR.For this well-known distribution, we have Y 1 ⊥ Y 2 and Y 1 ̸ ⊥ Y 2 |T , whereas the implied channel distribution that I d ∩ assumes yields the exact opposite dependencies, that is, Y 1 ̸ ⊥ Y 2 and Y 1 ⊥ Y 2 |T ; see Figure 1 for more insight.

Consider the computation of
for the XOR distribution.This measure argues that, since 1) and redundancy is computed as I(Q; T ), yielding 0 bits of redundancy, and consequently 1 bit of synergy (as computed from (1)).Under this channel perspective (as in Figure 1b), If all that is needed to compute redundancy is p(T ), K (1) , and K (2) , this would lead to the wrong conclusion that the outcome (Y 1 , Y 2 , T ) = (0, 0, 1) has non-null probability, which it does not.With this, we do not mean that I d ∩ is an incomplete or incorrect measure to perform PID, we are using its insights to point us in a different direction.

Motivation for a New PID
At this point, the most often used approaches to PID are based on redundancy measures.Usually, these are in one of the two following classes: • Measures that are not concerned with information content, only information size, which makes them easy to compute even for distributions with many variables, but at the cost that the resulting decompositions may not give realistic insights into the system, precisely because they are not sensitive to informational content.
• Measures that satisfy the Blackwell property-which arguably do measure information content-but are insensitive to changes in the sources' distributions p(y 1 , . .., y n ) = t p(y 1 , . .., y n , t) (as long as p(T ), K (1) , . .., K (n)   [Bertschinger et al., 2014].It should be noted that I BROJA ∩ is only defined for the bivariate case, that is, for distributions with at most two sources, described by p(y 1 , y 2 , t).Applications of these can be found in [Kay et al., 2022, Liang et al., 2023, Hamman and Dutta, 2023].satisfy the so-called (*) assumption [Bertschinger et al., 2014], which argues that redundant and unique information should only depend on the marginal distribution of the target p(T ) and on the conditional distributions of the sources given the target, that is, on the stochastic matrices K (i) .James et al. [2018b] (Section 4) and Ince [2017] (Section 5) provide great arguments as to why the (*) assumption should not hold in general, and we agree with them.
Towards motivating a new PID, let us look at how I d ∩ decomposes information in the bivariate case.Any measure that is based on a preorder between channels and which satisfies Kolchinsky's axioms yields similar decompositions [Gomes and Figueiredo, 2023], thus there is no loss of generality in focusing on I d ∩ .We next analyze three different cases.
• Case 2: There is no ordering between the channels and the solution of ) is a trivial channel, in the sense that it has no information about T .The decomposition is given by R = 0, U 2 = I(Y 2 ; T ), U 1 = I(Y 1 ; T ), and S = I(Y ; T ) − I(Y 1 ; T ) − I(Y 2 ; T ), which may lead to a negative value of synergy.An example of this is provided later.
As an example, consider the COPY distribution with Y 1 and Y 2 i.i.d.Bernoulli variables with parameter 0.5, shown in the center of Table 1.In this case, channels K (1) and K (2) have the forms with no degradation order between them.This yields R = 0, U 1 = U 2 = 1, and S = 0.
• Case 3: There is no ordering between the channels and As an example, consider the BOOM distribution [James et al., 2018b], shown on the right-hand side of Table 1.In this case, channels K (1) and K (2) are and there is no degradation order between them.However, there is a nontrivial channel K Q that is dominated by both K (1) and K (2) that maximizes I(Q; T ).One of its versions is This class of approaches has some limitations, as is the case for all PID measures.In the bivariate case, the definition of synergy S from a measure of redundant information is the completing term such that I(Y ; , then there is no synergy.This makes intuitive sense because, in this case, T − Y 1 − Y 2 is a Markov chain (see Section 1 for the definition), consequently, I(Y ; T ) = I(Y 1 ; T ), that is, Y 1 has the same information about T as the pair Y = (Y 1 , Y 2 ).
If there is no ⪯ d ordering between the channels, as in the COPY distribution (Table 1, middle), the situation is more complicated.We saw that the decomposition for this distribution yields R = 0, U 1 = U 2 = 1, and S = 0.However, suppose we change the distribution such that T = (1, 1) has probability 0 and the other outcomes have probability 1/3.For example, consider the distribution in Table 2.For this distribution, we have I(Y ; T ) ≈ 1.585.Intuitively, we would expect that I(Y ; T ) would be decomposed as R = 0, U 1 = U 2 = I(Y ; T )/2, and S = 0, just as before, so that the proportions U i /I(Y ; T ), for i ∈ {1, 2}, in both distributions remain the same, whereas redundancy and synergy would remain zero.That is, we do not expect that removing one of the outcomes while maintaining the remaining outcomes equiprobable would change the types of information in the system.However, if we perform this and compute the decomposition yielded by I d ∩ , we obtain R = 0, U 1 = U 2 = 0.918 ̸ = I(Y ; T )/2, and S = −0.251,i.e., a negative synergy, arguably meaningless.
There are still many open questions in PID.One of those questions is: Should measures of redundant information be used to measure synergy, given that they compute it as the completing term in Equation (1).We agree that using a measure of redundant information to compute the synergy in this way may not be appropriate, especially because the inclusion-exclusion principle (IEP) should not necessarily hold in the context of PID; see [Kolchinsky, 2022] for comments on the IEP.
With these motivations, we propose a measure of union information for PID that shares with I d ∩ the implicit view of channels.However, unlike I d ∩ and I BROJA ∩ -which satisfy the (*) assumption, and thus, are not concerned with the conditional dependencies in p(y i |t, y −i )-our measure defines synergy as the information that cannot be computed from p(y i |t), but can be computed from p(y i |t, y −i ).That is, we propose that synergy be computed as the information that is not captured by assuming conditional independence of the sources, given the target.
3 A New Measure of Union Information

Motivation and Bivariate Definition
Consider a distribution p(y 1 , y 2 , t) and suppose there are two agents, agent 1 and agent 2, whose goal is to reduce their uncertainty about T by observing Y 1 and Y 2 , respectively.Suppose also that the agents know p(t), and that agent i has access to its channel distribution p(y i |t).Many PID measures make this same assumption, including I d ∩ .When agent i works alone to reduce the uncertainty about T , since it has access to p(t) and p(y i |t), it also knows p(y i ) and p(y i , t), which allows it to compute I(Y i ; T ): the amount of uncertainty reduction about T achieved by observing Y i .Now, if the agents can work together, that is, if they have access to Y = (Y 1 , Y 2 ), then they can compute I(Y ; T ), because they have access to p(y 1 , y 2 |t) and p(t).On the other hand, if the agents are not able to work together (in the sense that they are not able to observe Y together, but only Y 1 and Y 2 , separately) yet can communicate, then they can construct a different distribution q given by q(y 1 , y 2 , t) := p(t)p(y 1 |t)p(y 2 |t), i.e., a distribution under which Y 1 and Y 2 are conditionally independent given T , but have the same marginal p(t) and the same individual conditionals p(y 1 |t) and p(y 2 |t).
The form of q in the previous paragraph should be contrasted with the following factorization of p, which entails no conditional independence assumption: p(y 1 , y 2 , t) = p(t)p(y 1 |t)p(y 2 |t, y 1 ).In this sense, we would propose to define union information, for the bivariate case, as follows: where the subscript refers to the distribution under which the mutual information is computed.From this point forward, the absence of a subscript means that the computation is performed under the true distribution p.As we will see, this is not yet the final definition, for reasons to be addressed below.
Using the definition of synergy derived from a measure of union information Kolchinsky [2022], for the bivariate case we have (4) Synergy is often posited as the difference between the whole and the union of the parts.For our measure of union information, the 'union of the parts' corresponds to the reduction in uncertainty about T -under q-that agents 1 and 2 can obtain by sharing their conditional distributions.Interestingly, there are cases where the union of the parts is better than the whole, in the sense that I ∪ (Y 1 , Y 2 → T ) > I(Y ; T ).An example of this is given by the Adapted ReducedOR distribution, originally introduced by Ince [2017] and adapted by James et al. [2018b], which is shown in the left-hand side of Table 3, where r ∈ [0, 1].This distribution is such that I q (Y ; T ) does not depend on r (I q (Y ; T ) ≈ 0.549), since neither p(t) nor p(y 1 |t) and p(y 2 |t) depend on r; consequently, q(t, y 1 , y 2 ) also does not depend on r, as shown in the right-hand side of Table 3.
It can be easily shown that if r > 0.5, then I q (Y ; T ) > I(Y ; T ), which implies that synergy, if defined as in (4), could be negative.How do we interpret the fact that there exist distributions such that I q (Y ; T ) > I(Y ; T )?This means that under distribution q, which assumes Y 1 and Y 2 are conditionally independent given T , Y 1 and Y 2 reduce the uncertainty about T more than in the original distribution.Arguably, the parts working independently and achieving better results than the whole should mean there is no synergy, as opposed to negative synergy.
The observations in the previous paragraphs motivate our definition of a new measure of union information as with the superscript CI standing for conditional independence, yielding a non-negative synergy: Note that for the bivariate case we have zero synergy if , that is, if the outputs are indeed conditionally independent given T .Moreover, I CI ∪ satisfies the monotonicity axiom from the extension of the Williams-Beer axioms to measures of union information (to be mentioned in Section 4.1), which further supports this definition.For the bivariate source case, the decomposition of I(Y ; T ) derived from a measure of union information is given by

Operational Interpretation
For the bivariate case, if Y 1 and Y 2 are conditionally independent given T (Figure 1b), then p(y 1 |t) and p(y 2 |t) (and p(t)) suffice to reconstruct the original joint distribution p(y 1 , y 2 , t), which means the union of the parts is enough to reconstruct the whole, i.e., there is no synergy between Y 1 and Y 2 .Conversely, a distribution generated by the DAG in Figure 1c does not satisfy conditional independence (given T ), hence we expect positive synergy, as is the case for the XOR distribution, and indeed our measure yields 1 bit of synergy for this distribution.These two cases motivate the operational interpretation of our measure of synergy: it is the amount of information that is not captured by assuming conditional independence of the sources (given the target).
Recall, however, that some distributions are such that I q (Y ; T ) > I p (Y ; T ), i.e., such that the union of the parts 'outperforms' the whole.What does this mean?It means that under q, Y 1 and Y 2 have more information about T than under p: the constructed distribution q, which drops the conditional dependence of Y 1 and Y 2 given T , reduces the uncertainty that Y has about T more than the original distribution p.In some cases, this may happen because the support of q is larger than that of p, which may lead to a reduction in uncertainty under q that cannot be achieved under p.In these cases, since we are decomposing I p (Y ; T ), we revert to saying that the union information that a set of variables has about T is equal to I p (Y ; T ), so that our measure satisfies the monotonicity axiom (later introduced in Definition 2).We will comment on this compromise between satisfying the monotonicity axiom and ignoring dependencies later.

General (Multivariate) Definition
To extend the proposed measure to an arbitrary number n ≥ 2 of sources, we briefly recall the synergy lattice [Chicharro andPanzeri, 2017, Gutknecht et al., 2023] and the union information semi-lattice [Gutknecht et al., 2023].For n = 3, these two lattices are shown in Figure 2.For the sake of brevity, we will not address the construction of the lattices or the different orders between sources.We refer the reader to the work of Gutknecht et al. [2023] for an excellent overview of the different lattices, the orders between sources, and the construction of different PID measures.In the following, we use the term source to mean a subset of the variables {Y 1 , . .., Y n }, or a set of such subsets, we drop the curly brackets for clarity and refer to the different variables by their indices, as is common in most works on PID.The decomposition resulting from a measure of union information is not as direct to obtain as one obtained from a measure of redundant information, as the solution for the information atoms is not a Möbius inversion [Gutknecht et al., 2021].One must first construct the measure of synergy for source α by writing which is the generalization of ( 6) for an arbitrary source α.In the remainder of this paper, we will often omit "→ T " from the notation (unless it is explicitly needed), with the understanding that the target variable is always referred to as T .Also for simplicity, in the following, we identify the different agents that have access to different distributions as the distributions they have access to.
It is fairly simple to extend the proposed measure to an arbitrary number of sources, as illustrated in the following two examples.
Example 3.1.To compute , thus it can also compute, by marginalization, p(y 1 |t) and p(y 2 |t).On the other hand, agent Y 3 only knows p(y 3 |t).Recall that both agents also have access to p(t).By sharing their conditionals, the agents can compute q 1 (y 1 , y 2 , y 3 , t) := p(t)p(y 1 , y 2 |t)p(y 3 |t), and also q 2 (y 1 , y 2 , y 3 , t) := p(t)p(y 1 |t)p(y 2 |t)p(y 3 |t).After this, they may choose whichever distribution has the highest information about T , while still holding the view that any information gain larger than I(Y ; T ) must be disregarded.Consequently, we write ) .In this case, the three agents may compute four different distributions, two of which are the same q 1 and q 2 defined in the previous paragraph, and the other two are q 3 (y 1 , y 2 , y 3 , t) := p(t)p(y 1 , y 3 |t)p(y 2 |t), and q 4 (y 1 , y 2 , y 3 , t) := p(t)p(y 1 |t)p(y 2 , y 3 |t).
Given these insights, we propose the following measure of union information.Definition 1.Let A 1 , . .., A m be an arbitrary collection of m ≥ 1 sources (recall sources may be subsets of variables).Assume that no source is a subset of another source and no source is a deterministic function of other sources (if there is, remove it).We define A i and Q is the set of all different distributions that the m agents can construct by combining their conditional distributions and marginalizations thereof.
q 3 , q 4 }.We now justify the conditions in Definition 1 and the fact that they do not entail any loss of generality.
• The condition that no source is a subset of another source (which also excludes the case where two sources are the same) implies no loss of generality: if one source is a subset of another, say A i ⊆ A j , then A i may be removed without affecting either A or Q, thus yielding the same value for I CI ∪ .The removal of source A i is also performed for measures of intersection information, but under the opposite condition: whenever A j ⊆ A i .
• The condition that no source is a deterministic function of other sources is slightly more nuanced.In our perspective, an intuitive and desired property of measures of both union and synergistic information is that their value should not change whenever one adds a source that is a deterministic function of sources that are already considered.We provide arguments in favor of this property in Section 4.2.1.This property may not be satisfied by computing I CI ∪ without previously excluding such sources.For instance, consider p(t, y 1 , y 2 , y 3 ), where Y 1 and Y 2 are two i.i.d.random variables following a Bernoulli distribution with parameter 0.5, 5409.This issue is resolved by removing deterministic sources before computing I CI ∪ .
We conclude this section by commenting on the monotonicity of our measure.Suppose we wish to compute the union information of sources 3 ) (monotonicity of union information).Recall our motivation for I CI ∪ ((Y 1 , Y 2 ), Y 3 ): there are two agents, the first has access to p(y 1 , y 2 |t) and the second to p(y 3 |t).The two agents assume conditional independence of their variables and construct q ′ (y 1 , y 2 , y 3 , t) = p(t)p(y 1 , y 2 |t)p(y 3 |t).The story is similar for the computation of ), in which case we have three agents that construct q ′′ (y 1 , y 2 , y 3 , t) = p(t)p(y 1 |t)p(y 2 |t)p(y 3 |t).Now, it may be the case that I q ′ (Y ; T ) < I q ′′ (Y ; T ); considering only these two distributions would yield ), contradicting monotonicity for measures of union information.To overcome this issue, for the computation of I CI ∪ ((Y 1 , Y 2 ), Y 3 )-and other sources in general-the agent that has access to p(y 1 , y 2 |t) must be allowed to disregard the conditional dependence of Y 1 and Y 2 on T , even if it holds in the original distribution p.

Extension of the Williams-Beer Axioms for Measures of Union Information
As Gutknecht et al. [2023] rightfully notice, the so-called Williams-Beer axioms [Williams and Beer, 2010] can actually be derived from parthood distribution functions and the consistency equation [Gutknecht et al., 2023].Consequently, they are not really axioms but consequences of the PID framework.As far as we know, there has been no proposal in the literature for the equivalent of the Williams-Beer axioms (which refer to measures of redundant information) for measures of union information.In the following, we extend the Williams-Beer axioms to measures of union information and show that the proposed I CI ∪ satisfies these axioms.Although we just argued against calling them axioms, we keep the designation Williams-Beer axioms because of its popularity.Although the following definition is not in the formulation of Gutknecht et al. [2021], we suggest formally defining union information from the formulation of parthood functions as where f refers to a parthood function and Π(f ) is the information atom associated with the parthood function f .Given this formulation, the following properties must hold.Definition 2. Let A 1 , . .., A m be an arbitrary number m ≥ 2 of sources.A measure of union information I ∪ is said to satisfy the Williams-Beer axioms for union information measures if it satisfies: 1. Symmetry: I ∪ is symmetric in the A i 's.

Equality for monotonicity: A
∪ satisfies the Williams-Beer axioms for measures of union information given in Definition 2.
Proof.We address each of the axioms in turn.
1. Symmetry follows from the symmetry of mutual information, which in turn is a consequence of the well-known symmetry of joint entropy.
2. Self-redundancy follows from the fact that agent i has access to p(A i |T ) and p(T ), which means that p(A i , T ) is one of the distributions in the set Q, which implies that 3. To show that monotonicity holds, begin by noting that due to the monotonicity of mutual information.Let Q m be the set of distributions that the sources A 1 , . .., A m can construct and Consequently, which means monotonicity holds.
4. Finally, the proof of equality for monotonicity is the same that was used above to show that the assumption that no source is a subset of another source entails no loss of generality.

Review of Suggested Properties: Griffith and Koch [2014]
We now review properties of measures of union information and synergy that have been suggested in the literature in chronological order.The first set of properties was suggested by Griffith and Koch [2014], with the first two being the following.
• Duplicating a predictor does not change synergistic information; formally, where A m+1 = A i , for some i = 1, . .., m.Griffith and Koch [2014] show that this property holds if the equality for monotonicity property holds for the "corresponding" measure of union information ("corresponding" in the sense of Equation ( 7)).As shown in the previous subsection, I CI ∪ satisfies this property, and so does the corresponding synergy S CI .
• Adding a new predictor can decrease synergy, which is a weak statement.We suggest a stronger property: Following the monotonicity property, adding a new predictor cannot increase synergy, which is formally written as S(A 1 , . .., A m → T ) ≥ S(A 1 , . .., A m , A m+1 → T ).This property simply follows from monotonicity for the corresponding measure of union information, which we proved above holds for I CI ∪ .
The next properties for any measure of union information were also suggested by Griffith and Koch [2014]: 1. Global positivity: I ∪ (A 1 , . .., A m ) ≥ 0.
6. Weak local positivity: for n = 2 the derived partial informations are non-negative.This is equivalent to

Strong identity: I
We argued before that self-redundancy and symmetry are properties that follow trivially from a well-defined measure of union information [Gutknecht et al., 2023].In the following, we discuss in more detail properties 4 and 5, and return to the global positivity property later.

Stronger Monotonicity
Property 4 in the above list was originally called monotonicity by Griffith and Koch [2014]; we changed its name because we had already defined monotonicity in Definition 2, a weaker condition than stronger monotonicity.The proposed inequality clearly follows from the monotonicity of union information (the third Williams-Beer axiom).Now, if there is some Griffith and Koch [2014] suggest that we must have equality.Recall Axiom 4 (equality for monotonicity) in the extension of the WB axioms (Definition 2).It states that equality must hold if A m ⊆ A m−1 .In this context, A m and A m−1 are sets of random variables, for example, There is a different point of view we may take.The only way that A m is a subset of A m−1 is if A m , when viewed as a random vector (in this case, write , is a subvector of A m−1 .A subvector of a random vector is a deterministic function, and no information gain can come from applying a deterministic function to a random vector.As such, there is no information gain when one considers A m , a function of A m−1 , if one already has access to A m−1 .Griffith and Koch [2014] argue similarly, there is no information gain by considering A m+1 -a function of A i -in addition to A i .In conclusion, considering the 'equality for monotonicity' strictly through a set-inclusion perspective, stronger monotonicity does not follow.On the other hand, extending the idea of set inclusion to the more general context of functions of random variables, then stronger monotonicity follows, because {A m+1 } = {f (A i )} is a subset of {A i }, hence there is no information gain by considering A m+1 = f (A i ) in addition to A i .As such, we obtain I ∪ (A 1 , . .., A m ) = I ∪ (A 1 , . .., A m , A m+1 ).Consequently, it is clear that stronger monotonicity must hold for any measure of union information.We, thus, argue in favor of extending the concept of subset inclusion of property 4 in Definition 2 to include the concept of deterministic functions.

Target Monotonicity
Let us move on to target monotonicity, which we argue should not hold in general.This precise same property was suggested, but for a measure of redundant information, by Bertschinger et al. [2013]; they argue that a measure of redundant information should satisfy for any discrete random variable Z, as they argue that this property 'captures the intuition that if A 1 , . .., A m share some information about T , then at least the same amount of information is available to reduce the uncertainty about the joint outcome of (T, Z)'.Since most PID approaches have been built upon measures of redundant information, it is simpler to refute this property.Consider I d ∩ , which we argue is one of the most well-motivated and accepted measures of redundant information (as defined in (2)): it satisfies the WB axioms, it is based on the famous Blackwell channel preorder-thus inheriting a well-defined and rigorous operation interpretation-it is based on channels, just as I(X; T ) was originally motivated based on channels, and is defined for any number of input variables, which is more than most PID measures can accomplish.Consider also the distribution presented in Table 4, which satisfies T = Y 1 AND Y 2 and Z = (Y 1 , Y 2 ).
From a game theory perspective, since neither agent (Y 1 or Y 2 ) has an advantage when predicting T (because the channels that each agent has access to have the same conditional distributions), neither agent has any unique information.Moreover, redundancy-as computed by I d ∩ (Y 1 , Y 2 → T )-evaluates to approximately 0.311.However, when considering the pair (T, Z), the structure that was present in T is now destroyed, in the sense that now there is no degradation order between the channels that each agent has access to.Note that p((t, z), y 1 , y 2 ) is a relabeling of the COPY distribution.As such, , contradicting the property proposed by Bertschinger et al. [2013].
For a similar reason, we believe that this property should not hold for a general measure of union information, even if the measure satisfies the extension of the Williams-Beer axioms, as our proposed measure does.For instance, consider the distribution presented in Table 5.
, meaning target monotonicity does not hold.This happens because although I p (Y ; T ) ≤ I p (Y ; T, Z), it is not necessarily true that I q (Y ; T ) ≤ I q (Y ; T, Z).The union information measure derived from the degradation order between channels, defined as the 'dual' of (2), also agrees with our conclusion [Kolchinsky, 2022].For the distribution in Table 4 we have , for the same reason as above: considering (T, Z) as the target variable destroys the structure present in T .We agree with the remaining properties suggested by Griffith and Koch [2014] and we will address those later.

Review of Suggested Properties: Quax et al. [2017]
Moving on to additional properties, Quax et al. [2017] suggest the following properties for a measure of synergy: 1. Non-negativity: S(A 1 , . .., A m → T ) ≥ 0. Let us comment on the proposed 'zero synergy' properties (4 and 5) under the context of PID.Property 4 seems to have been proposed with the rationale that synergy can only exist for at least two sources, which intuitively makes sense, as synergy is often defined as 'the information that is present in the pair, but that is not retrievable from any individual variable'.However, because of the way a synergy-based PID is constructed-or weak-synergy, as Gutknecht et al. [2023] call it-synergy must be defined as in ( 7), so that, for example, in the bivariate case, S(Y 1 → T ) := I(Y ; T ) − I ∪ (Y 1 → T ) = I(Y 2 ; T |Y 1 ), because of self-redundancy of union information and the chain rule of mutual information [Cover and Thomas, 1999], and since I(Y 2 ; T |Y 1 ) is in general larger than 0, we reject the property 'zero synergy about a single variable'.We note that our rejection of this property is based on the PID perspective.There may be other areas of research where it makes sense to demand that any (single) random variable alone has no synergy about any target, but under the PID framework, this must not happen, particularly so that we obtain a valid information decomposition.
Property 5, 'zero synergy in a single variable', on the other hand, must hold because of self-redundancy.That is because, for any i ∈ {1, . .., n}, 4.4 Review of Suggested Properties: Rosas et al. [2020] Based on the proposals of Griffith et al. [2014], Rosas et al. [2020] suggested the following properties for a measure of synergy: We argue that a principled measure of synergy does not need to satisfy these properties (in general).Consider the distribution presented in Table 6, in which T 1 is a relabeling of the COPY distribution and Start by noting that since T 2 is a deterministic function of T 1 , then = 0 leads to zero synergy.On the other hand, S CI (Y 1 , Y 2 → T 2 ) = 1, contradicting the first property suggested by Rosas et al. [2020].This happens because Y 1 ̸ ⊥ p Y 2 |T 2 , so synergy is positive.The loss of conditional independence of the inputs (given the target) when one goes from considering the target T 1 to T 2 is the reason why synergy increases.It can be easily seen that S d , the measure of synergy derived from Kolchinsky's proposed union information measure I d ∪ [Kolchinsky, 2022], agrees with this.A simpler way to see this is by noticing that the XOR distribution must yield 1 bit of synergistic information, and many PID measures do not yield 1 bit of synergistic information for the COPY distribution.
The second suggested property argues that synergy should be a convex function of P (T |Y ), for fixed P (Y ).Our measure of synergy does not satisfy this property, even though it is derived from a measure of union information that satisfies the extension of the WB axioms.For instance, consider the XOR distribution with one extra outcome.We introduce it in Table 7 and parameterize it using r = p(T = 0|Y = (0, 0)) ∈ [0, 1].Notice that this modification does not affect P (Y ).
Synergy, as measured by S CI (Y 1 , Y 2 → T ), is maximized when r equals 1 (the distribution becomes the standard XOR) and minimized when r equals 0. We do not see an immediate reason as to why a general synergy function should be convex in p(t|y), or why it should have a unique minimizer as a function of r.Recall that a function ).In the following, we slightly abuse the notation of the input variables of a synergy function.Our synergy measure S CI , when considered as a function of r, does not satisfy this inequality.For the adapted XOR distribution, take t = 0.5, x 1 = 0, and x 2 = 0.5.We have S CI (0.5 × 0 + 0.5 × 0.5) = S CI (0.25) ≈ 0.552 and 0.5 × S CI (0) + 0.5 × S CI (0.5) ≈ 0.5 × 0.270 + 0.5 × 0.610 ≈ 0.440, contradicting the property of channel convexity.S d agrees with this.We slightly change p(y) in the distribution to obtain a new distribution, which we present in Table 8.
This can be easily seen since K (1) = K (2) for any r ∈ [0, 1], hence we may choose K Q = K (1) to compute S d = I(Y ; T ) − I(Q; T ), which is not convex for this particular distribution.To conclude this section, we present a plot of S CI (Y 1 , Y 2 ) and S d (Y 1 , Y 2 ) as a function of r in Figure 3, for the distribution presented in Table 7.  8.As we showed for this distribution, S CI is not a convex function of r.
With this we do not mean that these properties should not hold for an arbitrary PID.We are simply showing that some properties must be satisfied in the context of PID (such as the WB axioms), whereas other properties are not necessary for PID (such as the previous two properties).

Relationship with the Extended Williams-Beer Axioms
We now prove which of the introduced properties are implied by the extension of the Williams-Beer axioms for measures of union information.In what follows, assume that the goal is to decompose the information present in the distribution p(y, t) = p(y 1 , . .., y n , t).Theorem 2. Let I ∪ be a measure of union information that satisfies the extension of the Williams-Beer axioms (symmetry, self-redundancy, monotonicity, and equality for monotonicity) for measures of union information as in Definition (2).Then, I ∪ also satisfies the following properties of Griffith and Koch [2014]: global positivity, weak local positivity, strong identity, "duplicating a predictor does not change synergistic information", and "adding a new predictor cannot increase synergy".
Proof.We argued before that the last two properties follow from the definition of a measure of union information.Global positivity is a direct consequence of monotonicity and the non-negativity of mutual information: Weak local positivity holds because monotonicity and self-redundancy imply that Strong identity follows trivially from self-redundancy, since Theorem 3. Consider a measure of union information that satisfies the conditions of Theorem 2. If synergy is defined as in Equation (7), it satisfies the following properties of Quax et al. [2017]: non-negativity, upper-bounded by mutual information, weak symmetry, and zero synergy in a single variable.
Proof.Non-negativity of synergy and upper-bounded by mutual information follow from the definition of synergy and from the fact that for whichever source (A 1 , . .., A m ), with m ≥ 1, we have that Weak symmetry follows trivially from the fact that both I (Y ; T ) and I ∪ (A 1 , . .., A m → T ) are symmetric in the relevant arguments.
Finally, zero synergy in a single variable follows from self-redundancy together with the definition of synergy, as shown above.

Previous Measures of Union Information and Synergy
We now review other measures of union information and synergy proposed in the literature.For the sake of brevity, we will not recall all their definitions, only some important conclusions.We suggest the interested reader consult the bibliography for more information.

Qualitative Comparison
Griffith and Koch [2014] review three previous measures of synergy: • S WB , derived from I WB ∩ , the original redundancy measure proposed by Williams and Beer [2010], using the IEP; • the whole-minus-sum (WMS) synergy, S WMS ; • the correlational importance synergy, S ∆I .These synergies can be interpreted as resulting directly from measures of union information; that is, they are explicitly written as S(α → T ) = I(Y ; T ) − I ∪ (α → T ), where I ∪ may not necessarily satisfy our intuitions of a measure of union information, as in Definition 2, except for S ∆I , which has the form of a Kullback-Leibler divergence.Griffith and Koch [2014] argue that S WB overestimates synergy, which is not a surprise, as many authors criticized I WB ∩ for not measuring informational content, only informational values [Harder et al., 2013].The WMS synergy, on the other hand, which can be written as a difference of total correlations, can be shown to be equal to the difference between synergy and redundancy for n = 2, which is not what is desired in a measure of synergy.For n > 2, the authors show that the problem becomes even more exacerbated: S WMS equals synergy minus the redundancy counted multiple times, which is why the authors argue that S WMS underestimates synergy.Correlational importance, S ∆I , is known to be larger than I(Y ; T ) for some distributions, excluding it from being an appropriately interpretable measure of synergy.
Faced with these limitations, Griffith and Koch [2014] introduce their measure of union information, which they define as where the minimization is over joint distributions of A 1 , . .., A m , T , alongside the derived measure of synergy S VK (α → T ) = I(Y ; T ) − I VK ∪ (α → T ).This measure quantifies union information as the least amount of information that source α has about T when the source-target marginals (as determined by α) are fixed by p. Griffith and Koch [2014] also established the following inequalities for the synergistic measures they reviewed: where α = (A 1 , . .., A m ).At the time, Griffith and Koch [2014] did not provide a way to analytically compute their measure.Later, Kolchinsky [2022] showed that the measure of union information derived from the degradation order, I d ∪ , is equivalent to I VK ∪ , and provided a way to compute it.For this reason, we will only consider I d ∪ .After the work of Griffith and Koch [2014] in 2014, we are aware of only three other suggested measures of synergy: • S MSRV , proposed by Quax et al. [2017], where MSRV stands for maximally synergistic random variable; • S SD , proposed by Rosas et al. [2020], where SD stands for synergistic disclosure; • S d , proposed by Kolchinsky [2022].
The first two proposals do not define synergy via a measure of union information.They define synergy through an auxiliary random variable, Z, which has positive information about the whole-that is, I(Z; Y ) > 0-but no information about any of the parts-that is, I(Z; Y i ) = 0, i = 1, . .., n.While this property has an appealing operational interpretation, we believe that it is too restrictive; that is, we believe that information can be synergistic, even if it provides some positive information about some part of Y .
The authors of S MSRV show that their proposed measure is incompatible with PID and that it cannot be computed for all distributions, as it requires the ability to compute orthogonal random variables, which is not always possible [Quax et al., 2017].A counter-intuitive example for the value of this measure can be seen for the AND distribution, defined by T = Y 1 AND Y 2 , with Y 1 and Y 2 i.i.d.taking values in {0, 1} with equal probability.In this case, S MSRV = 0.5, a value that we argue is too large, because whenever Y 1 (respectively Y 2 ) is 0, then T does not depend on Y 2 (respectively Y 1 ) (which happens with probability 0.75).Consequently, S MSRV /I(Y ; T ) ≈ 0.5/0.811≈ 0.617 may be too large of a synergy ratio for this distribution.As the authors note, the only other measure that agrees with S MSRV for the AND distribution is S WB , which Griffith and Koch [2014] argued also overestimates synergy.
Concerning S SD , we do not have any criticism, except for the one already pointed out by Gutknecht et al. [2023]: they note that the resulting decomposition from S SD is not a standard PID, in the sense that it does not satisfy a consistency equation (see [Gutknecht et al., 2023] for more details), which implies that '. . . the atoms cannot be interpreted in terms of parthood relations with respect to mutual information terms . . . .For example, we do not obtain any atoms interpretable as unique or redundant information in the case of two sources' [Gutknecht et al., 2023].Gutknecht et al. [2023] suggest a very simple modification to the measure so that it satisfies the consistency equation.
For the AND distribution, S SD evaluates to approximately 0.311, as does S d , whereas our measure yields S CI ≈ 0.270, as the information that the parts cannot obtain when they combine their marginals, under distribution q.This shows that these four measures are not equivalent.

Quantitative Comparison
Griffith and Koch [2014] applied the synergy measures they reviewed to other distributions.We show their results below in Table 9 and compare them with the synergy resulting from our measure of union information, S CI , with the measure of Rosas et al. [2020], S SD , and that of Kolchinsky [2022], S d .Since the code for the computation of S MSRV is no longer available online, we do not present it.
unless the synergy measure is directly defined (as opposed to being defined via a union information measure).We now comment on the results.It should be noted that Kolchinsky [2022] suggested that unique information U 1 and U 2 should be computed from measures of redundant information, and excluded information E 1 and E 2 should be computed from measures of union information, as in our case.However, since we will only present the decompositions for the bivariate case and in this case E 1 = U 2 and E 2 = U 1 , we present the results considering unique information, as is mostly performed in the literature.
• XOR yields I(Y ; T ) = 1.The XOR distribution is the hallmark of synergy.Indeed, the only solution of ( 1) is (S, R, U 1 , U 2 ) = (1, 0, 0, 0), and all of the above measures yield 1 bit of synergy.• AND yields I(Y ; T ) ≈ 0.811.Unlike XOR, there are multiple solutions for (1), and none is universally agreed upon, since different information measures capture different concepts of information.• COPY yields I(Y ; T ) = 2.Most PID measures argue one of two different possibilities for this distribution.
They suggest that the solution is either (S, R, U 1 , U 2 ) = (1, 1, 0, 0) or (0, 0, 1, 1).Our measure suggests that all information flows uniquely from each source.• RDNXOR yields I(Y ; T ) = 2.In words, this distribution is the concatenation of two XOR 'blocks', each of which has its own symbols, and not allowing the two blocks to mix.That is, both Y 1 and Y 2 can determine in which XOR block the resulting value T will be-which intuitively means that they both have this information, meaning it is redundant-but neither Y 1 nor Y 2 have information about the outcome of the XOR operation-as is expected in the XOR distribution-which intuitively means that such information must be synergistic.All measures except S WMS agree with this.• RDNUNQXOR yields I(Y ; T ) = 4.According to Griffith and Koch [2014], it was constructed to carry 1 bit of each information type.Although the solution is not unique, it must satisfy U 1 = U 2 .Indeed, our measure yields the solution (S, R, U 1 , U 2 ) = (1, 1, 1, 1), like most measures except S WB and S WMS .This confirms the intuition by Griffith and Koch [2014] that S WB and S WMS overestimate and underestimate synergy, respectively.In fact, in the decomposition resulting from S WB , there are 2 bits of synergy and 2 bits of redundancy, which we argue cannot be the case, as this would imply that U 1 = U 2 = 0, and given the construction of this distribution, it is clear that there is some unique information since, unlike in RDNXOR, the XOR blocks are allowed to mix, thus (T, Y 1 , Y 2 ) = (1, 0, 1) is a possible outcome, but so is (T, Y 1 , Y 2 ) = (2, 0, 2).That is not the case with RDNXOR.On the other hand, S WMS yields zero synergy and redundancy, with U 1 and U 2 each evaluating to 2 bits.Since this distribution is a mix of blocks satisfying a relation of the form T = Y 1 xor Y 2 , we argue that there must be some non-null amount of synergy, which is why we claim that S WMS is not valid.• XORDUPLICATE yields I(Y ; T ) = 1.All measures correctly identify that the duplication of a source should not change synergy, at least for this particular distribution.• ANDDUPLICATE yields I(Y ; T ) ≈ 0.811.Unlike in the previous example, both S WMS and S ∆I yield a change in their synergy value.This is a shortcoming, since duplicating a source should not increase either synergy or union information.The other measures are not affected by the duplication of a source.
• XORLOSES yields I(Y ; T ) = 1.Its distribution is the same as XOR but with a new source Y 3 satisfying T = Y 3 .As such, since Y 3 uniquely determines T , we expect no synergy.All measures agree with this.
• XORMULTICOAL yields I(Y ; T ) = 1.Its distribution is such that any pair (Y i , Y j ), i, j = 1, 2, 3, i ̸ = j is able to determine T with no uncertainty.All measures agree that the information present in this distribution is purely synergistic.
From these results, we agree with Griffith and Koch [2014] that S WB , S WMS , and S ∆I are not good measures of synergy: they do not satisfy many of our intuitions and overestimate synergy, not being invariant to duplicate sources or taking negative values.For these reasons, and those presented in Section 5, we reject those measures of synergy.In the next section, we comment on the remaining measures S d , S SD , and S CI .

Relation to Other PID Measures
Kolchinsky [2022] introduced I d ∪ and showed that this measure is equivalent to I VK ∪ [Griffith and Koch, 2014] and to I BROJA ∪ [Bertschinger et al., 2014], in the sense that the three of them achieve the same optimum value [Kolchinsky, 2022].The multivariate extension of I BROJA ∪ was proposed by Griffith and Koch [2014], defined as which we present because it makes it clear what conditions are enforced upon the marginals.There is a relation between ) and I CI ∪ (A 1 , . .., A m ) whenever the sources A 1 , . ..A m are singletons.In this case, and only in this case, the set Q involved in the computation of I CI ∪ (A 1 , . .., A m ) has only one element: q(t, a 1 , . .., a m ) = p(t)p(a 1 |t). ..p(a m |t).Since this distribution, as well as the original distribution p, are both admissible points in I d ∪ , we have that I d ∪ ≤ I CI ∪ , which implies that S d ≥ S CI .On the other hand, if there is at least one source A 1 , . .., A m that is not a singleton, the measures are not trivially comparable.For example, suppose we wish to compute )) is a distribution p * whose marginals p * (y 1 , y 2 , t) and p * (y 2 , y 3 , t) must coincide with the marginals under the original p.However, in the computation of ∪ , and it achieves a lower mutual information with T .That is, it might be the case that I p * (Y ; T ) < I q (Y ; T ), for all q ∈ Q.In such a case, we would have I d ∪ > I CI ∪ .It is convenient to be able to upper-bound certain measures with other measures.For example, Gomes and Figueiredo [2023] (see that paper for the definitions of these measures) showed that for any source (A 1 , . .., A m ), m ≥ 1, However, we argue that the inability to draw such strong conclusions (or bounds) is a positive aspect of PID.This is because there are many different ways to define the information (be it redundant, unique, union, etc.) that one wishes to capture.If one could trivially relate all measures, it would mean that it would be possible to know a priori how those measures would behave.Consequently, this would imply the absence of variability/freedom in how to measure different information concepts, as those measures would capture, non-equivalent but similar types of information, as they would all be ordered.It is precisely because one cannot order different measures of information trivially that PID provides a rich and complex framework to distinguish different types of information, although we believe that PID is still in its infancy.James et al. [2018b] introduced a measure of unique information, which we recall now.In the bivariate case-i.e., consider p(y 1 , y 2 , t)-let q be the maximum entropy distribution that preserves the marginals p(y 1 , t) and p(y 2 , t), and let r be the maximum entropy distribution that preserves the marginals p(y 1 , t), p(y 2 , t), and p(y 1 , y 2 ).Although there is no closed form for r, which has to be computed using an iterative algorithm [Krippendorff, 2009], it may be shown that the solution for q is q(y 1 , y 2 , t) = p(t)p(y 1 |t)p(y 2 |t) (see, e.g., [James et al., 2018b]).This is the same distribution q that we consider for the bivariate decomposition (3).James et al. [2018b] suggest defining unique information U i as the least change (in sources-target mutual information) that involves the addition of the (Y i , T ) marginal constraint, that is, and analogously for U 2 .They show that their measure yields a non-negative decomposition for the bivariate case.Since I(Y 1 ; T |Y 2 ) = S + U 1 , some algebra leads to S dep = I(Y ; T ) − min{I q (Y ; T ), I r (Y ; T )}, where S dep is the synergy resulting from the decomposition of James et al. [2018b] in the bivariate case.Recall that our measure of synergy for the bivariate case is given by S CI = I(Y ; T ) − min{I q (Y ; T ), I p (Y ; T )}. ( The similarity is striking.Computing S dep for the bivariate distributions in Table 9 shows that it coincides with the decomposition given by our measure, although this is not the case in general.We could not obtain S dep for the RDNUNQXOR distribution because the algorithm that computes r did not finish in the allotted time of 10 min.James et al. [2018b] showed that for whichever bivariate distribution I r (Y ; T ) ≤ I p (Y ; T ); therefore, for the bivariate case we have S CI ≤ S dep .Unfortunately, the measure of unique information proposed by James et al. [2018b], unlike the usual proposals of intersection or union information, does not allow for the computation of the partial information atoms in the complete redundancy lattice if n > 2. The authors also comment that it is not clear if their measure satisfies monotonicity when n > 2. Naturally, our measure is not the same as S dep , so it does not retain the operational interpretation of unique information U i being the least amount that influences I(Y ; T ) when the marginal constraint (Y i , T ) is added to the resulting maximum entropy distributions.Given the form of S dep , one could define I dep ∪ := min{I q (Y ; T ), I r (Y ; T )} and study its properties.Clearly, it does not satisfy the self-redundancy axiom, but we wonder if it could be adjusted so that it satisfies all of the proposed axioms.The n = 2 decomposition retains the operational interpretation of the original measure, but it is not clear whether this is true for n > 2. For the latter case, the maximum entropy distributions that we wrote as q and r have different definitions [James et al., 2018b].We leave this for future work.

Conclusions and Future Work
In this paper, we introduced a new measure of union information for the partial information decomposition (PID) framework, based on the channel perspective, which quantifies synergy as the information that is beyond conditional independence of the sources, given the target.This measure has a clear interpretation and is very easy to compute, unlike most measures of union information or synergy, which require solving an optimization problem.The main contributions and conclusions of the paper can be summarized as follows.
• We introduced new measures of union information and synergy for the PID framework, which thus far was mainly developed based on measures of redundant or unique information.We provided its operational interpretation and defined it for an arbitrary number of sources.• We proposed an extension of the Williams-Beer axioms for measures of union information and showed our proposed measure satisfies them.• We reviewed, commented on, and rejected some of the previously proposed properties for measures of union information and synergy in the literature.• We showed that measures of union information that satisfy the extension of the Williams-Beer axioms necessarily satisfy a few other appealing properties, as well as the derived measures of synergy.• We reviewed previous measures of union information and synergy, critiqued them, and compared them with our proposed measure.• The proposed conditional independence measure is very simple to compute.
• We provide code for the computation of our measure for the bivariate case and for source {{Y 1 }, {Y 2 }, {Y 3 }} in the trivariate case.
Finally, we believe this paper opens several avenues for future research, thus we point out several directions to be pursued in upcoming work: • We saw that the synergy yielded by the measure of James et al. [2018b] is given by S dep = I(Y ; T ) − min{I q (Y ; T ), I r (Y ; T )}.Given its analytical expression, one could start by defining a measure of union information as I ∪ (Y 1 , Y 2 → T ) = min{I q (Y ; T ), I r (Y ; T )}, possibly tweak it so it satisfies the WB axioms, study its properties, and possibly extend it to the multivariate case.• Our proposed measure may ignore conditional dependencies that are present in p in favor of maximizing mutual information, as we commented in section 3.3.This is a compromise so that the measure satisfies monotonicity.We believe this is a potential drawback of our measure, and we suggest the investigation of a measure similar to ours, but that does not ignore conditional dependencies that it has access to.• Extending this measure for absolutely continuous random variables.
• Implementing our measure in the dit package [James et al., 2018d].
• This paper reviewed measures of union information and synergy, as well as properties that were suggested throughout the literature.Sometimes this was by providing examples where the suggested properties fail, and other times simply by commenting.We suggest performing something similar for measures of redundant information.

Figure 1 :
Figure 1: (a) Assuming faithfulness [Pearl, 2009], this is the only three-variable directed acyclic graph (DAG) that satisfies Y 1 ⊥ Y 2 and Y 1 ̸ ⊥ Y 2 |T , in general [Pearl, 2009].(b) The DAG that is "implied" by the perspective of I d ∩ .(c) A DAG that can generate the XOR distribution, but does not satisfy the dependencies implied by T = Y 1 xor Y 2 .In fact, any DAG that is in the same Markov equivalence class as (c) can generate the XOR distribution (or any other joint distribution), but none satisfy the earlier dependencies, assuming faithfulness.

Figure 3 :
Figure 3: Computation of S CI and S d as functions of r = p(T = 0|Y = (0, 0)) for the distribution presented in Table8.As we showed for this distribution, S CI is not a convex function of r.

Table 1 :
Three joint distributions p(yt, y 2 , y 2 ) used to exemplify the three cases.Left: joint distribution satisfying T = Y 1 .Middle: distribution satisfying T = (Y 1 , Y 2 ), known as the COPY distribution.Right: the so-called BOOM distribution (see text).

Table 2 :
Tweaked COPY distribution, now without the outcome (

Table 4 :
Counter-example distribution for target monotonicity.

Table 5 :
Counter-example distribution for target monotonicity.

Table 9 :
Griffith and Koch [2014]res reviewed inGriffith and Koch [2014](S WB , S WMS , and S ∆I ), S SD introduced by Rosas et al.[2020], S d introduced by Kolchinsky [2022], and our measure of synergy S CI to different distributions.The bottom four distributions are trivariate.We write DNF to mean that a specific computation did not finish within 10 min.We already saw the definition of the AND, COPY, and XOR distributions.The XORDUPLICATE and ANDDUPLI-CATE are built from the XOR and the AND distributions by inserting a duplicate source variable Y 3 = Y 2 .The goal is to test if the presence of a duplicate predictor impacts the different synergy measures.The definitions of the remaining distributions are presented in Appendix 7. Some of these are trivariate, and for those we compute synergy as