Causal Structure Learning with Conditional and Unique Information Groups-Decomposition Inequalities

The causal structure of a system imposes constraints on the joint probability distribution of variables that can be generated by the system. Archetypal constraints consist of conditional independencies between variables. However, particularly in the presence of hidden variables, many causal structures are compatible with the same set of independencies inferred from the marginal distributions of observed variables. Additional constraints allow further testing for the compatibility of data with specific causal structures. An existing family of causally informative inequalities compares the information about a set of target variables contained in a collection of variables, with a sum of the information contained in different groups defined as subsets of that collection. While procedures to identify the form of these groups-decomposition inequalities have been previously derived, we substantially enlarge the applicability of the framework. We derive groups-decomposition inequalities subject to weaker independence conditions, with weaker requirements in the configuration of the groups, and additionally allowing for conditioning sets. Furthermore, we show how constraints with higher inferential power may be derived with collections that include hidden variables, and then converted into testable constraints using data processing inequalities. For this purpose, we apply the standard data processing inequality of conditional mutual information and derive an analogous property for a measure of conditional unique information recently introduced to separate redundant, synergistic, and unique contributions to the information that a set of variables has about a target.


Introduction
The inference of the underlying causal structure of a system using observational data is a fundamental question in many scientific domains.The causal structure of a system imposes constraints on the joint probability distribution of variables generated from it [1][2][3][4], and these constraints can be exploited to learn the causal structure.Causal learning algorithms based on conditional independencies [1,2,5] allow the construction of a partially oriented graph [6] that represents the equivalence class of all causal structures compatible with the set of conditional independencies present in the distribution of the observable variables (the so-called Markov equivalence class).However, without restrictions on the potential existence and structure of an unknown number of hidden variables that could account for the observed dependencies, Markov equivalence classes may encompass many causal structures compatible with the data.
Conditional independencies impose equality constraints on a joint probability distribution; namely, an independence results in the equality between conditional and unconditional probability distributions, or equivalently, in a null mutual information between independent variables.In addition to the information from independencies between the observed variables, causal information can also be obtained from other functional equality constraints [7], such as dormant independencies that would occur under active interventions [8].Further causal inference power can be obtained incorporating assumptions on the potential form of the causal mechanisms in order to exploit additional independencies associated with hidden substructures within the generative model [9,10], or independencies related to exogenous noise terms [11][12][13].Other approaches have studied the identifiability of specific parametric families of causal models [3,14].However, these methods only provide additional inference power if the actual causal mechanisms conform to the required parametric form.
Beyond equality constraints, the causal structure may also impose inequality constraints on the distribution of the data [15,16], which reflect non-verifiable independencies involving hidden variables.Figure 1 illustrates this distinction between pairs of causal structures distinguishable based on independence constraints (Figure 1A,B) and causal structures that may be discriminated based on inequality constraints (Figure 1C,D).The structures of Figure 1A,B belong to different Markov equivalence classes because in Figure 1A variables V 1 and V 2 are independent conditioned on S, while in Figure 1B, to obtain an independence it is required to further the condition on V 3 .On the other hand, the structures of Figure 1C,D belong to the same equivalence class because no independencies exist between the observable variables V i , i " 1, 2, 3. Nonetheless, if the hidden variables were also observable, these structures would be distinguishable.In Figure 1D, all the dependencies between the observable variables are caused by a single hidden variable U, while in Figure 1C dependencies are created pairwise by different hidden variables.In this case, a testable inequality constraint involving the observable variables reflects the non-verifiable independencies that involve also hidden variables.Intuitively, in Figure 1C, the inequality constraint imposes an upper bound on the overall degree of dependence between the three variables, given that these dependencies arise only in a pairwise manner, while in Figure 1D no such bound exists.
Importantly, unlike equality constraints, inequality constraints provide necessary but not sufficient conditions for the compatibility of data with a certain causal structure.While a certain hypothesized causal structure-like in Figure 1C-may impose the fulfillment of a given inequality intrinsically from its structure, other causal structures-like in Figure 1Dcan generate data that, given a particular instantiation of the causal mechanisms, also fulfill the inequality.Accordingly, the causal inference power of inequality constraints lies in the ability to reject hypothesized causal structures that would intrinsically require the fulfillment of an inequality when that inequality is not fulfilled by the data.This means that tighter inequalities have more inferential power, giving the capacity to discard more causal structures.
Examples of causal structures distinguishable from independencies (A,B) and structures that may only be discriminated based on inequality constraints (C,D).In this case, the structure in (C), and not the one in (D), intrinsically imposes a constraint due to dependencies between the observable variables V i , i " 1, 2, 3 arising only from pairwise dependencies with hidden common causes.
Two main classes of inequality constraints have been derived.The first class corresponds to inequality constraints in the probability space, which comprise tests of compatibility such as Bell-type inequalities [17,18], instrumental inequalities [19,20], and inequalities that appear on identifiable interventional distributions [21].The second class corresponds to inequalities involving information-theoretic quantities.The relation between these probabilistic and entropic inequalities has been examined in [22].One approach to construct entropic inequalities combines the inequalities defining the Shannon entropic cone, i.e., associated with the non-negativity, monotonicity, and submodularity properties of entropy, and additional independence constraints related to the causal structure [23,24].Additional causally informative inequalities can be derived if considering the so-called Non-Shannon inequalities [25,26].When the causal structure to be tested involves hidden variables, all non-trivial entropic inequalities in the marginal scenario associated with the set of observable variables can be derived with an algorithmic procedure [23,24] that projects the set of inequalities of all variables into inequalities that only involve the subset of observable variables.
As an alternative approach, information-theoretic inequality constraints can be derived by an explicit analytical formulation [24,27].In particular, [27] introduced inequalities comparing the information about a target variable contained in a whole collection of variables with a weighted sum of the information contained in groups of variables corresponding to subsets of the collection.Two procedures were introduced to select the composition of these groups.In a first type of inequalities, the composition of the groups is arbitrarily determined, but an inequality only exists under some conditions of independence between the chosen variables, whose fulfillment reflects the underlying causal structure.In a second type, no conditions are required for the existence of an inequality, but the groups must be ancestral sets; that is, must contain all other variables that have a causal effect on any given element of the group.In both cases, [27] showed that the coefficients in the weighted sum of the information contained in groups of variables are determined by the number of intersections between the groups.
In this work, we build upon the results of [27] and generalize their framework of groups-decomposition inequalities in several ways.First, we generalize both types of inequalities to the conditional case, when the inequalities involve conditional mutual information measures instead of unconditional ones.While this extension is trivial for the first type of inequalities, we show that for the second type it requires a definition of augmented ancestral sets.Second, we formulate more flexible conditions of independence for which the first type of inequalities exists.Third, we add flexibility to the construction of the ancestral sets that appear in the second type of inequalities.We show that, given a causal graph and a conditioning set of variables used for the conditional mutual information measures, alternative inequalities exist when determining ancestors in subgraphs that eliminate causal connections from different subsets of the conditioning variables.Furthermore, we determine conditions in which an inequality also holds when removing subsets of ancestors from the whole set of variables, hence relaxing for the second type of inequalities the requirement that the groups correspond to ancestral sets.
Apart from these generalizations, we expand the power of the approach of [27] by considering inequalities whose existence is determined by the partition into groups of a collection of variables that also contains hidden variables.That is, hidden variables can appear not only as hidden common ancestors of the collection but also as part (or even all) of the variables in the collection for which the inequality is defined.To render operational the use of inequalities derived from collections containing hidden variables, we develop procedures that allow mapping those inequalities into testable inequalities that only involve observable variables.While this mapping can be carried out by simply applying the monotonicity of mutual information to remove hidden variables from the groups, this does not work when all variables in the collection are hidden.We show that data processing inequalities [28] can be applied to obtain testable inequalities also in this case, or applied to obtain tighter inequalities than those obtained by simply removing the hidden variables.We illustrate how testable inequalities whose coefficients in the weighted sum depend on intersections among subsets of hidden variables instead of among subsets of observable variables can result into tighter inequalities with higher inferential power.
In order to derive testable groups-decomposition inequalities, we do not only apply the standard data processing inequality of conditional mutual information [28], but we derive an additional data processing inequality for the so-called unique information measure introduced in [29].This measure was introduced in the framework of a decomposition of mutual information into redundant, unique, and synergistic information components [30].Recently, alternative decompositions have been proposed to decompose the joint mutual information that a set of predictor variables has about a target variable into redundant, synergistic, and unique components [31][32][33][34][35] (among others).These alternative decompositions generally differ in the quantification of each component and differ in whether the measures fulfill certain properties or axioms.However, in our work, we do not apply the unique information measure of [29] as part of a decomposition of the joint mutual information.Instead, we show that it provides an alternative data processing inequality that holds for different causal configurations than the standard data processing inequality of conditional mutual information.In this way, the unique information data processing inequality increases the capability to eliminate hidden variables in order to obtain testable groups-decomposition inequalities.Accordingly, the groups-decomposition inequalities we derive can contain unique information terms apart from the standard mutual information and entropy measures that appear when considering the constraints of the Shannon entropic cone [23,24].
We envisage the application of the causally informative tests here proposed in the following way.Given a data set, a hypothesized causal structure is selected to test its compatibility with the data.First, the set of inequality constraints enforced by that causal structure is determined.Second, their fulfillment is evaluated from the data and the causal structure is discarded if some inequality does not hold.In the first step, the determination of the set of groups-decomposition inequalities enforced by a causal structure requires at different levels the verification of conditional independencies.This is the case, for example, with the conditional independencies that are necessary conditions for the existence of the first type of inequalities introduced by [27].If all variables involved were observable, this verification could be conducted directly from the data.However, as mentioned above, we here consider groups-decomposition inequalities that may contain hidden variables as part of the collection of variables, which precludes this direct verification.For this reason, we will work under the assumption that statistical independencies can be assessed from the structure of the causal graph, namely with the graphical criterion of separability between nodes in the graph known as d-separation [36].That is, we will rely on the assumption that graphical separability is a sufficient condition for statistical independence and hence characterize the set of groups-decomposition inequalities enforced by a causal structure without using the data.Data would only be used in the second step, in which the actual fulfillment of the inequalities is evaluated.
This paper is organized as follows.In Section 2, we review previous work relevant for our contributions.In Section 3.1, we formulate the data processing inequality for the unique information measure.In Section 3.2, we generalize the first type of inequalities of [27], formulating for the conditional case more general conditions of independence for which a groups-decomposition inequality exists.We also apply data processing inequalities to derive testable groups-decomposition inequalities when collections include hidden variables.In Section 3.3, we generalize the second type of inequalities of [27] as outlined above.In Section 4, we discuss the connection of this work with other approaches to causal structure learning and point to future continuations and potential applications.The Appendix contains proofs of the results (Appendices A and B) and a discussion of the relations between conditional independencies and d-separations required so that the inequalities here derived are applicable to test causal structures (Appendix C).

Previous Work on Information-Theoretic Measures and Causal Graphs Relevant for Our Derivations
In this section we review properties of information-theoretic measures and concepts of causal graphs relevant for our work.In Section 2.1, we review basic inequalities of the mutual information and in Section 2.2 the definition and relevant properties of the unique information measure of [29].We then review in Section 2.3 Directed Acyclic Graphs (DAGs) and their relation to conditional independence through the graphical criterion of d-separation [36,37].Finally, we review the inequalities introduced by [27] to test causal structures from information decompositions involving sums of groups of variables (Section 2.4).We do not aim to more broadly review other types of information-theoretic inequalities [23,24] also used for causal inference.The relation with these other types will be considered in the Discussion.

Mutual Information Inequalities Associated with Independencies
We present in Lemma 1 two well-known inequalities that will be used in our derivations.This lemma corresponds to Lemma 1 in [27].For completion, we provide the proof of the lemma.Since IpA; B 1 |B, Dq " 0 and the mutual information is non-negative, this implies the inequality.To prove piiq, the chain rule is applied in different orders to IpY, C; A|Bq: IpY, C; A|Bq " IpC; A|Bq `IpY; A|B, Cq " IpY; A|Bq `IpC; A|B, Yq.
Since IpC; A|Bq " 0 and the mutual information is non-negative, this implies the inequality.

Definition and Properties of the Unique Information
The concept of unique information as part of a decomposition of the joint mutual information IpY; Xq that a set of predictor variables X " tX 1 , . . ., X N u has about a target (possibly multivariate) variable Y was introduced in [30].In the simplest case of two predictors tX 1 , X 2 u, this framework decomposes the joint mutual information about Y into four terms, namely the redundancy of X 1 and X 2 , the unique information of X 1 with respect to X 2 , the unique information of X 2 with respect to X 1 , and the synergy between X 1 and X 2 .The predictors share the redundant component, the synergistic one is only obtained by combining the predictors, and unique components are exclusive to each predictor.Several information measures have been proposed to define this decomposition, aiming to comply with a set of desirable properties which were not all fulfilled by the original proposal [29,[31][32][33].However, in this work we will not study the whole decomposition but specifically apply the bivariate measure of unique information introduced in [29].In Section 3.1, we derive a data processing inequality for this measure and in Section 3.2 we show how it can help to obtain testable groupsdecomposition inequalities for causal structures for which the standard data processing inequality of the mutual information would not allow elimination of the hidden variables.In this Section, we review the definition of the unique information measure of [29], we provide a straightforward generalization to a conditional unique information measure, and state a monotonicity property that will be used to derive the data processing inequality of the unique information.The unique information of X 1 with respect to X 2 about Y was defined as IpY; X 1 zzX 2 q " min where ∆ P is defined as the set of distributions on tY, X 1 , X 2 u that preserve the marginals PpY, X 1 q and PpY, X 2 q of the original distribution PpY, X 1 , X 2 q.The notation I Q is used to indicate that the mutual information is calculated on the probability distribution Q.We use IpY; X 1 zzX 2 q to refer to the unique information of X 1 with respect to X 2 , compared to IpY; X 1 |X 2 q, which is the standard conditional information of X 1 given X 2 .We use the notation X 1 zzX 2 instead of the notation X 1 zX 2 introduced by [29] to differentiate it from the set notation X 1 zX 2 , which indicates the subset of variables in X 1 that is not contained in X 2 , since we will also be using this set notation.This unique information measure is a maximum entropy measure, since all distributions within ∆ P preserve the conditional entropy HpY|X 2 q, and hence the minimization is equivalent to a maximization of the conditional entropy HpY|X 1 , X 2 q.The rationale that supports this definition is that the unique information of X 1 with respect to X 2 about Y has to be determined by the marginal probabilities PpY, X 1 q and PpY, X 2 q, and cannot depend on any additional structure in the joint distribution that contributes to the dependence between tX 1 , X 2 u and Y [29].This additional contribution is removed by minimizing within ∆ P .
In a straightforward generalization, we define the conditional unique information given another set of variables Z as IpY; X 1 zzX 2 |Zq " min where ∆ P 1 is the set of distributions on tY, X 1 , X 2 , Zu that preserve the marginals PpY, X 1 , Zq and PpY, X 2 , Zq of the original PpY, X 1 , X 2 , Zq.By construction [29], the conditional unique information is bounded as mintIpY; This is consistent with the intuition of the decomposition that the unique information is a component exclusive of X 1 .In Lemma 2, we present a type of monotonicity fulfilled by the conditional unique information.This result is a straightforward extension to the conditional case of the one stated in Lemma 3 of [38].We include the full proof because it will be useful in the Results section to prove a related data processing inequality for the unique information.To better suit our subsequent use of notation, we consider the two predictors to be Z 1 and tX 1 , X 1  1 u, and the conditioning set to be Z 2 .
Lemma 2. The maximum entropy conditional unique information is monotonic on its second argument, corresponding to the non-conditioning predictor, as follows: Proof.Consider the distribution P 1,1 1 " PpY, X 1 , X 1 1 , Z 1 , Z 2 q and its marginal P 1 " PpY, X 1 , Z 1 , Z 2 q.Consider any distribution Q 1,1 1 P ∆ P 1,1 1 and its marginal Q 1 on pY, X 1 , Z 1 , Z 2 q.Then Q 1 P ∆ P 1 .By monotonicity of the mutual information, 1 as an argument, it is equal to the information calculated on its marginal I Q 1 pY; X 1 |Z 1 , Z 2 q.Since this holds for any distribution in ∆ P 1,1 1 , it holds in particular for the distribution Q 1,1 1 that minimizes IpY; X 1 , X 1  1 |Z 1 , Z 2 q in ∆ P 1,1 1 .Since Q 1 belongs to ∆ P 1 , the minimum of IpY; X 1 |Z 1 , Z 2 q in ∆ P 1 is equal to or smaller than I Q 1 pY; X 1 |Z 1 , Z 2 q and hence equal to or smaller than I Q 1,1 1 pY; X 1 , X 1  1 |Z 1 , Z 2 q.

Causal Graphs and Conditional Independencies
We here review basic notions of Directed Acyclic Graphs (DAGs) and the relation between causal structures and dependencies.Consider a set of random variables V " tV 1 , . . ., V n u.A DAG G " pV; Eq consists of nodes V and edges E between the nodes.The graph contains V i Ñ V j for each pV i ; V j q P E. We refer to V as both a variable and its corresponding node.
Causal influences can be represented in acyclic graphs given that causal mechanisms are not instantaneous and causal loops can be spanned using separate time-indexed variables.A path in G is a sequence of (at least two) distinct nodes V 1 , . . ., V m , such that there is an edge between V k and V k`1 for all k " 1, . . ., m ´1.If all edges are directed as V k Ñ V k`1 the path is a causal or directed path.A node V i is a collider in a path if it has incoming arrows V i´1 Ñ V i Ð V i`1 and is a noncollider otherwise.A node V i is called a parent of V j if there is an arrow V i Ñ V j .The set of parents is denoted Pa V j .A node V i is called an ancestor of V j if there is a directed path from V i to V j .Conversely, in this case V j is a descendant of V i .For convenience, we define the set of ancestors an G pV i q as including V i itself, and the set of descendants D G pV i q as also containing V i itself.
The link between generative mechanisms and causal graphs relies on the fact that in the graph a variable V i is a parent of another variable V j if and only if it is an argument of an underlying functional equation that captures the mechanisms that generate V j ; that is, an argument of V j :" f V j pPa V j , ε V j q, where ε V j captures additional sources of stochasticity exogenous to the system.If a DAG constitutes an accurate representation of the causal mechanisms, an isomorphic relation exists between the conditional independencies that hold between variables in the system and a graphical criterion of separability between the nodes, called d-separation [36].Two nodes X and Y are d-separated given a set of nodes S if and only if no S-active paths exist between X and Y.A path is active given the conditioning set S (S-active) if no noncollider in the path belongs to S and every collider in the path either is in S or has a descendant in S. A causal structure G and a generated probability distribution ppVq are faithful [1,2] to one another when a conditional independence between X and Y given S-denoted by X K P Y|S-holds if and only if there is no S-active path between them; that is, if X and Y are d-separated given S-denoted by X K G Y|S. Accordingly, faithfulness is assumed in the algorithms of causal inference [1,2] that examine conditional independencies to characterize the Markov equivalence class of causal structures that share a common set of independencies.A well-known example of a system that is unfaithful to its causal structure is the exclusive-OR (X-OR) logic gate, whose output is independent of the two inputs separately but dependent on them jointly.
In contrast to the algorithms that infer Markov equivalence classes, we will show that the applicability of the groups-decomposition inequalities here studied relies on the assumption that d-separability is a sufficient condition for conditional independence.That is, instead of an if and only if relation between d-separability and conditional independence, as required in the faithfulness assumption, it is enough to assume that d-separability implies conditional independence.As we further discuss in Appendix C, this is a substantially weaker assumption, since usually faithfulness is violated due to the presence of independencies that are incompatible with the causal structure.This is the case, for example, of the X-OR logic gate, for which faithfulness is violated because the inputs are separately independent of the output despite each having an arrow towards the output in the corresponding causal graph.Conversely, the X-OR gate complies with d-separability being a sufficient condition for independence, since in the graph only the input nodes are d-separated and the corresponding input variables of the X-OR gate are independent.Despite only requiring that d-separability implies independence, to simplify the presentation of our results in the main text we will assume faithfulness and indistinctively use X K Y|S to indicate statistical independence and graphical separability, instead of distinguishing between X K P Y|S and X K G Y|S.In Appendix C, we will more closely examine how in the proofs of our results the sufficient condition of d-separability for conditional independencies is enough.An important implication of independencies following from d-separability is that, if variables tX 1 , X 2 u are separately independent from Y-namely Y K X 1 and Y K X 2 -because of the lack of any connection between node Y and both nodes X 1 and X 2 , then tX 1 , X 2 u cannot be jointly dependent on Y, namely Y M tX 1 , X 2 u cannot occur.This is because d-separability between node Y and the set of nodes tX 1 , X 2 u is determined by separately considering the lack of active paths between Y and each node X 1 and X 2 .Since the set of paths between Y and tX 1 , X 2 u is the union of the paths between Y and both X 1 and X 2 , considering tX 1 , X 2 u jointly does not add new paths that could create a dependence of Y with tX 1 , X 2 u.A dependence can only be created by conditioning on some other variable, which could activate additional paths by activating a collider.

Inequalities for Sums of Information Terms from Groups of Variables
We now review two results in [27] that are at the foundation of our results.The first corresponds to their Proposition 1.We provide a slightly more general formulation that is useful for subsequent extensions.Proposition 1. (Decomposition of information from groups with conditionally independent nonshared components): Consider a collection of groups A rns " tA 1 , . . ., A n u, where each group A i consists of a subset of observable variables A i Ă O, being O the set of all observable variables.For every A i P A rns , define d i as the maximal value such that A i has a non-empty intersection where it intersects jointly with d i ´1 other distinct groups out of A rns .Consider a conditioning set Z and target variables Y.If each group is conditionally independent given Z from the non-shared variables in each other group (i.e., A i K A j zA i |Z, @i, j), then the conditional information that A rns has about the target variables Y given Z is bounded from below by IpY; A rns |Zq ě Proof.The proof is presented in Appendix A. It is a generalization to the conditional case of the proof of Proposition 1 in [27] and a slight generalization that allows for dependencies to exist between variables shared by two groups as long as dependencies with non-shared variables do not exist.
An illustration of Proposition 1 for the unconditional case is presented in Figure 3 of [27], together with further discussion.In Section 3.2 we will provide further illustrations for the extensions of Proposition 1 that we introduce.We will use d " td 1 , . . ., d n u to indicate the maximal values for all groups.We will add a subindex d A rns to specify the collection if different collections are compared.A trivial refinement of Proposition 1 would consider IpY; A rns zZ|Zq and for each group IpY; A i zZ|Zq.This may lead to a tighter lower bound by decreasing some values in d if some intersections between groups occur in Z.We do not present this refinement in order to simplify the presentation.
The second result from [27] that we will be relying on is their Theorem 1.We present a version that is slightly reduced and modified, which is more convenient in order to relate to our own results.Theorem 1. (Decomposition of information in ancestral groups.)Let G be a DAG model that includes nodes corresponding to the variables in a collection of groups A rns " tA 1 , . . ., A n u, which is a subset all observable variables O. Let an G pA rns q " tan G pA 1 q, . . ., an G pA n qu be the collection of ancestors of the groups, as determined by G.For every ancestral set of a group, an G pA i q, let d i pGq be maximal, such that there is a non-empty joint intersection of an G pA i q and other d i pGq ´1 distinct ancestral sets out of an G pA rns q.Let Y be a set of target variables.Then the information of an G pA rns q about Y is bounded as HpYq ě IpY; an G pA rns qq ě IpY; an G pA i qq.
Proof.The original proof can be found in [27].
In contrast to Proposition 1, a generalization to the conditional mutual information is not trivial and will be developed in Section 3.3.We will also propose additional generalizations regarding which graph to use to construct the ancestral sets and conditions to exclude some ancestors from the groups.In their work, [27] conceptualized Y as corresponding to leaf nodes in the graph, for example providing some noisy measurement of A rns , with Y " A rns being the case of perfect measurement.While this conceptualization guided their presentation, their results were general, and here we will not assume any concrete causal relation between Y and A rns .We have slightly modified the presentation of Theorem 1 from [27] to add the upper bound and to remove some additional subcases with extra assumptions presented in their work.The upper bound is the standard upper bound of mutual information by entropy [28].In the Results, we will also be interested in cases in which an G pA rns q contains hidden variables, so that IpY; an G pA rns qq cannot be estimated.Given the monotonicity of mutual information, the terms from each ancestral group can be lower bounded by the information in the observable variables within each group and HpYq is used as a testable upper bound.
There are two main differences between Proposition 1 and Theorem 1. First, Theorem 1 does not impose conditions of independence for the inequality to hold.Second, while the value d i of each group A i is determined in Proposition 1 by the overlap between groups, with no influence of the causal structure relating the variables, on the other hand in Theorem 1 the value d i pGq depends on the causal structure, since it is determined from the intersections between ancestral sets.Despite these differences, given the relation between causal structure and independencies reviewed in Section 2.3, both types of inequalities can have causal inference power to test the compatibility of certain causal structures with data.

Results
In Section 3.1, we introduce a data processing inequality for the conditional unique information measure of [29].In Section 3.2, we develop new information inequalities involving groups of variables and examine how data processing inequalities can help to derive testable inequalities in the presence of hidden variables.In Section 3.3, we develop new information inequalities involving ancestral sets.The application of these inequalities for causal structure learning is discussed.As justified in the proofs of our results (Appendices A and B) and further discussed in Appendix C, our derivations of groups-decomposition inequalities only rely on the assumption that d-separability implies conditional independence.No further assumptions are used in our work, in particular, our application of the unique information measures of [29] does not require any assumption regarding the precise distribution of the joint mutual information among redundancy, unique, and synergistic components.Proof.Let P BB 1 " PpA, B, B 1 , D, Eq be the original distribution of the variables and define ∆ P BB 1 as the set of distributions on tA, B, B 1 , D, Eu that preserve the two marginals PpA, B, B 1 , Eq and PpA, D, Eq.Let P B " PpA, B, D, Eq be the marginal of P BB 1 and ∆ P B be the set of distributions that preserve the marginals PpA, B, Eq and PpA, D, Eq.By the definition of unique information (Equation (2)) Equality paq follows from the chain rule of mutual information.Equality pbq holds because I Q BB 1 pA; B|D, Eq does not depend on B 1 and can be calculated with the marginal Q B , marginalizing 1 , Eq factorizes as PpB 1 |B, EqPpA, B, Eq.For any distribution QB P ∆ P B , which preserves PpA, D, Eq and PpA, B, Eq, a distribution can be constructed as QBB 1 " PpB 1 |B, Eq QB , such that QBB 1 P ∆ P BB 1 , since QBB 1 continues to preserve PpA, D, Eq and PpA, B, B 1 , Eq is preserved by construction.Also by construction, I QBB 1 pA; B 1 |B, D, Eq " 0 for any QBB 1 created from any QB P ∆ P B .In particular, this holds for the distribution QB B 1 constructed from QB that minimizes I QB pA; B|D, Eq, which determines IpA; BzzD|Eq.The distribution QB B 1 minimizes the first term in the r.h.s of Equation ( 4) and, given the non-negativity of mutual information, it also minimizes the second term, hence providing the minimum in ∆ P BB 1 .Accordingly, IpA; B, B 1 zzD|Eq " IpA; BzzD|Eq.The monotonicity of unique information on the non-conditioning predictor (Lemma 2) leads to IpA; B, B 1 zzD|Eq ě IpA; B 1 zzD|Eq.
A related data processing inequality has already been previously derived for the unconditional unique information in the case of IpA, D; B 1 |Bq " 0, with E " H [39]. Differently, Proposition 2 formulates a data processing inequality for the case IpA; B 1 |B, Eq " 0. When E " H, Proposition 2 states a weaker requirement for the existence of an inequality, given the decomposition axiom of the mutual information [27].As we will now see in Section 3.2, Proposition 2 will allow us to apply the unique information data processing inequality in cases in which IpA; B 1 |B, Eq " 0. In particular, IpA; B, B 1 zzD|Eq ě IpA; B 1 zzD|Eq allows us to obtain a lower bound when B contains hidden variables that we want to eliminate in order to have a testable groups-decomposition inequality.In contrast, the application of the standard data processing inequality of the mutual information IpA; B, B 1 |D, Eq ě IpA; B 1 |D, Eq requires IpA; B 1 |B, D, Eq " 0, and hence the two types of data processing inequalities may be applicable in different cases to eliminate B. This will be fully appreciated in Propositions 5 and 6.Note that this application of the unique information measure of Equation (2) to eliminate hidden variables is not restrained by the role of the measure in the mutual information decomposition and by considerations about which alternative decompositions optimally quantify the different components [30,35].

Inequalities Involving Sums of Information Terms from Groups
In this section, we extend Proposition 1 in several ways.Propositions 3-6 present subsequent generalizations, all subsumed by Proposition 6.We present these generalizations progressively to better appreciate the new elements.For these Propositions, examples are displayed in Figures 2 and 3 and explained in text after the enunciation of each Proposition.Which Proposition is illustrated by each example is indicated in the figure caption and in the main text.The objective of these generalizations is twofold: First, to derive new testable inequalities for causal structures not producing a testable inequality from Proposition 1.Second, to find inequalities with higher inferential power, even when some already exist.These objectives are achieved introducing inequalities with less constringent requirements of conditional independence and using data processing inequalities to substitute certain variables from A rns , so that the conditions of independence are fulfilled or the number of intersections is reduced and lower values in d are obtained.The first extension relaxes the conditions A i K A j zA i |Z @i, j required in Proposition 1: Proposition 3. (Weaker conditions of independence through group augmentation for a decomposition of information from groups with conditionally independent non-shared components): Consider a collection of groups A rns , a conditioning set Z, and target variables Y as in Proposition 1.Consider that for each group A i a group B i exists, such that A i Ď B i and B i can be partitioned in two disjoint subsets B i " tB Proof.The proof is provided in Appendix A.
The contribution of Proposition 3 is to relax the conditional independence requirements This means that the variables in B Another difference between Propositions 1 and 3 regards the role of hidden variables.Assume that each A i is formed by tV i , U i u, where U i are hidden variables and V i observable variables.In Proposition 1, the requirement that the variables are observable is not fundamental and could be removed.However, to obtain a testable inequality, monotonicity of mutual information would need to be applied to reduce each term IpY; A i |Zq to its estimable lower bound IpY; V i |Zq that does not contain the hidden variables U i .On the other hand, the fulfillment of A i K A j zA i |Z implies V i K V j zV i |Z, and reducing A i to V i can only decrease the number of intersections, and hence d V rns values are equal or smaller than d A rns .Therefore, with Proposition 1, there is no advantage in including hidden variables.When testing Proposition 1 for a hypothesis of the underlying causal structure (and related independencies), it is equally or more powerful to use V rns than A rns .This changes in Proposition 3, since B p1q i appears in the conditioning side of the independencies that constrain B p2q i .If hidden variables within B p1q i are necessary to create the independencies for B p2q i , it is not possible to reduce each group to its subset of observable variables.Note that, for a hypothesized causal structure, whether the independence conditions required by Proposition 3 are fulfilled can be verified without observing the hidden variables by using the d-separation criterion on the causal graph, assuming d-separation implies independence.The actual estimation of mutual information values is only needed when testing an inequality from the data.
If B rns includes hidden variables, in general IpY; B rns |Zq cannot be estimated and HpY|Zq is used as an upper bound.For the r.h.s. of the inequality, a lower bound is obtained by monotonicity of the mutual information, removing the hidden variables.In general, a testable inequality has the form with V i Ď B i being the observable variables within each group.In the case that IpY; B rns |Zq " IpY; V rns |Zq, that is, if the hidden variables do not add information, then a testable tighter upper bound is available using IpY; V rns |Zq.Importantly, the values d B rns are determined using the groups in B rns .Since A i Ď B i , group augmentation comes at the price that d B rns are equal or higher than d A rns , but the conditional independence requirements may not be fulfilled without it.Note also that the partition B i " tB p1q i , B p2q i u is not known a priori, but determined in the process of finding suitable augmented groups that fulfill the conditions.
We examine some examples before further generalizations.Throughout all figures, we will read independencies from the causal structures using d-separation, assuming faithfulness.In Figure 2A, consider groups A 1 " tV 1 , V 2 u and A 2 " tV 3 , V 4 u, and Z " H. Proposition 1 is not applicable due to V 2 M V 3 .Augmenting the groups to B as can be verified by d-separation.Coefficients are determined by d " t2, 2u due to the intersection of the groups in U. Note that hidden variables are not restricted to be hidden common ancestors, and here U is a mediator between V 2 and V 3 .In Figure 2B, consider groups A 1 " tV 1 u, A 2 " tV 3 u, A 3 " tV 5 u, which do not fulfill the conditions of Proposition 1. Augmenting the groups to B In both examples the upper bound is HpYq since IpY; B rns q cannot be estimated due to hidden variables.For all examples, the composition of groups is described in the main text.For graphs using subindexes i, k to display two concrete groups, those are representative of the same causal structure for all groups that compose the system.In those graphs, variables with no subindex have the same connectivity with all groups.Bidirectional arrows indicate common hidden parents not included in any group.
We also consider scenarios with more groups.Figure 2C represents 2N groups organized in pairs, with subindexes i, k indicating two particular pairs.The 2N groups are defined in pairs, with A 1j " tV 1j u and A 2j " tV 2j u, j " 1, . . ., N. The causal structure is the same across pairs, but the mechanisms generating the variables beyond the causal structure can possibly differ.Proposition 1 is not fulfilled since V 1j M V 2j .Groups can be augmented to B p1q j 1 j " tU 1j , U 2j u, B p2q j 1 j " tV j 1 j u, for j 1 " 1, 2. Proposition 3 then holds with d " 2 for all 2N groups.The pairs of groups contribute to the sum as 1{2rIpY; V 1j , U 1j , U 2j q `IpY; V 2j , U 1j , U 2j qs, which in the testable inequality of the form of Equation (5) reduces to 1{2rIpY; V 1j q `IpY; V 2j qs.The upper bound to the sum of 2N terms is HpYq.This inequality provides causal inference power because V 1j K V 2j |U 1j , U 2j for all j is not directly testable.As previously indicated, the inference power of an inequality emanates from the possibility to discard causal structures that do not fulfill it.Note that for this system an alternative is to define N groups instead of 2N groups, each as A 1  j " tV 1j , V 2j u.In this case Proposition 1 is already applicable with the coefficients being all 1, since V 1i , V 2i K V 1j , V 2j for all i ‰ j.For this inequality, each of the N groups contributes with IpY; V 1j q `IpY; V 2j |V 1j q, and since there are no hidden variables the l.h.s. is IpY; A 1 rns q.However, this latter inequality holds for any causal structure that fulfills V 1i , V 2i K V 1j , V 2j for all i ‰ j.Given that these independencies do not involve hidden variables, they are di-rectly testable from data, so that the latter inequality does not provide additional inference power, in contrast to the former one.
We now continue with further generalizations.Group augmentation in Proposition 3 cannot decrease the values of the maximal number of intersections.We now describe how the data processing inequalities in Lemma 1piq and Proposition 2 can be used to substitute variables within the groups, potentially reducing the number of intersections.We start with the data processing inequality for the conditional mutual information.Proposition 4. (Decomposition of information from groups modified with the conditional mutual information data processing inequality): Consider a collection of groups A rns , a conditioning set Z, and target variables Y as in Proposition 1.Consider that for some group A i a group B i exists such that Y K A i zB i |B i Z, with A i zB i ‰ H. Define B rns as the collection of groups that replaces A i by B i for those following the previous independence condition.If B rns fulfills the conditions of Proposition 3, the inequality derived for B rns also provides an upper bound for the sum of the information provided by the groups in A rns : Proof.The proof applies Proposition 3 to B rns followed by the data processing inequality of Lemma 1piq to each term within the sum in which A i and B i are different.Given that Y K A i zB i |B i Z implies IpY; B i |Zq ě IpY; A i |Zq, their sum is also smaller or equal.
Proposition 3 envisaged cases in which the conditions of independence of Proposition 1 were not fulfilled for a collection A rns and augmentation allowed fulfilling weaker conditions, even if with higher d B rns values compared to d A rns .Proposition 4 is useful not only when the conditions of independence are not fulfilled for A rns , but more generally if some values in d B rns are lower than in d A rns , hence providing a tighter inequality.Including hidden variables in B rns is beneficiary when replacing observed by hidden variables leads to fewer intersections.The procedures of Proposition 3 and 4 can be combined, that is, starting with A rns that contains only observable variables, a new collection can be constructed adding new variables and removing others from A rns , ending with B rns that contains both observable and hidden variables.The collection B rns fulfilling the conditions of Proposition 3 may even contain only hidden variables, and a testable inequality is obtained as long as the data processing inequality allows calculating observable lower bounds for all terms in the sum.
Figure 2D-F are examples of Proposition 4. Again we consider cases with N groups with equal causal structure and use indexes i, k to represent two concrete groups.In Figure 2D, with A j " tV j u, Proposition 3 does not apply for A rns conditioning on tZ 1 , Z 2 u because V i M V j |Z 1 , Z 2 , for all i, j.However, given that Y K V j |U j , Z 1 , Z 2 , each V j can be replaced to build B j " tU j u, and since U i K U j |Z 1 , Z 2 , for all i, j Proposition 3 applies after using Proposition 4 to create B rns .A testable inequality is derived with upper bound HpY|Z 1 , Z 2 q and a sum of terms IpY; V j |Z 1 , Z 2 q, each being a lower bound of IpY; U j |Z 1 , Z 2 q given the data processing inequality that follows from Y K V j |U j , Z 1 , Z 2 .The coefficients are d B rns " 1.Therefore, in this case Proposition 4 results in an inequality when no inequality held for A rns .In Figure 2E, the same procedure relies on Y K V j |U j , Z 1 , Z 2 and U i K U j |Z 1 , Z 2 to use B j " tU j u to create a testable inequality with l.h.s.HpY|Z 1 , Z 2 q and the sum of terms IpY; V j |Z 1 , Z 2 q in the r.h.s. with d B rns " 1.Note that by U, which has no subindex, we represent in Figure 2E a hidden common driver of all N groups, not only the displayed i, k.In this example Proposition 3 could have been directly applied without using Proposition 4 if augmenting A j " tV j u to B 1 j " tV j , Uu, with B 1 p1q j " tUu and B " N, since all groups B 1 j intersect in U. Therefore, in this case an inequality already exists without applying Proposition 4, but its use allows replacing d B 1 rns " N by d B rns " 1, hence creating a tighter inequality with higher causal inference power.
In Figure 2F, again we consider 2N groups, consisting of N pairs with the same causal structure across pairs and indices i, k representing two of these pairs.For groups A j 1 j " tV j 1 j u, with j 1 " 1, 2 and j " 1, ..., N, Proposition 3 is directly applicable for B p1q j 1 j " tU j u and B p2q j 1 j " tV j 1 j u, with d B rns " 2. The data processing inequalities associated with Y K V j 1 j |U j 1 j allow applying Proposition 4 to obtain an inequality for the groups B 1 j 1 j " B 1 p1q j 1 j " tU j 1 j u, which d B 1 rns " 1. Proposition 4 relies on the data processing inequality of the conditional mutual information.The data processing inequality of unique information can also be used for the same purpose, and both data processing inequalities can be combined applying them to different groups.
Proposition 5. (Decomposition of information from groups modified using across different groups the conditional or unique information data processing inequality): Consider a collection of groups A rns , a conditioning set Z, and target variables Y as in Proposition 1.Consider a subset of groups such that for A i a group B i exists such that, for some  Proof.The proof applies Proposition 3 to B rns and then both types of data processing inequalities depending on which one holds for different groups: Inequality paq follows from the unique information always being equal to or smaller than the conditional mutual information (Equation ( 3)).Inequality pbq applies the conditional mutual information data processing inequality to those groups with A i different than B i but |Z p2q i | " 0, and the unique information data processing inequality to those groups with Proposition 5 is useful when the conditions of independence required to apply Proposition 3 do not hold for A rns .It can also be useful to obtain inequalities with higher causal inferential power if d B rns are smaller than d A rns , even if Proposition 3 is directly applicable.By definition, the terms IpY; A i zzZ p2q i |Z p1q i q are equal to or smaller than IpY; A i |Zq, which can only decrease the lower bound, but the data processing inequality may hold only for the unique information and not the conditional information term.Note that the partition tZ p1q i , Z p2q i u can be group-specific and selected such that data processing inequalities can be applied.Figure 3A shows an example of the application of the data processing inequality of unique information.For A j " tV j u, Proposition 3 does not apply to IpY; A rns |Zq because V i M V k |Z.The data processing inequality of conditional mutual information does not hold with Y M V i |U i Z.This data processing inequality could be used adding to U i the latent common parent in Y Ø Z, but this variable would be shared by all augmented groups B i , leading to an intersection of all N groups.Alternatively, the data processing inequality holds for the unique information with IpY; U j zzZq ě IpY; V j zzZq, and U i K U j |Z for all i ‰ j.Proposition 5 is applied with Z p1q j " H, Z p2q j " tZu, and B j " B p1q j " tU j u, @j.This leads to an inequality with HpY|Zq as upper bound and the sum of terms IpY; V j zzZq at the r.h.s. with coefficients determined by d B rns " 1.In Figure 3B, taking A j " tV j1 , V j2 u @j and defining the conditioning set Z " tZ, Z 1 , ..., Z N u, we have V i2 M V k2 |Z and V j1 , V j2 M Y|U j Z .On the other hand, V j1 , V j2 K Y|U j ZzZ j , so that the data processing can be applied with the unique information and Proposition 5 is applied with Z p1q j " ZzZ j , Z p2q j " tZ j u and B j " B p1q j " tU j u.An inequality exists given that U i K U k |Z, and the testable inequality has an upper bound HpY|Zq and at the r.h.s. the sum of terms IpY; V j1 V j2 zzZ j |ZzZ j q, with d B rns " 1.In Figure 3C, we examine an example in which groups differ in the causal structure of the conditioning variable Z j : For the groups of the type of group i, Z i is a common parent of Y and V i1 .For the groups of the type of k, Z k is a collider in a path between Y and V k1 .Consider M groups of the former type and N ´M of the latter.We examine the existence of an inequality for groups defined as A j " tV j1 , V j2 u @j, with Z " tZ, Z 1 , . . ., Z N u.Proposition 3 cannot be applied to IpY; A rns |Zq because V i1 M V j1 |Z for all i ‰ j.The mutual information data processing inequality is not applicable to substitute V j1 because Y M V j1 |U j V j2 Z.However, for the M groups like i, the independence Y K V j1 |U j V j2 ZzZ leads to the data processing inequality IpY; U j V j2 zzZ|ZzZq ě IpY; V j1 V j2 zzZ|ZzZq.For these groups, Z p1q j " ZzZ and Z p2q j " tZu.For the N ´M groups like k, the independence Y K V j1 |U j V j2 ZztZ, Z j u leads to IpY; U j V j2 zzZ, Z j |ZztZ, Z j uq ě IpY; V j1 V j2 zzZ, Z j |ZztZ, Z j uq.For these groups Z p1q j " ZztZ, Z j u and Z p2q j " tZ, Z j u.In all cases the modified groups are B j " B p1q j " tU j , V j2 u, which fulfill the requirement U j , V j2 K U i , V i2 |Z for all i ‰ j needed to apply Proposition 3. The testable inequality that follows from Proposition 5 has upper bound HpY|Zq and in the sum at the r.h.s. has M terms of the form IpY; V j1 V j2 zzZ|ZzZq and N ´M terms of the form IpY; V j1 V j2 zzZ, Z j |ZztZ, Z j uq.The coefficients are determined by d B rns " 1.
Proposition 5 combines both types of data processing inequalities, but only across different groups.Our last extension of Proposition 1 combines both types across and within groups.For each group, we introduce a disjoint partition into m i subgroups A i " tA p1q i , . . ., A pm i q i u and define A for k i P t1, . . ., m i ´1u.
Proof.The proof is provided in Appendix A. The tightest inequality results from maximizing across k i P t1, . . ., m i ´1u each term in the sum.In the proof of Proposition 6 in Appendix A we show that, when increasing k i P t1, . . ., m i ´1u, the terms IpY; C rks i A i zA rks i zzZzZ rks i |Z rks i q are monotonically increasing.However, in general C i can contain hidden variables, which means that, to obtain a testable inequality, for each k i P t1, . . ., m i ´1u each term needs to be substituted by its lower bound that quantifies the information in the subset of observable variables.For each group, the optimal k i leading to the tightest inequality will depend on the subset of observable variables 3D shows an example of application of Proposition 6.Like in Figure 3C, there are two types of groups with different causal structure.M groups have the structure of the variables with indexes i, k, and A j 1 " tV j 1 1 , V j 1 2 u.The other N ´M groups have the structure of the variables with indexes l, j, and A j 1 " tV j 1 u.The conditioning set selected is Z " tZ 1 , Z 2 u.Proposition 3 cannot be applied directly because V i1 M V k1 |Z for all i ‰ k within the M groups, and V j M V l |Z for all j ‰ l within the N ´M groups.Proposition 6 applies as follows.For the N ´M groups, m j 1 " 2 with A p1q Proposition 6 applies because with B rns defined as B j 1 " tU j 1 u for the N ´M groups and B j 1 " tU j 1 1 , U j 1 2 u for the M groups, the requirements of independence of Proposition 3 are fulfilled, in particular B i K B j |Z for all i ‰ j.The terms IpY; B j 1 |Zq for the N ´M groups are IpY; U j 1 |Z 1 , Z 2 q and are substituted by lower bounds IpY; V j 1 |Z 1 , Z 2 q in the testable inequality.For the M groups, we have the subsequent sequence of inequalities: The first inequality follows from the independence for k " 2, the second from the unique information being equal or smaller than the conditional information, and the third from the independence for k " 1. Considering that a testable inequality can only contain observable variables, for the M groups the terms in the sum can be IpY; V j 1 1 , V j 1 2 zzZ 1 |Z 2 q or IpY; V j 1 2 |Z 1 , Z 2 q, depending on which one is higher.The coefficients are determined by d B rns " 1 and the resulting testable inequality has upper bound HpY|Z 1 , Z 2 q.
Overall, Propositions 4-6 further extend the cases in which groups-decomposition inequalities of the type of Proposition 1 can be derived.Our Proposition 1 extends Proposition 1 of [27] to allow conditioning sets, Proposition 3 further weakens the conditions of independence required in Proposition 1, and Propositions 4-6 use data processing inequalities to obtain testable inequalities from groups-decompositions derived comprising hidden variables, which can be more powerful than inequalities directly derived without comprising hidden variables.In Figures 2 and 3, we have provided examples of causal structures for which these new groups-decompositions inequalities exist.In all these cases, the use of our groups-decomposition inequalities increases the set of available inequality tests that can be used to reject hypothesized causal structures underlying data.

Inequalities Involving Sums of Information Terms from Ancestral Sets
We now examine inequalities involving ancestral sets as in Theorem 1 of Steudel and Ay [27], which we reviewed in our Theorem 1 (Section 2.4).We extend this theorem allowing for a conditioning set Z and adding flexibility on how ancestral sets are constructed, as well as allowing the selection of reduced ancestral sets that exclude some variables.Like for Theorem 1, we will use an G pA rns q " tan G pA 1 q, . . ., an G pA n qu to indicate the collection of all ancestral sets in graph G from the collection of groups A rns " tA 1 , . . ., A n u.
The extension of Theorem 1 to allow for a conditioning set Z requires an extension of the notion of ancestral set that will be used to determine the coefficients in the inequalities.The intuition for this extension is that conditioning on Z can introduce new dependencies between groups, in particular when a variable Z j P Z is a common descendant of several ancestral groups, and hence conditioning on it activates paths in which it is a collider.The coefficients need to take into account that common information contributions across ancestral groups can originate from these new dependencies.At the same time, conditioning can also block paths that created dependencies between the ancestral groups.To also account for this, we will not only consider ancestral sets in the original graph G, but in any graph G 1 " G Z 1 , with Z 1 Ď Z.The graph G Z 1 is constructed by removing from G all the outgoing arrows from nodes in Z 1 .This has an effect equivalent to conditioning on Z 1 with regard to eliminating dependencies enabled by paths through Z 1 in which the variables in Z 1 are noncolliders, since removing those arrows deactivates the paths.To account for these effects of conditioning on Z, for each Z j P Z we define an augmented ancestral set of the groups A i P A rns as follows: We then define the set SpG 1 ; Z j q " tA i P A rns : an G 1 pA i q M an G 1 pZ j q X an G 1 pA rns q|Zu, that is, the set of groups that have some ancestor not independent from some ancestor of Z j that is also ancestor of A rns , given Z.
For each A i , let d i pG 1 ; Z j q be the maximal number such that a non-empty intersection exists between an G 1 pA i ; Z j q and d i pG 1 ; Z j q ´1 other distinct augmented ancestral sets of A i 1 , . . ., A i d i pG 1 ;Z j q´1 .Furthermore, we define d i pG 1 ; Zq as the maximum for all Z j P Z: We will use dpG 1 ; Zq to refer to the whole set of maximal values for all groups.If required, we will use d A rns pG 1 ; Zq to specify that the collection is A rns .
In Figure 4A-D, we consider examples to understand the rationale of how d A rns pG 1 ; Zq is determined in inequalities with a conditioning Z.In Figure 4A, for groups A 1 " tV 1 u and A 2 " tV 2 u, the augmented ancestral sets on graph G are an G pA 1 ; Zq " tV 1 , Zu and an G pA 2 ; Zq " tV 2 , Zu, which intersect on Z and d i pG; Zq " 2 for i " 1, 2. However, Z is a noncollider in the path creating a dependence between V 1 and V 2 , and conditioning on Z renders them independent, so that d i pG; Zq " 2 overestimates the amount of information the groups may share after conditioning.Alternatively, selecting G Z the ancestral sets are an G Z pA 1 ; Zq " tV 1 u and an G Z pA 2 ; Zq " tV 2 u, which do not intersect and d i pG Z ; Zq " 1 for i " 1, 2 when calculated following Equation (7).A priori, we do not know which graph G 1 " G Z 1 , Z 1 Ď Z, results in a tighter inequality.Here we see that G Z leads to an inequality with more causal inference power than G for Figure 4A.In Figure 4B, Z is a collider between V 1 and V 2 , so that conditioning on Z creates a dependence between the groups.If the values d i were determined from the standard ancestral sets, in this case an G pA i q " an G Z pA i q " tV i u, for i " 1, 2, which do not intersect, leading to unit coefficients.However, the augmented ancestral sets following Equation ( 7) are an G pA i ; Zq " an G Z pA i ; Zq " tV 1 , V 2 u for i " 1, 2, so that d i pG; Zq " d i pG Z ; Zq " 2. This illustrates that the augmented ancestral sets are necessary to properly determine the coefficients in inequalities with conditioning sets Z, in this case reflecting that IpY; V 1 |Zq and IpY; V 2 |Zq can have redundant information.
. Inequalities involving sums of information terms from ancestral sets.(A-D) Examples to illustrate the definition of augmented ancestral sets (Equations ( 7) and ( 8)).(E,F) Examples of the application of Theorem 2 to obtain testable inequalities.
Figure 4C shows a scenario in which conditioning creates dependencies of Y with V 1 and V 2 , which were previously independent.The standard ancestral sets an G 1 pA 1 q " tV 1 u and an G 1 pA 2 q " tV 2 u would not intersect in any G 1 " G Z 1 , with Z 1 Ď tZ 1 , Z 2 u and would lead to unit values for d i .On the other hand, the augmented ancestral sets are an G 1 pA i ; Z j q " tV i u for i " j and an G 1 pA i ; Z j q " tV 1 , This results in d i pG 1 ; Zq " 2 in all cases, which appropriately captures that the two groups can have common information about Y when conditioning on tZ 1 , Z 2 u.The example of Figure 4D illustrates why each value d i pG 1 ; Z j q is determined separately (Equation ( 7)) first, and only after is the maximum calculated (Equation ( 8)).Four groups are defined as A i " V i for i " 1, . . ., 4. If d i pG 1 ; Zq were to be determined directly from Equation (7) but using Z " tZ 1 , Z 2 u, instead of using separately Z 1 and Z 2 , then for all the ancestral sets the augmented ancestral set would include all variables, since an G 1 pZq X an G 1 pA rns q is equal to an G 1 pA rns q.This would lead to d i " 4, @i.However, that determination would overestimate how many groups become dependent when conditioning on Z, since Z 1 creates a dependence between V 1 and V 2 and Z 2 between V 3 and V 4 , but no dependencies across these pairs are created.The determination of dpG 1 ; Zq " 2 from Equations ( 7) and (8) properly leads to a tighter inequality than the one obtained if considering jointly both conditioning variables.
Equipped with this extended definition of d A rns pG 1 ; Zq, we now present our generalization of Theorem 1: Theorem 2. Let G be a DAG model containing nodes corresponding to a set of (possibly hidden) variables X .Let Y P X be a set of observable target variables, and Z " tZ 1 , . . ., Z m u a conditioning set of observable variables, with Z Ă X .Let A rns " tA 1 , . . ., A n u be a collection of (possibly overlapping) groups of (possibly hidden) variables A i Ă X .Consider a DAG G 1 selected as G 1 " G Z 1 with Z 1 Ď Z, constructed by removing from graph G all the outgoing arrows from nodes in Z 1 .Following Equation (7), define an augmented ancestral set in G 1 for each group A i P A rns for each variable in the conditioning set, Z j P Z.Following Equation ( 8), determine d i pG 1 ; Zq for each group, given the intersections of the augmented ancestral sets an G 1 pA i ; Z j q.Select a variable W 0 P an G 1 pA rns q and a group of variables W Ď D G 1 pW 0 q X an G 1 pA rns q, possibly W " H. Define the reduced ancestral sets ãn G 1 pA i q " an G 1 pA i qzW for each A i P A rns , and the reduced collection ãn G 1 pA rns q " an G 1 pA rns qzW.The information about Y in this reduced collection when conditioning on Z is bounded from below by IpY; ãn G 1 pA rns q|Zq ě Proof.The proof is provided in Appendix B.
Theorem 2 provides several extensions of Theorem 1. First, it allows for a conditioning set Z. Second, given a hypothesis of the generative causal graph G underlying the data, Theorem 2 can be applied to any G 1 " G Z 1 with Z 1 Ď Z, and hence offers a set of inequalities potentially adding causal inference power.As we have discussed in relation to Figure 4A-D, the selection of G 1 that leads to the tightest inequality in some cases will be determined by the causal structure, but in general it also depends on the exact probability distribution of the variables.Third, Theorem 2 allows excluding some variables W from the ancestral sets, although imposing constraints in the causal structure of W. The role of these constraints is clear in the proof at Appendix B. The case of Theorem 1 corresponds to Z " H, W " H, and G 1 " G.
Excluding some variables W can be advantageous.For example, if Y is univariate and it overlaps with some ancestral sets, as it is the case when some groups include descendants of Y, then the upper bound IpY; an G 1 pA rns q|Zq is equal to HpY|Zq and also IpY; an G 1 pA i q|Zq is equal to HpY|Zq for all ancestral sets that include Y. Excluding W " Y provides a tighter upper bound IpY; an G 1 pA rns qzY|Zq and may provide more causal inferential power.Another scenario in which a reduced collection can be useful is when excluding W removes all hidden variables from an G 1 pA rns q, such that ãn G 1 pA rns q is observable, giving IpY; ãn G 1 pA rns q|Zq as a testable upper bound instead of HpY|Zq.When comparing inequalities with different sets W, in some cases the form of the causal structure and the specification of which variables are hidden or observable will a priori determine an order of causal inference power among the inequalities.However, like for the comparison across G 1 " G Z 1 with Z 1 Ď Z, in general the power of the different inequalities depends on the details of the generated probability distributions.Formulating general criteria to rank inequalities with different Z, G 1 , and W in terms of their inferential power is beyond the scope of this work.
Note that we have formulated Theorem 2 explicitly allowing for hidden variables.Also, in Theorem 1 (as a subcase of Theorem 2) the restriction of A rns being observable variables can be removed.In any case, the inclusion of hidden variables can only increase the causal inference power if combined with data processing inequalities to obtain a testable inequality.Propositions 4-6 indicate how to possibly tighten an inequality derived from Proposition 1 by substituting A rns by a new collection B rns that, including hidden variables, leads to d B rns smaller than d A rns .The same application of data processing inequalities of the unique and conditional mutual information can be used for Theorem 2 to determine a B rns with d B rns pG 1 ; Zq smaller than d A rns pG 1 ; Zq.The use of data processing inequalities is necessary because they allow substituting some of the observable variables by hidden variables, instead of only adding hidden variables.When only adding variables, the number of intersections between ancestral groups can only increase, hence not decreasing dpG 1 ; Zq.On top of this, a testable inequality replaces information terms of ancestral groups by their lower bounds given by observable subsets of variables.This means that, adding hidden variables, the testable inequality will contain the same information terms of the observable variables, but possibly smaller coefficients, hence resulting in a looser inequality.This is not the case any more when hidden variables are not added but instead substitute some of observable variables, thanks to data processing inequalities.This substitution may decrease the number of intersections between ancestral groups, and the coefficients in the sum may be higher.We will not describe this procedure in detail, since the use of data processing inequalities is analogous to their use in Propositions 4-6.
We now illustrate the application of Theorem 2. In Figure 4E, with Z " tZ 1 , Z 2 u, the conditions of independence required by Proposition 6 do not hold for any set of groups, either No data processing inequalities can be applied to replace some variables to fulfill the conditions.On the other hand, Theorem 2 can always be applied, since it does not require the fulfillment of some conditions of independence.For example, for A i " tV i u, i " 1, 2, 3 and for G 1 " G Z 1 Z 2 , we have an G 1 pV 1 q " tV 1 u, an G 1 pV 2 q " tV 2 u, an G 1 pV 3 q " tV 1 , V 2 , V 3 , U, Yu, and following Equation ( 7) an G 1 pV 1 ; Z j q " tV 1 u, an G 1 pV 2 ; Z j q " tV 2 u, and an G 1 pV 3 ; Z j q " tV 1 , V 2 , V 3 , U, Yu, for j " 1, 2. This leads to dpG 1 ; Zq " t2, 2, 3u.For illustration purpose, we focus on W equal to tY, Uu or any of its subsets.In all cases ãn G 1 pV i q " an G 1 pV i q, for i " 1, 2, contributing terms 1{2IpY; V 1 |Z 1 , Z 2 q and 1{2IpY; V 2 |Z 1 , Z 2 q.For W " tY, Uu or W " tYu, the contribution of the observable lower bound of the third group is 1{3IpY; For W " tUu or W " H, the third group contributes 1{3HpY|Z 1 , Z 2 q.For W " tY, Uu, ãn G 1 pA rns q " tV 1 , V 2 , V 3 u, which is observable and the upper bound is For any other subset of tY, Uu the upper bound in the testable inequality is HpY|Z 1 , Z 2 q.Because the terms in the sum for groups 1 and 2 are equal for all the W compared, in this case it can be checked that selecting W " tY, Uu leads to the tightest inequality.This example illustrates the utility of being able to construct inequalities for reduced ancestral sets.While in the previous example only Theorem 2 and not Proposition 6 was applicable, more generally, a causal structure will involve the fulfillment of a set of inequalities, some obtained using Proposition 6 and some using Theorem 2. Which inequalities have higher inferential power will depend on the causal structure and the exact probability distribution of the data, and neither Theorem 2 nor Proposition 6 are more powerful a priori.In Fig- ure 4F, Proposition 6 cannot be applied using A i " tV i u, i " 1, 2, 3 and conditioning on Z, because V i M V j |Z, @i, j and no data processing inequalities help to substitute these variables.On the other hand, Theorem 2 can be applied with A i " tV i u, leading to an G 1 pV 1 q " tV 1 u, an G 1 pV 3 q " tV 3 u, and an G 1 pV 2 q " tV 2 , U 1 , U 2 u, for all G 1 " G Z 1 .The augmented ancestral sets are an G 1 pV 1 ; Zq " tV 1 , V 3 , U 1 u " an G 1 pV 3 ; Zq, and an G 1 pV 2 ; Zq " tV 1 , V 2 , V 3 , U 1 , U 2 u, also for all G 1 , resulting in dpG 1 ; Zq " 3. Focusing on the case of W " tY, U 2 u, or any subset of it, in all cases the associated testable inequality has HpY|Zq as upper bound and in the r.h.s. the sum of terms 1{3IpY; V i |Zq, i " 1, 2, 3.Alternatively, defining A 1 " tV 1 , V 3 , U 1 u and A 2 " tV 2 , U 1 u, Proposition 3 is applicable with the two groups intersecting in U 1 and V 1 , V 3 K V 2 |Z, U 1 .The associated testable inequality has the same upper bound HpY|Zq and in the r.h.s. the sum of terms 1{2IpY; V 1 , V 3 |Zq and 1{2IpY; V 2 |Zq.In this case, which inequality has more causal inferential power will depend on the exact distribution of the data.
Overall, Theorem 2 extends Theorem 1 of [27], allowing conditioning sets and providing more flexible conditions to form the groups.In the examples of Figure 4, we have illustrated how Theorem 2 substantially increases the number of groups-decomposition inequalities that can be tested to reject hypothesized causal structures to be compatible with a certain data set.

Discussion
We have presented several generalizations of the type of groups-decomposition inequalities introduced by [27], which compare the information about target variables contained in a collection of variables with a weighted sum of the information contained in subsets of the collection.These generalizations include an extension to allow for conditioning sets and methods to identify existing inequalities that involve collections and subsets selected with less restrictive criteria.This comprises less restrictive conditions of independence, the use of ancestral sets from subgraphs of the causal structure of interest, and the removal of some variables from the ancestral sets.We have also shown how to exploit inequalities identified for collections containing hidden variables-which are not directly testable-by converting them into testable inequalities using data processing inequalities.
Our use of data processing inequalities to derive testable groups-decomposition inequalities when collections contain hidden variables is not entirely new.We found inspiration for this approach in the proof of Theorem 1 in [24].This theorem derives a causally informative inequality from a particular type of causal structure, namely common ancestor graphs in which all dependencies between observable variables are caused by hidden common ancestors.The inequality presented in the theorem corresponds to the setting of a univariate target variable and groups composed by different single observable variables.In their simplest case, each hidden ancestor only has two children, which are observable variables.Their proof uses the mutual information data processing inequality to convert a sum of information terms involving the observable variables into a sum of terms involving the hidden ancestors.The final inequality can equally be proven applying our Proposition 4 by deriving an inequality for the collection of hidden variables and then converting it into a testable inequality using data processing inequalities.The same final inequality can also be derived as an application of our Theorem 2 followed by the use of the data processing inequality.
We have expanded the applicability of data processing inequalities by showing that this type of inequality also holds for conditional unique information measures [29].For a given causal structure, a testable causally informative inequality may be obtained substituting hidden variables by observable variables thanks to the data processing inequality of the unique information, in cases in which the data processing inequality of mutual information is not applicable.As shown in Proposition 6, the unique information data processing inequalities are particularly powerful for deriving groups-decomposition inequalities with a conditioning set, since they can iteratively be applied to replace different subsets of hidden variables by observable variables choosing which variables are kept as conditioning variables and which ones are taken as reference variables for different unique information measures.This use of unique information indicates how other types of information-theoretic measures could be similarly incorporated to derive causally informative inequalities.Recent developments in the decomposition of mutual information into redundant, synergistic, and unique contributions [30] provide candidate measures whose utility for this purpose needs to be further explored [31][32][33][34][35]40,41] (among others).Furthermore, while this type of decomposition has been extensively debated recently [35,42,43], aspects of its characterization are still unsolved and an understanding of how the terms are related to the causal structure can provide new insights.
One particular domain in which our generalizations can be useful is to study causal interactions among dynamical processes [23,44], for which causal interactions are characterized from time series both in the temporal [45] and spectral domain [46][47][48].When studying high-dimensional multivariate dynamical processes, such as brain dynamics (e.g., [49][50][51]) or econometric data [52,53], an important question is to determine whether correlations between time series are related to causal influences or to hidden common influences.For highly interconnected systems with many hidden variables, the number of independencies may be small, hence providing limited information about the causal structure.In this case, inequality constraints can help to substantially narrow down the set of causal structures compatible with the data.Accordingly, our generalization to formulate conditional inequalities may play an important role in combination with measures to quantify partial dependencies between time series [54,55].We expect this approach to be easily adaptable to non-stationary time-series, as it is often the case in the presence of unit roots and co-integrated time series [56][57][58].This can be carried out by selecting collections and groups consistent with the temporal partitioning in non-stationary information-theoretic measures of causality in time-series [59,60].Another area to extend the applicability of our proposal is to study non-classical quantum systems [16,[61][62][63].In this case, an extended d-separation criterion [64] and adapted faithfulness considerations [65] have been proposed to take into account the particularities of quantum systems.Further exploration will be required to determine if and how our derivations that rely on d-separation leading to statistical independence (Appendix C) are also applicable when considering generalized causal structures for quantum systems.
Besides the extension to particular domains, an important question yet to be addressed regards the relation between the causal inferential power of different inequalities.Our proposal considerably enlarges the number of groups-decomposition inequalities of the type of [27] available to test the compatibility of a causal structure with a given data set.We have seen in our analysis some examples of how, under certain conditions, the causal structure imposes an ordering to the power of alternative inequalities.Future work should aim to derive broader criteria to rank the inferential power of inequalities, for example in terms of the relation between the conditioning sets or the constituency of the groups that appear in each inequality.Formulating criteria to rank the inferential power of different inequalities would help to simplify the set of inequalities that needs to be tested when the compatibility of a certain causal structure with the data is to be examined.
Apart from a characterization of how groups-decomposition inequalities are related among themselves, future work should also examine the relation and embedding of this type of inequalities with those derived with other approaches.In our understanding, the algorithmic projection procedure of [23,24] could equally retrieve some of the inequalities here described, but without the advantage of having a constructive procedure to derive the form of an inequality directly reading a causal graph, and instead requiring costly computations that may limit the derivation of inequalities for large systems.The incorporation of constraints for other types of information-theoretic measures, such as constraints involving unique information measures, would require an extension of the algorithmic approach.Among other approaches, the so-called Inflation technique [66] stands out as capable of providing asymptotically sufficient tests of causal compatibility [67].The inflation method creates a new causal structure with multiple copies of the original structure and symmetry constraints on the ancestral properties of the different copies, in such a way that testable constraints on the inflated graph can be mapped back to the compatibility of the original causal structure.However, despite the ongoing advances in its theoretical developments and implementation [68], to our knowledge it is not straightforward to identify the order of inflation and the specific inflation structure adequate to discriminate between certain causal structures.The availability of inequalities easily derived by reading the original causal structure can also be helpful in combination with the inflation method, in order to discard as many candidate causal structures as possible before the design of additional inflated graphs.The connection with other approaches [69][70][71][72][73][74] also deserves further investigation, ultimately to determine minimal sets of inequality constraints with equivalent inferential power.
Beyond the derivation of existing testable causally informative inequalities, a crucial issue for their application is the implementation of the corresponding tests.This implementation depends on the estimation of information-theoretic measures from data.A ubiquitous challenge for the application of mutual information measures is that they are positively biased and their estimation is data-demanding [75,76].These biases scale with the dimensionality of the variables, and hence can hinder the applicability of informationtheoretic inequalities for large collections of variables, or for variables with high cardinality.However, recent advances in the estimation of mutual information for high-dimensional data can help to attenuate these biases [77].Furthermore, the implementation of the tests can take advantage of the existence of both upper-bound and lower-bound estimators of mutual information [78], using opposite bounds at the two sides of the inequalities.These technical aspects of the implementation of the tests are important to apply all types of information-theoretic inequalities [23][24][25][26][27]71].Despite these common challenges, our extension of groups-decomposition inequalities does not come at the price of having to test inequalities that intrinsically are more difficult to estimate.Our contribution can substantially increase the number of inequalities available to be tested, and we have provided examples in Figures 2-4 of new inequalities in which-in particular thanks to the use of data processing inequalities-the dimensionality of the collections is not increased.Future work is required to determine how to efficiently combine all available tests.In the goal to determine minimal sets of inequality tests that are maximally informative, the statistical power of the tests will need to be considered together with their discrimination power among causal structures.
Equality paq holds because tX p1q rr 1 s , X p2q rr 2 s u contains the same variables as B rns .Equality pbq is an iterative application of the chain rule.Inequality pcq is as follows: For the sum in X p1q rr 1 s , steps pcq to p f q of Equation (A1) are all combined, substituting sets A i by B Before continuing with the proof of Proposition 6, we formulate in Lemma A1 a property of the unique information that will be used in the proof.

Lemma A1. (Conditioning on reference variables increases conditional unique information):
The conditional unique information IpY; XzzZ 1 Z 2 |Z 3 q is smaller than or equal to IpY; XzzZ 1 |Z 2 Z 3 q, where Z 2 moves from the set of reference predictors of the unique information to the conditioning set.
Proof of Lemma A1.The unique information IpY; XzzZ 1 Z 2 |Z 3 q is by definition (Equation ( 2)) the minimum information IpY; X|Z 1 Z 2 Z 3 q among the distributions that preserve PpY, X, Z 3 q and PpY, Z 1 , Z 2 , Z 3 q, and IpY; XzzZ 1 |Z 2 Z 3 q is the minimum information IpY; X|Z 1 Z 2 Z 3 q among the distributions that preserve PpY, X, Z 2 , Z 3 q and PpY, Z 1 , Z 2 , Z 3 q.Since the latter constraints subsume the former ones, the minimum can only be equal or higher.
Proof of Proposition 6.For iterations k " 1, ..., m i ´1, consider the following:  follows from monotonicity of mutual information.We now consider the sum involving groups that do not belong to S pkq j : The inequality holds applying Lemma 1(ii) with A " ãnpA i qztZ, V rks u, B " tZ, V rk´1s u, and C " V pkq j ztZ, V rk´1s u.By construction, V pkq j X tZ, V rk´1s u " H and hence C " V pkq j .Furthermore, ãnpA i qztZ, V rk´1s u is equal to ãnpA i qztZ, V rks u given that A i R S pkq j .An intersection of ãnpA i qztZ, V rk´1s u and V pkq j is contradictory with the definition of V pkq j , since |S pkq j | is determined to be maximal, but would increase to |S pkq j | `1 if defined by that intersection, and that would lead to A i P S pkq j instead.Lemma 1(ii) applies given the independence A K C|B.We now prove that this independence holds.We proceed discarding the presence of all types of paths in G that would create a dependence A M C|B.Under the faithfulness assumption, we examine the four different types of paths in G that could create a dependence.First, there is a variable X r P C and a variable X l P A with an active directed path in G from X r to X l , not blocked by B. If this path is active in G conditioning on B " tZ, V rk´1s u, it also exists in any G 1 " G Z 1 , with Z 1 Ď Z, since the removal of outgoing arrows has the same effect as conditioning for the paths in which the conditioning variables are noncolliders (i.e., do not have two incoming arrows).This active directed path means that X r would be an ancestor of X l in G 1 .Therefore, given X l P A and X r P C, X r itself would be part of ãn G 1 pA i qzV rk´1s .However, as argued above, an intersection of ãn G 1 pA i qzV rk´1s and V pkq j is contradictory with A i R S pkq j .Second, there is a variable X r P C and a variable X l P A with an active directed path in G from X l to X r , not blocked by B. Again, this path being active in G when conditioning on B " tZ, V rk´1s u, means that it also exists in any G 1 " G Z 1 , with Z 1 Ď Z. Therefore, X l would be an ancestor of X r in G 1 .This is again a contradiction with the definition of V pkq j because it could be redefined to include |S pkq j | `1 groups, since X l would be an ancestor of all groups intersecting in V pkq j .Third, there is a variable X r P C, a variable X l P A, and another variable X h that is not part of A nor C with an active directed path in G from X h to X r and an active directed path from X h to X l , both not blocked by B. This would also imply that these directed paths exist in G 1 " G Z 1 , with Z 1 Ď Z, and hence X h is an ancestor of A and C in G 1 .Since X h is an ancestor of A " ãnpA i qztZ, V rk´1s u but by construction X h R A, this means that X h has to be part of tZ, V rk´1s u or of W, since any ancestor of anpA i q is part of anpA i q.If X h P tZ, V rk´1s u, conditioning on B " tZ, V rk´1s u would prevent from having active directed paths from X h to X r and from X h to X l , leading to a contradiction.We now consider the case X h P W. Since X h is an ancestor of C " V pkq j , by construction of V pkq j , X h is an ancestor of ãnpA j qztZ, V rk´1s u.This means that anpA j q includes X h P W which, given Equation (A11), is in contradiction with the criterion for selection of reference groups such that A j P S W .In these three types of cases, an active path would exist despite conditioning on B. In the last type, a path would be activated by conditioning on B. At least one variable X h P B " tZ, V rk´1s u has to be a collider or a descendant of a collider along the path that conditioning activates.Consider first that a single collider X h is involved.For the collider to activate the path, it must exist an active directed subpath to X h from a variable X r that is part of C or part of its ancestor set in G 1 .Since this directed subpath is active in G when conditioning on B, it is also active in G 1 .This means that X r would be an ancestor of X h in G 1 .If X h is part of Z or part of V p0q " pan G 1 pZq X an G 1 pA rns qqzW, then X r being an ancestor of X h means that it is part of an G 1 pZq X an G 1 pA rns q.Accordingly, by definition of V p0q , X r would be part of V p0q or of W. The former option leads to a contradiction because X k K G pX rk´1s X A j qzA i |pX rk´1s X A i q, Z, @j ‰ i straightforwardly implies the joint separability X k K G X rk´1s zA i |pX rk´1s X A i q, Z.This is because the separability of X rk´1s zA i follows from the lack of active paths for each of the nodes it contains, and hence is equivalent to the separability of pX rk´1s X A j qzA i for all j, which jointly comprise the same nodes.
The assumption that d-separation implies independence guarantees the independence X k K P X rk´1s zA i |pX rk´1s X A i q, Z from X k K G X rk´1s zA i |pX rk´1s X A i q, Z, without the need to more broadly require faithfulness.The proof of Proposition 3 relies on an analogous way on the assumption that d-separation implies independence, using it to guarantee the conditional independencies involving the subsets in B p1q rns and B p2q rns .In step pdq of Equation (A2), the fact that separability for a joint set of nodes is straightforwardly guaranteed by the separability of each of its nodes is again applied and then mapped to the existence of an independence using this assumption.The fact that conditions i Z @i, j can be verified using d-separation instead of estimating independencies from data is crucial in the case that the groups include hidden variables, which precludes the direct evaluation of these independencies.
The next result whose derivation relies on the assumption that d-separation implies independence is Theorem 2. In step paq of Equation (A7), faithfulness was invoked to guarantee that conditioning on some ancestors of Z cannot create new dependencies that were not already created by conditioning on Z itself.In more detail, it was assumed that if the independence ãnpA i q K P V pjq Z |Z holds then also ãnpA i q K P V pjq Z |tZ, V Faithfulness is also invoked in the proof of Theorem 2 to justify the application of Lemma 1(ii) in Equation (A15).In this case, the existence of an independence A K P C|B is directly justified in terms of the nonexistence of active paths in the graph, hence guaranteeing A K G C|B and subsequently using the assumption that d-separation implies independence to derive A K P C|B.
The considerations above show that the assumption that d-separation implies statistical independence is enough to derive the existence of groups-decomposition inequalities under the conditions of Propositions 1 and 3, and of Theorem 2. Furthermore, if unfaithful independencies are present in the data that do not follow from the causal structure, this may decrease the power to reject causal structures testing the inequalities, but will not lead to incorrect rejections.This differs from the impact of unfaithful independencies on the inference of the Markov equivalence class from data [1,2].In that case, unfaithful independencies can lead to an incorrect reconstruction of the skeleton of the graph or result in contradictory rules for edge orientation.The assumption that d-separation implies statistical independence is substantially weaker than the reverse assumption also included in the faithfulness assumption, namely that statistical independence implies d-separation.The X-OR logic gate is an example that the latter assumption can be violated.Conversely, if the causal graph is meant to reflect the underlying structure of actual physical mechanisms involved in generating the variables, all statistical dependencies need to originate from some paths of influence between the variables.Accordingly, a d-separation that does not lead to an independence can be taken as an indicator that some structure is missing in the causal graph, namely associated with the paths that create the observed dependence.In this regard, it is appropriate to reject a causal structure if it does not fulfill an inequality constraint because graphical separability is not reflected in the corresponding independencies found in the data.

Lemma 1 .
The mutual information fulfills the following inequalities in the presence of the corresponding independencies: piq (Conditional mutual information data processing inequality): Let A, B, B 1 , and D be four sets of variables.If IpA; B 1 |B, Dq " 0, then it follows that IpA; B|Dq ě IpA; B 1 |Dq.piiq (Increase through conditioning on independent sets): Let A, B, C, and Y be four sets of variables.If IpA; C|Bq " 0, then IpY; A|Bq ď IpY; A|B, Cq.Proof.piq is proven applying, in two different orders, the chain rule of the mutual information to IpA; B, B 1 |Dq: IpA; B, B 1 |Dq " IpA; B|Dq `IpA; B 1 |B, Dq " IpA; B 1 |Dq `IpA; B|B 1 , Dq.
p1q i , B p2q i u such that B p1q i fulfills the conditions of independence B p1q i K B p1q j zB p1q i |Z and B p2q i the conditions B p2q i K B j zB p2q i |B p1q i Z @i, j, and such that B p1q rns " tB p1q 1 , . . ., B p1q n u and B p2q rns " tB p2q 1 , . . ., B p2q n u are disjoint.Define the maximal values d B i like in Proposition 1 but for the augmented groups B rns " tB 1 , . . ., B n u.Then, the conditional information that B rns has about the target variables Y given Z is bounded from below by: IpY; B rns |Zq ě n groups.If B p2q i is empty for all i, Proposition 3 reduces to Proposition 1.

Figure 2 .
Figure 2. Examples of applications of Proposition 3 (A-C) and Proposition 4 (D-F) to obtain testable inequalities.The causal graphs allow verifying if the required conditional independence conditions are fulfilled by using d-separation.Variable Y is the target variable, observable variables are denoted by V, hidden variables by U, and conditioning variables by Z.For all examples, the composition of groups is described in the main text.For graphs using subindexes i, k to display two concrete groups, those are representative of the same causal structure for all groups that compose the system.In those graphs, variables with no subindex have the same connectivity with all groups.Bidirectional arrows indicate common hidden parents not included in any group.
H. Define B rns as the collection of groups that replaces A i by B i for those following the previous independence conditions.Define Z p1q i " Z for the unaltered groups and Z p2q i " ZzZ p1q i for all groups.If B rns fulfills the conditions of Proposition 3, the inequality derived for B rns also provides an upper bound for a sum combining conditional and unique information terms for different groups in A rns : HpY|Zq ě IpY; B rns |Zq ě

Figure 3 .
Figure 3. Examples of the application of Proposition 5 (A-C) and Proposition 6 (D) to obtain testable inequalities.Notation is analogous to Figure 2. The composition of groups is described in the main text.
p0q i " H. Subgroups are analogously defined for Z i , also with Z p0q i " H.In general, for any ordered set of vectors we use V rks i " tV p0q i , V p1q i , . . ., V pkq i u to refer to all elements up to k, where in general V p0q i can be nonempty.Proposition 6. (Decomposition of information from groups modified with the conditional or unique information data processing inequality across and within groups): Consider a collection of groups A rns , a conditioning set Z, and a target variable Y as in Proposition 1.Consider that for each group A i there are disjoint partitions A i " tA p1q i , . . ., A pm i q i u and Z " tZ p1q i , . . ., Z pm i q i u, and a collection of sets of additional variables C i " tC p0q 1, . . ., m i ´1.Define the collection B rns with the modified groups B i " tC i , A pm i q i u.If B rns fulfills the conditions of Proposition 3, the inequality derived for B rns also provides an upper bound for sums combining conditional and unique information terms for different groups in A rns : HpY|Zq ě IpY; B rns |Zq ě

i
variables fulfill conditions of independence equivalent to Proposition 1.For the sum in X p2q rr 2 s , only steps pcq and pdq of Equation (A1) are applied, substituting sets A i by B p2q i .Inequality pdq holds applying Lemma 1 piiq.In more detail, j combined with the weak union property of independencies [27] mean that for each X k P B p2q i , X k K tB p1q , Z @j ‰ i.Assuming faithfulness, this implies X k K tpX Accordingly, Lemma 1 piiq applies with A " X k , B " tpX p2q rk´1s X B p2q i q, B p1q i , Zu, and C " tpX p1q rr 1 s zB p1q i q, pX p2q rk´1s zB p2q i qu.Equalities peq and p f q follow from the chain rule of mutual information.

.
paq holds from monotonicity, information cannot decrease if adding C Inequality pbq holds from Lemma A1, moving Z pkq i from the set of reference predictors of the unique information to the conditioning set.Equality pcq follows from zA rks i holds.Accordingly, the unique information is preserved removing A pkq i (Proposition 2).This leads to the inequality IpY; C rk´1s i Finally, this unique information by construction is smaller than IpY; B i |Zq.The terms IpY; A i zzZzZ p1q i |Z p1q i q are obtained removing C r1s i from IpY; C rks i A i zA rk´1s i zzZzZ rks i |Z rks i q by monotonicity, from step k " 1.

Z
are by construction ancestors of Z. Again, at the level of graphical separability this implication is straightforward and does not require any assumption.This is because by definition of d-separation a path is activated both when conditioning on a collider or on any descendant of the collider, andV rj´1s Z being ancestors of Z means that Z contains a descendant for each node in V rj´1s Z .Accordingly, no assumption is needed to ensure ãnpA i q K G V pjq Z |tZ, V rj´1s Z u from ãnpA i q K G V pjq Z |Z.The assumption that d-separation implies independence is then used to ensure ãnpA i q K P V and Proposition 6 reduces to Proposition 3. If m i " 2 and Z p1q i " Z for all i, we recover Proposition 4, with B i " tC i , A p2q i u.If m i " 2 for all i and Z p1q i Ă Z for some i, we recover Proposition 5, with B i " tC i , A p2q i u and Z p2q i " ZzZ p1q i .Like for previous propositions, some groups may be unmodified such that B i " A i .
Proof of Proposition 3. Consider a collection of groups B rns " tB 1 , . . ., B n u, each with a partition in disjoint subsets B i " tB