To develop how minInf terms can be used in causally informative entropic inequalities, we start from the scenario of the standard instrumental entropic inequality (
Figure 1A) and consider changes in the causal structure. We first derive (
Section 3.1) an instrumental entropic inequality that applies the DP inequality of unique information. We then address the embedding of this new type of inequalities together with standard instrumental inequalities derived with multivariate instrumental sets and we illustrate that they can provide additional causal inference power (
Section 3.2). In
Section 3.3, we examine instrumental entropic inequalities in which the DP inequality of conditional mutual information and unique information are combined. This analysis reveals how different types of DP inequalities can recursively be applied. In
Section 3.4, we introduce a type of DP inequalities for minInf terms, which encompasses as subcases the DP inequalities of conditional mutual information and unique information. We show how to recursively apply these DP inequalities to obtain sums of observable information terms as lower bounds of unobservable information terms. In
Section 3.5, we apply this procedure to construct more powerful instrumental entropic inequalities. In
Section 3.6, we reexamine more broadly the derived minInf inequalities from a geometrical perspective, in connection with Shannon entropy cones. Finally, in
Section 3.7, we apply the procedures developed in
Section 3.4 to other types of entropic inequalities beyond the instrumental inequality scenario.
3.1. Instrumental Entropic Inequalities with Maximum Entropy Unique Information Terms: The Case with One Data Processing Inequality Applied
We start considering how to construct instrumental entropic inequalities with the causal structures of
Figure 1B. Again, the graph displays several causal structures depending on the instantiation of the dashed edges. Similar to the case of
Figure 1A, a requirement for any instrumental entropic inequalities to be causally fulfilled is that the dashed edges between
Z and
Y as well as between
Z and
U are removed. Therefore, we focus on this case with no edges between
Z and
Y and between
Z and
U. A difference between
Figure 1A,B is that
W is a noncollider in
Figure 1A, leading to
and
while
W is a collider in
Figure 1B, such that
and
. For
Figure 1B, the instrumental inequality of Proposition 1 can be applied with
and
. The required independencies
and
are fulfilled, namely they correspond to
and
. This leads to
On the contrary, the fact that
W is a collider leads to a dependence
, and hence in
Figure 1B Proposition 1 cannot be applied selecting
. Note that being able to condition on
would be advantageous because, following the derivation of Proposition 1, it would lead to a tighter upper bound
instead of
. Since what prevents deriving an instrumental entropic inequality with
is that
, as opposed to
, we can consider if using the unique information DP inequality is useful in this case. This is because the unique information has a DP inequality (Lemma 3) that differs from the one of conditional mutual information (Lemma 1) in that it is associated with a conditional independence that excludes the reference variables from the conditioning set. This type of exclusion is precisely what is needed to use
instead of
. We first state a general formulation of an instrumental entropic inequality that uses the unique information and we will then go back to the example of
Figure 1B.
Proposition 2
(Instrumental entropic inequality with maximum entropy unique information).
Consider the variables , X, Y, , and U, all observable except U a hidden variable. Consider that the causal structure is such that, for all , no pair from is separable given that U is hidden. Consider an exclusive partition . Consider that the causal structure imposes the nontestable independencies and . These independencies result in the testable inequality Proof. The proof is analogous to the one of Proposition 1. Again, the departing quantity is
. Using the chain rule to decompose
as the sum of
and
, the independence
allows deriving
as upper bound, as in Equation (
2). For the lower bound, instead of Equation (
3) that applies the DP inequality of conditional information, the DP inequality of unique information is applied:
Equality
applies the chain rule of mutual information. Inequality
applies the definition of the unique information as a contribution smaller than or equal to the conditional mutual information (Equation (
6)). Inequality
holds because
by Lemma 3 implies the DP inequality of the unique information
. Combining the upper bound
and the lower bound
proves the testable inequality. □
In the example of
Figure 1B, Proposition 2 applies with
,
, and
and results in
The inequality of Equation (
11) is causally imposed when the causal structure creates the independencies
and
. These independencies would also exist if in
Figure 1B variables
X and
W were connected by an arc
. On the other hand,
would produce
, and
would produce
.
Comparing Equations (
8) and (
11), the first is derived with
and the second with
,
, and
. To better appreciate the factors that determine their power, we can rewrite them passing the first term at the r.h.s. to the l.h.s.:
In general, these inequalities are complementary. For example, consider that it is to be tested the compatibility of a data set with the causal structure in
Figure 1B with no edge
and no edge
. This causal structure creates independencies
,
, and
, and hence causally imposes both inequalities of Equations (
8) and (
11). Therefore, the violation of any of the two inequalities suffices to discard the causal structure. Comparing their form, Equation (
12b) has a smaller or equal upper bound than Equation (
12a), given the monotonicity of entropy under conditioning. However, it also has a smaller or equal lower bound, since the unique information is upper-bounded by the unconditional mutual information (Equation (
6)). This means that, for a concrete data set that has been generated from another causal structure that does not impose the fulfillment of the inequalities, any of the two inequalities can be violated while the other is not, so that their use is complementary for causal inference. Using
instead of
allows decreasing the upper bound, but when
W is a collider between
Z and
Y such that
, a smaller observable lower bound is derived with the unique information and
.
3.2. Instrumental Entropic Inequalities with Multivariate Instrumental Sets
So far, the example of
Figure 1B presented an application of Proposition 2 with a univariate instrument
. However, to establish that the new type of inequalities of Proposition 2 contributes additional causal inference power to the standard instrumental entropic inequalities, we also need to examine the standard instrumental inequalities with multivariate instrumental sets that exist for the same causal structure. To do so, we first highlight three key elements of the structure of instrumental entropic inequalities.
The derivation of both Propositions 1 and 2 departs from the quantity . The first key element is that the two required types of independencies play separate roles in the derivation of instrumental inequalities: is used to derive the observable upper bound, while the other independence is used to derive the observable lower bound. In more detail, is used after is separated into and . The other required independence is used to derive the observable lower bound thanks to a DP inequality, which is applied to . The different DP inequality applied is what differentiates Propositions 1 and 2.
The second key element is better appreciated rewriting the inequalities of Propositions 1 and 2 passing the the first term of the r.h.s. to the l.h.s.:
Written like this, the two inequalities have upper bound
, with a conditioning set
that does not differentiate between
and
, which appear in different arguments of
. Therefore, regarding the upper bound, there is an invariance under the exchange of variables between the
causal discovery instrumental set and the conditioning set
. Alternative instrumental entropic inequalities that would require independencies
, with
, would all lead to the same upper bound
. Accordingly, instrumental inequalities with multivariate instrumental sets obtained under this invariance need to be considered in order to assess the additional causal inference power provided by the new type of inequalities introduced in Proposition 2.
We can see an example of multivariate instrumental set in
Figure 1B, again focusing on the causal structure with no connections
and
. Variable
W, which in the derivation of Equation (
12b) is assigned to
, can be assigned to the instrumental set, leading to the bivariate instrumental set
and to
. The set
fulfills
, namely
. On the other hand,
does not fulfill the other independence condition
required in Proposition 1, since
due to the direct connection between
W and
Y.
This leads us to the third key element of the construction of instrumental entropic inequalities. In Propositions 1 and 2, there is a single set that appears in both independencies, namely in either and , or in and . However, this constraint is not necessary and can be relaxed. In the derivation of an observable lower bound, a DP inequality could be applied to any subset . This is captured in the following Proposition. For later convenience, we now also consider multivariate variables , , and :
Proposition 3
(Chainlike instrumental entropic inequalities with multivariate instrumental sets).
Consider variables , , , , and , all observable except hidden variables. Consider that the causal structure is such that for at least a there is a nonempty subset and such that no pair in is separable with hidden. Consider an exclusive partition in r parts of the multivariate instrumental set given by , with . Consider that the causal structure imposes the nontestable independence . This independence creates a nontestable instrumental entropic inequalitywith , where nontestability is due to the nonestimable components of the lower bound. A nontrivial testable instrumental entropic inequality exists if for at least one term at least one conditional independence exists that enables a data processing inequality to substitute that term by an estimable lower bound that contains some variables in and does not contain . Proof. The derivation of the upper bound is the same as in Propositions 1 and 2. Starting from the chain rule is applied and the upper bound derived with thanks to . The nonestimable lower bound follows from a direct application of the chain rule equality of conditional mutual information to separate into the terms and , followed by the chain rule to separate with the partition . Finally, the proposition states that it suffices that at least one term of the sum in the lower bound can be replaced by at least one observable information term so that a nontrivial testable inequality is obtained, dropping all remaining terms in the lower bound that contain hidden variables. This replacement is possible when applying at least one DP inequality to at least a term . Of course, a testable instrumental inequality is also obtained if more than one term can be replaced by observable lower bounds. □
Note that Proposition 3 does not specify the form of the conditional independencies and associated DP inequalities applied to obtain observable lower bounds. New procedures to do so are to be specified in
Section 3.4. The advantage of this formulation is that it distinguishes between the variables
that appear in the condition
and the subsets
that are involved in the independencies associated with DP inequalities to derive observable terms in the lower bound. This explains why in
Section 2.3 we defined
causal discovery instrumental sets based only on
.
We can now reconsider the multivariate instrumental set of
Figure 1B that assigns
and
. As mentioned,
allows deriving an upper bound
. The reason why Proposition 1 could not be applied is that
due to the direct connection between
W and
Y. However, Proposition 3 can be applied to the example of
Figure 1B selecting a partition with
,
. Then
is decomposed into
. It now suffices to apply the DP inequality based on
to obtain a lower bound
, and hence to obtain the instrumental entropic inequality
The test with this inequality subsumes both tests of Equations (
8) and (
11), derived from Propositions 1 and 2, respectively. This is because it can be rewritten as
, and hence combines the upper bound of Equation (
12b) and the lower bound of Equation (
12a).
The consideration of instrumental inequalities with multivariate instrumental sets discards that the new type instrumental entropic inequality of Proposition 2 provides additional causal inference power in the case of
Figure 1B. More generally, we show in
Appendix J that for a multivariate instrumental set
that fulfills an independence
, no additional power can be gained from tests that use only a subset of
as instrumental set, removing the rest of variables or transferring some variables of
into the conditioning set
. However, a causal structure may be such that for
it holds
and yet
. Or with the opposite perspective, using the notation of Proposition 2 with
, the causal structure may be such that
, but
. In this case, Proposition 2 can add causal inference power.
An example of this is illustrated in
Figure 1C. With no direct connection between
Z and
U, the independence
holds. On the other hand,
,
, and
. If any of these independencies existed, the upper bound
would be equally obtained from the corresponding multivariate instrumental set, because of the invariance of the upper bound to exchanges between
and
. In those cases, the variables of
included in the instrumental set could be marginalized instead of requiring the use of a unique information with both variables in the reference argument. However, since these other independencies do not exist, the instrumental inequality constructed from Proposition 2 using
and
adds additional causal inference power with the test
In this section, we have addressed the issue of whether the new type of entropic inequalities with unique information terms can add causal inference power when tested together with standard instrumental inequalities that use related multivariate instrumental sets. Our objective was to ensure that the new type of inequalities is not trivially subsumed. To do so, we have contemplated two factors that determine when a new inequality test adds causal inference power. First, that there is some hypothesized causal structure of interest that causally imposes the new type of inequality, possibly together with other inequalities. Second, that the form of the new inequality is such that a probability distribution can exist for which the new test is rejected while no other causally-imposed inequality test is simultaneously rejected. On the other hand, if the new inequality is such that when violated also another inequality is always violated, then it does not add additional power. In
Appendix B, we provide a full formal statement of the if and only if conditions under which a new entropic inequality test adds additional power to a set of other tests.
By examining multivariate instrumental sets, we have provided an example in
Figure 1C for which Proposition 2 provides additional causal inference power. More broadly, we have seen in Proposition 3 that a nontestable entropic inequality is associated with each instrumental set. The identification of a set of variables as an instrumental set implies the existence of an observable upper bound, and the lower bound can be decomposed as a sum of nonobservable information terms. Nontrivial testable entropic inequalities are obtained when finding observable lower bounds of some of these terms. Proposition 3 accommodates the inequalities derived in Propositions 1 and 2. Compared to those, it relaxes the condition of Propositions 1 and 2 that requires that no pair from
is separable when
U is hidden, for all
. This is because Proposition 3 does not cover only the specific application of a type of DP inequality associated with a specific conditional independence, contrarily to Propositions 1 and 2 that are linked to
and
, respectively.
Section 3.3,
Section 3.4 and
Section 3.5 will generalize procedures to convert nontestable instrumental entropic inequalities into testable ones.
3.3. Instrumental Entropic Inequalities with Mutual Information and Maximum Entropy Unique Information Terms: The Case with Two Data Processing Inequalities Applied
We have seen above that, when a single conditional independence is used to derive a lower bound, the unique information DP inequality only increases causal inference power when the variables in the reference argument of the unique information are not part of a valid instrumental set (i.e., no transfer from
as specified in Proposition 2 to
creates a valid instrumental set). We now show that this limitation does not occur when two DP inequalities are used to add terms in the lower bound. We show that unique information terms can be added in the lower bound not only instead of conditional information terms, but in addition to them, resulting in an increase of causal inference power. We use this scenario to illustrate the procedure that will then be generalized to combine an arbitrary number of DP inequalities, comprising DP inequalities of conditional mutual information, unique information, and of minInf information terms, as will be introduced in
Section 3.4.
Proposition 4
(Instrumental entropic inequalities with conditional mutual information and unique information terms).
Consider variables , , , , and , all observable except hidden variables. Consider that the causal structure is such that for at least a there is a nonempty subset and such that no pair in is separable when is hidden. Consider that the causal structure imposes the nontestable independencies , , and , with . These independencies result in the testable inequality Proof. The upper bound is derived as in Proposition 1, given the independence
. To derive the lower bound, start with
after extracting
with the chain rule of mutual information:
Equality
applies the DP inequality of conditional mutual information (Lemma 1) thanks to
. Equality
applies the chain rule equality of mutual information. Inequality
holds from the definition of unique information, which has conditional mutual information as an upper bound (Equation (
6)). Inequality
applies the DP inequality of unique information with
. □
The examples of
Figure 2A,B illustrate how the two types of DP inequalities are combined. For simplicity of the explanations, in contrast to
Figure 1, in
Figure 2 we only represent individual causal structures (no dashed connections). Our objective here is not to derive all existing inequalities in these graphs, but to illustrate the procedure to construct inequalities combining the two types of DP inequalities and to examine the additional causal inference power they can provide. In these causal structures, select the instrumental set
and conditioning set
. With this selection, both causal structures impose inequalities of the type of Proposition 4.
In the example of
Figure 2A, given
, the upper bound
is derived with
. Following Proposition 4, the term with
in the r.h.s.,
, is obtained with
, from the mutual information DP inequality. The term with
,
, is obtained with
, from the unique information DP inequality. The derivation corresponds to the assignment in Equation (
17) of
,
, and
. Note that it is necessary to exclude
from the conditioning set because
, for any nonempty
. In
Figure 2B, the upper bound and the term with
are derived in the same way. On the other hand, the term
results from
, with
. The unique information is chosen with reference variable
as opposed to the reference variable
for
Figure 2A. This difference is due to
in
Figure 2A, which renders
a descendant of the collider
in
, which needs to be removed from conditioning to create an independence. Note that while for simplicity
Figure 2A,B show specific causal structures, Proposition 4 also applies under certain changes of these examples. The same inequalities apply in any graph in which arrows are assigned such that
continues to be a noncollider in
and
, or if either
or
is not present. The graphs could also comprise
or
, with
a noncollider.
We will now illustrate that instrumental inequalities of the type of Proposition 4 can add extra causal inference power. To do so, we show that, for example in
Figure 2A, this type of instrumental entropic inequalities is not subsumed by any standard instrumental entropic inequality that uses a multivariate instrumental set. We here justify the additional causal inference power verifying that no standard instrumental inequality can jointly use the DP inequalities that introduce variables
and
. This guarantees that it does not happen that the new type of instrumental inequality is only violated in cases in which already a standard instrumental inequality is violated. To complement this reasoning, in
Appendix D we examine concrete numerical examples in which no rejection occurs for tests based on standard instrumental inequalities, while rejections are obtained when incorporating unique information terms as in Proposition 4.
Consider the multivariate instrumental sets that can be used as an alternative given the selection of the instrumental set
and conditioning set
in
Figure 2A,B. We know from
Section 3.2 that the same upper bound would hold for any valid instrumental set obtained from a transfer between these sets, namely with
, and
, for
. Given the direct connection between
and
U, no set containing
fulfills the criterion
to be an instrumental set. On the other hand,
, so that
is a valid instrumental set with
.
We now focus specifically on
Figure 2A. Starting from
, we examine different partitions of
that can be selected to apply Proposition 3. One partition is
, which results in terms
and
. Given that
has direct connections to
and
, and hence it is nonseparable from them, no DP inequality can be applied to
. For
, the DP inequality associated with
can be applied. However, it is not possible to introduce
, which requires conditioning on
in
. The opposite partition
results in terms
and
. Again, no DP inequality is applicable to
. We see that the remaining term
corresponds to the term used as starting point when applying Proposition 4. This shows that in this example starting from the multivariate instrumental set comes back to
, and
.
The key element that leads the combination of the mutual information and unique information DP inequalities to add causal inference power is the intertwined requirements in the independencies
and
. In
Figure 2A, conditioning on
is necessary to separate
Z from
, since
is a noncollider in
. At the same time,
cannot appear in the conditioning set to separate
Z and
, since it is a collider in
. This means that
cannot simply be marginalized to exploit jointly the two independencies
and
. It needs to first appear in the conditioning set (when applying the DP inequality of conditional mutual information) and then be excluded from the conditioning set (leading to the application of the unique information DP inequality).
This analysis of
Figure 2A highlights the difference with the scenario addressed in
Section 3.1, in which a single DP inequality was applied. With a single DP inequality, the unique information DP inequality can only contribute to increase causal inference power when the variables that appear in the reference argument of the unique information cannot be part of a valid instrumental set. On the other hand, when combining DP inequalities, it is the intertwined structure of the independencies associated with different types of DP inequalities what requires their combination. To further highlight this point, in
Appendix C we compare in more detail
Figure 2A,B. For
Figure 2B, the inequality of the type of Proposition 4 derived with
and
does not add causal inference power to the instrumental inequality derived with
and
that relies only on the DP inequality of conditional mutual information. The key difference is that in
Figure 2A conditioning on
is necessary to create the independence between
Z and
, while in
Figure 2B the independence
also holds. This does not create the intertwined structure of
and
, which require respectively the conditioning on
and non conditioning on
(see
Appendix C for details).
In this section, we have examined how the DP inequality of conditional mutual information and of unique information can be used sequentially to introduce new observable information terms in the lower bound of an instrumental entropic inequality. Note that the chainlike instrumental entropic inequality of Proposition 3 accommodates the use of Proposition 4. Proposition 3 indicates potential partitions of
into
, while Proposition 4 describes a procedure to derive observable information terms that can be introduced in parallel starting separately from different summands
of the r.h.s. of Equation (
14).
In the next section we will see that the sequential addition of observable information terms can be extended with a more general type of minInf DP inequalities. With Proposition 4, we have seen that the combination of the DP inequality of conditional mutual information and unique information allows sequentially including and then removing from the conditioning set variables that are required to create an independence between and , but that preclude from creating an independence between and . This is achieved because, while the DP inequality of conditional mutual information operates in the original joint distribution , the DP inequality of unique information operates within the family of distributions that only preserve and . It is the exclusion of and of from what allows exploiting an independence with only in the conditioning set instead of . With the same logic, further relaxations of which marginals are preserved will allow us to sequentially combine more DP inequalities.
As a last remark, so far the introduction of new types of instrumental entropic inequalities (Propositions 2 and 4) has been accompanied by the comparison to related standard intrumental entropic inequalities with multivariate instrumental sets. This was necessary to verify that the new entropic inequalities do provide additional causal inference power. The examples of
Figure 1C and
Figure 2A show that indeed additional causal inference power can be gained either because of the lack of validity of corresponding multivariate instrumental sets (
Figure 1C), or because of the intertwinement between the conditioning sets that appear in different independencies, which requires the application of the unique information DP inequality (
Figure 2A). In the next sections, we will not proceed in the same way, and instead we will exclusively focus in developing instrumental entropic inequalities that add more minInf information terms in the lower bound. The verification that this addition can further increase causal inference power follows from the same logic of these previous examples. Numerical examples will be provided in
Appendix H to illustrate that the addition of more minInf terms together with unique information terms increases causal inference power. It is out of the scope of this work to provide a full taxonomy of when instrumental inequalities that exploit certain types of DP inequalities are subsumed by instrumental inequalities that only exploit a subset of those types of DP inequalities. Only in
Appendix J, we derive a hierarchy between specific types of instrumental entropic inequalities with related instrumental sets.
3.4. Recursive Use of Data Processing Inequalities to Add Observable minInf Information Terms as Lower Bounds of Information Terms with Hidden Variables
We have identified the DP inequality of unique information as the key property that allows increasing causal inference power using unique information terms. This raises the question of whether analogous DP inequalities exist for other minInf information terms defined with other sets of constraints on the preserved marginals and if so, how to recursively use these DP inequalities to insert additional observable information terms into entropic inequalities. We now show that indeed there is such a DP inequality for a more general form of minInf information terms.
This section contains our core results of how to exploit minInf DP inequalities. We have used the instrumental entropic inequality to vertebrate our presentation, but in
Section 3.6 we will describe a wider framework for the finding of new entropic inequalities and in
Section 3.7 we will provide further examples of the applicability of the tools here developed. To help differentiate between general results and results specific of the instrumental inequality scenario, we continue to use a different notation of variables specific for the instrumental scenario, separate from the notation used for general results. We present a general DP inequality for minInf terms using the same notation of the DP inequalities of mutual information (Lemma 1) and unique information (Lemma 3). We then show how to iteratively combine minInf DP inequalities to add new observable terms into entropic inequalities. In
Section 3.5, we will show how to apply these tools concretely to the instrumental scenario.
Proposition 5
(Data processing inequality in predictor variables of minInf information terms preserving sets of marginals).
Let , , , , and be five nonoverlapping sets of variables. Consider a probability distribution and the family of distributions that share the set of marginals and for , where is a collection of subsets and . If the distribution is such that , thenwhere is the family of distributions that preserve and for , and is the family of distributions that preserve and for . Proof. Given the chain rule of mutual information
and
Now consider a distribution that minimizes Equation (
21), namely
Construct
. Given that
, it preserves the marginals
for
and
. Furthermore,
by construction preserves
, which means that it preserves
, and hence
, since all other constraints to preserve marginals are the same in
and
. Since the first two terms in the sum of Equation (
20) do not depend on
their minimization is the same in
or
and
minimizes their sum, which is equal to the one in Equation (
21). By construction of
, the independence
holds and hence, using the weak union axiom of semi-graphoids for mutual information [
25,
43], also the independence
holds. This means that the last term in the sum of Equation (
20) is zero for
. Therefore,
minimizes the r.h.s. of Equation (
20), which is equal to the r.h.s. of Equation (
21), so that
is equal to
, with the minima reached by
and
, respectively. Furthermore, monotonicity of mutual information guarantees that information can only decrease when removing variable
from
, namely
is smaller than or equal to
. Finally,
by definition is smaller than or equal to
. □
Proposition 5 encompasses Lemmas 1 and 3 as subcases. The DP inequality of conditional mutual information is subsumed with , , and , such that corresponds to the joint original distribution. The DP inequality of unique information is subsumed with , , , . This results in a unique information , with , as in Lemma 3.
Given this DP inequality for minInf information terms, we now describe how it can be used to iteratively add new observable information terms in a lower bound of a minInf information term containing as predictor hidden variables. We start with an example to gain some intuition of the procedure. For this purpose, we recap how the DP inequalities of mutual information and unique information are sequentially combined in Proposition 4, concretely in the example of
Figure 2A. We then point out how a similar procedure can be used to sequentially combine more minInf DP inequalities, using
Figure 2C as an example.
The first row of
Table 1 summarizes how the relaxation of preserved marginals allows combining the DP inequalities of mutual information and unique information in
Figure 2A to sequentially insert
and
. The key aspect of this relaxation is that only the variables involved in the independence
are preserved in the marginal
that includes the hidden variable. This allows applying the unique information DP inequality to insert
, while
does not allow applying the mutual information DP inequality. The second row of
Table 1 summarizes the analogous procedure applied to
Figure 2C to combine the DP inequalities of mutual information and unique information to sequentially insert
and
. A more detailed examination of the corresponding instrumental entropic inequality that holds for
Figure 2C will be examined in
Section 3.5. Here, our interest is to motivate that the same procedure of relaxation of the preserved marginals allows applying a third minInf DP inequality to insert
.
This is shown in the third row of
Table 1. The preserved marginals
in the third column of (ii), which allow applying the DP inequality of unique information, are the departing set in the first column of (iii). Then a new relaxation of the marginals divides
into
. This allows preserving in
only the variables involved in the independence
, while
. The other marginal
is left unchanged during the relaxation. We then recognize in the structure of the marginals preserved after the iterative application of the relaxations the pattern of constraints of the families of distributions considered in Proposition 5, namely the fact that the hidden variables and conditioning variables involved in the subsequent conditional independence to be exploited are the only ones included together with
in the marginal distribution that plays the role of
. Accordingly, the minInf DP inequality of Proposition 5 is used to insert
.
We now formalize how DP inequalities of minInf terms that are determined by sequential relaxations of the preserved marginals can be combined:
Theorem 1
(Iterative addition of observable minInf information terms to lower bounds of unobservable minInf information terms).
Consider nonoverlapping sets of variables , , and with all observable except hidden variables. For , consider a nonempty collection of observable nonoverlapping sets of variables , with . Consider a collection and a collection such that , , and , , for . Consider a collection such that and , for . Consider the collections of sets of variables and , with , , and iteratively constructed such that , , , and for , , , and , so that , for . Consider a joint distribution . Consider the family of distributions preserving and , for . Consider the family of distributions preserving and , for . If , then Theorem 1 provides a way to iteratively add additional observable terms at the lower bound of nonobservable information terms with hidden variables. A full understanding of how it proceeds can be gained with its proof. In the rest of this section, we highlight its main properties, describing the transition from the exploitation of independence
in iteration
to the exploitation of
in iteration
k. We start with the simplest scenario, in which
,
, and
, for
. This case already allows appreciating the core of the recursiveness. It corresponds to the scenario in which the DP inequalities being used all have the same target variable
and rely on the same set of hidden variables
. Given that
for
, the iterative construction of
and
can be simplified to
,
, and for
,
and
. The auxiliary variables
are not needed when
for
because their role is only to add
in
. Furthermore, for
, we have
instead of
because an equality
leads to applying a DP inequality in step
with the independence
and in step
j with
. These two steps can then be merged in a new step
that jointly adds
given
, based on the
contraction axiom of semi-graphoids [
25,
43].
For this simplest scenario, we now highlight the core of the recursiveness. The independencies of step
and
k are
and
, respectively. The family
preserves
and
for
, and given that
, it hence preserves
. In iteration
, variables
are introduced using Proposition 5 with
. Here the variables
of Proposition 5 are assigned as
. The family
plays the role of
in Equation (
19). When moving from
to
, the preservation of
is loosen to the preservation of two of its marginals, namely
and
. The first is a marginal because
. The second is a marginal because
is removed. Now
only appears in
. The variables
are introduced analogously to
, using Proposition 5 now with
. Here
are assigned as
. The family
plays the role of
in Equation (
19). The second term at the r.h.s. of Equation (
23) has the same form as the one at the l.h.s., replacing
by
k. Comparing the two independencies used in steps
and
k,
used in
is a subset of
used in
, except for the possible addition of variables from
. This follows the same pattern already seen in
Figure 2A with
and
, where
,
,
, and
, or in
Figure 2B with
and
, where
,
, and
are the same, and
.
We now provide an overview of the rest of scenarios. Concrete examples are described in
Section 3.5 and in
Appendix F. These other scenarios comprise cases in which
or
are not constant for
. The cases with non constant
, given that
, correspond to cases in which some of the hidden variables
are marginalized before applying subsequent DP inequalities. This happens when in step
j the variables
are colliders or descendants of colliders in paths that lead to
as opposed to
. An example will be shown in
Figure A2B.
Similarly, if
for
, the cases with non constant
for some
j correspond to cases in which some target variables are marginalized to apply subsequent DP inequalities. This is because the relations
,
, for
simplify to
when
for
. This happens in step
j when
, as opposed to
, because the variables
have active paths reaching
. An example will be shown in
Figure A2C.
Finally, the case in which
for some
j covers cases in which conditioning on
is necessary to create the independence
, with
. Accordingly, in step
j the variables
are moved from target variables to conditioning variables using a chain rule of the information terms. An example will be shown in
Figure A2D. Furthermore, the marginalization of some variables in
, the marginalization of some variables in
, and the conditioning on some
can co-occur in the same step
j, leading to the final general formulation of Theorem 1. The commonality to all scenarios is that in Equation (
23), while the term at the l.h.s. is not observable, the first term at r.h.s. does not depend on any hidden variable and hence leads to an observable term by relaxing the preservation of
to
. Furthermore, the second term of the r.h.s. has the same form as the term at the l.h.s., which means that Theorem 1 can be applied recursively.
3.5. Instrumental Entropic Inequalities with Sums of minInf Information Terms
We now show how to use Theorem 1 to create testable entropic inequalities from the nontestable inequalities of Proposition 3. Reexamining the terms
,
, that appear in the r.h.s. of Equation (
14), we see that each of these terms can be the starting point to iterate the addition of observable information terms using Theorem 1. The application of Theorem 1 to a nonempty subset of a partition
converts a nontestable instrumental entropic inequality from Proposition 3 into testable.
Proposition 6
(Testable instrumental entropic inequalities from the iterative application of data processing inequalities to minInf information terms).
Consider nonoverlapping sets of variables , , , and , all observable except hidden variables. Consider that the joint distribution of these variables is generated from a causal structure that creates the independence , which leads to a nontestable entropic inequality of the form . Consider an exclusive partition in r parts of the instrumental set given by , such that is separated in the sum of r nonestimable information terms , . Select . Consider a nonoverlapping subset , , such that each corresponds to a different . Consider that for each it is possible to iteratively apply Theorem 1 with an initial assignment of its inputs as , with . Accordingly, for each , it is possible to construct collections , , , , , and , with , which are associated with sets of independencies , for , which are imposed by the causal structure. This leads to the testable instrumental entropic inequalitywhere each family of distributions preserves and , for . Proof. Proposition 6 follows directly from the iterative application of Theorem 1 to a subset of the nonobservable information terms in the sum of Equation (
14). In each case, the theorem is applied starting from a different set of variables
, namely
, where
corresponds to some
. The variables in
not included in
are marginalized. Theorem 1 describes the properties that need to fulfill the collections
,
,
,
,
, and
. The requirements
and
ensure that at least one observable information term is added in the lower bound, such that a nontrivial entropic inequality is testable. The form of the resulting testable inequality is determined by which independencies are imposed by the causal structure of interest, that is, which sets of independencies
, for
,
are combined to apply DP inequalities that add estimable information terms at the lower bound. □
The inequality of Proposition 6 encompasses the ones of Proposition 1, 2, and 4. It may be asked why the term
is always separated before starting to apply DP inequalities. In fact, in the case of Proposition 1 where the standard DP inequality is applied, this is not a differentiating factor, since the r.h.s. of Equation (
1) is equal to
. However, when a DP inequality is applied in combination with relaxations of the constraints on the marginals to be preserved, this changes. As elaborated after the proof of Theorem 1 in
Appendix E, in order to obtain a lower bound as tight as possible the constraints on the marginals should always be as strong as possible, while loose enough to allow the application of the subsequent DP inequalities. Given that the term
is observable, a minimization after a relaxation of the constraints that would not preserve
results in an equal or smaller lower bound. This means that, to obtain the highest lower bound, DP inequalities should be applied starting with
, after the separation of
. Accordingly, we will further illustrate in
Figure A2A that
and
play the same role in the derivation of an estimable lower bound.
We now examine in detail the application of Proposition 6 to the example of
Figure 2C. The causal structure of
Figure 2C is analogous to the one of
Figure 2A, with some differences: It contains an additional predictor
and a new conditioning variable
. The conditioning variable
has been removed for simplicity of the figure, but could be left as in
Figure 2A with no qualitative effect in our reasoning. For simplicity of the explanation, we now focus on the construction of an instrumental entropic inequality with instrumental set
and conditioning set
. We do so because this suffices to illustrate the iterative application of minInf DP inequalities with Proposition 6. See
Appendix I for a more detailed analysis of this example. With
and
, we apply Proposition 6 with
, using the independencies
,
, and
. The derived entropic inequality is
where
preserves the marginals
and
preserves the marginals
. The second term in the r.h.s. is the minInf term that corresponds to
. Variable
is inserted thanks to
, using the standard DP inequality of Lemma 1. Variable
is inserted thanks to
, using the DP inequality of unique information of Lemma 3. Finally, variable
is inserted thanks to
, using a minInf DP inequality of the form of Proposition 5.
Our objective with this example was to illustrate the iterative insertion of estimable information terms in the lower bound. As mentioned above, an extended analysis of instrumental entropic inequalities for the causal structure of
Figure 2C is presented in
Appendix I. This extended presentation will cover alternative entropic inequalities that are derived with the multivariate instrumental set
, when using different partitions to create chainlike instrumental entropic inequalities of the form of Equation (
14). Concretely, the inequality of Equation (
25) is subsumed by the one of Equation (A17e).
Appendix I provides further evidence of the increase in causal inference power thanks to the addition of minInf terms, since they allow obtaining tighter lower bounds thanks to the combination of conditional independencies that cannot be jointly used in a standard instrumental inequality.
This completes our extension of instrumental entropic inequalities, from the standard form reviewed in Proposition 1, to the minInf instrumental entropic inequalities of Proposition 6. We have used the specific scenario of instrumental inequalities to vertebrate the presentation of our core contributions, namely the theoretical derivation of DP inequalities for minInf information terms (Proposition 5) and the iterative procedure to combine them (Theorem 1). In the rest of our Results, we will more broadly show how to apply these tools to derive other entropic inequalities apart from instrumental inequalities.
While our main contribution focuses on the theoretical derivation of the properties of minInf terms that render them useful for causal structure learning, in
Appendix G we also discuss their estimation. As we explain, this estimation constitutes a non-convex minimization problem [
44] and a general implementation is out of the scope of this work. Nonetheless, in Lemma A2 we recast minInf terms separating a convex and non-convex component of the minimization problem. We use this approach to extend the numerical examples presented in
Appendix D, which show the gain in causal inference power obtained with Proposition 4. In this way, in
Appendix H we also provide numerical examples in which it is only thanks to the addition of a second minInf term, like in Equation (
25), that a rejection is obtained when testing the entropic inequality.
3.6. The Region of minInf Shannon Entropy Cones
In previous sections we have derived entropic inequalities with increased causal inference power by introducing minInf DP inequalities. We here more broadly reformulate this derivation from a geometrical perspective. In a geometrical perspective of entropy [
22], entropy values associated with a set of variables
are represented as a point in a
space. In more detail, given a joint distribution
, for the set of indices
associated with the variables, an entropy value is obtained for each subset of indexes
. This entropy value corresponds to the joint entropy
of the marginal probability distribution
. Given that the
power set of subsets of
contains
subsets, a vector constructed with all entropy values
lies in a
space. The region in this space containing all points obtainable from probability distributions, the
entropy cone, forms a convex cone (Theorem 15.5 in [
22]), but has an unknown explicit characterization. However, an approximation of this region is given by the
Shannon entropy cone, which includes all points that comply with the following linear inequality constraints:
where
S and
T are two subsets of
. These inequalities are known as the
basic inequalities [
22,
28] and can be expressed as linear inequalities only involving entropy terms, hence introducing constraints among different components of the entropy vectors. The basic inequalities impose requirements for any well-defined probability distribution, namely the nonnegativity of entropy (Equation (26a,b)), the monotonicity of entropy (Equation (
26b)), and the nonnegativity of conditional mutual information (Equation (
26c)), associated with the submodularity of entropy.
These basic inequalities are constraints that apply to any entropic vector created from a well-defined probability distribution. Furthermore, if a joint probability distribution of interest is generated under additional constraints, such as the compatibility with a certain causal structure, then the set of independencies induced by the causal structure adds extra equality constraints to the basic inequalities, namely in the form of conditional mutual information terms with zero values. In the presence of hidden variables, the cancelation of conditional mutual information terms involving the hidden variables is not verifiable. However, given the set combining the basic inequalities and the causally-induced equalities, it has been shown [
14,
24] that causally informative entropic inequalities, such as the standard instrumental entropic inequality, are derived by marginalization of the hidden variables to obtain inequalities that only involve observable variables. This marginalization has been algorithmically implemented using Fourier-Motzkin elimination, a standard linear programming algorithm for the elimination of variables from systems of inequalities [
14].
The derivation of entropic inequalities with minInf terms can be formulated as an analogous marginalization problem, but starting from the region of a
minInf Shannon entropy cone, which generalizes the region of the Shannon entropy cone from individual distributions to families of distributions sharing sets of constraints. In more detail, consider a minInf term
where
,
are sets of variables, without specification of which variables are observable or hidden. The family of distributions
is defined as preserving a set of marginals from a joint original distribution
, with
. The minimum within
determines a set of distributions, at least one, that reach the minimum. Select a distribution
among the ones reaching the minimum and consider the region that its entropic vector can occupy. This region is restricted by basic inequalities (Equation (26)) and also by the constraints intrinsic to the definition of the minInf term in Equation (
27). First, there is an additional constraint
, associated with the minimization, since
and
is a minimum. Second, there is a constraint
for any
that appears in one of the marginal distributions preserved in
.
In the presence of additional constraints induced by a causal structure, the same constraints are imposed to P and for those independencies associated with variables whose joint marginals are preserved. For example, for a causally-induced independence , the constraints and are imposed if is preserved in . Overall, the entropic vector associated with the original distribution P and the vector associated with are coupled. Furthermore, the set of constraints that characterizes the accessible region for entropic vectors also includes the minInf DP inequalities. This is because the derivation of these DP inequalities results from the definitions in terms of the minimization operator (see proofs of Lemma 3 and Proposition 5), that is, the DP inequality holds specifically for distributions at the minimum. Without imposing the DP inequalities as constraints to bound the region accessible to entropic vectors, the entropic vector of could correspond to any distribution within fulfilling , without corresponding to the minimum.
This type of coupling among entropic vectors does not uniquely result from constraints involving the original distribution P. Consider two families and , with , and a term to be minimized. Consider distributions and that reach the minimum within and , respectively. Since , this means that there is a constraint . Furthermore, there are also constraints for any that appears in the shared preserved marginals of and . The same happens with the additional constraints induced by a causal structure. If both and preserve the marginal distribution containing the variables associated with a causally-imposed conditional independence, this results in information terms with zero values, creating a further coupling between entropic vectors corresponding to distributions in the two families.
Overall, there is a coupling between entropic vectors, comprising the one of the original probability distribution and those of the probability distributions defined in terms of minimizations within families that preserve certain sets of marginals. The resulting set of constraints includes constraints of different types. First, the basic inequalities of Equation (26), which apply to all distributions. Second, inequalities related to the definition of the minInf terms within a family of distribution preserving a set of marginals. This includes equalities between entropies when two families of distributions share marginals that are preserved. It also includes inequalities between information terms when they result from the minimization within families that are one a subset of the other. Third, it includes causally-induced constraints. This includes equalities (information terms with zero value) that apply to the original distribution and to any minInf distribution that preserves the joint marginal of variables involved in a conditional independence. This implies the DP inequalities of minInf terms.
The region accessible to the vectors compatible with a certain causal structure can thus be characterized in two dual ways. Given the selection of M minInf distributions resulting from M minimizations, one possibility is to describe the region as a set of interdependent entropic vectors within the space of entropies. Another possibility is to define a space in which the entropic vectors of the original joint distribution P and of the M additional joint distributions associated with the minInf terms are appended. The latter representation constructs the minInf Shannon entropy cone.
A formal characterization of the region of minInf Shannon entropy cones is beyond the aim of this paper. However, the considerations above suggest how an algorithmic entropic characterization of causal structures [
14,
24] can be extended to exploit the constraints that exist in minInf Shannon entropy cones. Since the constraints involving minInf terms also constitute a linear system of equalities and inequalities, the procedure used to derive testable inequalities by the marginalization of hidden variables [
14] can be extended to include minInf terms. Note that the minimization operations involved in the identification of the minInf distributions are not to be solved as part of the linear system. They are reflected in the system by the inclusion of the constraints associated with the definition of the minInf terms. This guarantees that, after the marginalization, the reduced system will contain entropic inequalities that express relations between estimable minInf terms, which can then be tested. The minInf instrumental entropic inequalities introduced in previous sections are one example of inequalities that would be obtained with this algorithmic approach. The implementation of this extended procedure is left for future work.
3.7. Other Types of Entropic Inequalities with minInf Information Terms
The implementation of a procedure to algorithmically derive causally informative inequalities with minInf terms is out of the scope of this work. However, we here point to two other well-known types of causally informative entropic inequalities that can be extended thanks to the minInf DP inequalities introduced in
Section 3.4. We do not aim to provide a full presentation of these entropic inequalities, but to reframe them in a form that allows appreciating their extensions.
The first type of inequalities that we extend is the Groups-Decomposition (GD) inequalities [
25,
26]. We keep the notation of [
26] to facilitate the comparison. This type of inequalities relates the information that a collection of variables has about a set of target variables
with a weighted sum of the information contained in different groups defined as subsets of that collection. Two subtypes of GD inequalities were introduced in [
25]. The existence of an inequality of the first subtype (GD1) imposes certain conditions of independence between the groups and determines the weights based on the overlap between them. The second subtype (GD2) has no requirements of independence, but only applies to collections and groups that constitute ancestral sets, that is, sets including all ancestors of their members. Several extensions were introduced in [
26], comprising a relaxation of the required conditions of independence for GD1 and more flexibility in the configuration of groups for GD2. These extensions also included a generalization to allow for conditioning sets and the use of the DP inequalities of Lemmas 1 and 3 to derive testable GD inequalities from collections containing hidden variables.
We here present GD inequalities in a form that highlights how to apply the tools developed in
Section 3.4, hence using minInf DP inequalities to increase their causal inference power. For simplicity, we restrict ourselves to GD1 inequalities, because the presentation of the second subtype is more mathematically heavy. We extend GD1 inequalities following Proposition 3 of [
26], while an extension of GD2 analogously follows from their Theorem 2. For the purpose of this extension, we explicitly differentiate observable variables
and hidden variables
. The inequality considers a target set of variables
and a conditioning set
. It also considers a collection of variables
formed by
n groups, possibly overlapping. Each group can contain observable and hidden variables, such that
. The GD1 inequality states that
where
is the number of groups that intersect with
. The inequality holds if the groups fulfill the following conditions of independence: Given disjoint partitions
,
and
, for all
. For each group, the term
can be estimated, while the term
is analogous to the terms with hidden variables that appear in Proposition 3. Ref. [
26] used the DP inequalities of mutual information and unique information to derive tighter estimable lower bounds. An example of a causal structure for which this inequality is causally fulfilled is shown in
Figure 3A. In this case, the structure of a Common Ancestors (CM) graph [
24] is obtained after conditioning on
Z, with all dependencies between observable variables mediated by hidden variables. A GD inequality exists selecting
, such that
holds for all
. The standard DP inequality is applied to each group, thanks to
, which leads to a testable GD inequality. Consider now the role of the terms
if the structure of
Figure 3A is embedded as part of a larger causal structure for which other minInf DP inequalities can be applied. In that case, analogously to
Figure 1 and
Figure 2, the terms
can be the starting point to iteratively add estimable information terms at the lower bound of the inequality, with a procedure analogous to the one enabled by Propositions 3 and 6.
The second type of inequality to be generalized is the Information Causality (IC) inequality [
23,
31]. This inequality differs from the ones considered so far in that it contemplates a marginal scenario defined not only in terms of the presence of hidden variables, but also in terms of restrictions regarding which observable variables are jointly observable. The motivation of this marginal scenario is that the IC inequality was conceived to comprise also quantum systems. However, we here focus on its derivation for classical systems, since the consideration of quantum systems would require examining how to possibly adapt our results to the case in which Shannon entropy is substituted by von Neumann entropy [
31]. In particular, we follow the generalization introduced by [
31]. For classical systems, the derivation of this IC inequality can be understood in relation to the causal structure of
Figure 3B. We have kept the notation used in [
31] to facilitate a comparison with their derivations. The only exception is that we use
U for the hidden variable, consistently with our previous derivations, while in [
31] it corresponds to variable
B. All variables
,
, and
M are observable, with
U hidden. However, the marginal scenario is defined as imposing further constraints of observability, such that the variables in
have mutually exclusive observability. Only marginal distributions of the form
are observable, for each
. Accordingly, for an inequality to be testable, it can only contain information terms estimable from these marginals. We now present the IC inequality in a form that highlights how it can be extended based on the tools developed in
Section 3.4:
Note that the selection of
is arbitrary and without loss of generality it can be replaced by any other variable in
. The detailed derivation of inequality
can be found in Equations (16)–(21) of [
31] and follows from Lemma 2 in [
24]. Apart from basic properties of entropy, the derivation relies on the causally-imposed independence
. On the other hand, inequality
is derived thanks to the DP inequalities of mutual information associated with
and
, which provide observable lower bounds of the terms
and
. More generally, these terms can be the starting point to iteratively add estimable information terms at the lower bound of the inequality, with a procedure analogous to the one enabled by Propositions 3 and 6. To highlight this, we have kept in the lower bound the nonestimable information terms that contain
U and that are dropped in the final testable inequality of [
31]. If the structure of
Figure 3B is embedded as part of a larger causal structure such that other types of minInf DP inequalities can be applied, an iterative addition of observable information terms in the lower bound can proceed as in Proposition 6.