Information-theoretic inference of common ancestors

A directed acyclic graph (DAG) partially represents the conditional independence structure among observations of a system if the local Markov condition holds, that is, if every variable is independent of its non-descendants given its parents. In general, there is a whole class of DAGs that represents a given set of conditional independence relations. We are interested in properties of this class that can be derived from observations of a subsystem only. To this end, we prove an information theoretic inequality that allows for the inference of common ancestors of observed parts in any DAG representing some unknown larger system. More explicitly, we show that a large amount of dependence in terms of mutual information among the observations implies the existence of a common ancestor that distributes this information. Within the causal interpretation of DAGs our result can be seen as a quantitative extension of Reichenbach's Principle of Common Cause to more than two variables. Our conclusions are valid also for non-probabilistic observations such as binary strings, since we state the proof for an axiomatized notion of mutual information that includes the stochastic as well as the algorithmic version.


Introduction
Causal relations among components X 1 , . . . , X n of a system are commonly modeled in terms of a directed acyclic graph (DAG) in which there is an edge X i → X j whenever X i is a direct cause of X j . Further, it is usually assumed that information about the causal structure can be obtained through interventions in the system. However, there are situations in which interventions are not feasible (too expensive, unethical or physically impossible) and one faces the problem to infer causal relations from observational data only. To this end, postulates linking observations to the underlying causal structure have been employed, one of the most fundamental being the causal Markov condition [1,2]. It connects the underlying causal structure to conditional independencies among the observations. Explicitly it states that every observation is independent of its non-effects given its direct causes. It formalizes the intuition, that the only relevant components of a system for a given observation are its direct causes. In terms of DAGs, the causal Markov condition states that a DAG can only be a valid causal model of a system if every node is independent of its non-descendants given its parents. The graph is then said to fulfill the local Markov condition [3]. Consider for example the causal hypothesis X → Y ← Z on three observations X, Y and Z. Assuming the causal Markov condition, the hypothesis implies that X and Z are independent. Violation of this independence then allows one to exclude this causal hypothesis. But note that in general there are many DAGs that fulfill the local Markov condition with respect to a given set of conditional independence relations. For example, Figure 1: Two causal hypothesis for which the causal Markov condition does not imply conditional independencies among the observations X 1 , X 2 and X 3 . Thus they can not be distinguished using qualitative criteria like the common cause principle (unobserved variables are indicated as dots). However, the model on the right can be excluded if the dependence among the X i exceeds a certain bound.
given Y and it can not be decided from information on conditional independences alone, which is the true causal model. Nevertheless, properties that are shared by all valid DAGs (e.g. an edge between X and Y in the example) provide information about the underlying causal structure. The causal Markov condition is only expected to hold for a given set of observations if all relevant components of a system have been observed, that is if there are no confounders (causes of more than two observations that have not been measured). It can then be proven by assuming a functional model of causality [1,4,5]. As an example, consider the observations X 1 , . . . , X n to be jointly distributed random variables. In this case, the causal Markov condition can be derived for a given DAG on X 1 , . . . , X n from two assumptions: (1) every variable X i is a deterministic function of its parents and an independent (possibly unobserved) noise variable N i and (2) the noise variables N i are jointly independent. However, in this paper we assume that our observations provide only partial knowledge about a system and ask for structural properties common to all DAGs that represent the independencies of some larger set of elements. To motivate our result, assume first that our observation consists of only two jointly distributed random variables X 1 and X 2 which are stochastically dependent. Reichenbach [6] postulated already in 1956 that the dependence of X 1 and X 2 needs to be explained by (at least) one of the following cases: X 1 is a cause of X 2 , or X 2 is a cause of X 1 , or there exists a common cause of X 1 and X 2 . This link between dependence and the underlying causal structure is known as Reichenbach's principle of common cause. It is easily seen that by assuming X 1 and X 2 to be part of some unknown larger system whose causal structure is described by a DAG G, then the causal Markov condition for G implies the principle of common cause. Moreover, we can subsume all three cases of the principle if we formally allow a node to be an ancestor of itself and arrive at Common cause principle. If two observations X 1 and X 2 are dependent, then they must have a common ancestor in any DAG modeling some possibly larger system.
Our main result is an information theoretic inequality that enables us to generalize this principle to more than two variables. It leads to the Extended common cause principle (informal version). Consider n observations X 1 , . . . , X n , and a number c, 1 ≤ c ≤ n. If the dependence of the observations exceeds a bound that depends on c, then in any DAG modeling some possibly larger system there exist c nodes out of X 1 , . . . , X n that have a common ancestor.
Thus, structural information can be obtained by exploiting the degree of dependence on the subsystem and we would like to emphasize that, in contrast to the original common cause principle, the above criterion provides means to distinguish among cases with the same independence structure of the observed variables. This is illustrated in Figure 1.
Above, the extended common cause principle is stated without making explicit the kind of observations we consider and how dependence is quantified. In the main case we have in mind, the observations are jointly distributed random variables and dependence is quantified by the mutual information [7] function. Then the extended common cause principle (Theorem 10) relates stochastic dependence to a property of all Bayesian networks that include the observations. However, the result holds for more general observations (such as binary strings) and for more general notions of mutual information (such as algorithmic mutual information [8]). Therefore we introduce an 'axiomatized' version of mutual information in the following Section and describe how it can be connected to a DAG. Then, in Section 3 we prove a theorem on the decomposition of information about subsets of a DAG out of which the extended common cause principle then follows as a corollary. Apart from a larger area of applicability, we think that an abstract proof based on an axiomatized notion of information better illustrates that the result is independent of the notion of 'probability'. It only relies on the basic properties of (stochastic) mutual information (see Definition 1). Finally, in Section 4 we describe the result in more detail within different contexts and relate it to the notion of redundancy and synergy that was introduced in the area of neural information processing.

General mutual information and DAGs
Before introducing a general notion of mutual information, let us describe how it is connected to a DAG in the stochastic setting. Assume we are given an observation of n discrete random variables X 1 , . . . , X n in terms of their joint probability distribution p(X 1 , . . . , X n ). Write [n] = {1, . . . , n} and for a subset S ⊆ [n] let X S be the random variable associated with the tuple (X i ) i∈S . Assume further, that a directed acyclic graph (DAG) G is associated with the nodes X 1 , . . . , X n , that fulfills the local Markov condition [3]: for all i, (1 ≤ i ≤ n) where nd i and pa i denote the subset of indices corresponding to the non-descendants and to the parents of X i in G. The tuple (G, p(X [n] )) is called a Bayesian net [9] and the conditional independence relations imply the factorization of the joint probability distribution where small letters x i stand for values taken by the random variables X i . From this factorization it follows that the joint information measured in terms of Shannon entropy [7] decomposes into a sum of individual conditional entropies Shannon entropy can be considered as absolute measure of information. However, in many cases only a notion of information relative to another observation may be available. For example, in the case of continuous random variables, Shannon entropy can be negative and hence may not be a good measure of the information. Therefore we would like formulate our results based on a relative measure, such as mutual information, which, moreover, induces a notion of independence in a natural way. This can be achieved by introducing a specially designated variable Y relative to which information will be quantified. Y can for example be thought of as providing a noisy measurement of the X [n] (Fig. 2 (a)). Then, with respect to a joint probability distribution p(Y, X [n] ) we can transform the decomposition of entropies into a decomposition of mutual information [7] I(Y : The graph in (a) shows a DAG on nodes X 1 , . . . , X 5 whose observation is modeled by a leaf node Y (e.g. a noisy measurement). Figure (b) shows a DAG-model of observed elements For a proof and a condition for equality see Lemma 3 below. In the case of discrete variables, Shannon entropy H(X i ) can be seen as mutual information of X i with a copy of itself: H(X i ) = I(X i : X i ). Therefore we can always choose p(Y |X [n] ) such that Y = X [n] and the decomposition of entropies in (2) is recovered. We are interested in decompositions as in (2) and (3), since their violation allows us to exclude possible DAG structures. However, note that the above relations are not yet very useful, since they require, through the assumption of the local Markov condition, that we have observed all relevant variables of a system. Before we relax this assumption in the next section we introduce mutual information measures on general observations. Of course, mutual information of discrete as well as of continuous random variables is included in the above definition. Further, in Section 4.2 we will discuss a recently developed theory of causal inference [4] based on algorithmic mutual information of binary strings 1 . We now state two properties of mutual information that we need later on.

Lemma 2 (properties of mutual information).
Let I be a measure of mutual information on a set of elements O. Then 1 Mutual information of composed quantum systems satisfies the definition as well, because it can be defined in formal analogy to classical information theory if Shannon entropy is replaced by von Neumann entropy of a quantum state. The properties of mutual information stated above have been used to single out quantum physics from a whole class of no-signaling theories [10]. (ii) (increase through conditioning on independent sets) For three disjoint sets A, B, C ⊆ O where Y is an arbitrary set Y ⊆ O disjoint from the rest. Further, the difference is given by Proof: (i) Using the chain rule two times where the last inequality follows from non-negativity of I. To prove (ii) we again use the chain rule As in the stochastic setting, we can connect a DAG to the conditional independence relation that is induced by mutual information: we say that a DAG on a given set of observations fulfills the local Markov condition if every node is independent of its non-descendants given its parents. Furthermore, we show in Appendix A that the induced independence relations are sufficiently nice, in the sense that they satisfy the semi-graphoid axioms [11]. This is useful because it implies that a DAG that fulfills the local Markov condition is an efficient partial representation of the conditional independence structure. Namely, conditional independence relations can be read off the graph with the help of a criterion called d-separation [1] (see Appendix A for details).
We conclude with a general formulation of the decomposition of mutual information that we already described in the probabilistic case.

Lemma 3 (decomposition of mutual information).
Let I be a measure of mutual information on elements O [n] = {O 1 , . . . , O n } and Y . Further let G be a DAG with node set O [n] that fulfills the local Markov condition. Then with equality if conditioning on Y does preserve the independences of the local Markov condition: Proof: Assume the O i are ordered topologically with respect to G. The proof is by induction on n. The lemma is trivially true if n = 1 with equality. Assume that it holds for k − 1 < n. It is easy to see that the graph G k with nodes O [k] that is obtained from G by deleting all but the first k nodes fulfills the local Markov condition with respect to O [k] . By the chain rule and we are left to show that I(Y : Since the local Markov condition holds, we have O k ⊥ ⊥ O [k−1]\pa k |O pa k and the inequality follows by applying (4). Further, by property (ii) of the previous Lemma, equality holds if for every k: In the next section we derive a similar inequality in the case in which only the mutual information of Y with a subset of the nodes O [n] is known.

Partial information about a system
We have shown that the information about elements of a system described by a DAG decomposes if the graph fulfills the local Markov condition. In this section we derive a similar decomposition in cases where not all elements of a system have been observed. This decomposition will of course depend on specific properties of G and, in turn, enable us to exclude certain DAGs as models of the total system whenever we observe a violation of such a decomposition. More precisely, we are interested in properties of the class of DAG-models of a set of observations that we define as follows (see Figure 2 (b)). (ii) G fulfills the local Markov condition with respect to I G The first three conditions state that, given the causal Markov condition, G is a valid hypothesis on the causal relations among components of some larger system including the O [n] that is consistent with the observed mutual information values. Condition (iv) is merely a technical condition due to the special role of Y as an observation of the O [n] external to the system.
Proof: For two subsets S, T ⊆ [n] write S ′ = S\(S ∩ T ) and T ′ = T \(S ∩ T ). Using the chain rule we have where the inequality follows from property (4) of mutual information.
Hence, a violation of submodularity allows one to reject mutual independence among the O i and therefore to exclude the DAG that does not have any edges from the class of possible DAG-models (the local Markov condition would imply mutual independence). We now broaden the applicability of the above Lemma based on a result for submodular functions from [12]: We assume that there are unknown objects independent and that the observed elements O i ⊆ X will be subsets of them (see Figure 3 (a)). In contrast to the previous lemma it is not required anymore, that the O i are mutually independent themselves. It turns out, that the way the information about the O i decomposes allows for the inference of intersections among the sets O i , namely Proposition 6 (decomposition of information about sets of independent elements). Let X = {X 1 , . . . , X r } be mutually independent objects, that is I(X j : Then the information about the O [n] can be bounded from below by For an illustration see Figure 3(a). Even though the proposition is actually a corollary of the following theorem, its proof is given in appendix B since it is, unlike the theorem, independent of graph theoretic notions .
Furthermore, if Y only depends on whole system X through the O [n] , that is we obtain an inequality containing only known values of mutual information The proof is given in Appendix C and an example is illustrated in Figure 3(b). If all quantities except the structural parameters d i are known, inequality (10) can be used to obtain information about the intersection structure among the O i that is encoded in the d i provided that the independence assumption (9) holds. Even if (9) In particular, for an index i ∈ [n] we must have A ci+1 = ∅, hence there exists a common ancestor The proof is given in Appendix D. Theorem 7 and its corollary are our most general results but due to ease of interpretation we illustrate them in the next section only in the speciale case in which all c i are equal (Cor. 9) to obtain a lower bound on the information about all common ancestors of at least c + 1 elements O i . To conclude this section, we ask what is the maximum amount of information that one can expect to obtain about the intersection structure of ancestral sets of a DAG-model of an observations. The main requirement for a DAG-model G is, that it fulfills the local Markov condition with respect to some larger set X of elements. This will remain true if we add nodes and arbitrary edges in a way that G remains acyclic. Therefore, if G contains a common ancestor of c elements we can always construct a DAG-model G ′ that contains a common ancestor of more than c elements (e.g. the DAG-model on the right hand side of Fig. 1 can be transformed in the one on the left hand side). We conclude that without adding minimality requirements for the DAG-models (such as the causal faithfulness assumption [2]) only assertions on ancestors of a minimal number of nodes can be made.

Structural implications of redundancy and synergy
The results of the last section can be related to the notions of redundancy and synergy. In the context of neuronal information processing, it has been proposed [13] to capture the redundancy and synergy of elements O [n] = {O 1 , . . . , O n } with respect to another element Y using the function where I is a measure of mutual information. Thus r relates information that Y has about the single elements to information about the whole set. In the following two subsections we discuss this result in more detail for the cases in which the observed elements are discrete random variables and binary strings.

Common ancestors of discrete random variables
Let X [n] = {X 1 , . . . , X n } and Y be discrete random variables with joint distribution p(X [n] , Y ) and let I denote the usual measure of mutual information given by the Kullback-Leibler divergence of p from its factorized distribution [7]. If Y = X [n] is a copy of the X . Moreover, the entropy of the set A c+1 of all common ancestors of more than c variables is lower bounded by We continue with some remarks to illustrate the theorem: (a) Setting c = 1, the theorem states that, up to a factor 1/(n − 1), the multi-information I 1 is a lower bound on the entropy of common ancestors of more than two variables. In particular, if I 1 (X [n] ) > 0 any Bayesian net containing the X [n] must have at least an edge. (b) Conversely, the entropy of common ancestors of all the elements X 1 , . . . , X n is lower bounded by (n − 1)I n−1 (X [n] ). This bound is not trivial whenever I n−1 (X [n] ) > 0, which is for example the case if the X i are only slightly disturbed copies of some not necessarily observed random variable (see example below).
(c) We emphasize that the inferred common ancestors can be among the elements X i themselves. Unobserved common ancestors can only be inferred by postulating assumptions on the causal influences among the X i . If, for example, all the X i were measured simultaneously, a direct causal influence among the X i can be excluded and any dependence or redundancy has to be attributed to unobserved common ancestors.
(d) Finally note that I c > 0 is only a sufficient, but not a necessary condition for the existence of common ancestors. However, we know that the information theoretic information provided by I c is used in the theorem in an optimal way. By this we mean that we can construct distributions p(X [n] ), such that I c (X [n] ) = 0 for a given c and no common ancestors of c + 1 nodes have to exist. We conclude this section with two examples: Example (three variables): Let X 1 , X 2 and X 3 be three binary variables, each with maximal entropy H(X i ) = log 2. Then I 2 (X 1 , X 2 , X 3 ) > 0 iff the joint entropy H(X 1 , X 2 , X 3 ) is strictly less than 3 2 log 2. In this case, there must exist a common ancestor of all three variables in any Bayesian net that contains them. In particular, any Bayesian net corresponding to the DAG on the right hand side of Figure 1 can be excluded as a model. Example (synchrony and interaction among random variables): Let X 1 = X 2 = · · · = X n be identical random variables with non-vanishing entropy h. Then in particular I n−1 (X [n] ) = (n − 1) −1 h > 0 and we can conclude that there has to exist a common ancestor of all n nodes in any Bayesian net that contains them. In contrast to the synchronized case, let X 1 , X 2 , . . . , X n be binary random variables taking values in {−1, 1} and assume that the joint distribution is of pure n-interaction 3 , that is for some β = 0 it has the form where Z is a normalization constant. It can be shown that there exists a Bayesian net including the X [n] , in which common ancestors of at most two variables exist. This is illustrated in Figure 4 for three variables and in the limiting case β = ∞ in which each X i is uniformly distributed and X 1 = X 2 · X 3 . We found it somewhat surprising that, contrary to synchronization, higher order interaction among observations does not require common ancestors of many variables.

Common ancestors in string manipulation processes
In some situations it is not convenient or straightforward to summarize an observation in terms of a joint probability distribution of random variables. Consider for example cases in which the data comes from repeated observations under varying conditions (e.g. time series). A related situation U 23 Figure 4: The figure illustrates that higher order interaction among observed random variables can be explained by a Bayesian net in which only common ancestors of two variables exist. More precisely, all random variables are assumed to be binary with values in {−1, 1} and the unobserved common ancestors U ij are mutually independent and uniformly distributed. Further the value of each observation X i is obtained the product of the values of its two ancestors. Then the resulting marginal distribution p(X 1 , X 2 , X 3 ) is of higher order interaction: it is related to the parity function p( is given if the number of samples is low. Janzing and Schoelkopf [4] argue that causal inference in these situations still should be possible, provided that the observations are sufficiently complex. To this end, they developed a framework for causal inference from single observations that we describe now briefly. Assume we have observed two objects A and B in nature (e.g. two carpets) and we encoded these observations into binary strings a and b. If the descriptions of the observations in terms of the strings a and b are sufficiently complex and sufficiently similar (e.g. the same pattern on the carpets) one would expect an explanation of this similarity in terms of a mechanism that relates these two strings in nature (are the carpets produced by the same company?). It is necessary that the descriptions are sufficiently complex, as an example of [4] illustrates: assume the two observed strings are equal to the first hundred digits of the binary expansion of π, hence they can be generated independently by a simple rule. If this is the case, the similarity of the two strings would not be considered as strong evidence for the existence of a causal link. To exclude such cases, Kolmogorov complexity [17] K(s) of a string s has been used as measure of complexity. It is defined as the length of the shortest program that prints out s on a universal (prefix-free) Turing machine. With this definition, strings that can be generated using a simple rule, such as the constant string s = 0 · · · 0 or the first n digits of the binary expansion of π are considered simple, whereas it can be shown that a random string of length n is complex with high probability. Kolmogorov complexity can be transformed into a function on sets of strings by choosing a suitable concatenation function ·, · , such that K(s 1 , . . . , s n ) = K( s 1 , s 2 , . . . , s n−1 , s n . . . ). The algorithmic mutual information [8] of two strings a and b is then equal to the sum of the lengths of the shortest programs that generate each string separately minus the length of the shortest program that generates the strings a and b: where + = stands for equality up to an additive constant that depends on the choice of the universal Turing machine. Analog to Reichenbach's principle of common cause, [4] postulates a causal relation among a and b whenever I(a : b) is large, which is the case if the complexities of the strings are large and both strings together can be generated by a much shorter program than the programs that describe them separately. In formal analogy to the probabilistic case, algorithmic mutual information can be extended to a conditional version defined for sets of strings A, B, C ⊆ {s 1 , . . . , s n } as Intuitively, I(A : B |C) is the mutual information between the strings of A and the strings of B if a shortest program that prints the strings in C has been provided as an additional input. Based on this notion of condition mutual information the causal Markov condition can be formulated in the algorithmic setting. It can be proven [4] to hold for a directed acyclic graph G on strings s 1 , . . . , s n if every s i can be computed by a simple program on a universal Turing machine from its parents and an additional string n i such that the n i are mutually independent. Without going into the details we sum up by stating that DAGs on strings can be given a causal interpretation and it is therefore interesting to infer properties of the class of possible DAGs that represent the algorithmic conditional independence relations. In the algorithmic setting, our result can be stated as follows Thus, highly redundant strings require a common ancestor in any DAG-model. Since the Kolmogorov complexity of a string s is uncomputable, we have argued in recent work [5], that it can be substituted by a measure of complexity in terms of the length of a compressed version of s with respect to a chosen compression scheme (instead of a universal Turing machine) and the above result should still hold approximately.

Structural implications from synergy?
We saw that large redundancy implies common ancestors of many elements and we may wonder whether structural information can be obtained from synergy in a similar way. This seems not to be possible, since synergy is related to more fine-grained information (information about the mechanisms) as the following example shows: Assume the observations O [n] are mutually independent. Then any DAG is a valid DAG-model since the local Markov condition will always be satisfied. We also now that r(Y ) ≤ 0, but it turns out that the amount of synergy crucially depends on the way that Y has processed the information of the O   Proof: Using the chain rule, we derive where the last equality follows because r c (Y |O [n] ) = 0.
Continuing the example of binary random variables above, mutual independence of the O [n] is equivalent to r 1 (O [n] ) = 0 and therefore, using the proposition as already noted above.

Discussion
Based on a generalized notion of mutual information, we proved an inequality describing the decomposition of information about a whole set into the sum of information about its parts. The decomposition depended on a structural property, namely the existence of common ancestors in a DAG. We connected the result to the notions of redundancy and synergy and concluded that large redundancy implies the existence of common ancestors in any DAG-model. Specialized to the case of discrete random variables, this means that large stochastic dependence in terms of multi-information needs to be explained through a common ancestor (in a Bayesian net) acting as a broadcaster of information. Much work has been done already that examined the restrictions that are imposed on observations by graphical models that include latent variables. Pearl [1,18] already investigated constraints imposed by the special instrumental variable model. Also Darroch et al. [15] and recently Sullivant et. al [19] looked at linear Gaussian graphical models and determined constraints in terms of the entries on the covariance matrix describing the data (tetrad constraints). Further, methods of algebraic statistics were applied (e.g. [20]) to derive constraints that are induced by latent variable models directly on the level of probabilities. In general this does not seem to be an easy task due to the large number of variables involved and information theoretic quantities allow for relatively easy derivations of 'macroscopic' constraints (see also [21]). Finally, we think that the general methodology of connecting concepts such as synergy and redundancy of observations to properties of the class of possible DAG-models is interesting, especially in the light of their causal interpretation.

B Proof of Proposition 6
We have shown in Lemma 5 the submodularity of I(Y : ·) with respect to independent sets. The rest of the proof is on the lines of the proof of Corollary I in [12]: First, by iteratively applying the chain rule for mutual information we obtain Without loss of generality we can assume that every X i is part of at least one set O k for some k. Let n i be the total number of subsets O k containing X i . By definition of d k , for every k it holds n i ≤ d k and we obtain Putting (14) and (15) together we get where (a) is obtained by exchanging summations and (b) uses the property of I, that conditioning on independent objects can only increase mutual information (inequality (4) applied to X i ⊥ ⊥ (X [i−1] \O j ) |O j ) . This is the point at which submodularity of I is used, since it is actually equivalent to (4) as can be seen from the proof of Lemma 5. Finally (c) is an application of the chain rule to the elements of each O j separately.

C Proof of Theorem 7
By assumption O i ⊆ X and the DAG G with node set X fulfills the local Markov condition. For each O i denote by an G (O i ) the smallest ancestral set in G containing O i . An easy observation that we need in the proof is given by the fact that two ancestral sets A and B are independent given their intersection: This is implied by d-separation using Theorem 14.
We first prove the inequality From this the inequalities of the theorem follow directly: (8)  where the last equality is a consequence of (9). The proof of (17) is by induction on the number of elements in A = an G (O [n] ). If A = ∅ nothing has to be proven. Assume now (17) whered i is defined similarly as d i , but with respect to the elementsÕ i andG. Further the sum is over all non-emptyÕ i . By construction ofĨ andÕ [n] , the left hand side of (18) is equal tõ The right hand side of (18) can be rewritten to where the inequality holds because m i=1 1 di ≤ 1 which has already been used, see (15) in the proof of Proposition 6 . Summarizing, the right hand side of (18) can be bounded from below by Since we have shown in (18) and (19), that the left hand side can be bounded from above by I ( Using assumption (11) and the chain rule for mutual information we obtain where the chain rule has been applied multiple times. The corollary now follows by solving for I(Y : A).