# Information-Theoretic Inference of Common Ancestors

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

_{1},…, X

_{n}of a system are commonly modeled in terms of a directed acyclic graph (DAG) in which there is an edge X

_{i}→ X

_{j}whenever X

_{i}is a direct cause of X

_{j}. Further, it is usually assumed that information about the causal structure can be obtained through interventions in the system. However, there are situations in which interventions are not feasible (too expensive, unethical or physically impossible) and one faces the problem of inferring causal relations from observational data only. To this end, postulates linking observations to the underlying causal structure have been employed, one of the most fundamental being the causal Markov condition [1,2]. It connects the underlying causal structure to conditional independencies among the observations. Explicitly, it states that every observation is independent of its non-effects given its direct causes. It formalizes the intuition that the only relevant components of a system for a given observation are its direct causes.

_{1},…, X

_{n}to be jointly distributed random variables. In this case, the causal Markov condition can be derived for a given DAG on X

_{1},…, X

_{n}from two assumptions: (1) every variable X

_{i}is a deterministic function of its parents and an independent (possibly unobserved) noise variable N

_{i}, and (2) the noise variables N

_{i}are jointly independent. However, in this paper, we assume that our observations provide only partial knowledge about a system and ask for structural properties common to all DAGs that represent the independencies of some larger set of elements.

_{1}and X

_{2}, which are stochastically dependent. Reichenbach [6] postulated already in 1956 that the dependence of X

_{1}and X

_{2}needs to be explained by (at least) one of the following cases: X

_{1}is a cause of X

_{2}, or X

_{2}is a cause of X

_{1}, or there exists a common cause of X

_{1}and X

_{2}. This link between dependence and the underlying causal structure is known as Reichenbach’s principle of common cause. It is easily seen that by assuming X

_{1}and X

_{2}to be part of some unknown larger system whose causal structure is described by a DAG G, then the causal Markov condition for G implies the principle of common cause. Moreover, we can subsume all three cases of the principle if we formally allow a node to be an ancestor of itself and arrive at:

**The common cause principle:**If two observations X

_{1}and X

_{2}are dependent, then they must have a common ancestor in any DAG modeling some possibly larger system.

**Extended common cause principle (informal version):**Consider n observations X

_{1},…, X

_{n}, and a number c, 1≤c≤n. If the dependence of the observations exceeds a bound that depends on c, then in any DAG modeling some possibly larger system, there exist c nodes out of X

_{1},…, X

_{n}that have a common ancestor.

## 2. General Mutual Information and DAGs

_{1},…, X

_{n}in terms of their joint probability distribution p(X

_{1},…, X

_{n}). Write [n] = f1,…, ng, and for a subset S⊆ [n], let X

_{S}be the random variable associated with the tuple (X

_{i})

_{i2S}. Assume further that a directed acyclic graph (DAG) G is associated with the nodes X

_{1},…, X

_{n}that fulfill the local Markov condition [3]: for all i, (1 ≤ i ≤ n):

_{i}and pa

_{i}denote the subset of indices corresponding to the non-descendants and to the parents of X

_{i}in G. The tuple (G, p(X

_{[}

_{n}

_{]})) is called a Bayesian net [9] and the conditional independence relations imply the factorization of the joint probability distribution

_{i}stand for values of the random variables X

_{i}. From this factorization, it follows that the joint information measured in terms of Shannon entropy [7] decomposes into a sum of individual conditional entropies:

_{[}

_{n}

_{]}(Figure 2a). Then, with respect to a joint probability distribution p(Y, X

_{[}

_{n}

_{]}), we can transform the decomposition of entropies into a decomposition of mutual information [7]:

_{i}) can be seen as mutual information of X

_{i}and a copy of itself: H(X

_{i}) = I(X

_{i}: X

_{i}). Therefore, we can always choose p(Y|X

_{[}

_{n}

_{]}), such that Y = X

_{[}

_{n}

_{]}and the decomposition of entropies in (2) is recovered. We are interested in decompositions as in (2) and (3), since their violation allows us to exclude possible DAG structures.

**Definition 1**(Measure of mutual information). Given a finite set of elements$\mathcal{O}$, a measure of mutual information on$\mathcal{O}$ is a three-argument function on the power set:

**Lemma 1**(Properties of mutual information). Let I be a measure of mutual information on a set of elements$\mathcal{O}$. Then:

- (Data processing inequality) For three disjoint sets A, B, C ⊆ $\mathcal{O}$:$$I\left(A:C|B\right)=0\Rightarrow I\left(A:B\right)\ge I\left(A:C\right).$$
- (Increase through conditioning on independent sets) For three disjoint sets A, B, C ⊆ $\mathcal{O}$:$$I\left(A:C|B\right)=0\Rightarrow I\left(Y:A|B\right)\le I\left(Y:A|B,C\right),$$

**Proof.**(i) Using the chain rule two times:

**Lemma 2**(Decomposition of mutual information). Let I be a measure of mutual information on elements O

_{[}

_{n}

_{]}= {O

_{1},…, O

_{n}} and Y. Further, let G be a DAG with node set O

_{[}

_{n}

_{]}that fulfills the local Markov condition. Then:

**Proof.**Assume the O

_{i}are ordered topologically with respect to G. The proof is by induction on n. The lemma is trivially true if n = 1 with equality. Assume that it holds for k−1 < n. It is easy to see that the graph G

_{k}with nodes O

_{[}

_{k}

_{]}that is obtained from G by deleting all but the first k nodes fulfills the local Markov condition with respect to O

_{[}

_{k}

_{]}. By the chain rule,

_{k}|O

_{[}

_{k}

_{−1]}) ≥I(Y : O

_{k}|O

_{pak}). Since the local Markov condition holds, we have ${O}_{k}\u2568{O}_{\left[k-1\right]\backslash p{a}_{k}}|{O}_{p{a}_{k}}$, and the inequality follows by applying (4). Further, by Property (ii) of the previous lemma, equality holds if for every $k:{O}_{k}\u2568{O}_{\left[k-1\right]\backslash p{a}_{k}}|({O}_{p{a}_{k}},Y)$, which is implied by (6).

_{[}

_{n}

_{]}is known.

## 3. Partial Information about a System

**Definition 2**(DAG model of observations). An observation of elements O

_{[}

_{n}

_{]}= {O

_{1},…, O

_{n}} with respect to a reference object Y and mutual information measure I is given by the values of I(Y : O

_{S}) for every subset S⊆ [n].

_{G}on$\mathcal{X}$ is a DAG model of an observation, if the following holds:

- each observation O
_{i}is a subset of the nodes of G. - G fulfills the local Markov condition with respect$\mathcal{X}$ to I
_{G}. - I
_{G}is an extension of I, that is I_{G}(Y : O_{S}) = I(Y : O_{S}) for all S⊆ [n]. - Y is a leaf node (no descendants) of G.

_{[}

_{n}

_{]}, that is consistent with the observed mutual information values. Condition (iv) is merely a technical condition, due to the special role of Y as an observation of the O

_{[}

_{n}

_{]}external to the system.

_{i}and Y are random variables with joint distribution p(O

_{[}

_{n}

_{]}; Y), a DAG model G with nodes $\mathcal{X}$ is given by the graph structure of a Bayesian net with joint distribution $p\left(\mathcal{X}\right)$, such that the marginal on O

_{[}

_{n}

_{]}and Y equals p(O

_{[}

_{n}

_{]}; Y). Moreover, if Y is a copy of O

_{[}

_{n}

_{]}, then an observation in our sense is given by the values of the Shannon entropy H(O

_{S}) for every subset S⊆ [n].

_{S}) about the class of DAG models?

**Lemma 3**(Submodularity of I). If the O

_{i}are mutually independent, that is I(O

_{i}: O

_{[}

_{n}

_{]}

_{ni}) = 0 for all i, then the function [n] ⊇S → −I(Y : O

_{S}) is submodular, that is, for two sets S, T⊆ [n]:

**Proof.**For two subsets S, T⊆ [n], write S′ = Sn(S ∩ T) and T′ = T\(S ∩ T). Using the chain rule we, have:

_{i}and therefore to exclude the DAG that does not have any edges from the class of possible DAG models (the local Markov condition would imply mutual independence).

_{i}are mutually independent themselves. It turns out that the way the information about the O

_{i}decomposes allows for the inference of intersections among the sets O

_{i}, namely:

**Proposition 1**(Decomposition of information about sets of independent elements). Let$\mathcal{X}=\left\{{X}_{1},\dots ,{X}_{r}\right\}$ be mutually independent objects, that is I(X

_{j}: X

_{[}

_{r}

_{]}

_{nj}) = 0 for all j. Let O

_{[}

_{n}

_{]}= {O

_{1},…, O

_{n}}, where each${O}_{i}\subseteq \mathcal{X}$ is a non-empty subset of$\mathcal{X}$. For every i ∈ [n], let d

_{i}be maximal, such that O

_{i}has non-empty intersection with d

_{i}−1 sets out of O

_{[}

_{n}

_{]}distinct from O

_{i}. Then, the information about the O

_{[}

_{n}

_{]}can be bounded from below by:

_{1}= d

_{2}= 2 and:

_{i}that is also in k −1 different sets O

_{j}, then d

_{i}≥k, and we account for this redundancy in dividing the single information I(Y : O

_{i}) by at least k.

_{i}to the intersection structure of associated ancestral sets. For a given DAG G, a set of nodes A is called ancestral, if for every edge v → w in G, such that w is in A, also v is in A. Further, for a subset of nodes S, we denote by an(S) the smallest ancestral set that contains S. Elements of an(S) will be called ancestors of S.

**Theorem 1**(Decomposition of ancestral information). Let G be a DAG model of an observation of elements O

_{[}

_{n}

_{]}= {O

_{1},…, O

_{n}}. For every i, let d

_{i}be the maximal number, such that the intersection of an(O

_{i}) with d

_{i}−1 distinct sets an$\left({O}_{i}{}_{{}_{1}}\right),\dots ,an\left({O}_{{i}_{d-1}}\right)$ is non-empty. Then, the information about all ancestors of O

_{[}

_{n}

_{]}can be bounded from below by:

_{[}

_{n}

_{]}, that is:

_{i}are known, the inequality (10) can be used to obtain information about the intersection structure among the O

_{i}that is encoded in the d

_{i}, provided that the independence assumption (9) holds. Even if (9) does not hold, but information on an upper bound of I(Y : an(O

_{[}

_{n}

_{]})) is available (e.g., in terms of the entropy of Y), information about the intersection structure may be obtained from (8). The following corollary additionally provides a bound on the minimum information about ancestral sets.

**Corollary 1**(Inference of common ancestors, local version). Given an observation of elements O

_{[}

_{n}

_{]}={O

_{1},…, O

_{n}}, assume that for natural numbers

**c**= (c

_{1},…, c

_{n}) with (1 ≤ c

_{i}≤n −1), we observe:

_{i}, let${A}_{{c}_{i}}{}_{+1}$ be the set of common ancestors in G of O

_{i}and at least c

_{i}elements of O

_{[}

_{n}

_{]}different from O

_{i}. Then, the joint information about all common ancestors can be bounded from below by:

_{i}and at least c

_{i}elements of O

_{[}

_{n}

_{]}different from O

_{i}.

_{i}are equal (Corollary 2) to obtain a lower bound on the information about all common ancestors of at least c + 1 elements O

_{i}.

## 4. Structural Implications of Redundancy and Synergy

_{[}

_{n}

_{]}= {O

_{1},…, O

_{n}} with respect to another element Y using the function:

_{i}is larger than the information about whole set (r(Y) > 0), the O

_{[}

_{n}

_{]}are said to be redundant with respect to Y. This may be the case if Y “contains” information that is shared by multiple O

_{i}. In general, if the O

_{i}do not share any information, that is if they are mutually independent, then they can not be redundant with respect to any Y (this follows from Lemma 3).

_{[}

_{n}

_{]}are called synergistic with respect to Y. This may, for example, be the case if Y is generated through a function Y = f(O

_{1},…, O

_{n}) and the function value contains little information about each argument (as is the case for the parity function; see below). If, instead, Y is a copy of the O

_{[}

_{n}

_{]}, then r(Y) ≥ 0, and thus, the O

_{[}

_{n}

_{]}are not synergistic with respect to Y. To connect our results to the introduced notion of redundancy and synergy, we introduce the following version of r parametrized by a parameter c ∈ {1,…, n}:

_{c}(Y) > 0 for large c, then the O

_{i}are highly redundant with respect to Y. Corollary 1 of the last section implies that high redundancy implies common ancestors of many O

_{i}.

**Corollary 2**(Redundancy explained structurally). Let an observation of elements O

_{[}

_{n}

_{]}= {O

_{1},…, O

_{n}} be given by the values of I(Y : O

_{S}) for any subset S ⊆ [n]. If r

**(Y) > 0, then in any DAG model of the observation in which Y only depends on$\mathcal{X}$ through O**

_{c}_{[}

_{n}

_{]}[16], there exists a common ancestor of at least c + 1 elements of O

_{[}

_{n}

_{]}.

#### 4.1. Common Ancestors of Discrete Random Variables

_{[}

_{n}

_{]}= {X

_{1},…, X

_{n}} and Y be discrete random variables with joint distribution p(X

_{[}

_{n}

_{]}; Y), and let I denote the usual measure of mutual information given by the Kullback–Leibler divergence of p from its factorized distribution [7]. If Y = X

_{[}

_{n}

_{]}is a copy of the X

_{[}

_{n}

_{]}, then I(Y : X

_{[}

_{n}

_{]}) = H(X

_{[}

_{n}

_{]}), where H denotes the Shannon entropy. In this case, the redundancy r

_{1}(X

_{[}

_{n}

_{]}) is equal to the multi-information [17] of the X

_{[}

_{n}

_{]}. Moreover, r

_{c}gives rise to a parametrized version of multi-information:

**Theorem 2**(Lower bound on entropy o{ common ancestors). Let X

_{[}

_{n}

_{]}be jointly-distributed discrete random variables. If I

_{c}(X

_{[}

_{n}

_{]}) > 0, then in any Bayesian net containing the X

_{[}

_{n}

_{]}, there exists a common ancestor of strictly more than c variables out of the X

_{[}

_{n}

_{]}. Moreover, the entropy of the set A

_{c}

_{+1}of all common ancestors of more than c variables is lower bounded by:

- Setting c = 1, the theorem states that, up to a factor 1=(n 1), the multi-information I
_{1}is a lower bound on the entropy of common ancestors of more than two variables. In particular, if I_{1}(X_{[}_{n}_{]}) > 0, any Bayesian net containing the X_{[}_{n}_{]}must have at least an edge. - Conversely, the entropy of common ancestors of all of the elements X
_{1},…, X_{n}is lower bounded by (n −1)I_{n}_{−1}(X_{[}_{n}_{]}). This bound is not trivial whenever I_{n}_{−1}(X_{[}_{n}_{]}) > 0, which is, for example, the case if the X_{i}are only slightly disturbed copies of some not necessarily observed random variable (see the example below). - We emphasize that the inferred common ancestors can be among the elements X
_{i}themselves. Unobserved common ancestors can only be inferred by postulating assumptions on the causal influences among the X_{i}. If, for example, all of the X_{i}were measured simultaneously, a direct causal influence among the X_{i}can be excluded, and any dependence or redundancy has to be attributed to unobserved common ancestors. - Finally, note that I
_{c}> 0 is only a sufficient, but not a necessary condition for the existence of common ancestors. However, we know that the information-theoretic information provided by I_{c}is used in the theorem in an optimal way. By this, we mean that we can construct distributions p(X_{[}_{n}_{]}), such that I_{c}(X_{[}_{n}_{]}) = 0 for a given c, and no common ancestors of c+1 nodes have to exist.

**Example 1**(Three variables). Let X

_{1}; X

_{2}and X

_{3}be three binary variables. Then I

_{2}(X

_{1}; X

_{2}; X

_{3}) > 0 if and only if

**Example 2**(Synchrony and interaction among random variables). Let X

_{1}= X

_{2}= ⋯= X

_{n}be identical random variables with non-vanishing entropy h. Then, in particular, I

_{n}

_{−1}(X

_{[}

_{n}

_{]}) = (n−1)

^{−1}h > 0, and we can conclude that there has to exist a common ancestor of all n nodes in any Bayesian net that contains them.

**Example 3**(Interaction of maximal order). In contrast to the synchronized case, let X

_{1}; X

_{2},…, X

_{n}be binary random variables taking values in {−1,1}, and assume that the joint distribution is of pure n-interaction [18], that is for some β ≠ 0, it has the form

_{[}

_{n}

_{]}, in which common ancestors of at most two variables exist. This is illustrated in Figure 4 for three variables and in the limiting case β = ∞ in which each X

_{i}is uniformly distributed and X

_{1}=X

_{2}·X

_{3}. We found it somewhat surprising that, contrary to synchronization, higher order interaction among observations does not require common ancestors of many variables.

#### 4.2. Common Ancestors in String Manipulation Processes

_{1},…, s

_{n}) = K(⟨s

_{1}, ⟨s

_{2},…, ⟨s

_{n}

_{−1}, s

_{n}⟩ …⟩).

_{1},…, s

_{n}} as:

_{1},…, s

_{n}if every s

_{i}can be computed by a simple program on a universal Turing machine from its parents and an additional string n

_{i}, such that the n

_{i}are mutually independent. Without going into the details, we sum up by stating that DAGs on strings can be given a causal interpretation, and it is therefore interesting to infer properties of the class of possible DAGs that represent the algorithmic conditional independence relations.

**Theorem 3**(Inference of common ancestors of strings). Let O

_{[}

_{n}

_{]}= {s

_{1},…, s

_{n}} be a set of binary strings. If for a number c, (1≤ c ≤ n −1):

_{[}

_{n}

_{]}in any DAG model of the O

_{[}

_{n}

_{]}. (Here, $\stackrel{+}{\ge}$ means up to an additive constant dependent only on the choice of a universal Turing machine, on c and on n.)

**Proof.**As described, algorithmic mutual information is an information measure in our sense only up to an additive constant depending on the choice of the universal Turing machine. However, one can check that in this case, the decomposition of mutual information (Theorem 1) holds up to an additive constant that depends additionally on the number of strings n and the chosen parameter c. The result on Kolmogorov complexities follows by choosing Y = (s

_{1},…, s

_{n}), since $K({s}_{i})\stackrel{+}{=}I\left(Y:{s}_{i}\right)$.

#### 4.3. Structural Implications from Synergy?

_{[}

_{n}

_{]}are mutually independent. Then, any DAG is a valid DAG model, since the local Markov condition will always be satisfied. We also now that r(Y)≤ 0, but it turns out that the amount of synergy crucially depends on the way that Y has processed the information of the O

_{[}

_{n}

_{]}(and therefore, not on a structural property among the O

_{[}

_{n}

_{]}themselves). To see this, let the observations O

_{i}be binary random variables, which are mutual independent and distributed uniformly, such that:

_{i}⊕O

_{j})

_{i<j}be a function of the observations (addition is modulo two). Then, the O

_{[}

_{n}

_{]}are highly synergistic with respect to Y, that is r

_{1}(Y) = −(n−1) log 2. On the other hand, if Y = O

_{1}⊕⋯⊕O

_{n}, then r

_{1}(Y) = −log 2 only.

_{[}

_{n}

_{]}, we have the following:

**Proposition 2**(Synergy from increased redundancy induced by conditioning). Let O

_{[}

_{n}

_{]}= {O

_{1},…, O

_{n}} and Y be arbitrary elements on which a mutual information function I is defined. Then:

_{[}

_{n}

_{]}with respect to itself, then r

_{c}(Y) < 0 and the O

_{[}

_{n}

_{]}are synergistic with respect to Y.

**Proof.**Using the chain rule, we derive

_{c}(Y|O

_{[}

_{n}

_{]}) = 0.

_{[}

_{n}

_{]}is equivalent to r

_{1}(O

_{[}

_{n}

_{]}) = 0 and, therefore, using the proposition r

_{1}(Y) =−r

_{1}(O

_{[}

_{n}

_{]}|Y). Thus, if Y = O

_{1}⊕⋯ ⊕O

_{[}

_{n}

_{]},

## 5. Conclusions

_{c}of redundancy (see (13)) has been used by Ver Steeg and Galstyan as an objective function for hierarchical representations of high-dimensional data [36,37], where the optimization is taken with respect to the variable Y.

_{c}(X

_{1},…, X

_{n}) of multi-information (see (14)) and derives a method for discriminating between causal structures in Bayesian networks given partial observations.

## Appendix

## A. Semi-Graphoid Axioms and d-Separation

**Lemma 4**(General independence satisfies semi-graphoid axioms). The relation of (conditional) independence induced by an independence measure I on elements O satisfies the semi-graphoid axioms: for disjoint subsets W; X; Y and Z of$\mathcal{O}$, it holds:

_{1}; x

_{2},…, x

_{r}) with x

_{1}∈ A and x

_{r}∈ B is blocked if at least one of the following is true:

- there is an i, such that x
_{i}∈ C and x_{i}_{−1}→ x_{i}→ x_{i}_{+1}or x_{i}_{−1}←x_{i}←x_{i}_{+1}or x_{i}_{− 1}← x_{i}→ x_{i}_{+1}, - there is an i, such that x
_{i}, and its descendants are not in C and x_{i}_{−1}→ x_{i}←x_{i}_{+1}.

**Theorem 4**(Equivalence of Markov conditions). Let I be a measure of mutual information on elements O

_{[}

_{n}

_{]}= {O

_{1},…, O

_{n}}, and let G be a DAG with node set O

_{[}

_{n}

_{]}. Then, the following two properties are equivalent:

- (Local Markov condition) Every node O
_{i}of G is independent of its non-descendants O_{nd}given its parents${O}_{p{a}_{i}}$,$${O}_{i}\u2568{O}_{n{d}_{i}}|{O}_{p{a}_{i}}.$$ - (Global Markov condition) For every three disjoint sets of nodes A, B and C, such that A is d-separated from B given C in G, it holds A ╨ B|C.

**Proof.**(1)→(2). Since the dependence measure I satisfies the semi-graphoid axioms (Lemma 4), we can apply Theorem 2 in Verma and Pearl [41], which asserts that the DAG is an I-map, or in other words, that d-separation relations represent a subset of the (conditional) independences that hold for the given objects.

## B. Proof of Proposition 1

_{i}is part of at least one set O

_{k}for some k. Let n

_{i}be the total number of subsets O

_{k}containing X

_{i}. By definition of d

_{k}, for every k, it holds n

_{i}≤ d

_{k}, and we obtain:

_{i}╨ (X

_{[}

_{i}

_{−1]}nO

_{j}) |O

_{j}). This is the point at which the submodularity of I is used, since it is actually equivalent to (4), as can be seen from the proof of Lemma 3. Finally, (c) is an application of the chain rule to the elements of each O

_{j}separately.

## C. Proof of Theorem 1

_{i}, denote by an

_{G}(O

_{i}) the smallest ancestral set in G containing O

_{i}.

_{i})) ≥I(Y : O

_{i}) using the monotony of I (implied by the chain rule and non-negativity). Further, (10) is a direct consequence of (19) together with the independence assumption (9), since by the chain rule:

_{[}

_{n}

_{]}= {Õ

_{1},…, Õ

_{n}}, such that $\tilde{\mathrm{A}}={\displaystyle {\cup}_{i=1}^{n}an}\left({\xd4}_{i}\right)$ is of cardinality at most k 1. Let O

_{[}

_{n}

_{]}be a set of observations, such that A is of cardinality k. From O

_{[}

_{n}

_{]}, we construct a new collection Õ

_{[}

_{n}

_{]}as follows: w.l.o.g., assume m := d

_{1}> 0, in particular O

_{1}is non-empty and moreover, by definition of d

_{1}, and after reordering of the O

_{i}, we can assume that the intersection $V:={\displaystyle {\cap}_{i=1}^{m}a{n}_{G}}\left({O}_{i}\right)$ is non-empty. Note that V itself is an ancestral set. We define Õ

_{i}= O

_{i}\V for all 1 ≤ i ≤ n and denote by $\tilde{G}$ the modified graph that is obtained from G by removing all elements of V. Further, denote by Ĩ(A : B|C) := I(A : B|C; V) a modified measure of mutual information obtained by conditioning on V. One checks easily that the graph $\tilde{G}$ fulfills the local Markov condition with respect to the independence relation induced by Ĩ and is a DAG model of the elements Õ

_{[}

_{n}

_{]}. Hence, by induction assumption:

_{i}, but with respect to the elements Õ

_{i}and $\tilde{G}$. Further, the sum is over all non-empty Õ

_{i}. By construction of Ĩ and Õ

_{[}

_{n}

_{]}, the left-hand side of (20) is equal to:

_{G}(O

_{i}) ∩ V = ∅ for i > m. Hence, by (18), V and an

_{G}(O

_{i}) are independent; therefore, conditioning on V only increases mutual information, as proven in Lemma 1, and Inequality (c) follows. We continue by rewriting the first m summands of the right-hand side using the chain rule:

## D. Proof of Corollary 1

**Proof.**Let G be a DAG model of the observation of O

_{[}

_{n}

_{]}= {O

_{1},…, O

_{n}}. We construct a new DAG G

^{′}, by removing the objects of $A:={\displaystyle {\cup}_{i=1}^{n}{A}_{{c}_{i}}+1}$. Since A is an ancestral set, G

^{′}fulfills the local Markov condition with respect to the mutual information measure obtained by conditioning on A. We apply Theorem 1 to G

^{′}and the observations O

_{[}

^{′}

_{n}

_{]}= {O

_{1}\A,…, O

_{n}\A} to get:

_{i}

^{′}and for (b) we plugged in Inequalities (11) and (22). Finally, (c) holds, because:

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References and Notes

- Pearl, J. Causality; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
- Spirtes, P.; Glymour, C.; Scheines, R. Causation, Prediction, and Search, 2nd ed.; Adaptive Computation and Machine Learning series; The MIT Press: Cambridge, MA, USA, 2001. [Google Scholar]
- Lauritzen, S.L. Graphical Models; Oxford Statistical Science Series; Oxford University Press: Oxford, UK, 1996. [Google Scholar]
- Janzing, D.; Schölkopf, B. Causal inference using the algorithmic Markov condition. IEEE Trans. Inf. Theory.
**2010**, 56, 5168–5194. [Google Scholar] - Steudel, B.; Janzing, D.; Schölkopf, B. Causal markov condition for submodular information measures, Proceedings of the 23rd Annual Conference on Learning Theory, Haifa, Israel, 17–19 June 2010; pp. 464–476.
- Reichenbach, H. The Direction of Time; University of Califonia Press: Oakland, CA, USA, 1956. [Google Scholar]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley: New York, NY, USA, 2006. [Google Scholar]
- Gács, P.; Tromp, J.T.; Vitányi, P.M. Algorithmic statistics. IEEE Trans. Inf. Theory.
**2001**, 47, 2443–2463. [Google Scholar] - Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan Kaufmann Publishers Inc: San Francisco, CA, USA, 1988. [Google Scholar]
- Mutual information of composed quantum systems satisfies the definition as well, because it can be defined in formal analogy to classical information theory if Shannon entropy is replaced by von Neumann entropy of a quantum state. The properties of mutual information stated above have been used to single out quantum physics from a whole class of no-signaling theories [42].
- Dawid, A.P. Conditional independence in statistical theory. J. R. Stat. Soc. Ser. B (Methodol.).
**1979**, 41, 1–31. [Google Scholar] - Madiman, M.; Tetali, P. Information inequalities for joint distributions, with interpretations and applications. IEEE Trans. Inf. Theory.
**2010**, 56, 2699–2713. [Google Scholar] - Schneidman, E.; Bialek, W.; Berry, M.J., II. Synergy, redundancy, and independence in population codes. J. Neurosci.
**2003**, 23, 11539–11553. [Google Scholar] - Latham, P.E.; Nirenberg, S. Synergy, redundancy, and independence in population codes, revisited. J. Neurosci.
**2005**, 25, 5195–5206. [Google Scholar] - Schneidman, E.; Still, S.; Berry, M.J., II; Bialek, W. Network information and connected correlations. Phys. Rev. Lett.
**2003**, 91, 238701. [Google Scholar] - We formulate the independence assumption as $Y\u2568\tilde{\mathcal{X}}|{O}_{[n]}$, where $\tilde{\mathcal{X}}$ denotes all nodes of the DAG-model different from the nodes in O
_{[}_{n}_{]}and Y. Note that this assumption does not hold in the original context in which r has been introduced. There, Y is the observation of a stimulus that is presented to some neuronal system and the O_{i}represent the responses of (areas of) neurons to this stimulus. - Studeny, M.; Vejnarová, J. The multiinformation function as a tool for measuring stochastic dependence. In Learning in Graphical Models; Jordan, M.I., Ed.; Kluwer Academic Publishers: Norwell, MA, USA, 1998; pp. 261–297. [Google Scholar]
- This terminology is motivated by the general framework of interaction spaces proposed and investigated by Darroch et al. [21] and used by Amari [43] within information geometry.
- Li, M.; Vitányi, P. An Introduction to Kolmogorov Complexity and Its Applications (Text and Monographs in Computer Science); Springer: Berlin, Germany, 2007. [Google Scholar]
- Pearl, J. On the testability of causal models with latent and instrumental variables, Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence (UAI), Montreal, QU, USA, 18–20 August 1995; pp. 435–443.
- Darroch, J.N.; Lauritzen, S.L.; Speed, T.P. Markov fields and log-linear interaction models for contingency tables. Ann. Stat.
**1980**, 8, 522–539. [Google Scholar] - Sullivant, S.; Talaska, K.; Draisma, J. Trek separation for gaussian graphical models. Ann. Stat.
**2010**, 38, 1665–1685. [Google Scholar] - Riccomagno, E.; Smith, J.Q. Algebraic causality: Bayes nets and beyond
**2007**, arXiv, 0709.3377. - Ay, N. A refinement of the common cause principle. Discret. Appl. Math.
**2009**, 157, 2439–2457. [Google Scholar] - Steudel, B.; Ay, N. Information-Theoretic Inference of Common Ancestors
**2010**, arXiv, 1010.5720. - Fritz, T.; Chaves, R. Entropic inequalities and marginal problems. IEEE Trans. Inf. Theory.
**2013**, 59, 803–817. [Google Scholar] - Chaves, R.; Luft, L.; Gross, D. Causal structures from entropic information: geometry and novel scenarios. New J. Phys.
**2014**, 16, 043001. [Google Scholar] - Fritz, T. Beyond Bell’s theorem: correlation scenarios. New J. Phys.
**2012**, 14, 103001. [Google Scholar] - Chaves, R.; Majenz, C.; Gross, D. Information-theoretic implications of quantum causal structures. Nat. Commun.
**2015**, 6. [Google Scholar] [CrossRef] - Henson, J.; Lal, R.; Pusey, M.F. Theory-independent limits on correlations from generalized Bayesian networks. New J. Phys.
**2014**, 16, 113043. [Google Scholar] - Steudel, B.; Janzing, D.; Schölkopf, B. Causal Markov condition for submodular information measures, Proceedings of the 23rd Annual Conference on Learning Theory, Haifa, Israel, 17–19 June 2010; Kalai, A.T., Mohri, M., Eds.; OmniPress: Madison, WI, USA; pp. 464–476.
- Williams, P.; Beer, R. Nonnegative decomposition of multivariate information
**2010**, arXiv, 1004.2515. - Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J.; Ay, N. Quantifying unique information. Entropy
**2014**, 16, 2161–2183. [Google Scholar] - Harder, M.; Salge, C.; Polani, D. Bivariate measure of redundant information. Phys. Rev. E
**2013**, 87, 012130. [Google Scholar] - Griffith, V.; Koch, C. Quantifying synergistic mutual information
**2013**, arXiv, 1205.4265. - Ver Steeg, G.; Galstyan, A. Discovering structure in high-dimensional data through correlation explanation, Prodeedings of Advances in Neural Information Processing System 27, Montréal, QC, Canada, 8–13 December 2014; pp. 577–585.
- Ver Steeg, G.; Galstyan, A. Maximally Informative Hierarchical Representations of High-Dimensional Data, Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), San Diego, CA, USA; 2015.
- Ay, N.; Wenzel, W. On Solution Sets of Information Inequalities. Kybernetika
**2012**, 48, 845–864. [Google Scholar] - Moritz, P.; Reichardt, J.; Ay, N. Discriminating between causal structures in Bayesian Networks via partial observations. Kybernetika
**2014**, 50, 284–295. [Google Scholar] - In general there may hold additional conditional independence relations among the observations that are not implied by the local Markov condition together with the semi-graphoid axioms. In fact, it is well known that there so called non-graphical probability distributions whose conditional independence structure can not be completely represented by any DAG.
- Verma, T.; Pearl, J. Causal networks: Semantics and expressiveness. Uncertain. Artif. Intell.
**1990**, 4, 69–76. [Google Scholar] - Pawłowski, M.; Paterek, T.; Kaszlikowski, D.; Scarani, V.; Winter, A.; Żukowski, M. Information causality as a physical principle. Nature
**2009**, 461, 1101–1104. [Google Scholar] - Amari, S.I. Information geometry on hierarchy of probability distributions. IEEE Trans. Inf. Theory.
**2001**, 47, 1701–1711. [Google Scholar]

**Figure 1.**Two causal hypothesis for which the causal Markov condition does not imply conditional independencies among the observations X

_{1}, X

_{2}and X

_{3}. Thus, they cannot be distinguished using qualitative criteria, like the common cause principle (unobserved variables are indicated as dots). However, the model on the right can be excluded if the dependence among the X

_{i}exceeds a certain bound.

**Figure 2.**The graph in (a) shows a directed acyclic graph (DAG) on nodes X

_{1},…, X

_{5}whose observation is modeled by a leaf node Y (e.g., a noisy measurement). (b) A DAG model of observed elements O

_{1}= {X

_{1}} and O

_{2}= {X

_{4}, X

_{5}}.

**Figure 3.**(

**a**) Four subsets O

_{1},…, O

_{4}of independent elements X

_{1},…, X

_{8}“observed by” Y. Note that the intersection of three sets O

_{i}is empty; hence, d

_{i}≤ 2 for all i = 1,…, 4 in Proposition 1 and, therefore, $I(Y:{O}_{[4]})\ge {\scriptscriptstyle \frac{1}{2}}{\displaystyle {\sum}_{i=1}^{4}I(Y:{O}_{i})}$. (

**b**) A DAG model in gray. The observed elements O

_{1},…, O

_{4}are subsets of its nodes. One can check that the DAG does not imply any conditional independencies among the O

_{i}(e.g., with the help of the d-separation criterion; see Appendix A). Nevertheless, there is no common ancestor of all four observations $\left({\displaystyle {\cap}_{i=1}^{4}an}\left({O}_{i}\right)=\varnothing \right)$. Since Y only depends on the O

_{i}, the inequality (10) of Theorem 1 implies $I\left(Y:{O}_{[4]}\right){\scriptscriptstyle \frac{1}{3}}{\displaystyle {\sum}_{i=1}^{4}I\left(Y:{O}_{i}\right)}$.

**Figure 4.**The figure illustrates that higher order interaction among observed random variables can be explained by a Bayesian net in which only common ancestors of two variables exist. More precisely, all random variables are assumed to be binary with values in {−1, 1}, and the unobserved common ancestors U

_{ij}are mutually independent and uniformly distributed. Further, the value of each observation X

_{i}is obtained bythe product of the values of its two ancestors. Then, the resulting marginal distribution p(X

_{1}, X

_{2}, X

_{3}) is of higher order interaction: it is related to the parity function p(X

_{1}= x

_{1}, X

_{2}= x

_{2}, X

_{3}= x

_{3}) = $\frac{1}{4}$ if x

_{1}x

_{2}x

_{3}= 1, and zero otherwise.

© 2015 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Steudel, B.; Ay, N. Information-Theoretic Inference of Common Ancestors. *Entropy* **2015**, *17*, 2304-2327.
https://doi.org/10.3390/e17042304

**AMA Style**

Steudel B, Ay N. Information-Theoretic Inference of Common Ancestors. *Entropy*. 2015; 17(4):2304-2327.
https://doi.org/10.3390/e17042304

**Chicago/Turabian Style**

Steudel, Bastian, and Nihat Ay. 2015. "Information-Theoretic Inference of Common Ancestors" *Entropy* 17, no. 4: 2304-2327.
https://doi.org/10.3390/e17042304