Complexity as Causal Information Integration

Complexity measures in the context of the Integrated Information Theory of consciousness try to quantify the strength of the causal connections between different neurons. This is done by minimizing the KL-divergence between a full system and one without causal cross-connections. Various measures have been proposed and compared in this setting. We will discuss a class of information geometric measures that aim at assessing the intrinsic causal cross-influences in a system. One promising candidate of these measures, denoted by ΦCIS, is based on conditional independence statements and does satisfy all of the properties that have been postulated as desirable. Unfortunately it does not have a graphical representation, which makes it less intuitive and difficult to analyze. We propose an alternative approach using a latent variable, which models a common exterior influence. This leads to a measure ΦCII, Causal Information Integration, that satisfies all of the required conditions. Our measure can be calculated using an iterative information geometric algorithm, the em-algorithm. Therefore we are able to compare its behavior to existing integrated information measures.


Introduction
The theory of Integrated Information aims at quantifying the amount and quality of consciousness of a neural network.It was originally proposed by Tononi and went through various phases of evolution, starting with one of the first papers "Consciousness and Complexity" [27] in 1999 to "Consciousness as Integrated Information: a Provisional Manifesto" [26] in 2008 and IIT 3.0 [21] in 2014 to ongoing research.Although important parts of the methodology of this theory changed or got extended the two key concepts determining consciousness that virtually stayed fixed are "Information" and "Integration".Information refers to the number of different states a system can be in and Integration describes the amount to which the information is integrated among different parts of it.In order to determine to what extent a system integrates information, one divides it into smaller parts and calculates how much the split system differs from the full one.There are various ways to define a split system and the difference between them.Therefore, there exist different branches of complexity measures in the context of Integrated Information.
In detail we will measure the distance between the full and the split system using the KL-divergence as proposed in [5].This framework was further discussed in [8].Oizumi et al. [22] and Amari et al. [4] summarize these ideas and add a Markov condition and an upper bound to clarify what a complexity measure should satisfy.We will discuss these conditions in the next section.Additionally they introduce one measure that satisfies all of these requirements.This measure is described by conditional independence statements and will be denoted here by Φ CIS .We will introduce Φ CIS along with two other existing measures, namely Stochastic Interaction Φ SI [6] and Geometric Integrated Information Φ G [1].Although Φ CIS fits perfectly in the proposed framework, this measure does not correspond to a graphical representation and it is therefore difficult to analyze the nature of the measured information flow.
The main purpose of this paper is to propose a more intuitive approach using a latent variable which models a common exterior influence.This leads to a new measure, which we call Causal Information Integration Φ CII .This measure is specifically created to only measure the intrinsic causal influences and it satisfies all the required conditions postulated by Oizumi et al.We discuss the relationship between the introduced measures in Section 2.0.2 and present a way of calculating Φ CII by using an iterative information geometric algorithm, the em-algorithm described in Section 2.0.3.Utilizing this algorithm we are able to compare the behavior of Φ CII to existing integrated information measures.

Integrated Information Measures
Measures corresponding to Integrated Information investigate the information flow in a system from a time t to t `1.This flow is represented by the connections from the nodes X i in t to the nodes Y i in t `1, i P t1, . . ., nu as displayed in Figure 1.
Figure 1: The fully connected system for n " 2 and n " 3.
The systems are modeled as discrete, stationary, n-dimensional Markov processes pZ t q tPN X " pX 1 , . . ., X n q " pX 1,t , . . ., X n,t q, Y " pY 1 , . . ., Y n q " pX 1,t`1 , . . ., X n,t`1 q, Z " pX, Y q on a finite set Z ‰ H, which is the Cartesian product of the sample spaces of X i i P t1 . . .nu , denoted by X i Z " X ˆY " Since the process is stationary and markovian we are able to restrict the discussion to one time step.Denote the complement of X i in X by X Iztiu " pX 1 , . . ., X i´1 , X i`1 , . . ., X n q with I " t1, . . ., nu.Corresponding to this notation x Iztiu P X Iztiu describes the elementary events of X Iztiu .We will use the analogue notation in the case of Y and we will write z P Z instead of px, yq P X ˆY.The set of probability distributions on Z will be denoted by PpZq.Throughout this article we will restrict attention to strictly positive distributions.The core idea of measuring Integrated Information is to determine how much the initial system differs from one in which no information integration takes place.The former will be called a "full" system, because we allow all possible connections between the nodes, and the latter will be called a "split" system.Graphical representations of the full systems for n " 2, 3 and their connections are depicted in Figure 1.In this article we are using graphs which describe the conditional independence structure of the corresponding sets of distributions.An introduction to those is given in Appendix A.
Following the concept introduced in [5], the difference between the measures corresponding to the full and split systems will be calculated by using the KL-divergence.
Definition 1 (Complexity).Let M be a set of probability distributions on Z corresponding to a split system.Then we minimize the KL-divergence between M and the distribution of the fully connected system P to calculate the complexity Φ M " min The question remains how to define the split model M. We want to measure the information that gets integrated between different nodes in different points in time.In Figure 1 these are the dashed connections, also called causal connections.
In order to ensure that these connections are removed in the split system, the authors of [22] and [4] argue that Y j should be independent of X i given X Iztiu , i ‰ j leading to the following property.

Property 1. A valid split system should satisfy the Markov condition
with Q P PpZq.This can also be written in the following form Now we take a closer look at the remaining connections.The dotted lines connect nodes belonging to the same point in time.These connections represent the common exterior or interior influences affecting the nodes leading to an undirected edge.Since we want to measure the amount of integrated information between t and t `1, the distribution in t, and therefore the connection between the X i s, should stay unchanged in the split system.The dotted connections between the Y i s play an important role in Property 2. For this property, we will consider the split system in which the solid and dashed connections are removed.
The solid arrows represent the influence of a node in t on itself in t `1 and removing these arrows, in addition to the causal connections, leads to a system with completely disconnected points in time as shown in the first row of Figure 3.The distributions corresponding to this split system are M I " tQ P PpZq|Qpzq " QpxqQpyq, @z " px, yq P Zu and the measure Φ I is given by the mutual information IpX; Y q, which is defined in the following way Φ I " IpX; Y q " ÿ zPZ P px, yq log ˆP px, yq P pxqP pyq

˙.
Since there is no information flow between the time steps Oizumi et al. argue in [22] that an integrated information measure should be bounded from above by the mutual information.
Property 2. The mutual information should be an upper bound for a measure for Integrated Information Oizumi et al. [22] and Amari et al. [4] state that this property is necessary and give the following two arguments.On the one hand this takes into account that the Y i s might have a common exterior influence that affects all the Y i s.This is symbolized by the additional node W in Figure 2 and this should not contribute to the value of Integrated Information between the different points in time.On the other hand, we know that if the X i s are correlated, then the correlation is passed to the Y i s via the solid and dashed arrows.The question now is, how much of these correlations are causal and should therefore be measured.Kanwal et al. discuss this problem in [16].They distinguish between intrinsic and extrinsic influences that cause the connections between the Y i s in the way displayed in Figure 2. By calculating the split system for Φ I the edge between the Y i s might compensate for the solid arrows and common exterior influences, but also for the dashed, causal connections, as shown in Figure 2 on the right.Kanwal et al. analyze an example of a full system without a common exterior influence with the result that there are cases in which a measure that only removes the causal connections has a larger value than Φ I .This is only possible if the undirected edge between the Y i s compensates also for a part of the causal connections.Hence Φ I does not measure all the intrinsic causal influences.Therefore Kanwal et al. question the use of the mutual information as an upper bound.
Then again, we would like to contribute a different perspective.Admitting to Property 2 does not necessarily mean that the connections between the Y i s are fixed.It may merely mean that M I is a subset of the set of split distributions.We will see that the measures Φ CIS and Φ CII do satisfy Property 2 in this way.Although the argument that Φ I measures all the intrinsic influences is no longer valid, satisfying Property 2 is still desirable in general.Consider an initial system with the distribution P pzq " P pxq P pyq, @z P Z.This system has a common exterior influence on the Y i s and no connection between the different points in time.Since there is no information flow between the points in time, a measure for Integrated Information Φ M should be zero for all measures of this form.This is the case exactly when M I Ď M, hence when Φ I is an upper bound for Φ M .In order to emphasize this point we propose a modified version of Property 2. Note that the new formulation is stronger, hence Property 2 is a consequence of Property 3. Every measure discussed here that satisfies Property 2 fulfills also Property 3. Therefore we will keep referring to Property 2 in the following sections.
Figure 3 displays an overview over the different measures and whether they satisfy Properties 1 and 2. The first complexity measure that we are discussing does not fulfill Property 2. It is called Stochastic Interaction and was introduced by Ay in [5] in 2001, later published in [6].Barrett and Seth discuss it in [9] in the context of Integrated Information.In [4] the corresponding model is called "fully split model".
The core idea is to allow only the connections among the random variables in t and additionally the connections between X i and Y i , meaning the same random variable in different points in time.The last ones correspond to the solid arrows in Figure 1.A graphical representation for n " 2 can be found in the first column of Figure 3.
Definition 2 (Stochastic Interaction).The set of distributions belonging to the split model in the sense of Stochastic Interaction can be defined as and the complexity measure can be calculated as follows HpY i | X i q ´HpY | Xq as shown in [6].In the definition above, H denotes the conditional entropy This does not satisfy Property 2 and therefore the corresponding graph is displayed only in the first column of Figure 3. Consider a setting without exterior influences, then this measure quantifies the strength of the causal connections alone and is therefore a reasonable choice for an Integrated Information measure.Accounting for an exterior influence that does not exist leads to a split system that compensates a part of the removal of the causal connections so that the resulting measure does not quantify all of the interior causal influences.
To force the model to satisfy Property 2, one can add the interaction between Y i and Y j which results in the measure Geometric Integrated Information [1].
Definition 3 (Geometric Integrated Information).The graphical model corresponding to the graph in the second row and first column of Figure 3 is the set and the measure is defined as M G is called the diagonally split model in [4].This is not causally split in the sense that the corresponding distributions in general do not satisfy Property 1.It can be seen by analyzing the conditional independence structure of the graph as described in Appendix A. By introducing the edges between the Y i s as fixed, Φ G might force these connections to be stronger than they originally are.A result of this might be that an effect of the causal connections gets atoned for by the new edge.We discussed this above in the context of Property 2 .
This measure has no closed form solution, but we are able to calculate the corresponding split system with the help of the iterative scaling algorithm, e.g.[12] Section 5.1.
The first measure that satifies both properties is called "Integrated Information" [22], its model is referred to by "Causally split model" in [4] and it is derived from the first property.Since we are able to define it using conditional independence statements, we will denote it by Φ CIS .It requires Y i to be independent of X Iztiu given X i .
The different measures and their properties in the case of n " 2 Definition 4 (Integrated Information).The set of distributions, that belongs to the split system corresponding to integrated information, is defined as and this leads to the measure Φ CIS " min We write the requirements to the distributions in (3) as conditional independent statements A detailed analysis of probabilistic independence statements can be found in [24].Unfortunately, these conditional independence statements can not be encoded in terms of a chain graph in general.The definition of this measure arises naturally from Property 1 by applying the relation ( 1) to all pairs i, j P t1, . . ., nu.This leads to as shown in Appendix B.
Note that this implies that every model satisfying Property 1 is a submodel of M CIS .In order to show that Φ CIS satisfies Property 1, we are going to rewrite the condition in Property 1 to The definition of M CIS allows us to write QpY j |Xq " QpY j |X j q " QpY j |X Iztiu q for Q P M CIS .Therefore Φ CIS satisfies Property 1 and since M I meets the conditional independence statements of Property 1 the relation M I Ď M CIS holds and Φ CIS fulfills Property 2.
In [22] Oizumi et al. derive an analytical solution for Gaussian variables, but there does not exist a closed form solution for discrete variables in general.Therefore they use Newton's method in the case of discrete variables.
Due to the lack of a graphical representation, it is difficult to interpret the causal nature of the elements of M CIS .In Example 1 we will see a type of model that is part of M CIS , but which challenges the notion of Integrated Information.

Causal Information Integration
Inspired by the discussion about extrinsic and intrinsic influences in the context of Property 2, we now utilize the notion of a common exterior influence to define the measure Φ CII , which we call Causal Information Integration.Explicitly including a common exterior influence allows us to avoid the problems of a fixed edge between the Y i s discussed earlier.This leads to the graphs in Figure 4.The factorization of the distributions belonging to these graphical models is the following one P pz, wq " P pxq By marginalizing over the elements of W we get a distribution on Z defining our new model.

Definition 5 (Causal Information Integration). The set of distributions belonging to the marginalized model for
We will define the split model for Causal Integrated Information as This leads to the measure Φ CII " min In order to show that this measure satisfies the conditional independence statements in Property 1, we will calculate the conditional distributions P py i |x i q and P py i |xq of P pzq " We are able to represent the marginalized model by using the methods from [23].Up to this point we have been using chain graphs.These are graphs consisting of directed and undirected edges such that there are no semi-directed cycles as described in Appendix A. In order to be able to gain a graph that represents the conditional independence structure of the marginalized model, we need the concept of chain mixed graphs (CMGs).In addition to the directed and undirected edges belonging to chain graphs, chain mixed graphs also have arcs Ø.Two nodes connected by an arc are called spouses.The connection between spouses appears when we marginalize over a common influence, hence spouses do not have a directed information flow from one node to the other but are affected by the same mechanisms.The Algorithm 8 from [23] allows us to transform a chain graph with latent variables into a chain mixed graph that represents the conditional independence structures of the marginalized chain graph.Using this on the graphs in Figure 4 leads to the CMGs in Figure 5. Unfortunately, there exists no new factorization corresponding to the CMGs known to the authors.In order to show that Φ CII satisfies Property 2, we will show that M I is a subset of M CII .At first we will consider the following subset of M CII M m CI "

This results in
where we remove the connections between the different stages, as shown in Figure 6.Qpy i |wq for Q P M CI and therefore we have M CI Ď M I .In order to gain equality it remains to show that QpY q can approximate every distribution on Y if the state space of W is sufficiently large.These distributions are mixtures of discrete product distributions, where Qpy i |wq are the mixture components and Qpwq are the mixture weights.Hence we are able to use the following result.
Theorem 2.1 (Theorem 1.3.1 from [20]).Let q be a prime power.The smallest m for which any probability distribution on t1, . . ., qu can be approximated arbitrarily well as mixture of m product distributions is q n´1 .
Universal approximation results like the theorem above may suggest that the models M CII and M CIS are equal.However we will present numerically calculated examples of elements belonging to M CIS , but not to M CII , even with an extremely large state space.We will discuss this matter further in Section 2.0.2.

Ground truth
The concept of an exterior influence suggests that there exists a ground truth in a larger model in which W is a visible variable.This is shown in Figure 7 on the right.Assuming that we know the distribution of the whole model, we are able to apply the concepts discussed above to define an Integrated Information measure Φ T on the larger space.This allows us to really only remove the causal connections as shown in Figure 7 on the left.Thus we can interpret Φ T as the ultimate measure of Integrated Information, if the ground truth is available.
The set of distributions belonging to the larger, fully connected model will be called E f and the set corresponding to the graph on the left of Figure 7 depicts the split system which will be denoted by E. P py i |x i , wqP pwq, @pz, wq P Z ˆW, |W| " m Calculating the KL-divergence between P P E f and E leads to the new measure.
In the definition above IpY i ; X Iztiu |X i , W q is the conditional mutual information defined by It characterizes the reduction of uncertainty in Y i due to X Iztiu when W and X i are given.Therefore this measure decomposes to a sum in which each addend characterizes the information flow towards one Y i .Writing this as conditional independence statements, Φ T is 0 if and only if Ignoring W would lead exactly to Property 1.For a more detailed description of the conditional mutual information and its properties, see [11].Additionally, by using that W K K X, we are able to split up the conditional mutual information into a part corresponding to the conditional independence statements of Property 1 and another conditional mutual information.
Since the conditional mutual information is non-negative, Φ T is 0 if and only if the conditional independence statements of Property 1 hold and additionally the reduction of uncertainty in W due to X Iztiu given Y i , X i is 0.
In general, we do not know what the ground truth of our system is and therefore we have to assume that W is a hidden variable.This leads us back to Φ CII .Since minimizing over all possible W might compensate a part of the causal information flow, Φ CII is smaller or equal to the true value Φ T .
Proposition 2. The new measure Φ T is an upper bound for Φ CII Hence by assuming that there exists a common exterior influence, we are able to show that Φ CII is bounded from above by the true value, that purely measures intrinsic influences.

Relationships between the different measures
Now we are going to analyze the relationship between the different measures Φ SI , Φ G , Φ CIS and Φ CII .We will start with Φ G and Φ CII .Previously we already showed that M CII satisfies Property 1 and since Φ G does not satisfy Property 1, we have To evaluate the other inclusion, we will consider the more refined parametrizations of elements P P M CII and Q P M G as defined 6.These are P pzq " P pxqf 2 px 1 , y 1 qg 2 px 2 , y 2 q ÿ w P pwqf 1 pw, y 1 qf 3 px 1 , y 1 , wqg 1 pw, y 2 qg 3 px 2 , y 2 , wq are non-negative functions such that P, Q P PpZq and Since φ depends on more than Y 1 and Y 2 , P pzq does not factorize according to M G in general.Hence M CII Ę M G holds.Furthermore, looking at the parametrizations allows us to identify a subset of distributions which lies in the intersection of M G and M CII .Allowing P to only have pairwise interactions would lead to P pzq " P pxq f2 px 1 , y 1 qg 2 px 2 , y 2 q ÿ w P pwq f1 pw, y 1 qg 1 pw, y 2 q " P pxq f2 px 1 , y 1 qg 2 px 2 , y 2 q φpy 1 , y 2 q with the non-negative functions f1 , f2 , g1 , g2 such that P P PpZq and φpy 1 , y 2 q " ÿ w P pwq f1 pw, y 1 qg 1 pw, y 2 q.
This P is an element of M G X M CII .
In the next part we will discuss the relationship between M CII and M CIS .The elements in M CII satisfy the conditional independence statements of Property 1, therefore Previously we have seen that making the state space of W large enough can approximate a distribution between the Y i s, see Theorem 2.1 .This seems to hint that doing so would lead to an equality between M CII and M CIS , but based on numerically calculated examples, we have the following conjecture.
Conjecture 1.It is not possible to approximate every distribution Q P M CIS with arbitrary accuracy by an element of P P M CII .Therefore we have that The following example strongly suggests this conjecture to be true.
Example 1.Consider the set of distributions that factor according to the graph in Figure 8 N CIS tP P PpZq|P pzq " P px 1 qP px 2 qP py 1 |x 1 , y 2 qP py 2 qu.This model satisfies the conditional independence statements of Property 1 and is therefore a subset of the model M CIS .In this case X 1 and X 2 are independent of each other, hence from a causal perspective the influence of Y 2 on Y 1 should be purely external.Therefore we try to model this with a subset of and this corresponds to Figure 9.Using the em-algorithm described in Section 2.0.3 we took 500 random elements of N CIS and calculated the closest element of N CII by using the minimum Kl-divergence of 50 different random input distributions in each run.The results are displayed in Table 1.Further examples which hint towards M CII Ĺ M CIS can be found in Section 2.1.2.Adding the hidden variable W seems not to be sufficient to approximate elements of M CIS .Now the question naturally arises whether there are other exterior influences that need to be included in order to be able to approximate M CIS .We will explore this thought by starting with the graph corresponding to the split model M SI , depicted in Figure 10 on the left.In the next step we add hidden vertices and edges to the graph in a way such that the whole graph is still a chain graph.An example for a valid hidden structure is given in Figure 10 in the middle.Since we are going to marginalize over the hidden structure, it is only important how the visible nodes are connected via the hidden nodes.In the case of the example in Figure 10 we have a directed path from X 1 to X 2 going through the hidden nodes.Therefore we are able to reduce the structure to a gray box shown on the right in Figure 10.Using the Algorithm 8 mentioned earlier that converts a chain graph with hidden variables to a chain mixed graph reflecting the conditional independence structure of the marginalized model, we gain that by marginalizing we would create a directed edge from X 1 to X 2 .Seeing that this directed edge already existed, the resulting model now is a subset of M SI and therefore does not approximate M CIS .

|W|
Following this procedure we are able to show that adding further hidden nodes and subgraphs of hidden nodes does not lead to a chain mixed graph belonging to a model that satisfies the conditional independence statements of Property 1 and strictly contains M CII .Theorem 2.2.It is not possible to create a chain mixed graph corresponding to a model M, such that its distributions satisfy Property 1 and M CII Ĺ M, by introducing a more complicated hidden structure to the graph of M SI .
In conclusion, assuming that Conjecture 1 holds, we have the following relations among the different presented models.
A sketch of the inclusion properties among the models is displayed in Figure 11.Every set that lies inside M CIS satisfies Property 1 and every set that completely contains M I fulfills Property 2.

em-Algorithm
The calculation of the measure Φ CII can be done by the em-algorithm, a well known information geometric algorithm.It was proposed by Csiszr and Tusndy in 1984 in [13] and its usage in the context of neural networks with hidden variables was described for example by Amari et al. in [3].The expectationmaximization EM-algorithm [14] used in statistics is equivalent to the em-algorithm in many cases, including this one, as we will see below.A detailed discussion of the relationship of these algorithms can be found in [2].
In order to calculate the distance between the distribution P and the set M CII on Z we will make use of the bigger space of distributions on Z ˆW, PpZ ˆWq.Let M W |Z be the set of all distributions on Z ˆW that have Z-marginals equal to the distribution of the whole system P M W |Z " !P P PpZ ˆWq | P pzq " P pzq, @z P Z ) " !P P PpZ ˆWq | P pz, wq " P pzqP pw|zq, @pz, wq P Z ˆW) .
This is an m-flat submanifold since it is linear w.r.t P pw|zq.
The second set that we are going to use is the set E of distributions that factor according to the split model including the common exterior influence.We have seen this set before in Section 2.0.1.P py i |x i , wqP pwq, @pz, wq P Z ˆW, |W| " m

E "
This set is in general not e-flat, but we will show that there is a unique m-projection to it.We are able to use these sets instead of P and M CII because of the following result.
Theorem 2.3 (Theorem 7 from [3]).The em-algorithm is an iterative algorithm that first performs an e-projection to M W |Z and then an m-projection to E repeatedly.Let Q 0 P E be an arbitrary starting point and define P 1 as the e-projection of Q 0 to M W |Z P 1 " arg min Repeating this leads to The correspondence between these projections in the bigger space PpZ ˆWq and one m-projection in PpZq is illustrated in Figure 12.The algorithm iterates between the bigger spaces M W |Z and E on the left of Figure 12.Using Theorem 2.0.3 we gain that this minimization is equivalent to the minimization between P and M CII .The convergence of this algorithm is given by the following result.
Proposition 3 (Theorem 8 from [3]).The monotonic relations hold, where equality holds only for the fixed points p P , Qq P M W |Z ˆE of the projections P " arg min Proof of Proposition 3.This is immediate, because of the definitions of the e-and m-projections.
Hence this algorithm is guaranteed to converge towards a minimum, but this minimum might be local.We will see examples of that in Section 2.1.2.
In order to use this algorithm to calculate Φ CII we first need to determine how to perform an e-and m-projection in this case.The e-projection from Q P E to M W |Z is given by P pz, wq " P pzqQpw|zq, for all pz, wq P Z ˆW.This is the projection because of the following equality The first addend is a constant for a fixed distribution P and the second addend is equal to 0 if and only if P pw|zq " Qpw|zq.Note that this means that the conditional expectation of W remains fixed during the e-projection.This is an important point, because this guarantees the equivalence to the EM algorithm and therefore the convergence towards the MLE.For a proof and examples see Theorem 8.1 in [1] and Section 6 in [2].
After discussing the e-projection, we now consider the m-projection.
Proposition 4. The m-projection from P P M W |Z is given by Qpz, wq " P pxq P py i |x i , wqP pwq for all pz, wq P Z ˆW.
The last remaining decision to be made before calculating Φ CII is the choice of the initial distribution.Since it depends on the initial distribution whether the algorithm converges towards a local or global minimum, it is important to take the minimal outcome of multiple runs.One class of starting points that immediately lead to an equilibrium which is in general not minimal are the ones in which Z and W are independent P 0 pz, wq " P 0 pzqP 0 pwq.It is easy to check that the algorithm converges here to the fixed point P P pz, wq " P pxq Note that this is the result of the m-projection of P to M SI , the manifold belonging to Φ SI .

Comparison
In order to compare the different measures, we need a setting in which we generate the probability distributions of full systems.We chose to use weighted Ising models as described in the next section.

Ising model
The distributions used to compare the different measures in the next chapter are generated by weighted Ising models, also known as binary auto-logistic models as described in [28] Example 3.2.3.Let us consider n binary variables X " pX 1 , . . ., X n q, X " t´1, 1u n .The matrix V P R nˆn contains the weights v ij of the connection from X i to Y j as displayed in Figure 13.Note that this figure is not a graphical model corresponding to the stationary distribution, but merely displays the connections of the conditional distribution of Y i " y i given X " x with the respective weights The inverse temperature β ą 0 regulates the coupling strength between the nodes.For β close to zero the different nodes are almost independent and as β grows the connections become stronger.
Figure 13: The weights corresponding to the connections for n " 2.
We are calculating the stationary distribution P by starting with a random initial distribution P 0 and then multiplying by (9) in the following way P t`1 pxq " This leads to P " lim There always exists a unique stationary distribution, see for instance [28], Theorem 5.1.2 .

Results
In this section we are going to compare the different measures experimentally.The code is available at [17].To distinguish between the Causal Information Integration Φ CII calculated with different sized state spaces of W , we will denote Φ m CII " min We start with the smallest example possible, with n " 2, and the weight matrix V " ˆ0.0084181 ´0.2401545 0.39270161 0.37198751 ṡhown in Figure 14.In this example every measure is bounded by Φ I and the measures Φ I , Φ G and Φ SI display a limit behavior different from Φ CIS and the Φ CII .The state spaces of W have the size 2, 3, 4, 36 and 92 and the respective measures are displayed in shades of blue that get darker as the state space gets larger.In every case the em-algorithm has been run 10 times with a random input distribution in order to find a global minimum.On the right side of this figure, we are able to see the difference between Φ CIS and Φ CII .Considering the precision of the algorithms we assume that a difference smaller than 5e-07 is approx.zero.We can see that in a region from β " 4 to β " 6 the measures differ even in the case of 92 hidden states.So this small case already hints towards M CII Ĺ M CIS .Here we are able to see a difference in the behavior of Φ G compared to the other measures, since we see that Φ I , Φ SI , Φ CII and Φ G are still increasing around β « 1.1, while Φ G starts to decrease.produces the Figure 16.This example shows that Φ SI is not bounded by Φ I and therefore does not satisfy Property 2.
Using this example, we are going to take a closer look at the local minima the em-algorithm converges to.Considering only Φ CII and varying the size of the state space leads to the upper picture in Figure 17.This figure displays ten different runs of the em-algorithm with each size of state space in different shades of the respective color, namely blue for Φ 2 CII , violet for Φ 4 CII , red for Φ 8 CII and orange for Φ 16 CII .We are able to observe how increasing the state space leads to a smaller value of Φ CII .Additionally, the differences between the minimal values corresponding to each state space grow smaller and converge as the state spaces increase.The bottom half of Figure 17 highlights an observation that we made.Each of the four illustrations is a copy of the one above, where the difference between the minima are shaded in the respective color.By increasing the size of the state space the difference in value between the various local minima decreases visibly.We think this is consistent with the general observation made in the context of high dimensional optimization, e. g. [10] in which the authors conjecture that the probability of finding a high valued local minimum decreases when the network size grows.
Letting the algorithm run only once with |W| " 2 on the same data leads to a curve on the left in Figure 18.The sets E defined in (8) and M CII (5) do not change for different values of β and therefore we have a fixed set of local minima for a fixed state space of W .What does change with different β is which of the local minima are global minima.The vertical dotted lines represent the steps P t to P t`1 in which the KL-divergence between the projection to M CII is greater than 0.2 D Z pP t,‹ P t`1,‹ q ą 0.2.
Meaning that inside the different sections of the curve, the projections to M CII are close.As β increases, a different region of local minima becomes global.A sketch of this is shown in Figure 19.The curve is colored according to the distribution of W as shown on the right side of the figure.We see that a different distribution on W results in a different minimum, except for the region between 7.5 and 8.The colors light blue and yellow refer to distributions on W that are different, but symmetric in the following way.Consider two different distributions Q, Q on Z ˆW such that Qpz, w 1 q " Qpz, w 2 q and Qpz, w 2 q " Qpz, w 1 q for all z P Z. Then the corresponding marginalized distributions in M 2 CII are equal This symmetry is the reason for the different colors in the region between 7.5 and 8.
Using this geometric algorithm we therefore gain a notion of the local minima on E.

Discussion
This article discusses a selection of existing complexity measures in the context of Integrated Information Theory that follow the framework introduced in [5], namely Φ SI , Φ G and Φ CIS .The main contribution is the proposal of a new measure, Causal Information Integration Φ CII .
In [22] and [4] the authors postulate a Markov condition and an upper bound, given by the mutual information Φ I , for valid Integrated Information measures.Although Φ SI is not bounded by Φ I , as we see in Figure 16, it does measure the intrinsic causal connections in a setting in which there exists no common exterior influences.Therefore the authors of [16] criticize this bound.Since wrongly assuming the existence of a common exterior influence might lead to a value that does not measure all the intrinsic causal influences, the question which measure to use strongly depends on the setting we are in.We argue that using Φ I as an upper bound in the cases in which we have a common exterior influence is reasonable.The measure Φ G attempts to extend Φ SI to a setting with exterior influences, but it does not satisfy the Markov condition postulated in [22].
One measure that fulfills all the requirements of this framework is Φ CIS , but it has no graphical representation.Hence the nature of the measured information flow is difficult to analyze.We present in Example 1 a submodel of M CIS , which does not fit into the framework of Integrated Information.For discrete variables Φ CIS does not have a closed form solution and has to be calculated numerically.
We propose a new measure Φ CII which also satisfies all the conditions and has additionally a graphical and intuitive interpretation.Numerically calculated examples indicate that Φ CII Ĺ Φ CIS .The definition of Φ CII explicitly includes an interior influence as a latent variable and therefore aims at only measuring intrinsic causal influences.This measure should be used in the setting in which there exists a common exterior influence which is unknown.By assuming the existence of a ground truth, we are able to prove that our new measure is bounded from above by the ultimate value of Integrated Information Φ T of this system.Although Φ CII also has no analytical solution, we are able to use the em-algorithm to calculate it.The em-algorithm is guaranteed to converge towards a minimum, but this might be local.In our experience the em-algorithm seems to be more reliable and for larger networks faster than the numerical methods we used to calculate Φ CIS .Additionally, by letting the algorithm run multiple times we are able to gain a notion on how the local minima in E are related to each other as demonstrated in Figure 18 .
If τ is a singleton then τ ‹ is already complete.There are different kinds of independence statements a chain graph can encode, but we only need the global chain graph markov property.In order to define this property we need the concepts ancestral set and moral graph.
The boundary bdpAq of a set A Ď V is the set of vertices in V zA that are parents or neighbours to vertices in A. If bdpαq Ď A for all α P A we call A an ancestral set.For any A Ď V there exists a smallest ancestral set containing A, because the intersection of ancestral sets is again an ancestral set.This smallest ancestral set of A is denoted by AnpAq.
Let G be a chain graph.The moral graph of G is an undirected graph denoted by G m that consists of the same vertex set as G and in which two vertices α, β are connected if and only if either they were already connected by an edge in G or if there are vertices γ, δ belonging to the same chain component such that α Ñ γ and β Ñ δ.
Definition 7 (Global Chain Graph Markov Property).Let P be a distribution on Z and G a chain graph.P satisfies the global chain Markov property, with respect to G, if for any triple pZ A , Z B , Z S q of disjoint subsets of Z such that Z S separates Z A from Z B in pG AnpZ A YZ B YZ S q q m , the moral graph of the smallest ancestral set containing Since we are only considering positive discrete distributions, we have the following result.Proof of Lemma 1. Theorem 4.1 from [15] combined with the HammersleyClifford theorem, e.g.Theorem 2.9 in [7], proves this statement.
In order to understand the conditional independence structure of a chain graph after marginalization, we need the following alogrithm from [23].This algorithm converts a chain graph with latent variables into a chain mixed graph with the conditional independence structure of the marginalized chain graph.A chain mixed graph has in addition to directed and undirected edges also bidirected edges, called arcs.The condition that there are no semi-directed cycles also applies to chain mixed graphs.Definition 8. Let M be the set of vertices over which we want to marginalize.The following algorithm produces a chain mixed graph (CMG) with the conditional independence structure of the marginalized chain graph.
1. Generate an ij edge as in Table 2, steps 8 and 9, between i and j on a collider trislide with an endpoint j and an endpoint in M if the edge of the same type does not already exist.
2. Generate an appropriate edge as in Table 2, steps 1 to 7, between the endpoints of every tripath with inner node in M if the edge of the same type does not already exist.Apply this step until no other edge can be generated.

Remove all nodes in M.
Conditional independence in CMGs is defined using the concept of c-separation, see for example [23] in Section 4. For this definition we need the concepts of a walk and of a collider section.A walk is a list of vertices α 0 , . . ., α k , k P N, such there is an edge or arrow from α i to α i`1 , i P t0, . . ., k ´1u.A set of vertices connected by undirected edges is called a section.If there exists a walk including a section such that an arrow points at the first and last vertices of the section Ñ ‚ ´¨¨¨´‚ Ð then this is called a collider section.
Table 2: Types of edge induced by tripaths with inner node m P M and trislides with endpoint m P M. Definition 9 (c-separation).Let A, B and C be disjoint sets of vertices of a graph.A walk π is called a c-connecting walk given C, if every collider section of π has a node in C and all non-collider sections are disjoint.The nodes A and B are called c-separated given C if there are no c-connecting walks between them given V and we write A K K c B|C.
Using inductively the remaining relations results in (4).
Proof of Proposition 1.Let P P E f and Q P E, then the KL-divergence between the two elements is A collider is a node or a set of nodes connected by undirected edges that have an arrow pointing at the set at both ends Ñ ‚ ¨¨¨‚ Ð .We will start with the gridded hidden structure connected to X 1 and X 2 .Since there already is an undirected edge between the X i s an undirected path would make no difference in the marginalized model.The cases (2) and (3) would form a directed cycle which violates the requirements of a chain mixed graph.A collider would also make no difference, since it disappears in the marginalized model.A common exterior influence leads to P p ŵqP px| ŵqP py 1 |x 1 qP py 2 |x 2 q " P px, ŵqP py 1 |x 1 qP py 2 |x 2 q ÿ ŵ P px, ŵqP py 1 |x 1 qP py 2 |x 2 q " P pxqP py 1 |x 1 qP py 2 |x 2 q Now let us discuss these possibilities in the case of a gray hidden structure between X i and Y j , i, j P t1, 2u, i ‰ j.An undirected edge or a directed edge (3) would create a directed cycle.A directed path (2) from X i to Y j would lead to a chain graph in which X i and Y j are not conditionally independent given X j .If there exists a collider (4) in the hidden structure, then nothing else in the graph depends on this part of the structure and it reduces to a factor one when we marginalize over the hidden variables.Therefore the path between X i and Y j gets interrupted leaving a potential external influence or effect.Those do not have an additional impact on the marginalized model.A common exterior influence (5) leads to a chain mixed graph which does not satisfy the necessary conditional independence structure, because using the Algorithm 8 leads to an arc between X i and Y j , hence they are c-connected in the sense of Definition 9.
The next possibility is a dotted hidden structure between X i and Y i , i P t1, 2u.An undirected path (1) and a directed path (3) would lead to a directed cycle.A directed path (2) would add no new structure to the model since there already is a directed edge between X i and Y i .A collider (4) does not have an effect on the marginalized model.Adding a common exterior influence W 1 on X 1 , Y 1 results in a new model which is not symmetric in i P t1, 2u and does not include M I , therefore it does not fully contain M CII .By adding additional common exterior W 2 influences on X 2 , Y 2 or Y 1 , Y 2 , in order to include M I in the new model, violates the conditional independence statements since nodes in W 1 and W 2 are connected in the moralized graph.
The last hidden structure between two nodes is the striped one between the Y i s.An undirected path (1) or any directed path (2),(3) lead to a graph that does not satisfy the conditional independence statements.A collider (4) has no impact on the model and a common exterior influence leads to the definition of Causal Information Integration.
Connecting Y 1 , Y 2 and X i , i P t1, 2u leads either to a violation of the conditional independence statements or contains a collider in which case the marginalized model reduces to one of the cases above.
All the possible ways a hidden structure could be connected to three nodes X 1 , X 2 , Y 1 by directed edges are shown in Figure 21.Replacing any of these edges by an undirected edge would either make no difference or lead to a model that does not satisfy the conditional independence statements.In this case the black boxes represent sections.More complicated hidden structures reduce to this case, since these structures either contain a collider and correspond to one of the cases above or contain longer directed paths in the direction of the edges connecting the structure to the visible nodes, which does not change the marginalized model.The models in (c), (d), (e), (f) and (g) contain either a collider and reduce therefore to one of the cases discussed above or induce a directed cycle.We see that (a) and (h) display structures that do not satisfy the conditional independence statements.The hidden structure in (b) has no impact on the model.
A hidden structure connected to all four nodes contains one of the structures above and therefore does not induce a new valid model.
Let us now consider a model with n ą 2. Any hidden structure on this model either connects only up to four nodes and reduces therefore to one of the cases above, contains one of the connections discussed in Figure 21 or only connects nodes among one point in time.The only structures possible to add would be a common exterior influence on the X i s, a common exterior influence on the Y i s or a collider section on any nodes.All these structures do not change the marginalized model.Therefore it is not possible to create a chain graph with hidden nodes in order to get a model strictly larger than M CII .

QPMD
Z p P Qq " ÿ zPZ P pzq log P pzq Qpzq Minimizing the KL-divergence with respect to the second argument is called m-projection or rIprojection.Hence we will call P ‹ with P ‹ " arg min QPM D Z p P Qq the projection of P to M.

Figure 2 :
Figure 2: Interior and exterior influences on Y in the full and the split system corresponding to Φ I .

Property 3 .
The set M I should be a subset of the split model M corresponding to the Integrated Information measure Φ M .Then the inequality Φ M " min QPM D Z p P | Qq ď IpX; Y q holds.

Figure 4 :
Figure 4: Split systems with exterior influences for n " 2 and n " 3.

Figure 6 :
Figure 6: Submodels of the split models with exterior influences for n " 2 and n " 3.

Figure 7 :
Figure 7: The graphs corresponding to E and E f (right).

X 1 X 2 Y 1 Y 2 Figure 8 :
Figure 8: Graph of the model N CIS .

Figure 9 :
Figure 9: Graph of the model N CII .

Figure 10 :
Figure 10: Example of an exterior influence on the initial graph.

Figure 11 :
Figure 11: Sketch of the relationship between the manifolds corresponding to the different measures.

Figure 14 :
Figure 14: Ising model with 2 nodes and the differences between Φ CIS and Φ CII

Figure 17 :
Figure 17: The effect of a different sized state space.

Figure 18 :
Figure 18: Curve of one run of the em-algorithm for each β coloured according to the distribution of W .

Lemma 1 .
The global chain Markov property and the factorization property are equivalent for positive discrete distributions.

Figure 20 :
Figure 20: Starting graph and possible two way interactions.

Figure 21 :
Figure 21: The eight possible hidden structures between three nodes.
,w P py i , x, wq log P py i , x ,w P pwq log ˆP py i , x Iztiu |x i q P py i |x i qP px Iztiu |x i q ¨P py i , x i qP pxqP py i , x, wqP px i , wq P py i , xqP px i qP py i , x i , wqP px, wq " IpY i ; X Iztiu |X i q `ÿ yi,x,w P pwq log P py i , x i qP pxqP py i , x, wqP px i , wq P py i , xqP px i qP py i , x i , wqP px, wq

Table 1 :
The results of the em-algorithm between N CIS and N CII If we trust the generated results, this would imply that the influence from Y 2 to Y 1 is not purely external, but that there suddenly develops an internal influence in timestep t `1 that did not exist in timestep t.This situation should not occur in the context of Integrated Information.
The minimum divergence between M W |Z and E is equal to the minimum divergence between P and M CII in the visible manifold Proof of Theorem 2.3.Let P, Q P PpZ ˆWq, using the chain-rule for KL-divergence leads to D ZˆW pP Qq " D Z pP Qq `DW|Z pP Qq P PM W |Z ,QPE D ZˆW pP Qq " min P PM W |Z ,QPE D Z pP Qq `DW|Z pP Qq ( " min P PM W |Z ,QPE !D Z p P Qq `DW|Z pP Qq ) " min QPM CII D Z p P Qq.
The inequality holds, because in the first and third addend, we are able to apply that the cross entropy is greater or equal to the entropy and in the second addend we use the log-sum inequality in the following way .they can form a directed path from A to B, 3. they can form a directed path form B to A, 4. there exists a collider or 5.A and B have a common exterior influence.
i P py i |x, wqP pwq Qpxq ś i Qpy i |x i , wqQpwq i P py i |x, wq ś i P py i |x i , wq .iPpy i |x, wq ś i P py i |x i , wq .iPpy i |x, wq ś i P py i |x i , wq " ÿ z,w P pz, wq log ś i P py i , x, wqP px i , wq ś i P py i , x i , wqP px, wq i P py i , x Iztiu |x i , wqP px i , wq ś i P py i |x i , wqP px, wq " ÿ z,w P pz, wq log ś i P py i , x Iztiu |x i , wq ś i P py i |x i , wqP px Iztiu |x i , wq " ÿ i IpY i ; X Iztiu |X i , W q. i P py i |x, wqP pwq ř w Qpxq ś i Qpy i |x i , wqQpwq i P py i |x, wqP pwq Qpxq ś i Qpy i |x i , wqQpwq " min QPE D ZˆW pP Qq.The fact that every element of Q P E corresponds via marginalization to an element in M CII and every element in M CII has at least one corresponding element in Q P E, leads to the equality in the last row.2