Information Geometry on Complexity and Stochastic Interaction

Interdependencies of stochastically interacting units are usually quantified by the Kullback-Leibler divergence of a stationary joint probability distribution on the set of all configurations from the corresponding factorized distribution. This is a spatial approach which does not describe the intrinsically temporal aspects of interaction. In the present paper, the setting is extended to a dynamical version where temporal interdependencies are also captured by using information geometry of Markov chain manifolds.

Since the publication of Shannon's pioneering work in 1948 [1], it has been hypothesized that his information theory provides means for understanding information processing and learning in the brain.Already in the 1950s, the principle of redundancy reduction has been proposed independently by Attneave [2] and Barlow [3].In 1981, Laughlin has provided some experimental evidence for the redundancy reduction principle in terms of the maximization of the output entropy of large monopolar cells of the fly's compound eye [4].As only deterministic response functions have been considered, this principle turns out to be equivalent to the mutual information maximization between the input and the output.Later, Linsker [5] has demonstrated that the maximization of mutual information in a layered feed-forward network leads to feature detectors that are similar to those observed by Hubel and Wiesel in the visual system of the cat and the monkey [6,7].He coined his information-theoretic principle of learning the infomax principle.
The idea that an information-theoretic principle, such as the infomax principle, governs learning processes of neuronal systems has attracted many researchers.A highly recognized contribution in this regard is the work by Bell and Sejnowski [8] which applies the infomax principle to the source separation problem.An exhaustive review of all relevant contributions to that field is not within the scope of this short discussion.I shall focus on approaches that aim at relating such information based principles to the overall complexity of the system.In particular, I shall concentrate on the theory of information integration and complexity, initially proposed by Tononi, Sporns, and Edelman [9], and further developed and analyzed in a series of papers [10][11][12][13][14][15].I shall compare this line of research with my own information-geometric approach to complexity, initially proposed in my manuscript [16], entitled Information Geometry on Complexity and Stochastic Interaction, which led to various lines of research that I am going to outline below.This manuscript constitutes the main body of the present paper, starting with Section 2. It quantifies complexity as the extent to which the whole is more than the sum of its parts using information geometry [17].Thereby, it extends the notion of multi-information [18,19], also called information integration in [9], to the setting of discrete time stochastic processes, in particular Markov chains.This article was originally accepted for publication in IEEE Transactions on Information Theory, subject to minor revision.However, by the end of the unusually long reviewing process I had come to the conclusion that my geometric approach has to be further improved in order to address important aspects of complexity (I shall be more concrete on that).Recent developments, on the other hand, suggest that this work is of relevance in the context of information integration already in its present form [12][13][14][15]20,21]. Therefore, it should be useful to provide it together with a discussion of its strengths and shortcomings, thereby relating it to similar work that has been developed since its first publication.
Let us first consider the so-called multi-information [18,19] of a random vector X = (X v ) v∈V , taking values in a finite set: where H denotes the Shannon entropy (we assume V to be a non-empty and finite set).
The multi-information vanishes if and only if the variables X v , v ∈ V , are stochastically independent.
In their original paper [9], Tononi, Sporns, and Edelman call this quantity integration.Following their intuition, however, the notion of integration should rather refer to a dynamical process, the process of integration, which is causal in nature.In later works, the dynamical aspects have been more explicitly addressed in terms of a causal version of mutual information, leading to improved notions of effective information and information integration, denoted by Φ [10,11].In fact, most formulated information-theoretic principles are, in some way or another, based on (conditional) mutual information.This directly fits into Shannon's classical sender-receiver picture [1], where the mutual information has been used in order to quantify the capacity of a communication channel.At first sight, this picture suggests to treat only feed-forward networks, in which information is transmitted from one layer to the next, as in the context of Linsker's infomax principle.In order to overcome this apparent restriction, however, we can simply unfold the dynamics in time and consider corresponding temporal information flow measures, which allows us to treat also recurrent networks.In what follows, I am going to explain this idea in more detail, thereby providing a motivation of the quantities that are derived in Section 2 in terms of information geometry.We consider again a non-empty and finite set V of nodes and assume that each v ∈ V receives signals from a set of nodes which we call parents of v and denote by pa(v).Based on the received signals, the node v updates its state according to a Markov kernel K (v) , the mechanism of v, which quantifies the conditional probability of its new state ω v given the current state ω pa(v) of its parents.If v ∈ pa(v), this update will involve also ω v for generating the new state ω v .How much information is involved from "outside", that is from ∂(v) := pa(v) \ v, in addition to the information given by ω v ?We can define the local information flow from this set as where MI stands for the (conditional) mutual information.Note that this is the uncertainty reduction that the node v gains through the knowledge of its parents' state, in addition to its own state.Now let us define the total information flow in the network.In order to do so, we have to consider the overall transition kernel.Because the nodes update their states in parallel, the global transition kernel is given as In order to quantify the total information flow in the network, we simply add all the local information flows, defined by Equation (2), and obtain It is easy to see that the total information flow vanishes whenever the global transition kernel has the following structure which encodes the dynamics of isolated non-communicating nodes: Referring to these kernels as being split, we are now ready to give our network information flow measure, defined by Equation ( 4), a geometric interpretation.If K has the structure Equation (3) then = min Here, D p (K K ) is a measure of "distance", in terms of the Kullback-Leibler divergence, between K and K with respect to the distribution p (see definition by Equation ( 23)).The expression on the right-hand side of Equation ( 6) can be considered as an extension of the multi-information (1) to the temporal domain.The second equality, Equation ( 7), gives the total information flow in the network a geometric interpretation as the distance of the global dynamics K from the set of split dynamics.Stated differently, the total information flow can be seen as the extent to which the whole transition X → X is more than the sum of its individual transitions X v → X v , v ∈ V .Note, however, that Equation (6) follows from the additional structure (3) which implies This structure encodes the consistency of the dynamics with the network.Equation (7), on the other hand, holds for any transition kernel K. Therefore, without reference to a particular network, the distance min K split D p (K K ) can be considered as a complexity measure for any transition X → X , which we denote by C (1) (X → X ).The information-geometric derivation of C (1) (X → X ) is given in Section 2.4.1.Restricted to kernels that are consistent with a network, the complexity C (1) (X → X ) reduces to the total information flow in the network (see Proposition 2 (iv)).
In order to consider the maximization of the complexity measure C (1) (X → X ) as a valid information-theoretic principle of learning in neuronal systems, I analyzed the natural gradient field on the manifold of kernels that have the structure given by Equation (3) (see [17,22] for the natural gradient method within information geometry).In [23] I proved the consistency of this gradient in the sense that it is completely local: If every node v maximizes its own local information flow, defined by Equation ( 2), in terms of the natural gradient, then this will be the best way, again with respect to the natural gradient, to maximize the complexity of the whole system.This suggests that the infomax principle by Linsker and also Laughlin's ansatz, applied locally to recurrent networks, will actually lead to the maximization of the overall complexity.We used geometric methods to study the maximizers of this complexity analytically [24,25].We have shown that they are almost deterministic, which has quite interesting implications, for instance for the design of learning systems that are parametrized in a way that allows them to maximize their complexity [26] (see also [27] for an overview of geometric methods for systems design).Furthermore, evidence has been provided in [25] that the maximization of C (1) (X → X ) is achieved in terms of a rule that mimics the spike-timing-dependent plasticity of neurons in the context of discrete time.Together with Wennekers, we have studied complexity maximization as first principle of learning in neural networks also in [28][29][30][31][32][33].
Even though I implicitly assumed that a natural notion of information flow has to reflect the causal interactions of the nodes, I should point out that the above definition of information flow has a shortcoming in this regard.If X v and X ∂(v) contain the same information, due to a strong stochastic dependence, then the conditional mutual information in Equation (2) will vanish, even though there might be a strong causal effect of ∂(v) on v. Thus, correlation among various potential causes can hide the actual causal information flow.The information flow measure of Equation ( 2) is one instance of the so-called transfer entropy [34] which is used within the context of Granger causality and has, as a conditional mutual information, the mentioned shortcoming also in more general settings (see a more detailed discussion in [35]).In order to overcome these limitations of the (conditional) mutual information, in a series of papers [35][36][37][38][39] we have proposed the use of information theory in combination with Pearl's theory of causation [40].Our approach has been discussed in [41] where a variant of our notion of node exclusion, introduced in [36], has been utilized for an alternative definition.This definition, however, is restricted to direct causal effects and does not capture, in contrast to [35], mediated causal effects.
Let us now draw a parallel to causality issues of the complexity measure introduced in the original work [9], which we refer to as TSE-complexity.In order to do so, consider the following representation of the original TSE-complexity as weighted sum of mutual informations: where . Interpreting the mutual information between A and its complement V \ A in this sum as an information flow is clearly misleading.These terms are completely associational and neglect the causal nature of information flow.In [10,11], Tononi and Sporns avoid such inconsistencies by injecting noise (maximum entropy distribution) into A and then measuring the effect in V \ A. They use the corresponding interventional mutual information in order to define effective information.Note that, although their notion of noise injection is conceptually similar to the notion of intervention proposed by Pearl, they formalize it differently.However, the idea of considering a post-interventional mutual information is similar to the one formalized in [35,36] using Pearl's interventional calculus.
Clearly, the measure C (1) (X → X ) does not account for all aspects of the system's complexity.One obvious reason for that can be seen by comparison with the multi-information, defined by Equation ( 1), which also captures some aspects of complexity in the sense that it quantifies the extent to which the whole is more than the sum of its elements (parts of size one).On the other hand, it attains its (globally) maximal value, if and only if the nodes are completely correlated.Such systems, in particular completely synchronized systems, are generally not considered to be complex.Furthermore, it turns out that these maximizers are determined by the marginals of size two [42].Stated differently, the maximization of the extent to which the whole is more than the sum of its parts of size one leads to systems that are not more than the sum of their parts of size two (see for a more detailed discussion [43,44]).Therefore, the multi-information does not capture the complexity of a distribution at all levels.The measure C (1) (X → X ) has the same shortcoming as the multi-information.In order to study different levels of complexity, one can consider coarse-grainings of the system at different scales in terms of corresponding partitions Π = {S 1 , . . ., S n } of V .Given such a partition, we can define the information flows among its atoms S i as we already did for the individual elements v of V .For each S i , we denote the set of nodes that provide information to S i from outside by ∂(S i ) := v∈S i (pa(v) \ S i ).We quantify the information flow into S i as in Equation (2): For a transition that satisfies Equation (3), the total information flow among the parts S i is then given by We can now define the Π-complexity of a general transition, as we already did for the complete partition: Obviously, the Π-complexity coincides with the information flow IF (X → X | Π) in the case where the transition kernel is compatible with the network.The information-geometric derivation of C(X → X | Π) is given in Section 2.4.1.In the early work [10,11], a similar approach has been proposed where only bipartitions have been considered.Later, an extension to arbitrary partitions has been proposed by Balduzzi and Tononi [12,13] where the complexity defined by Equation ( 11) appears as measure of effective information.Note, however, that there are important differences.First, the proposed measure by Tononi and his coworkers is reversed in time, so that their quantity is given by Equation (11) where X and X have exchanged roles.This time-reversal of the effective information is motivated by its intended role as a measure relevant to conscious experience.This does not make any difference in the case where a stationary distribution is chosen as input distribution.However, in order to be consistent with causal aspects of conscious experience, the authors choose a uniform input distribution, which models the least informative prior about the input.Note that there is also a closely related measure, referred to as synergistic information in the works [15,45]: The last equation directly follows from Proposition 1 (iii) (see the derivation of Equation ( 29)).
Interpreting the mutual informations as (one-step) predictive information [46][47][48], the synergistic information quantifies the extent to which the predictive information of the whole system exceeds the sum of predictive informations of the elements.Now, having for each partition of the system the corresponding Π-complexity of Equation ( 11), how should one choose among all these complexities the right one?Following the proposal made in [10][11][12][13], one should identify the partition (or bipartition) that has the smallest, appropriately normalized, Π-complexity.Although the overall complexity is not explicitly defined in these works, the notion of information integration, denoted by Φ, seems to directly correspond to it.This is confirmed by the fact that information integration is used for the identification of so-called complexes in the system.Loosely speaking, these are defined to be subsets S of V with maximal information integration Φ(S).This suggests that the authors equate information integration with complexity.In a further refinement [12,13] of the information integration concept, this is made even more explicit.In [13], Tononi writes: "In short, integrated information captures the information generated by causal interactions in the whole, over and above the information generated by the parts." Defining the overall complexity simply as the minimal one, with respect to all partitions, will ensure that a complex system has a considerably high complexity at all levels.I refer to this choice as the weakest link approach.This is not the only approch to obtain an overall complexty measure from individual ones defined for various levels.In order to give an instructive example for an alternative approach, let us highlight another representation of the TSE-complexity.Instead of the atoms of a partition, this time we consider the subsets of V with a given size k ∈ {1, . . ., N } and define the following quantity: Let us compare this quantity with the multi-information of Equation (1).For k = 1, they are identical.While the multi-information quantifies the extent to which the whole is more than the sum of its elements (subsets of size one), its generalization C (k) (X) can be interpreted as the extent to which the whole is more than the sum of its parts of size k.Now, defining the overall complexity as the minimal C (k) (X) would correspond to the weakest link approach which I discussed above in the context of partitions.
A complex system would then have considerably high complexity C (k) (X) at all levels k.However, the TSE-complexity is not constructed according to the weakest link approach, but can be written as a weighted sum of the terms C (k) (X): where α(k) = k N .The right choice of the weights is important here.I refer to this approach as the average approach.Clearly, one can interpolate between the weakest link approach and the average approach using the standard interpolation between the L ∞ -norm (maximum) and the L 1 -norm (average) in terms of the L p -norms, p ≥ 1.However, L p -norms appear somewhat unnatural for entropic quantities.
The TSE-complexity has also an information-geometric counterpart which has been developed in a series of papers [43,44,49,50].It is instructive to consider this geometric reformulation of the TSE-complexity.For a distribution p, let p (k) be the maximum-entropy estimation of p with fixed marginals of order k.In particular, p (N ) = p, and p (1) is the product of the marginals p v , v ∈ V , of order one.In some sense, p (k) encodes the structure of p that is contained only in the parts of size k.The deviation of p from p (k) therefore corresponds to C (k) (X), as defined in Equation ( 14).This correspondence can be made more explicit by writing this deviation in terms of a difference of entropies: where D denotes the Kullback-Leibler divergence.If we compare the Equations ( 16) and ( 14), then we see that A⊆V |A|=k H(X A ) corresponds to H p (k) (X).Indeed, both terms quantify the entropy that is contained in the marginals of order k.From the information-geometric point of view, however, the second term appears more natural.The first term seems to count marginal entropies multiple times so that we can expect that this mean value is larger than H p (k) (X).In [43], we have shown that this is indeed true, which implies If we replace the C (k) (X) in the definition (15) of the TSE-complexity by D(p p (k) ), then we obtain with the Pythagorean theorem of information geometry the following quantity: where

2
. Let us compare this with the multi-information.Following [18], we can decompose the multi-information as I already mentioned that high multi-information is achieved for strongly correlated systems, which implies that the global maximizers can be generated by systems that only have pairwise interactions [42], that is p = p (2) .It follows that in the above decompsition of Equation ( 19), only the first term D(p (2) p (1) ) is positive while all the other terms vanish for maximizers of the multi-information.This suggests that the multi-information does not weight all contributions D(p (k+1) p (k) ) to the stochastic dependence in a way that would qualify it as a complexity measure.The measure defined by Equation ( 18), which I see as an information-geometric counterpart of the TSE-complexity, weights the higher-order contributions D(p (k+1) p (k) ), k ≥ 2, more strongly.In this geometric picture, we can interpret the TSE-complexity as a rescaling of the multi-information in such a way that its maximization will emphasize not only pairwise interactions.Concluding this preface, I compared two lines of research, the one pursued by Tononi and coworkers on information integration, and my own information-geometric research on complexity.The fact that both research lines independently identified closely related core concepts of complexity confirms that these concepts are quite natural.The comparison of the involved ideas suggests the following intuitive definition of complexity: The complexity of a system is the extent to which the whole is more than the sum of its parts at all system levels.I argue that information geometry provides natural methods for casting this intuitive definition into a formal and quantitative theory of complexity.My paper [16], included here as Section 2, exemplifies this way of thinking about complexity.It is presented with only minor changes compared to its initial publication, except that the original reference list is replaced by the largely extended up-to-date list of references.This implies repetitions of a few standard definitions which I already used in this preface.

Introduction
"The whole is more than the sum of its elementary parts."This statement characterizes the present approach to complexity.Let us put it in a more formal setting.Assume that we have a system consisting of elementary units v ∈ V .With each non-empty subsystem S ⊂ V we associate a set O S of objects that can be generated by S. Examples for such objects are (deterministic) dynamical systems, stochastic processes, and probability distributions.Furthermore, we assume that there is a "composition" map ⊗ : v∈V O {v} → O V that defines how to put objects of the individual units together in order to describe a global object without any interrelations.The image of ⊗ consists of the split global objects which are completely characterized by the individual ones and therefore represent the absence of complexity.In order to quantify complexity, assume that there is given a function D : (x, y) → D(x y), that measures the divergence of global objects x, y ∈ O V .We define the complexity of x ∈ O V to be the divergence from being split: Of course, this approach is very general, and there are many ways to define complexity following this concept.Is there a canonical way?At least, within the probabilistic setting, information geometry [17,51] provides a very convincing framework for this.In the context of random fields, it leads to a measure for "spatial" interdependencies: Given state sets Ω v , v ∈ V , we define the set O S of objects that are generated by a subsystem S ⊂ V to be the probability distributions on the product set v∈S Ω v .A family of individual probability distributions p (v) on Ω v can be considered as a distribution on the whole configuration set v∈V Ω v by identifying it with the product ⊗ v∈V p (v) ∈ O V .In order to define the complexity of a distribution p ∈ O V on the whole system, according to Equation (20) we have to choose a divergence function.A canonical choice for D is given by the Kullback-Leibler divergence [52,53]: It is well known that I(p) quantifies spatial interdependencies [18].It vanishes exactly when the units are stochastically independent with respect to p.Such split distributions are called factorizable in this context.In Figure 1, the example of two binary units with the state sets {0, 1} is illustrated.The distributions with maximal interdependence (complexity) are given by 1 2 δ (0,0) + δ (1,1) and 1 2 δ (1,0) + δ (0,1) .
Spatial interdependence has been studied by Amari [18] and Ay [23,55] from the information-geometric point of view, where it is referred to as (stochastic) interaction and discussed in view of neural networks.The aim of the present paper is to use the concept of complexity that is formalized by Equation ( 20) in order to extend spatial interdependence to a dynamical notion of interaction, where the evolution in time is taken into account.Therefore, the term "stochastic interaction" is mainly used in the context of spatio-temporal interdependence.
The present paper is organized as follows.After a brief introduction into the information-geometric description of finite probability spaces in Section 2.2, the general notion of separability is introduced for Markovian transition kernels, and information geometry is used for quantifying non-separability as divergence from separability (Section 2.3).In Section 2.4, the presented theoretical framework is used to derive a dynamical version of the definition in Equation ( 21), where spatio-temporal interdependencies are quantified and referred to as stochastic interaction.This is illustrated by some simple but instructive examples.

Preliminaries on Finite Information Geometry
In the following, Ω denotes a non-empty and finite set.The vector space R Ω of all functions Ω → R carries the natural topology, and we consider subsets as topological subspaces.The set of all probability distributions on Ω is given by Following the information-geometric description of finite probability spaces, its interior P(Ω) can be considered as a differentiable submanifold of R Ω with dimension |Ω|−1 and the basis-point independent tangent space (If one considers P(Ω) as an "abstract" differentiable manifold, there are many ways to represent it as a submanifold of R Ω .In information geometry, the natural embedding presented here is called (−1)respectively (m)-representation) With the Fisher metric x(ω)y(ω) , P(Ω) becomes a Riemannian manifold [56] (In mathematical biology this metric is also known as Shahshahani metric [57]).The most important additional structure studied in information geometry is given by a pair of dual affine connections on the manifold.Application of such a dual structure to the present situation leads to the notion of (−1)and (+1)-geodesics: Each two points p, q ∈ P(Ω) can be connected by the geodesics γ Here, r(t) denotes the normalization factor.
A submanifold E of P(Ω) is called an exponential family if there exist a point p 0 ∈ P(Ω) and vectors v 1 , . . ., v d ∈ R Ω , such that it can be expressed as the image of the map Let p be a probability distribution in P(Ω).An element p ∈ E is called (−1)-projection of p onto E iff the (−1)-geodesic connecting p and p intersects E orthogonally with respect to the Fisher metric.Such a point p is unique ( [51], Theorem 3.9, p. 91) and can be characterized by the Kullback-Leibler divergence [52,53] (This is a special case of Csiszár's f -divergence [54]) We define the distance D(• E) : It is well known that a point p ∈ E is the (−1)-projection of p onto E if and only if it satisfies the minimizing property D(p E) = D(p p ) ([51], Theorem 3.8, p. 90; [17], Corollary 3.9, p. 63).
In the present paper, the set of states is given by the Cartesian product of individual state sets Ω v , v ∈ V , where V denotes the set of units.In the following, the unit set and the corresponding state sets are assumed to be non-empty and finite.For a subsystem S ⊂ V , Ω S := v∈S Ω v denotes the set of all configurations on S. The elements of P(Ω S ) are the random fields on S. One has the natural restriction X S : Ω V → Ω S , ω = (ω v ) v∈V → ω S := (ω v ) v∈S , which induces the projection P(Ω V ) → P(Ω S ), p → p S , where p S denotes the image measure of p under the variable X S .If the subsystem S consists of exactly one unit v, we write p v instead of p {v} .
The following example, which allows us to put the definition of Equation ( 21) into the information-geometric setting, represents the main motivation for the present approach to stochastic interaction.It will be generalized in Section 2.4.

Example 1 (FACTORIZABLE DISTRIBUTIONS AND SPATIAL INTERDEPENDENCE
).Let V be a finite set of units and Ω v , v ∈ V , corresponding state sets.Consider the tensorial map The image F := F(Ω V ) := ⊗ v∈V p (v) : p (v) ∈ P(Ω v ), v ∈ V of this map, which consists of all factorizable and strictly positive probability distributions, is an exponential family in . For the particular case of binary units, that is |Ω v | = 2 for all v, the dimension of F is equal to the number |V | of units.The following statement is well known [18]: The (−1)-projection of a distribution p ∈ P(Ω V ) on F is given by ⊗ v∈V p v (the p v , v ∈ V , are the marginal distributions), and one has the representation where H denotes the Shannon entropy [1].As stated in the introduction, I(p) is a measure for the spatial interdependencies of the units.It vanishes exactly when the units are stochastically independent.
Before extending the spatial notion of interaction to a dynamical one, in Section 2.3 we consider the more general concept of separability of transition kernels.

Non-Separability as Divergence from Separability
Consider a Markov chain X n = (X v, n ) v∈V , n = 0, 1, 2, . . ., that is given by an initial distribution p ∈ P(Ω V ) and a kernel K ∈ K(Ω V ).The probabilistic properties of this stochastic process are determined by the following set of finite marginals: Thus, the set of Markov chains on Ω V can be identified with and we also use the notation {X n } = {X 0 , X 1 , X 2 , . . .} instead of (p, K).The interior MC(Ω V ) of the set of Markov chains carries the natural dualistic structure from . Thus, we can talk about exponential families and (−1)-projections in MC(Ω V ).The "distance" D((p, K) (p , K )) from a Markov chain (p, K) to another one (p , K ) is given by For a set S = {(A 1 , B 1 ), (A 2 , B 2 ), . . ., (A n , B n )}, we introduce the exponential family (see Proposition 3) The set of all these exponential families is partially ordered by inclusion with MC(Ω V ) as the greatest element and MC f ac (Ω V ) as the least one.This ordering is connected with the following partial ordering of the sets S : Given S = {(A 1 , B 1 ), . . ., (A m , B m )} and S = {(A 1 , B 1 ), . . ., (A n , B n )}, we write S S (S coarser than S ) iff for all (A, B) ∈ S there exists a pair (A , B ) ∈ S with A ⊂ A and B ⊂ B .One has Thus, coarsening enlarges the corresponding manifold (the proof is given in the appendix).Now, we describe the (−1)-projections on the exponential families MC S (Ω V ): Proposition 1.Let (p, K) be a Markov chain in MC(Ω V ) and S S .Then: (i) (PROJECTION) The (−1)-projection of (p, K) on MC S (Ω V ) is given by (p, K S ) with Here, the kernels K A B ∈ K(Ω B | Ω A ) denote the corresponding marginals of K: K S is the projection of K on K S (Ω V ) with respect to p.
(ii) (ENTROPIC REPRESENTATION) The corresponding divergence is given by (iii) (PYTHAGORIAN THEOREM) One has , ω, ω ∈ Ω V , with a probability distribution p ∈ P(Ω V ), then the divergence D p (K K f ac ) is nothing but the measure I(p) for spatial interdependencies that has been discussed in the introduction and in Example 1.More generally, we interpret the divergence D p (K K S ) as a natural measure for the non-separability of (p, K) with respect to S .The corresponding function I S : (p, K) → I S (p, K) := D p (K K S ) has a unique continuous extension to the set MC(Ω V ) of all Markov chains which is also denoted by I S (see Lemma 4.2 in [55]).Thus, non-separability is defined for not necessarily strictly positive Markov chains.

The Definition of Stochastic Interaction
As stated in the introduction we use the concept of complexity that is described by the formal definition in Equation (20) in order to define stochastic interaction.
Let V be a set of units and Ω v , v ∈ V , corresponding state sets.Furthermore, consider non-empty and pairwise distinct subsystems S 1 , . . ., S n ⊂ V with V = S 1 • • • S n .The stochastic interaction of S 1 , . . ., S n with respect to (p, K) ∈ MC(Ω V ) is quantified by the divergence of (p, K) from the set of Markov chains that represent {S 1 , . . ., S n }-split information processing, where the subsystems do not interact with each other (see Examples and Definitions 1 (6)).More precisely, we define the stochastic interaction (of the subsystems S 1 , . . ., S n ) to be the function I S 1 ,...,Sn : MC(Ω V ) → R + with I S 1 ,...,Sn (p, K) := I S (S 1 ,...,Sn) (p, K) = inf In the case of complete splitting of V = {v 1 , . . ., v n } into the elementary units, that is S i := {v i }, i = 1, . . ., n, we simply write I instead of I {v 1 },...,{vn} .
The definition of stochastic interaction given by Equation ( 25) is consistent with the complexity concept that is discussed in the introduction.
Here are some basic properties of I, which are well known in the spatial setting of Example 1: Proposition 2. Let V be a set of units, Ω v , v ∈ V , corresponding state sets, and X n = (X v, n ) v∈V , n = 0, 1, 2, . . ., a Markov chain on Ω V .For a subsystem S ⊂ V , we write X S, n := (X v, n ) v∈S .Assume that the chain is given by (p, K) ∈ MC(Ω V ), where p is a stationary distribution with respect to K.
Then the following holds: (i) (iii) If the process is parallel, then (iv) If the process is adapted to a graph (V, E) then In the statements (iii) and (iv), the conditional mutual information MI(X; Y | Z) of two random variables X, Y with respect to a third one Z is defined to be the difference H(X | Z) − H(X | Y, Z) (see p. 22 in [58]).
If X n+1 and X n are independent for all n, the stochastic interaction I{X n } reduces to the measure I(p) for spatial interdependencies with respect to the stationary distribution p of {X n } (see Example 1).Thus, the dynamical notion of stochastic interaction is a generalization of the spatial one.Geometrically, this can be illustrated as follows.In addition to the projection K spl of the kernel K ∈ MC(Ω V ) with respect to a distribution p ∈ P(Ω V ) on the set of split kernels, we consider its projections K ind and K f ac on the set P(Ω V ) of independent kernels and on the subset F(Ω V ), respectively.From Proposition 1 we know

(sum of individual transinformations)
According to the Pythagorian relation (Proposition 1 (iii)), we get the following representation of stochastic interaction: In the particular case of an independent process, the divergences D p (K K ind ) and D p (K spl K f ac ) in Equation ( 29) vanish, and the stochastic interaction coincides with spatial interdependence.
Figure 2. Illustration of the two ways of projecting K onto F(Ω V ).Corresponding application of the Pythagorean theorem leads to Equation (29).

Examples
Example 2 (SOURCE AND RECEIVER).Consider two units 1 = source and 2 = receiver with the state sets Ω 1 and Ω 2 .Assume that the information flow is adapted to the graph G = {{1, 2}, {(1, 2)}}, which only allows a transmission from the first unit to the second.In each transition from time n to n + 1, a state X 1, n+1 of the first unit is chosen independently from X 1, n according to a probability distribution p ∈ P(Ω 1 ).The state X 2, n+1 of the second unit at time n + 1 is "computed" from X 1, n according to a kernel K ∈ K(Ω 2 | Ω 1 ).Using formula Equation (28), we have This is the well-known mutual information of the variables X 2, n+1 and X 1, n , which has a temporal interpretation within the present approach.It plays an important role in coding and information theory [58].
Example 3 (TWO BINARY UNITS I).Consider two units with the state sets {0, 1}.Each unit copies the state of the other unit with probability 1 − ε.The transition probabilities for the units are given by the following tables: The transition kernel K ∈ Kpar ({0, 1} × {0, 1}) for the corresponding parallel dynamics of the whole system is then given by Note that for ε ∈ {0, 1}, K corresponds to the deterministic transformations ε = 0 : (x, y) → (y, x) and ε = 1 : (x, y) → (1 − y, 1 − x) , which in an intuitive sense describe complete information exchange of the units.With the unique stationary probability distribution p = ( 1 4 , 1 4 , 1 4 , 1 4 ) one can easily compute the marginal kernels which describe the split dynamics according to K spl = K 1 ⊗ K 2 .With Equation ( 27) we finally get The shape of this function is shown in Figure 3.For the extreme values of ε we have maximal stochastic interaction, which corresponds to a complete information exchange in terms of (x, y) → (y, x) for ε = 0 and (x, y) → (1 − y, 1 − x) for ε = 1.For ε = 1 2 , the dynamics is maximally random, which is associated with no interaction of the nodes.This function is symmetric around ε = 1 2 where it vanishes.In ε = 0 and ε = 1 it attains its maximal value 2 ln 2. As stated above, this corresponds to the deterministic transformations with complete information exchange.
This function is monotonically increasing from the minimal value 0 (no interaction) in ε = 0 to its maximal value 2 ln 2 (complete interaction) in ε = 1.

Conclusions
Following the general concept that complexity is characterized by the divergence of a composed system from the superposition of its elementary parts, information geometry has been used to derive a measure for spatio-temporal interdependencies among a finite set of units, which is referred to as stochastic interaction.This generalizes the well-known measure for spatial interdependence that is quantified by the Kullback-Leibler divergence of a probability distribution from its factorization [18,55].Thereby, previous work by Ay [23] is continued, where the optimization of dependencies among

0 1 2 1Figure 3 .
Figure3.Illustration of the stochastic interaction I{X n } as a function of ε.For the extreme values of ε we have maximal stochastic interaction, which corresponds to a complete information exchange in terms of (x, y) → (y, x) for ε = 0 and (x, y)→ (1 − y, 1 − x) for ε = 1.For ε = 12 , the dynamics is maximally random, which is associated with no interaction of the nodes.