# Information Geometry on Complexity and Stochastic Interaction

^{1}

^{2}

^{3}

## Abstract

**:**

## 1. Preface: Information Integration and Complexity

_{v})

_{v∊V}, taking values in a finite set:

_{v}, v ∊ V, are stochastically independent. In their original paper [9], Tononi, Sporns, and Edelman call this quantity integration. Following their intuition, however, the notion of integration should rather refer to a dynamical process, the process of integration, which is causal in nature. In later works, the dynamical aspects have been more explicitly addressed in terms of a causal version of mutual information, leading to improved notions of effective information and information integration, denoted by Φ [10,11]. In fact, most formulated information-theoretic principles are, in some way or another, based on (conditional) mutual information. This directly fits into Shannon’s classical sender-receiver picture [1], where the mutual information has been used in order to quantify the capacity of a communication channel. At first sight, this picture suggests to treat only feed-forward networks, in which information is transmitted from one layer to the next, as in the context of Linsker’s infomax principle. In order to overcome this apparent restriction, however, we can simply unfold the dynamics in time and consider corresponding temporal information flow measures, which allows us to treat also recurrent networks. In what follows, I am going to explain this idea in more detail, thereby providing a motivation of the quantities that are derived in Section 2 in terms of information geometry.

^{(}

^{v}

^{)}, the mechanism of v, which quantifies the conditional probability of its new state ${{\omega}^{\prime}}_{v}$ given the current state ω

_{pa}

_{(}

_{v}

_{)}of its parents. If v ∊ pa(v), this update will involve also ω

_{v}for generating the new state ${{\omega}^{\prime}}_{v}$. How much information is involved from “outside”, that is from ∂(v) := pa(v) \ v, in addition to the information given by ω

_{v}? We can define the local information flow from this set as

_{p}(K || K′) is a measure of “distance”, in terms of the Kullback-Leibler divergence, between K and K′ with respect to the distribution p (see definition by Equation (23)). The expression on the right-hand side of Equation (6) can be considered as an extension of the multi-information (1) to the temporal domain. The second equality, Equation (7), gives the total information flow in the network a geometric interpretation as the distance of the global dynamics K from the set of split dynamics. Stated differently, the total information flow can be seen as the extent to which the whole transition X → X′ is more than the sum of its individual transitions ${X}_{v}\to {{X}^{\prime}}_{v}$, v ∊ V. Note, however, that Equation (6) follows from the additional structure (3) which implies $H({X}^{\prime}|X)={\displaystyle \sum {}_{v\in V}H({{X}^{\prime}}_{v}|X)}$. This structure encodes the consistency of the dynamics with the network. Equation (7), on the other hand, holds for any transition kernel K. Therefore, without reference to a particular network, the distance min

_{K′}

_{split}D

_{p}(K || K′) can be considered as a complexity measure for any transition X → X′, which we denote by C

^{(1)}(X → X′). The information-geometric derivation of C

^{(1)}(X → X′) is given in Section 2.4.1. Restricted to kernels that are consistent with a network, the complexity C

^{(1)}(X → X′) reduces to the total information flow in the network (see Proposition 2 (iv)).

^{(1)}(X → X′) as a valid information-theoretic principle of learning in neuronal systems, I analyzed the natural gradient field on the manifold of kernels that have the structure given by Equation (3) (see [17,22] for the natural gradient method within information geometry). In [23] I proved the consistency of this gradient in the sense that it is completely local: If every node v maximizes its own local information flow, defined by Equation (2), in terms of the natural gradient, then this will be the best way, again with respect to the natural gradient, to maximize the complexity of the whole system. This suggests that the infomax principle by Linsker and also Laughlin’s ansatz, applied locally to recurrent networks, will actually lead to the maximization of the overall complexity. We used geometric methods to study the maximizers of this complexity analytically [24,25]. We have shown that they are almost deterministic, which has quite interesting implications, for instance for the design of learning systems that are parametrized in a way that allows them to maximize their complexity [26] (see also [27] for an overview of geometric methods for systems design). Furthermore, evidence has been provided in [25] that the maximization of C

^{(1)}(X → X′) is achieved in terms of a rule that mimics the spike-timing-dependent plasticity of neurons in the context of discrete time. Together with Wennekers, we have studied complexity maximization as first principle of learning in neural networks also in [28–33].

_{v}and X

_{∂}

_{(}

_{v}

_{)}contain the same information, due to a strong stochastic dependence, then the conditional mutual information in Equation (2) will vanish, even though there might be a strong causal effect of ∂(v) on v. Thus, correlation among various potential causes can hide the actual causal information flow. The information flow measure of Equation (2) is one instance of the so-called transfer entropy [34] which is used within the context of Granger causality and has, as a conditional mutual information, the mentioned shortcoming also in more general settings (see a more detailed discussion in [35]). In order to overcome these limitations of the (conditional) mutual information, in a series of papers [35–39] we have proposed the use of information theory in combination with Pearl’s theory of causation [40]. Our approach has been discussed in [41] where a variant of our notion of node exclusion, introduced in [36], has been utilized for an alternative definition. This definition, however, is restricted to direct causal effects and does not capture, in contrast to [35], mediated causal effects.

^{(1)}(X → X′) does not account for all aspects of the system’s complexity. One obvious reason for that can be seen by comparison with the multi-information, defined by Equation (1), which also captures some aspects of complexity in the sense that it quantifies the extent to which the whole is more than the sum of its elements (parts of size one). On the other hand, it attains its (globally) maximal value, if and only if the nodes are completely correlated. Such systems, in particular completely synchronized systems, are generally not considered to be complex. Furthermore, it turns out that these maximizers are determined by the marginals of size two [42]. Stated differently, the maximization of the extent to which the whole is more than the sum of its parts of size one leads to systems that are not more than the sum of their parts of size two (see for a more detailed discussion [43,44]). Therefore, the multi-information does not capture the complexity of a distribution at all levels. The measure C

^{(1)}(X → X′) has the same shortcoming as the multi-information. In order to study different levels of complexity, one can consider coarse-grainings of the system at different scales in terms of corresponding partitions Π = {S

_{1},…, S

_{n}} of V. Given such a partition, we can define the information flows among its atoms S

_{i}as we already did for the individual elements v of V. For each S

_{i}, we denote the set of nodes that provide information to S

_{i}from outside by $\partial \left({S}_{i}\right):={\cup}_{v\in {S}_{i}}\phantom{\rule{0.2em}{0ex}}\left(pa\left(v\right)\backslash {S}_{i}\right)$. We quantify the information flow into S

_{i}as in Equation (2):

_{i}is then given by

^{(}

^{k}

^{)}(X) can be interpreted as the extent to which the whole is more than the sum of its parts of size k. Now, defining the overall complexity as the minimal C

^{(}

^{k}

^{)}(X) would correspond to the weakest link approach which I discussed above in the context of partitions. A complex system would then have considerably high complexity C

^{(}

^{k}

^{)}(X) at all levels k. However, the TSE-complexity is not constructed according to the weakest link approach, but can be written as a weighted sum of the terms C

^{(}

^{k}

^{)}(X):

^{∞}-norm (maximum) and the L

^{1}-norm (average) in terms of the L

^{p}-norms, p ≥ 1. However, L

^{p}-norms appear somewhat unnatural for entropic quantities.

^{(}

^{k}

^{)}be the maximum-entropy estimation of p with fixed marginals of order k. In particular, p

^{(}

^{N}

^{)}= p, and p

^{(1)}is the product of the marginals p

_{v}, v ∊ V, of order one. In some sense, p

^{(}

^{k}

^{)}encodes the structure of p that is contained only in the parts of size k. The deviation of p from p

^{(}

^{k}

^{)}therefore corresponds to C

^{(}

^{k}

^{)}(X), as defined in Equation (14). This correspondence can be made more explicit by writing this deviation in terms of a difference of entropies:

^{(}

^{k}

^{)}(X) in the definition (15) of the TSE-complexity by D(p || p

^{(}

^{k}

^{)}), then we obtain with the Pythagorean theorem of information geometry the following quantity:

^{(2)}. It follows that in the above decompsition of Equation (19), only the first term D(p

^{(2)}|| p

^{(1)}) is positive while all the other terms vanish for maximizers of the multi-information. This suggests that the multi-information does not weight all contributions D(p

^{(}

^{k}

^{+1)}|| p

^{(}

^{k}

^{)}) to the stochastic dependence in a way that would qualify it as a complexity measure. The measure defined by Equation (18), which I see as an information-geometric counterpart of the TSE-complexity, weights the higher-order contributions D(p

^{(}

^{k}

^{+1)}|| p

^{(}

^{k}

^{)}), k ≥ 2, more strongly. In this geometric picture, we can interpret the TSE-complexity as a rescaling of the multi-information in such a way that its maximization will emphasize not only pairwise interactions.

## 2. “Information Geometry on Complexity and Stochastic Interaction”, Reference [16]

#### 2.1. Introduction

_{v}, v ∊ V, we define the set O

_{S}of objects that are generated by a subsystem S ⊂ V to be the probability distributions on the product set ${\prod}_{v\in S}\Omega v$. A family of individual probability distributions p

^{(}

^{v}

^{)}on Ω

_{v}can be considered as a distribution on the whole configuration set ${\prod}_{v\in S}\Omega v$ by identifying it with the product ${\otimes}_{v\in V}{p}^{(v)}\in {\mathcal{O}}_{V}$. In order to define the complexity of a distribution $p\in {\mathcal{O}}_{V}$ on the whole system, according to Equation (20) we have to choose a divergence function. A canonical choice for D is given by the Kullback-Leibler divergence [52,53]:

#### 2.2. Preliminaries on Finite Information Geometry

^{Ω}of all functions Ω → ℝ carries the natural topology, and we consider subsets as topological subspaces. The set of all probability distributions on Ω is given by

^{Ω}with dimension |Ω|−1 and the basis-point independent tangent space

^{Ω}. In information geometry, the natural embedding presented here is called (−1)-respectively (m)-representation)

_{p}: T(Ω) × T(Ω) → ℝ in p ∊$\mathcal{P}\left(\Omega \right)$ defined by

_{0}∊$\mathcal{P}\left(\Omega \right)$ and vectors v

_{1},…, v

_{d}∊ R

^{Ω}, such that it can be expressed as the image of the map ${\mathbb{R}}^{d}\to \mathcal{P}\left(\Omega \right)$, θ = (θ

_{1},…, θ

_{d}) ↦ p

_{θ}, with

_{v}, v ∊ V, where V denotes the set of units. In the following, the unit set and the corresponding state sets are assumed to be non-empty and finite. For a subsystem S ⊂ V, ${\Omega}_{S}:={\displaystyle {\prod}_{v\in S}{\Omega}_{v}}$ denotes the set of all configurations on S. The elements of $\overline{\mathcal{P}}\left({\Omega}_{S}\right)$ are the random fields on S. One has the natural restriction X

_{S}: Ω

_{V}→ Ω

_{S}, ω = (ω

_{v})

_{v∊V}↦ ω

_{S}:= (ω

_{v})

_{v∊S}, which induces the projection $\overline{\mathcal{P}}\left({\Omega}_{V}\right)\to \overline{\mathcal{P}}\left({\Omega}_{S}\right)$, p ↦p

_{S}, where p

_{S}denotes the image measure of p under the variable X

_{S}. If the subsystem S consists of exactly one unit v, we write p

_{v}instead of p

_{{}

_{v}

_{}}.

**Example 1**(FACTORIZABLE DISTRIBUTIONS AND SPATIAL INTERDEPENDENCE). Let V be a finite set of units and Ω

_{v}, v ∊ V, corresponding state sets. Consider the tensorial map

_{v}| = 2 for all v, the dimension of F is equal to the number |V| of units. The following statement is well known [18]: The (−1)-projection of a distribution $p\in \mathcal{P}\left({\Omega}_{V}\right)$ on $\mathcal{F}$ is given by ⊗

_{v∊V}p

_{v}(the p

_{v}, v ∊ V, are the marginal distributions), and one has the representation

#### 2.3. Quantifying Non-Separability

#### 2.3.1. Manifolds of Separable Transition Kernels

_{v}, v ∊ V, and two subsets A, B ⊂ V with B 6= ∅. A function

_{A}, that is

_{A}consists of exactly one element, namely the empty configuration ϵ. In that case, $\overline{\mathcal{K}}({\Omega}_{B}|{\Omega}_{\theta})=\overline{K}({\mathrm{\Omega}}_{B}|\in )$ can naturally be identified with $\overline{\mathcal{P}}\left({\Omega}_{B}\right)\text{by}\phantom{\rule{0.2em}{0ex}}p\left(\omega \right):=K\left(\omega |\in \right)$, ω ∊ Ω

_{B}.

_{A}, and Prob{Y = ω′ | X = ω} = K(ω′ | ω) for all ω ∊ Ω

_{A}with p(ω) > 0 and all ω′ ∊ Ω

_{B}, we set H(Y | X) := H(p, K).

_{i}and B

_{i}are subsets of V. We assume that {B

_{1},…, B

_{n}} is a partition of V, that is B

_{i}≠ ∅ for all i, B

_{i}∩ B

_{j}= ∅ for all i ≠ j, and $V={B}_{1}\uplus \cdots \uplus {B}_{n}$. Now consider the corresponding tensorial map

**Examples and Definitions 1.**

- If we set $\mathcal{S}:=\left\{\left(V,V\right)\right\}$, the tensorial map is nothing but the identity $\mathcal{K}\left({\Omega}_{V}\right)\to \mathcal{K}\left({\Omega}_{V}\right)$, and therefore one has ${\mathcal{K}}_{\mathcal{S}}\left({\Omega}_{V}\right)=\mathcal{K}\left({\Omega}_{V}\right)$.
- Consider the case where no temporal information is transmitted but all spatial information: $\mathcal{S}:=ind:=\{(\theta ,V)\}$. In that case the tensorial map ${\otimes}_{\mathcal{S}}$ reduces to the natural embedding$$\mathcal{K}({\Omega}_{V}|{\Omega}_{\theta})=\mathcal{P}\left({\Omega}_{V}\right)\hookrightarrow \mathcal{K}\left({\Omega}_{V}\right)$$$$\begin{array}{cc}K\left(\omega \prime |\omega \right):=p\left(\omega \prime \right),& \omega ,\omega \prime \in {\Omega}_{V}\end{array}.$$Therefore, we write ${\mathcal{K}}_{ind}\left({\Omega}_{V}\right)=\mathcal{P}\left({\Omega}_{V}\right)$.
- In addition to the splitting in time which is described in example (2), consider also a complete splitting in space: $\mathcal{S}:=fac:=\{(\theta ,\left\{v\right\}):v\in V\}$. Then we recover the tensorial map of Example 1. Thus, ${\mathcal{K}}_{fac}\left({\Omega}_{V}\right)$ can be identified with $\mathcal{F}\left({\Omega}_{V}\right)$.
- To model the important class of parallel information processing, we set $\mathcal{S}:=par:=\{\left(V,\left\{v\right\}\right):v\in V\}$. Here, each unit “computes” its new state on the basis of all current states according to a kernel ${K}^{\left(v\right)}\in \mathcal{K}\left({\Omega}_{v}|{\Omega}_{V}\right)$. The transition from a configuration ω = (ω
_{v})_{v∊V}of the whole system to a new configuration $\omega \prime ={\left({{\omega}^{\prime}}_{v}\right)}_{v}{}_{\in V}$ is done according to the following composed kernel in $\mathcal{K}\left({\Omega}_{V}\right)$:$$\begin{array}{cc}K({\omega}^{\prime}|\omega )={\displaystyle \prod _{v\in V}{K}^{(v)}({{\omega}^{\prime}}_{v}|\omega ),}& \omega ,{\omega}^{\prime}\in {\mathrm{\Omega}}_{V}\end{array}.$$ - In applications, parallel processing is adapted to a graph G = (V, E) – here, E ⊂ V × V denotes the set of edges – in order to model constraints for the information flow in the system. This is represented by $\mathcal{S}:=\mathcal{S}\left(G\right):=\{(pa(v),\left\{v\right\}):v\in V\}$. Each unit v is supposed to process only information from its parents pa(v) = {μ ∊ V: (μ, v) ∊ E}, which is modeled by a transition kernel ${K}^{\left(v\right)}\in \mathcal{K}\left({\Omega}_{v}|{\Omega}_{pa}{}_{(v)}\right)$. The parallel transition of the whole system is then described by$$\begin{array}{cc}K({\omega}^{\prime}|\omega )={\displaystyle \prod _{v\in V}{K}^{(v)}({{\omega}^{\prime}}_{v}|{\omega}_{pa(v)}),}& \omega ,{\omega}^{\prime}\in {\mathrm{\Omega}}_{V}\end{array}.$$
- Now, we introduce the example of parallel processing that plays the most important role in the present paper: Consider non-empty and pairwise distinct subsystems S
_{1},…, S_{n}of V with V = S_{1}⊎· · ·⊎S_{n}and define $\mathcal{S}:=\mathcal{S}({S}_{1},\dots ,{S}_{n}):=\{({S}_{i},{S}_{i}):i=1,\dots ,n\}$. It describes {S_{1},…, S_{n}}-split information processing, where the subsystems do not interact with each other. Each subsystem S_{i}only processes information from its own current state according to a kernel ${K}^{\left(i\right)}\in \mathcal{K}\left({\Omega}_{Si}\right)$. The composed transition of the whole system is then given by$$\begin{array}{cc}K({\omega}^{\prime}|\omega )={\displaystyle \prod _{i=1}^{n}{K}^{(i)}}({{\omega}^{\prime}}_{{S}_{i}}|{\omega}_{{S}_{i}}),& \omega ,{\omega}^{\prime}\in {\mathrm{\Omega}}_{V}\end{array}.$$For the completely split case, where the subsystems are the elementary units, we define $spl:=\mathcal{S}(\{v\},v\in V)=\{\left(\left\{v\right\},\left\{v\right\}\right):v\in V\}$.

#### 2.3.2. Non-Separability as Divergence from Separability

_{n}= (X

_{v},n)

_{v}V, n = 0, 1, 2,…, that is given by an initial distribution $p\in \overline{\mathcal{P}}({\mathrm{\Omega}}_{V})$ and a kernel $K\in \overline{\mathcal{K}}\left({\Omega}_{V}\right)$. The probabilistic properties of this stochastic process are determined by the following set of finite marginals:

_{V}can be identified with

_{n}} = {X

_{0}, X

_{1}, X

_{2},…} instead of (p, K). The interior MC(Ω

_{V}) of the set of Markov chains carries the natural dualistic structure from $\mathcal{P}\left({\Omega}_{V}\times {\Omega}_{V}\right)$, which is induced by the diffeomorphic composition map $\otimes :\mathrm{MC}\left({\Omega}_{V}\right)\to \mathcal{P}\left({\Omega}_{V}\times {\Omega}_{V}\right)$,

_{V}). The “distance” D((p, K) || (p′, K′)) from a Markov chain (p, K) to another one (p′, K′) is given by

_{V}) as the greatest element and MC

_{fac}(Ω

_{V}) as the least one. This ordering is connected with the following partial ordering ≼ of the sets $\mathcal{S}$ : Given $\mathcal{S}=\left\{\left({A}_{1},{B}_{1}\right),\dots ,\left({A}_{m},{B}_{m}\right)\right\}$ and $\mathcal{S}\prime =\left\{\left({{A}^{\prime}}_{1},{{B}^{\prime}}_{1}\right),\dots ,\left({{A}^{\prime}}_{n},{{B}^{\prime}}_{n}\right)\right\}$, we write $\mathcal{S}\underset{\xaf}{\prec}\mathcal{S}$ ( $\mathcal{S}\prime $coarser than $\mathcal{S}$) iff for all (A, B) ∊$\mathcal{S}$ there exists a pair (A′, B′) ∊$\mathcal{S}$′ with A ⊂ A′ and B ⊂ B′. One has

**Proposition 1.**Let (p, K) be a Markov chain in MC(Ω

_{V}) and S ≼S ′. Then:

- (PROJECTION) The (−1)-projection of (p, K) on${\mathrm{MC}}_{\mathcal{S}}\left({\Omega}_{V}\right)$ is given by (p, ${K}_{\mathcal{S}}$) with${K}_{\mathcal{S}}:={\otimes}_{(A,B)\in \mathcal{S}}{K}_{B}^{A}$. Here, the kernels${K}_{B}^{A}\in \mathcal{K}\left({\Omega}_{B}|{\Omega}_{A}\right)$ denote the corresponding marginals of K:$${K}_{B}^{A}({\omega}^{\prime}|\omega ):=\frac{{\displaystyle \sum {}_{\begin{array}{c}\sigma ,{\sigma}^{\prime}\in {\mathrm{\Omega}}_{V}\\ \sigma A=\omega ,{{\sigma}^{\prime}}_{B}={\omega}^{\prime}\end{array}}p(\sigma )K({\sigma}^{\prime}|\sigma )}}{{\displaystyle \sum {}_{\begin{array}{c}\sigma \in {\mathrm{\Omega}}_{V}\\ \sigma A=\omega \end{array}}p(\sigma )}},\omega \in {\mathrm{\Omega}}_{A},{\omega}^{\prime}\in {\mathrm{\Omega}}_{B}.$$${K}_{\mathcal{S}}$ is the projection of K on ${K}_{\mathcal{S}}\left({\Omega}_{V}\right)$ with respect to p.
- (ENTROPIC REPRESENTATION) The corresponding divergence is given by$$\begin{array}{l}D((p,K)\Vert {\mathrm{MC}}_{\mathcal{S}}({\mathrm{\Omega}}_{V})={D}_{p}(K\Vert {K}_{\mathcal{S}})\\ \phantom{\rule{8.5em}{0ex}}={\displaystyle \sum _{(A,B)\in \mathcal{S}}H({p}_{A},{K}_{B}^{A})-H(p,K).}\end{array}$$
- (PYTHAGORIAN THEOREM) One has$${D}_{p}\left(K\Vert {K}_{\mathcal{S}}\right)={D}_{p}\left(K\Vert {K}_{{\mathcal{S}}^{\prime}}\right)+{D}_{p}\left({K}_{{\mathcal{S}}^{\prime}}\Vert {K}_{\mathcal{S}}\right).$$

_{V}, with a probability distribution $p\in \mathcal{P}\left({\Omega}_{V}\right)$, then the divergence D

_{p}(K||K

_{fac}) is nothing but the measure I(p) for spatial interdependencies that has been discussed in the introduction and in Example 1. More generally, we interpret the divergence ${D}_{p}\left(K\Vert {K}_{\mathcal{S}}\right)$ as a natural measure for the non-separability of (p, K) with respect to $\mathcal{S}$. The corresponding function ${I}_{\mathcal{S}}:\left(p,K\right)\mapsto {I}_{\mathcal{S}}\left(p,K\right):={D}_{p}\left(K\Vert {K}_{\mathcal{S}}\right)$ has a unique continuous extension to the set $\overline{\mathrm{MC}}\left({\Omega}_{V}\right)$ of all Markov chains which is also denoted by ${I}_{\mathcal{S}}$ (see Lemma 4.2 in [55]). Thus, non-separability is defined for not necessarily strictly positive Markov chains.

#### 2.4. Application to Stochastic Interaction

#### 2.4.1. The Definition of Stochastic Interaction

_{v}, v ∊ V, corresponding state sets. Furthermore, consider non-empty and pairwise distinct subsystems S

_{1},…, S

_{n}⊂ V with V = S

_{1}⊎⋯⊎ S

_{n}. The stochastic interaction of S

_{1},…, S

_{n}with respect to $\left(p,K\right)\in \overline{\mathrm{MC}}\left({\Omega}_{V}\right)$ is quantified by the divergence of (p, K) from the set of Markov chains that represent {S

_{1},…, S

_{n}}-split information processing, where the subsystems do not interact with each other (see Examples and Definitions 1 (6)). More precisely, we define the stochastic interaction (of the subsystems S

_{1},…, S

_{n}) to be the function ${I}_{S}{}_{{}_{1}},\dots ,{S}_{n}:\overline{\mathrm{MC}}\left({\Omega}_{V}\right)\to {\mathbb{R}}_{+}$ with

_{1},…, v

_{n}} into the elementary units, that is S

_{i}:= {v

_{i}}, i = 1,…, n, we simply write I instead of I

_{{}

_{v}

_{1}},…,{v

_{n}}.

**Proposition 2.**Let V be a set of units, Ω

_{v}, v ∊ V, corresponding state sets, and X

_{n}= (X

_{v,n})

_{v∊V}, n = 0, 1, 2,…, a Markov chain on Ω

_{V}. For a subsystem S ⊂ V, we write X

_{S,n}:= (X

_{v,n})

_{v∊S}. Assume that the chain is given by$\left(p,K\right)\in \overline{\mathrm{MC}}\left({\Omega}_{V}\right)$, where p is a stationary distribution with respect to K. Then the following holds:

- $$I\{{X}_{n}\}={\displaystyle \sum _{v\in V}H({X}_{v,n+1}|{X}_{v,n})-H({X}_{n+1}|{X}_{n})}.$$
- A, B ⊂ V, A, B ≠ ∅, A ∩ B = ∅, A ⊎ B = V ⇒$$I\{{X}_{n}\}=I\{{X}_{A,}{}_{n}\}+I\left\{{X}_{B,}{}_{n}\right\}+{I}_{A,B}\left\{{X}_{n}\right\}.$$
- If the process is parallel, then$$\begin{array}{l}I\{{X}_{n}\}={\displaystyle \sum _{v\in V}(H({X}_{v,n+1}|{X}_{v,n})-H({X}_{v,n+1}|{X}_{n}))}\\ \phantom{\rule{2em}{0ex}}={\displaystyle \sum _{v\in V}MI({X}_{v,n+1};{X}_{V\backslash v,n}|{X}_{v,n})}.\end{array}$$
- If the process is adapted to a graph (V, E) then$$\begin{array}{l}I\{{X}_{n}\}={\displaystyle \sum _{v\in V}(H({X}_{v,n+1}|{X}_{v,n})-H({X}_{v,n+1}|{X}_{pa(v),n}))}\\ \phantom{\rule{2em}{0ex}}={\displaystyle \sum _{v\in V}MI({X}_{v,n+1};{X}_{pa(v)\backslash v,n}|{X}_{v,n})}.\end{array}$$

_{n}

_{+1}and X

_{n}are independent for all n, the stochastic interaction I{X

_{n}} reduces to the measure I(p) for spatial interdependencies with respect to the stationary distribution p of {X

_{n}} (see Example 1). Thus, the dynamical notion of stochastic interaction is a generalization of the spatial one. Geometrically, this can be illustrated as follows. In addition to the projection K

_{spl}of the kernel K ∊ MC(Ω

_{V}) with respect to a distribution $p\in \mathcal{P}\left({\Omega}_{V}\right)$ on the set of split kernels, we consider its projections K

_{ind}and K

_{fac}on the set $\mathcal{P}\left({\Omega}_{V}\right)$ of independent kernels and on the subset $\mathcal{F}\left({\Omega}_{V}\right)$, respectively. From Proposition 1 we know

_{p}(K || K

_{ind}) and D

_{p}(K

_{spl}|| K

_{fac}) in Equation (29) vanish, and the stochastic interaction coincides with spatial interdependence.

#### 2.4.2. Examples

**Example 2**(SOURCE AND RECEIVER). Consider two units 1 = source and 2 = receiver with the state sets Ω

_{1}and Ω

_{2}. Assume that the information flow is adapted to the graph G = {{1, 2}, {(1, 2)}}, which only allows a transmission from the first unit to the second. In each transition from time n to n + 1, a state X

_{1}

_{,n}

_{+1}of the first unit is chosen independently from X

_{1}

_{,n}according to a probability distribution $p\in \mathcal{P}\left({\Omega}_{1}\right)$. The state X

_{2}

_{,n}

_{+1}of the second unit at time n + 1 is “computed” from X

_{1}

_{,n}according to a kernel $K\in \mathcal{K}\left({\Omega}_{2}|{\Omega}_{1}\right)$. Using formula Equation (28), we have

_{2}

_{,n}

_{+1}and X

_{1}

_{,n}, which has a temporal interpretation within the present approach. It plays an important role in coding and information theory [58].

**Example 3**(TWO BINARY UNITS I). Consider two units with the state sets {0, 1}. Each unit copies the state of the other unit with probability 1 − ε. The transition probabilities for the units are given by the following tables:

K^{(1)}(y′| (x, y)) | 0 | 1 |
---|---|---|

(0, 0) | 1−ε | ε |

(0, 1) | ε | 1 −ε |

(1, 0) | 1 −ε | ε |

(1, 1) | ε | 1 − ε |

K^{(2)}(y′| (x, y)) | 0 | 1 |
---|---|---|

(0, 0) | 1−ε | ε |

(0, 1) | 1 − ε | ε |

(1, 0) | ε | 1 − ε |

(1, 1) | ε | 1 − ε |

K((x', y')|(x, y)) | (0, 0) | (0, 1) | (1, 0) | (1, 1) |
---|---|---|---|---|

(0, 0) | (1−ε)^{2} | (1−ε)ε | ε(1−ε) | ε^{2} |

(0, 1) | ε(1−ε) | ε^{2} | (1−ε)^{2} | (1−ε)ε |

(1, 0) | (1−ε)ε | (1−ε)^{2} | ε^{2} | ε(1−ε) |

(1, 1) | ε^{2} | ε(1−ε) | (1−ε)ε | (1−ε)^{2} |

K_{1}(x′|x) | 0 | 1 |
---|---|---|

0 | $\frac{1}{2}$ | $\frac{1}{2}$ |

1 | $\frac{1}{2}$ | $\frac{1}{2}$ |

K_{2}(y′|y) | 0 | 1 |
---|---|---|

0 | $\frac{1}{2}$ | $\frac{1}{2}$ |

1 | $\frac{1}{2}$ | $\frac{1}{2}$ |

_{spl}= K

_{1}⊗ K

_{2}. With Equation (27) we finally get

**Example 4**(TWO BINARY UNITS II). Consider again two binary units with the state sets {0, 1} and the transition probabilities

K^{(1)}(x′| (x, y)) | 0 | 1 |
---|---|---|

(0, 0) | 1 | 0 |

(0, 1) | 1−ε | ε |

(1, 0) | ε | 1−ε |

(1, 1) | 0 | 1 |

K^{(2)}(y′| (x, y)) | 0 | 1 |
---|---|---|

(0, 0) | 0 | 1 |

(0, 1) | 1−ε | ε |

(1, 0) | ε | 1−ε |

(1, 1) | 1 | 0 |

K((x', y')|(x, y)) | (0, 0) | (0, 1) | (1, 0) | (1, 1) |
---|---|---|---|---|

(0, 0) | 0 | 1 | 0 | 0 |

(0, 1) | (1−ε)^{2} | (1−ε)ε | ε(1−ε) | ε^{2} |

(1, 0) | ε^{2} | ε(1−ε) | (1−ε)ε | (1−ε)^{2} |

(1, 1) | 0 | 0 | 1 | 0 |

K_{1}(x′|x) | 0 | 1 |
---|---|---|

0 | $1-\frac{\epsilon}{2({\epsilon}^{2}-\epsilon +1)}$ | $\frac{\epsilon}{2({\epsilon}^{2}-\epsilon +1)}$ |

1 | $\frac{\epsilon}{2({\epsilon}^{2}-\epsilon +1)}$ | $1-\frac{\epsilon}{2({\epsilon}^{2}-\epsilon +1)}$ |

K_{2}(y′|y) | 0 | 1 |
---|---|---|

0 | $\frac{\epsilon}{2({\epsilon}^{2}-\epsilon +1)}$ | $1-\frac{\epsilon}{2({\epsilon}^{2}-\epsilon +1)}$ |

1 | $1-\frac{\epsilon}{2({\epsilon}^{2}-\epsilon +1)}$ | $\frac{\epsilon}{2({\epsilon}^{2}-\epsilon +1)}$ |

## 3. Conclusions

## Conflicts of Interest

## Appendix

## Appendix: Proofs

**Proposition 3.**The manifold${\mathrm{MC}}_{\mathcal{S}}\left({\Omega}_{V}\right)$ is an exponential family in MC(Ω

_{V}).

**Proof.**To see this, consider the functions Ω

_{V}× Ω

_{V}→ ℝ

**Proof of Implication**(24). If

_{i}, i = 1,…, n, of the index set {1,…, m} such that

**Proof of Proposition 1.**

- Consider the following strictly convex function ( ${\mathbb{R}}_{+}^{*}$ denotes the set of positive real numbers)$$\begin{array}{c}F:{({\mathbb{R}}_{+}^{*})}^{{\mathrm{\Omega}}_{V}}\times \left({\displaystyle \prod _{(A,B)\in \mathcal{S}}{\left({\mathbb{R}}_{+}^{*}\right)}^{{\mathrm{\Omega}}_{A}\times {\mathrm{\Omega}}_{B}}}\right)\to \mathbb{R},\\ (x,y)=({x}_{\omega},\omega \in {\mathrm{\Omega}}_{V};{y}_{{\omega}_{A},{\omega}_{B},}{\omega}_{A}\in {\mathrm{\Omega}}_{A},{\omega}_{B}\in {\mathrm{\Omega}}_{B})\mapsto \\ F(x,y):={\displaystyle \sum _{\omega \in \mathrm{\Omega}V}p(\omega )\mathrm{ln}\frac{p(\omega )}{{x}_{\omega}}}+{\displaystyle \sum _{\omega ,{\omega}^{\prime}\in {\mathrm{\Omega}}_{V}}p(\omega )K({\omega}^{\prime}|\omega )\mathrm{ln}\frac{{\rm K}({\omega}^{\prime}|\omega )}{{\displaystyle {\prod}_{(A,B)\in S}{y}_{\omega A,{{\omega}^{\prime}}_{B}}}}}\\ +\lambda \left({\displaystyle \sum _{\omega \in {\mathrm{\Omega}}_{V}}{x}_{\omega}-1}\right)+{\displaystyle \sum _{(A,B)\in \mathcal{S}}{\displaystyle \sum _{{\omega}_{A}\in {\mathrm{\Omega}}_{A}}{\lambda}_{{\omega}_{A}}^{B}\left({\displaystyle \sum _{{{\omega}^{\prime}}_{B}\in {\mathrm{\Omega}}_{B}}{y}_{{\omega}_{A},{{\omega}^{\prime}}_{B}}-1}\right)}}.\end{array}$$Here, λ and the ${\lambda}_{{\omega}_{A}}^{B}$ are Lagrangian parameters. Note that in the case $x\in \mathcal{P}({\mathrm{\Omega}}_{V})$ and $y\in {\displaystyle {\prod}_{(A,B)\in \mathcal{S}}\mathcal{K}}({\mathrm{\Omega}}_{B}|{\mathrm{\Omega}}_{A})$, the value F (x, y) is nothing but the divergence of (p, K) from (x, ${\otimes}_{\mathcal{S}}$(y)). In order to get the Markov chain that minimizes the divergence we have to compute the partial derivatives of F:$$\begin{array}{l}\frac{\partial F}{\partial {x}_{\sigma}}(x,y)=-{\displaystyle \sum _{\omega \in {\mathrm{\Omega}}_{V}}p(\omega )\frac{1}{{x}_{\omega}}{\delta}_{\sigma ,\omega}}+\lambda \\ \phantom{\rule{3em}{0ex}}=-\frac{p(\sigma )}{{x}_{\sigma}}+\lambda ,\end{array}$$$$\begin{array}{l}\frac{\partial F}{\partial {y}_{{\sigma}_{C},{\sigma}^{\prime}{}_{D}}}(x,y)=-{\displaystyle \sum}_{\omega {\omega}^{\prime}\in {\mathrm{\Omega}}_{V}}p(\omega )K({\omega}^{\prime}|\omega ){\displaystyle \sum}_{(A,B)\in \mathcal{S}}\frac{1}{{y}_{{\omega}_{A},{\omega}^{\prime}{}_{B}}}{\delta}_{({\omega}_{A},{\omega}^{\prime}{}_{B}),({\sigma}_{C},{\sigma}^{\prime}{}_{D})}\\ \begin{array}{l}\phantom{\rule{7em}{0ex}}+{\displaystyle \sum}_{(A,B)\in \mathcal{S}}{\displaystyle \sum}_{{\omega}_{A}\in {\mathrm{\Omega}}_{A}}{\lambda}_{{\omega}_{A}}^{B}{\displaystyle \sum}_{{\omega}^{\prime}{}_{B}\in {\mathrm{\Omega}}_{B}}{\delta}_{({\omega}_{A},{\omega}^{\prime}{}_{B}),({\sigma}_{C},{\sigma}^{\prime}{}_{D})}\\ \phantom{\rule{7em}{0ex}}=-{\displaystyle \sum _{\omega ,{\omega}^{\prime}\in \mathrm{\Omega}V}p}({\omega}^{\prime}|\omega )\frac{1}{y{\omega}_{C},{\omega}_{D}^{{}^{\prime}}}\delta ({\omega}_{C},{\omega}_{D}^{{}^{\prime}}),({\sigma}_{C},{\sigma}_{D}^{{}^{\prime}})+{\lambda}_{{\sigma}_{C}}^{D}\end{array}\\ \phantom{\rule{7em}{0ex}}=-\frac{1}{{y}_{{\sigma}_{C},{\sigma}^{\prime}{}_{D}}}{\displaystyle \sum}_{\begin{array}{c}\omega ,{\omega}^{\prime}\in {\mathrm{\Omega}}_{V}\\ {\omega}_{C}={\sigma}_{C},{\omega}^{\prime}{}_{D}={\sigma}^{\prime}{}_{D}\end{array}}p(\omega )K({\omega}^{\prime}|\omega )+{\lambda}_{{\sigma}_{C}}^{D}.\end{array}$$For a critical point (x, y), the partial derivatives vanish. We get the following solution:$${x}_{\sigma}=p\left(\sigma \right),\phantom{\rule{1em}{0ex}}\sigma \in {\Omega}_{V},$$$${y}_{{\sigma}_{C},{{\sigma}^{\prime}}_{D}}=\frac{1}{{\displaystyle {\sum}_{{\omega}_{C}\in {\mathrm{\Omega}}_{C}}p(\omega )}}{\displaystyle \sum _{\begin{array}{c}\omega ,{\omega}^{\prime}\in {\mathrm{\Omega}}_{V}\\ {\omega}_{C}={\sigma}_{C},{{\omega}^{\prime}}_{D}={{\sigma}^{\prime}}_{D}\end{array}}p(\omega )K({\omega}^{\prime}|\omega )}\phantom{\rule{1em}{0ex}}{\sigma}_{C}\in {\mathrm{\Omega}}_{C},{{\sigma}^{\prime}}_{D}\in {\mathrm{\Omega}}_{D}.$$From Theorem 3.10 in [17] we know that this solution is the (−1)-projection of (p, K) onto ${\mathrm{MC}}_{\mathcal{S}}\left({\Omega}_{V}\right)$. It is given by the initial distribution p and the corresponding marginals ${K}_{B}^{A},\left(A,B\right)K\in \mathcal{S}$, of K.
- With (i) we get$$\begin{array}{l}\phantom{\rule{1em}{0ex}}D((p,K)\Vert {\mathrm{MC}}_{\mathcal{S}}({\mathrm{\Omega}}_{V}))\\ ={D}_{p}(K\Vert {K}_{\mathcal{S}})\\ ={\displaystyle \sum _{\omega ,{\omega}^{\prime}\in {\mathrm{\Omega}}_{V}}p(\omega )}K({\omega}^{\prime}|\omega )\mathrm{ln}\frac{{\rm K}({\omega}^{\prime}|\omega )}{{\displaystyle {\prod}_{(A,B)\in \mathcal{S}}{K}_{B}^{A}({{\omega}^{\prime}}_{B}|{\omega}_{A})}}\\ =-H(p,K)\\ -{\displaystyle \sum _{(A,B)\in \mathcal{S}}{\displaystyle \sum _{\omega ,{\omega}^{\prime}\in {\mathrm{\Omega}}_{V}}p(\omega )K({\omega}^{\prime}|\omega )\mathrm{ln}{K}_{B}^{A}({{\omega}^{\prime}}_{B}|{\omega}_{A})}}\\ =-H(p,K)\\ -{\displaystyle \sum _{(A,B)\in \mathcal{S}}{\displaystyle \sum _{\omega ,{\omega}^{\prime}\in {\mathrm{\Omega}}_{V}}\mathrm{ln}{K}_{B}^{A}({{\omega}^{\prime}}_{B}|{\omega}_{A})}}\underset{{p}_{A(\omega ){K}_{B}^{A}({\omega}^{\prime}|\omega )}}{\underbrace{{\displaystyle \sum _{\begin{array}{c}\sigma ,{\sigma}^{\prime}\in {\mathrm{\Omega}}_{V}\\ {\sigma}_{A}=\omega ,{{\sigma}^{\prime}}_{B}={\omega}^{\prime}\end{array}}p(\sigma )K({\sigma}^{\prime}|\sigma )}}}\\ ={\displaystyle \sum _{(A,B)\in \mathcal{S}}H({p}_{A},{K}_{B}^{A})-H(p,K).}\end{array}$$
- According to Equation (24) we have MC
_{S}(Ω_{V}) ⊆ MC_{S}0 (Ω_{V}), and the statement follows from the Pythagorian theorem ([17], p. 62, Theorem 3.8).

**Proof of Proposition 2.**

- This follows from Proposition 1 (ii).
- We apply (i):$$\begin{array}{l}I\left\{{X}_{n}\right\}\underset{\xaf}{\underset{\xaf}{(\mathrm{i})}}{\displaystyle \sum _{v\in V}H({X}_{v,n+1}|{X}_{v,n})-H({X}_{n+1}|{X}_{n})}\\ \phantom{\rule{3em}{0ex}}=\left({\displaystyle \sum _{v\in A}H({X}_{v,n+1}|{X}_{v,n})-H({X}_{A,n+1}|{X}_{A,n})}\right)\\ \phantom{\rule{4em}{0ex}}+\left({\displaystyle \sum _{v\in B}H({X}_{v,n+1}|{X}_{v,n})-H({X}_{B,n+1}|{X}_{B,n})}\right)\\ \phantom{\rule{4em}{0ex}}+\left(H({X}_{A,n+1}|{X}_{A,n})+H({X}_{B,n+1}|{X}_{B,n})-H({X}_{n+1}|{X}_{n})\right)\\ \phantom{\rule{3em}{0ex}}\underset{\xaf}{\underset{\xaf}{(\mathrm{i})}}I\left\{{X}_{A,n}\right\}+I\left\{{X}_{B,n}\right\}+{I}_{A,B}\left\{{X}_{n}\right\}.\end{array}$$
- For parallel processing, one has$$H({X}_{n+1}|{X}_{n})={\displaystyle \sum _{v\in V}H({X}_{v,n+1}|{X}_{n})}.$$The statement is then implied by (i).
- This follows from (iii) and the Markov property for (V, E)-adapted Markov chains.

## References

- Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J.
**1948**, 27, 379–423. [Google Scholar] - Attneave, F. Informational aspects of visual perception. Psychol. Rev.
**1954**, 61, 183–193. [Google Scholar] - Barlow, H.B. Possible principles underlying the transformation of sensory messages. Sens. Commun.
**1961**, 217–234. [Google Scholar] - Laughlin, S. A simple coding procedure enhances a neuron’s information capacity. Z. Naturforsch.
**1981**, 36, 910–912. [Google Scholar] - Linsker, R. Self-organization in a perceptual network. Computer
**1988**, 21, 105–117. [Google Scholar] - Hubel, D.H.; Wiesel, T.N. Functional Architecture of Macaque Monkey Visual Cortex (Ferrier lecture). Proc. R. Soc. Lond. B
**1977**, 198, 1–59. [Google Scholar] - Hubel, D.H.; Wiesel, T.N. Brain Mechanisms of Vision. Sci. Am.
**1979**, 241, 150–162. [Google Scholar] - Bell, A.J.; Sejnowski, T.J. An information-maximization approach to blind separation and blind deconvolution. Neural Comput.
**1995**, 7, 1129–1159. [Google Scholar] - Tononi, G.; Sporns, O.; Edelmann, G.M. A measure for brain complexity: Relating functional segregation and integration in the nervous system. Proc. Natl. Acad. Sci. USA
**1994**, 91, 5033–5037. [Google Scholar] - Tononi, G.; Sporns, O. Measuring information integration. BMC Neurosci.
**2003**, 4. [Google Scholar] [CrossRef][Green Version] - Tononi, G. An information integration theory of consciousness. BMC Neurosci.
**2004**, 5. [Google Scholar] [CrossRef][Green Version] - Balduzzi, D.; Tononi, G. Integrated Information in Discrete Dynamical Systems: Motivation and Theoretical Framework. PLoS Comp. Biol.
**2008**, 4, 1000091. [Google Scholar] - Tononi, G. Consciousness as Integrated Information: A Provisional Manifesto. Biol. Bull.
**2008**, 215, 216–242. [Google Scholar] - Barrett, A.B.; Seth, A.K. Practical Measures of Integrated Information for Time-Series Data. PLoS Comp. Biol.
**2011**, 7, 1001052. [Google Scholar] - Edlund, J.; Chaumont, N.; Hintze, A.; Koch, C.; Tononi, G.; Adami, C. Integrated Information Increases with Fitness in the Evolution of Animats. PLoS Comp. Biol.
**2011**, 7, e1002236. [Google Scholar] - Ay, N. Information Geometry on Complexity and Stochastic Interaction. MPI MiS Preprint, 95/2001. Available online: http://www.mis.mpg.de/publications/preprints/2001/prepr2001-95.html accessed on 7 December 2001.
- Amari, S.I.; Nagaoka, H. Methods of Information Geometry; Translations of Mathematical Monographs; AMS and Oxford University Press: New York, NY, USA, 2000. [Google Scholar]
- Amari, S.I. Information Geometry on Hierarchy of Probability Distributions. IEEE Trans. Inf. Theory
**2001**, 47, 1701–1711. [Google Scholar] - McGill, W.J. Multivariate information transmission. Psychometrika
**1954**, 19, 97–116. [Google Scholar] - Kolchinsky, A.; Rocha, L.M. Prediction and Modularity in Dynamical Systems, Advances in Artificial Life, Proceedings of the Eleventh European Conference on the Syntheses and Simulation of Living Systems (ECAL 2011); MIT Press: Paris, France, 2011; pp. 423–430.
- Arsiwalla, X.D.; Verschure, P.F.M.J. Integrated Information for Large Complex Networks, Proceedings of 2013 International Joint Conference on Neural Networks (IJCNN), Dallas, TX, USA, 4–9 August 2013; pp. 620–626.
- Amari, S.I. Natural gradient works efficiently in learning. Neural Comput.
**1998**, 10, 251–276. [Google Scholar] - Ay, N. Locality of global stochastic interaction in directed acyclic networks. Neural Comput.
**2002**, 14, 2959–2980. [Google Scholar] - Ay, N.; Wennekers, T. Dynamical Properties of Strongly Interacting Markov Chains. Neural Netw.
**2003**, 16, 1483–1497. [Google Scholar] - Wennekers, T.; Ay, N. Finite state automata resulting from temporal information maximization and a temporal learning rule. Neural Comput.
**2005**, 17, 2258–2290. [Google Scholar] - Ay, N.; Montúfar, G.; Rauh, J. Selection Criteria for Neuromanifolds of Stochastic Dynamics, Advances in Cognitive Neurodynamics (III), Proceedings of the Third International Conference on Cognitive Neurodynamics 2011, Hokkaido, Japan, 9–13 June 2011; pp. 147–154.
- Ay, N. Geometric Design Principles for Brains of Embodied Agents; Santa Fe Institute Working Paper 15-02-005; Santa Fe Institute: Santa Fe, NM, USA, 2015. [Google Scholar]
- Wennekers, T.; Ay, N. Temporal infomax leads to almost deterministic dynamical systems. Neurocomputing
**2003**, 52-4, 461–466. [Google Scholar] - Wennekers, T.; Ay, N. Temporal Infomax on Markov chains with input leads to finite state automata. Neurocomputing
**2003**, 52-4, 431–436. [Google Scholar] - Wennekers, T.; Ay, N. Spatial and temporal stochastic interaction in neuronal assemblies. Theory Biosci
**2003**, 122, 5–18. [Google Scholar] - Wennekers, T.; Ay, N. Stochastic interaction in associative nets. Neurocomputing
**2005**, 65, 387–392. [Google Scholar] - Wennekers, T.; Ay, N. A temporal learning rule in recurrent systems supports high spatio-temporal stochastic interactions. Neurocomputing
**2006**, 69, 1199–1202. [Google Scholar] - Wennekers, T.; Ay, N.; Andras, P. High-resolution multiple-unit EEG in cat auditory cortex reveals large spatio-temporal stochastic interactions. Biosystems
**2007**, 89, 190–197. [Google Scholar] - Schreiber, T. Measuring information transfer. Phys. Rev. Lett.
**2000**, 85, 461–464. [Google Scholar] - Ay, N.; Polani, D. Information Flows in Causal Networks. Adv. Complex Syst.
**2008**, 11, 17–41. [Google Scholar] - Ay, N.; Krakauer, D.C. Geometric robustness theory and biological networks. Theory Biosci.
**2007**, 2, 93–121. [Google Scholar] - Ay, N. A Refinement of the Common Cause Principle. Discret. Appl. Math.
**2009**, 157, 2439–2457. [Google Scholar] - Steudel, B.; Ay, N. Information-theoretic inference of common ancestors. Entropy
**2015**, 17, 2304–2327. [Google Scholar] - Moritz, P.; Reichardt, J.; Ay, N. Discriminating between causal structures in Bayesian Networks via partial observations. Kybernetika
**2014**, 50, 284–295. [Google Scholar] - Pearl, J. Causality: Models, Reasoning and Inference; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
- Janzing, D.; Balduzzi, D.; Grosse-Wentrup, M.; Schölkopf, B. Quantifying causal influences. Ann. Stat.
**2013**, 41, 2324–2358. [Google Scholar] - Ay, N.; Knauf, A. Maximizing Multi-Information. Kybernetika
**2007**, 42, 517–538. [Google Scholar] - Ay, N.; Olbrich, E.; Bertschinger, N.; Jost, J. A Geometric Approach to Complexity. Chaos
**2011**, 21, 037103. [Google Scholar] - Ay, N.; Olbrich, E.; Bertschinger, N.; Jost, J. A Unifying Framework for Complexity Measures of Finite Systems, Proceedings of the European Conference on Complex Systems 2006 (ECCS’06), Oxford University, Oxford, UK, 25–29 September 2006; p. 80.
- Adami, C. The Use of Information Theory in Evolutionary Biology. Ann. NY Acad. Sci.
**2012**, 1256, 49–65. [Google Scholar] - Ay, N.; Bertschinger, N.; Der, R.; Guettler, F.; Olbrich, E. Predictive information and explorative behavior of autonomous robots. Eur. Phys. J. B
**2008**, 63, 329–339. [Google Scholar] - Zahedi, K.; Ay, N.; Der, R. Higher coordination with less control—A result of information maximisation in the sensorimotor loop. Adapt. Behav.
**2010**, 18, 338–355. [Google Scholar] - Ay, N.; Bernigau, H.; Der, R.; Prokopenko, M. Information driven self-organization: The dynamical system approach to autonomous robot behavior. Theory Biosci
**2011**. [Google Scholar] [CrossRef] - Kahle, T.; Olbrich, E.; Jost, J.; Ay, N. Complexity Measures from Interaction Structures. Phys. Rev. E
**2009**, 79, 026201. [Google Scholar] - Olbrich, E.; Bertschinger, N.; Ay, N.; Jost, J. How should complexity scale with system size? Eur. Phys. J. B
**2008**, 63, 407–415. [Google Scholar] - Amari, S.I. Differential-Geometric Methods in Statistics (Lecture Notes in Statistics); Springer: Berlin, Germany, 1985. [Google Scholar]
- Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat.
**1951**, 22, 79–86. [Google Scholar] - Csiszár, I. I-divergence geometry of probability distributions and minimization problems. Ann. Probab.
**1975**, 3, 146–158. [Google Scholar] - Csiszár, I. On topological properties of f-divergence. Stud. Sci. Math. Hungar.
**1967**, 2, 329–339. [Google Scholar] - Ay, N. An Information-Geometric Approach to a Theory of Pragmatic Structuring. Ann. Probab.
**2002**, 30, 416–436. [Google Scholar] - Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc.
**1945**, 37, 81–91. [Google Scholar] - Hofbauer, J.; Sigmund, K. Evolutionary Games and Population Dynamics; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley Series in Telecommunications; Wiley-Interscience: New York, NY, USA, 1991. [Google Scholar]

**Figure 2.**Illustration of the two ways of projecting K onto $\mathcal{F}\left({\Omega}_{V}\right)$. Corresponding application of the Pythagorean theorem leads to Equation (29).

**Figure 3.**Illustration of the stochastic interaction I{X

_{n}} as a function of ε. For the extreme values of ε we have maximal stochastic interaction, which corresponds to a complete information exchange in terms of (x, y) ↦ (y, x) for ε = 0 and (x, y) ↦ (1 − y, 1 − x) for ε = 1. For $\epsilon =\frac{1}{2}$, the dynamics is maximally random, which is associated with no interaction of the nodes.

**Figure 4.**Illustration of the stochastic interaction I{X

_{n}} as a function of ε. For ε = 0, the two units update their states with no information exchange: (x, y) ↦ (x, 1 − y). For ε = 1, there is maximal information exchange in terms of (x, y) ↦(y, 1 − x).

© 2015 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

## Share and Cite

**MDPI and ACS Style**

Ay, N. Information Geometry on Complexity and Stochastic Interaction. *Entropy* **2015**, *17*, 2432-2458.
https://doi.org/10.3390/e17042432

**AMA Style**

Ay N. Information Geometry on Complexity and Stochastic Interaction. *Entropy*. 2015; 17(4):2432-2458.
https://doi.org/10.3390/e17042432

**Chicago/Turabian Style**

Ay, Nihat. 2015. "Information Geometry on Complexity and Stochastic Interaction" *Entropy* 17, no. 4: 2432-2458.
https://doi.org/10.3390/e17042432