The Causal Interaction between Complex Subsystems

Information flow provides a natural measure for the causal interaction between dynamical events. This study extends our previous rigorous formalism of componentwise information flow to the bulk information flow between two complex subsystems of a large-dimensional parental system. Analytical formulas have been obtained in a closed form. Under a Gaussian assumption, their maximum likelihood estimators have also been obtained. These formulas have been validated using different subsystems with preset relations, and they yield causalities just as expected. On the contrary, the commonly used proxies for the characterization of subsystems, such as averages and principal components, generally do not work correctly. This study can help diagnose the emergence of patterns in complex systems and is expected to have applications in many real world problems in different disciplines such as climate science, fluid dynamics, neuroscience, financial economics, etc.


Introduction
When investigating the properties of a complex system, it is often necessary to study the interaction between one subsystem and another subsystem, which themselves also form complex systems, usually with a large number of components involved. In climate science, for example, there is much interest in understanding how one sector of the system collaborates with another sector to cause climate change (see [1] and the references therein); in neuroscience, it is important to investigate the effective connectivity from one brain region to another, each with millions of neurons involved (e.g., [2,3]), and the interaction between structures (e.g., [4][5][6]; see more references in a recent review [7]). This naturally raises a question: How can we study the interaction between two subsystems in a large parental system? An immediate answer coming to mind might be to study the componentwise interactions by assessing the causalities between the respective components using, for instance, the classical causal inference approaches (e.g., [8][9][10]). This is generally infeasible if the dimensionality is large. For two subsystems, each with, say, 1000 components, they end up with 1 million causal relations, making it impossible to analyze, albeit with all the details. In this case, the details are not a benefit; they need to be re-analyzed for a big, interpretable picture of the phenomena. On the other hand, in many situations, this is not necessary; one needs only a "bulk" description of the subsystems and their interactions. Such examples are seen from the Reynolds equations for turbulence (e.g., [11]) and the thermodynamic description of molecular motions (e.g., [12]). In some fields (e.g., climate science, neuroscience, geography, etc.), a common practice is simply to take respective averages and form the mean properties, and to study the interactions between the proxies, i.e., the mean properties. A more sophisticated approach is to extract the respective principal components (PCs) (e.g., [13][14][15]), based on which the interactions are analyzed henceforth. These approaches, as we will be examining in this study, however, may not work satisfactorily; their validities need to be carefully checked before being put into application.
During the past 16 years, it has been gradually realized that causality in terms of information flow (IF) is a real physical notion that can be rigorously derived from first principles (see [16]). When two processes interact, IF provides not only the direction but also the strength of the interaction. Thus far, the formalism of the IF between two components has been well established (see [16][17][18][19][20], among others). It has been shown promising to extend the formalism to subspaces with many components involved. A pioneering effort is [21], where the authors show that the heuristic argument in [17] equally applies to that between subsystems in the case with only one-way causality. A recent study on the role of individual nodes in a complex network [22] may be viewed as another effort. (Causality analyses between subspaces with the classical approaches are rare; a few examples are [23,24], etc.) However, a rigorous formalism for more generic problems (e.g., with mutual causality involved) is yet to be implemented. This makes the objective of this study, i.e., to study the interactions between two complex subsystems within a large parental system by investigating the "bulk" information flow between them.
The rest of the paper is organized as follows. In Section 2, we first present the setting for the problem and then derive the IF formulas. Maximum likelihood estimators of these formulas are given in Section 3, which is followed by a validation. Finally, Section 5 summarizes the study.

Information Flow between Two Subspaces of a Complex System
Consider an n-dimensional dynamical system A : where x ∈ R n denotes the vector of state variable (x 1 , x 2 , . . . , x n ), F = (F 1 , . . . , F n ) are differentiable functions of x and time t, w is a vector of m independent standard Wiener processes, and B = (b ij ) is an n × m matrix of stochastic perturbation amplitudes. Here we follow the convention in physics not to distinguish a random variable from its deterministic counterpart. From the components (x 1 , . . . , x n ), we separate out two sets, (x 1 , . . . , x r ) and (x r+1 , . . . , x s ), and denote them as x 1...r and x r+1,...s , respectively. The remaining components (x s+1 , . . . , x n ) are denoted as x s+1,...n . The subsystems formed by them are henceforth referred to as A and B, and the following is a derivation of the information flow between them. Note that, for convenience, here A and B are put adjacent to each other; if not, the equations can always be rearranged to make them so. Associated with Equations (1)-(3) there is a Fokker-Planck equation governing the evolution of the joint probability density function (pdf) ρ of x: ∂ρ ∂t where g ij = ∑ m k=1 b ik b jk , i, j = 1, . . . , n. Without much loss of generality, ρ is assumed to be compactly supported on R n . The joint pdfs of x 1...r and x r+1,...s are, respectively, With respect to them, the joint entropies are then To derive the evolution of ρ 1...r , integrate out (x r+1 , . . . , x n ) in Equation (4). This yields, by using the assumption of compactness for ρ, Similarly, Multiplication of Equation (7) by −(1 + log ρ 1...r ), followed by an integration with respect to x 1...r over R r , yields Note that in the second term of the left hand side, the part within the summation is, by integration by parts, In the derivation, the compactness assumption has been used (variables vanish at the boundaries). By the same approach, the right hand side becomes Hence, Likewise, we have Now consider the impact of the subsystem A on its peer B, written T A→B . Following Liang (2016) [16], this is associated with the evolution of the joint entropy of the latter: where H B\ A signifies the entropy evolution with the influence of A excluded, which is found by instantaneously freezing (x 1 , ..., x r ) ≡ x 1...r as parameters. To do this, examine, on an infinitesimal interval [t, t + ∆t], a system modified from the original Equations (1)-(3) by removing the r equations for x 1 , x 2 , ..., x r from the equation set . . . . . .
Notice that the F i s and b ik s still have dependence on (x 1 , ..., x r ) ≡ x 1...r , which, however, appear in the modified system as parameters. By [16], we can construct a mapping , where x \ A means x but with x 1...r appearing as parameters, and study the Frobenius-Perron operator (see, for example, [25]) of the modified system. An alternative approach is given by Liang in [18], which we henceforth follow.
Observe that on the interval [t, t + ∆t], corresponding to the modified dynamical system, there is also a Fokker-Planck equation Here g ij = ∑ m k=1 b ik b jk , ρ \ A means the joint pdf of (x r+1 , ..., x n ) with x 1...r frozen as parameters. Note the difference between ρ \ A and ρ r+1,...n ; the former has x 1...r as parameters, while the latter has no dependence on x 1...r . However, they are equal at time t.
Integration of the above Fokker-Planck equation with respect to dx s+1,...n gives the evolution of the pdf of subsystem B with A frozen as parameters, written ρ B,\ A : Divide Equation (16) by ρ B,\ A and simplify the notation x r+1,...s by x B to obtain Discretizing and noticing that ρ B,\ A (t) = ρ r+1,...s (t), we have (in the following, unless otherwise indicated, the variables without arguments explicitly specified are assumed to be at time step t) To arrive at dH B,\ A /dt, we need to find log ρ B,\ A (x B (t + ∆t); t + ∆t). Using the Euler-Bernstein approximation, where, just as the notation x B , Take mathematical expectation on both sides. The left hand side is −H B,\ A (t + ∆t). By Corollary III.I of [16], and noting E∆w k = 0, E∆w 2 k = ∆t and the fact that ∆w are independent of Thus, Hence, the information flow from x 1...r to x r+1,...s is Likewise, we can obtain the information flow from subsystem B to subsystem A. These are summarized in the following theorem.
where g ij = ∑ m k=1 b ik b jk , and E signifies mathematical expectation.
When r = 1, s = n = 2, (20) reduces to which is precisely the same as the Equation (15) in [18]; the same holds for Equation (19). These equations are hence verified.
The following theorem forms the basis for causal inference.

Proof.
We only check the formula for T B→A . In (20), the deterministic part By the compactness of ρ, the whole deterministic part hence vanishes. Likewise, it can be proved that the stochastic part also vanishes. This theorem allows us to identify the causality with information flow. Ideally, if T B→A = 0, then B is not causal to A, and vice versa; the same holds for T A→B .

Information Flow between Linear Subsystems and Its Estimation
Linear systems provide the simplest framework which is usually taken as the first step toward a more generic setting. Simple as it may be, it has been demonstrated in practice that linear results often provide a good approximation of an otherwise much more complicated problem. It is hence of interest to examine this special case. Let where f i and a ij are constants. Additionally, suppose that b ij are constants-that is to say, the noises are additive. Then, g ij are also constants. Thus, in Equation (20), The same holds in Equation (19). Thus, the stochastic parts in both Equations (19) and (20) vanish. Since a linear system initialized with a Gaussian process will always be Gaussian, we may write the joint pdf of x as where Σ = (σ ij ) n×n is the population covariance matrix of x. By the property of the Gaussian process, it is easy to show where x B = (x r+1 , ..., x s ), µ B = (µ r+1 , ..., µ s ) is the vector of the means of x B , and Σ B the covariance matrix of x B . For easy correspondence, we will augment x B , µ B , and Σ B , so that their entries have the same indices as their counterparts in x, µ and Σ. Separate F i into two parts: where F i and F i correspond to the respective parts in the two square brackets. Thus, F i has nothing to do with the subspace B. By Theorem 2, this part does not contribute to the causality from A to B, so we only need to consider F i in evaluating T A→B ; that is to say, The second term in the bracket is a ii . The first term is Here, σ ij is the (i, j)th entry of the matrix Since, here, only 1 ≤ i, j ≤ s are in question, this is equal to the (i, j)th entry of the matrix Substituting back, we obtain a very simplified result for T A→B . Likewise, T B→A can also be obtained, as shown in the following.

Theorem 3. In Equations (1)-(3), suppose b ij are constants, and
where f i and a ij are also constants. Furthermore, suppose that initially x has a Gaussian distribution; then, where σ ij is the (i, j)th entry of , and where σ ij is the (i, j)th entry of Given a system such as (1)-(3), we can evaluate in a precise sense the information flows among the components. Now, suppose that instead of the dynamical system, what we have are just n time series with K steps, K n, {x 1 (k)}, {x 2 (k)}, . . . , {x n (k)}. We can estimate the system from the series and then apply the information flow formula to fulfill the task. Assume a linear model as shown above, and assume m = 1. following Liang (2014) [19], the maximum likelihood estimator (mle) of a ij is equal to the least-square solution of the following over-determined problem: . . .
)/∆t (∆t is the time stepsize), for i = 1, 2, . . . , n, k = 1, . . . , K. Use an overbar to denote the time mean over the K steps. The above equation is , and a i the row vector (a i1 , . . . , a in ) T . Then, Ra i = q. The least square solution of a i ,â i , solves where c j,di is the sample covariance between the series {x j (k)} and {(x i (k + 1) − x i (k))/∆t}. Thus, finally, the mle of T A→B iŝ where c ij is the (i, j)th entry ofC −1 , and Likewise,T Here,C and c ij is the (i, j)th entry ofC recovering the well-known Equation (10) in [19].
Since practically averages and principal components (PCs) have been widely used to measure complex subsystem variations, we also compute the information flows between , and that between the first PCs of (x 1 , x 2 , x 3 ) and (y 1 , y 2 , y 3 ). The results are plotted in Figure 2c,d, respectively. As can be seen, the principal component analysis (PCA) method works just fine in this case. By comparison, the averaging method yields an incorrect result.
The incorrect inference based on averaging is within expectation. In a network with complex causal relations, for example, with a causality from y 2 to y 1 , the averaging of y 1 with y 2 is equivalent to mixing y 1 with its future state, which is related to the contemporary state of x 1 , and hence will yield a spurious causality to x 1 . The PCA here functions satisfactorily, perhaps because in selecting the most coherent structure, it discards most of the influences from other (implicit) time steps. However, the relative success of PCA may not be robust, as evidenced in the following mutually causal case.

Mutually Causal Relation
If both the coupling parameters, ε 1 and ε 3 , are turned on, the resulting causal relation has a distribution on the ε 1 − ε 3 plane. Figure 3 shows the componentwise information flows T x 1 →y 1 (bottom) and T y 3 →x 3 (top) on the plane. The other two flows, i.e., their counterparts T y 1 →x 1 and T x 3 →y 3 , are by computation essentially zero. As argued in the preceding subsection, the bulk information flows should follow the general pattern, albeit perhaps in a more coarse and/or mild pattern, since it is a property on the whole. This is indeed true. Shown in Figure 4 are the bulk information flows between X and Y computed using Equations (28) and (30).
Again, as usual, we try the averages and first PCs as proxies for estimating the causal interaction between X and Y. Figure 5 shows the distributions of the information flows betweenx andȳ. The resulting patterns are totally different from what Figure 3 displays; obviously, these patterns are incorrect.
One may expect that the PCA method should yield more reasonable causal patterns. We have computed the first PCs for (x 1 , x 2 , x 3 ) and (y 1 , y 2 , y 3 ), respectively, and estimated the information flows using the algorithm by Liang [20]. The resulting distributions, however, are no better than those with the averaged series ( Figure 6). That is to say, this seemingly more sophisticated approach does not yield the right interaction between the complex subsystems, either.      Figure 4, but for the information flows between the first principal component of (x 1 , x 2 , x 3 ) and that of (y 1 , y 2 , y 3 ).

Summary
Information flow provides a natural measure of the causal interaction between dynamical events. In this study, the information flows between two complex subsystems of a large dimensional system are studied, and analytical formulas have been obtained in a closed form. For easy reference, the major results are summarized hereafter.
For an n-dimensional if the probability density function (pdf) of x is compactly supported, then the information flows from subsystem A, which are made of x 1...r , to subsystem B, made of x r+1,...s (1 ≤ r < s ≤ n), and that from B to A are, respectively (in nats per unit time), where g ij = ∑ m k=1 b ik b jk , and E signifies mathematical expectation. Given n stationary time series, T A→B and T B→A can be estimated. The maximum likelihood estimators under a Gaussian assumption are referred to in Equations (28) and (30).
We have constructed a VAR process to validate the formalism. The system has a dimension of 6, with two subsystems respectively denoted by X and Y, each with a dimension of 3. X drives Y via the coupling at one component, and Y feeds back to X via another. The detailed, componentwise causal relation can be easily found using our previous algorithms such as that in [20]. It is expected that the bulk information flow should in general also follow a similar trend, though the structure could be in a more coarse and mild fashion, as now displayed is an overall property. The above formalism does yield such a result. On the contrary, the commonly used proxies for subsystems, such as averages and principal components (PCs), generally do not work. Particularly, the averaged series yield the wrong results in the two cases considered in this study; the PC series do not work either for the mutually causal case, though they result in a satisfactory characterization for the case with a one-way causality.
The result of this study is applicable in many real world problems. As explained in the Introduction, it will be of particular use in the related fields of climate science, neuroscience, financial economics, fluid mechanics, etc. For example, it helps clarify the role of greenhouse gas emissions in bridging the climate system and the socioeconomic system (see a review in [26]). Likewise, the interaction between the earth system and public health [27] can also be studied. In short, it is expected to play a role in the frontier field of complexity, namely, multiplex networks or networks of networks (see the references in [28][29][30]). We are therefore working on these applications.

Conflicts of Interest:
The author declares no conflict of interest.