Identifying Coupling Structure in Complex Systems through the Optimal Causation Entropy Principle

Inferring the coupling structure of complex systems from time series data in general by means of statistical and information-theoretic techniques is a challenging problem in applied science. The reliability of statistical inferences requires the construction of suitable information-theoretic measures that take into account both direct and indirect influences, manifest in the form of information flows, between the components within the system. In this work, we present an application of the optimal causation entropy (oCSE) principle to identify the coupling structure of a synthetic biological system, the repressilator. Specifically, when the system reaches an equilibrium state, we use a stochastic perturbation approach to extract time series data that approximate a linear stochastic process. Then, we present and jointly apply the aggregative discovery and progressive removal algorithms based on the oCSE principle to infer the coupling structure of the system from the measured data. Finally, we show that the success rate of our coupling inferences not only improves with the amount of available data, but it also increases with a higher frequency of sampling and is especially immune to false positives.

abundance of data. Notably, transfer entropy (TE), a particular type of conditional mutual information, was introduced to quantify the (asymmetric) information flow between pairs of components within a system [26,27] and further developed to detect the directionality of coupling in systems with two components [28][29][30]. The application of these pairwise inference methods to identify direct coupling in systems with more than two components warrants caution: the inferred couplings are often incorrect regardless of the amount of available data [24,25,31]. In fact, Granger's second principle implies that a true causal relationship should remain effective upon the removal of all other possible causes. As a consequence, an inference method cannot correctly identify the direct coupling between two components in a system without appropriate conditioning on the remaining components [15,24,25].
Inferring causal relationships in large-scale complex systems is challenging. For a candidate causal relationship, one needs to effectively determine whether the cause and effect is real or is due to the presence of other variables in the system. A common approach is to test the relative independence between the potential cause and effect conditioned on all other variables, as demonstrated in linear models [17,33,34]. Although analytically rigorous, this approach requires the estimation of the related inference measures for high dimensional random variables from (often limited available) data, and therefore, suffers the curse of dimensionality [22]. Many heuristic alternatives have been proposed to address this issue. The essence of these alternative approaches is to repeatedly measure the relative independence between the cause and effect conditioned on a combination of the other variables, often starting with subsets of only a few variables [22,23,35]. As soon as one such combination renders the proposed cause and effect independent, the proposed relationship is rejected, and there is no need to condition on subsets with more variables. The advantage of such an approach is that it reduces the dimensionality of the sample space if the proposed relationship is rejected at some point. However, if the relationship is not rejected, then the algorithm will continue, potentially having to enumerate over all subsets of the other variables. In this scenario, regardless of the dimensionality of the sample space, the combinatorial search itself is often computationally infeasible for large or even moderate-size systems.
In our recent work in [24,25], we introduced the concept of causation entropy (CSE), proposed and proved the optimal causation entropy (oCSE) principle and presented efficient algorithms to infer causal relationships within large complex systems. In particular, we showed in [25] that by a combination of the aggregate discovery and progressive removal of variables, all causal relationships can be correctly inferred in an computational feasible and dataefficient manner. We repeat for emphasis that without properly accounting for multiple interactions and conditioning accordingly, erroneous or irrelevant casual influences may be inferred, and specifically, any pairwise-only method will inherent the problem that many false-positive connections will arise. The design of oCSE specifically addresses this issue.
In this paper, we focus on the problem of inferring the coupling structure in synthetic biological systems. When the system reaches an equilibrium state, we employ random perturbations to extract time series that approximate a linear Gaussian stochastic process. Then, we apply the oCSE principle to infer the system's coupling structure from the measured data. Finally, we show that the success rate of causal inferences not only improves with the amount of available data, but it also increases with the higher frequency of sampling.

II. INFERENCE OF DIRECT COUPLING STRUCTURE THROUGH OPTIMAL CAUSATION ENTROPY
To infer causal structures in complex systems, we clearly need to specify the mathematical assumptions under which the task is to be accomplished. Accurate mathematical modeling of complex systems demands taking into account the coupling of neglected degrees of freedom or, more generally, the fluctuations of external fields that describe the environment interacting with the system itself [36]. This requirement can be addressed in a phenomenological manner by adding noise in deterministic dynamical models. The addition of noise, in turn, generally leads to the stochastic process formalism used in the modeling of natural phenomena [37]. We study the system in a probabilistic framework. Suppose that the system contains n components, V = {1, 2, . . . , n}. Each component i is assumed to be influenced by a unique set of components, denoted by N i . Let: be a random variable that describes the state of the system at time t. For a subset K = {k 1 , k 2 , . . . , k q } ⊂ V, we define: Do the spectators influence the game? Yes. This is even scientifically proven [32].
The actual coupling structure: The four components in the system are Mario (M), Pascal (P), the game (G) and the spectators (S). They form a directed network of four nodes, as shown on the right of the picture (with self-links ignored). The game affects the state of mind of Mario, Pascal and the spectators. On the other hand, the spectators influence the game.

A. Markov Conditions
We assume that the system undergoes a stochastic process with the following Markov conditions [25].
for any t and t . (ii) Spatially Markov: Here, p(·|·) denotes conditional probability. The relationship between two probability density functions p 1 and p 2 is denoted as "p 1 = p 2 " iff they equal almost everywhere, and "p 1 = p 2 " iff there is a set of positive measure on which the two functions do not equal. Note that the Markov conditions stated in Equation (3) ensure that for each component i, there is a unique set of components N i that renders the rest of the system irrelevant in making inference about X (i) and each individual component in N i presents an observable cause regardless of the presence or absence of the other components.
Although several complex systems can be properly modeled in terms of Markov processes [38], we cannot avoid recalling that, after all, non-Markov is the rule and Markov is only the exception in nature [39]. Therefore, it is important to develop a theoretical framework suitable for identifying coupling structures in complex systems driven by correlated fluctuations where memory effects cannot be neglected. It is in fact possible to relax the assumptions in Equation (3) to a stochastic process with finite or vanishing memory, as discussed in the last section of the paper.
The problem of inferring direct (or causal) couplings can be stated as follows. Given time series data {x (i) t } (i = 1, 2, . . . , n; t = 1, 2, . . . , T ) drawn from a stochastic process that fulfills the Markov conditions in Equation (3), the goal is to uncover the set of causal components N i for each i. We solve this problem by means of algorithms that implement the oCSE principle.

B. Causation Entropy
We follow Shannon's definition of entropy to quantify the uncertainty of a random variable. In particular, the entropy of a continuous random variable X is defined as [40]: where p(x) is the probability density function of X. The conditional entropy of X given Y is defined as [41]: Equations (4) and (5) are valid for both univariate and multivariate random variables. Consider a stochastic process. Let I, J and K be arbitrary sets of components within the system. We propose to quantify the influence of J on I conditioned upon K via the CSE: provided that the limit exists [24,25]. Since, in general, H(X|Y ) − H(X|Y, Z) = I(X; Z|Y ), where the latter defines the conditional mutual information between X and Z conditioned on Y , it follows from the nonnegativity of conditional mutual information that CSE is nonnegative. When K = ∅, we omit the conditioning part and simply use the notation C J→I . Notice that if K = I, CSE reduces to TE. On the other hand, CSE generalizes TE by allowing K to be an arbitrary set of components.

C. Optimal Causation Entropy Principle
In our recent work [25], we revealed the equivalence between the problem of identifying the causal components and the optimization of CSE. In particular, we proved that for an arbitrary component i, its unique set of causal components N i is the minimal set of components that maximizes CSE. Given the collection of sets (of components) with maximal CSE with respect to component i, it was shown that N i is the unique set in K i with minimal cardinality, i.e., We refer to this minimax principle as the oCSE principle [25].

D. Computational Causality Inference
Based on the oCSE principle, we developed two algorithms whose joint sequential application allows the inference of causal relationships within a system [25]. The goal of these algorithms is to effectively and efficiently identify for a given component i the direct causal set of components N i . The first algorithm aggregatively identifies the potential causal components of i. The outcome is a set of components M i that includes N i as its subset, with possibly additional components. These additional components are then progressively removed by applying the second algorithm.
For a given component i, the first algorithm, referred to as aggregative discovery, starts by selecting a component k 1 that maximizes CSE, i.e., Then, at each step j (j = 1, 2, . . . ), a new component k j+1 is identified among the rest of the components to maximize the CSE conditioned on the previously selected components: Recall that CSE is nonnegative. The above iterative process is terminated when the corresponding maximum CSE equals zero, i.e., when: and the outcome is the set of components M i = {k 1 , k 2 , . . . , k j } ⊃ N i . Next, to remove non-causal components (including indirect and spurious ones) that are in M i , but not in N i , we employ the second algorithm, referred to as progressive removal. A component k j in M i is removed when: and M i is updated accordingly [60]. After removing all such components, the resulting set is identified as N i .

E. Estimation of CSE in Practice
In practice, causation entropy C J→I|K needs to be estimated from data. We define the Gaussian estimator of causation entropy as: Here: denotes a covariance matrix where τ ∈ {0, 1}, and Φ(τ ) IJ denotes the corresponding sample covariance matrix estimated from the data [25]. The estimation C (Gaussian) J→I|K ≈ C J→I|K if: (i) the underlying random variables are Gaussians; and (ii) the amount of data is sufficient for the relevant sample covariances to be close to their true values.
When the underlying random variables are non-Gaussian, an efficient estimator for CSE is yet to be fully developed. As previously pointed out, binning-related estimators, although conceptually simple, are generally inefficient, unless the sample space is low-dimensional [16]. To gain efficiency and to minimize bias for a large system where the sample space is high-dimensional, an idea would be to build upon the k-nearest neighbor estimators of entropic measures [29,[42][43][44].
Regardless of the method that is being adopted for the estimation of C J→I|K , the estimated value is unlikely to be exactly zero due to limited data and numerical precision. In practice, it is necessary to decide whether or not the estimated quantity should be regarded as zero. Such a decision impacts the termination criterion Equation (11) in the aggregative discovery algorithm and determines which components need to be removed based on Equation (12) in the progressive removal algorithm. As discussed in [25], such a decision problem can be addressed via a nonparametric statistical test, called the permutation test, as described below.
For given time series data {x (i) t } (i = 1, 2, . . . , n; t = 1, 2, . . . , T ), let C J→I|K denote the estimated value of C J→I|K . Based on the set of components J and a permutation function π on the set of integers from one to T , the corresponding permuted time series {y To apply the permutation test, we generate a number of randomly permuted time series (the number will be denoted by r). We then compute the causation entropy from J to I conditioned on K for each permuted time series to obtain r values of the estimates, which are consequently used to construct an empirical cumulative distribution F as: where C (s) J→I|K are estimates from the permuted time series with 1 ≤ s ≤ r and | · | denotes the cardinality of a set. Finally, with the null hypothesis that C J→I|K = 0, we regard C J→I|K as strictly positive if and only if: where 0 < (1 − θ) < 1 is the prescribed significance level. In other words, the null hypothesis is rejected at level (1 − θ) if the above inequality holds. Therefore, the permutation test relies on two input parameters: (i) the number of random permutations r; and (ii) the significance level (1 − θ). In practice, the increase of r improves the accuracy of the empirical distribution at the expense of additional computational cost. A reasonable balance is often achieved when 10 3 r 10 4 [25]. The value of θ sets a lower bound on the false positive ratio and should be chosen to be close to one (for example, θ = 99%) [25].

III. EXTRACTING STOCHASTIC TIME SERIES FROM DETERMINISTIC ORBITS
Biological systems can exhibit both regular and chaotic features [45]. For instance, a healthy cardiac rhythm is characterized by a chaotic time series, whereas a pathological rhythm often exhibits regular dynamics [46]. Therefore, a decrease of the chaoticity of the cardiac rhythm is an alarming clinical signature. A similar connection between the regularity of the time series and pathology is observed in spontaneous neuronal bursting in the brain [47,48]. When a system settles into a periodic or equilibrium state, it becomes nearly impossible to infer the coupling structure among the variables, as the system is not generating any information to be utilized for inference. To overcome this difficulty, we propose to apply small stochastic perturbations to the system while in an equilibrium state (this is equivalent to adding dynamic noise to a system to facilitate coupling inference, as shown in [49]). Then, we measure the system's response over short time intervals. Finally, we follow the oCSE principle and apply the aggregative discovery and progressive removal algorithms to the measured data to infer the couplings among the variables.

A. Dynamical System and Equilibrium States
Consider a continuous dynamical system: where x = [x 1 , x 2 , . . . , x n ] ∈ R n is the n-dimensional state variable, the symbol " " denotes transpose, and f = [f 1 , f 2 , . . . , f n ] : R n → R n is a smooth vector field, which models the dynamic rules of the system. A trajectory of the system (18) is a solution x(t) to the differential equation, Equation (18), with a given initial condition x(0) = x 0 . An equilibrium of the system is a state x * , such that f (x * ) = 0. When a system reaches an equilibrium, the time evolution of the state ceases. An equilibrium x * is called stable if nearby trajectories approach x * forward in time, i.e., there exists a constant ρ > 0, such that x(t) t→∞ −−−→ x * whenever x 0 − x * < ρ, where · denotes the standard Euclidean norm. Otherwise, the equilibrium is called unstable.

B. Response of the System to Stochastic Perturbations
To gain information about the coupling structure of a system, it is necessary to apply external perturbations to "knock" the system out of an equilibrium state and observe how it responds to these perturbations. Suppose that we apply and record a sequence of random perturbations to the system in such a manner that before each perturbation, the system is given sufficient time to evolve back to its stable equilibrium. In addition, the response of the system is measured shortly after each perturbation, but before the system reaches the equilibrium again. Denote the stable equilibrium of interest as x * ; we propose to repeatedly apply the following steps.
Step 1. Allow the system to (spontaneously) reach x * .
Step 2. At time t, apply and record a random perturbation ξ to the system, i.e., x(t) = x * + ξ.
Repeated application of these steps L times results in a collection of perturbations, denoted as {ξ }, and rates of response, denoted as {η }, where = 1, 2, . . . , L. Here, each perturbation ξ is assumed to be drawn independently from the same multivariate Gaussian distribution with zero mean and covariance matrix σ 2 I. For a given equilibrium x * , in addition to L-the number of times the perturbations are applied-there are two more parameters in the stochastic perturbation process: the sample frequency defined as 1/∆t and the variance of perturbation defined as σ 2 . To ensure that the perturbation approximates the linearized dynamics of the system, we require that 1/∆t 1 and σ 1. The influence of these parameters will be studied in the next section with a concrete example. We remark here that the choice of Gaussian distribution is a practical rather than a conceptual one. In theory, any multivariate distribution can be used to generate the perturbation vector ξ as long as the component-wise distributions of ξ (i) and ξ (j) k are identical and independent whenever i = j or = k. In practice, since the effectiveness of any information theoretic method (including the one proposed here) depends on the estimation of entropies, choosing the perturbations to be Gaussian greatly improves the reliability of estimation and, thus, the accuracy of the inference. This is because the entropy of Gaussian variables depends only on covariances, rendering its estimation relatively straightforward [25,50,51].
Note that each perturbation ξ and its response η are related through the underlying nonlinear differential equation dx/dt = f (x), where the nonlinearity is encoded in the function f (x), which is assumed to be unknown. For an equilibrium x * , the dynamics of nearby trajectories can be approximated by its linearized system, as follows. Consider a state x ≈ x * and define the new variable δx = x − x * . To the first order, the time evolution of δx can be approximated by the following linear system: where Df (·) is the n-by-n Jacobian matrix of f , defined as [Df (x)] ij = ∂f i /∂x j . A sufficient condition for x * to be stable is that all eigenvalues of Df (x * ) must have negative real parts. From Equation (19), with the additional assumption that ∆t 1, the relationship between the perturbation and response is approximated by the following equation: Note that since ξ is a multivariate normal random variable and Df (x * ) is a constant matrix, the variable η is also (approximately) a multivariate normal random variable. Equation (20) therefore represents a drive-response type of Gaussian process.

IV. APPLICATION TO SYNTHETIC BIOLOGY A. The Repressilator
Cellular dynamics is centrally important in biology [52]. To describe and, to a certain extent, to understand what happens in cells, fluctuating chemical concentrations and reaction rates need to be measured experimentally.
However, constructing dynamic models that accurately reproduce the observed phenomena in cells is quite challenging. An alternative approach consists in engineering synthetic biological systems that follow prescribed dynamic rules. An important example is the so-called repressilator (or repression-driven oscillator) presented in [53]. The repressilator is based upon three transcriptional repressors inserted into the E. coli bacteria with a plasmid. The three repressors, lacl, tetR and cl, are related as follows: lacl inhibits the transcription of the gene coding for tetR; tetR inhibits the transcription of the gene coding for cl; cl inhibits the transcription of the gene coding for lacl. In the absence of inhibition, each of the three proteins reaches a steady-state concentration resulting from a balance between its production and degradation rates within the bacterium. In the presence of cross-inhibition by the other two repressors, this network architecture potentially allows oscillations and other interesting dynamical behaviors and serves as a prototype of modeling the quorum sensing among bacteria species [54].
The repressilator dynamics can be modeled by a system of coupled differential equations, which describe the rates of change for the concentration p i of each protein repressor and the concentration m i of its associated mRNA in their network, as: [53] where m 1 (p 1 ), m 2 (p 2 ) and m 3 (p 3 ) represent the mRNA (protein) concentration of the genes lacl, tetR and cl, respectively. See Figure 2a for a schematic representation of the system. Each ODE in Equation (21) consists of positive terms modeling the production rate and a negative term representing degradation. There are four parameters in the ODEs in Equation (21), namely: β is the ratio of the protein decay rate to the mRNA decay rate; n is the so-called Hill coefficient and describes the cooperativity of the binding of repressor to promoter; α 0 , the leakiness of the promoter, is the rate of transcription of mRNA in the presence of saturating concentration of the repressor; α 0 + α is the additional rate of transcription of mRNA in the absence of the inhibitor. Note that units of time and concentration in Equation (21) have been rescaled in order to make these equations non-dimensional [53]. As shown in [53], there is an extended region in the parameter space for which the system described in Equation (21) exhibits a single stable equilibrium. The Jacobian matrix at the equilibrium is: Problem statement: The goal of coupling inference is to identify the location of the nonzero entries of the Jacobian through time series generated by the system near equilibrium.

B. Inference of Coupling Structure via the Repressilator Dynamics
We consider the repressilator dynamics as modeled by Equation (21) with the parameters n = 2, α 0 = 0, α = 10 and β = 100. Under this setting, the system exhibits a single stable equilibrium [53] at which m 1 = m 2 = m 3 = p 1 = p 2 = p 3 = 2. Typical time series are shown in Figure 2b. After the system settles at the equilibrium, we apply the stochastic perturbations as described in Section III and obtain a time series of the perturbations {ξ }, as well as responses {η }. The goal is to identify, for each component (the mRNA or protein of a gene) in the system, the set of components that determine its dynamics (i.e., the variables that appear on the right-hand side of each relation in Equation (21)). To apply our algorithms according to the oCSE principle, we define the set of random variables X (i) t (i = 1, 2, . . . , 12; t = 1, 2, . . . ) as: The approximate relationship between the perturbation and response as in Equation (20) can be expressed as: where I = {7, 8, . . . , 12}, J = {1, 2, . . . , 6} and A = Df (x * ). Since this equation defines a stochastic process that satisfies the Markov assumptions in Equation (3), the oCSE principle applies. This implies that the direct couplings can be correctly inferred, at least in principle, by jointly applying the aggregative discovery and progressive removal algorithms.
Since the perturbations {ξ } and responses {η } depend on the number of samples L, the rate of perturbation 1/∆t and the variance of perturbation σ 2 , so is the accuracy of the inference. Next, we explore the change in performance of our approach by varying these parameters. We use two quantities to measure the accuracy of the inferred coupling structure, namely, the false positive ratio ε + and the false negative ratio ε − . Since our goal is to infer the structure rather than the weights of the couplings, we focus on the structure of the Jacobian matrix Df (x * ), encoded in the binary matrix B, where: On the other hand, applying the oCSE principle, the inferred direct coupling structure gives rise to the estimated binary matrix B, where: Given matrices B and B, the false positive and false negative ratios are defined, respectively, as: ε + ≡ number of (i, j) pairs with B ij = 1 and B ij = 0 number of (i, j) pairs with B ij = 0 , ε − ≡ number of (i, j) pairs with B ij = 0 and B ij = 1 number of (i, j) pairs with B ij = 1 .
It follows that 0 ≤ ε + , ε − ≤ 1 and ε + = ε − = 0 when exact (error-free) inference is achieved. Figure 3a,b shows that both the false positives and false negatives converge as the number of samples increases. However, they converge to zero only if the rate of perturbation is sufficiently high. Figure 3c,d supports this observation and, in addition, shows that in the high rate of perturbation regime, exact inference is achieved with a sufficient number of samples (in this case, L ∼ 100). The combined effects of L and 1/∆t are shown in Figure 4. In all simulations, the variation of the perturbation is set to be σ 2 = 10 −4 and is found to have little effect on the resulting inference, provided that it is sufficiently small to keep the linearization in Equations (20) and (24)  Here, ε+ and ε− are defined in Equation (27). (c,d) ε+ and ε− as a function of 1/∆t for three different values of L. In all panels, we set σ = 10 −2 , and each data point is an average over 100 independent runs.

V. DISCUSSION AND CONCLUSIONS A. Results Summary
In this paper, we considered the challenging problem of inferring the causal structure of complex systems from limited available data (enjoy Figure 1 for a cartoon depiction of the concept of causality during a soccer game). Specifically, we presented an application of the so-called oCSE principle to identify the coupling structure of a synthetic biological system, the repressilator. First, we briefly reviewed the main points of the oCSE principle (Equations (7) and (8)), which states that for an arbitrary component i of a complex system, its unique set of causal components N i is the minimal set of components that maximizes CSE (causation entropy, defined in Equation (6) is a generalization of transfer entropy). We strengthen in this work our claim that CSE is a suitable information-theoretic measure for reliable statistical inferences, since it takes into account both direct and indirect influences that appear in the form of information flows between nodes of networks underlying complex systems. We also devoted some attention to the implementation of the oCSE principle. This task is accomplished by means of the joint sequential application of two algorithms, aggregative discovery and progressive removal, respectively. Second, having introduced the main theoretical and computational frameworks, we used a stochastic perturbation approach to extract time series data approximating a Gaussian process when the model system-the repressilator (see Equation (21))-reaches an equilibrium configuration. We then applied the above-mentioned algorithms implementing the oCSE principle in order to infer the coupling structure of the model system from the observed data. Finally, we numerically showed that the success rate of our causal entropic inferences not only improves with the amount of available measured data (Figure 2a), but it also increases with a higher frequency of sampling (Figure 2b).
One especially important feature of our oCSE-based causality inference approach is that it is immune (in principle) to false positives. When data is sufficient, false positives are eliminated by sequential joint application of the aggregative discovery and progressive removal algorithms, as well as raising the threshold θ used in the permutation test (see [25] for a more detailed investigation of this point). In contrast, any pairwise causality measure without additional appropriate conditioning will in principle be susceptible to false positives, result in too many connections and, sometimes, even, implies that everything causes everything. Such a limitation is intrinsic and cannot be overcome merely by gathering more data [25]. On the other hand, false negatives are less common in practice and are usually caused by ineffective statistical estimation or insufficient data.

B. oCSE for General Stochastic Processes
We presented the oCSE principle for stochastic processes under the Markov assumptions stated in Equation (3). From an inference point of view, each and every one of these assumptions has a well-defined interpretation. For instance, the temporal Markov condition implies the stationarity of causal dependences between nodes. The loss of temporal stationarity could be addressed by partitioning the time series data into stationary segments and, then, performing time-dependent inferences. Clearly, such inferences would require extra care. Furthermore, the loss of the spatially and/or faithfully Markov condition would imply that the set of nodes that directly influence any given node of the network describing the complex system is no longer minimal and unique. Causality inference in this case becomes an ill-posed problem. These issues will be formally addressed in a forthcoming work.
Note that a finite k-th order Markov process {X t } can always be converted into a firstorder (memoryless) one {Z t } by lifting, or in other words, defining new random variables Z t = (X t−k+1 , X t−k+2 , . . . , X t ) [55]. In this regard, the oCSE principle extends naturally to an arbitrary finite-order Markov process. On the other hand, for a general stationary stochastic process that is not necessarily Markov (i.e., considering a process with potentially infinite memory), there might exist an infinite number of components from the past that causally influence the current state of a given component. However, under an assumption of vanishing (or fading) memory, such influences decay rapidly as a function of the time lag, and consequently, the process itself can be approximated by a finite-order Markov chain [56][57][58][59]. We plan to leave such generalizations for forthcoming investigations.