^{*}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Inferring the coupling structure of complex systems from time series data in general by means of statistical and information-theoretic techniques is a challenging problem in applied science. The reliability of statistical inferences requires the construction of suitable information-theoretic measures that take into account both direct and indirect influences, manifest in the form of information flows, between the components within the system. In this work, we present an application of the optimal causation entropy (oCSE) principle to identify the coupling structure of a synthetic biological system, the repressilator. Specifically, when the system reaches an equilibrium state, we use a stochastic perturbation approach to extract time series data that approximate a linear stochastic process. Then, we present and jointly apply the aggregative discovery and progressive removal algorithms based on the oCSE principle to infer the coupling structure of the system from the measured data. Finally, we show that the success rate of our coupling inferences not only improves with the amount of available data, but it also increases with a higher frequency of sampling and is especially immune to false positives.

Deducing equations of dynamics from empirical observations is fundamental in science. In real-world experiments, we gather data of the state of a system. Then, to achieve the comprehension of the mechanisms behind the system dynamics, we often need to reconstruct the underlying dynamical equations from the measured data. For example, the laws of celestial mechanics were deduced based on observations of planet trajectories [

Later, more systematic methods were developed to design optimal models from a set of basis functions, where model quality was quantified in various ways, such as the Euclidean norm of the error [

Fortunately, in many applications, the problem we face is not necessarily the extraction of exact equations, but rather, uncovering the cause-and-effect relationships (

Among the various notions of causality, we adopt the one originally proposed by Granger, which relies on two basic principles [

The cause should occur before the effect (caused);

The causal process should carry information (unavailable in other processes) about the effect.

A causal relationship needs to satisfy both requirements. See

The classical Granger causality is limited to coupled linear systems, while most recently developed methods based on information-theoretic measures are applicable to virtually any model, although their effectiveness relies on the abundance of data. Notably, transfer entropy (TE), a particular type of conditional mutual information, was introduced to quantify the (asymmetric) information flow between pairs of components within a system [

Inferring causal relationships in large-scale complex systems is challenging. For a candidate causal relationship, one needs to effectively determine whether the cause and effect is real or is due to the presence of other variables in the system. A common approach is to test the relative independence between the potential cause and effect conditioned on all other variables, as demonstrated in linear models [

In our recent work in [

In this paper, we focus on the problem of inferring the coupling structure in synthetic biological systems. When the system reaches an equilibrium state, we employ random perturbations to extract time series that approximate a linear Gaussian stochastic process. Then, we apply the oCSE principle to infer the system’s coupling structure from the measured data. Finally, we show that the success rate of causal inferences not only improves with the amount of available data, but it also increases with the higher frequency of sampling.

To infer causal structures in complex systems, we clearly need to specify the mathematical assumptions under which the task is to be accomplished. Accurate mathematical modeling of complex systems demands taking into account the coupling of neglected degrees of freedom or, more generally, the fluctuations of external fields that describe the environment interacting with the system itself [_{i}

be a random variable that describes the state of the system at time _{1}_{2},…_{q}

We assume that the system undergoes a stochastic process with the following Markov conditions [

Here, _{1} and _{2} is denoted as “_{1} = _{2}” iff they equal almost everywhere, and “_{1} ≠ _{2}” iff there is a set of positive measure on which the two functions do not equal. Note that the Markov conditions stated in _{i}^{(}^{i}^{)} and each individual component in _{i}

Although several complex systems can be properly modeled in terms of Markov processes [

The problem of inferring direct (or causal) couplings can be stated as follows. Given time series data
_{i}

We follow Shannon’s definition of entropy to quantify the uncertainty of a random variable. In particular, the entropy of a continuous random variable

where

provided that the limit exists [_{J→I}

In our recent work [_{i}

it was shown that _{i}

We refer to this minimax principle as the oCSE principle [

Based on the oCSE principle, we developed two algorithms whose joint sequential application allows the inference of causal relationships within a system [_{i}_{i}_{i}

For a given component _{1} that maximizes CSE,

Then, at each step _{j}_{+1} is identified among the rest of the components to maximize the CSE conditioned on the previously selected components:

Recall that CSE is nonnegative. The above iterative process is terminated when the corresponding maximum CSE equals zero,

and the outcome is the set of components _{i}_{1}_{2}_{j}_{i}

Next, to remove non-causal components (including indirect and spurious ones) that are in _{i}_{i}_{j}_{i}

and _{i}_{i}_{i}_{i}

In practice, causation entropy _{J→I|K}

Here:

denotes a covariance matrix where

Regardless of the method that is being adopted for the estimation of _{J→I|K}

For given time series data
_{J→I|K}

To apply the permutation test, we generate a number of randomly permuted time series (the number will be denoted by

where
_{J→I|K}_{J→I|K}

where 0 < (1 − ^{3} ≲ ^{4} [

Biological systems can exhibit both regular and chaotic features [

Consider a continuous dynamical system:

where
_{1}_{2},…, _{n}^{⊤}: ℝ^{n} →^{n}_{0}.

An equilibrium of the system is a state ^{∗}, such that ^{∗}) = 0. When a system reaches an equilibrium, the time evolution of the state ceases. An equilibrium ^{∗} is called stable if nearby trajectories approach ^{∗} forward in time, _{0} –

To gain information about the coupling structure of a system, it is necessary to apply external perturbations to “knock” the system out of an equilibrium state and observe how it responds to these perturbations. Suppose that we apply and record a sequence of random perturbations to the system in such a manner that before each perturbation, the system is given sufficient time to evolve back to its stable equilibrium. In addition, the response of the system is measured shortly after each perturbation, but before the system reaches the equilibrium again. Denote the stable equilibrium of interest as ^{∗}

Allow the system to (spontaneously) reach ^{∗}

At time ^{∗} +

At time ^{∗} −

Repeated application of these steps _{ℓ}}, where ℓ = 1_{ℓ} is assumed to be drawn independently from the same multivariate Gaussian distribution with zero mean and covariance matrix ^{2}
^{∗}^{2}. To ensure that the perturbation approximates the linearized dynamics of the system, we require that 1/Δ

We remark here that the choice of Gaussian distribution is a practical rather than a conceptual one. In theory, any multivariate distribution can be used to generate the perturbation vector _{ℓ} as long as the component-wise distributions of

Note that each perturbation _{ℓ} and its response _{ℓ}^{∗}, the dynamics of nearby trajectories can be approximated by its linearized system, as follows. Consider a state ^{∗}^{∗}

where _{ij}_{i}/∂x_{j}^{∗}^{∗}

Note that since _{ℓ} is a multivariate normal random variable and ^{∗}_{ℓ}

Cellular dynamics is centrally important in biology [

The repressilator dynamics can be modeled by a system of coupled differential equations, which describe the rates of change for the concentration _{i}_{i}

where _{1} (_{1}), _{2} (_{2}) and _{3} (_{3}) represent the mRNA (protein) concentration of the genes _{0}, the leakiness of the promoter, is the rate of transcription of mRNA in the presence of saturating concentration of the repressor; _{0} +

As shown in [

We consider the repressilator dynamics as modeled by _{0} = 0, _{1} = _{2} = _{3} = _{1} = _{2} = _{3} = 2. Typical time series are shown in _{ℓ}_{ℓ}

To apply our algorithms according to the oCSE principle, we define the set of random variables

The approximate relationship between the perturbation and response as in 0

where ^{∗}

Since the perturbations {_{ℓ}_{ℓ}^{2}, so is the accuracy of the inference. Next, we explore the change in performance of our approach by varying these parameters. We use two quantities to measure the accuracy of the inferred coupling structure, namely, the false positive ratio _{+} and the false negative ratio _{−}. Since our goal is to infer the structure rather than the weights of the couplings, we focus on the structure of the Jacobian matrix ^{∗}

On the other hand, applying the oCSE principle, the inferred direct coupling structure gives rise to the estimated binary matrix

Given matrices

It follows that 0 _{+}, _{−} _{+} = _{−} = 0 when exact (error-free) inference is achieved.

^{2} = 10^{−4} and is found to have little effect on the resulting inference, provided that it is sufficiently small to keep the linearization in

In this paper, we considered the challenging problem of inferring the causal structure of complex systems from limited available data (enjoy _{i}

One especially important feature of our oCSE-based causality inference approach is that it is immune (in principle) to false positives. When data is sufficient, false positives are eliminated by sequential joint application of the aggregative discovery and progressive removal algorithms, as well as raising the threshold

We presented the oCSE principle for stochastic processes under the Markov assumptions stated in

Note that a finite _{t}_{t}_{t}_{t−k}_{+1}_{t−k}_{+2}_{t}

We thank Samuel Stanton from the AROComplex Dynamics and Systems Program for his ongoing and continuous support. This work was funded by ARO Grant No. 61386-EG.

Jie Sun and Erik M. Bollt designed and supervised the research. Jie Sun and Carlo Cafaro performed the analytical calculations and the numerical simulations. All authors contributed to the writing of the paper.

The authors declare no conflict of interest.

Cause-and-effect relationships during a soccer game.

The story: It is July 9, 2006. Nearly 750 million people are watching the FIFA World Cup final between Italy and France. The soccer game is held at the Olympiastadion in Berlin in the presence of around 69, 000 live spectators. Mario is watching the game in a hotel, cheering for Italy. Pascal is sitting comfortably in front of his new TV, supporting France. The Causality Quiz: answer “Yes” or “No” to the following questions, and explain why.

What affects the state of mind of Mario?

Is Mario happy because Pascal is sad? No. Mario has no idea about who Pascal is.

Is Mario happy because the spectators are cheering? No. If anything, Mario is only jealous of those attending the game.

Is Mario happy because of the game? Yes. Check the scoreboard.

What affects the behavior of the spectators?

Are the spectators cheering because Mario is happy? No. Why would they?

Are the spectators cheering because Pascal is sad? No. Why would they?

Are the spectators cheering because of the game? Yes. They are restless soccer lovers, just like the players.

What affects the state of the game?

Is Mario helping his team to win? No. Although Mario probably thinks so after too much wine and cheese.

Is Pascal causing his team to lose? No. Pascal is only causing his TV to break after kicking a ball against it.

Do the spectators influence the game? Yes. This is even scientifically proven [

The actual coupling structure: The four components in the system are Mario (M), Pascal (P), the game (G) and the spectators (S). They form a directed network of four nodes, as shown on the right of the picture (with self-links ignored). The game affects the state of mind of Mario, Pascal and the spectators. On the other hand, the spectators influence the game.

Illustration of the repressilator coupling structure and dynamics. (_{0} = 0,

Direct inference of couplings in the repressilator system. We consider the repressilator dynamics modeled by _{0} = 0, ^{2} (variation of perturbation) and ∆_{ℓ}_{ℓ}_{t}_{+} and false negative _{−}_{+} and _{−}_{+} and _{−}^{−}^{2}, and each data point is an average over 100 independent runs.

The same setup as in ^{−}^{2}), shown as surface plots.