1. Introduction
Causality analysis is a fundamental task in scientific research. Though traditionally formulated as a statistical problem (see, for example, the classics by Granger [
1], Pearl [
2], and Rubin [
3,
4]) in data science and computer science, among other disciplines, formalisms within the framework of dynamical systems have also been established; refer to a focus issue of
Chaos [
5] for relevant references. Particularly, in terms of information flow/transfer, it has been argued that causality is “a real notion in physics that can be derived
ab initio” [
6]. A comprehensive study with generic systems has been conducted, and explicit formulas have been attained in closed form [
6,
7] with the aid of the Frobenius–Perron operator (e.g., [
8]), a technique also exploited in similar studies (e.g., [
9]). These formulas have been validated with benchmark systems such as the baker transformation, Hénon map, etc., and have been applied successfully to real-world problems in diverse disciplines, such as global climate change, dynamic meteorology, land–atmosphere interaction, data-driven prediction, near-wall turbulence, neuroscience, financial analysis, and quantum information; see [
10] for a brief review of the applications and [
11,
12,
13,
14,
15,
16] for some updates.
For the purpose of this study, we first give a brief introduction to the theory within the framework of a differential dynamical system (also available for discrete-time mappings; see [
6]). Let
be a
d-dimensional continuous-time stochastic system for
(we do not distinguish notations for random and deterministic variables), where
may be arbitrary nonlinear differentiable functions of
and
t,
is a vector of white noise, and
is the matrix of perturbation amplitudes, which may also be any differentiable functions of
and
t. This setting, particularly the setting in the absence of stochasticity, avails us of the arsenal from physics in approaching the problem. For example, the concept of symmetry has been found to play an important role [
17]. Using the Frobenius–Perron operator, Liang (2016) [
6] proved that the rate of information flowing from
to
(in nats per unit time) is
where
signifies
,
E stands for mathematical expectation,
,
is the marginal probability density function (pdf) of
,
is the pdf of
conditioned on
, and
. The algorithm for the information flow-based causal inference is as follows: if
, then
is not causal to
; otherwise, it is causal, and the absolute value measures the magnitude of the causality from
to
. This is guaranteed by a property called the “principle of nil causality”. Another property, as proven by Liang (2018) (see [
10]), regards the invariance upon coordinate transformation, indicating that the obtained information flow (IF) is an intrinsic property in nature. Also established is that [
6], for a linear model, i.e., for
with
and
being constant matrices in (
1), the information flow rate from
to
is
where
is the population covariance of
and
. By this, it follows that, in the linear sense, causation implies correlation, but not vice versa. In an explicit expression, this corollary expresses mathematically the debate on causation vs. correlation ever since George Berkeley (1710) [
18].
In the case with only
d time series
, the quantitative causality, i.e., the IF, between them can be estimated using maximum likelihood estimation (see [
19,
20]). Under the assumption of a linear system with additive noise, the maximum likelihood estimator (mle) of (
2) for
is [
20]
where
is the sample covariance between
and
,
are the cofactors of the matrix
, and
is the sample covariance between
and a series derived from
using the Euler forward differencing scheme:
, with being
some integer. Equation (
3) is rather concise in form, involving only the common statistics, i.e., sample covariances. The transparent formula makes causality analysis, which otherwise would be complicated, very easy and computationally efficient. Note, however, that Equation (
3) cannot replace Equation (
2); it is merely the maximum likelihood estimator (mle) of the latter. Statistical significance tests can be performed for the estimators. This is achieved with the aid of a Fisher information matrix. See [
19,
20] for details.
Originally, the formalism was established within the framework of a differential system; in other words, it is with infinitesimal time increments. (The formalism with discrete mappings was also established by Liang (2016) [
6], but no estimation has ever been made so far.) A question naturally arises about its applicability in the case of coarsely sampled time series. Indeed, it is not unusual that the given series may be coarsely sampled due to limited observations. As will be seen in the following section, this may pose a problem for nonlinear systems if the sample interval is large. This paper henceforth attempts to address this issue in the original linear framework. In the following, we first check the applicability of (
3) for series from a linear system and a highly nonlinear system (
Section 2), with a variety of sampling intervals. A new approach is presented in
Section 3, which is then utilized to reconduct the causal inferences in
Section 2. Some remaining issues are discussed in
Section 5.
3. Approaching a Partial Solution
As shown above, if the sampling frequency of the time series is low, the resulting linear IF for nonlinear series may be biased. Indeed, in the case with high nonlinearity, the linear assumption is always easy to blame. While theoretically it is not a problem (causality is guaranteed, as it is proven in a theorem), we agree that, before a fully nonlinear algorithm is developed, this will be a continuing issue. What we want to show here is, how much room there is for improvement. At present, the algorithm documented in [
19], and later in [
20], is based on the Bernstein–Euler differencing scheme, which is, of course, very rudimentary due to the first-order differencing. If a time series is coarsely sampled, the error could be large.
A theorem established by Liang (2008) [
7] reads that, if the noise is additive in Equation (
1), i.e., if
is a constant matrix, then the noise itself does not appear in the formula of
. Thus, under the additive noise assumption, we can estimate the IF within the framework of a deterministic system. In this case, note that the linear equation set can be solved for an interval
, regardless of the size of
. This gives insight regarding a solution to the low sampling frequency problem.
Consider
where
is a
matrix. Let us assume that
, since the time series can always be pretreated by removing the linear trend, and it has been proven that this removal does not alter the IF rates. In this case, on the interval
, we actually have a mapping
that takes the state
to the state
at
, with the propagating operator
It is not easy to estimate
, but it is easy to estimate
instead, by observing the relation
This is written in matrix form as
for
. Averaging all rows of the algebraic equation set, and subtracting the mean from each row, we get
where
,
, i.e., the series
is the series
advanced by one step. Let
i run through
. We have the following
d overdetermined equation sets:
Denote by
the matrix
; then, the matrix of unknowns in the above equation sets is
. Left multiplication by
on both sides yields
d equation sets:
where
is the sample covariance matrix of
,
, and
is the sample covariance between
and
, i.e.,
advanced by one time step. The least-square solutions of the overdetermined sets (
10) are the solutions of (
11),
and hence
The estimator of
is, therefore,
(Caution should be used in cases of singularity. The irrelevant imaginary part also should be discarded.)
Once we have obtained
A, and hence the coefficients
, we substitute
for the whole part
in Equation (3), i.e., we multiply
by
to arrive at the desideratum,
. If we denote by
the extraction of the
entry of the matrix
, this is
(Note here that log is the matrix logarithm. In MATLAB, the function is logm.)
4. The Coarsely Sampled Series Problem Revisited
As demonstrated above, for series generated from linear systems, the estimation of the IF is satisfactory qualitatively. Nevertheless, we want to examine how the new scheme may improve the results. Shown in
Table 2 is a recalculation of the estimates. Since this case has a ground truth (half analytical) (≈0.11 nats per unit time), we can see that the result is accurate enough for all SIs.
The new scheme for the estimation is particularly satisfactory for the nonlinear case. For the pair of Rössler oscillators, the computed results are plotted in
Figure 4. Compared to
Figure 3, it can be seen that the performance is significantly improved. Consider the cases with SI ≤ 100 first. To see the improvement more clearly, we introduce a ratio
to measure the performance of the one-way causal inference; the smaller the value, the more accurate the result, with
taken as insignificant. By this standard, the cases with SI ≤ 10 (
Figure 3a,b) appear satisfactory, but the causal relations as shown in
Figure 3c,d are inaccurate or even incorrect. Specifically, for
, i.e., when the system is synchronized,
reaches
and
for the cases of SI = 50 and SI = 100, respectively. For
, the inference appears satisfactory, albeit inaccurate. However, at
,
, the inference is incorrect. In contrast, the inferred causalities in
Figure 4a–d are rather accurate for the coupling strengths
considered (both synchronized and nonsynchronized), with all the
r’s being insignificant.
For the case SI = 300, which corresponds to a sampling frequency of 20 points per period, the one-way causality is accurately recovered for the nonsynchronized cases (
). However, beyond this, i.e.,
, the inference fails. In particular, when SI = 500 (
Figure 4f), the result is even worse than its counterpart obtained with the traditional scheme as plotted in
Figure 3f. This, of course, may be due to the resulting small ensemble size, which causes singularity to the matrix logarithm. Consider the extreme limit when the series are completely synchronized. In this case, it is not possible to determine whether there exists a causal relation or not, as they are identical, impossible to be differentiated. Correspondingly, this means a singular covariance matrix. Now, in the above problem, the series are nearly synchronized. Given the length, the size of the ensemble thus formed reduces as the sampling interval increases. This tends to increase the condition number of the resulting covariance matrix. Whenever a matrix is ill-conditioned, its logarithm is very sensitive, leading to large errors in the subsequent computation.
Nonetheless, the success in applying the linear estimator to such a highly nonlinear system is remarkable. The reason for this is probably the same as that for linearization; that is to say, on a small interval, a linearized system can provide a good approximation to an otherwise nonlinear system. While its applicability is yet to be proven, or investigated with more nonlinear problems, by our experience, it indeed works well, provided that the statistics are sufficient (stochastic systems or chaotic deterministic systems) and the sampling interval is not too large.
5. Discussion
The maximum likelihood estimator of the information flow (IF), i.e., Equation (3), provides an easy way toward causal inference. Theoretically, it is based on a linear assumption, but, practically, it has shown considerable success with series generated from highly nonlinear systems; anyhow, linearization piecewise in time proves to be an efficient asymptote to an otherwise nonlinear system. In reality, series may be coarsely sampled; the time resolution may be low. An issue thus arises, as this formalism is theoretically on the basis of infinitesimal time increments. In this case, as we have shown, it still works for linear systems in a qualitative sense; however, for a highly nonlinear system composed of two Rössler oscillators, the bias becomes increasingly significant as the sampling frequency is reduced.
A new scheme has been proposed to address this problem and provide a partial solution. Due to a property of the IF, as proven in [
7], that additive noise does not alter the IF in form, it is reasonable to directly estimate the IF without paying attention to the stochasticity. Instead of estimating through the vector field using Euler–Bernstein differencing, we choose to consider the propagator on the finite time interval, i.e., to estimate the Lie group members. In doing so, the original Formula (
3), which is rewritten here for ease of reference,
is replaced with (
13),
where
, and
is the sample covariance between
and
, i.e.,
advanced by one time step. Note that here log is the matrix logarithm; in MATLAB (Version 23.2 or earlier), the function is logm. (The spurious imaginary part, if it arises, should be discarded.) In this way, the preset causality within the coupled system of chaotic oscillators can be rather accurately reproduced even when the sampling interval is large (sampling frequency is low). For convenience, this is summarized as an algorithm (Algorithm 1).
| Algorithm 1: Causal inference through information flow estimation |
input: d time series and time step size output: a causal graph and IFs along edges evaluate the covariance matrices and ; compute ; for each do compute by ( 13), i.e., end return , together with the IFs |
There is still much room for improvement in the above approach. For example, the estimation of the covariances in the quotient
is achieved by replacing the population covariances with sample covariances, while the sample is formed from the time series. While this is satisfactory for stochastic systems under the ergodic assumption, this may not be suitable for deterministic chaos, such as in the case of the Rössler oscillators studied here. The reason is obvious: the time mean of the series in
Figure 2 is zero, but one can imagine that the ensemble mean of all possible paths is by no means zero; rather, it should be a function of time (like the series itself), which may be close to the asymptotic linear system solution. Thus, it is more reasonable to treat the linear system solution as the mean. As such, we have attempted to improve the estimation by replacing the covariances of
with those of
, where
stands for the resulting linear system solution. With this, we obtain another causal inference result for SI = 300; the resulting IFs are plotted in
Figure 5. As one can see, the result appears rather accurate, as expected, in contrast to
Figure 4e.
We, however, do not claim that we have solved the problem. What we want to show here is, how much room there is for improvement within a linear framework in estimating Equation (
2), the original fully nonlinear formula
. Indeed, in the case of high nonlinearity, the linear assumption is always easy to blame. While theoretically it is not a problem (causality is guaranteed as proven in a theorem; see [
6] and other references), it is believed that, before a fully nonlinear algorithm is developed (when this paper was prepared, ref. [
22] was not published), this will be a continuing issue.
It should be pointed out that the above algorithm works only when the sampling interval does not exceed the scale in question. For processes with multiple scales involved, different samplings may result in different physically meaningful information flow rates. In this case, it is not just a computational issue any more; it is a physical problem that deserves for an in-depth investigation. We will leave this problem, among others, for future research.