Relaxation of Some Confusions about Confounders

This work is about observational causal discovery for deterministic and stochastic dynamic systems. We explore what additional knowledge can be gained by the usage of standard conditional independence tests and if the interacting systems are located in a geodesic space.


Introduction
It is not necessary to emphasize the importance of the concept of causality in science and in the natural sciences in particular. The concept traverses all disciplines, and it is a matter of extensive research fueled by the exponentially increasing available scientific data and computation power. Revealing causal relations between systems via the time series produced by them is one of the most attractive challenges. The first major advancement was due to Granger who used an auto-regressive framework for a practical implementation of the predictive causality principle by Wiener [1].
The very popular Granger [2] method has some theoretical and practical limitations. It is not able to detect hidden common cause and, instead, indicates false directional causal relation between the observed systems (for details of all the pros and cons cf. [3]). Several further methods appeared in the last two decades (for a concise review see Runge [4] or [5,6]). One of the most prominent is the convergent cross mapping method developed by Sugihara [7] to investigate deterministic dynamic systems, which essentially utilizes Takens' embedding theorem [8]. Stark [9,10] generalized Takens' result and showed the theoretical limitations to use it for stochastic dynamic systems. For deterministic dynamics, a new approach was presented in a recent work [11] that was based on the comparison of the dimension of the attractors of the given systems and their joint observation.
The present paper investigates the causal relation of a pair of dynamic systems (which might be deterministic or stochastic). Facts are revealed that, to our best knowledge, avoided the attention of previous studies. We show that the common driver is an i.i.d. sequence, shared observational noise, if there is dependence between the systems with the smallest but positive time difference. We also show that, if the pair is located in a non-abstract physical space where the speed of information transfer is known, then direct causation and common cause cases can be distinguished, which, in general, is theoretically impossible.

Basic Definitions
First, we provide the framework of our investigation. Our aim is to find the causal relationship between two stochastic dynamic systems X and Y from which we observe the time series An example is given for a possible causation scheme for the system M. In the observed series X and Y, and in between we have L a common cause. Above X and below Y, small circles represent the i.i.d. input ξ, η, and the large circles V X , V Y (also belonging to the set of unobserved series) represent the non i.i.d. influences that are not shared and not common for X and Y. In this example, X drives Y, and they have L as a common cause.
In what follows, for ω 1 , we will use ξ emphasizing that it influences x and similarly η for ω 2 for y.
For brevity, we will use the multi-index of involved dimensions: d = (d X , d Y , d L ).

Assumption 2.
The external noise ω i ∈ R D is modeled with an unobserved i.i.d. sequence and affects all the systems with independent ξ, η, ω l i components. Furthermore, ω i+1 is independent Assumption 3. The process m i is stationary. In what follows, we use the expression "drives" for all the terms "causes", "influences" and "injects information" in relation to dynamic systems.
Following [12], we use the next model. The visible and invisible system can be described by a p-order Structural Vector Auto-regressive SVAR model: m n+1 = f m n , . . . , m n−p+1 , ω n+1 n ∈ N, and m 0 follows the stationary distribution of the system. (It is a SVAR(d, p) process, were d is the multi index of dimensions of the variables, and p is the order of auto-regression.) The recursion clearly can be transformed with time delay embedding into higher dimension first order SVAR in particular to SVAR(2D × p, 1) in short SVAR(1) with M n = m n , m n−1 , . . . , m n−p+1 : (1) We make the same restriction as in [12] that (2) must be recursive in the variables, which ensures that there is no directed functional cycle. Variables with capital letters denote the same "embedding" as in (1).
Exactness means that, if the process started from a set with positive probability, then, after a long time, the set in which it can be found has probability one. It is natural to assume exactness given that we work with an observation, and the support of the observed process for us will be the whole set where the process can run and, consequently, has probability one. On the other hand, exactness implies mixing for stationary processes, and, at the same time, a mixing stationary process is α-mixing (or strong mixing). Let us note here that, from strong mixing, ergodicity also follows, but we do not need that fact (see also [13]).
In our discovery scheme, we may allow instantaneous causation between all variables; however, we do not elaborate on that case here. For brevity, that is not reflected in (2). We note that a system like (2) with contemporaneous interaction but without a directed cycle can be rewritten into the form of (2) using time shifts thanks to the acyclic recursivity.
where L n stands for the set of latent variables, η n+1 is an i.i.d sequence that is independent of (X i , Y i , L i ) n i=0 , and X n cannot be omitted without violating the validity of (3).
Let us explain that key definition. We may say that there is no such a function g that which makes the fact explicit, that Y n+1 can be created without X n . Here, one should also observe that the i.i.d. part is also the same as in (3), and there is no possibility for an i.i.d. X n to be hidden in η n+1 .

Causal Discovery Schemes
The literature of causal discovery is huge. This work has been inspired by two recent ones with their strengths and limitations. First, we found the framework defined by Malinsky in [12] very appealing and that the complex nature of assumptions and the suggested algorithm in [14] presented an essential challenge. The algorithm in [12] is an extension of [15,16]. The algorithm provides a theoretically complete recall of the underlying causal structure at the price that some relations are marked undetermined and some causal relations are not or only partially revealed.
In [14], in addition to many other assumptions, it is assumed that all hidden processes that influence an observed one have no memory (Assumption A9 in [14]). That assumption and A6 in [14] cannot be checked. In [12], such restrictions are eliminated. That paper and most of the works based on Pearl's DAG analysis have theoretical limitations as admitted in [12]. In what follows, we investigate some situations in which that limitation can be relaxed.
Information from X to Y can be transferred along a chain of direct causal links, along a directed path π X,Y . The length of the path (the number of intermediate components plus one) is denoted by l = l X,Y = l(π X,Y ). Such a path has a starting and ending time n, n + l (for arbitrary n ≥ 0, l > 0), the difference is the time lag. Assumption 6. We assume that, with some background information, the minimal lag between the systems X and Y can be determined.
We consider the smallest lag τ for which dependence can be detected in "direction" X to Y:

The Decomposable Case
We introduce our notation. In order to save space, let (A, B) = (X, Y) or (Y, X). Let I stand for the Shannon entropy/differential entropy based mutual information. We define conditional mutual information between elements of time series a n , b n and similarly for other series. A segment from k to l of a time series a n are denoted by A l k . Such segments are used in the condition representing a part or the full past. In order to investigate if there is information transfer from B to A with a given time lag τ b,a we use the conditional mutual information between a n+τ b,a and b n given the full past of both series A 0 n+τ b,a −1 and B n−1 0 , and we denote it by I B . We define the following conditional mutual information where we set A q p = a q , . . . , a p and similarly for B and other variables.

Relation 1. Logical relations between conditional mutual information values and causal relations
where CD stands for Common Driver and c, c , c > 0. In the right part of the table, =0 means that I

(k)
A,B = 0 holds for all 0 ≤ k < τ A,B , while >0 means that there is at least one such k for which I The proposition summarizes the possible inferences in a concise way. In Relation 1, the headers contain the list of possible combinations and the possible causal scenarios. We have the direct product of two lists of cases collected in the two tables. The header of tables contains, on the left, the quantities that are decisive and, on the right, the possible causal scenarios. As an example, in the left table, the first row shows that if and only if we have that I B = 0 but I A = c > 0 (significantly differ from zero) then B does not drive A but A drives B. In the right table, if I

(k)
A,B = 0 holds for all 0 ≤ k < τ a,b that means that there is no common information between members of the series for k < τ a,b , while, in the opposite case, there should be a common driver, given, that there is shared information that cannot be attributed to driving with a lag below τ a,b . If δ = τ x,y (or = τ y,x ) then, causation between X and Y and a common driver may coexist, and we cannot separate those models. In the next section we provide some observations in that situation.

The Confounder Case
We assume that δ L,X,Y = τ X,Y but τ > 0. If τ > 0, we can investigate the common information between X n+1 and Y n+τ . Unfortunately, the variables X n , Y n+τ have a confounder; therefore, we cannot tell which causal relation is behind the dependence. However, some internal structure can be revealed. In line with the assumptions δ L,X,Y = τ X,Y but τ > 0, we assume that Let b 1 be the information that is passed from X n to Y n+τ and b i for i = 1, 2 from an L to both (if one or other information transfer takes place). We also let a 1 be the information passed from X n to the X n+1 as Figure 2 shows. From (7), we have that b 1 is independent from a 1 and b 1 is independent from b 2 . Thus, we have that the information b n injected to X n and Y n from L is an i.i.d. sequence. A similar argument shows that the information c 1 passed from Y n+τ to Y n+τ+1 is independent from b 2 . We still cannot decide if X drives Y or L drives both; however, in the latter situation, we may say that L emits observational noise for X, and it does not influence its evolution (the value of a i ). Alternatively, we may consider b i as the "part" of X, which is injected to Y. Let us note that L itself is not necessarily an i.i.d. sequence but, from the point of view of its impact on X and Y, it is indifferent.
One may appeal to the Occam's razor principle (if other background knowledge does not dictate otherwise) that L itself is an i.i.d. process. If b is part of X or external noise that cannot be decided without further knowledge, we may refer again to the Occam's razor principle and assume that there is no a third system, a common driver but X injects an i.i.d. sequence to Y.

Geodesic Spaces
Now, we investigate the case when the subsystems of M are located in a geodesic metric space with unique geodesics between any pair of points. We assume that the information transfer speed is uniform, constant in the space regardless of the location of the source and target. Under that assumption, we can speak interchangeably about distance in space and time.

Strict Reversed Triangular Inequality
If δ = min L δ L,X,Y and δ > τ (8) the reversed, strict triangular inequality, and there is information share between X n and Y n+τ , then no L can be a common driver of X n and Y n+τ (cf. Figure 3), so a direct driving should take place from X to Y. Figure 3. The causation has smaller time lag τ compared with the difference δ from the common driver.

Strict Triangular Inequality
On the other hand, if for an L and X n and Y n+τ have positive conditional mutual information conditioned on the past, then only L, the common driver can produce it, not causation (see Figure 4).

The Equality, the Confounder Case
we have a confounder. If the metric space has a unique geodesic from L to Y, then X should be on that geodesic of L and Y, and this means that the information from L either enters X along the path to Y or avoids it in in a tricky way by an infinitesimal detour as Figure 5 depicts. In the former case, we have no confounder but the causal chain L → X → Y. This is a situation that, again, cannot be resolved without additional information about the actual systems under scrutiny. Economists used to call such L an instrumental variable. Now, let us recall that the inequalities (8) and (10) read as The latter one is the strict triangular inequality and the former one is its converse (both with strict inequality). Here, we arrive at the interpretation of causation in M. If it is a system in an abstract space without metric properties, there is no point to speak about distances in it, and there is no link between information transfer time (lag in short) and distances.
On the other hand, if • the system M is located in a geodesic metric space, • the geodesics are unique, • the information propagates along the geodesics, and • the information transfer has a constant speed, then , distances are proportional to the delay with the same constant factor for all members. Triangular inequality is inherited from distances to lags. In the case of a metric space, like the Euclidean, hyperbolic and spherical with unique geodesic (except if X and Y are the oppositely positioned on the sphere) the triangular inequality holds, and thus (13) is impossible, and L cannot be a common driver that mimics driving or acts parallel to a driving between X and Y. Let us note that the triangular inequality holds for space-like vectors in the Minkovsky space, while the converse holds for time-like positions. Finally, the case of strict equality needs further investigation. In case of different transfer speed, the picture is more complex, and the above geometric consideration is applicable in particular settings only. In the human brain, the information transfer has different speed depending on the transfer mode: via sequences of neural cells, long axon bundles or volume of surface currents. The transfer speed depends on the number of intermediate relay nodes of the network as well. Consequently, the case of causality analysis of brain regions needs detailed information on the connection type and speed between them. It is likely that many other topical areas, like climate and geophysics, specific knowledge of the metric properties and transfer speed may contribute to the success of causal discovery. In other areas, there is no information about the temporal arrangement of the unobserved factors, and consequently revealing the perfect description of the causal structure seems impossible.

Conditions and Mixing
Let us recall here that all the methods that are based on Pearl's DAG analysis use d-separation ( or causal Markovness) based on a conditional independence test (CIT) in which parents are the conditioning variables. As such, they need access to the parents, which is impossible if those are not observed, and the computation cost can be prohibitive for large networks. Let us see that the d-separation uses the parents as cut set in the DAG. In Section 2, we used the full past of both observed processes. In practice, it is impossible to put the whole past in the condition; therefore, we should work with a shorter history. Let us consider, as an example, the case when 0 ≤ k < τ xy , which means that there is no information transfer from x n to y n+k and investigate I x n ; y n+k |X n−1 0 , Y n+k−1 0 . If I x n ; y n+k |X n−1 0 , Y n+k−1 0 = 0, i.e., there is no hidden common driver. One can show that I x n ; y n+k |X n−1 For the proof, see Appendix A. With this argument, we have that the convergence to a constant or to zero of the conditional mutual information determines if there is a driving between X and Y and if there is a common driver (as indicated in Relation 1). Under Assumption 5, it is evident that if there is a hidden common driver, the information is passed along a fixed length path from the common cause to X and Y, and its effect on dependency is not diminishing. If there is no common driver, the exchanged information should traverse longer and longer paths, and the Conditional Mutual Information (CMI) should go to zero as d goes to infinity.
The conditional independence test (and proper estimate of CMI) has recently been the focus of research motivated by applications in machine learning and artificial intelligence. This is known to be a challenging task (cf. [6,12,[17][18][19]).

Related Works and Discussion
There are numerous extensions and refinements of the original PC algorithm that Pearl developed [20]. This applies to the study of causal discovery of dynamic systems based on observed time series. We mention some prominent works [4,6,12,21,22]and their bibliography for further reading (see also the extended surveys [5,23]). The recent works [12,14] (see also [23]) have a very similar approach to the present one. In particular, we also use the structural modeling framework; however, we limit our focus to the discovery of a causal relation between a pair of systems. The method can be extended to the study of many time series by considering vector valued observations and/or many pairwise investigations.
The capabilities and limitations of the causal discovery algorithms were investigated in detail in seminal works [15,16,20,24] and recently in [14,21]. The recent generalizations are complete. They extend the labeling of edge ends of classical DAGs, while completeness does not mean that all relations are well specified. Completeness means that all the possible MAGs (Markov Equivalent Acyclic Graphs) can be created.
In this paper, we used an essential assumption and two unavoidable approximations. First, we assumed that the continuous time process can be inferred using a discrete time and limited resolution time series observation. Next, we assumed that the discrete time process can be well approximated with an order-p SVAR model. Finally, if the processes contain continuous variables, the condition is not restricted to a single state value but to a set of them, and, as a consequence, it is not perfectly blocking the information flow between the marginal variables. This deficiency might be eliminated by the local permutation method proposed by Runge in [19].