Entropy-Based Discovery of Summary Causal Graphs in Time Series

This study addresses the problem of learning a summary causal graph on time series with potentially different sampling rates. To do so, we first propose a new causal temporal mutual information measure for time series. We then show how this measure relates to an entropy reduction principle that can be seen as a special case of the probability raising principle. We finally combine these two ingredients in PC-like and FCI-like algorithms to construct the summary causal graph. There algorithm are evaluated on several datasets, which shows both their efficacy and efficiency.


Introduction
Time series arise as soon as observations, from sensors or experiments, for example, are collected over time. They are present in various forms in many different domains, such as healthcare (through, e.g., monitoring systems), Industry 4.0 (through, e.g., predictive maintenance and industrial monitoring systems), surveillance systems (from images, acoustic signals, seismic waves, etc.), or energy management (through, e.g., energy consumption data), to name but a few. We are interested in this study in analyzing quantitative discretetime series to detect the causal relations that exist between them under the assumption of consistency throughout time [1] which states that causal relationships remain constant in direction throughout time. Under this assumption, one can replace the infinite full-time causal graph ( Figure 1a) by a window causal graph (Figure 1b) defined as follows: Definition 1 (Window causal graph [1]). Let X be a multivariate discrete-time series and G = (V, E) the associated window causal graph for a window of size τ. The set of vertices in that graph consists of the set of components X 1 , . . . , X d at each time t, . . . , t + τ. The edges E of the graph are defined as follows: variables X p t−i and X q t are connected by a lag-specific directed link X p t−i → X q t in G pointing forward in time if and only if X p causes X q at time t with a time lag of 0 ≤ i ≤ τ for p = q and with a time lag of 0 < i ≤ τ for p = q.
The window causal graph only covers a fixed number of time instants, which are sufficient to understand the dynamics of the system given the largest time gap between causes and effects. This said, it is difficult for an expert to provide or validate a window causal graph because it is difficult to determine which exact time instant is the cause of another. In contrast, there exists another type of causal graph where a node corresponds to a time series, as illustrated in Figure 1c, which usually can be analyzed or validated by an expert easily [2][3][4]. Such graphs are referred to as summary causal graphs [1] and are defined as follows: Definition 2 (Summary causal graph [1]). Let X be a multivariate discrete-time series and G = (V, E) the associated summary causal graph. The set of vertices in that graph consists of the set of time series X 1 , . . . , X d . The edges E of the graph are defined as follows: variables X p and X q are connected if and only if there exist some time t and some time lag i such that X p t−i causes X q t at time t with a time lag of 0 ≤ i for p = q and with a time lag of 0 < i for p = q.
Note that a summary causal graph can be deduced from a window causal graph, but the reverse is not true. We focus in this study on the summary causal graph, as it provides a simple and efficient view on the causal relations that exist between time series. In particular, we are interested in inferring a summary causal graph without passing by a window causal graph to avoid unnecessary computations. An important aspect of real-world time series is that different time series, as they measure different elements, usually have different sampling rates. Despite this, the algorithms that have been developed so far to discover causal structures [5][6][7][8] rely on the idealized assumptions that all time series have the same sampling rates with identical timestamps (assuming identical timestamps in the case of identical sampling rates seems reasonable as one can shift time series so that they coincide in time) . We introduce in this paper a constraint-based strategy to infer a summary causal graph from discrete-time series with continuous values with equal or different sampling rates under the two classical assumptions of causal discovery (causal Markov condition, faithfulness), in addition to acyclicity in summary causal graphs. In summary, our contribution is four-fold: • First of all, we propose a new causal temporal mutual information measure defined on a window-based representation of time series; • We then show how this measure relates to an entropy reduction principle, which can be seen as a special case of the probability raising principle; • We also show how this measure can be used for time series with different sampling rates; • We finally combine these three ingredients in PC-like and FCI-like algorithms [9] to construct the summary causal graph from time series with equal or different sampling rates.
The remainder of the paper is organized as follows: Section 2 describes the related work. Section 3 introduces the (conditional) mutual information measures we propose for time series and the entropy reduction principle that our method is based on. Section 4 presents two causal discovery algorithms we developed on top of these measures. The causal discovery algorithms we propose are illustrated and evaluated on datasets, including time series with equal and different sampling rates and a real dataset in Section 5. Finally, Section 6 concludes the paper.

Related Work
Granger Causality is one of the oldest methods proposed to detect causal relations between time series. However, in its standard form [5], it is known to handle a restricted version of causality that focuses on linear relations and causal priorities, as it assumes that the past of a cause is necessary and sufficient for optimally forecasting its effect. This approach has nevertheless been improved since then in MVGC [10] and has recently been extended to handle nonlinearities through an attention mechanism within convolutional networks [8]. This last extension is referred to as TCDF. Score-based approaches [11] search over the space of possible graphs, trying to maximize a score that reflects how well the graph fits the data. Recently, a new score-based method called Dynotears [12] was presented to infer a window causal graph from time series. In a different line, approaches based on the noise assume that the causal system can be defined by a set of equations that explains each variable by its direct causes and an additional noise. Causal relations are in this case discovered using footprints produced by the causal asymmetry in the data. For time series, the most popular nonlinear algorithm in this family is TiMINo [6], which discovers a causal relationship by looking at independence between the noise and the potential causes. The most popular approaches for inferring causal graphs are certainly constraintbased approaches, based on the PC and FCI algorithms [9], which turn knowledge about (conditional) independencies into causal knowledge assuming the causal Markov condition and faithfulness. Several algorithms, adapted from non-temporal causal graph discovery algorithms, have been proposed in this family for time series, namely oCSE by [13], which is limited to one time lag between causal relations, PCMCI [7], and tsFCI [14]. Our work is thus more closely related to that of [7,14], which use constraint-based strategies to infer a window causal graph. However, we focus here on the summary causal graph, and we introduce a new causal temporal mutual information measure and entropy reduction principle on which we ground our causal discovery algorithms. Constraint-based methods have been also used jointly with other approaches (e.g., with a score-based method [15] and with a noise-based method [16]), but we consider such hybrid methods as beyond the scope of this paper.
At the core of constraint-based approaches lie (in)dependence measures, to detect relevant dependencies, which are based here on an information theoretic approach. Since their introduction [17], information theoretic measures have become very popular due to their non-parametric nature, their robustness against strictly monotonic transformations, which makes them capable of handling nonlinear distortions in the system, and their good behavior in previous studies on causal discovery [18]. However, their application to temporal data raises several problems related to the fact that time series may have different sampling rates, be shifted in time, and have strong internal dependencies. Many studies have attempted to re-formalize mutual information for time series. Reference [19] considered each value of each time series as different random variables and proceeded by whitening the data, such that time-dependent data will be transformed into independent residuals through a parametric model. However, whitening the data can have severe consequences on causal relations. Reference [20] proposed a reformulation of mutual information, called the transfer entropy, which represents the information flow from one state to another and, thus, is asymmetric. Later, Ref. [13] generalized transfer entropy to handle conditioning. However, their approach, called causation entropy, can only handle causal relations with lags equal to 1. Closely related, the directed information [21,22] allows the computation of the mutual information between two instantaneous time series conditioned on the past of a third time series with a lag of 1. One of our goals in this paper is to extend these approaches to handle any lag equal to or greater than 0. Another related approach is the time-delayed mutual information proposed in [23], which aims at addressing the problem of non-uniform sampling rates. The computation of the timedelayed mutual information relates single points from a single time series (shifted in time), but does not consider potentially complex relations between time stamps in different time series, as we do through the use of window-based representations and compatible time lags. The measure we propose is more suited to discovering summary causal graphs as it can consider potentially complex relations between timestamps in different time series through the use of window-based representations and compatible time lags and is more general as it can consider different sampling rates.

Information Measures for Causal Discovery in Time Series
We present in this section a new mutual information measure that operates on a window-based representation of time series to assess whether time series are (conditionally) dependent or not. We then show how this measure is related to an entropy reduction principle that is a special case of the probability raising principle [24].
We first assume that all time series are aligned in time, with the same sampling rate, prior to showing how our development can be applied to time series with different sampling rates. Without loss of generality, time instants are assumed to be integers. Lastly, as performed in previous studies [20], we assume that all time series are first-order Markov self-causal (any time instant is caused by its previous instant within the same time series).

Causal Temporal Mutual Information
Let us consider d univariate time series X 1 , · · · , X d and their observations (X p t ) 1≤t≤N p ,1≤p≤d . Throughout this section, we will make use of the following Example 1, illustrated in Figure 1, to discuss the notions we introduce. Example 1. Let us consider the following two time series defined by, for all t, One can see in Example 1 that, in order to capture the dependencies between the two time series, one needs to take into account a lag between them, as the true causal relations are not instantaneous. Several studies have recognized the importance of taking into account lags to measure (conditional) dependencies between time series; for example, in [7], a pointwise mutual information between time series with lags was used to assess whether they are dependent or not.
In addition to lags, Example 1 also reveals that a window-based representation may be necessary to fully capture the dependencies between the two time series. Indeed, as X 2 t−1 and X 2 t are the effects of the same cause (X 1 t−2 ), it may be convenient to consider them together when assessing whether the time series are dependent or not. For example, defining (overlapping) windows of size two for X 2 and one for X 1 with a lag of 1 from X 1 to X 2 , as in Figure 2, allows one to fully represent the causal dependencies between the two time series. Figure 2. Why do we need windows and lags? An illustration with two time series where X 1 causes X 2 in two steps (circles correspond to observed points and rectangles to windows). The arrows in black are discussed in the text. Definition 3. Let γ max denote the maximum lag between two time series X p and X q , and let the maximum window size λ max = γ max + 1. The window-based representation, of size 0 < λ pq ≤ λ max < N p , of the time series X p with respect to X q , which will be denoted X (p;λ pq ) , simply amounts to considering (N p − λ pq + 1) windows: (X p t , · · · , X p t+λ pq −1 ), 1 ≤ t ≤ N p − λ pq + 1. The windowbased representation, of size 0 < λ qp ≤ λ max < N q , of the time series X q with respect to X p is defined in the same way. A temporal lag γ pq ∈ Z compatible with λ pq and λ qp relates windows in X (p;λ pq ) and X (q;λ qp ) with starting time points separated by γ pq . We denote by C (p,q) the set of window sizes and compatible temporal lags.
Based on the above elements, we define the causal temporal mutual information between two time series X p and X q as the maximum of the standard mutual information over all possible compatible temporal lags and windows C (p,q) , conditioned by the past of the two time series. Indeed, as we are interested in obtaining a summary causal graph, we do not have to consider all the potential dependencies between two time series (which would be necessary for inferring a window causal graph). Using the maximum over all possible associations is a way to summarize all temporal dependencies, which ensures that one does not miss a dependency between the two time series. Furthermore, conditioning on the past allows one to eliminate spurious dependencies in the form of auto-correlation, as in transfer entropy [20]. We follow this idea here and, as in transfer entropy, consider windows of size 1 and a temporal lag of 1 for conditioning on the past, which is in line with the first-order Markov self-causal assumption mentioned above.

Definition 4.
Consider two time series X p and X q . We define the causal temporal mutual information between X p and X q as: where I represents the mutual information. In case the maximum can be obtained with different values in C (p,q) , we first setγ pq to its largest possible value. We then setλ pq to its smallest possible value and, finally,λ qp to its smallest possible value.γ pq ,λ pq , andλ qp , respectively, correspond to the optimal lag and optimal windows.
In the context we have retained, in which dependencies are constant over time, CTMI satisfies the standard properties of mutual information, namely it is nonnegative, symmetric, and equals 0 iff time series are independent. Thus, two time series X p and X q such that CTMI(X p ; X q ) > 0 are dependent. Settingγ pq to its largest possible value allows one to get rid of instants that are not crucial in determining the mutual information between two time series. The choice for the window sizes, even though arbitrary on the choice of treating one window size before the other, is based on the same ground, as the mutual information defined above cannot decrease when one increases the size of the windows. Indeed: The last inequality is due to the fact that mutual information is positive.
Example 2. Consider the structure described in Example 1, and assume that λ max = 3. First, we have, for the standard mutual information, We also have that any γ 12 < 0 has zero mutual information because conditioning on the past of X for all i > 0. For γ 12 > 0, starting by γ 12 = 1, This is similar for γ 12 > 2. Now, any increase of λ 21 alone or of λ 12 and λ 21 will generate an increase in the mutual information as long as the difference between the last time point of the window of X 1 (the cause) and the last time point of the window of X 2 is less than or equal to γ 12 as where γ 12 ≥ 1 (1 is the minimal lag that generates a correlation that cannot be removed by conditioning on the past of X 1 and X 2 ). For λ max = 3, the optimal window sizeλ 12 is equal to 2 as X 1 has no other cause than itself;λ 21 is equal to 2 as X 1 causes only (except itself) X 2 t+1 and X 2 t . Furthermore,γ 12 = 1 and

Entropy Reduction Principle
Interestingly, CTMI can be related to a version of the probability raising principle [24], which states that a cause, here a time series, raises the probability of any of its effects, here another time series, even when the past of the two time series is taken into account, meaning that the relation between the two time series is not negligible compared to the internal dependencies of the time series. In this context, the following definition generalizes to window-based representations of time series the standard definition of prima facie causes for discrete variables.
Definition 5 (Prima facie cause for window-based time series). Let X p and X q be two time series with window sizes λ pq and λ qp , and let P t,t = (X (p;1) t−1 , X (q;1) t −1 ) represent the past of X p and X q for any two instants (t, t ). We say that X p is a prima facie cause of X q with delay γ pq > 0 iff there exist Borel sets B p , B q , and B P such that one has: P(X We now introduce a slightly different principle based on the causal temporal mutual information, which we refer to as the entropy reduction principle.
Definition 6 (Entropic prima facie cause). Using the same notations as in Definition 5, we say that X p is an entropic prima facie cause of X q with delay γ pq > 0 iff I(X Note that considering that the above mutual information is positive is equivalent to considering that the entropy of X q when conditioned on the past reduces when one further conditions on X p . One has the following relation between the entropy reduction and the probability raising principles.

Property 1.
With the same notations, if X p is an entropic prima facie cause of X q with delay γ pq > 0, then X p is a prima facie cause of X q with delay γ pq > 0. Furthermore, if CTMI(X p ; X q ) > 0 withγ pq > 0, then X p is an entropic prima facie cause of X q with delayγ pq .
Proof. Let us assume that X p is not a prima facie cause of X q for the delay γ pq . Then, for all Borel sets B p , B q , and B P , one has P(X t+γ pq ∈ B q |P t,t+γ pq ∈ B P ). This translates, in terms of density functions denoted f , as: t+γ pq ∈ B q |P t,t+γ pq ∈ B P ), so that X p is not an entropic prima facie cause of X p with delay γ pq . By contraposition, we conclude the proof of the first statement. The second statement directly derives from the definition of CTMI.
It is also interesting to note that, given two time series X p and X q such that X p causes X q with γ pq = 0, CTMI does not necessarily increase symmetrically with respect to the increase of λ pq and λ qp . For an illustration on a simple model, see Figure 3. In Figure 3a, the mutual information conditioned on the past between X p and X q with γ pq = 0 is positive since X p causes X q instantaneously. In Figure 3b the same mutual information does not increase when increasing only the window size of the cause. However, the mutual information increases when increasing only the window size of the effect, as in Figure 3c, or when increasing simultaneously the window sizes of the effect and the cause, as in Figure 3d. Figure 3. Illustration of the asymmetric increase of CTMI with the increase of the window sizes. The mutual information conditioned on the past in (a) is , or when increasing simultaneously the window sizes of the effect and the cause, as in (d), i.e, I(X . However, it does not increase when increasing only the window size of the cause, as in (b), i.e, I(X . Dashed lines are for correlations that are not causations, and bold arrows correspond to causal relations between the window representations of time series.

Conditional Causal Temporal Mutual Information
We now extend the causal temporal mutual information by conditioning on a set of variables. In a causal discovery setting, conditioning is used to assess whether two dependent time series can be made independent by conditioning on connected time series, i.e., time series that are dependent with at least one of the two times series under consideration. Figure 4 illustrates the case where the dependence between X 1 and X 2 is due to spurious correlations originating from common causes. Conditioning on these common causes should lead to the conditional independence of the two time series. Of course, the conditional variables cannot succeed simultaneously in time the two time series under consideration. This leads us to the following definition of the conditional causal temporal mutual information. Figure 4. Example of conditional independence between dependent time series. In (a) the conditioning set contains one time series X 3 in addition to the past of X 1 and X 2 . In (b) the conditioning set contains one time series X 3 and X 4 in addition to the past of X 1 and X 2 . Dashed lines are for correlations that are not causations, and bold arrows correspond to conditioning variables.

Definition 7.
The conditional causal temporal mutual information between two time series X p and X q such thatγ pq ≥ 0, conditioned on a set X R = {X r 1 , · · · , X r K }, is given by: In case the minimum can be obtained with different values, we first setΓ k to its largest possible value. We then setλ k to its smallest possible value. (Γ 1 , . . . ,Γ K ) and (λ 1 , · · · ,λ K ) correspond to the optimal conditional lags and window sizes, which minimize, for Γ 1 , . . . , Γ K ≥ −γ pq : By considering the minimum over compatible lags and window sizes, one guarantees that, if there exist conditioning variables that make the two time series independent, they will be found. Note that the case in whichγ qp < 0 corresponds to CTMI(X p ; X q | X R ), whereγ pq > 0. Figure 4 illustrates the above on two different examples. On the left, X 1 t−1 is correlated to X 2 t as X 3 t−2 is a common cause with a lag of 1 for X 1 and a lag of 2 for X 2 . Conditioning on X 3 t−2 removes the dependency between X 1 and X 2 . Note that all time series have here a window of size 1. On the right, X 3 and X 4 are common causes of X 1 and X 2 : X 3 causes X 1 and X 2 with a temporal lag of 1, which renders X 1 and X 2 correlated at the same time point, while X 4 causes X 1 and X 2 with a temporal lag of 1 and 2, respectively, which renders X 1 and X 2 correlated at lagged time points. The overall correlation between X 1 and X 2 is captured by considering a window size of 2 in X 2 . All other time series have a window size of 1. By conditioning on both X 3 and X 4 , X p and X q become independent, assuming we also condition on the past of X 1 and X 2 to remove the autocorrelation.

Estimation and Testing
In practice, the success of the CTMI approach (and in fact, any entropy-based approaches) depends crucially on the reliable estimation of the relevant entropies in question from data. This leads to two practical challenges. The first one is based on the fact that entropies must be estimated from finite-time series data. The second is that to detect independence, we need a statistical test to check if the temporal causation entropy is equal to zero.
Here, we rely on the k-nearest neighbor method [25,26] for the estimation of CTMI. The distance between two windows considered here is the supremum distance, i.e., the maximum of the absolute difference between any two values in the two windows.
In the case of the causal temporal mutual information, we denote by ik /2 the distance from (X to its k-th neighbor, n 1,3 i , n 2,3 i , and n 3 i being the numbers of points with a distance strictly smaller than ik /2 in the subspace respectively, and n γ pq ,γ qp the number of observations. The estimate of the causal temporal mutual information is then given by: where ψ denotes the digamma function.
Similarly, for the estimation of the conditional causal temporal mutual information, we denote by ik /2 the distance from (X to its k-th neighbor, n 1,3 i , n 2,3 i , and n 3 i being the numbers of points with a distance strictly smaller than ik /2 in the subspace (X respectively, and n γ rp ,γ rq the number of observations. The estimate of the conditional causal temporal mutual information is then given by: where ψ denotes the digamma function.
To detect independencies through CTMI, we rely on the following permutation test: Definition 8 (Permutation test of CTMI). Given X p , X q , and X R , the p-value associated with the permutation test of CTMI is given by: where (X p ) b is a permuted version of X p and follows the local permutation scheme presented in [27].
The advantage of the scheme presented in [27] is that it preserves marginals by drawing as much as possible without replacement and it performs local permutation, which ensures that by permuting X p , the dependence between X p and X r is not destroyed.
Note that Definition 8 is applicable to the causal temporal mutual information (when R is empty) and to the conditional causal temporal mutual information.

Extension to Time Series with Different Sampling Rates
The above development readily applies to time series with different sampling rates, as one can define window-based representations of the two time series, as well as a sequence of joint observations. Indeed, as one can note, Definition 3 does not rely on the fact that the time series have the same sampling rates. Figure 5 displays two time series X p and X q with different sampling rates, where, while λ pq = 2 and λ qp = 3, the time spanned by each window is the same. The joint sequence of observations, relating pairs of windows from X p and X q in the form S = {(X )}, should, however, be such that, for all indices i of the sequence, one has: s(X where s(X) represents the starting time of the window X, and γ pq is constant over time. This is not the case for the first example, but is true for the second one, which is a relevant sequence of observations. If the two time series are sufficiently long, there always exists a correct sequence of joint observations. Indeed, if the window sizes λ pq and λ qp are known, let γ pq = s(X ). Furthermore, let N p and N q denote the number of observations per time unit (the time unit corresponds to the largest (integer) time interval according to the sampling rates of the different time series. For example, if a time series has a sampling rate of 10 per second and another a sampling rate of 3 per 10 min, then the time unit is equal to 10 min). Then, λ pq , λ qp , and γ pq are compatible through the set of joint observations S with s(X Note that this methodology for handling different sampling rates is not unique to our proposal and can be used as soon as lags and windows are defined.

Causal Discovery Based on Causal Temporal Mutual Information
We present in this section two new methods for causal discovery in time series based on the causal temporal mutual information introduced above to construct the skeleton of the causal graph. The first method assumes causal sufficiency (there exists no hidden common cause of two observable time series), while the second method is an extension of the first for the case where causal sufficiency is not satisfied. In both methods, the skeleton is oriented on the basis of the entropy reduction principle in addition to classical constraintbased orientation rules (the rules used in the PC algorithm or the rules used in the FCI algorithm). Our methods assume both the causal Markov condition and faithfulness of the data distribution, which are classical assumptions for causal discovery within constraintbased methods in addition to acyclicity in the summary causal graph.

Without Hidden Common Causes
We present here our main method, which assumes causal sufficiency. We start by describing how we infer the skeleton of the summary graph, then we show how we orient edges using classical PC-rules in addition to rules based on the entropy reduction principle.

Skeleton Construction
We follow the same steps as the ones of the PC algorithm [9], which assumes that all variables are observed. It aims at building causal graphs by orienting a skeleton obtained, from a complete graph, by removing edges connecting independent variables. The summary causal graphs considered are directed acyclic graphs (DAGs) in which self-loops are allowed to represent temporal dependencies within a time series.
Starting with a complete graph that relates all time series, the first step consists of computing CTMI for all pairs of time series and removing edges if the two time series are considered independent. Once this is done, one checks, for the remaining edges, whether the two time series are conditionally independent (the edge is removed) or not (the edge is kept). Starting from a single time series connected to X p or X q , the set of conditioning time series is gradually increased until either the edge between X p and X q is removed or all time series connected to X p and X q have been considered. We will denote by Sepset(p, q) the separation set of X p and X q , which corresponds to the smallest set of time series connected to X p and X q such that X p and X q are conditionally independent given this set. Note that we make use of the same strategy as the one used in PC-stable [28], which consists of sorting time series according to their CTMI scores and, when an independence is detected, removing all other occurrences of the time series. This leads to an order-independent procedure. The following theorem states that the skeleton obtained by the above procedure is the true one. Theorem 1. Let G = (V, E) be a summary causal graph, and assume that we are given perfect conditional independence information about all pairs of variables (X p , X q ) in V given subsets S ⊆ V\{X p , X q }. Then, the skeleton previously constructed is the skeleton of G.
Proof. Let us consider two time series X p and X q . If they are independent given X R , then CTMI(X p ; X q |X R ) = 0, as otherwise, the conditional mutual information between X p and X q would be non-null and the two time series would not be conditionally independent as we are given perfect information. By the causal Markov and faithfulness conditions, there is no edge in this case between X p and X q in the corresponding skeleton, as in the true one. Conversely, if CTMI(X p ; X q |X R ) = 0 for any X R , then the two time series cannot be dependent conditioned on X R . Indeed, if this were the case, as we are given perfect conditional information, there would exist a lag γ and two window sizes λ pq and λ pq such In this case, the two windows of size λ max centered on time point t in both X p and X q contain the windows of sizes λ pq and λ pq separated by a lag γ in X p and X q as λ max = 2γ max + 1. Thus, CTMI(X p ; X q |X R ) would be positive as this quantity cannot be less than I(X (p;λ pq ) t ; X (q;λ qp ) t+γ |X R ), which leads to a contradiction. Finally, as we tested all necessary conditioning sets in the construction of the skeleton, we have the guarantee of removing all unnecessary edges.

Orientation
Once the skeleton has been constructed, one tries to orient as many edges as possible using the standard PC-rules [9], which yields a completed partially directed acyclic graph (CPDAG).

PC-rule 0 (Origin of causality).
In an unshielded triple X p − X r − X q , if X r / ∈ sepset(p, q), then X r is an unshielded collider: X p → X r ← X q .

PC-rule 1 (Propagation of causality).
In an unshielded triple X p → X r − X q , if X r ∈ sepset(p, q), then orient the unshielded triple as X p → X r → X q .

PC-rule 2.
If there exist a direct path from X p to X q and an edge between X p and X q , then orient X p → X q .

PC-rule 3.
If there exists an unshielded triple X p → X r ← X q and an unshielded triple X p − X s − X q , then orient X s → X r .
As we are using here the standard PC-rules, we have the following theorem, the proof of which directly derives from the results on PC [9]. Theorem 2 (Theorem 5.1 of [9]). Let the distribution of V be faithful to a DAG G = (V, E), and assume that we are given perfect conditional independence information about all pairs of variables (X p , X q ) in V given subsets X R ⊆ V\{X p , X q }. Then, the skeleton constructed previously followed by PC-rules 0, 1, 2, and 3 represents the CPDAG of G.
In addition to the PC orientation rules, we introduce two new rules, which are based on the notion of possible spurious correlations and the mutual information we have introduced. The notion of possible spurious correlations captures the fact that two variables may be correlated through relations that do not only correspond to direct causal relations between them. It is formalized as follows: Definition 9 (Possible spurious correlations). We say that two nodes X p − X q have possible spurious correlations if there exists a path between them that neither contains the edge X p − X q nor any collider.
Interestingly, when two connected variables do not have possible spurious correlations, one can conclude about their orientation using CTMI.

Property 2.
Let us assume that we are given perfect conditional independence information about all pairs of variables (X p , X q ) in V given subsets S ⊆ V\{X p , X q }. Then, every non-oriented edge in the CPDAG obtained by the above procedure corresponds to a prima facie cause and by, causal sufficiency, to a true causal relation between the related time series. Furthermore, the orientation of an unoriented edge between two nodes X p and X q that do not have possible spurious correlations is given by the "direction" of the optimal lag in CTMI(X p , X q ), assuming that the maximal window size is larger than the longest lag γ max between causes and effects.
Proof. The first part of Property 2 directly derives from Property 1. As we assume that we are given perfect conditional information, the skeleton is the true one from Theorem 1. Thus, if two variables do not have possible spurious correlations, the only correlations observed between them correspond to a causal relation. We now need to prove that the optimal lag can be used to orient edges between any pair of variables X p and X q .
Without loss of generality, let us assume that X p causes X q t , for any time t, via the K time instants {t − γ, t − γ 1 , · · · , t − γ K−1 } with 0 < γ K−1 < · · · < γ 1 < γ. First, let us consider a window of size 1 in X q and a window of arbitrary size λ in X p with a lag γ pq set to γ ≥ 0. As γ ≥ 0, there is no cause of X q t in the window X (p;λ) t+γ . Furthermore, the only observed correlations between X p and X q correspond to causal relations. We thus have: t+γ are independent of each other. One the contrary, for the same window size in X p and a lag γ pq set to −γ with γ > 0, one has: The first inequality derives from Inequality 2. The second inequality is due to the fact that X p t−γ is a true cause of X q t and the fact that we are given perfect information. Thus, when considering a window of size 1 for X q , the optimal lag given by CTMI will necessarily go from X p to X q , which corresponds to the correct orientation.
We now consider the case where we have a window of arbitrary size λ in X q . Let us further consider a window of arbitrary size λ in X p with a lag γ pq set to γ ≥ 0. If λ < γ + γ K−1 , there is no causal relations between the elements in X t+γ and the mutual information between these two windows is 0. Otherwise, one can decompose this mutual information as: as the conditioning on X (q;γ +γ K−1 ) t and X (q;1) t−1 amounts to conditioning on the instant X (q;1) t+γ +γ K−1 −1 due to the first-order Markov self-causal assumption.
As there are no causal relations between the elements in X (q;γ +γ K−1 ) t and the elements in X (p;λ) t+γ , the first term on the right-hand side is 0. Using a similar decomposition in order to exclude elements at the end of X (p;λ) t+γ , which do not cause any element in X (q;λ ) t , one obtains: Let us now consider the window in X p of size λ with a lag γ pq set to −γ K−1 . Using the same reasoning as before, one obtains: (5) with: . In addition, the first term on the right-hand side of Equation (5) is strictly positive as all the elements in X (q;γ +γ K−1 ) t have causal relations in X (p;λ ) t−γ K−1 . Thus, the mutual information obtained with the negative lag −γ K−1 is better than the one obtained with any positive lag: t+γ −1 ); meaning that the optimal lag given by CTMI will necessarily go from X p to X q , which corresponds to the correct orientation.
The following orientation rule is a direct application of the above property.
ER-rule 0 (Entropy reduction-γ ). In a pair X p − X q , such X p and X q do not have a possible spurious correlation, ifγ pq > 0, then orient the edge as: X p → X q .
Furthermore, we make use of the following rule to orient additional edges when the optimal lagγ pq is null based on the fact that CTMI increases asymmetrically with respect to the increase of λ pq and λ qp (Figure 3). This rule infers the direction of the cause by checking the difference in the window sizes as the window size of the cause cannot be greater than the window size of the effect. (Entropy reduction-λ). In a pair X p − X q , such X p and X q do not have a possible spurious correlation, ifγ pq = 0 andλ pq <λ qp , then orient the edge as: X p → X q .

ER-rule 1
In practice, we also apply the ER-rule 0 before PC-rules, because, experimentally, we found that the ER-rule 0 is more reliable than the PC-rule 0 in detecting lagged unshielded colliders, especially in the case of a low sample size.
We call our method PCTMI; the pseudo-code is available in Algorithm 1. Adj(X q , G) represents all adjacent nodes to X q in the graph G, and sepset(p, q) is the separation set of X p and X q . The output of the algorithm is a CPDAG version of the summary graph such that all lagged relations are oriented, but instantaneous relations are partially oriented.

Algorithm 1 PCTMI.
X a d-dimensional time series of length T, γ max ∈ N the maximum number of lags, α a significance threshold Form a complete undirected graph G = (V, E) with d nodes n = 0 while there exists X q ∈ V such that card(Adj(X q , G)) ≥ n + 1 do D = list() for X q ∈ V s.t. card(Adj(X q , G)) ≥ n + 1 do for X p ∈ Adj(X q , G) do for all subsets X R ⊂ Adj(X q , G) \ {X p } such that card(X R ) = n and (Γ rp ≥ 0 or Γ rq ≥ 0) for all r ∈ R do y q,p,R = CTMI(X p ; X q | X R ) append (D, {X q , X p , X R , y q,p,R })) Sort D by increasing order of y while D is not empty do {X q , X p , X R , y} = pop(D) if X p ∈ Adj(X q , G) and X R ⊂ Adj(X q , G) then Compute z the p-value of CTMI(X p ; X q | X R ) given by Equation (4) if test z > α then Remove edge X p − X q from G Sepset(p, q) = Sepset(q, p) = X R n=n+1 for each triple in G, do apply PC-rule 0 while no more edges can be oriented do for each triple in G, do apply PC-rules 1, 2, and 3 for each connected pair in G do apply ER-rules 0 and 1 Return G

Extension to Hidden Common Causes
When unobserved variables are causing a variable of interest, the PC algorithm is no longer appropriate and one needs to resort to the FCI algorithm introduced in [9], which infers a partial ancestral graph (PAG). We extend here the version of this algorithm presented in [29] without the selection bias.
We recall the notations needed for FCI: a double arrow ← → indicates the presence of hidden common causes, while the symbol • represents an undetermined mark, and the meta-symbol asterisk * is used as a wildcard, which can denote a tail, an arrow, or a circle. Furthermore, we make use of the notion Possible-Dsep, which is defined as follows: Definition 10 (Possible-Dsep [9]). X r is in the Possible-Dsep set of X p and X q if and only if X r is different from X p and X q and there is an undirected path U between X p and X r such that, for every subpath < X w , X s , X v > of U, either X s is a collider on the subpath or X w and X v are adjacent in the PAG.
Lastly, we make use of the following types of paths. A discriminating path between X p and X q is a path that includes at least three edges such that each non-endpoint vertex X r on the path is adjacent to X q , and X p is not adjacent to X q , and every vertex between X p and X r is a collider on the path and is a parent of X q . An uncovered path is a path where every consecutive triple on the path is unshielded. A potentially directed path is a path where the edge between two consecutive vertices is not directed toward the first or against the second.
From the skeleton obtained in Section 4.1.1, unshielded colliders are detected using the FCI-rule 0.
FCI-rule 0 (Origin of causality). For each unshielded triple X p * −• X r •− * X q , if X r / ∈ Sepset(p, q), then orient the unshielded triple as a collider: X p * → X r ← * X q .
From this, Possible-Dsep sets can be constructed. As elements of Possible-Dsep sets in a PAG play a role similar to the ones of parents in a DAG, additional edges are removed by conditioning on the elements of the Possible-Dsep sets, using the same strategy as the one given in Section 4.1.1. All edges are then unoriented, and the FCI-rule 0 is again applied as some of the edges of the unshielded colliders originally detected may have been removed by the previous step. Then, as in FCI, FCI-rules 1, 2, 3, 4, 8, 9, and 10 are applied.

FCI-rule 1.
In an unshielded triple X p * → X r •− * X q , if X r ∈ Sepset(p, q), then orient the unshielded triple as X p * → X r * → X q .

FCI-rule 2.
If there exists a triple X p → X r * → X q or a triple X p * → X r → X q with X p * −• X q , then orient the pair as X p * → X q .

FCI-rule 3.
If there exists an unshielded triple X p * → X r ← * X q and an unshielded triple X p * −• X s •− * X q and X s * −• X r , then orient the pair as X s → X r .

FCI-rule 4.
If there exists a discriminating path between X p and X q for X r and X r •− * X q , then orient X r •− * X q as X r → X q ; otherwise, orient the triple as X s ← → X r ← → X q .
FCI-rule 8. If X p → X r → X q or X p −•X r → X q and X p •→ X q , then orient X p → X q . FCI-rule 9. If X p •→ X q and U is an uncovered potentially directed path from X p to X q such that X q and X r are not adjacent, then orient the pair as X p → X q .
FCI-rule 10. Suppose X p •→ X q , X r → X q ← X s , U 1 is an uncovered potentially directed path from X p to X r , and U 2 is an uncovered potentially directed path from X p to X s . Let µ be the vertex adjacent to X p on U 1 (µ could be X r ) and ω be the vertex adjacent to X p on U 2 (ω could be X s ). If µ and ω are distinct and are not adjacent, then orient X p •→ X q as X p → X q .
Note that we have not included FCI-rules 5, 6, and 7 from [29], as these rules deal with selection bias, a phenomenon that is not present in the datasets we consider. Including these rules in our framework is nevertheless straightforward.
Finally, as in PCTMI, we orient additional edges using an adapted version of the ER-rules.
LER-rule 0 (Latent entropy reduction-γ ). In a pair X p * −• X q , such X p and X q do not have a possible spurious correlation, ifγ pq > 0, then orient the edge as: X p * → X q .
LER-rule 1 (Latent entropy reduction-λ). In a pair X p * −• X q , such X p and X q do not have a possible spurious correlation, ifγ pq = 0 andλ pq <λ qp , then orient the edge as: X p * → X q .
The overall process, referred to as FCITMI, is described in Algorithm 2.

Algorithm 2 FCITMI.
Require: X a d-dimensional time series of length T, γ max ∈ N the maximum number of lags, α a significance threshold Form a complete undirected graph G = (V, E) with d nodes n = 0 while there exists X q ∈ V such that card(Adj(X q , G)) ≥ n + 1 do Compute z the p-value of CTMI(X p ; X q | X R ) given by Equation (4) if test z > α then Remove edge X p − X q from G Sepset(p, q) = Sepset(q, p) = X R n=n+1 for each triple in G, do apply FCI-rule 0 using Possible-Dsep sets, remove edges using CTMI Reorient all edges as •−• in G for each triple in G, do apply FCI-rule 0 while edges can be oriented do for each triple in G, apply FCI-rules 1, 2, 3, 4, 8, 9, and 10 for each connected pair in G, do apply ER-rules 0 and 1. Return G

Experiments
In this section, we evaluate our method experimentally on several datasets. We propose first an extensive analysis on simulated data with equal and different sampling rates, generated from basic causal structures; then, we performed an analysis on real-world datasets. To assess the quality of causal inference, we used the F1-score regarding directed edges in the graph without considering self-loops.
In the following, we first describe the different settings of the methods we compare with and the datasets, and then, we describe the results and provide a complexity analysis.

Methods and Their Use
We compared our method PCTMI, available at https://github.com/ckassaad/PCTM I (accessed on 30 June 2022),with several methods. From the Granger family, we ran MVGC through the implementation available at http://www.sussex.ac.uk/sackler/mvgc/ (accessed on 30 June 2022) and TCDF through the implementation available at https://gith ub.com/M-Nauta/TCDF (accessed on 30 June 2022). For TCDF, some hyperparameters have to be defined: we use da kernel of size 4, a dilation coefficient 4, one hidden layer, a learning rate of 0.01, and 5000 epochs. From the score-based family, we retained Dynotears, which is available at https://github.com/quantumblacklabs/causalnex (accessed on 30 June 2022), the hyperparameters of which were set to their recommended values (the regularization constants λ W = λ A = 0.05). Among the noise-based approaches, we ran TiMINo, which is available at http://web.math.ku.dk/~peters/code.html (accessed on 30 June 2022). TiMINo uses a test based on the cross-correlation.From the constraint-based family, we ran PCMCI using the mutual information to measure the dependence, both provided in the implementation available at https://github.com/jakobrunge/tigramite (accessed on 30 June 2022).
We compared FCITMI with tsFCI, provided at https://sites.google.com/site/dorise ntner/publications/tsfci (accessed on 30 June 2022), where independence or conditional independence is tested, respectively, with tests of zero correlation or zero partial correlation.
For all methods using mutual information, we used the k-nearest neighbor estimator [26], for which we fixed the number of nearest neighbor to k = 10. Since the output of those measures are necessarily positive given a finite sample size and a finite numerical precision, we used a significance permutation test introduced in [27]. For all methods, we used γ max = 5, and when performing a statistical test, we used a significance level of 0.05.

Datasets
To illustrate the behavior of the causal inference algorithms, we relied on both artificial and real-world datasets.

Simulated data
In the case of causal sufficiency, we simulated time series with equal and different sampling rates of size 1000 generated from three different causal structures: fork, v structure, and diamond represented in Table 1. In the case of non-causal sufficiency, we considered the 7ts2h structure represented in Table 2.
For each structure, we generated 10 datasets using the following generating process: for all q, X q 0 = 0, and for all t > 0, and f is a nonlinear function chosen at random uniformly among the absolute value, tanh, sine, and cosine.

V-Structure
Fork Diamond

Real Data
First, we considered the realistic functional magnetic resonance imaging (fMRI) benchmark, which contains blood-oxygen-level dependent (BOLD) datasets [30] for 28 different underlying brain networks. It measures the neural activity of different regions of interest in the brain based on the change of blood flow. There are 50 regions in total, each with its own associated time series. Since not all existing methods can handle 50 time series, datasets with more than 10 time series were excluded. In total, we were left with 26 datasets containing between 5 and 10 brain regions.
Second, we considered time series collected from an IT monitoring system with a one-minute sampling rate provided by EasyVista (https://www.easyvista.com/fr/produ its/servicenav, accessed on 30 June 2022). In total, we considered 24 datasets with 1000 timestamps. Each dataset contains three time series: the metric extraction (M.E), which represents the activity of the extraction of the metrics from the messages; the group history insertion (G.H.I), which represents the activity of the insertion of the historical status in the database; and the collector monitoring information (C.M.I), which represents the activity of the updates in a given database. Lags between time series are unknown, as well as the existence of self-causes. According to the domain experts, all these datasets follow a fork structure such that metric extraction is a common cause of the other two time series.

Simulated Data
We provide in Table 3 the performance of all methods on simulated data with an equal sampling rate. PCTMI has better results than other methods for the v structure and the fork structure For the diamond structure, PCTMI and PCMCI have the same performance; however, PCTMI has a lower standard deviation. In general, we can say, that constraintbased algorithms perform best. Table 3. Results for simulated datasets with an equal sampling rate. We report the mean and the standard deviation of the F1-score. The best results are in bold. We also assess the behavior of PCTMI when the time series have different sampling rates in Table 4. We present here results only for PCTMI, because other methods are not applicable to time series with different sampling rates. As one can see, the performance obtained here is close to the ones obtained with equal sampling rates in the case of the v structure, but significantly lower in the case if the fork structure and the diamond. The degradation of the results is not really surprising as one has less data to rely on in a different sampling rate scenario. By comparing the results of PCTMI on time series with different sampling rates with the results of other methods obtained when time series have an equal sampling rate, we can see that PCTMI still performs better than most methods. In the case of causal non-sufficiency, our proposed algorithm FCITMI outperforms tsFCI, as shown in Table 5. Table 5. Results for simulated datasets with an equal sampling rate and with hidden common causes. We report the mean and the standard deviation of the F1-score.

Real Data
We provide in Table 6 the results on the fMRI benchmark and the IT benchmark for all the methods. In the case of fMRI, VarLiNGAM performs best, followed by Dynotears and, then, by PCTMI and TiMINo. The success of linear methods (VarLiNGAM and Dynotears) on this benchmark might suggest that causal relations in fMRI are linear, especially as this result was not replicated on other datasets considered in this paper. In the case of the IT datasets, TiMINo performs best, followed by PCTMI; however, while investigating the results, we noticed that for 6/10 datasets, TiMINo returns a fully connected summary graph with arcs oriented from both sides. In both benchmarks, PCTMI clearly outperforms other constraint-based methods. In the case of the IT datasets, we also provide in Table 7, for each dataset, the inferred summary causal graphs by methods that have an F1-score > 0 in Table 6. Here, we can see that in 5 out of 10 datasets, PCTMI was able to infer that M.E is a cause of G.H.I. On the other hand, PCTMI was able to infer that M.E is a cause of C.M.I. in only one dataset. Interestingly, in 9 out 10 datasets, PCTMI did not infer any false positive. As one can expect from Table 6, PCMCI suffers on all datasets; however, one can notice that as PCTMI, PCMCI has a tendency to yield sparse graphs. TiMINo, which has the best F1-score in Table 6, seems to have a very low precision. In 7 out 10 datasets, TiMINo returned a complete bidirected graph. In the other three datasets, TiMINo detects all true causal relations, in addition to one false positive. VarLiNGAM has an even lower precision compared to TiMINo and a lower recall. Finally, MVGC infers most of the time a complete bidirected graph.  , where k represents the maximal degree of any vertex and γ max is the maximum number of lags. Each operation consists of conducting a significance test on a conditional independence measure. Algorithms adapted to time series, as PCMCI [7], rely on time information to reduce the number of tests. PCTMI reduces the number of tests even more since it infers a summary graph. Indeed, PCTMI's complexity in the worst case is bounded by Figure 6 provides an empirical illustration of the difference in the complexity of the two approaches on the three structures (v structure, fork, diamond), sorted according to their number of nodes, their maximal out-degree, and their maximal in-degree. The time is given in seconds. As one can note, PCTMI is always faster than PCMCI, the difference being more important when the structure to be inferred is complex.

Hyperparameters' Analysis
Here, we provide a hyperparameter analysis on CTMI with respect to the maximum lag γ max and the k-nearest neighbor k. Using the same 10 datasets that are compatible with the fork structure in Table 1 and generated as described in Section 5.2.1, we performed two experimentations.
First, we computed the average and the standard deviation of CTMI(X 2 ; X 3 ) and CTMI(X 2 ; X 3 | X 1 ) over the 10 datasets while varying γ max between 3 and 10. The results given in Figure 7a show that CTMI is robust with respect to γ max in the case of dependency and in the case of conditional independency: the mean of CTMI(X 2 ; X 3 ) remains the same for γ max = 3, 4, 5 and slightly decreases for γ max = 10, and the standard deviation remains the same for γ max = 3, 4 and slightly increases for γ max = 5, 10. On the other hand, the mean and the standard deviation of CTMI(X 2 ; X 3 | X 1 ) remain the same for all γ max .

Discussion and Conclusions
We addressed in this paper the problem of learning a summary causal graph on time series with equal or different sampling rates. To do so, we first proposed a new temporal mutual information measure defined on a window-based representation of time series. We then showed how this measure relates to an entropy reduction principle, which can be seen as a special case of the probability raising principle. We finally combined these two ingredients in PC-like and FCI-like algorithms to construct the summary causal graph. Our method proved to work well with small time complexity in comparison with similar approaches that use mutual information. The main limitations of our methods are that they cannot orient causal relations that are not oriented by the classical PC-rules or FCI-rules and have a possible spurious correlation, and they assume acyclicity in summary causal graphs.  Institutional Review Board Statement: Not applicable.