# Causal Discovery with Attention-Based Convolutional Neural Networks

^{*}

## Abstract

**:**

## 1. Introduction

- We present a new temporal causal discovery method (TCDF) that uses attention-based CNNs to discover
**causal relationships**in time series data, to discover the**time delay**between each cause and effect, and to construct a**temporal causal graph**of causal relationships with delays. - We evaluate TCDF and several other temporal causal discovery methods on two benchmarks: financial data describing stock returns, and FMRI data measuring brain blood flow.

## 2. Problem Statement

- The method should distinguish
**direct**from**indirect**causes. Vertex ${v}_{i}$ is seen as an indirect cause of ${v}_{j}$ if ${e}_{i,j}\notin \mathcal{G}$ and if there is a two-edge path $p=\langle {v}_{i},{v}_{k},{v}_{j}\rangle \in \mathcal{G}$ (Figure 2a). Pairwise methods, i.e., methods that only find causal relationships between two variables, are often unable to make this distinction [10]. In contrast, multivariate methods take all variables into account to distinguish between direct and indirect causality [11]. - The method should learn
**instantaneous**causal effects, where the delay between cause and effect is 0 time steps. Neglecting instantaneous influences can lead to misleading interpretations [13]. In practice, instantaneous effects mostly occur when cause and effect refer to the same time step that cannot be causally ordered a priori, because of a too coarse time scale. - The presence of a
**confounder**, a common cause of at least two variables, is a well-known challenge for causal discovery methods (Figure 2b). Although confounders are quite common in real-world situations, they complicate causal discovery since the confounder’s effects (${\mathbf{X}}_{2}$ and ${\mathbf{X}}_{3}$ in Figure 2b) are correlated, but are not causally related. Especially when the delays between the confounder and its effects are not equal, one should be careful to not incorrectly include a causal relationship between the confounder’s effects (the grey edge in Figure 2b). - A particular challenge occurs when a confounder is not observed (a
**hidden**(or latent)**confounder**). Although it might not even be known how many hidden confounders exist, it is important that a causal discovery method can hypothesise the existence of a hidden confounder to prevent learning an incorrect causal relation between its effects.

## 3. Related Work

#### 3.1. Temporal Causal Discovery

**Granger Causality**(GC) [24] is one of the earliest methods developed to quantify the causal effects between two time series. Time series ${\mathbf{X}}_{i}$Granger causes time series ${\mathbf{X}}_{j}$ if the future value of ${\mathbf{X}}_{j}$ (at time $t+1$) can be better predicted by using both the values of ${\mathbf{X}}_{i}$ and ${\mathbf{X}}_{j}$ up to time t than by using only the past values of ${\mathbf{X}}_{j}$ itself. Since pairwise methods cannot correctly handle indirect causal relationships, conditional Granger causality takes a third time series into account [25]. However, in practice not all relevant variables may be observed and GC cannot correctly deal with unmeasured time series, including hidden confounders [4]. In the system identification domain, this limitation is overcome with sparse plus low-rank (S + L) networks that include an extra layer in a causal graph to explicitly model hidden variables (called factors) [26]. Furthermore, GC only captures the linear interdependencies between time series. Various extensions have been made to nonlinear and higher-order causality, e.g., [27,28]. A recent extension that outperforms other GC methods is based on conditional copula, that allows to dissociate the marginal distributions from their joint density distribution to focus only on statistical dependence between variables [10].

**Constraint-based Time Series approaches**are often adapted versions of non-temporal causal graph discovery algorithms. The temporal precedence constraint reduces the search space of the causal structure [29]. The well-known algorithms PC and FCI both have a time series version: PCMCI [8] and tsFCI [21]. PC [30] makes use of a series of tests to efficiently explore the whole space of Directed Acyclic Graphs (DAGs). FCI [30] can, contrary to PC, deal with hidden confounders by using independence tests. Both temporal algorithms require stationary data. Additive Non-linear Time Series Model (ANLTSM) [20] does causal discovery in both linear and non-linear time series data, and can also deal with hidden confounders. It uses statistical tests based on additive model regression.

**Structural Equation Model approaches**assume that a causal system can be represented by a Structural Equation Model (SEM) that describes a variable ${\mathbf{X}}_{j}$ as a function of other variables ${\mathbf{X}}_{-j}$, and an error term ${\u03f5}_{X}$ to account for additive noise such that $X:=\mathit{f}({\mathbf{X}}_{-j},{\u03f5}_{X})$ [29]. It assumes that the set ${\mathbf{X}}_{-j}$ is jointly independent. TiMINo [22] discovers a causal relationship if the coefficient of ${X}_{i}^{t}$ for any t is nonzero for ${X}_{j\ne i}^{t}$. Self-causation is not discovered. TiMINo remains undecided if the direct causes of ${X}_{i}$ are not independent, instead of drawing possibly wrong conclusions. TiMINo is not suitable for large datasets, since small differences between the data and the fitted model may lead to failed independence tests. VAR-LiNGAM [13] is a restricted SEM. It makes additional assumptions on the data distribution and combines a non-Gaussian instantaneous model with autoregressive models.

**Information-theoretic approaches**for temporal causal discovery exist, such as (mutual) shifted directed information [23] and transfer entropy [11]. Their main advantage is that they are model free and are able to detect both linear and non-linear dependencies [19]. The universal idea is that ${\mathbf{X}}_{i}$ is likely a cause of ${\mathbf{X}}_{j}$, $i\ne j$, if ${\mathbf{X}}_{j}$ can be better sequentially compressed given the past of both ${\mathbf{X}}_{i}$ and ${\mathbf{X}}_{j}$ than given the past of ${\mathbf{X}}_{j}$ alone. Transfer entropy cannot, contrary to directed information [31], deal with non-stationary time series. Partial Symbolic Transfer Entropy (PSTE) [11] overcomes this limitation, but is not effective when only linear causal relationships are present.

**Causal Significance**is a causal discovery framework that calculates a causal significance measure $\alpha (c,e)$ for a specific cause-effect pair by isolating the impact of cause c on effect e. [9]. It also discovers time delay and impact of a causal relationship. The method assumes that causal relationships are linear and additive, and that all causes are observed. However, the authors experimentally demonstrate that low false discovery and negative rates are achieved if some assumptions are violated.

**Deep Learning approach**uses neural networks to learn a function for time series prediction. Although learning such a function is comparable to SEM, the interpretation of coefficients is different (Section 4.2). Furthermore, we apply a validation step that is to some extent comparable to conditional Granger causality. Instead of removing a variable, we randomly permute its values (Section 4.3).

#### 3.2. Deep Learning for Non-Temporal Causal Discovery

#### 3.3. Time Series Prediction

#### 3.4. Attention Mechanism in Neural Networks

## 4. TCDF—Temporal Causal Discovery Framework

#### 4.1. The Architecture for Time Series Prediction

#### 4.1.1. Dilations

#### 4.1.2. Adaption for Discovering Self-Causation

#### 4.1.3. Adaption for Multivariate Causal Discovery

#### 4.1.4. The Attention Mechanism

#### 4.1.5. Residual Connections

#### 4.2. Attention Interpretation

- We require that ${\tau}_{j}\ge 1$, since all scores are initialized at 1 and a score will only be increased through backpropagation if the network attends to that time series.
- Since a temporal causal graph is usually sparse, we require that the gap selected for ${\tau}_{j}$ lies in the first half of $\mathbf{G}$ (if $N>5$) to ensure that the algorithm does not include low attention scores in the selection. At most 50% of the input time series can be a potential cause of target ${\mathbf{X}}_{j}$. By this requirement, we limit the number of time series labeled as potential causes. Although this number can be configured, we experimentally estimated that 50% gives good results.
- We require that the gap for ${\tau}_{j}$ cannot be in first position (i.e., between the highest and second-highest attention score). This ensures that the algorithm does not truncate to zero the scores for time series which were actually a cause of the target time series, but were weaker than the top scorer. Thus, the potential causes ${\mathbf{P}}_{j}$ for target ${\mathbf{X}}_{j}$ will include at least two time series.

- ${h}_{i,j}=0$ and ${h}_{j,i}=0$: ${\mathbf{X}}_{i}$ is not correlated with ${\mathbf{X}}_{j}$ and vice versa.
- ${h}_{i,j}=0$ and ${h}_{j,i}>0$: ${\mathbf{X}}_{j}$ is added to ${\mathit{P}}_{i}$ since ${\mathbf{X}}_{j}$ is a potential cause of ${\mathbf{X}}_{i}$ because of:
- (a)
- (In)direct causal relation from ${\mathbf{X}}_{j}$ to ${\mathbf{X}}_{i}$, or
- (b)
- Presence of a (hidden) confounder between ${\mathbf{X}}_{j}$ and ${\mathbf{X}}_{i}$ where the delay from the confounder to ${\mathbf{X}}_{j}$ is smaller than the delay to ${\mathbf{X}}_{i}$.

- ${h}_{i,j}>0$ and ${h}_{j,i}=0$: ${\mathbf{X}}_{i}$ is added to ${\mathbf{P}}_{j}$ since ${\mathbf{X}}_{i}$ is a potential cause of ${\mathbf{X}}_{j}$ because of:
- (a)
- (In)direct causal relation from ${\mathbf{X}}_{i}$ to ${\mathbf{X}}_{j}$, or
- (b)
- Presence of a (hidden) confounder between ${\mathbf{X}}_{i}$ and ${\mathbf{X}}_{j}$ where the delay from the confounder to ${\mathbf{X}}_{i}$ is smaller than the delay to ${\mathbf{X}}_{j}$.

- ${h}_{i,j}>0$ and ${h}_{j,i}>0$: ${\mathbf{X}}_{i}$ is added to ${\mathbf{P}}_{j}$ and ${\mathbf{X}}_{j}$ is added to ${\mathbf{P}}_{i}$ because of:
- (a)
- Presence of a 2-cycle where ${\mathbf{X}}_{i}$ causes ${\mathbf{X}}_{j}$ and ${\mathbf{X}}_{j}$ causes ${\mathbf{X}}_{i}$, or
- (b)
- Presence of a (hidden) confounder with equal delays to its effects ${\mathbf{X}}_{i}$ and ${\mathbf{X}}_{j}$.

#### 4.3. Causal Validation

- Temporal precedence: the cause precedes its effect,
- Physical influence: manipulation of the cause changes its effect.

#### 4.3.1. Permutation Importance Validation Method

#### 4.3.2. Dealing with Hidden Confounders

#### 4.4. Delay Discovery

## 5. Experiments

#### 5.1. Data Sets

`FINANCE`, contains datasets for 10 different causal structures of financial markets [2]. For our experiments, we exclude the dataset without any causal relationships (since this would result in an F1-score of 0). The datasets are created using the Fama-French Three-Factor Model [57] that can be used to describe stock returns based on the three factors ‘volatility’, ‘size’ and ‘value’. A portfolio’s return ${X}_{i}^{t}$ depends on these three factors at time t plus a portfolio-specific error term [2]. We use one of the two 4000-day observation periods for each financial portfolio.

`FINANCE HIDDEN`containing four datasets. Each dataset corresponds to either dataset ‘20-1A’ or ‘40-1-3’ from

`FINANCE`except that one time series is hidden by replacing all its values by 0. Figure 11 shows the underlying causal structures, in which a grey node denotes a hidden confounder. As can be seen, we test TCDF on hidden confounders with both equal delays and unequal delays to its effects. To evaluate the predictive ability of TCDF, we created training data sets corresponding to the first 80% of the data sets and utilized the remaining 20% for testing. These data sets are referred to as

`FINANCE TRAIN/TEST`.

`FMRI`, contains realistic, simulated BOLD (Blood-oxygen-level dependent) datasets for 28 different underlying brain networks [58]. BOLD FMRI measures the neural activity of different regions of interest in the brain based on the change of blood flow. Each region (i.e., node in the brain network) has its own associated time series. Since not all existing methods can handle 50 time series, we excluded one dataset with 50 nodes. For each of the remaining 27 brain networks, we selected one dataset (scanning session) out of multiple available. All time series have a hidden external input, white noise, and are fed through a non-linear balloon model [59].

`FMRI`contains only six (out of 27) datasets with ‘long’ time series, we create an extra benchmark that is a subset of

`FMRI`. This subset contains only datasets in which the time series have at least 1000 time steps, therefore denoted as

`FMRI`$T>1000$, and coincidentally are all stationary. To evaluate the predictive ability of TCDF, we created a training and test set corresponding to the resp. first 80% and last 20% of the datasets, referred to as

`FMRI TRAIN/TEST`and

`FMRI $T>1000$ TRAIN/TEST`.

#### 5.2. Experimental Setup

**TCDF**: All AD-DSTCNs use the Mean Squared Error as loss function and the Adam optimization algorithm which is an extension to stochastic gradient descent [60]. This optimizer computes individual adaptive learning rates for each parameter which allows the gradient descent to find the minimum more accurately. Furthermore, in all experiments, we train our AD-DSTCNs for 5000 training epochs, with learning rate $\lambda =0.01$, dilation coefficient $c=4$ and kernel size $K=4$. We chose K such that the delays in the ground truth fall within the receptive field R. We vary the number of hidden layers in the depthwise convolution between $L=0$, $L=1$ and $L=2$ to evaluate how the number of hidden layers influences to framework’s accuracy. Note that increasing the number of hidden layers leads to an increased receptive field (according to Equation (2)), and therefore an increasing maximum delay.

**PCMCI**: We used the authors’ implementation from the Python Tigramite module [8]. We set the maximum delay to three time steps and the minimum delay to 0, equivalent to the minimum and maximum delay that can be found by TCDF in our AD-DSTCNs with $K=4$ and $L=0$. We use the ParCorr independence test for linear partial correlation. (Besides the linear ParCorr independence test, the authors present the non-linear GPACE test to discover non-linear causal relationships [8]. However, since GPACE scales $\sim {T}^{3}$, we apply for computational reasons the linear ParCorr test.) We let PCMCI optimize the significance level by the Akaike Information criterion.

**tsFCI**: We set the maximum delay to three time steps, equivalent to the maximum delay that can be found by TCDF in our AD-DSTCNs with $K=4$ and $L=0$. We experimented with cutoff value for p-values $\in \{0.001,0.01,0.1\}$ and chose $0.01$ because it gave the best results (and is also the default setting). Since tsFCI is in theory conservative [21], we applied the majority rule to make tsFCI slightly less conservative. We only take the discovered direct causes into account and disregard other edge types which denote uncertainty or the presence of a hidden confounder. Only in the experiment to discover hidden confounders, we look at all edge types.

**TiMINo**: We set the maximum delay to 3, equivalent to the maximum delay that can be found by TCDF in our AD-DSTCNs with $K=4$ and $L=0$. We assumed a linear time series model, including instantaneous effects and shifted time series. (The authors present two other variants besides the linear model, of which ‘TiMINo-GP’ was shown to be more suitable for time series with more than 300 time steps [22], but only the linear model was fully implemented by the authors.) We experimented with significance level $\in \{0.05,0.01,0.001\}$. However, TiMINo did not give any result for all of the significance levels. Therefore, we set it to 0 such that TiMINo always obtains a DAG.

#### 5.3. Evaluation Measures

**prediction performance**for times series, we report the mean absolute scaled error (MASE), since it is invariant to the scale of the time series values and is stable for values close to zero (as opposed to the mean percentage error) [61].

**discovered causal relationships**in the learnt graph ${\mathcal{G}}_{L}$ by looking at the presence and absence of directed edges compared to the ground truth graph ${\mathcal{G}}_{G}$. Since causality is asymmetric, all edges are directed. We used the standard evaluation measures precision and recall defined in terms of True Positives (TP), False Positives (FP) and False Negatives (FN). We apply the usual definitions from graph comparison, such that:

**discovered delay**$d({e}_{i,j}\in {\mathcal{G}}_{L})$ between cause ${\mathbf{X}}_{i}$ and effect ${\mathbf{X}}_{j}$ by comparing it to the full ground truth delay $d({e}_{i,j}\in {\mathcal{G}}_{F})$. By comparing it to the full ground truth, we not only evaluate the delay of direct causal relationships, but can also evaluate if the discovered delay of indirect causal relationships is correct. The ground truth delay of an indirect causal relationship is the sum of the delays of its direct relationships. We only evaluate the delay of True Positive edges since the other edges do not exist in both the full ground truth graph ${\mathcal{G}}_{F}$ and the learnt graph ${\mathcal{G}}_{L}$. We measure the percentage of delays on correctly discovered edges w.r.t. the full ground-truth graph.

**PIVM effectiveness**by calculating the relative increase (or decrease) of the F1-score and F1’-score when PIVM is applied compared to when it is not. The goal of the Permutation Importance Validation Method (PIVM) is to label a subset of the potential causes as true causes.

**hidden confounder**between two time series by applying it to the

`FINANCE HIDDEN`benchmark and counting how many hidden confounders were discovered. As discussed in Section 4.3.2, TCDF should be able to discover the existence of a hidden confounder between two time series ${\mathbf{X}}_{i}$ and ${\mathbf{X}}_{j}$ when the confounder has equal delays to its effects ${\mathbf{X}}_{i}$ and ${\mathbf{X}}_{j}$. If the confounder has unequal delays to its effects, we expect that TCDF will discover an incorrect causal relationship between ${\mathbf{X}}_{i}$ and ${\mathbf{X}}_{j}$. We therefore not only evaluate how many hidden confounders were discovered, but also how many incorrect causal relationships were learnt between the confounder and its effects.

#### 5.4. Results

#### 5.4.1. Overall Performance

`FINANCE TRAIN`,

`FMRI TRAIN`and

`FMRI $T>1000$ TRAIN`are shown in Table 3. TCDF ($L=0$) predicts time series well, since MASE $<1$ so TCDF gives, on average, smaller errors than a naïve method. The good results from

`FMRI $T>1000$`show that (too) short time series in

`FMRI`combined with a complex architecture in TCDF ($L=1$, $L=2$) are probably the reason that prediction accuracy of TCDF decreases.

`FINANCE`benchmark, TCDF outperforms the other methods. Especially the F1’-score of TCDF is much higher, indicating that a substantial part of the False Positives of TCDF are correct indirect causes. Since Deep Learning models have many parameters that need to be fit during training and therefore usually need more data than models with a less complex hypothesis space [62], TCDF performs slightly worse on the

`FMRI`benchmark compared to

`FINANCE`because of some short time series in

`FMRI`. Whereas all datasets in

`FINANCE`contain 4000 time steps,

`FMRI`contains only six (out of 27) datasets with more than 1000 time steps. The results for TCDF when applied only to datasets with $T>1000$ are therefore better than the overall average from all datasets. For

`FMRI`$T>1000$, our results are slightly better than the performance of PCMCI, and TCDF clearly outperforms tsFCI and TiMINo. PCMCI is not affected by time series length and performs comparably for both FMRI benchmarks. TiMINo performs very poorly when applied to

`FINANCE`and only slightly better on

`FMRI`, which is mainly due to a large number of False Positives. TiMINo’s poor results are in line with results from the authors, who already stated that TiMINo is not suitable for high-dimensional data [22]. In contrast, where TiMINo discovers many incorrect causal relationships, tsFCI seems to be too conservative, missing many causal relationships in all benchmarks. Our poor results of tsFCI correspond with poor results of tsFCI in experiments done by the authors on continuous data [21]. In terms of computation time, PCMCI and tsFCI are faster than TCDF for both benchmarks, as shown in Table 5.

`FMRI`does not explicitly include delays and therefore does not have a delay ground truth, we only evaluate

`FINANCE`. PCMCI discovered all delays correctly, closely followed by tsFCI and TCDF. Note that TiMINo only outputs causal relationships without delays. This experiment suggests that our delay discovery algorithm performs well not only without hidden layers (which makes the delay discovery relatively easy), but still keeps the percentage of correctly discovered delays relatively high when the number of hidden layers L (and therefore the number of kernels, the receptive field and maximum delay) is increased. Thus, the number of hidden layers seems of almost no influence for the accuracy of the delay discovery.

#### 5.4.2. Impact of the Causal Validation

`FINANCE`show that performance decreases drastically when PIVM is removed. For

`FMRI`and

`FMRI $T>1000$`, the F1-scores are exactly the same when TCDF is applied with or without PIVM.

`FINANCE`benchmark because of the many confounders in

`FINANCE`. The attention mechanism can select one of the effects of a confounder as potential cause of another confounder’s effect, but the potential cause will not be labeled as true cause by PIVM. In contrast, there are very few confounders in the datasets of

`FMRI`, which might explain the same scores of TCDF with and without PIVM. This experiment therefore suggests that the impact of causal validation depends on the number of confounders (shared causes) in the data, but will usually not have a negative impact on the causal discovery accuracy.

#### 5.4.3. Case Study: Detection of Hidden Confounders

`FINANCE HIDDEN`are shown in Table 8. We apply TCDF with $L=1$ since this architecture was most accurate for

`FINANCE`. We denote by → a causal relationship that is discovered by TCDF using the method for hidden confounders described in Section 4.3.2. Table 9 shows a comparison between TCDF, PCMCI, tsFCI and TiMINo.

#### 5.5. Summary

`FINANCE`and

`FMRI $T>1000$`. Since a Deep Learning method has many parameters to fit, TCDF performs slightly worse on short time series in

`FMRI`. In contrast, the accuracy of PCMCI is not affected by time series length. Although computation time is not so relevant in the domain of knowledge extraction, PCMCI is faster than TCDF. TCDF discovers roughly 95%-97% of delays correctly, which is only slightly worse than PCMCI and tsFCI. TCDF is the only method to locate the presence of a hidden confounder but, contrary to PCMCI, discovers in some cases an incorrect causal relationship between a confounder’s effects.

## 6. Discussion

#### 6.1. Hyperparameters

`FINANCE`benchmark barely differ across different values for L. TCDF $L=2$ performs worst on

`FMRI`because the architecture is probably too complex for the dataset (there are too many parameters to fit) and the receptive field (and therefore the maximum delay) is unnecessary large. The results for TCDF with $L=2$ improve substantially when applied to time series having more than 1000 time steps. Thus, the best number of hidden layers depends on the dataset and mainly on the length of the time series.

`FINANCE`benchmark is three time steps, it might be more challenging for TCDF to discover the correct patterns. Interestingly, increasing the number of hidden layers barely influences the number of correctly discovered delays. The experiments show that despite the more complex delay discovery and the increased receptive field, our delay discovery algorithm correctly discovers almost all delays.

#### 6.2. Limitations of Experiments

## 7. Summary and Future Work

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Kleinberg, S. Why: A Guide to Finding and Using Causes; O’Reilly: Springfield, MA, USA, 2015. [Google Scholar]
- Kleinberg, S. Causality, Probability, and Time; Cambridge University Press: Cambridge, UK, 2013. [Google Scholar]
- Zorzi, M.; Sepulchre, R. AR Identification of Latent-Variable Graphical Models. IEEE Trans. Autom. Control
**2016**, 61, 2327–2340. [Google Scholar] [CrossRef] [Green Version] - Spirtes, P. Introduction to causal inference. J. Mach. Learn. Res.
**2010**, 11, 1643–1662. [Google Scholar] - Zhang, K.; Schölkopf, B.; Spirtes, P.; Glymour, C. Learning causality and causality-related learning: Some recent progress. Natl. Sci. Rev.
**2017**, 5, 26–29. [Google Scholar] [CrossRef] - Danks, D. The Psychology of Causal Perception and Reasoning. In The Oxford Handbook of Causation; Helen Beebee, C.H., Menzies, P., Eds.; Oxford University Press: Oxford, UK, 2009; Chapter 21; pp. 447–470. [Google Scholar]
- Abdul, A.; Vermeulen, J.; Wang, D.; Lim, B.Y.; Kankanhalli, M. Trends and trajectories for explainable, accountable and intelligible systems: An hci research agenda. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 21–26 April 2018; ACM: New York, NY, USA, 2018; p. 582. [Google Scholar]
- Runge, J.; Sejdinovic, D.; Flaxman, S. Detecting causal associations in large nonlinear time series datasets. arXiv, 2017; arXiv:1702.07007. [Google Scholar]
- Huang, Y.; Kleinberg, S. Fast and Accurate Causal Inference from Time Series Data. In Proceedings of the FLAIRS Conference, Hollywood, FL, USA, 18–20 May 2015; pp. 49–54. [Google Scholar]
- Hu, M.; Liang, H. A copula approach to assessing Granger causality. NeuroImage
**2014**, 100, 125–134. [Google Scholar] [CrossRef] - Papana, A.; Kyrtsou, C.; Kugiumtzis, D.; Diks, C. Detecting causality in non-stationary time series using partial symbolic transfer entropy: Evidence in financial data. Comput. Econ.
**2016**, 47, 341–365. [Google Scholar] [CrossRef] - Müller, B.; Reinhardt, J.; Strickland, M.T. Neural Networks: An Introduction; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Hyvärinen, A.; Shimizu, S.; Hoyer, P.O. Causal modelling combining instantaneous and lagged effects: An identifiable model based on non-Gaussianity. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 424–431. [Google Scholar]
- Malinsky, D.; Danks, D. Causal discovery algorithms: A practical guide. Philos. Compass
**2018**, 13, e12470. [Google Scholar] [CrossRef] - Quinn, C.J.; Coleman, T.P.; Kiyavash, N.; Hatsopoulos, N.G. Estimating the directed information to infer causal relationships in ensemble neural spike train recordings. J. Comput. Neurosci.
**2011**, 30, 17–44. [Google Scholar] [CrossRef] - Gevers, M.; Bazanella, A.S.; Parraga, A. On the identifiability of dynamical networks. IFAC-PapersOnLine
**2017**, 50, 10580–10585. [Google Scholar] [CrossRef] - Friston, K.; Moran, R.; Seth, A.K. Analysing connectivity with Granger causality and dynamic causal modelling. Curr. Opin. Neurobiol.
**2013**, 23, 172–178. [Google Scholar] [CrossRef] [Green Version] - Peters, J.; Janzing, D.; Schölkopf, B. Elements of Causal Inference: Foundations and Learning Algorithms; MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
- Papana, A.; Kyrtsou, K.; Kugiumtzis, D.; Diks, C. Identifying Causal Relationships in Case of Non-Stationary Time Series; Technical Report; Universiteit van Amsterdam: Amsterdam, The Netherlands, 2014. [Google Scholar]
- Chu, T.; Glymour, C. Search for additive nonlinear time series causal models. J. Mach. Learn. Res.
**2008**, 9, 967–991. [Google Scholar] - Entner, D.; Hoyer, P.O. On causal discovery from time series data using FCI. In Proceedings of the Fifth European Workshop on Probabilistic Graphical Models, Helsinki, Finland, 13–15 September 2010; pp. 121–128. [Google Scholar]
- Peters, J.; Janzing, D.; Schölkopf, B. Causal inference on time series using restricted structural equation models. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2013; pp. 154–162. [Google Scholar]
- Jiao, J.; Permuter, H.H.; Zhao, L.; Kim, Y.H.; Weissman, T. Universal estimation of directed information. IEEE Trans. Inf. Theory
**2013**, 59, 6220–6242. [Google Scholar] [CrossRef] - Granger, C.W. Investigating causal relations by econometric models and cross-spectral methods. Econom. J. Econom. Soc.
**1969**, 37, 424–438. [Google Scholar] [CrossRef] - Chen, Y.; Bressler, S.L.; Ding, M. Frequency decomposition of conditional Granger causality and application to multivariate neural field potential data. J. Neurosci. Methods
**2006**, 150, 228–237. [Google Scholar] [CrossRef] [Green Version] - Zorzi, M.; Chiuso, A. Sparse plus low rank network identification: A nonparametric approach. Automatica
**2017**, 76, 355–366. [Google Scholar] [CrossRef] [Green Version] - Marinazzo, D.; Pellicoro, M.; Stramaglia, S. Kernel method for nonlinear Granger causality. Phys. Rev. Lett.
**2008**, 100, 144103. [Google Scholar] [CrossRef] - Luo, Q.; Ge, T.; Grabenhorst, F.; Feng, J.; Rolls, E.T. Attention-dependent modulation of cortical taste circuits revealed by Granger causality with signal-dependent noise. PLoS Comput. Biol.
**2013**, 9, e1003265. [Google Scholar] [CrossRef] - Spirtes, P.; Zhang, K. Causal discovery and inference: Concepts and recent methodological advances. In Applied Informatics; Springer: Berlin, Germany, 2016; Volume 3, p. 3. [Google Scholar]
- Spirtes, P.; Glymour, C.N.; Scheines, R. Causation, Prediction, and Search; MIT Press: Cambridge, MA, USA, 2000. [Google Scholar]
- Liu, Y.; Aviyente, S. The relationship between transfer entropy and directed information. In Proceedings of the Statistical Signal Processing Workshop (SSP), Ann Arbor, MI, USA, 5–8 August 2012; pp. 73–76. [Google Scholar]
- Guo, T.; Lin, T.; Lu, Y. An Interpretable LSTM Neural Network for Autoregressive Exogenous Model. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Louizos, C.; Shalit, U.; Mooij, J.M.; Sontag, D.; Zemel, R.; Welling, M. Causal effect inference with deep latent-variable models. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2017; pp. 6446–6456. [Google Scholar]
- Goudet, O.; Kalainathan, D.; Caillou, P.; Guyon, I.; Lopez-Paz, D.; Sebag, M. Causal Generative Neural Networks. arXiv, 2018; arXiv:1711.08936v2. [Google Scholar]
- Kalainathan, D.; Goudet, O.; Guyon, I.; Lopez-Paz, D.; Sebag, M. SAM: Structural Agnostic Model, Causal Discovery and Penalized Adversarial Learning. arXiv, 2018; arXiv:1803.04929. [Google Scholar]
- Bai, S.; Kolter, J.Z.; Koltun, V. Convolutional Sequence Modeling Revisited. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw.
**1994**, 5, 157–166. [Google Scholar] [CrossRef] - Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. Convolutional Sequence to Sequence Learning. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1243–1252. [Google Scholar]
- Van den Oord, A.; Kalchbrenner, N.; Espeholt, L.; Vinyals, O.; Graves, A.; Kavukcuoglu, K. Conditional image generation with pixelCNN decoders. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2016; pp. 4790–4798. [Google Scholar]
- Borovykh, A.; Bohte, S.; Oosterlee, C.W. Conditional time series forecasting with convolutional neural networks. In Lecture Notes in Computer Science/Lecture Notes in Artificial Intelligence; Springer: Berlin, Germany, 2017; pp. 729–730. [Google Scholar]
- Binkowski, M.; Marti, G.; Donnat, P. Autoregressive Convolutional Neural Networks for Asynchronous Time Series. arXiv, 2017; arXiv:1703.04122. [Google Scholar]
- Walther, D.; Rutishauser, U.; Koch, C.; Perona, P. On the usefulness of attention for object recognition. In Proceedings of the Workshop on Attention and Performance in Computational Vision at ECCV, Prague, Czech Republic, 15 May 2004; pp. 96–103. [Google Scholar]
- Yin, W.; Schütze, H.; Xiang, B.; Zhou, B. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs. Trans. Assoc. Comput. Linguist.
**2016**, 4, 259–272. [Google Scholar] [CrossRef] - He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
- Van Den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv, 2016; arXiv:1609.03499. [Google Scholar]
- Sifre, L.; Mallat, S. Rigid-Motion Scattering for Image Classification. 2014. Available online: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.672.7091&rep=rep1&type=pdf (accessed on 15 October 2018).
- Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Martins, A.; Astudillo, R. From softmax to sparsemax: A sparse model of attention and multi-label classification. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1614–1623. [Google Scholar]
- Shen, T.; Zhou, T.; Long, G.; Jiang, J.; Wang, S.; Zhang, C. Reinforced Self-Attention Network: A Hybrid of Hard and Soft Attention for Sequence Modeling. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, 13–19 July 2018; pp. 4345–4352. [Google Scholar]
- Eichler, M. Causal inference in time series analysis. In Causality: Statistical Perspectives and Applications; Wiley: Hoboken, NJ, USA, 2012; pp. 327–354. [Google Scholar]
- Woodward, J. Making Things Happen: A Theory of Causal Explanation; Oxford University Press: Oxford, UK, 2005. [Google Scholar]
- Breiman, L. Random forests. Mach. Learn.
**2001**, 45, 5–32. [Google Scholar] [CrossRef] - Van der Laan, M.J. Statistical inference for variable importance. Int. J. Biostat.
**2006**, 2. [Google Scholar] [CrossRef] - Datta, A.; Sen, S.; Zick, Y. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In Proceedings of the IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 23–25 May 2016; pp. 598–617. [Google Scholar]
- Janzing, D.; Balduzzi, D.; Grosse-Wentrup, M.; Schölkopf, B. Quantifying causal influences. Ann. Stat.
**2013**, 41, 2324–2358. [Google Scholar] [CrossRef] [Green Version] - Fama, E.F.; French, K.R. The cross-section of expected stock returns. J. Financ.
**1992**, 47, 427–465. [Google Scholar] [CrossRef] - Smith, S.M.; Miller, K.L.; Salimi-Khorshidi, G.; Webster, M.; Beckmann, C.F.; Nichols, T.E.; Ramsey, J.D.; Woolrich, M.W. Network modelling methods for FMRI. Neuroimage
**2011**, 54, 875–891. [Google Scholar] [CrossRef] - Buxton, R.B.; Wong, E.C.; Frank, L.R. Dynamics of blood flow and oxygenation changes during brain activation: The balloon model. Magn. Reson. Med.
**1998**, 39, 855–864. [Google Scholar] [CrossRef] - Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization, 2014. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Hyndman, R.; Khandakar, Y. Automatic Time Series Forecasting: The forecast Package for R. J. Stat. Softw.
**2008**, 27, 95405. [Google Scholar] [CrossRef] - Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Available online: http://www.deeplearningbook.org (accessed on 3 December 2018).
- Rohrer, J.M. Thinking clearly about correlations and causation: Graphical causal models for observational data. Adv. Methods Pract. Psychol. Sci.
**2018**, 1, 27–42. [Google Scholar] [CrossRef]

**Figure 1.**A temporal causal graph learnt from multivariate observational time series data. A graph node models one time series. A directed edge denotes a causal relationship and is annotated with the time delay between cause and effect.

**Figure 3.**Overview of Temporal Causal Discovery Framework (TCDF). With time series data as input, TCDF performs four steps (gray boxes) using the technique described in the white box and outputs a temporal causal graph.

**Figure 4.**TCDF with N independent CNNs ${\mathcal{N}}_{1}$…${\mathcal{N}}_{n}$, all having time series ${\mathbf{X}}_{1}$…${\mathbf{X}}_{n}$ of length T as input (N is equal to the number of time series in the input data set). ${\mathcal{N}}_{j}$ predicts ${\mathbf{X}}_{j}$ and also outputs, besides ${\widehat{\mathbf{X}}}_{j}$, the kernel weights ${\mathcal{W}}_{j}$ and attention scores ${\mathbf{a}}_{j}$. After attention interpretation, causal validation and delay discovery, TCDF constructs a temporal causal graph.

**Figure 5.**Dilated TCN to predict ${\mathbf{X}}_{2}$, with $L=3$ hidden layers, kernel size $K=2$ (shown as arrows) and dilation coefficient $c=2$, leading to a receptive field $R=16$. A PReLU activation function is applied after each convolution. To predict the first values (shown as dashed arrows), zero padding is added to the left of the sequence. Weights are shared across layers, indicated by the identical colors.

**Figure 6.**Attention-based Dilated Depthwise Separable Temporal Convolutional Network ${\mathcal{N}}_{2}$ to predict target time series ${\mathbf{X}}_{2}$. The N channels have $T=13$ time steps, $L=1$ hidden layer in the depthwise convolution and $N\times 2$ kernels with kernel size $K=2$ (denoted by colored blocks). The attention scores a are multiplied element-wise with the input time series, followed by an element-wise multiplication with the kernel. In the pointwise convolution, all channel outputs are combined to construct the prediction ${\widehat{\mathbf{X}}}_{2}$.

**Figure 7.**Threshold ${\tau}_{j}$ is set equal to the attention score at the left side of the largest gap ${g}_{k}$ where $k\ne 0$ and $k<\left|\mathbf{G}\right|/2$. In this example, ${\tau}_{j}$ is set equal to the third largest attention score.

**Figure 8.**How TCDF deals, in theory, with hidden confounders (denoted by squares). A black square indicates that the hidden confounder is discovered by TCDF; a grey square indicates that it is not discovered. Black edges indicate causal relationships that will be included in the learnt temporal causal graph ${\mathcal{G}}_{L}$; grey edges will not be included in ${\mathcal{G}}_{L}$.

**Figure 9.**Discovering the delay between cause ${\mathbf{X}}_{1}$ and target ${\mathbf{X}}_{2}$, both having $T=16$. Starting from the top convolutional layer, the algorithm traverses through the path with the highest kernel weights. Eventually, the algorithm ends in input value ${X}_{1}^{10}$, indicating a delay of $16-10=6$ time steps.

**Figure 10.**Example datasets and causal graphs: simulation 17 from

`FMRI`(

**top**), graph 20-1A from

`FINANCE`(

**bottom**). A colored line corresponds to one time series (node) in the causal graph.

**Figure 11.**Adapted ground truth for the hidden confounder experiment, showing graphs 20-1A (

**left**) and 40-1-3 (

**right**) from FINANCE. Only one grey node was removed per experiment.

**Figure 12.**Example with three variables showing that ${\mathcal{G}}_{L}$ has TP = 0, FP = 1 (${e}_{1,3}$), TP’ = 1 (${e}_{1,3}$), FP’ = 0 and FN = 2 (${e}_{1,2}$ and ${e}_{2,3}$). Therefore, F1 = 0 and F1’= 0.5.

FINANCE | FMRI | FMRI $\mathit{T}>1000$ | |
---|---|---|---|

#datasets | 9 | 27 | 6 |

#non-stationary datasets | 0 | 1 | 0 |

#variables (time series) | 25 | $\in \{5,10,15\}$ | {5, 10} |

#causal relationships | $\in \{6,20,40\}$ | $\in \{10,12,13,21,33\}$ | $\in \{10,21\}$ |

time series length | 4000 | 50–5000 (mean: 774) | 1000–5000 (mean: 2867) |

delays [timesteps] | 1–3 | n.a. | n.a. |

self-causation | ✓ | ✓ | ✓ |

confounders | ✓ | ✓ | ✓ |

type of relationship | linear | non-linear | non-linear |

**Table 3.**Time series prediction performance of TCDF, in terms of the mean absolute scaled error (MASE) averaged across all datasets, plus its standard deviation. Best results are highlighted in bold.

FINANCE TEST | FMRI TEST | FMRI $\mathit{T}>1000$ TEST | |
---|---|---|---|

TCDF ($L=0$) | 0.38 ± 0.09 | 0.84 ± 0.38 | 0.71 ± 0.05 |

TCDF ($L=1$) | 0.38 ± 0.10 | 1.06 ± 0.49 | 0.72 ± 0.07 |

TCDF ($L=2$) | 0.40 ± 0.10 | 1.13 ± 0.45 | 0.74 ± 0.08 |

**Table 4.**Causal discovery overview for all data sets and all methods. Showing macro-averaged F1 and F1’ scores and standard deviations. The highest score per benchmark is highlighted in bold.

FINANCE (9 Data Sets) | FMRI (27 Data Sets) | FMRI $\mathit{T}>1000$ (6 Data Sets) | ||||
---|---|---|---|---|---|---|

F1 | F1′ | F1 | F1′ | F1 | F1′ | |

TCDF ($L=0$) | 0.64 ± 0.06 | 0.77 ± 0.08 | 0.60 ± 0.09 | 0.63 ± 0.09 | 0.68 ± 0.05 | 0.68 ± 0.05 |

TCDF ($L=1$) | 0.65 ± 0.09 | 0.78 ± 0.10 | 0.58 ± 0.15 | 0.62 ± 0.14 | 0.65 ± 0.13 | 0.68 ± 0.11 |

TCDF ($L=2$) | 0.64 ± 0.09 | 0.77 ± 0.09 | 0.55 ± 0.13 | 0.63 ± 0.11 | 0.70 ± 0.09 | 0.73 ± 0.08 |

PCMCI | 0.55 ± 0.22 | 0.56 ± 0.22 | 0.63 ± 0.10 | 0.67 ± 0.11 | 0.67 ± 0.04 | 0.67 ± 0.04 |

tsFCI | 0.37 ± 0.11 | 0.37 ± 0.12 | 0.49 ± 0.22 | 0.49 ± 0.22 | 0.48 ± 0.28 | 0.48 ± 0.28 |

TiMINo | 0.13 ± 0.05 | 0.21 ± 0.10 | 0.23 ± 0.12 | 0.37 ± 0.14 | 0.23 ± 0.11 | 0.37 ± 0.15 |

**Table 5.**Run time in seconds, averaged over all datasets in the benchmark. TCDF (without parallelism) and TiMINo are run on a Ubuntu 16.04.4 LTS computer with an Intel

^{®}Xeon

^{®}E5-2683-v4 CPU and NVIDIA TitanX 12GB GPU. PCMCI and tsFCI are run on a Windows 10 1803 computer with an Intel

^{®}Core

^{™}i7-5500U CPU.

TCDF ($\mathit{L}=0$) | PCMCI | tsFCI | TiMINo | |
---|---|---|---|---|

FINANCE | 318 s | 10 s | 93 s | 499 s |

FMRI | 74 s | 1 s | 1 s | 14 s |

**Table 6.**Delay discovery overview for all data sets of the

`FINANCE`benchmark (nine datasets). Showing macro-averaged percentage of delays that are correctly discovered w.r.t. the full ground truth, and standard deviation. TiMINo does not discover delays.

TCDF ($\mathit{L}=0$) | TCDF ($\mathit{L}=1$) | TCDF ($\mathit{L}=2$) | PCMCI | tsFCI | TiMINo | |
---|---|---|---|---|---|---|

FINANCE | 97.79% ± 2.56 | 96.42% ± 3.68 | 95.49% ± 4.15 | 100.00% ± 0.00 | 98.77% ± 3.49 | n.a. |

**Table 7.**Impact of causal validation step. Showing macro-averaged F1 scores and standard deviation for TCDF with PIVM and TCDF without PIVM. $\Delta $ shows the change in F1-score or F1’-score in percent.

FINANCE (9 Data Sets) | FMRI (27 Data Sets) | FMRI $\mathit{T}>1000$ (6 Data Sets) | ||||
---|---|---|---|---|---|---|

F1 | F1′ | F1 | F1′ | F1 | F1′ | |

TCDF ($L=0$) | 0.64 ± 0.06 | 0.77 ± 0.08 | 0.60 ± 0.09 | 0.63 ± 0.09 | 0.68 ± 0.05 | 0.68 ± 0.05 |

TCDF ($L=0$) w/o PIVM | 0.22 ± 0.09 | 0.30 ± 0.13 | 0.60 ± 0.09 | 0.63 ± 0.09 | 0.68 ± 0.05 | 0.68 ± 0.05 |

$\Delta $ (PIVM) | −66% | −61% | 0% | 0% | 0% | 0% |

**Table 8.**Results of our TCDF ($L=1$) applied to

`FINANCE HIDDEN`. ‘Equal Delays’ denotes whether the delays from the confounder (conf.) to the confounder’s effects are equal. Grey causal relationships denote that the discovered relationship was not causal according to the ground truth.

Dataset | Hidden Conf. | Effects | Equal Delays | Conf. Discovered | Learnt Causal Relationships |
---|---|---|---|---|---|

20-1A | ${\mathbf{X}}_{16}$ | ${\mathbf{X}}_{8}$, ${\mathbf{X}}_{5}$ | ✓ | ✓ | ${\mathbf{X}}_{16}\to {\mathbf{X}}_{8}$, ${\mathbf{X}}_{16}\to {\mathbf{X}}_{5}$ |

40-1-3 | ${\mathbf{X}}_{7}$ | ${\mathbf{X}}_{8}$, ${\mathbf{X}}_{3}$ | ✓ | ✓ | ${\mathbf{X}}_{7}\to {\mathbf{X}}_{8}$, ${\mathbf{X}}_{7}\to {\mathbf{X}}_{3}$ |

40-1-3 | ${\mathbf{X}}_{0}$ | ${\mathbf{X}}_{5}$, ${\mathbf{X}}_{6}$ | ✗ | ✗ | ${\mathbf{X}}_{5}\to {\mathbf{X}}_{6}$ |

40-1-3 | ${\mathbf{X}}_{8}$ | ${\mathbf{X}}_{23}$, ${\mathbf{X}}_{4}$ | ✓ | ✓ | ${\mathbf{X}}_{8}\to {\mathbf{X}}_{23}$, ${\mathbf{X}}_{8}\to {\mathbf{X}}_{4}$ |

40-1-3 | ${\mathbf{X}}_{8}$ | ${\mathbf{X}}_{15}$, ${\mathbf{X}}_{4}$ | ✗ | ✗ | - |

40-1-3 | ${\mathbf{X}}_{8}$ | ${\mathbf{X}}_{24}$, ${\mathbf{X}}_{4}$ | ✗ | ✗ | - |

40-1-3 | ${\mathbf{X}}_{8}$ | ${\mathbf{X}}_{24}$, ${\mathbf{X}}_{15}$ | ✗ | ✗ | - |

40-1-3 | ${\mathbf{X}}_{8}$ | ${\mathbf{X}}_{24}$, ${\mathbf{X}}_{23}$ | ✗ | ✗ | ${\mathbf{X}}_{24}\to {\mathbf{X}}_{23}$ |

40-1-3 | ${\mathbf{X}}_{8}$ | ${\mathbf{X}}_{15}$, ${\mathbf{X}}_{23}$ | ✗ | ✗ | - |

**Table 9.**Results of TCDF compared with PCMCI, tsFCI and TiMINo when applied to datasets with hidden confounders. The first row denotes the number of incorrect causal relationships that were discovered between the effects of the hidden confounders. The second row denotes the number of hidden confounders that were located.

FINANCE HIDDEN | TCDF ($\mathit{L}=1$) | PCMCI | tsFCI | TiMINo |
---|---|---|---|---|

# Incorrect Causal Relationships | 2 | 0 | 3 | 8 |

# Discovered Hidden Confounders | 3 | 0 | 0 | 0 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Nauta, M.; Bucur, D.; Seifert, C.
Causal Discovery with Attention-Based Convolutional Neural Networks. *Mach. Learn. Knowl. Extr.* **2019**, *1*, 312-340.
https://doi.org/10.3390/make1010019

**AMA Style**

Nauta M, Bucur D, Seifert C.
Causal Discovery with Attention-Based Convolutional Neural Networks. *Machine Learning and Knowledge Extraction*. 2019; 1(1):312-340.
https://doi.org/10.3390/make1010019

**Chicago/Turabian Style**

Nauta, Meike, Doina Bucur, and Christin Seifert.
2019. "Causal Discovery with Attention-Based Convolutional Neural Networks" *Machine Learning and Knowledge Extraction* 1, no. 1: 312-340.
https://doi.org/10.3390/make1010019