Kernel-Based Independence Tests for Causal Structure Learning on Functional Data

Measurements of systems taken along a continuous functional dimension, such as time or space, are ubiquitous in many fields, from the physical and biological sciences to economics and engineering. Such measurements can be viewed as realisations of an underlying smooth process sampled over the continuum. However, traditional methods for independence testing and causal learning are not directly applicable to such data, as they do not take into account the dependence along the functional dimension. By using specifically designed kernels, we introduce statistical tests for bivariate, joint, and conditional independence for functional variables. Our method not only extends the applicability to functional data of the Hilbert–Schmidt independence criterion (hsic) and its d-variate version (d-hsic), but also allows us to introduce a test for conditional independence by defining a novel statistic for the conditional permutation test (cpt) based on the Hilbert–Schmidt conditional independence criterion (hscic), with optimised regularisation strength estimated through an evaluation rejection rate. Our empirical results of the size and power of these tests on synthetic functional data show good performance, and we then exemplify their application to several constraint- and regression-based causal structure learning problems, including both synthetic examples and real socioeconomic data.


Introduction
Uncovering the causal relationships between measured variables, a discipline known as causal structure learning or causal discovery, is of great importance across various scientific fields, such as climatology (Runge et al., 2019a), economics (Sulemana and Kpienbaareh, 2018), or biology (Finkle et al., 2018).Doing so from passively collected ('observational') data enables the inference of causal interactions between variables without performing experiments or randomised control trials, which are often expensive, unethical, or impossible to conduct (Glymour et al., 2019).Causal structure learning is the inference, under a given set of assumptions, of directed and undirected edges in graphs representing the data generating process, where the nodes represent variables and the inferred edges capture causal (directed) or non-causal (undirected) relationships between them.
Research in various areas collates functional data consisting of multiple series of measurements observed conjointly over a given continuum (e.g., time, space, or frequency), where each series is assumed to be a realisation of an underlying smooth process (Ramsay, 2005, §3).By viewing the series of measurements as discretisations of functions, the observations are not required to be collected over regular meshes of points along the continuum.If the variables are measured over time as the underlying continuum, there is a long history of methods that have been developed to infer (time-based) causality between variables.Among those, the classic Granger-causality (Granger, 1969) declares that variable 'X causes Y ' (X → Y ) if predicting the future of Y becomes more accurate with, as compared to without, access to the past of X, conditional on all other relevant variables (Geweke, 1982).However, these methods assume that the observed time-series are stationary and the causal dependency of X on Y is linear.More recently, Sugihara et al. (2012) developed convergent cross mappings (CCMs), a method that relaxes the assumption of linearity and finds causal relationships based on time-embeddings of the (stationary) time-series at each point.While useful in many situations, Granger-causality and CCM can perform weakly when the time-series for X and Y are nonlinearly related or nonstationary, respectively (see Appendix H).
Here we present a method that uses kernel-based independence tests to detect statistically significant causal relationships by extending constraint-and regression-based causal structure learning to functional data.The key advantages over Granger-causality and CCM are both the systematic consideration of confounders and the relaxation of assumptions around linear relationships or stationarity in the data, which can lead to different causal relationships between variables.As a motivating example, consider the relationship between two variables: 'corruption' and 'income inequality', as measured by the World Governance Indicators (WGIs) (Kaufmann et al., 2011) and the World Bank (2022), respectively.Using data for 48 African countries from 1996 to 2016, Sulemana and Kpienbaareh (2018) investigated their cause-effect relationship and found that corruption 'Granger-causes' income inequality.We have also confirmed independently that applying CCM to the same data leads to the same conclusion.However, by considering the time-series data as realisations of functions over time, and thus avoiding linearity and stationarity assumptions, our proposed kernel-based approach suggests the reverse result, i.e., causal influence of income inequality on corruption appears as the more statistically likely direction.Although a bidirectional causal dependency between these two variables might appear as more realistic, this conclusion is in agreement with other quantitative findings, which draw on different data sources (Jong-Sung and Khagram, 2005;Alesina and Angeletos, 2005;Dobson and Ramlogan-Dobson, 2010).We will return to this example in Section 4.2.2where we analyse causal dependencies between all six WGIs.Methodologically, our work extends the applicability of two popular paradigms in causal structure learningconstraint-based (Spirtes et al., 2000, § 5) and regression-based methods (Peters et al., 2014)-to functional data.Independence tests play a crucial role in uncovering causal relationships in both paradigms, and kernels provide a powerful framework for such tests by embedding probability distributions in reproducing kernel Hilbert spaces (RKHSs) (Muandet et al., 2017, § 2.2).Until now, however, related methods for causal learning had only been applicable to univariate and multivariate data, but not to functional data.To address this limitation, we employ recently derived kernels over functions (Wynne and Duncan, 2022) to widen the applicability of kernel-based independence tests to functional data settings.To test for conditional independence, we can then compute HSCIC (Park and Muandet, 2020) in a conditional permutation test (CPT) (Berrett et al., 2020), and we propose a straightforward search to determine the optimised regularisation rate in HSCIC.We structure our paper as follows.Section 2 provides a brief overview of prior literature on functional data analysis, kernel-based independence tests, and causal structure learning methods.Section 3 presents the definition of a conditional independence test for functional data and its applicability to causal structure learning on such data.We then empirically analyse the performance of our independence tests and causal structure learning algorithms on synthetic and real-world data in Section 4. We conclude with a discussion in Section 5.
Our main contribution lies in Section 3 where we propose a conditional independence test for functional data that combines a novel test statistic based on HSCIC with CPT to generate samples under the null hypothesis.The algorithm also searches for the optimised regularisation strength λ * required to compute HSCIC, by pre-test permutations to calculate an evaluation rejection rate.We also highlight the following secondary contributions: • In Section 4.1.2,we extend the historical functional linear model (Malfait and Ramsay, 2003) to the multivariate case {X 1 , . . ., X i , . . ., X d } → Y for regression-based causal structure learning, and we show how a joint independence test can be used to verify candidate directed acyclic graphs (DAGs) (Pfister et al., 2018, § 5.2) that embed the causal structure of function-valued random variables.This model has been contributed to the Python package scikit-fda (Ramos-Carreño et al., 2019).
• On synthetic data, we show empirically that our bivariate, joint and conditional independence tests achieve high test power, and that our causal structure learning algorithms outperform previously proposed methods.
• Using a real-world data set (World Governance Indicators) we demonstrate how our method can yield insights into cause-effect relationships amongst socio-economic variables measured in countries worldwide.
• Implementations of our algorithms are made available at https://github.com/felix-laumann/causal-fda/ in an easily usable format that builds on top of scikit-fda and causaldag (Squires, 2018).
2 Background and Related Work

Functional Data Analysis
In functional data analysis (Ramsay, 2005), a variable X is described by a set of n samples (or realisations), X = {x i (s)} n i=1 , where each functional sample x i (s) corresponds to a series of observations over the continuum s, also called the functional dimension.Typical functional dimensions are time or space.In practical settings, the observations are taken at a set of S discrete values s 1 , . . ., s S of the continuum variable s.Examples of functional data sets include the vertical position of the lower lip over time when speaking out a given word (Malfait and Ramsay, 2003), the muscle soreness over the duration of a tennis match (Girard et al., 2006), or the ultrafine particle concentration in air measured over the distance to the nearest motorway (Zhu et al., 2006).
In applications, the functional samples are usually represented as linear combinations of a finite set of M basis functions {ϕ m (s)} M m=1 (e.g., Fourier or monomial basis functions): where the coefficients ) characterise each sample.If the number of basis functions is equal to the number of observations, M = S, each observed value x i (s k ) can be fitted exactly by obtaining the coefficients C i using standard nonlinear least squares fitting techniques (provided the ϕ m (s) are valid basis functions), and Equation (1) allows us to interpolate between any two observations.When the number of basis functions is smaller than the number of observations, M < S, as it is commonly the case in practice, the basis function expansion (1) provides a smoothed approximation to the set of observations, xi (s k ).
For the many applications where the continuum is time, historical functional linear models (HFLMs) (Malfait and Ramsay, 2003) provide a comprehensive framework to map the relationship between two sets of functional samples.Let 0 and T be the initial and final time points for a set of samples y i (t).HFLMs describe dependencies that can vary over time using the function β(s, t), which encapsulates the influence of x(s), s ∈ [0, S] on another variable y(t), t ∈ [0, T ] at any two points in time, s k and t k : where s 0 (t) is the maximum allowable lag for any influence of X on Y and s 0 (t) ≤ s ≤ t.Typical choices for β(s, t) are exponential decay and hyperbolic paraboloid (or "saddle") functions.The continuum is not required to be time but can also be space, frequency or others (see Ramsay and Hooker (2017) for an extensive collection of function-to-function models and applications).

Kernel Independence Tests
Let H k1 and H k2 be separable RKHSs with kernels If the kernel k 1 uniquely embeds a distribution P X in an RKHS by a mean embedding, which captures any information about P X , we call k 1 a characteristic kernel on R X × R Y (Fukumizu et al., 2007).Characteristic kernels have thus been extensively used in bivariate (P XY = P X P Y ), joint (P XY Z = P X P Y P Z ) and conditional (P XY |Z = P X|Z P Y |Z ) independence tests (e.g., Gretton et al., 2008;Pfister et al., 2018;Zhang et al., 2012).
For the bivariate independence test, let P XY denote the joint distribution of X and Y. Then HSIC is defined as with equality iff P XY = P X P Y .
We refer to Gretton et al. (2008) for the definition of an estimator for finite samples that constitutes the test statistic in the bivariate independence test with null hypothesis H 0 : X ⊥ ⊥ Y .The test statistic is then computed on the original data (X, Y) and statistically compared to random permutations {(X, Y p ′ )} P p ′ =1 under the null hypothesis.For distributions with more than two variables, let P X1,X2,...,X d denote the joint distribution on X 1 , X 2 , . . ., X d .To test for joint independence we compute Pfister et al. ( 2018) derive a numerical estimator, which serves as the basis for a joint independence test on finite samples.Here, the distribution under the null hypothesis of joint independence is generated by randomly permuting all sample sets in the same way as Y is in the bivariate independence test.
Lastly, the conditional independence test relies on accurately sampling from the distribution under the null hypothesis At the core of the conditional permutation test (Berrett et al., 2020) lies a randomisation procedure that generates permutations of X, denoted {X p ′ } P p ′ =1 , which are generated without altering the conditional distribution P X|Z , so that under H 0 while breaking any dependence between X and Y .The null distribution can therefore be generated by repeating this procedure multiple times, and we can decide whether H 0 should be rejected by comparing a test statistic on the original data against its results on the generated null distribution.
The existing literature on kernel-based independence tests is extensive, see e.g., Berrett et al. (2020) for a relevant review, but only a small part of those tests investigates independence among functional variables (Lai et al., 2021;Górecki et al., 2020).There have been particularly strong efforts in developing conditional independence tests and understanding their applicable settings.The kernel conditional independence test (KCIT) (Zhang et al., 2012), for example, gave promising results in use with univariate data but increasingly suffered when the number of conditional variables was large.In contrast, the kernel conditional independence permutation test (KCIPT) (Doran et al., 2014) repurposed the well-established kernel two-sample test (Gretton et al., 2012) to a conditional independence setting which delivered stable results for multiple conditional variables.However, Lee and Honavar (2017) pointed out that as the number of permutations increases, while its power increases, its calibratedness decreases.This issue was overcome by their proposed self-discrepancy conditional independence test (SDCIT), which is based on a modified unbiased estimate of the maximum mean discrepancy (MMD).

Causal Structure Learning
The aim of causal structure learning, or causal discovery, is to infer the qualitative causal relationships among a set of observed variables, typically in the form of a causal diagram or DAG.Once learnt, such a causal structure can then be used to construct a causal model such as a causal Bayesian network or a structural causal model (SCM) (Pearl, 2009).Causal models are endowed with a notion of manipulation and, unlike a statistical model, do not just describe a single distribution, but many distributions indexed by different interventions and counterfactuals.They can be used for causal reasoning, that is, to answer causal questions such as computing the average causal effect of a treatment on a given outcome variable.Such questions are of interest across many disciplines, and causal discovery is thus a highly topical area.We refer to Mooij et al. (2016); Peters et al. (2017); Glymour et al. (2019); Schölkopf and von Kügelgen (2022); Squires and Uhler (2022); Vowels et al. (2022) for comprehensive surveys and accounts of the main research concepts.In particular, we focus here on causal discovery methods for causally sufficient systems, for which there are no unobserved confounders influencing two or more of the observed variables.Existing causal discovery methods can roughly be categorised into three families: • Score-based approaches assign a score, such as a penalised likelihood, to each candidate graph and then pick the highest scoring graph(s).A common drawback of score-based approaches is the need for a combinatorial enumeration of all DAGs in the optimisation, although greedy approaches have been proposed to alleviate such issues (Chickering, 2002).
• Constraint-based methods start by characterising the set of conditional independences in the observed data (Spirtes et al., 2000).They then determine the graph(s) consistent with the detected conditional independences by using a graphical criterion called d-separation, as well as the causal Markov and faithfulness assumptions, which establish a one-to-one connection between d-separation and conditional independence (see Appendix B for definitions).When only observational i.i.d.data are available, this yields a so-called Markov equivalence class, possibly containing multiple candidate graphs.For example, the graphs X → Y → Z, X ← Y ← Z, and X ← Y → Z are Markov equivalent, as they all imply X ⊥ ⊥ Z|Y and no other conditional independence relations.
• Regression-based approaches directly fit the structural equations X i := f i (PA i , U i ) of an underlying SCM for each X i , where PA i denote the parents of X i in the causal DAG and U = (U 1 , ..., U n ) are jointly independent exogenous noise variables.Provided that the function class of the f i is sufficiently restricted, e.g., by considering only linear relationships (Shimizu et al., 2006) or additive noise models (Hoyer et al., 2008), the true causal graph is identified as the unique choice of parents for each i such that the resulting residuals Ûi = X i − fi (PA i ) are jointly independent.
As can be seen from these definitions, conditional, bivariate, and joint independence tests are an integral part of constraint-and regression-based causal discovery methods.Our main focus in the present work is therefore to extend the applicability of these causal discovery frameworks to functional data by generalising the underlying independence tests to such domains.

Methods
In all three of our independence tests (bivariate, joint, conditional), we employ kernels over functions, also known as squared-exponential T (SE-T) kernels (Wynne and Duncan, 2022).Let X and Y be real, separable Hilbert spaces with norms ∥ • ∥ X and ∥ • ∥ Y , respectively.Then, for T : X → Y, SE-T kernels are defined as where σ 2 is commonly defined as the median heuristic, Replacing any characteristic kernel by the SE-T kernels for bivariate independence tests (based on HSIC) and for joint independence tests (based on d-variable Hilbert-Schmidt independence criterion (d−HSIC)) is straightforward and does not require further theoretical investigation besides evaluating numerically the validity and power of the tests (see Section 4).However, the application of SE-T kernels to conditional independence tests needs further theoretical results, as we discuss now.

Conditional independence test on functional data
We consider the conditional independence test, which generates samples under the null hypothesis based on the CPT, and uses the sum of HSCICs over all samples z ∈ Z as its test statistic.The CPT defines a permutation procedure that preserves the dependence of Z on both X and Y while resampling data for X that eliminates any potential dependence between X and Y .This procedure results in samples according to the null hypothesis, X ⊥ ⊥ Y |Z.We use this procedure whenever permutations are required as part of the conditional independence test.Given the computation of the HSCIC is based on a kernel ridge regression, it requires to set a regularisation strength λ, which has been manually chosen previously (Park and Muandet, 2020).Generally, our aim is to have type-I error rates close to the allowable false positive rate α.However, choosing λ inappropriately may result in an invalid test (type-I error rates exceed α if λ is chosen too large), or in a deflated test power (type-I error rates are well below α and type-II error rates are high if λ is chosen too small).Thus, we must define an algorithm that conducts a search over a range of potentially suitable values for λ and assesses each candidate value by, what we will call, an evaluation rejection rate.
The search proceeds by iterating over a range of values {λ l } L l=1 to find the optimised value λ * , as follows.For each λ l , we start by producing one permutation of the samples X, which we denote as X b , and we compute its corresponding evaluation test statistic given by the sum of HSCICs over all samples z ∈ Z.We then apply the usual strategy for permutation-based statistical tests: we produce an additional set of P permuted sample sets of X, which we denote {X π } P π=1 , and for each X π , we determine the sum of HSCICs over z ∈ Z to generate a distribution over statistics under the null hypothesis, which we call the evaluation null statistics.We then compute the percentile where the evaluation test statistic on X b falls within the distribution of evaluation null statistics on the permutation set {X π } P π=1 .This results in an evaluation p-value which is compared to the allowable false positive rate α to establish if the hypothesis H 0 : X b ̸ ⊥ ⊥ Y|Z can be rejected.Given that both the evaluation test statistic and the evaluation null statistics are computed on conditionally independent samples, we repeat this procedure for b = 1, . . ., B ≥ 100 times to estimate an evaluation rejection rate for each value of λ l .Having completed this procedure over all values {λ l } L l=1 , we select the λ l that produces an evaluation rejection rate closest to α as the optimised regularisation strength, λ * .Finally, we apply a CPT-based conditional independence test using the optimised λ * to test the null hypothesis H 0 : X ̸ ⊥ ⊥ Y |Z.This entire procedure is summarised in Algorithm 1.
The following Theorem 1 guarantees the consistency of the conditional independence test in Algorithm 1 with respect to the regularisation parameter λ.
Theorem 1.Let H k1 and H k2 be separable RKHSs with kernels k and Sriperumbudur, 2017).If the regularisation parameter λ = λ n decays as n → ∞ at a slower rate than n −1/2 , then the test based on the test statistic in Algorithm 1 (lines 19-27) is consistent.
Proof.See Appendix A.

Causal structure learning on functional data
To infer the existence of (directed) edges in an underlying causal graph G based on the joint data distribution P over a set of observed variables, we must assume that G and P are intrinsically linked.Spirtes et al. (2000, § 2.3.3)define the faithfulness assumption, from which it follows that if two random variables are (conditionally) independent in the observed distribution P, then they are d-separated in the underlying causal graph G (Pearl, 2009, § 1.2.3).Using this fact, constraint-based causal structure learning methods (Spirtes et al., 2000, § 5) take advantage of bivariate and conditional independence tests to infer whether two nodes are d-separated in the underlying graph.These methods yield completed partially directed acyclic graphs (CPDAGs), which are graphs with undirected and/or directed edges.In contrast, regression-based methods (Peters et al., 2014), utilise the joint independence test to discover DAGs that have all edges oriented.We now describe both of these approaches in more detail.

Constraint-based causal structure learning
Constraint-based causal structure learning relies on performing conditional independence tests for each pair of variables, X and Y , conditioned on any possible subset of the remaining variables, that is, any subset within the power set Based on these oriented edges, 'Meek's orientation rules' defined by Meek (1995) (see e.g., Peters et al., 2017, § 7.2.1) can be applied to direct additional edges based on certain graphical compositions, as shown in Appendix C. Briefly, these rules follow from the avoidance of cycles, which would violate the acyclicity of a DAG, and colliders, which violate the found conditional independence.Algorithms that implement these criteria are the SGS algorithm and the more efficient PC algorithm (Spirtes et al., 2000), which we summarise in Appendix C.

Regression-based causal structure learning
To carry out our regression-based causal learning, we choose additive noise models (ANMs), a special class of SCMs where the noise terms are assumed to be independent and additive.This assumption guarantees that the causal graph can be identified if the function f i is nonlinear with Gaussian additive noise (Hoyer et al., 2008;Peters et al., 2012), a typical setting in functional data.Our model over the set of random variables {X} d i=1 is given by: where the additive noise terms U i are jointly independent, i.e., any noise term U i is independent of any intersection of the other noise terms, Based on these assumptions, we follow RESIT (Regression with Subsequent Independence Testing; Peters et al., 2014), an approach to causal discovery that can be briefly described as follows.If we only have two variables X and Y , we start by assuming that the samples in Y are the effect and the samples in X are the cause.Therefore we write Y as a function fY of X plus some noise, and we test whether the residual r Y = Y − fY (X) is independent of X.We then exchange the roles of assumed effect and assumed cause to obtain the residual r X = X − fX (Y), which is tested for independence from Y. To overcome issues with finite samples and test power, Peters et al. (2014) used the p-value of the independence tests as a measure of strength of independence, which we follow in our experiments in Section 4.
Alternatively, one may determine the causal direction for this bivariate case by comparing both cause-effect directions with respect to a score defined by Bühlmann et al. (2014).
For d variables, the joint independence test replaces the bivariate independence test.Firstly, we consider the set of candidate causal graphs, Γ = {G 1 , . . ., G γ , . . ., G Γ }, which contains every potential DAG with d nodes.For each candidate DAG, G γ , we regress each descendant X i on its parents PA i : and we compute the residuals r Xi = X i − k∈PAi fi,k (X k ).We then apply the joint independence test to all d residuals.The candidate DAG G γ is accepted as the 'true' causal graph if the null hypothesis of joint independence amongst the residuals is not rejected.This process is repeated over all candidate causal graphs in the set Γ. Again, because in finite samples this procedure may not always lead to only one candidate DAG being accepted, the one with the highest p-value is chosen (Peters et al., 2014).

Experiments
In practice, functional data analysis requires samples of continuous functions evaluated over a mesh of discrete observations.In our synthetic data, we consider the interval s = [0, 1] for our functional dimension and draw 100 equally spaced points over s to evaluate the functions.For real-world data, we map the space in which the data live (e.g., the years 1996 to 2020) to the interval s = [0, 1] and interpolate over the discrete measurement points.We then use this interpolation to evaluate the functional samples on 100 equally spaced points.Unless otherwise mentioned, henceforth we use SE-T kernels with T = I where I is the identity matrix and σ 2 as the median heuristic, as previously defined in Section 3.

Evaluation of independence tests for functional data
Before applying our proposed independence tests to causal structure learning problems, we use appropriate synthetic data to evaluate their type-I error rate (test size) and type-II error rate (test power).Type-I errors are made when the samples are independent (achieved by setting a = 0 in our experiments below) but the test concludes that they are dependent, i.e., the test falsely rejects the null hypothesis H 0 .In contrast, type-II errors appear when the samples are dependent but the test fails to reject H 0 (achieved by setting a > 0).Specifically, we compute the error rates on 200 independent trials, which correspond to different random realisations of the data sets, i.e., synthetically generated data sets with random Fourier basis coefficients, random coefficients of the β-functions and random additive noise.We set the test significance level at α = 0.05 and approximate the null hypothesis of independence by 1000 permutations using the CPT scheme described above.Note that although our synthetic data are designed to replicate typical behaviour in time-series, the independence tests are not limited to time as the functional dimension, and can be used more generally.

Bivariate independence test
We consider n functional samples {x i (s)} n i=1 of a random variable X defined over the interval s = [0, 1].To generate our samples, we sum M = 3 Fourier basis functions ϕ m,T (s) with period T = 0.1 and coefficients c i,m randomly drawn from a standard normal distribution: where the Fourier functions are ϕ 1,T (s) = 1, ϕ 2,T (s) =

2
T cos 2πm T s .To mimic real-world measurements, we draw each sample over a distinct, irregular mesh of evaluation points.We then interpolate them with splines, and generate realisations over a regular mesh of points.Multivariate random noise ϵ X i (s) ∼ N (0, I), where I is the identity matrix, is then added.The samples {y i (t)} n i=1 of random variable Y are defined as a HFLM (Malfait and Ramsay, 2003) over t = [0, 1] by: where a ∈ [0, 1].The function β(s, t) maps the dependence from x i at any point s to y i at any point t, and is defined here as a hyperbolic paraboloid function: with coefficients c 1 and c 2 drawn independently from a uniform distribution over the interval [0.25, 0.75].Afterwards, the samples y i (t) are evaluated over a regular mesh of 100 points t ∈ [0, 1] and random noise ϵ Y i (t) ∼ N (0, I) is added.Clearly, for a = 0 our samples are independent, as the samples y i (t) = ϵ Y i (t) are just random noise.As a is increased, the dependence between the samples x i (s) and y i (t) becomes easier to detect.Figure 1a shows that our test stays within the allowable false-positive rate α and detects the dependence as soon as a > 0, even with a low number of samples.

Joint independence test
To produce the synthetic data for joint independence tests, we first generate random DAGs with d nodes using an Erdös-Rényi model with density 0.5 (Pfister et al., 2018, § 5.2).For each DAG, we first use Equation ( 10) to produce function-valued random samples for the variables X i without parents.We then generate samples for other variables using a historical functional linear model: where PA j are the parents of node X j , and the function β p (s, t) is given by Equation ( 12) with random coefficients c 1 and c 2 independently generated for each descendant-parent pair indexed by p.After being evaluated at a regular mesh of 100 points within t ∈ [0, 1], random noise ϵ j i (t) ∼ N (0, I) is added.Note that, again, an increase in the factor a ∈ [0, 1] should make the dependence structure of the DAG easier to detect.Figure 1b shows the test size where a = 0, resulting in independent variables, and test power where a > 0. We evaluate the joint independence test for d = 4 variables over various values of a and a range of sample sizes.

Conditional independence test
To evaluate whether X is independent of Y given Z, where Z may be any subset within the power set of the d\{X, Y } remaining variables, we generate data samples according to: where |Z| is the cardinality of the set Z, β's are given by Equation ( 12), and noise terms ϵ Z j i (u), ϵ X i (s) and ϵ Y i (t), are added to the discrete observation values.
We then apply Algorithm 1 to compute the statistics for our conditional independence test.Firstly, we find that the optimised regularisation strength λ * is robust and reproducible for different random realisations with the same sample size and model parameters.Consequently, we search for one optimised λ * (line 1-18 in Algorithm 1) for each sample size, and fix this value for all 200 independent trials.Figure 2 summarises the results for λ * for n ∈ {100, 200, 300} samples after conducting a grid search over a range of L = 18 possible values 10 −4 ≤ λ ≤ 10 −1 , with P = 1000 permutations, and B = 100 rejection iterations.Note that the range of values for λ can be tuned in stages by the practitioner, e.g., starting with a coarse initial exploration followed by a more fine-grained range.We recommend to choose B ≥ 100 for ease of comparison of the evaluation rejection rate to the acceptable false positive rate, α.We find that the optimised λ * exhibits a saturation as the number of conditional variables (the dimension of the conditional set) is increased.Given that we perform a kernel ridge regression, a larger number of samples n should result in lower requirements for regularisation-which aligns with our observations of a decreasing λ * over the increase in number of samples n.Therefore, Algorithm 1 optimises the regularisation parameter λ * such that the evaluation rejection rate of the test is closest to the allowable false positive rate α = 0.05.Figures 3a-3d show the results of the conditional independence test for increasing dimension d of the conditional set (i.e., number of conditional variables).We find that the test power is well preserved as d increases through a concomitant increase of λ * that partly ameliorates the "curse of dimensionality" (Shah et al., 2020).Furthermore, the values of the test power for a ′ = 0 correspond to the type-I error rates.

Causal structure learning
We use the three independence tests evaluated numerically in Section 4.1 to learn different causal structures among function-valued random variables.We start with regression-based methods for the bivariate case X → Y and extend to the multivariate case through regression-, constraint-based, and a combination of constraint-and regression-based causal structure learning approaches.

Synthetic data
Regression-based causal discovery.We start by evaluating synthetic data for a bivariate system X → Y generated according to Equations ( 10)-( 11) with a = 1.We generate 200 independent trials, and we score the performance of the method using the structural Hamming distance (SHD) (see Appendix D for its definition).As seen in Figure 4a, our regression-based algorithm using RESIT outperforms two well-known algorithms for causal discovery: Granger-causality, which assumes linearity in the relationships (Granger, 1969), and CCM, which allows for nonlinearity through a time-series embedding (Sugihara et al., 2012).
We then evaluate how the regression-based approach performs in a system of three variables where data are generated according to random DAGs with three nodes that entail historical nonlinear dependence.One of the above methods (CCM) is commonly applied to bivariate problems only, hence we compare our method on a system with three variables to multivariate Granger causality (Geweke, 1982) as well as to PC algorithm with momentary conditional independence (PCMCI), an algorithm where each time point is represented as a node in a graph (Runge et al., 2019b)  from which we extract a directed graph for the variables.Figure 4b shows substantially improved performance (lower SHD) of our regression-based algorithm with respect to both of these methods on 3-variable systems.Details of all the methods are given in Appendix E.
Constraint-based causal discovery.The accuracy of our constraint-based approach is evaluated by computing three metrics: normalised SHD (SHD norm ), precision, and recall (see Appendix D for definitions).Figure 5 shows the accuracy of our algorithm when applied to d ∈ {3, 4, 5, 6} variables.The normalised SHD in Figure 5a demonstrates consistent accuracy of the constraint-based causal structure learning method across sample sizes for all d.To complement this measure, we can examine jointly precision and recall in Figure 5b-5c, where we find that the learnt edges are predominantly unoriented (low values of recall) but the oriented edges that are found are indeed correctly oriented (high values of precision).
Combined approach.To scale up our causal discovery techniques more efficiently to larger graphs, it is possible to combine constraint-and regression-based causal learning, yielding DAGs.In this "combined" approach we start with a constraint-based step through which we learn CPDAGs by applying conditional independence tests to any two variables conditioned on any subset of the remaining d−2 variables and orient edges according to the Algorithm in Appendix C, which finds v-structures, applies Meek's orientation rules and returns the Markov equivalence class.Often, the Markov equivalence class entails undirected edges, but we can then take a second step using a regression-based approach, under a different set of assumptions, to orient edges by applying RESIT.This two-step process yields a DAG, i.e., every edge in the graph is oriented, in a more scalable manner than applying directly regression-based causal discovery.We then measure the accuracy of our approach computing the normalised SHD (16) as above.Figure 6 shows that our results Granger-causality and CCM.In the case of three variables (b), we compare to multivariate Granger-causality as well as using PCMCI to produce comparable results.Our method significantly outperforms both Granger-causality and CCM in the bivariate setting, with just a few mistakes made for low sample numbers (n = 100) and no mistakes for higher sample sizes.For three variables, our method is substantially more accurate than PCMCI and Granger-causality, with an average SHD of 0.3 at n = 300 versus 2.9 for PCMCI and 3.12 for Granger-causality.The accuracy of the methods is measured in terms of the structural Hamming distance (SHD).
for d = 3, 4, 5, 6 variables compare favourably to PCMCI and to multivariate Granger causality with consistently lower SHD for all d.

Real-world data
To showcase the application of our methods to real data, we return now to the motivating example on socio-economic indicators mentioned in the Introduction (Section 1).Sulemana and Kpienbaareh (2018) tested for the causal direction between corruption and income inequality, as measured by the Control of Corruption index (Kaufmann et al., 2011) and Gini coefficient (Gini, 1936), respectively.They applied Granger-causality with Wald-χ 2 tests, which statistically test the null hypothesis that an assumed explanatory variable does not have a (linear) influence on other considered variables.Each variable, corruption and income inequality, was tested against being the explanatory variable (i.e., the cause) in their bivariate relationship.Their findings showed that income inequality as the explanatory variable results in a p-value of 0.079, whereas corruption is more likely to be the cause with a p-value of 0.276.Analysing the same data using CCM, which does not assume linearity, we also find the same cause-effect direction: 70.9% of the sample pairs (x i , y i ) validate corruption as the cause, against 29.7% of pairs which detect the opposite direction.However, using our proposed regression-based kernel approach, which does not assume linearity or stationarity, we find the opposite result, i.e., the causal dependence of corruption on income inequality is statistically more likely (p = 0.0859) than the reverse direction (p = 0.0130) when applying regression with subsequent independence test (RESIT) that rejects the null hypothesis of independence of the residuals after regression in the false causal directions.
Going beyond pairwise interactions, we have also applied our causal structure learning method to the set of World Governance Indicators (WGIs) (Kaufmann et al., 2011).The WGIs are a collection of six variables, which have been measured in 182 countries over 25 years from 1996 to 2020 (see Appendix I).We view the time-series of the countries as independent functional samples of each variable (so that n = 182 and the functional dimension is time), and we apply our methods as described above to establish the causal structure amongst the six WGIs.In this case, we use the "combined" approach by successively applying our constraint-and regression-based causal structure learning methods.We first learn an undirected causal skeleton from the WGIs (Figure 7a), and we find a triangle between Government Effectiveness (GE), Rule of Law (RL) and Control of Corruption (CC), and a separate link between Figure 6: Causal structure learning with the "combined" approach, where we first apply the constraint-based method to find the Markov equivalence class, followed by the regression-based method to orient the undirected edges of the Markov equivalence class.We compute the normalised SHD over d ∈ {3, 4, 5, 6} variables and n ∈ {100, 200, 300} samples with a = 1 as in Equation ( 13).Our results (dashed lines) have substantially lower normalised SHD than those obtained from Granger-causality (dash-dotted lines) and PCMCI (solid lines), applied as described in Appendix E.
Voice and Accountability (VA) and Regulatory Quality (RQ).We then orient these edges (Figure 7b) and find that Government Effectiveness (GE) causes both Rule of Law (RL) and Control of Corruption (CC), and Rule of Law (RL) causes Control of Corruption (CC).We also find that Voice and Accountability (VA) causes Regulatory Quality (RQ).

Discussion and Conclusion
We present a causal structure learning framework on functional data that utilises kernel-based independence tests to extend the applicability of the widely used regression-and constraint-based approaches.The foundation of the framework originates from the functional data analysis literature by interpreting any discrete-measured observations of a random variable as finite-dimensional realisations of functions.
Using synthetic data, we have demonstrated that our regression-based approach outperforms existing methods such as Granger-causality, CCM and PCMCI when learning causal relationships between two and three variables.In the bivariate case, we have carried out a more detailed comparison to Granger-causality and CCM to explore the robustness of our regression-based approach to nonlinearity and nonstationarity in the data (see Appendix H) We find that while Granger degrades under the introduction of nonlinearity in the data, and CCM degrades under the introduction of nonstationarity, our method remains robust in its performance under both nonlinearity and nonstationarity.In addition, as seen in Figure 4a, CCM can have difficulty in detecting strong unidirectional causal dependencies (Sugihara et al., 2012;Ye et al., 2015), where the cause variable X uniquely determines the state of the effect variable Y inducing "generalised synchrony" (Rulkov et al., 1995).In such cases, CCM can predict samples of Y from X equally well as X from Y ; hence CCM finds X → Y and X ← Y indistinctly.In contrast, our experiments (Section 4.2.1 and Appendix H) show that our regression-based method is unaffected by the presence of "generalised synchrony" in the data.
Further, we show that our conditional independence test, which is the cornerstone of the constraint-based causal discovery approach, achieves type-I error rates close to the acceptable false-positive rate α and high type-II error rates, even when the number of variables in the conditional set increases.Shah et al. (2020) rightly state that any conditional independence test can suffer from an undesirable test size or low test power in finite samples, and our method is obviously no exception.However, Figure 3 demonstrates the counterbalance of the "curse of dimensionality" through the optimised regularisation strength λ * .Indeed, while with larger numbers of conditional variables, we would generally expect the test power to diminish, the optimisation of λ * offsets this reduction, resulting in no significant decrease in test power with a growing number of conditional variables.Although our proposed method is computationally expensive and significantly benefits from high sample sizes, a suitable regularisation strength could be chosen for large conditional sets in principle.
Moreover, we demonstrate that constraint-and regression-based causal discovery methods can be combined to learn DAGs with large number of nodes, which would otherwise be computationally very expensive when relying on regression-based methods to yield DAGs.By comparing Figure 5a and 6 we see however that it does not necessarily result in a lower SHD.After learning the Markov equivalence class through the constraint-based approach, we apply RESIT to orient undirected edges in the "combined" approach.Here, mistakes in the orientation of undirected edges add 2 to the SHD whereas an undirected edge only adds 1 to the SHD.When applied to real-world data, we have utilised this approach to learn a causal graph of the WGIs, assuming each relationship between two variables is uni-directionally identifiable as suggested by several economic studies (Jong-Sung and Khagram, 2005;Alesina and Angeletos, 2005;Dobson and Ramlogan-Dobson, 2010).
The presented work contributes to opening the field of causal discovery to functional data.Beyond the results presented here, we believe that more research needs to be conducted to (i) increase the efficiency of the independence tests, meaning smaller sample sets can achieve higher test power; (ii) learn about upper bounds of the regularisation strength λ * with respect to the size of the sample and conditional sets; (iii) reduce the computational cost of the conditional independence test (see Appendices F and G for a comparison to other methods); and (iv) establish connections and investigate differences to causal structure learning approaches based on transfer entropy (Barnett et al., 2009;Porta et al., 2015).
A Proof of Theorem 1 Proof.All norms ∥•∥ and inner products ⟨•, •⟩ in this proof are taken with respect to the RKHS in which the normand and the arguments of the inner product reside in.
Denote by HSCIC(X, Y, Z) the true Hilbert-Schmidt conditional independence criterion between the random variables X and Y given Z, i.e.
where μXY|Z , μX|Z and μY|Z are the empirical estimates of the conditional mean embeddings µ XY|Z , µ X|Z and µ Y|Z obtained with the regularisation parameter λ n .
Note that by the reverse triangle inequality, followed by the ordinary triangle inequality, we have Here, where we used the Cauchy-Schwarz inequality in the last inequality.We note that Park and Muandet (2020, Theorem 4.4) ensures that each of ∥µ XY|Z − μXY|Z ∥, ∥µ Y|Z − μY|Z ∥ and ∥µ X|Z − μX|Z ∥ converge to 0 in probability, hence ĤSCIC(X, Y, Z, λ n ) → HSCIC(X, Y, Z) in probability with respect to the L 2 -norm.

C Meek's orientation rules, the SGS and the PC algorithms
In Figure 8, a node is placed at each end of the edges.Rule 1 states that if one directed edge (here the vertical edge) points towards another undirected edge (the horizontal edge) that is not further connected to any other edges, the undirected edge must point in the same direction as the adjacent edge, because the three vertices would otherwise form a v-structure (which are only formed with conditional dependence).Rule 2, in contrast, points the undirected edge in both adjacent vertices' direction because it would violate the acyclicity of a DAG otherwise.Rule 3 and 4 avoid new v-structures which would come into existence by applying Rule 2 twice if the oriented edges pointed in the opposite direction.These new v-structures would be in the left top corners of the rectangles.
Figure 8: Meek rules to orient edges that remain in the graph after conditional independence tests and edges are oriented based on detected colliders.Proofs that edges cannot be oriented in the opposite direction without violating the acyclicity of the graph and found conditional independencies are given in Meek (1995).
Based on Meek's orientation rules, the SGS algorithm tests each pair of variables as follows (Spirtes et al., 2000, § 5.4.1): The large computational cost of applying the SGS algorithm (especially step 2) to data, due to the large number of combinations of variables as the potential conditional sets, opened the door for computationally more efficient methods.The PC algorithm (Spirtes et al., 2000, § 5.4.2) is amongst the most popular alternatives and minimises the computational cost by searching for conditional independence in a structured manner, as opposed to step 2 of the SGS algorithm that iterates over all possible conditional subsets to any two variables A and B. Starting again with a fully connected undirected graph, the PC algorithm begins by testing for marginal independence between any two variables A and B and deletes the edge connecting A and B, A − B, if A ⊥ ⊥ B. Then, it proceeds with testing for A ⊥ ⊥ B|C and erases A − B if one conditional variable C is found that makes A and B independent given C. Afterwards, the conditional set is extended to two variables, {C, D}, and the edge A − B is deleted if this conditional set makes A and B independent.The conditional set is extended, round after round, until no more conditional independencies are found, resulting in the sparsified graph K.
Step 3 and 4 are then pursued as in the SGS algorithm.

D Definitions of performance metrics SHD
For two graphs, G 1 and G 2 , SHD is the sum of the element-wise absolute differences between the adjacency matrices A 1 and A 2 of G 1 and G 2 , respectively: Thus, when a learnt causal graph G 1 includes an edge X → Y oriented opposite to the corresponding edge in the true graph G 2 (i.e., X ← Y is true), we add 2 to the SHD score ("double penalty").Note that others (e.g., Peters and Bühlmann, 2015) only penalise by 1 for a learnt edge being directed opposite to the true edge.

Normalised SHD
We also define the normalised SHD, where we divide SHD by the number of possible directed edges in a graph of size d, thus allowing comparison across graphs with different number of nodes:

Precision
Let the total number of edges in a learnt graph G 1 be the sum of truly and falsely oriented edges, e = e t + e f (the true-positives and false-positives, respectively).The truly oriented edges are the edges that correspond to the zero elements in the matrix containing the element-wise differences between the learnt A 1 and the true A 2 .In contrast, the falsely oriented edges are the elements having a value of 2 ("double penalty").Precision is the proportion of correctly oriented edges e t in a learnt graph G 1 in comparison to a true graph G 2 out of all learnt edges e in G 1 :

Recall
We separate between oriented and unoriented edges, e o and e u , respectively, which can again be summed up to the total number of edges, e = e o + e u .Recall is the fraction of edges in the CPDAG that are oriented: E Other causal methods used for comparison

Granger-causality
We implement Granger-causality through software provided by Seabold and Perktold (2010) for d = 2 and by Runge et al. (2019b) for d > 2, and define the optimal lag as the first local minimum in the mutual information function over measurement points.Given that Granger-causality is applied to a pair of samples, (x i , y i ), we iterate over every pair and determine the percentage of pairs that result in X → Y against X ← Y .We follow a similar process for more than two variables.

CCM
We implement CCM through software provided by Javier (2021) and define the optimal lag as for Granger-causality.
The embedding dimensionality is determined by a false nearest neighbour method (Munch et al., 2022).As with Granger-causality, CCM is also applied to a pair of samples, (x i , y i ).We take the same summarising approach and iterate over every pair and determine the percentage of of pairs that result in X → Y against X ← Y .
PCMCI Runge et al. (2019b) proposed PCMCI, a method that produces time-series graphs where each node represents a variable at a certain state in time.In our experiments, we estimate the maximum possible lag as the average of the first local minimum of mutual information of all considered data series.To test for the presence of an edge, we apply distance correlation-based independence tests (Székely et al., 2007) between the residuals of X and Y that remain after regressing out the influence of the remaining nodes in the graph through Gaussian processes.The distance correlation coefficients of the significant edges between two variables are summed up to measure the prevalence of a causal direction of edges connecting these two variables.The causal direction with the greater sum is considered to be the directed edge, and assessed against the edge between the same two variables in the true DAG.

Regression-based causal discovery
One needs to iterate over all possible candidate DAGs to check the possibility that it was hte causal graph that generated the observed data.The total number of candidate DAGs κ d of a graph with d nodes is given by

Constraint-based approach
One needs to exhaustively test for conditional independence between any two nodes of the graph given any possible subset of the remaining nodes.Including the empty conditional set, this results in 2 d−2 conditional independence tests for every pair of nodes in the graph.Then, the total number of conditional independence tests is: G Comparison of running times of our proposed method and the other causal methods where ν X (t) = tanh c ν t − cν 2 with c ν ∼ N (8, 1), and similarly for ν Y (t).The factor r ∈ [0, 1] controls the weight of the nonstationarity of the time-series as it is increased.
As shown in Figure 10, CCM is only able to capture the true causal direction X → Y for r = 0, as expected, but CCM loses its detection power as soon as the data become nonstationary (r > 0).In contrast, the regression-based maintains its detection power as the nonstationarity is increased.

Figure 1 :
Figure 1: (a) Bivariate and (b) joint independence tests over various values of the dependence factor a and sample size n.The joint independence test is conducted with d = 4 variables.

Figure 2 :
Figure 2: The optimised λ * for increasing dimension d of the evaluated size of the conditional set and different sample sizes n ∈ {100, 200, 300}.

Figure 3 :
Figure 3: The test power of the conditional independence tests over various values for a ′ and sample sizes n, from (a) 1 to (d) 4 conditional variables.The regularisation parameter λ * is given in the legend.

Figure 4 :
Figure 4: Accuracy of the regression-based causal discovery using kernel-based joint independence tests among residuals for (a) two and (b) three variables.In the case of two variables (a), we compare our kernel-based method to Granger-causality and CCM.In the case of three variables(b), we compare to multivariate Granger-causality as well as using PCMCI to produce comparable results.Our method significantly outperforms both Granger-causality and CCM in the bivariate setting, with just a few mistakes made for low sample numbers (n = 100) and no mistakes for higher sample sizes.For three variables, our method is substantially more accurate than PCMCI and Granger-causality, with an average SHD of 0.3 at n = 300 versus 2.9 for PCMCI and 3.12 for Granger-causality.The accuracy of the methods is measured in terms of the structural Hamming distance (SHD).

Figure 5 :
Figure 5: Constraint-based causal structure learning experiments.From left to right, we compute (a) the normalised SHD, (b) precision and (c) recall over d ∈ {3, 4, 5, 6} variables and n ∈ {100, 200, 300} samples with a = 1 as the dependence between any two variables in the data that are connected by an edge in the true DAG.

Figure 7 :
Figure 7: Undirected (a) and directed (b) causal networks on the World Governance Indicators data set to evaluate the results of the constraint-based approach alone (a) and the subsequent regression-based approach (b).The labels are abbreviations of the official names: Voice and Accountability (VA), Political Stability (PS), Government Effectiveness (GE), Regulatory Quality (RQ), Rule of Law (RL), and Control of Corruption (CC) (see Appendix I).

Figure 10 :
Figure 10: Comparison between CCM and our regression-based method for n = 100 and various values for r which influence the level of nonstationarity in the data for X and Y .

Table 1
shows the running times of the proposed and comparison methods for synthetic data in Section 4. Furthermore, for the bivariate real-world socioeconomic data set of corruption and income inequality in Section 4.2.2,Grangercausalitycompletes the computation in 731 ms, CCM in 537 ms, and the regression-based approach in 19.2 s (according to our Python implementations).vice-versa and we add nonstationary trends ν X (t) and ν Y (t):x i (t + 1) = x i (t) [0.8 − 3.5x i (t) − 0.02y i (t)] + r ν X (t)(22)y i (t + 1) = y i (t) [0.2 − 3.2y i (t) − 0.1x i (t)] + r ν Y (t) ,