Nonparametric Causal Structure Learning in High Dimensions

The PC and FCI algorithms are popular constraint-based methods for learning the structure of directed acyclic graphs (DAGs) in the absence and presence of latent and selection variables, respectively. These algorithms (and their order-independent variants, PC-stable and FCI-stable) have been shown to be consistent for learning sparse high-dimensional DAGs based on partial correlations. However, inferring conditional independences from partial correlations is valid if the data are jointly Gaussian or generated from a linear structural equation model—an assumption that may be violated in many applications. To broaden the scope of high-dimensional causal structure learning, we propose nonparametric variants of the PC-stable and FCI-stable algorithms that employ the conditional distance covariance (CdCov) to test for conditional independence relationships. As the key theoretical contribution, we prove that the high-dimensional consistency of the PC-stable and FCI-stable algorithms carry over to general distributions over DAGs when we implement CdCov-based nonparametric tests for conditional independence. Numerical studies demonstrate that our proposed algorithms perform nearly as good as the PC-stable and FCI-stable for Gaussian distributions, and offer advantages in non-Gaussian graphical models.


Introduction
Directed acyclic graphs (DAGs) are commonly used to represent causal relationships among random variables [1][2][3]. The PC algorithm [3] is the most popular constraintbased method for learning DAGs from observational data under the assumption of causal sufficiency, i.e., when there are no unmeasured common causes and no selection variables. It first estimates the skeleton of a DAG by recursively performing a sequence of conditional independence tests, and then uses the information from the conditional independence relations to partially orient the edges, resulting in a completed partially directed acyclic graph (CPDAG). In Section 2, we provide a review of these and other notions commonly used in the graphical modeling literature that are relevant to our work. In addition, we refer to estimating the CPDAG as structure learning of the underlying DAG throughout the rest of the paper.
Observational studies often involve latent and selection variables, which complicate the causal structure learning problem. Ignoring such unmeasured variables can make the causal inference based on the PC algorithm erroneous; see, e.g., Section 1.2 in [4] for some illustrations. The Fast Causal Inference (FCI) algorithm and its variants [3][4][5][6] utilize similar strategies as the PC algorithm to learn the DAG structure in the presence of latent and selection variables.
Both PC and FCI algorithms adopt a hierarchical search strategy-they recursively perform conditional independence tests given subsets of increasingly larger cardinalities in some appropriate search pool. The PC algorithm is usually order-dependent, in the sense that its output depends on the order in which pairs of adjacent vertices and subsets of their adjacency sets are considered. The FCI algorithm suffers from a similar limitation. To overcome this limitation, Ref. [7] proposed two variants of the PC and FCI algorithms, namely the PC-stable and FCI-stable algorithms that resolve the order dependence at different stages of the algorithms.

Preliminaries on Graphical Modeling
We start with introducing some necessary terminologies and background information. Our notations and terminologies follow standard conventions in graphical modeling (see, e.g., [3]). A graph G = (V, E) consists of a vertex set V = {1, . . . , p} and an edge set E ⊆ V × V. In a graphical model, the vertices or nodes are associated with random variables X a for 1 ≤ a ≤ p. Throughout, we index the nodes by the corresponding random variables. We also allow the edge set E of the graph G to contain (a subset of) the following six types of edges: → (directed), ↔ (bidirected), − (undirected), •−• (nondirected), •− (partially undirected) and •→ (partially directed). The endpoints of an edge are called marks, which can be tails, arrowheads or circles. A "•" at the end of an edge indicates it is not known whether an arrowhead should occur at that place. We use the symbol ' ' to denote an arbitrary edge mark; for example, the symbol → represents an edge of the type →, ↔ or •→ in the graph. A mixed graph is a graph containing directed, bidirected and undirected edges. A graph containing only directed edges (→) is called a directed graph, one containing only undirected edges (−) is called an undirected graph, and one containing directed and undirected edges is called a partially directed graph.
The adjacency set of a vertex X a in the graph G = (V, E), denoted adj(G, X a ), is the set of all vertices in V that are adjacent to X a , or, in other words, are connected to X a by an edge. The degree of a vertex X a , |adj(G, X a )|, is defined as the number of vertices adjacent to it. A graph is complete if all pairs of vertices in the graph are adjacent. A vertex X b ∈ adj(G, X a ) is called a parent of X a if X b → X a , a child of X a if X a → X b and a neighbor of X a if X a − X b . The skeleton of the graph G is the undirected graph obtained by replacing all the edges of G by undirected edges, in other words, ignoring all the edge orientations. Three vertices X a , X b , X c are called an unshielded triple if X a and X b are adjacent, X b and X c are adjacent, but X a and X c are not adjacent. A path is a sequence of distinct adjacent vertices. A node X a is an ancestor of its descendent X b , if G contains a directed path X a → · · · → X b . A non-endpoint vertex X a on a path is called a collider on the path if both the edges preceding and succeeding it have an arrowhead at X a , or, in other words, the path contains → X a ← . An unshielded triple X a , X b , X c is called a v-structure if X b is a collider on the path X a , X b , X c .
A cycle occurs in a graph when there is a path from X a to X b , and X a and X b are adjacent. A directed path from X a to X b forms a directed cycle together with the edge X b → X a , and it forms an almost directed cycle together with the edge X b ↔ X a . Three vertices that form a cycle are called a triangle. A directed acyclic graph (DAG) is a directed graph that does not contain any cycle. A DAG entails conditional independence relationships via a graphical criterion called d-separation (Section 1.2.3 in [16]). Two vertices X a and X b that are not adjacent in a DAG G are d-separated in G by a subset X S ⊆ V\{X a , X b }. A probability distribution P on R p is said to be faithful with respect to the DAG G if the conditional independence relationships in P can be inferred from G using d-separation and vice versa; in other words, X a ⊥ ⊥ X b |X S if and only if X a and X b are d-separated in G by X S .
A graph that is both (partially) directed and acyclic is called a partially directed acyclic graph (PDAG). DAGs that encode the same set of conditional independence relations form a Markov equivalence class [17]. Two DAGs belong to the same Markov equivalence class if and only if they have the same skeleton and the same v-structures. A Markov equivalence class of DAGs can be uniquely represented by a completed partially directed acyclic graph (CPDAG), which is a PDAG that satisfies the following: (i) X a → X b in the CPDAG if X a → X b in every DAG in the Markov equivalence class, and (ii) X a − X b in the CPDAG if the Markov equivalence class contains a DAG in which X a → X b as well as a DAG in which X a ← X b .

The PC-Stable and FCI-Stable Algorithms
In this section, we provide an outline of the PC/PC-stable and FCI/FCI-stable algorithms. Estimation of the CPDAG by the PC algorithm involves two steps: (1) estimation of the skeleton and separating sets (also called the adjacency search step); and (2) partial orientation of edges; see Algorithms 1 and 2 in [8] for details.
Intuitively, the PC algorithm works as follows. In the first step (the adjacency search step), the algorithm starts with a complete undirected graph Then, for conditioning sets of increasing cardinality, k = 0, 1, . . ., the algorithm removed an edge X a − X b if X a and X b are conditionally independent given a subset S of size k chosen among the current neighbors of nodes a and b. This process continues up to the order q − 1, where q is the maximum degree of the underlying DAG. By searching over the neighboring nodes, the algorithm is adaptive and can efficiently infer sparse high-dimensional DAGs, where the sparsity is characterized by the maximum node degree, q.
In the presence of latent and selection variables, one needs a generalization of an DAG, called a maximal ancestral graph (MAG). A mixed graph is called an ancestral graph if it contains no directed or almost directed cycles and no subgraph of the type X a − X b ← X c . DAGs form a subset of ancestral graphs. A MAG is an ancestral graph in which every missing edge corresponds to a conditional independence relationship via the mseparation criterion [18], a generalization of the notion of d-separation. Multiple MAGs may represent the same set of conditional independence relations. Such MAGs form a Markov equivalence class which can be represented by a partial ancestral graph (PAG) [19]; see [18] for additional details.
Under the faithfulness assumption, the Markov equivalence class of a DAG with latent and selection variables can be learned using the FCI algorithm (e.g., Algorithm 3.1 in [4]), which is a modification of the PC algorithm. The FCI algorithm first employs the adjacency search of the PC algorithm, and then performs additional conditional independence queries because of the presence of latent variables followed by partial orientation of the edges, resulting in an estimated PAG. The FCI algorithm adopts the same hierarchical search strategy as the PC algorithm: It starts with a complete undirected graph and recursively removes edges via conditional independence queries given subsets of increasingly larger cardinalities in some appropriate search pool.
The PC algorithm is usually order-dependent, in the sense that its output depends on the order in which pairs of adjacent vertices and subsets of their adjacency sets are considered. The FCI algorithm suffers from a similar limitation, as it shares the adjacency search step of the PC algorithm as its first step. To overcome this limitation, ref. [7] proposed variants of the PC and FCI algorithms, namely the PC-stable and FCI-stable algorithms that resolve the order dependence at different stages of the algorithms. The basic difference between the PC algorithm and the PC-stable algorithm is that, in the adjacency search step, the latter computes and stores the adjacency sets of all the variables after each new cardinality, k = 0, 1, . . ., of the conditioning sets. These stored adjacency sets are then used to search for conditioning sets of this given size k. As a consequence, the removal of an edge no longer affects which conditional independence relations need to be checked for other pairs of variables at this given size of the conditioning sets.
We would refer the reader to Appendix A, where we provide in full detail the pseudocodes of the oracle versions of the PC-stable and FCI-stable algorithms. In the oracle versions of the algorithms, it is assumed that perfect knowledge is available about all the necessary conditional independence relations. As such, conditional independence relations are not estimated from data. Of course, this perfect knowledge is not available in practice. Sample versions of the PC-stable and FCI-stable algorithms can be obtained by replacing the conditional independence queries by a suitable test for conditional independence at some pre-specified level. For example, if the variables are jointly Gaussian, one can test for zero partial correlations (see, e.g., [8]). The next subsection is devoted to discussions on nonparametric tests for independence and conditional independence.

Distance Covariance and Conditional Distance Covariance
We start by describing the notation used throughout the paper. We denote by · p the Euclidean norm of R p and use · when the dimension is clear from the context. We use X ⊥ ⊥ Y to denote the independence of X and Y and use E U to denote expectation with respect to the probability distribution of the random variable U. For any set S, we denote its cardinality by |S|.
We use the usual asymptotic notation, 'O' and 'o', as well as their probabilistic counterparts, O p and o p , which denote stochastic boundedness and convergence in probability, respectively. For two sequences of real numbers {a n } ∞ n=1 and {b n } ∞ n=1 , a n b n if and only if a n /b n = O(1) and b n /a n = O(1) as n → ∞. We use the symbol "a b" to indicate that a ≤ C b for some constant C > 0. For a matrix A = (a kl ) n k,l=1 ∈ R n×n , we denote its determinant by |A| and define its U -centered versionÃ = (ã kl ) n k,l=1 as for k, l = 1, . . . , n. We denote the indicator function of any set A by 1(A). Finally, we denote the integer part of a ∈ R by a .
Ref. [14], in their seminal paper, introduced the notion of distance covariance (dCov, henceforth) to quantify nonlinear and non-monotone dependence between two random vectors of arbitrary dimensions. Consider two random vectors X ∈ R p and Y ∈ R q with E X p < ∞ and E Y q < ∞. The distance covariance between X and Y is defined as the positive square root of where f X , f Y and f X,Y are the individual and joint characteristic functions of X and Y, respectively, and c p = π (1+p)/2 / Γ((1 + p)/2) is a constant with Γ(·) being the complete gamma function.
The key feature of dCov is that it completely characterizes the independence between two random vectors, or in other words dCov(X, Y) = 0 if and only if X ⊥ ⊥ Y. According to Remark 3 in [14], dCov can be equivalently expressed as This alternate expression comes handy in constructing V or U-statistic type estimators for the quantity. For an observed random sample (X i , Y i ) n i=1 from the joint distribution of X and Y, define the distance matrices d X = d X ij n i,j=1 and d Y = d Y ij n i,j=1 ∈ R n×n , where d X ij := X i − X j p and d Y ij := Y i − Y j q . Following the U -centering idea in [20], an unbiased U-statistic type estimator of dCov 2 (X, Y) can be expressed as are the U -centered versions of the matrices d X and d Y , respectively, as defined in (1).
Ref. [15] generalized the notion of dCov and introduced the conditional distance covariance (CdCov, henceforth) as a measure of conditional dependence between two random vectors of arbitrary dimensions given a third. CdCov essentially replaces the characteristic functions used in the definition of dCov by conditional characteristic functions. Consider a third random vector Z ∈ R r with E( X p + Y q | Z) < ∞. Denote by f X,Y|Z the conditional joint characteristic function of X and Y given Z, and by f X|Z and f Y|Z the conditional marginal characteristic functions of X and Y given Z, respectively. Then, CdCov between X and Y given Z is defined as the positive square root of The key feature of CdCov is that CdCov (X, Y|Z) = 0 almost surely if and only if X ⊥ ⊥ Y|Z, which is quite straightforward to see from the definition. Similar to dCov, an equivalent alternative expression can be established for CdCov that avoids complicated integrations involving conditional characteristic functions. Let which is not symmetric with respect to {i, j, k, l}, and therefore necessitates defining the following symmetric form: [15] establishes an equivalent representation of CdCov 2 (X, Y|Z = z) as Remark 1. In a recent work, [21] explore the connection between conditional independence measures induced by distances on a metric space and reproducing kernels associated with a reproducing kernel Hilbert space (RKHS). They generalize CdCov to arbitrary metric spaces of negative typetermed generalized CdCov (gCdCov)-and develop a kernel-based measure of conditional independence, namely the Hilbert-Schmidt conditional independence criterion (HSCIC). Theorem 1 in their paper establishes an equivalence between gCdCov and HSCIC, or, in other words, between distance and kernel-based measures of conditional independence.
Then, by virtue of the equivalent representation of CdCov in (3), a V-statistic type estimator of CdCov 2 (X, Y|Z) can be constructed as Under certain regularity conditions, Theorem 4 in [15] shows that, conditioned on Z,

The Nonparametric PC Algorithm in High Dimensions
To obtain a measure of conditional independence between X and Y given Z that is free of Z, we define We reject H 0 : for a suitably chosen threshold ξ α . In Appendix A, we present a local bootstrap procedure for choosing ξ α in practice, which is also used in our numerical studies. Henceforth, we will often denote ρ * 0 (X, Y|Z) and ρ * (X, Y|Z) simply by ρ * 0 and ρ * respectively for notational simplicity, whenever there is no confusion.
In view of the complete characterization of conditional independence by ρ * 0 , we propose testing for conditional independence relations nonparametrically in the sample version of the PC-stable algorithm based on ρ * 0 , rather than partial correlations. We coin the resulting algorithm the 'nonPC' algorithm, to emphasize that it is a nonparametric generalization of parametric PC-stable algorithms.
The oracle version of the first step of nonPC, or the skeleton estimation step, is exactly the same as that of the PC-stable algorithm (Algorithm A1 in Appendix A). The second step, which extends the skeleton estimated in the first step to a CPDAG (Algorithm A2 in Appendix A), is comprised of some purely deterministic rules for edge orientations, and is exactly the same for both the nonPC and PC-stable as well. The only difference lies in the implementation of the tests for conditional independence relationships in the sample versions of the first step. Specifically, we replace all the conditional independence queries in the first step by tests based on ρ * 0 (X, Y|Z). At some pre-specified significance . The critical value ξ n,α in this case is obtained by a bootstrap procedure (see, e.g., Section 4 in [22] with d = 2).
Given that the equivalence between conditional independence and zero partial correlations only holds for multivariate normal random variables, our generalization broadens the scope of applicability of causal structure learning by the PC/PC-stable algorithm to general distributions over DAGs. This nonparametric approach is thus a natural extension of Gaussian and Gaussian copula models. It enables capturing nonlinear and non-monotone conditional dependence relationships among the variables, which partial correlations fail to detect.
Next, we establish theoretical guarantees on the correctness of the nonPC algorithm in learning the true underlying causal structure in sparse high-dimensional settings. Our consistency results only require mild moment and tail conditions on the set of variables, without making any strict distributional assumptions. Denote by m p the maximum cardinality of the conditioning sets considered in the adjacency search step of the PC-stable algorithm. Clearly, m p ≤ q, where q := max 1≤a≤p |adj(G, a) | is the maximum degree of the DAG G. For a fixed pair of nodes a, b ∈ V, the conditioning sets considered in the adjacency search step are elements of J m p a,b := {S ⊆ V\{a, b} : |S| ≤ m p }. We first establish a concentration inequality that gives the rate at which the absolute difference of ρ * 0 (X a , X b |X S ) and its plug-in estimate ρ * (X a , X b |X S ) decays to zero, for any fixed pair of nodes a and b ∈ V and a fixed conditioning set S. Towards that, we impose the following regularity conditions.
Condition (A1) imposes a sub-exponential tail bound on the squares of the random variables. This is a quite commonly used condition, for example, in the high-dimensional feature screening literature (see, for example, [23]). Condition (A2) is a mild condition on the kernel function K(·) that is guaranteed by many commonly used kernels, including the Gaussian kernel. Under conditions (A1) and (A2), the next result shows that the plug-in estimate ρ * (X a , X b |X S ) converges in probability to its population counterpart ρ * 0 (X a , X b |X S ) exponentially fast.

Theorem 1.
Under conditions (A1) and (A2), for any > 0, there exist positive constants A, B and γ ∈ (0, 1/4) such that The proof of Theorem 1 is long and somewhat technical; it is thus relegated to Appendix B. Theorem 1 serves as the main building block towards establishing the consistency of the nonPC algorithm in sparse high-dimensional settings.
For notational convenience, henceforth, we denote ρ * 0 (X a , X b |X S ) and ρ * (X a , X b |X S ) by ρ * 0 ; a b|S and ρ * ab|S , respectively. In Theorem 2 below, we establish a uniform bound for the errors in inferring conditional independence relationships using the ρ * 0 -based test in the skeleton estimation step of the sample version of the nonPC algorithm.

Theorem 2.
Under conditions (A1) and (A2), for any > 0, there exist positive constants A, B and γ ∈ (0, 1/4) such that Next, we turn to proving the consistency of the nonPC algorithm in the high-dimensional setting where the dimension p can be much larger than the sample size n, but the DAG is considered to be sparse. We impose the following regularity conditions, which are similar to the assumptions imposed in Section 3.1 of [8] in order to prove the consistency of the PC algorithm for Gaussian graphical models. We let the number of variables p grow with the sample size n and consider p = p n , and also the DAG G = G n := (V n , E n ) and the distribution P = P n .
(A3) The dimension p n grows at a rate such that the right-hand side of (7) tends to zero as n → ∞. In particular, this is satisfied when p n = O(n r ) for any 0 ≤ r < ∞. (A4) The maximum degree of the DAG G n , denoted by q n := max 1≤a≤p n |adj(G n , a) |, grows at the rate of O( The distribution P n is faithful to the DAG G n for all n. In other words, for any a, b ∈ V n and S ∈ J m pn a,b , Moreover, ρ * 0 ; a b|S values are uniformly bounded both from above and below. Formally, and where λ max is a positive constant and 0 < v < 1/4. Condition (A3) allows the dimension to grow at any arbitrary polynomial rate of the sample size. Condition (A4) is a sparsity assumption on the underlying true DAG, allowing the maximum degree of the DAG to also grow, but at a slower rate than n. Since m p ≤ q n , we also have m p = O(n 1−b ). Finally, Condition (A5) is the strong faithfulness assumption (Definition 1.3 in [24]) on P n and is similar to condition (A4) in [8]. This essentially requires ρ * 0 ; ab|S to be bounded away from zero when the vertices X a and X b are not d-separated by X S . It is worth noting that the faithfulness assumption alone is not enough to prove the consistency of the PC/PC-stable/nonPC algorithms in high-dimensional settings, and the more stringent strong faithfulness condition is required.

Remark 2.
For notational convenience, treat X a , X b and X S as X, Y and Z, respectively, for any a, b ∈ V n and S ∈ J m pn a,b . From Equation (3), we have which implies With this and the definition of d ijkl in Section 2.3, it follows from some simple algebra and the Cauchy-Schwarz inequality that ρ * 0 < ∞. This provides a justification for the second part of Assumption (A5) that sup a,b∈V n S∈J mp n a,b ρ * 0 ; ab|S ≤ λ max for some positive constant λ max .
The next theorem establishes that the nonPC algorithm consistently estimates the skeleton of a sparse high-dimensional DAG, thereby providing the necessary theoretical guarantees to our proposed methodology. It is worth noting that, in the sample version of the PC-stable and hence the nonPC algorithm, all the inference is done during the skeleton estimation step. The second step that involves appropriately orienting the edges of the estimated skeleton is purely deterministic (see Sections 4.2 and 4.3 in [7]). Therefore, to prove the consistency of the nonPC algorithm in estimating the equivalence class of the underlying true DAG, it is enough to prove the consistency of the estimated skeleton. We include the detailed proof of Theorem 3 in Appendix B.
Theorem 3. Assume that Conditions (A1)-(A5) hold. Let G skel,n be the true skeleton of the graph G n , andĜ skel,n be the skeleton estimated by the nonPC algorithm. Then, as n → ∞, P Ĝ skel,n = G skel,n → 1.

Remark 3.
In the proof of Theorem 3, we consider the threshold ξ α to be of constant order. However, the proof continues to work as long as ξ α is of the same order as C min as n → ∞.

The Nonparametric FCI Algorithm in High Dimensions
The FCI is a modification of the PC algorithm that accounts for latent and selection variables. Thus, generalizations of the PC algorithm naturally extend to the FCI as well. Similar to nonPC, we propose testing for conditional independence relations nonparametrically in the sample version of the FCI-stable algorithm (Algorithm A3 in Appendix A) based on ρ * 0 , instead of partial correlations. We coin the resulting algorithm the 'nonFCI' algorithm, to emphasize that it is a generalization of parametric FCI-stable algorithms. Again, the oracle version of the nonFCI is exactly the same as that of the FCI-stable algorithm. The difference is in the implementation of the tests for conditional independence relationships in their sample versions. This broadens the scope of the FCI algorithm in causal structural learning for observational data in the presence of latent and selection variables when Gaussianity is not a viable assumption. More specifically, it enables capturing nonlinear and non-monotone conditional dependence relationships among the variables that partial correlations would fail to detect.
Equipped with the theoretical guarantees we established for the nonPC in Section 3.1, we establish below in Theorem 4 the consistency of the nonFCI algorithm for general distributions in sparse high-dimensional settings. Let H = (V, E) be a DAG with the vertex set partitioned as V = V X ∪ V L ∪ V T , where V X indexes the set of p observed variables, V L denotes the set of latent variables and V T stands for the set of selection variables. Let M be the unique MAG over V X . We let p grow with n and consider p = p n , H = H n and Q = Q n , where Q is the distribution of (U 1 , . . . , U p ) := (X 1 | V T , . . . , X p | V T ). We provide below the definition of possible-D-SEP sets (Definition 3.3 in [4]). Definition 1. Let C be a graph with any of the following edge types : •−•, •→ and ↔. A possible-D-SEP (X a , X b ) in C, denoted pds(C, X a , X b ), is defined as follows: X c ∈ pds(C, X a , X b ) if and only if there is a path π between X a and X c in C such that, for every subpath X e , X f , X g of π, X f is a collider on the subpath in C or X e , X f , X g is a triangle in C.
To prove the consistency of the nonFCI algorithm in sparse high-dimensional settings, we impose the following regularity conditions, which are similar to the assumptions imposed in Section 4 in [4].
(C3) The distribution Q n is faithful to the underlying MAG M n for all n. (C4) The maximum size of the possible-D-SEP sets for finding the final skeleton in the FCI-stable algorithm (Algorithm A6 in Appendix A), q n , grows at the rate of O( where λ max is a positive constant and 0 < v < 1/4.

Performance of the NonPC Algorithm
In this subsection, we compare the performances of the nonPC and the PC-stable algorithms in finding the skeleton and the CPDAG for various simulated datasets. We simulate random DAGs in the following examples and sample from probability distributions faithful to them.

Example 1 (Linear SEM).
We first fix a sparsity parameter s ∈ (0, 1) and enumerate the vertices as V = {1, . . . , p}. We then construct a p × p adjacency matrix Λ as follows. First, initialize Λ as a zero matrix. Next, fill every entry in the lower triangle (below the diagonal) of Λ by independent realizations of Bernoulli random variables with success probability s. Finally, replace each nonzero entry in Λ by independent realizations of a Uniform(0.1, 1) random variable.
In this scheme, each node has the same expected degree E(m) = (p − 1)s, where m is the degree of a node and follows a Binomial (p − 1, s) distribution. Using the adjacency matrix Λ, the data are then generated from the following linear structural equation model (SEM) : X = ΛX + where = ( 1 , . . . , p ) and 1 , . . . , p are jointly independent. To obtain samples {X k 1 , . . . , X k p } n k=1 on {X 1 , . . . , X p }, we first sample { k 1 , . . . , k p } n k=1 from the three following data-generating schemes. For 1 ≤ k ≤ n and 1 ≤ i ≤ p, 1. Normal: Generate k i 's independently from a standard normal distribution. 2. Copula: Generate k i 's as in (1) and then transform the marginals to a F 1,1 distribution. 3. Mixture: Generate k i 's independently from a 50-50 mixture of a standard normal and a standard Cauchy distribution.
Example 2 (Nonlinear SEM). In this example, we first generate a p × p adjacency matrix Λ in the similar way as in Example 1 and then generate the data from the following nonlinear SEM (similar to [10]) : If the functions f ij 's are chosen to be nonlinear, then the data will typically not correspond to a well-known multivariate distribution. We consider f ij (x j ) = b ij1 x j + b ij2 x 2 j , where b ij1 and b ij1 are independently sampled from N(0, 1) and N(0, 0.5) distributions, respectively.
With the exception of Example 1.1, the above examples are all non-Gaussian graphical models. We would thus expect the nonPC to perform better than the PC-stable in learning the unknown causal structure in these examples. For each of the four data generating methods considered above, we compare the Structural Hamming Distance (SHD) [25] between the estimated and the true skeletons of the underlying DAGs using the nonPC and PC-stable algorithms. The SHD between two undirected graphs is the number of edge additions or deletions necessary to make the two graphs match. Therefore, larger SHD values between the estimated and the true skeleton correspond to worse estimates.
We consider 199 bootstrap replicates for the CdCov-based conditional independence tests in the implementation of our nonPC algorithm and the significance level α = 0.05. Table 1 presents the average SHD for the different data generating schemes over 20 simulation runs, for different choices of n, p and E(m). The results in Table 1 demonstrate that the nonPC performs nearly as good as the PC-stable for the Gaussian data example, in terms of the average SHD. However, for each of the non-Gaussian data examples, the nonPC performs better than the PC-stable in estimating the true skeleton of the underlying DAGs. The improvement in SHD becomes more substantial as the dimension grows. The superior performance of the nonPC over PC-stable for the non-Gaussian graphical models is expected, as the characterization of conditional independence by partial correlations is only valid under the assumption of joint Gaussianity.

Performance of the NonFCI Algorithm
In this subsection, we compare the performances of the nonFCI and the FCI-stable algorithms over various simulated datasets. We first generate random DAGs as in Examples 1 and 2.
To assess the impact of latent variables, we randomly define half of the variables with no parents and at least one child as latent. We do not consider selection variables. We run both the nonFCI and the FCI-stable algorithms on the above data examples with n = 200, p = {10, 20, 30, 100, 200} and α = 0.01, using 199 bootstrap replicates for the CdCov-based conditional independence tests. We consider 20 simulation runs for each of the data generating models. Table 2 reports the average SHD between the estimated and true PAG skeleton by the nonFCI and FCI-stable algorithms. The results in Table 2 demonstrate that, in both the Gaussian and non-Gaussian examples, the nonFCI algorithm outperforms the FCI-stable in estimating the true PAG skeleton.

Real Data Example
A major difficulty in assessing whether nonPC and nonFCI provide more reasonable estimates compared to the parametric versions of the algorithms in high-dimensional real data settings is that the true causal graph is not known in most of the cases. In absence of the truth, we may only be able to draw some conclusions about sensible causal mechanisms by examining known or logical relationships among pairs of variables. However, this becomes increasingly difficult for larger networks, where even visualization becomes challenging. This is why we first choose a relatively smaller dataset in Section 4.3.1, where we can draw upon background knowledge to glean insight into potential causal mechanisms in a setting where the data are clearly non-Gaussian. This example highlights the main focus of the paper that, with non-Gaussian data (categorical, as in this example), nonPC is expected to perform better than the PC-stable in learning the true causal structure of the underlying DAG. In Section 4.3.2, we consider a larger example and examine the performance of PC-stable and nonPC in learning the DAG from both seemingly Gaussian data as well as a categorized version of the same data. This example clearly illustrates the potential limitations of PC-stable: in contrast to nonPC, the output of PC-stable can be strikingly different when applied to a categorized version of the original data.

Montana Poll Dataset
To demonstrate the flexibility of our proposed framework, we first apply the nonPC algorithm to the Montana Economic Outlook Poll dataset. The poll was conducted in May 1992 where a random sample of 209 Montana residents were asked whether their personal financial status was worse, the same or better than a year ago, and whether they thought the state economic outlook was better than the year before. Accompanying demographic information on the respondents' age, income, political orientation, and area of residence in the state were also recorded. We obtained the dataset from the Data and Story Library (DASL), available at https://math.tntech.edu/e-stat/DASL/page4.html (accessed on 25 March 2021). The study is comprised of the following seven categorical variables: AGE = 1 for under 35, 2 for 35-54, 3 for 55 and over; SEX = 0 for male, 1 for female; INC = yearly income: 1 for under $20 K, 2 for $20-35 K, 3 for over $35 K; POL = 1 for Democrat, 2 for Independent, 3 for Republican; AREA = 1 for Western, 2 for Northeastern, 3 for Southeastern Montana; FIN (=Financial status): 1 for worse, 2 for same, 3 for better than a year ago; and STAT (=State economic outlook): 1 for better, 0 for not better than a year ago.
After removing the cases with missing values, we are left with n = 163 samples. Since all the variables are categorical, the Gaussianity assumption is outrightly violated. Thus, we would expect the nonPC to perform better than the PC-stable in learning the true causal structure among the variables in this case. Figure 1 below presents the CPDAGs estimated by the nonPC and PC-stable algorithms at a significance level α = 0.1. We consider 199 bootstrap replicates for the CdCov-based conditional independence tests in the implementation of the nonPC algorithm.
It is quite intuitive that age and sex are likely to affect the income; one's financial status and the area of residence might also influence their political inclination; and improvements or downturns in the state economic outlook might impact an individual's financial status. The CPDAG estimated by the nonPC algorithm in Figure 1a affirms such common-sense understanding of these causal influences. However, in the CPDAG estimated by the PCstable in Figure 1b, the edge between age and income is missing. In addition, the directed edges POL → AREA and POL → FIN seem to make little sense in this case.

Protein Expression Data
We next consider a protein expression dataset of 410 patients with breast cancer from The Cancer Genome Atlas (TCGA). The dataset consists of p = 118 genes, and we randomly select a subset of n = 100 patients with PR-negative status. Since the true causal structure of the genes in the cancer cells may be different than that of normal cells [26], we apply both the nonPC and PC-stable algorithms to learn the causal structure. To put the performances of the nonPC and PC-stable under scrutiny as the data depart farther away from Gaussianity, we categorize the protein expression data for each of the p genes, denoted by {X k a } n k=1 , 1 ≤ a ≤ p, as follows. We compute the three quartiles Q 1 ; a , Q 2 ; a and Q 3 ; a of the protein expression values for every 1 ≤ a ≤ p. Consequently, we obtain categorized protein expressions {X k C ; a } n k=1 for 1 ≤ a ≤ p, where We apply the nonPC and PC-stable algorithms to both the original and the categorized protein expression data at a significance level α = 0.01. We consider 199 bootstrap replicates for the CdCov-based conditional independence tests in the implementation of the nonPC algorithm. Table 3 below shows the SHD between the skeletons estimated from the original and the categorized data by the nonPC and PC-stable algorithms. It can be seen that the SHD between the skeletons estimated from the original and categorized data by the PCstable algorithm is much larger than that for nonPC. This example highlights the potential limitation of parametric implementations of the PC algorithm: when the data deviate farther away from Gaussianity (in this case, being categorical), the estimates produced by the PC-stable may deviate considerably more from the estimates from the original data. In contrast, the nonparametric test in nonPC delivers more stable estimates regardless of the data distribution. Table 3. Comparison of the SHD between the skeletons estimated from the original and the categorized protein expression data by the nonPC and PC-stable algorithms.

Discussion
We proposed nonparametric variants of the widely popular PC-stable and FCI-stable algorithms, which employ conditional distance covariance (CdCov) to test for conditional independence relationships in their sample versions. Our proposed algorithms broaden the applicability of the PC/PC-stable and FCI/FCI-stable algorithms to general distributions over DAGs, and enable taking into account nonlinear and non-monotone conditional dependence among the random variables, which partial correlations fail to capture. We show that the high-dimensional consistency of the PC-stable and FCI-stable algorithms carry over to more general distributions over DAGs when we implement CdCov-based nonparametric tests for conditional independence. These results are obtained without imposing any strict distributional assumptions and only require moment and tail conditions on the variables.
There are several intriguing potential directions for future research. First, it is generally difficult to select the tuning parameter (i.e., the significance threshold for the CdCov test) in causal structure learning. One possible strategy is to use ideas based on stability selection [27,28]. By assessing the stability of the estimated graphs in multiple subsamples, this strategy allows us to choose the tuning parameter in order to control the false positive error. However, the repeated subsampling increases the computational burden. Second, the computational and sample complexities of the PC and FCI algorithms (and hence those of the nonPC and nonFCI) scale with the maximum degree of the DAG, which is assumed to be small relative to the sample size. However, in many applications, one encounters sparse graphs containing a small number of highly connected 'hub' nodes. In such cases, ref. [29] proposed a low-complexity variant of the PC algorithm, namely the reduced PC (rPC) algorithm that exploits the local separation property of large random networks [30]. The rPC is shown to consistently estimate the skeleton of a high-dimensional DAG by conditioning only on sets of small cardinality. More recently, ref. [31] have generalized this approach to account for unobserved confounders. In this light, it would be intriguing to develop computationally faster variants of the nonPC and nonFCI in the future by exploiting the idea of local separation.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Preliminaries and Background
For the sake of completeness, we illustrate in this section the pseudocodes of the oracle versions of the PC-stable and FCI-stable algorithms. We also outline a local bootstrap procedure that can be used to approximate the threshold ξ α mentioned in Section 3.1 and is used throughout the numerical studies in the paper.
Algorithm A1 presents the pseudocode of the oracle version of Step 1 of the PC-stable algorithm (Algorithm 4.1 of [7]), which estimates the skeleton of the underlying DAG. Algorithm A2 presents the pseudocode of Step 2 of the PC-stable algorithm (Algorithm 2 of [8]) that extends the skeleton estimated in Step 1 to the CPDAG. Algorithm A3 presents the pseudocode of the FCI-stable algorithm (Section 4.4 in [7]). It implements Algorithm A4 to obtain an initial skeleton of the underlying PAG, Algorithm A5 to orient the v-structures, and finally Algorithm A6 to obtain the final skeleton that the FCI-stable returns.
To approximate the threshold ξ α to test for H 0 : X ⊥ ⊥ Y|Z vs. H A : X ⊥ ⊥ Y|Z at level α ∈ (0, 1) (see Section 3.1), we consider the following local bootstrap procedure in the light of Section 4.3 in [15]. Given the i.i.d.
and compute the bootstrap statistic. The detailed steps are as follows :

Algorithm A1
Step 1 of the PC-stable algorithm (oracle version).
Require : Conditional independence information among all variables in V, and an ordering order(V) on the variables. Form the complete undirected graph C on the vertex set V. Let l = −1; repeat l = l + 1; for all vertices X a in C do let u(X a ) = adj(C, X a ) end for repeat Select a (new) ordered pair of vertices (X a , X b ) that are adjacent in C such that been considered until all ordered pairs of adjacent vertices (X a , X b ) in C with |u(X a ) \ {X b }| ≥ l have been considered until all pairs of adjacent vertices (X a , X b ) in C satisfy |u(X a ) \ {X b }| ≤ l Output : The estimated skeleton C, separation sets sepset.

Algorithm A2
Step 2 of the PC-stable algorithm.
Require : Skeleton C, separation sets sepset. for all all pair of nonadjacent vertices X a , X c with common neighbor In the resulting PDAG, try to orient as many undirected edges as possible by repeated applications of the following rules : (R1) Orient X b − X c into X b → X c whenever there is an arrow X a → X b such that X a and X c are nonadjacent (otherwise, a new v-structure is created). (R2) Orient X a − X c into X a → X c whenever there is a chain X a → X b → X c (otherwise, a directed cycle is created). (R3) Orient X a − X c into X a → X c whenever there are two chains X a − X b → X c and X a − X d → X c such that X b and X d are nonadjacent (otherwise, a new v-structure or a directed cycle is created).

Algorithm A3 The FCI-stable algorithm (oracle version).
Require : Conditional independence information among all variables in V X given V T . Use Algorithm A4 to find an initial skeleton (C), separation sets (sepset) and unshielded triple list (M); Use Algorithm A5 to orient v-structures (update C); Use Algorithm A6 to find the final skeleton (update C and sepset); Use Algorithm A5 to orient v-structures (update C); Use rules (R1)-(R10) of [6] to orient as many edge marks as possible (update C); Output : C, sepset.

Algorithm A4
Obtaining an initial skeleton in the FCI-stable algorithm (Algorithm 4.1 in the supplement of [4]).
Require : Conditional independence information among all variables in V X given V T , and an ordering order(V X ) on the variables. Form the complete undirected graph C on the vertex set V X with edges •−•. Let l = −1; repeat l = l + 1; for all vertices X a in C do let u(X a ) = adj(C, X a ) end for repeat Select a (new) ordered pair of vertices (X a , Delete the edge X a •−• X b from C; Let sepset(X a , X b ) = sepset(X b , X a ) = Y; end if until X a and X b are no longer adjacent in C or all Y ⊆ u(X a ) \ {X b } with |Y| = l have been considered until all ordered pairs of adjacent vertices (X a , X b ) in C with |u(X a ) \ {X b }| ≥ l have been considered until all pairs of adjacent vertices (X a , X b ) in C satisfy |u(X a ) \ {X b }| ≤ l Form a list M of all unshielded triples X c · X d (i.e., the middle vertex is left unspecified) in C with c < d. Output : C, sepset, M.

Algorithm A5
Orienting v-structures in the FCI-stable algorithm (Algorithm 4.2 in the supplement of [4]).
Require : Initial skeleton (C), separation sets (sepset) and unshielded triple list (M). for all elements X a , Algorithm A6 Obtaining the final skeleton in the FCI-stable algorithm (Algorithm 4.3 in the supplement of [4]).
Require: Partially oriented graph (C) and separation sets (sepset). for all vertices X a in C do let v(X a ) = pds(C, X a , ·); for all vertices X b ∈ adj(C, X a ) do Let l = −1; repeat been considered until X a and X b are no longer adjacent in C or |v(X a ) \ {X b }| < l end for end for Reorient all edges in C as •−•. Form a list M of all unshielded triples X c · X d in C with c < d. Output : C, sepset, M.
A. For i = 1, . . . , n, draw X † i from Compute ρ * † based on the local bootstrap sample

Appendix B. Proofs of the Theoretical Results
In this section, we provide detailed technical proofs of the theoretical results presented in the paper. We first state a concentration inequality in Lemma A1. The result in Lemma A1 is not new and can be seen as a corollary of Theorem A in Section 5.6.1 of [32]; however, it is a key technical ingredient in the proof of Theorem 1, which is the main theoretical innovation of our paper. For completeness, we include a short proof for Lemma A1. where k := n m .
For notational convenience, we treat X a , X b and X S as X, Y and Z, respectively.
Proof of Theorem 2. The first inequality in Theorem 2 simply follows by observing the fact that, for any generic random sequence {X n } ∞ n=1 and any > 0, P(|X n | > ) ≤ P sup n |X n | > for all n ≥ 1, which, in turn, implies sup n P(|X n | > ) ≤ P sup n |X n | > .
The second inequality follows from union bound and Theorem 1.