Towards Moment-Constrained Causal Modeling †

: The fundamental problem with causal inference involves discovering causal relations between variables used to describe observational data. We address this problem within the formalism of information ﬁeld theory (IFT). Speciﬁcally, we focus on the problems of bivariate causal discovery ( X → Y , Y → X ) from an observational dataset ( X , Y ) . The bivariate case is especially interesting because the methods of statistical independence testing are not applicable here. For this class of problems, we propose the moment-constrained causal model (MCM). The MCM goes beyond the additive noise model by exploiting Bayesian hierarchical modeling to provide non-parametric reconstructions of the observational distributions. In order to identify the correct causal direction, we compare the performance of our newly-developed Bayesian inference algorithm for different causal directions ( X → Y , Y → X ) by calculating the evidence lower bound (ELBO). To this end, we developed a new method for the ELBO estimation that takes advantage of the adopted variational inference scheme for parameter inference.


Introduction
Since Pearl's pioneering works on causality [1,2], various new methods have been developed for the study of causal inference (for example, see [3] for a complete overview). In particular, the case of bivariate causal inference remains of central interest because the traditional interventional methods are not applicable in this regime. The problem of inferring causal relations from observational data is usually tackled following one of three main approaches [4]. In the first approach, often referred to as the constraint-based approach, the correct causal graph is identified by analyzing conditional independence in the observational data distribution [5][6][7]. Methods that follow this approach can recover the true causal directed acyclic graph (DAG) up to its Markov equivalence class under some technical assumptions, but are inapplicable in the bivariate case, where no conditionalindependence relationship is available.
The second approach, referred to as the score-based approach, fits the observational data by searching in the space of all possible DAGs of a given size and assigning a score to the result. This score is used to infer the correct causal relations [8], but again can only identify the causal graph up to its Markov equivalence class. Therefore, both constrainedbased and score-based methods are not suited to perform causal inference in the bivariate case. However, combined constraint and score-based approaches that use generative neural networks (see Goudet et al. [9]) are also able to infer causal directions in the bivariate case.
The third and final approach is the asymmetry-based approach. Asymmetry-based methods exploit the asymmetries between the cause and the effect variables to reconstruct the underlying causal relations. These methods either evaluate the algorithm complexity of the two possible conditional distributions p(y|x, X → Y, M) and p(x|y, Y → X, M) resulting from splitting the joint probability density p(x, y) given a model M [10][11][12][13] or check for the independence of the conditional probability of the effect given the cause from the cause's independent probability distribution (information geometric causal inference [14,15]). This work most resembles the asymmetry-based approach described above. Here, we continue the series of works [16,17] that exploit the Information Field Theory (IFT) formalism [18][19][20][21] for causal inference, with two important novelties. First, we use the geometric variational inference (geoVI) approach as the inference tool for our hierarchical model. This has the advantage that the final covariance estimate of the posterior is better tailored to the problem and therefore for our evidence estimation, described in Section 3.7. Second, we propose a new causal model that can extend beyond the additive noise models (ANMs) formulation and allow for more flexible noise parameterization, while explicitly breaking the symmetry between cause and effect (Section 3.3).

Causal Inference and Bayesian Generative Models
In this section, we state the bivariate problem (Section 2.1), make a connection to the IFT formalism (Section 2.2), and then introduce structural causal models (SCMs) (Section 2.3).

The Bivariate Problem
The bivariate problem is of central importance in causal inference. Given a DAG representing a causal relation between N random variables X 1 , . . . , X n , the joint distribution factorizes as which is the Markov condition described by Pearl [1]. In fact, this shows that if we can characterize this factorization in such a way that the conditional probabilities are "smoother" or "simpler" for the true causal direction, we can recover the complete causal graph [13,22]. For this reason, i.e., to determine the correct causal structure of a complicated DAG, we only need to determine the parent cause direction, we can restrict ourselves to the bivariate X → Y and Y → X problems. The bivariate problem can be stated as follows. Given a dataset formed by pairs d = (x, y) composed by realizations x = (x 1 , . . . , x N ) ∈ D x and y = (y 1 , . . . , y N ) ∈ D y of the random variables X and Y, respectively, predict the underlying causal direction choosing between X → Y or Y → X.

Causal Inference and IFT
We note that the structure of Equation (1) resembles Bayesian hierarchical generative modeling. This similarity is made even stronger if we consider the so-called structural causal models (SCMs) (see Section 6.2 of [3,23]). The SCM is composed of a DAG, which uniquely identifies the causal directions between a set of given variables, and a function f , which we refer to as the causal mechanism, and which links the causes to their effects within the DAG. In addition to the cause-effect pairs, the SCM is also supplemented by a set of noise variables ξ. Then the bivariate problem can be stated as Under certain conditions, this relation can uniquely identify the correct causal mechanism [1]. We stress how Equation (2) can easily be represented by a hierarchical generative model, once the split from Equation (1) is given. Since generative models can be directly related to inference (see, e.g., Enßlin [21]), we will use the framework of information field theory (IFT). IFT is Bayesian inference for the fields. Here, the fields we need to infer from noisy data are the probability distributions of X, Y, and their conditionals, given some model M. For details on our inference scheme, we refer to Section 3.6. The problem of detecting the true underlying causal direction reduces to probing all possible splits in Equation (1) and selecting the one with the highest score, given by our evidence estimate (see Section 3.7).

SCMs and Bayesian Generative Modeling
Combining Bayesian generative modeling and IFT, it is possible to reproduce complicated densities non-parametrically, given some prior assumptions on their smoothness [24,25]. As discussed in Section 1, some asymmetry-based models, including the novel causal model we propose in this work (Section 3.3), exploit the idea that the conditional distributions are somewhat "simpler" in the correct causal direction. It is then natural to generate these probability distributions directly and then try to find their correspondent SCMs. In particular, if we want to produce the conditional distribution for the X → Y causal direction, we need to define a model M to generate p(y|x, X → Y, M). We can then relate this distribution to an SCM of the form described in Equation (2) by considering where we denote with P , and with δ(·) Dirac's delta distribution. We note that without loss of generality, we can choose the noise variable ξ to be uniformly distributed over its domain D ξ = [0, η] invoking the inverse cumulative distribution function (inverse CDF) method. We can then absorb the inverse CDF transformation into f . In this parameterization, Equation (3) reduces to

From Additive Noise Models to MCM
In this section, we give an overview of the additive noise model (ANM) as a paradigmatic example of a simple asymmetry method and the flexible noise model (FNM) as a more complicated one. We then propose a novel causal model that can overcome some of their limitations.

The Additive Noise Model
From the very definition of the SCM, it might be reasonable to model the effect of a cause variable X on a variable Y through a deterministic causal mechanism f plus independent noise ξ. Such a model is called Additive Noise Model (ANM) [11,23]. In the linear, non-Gaussian, and acyclic models (LiNGAM, see Shimizu et al. [26]), f is linear, and at most one variable between X or ξ is normally distributed. In the nonlinear ANM proposed by Hoyer et al. [11], f is nonlinear and the only constraint on the noise distribution P (ξ) is to be independent of the cause X.
Mimicking the approach followed in Section 2.3, we find the ANM's correspondent conditional density distribution. The generic additive noise model can be written in the form Y = f (X) + ξ; hence, where P (ξ) is the distribution of the noise variable ξ ∈ D ξ . This shows that the conditional probability for an ANM, given a specific causal direction, is the noise distribution evaluated at the residuals coordinates y − f (x), for the X → Y causal direction. If we furthermore assume the noise to be normally distributed, with a fixed variance σ 2 , this further reduces to the Gaussian distribution

Flexible Noise Model
Many attempts have been made to go beyond the additive noise formulation (see, e.g., [9,27]) and many algorithms have proven effective, even if they have not been directly related to SCMs. Some of these approaches have identified preferred causal directions by comparing complexity measures of the conditional distributions p(y|x X → Y) and p(x|y Y → X) (see, e.g., Sun et al. [13,22].). Unfortunately, in many cases, the proposed complexity measures are intractable and hard to quantify. In Guardiani et al. [25], we presented a hierarchical generative model that exploits prior information on the observed variables to produce a joint distribution, parametrized as follows In the following, we will refer to this model as the flexible noise model (or FNM). In the FNM, the true causal direction is determined by comparing Bayesian evidence (more specifically the lower bounds of the evidence) between different possible causal graphs. As we will discuss in Section 3.7, using conditional-probability model evidences to determine causal directions inherently relates to the complexity of the conditional distribution. Intuitively, the evidence accounts for how many degrees of freedom (read algorithmic complexity) need to be excited in order to reproduce a certain joint distribution given a specific causal direction. The generative nature of Bayesian hierarchical models also allows for an understanding of these models from the SCM perspective, in which Pearl's do-calculus, interventions, and counterfactuals are naturally defined. If we use Equation (3) for the FNM, we can describe the conditional probability of the X → Y model as which is subject to the constraint D y p FNM (ỹ|x, X → Y) dỹ = 1. While this approach is effective when prior knowledge (regarding the distributions of the observational variables) is accessible, the constraints on f are too weak to guarantee identifiability in a more general setting. In the case that no prior information on the observed variables is available and the conditional distributions are similarly complex in either causal direction, additional asymmetry must be introduced in the model in order to identify the correct causal structure.

From FNM to First-Order MCM
In the spirit of breaking the symmetry in the split of the joint distribution p(x, y) = p(x) p(y|x) = p(y) p(x|y), we introduce the moment-constrained causal model (MCM). This model can be seen as a generalization of the nonlinear additive noise model and a (restrictive) modification of the flexible noise model. Starting from the FNM conditional distribution Equation (5), we can represent an x-independent y distribution p FNM (y|h(x, y) ≡ 0, X → Y) = e g(y) D y e g(ỹ) dỹ by setting h(x, y) ≡ 0. Then we realize that the conditional probability of the ANM from Equation (4) can simply be represented from the x-independent probability distribution in Equation (8) p(y|x, X → Y) = e g(y) D y e g(ỹ) dỹ if we substitute y → y − f (x) and require that g is quadratic (g[·] = − 1 2 · σ 2 ).
Equation (9) directly shows that a nonlinear ANM can be interpreted as a coordinate transformation that shifts the mean of a normal and zero-centered y-distribution (with fixed variance) by a nonlinear x-dependent function f (x). We will use this intuition to develop the MCM. We now want to extend the ANM in a way that allows for more complicated noise distributions and noise-cause entanglements, similar to the FNM. As previously discussed though, in the absence of detailed prior knowledge the FNM is not capable of completely constraining the f of its corresponding SCM, nor it is able to impose strong constraints on the shape of the joint distribution p FNM (x|y, X → Y). In particular, it is not uniquely centered around a smooth function f (x) as the ANM is.
Mimicking the approach described in Equation (9), we propose to extend the ANM with a model which is parametrized similarly to the FNM, but whose conditional distribution's first moment is centered around a smooth function of x, which we callŝ(x). To build this model, we could start from the FNM defined on some probe coordinates (x, y ) and perform the change of variable y = y −ŝ(x). This parameterization though allows for both h(x, y ) andŝ(x) to represent the average y p(y |x,X → Y) . To eliminate this degeneracy, we introduce a slightly different change of variable This way,ŝ(x) will model the mean of the causal mechanism f and the function h(x, y ) will model higher-order effects. This parameterization, together with the additional smoothness hypothesis on the functionsŝ(x) and h(x, y ) (see Section 3.6) fulfill our initial purpose of further breaking the symmetry in the joint distribution split in the two opposite causal directions X → Y and Y → X.
We will refer to the model in Equation (10) as the first-order MCM. Upon defining δy(x) :=ŝ(x) − y p FNM (y |x,X → Y) , we can write the conditional density of the MCM model as where y := y − δy(x) and Z (x) represents the normalization term. By comparing Equations (9) and (11), we see that the δy(x) in the MCM plays the same role as the f (x) in the ANM and that all the remaining contributions are modeled by h(x, y − δy(x)).

Second-Order MCM
With the introduction of the first-order MCM, we have shown how to model a conditional distribution centered around a smooth functionŝ(x) for the causal direction X → Y. The guiding intuition we want to bare in mind is that at least some of the moments of the effect variable, should somehow smoothly depend on the cause. In the first-order MCM, we demonstrate this dependence for the first moment. Taking this concept one step further, we can require that the variance of the MCM conditional density is represented by a smooth, strictly positive, and x-only dependent functionσ(x). We do so by modifying the change of variable in Equation (10) into where all expectation values are taken over the FNM conditional distribution · p FNM (y |x,X → Y) . We note that all these expectation values depend on x.

Likelihood
Without loss of generality, we assume that the observational data d consists of N pairs (x i , y i ), with i ∈ {1, . . . , N}, which live on some compact space (x, y) ∈ [0, 1] 2 . We bin the data into a fine two-dimensional grid over the x and y directions, such that is the number of data counts within the (i, j) th where ∆x and ∆y are the pixel dimensions in the x and y directions. We then assume a Poisson likelihood to compare the number of counts n ij with the model's expectations where we indicate with (x, y) = (x) p(y|x, X → Y) (or (x, y) = (y) p(x|y, Y → X) depending on the causal model at hand) the joint density of x and y. We note that if the data-generating process is unknown, selection effects and instrument responses can still imprint a wrong causal direction into the data. Even though under certain hypotheses it is still possible to recover the correct causal direction (see Bareinboim et al. and Correa and Bareinboim [28,29]), in general, this is not the case (see, e.g., Section Results in Guardiani et al. [25]). In principle, if we know how the data are measured, we can model the data generation process with an additional and independent likelihood factor P measure (d i |x i , y i , X Y).

Inference
In Section 2.2, we explain why the IFT generative model framework is particularly suited to developing causal inference methods. One of the main reasons is that it allows generating and inferring smooth probability distributions from the data d according to a certain causal model. Let us consider the causal direction X → Y. This problem then reduces to learning the joint density MCM (x, y| X → Y) = (x) p MCM (y|x, X → Y) by inferring the correlation structure of the independent density (x) parametrized as (x) = e r(x) and of the conditional density p MCM (y|x, X → Y). Upon denoting all unknown signal parameters with s := (r, g, h,ŝ), the full model reads P (d, s) = P (d|s) P (s), where P (d|s) = P (d| [r, g, h,ŝ]) and P (s) = P (r|p r )P (g|p g )P (h|p h )P (ŝ|pŝ) P (p r )P (p g )P (p h )P (pŝ).
We have denoted with p r := (a r , k r , γ r ) (the same holds for g, h, andŝ) the parameters of the Matérn kernel parameterization of the power spectrum This is the parameterization we use to infer the correlation structure of the signal fields s for each Fourier mode k. The process of generating smooth distributions with Matérn kernel correlation structures is described in more detail by Guardiani et al. [25].
We then need to infer the signal vector s. We do this by transforming the coordinates of the signal vector s = s(χ) such that the prior on the new χ coordinates becomes an uncorrelated Gaussian P (χ) = G(χ, 1) as described by Frank et al. [30]. We note once again that this generative structure resembles the SCM. We then implement the resulting model using the Python package Numerical Information Field Theory (NIFTy [31][32][33]) and finally use NIFTy's implementation of geometric variational inference (geoVI) [30] to draw posterior samples. In essence, the geoVI scheme finds an approximate coordinate transformation that maps the posterior distribution of the signal parameters χ into an approximate standard normal distribution. We refer to the bibliography for more details and benchmarks.

Evidence Estimation
In order to determine the true causal relations underlying the data, we estimate the Bayesian evidence lower bounds (ELBOs) of our models [34,35]. Since we're using the geoVI variational inference approach described in Section 3.6, we can exploit the variational nature of our minimization scheme to estimate the ELBO. In fact, the ELBO can be related to the Kullback-Leibler divergence D KL between the approximating distribution Q to the true posterior p(s(χ)|d) where, as before, s(χ) represents the parameterization of model degrees of freedom (signal), d = {x i , y i } i=1···N is the data, and Λ Θ| χ=χ represents the diagonalized metric of the posterior at the current sampling point. We have also denoted with H(·) := − log p(·) the information Hamiltonian. The expectation value H(s(χ), d) Q is estimated via geoVI posterior sample average. The D KL represents the Kullback-Leibler divergence between the approximating Q(s(χ)|d) distribution and the posterior p(s(χ)|d). Given that D KL ≥ 0, it is more convenient to consider only the lower bound ln(p(d)) ≥ − H(s(χ), d) Q(s|d) + 1 2 (N + Tr log Λ Θ | χ=χ ).
The causal direction with the highest ELBO on a given dataset d is our final estimated causal direction. We note that, for well-constructed models, the ELBO can quantify the complexity of the conditional probability distribution of the effect given the cause, since it inherently penalizes the excitation of model degrees of freedom (Occam's razor principle).

Implementation
We implemented a preliminary version of the first-order MCM in Python and NIFTy. This implementation makes use of the inference machinery described in Section 3.6 to produce posterior reconstructions of the data d according to the MCM model.
An example of the MCM performance is shown in Figure 1. Even though it has so far only been tested on the bci_default dataset described by Kurthen et al. [16], from the reconstruction of the conditional density ρ FNM (x, y) (Figure 1, middle column) it can be already seen that the symmetry between cause and effect is clearly broken. In the true causal direction, theŝ field can capture more x-dependent structures from the data. For this case, the ELBO calculations support the correct X → Y causal-direction model by a Bayesian factor of ≈e 20 . On the left, we show the raw data (counts are color-coded). In the middle, we display the reconstruction of the density MCM (x, y| X Y) as modeled by the MCM. The correspondent reconstructedŝ field is superimposed (white dashed line, the shaded region represents one standard deviation of the geoVI posterior uncertainty). On the right, we show the relative uncertainty on MCM (x, y| X Y). In the upper half of the figure, the causal direction is the true causal direction X → Y. In the lower half, we present the results for the wrong causal direction Y → X.

Conclusions
In this work, we showed the connection between SCMs and Bayesian generative modeling within the framework provided by IFT. In this spirit, we introduced a novel causal model that translates the intuition that some of the moments of the effect variable should exhibit a smooth dependence on the cause into an algorithm that explicitly breaks the symmetry in the split of the joint distribution of cause and effect. Preliminary results on a single dataset show that our new model can identify the correct causal direction. Future work directions include benchmarking first-and second-order MCM against the state-of-the-art causal inference algorithms as well as exploring possible higher-order correction terms.