Direct and Indirect Effects—An Information Theoretic Perspective

Information theoretic (IT) approaches to quantifying causal influences have experienced some popularity in the literature, in both theoretical and applied (e.g., neuroscience and climate science) domains. While these causal measures are desirable in that they are model agnostic and can capture non-linear interactions, they are fundamentally different from common statistical notions of causal influence in that they (1) compare distributions over the effect rather than values of the effect and (2) are defined with respect to random variables representing a cause rather than specific values of a cause. We here present IT measures of direct, indirect, and total causal effects. The proposed measures are unlike existing IT techniques in that they enable measuring causal effects that are defined with respect to specific values of a cause while still offering the flexibility and general applicability of IT techniques. We provide an identifiability result and demonstrate application of the proposed measures in estimating the causal effect of the El Niño–Southern Oscillation on temperature anomalies in the North American Pacific Northwest.


Introduction
Consider a directed acyclic graph (DAG), where nodes represent random variables and edges represent a direct causal influence between two variables.We here discuss the problem of quantifying these causal influences.This problem has received considerable attention in a variety of communities; for the sake of exposition, we coarsely categorize methods as either statistical (i.e.those summarized by [40]) or information theoretic (IT) [21,3,10,42].When viewed from an applications perspective, these two approaches are quite different.Statistical approaches are common in epidemiology and economics [14,19], whereas IT methods appear in the study of complex natural systems, for example climate scientic [16,45] or neuroscientific [24,50].The fundamental difference in perspectives that gives rise to this disparity is not well presented in the development of IT methodologies.
To illustrate these differences, consider a simple example with a two node graph X → Y , where X ∈ {0, 1} represents whether or not an individual has won the lottery and Y ∈ R represents that individual's average monthly spending (assume for clarity that there are no confounding factors).A statistical measure such as the average causal effect (ACE) [18,39] would seek to answer the question "What is the effect of winning the lottery on spending?"by comparing the average spending of lottery winners (X = 1) against the average spending of lottery non-winners (X = 0): . We would of course expect this to be quite large.It is important to note that the Figure 1: DAG G representing a mediation model ACE is defined irrespective of the marginal distribution of X, meaning that the probability with which x occurs has no bearing on the effect of x on Y .An IT approach addresses a subtly different question: "What is the effect of the lottery on spending?"In other words, an IT measure considers the effect of the random variable representing whether or not one wins the lottery on spending.Specifically, the effect of X on Y would be given by the mutual information (MI), I(X; Y ) (see [21,Sec. 2,P1]).Using a simple IT inequality, we get that the MI is bounded above the Shannon entropy, H(X).Given that the odds of winning the lottery are essentially a point mass, which has zero Shannon entropy, we have I(X; Y ) ≤ H(X) ≈ 0. In words, because so few people win the lottery, an IT measure indicates that the lottery has a negligible effect on spending.In other words, the statistical measures consider the effect of a specific cause, whereas IT measures consider the effect at a systemic level.
A second difference is that, whereas statistical approaches typically measure causal effects on the value of an outcome, IT approaches measure the causal effect on the distribution of an outcome.Exceptions to this include [23,44], which take a non-IT approach to measuring differences between outcome distributions.Each of these approaches comes with benefits and drawbacks.With statistical approaches, the units are preserved (in the previous example, the units of the ACE are dollars).While IT measures yield the less interpretable unit of bits, they are able to capture more complex causal effects, for instance the effect that a variable has on the variance of another.Acknowledging this difference helps to understand the disparity between the applications of statistical and IT measures.When evaluating the causal link between smoking and cancer, the number of bits of information shared by the smoking and cancer variables may not be as useful as knowing the extent to which quitting smoking decreases the likelihood of cancer.However, when studying the nature of complex natural networks, it may be desirable to use a measure that can capture higher order causal effects.
In the present work we seek to endow IT measures with the ability to measure specific causal effects.Furthermore, we show that existing IT measures of causal influences are ill-equipped for distinguishing direct and indirect effects.Following a parallel storyline to that of Pearl [38], we provide measures of the total, (natural and controlled) direct, and natural indirect effects.We show that these measures do not fundamentally change the underlying IT perspective on causality, but enable obtaining "higher resolution" measures of causal influence.In doing so, we provide increased clarity to the aforementioned differences between IT and statistical causal measures.
In Section 5, we showcase how the framework can be used in practical contexts, focusing on the evaluation of the causal effect of the El Niño-Southern Oscillation (ENSO) on land surface temperature anomalies in the North American Pacific Northwest (PNW).Our results confirm the scientific consensus that both ENSO phases affect PNW land surface temperatures asymmetrically.Furthermore, using a conditional version of the proposed measures, we show the presence of a "persistence signal" across two-week average temperature anomalies that is modulated by the El Niño phase.This result both demonstrates the value of the proposed framework and provides direction for future studies focused on climate scientific findings.

Notation and Problem Setup
We will be developing techniques for measuring the causal influence of X ∈ X upon Y ∈ Y in the presence of a mediating variable Z ∈ Z using the DAG G depicted in Figure 1.Without loss of generality, Z may represent a collection (Z 1 , Z 2 , . . ., Z k ) ∈ Z 1 × Z 2 × • • • × Z k = Z of all mediating variables.Define the parent sets of X, Z, and Y as P A X = U X , P A Z = {X} ∪ U Z , and P A Y = {X, Z} ∪ U Y .Dashed double headed arrows in Figure 1 are used to indicate unknown dependencies between U X ∈ U X , U Y ∈ U Y , and U Z ∈ U Z (including the possibility of U S ∩ U T = ∅ for S, T ∈ {X, Y, Z}).We may use the shorthand U = U X ∪ U Y ∪ U Z ∈ U.For simplicity, we assume that all variables are discrete with arbitrary finite supports, though extending the proposed methods to continuous or mixed random variables is straightforward.In general, let p be the probability mass function (pmf) for all variables in the graph (i.e.X, Y, Z, U ∼ p), capital letters represent random variables, and lowercase letters represent their realizations.For example, p(x | pa X ) gives the conditional probability of the event X = x given that its parents took on values pa X .We further assume that p satisfies the causal Markov condition with respect to G [39], with p(x, y, z, u) = p(u)p(x | u X )p(z | x, u Z )p(y | x, z, u Y ).We use a hat to indicate the do-operator, which represents taking the action of forcing a variable to assume a particular value by means of intervention.For example, p(y | ẑ) = p(y | do(Z = z)) gives the probability of y given that Z is forced to take the value z, irrespective of the probability with which that value occurs.When working with distributions utilizing the do-operator, a set of rules known as the do-calculus can be used to identify if and how the interventional distributions correspond to observational distributions that do not utilize the interventions.While the reader is referred to [39,Sec. 3.4] for the complete do-calculus, we provide a description of the rule which enables swapping interventions for observations in Appendix A.
The entropy of a random variable Y and conditional entropy of Y given X are given respectively by H(Y ) = − y p(y) log p(y) and H(Y | X) = − x,y p(x, y) log p(y | x).It is worth noting that the conditional entropy yields the expected uncertainty of Y given X, and is not to be confused with H(Y | X = x) = − y p(y) log p(y | x), which gives the uncertainty of Y when conditioning on a particular value of X.For two distributions p and q over Y, the KL-divergence (also known as relative entropy) from p to q represents the excess number of bits needed to represent Y if the distribution is assumed to be q when it is in fact p, and is given by D(p(Y ) || q(Y )) = y p(y) log p(y) /q(y) [6,30].The KL-divergence is zero if and only if p(y) = q(y) for all y for which p(y) > 0, and is deemed infinite if there exists a y such that p(y) > 0 and q(y) = 0. We use Bern(α) to represent the distribution of a Bernoulli random variable with parameter 0 < α < 1.For the KL divergence between two Bernoulli random variables with parameters α and β, we will use the shorthand D(α || β).Finally, the mutual information (MI) between X and Y is given by . These equivalent definitions of MI give rise to two interpretations: (i) the average reduction in uncertainty in Y obtained by conditioning on X and (ii) the average increased ability to predict Y resulting from conditioning on X.It is worth noting that (barring some technical details), these definitions can be applied to continuous valued random variables by substituting integrals for sums and probability density functions for pmfs.

Direct and Indirect Effects
Building upon the work of Robins and Greenland [43], Pearl [38] formalized definitions of direct and indirect effects in the context of graphical models.Such a distinction is useful in disentangling the mechanisms via which causal influences occur.A canonical example is presented by Hesslow [15], wherein a birth control pill is suspected of directly increasing the likelihood of thrombosis in women, while simultaneously reducing thrombosis through its prevention of pregnancy (which is positively linked to thrombosis).In each of Pearl's definitions, the magnitude of the causal effect is specified for a specific value x and is measured with respect to a reference (or baseline) value x * .The simplest of these measures is the total effect (TE) of X = x on Y given by E The TE yields the answer to a very concise causal question, namely "How much would we expect the value of Y to change if we were to change X from x * to x?" As indicated by the name, the TE does not distinguish effects that x has on Y directly from those that occur via a mediating variable Z.As such, Pearl proceeds to define the controlled direct effect (CDE) of x on Y with mediator Once again, this measure addresses a clear causal question: "How much would we expect the value of Y to change if we were to change X from x * to x, but kept Z at a fixed value z?" While this is an intuitive notion of direct effect, it is important to note that it requires the intervention do(Z = z).Given that it may be of interest to know the direct effect that occurs when the mediating variable is not controlled for, Pearl defines the natural direct effect (NDE) as where Z x * is the value Z would have taken had X been x * .Using this notion of simultaneously assigning a value to X and allowing Z to take the value it would under a different X, Pearl defines the natural indirect effect (NIE) as In words, the natural indirect effect represents the expected change in Y resulting from changing Z from the value it would take under x * to the value it takes under x while leaving X fixed at x * .

Information Theoretic Notions of Causal Influence
While there is a considerable body of developing IT techniques for measuring causal influence, we here focus on information flow [3] and causal strength [21].

Information Flow
Drawing on the relationship between mutual information and statistical dependence, Ay and Polani [3] define an IT notion of causal independence, which unlike mutual information, is directed.Their definitions rely heavily on the post-interventional distribution, which dictates a truncated factorization of a joint distribution in the presence of interventions.The information flow (IF) from X to Y is defined as: If all the hats are removed from the above equation, then the standard mutual information is recovered.By using these post-interventional distributions, however, all "upstream" dependencies of X are ignored, and thus any relationship between X and Y resulting from confounding variables is removed.Ay and Polani also define a conditional version of IF.Using the mediation model in Figure 1, let V be some subset of remaining variables in the graph, i.e.V ⊆ U ∪{Z}.
The IF from X to Y imposing V is given by: Noting that V always appears as an intervention, the conditional IF can be interpreted as representing the IF from X to Y when the value of V is controlled.The IF can be extended to measure the flow to and from sets of nodes, though at present we only consider the flow from X to Y .IF is not to be confused with Marko and Massey's directed information [33,34,35] or Schreiber's transfer entropy [46], as these do not employ any notion of intervention and are only used in the context of time series.
Within the IF framework, we can treat I(X → Y ) as a measure of the total effect of X on Y and I(X → Y | Ẑ) as a measure of controlled direct effect.While these measures are intuitively analogous to the measures in [38], it is difficult to formalize the nature of this analogy because we cannot formulate IF measures as the answer to a concise causal question similar to those of the previous section.Furthermore, because the conditional version of IF represents controlling a set of variables, IF offers no way to measure the natural direct and indirect effects proposed by Pearl.

Causal Strength
The causal strength (CS) measure proposed by Janzing et al. [21] takes a slightly different approach in that it measures the strength of specific edges in a DAG.We call this an "edge-centric" perspective, in contrast with the "node-centric" perspective used by IF.To motivate the definition of CS, the authors propose a collection of five postulates that they argue ought to be satisfied by measures of CS.Janzing et al. acknowledge that their postulates need not apply to all reasonable measures of causal influence; as such, any present criticisms of CS can be attributed to differences in the problem formulation.The postulates are briefly summarized here, and the reader is referred to [21] for more thorough definitions: (P0) If the CS of an arrow is zero, then that arrow should be able to be removed from the DAG without breaking the causal Markov condition.(P1) If the entire DAG is given by X → Y , then the CS is I(X; Y ).(P2) The strength of an arrow X → Y should be defined locally, i.e. it should depend only upon the distributions p(y | pa Y ) and p(pa Y ).(P3) The CS of an arrow X → Y should be lower bounded by the conditional mutual information I(X; Y | P A Y \ {X}).(P4) If the CS of a set of edges is zero, then the CS of all subsets of those edges should be zero.
Janzing et al. [21] proceed to propose a measure of CS that satisfies these postulates.Central to their CS measure is the post-cutting distribution.Formally, let V = {V 1 , . . ., V n } be the nodes in a graph, P A S j be the subset of parents of V j for which V i → V j ∈ S, and P A S j = P A j \ P A S j .Then the post-cutting distribution is given by: The post-cutting distribution factorizes much like the joint distribution p -however, for nodes at the receiving end of an edge in S, they are fed the marginal distribution of the node at the other end, rather than the actual value of that node.Using the post-cutting distribution, the CS of a set of edges S is then given by C S = D(p || p S ), and thus provides a measure of how much excess information is needed to accommodate the severed edges.
Consider CS in the context of the mediation model in Figure 1, i.e.D(p(X, Y, Z, U ) || p S (X, Y, Z, U )) for some set of edges S ⊆ {X → Y, X → Z, Z → Y }.Within the constraints of the CS framework, one might seek to measure the total, direct, and indirect effects as the strength of the edge sets To see why this is insufficient, consider an extreme case of the birth control pill example above, where the indirect and direct effects of X on Y are perfectly complementary such that for all x 1 , x 2 ∈ X and y ∈ Y, p(y | x1 ) = p(y | x2 ).Any reasonable measure of total effect will conclude that no value of X has an effect on Y -however, note that from postulate (P4), the total effect (as we have defined it in the CS framework) must be non-zero if either the direct or indirect effect is non-zero.A similar example can be constructed for the insufficiency of CS as a measure of indirect effects by having the effect of X on Z be canceled out by the effect of Z on Y .Finally, CS is similar to IF in that it does not yield a clear causal question for which it gives the answer.This is perhaps justified by the decision to define a set of formal postulates that are used to link the properties of CS with our intuitions.However, given that causal influences are likely to be measured in order to obtain a better understanding of the system under study, we find it to be of great practical use to pair causal measures with an easily interpretable causal question for which the measure provides an answer.We will now show that this can be achieved by defining a measure of causal effect of specific values of X.

Novel Information Theoretic Causal Measures
The observation that the MI I(X; Y ) does not capture how different values of X may contain different amounts of information about Y has been made in a variety of contexts throughout the literature, including experimental design [29,7], neural stimulus response [8,48], information decomposition [51], measuring surprise [20], and most recently, distinguishing between information transfer and information copying [25].Central to each of these works is the development of a notion of MI for a specific value of X, i.e.I(x; Y ).There is, however, no inherent I(x; Y ) implied by the definition of I(X; Y ) -to see this, we use the notation of [8] and provide two candidate definitions of I(x; Y ) based on the two definitions of I(X; Y ) provided in Section 2.1: It is well understood that, in general, I 1 (x; Y ) = I 2 (x; Y ).This is clear to see by simply noting that, for any joint distribution X, Y ∼ p, I 1 (x; Y ) ≥ 0 for all x, whereas it is possible to have I 2 (x; Y ) < 0. In words, the knowledge of a specific value of X will only provide us with a more accurate distribution of Y (I 1 ≥ 0), though it is possible for this distribution to have a greater entropy than the marginal distribution (I 2 < 0).We here use I 1 as a foundation for establishing value specific measures of causal influence, and, using the terminology of [25], refer to it as the specific mutual information (SMI).Building upon this language in the present context, we refer to the quantities measured by the proposed methods as specific causal effects.To our knowledge, the use of SMI in the context of quantifying causal influence is novel.As such, we begin with an informal discussion around the use of SMI for the quantification of causal influence in two-node DAGs, followed by a formal definition of various specific causal effects in a mediation model.

Specific Mutual Information in Two-Node DAGs
Consider a DAG X → Y with joint distribution over nodes X, Y ∼ p, and for the sake of exposition, assume there are no confounding variables.In this simple scenario, when considering the effect of X on Y , we can freely exchange interventions for observations (assuming we only consider x s.t.p(x) > 0), and thus the ACE of x with respect to baseline x * is given by E Once again, this addresses the question of how much the value of Y is expected to change as a result of switching from x * to x.With regard to the CS and IF methods discussed above, both would quantify the effect of X on Y as I(X; Y ).Consider the SMI I 1 (x; Y ) as a measure of the specific causal influence of x upon Y and note the following: where the expectation is taken with respect to X.As such, we can think of the specific causal effect as a random variable, whose expectation is the mutual information.In doing so, we are able to capture that different values of X may have different magnitudes of causal effect on Y , with each of those effects occurring with some probability according to p(x).Moreover, this makes clear that the perspective adopted here is consistent with that of other IT measures.
(II) I 1 (x; Y ) is non-negative for all x ∈ X .Whereas a negative ACE has the clear interpretation of x causing a decrease in the expected value of Y , we are measuring influences that x has on the distribution of Y .Given that there is no obvious notion of a (potentially negative) difference between distributions, we utilize a definition that results in all causal effects having positive magnitude.This serves as a partial justification for using I 1 , rather than I 2 , as a foundation.
(III) The SMI does not require specifying a reference value x * .Instead, we can view SMI as measuring the causal effect of x as compared with the X that would have occurred naturally.This suggests an intuition for the appearance of IT measures of causal influences in complex natural networks -values of X that are seen as changing the course of nature will be assigned a large causal influence.Given that we can (in this setting) exchange observation for intervention, we can view the SMI as comparing the effect of an intervention x with a random (i.e.non-atomic) intervention X with X ∼ p (see [39,41] for discussions on random interventions).
(IV) The SMI addresses a very clear causal question: "How much different would we expect the distribution of Y to be if, instead of forcing X to take the value x, we let X take on a value naturally?"Stated more compactly: "How much would we expect performing the intervention do(X = x) to change the course of nature for Y ?" We can interpret the SMI as comparing a ground truth distribution of Y conditioned on x (p(Y | x)) with a counterfactual distribution wherein nature was allowed to run its course (p(Y )).This works well with the interpretation of the KL-divergence as a measure of excess bits resulting from encoding Y using the distribution that is not the true distribution from which Y is sampled.The use of the KL-divergence is further justified in this context by the fact that the logarithmic loss is unique in its ability to capture the benefit of conditioning on X in the prediction of Y [22].
(VI) Finally, we note that I 1 (x; Y ) = 0 if and only if p(y | x) = p(y) for all y for which p(y) > 0. By contrast, it is possible to have I 2 (x; Y ) = 0 and p(y | x) = p(y).The following example illustrates why this is undesirable: On the other hand, we have I 1 (X = 1; Y ) = D( 8 /10 || 2 /10) = 1.2 bits.This exemplifies how simply measuring differences in entropy is insufficient for capturing causal influences.
We conclude this section by returning to the lottery example discussed in the introduction, recalling that X ∈ {0, 1} represents whether or not an individual has won the lottery and Y ∈ R represents that individual's average spending.As such, we have p(X = 0) ≈ 1 and thus p(y) = x p(y | x)p(x) ≈ p(y | X = 0).As such, the specific causal effect of losing the lottery is I 1 (X = 0; Y ) ≈ 0. By contrast, the distribution of Y conditioned on winning the lottery will be significantly different from the marginal distribution of Y resulting in Framed in terms of the causal question discussed above in (IV), we would expect forcing someone to win the lottery to change the course of nature much more than forcing someone to not win the lottery.

Specific Causal Effects in the Mediation Model
Following the process of [38], we here formalize a series of definitions of total/direct/indirect causal influences from an information theoretic perspective.When leaving the comfort of the unconfounded two-node DAG, it is necessary to incorporate the interventions in the definition of the causal measures: Definition 1.The specific total effect of x on Y is defined as: With the exception of the interventional notation, the STE is equivalent to the SMI.Note that for a DAG given by X → Y , we will have where ST E(y → X) represents the specific total effect of y on X.
Next we define the specific controlled direct effect (SCDE) of x on Y .Given that computing the controlled direct effect must be done by means of intervention on Z, we define the SCDE with respect to a specific value z, as it is unclear what distribution over Z should be used if the definition were to take an expectation over all possible values of z (see Theorem 2).Definition 2. The specific controlled direct effect of x on Y with mediator z is defined as: The SCDE measures how much we would expect performing the intervention do(X = x) to change the course of nature given that Z is held fixed at z.As mentioned in Section 2.2, computing the controlled direct effect involves intervening upon the mediating variable Z, and thus does not convey the direct effect that occurs naturally from fixing a value of X.
Next, the specific natural direct effect measures the direct effect of x on Y that occurs naturally when the mediator is not controlled for: Definition 3. The specific natural direct effect of x on Y is defined as: It is helpful to dissect the two distributions of Y considered by the SNDE.Expanding the first argument as z p(z | x)p(Y | x, z ), both distributions are given by a weighted combination of the distribution of Y conditioned upon different values of Z.In both cases, these values of Z are weighted by the probability with which they would occur under the intervention x.For the intervened values of X used to evaluate the probability of Y , however, the first distribution uses the "ground truth" value x, whereas the second uses the "naturally occurring" x , weighted according to p(x ).Using the same logic, we can define a specific natural indirect effect: Definition 4. The specific natural indirect effect of x on Y is defined as: Conducting a similar dissection, we see that the roles of x and x are swapped from the SNDE -the "ground truth" x is used to evaluate the probability of Y , while the naturally occurring x is used to weight different values z .As such, the only difference between the first and second arguments of the SNIE is how the value of the mediating Z is determined, resulting in a measurement of the indirect effect of x on Y .
Unfortunately, the proposed definitions of SNDE and SNIE yield no obvious inequalities with respect to the STE (for example, While this is initially unintuitive, it can be justified by the decision to have all causal influences be assigned a non-negative magnitude.As such, we would expect that contradictory indirect and direct effects could individually have a large magnitude while still resulting in a total effect of zero (as in the example in Section 2.2).

Equivalence Relations
We now analyze the relationship between the proposed specific measures and IF/CS.Theorem 1.The expected STE is equivalent to the information flow, i.e.
, where the expectation is taken with respect to the marginal distribution over X.
A proof is found in Appendix C.1.The above theorem shows that the expected STE recovers the standard (unconditional) IF from X to Y .Notably, the expected STE is not equivalent to the CS associated with any subset of the arrows in the graph.Next, we show that both IF and CS provide a notion of expected SCDE: Theorem 2. The conditional IF is given by the expected value of the SCDE taken with respect to the marginal distributions of X and Z: Furthermore, if the DAG consists of only X, Y , and Z (i.e.U = ∅), then the CS of X → Y is given by the expected value of the SCDE taken with respect to the joint distribution of X and Z: A proof is found in Appendix C.2.This theorem clarifies the point made earlier with regard to the value of a measure of natural direct effect.In particular, when taking an average with respect to possible control values for the mediator Z, it is not clear what distribution over Z should be used.

Conditional Specific Influences
Even though the above causal measures are defined for specific values of X, they provide a notion of average causal influence in that they are implicitly averaging over all possible covariates U .Given that different values of u may significantly affect the nature of the relationship between x and Y , we define conditional versions of the above definitions for a specific value U = u.We here consider the general case where only a subset of the covariates Ũ ⊆ U are observed: Definition 5.The conditional STE of x on Y given ũ is defined as: For the special case where we can observe all relevant covariates, i.e.Ũ = U , the conditional STE can be simplified as: This definition violates the locality postulate (P2) of Janzing et al. [21] in that the causal effect of x on Y may be dependent upon how X is affected by its own parents.Allowing this is, however, consistent with the perspective that IT measures quantify the deviance from the course of nature in that the value u dictates the current natural state (see Section 4.1).Nevertheless, the terms p(x | ũ) and p(x | u X ) can be replaced with p(x ) if one wishes to remain faithful to the locality postulate (though not explored presently, this would provide us with a notion of specific causal strength).The conditional versions of SCDE, SNDE, and SNIE follow very similar logic to that of the STE, and are defined in Appendix B.

Identifiability
When U is partially observable or unobservable, the nature of the dependence relationships between U X , U Y , and U Z will dictate the ability to estimate the proposed causal measures from observational data -more specifically, the ability to determine the interventional distributions given only estimated conditional distributions.This is crucially important given that performing interventions in many complex natural systems is infeasible.The following theorem uses the d-separation criterion [37,27] to identify when the conditional specific measures can be estimated in the partially observable setting where only Ũ ⊂ U can be observed: Theorem 3. Consider a dataset containing observations of X, Y , Z, and partially observable covariates Ũ ⊆ U .Then, the conditional STE, SNDE, and SNIE are non-experimentally identifiable if there exist Ũ1 , Ũ2 ⊆ Ũ such that the following two conditions hold: , where G X represents the DAG with all outgoing arrows from X removed, and The proof uses a direct application of Pearl's do-calculus [39, Theorem 3.4.1],and is provided in Appendix C. 3. By letting Ũ = ∅, identifiability conditions for the specific causal effects of Section 3.2 are obtained.Similarly, the theorem provides the corollary that the setting specific causal effects may be estimated from observational data when U is fully observable.It is important to note that the above theorem assumes that each conditional distribution can be sufficiently well estimated.Indeed, the "increased resolution" of the proposed measures comes at a cost in that reliable estimation of the proposed measures poses challenges for values of X that occur infrequently.Consider, for example, estimating the second argument of the KL-divergence defining the SNDE in (8), namely p(y | x , z ).Given that there is a sum over x and z , it is necessary to know this distribution for every pair (x , z ).Thus, when p(x , z ) is very small, a significant amount of data will be required to estimate p(y | x , z ) (and therefore the SNDE) reliably.

Normalized Specific Total Effect
The opacity of measuring causal influences in bits can be addressed by identifying a normalization procedure.Definition 6.The normalized conditional STE of x on Y conditioned on ũ is defined as: The normalized versions of the other specific causal measures can be found in Appendix D. For the sake of exposition, suppose Ũ = ∅ and recall the data compression interpretation of ST E(x → Y ) as the excess number of bits used to encode Y under the assumption X occurs naturally when we have in fact forced X = x by means of an intervention.
Noting that H(Y | do(X = x)) represents the number of bits required to encode Y when we have (knowingly) forced X = x, the denominator of ST E gives the total number of bits used to encode Y under the incorrect assumption of a naturally occurring X.As such, the normalized STE represents the fraction of bits used to encode Y under the assumption that X occurred naturally that are unnecessary when performing the intervention do(X = x).
As a result of the non-negativity of entropy and the KL-divergence, the normalized STE is bounded between zero and one.Interpreting ST E is facilitated by considering the scenarios that yield the extremal values.First, the normalized STE is zero if and only if the STE is zero, which is to say that p(y | x, ũ) = p(y | ũ) for all y for which p(y | x, ũ) > 0.
More interestingly, the normalized STE is one if and only if the STE is greater than zero and H(Y | do(X = x), Ũ = ũ) = 0.As such, the normalized STE being equal to one represents x having a maximal causal effect on Y in the sense that performing the intervention do(X = x) determines the value of Y with 100 percent certainty.It should be emphasized that, like the unnormalized measures, this notion of maximal causal effect applies strictly in a distributional sense and says nothing of the direction or magnitude of the causal effect with respect to the units of Y .For example, if performing do(X = x) results in Y = E[Y ] with probability one, then H(Y | do(X = x)) = 0 and we would conclude that x has a maximal effect on Y even though x causes Y to take the value it is expected to take absent an intervention.

Examples
We now present three examples of notions of causal influence that are uniquely identified by the specific causal measures.

Chain Reaction
For the first example consider a simple chain X → Z → Y .This can be thought of as a simplified version of the example proposed by Ay and Polani [3] and modified to include noise by Janzing et al. [21,Example 7].We will consider the simplest case of this example where a binary message is being passed from X to Z to Y , with the message being flipped by Z and Y with probability .We will interpret each variable as representing the message it passes on, i.e.X = 1 means "X passes the message 1 to Z." Formally, let X, Y, Z ∈ {0, 1} with X ∼ Bern(0.5): where ⊕ is the XOR operation.
Focusing first on the effect of x on Y , we note that because the only path from X to Y is the one through Z, the direct effect is zero and the total and indirect effects are equal.Noting that , the total effect is the same for both x ∈ {0, 1} and is given by: Thus, as the probability of flipping the message approaches zero, Y will be deterministically linked to X, and X resolves the entire one bit of uncertainty associated with Y .Now consider the conditional STE of z on Y for a particular x.We can compute this by comparing the distributions p(y | x, ẑ) = p(y | z) and p(y | x).Given the symmetry of the problem, this will take one of two values depending on whether or not x and z are equal: To understand this result, fix to be an arbitrarily small number such that Z will pass on its received message with high probability.Thus, when x = z, it is, in a sense, unreasonable to endow Z with responsibility for causing the value taken by Y when it is propagating the message in a nearly deterministic manner.In such a case, it is not so much Z that is causing Y , but rather X that initiated a chain reaction.On the other hand, in the unlikely occurrence that x = z, we have that Z does have a causal effect on Y .This scenario can be thought of as Z acting of its own volition in selecting a message to pass to Y .
We acknowledge that the notion of an unbounded causal influence is initially unsettling.When looking closer, however, this property is intuitive.First, we note that for any fixed > 0, the STE will be finite.It is only for = 0 that the STE could be infinite, but in that case, the setting that results in infinite influence happens with probability zero.Thus, in general, an infinite influence could only be achieved through intervention.Furthermore, such an intervention would have to assign a value to a cause that occurs with probability zero, and that cause would in turn have to enable an otherwise impossible effect to have non-zero probability.
As mentioned in Section 3.4, this conditional formulation violates the locality postulate (P2) of Janzing et al. [21] in that the effect of z depends on the value of its own parent, x.We do not claim that the perspective taken here is "correct," but merely point out that there exist justifications for considering the value of a cause's parent in evaluating the causal effect.

Caused Uncertainty
Consider a 3-node DAG characterized by the connections X → Y ← Z and the following (conditional) distributions: Given that X and Z are both parentless, we can treat interventions on X and Z as observations, and the CS, conditional IF, and conditional mutual information (CMI) are equivalent.In particular, we have that provides us with the interpretation of CMI as the reduction in uncertainty of Y resulting from the added conditioning of Z, which will always be non-negative.
Next we consider ST E(x → Y | z) and ST E(z → Y | x) for (x, z) ∈ {0, 1} 2 .Given the symmetry of the problem with respect to X, we only need to consider two of the four possible values of (X, Z), namely (x 0 , z 0 ) (0, 0) and (x 0 , z 1 ) (0, 1).In order to compute the STE for each X and Z to Y in either case, we need the following distributions: For a given (x, z), the STE is given by ST The results presented above are intuitive: when z = 0, then the value taken by Y is largely determined by X, and the knowledge that z = 0 tells us very little about the distribution of Y .On the other hand, when z = 1, X has no bearing on the value taken by Y .Thus, in this scenario, it is the value taken by Z that has caused the shift in the distribution of Y , even though Z provides no information with regard to the particular value taken by Y .In this sense, we can think of Z as causing uncertainty in Y .This scenario makes particularly clear why it makes sense to condition on the cause but take an expectation with respect to the effect -no outcome y could be attributed to being a result of z = 1, despite the clear influence that such an event has on the distribution of Y .

Shared Responsibility
Consider a scenario where a collection of n iid variables X i ∼ Bern( ) collectively influence a single outcome Y , i.e.X i → Y for i = 1, . . ., n.For a given context {x i } n i=1 , let k be the number of x i that are one, i.e. k = i x i .Then let Y be distributed as: Y | X 1 , . . ., X n ∼ Bern 1 2 K where K = i X i is a random variable.One interpretation of this example is that each X i is a potential inhibitor of Y .As more inhibitors become activated (i.e. as k grows), the effect of adding another inhibitor diminishes.Since the value taken by K depends on the values taken by each X i , a measure that averages with respect to X i will not capture this change in causal effect that results for different values of k.As with the previous example, the CS, conditional IF, and CMI are equivalent for this problem setting.While there is no simple computation for these measures as a function of and n, there are a couple of key points.First, the influence of each of the variables X i on Y is the same, i.e.I(X i ; Y | X 1 , . . ., X i−1 , X i+1 , . . ., X n ) = I(X 1 ; Y | X 2 , . . ., X n ) for all i = 1, . . ., n.Second, as n → ∞, the probability of Y = 1 goes to zero, and as → 0, the probability of Y = 1 goes to one.In either of the limits, the entropy of Y goes to zero and thus so does the causal influence of each X i as measured by either CMI, conditional IF, or CS.Now consider a realization {x i } n i=1 and the corresponding ST E(x 1 → Y | x 2 , . . ., x n ).While the influence of each x i on Y will not be the same for a given realization, the symmetry of the problem is such that the computation will be performed in the same manner for each x i .Letting k 1 n i=2 x i be the number of ones excluding x 1 , define the following distributions: Then, for a given realization, the STE is a function of x 1 and k 1 : In interpreting these results, first assume that is small, meaning that for each of the inhibitors, it is unlikely that it will be activated.As a result of this assumption, we have ST E(X an inhibitor has a greater influence when it is activated.More interestingly, note that ST E(x 1 → Y | k 1 ) is strictly decreasing in k 1 .This is consistent with the intuition provided above, namely that if a large number of inhibitors are active, then they share responsibility and the influence of any single one is negligible.On the other hand, if only one is activated (i.e.(x 1 , k 1 ) = (1, 0)), then in the limit of → 0, its influence will approach infinity (and its normalized influence will approach one).

Case Study -Effect of El Ni ño-Southern Oscillation on Pacific Northwest Temperature Anomalies
We now present an application of the proposed framework to measuring the specific causal influences of the El Niño-Southern Oscillation (ENSO) on the temperature anomaly signal in the North American Pacific Northwest (PNW, latitude: 47 • N, longitude: 240 • E).For our purposes, ENSO is characterized by the sea surface temperature in the Niño 3.4 region located in the equatorial Pacific (latitude: 5 • S-5 • N, longitude: 170 • W-120 • W).The ENSO signal is typically understood by being in one of three phases (or states) -a neutral phase (we will refer to this as E = 0) gives rise to a precipitation region centered near longitude 160 • E (Figure 2B), the El Niño phase (E = 1) gives rise to an eastward shifted precipitation region (∼170 • W, Figure 2C), and the La Niña phase (E = −1) gives rise to a westward shifted precipitation region (∼150 • E, Figure 2A) [1,13].Niño and Niña phases can occur with varying intensities during the winter months with a typical return period of two to seven years [28].When a Niño or Niña phase occurs, the shifted precipitation signal produces large scale atmospheric Rossby waves (waves in the upper level atmospheric pressure field) that influence North American land temperatures, predominantly through the well studied Pacific North American teleconnection pattern (PNA) [5,49].PNA affects North American land temperatures through the advection of warm marine air during a Niño phase and cool polar air during a Niña phase [52,17].We here use the proposed framework to quantify the causal effect of this teleconnection, focusing specifically on the temperature in the PNW.
This application is a particularly good fit for the proposed analysis for a number of reasons.First, by utilizing a collection of simulation model runs, an immense amount of data can be obtained.Second, domain expertise can be leveraged to construct causal DAGs prior to performing analysis.For example, it is well known that the ENSO signal influences temperature as opposed to the temperature influencing ENSO.Third, there are well-accepted methods for detrending signals, and these methods can be used to control for possible confounding effects.Fourth, it is to be expected that different phases of the ENSO signal will, in some sense, give rise to larger causal effects than other phases [2].The proposed framework can be used to quantify these differences in a formal sense.

Data and Preprocessing
The analyzed dataset is composed of nine simulated model runs from the National Center for Atmospheric Research's Community Earth System Model, version 2 (CESM2) [12] scientifically validated historical CMIP6 runs [11].This is  the gold standard US climate model.Each model run provides an array of daily temperature values spanning the years 1850 to 2015 from which we can compute the ENSO 3.4 index (as in [4]) and directly obtain the PNW two-meter temperature.Each of the model runs provides an independent realization of possible evolutions of temperatures that obey the underlying dynamic and thermodynamic equations as encoded by the model.It is important to clarify that the model is not intended for prediction, but rather gives possible atmospheric states for a given set of initial conditions and constraints determined by the selected time period (i.e.CO 2 forcing, solar/lunar cycles, etc.).Both the ENSO index and PNW two-meter temperature signals have the mean and the leading six harmonics of the annual cycle removed, leaving only the anomalous components of the signal.As this is standard practice in the analysis of climate data (e.g.[26]), we henceforth strictly consider anomaly signals.
A portion of single model run of ENSO index is shown in Figure 3, along with a plus/minus one degree threshold for determining the quantized ENSO phase.It is clear that the ENSO signal does not reliably alternate between E = 1 and E = −1 with a constant period.As a result of ENSO cold-season phase locking [36], the ENSO signal is strongest in or near to January (marked by vertical grid lines).As such, we limit our focus to the months of January, February, and March, as it is not interesting to measure the effect of the ENSO signal in the months where it is not present.We further simplify the problem by quantizing the ENSO index on an annual timescale, i.e. we assign a single value to E ∈ {−1, 0, 1} for January-March of a given year based on the ENSO index value on January 1st of that year.Given that we are estimating the effect of ENSO on temperature, we similarly consider the temperature signal only during the months of January, February, and March.Rather than attempting to assess the effect of ENSO on daily temperature anomalies, we choose to focus on two-week averages, corresponding to the limit of predictability in numerical weather forecasting [32].As we will discuss in the next section, this choice also facilitates the causal modeling.As a final processing step, we quantize the temperature anomaly averages to T ∈ {−1, 0, 1}.While this quantization does come with an inevitable loss of resolution, it yields the easily understood interpretation of the temperature signal as representing either a cold anomaly, a warm anomaly, or neutral state.We compute the quantization threshold on the entire dataset (i.e.before averaging and before selecting for months) such that one third of days are in each category.The averages are then compared to these thresholds, given by -1.3 and +1.94 degrees Kelvin.As we can see from the count in the the top left panel of Figure 4, the resultant dataset after selecting for the winter months and taking two-week averages consists of 9840 samples.

Causal Modeling
In order to implement the proposed framework, we first need to formulate a causal DAG representation of the dataset discussed above.As a starting point, consider the DAG on the left side of Figure 5, where we let E represent an annual ENSO phase, T 1 , . . ., T 6 represent the quantized two-week temperature anomaly averages for January through March (i.e.T 1 averages January 1st through 14th, T 2 averages January 15th through 28th, etc.), and U represents the other factors, such as seasonality and CO 2 forcing.This DAG encodes a number of assumptions.First, it encodes the intuition that seasonality may affect ENSO and the temperature, but not the other way around.Similarly, ENSO will affect the temperature in the PNW, but not the other way around.The more interesting implicit assumption is that there is a persistence signal in the temperature represented by the arrow T i−1 → T i .Importantly, we have assumed that this persistence signal is Markov (when conditioned on E and U ), i.e. there is no arrow This assumption significantly simplifies estimation of the direct and indirect effects of E on T i , as those require estimating the distribution of T i for every possible combination of its parents.This serves as a motivation for the decision to consider two-week averages -if we were to simply consider daily temperatures, it is unreasonable to expect that T i would be independent of T i−2 when conditioned on E, U , and T i−1 .
We next incorporate two assumptions in order to simplify the causal model.First, we assume that all the effects of U are removed by the detrending and removal of annual cycle performed in the preprocessing steps.It is to be expected that this assumption will hold for the well known shared causes (such as the aforementioned seasonality and CO 2 forcing), but the possibility of other factors that have effects not captured by the leading six harmonics of the annual cycle is important to note.The second assumption we make is that the distribution of the temperature anomaly averages does not change over time, i.e. that p(t i | t i−1 , e) and p(t i | e) are not dependent on i.After making these assumptions, we obtain the simplified DAG on the right of Figure 5, where we introduce the new variable S to represent the past temperature anomaly average and T to represent the subsequent temperature average, and note that this perfectly matches the mediation model in Figure 1 with U = ∅.We can think of T as representing T i and S as representing either T i−1 or the collection T 1 , . . ., T i−1 .We show in Appendix F that these two interpretations will in fact yield equivalent results.As such, we let S = T i−1 and directly use the definitions provided Section 3.2 to measure the causal influence of ENSO on temperature.As a result of the assumption that p(t i | e) does not depend on i, we have that p(t | e) = p(s | e) for t = s.It should be noted that for T = T 1 (i.e. the average for the first two weeks of January), we define S = T 0 to be the average taken over the last two weeks of December.
Using this causal model, we define the corresponding dataset from which we estimate the causal influences as D = (e n , s n , t n ) 9840 n=1 .This dataset is fully characterized by Figure 4, where each column gives rise to an empirical distribution over current temperature averages conditioned on the previous average and/or ENSO phase.The causal measures are then computed by comparing the empirical distributions corresponding to specific columns, or weighted averages of them in the case of natural direct and indirect effects.For example, ST E(S = 1 → Y | E = 1) compares the T i−1 = 1 and Total columns under EN SO = 1.The next section details the estimation procedure for all of the proposed measures.

Estimation
Given that there is a large amount of data and a relatively small alphabet size, we utilize plug-in estimators of the proposed measures, where every distribution in question is estimated using a maximum likelihood estimator.Given that E has no parents in the DAG given on the right side of Figure 5, we can freely exchange interventions ê for observations e in the estimation of the effect of e on T .As such, the estimates of the specific effect of ENSO on temperature are given by: Next note that the conditional STE of the past temperature average S on the subsequent temperature T conditioned on an ENSO state E is: Letting X = S, Y = T , Z = ∅, and U = E, it follows from Theorem 3 that we can estimate the total effect from observational data.Therefore, we use the following plug-in estimator: Given the absence of an intuitive link between bits and temperature, we choose to focus on the normalized version of the proposed causal measures which are estimated as: where ĤD (T | e) − t pD (t | e) log pD (t | e) and ĤD (T | e, s) − t pD (t | e, s) log pD (t | e, s).In all of the following figures, we have multiplied the estimated measures by 100 to obtain a percentage.
By applying these estimators to the complete dataset D, we obtain point estimates of the desired measures.For ease of notation, we omit D from the estimates from here on.It is important to note that even though not all estimates will utilize all 9840 samples, Figure 4 makes clear there is a considerable amount of samples available for estimating every distribution in question.In particular, we see that: In other words, the distribution estimated on the smallest number of samples is p(t | E = 1, S = −1), and this estimate is obtained from 596 samples.
In addition to these point estimates, it is desirable to have a means of measuring the significance of the estimated measures.To assess significance, we here pair two approaches -(i) performing a nonparametric bootstrap hypothesis test [31] and (ii) constructing a nonparametric bootstrap confidence interval [9].The goal of the hypothesis test is to estimate the distribution of the estimated measure under a null hypothesis (H0) and assess the likelihood that our estimate came from such a distribution.In this case, H0 corresponds to the absence of a causal link, which would result in the true causal measure being equal to zero.The primary challenge to performing this test is the generation of samples from a distribution representative of H0.We accomplish this using a scheme similar to that presented in [21, Example 2] wherein we group the data by one of the three variables (E, S, or T ) and shuffle the other two in order to break one of the causal links.For example, when performing the test for the direct effect of E on T , we split the data into three sets: {n −1 : s n = −1}, {n 0 : s n = 0}, and {n 1 : s n = 1}.Within each of these sets, we shuffle (i.e.permute) all the samples of E (or T ).Because the shuffling occurs within groupings of S, any possible link from E to S and S to T is preserved (and thus so is the indirect effect), but the link between E and T is destroyed.
Each of these permutations is then treated as a sample under H0 from which we estimate the SNDE.We perform this shuffling and estimation procedure 100 times and treat the 6th largest estimate as the cutoff threshold for statistical significance.This threshold is given by the upper whisker on the boxplots labeled H0 in the figures in the next section.When performing this test for the indirect effect, we choose to break the link from S to T rather than from E to S in order to preserve the assumption that p(s | e) = p(t | e) for s = t.
Rather than comparing the above threshold to our estimate on the complete dataset, we consider a necessarily stricter test wherein we compare the threshold with the lower end of an estimated nonparametric bootstrap confidence interval by repeatedly drawing a collection of samples from the empirical distribution of our data and estimating the measure on the new collection of samples.Specifically, let D * b = (e j n b , s j n b , t j n b ) 9840 n=1 be the bth bootstrap sample, where j n b are drawn independently from the uniform distribution over {1, 2, . . ., 9840} for b = 1, . . ., 100.We estimate the causal measure in question on each of the 100 bootstrap samples and, similarly to the hypothesis test, treat the 6th smallest and 6th largest estimates as the lower and upper bounds to our confidence interval.

Results
We estimate the normalized STE, SNDE, and SNIE of ENSO on temperature and the normalized conditional STE of the past temperature average on the next average conditioned on ENSO.In every case, the measure is estimated on the complete dataset (red ×) and compared with the corresponding weighted average (i.e."non-specific") measure (red dashed lines).For the specific measure, we obtain an estimate for each value of the cause, i.e. e ∈ {−1, 0, 1} or s ∈ {−1, 0, 1}.The average measure is then calculated by taking an expectation of the specific measures with respect to p(e) or p(s | e).As an example, the red dashed line in the left panel of Figure 6 represents E and the three red dashed lines in Figure 7 represent E p(S|e) [ ST E(S → T | e)] for e ∈ {−1, 0, 1}.Each figure also displays two boxplots for each measure -the first shows the distribution of the measure estimated on the bootstrap samples and the second shows the distribution of the measure estimated under the null hypothesis that the causal link in question does not exist (denoted "H0").
We begin by considering the total effect of ENSO on temperature shown in Figure 6.Given that E is a root node in the DAG representation given on the right of Figure 5, we note that STE and SMI are equivalent, i.e.ST E(e → T ) = I 1 (e; T ), and the expectation gives an estimate of the mutual information, i.e.Î(E; T ) = E p(E) [ ST E(E → T )].This illustrates the value of considering a specific causal measure -as we can see, the estimated effect of E = 1 is roughly three times the effect as estimated by the average with respect to E. Recall the interpretation of the SMI provided by point (IV) in Section 3.1, namely that it provides a measure of how much we would expect performing do(E = e) to change the course of nature for T .Under this interpretation, we see that forcing an El Niño year would alter the temperature distribution from what we would expect to occur naturally moreso than forcing a La Niña or neutral year.
We next consider the natural direct and indirect effects shown in Figure 6, first noting that both are less than the STE for all values e.This is consistent with the intuition that the direct and indirect effects of ENSO on temperature would not cancel each other out.Intuition is also validated by the fact that the SNIE is less than the SNDE for all values.While this need not be the case in general, we make the assumption that S and T are identically distributed given E, and thus we would expect the indirect link E → S → T to be weaker than the direct link E → T .While the proposed method does not explicitly identify a physical causal mechanism, the indirect link would represent a situation wherein certain temperatures give rise to environmental circumstances that may affect future temperatures, for example snow pack or soil moisture.Given that there is no evidence in the literature of these environmental factors having a large affect on temperatures, it is sensible that the SNIE is very low.As a final point, we note that for both the SNDE and SNIE, only the effect of E = 1 passes the proposed statistical significance test.This serves as further justification for the measurement of specific causal influences -when simply measuring average influences with MI, CS, or IF, statistical significance testing results in an "all or nothing" result, whereas the present framework enables identifying influences that are significant for only some values of a cause.
We conclude this section with the conditional STE of past on current temperature in a specific ENSO phase, as portrayed by Figure 7.We can clearly see that there is a strong persistence in the temperature anomaly signal, i.e. that the past temperature average has a strong effect on the subsequent average, with the largest effect ( ST E(S = −1 → T | E = 1)) being roughly five times that of the effect of E = 1.The fact that the largest effect of S on T occurs when performing the intervention do(S = −1) during an El Niño year can likely be explained by the tendency for El Niño years to give rise to high temperatures.Thus, we would expect that forcing a cold spell during an El Niño would alter the course of nature moreso than, say, forcing a heat wave.Furthermore, the second largest effect is seen when S = 1 and E = −1, i.e. when a heat wave is forced during a La Niña year.This result is reminiscent of the example provide in Section 4.1, where there is a large causal influence resulting from a broken chain reaction.In this case, since we would expect an El Niño (resp.La Niña) year to assign a higher probability to a heat wave (resp.cold spell) that would then persist through the effect of S on T , intervening on S to force a cold front (resp.heat wave) will result in a large deviation from the natural behavior and thus a large causal effect.It is important to note that "forcing a cold spell" is ambiguous in that there are many different mechanisms by which one could hypothetically force a temperature.The following section includes a discussion of how these different mechanisms affect the ability to consider the estimated affect as a true causal effect or merely a measure of predictive utility.In either case, the proposed methods provide a clearer picture of how the relationship between subsequent two-week anomaly averages is modulated by ENSO phase than traditional IT methods.This suggests an area for future investigation, as two-week temperature persistence is not well studied outside of the context of persistent high pressure anomalies [47].

Challenges and Caveats
Any causal interpretation of the results is predicated on the assumption that there are no confounding factors not accounted for in the preprocessing steps.This assumption is less of an issue when measuring the effect of ENSO, where we only need to assume that there is no common cause for E and S or E and T (that there is no backdoor path, to be precise) beyond the seasonality, CO 2 forcing, and any other phenomena captured by the leading six harmonics.When measuring the effect of past temperatures, however, this assumption is a bit more far reaching.For example, we have neglected to consider the temperatures in neighboring regions.Moreover, the explicit nature of the causal effect of S on T is more elusive than that of E on T .While it is reasonable to expect the temperature to have some causal effect in a literal sense (i.e. via the heat equation), it is likely that the estimation procedure is also capturing the effects of temperature related variables.For example, if we additionally included PNW atmospheric pressure waves in the model, we would expect these waves to be a common cause for S and T resulting in a significantly weaker (if not absent) link S → T .As such, the above estimate of ST E(s → T | e) ought to be viewed as either a measure of predictive utility of the literal temperature, or the causal effect of a "meta variable" representative of the temperature and related quantities that are intervened upon as a whole.In any case, the present study serves as a starting point for the development of more intricate causal models relating ENSO and temperature.
A second set of challenges arises from the need to estimate the measures for every value of the cause.While these challenges are indeed a fundamental challenge with the proposed framework, they provide an opportunity for the development of novel estimation and statistical testing techniques.On one hand, the proposed specific causal measures are necessarily more challenging to estimate than their average counterparts.On the other hand, they necessarily provide more resolution and allow for estimating separate confidence intervals for each element in the analysis.If we are trying to estimate ST E(x → Y ) but only have a small number of points in our dataset where x n = x, then we would have very little confidence in our estimate.However, that need not discourage us from having high confidence in an estimate of ST E(x → Y ) for some x for which we have many samples.That having been said, the proposed estimators and significance test used in the present study lack a formal analysis and leave considerable room for improvement.
As a final discussion point, we return to the comparison of information theoretic and statistical notions of causal influence.Despite having carefully formulated the proposed measures as measures of the extent to which an intervention results in a deviation from the course of nature, the results presented in this section beg the question: How useful are bits?As an absolute measure, it is worth noting that a measure in bits will be largely influenced by the number of quantization regions we select.While this can be partially addressed by the proposed normalization, there is no question that the data compression interpretation provided alongside those equations is less intuitive than a measure of, say, the number of degrees warmer we would expect it to be an El Niño year than a La Niña year.Moreover, this intuition gap would be even larger for someone outside of the information theory community (e.g.climate scientists).This is not to say that the proposed measures are so opaque that they are unusable.In fact, we would argue that they provide more interpretable notions of causal influence than other information theoretic measures that have experienced some popularity in the literature.Instead, this discussion is merely to maximize the level of intuition that we can associate with the proposed measures while simultaneously acknowledging the limitations of information theoretic measures in terms of interpretability.

Conclusion
We have sought inspiration from the statistical causality community in order to refine information theoretic measures of causal influence.Specifically, we have developed a series of causal measures that are defined for specific values of the cause in question with the goal of differentiating between total, direct, and indirect effects, and provided conditions under which they can be estimated from observational data.The proposed measures are, at their core, aligned with previous information theoretic measures in that they compare distributions of Y rather than comparing values of Y .As such, they are well-equipped for capturing non-linear, higher order causal effects, although at the cost of foregoing an explanation of the exact nature of the causal effects.Perhaps most importantly, we have elucidated the key insight that information theoretic measures of causal influence can be interpreted as methods for quantifying the magnitude with which an intervention is expected to alter the course of nature.This interpretation stands in stark contrast to that of statistical measures.As such, we hope that a key lesson will be that information theoretic and statistical notions of causal can provide complementary methods in that they yield the answers to fundamentally different causal questions.

Acknowledgement
The CESM project is supported primarily by the National Science Foundation.We thank all the scientists, software engineers, and administrators who contributed to the development of CESM2.All of the data used is made publicly available by NCAR and can be downloaded at https://csegweb.cgd.ucar.edu/exp2-public/cgi-bin/expListPublic.cgi.

A Exchanging Interventions and Observations
The do-calculus provides a set of rules to aid using the do-operator in practice and to enable identifying if and how interventional probabilities can be computed.Of particular interest is computing interventional probabilities (i.e.those using the do-operator) from the standard conditional probabilities that represent observing variables.This is particularly important in scenarios such as the one considered in Section 5, wherein it is infeasible to actually perform interventions.The do-calculus consists of three rules, each of which involves an equivalence statement between probabilities that is implied by a d-separation criterion.We here focus on Rule 2, which provides a condition for which observations can be exchanged for actions.Specifically, this rule says that for a DAG G and any disjoint sets of variables X,Y ,Z, and W : where (•⊥ ⊥ d • | •) G represents d-separation with respect to the DAG G and G XZ represents an augmented DAG with all incoming arrows to X and outgoing arrows from Z removed.The rule is framed in a general form in that it allows other variables to be observed or intervened upon (i.e.W and X) on both sides of the equality.Roughly speaking, this rule says that if the only way Z relates to Y is via descendants of Z, than knowing whether or not a particular value z was observed or forced will not change the distribution of Y .To see this, first let X = ∅, and note that the d-separation condition becomes Y is d-separated from Z by W if we ignore all paths coming out of Z.If that condition is not satisfied, then observing a value of Z informs us about the values of Z's parents, which then may provide further information on the distribution of Y .By contrast, if we intervene on Z, then no information is conveyed about Z's parents, and the distribution of Y will not be the same.Next, letting X = ∅, we see that the condition now requires removing all incoming arrows to X.This is because if X is intervened upon, it will contain no information about the values of its parents.
This rule is applied in a straightforward manner in two ways in Section 5. First, when measuring the effect of ENSO on temperature, we need to exchange an intervention on the ENSO phase for an observation of an ENSO phase.Focusing on the graph on the right side of Figure 5, the augmented graph G E is given by E being an isolated node.Thus, in this augmented graph E is d-separated from T by either ∅ or S, We wish to show that we can estimate these distributions can be estimated from observational data, i.e. that the hats can be removed.Assume that the conditions of the theorem hold.We first claim that To see this, note that in the DAG G X , X has no children, and thus will not be connected to any other nodes in step two of the d-separation algorithm given by Algorithm 1.Since every edge connected to a node in Ũ is removed in step three in the algorithm, the only way for one of the implications to be violated is if there is an undirected path in G X connecting X and Z or X and Y that does not pass through Ũ ; however, such a path would necessarily not pass through Ũ1 or Ũ2 , which would violate (X ⊥ ⊥ Y | Ũ1 ) G X or (X ⊥ ⊥ Z | Ũ2 ) G X .Thus, the claimed implications hold.Next we can directly apply rule two of the do-calculus [39, Theorem 3.

E Climate Model Details
In concordance with the CMIP6 terms of use (https://pcmdi.llnl.gov/CMIP6/TermsOfUse/TermsOfUse6-1.html), we provide the full model details for the model that provided the utilized dataset.Given that S appears nowhere in the first argument the KL-divergence, we can see that whether S = T i−1 or S = T i−1 1 , the result is the same.The same procedure can be applied to show equivalence for the SNIE.

Figure 2 :
Figure 2: Sea surface temperatures (SST) averaged over January, February, and March from 1979-2018 in the equatorial Pacific for La Niña (A), neutral (B), and El Niño (C) ENSO phases derived from the ERA-interim OCEAN5 reanalysis product conditioned on the Nino3.4index ± 1 anomaly standard deviation [53].The shifted SST patterns give rise to shifted precipitation regions (yellow circle), which affect temperature anomalies in the PNW through large scale atmospheric waves.

Figure 3 :
Figure 3: Simulation of the ENSO 3.4 index from 1851-1871 from a CESM2 model run along with threshold for determining ENSO phase.

Figure 4 :
Figure4: Counts of transitions from the past average temperature T i−1 to the current average T i in the complete dataset and subsets corresponding with specific values of the ENSO signal.The parenthetical gives the total count of samples in a given subset.The number of samples used to estimate the distribution of the current temperature anomaly for a given ENSO phase and past temperature are given by the sums of the columns.It is clear that there is ample data for estimating each distribution in question (see(17)).

Figure 5 :
Figure 5: Left: Complete DAG representation of climate variables.Right: Simplified DAG after detrending and assuming stationarity.
ST E D (e → T ) D(p D (T | e) || pD (T )) SN DE D (e → T ) D(p D (T | e) || e ,s pD (e )p D (s | e)p D (T | e , s )) SN IE D (e → T ) D(p D (T | e) || e ,s pD (e )p D (s | e )p D (T | e, s )) where pD gives the maximum likelihood estimate of p on the sample D. Specifically, for an arbitrary collection of N samples C = (x n , y n , z n ) N n=1 of variables X, Y, Z, the estimate is given by: pC (y) |{n : y n = y}| n pC (y | x) |{n : x n = x, y n = y}| |{n : x n = x}| pC (y | x, z) |{n : x n = x, y n = y, z n = z}| |{n : x n = x, z n = z}| where the |{•}| gives the number of elements in the set {•}.

Figure 6 :
Figure 6: Estimates of the normalized specific total effect (left), specific natural direct effect (center), and specific natural indirect effect (right) of ENSO on temperature anomalies.

Figure 7 :
Figure 7: Estimates of the normalized conditional specific total effect of previous temperature anomaly on current temperature anomaly conditioned on different values of ENSO phase.
and we have p(t | ê) = p(t | e) and p(t | s, ê) = p(t | s, e).Similarly, for measuring the effect of S on T , we need to consider the augmented graph G S given by S ← E → T .Using the d-separation algorithm described in Algorithm 1, it is straightforward to see that (S⊥ ⊥ d T | E) G S and thus p(t | e, ŝ) = p(t | e, s).Algorithm 1 d-Separation[27] Input: DAG G = (V, E) and disjoint sets A, B, C ⊂ V 1: Create a subgraph containing only nodes in A, B, or C or with a directed path to A, B, or C 2: Connect with an undirected edge any two variables that share a common child 3: For each c ∈ C, remove c and any edge connected to c 4: Make every edge an undirected edge 5: Conclude that A and B are d-separated by C if and only if there is no path connecting A and B B Conditional Specific Causal Measures Definition 7. The partially observed conditional SCDE of x on Y with mediator z given ũ is defined as:SCDE(x → Y ; z | ũ) D(p(Y | x, ẑ, ũ) || x p(x | ũ)p(Y | x , ẑ, ũ))In the fully observable setting Ũ = U we have:SCDE(x → Y ; z | u) D(p(Y | x, ẑ, u Y ) || x p(x | u X )p(Y | x , ẑ, u Y ))where(20) follows from the fact that interventions on Z can be ignored in the distribution of X. Moving onto the CS, we have:C X→Y = D(p(X, Y, Z) || p X→Y (X, Y, Z)) = x,y,z p(x, y, z) log p(x)p(z | x)p(y | x, z) p(x)p(z | x) ( x p(x )p(y | x , z)) = x,y,z p(x, y, z) log p(y | x, z) x p(x )p(y | x , z) | x, z) log p(y | x, z) x p(x )p(y | x , z) | x, ẑ) log p(y | x, ẑ) x p(x )p(y | x , ẑ) = x,z p(x, z)D(p(Y | x, ẑ) || x p(x )p(Y | x , ẑ)) = E p(X,Z) [SCDE(X → Y ; Z)] C.3Proof of Theorem 3 Note that the conditional STE, SNDE, and SNIE only utilize three distributions involving interventions, namely p(y | x, ũ), p(z | x, ũ), and p(y | x, z, ũ).