1. Introduction
Philosophers have debated at length whether causality is a subject that should be treated probabilistically or deterministically. This resulted in the development of different inferential systems and views on reality. Pure logic dealt with inferences about deterministic truths [
1,
2]. Probabilistic reasoning has been developed to allow for uncertainty in inferences about deterministic truths [
3,
4], to make inferences about probabilistic truths [
5,
6], or to imply the existence of associated deterministic truths [
7,
8,
9,
10,
11]. Probabilistic theories about causality were developed throughout the 20th century, with notable contributions by Reichenbach, Good, and Suppe [
12]. At the same time, however, the classical model of physics maintained its position as a role model for other sciences, which led researchers, including those concerned with human behavior and economic systems, to reject ideas about probabilistic causation, opting, often, to reason probabilistically about deterministic truths.
In modern physics, the standard equations of quantum mechanics suggest that reality is, in fact, better described by probability laws [
13]. The outcome of the Bohr–Einstein debates settled on the assertion that these probability laws are a result of a real indeterminacy and that reality itself is probabilistic (One may also argue that this is simply a correct 
exposition of the theory and not necessarily of the physical world, as more complete theories may yet be discovered). Ref. [
14] provides an alternative interpretation of quantum physics in which the probability laws are statistical results of the development of completely determined, but hidden, variables. At a macroscopic level, deterministic laws and contingencies induce associated probabilistic laws (Contingencies is a term used by Ref. [
14] to refer to independent factors that may exist outside the scope of what is treated by the laws under consideration, and which do not follow necessarily from anything that may be specified under the context of these laws). In particular, by broadening the context of the processes under consideration, new laws that govern some of the contingencies can be found. This inevitably leads to new contingencies: a process that repeats indefinitely. For this reason, any theory about reality that embraces either of deterministic law or chance, to the exclusion of the other, is inherently incomplete. Regardless of one’s position on real indeterminism, it holds, according to this logic, that any natural process that arises deterministically must also satisfy statistical laws that are more general, and so any complete theory about interesting real-world phenomena must be probabilistic.
In a probabilistic view of reality cause and consequence are related by probability laws rather than laws of logical truths. A theory about probabilistic causality can, therefore, be stated in terms of the properties of the true measure that describes a process stochastically. The theory of causation developed here is that a causal relationship exists if there exists a true probability measure that produces a non-empty stochastic sequence that describes the directly caused effects from perturbations in one variable in terms of the responses in another. The paper shows that ideas about causality, including the direction, statistical significance, and economic relevance of effects, may be tested by formulating a statistical model that correctly describes observed data, and evaluating its dynamic properties. In practice, this means that the inference is conducted with a best approximation of the true probability measure. It is the position of the paper that in order to demonstrate that causality runs from a potential causal variable to the target variable, one requires developing the best approximation of the true probability measure using the potential causal variable and a best approximation of the true probability measure without the potential causal variable. The analysis should then (1) conclude whether the first modeled measure is closer to the true measure, and (2) test that the two modeled measures are not equivalent. Practical routines to do so shall be discussed and an example is provided using random forest (RF) regressions and daily data on yield spreads. The application tests how uncertainty around short- and long-term inflation expectations interact with spreads in the daily Bitcoin price, a digital asset with a predetermined finite supply that has been characterized as a new potential inflation hedge. The results are contrasted with those obtained with standard linear Granger causality tests. It is shown that the suggested approaches do not only lead to better predictive models, but also to more plausible parsimonious descriptions of possible causal flows.
The focus on approximating a correct stochastic representation of the DGP (data generating process) as a means of learning about true causal linkages is different from the approaches that try to simulate laboratory conditions by testing for statistical differences in control groups, such as described by [
15,
16]. The focus on obtaining a correct functional representation of the data is also different from attributing the presence of causal relationships directly to the values of parameters representing averages in treatment groups, see for instance [
17,
18,
19] on this approach. Placing emphasis on the need for accurate statistical models for the full data distribution when conducting causal analysis introduces an obvious weakness: it is generally accepted that all empirical models will be mis-specified to a certain degree and that empirical models are likely never correctly specified. The 
true process, after all, is unknown in practice. This is the reason to conduct analyses in the first place. The aim to develop correct models can therefore be seen as an idealistic idea that is difficult to put into practice. However, it is still valuable to understand the role of the correct-specification assumption in causal analysis. It is commonly taught that mis-specification leads to residual dependencies that violate the assumptions made by general central limit theorems needed to obtain correct standard errors, see for example chapter 2 in [
20]. However, more general estimation theory for dependent processes, as those developed and discussed for instance by [
21,
22,
23,
24,
25], may help correct standard error estimation but do not remedy the issue that the structural response of the model is incorrect [
26]. These are theories to correct the variance estimator when the underlying model is wrong, and do not address the issue that the structural response of the model does not correctly describe the data.
The paper builds on contributions of others in the following lines of research. The views on causality developed in the paper are related to the information theoretic view on testing causal theories, as discussed by [
27,
28,
29,
30], which, as here, emphasizes model parsimony. The line of reasoning is inspired by the work of [
31,
32], who emphasized the importance of a probabilistic formulation of economic theories and warned against the use of statistical methods without any reference to a stochastic process. The paper also emphasizes the importance of the overall model response, and, thus, on focusing on system behavior, rather than on isolated parameters that make no reference to a wider economic system. This has previously been advocated by [
33]. The main result of the paper is that convincing statements about partial causal linkages must be underpinned by an accurate model of broader reality, even if the interest is in inference and not prediction per se. In order to do so, researchers must, as shall be discussed, pay due attention to distinguishing between direct causal impacts and system memory and take note of developments in the field of predictive modeling.
The plan of the paper is as follows. 
Section 2 develops definitions for probabilistic causality in terms of true probability measures using a flexible type of dynamical system that covers many processes observed in economics, physics, finance, and related fields of study. 
Section 3 discusses approximating this true probability measure as an act of minimizing divergence between the modeled probability measure and the true probability measure, while 
Section 4 forges the link between statistical divergence and distance. This draws the connections between distance-minimization and the use of maximum likelihood criteria. 
Section 5 provides practical considerations and applies the theory. Finally, 
Section 6 concludes. Proofs are provided in the 
Appendix A.
  2. Causality in Terms of True Probability Measures
Notation will be as follows.
Notation 1. ,  and , respectively denote the sets of natural, integer, and real numbers. If  is a set,  denotes the Borel-σ algebra over , and , alternatively denoted as , is the Cartesian product of T copies of . Definitional equivalence is denoted , which is to be distinguished from ≡ denoting equivalence, for example in the functional sense. For two maps, f and g, their composition arises from their point-wise application and is denoted  and  is the inverse function of f. The tensor product is denoted ⊗. The notation  is used to indicate that μ is absolutely continuous with respect to ν, i.e., if μ and ν are two measures on the same measurable space , μ is absolutely continuous with respect to ν if  for every set A for which , or, as an example, if ν is the counting measure on  and μ is the Lebesgue measure, then . It is also said that ν is dominating 
 μ when , see for instance ([34] p. 574). Finally, the empty set ∅ 
is also used in the context of an empty sequence, which sometimes would be notated as  in the literature.  Directional causality is interesting when at least two sequences are considered. Specifically, when the focus is on a 
T-period sequence 
, that is a subset of the realized path of the 
-variate stochastic sequence 
 for events in the event space 
. (That is, 
. The random sequence 
 is a Borel-
 -measurable map 
. In this, 
 denotes the Cartesian product of infinite copies of 
 and 
 with 
, and 
 denotes the Borel-
 algebra on the finite dimensional cylinder set of 
, see Theorem 10.1 of [
35], p. 159). As always, the complete probability space of interest is described by a triplet 
, with 
 as the 
-field defined on the event space. 
 is used here informally as a placeholder for a collection of probability measures, as we shall introduce the exact probability measures of interest shortly.
If 
 is considered as a univariate sequence independent from causal drivers, then for every event 
, the stochastic sequence 
 would live on the probability space 
 where 
 assigns probability to all elements of 
. In a similar fashion, one can consider 
 as the subset of the realized path of the 
-variate stochastic sequence 
 indexed by identical 
t for events 
 (i.e., 
 and the random sequence 
 is a Borel-
 -measurable map 
.) If 
 would live similarly isolated from outside influence, then for every 
, the stochastic sequence 
 would operate on a space 
 where 
 assigns probability to all the elements of 
. We have a system of two unrelated sequences (This naturally covers to most common auto-regression case, only stated for 
 here, 
, where 
 is unobserved. The linear auto-regression case is obtained when 
 is a scaled identity function.):
	  As we shall see, an important aspect of causal analysis is to rule out that the observed data is not generated by Equation (
1). As such, it is important to comment on a number of properties. First, in this system of equations, the functions 
 and 
 are intentionally not indexed by 
t. This does not imply that these functions cannot posses complex time-varying properties; it only limits the discussion to observation-driven models (to the exclusion of parameter-driven models), in which time-varying parameters arise as nonlinear functions of the data. An example would be the threshold models considered by [
36,
37], in which parameter values are allowed to differ across regimes in the data. The choice to restrict the discussion is made because it is intuitively easier to conceive of causal effects in an observation-driven context where observations represent verifiable values describing different states of real-world phenomena. At the same time, it has been shown that parametric observation-driven models can produce time-varying parameters of a wide class of nonlinear models [
38] and that the forecasting power of such models may be on-par with parameter-driven models, even if the latter are correctly specified [
39]. Moreover, Refs. [
20,
40,
41] show how observation-driven models may be used to not only investigate how observations impact future observations, but also future parameter values, which may empirically be interesting if those parameters carry an economic interpretation. Finally, many popular machine learning algorithms, such as neural networks, can be reduced to equations that show how parameter values change according to levels in the data [
42].
While the dynamics in Equation (
1) may be nonlinear, the notation is too restrictive to nest long-memory processes. In particular, the state at time 
t is only a function of the previous state at time 
, or 
 if the model would be generalized to 
p-order lags, but not of the full history. Vanishing dependence, implied under contraction conditions [
43], is often key to verifying irreducibility and continuity [
44] and proving the ergodicity of time series [
45]. Proving the ergodicity of a model is needed to obtain an estimation theory under an assumption of correct specification [
20,
24]. Later, multivariate models will be considered, in which case long-memory properties may arise, for example, when time-varying parameters in one of the functions are a function of past data as well as of past values of those time-varying parameters.
If interrelated stochastic sequences are at the center of inference, additional building blocks are required to describe the processes. This increases the potential complexity of 
 and 
, but it also allows to distinguish between causality, non-causality, and feedback. Consider the stochastic system:
	  In this multivariate context, 
 and 
 will be referred to as the direct causal maps, while 
 and 
 control the memory properties within each channel.
When  and  are analyzed individually, the properties of  and  are of key interest. They carry information on the future positions of  and , and provide predictability without considering outside influence directly. However, correct causal inference around the interdependencies of  and  may be preferred over developing predictive capabilities that can result from many configurations within the parameter space that are associated with untrue probability measures. The properties of  and  determine the direction in which effects move. Verifying their properties is central to causality studies. The functions  and , on the other hand, play a central role in the system’s responses to external impulses by shaping memory of the causal initial impact of a sequence of interventions, even after that sequence turns inactive.
The functions that control memory properties within channels in some sense determine how the past reverberates into the future, and specifying correct empirical equivalents to 
 and 
 is as crucial to the inference about the causal interdependencies as is specifying mechanisms for the action of interest (it would be more general to write Equation (
2) with 
 and 
 and with 
. In this case, for instance, the dependence of 
 on its own past, 
, is allowed to vary based on the levels in past data. However, under this notation, one could at any point in time, decompose the change in one variable into effects attributed to memory and outside influence separately, which the simplified notation in Equation (
2) is intended to focus on). In fact, as Ref. [
46] point out, systems may be dominated by memory and the influence of the causal components may be small on the overall process in which case predictive power can be obtained without specifying any causal maps and focusing solely on memory. Inversely, this also suggests that one must obtain a model for the memory process to isolate the causal impacts themselves, suggesting that long-memory applications in which causal inference is of interest must develop a high degree of predictive power, even if prediction is not needed for policy purposes. This can be made more clear by considering the following:
      with 
 and 
 defined as 
 and 
. Given the realized sequences 
 and 
 generated by Equation (
2), the sequential system of Equation (
3) moves forward in time as the one-step-ahead directly caused parts of 
 and 
 that are filtered from the reverberating effects of 
 and 
. More specifically, while 
 partially consists of memory, there is a part, 
, that, at any point, is directly mapped from the previous state of 
, while, at the same time, 
 consists partially of memory and a part 
 directly generated from the last position of 
. In this view, directional causality can be stated in terms of whether (
3) produces any values, i.e., diagnosing if there is any statistically significant signal from initial causal impulses left after all memory properties have been stripped from the data. Importantly, the system reveals that by the definitions of 
 and 
, obtaining appropriate estimates for 
 and 
 involves 
 and 
 being modeled correctly as 
 and 
 are not observed and only result as functions from the observable processes 
 and 
. Moreover, if 
 and 
 are triggered by an event, then it is possible, by process of infinite backward substitution, to write Equation (
3) as an infinite chain initialized in the infinite past. Plugging in the equalities 
 and 
 and defining the random functions 
 and 
, one can write
      
	  Repeating infinitely, and extending infinitely in the direction 
,
      
 and 
 are the maps that generate 
 and 
 infinitely after 
 and 
 have been generated into infinity. Subscript 
 has been used, here, to mark the initialization points. This shows that 
 can be written as a sequence of iterating functional operations that are all defined on 
, and 
 defined on 
 in a similar way (Equation (
5) reveals that the sequences that constitute the directly caused parts of 
 and 
 are ultimately dependent on the values at which the observable process has been initialized. That is, the entire causal pathway depends on the initial impact. In practice, one cannot observe all impacts—including those that occurred in the infinite past—and assurance is required that the initialization effect of the causal pathway must, asymptotically, be irrelevant). For ease of notation, let us write
      
      where bold-faced 
 is used to refer to the entire sequence of functional operations 
 up to 
t, starting in the infinite past 
. This highlights that generating the unobserved quantities 
 and 
 from the observed quantities 
 and 
 by back substitution eventually involves the unobserved quantities 
 and 
. This means that some feasible form of approximation is needed, since time series data in practice area almost never recorded since the beginning of the process.
Note first that 
 is a 
-measurable mapping, and 
 is a 
-measurable mapping. The sequence 
 thus lives on 
, where 
 is induced according to 
, and 
 lives on 
, where 
 is induced according to 
, see [
47] p. 118 and [
48] p. 115. The notation shows that the probability measures underlying the stochastic causal sequences result from the functional behavior of the entire system. In particular, the causal sequences can be written as recursive direct effects from another variable that itself consists of memory and causal effects, and the probability measures underlying the causal sequences are thus induced by the functional relationships that describe all dynamical dependencies. This is important to the extent that many causal studies focus on one single marginal dependency, while, from the measure-theoretic perspective developed here, the wider system within any one single process operates, is of importance to the analysis. This suggests that researchers must pay attention to referencing the workings of a broader system when designing their models for inference, something [
33] has also argued. Moreover, it has been argued (see [
49] for discussion) that probabilistic definitions of causality are not strictly causal in the sense that they do not provide insight in the origin of the probability law that regulates the process of interest, and that a (correct) time-series model only describes (correctly) the probabilistic behavior as the outcome of that unknown causal origin. The notation, here, shows, however, explicitly the relation between the functional behavior of a system and its induced probability measure that assigns probability to all possible outcomes. This suggests that such critiquing views, rather, relate to disagreements around the level of detail in the structure of a model, which in turn would be guided by the research question of interest and the availability of detailed data. Particularly, dynamical systems in economics are often modeled using aggregate macro-economic data that do not have the same granularity as micro-economic data containing information about the behaviors of individual economic agents.
In many cases, a researcher is not able to observe all the relevant variables. When a third, possibly unobserved external variable, 
, with effect 
, is considered, the researcher is confronted with the situation that
      
	  If 
 is unobserved, it can still be approximated as a difference combination of 
 and 
. To obtain an approximated sequence of the 
true  sequence to condition empirical counterparts for 
 and 
 on, one can work with: 
	  Equation (
8) suggests to write Equation (
7) in terms of 
 and 
 only by defining 
 as a difference combination of 
 and 
 (Apart from stability conditions imposed on the endogenous process, one requires also that the exogenous impacts enter the system in some suitable manner, which, for example, requires that 
 and 
 are appropriately bounded. Following the same arguments that resulted in Equation (
5), the initialization of the exogenous impacts 
 should similarly not carry information influential in the empirical estimates of 
 and 
, conditional on partial information). This allows us to define the spaces and measures in terms of 
 and 
 when the multivariate process includes further variables, in this case, 
. If the process is invertible, one can write, by aggregating the functions:
	  For every 
, the map 
 is 
-measurable and 
 lives on the space 
 where the probability measure 
 is induced by 
 on 
 according to the point-wise application of 
 and the inverse of 
. (
). Similar arguments follow for 
. This tells us that, in the general case of multivariate dependencies and in the presence of possibly unobserved variables, the probability measures underlying the individual sequences are possibly a result of those of the other sequences. This means the space of empirical candidates for the probability measure 
 that underlies the joint process 
 operates on 
. (The sequence realizes under the events 
, 
, where 
 and 
, with 
, and the probability measure of the joint process 
 is thus defined on the product 
-algebra 
 (see, [
47] p. 119)).
Regardless, the measure 
 is induced by functional relations of Equation (
2), which, as was shown, can be decomposed into memory and causal subsystems. One can thus state causality conditions, based on the measures that describe the directly caused effects represented by Equation (
6). In particular, one can keep the focus on 
 and 
, bearing in mind that they are lower-level constituents of 
 on which, in turn, the complete estimation objective will be defined.
Definition 1 (Non-causality). The stochastic sequences  and  are not causally related if  and  are null measures, such that  and .
 Definition 2 (Uni-directional Causality). Causality runs uni-directionally from the stochastic sequence  to another stochastic sequence  (visa versa), if  is a null measure, and  is a non-null measure, such that  and  (visa versa).
 Definition 3 (Bi-directional Causality). The stochastic sequence  is causal with respect to  and  is causal with respect to , if  and  are both non-null measures, such that  and .
 Respectively, conditioning on impacts in , these probabilistic causality definitions can thus be understood broadly as:
- 1.
- Whenever an intervention in  occurs, there is no chance that  reacts as a result of that. 
- 2.
- Whenever an intervention in  occurs, there is positive chance that  reacts as a result of that. 
- 3.
- Whenever an intervention in  occurs, there is positive chance that  reacts as a result of that. Subsequently there is positive chance that  reacts to this initial reaction, a probabilistic process that repeats recursively. 
Remark 1. With null-measures, it is meant that the stochastic sequence describing the directly caused effects from one variable to the other takes values in the empty set with probability 1. This is because the functions that induce the probability measure cancel out, hence, they can be removed from the equations resulting in a probability measure that is not induced by any remaining rule or relationship. In practice, one can test whether  or , where  here denotes the probability measure induced by the functional relationships in Equation (1) and  denotes the probability measure induced by the functional relationships in Equation (2), to test whether  exists. A practical test is a Kolmogorov–Smirnov-type test.    3. Limit Divergence on the Space of Modeled Probability Measures
The definitions of causality, in terms of the lower-level components of , suggest that correct causal statements can be obtained empirically by extracting relevant counterparts to  and  from a relevant counterpart to , and investigating the stochastic sequences produced by these modeled measures. For such an approach to be of relevance in an empirical context, one must ensure that the concepts introduced adequately transfer over from the true measure  to a modeled measure . The focus is therefore shifted towards detailing how  can be approximated as a minimally divergent measure relative to , and draw on approximation theory to construct equivalence around the true measure under an axiom of correct specification.
For some event 
, a realized 
T-period sequence 
 consisting of sequences 
 and 
 can be observed. The 
true function 
, consists of our main functions of interest 
 and 
 that in turn are composed of 
 and 
 that are of particular interest to the researcher focused on causality, but possibly also functions 
 and 
 that shape the responses of an initial causal effect. The exact properties are generally unknown to the observer, but one can design a parameterization mapping that learns the behavior of 
 and 
 when exposed to sufficient data. To learn from the data an approximation of 
 and 
, one can postulate a model
      
      with 
f:
 as our postulated model function and 
 as the modeled data. In the context of parametric inference, the parameter space 
 is of finite dimensionality, but also in the nonparametric case, the vector 
 indexes parametric models nested by the nonparametric model, each inducing its own probability measure, and 
 indexes families of parametric models, each inducing a space of parametric functions generated under 
. In this discussion a compact set of potential hypotheses is considered, limiting the inference to parametric models. The arguments can be extended to the nonparametric case, by focusing on a compact subset 
 of solutions (For example, by letting 
 grow as 
, hence focusing on the case 
, see for example [
50]). For example, by using priors or penalties that discard 
 such that any solution of the criterion necessarily falls within a compact subset space, see [
20] p. 210 and [
24]. Let 
f be 
-measurable 
 so that 
 is 
-measurable 
 and 
. 
 is our space of parametric functions defined on 
 generated under 
 under the injective 
 where 
. Under any 
true probability measure 
, every potential parameter vector included in the parameter space 
 induces a probability measure 
 indexed by 
 on 
, according to 
. Thus, for every potential parameter vector included in the parameter space 
, there is a triplet 
 that describes the probability space of modeled data under 
. The triplet 
 is, thus, itself an element of the measure spaces indexed by 
 across all 
. Given the 
true probability measure 
 on 
, this process is summarized by a functional 
, that maps elements from the space of parametric functions generated by the entire parameter space 
, onto the space 
 of probability measures defined on the sets of 
 generated by 
 through 
.
Now, 
 is generally not only unknown, but for a finite 
 there is no guarantee that 
, implying that, in many empirical applications, one is concerned with the situation where 
. However, if 
, one can learn all about 
 by uncovering the properties of 
f, given that a sufficient amount of observations is available. (As discussed in the literature on miss-specification, even when the axiom of correct specification is abandoned, 
f may converge to a function that produces the optimal conditional density, which may reveal properties of 
). Let
      
, be the extremum estimate for 
 as judged by the criterion 
. Trivially, 
 and 
. To see that under correct specification it is possible to approximate the 
true function 
 in terms of equivalence (in the sense of function equivalence [
51] p. 288), one can write the criterion function also as a function of the 
true function and the postulated model 
 in which it is made use of the fact that 
 and 
.
The discussion further evolves toward showing that the element in  that is closest to  minimizes a divergence metric that results from a transformation of the limit criterion that measures the divergence between the true density and the density implied by the model. Note that  is induced by the proposed candidates for ; studies on causality thus rely on flexible model design as the researcher determines which hypotheses are considered in a study by exerting control over . Naturally, if , then  produces a larger . This suggests that minimizing this divergence metric over a large as possible  results in selecting  at a point in  that attains equivalence to  only when  is large enough to produce a correctly specified hypothesis set. Note that the definition of , as our space of parametric functions generated under , under the injective  and the functional  that induces the space of probability measures, is defined on the sample space . This highlights that the correct specification argument, , not only stresses flexible parameterization in the sense that parameterized dependencies can take on many values, but also in the sense of using correct data (Indeed, the potential parameters that would interact with data that is not used are essentially treated as zero, so the focus on using correct data is implicitly already contained in the standard statements of correct specification that focus directly on the dimensions of . The distinction is nevertheless useful because nonparametric models are often popularized as methods to reduce miss-specification bias as  becomes infinite dimensional, but this does not imply that  if important data is missing). When little is known about f, one is thus not only concerned with flexibility in terms of the type of parametric functions generated under , but also the variables on which the modeled measures are defined. When these concerns are appropriately addressed, testing for causality is deciding based on the approximation  whether the best approximation of the true model suggests (1) that  and  live in isolation, (2) unidirectional causality, or (3) that  produces feedback.
To turn this problem into a selection problem that can be solved by divergence minimization  w.r.t. the true measure, first introduce the limit criterion by taking  and working with the modeled data as the minimizer of the criterion. Specifically, let the limit criterion be  evaluated at  with  and  with the criterion  as a measure of divergence  on the true probability measure and the modeled measure. More specifically, . By definition of  as a divergence on the space that contains  and , the element  is thus the minimizer of that divergence.
Moreover,  in the parameter sense,  in the function sense (in terms of a divergence metric on the true function), and  in the measure sense (in terms of a divergence metric on the true probability measure), are equivalent limits under the same consistency result. To see this, it is convenient to focus once more on the target and write , with , to make clear that the criterion establishes a divergence  on , which is, in turn, induced by  through  according to . This ensures that our statement on the probability measure is relevant under standard consistency results that are focused on the convergence of an estimated parameter vector toward , while, equivalently, the impulse response functions (IRFs) converge to the true IRFs at . This implies that deciding between Definitions 1–3 can be read from the responses produced by the IRF that minimizes divergence  w.r.t. the true IRF.
Not necessary, but convenient for a proof that holds easily in practical situations, is to assume the existence of a strictly increasing function  that ensures the existence of a transformation of the limit criterion into a metric, , with r being a continuously and strictly increasing function. For convenience, all assumptions are summarized in Assumption 1.
Assumption 1. For a limit criterion  of the form ,  is a divergence. Assume there exists a continuous strictly increasing function  such that  is a metric. The functional  is injective and .
 Proposition 1. Assume 1, then the following are equivalent limits:
- 1.
- , 
- 2.
- , 
- 3.
- , 
- 4.
- , 
- 5.
- . 
 Remark 2. Dropping the axiom of correct specification implies , hence, the equivalences of 3–5 are now  w.r.t. item 2.
 The equivalences in Proposition 1 not only ensure that for a correctly specified model , the element  results in functional equivalence between the model and the true model (item 3), but also in zero divergence between the probability measures  and  (item 4). Moreover, it follows that at , the empirically estimated probability measure  is equivalent to  in the sense that there is zero distance between the two (item 5).
Remark 3. Proposition 1 is applicable to a large class of extremum estimators, even those not initially conceived as minimizers of distance. In particular it is often possible to find a divergence on the space of probability measures. For example, method of moments estimators are naturally defined in terms of features of the underlying probability measures. In Section 4 and example is given, using Kullback–Leibler divergence, for which penalized likelihood is an estimator. In this case squared Hellinger distance can be shown to be a lower bound.  Corollary 1 now delivers that our definitions, set on the 
true measures, transfer to modeled probability measures in the limit for correctly specified cases. It is well-known that standard consistency proofs apply also to approximate extremum estimators, therefore, assuming additionally that 
a.s., is sufficient for a consistency result together with the uniqueness of 
 within the compact hypothesis space 
 (Note that, under the axiom of correct-specification, consistency results require suitable forms of stability defined on the process rather than the data. While we have loosely remarked on the fact that the non-parametric case of an infinite dimensional 
 is easily allowed, stability of highly nonlinear multivariate time series is a difficult separate topic. Regardless, Refs. [
44,
45] provide Ergodicity results for a large class of nonlinear time series that include non-parametric ones. The conditions require the nonlinearities to be sufficiently smooth. Specific stability results have also been established for certain neural network models, for example by [
52]). This implies that our causality conditions on the 
true measures do not only transfer to the approximate in the limit, but also for large 
T under standard regularity conditions. Essentially, this is the setting considered by Ref. [
11]. Summarized:
Corollary 1. Given a true probability measure , and an equivalent modeled probability measure  in the sense that , there are four possibilities for causality:
- 1.
- There is no causation if  and  adhere to Definition 1. 
- 2.
-  causes  if the probability measure  adheres to Definition 2. 
- 3.
-  causes  if the probability measure  adheres to Definition 2. 
- 4.
- There is bi-directional causality if  and  adhere to Definition 3. 
 Finally, in the case of a miss-specified model, Proposition 2 implies that the divergence between the optimal probability measure as judged by the criterion and the true probability measure attains a minimum at a strictly positive value . In this case, the quantity  determines how “close” the empirical claim is to the true hypothesis about causality. While it is difficult to make claims about this quantity, it is evident that minimizing  may involve widening  in the direction of  by increasing the dimensionality of  and allow flexibility while investigating a wide range of data. Disregarding the value of , the following holds.
Proposition 2. If , then . However,  is still the pseudo-true parameter that minimizes  over Θ. Therefore,  is the probability measure minimally divergent from  within . As such, it follows that, from all the potential probability measures in , the measure closest to  is supportive of one out of  in corollary 1 based on the properties of  and  as the best approximations.  provides the best approximation of the true causal measure across all the hypotheses considered.
 This leads to the following collection of results.
Corollary 2. Given a true probability measure , and a non-equivalent, but pseudo-true modeled probability measure, , in the sense that  has attained a non-zero minimum, there are four possible optimal hypotheses about causality, as judged by the criterion:
- 1.
- There is no causation if  and  adhere to Definition 1. 
- 2.
-  causes  if the probability measure  adheres to Definition 2. 
- 3.
-  causes  if the probability measure  adheres to Definition 2. 
- 4.
- There is bi-directional causality if  and  adhere to Definition 3. 
 Respectively, conditioning on interventions in , the results can be understood as:
- 1.
- Whenever an intervention in  occurs, our best hypothesis is that there is no chance that  reacts as a result of that. 
- 2.
- Whenever an intervention in  occurs, our best hypothesis is that there is positive chance that  reacts as a result of that. 
- 3.
- Whenever an intervention in  occurs, our best hypothesis is that there is positive chance that  reacts as a result of that, and these interactions continue to repeat with positive probability. 
  4. Limit Squared Hellinger Distance
Both Corollaries 1 and 2 assume that an appropriate transformation of the limit criterion exists that provides us with a metric or norm. This assumption allows us to make use of the classical theorems on existence and uniqueness of best approximations that have been naturally obtained for metric, normed, and inner product spaces [
53]. While this retains the simplicity of the argument, it also shows that a direct interpretation of Corollaries 1 and 2 can be obtained within the framework of maximum likelihood. Let us first define the criterion function as the maximum likelihood estimator:
	  Note that this is conforming to 
 with 
 and 
. It can be shown that, under this definition with 
, the criterion 
 is a measure of divergence 
 on the 
true probability measure and the modeled measure. Specifically, we can introduce a divergence 
 as follows. Let 
 and 
 be, respectively, the 
true density evaluated under the 
true parameter and a modeled density at 
, evaluated under the estimated parameter, both at time 
t, with respect to the Lebesque measure (such that they are probability density functions); then the following is a divergence from the true probability measure to the modeled probability measure (Kullback–Leibler divergence, see [
54]):
	  Naturally, 
 with equality if and only if 
 almost everywhere, i.e., when the probability measures are the same (this is known as Gibb’s inequality and can be verified by applying Jensen’s inequality).
Kullback–Leibler divergence is not a distance metric, as was used in Corollaries 1 and 2 to establish equivalences by partitioning into classes of zero-distance points. In particular, it is asymmetric
      
      and the triangle inequality is also not satisfied. However, it has the product–density property
      
      for 
, and 
 defined similarly. Hence, the MLE is an unbiased estimator of minimized Kullback–Leibler divergence:
	  Note that under standard assumptions, a law of large numbers can be applied to obtain the convergence, hence, by maximizing log likelihood, we minimize Kullback–Leibler divergence. Now, we need to either find a continuously scaling function, 
r, to ensure that it also minimizes distance between the 
true measure and the modeled measure so that we may reach zero at 
. Alternatively, we find the distance metric directly. We argued above that Kullback–Leibler divergence is not a proper distance (in particular, it is not symmetric and does not satisfy the triangle inequality). However, notably useful is specifying 
 directly as the Hellinger distance between a modeled probability measure and the true probability measure [
55]:
	  Specifically, the squared Hellinger distance provides a lower bound for the Kullback–Leibler divergence. Therefore, maximizing log likelihood implies minimizing Kullback–Leibler divergence, which implies minimizing the Hellinger distance. This is easily seen by the following:
Proposition 3. The squared Hellinger distance provides a lower bound to Kullback–Leibler divergence:  Remark 4 below highlights that these notions do not just apply to the standard real-valued time series settings considered by Granger, but can apply to the explicit probability modeling of binary outcomes as well. Remark 4 further clarifies a result that has so far only been presented implicitly—that the probabilistic truth identified at the discussed zero-distance point may allow for a base level of entropy to exist even when all functional relationships in the process have been accounted for in a model.
Remark 4. While the paper has implicitly alluded to modeling continuous real-valued processes though the notational conventions, the connections between true probability and modeled probability are also easily made by focusing on an explicit binary outcome problem. Define cross-entropy for two discrete probability distributions p and q with the same support :in which  is Kullback–Leibler divergence, or the relative entropy of q with respect to p, and  is the entropy of p. Now if  and , we can rewrite cross-entropy:or, for predictions generated under a set of parameters  and a predictor x, asRemember that the maximum likelihood estimator maximizes the likelihood of the data under some probabilistic model. The correct likelihood in the case of binary classification is Bernoulli:which results in the likelihood functionTaking logs then gives the following log likelihood functionThis shows that negative log likelihood is proportional to Kullback–Leibler divergence and differs by the basic entropy in the data, which is constant. Maximizing the likelihood of a binary model can, thus, be understood as minimizing statistical distance toward a true probability measure; the minimum value is determined by the entropy in the observed data.