Causal Geometry

Pavel Chvykov; Erik Hoel

doi:10.3390/e23010024

and

¹

Physics of Living Systems, Massachusetts Institute of Technology, Cambridge, MA 02139, USA

²

Allen Discovery Center, Tufts University, Medford, MA 02155, USA

^*

Author to whom correspondence should be addressed.

Entropy2021, 23(1), 24;https://doi.org/10.3390/e23010024

This article belongs to the Collection Feature Papers in Information Theory

Version Notes

Order Reprints

Abstract

Information geometry has offered a way to formally study the efficacy of scientific models by quantifying the impact of model parameters on the predicted effects. However, there has been little formal investigation of causation in this framework, despite causal models being a fundamental part of science and explanation. Here, we introduce causal geometry, which formalizes not only how outcomes are impacted by parameters, but also how the parameters of a model can be intervened upon. Therefore, we introduce a geometric version of “effective information”—a known measure of the informativeness of a causal relationship. We show that it is given by the matching between the space of effects and the space of interventions, in the form of their geometric congruence. Therefore, given a fixed intervention capability, an effective causal model is one that is well matched to those interventions. This is a consequence of “causal emergence,” wherein macroscopic causal relationships may carry more information than “fundamental” microscopic ones. We thus argue that a coarse-grained model may, paradoxically, be more informative than the microscopic one, especially when it better matches the scale of accessible interventions—as we illustrate on toy examples.

Keywords:

model selection; causality; sloppy models; information geometry; effective information

1. Introduction

Many complex real-world phenomena admit surprisingly simple descriptions, from the smallest of microphysics to aspects of economics [1,2,3]. While this is not seen as entirely coincidental, the precise reasons for this fortunate circumstance are not entirely understood [4,5]. Two complementary solutions may be sought: On the one hand, we may hypothesize that this is an objective property of nature, thereby looking for some mechanism common among complex systems that allows them to be well-described with only a few parameters [3,6]. On the other, we may guess that it is a subjective property of our perception and then try to formalize the process by which we find useful patterns in arbitrarily complex systems [5,7,8]. In the recent years, substantial progress has been made in developing both of these perspectives, grounded in information theory [9].

One compelling argument for the first hypothesis has been made from information geometry. The approach starts by associating with any given model a particular “model manifold,” whose geometric properties can tell us whether and which simplifications can be helpful [6,10]. It turns out that for many real-world models, this manifold is highly anisotropic, having a hierarchical hyper-ribbon structure [11]. This property, termed “sloppiness,” indicates that only a few of the many microscopic model parameters are actually important for the model predictions—thus allowing for model simplification [12,13]. While it is not yet clear how general this property is, sloppiness was empirically illustrated in a number of biochemical and physical models and argued for on some general grounds [6,14,15]. This way, sloppiness provides an explanation for how emergent simplicity may be an objective property of complex systems themselves.

The second perspective instead takes as its starting point the well-known aphorism that “all models are wrong, but some are useful” [7]. One way to see this is as a rejection of the reductionist notion that a “fundamental” microscopic description is the “correct” one, while all emergent system properties are derivatives from it [2,16]. Instead, if no model is seen as inherently best, then we can only compare models by their efficacy in predicting and controlling a given system. This amounts to finding an effective causal description of the system—such that distinct interventions causally translate to unique effects. In this perspective, simple descriptions may arise in part from our efforts to match the system description to how we interact with it [8,9].

Despite the fundamental role causality plays in science and its applications, its formal study has been scarce and elusive, until recently [17]. Especially in complex systems, even conceptually defining what it means for something to be the cause of an observed phenomenon may be surprisingly tricky [18], let alone empirically verifying this [19,20]. In particular, the way models are studied in sloppiness makes no distinction between parameters as causes of the measured data, or merely as its phenomenological descriptors (e.g., slope of a linear fit) or statistical correlates.

Consider for example some bacterial population, whose size

y (t) = e^{θ t}

at some time t depends on its growth rate

θ

. We cannot say that

θ

is the cause of a large population; it remains merely a descriptor of the exponential fit, until we introduce an additional ingredient: “intervention capabilities”—some (at least hypothetical) way in which we can control this parameter and observe its effects. We can formalize this by recasting our system as a Bayesian dependency network and defining the

d o

-operator [17]:

d o (θ = θ_{0})

, which isolates the effects of actively setting the parameter

θ

to value

θ_{0}

, regardless of any other confounding factors that might influence it (such as if

θ

were also affected by t). The role of the

d o

-operator is to distinguish the effect of a given intervention from that of all other possible interventions (or a lack thereof), thereby allowing a rigorous counter-factual definition of causality. This allows reliably disentangling causal dependencies

p (y | d o (θ))

from mere statistical ones

p (y | θ)

[21]. See Appendix A for a worked out example explicitly illustrating how this distinction arises.

Building on this, Judea Pearl developed the “causal calculus” framework, providing a set of tools to reliably work with intricate causal structures [17]. Further combining causal calculus with information theory allowed rigorously quantifying the amount of Effective Information (

E I

) in the causal structure of a given model [22]. In a phenomenon termed “causal emergence,” it was then pointed out that in some systems, a dimension-reduced description (such as a coarse-graining) may paradoxically carry more information about causation than the full microscopic model. This is because dimension reduction can sometimes yield a substantial reduction in noise and degeneracy [23,24].

In this work, we extend the causal emergence framework to continuous systems and show that in that context, it is naturally related to information geometry and sloppy models. This leads to a novel construction, which we term causal geometry, where finding the causally most informative model translates to a geometric matching between our intervention capabilities and the effects on system behaviors, both expressed as distance metrics on the model’s parameter space. This framework captures precisely how the inherent properties of the system’s behavior (its sloppiness structure) and their relation to its use context (matching to intervention capabilities) both play a role in optimal model selection, thereby reconciling the two above perspectives. This comes up because, on the one hand, intervention capabilities are never unlimited in their degree or fineness and perfection of control, and on the other, because the behaviors of open physical systems are never without noise. This helps to formalize how neither the “simplest” nor the most “fundamental” reductionist model may be universally seen as preferable.

We take a moment here to further clarify how this work fits within the current literature context. Besides sloppiness and causal emergence, there have been other approaches developed for optimal model selection [25,26]. For example, Bayesian inference allows choosing the optimal modeling level by maximizing the posterior likelihood over a predefined library of models [27]. In contrast, the sloppiness approach lacks a scalar quantity to be optimized for the best model and only provides a heuristic for reducing unnecessarily complex models [6]. To mitigate this, in [13], the authors optimized mutual information to find the set of coarse-grained parameters on sloppy manifolds that could reasonably be constrained by some limited experimental observations, thus selecting the optimal modeling level for the available data. The mathematical procedure thus carried out closely paralleled that developed earlier in the causal emergence for discrete systems [23,24]. In our work, we use this formal resemblance to understand the connection between these two distant fields: sloppy models and causal calculus. This allows giving a continuum formulation of causal emergence, as well as a novel local measure of causal optimality. On the other side, our work establishes the proper formal role of interventions and causality in sloppiness, potentially resolving a long-standing formal challenge around the non-covariance of metric eigenvalues [28] and showing that not only the hyper-ribbon manifold structure, but also its relation to intervention capabilities account for the emergence of simple models.

In Section 2, we define the Effective Information (

E I

) for continuous models, which captures the amount of information in the model’s causal relationships. We illustrate it on a simple example (Section 2.1) and show how restricting the set of allowed interventions may sometimes, surprisingly, make the causal model more informative. Section 3 then introduces causal geometry. Specifically, it relates the continuous

E I

to information geometry, introducing a local geometric measure of causal structure, and providing a way to find the locally most effective model for a given set of intervention capabilities using the techniques of information geometry. We demonstrate our construction on another simple toy model in Section 4, showing how causal emergence can arise in our geometric formulation, subject to the given interventional and observational capabilities.

2. Effective Information in Continuous Systems

For the purposes of this work, we formalize a causal model as a set of input-output relations, or more precisely, a map from all possible interventions to the full description of all effects within the context of some system [24]. While the set of all hypothetically possible interventions on a given physical system is enormous and impractical to consider (involving arbitrary manipulations of every subatomic particle), the set of experimentally doable (or even considered) interventions for a given context always represent a much smaller bounded space

X

, which we refer to here as “intervention capabilities.” Similarly, while an intervention will lead to uncountable microscopic physical effects, the space of specific effects of interest

Y

is much smaller and often happens to be closely related to the intervention capabilities. All causal models by definition use some such subset of possibilities, and it is common in the literature around causation to restrict the set of hypotheticals, or counterfactuals, within a causal model [29]. In this work, we will illustrate how finding the optimal causal model for a given system is about a matching between the system behavior and the intervention capabilities considered.

As the focus of this paper is on continuous systems, we consider

X

and

Y

to be continuous spaces, with points

x \in X

and

y \in Y

. To formally discuss the causal model of our system, we make use of the

d o (x)

operator, as per Judea Pearl’s causal calculus [17]. This operator is defined for any doable intervention the experimenter is capable of either performing or modeling, allowing assessing its causal effects. This allows us to formally describe a causal model as a map:

\begin{matrix} x \to p (y | d o (x)) \end{matrix}

(1)

where p is the probability density over effect space

Y

resulting from “doing” the intervention x. Note that this is distinct from

p (y | x)

in that the

d o

operator allows us to distinguish the correlation introduced by the causal relation

x \to y

from one due to a common cause

a \to {x, y}

.

The notion of causality is then formalized as a counterfactual: How does the effect of

d o (x)

differ from the effect of doing anything else? This latter “null effect” may include interventions such as

d o (\neg x)

, but it can also include all other possible interventions: it is thus formally described by averaging together the effects of all considered intervention capabilities

X

, giving the total “effect distribution”:

\begin{matrix} E_{D} (y) = {⟨p (y | d o (x))⟩}_{x \in X} \end{matrix}

(2)

This way, to know precisely which effects

d o (x)

causes, we can compare

E_{D} (y)

to

p (y | d o (x))

. The distinguishability between these distributions may be captured with the Kullback–Leibler divergence

D_{K L} [p (y | d o (x)) ∥ E_{D} (y)]

, giving us the amount of information associated with the application of an individual

d o (x)

intervention [23,30].

Averaging over all accessible interventions gives the information of the system’s entire causal structure, termed the total “effective information”:

\begin{matrix} E I = {⟨D_{K L} [p (y | d o (x)) ∥ E_{D} (y)]⟩}_{x \in X} \end{matrix}

(3)

Discrete versions of this effective information have been explored in Boolean networks [23] and graphs [31]. Note that the definition of

E I

here is identical to the mutual information between the uniform distribution over interventions

I_{D} (x) = c o n s t

and the resulting distribution over effects

E_{D} (y)

, so that:

E I = I (I_{D}; E_{D})

[22,24].

We proceed to illustrate with a simple example how the EI varies across families of simple physical systems. As such, we show how it may be used to select the systems that are in some sense “best controllable,” in that they best associate unique effects with unique interventions [32]. Additionally, this example will help us illustrate how the EI may sometimes allow us to identify a coarse-grained system description that is more informative than the full microscopic one—thus illustrating causal emergence [24].

2.1. Toy Example: Dimmer Switch

Consider a continuous dimmer switch controlling a light bulb, but with an arbitrary non-linear function

y = f (θ)

, a “dimmer profile,” mapping from the switch setting

θ \in Θ = [0, 1]

to the light bulb brightness

y \in Y = [0, 1]

(Figure 1a). To quantify information about causation in continuous systems, we must carefully account for noise and errors in our inputs and outputs; else, infinite precision leads to infinite information. This is an issue for the application of all mutual information measures or their derivatives in deterministic continuous systems. Realistically, in operating a dimmer switch, any user will have certain “intervention error” on setting its value, as well as “effect error,” which can come either from intrinsic system noise or from extrinsic measurement error. To encode the effect error, we can replace the deterministic mapping

θ \to y = f (θ)

with a probabilistic one

θ \to p (y | d o (θ)) = N_{y} (f (θ), ϵ^{2})

—the normal distribution centered on

f (θ)

and with standard deviation

ϵ

. While we could incorporate intervention error of setting

θ

into this probability distribution as well, it is instructive for later discussion and generality to keep it separate. The intervention error is thus similarly encoded by introducing a probabilistic mapping from the “do”-able interventions

x \in X = [0, 1]

to the physical switch settings with some error

δ

, as

x \to q (θ | d o (x)) = N_{θ} (x, δ^{2})

. Here, we can think of the interventions

x \in X

as the “intended” switch settings, as in practice, we cannot set the switch position with infinite precision. Note that while we do not explicitly model any possible confounding factors here, we assume that these may be present and important, but are all taken care of by our use of the

d o

-operator. This ensures that only true causal relations, and not spurious correlations, are captured by the distributions p and q.

Figure 1. Illustrating continuous Effective Information (EI) on a simple toy system. (a) shows the system construction: a dimmer switch with a particular “dimmer profile”

f (θ)

. We can intervene on it by setting the switch

θ \in (0, 1)

up to error tolerance

δ

, while effects are similarly measured with error

ϵ

; (b) shows that for uniform errors

ϵ = δ = 0.03

, out of the family of dimmer profiles parametrized by a (left), the linear profile gives the “best control,” i.e., has the highest EI (where dark blue—numerical EI calculation and light blue—approximation in Equation (4)); (c) illustrates how for two other dimmer profiles (left), increasing error tolerances

ϵ = δ

influence the EI (right, calculated numerically). The profile in red represents a discrete binary switch—which emerges if we restrict the interventions on the blue dimmer profile to only use “ends of run.” Crucially, such coarse-graining allows for an improved control of the light (higher EI) when errors are sufficiently large.

With this setup, we can now use Equation (3) to explicitly compute an EI for different dimmer profiles

f (θ)

and see which is causally most informative (has the most distinguishable effects). To do this analytically for arbitrary

f (θ)

, we must take the approximation that

δ

and

ϵ

are small compared to one (the range of interventions and effects) and compared to the scale of curvature of

f (θ)

(such that

ϵ f^{″} (θ) ≪ f^{'} {(θ)}^{2}

). In this limit, we have (for the derivation, see setup in Equation (6) below and Appendix B.1):

\begin{matrix} E I ≃ - \frac{1}{2} \int d θ log [2 π e ({(\frac{ϵ}{f^{'} (θ)})}^{2} + δ^{2})], \end{matrix}

(4)

which echoes the form of the expression for entropy of a normal distribution. From this, we see variationally that, given the fixed end-points

f (0) = 0

and

f (1) = 1

, EI is maximized iff

f^{'} (θ) = 1

: a uniformly linear dimmer switch. We can check this numerically by computing the exact EI for several different choices of

f (θ)

(Figure 1b).

A slightly more interesting version of this example is when our detector (eyes) perceives light brightness on a log, rather than linear, scale (Weber–Fechner law, [33]), in which case the effect of the error will be non-uniform:

ϵ (y) \propto y

. If this error is bound to be sufficiently small everywhere, Equation (4) still holds, replacing only

ϵ \to ϵ (y) \propto y = f (θ)

. Again, we can variationally show that here, EI is maximal iff

f (θ) / f^{'} (θ)

is constant (up to fluctuations of magnitude

O [δ / ϵ]

), giving the optimal dimming profile

f (θ) = (e^{θ / r} - 1) / (e^{1 / r} - 1)

, with some constant

r ≪ δ / ϵ

. In reality, the lighting industry produces switches with many dimming profiles that depend on the application [34], so our approach can be seen as a principled way to optimize this choice.

Interestingly, restricting the accessible interventions can sometimes increase the amount of effective information if it increases the distinguishability, and therefore informativeness, of interventions. This is a form of causal emergence, wherein a higher level (coarsened) macroscopic model emerges as the more informative system description for modeling causation [24]. To give a particular example here, we compare the continuous dimmer profile shown in Figure 1c (left, blue) to its discrete restriction (left, red)—which corresponds to a simple binary switch. When the intervention and effect errors

δ = ϵ

are small, the continuous switch gives more control opportunities and is thus preferable; its EI is larger than the 1 bit for the discrete switch. However, as we increase the errors, we see a crossover in the two EI values. In this regime, the errors are so large that the intermediate switch positions of the continuous profile become essentially useless and are “distracting” from the more useful endpoint settings. Formally, such causal emergence arises due to the averaging over the set of all interventions in Equation (3). Practically, it captures the intuition that building good causal models, as well as designing useful devices involve isolating only the most powerful control parameters out of all possible degrees of freedom [32].

3. Causal Geometry

Taking inspiration from information geometry, we can construct a more intuitive geometric expression for the EI [13]. For studying causal structures of models, this “geometric EI” we introduce may be viewed as a supplement to the usual EI in Equation (3). While the geometric EI corresponds to the EI in a particular limit, we suggest that it remains useful more generally as an alternative metric of causal efficacy: it captures a causal model’s informativeness like the EI does, but in a way that is local in a system’s parameter space. In this way, it frames causality not as a global counter-factual (comparing an intervention to all other interventions) [17], but as a local neighborhood counter-factual (comparing an intervention to nearby interventions). On the flip side, our construct provides a novel formulation of information geometry that allows it to explicitly account for the causal relations of a model. Moreover, we argue that this is necessary for formal consistency when working with the Fisher information matrix eigenvalues (namely, their covariant formulation; see Section 3.2) [28]. This suggests that model reduction based on these eigenvalues may not be made fully rigorous without explicitly accounting for causal relations in the model.

3.1. Construction

Section 2 described a causal model as a set of input-output relations between interventions

X

and effects

Y

. Here, we investigate the relationship of such a causal model with the space of parameters

Θ

that describe the underlying physical system. While these parameters need not necessarily have any direct physical meaning themselves, they are meant to give some abstract internal representation of the system—i.e., they mediate the mapping between interventions and effects. For example, while the notion of energy is merely an abstract concept, it provides a useful model to mediate between interventions such as “turning on the stove” and effects like “boiling water.” The goal of our construction here is to compare how well different physical models capture the causal structure of a system.

As we are focusing on continuous systems, we assume that our parameter space

Θ

forms a smooth d-dimensional manifold, with parameters

θ_{μ}

indexed by

μ, ν \in {1, 2, \dots, d}

. Each accessible intervention

x \in X

then maps to some probability distribution over parameters

x \to q (θ | d o (x))

, and each parameter in turn maps to a distribution over the observed effects

θ \to p (y | d o (θ))

. The causal relations we are interested in here are thus simply

x \to θ \to y

, but these are assumed to be embedded in some larger more complicated causal graph of additional confounding factors. These other possible hidden causes highlight the importance of using the

d o

-operator to isolate the causal relations in which we are interested (see Appendix A for a simple example). Our one assumption is that the parameter space

Θ

is chosen to be a sufficiently complete description of the system that no causal link can pass from

X

to

Y

directly without being reflected on

Θ

.

To understand the role of various parameters

θ_{μ}

, we can ask how much the effects change as we perturb from some set of parameters in some direction: from

θ

to

θ + d θ

. Using the Kullback–Leibler divergence and expanding it to leading order in

d θ

, we get:

\begin{matrix} D_{K L} [p (y | d o (θ)) ∥ p (y | d o (θ + d θ))] ≃ g_{μ ν} (θ) d θ_{μ} d θ_{ν} \end{matrix}

(5)

where summation over repeated indices is implied. This defines the Fisher information metric

g_{μ ν} (θ) = - {⟨\partial_{μ} \partial_{ν} log p (y | d o (θ))⟩}_{p (y | θ)}

with

\partial_{μ} \equiv \frac{\partial}{\partial θ_{μ}}

. This introduces a distance metric on the parameter space

Θ

, turning it into a Riemannian manifold, which we term the “effect manifold”

M_{E}

(this is usually called simply the “model manifold” in the literature, but here, we want to distinguish it from the “intervention manifold,” introduced below). More precisely, it is usually defined as

M_{E} \equiv {p (y | d o (θ))}_{θ \in Θ}

—the collection of all the effect distributions, or the image of the parameter space

Θ

under the model mapping, with Equation (5) being the natural distance metric on this space [6,10,14].

Just as the mapping to effects defines the effect manifold

M_{E}

, we can similarly construct an “intervention manifold”

M_{I}

. For this, we use Bayes’ rule to invert the mapping from interventions to parameters

x \to q (θ | d o (x))

, thus giving

θ \to \tilde{q} (d o (x) | θ)

—the probability that a given parameter point

θ

was “activated” by an intervention x. The intervention manifold is thus defined as

M_{I} \equiv {\tilde{q} (d o (x) | θ)}_{θ \in Θ}

, with the corresponding Fisher information metric

h_{μ ν}

giving the distances on this space. With this, we can now summarize our construction:

\begin{matrix} interventions x \in X, parameters θ \in Θ, effects y \in Y \\ x \to q (θ | d o (x)), θ \to p (y | d o (θ)) \\ effect manifold M_{E} \equiv {p (y | d o (θ))}_{θ \in Θ} with metric g_{μ ν} (θ) = - \int d y p (y | θ) \partial_{μ} \partial_{ν} ln p (y | d o (θ)) \\ intervention manifold M_{I} \equiv {\tilde{q} (d o (x) | θ)}_{θ \in Θ} with metric h_{μ ν} (θ) = - \int d x \tilde{q} (d o (x) | θ) \partial_{μ} \partial_{ν} ln \tilde{q} (x | θ) \\ where \tilde{q} (d o (x) | θ) \equiv \frac{q (θ | d o (x))}{\int d x q (θ | d o (x))} and \partial_{μ} \equiv \frac{\partial}{\partial θ_{μ}} with μ, ν \in {1, 2, \dots, d} \end{matrix}

(6)

Note that for Bayesian inversion in the last line, we used a uniform prior over the intervention space

I_{D} (x) = c o n s t

, which amounts to assuming that statistically, interventions are uniformly distributed over the entire considered space

X

. Note that this is not a choice of convenience, but rather of conceptual necessity for correctly defining information in a causal model, as argued in [24]. The natural point-wise correspondence between the two manifolds

M_{E} \leftrightarrow M_{I} : p (y | d o (θ)) \leftrightarrow \tilde{q} (d o (x) | θ)

then allows for a local comparison between the two geometries. Alternatively, we may simply think of the parameter space

Θ

with two separate distance metrics on it, effect metric

g (θ)

and intervention metric

h (θ)

. With this setup, we can now define our “geometric” effective information:

\begin{matrix} E I_{g} = log [\frac{V_{I}}{{(2 π e)}^{d / 2}}] - {⟨l (θ)⟩}_{I} \end{matrix}

(7)

\begin{matrix} with l (θ) = \frac{1}{2} log det (𝟙 + g {(θ)}^{- 1} h (θ)) \end{matrix}

(8)

Here,

V_{I}

is the volume of the intervention manifold

M_{I}

, which can be computed as

V_{I} = \int d^{d} θ \sqrt{det h}

. It quantifies the effective number of distinct interventions we can

d o

, and so, the first term in Equation (7) gives the maximal possible amount of information about the causation our model could have, if all interventions perfectly translated to effects. The second term then discounts this number according to how poorly the interventions actually overlap with effects: geometrically, the expression in Equation (8) quantifies the degree of matching between the metrics g and h at the point

θ

(here, 𝟙 stands for the identity matrix). This way, the loss term

l (θ)

can be interpreted as a measure of “local mismatch” between interventions and effects at

θ

, quantifying how much information about causation is lost by our modeling choice. The average is then taken according to the intervention metric as:

{⟨l (θ)⟩}_{I} \equiv \frac{1}{V_{I}} \int d^{d} θ \sqrt{det h} l (θ)

. Note that the expression in Equation (7) is identical to the approximation in Equation (4) for the setup in that example.

In Appendix B.2, we show that this expression for

E I_{g}

in Equation (7) can be derived as the approximation of the exact

E I

in Equation (3) when both the mappings are close to deterministic:

p (y | d o (θ)) = N_{y} (f (θ), ϵ^{2})

and

\tilde{q} (d o (x) | θ) = N_{x} (F (θ), δ^{2})

, for some functions

f : Θ \to Y

and

F : Θ \to X

, with small errors

ϵ

and

δ

(which may be anisotropic and nonuniform). Outside of this regime, the

E I

and

E I_{g}

can differ. For instance, while

E I

is positive by definition,

E I_{g}

can easily become negative, especially if g is degenerate anywhere on the manifold. Second, while

E I

captures the informativeness and therefore effectiveness of a causal model globally,

E I_{g}

, and more specifically the landscape

l (θ)

, can show us which local sectors of the parameter space are most and least causally effective. Finally, the global nature of the computation for the exact

E I

quickly makes it intractable, even numerically, for many continuous systems due to the proliferation of high-dimensional probability distributions—making

E I_{g}

the more practical choice in those settings.

3.2. Relation To Sloppiness

“Sloppiness” is the property empirically observed in many real-world models, when the eigenvalues of the Fisher information matrix

g_{μ ν}

take on a hierarchy of vastly varying values [6,11,14]. As such, parameter variations in the directions corresponding to the smallest eigenvalues will have negligible impact on the effects [14]. This leads to the hypothesis that we may effectively simplify our model by projecting out such directions, with little loss for the model’s descriptive power [12,13].

The trouble with this approach is that the components of the matrix

g_{μ ν}

, and hence its eigenvalues, depend on the particular choice of

θ

-coordinates on the effect manifold

M_{E}

[6,28]. Since the parameters

Θ

represent some conceptual abstraction of the physical system, they constitute an arbitrary choice. This means that for a given point of

M_{E}

labeled by

θ

, we can always choose some coordinates in which locally,

g (θ) = 𝟙

(an identity matrix), thus apparently breaking the above sloppiness structure. This issue is avoided in the literature by relying on the coordinate independent global properties of

M_{E}

, namely its boundary structure [12].

Here, we show that by explicitly considering intervention capabilities, we can construct a local, but still coordinate independent sloppiness metric. This becomes possible since interventions give a second independent distance metric on

Θ

[28]. The matrix product

g^{- 1} h

appearing in Equation (8) is then a linear transformation, and thus, its eigenvalues are coordinate independent. This way, to evaluate how sloppy a given causal model is, we suggest that it is more appropriate to study the eigenvalues of

h^{- 1} g

instead of those of g as is usually done [6]. If we then want to identify the directions in the parameter space that are locally least informative at a point

θ

, we first need to re-express the metric g in terms of the coordinates for which

h (θ) = 𝟙

locally and then find the appropriate eigenvectors in these new coordinates.

From this perspective, we see that the usual discussion of sloppiness, which does not study interventions explicitly [6], may be said to implicitly assume that the intervention metric

h (θ) \propto 𝟙

, meaning that all model parameters directly correspond to physically doable interventions. Moreover, this requires that with respect to the given coordinate choice, each parameter can be intervened upon with equal uniform precision—which fixes the particular choice of coordinates on the parameter space. As such, the coordinate-specific eigenvalues

λ_{g} (θ)

of the effect metric

g (θ)

studied in information geometry, become physically meaningful in this special coordinate frame. In particular, our expression for the local mismatch in Equation (8) can here be expressed in terms of these eigenvalues as

l (θ) = \frac{1}{2} \sum_{λ} log (1 + 1 / λ_{g} (θ))

. Thus, locally, the directions with the smallest

λ_{g}

account for the largest contribution to the mismatch l. This recovers the standard intuition of sloppiness: we can best improve our model’s descriptive efficacy by projecting out the smallest-

λ_{g}

directions [12,35]. By seeing how this result arises in our framework, we thus point out that it formally relies on the implicit assumption of uniform intervention capabilities over all model parameters.

It is worth noting that in the construction in Equation (6), it may be possible to integrate out the parameters

θ

, giving directly the distribution of effects in terms of interventions

P (y | d o (x)) = \int d^{d} θ p (y | d o (θ)) q (θ | d o (x))

(we assume that

Θ

gives a complete description of our system in the sense that no causal links from

X

to

Y

can bypass

Θ

). This way, the

P (y | d o (x))

distribution gives an effect metric

\hat{g} (x)

over the

X

-space, which can directly quantify the amount of causal structure locally in our model. Nonetheless, the intervention metric and Equation (8) are still implicitly used here. This is because the space

X

was constructed such that the intervention metric over it would be uniform

\hat{h} (x) = 𝟙

everywhere. In turn, Equation (8) thus takes a particularly simple form in terms of

\hat{g} (x)

. This illustrates that regardless of the parametrization we choose to describe our system, causal efficacy always arises as the matching between the effect metric and the intervention metric, per Equation (8)—which may or may not take on a simple form, and must be checked explicitly in each case. Furthermore, though this goes beyond the scope of this paper, we may imagine cases where the actual interventions form a complicated high-dimensional space

X

that is harder to work with than the parameter space

Θ

(just as the effect space

Y

is often more complex than

Θ

). In fact, this may be the typical scenario for real-world systems, where

X

and

Y

represent arbitrarily detailed descriptions of the system’s context, while

Θ

gives a manageable system abstraction.

4. Two-Dimensional Example

In order to illustrate our causal geometry framework explicitly and show how higher level descriptions can emerge within it, we use a simple toy model (based on the example considered in [12,13]).

Imagine an experimenter has a mixed population of two non-interacting bacterial species that they are treating with two different antibiotics. The experimenter’s measurements cannot distinguish between the bacteria, and so, they are monitoring only the total population size over time

y (t) = e^{- θ_{1} t} + e^{- θ_{2} t}

, where

{θ_{1}, θ_{2}} \in [0, 1]

are the death rates of the two individual species. These death rates are determined by the two antibiotic concentrations the experimenter treats the system with

{x_{1}, x_{2}}

, which are the possible interventions here. In the simplest case, each antibiotic will influence both species via some linear transformation A, such that

θ_{μ} = \sum_{i} A_{μ i} x_{i}

.

This setup allows us to flesh-out the causal geometry construction and illustrate causal emergence here. Our main question is: When is this system best modeled microscopically, as the two independent species, and when does it behave more like a single homogeneous population, or something else [36]? To identify when higher scale models are more informative for the experimenter, we will calculate the geometric

E I_{g}

from Equation (7) for the full 2D model described above and then compare it to two separate 1D coarse-grained model descriptions, shown by the two red 1D sub-manifolds of the parameter space in Figure 2.

Figure 2. An illustration of the causal geometry construction in Equation (6). The parameter space

Θ

of our model gets two distinct geometric structures: the effect metric

g_{μ ν} (θ)

and the intervention metric

h_{μ ν} (θ)

. Here, a model is seen as a map that associates with each set of parameters

θ

, some distribution of possible measured effects y (right). As parameters

θ

may involve arbitrary abstractions and thus need not be directly controllable, we similarly associate them with practically doable interventions x (left). This way, our system description in terms of

θ

“mediates” between the interventions and resulting effects in the causal model.

We first specify the quantities for the construction in Equation (6). Our interventions x, having some uniform error tolerance

δ

, map to normal distributions over parameters

θ

as:

x \to q (θ | d o (x)) = N_{θ} (A x, A A^{T} δ^{2})

, giving the Bayesian inverse probability

\tilde{q} (d o (x) | θ) = N_{x} (A^{- 1} θ, δ^{2})

, and hence the intervention metric

h_{μ ν} = \sum_{i} {(A^{- 1})}_{i μ} {(A^{- 1})}_{i ν} / δ^{2}

. The effect space y is constructed by measuring the population size at several time-points, spaced out at intervals

Δ t

, such that the components of y are given by

y_{n} = y (n Δ t) = e^{- n Δ t θ_{1}} + e^{- n Δ t θ_{2}}

, with

n \in {1, 2, \dots, N}

and error

ϵ

on each measurement (the initial conditions are thus always

y (0) = 2

). Thus, we have

θ \to p (y | d o (θ)) = N_{y} ({y_{n}}, ϵ^{2})

and effect metric

g_{μ ν} = \sum_{n} \partial_{μ} y_{n} \partial_{ν} y_{n} / ϵ^{2}

. Figure 2 shows these mappings with

N = 2

for visual clarity, and we use

N = 3

for the

E I_{g}

calculations below, but all the qualitative behaviors remain the same for larger N. Figure 3 shows the resulting geometric

E I_{g}

(blue curves), computed via Equation (7) for varying values of the error tolerances

ϵ

and

δ

.

Figure 3. Causal emergence from increasing errors for the toy model in Section 4. In all panels, the blue line shows the

E I_{g}

for the full 2D model, while red for the 1D sub-manifold A shown in Figure 2 (solid red line). In (a), we vary the effect error

ϵ

at fixed intervention error

δ = 10^{- 2}

; (b) varies intervention error

δ

at fixed effect error

ϵ = 10^{- 2}

; and (c) varies both together

δ = ϵ

. In each case, we see a crossover where, with no change in system behavior, the coarse-grained 1D model becomes causally more informative when our intervention or effect errors become large.

We can similarly find the

E I_{g}

for any sub-manifold of our parameter space, which would lead to a coarse-grained causal model, with a correspondingly lower dimensional space of intervention capabilities. To do this, we identify the pull-back of the two metrics in the full parameter space, to the embedded sub-manifold, as follows. We define a 1D submanifold of

Θ

as a parametric curve

(θ_{1}, θ_{2}) = (s_{1} (σ), s_{2} (σ))

with the parameter

σ

. The pull-back effect metric on this 1D space with respect to

σ

will be the scalar

\hat{g} (σ) = \sum_{μ, ν} s_{μ}^{'} (σ) s_{ν}^{'} (σ) g_{μ ν} (s_{1} (σ), s_{2} (σ))

, and similarly for intervention metric

\hat{h} (σ)

. For the 1D submanifold depicted by the solid red line in Figure 2, the resulting

E I_{g}

is plotted in red in Figure 3.

The crossover seen in Figure 3 thus illustrates causal emergence: for larger error values, the coarse-grained 1D description turns out to be more informative than the full 2D model. Since this coarse-graining corresponds to the case where the two bacterial species are seen as identical

θ_{1} = θ_{2}

, we can say that at large errors, our bacterial colony is better modeled as a single homogeneous population. Crucially, this arises not from any change in system behavior, but merely from how we interact with it: either from what interventions we impart or from which effects we measure. Note also that when both the intervention and effect errors are scaled together

δ \propto ϵ

, we see analytically from Equation (8) that

l (θ)

is constant, and so:

\begin{matrix} E I_{g} \sim log V_{I} \sim - d log δ \end{matrix}

(9)

which is also explicitly seen in Figure 3c. This indicates that, quite generally, we expect to see crossovers between geometric

E I

s of models with different d as we scale errors, with low-dimensional models being preferred at large noise. Since noise is ubiquitous in all real-world complex systems, this argument suggests why reductionist microscopic descriptions are rarely optimal from the perspective of informative interventions.

By carrying out similar calculations, in Figure 4, we explore how the optimal model choice depends on the time-scales we care about for the population dynamics (effects) and the antibiotics we are using (interventions), all at fixed errors

ϵ, δ

. When the two antibiotics control the two bacterial species almost independently (

A \sim 𝟙

, Figure 4a), we can identify three distinct regimes in the

E I_{g}

plot as we tune the measurement time-scale

Δ t

along the x-axis. If we only care about the population’s initial response to the treatment at early times, then we get a higher

E I_{g}

by modeling our colony as a single bacterial species. For intermediate times, the full 2D model has the higher

E I_{g}

, showing that in this regime, modeling both species independently is preferred. Finally, at late times, most of the population is dead, and the biggest remaining effect identifies how dissimilar their death rates were; the coarse-grained model given by the dashed red submanifold in Figure 2 (

θ_{2} = 1 - θ_{1}

) turns out to be more informative. In this regime, rather than viewing the population as either one or two independent bacterial species, we may think of it as a tightly coupled ecosystem of two competing species. Interestingly, such apparent coupling emerges here not from the underlying system dynamics, but from the optimal choice of coarse-grained description for the given effects of interest.

Figure 4. The optimal model choice depends on both the effects we choose to measure and the intervention capabilities we have. Horizontally, we vary the time-scale

Δ t

on which we measure the bacterial population dynamics in our toy model (Section 4): the top row shows how this changes the shape of our effect manifold. (a) shows the results when our intervention capabilities are nearly in direct correspondence with the parameters

θ

. Here, the

E I_{g}

plot shows that varying

Δ t

takes us through three regimes: with submanifold A as the optimal model at early times, the full 2D model optimal at intermediate times, and submanifold B most informative at late times. (b) shows that this entire picture changes for a different set of intervention capabilities—illustrating that the appropriate model choice depends as much on the interventions as on the effects.

For a different set of intervention capabilities, where the antibiotics affect both species in a more interrelated way, with

A = (\begin{matrix} 1 & 0.8 \\ 0.7 & 1 \end{matrix})

, this entire picture changes (Figure 4b). In particular, we get the scenario where the “fundamental” two-species model is never useful, and the unintuitive “two competing species” description is actually optimal at most time-scales. Note also that in all cases, for very long and very short times

Δ t

, the geometric

E I_{g}

drops below zero. While in this regime, the agreement with exact

E I

breaks down

E I_{g} \neq E I

, it is also heuristically true that

E I_{g} < 0 \Rightarrow

small

E I

, and so, the causal model is no longer very useful here. Even so, as seen in Figure 4, some coarse-graining of the model may still be effective even when the full microscopic description becomes useless.

5. Discussion

The world appears to agents and experimenters as having a certain scale and boundaries. For instance, solid objects are made of many loosely-connected atoms, yet in our everyday experience, we invariably view them as single units. While this may be intuitively understood in terms of the dictates of compression and memory storage [37], this work proposes a way to formalize precisely how such a coarse-grained modeling choice may be the correct (causally optimal) one. This is particularly true for a given set of intervention capabilities. In particular, we frame model selection as a geometric “matching” in the information space of the causal model to accessible interventions.

Intriguingly, this suggests that the correct choice of scientific modeling may not be merely a function of the correct understanding of the system, but also of the context that system is being used in or the capabilities of the experimenters. Thus, for example, if the forces we used to handle solid objects were far larger than inter-atomic attraction holding them together, then viewing objects as single units would no longer be a good model. This echoes one of the main ideas in “embodied cognition” for AI and psychology, which posits that in order for an agent to build accurate models of reality, it needs the ability to actively intervene in the world, not merely observe it [38].

This highlights a potentially important distinction between optimizing a model’s predictive efficacy and its causal efficacy. Many approaches to optimal model selection, such as sloppiness, focus on getting computationally efficient predictions from a few fundamental parameters. In contrast, optimizing causal efficacy looks for a model that best translates all interventions to unique effects, thus giving the user optimal power to control the system. Such a shift of motivation fundamentally changes our perspective on good scientific modeling: roughly, shifting the emphasis from prediction to control. While these two motivations may often go hand-in-hand, the question of which is the more fundamental may be important in distinguishing scenarios.

The causal geometry we introduce here is a natural extension of the information geometry framework [6,10], but now explicitly accounting for the causal structure of model construction. In our proposed formalism, a given model becomes associated with two distinct Riemannian manifolds, along with a mapping between them, one capturing the role of interventions and the other of the effects. The relative geometric matching between these two manifolds locally tells us about how causally informative the present model is and what coarse-graining may lead to a local improvement.

In this structure, the colloquial notion of model “sectors” (especially used in field theories to refer to various field content [39]) becomes associated with literal sectors of the manifolds, with their local geometries specifying the optimal or emergent descriptions of that sector. Such examples also highlight the importance of having a local way to quantify model optimality, as globally, the manifold may have a complex and piecewise structure not amenable to simplification. While both traditional

E I

[24] and information-geometric model-reduction [12] depend on the global model behavior over the entire span of possible interventions, the geometric

E I_{g}

introduced here is built by averaging inherently local causal efficacy, defined for each point in parameter space. We can further speculate that fundamentally, the geometric matching in

E I_{g}

may provide a novel way to quantify causality locally, where the counter-factual comparison is considered relative to the local neighborhood of interventions, rather than to all globally accessible ones [17].

We hope that causal geometry can contribute to further development in both formal principled methods for optimal model building in complex systems, as well as an abstract understanding of what it means to develop informative scientific theories.

Author Contributions

Conceptualization, P.C. and E.H.; formal analysis, P.C.; supervision, E.H.; visualization, P.C.; writing, original draft, P.C.; writing, review and editing, E.H. All authors read and agreed to the published version of the manuscript.

Funding

E.H. was funded by Army Research Office Grant W911NF2010243 and P.C. by ARO W911NF1810101 and the James S. McDonnell Foundation Scholar Grant 220020476.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

Thanks Mark Transtrum for helpful discussions throughout the work on the manuscript, as well as review of the final draft, Mikhail Tikhonov for inspiration at the early stages of the project, and Thomas Berrueta for great comments on the final draft.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Example Illustrating the Do-Operator

Here, we give a simple pedagogic example that illustrates the use of the do-operator and its role in distinguishing causal relations from correlation. To highlight its role in the context of information geometry and sloppiness, we use a setup familiar from that literature [6]. Consider some bacterial population whose decay over time in unfavorable conditions we want to monitor:

y (t) = e^{- θ t}

, where

y (t)

is the remaining fraction at time t of the original population at

t = 0

and

θ

is the death rate. In the context of information geometry, we might consider

θ

as a model parameter, while

y (t)

as the predicted data. However, we cannot say that

θ

“causes” the decay, as any such statement about causality requires distinguishing interventions and effects. Without this, our model is merely an exponential fit to the data

y (t)

, with

θ

labeling a compressed representation of the decay curve.

To have a causal, rather than a descriptive model, we need to introduce interventions that can influence this decay curve (and hence its descriptor

θ

). For example, this may be the concentration of some harmful chemical in the bacteria’s environment. For the simplicity of our example, imagine that we have a direct linear mapping from these concentrations x to the death rate:

d o (x) \to θ

. Note that this mapping is a causal model, with the

d o

-operator being well-defined on the space of interventions: it prescribes actively setting the chemical concentrations, in spite of any other possible environmental causes and fluctuations. In this setting, we can show how causal dependencies

q (θ | d o (x))

can differ from statistical ones

q (θ | x)

; these will be distinct whenever there are any confounding factors present. For concreteness in our example, we can introduce temperature T as such a confounder: On the one hand, it might speed up bacteria’s life-cycle, and hence death rate such that

d o (x) \to θ = x + α T

. On the other, it can denature the harmful chemical as

x = x_{0} / T

, where

x_{0}

is the reference concentration at

T = 1

.

If our model is specified explicitly, as is being done here, then all the causal relations are a priori known, and no work is needed to distinguish causation from correlation. The

d o

-operator becomes important when we look at a more real-world setting where we can interact with a system, but do not know its underlying dynamics. From the perspective of sloppiness, this is referred to as data-driven manifold exploration. The

d o

-operator prescribes how we should collect our data to extract the correct causal relations.

First consider a “wrong” way we can explore the model manifold, which will yield only statistical dependencies

q (θ | x)

. For this, we study the fluctuations of bacterial population and habitat in their natural environment (specifically looking at

θ

and x). We can imagine that these fluctuations are dominated by natural variation of the reference chemical concentration

x_{0} \sim N (\hat{x}, σ_{x}^{2})

, with an additional influence from temperature fluctuations

T \sim N (0, σ_{T}^{2})

(where we defined the scale of T such that its fluctuations centered on zero). Since we only observe x, and not

x_{0}

or T, we integrate these out to get:

\begin{matrix} q (θ | x) & = \frac{1}{Z (x)} \int d x_{0} e^{- \frac{{(x_{0} - \hat{x})}^{2}}{2 σ_{x}^{2}}} \int d T e^{- \frac{T^{2}}{2 σ_{T}^{2}}} δ (θ - x - α T) δ (x_{0} - x T) \\ = \frac{1}{α Z (x)} e^{- \frac{{(x \frac{θ - x}{α} - \hat{x})}^{2}}{2 σ_{x}^{2}} - \frac{{(\frac{θ - x}{α})}^{2}}{2 σ_{T}^{2}}} = \frac{1}{\sqrt{2 π σ_{n e t}^{2}}} e^{- \frac{1}{2 σ_{n e t}^{2}} {(θ - x (1 + \frac{\hat{x} σ_{n e t}^{2}}{α σ_{x}^{2}}))}^{2}} \end{matrix}

(A1)

where we defined the x-dependent

σ_{n e t} \equiv α {(\frac{1}{σ_{T}^{2}} + \frac{x^{2}}{σ_{x}^{2}})}^{- 1 / 2}

in the last line.

Now, to capture the correct causal relationships, we instead use the

d o

-operator to find

q (θ | d o (x))

, which prescribes actively setting the chemical concentration x to specific values, rather than passively waiting for these to be observed. This means that while temperature can still fluctuate as above

T \sim N (0, σ_{T}^{2})

, we do not defer to fluctuations of

x_{0}

to produce specific values of x, but rather set these concentrations ourselves (e.g., in lab conditions). This removes confounding correlations, while T fluctuations now merely add uniform noise on

θ

:

\begin{matrix} q (θ | d o (x)) & = \frac{1}{\sqrt{2 π} σ_{T}} \int d T e^{- \frac{T^{2}}{2 σ_{T}^{2}}} δ (θ - x - α T) \\ = \frac{1}{\sqrt{2 π} σ_{T} α} e^{- \frac{{(θ - x)}^{2}}{2 σ_{T}^{2} α^{2}}} \end{matrix}

(A2)

While the distributions resulting form the two above setups (Equations (A1) and (A2)) are clearly distinct, we want to take one further step to show that the information metrics they induce are also different. First, we note that the distributions

p (y | θ)

, and so the “effect metric”

g (θ)

(see Equation (6)), simply reflect how the space of decay curves

y (t)

is parametrized by the scalar

θ

. Thus, in this example,

g (θ)

is not about causality, and so is fixed regardless of how we measure our data. We therefore expect this distinction to be captured entirely by the intervention metrics. To find these, according to the setup in Equation (6), we first need the Bayesian inverse

\tilde{q} (d o (x) | θ)

, for which we must calculate the normalization

\int d x q (θ | d o (x))

. For the distribution in Equation (A2), this is one, and

\tilde{q} (d o (x) | θ) = q (θ | d o (x))

, thus giving the uniform intervention metric

h_{c a u s} (θ) = \frac{1}{σ_{T}^{2} α^{2}}

for the true causal dependences in the system. For Equation (A1), the integral for normalization cannot be done exactly, so we take the approximation that fluctuations

σ_{T}

of the confounding temperature variable T are small. With this, we find that

\int d x q (θ | x) = 1 - α \hat{x} {(\frac{σ_{T}}{σ_{x}})}^{2} - (\frac{3}{2} α \hat{x} θ^{2} + \frac{1}{4} θ^{4} + α^{2} σ_{x}^{2}) {(\frac{σ_{T}}{σ_{x}})}^{4} + O [σ_{T}^{6}]

, and thus, the “intervention” metric for the statistical dependencies is

h_{s t a t} (θ) = {⟨\frac{1}{σ_{n e t}^{2}}⟩}_{x} - 3 (α \hat{x} + θ^{2}) {(\frac{σ_{T}}{σ_{x}})}^{4} + O [σ_{T}^{6}]

, where the average

{⟨\cdot⟩}_{x}

is taken according to the distribution

\tilde{q} (x | θ)

. Notably, unlike

h_{c a u s}

above, here,

h_{s t a t} (θ)

varies with

θ

, giving a qualitatively different geometry than the true causal geometry for this example.

As such, we have shown in an explicit setup how the causal relations

q (θ | d o (x))

can differ from statistical ones

q (θ | x)

, leading to distinct geometric structures on the model manifold, which may then produce different schemes for model simplification. In general, while we may not know all the confounding factors nor the effects they can have, we can simply rely on using the

d o

-operator to extract the true causal relations for us. For this reason, none of the examples presented in this work make any explicit reference to confounding factors, as these may be assumed to be plentiful and unknown. For further discussion on the role of confounding factors and the tools of causal calculus that help to work with complex causal graphs, see [17].

Appendix B. Deriving Geometric EI

Here, we will derive the expression for

E I_{g}

in Equation (7), and equivalently in Equation (4). We start from the definition of

E I

in Equation (3) and use the near-deterministic model approximation discussed in the main text, and below. In Appendix B.1, we go through the detailed derivation for the 1D case and then in Appendix B.2 overview the steps needed to generalize it to higher dimensions.

Appendix B.1. One-Dimensional Case

As mentioned in the main text, the expression for

E I_{g}

presented here only approximates the exact

E I

when the mappings from interventions

x \in X

to parameters

θ \in Θ

and to effects

y \in Y

are both nearly-deterministic. Explicitly, this means that we can express the probability distributions as Gaussians with small variances. Note that the variance can be different at different points and in the multi-dimensional case may be anisotropic—as long as it remains sufficiently small everywhere (to be clarified later). This way, we can specify the concrete expression for the construct in Equation (6):

\begin{matrix} interventions x \in X, parameters θ \in Θ, effects y \in Y \\ x \to q (θ | d o (x)), such that \tilde{q} (d o (x) | θ) \equiv \frac{q (θ | d o (x))}{\int d x q (θ | d o (x))} = N_{x} (F (θ), δ^{2}) \\ θ \to p (y | d o (θ)) = N_{y} (f (θ), ϵ^{2}) \\ g (θ) = {(f^{'} (θ) / ϵ)}^{2} effect metric \\ h (θ) = {(F^{'} (θ) / δ)}^{2} intervention metric \end{matrix}

(A3)

Note that in 1D, the metrics g and h become scalars, and

N_{x} (F (θ), δ^{2})

denotes a Gaussian distribution in x, centered on

F (θ)

and with standard deviation

δ

. Furthermore, we define

F (θ)

and intervention errors

δ

as above, giving that

q (θ | d o (x)) = N_{θ} (F^{- 1} (x), {(F^{- 1^{'}} (x) δ)}^{2})

—merely for convenience of notation later (as

δ

may depend on x). As such, we also assume

F (θ)

and

f (θ)

to be invertible.

We begin with the definition of

E I

from Equation (3), for which we must first calculate the distribution

P (y | d o (x)) \equiv \int d θ p (y | d o (θ)) q (θ | d o (x))

:

\begin{matrix} P (y | d o (x)) = \int d θ \overset{p (y | d o (θ))}{\overset{︷}{\frac{1}{\sqrt{2 π} ϵ} e^{- \frac{{(y - f (θ))}^{2}}{2 ϵ^{2}}}}} \times \overset{q (θ | d o (x))}{\overset{︷}{\frac{F^{'} (F^{- 1} (x))}{\sqrt{2 π} δ} e^{- \frac{{(x - F (θ))}^{2}}{2 δ^{2}}}}} \end{matrix}

(A4)

We can evaluate this Gaussian integral in the limit of small

ϵ

and

δ

. Let us understand precisely how small these must be.

To work with the above integral,

δ

must be small enough that both

\tilde{q} (d o (x) | θ)

and

q (θ | d o (x))

are Gaussian. For this, the second-order term in the Taylor expansion

F (θ) = F (θ_{x}) + (θ - θ_{x}) F^{'} (θ_{x}) + \frac{1}{2} {(θ - θ_{x})}^{2} F^{″} (θ_{x}) + \dots

around

θ_{x} \equiv F^{- 1} (x)

must be negligible in all regions with substantial probability, giving the asymptotic assumption:

\frac{1}{2} {(θ - θ_{x})}^{2} F^{″} (θ_{x}) ≪ (θ - θ_{x}) F^{'} (θ_{x})

. We can check that in this case, plugging this expansion back into

\tilde{q} (x | θ)

from Equation (A3), we get now a Gaussian in

θ

:

exp [- \frac{{(θ - θ_{x})}^{2} F^{'} {(θ_{x})}^{2}}{2 δ^{2}}]

as desired. This then shows us that we expect

θ

to typically be within

O [δ / F^{'} (θ_{x})]

of

θ_{x}

, i.e.,

(θ - θ_{x}) \sim O [δ / F^{'} (θ_{x})]

. This allows us to write the above asymptotic condition as

F^{″} (θ) δ ≪ F^{'} {(θ)}^{2}

for all

θ

. The exact same argument goes for the effect distribution

p (y | d o (θ))

and similarly gives the condition

f^{″} (θ) ϵ ≪ f^{'} {(θ)}^{2}

. As long as these two conditions hold for all

θ

, we can allow

δ = δ (x)

and

ϵ = ϵ (y)

to vary arbitrarily.

In this limit, the integration of the expression in Equation (A4) is straightforward to carry out, giving another Gaussian:

\begin{matrix} P (y | d o (x)) = \frac{1}{f^{'} (θ_{y}) \sqrt{2 π σ^{2} (θ_{x}, θ_{y})}} e^{- \frac{{(θ_{x} - θ_{y})}^{2}}{2 σ^{2} (θ_{x}, θ_{y})}} \\ with & σ^{2} (θ_{x}, θ_{y}) \equiv {(\frac{ϵ}{f^{'} (θ_{y})})}^{2} + {(\frac{δ}{F^{'} (θ_{x})})}^{2} = \frac{1}{g (θ_{y})} + \frac{1}{h (θ_{x})} \end{matrix}

(A5)

where

θ_{y} \equiv f^{- 1} (y)

and

θ_{x} \equiv F^{- 1} (x)

, and we used the expressions for the two metrics in Equation (A3). Averaging this over the interventions and using the fact that

σ \sim O [ϵ, δ]

is small, we can then find the effect distribution:

\begin{matrix} E_{D} (y) = {⟨P (y | d o (x))⟩}_{I_{D} (x)} = \int \frac{d x}{L} P (y | d o (x)) = \frac{F^{'} (θ_{y})}{L f^{'} (θ_{y})} \end{matrix}

(A6)

where L is the size of the 1D intervention space

X

, so that the uniform intervention distribution

I_{D} (x) = 1 / L

.

With these expressions, we can now calculate the

E I = {⟨D_{K L} [P (y | d o (x)) ∥ E_{D} (y)]⟩}_{I_{D} (x)}

. Since

σ

is small, the Gaussian for

P (y | d o (x))

will ensure that

θ_{x}

is close to

θ_{y}

, and so to leading order, we replace

σ (θ_{x}, θ_{y}) \approx σ (θ_{x}, θ_{x})

here. Therefore:

\begin{matrix} E I & = \int \frac{d x}{L} \int \underset{= d θ_{y}}{\underset{︸}{\frac{d y}{f^{'} (θ_{y})}}} \frac{1}{\sqrt{2 π σ^{2} (θ_{x})}} e^{- \frac{{(θ_{x} - θ_{y})}^{2}}{2 σ^{2} (θ_{x})}} (- \frac{{(θ_{x} - θ_{y})}^{2}}{2 σ^{2} (θ_{x})} - log [f^{'} (θ_{y}) \sqrt{2 π σ^{2} (θ_{x})}] - log [\frac{F^{'} (θ_{y})}{L f^{'} (θ_{y})}]) \\ = \int \frac{d x}{L} (- \frac{1}{2} - log [f^{'} (θ_{x}) \sqrt{2 π σ^{2} (θ_{x})}] - log [\frac{F^{'} (θ_{x})}{L f^{'} (θ_{x})}]) \end{matrix}

\begin{matrix} = - \frac{1}{2} \int \frac{d θ F^{'} (θ)}{L} log [2 π e {(\frac{F^{'} (θ)}{L})}^{2} ({(\frac{ϵ}{f^{'} (θ)})}^{2} + {(\frac{δ}{F^{'} (θ)})}^{2})] \end{matrix}

(A7)

\begin{matrix} = log [\frac{L}{δ \sqrt{2 π e}}] - \frac{δ}{L} \int d θ \sqrt{h (θ)} log \sqrt{h (θ) (h^{- 1} (θ) + g^{- 1} (θ))} \end{matrix}

(A8)

where in the last line, we simply rearranged and substituted the expressions for the metrics

g (θ)

and

h (θ)

from Equation (A3). Here, we first see that Line (A7) reproduces the

E I

approximation we showed for the dimmer switch example, in Equation (4), for the setup there:

F (θ) = θ

, and

L = 1

. In general, by recognizing that here, the volume of the intervention space is

V_{I} = \int d θ \sqrt{h (θ)} = \int d θ F^{'} (θ) / δ = L / δ

, we finally see that the expression in Equation (A8) agrees with our main result presented in Equation (7) in the case of a 1D parameter space discussed here.

Appendix B.2. Multi-Dimensional Case

Generalizing the above derivation to the multi-dimensional case is straightforward and is mainly a matter of careful bookkeeping. The expressions in Equation (A3) become here:

\begin{matrix} interventions x^{a} \in X, parameters θ^{μ} \in Θ, effects y^{i} \in Y \\ x \to q (θ | d o (x)), such that \tilde{q} (d o (x) | θ) \equiv \frac{q (θ | d o (x))}{\int d^{d} x q (θ | d o (x))} = \frac{\sqrt{det Δ_{a b}}}{{(2 π)}^{d_{I} / 2}} e^{- \frac{1}{2} Δ_{a b} {(x - F (θ))}^{a} {(x - F (θ))}^{b}} \\ θ \to p (y | d o (θ)) = \frac{\sqrt{det E_{i j}}}{{(2 π)}^{d_{E} / 2}} e^{- \frac{1}{2} E_{i j} {(y - f (θ))}^{i} {(y - f (θ))}^{j}} \\ g_{μ ν} (θ) = E_{i j} \partial_{μ} f^{i} (θ) \partial_{ν} f^{j} (θ) effect metric \\ h_{μ ν} (θ) = Δ_{a b} \partial_{μ} F^{a} (θ) \partial_{ν} F^{b} (θ) intervention metric \end{matrix}

(A9)

Here,

a, b \in {1, \dots, d_{I}}

index the various dimensions of the intervention space

X

,

μ, ν \in {1, \dots, d}

—for dimensions of the parameter space

Θ

and

i, j \in {1, \dots, d_{E}}

—the effects space

Y

. As we are dealing with general multi-dimensional geometric constructs, in this section, we are being careful to denote the contravariant vector components with upper indices, as in

θ^{μ}

, and covariant components with lower indices, as in

\partial_{μ} \equiv \frac{\partial}{\partial θ^{μ}}

.

As in Equations (A4) and (A5) above, we can then compute the distribution over effects conditioned on interventions:

\begin{matrix} P (y | d o (x)) = \frac{\sqrt{det Σ_{μ ν} (θ_{x}, θ_{y})}}{det (\partial_{μ} f^{i} (θ_{y})) {(2 π)}^{d / 2}} e^{- \frac{1}{2} Σ_{μ ν} {(θ_{x} - θ_{y})}^{μ} {(θ_{x} - θ_{y})}^{ν}} \\ with & Σ^{- 1} (θ_{x}, θ_{y}) \equiv g^{- 1} (θ_{y}) + h^{- 1} (θ_{x}) \end{matrix}

(A10)

where

Σ

denotes the matrix with components

Σ_{μ ν}

and

Σ^{- 1}

its matrix inverse (and similar for g and h). Here, we assume that both functions

F (θ)

and

f (θ)

are invertible, which means that the intervention and effect spaces

X

and

Y

both have the same dimension as the parameter space

Θ

:

d_{I} = d_{E} = d

. This allows us to view the map

θ^{μ} \to f^{i} (θ)

as a change of coordinates, with a square Jacobian matrix

\partial_{μ} f^{i}

, whose determinant in the first line of Equation (A10) is thus well-defined and may be usefully expressed as

det (\partial_{μ} f^{i}) = \sqrt{\frac{det g_{μ ν}}{det E_{i j}}}

. Note also that to get the above result, we once again assume the distributions

\tilde{q} (d o (x) | θ)

and

p (y | d o (θ))

to be nearly deterministic, meaning here that the matrices

Δ

and E must be large, though the precise form of the assumption is messy here.

Averaging this result over the intervention space

X

, we get:

\begin{matrix} E_{D} (y) = {⟨P (y | d o (x))⟩}_{X} = \frac{det (\partial_{μ} F^{a} (θ_{y}))}{det (\partial_{μ} f^{i} (θ_{y}))} \frac{\sqrt{det Δ_{a b}}}{V_{I}} \end{matrix}

(A11)

where we define the intervention-space volume in units of variance of

\tilde{q}

, as:

V_{I} \equiv \int d^{d} x \sqrt{det Δ_{a b}} = \int d^{d} θ \sqrt{det h_{μ ν} (θ)}

. Performing another such average over

X

, with some algebra, similar as for Equation (A7), we can arrive at our result in Equation (7):

\begin{matrix} E I = {⟨D_{K L} [P (y | d o (x)) ∥ E_{D} (y)]⟩}_{X} = log [\frac{V_{I}}{{(2 π e)}^{d / 2}}] - \frac{1}{V_{I}} \int d^{d} θ \sqrt{det h} log \sqrt{det (𝟙 + g^{- 1} h)} \end{matrix}

(A12)

References

Hoel, E.P. Agent above, atom below: How agents causally emerge from their underlying microphysics. In Wandering Towards a Goal; Springer: New York, NY, USA, 2018; pp. 63–76. [Google Scholar]
Anderson, P.W. More is different. Science 1972, 177, 393–396. [Google Scholar] [CrossRef] [PubMed]
Solé, R.V.; Manrubia Cuevas, S.; Luque, B.; Delgado, J.; Bascompte, J. Phase Transitions and Complex Systems: Simple, Nonlinear Models Capture Complex Systems at the Edge of Chaos; John Wiley Sons: Hoboken, NJ, USA, 1996. [Google Scholar]
Zenil, H.; Soler-Toscano, F.; Joosten, J.J. Empirical encounters with computational irreducibility and unpredictability. Minds Mach. 2012, 22, 149–165. [Google Scholar] [CrossRef][Green Version]
Israeli, N.; Goldenfeld, N. Computational irreducibility and the predictability of complex physical systems. Phys. Rev. Lett. 2004, 92, 074105. [Google Scholar] [CrossRef] [PubMed]
Transtrum, M.K.; Machta, B.B.; Brown, K.S.; Daniels, B.C.; Myers, C.R.; Sethna, J.P. Perspective: Sloppiness and emergent theories in physics, biology, and beyond. J. Chem. Phys. 2015, 143, 07B201_1. [Google Scholar] [CrossRef]
Box, G.E. Science and statistics. J. Am. Stat. Assoc. 1976, 71, 791–799. [Google Scholar] [CrossRef]
Maiwald, T.; Hass, H.; Steiert, B.; Vanlier, J.; Engesser, R.; Raue, A.; Kipkeew, F.; Bock, H.H.; Kaschek, D.; Kreutz, C.; et al. Driving the model to its limit: Profile likelihood based model reduction. PLoS ONE 2016, 11, e0162366. [Google Scholar] [CrossRef]
MacKay, D.J.; Mac Kay, D.J. Information Theory, Inference and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Amari, S.I. Information Geometry and Its Applications; Springer: Berlin/Heidelberg, Germany, 2016; Volume 194. [Google Scholar]
Gutenkunst, R.N.; Waterfall, J.J.; Casey, F.P.; Brown, K.S.; Myers, C.R.; Sethna, J.P. Universally sloppy parameter sensitivities in systems biology models. PLoS Comput. Biol. 2007, 3, e189. [Google Scholar] [CrossRef]
Transtrum, M.K.; Qiu, P. Model reduction by manifold boundaries. Phys. Rev. Lett. 2014, 113, 098701. [Google Scholar] [CrossRef]
Mattingly, H.H.; Transtrum, M.K.; Abbott, M.C.; Machta, B.B. Maximizing the information learned from finite data selects a simple model. Proc. Natl. Acad. Sci. USA 2018, 115, 1760–1765. [Google Scholar] [CrossRef]
Machta, B.B.; Chachra, R.; Transtrum, M.K.; Sethna, J.P. Parameter space compression underlies emergent theories and predictive models. Science 2013, 342, 604–607. [Google Scholar] [CrossRef]
Raju, A.; Machta, B.B.; Sethna, J.P. Information loss under coarse graining: A geometric approach. Phys. Rev. E 2018, 98, 052112. [Google Scholar] [CrossRef]
Laughlin, R.B.; Pines, D. From the cover: The theory of everything. Proc. Natl. Acad. Sci. USA 2000, 97, 28. [Google Scholar] [CrossRef] [PubMed]
Pearl, J. Causality; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
Halpern, J.Y. Actual Causality; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Sugihara, G.; May, R.; Ye, H.; Hsieh, C.H.; Deyle, E.; Fogarty, M.; Munch, S. Detecting causality in complex ecosystems. Science 2012, 338, 496–500. [Google Scholar] [CrossRef] [PubMed]
Solvang, H.K.; Subbey, S. An improved methodology for quantifying causality in complex ecological systems. PLoS ONE 2019, 14, e0208078. [Google Scholar] [CrossRef]
Albantakis, L.; Tononi, G. Causal composition: Structural differences among dynamically equivalent systems. Entropy 2019, 21, 989. [Google Scholar] [CrossRef]
Tononi, G.; Sporns, O. Measuring information integration. BMC Neurosci. 2003, 4, 31. [Google Scholar] [CrossRef]
Hoel, E.P.; Albantakis, L.; Tononi, G. Quantifying causal emergence shows that macro can beat micro. Proc. Natl. Acad. Sci. USA 2013, 110, 19790–19795. [Google Scholar] [CrossRef]
Hoel, E.P. When the map is better than the territory. Entropy 2017, 19, 188. [Google Scholar] [CrossRef]
Gugercin, S.; Antoulas, A.C. A Survey of Model Reduction by Balanced Truncation and Some New Results. Int. J. Control. 2004, 77, 748–766. [Google Scholar] [CrossRef]
Huang, H.; Fairweather, M.; Griffiths, J.; Tomlin, A.; Brad, R. A systematic lumping approach for the reduction of comprehensive kinetic models. Proc. Combust. Inst. 2005, 30, 1309–1316. [Google Scholar] [CrossRef]
Daniels, B.C.; Nemenman, I. Automated adaptive inference of phenomenological dynamical models. Nat. Commun. 2015, 6, 1–8. [Google Scholar] [CrossRef] [PubMed]
Dufresne, E.; Harrington, H.A.; Raman, D.V. The geometry of sloppiness. arXiv 2016, arXiv:1608.05679. [Google Scholar] [CrossRef]
Pearl, J.; Mackenzie, D. The Book of Why: The New Science of Cause and Effect; Basic Books: New York, NY, USA, 2018. [Google Scholar]
Balduzzi, D. Information, learning and falsification. arXiv 2011, arXiv:1110.3592. [Google Scholar]
Klein, B.; Hoel, E. The emergence of informative higher scales in complex networks. Complexity 2020, 2020. [Google Scholar] [CrossRef]
Liu, Y.Y.; Slotine, J.J.; Barabási, A.L. Controllability of complex networks. Nature 2011, 473, 167–173. [Google Scholar] [CrossRef]
Portugal, R.; Svaiter, B.F. Weber-Fechner law and the optimality of the logarithmic scale. Minds Mach. 2011, 21, 73–81. [Google Scholar] [CrossRef]
Hu, W.; Davis, W. Dimming curve based on the detectability and acceptability of illuminance differences. Opt. Express 2016, 24, A885–A897. [Google Scholar] [CrossRef]
Transtrum, M.K.; Machta, B.B.; Sethna, J.P. Geometry of nonlinear least squares with applications to sloppy models and optimization. Phys. Rev. E 2011, 83, 036701. [Google Scholar] [CrossRef]
Tikhonov, M. Theoretical ecology without species. Bull. Am. Phys. Soc. 2016, 61, 1–13. [Google Scholar]
Blumer, A.; Ehrenfeucht, A.; Haussler, D.; Warmuth, M.K. Occam’s razor. Inf. Process. Lett. 1987, 24, 377–380. [Google Scholar] [CrossRef]
Shapiro, L. Embodied Cognition; Routledge: Abingdon, UK, 2019. [Google Scholar]
Srednicki, M. Quantum Field Theory; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]

Figure 1. Illustrating continuous Effective Information (EI) on a simple toy system. (a) shows the system construction: a dimmer switch with a particular “dimmer profile”

f (θ)

. We can intervene on it by setting the switch

θ \in (0, 1)

up to error tolerance

δ

, while effects are similarly measured with error

ϵ

; (b) shows that for uniform errors

ϵ = δ = 0.03

, out of the family of dimmer profiles parametrized by a (left), the linear profile gives the “best control,” i.e., has the highest EI (where dark blue—numerical EI calculation and light blue—approximation in Equation (4)); (c) illustrates how for two other dimmer profiles (left), increasing error tolerances

ϵ = δ

influence the EI (right, calculated numerically). The profile in red represents a discrete binary switch—which emerges if we restrict the interventions on the blue dimmer profile to only use “ends of run.” Crucially, such coarse-graining allows for an improved control of the light (higher EI) when errors are sufficiently large.

Figure 2. An illustration of the causal geometry construction in Equation (6). The parameter space

Θ

of our model gets two distinct geometric structures: the effect metric

g_{μ ν} (θ)

and the intervention metric

h_{μ ν} (θ)

. Here, a model is seen as a map that associates with each set of parameters

θ

, some distribution of possible measured effects y (right). As parameters

θ

may involve arbitrary abstractions and thus need not be directly controllable, we similarly associate them with practically doable interventions x (left). This way, our system description in terms of

θ

“mediates” between the interventions and resulting effects in the causal model.

Figure 3. Causal emergence from increasing errors for the toy model in Section 4. In all panels, the blue line shows the

E I_{g}

for the full 2D model, while red for the 1D sub-manifold A shown in Figure 2 (solid red line). In (a), we vary the effect error

ϵ

at fixed intervention error

δ = 10^{- 2}

; (b) varies intervention error

δ

at fixed effect error

ϵ = 10^{- 2}

; and (c) varies both together

δ = ϵ

. In each case, we see a crossover where, with no change in system behavior, the coarse-grained 1D model becomes causally more informative when our intervention or effect errors become large.

Figure 4. The optimal model choice depends on both the effects we choose to measure and the intervention capabilities we have. Horizontally, we vary the time-scale

Δ t

on which we measure the bacterial population dynamics in our toy model (Section 4): the top row shows how this changes the shape of our effect manifold. (a) shows the results when our intervention capabilities are nearly in direct correspondence with the parameters

θ

. Here, the

E I_{g}

plot shows that varying

Δ t

takes us through three regimes: with submanifold A as the optimal model at early times, the full 2D model optimal at intermediate times, and submanifold B most informative at late times. (b) shows that this entire picture changes for a different set of intervention capabilities—illustrating that the appropriate model choice depends as much on the interventions as on the effects.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Causal Geometry

Abstract

1. Introduction

2. Effective Information in Continuous Systems

2.1. Toy Example: Dimmer Switch

3. Causal Geometry

3.1. Construction

3.2. Relation To Sloppiness

4. Two-Dimensional Example

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Example Illustrating the Do-Operator

Appendix B. Deriving Geometric EI

Appendix B.1. One-Dimensional Case

Appendix B.2. Multi-Dimensional Case

References

Article Metrics

Citations

Article Access Statistics