Causal Geometry

Information geometry has offered a way to formally study the efficacy of scientific models by quantifying the impact of model parameters on the predicted effects. However, there has been little formal investigation of causation in this framework, despite causal models being a fundamental part of science and explanation. Here, we introduce causal geometry, which formalizes not only how outcomes are impacted by parameters, but also how the parameters of a model can be intervened upon. Therefore, we introduce a geometric version of “effective information”—a known measure of the informativeness of a causal relationship. We show that it is given by the matching between the space of effects and the space of interventions, in the form of their geometric congruence. Therefore, given a fixed intervention capability, an effective causal model is one that is well matched to those interventions. This is a consequence of “causal emergence,” wherein macroscopic causal relationships may carry more information than “fundamental” microscopic ones. We thus argue that a coarse-grained model may, paradoxically, be more informative than the microscopic one, especially when it better matches the scale of accessible interventions—as we illustrate on toy examples.


Introduction
Many complex real-world phenomena admit surprisingly simple descriptions, from the smallest of microphysics to aspects of economics [1][2][3]. While this is not seen as entirely coincidental, the precise reasons for this fortunate circumstance are not entirely understood [4,5]. Two complementary solutions may be sought: On the one hand, we may hypothesize that this is an objective property of nature, thereby looking for some mechanism common among complex systems that allows them to be well-described with only a few parameters [3,6]. On the other, we may guess that it is a subjective property of our perception and then try to formalize the process by which we find useful patterns in arbitrarily complex systems [5,7,8]. In the recent years, substantial progress has been made in developing both of these perspectives, grounded in information theory [9].
One compelling argument for the first hypothesis has been made from information geometry. The approach starts by associating with any given model a particular "model manifold," whose geometric properties can tell us whether and which simplifications can be helpful [6,10]. It turns out that for many real-world models, this manifold is highly anisotropic, having a hierarchical hyper-ribbon structure [11]. This property, termed "sloppiness," indicates that only a few of the many microscopic model parameters are actually important for the model predictions-thus allowing for model simplification [12,13]. While it is not yet clear how general this property is, sloppiness was empirically illustrated in a number of biochemical and physical models and argued for on some general grounds [6,14,15]. This way, sloppiness provides an explanation for how emergent simplicity may be an objective property of complex systems themselves.
The second perspective instead takes as its starting point the well-known aphorism that "all models are wrong, but some are useful" [7]. One way to see this is as a rejection of the reductionist notion that a "fundamental" microscopic description is the "correct" one, to find the set of coarse-grained parameters on sloppy manifolds that could reasonably be constrained by some limited experimental observations, thus selecting the optimal modeling level for the available data. The mathematical procedure thus carried out closely paralleled that developed earlier in the causal emergence for discrete systems [23,24]. In our work, we use this formal resemblance to understand the connection between these two distant fields: sloppy models and causal calculus. This allows giving a continuum formulation of causal emergence, as well as a novel local measure of causal optimality. On the other side, our work establishes the proper formal role of interventions and causality in sloppiness, potentially resolving a long-standing formal challenge around the noncovariance of metric eigenvalues [28] and showing that not only the hyper-ribbon manifold structure, but also its relation to intervention capabilities account for the emergence of simple models.
In Section 2, we define the Effective Information (EI) for continuous models, which captures the amount of information in the model's causal relationships. We illustrate it on a simple example (Section 2.1) and show how restricting the set of allowed interventions may sometimes, surprisingly, make the causal model more informative. Section 3 then introduces causal geometry. Specifically, it relates the continuous EI to information geometry, introducing a local geometric measure of causal structure, and providing a way to find the locally most effective model for a given set of intervention capabilities using the techniques of information geometry. We demonstrate our construction on another simple toy model in Section 4, showing how causal emergence can arise in our geometric formulation, subject to the given interventional and observational capabilities.

Effective Information in Continuous Systems
For the purposes of this work, we formalize a causal model as a set of input-output relations, or more precisely, a map from all possible interventions to the full description of all effects within the context of some system [24]. While the set of all hypothetically possible interventions on a given physical system is enormous and impractical to consider (involving arbitrary manipulations of every subatomic particle), the set of experimentally doable (or even considered) interventions for a given context always represent a much smaller bounded space X , which we refer to here as "intervention capabilities." Similarly, while an intervention will lead to uncountable microscopic physical effects, the space of specific effects of interest Y is much smaller and often happens to be closely related to the intervention capabilities. All causal models by definition use some such subset of possibilities, and it is common in the literature around causation to restrict the set of hypotheticals, or counterfactuals, within a causal model [29]. In this work, we will illustrate how finding the optimal causal model for a given system is about a matching between the system behavior and the intervention capabilities considered.
As the focus of this paper is on continuous systems, we consider X and Y to be continuous spaces, with points x ∈ X and y ∈ Y. To formally discuss the causal model of our system, we make use of the do(x) operator, as per Judea Pearl's causal calculus [17]. This operator is defined for any doable intervention the experimenter is capable of either performing or modeling, allowing assessing its causal effects. This allows us to formally describe a causal model as a map: where p is the probability density over effect space Y resulting from "doing" the intervention x. Note that this is distinct from p(y | x) in that the do operator allows us to distinguish the correlation introduced by the causal relation x → y from one due to a common cause a → {x, y}.
The notion of causality is then formalized as a counterfactual: How does the effect of do(x) differ from the effect of doing anything else? This latter "null effect" may include interventions such as do(¬x), but it can also include all other possible interventions: it is thus formally described by averaging together the effects of all considered intervention capabilities X , giving the total "effect distribution": This way, to know precisely which effects do(x) causes, we can compare E D (y) to p(y | do(x)).
The distinguishability between these distributions may be captured with the Kullback-Leibler divergence D KL [p(y | do(x)) E D (y)], giving us the amount of information associated with the application of an individual do(x) intervention [23,30]. Averaging over all accessible interventions gives the information of the system's entire causal structure, termed the total "effective information": Discrete versions of this effective information have been explored in Boolean networks [23] and graphs [31]. Note that the definition of EI here is identical to the mutual information between the uniform distribution over interventions I D (x) = const and the resulting distribution over effects E D (y), so that: EI = I(I D ; E D ) [22,24].
We proceed to illustrate with a simple example how the EI varies across families of simple physical systems. As such, we show how it may be used to select the systems that are in some sense "best controllable," in that they best associate unique effects with unique interventions [32]. Additionally, this example will help us illustrate how the EI may sometimes allow us to identify a coarse-grained system description that is more informative than the full microscopic one-thus illustrating causal emergence [24].

Toy Example: Dimmer Switch
Consider a continuous dimmer switch controlling a light bulb, but with an arbitrary non-linear function y = f (θ), a "dimmer profile," mapping from the switch setting θ ∈ Θ = [0, 1] to the light bulb brightness y ∈ Y = [0, 1] (Figure1a). To quantify information about causation in continuous systems, we must carefully account for noise and errors in our inputs and outputs; else, infinite precision leads to infinite information. This is an issue for the application of all mutual information measures or their derivatives in deterministic continuous systems. Realistically, in operating a dimmer switch, any user will have certain "intervention error" on setting its value, as well as "effect error," which can come either from intrinsic system noise or from extrinsic measurement error. To encode the effect error, we can replace the deterministic mapping θ → y = f (θ) with a probabilistic one θ → p(y | do(θ)) = N y ( f (θ), 2 )-the normal distribution centered on f (θ) and with standard deviation . While we could incorporate intervention error of setting θ into this probability distribution as well, it is instructive for later discussion and generality to keep it separate. The intervention error is thus similarly encoded by introducing a probabilistic mapping from the "do"-able interventions x ∈ X = [0, 1] to the physical switch settings with some error δ, as x → q(θ | do(x)) = N θ (x, δ 2 ). Here, we can think of the interventions x ∈ X as the "intended" switch settings, as in practice, we cannot set the switch position with infinite precision. Note that while we do not explicitly model any possible confounding factors here, we assume that these may be present and important, but are all taken care of by our use of the do-operator. This ensures that only true causal relations, and not spurious correlations, are captured by the distributions p and q.  Figure 1. Illustrating continuous Effective Information (EI) on a simple toy system. (a) shows the system construction: a dimmer switch with a particular "dimmer profile" f (θ). We can intervene on it by setting the switch θ ∈ (0, 1) up to error tolerance δ, while effects are similarly measured with error ; (b) shows that for uniform errors = δ = 0.03, out of the family of dimmer profiles parametrized by a (left), the linear profile gives the "best control," i.e., has the highest EI (where dark blue-numerical EI calculation and light blue-approximation in Equation (4)); (c) illustrates how for two other dimmer profiles (left), increasing error tolerances = δ influence the EI (right, calculated numerically). The profile in red represents a discrete binary switch-which emerges if we restrict the interventions on the blue dimmer profile to only use "ends of run." Crucially, such coarse-graining allows for an improved control of the light (higher EI) when errors are sufficiently large.
With this setup, we can now use Equation (3) to explicitly compute an EI for different dimmer profiles f (θ) and see which is causally most informative (has the most distinguishable effects). To do this analytically for arbitrary f (θ), we must take the approximation that δ and are small compared to one (the range of interventions and effects) and compared to the scale of curvature of f (θ) (such that f (θ) f (θ) 2 ). In this limit, we have (for the derivation, see setup in Equation (6) below and Appendix B.1): which echoes the form of the expression for entropy of a normal distribution. From this, we see variationally that, given the fixed end-points f (0) = 0 and f (1) = 1, EI is maximized iff f (θ) = 1: a uniformly linear dimmer switch. We can check this numerically by computing the exact EI for several different choices of f (θ) (Figure 1b). A slightly more interesting version of this example is when our detector (eyes) perceives light brightness on a log, rather than linear, scale (Weber-Fechner law, [33]), in which case the effect of the error will be non-uniform: (y) ∝ y. If this error is bound to be sufficiently small everywhere, Equation (4) still holds, replacing only → (y) ∝ y = f (θ). Again, we can variationally show that here, EI is maximal iff f (θ)/ f (θ) is constant (up to fluctuations of magnitude O[δ/ ]), giving the optimal dimming profile f (θ) = (e θ/r − 1)/(e 1/r − 1), with some constant r δ/ . In reality, the lighting industry produces switches with many dimming profiles that depend on the application [34], so our approach can be seen as a principled way to optimize this choice.
Interestingly, restricting the accessible interventions can sometimes increase the amount of effective information if it increases the distinguishability, and therefore informativeness, of interventions. This is a form of causal emergence, wherein a higher level (coarsened) macroscopic model emerges as the more informative system description for modeling causation [24]. To give a particular example here, we compare the continuous dimmer profile shown in Figure 1c (left, blue) to its discrete restriction (left, red)-which corresponds to a simple binary switch. When the intervention and effect errors δ = are small, the continuous switch gives more control opportunities and is thus preferable; its EI is larger than the 1 bit for the discrete switch. However, as we increase the errors, we see a crossover in the two EI values. In this regime, the errors are so large that the intermediate switch positions of the continuous profile become essentially useless and are "distracting" from the more useful endpoint settings. Formally, such causal emergence arises due to the averaging over the set of all interventions in Equation (3). Practically, it captures the intuition that building good causal models, as well as designing useful devices involve isolating only the most powerful control parameters out of all possible degrees of freedom [32].

Causal Geometry
Taking inspiration from information geometry, we can construct a more intuitive geometric expression for the EI [13]. For studying causal structures of models, this "geometric EI" we introduce may be viewed as a supplement to the usual EI in Equation (3). While the geometric EI corresponds to the EI in a particular limit, we suggest that it remains useful more generally as an alternative metric of causal efficacy: it captures a causal model's informativeness like the EI does, but in a way that is local in a system's parameter space. In this way, it frames causality not as a global counter-factual (comparing an intervention to all other interventions) [17], but as a local neighborhood counter-factual (comparing an intervention to nearby interventions). On the flip side, our construct provides a novel formulation of information geometry that allows it to explicitly account for the causal relations of a model. Moreover, we argue that this is necessary for formal consistency when working with the Fisher information matrix eigenvalues (namely, their covariant formulation; see Section 3.2) [28]. This suggests that model reduction based on these eigenvalues may not be made fully rigorous without explicitly accounting for causal relations in the model.

Construction
Section 2 described a causal model as a set of input-output relations between interventions X and effects Y. Here, we investigate the relationship of such a causal model with the space of parameters Θ that describe the underlying physical system. While these parameters need not necessarily have any direct physical meaning themselves, they are meant to give some abstract internal representation of the system-i.e., they mediate the mapping between interventions and effects. For example, while the notion of energy is merely an abstract concept, it provides a useful model to mediate between interventions such as "turning on the stove" and effects like "boiling water." The goal of our construction here is to compare how well different physical models capture the causal structure of a system.
As we are focusing on continuous systems, we assume that our parameter space Θ forms a smooth d-dimensional manifold, with parameters θ µ indexed by µ, ν ∈ {1, 2, . . . , d}. Each accessible intervention x ∈ X then maps to some probability distribution over parameters x → q(θ | do(x)), and each parameter in turn maps to a distribution over the observed effects θ → p(y | do(θ)). The causal relations we are interested in here are thus simply x → θ → y, but these are assumed to be embedded in some larger more complicated causal graph of additional confounding factors. These other possible hidden causes highlight the importance of using the do-operator to isolate the causal relations in which we are interested (see Appendix A for a simple example). Our one assumption is that the parameter space Θ is chosen to be a sufficiently complete description of the system that no causal link can pass from X to Y directly without being reflected on Θ.
To understand the role of various parameters θ µ , we can ask how much the effects change as we perturb from some set of parameters in some direction: from θ to θ + dθ. Using the Kullback-Leibler divergence and expanding it to leading order in dθ, we get: where summation over repeated indices is implied. This defines the Fisher information This introduces a distance metric on the parameter space Θ, turning it into a Riemannian manifold, which we term the "effect manifold" M E (this is usually called simply the "model manifold" in the literature, but here, we want to distinguish it from the "intervention manifold," introduced below). More precisely, it is usually defined as M E ≡ {p(y | do(θ))} θ∈Θ -the collection of all the effect distributions, or the image of the parameter space Θ under the model mapping, with Equation (5) being the natural distance metric on this space [6,10,14].
Just as the mapping to effects defines the effect manifold M E , we can similarly construct an "intervention manifold" M I . For this, we use Bayes' rule to invert the mapping from interventions to parameters x → q(θ | do(x)), thus giving θ →q(do(x) | θ)-the probability that a given parameter point θ was "activated" by an intervention x. The intervention manifold is thus defined as M I ≡ {q(do(x) | θ)} θ∈Θ , with the corresponding Fisher information metric h µν giving the distances on this space. With this, we can now summarize our construction: Note that for Bayesian inversion in the last line, we used a uniform prior over the intervention space I D (x) = const, which amounts to assuming that statistically, interventions are uniformly distributed over the entire considered space X . Note that this is not a choice of convenience, but rather of conceptual necessity for correctly defining information in a causal model, as argued in [24]. The natural point-wise correspondence between the two manifolds M E ↔ M I : p(y | do(θ)) ↔q(do(x) | θ) then allows for a local comparison between the two geometries. Alternatively, we may simply think of the parameter space Θ with two separate distance metrics on it, effect metric g(θ) and intervention metric h(θ). With this setup, we can now define our "geometric" effective information: Here, V I is the volume of the intervention manifold M I , which can be computed as It quantifies the effective number of distinct interventions we can do, and so, the first term in Equation (7) gives the maximal possible amount of information about the causation our model could have, if all interventions perfectly translated to effects. The second term then discounts this number according to how poorly the interventions actually overlap with effects: geometrically, the expression in Equation (8) quantifies the degree of matching between the metrics g and h at the point θ (here, 1 stands for the identity matrix). This way, the loss term l(θ) can be interpreted as a measure of "local mismatch" between interventions and effects at θ, quantifying how much information about causation is lost by our modeling choice. The average is then taken according to the intervention metric as: l(θ) I ≡ 1 . Note that the expression in Equation (7) is identical to the approximation in Equation (4) for the setup in that example.
In Appendix B.2, we show that this expression for EI g in Equation (7) can be derived as the approximation of the exact EI in Equation (3) when both the mappings are close to deterministic: p(y | do(θ)) = N y f (θ), 2 andq(do(x) | θ) = N x F(θ), δ 2 , for some functions f : Θ → Y and F : Θ → X , with small errors and δ (which may be anisotropic and nonuniform). Outside of this regime, the EI and EI g can differ. For instance, while EI is positive by definition, EI g can easily become negative, especially if g is degenerate anywhere on the manifold. Second, while EI captures the informativeness and therefore effectiveness of a causal model globally, EI g , and more specifically the landscape l(θ), can show us which local sectors of the parameter space are most and least causally effective. Finally, the global nature of the computation for the exact EI quickly makes it intractable, even numerically, for many continuous systems due to the proliferation of high-dimensional probability distributions-making EI g the more practical choice in those settings.

Relation To Sloppiness
"Sloppiness" is the property empirically observed in many real-world models, when the eigenvalues of the Fisher information matrix g µν take on a hierarchy of vastly varying values [6,11,14]. As such, parameter variations in the directions corresponding to the smallest eigenvalues will have negligible impact on the effects [14]. This leads to the hypothesis that we may effectively simplify our model by projecting out such directions, with little loss for the model's descriptive power [12,13].
The trouble with this approach is that the components of the matrix g µν , and hence its eigenvalues, depend on the particular choice of θ-coordinates on the effect manifold M E [6,28]. Since the parameters Θ represent some conceptual abstraction of the physical system, they constitute an arbitrary choice. This means that for a given point of M E labeled by θ, we can always choose some coordinates in which locally, g(θ) = 1 (an identity matrix), thus apparently breaking the above sloppiness structure. This issue is avoided in the literature by relying on the coordinate independent global properties of M E , namely its boundary structure [12].
Here, we show that by explicitly considering intervention capabilities, we can construct a local, but still coordinate independent sloppiness metric. This becomes possible since interventions give a second independent distance metric on Θ [28]. The matrix product g −1 h appearing in Equation (8) is then a linear transformation, and thus, its eigenvalues are coordinate independent. This way, to evaluate how sloppy a given causal model is, we suggest that it is more appropriate to study the eigenvalues of h −1 g instead of those of g as is usually done [6]. If we then want to identify the directions in the parameter space that are locally least informative at a point θ, we first need to re-express the metric g in terms of the coordinates for which h(θ) = 1 locally and then find the appropriate eigenvectors in these new coordinates.
From this perspective, we see that the usual discussion of sloppiness, which does not study interventions explicitly [6], may be said to implicitly assume that the intervention metric h(θ) ∝ 1, meaning that all model parameters directly correspond to physically doable interventions. Moreover, this requires that with respect to the given coordinate choice, each parameter can be intervened upon with equal uniform precision-which fixes the particular choice of coordinates on the parameter space. As such, the coordinatespecific eigenvalues λ g (θ) of the effect metric g(θ) studied in information geometry, become physically meaningful in this special coordinate frame. In particular, our expression for the local mismatch in Equation (8) can here be expressed in terms of these eigenvalues as l(θ) = 1 2 ∑ λ log 1 + 1/λ g (θ) . Thus, locally, the directions with the smallest λ g account for the largest contribution to the mismatch l. This recovers the standard intuition of sloppiness: we can best improve our model's descriptive efficacy by projecting out the smallest-λ g directions [12,35]. By seeing how this result arises in our framework, we thus point out that it formally relies on the implicit assumption of uniform intervention capabilities over all model parameters.
It is worth noting that in the construction in Equation (6), it may be possible to integrate out the parameters θ, giving directly the distribution of effects in terms of interventions P(y | do(x)) = d d θ p(y | do(θ)) q(θ | do(x)) (we assume that Θ gives a complete description of our system in the sense that no causal links from X to Y can bypass Θ). This way, the P(y | do(x)) distribution gives an effect metricĝ(x) over the X -space, which can directly quantify the amount of causal structure locally in our model. Nonetheless, the intervention metric and Equation (8) are still implicitly used here. This is because the space X was constructed such that the intervention metric over it would be uniformĥ(x) = 1 everywhere. In turn, Equation (8) thus takes a particularly simple form in terms ofĝ(x). This illustrates that regardless of the parametrization we choose to describe our system, causal efficacy always arises as the matching between the effect metric and the intervention metric, per Equation (8)-which may or may not take on a simple form, and must be checked explicitly in each case. Furthermore, though this goes beyond the scope of this paper, we may imagine cases where the actual interventions form a complicated high-dimensional space X that is harder to work with than the parameter space Θ (just as the effect space Y is often more complex than Θ). In fact, this may be the typical scenario for real-world systems, where X and Y represent arbitrarily detailed descriptions of the system's context, while Θ gives a manageable system abstraction.

Two-Dimensional Example
In order to illustrate our causal geometry framework explicitly and show how higher level descriptions can emerge within it, we use a simple toy model (based on the example considered in [12,13]).
Imagine an experimenter has a mixed population of two non-interacting bacterial species that they are treating with two different antibiotics. The experimenter's measurements cannot distinguish between the bacteria, and so, they are monitoring only the total population size over time y(t) = e −θ 1 t + e −θ 2 t , where {θ 1 , θ 2 } ∈ [0, 1] are the death rates of the two individual species. These death rates are determined by the two antibiotic concentrations the experimenter treats the system with {x 1 , x 2 }, which are the possible interventions here. In the simplest case, each antibiotic will influence both species via some linear transformation A, such that θ µ = ∑ i A µi x i . This setup allows us to flesh-out the causal geometry construction and illustrate causal emergence here. Our main question is: When is this system best modeled microscopically, as the two independent species, and when does it behave more like a single homogeneous population, or something else [36]? To identify when higher scale models are more informative for the experimenter, we will calculate the geometric EI g from Equation (7) for the full 2D model described above and then compare it to two separate 1D coarse-grained model descriptions, shown by the two red 1D sub-manifolds of the parameter space in Figure 2.
We first specify the quantities for the construction in Equation (6). Our interventions x, having some uniform error tolerance δ, map to normal distributions over parameters θ as: 1 θ, δ 2 ), and hence the intervention metric h µν = ∑ i (A −1 ) iµ (A −1 ) iν /δ 2 . The effect space y is constructed by measuring the population size at several time-points, spaced out at intervals ∆t, such that the components of y are given by y n = y(n ∆t) = e −n ∆t θ 1 + e −n ∆t θ 2 , with n ∈ {1, 2, . . . , N} and error on each measurement (the initial conditions are thus always y(0) = 2). Thus, we have θ → p(y | do(θ)) = N y {y n }, 2 and effect metric g µν = ∑ n ∂ µ y n ∂ ν y n / 2 . Figure 2 shows these mappings with N = 2 for visual clarity, and we use N = 3 for the EI g calculations below, but all the qualitative behaviors remain the same for larger N. Figure 3 shows the resulting geometric EI g (blue curves), computed via Equation (7) for varying values of the error tolerances and δ. Figure 2. An illustration of the causal geometry construction in Equation (6). The parameter space Θ of our model gets two distinct geometric structures: the effect metric g µν (θ) and the intervention metric h µν (θ). Here, a model is seen as a map that associates with each set of parameters θ, some distribution of possible measured effects y (right). As parameters θ may involve arbitrary abstractions and thus need not be directly controllable, we similarly associate them with practically doable interventions x (left). This way, our system description in terms of θ "mediates" between the interventions and resulting effects in the causal model. In each case, we see a crossover where, with no change in system behavior, the coarse-grained 1D model becomes causally more informative when our intervention or effect errors become large.
We can similarly find the EI g for any sub-manifold of our parameter space, which would lead to a coarse-grained causal model, with a correspondingly lower dimensional space of intervention capabilities. To do this, we identify the pull-back of the two metrics in the full parameter space, to the embedded sub-manifold, as follows. We define a 1D submanifold of Θ as a parametric curve (θ 1 , θ 2 ) = (s 1 (σ), s 2 (σ)) with the parameter σ. The pull-back effect metric on this 1D space with respect to σ will be the scalarĝ(σ) = ∑ µ,ν s µ (σ)s ν (σ) g µν (s 1 (σ), s 2 (σ)), and similarly for intervention metricĥ(σ). For the 1D submanifold depicted by the solid red line in Figure 2, the resulting EI g is plotted in red in Figure 3.
The crossover seen in Figure 3 thus illustrates causal emergence: for larger error values, the coarse-grained 1D description turns out to be more informative than the full 2D model. Since this coarse-graining corresponds to the case where the two bacterial species are seen as identical θ 1 = θ 2 , we can say that at large errors, our bacterial colony is better modeled as a single homogeneous population. Crucially, this arises not from any change in system behavior, but merely from how we interact with it: either from what interventions we impart or from which effects we measure. Note also that when both the intervention and effect errors are scaled together δ ∝ , we see analytically from Equation (8) that l(θ) is constant, and so: which is also explicitly seen in Figure 3c. This indicates that, quite generally, we expect to see crossovers between geometric EIs of models with different d as we scale errors, with low-dimensional models being preferred at large noise. Since noise is ubiquitous in all real-world complex systems, this argument suggests why reductionist microscopic descriptions are rarely optimal from the perspective of informative interventions. By carrying out similar calculations, in Figure 4, we explore how the optimal model choice depends on the time-scales we care about for the population dynamics (effects) and the antibiotics we are using (interventions), all at fixed errors , δ. When the two antibiotics control the two bacterial species almost independently (A ∼ 1, Figure 4a), we can identify three distinct regimes in the EI g plot as we tune the measurement time-scale ∆t along the x-axis. If we only care about the population's initial response to the treatment at early times, then we get a higher EI g by modeling our colony as a single bacterial species. For intermediate times, the full 2D model has the higher EI g , showing that in this regime, modeling both species independently is preferred. Finally, at late times, most of the population is dead, and the biggest remaining effect identifies how dissimilar their death rates were; the coarse-grained model given by the dashed red submanifold in Figure  2 (θ 2 = 1 − θ 1 ) turns out to be more informative. In this regime, rather than viewing the population as either one or two independent bacterial species, we may think of it as a tightly coupled ecosystem of two competing species. Interestingly, such apparent coupling emerges here not from the underlying system dynamics, but from the optimal choice of coarse-grained description for the given effects of interest.

EIg (bits)
EIg  For a different set of intervention capabilities, where the antibiotics affect both species in a more interrelated way, with A = 1 0.8 0.7 1 , this entire picture changes (Figure 4b). In particular, we get the scenario where the "fundamental" two-species model is never useful, and the unintuitive "two competing species" description is actually optimal at most time-scales. Note also that in all cases, for very long and very short times ∆t, the geometric EI g drops below zero. While in this regime, the agreement with exact EI breaks down EI g = EI, it is also heuristically true that EI g < 0 ⇒ small EI, and so, the causal model is no longer very useful here. Even so, as seen in Figure 4, some coarse-graining of the model may still be effective even when the full microscopic description becomes useless.

Discussion
The world appears to agents and experimenters as having a certain scale and boundaries. For instance, solid objects are made of many loosely-connected atoms, yet in our everyday experience, we invariably view them as single units. While this may be intuitively understood in terms of the dictates of compression and memory storage [37], this work proposes a way to formalize precisely how such a coarse-grained modeling choice may be the correct (causally optimal) one. This is particularly true for a given set of intervention capabilities. In particular, we frame model selection as a geometric "matching" in the information space of the causal model to accessible interventions.
Intriguingly, this suggests that the correct choice of scientific modeling may not be merely a function of the correct understanding of the system, but also of the context that system is being used in or the capabilities of the experimenters. Thus, for example, if the forces we used to handle solid objects were far larger than inter-atomic attraction holding them together, then viewing objects as single units would no longer be a good model. This echoes one of the main ideas in "embodied cognition" for AI and psychology, which posits that in order for an agent to build accurate models of reality, it needs the ability to actively intervene in the world, not merely observe it [38].
This highlights a potentially important distinction between optimizing a model's predictive efficacy and its causal efficacy. Many approaches to optimal model selection, such as sloppiness, focus on getting computationally efficient predictions from a few fundamental parameters. In contrast, optimizing causal efficacy looks for a model that best translates all interventions to unique effects, thus giving the user optimal power to control the system. Such a shift of motivation fundamentally changes our perspective on good scientific modeling: roughly, shifting the emphasis from prediction to control. While these two motivations may often go hand-in-hand, the question of which is the more fundamental may be important in distinguishing scenarios.
The causal geometry we introduce here is a natural extension of the information geometry framework [6,10], but now explicitly accounting for the causal structure of model construction. In our proposed formalism, a given model becomes associated with two distinct Riemannian manifolds, along with a mapping between them, one capturing the role of interventions and the other of the effects. The relative geometric matching between these two manifolds locally tells us about how causally informative the present model is and what coarse-graining may lead to a local improvement.
In this structure, the colloquial notion of model "sectors" (especially used in field theories to refer to various field content [39]) becomes associated with literal sectors of the manifolds, with their local geometries specifying the optimal or emergent descriptions of that sector. Such examples also highlight the importance of having a local way to quantify model optimality, as globally, the manifold may have a complex and piecewise structure not amenable to simplification. While both traditional EI [24] and informationgeometric model-reduction [12] depend on the global model behavior over the entire span of possible interventions, the geometric EI g introduced here is built by averaging inherently local causal efficacy, defined for each point in parameter space. We can further speculate that fundamentally, the geometric matching in EI g may provide a novel way to quantify causality locally, where the counter-factual comparison is considered relative to the local neighborhood of interventions, rather than to all globally accessible ones [17].
We hope that causal geometry can contribute to further development in both formal principled methods for optimal model building in complex systems, as well as an abstract understanding of what it means to develop informative scientific theories.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.
Acknowledgments: Thanks Mark Transtrum for helpful discussions throughout the work on the manuscript, as well as review of the final draft, Mikhail Tikhonov for inspiration at the early stages of the project, and Thomas Berrueta for great comments on the final draft.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Example Illustrating the Do-Operator
Here, we give a simple pedagogic example that illustrates the use of the do-operator and its role in distinguishing causal relations from correlation. To highlight its role in the context of information geometry and sloppiness, we use a setup familiar from that literature [6]. Consider some bacterial population whose decay over time in unfavorable conditions we want to monitor: y(t) = e −θ t , where y(t) is the remaining fraction at time t of the original population at t = 0 and θ is the death rate. In the context of information geometry, we might consider θ as a model parameter, while y(t) as the predicted data. However, we cannot say that θ "causes" the decay, as any such statement about causality requires distinguishing interventions and effects. Without this, our model is merely an exponential fit to the data y(t), with θ labeling a compressed representation of the decay curve.
To have a causal, rather than a descriptive model, we need to introduce interventions that can influence this decay curve (and hence its descriptor θ). For example, this may be the concentration of some harmful chemical in the bacteria's environment. For the simplicity of our example, imagine that we have a direct linear mapping from these concentrations x to the death rate: do(x) → θ. Note that this mapping is a causal model, with the do-operator being well-defined on the space of interventions: it prescribes actively setting the chemical concentrations, in spite of any other possible environmental causes and fluctuations. In this setting, we can show how causal dependencies q(θ|do(x)) can differ from statistical ones q(θ|x); these will be distinct whenever there are any confounding factors present. For concreteness in our example, we can introduce temperature T as such a confounder: On the one hand, it might speed up bacteria's life-cycle, and hence death rate such that do(x) → θ = x + α T. On the other, it can denature the harmful chemical as x = x 0 /T, where x 0 is the reference concentration at T = 1.
If our model is specified explicitly, as is being done here, then all the causal relations are a priori known, and no work is needed to distinguish causation from correlation. The do-operator becomes important when we look at a more real-world setting where we can interact with a system, but do not know its underlying dynamics. From the perspective of sloppiness, this is referred to as data-driven manifold exploration. The do-operator prescribes how we should collect our data to extract the correct causal relations.
First consider a "wrong" way we can explore the model manifold, which will yield only statistical dependencies q(θ|x). For this, we study the fluctuations of bacterial population and habitat in their natural environment (specifically looking at θ and x). We can imagine that these fluctuations are dominated by natural variation of the reference chemical concentration x 0 ∼ N x, σ 2 x , with an additional influence from temperature fluctuations T ∼ N 0, σ 2 T (where we defined the scale of T such that its fluctuations centered on zero). Since we only observe x, and not x 0 or T, we integrate these out to get: where we defined the x-dependent σ net ≡ α 1 in the last line. Now, to capture the correct causal relationships, we instead use the do-operator to find q(θ|do(x)), which prescribes actively setting the chemical concentration x to specific values, rather than passively waiting for these to be observed. This means that while temperature can still fluctuate as above T ∼ N 0, σ 2 T , we do not defer to fluctuations of x 0 to produce specific values of x, but rather set these concentrations ourselves (e.g., in lab conditions). This removes confounding correlations, while T fluctuations now merely add uniform noise on θ: While the distributions resulting form the two above setups (Equations (A1) and (A2)) are clearly distinct, we want to take one further step to show that the information metrics they induce are also different. First, we note that the distributions p(y|θ), and so the "effect metric" g(θ) (see Equation (6)), simply reflect how the space of decay curves y(t) is parametrized by the scalar θ. Thus, in this example, g(θ) is not about causality, and so is fixed regardless of how we measure our data. We therefore expect this distinction to be captured entirely by the intervention metrics. To find these, according to the setup in Equation (6), we first need the Bayesian inverseq(do(x)|θ), for which we must calculate the normalization dx q(θ|do(x)). For the distribution in Equation (A2), this is one, andq(do(x)|θ) = q(θ|do(x)), thus giving the uniform intervention metric h caus (θ) = 1 σ 2 T α 2 for the true causal dependences in the system. For Equation (A1), the integral for normalization cannot be done exactly, so we take the approximation that fluctuations σ T of the confounding temperature variable T are small. With this, we find that dx q(θ|x) = T , and thus, the "intervention" metric for the statistical dependencies is h stat (θ) = 1 where the average · x is taken according to the distributionq(x|θ). Notably, unlike h caus above, here, h stat (θ) varies with θ, giving a qualitatively different geometry than the true causal geometry for this example.
As such, we have shown in an explicit setup how the causal relations q(θ|do(x)) can differ from statistical ones q(θ|x), leading to distinct geometric structures on the model manifold, which may then produce different schemes for model simplification.
In general, while we may not know all the confounding factors nor the effects they can have, we can simply rely on using the do-operator to extract the true causal relations for us.
For this reason, none of the examples presented in this work make any explicit reference to confounding factors, as these may be assumed to be plentiful and unknown. For further discussion on the role of confounding factors and the tools of causal calculus that help to work with complex causal graphs, see [17].

Appendix B. Deriving Geometric EI
Here, we will derive the expression for EI g in Equation (7), and equivalently in Equation (4). We start from the definition of EI in Equation (3) and use the near-deterministic model approximation discussed in the main text, and below. In Appendix B.1, we go through the detailed derivation for the 1D case and then in Appendix B.2 overview the steps needed to generalize it to higher dimensions.
Appendix B.1. One-Dimensional Case As mentioned in the main text, the expression for EI g presented here only approximates the exact EI when the mappings from interventions x ∈ X to parameters θ ∈ Θ and to effects y ∈ Y are both nearly-deterministic. Explicitly, this means that we can express the probability distributions as Gaussians with small variances. Note that the variance can be different at different points and in the multi-dimensional case may be anisotropic-as long as it remains sufficiently small everywhere (to be clarified later). This way, we can specify the concrete expression for the construct in Equation (6): Note that in 1D, the metrics g and h become scalars, and N x F(θ), δ 2 denotes a Gaussian distribution in x, centered on F(θ) and with standard deviation δ. Furthermore, we define F(θ) and intervention errors δ as above, giving that q(θ | do(x)) = -merely for convenience of notation later (as δ may depend on x). As such, we also assume F(θ) and f (θ) to be invertible.
We begin with the definition of EI from Equation (3), for which we must first calculate the distribution P(y | do(x)) ≡ dθ p(y | do(θ)) q(θ | do(x)): We can evaluate this Gaussian integral in the limit of small and δ. Let us understand precisely how small these must be.
To work with the above integral, δ must be small enough that bothq(do(x) | θ) and q(θ | do(x)) are Gaussian. For this, the second-order term in the Taylor expansion must be negligible in all regions with substantial probability, giving the asymptotic assumption: . We can check that in this case, plugging this expansion back intoq(x | θ) from Equation (A3), we get now a Gaussian in θ: as desired. This then shows us that we expect θ to typically be within . This allows us to write the above asymptotic condition as F (θ) δ F (θ) 2 for all θ. The exact same argument goes for the effect distribution p(y | do(θ)) and similarly gives the condition f (θ) f (θ) 2 . As long as these two conditions hold for all θ, we can allow δ = δ(x) and = (y) to vary arbitrarily.
In this limit, the integration of the expression in Equation (A4) is straightforward to carry out, giving another Gaussian: where θ y ≡ f −1 (y) and θ x ≡ F −1 (x), and we used the expressions for the two metrics in Equation (A3). Averaging this over the interventions and using the fact that σ ∼ O[ , δ] is small, we can then find the effect distribution: where L is the size of the 1D intervention space X , so that the uniform intervention distribution I D (x) = 1/L. With these expressions, we can now calculate the EI = D KL [P(y | do(x)) E D (y)] I D (x) . Since σ is small, the Gaussian for P(y | do(x)) will ensure that θ x is close to θ y , and so to leading order, we replace σ(θ x , θ y ) ≈ σ(θ x , θ x ) here. Therefore: where in the last line, we simply rearranged and substituted the expressions for the metrics g(θ) and h(θ) from Equation (A3). Here, we first see that Line (A7) reproduces the EI approximation we showed for the dimmer switch example, in Equation (4), for the setup there: F(θ) = θ, and L = 1. In general, by recognizing that here, the volume of the intervention space is V I = dθ h(θ) = dθ F (θ)/δ = L/δ, we finally see that the expression in Equation (A8) agrees with our main result presented in Equation (7) in the case of a 1D parameter space discussed here.

Appendix B.2. Multi-Dimensional Case
Generalizing the above derivation to the multi-dimensional case is straightforward and is mainly a matter of careful bookkeeping. The expressions in Equation (A3) become here: interventions x a ∈ X , parameters θ µ ∈ Θ, effects y i ∈ Y x → q(θ | do(x)), such thatq(do(x) | θ) ≡ q(θ | do(x)) d d x q(θ | do(x)) = √ det ∆ ab (2π) d I /2 e − 1 2 ∆ ab (x−F(θ)) a (x−F(θ)) b θ → p(y | do(θ)) = det E ij Here, a, b ∈ {1, . . . , d I } index the various dimensions of the intervention space X , µ, ν ∈ {1, . . . , d}-for dimensions of the parameter space Θ and i, j ∈ {1, . . . , d E }-the effects space Y. As we are dealing with general multi-dimensional geometric constructs, in this section, we are being careful to denote the contravariant vector components with upper indices, as in θ µ , and covariant components with lower indices, as in ∂ µ ≡ ∂ ∂θ µ . As in Equations (A4) and (A5) above, we can then compute the distribution over effects conditioned on interventions: where Σ denotes the matrix with components Σ µν and Σ −1 its matrix inverse (and similar for g and h). Here, we assume that both functions F(θ) and f (θ) are invertible, which means that the intervention and effect spaces X and Y both have the same dimension as the parameter space Θ: d I = d E = d. This allows us to view the map θ µ → f i (θ) as a change of coordinates, with a square Jacobian matrix ∂ µ f i , whose determinant in the first line of Equation (A10) is thus well-defined and may be usefully expressed as det ∂ µ f i = det g µν det E ij .
Note also that to get the above result, we once again assume the distributionsq(do(x) | θ) and p(y | do(θ)) to be nearly deterministic, meaning here that the matrices ∆ and E must be large, though the precise form of the assumption is messy here. Averaging this result over the intervention space X , we get: where we define the intervention-space volume in units of variance ofq, as: . Performing another such average over X , with some algebra, similar as for Equation (A7), we can arrive at our result in Equation (7):