# When the Map Is Better Than the Territory

## Abstract

**:**

## 1. Introduction

## 2. Assessing Causal Structure with Information Theory

_{+1}are observed. More broadly, there is an application of some Intervention Distribution (I

_{D}) composed of the probabilities of each do(s

_{i}).

_{D}that equals H

^{max}screens off EI from being sensitive to the marginal or observed probabilities of states (for instance, how often someone uses a light switch does not impact its causal connection to a light bulb). In general, every causal model (which may correspond to a particular scale) will have some associated “fair” I

_{D}over the associated set of states exogenous to (included in) the causal model, which is when those exogenous states are intervened upon equiprobably (I

_{D}= H

^{max}). Perturbing using the maximum entropy distribution (I

_{D}= H

^{max}) means intervening on some system S over all n possible states with equal probability so that ($do\left(S={s}_{i}\right){\forall}_{i}\in 1\dots n)$, i.e., the p of each member do(s

_{i}) of I

_{D}is 1/n.

_{D}results in Effect Distribution (E

_{D}), that is, the effects of I

_{D}. If an I

_{D}is applied to a memoryless system with the Markov property such that each do(s

_{i}) occurs at some time t then the distribution of states transitioned into at t

_{+1}is E

_{D}. From the application of I

_{D}a Transition Probability Matrix (TPM) is constructed using Bayes’ rule. The TPM associates each state s

_{i}in S with a probability distribution of past states (S

_{P}|S = s

_{i}) that could have led to it, and a probability distribution of future states that it could lead to (S

_{F}|S = s

_{i}). In these terms, the E

_{D}can be stated formally as the expectation of (S

_{F}|do(S = s

_{i})) given some I

_{D}.

_{D};E

_{D}). Its value is determined by the effects of each individual state in the system, such that, if intervening on a system S, the EI is:

_{KL}is the Kullback-Leibler divergence [15].

_{D}(where generally I

_{D}= H

^{max}). The effect information of an individual state in this formalization can be stated generally as:

_{i}makes to the future of the system.

_{+1}) of three Markov chains, each with n = 4 states {00, 01, 10, 11}:

_{1}every state completely constrains both the past and future, while the states of M

_{2}constrain the past/future only to some degree, and finally M

_{3}is entirely unconstrained (the probability of any state-to-state transition is 1/n). This affects the chains’ respective EI values. Assuming that I

_{D}= H

^{max}: EI(M

_{1}) = 2 bits, EI(M

_{2}) = 1 bit, EI(M

_{3}) = 0 bits. Given that the systems are the same size (n) and the same I

_{D}is applied to each, their differences in EI stem from their differing levels of effectiveness (Figure 1), a value which is bounded between 0 and 1. Effectiveness (eff) is how successful a causal structure is in turning interventions into unique effects. In Figure 1 the three chains are drawn above their TPMs. Probabilities are represented in grayscale (black is p = 1).

_{F}|do(S = s

_{i})) to the maximum entropy distribution H

^{max}:

_{D}contribute equally to eff, which is how effective I

_{D}is at transforming interventions into unique effects:

_{D}of H

^{max}leads to eff = 1 then all the state-to-state transitions of that system are logical biconditionals of the form s

_{i}⇔s

_{k}. This indicates that for the causal relationship between any two states s

_{i}is always completely sufficient and utterly necessary to produce s

_{k}. That is, if all states transition with equal probability to all states then no intervention makes any difference to the future of the system. Conversely, if all states transition to the same state, then no intervention makes any difference to the future of the system. If each state transitions to a single other state that no other state transitions to, then and only then does each state make the maximal difference to the future of the system, so EI is maximal. For further details on effectiveness, determinism, and degeneracy, see [3]. Additionally, for how eff contributes to how irreducible a causal relationship in a system is when partitioned, see [16].

_{D}).

_{D}of H

^{max}, eff = 1 (as the on/off state of LS at t perfectly constrains the on/off state of LB at t

_{+1}). Compare that to a light dial (LD) with 256 discriminable states, which controls a different light bulb (LB

_{2}) that possesses 256 discriminable states of luminosity. Just examining the sufficiency or necessity of the causal relationships would not inform us of the crucial causal difference between these two systems. However, under an I

_{D}that equals H

^{max}, EI would be 1 bit for $LS\to LB$ and 8 bits for $LD\to L{B}_{2}$, capturing the fact that a causal model with the same general structure but more states should be have higher EI over an equivalent set of interventions, indicating that the causal influence of the light dial is correspondingly that much greater.

_{D}= H

^{max}). A system may only be able to support a less complex set of interventions, but as long as eff is higher, the EI can be higher. Consider two Markov chains: M

_{A}has a high eff, but low H

^{max}(I

_{D}), while M

_{B}has a large H

^{max}(I

_{D}), but low eff. If eff

_{A}> eff

_{B}, and H

^{max}(I

_{D})

_{A}< H

^{max}(I

_{D})

_{B}, then EI(M

_{A}) > EI(M

_{B}) only if $\left(\frac{ef{f}_{\mathrm{B}}}{ef{f}_{\mathrm{A}}}\text{}\text{}\frac{{H}^{max}{\left({I}_{D}\right)}_{\mathrm{A}}}{{H}^{max}{\left({I}_{D}\right)}_{\mathrm{B}}}\right)$. This means that if H

^{max}(I

_{D})

_{B}>> H

^{max}(I

_{D})

_{A}then there must larger relative differences in effectiveness, such that eff

_{A}>>> eff

_{B}, for M

_{A}to have higher EI. Importantly, causal models that represent systems at higher scales can have increased eff, in fact, so much so that it outweighs the decrease in H(I

_{D}) [3].

## 3. Causal Analysis across Scales

_{m}). However, systems can also be considered as many different macro causal models (S

_{M}), such as higher scales or over a subset of the state-space. The set of all possible causal models, {

**S**}, is entirely fixed by the base S

_{m}. In technical terms this known as supervenience: given the lowest scale of any system (the base), all the subsequent macro causal models of that system are fixed [17,18]. Due to multiple realizability, different S

_{m}may share the same S

_{M}.

_{M}must always be of a smaller cardinality than S

_{m}. For instance, a macro causal model may be a mapping of states that leaves out (considers as “black-boxed”, or exogenous) some of the states in the microscale causal model.

_{i}) mapped into S

_{M}. Put simply, a coarse-grained intervention is an average over a set of micro-interventions. Note that there is also some corresponding macro-effect distribution as well, where each macro-effect is the average expectation of the result of some macro-intervention (using the same mapping).

_{D}= H

^{max}. However, when intervening on a macro causal model this will not always be true; for instance, it may be over only the set of states that are explicitly included in the macro causal model. Additionally, I

_{D}might be distributed non-uniformly at the microscale, due to the grouping effects of mapping microstates into macrostates. This can also be expressed by saying that the H(I

_{D}) at the macroscale is always less than H (I

_{D}) at the microscale. As we will see, this is critical for causal emergence.

## 4. Causal Emergence

_{m}with n = 8 possible states, with the TPM:

_{m}is very low under an I

_{D}that equals H

^{max}(eff = 0.18) and so EI(S

_{m}) is only 0.55 bits. A search over all possible mappings reveals a particular macroscale that can be represented as causal model with an associated TPM such that EI

^{max}(S

_{M}) = 1 bit:

_{m}is:

_{F}|S = s

_{i}) are different for each microstate. However, the maximal possible EI, EI

^{max}, is still at the same macroscale (EI(S

_{M}) = 1 bit > EI(S

_{m}) = 0.81 bits).

^{max}? While all macro causal models inherently have a smaller size, there may be an increase in eff. As stated previously, for two Markov chains, EI(M

_{x}) > EI(M

_{y}) if $\left(\frac{ef{f}_{y}}{ef{f}_{x}}\text{}\text{}\frac{siz{e}_{x}}{siz{e}_{y}}\right)$. Since the ratio of eff can increase to a greater degree than the accompanying decrease in the ratio of size, the macro can beat the micro.

_{z}transitions to itself with p = 1. For such a system EI

^{max}(S

_{M}) = 1 bit, no matter how large the size of the system is (from a mapping M of n

_{z}into macrostate 1 and all remaining n − 1 states into macrostate 2). In this case, as the size increases EI(S

_{m}) decreases: $\underset{n\to \infty}{\mathrm{lim}}EI\left({S}_{m}\right)=0$ as $\underset{n\to \infty}{\mathrm{lim}}1/\left(n-1\right)=p=0$. That is, a macro causal model S

_{M}can remain the same even as the underlying microscale drops off to an infinitesimal EI. This also means that the upper limit of the difference between EI(S

_{M}) and EI(S

_{m}) (the amount of emergence) is theoretically bounded only by log

_{2}(m), where m is the number of macrostates.

_{D}that equals H

^{max}the EI is only 0.81. In comparison the S

_{M}has an EI

^{max}value of 1 bit, as eff(S

_{M}) = 1 (as degeneracy(S

_{M}) = 0).

## 5. Causal Emergence as a Special Case of Noisy-Channel Coding

_{D}, so H(X) = H(I

_{D}). The conditional entropy H(X|Y) captures how much information is left over about X once Y is taken into account. H(X|Y) therefore has a clear causal interpretation as the amount of information lost in the set of interventions. More specifically, it is the information lost by the lack of effectiveness. Staring with the known definition of conditional entropy $H\left(X|Y\right)=H\left(X\right)-I\left(X;Y\right)$, which with the substitution of causal terminology is $H\left({I}_{D}|{E}_{D}\right)=H\left({I}_{D}\right)-I\left({I}_{D};{E}_{D}\right)$, we can see that $H\left({I}_{D}|{E}_{D}\right)=H\left({I}_{D}\right)-\left(H\left({I}_{D}\right)*eff\right)$, and can show via algebraic transposition that H(I

_{D}|E

_{D}) indeed captures the lack of effectiveness since $H\left({I}_{D}|{E}_{D}\right)=\left(1-eff\right)*H({I}_{D})$:

_{D}) is necessarily decreasing at the macroscale, the conditional entropy H(I

_{D}|E

_{D}) may be decreasing to a greater degree, making the total mutual information higher.

_{D}) increase EI. The use of macro interventions transforms or warps the I

_{D}, leading to causal emergence. Correspondingly, the macroscale causal model (with its associated I

_{D}and E

_{D}) with EI

^{max}is the one that most fully uses the causal capacity of the system. Also note that, despite this warping of the I

_{D}, from the perspective of some particular macroscale, I

_{D}is still at H

^{max}in the sense that each do(s

_{M}) is equiprobable (and E

_{D}is a set of macro-effects).

_{+1}):

_{D}. The encoding function $\varphi :\left\{message\right\}\to \left\{encoder\right\}$ is a rule that associates some channel input with some output, along with some decoding function $\psi $. The encoding/decoding functions together create the codebook. For simplicity issues like prefixes and instantaneous decoding are ignored here.

_{D}) = 2 bits as its four possible states are successive randomized interventions (so that p(1) = 0.5). Each code specifies a rate of transmission R = n/t, where t is every state-transition of the system and n is the number of bits sent per transition. For the microscale code of the system shown above the rate R = 2 bits, although these 2 bits are not sent reliably. This is because H(I

_{D}|E

_{D}) is large: 1.19 bits, so I(I

_{D}; E

_{D}) = H(I

_{D}) − H(I

_{D}|E

_{D}) = 0.81 bits. In application, this means that if one wanted to send the message {00,10,11,01,00,11}, this would take 6 interventions (channel usages) and there would be a very high probability of numerous errors. This is because the rate exceeds the capacity at the microscale.

_{D}) is halved (1 bit; so that p(1) = 0.83). However, I(I

_{D}; E

_{D}) = 1 bit, as H(I

_{D}|E

_{D}) = 0, showing that reliable interventions can proceed at the rate of 1 bit, higher than with using a microscale code. At the macroscales there would be zero errors in transmitting any intervention. This rate of reliable communication is equal to the capacity C.

^{max}has been proven to be the uniform distribution H

^{max}[22]. Treating the system at the microscale implies that I

_{D}= H

^{max}. Therefore, for symmetric or weakly symmetric systems, the microscale provides the best causal model without any need to search across model space. It is only in systems with asymmetrical causal relationships that causal emergence can occur.

## 6. Causal Capacity Can Approximate Channel Capacity as Model Choice Increases

_{D}, which achieves its success in the same manner as the input distribution that uses the full channel capacity: by sending only a subset of the possible messages during channel usage. However, while the causal capacity is bounded by the channel capacity, it is not always identical to it. Because the warping of I

_{D}is a function of model choice, which is constrained in various ways (a subset of possible distributions), causal capacity is a special case of the more general channel capacity (defined over all possible distributions). Coarse-graining is one way to manipulate (warp) I

_{D}: by moving up to a macro scale. It is not the only way that the I

_{D}(and the associated E

_{D}) can be changed. Choices made in causal modeling a system, including the choice of scale to create the causal model, but also the choice of initial condition, and whether to classify variables as exogenous or endogenous to the causal model (“black boxing”), are all forms of model choice and can all also warp I

_{D}and change E

_{D}, leading to causal emergence.

_{D}of H

^{max}gives an EI of 0.63 bits. Yet every causal model implicitly classifies variables as endogenous or exogenous to the model. For instance, here, we can take only the last two states (s

_{7}, s

_{8}) as endogenous to the macro causal model, while leaving the rest of the states as exogenous. This restriction is still a macro model because it has a smaller state-space, and in this general sense also a macroscale. For this macro causal model of the system EI = 1 bit, meaning that causal emergence occurs, again because the I

_{D}is warped by model choice. This warping can itself be quantified as the loss of entropy in the intervention distribution, H(I

_{D}):

_{D}. This latter form has been called “black boxing”, where an element’s internal workings, or role in the system, cannot be examined [23,24]. In Figure 3, both types of model choices are shown in systems of deterministic interconnected logic gates. Each model choice leads to causal emergence.

_{D}via model-building choices, the closer the causal capacity approximates the actual channel capacity. For example, consider the system in Figure 4A. In Figure 4B, a macroscale is shown that demonstrates causal emergence using various types of model choice (by coarse-graining, black-boxing an element, and setting a particular initial condition for an exogenous element). As can be seen in Figure 5, the more degrees of freedom in terms of model choice there are, the closer the causal capacity approximates the channel capacity. The channel capacity of this system was found via gradient ascent after the simulation of millions of random probability distributions p(X), searching for the one that maximizes I. Model choice warps the microscale I

_{D}in such a way that it moves closer to p(X), as shown in Figure 5B. As model choice increases, the EI

^{max}approaches the I

^{max}of the channel capacity (Figure 5C).

## 7. Discussion

_{D}. All of these make the state-space of the system smaller, so can be classified as macroscales, yet all may possibly lead to causal emergence.

^{max}at some particular macroscale). In this sense, some systems may function over their inputs and outputs at a microscale or macroscale, depending on their own causal capacity and the probability distribution of some natural source of driving input.

^{max}[3,37]. If there are such privileged scales in a system then intervention and experimentation should focus on those scales.

## Acknowledgments

## Conflicts of Interest

## References

- Fodor, J.A. Special sciences (or: The disunity of science as a working hypothesis). Synthese
**1974**, 28, 97–115. [Google Scholar] [CrossRef] - Kim, J. Mind in a Physical World: An Essay on the Mind-Body Problem and Mental Causation; MIT Press: Cambridge, MA, USA, 2000. [Google Scholar]
- Hoel, E.P.; Albantakis, L.; Tononi, G. Quantifying causal emergence shows that macro can beat micro. Proc. Natl. Acad. Sci. USA
**2013**, 110, 19790–19795. [Google Scholar] [CrossRef] [PubMed] - Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J.
**1948**, 27, 623–666. [Google Scholar] [CrossRef] - Granger, C.W.J. Investigating causal relations by econometric models and cross-spectral methods. Econometrica
**1969**, 37, 424–438. [Google Scholar] [CrossRef] - Massey, J.L. Causality, feedback and directed information. In Proceedings of the International Symposium on Information Theory and Its Applications, Waikiki, HI, USA, 27–30 November 1990; pp. 303–305. [Google Scholar]
- Schreiber, T. Measuring information transfer. Phys. Rev. Lett.
**2000**, 85, 461–464. [Google Scholar] [CrossRef] [PubMed] - Janzing, D.; Balduzzi, D.; Grosse-Wentrup, M.; Schölkopf, B. Quantifying causal influences. Ann. Stat.
**2013**, 41, 2324–2358. [Google Scholar] [CrossRef] - Pearl, J. Causality; Cambridge University Press: New York, NY, USA, 2000. [Google Scholar]
- Tononi, G.; Sporns, O. Measuring information integration. BMC Neurosci.
**2003**, 4, 31. [Google Scholar] [CrossRef] [PubMed] - Hope, L.R.; Korb, K.B. An information-theoretic causal power theory. In Australasian Joint Conference on Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2005; pp. 805–811. [Google Scholar]
- Ay, N.; Polani, D. Information flows in causal networks. Adv. Complex Syst.
**2008**, 11, 17–41. [Google Scholar] [CrossRef] - Griffith, P.E.; Pocheville, A.; Calcott, B.; Stotz, K.; Kim, H.; Knight, R. Measuring causal specificity. Philos. Sci.
**2015**, 82, 529–555. [Google Scholar] [CrossRef] - Fisher, R.A. The Design of Experiments; Oliver and Boyd: Edinburgh, UK, 1935. [Google Scholar]
- Kullback, S. Information Theory and Statistics; Dover Publications Inc.: Mineola, NY, USA, 1997. [Google Scholar]
- Hoel, E.P. Can the macro beat the micro? Integrated information across spatiotemporal scales. Neurosci. Conscious.
**2016**, 2016, niw012. [Google Scholar] [CrossRef] - Davidson, D. Essays on Actions and Events: Philosophical Essays; Oxford University Press on Demand: Oxford, UK, 2001; Volume 1. [Google Scholar]
- Stalnaker, R. Varieties of supervenience. Philos. Perspect.
**1996**, 10, 221–241. [Google Scholar] [CrossRef] - Crutchfield, J.P. The calculi of emergence: Computation, dynamics and induction. Phys. D Nonlinear Phenom.
**1994**, 75, 11–54. [Google Scholar] [CrossRef] - Shalizi, C.R.; Moore, C. What Is a Macrostate? Subjective Observations and Objective Dynamics. arXiv, 2003; arXiv:cond-mat/0303625. [Google Scholar]
- Wolpert, D.H.; Grochow, J.A.; Libby, E.; DeDeo, S. Optimal High-Level Descriptions of Dynamical Systems. arXiv, 2014; arXiv:1409.7403. [Google Scholar]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
- Ashby, W.R. An Introduction to Cybernetics; Chapman & Hail: London, UK, 1956. [Google Scholar]
- Bunge, M. A general black box theory. Philos. Sci.
**1963**, 30, 346–358. [Google Scholar] [CrossRef] - Rubner, Y.; Tomasi, C.; Guibas, L.J. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis.
**2000**, 40, 99–121. [Google Scholar] [CrossRef] - Campbell, D.T. ‘Downward causation’ in hierarchically organised biological systems. In Studies in the Philosophy of Biology; Macmillan Education: London, UK, 1974; pp. 179–186. [Google Scholar]
- Ellis, G. How can Physics Underlie the Mind; Springer: Berlin/Heidelberg, Germany; New York, NY, USA, 2016. [Google Scholar]
- Sperry, R.W. A modified concept of consciousness. Psychol. Rev.
**1969**, 76, 532. [Google Scholar] [CrossRef] [PubMed] - Auletta, G.; Ellis, G.F.R.; Jaeger, L. Top-down causation by information control: From a philosophical problem to a scientific research programme. J. R. Soc. Interface
**2008**, 5, 1159–1172. [Google Scholar] [CrossRef] [PubMed] - Ellis, G. Recognising top-down causation. In Questioning the Foundations of Physics; Springer International Publishing: Basel, Switzerland, 2015; pp. 17–44. [Google Scholar]
- Broad, C.D. The Mind and Its Place in Nature; Routledge: New York, NY, USA, 2014. [Google Scholar]
- Stone, J.V. Information Theory: A Tutorial Introduction; Sebtel Press: Sheffield, UK, 2015. [Google Scholar]
- Frisch, M. Causal Reasoning in Physics; Cambridge University Press: New York, NY, USA, 2014. [Google Scholar]
- Buxhoeveden, D.P.; Casanova, M.F. The minicolumn hypothesis in neuroscience. Brain
**2002**, 125, 935–951. [Google Scholar] [CrossRef] [PubMed] - Yuste, R. From the neuron doctrine to neural networks. Nat. Rev. Neurosci.
**2015**, 16, 487–497. [Google Scholar] [CrossRef] [PubMed] - Tononi, G. Consciousness as integrated information: A provisional manifesto. Biol. Bull.
**2008**, 215, 216–242. [Google Scholar] [CrossRef] [PubMed] - Tononi, G.; Boly, M.; Massimini, M.; Koch, C. Integrated information theory: From consciousness to its physical substrate. Nat. Rev. Neurosci.
**2016**, 17, 450–461. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**Markov chains with different levels of effectiveness. At the top are three Markov chains of differing levels of effectiveness, with the transition probabilities shown. This is assessed by the application of an I

_{D}of H

^{max}to each Markov chain, the results of which are shown as the TPMs below (probabilities in grayscale). The effectiveness of each chain is shown at the bottom.

**Figure 2.**Causal models as information channels. (

**A**) A Markov chain at the microscale with four states can be encoded into a macroscale chain with two states. (

**B**) Causal structure transforms interventions into effects. A macro causal model is a form of encoding for the interventions (inputs) and effects (outputs) that can use a greater amount of the capacity of the channel. TPM probabilities in gray scale.

**Figure 3.**Multiple types of model choice can lead to causal emergence. (

**A**) The full microscale model of the system, where all elements are endogenous. (

**B**) The same system but modeled at a macroscale where only elements {ABC} are endogenous, while {D} is exogenous; it was set to an initial state of 0 as a background condition of the causal analysis. (

**C**) The full microscale model of a system with six elements. (

**D**) The same system as in (C) but at a macroscale with the element {F} exogenous: it varies in the background in response to the application of the I

_{D}. EI is higher for both macro causal models.

**Figure 4.**Multiple types of model choice in combination leads to greater causal emergence. (

**A**) The microscale of an eight-element system. (

**B**) The same system but with some elements coarse-grained, others “black boxed”, and some frozen in a particular initial condition.

**Figure 5.**Causal capacity approximates the channel capacity as more degrees of freedom in model choices are allowed. (

**A**) The various I

_{D}s for the system in Figure 4, each getting closer to the p(X) that gives I

^{max}. (

**B**) The Earth Mover’s Distance [25] from each I

_{D}to the input distribution p(X) that gives the channel capacity I

^{max}. Increasing degrees of model choice leads to I

_{D}approximating the maximally informative p(X). (

**C**) Increasing degrees of model choice leads to a causal model where the EI

^{max}approximates I

^{max}(the macroscale shown in Figure 4B).

© 2017 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Hoel, E.P.
When the Map Is Better Than the Territory. *Entropy* **2017**, *19*, 188.
https://doi.org/10.3390/e19050188

**AMA Style**

Hoel EP.
When the Map Is Better Than the Territory. *Entropy*. 2017; 19(5):188.
https://doi.org/10.3390/e19050188

**Chicago/Turabian Style**

Hoel, Erik P.
2017. "When the Map Is Better Than the Territory" *Entropy* 19, no. 5: 188.
https://doi.org/10.3390/e19050188