# Critical Behavior in Physics and Probabilistic Formal Languages

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Markov Implies Exponential Decay

**Theorem**

**1.**

**Theorem**

**2.**

**Corollary**

**1.**

## 3. Power Laws from Generative Grammar

#### 3.1. A Simple Recursive Grammar Model

#### 3.2. Further Generalization: Strongly Correlated Characters in Words

**Theorem**

**3.**

#### 3.3. Further Generalization: Bayesian Networks and Context-Free Grammars

- An alphabet $\mathcal{A}=A\cup T$, which consists of non-terminal symbols A and terminal symbols T.
- A set of production rules of the form $a\to B$, where the left-hand side $a\in A$ is always a single non-terminal character and B is a string consisting of symbols in $\mathcal{A}$.
- Probabilities associated with each production rule $P(a\to B)$, such that for each $a\in A$, ${\sum}_{B}P(a\to B)=1$.

## 4. Discussion

#### 4.1. Connection to Recurrent Neural Networks

#### 4.2. A New Diagnostic for Machine Learning

^{3}more parameters than our LSTM-RNN, it captures less than a fifth of the mutual information captured by the LSTM-RNN even at modest separations ≳5. This phenomenon is related to a classic result in the theory of formal languages: a context-free grammar

[[computhourgist, Flagesernmenserved whirequotesor thand dy excommentaligmaktophy asits:Fran at ||\<If ISBN 088;\&ategorand on of to [[Prefung]]’ and at them rector>

Proudknow pop groups at Oxford- [http://ccw.com/faqsisdaler/cardiffstwander--helgar.jpg] and Cape Normans’s firstattacks Cup rigid (AM).

- Why is natural language so hard? The old answer was that language is uniquely human. Our new answer is that at least part of the difficulty is that natural language is a critical system, with long-range correlations that are difficult for machines to learn.
- Why are machines bad at natural languages, and why are they good? The old answer is that Markov models are simply not brain/human-like, whereas neural nets are more brain-like and, hence, better. Our new answer is that Markov models or other one-dimensional models cannot exhibit critical behavior, whereas neural nets and other deep models (where an extra hidden dimension is formed by the layers of the network) are able to exhibit critical behavior.
- How can we know when machines are bad or good? The old answer is to compute the loss function. Our new answer is to also compute the mutual information as a function of separation, which can immediately show how well the model is doing at capturing correlations on different scales.

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## Appendix A. Properties of Rational Mutual Information

- Symmetry: for any two random variables X and Y, ${I}_{R}(X,Y)={I}_{R}(Y,X)$. The proof is straightforward:$$\begin{array}{cc}\hfill {I}_{R}(X,Y)=& \sum _{ab}\frac{P{(X=a,Y=b)}^{2}}{P(X=a)P(Y=b)}-1\hfill \\ \hfill =& \sum _{ba}\frac{P{(Y=b,X=a)}^{2}}{P(Y=b)P(X=a)}-1={I}_{R}(Y,X).\hfill \end{array}$$
- Upper bound to mutual information: The logarithm function satisfies $ln(1+x)\le x$ with equality if and only if (iff) $x=0$. Therefore, setting $x=\frac{P(a,b)}{P\left(a\right)P\left(b\right)}-1$ gives:$$\begin{array}{ll}I(X,Y)& =\u2329{\mathrm{log}}_{B}\frac{P(a,b)}{P\left(a\right)P\left(b\right)}\u232a\\ & =\frac{1}{lnB}\u2329ln\left[1+\left(\frac{P(a,b)}{P\left(a\right)P\left(b\right)}-1\right)\right]\u232a\\ & \le \frac{1}{lnB}\u2329\frac{P(a,b)}{P\left(a\right)P\left(b\right)}-1\u232a=\frac{{I}_{R}(X,Y)}{lnB}.\end{array}$$Hence, the rational mutual information ${I}_{R}\ge IlnB$ with equality iff $I=0$ (or simply, ${I}_{R}\ge I$, if we use the natural logarithm base $B=e$).
- Non-negativity: It follows from the above inequality that ${I}_{R}(X,Y)\ge 0$ with equality iff $P(a,b)=P\left(a\right)P\left(b\right)$, since ${I}_{R}=I=0$ iff $P(a,b)=P\left(a\right)P\left(b\right)$. Note that this short proof is only possible because of the information inequality $I\ge 0$. From the definition of ${I}_{R}$, it is only obvious that ${I}_{R}\ge -1$; information theory gives a much tighter bound. Our Findings 1–3 can be summarized as follows:$${I}_{R}(X,Y)={I}_{R}(Y,X)\ge I(X,Y)\ge 0,$$
- Generalization: Note that if we view the mutual information as the divergence between two joint probability distributions, we can generalize the notion of rational mutual information to that of rational divergence:$${D}_{R}\left(p\right|\left|q\right)=\u2329\frac{p}{q}\u232a-1,$$The $\alpha $-divergence is itself a special case of so-called f-divergences [55,56,57]:$${D}_{f}\left(p\right|\left|q\right)=\sum {p}_{i}f({q}_{i}/{p}_{i}),$$Note that as it is written, p could be any probability measure on either a discrete or continuous space. The above results can be trivially modified to show that ${D}_{R}\left(p\right|\left|q\right)\ge {D}_{KL}\left(p\right|\left|q\right)$ and, hence, ${D}_{R}\left(p\right|\left|q\right)\ge 0$, with equality iff $p=q$.

## Appendix B. General Proof for Markov Processes

#### Appendix B.1. The Degenerate Case

#### Appendix B.2. The Reducible Case

#### Appendix B.3. The Periodic Case

#### Appendix B.4. The n > 1 Case

#### Appendix B.5. The Detailed Balance Case

#### Appendix B.6. Hidden Markov Model

## Appendix C. Power Laws for Generative Grammars

#### Appendix C.1. Detailed Evaluation of the Normalization

**Figure A1.**Decay of rational mutual information with separation for a binary sequence from a numerical simulation with probabilities $p\left(0\right|0)=p(1\left|1\right)=0.9$ and a branching factor $q=2$. The blue curve is not a fit to the simulated data, but rather an analytic calculation. The smooth power law displayed on the left is what is predicted by our “continuum” approximation. The very small discrepancies (right) are not random, but are fully accounted for by more involved exact calculations with discrete sums.

## Appendix D. Estimating (Rational) Mutual Information from Empirical Data

## References

- Bak, P. Self-organized criticality: An explanation of the 1/<i>f</i> noise. Phys. Rev. Lett.
**1987**, 59, 381–384. [Google Scholar] [PubMed] - Bak, P.; Tang, C.; Wiesenfeld, K. Self-organized criticality. Phys. Rev. A
**1988**, 38, 364. [Google Scholar] [CrossRef] - Linkenkaer-Hansen, K.; Nikouline, V.V.; Palva, J.M.; Ilmoniemi, R.J. Long-Range Temporal Correlations and Scaling Behavior in Human Brain Oscillations. J. Neurosci.
**2001**, 21, 1370–1377. [Google Scholar] [PubMed] - Levitin, D.J.; Chordia, P.; Menon, V. Musical rhythm spectra from Bach to Joplin obey a 1/f power law. Proc. Natl. Acad. Sci. USA
**2012**, 109, 3716–3720. [Google Scholar] [CrossRef] [PubMed] - Tegmark, M. Consciousness as a State of Matter. arXiv
**2014**. [Google Scholar] - Manaris, B.; Romero, J.; Machado, P.; Krehbiel, D.; Hirzel, T.; Pharr, W.; Davis, R.B. Zipf’s law, music classification, and aesthetics. Comput. Music J.
**2005**, 29, 55–69. [Google Scholar] [CrossRef] - Peng, C.K.; Buldyrev, S.V.; Goldberger, A.; Havlin, S.; Sciortino, F.; Simons, M.; Stanley, H.E. Long-range correlations in nucleotide sequences. Nature
**1992**, 356, 168–170. [Google Scholar] [CrossRef] [PubMed] - Mantegna, R.N.; Buldyrev, S.V.; Goldberger, A.L.; Havlin, S.; Peng, C.K.; Simons, M.; Stanley, H.E. Linguistic features of noncoding DNA sequences. Phys. Rev. Lett.
**1994**, 73, 3169–3172. [Google Scholar] [CrossRef] [PubMed] - Ebeling, W.; Pöschel, T. Entropy and Long-Range Correlations in Literary English. EPL (Europhys. Lett.)
**1994**, 26, 241–246. [Google Scholar] [CrossRef] - Ebeling, W.; Neiman, A. Long-range correlations between letters and sentences in texts. Phys. A Stat. Mech. Appl.
**1995**, 215, 233–241. [Google Scholar] [CrossRef] - Altmann, E.G.; Cristadoro, G.; Degli Esposti, M. On the origin of long-range correlations in texts. Proc. Natl. Acad. Sci. USA
**2012**, 109, 11582–11587. [Google Scholar] [CrossRef] [PubMed] - Montemurro, M.A.; Pury, P.A. Long-range fractal correlations in literary corpora. Fractals
**2002**, 10, 451–461. [Google Scholar] [CrossRef] - Deco, G.; Schürmann, B. Information Dynamics: Foundations and Applications; Springer: New York, NY, USA, 2012. [Google Scholar]
- Zipf, G.K. Human Behavior and the Principle of Least Effort; Addison-Wesley Press: Boston, MA, USA, 1949. [Google Scholar]
- Lin, H.W.; Loeb, A. Zipf’s law from scale-free geometry. Phys. Rev. E
**2016**, 93, 032306. [Google Scholar] [CrossRef] [PubMed] - Pietronero, L.; Tosatti, E.; Tosatti, V.; Vespignani, A. Explaining the uneven distribution of numbers in nature: The laws of Benford and Zipf. Phys. A Stat. Mech. Appl.
**2001**, 293, 297–304. [Google Scholar] [CrossRef] - Kardar, M. Statistical Physics of Fields; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]
- Homo Sapiens Genome. Available online: ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/ (accessed on 15 June 2016).
- MIDI Files: Sonatas and Partitas for Solo Violin. Available online: http://www.jsbach.net/midi/midi_solo_violin.html (accessed on 15 June 2016).
- 50,000 Euro Prize for Compressing Human Knowledge. Available online: http://prize.hutter1.net/ (accessed on 15 June 2016).
- Corpatext 1.02. Available online: http://www.lexique.org/public/lisezmoi.corpatext.htm (accessed on 15 June 2016).
- Turing, A.M. Computing machinery and intelligence. Mind
**1950**, 59, 433–460. [Google Scholar] [CrossRef] - Ferrucci, D.; Brown, E.; Chu-Carroll, J.; Fan, J.; Gondek, D.; Kalyanpur, A.A.; Lally, A.; Murdock, J.W.; Nyberg, E.; Prager, J.; et al. Building Watson: An overview of the DeepQA project. AI Mag.
**2010**, 31, 59–79. [Google Scholar] - Campbell, M.; Hoane, A.J.; Hsu, F.H. Deep blue. Artif. Intell.
**2002**, 134, 57–83. [Google Scholar] [CrossRef] - Mnih, V. Human-level control through deep reinforcement learning. Nature
**2015**, 518, 529–533. [Google Scholar] [CrossRef] [PubMed] - Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature
**2016**, 529, 484–489. [Google Scholar] [CrossRef] [PubMed] - Chomsky, N. On certain formal properties of grammars. Inf. Control
**1959**, 2, 137–167. [Google Scholar] [CrossRef] - Kim, Y.; Jernite, Y.; Sontag, D.; Rush, A.M. Character-Aware Neural Language Models. arXiv
**2015**. [Google Scholar] - Graves, A. Generating Sequences with Recurrent Neural Networks. arXiv
**2013**. [Google Scholar] - Graves, A.; Mohamed, A.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar]
- Collobert, R.; Weston, J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning (ICML), Helsinki, Finland, 5–9 July 2008; pp. 160–167. [Google Scholar]
- Van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. Wavenet: Agenerative model for raw audio. arXiv
**2016**. [Google Scholar] - Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw.
**2015**, 61, 85–117. [Google Scholar] [CrossRef] [PubMed] - LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature
**2015**, 521, 436–444. [Google Scholar] [CrossRef] [PubMed] - Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput.
**1997**, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed] - Shieber, S.M. Evidence against the context-freeness of natural language. In The Formal Complexity of Natural Language; Springer: Dordrecht, The Netherlands, 1985; pp. 320–334. [Google Scholar]
- Anisimov, A.V. Group languages. Cybern. Syst. Anal.
**1971**, 7, 594–601. [Google Scholar] [CrossRef] - Shannon, C.E. A Mathematical Theory of Communication. ACM SIGMOB. Mob. Comput. Commun. Rev.
**1948**, 5, 3–55. [Google Scholar] [CrossRef] - Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat.
**1951**, 22, 79–86. [Google Scholar] [CrossRef] - Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
- Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. IEEE Proc.
**1989**, 77, 257–286. [Google Scholar] [CrossRef] - Carrasco, R.C.; Oncina, J. Learning stochastic regular grammars by means of a state merging method. In Proceedings of the Second International Colloquium on Grammatical Inference and Applications (ICGI’94), Alicante, Spain, 21–23 September 1994; pp. 139–152. [Google Scholar]
- Ginsburg, S. The Mathematical Theory of Context Free Languages; McGraw-Hill Book Company: New York, NY, USA, 1966. [Google Scholar]
- Booth, T.L. Probabilistic representation of formal languages. In Proceedings of the 1969 IEEE Conference Record of 10th Annual Symposium on Switching and Automata Theory, Waterloo, ON, Canada, 15–17 October 1969; pp. 74–81. [Google Scholar]
- Huang, T.; Fu, K. On stochastic context-free languages. Inf. Sci.
**1971**, 3, 201–224. [Google Scholar] [CrossRef] - Lari, K.; Young, S.J. The estimation of stochastic context-free grammars using the inside-outside algorithm. Comput. Speech Lang.
**1990**, 4, 35–56. [Google Scholar] [CrossRef] - Harlow, D.; Shenker, S.H.; Stanford, D.; Susskind, L. Tree-like structure of eternal inflation: A solvable model. Phys. Rev. D
**2012**, 85, 063516. [Google Scholar] [CrossRef] - Van Hove, L. Sur l’intégrale de configuration pour les systèmes de particules à une dimension. Physica
**1950**, 16, 137–143. [Google Scholar] [CrossRef] - Cuesta, J.A.; Sánchez, A. General Non-Existence Theorem for Phase Transitions in One-Dimensional Systems with Short Range Interactions, and Physical Examples of Such Transitions. J. Stat. Phys.
**2004**, 115, 869–893. [Google Scholar] [CrossRef] - Evenbly, G.; Vidal, G. Tensor network states and geometry. J. Stat. Phys.
**2011**, 145, 891–918. [Google Scholar] [CrossRef] - Saxe, A.M.; McClelland, J.L.; Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv
**2013**. [Google Scholar] - Mahoney, M. Large Text Compression Benchmark. Available online: http://mattmahoney.net/dc/text.html (accessed on 23 June 2017).
- Karpathy, A.; Johnson, J.; Fei-Fei, L. Visualizing and Understanding Recurrent Networks. arXiv
**2015**. [Google Scholar] - Amari, S.I. α-Divergence and α-Projection in Statistical Manifold. In Differential-Geometrical Methods in Statistics; Springer: New York, NY, USA, 1985; pp. 66–103. [Google Scholar]
- Morimoto, T. Markov processes and the H-theorem. J. Phys. Soc. Jpn.
**1963**, 18, 328–331. [Google Scholar] [CrossRef] - Csisz, I. Information-type measures of difference of probability distributions and indirect observations. Stud. Sci. Math. Hung.
**1967**, 2, 299–318. [Google Scholar] - Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B (Methodol.)
**1966**, 28, 131–142. [Google Scholar] - Gardiner, C.W. Handbook of Stochastic Methods; Springer: Berlin, Germany, 1985; Volume 3. [Google Scholar]
- Grassberger, P. Entropy Estimates from Insufficient Samplings. arXiv
**2003**. [Google Scholar] - Li, W. Mutual information functions versus correlation functions. J. Stat. Phys.
**1990**, 60, 823–837. [Google Scholar] [CrossRef]

**Figure 1.**Decay of mutual information with separation. Here, the mutual information in bits per symbol is shown as a function of separation $d(X,Y)=|i-j|$, where the symbols X and Y are located at positions i and j in the sequence in question, and shaded bands correspond to $1-\sigma $ error bars. The statistics were computed using a sliding window using an estimator for the mutual information detailed in Appendix D. All measured curves are seen to decay roughly as power laws, explaining why they cannot be accurately modeled as Markov processes, for which the mutual information instead plummets exponentially (the example shown has $I\propto {e}^{-d/6}$). The measured curves are seen to be qualitatively similar to that of a famous critical system in physics: a 1D slice through a critical 2D Ising model, where the slope is $-1/2$. The human genome data consist of 177,696,512 base pairs {A, C, T,G} from chromosome 5 from the National Center for Biotechnology Information [18], with unknown base pairs omitted. The Bach data consist of 5727 notes from Partita No. 2 [19], with all notes mapped into a 12-symbol alphabet consisting of the 12 half-tones {C, C#, D, D#, E, F, F#, G, G#, A, A#, B}, with all timing, volume and octave information discarded. The three text corpuses are 100 MB from Wikipedia [20] (206 symbols), the first 114 MB of a French corpus [21] (185 symbols) and 27 MB of English articles from slate.com (143 symbols). The large long-range information appears to be dominated by poems in the French sample and by html-like syntax in the Wikipedia sample.

**Figure 2.**Both a traditional Markov process (top) and our recursive generative grammar process (bottom) can be represented as Bayesian networks, where the random variable at each node depends only on the node pointing to it with an arrow. The numbers show the geodesic distance $\Delta $ to the leftmost node, defined as the smallest number of edges that must be traversed to get there. Roughly speaking, our results show that for large $\Delta $, the mutual information decays exponentially with $\Delta $ (see Theorems 1 and 2). Since this geodesic distance $\Delta $ grows only logarithmically with the separation in time in a hierarchical generative grammar (the hierarchy creates very efficient shortcuts), the exponential kills the logarithm, and we are left with power law decays of mutual information in such languages.

**Figure 3.**Diagnosing different models by hallucinating text and then measuring the mutual information as a function of separation. The red line is the mutual information of enwik8, a 100-MB sample of English Wikipedia. In shaded blue is the mutual information of hallucinated Wikipedia from a trained LSTM with three layers of size 256. We plot in solid black the mutual information of a Markov process on single characters, which we compute exactly (this would correspond to the mutual information of hallucinations in the limit where the length of the hallucinations goes to infinity). This curve shows a sharp exponential decay after a distance of ∼10, in agreement with our theoretical predictions. We also measured the mutual information for hallucinated text on a Markov process for bigrams, which still underperforms the LSTMs in long-range correlations, despite having ∼10${}^{3}$ more parameters.

**Figure 4.**Our deep generative grammar model can be viewed as an idealization of a long-short term memory (LSTM) recurrent neural net, where the “forget weights” drop with depth so that the forget timescales grow exponentially with depth. The graph drawn here is clearly isomorphic to the graph drawn in Figure 1. For each cell, we approximate the usual incremental updating rule by either perfectly remembering the previous state (horizontal arrows) or by ignoring the previous state and determining the cell state by a random rule depending on the node above (vertical arrows).

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Lin, H.W.; Tegmark, M. Critical Behavior in Physics and Probabilistic Formal Languages. *Entropy* **2017**, *19*, 299.
https://doi.org/10.3390/e19070299

**AMA Style**

Lin HW, Tegmark M. Critical Behavior in Physics and Probabilistic Formal Languages. *Entropy*. 2017; 19(7):299.
https://doi.org/10.3390/e19070299

**Chicago/Turabian Style**

Lin, Henry W., and Max Tegmark. 2017. "Critical Behavior in Physics and Probabilistic Formal Languages" *Entropy* 19, no. 7: 299.
https://doi.org/10.3390/e19070299