# Probabilistic Deterministic Finite Automata and Recurrent Networks, Revisited

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Rate-Distortion Benchmarks for Prediction Algorithms

## 3. Background

#### 3.1. PDFAs and Predictive Rate-Distortion

#### 3.2. Time Series Methods

## 4. Results

#### 4.1. The Difference between Theory and Practice: The Even and Neven Process

#### 4.2. Comparing GLMs, RCs, and LSTMs

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Schultz, W.; Dayan, P.; Montague, P.R. A neural substrate of prediction and reward. Science
**1997**, 275, 1593–1599. [Google Scholar] [CrossRef] [Green Version] - Montague, P.R.; Dayan, P.; Sejnowski, T.J. A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J. Neurosci.
**1996**, 16, 1936–1947. [Google Scholar] [CrossRef] [Green Version] - Rao, R.P.; Ballard, D.H. Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nat. Neurosci.
**1999**, 2, 79. [Google Scholar] [CrossRef] - Berger, T. Rate Distortion Theory; Prentice-Hall: New York, NY, USA, 1971. [Google Scholar]
- Still, S.; Crutchfield, J.P.; Ellison, C.J. Optimal causal inference: Estimating stored information and approximating causal architecture. Chaos Interdiscip. J. Nonlinear Sci.
**2010**, 20, 037111. [Google Scholar] [CrossRef] [Green Version] - Still, S. Information bottleneck approach to predictive inference. Entropy
**2014**, 16, 968–989. [Google Scholar] [CrossRef] [Green Version] - Marzen, S.; Crutchfield, J.P. Predictive Rate-Distortion for Infinite-Order Markov Processes. J. Stat. Phys.
**2014**, 163, 1312–1338. [Google Scholar] [CrossRef] [Green Version] - Palmer, S.E.; Marre, O.; Berry, M.J.; Bialek, W. Predictive information in a sensory population. Proc. Natl. Acad. Sci. USA
**2015**, 112, 6908–6913. [Google Scholar] [CrossRef] [Green Version] - Tishby, N.; Zaslavsky, N. Deep Learning and the Information Bottleneck Principle. arXiv
**2015**, arXiv:1503.02406. [Google Scholar] - Shwartz-Ziv, R.; Tishby, N. Opening the Black Box of Deep Neural Networks via Information. arXiv
**2017**, arXiv:1703.00810. [Google Scholar] - Ash, R.B. Information Theory; John Wiley and Sons: New York, NY, USA, 1965. [Google Scholar]
- Shalizi, C.R.; Crutchfield, J.P. Computational Mechanics: Pattern and Prediction, Structure and Simplicity. J. Stat. Phys.
**2001**, 104, 817–879. [Google Scholar] [CrossRef] - Bialek, W.; Nemenman, I.; Tishby, N. Predictability, complexity, and learning. Neural Comput.
**2001**, 13, 2409–2463. [Google Scholar] [CrossRef] - Crutchfield, J.P.; Feldman, D.P. Regularities Unseen, Randomness Observed: Levels of Entropy Convergence. Chaos
**2003**, 13, 25–54. [Google Scholar] [CrossRef] [PubMed] - Maass, W.; Natschläger, T.; Markram, H. Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Comput.
**2002**, 14, 2531–2560. [Google Scholar] [CrossRef] - Grigoryeva, L.; Ortega, J.P. Echo state networks are universal. Neural Netw.
**2018**, 108, 495–508. [Google Scholar] [CrossRef] [Green Version] - Doya, K. Universality of Fully Connected Recurrent Neural Networks; Technology Report; Deptartment of Biology, UCSD: La Jolla, CA, USA, 1993. [Google Scholar]
- Cleeremans, A.; Servan-Schreiber, D.; McClelland, J.L. Finite state automata and simple recurrent networks. Neural Comput.
**1989**, 1, 372–381. [Google Scholar] [CrossRef] - Horne, B.G.; Hush, D.R. Bounds on the complexity of recurrent neural network implementations of finite state machines. In Proceedings of the 6th International Conference on Neural Information Processing Systems, Denver, CO, USA, 29 November–2 December 1993; pp. 359–366. [Google Scholar]
- Schmidhuber, J.; Hochreiter, S. Long short-term memory. Neural Comput.
**1997**, 9, 1735–1780. [Google Scholar] - Collins, J.; Sohl-Dickstein, J.; Sussillo, D. Capacity and trainability in recurrent neural networks. arXiv
**2016**, arXiv:1611.09913. [Google Scholar] - Nelder, J.A.; Wedderburn, R.W. Generalized linear models. J. R. Stat. Stoc. A
**1972**, 135, 370–384. [Google Scholar] [CrossRef] - Strelioff, C.C.; Crutchfield, J.P. Bayesian Structural Inference for Hidden Processes. Phys. Rev. E
**2014**, 89, 042119. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Crutchfield, J.P.; Young, K. Inferring Statistical Complexity. Phys. Rev. Let.
**1989**, 63, 105–108. [Google Scholar] [CrossRef] [PubMed] - Pfau, D.; Bartlett, N.; Wood, F. Probabilistic deterministic infinite automata. Adv. Neural Inf. Process. Syst.
**2010**, 23, 1930–1938. [Google Scholar] - Littman, M.L.; Sutton, R.S. Predictive representations of state. Adv. Neural Inf. Process. Syst.
**2002**, 14, 1555–1561. [Google Scholar] - Creutzig, F.; Sprekeler, H. Predictive coding and the slowness principle: An information-theoretic approach. Neural Comput.
**2008**, 20, 1026–1041. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Creutzig, F.; Globerson, A.; Tishby, N. Past-future information bottleneck in dynamical systems. Phys. Rev. E
**2009**, 79, 041925. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. arXiv
**2000**, arXiv:physics/0004057. [Google Scholar] - Hopcroft, J.E.; Ullman, J.D. Introduction to Automata Theory, Languages, and Computation; Addison-Wesley: Reading, MA, USA, 1979. [Google Scholar]
- James, R.G.; Mahoney, J.R.; Ellison, C.J.; Crutchfield, J.P. Many Roads to Synchrony: Natural Time Scales and Their Algorithms. Phys. Rev. E
**2014**, 89, 042135. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Löhr, W. Models of Discrete-Time Stochastic Processes and Associated Complexity Measures. Ph.D. Thesis, University of Leipzig, Leipzig, Germany, 2009. [Google Scholar]
- Shalizi, C.R.; Shalizi, K.L.; Crutchfield, J.P. Pattern discovery in time series, Part I: Theory, algorithm, analysis, and convergence. J. Mach. Learn. Res.
**2002**, 10, 60. [Google Scholar] - Csiszár, I. On the computation of rate-distortion functions (corresp.). IEEE Trans. Inf. Theory
**1974**, 20, 122–124. [Google Scholar] [CrossRef] - Johnson, B.D.; Crutchfield, J.P.; Ellison, C.J.; McTague, C.S. Enumerating Finitary Processes. arXiv
**2010**, arXiv:1011.0036. [Google Scholar] - Crutchfield, J.P.; Young, K. Computation at the Onset of Chaos. In Entropy, Complexity, and the Physics of Information; Zurek, W., Ed.; SFI Studies in the Sciences of Complexity; Addison-Wesley: Reading, MA, USA, 1990; Volume VIII, pp. 223–269. [Google Scholar]
- Packard, N.H. Adaptation toward the Edge of Chaos. In Dynamic Patterns in Complex Systems; Kelso, J.S., Mandell, A.J., Shlesinger, M.F., Eds.; World Scientific: Singapore, 1988; pp. 293–301. [Google Scholar]
- Mitchell, M.; Crutchfield, J.P.; Hraber, P. Dynamics, Computation, and the “Edge of Chaos”: A Re-Examination. In Complexity: Metaphors, Models, and Reality; Cowan, G., Pines, D., Melzner, D., Eds.; Santa Fe Institute Studies in the Sciences of Complexity; Addison-Wesley: Reading, MA, USA, 1994; Volume XIX, pp. 497–513. [Google Scholar]
- Mitchell, M.; Hraber, P.; Crutchfield, J.P. Revisiting the Edge of Chaos: Evolving Cellular Automata to Perform Computations. Complex Syst.
**1993**, 7, 89–130. [Google Scholar] - Bertschinger, N.; Natschläger, T. Real-time computation at the edge of chaos in recurrent neural networks. Neural Comput.
**2004**, 16, 1413–1436. [Google Scholar] [CrossRef] - Boedecker, J.; Obst, O.; Lizier, J.T.; Mayer, N.M.; Asada, M. Information processing in echo state networks at the edge of chaos. Theory Biosci.
**2012**, 131, 205–213. [Google Scholar] [CrossRef] [PubMed] - Hochreiter, S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertain. Fuzziness-Knowl.-Based Syst.
**1998**, 6, 107–116. [Google Scholar] [CrossRef] [Green Version] - Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar] - Zaslavsky, N.; Kemp, C.; Regier, T.; Tishby, N. Efficient compression in color naming and its evolution. Proc. Natl. Acad. Sci. USA
**2018**, 115, 7937–7942. [Google Scholar] [CrossRef] [PubMed] [Green Version]

**Figure 1.**At (

**top**), a typical setup for a recurrent neural network (or any other predictor): input is sent to the recurrent neural network, which makes a prediction about future inputs. At (

**bottom**), our setup for a recurrent neural network in which predictions must be made and the prediction must be communicated losslessly through the channel.

**Figure 2.**A sample predictive rate-accuracy curve, which is dependent not on how we process the time series but only on intrinsic properties of the time series. It is quite possible, and typical, to have zero rate and a nonzero predictive accuracy, and so the meeting of the x-axis and y-axis is not at the origin. The rate can run between zero and one bit for the binary-valued time series we study here. The starred point, which encodes the rate and accuracy of a minimal optimal predictor, has a rate of the single-symbol Shannon entropy of the time series and a predictive accuracy that depends in a complicated way on the specific time series. (Note the slight difference between this communication setup and that of standard predictive rate-distortion.) It is possible to have rates larger than the rate of the starred point, up to and including one bit.

**Figure 3.**Minimal two-state PDFA that generates the Even Process, so-called since there are always an even number of 1s between 0’s. Arrows indicate allowed transitions, while transition labels $p|s$ indicate the transition (and so too emission) probabilities $p\in [0,1]$ for the symbol $s\in \mathcal{A}$. Given a current state and next symbol, one knows the next state—the deterministic or unifilar property of this PDFA.

**Figure 4.**Predictive rate–accuracy curve for the Even Process in Figure 3, along with empirical predictive accuracies and rates of GLMs, RCs, and LSTMs of various sizes: orders range from 1–10 for GLMs, number of nodes range from 1–61 for RCs, and number of nodes range from 1–121 for LSTMs. Despite the Even Process’ simplicity, there is a noticeable difference between the predictors’ performances and between their performances and the optimal achievable performance.

**Figure 5.**Predictive rate-accuracy curve for the Neven Process (PDFA shown at left), along with empirical predictive accuracies and rates of GLMs, RCs, and LSTMs of various sizes: orders range from 1–10 for GLMs, number of nodes range from 1–61 for RCs, and number of nodes range from 1–121 for LSTMs. Despite Neven Process’ simplicity, there is a noticeable gap between the predictor’s performance and the optimal performance achievable.

**Figure 6.**(

**Left**) Histogram of normalized predictive distortions for LSTMs (blue), RCs (orange), and GLMs (green) using 798 distinct PDFAs. While LSTMs tend to have far higher predictive accuracies, they also have a much larger probability than reservoirs or GLMs do of having noticeable inaccuracies. Some recorded normalized predictive distortions were negative, indicating the effects of finite sample size. (

**Right**) Histogram of normalized distances to the predictive rate-accuracy curve for LSTMs (blue), RCs (orange), and GLMs (green) using 798 distinct PDFAs. It is apparent that LSTMs are closer to the predictive rate-accuracy curves than reservoirs and GLMs.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Marzen, S.E.; Crutchfield, J.P.
Probabilistic Deterministic Finite Automata and Recurrent Networks, Revisited. *Entropy* **2022**, *24*, 90.
https://doi.org/10.3390/e24010090

**AMA Style**

Marzen SE, Crutchfield JP.
Probabilistic Deterministic Finite Automata and Recurrent Networks, Revisited. *Entropy*. 2022; 24(1):90.
https://doi.org/10.3390/e24010090

**Chicago/Turabian Style**

Marzen, Sarah E., and James P. Crutchfield.
2022. "Probabilistic Deterministic Finite Automata and Recurrent Networks, Revisited" *Entropy* 24, no. 1: 90.
https://doi.org/10.3390/e24010090