# Estimating Predictive Rate–Distortion Curves via Neural Variational Inference

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Predictive Rate–Distortion

#### 2.1. Relation to Statistical Complexity

#### 2.2. Related Work

## 3. Prior Work: Optimal Causal Filtering

## 4. Neural Estimation via Variational Upper Bound

#### 4.1. Main Result: Variational Bound on Predictive Rate–Distortion

**Proposition**

**1.**

**Proof.**

**Proposition**

**2.**

**Proof.**

**Corollary**

**1**(Main Result)

**.**

**Proof.**

#### 4.2. Choosing Approximating Families

#### 4.3. Parameter Estimation and Evaluation

#### Estimating Predictiveness

#### 4.4. Related Work

## 5. Experiments

#### 5.1. Implementation Details

#### OCF

#### Neural Predictive Rate–Distortion

#### 5.2. Analytically Tractable Problems

#### Recovering Causal States

#### A Process with Many Causal States

## 6. Estimating Predictive Rate–Distortion for Natural Language

#### 6.1. Part-of-Speech-Level Language Modeling

#### 6.2. Discussion

#### 6.3. Word-Level Language Modeling

#### 6.4. General Discussion

## 7. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Abbreviations

NPRD | Neural Predictive Rate–Distortion |

OCF | Optimal Causal Filtering |

POS | parts of speech |

PRD | Predictive Rate–Distortion |

LSTM | Long Short Term Memory |

VAE | Variational Autoencoder |

## Appendix A. Hyperparameters

**Table A1.**NPRD Hyperparameters. See Appendix A for description of the parameters.

Section 5.2 | Section 6.1 | Section 6.3 | |
---|---|---|---|

Embedding Dimension | 50, 100 | 50, 100, 200, 300 | 150 |

LSTM Dimension | 32 | 64, 128, 256, 512 | 256, 512 |

Dropout rate | 0.0, 0.1, 0.2 | 0.0, 0.1, 0.2 | 0.1, 0.4 |

Input Dropout | 0 | 0.0, 0.1, 0.2 | 0.2 |

Adam Learning Rate | $\{1,5\}\xb7{10}^{-4}$, $\{1,2,4\}\xb7{10}^{-3}$ | $\{1,5\}\xb7{10}^{-4}$, $\{1,2,4\}\xb7{10}^{-3}$ | 0.00005, 0.0001, 0.0005, 0.001 |

Batch Size | 16, 32, 64 | 16, 32 64 | 16, 32, 64 |

Flow Length | 1, 2 | 1, 2, 3, 4, 5 | 1, 2, 3, 4, 5 |

Flow Type | DSF, DDSF | DSF, DDSF | DSF, DDSF |

Flow Dimension | 32, 64, 128, 512 | 512 | 512 |

Flow Layers | 2 | 2 | 2 |

## Appendix B. Alternative Modeling Choices

**Figure A1.**Rate–Distortion for the Even Process

**(left**) and the Random Insertion Process (

**right**), estimated using a simple diagonal unit Gaussian approximation for q. Gray lines: analytical curves; red dots: multiple runs of NPRD (>200 samples); red line: trade-off curve computed from NPRD runs. Compare Figure 1 for results from full NPRD.

**Figure A2.**Rate–Distortion for the Even Process (

**left**) and the Random Insertion Process (

**right**), varying $M=5$ (blue), 10 (red), 15 (green); gray lines: analytical curves; red dots: multiple runs of NPRD; red line: trade-off curve computed from NPRD runs. Compare Figure 1 for results from full NPRD.

## Appendix C. Sample Runs on English Text

**Figure A3.**Four example outputs from English word-level modeling, with low rate ($\mathrm{log}\frac{1}{\lambda}=1$; red, dotted), medium rate ($\mathrm{log}\frac{1}{\lambda}=3$; green, dashed), high rate ($\mathrm{log}\frac{1}{\lambda}=5$; blue, solid). For each sample, we provide the prior context $\stackrel{{\scriptscriptstyle M\leftarrow}}{X}$ (

**top**), and the per-word cross-entropies (in nats) on the future words $\stackrel{{\scriptscriptstyle \to M}}{X}$ (

**bottom**).

## References

- Still, S. Information Bottleneck Approach to Predictive Inference. Entropy
**2014**, 16, 968–989. [Google Scholar] [CrossRef] - Marzen, S.E.; Crutchfield, J.P. Predictive Rate-Distortion for Infinite-Order Markov Processes. J. Stat. Phys.
**2016**, 163, 1312–1338. [Google Scholar] [CrossRef][Green Version] - Creutzig, F.; Globerson, A.; Tishby, N. Past-future information bottleneck in dynamical systems. Phys. Rev. E
**2009**, 79. [Google Scholar] [CrossRef] - Amir, N.; Tiomkin, S.; Tishby, N. Past-future Information Bottleneck for linear feedback systems. In Proceedings of the 54th IEEE Conference on Decision and Control (CDC), Osaka, Japan, 15–18 December 2015; pp. 5737–5742. [Google Scholar]
- Genewein, T.; Leibfried, F.; Grau-Moya, J.; Braun, D.A. Bounded rationality, abstraction, and hierarchical decision-making: An information-theoretic optimality principle. Front. Robot. AI
**2015**, 2, 27. [Google Scholar] [CrossRef] - Still, S.; Crutchfield, J.P.; Ellison, C.J. Optimal causal inference: Estimating stored information and approximating causal architecture. Chaos Interdiscip. J. Nonlinear Sci.
**2010**, 20, 037111. [Google Scholar] [CrossRef][Green Version] - Józefowicz, R.; Vinyals, O.; Schuster, M.; Shazeer, N.; Wu, Y. Exploring the Limits of Language Modeling. arXiv
**2016**, arXiv:1602.02410. [Google Scholar] - Merity, S.; Keskar, N.S.; Socher, R. An analysis of neural language modeling at multiple scales. arXiv
**2018**, arXiv:1803.08240. [Google Scholar] - Dai, Z.; Yang, Z.; Yang, Y.; Cohen, W.W.; Carbonell, J.; Le, Q.V.; Salakhutdinov, R. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. arXiv
**2019**, arXiv:1901.02860. [Google Scholar] - Takahashi, S.; Tanaka-Ishii, K. Cross Entropy of Neural Language Models at Infinity—A New Bound of the Entropy Rate. Entropy
**2018**, 20, 839. [Google Scholar] [CrossRef] - Ogunmolu, O.; Gu, X.; Jiang, S.; Gans, N. Nonlinear systems identification using deep dynamic neural networks. arXiv
**2016**, arXiv:1610.01439. [Google Scholar] - Laptev, N.; Yosinski, J.; Li, L.E.; Smyl, S. Time-series extreme event forecasting with neural networks at uber. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 11 August 2017; pp. 1–5. [Google Scholar]
- Meyer, P.; Noblet, V.; Mazzara, C.; Lallement, A. Survey on deep learning for radiotherapy. Comput. Biol. Med.
**2018**, 98, 126–146. [Google Scholar] [CrossRef] [PubMed] - Hundman, K.; Constantinou, V.; Laporte, C.; Colwell, I.; Soderstrom, T. Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 387–395. [Google Scholar]
- White, G.; Palade, A.; Clarke, S. Forecasting qos attributes using lstm networks. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. [Google Scholar]
- Woo, J.; Park, J.; Yu, C.; Kim, N. Dynamic model identification of unmanned surface vehicles using deep learning network. Appl. Ocean Res.
**2018**, 78, 123–133. [Google Scholar] [CrossRef] - Sirignano, J.; Cont, R. Universal features of price formation in financial markets: perspectives from Deep Learning. arXiv
**2018**, arXiv:1803.06917. [Google Scholar] [CrossRef] - Mohajerin, N.; Waslander, S.L. Multistep Prediction of Dynamic Systems With Recurrent Neural Networks. IEEE Transa. Neural Netw. Learn. Syst.
**2019**. [Google Scholar] [CrossRef] [PubMed] - Rezende, D.J.; Mohamed, S. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, 6–11 July 2015; Volume 37, pp. 1530–1538. [Google Scholar]
- Huang, C.W.; Krueger, D.; Lacoste, A.; Courville, A. Neural Autoregressive Flows. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 2083–2092. [Google Scholar]
- Tishby, N.; Pereira, F.C.; Bialek, W. The Information Bottleneck Method. In Proceedings of the Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 22–24 September 1999; pp. 368–377. [Google Scholar]
- Harremoës, P.; Tishby, N. The information bottleneck revisited or how to choose a good distortion measure. In Proceedings of the IEEE International Symposium on Information Theory, Nice, France, 24–29 June 2007; pp. 566–570. [Google Scholar]
- Feldman, D.P.; Crutchfield, J.P. Synchronizing to Periodicity: The Transient Information and Synchronization Time of Periodic Sequences. Adv. Complex Syst.
**2004**, 7, 329–355. [Google Scholar] [CrossRef] - Crutchfield, J.P.; Young, K. Inferring statistical complexity. Phys. Rev. Lett.
**1989**, 63, 105–108. [Google Scholar] [CrossRef] [PubMed] - Grassberger, P. Toward a quantitative theory of self-generated complexity. Int. J. Theor. Phys.
**1986**, 25, 907–938. [Google Scholar] [CrossRef] - Löhr, W. Properties of the Statistical Complexity Functional and Partially Deterministic HMMs. Entropy
**2009**, 11, 385–401. [Google Scholar] [CrossRef][Green Version] - Clarke, R.W.; Freeman, M.P.; Watkins, N.W. Application of computational mechanics to the analysis of natural data: An example in geomagnetism. Phys. Rev. E
**2003**, 67, 016203. [Google Scholar] [CrossRef][Green Version] - Singh, S.P.; Littman, M.L.; Jong, N.K.; Pardoe, D.; Stone, P. Learning predictive state representations. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), Washington, DC, USA, 21–24 August 2003; pp. 712–719. [Google Scholar]
- Singh, S.; James, M.R.; Rudary, M.R. Predictive state representations: A new theory for modeling dynamical systems. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence; AUAI Press: Arlington, VA, USA, 2004; pp. 512–519. [Google Scholar]
- Jaeger, H. Discrete-Time, Discrete-Valued Observable Operator Models: A Tutorial; GMD-Forschungszentrum Informationstechnik: Darmstadt, Germany, 1998. [Google Scholar]
- Rubin, J.; Shamir, O.; Tishby, N. Trading value and information in MDPs. In Decision Making with Imperfect Decision Makers; Springer: Berlin, Germany, 2012; pp. 57–74. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput.
**1997**, 9, 1735–1780. [Google Scholar] [CrossRef] - Kingma, D.P.; Salimans, T.; Jozefowicz, R.; Chen, X.; Sutskever, I.; Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2016; pp. 4743–4751. [Google Scholar]
- Papamakarios, G.; Pavlakou, T.; Murray, I. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2017; pp. 2338–2347. [Google Scholar]
- Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- McAllester, D.; Statos, K. Formal Limitations on the Measurement of Mutual Information. arXiv
**2018**, arXiv:1811.04251. [Google Scholar] - Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep Variational Information Bottleneck. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
- Grathwohl, W.; Wilson, A. Disentangling space and time in video with hierarchical variational auto-encoders. arXiv
**2016**, arXiv:1612.04440. [Google Scholar] - Walker, J.; Doersch, C.; Gupta, A.; Hebert, M. An uncertain future: Forecasting from static images using variational autoencoders. In Proceedings of the European Conference on Computer Vision; Springer: Berlin, Germany, 2016; pp. 835–851. [Google Scholar]
- Fraccaro, M.; Kamronn, S.; Paquet, U.; Winther, O. A disentangled recognition and nonlinear dynamics model for unsupervised learning. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; pp. 3601–3610. [Google Scholar]
- Hernández, C.X.; Wayment-Steele, H.K.; Sultan, M.M.; Husic, B.E.; Pande, V.S. Variational encoding of complex dynamics. Phys. Rev. E
**2018**, 97, 062412. [Google Scholar] [CrossRef] [PubMed] - Bowman, S.R.; Vilnis, L.; Vinyals, O.; Dai, A.M.; Jozefowicz, R.; Bengio, S. Generating Sentences from a Continuous Space. In Proceedings of the CoNLL, Berlin, Germany, 11–12 August 2016. [Google Scholar]
- Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. β-VAE: Learning basic visual concepts with a constrained variational framework. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Burgess, C.; Higgins, I.; Pal, A.; Matthey, L.; Watters, N.; Desjardins, G.; Lerchner, A. Understanding disentangling in β-VAE. arXiv
**2018**, arXiv:1804.03599. [Google Scholar] - Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in PyTorch 2017. Available online: https://openreview.net/forum?id=BJJsrmfCZ (accessed on 28 June 2019).
- Shannon, C.E. Prediction and entropy of printed English. Bell Syst. Tech. J.
**1951**, 30, 50–64. [Google Scholar] [CrossRef] - Takahira, R.; Tanaka-Ishii, K.; Dębowski, Ł. Entropy rate estimates for natural language—A new extrapolation of compressed large-scale corpora. Entropy
**2016**, 18, 364. [Google Scholar] [CrossRef] - Bentz, C.; Alikaniotis, D.; Cysouw, M.; Ferrer-i Cancho, R. The entropy of words—Learnability and expressivity across more than 1000 languages. Entropy
**2017**, 19, 275. [Google Scholar] [CrossRef] - Hale, J. A Probabilistic Earley Parser as a Psycholinguistic Model. In Proceedings of the NAACL, Pittsburgh, PA, USA, 1–7 June 2001; Volume 2, pp. 159–166. [Google Scholar]
- Levy, R. Expectation-based syntactic comprehension. Cognition
**2008**, 106, 1126–1177. [Google Scholar] [CrossRef][Green Version] - Smith, N.J.; Levy, R. The effect of word predictability on reading time is logarithmic. Cognition
**2013**, 128, 302–319. [Google Scholar] [CrossRef][Green Version] - Frank, S.L.; Otten, L.J.; Galli, G.; Vigliocco, G. Word surprisal predicts N400 amplitude during reading. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, 4–9 August 2013; Volume 2, pp. 878–883. [Google Scholar]
- Kuperberg, G.R.; Jaeger, T.F. What do we mean by prediction in language comprehension? Lang. Cogn. Neurosci.
**2016**, 31, 32–59. [Google Scholar] [CrossRef] - Fenk, A.; Fenk, G. Konstanz im Kurzzeitgedächtnis—Konstanz im sprachlichen Informationsfluß. Z. Exp. Angew. Psychol.
**1980**, 27, 400–414. [Google Scholar] - Genzel, D.; Charniak, E. Entropy rate constancy in text. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002. [Google Scholar]
- Jaeger, T.F.; Levy, R.P. Speakers optimize information density through syntactic reduction. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 3–6 December 2007; pp. 849–856. [Google Scholar]
- Schenkel, A.; Zhang, J.; Zhang, Y.C. Long range correlation in human writings. Fractals
**1993**, 1, 47–57. [Google Scholar] [CrossRef] - Ebeling, W.; Pöschel, T. Entropy and long-range correlations in literary English. EPL (Europhys. Lett.)
**1994**, 26, 241. [Google Scholar] [CrossRef] - Ebeling, W.; Neiman, A. Long-range correlations between letters and sentences in texts. Phys. A Stat. Mech. Appl.
**1995**, 215, 233–241. [Google Scholar] [CrossRef] - Altmann, E.G.; Cristadoro, G.; Degli Esposti, M. On the origin of long-range correlations in texts. Proc. Natl. Acad. Sci. USA
**2012**, 109, 11582–11587. [Google Scholar] [CrossRef][Green Version] - Yang, T.; Gu, C.; Yang, H. Long-range correlations in sentence series from A Story of the Stone. PLoS ONE
**2016**, 11, e0162423. [Google Scholar] [CrossRef] - Chen, H.; Liu, H. Quantifying evolution of short and long-range correlations in Chinese narrative texts across 2000 years. Complexity
**2018**, 2018, 9362468. [Google Scholar] [CrossRef] - Dębowski, Ł. Is natural language a perigraphic process? The theorem about facts and words revisited. Entropy
**2018**, 20, 85. [Google Scholar] [CrossRef] - Koplenig, A.; Meyer, P.; Wolfer, S.; Mueller-Spitzer, C. The statistical trade-off between word order and word structure–Large-scale evidence for the principle of least effort. PLoS ONE
**2017**, 12, e0173614. [Google Scholar] [CrossRef] - Gibson, E. Linguistic complexity: locality of syntactic dependencies. Cognition
**1998**, 68, 1–76. [Google Scholar] [CrossRef] - Futrell, R.; Levy, R. Noisy-context surprisal as a human sentence processing cost model. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017; pp. 688–698. [Google Scholar]
- Petrov, S.; Das, D.; McDonald, R.T. A Universal Part-of-Speech Tagset. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, 23–25 May 2012; pp. 2089–2096. [Google Scholar]
- Nivre, J.; Agic, Z.; Ahrenberg, L.; Antonsen, L.; Aranzabe, M.J.; Asahara, M.; Ateyah, L.; Attia, M.; Atutxa, A.; Augustinus, L.; et al. Universal Dependencies 2.1. 2017. Available online: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2515 (accessed on 28 June 2019).
- Kim, Y.; Jernite, Y.; Sontag, D.; Rush, A.M. Character-aware neural language models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
- Luong, M.T.; Manning, C.D. Achieving open vocabulary neural machine translation with hybrid word-character models. arXiv
**2016**, arXiv:1604.00788. [Google Scholar] - Marcus, M.P.; Marcinkiewicz, M.A.; Santorini, B. Building a large annotated corpus of English: The Penn Treebank. Comput. Linguist.
**1993**, 19, 313–330. [Google Scholar] - Nivre, J.; de Marneffe, M.C.; Ginter, F.; Goldberg, Y.; Hajic, J.; Manning, C.D.; McDonald, R.T.; Petrov, S.; Pyysalo, S.; Silveira, N.; et al. Universal Dependencies v1: A Multilingual Treebank Collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, 23–28 May 2016. [Google Scholar]
- Maamouri, M.; Bies, A.; Buckwalter, T.; Mekki, W. The penn arabic treebank: Building a large-scale annotated arabic corpus. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools, Cairo, Egypt, 27–29 March 2004; Volume 27, pp. 466–467. [Google Scholar]
- Hajic, J.; Smrz, O.; Zemánek, P.; Šnaidauf, J.; Beška, E. Prague Arabic dependency treebank: Development in data and tools. In Proceedings of the NEMLAR Internaional Conference on Arabic Language Resources and Tools, Cairo, Egypt, 22–23 September 2004; pp. 110–117. [Google Scholar]
- Dyachenko, P.B.; Iomdin, L.L.; Lazurskiy, A.V.; Mityushin, L.G.; Podlesskaya, O.Y.; Sizov, V.G.; Frolova, T.I.; Tsinman, L.L. Sovremennoe sostoyanie gluboko annotirovannogo korpusa tekstov russkogo yazyka (SinTagRus). Trudy Instituta Russkogo Yazyka im. VV Vinogradova
**2015**, 10, 272–300. [Google Scholar] - Che, W.; Li, Z.; Liu, T. Chinese Dependency Treebank 1.0 LDC2012T05; Web Download; Linguistic Data Consortium: Philadelphia, PA, USA, 2012. [Google Scholar]
- Graff, D.; Wu, Z. Japanese bUsiness News Text; LDC95T8; Linguistic Data Consortium: Philadelphia, PA, USA, 1995. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res.
**2014**, 15, 1929–1958. [Google Scholar] - Bradbury, J.; Merity, S.; Xiong, C.; Socher, R. Quasi-recurrent neural networks. In Proceedings of the ICLR 2017, Toulon, France, 24–26 April 2017. [Google Scholar]

**Figure 1.**Rate–Distortion for the Even Process (

**left**) and the Random Insertion Process (

**right**). Gray lines: analytical curves; Red dots: multiple runs of NPRD; red line: trade-off curve computed from NPRD runs; blue: OCF for $M\le 5$.

**Figure 2.**Recovering the $\u03f5$-machine from NPRD.

**Left**: The $\u03f5$-machine of the Random Insertion Process, as described by [2].

**Right**: After computing a code Z from a past ${x}_{-15\dots -1}$, we recorded which of the three clusters the code moves to when appending the symbol 0 or 1 to the past sequence. The resulting transitions mirror those in the $\u03f5$-machine.

**Figure 3.**Applying Principal Component Analysis to 5000 sampled codes Z for the Random Insertion Process, at $\lambda =0.6$ (

**left**) and $\lambda =0.25$ (

**right**). We show the first two principal components. Samples are colored according to the states in the $\u03f5$-machine. There is a small number of samples from sequences that, at $M=15$, cannot be uniquely attributed to any of the states (ambiguous between A and C); these are indicated in black.

**Figure 4.**Rate–Distortion for the Copy3 process. We show NPRD samples, and the resulting upper bound in red. The gray line represents the anaytical curve.

**Figure 5.**

**Left**: Rate-Predictiveness for English POS modeling. Center and right: Rate (

**Center**) and Predictiveness (

**Right**) on English POS Modeling, as a function of $-log\lambda $. As $\lambda \to 0$, NPRD (red, $M=15$) continues to discover structure, while OCF (blue, plotted for $M=1,2,3$) exhausts its capacity.

**Figure 6.**Interpolated values for POS-level prediction of English (compare Figure 5).

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Hahn, M.; Futrell, R.
Estimating Predictive Rate–Distortion Curves via Neural Variational Inference. *Entropy* **2019**, *21*, 640.
https://doi.org/10.3390/e21070640

**AMA Style**

Hahn M, Futrell R.
Estimating Predictive Rate–Distortion Curves via Neural Variational Inference. *Entropy*. 2019; 21(7):640.
https://doi.org/10.3390/e21070640

**Chicago/Turabian Style**

Hahn, Michael, and Richard Futrell.
2019. "Estimating Predictive Rate–Distortion Curves via Neural Variational Inference" *Entropy* 21, no. 7: 640.
https://doi.org/10.3390/e21070640