# Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Entropy Rate

## 3. Direct Estimation Methods

- The first approach is to compress the text using a data compression algorithm. Let $R({X}_{1}^{n})$ denote the size in bits of text ${X}_{1}^{n}$ after the compression. Then the code length per unit, $r(n)=R({X}_{1}^{n})/n$, is always larger than the entropy rate [13],$$r(n)\ge h.$$We call $r(n)$ the encoding rate. In our application, we are interested in universal compression methods. A universal text compressor guarantees that the encoding rate converges to the entropy rate, provided that the stochastic process ${X}_{1}^{\infty}$ is stationary and ergodic, i.e., equality$$\underset{n\to \infty}{lim}r(n)=h$$
- The second approach is to estimate the probabilistic language models underlying formula (2). A representative classic work is [6], who reported $h\approx 1.75$ bpc, by estimating the probability of trigrams in the Brown National Corpus.
- Besides that, a bunch of different entropy estimation methods has been proposed in information theory. There are lower bounds of entropy such as the plug-in estimator [15], there are estimators which work under assumption that the process is Markovian [16,17,18], and there are a few other methods such as Context Tree Weighting [15,19].

## 4. Extrapolation Functions

## 5. Experimental Procedure

#### 5.1. Data Preparation

**English**English;**Chinese**Chinese; and**Others**French, Russian, Japanese, Korean and Romanized Chinese and Japanese.

#### 5.2. Detailed Procedure

## 6. Experimental Results

#### 6.1. Effects of Randomization by Documents

#### 6.2. Comparison of the Error of Fit

#### 6.3. Universality of the Estimates of Exponent β

#### 6.4. A Linear Perspective onto the Decay of the Encoding Rate

#### 6.5. Discriminative Power the Decay of the Encoding Rate

#### 6.6. Stability of the Entropy Rate Estimates

## 7. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Shannon, S. A mathematical theory of communication. Bell Syst. Tech. J.
**1948**, 27, 379–423, 623–656. [Google Scholar] [CrossRef] - Shannon, C. Prediction and entropy of printed English. Bell Syst. Tech. J.
**1951**, 30, 50–64. [Google Scholar] [CrossRef] - Genzel, D.; Charniak, E. Entropy Rate Constancy in Text. In Proceedings of the 40th Annual Meeting of the Association for the ACL, Philadelphia, PA, USA, 7–12 July 2002; pp. 199–206.
- Levy, R.; Jaeger, T.F. Speakers Optimize Information Density through Syntactic Reduction. In Proceedings of the 19th International Conference on Neural Information Processing Systems, Doha, Qatar, 12–15 November 2012.
- Cover, T.M.; King, R.C. A Convergent Gambling Estimate of the Entropy of English. IEEE Trans. Inf. Theory
**1978**, 24, 413–421. [Google Scholar] [CrossRef] - Brown, P.F.; Pietra, S.A.D.; Pietra, V.J.D.; Lai, J.C.; Mercer, R.L. An Estimate of an Upper Bound for the Entropy of English. Comput. Linguist.
**1983**, 18, 31–40. [Google Scholar] - Kontoyiannis, I. The Complexity and Entropy of Literary Styles; Technical Report 97; Department of Statistics, Stanford University: Stanford, CA, USA, 1997. [Google Scholar]
- Schümann, T.; Grassberger, P. Entropy estimation of symbol sequences. Chaos
**1996**, 6, 414–427. [Google Scholar] [CrossRef] [PubMed] - Hilberg, W. Der Bekannte Grenzwert der Redundanzfreien Information in Texten—Eine Fehlinterpretation der Shannonschen Experimente? Frequenz
**1990**, 44, 243–248. [Google Scholar] [CrossRef] - Dębowski, Ł. Maximal Repetitions in Written Texts: Finite Energy Hypothesis vs. Strong Hilberg Conjecture. Entropy
**2015**, 17, 5903–5919. [Google Scholar] [CrossRef] - Crutchfield, J.P.; Feldman, D.P. Regularities unseen, randomness observed: The entropy convergence hierarchy. Chaos
**2003**, 15, 25–54. [Google Scholar] [CrossRef] - Ebeling, W.; Nicolis, G. Entropy of Symbolic Sequences: The Role of Correlations. Europhys. Lett.
**1991**, 14, 191–196. [Google Scholar] [CrossRef] - Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
- Brudno, A.A. Entropy and the complexity of trajectories of a dynamical system. Trans. Moscovian Math. Soc.
**1982**, 44, 124–149. [Google Scholar] - Gao, Y.; Kontoyiannis, I.; Bienenstock, E. Estimating the Entropy of Binary Time Series: Methodology, Some Theory and a Simulation Study. Entropy
**2008**, 10, 71–99. [Google Scholar] [CrossRef] - Grassberger, P. Estimating the information content of symbol sequences and efficient codes. IEEE Trans. Inf. Theory
**1989**, 35, 669–675. [Google Scholar] [CrossRef] - Farach, M.; Noordewier, M.; Savari, S.; Shepp, L.; Wyner, A.; Ziv, J. On the Entropy of DNA: Algorithms and Measurements Based on Memory and Rapid Convergence. In Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, CA, USA, 22–24 January 1995; pp. 48–57.
- Shields, P.C. Entropy and Prefixes. Ann. Probab.
**1992**, 20, 403–409. [Google Scholar] [CrossRef] - Willems, F.M.J.; Shtarkov, Y.M.; Tjalkens, T.J. The Context Tree Weighting Method: Basic Properties. IEEE Trans. Inf. Theory
**1995**, 41, 653–664. [Google Scholar] [CrossRef] - Ziv, J.; Lempel, A. A Universal Algorithm for Sequential Data Compression. IEEE Trans. Inf. Theory
**1977**, 23, 337–343. [Google Scholar] [CrossRef] - Bell, T.C.; Cleary, J.G.; Witten, I.H. Text Compression; Prentice Hall: Upper Saddle River, NJ, USA, 1990. [Google Scholar]
- Kieffer, J.C.; Yang, E. Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Inf. Theory
**2000**, 46, 737–754. [Google Scholar] [CrossRef] - Nevill-Manning, C.G.; Witten, I.H. Identifying hierarchical structure in sequences: A linear-time algorithm. J. Artif. Intell. Res.
**1997**, 7, 67–82. [Google Scholar] - Grassberger, P. Data Compression and Entropy Estimates by Non-Sequential Recursive Pair Substitution. 2002; arXiv:physics/0207023. [Google Scholar]
- Ryabko, B. Applications of Universal Source Coding to Statistical Analysis of Time Series. In Selected Topics in Information and Coding Theory; Woungang, I., Misra, S., Misra, S.C., Eds.; Series on Coding and Cryptology; World Scientific Publishing: Singapore, 2010. [Google Scholar]
- Dębowski, Ł. A Preadapted Universal Switch Distribution for Testing Hilberg’s Conjecture. IEEE Trans. Inf. Theory
**2015**, 61, 5708–5715. [Google Scholar] [CrossRef] - Baayen, R.H. Word Frequency Distributions; Kluwer Academic Publishers: Berlin, Germany, 2001. [Google Scholar]
- Katz, S.M. Distribution of content words and phrases in text and language modelling. Nat. Lang. Eng.
**1996**, 2, 15–59. [Google Scholar] [CrossRef] - Altmann, E.G.; Pierrehumbert, J.B.; Motter, A.E. Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words. PLoS ONE
**2009**, 4, e7678. [Google Scholar] [CrossRef] [PubMed] - Louchard, G.; Szpankowski, W. On the average redundancy rate of the Lempel-Ziv code. IEEE Trans. Inf. Theory
**1997**, 43, 2–8. [Google Scholar] [CrossRef] - Barron, A.; Rissanen, J.; Yu, B. The Minimum Description Length Principle in Coding and Modeling. IEEE Trans. Inf. Theory
**1998**, 44, 2743–2760. [Google Scholar] [CrossRef] - Atteson, K. The Asymptotic Redundancy of Bayes Rules for Markov Chains. IEEE Trans. Inf. Theory
**1999**, 45, 2104–2109. [Google Scholar] [CrossRef] - Dębowski, Ł. The Relaxed Hilberg Conjecture: A Review and New Experimental Support. J. Quant. Linguist.
**2015**, 22, 311–337. [Google Scholar] [CrossRef] - Daniels, P.T.; Bright, W. The World’s Writing Systems; Oxford University Press: Oxford, UK, 1996. [Google Scholar]
- Tanaka-Ishii, K.; Shunsuke, A. Computational Constancy Measures of Texts—Yule’s K and Rényi’s Entropy. Comput. Linguist.
**2015**, 41, 481–502. [Google Scholar] [CrossRef]

**Figure 1.**Compression results for (

**a**) a Bernoulli process ($p=0.5$) and (

**b**) the Wall Street Journal for Lempel-Ziv (LZ), PPM (Prediction by Partial Match), and Sequitur.

**Figure 2.**Encoding rates for the Wall Street Journal corpus (in English). Panel (

**a**) is for the original data, whereas (

**b**) is the average of the data 10-fold shuffled by documents. To these results we fit functions ${f}_{1}(n)$ and ${f}_{3}(n)$.

**Figure 3.**The values of error and h for all natural language data sets in Table 1 and the three ansatz functions ${f}_{1}(n)$, ${f}_{2}(n)$, and ${f}_{3}(n)$. Each data point corresponds to a distinct corpus or a distinct text, where black is English, red is Chinese, and blue for other languages. The squares are the fitting results for ${f}_{1}(n)$, triangles—for ${f}_{2}(n)$, and circles—for ${f}_{3}(n)$. The means and the standard deviations of h (left) and error (right) are indicated in the figure next to the ovals, which show the range of standard deviation—dotted for ${f}_{1}(n)$, dashed for ${f}_{2}(n)$, and solid for ${f}_{3}(n)$.

**Figure 4.**The values of β and h for all natural language data sets in Table 1 and the ansatz functions ${f}_{1}(n)$, ${f}_{2}(n)$, and ${f}_{3}(n)$. Each data point corresponds to a distinct corpus or a distinct text, where black is English, red is Chinese, and blue for other languages. The squares are the fitting results for ${f}_{1}(n)$, triangles—for ${f}_{2}(n)$, and circles—for ${f}_{3}(n)$. The means and the standard deviations of h (left) and β (right) are indicated in the figure next to the ovals, which show the range of standard deviation—dotted for ${f}_{1}(n)$, dashed for ${f}_{2}(n)$, and solid for ${f}_{3}(n)$.

**Figure 5.**All large scale natural language data (first block of Table 1) from a linear perspective for function ${f}_{3}(n)$. The axes are $Y=lnr(n)$ and $X={n}^{\beta -1}$, where $\beta =0.884$. The black points are English, the red ones are Chinese, and the blue ones are other languages. The two linear fit lines are for English (lower) and Chinese (upper).

**Figure 6.**Data from the third block of Table 1 from a linear perspective for function ${f}_{3}(n)$. The axes are $X={n}^{\beta -1}$ and $Y=lnr(n)$, where $\beta =0.884$ as in Figure 5. The black points are the English text, the magenta ones are its randomized versions, whereas the blue ones are Bernoulli and Zipf processes.

Text | Encoding | f_{1}(n) | f_{3}(n) | ||||
---|---|---|---|---|---|---|---|

Language | Size (chars) | Rate (bit) | h (bit) | Error × 10^{−2} | h (bit) | Error × 10^{−2} | |

Large Scale Random Document Data | |||||||

Agence France-Presse | English | 4096003895 | 1.402 | 1.249 | 1.078 | 1.033 | 0.757 |

Associated Press Worldstream | English | 6524279444 | 1.439 | 1.311 | 1.485 | 1.128 | 1.070 |

Los Angeles Times/Washington Post | English | 1545238421 | 1.572 | 1.481 | 1.108 | 1.301 | 0.622 |

New York Times | English | 7827873832 | 1.599 | 1.500 | 0.961 | 1.342 | 0.616 |

Washington Post/Bloomberg | English | 97411747 | 1.535 | 1.389 | 1.429 | 1.121 | 0.991 |

Xinhua News Agency | English | 1929885224 | 1.317 | 1.158 | 0.906 | 0.919 | 0.619 |

Wall Street Journal | English | 112868008 | 1.456 | 1.320 | 1.301 | 1.061 | 0.812 |

Central News Agency of Taiwan | Chinese | 678182152 | 5.053 | 4.459 | 1.055 | 3.833 | 0.888 |

Xinhua News Agency of Beijing | Chinese | 383836212 | 4.725 | 3.810 | 0.751 | 2.924 | 0.545 |

People’s Daily (1991–95) | Chinese | 101507796 | 4.927 | 3.805 | 0.413 | 2.722 | 0.188 |

Mainichi | Japanese | 847606070 | 3.947 | 3.339 | 0.571 | 2.634 | 0.451 |

Le Monde | French | 727348826 | 1.489 | 1.323 | 1.103 | 1.075 | 0.711 |

KAIST Raw Corpus | Korean | 130873485 | 3.670 | 3.661 | 0.827 | 3.327 | 1.158 |

Mainichi (Romanized) | Japanese | 1916108161 | 1.766 | 1.620 | 2.372 | 1.476 | 2.067 |

People’s Daily (pinyin) | Chinese | 247551301 | 1.850 | 1.857 | 1.651 | 1.667 | 1.136 |

Small Scale Data | |||||||

Ulysses | English | 1510885 | 2.271 | 2.155 | 0.811 | 1.947 | 1.104 |

(by James Joyce) | |||||||

À la recherche du temps perdu | French | 7255271 | 1.660 | 1.414 | 0.770 | 1.078 | 0.506 |

(by Marcel Proust) | |||||||

The Brothers Karamazov | Russian | 1824096 | 2.223 | 1.983 | 0.566 | 1.598 | 0.839 |

(by Fyodor Dostoyevskiy) | |||||||

Daibosatsu toge | Japanese | 4548008 | 4.296 | 3.503 | 1.006 | 2.630 | 0.875 |

(by Nakazato Kaizan) | |||||||

Dang Kou Zhi | Chinese | 665591 | 6.739 | 4.479 | 1.344 | 2.988 | 1.335 |

(by by Wan-Chun Yu) | |||||||

Other Data | |||||||

Bernoulli (0.5) | Stochastic | 8000000000 | 1.019 | 1.016 | 0.391 | 1.012 | 0.721 |

Zipf’s law Random Character | English | 63683795 | 4.406 | 4.417 | 0.286 | 4.402 | 0.258 |

WSJ (Original) | English | 112868008 | 1.456 | 1.305 | 1.156 | 1.041 | 0.833 |

WSJ (Random Characters) | English | 112868008 | 4.697 | 4.706 | 0.131 | 4.699 | 0.146 |

WSJ (Random Word) | English | 112868008 | 2.028 | 1.796 | 0.663 | 1.554 | 0.956 |

WSJ (Random Sentence) | English | 112868008 | 1.461 | 1.026 | 0.500 | 0.562 | 0.532 |

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Takahira, R.; Tanaka-Ishii, K.; Dębowski, Ł. Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora. *Entropy* **2016**, *18*, 364.
https://doi.org/10.3390/e18100364

**AMA Style**

Takahira R, Tanaka-Ishii K, Dębowski Ł. Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora. *Entropy*. 2016; 18(10):364.
https://doi.org/10.3390/e18100364

**Chicago/Turabian Style**

Takahira, Ryosuke, Kumiko Tanaka-Ishii, and Łukasz Dębowski. 2016. "Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora" *Entropy* 18, no. 10: 364.
https://doi.org/10.3390/e18100364