Next Article in Journal
Boltzmann Sampling by Degenerate Optical Parametric Oscillator Network for Structure-Based Virtual Screening
Previous Article in Journal
The Shell Collapsar—A Possible Alternative to Black Holes
Article Menu
Issue 10 (October) cover image

Export Article

Open AccessArticle
Entropy 2016, 18(10), 364;

Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora

Graduate School of Information Science and Electrical Engineering, Kyushu University, Fukuoka 819-0395, Japan
Research Center for Advanced Science and Technology, University of Tokyo, Tokyo 153-8904, Japan
Institute of Computer Science, Polish Academy of Sciences, Warszawa 01-248, Poland
Author to whom correspondence should be addressed.
Academic Editor: J. A. Tenreiro Machado
Received: 9 September 2016 / Revised: 30 September 2016 / Accepted: 9 October 2016 / Published: 12 October 2016
(This article belongs to the Section Complexity)
Full-Text   |   PDF [833 KB, uploaded 12 October 2016]   |  


One of the fundamental questions about human language is whether its entropy rate is positive. The entropy rate measures the average amount of information communicated per unit time. The question about the entropy of language dates back to experiments by Shannon in 1951, but in 1990 Hilberg raised doubt regarding a correct interpretation of these experiments. This article provides an in-depth empirical analysis, using 20 corpora of up to 7.8 gigabytes across six languages (English, French, Russian, Korean, Chinese, and Japanese), to conclude that the entropy rate is positive. To obtain the estimates for data length tending to infinity, we use an extrapolation function given by an ansatz. Whereas some ansatzes were proposed previously, here we use a new stretched exponential extrapolation function that has a smaller error of fit. Thus, we conclude that the entropy rates of human languages are positive but approximately 20% smaller than without extrapolation. Although the entropy rate estimates depend on the script kind, the exponent of the ansatz function turns out to be constant across different languages and governs the complexity of natural language in general. In other words, in spite of typological differences, all languages seem equally hard to learn, which partly confirms Hilberg’s hypothesis. View Full-Text
Keywords: entropy rate; universal compression; stretched exponential; language universals entropy rate; universal compression; stretched exponential; language universals

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Share & Cite This Article

MDPI and ACS Style

Takahira, R.; Tanaka-Ishii, K.; Dębowski, Ł. Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora. Entropy 2016, 18, 364.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics



[Return to top]
Entropy EISSN 1099-4300 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top