# The Entropy of Words—Learnability and Expressivity across More than 1000 Languages

## Abstract

## 1. Introduction

- Across languages of the world, unigram entropies display a unimodal distribution around a mean of ca. nine bits/word, with a standard deviation of ca. one bit/word. Entropy rates have a lower mean of ca. six bits/word, with a standard deviation of ca. one bit/word. Hence, there seem to be strong pressures keeping the mass of languages in a relatively narrow entropy range. This is particularly salient for the difference between unigram entropy and entropy rate (Section 5.2).
- There is a strong positive linear relationship between unigram entropies and entropy rates ($r=0.96,p<0.0001$). To our knowledge, this has not been reported before. We formulate a simple linear model that predicts the entropy rate of a text $\widehat{h}\left(T\right)$ from the unigram entropy ${\widehat{H}}_{1}\left(T\right)$ of the same text: $\widehat{h}\left(T\right)={k}_{1}+{k}_{2}\phantom{\rule{0.166667em}{0ex}}{\widehat{H}}_{1}\left(T\right)$, where ${k}_{1}=-1.12$ and ${k}_{2}=0.78$ (Section 5.3). The implication of this relationship is that uncertainty-reduction by co-textual information is approximately linear across languages of the world.

## 2. Data

## 3. Theory

#### 3.1. Word Types and Tokens

The set of word types (in lower case) for this sentence is:in the beginning god created the heavens and the earth and the earth was waste and empty [...]

#### 3.2. Entropy Estimation

#### 3.2.1. General Conditions

#### 3.2.2. Word Entropy Estimation

in the beginning god created the heavens and the earth and the earth was waste and empty [...]

#### 3.2.3. Problem 1: The Infinite Productive Potential of Languages

#### 3.2.4. Problem 2: Short- and Long-Range Correlations between Words

#### 3.2.5. Our Perspective

#### 3.2.6. n-Gram Entropies

#### 3.2.7. Entropy Rate

in_{1}the_{2}beginning_{3}god_{4}created_{5}the_{6}heavens_{7}and_{8}the_{9}earth_{10}and_{11}the_{12}earth_{13}was_{14}waste_{15}and_{16}empty_{17}[...]

## 4. Methods

#### 4.1. Entropy Estimation Software

#### 4.2. Stabilization Criterion

#### 4.3. Corpus Samples

## 5. Results

#### 5.1. Entropy Stabilization throughout the Text Sequence

#### 5.2. Word Entropies across More than 1000 Languages

#### 5.3. Correlation between Unigram Entropy and Entropy Rate

## 6. Discussion

#### 6.1. Entropy Diversity across Languages of the World

#### 6.2. Correlation between Unigram Entropies and Entropy Rates

## 7. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## Abbreviations

EPC | European Parliament Corpus |

PBC | Parallel Bible Corpus |

UDHR | Universal Declaration of Human Rights |

NSB | Nemenman–Shafee–Bialek |

## Appendix A. Text Pre-Processing

#### Appendix A.1. Converting Letters to Lower Case

#### Appendix A.2. Removal of Punctuation

Although, as you will have seen, the dreaded “millennium bug” failed to materialise, [...]EPC (English, line 3)

And God said, let there be light . And there was light . [...]PBC (English, Genesis 1:3)

Furthermore, no distinction shall be made on the basis of the political, jurisdictional or international status of the country [...]UDHR (English, paragraph 3)

- For the EPC, we use the regular expression
`\\W+`in combination with the R function strsplit() to split strings of UTF-8 characters on punctuation and white spaces. - For the PBC and UDHR, we define a regular expression meaning “at least one alpha-numeric character between white spaces” which would be written as:
`.*[[:alpha:]].*`This regex can then be matched with the respective text to yield word types. This is done via the functions regexpr() and regmatches() in R.

## Appendix B. Advanced Entropy Estimators

#### Appendix B.1. The Miller–Madow Estimator

#### Appendix B.2. Bayesian Estimators

#### Appendix B.3. The Chao–Shen Estimator

#### Appendix B.4. The James–Stein Shrinkage Estimator

## Appendix C. Stabilization of the Entropy Rates for 21 Languages of the European Parliament Corpus

#### Appendix C.1. Unigram Entropies

**Figure A1.**Unigram entropies (y-axis) as a function of text length (x-axis) across 21 languages of the EPC corpus. Unigram entropies are estimated on prefixes of the text sequence increasing by one K tokens. Thus, the first prefix covers tokens one to one K, the second prefix covers tokens one to two K, etc. The number of tokens is limited to 100 K, since entropy values already (largely) stabilize throughout the text sequence before that. Hence, there are 100 points along the x-axis. Nine different methods of entropy estimation are indicated with colours. CS: Chao–Shen estimator, Jeff: Bayesian estimation with Jeffrey’s prior, Lap: Bayesian estimation with Laplace prior, minimax: Bayesian estimation with minimax prior, ML: maximum likelihood, MM: Miller–Madow estimator, NSB: Nemenman–Shafee–Bialek estimator, SG: Schürmann–Grassberger estimator, Shrink: James–Stein shrinkage estimator. Detailed explanations for these estimators are given in Appendix B. Language identifiers used by the EPC are given in parenthesis.

**Figure A2.**SDs of unigram entropies (y-axis) as a function of text length (x-axis) across 21 languages of the EPC corpus, and the nine different estimators. Unigram entropies are estimated on prefixes of the text sequence increasing by one K tokens as in Figure A1. SDs are calculated over the entropies of the next 10 prefixes as explained in Section 4. Hence, there are 90 points along the x-axis. The horizontal dashed line indicates $SD=0.1$ as a threshold.

#### Appendix C.2. Entropy Rate

**Figure A3.**Entropy rates as a function of text length across 21 languages of the EPC. Entropy rates are estimated on prefixes of the text sequence increasing by one K tokens as in Figure A1. Hence there are 100 points along the x-axis. The language identifiers used by the EPC are given in parenthesis.

**Figure A4.**SDs of entropy rates as a function of text length across 21 languages of the EPC corpus. The format is the same is as in Figure A2.

## Appendix D. Stabilization of Entropy Rates for 32 Languages of the Parallel Bible Corpus

**Figure A5.**Entropy rates as a function of text length across 32 languages of the PBC. Languages were chosen to represent some of the major language families across the world. Entropy rates are estimated on prefixes of the text sequence increasing by one K tokens as in Figure A1. Hence there are 100 points along the x-axis. The language names and families are taken from Glottolog 2.7 [78] and given above the plots.

**Figure A6.**SDs of entropy rates as a function of text length across 32 languages of the PBC. Languages were chosen to represent some of the major language families across the world. The format is as in Figure A2. The language names and families are given above the plots.

## Appendix E. Correlations between Estimated Unigram Entropies

**Figure A7.**Pairwise correlations of estimated unigram entropy values for three different corpora: Europarl Corpus (EPC), Parallel Bible Corpus (PBC), and Universal Declaration of Human Rights (UDHR). Results of the maximum likelihood (ML) method are here taken as a baseline and correlated with all other methods. CS: Chao–Shen estimator, Jeff: Bayesian estimation with Jeffrey’s prior, Lap: Bayesian estimation with Laplace prior, minimax: Bayesian estimation with minimax prior, MM: Miller–Madow estimator, NSB: Nemenman–Shafee–Bialek estimator, SG: Schürmann–Grassberger estimator, Shrink: James–Stein shrinkage estimator.

- | ML | MM | Jeff | Lap | SG | minmax | CS | NSB | Shrink |
---|---|---|---|---|---|---|---|---|---|

ML | - | ||||||||

MM | 0.9999405 | - | |||||||

Jeff | 0.9994266 | 0.9996888 | - | ||||||

Lap | 0.9983479 | 0.998819 | 0.9997185 | - | |||||

SG | 1 | 0.9999405 | 0.9994267 | 0.998348 | - | ||||

minmax | 0.9999999 | 0.9999445 | 0.9994415 | 0.9983733 | 0.9999999 | - | |||

CS | 0.9993888 | 0.9996607 | 0.9999867 | 0.9997199 | 0.9993889 | 0.9994037 | - | ||

NSB | 0.9997953 | 0.9999065 | 0.9998969 | 0.9992965 | 0.9997954 | 0.9998041 | 0.9998719 | - | |

Shrink | 0.9999945 | 0.9999059 | 0.9993348 | 0.9981998 | 0.9999945 | 0.9999935 | 0.9992906 | 0.9997419 | - |

- | ML | MM | Jeff | Lap | SG | minmax | CS | NSB | Shrink |
---|---|---|---|---|---|---|---|---|---|

ML | - | ||||||||

MM | 0.9998609 | - | |||||||

Jeff | 0.9989818 | 0.9993252 | - | ||||||

Lap | 0.9969406 | 0.9975655 | 0.999441 | - | |||||

SG | 1 | 0.9998611 | 0.9989821 | 0.9969412 | - | ||||

minmax | 0.9999988 | 0.9998743 | 0.9990352 | 0.9970343 | 0.9999989 | - | |||

CS | 0.998965 | 0.999388 | 0.9999208 | 0.9992828 | 0.9989654 | 0.9990176 | - | ||

NSB | 0.999173 | 0.9994438 | 0.9992162 | 0.9979161 | 0.9991732 | 0.9992024 | 0.9993134 | - | |

Shrink | 0.9999805 | 0.9998643 | 0.9988525 | 0.9967172 | 0.9999806 | 0.9999785 | 0.9988745 | 0.9991464 | - |

- | ML | MM | Jeff | Lap | SG | minmax | CS | NSB | Shrink |
---|---|---|---|---|---|---|---|---|---|

ML | - | ||||||||

MM | 0.9979922 | - | |||||||

Jeff | 0.9975981 | 0.9952309 | - | ||||||

Lap | 0.9928889 | 0.9895672 | 0.9986983 | - | |||||

SG | 0.9999999 | 0.9980081 | 0.9976072 | 0.9929003 | - | ||||

minmax | 0.999976 | 0.9979101 | 0.9980311 | 0.9936525 | 0.9999768 | - | |||

CS | 0.9854943 | 0.9932518 | 0.9826871 | 0.9763849 | 0.985522 | 0.985383 | - | ||

NSB | 0.9623212 | 0.9621106 | 0.9663655 | 0.964961 | 0.9623601 | 0.962801 | 0.9459217 | - | |

Shrink | 0.9986898 | 0.9984146 | 0.9942842 | 0.9877932 | 0.9986974 | 0.9984619 | 0.9866643 | 0.9607329 | - |

## Appendix F. Correlations between Unigram Entropies and Entropy Rates for the PBC

**Figure A8.**Correlations between the nine unigram entropy estimators and entropy rates for the PBC. The panels in the lower half of the plot give scatterplots, the panels in the upper half give corresponding Pearson correlations. The diagonal panels give density plots.

## Appendix G. Correlations between PBC, EPC and UDHR Unigram Entropies

**Figure A9.**Correlations between unigram entropies (NSB estimated) for texts of the PBC and UDHR (left panel), texts of the EPC and UDHR (middle panel), and texts of the PBC and EPC (right panel). Local regression smoothers are given (blue lines) with 95% confidence intervals.

**Figure 1.**The distribution of entropic measures in bits. (

**a**) Probability density of unigram entropies (light grey) and entropy rates (dark grey) across texts of the PBC (using $50K$ tokens). M and $SD$ are, respectively, the mean and the standard deviation of the values. A vertical dashed line indicates the mean M. (

**b**) The same for the difference between unigram entropies and entropy rates.

**Figure 2.**World maps with unigram entropies (upper panel), entropy rates (middle panel) and the difference between them (lower panel), across texts of the PBC (using $50K$ tokens), amounting to 1499 texts and 1115 languages.

**Figure 3.**Linear relationship between unigram entropies approximated with the NSB estimator (x-axis) and entropy rates (y-axis) for 1495 PBC texts ($50K$ tokens) across 1112 languages. Four texts (Ancient Hebrew (hbo), Eastern Canadian Inuktitut (ike), Kalaallisut (kal), and Northwest Alaska Inupiatun (esk)) were excluded here, since they have extremely high values of more than 13 bits/word. In the left panel, a linear regression model is given as blue line, a local regression smoother is given as red dashed line. The Pearson correlation coefficient is $r=0.95$. In the right panels, plots are faceted by macro areas across the world. Macro areas are taken from Glottolog 2.7 [78]. Linear regression models are given as coloured lines with 95% confidence intervals.

**Figure 4.**Pearson correlation between unigram entropy and entropy rate (y-axis) as a function of text length (x-axis). Each correlation is calculated over the 21 languages of the EPC corpus. All nine unigram entropy estimators are considered.

Corpus | Register | Size * | Mean Size * | Texts | Lang |
---|---|---|---|---|---|

EPC | Political | ca. 21 M | ca. 1 M | 21 | 21 |

PBC | Religious | ca. 430 M | ca. 290 K | 1525 | 1137 |

UDHR | Legal | ca. 500 K | ca. 1.3 K | 370 | 341 |

Total: | ca. 450 M | 1916 | 1259 |

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

