6.1. Entropy Diversity across Languages of the World
In Section 5.2
, we estimated word entropies for a sample of more than 1000 languages of the PBC. We find that unigram entropies cluster around a mean value of about nine bits/word, while entropy rates are generally lower, and fall closer to a mean of six bits/word (Figure 1
). This is to be expected, since the former do not take co-textual information into account, whereas the latter do. To see this, remember that under stationarity, the entropy rate can be defined as a word entropy conditioned on a sufficiently large number of previous tokens (Equation (7
)), while the unigram entropy is not conditioned on the co-text. As conditioning reduces entropy [73
], it is not surprising that entropy rates tend to fall below unigram entropies.
It is more surprising that given the wide range of potential entropies, from zero to ca. 14, most natural languages fall on a relatively narrow spectrum. It is non-trivial to find an upper limit for the maximum word entropy of natural languages. In theory, it could be infinite given that the range of word types is potentially infinite. However, in practice the highest entropy languages range only up to ca. 14 bits/word. Unigram entropies mainly fall in the range between seven to 12 bits/word, and entropy rates in the range between four to nine bits/word. Thus, each only covers around 40% of the scale. The distributions are also skewed to the right, and seem to differ from the Gaussian, and therefore symmetric, distribution that is expected for the plug-in estimator under a two-fold null hypothesis: (1) that the true entropy is the same for all languages, and that (2) besides the bias in the entropy estimation, there is no additional bias constraining the distribution of entropy [62
]. It needs to be clarified in further studies where this right-skewness stems from.
Overall, the distributions suggest that there are pressures at play which keep word entropies in a relatively narrow range. We argue that this observation is related to the trade-off between the learnability and expressivity of communication systems. A (hypothetical) language with maximum word entropy would have a vast (potentially infinite) number of word forms of equal probability, and would be hard (or impossible) to learn. A language with minimum word entropy, on the other hand, would repeat the same word forms over and over again, and lack expressivity. Natural languages fall in a relatively narrow range between these extremes. This is in line with evidence from iterated learning experiments and computational simulations. For instance, Kirby et al. [79
] illustrate how artificial languages collapse into states with underspecified word/meaning mappings, i.e., low word entropy states, if only pressure for learnability is given. On the other hand, when only pressure for expressivity is given, then so-called holistic strategies evolve with a separate word form for each meaning, i.e., a high word entropy state. However, if both pressures for learnability and expressivity interact then combinatoriality emerges as a coding strategy, which keeps unigram word entropies in the middle ground.
This trade-off is also obvious in optimization models of communication, which are based on two major principles: entropy minimization and maximization of mutual information between meanings and word forms [40
]. Zipf’s law for word frequencies, for instance, emerges in a critical balance between these two forces [39
]. In these models, entropy minimization is linked with learnability: fewer word forms are easier to learn (see [80
] for other cognitive costs associated with entropy). Whereas mutual information maximization is linked with expressivity via the form/meaning mappings available in a communication system. Note that a fundamental property in information theoretic models of communication is that
, the mutual information between word forms and meanings, cannot exceed the entropy, i.e., [37
The lower the entropy, the lower the potential for expressivity. Hence, the entropy of words is an upper bound on the expressivity of words. This may also shed light on the right-skewness of the unigram entropy distribution in Figure 1
. Displacing the distribution to the left or skewing it towards low values would compromise expressivity. In contrast, skewing it towards the right increases the potential for expressivity according to Equation (13
), though this comes with a learnability cost.
Further support for pressure to increase entropy to warrant sufficient expressivity (Equation (13
)) comes from the fact that there is no language with less than six bits/word unigram entropy for neither the PBC nor the EPC. There is only one language in the UDHR that has a slightly lower unigram entropy, a Zapotecan language of Meso-America (zam). It has a fairly small corpus size (1067 tokens), and the same language has more than eight bits/word in the PBC. Thus, there is no language which is more than three SDs below the mean of unigram entropies. On the other hand, there are several languages that are more than four SDs higher than the mean, i.e., around or beyond 13 bits/word.
Despite the fact that natural languages do not populate the whole range of possible word entropies, there can still be remarkable differences. Some of the languages at the low-entropy end are Tok Pisin, Bislama (Creole languages), and Sango (Atlantic-Congo language of Western Africa). These have unigram entropies of around 6.7 bits/word. Languages to the high-end include Greenlandic Inuktitut and Ancient Hebrew, with unigram entropies around 13 bits/word. Note that this is not to say that Greenlandic Inuktitut or Ancient Hebrew are “better” or “worse” communication systems than Creole languages or Sango. Such an assessment is misleading for two reasons: First, information encoding happens in different linguistic (and non-linguistic) dimensions, not just at the word level. We are only just starting to understand the interactions between these levels from an information-theoretic perspective [10
]. Second, if we assume that natural languages are used for communication, then both learnability and expressivity of words are equally desirable features. Any combination of the two arises in the evolution of languages due to adaptive pressures. There is nothing “better” or “worse” about learnability or expressivity per se.
On a global scale, there seem to be high and low entropy areas. For example, languages in the Andean region of South America all have high unigram entropies (bright red in Figure 2
). This is most likely due to their high morphological complexity, resulting in a wide range of word types, which were shown to correlate with word entropies [81
]. Further areas of generally high entropies include Northern Eurasia, Eastern Africa, and North America. In contrast, Meso-America, Sub-Saharan Africa und South-East Asia are areas of relatively low word entropies (purple and blue in Figure 2
). Testing these global patterns for statistical significance is an immediate next step in our research. Some preliminary results for unigram entropies and their relationship with latitude can be found in [14
We are just beginning to understand the driving forces involved when languages develop extremely high or low word entropies. For example, Bentz et al. [12
] as well as Bentz and Berdicevskis [18
] argue that specific learning pressures reduce word entropy over time. As a consequence, different scenarios of language learning, transmission and contact might lead to global patterns of low and high entropy areas [14
6.2. Correlation between Unigram Entropies and Entropy Rates
In Section 5.3
, we found a strong correlation between unigram entropies and entropy rates. This is surprising, as we would expect that the co-text has a variable effect on the information content of words, and that this might differ across languages too. However, what we actually find is that the co-text effect is (relatively) constant across languages. To put it differently, knowing the co-text of words decreases their uncertainty, or information content, by roughly the same amount, regardless of the language. Thus, entropy rates are systematically lower than unigram entropies, by 3.17 bits/word on average.
Notably, this result is in line with earlier findings by Montemurro and Zanette [8
]. They have reported, for samples of eight and 75 languages respectively, that the difference between word entropy rates for texts with randomized word order and those of texts with original word order is about
bits/word. Note that the word entropy rate given randomized word order is conceptually the same as unigram entropies, since any dependencies between words are destroyed via randomization [8
] (technically all the tokens of the sequence become independent and identically distributed variables ([73
], p. 75). Montemurro and Zanette [8
] also show that while the average information content of words might differ across languages, the co-text reduces the information content of words by a constant amount. They interpret this as a universal property of languages. We have shown for a considerably bigger sample of languages that the entropy difference has a smaller variance than the original unigram entropies and entropy rates, which consistent with Montemurro and Zanette’s findings (Figure 1
As a consequence, the entropy rate is a linear function of unigram entropy, and can be straightforwardly predicted from it. To our knowledge, we have provided the first empirical evidence for a linear dependency between the entropy rate and unigram entropy (Figure 3
). Interestingly, we have shown that this linearity of the relationship increases as text length increases (Figure 4
). A mathematical investigation of the origins of this linear relationship should be the subject of future research.
There is also a practical side to this finding: estimating entropy rates requires searching strings of length
, where i
is the index running through all tokens of a text. As i
increases, the CPU time per additional word token increases linearly. In contrast, unigram entropies can be estimated based on dictionaries of word types and their token frequencies, and the processing time per additional word token is constant. Hence, Equation (11
) can help to reduce processing costs considerably.