# Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size

^{*}

## Abstract

**:**

## 1. Introduction

_{max}= log

_{2}(K) = log

_{2}(4,009,318) ≈ 21.93 guesses to correctly predict the word type, calculating H for our database based on Equation (1) using the corresponding probabilities for each i yields 12.28. The difference between H

_{max}and $H\left(p\right)$ is defined as information in [3]. Thus, knowledge of the non-uniform word frequency distribution gives us approximately 9.65 bits of information, or put differently, we save on average almost 10 guesses to correctly predict the word type.

^{k}consecutive tokens, where k = 6, 7, …, $lo{g}_{2}\left(N\right)=28$. Figure 1 shows a Simpson’s Paradox [18] for the resulting data: an apparent strong positive relationship between H and γ is observed across all datapoints (Spearman ρ = 0.99). However, when the sample size is kept constant, this relationship completely changes: if the correlation between H and γ is calculated for each k, the results indicate a strong negative relationship (ρ ranges between −0.98 and −0.64 with a median of −0.92). The reason for this apparent contradiction is the fact that both H and γ monotonically increase with the sample size. When studying word frequency distributions quantitatively, it is essential to take this dependence on the sample size into account [16].

_{α}(Equation (4)) and thus leading to a spectrum of divergence measures D

_{α}, parametrized by α [22]. For the analysis of the statistical properties of natural languages, this parameter is highly interesting, because, as demonstrated by [21,22], varying the α-parameter allows us to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. If α is increased (decreased), then the weight of the most frequent words is increased (decreased). As pointed out by an anonymous reviewer, a similar idea was already reported in the work of Tanaka-Ishii and Aihara [23], who studied a different formulation of generalized entropy, the so-called Rényi entropy of order α [24]. Because we are especially interested in using generalized entropies to quantify the (dis)similarity between two different texts or databases, following [21,22], we chose to focus on the generalization of Havrda–Charvat–Lindhard–Nielsen–Aczél–Daróczy–Tsallis instead of the formulation of Rényi, because a divergence measure based on the latter can become negative for α > 1 [25], while it can be shown that the corresponding divergence measure based on the former formulation is strictly non-negative [20,22]. In addition, D

_{α}(p,q) is the square of a metric for $\alpha \in \left(0,2\right]$, i.e., (i) D

_{α}(p,q) ≥ 0, (ii) D

_{α}(p,q) = 0 ⟺ p = q, (iii) D

_{α}(p,q) = D

_{α}(q,p), and (iv) $\sqrt{{D}_{\alpha}}$ obeys the triangular inequality [7,20,22].

_{α}and D

_{α}for α = 2.00, and all other words are practically irrelevant. This number quickly grows with α. For example, database sizes of N ≈ 10

^{8}are needed for a robust estimation of the standard Jensen–Shannon divergence (Equation (2)), i.e., for α = 1.00. This connection makes the approach of [21,22] particularly interesting in relation to the systematic influence of the sample size demonstrated above (cf. Figure 1).

_{α}and D

_{α}on the sample size is tested for different α-parameters. This section is followed by a case study, in which we demonstrate that the influence of sample size makes it difficult to quantify lexical dynamics and language change and also show that standard sampling approaches do not solve this problem (Section 3.3). This paper ends with some concluding remarks regarding the consequences of the results for the statistical analysis of languages (Section 4).

## 2. Materials and Methods

_{α}can be written as a sum over different words, where each individual word type i contributes

_{α}as a function of α. For lower values of α, H

_{α}is dominated by word types with lower token frequencies. For instance, hapax legomena, i.e., word types that only occur once, contribute almost half of H

_{α=0.25}. For larger values of α, only the most frequent word contributes to H

_{α}. For example, the 27 word types with a token frequency of more than 1,000,000 contribute more than 92% to H

_{α=2.00}. Because words in different frequency ranges have different grammatical and pragmatic properties, varying α makes it possible to study different aspects of the word frequency spectrum [21].

_{α}and D

_{α}on the sample size for the different α-values. Let us note that each article in our database can be described by different attributes, e.g., publication date, subject matter, length, category, or author. Of course, this list of attributes is not exhaustive but can be freely extended depending on the research objective. In order to balance the article’s characteristics across the corpus, we prepared 10 versions of our database, each with a different random arrangement of the order of all articles. To study the convergence of H

_{α}, we computed H

_{α}after every n = 2

^{k}consecutive tokens for each version, where k = 6, 7, …, $lo{g}_{2}\left(N\right)=27$. For D

_{α}, we compared the first n = 2

^{k}word tokens with the last n = 2

^{k}of each version of our database. Here, k = 6, 7, …, 26. For instance for k = 26, the first 67,108,864 word tokens are compared with the last 67,108,864 word tokens by calculating the generalized divergence between both “texts” for different α-values. Through the manipulation of the article order, it can be inferred that, random fluctuations aside, any systematic differences are caused by differences in the sample size.

_{t}for each t, where each monthly observation is identified by a variable containing the year y = 1947, 1948, …, 2017 and the month m = 1, 2, …,12.

_{α}was calculated for successive moments in time, i.e., D

_{α}(t,t − 1), in order to estimate the rate of lexical change at a given time point t [11,12]. For instance, D

_{α}at y = 2000 and m = 1 represents the generalized divergence for a corresponding α-value between all articles that were published in January 2000 and those published in December 1999. The resulting series of month-to-month changes could then be analyzed in a standard time-series analysis framework. For example, we can test whether the series exhibits any large-scale tendency to change over time. A series with a positive trend increases over time, which would be indicative of an increasing rate of lexical change. It would also be interesting to look at first differences in the series, as an upward trend here in addition to an upward trend in the actual series would mean that the rate of lexical change is increasing at an increasing rate.

_{t=1}words of this version of the database to generate a new corpus that has the same length (in words) as the original corpus at t = 1 but in which the diachronic signal is destroyed. We then proceeded and used the next N

_{t=2}words to generate a corpus that has the same length as the original corpus at t = 2. For example, the length of a concatenation of all articles that where published in Der Spiegel in January 1947 is 94,716 word tokens. Correspondingly, our comparison corpus at this point in time also consisted of 94,716 word tokens, but the articles of which it consisted could belong to any point in time between 1947 and 2017. In what follows, we computed all D

_{α}(t,t − 1) values for both the original version of our database and for the version with a destroyed diachronic signal. We tentatively called this a “Litmus test”, because it determined whether our results can be attributed to real diachronic changes or if there is a systematic bias due to the varying sample sizes.

_{α}and D

_{α}vary as a function of the sample size without making any assumptions regarding the functional form of the relationship, we used the non-parametric Spearman correlation coefficient denoted as ρ. It assesses whether there is a monotonic relationship between two variables and is computed as Pearson’s correlation coefficient on the ranks and average ranks of the two variables. The significance of the observed coefficient was determined by Monte Carlo permutation tests in which the observed values of the sample size are randomly permuted 10,000 times. The null hypothesis is that H

_{α}/D

_{α}does not vary with the sample size. If this is the case, then the sample size becomes arbitrary and can thus be randomly re-arranged, i.e., permuted. Let c denote the number of times the absolute ρ-value of the derived dataset is greater than or equal to the absolute ρ-value computed on the original data. A corresponding coefficient was labeled as “statistically significant” if c < 10, i.e., p < 0.001. In cases where l, i.e., the number of datapoints, was lower than or equal to 7, an exact test for all l! permutations was calculated. Here, let c* denote the number of times where the absolute ρ-value of the derived dataset is greater than the absolute ρ-value computed on the original data. A coefficient was labeled as “statistically significant” if c*/l! < 0.001.

## 3. Results

#### 3.1. Entropy H_{α}

_{α}, we computed H

_{α}for the first n = 2

^{k}consecutive tokens, where k = 6, 7, …, 27 for the 10 versions of our database (each with a different random article order) and calculated averages. Figure 4A shows the convergence pattern for the five α-values in a superimposed scatter plot with connected dots where the colors of each y-axis correspond to one α-value (cf. the legend in Figure 4, the axes are log-scaled for improved visibility). For values of α < 1.00, there is no indication of convergence, while for H

_{α=1.50}and H

_{α=2.00}, it seems that H

_{α}converges rather quickly. To test the observed relationship between the sample size and H

_{α}for different α-values, we calculated the Spearman correlation between the sample size and H

_{α}for different minimum sample sizes. For example, a minimum sample size of n = 2

^{17}indicates that we restrict the calculation to sample sizes ranging between n = 2

^{17}and n = 2

^{27}. For those 11 datapoints, we computed the Spearman correlation between the sample size and H

_{α}and ran the permutation test. Table 2 summarizes the results. For all α-values, except for α = 2.00, there is a clear indication for a significant (at p < 0.001) strong, positive, monotonic relationship between H

_{α}and the sample size for all the minimum sample sizes. Thus, while Figure 4A seems to indicate that H

_{α=1.50}converges rather quickly, the Spearman analysis reveals that the sample size dependence of H

_{α=1.50}persists for higher values of k with a minimum ρ of 0.80. Except for the last two minimum sample sizes, all the coefficients pass the permutation test. For α = 2.00, H

_{α}starts to converge after n = 2

^{14}word tokens. None of the correlation coefficients of higher minimum sample sizes passes the permutation test. In line with the results of [21,22], this suggest α = 2.00 as a pragmatic choice when calculating H

_{α}. However, it is important to point out that for α = 2.00, the computation of H

_{α}is almost completely determined by the most frequent words (cf. Table 1). For lower values of α, the basic problem of sample size dependence (cf. Figure 1) persists. If it is the aim of a study to compare H

_{α}for databases with varying sizes, this has to be taken into account. Correspondingly, [23] reached similar conclusions for the convergence of Rényi entropy of order α = 2.00 for different languages and for different kinds of texts, both on the level of words and on the level of characters. In Appendix B, we have replicated the results of Table 2 based on Rényi’s formulation of the entropy generalization. Table A5 shows that the results are almost identical, which is to be expected because the Havrda–Charvat–Lindhard–Nielsen–Aczél–Daróczy–Tsallis entropy is a monotone function of the Rényi entropy [20].

#### 3.2. Divergence D_{α}

_{α}for different α-values, we computed D

_{α}for a “text” that consists of the first n = 2

^{k}word tokens, a “text” that consists of the last n = 2

^{k}word tokens for each version of our database for k = 6, 7, …, 26, and took averages. As for H

_{α}above, we then calculated the Spearman correlation between the sample size and D

_{α}for different minimum sample sizes. It is worth pointing out that the idea here is that the “texts” come from the same population, i.e., all Der Spiegel articles, so one should expect that with growing sample sizes, D

_{α}should fluctuate around 0 with no systematic relationship between D

_{α}and the sample size. Table 3 summarizes the results, while Figure 4B visualizes the convergence pattern. For all settings, there is a strong monotonic relationship between the sample size and D

_{α}that passes the permutation test in almost every case. For α = 0.25, the Spearman correlation coefficients are positive. This seems to be due to the fact that H

_{α=0.25}is dominated by word types from the lower end of the frequency spectrum (cf. Table 1). Because, for example, word types that only occur once contribute almost half of H

_{α=0.25}. Those word types then either appear in the first 2

^{k}or in the last 2

^{k}word tokens.

_{α}(cf. the pink line in Figure 4B). For = 0.75, a similar pattern is observed for smaller sample sizes (cf. the orange line in Figure 4 B). However, at around k = 15, the pattern changes. For k ≥ 15, there is a perfect monotonic negative relationship between D

_{α=0.75}and the sample size. Surprisingly, there is a perfect monotonic negative relationship for all settings for α ≥ 1.00, even if we restrict the calculation to relatively large sample sizes. However, the corresponding values are very small. For instance, D

_{α=2.00}= 7.91 × 10

^{−8}for n = 2

^{24}, D

_{α=2.00}= 4.08 × 10

^{−8}for n = 2

^{25}, and D

_{α=2.00}= 1.379 × 10

^{−8}for n = 2

^{26}. One might object that this systematic sample size dependence is practically irrelevant. In the next section, we show that, unfortunately, this is not the case.

#### 3.3. Case Study

_{α}for successive months, i.e., D

_{α}(t,t − 1). To rule out a potential systematical influence of the varying sample size, we also calculated D

_{α}(t,t − 1) for our comparison corpus where the diachronic signal was destroyed (“Litmus test”).

_{α}(t,t − 1) and the sample size. To test this observation, we calculated the Spearman correlation between the sample size and D

_{α}(t,t − 1) for both α = 1.00 and α = 2.00 and ran a permutation test. Table 4, row 1, shows that there is a significant strong negative correlation between the sample size and D

_{α}for both α = 1.00 and α = 2.00. Rows 2–5 present different approaches to solving the sample size dependence of D

_{α}. In row 2, we extended Equation (2) to allow for unequal sample sizes, i.e., N

_{p}≠ N

_{q}as suggested by ([22], Appendix A); here:

_{min}word tokens from the monthly databases, where N

_{min}is equal to the smallest of all monthly corpora, here N

_{min}= 75,819 (June 1947). To our own surprise, row 4 of Table 4 reveals that this “random draw” approach also does not break the sample size dependence. While the absolute values of the correlation coefficients for both α = 1.00 and α = 2.00 are smaller for the original data than for the comparison data, all four coefficients are significantly different from 0 (at p < 0.001) and thus indicate that the “random draw” approach fails to pass the “Litmus test”. As a last idea, we decided to truncate each monthly corpus after N

_{min}word tokens. The difference between this “cut-off” approach and the “random draw” is that the latter approach assumes that words occur randomly in texts, while truncating the data after N

_{min}as in the “cut-off” approach respects the syntactical and semantical coherence and the discourse structure at the text level [16,17]. On the one hand, row 5 of Table 4 demonstrates that this approach mostly solves the problem: all four coefficients are small, and only one coefficient is significantly different from zero, but positive. This suggests that the “cut-off” approach passes the “Litmus test”. On the other hand, it’s worth pointing out that we lose a lot of information with this approach. For example, the largest corpus is N = 507,542 word tokens long (October 2000). With the “cut-off” approach, more than 85% of those word tokens are not used to calculate D

_{α}(t,t − 1).

_{α}is far from practically irrelevant. On the contrary, the analyses presented in this section demonstrate again why it is essential to account for the sample size dependence of lexical statistics.

## 4. Discussion

_{α=2.00}for larger sample sizes, all quantities that are based on general entropies seem to strongly covary with the sample size (also see [23] for similar results based on Rényi’s formulation of generalized entropies). In his monograph on word frequency distributions, Baayen [16] introduces the two fundamental methodological issues in lexical statistics:

The sample size crucially determines a great many measures that have been proposed as characteristic text constants. However, the values of these measures change systematically as a function of the sample size. Similarly, the parameters of many models for word frequency distribution [sic!] are highly dependent on the sample size. This property sets lexical statistics apart from most other areas in statistics, where an increase in the sample size leads to enhanced accuracy and not to systematic changes in basic measures and parameters.… The second issue concerns the theoretical assumption […] that words occur randomly in texts. This assumption is an obvious simplification that, however, offers the possibility of deriving useful formulae for text characteristics. The crucial question, however, is to what extent this simplifying assumption affects the reliability of the formulae when applied to actual texts and corpora.(p.1)

- (i)
- In [12], an exploratory data-driven method was presented that extracts word-types from diachronic corpora that have undergone the most pronounced change in frequency of occurrence in a given period of time. To this end, a measure that is approximately equivalent to the Jensen–Shannon divergence is computed and period-to-period changes are calculated as in Section 3.3.
- (ii)
- In [15], the parameters of the Zipf–Mandelbrot law were used to quantify and visualize diachronic lexical, syntactical, and stylistic changes, as well as aspects of linguistic change for different languages.

## Supplementary Materials

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Appendix A. Inclusion of Punctuation and Cardinal Numbers.

^{24}none of the correlation coefficients pass the permutation test. Again, this indicates that α = 2.00 is a pragmatic choice when calculating H

_{α}. However, it also demonstrates that the conceptual decision to remove punctuation/cardinal numbers can affect the results. Table A3 corresponds to Table 3 The results are not qualitatively affected by the exclusion of punctuation/cardinal numbers. The same conclusion can be drawn for Table A4, which corresponds to Table 4.

Token Frequency | Number of Cases | Examples | α = 0.25 | α = 0.75 | α = 1.00 | α = 1.50 | α = 2.00 |
---|---|---|---|---|---|---|---|

1 | 2,511,837 | paragraphenplantage penicillinhaltigen partei-patt | 48.51 | 8.94 | 2.16 | 0.00 | 0.00 |

2–10 | 1,148,295 | koberten optimis-datenbank gazprom-zentrale | 29.82 | 10.46 | 3.32 | 0.00 | 0.00 |

11–100 | 303,049 | dunkelgraue stirlings drollig | 13.26 | 13.57 | 6.54 | 0.02 | 0.00 |

101–1000 | 76,049 | abgemagert irakern aufzugehen | 5.86 | 18.56 | 13.50 | 0.15 | 0.00 |

1001–10,000 | 14,710 | nord- selbstbestimmung alexandra | 1.99 | 19.35 | 20.60 | 0.83 | 0.02 |

10,001-100,000 | 1966 | parteien banken entscheidungen | 0.46 | 13.24 | 19.57 | 2.86 | 0.22 |

100,001-1,000,000 | 183 | wurde würde dieses | 0.08 | 7.47 | 14.89 | 10.05 | 2.66 |

1,000,001 + | 33 | auf wie , | 0.03 | 8.40 | 19.42 | 86.09 | 97.09 |

4,056,122 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |

Minimum Sample Size | Number of Datapoints | α = 0.25 | α = 0.75 | α = 1.00 | α = 1.50 | α = 2.00 |
---|---|---|---|---|---|---|

2^{6} | 23 | 1.00 * | 1.00 * | 1.00 * | 0.99 * | 0.49 |

2^{7} | 22 | 1.00 * | 1.00 * | 1.00 * | 0.99 * | 0.41 |

2^{8} | 21 | 1.00 * | 1.00 * | 1.00 * | 0.99 * | 0.32 |

2^{9} | 20 | 1.00 * | 1.00 * | 1.00 * | 0.99 * | 0.22 |

2^{10} | 19 | 1.00 * | 1.00 * | 1.00 * | 0.99 * | 0.09 |

2^{11} | 18 | 1.00 * | 1.00 * | 1.00 * | 0.99 * | −0.08 |

2^{12} | 17 | 1.00 * | 1.00 * | 1.00 * | 0.98 * | −0.28 |

2^{13} | 16 | 1.00 * | 1.00 * | 1.00 * | 0.98 * | −0.53 |

2^{14} | 15 | 1.00 * | 1.00 * | 1.00 * | 0.97 * | −0.50 |

2^{15} | 14 | 1.00 * | 1.00 * | 1.00 * | 0.97 * | −0.45 |

2^{16} | 13 | 1.00 * | 1.00 * | 1.00 * | 0.96 * | −0.81 |

2^{17} | 12 | 1.00 * | 1.00 * | 1.00 * | 0.95 * | −0.76 |

2^{18} | 11 | 1.00 * | 1.00 * | 1.00 * | 0.94 * | −0.71 |

2^{19} | 10 | 1.00 * | 1.00 * | 1.00 * | 0.95 * | −0.61 |

2^{20} | 9 | 1.00 * | 1.00 * | 1.00 * | 0.95 * | −0.47 |

2^{21} | 8 | 1.00 * | 1.00 * | 1.00 * | 0.93 | −0.31 |

2^{22} | 7 | 1.00 * | 1.00 * | 1.00 * | 0.89 | 0.04 |

2^{23} | 6 | 1.00 * | 1.00 * | 1.00 * | 0.83 | 0.66 |

2^{24} | 5 | 1.00 * | 1.00 * | 1.00 * | 1.00 * | 1.00 * |

^{20}, an exact permutation test is calculated.

Minimum Sample Size | Number of Datapoints | α = 0.25 | α = 0.75 | α = 1.00 | α = 1.50 | α = 2.00 |
---|---|---|---|---|---|---|

2^{6} | 22 | 1.00 * | −0.51 | −1.00 * | −1.00 * | −1.00 * |

2^{7} | 21 | 1.00 * | −0.59 | −1.00 * | −1.00 * | −1.00 * |

2^{8} | 20 | 1.00 * | −0.68 * | −1.00 * | −1.00 * | −1.00 * |

2^{9} | 19 | 1.00 * | −0.76 * | −1.00 * | −1.00 * | −1.00 * |

2^{10} | 18 | 1.00 * | −0.84 * | −1.00 * | −1.00 * | −1.00 * |

2^{11} | 17 | 1.00 * | −0.89 * | −1.00 * | −1.00 * | −1.00 * |

2^{12} | 16 | 1.00 * | −0.94 * | −1.00 * | −1.00 * | −1.00 * |

2^{13} | 15 | 1.00 * | −0.97 * | −1.00 * | −1.00 * | −1.00 * |

2^{14} | 14 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{15} | 13 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{16} | 12 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{17} | 11 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{18} | 10 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{19} | 9 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{20} | 8 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{21} | 7 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{22} | 6 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{23} | 5 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{24} | 4 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

^{20}, an exact permutation test is calculated.

**Table A4.**Spearman correlation between the sample size and D

_{α}(t,t − 1) for the original data and for the “Litmus test” for α = 1.00 and α = 2.00.

Row | Scenario | α | Number of Cases | Original Data | Litmus Test |
---|---|---|---|---|---|

1 | Original | 1.00 | 851 | −0.77 * | −0.91 * |

2.00 | 851 | −0.63 * | −0.70 * | ||

2 | Natural weights | 1.00 | 851 | −0.77 * | −0.91 * |

2.00 | 851 | −0.63 * | −0.70 * | ||

3 | Yearly data | 1.00 | 70 | −0.74 * | −0.98 * |

2.00 | 70 | −0.39 | −0.83 * | ||

4 | Random draw | 1.00 | 851 | −0.29 * | −0.69 * |

2.00 | 851 | −0.45 * | −0.56 * | ||

5 | Cut-off | 1.00 | 851 | 0.07 | 0.05 |

2.00 | 851 | 0.11 | −0.07 |

## Appendix B. Replication of Table 2 for a Different Formulation of Generalized Entropy.

**Table A5.**Spearman correlation between the sample size and ${H}_{\alpha}^{\prime}$ for different α-values *.

Minimum Sample Size | Number of Datapoints | α = 0.25 | α = 0.75 | α = 1.00 | α = 1.50 | α = 2.00 |
---|---|---|---|---|---|---|

2^{6} | 22 | 1.00 * | 1.00 * | 1.00 * | 1.00 * | 0.92 * |

2^{7} | 21 | 1.00 * | 1.00 * | 1.00 * | 1.00 * | 0.90 * |

2^{8} | 20 | 1.00 * | 1.00 * | 1.00 * | 1.00 * | 0.89 * |

2^{9} | 19 | 1.00 * | 1.00 * | 1.00 * | 1.00 * | 0.87 * |

2^{10} | 18 | 1.00 * | 1.00 * | 1.00 * | 1.00 * | 0.85 * |

2^{11} | 17 | 1.00 * | 1.00 * | 1.00 * | 1.00 * | 0.82 * |

2^{12} | 16 | 1.00 * | 1.00 * | 1.00 * | 1.00 * | 0.78 |

2^{13} | 15 | 1.00 * | 1.00 * | 1.00 * | 1.00 * | 0.73 |

2^{14} | 14 | 1.00 * | 1.00 * | 1.00 * | 1.00 * | 0.70 |

2^{15} | 13 | 1.00 * | 1.00 * | 1.00 * | 0.99 * | 0.65 |

2^{16} | 12 | 1.00 * | 1.00 * | 1.00 * | 0.99 * | 0.55 |

2^{17} | 11 | 1.00 * | 1.00 * | 1.00 * | 0.99 * | 0.43 |

2^{18} | 10 | 1.00 * | 1.00 * | 1.00 * | 0.99 * | 0.24 |

2^{19} | 9 | 1.00 * | 1.00 * | 1.00 * | 0.98 * | −0.05 |

2^{20} | 8 | 1.00 * | 1.00 * | 1.00 * | 0.98 * | −0.17 |

2^{21} | 7 | 1.00 * | 1.00 * | 1.00 * | 0.96 * | 0.25 |

2^{22} | 6 | 1.00 * | 1.00 * | 1.00 * | 0.94 | −0.20 |

2^{23} | 5 | 1.00 * | 1.00 * | 1.00 * | 0.90 | 0.10 |

2^{24} | 4 | 1.00 * | 1.00 * | 1.00 * | 0.80 | −0.80 |

^{20}, an exact permutation test is calculated.

## References

- Manning, C.D.; Schütze, H. Foundations of Statistical Natural Language Processing; MIT Press: Cambridge, MA, USA, 1999; ISBN 978-0-262-13360-9. [Google Scholar]
- Jurafsky, D.; Martin, J.H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition; Pearson Education (US): Upper Saddle River, NJ, USA, 2009; ISBN 978-0-13-504196-3. [Google Scholar]
- Adami, C. What is information? Philos. Trans. R. Soc. A
**2016**, 374, 20150230. [Google Scholar] [CrossRef] [PubMed] - Cover, T.M.; Thomas, J.A. Elements of information theory, 2nd ed.; Wiley-Interscience: Hoboken, NJ, USA, 2006; ISBN 978-0-471-24195-9. [Google Scholar]
- Bentz, C.; Alikaniotis, D.; Cysouw, M.; Ferrer-i-Cancho, R. The Entropy of Words—Learnability and Expressivity across More than 1000 Languages. Entropy
**2017**, 19, 275. [Google Scholar] [CrossRef] - Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory
**1991**, 37, 145–151. [Google Scholar] [CrossRef] - Endres, D.M.; Schindelin, J.E. A new metric for probability distributions. IEEE Trans. Inf. Theory
**2003**, 49, 1858–1860. [Google Scholar] [CrossRef] - Hughes, J.M.; Foti, N.J.; Krakauer, D.C.; Rockmore, D.N. Quantitative patterns of stylistic influence in the evolution of literature. Proc. Natl. Acad. Sci. USA
**2012**, 109, 7682–7686. [Google Scholar] [CrossRef] [PubMed] - Klingenstein, S.; Hitchcock, T.; DeDeo, S. The civilizing process in London’s Old Bailey. Proc. Natl. Acad. Sci. USA
**2014**, 111, 9419–9424. [Google Scholar] [CrossRef] - DeDeo, S.; Hawkins, R.; Klingenstein, S.; Hitchcock, T. Bootstrap Methods for the Empirical Study of Decision-Making and Information Flows in Social Systems. Entropy
**2013**, 15, 2246–2276. [Google Scholar] [CrossRef] - Bochkarev, V.; Solovyev, V.; Wichmann, S. Universals versus historical contingencies in lexical evolution. J. R. Soc. Interface
**2014**, 11, 20140841. [Google Scholar] [CrossRef] [PubMed] - Koplenig, A. A Data-Driven Method to Identify (Correlated) Changes in Chronological Corpora. J. Quant. Linguist.
**2017**, 24, 289–318. [Google Scholar] [CrossRef] - Pechenick, E.A.; Danforth, C.M.; Dodds, P.S. Characterizing the Google Books corpus: Strong limits to inferences of socio-cultural and linguistic evolution. PLOS ONE
**2015**. [Google Scholar] [CrossRef] - Zipf, G.K. The Psycho-biology of Language. An Introduction to Dynamic Philology; Houghton Mifflin Company: Boston, MA, USA, 1935. [Google Scholar]
- Koplenig, A. Using the parameters of the Zipf–Mandelbrot law to measure diachronic lexical, syntactical and stylistic changes–a large-scale corpus analysis. Corpus Linguist. Linguist. Theory
**2018**, 14, 1–34. [Google Scholar] [CrossRef] - Baayen, R.H. Word Frequency Distributions; Kluwer Academic Publishers: Dordrecht, The Netherlands, 2001. [Google Scholar]
- Tweedie, F.J.; Baayen, R.H. How Variable May a Constant be? Measures of Lexical Richness in Perspective. Comput. Hum.
**1998**, 32, 323–352. [Google Scholar] [CrossRef] - Simpson, E.H. The Interpretation of Interaction in Contingency Tables. J. R. Stat. Soc. Series B
**1951**, 13, 238–241. [Google Scholar] [CrossRef] - Gerlach, M.; Altmann, E.G. Stochastic Model for the Vocabulary Growth in Natural Languages. Phys. Rev. X
**2013**, 3, 021006. [Google Scholar] [CrossRef] - Briët, J.; Harremoës, P. Properties of classical and quantum Jensen-Shannon divergence. Phys. Rev. A
**2009**, 79, 052311. [Google Scholar] [CrossRef] - Altmann, E.G.; Dias, L.; Gerlach, M. Generalized entropies and the similarity of texts. J. Stat. Mech. Theory Exp.
**2017**, 2017, 014002. [Google Scholar] [CrossRef] - Gerlach, M.; Font-Clos, F.; Altmann, E.G. Similarity of Symbol Frequency Distributions with Heavy Tails. Phys. Rev. X
**2016**, 6, 021009. [Google Scholar] [CrossRef] - Tanaka-Ishii, K.; Aihara, S. Computational Constancy Measures of Texts—Yule’s K and Rényi’s Entropy. Comput. Linguist.
**2015**, 41, 481–502. [Google Scholar] [CrossRef] - Rényi, A. On Measures of Entropy and Information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, Berkeley, CA, USA, 20 June–30 July 1960; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
- He, Y.; Hamza, A.B.; Krim, H. A generalized divergence measure for robust image registration. IEEE Trans. Signal Process.
**2003**, 51, 1211–1220. [Google Scholar] - Schmid, H. Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of the International Conference on New Methods in Language Processing, Manchester, UK, 1994; pp. 44–49. [Google Scholar]
- Köhler, R.; Galle, M. Dynamic aspects of text characteristics. In Quantitative Text Analysis; Hřebíček, L., Altmann, G., Eds.; Quantitative linguistics; WVT Wissenschaftlicher Verlag Trier: Trier, Germany, 1993; pp. 46–53. ISBN 978-3-88476-080-2. [Google Scholar]
- Popescu, I.-I.; Altmann, G. Word Frequency Studies; Quantitative linguistics; Mouton de Gruyter: Berlin, Germany, 2009; ISBN 978-3-11-021852-7. [Google Scholar]
- Wimmer, G.; Altmann, G. Review Article: On Vocabulary Richness. J. Quant. Linguist.
**1999**, 6, 1–9. [Google Scholar] [CrossRef] - Michel, J.-B.; Shen, Y.K.; Aiden, A.P.; Verses, A.; Gray, M.K.; Google Books Team; Pickett, J.P.; Hoiberg, D.; Clancy, D.; Norvig, P.; et al. Quantitative Analysis of Culture Using Millions of Digitized Books. Science
**2010**, 331, 176–182. [Google Scholar] [CrossRef] [PubMed] - Lin, Y.; Michel, J.-B.; Aiden, L.E.; Orwant, J.; Brockmann, W.; Petrov, S. Syntactic Annotations for the Google Books Ngram Corpus. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, Korea, 8–14 July 2012; pp. 169–174. [Google Scholar]
- Kupietz, M.; Lüngen, H.; Kamocki, P.; Witt, A. The German Reference Corpus DeReKo: New Developments–New Opportunities. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018; Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., et al., Eds.; European Language Resources Association (ELRA): Miyazaki, Japan, 2018. [Google Scholar]

**Figure 1.**A Simpson’s Paradox for word frequency distributions. Here, the word entropy H and the exponent of the Zipf distribution γ are estimated after every n = 2

^{k}consecutive tokens, where k = 6, 7, …, $lo{g}_{2}\left(N\right)$ for 10 different random re-arrangements of the database; each dot corresponds to one observed value. The blue line represents a locally weighted regression of H on γ (with a bandwidth of 0.8). It indicates a strong positive relationship between H and γ (Spearman ρ = 0.99). However, when the sample size is held constant, this relationship completely changes, as indicated by the orange lines that correspond to separate locally weighted regressions of H on γ for each k. Here, the results indicate a strong negative relationship between H and γ (ρ ranges between −0.98 and −0.64 with a median of −0.92). The reason for this apparent contradiction is the fact that both H and γ monotonically increase with the sample size.

**Figure 2.**Visualization of the word frequency distribution of our database. Cumulative distribution (in %) as a function of (

**a**) the rank and (

**b**) the word frequency.

**Figure 3.**Sample size of the database as a function of time. The gray line depicts the raw data, while the orange line adds a symmetric 25-month window moving-average smoother highlighting the central tendency of the series at each point in time.

**Figure 4.**Generalized entropies H

_{α}and divergences D

_{α}as a function of the sample size. (

**A**) P

_{α,}(

**B**) D

_{α}.

**Figure 5.**D

_{α}(t,t − 1) as a function of time for α = 1.00 and α = 2.00. Lines represent a symmetric 25-month window moving-average smoother highlighting the central tendency of the series at each point in time. Left: results for the original data in blue. Middle: results for the “Litmus” data in orange. Right: superimposition of both the original and the “Litmus” data.

**Figure 6.**D

_{α}(t,t − 1) as a function of time for α = 1.00 and α = 2.00. Here, each monthly corpus is truncated after N

_{min}= 75,819 word tokens. Lines represent a symmetric 25-month window moving-average smoother highlighting the central tendency of the series at each point in time. Left: results for the original data in blue. Middle: results for the “Litmus” data in orange. Right: superimposition of both the original and the “Litmus” data.

Token Frequency | Number of Cases | Examples | α = 0.25 | α = 0.75 | α = 1.00 | α = 1.50 | α = 2.00 |
---|---|---|---|---|---|---|---|

1 | 2,486,393 | koalitionsbündnisse nr.6/1962 bruckner-breitklang | 48.65 | 9.32 | 2.38 | 0.00 | 0.00 |

2–10 | 1,135,102 | geschlechterschulung unal wiedervereinigungs-prozedur | 29.86 | 10.89 | 3.65 | 0.01 | 0.00 |

11–100 | 296,573 | hotpants lánský planwirtschaftlichen | 13.16 | 14.03 | 7.13 | 0.04 | 0.00 |

101–1000 | 74,791 | wanda verbannte mitschnitt | 5.83 | 19.21 | 14.69 | 0.28 | 0.00 |

1001–10,000 | 14,388 | schüren ablesen vollmachten | 1.96 | 19.81 | 22.07 | 1.53 | 0.06 |

10,001–100,000 | 1871 | london sitzen beginnen | 0.44 | 13.38 | 20.68 | 5.31 | 0.64 |

100,001–1,000,000 | 173 | mark frau kaum | 0.07 | 7.38 | 15.21 | 17.83 | 7.12 |

1,000,001 + | 27 | es die er | 0.02 | 5.98 | 14.19 | 75.02 | 92.18 |

4,009,318 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |

Minimum Sample Size | Number of Datapoints | α = 0.25 | α = 0.75 | α = 1.00 | α = 1.50 | α = 2.00 |
---|---|---|---|---|---|---|

2^{6} | 22 | 1.00 * | 1.00 * | 1.00 * | 1.00 * | 0.92 * |

2^{7} | 21 | 1.00 * | 1.00 * | 1.00 * | 1.00 * | 0.91 * |

2^{8} | 20 | 1.00 * | 1.00 * | 1.00 * | 1.00 * | 0.89 * |

2^{9} | 19 | 1.00 * | 1.00 * | 1.00 * | 1.00 * | 0.87 * |

2^{10} | 18 | 1.00 * | 1.00 * | 1.00 * | 1.00 * | 0.85 * |

2^{11} | 17 | 1.00 * | 1.00 * | 1.00 * | 1.00 * | 0.82 * |

2^{12} | 16 | 1.00 * | 1.00 * | 1.00 * | 1.00 * | 0.79 * |

2^{13} | 15 | 1.00 * | 1.00 * | 1.00 * | 1.00 * | 0.74 |

2^{14} | 14 | 1.00 * | 1.00 * | 1.00 * | 1.00 * | 0.71 |

2^{15} | 13 | 1.00 * | 1.00 * | 1.00 * | 0.99 * | 0.65 |

2^{16} | 12 | 1.00 * | 1.00 * | 1.00 * | 0.99 * | 0.55 |

2^{17} | 11 | 1.00 * | 1.00 * | 1.00 * | 0.99 * | 0.43 |

2^{18} | 10 | 1.00 * | 1.00 * | 1.00 * | 0.99 * | 0.24 |

2^{19} | 9 | 1.00 * | 1.00 * | 1.00 * | 0.98 * | −0.05 |

2^{20} | 8 | 1.00 * | 1.00 * | 1.00 * | 0.98 * | −0.17 |

2^{21} | 7 | 1.00 * | 1.00 * | 1.00 * | 0.96 * | 0.25 |

2^{22} | 6 | 1.00 * | 1.00 * | 1.00 * | 0.94 | −0.20 |

2^{23} | 5 | 1.00 * | 1.00 * | 1.00 * | 0.90 | 0.10 |

2^{24} | 4 | 1.00 * | 1.00 * | 1.00 * | 0.80 | −0.80 |

^{20}, an exact permutation test is calculated.

Minimum Sample Size | Number of Datapoints | α = 0.25 | α = 0.75 | α = 1.00 | α = 1.50 | α = 2.00 |
---|---|---|---|---|---|---|

2^{6} | 21 | 1.00 * | −0.42 | −1.00 * | −1.00 * | −1.00 * |

2^{7} | 20 | 1.00 * | −0.54 | −1.00 * | −1.00 * | −1.00 * |

2^{8} | 19 | 1.00 * | −0.64 | −1.00 * | −1.00 * | −1.00 * |

2^{9} | 18 | 1.00 * | −0.74 | −1.00 * | −1.00 * | −1.00 * |

2^{10} | 17 | 1.00 * | −0.83 * | −1.00 * | −1.00 * | −1.00 * |

2^{11} | 16 | 1.00 * | −0.90 * | −1.00 * | −1.00 * | −1.00 * |

2^{12} | 15 | 1.00 * | −0.95 * | −1.00 * | −1.00 * | −1.00 * |

2^{13} | 14 | 1.00 * | −0.99 * | −1.00 * | −1.00 * | −1.00 * |

2^{14} | 13 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{15} | 12 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{16} | 11 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{17} | 10 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{18} | 9 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{19} | 8 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{20} | 7 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{21} | 6 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{22} | 5 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{23} | 4 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

2^{24} | 3 | 1.00 * | −1.00 * | −1.00 * | −1.00 * | −1.00 * |

^{19}, an exact permutation test is calculated.

**Table 4.**Spearman correlation between the sample size and D

_{α}(t,t − 1) for the original data and for the “Litmus test” for α = 1.00 and α = 2.00.

Row | Scenario | α | Number of Cases | Original Data | Litmus Test |
---|---|---|---|---|---|

1 | Original | 1.00 | 851 | −0.76 * | −0.91 * |

2.00 | 851 | −0.70 * | −0.79 * | ||

2 | Natural weights | 1.00 | 851 | −0.77 * | −0.90 * |

2.00 | 851 | −0.70 * | −0.79 * | ||

3 | Yearly data | 1.00 | 70 | −0.74 * | −0.97 * |

2.00 | 70 | −0.46 * | −0.87 * | ||

4 | Random draw | 1.00 | 851 | −0.16 * | −0.69 * |

2.00 | 851 | −0.50 * | −0.61 * | ||

5 | Cut-off | 1.00 | 851 | 0.12 * | 0.08 |

2.00 | 851 | 0.08 | −0.10 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Koplenig, A.; Wolfer, S.; Müller-Spitzer, C. Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size. *Entropy* **2019**, *21*, 464.
https://doi.org/10.3390/e21050464

**AMA Style**

Koplenig A, Wolfer S, Müller-Spitzer C. Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size. *Entropy*. 2019; 21(5):464.
https://doi.org/10.3390/e21050464

**Chicago/Turabian Style**

Koplenig, Alexander, Sascha Wolfer, and Carolin Müller-Spitzer. 2019. "Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size" *Entropy* 21, no. 5: 464.
https://doi.org/10.3390/e21050464