The Brevity Law as a Scaling Law, and a Possible Origin of Zipf’s Law for Word Frequencies

An important body of quantitative linguistics is constituted by a series of statistical laws about language usage. Despite the importance of these linguistic laws, some of them are poorly formulated, and, more importantly, there is no unified framework that encompasses all them. This paper presents a new perspective to establish a connection between different statistical linguistic laws. Characterizing each word type by two random variables—length (in number of characters) and absolute frequency—we show that the corresponding bivariate joint probability distribution shows a rich and precise phenomenology, with the type-length and the type-frequency distributions as its two marginals, and the conditional distribution of frequency at fixed length providing a clear formulation for the brevity-frequency phenomenon. The type-length distribution turns out to be well fitted by a gamma distribution (much better than with the previously proposed lognormal), and the conditional frequency distributions at fixed length display power-law-decay behavior with a fixed exponent α≃1.4 and a characteristic-frequency crossover that scales as an inverse power δ≃2.8 of length, which implies the fulfillment of a scaling law analogous to those found in the thermodynamics of critical phenomena. As a by-product, we find a possible model-free explanation for the origin of Zipf’s law, which should arise as a mixture of conditional frequency distributions governed by the crossover length-dependent frequency.


Introduction
The usage of language, both in its written and oral expressions (texts and speech), follows very strong statistical regularities. One of the goals of quantitative linguistics is to unveil, analyze, explain, and exploit those linguistic statistical laws. Perhaps the clearest example of a statistical law in language usage is Zipf's law, which quantifies the frequency of occurrence of words in such written and oral forms [1][2][3][4][5][6], establishing that there is no unarbitrary way to distinguish between rare and common words (due to the absence of a characteristic scale in "rarity"). Surprisingly, Zipf's law is not only a linguistic law, but seems to be a rather common phenomenon in complex systems where discrete units self-organize into groups, or types (persons into cities, money into persons, etc. [7]).
Zipf's law can be considered as the "tip of the iceberg" of text statistics. Another well-known pattern of this sort is Herdan's law, also called Heaps' law [2, 8,9], which states that the growth of vocabulary with text length is sublinear (however, the precise mathematical dependence has been

Preliminary Considerations
Given a sample of natural language (a text, a fragment of speech, or a corpus, in general), any word type (i.e., each unique word) has an associated word length, which we measure in number of characters (as we deal with a written corpus), and an associated word absolute frequency, which is the number of occurrences of the word type on the corpus under consideration (i.e., the number of tokens of the type). We denote these two random variables as and n, respectively.
Zipf's law of word frequency is written as a power-law relation between f (n) and n [6], i.e., where f (n) is the empirical probability mass function of the word frequency n, the symbol ∝ denotes proportionality, β is the power-law exponent, and c is a lower cut-off below which the law losses its validity (so, Zipf's law is a high-frequency phenomenon). The exponent β takes values typically close to 2. When very large corpora are analyzed (made from many different texts an authors) another (additional) power-law regime appears at smaller frequencies [16,17], with α a new power law exponent smaller than β, and a and b lower and upper cut-offs, respectively (with a < b < c). This second power law is not identified with Zipf's law.
On the other hand, the law of word lengths [12] proposes a lognormal distribution for the empirical probability mass function of word lengths, that is, where LN denotes a lognormal distribution, whose associated normal distribution has mean µ and variance σ 2 (note that with the lognormal assumption it would seem that one is taking a continuous approximation for f ( ); nevertheless, discreteness of f ( ) is still possible just redefining the normalization constant). The present paper challenges the lognormal law for f ( ). Finally, the brevity law [14] can be summarized as corr( , n) < 0, where corr( , n) is a correlation measure between and n, as, for instance, Pearson correlation, Spearman correlation, or Kendall correlation. We claim that a more complete approach to the relationship between word length and word frequency can be obtained from the joint probability distribution f ( , n) of both variables, together with the associated conditional distributions f (n| ). To be more precise, f ( , n) is the joint probability mass function of type length and frequency, and f (n| ) is the probability mass function of type frequency conditioned to fixed length. Naturally, the word-frequency distribution f (n) and the word-length distribution f ( ) are just the two marginal distributions of f ( , n).
The relationships between these quantities are Note that we will not use in this paper the equivalent relation f ( , n) = f ( |n) f (n), for sampling reasons (n takes many more different values than ; so, for fixed values of n one may find there is not enough statistics to obtain f ( |n)). Obviously, all probability mass functions fulfil normalization, We stress that, in our framework, each type yields one instance of the bivariate random variable ( , n), in contrast to another equivalent approach for which it is each token that gives one instance of the (perhaps-different) random variables, see [7]. The use of each approach has important consequences for the formulation of Zipf's law, as it is well known [7], and for the formulation of the word-length law (as it is not so well known [12]). Moreover, our bivariate framework is certainly different to the that in [18], where the frequency was understood as a four-variate distribution with the random variables taking 26 values from a to z, and also to the generalization in [19].

Corpus and Statistical Methods
We investigate the joint probability distribution of word-type length and frequency empirically, using all English books in the recently presented Standardized Project Gutenberg Corpus [20], which comprises more than 40,000 books in English, with a total number of tokens equal to 2,016,391,406 and a total number of types of 2,268,043. We disregard types with n < 10 (relative frequency below 5 × 10 −9 ) and also those not composed exclusively by the 26 usual letters from a to z (previously, capital letters were transformed to lower-case). This sub-corpus is further reduced by the elimination of types with length above 20 characters; to avoid typos and "spurious" words (among the eliminated types with n ≥ 10 we only find three true English words: incomprehensibilities, crystalloluminescence, and nitrosodimethylaniline). This reduces the numbers of tokens and types, respectively, to 2,010,440,020 and 391,529. Thus, all we need for our study is the list of all types (a dictionary) including their absolute frequencies n and their lengths (measured in terms of number of characters).
Power-law distributions are fitted to the empirical data by using the version for discrete random variables of the method for continuous distributions outlined in [21] and developed in Refs. [22,23], which is based on maximum-likelihood estimation and the Kolmogorov-Smirnov goodness-of-fit test. Acceptable (i.e., non-rejectable) fits require p-values not below 0.20, which are computed with 1000 Monte Carlo simulations. Complete details in the discrete case are available in Refs. [6,24]. This method is similar in spirit to the one by Clauset et al. [25], but avoiding some of the important problems that the latter presents [26,27]. Histograms are drawn to provide visual intuition for the shape of the empirical probability mass functions and the adequacy of fits; in the case of f (n| ) and f (n), we use logarithmic binning [22,28]. Nevertheless, the computation of the fits does not make use of the graphical representation of the distributions.
On the other side, the theory of scaling analysis, following the authors of [21,29], allows us to compare the shape of the conditional distributions f (n| ) for different values of . This theory has revealed a very powerful tool in quantitative linguistics, allowing in previous research to show that the shape of the word-frequency distribution does not change as a text increases its length [30,31].

Results
First, let us examine the raw data, looking at the scatter plot between frequency and length in Figure 1, where each point is a word type represented by an associated value of n and an associated value of (note that several or many types can overlap at the same point, if they share their values of and n, as these are discrete variables). >From the tendency of decreasing maximum n with increasing , clearly visible in the plot, one could arrive to an erroneous version of the brevity law. Naturally, brevity would be apparent if the scatter plot were homogenously populated (i.e., if f ( , n) would be uniform in the domain occupied by the points). However, of course, this is not the case, as we will quantify later. On the contrary, if f ( , m) were the product of two independent exponentials, with m = ln n, the scatter plot would be rather similar to the real one ( Figure 1), but the brevity law would not hold (because of the independence of and m, that is, of and n). We will see that exponentials distributions play an important role here, but not in this way. A more acceptable approach to the brevity-frequency phenomenon is to calculate the correlation between and n. For the Pearson correlation, our dataset yields corr( , n) = −0.023, which, despite looking very small, is significantly different from zero, with a p-value below 0.01 for 100 reshufflings of the frequency (all the values obtained after reshuffling the frequencies keeping the lengths fixed are between −0.004 and 0.006). If, instead, we calculate the Pearson correlation between and the logarithm m of the frequency we get corr( , m) = −0.083, again with a p-value below 0.01. Nevertheless, as neither the underlying joint distributions f ( , n) or f ( , m) resemble a Gaussian at all, nor the correlation seems to be linear (see Figure 1), the meaning of the Pearson correlation is difficult to interpret. We will see below that the analysis of the conditional distributions f (n| ) provides more useful information.

Marginal Distributions
Let us now study the word-length distribution, f ( ), shown in Figure 2. The distribution is clearly unimodal (with its maximum at = 7), and although it has been previously modeled as a lognormal [12], we get a nearly perfect fit using a gamma distribution, with shape parameter γ = 11.10 ± 0.02 and inverted scale parameter λ = 1.439 ± 0.003 (where the uncertainty corresponds to one standard deviation, and Γ(γ) denotes the gamma function). Notice then that, for large lengths, we would get an exponential decay (asymptotically, strictly speaking). However, there is an important difference between the lognormal distribution proposed in [13] and the gamma distribution found here, which is that the former case refers to the length of tokens, whereas in our case we deal with the length of types (of course, length of tokens and length of types is the same length, but the relative number of tokens and types is different, depending on length). This was already distinguished by Herdan [12], who used the terms occurrence distribution and dictionary distribution, and proposed that both of them were lognormal. In the caption of Figure 2 we provide the log-likelihoods of both the gamma and lognormal fits, concluding that the gamma distribution yields a better fit for the "dictionary distribution" of word lengths. The fit is specially good in the range > 2. Probability mass function f ( ) of type length, together with gamma and lognormal fits. Note that the majority of types are those with lengths between 4 and 13, and that f ( ) is roughly constant between 5 and 10. The superiority of the gamma fit is visually apparent, and this is confirmed by log-likelihood equal to −872,175.2 in front of the value −876,535.1 for the lognormal (a discrete gamma distribution slightly improves the fit, but the simple continuous case here is enough for our purposes). The parameters resulting for the gamma fit are given in the text, and those for the lognormal are µ = 1.9970 ± 0.0005 and σ = 0.3081 ± 0.0003.
Regarding the other marginal distribution, which is the word-frequency distribution f (n) represented in Figure 3, we get that, as expected, Zipf's law is fulfilled with β = 1.94 ± 0.03 for n ≥ 1.9 × 10 5 (this is almost three orders of magnitude), see Table 1. Another power-law regime in the bulk, as in [16], is found to hold for one order of magnitude and a half (only), from a 400 to b 14,000, with exponent α = 1.41 ± 0.005, see Table 2. Note that although the truncated power law for the bulk part of the distribution is much shorter than the one for the tail (1.5 orders of magnitude in front of almost 3), the former contains many more data (50,000 in front of~1000), see Tables 1 and 2 for the precise figures. Note also that the two power-law regimes for the frequency translate into two exponential regimes for m (the logarithm of n).  Table 1. Results of the fitting of an discrete untruncated power law to the conditional distributions f (n| ), denoted by a fixed , and to the marginal distribution f (n), denoted by the range 1 ≤ ≤ 20. N is total number of types, n max is the frequency of the most frequent type, c is the lower cuf-off of the fit, ordermag is log 10 (n max /c), N c is the number of types with n ≥ c, β is the resulting fitting exponent, σ β is its standard deviation, and p is the p-value of the fit. For the conditional distributions, the possible fits are restricted to the range n > n 2 | / n| . The fit proceeds by sweeping 50 values of c per order of magnitude and using 1000 Monte Carlo simulations for the calculation of p. Of all the fits with p ≥ 0.20 for a given , the one with smaller c is selected. Outside the range 5 ≤ ≤ 14, the number of types in the tail (below 10) is too low to yield a meaningful fit.  Table 2. Results of the fitting of a truncated power law to the conditional distributions f (n| ), denoted by a fixed , and to the marginal distribution f (n), denoted by the range 1 ≤ ≤ 20. N is total number of types; a and b are the lower and upper cut-offs of the fit, respectively; N ab is the number of types with a ≤ n ≤ b; α is the resulting fitting exponent; σ α is its standard deviation; and p is the p-value of the fit. The fit of a continuous power law is attempted in the range n < 0.1 n 2 | / n| , sweeping 20 values of a and b per order of magnitude and using 1000 Monte Carlo simulations for the calculation of p. Of all the fits with p ≥ 0.20, for a given , the one with larger b/a is selected, except for f (n), where the largest N ab is used.

Power Laws and Scaling Law for the Conditional Distributions
As mentioned, the conditional word-frequency distributions f (n| ) are of substantial relevance. In Figure 4, we display some of those functions, and it turns out that n is broadly distributed for each value of (roughly in the same qualitative way it happens without conditioning to the value of ). Remarkably, the results of a scaling analysis [21,29], depicted in Figure 5, show that all the different f (n| ) (for 3 ≤ ≤ 14) share a common shape, with a scale determined by a scale parameter in frequency. Indeed, rescaling n as n n| / n 2 | and f (n| ) as f (n| ) n 2 | 2 / n| 3 , where the first and second empirical moments, n| and n 2 | , are also conditioned to the value of , we obtain an impressive data collapse, valid for~7 orders of magnitude in n, which allows us to write the scaling law where the key point is that the scaling function g(...) is the same function for any value of . For > 14 the statistics is low and the fulfilment of the scaling law becomes uncertain. Defining the scale parameter θ( ) = n 2 | / n| , we get alternative expressions for the same scaling law, where constants of proportionality have been reabsorbed into g, and the scale parameter has to be understood as proportional to a characteristic scale of the conditional distributions (i.e., θ is the characteristic scale, up to a constant factor; it is the relative change of θ what will be important for us). The reason for the fulfillment of these relations is the power-law dependence between the moments and the scale parameter when a scaling law holds, this power-law dependence is n| ∝ θ 2−α and n 2 | ∝ θ 3−α for 1 < α < 2, see [21,29].  The data collapse also unveils more clearly the functional form of the scaling function g, allowing to fit its power-law shape in two different ranges. The scaling function turns out to be compatible with a double power-law distribution, i.e., a (long) power law for n/θ < 0.1 with exponent α at~1.4 and another (short) power law for n/θ > 1 with exponent β at~2.75; in one formula, g(y) ∝ 1/y 1.4 for y 1, 1/y 2.75 for y > 1, for y = n/θ. In other words, there is a (smooth) change of exponent (a change of log-log slope) at a value of n Cθ( ), with the proportionality constant C taking some value in between 0.1 and 1 (as the transition from one regime to the other is smooth there is not a well defined value of C that separates both). Fitting power laws to those ranges we get the results shown in Tables 1 and 2. Note that Cθ( ) can be understood as the characteristic scale of f (n| ) mentioned before, and can be also called a frequency crossover.
Nevertheless, although the power-law regime for intermediate frequencies (n < 0.1θ) is very clear, the validity of the other power law (the one for large frequencies) is questionable, in the sense that the power law provides an "acceptable" fit but other distributions could do the same good job, due to the limited range spanned by the tail (less than one order of magnitude). Our main reason to fit a power law to the large-frequency regime is the comparison with Zipf's law (β 2), and, as we see, the resulting value of β for f (n| ) turns out to be rather large (the results of β for all f (n| ) turn out to be statistically compatible with β = 2.75). In addition, we will show in the next subsection that the high-frequency behavior of the conditional distributions (power law or not) has nothing to do with Zipf's law.

Brevity Law and Possible Origin of Zipf's Law
Coming back to the scaling law, its fulfillment has an important consequence: it is the scale parameter θ( ) and not the conditional mean n| what sets the scale of the conditional distributions f (n| ). Figure 6 represents the brevity law in terms of the scale parameter as a function of (the conditional mean value is also shown, for comparison, overimposed to maps of f (n, ) and f (n| )). Note that the authors of [13] dealt with the conditional mean, finding an exponential decay n| ∝ 26 −0.6 . Using our corpus (which is certainly different), we find that such an exponential decay for the mean is valid in a range of between 1 and 5, approximately. In contrast, the scale parameter θ shows an approximate power-law decay from about = 6 to 15, with an exponent δ around 3 (or 2.8, to be more precise), i.e., θ( ) ∝ 1 δ (note that Herdan assumed this exponent to be 2.4, with no clear empirical support [12]). Beyond = 15, the decay of θ( ) is much faster. Nevertheless, these results are somewhat qualitative. With these limitations, we could write a new version of the scaling law as where the proportionality constant between θ and δ has been reabsorbed in the scaling function g. The corresponding data collapse is shown in Figure 7, for 5 ≤ ≤ 14. Despite the rough approximation provided by the power-law decay of θ( ), the data collapse in terms of scaling law (3) is nearly excellent for δ = 2.8. This version of the scaling law provides a clean formulation of the brevity law: the characteristic scale of the distribution of n conditioned to the value of decays with increasing as 1/ δ ; i.e., the larger , the shorter the conditional distribution f (n| ), quantified by the exponent δ.
However, in addition to a new understanding of the brevity law, the scaling law in terms of provides, as a by-product, an empirical explanation of the origin of Zipf's law. In the regime of in which the scaling law is approximately valid, i.e., for 1 ≤ ≤ 2 , we can obtain the distribution of frequency as a mixture of conditional distributions (by the law of total probability), x −1+1/δ n 1/δ dx where we also have taken advantage of the fact that, in the region of interest, f ( ) can be considered (in a rough approximation) as constant. From here, we can see that in the case where the frequency is small (n θ( 2 )), the integration limits are also small, and then the last integral scales with n as n 1/δ (because we have that g(x) ∝ 1/x α ), which implies that we recover a power law with exponent α for f (n), i.e., f (n) ∝ 1/n α . However, for larger frequencies (n above θ( 2 ) but below θ( 1 )), the integral does not scale with n but can be considered instead as constant and then we get Zipf's law as This means that Zipf's exponent can be obtained from the values of the intermediate-frequency power-law conditional exponent α and the brevity exponent δ as where we have introduced a subscript z in β to stress that this is the β exponent appearing in Zipf's law, corresponding to the marginal distribution f (n), and to distinguish it from the one of the conditional distributions, that we may call β c . Note then that β c plays no role in the determination of β z , and, in fact, the scaling function does not need to have a power-law tail to obtain Zipf's law. This sort of argument is similar to the one used in statistical seismology [32], but in that case the scaling law was elementary (i.e., θ = n| ). We can check the previous exponent relation using the empirical values of the exponent. We do not have a unique measure of α, but from Table 2, we see that its value for the different f (n| ) is quite well defined. Taking the harmonic mean between the values 4 ≤ ≤ 14 we getᾱ = 1.43, which together with δ = 2.8 leads to β z 1.79, not far from the ideal Zipf's value β z = 2 and closer to the empirical value β z = 1.94. The reason to calculate the harmonic mean of the exponents comes from the fact that it is the maximum-likelihood outcome when untruncated power-law datasets are put together [33]; when the power laws are truncated, the result is closer to the untruncated case when the range b/a is large.

Conclusions
Using a large corpus of English texts, we have seen how three important laws of quantitative linguistics, which are the type-length law, Zipf's law of word frequency, and the brevity law, can be put into a unified framework just considering the joint distribution of length and frequency.
Straightforwardly, the marginals of the joint distribution provide both the type-length distribution and the word-frequency distribution. We reformulate the type-length law, finding that the gamma distribution provides an excellent fit of type lengths for values larger than 2, in contrast to the previously proposed lognormal distribution [12] (although some previous research was dealing not with type length but with token length [13]). For the distribution of word frequency, we confirm the well-known Zipf's law, with an exponent β z = 1.94; we also confirm the second intermediate power-law regime that emerges in large corpora [16], with an exponent α = 1.4.
The advantages of the perspective provided by considering the length-frequency joint distribution become apparent when dealing with the brevity phenomenon. In concrete, this property arises very clearly when looking at the distributions of frequency conditioned to fixed length. These show a well-defined shape, characterized by a power-law decay for intermediate frequencies followed by a faster decay, which is well modeled by a second power law, for larger frequencies. The exponent α for the intermediate regime turns out to be the same as the one for the usual (marginal) distribution of frequency, α 1.4. However, the exponent for higher frequencies β c turns out to be larger than 2 and unrelated to Zipf's law.
At this point, scaling analysis reveals as a very powerful tool to explore and formulate the brevity law. We observe that the conditional frequency distributions show scaling for different values of length, i.e., when the distributions are rescaled by a scale parameter (proportional to the characteristic scale of each distribution), these distributions collapse into a unique curve, showing that they share a common shape (although at different scales). The characteristic scale of the distributions turns out to be well described by the scale parameter (given by the ratio of moments n 2 | / n| ), instead than by the mean value ( n| ). This is the usual case when the distributions involved have a power-law shape (with exponent α > 1) close to the origin [29]. This also highlights the importance of looking at the whole distribution and not to mean values when one is dealing with complex phenomena.
Going further, we obtain that the characteristic scale of the conditional frequency distributions decays, approximately, as a power law of the type length, with exponent δ, which allows us to rewrite the scaling law in a form that is reminiscent to the one used in the theory of phase transitions and critical phenomena. Despite that the power-law behavior for the characteristic scale of frequency is rather rough, the derived scaling law shows an excellent agreement with the data. Note that taking together the marginal length distribution, Equation (1), and the scaling law for the conditional frequency distribution, Equation (3), we can write for the joint distribution with the scaling function g(x) given by Equation (2), up to proportionality factors.
Finally, the fulfilment of a scaling law of this form allows us to obtain a phenomenological (model free) explanation of Zipf's law as a mixture of the conditional distributions of frequencies. In contrast to some accepted explanations of Zipf's law, which put the origin of the law outside the linguistic realm (such as Simon's model [15], where only the reinforced growth of the different types counts; other explanations are in [19,34]), our approach indicates that the origin of Zipf's law can be fully linguistic, as it depends crucially on the length of the words (and the length is a purely linguistic attribute). Thus, at fixed length, each (conditional) frequency distribution shows a scale-free (power-law) behavior, up to a characteristic frequency where the power law (with exponent α) breaks down. This breaking-down frequency depends on length through the exponent δ. The mixture of different power laws, with exponent α and cut at a scale governed by the exponent δ, yields a Zipf's exponent β z = α + δ −1 . Strictly speaking, our explanation of Zipf's law does not fully explain Zipf's law, but transfers the explanation to the existence of a power law with a smaller exponent (α 1.4) as well as to the crossover frequency that depends on length as −δ . Clearly, more research is necessary to explain the shape of the conditional distributions. It is noteworthy that a similar phenomenology for Zipf's law (in general) was proposed in [34], using the concept of "underlying unobserved variables", which in the case of word frequencies were associated (without quantification) to part of speech (grammatical categories). From our point of view, the "underlying unobserved variables" in the case of word frequencies would be instead word (type) lengths.
Although our results are obtained using a unique English corpus, we believe they are fully representative of this language, at least when large corpora are used. Naturally, further investigations are needed to confirm the generality of our results. Of course, a necessary extension of our work is the use of corpora on other languages, to establish the universality of our results, as done, e.g., in [14]. The length of words is simply measured in number of characters, but nothing precludes the use of number of phonemes or mean time duration of types (in speech, as in [13]). At the end, the goal of this kind of research is to pursue a unified theory of linguistic laws, as proposed in [35]. The line of research shown in this paper seems to be a promising one.
Author Contributions: Methodology, Á.C. and I.S.; writing, Á.C.; visualization, I.S. All authors have read and agreed to the published version of the manuscript.