# Linguistic Laws in Speech: The Case of Catalan and Spanish

^{1}

^{2}

^{3}

^{4}

^{5}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

- Zipf’s law. After some notable precursors (as Pareto [5], Estoup [6] or Condon [7] among others), George Kingsley Zipf formulated and explained in [8,9] one of the most popular quantitative linguistic observations known in his honor as Zipf’s Law. He observed that the number of occurrences of words with a given rank can be expressed as $f\left(r\right)\sim {r}^{-\alpha}$, when ordering the words of written corpus in decreasing order by their frequency. This is a solid linguistic law proven in many written corpus [10] and in speech [11], even though its variations have been discussed in many contexts [12,13,14].
- Herdan’s law. Although with little-known precedents [15], Herdan’s law [16] (also known as Heap’s law, because it was also formulated later by Heaps in [17]) describes that the average growth of new different words V in a text of size L follows $V\sim {L}^{\alpha},\alpha <1$ [16]. Thus, Herdan’s law shows the evolution of the number V of different words in a text (types) as its size increases, measured in the total number of words (L). L obviously is obtained by the summation of the number of occurrences of each word (tokens), for each different words types that appear in the text.
- Brevity law. Also known as Zipf’s law of abbreviation, its original qualitative statement claims that the more a word is used, the shorter it tends to be [8,9,18]. In texts or transcriptions, usually the way of measuring the word size is using the number of characters that compose the word. In this way, brevity law has been empirically proven in texts from almost a thousand languages of eighty different linguistic families [19], but also holds acoustically when measuring the time duration of words [29,30].The leap from the classical qualitative conception of brevity law to a quantitative proposal has recently been made [4,31]. In information-theoretic terms [32], if a certain symbol i has a probability ${p}_{i}$ of appearing in a given symbolic code with a $\mathcal{D}$-ary alphabet, then its minimum (optimal) expected description length ${\ell}_{i}^{*}=-{log}_{\mathcal{D}}\left({p}_{i}\right)$. Deviating from optimality can be effectively modelled by adding a pre-factor, such that the description length of symbol i is ${\ell}_{i}\sim -\frac{1}{{\lambda}_{\mathcal{D}}}{log}_{\mathcal{D}}\left({p}_{i}\right)$, where $0<{\lambda}_{\mathcal{D}}\le 1$. So, the closer ${\lambda}_{\mathcal{D}}$ is to one, the closer it is the system to optimal compression. Reordering terms, one finds an exponentially decaying dependence between the frequency of a unit and its size (see [4] for further details on the mathematical formulation).
- Size-rank law. Zipf’s law and brevity law involve frequencies. Taking advantage of the new mathematical formulation of the latter, these can now be combined [4] in such a way the “size” ${\ell}_{i}$ of a unit i is mathematically related to its rank ${r}_{i}$ via $\alpha $ (Zipf) and $\lambda $ (brevity law) exponents. Experimentally, $\theta =\frac{\alpha}{\lambda}$ is therefore an observable parameter which indeed combines Zipf and Brevity exponents in a size-rank plot, and this law predicts that the larger linguistic units tend to have a higher rank following a specific logarithmic relation [4].
- Menzerath-Altmann law. Again after some forerunners [20], Paul Menzerath established that there is a negative correlation between the length of a linguistic construct and the length of its constituents [21,22]. Subsequently, a mathematical formulation law was heuristically proposed by Gabriel Altmann [23,24]: if n stands for the size of the linguistic construct and y is the constituent size, then $y\left(n\right)=a{n}^{b}exp(-cn)$, being a, b and c free parameters of the model, whose interpretation remains controversial [33]. Definitely, Menzerath–Altmann’s law could be simplified and generalized qualitatively as “the longer a language construct the shorter its components (constituents).” [23,34]. This law has been revised in different linguistic levels under multiple and polyhedral perspectives [1,33,34], but above all in written texts. Recently some researchers are turning back to the phonetic origins of the law [35] and new mathematical models explaining the actual formulation have been proposed [4].
- Lognormality law. Previous studies have found consistently lognormal distributions for spoken phonemes in several languages [25,26,27,28,36] and in word and breath groups (BGs) duration for English [4,37]. In [4] it was confirmed that the time duration of phonemes, words and breath groups in speech are well described by lognormal distribution for the English language. Moreover, in [4] a general stochastic model was presented to explain and justify such lognormality at all linguistic levels only assuming that the lowest (phonemic) level follows a lognormal distribution, hence claiming the universal validity of the lognormal shape and its proposal as a ‘lognormality law’.

## 2. Results

#### 2.1. Lognormality Law and Low-Resolution Effects

#### 2.2. Zipf’s Law for Words and Yule Distribution for Phonemes

#### 2.3. Herdan–Heaps’s Law

#### 2.4. Brevity Law

#### 2.5. Size-Rank Law

#### 2.6. Menzerath–Altmann’s Law (MAL)

## 3. Discussion

The other major theoretical factor working against an interest in frequency of use in language is the distinction, traditionally traced back to Ferdinand de Sausurre (1916), between the knowledge that speakers have of the signs and structures of their language and the way language is used by actual speakers communicating with one another. American structuralists, including those of the generativist tradition, accept this distinction and assert furthermore that the only worthwhile object of study is the underlying knowledge of language (Chomsky 1965 and subsequent works). In this view, any focus on the frequency of use of the patterns or items of language is considered irrelevant.

## 4. Materials and Methods

#### Data and Reproducibility

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Abbreviations

BG | Breath group |

LND | Lognormal distribution |

MAL | Menzerath–Altmann’s Law |

## References

- Köhler, R.; Altmann, G.; Piotrowski, R.G. Quantitative Linguistik/Quantitative Linguistics: Ein Internationales Handbuch/an International Handbook; Walter de Gruyter: Berlin, Germany, 2008; Volume 27. [Google Scholar]
- Grzybek, P. History of quantitative linguistics. Glottometrics
**2012**, 23, 70–80. [Google Scholar] - Best, K.H.; Rottmann, O. Quantitative Linguistics, an Invitation; RAM-Verlag: Ludenscheid, Germany, 2017. [Google Scholar]
- Torre, I.G.; Luque, B.; Lacasa, L.; Kello, C.T.; Hernández-Fernández, A. On the physical origin of linguistic laws and lognormality in speech. R. Soc. Open Sci.
**2019**, 6. [Google Scholar] [CrossRef] [PubMed] - Pareto, V. Cours d’économie Politique; Librairie Droz; Imprime en Suisse: Geneva, Swizerland, 1964; Volume 1. (In French) [Google Scholar]
- Estoup, J.B. Gammes Sténographiques. Recueil de Textes Choisis pour L’acquisition Méthodique de la Vitesse, Précédé d’une Introduction par J.-B. Estoup; Sténographique: Paris, France, 1912. (In French) [Google Scholar]
- Condon, E.U. Statistics of vocabulary. Science
**1928**, 67, 300. [Google Scholar] [CrossRef] [PubMed] - Zipf, G.K. The Psychobiology of Language, an Introduction to Dynamic Philology; Houghton–Mifflin: Boston, MA, USA, 1935. [Google Scholar]
- Zipf, G.K. Human Behavior and the Principle of Least Effort; Addison–Wesley: Cambridge, MA, USA, 1949. [Google Scholar]
- Altmann, E.G.; Gerlach, M. Statistical laws in linguistics. In Creativity and Universality in Language; Springer: Cham, Germany, 2016; pp. 7–26. [Google Scholar]
- Bian, C.; Lin, R.; Zhang, X.; Ma, Q.D.Y.; Ivanov, P.C. Scaling laws and model of words organization in spoken and written language. EPL (Europhysics Letters)
**2016**, 113, 18002. [Google Scholar] [CrossRef] - Ferrer-i Cancho, R. The variation of Zipf’s law in human language. Eur. Phys. J. B
**2005**, 44, 249–257. [Google Scholar] [CrossRef] - Baixeries, J.; Elvevag, B.; Ferrer-i Cancho, R. The evolution of the exponent of Zipf’s law in language ontogeny. PLoS ONE
**2013**, 8. [Google Scholar] [CrossRef] - Neophytou, K.; van Egmond, M.; Avrutin, S. Zipf’s Law in Aphasia Across Languages: A Comparison of English, Hungarian and Greek. J. Quant. Linguist.
**2017**, 24, 178–196. [Google Scholar] [CrossRef] - Kuraszkiewicz, W.; Łukaszewicz, J. Ilość różnych wyrazów w zależności od długości tekstu. Pamiętnik Literacki: Czasopismo Kwartalne Poświęcone Historii i Krytyce Literatury Polskiej
**1951**, 42, 168–182. (In Polish) [Google Scholar] - Herdan, G. Type-Token Mathematics: A Textbook of Mathematical Linguistics; De Gruyter Mouton: Berlin, Germany, 1960. [Google Scholar]
- Heaps, H.S. Information Retrieval, Computational and Theoretical Aspects; Academic Press: Cambridge, MA, USA, 1978. [Google Scholar]
- Zipf, G.K. Selected Studies of the Principle of Relative Frequency in Language; De Gruyter Mouton: Berlin, Germany, 1932. [Google Scholar]
- Bentz, C.; i Cancho, R.F. Zipf’s Law of Abbreviation as a Language Universal; Universitätsbibliothek Tübingen: Tübingen, The Netherlands, 2016. [Google Scholar]
- Grégoire, A. Variation de la dure de la syllabe française suivant sa place dans les groupements phonetiques. La Parole
**1899**, 1, 161–176. (In French) [Google Scholar] - Menzerath, P.; Oleza, J. Spanische Lautdauer: Eine Experimentelle Untersuchung; De Gruyter Mouton: Berlin, Germany, 1928. (In German) [Google Scholar]
- Menzerath, P. Die Architektonik des Deutschen Wortschatzes; Dümmler: Berlin, Germany, 1954; Volume 3. (In German) [Google Scholar]
- Altmann, G. Prolegomena to Menzerath’s law. Glottometrika
**1980**, 2, 1–10. [Google Scholar] - Altmann, G.; Schwibbe, M. Das Menzertahsche Gesetz in Informationsverbarbeitenden Systemen; Georg Olms: Hildesheim, Germany, 1989. (In German) [Google Scholar]
- Herdan, G. The relation between the dictionary distribution and the occurrence distribution of word length and its importance for the study of Quantitative Linguistics. Biometrika
**1958**, 45, 222–228. [Google Scholar] [CrossRef] - Rosen, K.M. Analysis of speech segment duration with the lognormal distribution: A basis for unification and comparison. J. Phon.
**2005**, 33, 411–426. [Google Scholar] [CrossRef] - Gopinath, D.P.; Veena, S.; Nair, A.S. Modeling of Vowel Duration in Malayalam Speech using Probability Distribution. In Proceedings of the Speech Prosody, Campinas, Brazil, 6–9 May 2008; pp. 6–9. [Google Scholar]
- Shaw, J.A.; Kawahara, S. Effects of surprisal and entropy on vowel duration in Japanese. Language Speech
**2017**, 62, 80–114. [Google Scholar] [CrossRef] [PubMed] - Gahl, S. Time and thyme are not homophones: The effect of lemma frequency on word durations in spontaneous speech. Language
**2008**, 84, 474–496. [Google Scholar] [CrossRef] - Tomaschek, F.; Wieling, M.; Arnold, D.; Baayen, R.H. Word frequency, Vowel Length and Vowel Quality in Speech Production: An EMA Study of the Importance of Experience. Available online: https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5957 (accessed on 23 November 2019).
- Ferrer-i-Cancho, R.; Bentz, C.; Seguin, C. Optimal coding and the origins of Zipfian laws. arXiv
**2019**, arXiv:1906.01545. [Google Scholar] - Cover, T.M.; Thomas, J.A. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing); Wiley-Interscience: New York, NY, USA, 2006. [Google Scholar]
- Cramer, I. The Parameters of the Altmann-Menzerath Law. J. Quant. Linguist.
**2005**, 12, 41–52. [Google Scholar] [CrossRef] - Grzybek Peter, N.; Stadlober, E.; Kelih Emmerich, N. The Relationship of Word Length and Sentence Length: The Inter-Textual Perspective. In Advances In Data Analysis; Springer: Berlin/Heidelberg, Germany, 2007; pp. 611–618. [Google Scholar]
- Mačutek, J.; Chromý, J.; Koščová, M. Menzerath-Altmann Law and Prothetic /v/ in Spoken Czech. J. Quant. Linguist.
**2019**, 26, 66–80. [Google Scholar] [CrossRef] - Sayli, O. Duration Analysis and Modeling for Turkish Text-to-Speech Synthesis. Master’s Thesis, Bogaziei University, Istanbul, Turkey, 2002. [Google Scholar]
- Greenberg, S.; Carvey, H.; Hitchcock, L.; Chang, S. Temporal properties of spontaneous speech-a syllable-centric perspective. J. Phon.
**2003**, 31, 465–485. [Google Scholar] [CrossRef] - Luque, J.; Luque, B.; Lacasa, L. Scaling and universality in the human voice. J. R. Soc. Interface
**2015**, 12, 20141344. [Google Scholar] [CrossRef] - Torre, I.G.; Luque, B.; Lacasa, L.; Luque, J.; Hernández-Fernández, A. Emergence of linguistic laws in human voice. Sci. Rep.
**2017**, 7, 43862. [Google Scholar] [CrossRef] - Garrido, J.M.; Escudero, D.; Aguilar, L.; Cardeñoso, V.; Rodero, E.; de-la Mota, C.; González, C.; Rustullet, S.; Larrea, O.; Laplaza, Y.; et al. Glissando: A corpus for multidisciplinary prosodic studies in Spanish and Catalan. Lang. Resour. Eval.
**2013**, 47, 945–971. [Google Scholar] [CrossRef] - Fernández Planas, A. Así se Habla: Nociones Fundamentales de Fonética General y Española.; Apuntes de Catalán, Gallego y Euskara; Horsori Editorial: Barcelona, Spain, 2005. (In Spanish) [Google Scholar]
- Pitt, M.A.; Dilley, L.; Johnson, K.; Kiesling, S.; Raymond, W.; Hume, E.; Fosler-Lussier, E. Buckeye Corpus of Conversational Speech, 2nd release; Columbus, OH: Department of Psychology, Ohio State University, 2007. Available online: http://sldr.org/voir_depot.php?id=776&lang=en&sip=0 (accessed on 23 November 2019).
- Pitt, M.A.; Johnson, K.; Hume, E.; Kiesling, S.; Raymond, W. The Buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability. Speech Commun.
**2005**, 45, 89–95. [Google Scholar] [CrossRef] - Eliason, S.R. Maximum Likelihood Estimation: Logic and Practice; Sage Publications: Tucson, AZ, USA, 1993; Volume 96. [Google Scholar]
- Clauset, A.; Shalizi, C.R.; Newman, M.E. Power-law distributions in empirical data. SIAM Rev.
**2009**, 51, 661–703. [Google Scholar] [CrossRef] - Gillespie, C.S. Fitting Heavy Tailed Distributions: The poweRlaw Package. J. Stat. Softw.
**2015**, 64, 1–16. [Google Scholar] [CrossRef] - Lü, L.; Zhang, Z.K.; Zhou, T. Zipf’s law leads to Heaps’ law: Analyzing their relation in finite-size systems. PLoS ONE
**2010**, 5, e14139. [Google Scholar] [CrossRef] [PubMed] - Font-Clos, F.; Boleda, G.; Corral, A. A scaling law beyond Zipf’s law and its relation to Heaps’ law. New J. Phys.
**2013**, 15, 093033. [Google Scholar] [CrossRef] - Ferrer-i Cancho, R. Compression and the origins of Zipf’s law for word frequencies. Complexity
**2016**, 21, 409–411. [Google Scholar] [CrossRef] - Bybee, J. Frequency of Use and the Organization of Language; Oxford University Press: Oxford, UK, 2007. [Google Scholar]
- Quatieri, T.F. Discrete-Time Speech Signal Processing: Principles and Practice; Prentice Hall PTR: Upper Saddle River, NJ, USA, 2002. [Google Scholar]
- Borleffs, E.; Maassen, B.A.M.; Lyytinen, H.; Zwarts, F. Measuring orthographic transparency and morphological-syllabic complexity in alphabetic orthographies: A narrative review. Read. Writ.
**2017**, 30, 1617–1638. [Google Scholar] [CrossRef] - Rojo, G. Sobre la configuración estadística de los corpus textuales. Lingüística
**2017**, 33, 121–134. (In Spanish) [Google Scholar] [CrossRef] - Tolchinsky, L.; Martí, A.; Llaurado, A. The growth of the written lexicon in Catalan From childhood to adolescence. Writ. Lang. Lit.
**2010**, 13, 206–235. [Google Scholar] [CrossRef] - Baken, R.; Orlikoff, R. Clinical Measurement of Speech and Voice (Speech Science); Cengage Learning: Boston, MA, USA, 2000. [Google Scholar]
- Casas, B.; Hernández-Fernández, A.; Català, N.; i Cancho, R.F.; Baixeries, J. Polysemy and brevity versus frequency in language. Comput. Speech Lang.
**2019**, 58, 1–50. [Google Scholar] [CrossRef] - Tsao, Y.C.; Weismer, G. Interspeaker variation in habitual speaking rate: Evidence for a neuromuscular component. J. Speech Lang. Hear. Res.
**1997**, 40, 858–866. [Google Scholar] [CrossRef] [PubMed] - Garrido, J.M. SegProso: A Praat-Based Tool for the Automatic Detection and Annotation of Prosodic Boundaries in Speech Corpora. In Proceedings of the TRASP 2013, Barcelona, Spain, 30 August 2013; pp. 74–77. [Google Scholar]

**Figure 1.**Lognormality law for time duration. (

**outer panels**) Time duration distribution of phonemes (orange), words (blue) and BGs (green) for Glissando corpus: Catalan (

**top left**) and Spanish (

**top right**). For comparison, in the bottom left panel we show the results of English from Buckeye corpus (extracted from [4]), where Buckeye has finer statistics (higher resolution) than Glissando. A coarsened version of the English corpus—developed to be comparable with Glissando’s resolution—is plotted in the bottom, right panel (see the text for details). (

**inset panels**) Collapse of all distributions after time rescaling ${t}^{\prime}=(log\left(t\right)-\langle log\left(t\right)\rangle /std(log\left(t\right)))$ (where $std(log(t\left)\right)$ stands for the standard deviation of the random variable $logt$). If time durations at all levels comply with a lognormal distribution, then the collapsed data should approach a standard Gaussian $\mathcal{N}(0,1)$ (solid line), in good agreement with the results. Small deviations found in Catalan and Spanish are similarly found in the coarsened version of English, thus concluding that such deviations are mainly due to finite-precision and lower-bound detectability effects, and the lognormality law otherwise holds.

**Figure 2.**Lognormality law truncation. (

**left**) Rescaled log-time duration distribution of synthetic ‘phonemes’ $P\left(\tilde{Y}\right)$, estimated by (i) sampling $Y=exp\left(X\right)$ where X is normally distributed $X\sim (\mu ,{\sigma}^{2})$ with $\mu =-3$, $\sigma =2$, and then (ii) rescaling $\tilde{Y}=[logY-\langle logY\rangle ]/std(logY)$ (where $std(logY)$ stands for the standard deviation of the random variable $logY$). If Y is lognormal, then $\tilde{Y}\sim \mathcal{N}(0,1)$. (

**right**) Rescaled log-time duration distribution of synthetic ‘words’ $P\left(\tilde{Z}\right)$, obtained using the stochastic model of [4] by concatenating n phonemes where n is another random variable whose distributed is approximated empirically. As the left panel, if Z is lognormal, then $\tilde{Z}\sim \mathcal{N}(0,1)$. In both panels, the black curve is the original, high resolution experiment whereas the purple curve is the result of (i) reducing the precision by rounding off to two decimal digits, (ii) reducing the sampling size to match differences between Buckeye and Glissando, and (iii) impose a lower-bound detectability $\tau =0.03$ s (akin to the 30 ms of Glissando), such that all synthetically generated phonemes with a duration $Y<\tau $ are rounded to $0.03$ s. Whereas lognormality is recovered in the original experiment, this shape is smeared out as soon as the lower-bound detectability threshold and other low-resolution artifacts are imposed, thereby explaining why the lognormality law might not be fully observable in Glissando.

**Figure 3.**Zipf’s law. Log-log frequency-rank of phonemes (orange squares) and words (blue circles) for the case of Catalan (

**left**) and Spanish (

**right**). Words are fitted to a power law distribution following [45,46] and leading to ${x}_{min}=1$ and slopes almost similar for both languages. Phonemes are fitted to a Yule distribution with the help of the maximum likelihood estimation method (MLE).

**Figure 4.**Herdan–Heaps’s law. Sublinear increase of number of different words V versus time elapsed T (blue circles) and versus total number of words spoken L (green diamonds) for Catalan (

**left**) and Spanish (

**right**). As we are leading with a multiauthor corpus, each line represents a different way of permuting the order of concatenating each speaker. In every case we find scaling laws $V\left(L\right)\sim {L}^{\beta}$ and $V\left(T\right)\sim {T}^{\gamma}$ which holds for about three decades. The scaling exponents $\beta $ and $\gamma $ are estimated for each permutation using the least-squares method, and the average value of each of them over all permutations is shown in the figure. We find $\beta \approx \gamma $, as previously justified in [4], while its numerical value is on agreement with the one found for English [4].

**Figure 5.**Brevity law: words (Catalan on the left panel and Spanish on the right panel). Red dashed lines are fits to the exponential law $f\sim exp(-\lambda \ell )$, where ℓ is the word size which can be measured in physical units (mean duration) (

**outer panels**) or in symbolical units (number of phonemes or number of characters, inset panels). See the text for and Table 3 for data fits and interpretation. Blue dots are the result of a data binning. Note that the fits are performed to the raw data, but the resulting exponential shape accurately matches the binned data within a range (deviations occur for shorter sizes, when the resolution and finite-precision issues of the Glissando corpus are important). Spearman test shows consistent negative correlations for the three formulations for the case of Catalan of $-0.27$, while for the case of Spanish the correlation is slightly stronger in physical magnitudes ($-0.25$) than in symbolic units ($-0.22$).

**Figure 6.**Brevity law: phonemes (Catalan on the left panel and Spanish on the right panel). Red dashed lines are fits to the exponential law $f\sim exp(-\lambda \ell )$, where ℓ is the phoneme size measured in physical units (mean duration). Orange squares are the result of a data binning. Spearman test always denote negative correlations ($-0.3$ for Catalan, $-0.54$ for Spanish) but the data sample is too small to evaluate the agreement to the exponential law.

**Figure 7.**Size-rank law for words. Linear-log representation of word size ℓ versus rank of all words (blue dots denote binned data) in Catalan (

**left**) and Spanish (

**right**). The black dashed line is a fit of raw data (light grey dots) to the size-rank law (see Table 1), i.e., the fit of this law is not done to the binned data, however its agreement is excellent.

**Figure 8.**Menzerath–Altmann law: BG vs words Representation of BG size measured in number of words versus the mean size of those words for Catalan (

**left**) and Spanish (

**right**), where the size of the words can be measured in physical magnitudes (

**main panel**) or symbolic units (phonemes or number of characters, inset panels). Each grey point represents one BG, whereas blue circles are the mean duration of BGs. MAL holds in physical magnitudes (with coefficient of determination ${R}^{2}=0.47$ for Catalan and ${R}^{2}=0.84$ for Spanish), while it is poorly fulfilled when the size is measured symbolically (Catalan: ${R}^{2}=0.23$ for character units and ${R}^{2}=0.11$ for phoneme units; Spanish: ${R}^{2}=0.04$ for character units and ${R}^{2}=0.08$ for phoneme units). Fitted parameters $a,b,c$ are reported in Table 3.

**Figure 9.**Menzerath–Altmann law: words–phonemes relation between the word size measured in number of phonemes versus the size of those phonemes in physical magnitudes. Orange squares represent the mean size of each word. Fitted parameters are shown in Table 4 coefficient of determination for these are ${R}^{2}=0.75$ for Catalan and ${R}^{2}=0.9$ for Spanish.

**Table 1.**

**Main linguistic laws**, according to Torre and collaborators [4]. From left to right columns: name of the linguistic law, its mathematical formulation, details on its magnitudes and parameters; and finally some basic references about each law. While Zipf’s law is naturally defined and measured in symbolic units (texts or speech transcriptions), Herdan-Heaps, Brevity, Size-Rank and Menzerath-Altmann laws can be measured both in symbolic and physical units. Lognormality law is only defined in physical units (time duration).

Mathematical Formulation | Details | References | |
---|---|---|---|

Zipf’s law | $f\left(r\right)\sim {r}^{-\alpha}$ | f: frequency r: rank $\alpha $: parameter | [5,6,7,8,9] |

Herdan-Heaps’ law | $V\sim {L}^{\beta}$ | L: text size/time elapsed V: vocabulary $\beta $: parameter | [15,16,17] |

Brevity law | $f\sim exp\left(-\lambda \ell \right),\phantom{\rule{1.em}{0ex}}\lambda >0$ | f: frequency ℓ: size $\lambda $: parameter | [4,8,9,18,19] |

Size-rank law | $\ell \sim \theta log\left(r\right),\phantom{\rule{4pt}{0ex}}\theta =\frac{\alpha}{\lambda}$ | ℓ: size r: rank $\theta $: parameter | [4,9,18] |

Menzerath-Altmann’s law | $y\left(n\right)=a{n}^{b}exp\left(-cn\right)$ | n: size of the whole y: size of the parts $a,b,c$: parameters | [4,20,21,22,23,24] |

Lognormality law | $\mathrm{p}(t;\mu ,\sigma )=\frac{1}{t\sigma \sqrt{2\pi}}{e}^{-\frac{{(ln\left(t\right)-\mu )}^{2}}{2{\sigma}^{2}}}$ | t: time duration $\sigma ,\mu $: parameters | [4,25,26,27,28] |

**Table 2.**Main characteristics of Glissando. This Table summarises main characteristics of Glissando corpus [40] across linguistic levels for both Catalan and Spanish. For reference, a comparison to Buckeye corpus (English) is provided [4,42,43]. We report the total number of linguistic elements (phonemes, words and breath groups (BG)), specifying the number of different linguistic elements (types) and the total (tokens). Since time duration distribution of linguistic levels are usually heavy-tailed [4], we use median duration (instead of mean) as a reference.

Number of Elements | Median Duration (secs.) | |||||||
---|---|---|---|---|---|---|---|---|

Phonemes | Words | BG | Phon | Word | BG | |||

Tokens | Types | Tokens | Types | Tokens | ||||

Catalan | $3\times {10}^{5}$ | 35 | $8\times {10}^{4}$ | $5\times {10}^{3}$ | $2\times {10}^{4}$ | $0.05$ | $0.20$ | $0.8$ |

Spanish | $2\times {10}^{5}$ | 32 | $5\times {10}^{4}$ | $4\times {10}^{3}$ | $1\times {10}^{4}$ | $0.05$ | $0.21$ | $0.9$ |

English | $8\times {10}^{5}$ | 64 | $3\times {10}^{5}$ | $9\times {10}^{3}$ | $5\times {10}^{4}$ | $0.07$ | $0.20$ | $1.1$ |

**Table 3.**Summary of exponents and parameters for the case of words. Results are on reasonable good agreement to those found for English in [4]. Note that actual fit of lognormality law in Spanish and Catalan was not carried out due to low-resolution problems of the Glissando corpus, however we certified that this law also holds (see the text).

Words | Zipf | Herdan-Heaps | Brevity | Size-Rank | Menzerath-Altmann | Lognormality | |||
---|---|---|---|---|---|---|---|---|---|

$\mathit{\alpha}$ | $\mathit{\beta}$ | $\mathit{\lambda}$ | $\mathit{\theta}$ | $\mathit{a}$ | $\mathit{b}$ | $\mathit{c}$ | $\mathit{\mu}$ | $\mathit{\sigma}$ | |

Catalan | $1.42$ | $0.62$ | $23.8$ | $0.060$ | $0.301$ | $-0.132$ | $-0.004$ | - | - |

Spanish | $1.41$ | $0.63$ | $24.1$ | $0.058$ | $0.336$ | $-0.148$ | $-0.003$ | - | - |

English | $1.41$ | $0.63$ | $20.6$ | $0.07$ | $0.364$ | $-0.227$ | $-0.0067$ | $-1.62$ | $0.66$ |

**Table 4.**Summary of exponents and parameters for the case of phonemes. Results are on reasonable good agreement to those found for English in [4]. Note that actual fit of lognormality law in Spanish and Catalan was not carried out due to low-resolution problems of the Glissando corpus, however we certified that this law also holds (see the text).

Phonemes | Yule | Brevity | Menzerath-Altmann | Lognormality | ||||
---|---|---|---|---|---|---|---|---|

$\mathit{a}$ | $\mathit{b}$ | $\mathit{\lambda}$ | $\mathit{a}$ | $\mathit{b}$ | $\mathit{c}$ | $\mathit{\mu}$ | $\mathit{\sigma}$ | |

Catalan | $0.04$ | $0.90$ | 297 | $0.092$ | $-0.355$ | $-0.037$ | - | - |

Spanish | $0.16$ | $0.89$ | 76 | $0.102$ | $-0.393$ | $-0.032$ | - | - |

English | $0.25$ | $0.96$ | 127 | $0.18$ | $-0.23$ | $-0.007$ | $-2.68$ | $0.59$ |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Hernández-Fernández, A.; G. Torre, I.; Garrido, J.-M.; Lacasa, L.
Linguistic Laws in Speech: The Case of Catalan and Spanish. *Entropy* **2019**, *21*, 1153.
https://doi.org/10.3390/e21121153

**AMA Style**

Hernández-Fernández A, G. Torre I, Garrido J-M, Lacasa L.
Linguistic Laws in Speech: The Case of Catalan and Spanish. *Entropy*. 2019; 21(12):1153.
https://doi.org/10.3390/e21121153

**Chicago/Turabian Style**

Hernández-Fernández, Antoni, Iván G. Torre, Juan-María Garrido, and Lucas Lacasa.
2019. "Linguistic Laws in Speech: The Case of Catalan and Spanish" *Entropy* 21, no. 12: 1153.
https://doi.org/10.3390/e21121153