Next Article in Journal
Effects of Advective-Diffusive Transport of Multiple Chemoattractants on Motility of Engineered Chemosensory Particles in Fluidic Environments
Previous Article in Journal
Controlling and Optimizing Entropy Production in Transient Heat Transfer in Graded Materials
Article Menu
Issue 5 (May) cover image

Export Article

Open AccessArticle

Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size

Department of Lexical Studies, Institute for the German language (IDS), 68161 Mannheim, Germany
*
Author to whom correspondence should be addressed.
Entropy 2019, 21(5), 464; https://doi.org/10.3390/e21050464
Received: 9 April 2019 / Revised: 24 April 2019 / Accepted: 30 April 2019 / Published: 3 May 2019
(This article belongs to the Special Issue Information Theory and Language)
  |  
PDF [2123 KB, uploaded 20 May 2019]
  |     |  

Abstract

Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf’s law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine Der Spiegel (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages. View Full-Text
Keywords: generalized entropy; generalized divergence; Jensen–Shannon divergence; sample size; text length; Zipf’s law generalized entropy; generalized divergence; Jensen–Shannon divergence; sample size; text length; Zipf’s law
Figures

Graphical abstract

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Supplementary material

SciFeed

Share & Cite This Article

MDPI and ACS Style

Koplenig, A.; Wolfer, S.; Müller-Spitzer, C. Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size. Entropy 2019, 21, 464.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics

1

Comments

[Return to top]
Entropy EISSN 1099-4300 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top