Special Issue "Information Theory and Language"

A special issue of Entropy (ISSN 1099-4300). This special issue belongs to the section "Multidisciplinary Applications".

Deadline for manuscript submissions: closed (31 October 2019).

Special Issue Editors

Dr. Łukasz Dębowski
E-Mail
Guest Editor
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, 01-248 Warszawa, Poland
Tel. 48 22 380 05 53
Interests: information theory; discrete stochastic processes; power laws of natural language; statistical language modeling
Dr. Christian Bentz
E-Mail
Guest Editor
1. URPP Language and Space, University of Zürich, Freiestrasse 16, CH-8032 Zürich, Switzerland2. DFG Center for Advanced Studies, University of Tübingen, Rümelinstraße 23, D-72070 Tübingen, Germany
Interests: information theory; linguistic typology; language evolution

Special Issue Information

Dear Colleagues,

The historical roots of information theory lie in statistical investigations of communication in natural language during the 1950s. In the decades that followed, however, linguistics and information theory developed largely independently, due to influential non-probabilistic theories of language. Recently, statistical investigations into natural language(s) have gained momentum again, driven by progress in computational linguistics, machine learning, and cognitive science. These developments are reopening the communication channel between information theorists and language researchers. Both information theory and linguistics have made important discoveries since the 1950s. While the two frameworks have sometimes been framed as irreconcilable, we believe that they are fully compatible. We expect fruitful cross-fertilization between the two fields in the near future.

In this Special Issue, we invite researchers working at the interface of information theory and natural language to present their original and recent developments. Possible topics include but are not limited to the following:

  • Applications of information-theoretic concepts to the research of natural language(s);
  • Mathematical work in information theory inspired by natural language phenomena;
  • Empirical and theoretical investigation of quantitative laws of natural language;
  • Empirical and theoretical evaluation of statistical language models.

Dr. Łukasz Dębowski
Dr. Christian Bentz
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All papers will be peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Entropy is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • natural language
  • entropy
  • mutual information
  • entropy rate
  • excess entropy
  • maximal repetition
  • power laws
  • minimum description length
  • statistical language models
  • neural networks
  • quantitative linguistics
  • linguistic typology

Published Papers (9 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Open AccessArticle
A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics
Entropy 2020, 22(1), 126; https://doi.org/10.3390/e22010126 - 20 Jan 2020
Abstract
The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to [...] Read more.
The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3 × 10 9 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on three different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval. Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Figure 1

Open AccessArticle
How the Probabilistic Structure of Grammatical Context Shapes Speech
Entropy 2020, 22(1), 90; https://doi.org/10.3390/e22010090 - 11 Jan 2020
Abstract
Does systematic covariation in the usage patterns of forms shape the sublexical variance observed in conversational speech? We address this question in terms of a recently proposed discriminative theory of human communication that argues that the distribution of events in communicative contexts should [...] Read more.
Does systematic covariation in the usage patterns of forms shape the sublexical variance observed in conversational speech? We address this question in terms of a recently proposed discriminative theory of human communication that argues that the distribution of events in communicative contexts should maintain mutual predictability between language users, present evidence that the distributions of words in the empirical contexts in which they are learned and used are geometric, and thus support this. Here, we extend this analysis to a corpus of conversational English, showing that the distribution of grammatical regularities and the sub-distributions of tokens discriminated by them are also geometric. Further analyses reveal a range of structural differences in the distribution of types in parts of speech categories that further support the suggestion that linguistic distributions (and codes) are subcategorized by context at multiple levels of abstraction. Finally, a series of analyses of the variation in spoken language reveals that quantifiable differences in the structure of lexical subcategories appears in turn to systematically shape sublexical variation in speech signal. Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Figure 1

Open AccessArticle
Approximating Information Measures for Fields
Entropy 2020, 22(1), 79; https://doi.org/10.3390/e22010079 - 09 Jan 2020
Abstract
We supply corrected proofs of the invariance of completion and the chain rule for the Shannon information measures of arbitrary fields, as stated by Dębowski in 2009. Our corrected proofs rest on a number of auxiliary approximation results for Shannon information measures, which [...] Read more.
We supply corrected proofs of the invariance of completion and the chain rule for the Shannon information measures of arbitrary fields, as stated by Dębowski in 2009. Our corrected proofs rest on a number of auxiliary approximation results for Shannon information measures, which may be of an independent interest. As also discussed briefly in this article, the generalized calculus of Shannon information measures for fields, including the invariance of completion and the chain rule, is useful in particular for studying the ergodic decomposition of stationary processes and its links with statistical modeling of natural language. Full article
(This article belongs to the Special Issue Information Theory and Language)
Open AccessArticle
Productivity and Predictability for Measuring Morphological Complexity
Entropy 2020, 22(1), 48; https://doi.org/10.3390/e22010048 - 30 Dec 2019
Abstract
We propose a quantitative approach for quantifying morphological complexity of a language based on text. Several corpus-based methods have focused on measuring the different word forms that a language can produce. We take into account not only the productivity of morphological processes but [...] Read more.
We propose a quantitative approach for quantifying morphological complexity of a language based on text. Several corpus-based methods have focused on measuring the different word forms that a language can produce. We take into account not only the productivity of morphological processes but also the predictability of those morphological processes. We use a language model that predicts the probability of sub-word sequences within a word; we calculate the entropy rate of this model and use it as a measure of predictability of the internal structure of words. Our results show that it is important to integrate these two dimensions when measuring morphological complexity, since languages can be complex under one measure but simpler under another one. We calculated the complexity measures in two different parallel corpora for a typologically diverse set of languages. Our approach is corpus-based and it does not require the use of linguistic annotated data. Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Figure 1

Open AccessArticle
Entropy Rate Estimation for English via a Large Cognitive Experiment Using Mechanical Turk
Entropy 2019, 21(12), 1201; https://doi.org/10.3390/e21121201 - 06 Dec 2019
Abstract
The entropy rate h of a natural language quantifies the complexity underlying the language. While recent studies have used computational approaches to estimate this rate, their results rely fundamentally on the performance of the language model used for prediction. On the other hand, [...] Read more.
The entropy rate h of a natural language quantifies the complexity underlying the language. While recent studies have used computational approaches to estimate this rate, their results rely fundamentally on the performance of the language model used for prediction. On the other hand, in 1951, Shannon conducted a cognitive experiment to estimate the rate without the use of any such artifact. Shannon’s experiment, however, used only one subject, bringing into question the statistical validity of his value of h = 1.3 bits per character for the English language entropy rate. In this study, we conducted Shannon’s experiment on a much larger scale to reevaluate the entropy rate h via Amazon’s Mechanical Turk, a crowd-sourcing service. The online subjects recruited through Mechanical Turk were each asked to guess the succeeding character after being given the preceding characters until obtaining the correct answer. We collected 172,954 character predictions and analyzed these predictions with a bootstrap technique. The analysis suggests that a large number of character predictions per context length, perhaps as many as 10 3 , would be necessary to obtain a convergent estimate of the entropy rate, and if fewer predictions are used, the resulting h value may be underestimated. Our final entropy estimate was h 1.22 bits per character. Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Figure 1

Open AccessArticle
Semantic Entropy in Language Comprehension
Entropy 2019, 21(12), 1159; https://doi.org/10.3390/e21121159 - 27 Nov 2019
Abstract
Language is processed on a more or less word-by-word basis, and the processing difficulty induced by each word is affected by our prior linguistic experience as well as our general knowledge about the world. Surprisal and entropy reduction have been independently proposed as [...] Read more.
Language is processed on a more or less word-by-word basis, and the processing difficulty induced by each word is affected by our prior linguistic experience as well as our general knowledge about the world. Surprisal and entropy reduction have been independently proposed as linking theories between word processing difficulty and probabilistic language models. Extant models, however, are typically limited to capturing linguistic experience and hence cannot account for the influence of world knowledge. A recent comprehension model by Venhuizen, Crocker, and Brouwer (2019, Discourse Processes) improves upon this situation by instantiating a comprehension-centric metric of surprisal that integrates linguistic experience and world knowledge at the level of interpretation and combines them in determining online expectations. Here, we extend this work by deriving a comprehension-centric metric of entropy reduction from this model. In contrast to previous work, which has found that surprisal and entropy reduction are not easily dissociated, we do find a clear dissociation in our model. While both surprisal and entropy reduction derive from the same cognitive process—the word-by-word updating of the unfolding interpretation—they reflect different aspects of this process: state-by-state expectation (surprisal) versus end-state confirmation (entropy reduction). Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Figure 1

Open AccessArticle
Linguistic Laws in Speech: The Case of Catalan and Spanish
Entropy 2019, 21(12), 1153; https://doi.org/10.3390/e21121153 - 26 Nov 2019
Abstract
In this work we consider Glissando Corpus—an oral corpus of Catalan and Spanish—and empirically analyze the presence of the four classical linguistic laws (Zipf’s law, Herdan’s law, Brevity law, and Menzerath–Altmann’s law) in oral communication, and further complement this with the analysis of [...] Read more.
In this work we consider Glissando Corpus—an oral corpus of Catalan and Spanish—and empirically analyze the presence of the four classical linguistic laws (Zipf’s law, Herdan’s law, Brevity law, and Menzerath–Altmann’s law) in oral communication, and further complement this with the analysis of two recently formulated laws: lognormality law and size-rank law. By aligning the acoustic signal of speech production with the speech transcriptions, we are able to measure and compare the agreement of each of these laws when measured in both physical and symbolic units. Our results show that these six laws are recovered in both languages but considerably more emphatically so when these are examined in physical units, hence reinforcing the so-called ‘physical hypothesis’ according to which linguistic laws might indeed have a physical origin and the patterns recovered in written texts would, therefore, be just a byproduct of the regularities already present in the acoustic signals of oral communication. Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Figure 1

Open AccessArticle
Estimating Predictive Rate–Distortion Curves via Neural Variational Inference
Entropy 2019, 21(7), 640; https://doi.org/10.3390/e21070640 - 28 Jun 2019
Abstract
The Predictive Rate–Distortion curve quantifies the trade-off between compressing information about the past of a stochastic process and predicting its future accurately. Existing estimation methods for this curve work by clustering finite sequences of observations or by utilizing analytically known causal states. Neither [...] Read more.
The Predictive Rate–Distortion curve quantifies the trade-off between compressing information about the past of a stochastic process and predicting its future accurately. Existing estimation methods for this curve work by clustering finite sequences of observations or by utilizing analytically known causal states. Neither type of approach scales to processes such as natural languages, which have large alphabets and long dependencies, and where the causal states are not known analytically. We describe Neural Predictive Rate–Distortion (NPRD), an estimation method that scales to such processes, leveraging the universal approximation capabilities of neural networks. Taking only time series data as input, the method computes a variational bound on the Predictive Rate–Distortion curve. We validate the method on processes where Predictive Rate–Distortion is analytically known. As an application, we provide bounds on the Predictive Rate–Distortion of natural language, improving on bounds provided by clustering sequences. Based on the results, we argue that the Predictive Rate–Distortion curve is more useful than the usual notion of statistical complexity for characterizing highly complex processes such as natural language. Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Figure 1

Open AccessArticle
Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size
Entropy 2019, 21(5), 464; https://doi.org/10.3390/e21050464 - 03 May 2019
Cited by 1
Abstract
Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales [...] Read more.
Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf’s law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine Der Spiegel (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages. Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Graphical abstract

Back to TopTop