Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (2)

Search Parameters:
Authors = Carolin Müller-Spitzer

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
10 pages, 904 KiB  
Data Descriptor
Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German
by Sascha Wolfer, Alexander Koplenig, Marc Kupietz and Carolin Müller-Spitzer
Data 2023, 8(11), 170; https://doi.org/10.3390/data8110170 - 10 Nov 2023
Cited by 2 | Viewed by 2722
Abstract
We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on [...] Read more.
We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource. Full article
Show Figures

Figure 1

18 pages, 2123 KiB  
Article
Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size
by Alexander Koplenig, Sascha Wolfer and Carolin Müller-Spitzer
Entropy 2019, 21(5), 464; https://doi.org/10.3390/e21050464 - 3 May 2019
Cited by 10 | Viewed by 5499
Abstract
Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales [...] Read more.
Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf’s law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine Der Spiegel (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages. Full article
(This article belongs to the Special Issue Information Theory and Language)
Show Figures

Graphical abstract

Back to TopTop