# A Refutation of Finite-State Language Models through Zipf’s Law for Factual Knowledge

## Abstract

**:**

## 1. Introduction

#### 1.1. Historical and Conceptual Research Context

- The expected number of distinct binary facts that can be learned from a finite text is roughly less than the mutual information between two halves of the text.
- The mutual information between two halves of the text is roughly less than the expected total length of distinct words that can be found in the text.

#### 1.2. Aims and Organization of the Article

## 2. Some Classes of Processes

**Markov process**when the conditional probability of the next symbol ${x}_{i}$ depends only on the directly preceding symbol ${x}_{i-1}$, i.e.,

**IID (independent identically distributed) process**. Whereas IID processes are central to the theory of mathematical statistics, Markov processes exhibit some rudimentary dependence—exactly only on the directly preceding observation—and were in fact proposed by Andrey Markov [69,70] as some primitive statistical language models.

**hidden Markov process**with respect to a Markov process ${\left({Y}_{i}\right)}_{i\in \mathbb{N}}$ over a countable alphabet $\mathbb{Y}$ when the conditional probability of the next symbol ${x}_{i}$ depends only on the hidden state ${y}_{i}$, i.e.,

**finite-state process**is such a hidden Markov process that the set of states $\mathbb{Y}$ is finite. By contrast, a

**unifilar process**${\left({X}_{i}\right)}_{i\in \mathbb{N}}$ with respect to a Markov process ${\left({Y}_{i}\right)}_{i\in \mathbb{N}}$ is such a hidden Markov process that

**n-th order Markov processes**, called also $(n+1)$-gram models, are unifilar processes such that ${Y}_{i}={X}_{i-n}^{i-1}$. A subclass of these processes with $n=2$, called trigram models, constitutes particularly effective statistical language models, which were applied in computational linguistics of the 1990s [7,8,9,12]. Another important examples are

**computable processes**, which are processes such that function $w\mapsto P({X}_{1}^{\left|w\right|}=w)$ is computable. It can be seen that the class of these processes is the class of hidden Markov processes with countably infinite $\mathbb{X}$ and $\mathbb{Y}$ and with computable functions $\pi $, $\sigma $, and $\epsilon $, since it suffices to state that ${Y}_{i}={X}_{1}^{i-1}$. The last example shows that hidden Markov processes can model pretty much anything if we do not impose a finite number of hidden states or a particular structure of the transition and emission matrices.

**hidden Markov order**of process ${\left({X}_{i}\right)}_{i\in \mathbb{N}}$ as the number of states in its minimal hidden Markov presentation,

**unifilar Markov order**of process ${\left({X}_{i}\right)}_{i\in \mathbb{N}}$ as the number of states in its minimal unifilar presentation,

**$\u03f5$-machine**, is unique and given by the equivalence classes of conditional probability of infinite future given infinite past [72,73]. There exist simple processes such that ${M}_{U}=\infty $ and ${M}_{HM}<\infty $ [74], e.g., the Golden Mean process [45] or the Simple Nonunifilar Source [75]. In fact, these two processes have only two hidden states in their minimal nonunifilar presentations but their minimal unifilar presentations have uncountably many hidden states. These processes are very simple examples of processes with ${M}_{U}=\infty $ but they have no linguistic interpretation.

**Santa Fe processes**, which are sequences of random variables ${\left({X}_{i}\right)}_{i\in \mathbb{N}}$ that consist of either pairs

**Zipf’s distribution**$P({K}_{i}=k)\propto {k}^{-\alpha}$ for a parameter $\alpha >1$. These processes were discovered by us in August 2002 during our visit at the Santa Fe Institute, but they were first published in [28,29].

**almost surely**. “Almost surely” is a mathematical quantifier that means “with probability 1”. Moreover, the Zipf distribution of random variables ${K}_{i}$ allows us to deduce a stronger property: The number of distinct facts ${Z}_{k}$ or ${z}_{k}$ described by a random text ${X}_{1}^{n}$ is asymptotically proportional to ${n}^{1/\alpha}$ almost surely [36]. That is, the facts follow a sort of Herdan–Heaps’ law, originally formulated as a power-law growth of the number of distinct words [31,32,33,34]. A generalization of this property is called

**perigraphic processes**in Section 5, applying the concept of Hilberg exponents developed in Section 3 and the notion of algorithmic randomness. What is interesting for linguistic discussions is that the non-IID Santa Fe processes (7) are not finite-state processes. We have ${M}_{HM}=\infty $ for them since the Shannon mutual information between the past and future is infinite, as we discuss in Section 3. In this article, we show that perigraphic processes, such as the IID Santa Fe processes (8), cannot be finite-state processes either.

**stationary process**if and only if the probabilities are shift invariant, i.e.,

**Birkhoff ergodic theorem**states that, for a stationary process ${\left({X}_{i}\right)}_{i\in \mathbb{N}}$, relative frequencies converge almost surely, i.e., if we define event

**ergodic process**if and only if the relative frequencies of all strings converge to their probabilities almost surely, i.e., when $P\left({\Omega}_{P}\right)=1$ for

**subjective probabilities**represent subjective odds of a language user—or of an effective predictor, speaking more generally. As such, the subjective probabilities should be computable, but they can be nonergodic—since there may be some prior random variables in the mental state of a language user such as variables ${Z}_{k}$ in the Santa Fe process (7). Upon the conditioning of subjective probabilities on the previously seen text, the prior random variables becomes more and more concentrated on some particular fixed values. This concentration process can be equivalently named the process of learning of the unknown parameters. The

**objective probabilities**represent an arbitrary limit of this learning process, where all prior random variables become instantiated by some fixed values such as values ${z}_{k}$ in the Santa Fe process (8). Miraculously, it turns out that objective probabilities of strings are exactly the asymptotic relative frequencies of these strings in the particularly generated infinite text. As such, the objective probabilities should be ergodic by the Birkhoff ergodic theorem if the generating subjective odds form a stationary process but they can be uncomputable since the limit of computable functions need not be computable.

**ergodic decomposition theorem**, which says that, for any stationary distribution P, there exists a unique prior $\nu $ supported on stationary ergodic distributions such that

## 3. Hilberg’s Hypothesis

**expectation**of a real random variable X with respect to a probability measure P. The

**Shannon entropy**of a discrete random variable X is $H\left(X\right):=\mathbf{E}\left(\right)open="["\; close="]">-logP\left(X\right)$, where $P\left(X\right)=P(X=x)$ if $X=x$, whereas

**conditional entropy**of X given random variable Y is $H\left(X\right|Y):=\mathbf{E}\left(\right)open="["\; close="]">-logP\left(X\right|Y)$, where $P\left(X\right|Y)=P(X=x|Y=y)$ if $X=x$ and $Y=y$. Subsequently, the

**Shannon mutual information**for random variables X and Y is $I(X;Y):=H\left(X\right)+H\left(Y\right)-H(X,Y)$.

**entropy rate**h as the limiting amount of information produced by a single random variable,

**excess entropy**E as the mutual information between infinite past and infinite future of the process,

**data-processing inequality**states that $I(X;Y)\ge I(X;Z)$ if random variables X and Z are conditionally independent given Y. This holds in particular if Z is a function of Y, $Z=f\left(Y\right)$, hence the name of this inequality: The information decreases as we process it deterministically. Consequently, if ${\left({X}_{i}\right)}_{i\in \mathbb{Z}}$ is a hidden Markov process with respect to a Markov process ${\left({Y}_{i}\right)}_{i\in \mathbb{Z}}$, then by the data-processing inequality and the Markov condition, we obtain

**Hilberg exponents**

**Theorem**

**1**

**relaxed Hilberg hypothesis**for natural language in the variant introduced in references [42,43,44,45,78] could be simply expressed as condition ${\beta}_{H}>0$ for a reasonable statistical language model. However, such a formulation is ambiguous since, as we mentioned in the beginning of this section, there are two main interpretations of probability, nonergodic subjective and ergodic objective, and this distinction affects the estimates of power-law growth of mutual information. As we indicated, the guiding example are the subjective nonergodic Santa Fe processes (7), where ${\beta}_{H}=1/\alpha $ is an arbitrary number in the range $(0,1)$, whereas ${\beta}_{H}=0$ holds for their objective ergodic components (8), since they are IID. Additionally, for natural language, the estimates of the Hilberg exponent vary depending on the estimation method. Universal coding estimates yield an upper bound of ${\beta}_{H}\le 0.8$ [46,47,48,51,52], whereas methods based on guessing by human subjects seem to yield an upper bound of ${\beta}_{H}\le 0.5$ [40,41]. Thus, imposing a condition on the subjective probability Hilberg exponent ${\beta}_{H}$ may differ greatly from imposing a similar condition on the objective probability Hilberg exponent ${\beta}_{H}$. This is the main conceptual difficulty about Hilberg’s hypothesis that researchers in this topic should be aware of.

**prefix Kolmogorov complexity**of a string w, denoted $K\left(w\right)$, is the length of the shortest self-delimiting program for a universal computer for which the output is w. Note that the prefix Kolmogorov complexity is in general uncomputable but can be effectively approximated from above. The

**algorithmic mutual information**between strings u and w is $J(u;w):=K\left(u\right)+K\left(w\right)-K(u,w)$. Many results from the Shannon information theory carry on to the algorithmic information theory, but the respective proofs are often more difficult [38,64,65]. Let us observe that the typical difference between expected Kolmogorov complexity $\mathbf{E}K\left({X}_{1}^{n}\right)$ and Shannon entropy $H\left({X}_{1}^{n}\right)$ is of the order $logn$ if the probability measure P is computable. For uncomputable measure P, which holds also if some parameters of a computable formula for P are uncomputable real numbers, this difference can be somewhat greater or even substantially greater, which complicates the transfer of results from one sort of information theory to another.

**Definition**

**1**

## 4. Finite-State Processes

**unifilar distributions**take the following form:

**maximum likelihood**(ML)

**normalized maximum likelihood**(NML) in the spirit of Shtarkov [82]

**Ryabko mixture**, cf. [58,59],

**family complexity**of the unifilar family:

**statistical complexity**of a stochastic process discussed in [72,73,74]. The family complexity (32) is a property of a class of processes, roughly related to the number of distinguishable distributions in the class. By contrast, the statistical complexity by [72,73,74] is the entropy of the hidden state distribution in the minimal unifilar presentation of a given process. The statistical complexity is smaller than or equal to $log{M}_{U}$ but greater than or equal to excess entropy (15). Unlike excess entropy, it can be infinite for some finite-state nonunifilar sources such as the Golden Mean process [45] or the Simple Nonunifilar Source [75]. By contrast, it is a rule of thumb that the family complexity of a distribution family with exactly k real parameters is roughly $klogn$. There also exist more exact expressions assuming some particular conditions [81]. Here, we only need a very rough bound for $\mathbb{C}\left(n\right|k)$ but assuming that we have not only a real-parameter emission matrix $\epsilon $ but also an integer-parameter transition table $\tau $. We can observe a small correction up to the aforementioned rule of thumb.

**Theorem**

**2.**

**universality of the Ryabko mixture**, i.e., the Ryabko mixture yields a strongly consistent and asymptotically unbiased estimator of the entropy rate. For distribution families that contain Markov chain distributions of all orders and for which the family complexity $\mathbb{C}\left(n\right|k)$ grows sublinearly with the sample size n for any order k, the Ryabko mixture is a universal distribution by a reasoning following the ideas of papers [58,59]. It turns out that this is the case for the unifilar hidden Markov family. As a consequence, the Ryabko mixture can be used for universal compression of data generated by any stationary ergodic process, i.e., there is a computable procedure that takes text ${X}_{1}^{n}$ and compresses it losslessly as a string of $-log\mathbb{P}\left({X}_{1}^{n}\right)\approx hn$ bits, and this compression cannot be substantially improved. The following theorem states the universality of the Ryabko mixture:

**Theorem**

**3.**

**unifilar order estimator**:

**strong consistency**and

**asymptotic unbiasedness**of unifilar order estimator (36), which makes use of the universality of the Ryabko mixture claimed in Theorem 3.

**Theorem**

**4.**

**Theorem**

**5.**

**Theorem**

**6.**

## 5. Perigraphic Processes

**data-processing inequality for algorithmic mutual information**, where ${\left({K}_{i}\right)}_{i\in \mathbb{Z}}$ is an IID process. In turn, we may suspect that algorithmic mutual information $J({K}_{-n+1}^{0};{K}_{1}^{n})$ is low and that the main contribution to high algorithmic mutual information $J({X}_{-n+1}^{0};{X}_{1}^{n})$ for almost all ergodic components (8) may come from the high Kolmogorov complexity of the fixed sequence ${\left({z}_{k}\right)}_{k\in \mathbb{N}}$.

**algorithmically random (in the Martin-Löf sense)**if it is incompressible in the sense that

**common information in the sense of Gács and Körner**[109]. Staying within the framework of Shannon information theory, if we have two random variables X and Y and a random variable Z that is a function of X and Y each, $Z=f\left(X\right)=g\left(Y\right)$, then the Shannon mutual information between X and Y is bounded as $I(X;Y)\ge I(Z;Z)=H\left(Z\right)$ by the data-processing inequality. The Gács–Körner common information ${C}_{GK}(X;Y)$ is the supremum of entropies $H\left(Z\right)$ taken over all random variable Z such that $Z=f\left(X\right)=g\left(Y\right)$. What is surprising is that inequality ${C}_{GK}(X;Y)\le I(X;Y)$ can be strict also if we perform the analogous construction in the algorithmic information theory [109]. There is also a related concept called the

**common information in the sense of Wyner**${C}_{W}(X;Y)$ [110], which satisfies a reversed inequality ${C}_{W}(X;Y)\ge I(X;Y)$. The theorems about facts and words discussed in [27,29,36] can be regarded as a certain application or generalization of inequalities ${C}_{GK}(X;Y)\le I(X;Y)\le {C}_{W}(X;Y)$.

**knowledge extractor**, and an arbitrary fixed algorithmically random binary sequence $z={\left({z}_{k}\right)}_{k\in \mathbb{N}}$. Define random variables

**Theorem**

**7.**

**Definition**

**2**

**random hierarchical association (RHA) processes**, cf. [30] and [36] (Section 11.4), which seem to exhibit not only the Hilberg condition but also a bottom-up hierarchical structure of an infinite height. These processes are nonergodic, and we suspect that their ergodic components are perigraphic with quite a nontrivial knowledge extractor g and algorithmically random sequences z, which are different for different ergodic components. From our point of view, it is interesting that some seemingly abstract mathematical concepts such as nonergodicity or uncomputability acquire an idealized linguistic interpretation. There is a great opportunity to exhibit further examples of processes and to pursue further modeling ideas. One such idea is the transience of factual knowledge, which seems to correspond to the phenomenon of mixing in stationary stochastic processes [108]. We comment on this a bit in Section 7.

## 6. Oracle Processes

**monkey-typing explanations**of Zipf’s law [24,67]. These researchers observed that, if the characters on the type-writer keyboard are pressed at random, then the resulting text approximately obeys Zipf’s law for words understood as random strings of letters delimited by spaces.

**Definition**

**3**

- The set of symbols $\mathbb{X}=\left(\right)open="\{"\; close="\}">0,1,2$;
- The set of states $\mathbb{Y}=\left(\right)open="\{"\; close="\}">a,b$;
- $\epsilon \left(x\right|ay)=\theta /2$ and $\tau (ay,x)=ayx$ for $x\in \left(\right)open="\{"\; close="\}">0,1$ and $y\in {\left(\right)}^{0}*$;
- $\epsilon \left(2\right|ay)=(1-\theta )$ and $\tau (ay,2)=by$ for $y\in {\left(\right)}^{0}*$; and
- $\epsilon \left({z}_{\varphi \left(y\right)}\right|by)=1$ and $\tau (by,{z}_{\varphi \left(y\right)})=a$ for $y\in {\left(\right)}^{0}*$.

**Theorem**

**8.**

**Theorem**

**9.**

## 7. Discussion

#### 7.1. Is It Possible to Decide by Computation That a Given Empirical Stream of Data Satisfies the Hilberg Condition or Was Generated by a Perigraphic Source?

#### 7.2. Is It Plausible That Human Speech Not Only Satisfies the Hilberg Condition in a Certain Approximation but Also Resembles a Perigraphic Process?

**Chaitin’s halting probability $\Omega $**, which is an infinite algorithmically random sequence encoding which mathematical statements are true or false [111,112]. All knowledge can be squeezed to a certain randomness but not every randomness is a useful knowledge.

**mixing process**(from a subjective probability perspective) rather than a perigraphic process. However, even in this mixing case, the process may satisfy the Hilberg condition and may differ from a finite-state process. In fact, we investigated such a mixing phenomenon in the framework of Shannon information theory in [36] (Section 11.2) and [108], but it may be interesting to translate the respective phenomenon into algorithmic information theory.

#### 7.3. What Kind of Linguistic Structures or Phenomena Do Perigraphic Processes Account for by Their Very Definition?

**shortest grammar-based compression**, cf. [29] and [36] (Problem 7.4). Hence, the perigraphicness of a stochastic process implies Hilberg’s hypothesis and this implies discernibility of discrete words, i.e., the

**double articulation**, and a Zipfian distribution of words. Since in this article, we have shown equality ${\beta}_{g,z}^{+}={\beta}_{\mathbb{M}}$ for the Oracle processes, we may expect that some nice class of perigraphic processes exhibits also equality ${\beta}_{g,z}^{+}={\beta}_{V}$. Does this mean that, in that case, we may have an approximate computable one-to-one correspondence between elementary statements $(k,{z}_{k})$ and words given by the shortest grammar-based compression?

**knowledge extractors**for practical statistical language models. Namely, if the number of independent elementary facts described by a text is approximately equal to the number of automatically detectable words, then an appearance of a new word in the predicted text can be a heuristic prompt for the predicting agent that a new fact needs to be added to the agent’s database of acquired factual knowledge. However, the added fact need not be necessarily a description of the new word.

**sentence information structure**—theme k and rheme ${z}_{k}$—at the very best weather. Perigraphicness, which is a sort of Zipf’s law for algorithmic information, seems to be a different cause against finite-state language models than context-free syntax of an unbounded height. A mathematically plausible language system with an infinitely complex semantics can be just an infinite set of meaningful words or rather meaningful commands applied in texts at random. However, we must be a bit careful with such statements. The lack of a hierarchical structure does not mean that Oracle processes can be recognized by a finite-state automaton. To recognize Oracle processes, we need a push-down automaton with an oracle. In this simple wording, there is also a pretty complicated computer hidden that allows to look up a particular bit of the oracle corresponding to a given string on the stack.

#### 7.4. Are There Competing Refutations of Finite-State Language Models Based on Other Quantitative Linguistic Observations?

**power-law decay of autocorrelations**[113]. This condition can be partly adapted to categorical times series as the power-law decay of Shannon mutual information $I({X}_{0};{X}_{n})$. Lin and Tegmark [114] claimed to observe such a power-law decay of mutual information for texts in natural language. Moreover, they proved that this power-law decay is incompatible with finite-state processes, and they argued that it may be be compatible with processes that exhibit hierarchical structures of an unbounded height; see also [46] for more computational experiments.

**maximal repetition length**in a given text. For many mixing sources, which include finite-state processes and probably also Oracle processes, the maximal repetition length grows asymptotically similar to the logarithm of the text length [115,116]. For texts in natural language, however, it seems that the maximal repetition length grows roughly similar to the cube of the logarithm of the text length [117], which begs for an explanation, cf. [36] (Chapter 9) and [118]. We think that the cube-logarithmic scaling of the maximal repetition length is a phenomenon that may inspire interesting mathematical models of cohesive narration rather than of unbounded accumulation of factual knowledge. However, cohesive narration and knowledge accumulation can be coupled phenomena both in language and in some mathematical models thereof. There may be a common underlying mechanism for both of them.

#### 7.5. Are There Perigraphic Processes That Satisfy All of These Quantitative Linguistic Laws and Exhibit Hierarchical Structures of an Unbounded Height?

**random hierarchical association (RHA) processes**, which seem to simultaneously exhibit the Hilberg condition, the power-law logarithmic growth of the maximal repetition length, and a bottom-up hierarchical structure of an infinite height, cf. [30] and [36] (Section 11.4). We suppose that the ergodic components of RHA processes are also perigraphic and satisfy the power-law decay of mutual information $I({X}_{0};{X}_{n})$, but we have not demonstrated it yet. In [30], it was also shown that RHA processes are nonergodic and have an infinite entropy of the invariant algebra, which would be a very promising symptom since perigraphicness and strong nonergodicity are similar conditions, cf. [27] and [36] (Section 8.3). Our definition of RHA processes is quite complicated, however, which makes them difficult to analyze, and we are not sure whether all results in [30] are correct. Probably the construction should be somewhat simplified in order to obtain more conclusive and convincing results.

#### 7.6. How Can We Improve Practical Statistical Language Models Using Ideas Borrowed from Perigraphic Processes?

**shortest grammar-based compression**[29,60,61] and they can be morphemes or multiword expressions [55]. Moreover, this new fact need not be a description of the new term but rather a sort of reaction to it.

**knowledge extractor**that works for a reasonable subclass of stationary processes, then by compressing the extracted knowledge, we could find a desired lower bound for the number of distinct time-independent facts necessary to verify the perigraphicness property. We notice that finding such a universal knowledge extractor is a different problem than constructing the minimal unifilar representation of the process, called the

**$\u03f5$-machine**in [72,73,74], but there may be some connections between these two tasks, cf. [103]. The relationship between the universal knowledge extractor and the $\u03f5$-machine may be analogous to the difference between the Gács and Körner common information [109] and the Wyner common information [110]. The former is a lower bound for the learning problem, whereas the latter is an upper bound.

## 8. Conclusions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A. Proofs

**Proof of Theorem**

**1.**

**Proof of Theorem**

**2.**

**Proof of Theorem**

**3.**

**Proof of Theorem**

**4.**

**Proof of Theorem**

**5.**

**Proof of Theorem**

**6.**

**Proof of Theorem**

**7.**

**Proof of Theorem**

**8.**

**Proof of Theorem**

**9.**

## References

- Skinner, B.F. Verbal Behavior; Prentice Hall: Englewood Cliffs, NJ, USA, 1957. [Google Scholar]
- Chomsky, N. Three models for the description of language. IRE Trans. Inf. Theory
**1956**, 2, 113–124. [Google Scholar] [CrossRef] [Green Version] - Chomsky, N. Syntactic Structures; Mouton & Co.: The Hague, The Netherland, 1957. [Google Scholar]
- Chomsky, N. A Review of B. F. Skinner’s Verbal Behavior. Language
**1959**, 35, 26–58. [Google Scholar] [CrossRef] - Chomsky, N.; Miller, G. Finite State Languages. Inf. Control.
**1959**, 1, 91–112. [Google Scholar] [CrossRef] [Green Version] - Pereira, F. Formal Grammar and Information Theory: Together Again? Philos. Trans. R. Soc. Lond. Ser. A
**2000**, 358, 1239–1253. [Google Scholar] [CrossRef] [Green Version] - Jelinek, F. Continuous speech recognition by statistical methods. Proc. IEEE
**1976**, 64, 532–556. [Google Scholar] [CrossRef] - Jelinek, F. Statistical Methods for Speech Recognition; MIT Press: Cambridge, MA, USA, 1997. [Google Scholar]
- Kupiec, J. Robust part-of-speech tagging using a hidden Markov model. Comput. Speech Lang.
**1992**, 6, 225–242. [Google Scholar] [CrossRef] - Charniak, E. Statistical Language Learning; MIT Press: Cambridge, MA, USA, 1993. [Google Scholar]
- Chi, Z.; Geman, S. Estimation of probabilistic context-free grammars. Comput. Linguist.
**1998**, 24, 299–305. [Google Scholar] - Manning, C.D.; Schütze, H. Foundations of Statistical Natural Language Processing; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput.
**1997**, 9, 1735–1780. [Google Scholar] [CrossRef] - Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of the 2013 Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA, 5–10 December 2013. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 2017 Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners. Available online: https://openai.com/blog/better-language-models/ (accessed on 29 August 2021).
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the 2020 Conference on Neural Information Processing Systems (NIPS), virtual meeting. 6–12 December 2020. [Google Scholar]
- Chomsky, N. Aspects of the Theory of Syntax; The MIT Press: Cambridge, MA, USA, 1965. [Google Scholar]
- Ahn, S.; Choi, H.; Pärnamaa, T.; Bengio, Y. A Neural Knowledge Language Model. A Rejected but Interesting Paper. Available online: https://openreview.net/forum?id=BJwFrvOeg (accessed on 29 August 2021).
- Khmaladze, E. The Statistical Analysis of Large Number of Rare Events; Technical Report MS-R8804; Centrum voor Wiskunde en Informatica: Amsterdam, The Netherlands, 1988. [Google Scholar]
- Baayen, R.H. Word Frequency Distributions; Kluwer Academic Publishers: Dordrecht, The Netherland, 2001. [Google Scholar]
- Zipf, G.K. The Psycho-Biology of Language: An Introduction to Dynamic Philology; Houghton Mifflin: Boston, MA, USA, 1935. [Google Scholar]
- Mandelbrot, B. Structure formelle des textes et communication. Word
**1954**, 10, 1–27. [Google Scholar] [CrossRef] - Bar-Hillel, Y.; Carnap, R. An Outline of a Theory of Semantic Information. In Language and Information: Selected Essays on Their Theory and Application; Addison-Wesley: Reading, UK, 1964; pp. 221–274. [Google Scholar]
- Shannon, C. A mathematical theory of communication. Bell Syst. Tech. J.
**1948**, 30, 379–423, 623–656. [Google Scholar] [CrossRef] [Green Version] - Dębowski, Ł. Is Natural Language a Perigraphic Process? The Theorem about Facts and Words Revisited. Entropy
**2018**, 20, 85. [Google Scholar] [CrossRef] [Green Version] - Dębowski, Ł. A general definition of conditional information and its application to ergodic decomposition. Stat. Probab. Lett.
**2009**, 79, 1260–1268. [Google Scholar] [CrossRef] [Green Version] - Dębowski, Ł. On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts. IEEE Trans. Inf. Theory
**2011**, 57, 4589–4599. [Google Scholar] [CrossRef] [Green Version] - Dębowski, Ł. Regular Hilberg Processes: An Example of Processes with a Vanishing Entropy Rate. IEEE Trans. Inf. Theory
**2017**, 63, 6538–6546. [Google Scholar] [CrossRef] - Kuraszkiewicz, W.; Łukaszewicz, J. The number of different words as a function of text length. Pamiętnik Literacki
**1951**, 42, 168–182. (In Polish) [Google Scholar] - Guiraud, P. Les Caractères Statistiques du Vocabulaire; Presses Universitaires de France: Paris, France, 1954. [Google Scholar]
- Herdan, G. Quantitative Linguistics; Butterworths: London, UK, 1964. [Google Scholar]
- Heaps, H.S. Information Retrieval—Computational and Theoretical Aspects; Academic Press: New York, NY, USA, 1978. [Google Scholar]
- Kornai, A. How many words are there? Glottometrics
**2002**, 4, 61–86. [Google Scholar] - Dębowski, Ł. Information Theory Meets Power Laws: Stochastic Processes and Language Models; Wiley & Sons: New York, NY, USA, 2021. [Google Scholar]
- Martin-Löf, P. The definition of random sequences. Inf. Control.
**1966**, 9, 602–619. [Google Scholar] [CrossRef] [Green Version] - Li, M.; Vitányi, P.M.B. An Introduction to Kolmogorov Complexity and Its Applications, 3rd ed.; Springer: New York, NY, USA, 2008. [Google Scholar]
- Ziv, J.; Lempel, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory
**1977**, 23, 337–343. [Google Scholar] [CrossRef] [Green Version] - Hilberg, W. Der bekannte Grenzwert der redundanzfreien Information in Texten — eine Fehlinterpretation der Shannonschen Experimente? Frequenz
**1990**, 44, 243–248. [Google Scholar] [CrossRef] - Shannon, C. Prediction and entropy of printed English. Bell Syst. Tech. J.
**1951**, 30, 50–64. [Google Scholar] [CrossRef] - Ebeling, W.; Nicolis, G. Entropy of Symbolic Sequences: The Role of Correlations. Europhys. Lett.
**1991**, 14, 191–196. [Google Scholar] [CrossRef] - Ebeling, W.; Pöschel, T. Entropy and long-range correlations in literary English. Europhys. Lett.
**1994**, 26, 241–246. [Google Scholar] [CrossRef] [Green Version] - Bialek, W.; Nemenman, I.; Tishby, N. Complexity through nonextensivity. Phys. A Stat. Mech. Appl.
**2001**, 302, 89–99. [Google Scholar] [CrossRef] [Green Version] - Crutchfield, J.P.; Feldman, D.P. Regularities unseen, randomness observed: The entropy convergence hierarchy. Chaos
**2003**, 15, 25–54. [Google Scholar] [CrossRef] - Tanaka-Ishii, K. Statistical Universals of Language: Mathematical Chance vs. Human Choice; Springer: New York, NY, USA, 2021. [Google Scholar]
- Takahira, R.; Tanaka-Ishii, K.; Dębowski, Ł. Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora. Entropy
**2016**, 18, 364. [Google Scholar] [CrossRef] [Green Version] - Hestness, J.; Narang, S.; Ardalani, N.; Diamos, G.; Jun, H.; Kianinejad, H.; Patwary, M.; Ali, M.; Yang, Y.; Zhou, Y. Deep Learning Scaling Is Predictable, Empirically. arXiv
**2017**, arXiv:1712.00409. [Google Scholar] - Hahn, M.; Futrell, R. Estimating Predictive Rate-Distortion Curves via Neural Variational Inference. Entropy
**2019**, 21, 640. [Google Scholar] [CrossRef] [Green Version] - Braverman, M.; Chen, X.; Kakade, S.M.; Narasimhan, K.; Zhang, C.; Zhang, Y. Calibration, Entropy Rates, and Memory in Language Models. In Proceedings of the 2020 International Conference on Machine Learning (ICML), virtual meeting, 12–18 July 2020. [Google Scholar]
- Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling Laws for Neural Language Models. arXiv
**2020**, arXiv:2001.08361. [Google Scholar] - Henighan, T.; Kaplan, J.; Katz, M.; Chen, M.; Hesse, C.; Jackson, J.; Jun, H.; Brown, T.B.; Dhariwal, P.; Gray, S. Scaling Laws for Autoregressive Generative Modeling. arXiv
**2020**, arXiv:2010.14701. [Google Scholar] - Hernandez, D.; Kaplan, J.; Henighan, T.; McCandlish, S. Scaling Laws for Transfer. arXiv
**2021**, arXiv:2102.01293. [Google Scholar] - Dębowski, Ł. On Hilberg’s law and its links with Guiraud’s law. J. Quant. Linguist.
**2006**, 13, 81–109. [Google Scholar] [CrossRef] [Green Version] - de Marcken, C.G. Unsupervised Language Acquisition. Ph.D Thesis, Massachussetts Institute of Technology, Cambridge, MA, USA, 1996. [Google Scholar]
- Dębowski, Ł. On processes with hyperbolically decaying autocorrelations. J. Time Ser. Anal.
**2011**, 32, 580–584. [Google Scholar] [CrossRef] - Cleary, J.G.; Witten, I.H. Data compression using adaptive coding and partial string matching. IEEE Trans. Commun.
**1984**, 32, 396–402. [Google Scholar] [CrossRef] [Green Version] - Ryabko, B.Y. Prediction of random sequences and universal coding. Probl. Inf. Transm.
**1988**, 24, 87–96. [Google Scholar] - Ryabko, B. Compression-based methods for nonparametric density estimation, on-line prediction, regression and classification for time series. In Proceedings of the 2008 IEEE Information Theory Workshop, Porto, Portugal, 5–9 May 2008; pp. 271–275. [Google Scholar]
- Kieffer, J.C.; Yang, E. Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Inf. Theory
**2000**, 46, 737–754. [Google Scholar] [CrossRef] - Charikar, M.; Lehman, E.; Lehman, A.; Liu, D.; Panigrahy, R.; Prabhakaran, M.; Sahai, A.; Shelat, A. The Smallest Grammar Problem. IEEE Trans. Inf. Theory
**2005**, 51, 2554–2576. [Google Scholar] [CrossRef] - Rokhlin, V.A. On the fundamental ideas of measure theory. Am. Math. Soc. Transl.
**1962**, 10, 1–54. [Google Scholar] - Gray, R.M.; Davisson, L.D. Source coding theorems without the ergodic assumption. IEEE Trans. Inf. Theory
**1974**, 20, 502–516. [Google Scholar] [CrossRef] - Gács, P. On the symmetry of algorithmic information. Dokl. Akad. Nauk. SSSR
**1974**, 15, 1477–1480. [Google Scholar] - Chaitin, G.J. A theory of program size formally identical to information theory. J. ACM
**1975**, 22, 329–340. [Google Scholar] [CrossRef] - Dębowski, Ł. Variable-length Coding of Two-sided Asymptotically Mean Stationary Measures. J. Theor. Probab.
**2010**, 23, 237–256. [Google Scholar] [CrossRef] [Green Version] - Miller, G.A. Some effects of intermittent silence. Am. J. Psychol.
**1957**, 70, 311–314. [Google Scholar] [CrossRef] [PubMed] - Billingsley, P. Probability and Measure; Wiley & Sons: New York, NY, USA, 1979. [Google Scholar]
- Markov, A.A. Essai d’une recherche statistique sur le texte du roman “Eugene Onegin” illustrant la liaison des epreuve en chain. Bulletin l’Académie Impériale Sci. St.-Pétersbourg
**1913**, 7, 153–162. [Google Scholar] - Markov, A.A. An Example of Statistical Investigation of the Text ‘Eugene Onegin’ Concerning the Connection of Samples in Chains. Sci. Context
**2006**, 19, 591–600. [Google Scholar] [CrossRef] [Green Version] - Miller, M.I.; O’Sullivan, J.A. Entropies and Combinatorics of Random Branching Processes and Context-Free Languages. IEEE Trans. Inf. Theory
**1992**, 38, 1292–1310. [Google Scholar] [CrossRef] [Green Version] - Crutchfield, J.P.; Young, K. Inferring statistical complexity. Phys. Rev. Lett.
**1989**, 63, 105–108. [Google Scholar] [CrossRef] [PubMed] - Löhr, W. Properties of the Statistical Complexity Functional and Partially Deterministic HMMs. Entropy
**2009**, 11, 385–401. [Google Scholar] [CrossRef] [Green Version] - Jurgens, A.M.; Crutchfield, J.P. Divergent Predictive States: The Statistical Complexity Dimension of Stationary, Ergodic Hidden Markov Processes. arXiv
**2021**, arXiv:2102.10487. [Google Scholar] - Marzen, S.E.; Crutchfield, J.P. Informational and Causal Architecture of Discrete-Time Renewal Processes. Entropy
**2015**, 17, 4891–4917. [Google Scholar] [CrossRef] [Green Version] - Birkhoff, G.D. Proof of the ergodic theorem. Proc. Natl. Acad. Sci. USA
**1932**, 17, 656–660. [Google Scholar] [CrossRef] - Gray, R.M. Probability, Random Processes, and Ergodic Properties; Springer: New York, NY, USA, 2009. [Google Scholar]
- Dębowski, Ł. The Relaxed Hilberg Conjecture: A Review and New Experimental Support. J. Quant. Linguist.
**2015**, 22, 311–337. [Google Scholar] [CrossRef] [Green Version] - Dębowski, Ł. Hilberg Exponents: New Measures of Long Memory in the Process. IEEE Trans. Inf. Theory
**2015**, 61, 5716–5726. [Google Scholar] [CrossRef] [Green Version] - Brudno, A.A. Entropy and the complexity of trajectories of a dynamical system. Trans. Moscovian Math. Soc.
**1982**, 44, 124–149. [Google Scholar] - Grünwald, P.D. The Minimum Description Length Principle; The MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
- Shtarkov, Y.M. Universal sequential coding of single messages. Probl. Inf. Transm.
**1987**, 23, 3–17. [Google Scholar] - Merhav, N.; Gutman, M.; Ziv, J. On the estimation of the order of a Markov chain and universal data compression. IEEE Trans. Inf. Theory
**1989**, 35, 1014–1019. [Google Scholar] [CrossRef] [Green Version] - Ziv, J.; Merhav, N. Estimating the Number of States of a Finite-State Source. IEEE Trans. Inf. Theory
**1992**, 38, 61–65. [Google Scholar] [CrossRef] [Green Version] - Csiszar, I.; Shields, P.C. The Consistency of the BIC Markov Order Estimator. Ann. Stat.
**2000**, 28, 1601–1619. [Google Scholar] [CrossRef] - Csiszar, I. Large-scale typicality of Markov sample paths and consistency of MDL order estimator. IEEE Trans. Inf. Theory
**2002**, 48, 1616–1628. [Google Scholar] [CrossRef] - Morvai, G.; Weiss, B. Order estimation of Markov chains. IEEE Trans. Inf. Theory
**2005**, 51, 1496–1497. [Google Scholar] [CrossRef] - Peres, Y.; Shields, P. Two new Markov order estimators. arXiv
**2005**, arXiv:math/0506080. [Google Scholar] - Dalevi, D.; Dubhashi, D. The Peres-Shields Order Estimator for Fixed and Variable Length Markov Models with Applications to DNA Sequence Similarity. In Algorithms in Bioinformatics; Casadio, R., Myers, G., Eds.; Springer: New York, NY, USA, 2005; pp. 291–302. [Google Scholar]
- Ryabko, B.; Astola, J. Universal Codes as a Basis for Time Series Testing. Stat. Methodol.
**2006**, 3, 375–397. [Google Scholar] [CrossRef] [Green Version] - Csiszar, I.; Talata, Z. Context tree estimation for not necessarily finite memory processes, via BIC and MDL. IEEE Trans. Inf. Theory
**2006**, 52, 1007–1016. [Google Scholar] [CrossRef] - Talata, Z. Divergence rates of Markov order estimators and their application to statistical estimation of stationary ergodic processes. Bernoulli
**2013**, 19, 846–885. [Google Scholar] [CrossRef] [Green Version] - Baigorri, A.R.; Goncalves, C.R.; Resende, P.A.A. Markov chain order estimation based on the chi-square divergence. Can. J. Stat.
**2014**, 42, 563–578. [Google Scholar] [CrossRef] - Ryabko, B.; Astola, J.; Malyutov, M. Compression-Based Methods of Statistical Analysis and Prediction of Time Series; Springer: New York, NY, USA, 2016. [Google Scholar]
- Papapetrou, M.; Kugiumtzis, D. Markov chain order estimation with parametric significance tests of conditional mutual information. Simul. Model. Pract. Theory
**2016**, 61, 1–13. [Google Scholar] [CrossRef] [Green Version] - Finesso, L. Order Estimation for Functions of Markov Chains. Ph.D Thesis, University of Maryland, College Park, MD, USA, 1990. [Google Scholar]
- Weinberger, M.J.; Lempel, A.; Ziv, J. A Sequential Algorithm for the Universal Coding of Finite Memory Sources. IEEE Trans. Inf. Theory
**1992**, 38, 1002–1014. [Google Scholar] [CrossRef] - Kieffer, J.C. Strongly Consistent Code-Based Identification and Order Estimation for Constrained Finite-State Model Classes. IEEE Trans. Inf. Theory
**1993**, 39, 893–902. [Google Scholar] [CrossRef] - Weinberger, M.J.; Feder, M. Predictive stochastic complexity and model estimation for finite-state processes. J. Stat. Plan. Inference
**1994**, 39, 353–372. [Google Scholar] [CrossRef] - Liu, C.C.; Narayan, P. Order Estimation and Sequential Universal Data Compression of a Hidden Markov Source bv the Method of Mixtures. IEEE Trans. Inf. Theory
**1994**, 40, 1167–1180. [Google Scholar] - Gassiat, E.; Boucheron, S. Optimal Error Exponents in Hidden Markov Models Order Estimation. IEEE Trans. Inf. Theory
**2003**, 49, 964–980. [Google Scholar] [CrossRef] - Lehéricy, L. Consistent order estimation for nonparametric Hidden Markov Models. Bernoulli
**2019**, 25, 464–498. [Google Scholar] [CrossRef] [Green Version] - Shalizi, C.R.; Shalizi, K.L.; Crutchfield, J.P. An Algorithm for Pattern Discovery in Time Series. arXiv
**2002**, arXiv:cs/0210025. [Google Scholar] - Zheng, J.; Huang, J.; Tong, C. The order estimation for hidden Markov models. Phys. A Stat. Mech. Appl.
**2019**, 527, 121462. [Google Scholar] [CrossRef] - Kieffer, J.C.; Rahe, M. Markov Channels are Asymptotically Mean Stationary. Siam J. Math. Anal.
**1981**, 12, 293–305. [Google Scholar] [CrossRef] - Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley & Sons: New York, NY, USA, 2006. [Google Scholar]
- Ochoa, C.; Navarro, G. RePair and All Irreducible Grammars are Upper Bounded by High-Order Empirical Entropy. IEEE Trans. Inf. Theory
**2019**, 65, 3160–3164. [Google Scholar] [CrossRef] - Dębowski, Ł. Mixing, Ergodic, and Nonergodic Processes with Rapidly Growing Information between Blocks. IEEE Trans. Inf. Theory
**2012**, 58, 3392–3401. [Google Scholar] [CrossRef] - Gács, P.; Körner, J. Common information is far less than mutual information. Probl. Control. Inf. Theory
**1973**, 2, 119–162. [Google Scholar] - Wyner, A.D. The Common Information of Two Dependent Random Variables. IEEE Trans. Inf. Theory
**1975**, IT-21, 163–179. [Google Scholar] [CrossRef] - Chaitin, G. Meta Math!: The Quest for Omega; Pantheon Books: New York, NY, USA, 2005. [Google Scholar]
- Gardner, M. The random number Ω bids fair to hold the mysteries of the universe. Sci. Am.
**1979**, 241, 20–34. [Google Scholar] [CrossRef] - Beran, J. Statistics for Long-Memory Processes; Chapman & Hall: New York, NY, USA, 1994. [Google Scholar]
- Lin, H.W.; Tegmark, M. Critical Behavior in Physics and Probabilistic Formal Languages. Entropy
**2017**, 19, 299. [Google Scholar] [CrossRef] [Green Version] - Szpankowski, W. Asymptotic Properties of Data Compression and Suffix Trees. IEEE Trans. Inf. Theory
**1993**, 39, 1647–1659. [Google Scholar] [CrossRef] - Szpankowski, W. A generalized suffix tree and its (un)expected asymptotic behaviors. Siam J. Comput.
**1993**, 22, 1176–1198. [Google Scholar] [CrossRef] [Green Version] - Dębowski, Ł. Maximal Repetitions in Written Texts: Finite Energy Hypothesis vs. Strong Hilberg Conjecture. Entropy
**2015**, 17, 5903–5919. [Google Scholar] [CrossRef] [Green Version] - Dębowski, Ł. Maximal Repetition and Zero Entropy Rate. IEEE Trans. Inf. Theory
**2018**, 64, 2212–2219. [Google Scholar] [CrossRef] - Futrell, R.; Qian, P.; Gibson, E.; Fedorenko, E.; Blank, I. Syntactic dependencies correspond to word pairs with high mutual information. In Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, Syntaxfest 2019), Paris, France, 27–28 August 2019; pp. 3–13. [Google Scholar]
- Hahn, M.; Degen, J.; Futrell, R. Modeling word and morpheme order in natural language as an efficient trade-off of memory and surprisal. Psychol. Rev.
**2021**, 128, 726–756. [Google Scholar] [CrossRef] - Barron, A.R. Logically Smooth Density Estimation. Ph.D Thesis, Stanford University, Stanford, CA, USA, 1985. [Google Scholar]
- Gray, R.M.; Kieffer, J.C. Asymptotically mean stationary measures. Ann. Probab.
**1980**, 8, 962–973. [Google Scholar] [CrossRef] - Wyner, A.D. A definition of conditional mutual information for arbitrary ensembles. Inf. Control
**1978**, 38, 51–59. [Google Scholar] [CrossRef] [Green Version] - Dębowski, Ł. Approximating Information Measures for Fields. Entropy
**2020**, 22, 79. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Travers, N.F.; Crutchfield, J.P. Exact synchronization for finite-state sources. J. Stat. Phys.
**2011**, 145, 1181–1201. [Google Scholar] [CrossRef] [Green Version] - Travers, N.F.; Crutchfield, J.P. Asymptotic synchronization for finite-state sources. J. Stat. Phys.
**2011**, 145, 1202–1223. [Google Scholar] [CrossRef] [Green Version] - Travers, N.F.; Crutchfield, J.P. Infinite Excess Entropy Processes with Countable-State Generators. Entropy
**2014**, 16, 1396–1413. [Google Scholar] [CrossRef] [Green Version] - Blackwell, D. The entropy of functions of finite-state Markov chains. In Transactions of the First Prague Conference on Information Theory, Statistical Decision Functions, Random Processes; Czechoslovak Academy of Sciences: Prague, Czech Republic, 1957; pp. 13–20. [Google Scholar]
- Ephraim, Y.; Merhav, N. Hidden Markov processes. IEEE Trans. Inf. Theory
**2002**, 48, 1518–1569. [Google Scholar] [CrossRef] [Green Version] - Han, G.; Marcus, B. Analyticity of entropy rate of hidden Markov chain. IEEE Trans. Inf. Theory
**2006**, 52, 5251–5266. [Google Scholar] [CrossRef] [Green Version] - Jacquet, P.; Seroussi, G.; Szpankowski, W. On the entropy of a hidden Markov process. Theor. Comput. Sci.
**2008**, 395, 203–219. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Dębowski, Ł.
A Refutation of Finite-State Language Models through Zipf’s Law for Factual Knowledge. *Entropy* **2021**, *23*, 1148.
https://doi.org/10.3390/e23091148

**AMA Style**

Dębowski Ł.
A Refutation of Finite-State Language Models through Zipf’s Law for Factual Knowledge. *Entropy*. 2021; 23(9):1148.
https://doi.org/10.3390/e23091148

**Chicago/Turabian Style**

Dębowski, Łukasz.
2021. "A Refutation of Finite-State Language Models through Zipf’s Law for Factual Knowledge" *Entropy* 23, no. 9: 1148.
https://doi.org/10.3390/e23091148