Abstract
We supply corrected proofs of the invariance of completion and the chain rule for the Shannon information measures of arbitrary fields, as stated by Dębowski in 2009. Our corrected proofs rest on a number of auxiliary approximation results for Shannon information measures, which may be of an independent interest. As also discussed briefly in this article, the generalized calculus of Shannon information measures for fields, including the invariance of completion and the chain rule, is useful in particular for studying the ergodic decomposition of stationary processes and its links with statistical modeling of natural language.
MSC:
94A17
1. Introduction
As it was noticed by Dębowski [1,2,3], a generalized calculus of Shannon information measures for arbitrary fields—initiated by Gelfand et al. [4] and later developed by Dobrushin [5], Pinsker [6], and Wyner [7]—is useful in particular for studying the ergodic decomposition of stationary processes and its links with statistical modeling of natural language. Fulfilling this need, Dębowski [1] has developed the calculus of Shannon information measures for arbitrary fields, relaxing the requirement of regular conditional probability, assumed implicitly by Dobrushin [5] and Pinsker [6]. He has done it unaware of the classical paper by Wyner [7], which pursued exactly the same idea, with some differences due to an independent interest.
Compared to exposition [7], the added value of the paper [1] was considering continuity and invariance of Shannon information measures with respect to completion of fields. Unfortunately, the proof of Theorem 2 in [1] establishing this invariance and the generalized chain rule contains some mistakes and gaps, which we have discovered recently. For this reason, in this article, we would like to provide a correction and a few new auxiliary results which may be of an independent interest. In this way, we will complete the full generalization of Shannon information measures and their properties, which was developed step-by-step by Gelfand et al. [4], Dobrushin [5], Pinsker [6], Wyner [7], and Dębowski [1]. By the way, we will also rediscuss the linguistic motivations of our results.
The preliminaries are as follows. Fix a probability space . Fields are set algebras closed under finite Boolean operations, whereas -fields are assumed to be closed also under countable unions and products. A field is called finite if it has finitely many elements. A finite partition is a finite collection of events which are disjoint and whose union equals . The definition proposed by Wyner [7] and Dębowski [1] independently reads as follows:
Definition 1.
For finite partitions and and a probability measure P, the entropy and mutual information are defined as
Subsequently, for an arbitrary field and finite partitions α and β, we define the pointwise conditional entropy and mutual information as
where is the conditional probability of event with respect to the smallest complete σ-field containing . Subsequently, for arbitrary fields , , and , the (average) conditional entropy and mutual information are defined as
where the supremum is taken over all finite subpartitions and is the expectation. Finally, we define the unconditional entropy and mutual information , as it is generally done in information theory. When the probability measure P is clear from the context, we omit subscript P from all above notations.
Although the above measures, called Shannon information measures, have usually been discussed for -fields, the defining equations (3) also make sense for fields. We observe a number of identities, such as and . It is important to stress that Definition 1, in contrast to the earlier expositions by Dobrushin [5] and Pinsker [6], is simpler—as it applies one Radon–Nikodym derivative less—and does not require regular conditional probability, i.e., it does not demand that conditional distribution be a probability measure almost surely. In fact, the expressions on the right-hand sides of the equations in (3) are defined for all , , and . No problems arise when conditional probability is not regular since conditional distribution restricted to a finite field is a probability measure almost surely [8] (Theorem 33.2).
We should admit that in the context of statistical language modeling, the respective probability space is countably generated so regular conditional probability is guaranteed to exist. Thus, for linguistic applications, one might think that expositions [5,6] are sufficient, although for a didactic reason, the approaches proposed by Wyner [7] and Dębowski [1] lead to a simpler and more general calculus of Shannon information measures. Yet, there is a more important reason for Definition 1. Namely, to discuss the ergodic decomposition of entropy rate and excess entropy—some highly relevant results for statistical language modeling, developed in [1] and to be briefly recalled in Section 3—we need the invariance of Shannon information measures with respect to completion of fields. But within the framework of Dobrushin [5] and Pinsker [6], such invariance of completion does not hold for strongly nonergodic processes, which seem to arise quite naturally in statistical modeling of natural language [1,2,3]. Thus, the approach proposed by Wyner [7] and Dębowski [1] is in fact indispensable.
Thus, let us inspect the problem of invariance of Shannon information measures with respect to completion of fields. A -field is called complete, with respect to a given probability measure P, if it contains all sets of outer P-measure 0. Let denote the intersection of all complete -fields containing class , i.e., is the completion of the generated -field. Let denote the intersection of all fields that contain and . Assuming Definition 1, the following statement has been claimed true by Dębowski [1] (Theorem 2):
Theorem 1.
Let , , , and be subfields of .
- 1.
- (invariance of completion);
- 2.
- (chain rule).
The property stated in Theorem 1.1 will be referred to as the invariance of completion. It was not discussed by Wyner [7]. The property stated in Theorem 1.2 is usually referred to as the chain rule or the polymatroid identity. It was proved independently by Wyner [7].
As we have mentioned, the invariance of completion is crucial to prove the ergodic decomposition of the entropy rate and excess entropy of stationary processes. But the proof of the invariance of completion given by Dębowski [1] contains a mistake in the order of quantifiers, and the respective proof of the chain rule is too laconic and contains a gap. For this reason, we would like to supplement the corrected proofs in this article. As we have mentioned, the chain rule was proved by Wyner [7], using an approximation result by Dobrushin [5] and Pinsker [6]. For completeness, we would like to provide a different proof of this approximation result—which follows easily from the invariance of completion—and to supply proofs of both parts of Theorem 1.
The corrected proofs of Theorem 1, to be presented in Section 2, are much longer than the original proofs by Dębowski [1]. In particular, for the sake of proving Theorem 1, we will discuss a few other approximation results, which seem to be of an independent interest. To provide more context for our statements, in Section 3, we will also recall the ergodic decomposition of excess entropy and its application to statistical language modeling.
2. Proofs
Let us write for a sequence of fields such that . ( need not be a -field.) Our proof of Theorem 1 will rest on a few approximation results and this statement by Dębowski [1] (Theorem 1):
Theorem 2.
Let , , , and be subfields of .
- 1.
- ;
- 2.
- with the equality if and only if almost surely for all and ;
- 3.
- ;
- 4.
- if ;
- 5.
- for .
Let . Subsequently, let us denote the symmetric difference
Symmetric difference satisfies the following identities, which will be used:
Moreover, we will apply the Bonferroni inequalities
and inequality .
In the following, we will derive the necessary approximation results. Our point of departure is the following folklore fact.
Theorem 3
(approximation of -fields). For any field and any event , there is a sequence of events such that
Proof.
Denote the class of sets G that satisfy (10) as . It is sufficient to show that is a complete -field that contains the field . Clearly, all satisfy (10) so . Now, we verify the conditions for to be a -field.
- We have . Hence, .
- For , consider such that . Then, , where . Hence, .
- For , consider events such that . Then,Moreover,Hence,which tends to 0 for n going to infinity. Since , we thus obtain that .
Completeness of -field is straightforward since, for any and , we obtain using the same sequence of approximating events in field as for event A. □
The second approximation result is the following bound:
Theorem 4
(continuity of entropy). Fix an and a field . For finite partitions and such that for all , we have
Proof.
We have the expectation . Hence, by the Markov inequality we obtain
Denote
From the Bonferroni inequality, we obtain . Subsequently, we observe that holds almost surely. Hence,
Function is subadditive and increasing for . In particular, we have for . Thus, on the event B we obtain
Plugging (18) into (17) yields the claim. □
Now, we can prove the invariance of completion. Note that
Proof of Theorem 1.
1 (invariance of completion): Consider some measurable fields , , and . We are going to demonstrate
Equality is straightforward since almost surely for all . It remains to prove . For this goal, it suffices to show that for any and any finite partitions and there exists a finite partition such that
Fix then some and finite partitions and . Invoking Theorem 3, we know that for each there exists a class of sets which need not be a partition, such that
for all . Let us put and let us construct sets and for . Subsequently, we put for and . In this way, we obtain a partition .
The next step of the proof is showing an analogue of bound (22) for partitions and . To begin, for , we have
Now, we observe for and that
Hence, by the Bonferroni inequality we derive
Resuming our bounds, we obtain
for all and . Then, invoking Theorem 4 yields
Taking sufficiently small, we obtain (21), which is the desired claim. □
Some consequence of the above result is this approximation result proved by Dobrushin [5] and Pinsker [6] and used by Wyner [7] to demonstrate the chain rule. Applying the invariance of completion, we supply a different proof than Dobrushin [5] and Pinsker [6].
Theorem 5
(split of join). Let , , , and be subfields of . We have
where the supremum is taken over all finite subpartitions.
Proof.
Define class
It can be easily verified that is a field such that . Thus, for all finite partitions and we have . Moreover, by definition of , for each finite partition there exists finite partitions and such that partition is finer than . Hence, by Theorem 2.4, we obtain in this case,
In consequence, by Theorem 1.1, we obtain the claim
□
The final approximation result which we need to prove the chain rule is as follows:
Theorem 6
(convergence of conditioning). Let be a finite partition and let be a field. For each , there exists a finite partition such that for any partition finer than we have
Proof.
Fix an . For each and , partition
is finite and belongs to . If we consider partition , it remains finite and still satisfies . Let a partition be finer than . Then,
almost surely for all . We also observe
We recall that function is subadditive and increasing for . In particular, we have for . Hence, for we obtain almost surely
Taking n so large that yields the claim. □
Taking the above into account, we can demonstrate the chain rule. Our proof essentially follows the ideas of Wyner [7], except for invoking Theorem 6.
Proof of Theorem 1.
2 (chain rule): Let , , , and be arbitrary fields, and let , , , and be finite partitions. The point of our departure is the chain rule for finite partitions [9] (Equation 2.60)
By Definition 1 and Theorems 1.1, 5, and 6, conditional mutual information can be approximated by , where we take appropriate limits of refined finite partitions with a certain care.
In particular, by Theorems 1.1, 5, and 6, taking sufficiently fine finite partitions of arbitrary fields and , the chain rule (38) for finite partitions implies
where all expressions are finite. Hence, we also obtain
where all expressions are finite. Having established the above claim for a finite partition , we generalize it to
for an arbitrary field , taking its appropriately fine finite partitions. □
3. Applications
This section borrows its statements largely from Dębowski [1,2,3] and is provided only to sketch some context for our research and justify its applicability to statistical language modeling. Let be a two-sided infinite stationary process over a countable alphabet on a probability space , where . We denote random blocks and complete -fields generated by them. By the generalized calculus of Shannon information measures, i.e., Theorems 1 and 2, we can define the entropy rate and the excess entropy of process as
see [10] for more background.
Let be the shift operation and let be the invariant -field. By the Birkhoff ergodic theorem [11], we have for the tail -fields and . Hence, by Theorems 1 and 2 we further obtain expressions
Denoting the conditional probability , which is a random stationary ergodic measure by the ergodic decomposition theorem [12], we notice that and , and consequently we obtain the ergodic decomposition of the entropy rate and excess entropy, which reads
Formulae (45) and (46) were derived by Gray and Davisson [13] and Dębowski [1] respectively. The ergodic decomposition of the entropy rate (45) states that a stationary process is asymptotically deterministic, i.e., , if and only if almost all its ergodic components are asymptotically deterministic, i.e., almost surely. In contrast, the ergodic decomposition of the excess entropy (46) states that a stationary process is infinitary, i.e., , if some of its ergodic components are infinitary, i.e., with a nonzero probability, or if , i.e., if the process is strongly nonergodic in particular, see [14,15].
The linguistic interpretation of the above results is as follows. There is a hypothesis by Hilberg [16] that the excess entropy of natural language is infinite. This hypothesis can be partly confirmed by the original estimates of conditional entropy by Shannon [17], by the power-law decay of the estimates of the entropy rate given by the PPM compression algorithm [18], by the approximately power-law growth of vocabulary called Heaps’ or Herdan’s law [2,3,19,20], and by some other experiments applying neural statistical language models [21,22]. In parallel, Dębowski [1,2,3] supposed that the very large excess entropy in natural language may be caused by the fact that texts in natural language describe some relatively slowly evolving and very complex reality. Indeed, it can be mathematically proved that if the abstract reality described by random texts is unchangeable and infinitely complex, then the resulting stochastic process is strongly nonergodic, i.e., in particular [1,2,3]. Consequently, its excess entropy is infinite by formula (46). We suppose that a similar mechanism may work for natural language, see [23,24,25,26] for further examples of abstract stochastic mechanisms leading to infinitary processes.
Funding
This research received no external funding.
Conflicts of Interest
The author declares no conflict of interest.
References
- Dębowski, Ł. A general definition of conditional information and its application to ergodic decomposition. Stat. Probab. Lett. 2009, 79, 1260–1268. [Google Scholar] [CrossRef]
- Dębowski, Ł. On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts. IEEE Trans. Inf. Theory 2011, 57, 4589–4599. [Google Scholar] [CrossRef]
- Dębowski, Ł. Is Natural Language a Perigraphic Process? The Theorem about Facts and Words Revisited. Entropy 2018, 20, 85. [Google Scholar] [CrossRef]
- Gelfand, I.M.; Kolmogorov, A.N.; Yaglom, A.M. Towards the general definition of the amount of information. Dokl. Akad. Nauk. SSSR 1956, 111, 745–748. (In Russian) [Google Scholar]
- Dobrushin, R.L. A general formulation of the fundamental Shannon theorems in information theory. Uspekhi Mat. Nauk. 1959, 14, 3–104. (In Russian) [Google Scholar]
- Pinsker, M.S. Information and Information Stability of Random Variables and Processes; Holden-Day: San Francisco, CA, USA, 1964. [Google Scholar]
- Wyner, A.D. A definition of conditional mutual information for arbitrary ensembles. Inf. Control. 1978, 38, 51–59. [Google Scholar] [CrossRef]
- Billingsley, P. Probability and Measure; John Wiley: New York, NY, USA, 1979. [Google Scholar]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley: New York, NY, USA, 1991. [Google Scholar]
- Crutchfield, J.P.; Feldman, D.P. Regularities unseen, randomness observed: The entropy convergence hierarchy. Chaos 2003, 15, 25–54. [Google Scholar] [CrossRef]
- Birkhoff, G.D. Proof of the ergodic theorem. Proc. Natl. Acad. Sci. USA 1932, 17, 656–660. [Google Scholar] [CrossRef]
- Rokhlin, V.A. On the fundamental ideas of measure theory. Am. Math. Soc. Transl. Ser. 1 1962, 10, 1–54. [Google Scholar]
- Gray, R.M.; Davisson, L.D. The ergodic decomposition of stationary discrete random processses. IEEE Trans. Inf. Theory 1974, 20, 625–636. [Google Scholar] [CrossRef]
- Löhr, W. Properties of the Statistical Complexity Functional and Partially Deterministic HMMs. Entropy 2009, 11, 385–401. [Google Scholar] [CrossRef]
- Crutchfield, J.P.; Marzen, S. Signatures of infinity: Nonergodicity and resource scaling in prediction, complexity, and learning. Phys. Rev. E 2015, 91, 050106. [Google Scholar] [CrossRef]
- Hilberg, W. Der bekannte Grenzwert der redundanzfreien Information in Texten—eine Fehlinterpretation der Shannonschen Experimente? Frequenz 1990, 44, 243–248. [Google Scholar] [CrossRef]
- Shannon, C. Prediction and entropy of printed English. Bell Syst. Tech. J. 1951, 30, 50–64. [Google Scholar] [CrossRef]
- Takahira, R.; Tanaka-Ishii, K.; Dębowski, Ł. Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora. Entropy 2016, 18, 364. [Google Scholar] [CrossRef]
- Herdan, G. Quantitative Linguistics; Butterworths: London, UK, 1964. [Google Scholar]
- Heaps, H.S. Information Retrieval—Computational and Theoretical Aspects; Academic Press: New York, NY, USA, 1978. [Google Scholar]
- Hahn, M.; Futrell, R. Estimating Predictive Rate-Distortion Curves via Neural Variational Inference. Entropy 2019, 21, 640. [Google Scholar] [CrossRef]
- Braverman, M.; Chen, X.; Kakade, S.M.; Narasimhan, K.; Zhang, C.; Zhang, Y. Calibration, Entropy Rates, and Memory in Language Models. arXiv 2019, arXiv:1906.05664. [Google Scholar]
- Dębowski, Ł. Mixing, Ergodic, and Nonergodic Processes with Rapidly Growing Information between Blocks. IEEE Trans. Inf. Theory 2012, 58, 3392–3401. [Google Scholar] [CrossRef]
- Dębowski, Ł. On Hidden Markov Processes with Infinite Excess Entropy. J. Theor. Probab. 2014, 27, 539–551. [Google Scholar] [CrossRef]
- Travers, N.F.; Crutchfield, J.P. Infinite Excess Entropy Processes with Countable-State Generators. Entropy 2014, 16, 1396–1413. [Google Scholar] [CrossRef]
- Dębowski, Ł. Maximal Repetition and Zero Entropy Rate. IEEE Trans. Inf. Theory 2018, 64, 2212–2219. [Google Scholar] [CrossRef]
© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).