Approximating Information Measures for Fields

We supply corrected proofs of the invariance of completion and the chain rule for the Shannon information measures of arbitrary fields, as stated by Dębowski in 2009. Our corrected proofs rest on a number of auxiliary approximation results for Shannon information measures, which may be of an independent interest. As also discussed briefly in this article, the generalized calculus of Shannon information measures for fields, including the invariance of completion and the chain rule, is useful in particular for studying the ergodic decomposition of stationary processes and its links with statistical modeling of natural language.


Definition 1.
For finite partitions α = {A i } I i=1 and β = B j J j=1 and a probability measure P, the entropy and mutual information are defined as Subsequently, for an arbitrary field C and finite partitions α and β, we define the pointwise conditional entropy and mutual information as H P (α||C) := H P(·|C) (α), I P (α; β||C) := I P(·|C) (α; β), (2) where P(E|C) is the conditional probability of event E ∈ J with respect to the smallest complete σ-field containing C. Subsequently, for arbitrary fields A, B, and C, the (average) conditional entropy and mutual information are defined as where the supremum is taken over all finite subpartitions and E P X := XdP is the expectation. Finally, we define the unconditional entropy H P (A) := H P (A| {∅, Ω}) and mutual information I P (A; B) := I P (A; B| {∅, Ω}), as it is generally done in information theory. When the probability measure P is clear from the context, we omit subscript P from all above notations.
Although the above measures, called Shannon information measures, have usually been discussed for σ-fields, the defining equations (3) also make sense for fields. We observe a number of identities, such as H(A) = I(A; A) and H(A|C) = I(A; A|C). It is important to stress that Definition 1, in contrast to the earlier expositions by Dobrushin [5] and Pinsker [6], is simpler-as it applies one Radon-Nikodym derivative less-and does not require regular conditional probability, i.e., it does not demand that conditional distribution (P(E|C)) E∈J be a probability measure almost surely. In fact, the expressions on the right-hand sides of the equations in (3) are defined for all A, B, and C. No problems arise when conditional probability is not regular since conditional distribution (P(E|C)) E∈E restricted to a finite field E is a probability measure almost surely [8] (Theorem 33.2).
We should admit that in the context of statistical language modeling, the respective probability space is countably generated so regular conditional probability is guaranteed to exist. Thus, for linguistic applications, one might think that expositions [5,6] are sufficient, although for a didactic reason, the approaches proposed by Wyner [7] and Dębowski [1] lead to a simpler and more general calculus of Shannon information measures. Yet, there is a more important reason for Definition 1. Namely, to discuss the ergodic decomposition of entropy rate and excess entropy-some highly relevant results for statistical language modeling, developed in [1] and to be briefly recalled in Section 3-we need the invariance of Shannon information measures with respect to completion of fields. But within the framework of Dobrushin [5] and Pinsker [6], such invariance of completion does not hold for strongly nonergodic processes, which seem to arise quite naturally in statistical modeling of natural language [1][2][3]. Thus, the approach proposed by Wyner [7] and Dębowski [1] is in fact indispensable.
Thus, let us inspect the problem of invariance of Shannon information measures with respect to completion of fields. A σ-field is called complete, with respect to a given probability measure P, if it contains all sets of outer P-measure 0. Let σ(A) denote the intersection of all complete σ-fields containing class A, i.e., σ(A) is the completion of the generated σ-field. Let A ∧ B denote the intersection of all fields that contain A and B. Assuming Definition 1, the following statement has been claimed true by Dębowski [1]   The property stated in Theorem 1. 1 will be referred to as the invariance of completion. It was not discussed by Wyner [7]. The property stated in Theorem 1. 2 is usually referred to as the chain rule or the polymatroid identity. It was proved independently by Wyner [7].
As we have mentioned, the invariance of completion is crucial to prove the ergodic decomposition of the entropy rate and excess entropy of stationary processes. But the proof of the invariance of completion given by Dębowski [1] contains a mistake in the order of quantifiers, and the respective proof of the chain rule is too laconic and contains a gap. For this reason, we would like to supplement the corrected proofs in this article. As we have mentioned, the chain rule was proved by Wyner [7], using an approximation result by Dobrushin [5] and Pinsker [6]. For completeness, we would like to provide a different proof of this approximation result-which follows easily from the invariance of completion-and to supply proofs of both parts of Theorem 1.
The corrected proofs of Theorem 1, to be presented in Section 2, are much longer than the original proofs by Dębowski [1]. In particular, for the sake of proving Theorem 1, we will discuss a few other approximation results, which seem to be of an independent interest. To provide more context for our statements, in Section 3, we will also recall the ergodic decomposition of excess entropy and its application to statistical language modeling.

Proofs
Let us write B n ↑ B for a sequence (B n ) n∈N of fields such that B 1 ⊂ B 2 ⊂ · · · ⊂ B = n∈N B n . (B need not be a σ-field.) Our proof of Theorem 1 will rest on a few approximation results and this statement by Dębowski Let A c = Ω \ A. Subsequently, let us denote the symmetric difference Symmetric difference satisfies the following identities, which will be used: Moreover, we will apply the Bonferroni inequalities and inequality P(A) ≤ P(B) + P(A B).
In the following, we will derive the necessary approximation results. Our point of departure is the following folklore fact.
Theorem 3 (approximation of σ-fields). For any field K and any event G ∈ σ(K), there is a sequence of events K 1 , K 2 , · · · ∈ K such that Proof. Denote the class of sets G that satisfy (10) as G. It is sufficient to show that G is a complete σ-field that contains the field K. Clearly, all G ∈ K satisfy (10) so G ⊃ K. Now, we verify the conditions for G to be a σ-field.
For A ∈ G, consider K 1 , K 2 , · · · ∈ K such that lim n→∞ P(A K n ) = 0. Then, Moreover, Hence, which tends to 0 for n going to infinity.
Completeness of σ-field G is straightforward since, for any A ∈ G and P(A A ) = 0, we obtain A ∈ G using the same sequence of approximating events in field K as for event A.
The second approximation result is the following bound: Theorem 4 (continuity of entropy). Fix an ∈ (0, e −1 ] and a field C. For finite partitions Proof. We have the expectation P(A i A i |C)dP = P(A i A i ) ≤ . Hence, by the Markov inequality we obtain Denote From the Bonferroni inequality, we obtain P(B c ) ≤ I √ . Subsequently, we observe that |H(α||C) − H(α ||C)| ≤ log I holds almost surely. Hence, Function −x log x is subadditive and increasing for x ∈ (0, e −1 ]. In particular, we have |(x + y) log(x + y) − x log x| ≤ −y log y for x, y ≥ 0. Thus, on the event B we obtain Plugging (18) into (17) Now, we observe for j, k ∈ {1, . . . , J} and j = k that Hence, by the Bonferroni inequality we derive Resuming our bounds, we obtain for all i ∈ {1, . . . , I} and j ∈ {1, . . . , J + 1}. Then, invoking Theorem 4 yields Taking η sufficiently small, we obtain (21), which is the desired claim.
Some consequence of the above result is this approximation result proved by Dobrushin [5] and Pinsker [6] and used by Wyner [7] to demonstrate the chain rule. Applying the invariance of completion, we supply a different proof than Dobrushin [5] and Pinsker [6].
where the supremum is taken over all finite subpartitions.

Proof. Define class
It can be easily verified that E is a field such that σ(E ) = σ(B ∧ C). Thus, for all finite partitions β ⊂ B and γ ⊂ C we have β ∧ γ ⊂ E . Moreover, by definition of E , for each finite partition ε ⊂ E there exists finite partitions β ⊂ B and γ ⊂ C such that partition β ∧ γ is finer than ε. Hence, by Theorem 2.4, we obtain in this case, In consequence, by Theorem 1. 1, we obtain the claim The final approximation result which we need to prove the chain rule is as follows: Theorem 6 (convergence of conditioning). Let α = {A i } I i=1 be a finite partition and let C be a field. For each > 0, there exists a finite partition γ ⊂ σ(C) such that for any partition γ ⊂ σ(C) finer than γ we have Proof. Fix an > 0. For each n ∈ N and A ∈ J , partition is finite and belongs to σ(C). If we consider partition γ := I i=1 γ A i , it remains finite and still satisfies γ ⊂ σ(C). Let a partition γ ⊂ σ(C) be finer than γ . Then, almost surely for all i ∈ {1, . . . , I}. We also observe We recall that function −x log x is subadditive and increasing for x ∈ (0, e −1 ]. In particular, we have |(x + y) log(x + y) − x log x| ≤ −y log y for x, y ≥ 0. Hence, for n ≥ e we obtain almost surely Taking n so large that n −1 I log n ≤ yields the claim.
Taking the above into account, we can demonstrate the chain rule. Our proof essentially follows the ideas of Wyner [7], except for invoking Theorem 6.
Proof of Theorem 1. 2 (chain rule): Let A, B, C, and D be arbitrary fields, and let α, β, γ, and δ be finite partitions. The point of our departure is the chain rule for finite partitions [9] (Equation 2.60) for an arbitrary field A, taking its appropriately fine finite partitions.

Applications
This section borrows its statements largely from Dębowski [1][2][3] and is provided only to sketch some context for our research and justify its applicability to statistical language modeling. Let (X i ) i∈Z be a two-sided infinite stationary process over a countable alphabet X on a probability space (X Z , X Z , P), where X k ((ω i ) i∈Z ) := ω k . We denote random blocks X k j := (X i ) j≤i≤k and complete σ-fields G k j := σ(X k j ) generated by them. By the generalized calculus of Shannon information measures, i.e., Theorems 1 and 2, we can define the entropy rate h P and the excess entropy E P of process (X i ) i∈Z as see [10] for more background. Let T((ω i ) i∈Z ) := (ω i+1 ) i∈Z be the shift operation and let I := A ∈ X Z : T −1 (A) = A be the invariant σ-field. By the Birkhoff ergodic theorem [11], we have σ(I ) ⊂ σ(G −∞ ) ∩ σ(G ∞ ) for the tail σ-fields G −∞ := ∞ n=1 G −n −∞ and G ∞ := ∞ n=1 G ∞ n . Hence, by Theorems 1 and 2 we further obtain expressions Denoting the conditional probability F(A) := P(A|I ), which is a random stationary ergodic measure by the ergodic decomposition theorem [12], we notice that H P (G 0 |G −1 −∞ ∧ I) = E P H F (G 0 |G −1 −∞ ) and I P (G −1 −∞ ; G ∞ 0 |I) = E P I F (G −1 −∞ ; G ∞ 0 ), and consequently we obtain the ergodic decomposition of the entropy rate and excess entropy, which reads Formulae (45) and (46) were derived by Gray and Davisson [13] and Dębowski [1] respectively. The ergodic decomposition of the entropy rate (45) states that a stationary process is asymptotically deterministic, i.e., h P = 0, if and only if almost all its ergodic components are asymptotically deterministic, i.e., h F = 0 almost surely. In contrast, the ergodic decomposition of the excess entropy (46) states that a stationary process is infinitary, i.e., E P = ∞, if some of its ergodic components are infinitary, i.e., E F = ∞ with a nonzero probability, or if H P (I) = ∞, i.e., if the process is strongly nonergodic in particular, see [14,15].
The linguistic interpretation of the above results is as follows. There is a hypothesis by Hilberg [16] that the excess entropy of natural language is infinite. This hypothesis can be partly confirmed by the original estimates of conditional entropy by Shannon [17], by the power-law decay of the estimates of the entropy rate given by the PPM compression algorithm [18], by the approximately power-law growth of vocabulary called Heaps' or Herdan's law [2,3,19,20], and by some other experiments applying neural statistical language models [21,22]. In parallel, Dębowski [1][2][3] supposed that the very large excess entropy in natural language may be caused by the fact that texts in natural language describe some relatively slowly evolving and very complex reality. Indeed, it can be mathematically proved that if the abstract reality described by random texts is unchangeable and infinitely complex, then the resulting stochastic process is strongly nonergodic, i.e., H P (I) = ∞ in particular [1][2][3]. Consequently, its excess entropy is infinite by formula (46). We suppose that a similar mechanism may work for natural language, see [23][24][25][26] for further examples of abstract stochastic mechanisms leading to infinitary processes.