Approximating Information Measures for Fields

Dębowski, Łukasz

doi:10.3390/e22010079

Open AccessArticle

Approximating Information Measures for Fields

by

Łukasz Dębowski

Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, 01-248 Warszawa, Poland

Entropy 2020, 22(1), 79; https://doi.org/10.3390/e22010079

Submission received: 20 November 2019 / Revised: 6 January 2020 / Accepted: 8 January 2020 / Published: 9 January 2020

(This article belongs to the Special Issue Information Theory and Language)

Download Review Reports Versions Notes

Abstract

:

We supply corrected proofs of the invariance of completion and the chain rule for the Shannon information measures of arbitrary fields, as stated by Dębowski in 2009. Our corrected proofs rest on a number of auxiliary approximation results for Shannon information measures, which may be of an independent interest. As also discussed briefly in this article, the generalized calculus of Shannon information measures for fields, including the invariance of completion and the chain rule, is useful in particular for studying the ergodic decomposition of stationary processes and its links with statistical modeling of natural language.

Keywords:

Shannon information measures; fields; invariance of completion; chain rule

MSC:

94A17

1. Introduction

As it was noticed by Dębowski [1,2,3], a generalized calculus of Shannon information measures for arbitrary fields—initiated by Gelfand et al. [4] and later developed by Dobrushin [5], Pinsker [6], and Wyner [7]—is useful in particular for studying the ergodic decomposition of stationary processes and its links with statistical modeling of natural language. Fulfilling this need, Dębowski [1] has developed the calculus of Shannon information measures for arbitrary fields, relaxing the requirement of regular conditional probability, assumed implicitly by Dobrushin [5] and Pinsker [6]. He has done it unaware of the classical paper by Wyner [7], which pursued exactly the same idea, with some differences due to an independent interest.

Compared to exposition [7], the added value of the paper [1] was considering continuity and invariance of Shannon information measures with respect to completion of fields. Unfortunately, the proof of Theorem 2 in [1] establishing this invariance and the generalized chain rule contains some mistakes and gaps, which we have discovered recently. For this reason, in this article, we would like to provide a correction and a few new auxiliary results which may be of an independent interest. In this way, we will complete the full generalization of Shannon information measures and their properties, which was developed step-by-step by Gelfand et al. [4], Dobrushin [5], Pinsker [6], Wyner [7], and Dębowski [1]. By the way, we will also rediscuss the linguistic motivations of our results.

The preliminaries are as follows. Fix a probability space

(Ω, J, P)

. Fields are set algebras closed under finite Boolean operations, whereas

σ

-fields are assumed to be closed also under countable unions and products. A field is called finite if it has finitely many elements. A finite partition is a finite collection of events

{\{B_{j}\}}_{j = 1}^{J} \subset J

which are disjoint and whose union equals

Ω

. The definition proposed by Wyner [7] and Dębowski [1] independently reads as follows:

Definition 1.

For finite partitions

α = {\{A_{i}\}}_{i = 1}^{I}

and

β = {\{B_{j}\}}_{j = 1}^{J}

and a probability measure P, the entropy and mutual information are defined as

\begin{matrix} H_{P} (α) & : = \sum_{i = 1}^{I} P (A_{i}) log \frac{1}{P (A_{i})}, & I_{P} (α; β) & : = \sum_{i = 1}^{I} \sum_{j = 1}^{J} P (A_{i} \cap B_{j}) log \frac{P (A_{i} \cap B_{j})}{P (A_{i}) P (B_{j})} . \end{matrix}

(1)

Subsequently, for an arbitrary field

C

and finite partitions α and β, we define the pointwise conditional entropy and mutual information as

\begin{matrix} H_{P} (α | | C) & : = H_{P (\cdot | C)} (α), & I_{P} (α; β | | C) & : = I_{P (\cdot | C)} (α; β), \end{matrix}

(2)

where

P (E | C)

is the conditional probability of event

E \in J

with respect to the smallest complete σ-field containing

C

. Subsequently, for arbitrary fields

A

,

B

, and

C

, the (average) conditional entropy and mutual information are defined as

\begin{matrix} H_{P} (A | C) & : = sup_{α \subset A} E_{P} H_{P} (α | | C), & I_{P} (A; B | C) & : = sup_{α \subset A, β \subset B} E_{P} I (α; β | | C), \end{matrix}

(3)

where the supremum is taken over all finite subpartitions and

E_{P} X : = \int X d P

is the expectation. Finally, we define the unconditional entropy

H_{P} (A) : = H_{P} (A | \{\emptyset, Ω\})

and mutual information

I_{P} (A; B) : = I_{P} (A; B | \{\emptyset, Ω\})

, as it is generally done in information theory. When the probability measure P is clear from the context, we omit subscript P from all above notations.

Although the above measures, called Shannon information measures, have usually been discussed for

σ

-fields, the defining equations (3) also make sense for fields. We observe a number of identities, such as

H (A) = I (A; A)

and

H (A | C) = I (A; A | C)

. It is important to stress that Definition 1, in contrast to the earlier expositions by Dobrushin [5] and Pinsker [6], is simpler—as it applies one Radon–Nikodym derivative less—and does not require regular conditional probability, i.e., it does not demand that conditional distribution

{(P (E | C))}_{E \in J}

be a probability measure almost surely. In fact, the expressions on the right-hand sides of the equations in (3) are defined for all

A

,

B

, and

C

. No problems arise when conditional probability is not regular since conditional distribution

{(P (E | C))}_{E \in E}

restricted to a finite field

E

is a probability measure almost surely [8] (Theorem 33.2).

We should admit that in the context of statistical language modeling, the respective probability space is countably generated so regular conditional probability is guaranteed to exist. Thus, for linguistic applications, one might think that expositions [5,6] are sufficient, although for a didactic reason, the approaches proposed by Wyner [7] and Dębowski [1] lead to a simpler and more general calculus of Shannon information measures. Yet, there is a more important reason for Definition 1. Namely, to discuss the ergodic decomposition of entropy rate and excess entropy—some highly relevant results for statistical language modeling, developed in [1] and to be briefly recalled in Section 3—we need the invariance of Shannon information measures with respect to completion of fields. But within the framework of Dobrushin [5] and Pinsker [6], such invariance of completion does not hold for strongly nonergodic processes, which seem to arise quite naturally in statistical modeling of natural language [1,2,3]. Thus, the approach proposed by Wyner [7] and Dębowski [1] is in fact indispensable.

Thus, let us inspect the problem of invariance of Shannon information measures with respect to completion of fields. A

σ

-field is called complete, with respect to a given probability measure P, if it contains all sets of outer P-measure 0. Let

σ (A)

denote the intersection of all complete

σ

-fields containing class

A

, i.e.,

σ (A)

is the completion of the generated

σ

-field. Let

A \land B

denote the intersection of all fields that contain

A

and

B

. Assuming Definition 1, the following statement has been claimed true by Dębowski [1] (Theorem 2):

Theorem 1.

Let

A

,

B

,

C

, and

D

be subfields of

J

.

1.: $I (A; B | C) = I (A; σ (B) | C) = I (A; B | σ (C))$ (invariance of completion);
2.: $I (A; B \land C | D) = I (A; B | D) + I (A; C | B \land D)$ (chain rule).

The property stated in Theorem 1.1 will be referred to as the invariance of completion. It was not discussed by Wyner [7]. The property stated in Theorem 1.2 is usually referred to as the chain rule or the polymatroid identity. It was proved independently by Wyner [7].

As we have mentioned, the invariance of completion is crucial to prove the ergodic decomposition of the entropy rate and excess entropy of stationary processes. But the proof of the invariance of completion given by Dębowski [1] contains a mistake in the order of quantifiers, and the respective proof of the chain rule is too laconic and contains a gap. For this reason, we would like to supplement the corrected proofs in this article. As we have mentioned, the chain rule was proved by Wyner [7], using an approximation result by Dobrushin [5] and Pinsker [6]. For completeness, we would like to provide a different proof of this approximation result—which follows easily from the invariance of completion—and to supply proofs of both parts of Theorem 1.

The corrected proofs of Theorem 1, to be presented in Section 2, are much longer than the original proofs by Dębowski [1]. In particular, for the sake of proving Theorem 1, we will discuss a few other approximation results, which seem to be of an independent interest. To provide more context for our statements, in Section 3, we will also recall the ergodic decomposition of excess entropy and its application to statistical language modeling.

2. Proofs

Let us write

B_{n} ↑ B

for a sequence

{(B_{n})}_{n \in N}

of fields such that

B_{1} \subset B_{2} \subset \dots \subset B = ⋃_{n \in N} B_{n}

. (

B

need not be a

σ

-field.) Our proof of Theorem 1 will rest on a few approximation results and this statement by Dębowski [1] (Theorem 1):

Theorem 2.

Let

A

,

B

,

B_{n}

, and

C

be subfields of

J

.

1.: $I (A; B | C) = I (B; A | C)$ ;
2.: $I (A; B | C) \geq 0$ with the equality if and only if $P (A \cap B | C) = P (A | C) P (B | C)$ almost surely for all $A \in A$ and $B \in B$ ;
3.: $I (A; B | C) \leq min (H (A | C), H (B | C))$ ;
4.: $I (A; B_{1} | C) \leq I (A; B_{2} | C)$ if $B_{1} \subset B_{2}$ ;
5.: $I (A; B_{n} | C) ↑ I (A; B | C)$ for $B_{n} ↑ B$ .

Let

A^{c} = Ω \ A

. Subsequently, let us denote the symmetric difference

\begin{matrix} A ▵ B : = (A \ B) \cup (B \ A) = (A \cup B) \ (A \cap B) . \end{matrix}

(4)

Symmetric difference satisfies the following identities, which will be used:

\begin{matrix} A^{c} ▵ B^{c} & = A ▵ B, \end{matrix}

(5)

\begin{matrix} A ▵ B & \subset (A ▵ C) \cup (C ▵ B), \end{matrix}

(6)

\begin{matrix} (A \ C) ▵ B & \subset (A ▵ B) \cup (C \cap B), \end{matrix}

(7)

\begin{matrix} (⋃_{i \in C} A_{i}) ▵ (⋃_{i \in C} B_{i}) & \subset ⋃_{i \in C} (A_{i} ▵ B_{i}) . \end{matrix}

(8)

Moreover, we will apply the Bonferroni inequalities

\begin{matrix} 0 \leq \sum_{1 \leq i \leq n} P (A_{i}) - P (⋃_{1 \leq i \leq n} A_{i}) \leq \sum_{1 \leq i < j \leq n} P (A_{i} \cap A_{j}) \end{matrix}

(9)

and inequality

P (A) \leq P (B) + P (A ▵ B)

.

In the following, we will derive the necessary approximation results. Our point of departure is the following folklore fact.

Theorem 3

(approximation of

σ

-fields). For any field

K

and any event

G \in σ (K)

, there is a sequence of events

K_{1}, K_{2}, \dots \in K

such that

\begin{matrix} lim_{n \to \infty} P (G ▵ K_{n}) = 0 . \end{matrix}

(10)

Proof.

Denote the class of sets G that satisfy (10) as

G

. It is sufficient to show that

G

is a complete

σ

-field that contains the field

K

. Clearly, all

G \in K

satisfy (10) so

G \supset K

. Now, we verify the conditions for

G

to be a

σ

-field.

We have $Ω \in K$ . Hence, $Ω \in G$ .
For $A \in G$ , consider $K_{1}, K_{2}, \dots \in K$ such that ${lim}_{n \to \infty} P (A ▵ K_{n}) = 0$ . Then, $A ▵ K_{n} = A^{c} ▵ K_{n}^{c}$ , where $K_{1}^{c}, K_{2}^{c}, \dots \in K$ . Hence, $A^{c} \in G$ .
For $A_{1}, A_{2}, \dots \in G$ , consider events $K_{i}^{n} \in K$ such that $P (A_{i} ▵ K_{i}^{n}) \leq 2^{- n}$ . Then,

$\begin{matrix} P ((⋂_{i = 1}^{n} A_{i}) ▵ (⋂_{i = 1}^{n} K_{i}^{i + n})) \leq \sum_{i = 1}^{n} P (A_{i} ▵ K_{i}^{i + n}) \leq 2^{- n} . \end{matrix}$

(11)

Moreover,

$\begin{matrix} P ((⋂_{i = 1}^{\infty} A_{i}) ▵ (⋂_{i = 1}^{n} A_{i})) = P (⋂_{i = 1}^{n} A_{i}) - P (⋂_{i = 1}^{\infty} A_{i}) . \end{matrix}$

(12)

Hence,

$\begin{matrix} P ((⋂_{i = 1}^{\infty} A_{i}) ▵ (⋂_{i = 1}^{n} K_{i}^{i + n})) \\ \leq P ((⋂_{i = 1}^{\infty} A_{i}) ▵ (⋂_{i = 1}^{n} A_{i})) + P ((⋂_{i = 1}^{n} A_{i}) ▵ (⋂_{i = 1}^{n} K_{i}^{i + n})) \\ \leq P (⋂_{i = 1}^{n} A_{i}) - P (⋂_{i = 1}^{\infty} A_{i}) - 2^{- n}, \end{matrix}$

(13)

which tends to 0 for n going to infinity. Since $⋂_{i = 1}^{n} K_{i}^{i + n} \in K$ , we thus obtain that $⋂_{i = 1}^{\infty} A_{i} \in G$ .

Completeness of

σ

-field

G

is straightforward since, for any

A \in G

and

P (A ▵ A^{'}) = 0

, we obtain

A^{'} \in G

using the same sequence of approximating events in field

K

as for event A. □

The second approximation result is the following bound:

Theorem 4

(continuity of entropy). Fix an

ϵ \in (0, e^{- 1}]

and a field

C

. For finite partitions

α = {\{A_{i}\}}_{i = 1}^{I}

and

α^{'} = {\{A_{i}^{'}\}}_{i = 1}^{I}

such that

P (A_{i} ▵ A_{i}^{'}) \leq ϵ

for all

i \in \{1, \dots, I\}

, we have

\begin{matrix} | H (α | C) - H (α^{'} | C) | \leq I \sqrt{ϵ} log \frac{I}{\sqrt{ϵ}} . \end{matrix}

(14)

Proof.

We have the expectation

\int P (A_{i} ▵ A_{i}^{'} | C) d P = P (A_{i} ▵ A_{i}^{'}) \leq ϵ

. Hence, by the Markov inequality we obtain

\begin{matrix} P (P (A_{i} ▵ A_{i}^{'} | C) \geq \sqrt{ϵ}) \leq \sqrt{ϵ} . \end{matrix}

(15)

Denote

\begin{matrix} B = (P (A_{i} ▵ A_{i}^{'} | C) < \sqrt{ϵ}) for all i \in \{1, \dots, I\}) . \end{matrix}

(16)

From the Bonferroni inequality, we obtain

P (B^{c}) \leq I \sqrt{ϵ}

. Subsequently, we observe that

| H (α | | C) - H (α^{'} | | C) | \leq log I

holds almost surely. Hence,

\begin{matrix} | H (α | C) - H (α^{'} | C) | & = | \int [H (α | C) - H (α^{'} | C)] d P | \\ \leq P (B^{c}) log I + \int_{B} | H (α | | C) - H (α^{'} | | C) | d P \\ \leq I \sqrt{ϵ} log I + \int_{B} | H (α | | C) - H (α^{'} | | C) | d P . \end{matrix}

(17)

Function

- x log x

is subadditive and increasing for

x \in (0, e^{- 1}]

. In particular, we have

| (x + y) log (x + y) - x log x | \leq - y log y

for

x, y \geq 0

. Thus, on the event B we obtain

\begin{matrix} | H (α | | C) - H (α^{'} | | C) | & = | \sum_{i = 1}^{I} P (A_{i}^{'} | C) log P (A_{i}^{'} | C) - \sum_{i = 1}^{I} P (A_{i} | C) log P (A_{i} | C) | \\ \leq - \sum_{i = 1}^{I} | P (A_{i} | C) - P (A_{i}^{'} | C) | log | P (A_{i} | C) - P (A_{i}^{'} | C) | \\ \leq - \sum_{i = 1}^{I} P (A_{i} ▵ A_{i}^{'} | C) log P (A_{i} ▵ A_{i}^{'} | C) \\ \leq - I \sqrt{ϵ} log \sqrt{ϵ} \end{matrix}

(18)

Plugging (18) into (17) yields the claim. □

Now, we can prove the invariance of completion. Note that

\begin{matrix} I (α; β | C) = H (α | C) + H (β | C) - H (α \land β | C) . \end{matrix}

(19)

Proof of Theorem 1.

1 (invariance of completion): Consider some measurable fields

A

,

B

, and

C

. We are going to demonstrate

\begin{matrix} I (A; B | C) = I (A; σ (B) | C) = I (A; B | σ (C)) . \end{matrix}

(20)

Equality

I (A; B | C) = I (A; B | σ (C))

is straightforward since

P (A | C) = P (A | σ (C))

almost surely for all

A \in J

. It remains to prove

I (A; B | C) = I (A; σ (B) | C)

. For this goal, it suffices to show that for any

ϵ > 0

and any finite partitions

α \subset A

and

β^{'} \subset σ (B)

there exists a finite partition

β \subset B

such that

\begin{matrix} | I (α; β | C) - I (α; β^{'} | C) | < ϵ . \end{matrix}

(21)

Fix then some

ϵ > 0

and finite partitions

α : = {\{A_{i}\}}_{i = 1}^{I} \subset A

and

β^{'} : = {\{B_{j}^{'}\}}_{j = 1}^{J} \subset σ (B)

. Invoking Theorem 3, we know that for each

η > 0

there exists a class of sets

{\{C_{j}\}}_{j = 1}^{J} \subset B

which need not be a partition, such that

\begin{matrix} P (C_{j} ▵ B_{j}^{'}) \leq η \end{matrix}

(22)

for all

j \in \{1, \dots, J\}

. Let us put

B_{J + 1}^{'} : = \emptyset

and let us construct sets

D_{0} : = \emptyset

and

D_{j} : = ⋃_{k = 1}^{j} C_{k}

for

j \in \{1, \dots, J\}

. Subsequently, we put

B_{j} : = C_{j} \ D_{j - 1}

for

j \in \{1, \dots, J\}

and

B_{J + 1} : = Ω \ D_{J}

. In this way, we obtain a partition

β : = {\{B_{j}\}}_{j = 1}^{J + 1} \subset B

.

The next step of the proof is showing an analogue of bound (22) for partitions

β

and

β^{'}

. To begin, for

j \in \{1, \dots, J\}

, we have

\begin{matrix} P (B_{j} ▵ B_{j}^{'}) & = P ((C_{j} \ D_{j - 1}) ▵ B_{j}^{'}) \leq P (C_{j} ▵ B_{j}^{'}) + P (D_{j - 1} \cap B_{j}^{'}) \\ \leq η + \sum_{k = 1}^{j - 1} P (C_{k} \cap B_{j}^{'}) \\ \leq η + \sum_{k = 1}^{j - 1} [P (B_{k}^{'} \cap B_{j}^{'}) + P ((C_{k} \cap B_{j}^{'}) ▵ (B_{k}^{'} \cap B_{j}^{'}))] \\ \leq η + \sum_{k = 1}^{j - 1} [0 + P (C_{k} ▵ B_{k}^{'})] \leq j η . \end{matrix}

(23)

Now, we observe for

j, k \in \{1, \dots, J\}

and

j \neq k

that

\begin{matrix} P (C_{j}) & \geq P (B_{j}^{'}) - P (C_{j} ▵ B_{j}^{'}) \geq P (B_{j}^{'}) - η \\ P (C_{j} \cap C_{k}) & \leq P (B_{j}^{'} \cap B_{k}^{'}) + P ((C_{j} \cap C_{k}) ▵ (B_{j}^{'} \cap B_{k}^{'})) \end{matrix}

(24)

\begin{matrix} \leq 0 + P (C_{j} ▵ B_{j}^{'}) + P (C_{k} ▵ B_{k}^{'}) \leq 2 η . \end{matrix}

(25)

Hence, by the Bonferroni inequality we derive

\begin{matrix} P (B_{J + 1} ▵ B_{J + 1}^{'}) & = P ((Ω \ D_{J}) ▵ \emptyset) = P (Ω \ D_{J}) = 1 - P (D_{J}) \\ \leq 1 - \sum_{1 \leq j \leq J} P (C_{j}) + \sum_{1 \leq j < k \leq J} P (C_{j} \cap C_{k}) \\ \leq 1 - \sum_{1 \leq j \leq J} P (B_{j}^{'}) + J η + \sum_{1 \leq j < k \leq J} 2 η = J^{2} η . \end{matrix}

(26)

Resuming our bounds, we obtain

\begin{matrix} P ((A_{i} \cap B_{j}) ▵ (A_{i} \cap B_{j}^{'})) & \leq P (B_{j} ▵ B_{j}^{'}) \leq J^{2} η \end{matrix}

(27)

for all

i \in \{1, \dots, I\}

and

j \in \{1, \dots, J + 1\}

. Then, invoking Theorem 4 yields

\begin{matrix} | I (α; β | C) - I (α; β^{'} | C) | & \leq | H (α \land β | C) - H (α \land β^{'} | C) | + | H (β | C) - H (β^{'} | C) | \\ \leq I (J + 1) \sqrt{J^{2} η} log \frac{I (J + 1)}{\sqrt{J^{2} η}} + (J + 1) \sqrt{J^{2} η} log \frac{J + 1}{\sqrt{J^{2} η}} . \end{matrix}

(28)

Taking

η

sufficiently small, we obtain (21), which is the desired claim. □

Some consequence of the above result is this approximation result proved by Dobrushin [5] and Pinsker [6] and used by Wyner [7] to demonstrate the chain rule. Applying the invariance of completion, we supply a different proof than Dobrushin [5] and Pinsker [6].

Theorem 5

(split of join). Let

A

,

B

,

C

, and

D

be subfields of

J

. We have

\begin{matrix} I (A; B \land C | D) = sup_{α \subset A, β \subset B, γ \subset C} E I (α; β \land γ | | D), \end{matrix}

(29)

where the supremum is taken over all finite subpartitions.

Proof.

Define class

\begin{matrix} E : = ⋃_{β \subset B, γ \subset C} σ (β \land γ) . \end{matrix}

(30)

It can be easily verified that

E

is a field such that

σ (E) = σ (B \land C)

. Thus, for all finite partitions

β \subset B

and

γ \subset C

we have

β \land γ \subset E

. Moreover, by definition of

E

, for each finite partition

ε \subset E

there exists finite partitions

β \subset B

and

γ \subset C

such that partition

β \land γ

is finer than

ε

. Hence, by Theorem 2.4, we obtain in this case,

\begin{matrix} E I (α; ε | | D) \leq E I (α; β \land γ | | D) \leq I (α; E | D) . \end{matrix}

(31)

In consequence, by Theorem 1.1, we obtain the claim

\begin{matrix} I (A; B \land C | D) & = I (A; E | D) = sup_{α \subset A, ε \subset E} E I (α; ε | | D) \\ = sup_{α \subset A, β \subset B, γ \subset C} E I (α; β \land γ | | D) . \end{matrix}

(32)

□

The final approximation result which we need to prove the chain rule is as follows:

Theorem 6

(convergence of conditioning). Let

α = {\{A_{i}\}}_{i = 1}^{I}

be a finite partition and let

C

be a field. For each

ϵ > 0

, there exists a finite partition

γ^{'} \subset σ (C)

such that for any partition

γ \subset σ (C)

finer than

γ^{'}

we have

\begin{matrix} | H (α | C) - H (α | γ) | \leq ϵ . \end{matrix}

(33)

Proof.

Fix an

ϵ > 0

. For each

n \in N

and

A \in J

, partition

\begin{matrix} γ_{A} : = \{((k - 1) / n < P (A | C) \leq k / n) : k \in \{0, 1, \dots, n\}\} \end{matrix}

(34)

is finite and belongs to

σ (C)

. If we consider partition

γ^{'} : = ⋀_{i = 1}^{I} γ_{A_{i}}

, it remains finite and still satisfies

γ^{'} \subset σ (C)

. Let a partition

γ \subset σ (C)

be finer than

γ^{'}

. Then,

\begin{matrix} | P (A_{i} | C) - P (A_{i} | γ) | \leq 1 / n \end{matrix}

(35)

almost surely for all

i \in \{1, \dots, I\}

. We also observe

\begin{matrix} | H (α | C) - H (α | γ) | & \leq \int | H (α | | C) - H (α | | γ) | d P . \end{matrix}

(36)

We recall that function

- x log x

is subadditive and increasing for

x \in (0, e^{- 1}]

. In particular, we have

| (x + y) log (x + y) - x log x | \leq - y log y

for

x, y \geq 0

. Hence, for

n \geq e

we obtain almost surely

\begin{matrix} | H (α | | C) - H (α | | γ) | & = | \sum_{i = 1}^{I} P (A_{i} | C) log P (A_{i} | C) - \sum_{i = 1}^{I} P (A_{i} | γ) log P (A_{i} | γ) | \\ \leq - \sum_{i = 1}^{I} | P (A_{i} | C) - P (A_{i} | γ) | log | P (A_{i} | C) - P (A_{i} | γ) | \\ \leq \frac{I log n}{n} . \end{matrix}

(37)

Taking n so large that

n^{- 1} I log n \leq ϵ

yields the claim. □

Taking the above into account, we can demonstrate the chain rule. Our proof essentially follows the ideas of Wyner [7], except for invoking Theorem 6.

Proof of Theorem 1.

2 (chain rule): Let

A

,

B

,

C

, and

D

be arbitrary fields, and let

α

,

β

,

γ

, and

δ

be finite partitions. The point of our departure is the chain rule for finite partitions [9] (Equation 2.60)

\begin{matrix} I (α; β \land γ) = I (α; β) + I (α; γ | β) . \end{matrix}

(38)

By Definition 1 and Theorems 1.1, 5, and 6, conditional mutual information

I (A; B | C)

can be approximated by

I (α; β | γ)

, where we take appropriate limits of refined finite partitions with a certain care.

In particular, by Theorems 1.1, 5, and 6, taking sufficiently fine finite partitions of arbitrary fields

B

and

C

, the chain rule (38) for finite partitions implies

\begin{matrix} I (α; B \land C) = I (α; B) + I (α; C | B), \end{matrix}

(39)

where all expressions are finite. Hence, we also obtain

\begin{matrix} 0 & = [I (α; B \land C \land D) - I (α; D) - I (α; B \land C | D)] \\ - [I (α; B \land D) - I (α; D) - I (α; B | D)] \\ - [I (α; B \land C \land D) - I (α; B \land D) - I (α; C | B \land D)] \\ = I (α; B | D) + I (α; C | B \land D) - I (α; B \land C | D), \end{matrix}

where all expressions are finite. Having established the above claim for a finite partition

α

, we generalize it to

\begin{matrix} I (A; B \land C | D) = I (A; B | D) + I (A; C | B \land D) \end{matrix}

(40)

for an arbitrary field

A

, taking its appropriately fine finite partitions. □

3. Applications

This section borrows its statements largely from Dębowski [1,2,3] and is provided only to sketch some context for our research and justify its applicability to statistical language modeling. Let

{(X_{i})}_{i \in Z}

be a two-sided infinite stationary process over a countable alphabet

X

on a probability space

(X^{Z}, X^{Z}, P)

, where

X_{k} ({(ω_{i})}_{i \in Z}) : = ω_{k}

. We denote random blocks

X_{j}^{k} : = {(X_{i})}_{j \leq i \leq k}

and complete

σ

-fields

G_{j}^{k} : = σ (X_{j}^{k})

generated by them. By the generalized calculus of Shannon information measures, i.e., Theorems 1 and 2, we can define the entropy rate

h_{P}

and the excess entropy

E_{P}

of process

{(X_{i})}_{i \in Z}

as

\begin{matrix} h_{P} & : = lim_{n \to \infty} H_{P} (G_{0} | G_{- n}^{- 1}) = H_{P} (G_{0} | G_{- \infty}^{- 1}) if X is finite, \end{matrix}

(41)

\begin{matrix} E_{P} & : = lim_{n \to \infty} I_{P} (G_{- n}^{- 1}; G_{0}^{n - 1}) = I_{P} (G_{- \infty}^{- 1}; G_{0}^{\infty}), \end{matrix}

(42)

see [10] for more background.

Let

T ({(ω_{i})}_{i \in Z}) : = {(ω_{i + 1})}_{i \in Z}

be the shift operation and let

I : = \{A \in X^{Z} : T^{- 1} (A) = A\}

be the invariant

σ

-field. By the Birkhoff ergodic theorem [11], we have

σ (I) \subset σ (G_{- \infty}) \cap σ (G_{\infty})

for the tail

σ

-fields

G_{- \infty} : = ⋂_{n = 1}^{\infty} G_{- \infty}^{- n}

and

G_{\infty} : = ⋂_{n = 1}^{\infty} G_{n}^{\infty}

. Hence, by Theorems 1 and 2 we further obtain expressions

\begin{matrix} h_{P} & = H_{P} (G_{0} | G_{- \infty}^{- 1}) = H_{P} (G_{0} | G_{- \infty}^{- 1} \land I) if X is finite, \end{matrix}

(43)

\begin{matrix} E_{P} & = I_{P} (G_{- \infty}^{- 1}; G_{0}^{\infty}) = H_{P} (I) + I_{P} (G_{- \infty}^{- 1}; G_{0}^{\infty} | I) . \end{matrix}

(44)

Denoting the conditional probability

F (A) : = P (A | I)

, which is a random stationary ergodic measure by the ergodic decomposition theorem [12], we notice that

H_{P} (G_{0} | G_{- \infty}^{- 1} \land I) = E_{P} H_{F} (G_{0} | G_{- \infty}^{- 1})

and

I_{P} (G_{- \infty}^{- 1}; G_{0}^{\infty} | I) = E_{P} I_{F} (G_{- \infty}^{- 1}; G_{0}^{\infty})

, and consequently we obtain the ergodic decomposition of the entropy rate and excess entropy, which reads

\begin{matrix} h_{P} & = E_{P} h_{F} if X is finite, \end{matrix}

(45)

\begin{matrix} E_{P} & = H_{P} (I) + E_{P} E_{F} . \end{matrix}

(46)

Formulae (45) and (46) were derived by Gray and Davisson [13] and Dębowski [1] respectively. The ergodic decomposition of the entropy rate (45) states that a stationary process is asymptotically deterministic, i.e.,

h_{P} = 0

, if and only if almost all its ergodic components are asymptotically deterministic, i.e.,

h_{F} = 0

almost surely. In contrast, the ergodic decomposition of the excess entropy (46) states that a stationary process is infinitary, i.e.,

E_{P} = \infty

, if some of its ergodic components are infinitary, i.e.,

E_{F} = \infty

with a nonzero probability, or if

H_{P} (I) = \infty

, i.e., if the process is strongly nonergodic in particular, see [14,15].

The linguistic interpretation of the above results is as follows. There is a hypothesis by Hilberg [16] that the excess entropy of natural language is infinite. This hypothesis can be partly confirmed by the original estimates of conditional entropy by Shannon [17], by the power-law decay of the estimates of the entropy rate given by the PPM compression algorithm [18], by the approximately power-law growth of vocabulary called Heaps’ or Herdan’s law [2,3,19,20], and by some other experiments applying neural statistical language models [21,22]. In parallel, Dębowski [1,2,3] supposed that the very large excess entropy in natural language may be caused by the fact that texts in natural language describe some relatively slowly evolving and very complex reality. Indeed, it can be mathematically proved that if the abstract reality described by random texts is unchangeable and infinitely complex, then the resulting stochastic process is strongly nonergodic, i.e.,

H_{P} (I) = \infty

in particular [1,2,3]. Consequently, its excess entropy is infinite by formula (46). We suppose that a similar mechanism may work for natural language, see [23,24,25,26] for further examples of abstract stochastic mechanisms leading to infinitary processes.

Funding

This research received no external funding.

Conflicts of Interest

The author declares no conflict of interest.

References

Dębowski, Ł. A general definition of conditional information and its application to ergodic decomposition. Stat. Probab. Lett. 2009, 79, 1260–1268. [Google Scholar] [CrossRef] [Green Version]
Dębowski, Ł. On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts. IEEE Trans. Inf. Theory 2011, 57, 4589–4599. [Google Scholar] [CrossRef] [Green Version]
Dębowski, Ł. Is Natural Language a Perigraphic Process? The Theorem about Facts and Words Revisited. Entropy 2018, 20, 85. [Google Scholar] [CrossRef] [Green Version]
Gelfand, I.M.; Kolmogorov, A.N.; Yaglom, A.M. Towards the general definition of the amount of information. Dokl. Akad. Nauk. SSSR 1956, 111, 745–748. (In Russian) [Google Scholar]
Dobrushin, R.L. A general formulation of the fundamental Shannon theorems in information theory. Uspekhi Mat. Nauk. 1959, 14, 3–104. (In Russian) [Google Scholar]
Pinsker, M.S. Information and Information Stability of Random Variables and Processes; Holden-Day: San Francisco, CA, USA, 1964. [Google Scholar]
Wyner, A.D. A definition of conditional mutual information for arbitrary ensembles. Inf. Control. 1978, 38, 51–59. [Google Scholar] [CrossRef] [Green Version]
Billingsley, P. Probability and Measure; John Wiley: New York, NY, USA, 1979. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley: New York, NY, USA, 1991. [Google Scholar]
Crutchfield, J.P.; Feldman, D.P. Regularities unseen, randomness observed: The entropy convergence hierarchy. Chaos 2003, 15, 25–54. [Google Scholar] [CrossRef]
Birkhoff, G.D. Proof of the ergodic theorem. Proc. Natl. Acad. Sci. USA 1932, 17, 656–660. [Google Scholar] [CrossRef]
Rokhlin, V.A. On the fundamental ideas of measure theory. Am. Math. Soc. Transl. Ser. 1 1962, 10, 1–54. [Google Scholar]
Gray, R.M.; Davisson, L.D. The ergodic decomposition of stationary discrete random processses. IEEE Trans. Inf. Theory 1974, 20, 625–636. [Google Scholar] [CrossRef]
Löhr, W. Properties of the Statistical Complexity Functional and Partially Deterministic HMMs. Entropy 2009, 11, 385–401. [Google Scholar] [CrossRef] [Green Version]
Crutchfield, J.P.; Marzen, S. Signatures of infinity: Nonergodicity and resource scaling in prediction, complexity, and learning. Phys. Rev. E 2015, 91, 050106. [Google Scholar] [CrossRef] [Green Version]
Hilberg, W. Der bekannte Grenzwert der redundanzfreien Information in Texten—eine Fehlinterpretation der Shannonschen Experimente? Frequenz 1990, 44, 243–248. [Google Scholar] [CrossRef]
Shannon, C. Prediction and entropy of printed English. Bell Syst. Tech. J. 1951, 30, 50–64. [Google Scholar] [CrossRef]
Takahira, R.; Tanaka-Ishii, K.; Dębowski, Ł. Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora. Entropy 2016, 18, 364. [Google Scholar] [CrossRef] [Green Version]
Herdan, G. Quantitative Linguistics; Butterworths: London, UK, 1964. [Google Scholar]
Heaps, H.S. Information Retrieval—Computational and Theoretical Aspects; Academic Press: New York, NY, USA, 1978. [Google Scholar]
Hahn, M.; Futrell, R. Estimating Predictive Rate-Distortion Curves via Neural Variational Inference. Entropy 2019, 21, 640. [Google Scholar] [CrossRef] [Green Version]
Braverman, M.; Chen, X.; Kakade, S.M.; Narasimhan, K.; Zhang, C.; Zhang, Y. Calibration, Entropy Rates, and Memory in Language Models. arXiv 2019, arXiv:1906.05664. [Google Scholar]
Dębowski, Ł. Mixing, Ergodic, and Nonergodic Processes with Rapidly Growing Information between Blocks. IEEE Trans. Inf. Theory 2012, 58, 3392–3401. [Google Scholar] [CrossRef]
Dębowski, Ł. On Hidden Markov Processes with Infinite Excess Entropy. J. Theor. Probab. 2014, 27, 539–551. [Google Scholar] [CrossRef] [Green Version]
Travers, N.F.; Crutchfield, J.P. Infinite Excess Entropy Processes with Countable-State Generators. Entropy 2014, 16, 1396–1413. [Google Scholar] [CrossRef] [Green Version]
Dębowski, Ł. Maximal Repetition and Zero Entropy Rate. IEEE Trans. Inf. Theory 2018, 64, 2212–2219. [Google Scholar] [CrossRef]

© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dębowski, Ł. Approximating Information Measures for Fields. Entropy 2020, 22, 79. https://doi.org/10.3390/e22010079

AMA Style

Dębowski Ł. Approximating Information Measures for Fields. Entropy. 2020; 22(1):79. https://doi.org/10.3390/e22010079

Chicago/Turabian Style

Dębowski, Łukasz. 2020. "Approximating Information Measures for Fields" Entropy 22, no. 1: 79. https://doi.org/10.3390/e22010079

APA Style

Dębowski, Ł. (2020). Approximating Information Measures for Fields. Entropy, 22(1), 79. https://doi.org/10.3390/e22010079

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Approximating Information Measures for Fields

Abstract

1. Introduction

2. Proofs

3. Applications

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI