- freely available
- re-usable

*Entropy*
**2014**,
*16*(3),
1396-1413;
doi:10.3390/e16031396

^{1}

^{2}

^{3}

^{4}

^{†}

## Abstract

**:**We present two examples of finite-alphabet, infinite excess entropy processes generated by stationary hidden Markov models (HMMs) with countable state sets. The first, simpler example is not ergodic, but the second is. These are the first explicit constructions of processes of this type.

**PACS Classification:**02.50.-r 89.70.+c 05.45.Tp 02.50.Ey

## 1. Introduction

For a stationary process (X_{t}) the excess entropy **E** is the mutual information between the infinite past X⃖ = . . . X_{−}_{2}X_{−}_{1} and the infinite future X⃗ = X_{0}X_{1} . . .. It has a long history and is widely employed as a measure of correlation and complexity in a variety of fields, from ergodic theory and dynamical systems to neuroscience and linguistics [1–6]. For a review the reader is referred to [7].

An important question in classifying a given process is whether the excess entropy is finite or infinite. In the former case the process is said to be finitary, and in the latter infinitary.

Over a finite alphabet, most of the commonly studied, simple process types are always finitary, including all independent identically distributed (IID) processes, finite-order Markov processes, and processes with finite-state hidden Markov model (HMM) presentations. However, there are also well known examples of finite-alphabet, infinitary processes. For instance, the symbolic dynamics at the onset of chaos in the logistic map and similar dynamical systems [7] and the stationary representation of the binary Fibonacci sequence [8] are both infinitary.

These latter processes, though, only admit stationary HMM presentations with uncountable state sets. Indeed, one can show that any process generated by a stationary, countable-state HMM either has positive entropy rate or consists entirely of periodic sequences, which these do not. Versions of the Santa Fe Process introduced in [6] are finite-alphabet, infinitary processes with positive entropy rate. However, they were not constructed directly as hidden Markov processes, and it seems unlikely that they should have any stationary, countable-state presentations either.

Here, we present two examples of stationary, countable-state HMMs that do generate finite-alphabet, infinitary processes. To the best of our knowledge, these are the first explicit constructions of this type in the literature. Although, subsequent to our release of the earlier version of the present work [9], two additional examples were given in [10].

Our first example is nonergodic, and the information conveyed from the past to the future essentially consists of the ergodic component along a given realization. This example is straightforward to construct and, though previously unpublished, others are likely aware of it or similar constructions. The second, ergodic example, though, is more involved, and both its structure and properties are novel.

To put these contributions in perspective, we note that any stationary, finite-alphabet process may be trivially presented by a stationary hidden Markov model with an uncountable state set, in which each infinite history ⃖ corresponds to a single state. Thus, it is clear that stationary HMMs with uncountable state sets can generate finite-alphabet, infinitary processes. In contrast, for any finite-state HMM **E** is always finite—bounded by the logarithm of the number of states. The case of countable-state HMMs lies in-between the finite-state and uncountable-state cases, and it was previously not demonstrated whether it is possible to have countable-state, stationary HMMs that generate infinitary, finite-alphabet processes and, in particular, ergodic ones.

## 2. Background

#### 2.1. Excess Entropy

We denote by H [X] the Shannon entropy in a random variable X, by H [X|Y ] the conditional entropy in X given Y, and by I [X; Y ] the mutual information between random variables X and Y. For definitions of these information theoretic quantities, as well as the definitions of stationarity and ergodicity for a stochastic process (X_{t}), the reader is referred to [11].

#### Definition 1

For a stationary, finite-alphabet process (X_{t})_{t}_{∈ℤ} the excess entropy **E** is the mutual information between the infinite past X⃖ = . . . X_{−}_{2}X_{−}_{1} and the infinite future X⃗ = X_{0}X_{1} . . . :

where X⃖^{t} = X_{−t} . . . X_{−}_{1} and X⃗^{t} = X_{0} . . . X_{t−}_{1} are the length-t past and future, respectively.

As noted in [7,12] this quantity, **E**, may also be expressed alternatively as:

where h is the process entropy rate:

That is, the excess entropy **E** is the asymptotic amount of entropy (information) in length-t blocks of random variables beyond that explained by the entropy rate. The excess entropy derives its name from this latter formulation. It is also this formulation that we use to establish that the process of Section 3.1 is infinitary.

Expanding the block entropy H [X⃗^{t}] in Equation (2) with the chain rule and recombining terms gives another important formulation [7]:

where h(t) is the length-t entropy-rate approximation:

the conditional entropy in the t-th symbol given the previous t – 1 symbols. This final formulation will be used to establish that the process of Section 3.2 is infinitary.

#### 2.2. Hidden Markov Models

There are two primary types of hidden Markov models: edge-emitting (or Mealy) and state-emitting (or Moore). We work with the former edge-emitting type, but the two are equivalent in that any model of one type with a finite output alphabet may be converted to a model of the other type without changing the cardinality of the state set by more than a constant factor—the alphabet size. Thus, for our purposes, Mealy HMMs are sufficiently general. We also consider only stationary HMMs with finite output alphabets and countable state sets.

#### Definition 2

A stationary, edge-emitting, countable-state, finite-alphabet hidden Markov model (hereafter referred to simply as a countable-state HMM) is a 4-tuple (
, , {T^{(}^{x}^{)}}, π) where:

- (1)
is a countable set of states.

- (2)
is a finite alphabet of output symbols.

- (3)
T

_{(}_{x}_{)}, x ∈ , are symbol labeled transition matrices whose sum T = ∑_{x}_{∈ }T^{(}^{x}^{)}is stochastic.${T}_{\sigma {\sigma}^{\prime}}^{(x)}$ is the probability that state σ transitions to state σ′ on symbol x.- (4)
π is a stationary distribution for the underlying Markov chain over states with transition matrix T. That is, π satisfies π = πT.

#### Remarks

- (1)
“Countable” in Property 1 means either finite or countably infinite. If the state set is finite, we also refer to the HMM as finite-state.

- (2)
We do not assume, in general, that the underlying Markov chain over states with transition matrix T is irreducible. Thus, even in the case that is finite, the stationary distribution π is not necessarily uniquely defined by the matrix T and is, therefore, specified separately.

Visually, a hidden Markov model may be depicted as a directed graph with labeled edges. The vertices are the states σ ∈ and, for all σ, σ′ ∈ with ${T}_{\sigma {\sigma}^{\prime}}^{(x)}>0$ , there is a directed edge from state σ to state σ′ labeled p|x for the symbol x and transition probability $p={T}_{\sigma {\sigma}^{\prime}}^{(x)}$. These probabilities are normalized so that the sum of probabilities on all outgoing edges from each state is 1. An example is given in Figure 1.

The operation of a HMM may be thought of as a weighted random walk on the associated graph. From the current state σ the next state σ′ is determined by following an outgoing edge from σ chosen according to the edge probabilities (or weights). During the transition, the HMM also outputs the symbol x labeling this edge.

We denote the state at time t by S_{t} and the t-th symbol by X_{t}, so that symbol X_{t} is generated upon the transition from state S_{t} to state S_{t}_{+1}. The state sequence (S_{t}) is simply a Markov chain with transition matrix T. However, we are interested not simply in this sequence of states, but also in the associated sequence of output symbols (X_{t}) that are generated by reading the labels off the edges as they are followed. The interpretation is that an observer of the HMM may directly observe this sequence of output symbols, but not the hidden internal states. Alternatively, one may consider the Markov chain over edges (E_{t}), of which the observed symbol sequence (X_{t}) is simply a projection.

In either case, the process (X_{t}) generated by the HMM (
, , {T^{(}^{x}^{)}}, π) is defined as the output sequence of edge symbols, which results from running the Markov chain over states according to the stationary law with marginals ℙ(S_{0}) = ℙ(S_{t}) = π. It is easy to verify that this process is itself stationary, with word probabilities given by:

where for a given word w = w_{1}...w_{n} ∈
^{*}, T^{(}^{w}^{)} is the word transition matrix T^{(}^{w}^{)} = T^{(}^{w}^{1)} · · · T^{(wn)}.

#### Remark

Even for a nonstationary HMM (
, , {T^{(}^{x}^{)}}, ρ), where the state distribution ρ is not stationary, one may always define a one-sided process (X_{t})_{t}_{≥0} with marginals given by:

Furthermore, though the state sequence (S_{t})_{t}_{≥0} will not be a stationary process if ρ is not a stationary distribution for T, the output sequence (X_{t})_{t}_{≥0} may still be stationary. In fact, as shown in [12] (Example 2.9), any one-sided process over a finite alphabet , stationary or not, may be represented by a countable-state, nonstationary HMM in which the states correspond to finite-length words in ^{*}, of which there are only countably many. By stationarity, a one-sided stationary process generated by such a nonstationary HMM can be uniquely extended to a two-sided stationary process. So, in a sense, any two-sided stationary process (X_{t})_{t}_{∈ℤ} can be said to be generated by a nonstationary, countable-state HMM. Though, this is a slightly unnatural interpretation of process generation in that the two-sided process (X_{t})_{t}_{∈ℤ} is not directly that obtained by reading symbols off the edges of the HMM as it runs along transitioning between states in bi-infinite time. In either case, the space of stationary, finite-alphabet processes generated by nonstationary, countable-state HMMs is too large: it includes all stationary, finite-alphabet processes. Due to this, we restrict to the case of stationary HMMs where both the state sequence (S_{t}) and output sequence (X_{t}) are stationary processes, and henceforth use the term HMM implicitly to mean stationary HMM. Clearly, if one allows finite-alphabet processes generated by nonstationary, countable-state HMMs there are infinitary examples.

We consider now an important property known as unifilarity. This property is useful in that many quantities are analytically computable only for unifilar HMMs. In particular, for unifilar HMMs the entropy rate h is often directly computable, unlike in the nonunifilar case. Both of the examples constructed in Section 3 are unifilar, as is the Even Process HMM of Figure 1.

#### Definition 3

A HMM (
, , {T^{(}^{x}^{)}}, π) is unifilar if for each σ ∈
and x ∈
there is at most one outgoing edge from state σ labeled with symbol x in the associated graph G.

It is well known that for any finite-state, unifilar HMM the entropy rate in the output process (X_{t}) is simply the conditional entropy in the next symbol given the current state:

where π_{σ} is the stationary probability of state σ and h_{σ} = H [X_{0}|S_{0} = σ] is the conditional entropy in the next symbol given that the current state is σ.

We are unaware, though, of any proof that this is generally true for countable-state HMMs. If the entropy in the stationary distribution H [π] is finite, then a proof along the lines given in [13] carries through to the countable-state case and Equation (8) still holds. However, countable-state HMMs may sometimes have H [π] = ∞. Furthermore, it can be shown [12] that the excess entropy **E** is always bounded above by H [π]. So, for the infinitary process of Section 3.2 we need slightly more than unifilarity to establish the value of h. To this end, we consider a property known as exactness [14].

#### Definition 4

A HMM is said to be exact if for a.e. infinite future ⃗ = x_{0}x_{1}... generated by the HMM an observer synchronizes to the internal state after a finite time. That is, for a.e. ⃗ there exists t ∈ ℕ such that H [S_{t}|X⃗^{t} = ⃗⃗ ^{t}] = 0, where ⃗^{t} = x_{0}x_{1}...x_{t−}_{1} denotes the the first t symbols of a given ⃗.

In the appendix we prove the following proposition.

#### Proposition 1

For any countable-state, exact, unifilar HMM the entropy rate is given by the standard formula of Equation (8).

The HMM constructed in Section 3.2 is both exact and unifilar, so Proposition 1 applies. Using this explicit formula for h, we will show that $\mathbf{E}={\sum}_{t=1}^{\infty}(h(t)-h)$ is infinite.

## 3. Constructions

We now present the two constructions of (stationary) countable-state HMMs that generate infinitary processes. In the first example the output process is not ergodic, but in the second it is.

#### 3.1. Heavy-Tailed Periodic Mixture: An infinitary nonergodic process with a countable-state presentation

Figure 2 depicts a countable-state HMM M, for a nonergodic infinitary process ℘. The machine M consists of a countable collection of disjoint strongly connected subcomponents M_{i}, i ≥ 2. For each i, the component M_{i} generates the periodic process ℘_{i} consisting of i – 1 1s followed by a 0. The weighting (μ_{2}, μ_{3}, ..., ) over components is taken as a heavy-tailed distribution with infinite entropy. For this reason, we refer to the process M generates as the Heavy-Tailed Periodic Mixture (HPM) process.

Intuitively, the information transmitted from the past to the future for the HPM Process is the ergodic component i along with the phase of the period-i process ℘_{i} in this component. This is more information than simply the ergodic component i, which is itself an infinite amount of information: H [(μ_{2}, μ_{3}, ..., )] = ∞. Hence, **E** should be infinite. This intuition can be made precise using the ergodic decomposition theorem of Debowski [15], but we present a more direct proof here.

#### Proposition 2

The HPM Process has infinite excess entropy.

#### Proof

For the HPM Process ℘ we will show that (i) lim_{t}_{→∞} H [X⃗^{t}] = ∞and (ii) h = 0. The conclusion then follows immediately from Equation (2). To this end, we define sets:

Note that any word w ∈ W_{i,t} with i ≤ t/2 contains at least two 0s. Therefore:

- (1)
No two distinct states σ

_{ij}and σ_{ij′}with i ≤ t/2 generate the same length t word.- (2)
The sets W

_{i,t}, i ≤ t/2, are disjoint from both each other and V_{t}.

It follows that each word w ∈ W_{i,t}, with i ≤ t/2, can only be generated from a single state σ_{ij} of the HMM and has probability:

Hence, for any fixed t:

so:

which proves Claim (i). Now, to prove Claim (ii) consider the quantity:

On the one hand, for w ∈ U_{t}, H [X_{t}|X⃗^{t} = w] = 0 since the current state and, hence, entire future are completely determined by any word w ∈ U_{t}. On the other hand, for w ∈ V_{t}, H [X_{t}|X⃗^{t} = w] ≤ 1 since the alphabet is binary. Moreover, the combined probability of all words in the set V_{t} is simply the probability of starting in some component M_{i} with i > t/2: ℙ(V_{t}) = ∑_{i>t/}_{2} μ_{i}. Thus, by Equation (11)h(t + 1) ≤ ∑ _{i>t/}_{2} μ_{i}. Since ∑_{i} μ_{i} converges, it follows that h(t) ↘ 0, which verifies Claim (ii).

#### 3.2. Branching Copy Process: An infinitary ergodic process with a countable-state presentation

Figure 3 depicts a countable-state HMM M for the ergodic, infinitary Branching Copy Process. Essentially, the machine M consists of a binary tree with loop backs to the root node. From the root a path is chosen down the tree with each left-right (or 0–1) choice equally likely. But, at each step there is also a chance of turning back towards the root. The path back is a not a single step, however. It has length equal to the number of steps taken down the tree before returning back, and copies the path taken down symbol-wise with 0 s replaced by 2 s and 1 s replaced by 3 s. There is also a high self-loop probability at the root node on symbol 4, so some number of 4 s will normally be generated after returning to the root node before preceding again down the tree. The process generated by this machine is referred to as the Branching Copy (BC) Process, because the branch taken down the tree is copied on the loop back to the root.

By inspection we see that the machine is unifilar with synchronizing word w = 4, i.e., H [S_{1}|X_{0} = 4] = 0. Since the underlying Markov chain over states (S_{t}) is positive recurrent, the state sequence (S_{t}) and symbol sequence (X_{t}) are both ergodic. Thus, a.e. infinite future ⃗ contains a 4, so the machine is exact. Therefore, Proposition 1 may be applied, and we know the entropy rate h is given by the standard formula of Equation (8): h = ∑_{σ} π_{σ}h_{σ}. Since ℙ(S_{t} = σ) = π_{σ} for any t ∈ ℕ, we may alternatively represent this entropy rate as:

where $\mathcal{L}$_{t} = {w : |w| = t, ℙ(w) > 0} is the set of length t words in the process language $\mathcal{L}$, φ(w) is the conditional state distribution induced by the word w (i.e., φ(w)_{σ} = ℙ(S_{t} = σ|X⃗^{t} = w)), and h̃_{w} = ∑_{σ} φ(w)_{σ}h_{σ} is the φ(w)-weighted average entropy in the next symbol given knowledge of the current state σ. Similarly, for any t ∈ ℕ the entropy-rate approximation h(t + 1) may be expressed as:

where h_{w} = H [X_{t}|X⃗^{t} = w] is the entropy in the next symbol after observing the word w. Combining Equations (12) and (13) we have for any t ∈ ℕ:

As we will show in Claim 6, concavity of the entropy function implies the quantity h_{w} – h̃_{w} is always nonnegative. Furthermore, in Claim 5 we will show that h_{w}–h̃_{w} is always bounded below by some fixed positive constant for any word w consisting entirely of 2s and 3s. Also, in Claim 3 we will show that ℙ(W_{t}) scales as 1/t, where W_{t} is the set of length-t words consisting entirely of 2s and 3s. Combining these results it follows that h(t + 1) – h ≥̃ 1/t and, hence, the sum
$\mathbf{E}={\sum}_{t=1}^{\infty}(h(t)-h)$ is infinite.

A more detailed analysis with the claims and their proofs is given below. In this we will use the following notation:

ℙ

_{σ}(·) = ℙ(·|S_{0}= σ),V

_{t}= {w ∈ $\mathcal{L}$_{t}: w contains only 0s and 1s} and W_{t}= {w ∈ $\mathcal{L}$_{t}: w contains only 2s and 3s},${\pi}_{ij}^{k}=\mathbb{P}({\sigma}_{ij}^{k})$ is the stationary probability of state ${\sigma}_{ij}^{k}$,

${R}_{ij}=\{{\sigma}_{ij}^{1},{\sigma}_{ij}^{2},\dots ,{\sigma}_{ij}^{i}\}$, and

${\pi}_{ij}={\sum}_{k=1}^{i}{\pi}_{ij}^{k}$ and ${\pi}_{i}^{1}={\sum}_{j=1}^{{2}^{i}}{\pi}_{ij}^{1}$.

Note that:

and:

These facts will be used in the proof of Claim 1.

#### Claim 1

The underlying Markov chain over states for the HMM is positive recurrent.

#### Proof

Let ${\tau}_{{\sigma}_{01}^{1}}=\text{min}\{t>0:{S}_{t}={\sigma}_{01}^{1}\}$ be the first return time to state ${\sigma}_{01}^{1}$. Then, by continuity:

Hence, the Markov chain is recurrent and we have:

from which it follows that the chain is also positive recurrent. Note that the topology of the chain implies the first return time may not be an odd integer greater than 1.

#### Claim 2

The stationary distribution π has:

where$C={\pi}_{01}^{1}(1-{p}_{0})$.

#### Proof

Existence of a unique stationary distribution π is guaranteed by Claim 1. Given this, clearly
${\pi}_{1}^{1}={\pi}_{01}^{1}(1-{p}_{0})$. Similarly, for i ≥ 1,
${\pi}_{i+1}^{1}={\pi}_{i}^{1}(1-{p}_{i})={\pi}_{i}^{1}\frac{{i}^{2}}{{(i+1)}^{2}}$, from which it follows by induction that
${\pi}_{i}^{1}={\pi}_{01}^{1}(1-{p}_{0})/{i}^{2}$, for all i ≥ 1. By symmetry
${\pi}_{ij}^{1}={\pi}_{i}^{1}/{2}^{i}$ for each i ∈ ℕ and 1 ≤ j ≤ 2^{i}. Therefore, for each i ∈ ℕ, 1 ≤ j ≤ 2^{i} we have
${\pi}_{ij}^{1}={\pi}_{01}^{1}(1-{p}_{0})/({i}^{2}\xb7{2}^{i})=C/({i}^{2}\xb7{2}^{i})$ as was claimed. Moreover, for i ≥ 2,
${\pi}_{ij}^{2}={\pi}_{ij}^{1}\xb7{p}_{i}={\pi}_{ij}^{1}\xb7\frac{2i+1}{{(i+1)}^{2}}$. Combining with the expression for
${\pi}_{ij}^{1}$ gives
${\pi}_{ij}^{2}=\frac{C}{{i}^{2}\xb7{2}^{i}}\xb7\frac{2i+1}{{(i+1)}^{2}}$. By induction,
${\pi}_{ij}^{2}={\pi}_{ij}^{3}=\dots ={\pi}_{ij}^{i}$, so this completes the proof.

Note that for all i ≥ 1 and 1 ≤ j ≤ 2^{i}:

Also note that for any t ∈ ℕ and i ≥ 2t we have for each 1 ≤ j ≤ 2^{i}:

- (1)
$\mathbb{P}({\overrightarrow{X}}^{t}\in {W}_{t}\mid {S}_{0}={\sigma}_{ij}^{k})=1$, for 2 ≤ k ≤ ⌈i/2⌉ + 1.

- (2)
$\left({\sum}_{k=2}^{i}{\pi}_{ij}^{k}\right)/{\pi}_{ij}\ge 1/3$ and $\mid \{k:2\le k\le \lceil i/2\rceil +1\}\mid \ge {\scriptstyle \frac{1}{2}}\xb7\mid \{k:2\le k\le i\}\mid $. Hence, $\left({\sum}_{k=2}^{\lceil i/2\rceil +1}{\pi}_{ij}^{k}\right)/{\pi}_{ij}\ge 1/6$.

Therefore, for each t ∈ ℕ:

Equations (19), (20), and (21) will be used in the proof of Claim 3 below, along with the following simple lemma.

#### **Lemma 1** (Integral Test)

Let n ∈ ℕ and let f : [n,∞] → ℝ be a positive, continuous, monotone-decreasing function, then:

#### Claim 3

ℙ(W_{t}) decays roughly as 1/t. More exactly, C/12t ≤ ℙ(W_{t}) ≤ 6C/t for all t ∈ ℕ.

#### Proof

For any state ${\sigma}_{ij}^{k}$ with i < t, $\mathbb{P}({\overrightarrow{X}}^{t}\in {W}_{t}\mid {S}_{0}={\sigma}_{ij}^{k})=0$. Thus, we have:

where the final equality follows from symmetry. We prove the bounds from above and below on ℙ(W_{t}) separately using Equation (22).

Bound from below:

$$\begin{array}{l}\mathbb{P}({W}_{t})=\sum _{i=t}^{\infty}{2}^{i}\xb7\mathbb{P}({S}_{0}\in {R}_{i1})\xb7\mathbb{P}({\overrightarrow{X}}^{t}\in {W}_{t}\mid {S}_{0}\in {R}_{i1})\\ \ge \sum _{i=2t}^{\infty}{2}^{i}\xb7\mathbb{P}({S}_{0}\in {R}_{i1})\xb7\mathbb{P}({\overrightarrow{X}}^{t}\in {W}_{t}\mid {S}_{0}\in {R}_{i1})\\ \stackrel{(a)}{\ge}\sum _{i=2t}^{\infty}{2}^{i}\xb7\frac{C}{{2}^{i}\xb7{i}^{2}}\xb7\frac{1}{6}\\ =\frac{C}{6}\sum _{i=2t}^{\infty}\frac{1}{{i}^{2}}\\ \stackrel{(b)}{\ge}\frac{C}{6}{\int}_{2t}^{\infty}\frac{1}{{x}^{2}}dx\\ =\frac{C}{12t}.\end{array}$$Here, (a) follows from Equations (19) and (21) and (b) from Lemma 1.

Bound from above:

$$\begin{array}{l}\mathbb{P}({W}_{t})=\sum _{i=t}^{\infty}{2}^{i}\xb7\mathbb{P}({S}_{0}\in {R}_{i1})\xb7\mathbb{P}({\overrightarrow{X}}^{t}\in {W}_{t}\mid {S}_{0}\in {R}_{i1})\\ \stackrel{(a)}{\le}\sum _{i=t}^{\infty}{2}^{i}\xb7\frac{3C}{{2}^{i}\xb7{i}^{2}}\xb71\\ =3C\sum _{i=t}^{\infty}\frac{1}{{i}^{2}}\\ \stackrel{(b)}{\le}3C\left(\frac{1}{{t}^{2}}+{\int}_{t}^{\infty}\frac{1}{{x}^{2}}dx\right)\\ =3C\xb7\left(\frac{1}{{t}^{2}}+\frac{1}{t}\right)\\ \le \frac{6C}{t}.\end{array}$$Here, (a) follows from Equation (20) and (b) from Lemma 1.

#### Claim 4

ℙ(X_{t} ∈ {2, 3}|X⃗^{t} = w) ≥ 1/150, for all t ∈ ℕ and w ∈ W_{t}.

#### Proof

Applying Claim 3 we have for any t ∈ ℕ:

By symmetry, ℙ(X_{t} ∈ {2, 3}|X⃗^{t} = w) is the same for each w ∈ W_{t}. Thus, the same bound must also hold for each w ∈ W_{t} individually: ℙ(X_{t} ∈ {2, 3}|X⃗^{t} = w) ≥ 1/150 for all w ∈ W_{t}.

#### Claim 5

For each t ∈ ℕ and w ∈ W_{t},

- (i)
h̃

_{w}≤ 1/300 and- (ii)
h

_{w}≥ 1/150.

Hence, h_{w} –h̃_{w} ≥ 1/300.

#### Proof of (i)

${h}_{{\sigma}_{ij}^{k}}=0$, for all i ≥ 1, 1 ≤ j ≤ 2^{i}, and k ≥ 2. And, for each w ∈ W_{t},
$\phi {(w)}_{{\sigma}_{ij}^{1}}=0$, for all i ≥ 1 and 1 ≤ j ≤ 2^{i}. Hence, for each w ∈ W_{t},
${\tilde{h}}_{w}={\sum}_{\sigma \in \mathcal{S}}\phi {(w)}_{\sigma}{h}_{\sigma}=\phi {(w)}_{{\sigma}_{01}^{1}}{h}_{{\sigma}_{01}^{1}}$. By construction of the machine
${h}_{{\sigma}_{01}^{1}}\le 1/300$ and, clearly,
$\phi {(w)}_{{\sigma}_{01}^{1}}$ can never exceed 1. Thus, h̃_{w} ≤ 1/300 for all w ∈ W_{t}.

#### Proof of (ii)

Let the random variable Z_{t} be defined by: Z_{t} = 1if X_{t} ∉ {2, 3} and Z_{t} = 0 if X_{t} ∉ {2, 3}. By Claim 4, ℙ(Z_{t} = 1|X⃗^{t} = w) ≥ 1/150 for any w ∈ W_{t}. Also, by symmetry, the probabilities of a 2 or a 3 following any word w ∈ W_{t} are equal, so ℙ(X_{t} = 2|X⃗^{t} = w, Z_{t} = 1) = ℙ(X_{t} = 3|X⃗^{t} = w, Z_{t} = 1) = 1/2. Therefore, for any w ∈ W_{t}:

#### Claim 6

For each t ∈ ℕ and w ∈ $\mathcal{L}$_{t}, h_{w} –h̃_{w} ≥ 0.

#### Proof

For w ∈ $\mathcal{L}$_{t}, let P_{w} = ℙ(X_{t}|X⃗^{t} = w) denote the probability distribution over the next output symbol after observing the word w. Also, for σ ∈
, let P_{σ} = ℙ(X_{t}|S_{t} = σ) denote the probability distribution over the next output symbol given that the current state is σ. Then, by concavity of the entropy function H [·], we have that for any w ∈ $\mathcal{L}$_{t}:

#### Claim 7

The quantity h(t) – h decays at a rate no faster than 1/t. More exactly,$h(t+1)-h\ge \frac{C}{3600t}$, for all t ∈ ℕ.

#### Proof

As noted above, since the machine satisfies the conditions of Proposition 1, the entropy rate is given by Equation (8) and the difference h(t + 1) – h is given by Equation (14). Therefore, applying Claims 3, 5, and 6 we may bound this difference h(t + 1) – h as follows:

With the above decay on h(t) established we easily see the Branching Copy Process must have infinite excess entropy.

#### Proposition 3

The excess entropy **E** for the BC Process is infinite.

#### Proof. $\mathbf{E}={\sum}_{t=1}^{\infty}(h(t)-h)$. By Claim 7, this sum must diverge

## 4. Conclusions

Any stationary, finite-alphabet process can be presented by a stationary HMM with an uncountable state set. Thus, there exist stationary HMMs with uncountable state sets capable of generating infinitary, finite-alphabet processes. It is impossible, however, to have a finite-state, stationary HMM that generates an infinitary process. The excess entropy **E** is always bounded by the entropy in the stationary distribution H [π], which is finite for any finite-state HMM. Countable-state HMMs are intermediate between the finite and uncountable cases, and it was previously not shown whether infinite excess entropy was possible in this case, or not. We have demonstrated that it is indeed possible, by giving two explicit constructions of finite-alphabet, infinitary processes generated by stationary HMMs with countable state sets.

The second example, the Branching Copy Process, is also ergodic—a strong restriction. It is a priori quite plausible that infinite **E** might only occur in the countable-state case for nonergodic processes. Moreover, both HMMs we constructed are unifilar, so the ε-machines [12,16] of the processes have countable state sets as well. Again, unifilarity is a strong restriction to impose, and it is a priori conceivable that infinite **E** might only occur in the countable-state case for nonunifilar HMMs. Our examples have shown, though, that infinite **E** is possible for countable-state HMMs, even if one requires both ergodicity and unifilarity.

Following the original release of the above results [9] two additional examples of both ergodic and nonergodic infinitary, finite-alphabet processes with countable-state HMM presentations appeared [10]. For these examples it was shown that the mutual information **E**(t) = I [X⃖^{t};X⃗^{t}] between length-t blocks diverges as a power law. Whereas, in our nonergodic example it diverges sublogarithmically and in our ergodic example, presumably, at most logarithmically. The ergodic example given in [10] is also somewhat simpler than ours. However, the HMM presentation for the ergodic process there is not unifilar and, moreover, one does not expect the ε-machine for this process to have a countable state set either. Taking this all into account leaves open the question: Is power law divergence of **E**(t) possible for ergodic processes with unifilar, countable-state HMM presentations?

## Acknowledgments

The authors thank Lukasz Debowski for helpful discussions. Nicholas F. Travers was partially supported on a National Science Foundation VIGRE fellowship. This material is based upon work supported by, or in part by, the US Army Research Laboratory and the US Army Research Office under grant number W911NF-12-1-0288 and the Defense Advanced Research Projects Agency (DARPA) Physical Intelligence project via subcontract No. 9060-000709. The views, opinions, and findings here are those of the authors and should not be interpreted as representing the official views or policies, either expressed or implied, of the DARPA or the Department of Defense.

## Appendix

We prove Proposition 1 from Section 2.2, which states that the entropy rate of any countable-state, exact, unifilar HMM is given by the standard formula:

#### Proof

Let $\mathcal{L}$_{t} = {w : |w| = t, ℙ(w) > 0} be the set of length t words in the process language $\mathcal{L}$, and let φ(w) be the conditional state distribution induced by a word w ∈ $\mathcal{L}$_{t}: i.e., φ(w)_{σ} = ℙ(S_{t} = σ|X⃗^{t} = w). Furthermore, let h̃_{w} = ∑_{σ} φ(w)_{σ}h_{σ} be the φ(w)-weighted average entropy in the next symbol given knowledge of the current state σ. And, let h_{w} = H [X_{t}|X⃗^{t} = w] be the entropy in the next symbol after observing the word w. Note that:

- (1)
h(t + 1) = H [X

_{t}|X⃗^{t}] = ∑_{w}_{∈$\mathcal{L}$ t}ℙ(w)h_{w}, and- (2)
∑

_{σ}π_{σ}h_{σ}= ∑_{σ}_{∑}_{w}_{∈$\mathcal{L}$ }_{t}ℙ(w)φ(w)_{σ}h_{σ}= ∑_{w}_{∈$\mathcal{L}$t}ℙ(w) (∑_{σ}φ(w)_{σ}h_{σ}) = ∑_{w}_{∈$\mathcal{L}$t}ℙ(w)h̃_{w}.

Thus, since we know h(t) limits to h, it suffices to show that:

Now, for any for any w ∈ $\mathcal{L}$_{t}, we have |h_{w} – h̃_{w}| ≤ log | |. However, for a synchronizing word w = w_{1}...w_{t} with H [S_{t}|X⃗^{t} = w] = 0, h_{w} –h̃_{w} is always 0, since the distribution φ(w) is concentrated only on a single state. Combining these two facts gives the estimate:

where N S_{t} is the set of length-t words that are nonsynchronizing and ℙ(N S_{t}) is the combined probability of all words in this set. Since the HMM is exact, we know that for a.e. infinite future ⃗ an observer will synchronize exactly at some finite time t = t(⃗ ). And, since it is unifilar, the observer will remain synchronized for all t′ ≥ t. It follows that ℙ(N S_{t}) must be monotonically decreasing and limit to 0:

Combining Equation (27) with Equation (28) shows that Equation (26) does in fact hold, which completes the proof.

## Conflicts of Interest

The authors declare no conflict of interest.

**Author Contribution**Nicholas F. Travers and James P. Crutchfield designed research; Nicholas F. Travers performed research; Nicholas F. Travers and James P. Crutchfield wrote the paper. Both authors read and approved the final manuscript.

## References

- Del Junco, A.; Rahe, M. Finitary codings and weak Bernoulli partitions. Proc. AMS
**1979**, 75. [Google Scholar] [CrossRef] - Crutchfield, J.P.; Packard, N.H. Symbolic dynamics of one-dimensional maps: Entropies, finite precision, and noise. Int. J. Theor. Phys
**1982**, 21, 433. [Google Scholar] - Grassberger, P. Toward a quantitative theory of self-generated complexity. Int. J. Theor. Phys
**1986**, 25, 907–938. [Google Scholar] - Lindgren, K.; Norhdal, M.G. Complexity measures and cellular automata. Complex Syst
**1988**, 2, 409–440. [Google Scholar] - Bialek, W.; Nemenman, I.; Tishby, N. Predictability, complexity, and learning. Neural Comput
**2001**, 13, 2409–2463. [Google Scholar] - Debowski, L. Excess entropy in natural language: Present state and perspectives. Chaos
**2011**, 21, 037105. [Google Scholar] - Crutchfield, J.P.; Feldman, D.P. Regularities unseen, randomness observed: Levels of entropy convergence. Chaos
**2003**, 13, 25–54. [Google Scholar] - Ebeling, W. Prediction and entropy of nonlinear dynamical systems and symbolic sequences with LRO. Physica D
**1997**, 109, 42–52. [Google Scholar] - Travers, N.F.; Crutchfield, J.P. Infinite excess entropy processes with countable-state generators
**2011**. arXiv:1111.3393. - Debowski, L. On hidden Markov processes with infinite excess entropy. J. Theor. Probab
**2012**. [Google Scholar] [CrossRef] - Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed; Wiley: New York, NY, USA, 2006. [Google Scholar]
- Löhr, W. Models of Discrete Time Stochastic Processes and Associated Complexity Measures. Ph.D Thesis, Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany, 2010. [Google Scholar]
- Travers, N.F.; Crutchfield, J.P. Asymptotic synchronization for finite-state sources. J. Stat. Phys
**2011**, 145, 1202–1223. [Google Scholar] - Travers, N.F.; Crutchfield, J.P. Exact synchronization for finite-state sources. J. Stat. Phys
**2011**, 145, 1181–1201. [Google Scholar] - Debowski, L. A general definition of conditional information and its application to ergodic decomposition. Stat. Probab. Lett
**2009**, 79, 1260–1268. [Google Scholar] - Crutchfield, J.P.; Young, K. Inferring statistical complexity. Phys. Rev. Lett
**1989**, 63, 105–108. [Google Scholar]

**Figure 1.**A hidden Markov model (the ε-machine) for the Even Process. The support for this process consists of all binary sequences in which blocks of uninterrupted 1 s are even in length, bounded by 0 s. After each even length is reached, there is a probability p of breaking the block of 1 s by inserting a 0. The machine has two internal states = {σ

_{1}, σ

_{2}}, a two symbol alphabet = {0, 1}, and a single parameter p ∈ (0, 1) that controls the transition probabilities. The associated Markov chain over states is finite-state and irreducible and, thus, has a unique stationary distribution π = (π

_{1}, π

_{2}) = (1/(2 – p), (1 – p)/(2 – p)). The graphical representation of the machine is given on the left, with the corresponding transition matrices on the right. In the graphical representation the symbols labeling the transitions have been colored blue, for visual contrast, while the transition probabilities are black.

**Figure 2.**A countable-state hidden Markov model (HMM) for the Heavy-Tailed Periodic Mixture Process. The machine M is the union of the machines M

_{i}, i ≥ 2, generating the period-i processes of i – 1 1 s followed by a 0. All topologically allowed transitions have probability 1. So, for visual clarity these probabilities are omitted from the edge labels and only the symbols labeling the transitions are given. The stationary distribution π is chosen such that the combined probability μ

_{i}of all states in the the i-th component is μ

_{i}= C/(i log

^{2}i), where $C=1/\left({\sum}_{i=2}^{\infty}1/(i\hspace{0.17em}{\text{log}}^{2}\hspace{0.17em}i)\right)$ is a normalizing constant. Formally, the HMM M = ( , , {T

^{(}

^{x}

^{)}}, π) has alphabet = {0, 1}, state set = {σ

_{ij}: i ≥ 2, 1 ≤ j ≤ i}, stationary distribution π defined by π

_{ij}= C/(i

^{2}log

^{2}i), and transition probabilities ${T}_{ij,i(j+1)}^{(1)}=1$ for i ≥ 2 and 1 ≤ j < i, ${T}_{ii,i1}^{(0)}=1$ for i ≥ 2, and all other transitions probabilities 0. Note that all logs here (and throughout) are taken base 2, as is typical when using information-theoretic quantities.

**Figure 3.**A countable-state HMM for the Branching Copy Process. The machine M is essentially a binary tree with loop-back paths from each node in the tree to the root node and a self-loop on the root. At each node ${\sigma}_{ij}^{1}$ in the tree there is a probability 2q

_{i}of continuing down the tree and a probability p

_{i}= 1–2q

_{i}of turning back towards the root ${\sigma}_{01}^{1}$ on path ${l}_{ij}~{\sigma}_{ij}^{1}\to {\sigma}_{ij}^{2}\to {\sigma}_{ij}^{3}\dots \to {\sigma}_{ij}^{i}\to {\sigma}_{01}^{1}$. If the choice is made to head back, the next i – 1 transitions are deterministic. The path of 0s and 1s taken to get from ${\sigma}_{01}^{1}$ to ${\sigma}_{ij}^{1}$ is copied on the return with 0 s replaced by 2 s and 1 s replaced by 3 s. Formally, the alphabet is = {0, 1, 2, 3, 4} and the state set is $\mathcal{S}=\{{\sigma}_{ij}^{k}:i\ge 0,1\le j\le {2}^{i},1\le k\le \text{max}\{i,1\}\}$, 1 ≤ j ≤ 2

^{i}, 1 ≤ k ≤ max{i, 1}}. The nonzero transition probabilities are as depicted graphically with p

_{i}= 1 – 2q

_{i}for all i ≥ 0, q

_{i}= i

^{2}/ [2(i + 1)

^{2}] for all i ≥ 1, and q

_{0}> 0 taken sufficiently small so that H [(p

_{0}, q

_{0}, q

_{0})] ≤ 1/300. The graph is strongly connected so the Markov chain over states is irreducible. Claim 1 shows that the Markov chain is also positive recurrent and, hence, has a unique stationary distribution π. Claim 2 gives the form of π.

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).