Equivalence of History and Generator ϵ-Machines

Travers, Nicholas F.; Crutchfield, James P.

doi:10.3390/sym17010078

Open AccessArticle

Equivalence of History and Generator ϵ-Machines

by

Nicholas F. Travers

^1,2 and

James P. Crutchfield

^1,2,3,*

¹

Complexity Sciences Center, University of California at Davis, One Shields Avenue, Davis, CA 95616, USA

²

Mathematics Department, University of California at Davis, One Shields Avenue, Davis, CA 95616, USA

³

Physics Department, University of California at Davis, One Shields Avenue, Davis, CA 95616, USA

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(1), 78; https://doi.org/10.3390/sym17010078

Submission received: 3 December 2024 / Revised: 25 December 2024 / Accepted: 31 December 2024 / Published: 6 January 2025

(This article belongs to the Special Issue Symmetry in Geometric Mechanics and Mathematical Physics)

Download

Browse Figures

Versions Notes

Abstract

ϵ

-Machines are minimal, unifilar presentations of stationary stochastic processes. They were originally defined in the history machine sense as hidden Markov models whose states are the equivalence classes of infinite pasts with the same probability distribution over futures. In analyzing synchronization though, an alternative generator definition was given as follows: unifilar, edge-emitting hidden Markov models with probabilistically distinct states. The key difference is that history

ϵ

-machines are defined by a process, whereas generator

ϵ

-machines define a process. We show here that these two definitions are equivalent in the finite-state case.

Keywords:

hidden Markov model; history epsilon-machine; generator epsilon-machine; measure theory; synchronization; uniqueness

MSC:

37A50; 37B10; 60J10

1. Introduction

Let

P = {(X_{t})}_{t \in Z}

be a stationary stochastic process. The

ϵ

-machine

M = M (P)

for the process

P

is the hidden Markov model whose states consist of equivalence classes of infinite past sequences (histories)

\overset{\leftarrow}{x} = \dots x_{- 2} x_{- 1}

with the same probability distribution over future sequences

\vec{x} = x_{0} x_{1} \dots

. The corresponding equivalence relation on the set of pasts

\overset{\leftarrow}{x}

is denoted by

\sim_{ϵ}

as follows:

\begin{matrix} \overset{\leftarrow}{x} \sim_{ϵ} {\overset{\leftarrow}{x}}^{'} if P (\vec{X} | \overset{\leftarrow}{x}) = P (\vec{X} | {\overset{\leftarrow}{x}}^{'}), \end{matrix}

(1)

where

\vec{X} = X_{0} X_{1} \dots

denotes the infinite sequence of future random variables.

These machines were first introduced in [1] as minimal, predictive models to measure the structural complexity of dynamical systems and have subsequently been applied in a number of contexts for nonlinear modeling [2,3,4,5,6]. Important extensions and a more thorough development of the theory were given in [7,8,9,10]. However, it was not until quite recently that the first fully formal construction was presented in [11,12,13].

Shortly thereafter, in our studies of synchronization [14,15], we introduced an alternative “generator

ϵ

-machine” definition, in contrast to the original “history

ϵ

-machine” construction discussed above. A generator

ϵ

-machine is defined simply as a unifilar, edge-emitting hidden Markov model with probabilistically distinct states. As opposed to the history

ϵ

-machine

M_{h} = M_{h} (P)

which is derived from a process

P

, a generator

ϵ

-machine

M_{g}

itself defines a stationary process

P = P (M_{g})

. Namely, the stationary output process of the hidden Markov model

M_{g}

is obtained by choosing the initial state according to the stationary distribution

π

for states of the underlying Markov chain.

The main contribution established here is that the history and generator

ϵ

-machine definitions are equivalent in the finite state case. This has long been assumed, without formally specifying the generator definition. However, our work makes this explicit and gives one of the first formal proofs of equivalence.

The equivalence is also implicit in [16]; in fact, this is the case for a more general class of machines, not just finite-state ones. However, the techniques used there differ substantially from ours and use somewhat more machinery. In particular, the proof of equivalence in the more difficult direction of Theorem 1 (Section 4.1) uses a supermartingale argument that, though elegant, relies implicitly on the martingale convergence theorem and is not particularly concrete. By contrast, our proof of Theorem 1 follows directly from the synchronization results given in [14,15], which are themselves fairly elementary, using only basic information theory and a large deviation estimate for finite-state Markov chains. Thus, the alternative proof presented here should be useful in providing intuition for the theorem. Furthermore, since the definitions and terminology used in [16] differ significantly from ours, it is not immediately clear that the history-generator equivalence is what is shown there or whether this is a consequence of what is shown. Thus, the exposition here should be helpful in clarifying these issues.

We note also that in order to parallel the generator

ϵ

-machine definition used in our synchronization studies and apply the results from those works, we somewhat restrict the range of processes when defining history

ϵ

-machines. In particular, we assume when defining history

ϵ

-machines that the process

P

is not only stationary but also ergodic and that the process alphabet is finite. This is required for equivalence, since the output process of a generator

ϵ

-machine is of this form. However, neither of these assumptions is strictly necessary for history

ϵ

-machines. Only stationarity is actually needed. The history

ϵ

-machine definition can be extended to nonergodic stationary processes and countable or even more general alphabets [9,16].

2. Related Work

Since their introduction in the late 1980s, most of the work on

ϵ

-machines, both theoretical and applied, has come from physics and information theory perspectives. However, similar concepts have been around for some time in several other disciplines. Among others, there has been substantial work on related topics by both probabilists and automata theorists, as well as those in the symbolic dynamics community. Below, we review some of the most germane developments in these areas. The interested reader is also referred to ref. [9] (appendix H) where a very broad overview of such connections is given and to ref. [17] for a recent review of the relation between symbolic dynamics and hidden Markov models in general.

We hope that our review provides context for the study of

ϵ

-machines and helps elucidate the relationship between

ϵ

-machines and other related models—both in terms of their similarities and their differences. However, an understanding of these relationships will not be necessary for the equivalence results that follow. The reader uninterested in these connections may safely skip to the definitions in Section 3.

2.1. Sofic Shifts and Topological Presentations

Let

X

be a finite alphabet, and let

X^{Z}

denote the set of all bi-infinite sequences

\overset{\leftrightarrow}{x} = \dots x_{- 1} x_{0} x_{1} \dots

consisting of symbols in

X

. A subshift

Σ \subset X^{Z}

is said to be sofic if it is the image of a subshift of a finite type under a k-block factor map. This concept was first introduced in [18], where it was also shown that any sofic shift

Σ

may be presented as a finite, directed graph with edges labeled by symbols in the alphabet

X

. The allowed sequences

\overset{\leftrightarrow}{x} \in Σ

consist of projections (under the edge labeling) of bi-infinite walks on the graph edges.

In the following, we will assume that all vertices in a presenting graph G of a sofic shift are essential. That is, each vertex v occurs as the target vertex of some edge e in a bi-infinite walk on the graph edges. If this is not the case, one may be restricted to the graph

G^{'}

, which consists of essential vertices in G, along with their outgoing edges, and

G^{'}

will also be a presenting graph for the sofic shift

Σ

. Thus, it is only necessary to consider presenting graphs in which all vertices are essential.

The language

Ł (Σ)

of a subshift

Σ

is the set of all finite words w occurring in some point

\overset{\leftrightarrow}{x} \in Σ

. For a sofic shift

Σ

with a presenting graph G, one may consider the nondeterministic finite automaton (NFA) M associated with the graph G, in which all states (vertices of G) are both start and accept states. Clearly (under the assumption that all vertices are essential), the language accepted by M is just

Ł (Σ)

. Thus, the language of any sofic shift is regular. By standard algorithms (see e.g., [19]) one may obtain from M a unique, minimal, deterministic finite automaton (DFA)

M^{'}

with the fewest number of states of all DFAs accepting the language

Ł (Σ)

. We call

M^{'}

the minimal deterministic automaton for the sofic shift

Σ

.

A subshift

Σ

is said to be irreducible if for any two words

w_{1}, w_{2} \in Ł (Σ)

there exists

w_{3} \in Ł (Σ)

such that the word is

w_{1} w_{3} w_{2} \in Ł (Σ)

. As shown in [20], a sofic shift is irreducible if and only if it has some irreducible (i.e., strongly connected) presenting graph G.

A presenting graph G of a sofic shift

Σ

is said to be unifilar or right-resolving if for each vertex

v \in G

and symbol

x \in X

there is at most one outgoing edge e from v labeled with the symbol x. A shown in [21], an irreducible sofic shift

Σ

always has a unique, minimal, unifilar presenting graph that, as it turns out, is also irreducible. In symbolic dynamics, this presentation is often referred to as the (right) Fischer cover of

Σ

.

For an irreducible sofic shift

Σ

, the graph associated with the minimal deterministic automaton always has a single recurrent, irreducible component. This recurrent component is isomorphic to the Fischer cover. That is, there exists a bijection between vertices in the automaton graph and vertices of the Fischer cover that preserves both edges and edge labels.

A related notion is the Krieger cover based on future sets [22]. For a subshift

Σ \subset X^{Z}

, let

Σ^{+}

denote the set of allowed future sequences

\vec{x} = x_{0} x_{1} \dots

and let

Σ^{-}

be the set of allowed past sequences

\overset{\leftarrow}{x} = \dots x_{- 2} x_{- 1}

. This is expressed as follows:

\begin{matrix} Σ^{+} = {\vec{x} : \exists \overset{\leftarrow}{x} with \overset{\leftarrow}{x} \vec{x} \in Σ} and Σ^{-} = {\overset{\leftarrow}{x} : \exists \vec{x} with \overset{\leftarrow}{x} \vec{x} \in Σ} . \end{matrix}

Furthermore, for a past

\overset{\leftarrow}{x} \in Σ^{-}

, let the future set

F (\overset{\leftarrow}{x})

of

\overset{\leftarrow}{x}

be the set of all possible future sequences

\vec{x}

that can follow

\overset{\leftarrow}{x}

:

\begin{matrix} F (\overset{\leftarrow}{x}) = {\vec{x} \in Σ^{+} : \overset{\leftarrow}{x} \vec{x} \in Σ} . \end{matrix}

Define an equivalence relation

\sim_{K}

on the set of infinite pasts

\overset{\leftarrow}{x} \in Σ^{-}

by the following:

\begin{matrix} \overset{\leftarrow}{x} \sim_{K} {\overset{\leftarrow}{x}}^{'} if F (\overset{\leftarrow}{x}) = F ({\overset{\leftarrow}{x}}^{'}) . \end{matrix}

(2)

The Krieger cover of

Σ

is the (possibly infinite) directed, edge-labeled graph G, whose vertices consist of equivalence classes of pasts

\overset{\leftarrow}{x}

under the relation

\sim_{K}

. There is a directed edge in G from vertex v to vertex

v^{'}

labeled with symbol x, if for some past

\overset{\leftarrow}{x} \in v

(equivalently all pasts

\overset{\leftarrow}{x} \in v

) the past

{\overset{\leftarrow}{x}}^{'} = \overset{\leftarrow}{x} x \in v^{'}

. By construction, the Krieger cover G is necessarily unifilar. Moreover, it is easily shown that G is a finite graph if and only if the subshift

Σ

is sofic.

If the subshift

Σ

is both irreducible and sofic, then the Krieger cover is isomorphic to the subgraph of the minimal deterministic automaton consisting of all states v that are not

f i n i t e - t i m e t r a n s i e n t

(and their outgoing edges). That is, the subgraph consisting of those states v such that there exist arbitrarily long words

w \in Ł (Σ)

on which the automaton transitions from its start state to v. Clearly, any state in the recurrent, irreducible component of the automaton graph is not finite-time transient. Thus, the Krieger cover contains this recurrent component—the Fischer cover.

To summarize, the minimal deterministic automaton, Fischer cover, and Krieger cover are three closely related ways for presenting an irreducible sofic shift that are each, in slightly different senses, minimal unifilar presentations. The Fischer cover is always an irreducible graph. The Krieger cover and graph of the minimal deterministic automaton are not necessarily irreducible, but they each have a single recurrent, irreducible component that is isomorphic to the Fischer cover. The Krieger cover itself is also isomorphic to a subgraph of the minimal deterministic automaton.

ϵ

-Machines are a probabilistic extension of these purely topological presentations. More specifically, for a stationary process

P

, the history

ϵ

-machine

M_{h}

is the probabilistic analog of the Krieger cover G for the subshift consisting of

supp (P)

. It is the edge-emitting hidden Markov model defined analogously to the Krieger cover, but with states that are equivalence classes of infinite past sequences

\overset{\leftarrow}{x}

with the same probability distribution over future sequences

\vec{x}

, rather than simply the same set of allowed future sequences (Compare Equations (1) and (2)).

In some cases, the two presentations may be topologically equivalent, e.g., the history

ϵ

-machine and Krieger cover can be isomorphic as graphs when the transition probabilities are removed from edges of the

ϵ

-machine. In other cases, however, they are not. For example, for the Even Process (Example 1, Section 3.5) the Krieger cover (or at least its recurrent component, the Fischer cover) and the history

ϵ

-machine are topologically equivalent. However, this is not so for the ABC process (Example 2, Section 3.5). In fact, there exist many examples of ergodic processes whose support is an irreducible sofic shift, but for which the history

ϵ

-machine has an infinite (or even continuum) number of states. See, e.g., Example 4 in Section 3.5, and Example 3.26 in [16].

2.2. Semigroup Measures

Semigroup measures are a class of probability measures on sofic shifts that arise from assigning probability transition structures to the right and left covers obtained from the Cayley graphs associated with generating semigroups for the shifts. These measures are studied extensively in [23], where a rich theory is developed and many of their key structural properties are characterized.

In particular, it is shown there that a stationary probability measure

P

on a sofic shift

Σ

is a semigroup measure if and only if it has a finite number of future measures—distributions over future sequences

\vec{x}

—induced by all finite-length past words w. That is,

P

is a semigroup measure if there exists a finite number of finite length words

w_{1}, \dots, w_{N}

such that for any word w the following positive probability holds:

\begin{matrix} P (\vec{X} | w) = P (\vec{X} | w_{i}), \end{matrix}

for some

1 \leq i \leq N

, where

\vec{X}

denotes the infinite sequence of future random variables

X_{t}

on

Σ

, defined by the natural projections

X_{t} (\overset{\leftrightarrow}{x}) = x_{t}

.

By contrast, a process

P

(or measure

P

) has a finite-state history

ϵ

-machine if there exists a finite number of infinite past sequences

{\overset{\leftarrow}{x}}_{1}, \dots, {\overset{\leftarrow}{x}}_{N}

such that, for almost every infinite past

\overset{\leftarrow}{x}

, the following stands:

\begin{matrix} P (\vec{X} | \overset{\leftarrow}{x}) = P (\vec{X} | {\overset{\leftarrow}{x}}_{i}), \end{matrix}

for some

1 \leq i \leq N

. The latter condition is strictly more general. The Alternating Biased Coins Process described in Section 3.5, for instance, has a finite-state (2-state)

ϵ

-machine, but does not correspond to a semigroup measure.

Thus, unfortunately, though the theory of semigroup measures is quite rich and well developed, much of it does not apply for the measures we study. For this reason, our proof methods are quite different from those previously used for semigroup measures, despite the seeming similarity between the two settings.

2.3. g-Functions and g-Measures

For a finite alphabet

X

, let

X^{-}

denote the set of all infinite past sequences

\overset{\leftarrow}{x} = \dots x_{- 2} x_{- 1}

consisting of symbols in

X

. A g-function for the full shift

X^{Z}

is a map as follows:

\begin{matrix} g : (X^{-} \times X) \to [0, 1], \end{matrix}

such that for any

\overset{\leftarrow}{x} \in X^{-}

, the following stands:

\begin{matrix} \sum_{x \in X} g (\overset{\leftarrow}{x}, x) = 1 . \end{matrix}

A g-measure for a g-function g on the full shift

X^{Z}

is stationary probability measure

P

on

X^{Z}

that is consistent with the g-function g in that for

P

a.e.

\overset{\leftarrow}{x} \in X^{-}

as follows:

\begin{matrix} P (X_{0} = x | \overset{\leftarrow}{X} = \overset{\leftarrow}{x}) = g (\overset{\leftarrow}{x}, x), for each x \in X . \end{matrix}

g-Functions and g-Measures have been studied for some time, though sometimes under different names [24,25,26,27]. In particular, many of these studies address cases where a g-function has or does not have a unique corresponding g-measure. Normally, g is assumed to be continuous (with respect to the natural product topology) and in this case, using fixed point theory, it can be shown that at least one g-measure exists. However, continuity is not enough to ensure uniqueness, even if some natural mixing conditions are required as well [26]. Thus, stronger conditions are often required, such as Hölder continuity.

Of particular relevance to us is the more recent work [28] on g-functions restricted to subshifts. This study shows, in many instances, ways of constructing g-functions on subshifts with an infinite or even continuum number of future measures, subject to fairly strong requirements; for example, residual local constancy or a synchronization condition similar to the exactness condition introduced in [15]. Most surprising, perhaps, are the constructions of g-functions for irreducible subshifts, which themselves take only a finite number of values, but have unique associated g-measures with an infinite number of future measures.

The relation to

ϵ

-machines is the following. Given a g-function g, one may divide the set of infinite past sequences

\overset{\leftarrow}{x}

into equivalence classes, in a manner analogous to that for history

ϵ

-machines, by the relation

\sim_{g}

as follows:

\begin{matrix} \overset{\leftarrow}{x} \sim_{g} {\overset{\leftarrow}{x}}^{'} if g (\overset{\leftarrow}{x}, x) = g ({\overset{\leftarrow}{x}}^{'}, x), for all x \in X . \end{matrix}

(3)

The equivalence classes induced by the relation

\sim_{g}

of Equation (3) are coarser than those induced by the relation

\sim_{ϵ}

of Equation (1). For any g-measure

P

of the g-function g, the states of the history

ϵ

-machine are a refinement or splitting of the

\sim_{g}

equivalence classes. Two infinite pasts,

\overset{\leftarrow}{x}

and

{\overset{\leftarrow}{x}}^{'}

, that induce different probability distributions over the next symbol

x_{0}

must induce different probability distributions over infinite future sequences

\vec{x}

, but the converse is not necessarily true. As shown in [28], the splitting may in fact be quite “bad” even if “nice” conditions are enforced on the g-function associated with the probability measure

P

. Concretely, there exist processes with history

ϵ

-machines that have an infinite or even continuum number of states, but for which the associated “nice” g-function from which the process is derived has only a finite number of equivalence classes.

3. Definitions

In this section, we set up the formal framework for our results and give more complete definitions for our objects of study as follows: stationary processes, hidden Markov models, and

ϵ

-machines.

3.1. Processes

There are several ways to define a stochastic process. Perhaps the most traditional definition is simply as a sequence of random variables

(X_{t})

on some common probability space

Ω

. However, in the following discussion, it will be convenient to use a slightly different but equivalent construction in which a process is itself a probability space whose sample space consists of bi-infinite sequences

\overset{\leftrightarrow}{x} = \dots x_{- 1} x_{0} x_{1} \dots

. Of course, on this space we have random variables

X_{t}

defined by the natural projections

X_{t} (\overset{\leftrightarrow}{x}) = x_{t}

, which we will at times employ in our proofs. However, for most of our development and, in particular, for defining history

ϵ

-machines, it will be more convenient to adopt the sequence-space viewpoint.

Throughout, we restrict our attention to processes over a finite alphabet

X

. We denote by

X^{*}

the set of all words w of finite positive length consisting of symbols in

X

and, for a word

w \in X^{*}

, we write

| w |

for its length. Note that we deviate slightly from the standard convention here and explicitly exclude the null word

λ

from

X^{*}

.

Definition 1.

Let

X

be a finite set. A process

P

over the alphabet

X

is a probability space

(X^{Z}, X, P)

where the following stand:

$X^{Z}$ is the set of all bi-infinite sequences of symbols in $X$ : $X^{Z} = {\overset{\leftrightarrow}{x} = \dots x_{- 1} x_{0} x_{1} \dots : x_{t} \in X, for all t \in Z}$ .
$X$ is the σ-algebra generated by finite cylinder sets of the form $A_{w, t} = {\overset{\leftrightarrow}{x} \in X^{Z} : x_{t} \dots x_{t + | w | - 1} = w}$ .
$P$ is a probability measure on the measurable space $(X^{Z}, X)$ .

For each symbol

x \in X

, we assume implicitly that

P (A_{x, t}) > 0

for some

t \in N

. Otherwise, the symbol x is useless and the process can be restricted to the alphabet

X / {x}

. In the following, we will be primarily interested in stationary, ergodic processes.

Let

r : X^{Z} \to X^{Z}

be the right shift operator. A process

P

is stationary if the measure

P

is shift invariant as follows:

P (A) = P (r (A))

, for any measurable set A. A process

P

is ergodic if every shift invariant event A is trivial. That is, for any measurable event A such that A and

r (A)

are

P

a.s. equal, the probability of A is either 0 or 1. A stationary process

P

is defined entirely by the word probabilities

P (w)

,

w \in X^{*}

, where

P (w) = P (A_{w, t})

is the shift invariant probability of cylinder sets for the word w. Ergodicity is equivalent to the almost sure convergence of empirical word probabilities

\hat{P} (w)

in finite sequences

{\vec{x}}^{t} = x_{0} x_{1} \dots x_{t - 1}

to their true values

P (w)

, as

t \to \infty

.

For a stationary process

P

and words

w, v \in X^{*}

with

P (v) > 0

, we define

P (w | v)

as the probability where the word w is followed by the word v in a bi-infinite sequence

\overset{\leftrightarrow}{x}

as follows:

\begin{matrix} P (w | v) & \equiv P (A_{w, 0} | A_{v, - | v |}) \\ = P (A_{v, - | v |} \cap A_{w, 0}) / P (A_{v, - | v |}) \\ = P (v w) / P (v) . \end{matrix}

(4)

The following facts concerning word probabilities and conditional word probabilities for a stationary process come directly from the definitions. They will be used repeatedly throughout our development, without further mention. For any words

u, v, w \in X^{*}

, the following stand:

$\sum_{x \in X} P (w x) = \sum_{x \in X} P (x w) = P (w)$ .
$P (w) \geq P (w v)$ and $P (w) \geq P (v w)$ .
If $P (w) > 0$ , $\sum_{x \in X} P (x | w) = 1$ .
If $P (u) > 0$ , $P (v | u) \geq P (v w | u)$ .
If $P (u) > 0$ and $P (u v) > 0$ , $P (v w | u) = P (v | u) \cdot P (w | u v)$ .

3.2. Hidden Markov Models

There are two primary types of hidden Markov models as follows: state-emitting (or Moore) and edge-emitting (or Mealy). The state-emitting type is the simpler of the two and, also, the more commonly studied and applied [29,30]. However, we focus on edge-emitting hidden Markov models here, since

ϵ

-machines are edge-emitting. We also restrict our development to the case where the hidden Markov model has a finite number of states and output symbols, although generalizations to countably infinite and even uncountable state sets and output alphabets are certainly possible.

Definition 2.

An edge-emitting hidden Markov model (HMM) is a triple

(S, X, {T^{(x)}})

model where the following stand:

$S$ is a finite set of states.
$X$ is a finite alphabet of output symbols.
$T^{(x)}, x \in X$ are symbol-labeled transition matrices. $T_{σ σ^{'}}^{(x)} \geq 0$ represents the probability of transitioning from state σ to state $σ^{'}$ on symbol x.

In what follows, we normally take the state set to be

S = {σ_{1}, \dots, σ_{N}}

and denote

T_{σ_{i} σ_{j}}^{(x)}

simply as

T_{i j}^{(x)}

. We also denote the overall state-to-state transition matrix for an HMM as T:

T = \sum_{x \in X} T^{(x)}

.

T_{i j}

is the overall probability of transitioning from state

σ_{i}

to state

σ_{j}

, regardless of symbol. The matrix T is stochastic as follows:

\sum_{j = 1}^{N} T_{i j} = 1

, for each i.

Pictorially, an HMM can be represented as a directed graph with labeled edges. The vertices are the states

σ_{1}, \dots, σ_{N}

and, for each

i, j, x

with

T_{i j}^{(x)} > 0

, there is a directed edge from state

σ_{i}

to state

σ_{j}

, labeled

p | x

for the symbol x and transition probability

p = T_{i j}^{(x)}

. The transition probabilities are normalized so that their sum on all outgoing edges from each state

σ_{i}

is 1.

Example.

Even Machine

Figure 1 depicts an HMM for the Even Process. The support for this process consists of all binary sequences in which blocks of uninterrupted 1 s are even in length, bounded by 0 s. After each even length is reached, there is a probability p of breaking the block of 1 s by inserting a 0. The HMM has two states

{σ_{1}, σ_{2}}

and symbol-labeled transitions matrices as follows:

\begin{matrix} T^{(0)} = (\begin{matrix} p & 0 \\ 0 & 0 \end{matrix}) and T^{(1)} = (\begin{matrix} 0 & 1 - p \\ 1 & 0 \end{matrix}) \end{matrix}

The operation of an HMM may be thought of as a weighted random walk on the associated directed graph. That is, from the current state

σ_{i}

, the next state

σ_{j}

is determined by selecting an outgoing edge from

σ_{i}

according to their relative probabilities. Having selected a transition, the HMM then moves to the new state and outputs the symbol x labeling this edge. The same procedure is then invoked repeatedly to generate future states and output symbols.

The state sequence determined in such a fashion is simply a Markov chain with transition matrix T. However, we are interested not simply in the HMM’s state sequence, but rather the associated sequence of output symbols it generates. We assume that an observer of the HMM has direct access to this sequence of output symbols, but not to the associated sequence of “hidden” states.

Formally, from an initial state

σ_{i}

, the probability that the HMM next outputs symbol x and transitions to state

σ_{j}

is expressed as follows:

\begin{matrix} P_{σ_{i}} (x, σ_{j}) = T_{i j}^{(x)} . \end{matrix}

(5)

Furthermore, the probability of longer sequences is computed inductively. Thus, for an initial state

σ_{i} = σ_{i_{0}}

, the probability the HMM outputs a length-l word

w = w_{0} \dots w_{l - 1}

while following the state path

s = σ_{i_{1}} \dots σ_{i_{l}}

in the next l steps is expressed as follows:

\begin{matrix} P_{σ_{i}} (w, s) = \prod_{t = 0}^{l - 1} T_{i_{t}, i_{t + 1}}^{(w_{t})} . \end{matrix}

(6)

If the initial state is chosen according to some distribution

ρ = (ρ_{1}, \dots, ρ_{N})

rather than as a fixed state

σ_{i}

, we obtain the following by linearity:

\begin{matrix} P_{ρ} (x, σ_{j}) & = \sum_{i} ρ_{i} \cdot P_{σ_{i}} (x, σ_{j}) and \end{matrix}

(7)

\begin{matrix} P_{ρ} (w, s) & = \sum_{i} ρ_{i} \cdot P_{σ_{i}} (w, s) . \end{matrix}

(8)

The overall probabilities of next generating a symbol x or word

w = w_{0} \dots w_{l - 1}

from a given state

σ_{i}

are computed by summing over all possible associated target states or state sequences as follows:

\begin{matrix} P_{σ_{i}} (x) & = \sum_{j} P_{σ_{i}} (x, σ_{j}) = {∥ e_{i} T^{(x)} ∥}_{1} and \end{matrix}

(9)

\begin{matrix} P_{σ_{i}} (w) & = \sum_{{s : | s | = l}} P_{σ_{i}} (w, s) = {∥ e_{i} T^{(w)} ∥}_{1}, \end{matrix}

(10)

respectively, where

e_{i} = (0, \dots, 1, \dots, 0)

is the

i^{th}

standard basis vector in

R^{N}

and

\begin{matrix} T^{(w)} & = T^{(w_{0} \dots w_{l - 1})} \equiv \prod_{t = 0}^{l - 1} T^{(w_{t})} . \end{matrix}

(11)

Finally, the overall probabilities of next generating a symbol x or word

w = w_{0} \dots w_{l - 1}

from an initial state distribution

ρ

are, respectively, expressed as follows:

\begin{matrix} P_{ρ} (x) & = \sum_{i} ρ_{i} \cdot P_{σ_{i}} (x) = {∥ ρ T^{(x)} ∥}_{1} and \end{matrix}

(12)

\begin{matrix} P_{ρ} (w) & = \sum_{i} ρ_{i} \cdot P_{σ_{i}} (w) = {∥ ρ T^{(w)} ∥}_{1} . \end{matrix}

(13)

If the graph G associated with a given HMM is strongly connected, then the corresponding Markov chain over states is irreducible and the state-to-state transition matrix T has a unique stationary distribution

π

satisfying

π = π T

[31]. In this case, we may define a stationary process

P = (X^{Z}, X, P)

by the word probabilities obtained from choosing the initial state according to

π

. That is, for any word

w \in X^{*}

as follows:

\begin{matrix} P (w) & \equiv P_{π} (w) = {∥ π T^{(w)} ∥}_{1} . \end{matrix}

(14)

Strong connectivity also implies the process

P

is ergodic, as it is a pointwise function of the irreducible Markov chain over edges, which is itself ergodic [31]. That is, at each time step, the symbol labeling the edge is a deterministic function of the edge.

We denote the corresponding (stationary, ergodic) process over bi-infinite symbol-state sequences

(\overset{\leftrightarrow}{x}, \overset{\leftrightarrow}{s})

by

\tilde{P}

. That is,

\tilde{P} = ({(X x S)}^{Z}, (X x S), \tilde{P})

where the following stand:

${(X x S)}^{Z} = \{(\overset{\leftrightarrow}{x}, \overset{\leftrightarrow}{s}) ≅ {(x_{t}, s_{t})}_{t \in Z} : x_{t} \in X and s_{t} \in S, for all t \in Z\}$ .
$(X x S)$ is the $σ$ -algebra generated by finite cylinder sets on the bi-infinite symbol-state sequences.
The (stationary) probability measure $\tilde{P}$ on $(X x S)$ is defined by Equation (8) with $ρ = π$ . Specifically, for any length-l word w and length-l state sequence s we have the following:

$\begin{matrix} \tilde{P} ({(\overset{\leftrightarrow}{x}, \overset{\leftrightarrow}{s}) : x_{0} \dots x_{l - 1} = w, s_{1} \dots s_{l} = s}) = P_{π} (w, s) . \end{matrix}$

By stationarity, this measure may be extended uniquely to all finite cylinders and, hence, to all $(X x S)$ -measurable sets. Furthermore, this is consistent with the measure $P$ as follows:

$\begin{matrix} \tilde{P} ({(\overset{\leftrightarrow}{x}, \overset{\leftrightarrow}{s}) : x_{0} \dots x_{l - 1} = w}) = P (w), \end{matrix}$

for all $w \in X^{*}$ .

Two HMMs are said to be isomorphic if there is a bijection between their state sets that preserves edges, including the symbols and probabilities labeling the edges. Clearly, any two isomorphic, irreducible HMMs generate the same process, but the converse is not true. Nonisomorphic HMMs may also generate equivalent processes. In Section 4, we will be concerned with isomorphism between generator and history

ϵ

-machines.

3.3. Generator $ϵ$ -Machines

Generator

ϵ

-machines are irreducible HMMs with two additional important properties as follows: unifilarity and probabilistically distinct states.

Definition 3.

A generator ϵ-machine

M_{g}

is an HMM with the following properties:

1.: Irreducibility: The graph G associated with the HMM is strongly connected.
2.: Unifilarity: For each state $σ_{i} \in S$ and each symbol $x \in X$ , there is at most one outgoing edge from state $σ_{i}$ labeled with symbol x.
3.: Probabilistically distinct states: For each pair of distinct states $σ_{i}, σ_{j} \in S$ , there exists some word $w \in X^{*}$ such that $P_{σ_{i}} (w) \neq P_{σ_{j}} (w)$ .

Note that all three of these properties may be easily checked for a given HMM. Irreducibility and unifilarity are immediate. The probabilistically distinct states’ condition can (if necessary) be checked by inductively separating distinct pairs with an algorithm similar to the one used to check for topologically distinct states in [15].

By irreducibility, there is always a unique stationary distribution

π

over the states of a generator

ϵ

-machine, so we may associate to each generator

ϵ

-machine

M_{g}

a unique stationary, ergodic process

P = P (M_{g})

, with word probabilities defined as in Equation (14). We refer to

P

as the process generated by the generator

ϵ

-machine

M_{g}

. The transition function for a generator

ϵ

-machine or, more generally, any unifilar HMM, is denoted by

δ

. That is, for i and x with

P_{σ_{i}} (x) > 0

,

δ (σ_{i}, x) \equiv σ_{j}

, where

σ_{j}

is the (unique) state to which state

σ_{i}

transitions on symbol x.

In a unifilar HMM, for any given initial state

σ_{i}

and word

w = w_{0} \dots w_{l - 1} \in X^{*}

, there can be at most one associated state path

s = s_{1} \dots s_{l}

such that the word w may be generated following the state path s from

σ_{i}

. Moreover, the probability

P_{σ_{i}} (w)

of generating w from

σ_{i}

is nonzero if and only if there is a path s. In this case, the states

s_{1}, \dots, s_{l}

are defined inductively by the relations

s_{t + 1} = δ (s_{t}, w_{t}), 0 \leq t \leq l - 1

with

s_{0} = σ_{i}

, and the probability

P_{σ_{i}} (w)

is simply expressed as follows:

\begin{matrix} P_{σ_{i}} (w) = \prod_{t = 0}^{l - 1} P_{s_{t}} (w_{t}) . \end{matrix}

(15)

Slightly more generally, Equation (15) holds as long as there is a well defined path

s_{1} \dots s_{l - 1}

upon which the subword

w_{0} \dots w_{l - 2}

may be generated starting in

σ_{i}

. Though, in this case,

P_{σ_{i}} (w)

may be 0 if state

s_{l - 1}

has no outgoing transition on symbol

w_{l - 1}

. This formula for word probabilities in unifilar HMMs will be useful in establishing the equivalence of generator and history

ϵ

-machines in Section 4.

3.4. History $ϵ$ -Machines

The history

ϵ

-machine

M_{h}

for a stationary process

P

is, essentially, just the hidden Markov model whose states are the equivalence classes of infinite past sequences defined by the equivalence relation

\sim_{ϵ}

of Equation (1). Two pasts

\overset{\leftarrow}{x}

and

{\overset{\leftarrow}{x}}^{'}

are considered equivalent if they induce the same probability distribution over future sequences. However, it takes some effort to make this notion precise and specify the transitions. The formal definition itself is quite lengthy, so, for clarity, the verification of many technicalities is deferred to the appendices. We recommend first reading through this section in its entirety without reference to the appendices for an overview and then reading through the appendices separately afterward for the details. The appendices are entirely self-contained in that, except for the notation introduced here, none of the results derived in the appendices rely on the developments in this section. As noted before, our focus is restricted to ergodic, finite-alphabet processes to parallel the generator definition. However, neither of these requirements is strictly necessary. Only stationarity is actually needed.

Let

P = (X^{Z}, X, P)

be a stationary, ergodic process over a finite alphabet

X

, and let

(X^{-}, X^{-}, P^{-})

be the corresponding probability space over past sequences

\overset{\leftarrow}{x}

. The following stand:

$X^{-}$ is the set of infinite past sequences of symbols in $X$ : $X^{-} = {\overset{\leftarrow}{x} = \dots x_{- 2} x_{- 1} : x_{t} \in X, t = - 1, - 2, \dots}$ .
$X^{-}$ is the $σ$ -algebra generated by finite cylinder sets on past sequences as follows: $X^{-} = σ (⋃_{t = 1}^{\infty} X_{t}^{-})$ , where $X_{t}^{-} = σ ({A_{w}^{-} : | w | = t})$ and $A_{w}^{-} = {\overset{\leftarrow}{x} = \dots x_{- 2} x_{- 1} : x_{- | w |} \dots x_{- 1} = w}$ .
$P^{-}$ is the probability measure on the measurable space $(X^{-}, X^{-})$ , which is the projection of $P$ to past sequences as folllows: $P^{-} (A_{w}^{-}) = P (w)$ , for each $w \in X^{*}$ .

For a given past

\overset{\leftarrow}{x} \in X^{-}

, we denote the last t symbols of

\overset{\leftarrow}{x}

as

{\overset{\leftarrow}{x}}^{t} = x_{- t} \dots x_{- 1}

. A past

\overset{\leftarrow}{x} \in X^{-}

is said to be trivial if

P ({\overset{\leftarrow}{x}}^{t}) = 0

for some finite t and is otherwise nontrivial. If a past

\overset{\leftarrow}{x}

is nontrivial, then for each

w \in X^{*}

,

P (w | {\overset{\leftarrow}{x}}^{t})

is well defined for each t, Equation (4), and one may consider

{lim}_{t \to \infty} P (w | {\overset{\leftarrow}{x}}^{t})

. A nontrivial past

\overset{\leftarrow}{x}

is said to be w-regular if

{lim}_{t \to \infty} P (w | {\overset{\leftarrow}{x}}^{t})

exists and regular if it is w-regular for each

w \in X^{*}

. Appendix A shows that the set of trivial pasts

T

is a null set and that the set of regular pasts

R

has full measure. That is,

P^{-} (T) = 0

and

P^{-} (R) = 1

.

For a word

w \in X^{*}

, the function

P (w | \cdot) : R \to R

is defined as follows:

\begin{matrix} P (w | \overset{\leftarrow}{x}) \equiv lim_{t \to \infty} P (w | {\overset{\leftarrow}{x}}^{t}) . \end{matrix}

(16)

Intuitively,

P (w | \overset{\leftarrow}{x})

is the conditional probability of w given

\overset{\leftarrow}{x}

. However, this probability is technically not well defined in accordance with Equation (4), since the probability of each past

\overset{\leftarrow}{x}

is normally 0. Furthermore, we do not want to define

P (w | \overset{\leftarrow}{x})

in terms of a formal conditional expectation, because such a definition is only unique up to a.e. equivalence, while we would like its value on individual pasts to be uniquely determined. Nevertheless, intuitively speaking,

P (w | \overset{\leftarrow}{x})

is the conditional probability of w given

\overset{\leftarrow}{x}

, and this intuition should be kept in mind as it will provide insights for what follows. Indeed, if one does consider the conditional probability

P (w | \overset{\leftarrow}{X})

as a formal conditional expectation, any version of it will be equal to

P (w | \overset{\leftarrow}{x})

for a.e.

\overset{\leftarrow}{x}

. So, this intuition is justified.

The central idea in the construction of the history

ϵ

-machine is the following equivalence relation on the set of regular pasts:

\begin{matrix} \overset{\leftarrow}{x} \sim {\overset{\leftarrow}{x}}^{'} if P (w | \overset{\leftarrow}{x}) = P (w | {\overset{\leftarrow}{x}}^{'}), for all w \in X^{*} . \end{matrix}

(17)

That is, two pasts

\overset{\leftarrow}{x}

and

{\overset{\leftarrow}{x}}^{'}

are ∼ equivalent if their predictions are the same. Conditioning on either past leads to the same probability distribution over the future words of all lengths. This is simply a more precise definition of the equivalence relation

\sim_{ϵ}

of Equation (1) (we drop the subscript

ϵ

, as this is the only equivalence relation we will consider from here on).

The set of the equivalence classes of regular pasts under the relation ∼ is denoted as

E = {E_{β}, β \in B}

, where B is simply an index set. In general, there may be finitely many, countably many, or uncountably many such equivalence classes. Examples with finite and countably infinite

E

are given in Section 3.5. For uncountable

E

, see Example 3.26 in [16].

For an equivalence class

E_{β} \in E

and word

w \in X^{*}

we define the probability of w given

E_{β}

as follows:

\begin{matrix} P (w | E_{β}) \equiv P (w | \overset{\leftarrow}{x}), \overset{\leftarrow}{x} \in E_{β} . \end{matrix}

(18)

By construction of the equivalence classes this definition is independent of the representative

\overset{\leftarrow}{x} \in E_{β}

, and Appendix B shows that these probabilities are normalized, so that for each equivalence class

E_{β}

the following stands:

\begin{matrix} \sum_{x \in X} P (x | E_{β}) = 1 . \end{matrix}

(19)

Appendix B also shows that the equivalence-class-to-equivalence-class transitions for the relation ∼ are well defined, so that the following stand:

For any regular past $\overset{\leftarrow}{x}$ and symbol $x \in X$ with $P (x | \overset{\leftarrow}{x}) > 0$ , the past $\overset{\leftarrow}{x} x$ is also a regular.
If $\overset{\leftarrow}{x}$ and ${\overset{\leftarrow}{x}}^{'}$ are two regular pasts in the same equivalence class $E_{β}$ and $P (x | E_{β}) > 0$ , then the two pasts $\overset{\leftarrow}{x} x$ and ${\overset{\leftarrow}{x}}^{'} x$ must also be in the same equivalence class.

So, for each

E_{β} \in E

and

x \in X

with

P (x | E_{β}) > 0

, there is a unique equivalence class

E_{α} = δ_{h} (E_{β}, x)

to which equivalence class

E_{β}

transitions on symbol x.

\begin{matrix} δ_{h} (E_{β}, x) \equiv E_{α}, where \overset{\leftarrow}{x} x \in E_{α} for \overset{\leftarrow}{x} \in E_{β} . \end{matrix}

(20)

According to Point 2 above, this definition is again independent of the representative

\overset{\leftarrow}{x} \in E_{β}

.

The subscript h in

δ_{h}

indicates that it is a transition function between equivalence classes of pasts or histories,

\overset{\leftarrow}{x}

. Formally, it is to be distinguished from the transition function

δ

between the states of a unifilar HMM. However, the two are essentially equivalent for a history

ϵ

-machine.

Appendix C shows that each equivalence class

E_{β}

is an

X^{-}

measurable set, so we can meaningfully assign a probability as follows:

\begin{matrix} P (E_{β}) & \equiv P^{-} ({\overset{\leftarrow}{x} \in E_{β}}) \\ = P ({\overset{\leftrightarrow}{x} = \overset{\leftarrow}{x} \vec{x} : \overset{\leftarrow}{x} \in E_{β}}) \end{matrix}

(21)

to each equivalence class

E_{β}

. We say a process

P

is finitely characterized if there is a finite number of positive probability equivalence classes

E_{1}, \dots, E_{N}

that together comprise a set of full measure as follows:

P (E_{i}) > 0

, for each

1 \leq i \leq N

and

\sum_{i = 1}^{N} P (E_{i}) = 1

. For a finitely characterized process

P

, we will also occasionally say, by a slight abuse of terminology, that

E^{+} \equiv {E_{1}, \dots, E_{N}}

is the set of equivalence classes of pasts and ignore the remaining measure-zero subset of equivalence classes.

Appendix E shows that for any finitely characterized process

P

, the transitions from the positive probability equivalence classes

E_{i} \in E^{+}

all go to other positive probability equivalence classes. That is, if

E_{i} \in E^{+}

, then the following stands:

δ_{h} (E_{i}, x) \in E^{+}, for all x with P (x | E_{i}) > 0 .

(22)

As such, we define the symbol-labeled transition matrices

T^{(x)}, x \in X

between the equivalence classes

E_{i} \in E^{+}

. A component

T_{i j}^{(x)}

of the matrix

T^{(x)}

gives the probability that equivalence class

E_{i}

transitions to equivalence class

E_{j}

on symbol x as follows:

\begin{matrix} T_{i j}^{(x)} & = P (E_{i} \overset{x}{\to} E_{j}) \equiv I (x, i, j) \cdot P (x | E_{i}), \end{matrix}

(23)

where

I (x, i, j)

is the indicator function of the transition from

E_{i}

to

E_{j}

on symbol x as follows:

\begin{matrix} I (x, i, j) & = \{\begin{matrix} 1 & if P (x | E_{i}) > 0 and δ_{h} (E_{i}, x) = E_{j}, \\ 0 & otherwise . \end{matrix} \end{matrix}

(24)

This follows from Equations (19) and (22), where the matrix

T \equiv \sum_{x \in X} T^{(x)}

is stochastic. (See also Claim 17 in Appendix E).

Definition 4.

Let

P = (X^{Z}, X, P)

be a finitely characterized, stationary, ergodic, finite-alphabet process. The history ϵ-machine

M_{h} (P)

is defined as the triple

(E^{+}, X, {T^{(x)}})

.

Note that

M_{h}

is a valid HMM since T is stochastic.

3.5. Examples

In this section, we present several examples of irreducible HMMs and the associated

ϵ

-machines for the processes that these HMMs generate. This should hopefully provide some useful intuition for the definitions. For the sake of brevity, descriptions of the history

ϵ

-machine constructions in our examples will be less detailed than in the formal definition given above, but the ideas should be clear. In all cases, the process alphabet is the binary alphabet

X = {0, 1}

.

Example 1.

Even Machine

The first example we consider, shown in Figure 2, is the generating HMM M for the Even Process previously introduced in Section 3.2. It is easily seen that this HMM is both irreducible and unifilar and, also, that it has probabilistically distinct states. State

σ_{1}

can generate the symbol 0, whereas state

σ_{2}

cannot. M is therefore a generator

ϵ

-machine, and by Theorem 1 below, the history

ϵ

-machine

M_{h}

for the process

P

that M generates is isomorphic to M. The Fischer cover for the sofic shift

supp (P)

is also isomorphic to M, if probabilities are removed from the edge labels in M.

More directly, the history

ϵ

-machine states for

P

can be deduced by noting that 0 is a synchronizing word for M [15]; it synchronizes the observer to state

σ_{1}

. Thus, for any nontrivial past

\overset{\leftarrow}{x}

terminating in

x_{- 1} = 0

, the initial state

s_{0}

must be

σ_{1}

. By unifilarity, any nontrivial past

\overset{\leftarrow}{x}

terminating in a word of the form

01^{n}

for some

n \geq 0

also uniquely determines the initial state

s_{0}

. For n even, we must have

s_{0} = σ_{1}

and, for n odd, we must have

s_{0} = σ_{2}

. Since a.e. infinite past

\overset{\leftarrow}{x}

generated by M contains at least one 0 and the distributions over future sequences

\vec{x}

are distinct for the two states

σ_{1}

and

σ_{2}

, the process

P

is finitely characterized with exactly two positive probability equivalence classes of infinite pasts as follows:

E_{1} = {\overset{\leftarrow}{x} = \dots 01^{n} : n is even}

and

E_{2} = {\overset{\leftarrow}{x} = \dots 01^{n} : n is odd}

. These correspond to the states

σ_{1}

and

σ_{2}

of M, respectively. More generally, a similar argument holds for any exact generator

ϵ

-machine; that is, any generator

ϵ

-machine having a finite synchronizing word w [15].

Example 2.

Alternating Biased Coins Machine

Figure 3 depicts a generating HMM M for the Alternating Biased Coins (ABC) Process. This process may be thought of as being generated by alternately flipping two coins with different biases

p \neq q

. The phase—p-bias on odd flips or p-bias on even flips—is chosen uniformly at random. M is again, by inspection, a generator

ϵ

-machine that is irreducible and unifilar with probabilistically distinct states. Therefore, by Theorem 1 below, the history

ϵ

-machine

M_{h}

for the process

P

that M generates is again isomorphic to M. However, the Fischer cover for the sofic shift

supp (P)

is not isomorphic to M. The support of

P

is the full shift

X^{Z}

, so the Fischer cover consists of a single state transitioning to itself on both symbols 0 and 1.

In this simple example, the history

ϵ

-machine states can also be deduced directly, despite the fact that the generator M does not have a synchronizing word. If the initial state is

s_{0} = σ_{1}

, then by the strong law of large numbers, the limiting fraction of 1s at odd time steps in finite-length past blocks

{\overset{\leftarrow}{x}}^{t}

converges a.s. to q, whereas if the initial state is

s_{0} = σ_{2}

, then the limiting fraction of 1s at odd time steps converges a.s. to p. Therefore, the initial state

s_{0}

can be inferred a.s. from the complete past

\overset{\leftarrow}{x}

, so the process

P

is finitely characterized with two positive probability equivalence classes of infinite pasts

E_{1}

and

E_{2}

, corresponding to the two states

σ_{1}

and

σ_{2}

. Unlike the exact case, however, arguments like this do not generalize as easily to other nonexact generator

ϵ

-machines.

Example 3.

Nonminimal Noisy Period-2 Machine

Figure 4 depicts a nonminimal generating HMM M for the Noisy Period-2 (NP2) Process

P

in which 1s alternate with random symbols. M is again unifilar, but it does not have probabilistically distinct states and is, therefore, not a generator

ϵ

-machine. States

σ_{1}

and

σ_{3}

have the same probability distribution over future output sequences, as do states

σ_{2}

and

σ_{4}

.

There are two positive probability equivalence classes of pasts

\overset{\leftarrow}{x}

for the process

P

, those containing 0s at a subset of the odd time steps, and those containing 0s at a subset of the even time steps. Those with 0s at odd time steps induce distributions over future output equivalent to that from states

σ_{2}

and

σ_{4}

, while those with 0s at even time steps induce distributions over future output equivalent to that from states

σ_{1}

and

σ_{3}

. Thus, the

ϵ

-machine for

P

consists of just two states,

E_{1} \sim {σ_{1}, σ_{3}}

and

E_{2} \sim {σ_{2}, σ_{4}}

. In general, for a unifilar HMM without probabilistically distinct states, the

ϵ

-machine is formed by grouping together equivalent states in a similar fashion.

Example 4.

Simple Nonunifilar Source

Figure 5 depicts a generating HMM M known as the Simple Nonunifilar Source (SNS) [7]. The output process

P

generated by M consists of long sequences of 1s broken by isolated 0s. As its name indicates, M is nonunifilar, so it is not an

ϵ

-machine.

Symbol 0 is a synchronizing word for M, so all pasts

\overset{\leftarrow}{x}

ending in a 0 induce the same probability distribution over future output sequences

\vec{x}

: Namely, the distribution over futures is given by starting M in the initial state

s_{0} = σ_{1}

. However, since M is nonunifilar, an observer does not remain synchronized after seeing a 0. Any nontrivial past of the form

\overset{\leftarrow}{x} = \dots 01^{n}

induces the same distribution over the initial state

s_{0}

as any other. However, for

n \geq 1

, there is some possibility of being in both

σ_{1}

and

σ_{2}

at time 0. A direct calculation shows that the distributions over

s_{0}

and, hence, the distributions over future output sequences

\vec{x}

are distinct for different values of n. Thus, since a.e. past

\overset{\leftarrow}{x}

contains at least one 0, it follows that the process

P

has a countable collection of positive probability equivalence classes of pasts, comprising a set of full measure as follows:

{E_{n} : n = 0, 1, 2 \dots}

where

E_{n} = {\overset{\leftarrow}{x} = \dots 01^{n}}

. This leads to a countable-state history ϵ-machine

M_{h}

as depicted on the right of Figure 5. We will not address countable-state machines further here, as other technical issues arise in this case. Conceptually, however, it is similar to the finite-state case and may be depicted graphically in an analogous fashion.

4. Equivalence

We will now show that the two

ϵ

-machine definitions—history and generator—are equivalent in the following sense:

If $P$ is the process generated by a generator $ϵ$ -machine $M_{g}$ , then $P$ is finitely characterized and the history $ϵ$ -machine $M_{h} (P)$ is isomorphic to $M_{g}$ as a hidden Markov model.
If $P$ is a finitely characterized, stationary, ergodic, finite-alphabet process, then the history $ϵ$ -machine $M_{h} (P)$ , when considered as a hidden Markov model, is also a generator $ϵ$ -machine. Furthermore, the process $P^{'}$ generated by $M_{h}$ is the same as the original process $P$ from which the history machine was derived.

That is, there is a

1 - 1

correspondence between finite-state generator

ϵ

-machines and finite-state history

ϵ

-machines. Every generator

ϵ

-machine is also a history

ϵ

-machine, for the same process

P

it generates. Every history

ϵ

-machine is also a generator

ϵ

-machine, for the same process

P

from which it was derived.

4.1. Generator $ϵ$ -Machines Are History $ϵ$ -Machines

In this section, we establish equivalence in the following direction:

Theorem 1.

If

P = (X^{Z}, X, P)

is the process generated by a generator ϵ-machine

M_{g}

, then

P

is finitely characterized and the history ϵ-machine

M_{h} (P)

is isomorphic to

M_{g}

as a hidden Markov model.

The key ideas in proving this theorem come from the study of synchronization to generator

ϵ

-machines [14,15]. In order to state these ideas precisely, however, we first need to introduce some terminology.

Let

M_{g}

be a generator

ϵ

-machine, and let

P = (X^{Z}, X, P)

and

\tilde{P} = ({(X x S)}^{Z}, (X x S), \tilde{P})

be the associated symbol and symbol-state processes generated by

M_{g}

as in Section 3.2. Further, let the random variables

X_{t} : {(X x S)}^{Z} \to X

and

S_{t} : {(X x S)}^{Z} \to S

be the natural projections

X_{t} (\overset{\leftrightarrow}{x}, \overset{\leftrightarrow}{s}) = x_{t}

and

S_{t} (\overset{\leftrightarrow}{x}, \overset{\leftrightarrow}{s}) = s_{t}

, and let

{\vec{X}}^{t} = X_{0} \dots X_{t - 1}

and

{\overset{\leftarrow}{X}}^{t} = X_{- t} \dots X_{- 1}

.

The process language

Ł (P)

is the set of words w of positive probability:

Ł (P) = {w \in X^{*} : P (w) > 0}

. For a given word

w \in Ł (P)

, we define

ϕ (w) = \tilde{P} (S | w)

to be an observer’s belief distribution as to the machine’s current state after observing the word w. Specifically, for a length-t word

w \in Ł (P)

,

ϕ (w)

is a probability distribution over the machine states

{σ_{1}, \dots, σ_{N}}

whose

i^{t h}

component is expressed as follows:

\begin{matrix} ϕ {(w)}_{i} & = \tilde{P} (S_{0} = σ_{i} | {\overset{\leftarrow}{X}}^{t} = w) \\ = \tilde{P} (S_{0} = σ_{i}, {\overset{\leftarrow}{X}}^{t} = w) / \tilde{P} ({\overset{\leftarrow}{X}}^{t} = w) . \end{matrix}

(25)

For a word

w \notin Ł (P)

we will, by convention, take

ϕ (w) = π

.

For any word w,

\bar{σ} (w)

is defined to be the most likely machine state at the current time given that the word w was just observed. That is,

\bar{σ} (w) = σ_{i^{*}}

, where

i^{*}

is defined by the relation

ϕ {(w)}_{i^{*}} = {max}_{i}

ϕ {(w)}_{i}

. In the case of a tie,

i^{*}

is taken to be the lowest value of the index i maximizing the quantity

ϕ {(w)}_{i}

. Furthermore,

P (w)

is defined to be the probability of the most likely state after observing w as follows:

\begin{matrix} P (w) \equiv ϕ {(w)}_{i^{*}} . \end{matrix}

(26)

Furthermore,

Q (w)

is defined to be the combined probability of all other states after observing w as follows:

\begin{matrix} Q (w) \equiv \sum_{i \neq i^{*}} ϕ {(w)}_{i} = 1 - P (w) . \end{matrix}

(27)

So, for example, if

ϕ (w) = (0.2, 0.7, 0.1)

, then

\bar{σ} (w) = σ_{2}

,

P (w) = 0.7

, and

Q (w) = 0.3

.

The most recent t symbols are described by the block random variable

{\overset{\leftarrow}{X}}^{t}

, and so we define the corresponding random variables

Φ_{t} = ϕ ({\overset{\leftarrow}{X}}^{t})

,

{\bar{S}}_{t} = \bar{σ} ({\overset{\leftarrow}{X}}^{t})

,

P_{t} = P ({\overset{\leftarrow}{X}}^{t})

, and

Q_{t} = Q ({\overset{\leftarrow}{X}}^{t})

. Although the values depend only on the symbol sequence

\overset{\leftrightarrow}{x}

, formally, we think of

Φ_{t}

,

{\bar{S}}_{t}

,

P_{t}

, and

Q_{t}

as defined on the cross product space

{(X x S)}^{Z}

. Their realizations are denoted with lowercase letters

ϕ_{t}

,

{\bar{s}}_{t}

,

p_{t}

, and

q_{t}

, so that for a given realization

(\overset{\leftrightarrow}{x}, \overset{\leftrightarrow}{s}) \in {(X x S)}^{Z}

,

ϕ_{t} = ϕ ({\overset{\leftarrow}{x}}^{t})

,

{\bar{s}}_{t} = \bar{σ} ({\overset{\leftarrow}{x}}^{t})

,

p_{t} = P ({\overset{\leftarrow}{x}}^{t})

, and

q_{t} = Q ({\overset{\leftarrow}{x}}^{t})

. The primary result we use is the following exponential decay bound on the quantity

Q_{t}

.

Lemma 1.

For any generator ϵ-machine

M_{g}

, there exist constants

K > 0

and

0 < α < 1

as follows:

\begin{matrix} \tilde{P} (Q_{t} > α^{t}) \leq K α^{t}, f o r a l l t \in N . \end{matrix}

(28)

Proof.

This follows directly from the Exact Machine Synchronization Theorem of [15] and the Nonexact Machine Synchronization Theorem of [14] by stationarity (note that the notation used there differs slightly from that here by a time shift of length t. That is,

Q_{t}

there refers to the observer’s doubt in

S_{t}

given

{\vec{X}}^{t}

, instead of the observer’s doubt in

S_{0}

given

{\overset{\leftarrow}{X}}^{t}

. Furthermore, L is used as a time index rather than t in those works). □

Essentially, this lemma says that after observing a block of t symbols, it is exponentially unlikely that an observer’s doubt

Q_{t}

in the machine state will be more than exponentially small. Using the lemma, we now prove Theorem 1.

Proof.

(Theorem 1) Let

M_{g}

be a generator

ϵ

-machine with state set

S = {σ_{1}, \dots, σ_{N}}

and stationary distribution

π = (π_{1}, \dots, π_{N})

. Let

P

and

\tilde{P}

be the associated symbol and symbol-state processes generated by

M_{g}

. By Lemma 1 there exist constants

K > 0

and

0 < α < 1

such that

\tilde{P} (Q_{t} > α^{t}) \leq K α^{t}

, for all

t \in N

. Let us define the following sets:

\begin{matrix} V_{t} = {(\overset{\leftrightarrow}{x}, \overset{\leftrightarrow}{s}) : q_{t} \leq α^{t}, s_{0} = {\bar{s}}_{t}}, \\ V_{t}^{'} = {(\overset{\leftrightarrow}{x}, \overset{\leftrightarrow}{s}) : q_{t} \leq α^{t}, s_{0} \neq {\bar{s}}_{t}}, \\ W_{t} = {(\overset{\leftrightarrow}{x}, \overset{\leftrightarrow}{s}) : q_{t} > α^{t}}, and \\ U_{t} = W_{t} \cup V_{t}^{'} . \end{matrix}

Then, we obtain the following:

\begin{matrix} \tilde{P} (U_{t}) & = \tilde{P} (V_{t}^{'}) + \tilde{P} (W_{t}) \\ \leq α^{t} + K α^{t} \\ = (K + 1) α^{t} . \end{matrix}

So the following stands:

\begin{matrix} \sum_{t = 1}^{\infty} \tilde{P} (U_{t}) \leq \sum_{t = 1}^{\infty} (K + 1) α^{t} < \infty . \end{matrix}

Hence, by the Borel–Cantelli Lemma,

\tilde{P} (U_{t} occurs infinitely often) = 0

. Or, equivalently, for

\tilde{P}

a.e.

(\overset{\leftrightarrow}{x}, \overset{\leftrightarrow}{s})

, there exists

t_{0} \in N

such that

(\overset{\leftrightarrow}{x}, \overset{\leftrightarrow}{s}) \in V_{t}

for all

t \geq t_{0}

. Now, we define the following:

\begin{matrix} C & = {(\overset{\leftrightarrow}{x}, \overset{\leftrightarrow}{s}) : there exists t_{0} \in N such that (\overset{\leftrightarrow}{x}, \overset{\leftrightarrow}{s}) \in V_{t} for all t \geq t_{0}}, \\ D_{i} & = {(\overset{\leftrightarrow}{x}, \overset{\leftrightarrow}{s}) : s_{0} = σ_{i}}, and \\ C_{i} & = C \cap D_{i} . \end{matrix}

According to the above discussion,

\tilde{P} (C) = 1

and, clearly,

\tilde{P} (D_{i}) = π_{i}

. Thus,

\tilde{P} (C_{i}) = \tilde{P} (C \cap D_{i}) = π_{i}

. Furthermore, by the convention for

ϕ (w), w \notin Ł (P)

, we know that for every

(\overset{\leftrightarrow}{x}, \overset{\leftrightarrow}{s}) \in C_{i}

, the corresponding symbol past

\overset{\leftarrow}{x}

is nontrivial. So, the conditional probabilities

P (w | {\overset{\leftarrow}{x}}^{t})

are well defined for each t.

Now, given any

(\overset{\leftrightarrow}{x}, \overset{\leftrightarrow}{s}) \in C_{i}

, we assume

t_{0}

is sufficiently large so that for all

t \geq t_{0}

,

(\overset{\leftrightarrow}{x}, \overset{\leftrightarrow}{s}) \in V_{t}

. Then, for

t \geq t_{0}

,

{\bar{s}}_{t} = σ_{i}

, and

q_{t} \leq α^{t}

. So, for any word

w \in X^{*}

and any

t \geq t_{0}

, the following stands:

\begin{matrix} | P (w | & {\overset{\leftarrow}{x}}^{t}) - P_{σ_{i}} (w) | \\ = |\tilde{P} ({\vec{X}}^{| w |} = w | {\overset{\leftarrow}{X}}^{t} = {\overset{\leftarrow}{x}}^{t}) - \tilde{P} ({\vec{X}}^{| w |} = w | S_{0} = σ_{i})| \\ \overset{(*)}{=} |\{\sum_{j} \tilde{P} ({\vec{X}}^{| w |} = w | S_{0} = σ_{j}) \tilde{P} (S_{0} = σ_{j} | {\overset{\leftarrow}{X}}^{t} = {\overset{\leftarrow}{x}}^{t})\} - \tilde{P} ({\vec{X}}^{| w |} = w | S_{0} = σ_{i})| \\ = |\{\sum_{j \neq i} \tilde{P} ({\vec{X}}^{| w |} = w | S_{0} = σ_{j}) \tilde{P} (S_{0} = σ_{j} | {\overset{\leftarrow}{X}}^{t} = {\overset{\leftarrow}{x}}^{t})\} - (1 - \tilde{P} (S_{0} = σ_{i} | {\overset{\leftarrow}{X}}^{t} = {\overset{\leftarrow}{x}}^{t})) \tilde{P} ({\vec{X}}^{| w |} = w | S_{0} = σ_{i})| \\ \leq \{\sum_{j \neq i} \tilde{P} ({\vec{X}}^{| w |} = w | S_{0} = σ_{j}) \tilde{P} (S_{0} = σ_{j} | {\overset{\leftarrow}{X}}^{t} = {\overset{\leftarrow}{x}}^{t})\} + (1 - \tilde{P} (S_{0} = σ_{i} | {\overset{\leftarrow}{X}}^{t} = {\overset{\leftarrow}{x}}^{t})) \tilde{P} ({\vec{X}}^{| w |} = w | S_{0} = σ_{i}) \\ \leq \{\sum_{j \neq i} \tilde{P} (S_{0} = σ_{j} | {\overset{\leftarrow}{X}}^{t} = {\overset{\leftarrow}{x}}^{t})\} + (1 - \tilde{P} (S_{0} = σ_{i} | {\overset{\leftarrow}{X}}^{t} = {\overset{\leftarrow}{x}}^{t})) \\ = 2 q_{t} \\ \leq 2 α^{t} . \end{matrix}

Step (*) follows from the fact that

{\overset{\leftarrow}{X}}^{m}

and

{\vec{X}}^{n}

are conditionally independent given

S_{0}

for any

m, n \in N

, by construction of the measure

\tilde{P}

. Since

| P (w | {\overset{\leftarrow}{x}}^{t}) - P_{σ_{i}} (w) | \leq 2 α^{t}

for all

t \geq t_{0}

, we know

{lim}_{t \to \infty} P (w | {\overset{\leftarrow}{x}}^{t}) = P_{σ_{i}} (w)

exists. Since this holds for all

w \in X^{*}

, we know

\overset{\leftarrow}{x}

is regular and

P (w | \overset{\leftarrow}{x}) = P_{σ_{i}} (w)

for all

w \in X^{*}

.

Now, let us define equivalence classes

E_{i}

,

i = 1, \dots, N

, as follows:

\begin{matrix} E_{i} = {\overset{\leftarrow}{x} : \overset{\leftarrow}{x} is regular and P (w | \overset{\leftarrow}{x}) = P_{σ_{i}} (w) for all w \in X^{*}} . \end{matrix}

Furthermore, also, for each

i = 1, \dots, N

let the following be true:

\begin{matrix} {\tilde{E}}_{i} = {(\overset{\leftrightarrow}{x}, \overset{\leftrightarrow}{s}) : \overset{\leftarrow}{x} \in E_{i}} . \end{matrix}

From the results from Appendix C we know that each equivalence class

E_{i}

is measurable, so each set

{\tilde{E}}_{i}

is also measurable with

\tilde{P} ({\tilde{E}}_{i}) = P (E_{i})

. Furthermore, for each i,

C_{i} \subseteq {\tilde{E}}_{i}

, so

P (E_{i}) = \tilde{P} ({\tilde{E}}_{i}) \geq \tilde{P} (C_{i}) = π_{i}

. Since

\sum_{i = 1}^{N} π_{i} = 1

and the equivalence classes

E_{i}, i = 1, \dots, N

, are all disjoint, it follows that

P (E_{i}) = π_{i}

for each i, and

\sum_{i = 1}^{N} P (E_{i}) = \sum_{i = 1}^{N} π_{i} = 1

. Hence, the process

P

is finitely characterized with positive probability equivalence classes

E^{+} = {E_{1}, \dots, E_{N}}

.

Moreover, the equivalence classes

{E_{1}, \dots, E_{N}}

—the history

ϵ

-machine states—have a natural one-to-one correspondence with the states of the generating

ϵ

-machineas follows:

E_{i} \sim σ_{i}, i = 1, \dots, N

. It remains only to verify that this bijection is also edge-preserving and, is thus, an isomorphism. Specifically, we must show the following:

For each $i = 1, \dots, N$ and $x \in X$ , $P (x | E_{i}) = P_{σ_{i}} (x)$ .
For all i and x with $P (x | E_{i}) = P_{σ_{i}} (x) > 0$ , $δ_{h} (E_{i}, x) ≅ δ (σ_{i}, x)$ . That is, if $δ_{h} (E_{i}, x) = E_{j}$ and $δ (σ_{i}, x) = σ_{j^{'}}$ , then $j = j^{'}$ .

Point 1 follows directly from the definition of

E_{i}

. To prove Point 2, take any i and x with

P (x | E_{i}) = P_{σ_{i}} (x) > 0

and let

δ_{h} (E_{i}, x) = E_{j}

and

δ (σ_{i}, x) = σ_{j^{'}}

. Then, for any word

w \in X^{*}

, we have the following:

(i): $P (x w | E_{i}) = P_{σ_{i}} (x w)$ , by definition of the equivalence class $E_{i}$ .
(ii): $P (x w | E_{i}) = P (x | E_{i}) \cdot P (w | E_{j})$ , by Claim 11 in Appendix D.
(iii): $P_{σ_{i}} (x w) = P_{σ_{i}} (x) \cdot P_{σ_{j^{'}}} (w)$ , by Equation (10) applied to a unifilar HMM.

Since

P (x | E_{i}) = P_{σ_{i}} (x) > 0

, it follows that

P (w | E_{j}) = P_{σ_{j^{'}}} (w)

. Since this holds for all

w \in X^{*}

and the states of the generator are probabilistically distinct, by assumption, it follows that

j = j^{'}

. □

Corollary 1.

Generator ϵ-machines are unique as follows: two generator ϵ-machines

M_{g_{1}}

and

M_{g_{2}}

that generate the same process

P

are isomorphic.

Proof.

By Theorem 1, the two generator

ϵ

-machines are both isomorphic to the process’s history

ϵ

-machine

M_{h} (P)

and, hence, isomorphic to each other. □

Remark 1.

Unlike history ϵ-machines that are unique by construction, generator ϵ-machines are not by definition unique. Furthermore, it is not a priori clear that they must be. Indeed, general HMMs are not unique. There are infinitely many nonisomorphic HMMs for any given process

P

generated by some HMM. Moreover, if either the unifilarity or probabilistically distinct states condition is removed from the definition of generator ϵ-machines, then uniqueness no longer holds. It is only when both of these properties are required together that one obtains uniqueness.

4.2. History $ϵ$ -Machines Are Generator $ϵ$ -Machines

In this section, we establish equivalence in the reverse direction as follows:

Theorem 2.

If

P

is a finitely characterized, stationary, ergodic, finite-alphabet process, then the history ϵ-machine

M_{h} (P)

, when considered as a hidden Markov model, is also a generator ϵ-machine. Furthermore, the process

P^{'}

generated by

M_{h}

is the same as the original process

P

from which the history machine was derived.

Note that by Claim 17 in Appendix E we know that for any finitely characterized, stationary, ergodic, finite-alphabet process the history

ϵ

-machine

M_{h} (P) = (E^{+}, X, {T^{(x)}})

is a valid hidden Markov model. So, we need only show that this HMM has the three properties of a generator

ϵ

-machine—strongly connected graph, unifilar transitions, and probabilistically distinct states—and that the process

P^{'}

generated by this HMM is the same as

P

. Unifilarity is immediate from the construction, but the other claims take more work and require several lemmas to establish. Throughout

μ = (μ_{1}, \dots, μ_{N}) \equiv (P (E_{1}), \dots, P (E_{N}))

, where

E^{+} = {E_{1}, \dots, E_{n}}

is the set of positive probability equivalence classes for the process

P

.

Lemma 2.

The distribution μ over equivalence-class states is stationary for the transition matrix

T = \sum_{x \in X} T^{(x)}

. That is, for any

1 \leq j \leq N

,

μ_{j} = \sum_{i = 1}^{N} μ_{i} \cdot T_{i j}

.

Proof.

This follows directly from Claim 15 in Appendix E and the definition of the

T^{(x)}

matrices. □

Lemma 3.

The graph G associated with the HMM

M_{h} = (E^{+}, X, {T^{(x)}})

consists entirely of disjoint strongly connected components. Each connected component of G is strongly connected.

Proof.

It is equivalent to show that the graphical representation of the associated Markov chain with state set

E^{+}

and transition matrix T consists entirely of disjoint strongly connected components. However, this follows directly from the existence of a stationary distribution

μ

with

μ_{i} = P (E_{i}) > 0

for all i [31]. □

Lemma 4.

For any

E_{i} \in E^{+}

and

w \in X^{*}

,

P (w | E_{i}) = P_{E_{i}} (w)

, where

P_{E_{i}} (w) \equiv {∥ e_{i} T^{(w)} ∥}_{1}

is the probability of generating the word w starting in state

E_{i}

of the HMM

M_{h} = (E^{+}, X, {T^{(x)}})

as defined in Section 3.2.

Proof.

By construction,

M_{h}

is a unifilar HMM, and its transition function

δ

, as defined in Section 3.3, is the same as the transition function

δ_{h}

between equivalence classes of histories as defined in Equation (20). Moreover, we obtain by construction that for each

x \in X

and state

E_{i}

,

P_{E_{i}} (x) = P (x | E_{i})

. The lemma essentially follows from these facts. We consider separately the two cases

P (w | E_{i}) > 0

and

P (w | E_{i}) = 0

.

Case (i)— $P (w | E_{i}) > 0$ . Let $w = w_{0} \dots w_{l - 1}$ be a word of length $l \geq 1$ with $P (w | E_{i}) > 0$ . By Claim 12 in Appendix D and the ensuing remark, we know that the equivalence classes $s_{0} = E_{i}$ , $s_{1} = δ_{h} (s_{0}, w_{0}), \dots, s_{l} = δ_{h} (s_{l - 1}, w_{l - 1})$ are well defined and the following stands:

$\begin{matrix} P (w | E_{i}) = \prod_{t = 0}^{l - 1} P (w_{t} | s_{t}) . \end{matrix}$

Since $δ_{h} ≅ δ$ , we see that there is an allowed state path s in the HMM $M_{h}$ —namely, $s = s_{1}, \dots, s_{l}$ —such that the word w can be generated following s from the initial state $E_{i}$ . It follows that $P_{E_{i}} (w) > 0$ and is given by Equation (15) as follows:

$\begin{matrix} P_{E_{i}} (w) = \prod_{t = 0}^{l - 1} P_{s_{t}} (w_{t}) = \prod_{t = 0}^{l - 1} P (w_{t} | s_{t}) . \end{matrix}$
Case (ii)— $P (w | E_{i}) = 0$ . Let $w = w_{0} \dots w_{l - 1}$ be a word of length $l \geq 1$ with $P (w | E_{i}) = 0$ . For $0 \leq m \leq l - 1$ , define $w^{m} = w_{0} \dots w_{m - 1}$ ( $w^{0}$ is the null word $λ$ ). Take the largest integer $m \in {0, \dots, l - 1}$ such that $P (w^{m} | E_{i}) > 0$ . By convention, we take $P (λ | E_{i}) = 1$ for all i, so there is always some such m. A similar analysis to above then shows that the equivalence classes $s_{0}, \dots, s_{m}$ defined by $s_{0} = E_{i}$ , $s_{t + 1} = δ_{h} (s_{t}, w_{t})$ are well defined and the following stands:

$\begin{matrix} P (w^{m + 1} | E_{i}) = \prod_{t = 0}^{m} P (w_{t} | s_{t}) = P_{E_{i}} (w^{m + 1}) . \end{matrix}$

By our choice of m, $P (w^{m + 1} | E_{i}) = 0$ , so $P_{E_{i}} (w^{m + 1}) = 0$ as well. It follows that $P_{E_{i}} (w) = 0$ , since $w^{m + 1}$ is a prefix of w.

□

Lemma 5.

For any

w \in X^{*}

,

P (w) = ∥ μ T^{(w)} ∥_{1}

.

Proof.

Let

E_{i, w} \equiv {\overset{\leftrightarrow}{x} : {\vec{x}}^{| w |} = w, \overset{\leftarrow}{x} \in E_{i}}

. Claim 14 of Appendix D shows that each

E_{i, w}

is an

X

-measurable set with

P (E_{i, w}) = P (E_{i}) \cdot P (w | E_{i})

. Since the

E_{i}

s are disjoint sets with probabilities summing to 1, it follows that

P (w) = \sum_{i = 1}^{N} P (E_{i, w})

for each

w \in X^{*}

. Thus, applying Lemma 4, for any

w \in X^{*}

we have the following:

\begin{matrix} P (w) & = \sum_{i = 1}^{N} P (E_{i, w}) \\ = \sum_{i = 1}^{N} P (E_{i}) \cdot P (w | E_{i}) \\ = \sum_{i = 1}^{N} μ_{i} {∥ e_{i} T^{(w)} ∥}_{1} \\ = ∥ μ T^{(w)} ∥_{1} . \end{matrix}

□

Proof.

(Theorem 2)

Unifilarity: As mentioned above, this is immediate from the history $ϵ$ -machine construction.
Probabilistically Distinct States: Take any i and j with $i \neq j$ . By construction of the equivalence classes there exists some word $w \in X^{*}$ such that $P (w | E_{i}) \neq P (w | E_{j})$ . However, by Lemma 4, $P (w | E_{i}) = P_{E_{i}} (w)$ and $P (w | E_{j}) = P_{E_{j}} (w)$ . Hence, $P_{E_{i}} (w) \neq P_{E_{j}} (w)$ , so the states $E_{i}$ and $E_{j}$ of the HMM $M_{h} = (E^{+}, X, {T^{(x)}})$ are probabilistically distinct. Since this holds for all $i \neq j$ , $M_{h}$ has probabilistically distinct states.
Strongly Connected Graph: By Lemma 3, we know the graph G associated with the HMM $M_{h}$ consists of one or more connected components $C_{1}, \dots, C_{n}$ , each of which is strongly connected. Assume that there is more than one of these strongly connected components: $n \geq 2$ . By Points 1 and 2 above we know that each component $C_{k}$ defines a generator $ϵ$ -machine. If two of these components $C_{k}$ and $C_{j}$ were isomorphic via a function $f : C_{k}$ states $\to C_{j}$ states, then for states $E_{i} \in C_{k}$ and $E_{l} \in C_{j}$ with $f (E_{i}) = E_{l}$ , we would have $P_{E_{i}} (w) = P_{E_{l}} (w)$ for all $w \in X^{*}$ . By Lemma 4, however, this implies $P (w | E_{i}) = P (w | E_{l})$ for all $w \in X^{*}$ as well, which contradicts the fact that $E_{i}$ and $E_{l}$ are distinct equivalence classes. Hence, no two of the components $C_{k}, k = 1, \dots, n$ , can be isomorphic. By Corollary 1, this implies that the stationary processes $P^{k}, k = 1, \dots, n$ , generated by each of the generator $ϵ$ -machine components, are all distinct. However, by a block diagonalization argument, it follows from Lemma 5 that $P = \sum_{k = 1}^{n} μ^{k} \cdot P^{k}$ , where $μ^{k} = \sum_{{i : E_{i} \in C_{k}}} μ_{i}$ . That is, for any word $w \in X^{*}$ , we have the following:

$\begin{matrix} P (w) & = \sum_{k = 1}^{n} μ^{k} \cdot P^{k} (w) \\ = \sum_{k = 1}^{n} μ^{k} \cdot {∥ ρ^{k} T^{k, (w)} ∥}_{1}, \end{matrix}$

where $ρ^{k}$ and $T^{k, (w)}$ are, respectively, the stationary state distribution and w-transition matrix for the generator $ϵ$ -machine of component $C_{k}$ . Since the $P^{k}$ s are all distinct, this implies that the process $P$ cannot be ergodic, which is a contradiction. Hence, there can only be one strongly connected component $C_{1}$ —the whole graph is strongly connected.
Equivalence of $P$ and $P^{'}$ : Since the graph of the HMM $M_{h} = (E^{+}, X, {T^{(x)}})$ is strongly connected there is a unique stationary distribution $π$ over the states satisfying $π = π T$ . However, we already know the distribution $μ$ is stationary. Hence, $π = μ$ . By definition, the word probabilities $P^{'} (w)$ for the process $P^{'}$ generated by this HMM are $P^{'} (w) = {∥ π T^{(w)} ∥}_{1}, w \in X^{*}$ . However, by Lemma 5, we have also $P (w) = ∥ μ T^{(w)} ∥_{1} = {∥ π T^{(w)} ∥}_{1}$ for each $w \in X^{*}$ . Hence, $P (w) = P^{'} (w)$ for all $w \in X^{*}$ , so $P$ and $P^{'}$ are the same process.

□

5. Conclusions

We have demonstrated the equivalence of finite-state history and generator

ϵ

-machines. This is not a new idea. However, a formal treatment was absent until quite recently and, while the rigorous development of

ϵ

-machines in [16] also implies equivalence, the proofs given here, especially for Theorem 1, are more direct and provide improved intuition.

The key step in proving the equivalence, at least the new approach used for Theorem 1, comes directly from recent bounds on synchronization rates for finite-state generator

ϵ

-machines. To generalize the equivalence to larger model classes, such as machines with a countably infinite number of states, it therefore seems reasonable that one should first deduce and apply similar synchronization results for countable-state generators. Unfortunately, for countable-state generators, synchronization can be much more difficult and exponential decay rates as in Lemma 1 no longer always hold. Thus, it is unclear whether equivalence in the countable-state case also always holds. Nevertheless, the results in [16] do indicate equivalence holds for countable-state machines if the entropy in the stationary distribution

H [π]

is finite, which it often is.

Author Contributions

Both authors contributed to conceptualization, investigation, validation, and writing and editing the manuscript. N.F.T. was responsible for the methodology and formal analysis. J.P.C. provided supervision, administration, and funding. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially support by the ARO award W911NF-12-1-0234-0. NT was partially supported by an NSF VIGRE fellowship.

Data Availability Statement

The article provides all data supporting the results.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Regular Pasts and Trivial Pasts

We establish that the set of trivial pasts

T

is a null set and the set of regular pasts

R

has full measure. Throughout this section,

P = (X^{Z}, X, P)

is a stationary, ergodic process over a finite alphabet

X

, and

(X^{-}, X^{-}, P^{-})

is the corresponding probability space over past sequences

\overset{\leftarrow}{x}

. The other notation is used as in Section 3.

Claim 1.

P^{-}

a.e.

\overset{\leftarrow}{x}

is nontrivial. That is,

T

is an

X^{-}

measurable set with

P^{-} (T) = 0

.

Proof.

For any fixed t,

T_{t} \equiv {\overset{\leftarrow}{x} : P ({\overset{\leftarrow}{x}}^{t}) = 0}

is

X^{-}

measurable, since it is

X_{t}^{-}

measurable, and

P^{-} (T_{t}) = 0

. Hence,

T = ⋃_{t = 1}^{\infty} T_{t}^{-}

is also

X^{-}

measurable with

P^{-} (T) = 0

. □

Claim 2.

For any

w \in X^{*}

,

P^{-}

a.e.

\overset{\leftarrow}{x}

is w-regular. That is,

\begin{matrix} R_{w} \equiv {\overset{\leftarrow}{x} : P ({\overset{\leftarrow}{x}}^{t}) > 0, f o r a l l t a n d lim_{t \to \infty} P (w | {\overset{\leftarrow}{x}}^{t}) e x i s t s} \end{matrix}

is an

X^{-}

measurable set with

P^{-} (R_{w}) = 1

.

Proof.

Fix

w \in X^{*}

. Let

Y_{w, t} : X^{-} \to R

be defined by the following:

\begin{matrix} Y_{w, t} (\overset{\leftarrow}{x}) & = \{\begin{matrix} P (w | {\overset{\leftarrow}{x}}^{t}) & if P ({\overset{\leftarrow}{x}}^{t}) > 0, \\ 0 & otherwise . \end{matrix} \end{matrix}

Then, the sequence

(Y_{w, t})

is a martingale with respect to the filtration

(X_{t}^{-})

and

E (Y_{w, t}) \leq 1

for all t. Hence, by the Martingale Converge Theorem,

Y_{w, t} \overset{a . s .}{⟶} Y_{w}

for some

X^{-}

measurable random variable

Y_{w}

. In particular,

{lim}_{t \to \infty} Y_{w, t} (\overset{\leftarrow}{x})

exists for

P^{-}

a.e.

\overset{\leftarrow}{x}

.

Let

{\hat{R}}_{w} \equiv {\overset{\leftarrow}{x} : {lim}_{t \to \infty} Y_{w, t} (\overset{\leftarrow}{x}) exists}

. Then, as just shown,

{\hat{R}}_{w}

is

X^{-}

measurable with

P^{-} ({\hat{R}}_{w}) = 1

, and from Claim 1, we know

T

is

X^{-}

measurable with

P^{-} (T) = 0

. Hence,

R_{w} = {\hat{R}}_{w} \cap T^{c}

is also

X^{-}

measurable with

P^{-} (R_{w}) = 1

. □

Claim 3.

P^{-}

a.e.

\overset{\leftarrow}{x}

is regular. That is,

R

is an

X^{-}

measurable set with

P^{-} (R)

= 1.

Proof.

R = ⋂_{w \in X^{*}} R_{w}

. By Claim 2, each

R_{w}

is

X^{-}

measurable with

P^{-} (R_{w}) = 1

. Since there are only countably many finite length words

w \in X^{*}

, it follows that

R

is also

P^{-}

measurable with

P^{-} (R)

= 1. □

Appendix B. Well Definedness of Equivalence Class Transitions

We establish that the equivalence-class-to-equivalence-class transitions are well defined and normalized for the equivalence classes

E_{β} \in E

. Throughout this section,

P = (X^{Z}, X, P)

is a stationary, ergodic process over a finite alphabet

X

, and

(X^{-}, X^{-}, P^{-})

is the corresponding probability space over past sequences

\overset{\leftarrow}{x}

. The other notation is used as in Section 3. Recall that, by definition, for any regular past

\overset{\leftarrow}{x}

,

P ({\overset{\leftarrow}{x}}^{t}) > 0

for each

t \in N

. This fact is used implicitly in the proofs of the following claims several times to ensure that various quantities are well defined.

Claim 4.

For any regular past

\overset{\leftarrow}{x} \in X^{-}

and word

w \in X^{*}

with

P (w | \overset{\leftarrow}{x}) > 0

the following stand:

(i): $P ({\overset{\leftarrow}{x}}^{t} w) > 0$ for each $t \in N$ .
(ii): $P (w | {\overset{\leftarrow}{x}}^{t}) > 0$ for each $t \in N$ .

Proof.

Fix any regular past

\overset{\leftarrow}{x} \in X^{-}

and word

w \in X^{*}

with

P (w | \overset{\leftarrow}{x}) > 0

. Assume there exists

t \in N

such that

P ({\overset{\leftarrow}{x}}^{t} w) = 0

. Then,

P ({\overset{\leftarrow}{x}}^{n} w) = 0

for all

n \geq t

and, thus,

P (w | {\overset{\leftarrow}{x}}^{n}) = P ({\overset{\leftarrow}{x}}^{n} w) / P ({\overset{\leftarrow}{x}}^{n}) = 0

for all

n \geq t

as well. Taking the limit gives

P (w | \overset{\leftarrow}{x}) = {lim}_{n \to \infty} P (w | {\overset{\leftarrow}{x}}^{n}) = 0

, which is a contradiction. Hence, we must have

P ({\overset{\leftarrow}{x}}^{t} w) > 0

for each t, proving (i). (ii) follows, since

P (w | {\overset{\leftarrow}{x}}^{t}) = P ({\overset{\leftarrow}{x}}^{t} w) / P ({\overset{\leftarrow}{x}}^{t})

is greater than zero as long as

P ({\overset{\leftarrow}{x}}^{t} w) > 0

. □

Claim 5.

For any regular past

\overset{\leftarrow}{x} \in X^{-}

and any symbol

x \in X

with

P (x | \overset{\leftarrow}{x}) > 0

, the past

\overset{\leftarrow}{x} x

is regular.

Proof.

Fix any regular past

\overset{\leftarrow}{x} \in X^{-}

and symbol

x \in X

with

P (x | \overset{\leftarrow}{x}) > 0

. By Claim 4,

P ({\overset{\leftarrow}{x}}^{t} x)

and

P (x | {\overset{\leftarrow}{x}}^{t})

are both nonzero for each

t \in N

. Thus, the past

\overset{\leftarrow}{x} x

is nontrivial, and the conditional probability

P (w | {\overset{\leftarrow}{x}}^{t} x)

is well defined for each

w \in X^{*}, t \in N

and given by the following:

\begin{matrix} P (w | {\overset{\leftarrow}{x}}^{t} x) = \frac{P (x w | {\overset{\leftarrow}{x}}^{t})}{P (x | {\overset{\leftarrow}{x}}^{t})} . \end{matrix}

Taking the limit gives the following:

\begin{matrix} lim_{t \to \infty} P (w | {(\overset{\leftarrow}{x} x)}^{t}) & = lim_{t \to \infty} P (w | {\overset{\leftarrow}{x}}^{t} x) \\ = lim_{t \to \infty} \frac{P (x w | {\overset{\leftarrow}{x}}^{t})}{P (x | {\overset{\leftarrow}{x}}^{t})} \\ = \frac{{lim}_{t \to \infty} P (x w | {\overset{\leftarrow}{x}}^{t})}{{lim}_{t \to \infty} P (x | {\overset{\leftarrow}{x}}^{t})} \\ = \frac{P (x w | \overset{\leftarrow}{x})}{P (x | \overset{\leftarrow}{x})} . \end{matrix}

In particular,

{lim}_{t \to \infty} P (w | {(\overset{\leftarrow}{x} x)}^{t}) = P (x w | \overset{\leftarrow}{x}) / P (x | \overset{\leftarrow}{x})

exists. Since this holds for all

w \in X^{*}

, the past

\overset{\leftarrow}{x} x

is regular.

□

Claim 6.

If

\overset{\leftarrow}{x}

and

{\overset{\leftarrow}{x}}^{'}

are two regular pasts in the same equivalence class

E_{β} \in E

, then, for any symbol

x \in X

with

P (x | E_{β}) > 0

, the regular pasts

\overset{\leftarrow}{x} x

and

{\overset{\leftarrow}{x}}^{'} x

are also in the same equivalence class.

Proof.

Let

E_{β} \in E

and fix any

\overset{\leftarrow}{x}, {\overset{\leftarrow}{x}}^{'} \in E_{β}

and

x \in X

with

P (x | E_{β}) = P (x | \overset{\leftarrow}{x}) = P (x | {\overset{\leftarrow}{x}}^{'}) > 0

. By Claim 5,

\overset{\leftarrow}{x} x

and

{\overset{\leftarrow}{x}}^{'} x

are both regular. Furthermore, just as in the proof of Claim 5, for any

w \in X^{*}

we have the following:

\begin{matrix} P (w | \overset{\leftarrow}{x} x) = lim_{t \to \infty} P (w | {(\overset{\leftarrow}{x} x)}^{t}) = \frac{P (x w | \overset{\leftarrow}{x})}{P (x | \overset{\leftarrow}{x})} = \frac{P (x w | E_{β})}{P (x | E_{β})} . \end{matrix}

Furthermore, similarly, for any

w \in X^{*}

the following stands:

\begin{matrix} P (w | {\overset{\leftarrow}{x}}^{'} x) = lim_{t \to \infty} P (w | {({\overset{\leftarrow}{x}}^{'} x)}^{t}) = \frac{P (x w | {\overset{\leftarrow}{x}}^{'})}{P (x | {\overset{\leftarrow}{x}}^{'})} = \frac{P (x w | E_{β})}{P (x | E_{β})} . \end{matrix}

Since this holds for all

w \in X^{*}

, it follows that

\overset{\leftarrow}{x} x

and

{\overset{\leftarrow}{x}}^{'} x

are both in the same equivalence class. □

Claim 7.

For any equivalence class

E_{β}

,

\sum_{x \in X} P (x | E_{β}) = 1

.

Proof.

Fix

\overset{\leftarrow}{x} \in E_{β}

. Then the following holds:

\begin{matrix} \sum_{x \in X} P (x | E_{β}) & = \sum_{x \in X} P (x | \overset{\leftarrow}{x}) \\ = \sum_{x \in X} lim_{t \to \infty} P (x | {\overset{\leftarrow}{x}}^{t}) \\ = lim_{t \to \infty} \sum_{x \in X} P (x | {\overset{\leftarrow}{x}}^{t}) \\ = lim_{t \to \infty} 1 \\ = 1 . \end{matrix}

□

Appendix C. Measurability of Equivalence Classes

We establish that the equivalence classes

E_{β}, β \in B

, are measurable sets. Throughout this section,

P = (X^{Z}, X, P)

is a stationary, ergodic process over a finite alphabet

X

and

(X^{-}, X^{-}, P^{-})

is the corresponding probability space over past sequences

\overset{\leftarrow}{x}

. The other notation is used as in Section 3.

Claim 8.

Let

A_{w, p} \equiv {\overset{\leftarrow}{x} : P ({\overset{\leftarrow}{x}}^{t}) > 0, f o r a l l t and {lim}_{t \to \infty} P (w | {\overset{\leftarrow}{x}}^{t}) = p}

. Then,

A_{w, p}

is

X^{-}

measurable for each

w \in X^{*}

and

p \in [0, 1]

.

Proof.

We proceed in steps through a series of intermediate sets.

Let $A_{w, p, ϵ, t}^{+} \equiv {\overset{\leftarrow}{x} : P ({\overset{\leftarrow}{x}}^{t}) > 0, P (w | {\overset{\leftarrow}{x}}^{t}) \leq p + ϵ}$ and $A_{w, p, ϵ, t}^{-} \equiv {\overset{\leftarrow}{x} : P ({\overset{\leftarrow}{x}}^{t}) > 0, P (w | {\overset{\leftarrow}{x}}^{t}) \geq p - ϵ}$ .
$A_{w, p, ϵ, t}^{+}$ and $A_{w, p, ϵ, t}^{-}$ are both $X^{-}$ measurable, since they are both $X_{t}^{-}$ measurable.
Let $A_{w, p, ϵ}^{+} \equiv ⋃_{n = 1}^{\infty} ⋂_{t = n}^{\infty} A_{w, p, ϵ, t}^{+} = {\overset{\leftarrow}{x} : P ({\overset{\leftarrow}{x}}^{t}) > 0, \forall t and \exists n \in N such that P (w | {\overset{\leftarrow}{x}}^{t}) \leq p + ϵ, for t \geq n}$ , and $A_{w, p, ϵ}^{-} \equiv ⋃_{n = 1}^{\infty} ⋂_{t = n}^{\infty} A_{w, p, ϵ, t}^{-} = {\overset{\leftarrow}{x} : P ({\overset{\leftarrow}{x}}^{t}) > 0, \forall t and \exists n \in N such that P (w | {\overset{\leftarrow}{x}}^{t}) \geq p - ϵ, for t \geq n}$ . Then, $A_{w, p, ϵ}^{+}$ and $A_{w, p, ϵ}^{-}$ are each $X^{-}$ measurable since they are countable unions of countable intersections of $X^{-}$ measurable sets.
Let $A_{w, p, ϵ} \equiv A_{w, p, ϵ}^{+} \cap A_{w, p, ϵ}^{-} = \{\overset{\leftarrow}{x} : P ({\overset{\leftarrow}{x}}^{t}) > 0, \forall t and \exists n \in N$ such that $|P (w | {\overset{\leftarrow}{x}}^{t}) - p| \leq ϵ, for t \geq n\}$ . $A_{w, p, ϵ}$ is $X^{-}$ measurable since it is the intersection of two $X^{-}$ measurable sets.
Finally, note that $A_{w, p} = ⋂_{m = 1}^{\infty} A_{w, p, ϵ_{m}}$ where $ϵ_{m} = 1 / m$ . Furthermore, hence, $A_{w, p}$ is $X^{-}$ measurable as it is a countable intersection of $X^{-}$ measurable sets.

□

Claim 9.

Any equivalence class

E_{β} \in E

is an

X^{-}

measurable set.

Proof.

Fix any equivalence class

E_{β} \in E

and, for

w \in X^{*}

, let

p_{w} = P (w | E_{β})

. By definition

E_{β} = ⋂_{w \in X^{*}} A_{w, p_{w}}

and by Claim 8, each

A_{w, p_{w}}

is

X^{-}

is measurable. Thus, since there are only countably many finite length words

w \in X^{*}

,

E_{β}

must also be

X^{-}

measurable. □

Appendix D. Probabilistic Consistency of Equivalence Class Transitions

We establish that the probability of word generation from each equivalence class is consistent in the sense of Claims 12 and 14. Claim 14 is used in the proof of Claim 15 in Appendix E, and Claim 12 is used in the proof of Theorem 2. Throughout this section, we assume

P = (X^{Z}, X, P)

is a stationary, ergodic process over a finite alphabet

X

and denote the corresponding probability space over past sequences as

(X^{-}, X^{-}, P^{-})

, with other notation is as in Section 3. We define also the history σ-algebra

H

for a process

P = (X^{Z}, X, P)

as the

σ

-algebra generated by cylinder sets of all finite length histories. That is,

\begin{matrix} H = σ (⋃_{t = 1}^{\infty} H_{t}) where H_{t} = σ ({A_{w, - | w |} : | w | = t}), \end{matrix}

with

A_{w, t} = {\overset{\leftrightarrow}{x} : x_{t} \dots x_{t + | w | - 1} = w}

as in Section 3.

H

is the projection onto

X^{Z}

of the

σ

-algebra

X^{-}

on the space

X^{-}

.

Claim 10.

For any

E_{β} \in E

and

w, v \in X^{*}

,

P (w v | E_{β}) \leq P (w | E_{β})

.

Proof.

Fix

\overset{\leftarrow}{x} \in E_{β}

. Since

P (w v | {\overset{\leftarrow}{x}}^{t}) \leq P (w | {\overset{\leftarrow}{x}}^{t})

for each t as follows:

\begin{matrix} P (w v | E_{β}) = P (w v | \overset{\leftarrow}{x}) = lim_{t \to \infty} P (w v | {\overset{\leftarrow}{x}}^{t}) \leq lim_{t \to \infty} P (w | {\overset{\leftarrow}{x}}^{t}) = P (w | \overset{\leftarrow}{x}) = P (w | E_{β}) . \end{matrix}

□

Claim 11.

Let

E_{β} \in E

,

x \in X

with

P (x | E_{β}) > 0

, and let

E_{α} = δ_{h} (E_{β}, x)

. Then,

P (x w | E_{β}) = P (x | E_{β}) \cdot P (w | E_{α})

for any word

w \in X^{*}

.

Proof.

Fix

\overset{\leftarrow}{x} \in E_{β}

. Then,

\overset{\leftarrow}{x} x \in E_{α}

is regular, so

P ({\overset{\leftarrow}{x}}^{t} x) > 0

for all t and we have the following:

\begin{matrix} P (x w | E_{β}) & = P (x w | \overset{\leftarrow}{x}) \\ = lim_{t \to \infty} P (x w | {\overset{\leftarrow}{x}}^{t}) \\ = lim_{t \to \infty} P (x | {\overset{\leftarrow}{x}}^{t}) \cdot P (w | {\overset{\leftarrow}{x}}^{t} x) \\ = lim_{t \to \infty} P (x | {\overset{\leftarrow}{x}}^{t}) \cdot lim_{t \to \infty} P (w | {\overset{\leftarrow}{x}}^{t} x) \\ = P (x | \overset{\leftarrow}{x}) \cdot P (w | \overset{\leftarrow}{x} x) \\ = P (x | E_{β}) \cdot P (w | E_{α}) . \end{matrix}

□

Claim 12.

Let

w = w_{0} \dots w_{l - 1} \in X^{*}

be a word of length

l \geq 1

, and let

w^{m} = w_{0} \dots w_{m - 1}

for

0 \leq m \leq l

. Assume that

P (w^{l - 1} | E_{β}) > 0

for some

E_{β} \in E

. Then, the equivalence classes

s_{t}

,

0 \leq t \leq l - 1

, defined inductively by the relations

s_{0} = E_{β}

and

s_{t} = δ_{h} (s_{t - 1}, w_{t - 1})

for

1 \leq t \leq l - 1

, are well defined. That is,

P (w_{t - 1} | s_{t - 1}) > 0

for each

1 \leq t \leq l - 1

. Further, the probability

P (w | E_{β})

may be expressed as follows:

\begin{matrix} P (w | E_{β}) = \prod_{t = 0}^{l - 1} P (w_{t} | s_{t}) . \end{matrix}

In the above,

w^{0} = λ

is the null word and, for any equivalence class

E_{β}

,

P (λ | E_{β}) \equiv 1

.

Proof.

For

| w | = 1

, the statement is immediate and, for

| w | = 2

, it reduces to Claim 11. For

| w | \geq 3

, it can proved by induction on the length of w using Claim 11 and the consistency bound provided by Claim 10, which guarantees that

P (w_{0} | E_{β}) > 0

if

P (w^{l - 1} | E_{β}) > 0

. □

Remark A1.

If

P (w | E_{β}) > 0

, then, by Claim 10, we know

P (w^{l - 1} | E_{β}) > 0

, so the formula above holds for any word w with

P (w | E_{β}) > 0

. Moreover, in this case,

P (w_{l - 1} | s_{l - 1})

must be nonzero in order to ensure

P (w | E_{β})

is nonzero. Thus, the equivalence class

s_{l} = δ_{h} (s_{l - 1}, w_{l - 1})

is also well defined.

The following theorem from [32] (Chapter 4, Theorem 5.7) is needed in the proof of Claim 13. It is an application of the Martingale Convergence Theorem.

Theorem A1.

Let

(Ω, F, P)

be a probability space, and let

F_{1} \subseteq F_{2} \subseteq F_{3} \dots

be an increasing sequence of σ-algebras on Ω with

F_{\infty} = σ (⋃_{n = 1}^{\infty} F_{n}) \subseteq F

. Suppose

X : Ω \to R

is an

F

-measurable random variable (with

E | X | < \infty

). Then, for (any versions of) the conditional expectations

E (X | F_{n})

and

E (X | F_{\infty})

, we have the following:

\begin{matrix} E (X | F_{n}) \to E (X | F_{\infty}) a . s . a n d i n L^{1} . \end{matrix}

Claim 13.

For any

w \in X^{*}

,

P_{w} (\overset{\leftrightarrow}{x})

is (a version of) the conditional expectation

E (1_{A_{w, 0}} | H) (\overset{\leftrightarrow}{x})

, where

P_{w} : X^{Z} \to [0, 1]

is defined by the following:

\begin{matrix} P_{w} (\overset{\leftrightarrow}{x}) & = \{\begin{matrix} P (w | \overset{\leftarrow}{x}) & i f \overset{\leftarrow}{x} i s r e g u l a r, w h e r e \overset{\leftrightarrow}{x} = \overset{\leftarrow}{x} \vec{x}, \\ 0 & o t h e r w i s e . \end{matrix} \end{matrix}

Proof.

Fix

w \in X^{*}

, and let

E_{w}

be any fixed version of the conditional expectation

E (1_{A_{w, 0}} | H)

. Since the function

P_{w, t} : X^{Z} \to [0, 1]

defined by the following:

\begin{matrix} P_{w, t} (\overset{\leftrightarrow}{x}) & = \{\begin{matrix} P (w | {\overset{\leftarrow}{x}}^{t}) & if P ({\overset{\leftarrow}{x}}^{t}) > 0, \\ 0 & otherwise \end{matrix} \end{matrix}

is a version of the conditional expectation

E (1_{A_{w, 0}} | H_{t})

, Theorem A1 implies that

P_{w, t} (\overset{\leftrightarrow}{x}) \to E_{w} (\overset{\leftrightarrow}{x})

for

P

a.e.

\overset{\leftrightarrow}{x}

. Now, define the following:

\begin{matrix} V_{w} & = {\overset{\leftrightarrow}{x} : P_{w, t} (\overset{\leftrightarrow}{x}) \to E_{w} (\overset{\leftrightarrow}{x})}, \\ W_{w} & = {\overset{\leftrightarrow}{x} \in V_{w} : \overset{\leftarrow}{x} is regular} . \end{matrix}

By the above

P (V_{w}) = 1

and, by Claim 3, the regular pasts have probability 1. Hence,

P (W_{w}) = 1

.

However, for each

\overset{\leftrightarrow}{x} \in W_{w}

, we have the following:

\begin{matrix} P_{w} (\overset{\leftrightarrow}{x}) = P (w | \overset{\leftarrow}{x}) = E_{w} (\overset{\leftrightarrow}{x}) . \end{matrix}

Thus,

P_{w} (\overset{\leftrightarrow}{x}) = E_{w} (\overset{\leftrightarrow}{x})

for

P

a.e.

\overset{\leftrightarrow}{x}

. So, for any

H

-measurable set H,

\int_{H} P_{w} d P = \int_{H} E_{w} d P

. Furthermore,

P_{w}

is

H

-measurable since

P_{w, t} \overset{a . s .}{⟶} P_{w}

and each

P_{w, t}

is

H

-measurable. It follows that

P_{w} (\overset{\leftrightarrow}{x})

is a version of the conditional expectation

E (1_{A_{w, 0}} | H)

. □

Claim 14.

For any equivalence class

E_{β} \in E

and word

w \in X^{*}

, the set

E_{β, w} \equiv {\overset{\leftrightarrow}{x} : \overset{\leftarrow}{x} \in E_{β}, {\vec{x}}^{| w |} = w}

is

X

-measurable with

P (E_{β, w}) = P (E_{β}) \cdot P (w | E_{β})

.

Proof.

Let

{\hat{E}}_{β} = {\overset{\leftrightarrow}{x} : \overset{\leftarrow}{x} \in E_{β}}

. Then,

{\hat{E}}_{β}

and

A_{w, 0}

are both

X

-measurable, so their intersection

E_{β, w}

is as well. Furthermore, we have the following:

\begin{matrix} P (E_{β, w}) & = \int_{{\hat{E}}_{β}} 1_{A_{w, 0}} (\overset{\leftrightarrow}{x}) d P \\ \overset{(a)}{=} \int_{{\hat{E}}_{β}} E (1_{A_{w, 0}} | H) (\overset{\leftrightarrow}{x}) d P \\ \overset{(b)}{=} \int_{{\hat{E}}_{β}} P_{w} (\overset{\leftrightarrow}{x}) d P \\ = \int_{{\hat{E}}_{β}} P (w | E_{β}) d P \\ = P (E_{β}) \cdot P (w | E_{β}), \end{matrix}

where (a) follows from the fact that

{\hat{E}}_{β}

is

H

-measurable and (b) follows from Claim 13. □

Appendix E. Finitely Characterized Processes

We establish several results concerning finitely characterized processes. In particular, we show (Claim 17) that the history

ϵ

-machine

M_{h} (P)

is, in fact, a well-defined HMM. Throughout, we assume

P = (X^{Z}, X, P)

is a stationary, ergodic, finitely characterized process over a finite alphabet

X

and denote the corresponding probability space over past sequences as

(X^{-}, X^{-}, P^{-})

. The set of positive probability equivalences is denoted by

E^{+} = {E_{1}, \dots, E_{N}}

and the set of all equivalence classes as

E = {E_{β}, β \in B}

. For equivalence classes

E_{β}, E_{α} \in E

, and symbol

x \in X

,

I (x, α, β)

is the indicator of the transition from class

E_{α}

to class

E_{β}

on symbol x.

\begin{matrix} I (x, α, β) & = \{\begin{matrix} 1 & if P (x | E_{α}) > 0 and δ_{h} (E_{α}, x) = E_{β}, \\ 0 & otherwise . \end{matrix} \end{matrix}

Finally, the symbol-labeled transition matrices

T^{(x)}, x \in X

between equivalence classes

E_{1}, \dots, E_{N}

are defined by

T_{i j}^{(x)} = P (x | E_{i}) \cdot I (x, i, j)

. The overall transition matrix between these equivalence classes is denoted by T,

T = \sum_{x \in X} T^{(x)}

.

Claim 15.

For any equivalence class

E_{β} \in E

as follows:

\begin{matrix} P (E_{β}) = \sum_{i = 1}^{N} \sum_{x \in X} P (E_{i}) \cdot P (x | E_{i}) \cdot I (x, i, β) . \end{matrix}

Proof.

We have the following:

\begin{matrix} P (E_{β}) & \equiv P ({\overset{\leftrightarrow}{x} : \overset{\leftarrow}{x} \in E_{β}}) \\ \overset{(a)}{=} P ({\overset{\leftrightarrow}{x} : \overset{\leftarrow}{x} x_{0} \in E_{β}}) \\ \overset{(b)}{=} \sum_{i = 1}^{N} P ({\overset{\leftrightarrow}{x} : \overset{\leftarrow}{x} x_{0} \in E_{β}, \overset{\leftarrow}{x} \in E_{i}}) \\ = \sum_{i = 1}^{N} \sum_{x \in X} P ({\overset{\leftrightarrow}{x} : \overset{\leftarrow}{x} x_{0} \in E_{β}, \overset{\leftarrow}{x} \in E_{i}, x_{0} = x}) \\ = \sum_{i = 1}^{N} \sum_{x \in X} P ({\overset{\leftrightarrow}{x} : \overset{\leftarrow}{x} \in E_{i}, x_{0} = x}) \cdot I (x, i, β) \\ = \sum_{i = 1}^{N} \sum_{x \in X} P (E_{i, x}) \cdot I (x, i, β) \\ \overset{(c)}{=} \sum_{i = 1}^{N} \sum_{x \in X} P (E_{i}) \cdot P (x | E_{i}) \cdot I (x, i, β), \end{matrix}

where (a) follows from stationarity, (b) from the fact that

\sum_{i = 1}^{N} P (E_{i}) = 1

, and (c) from Claim 14. □

Claim 16.

For any

E_{i} \in E^{+}

and symbol x with

P (x | E_{i}) > 0

,

δ_{h} (E_{i}, x) \in E^{+}

.

Proof.

Fix

E_{i} \in E^{+}

and

x \in X

with

P (x | E_{i}) > 0

. By Claim 15,

P (δ_{h} (E_{i}, x)) \geq P (E_{i}) \cdot P (x | E_{i}) > 0

. Hence,

δ_{h} (E_{i}, x) \in E^{+}

. □

Claim 17.

The transition matrix

T = \sum_{x \in X} T^{(x)}

is stochastic as follows:

\sum_{j = 1}^{N} T_{i j} = 1

, for each

1 \leq i \leq N

. Hence, the HMM

M_{h} (P) = (E^{+}, X, {T^{(x)}})

is well defined.

Proof.

This follows directly from Claims 7 and 16. □

References

Crutchfield, J.P.; Young, K. Inferring statistical complexity. Phys. Rev. Let. 1989, 63, 105–108. [Google Scholar] [CrossRef]
Crutchfield, J.P.; Feldman, D.P. Statistical complexity of simple one-dimensional spin systems. Phys. Rev. E 1997, 55, 1239R–1243R. [Google Scholar] [CrossRef]
Varn, D.P.; Canright, G.S.; Crutchfield, J.P. Discovering planar disorder in close-packed structures from X-ray diffraction: Beyond the fault model. Phys. Rev. B 2002, 66, 174110. [Google Scholar] [CrossRef]
Varn, D.P.; Crutchfield, J.P. From finite to infinite range order via annealing: The causal architecture of deformation faulting in annealed close-packed crystals. Phys. Lett. A 2004, 234, 299–307. [Google Scholar] [CrossRef]
Still, S.; Crutchfield, J.P.; Ellison, C.J. Optimal causal inference: Estimating stored information and approximating causal architecture. Chaos Interdiscip. J. Nonlinear Sci. 2010, 20, 037111. [Google Scholar] [CrossRef] [PubMed]
Li, C.-B.; Yang, H.; Komatsuzaki, T. Multiscale complex network of protein conformational fluctuations in single-molecule time series. Proc. Natl. Acad. Sci. USA 2008, 105, 536–541. [Google Scholar] [CrossRef]
Crutchfield, J.P. The calculi of emergence: Computation, dynamics, and induction. Phys. D 1994, 75, 11–54. [Google Scholar] [CrossRef]
Upper, D.R. Theory and Algorithms for Hidden Markov Models and Generalized Hidden Markov Models. Ph.D. Thesis, University of California, Berkeley, CA, USA, 1997. [Google Scholar]
Shalizi, C.R.; Crutchfield, J.P. Computational mechanics: Pattern and prediction, structure and simplicity. J. Stat. Phys. 2001, 104, 817–879. [Google Scholar] [CrossRef]
Ay, N.; Crutchfield, J.P. Reductions of hidden information sources. J. Stat. Phys. 2005, 210, 659–684. [Google Scholar] [CrossRef]
Löhr, W. Predictive Models and Generative Complexity. J. Syst. Sci. Complex 2009, 25, 30–45. [Google Scholar] [CrossRef]
Löhr, W.; Ay, N. Non-sufficient Memories That Are Sufficient for Prediction. In Complex 2009: First International Conference, Shanghai, China, 23–25 February 2009; Zhou, J., Ed.; LNICST 4; Springer: Berlin/Heidelberg, Germany, 2009; pp. 265–276. [Google Scholar]
Löhr, W. Properties of the Statistical Complexity Functional and Partially Deterministic HMMs. Entropy 2009, 11, 385–401. [Google Scholar] [CrossRef]
Travers, N.F.; Crutchfield, J.P. Asymptotic synchronization for finite-state sources. J. Stat. Phys. 2011, 145, 1202–1223. [Google Scholar] [CrossRef]
Travers, N.F.; Crutchfield, J.P. Exact synchronization for finite-state sources. J. Stat. Phys. 2011, 145, 1181–1201. [Google Scholar] [CrossRef]
Löhr, W. Models of Discrete Time Stochastic Processes and Associated Complexity Measures. PhD Thesis, Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany, 2010. [Google Scholar]
Boyle, M.; Petersen, K. Hidden Markov processes in the context of symbolic dynamics. arXiv 2009, arXiv:0907.1858. [Google Scholar]
Weiss, B. Subshifts of finite type and sofic systems. Monatsh. Math. 1973, 77, 462–474. [Google Scholar] [CrossRef]
Hopcroft, J.E.; Ullman, J.D. Introduction to Automata Theory, Languages, and Computation; Addison-Wesley: Reading, UK, 1979. [Google Scholar]
Fischer, R. Sofic systems and graphs. Monatsh. Math. 1975, 80, 179–186. [Google Scholar] [CrossRef]
Boyle, M.; Kitchens, B.; Marcus, B. A note on minimal covers for sofic systems. Proc. AMS 1985, 95, 403–411. [Google Scholar] [CrossRef]
Lind, D.; Marcus, B. An Introduction to Symbolic Dynamics and Coding; Cambridge University Press: New York, NY, USA, 1995. [Google Scholar]
Kitchens, B.; Tuncel, S. Finitary measures for subshifts of finite type and sofic systems. Mem. AMS 1985, 58, 338. [Google Scholar] [CrossRef]
Harris, T.E. On chains of infinite order. Pacific J. Math. 1955, 5, 707–724. [Google Scholar] [CrossRef]
Keane, M. Strongly mixing g-measures. Invent. Math. 1972, 16, 309–324. [Google Scholar] [CrossRef]
Bramson, M.; Kalikow, S. Nonuniqueness in g-functions. Isr. J. Math. 1993, 84, 153–160. [Google Scholar] [CrossRef]
Stenflo, Ö. Uniqueness in g-measures. Nonlinearity 2003, 16, 403–410. [Google Scholar] [CrossRef]
Krieger, W.; Weiss, B. On g measures in symbolic dynamics. Isr. J. Math 2010, 176, 1–27. [Google Scholar] [CrossRef]
Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. IEEE Proc. 1989, 77, 257. [Google Scholar] [CrossRef]
Ephraim, Y.; Merhav, N. Hidden Markov processes. IEEE Trans. Inf. Theory 2002, 48, 1518–1569. [Google Scholar] [CrossRef]
Levin, D.; Peres, Y.; Wilmer, E.L. Markov Chains and Mixing Times; American Mathematical Society: Providence, RI, USA, 2006. [Google Scholar]
Durrett, R. Probability: Theory and Examples, 2nd ed.; Wadsworth Publishing Company: Pacific Grove, CA, USA, 1995. [Google Scholar]

Figure 1. A hidden Markov model (the

ϵ

-machine) for the Even Process. The HMM has two internal states

S = {σ_{1}, σ_{2}}

, a two-symbol alphabet

X = {0, 1}

, and a single parameter

p \in (0, 1)

that controls the transition probabilities.

Figure 1. A hidden Markov model (the

ϵ

-machine) for the Even Process. The HMM has two internal states

S = {σ_{1}, σ_{2}}

, a two-symbol alphabet

X = {0, 1}

, and a single parameter

p \in (0, 1)

that controls the transition probabilities.

Figure 2. The Even Machine M (left) and associated history

ϵ

-machine

M_{h}

(right) for the process

P

generated by M.

p \in (0, 1)

is a parameter.

Figure 2. The Even Machine M (left) and associated history

ϵ

-machine

M_{h}

(right) for the process

P

generated by M.

p \in (0, 1)

is a parameter.

Figure 3. The Alternating Biased Coins (ABC) Machine M (left) and associated history

ϵ

-machine

M_{h}

(right) for the process

P

n generated by M.

p, q \in (0, 1)

are parameters,

p \neq q

.

Figure 3. The Alternating Biased Coins (ABC) Machine M (left) and associated history

ϵ

-machine

M_{h}

(right) for the process

P

n generated by M.

p, q \in (0, 1)

are parameters,

p \neq q

.

Figure 4. A nonminimal generating HMM M for the Noisy Period-2 (NP2) Process (left), and the associated history

ϵ

-machine

M_{h}

for this process (right).

p \in (0, 1)

is a parameter.

Figure 4. A nonminimal generating HMM M for the Noisy Period-2 (NP2) Process (left), and the associated history

ϵ

-machine

M_{h}

for this process (right).

p \in (0, 1)

is a parameter.

Figure 5. The Simple Nonunifilar Source (SNS) M (left) and associated history

ϵ

-machine

M_{h}

(right) for the process

P

generated by M. In the history

ϵ

-machine,

q_{n} + p_{n} = 1

for each

n \in N

, and

{(q_{n})}_{n \in N}

is an increasing sequence defined by the following:

q_{n} = (1 - q) \cdot ((1 - p) \sum_{m = 0}^{n - 1} p^{m} q^{n - 1 - m}) / (p^{n} + (1 - p) \sum_{m = 0}^{n - 1} p^{m} q^{n - 1 - m})

.

Figure 5. The Simple Nonunifilar Source (SNS) M (left) and associated history

ϵ

-machine

M_{h}

(right) for the process

P

generated by M. In the history

ϵ

-machine,

q_{n} + p_{n} = 1

for each

n \in N

, and

{(q_{n})}_{n \in N}

is an increasing sequence defined by the following:

q_{n} = (1 - q) \cdot ((1 - p) \sum_{m = 0}^{n - 1} p^{m} q^{n - 1 - m}) / (p^{n} + (1 - p) \sum_{m = 0}^{n - 1} p^{m} q^{n - 1 - m})

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Travers, N.F.; Crutchfield, J.P. Equivalence of History and Generator ϵ-Machines. Symmetry 2025, 17, 78. https://doi.org/10.3390/sym17010078

AMA Style

Travers NF, Crutchfield JP. Equivalence of History and Generator ϵ-Machines. Symmetry. 2025; 17(1):78. https://doi.org/10.3390/sym17010078

Chicago/Turabian Style

Travers, Nicholas F., and James P. Crutchfield. 2025. "Equivalence of History and Generator ϵ-Machines" Symmetry 17, no. 1: 78. https://doi.org/10.3390/sym17010078

APA Style

Travers, N. F., & Crutchfield, J. P. (2025). Equivalence of History and Generator ϵ-Machines. Symmetry, 17(1), 78. https://doi.org/10.3390/sym17010078

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Equivalence of History and Generator ϵ-Machines

Abstract

1. Introduction

2. Related Work

2.1. Sofic Shifts and Topological Presentations

2.2. Semigroup Measures

2.3. g-Functions and g-Measures

3. Definitions

3.1. Processes

3.2. Hidden Markov Models

3.3. Generator $ϵ$ -Machines

3.4. History $ϵ$ -Machines

3.5. Examples

4. Equivalence

4.1. Generator $ϵ$ -Machines Are History $ϵ$ -Machines

4.2. History $ϵ$ -Machines Are Generator $ϵ$ -Machines

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Regular Pasts and Trivial Pasts

Appendix B. Well Definedness of Equivalence Class Transitions

Appendix C. Measurability of Equivalence Classes

Appendix D. Probabilistic Consistency of Equivalence Class Transitions

Appendix E. Finitely Characterized Processes

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Equivalence of History and Generator ϵ-Machines

Abstract

1. Introduction

2. Related Work

2.1. Sofic Shifts and Topological Presentations

2.2. Semigroup Measures

2.3. g-Functions and g-Measures

3. Definitions

3.1. Processes

3.2. Hidden Markov Models

3.3. Generator ϵ -Machines

3.4. History ϵ -Machines

3.5. Examples

4. Equivalence

4.1. Generator ϵ -Machines Are History ϵ -Machines

4.2. History ϵ -Machines Are Generator ϵ -Machines

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Regular Pasts and Trivial Pasts

Appendix B. Well Definedness of Equivalence Class Transitions

Appendix C. Measurability of Equivalence Classes

Appendix D. Probabilistic Consistency of Equivalence Class Transitions

Appendix E. Finitely Characterized Processes

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3. Generator $ϵ$ -Machines

3.4. History $ϵ$ -Machines

4.1. Generator $ϵ$ -Machines Are History $ϵ$ -Machines

4.2. History $ϵ$ -Machines Are Generator $ϵ$ -Machines