Properties of the Statistical Complexity Functional and Partially Deterministic HMMs

Statistical complexity is a measure of complexity of discrete-time stationary stochastic processes, which has many applications. We investigate its more abstract properties as a non-linear function of the space of processes and show its close relation to the Knight’s prediction process. We prove lower semi-continuity, concavity, and a formula for the ergodic decomposition of statistical complexity. On the way, we show that the discrete version of the prediction process has a continuous Markov transition. We also prove that, given the past output of a partially deterministic hidden Markov model (HMM), the uncertainty of the internal state is constant over time and knowledge of the internal state gives no additional information on the future output. Using this fact, we show that the causal state distribution is the unique stationary representation on prediction space that may have finite entropy.


Introduction
An important task of complex systems sciences is to define "complexity".Measures that quantify complexity are of both theoretical (e.g., [1]) and practical interest.In applications, they are widely used to identify "interesting" parts of simulations and real-world data (e.g., [2]).There exist various measures of different kinds of complexity.In particular, statistical complexity constitutes a complexity measure for stationary stochastic processes in doubly infinite discrete time and discrete state space.It was introduced by Jim Crutchfield and co-workers within a theory called computational mechanics [3][4][5].Note that here "computational mechanics" is unrelated to computer simulations of mechanical systems.Statistical complexity is applied to a variety of real-world data, e.g., in [6].An important, closely related concept of computational mechanics is the so-called ε-machine.It is a particular partially deterministic HMM that encodes the mechanisms of prediction.Partially deterministic HMMs are often called deterministic stochastic automata to emphasize their close connection to a key concept of theoretical computer science, namely deterministic finite state automata [7].
In this paper, we look at more abstract features of statistical complexity as well as partially deterministic HMMs.We consider statistical complexity to be a non-linear functional from the space of ∆-valued stationary processes (∆ countable) to the set R + = R + ∪ {∞} of non-negative extended real numbers.Here, we identify stationary processes with their law, i.e., with shift-invariant probability measures on the sequence space ∆ Z , and equip the space of measures with the usual weak- * topology (often called "weak topology").Because ∆ is discrete, this topology is equal to the topology of finite-dimensional convergence.In ergodic theory, Kolmogorov-Sinai entropy is studied as a function of the (invariant) measure, and the questions of continuity properties, affinity, and behaviour under ergodic decomposition arise naturally (e.g., [8]).We believe that these questions are worth considering also for complexity measures.A formula for the ergodic decomposition of excess entropy, which is another complexity measure for stochastic processes, was obtained in [9,10].Our results presented here include the corresponding formula for statistical complexity, and this formula directly implies concavity.The most important result is lower semi-continuity of statistical complexity.We consider this a desirable property for a complexity measure, as it means that a process cannot be complex if it can be approximated by non-complex ones.
In Section 2., we define statistical complexity and show its relations to a discrete version of Frank Knight's prediction process [11,12].The prediction process is the measure-valued process of conditional probabilities of the future given the past.It takes values in the space P(∆ N ) of probability measures on ∆ N , called prediction space.In our formulation, statistical complexity is the marginal entropy of the prediction process.This is equivalent to the classical definition as entropy of a certain partition of the past.We only replace equivalence classes with the respective induced probabilities on the future.In this section, we also show that the discrete (and thus technically vastly simplified) version of the prediction process has a continuous Markov transition kernel (Proposition 5).
In Section 3., we investigate properties of partially deterministic HMMs.Here, we use a general notion of HMM (sometimes called edge-emitting HMM), where new internal state and output symbol are jointly determined and may have dependencies conditioned on the last internal state.Partial determinism means that this dependence is extreme in the sense that the last internal state and the output altogether uniquely determine the following internal state.We show that, if one knows the past output trajectory, the remaining uncertainty (measured by entropy) of the internal state is constant over time, although it may depend on the ergodic component (Proposition 18).Furthermore, the distribution of future output is the same for any internal state that is compatible with the past output (Corollary 20).In Section 3.3., we construct a canonical Markov kernel, such that taking any measure ν on prediction space P(∆ N ) (i.e., ν is a measure on measures) as initial distribution, we obtain a partially deterministic HMM of a process P ∈ P(∆ N ).This process P coincides with the measure r(ν) represented by ν in the sense of integral representation theory, and if ν is appropriately chosen, we obtain the ε-machine of computational mechanics (or something isomorphic) as special case.Using the properties of partially deterministic HMMs, we obtain that there is no invariant representation on prediction space with finite entropy other than, possibly, the causal state distribution, which may have finite or infinite entropy (Proposition 23).
Section 4. contains our results about statistical complexity.We show that the complexity of a process is the average complexity of its ergodic components plus the entropy of the mixture (Proposition 26).As a direct consequence, statistical complexity is concave (Corollary 27) and non-continuous (even w.r.t.variational topology).But it does have a continuity property.Namely, using the results of the previous sections, we show in Theorem 32 that it is weak- * lower semi-continuous.

Prediction Dynamic and Statistical Complexity
For the whole article, fix a countable set ∆ with at least two elements and discrete topology.We identify ∆-valued stochastic processes X Z := (X k ) k∈Z , defined on some probability space (Ω, A, P), with their respective laws P := P • X −1 Z ∈ P(∆ Z ).Here, P denotes the set of probability measures.If X Z is stationary, P is in the set P inv (∆ Z ) of shift-invariant probability measures.Let ξ k : ∆ Z → ∆ be the canonical projections.Then ξ Z is a process on (∆ Z , B(∆ Z ), P ) with the same distribution as X Z .Here, B denotes the Borel σ-algebra.We often decompose the time set Z into the "future" N and the "past" Z \ N = −N 0 , where N 0 = N ∪ { 0 }.For simplicity of notation, we denote the canonical projections on ∆ N with the same symbols, ξ k , as the projections on ∆ Z .If not stated otherwise, product spaces are equipped with product and spaces of probability measures are equipped with weak- * -topology.We use the arrow * to denote weak- * convergence.

Discrete Version of Knight's Prediction Process
For every measurable stochastic process with time set R + on some Lusin space, Frank Knight defines the corresponding prediction process as a process of conditional probabilities of the future given the past.This theory originated in [11] and was further developed in [12][13][14].The most important properties of the prediction process are that its paths are right continuous with left limits (cadlag), it has the strong Markov property and determines the original process.The continuity of the time set and the generality of the state space lead to a lot of technical difficulties.In our simpler, discrete setting, these difficulties mostly disappear, and useful properties of the prediction process, such as having cadlag paths, become meaningless.A new aspect, however, is added by considering infinite pasts of stationary processes via the time-set Z.The marginal distribution (unique because of stationarity) of the prediction process is an important characteristic, which is used to define statistical complexity.For this subsection, fix a stationary process X Z with distribution P ∈ P inv (∆ Z ).
We use the following notation concerning Markov kernels and conditional probabilities.If K is a kernel from Ω to a measurable space M , we consider K as measurable function from Ω to P(M ) and write K(ω; A) := K(ω)(A) for the probability of a measurable set A w.r.t. the measure K(ω).Given random variables X, Y on Ω, we write Definition 1.Let Z Z = Z P Z be the P(∆ N )-valued stochastic process of conditional probabilities defined by It is evident that the Markov property of the prediction process in continuous time also holds in discrete time.Nevertheless, we give a proof, because it is elementary in our discrete setting.The corresponding transition kernel works as follows.Assume the prediction process is in state z ∈ P(∆ N ).The transition kernel maps z to a measure on measures, namely P (Z 1 | Z 0 = z) ∈ P P(∆ N ) .Note that z is a state of the prediction process but at the same time a probability measure.Thus it makes sense to consider the conditional probability given ξ 1 = d w.r.t. the measure z.It is intuitively plausible that the next state will be one of those conditional probabilities with d distributed according to the marginal of z.The resulting measure has to be shifted by one as time proceeds.With ς : ∆ N → ∆ N , we denote the left shift.
In other words, S is the transition kernel of the prediction process.
Proof.Stationarity is obvious from stationarity of X Z .We obtain a.s.

S(Z
In particular, as claimed.We still have to verify the Markov property.But because the σ-algebra induced by Z −N 0 is nested between those induced by Z 0 and , we obtain the Markov property from the first equality in (1).
Definition 3. We call the Markov transition S of the prediction process prediction dynamic.
Note that although the prediction process Z Z obviously depends on P , the prediction space P(∆ N ) and the prediction dynamic S do not.In the case of general Lusin state space, it is non-trivial to prove the existence of the regular versions of conditional probability such that φ z (ω) is jointly measurable in (z, ω) (see [12]).For countable ∆, however, we even obtain essential continuity in an elementary way.This enables us to prove continuity of the prediction dynamic.
Proposition 5.The prediction dynamic S is continuous.
Proof.Let z n , z ∈ P(∆ N ) with z n * z and Ω z as in Lemma 4. We have to show for any bounded continuous g.According to Prokhorov's theorem, the sequence (z n ) n∈N is uniformly tight and we can restrict the integrations to compact subsets.Because lim n→∞ z n (Ω z ) = z(Ω z ) = 1, we can restrict to compact subsets of Ω z .There, the convergence of φ z n is uniform, thus (2) holds.

Statistical Complexity
In integral representation theory, a measure ν ∈ P P(∆ N ) represents the measure where r : P(P(∆ N )) → P(∆ N ) is called resolvent or barycentre map (see [15]) and id is the identity map.Here, measure valued integrals are Gel'fand integrals.That is, µ = K dν for some kernel K means f dµ = f dK( • ) dν for all continuous, real-valued f or, equivalently, µ(B) = K( • ; B) dν for all measurable sets B. z = r(ν) means that z is a mixture (convex combination) of other processes, and the mixture is described by ν.A trivial representation for z is given by δ z , the Dirac measure in z.The measure ν is called S-invariant if νS = ν, where νS := S dν.In other words, it is S-invariant if the iteration with the prediction dynamic S does not change it.We see in the following lemma that general iteration with S shifts the represented measure, i.e., νS represents z • ς −1 .Lemma 6. r(νS) = r(ν) • ς −1 .In particular, S-invariant ν represent stationary processes.
Proof.Because r(νS) = id P(∆ N ) dS dν, it is sufficient to consider Dirac measures δ z , z ∈ P(∆ N ) (the general claim follows by integration over ν).For Dirac measures we have If ν is S-invariant, we also say that ν represents the stationary extension of r(ν) to ∆ Z .The marginal of the prediction process is an important such representation, which we call causal state distribution because of its close relation to the causal states of computational mechanics.Definition 7.For P ∈ P inv (∆ Z ), the causal state distribution µ C (P ) is the marginal distribution of the prediction process, i.e., µ C (P ) := P • Z −1 0 ∈ P(P(∆ N ) .The causal state distribution of P is an S-invariant representation of P .Lemma 8. Let P ∈ P inv (∆ Z ).Then µ C (P ) is S-invariant and represents P .
Proof.From Proposition 2 we know that Furthermore, µ C (P ) represents P because we have Remark.The definitions in computational mechanics are slightly different.There, one works with equivalence classes of past trajectories (called causal states) instead of probability distributions on future trajectories.Because past trajectories x, y ∈ ∆ −N 0 are identified if , the two approaches are equivalent.The advantage of working on prediction space P(∆ N ) is that it has a natural topology and the prediction processes of all ∆-valued stochastic processes are described in a unified manner on the same space with the same transition kernel.
Example 9. µ C is not continuous.Let P be a non-deterministic i.i.d.(independent, identically distributed) process.Obviously, the causal state distribution of an i.i.d.process is the Dirac measure δ P N in its restriction P N := P • ξ N −1 to positive time.According to [16], periodic measures are dense in the stationary measures and we find an approximating sequence P n * P of periodic measures P n .But the past of a periodic process determines its future.Thus its causal state distribution is supported by the set of Dirac measures on ∆ N .Because the set of Dirac measures is closed in P P(∆ N ) , the topological supports supp µ C (P n ) are disjoint from the support supp µ With statistical complexity, we measure complexity of a process P by the "diversity" of its expected futures, given observed pasts (i.e., of µ C (P )).The Shannon entropy H(µ) is used as the measure of "diversity" of a probability measure µ.With ϕ(x) := −x log(x), it is defined as Definition 10.For P ∈ P inv (∆ Z ), the quantity C C (P ) := H µ C (P ) ∈ R + is called statistical complexity of P .
Note that if the probability space is sufficiently regular (e.g., separable, metrisable), H(µ) can only be finite if µ is supported by a countable set A. In this case Probably, lower semi-continuity of this entropy functional is well-known.We give a proof in the appendix.
Lemma 11.Let M be a separable, metrisable space.Then the entropy H : P(M ) → R + is weak- * lower semi-continuous.

Partially Deterministic HMMs
The probability measures on prediction space induce hidden Markov models (HMMs) with an additional partial determinism property, and it turns out to be helpful to investigate such HMMs.In Section 3.1.,we define HMMs and introduce the notation we need for the further discussion.In Section 3.2., we define the partial determinism property and obtain our results about the HMMs satisfying this property.In Section 3.3., we show how measures on prediction space induce partially deterministic HMMs and apply the results from Section 3.2.to prove that the causal state distribution is the only invariant representation on prediction space that can have finite entropy.

HMMs
We use the term HMM in a wide sense, meaning a pair (T, µ), where µ is an initial probability measure on some Polish space M of internal states and T is a Markov kernel from M to ∆ × M .The HMM generates on (Ω, A, P) a ∆-valued output process X N and a (coupled) M -valued internal process W N 0 , such that W 0 is µ-distributed and the joint process is Markovian with dµ, we say that the HMM is invariant and extend the generated processes to stationary processes X Z and W Z .We need some further notation.b) The internal operators Remark.a) K m is the distribution of the next output symbol when the internal state is m, i.e.K m = P(X 1 | W 0 = m) a.s.Further, K µ is the law of X 1 .
b) The internal operator L d describes the update of knowledge of the internal state when the symbol d ∈ ∆ is observed.For Dirac measures, we obtain Be warned that L d is not induced by a kernel in the following sense.There is no kernel It directly follows from the definition of (X N , W N 0 ) by a Markov kernel that the conditional probability, given that the internal state is m, is obtained by starting the HMM in m.In other words, it is generated by the HMM (T, δ m ).Similarly, the conditional probability given an observed symbol X 1 = d is obtained by starting the HMM in the updated initial distribution L d (µ).We formulate these observations in the following lemma and give a formal proof in the appendix.
Lemma 13.Let (T, µ) be an HMM with internal and output processes W N 0 , X N as above.Then a.s.
Definition 14 (processes Y Z and H Z ).Given an invariant HMM, let Y Z be the P(M )-valued stochastic process of expectations over internal states, given by , where entropy H is defined by (4).
Remark.Y k describes the current knowledge of the internal state, given the past.H k is the entropy of the value of Y k and measures "how uncertain" the knowledge of the internal state is.It is important to bear in mind that this is different from the entropy of the random variable Y k .To avoid confusion, we always write H P (X) when referring to the entropy of a random variable X defined on a probability space with measure P.
The following lemma justifies the idea of the internal operator L d being an update of knowledge of the internal state.Furthermore, it enables us to condition on Y 0 instead of X −N 0 .The conditional probability of the internal state given the past, Y 0 , contains as much information about X 1 (and in fact X N , but we do not need that here) as the past X −N 0 does.
Proof.Conditional independence of (X 1 , W 1 ) and X −N 0 given W 0 implies that a.s.
b) The second equality follows directly from (5).The first follows because, due to the second equality, The previous lemma enables us to prove that Y Z is Markovian and compute its transition kernel.We already know that L d (ν) is the updated expectation of the internal state when it was previously ν and is now observed d.Thus it is not surprising that the conditional probability of Y k given Y k−1 = ν is a convex combination of Dirac measures in L d (ν) for different d (note that Y k is a measure-valued random variable, thus its conditional probability distribution is indeed a distribution on distributions).The mixture is given by the output kernel K, more precisely by K ν .

Lemma 16. For an invariant HMM, Y Z and H
Proof.Stationarity is obvious.For ν 0 , . . ., ν k ∈ P(M ) and ν := ν k we obtain and hence the claim.

Partial Determinism
If the transition T of an HMM is deterministic, i.e., if the internal state determines the next state and output (and thus the whole future) uniquely, the HMM is called (completely) deterministic.In a deterministic HMM, all randomness is due to the initial distribution.This is a very strong property, and a weaker partial determinism property is useful.In a partially deterministic HMM, the output symbol is determined randomly, but the new internal state is a function f (m, d) of the last internal state m and the new output symbol d.If the internal space M is finite, such HMMs are stochastic versions of deterministic finite state automata (DFAs), an important concept of theoretical computer science (see [7,Chap. 2]).The function f directly corresponds to the transition function of the DFA, but the start state is replaced by the initial distribution and the HMM assigns probabilities to the outputs via the output kernel K.A difference in interpretation is that the symbols from ∆ are considered input of the DFA and output of HMMs.To emphasise their close connection to DFAs, partially deterministic HMMs are often called deterministic stochastic automata, although they are not completely deterministic.
where f m (d) := fd (m) := f (m, d) and B(M ) is the Borel σ-algebra on M .
Remark.For partially deterministic HMMs we obtain The second equation implies that s., justifying the name transition function for f .
The following proposition is crucial for understanding partially deterministic representations.It states that, given the past output, the uncertainty H k = H(Y k ) about the internal state is constant over time and the next output symbol is independent of the internal state.The proof is along the following lines.If we know the internal state at one point in time, we can maintain knowledge of the internal state due to partial determinism.More generally, the uncertainty H k of the internal state cannot decrease on average and thus is a supermartingale.But because it is also stationary, the trajectories have to be constant.If two possible internal states would lead to different probabilities for the next output symbol, we could increase our knowledge of the internal state by observing the next output.But because of partial determinism, this would also decrease the uncertainty of the following internal state, in contradiction to the constant trajectories of H Z .Proposition 18.Let (T, µ) be a partially deterministic, invariant HMM with H(µ) < ∞.Then H Z has a.s.constant trajectories, i.e., H k = H 0 a.s.Furthermore, the restriction K supp(Y 0 ) of the output kernel K to the support supp(Y 0 ) ⊆ M of the random measure Y 0 is a.s. a constant kernel, i.e., Proof.We show that H Z is a supermartingale to use the following well-known property.
Lemma.Every stationary supermartingale has a.s.constant trajectories.
Because H(µ) < ∞, we may assume w.l.o.g. that M is countable.Note that ϕ(x) = −x log(x) satisfies ϕ( x i ) ≤ ϕ(x i ).We obtain We use the filtration where the second equality holds because Thus H Z is a supermartingale w.r.t.(F k ) k∈Z and has a.s.constant trajectories.In particular, inequality ( 8) is actually an equality.Because Note that the finite-entropy assumption is indeed necessary for the second statement of Proposition 18.For example, the shift defines a deterministic HMM that does not (in general) satisfy (7).
Example 19 (shift HMM).The shift HMM is defined as follows.The internal state consists of the whole trajectory, M := ∆ Z .T = T ς outputs the symbol at position one and shifts the sequence to the left.More formally with m = (m k ) k∈Z ∈ M and ς(m) = (m k+1 ) k∈Z we have If P ∈ P inv (∆ Z ), it is obvious that (T ς , P ) is an invariant, deterministic (in particular partially deterministic) HMM of P .Here, P is the law of both X Z and W 0 ; in fact even X Z = W 0 .We claim that, generically, (T ς , P ) does not satisfy (7) (and of course the internal state entropy implies that X −N 0 determines X 1 uniquely, which is generically not true.The analogously defined one-sided shift on M = ∆ N also does not satisfy (7).Note that, because future trajectories are equivalent to internal states, the associated process Y Z is essentially the prediction process in the sense that Proposition 18 tells us that the next output symbol of a partially deterministic HMM is conditionally independent of the internal state, given the past output.But even more is true.The whole future output is conditionally independent of the internal state.Thus, if we know the past, the internal state provides no additional information useful for the prediction of the future output.
Corollary 20.Let (T, µ) be partially deterministic, invariant, and Proof.According to Proposition 18, P(X To obtain the statement for X [1,n] , we consider the n-tuple HMM defined as follows.The output space is ∆ n , the internal space is M , whereas the output and internal processes X Z and W Z are given by X k = X [(k−1)n+1,kn] and W k = W nk .This is achieved by the HMM ( T , µ) with T : The HMM is obviously partially deterministic with transition function Because we can couple the processes such that Y 0 = Y 0 , the claim follows.

Representations on Prediction Space
We can interpret any probability measure µ on prediction space P(∆ N ) as initial distribution of an HMM.The "internal state update" of the corresponding transition T C follows the same rule as the prediction dynamic S, described by the conditional probability given the last observation.The difference is that now we include output symbols from ∆.We want to construct the HMM in such a way that if it is started in the internal state z ∈ P(∆ N ), its output process is distributed according to z (which is also a measure on the future).Thus, the distribution of the next output d has to be equal to the marginal of z.The next internal state has to be the conditional z-probability of the future given Definition 21.We define the Markov kernel T C from P(∆ N ) to ∆ × P(∆ N ) by Note that T C (z; ∆ × B) = S(z; B), i.e., marginalising T C (z) to the internal component yields the prediction dynamic.Thus, if µ = µ C (P ) is the causal state distribution (Definition 7) of some process P ∈ P inv (∆ Z ), then the internal state process of the induced HMM (T C , µ) coincides with the prediction process Z Z of P .From the following lemma we conclude that the output process X Z is, as expected, distributed according to P .More generally, if µ ∈ P P(∆ N ) represents a process z ∈ P(∆ N ) in the sense of integral representation theory as a mixture of other processes, it also induces an HMM of z, namely (T C , µ).Recall that r is the resolvent, defined in (3), and associates the represented process to µ.
Proof.Partial determinism follows directly from the definition of T C .We have K z = z • ξ 1 −1 and the transition function f is given by f z • ξ 1 := φ z .It is well defined due to the σ(ξ 1 )-measurability of φ z and obviously T C (z; . We assume w.l.o.g. that µ is a Dirac measure (the general claim follows by integration over µ).Thus let µ = δ z with z = r(µ).Recall that, according to Lemma 13, T C , T C d (δ z ) is an HMM of the conditional probability of ξ [2,∞[ given that ξ 1 = d (w.r.t. the output process of (T C , δ z )).Using T C z; the claim follows by induction.
Such a measure ν always exists and is uniquely determined by P .In [9,10], Łukasz Dębowski investigated another complexity measure, excess entropy, and gave a formula for its ergodic decomposition.
Here, we obtain the corresponding result for statistical complexity.It is the average complexity of the ergodic components plus the entropy of the mixture.
Proposition 26 (ergodic decomposition).Let ν ∈ P P e (∆ Z ) be the ergodic decomposition of P ∈ P inv (∆ Z ).Then Proof.First note that µ C (P 1 ) and µ C (P 2 ) are singular for distinct ergodic P 1 , P 2 ∈ P e (∆ Z ).Indeed, there exist disjoint Consequently, if ν is not supported by a countable set, µ C (P ) cannot be supported by a countable set and C C (P ) = H(ν) = ∞.Thus assume ν = k∈N ν k δ P k for some ν k ≥ 0 and distinct P k ∈ P e (∆ Z ).Then there are disjoint Several corollaries follow directly from this proposition.The set P C := C −1 C (R) of stationary processes with finite statistical complexity is convex, C C is concave but not continuous, and the set P ∞ := P inv (∆ Z ) \ P C of processes with infinite statistical complexity is dense.
Corollary 27 (concavity).P C is a convex set and C C is concave.Moreover, for all ν ∈ P(N), ν k := ν(k) and Proof.Use ergodic decomposition of the P k and Proposition 26.
Corollary 28 (non-continuity).C C P C is not continuous in any P ∈ P C w.r.t.variational topology, let alone w.r.t.weak- * topology.
We argue that, from a theoretical point of view, every complexity measure should be lower semicontinuous.While it is not counter-intuitive that it is possible to approximate a simple system by unnecessarily complex ones (and hence the complexity is not continuous), it would be strange to consider a process complex if there is an approximating sequence with (uniformly) simple processes.Therefore, an axiomatic characterisation of complexity measures (although, of course, we are far from having such a characterisation) should include lower semi-continuity.There are also slightly more practical reasons why semi-continuity is a nice property.
In a model selection task, for instance, it might be desirable to impose some upper bound a ∈ R + on the complexity of considered processes (e.g. to avoid overfitting).An important consequence of lower semi-continuity is that the set C −1 C [0, a] = P ∈ P inv (∆ Z ) C C (P ) ≤ a of processes with complexity bounded by a is closed.This makes the complexity constraint technically easier.Consider any complete metric on P inv (∆ Z ) compatible with weak- * (or any stronger) topology (e.g.Prokhorov, Kantorovich-Rubinshtein or variational metric).Then due to the closedness, for every P ∈ P inv (∆ Z ) with arbitrary complexity, there is a (not necessarily unique) closest "sufficiently simple" process P a with complexity not exceeding a.Another consequence is that the set of processes with infinite complexity is generic in the following sense.Example 34.Consider the experiment of first choosing a random coin with success probability p uniformly in [0, 1] and then generating an i.i.d.sequence with this coin.More precisely, let Q p be the Bernoulli process with parameter p on ∆ = { 0, 1 } and P = Q p dp. Then P has infinite statistical complexity according to Proposition 26.We might approximate P by P n * P (e.g. with ergodic P n ).
Then Theorem 32 implies that the complexity of P n necessarily tends to infinity.♦ Example 35.Let ∆ be finite, then P inv (∆ Z ) is compact.Assume we made observations of a ∆-valued process and want to fit some P ∈ P inv (∆ Z ).From the observations, we might derive a set of closed constraints, e. Obviously, H ≤ H. Recall that µ n * µ implies µ n (A) → µ(A) for all A with µ(∂A) = 0 (e.g., [18]).
Thus H is clearly lower semi-continuous and it is sufficient to show If µ is not supported by any countable set, H(µ) = ∞ due to separability of M .Let µ = ∞ i=1 a i δ x i (a i ∈ [0, 1], x i ∈ M ), and d a compatible metric on M .For fixed n ∈ N, we can choose a radius r n > 0, such that B n i := { x ∈ M | d(x i , x) < r n }, i = 1, . . ., n, are disjoint and µ(∂B n i ) = 0. We get n i=1 ϕ(a i ) Therefore, H(µ) = lim n→∞ n i=1 ϕ(a i ) ≤ H(µ).Proof of Lemma 13.We first prove that (T, δ W 0 ) is an HMM of P(X N | W 0 ).Let G T (m) ∈ P(∆ N ) be the distribution of the output process of (T, δ m ).Because G T is measurable, G T • W 0 is σ(W 0 )measurable.From the definition of (W N 0 , X N ) it follows for measurable B ⊆ M, A ⊆ ∆ N that where the second equality holds because W 0 is distributed according to µ.Thus G T • W 0 is the claimed conditional probability.To see that T, L X 1 (µ) is an HMM of P(X [2,∞[ | X 1 ), let d ∈ ∆ and observe

Lemma 4 .
Let z, z n ∈ P(∆ N ) and z n * z.There is a clopen (i.e.closed and open) set
is normalised outside the integral as opposed to an individual normalisation of the L d (δ m ) inside the integral on the right-hand side.

1 C
Corollary 33.P ∞ contains a dense G δ -set.Proof.Because all C −[0, n] are closed, P ∞ is a G δ -set.It is dense according to Corollary 29.
g., P { ξ 1 = ξ 2 } ∈ [a, b], P { ξ 1 = d } ≥ ε, and P { ξ 2 = d } ξ 1 = d ∈ [a, b] (the third is closed only in presence of the second).Further closed constraints may be given by modelling assumptions.Because the resulting set of admissible processes is compact, lower semi-continuity implies that there is at least one process of minimal complexity satisfying all constraints.♦AppendixProof of Lemma 11 (lower semi-continuity of the entropy).Recall that ϕ(x) := −x log(x) and denote the boundary of a set B by ∂B.Define H(µ) := sup n i=1 ϕ µ(B i ) n ∈ N, B i disjoint, µ(∂B i ) = 0 .