1. Introduction
Multicomponent systems typically are much more structurally complex than the collection of their parts, even infinitely more so. This should be contrasted with statistical mixtures—such as arise in the Gibbs Paradox of thermodynamics (Sections 2–3 in [
1]), where gases of distinct molecular species exhibit only a modest entropy increase upon formation due to the uncertainty in which species one has in hand. This contrast demonstrates how the ansatz of statistical mixtures misses key aspects of hierarchical organization. The result, as we show, is an awareness of a new kind of structural complexity of composite systems.
The development here focuses on the theoretical core of this basic phenomenon, arguing that it is, in fact, quite commonplace. To appreciate this, it will be helpful to address the motivating issues upfront.
The multicomponent systems of interest are found in several different domains, including the entropy of mixing in thermodynamics [
2,
3], the change point problem in statistics [
4], the attractor-basin portrait of a dynamical system [
5], Smale’s basic sets [
6,
7], spatially extended systems with multiple local attractors [
8], chaotic crystallography [
9,
10], evolutionary dynamics [
11], and adaptive and learning systems with memory. More recently, nonergodicity has been broadly implicated in, for example, computation theory [
12], learning theory [
13], and the complex behaviors of social and economic systems [
14,
15].
We introduce the concept of hidden multistationary processes to capture what is common across these domains—a system comprising multiple locally competing behaviors and structures. The basic idea can be appreciated within an experimental paradigm: multistationarity models repeated experimental trials in which different initial conditions lead to statistically distinct behaviors. When we wish to emphasize their structure, we refer to a multicomponent process; when emphasizing the statistical consequences, we refer to a multistationary process.
In short, one goal is to provide a tractable model that quantitatively captures what is common among these domains while providing an architectural, high-level view of the state-space organization of behaviors. In particular, we would like to analyze how unpredictable and how structurally complex hidden multistationary processes are when given their components, whose unpredictability and complexity we know. Another goal is that the approach be constructive, allowing one to quantitatively determine essential properties and to determine precisely what gives rise to the emergent global complexity.
The development proceeds as follows: It first reviews statistical mixtures, briefly recalling stochastic processes, information theory, structural complexity, and mixed-state processes. It then introduces the theory and construction of hidden multistationary processes. This includes a canonical minimal representation of hidden multistationary processes and a method to analyze their ergodic decompositions that determines how the latter affect information measures.
The sections following this explore a number of examples, going from the simplest cases and familiar structured stationary component processes to the Mother of All Processes that subsumes them all. Taken altogether, these illustrate a new kind of structural hierarchy and make plain how infinite complexity naturally emerges. The development concludes by drawing out parallels with related results and consequences in nonequilibrium thermodynamics and machine learning.
2. Background
To get started, we give a minimal summary of the required background—a summary that assumes familiarity with computational mechanics [
16,
17] and with information theory for complex systems [
18,
19].
2.1. Processes
A process, denoted , is specified by the joint distribution over its chain of random variables . We view as a communication channel with a fixed input distribution : It transmits information from the past to the future by storing it in the present. denotes the discrete random variable at time t taking on values x from a discrete alphabet . And is the block of ℓ random variables starting at time t. A particular realization is denoted using lowercase: . Often, we simply refer to a particular sequence , , as a word. If we have a symbol x and a word w, we form a new word by concatenation, e.g., or .
2.2. Information
Given a process, we form the block distributions
by marginalizing the given joint distribution
(We ignore here the measure-theoretic construction of cylinder sets and their measures; for background, see Ref. [
20] and references therein.) A stationary process is one for which
for all
t and
ℓ. For a stationary process, we drop the time index and thereby have the family of
word distributions that completely characterizes the process.
The amount of Shannon information in words is measured by the
block entropy,
where
is the Shannon entropy of the random variable
Y. A process’ information production is given by its
entropy rate,
where
refers to the measure over infinite sequences and so, too, to the word probabilities
. It is often used to measure a process’ degree of unpredictability.
At a minimum, a good predictor—denote this model’s state random variables
—must capture
all of a process’
excess entropy [
19]—all of the information shared between past and future:
. Here,
is the mutual information between variables
Y and
Z. That is, for a good predictor
:
.
These quantities are closely related. In particular, for finitary processes, those with
, the block entropy has the linear asymptotic behavior,
More precisely,
This shows that
controls the convergence of entropy rate estimates
. In fact, for time-series processes,
can also be defined in terms of entropy convergence,
An analogous quantity that controls the block entropy convergence to the linear asymptote is the
transient information,
measures the average amount of information an observer must extract in order to know a process’ internal state (for a review of these and related informations, see Ref. [
19]).
2.3. Structure
We refer to a model of a process—a particular choice of —as a presentation. Note that building a model of a process is more demanding than developing a prediction scheme, since one wishes to go beyond sequence statistics to express a process’ mechanisms and internal organization.
To do this, we first recall that a process’ communication channel is determined by the conditional distributions
. Based on this, computational mechanics introduced an equivalence relation
that groups all of a process’ histories that give rise to the same prediction. This results in constructing a map
from all pasts (finite and infinite) to
causal states defined by
where
. In other words, a process’
finite-history causal states are equivalence classes—
—that partition the space
of pasts into sets which are predictively equivalent. The causal states, then, are the collection across all past lengths:
. These consist of
recurrent and
transient causal states that are visited with positive or vanishing probability, respectively.
With the causal states in hand, one determines the causal-state to causal-state transitions,
The resulting model
M, consisting of the causal states and transitions, is called the process’
ϵ-machine [
21],
Informally, a process is ergodic if its statistics can be estimated from a single realization that is sufficiently long. If is ergodic, then ’s recurrent causal states are strongly connected and their asymptotic invariant distribution is unique and given by , where .
As described, an
-machine is obtained from a process, but one can also simply define an
-machine and consider its generated process. We will use both notions in the following, as they are equivalent [
22]. But why should one use the
-machine presentation of a process in the first place?
The main takeaway from computational mechanics is that out of all optimally predictive models
resulting from a partition of the past—those such that
—the
-machine captures the amount of information that a process stores—the
statistical complexity . The excess entropy
—the information explicitly observed in sequences—is only a lower bound on the information
that a process stores [
21]:
. The difference
, called the
crypticity, measures how the process hides its internal state information from an observer [
23].
A process’
-machine is its minimal unifilar presentation. It is unique for the process. Moreover, it allows a number of the process’ complexity measures to be directly and efficiently calculated [
24]. The latter include the process’ entropy rate, excess entropy, statistical complexity, and crypticity. In short, a process’
-machine captures all of its informational and structural properties.
3. Mixed State Operator
Given an
-machine
M, its recurrent causal states can be treated as a standard basis
in a vector space. Then, any distribution
over the states is a linear combination:
. Following Ref. [
17], these distributions are called
mixed states. For an
k-state
-machine, the mixed-state space is a
-dimensional simplex
, as the distributions
are normalized.
Consider a special subset of mixed states. Define
as the distribution over
M’s states induced after observing sequence
,
M having started with state distribution
,
where
1 is a column vector of 1s and
. Here, the notation
serves to indicate that random variable
X is governed by distribution
P.
The last line gives the mixed-state directly in terms of the initial state distribution and M’s transition matrices. One interpretation is that represents an observer’s best guess as to the process’ causal-state distribution given that it saw word w and knows both the process’ -machine and the initial distribution . Occasionally in the following, it will be noted that refers not to the initial distribution but to another.
To determine the set of mixed states allowed by a process, we simply calculate the set of distinct for all words . This is most directly achieved by enumerating w in lexicographic order, e.g., for a binary alphabet, successively choosing . Here, is the null word. As we will see, the mixed-state set can be finite or infinite.
If we consider the entire set of mixed states, then we construct a presentation of the process by specifying the transition matrices
,
Note that many words can induce the same mixed state.
It is useful to define a corresponding operator that acts on a machine M, returning its mixed-state presentation under initial distribution . The examples to follow shortly illustrate how mixed states and are calculated.
4. Constructing Hidden Multistationary Processes
Recall that a hidden multistationary nonergodic process is one that evolves, across successive realizations, to statistically distinct long-term behaviors. We now introduce our model of this by giving a construction procedure. This, in effect, defines what we mean by multistationary. We then develop several basic properties and analyze in detail a series of example constructions to illustrate them and their ergodic decompositions.
The main tool used to construct a hidden multistationary process is the mixed-state operator . We show that this results in a canonical presentation of a given set of stationary components. This is the multistationary process’ -machine.
Definition 1. A hidden multistationary process (HMSP) is defined by the presentation determined via the following procedure:
Identify an indexed family of distinct component stationary ergodic processes , where I is a finite or countable index set. Each is specified by its ϵ-machine presentation . The ϵ-machines consist only of their recurrent states that, due to ergodicity, form a single, strongly connected set.
Specify the component mixture distribution
π—the probability with which each will be visited (sampled), Finally, calculate the mixed-state presentation of the multistationary process,where we take the nonoverlapping set of the measure semi-groups [25] specified by the component ϵ-machines. In this way, M’s states and transitions are determined from the component ϵ-machines and the mixture distribution π.
The HMSP M, the result of the construction, determines the transient portion of a nonergodic -machine. M’s recurrent components are essentially the same as those (’s) of the original component stationary processes . That is to say, what is new in M is the set of transient causal states.
Note that this construction is a stochastic analog of building recognizers for multiregular formal languages [
26].
5. The Multistationary -Machine
With the background and definitions set, we are ready to explore the properties of multistationary nonergodic processes. We first establish the structural properties of their -machine presentations and then their informational properties via ergodic decompositions of various complexity measures.
Each component
, considered as generating its own process
, has a stationary distribution
over its states,
We will also write this as a vector over the multistationary process’ recurrent states, when we have a finite number of components,
where
and
. The stationary state distribution
for the multistationary process generated by
M is, then
Consider the following properties of a multistationary process as just defined. The proofs of these results closely follow those of Theorem 1, Lemma 7, and Theorem 2 of Section 4 in Ref. [
27]. One simply groups together the states and transitions of the nonergodic mixture component machines and applies Ref. [
27]’s methods to these new (larger) sets of the composite machine. Reference [
27]’s proof steps, then apply directly to obtain the results claimed. Indeed, many of the results for
-machines in Ref. [
27] carry over to the
-machines for multistationary processes.
To help motivate the construction and rationale of multistationary
-machines, the following describes their properties with short outlines of the arguments. The claims themselves, though, are stated here only as conjectures, leaving to a sequel the formal development and proofs. With this noted, for each of the example cases in
Section 7 below, the properties can be verified by hand.
Lemma 1 (Stationarity).
The state distribution is stationary.
Proof. This follows from realizing that the recurrent portion of M’s transition matrix is block diagonal. That is, asymptotically the components are independent, and, by assumption, the component distributions are invariant. □
Hypothesis 1 (Unifilarity).
The hidden multistationary process machine M is unifilar.
Remark 1. This would follow by adapting Lemma 5 (ϵ-Machines Are Deterministic) of Ref. [27] to the composite machine’s mixed-state presentation. Hypothesis 2 (Minimality).
The hidden multistationary process machine M is minimal.
Remark 2. This would follow by adapting Theorem 2 (Causal States Are Minimal) of Ref. [27] to the composite machine’s mixed-state presentation. Recall that the latter is determined from each component’s ϵ-machine, which is minimal. Hypothesis 3 (Uniqueness).
The hidden multistationary process machine M is unique.
Remark 3. This would follow by adapting Theorem 3 (Causal States Are Unique) of Ref. [27] to the composite machine’s mixed-state presentation. As noted, the relevant definitions and proofs of these closely follow those given for
-machines generally, as in Ref. [
27], and will be the subject of a sequel.
Hypothesis 4. The mixed state operator applied to a mixture of (finite, ergodic) ϵ-machines produces an ϵ-machine. That is, the ϵ-machine for the hidden multistationary process generated byis an ϵ-machine. Remark 4. This would follow from the preceding claims.
Remark: Constructing HMSPs in this way, one could start with other classes of presentation for the ergodic component processes, such as nonunifilar presentations—i.e., generic HMMs. However, the resulting M need not be an -machine. And, as a consequence, one could not directly calculate from such an M the various complexity measures nor, lacking minimality, draw structural conclusions about its architecture. This is one reason why we choose to specify the component processes using -machine presentations. Limiting the current construction to ergodic components specified by finite-state -machine presentations serves to simplify the discussion and highlight our main results.
However, lifting these various restrictions or generalizing the previous properties to address them would be a fruitful effort, giving a much broader characterization of the complexity of multistationary processes.
So, from here on out we assume the ergodic components are -machines and ask what properties hold for the multistationary processes so constructed. We build processes consisting of either a finite number or countably infinite number of components.
6. Ergodic Decompositions
Since we are given the component processes
, what can we say about the resulting multistationary process generated by
M? A first step develops various kinds of ergodic decomposition that attempt to predict
M’s properties in terms of its ergodic components’ properties. The basic question has a very long history in ergodic and information theories. The reader is referred to the review given in Ref. [
28]. Our approach here is, on the one hand, to briefly give a flavor of several ergodic decompositions and, on the other, to compensate for that lack of rigor, by analyzing in detail a number of concrete examples.
The word distribution
for
is given by
That is, for word
w,
where
denotes the probability that
generates
w.
Quantitatively, the HMSP’s block entropy is upper bounded by the component block entropies,
where the second-to-last step employs Jensen’s inequality [
18] and
is component
’s block entropy.
A more insightful upper bound, though, is developed by first imagining that the sequences generated by the ergodic components do not overlap—for example, the
s have disjoint alphabets
. Then we define an indicator function
f of the process and an associated random variable
:
, if
. We have
In the general setting, however, the sequences generated by distinct components can overlap. This reduces the number of distinct positive-probability words and so, too, the block entropy. In this way, we see that the above equality is only an upper bound on the HMSP’s block entropy:
This bound highlights the contribution of the
mixture entropy . We return to critique this notion of ergodic decomposition later on. For now, we draw out several useful consequences of this line of reasoning, relying on the bound Equation (
18). Elsewhere we explore tighter informational bounds on decomposition.
From this, we see that an HMSP’s entropy rate
is simply determined by those of its ergodic components. Assuming the mixture entropy
is finite, we have
where we have the
component entropy rate . Reference [
28] originally established this decomposition.
What is less intuitive, though, are various complexity measures as they apply to HMSPs. As we will see, unlike the entropy rate, which component processes are selected and how they relate to one another play key roles. We first consider the ergodic decomposition for excess entropy, then for the transient information, and finally that for the statistical complexity.
The excess entropy
also has an ergodic decomposition. In this case, we have
where
is the excess entropy for ergodic component
i. The excess entropy decomposition was explored in Refs. [
29,
30].
Combining the entropy rate and excess entropy ergodic decompositions, we see that the block-entropy linear asymptotes—
—have their own decomposition,
It is a simple additional step to develop the ergodic decomposition for the transient information,
Curiously, like the entropy rate decomposition, the mixture entropy
does not play a role.
The statistical complexity also has an ergodic decomposition,
where
are the statistical complexities of the ergodic components. The decomposition for statistical complexity was first noted in Ref. [
31]. Note that this decomposition does not rely on assuming an equality as in Equation (
18).
Finally, the multistationary crypticity
, which measures how a process hides state information from an observer, is also unaffected by the mixture distribution,
where
is the crypticity of component
. In this, it is similar to the entropy rate and transient information decompositions.
7. Structural Decompositions—Beyond Statistical
To emphasize, what’s notable in these kinds of informational decomposition is that, for nonergodic
-machines, we have, for example,
That is, the global structural complexity
of a multistationary process is strictly greater than that contained in its components
. In short, a multistationary process is
at least the sum of its parts. Indeed, the above inequality leaves out the entropy of mixing. But this is too facile. As we will see, multistationary processes are much, much more.
We will see below, taking a more structural perspective going beyond the ergodic decompositions, that the transient causal state structure is key to a process’ global organization and what sequences of observations reveal. This leads us to call into question the interpretation and use of the preceding kinds of ergodic decomposition.
We now show that the construction procedure can be used to answer a number of different questions about multistationary ergodic processes. Several questions are illustrated via particular examples; others via general constructions. The series of examples is developed incrementally to highlight the methods and particular results, as much in isolation as possible.
We first start with processes built from finite-state ergodic components that lead to a multistationary process that is itself finite-state. Then we analyze the case in which finite components lead to a multistationary process with an infinite number of states. We end with examples built from an infinite number of finitary ergodic processes. In each case, we explore the structure of the resulting multistationary process, its complexity measures, and its ergodic decomposition.
7.1. Finite Hidden Multistationary Processes
7.1.1. A Base Case
A simple but illustrative case is that of two period-1 component processes: all Heads and all Tails, selected with fair probability: .
The components observed separately have . But together , , since that is the informational uncertainty we have about which component the process is in. Naturally, once in a component there is no uncertainty about the symbols emitted. In this way, we see that the HMSP information of the mixture is all mixing entropy .
The composite -machine consists of three causal states: a single transient start state that immediately transitions (with fair probability) to either the All-Heads recurrent component (single recurrent state) or to the All-Tails recurrent component (also a single recurrent state).
7.1.2. Period-1 and Period-2 Process
Define the Periodic Process that repeats the word . Let us construct the simplest multistationary process consisting of the following two components:
Period-1 process , which has complexity measures bits per symbol, bits, bits, bit-symbols, and bits.
Period-2 process , which has complexity measures bits per symbol, bit, bit, bit-symbol, and bits.
The Period-1 component has a single recurrent state
A and the Period-2 has two recurrent states; label them
B and
C. The second step is to specify the mixture distribution
and we take this to be uniform:
. That is,
and
. And the final step is to use the mixed-state operator to construct
. The resulting multistationary
-machine is shown in
Figure 1c.
The recurrent states of the component -machines show up as M’s recurrent states, as claimed. The two recurrent state sets are not connected. What is new are the two transient states (solid circles). As a generator of the multistationary process, M begins in its start state (solid circle with circumscribing circle) and then follows transitions according to the edge probabilities, emitting the corresponding symbols, eventually reaching one or the other of the two recurrent-state sets— or .
We can understand
M’s structure by calculating its mixed states
,
, using Equation (
9):
In this, on the one hand,
is the start state of the mixed state presentation and its distribution gives the asymptotic invariant distribution over the component recurrent states
A,
B, and
C—the state probabilities before any symbols have been generated.
On the other hand, if is generated, then we immediately know the process is in component , since cannot produce a 0, and, in particular, it is in a specific state, B. This is reflected in the transient mixed state . In fact, any time a valid 0 is generated, we know M is in state B. This is also seen in the mixed state , in which the last symbol generated is a 0 and we again obtain a -function distribution concentrated on state B.
Now, there are also disallowed transitions and so disallowed words. This is shown in the mixed state for the word .
More interesting, though, is the transient mixed state , which indicates that, having seen a 1, we know that M cannot be in state B. However, the best we can say is that it is either in state A (the Period-1 component) or in state C (the Period-2 component) with fair probability. It is not until we see another symbol that we are guaranteed to know with certainty in which component M is. If , then is in A. Since we now know the state with certainty, we say that and are synchronizing words. In this case, they are the minimal synchronizing words.
The ergodic decompositions tell us the following:
bit per symbol;
bits;
bits;
bit-symbols;
bits.
Let us check these by directly calculating the entropy growth
and convergence
for
M. These are shown in
Figure 2.
The entropy growth plot (top) leads to an estimate of bits, which is predicted by the ergodic decomposition. Both entropy growth and entropy convergence (bottom) show that after . And this too is correctly predicted by the corresponding entropy rate decomposition.
In fact, for lengths longer than the longest period, there are always three distinct sequences—. And so, bits. This is roughly consistent with block entropy plots.
Let us analyze this exactly. One of those sequences is and it occurs with probability . The two other sequences are and and they are generated equally often by their components. But since that component appears only half the time, they occur in the output sequences with probability each. Thus, bits. And this is what is seen in the plots.
The HMSP’s statistical complexity is
which agrees with the ergodic decomposition.
The ergodic decomposition, however, predicts
bit-symbols, while the entropy growth plot shows that, in fact,
bit-symbols. So, the ergodic decomposition for
is incorrect. In short, we see that the ergodic decomposition does not properly account for the state distribution’s relaxation through the transient mixed states (solid circles) in
M;
Figure 1c. That relaxation takes longer than a single step (as the decomposition incorrectly assumes) and that increased relaxation time increases
.
Note that this is one of the simpler examples of the class of processes that have finite transients. Let us consider one that is more complex.
7.1.3. Isomorphic Golden Means Process
The No-Repeated-0s Golden Mean Process (GMP) generates all binary sequences except those with consecutive 0s. When a 0 is generated, then the probability of a 0 or a 1 is fair. The GMP is an order-1 Markov process.
Let
be the No-Repeated-0s GMP, and let
be the No-Repeated-1s GMP. See
Figure 3a,b. We define a nonergodic mixture
as follows:
with mixture distribution
. The probability of any word
w is, then,
Using the mixed-state operator, we construct
’s transient and recurrent states using this mixture distribution, finding
Longer words can only lead to one of these mixed states and so the
-machine is finite. The full multistationary
-machine is shown in
Figure 3c, as a function of the mixture parameter
p. We see that the number of states, including the transients, is finite for all mixture probabilities.
The transition matrices for
’s recurrent causal states are
The stationary distribution is defined by the mixture of the two processes,
recalling that
.
Using methods from refs. [
16,
17], the excess entropy for each recurrent component is seen to be
By the ergodic decomposition theorem, the excess entropy for the mixture, as a function of
p is
since the two components are isomorphic. For
, we expect
bits.
Again, the component transient information equals the excess entropy, since the GMP is order-1 Markov. So, the associated ergodic decomposition gives
since the two components are isomorphic. For
, we expect
bits.
Similarly, the statistical complexity of each recurrent component is
So, from Equation (
23) the statistical complexity of the mixture as a function of
p is
For
, we expect
bits.
Let us check the decompositions by calculating the associated complexity measures from
M’s entropy growth and convergence. The latter are shown in
Figure 4.
The entropy growth plot estimates that
bits, which is low by 2%. And the entropy convergence plot shows that the
, calculated there using Equation (
5) as the area shown, is a bit lower still:
. Although, due to the slow convergence and the finite number of terms taken in the approximation, these errors are expected. Similarly, the ergodic decomposition of the entropy rate
bits per symbol shows up correctly when estimated from
’s entropy growth and convergence. And so, the predictions from the related ergodic decompositions are consistent.
The entropy growth, however, shows the transient information is substantially larger (
bit-symbols) than that predicted from its ergodic decomposition (
). This discrepancy is clearly not due to estimation errors. Rather, as noted above for the P1-P2 mixture, it arises from the decomposition not accounting for the five transient causal states of
M; see
Figure 3c.
Individually, GMPs are subshifts of finite type and finite Markov order. From the cycles in the transient states, we see that as components they make the multistationary process sofic—infinite Markov order. There are subsets of sequences—specifically —for which one never synchronizes.
This means that mixtures of finite-order Markov chains, even “linear” mixtures that come from independently running them, are processes that are not finite Markovian. They require hidden Markov representations.
7.2. Infinite State
The preceding examples, chosen to explicitly illustrate methods and as harbingers of coming results, are rather special in that the led to finite-state multistationary processes. We now turn to more typical cases, still constructed from finite-state ergodic components, that lead to a multistationary process with an infinite number of states.
7.2.1. Period-1 and Fair Coin Process
The next example of a multistationary process mixes stochastic and periodic behaviors: We build it out of a period-1 process and a fair coin. In effect, we ask how difficult it is to distinguish these two simple, but extreme processes—one completely predictable, the other completely unpredictable.
For here and a bit later, define Bernoulli Process , which is a model of a coin flip with bias probability p.
The first step, then, is to select the following two stationary components:
Period-1 Process
:
bits per symbol,
bits,
bits,
bit-symbols, and
bits. See
Figure 5a.
Fair Coin Process
:,
bit per symbol,
bits,
bits,
bit-symbols, and
bits. See
Figure 5b.
Though at the two extremes of predictability, these are structurally trivial processes—.
The second step is to select the mixture distribution, which we take to be uniform:
. And the third step is to use the mixed-state operator to construct
. Several of the mixed states are
where
. The resulting
-machine is shown in
Figure 5c.
In
Figure 5c, and so too in the mixed states, we see our first surprising result for multistationary processes. Starting from two structurally trivial processes, the multistationary
-machine has a countable infinity of transient causal states. Why? If, at any point, one sees a 0, then we know the process is in the Fair Coin component, since the other component cannot generate a 0. However, it is only after “seeing” an infinite sequence of 1s that one could determine that the process is in the All-1s component. In short, the effort required to distinguish between these two trivial processes is infinite and this is directly reflected in the infinite set of transient states.
The ergodic decompositions tell us the following:
bit per symbol;
bits;
bits;
bit-symbols;
bits.
Note that the ergodic decompositions predict that the structural complexity measures are driven solely by the mixture entropy
. Both components contribute nothing:
. Let us check these predictions by estimating the quantities from
M’s entropy growth and convergence, shown in
Figure 6.
The entropy growth plot shows that bit, as predicted by M’s ergodic decomposition. And the entropy convergence plot shows that the , calculated as the area shown, is also the same. Similarly, the ergodic decomposition of the entropy rate bits per symbol shows up correctly on the entropy plots.
The entropy growth plot, however, shows the transient information is quite a bit larger ( bit-symbols) than that predicted () from its ergodic decomposition.
Also, the informational ergodic decompositions, while indicating a role for the mixture entropy, miss entirely the existence of an infinite number of transient states and the attendant difficulty that confronts an observer trying to detect in which component the process is.
7.2.2. Two Biased Coins
Slightly increasing the level of sophistication, we now construct a multistationary process out of fully stochastic components: Two biased coins of unequal (but symmetric) biases
and
. See
Figure 7a,b.
We again take a uniform mixture distribution:
. The result of constructing
with
is shown in
Figure 8.
The mixed state presentation reveals two countably and infinitely long chains of transient causal states. One leads to the ergodic component for and the other for . In a simple sense these long transient chains show the mechanism by which one determines the coin biases. Interestingly, though, at any point statistical fluctuations can change the apparent bias and drive the state back up the long chains, heading for the complementary biased coin.
Consider, as above the
,
,
,
, and
ergodic decompositions. The ergodic decompositions for excess entropy and statistical complexity give similar results; namely,
That is, the complexities of the multistationary process are all in the mixture distribution. Even then, the mixture entropy, in this case a number upper bounded by 1, belies the infinite number of transients and the difficulty of determining in which ergodic component the process is. Quantitatively, it seems another measure of the global process complexity and a new decomposition are in order. We return to this shortly, after examining several more kinds of multistationary processes.
Let us validate the ergodic decompositions’ predictions vis-à-vis the process estimates of their various measures from
M’s entropy growth and convergence. The latter are shown in
Figure 9.
Entropy growth, using the y-intercept method, shows that bit, as predicted by M’s ergodic decomposition. And the entropy convergence plot shows that , as the area shown, is also the same, though it takes many terms and so shows slow convergence. The ergodic decomposition of the entropy rate bits per symbol shows up correctly on the entropy plots.
The entropy growth, however, shows the transient information is substantially larger ( bit-symbols) than that predicted () from its ergodic decomposition. Again, the mixing entropy fails to account for the dominating transient causal state structure.
7.2.3. Pair of Isomorphic Even Processes
The Even Process (EP) generates all binary sequences such that pairs of 1s occur in blocks of even length bounded by 0s. Once a 0 is seen, a 0 or a 1 is generated with fair probability. The EP is closely related to the Golden Mean Process. They have the same entropy rates and statistical complexities. The main important difference, despite the close similarity and a simple relabeling of transitions, is that the EP is described by no finite-order Markov chain. It is infinite Markov order, though finite state.
To construct a multistationary process the first step, then, is to select two stationary components. One component
will be an EP with an even number of 0s (
Figure 10b) and the other
an EP with an even number of 1s (
Figure 10a). The second step is to choose mixture distribution:
.
Finally,
Figure 11 shows the
-machine for the HMSP
with
. The
-machine displayed is estimated only up to words of length 8 and the transitions are set to give a well-formed
-machine at this approximation.
There are several observations. First, the HMSP -machine is symmetric under 1-0 exchange, as it should be given this symmetry in the ergodic components. Second, and less obviously, there is an infinite number of transient causal states. This is due to the outside paths along and . These two sequences arise from the 2-cycles in the respective ergodic component recurrent states: pairs of 1s in never synchronize; ditto for pairs of 0s in . And so, in M there are infinitely long sequences that never reach or .
Third, the HMSP is infinite Markov order. To see this, note that there are six cycles in the transient states—these cycles are the signature of infinite Markov order or, what is called “soficity”. The HMSP is a shift in infinite type [
32]. In particular, there is a two-cycle
between states 32 and 38 and one
between states 37 and 41. There are two four-cycles
between states, 10, 18, 24, and 26 and between states 24, 29, 34, and 36; and two
between states 11, 19, 25, and 23 and between states 25, 30, 35, and 33.
The ergodic decompositions tell us the following:
bit per symbol;
bits;
bits;
bit-symbols;
bits.
Let us check these by directly calculating the entropy growth and convergence for
M. These are shown in
Figure 12.
Let us check the decompositions by comparing their predictions to estimates from
M’s entropy growth and convergence. These functions are shown in
Figure 12.
The entropy growth plot shows that bits which disagrees by about 4% with the prediction from the ergodic decomposition ( bits). And the entropy convergence plot shows that the bits as the area shown. Although, due to the slow convergence, the finite number of terms taken in the numerical approximation, and the finite number of transient states taken in the approximation of M, this error is not surprising. Similarly, the ergodic decomposition of the entropy rate bits per symbol shows up correctly on the entropy plots ().
Entropy growth, however, shows the transient information is three times larger ( bit-symbols) than that predicted from its ergodic decomposition ( bit-symbols). Again, this discrepancy follows from the mixture entropy missing the contributions from the (infinite) number of transient causal states.
7.3. Infinite Components
We end our selection of example multistationary processes by constructing several from an infinite number of finitary ergodic components.
7.3.1. Handbag of Necklaces
Fourier analysis of a signal assumes the generating process consists of (at most) periodic sequences. As an analog of this assumption in the present setting, consider the
Handbag of Necklaces (HMSP) consisting of ergodic-component stationary processes
for all periods,
. That is, if we assume a binary process, the sequences emitted consist of words period-1
, period-2
, period-3
, period-4
, and so on. The HMSP
-machine is shown in
Figure 13.
Note that there is an infinite number of transient causal states. Overall the HMSP is a highly symmetric structure and dominated by the transient states. From this one can readily read off how to synchronize—how to know in which ergodic component the process is. For example, to obtain to component , there are exactly i paths.
Now, consider the mixture measure for the components. Then, the state probabilities are , where is the normalization constant. Note this is the presentations’ stationary invariant distribution.
There is some flexibility in setting the mixture distribution . There are several criteria for choosing it for a countable number of states, including the following:
Consider the structure of the transitions in the HMSP’s first row of states. The first transition probability for seeing an
a is
since the probability of seeing an
a in the
component is
. Let
. The probability for the succeeding transition emitting an
a is
since
. These transitions determine those leaving the top row of states on a
b. Note that
.
Now, consider the second to the top row of transitions. First,
. Then, we have
since
Note that
. There is a second path to
controlled by the transition
since
. That is, the appearance of
and
in each component occurs due to the same conditions.
The ergodic decompositions tell us the following:
, as .
The excess-entropy ergodic decomposition for a process with
p component periodic processes with periods
is
where the second step follows assuming
is uniform.
= E.
bit-symbols.
: This is a bit surprising: No crypticity, no hidden information.
These are consistent with directly calculating the entropy growth for
M, as shown in
Figure 14.
7.3.2. The Purse Process
The example of two biased coins suggests extending to an infinite number of biased coins in a purse—a bag of coins with different biases. As hinted at in the two-coin case, all of the (infinite) complexity is in the mixture and none comes from the components.
Moreover, we can choose to be such that is finite or infinite. Thus, the Purse Process is an extreme example in which infinite complexity comes from zero-complexity components. There is probably no simpler way to say that a multistationary process is way more than the sum of its (zero complexity) parts.
To obtain a brief sense of the Purse Process, consider an HMSP consisting of three coins of unequal bias and compare this to the case of two coins of
Section 7.2.2.
Figure 15 shows the HMSP for two completely biased coins and one fair coin. Its basic features were already encountered above. And it suggests a notable generalization to which we now turn.
7.3.3. Mother of All Processes
Finally, consider upping the complexity ante substantially. This generalization is to an HMSP consisting of a mixture of all processes. Let us step through its construction.
First, recall that every stationary process has a unique
-machine presentation. That is,
-machines and stationary processes are in a 1-to-1 correspondence. Second, an efficient algorithm exists to list all
-machines by the number of the recurrent causal states. Reference [
34] shows how to systematically enumerate the
ϵ-machine process library for
k-state
-machines. See Table I there for the list of binary-alphabet topological
-machines. There are 1,117,768,214 such 8-state
-machines.
In the current construction consider only topological -machines for which any branching transitions are taken with fair probability. We refer to each process by its -machine’s enumeration number—we call this the process’ Gödel number.
Second, define the Process Urn (PU) as containing the entire library of -machines. That is, we imagine an HMSP that is the result of reaching into the urn, selecting one -machine, and having it generate a full realization. The repeatedly sampled PU is a HMSP—The Mother of All Processes. Certainly, one of the most nonergodic processes one could work with.
Definition 2. The Mother of All Processes
iswith π being a chosen mixture distribution. To simplify, let us examine the HMSP whose components are all one-state and all two-state -machines. There are now 10 components—three 1-state components and 7 2-state components.
There are 17 recurrent causal states altogether across the ergodic components. However, the many hundreds of mixed states are no longer usefully presented in a state-transition diagram, as illustrated up to this point. Instead, we plot the mixed states themselves as dots in the simplex
. This is shown in
Figure 16. This is a 2D projection in which the recurrent states are the vertices of
and so appear on its periphery. The start state, with uniform probability across the components and not across the recurrent states, is not in the simplex center.
One notes the concentration of mixed states that move near to ’s vertices, indicating close approaches to synchronization.
There are a number of notable properties, including the following:
The simplex vertices correspond to recurrent causal states.
There is an uncountably infinite number of transient states. These fill out a complicated fractal measure within the .
All mixed states that are not on vertices are transient states.
While it is clear that
M is not exactly synchronizable [
35] as it contains infinite Markov-order components, is it asymptotically synchronizable [
36]? What about the synchronizability of approximations to it?
There are a number of open questions, including the following:
8. Discussion
In addition to their particular application, the ergodic decompositions give important insight into basic questions about what structural complexity is and how to measure it. A number of previous efforts that address these definitional issues consider it a key property that complexity be additive over the components of a system [
38]. This is often motivated by a parallel with Boltzmann entropy in thermodynamics. And, for that matter, additivity was also posited as an axiom by Shannon for his measure of surprise [
39].
However, the ergodic decompositions here show that the manner in which a system’s components relate to one another—specifically, the mixture distribution—plays a central role in the process’ organization and contributes to quantitative measures of global complexity. The foregoing offered a different, more structural view that goes beyond the ergodic decompositions and statistical mixtures. Constructively, the transient state structure is key to a multistationary process’ global organization and what observations can or cannot reveal.
The lessons here also suggest a skepticism in applying the ergodic decompositions of
Section 6. One reason is that underlying them is the assumption of an IID sampling of components, which is not generally valid. Another is that they completely ignore how the internal structures of the components interrelate with each other. And, as shown, this brings out wholly new properties that are not part of any given component nor their sum nor their IID mixture. Indeed, the mixture entropy does not capture this, except in the most limited of cases.
Constructive responses to this will address the new kind of hierarchical structure explicitly represented by the multistationary process’
-machine transient causal states and their complicated measure in
. Quantitatively, in contrast to the block entropy, entropy rate, and excess entropy, we demonstrated that the transient information is sensitive to this new kind of complexity in structural mixtures. It is this additional structure that makes the organization of multistationary processes way more than the sum of their parts. As a complementary metric, adapting the statistical complexity dimension
suggests itself [
37].
9. Conclusions
Let us close by exploring several wider implications for thermodynamics, on the one hand, and various attempts to introduce “universal modeling” schemes on the other.
First, we started out highlighting the colloquialism, made familiar by the social movements of the 1960s, that a system is more than the sum of its parts. Presumably, the social reaction then reflected an increasing awareness of the impact of technical systems humans were creating. The preceding development explored in which senses this could be true for truly complex systems—ones consisting of many structured components—more akin to the social subsystems than mere atoms. And the various informational ergodic decompositions bolstered the popular understanding.
However, in emphasizing structure and analyzing the concrete process class of hidden multistationary processes, it became abundantly clear—through all of the examples presented—that composite or heterogeneous (to use Gibbs’ word [
40]) systems are far more than the sum of their components. Specifically, beyond a mere entropic, missing contribution from increased disorder that arises from the random selection of components, composite systems are markedly more complex. And they are more structured according to the relative interplay of the components’ internal organization. It is that interplay that drives the explosive complexity of multicomponent systems.
On this score, the history of composite systems is perhaps a bit confusing, especially as they arose in the early foundations of thermodynamics. There is, for example, Gibbs’ seemingly contradictory statement, as quoted by Jaynes (p. 13 in [
1]), that “The whole is simpler than the sum of its parts”. The ergodic decompositions seemed to say the opposite. However, there is not really a confusion here. First, Gibbs was thinking of the correlations that would emerge between system components when coupled together. Here, we intentionally did not couple the components. Sequels address this. Second, at root, the issue turns on an ambiguous vocabulary for describing randomness and structure. Here, at least, by distinguishing between “randomness” in terms of Shannon’s notion of the flatness of a probability distribution and “structure” in terms of statistical complexity, we shed some light on these important and still evolving issues.
Second, the HMSP construction procedure here gives a rather direct picture of one kind of hierarchical organization in how a stochastic process can be built from other processes. The constructive procedure uses the mixed-state presentation. And this generates a new kind of hierarchy that emerges due to the diverse combinatorial relationships between the components’ internal organizations. Other related hierarchies can be similarly constructed, such as when using generalized hidden Markov models [
41] as ergodic components.
Third and finally, modern statistical inference has been treated to a number of formalizations of general learning that make minimal assumptions. Consider, for example, the following:
Universal Priors [
42,
43,
44]: In the computation-theoretic approach to modeling and statistical inference, there are attempts to define a most-general prior over model space. However, these raise very natural questions: What kind of process would generate such a prior? Moreover, what kinds of difficulties are there in detecting processes drawn according to such a prior?
No Free Lunch Theorem [
45]: This framing makes a number of implicit assumptions about the measure on the Process Urn simplex. Does the theorem hold? Not when you consider structure.
Probably Almost Correct Learning [
46]: This “distribution-free” approach is a bold attempt within machine learning to identify the computational nature of evolution and learning. However, is not this the same thing as assuming any process is possible? If so, then it is analogous to assuming the Mother of All Processes. That is, rather than being “distribution-free” the assumption underlying PAC learning is “distribution-full”.
In light of these, The Mother of All Processes suggests a construction for such assumption-free or minimal-assumption modeling. In this, one is sampling from the space of all processes and exploits the -machine representation to be specific about probability, on the one hand, and structure, on the other. The realization resulting from this was that the preceding development was able to demonstrate that transient-state structure made explicit the challenges in detecting component processes and that this was captured informationally via the transient information.