Informational and Causal Architecture of Discrete-Time Renewal Processes

Renewal processes are broadly used to model stochastic behavior consisting of isolated events separated by periods of quiescence, whose durations are specified by a given probability law. Here, we identify the minimal sufficient statistic for their prediction (the set of causal states), calculate the historical memory capacity required to store those states (statistical complexity), delineate what information is predictable (excess entropy), and decompose the entropy of a single measurement into that shared with the past, future, or both. The causal state equivalence relation defines a new subclass of renewal processes with a finite number of causal states despite having an unbounded interevent count distribution. We use these formulae to analyze the output of the parametrized Simple Nonunifilar Source, generated by a simple two-state hidden Markov model, but with an infinite-state epsilon-machine presentation. All in all, the results lay the groundwork for analyzing processes with infinite statistical complexity and infinite excess entropy.


I. INTRODUCTION
Stationary renewal processes are widely used, analytically tractable, compact models of an important class of point processes [1][2][3][4]. Realizations consist of sequences of events-e.g., neuronal spikes or earthquakes-separated by epochs of quiescence, the lengths of which are drawn independently from the same interevent distribution. Renewal processes on their own have a long history and, due to their offering a parsimonious mechanism, often are implicated in highly complex behavior [5][6][7][8][9][10]. Additionally, understanding more complicated processes [11][12][13][14] requires fully analyzing renewal processes and their generalizations.
As done here and elsewhere [15], analyzing them indepth from a structural information viewpoint yields new statistical signatures of apparent high complexity-longrange statistical dependence, memory, and internal structure. To that end, we derive the causal-state minimal sufficient statistics-the -machine-for renewal processes and then derive new formulae for their various information measures in terms the interevent count distribution. The result is a thorough-going analysis of their information architecture-a shorthand referring to a collection of measures that together quantify key process properties: * smarzen@berkeley.edu † chaos@ucdavis.edu predictability, difficulty of prediction, inherent randomness, memory, and Markovity, and the like. The measures include: -the statistical complexity C µ , which quantifies the historical memory that must be stored in order to predict a process's future; -the entropy rate h µ , which quantifies a process' inherent randomness as the uncertainty in the next observation even given that we can predict as well as possible; -the excess entropy E, which quantifies how much of a process's future is predictable in terms of the mutual information between its past and future; -the bound information b µ , which identifies the portion of the inherent randomness (h µ ) that affects a process's future in terms of the information in the next observation shared with the future, above and beyond that of the entire past; and -the elusive information σ µ , which quantifies a process's deviation from Markovity as the mutual information between the past and future conditioned on the present. Analyzing a process in this way gives a more detailed understanding of its structure and stochasticity. Beyond this, these information measures are key to finding limits to a process's optimal lossy predictive features [16][17][18][19], designing action policies for intelligent autonomous agents [20], and quantifying whether or not a given process has one or another kind of infinite memory [21][22][23]. FIG. 1. The role of maximally predictive (prescient) models: Estimating information measures directly from trajectory data encounters a curse of dimensionality or, in other words, severe undersampling. Instead, one can calculate information measures in closed-form from (inferred) maximally predictive models [24]. Alternate generative models that are not maximally predictive cannot be used directly, as Blackwell showed in the 1950s [25].
While it is certainly possible to numerically estimate information measures directly from trajectory data, statistical methods generally encounter a curse of dimensionality when a renewal process has long-range temporal correlations since the number of typical trajectories grows exponentially (at entropy rate h µ ). Alternatively, we gain substantial advantages by first building a maximally predictive model of a process (e.g., using Bayesian inference [26]) and then using that model to calculate information measures (e.g., using recently available closed-form expressions when the model is finite [24]). Mathematicians have known for over a half century [25] that alternative models that are not maximally predictive are inadequate for such calculations. Thus, maximally predictive models are critical. Figure 1 depicts the overall procedure just outlined, highlighting their important role. Here, extending the benefits of this procedure, we determine formulae for the information measures mentioned above and the appropriate model structures for a class of processes that require countably infinite models-the ubiquitous renewal processes.
Our development requires familiarity with computational mechanics [27]. Those disinterested in its methods, but who wish to use the results, can skip to Figs. 3-6 and Table I. A pedagogical example is provided in Sec. V. Two sequels will use the results to examine the limit of infinitesimal time resolution for information in neural spike trains [28] and the conditions under which renewal processes have infinite excess entropy [29].
The development is organized as follows. Section II provides a quick introduction to computational mechanics and prediction-related information measures of sta-tionary time series. Section III identifies the causal states (in both forward and reverse time), the statistical complexity, and the -machine of discrete-time stationary renewal processes. Section IV calculates the information architecture and predictable information of a discretetime stationary renewal process. Section V calculates these information-theoretic measures explicitly for the parametrized Simple Nonunifilar Source, a simple twostate nonunifilar Hidden Markov Model with a countable infinity of causal states. Finally, Sec. VI summarizes the results and lessons, giving a view to future directions and mathematical and empirical challenges.

II. BACKGROUND
We first describe renewal processes, then introduce a small piece of information theory, review the definition of process structure, and finally recall several informationtheoretic measures designed to capture organization in structured processes.

A. Renewal Processes
We are interested in a system's immanent, possibly emergent, properties. To this end we focus on behaviors and not, for example, particular equations of motion or particular forms of stochastic differential or difference equation. The latter are important in applications because they are generators of behavior, as we will see in a later section. As Fig. 1 explains, for a given process, some of its generators facilitate calculating key properties. Others lead to complicated calculations and others still cannot be used at all.
As a result, our main object of study is a process P: the list of all of a system's behaviors or realizations {. . . , x −2 , x −1 , x 0 , x 1 , . . .} as specified by their measure µ(. . . , X −2 , X −1 , X 0 , X 1 , . . .). We denote a contiguous chain of random variables as X 0:L = X 0 X 1 · · · X L−1 . Left indices are inclusive; right, exclusive. We suppress indices that are infinite. In this setting, the present X 0 is the random variable measured at t = 0, the past is the chain X :0 = . . . X −2 X −1 leading up the present, and the future is the chain following the present X 1: = X 1 X 2 · · · . The joint probabilities Pr(X 0:N ) of sequences are determined by the measure of the corresponding cylinder sets: Pr(X 0:N = x 0 x 1 . . . x N −1 ) = µ(. . . , x 0 , x 1 , . . . , x N −1 , . . .). Finally, we assume a process is ergodic and stationary-Pr(X 0:L ) = Pr(X t:L+t ) for all t ∈ Z-and the observation values x t range over a finite alphabet: x ∈ A. In short, we work with hidden Markov processes [30].
Discrete-time stationary renewal processes here have binary observation alphabets A = {0, 1}. Observation of the binary symbol 1 is called an event. The event count is the number of 0's between successive 1s. Counts n are i.i.d. random variables drawn from an interevent distribution F (n), n ≥ 0. We restrict ourselves to persistent renewal processes, such that the probability distribution function is normalized: ∞ n=0 F (n) = 1. This translates into the processes being ergodic and stationary. We also define the survival function by w(n) = ∞ n =n F (n ), and the expected interevent count is given by µ = ∞ n=0 nF (n). We assume also that µ < ∞. It is straightforward to check that ∞ n=0 w(n) = µ + 1. Note the dual use of µ. On the one hand, it denotes the measure over sequences and, since it determines probabilities, it appears in names for informational quantities. On the other, it is a commonplace in renewal process theory that denotes mean rates. Fortunately, context easily distinguishes the meaning through the very different uses.

B. Process Unpredictability
The information or uncertainty in a process is often defined as the Shannon entropy H[X 0 ] of a single symbol X 0 [31]: However, since we are interested in general complex processes-those with arbitrary dependence structurewe employ the block entropy to monitor information in long sequences: Pr(X 0:L = w) log 2 Pr(X 0:L = w) .
To measure a process's asymptotic per-symbol uncertainty one then uses the Shannon entropy rate: when the limit exists. (Here and elsewhere, µ reminds us that information quantities depend on the process's measure µ over sequences.) h µ quantifies the rate at which a stochastic process generates information. Using standard informational identities, one sees that the entropy rate is also given by the conditional entropy: This form makes transparent its interpretation as the residual uncertainty in a measurement given the infinite past. As such, it is often employed as a measure of a process's degree of unpredictability.

C. Maximally Predictive Models
Forward-time causal states S + are minimal sufficient statistics for predicting a process's future [32,33]. This follows from their definition-a causal state σ + ∈ S + is a sets of pasts grouped by the equivalence relation ∼ + : So, S + is a set of classes-a coarse-graining of the uncountably infinite set of all pasts. At time t, we have the random variable S + t that takes values σ + ∈ S + and describes the causal-state process . . . , S + −1 , S + 0 , S + 1 , . . .. S + t is a partition of pasts X :t that, according to the indexing convention, does not include the present observation X t . In addition to the set of pasts leading to it, a causal state σ + t has an associated future morphthe conditional measure µ(X t: |σ + t ) of futures that can be generated from it. Moreover, each state σ + t inherits a probability π(σ + t ) from the process's measure over pasts µ(X :t ). The forward-time statistical complexity is defined as the Shannon entropy of the probability distribution over forward-time causal states [32]: A generative model is constructed out of the causal states by endowing the causal-state process with transitions: that give the probability of generating the next symbol x and ending in the next state σ , if starting in state σ.
(Residing in a state and generating a symbol do not occur simultaneously. Since symbols are generated during transitions there is, in effect, a half time-step difference in the indexes of the random variables X t and S + t . We suppress notating this.) To summarize, a process's forward-time -machine is the tuple {A, For a discrete-time, discrete-alphabet process, the -machine is its minimal unifilar Hidden Markov Model (HMM) [32,33]. (For general background on HMMs see [34][35][36].) Note that the causal state set can be finite, countable, or uncountable; the latter two cases can occur even for processes generated by finite-state HMMs. Minimality can be defined by either the smallest number of states or the smallest entropy over states [33]. Unifilarity is a constraint on the transition matrices T (x) such that the next state σ is determined by knowing the current state σ and the next symbol x. That is, if the transition exists, then Pr(S + t+1 |X t = x, S + t = σ) has support on a single causal state.
While the -machine is a process's minimal, maximally predictive model, there can be alternative HMMs that are as predictive, but are not minimal. We refer to the maximally predictive property by referring to the -machine and these alternatives as prescient. The state and transition structure of a prescient model allow one to immediately calculate the entropy rate h µ , for example. More generally, any statistic that gives the same (optimal) level of predictability, we call a prescient statistic.
A similar equivalence relation can be applied to find minimal sufficient statistics for retrodiction [37]. Futures are grouped together if they have equivalent conditional probability distributions over pasts: A cluster of futures-a reverse-time causal state-defined by ∼ − is denoted σ − ∈ S − . Again, each σ − inherits a probability π(σ − ) from the measure over futures µ(X 0: ). And, the reverse-time statistical complexity is the Shannon entropy of the probability distribution over reversetime causal states: In general, the forward and reverse-time statistical complexities are not equal [37,38]. That is, different amounts of information must be stored from the past (future) to predict (retrodict). Their difference Ξ = C + µ − C − µ is a process's causal irreversibility and it reflects this statistical asymmetry.
Since we work with stationary processes in the following the time origin is arbitrary and so we drop the time index t when it is unnecessary.

D. Information Measures for Processes
Shannon's various information quantities-entropy, conditional entropy, mutual information, and the likewhen applied to time series are functions of the joint distributions Pr(X 0:L ). Importantly, they define an algebra of information measures for a given set of random variables [39]. Reference [40] used this to show that the past and future partition the single-measurement entropy H(X 0 ) into several distinct measure-theoretic atoms. These include the ephemeral information: which measures the uncertainty of the present knowing the past and future; the bound information: which is the mutual information shared between present and future conditioned on past; and the enigmatic information: which is the three-way mutual information between past, present, and future. Multi-way mutual informations are sometimes referred to as co-informations [40,41] and, compared to Shannon entropies and two-way mutual information, can have counterintuitive properties, such as being negative. For a stationary time series, the bound information is also the shared information between present and past conditioned on the future: One can also consider the amount of predictable information not captured by the present: This is called the elusive information [42]. It measures the amount of past-future correlation not contained in the present. It is nonzero if the process necessarily has hidden states and is therefore quite sensitive to how the state space is observed or coarse grained. The maximum amount of information in the future predictable from the past (or vice versa) is the excess entropy: It is symmetric in time and a lower bound on the stored informations C + µ and C − µ . It is directly given by the information atoms above: The process's Shannon entropy rate h µ -recall the form of Eq. (2)-can also be written as a sum of atoms: Thus, a portion of the information (h µ ) a process spontaneously generates is thrown away (r µ ) and a portion is actively stored (b µ ). Putting these observations together gives the information architecture of a single measurement (Eq. (1)): These identities can be used to determine r µ , q µ , and E from H[X 0 ], b µ , and σ µ , for example. We have a particular interest in when C µ and E are infinite and so will investigate finite-time variants of causal states and finite-time estimates of statistical complexity and E. For example, the latter is given by: If E is finite, then E = lim M,N →∞ E(M, N ). When E is infinite, then the way in which E(M, N ) diverges is one measure of a process' complexity [21,43,44]. Analogous, finite past-future (M, N )-parametrized equivalence relations lead to finite-time forward and reverse causal states and statistical complexities C + µ (M, N ) and C − µ (M, N ).

III. CAUSAL ARCHITECTURE OF RENEWAL PROCESSES
It will be helpful pedagogically to anchor our theory in the contrast between two different, but still simple, renewal processes. One is the familiar "memoryless" Poisson process with rate λ. Its HMM generator, a biased coin, is shown at the left of Fig. 2. It has an interevent count distribution F (n) = (1 − λ)λ n ; a distribution with unbounded support. However, we notice in Fig. 2 that it is a unifilar model with a minimal number of states. So, in fact, this one-state machine is the -machine of a Poisson process. The rate at which it generates information is given by the entropy rate: h µ = H(λ) bits per output symbol. (Here, H(p) is the binary entropy function.) It also has a vanishing statistical complexity C + µ = 0 and so stores no historical information.
The second example is the Simple Nonunifilar Source (SNS) [45]; an HMM generator for which is shown on the right of Fig. 2. Transitions from state B are unifilar, but transitions from state A are not. In fact, a little reflection shows that the time series produced by the SNS is a discrete-time renewal process. Once we observe the "event" x t = 1, we know the internal model state to be σ t+1 = A, so successive interevent counts are completely uncorrelated.
The SNS generator is not an -machine and, moreover, it cannot be used to calculate the process's information per output symbol (entropy rate). If we can only see 0's A λ|0 and 1's, we will usually be uncertain as to whether we are in state A or state B, so this generative model is not maximally predictive. How can we calculate this basic quantity? And, if we cannot use the two-state generator, how many states are required and what is their transition dynamic? The following uses computational mechanics to answer these and a number of related questions. To aid readability, though, we sequester most all of the detailed calculations and proofs in App. A.
We start with a simple Lemma that follows directly from the definitions of a renewal process and the causal states. It allows us to introduce notation that simplifies the development. Lemma 1. The count since last event is a prescient statistic of a discrete-time stationary renewal process.
That is, if we remember only the number of counts since the last event and nothing prior, we can predict the future as well as if we had memorized the entire past. Specifically, a prescient state R is a function of the past such that: Causal states can be written as unions of prescient states [33]. We start with a definition that helps to characterize the converse; i.e., when the prescient states of Lemma 1 are also causal states.
To ground our intuition, recall that Poisson processes are "memoryless". This may seem counterintuitive, if viewed from a parameter estimation point of view. After all, if observing longer pasts, one makes better and better estimates of the Poisson rate. However, finite data fluc-tuations in estimating model parameters are irrelevant to the present mathematical setting unless the parameters are themselves random variables, as in Ref. [44]. This is not our setting here: the parameters are fixed. In fact, we restrict ourselves to studying ergodic processes, in which the conditional probability distributions of futures given pasts of a Poisson process are independent of the past.
We therefore expect the prescient states in Lemma 1 to fail to be causal states precisely when the interevent distribution is similar to that of a Poisson renewal process. This intuition is made precise by Def. 2.

Definition 1. A ∆-Poisson process has an interevent distribution
for all n and some λ > 0. If this statement holds for multiple ∆ ≥ 1, then we choose the smallest possible ∆.
Definition 2. A (ñ, ∆) eventually ∆-Poisson process has an interevent distribution that is ∆-Poisson for all n ≥ñ: for all 0 ≤ m < ∆, for all k ≥ 0, and for some λ > 0. If this statement holds for multiple ∆ ≥ 1 and multipleñ, then we choose the smallest possible ∆ and the smallest possibleñ.
Thus, a Poisson process is a ∆-Poisson process with ∆ = 1 and an eventually ∆-Poisson process with ∆ = 1 andñ = 0. Moreover, we will now show that at some finiteñ, any renewal process is either (i) Poisson, if ∆ = 1, or (ii) a combination of several Poisson processes, if ∆ > 1.
Why identify new classes of renewal process? In short, renewal processes that are similar to, but not the same as, the Poisson process do not have an infinite number of causal states. The particular condition for when they do not is given by the eventually ∆-Poisson definition. Notably, this new class is what emerged, rather unexpectedly, by applying the causal-state equivalence relation ∼ + to renewal processes. The resulting insight is that general renewal processes, after some number of counts (the "eventually" part) and after some coarse-graining of counts (the ∆ part), behave like a Poisson process.
With these definitions in hand, we can proceed to identify the causal architecture of discrete-time stationary renewal processes.
Theorem 1. (a) The forward-time causal states of a discrete-time stationary renewal process that is not eventually ∆-Poisson are groupings of pasts with the same count since last event. (b) The forward-time causal states of a discrete-time eventually ∆-Poisson stationary re-newal process are groupings of pasts with the same count since last event up untilñ and pasts whose count n since last event are in the same equivalence class asñ modulo ∆.
The Poisson process, as an eventually ∆-Poisson with n = 0 and ∆ = 1, is represented by the one-state -machine despite the unbounded support of its interevent count distribution. Unlike most processes, the Poisson process' -machine is the same as its generative model shown in Fig. 2(left).
The SNS, on the other hand, has an interevent count distribution that is not eventually ∆-Poisson. According to Thm. 1, then, the SNS has a countable infinity of causal states despite its simple two-state generative model in Fig. 2(left). Compare Fig. 3. Each causal state corresponds to a different probability distribution over the internal states A and B. These internal state distributions are the mixed states of Ref. [46]. Observing more 0's, one becomes increasingly convinced that the internal state is B. For maximal predictive power, however, we must track the probability that the process is still in state A. Both Fig. 3 and Fig. 2(left) are "minimally complex" models of the same process, but with different definitions of model complexity. We return to this point in Sec. V.
Appendix A makes the statements in Thm. 1 precisely. The main result is that causal states are sensitive to two features: (i) eventually ∆-Poisson structure in the interevent distribution and (ii) the boundedness of F (n)'s support. If the support is bounded, then there are a finite number of causal states rather than a countable infinity of causal states. Similarly, if F (n) has ∆-Poisson tails, then there are a finite number of causal states despite the support of F (n) having no bound. Nonetheless, one can say that the generic discrete-time stationary renewal process has a countable infinity of causal states.
Finding the probability distribution over these causal states is straightforwardly related to the survival-time distribution w(n) and the mean interevent interval µ, since the probability of observing at least n counts since last event is w(n). Hence, the probability of seeing n counts since the last event is simply the normalized survival function w(n)/(µ + 1). Appendix A derives the statistical complexity using this and Theorem 1. The resulting formulae are given in Table I for the various cases.
As described in Sec. II, we can also endow the causal state space with a transition dynamic in order to construct the renewal process -machine-the process's minimal unifilar hidden Markov model. The transition dynamic is sensitive to F (n)'s support and not only its boundedness. For instance, the probability of observing an event given that it has been n counts since the last event is F (n)/w(n). For the generic discrete-time Not eventually ∆-Poisson with unboundedsupport interevent distribution.
Possible -machine architectures for discrete-time stationary renewal processes. renewal process this is exactly the transition probability from causal state n to causal state 0. If F (n) = 0, then there is no probability of transition from σ = n to σ = 0. See App. A for details. Figures 3-6 display the causal state architectures, depicted as state-transition diagrams, for the -machines in the various cases delineated. Figure 3 is the -machine of a generic renewal process whose interevent interval can be arbitrarily large and whose interevent distribution never has exponential tails. Figure 4 is the -machine of a renewal process whose interevent distribution never has exponential tails but cannot have arbitrarily large interevent counts. The -machine in Fig. 5 looks quite similar to the -machine in Fig. 4, but it has an additional transition that connects the last stateñ to itself. This added transition changes our structural interpretation of the process. Interevent counts can be arbitrarily large for this -machine but past an interevent count of n, the interevent distribution is exponential. Finally, the -machine in Fig. 6 represents an eventually ∆-Poisson process with ∆ > 1 whose structure is conceptually most similar to that of the -machine in Fig. 5. (See Def. 2 for the precise version of that statement.) If our renewal process disallows seeing interevent counts of a particular length L, then this will be apparent from the -machine since there will be no transition between the causal state corresponding to an interevent count of L and causal state 0.
As described in Sec. II, we can analytically characterize a process' information architecture far better once we characterize its statistical structure in reverse time.
Lemma 2. Groupings of futures with the same counts to next event are reverse-time prescient statistics for discrete-time stationary renewal processes.
Theorem 2. (a) The reverse-time causal states of a discrete-time stationary renewal process that is not eventually ∆-Poisson are groupings of futures with the same count to next event. (b) The reverse-time causal states of a discrete-time eventually ∆-Poisson stationary renewal process are groupings of futures with the same count to next event up untilñ plus groupings of futures whose count since last event n are in the same equivalence class asñ modulo ∆.
As a result, in reverse time a stationary renewal process is effectively the same stationary renewal process-counts Quantity Expression Not eventually ∆-Poisson Eventually ∆-Poisson  between events are still independently drawn from F (n). Thus, the causal irreversibility vanishes: Ξ = 0.
Moreover, these results taken together indicate that we can straightforwardly build a renewal process's bidirectional machine from these forward and reverse-time causal states, as described in Refs. [37,38,46]. Additional properties can then be deduced from the bidirectional machine, but we leave this for the future.

IV. INFORMATION ARCHITECTURE OF RENEWAL PROCESSES
As Sec. II described, many quantities that capture a process's predictability and randomness can be calculated from knowing the block entropy function H(L). Often, the block entropy is estimated by generating samples of a process and estimating the entropy of a trajectory distribution. This method has the obvious disadvantage that at large L, there are |A| L possible trajectories and |A| hµL typical trajectories. And so, one easily runs into the problem of severe undersampling, previously referred to as the curse of dimensionality. This matters most when the underlying process has long-range temporal correlations.
Nor can one calculate the block entropy and other such information measures exactly from generative models that are not maximally predictive (prescient). Then, the model states do not shield the past from the future.
For instance, as noted above, one cannot calculate the SNS's entropy rate from its simple two-state generative HMM. The entropy of the next symbol given the generative model's current state (A or B) actually underestimates the true entropy rate by assuming that we can almost always precisely determine the underlying model state from the past. For a sense of the fundamental challenge, see Refs. [25,47].
However, we can calculate the block entropy and various other information measures in closed-form from a maximally predictive model. In other words, finding an -machine allows one to avoid the curse of dimensionality inherently involved in calculating the entropy rate, excess entropy, or the other information measures discussed here. Figure 1 summarized the above points. This section makes good on the procedure outlined there by providing analytic formulae for various information measures of renewal processes. The formulae for the entropy rate of a renewal process is already well known, but all others are new.
Prescient HMMs built from the prescient statistics of Lemma 1 are maximally predictive models, and correspond to the unifilar Hidden Markov Model shown in Fig. 3. The prescient machines make no distinction between eventually ∆-Poisson renewal processes and one that is not, but they do contain information about the support of F (n) through their transition dynamics. (See App. A.) Appendix B describes how a prescient machine can be used to calculate all information architecture quantities-r µ , b µ , σ µ , q µ , and the more familiar Shannon entropy rate h µ and excess entropy E. A general strategy for calculating these quantities, as described in Sec. II and Refs. [19,40], is to calculate b µ , h µ , E, and H[X 0 ], and then to derive the other quantities using the information-theoretic identities given in Sec. II. Table I gives the results of these calculations. It helps one's interpretation to consider two base cases. For a Poisson process, we gain no predictive power by remembering specific pasts, and we would expect the statistical complexity, excess entropy, and bound information rate to vanish. The entropy rate and ephemeral information, though, are nonzero. One can check that this is, indeed, the case. For a periodic process with period T , in contrast, one can check that µ + 1 = T , since the period is the length of the string of 0's (mean interevent interval µ) concatenated with the subsequent event x = 1. The statistical complexity and excess entropy of this process are log 2 T and the entropy rate is h µ = 0, as expected.
Calculating the predictable information E(M, N ) requires identifying finite-time prescient statistics, since the predictable information is the mutual information between forward-time causal states over pasts of length M and reverse-time causal states over futures of length N . Such finite-time prescient statistics are identified in Corollary 1, below, and the predictable information is derived in App. B. The final expression is not included in Table I due to its length. All of these quantities can be calculated using a mixedstate presentation, as described in Ref. [46], though the formulae developed there are as yet unable to describe processes with a countably infinite set of mixed states. Calculations of finite-time entropy rate estimates using a mixed-state presentation are consistent with all other results here, though. Purely for simplicity, we avoid discussing mixed-state presentations.

V. NONUNIFILAR HMMS AND RENEWAL PROCESSES
The task of inferring an -machine for discrete-time, discrete-alphabet processes is essentially that of inferring minimal unifilar HMMs; what are sometimes also called "probabilistic deterministic" finite automata. In unifilar HMMs, the transition to the next hidden state given the previous one and next emitted symbol is determined. Nonunifilar HMMs are a more general class of time series models in which the transitions between underlying states given the next emitted symbol can be stochastic.
This simple difference in HMM structure has important consequences for calculating the predictable information, information architecture, and statistical complexity of time series generated by nonunifilar HMMs. First, note that for processes with a finite number of transient and recurrent causal states, these quantities can be calculated in closed form [24]. Second, the autocorrelation function and power spectrum can also be calculated in closed form for nonunifilar presentations [48]. Unlike these cases, though, most of Table I's quantities defy current calculational techniques. As a result, exact calculations of these prediction-related information measures for even the simplest nonunifilar HMMs can be surprisingly difficult.
To illustrate this point, we focus our attention on a parametrized version of the SNS shown in Fig. 7. As for the original SNS in Fig. 2, transitions from state B are unifilar, but transitions from state A are not. As noted before, the time series generated by the parametrized SNS is a discrete-time renewal process with interevent count distribution:  The nonunifilar HMM there should be contrasted with the unifilar HMM presentation of the parametrized SNS which is the -machine in Fig. 3, with a countable infinity of causal states. Both parametrized SNS presentations are "minimally complex", but according to different metrics. On the one hand, the nonunifilar presentation is a minimal generative model: No one-state HMM (i.e., biased coin) can produce a time series with the same statistics. On the other, the unifilar HMM is the minimal maximally predictive model: In order to predict the future as well as possible given the entire past, we must at least remember how many 0's have been seen since the last 1. That memory requires a countable infinity of prescient states. The preferred complexity metric is a matter of taste and desired implementation, modulo important concerns regarding overfitting or ease of inference [26]. However, if we wish to calculate the information measures in Table I as accurately as possible, finding a maximally predictive model-a unifilar presentation, that is-is necessary. The components of the predictable information-the excess entropy E = σµ + bµ + qµ in bits-also as a function of p with q = p. The lowest (blue) line is qµ; the middle (green) line is qµ + bµ, so that the green area denotes bµ's contribution to E. The upper (red) line is E, so that the red area denotes elusive information σµ in E. Note that for a large range of p the co-information qµ is (slightly) negative.
Using the formulae of Table I, Fig. 8 shows how the statistical complexity C µ , excess entropy E, entropy rate h µ , and bound information b µ vary with the transition probabilities p and q. C µ often reveals detailed information about a process' underlying structure, but for the parametrized SNS and other renewal processes, the statistical complexity merely reflects the spread of the interevent distribution. Thus, it increases with increasing p and q. E, a measure of how much can be predicted rather than historical memory required for prediction, increases as p and q decrease. The intuition for this is that as p and q vanish, the process arrives at a perfectly predictable period-2 sequence. We see that the SNS constitutes a simple example of a class of processes over which information transmission between the past and future (E) and information storage (C µ ) are anticorrelated. The entropy rate h µ at the top right of Fig. 8 is maximized when transitions are uniformly stochastic and the bound information b µ at the bottom right is maximized somewhere between fully stochastic and fully deterministic regimes. Figure 9 presents a more nuanced decomposition of the information measures as p = q vary from 0 to 1. The top most plot breaks down the single-measurement entropy H[X 0 ] into redundant information ρ µ in a single measurement, predictively useless generated information r µ , and predictively useful generated entropy b µ . As p increases, the SNS moves from mostly predictable (close to period-2) to mostly unpredictable, shown by the relative height of the (green) line denoting h µ to the (red) line denoting H[X 0 ]. The portion b µ of h µ predictive of the future is maximized at lower p when the singlemeasurement entropy is close to a less noisy period-2 process. The plot at the bottom decomposes the predictable information E into the predictable information hidden from the present σ µ , the predictable generated entropy in the present b µ , and the co-information q µ shared between past, present, and future. Recall that the coinformation q µ = E − σ µ − b µ can be negative and, for a large range of values, it is. Most of the predictable information passes through the present as indicated by σ µ being a small for most parameters p. Hence, even though the parametrized SNS is technically an infiniteorder Markov process, it can be well approximated by a finite-order Markov process without much predictable information loss, as noted previously with rate-distortion theory [49].

VI. CONCLUSIONS
Stationary renewal processes are well studied, easy to define, and, in many ways, temporally simple. Given this simplicity and their long history it is somewhat surprisingly that one is still able to discover new properties; in our case, by viewing them through an informationtheoretic lens. Indeed, their simplicity becomes apparent in the informational and structural analyses. For instance, renewal processes are causally reversible with isomorphic -machines in forward and reverse-time, i.e., temporally reversible. Applying the causal-state equivalence relation to renewal processes, however, also revealed several unanticipated subtleties. For instance, we had to delineate new subclasses of renewal process ("eventually ∆-Poisson") in order to completely classify -machines of renewal processes. Additionally, the informational architecture formulae in Table I are surprisingly complicated, since exactly calculating these informational measures requires a unifilar presentation. In Sec. V, we needed an infinite-state machine to study the informational archi-tecture of a process generated by simple two-state HMM.
Looking to the future, the new structural view of renewal processes will help improve inference methods for infinite-state processes, as it tells us what to expect in the ideal setting-what are the effective states, what are appropriate null models, how informational quantities scale, and the like. For example, Figs. 3-6 gave all possible causal architectures for discrete-time stationary renewal processes. Such a classification will allow for more efficient Bayesian inference of -machines of point processes, as developed in Ref. [26]. That is, we can leverage "expert" knowledge that one is seeing a renewal process to delineate the appropriate subset of model architectures and thereby avoid searching over the superexponentially large set of all HMM topologies.
The range of the results' application is much larger than that explicitly considered here. The formulae in Table I will be most useful for understanding renewal processes with infinite statistical complexity. For instance, Ref. [28] applies the formulae to study the divergence of the statistical complexity of continuous-time processes as the observation time scale decreases. And, Ref. [29] applies these formulae to renewal processes with infinite excess entropy. In particular, there we investigate the causal architectures of infinite-state processes that generate so-called critical phenomena-behavior with powerlaw temporal or spatial correlations [50]. The analysis of such critical systems often turns on having an appropriate order parameter. The statistical complexity and excess entropy are application-agnostic order parameters [51][52][53] that allow one to better quantify when a phase transition in stochastic processes has or has not occurred, as seen in Ref. [29]. Such critical behavior has even been implicated in early studies of human communication [54] [55] and recently in neural dynamics [56] and in socially constructed, communal knowledge systems [57]. Pr(X −n:m+1 = 0 n+m 1|X −n−1 = 1) We see that π(r + n ) = w(n)/Z, with Z a normalization constant that makes ∞ n=0 w(n) = µ + 1. And so: In the main text, Thm. 1 was stated with less precision so as to be comprehensible. Here, we state it with more precision, even though the meaning is obfuscated somewhat by doing so. In the proof, we still err somewhat on the side of comprehensibility, and so one might view this proof as more of a proof sketch.
Theorem 1 The forward-time causal states of a discrete-time stationary renewal process that is not eventually ∆-Poisson are exactly S + = R + , if F has unbounded support. When the support is bounded such that F (n) = 0 for all n ≥ N , S + = {r + n } N n=0 . Finally, a discrete-time eventually ∆-Poisson renewal process with characteristic (ñ, ∆) has forward-time causal states: This is a complete classification of the causal states of any persistent renewal process.
Proof. From the proof of Lemma 1 in this appendix, we know that two prescient states r + n and r + n are minimal only when: and Pr(N 0 = n) = w(n)/(µ + 1) from earlier, we find that the equivalence class condition becomes: for all m ≥ 0. First, note that for these conditional probabilities even to be well defined, w(n) > 0 and w(n ) > 0. Hence, if F has bounded support-max suppF (n) = N -then the causal states do not include any r + n for n > N . Further-more, Eq. (A2) cannot be true for all m ≥ 0, unless n = n for n and n ≤ N . To see this, suppose that n = n but that Eq. (A2) holds. Then choose m = N +1−max(n, n ) to give 0 = F (N + 1 − |n − n |)/w(n ), a contradiction unless n = n . So, for all remaining cases, we can assume that F in Eq. (A2) has unbounded support.
A little rewriting makes the connection between Eq. (A2) and an eventually ∆-Poisson process clearer. First, we choose m = 0 to find: which we can use to rewrite Eq. (A2) as: or more usefully: A particularly compact way of rewriting this is to define ∆ := n − n, which gives F (n + m) = F ((n + m) + ∆ ). In this form, it is clear that the above equation is a recurrence relation on F in steps of ∆ , so that we can write: This must be true for every m ≥ 0. Importantly, since w(n) = ∞ m=n F (m), satisfying this recurrence relation is equivalent to satisfying Eq. (A2). But Eq. (A3) is just the definition of an eventually ∆-Poisson process in disguise; relabel with λ := F (n )/F (n),ñ := n, and ∆ = ∆ . Therefore, if Eq. (A2) does not hold for any pair n = n , the process is not eventually ∆-Poisson and the prescient states identified in Lemma 1 are minimal; i.e., they are the causal states.
If Eq. (A2) does hold for some n = n , choose the minimal such n and n both. The renewal process is eventually ∆-Poisson with characterization ∆ = n − n andñ. And, F (ñ + m)/w(ñ + m) = F (ñ + m )/w(ñ + m ) implies that m ≡ m mod ∆ since otherwise, the n and n chosen would not be minimal. Hence, the causal states are exactly those given in the theorem's statement.
The probability distribution over these forward-time causal states is straightforward to derive from π(r + n ) = w(n)/(µ + 1). So, for a renewal process that is not eventually ∆-Poisson or one that is with bounded support, π(σ + n ) = w(n)/(µ + 1). (For the latter, n only runs from 0 toñ.) For an eventually ∆-Poisson renewal process π(σ + n ) = w(n)/(µ + 1) when n <ñ and: whenñ ≤ n <ñ + ∆. And so, the statistical complexity given in Table I follows from C + µ = H[S + ]. Recall Lemma 2 and Thm. 2. Lemma 2 Groupings of futures with the same counts to next event are reverse-time prescient statistics for discrete-time stationary renewal processes.
Theorem 2 (a) The reverse-time causal states of a discrete-time stationary renewal process that is not eventually ∆-Poisson are groupings of futures with the same count to next event up until and including N , if N is finite. (b) The reverse-time causal states of a discretetime eventually ∆-Poisson stationary renewal process are groupings of futures with the same count to next event up untilñ, plus groupings of futures whose count since last event n are in the same equivalence class asñ modulo ∆.
Proof. The proof for both claims relies on a single fact: In reverse-time, a stationary renewal process is still a stationary renewal process with the same interevent count distribution. The lemma and theorem therefore follow from Lemma 1 and Thm. 1.
Since the forward and reverse-time causal states are the same with the same future conditional probability distribution, we have C + µ = C − µ and the causal irreversibility vanishes: Ξ = 0.
Transition probabilities can be derived for both the renewal process's prescient states and its -machine as follows. For the prescient machine, if a 0 is observed when in r + n , we transition to r + n+1 ; else, we transition to r + 0 since we just saw an event. Basic calculations show that these transition probabilities are: Not only do these specify the prescient machine transition dynamic but, due to the close correspondence between prescient and causal states, they also automatically give the -machine transition dynamic: T (x) σσ = Pr(S + t+1 = σ , X t+1 = x|S + t = σ) = r,r ∈R + T (x) r →r Pr(S + t+1 = σ |R + t+1 = r) × Pr(R + t = r |S + t = σ) .
And, after some algebra, this simplifies to: once we recognize that w(0) = 1 and so w(0) log 2 w(0) = 0 and we recall that w(n + 1) + F (n) = w(n). The excess entropy, being the mutual information between forward and reverse-time prescient states is [37,46]: And so, to calculate, we note that: Pr(r + n , r − m ) = F (m + n) µ + 1 and Pr(r + n |r − m ) = F (n + m) w(m) .
After some algebra, we find that: w(n) µ + 1 log 2 w(n) µ + 1 and that: The above quantity is the forward crypticity χ + [37] when the renewal process is not eventually ∆-Poisson. These together imply: And, finally, the bound information b µ is: where we used the causal shielding properties of prescient states, X :0 → R + 0 → R − 1 → X 1: , and the unifilarity of the prescient machines as shown in Figs = F (m + n + 1) + F (n)F (m) w(m) .
From the expressions above, we immediately solve for r µ = h µ − b µ , q µ = H[X 0 ] − h µ − b µ , and σ µ = E − q µ . Thereby laying out information architecture of stationary renewal processes. Finally, we calculate the finite-time predictable information E(M, N ) as the mutual information between finite-time forward and reverse-time prescient states: Recall Corollary 1. Corollary 1 Forward-time (and reverse-time) finitetime M prescient states of a discrete-time stationary renewal process are the counts from (and to) the next event up until and including M .
Proof. From Lemmas 1 and 2, we know that counts from (to) the last (next) event are prescient forward-time (reverse-time) statistics. If our window on pasts (futures) is M , then we cannot distinguish between counts since (to) the last (next) event that are M and larger. Hence, the finite-time M prescient statistics are the counts from (and to) the next event up until and including M , where a finite-time M prescient state includes all pasts with M or more counts from (to) the last (next) event.
To calculate E(M, N ), we find Pr(R + M , R − N ) by marginalizing Pr(R + , R − ). For ease of notation, we first define a function: In the latter case of semi-infinite pasts several terms vanish and we have: n=0 (n + 1)F (n) log 2 F (n) µ + 1 .